Tag Archives: jupyter notebook

Docker – Improving Roberts Lab Reproducibility

In an attempt at furthering our lab’s abilities to maximize our reproducibility, I’ve been  working on developing an all-encompassing Docker image. Docker is a type of virtual machine (i.e. a self-contained computer that runs within your computer). For the Roberts Lab, the advantage of using Docker is that the Docker images can be customized to run a specific suite of software and these images can then be used by any other person in the lab (assuming they can run Docker on their particular operating system), regardless of operating system. In turn, if everyone is using the same Docker image (i.e. the same virtual machine with all the same software), then we should be able to reproduce data analyses more reliably, due to the fact that there won’t be differences between software versions that people are using. Additionally, using Docker greatly simplifies the setup of new computers with the requisite software.

I’ve put together a Dockerfile (a Dockerfile is a text file/script that Docker uses to retrieve software and build a computer image with those specific instructions) which will automatically build a Docker image (i.e. virtual computer) that contains all of the normal bioinformatics software our lab uses. This has been a side project while I wait for Stacks analysis to complete (or, fail, depending on the day) and it’s finally usable! The image that is built from this Dockerfile will even let the user run R Studio and/or Jupyter Notebooks in their browser (I’m excited about this part)!

Here’s the current list of software that will be installed:

bedtools 2.25.0
bismark 0.15.0
blast 2.3.0+
bowtie2 2.2.8
bsmap 2.90
cufflinks 2.1.1
fastqc 0.11.5
fastx_toolkit 0.0.13
R 3.2.5
RStudio Server0.99
pyrad 3.0.66
samtools 0.1.19
stacks 1.40
tophat 2.1.1
trimmomatic 0.36

In order to set this up, you need to install Docker and download the Dockerfile (Dockerfile.bio) I’ve created.

I’ve written a bit of a user guide (specific to this Dockerfile) here to get people started: docker.md

The user guide explains a bit how all of this works and tries to progress from a “basic” this-is-how-to-get-started-with-Docker to an “advanced” description of how to map ports, mount local volumes in your containers, and how to start/attach previously used containers.

The next major goal I have with this Docker project is to get the R kernel installed for Jupyter Notebooks. Currently, the Jupyter Notebook installation is restricted to the default Python 2 kernel.

Additionally, I’d like to improve the usability of the Docker image by setting up aliases in the image. Meaning, a user who wants to use the bowtie program can just type “bowtie”. Currently, the user has to type “bowtie2_2.2.8″ (although, with this being in the system PATH and tab-completion, it’s not that big of a deal), which is a bit ugly.

For some next level stuff, I’d also like to setup all Roberts Lab computers to automatically launch the Docker image when the user opens a terminal. This would greatly simplify things for new lab members. They wouldn’t have to deal with going through the various Docker commands to start a Docker container. Instead, their terminal would just put them directly into the container and the user would be none-the-wiser. They’d be reproducibly conducting data analysis without even having to think about it.

Data Analysis – Oly GBS Data Using Stacks 1.37

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb

Data Management – Olympia Oyster Small Insert Library Genome Assembly from BGI

Received another set of Ostrea lurida genome assembly data from BGI. In this case, it’s data assembled from the small insert libraries they created for this project.

All data is stored here: http://owl.fish.washington.edu/O_lurida_genome_assemblies_BGI/20160512/

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Olympia oyster genome size: 1898.92 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20160516_Oly_Small_Insert_Library_Genome_Read_Counts.ipynb


 

Data Management – O.lurida Raw BGI GBS FASTQ Data

BGI had previously supplied us with demultiplexed GBS FASTQ files. However, they had not provided us with the information/data on how those files were created. I contacted them and they’ve given us the two original FASTQ files, as well as the library index file and corresponding script they used for demultiplexing all of the files. The Jupyter (iPython) notebook below updates our checksum and readme files in our server directory that’s hosting the files: http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

See Jupyter Notebook below for processing details.

Jupyter Notebook: 20160427_Oly_GBS_data_management.ipynb

Computing – Speed Benchmark Comparisons Between Local, External, & Server Files

I decided to run a quick test to see what difference in speed (i.e. time) we might see between handling files that are stored locally, on an external hard drive (HDD), or on our server (Owl).

This isn’t tightly controlled because it’s possible that other people were using resources on the server, thus slowing things down. However, this situation would be a true real world experience, so it is probably an accurate representation of what we’d experience on a daily basis.

https://github.com/sr320/LabDocs/blob/master/jupyter_nbs/sam/20160427_speed_comparison.ipynb

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.

 

One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!

Jupyter Notebook: 20160418_pyrad_oly_PE-GBS.ipynb

NBviewer: 20160418_pyrad_oly_PE-GBS

Data Management – Concatenate FASTQ files from Oly MBDseq Project

Steven requested I concatenate the MBDseq files we received for this project:

  • concatenate the s4, s5, s6 file sets for each individual

  • concatenate the full file sets for each individual

Ran the concatenations in the Jupyter (iPython) notebook below. All files were saved to Owl/nightingales/O_lurida/2016

Jupyter Notebook: 20160411_Concatenate_Oly_MBDseq.ipynb

NBviewer: 20160411_Concatenate_Oly_MBDseq

Data Analysis – Oly GBS Data from BGI Using Stacks

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Installed Stacks (and its dependencies on Hummingbird earlier today).

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS

Software Install – samtools-0.1.19 and stacks-1.37

Getting ready to analyze our Ostrea lurida genotype-by-sequencing data and wanted to use the Stacks software.

We have an existing version of Stacks on Hummingbird (the Apple server blade I will be running this analysis on), but I figured I might as well install the latest version (stacks-1.37).

Additionally, Stacks requires samtools-0.1.19 to run, which we did NOT have installed.

I tracked all of this in the Jupyter (iPython) notebook below.

Due to permissions issues during installation, I frequently had to leave the Jupyter notebook to run “sudo” in bash. As such, the notebook is messy, but does outline the necessary steps to get these two programs installed.

Jupyter notebook: 20160406_STACKS_install.ipynb

NBviewer: 20160406_STACKS_install.ipynb

Data Management – O. lurida genotype-by-sequencing (GBS) data from BGI

We received a hard drive from BGI on 20160223 (while I was out on paternity leave) containing the Ostrea lurida GBS data.

Briefly, three sets (i.e. populations) of Olympia oyster tissue was collected from oysters raised in Oyster Bay and were sent to BGI for DNA extraction and GBS. A total of 23 individuals from each of the following three populations were sequenced (a grand total of 96 samples):

  • 1HL – (Hood Canal, Long Spit)
  • 1NF – (North Sound, Fidalgo Bay)
  • 1SN – (South Sound, Oyster Bay)

An overview of this project can be viewed on our GitHub Olympia oyster wiki.

Data was copied from the HDD to the following location on Owl (our server): http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

The data was generated from paired-end Illumina sequencing, so there are two FASTQ files for each individual.

The files were analyzed to create a MD5 checksum, perform read counts, and create a readme (markdown format) file. This was performed in a Jupyter/iPython notebook (see below).

IMPORTANT NOTE: The directory where this data is housed was renamed AFTER the Jupyter notebook was run. As such, the directory listed above will not be seen in the Jupyter notebook.

Jupyter notebook file: 20160314_Olurida_GBS_data_management.ipynb

Notebook Viewer: 20160314_Olurida_GBS_data_management.ipynb