Computer Setup – Cluster Node003 Conversion

Here’s an overview of some of the struggles getting node003 converted/upgraded to function as an independent computer (as opposed to a slave node in the Apple computer cluster).

  • 6TB HDD
  • Only 2.2TB recognized when connected to Hummingbird via Firewire – internet suggests that is max for Xserve; USB might recognize full drive) – Hummingbird is a converted Xserve running Mavericks
  • Reformatted on different Mac and full drive size recognized
  • Connected to Hummingbird (via USB) and full 6TB recognized
  • Connected to Mac Mini to install OS X
  • Tried installing OS X 10.8.5 (Mountain Lion) via CMD+r at boot, but failed partway through installation
  • Tried and couldn’t reformat drive through CMD+r at boot with Disk Utility
  • Broken partition tables identified on Linux, used GParted to establish partition table, back to Mac Mini and OS X (Mountain Lion) install worked
  • Upgraded to OS X 10.11.5 (El Capitan)
  • Inserted drive to Mac cluster node003 – wouldn’t boot all the way – Apple icon, progress bar > Do Not Enter symbol
  • Removed drive, put original back in, connected 6TB HDD via USB, but booting from USB not an option (when booting and holding Option key)
  • Probably due to node003 being part of cluster – reformatted original node003 drive with clean install of OS X Server.
  • Booting from USB now an option and worked with 6TB HDD!
  • Put 6TB HDD w/El Capitan in internal sled and won’t boot! Apple icon, progress bar > Do Not Enter symbol
  • Installed OS X 10.11.5 (El Capitan) on old 1TB drive and inserted into node003 – worked perfectly!
  • Will just use 1TB boot drive and figure out another use for 6TB HDD
  • Renamed node003 to roadrunner
  • Current plan is to upgrade from 12GB to 48GB of RAM and then automate moving data off this drive to long-term storage on Owl (Synology server).

Data Analysis – Oly GBS Data Using Stacks 1.37

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb

Data Management – Olympia Oyster Small Insert Library Genome Assembly from BGI

Received another set of Ostrea lurida genome assembly data from BGI. In this case, it’s data assembled from the small insert libraries they created for this project.

All data is stored here: http://owl.fish.washington.edu/O_lurida_genome_assemblies_BGI/20160512/

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Olympia oyster genome size: 1898.92 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20160516_Oly_Small_Insert_Library_Genome_Read_Counts.ipynb


 

GBS Frustrations

This isn’t really a notebook entry – it’s more of a traditional blog post.

It’s a quick summary of the frustrations and struggles I’ve encountered while trying to analyze the Olympia oyster GBS data. Hopefully it will serve as a place holder for others to find (and avoid) some of the pitfalls I’ve encountered so far. But, mostly, this is just for me to vent…

  1. Using the Stacks program (on Hummingbird over the network to our server Owl) takes forever and, more importantly, consistently fails to complete the ustacks and cstacks programs.

  2. Using the Stacks program (on Hummingbird via external HDD connected through Firewire) takes forever (combined, process_radtags and ustacks has been running since 20160428; that’s eight days)!!! Granted, this is running on all 96 samples, but, regardless, this type of time frame is not very conducive to productivity.

  3. The “raw” non-demultiplexed fastq files supplied by BGI have a ‘N’ in the barcode in the FASTQ header lines. This prevents Stacks (and possibly Tassel – I’ll get to this in a second) from being able to perform the demultiplexing. Here’s a screen shot of what I’m talking about:

  1. Cyverse has a program called Tassel that should be able to handle GBS data just like ours. However, it doesn’t produce the expected output to proceed to the second step. Although I haven’t tested it, it’s possible that the problem is related to the ‘N’ in the FASTQ header barcode sequence I mentioned above. I suspect it’s related because the first step in using Tassel is demultiplexing utilizing a supplied barcode keyfile.

  2. Cyverse has Stacks installed, but in order to use it, someone has to build a Cyverse “app.” I’ve tried and the process is brutal. It’s not conducive for a program (that is really a suite of programs) like Stacks that has so many command line options and, depending on your input file types (e.g. “non-standard” Illumina filenames for paired-end sequencing), requires looping over filenames to specify corresponding file pairs.

  3. Pyrad actually worked relatively well, but the VCF output file (for visualizing in the Integrative Genomics Viewer) has an ill-formed header that IGV won’t accept. Attempts at tweaking the header don’t seem to resolve the issue. Additionally, it’s not apparent in the output files if individuals get grouped, even though there is an option to specify which individuals should be grouped together.

  4. And, the most frustrating thing of all???!!!  I just realized how to handle the problematic barcodes in the FASTQ headers!! Instead of trying to alter the FASTQ files (which I’ve been messing around with over the past few days), all I’ve needed to do this entire time is CHANGE THE BARCODE KEY FILE THAT STACKS AND/OR TASSEL USES TO HAVE A ‘N’ AT THE BEGINNING OF EACH BARCODE!

I’m going to go cry now…

Regardless of that last one, it doesn’t change the fact that Stacks is painfully slow and, at times, unreliable.

Goals – May 2016

Well, I guess the first goal is to remember to be more consistent about writing monthly goals…

Anyway, here they are – short and sweet. Most of them are really part of a to-do list, as opposed to goals, but I’ll still put them down.

  • Analyze Olympia oyster GBS data
  • Continue NGS archiving via SRA submissions
  • Continue to work on building functional Docker image to improve laboratory reproducibility
  • Build Stacks app in Cyverse Discovery Environment
  • Start using the UW Hyak computing cluster.

 

SRA Release – Transcriptomic Profiles of Adult Female & Male Gonads in Panopea generosa (Pacific geoduck)

The RNAseq data that I previously submitted to NCBI short read archive (SRA) has been released to the public today. Here are the various links for the project:

Study: SRP072283http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP072283

 

BioProject: PRJNA316216http://www.ncbi.nlm.nih.gov/bioproject/PRJNA316216

Study: SRP072283http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP072283

Female Pool Experiment: SRX1659865http://www.ncbi.nlm.nih.gov/sra/SRX1659865

Male Pool Experiment: SRX1659865http://www.ncbi.nlm.nih.gov/sra/SRX1659866

Data Management – O.lurida Raw BGI GBS FASTQ Data

BGI had previously supplied us with demultiplexed GBS FASTQ files. However, they had not provided us with the information/data on how those files were created. I contacted them and they’ve given us the two original FASTQ files, as well as the library index file and corresponding script they used for demultiplexing all of the files. The Jupyter (iPython) notebook below updates our checksum and readme files in our server directory that’s hosting the files: http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

See Jupyter Notebook below for processing details.

Jupyter Notebook: 20160427_Oly_GBS_data_management.ipynb

Computing – Speed Benchmark Comparisons Between Local, External, & Server Files

I decided to run a quick test to see what difference in speed (i.e. time) we might see between handling files that are stored locally, on an external hard drive (HDD), or on our server (Owl).

This isn’t tightly controlled because it’s possible that other people were using resources on the server, thus slowing things down. However, this situation would be a true real world experience, so it is probably an accurate representation of what we’d experience on a daily basis.

https://github.com/sr320/LabDocs/blob/master/jupyter_nbs/sam/20160427_speed_comparison.ipynb

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.

 

One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!

Jupyter Notebook: 20160418_pyrad_oly_PE-GBS.ipynb

NBviewer: 20160418_pyrad_oly_PE-GBS

Data Management – Concatenate FASTQ files from Oly MBDseq Project

Steven requested I concatenate the MBDseq files we received for this project:

  • concatenate the s4, s5, s6 file sets for each individual

  • concatenate the full file sets for each individual

Ran the concatenations in the Jupyter (iPython) notebook below. All files were saved to Owl/nightingales/O_lurida/2016

Jupyter Notebook: 20160411_Concatenate_Oly_MBDseq.ipynb

NBviewer: 20160411_Concatenate_Oly_MBDseq