Tag Archives: stacks

Computing – Not Enough Power!

Well, I tackled the storage space issue by expanding the EC2 Instance to have a 1000GB of storage space. Now that that’s no longer a concern, it turns out I’m running up against processing/memory limits!

I’m running the EC2 c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM) instance.

I’m trying to run two programs simultaneously: PyRad and Stacks (specifically, the ustacks “sub” program).

PyRad keeps crashing with some memory error stuff (see embedded Jupyter Notebook at the end of this post).

Used the following Bash program to visualize what’s happening with the EC2 Instance resources (i.e. processors and RAM utilization):

htop

Downloaded/installed to EC2 Instance using:

sudo apt-get install htop

I see why PyRad is dying. Here are two screen captures that show what resources are being used (click to see detail):

The top image shows that ustacks is using 100% of all eight CPUs!

The second image shows when ustacks is finishing with one of the files it’s processing, it uses all of the memory (16GBs)!

So, I will have to wait until ustacks is finished running before being able to continue with PyRad.

If I want to be able to run these simultaneously, I can (using either of these options still requires me to wait until ustacks completes in order to manipulate the current EC2 instance to accommodate either of the two following options):

Increase the computing resources of this EC2 Instance
Create an additional EC2 Instance and run PyRad on one and Stacks programs on the other.

Here’s the Jupyter Notebook with the PyRad errors (see “Step 3: Clustering” section):

Data Analysis – Oly GBS Data Using Stacks 1.37

0000-0002-2747-368X

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb

GBS Frustrations

0000-0002-2747-368X

This isn’t really a notebook entry – it’s more of a traditional blog post.

It’s a quick summary of the frustrations and struggles I’ve encountered while trying to analyze the Olympia oyster GBS data. Hopefully it will serve as a place holder for others to find (and avoid) some of the pitfalls I’ve encountered so far. But, mostly, this is just for me to vent…

Using the Stacks program (on Hummingbird over the network to our server Owl) takes forever and, more importantly, consistently fails to complete the ustacks and cstacks programs.
Using the Stacks program (on Hummingbird via external HDD connected through Firewire) takes forever (combined, process_radtags and ustacks has been running since 20160428; that’s eight days)!!! Granted, this is running on all 96 samples, but, regardless, this type of time frame is not very conducive to productivity.
The “raw” non-demultiplexed fastq files supplied by BGI have a ‘N’ in the barcode in the FASTQ header lines. This prevents Stacks (and possibly Tassel – I’ll get to this in a second) from being able to perform the demultiplexing. Here’s a screen shot of what I’m talking about:

Cyverse has a program called Tassel that should be able to handle GBS data just like ours. However, it doesn’t produce the expected output to proceed to the second step. Although I haven’t tested it, it’s possible that the problem is related to the ‘N’ in the FASTQ header barcode sequence I mentioned above. I suspect it’s related because the first step in using Tassel is demultiplexing utilizing a supplied barcode keyfile.
Cyverse has Stacks installed, but in order to use it, someone has to build a Cyverse “app.” I’ve tried and the process is brutal. It’s not conducive for a program (that is really a suite of programs) like Stacks that has so many command line options and, depending on your input file types (e.g. “non-standard” Illumina filenames for paired-end sequencing), requires looping over filenames to specify corresponding file pairs.
Pyrad actually worked relatively well, but the VCF output file (for visualizing in the Integrative Genomics Viewer) has an ill-formed header that IGV won’t accept. Attempts at tweaking the header don’t seem to resolve the issue. Additionally, it’s not apparent in the output files if individuals get grouped, even though there is an option to specify which individuals should be grouped together.
And, the most frustrating thing of all???!!! I just realized how to handle the problematic barcodes in the FASTQ headers!! Instead of trying to alter the FASTQ files (which I’ve been messing around with over the past few days), all I’ve needed to do this entire time is CHANGE THE BARCODE KEY FILE THAT STACKS AND/OR TASSEL USES TO HAVE A ‘N’ AT THE BEGINNING OF EACH BARCODE!

I’m going to go cry now…

Regardless of that last one, it doesn’t change the fact that Stacks is painfully slow and, at times, unreliable.

Data Analysis – Oly GBS Data from BGI Using Stacks

0000-0002-2747-368X

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Installed Stacks (and its dependencies on Hummingbird earlier today).

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS

Software Install – samtools-0.1.19 and stacks-1.37

0000-0002-2747-368X

Getting ready to analyze our Ostrea lurida genotype-by-sequencing data and wanted to use the Stacks software.

We have an existing version of Stacks on Hummingbird (the Apple server blade I will be running this analysis on), but I figured I might as well install the latest version (stacks-1.37).

Additionally, Stacks requires samtools-0.1.19 to run, which we did NOT have installed.

I tracked all of this in the Jupyter (iPython) notebook below.

Due to permissions issues during installation, I frequently had to leave the Jupyter notebook to run “sudo” in bash. As such, the notebook is messy, but does outline the necessary steps to get these two programs installed.

Jupyter notebook: 20160406_STACKS_install.ipynb

NBviewer: 20160406_STACKS_install.ipynb

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: stacks

Computing – Not Enough Power!

Data Analysis – Oly GBS Data Using Stacks 1.37

GBS Frustrations

Data Analysis – Oly GBS Data from BGI Using Stacks

Software Install – samtools-0.1.19 and stacks-1.37