Data Analysis – fastStructure Population Analysis of Oly GBS PyRAD Output

After some basal readings about what Fst is (see notebook below for a definition and reference), I decided to try to use fastStructure to analyze the PyRAD output from 20160727.

The quick, TL;DR: after spending a bunch of time installing the program, it doesn’t handle the default Structure file (.str); requires some companion file types that PyRAD doesn’t output.

I’ve put this here for posterity and background reference on Fst…

Will proceed with using the full blown Structure program to try to glean some info from these three populations.

 

Jupyter Notebook: 20160816_oly_gbs_fst_calcs.ipynb

 

Computing – Amazon EC2 Cost “Analysis”

I recently moved some computing jobs over to Amazon’s Elastic Cloud Computing (EC2) in attempt to avoid some odd computing issues/errors I kept encountering on our lab computers (Apple Xserve 3,1).

The big trade off here is that the lab computers are paid for and using EC2 means we’ll be sinking more money into computing resources. With that expense should come faster processing (i.e. less time) to perform various analyses. As they say, time is money…

Let’s look at how things’ve worked out so far.

 

First, how much did we spend and how did we spend it (click on the image to enlarge)?

 

Of course, it’s easy to see that for the instance I was running, it cost us $0.419/hr. That’s great and all, but you sort of lose sense of what that ends up costing over the long-term. Let’s look at how things break out over a larger time scale.

According to Amazon’s (very useful!) billing breakdown, we spent $187 in the month of July 2016. This doesn’t seem too bad. In fact, this would only cost us ~$2200/yr if we continue to run this instance in this fashion. However, let’s look at it a bit further.

We see that the instance ran for a total of 374 hrs during July 2016. Divide that by 24hrs/day and we see that the instance was running for 15.6 days; just over half the month. That means we would’ve spent ~$374 for the full month, which would equate to $4488/yr. For our lab, that kind of money starts to add up and one starts to wonder if it wouldn’t be better to invest in higher end hardware to use in the lab with a single “sunk” cost that will last us many, many years.

Regardless, with the lab’s current computing hardware, we should compare another factor that’s involved with the expense of using Amazon EC2 instead of our lab computers: time.

I performed a very rough “guestimation” of the time savings that EC2 has provided us.

 

I compared the length of “real” time for the first step in the PyRad program using the same data set on one of our lab computers (roadrunner) and the Amazon EC2 instance:

  • roadrunner: 1118 minutes

  • EC2: 771 minutes

 

Roadrunner is nearly 1.5x slower than the EC2 instance! To really appreciate what type of impact that has, we should look at the run time for the full PyRad analysis:

  • roadrunner: 5546 minutes (NOTE: Due to incomplete analysis, roadrunner time is “guestimated” as 1.45 x EC2; see below)

  • EC2: 3825 minutes

 

Let’s convert those numbers into something more easily understood – hours and days:

  • roadrunner: 95hrs

  • roadrunner: ~4 days

  • EC2: 63hrs

  • EC2: ~2.6 days

 

Of course, these times don’t take into account any technical issues that we might encounter (and I have encountered many technical issues using roadrunner) on either platform, but I can tell you that I’ve not had any headaches using EC2 (other than unintentional, self-imposed ones).

 

Another potential option is trying out InsideDNA. They offer cloud computing services that are specifically geared towards high-throughput bioinformatics analysis. They have many, many bioinformatics tools already installed and available to use on their platform. Additionally, they have nice tutorials on how to use some of these tools, which goes a long ways in getting started on any analyses using new software. Here are the various pricing tiers that they offer:

 

 

 

The “Advanced” tier ($100/month) certainly seems like it could potentially be better than using Amazon. However, this tier only offers 500GB of storage. If you look up above at the Amazon pricing breakdown, you’ll notice that I’ve already used 466GB of storage for just that one experiment! Additionally, the 1000 CPU hours seems great, but remember, this is likely divided by the number of CPUs that you end up using. The Amazon EC2 instance was running eight cores. If I were to run a similar set up on InsideDNA, that would amount to 125 CPU hours per core. Again, looking up above, we see that I ran the EC2 instance for 374 hours! That means the “Advanced” tier on InsideDNA wouldn’t be enough to get our jobs done.

 

Anyway, in the grand scheme of things, using an Amazon EC2 instance periodically as we need it throughout the year isn’t terrible. However, if we start using the University of Washington Hyak computing cluster we may be able to avoid spending on EC2 and be able to have similar time savings (compared to using the lab computers). Need to get cracking on that…

Goals – August 2016

  • Complete Olympia oyster GBS data analysis – Progress has actually been made! After many struggles, I managed to get a PyRad analysis of the entire data set to complete. Now, I just have to figure out what to do with the output files…
  • Troubleshoot Stacks analysis of Olympia oyster GBS data analysis – After switching computing to Amazon AWS, I thought this would be a breeze. However, the analysis keeps failing (without errors) on the “ustacks” portion of the pipeline; no output files are created even though the analysis runs (after 20hrs!!). Although it would be nice to get Stacks to complete successfully (just once!), now that I have a completed set of analysis from PyRad, troubleshooting this will be a little lower on the priority list.
  • Start using Hyak – We need computing power and Hyak is a free resource. Although Amazon AWS is pretty sweet , it ends up being a bit costly…

Data Analysis – PyRad Analysis of Olympia Oyster GBS Data

Previously, I ran a PyRad analysis on just a subset of these samples in an attempt to have some data for a grant pre-proposal.

I’ve now completed a PyRad analysis on the full set. Now, I just need to figure out what to do with the output from this…

Jupyter Notebook: 20160715_ec2_oly_gbs_pyrad.ipynb

Computing – Not Enough Power!

Well, I tackled the storage space issue by expanding the EC2 Instance to have a 1000GB of storage space. Now that that’s no longer a concern, it turns out I’m running up against processing/memory limits!

I’m running the EC2 c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM) instance.

I’m trying to run two programs simultaneously: PyRad and Stacks (specifically, the ustacks “sub” program).

PyRad keeps crashing with some memory error stuff (see embedded Jupyter Notebook at the end of this post).

Used the following Bash program to visualize what’s happening with the EC2 Instance resources (i.e. processors and RAM utilization):

htop

Downloaded/installed to EC2 Instance using:

sudo apt-get install htop

 

I see why PyRad is dying. Here are two screen captures that show what resources are being used (click to see detail):

 

 

 

 

The top image shows that ustacks is using 100% of all eight CPUs!

The second image shows when ustacks is finishing with one of the files it’s processing, it uses all of the memory (16GBs)!

So, I will have to wait until ustacks is finished running before being able to continue with PyRad.

If I want to be able to run these simultaneously, I can (using either of these options still requires me to wait until ustacks completes in order to manipulate the current EC2 instance to accommodate either of the two following options):

  • Increase the computing resources of this EC2 Instance

  • Create an additional EC2 Instance and run PyRad on one and Stacks programs on the other.

 

Here’s the Jupyter Notebook with the PyRad errors (see “Step 3: Clustering” section):


					
				



Computing – Amazon EC2 Instance Out of Space?

Running PyRad analysis on the Olympia oyster GBS data. PyRad exited with warnings about running out of space. However, looking at free disk space on the EC2 Instance suggests that there’s still space left on the disk. Possibly PyRad monitors the expected disk space usage during analysis to verify there will be sufficient disk space to write to? Regardless, will expand EC2 volume instance to a larger size…

 

 

Computing – A Very Quick “Guide” to Amazon EC2 Continued

Yesterday’s post ended with me trying to mount a S3 bucket to my EC2 instance using s3fs-fuse.

Waited for the 36GB of data to copy over to new bucket with proper naming (i.e. no capital letters in name). Copying took hours; left lab before copying completed.

 

Mount S3 bucket: kubu4

s3fs kubu4 /mnt/s3bucket/ -o passwd_file=/home/ubuntu/creds_s3fs

 

So, that didn’t work. The reason that it doesn’t work is that I uploaded the files to the S3 bucket via the Amazon AWS command line (awscli). Apparently, s3fs-fuse can’t mount S3 buckets that contain data uploaded via awscli [see this GitHub Issue for s3fs-fuse]! However, I had to upload them via awscli because the web interface kept failing!

 

That means I need to upload the data directly to my EC2 instance, but my EC2 instance is set with the default storage capacity of 8GB so I need to increase the capacity to accommodate my two large files, as well as the anticipated intermediate files that will be generated by the types of analysis I plan on running. I’m guessing I’ll need at least 100GB to be safe. To do this, I have to expand the Elastic Block Storage (EBS) volume of my instance. The rest of stuff below is fully explained and covered very well in the EBS expansion link I have in the previous sentence.

Don’t be fooled into thinking I figured any of this out on my own!

 

Expanding the EC2 Instance

The initial part of the process is creating a Snapshot of my instance. This took a long time (2.5hrs). However, I did finally decide to refresh the page when I noticed that the “Status” progress bar hadn’t moved beyond 46% for well over an hour. After refreshing, the “Status” showed “Complete.” Maybe this actually was ready to go much faster, but the page didn’t automatically refresh? Regardless, in retrospect, since this EC2 instance is pretty much brand new and doesn’t have too many changes from when it was initialized, I probably should’ve just created a brand new EC2 instance with the desired amount of EBS…

Created volume from that Snapshot with 150GB of magnetic storage.

Attached volume to the EC2 instance at /dev/sda1 (the default setting /dev/sdf resulted in an error message about the instance not having a root volume) and SSH’d into the instance. Odd, it seems to show that I still only have 8GB of storage (see the “Usage of…” in the screenshot below):

 

 

Check to see if I actually have the expanded storage volume or not. It turns out, I do! (notice that the only drive listed is “xvda” and its partition, “xvda/xvda1″ AND they are equal in size; 150G):

 

 

Time to upload (via the secure copy command) the files to my EC2 instance! The following commands upload the files to a folder called “data” in my /home directory. I also ran the “time” command at the beginning to get an idea of how long it takes to upload each of these files.

time scp -i ~/Dropbox/Lab/Sam/bioinformatics.pem /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz ubuntu@ec2.ip.address:~/data
time scp -i ~/Dropbox/Lab/Sam/bioinformatics.pem /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz ubuntu@ec2.ip.address:~/data

 

Details on upload times and file sizes:

 

Confim the files now reside in my EC2 instance:

 

Alas, I should’ve captured all of this in a Jupyter Notebook. However, I didn’t because I thought I would need to enter passwords (which you can’t do with a Jupyter Notebook). It turns out, I didn’t need a password for anything; even when using “sudo” on the EC2 instance. Oh well, it’s set up and running with my data finally accessible. That’s all that really matters here.

Alrighty, time to get rolling on some data analysis with a fancy new Amazon EC2 instance!!!

Dissection – Frozen Geoduck & Pacific Oyster

We’re working on a project with Washington Department of Natural Resources’ (DNR) Micah Horwith to identify potential proteomic biomarkers in geoduck (Panopea generosa) and  Pacific oyster (Crassostrea gigas). One aspect of the project is how to best conduct sampling of juvenile geoduck (Panopea generosa) and Pacific oyster (Crassostrea gigas) to minimize changes in the proteome of ctenidia tissue during sampling. Generally, live animals are shucked, tissue dissected, and then the tissue is “snap” frozen. However, Micah’s crew will be collecting animals from wild sites around Puget Sound and, because of the remote locations and the means of collection, will have limited tools and time to perform this type of sampling. Time is a significant component that will have great impact on proteomic status in each individual.

As such, Micah and crew wanted to try out a different means of sampling that would help preserve the state of the proteome at the time of collection. Micah and crew have collected some juveniles of both species and “snap” frozen them in the field in a dry ice/ethanol bath in hopes of being able to best preserve the ctenidia proteome status. I’m attempting to dissect out the frozen ctenidia tissue from both types of animals and am reporting on the success/failure of this method of preservation-sampling protocol.

To test this, I transferred animals (contained in baggies) from the -80C to dry ice. Utensils and weigh boats were cooled on dry ice.

 

Results:

Quick summary: This method won’t and I think sampling will have to take place in the field.

The details of why this won’t work (along with images of the process) are below.

 

First issue with this sampling method (and should be noted because I believe dry ice/ethanol baths will be used, even with a different sampling methodology) is that the ethanol in the dry ice bath at the time of animal collection is a potential problem for labeling baggies. Notice in the screenshot below that the label for the geoduck baggie (the baggie on the left) has, for all intents and purposes, completely washed off:

 

 

Starting with C.gigas, opening the animal was relatively easy. Granted, the animal has become brittle, but access to, and identification of, tissues ended up being pretty easy:

 

 

 

 

However, dissecting out just ctenidia is a lost cause. The mantle and the ctendia are, essentially, fused together in a frozen block through the oyster. Although the image below might look like part of the shell, it is not. It is strictly a chunk of frozen ctenidia/mantle tissue:

 

 

 

The geoduck were even more difficult. In fact, I couldn’t even manage to remove the soft tissue from the shell (for the uninitiated, there are two geoduck in the image below). I only managed to crush most of the tissue contained within the shell, making it even more impossible (if that’s possible) to identify and dissect out the ctenidia: