Tag Archives: computing

Computing – Amazon EC2 Cost “Analysis”

I recently moved some computing jobs over to Amazon’s Elastic Cloud Computing (EC2) in attempt to avoid some odd computing issues/errors I kept encountering on our lab computers (Apple Xserve 3,1).

The big trade off here is that the lab computers are paid for and using EC2 means we’ll be sinking more money into computing resources. With that expense should come faster processing (i.e. less time) to perform various analyses. As they say, time is money…

Let’s look at how things’ve worked out so far.

 

First, how much did we spend and how did we spend it (click on the image to enlarge)?

 

Of course, it’s easy to see that for the instance I was running, it cost us $0.419/hr. That’s great and all, but you sort of lose sense of what that ends up costing over the long-term. Let’s look at how things break out over a larger time scale.

According to Amazon’s (very useful!) billing breakdown, we spent $187 in the month of July 2016. This doesn’t seem too bad. In fact, this would only cost us ~$2200/yr if we continue to run this instance in this fashion. However, let’s look at it a bit further.

We see that the instance ran for a total of 374 hrs during July 2016. Divide that by 24hrs/day and we see that the instance was running for 15.6 days; just over half the month. That means we would’ve spent ~$374 for the full month, which would equate to $4488/yr. For our lab, that kind of money starts to add up and one starts to wonder if it wouldn’t be better to invest in higher end hardware to use in the lab with a single “sunk” cost that will last us many, many years.

Regardless, with the lab’s current computing hardware, we should compare another factor that’s involved with the expense of using Amazon EC2 instead of our lab computers: time.

I performed a very rough “guestimation” of the time savings that EC2 has provided us.

 

I compared the length of “real” time for the first step in the PyRad program using the same data set on one of our lab computers (roadrunner) and the Amazon EC2 instance:

  • roadrunner: 1118 minutes

  • EC2: 771 minutes

 

Roadrunner is nearly 1.5x slower than the EC2 instance! To really appreciate what type of impact that has, we should look at the run time for the full PyRad analysis:

  • roadrunner: 5546 minutes (NOTE: Due to incomplete analysis, roadrunner time is “guestimated” as 1.45 x EC2; see below)

  • EC2: 3825 minutes

 

Let’s convert those numbers into something more easily understood – hours and days:

  • roadrunner: 95hrs

  • roadrunner: ~4 days

  • EC2: 63hrs

  • EC2: ~2.6 days

 

Of course, these times don’t take into account any technical issues that we might encounter (and I have encountered many technical issues using roadrunner) on either platform, but I can tell you that I’ve not had any headaches using EC2 (other than unintentional, self-imposed ones).

 

Another potential option is trying out InsideDNA. They offer cloud computing services that are specifically geared towards high-throughput bioinformatics analysis. They have many, many bioinformatics tools already installed and available to use on their platform. Additionally, they have nice tutorials on how to use some of these tools, which goes a long ways in getting started on any analyses using new software. Here are the various pricing tiers that they offer:

 

 

 

The “Advanced” tier ($100/month) certainly seems like it could potentially be better than using Amazon. However, this tier only offers 500GB of storage. If you look up above at the Amazon pricing breakdown, you’ll notice that I’ve already used 466GB of storage for just that one experiment! Additionally, the 1000 CPU hours seems great, but remember, this is likely divided by the number of CPUs that you end up using. The Amazon EC2 instance was running eight cores. If I were to run a similar set up on InsideDNA, that would amount to 125 CPU hours per core. Again, looking up above, we see that I ran the EC2 instance for 374 hours! That means the “Advanced” tier on InsideDNA wouldn’t be enough to get our jobs done.

 

Anyway, in the grand scheme of things, using an Amazon EC2 instance periodically as we need it throughout the year isn’t terrible. However, if we start using the University of Washington Hyak computing cluster we may be able to avoid spending on EC2 and be able to have similar time savings (compared to using the lab computers). Need to get cracking on that…

Computing – Not Enough Power!

Well, I tackled the storage space issue by expanding the EC2 Instance to have a 1000GB of storage space. Now that that’s no longer a concern, it turns out I’m running up against processing/memory limits!

I’m running the EC2 c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM) instance.

I’m trying to run two programs simultaneously: PyRad and Stacks (specifically, the ustacks “sub” program).

PyRad keeps crashing with some memory error stuff (see embedded Jupyter Notebook at the end of this post).

Used the following Bash program to visualize what’s happening with the EC2 Instance resources (i.e. processors and RAM utilization):

htop

Downloaded/installed to EC2 Instance using:

sudo apt-get install htop

 

I see why PyRad is dying. Here are two screen captures that show what resources are being used (click to see detail):

 

 

 

 

The top image shows that ustacks is using 100% of all eight CPUs!

The second image shows when ustacks is finishing with one of the files it’s processing, it uses all of the memory (16GBs)!

So, I will have to wait until ustacks is finished running before being able to continue with PyRad.

If I want to be able to run these simultaneously, I can (using either of these options still requires me to wait until ustacks completes in order to manipulate the current EC2 instance to accommodate either of the two following options):

  • Increase the computing resources of this EC2 Instance

  • Create an additional EC2 Instance and run PyRad on one and Stacks programs on the other.

 

Here’s the Jupyter Notebook with the PyRad errors (see “Step 3: Clustering” section):


					
				



Computing – The Very Quick “Guide” to Amazon Web Services Cloud Computing Instances (EC2)

This all takes a surprisingly long time to set up.

Setup AWS Identity and Access Management (IAM): http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html?icmpid=docs_iam_console

Install AWS command line interface: https://aws.amazon.com/cli/

Copy files to S3 bucket:

aws s3 cp /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz s3://Samb
aws s3 cp /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz s3://Samb

Launch EC2 instance c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM). Configured to have SSH open (TCP, port 22) and also to be able to access Jupyter Notebook via tunnel (TCP, port 8888). Set with “My IP” to limit access to these ports.

Create new key pair. Have to change permissions:

chmod 400 bioinformatics.pem

 

Connect to instance

For Amazon AMI:

ssh -i "bioinformatics.pem" ec2-user@ip.address.of.instance

 

For Amazon Ubuntu Server:

ssh -i "bioinformatics.pem" ubuntu@ip.address.of.instance


Update/Upgrade default Ubuntu packages at after initial launch:

sudo apt-get update
sudo apt-get upgrade

 

Set up Docker

Install Docker for Ubuntu 14.04 and copy our bioinformatics Dockerfile to the /home directory of the EC2 instance:

ssh -i "bioinformatics.pem" /Users/Sam/GitRepos/LabDocs/code/dockerfiles/Dockerfile.bio ubuntu@ip.address.of.instance:

Access data stored in Amazon S3 bucket(s)

Mounting S3 storage as volume in EC2 instance requires https://github.com/s3fs-fuse/s3fs-fuse

 

Mount bucket:

sudo s3fs Samb /mnt/s3bucket/ -o passwd_file=/home/ubuntu/s3fs_creds

 

Error:

s3fs: BUCKET Samb, name not compatible with virtual-hosted style.

 

Turns out, the error is due to the bucket name having an uppercase letter.

Made new bucket in S3 (via web interface) and copied data files to the new bucket. Will try mounting again once the files are copied over (this will take awhile; the two files total 36GB)..

Docker – One liner to create Docker container

One liner to create Docker container for Jupyter notebook usage and data analysis on roadrunner (Xserve):

docker run -p 8888:8888 -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/notebooks -v /Users/sam/data/:/data -v /Users/sam/analysis/:/analysis -it kubu4/bioinformatics:v11 /bin/bash

This does the following:

  • Maps roadrunner port 8888 to Docker container port 8888 (for Jupyter notebook access outside of the Docker container)
  • Mounts my local Jupyter notebooks directory to the
    /notebooks

    directory in the Docker container

  • Mounts my local data directory to the
    /data

    directory in the Docker container

  • Mounts my local analysis directory to the
    /analysis

    directory in the Docker container

These commands allow me to interact with data outside of the Docker container.

Docker – Improving Roberts Lab Reproducibility

In an attempt at furthering our lab’s abilities to maximize our reproducibility, I’ve been  working on developing an all-encompassing Docker image. Docker is a type of virtual machine (i.e. a self-contained computer that runs within your computer). For the Roberts Lab, the advantage of using Docker is that the Docker images can be customized to run a specific suite of software and these images can then be used by any other person in the lab (assuming they can run Docker on their particular operating system), regardless of operating system. In turn, if everyone is using the same Docker image (i.e. the same virtual machine with all the same software), then we should be able to reproduce data analyses more reliably, due to the fact that there won’t be differences between software versions that people are using. Additionally, using Docker greatly simplifies the setup of new computers with the requisite software.

I’ve put together a Dockerfile (a Dockerfile is a text file/script that Docker uses to retrieve software and build a computer image with those specific instructions) which will automatically build a Docker image (i.e. virtual computer) that contains all of the normal bioinformatics software our lab uses. This has been a side project while I wait for Stacks analysis to complete (or, fail, depending on the day) and it’s finally usable! The image that is built from this Dockerfile will even let the user run R Studio and/or Jupyter Notebooks in their browser (I’m excited about this part)!

Here’s the current list of software that will be installed:

bedtools 2.25.0
bismark 0.15.0
blast 2.3.0+
bowtie2 2.2.8
bsmap 2.90
cufflinks 2.1.1
fastqc 0.11.5
fastx_toolkit 0.0.13
R 3.2.5
RStudio Server0.99
pyrad 3.0.66
samtools 0.1.19
stacks 1.40
tophat 2.1.1
trimmomatic 0.36

In order to set this up, you need to install Docker and download the Dockerfile (Dockerfile.bio) I’ve created.

I’ve written a bit of a user guide (specific to this Dockerfile) here to get people started: docker.md

The user guide explains a bit how all of this works and tries to progress from a “basic” this-is-how-to-get-started-with-Docker to an “advanced” description of how to map ports, mount local volumes in your containers, and how to start/attach previously used containers.

The next major goal I have with this Docker project is to get the R kernel installed for Jupyter Notebooks. Currently, the Jupyter Notebook installation is restricted to the default Python 2 kernel.

Additionally, I’d like to improve the usability of the Docker image by setting up aliases in the image. Meaning, a user who wants to use the bowtie program can just type “bowtie”. Currently, the user has to type “bowtie2_2.2.8″ (although, with this being in the system PATH and tab-completion, it’s not that big of a deal), which is a bit ugly.

For some next level stuff, I’d also like to setup all Roberts Lab computers to automatically launch the Docker image when the user opens a terminal. This would greatly simplify things for new lab members. They wouldn’t have to deal with going through the various Docker commands to start a Docker container. Instead, their terminal would just put them directly into the container and the user would be none-the-wiser. They’d be reproducibly conducting data analysis without even having to think about it.

Computing – Speed Benchmark Comparisons Between Local, External, & Server Files

I decided to run a quick test to see what difference in speed (i.e. time) we might see between handling files that are stored locally, on an external hard drive (HDD), or on our server (Owl).

This isn’t tightly controlled because it’s possible that other people were using resources on the server, thus slowing things down. However, this situation would be a true real world experience, so it is probably an accurate representation of what we’d experience on a daily basis.

https://github.com/sr320/LabDocs/blob/master/jupyter_nbs/sam/20160427_speed_comparison.ipynb