Computing – The Very Quick “Guide” to Amazon Web Services Cloud Computing Instances (EC2)

This all takes a surprisingly long time to set up.

Setup AWS Identity and Access Management (IAM): http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html?icmpid=docs_iam_console

Install AWS command line interface: https://aws.amazon.com/cli/

Copy files to S3 bucket:

aws s3 cp /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz s3://Samb
aws s3 cp /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz s3://Samb

Launch EC2 instance c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM). Configured to have SSH open (TCP, port 22) and also to be able to access Jupyter Notebook via tunnel (TCP, port 8888). Set with “My IP” to limit access to these ports.

Create new key pair. Have to change permissions:

chmod 400 bioinformatics.pem

 

Connect to instance

For Amazon AMI:

ssh -i "bioinformatics.pem" ec2-user@ip.address.of.instance

 

For Amazon Ubuntu Server:

ssh -i "bioinformatics.pem" ubuntu@ip.address.of.instance


Update/Upgrade default Ubuntu packages at after initial launch:

sudo apt-get update
sudo apt-get upgrade

 

Set up Docker

Install Docker for Ubuntu 14.04 and copy our bioinformatics Dockerfile to the /home directory of the EC2 instance:

ssh -i "bioinformatics.pem" /Users/Sam/GitRepos/LabDocs/code/dockerfiles/Dockerfile.bio ubuntu@ip.address.of.instance:

Access data stored in Amazon S3 bucket(s)

Mounting S3 storage as volume in EC2 instance requires https://github.com/s3fs-fuse/s3fs-fuse

 

Mount bucket:

sudo s3fs Samb /mnt/s3bucket/ -o passwd_file=/home/ubuntu/s3fs_creds

 

Error:

s3fs: BUCKET Samb, name not compatible with virtual-hosted style.

 

Turns out, the error is due to the bucket name having an uppercase letter.

Made new bucket in S3 (via web interface) and copied data files to the new bucket. Will try mounting again once the files are copied over (this will take awhile; the two files total 36GB)..

DNA Methylation Quantification – Coral DNA from Jose M. Eirin-Lopez (Florida International University)

Ran the coral DNA I quantified on 20160630 through the MethylFlash Methylated DNA Quantification Kit [Colorimetric] (Epigentek) kit to quantify global methylation.

Used 100ng of DNA per 8uL per replicate (x2 replicates = total 200ng in 16uL). Calcs are here (Google Sheet): 20160705_coral_DNA_methylation_calcs

Manufacturer’s protocol was followed.

Dilutions of kit reagents:

ME5 (1:1000) 2.6uL ME5 + 2597.4uL diluted ME1

ME6 (1:2000) 1.3uL ME6 + 2598.7uL diluted ME1

ME7 (1:5000) 0.52uL ME7 + 2599.48uL diluted ME1

Samples were quantified on the Seeb’s plate reader @ 450nm  (Wallac 1420 Victor 2  [Perkin Elmer])

Results:

Google Sheet: 20160707_coral_DNA_methylflash

sample treatment 5-mC(ng)
H1_1 nitrogen 0.8712248853
H1_10 nitrogen 0.6917168368
H1_12 control 0.2738478893
H1_5 nitrogen & phosphorous 0.9663585942
H1_6 control 0.6494783825
H1_8 nitrogen & phosphorous 0.4244913398
H24_1 nitrogen 0.372603297
H24_10 nitrogen 0.4237237786
H24_12 control 0.5350511937
H24_5 nitrogen & phosphorous 0.1495527697
H24_6 control 0.2291900804
H24_8 nitrogen & phosphorous 0.2213437801
H5_1 nitrogen -0.1233169902
H5_10 nitrogen 0.6997668774
H5_12 control 0.2307000493
H5_5 nitrogen & phosphorous -0.07790933048
H5_6 control 0.4562401662
H5_8 nitrogen & phosphorous 0.5949647121

 

Overall, it’s difficult to really interpret these results. I believe the data is a time course (e.g. H5 = hour 5, H24 = hour 24). Additionally, looking at treatments, there appear to be replicates, but it’s not clear what type of replicates they are (i.e. technical or biological). Generally, it seems like the control samples have lower quantities of methylated DNA than the treated samples. However, this doesn’t hold true for all three of the groups.

And, not that it really matters, but I don’t even know what species this is…

In any case, this was an attempt to gather some preliminary data for a grant that Steven is attempting to put together, so the original experiment and the subsequent data aren’t as robust as one would expect for a full-blown research project.

Goals – July 2016

Unfortunately, most of this month’s goals are the same as last months!

 

  • Process Olympia oyster GBS data. I’ve been running two different analyses (Stacks and PyRad) on two different machines (Hummingbird and Roadrunner, respectively) and I keep encountering different problems! For example, just yesterday, the following popped up in Terminal on my SSH connection to a Docker container on Roadrunner running PyRad in a Jupyter Notebook (click on the image to enlarge):

 

The computer became completely unresponsive (for the second time in less than 24hrs). Maybe the problem is Docker. Maybe it’s creating a remote tunnel to a Docker container. Maybe it’s running Jupyter Notebook through a remote tunnel into a Docker container? I don’t know. At this point, I’ll just install PyRad directly on Roadrunner and try to get the analysis done that way. It certainly isn’t convenient because it means I have to be physically present at Roadrunner to execute commands and check on things…

 

  • Get stuff running on Amazon AWS and Hyak as soon as possible. I think the increased computing power will improve my chances to actually complete the Oly GBS analysis due to the greater stability those computing environments provide.

  • Quantify coral DNA methylation. This should be straightforward and completed on Tuesday, 20160705.

 

DNA Quantification – Coral DNA from Jose M. Eirin-Lopez (Florida International University)

Quantified the DNA we received from Jose on 20160615 using the Qubit 3.0 Flouorometer (ThermoFisher) with the dsDNA Broad Range (BR) Kit according to the manufacturer’s protocol. Used 1μL of each sample.

Results are here (Google Sheet): Coral_DNA_QubitData_2016-06-30_08-45-56.xls

Here is a table of sample concentrations:

Sample Concentration(ng/μL)
H1 1 52.4
H1 5 34
H1 6 13
H1 8 22
H1 10 39
H1 12 52.4
H5 1 14.7
H5 5 20.8
H5 6 54
H5 8 18.4
H5 10 46.6
H5 12 29.8
H24 1 16.2
H24 5 25
H24 6 20.2
H24 8 22
H24 10 22
H24 12 30.6

 

Will proceed with DNA methylation assessment.

Samples Received – Coral DNA from Jose M. Eirin-Lopez (Florida International University)

Steven received these coral DNA samples today. Here’s his post on Google Plus (stored @ 4C in FTR 213):

 

 

Here’s the email from Jose describing the samples:

“Dear Steven, the coral DNA samples were sent today by my student Javier (cc’ed here) to your lab. Here’s an excel attached with info for the samples including concentration and treatment of the coral from which they were extracted (N, nitrogen; NP, nitrogen+phosphorous; C, control).

Please let us know when you get these in the lab so we know all is fine!

thanks!

Chema”

Here’s the spreadsheet he sent (renamed for easier identification later on – original file sent was title DNA Qbit readings), uploaded to Google Drive:

20160615_coral_DNA_Qbit_readings.xls

Docker – VirtualBox Defaults on OS X

I noticed a discrepancy between what system info is detected natively on Roadrunner (Apple Xserve) and what was being shown when I started a Docker container.

Here’s what Roadrunner’s system info looks like outside of a Docker container:

 

However, here’s what is seen when running a Docker container:

 

 

It’s important to notice the that the Docker container is only seeing 2 CPUs. Ideally, the Docker container would see that this system has 8 cores available. By default, however, it does not. In order to remedy this, the user has to adjust settings in VirtualBox. VirtualBox is a virtual machine thingy that gets installed with the Docker Toolbox for OS X. Apparently, Docker runs within VirtualBox, but this is not really transparent to a beginner Docker user on OS X.

To change the way VirtualBox (and, in turn, Docker) can access the full system hardware, you must launch the VirtualBox application (if you installed Docker using Docker Toolbox, you should be able to find this in your Applications folder). Once you’ve launched VirtualBox, you’ll have to turn off the virtual machine that’s currently running. Once that’s been accomplished, you can make changes and then restart the virtual machine.

 

Shutdown VirtualBox machine before you can make changes:

 

Here are the default CPU settings that VirtualBox is using:

 

 

Maxed out the CPU slider:

 

 

 

Here are the default RAM settings that VirtualBox is using:

 

 

 

Changed RAM slider to 24GB:

 

 

 

Now, let’s see what the Docker container reports for system info after making these changes:

 

Looking at the CPUs now, we see it has 8 listed (as opposed to only 2 initially). I think this means that Docker now has full access to the hardware on this machine.

This situation is a weird shortcoming of Docker (and/or VirtualBox). Additionally, I think this issue might only exist on the OS X and Windows versions of Docker, since they require the installation of the Docker Toolbox (which installs VirtualBox). I don’t think Linux installations suffer from this issue.

Docker – One liner to create Docker container

One liner to create Docker container for Jupyter notebook usage and data analysis on roadrunner (Xserve):

docker run -p 8888:8888 -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/notebooks -v /Users/sam/data/:/data -v /Users/sam/analysis/:/analysis -it kubu4/bioinformatics:v11 /bin/bash

This does the following:

  • Maps roadrunner port 8888 to Docker container port 8888 (for Jupyter notebook access outside of the Docker container)
  • Mounts my local Jupyter notebooks directory to the
    /notebooks

    directory in the Docker container

  • Mounts my local data directory to the
    /data

    directory in the Docker container

  • Mounts my local analysis directory to the
    /analysis

    directory in the Docker container

These commands allow me to interact with data outside of the Docker container.

Goals – June 2016

Current goals are as follows:

  • Complete Oly GBS data analysis. This is getting closer to actually being done. Had some issues with an external hard drive crashing. I’ve since replaced that and the analysis is running (it takes multiple days per stage of the analysis on Hummingbird).

  • Configure computing instances on Amazon AWS to improve our ability to handle these large data sets in a more timely fashion.

  • Begin using the UW’s Hyak computing cluster to improve our ability to handles these large data sets in a more timely fashion.

Docker – Improving Roberts Lab Reproducibility

In an attempt at furthering our lab’s abilities to maximize our reproducibility, I’ve been  working on developing an all-encompassing Docker image. Docker is a type of virtual machine (i.e. a self-contained computer that runs within your computer). For the Roberts Lab, the advantage of using Docker is that the Docker images can be customized to run a specific suite of software and these images can then be used by any other person in the lab (assuming they can run Docker on their particular operating system), regardless of operating system. In turn, if everyone is using the same Docker image (i.e. the same virtual machine with all the same software), then we should be able to reproduce data analyses more reliably, due to the fact that there won’t be differences between software versions that people are using. Additionally, using Docker greatly simplifies the setup of new computers with the requisite software.

I’ve put together a Dockerfile (a Dockerfile is a text file/script that Docker uses to retrieve software and build a computer image with those specific instructions) which will automatically build a Docker image (i.e. virtual computer) that contains all of the normal bioinformatics software our lab uses. This has been a side project while I wait for Stacks analysis to complete (or, fail, depending on the day) and it’s finally usable! The image that is built from this Dockerfile will even let the user run R Studio and/or Jupyter Notebooks in their browser (I’m excited about this part)!

Here’s the current list of software that will be installed:

bedtools 2.25.0
bismark 0.15.0
blast 2.3.0+
bowtie2 2.2.8
bsmap 2.90
cufflinks 2.1.1
fastqc 0.11.5
fastx_toolkit 0.0.13
R 3.2.5
RStudio Server0.99
pyrad 3.0.66
samtools 0.1.19
stacks 1.40
tophat 2.1.1
trimmomatic 0.36

In order to set this up, you need to install Docker and download the Dockerfile (Dockerfile.bio) I’ve created.

I’ve written a bit of a user guide (specific to this Dockerfile) here to get people started: docker.md

The user guide explains a bit how all of this works and tries to progress from a “basic” this-is-how-to-get-started-with-Docker to an “advanced” description of how to map ports, mount local volumes in your containers, and how to start/attach previously used containers.

The next major goal I have with this Docker project is to get the R kernel installed for Jupyter Notebooks. Currently, the Jupyter Notebook installation is restricted to the default Python 2 kernel.

Additionally, I’d like to improve the usability of the Docker image by setting up aliases in the image. Meaning, a user who wants to use the bowtie program can just type “bowtie”. Currently, the user has to type “bowtie2_2.2.8″ (although, with this being in the system PATH and tab-completion, it’s not that big of a deal), which is a bit ugly.

For some next level stuff, I’d also like to setup all Roberts Lab computers to automatically launch the Docker image when the user opens a terminal. This would greatly simplify things for new lab members. They wouldn’t have to deal with going through the various Docker commands to start a Docker container. Instead, their terminal would just put them directly into the container and the user would be none-the-wiser. They’d be reproducibly conducting data analysis without even having to think about it.