Computing – Retrieve data from Amazon EC2 Instance

I had an existing instance that still had data on it from my PyRad analysis on 20160727 that I needed to retrieve.

Logged into Amazon AWS via the web interface and started my existing instance (via the Actions > Instance State > Start menu). After the instance started and generated a new public IP address, I SSH’d into the instance:

ssh -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address

NOTE: I needed the full path to the PEM file! Tried multiple times using a relative path (e.g. ~/Documents/bionformatics.pem) and received error messages that the file did not exist and “Permission denied (public key)”.

Changed to the directory with the PyRAD analysis and created a tarball to speed up eventual download from the EC2 instance to my local computer:

tar -cvzf 20160715_pyrad_analysis.tar.gz /home/ubuntu/data/analysis/

After compression, I used secure copy to copy the file from the EC2 instance to my local computer:

scp -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address:/home/ubuntu/data/20160715_pyrad_analysis.tar.gz /Volumes/toaster/sam/

This didn’t work initially because I attempted to transfer the file using Hummingbird (instead of my computer). The SSH connection kept timing out. The reason for this was that I hadn’t previously used Hummingbird to connect to the EC2 instance and Hummingbird’s IP address wasn’t listed in the Security Groups table as being allowed to connect. I made that change using the Amazon AWS web interface:

Once transfer was complete, I terminated the EC2 instance and the corresponding data volume.

Goals – November 2016

0000-0002-2747-368X

Well, I’m serious this time. My goal for this month is to complete the Oly GBS data analysis and, get the data sets and data analysis prepared/placed in satisfactory repositories in preparation for publication in Scientific Data.

Additionally, I feel like I need to better document what I spend (waste?) my time on. For example, last month, I certainly got sidetracked trying to help/troubleshoot working with Docker. Here are just some of the issues that were encountered:

Despite having that list, I really should have notebook entries for each day I’m in lab, even if my day is spent struggling to get software installed and I don’t have any “product” for the day. Having the documentation of what I tried, what worked/didn’t work, will be helpful for future troubleshooting, and will provide some evidence that I actually did stuff.

So, I guess that’s a second goal for the month: Improve notebook documentation for days when I don’t generate a “product.”

Data Management – Geoduck Small Insert Library Genome Assembly from BGI

0000-0002-2747-368X

Received another set of Panopea generosa genome assembly data from BGI back in May! I neglected to create MD5 checksums, as well as a readme file for this data set! Of course, I needed some of the info that the readme file should’ve had and it wasn’t there. So, here’s the skinny…

It’s data assembled from the small insert libraries they created for this project.

All data is stored here: http://owl.fish.washington.edu/P_generosa_genome_assemblies_BGI/20160512/

They’ve provided a Genome Survey (PDF) that has some info about the data they’ve assembled. In it, is the estimated genome size:

Geoduck genome size: 2972.9 Mb

Additionally, there’s a table breaking down the N50 distributions of scaffold and contig sizes.

Data management stuff was performed in a Jupyter (iPython) notebook; see below.

Jupyter Notebook: 20161025_Pgenerosa_Small_Library_Genome_Read_Counts.ipynb

Goals – October 2016

0000-0002-2747-368X

Last month’s goals, as it turns out, were way too ambitious. This month’s goal will be to get the Oly GBS data analysis fully completed (currently have individuals data, but need summary of the three populations data). I’ll also get the data sets and data analysis prepared/placed in satisfactory repositories in preparation for publication in Scientific Data.

Data Received – Jay’s Coral epiRADseq – Not Demultiplexed

0000-0002-2747-368X

Previously downloaded Jay’s epiRADseq data that was provided by the Genomic Sequencing Laboratory at UC-Berkeley. It was provided already demultiplexed (which is very nice of them!). To be completionists on our end, we requested the non-demultiplexed data set.

Downloaded the FASTQ files from the project directory to Owl/nightingales/Porites_spp:

time wget -r -np -nc --ask-password ftp://gslftp@gslserver.qb3.berkeley.edu/160830_100PE_HS4KB_L4

It took awhile:

FINISHED --2016-09-19 11:39:21--
Total wall clock time: 4h 26m 21s
Downloaded: 11 files, 36G in 4h 17m 18s (2.39 MB/s)

Here are the files:

JD001_A_S1_L004_R2_001.fastq.gz
JD001_A_S1_L004_R1_001.fastq.gz
JD001_A_S1_L004_I1_001.fastq.gz
160830_100PE_HS4KB_L4_Stats/
- AdapterTrimming.txt
- ConversionStats.xml
- DemultiplexingStats.xml
- DemuxSummaryF1L4.txt
- FastqSummaryF1L4.txt

Generated MD5 checksums for each file:

for i in *.gz; do md5 $i >> checksums.md5; done

Calculate total number of reads for this sequencing run:

totalreads=0; for i in *S1*R*.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total reads: 662,868,166 (this isn’t entirely accurate, as it is counting all three files; probably should’ve just counted the R1 and R2 files…)

Calculate read counts for each file and write the data to the readme.md file in the Owl/web/nightingales/Porites_spp directory:

for i in *S1*R*.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n" "$i" "$readcount" >> readme.md; done

See this Jupyter notebook for code explanations.

Added sequencing info to Next_Gen_Seq_Library_Database (Google Sheet) and the Nightingales Spreadsheet (Google Sheet) and Nightingales Fusion Table (Google Fusion Table).

Oyster Sampling – Olympia Oyster OA Populations at Manchester

0000-0002-2747-368X

I helped Katherine Silliman with her oyster sampling today from her ocean acidification experiment with Olympia oysters (Ostrea lurida) at the Kenneth K. Chew Center for Shellfish Research & Restoration, which is housed at the NOAA Northwest Fisheries Science Center at Manchester in a partnership with the Puget Sound Restoration Fund (PSRF). We sampled the following tissues and stored in 1mL RNAlater:

adductor muscle (A)
ctenidia (C)
mantle (M)

When there was sufficient ctenidia tissue, an additional sample was stored in 75% ethanol for potential microbial analysis.

Tissue was collected from two oysters from each of the following oyster populations:

British Columbia (BC)
California (CA)
Oregon (OR)

Oysters were sampled from each of the following tanks:

Tubes were labeled in the following fashion:

Population & Tank (e.g. OR3B)
Tag#
Tissue

If no tag was present on the oyster, the oyster was assigned a number (beginning at 150 and increased sequentially) and photographed with a ruler for future measurement. White colored tags were written with the number followed by the letter ‘W’ (e.g. 78W) – no tag color info was recorded for other tag colors.

Additionally, gonad developmental stage was roughly assessed: ripe, kinda ripe, or not ripe.

All info was recorded by Katherine in her notepad. All samples were retained by Katherine (not sure where she stored them).

Utensils were flame sterilized between oysters and gloves/work surfaces were washed with a 10% bleach solution between oysters.

Here are a few pics from the day:

Data Received – Jay’s Coral epiRADseq

0000-0002-2747-368X

We received notice that Jay’s coral (Porites spp) epiRADseq data was available from the Genomic Sequencing Laboratory at UC-Berkeley.

Downloaded the FASTQ files from the project directory to Owl/nightingales/Porites_spp:

time wget -r -np -nc -A "*.gz" --ask-password ftp://gslftp@gslserver.qb3.berkeley.edu/160830_100PE_HS4KB/Roberts

Generated MD5 checksums for each file:

for i in *.gz; do md5 $i >> checksums.md5; done

Calculate total number of reads for this sequencing run:

totalreads=0; for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total reads: 573,378,864

Calculate read counts for each file and write the data to the readme.md file in the Owl/web/nightingales/Porites_spp directory:

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n" "$i" "$readcount" >> readme.md; done

See this Jupyter notebook for code explanations.

Added sequencing info to Next_Gen_Seq_Library_Database (Google Sheet) and the Nightingales Spreadsheet (Google Sheet) and Nightingales Fusion Table (Google Fusion Table).

Goals – September 2016

0000-0002-2747-368X

Whoops! It’s already September 6th! The 1st of the month came and went without me noticing.

One goal for this month: Write up and submit Olympia oyster genotype-by-sequencing (GBS) data to Scientific Data for publication.

Data Management – Synology Cloud Sync to UW Google Drive

0000-0002-2747-368X

After a bit of a scare this weekend (Synology DX513 expansion unit no longer detected two drives after a system update – resolved by powering down the system and rebooting it), we revisited our approach for backing up data.

Our decision was to utilize UW Google Drive, as it provides unlimited storage space!

Synology has an available app for syncing data to/from Google Drive, so I set up both Owl (Synology DS1812+) and Eagle (Synology DS413) to sync all of their data to a shared UW Google Drive account. This should provide a functional backup solution for the massive amounts of data we’re storing and it will simplify tracking where/what is backed up where. Now, instead of wondering if certain directories are backed up via CrashPlan or Backblaze or Time Backup to another Synology server, we know that everything is backed up to Google Drive.

Server HDD Failure – Owl

0000-0002-2747-368X

Noticed that Owl (Synology DS1812+ server) was beeping.

I also noticed, just like the last time we had to replace a HDD in Owl, that I didn’t receive a notification email… As it turns out, this time the reason no notification email was received was due to the fact that I had changed my UW password and we use my UW account for authorizing usage of the UW email server through Owl. So, the emails Owl’s been trying to send have failed because the authorization password was no longer valid… Yikes!

Anyway, I’ve updated the password on Owl for using the UW email servers and swapped out the bad drive with a backup drive we keep on hand for just such an occasion. See the first post about this subject for a bit more detail on the process of swapping hard drives.

Unfortunately, the dead HDD is out of warranty, however we already have another backup drive on-hand.

Below are some screen caps of today’s incident:

Notice the empty slot in the graphical representation of the disk layout, as well as the “Available Slots” showing 1.

After replacing the HDD (but before the system has rebuilt the new HDD), the empty slot is now represented as a green block and the “Available Slots” is now zero and “Unused Disks” is now 1.

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Computing – Retrieve data from Amazon EC2 Instance

Goals – November 2016

Data Management – Geoduck Small Insert Library Genome Assembly from BGI

Goals – October 2016

Data Received – Jay’s Coral epiRADseq – Not Demultiplexed

Oyster Sampling – Olympia Oyster OA Populations at Manchester

Data Received – Jay’s Coral epiRADseq

Goals – September 2016

Data Management – Synology Cloud Sync to UW Google Drive

Server HDD Failure – Owl

Notice the empty slot in the graphical representation of the disk layout, as well as the “Available Slots” showing 1.

After replacing the HDD (but before the system has rebuilt the new HDD), the empty slot is now represented as a green block and the “Available Slots” is now zero and “Unused Disks” is now 1.