Category Archives: Olympia oyster reciprocal transplant

FASTQC – Oly BGI GBS Raw Illumina Data

In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.

Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.

Results:

Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.

However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…

Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb

Data Management – SRA Submission of Ostrea lurida GBS FASTQ Files

0000-0002-2747-368X

Prepared a short read archive (SRA) submission for archiving our Olympia oyster genotype-by-sequencing (GBS) data in NCBI. This is in preparation for submission of the mansucript we’re putting together.

I followed my outline/guideline for navigating the SRA submission process, as it’s a bit of a pain in the neck. Glad my notes were actually useful!

The following two files are currently being uploaded via FTP; the process will take about 3hrs, as each file is ~18GB in size:

They are being submitted under the following accession numbers (note: a final accession number will be provided once this is publicly available; I will update this post when that happens):

Manuscript Writing – The “Nuances” of Using Authorea

0000-0002-2747-368X

I’m currently trying to write a manuscript covering our genotype-by-sequencing data for the Olympia oyster using the Authorea.com platform and am encountering some issues that are a bit frustrating. Here’s what’s happening (and the ways I’ve managed to get around the problems).

PROBLEM: Authorea spits out a browser-crashing “unresponsive script” message (actually, lots and lots of them; clicking “Stop script” or “Continue” just results in additional messages) in Firefox (haven’t tried any other browsers). This renders the browser inoperable and I have to force quit. It doesn’t happen all of the time, so it’s hard to pinpoint what triggers this.

SOLUTION: Edit documents in Git/GitHub. I have my Authorea manuscript linked to a GitHub repo, which allows me to write without using Authorea.com. This is how I’ll be doing my writing the majority of the time anyway, but I would like to use Authorea.com to insert and manage citations…

PROBLEM: Authorea remains in a perpetual “saving…” state after inserting a citation. It also renders the page strangely, with HTML <br></br> tags (see the “Methods” section in the screen cap below).

SOLUTION: Type additional text somewhere, anywhere. This is an OK solution, but is particularly annoying if I just want to go through and add citations and have no intentions of doing any writing.

PROBLEM: Multi-author citations don’t get formatted with “et al.” By default, Authorea inserts all citations using the following LaTeX format:

\cite{Elshire_2011}

Result: (Elshire 2011).

This is a problem because this reference has multiple authors and should be written as: (Elshire et al., 2011).

SOLUTION: Change citation format to:

\citep{Elshire_2011}

Other citation formatting options can be found here (including multiple citations within one set of parentheses, and referring in-text author name with only publication year in parentheses):

How to add and manage citations and references in Authorea

PROBLEM: When a citation no longer exists in the manuscript, it still persists in the bibliography.

SOLUTION: A known bug with no current solution. Currently, have to delete them from the bibliography by hand (or, maybe figure out a way to do it programatically)…

PROBLEM: Cannot click-and-drag some references from Mendeley (haven’t tested other reference managers) without getting an error. To my knowledge, the BibTeX is valid, as it appears to be the same formatting as other references that can be inserted via the click-and-drag method. There are some references it won’t work for…

SOLUTION: Use the search bar in the citation insertion dialogue box. Not as convenient and slows down the workflow for citation insertion, but it works…

Data Analysis – Continued O.lurida Fst Analysis from GBS Data

0000-0002-2747-368X

Continued the analysis I started the other day. Still following Katherine Silliman’s notebook for guidance.

Quick summary of this analysis:

Mean Fst comparing all populations = 0.139539326187024
Mean Fst HL vs NF = 0.143075552548742
Mean Fst HL vs SN = 0.155234939276722
Mean Fst NF vs SN = 0.117889300124951

NOTE: Mean Fst values were calculated after replacing negative Fst values with 0. Thus, the means are higher than they would be had the raw data been used. I followed Katherine’s notebook and she doesn’t explicitly explain why she does this, nor what the potential implications are for interpreting the data. Will have to discus her rationale behind this with her.

Jupyter notebook: 20161201_docker_oly_vcf_analysis_R.ipynb

Data Analysis – Initial O.lurida Fst Determination from GBS Data

0000-0002-2747-368X

Finally running some analysis on the output from my PyRad analysison 20160727.

I’m following Katherine Silliman’s Jupyter notebook (2bRAD Subset Population Structure Analysis.ipynb) as a guide.

The initial analysis (which isn’t much) is in the Jupyter notebook below. The analysis will be continued on a later date.

Jupyter notebook: 20161117_docker_oly_vcf_analysis.ipynb

I’ve embedded the notebook below, but it’s much easier to view (there are many lengthy commands/filenames that wrap lines in the embedded version below) the actual file linked above.

Computing – Retrieve data from Amazon EC2 Instance

0000-0002-2747-368X

I had an existing instance that still had data on it from my PyRad analysis on 20160727 that I needed to retrieve.

Logged into Amazon AWS via the web interface and started my existing instance (via the Actions > Instance State > Start menu). After the instance started and generated a new public IP address, I SSH’d into the instance:

ssh -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address

NOTE: I needed the full path to the PEM file! Tried multiple times using a relative path (e.g. ~/Documents/bionformatics.pem) and received error messages that the file did not exist and “Permission denied (public key)”.

Changed to the directory with the PyRAD analysis and created a tarball to speed up eventual download from the EC2 instance to my local computer:

tar -cvzf 20160715_pyrad_analysis.tar.gz /home/ubuntu/data/analysis/

After compression, I used secure copy to copy the file from the EC2 instance to my local computer:

scp -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address:/home/ubuntu/data/20160715_pyrad_analysis.tar.gz /Volumes/toaster/sam/

This didn’t work initially because I attempted to transfer the file using Hummingbird (instead of my computer). The SSH connection kept timing out. The reason for this was that I hadn’t previously used Hummingbird to connect to the EC2 instance and Hummingbird’s IP address wasn’t listed in the Security Groups table as being allowed to connect. I made that change using the Amazon AWS web interface:

Once transfer was complete, I terminated the EC2 instance and the corresponding data volume.

Data Analysis – fastStructure Population Analysis of Oly GBS PyRAD Output

0000-0002-2747-368X

After some basal readings about what Fst is (see notebook below for a definition and reference), I decided to try to use fastStructure to analyze the PyRAD output from 20160727.

The quick, TL;DR: after spending a bunch of time installing the program, it doesn’t handle the default Structure file (.str); requires some companion file types that PyRAD doesn’t output.

I’ve put this here for posterity and background reference on Fst…

Will proceed with using the full blown Structure program to try to glean some info from these three populations.

Jupyter Notebook: 20160816_oly_gbs_fst_calcs.ipynb

Data Analysis – PyRad Analysis of Olympia Oyster GBS Data

0000-0002-2747-368X

Previously, I ran a PyRad analysis on just a subset of these samples in an attempt to have some data for a grant pre-proposal.

I’ve now completed a PyRad analysis on the full set. Now, I just need to figure out what to do with the output from this…

Jupyter Notebook: 20160715_ec2_oly_gbs_pyrad.ipynb

Computing – Amazon EC2 Instance Out of Space?

0000-0002-2747-368X

Running PyRad analysis on the Olympia oyster GBS data. PyRad exited with warnings about running out of space. However, looking at free disk space on the EC2 Instance suggests that there’s still space left on the disk. Possibly PyRad monitors the expected disk space usage during analysis to verify there will be sufficient disk space to write to? Regardless, will expand EC2 volume instance to a larger size…

Data Analysis – Oly GBS Data Using Stacks 1.37

0000-0002-2747-368X

This analysis ran (or, more properly, was attempted) for a couple of weeks and failed a few times. The failures seemed to be linked to the external hard drive I was reading/writing data to. It continually locked up, leading to “Segmentation fault” errors.

We’ve replaced the external with a different one in hopes that it’ll be able to handle the workload. Will be attempting to re-run Stacks with the new external hard drive.

I’m posting the Jupyter notebook here for posterity.

Jupyter notebook: 20160428_Oly_GBS_STACKS.ipynb

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Category Archives: Olympia oyster reciprocal transplant

FASTQC – Oly BGI GBS Raw Illumina Data

Data Management – SRA Submission of Ostrea lurida GBS FASTQ Files

Manuscript Writing – The “Nuances” of Using Authorea

Data Analysis – Continued O.lurida Fst Analysis from GBS Data

Data Analysis – Initial O.lurida Fst Determination from GBS Data

Computing – Retrieve data from Amazon EC2 Instance

Data Analysis – fastStructure Population Analysis of Oly GBS PyRAD Output

Data Analysis – PyRad Analysis of Olympia Oyster GBS Data

Computing – Amazon EC2 Instance Out of Space?

Data Analysis – Oly GBS Data Using Stacks 1.37