Computing – Oly BGI GBS Reproducibility Fail

0000-0002-2747-368X

Since we’re preparing a manuscript that relies on BGI’s manipulation/handling of the genotype-by-sequencing data, I attempted to could reproduce the demultiplexing steps that BGI used in order to perform the SNP/genotyping on these samples.

The key word in the above sentence is “attempted.” Ugh, what a massive waste of time it turned out to be. I’ve contacted BGI to get some help on this.

In the meantime, here’s a brief (actually, not as brief as I’d like) rundown of my struggles.

The demultiplexing software that BGI used is something called “iTools” which is bundled in this GitHub repo: Resqtools

To demutliplex, they ran a script called: split.sh

The script seems fairly straightforward. Here is what it contains:

iTools Fqtools splitpool \
-InFq1 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz \
-InFq2 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz \
-Index index.lst \
-Flag enzyme.txt \
-MisMatch \
-OutDir split

It tells the iTools program to use the Fqtools tool “splitpool” to operate on a pair of gzipped FASTQ files. It also utilizes an index file (index.lst) which contains all the barcodes needed to identify, and separate, the individual samples that were combined prior to sequencing.

The first bump in the road is the -Flag enzyme.txt portion of the code. BGI did not provide me with this file. I recently requested them to send me it (or its contents, since I suspected it was only a single line text file). They sent me the contents of the file:

CAGC
CTGC

The next problem is neither of those two sequences are the recognition site for the enzyme that was (supposedly) used: ApeKI. The recognition site for ApeKI is: GCWGC

Regardless, I decided to see if I could reproduce the demultiplexing using the info they’d provided me.

I cloned the Resqtools repo, changed into the Reseqtools/iTools directory and typed make.

This resulted in an error informing me that it could not find boost/spirit/core.hpp

I tracked down the Boost library junk, downloaded the newest version and untarred it in /usr/local/bin.

Tried to run make in the Reseqtools/iTools directory and got the same error. Realized iTools might not be searching the system $PATH (this turned out to be correct), so I moved the contents of the Boost folder to the iTools, ran make and got the same error. Turns out, the newest version of Boost doesn’t have that core.hpp file any more. Looking at the iTools documentation, iTools was built around Boost 1.44. OMG…

Downloaded Boost 1.44 and went through the same steps as above. This eliminated the missing core.hpp error!

But, of course, led to another error. The error:

"Threading support unavaliable: it has been explicitly disabled with BOOST_DISABLE_THREADS"

That was related to something with newer versions of the GCC compiler (this is, essentially, built into the computer; it’s not worth trying to install/use old versions of GCC) trying to work with old versions of Boost. Found a patch for a config file here: libstdcpp3.hpp.patch

I made the appropriate edits to the file as shown in that link and ran make and it almost worked!

The current error is:

./src/Variants/soapsv-v1.02/include.h:15:16: fatal error: gd.h: No such file or directory

I gave up and contacted BGI to see if they can get me a functional version of iTools…

FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

0000-0002-2747-368X

Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.

However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.

Results:

Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.

Links to all of the FastQC output files are linked at the bottom of the notebook.

Jupyter notebook (GitHub): 20170306_docker_fastqc_demultiplexed_bgi_oly_gbs.ipynb

FASTQC – Oly BGI GBS Raw Illumina Data

0000-0002-2747-368X

In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.

Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.

Results:

Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.

However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…

Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb

Goals – March 2017

0000-0002-2747-368X

Goal, singular: Get Oly GBS manuscript completed/submitted.

Oh, actually, there is another, smaller goal that will be very difficult to achieve: win Pub-a-Thon. Jay’s taken a massive lead and has a nearly complete manuscript ready for submission. His manuscript is pretty well fleshed out, so it’ll be very difficult to surpass him at this point. However, I’m always up for a challenge, so I’ll see what I can do…

Anyway, back to my main goal of completing my manuscript.

This should be do-able. I’ve completed the SRA submission process for the raw sequencing data. The stuff that remains is as follows:

Generate FASTQC analysis on FASTQ files (this is currently running – takes awhile)
Try to replicate BGI’s FASTQ demultiplexing pipeline to verify that it is functional
Make decisions with Steven (and Brent?) about what information tables should contain
Write

The beauty of submitting this to the journal Scientific Data, is that it doesn’t require in-depth analysis of your data sets. It merely requires an examination of the data to ensure its integrity, as well as a cursory assessment of the data to evaluate it’s usefulness to the scientific community. No need to delve deeper into the data and attempt to interpret, or draw conclusions about, what the data might mean; that can be left to other researchers who deem this data worthwhile to explore.

Manuscript Writing – More “Nuances” Using Authorea

0000-0002-2747-368X

I previously highlighted some of the issues I was having using Authorea.com as an writing platform.

As a collaborative writing platform, it also has issues.

I recently received email notifications about comments made on the manuscript. However, when visiting my manuscript on Authorea, there are no indications that any comments have been made…

As it turns out, comments are currently only viewable when using Private Browsing/Incognito modes on your browser!!!

I found this out by using the chat feature that’s built into Authorea. This feature is great and support is pretty quick to respond:

However, Josh at Authorea suggested this bug would be resolved by the end of the day (that was on February 24th). I took the above screenshots of my manuscript demonstrating that comments don’t show up when using a browser like a normal person, today, February 28th…

Another significant shortcoming to using Authorea as a collaborative writing platform, as it relates to comments:

You can’t reply to individual comments! You can only add comments in chronological order for any given section of the manuscript. For example, if multiple comments are made in the Methods section, it makes it extremely difficult to address individual comments that were made earlier, in a clear fashion. To reply to a comment, you have to type out which previous comment you’re currently addressing in the comment you’re writing to address a particular previous comment. See what I mean?

Data Received – Jay’s Coral RADseq and Hollie’s Geoduck Epi-RADseq

0000-0002-2747-368X

Jay received notice from UC Berkeley that the sequencing data from his coral RADseq was ready. In addition, the sequencing contains some epiRADseq data from samples provided by Hollie Putnam. See his notebook for multiple links that describe library preparation (indexing and barcodes), sample pooling, and species breakdown.

For quickest reference, here’s Jay’s spreadsheet with virtually all the sample/index/barcode/pooling info (Google Sheet): ddRAD/EpiRAD_Jan_16

I’ve downloaded both the demultiplexed and non-demultiplexed data, verified data integrity by generating and comparing MD5 checksums, copied the files to each of the three species folders on owl/nightingales that were sequenced (Panopea generosa, Anthopleura elegantissima, Porites astreoides), generated and compared MD5 checksums for the files in their directories on owl/nightingales, and created/updated the readme files in each respective folder.

Data management is detailed in the Jupyter notebook below. The notebook is embedded in this post, but it may be easier to view on GitHub (linked below).

Readme files were updated outside of the notebook.

Jupyter notebook (GitHub): 20170227_docker_jay_ngs_data_retrieval.ipynb

Data Management – SRA Submission of Ostrea lurida GBS FASTQ Files

0000-0002-2747-368X

Prepared a short read archive (SRA) submission for archiving our Olympia oyster genotype-by-sequencing (GBS) data in NCBI. This is in preparation for submission of the mansucript we’re putting together.

I followed my outline/guideline for navigating the SRA submission process, as it’s a bit of a pain in the neck. Glad my notes were actually useful!

The following two files are currently being uploaded via FTP; the process will take about 3hrs, as each file is ~18GB in size:

They are being submitted under the following accession numbers (note: a final accession number will be provided once this is publicly available; I will update this post when that happens):

Goals – February 2017

0000-0002-2747-368X

First goal is to be the first person in lab to post their goals each month. Props to one of our new grad students, Yaamini Venkataraman on beating me this month!

Next goal is to dominate this year’s Pub-a-thon. I’m working on two different manuscripts, this one and this one, but I still think I can win this!

Stuff that got tackled from last month’s goals:

Freezer organization – This has happened, albeit without much effort on my part. Many thanks to the Big Cheese and [Grace for tackling this project[(https://genefish.wordpress.com/2017/01/28/80-organization)!

Data Management Plan – Some progress has been made on this. I improved the instructions on the DMP a bit, but the master spreadsheet on which the DMP revolves around (Nightingales) is still in a massive state of flux that needs a lot of attention.

Sequencing data handling – Thanks to Sean for putting forth a serious dent in automating this. He wrote an R script to handle this sort of thing. I’m not entirely sure if he’s done testing it, but it seems to work so far. Next will be incorporating usage instructions of this R script into the DMP so that others can utilize it. On that note, I need to figure out where Sean is keeping this script (can’t seem to locate in his notebook.

Manuscript Writing – The “Nuances” of Using Authorea

0000-0002-2747-368X

I’m currently trying to write a manuscript covering our genotype-by-sequencing data for the Olympia oyster using the Authorea.com platform and am encountering some issues that are a bit frustrating. Here’s what’s happening (and the ways I’ve managed to get around the problems).

PROBLEM: Authorea spits out a browser-crashing “unresponsive script” message (actually, lots and lots of them; clicking “Stop script” or “Continue” just results in additional messages) in Firefox (haven’t tried any other browsers). This renders the browser inoperable and I have to force quit. It doesn’t happen all of the time, so it’s hard to pinpoint what triggers this.

SOLUTION: Edit documents in Git/GitHub. I have my Authorea manuscript linked to a GitHub repo, which allows me to write without using Authorea.com. This is how I’ll be doing my writing the majority of the time anyway, but I would like to use Authorea.com to insert and manage citations…

PROBLEM: Authorea remains in a perpetual “saving…” state after inserting a citation. It also renders the page strangely, with HTML <br></br> tags (see the “Methods” section in the screen cap below).

SOLUTION: Type additional text somewhere, anywhere. This is an OK solution, but is particularly annoying if I just want to go through and add citations and have no intentions of doing any writing.

PROBLEM: Multi-author citations don’t get formatted with “et al.” By default, Authorea inserts all citations using the following LaTeX format:

\cite{Elshire_2011}

Result: (Elshire 2011).

This is a problem because this reference has multiple authors and should be written as: (Elshire et al., 2011).

SOLUTION: Change citation format to:

\citep{Elshire_2011}

Other citation formatting options can be found here (including multiple citations within one set of parentheses, and referring in-text author name with only publication year in parentheses):

How to add and manage citations and references in Authorea

PROBLEM: When a citation no longer exists in the manuscript, it still persists in the bibliography.

SOLUTION: A known bug with no current solution. Currently, have to delete them from the bibliography by hand (or, maybe figure out a way to do it programatically)…

PROBLEM: Cannot click-and-drag some references from Mendeley (haven’t tested other reference managers) without getting an error. To my knowledge, the BibTeX is valid, as it appears to be the same formatting as other references that can be inserted via the click-and-drag method. There are some references it won’t work for…

SOLUTION: Use the search bar in the citation insertion dialogue box. Not as convenient and slows down the workflow for citation insertion, but it works…

Sample Submission – Geoduck gDNA for Illumina Pilot Sequencing Project

0000-0002-2747-368X

Sent 10μg of the geoduck gDNA I isolated earlier today to Illumina on dry ice via FedEx Standard Overnight service.

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Computing – Oly BGI GBS Reproducibility Fail

FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

FASTQC – Oly BGI GBS Raw Illumina Data

Goals – March 2017

Manuscript Writing – More “Nuances” Using Authorea

Data Received – Jay’s Coral RADseq and Hollie’s Geoduck Epi-RADseq

Data Management – SRA Submission of Ostrea lurida GBS FASTQ Files

Goals – February 2017

Manuscript Writing – The “Nuances” of Using Authorea

Sample Submission – Geoduck gDNA for Illumina Pilot Sequencing Project