Data Analysis – Oly GBS Data from BGI Using Stacks

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Installed Stacks (and its dependencies on Hummingbird earlier today).

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS

Software Install – samtools-0.1.19 and stacks-1.37

Getting ready to analyze our Ostrea lurida genotype-by-sequencing data and wanted to use the Stacks software.

We have an existing version of Stacks on Hummingbird (the Apple server blade I will be running this analysis on), but I figured I might as well install the latest version (stacks-1.37).

Additionally, Stacks requires samtools-0.1.19 to run, which we did NOT have installed.

I tracked all of this in the Jupyter (iPython) notebook below.

Due to permissions issues during installation, I frequently had to leave the Jupyter notebook to run “sudo” in bash. As such, the notebook is messy, but does outline the necessary steps to get these two programs installed.

Jupyter notebook: 20160406_STACKS_install.ipynb

NBviewer: 20160406_STACKS_install.ipynb

SRA Submission – Genome sequencing of the Olympia oyster (Ostrea lurida)

Adding our Olympia oyster genome sequencing (sequencing done by BGI) to the NCBI Sequence Read Archive (SRS). The current status can be seen in the screen cap below. Release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:


Paired-end sequencing files were uploaded together within a single “Run”.

SRA Info:
SRA: SRS1365663
Study: SRP072461
BioProject: PRJNA316624
BioSample: SAMN04588827

SRA Submission – Genome sequencing of the Pacific geoduck (Panopea generosa)

Adding our geoduck genome sequencing (sequencing done by BGI) to the NCBI Sequence Read Archive (SRS). The current status can be seen in the screen cap below. Release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:


Mate pair sequencing files were uploaded together within a single “Run”.

SRA Submission – Transcriptomic Profiles of Adult Female & Male Gonads in Panopea generosa (Pacific geoduck).

RNAseq experiment, which is part of a larger project that involves characterizing geoduck gonad development across multiple stages: histologically, proteomically, and transcriptomically. Initial sample collection performed by Grace Crandall.

The current status can be seen in the screen cap below. Current release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:


Mate pair sequencing files were uploaded together within a single “Run”.

SRA Submission – Individual Transcriptomic Profiles of C.gigas Before & After Heat Shock

RNA-seq experiment conducted by Claire in 2013.

She sampled mantle tissue from three adult oysters, allowed them to recover from the sampling (one week?) and then subjected those same oysters to a 1hr heat shock at 40C and collected mantle tissue from them again.

As this is our first Small Read Archive (SRA) submission in many years, I decided to submit these to the SRA due to the small number of samples (6) from the Illumina sequencing we had done to make sure it was manageable.

An overview of the basic SRA submission process is here.

The current status can be seen in the screen cap below. Current release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:


SRA Accession: SRP072251

Data Management – SRA Submission Overview

We have an enormous backlog of high-throughput sequencing files (641 FASTQ files, to be exact) that we need/want to get added to the NCBI Sequence Read Archive (SRA).

This post provides a brief summary of what’s involved in the process (mostly via screen shots) and attempts to identify the various pitfalls/pains that I’ve already stumbled through trying to get a set of six FASTQ files submitted properly.

OVERALL – It’s horrible and tedious.

Important things to note:

  • Once any of the three required components for SRA submission have been created (SRA, BioProject, and BioSamples), they can no longer be edited/deleted by the user! Understandable if they’ve already been publicly released, but if they’re still in pre-public release status, I think the user should be able to make changes as they see fit. As it currently stands, the user has to email the help desk at SRA and/or BioProjects to make any changes.
  • Extremely difficult to figure out which information will show up (and where it will show up) in the final, formatted SRA record – no guide to this that I could find. Thus, if you screw it up, it’s a major, major hassle to try to change anything.
  • When creating a “Run” (within an “Experiment”, within your SRA submission), only include sequencing files that provide the same data (e.g. if you have multiple sequence files, each generated from different individuals/samples, then you need to create a separate “Experiment” and “Run” for each of those files – otherwise, all files uploaded to a “Run” are combined into a single SRA file that loses any distinguishing info from the separate sequencing files).
  • When creating a batch submission for BioSamples, there’s no way to set a Title attribute. This means all of your submissions (in my case) will have all have a title of “Invertebrate sample”. Considering that I will likely end up with dozens of BioSamples, that means there’s no easy way to distinguish them from each other without some extra clicking and poking around.


Here’s the best way to proceed:

  1. Create a BioProject. This will sit at the top of the hierarchy in the SRA submission and will be displayed as the STUDY associated with the SRA.

  2. Create BioSample(s). This will be the next level of the hierarchy in the SRA submission and will be displayed as SAMPLE. This only shows up in the SRA when you create a new “Experiment”

  3. Create SRA. This will end up encompassing any BioProject(s) and BioSample(s) that you need to include to describe the sequencing files you’re submitting to the SRA.

  4. Create an Experiment.

  5. Create a Run. This option is available once you’ve saved your experiment. This is where you provide your sequencing filename and associated MD5 checksum. This will also provide you with the login info to upload your sequencing files via FTP to NCBI servers. You can associate multiple sequencing files within a single run. This should be done if your sequencing files all provide data for the BioSample you selected. However, if you have sequencing files that are associated with different BioSamples, then you need to create an individual Experiment (and Run) for each BioSample!


Here are some links that might come in handy (although, none are that great)…

SRA Getting Started (helpful):

SRA Metadata Overview (this is helpful):

SRA Submission Quick Start Guide (this is useful!):

FTP Upload Instructions:

User UN-friendly SRA Guide:


And, here are the screen caps, roughly in chronological order of how the process presents itself. It’s too time consuming to caption any of these, so I’m putting them up for a reference. Also, all of the information seen in these screen caps has been deleted (because the entire submission was totally jacked up in multiple facets), so don’t look for any of the various submission IDs – they no longer exist. This is really just to visually show how many steps there are in order to get stuff submitted – it’s brutal.
















Data Management – O. lurida genotype-by-sequencing (GBS) data from BGI

We received a hard drive from BGI on 20160223 (while I was out on paternity leave) containing the Ostrea lurida GBS data.

Briefly, three sets (i.e. populations) of Olympia oyster tissue was collected from oysters raised in Oyster Bay and were sent to BGI for DNA extraction and GBS. A total of 23 individuals from each of the following three populations were sequenced (a grand total of 96 samples):

  • 1HL – (Hood Canal, Long Spit)
  • 1NF – (North Sound, Fidalgo Bay)
  • 1SN – (South Sound, Oyster Bay)

An overview of this project can be viewed on our GitHub Olympia oyster wiki.

Data was copied from the HDD to the following location on Owl (our server):

The data was generated from paired-end Illumina sequencing, so there are two FASTQ files for each individual.

The files were analyzed to create a MD5 checksum, perform read counts, and create a readme (markdown format) file. This was performed in a Jupyter/iPython notebook (see below).

IMPORTANT NOTE: The directory where this data is housed was renamed AFTER the Jupyter notebook was run. As such, the directory listed above will not be seen in the Jupyter notebook.

Jupyter notebook file: 20160314_Olurida_GBS_data_management.ipynb

Notebook Viewer: 20160314_Olurida_GBS_data_management.ipynb

Data Received – Initial Geoduck Genome Assembly from BGI

The initial assembly of the Ostrea lurida genome is available from BGI. Currently, we’ve stashed it here:

The data provided consisted of the following three files:

  • md5.txt
  • N50.txt
  • scaffold.fa.fill

md5.txt – Checksum file to verify integrity of files after downloading.

N50.txt – Contains some very limited stats on scaffolds provided.

scaffold.fa.fill – A FASTA file of scaffolds. Since these are scaffolds (and NOT contigs!), there are many regions containing NNNNNN’s that have been put in place for scaffold assembly based on paired-end spatial information. As such, the N50 information is not as useful as it would be if these were contigs.

Additional assemblies will be provided at some point. I’ve emailed BGI about what we should expect from this initial assembly and what subsequent assemblies should look like.

Data Received – Initial Olympia oyster Genome Assembly from BGI

The initial assembly of the Ostrea lurida genome is available from BGI. Currently, we’ve stashed it here:

The data provided consisted of the following three files:

  • md5.txt
  • N50.txt
  • scaffold.fa.fill

md5.txt – Checksum file to verify integrity of files after downloading.

N50.txt – Contains some very limited stats on scaffolds provided.

scaffold.fa.fill – A FASTA file of scaffolds. Since these are scaffolds (and NOT contigs!), there are many regions containing NNNNNN’s that have been put in place for scaffold assembly based on paired-end spatial information. As such, the N50 information is not as useful as it would be if these were contigs.

Additional assemblies will be provided at some point. I’ve emailed BGI about what we should expect from this initial assembly and what subsequent assemblies should look like.