Category Archives: Olympia oyster reciprocal transplant

Data Management – O.lurida Raw BGI GBS FASTQ Data

BGI had previously supplied us with demultiplexed GBS FASTQ files. However, they had not provided us with the information/data on how those files were created. I contacted them and they’ve given us the two original FASTQ files, as well as the library index file and corresponding script they used for demultiplexing all of the files. The Jupyter (iPython) notebook below updates our checksum and readme files in our server directory that’s hosting the files: http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

See Jupyter Notebook below for processing details.

Jupyter Notebook: 20160427_Oly_GBS_data_management.ipynb

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.

 

One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!

Jupyter Notebook: 20160418_pyrad_oly_PE-GBS.ipynb

NBviewer: 20160418_pyrad_oly_PE-GBS

Data Management – Concatenate FASTQ files from Oly MBDseq Project

Steven requested I concatenate the MBDseq files we received for this project:

  • concatenate the s4, s5, s6 file sets for each individual

  • concatenate the full file sets for each individual

Ran the concatenations in the Jupyter (iPython) notebook below. All files were saved to Owl/nightingales/O_lurida/2016

Jupyter Notebook: 20160411_Concatenate_Oly_MBDseq.ipynb

NBviewer: 20160411_Concatenate_Oly_MBDseq

Data Analysis – Oly GBS Data from BGI Using Stacks

UPDATE (20160418) : I’m posting this more for posterity, as Stacks continually locked up at both the “ustacks” and “cstacks” stages. These processes would take days to run (on the full 96 samples) and then the processes would become “stuck” (viewed via the top command in OS X).

Have moved on to trying PyRAD in the meantime.

Need to get the GBS from BGI data analyzed.

Installed Stacks (and its dependencies on Hummingbird earlier today).

Below is the Jupyter (iPython) notebook I ran to perform this analysis.

Jupyter (iPython) Notebook: 20160406_Oly_GBS_STACKS.ipynb

Jupyter Notebook Viewer: 20160406_Oly_GBS_STACKS

Data Management – O. lurida genotype-by-sequencing (GBS) data from BGI

We received a hard drive from BGI on 20160223 (while I was out on paternity leave) containing the Ostrea lurida GBS data.

Briefly, three sets (i.e. populations) of Olympia oyster tissue was collected from oysters raised in Oyster Bay and were sent to BGI for DNA extraction and GBS. A total of 23 individuals from each of the following three populations were sequenced (a grand total of 96 samples):

  • 1HL – (Hood Canal, Long Spit)
  • 1NF – (North Sound, Fidalgo Bay)
  • 1SN – (South Sound, Oyster Bay)

An overview of this project can be viewed on our GitHub Olympia oyster wiki.

Data was copied from the HDD to the following location on Owl (our server): http://owl.fish.washington.edu/nightingales/O_lurida/20160223_gbs/

The data was generated from paired-end Illumina sequencing, so there are two FASTQ files for each individual.

The files were analyzed to create a MD5 checksum, perform read counts, and create a readme (markdown format) file. This was performed in a Jupyter/iPython notebook (see below).

IMPORTANT NOTE: The directory where this data is housed was renamed AFTER the Jupyter notebook was run. As such, the directory listed above will not be seen in the Jupyter notebook.

Jupyter notebook file: 20160314_Olurida_GBS_data_management.ipynb

Notebook Viewer: 20160314_Olurida_GBS_data_management.ipynb

Data Management – O.lurida 2bRAD Dec2015 Undetermined FASTQ files

An astute observation by Katherine Silliman revealed that the FASTQ files I had moved to our high-throughput sequencing archive on our server Owl, only had two FASTQ files labeled as “undetermined”. Based on the number of lanes we had sequenced, we should have had many more. Turns out that the undetermined FASTQ files that were present in different sub-folders of the Genewiz project data were not uniquely named. Thus, when I moved them over (via a bash script), the undetermined files were continually overwritten, until we were left with only two FASTQ files labeled as undetermined.

So, I re-downloaded the entire project folder from Genewiz servers and renamed the FASTQ files labeled as undetermined and then copied them over to the archive on Owl:

http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

I also zipped up the raw data project from Genewiz and moved it to the same archive location and updated the checksum.md5 and readme.md files.

Details can be found in the Jupyter (iPython) notebook below.

Jupyter Notebook file: 20160308_find_rename_2bRAD_undetermined_fastqs.ipynb

Notebook Viewer: 20160308_find_rename_2bRAD_undetermined_fastqs.ipynb

 

 

Data Received – Ostrea lurida MBD-enriched BS-seq

Received the Olympia oyster, MBD-enriched BS-seq sequencing files (50bp, single read) from ZymoResearch (submitted 20151208). Here’s the sample list:

  • E1_hc1_2B
  • E1_hc1_4B
  • E1_hc2_15B
  • E1_hc2_17
  • E1_hc3_1
  • E1_hc3_5
  • E1_hc3_7
  • E1_hc3_10
  • E1_hc3_11
  • E1_ss2_9B
  • E1_ss2_14B
  • E1_ss2_18B
  • E1_ss3_3B
  • E1_ss3_14B
  • E1_ss3_15B
  • E1_ss3_16B
  • E1_ss3_20
  • E1_ss5_18

 

The 18 samples listed above had previously been MBD-enriched and then sent to ZymoResearch for bisulfite conversion, multiplex library construction, and subsequent sequencing. The library (multiplex of all samples) was sequenced in a single lane, three times. Thus, we would expect 54 FASTQ files. However, ZymoResearch was dissatisfied with the QC of the initial sequencing run (completed on 20160129), so they re-ran the samples (completed on 20160202). This created two sets of data, resulting in a total of 108 FASTQ files.

ZymoResearch data portal does not allow bulk download of files. However, I ended up using Chrono Download Manager extension for Google Chrome to allow for automated downloading of each file (per ZymoResearch recommendation).

After download, the files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/O_lurida/20160203_mbdseq

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

 

Total reads generated for this project: 1,481,836,875

 

Jupyter Notebook file: 20160203_Olurida_Zymo_Data_Handling.ipynb

Notebook Viewer: 20160203_Olurida_Zymo_Data_Handling.ipynb

Data Received – Bisulfite-treated Illumina Sequencing from Genewiz

Received notice the sequencing data was ready from Genewiz for the samples submitted 20151222.

Download the FASTQ files from Genewiz project directory:

wget -r -np -nc -A "*.gz" ftp://username:password@ftp2.genewiz.com/Project_BS1512183

Since two species were sequenced (C.gigas & O.lurida), the corresponding files are in the following locations:

http://owl.fish.washington.edu/nightingales/O_lurida/

http://owl.fish.washington.edu/nightingales/C_gigas/

 

In order to process the files, I needed to identify just the FASTQ files from this project and save the list of files to a bash variable called ‘bsseq':

bsseq=$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")

Explanation:

bsseq=
  • This initializes a variable called “bsseq” to the values contained in the command following the equals sign.
$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")
  • This lists (ls) all files, pipes them to the grep command (|), grep finds those files that begin with (^) one or two digits followed by an underscore ([0-9{1}_*), pipes those results (|) to another grep command which excludes (-v) any results containing the text “2bRAD”.

 

FILENAME SAMPLE NAME SPECIES
1_ATCACG_L001_R1_001.fastq.gz 1NF11 O.lurida
2_CGATGT_L001_R1_001.fastq.gz 1NF15 O.lurida
3_TTAGGC_L001_R1_001.fastq.gz 1NF16 O.lurida
4_TGACCA_L001_R1_001.fastq.gz 1NF17 O.lurida
5_ACAGTG_L001_R1_001.fastq.gz 2NF5 O.lurida
6_GCCAAT_L001_R1_001.fastq.gz 2NF6 O.lurida
7_CAGATC_L001_R1_001.fastq.gz 2NF7 O.lurida
8_ACTTGA_L001_R1_001.fastq.gz 2NF8 O.lurida
9_GATCAG_L001_R1_001.fastq.gz M2 C.gigas
10_TAGCTT_L001_R1_001.fastq.gz M3 C.gigas
11_GGCTAC_L001_R1_001.fastq.gz NF2_6 O.lurida
12_CTTGTA_L001_R1_001.fastq.gz NF_18 O.lurida

 

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total reads = 138,530,448

C.gigas reads: 22,249,631

O.lurida reads: 116,280,817

Code explanation:

totalreads=0;
  • Creates variable called “totalreads” and initializes value to 0.
for i in $bsseq;
  • Initiates a for loop to process the list of files stored in $bsseq variable. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
  • Creates variable called “linecount” that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.
totalreads=$((readcount+totalreads));
  • Adds the readcount for the current file and adds the value to totalreads.
done;
  • End the for loop.
echo $totalreads
  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done

Code explanation:

for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
  • Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md;
  • This formats the the printed output. The “%s\t%s\n\n” portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
  • This appends the result from each loop to the readme.md file and ends the for loop (done).

 

Data Received – Oly 2bRAD Illumina Sequencing from Genewiz

The data was made available to use on 20151224 and took two days to download.

The full list of samples (and the individual samples/libraries/indexes) submitted to Genewiz for this project by Katherine Silliman & me can be seen here (Google Sheet): White_BS1511196_R2_barcodes

The data supplied were all of the Illumina output files (currently not entirely sure where/how we want to store all of this, but we’ll probably want to use them for attempting our own demultiplexing since there were a significant amount of reads that Genewiz was unable to demultiplex), in addition to demultiplexed FASTQ files. The FASTQ files were buried in inconvenient locations, and there are over 300 of them, so I used the power of the command line to find them and copy them to a single location: http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

Find and copy all FASTQ files:

find /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=home/ -name '*.fastq.*' -exec cp -n '{}' /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=web/nightingales/O_lurida/ \;

Code explanation:

find
  • Command line program used for searching for files
/run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=home/ 
  • Location of the files I wanted to search through. The path looks a little crazy because I was working remotely and had the server share mounted.
-name '*.fastq.*'
  • The name argument tells the find command to look for filenames that have “.fastq” in them.
-exec cp -n '{}'
  • The exec option tells the find command to execute a subsequent action upon finding a match. In this case, I’m using the copy command (cp) and telling the program not to overwrite (clobber, -n) any duplicate files.
/run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=web/nightingales/O_lurida/2bRAD_Dec2015 \;
  • The location where I want the matched files copied.

I created a readme file in the directory directory with these files: readme.md

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total Reads: 588,396,334

Code explanation:

totalreads=0;
  • Creates variable called “totalreads” and initializes value to 0.
for i in *.gz;
  • Initiates a for loop to process any filenames that end with “.gz”. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
  • Creates variable called “linecount” that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.
totalreads=$((readcount+totalreads));
  • Adds the readcount for the current file and adds the value to totalreads.
done;
  • End the for loop.
echo $totalreads
  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done

Code explanation:

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
  • Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md;
  • This formats the the printed output. The “%s\t%s\n\n” portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
  • This appends the result from each loop to the readme.md file and ends the for loop (done).