Data Management – O.lurida 2bRAD Dec2015 Undetermined FASTQ files

An astute observation by Katherine Silliman revealed that the FASTQ files I had moved to our high-throughput sequencing archive on our server Owl, only had two FASTQ files labeled as “undetermined”. Based on the number of lanes we had sequenced, we should have had many more. Turns out that the undetermined FASTQ files that were present in different sub-folders of the Genewiz project data were not uniquely named. Thus, when I moved them over (via a bash script), the undetermined files were continually overwritten, until we were left with only two FASTQ files labeled as undetermined.

So, I re-downloaded the entire project folder from Genewiz servers and renamed the FASTQ files labeled as undetermined and then copied them over to the archive on Owl:

http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

I also zipped up the raw data project from Genewiz and moved it to the same archive location and updated the checksum.md5 and readme.md files.

Details can be found in the Jupyter (iPython) notebook below.

Jupyter Notebook file: 20160308_find_rename_2bRAD_undetermined_fastqs.ipynb

Notebook Viewer: 20160308_find_rename_2bRAD_undetermined_fastqs.ipynb

 

 

Data Management – High-throughput Sequencing Data

We’ve had a recent influx of sequencing data, which is great, but it created a bit of a backlog documenting what we’ve received.

I updated our Google Sheet (Nightingales) with the data from geoduck genome sequencing data from BGI, Olympia oyster genome sequencing data from BGI, and MBD bisulfite sequencing data from ZymoResearch.

I also fixed the :FileLocation” column by replacing the “HYPERLINK” function with “CONCATENATE”.

Google Sheet: Nightingales

 

After updating the Nightingales Google Sheet, I updated the corresponding Google Fusion Table (also called Nightingales).

To update the Fusion Table, you have to do the following:

  • delete all rows in the Nightingales Google Fusion Table (Edit > Delete all rows)
  • Import data from the Nightingales Google Spreadsheet (File > Import more rows…)

Fusion Table: Nightingales

At initial glance, the Fusion Table appears the same as the Google Sheet. However, if you follow the link to the full Fusion Table, it offers some unique ways to visually explore the data contained in the Fusion Table.

 

After that I decided to deal with the fact that many of the directories on Owl (http://owl.fish.washington.edu/nightingales/) lack readme files and subsequent information about the sequencing files in those folders.

So, I took an inordinate amount of time to write a script that would automate as much of the process as I could think of.

The script is here (GitHub): https://github.com/kubu4/Scripts/blob/master/bash/ngs_automator.sh

The goal of the script is to perform the following:

  • Identify folders that do not have readme files.

  • Identify folders that do not have checksum files.

  • Create readme files in those directories lacking readme files

  • Append the directory path to each new readme file

  • Append sequencing file names and corresponding read counts to the new readme files

Will run the script. Hope it works…

Data Received – Ostrea lurida MBD-enriched BS-seq

Received the Olympia oyster, MBD-enriched BS-seq sequencing files (50bp, single read) from ZymoResearch (submitted 20151208). Here’s the sample list:

  • E1_hc1_2B
  • E1_hc1_4B
  • E1_hc2_15B
  • E1_hc2_17
  • E1_hc3_1
  • E1_hc3_5
  • E1_hc3_7
  • E1_hc3_10
  • E1_hc3_11
  • E1_ss2_9B
  • E1_ss2_14B
  • E1_ss2_18B
  • E1_ss3_3B
  • E1_ss3_14B
  • E1_ss3_15B
  • E1_ss3_16B
  • E1_ss3_20
  • E1_ss5_18

 

The 18 samples listed above had previously been MBD-enriched and then sent to ZymoResearch for bisulfite conversion, multiplex library construction, and subsequent sequencing. The library (multiplex of all samples) was sequenced in a single lane, three times. Thus, we would expect 54 FASTQ files. However, ZymoResearch was dissatisfied with the QC of the initial sequencing run (completed on 20160129), so they re-ran the samples (completed on 20160202). This created two sets of data, resulting in a total of 108 FASTQ files.

ZymoResearch data portal does not allow bulk download of files. However, I ended up using Chrono Download Manager extension for Google Chrome to allow for automated downloading of each file (per ZymoResearch recommendation).

After download, the files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/O_lurida/20160203_mbdseq

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

 

Total reads generated for this project: 1,481,836,875

 

Jupyter Notebook file: 20160203_Olurida_Zymo_Data_Handling.ipynb

Notebook Viewer: 20160203_Olurida_Zymo_Data_Handling.ipynb

Data Received – Ostrea lurida genome sequencing files from BGI

Downloaded data from the BGI project portal to our server, Owl, using the Synology Download Station. Although the BGI portal is aesthetically nice, it’s set up poorly for bulk downloads and took a few tries to download all of the files.

Data integrity was assessed and read counts for each file were generated. The files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/O_lurida

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

 

Total reads generated for this project: 1,225,964,680

BGI provided us with the raw data files for us to play around with, but they are also currently in the process of performing the genome assembly.

 

Jupyter Notebook file: 20160126_Olurida_BGI_data_handling.ipynb

Notebook Viewer: 20160126_Olurida_BGI_data_handling.ipynb

Data Received – Panopea generosa genome sequencing files from BGI

Downloaded data from the BGI project portal to our server, Owl, using the Synology Download Station. Although the BGI portal is aesthetically nice, it’s set up poorly for bulk downloads and took a few tries to download all of the files.

Data integrity was assessed and read counts for each file were generated. The files were moved to their permanent storage location on Owl: http://owl.fish.washington.edu/nightingales/P_generosa/

The readme.md file was updated to include project/file information.

The file manipulations were performed in a Jupyter notebook (see below).

 

Total reads generated for this project: 1,208,635,950

BGI provided us with the raw data files for us to play around with, but they are also currently in the process of performing the genome assembly.

 

Jupyter Notebook file: 20160126_Olurida_BGI_data_handling.ipynb

Notebook Viewer: 20160126_Olurida_BGI_data_handling.ipynb

Data Analysis – Identification of duplicate files on Eagle

Recently, we’ve been bumping into our storage limit on Eagle (our Synology DS413):

 

Being fairly certain that there’s a significant amount of large datasets that is duplicated throughout Eagle, I ran a program on Linux called “fslint”. It searches for duplicates files based on a few parameters and is smart enough to be able to compare files with different filenames that share the same file contents!

I decided to check for duplicate files in the Eagle/archive folder and the Eagle/web folder. Initially, I tried searching for duplicates across all of Eagle, but after a week of running I got tired of waiting for results and ran the analysis on those two directories independently. As such, there is a possibility that there are more duplicates (consuming even more space) across the remainder of Eagle that have not been identified. However, this is a good starting point.

Here are the two output files from the fslint analysis:

 

To get a summary of the fslint output, I tallied the total amount of duplicates files that were >100MB in size. This was performed in a Jupyter notebook (see below):
Notebook Viewer: 20160114_wasted_space_synologies.ipynb
Jupyter (IPython) Notebook File: 20160114_wasted_space_synologies.ipynb

 

Here are the cleaned output files from the fslint analysis:

 

Summary

There are duplicates of files (>100MB in size) that are consuming at least 730GB!

Since the majority of these files exist in the Eagle/web folder, careful consideration will have to be taken in determining which duplicates (if any) can be deleted since it’s highly possible that there are notebooks that link to some of the files. Regardless, this analysis shows just how space is being consumed by the presence of large, duplicate files; something to consider for future data handling/storage/analysis with Eagle.

Data Received – Bisulfite-treated Illumina Sequencing from Genewiz

Received notice the sequencing data was ready from Genewiz for the samples submitted 20151222.

Download the FASTQ files from Genewiz project directory:

wget -r -np -nc -A "*.gz" ftp://username:password@ftp2.genewiz.com/Project_BS1512183

Since two species were sequenced (C.gigas & O.lurida), the corresponding files are in the following locations:

http://owl.fish.washington.edu/nightingales/O_lurida/

http://owl.fish.washington.edu/nightingales/C_gigas/

 

In order to process the files, I needed to identify just the FASTQ files from this project and save the list of files to a bash variable called ‘bsseq':

bsseq=$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")

Explanation:

bsseq=
  • This initializes a variable called “bsseq” to the values contained in the command following the equals sign.
$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")
  • This lists (ls) all files, pipes them to the grep command (|), grep finds those files that begin with (^) one or two digits followed by an underscore ([0-9{1}_*), pipes those results (|) to another grep command which excludes (-v) any results containing the text “2bRAD”.

 

FILENAME SAMPLE NAME SPECIES
1_ATCACG_L001_R1_001.fastq.gz 1NF11 O.lurida
2_CGATGT_L001_R1_001.fastq.gz 1NF15 O.lurida
3_TTAGGC_L001_R1_001.fastq.gz 1NF16 O.lurida
4_TGACCA_L001_R1_001.fastq.gz 1NF17 O.lurida
5_ACAGTG_L001_R1_001.fastq.gz 2NF5 O.lurida
6_GCCAAT_L001_R1_001.fastq.gz 2NF6 O.lurida
7_CAGATC_L001_R1_001.fastq.gz 2NF7 O.lurida
8_ACTTGA_L001_R1_001.fastq.gz 2NF8 O.lurida
9_GATCAG_L001_R1_001.fastq.gz M2 C.gigas
10_TAGCTT_L001_R1_001.fastq.gz M3 C.gigas
11_GGCTAC_L001_R1_001.fastq.gz NF2_6 O.lurida
12_CTTGTA_L001_R1_001.fastq.gz NF_18 O.lurida

 

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total reads = 138,530,448

C.gigas reads: 22,249,631

O.lurida reads: 116,280,817

Code explanation:

totalreads=0;
  • Creates variable called “totalreads” and initializes value to 0.
for i in $bsseq;
  • Initiates a for loop to process the list of files stored in $bsseq variable. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
  • Creates variable called “linecount” that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.
totalreads=$((readcount+totalreads));
  • Adds the readcount for the current file and adds the value to totalreads.
done;
  • End the for loop.
echo $totalreads
  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done

Code explanation:

for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
  • Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md;
  • This formats the the printed output. The “%s\t%s\n\n” portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
  • This appends the result from each loop to the readme.md file and ends the for loop (done).

 

Data Received – Oly 2bRAD Illumina Sequencing from Genewiz

The data was made available to use on 20151224 and took two days to download.

The full list of samples (and the individual samples/libraries/indexes) submitted to Genewiz for this project by Katherine Silliman & me can be seen here (Google Sheet): White_BS1511196_R2_barcodes

The data supplied were all of the Illumina output files (currently not entirely sure where/how we want to store all of this, but we’ll probably want to use them for attempting our own demultiplexing since there were a significant amount of reads that Genewiz was unable to demultiplex), in addition to demultiplexed FASTQ files. The FASTQ files were buried in inconvenient locations, and there are over 300 of them, so I used the power of the command line to find them and copy them to a single location: http://owl.fish.washington.edu/nightingales/O_lurida/2bRAD_Dec2015/

Find and copy all FASTQ files:

find /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=home/ -name '*.fastq.*' -exec cp -n '{}' /run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=web/nightingales/O_lurida/ \;

Code explanation:

find
  • Command line program used for searching for files
/run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=home/ 
  • Location of the files I wanted to search through. The path looks a little crazy because I was working remotely and had the server share mounted.
-name '*.fastq.*'
  • The name argument tells the find command to look for filenames that have “.fastq” in them.
-exec cp -n '{}'
  • The exec option tells the find command to execute a subsequent action upon finding a match. In this case, I’m using the copy command (cp) and telling the program not to overwrite (clobber, -n) any duplicate files.
/run/user/1000/gvfs/smb-share\:server\=owl.fish.washington.edu\,share\=web/nightingales/O_lurida/2bRAD_Dec2015 \;
  • The location where I want the matched files copied.

I created a readme file in the directory directory with these files: readme.md

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

totalreads=0; for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads

Total Reads: 588,396,334

Code explanation:

totalreads=0;
  • Creates variable called “totalreads” and initializes value to 0.
for i in *.gz;
  • Initiates a for loop to process any filenames that end with “.gz”. The FASTQ files have been compressed with gzip and end with the .gz extension.
do linecount=
  • Creates variable called “linecount” that stores the results of the following command:
`gunzip -c "$i" | wc -l`;
  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.
readcount=$((linecount/4));
  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.
totalreads=$((readcount+totalreads));
  • Adds the readcount for the current file and adds the value to totalreads.
done;
  • End the for loop.
echo $totalreads
  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done

Code explanation:

for i in *.gz; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));
  • Same for loop as above that calculates the number of reads in each FASTQ file.
printf "%s\t%s\n\n" "$i" "$readcount" >> readme.md;
  • This formats the the printed output. The “%s\t%s\n\n” portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.
>> readme.md; done
  • This appends the result from each loop to the readme.md file and ends the for loop (done).

Illumina Methylation Library Quantification – BS-seq Oly/C.gigas Libraries

Re-quantified the libraries that were completed yesterday using the Qubit3.0 dsDNA HS (high sensitivity) assay because the library concentrations were too low for the normal broad range kit.

Results:

Qubit Quants and Library Normalization Calcs: 20151222_qubit_illumina_methylation_libraries

SAMPLE CONCENTRATION (ng/μL)
1NF11 2.42
1NF15 1.88
1NF16 2.74
1NF17 2.54
2NF5 2.72
2NF6 2.44
2NF7 2.38
2NF8 1.88
M2 2.18
M3 2.56
NF2_6 2.5
NF_18 2.66

 

Things look pretty good. The TruSeq DNA Methylation Library Kit (Illumina) suggests that the libraries produced should end up with concentrations >3ng/μL, but we have plenty of DNA here to make a pool for running on the HiSeq2500.