Tag Archives: LSU

Sequence Data Analysis – LSU C.virginica Oil Spill MBD BS-Seq Data

Performed some rudimentary data analysis on the new, demultiplexed data downloaded earlier today:

2112_lane1_ACAGTG_L001_R1_001.fastq.gz
2112_lane1_ACAGTG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_001.fastq.gz
2112_lane1_ATCACG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_003.fastq.gz
2112_lane1_CAGATC_L001_R1_001.fastq.gz
2112_lane1_CAGATC_L001_R1_002.fastq.gz
2112_lane1_CAGATC_L001_R1_003.fastq.gz
2112_lane1_GCCAAT_L001_R1_001.fastq.gz
2112_lane1_GCCAAT_L001_R1_002.fastq.gz
2112_lane1_TGACCA_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_002.fastq.gz

Compared total amount of data (in gigabytes) generated from each index. The commands below send the output of the ‘ls -l’ command to awk. Awk sums the file sizes, found in the 5th field ($5) of the ‘ls -l’ command, then prints the sum, divided by 1024^3 to convert from bytes to gigabytes.

Index: ACAGTG

$ls -l 2112_lane1_AC* | awk '{sum += $5} END {print sum/1024/1024/1024}' 1.49652

Index: ATCACG

$ls -l 2112_lane1_AT* | awk '{sum += $5} END {print sum/1024/1024/1024}' 3.02269

Index: CAGATC

$ls -l 2112_lane1_CA* | awk '{sum += $5} END {print sum/1024/1024/1024}' 3.49797

Index: GCCAAT

$ls -l 2112_lane1_GC* | awk '{sum += $5} END {print sum/1024/1024/1024}' 2.21379

Index: TGACCA

$ls -l 2112_lane1_TG* | awk '{sum += $5} END {print sum/1024/1024/1024}' 0.687374

Index: TTAGGC

$ls -l 2112_lane1_TT* | awk '{sum += $5} END {print sum/1024/1024/1024}' 2.28902

Ran FASTQC on the following files downloaded earlier today. The FASTQC command is below. This command runs FASTQC in a for loop over any files that begin with “2212_lane2_C” or “2212_lane2_G” and outputs the analyses to the Arabidopsis folder on Eagle:

$for file in /Volumes/nightingales/C_virginica/2112_lane1_[ATCG]*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done

From within the Eagle/Arabidopsis folder, I renamed the FASTQC output files to prepend today’s date:

$for file in 2112_lane1_[ATCG]*; do mv "$file" "20150413_$file"; done

Then, I unzipped the .zip files generated by FASTQC in order to have access to the images, to eliminate the need for screen shots for display in this notebook entry:

$for file in 20150413_2112_lane1_[ATCG]*.zip; do unzip "$file"; done

The unzip output retained the old naming scheme, so I renamed the unzipped folders:

$for file in 2112_lane1_[ATCG]*; do mv "$file" "20150413_$file"; done

The FASTQC results are linked below:

Sequence Data – LSU C.virginica Oil Spill MBD BS-Seq Demultiplexed

0000-0002-2747-368X

I had previously contacted Doug Turnbull at the Univ. of Oregon Genomics Core Facility for help demultiplexing this data, as it was initially returned to us as a single data set with “no index” (i.e. barcode) set for any of the libraries that were sequenced. As it turns out, when multiplexed libraries are sequenced using the Illumina platform, an index read step needs to be “enabled” on the machine for sequencing. Otherwise, the machine does not perform the index read step (since it wouldn’t be necessary for a single library). Surprisingly, the sample submission form for the Univ. of Oregon Genomics Core Facility doesn’t request any information regarding whether or not a submitted sample has been multiplexed. However, by default, they enable the index read step on all sequencing runs. I provided them with the barcodes and they demultiplexed them after the fact.

I downloaded the new, demultiplexed files to Owl/nightingales/C_virginica:

lane1_ACAGTG_L001_R1_001.fastq.gz
lane1_ACAGTG_L001_R1_002.fastq.gz
lane1_ATCACG_L001_R1_001.fastq.gz
lane1_ATCACG_L001_R1_002.fastq.gz
lane1_ATCACG_L001_R1_003.fastq.gz
lane1_CAGATC_L001_R1_001.fastq.gz
lane1_CAGATC_L001_R1_002.fastq.gz
lane1_CAGATC_L001_R1_003.fastq.gz
lane1_GCCAAT_L001_R1_001.fastq.gz
lane1_GCCAAT_L001_R1_002.fastq.gz
lane1_TGACCA_L001_R1_001.fastq.gz
lane1_TTAGGC_L001_R1_001.fastq.gz
lane1_TTAGGC_L001_R1_002.fastq.gz

Notice that the file names now contain the corresponding index!

Renamed the files, to append the order number to the beginning of the file names:

$for file in lane1*; do mv "$file" "2112_$file"; done

New file names:

Updated the checksums.md5 file to include the new files (the command is written to exclude the previously downloaded files that are named “2112_lane1_NoIndex_”; the [^N] regex excludes any files that have a capital ‘N’ at that position in the file name):

$for file in 2112_lane1_[^N]*; do md5 "$file" >> checksums.md5; done

Updated the readme.md file to reflect the addition of these new files.

Epinext Adaptor 1 Counts – LSU C.virginica Oil Spill Samples

0000-0002-2747-368X

Before contacting the Univ. of Oregon facility for help with this sequence demultiplexing dilemma, I contacted Epigentek to find out what the other adaptor sequence that is used in the EpiNext Post-Bisulfite DNA Library Preparation Kit (Illumina). I used grep and fastx_barcode_splitter to determine how many reads (if any) contained this adaptor sequence. All analysis was performed in the embedded Jupyter (IPython) notebook embedded below.

NBviewer: 20150317_LSU_OilSpill_EpinextAdaptor1_ID.ipynb

Results:

This adaptor sequence is not present in any of the reads in the FASTQ file analyzed.

TruSeq Adaptor Counts – LSU C.virginica Oil Spill Sequences

0000-0002-2747-368X

Initial analysis, comparing barcode identification methods, revealed the following info about demultiplexing on untrimmed sequences:

Using grep:

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

Using fastx_barcode_splitter:

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

Decided to determine what percentage of the sequences in this FASTQ file have just the beginning of the adaptor sequence (up to the 6bp barcode/index):

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

This was done to see if the numbers increased without the barcode index (i.e. see if majority of sequences are being generated from “empty” adaptors lacking barcodes).

The analysis was performed in a Jupyter (IPython) notebook and the notebook is linked, and embedded, below.

NBViewer: 20150316_LSU_OilSpill_Adapter_ID.ipynb

Results:

Using grep:

15% of the sequences match

That’s about 3% more than when the adaptor and barcode are searched as one sequence.

Using fastx_barcode_splitter:

beginning of line – 17% match

end of line – 0.06% match

The beginning of line matches are ~2% higher than when the adaptor and barcode are searched as one sequence.

Will contact Univ. of Oregon to see if they can shed any light and/or help with the demultiplexing dilemma we have here. Lots of sequence, but how did it get generated if adaptors aren’t present on all of the reads?

Sequencing Data – LSU C.virginica MBD BS-Seq

0000-0002-2747-368X

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2112).

Samples sequenced/pooled for this run:

Sample	Treatment	Barcode
HB2	25,000ppm oil	ATCACG
HB16	25,000ppm oil	TTAGGC
HB30	25,000ppm oil	TGACCA
NB3	No oil	ACAGTG
NB6	No oil	GCCAAT
NB11	No oil	CAGATC

All code listed below was run on OS X 10.9.5

Downloaded all 15 fastq.gz files to Owl/web/nightingales/C_virginica:

$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_001.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_002.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_003.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_004.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_005.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_006.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_007.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_008.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_009.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_010.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_011.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_012.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_013.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_014.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_015.fastq.gz

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2112_:

$for file in 2112*lane1_NoIndex_L001_R1_0*; do mv "$file" "${file/#2112?fileName=/2112_}"; done

Created a directory readme.md (markdown) file to list & describe directory contents: readme.md

$ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

Created MD5 checksums for each fastq.gz file: checksums.md5

$md5 *.gz >> checksums.md5

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: LSU

Sequence Data Analysis – LSU C.virginica Oil Spill MBD BS-Seq Data

Sequence Data – LSU C.virginica Oil Spill MBD BS-Seq Demultiplexed

Epinext Adaptor 1 Counts – LSU C.virginica Oil Spill Samples

TruSeq Adaptor Counts – LSU C.virginica Oil Spill Sequences

Using grep:

Using fastx_barcode_splitter:

Sequencing Data – LSU C.virginica MBD BS-Seq