Tag Archives: Crassostrea virginica

TruSeq Adaptor Counts – LSU C.virginica Oil Spill Sequences

Initial analysis, comparing barcode identification methods, revealed the following info about demultiplexing on untrimmed sequences:

Using grep:

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

Using fastx_barcode_splitter:

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

 

Decided to determine what percentage of the sequences in this FASTQ file have just the beginning of the adaptor sequence (up to the 6bp barcode/index):

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

This was done to see if the numbers increased without the barcode index (i.e. see if majority of sequences are being generated from “empty” adaptors lacking barcodes).

The analysis was performed in a Jupyter (IPython) notebook and the notebook is linked, and embedded, below.

NBViewer: 20150316_LSU_OilSpill_Adapter_ID.ipynb

 

Results:

Using grep:

15% of the sequences match

That’s about 3% more than when the adaptor and barcode are searched as one sequence.

Using fastx_barcode_splitter:

beginning of line – 17% match

end of line – 0.06% match

The beginning of line matches are ~2% higher than when the adaptor and barcode are searched as one sequence.

Will contact Univ. of Oregon to see if they can shed any light and/or help with the demultiplexing dilemma we have here. Lots of sequence, but how did it get generated if adaptors aren’t present on all of the reads?

TruSeq Adaptor Identification Method Comparison – LSU C.virginica Oil Spill Sequences

We recently received Illumina HiSeq2500 data back from this project. Initially looking at the data, something seems off.  Using FASTQC, the quality drops of drastically towards the last 20 bases of the reads. We also see a high degree of Illumina TruSeq adaptor/index sequences present in our data.

Since this sequencing run was multiplexed (i.e. multiple libraries were pooled and run together on the HiSeq), we need to demultiplex our sequences before performing any trimming. Otherwise, the trimming could remove the index (barcodes) sequences from the data and prevent us from separating out the different libraries from each other.

However, it turns out, demultiplexing is not a simple, straightforward task. There are a variety of programs available and they all have different options. I decided to compare TruSeq index identification using two programs:

-grep (grep is a built-in command line (bash) program that searches through files to find matches to user-provided information.)
-fastx_barcode_splitter.pl (fastx_barcode_splitter.pl is a component of the fastx_tookit that searches through FASTQ files to identify matches to user-provided index/barcode sequences.)

The advantage(s) of using grep is that it’s extremely fast, easy to use, and already exists on most Unix-based computers (Linux, OS X), thus not requiring any software installation. The disadvantage(s) of using grep for a situation like this is that it is not amenable to allowing for mismatches and/or partial matches to the user-provided information.

The advantage(s) of using fastx_barcode_splitter.pl is that it can accept a user-defined number of mismatches and/or partial matches to the user-defined index/barcode sequences. The disadvantage(s) of using fastx_barcode_splitter.pl is that it requires the user to specify the expected location of the index/barcode sequence in the target sequence: either the beginning of the line or the end of the line. It will not search beyond the length(s) of the provided index/barcode sequences. That means if you index/barcode exists in the middle of your sequences, this program will not find it. Additionally, since this program doesn’t exist natively on Unix-based machines, it must be downloaded and installed by the user.

So, I tested both of these programs to see how they compared at matching both long (the TruSeq adaptor/index sequences identified with FASTQC) and “short” (the actual 6bp index sequence) barcodes.

To simplify testing, only a single sequence file was used from the data set.

All analysis was done in a Jupyter (IPython) notebook.

FASTQC HTML file for easier viewing of FASTQC output.

NBViewer version of embedded notebook below.

 

Result:

grep

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

 

fastx_barcode_splitter

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

 

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

 

Overall, the comparison is interesting, however, the important take home from this is that in the best-case scenario (grep, short barcodes), we’re only able to identify 25% of the reads in our sequences!

It should also be noted that my analysis only used sequences in one orientation. It would be a good idea to also do this analysis by searching with the reverse and reverse complements of these sequences.

Sequencing Data – LSU C.virginica MBD BS-Seq

Our sequencing data (Illumina HiSeq2500, 100SE) for this project has completed by Univ. of Oregon Genomics Core Facility (order number 2112).

Samples sequenced/pooled for this run:

Sample Treatment Barcode
HB2 25,000ppm oil ATCACG
HB16 25,000ppm oil TTAGGC
HB30 25,000ppm oil TGACCA
NB3 No oil ACAGTG
NB6 No oil GCCAAT
NB11 No oil CAGATC

All code listed below was run on OS X 10.9.5

Downloaded all 15 fastq.gz files to Owl/web/nightingales/C_virginica:

$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_001.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_002.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_003.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_004.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_005.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_006.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_007.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_008.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_009.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_010.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_011.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_012.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_013.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_014.fastq.gz
$curl -O http://gcf.uoregon.edu:8080/job/download/2112?fileName=lane1_NoIndex_L001_R1_015.fastq.gz

 

Renamed all files by removing the beginning of each file name (2112?fileName=) and replacing that with 2112_:

$for file in 2112*lane1_NoIndex_L001_R1_0*; do mv "$file" "${file/#2112?fileName=/2112_}"; done

 

Created a directory readme.md (markdown) file to list & describe directory contents: readme.md

$ls *.gz >> readme.md

Note: In order for the readme file to appear in the web directory listing, the file cannot be all upper-case.

 

Created MD5 checksums for each fastq.gz file: checksums.md5

$md5 *.gz >> checksums.md5

Library Cleanup – LSU C.virginica MBD BS Library

I was contacted by the sequencing facility at the University of Oregon regarding a sample quality issue with our library.  As evidenced by the electropherogram below, there is a great deal of adaptor primer dimer (the peak at 128bp):

 

This is a problem because such a high quantity of adaptor sequence will result in the majority of reads coming off the Illumina being just adaptor sequences.

With the remainder of the library sample prepared earlier, I performed the recommended clean up procedure for removing adaptor sequences in the EpiNext Post-Bisulfite DNA Library Preparation Kit – Illumina (Epigentek).    Briefly:

  • Brought sample volume up to 20uL with NanoPure H2O (added 9.99uL)

  • Added equal volume of MQ Beads

  • Washed beads 3x w/80% EtOH

  • Eluted DNA w/12uL Buffer EB (Qiagen)

After clean up, quantified the sample via fluorescence using the Quant-iT DNA BR Kit (Life Technologies/Invitrogen).  Used 1uL of the sample and the standards.  All standards were run in duplicate and read on a FLx800 plate reader (BioTek).

Results are here: 20150122 – LSU_virginicaMBDlibraryCleanup

Library concentration = 2.46ng/uL

Brought the entire sample up to 20uL with Buffer EB (Qiagen) and a final concentration of 0.1% Tween-20 (required by the sequencing facility).

Sent sample to the University of Oregon to replace our previous submission.

Bisulfite NGS Library – LSU C.virginica Oil Spill MBD Bisulfite DNA Sequencing Submission

Combined the following libraries in equal quantities (17ng each) to create a single, multiplexed sample for sequencing (LSU_Oil_01):

  • HB2 – 1 (ATCACG)
  • HB16 – 3 (TTAGGC)
  • HB30 – 4 (TGACCA)
  • NB3 – 5 (ACAGTG)
  • NB6 – 6 (GCCAAT)
  • NB11 – 7 (CAGATC)

Quantified pooled libraries using the Quant-iT dsDNA BR Kit (Invitrogen) with a FLx800 plate reader (BioTek). Used 1μL of the pooled sample, run in duplicate. Used 1uL of standards, run in duplicate.

Results:

pooled libraries = 6.575ng/μL

Will submit to University of Oregon Genomics Core Facility for 100bp, single end Illumina HiSeq2500 sequencing. They need 10nM of sample. For a library with average size range of 300-400bp, this requires a sample volume of 20uL with a concentration of 2.28ng/μL in a solution of 0.1% Tween20 in Buffer EB (Qiagen).

Combined 6.94μL of pooled libraries with 13.06 of 0.1% Tween20/EB solution.

Submitted sample LSU_Oil_01 to University of Oregon Genomics Core Facility via O/N FedEx on dry ice. Sample was assigned order # 2112.

Bisulfite NGS Library Prep – LSU C.virginica Oil Spill MBD Bisulfite DNA and Emma’s C.gigas Larvae OA Bisulfite DNA (continued from yesterday)

Continued library prep from yesterday. Set up Library Amplification according to the protocol. The samples received the following Barcode Indices:

  • HB2 – 1 (ATCACG)
  • HB5 – 2 (CGATGT)
  • HB16 – 3 (TTAGGC)
  • HB30 – 4 (TGACCA)
  • NB3 – 5 (ACAGTG)
  • NB6 – 6 (GCCAAT)
  • NB11 – 7 (CAGATC)
  • NB21 – 12 (CTTGTA)
  • 1A1 – 2 (CGATGT)
  • 1A2 – 1 (ATCACG)
  • 6A1 – 4 (TGACCA)
  • 6A2 – 5 (ACAGTG)
  • 103B1 – 6 (GCCAAT)
  • 103B2 – 7 (CAGATC)
  • 105A4 – 12 (CTTGTA)
  • 105A5 – 11 (GGCTAC)

Due to differences in input DNA quantities, samples were run with different numbers of thermal cycles.

13 thermal cycles were run for the following samples:

  • 1A1
  • 105A4
  • 105A5

22 thermal cycles were run for the following samples:

  • HB2
  • HB5
  • HB16
  • HB30
  • NB3
  • NB6
  • NB11
  • NB21
  • 1A2
  • 6A1
  • 6A2
  • 103B1
  • 103B2

Samples were quantified with 1uL of each sample using the Quant-iT dsDNA BR Kit (Invitrogen). Used 5uL of each standard and standards were run in duplicate.

Results:

Bisulfite NGS Library Prep – LSU C.virginica Oil Spill Bisulfite DNA and Emma’s C.gigas Larvae OA Bisulfite DNA

Constructed next generation libraries (Illumina) using the bisulfite-treated DNA from yesterday using the EpiNext Post-Bisulfite DNA Library Preparation Kit – Illumina (Epigentek). Samples were processed according to the manufacturer’s protocol up to Section 8 (Library Amplification) with the following changes:

– Skipped Section 7.1 (recommended to do so in the protocol due to low quantity of input DNA)

Samples were stored O/N @ -20C.

dA Tailing Master Mix

10x Tailing Buffer 1.5uL x 17.6 = 26.4uL

Klenow 1uL x 17.6 = 17.6uL

H2O 0.5uL x 17.6 = 8.8uL

Add 3uL of master mix to each sample

Adaptor Ligation

2x Ligation Buffer 17uL x 17.6 – 299.2uL

T4 DNA Ligase 1uL x 17.6uL = 17.6uL

Adaptors 1uL x 17.6 = 17.6uL

Added 19uL of master mix to each sample

dsDNA Conversion Master Mix

5x Conversion Buffer 4uL x 17.6 = 70.4uL

C.P. 2uL x 17.6 = 35.2uL

H2O 3uL x 18.6 = 52.8uL

Add 9uL of master mix to each sample

End Repair

10x Buffer 2uL x 17.6 = 35.2uL

Enzyme 1uL x 17.6 = 17.6uL

H2O 5uL x 17.6 = 88uL

Added 8uL of master mix to each sample

Bisulfite Conversion – LSU C.virginica Oil Spill MBD DNA and Emma’s C.gigas Larvae OA DNA

Performed bisulfite conversion on MBD DNA samples from LSU C.virginica oil spill samples (see 201411202 and 20141126) and Emma’s C.gigas larvae OA DNA samples (see 20141121) with the Methylamp DNA Modification Kit (Epigentek).

Added 4uL of H2O to each of Emma’s DNA samples to bring them up to 24uL.

Samples were processed according to the manufacturer’s protocol.

Samples were eluted with 10uL of Solution R6 and stored @ -20C.

EtOH Precipitation – LSU C.virginica Oil Spill MBD Continued (from 20141126)

Precipitation was continued according to the MethylMiner Methylated DNA Enrichment Kit (Invitrogen). Since I will need sample volumes of 24uL for the subsequent bisulfite conversion, I resuspended the samples in 29uL of water (will use 2.5uL x 2 reps for quantification).

Samples to be quantified:

NC = non-captured (i.e. non-methylated)

E = eluted (i.e. methylated)

  • HB2 NC
  • HB5 NC
  • HB16 NC
  • HB30 NC
  • NB3 NC
  • NB6 NC
  • NB11 NC
  • NB21 NC
  • HB2 E
  • HB5 E
  • HB16 E
  • HB30 E
  • NB3 E
  • NB6 E
  • NB11 E
  • NB21 E
  • Control NC
  • Control E

Samples were quantified using the Quant-IT BS Kit (Invitrogen) with a plate reader (BioTek). All samples were run in duplicate. Used 2.5uL of each sample for quantification.

Samples were stored in @ -20C (FTR 209) in the bisulfite seq box created by Claire for this project.

Results:

20141202_LSU_Virginica_MBD:

https://docs.google.com/spreadsheets/d/1NrrVmYsUQcstnrt4583mYN2PeVav54luyFvVUEkcjWE/edit?usp=sharing

Methylated DNA Enrichment (MBD) – LSU C.virginica Oil Spill gDNA

Enrichment was performed using the MethylMiner Methylated DNA Enrichment Kit (Invitrogen) according to the manufacturer’s protocol with the following changes:

– Used 25uL of Dynabeads M-280 (10uL/ug of input DNA) and 15uL of MBD-Biotin Protein (7uL/ug of input DNA).

– Followed the corresponding instructions for the volumes listed above and for quantities of input DNA > 1ug – 10ug

– A single elution with 2000mM NaCl was performed

– EtOH precipitation: Samples were incubated over the long weekend at -80C.