Epinext Adaptor 1 Counts – LSU C.virginica Oil Spill Samples

Before contacting the Univ. of Oregon facility for help with this sequence demultiplexing dilemma, I contacted Epigentek to find out what the other adaptor sequence that is used in the EpiNext Post-Bisulfite DNA Library Preparation Kit (Illumina). I used grep and fastx_barcode_splitter to determine how many reads (if any) contained this adaptor sequence. All analysis was performed in the embedded Jupyter (IPython) notebook embedded below.

NBviewer: 20150317_LSU_OilSpill_EpinextAdaptor1_ID.ipynb

 

Results:

This adaptor sequence is not present in any of the reads in the FASTQ file analyzed.

TruSeq Adaptor Counts – LSU C.virginica Oil Spill Sequences

Initial analysis, comparing barcode identification methods, revealed the following info about demultiplexing on untrimmed sequences:

Using grep:

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

Using fastx_barcode_splitter:

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

 

Decided to determine what percentage of the sequences in this FASTQ file have just the beginning of the adaptor sequence (up to the 6bp barcode/index):

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

This was done to see if the numbers increased without the barcode index (i.e. see if majority of sequences are being generated from “empty” adaptors lacking barcodes).

The analysis was performed in a Jupyter (IPython) notebook and the notebook is linked, and embedded, below.

NBViewer: 20150316_LSU_OilSpill_Adapter_ID.ipynb

 

Results:

Using grep:

15% of the sequences match

That’s about 3% more than when the adaptor and barcode are searched as one sequence.

Using fastx_barcode_splitter:

beginning of line – 17% match

end of line – 0.06% match

The beginning of line matches are ~2% higher than when the adaptor and barcode are searched as one sequence.

Will contact Univ. of Oregon to see if they can shed any light and/or help with the demultiplexing dilemma we have here. Lots of sequence, but how did it get generated if adaptors aren’t present on all of the reads?

TruSeq Adaptor Identification Method Comparison – LSU C.virginica Oil Spill Sequences

We recently received Illumina HiSeq2500 data back from this project. Initially looking at the data, something seems off.  Using FASTQC, the quality drops of drastically towards the last 20 bases of the reads. We also see a high degree of Illumina TruSeq adaptor/index sequences present in our data.

Since this sequencing run was multiplexed (i.e. multiple libraries were pooled and run together on the HiSeq), we need to demultiplex our sequences before performing any trimming. Otherwise, the trimming could remove the index (barcodes) sequences from the data and prevent us from separating out the different libraries from each other.

However, it turns out, demultiplexing is not a simple, straightforward task. There are a variety of programs available and they all have different options. I decided to compare TruSeq index identification using two programs:

-grep (grep is a built-in command line (bash) program that searches through files to find matches to user-provided information.)
-fastx_barcode_splitter.pl (fastx_barcode_splitter.pl is a component of the fastx_tookit that searches through FASTQ files to identify matches to user-provided index/barcode sequences.)

The advantage(s) of using grep is that it’s extremely fast, easy to use, and already exists on most Unix-based computers (Linux, OS X), thus not requiring any software installation. The disadvantage(s) of using grep for a situation like this is that it is not amenable to allowing for mismatches and/or partial matches to the user-provided information.

The advantage(s) of using fastx_barcode_splitter.pl is that it can accept a user-defined number of mismatches and/or partial matches to the user-defined index/barcode sequences. The disadvantage(s) of using fastx_barcode_splitter.pl is that it requires the user to specify the expected location of the index/barcode sequence in the target sequence: either the beginning of the line or the end of the line. It will not search beyond the length(s) of the provided index/barcode sequences. That means if you index/barcode exists in the middle of your sequences, this program will not find it. Additionally, since this program doesn’t exist natively on Unix-based machines, it must be downloaded and installed by the user.

So, I tested both of these programs to see how they compared at matching both long (the TruSeq adaptor/index sequences identified with FASTQC) and “short” (the actual 6bp index sequence) barcodes.

To simplify testing, only a single sequence file was used from the data set.

All analysis was done in a Jupyter (IPython) notebook.

FASTQC HTML file for easier viewing of FASTQC output.

NBViewer version of embedded notebook below.

 

Result:

grep

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

 

fastx_barcode_splitter

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

 

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

 

Overall, the comparison is interesting, however, the important take home from this is that in the best-case scenario (grep, short barcodes), we’re only able to identify 25% of the reads in our sequences!

It should also be noted that my analysis only used sequences in one orientation. It would be a good idea to also do this analysis by searching with the reverse and reverse complements of these sequences.

DNA Quantification – Claire’s C.gigas Sheared DNA

In an attempt to obtain the most accurate measurement of Claire’s sheared, heat shock mantle DNA, I quantified the samples using a third method: fluorescence.

Samples were quantified using the Quant-It DNA BR Kit (Life Technologies/Invitrogen) according the manufacturer’s protocol. Standards were run in triplicate. Due to low sample volumes, only 1μL of each sample was used and was not replicated.

Plate was read on a Perkin Elmer plate reader using the Wallac software. The plate was measured three times, with each well measured for a one second duration on each read.

 

Results:

Spreadsheet: 20150303_gigasHSshearedDNApico

 

 

Comparison of NanoDrop1000, Bioanalyzer, and fluorescence measurements:

Sample NanoDrop (ng/μL) Bioanalyzer (ng/μL) Fluorescence (ng/μL)
2M sheared 48.03 16.28 4.91
4M sheared 190.96 58.52 48.10
6M sheared 141.56 42.98 28.42
2MHS sheared 221.93 32.45 13.48
4MHS sheared 257.48 43.82 11.75
6MHS sheared 202.02 51.12 8.97

 

Not entirely surprising, but the fluorescence method is clearly the most conservative measurement of the three methods. However, I do find the difference between the Bioanalyzer and fluorescence measurements very surprising. I suspected the Bioanalyzer would underestimate the concentrations because I actively selected which peak regions to measure, possibly leaving out some aspect of the sample.

Regardless, will use the most conservative measurements (fluorescence) for decision making.

With our yields, we have insufficient DNA to conduct MeDIP and then subsequent bisulfite conversion and library prep on our own. The recovery from the MeDIP will result in too little input DNA for bisulfite conversion and, in turn, library prep.

However, we do have sufficient quantities of starting DNA (>200ng) for Epigentek’s MeDIP Methyl-seq. I have contacted Epigentek to see if their procedure includes bisulfite conversion after MeDIP (which the website workflow suggests that it does not).

Library Quality Assessment – C.gigas OA larvae Illumina libraries

Ran the 400ppm library and the 1000ppm library preps on a DNA1000 Assay Chip (Agilent) on the Agilent 2100 Bioanalyzer.

 

Results:

Data File (XAD): 2100_expert_DNA_1000_DE72902486_2015-03-02_09-18-02.xad

Electropherogram overlay of both samples:

Red = 400ppm

Blue = 1000ppm

 

 

 

Measurement data and parameters are here: 20150302_Bioanalyzer_Cgigas_400_1000ppm_BS-Seq

 

Both libraries look good; no adaptor contamination (peak would be present at ~125bp), good library sizes.

Pooled equal quantities of each library, based off the concentration values above, to prepare the sample for sequencing.

Component Volume (μL) Quantity (ng)
400ppm library 10 14.7
1000ppm library 1.09 14.7
Buffer EB 7.81 N/A
1% Tween20 2.1 N/A
Total 21 N/A

 

The pooled libraries will be submitted tomorrow to the Genomics Core Facility at the Univ. of Oregon for high-throughput sequencing (100bp, SE) on the HiSeq2500 (Illumina). Sample order #2212.

BS-seq Library Prep – C.gigas Larvae OA 1000ppm

Bisulfite Conversion

Pooled 200ng each of the sheared 1B1 (4μL) & 1B2 (used the entire sample, 20μL) 5.13.11 1000ppm C.gigas larvae DNA samples for a total of 400ng. Total volume = 24μL.

Quantified the pooled DNA using the NanoDrop1000 (ThermoFisher) prior to initiating bisulfite conversion.

Clearly, the NanoDrop measurements differ from the expected concentration. NanoDrop suggests the total amount of input DNA is ~1400ng (58ng/μL x 24μL = 1392ng). This is most likely due to RNA carryover, as DNA quantification using a fluorescence-based, double-stranded DNA assay performed previously shows a drastically lower concentration.

Proceeded with bisulfite conversion using the Methylamp DNA Modification Kit (Epigentek) in 1.5mL tube, according to the manufacturer’s protocol:

  • Added 1μL to sample, incubated 10mins @ 37C in water bath
  • Made fresh R1/R2/R3 solution (1.1mL R3 buffer added to vial of R2, vortexed 2mins, 40μL R1 added to mixture – Remainder stored @ -20C in “-20C Kit Components Box”)
  • Added 125μL of R1/R2/R3 solution to sample, incubated 90mins @ 65C in heating block with water
  • Addd 300μL R4 to sample, mixed, transferred to column, spun 12,000RPM 30s
  • Added 200μL R5 to column, spun 12,000RPM 30s
  • Added 50μL R1/ethanol solution to column, incubated 8mins @ RT, spun 12,000RPM 30s
  • Washed column with 200μL of 90% EtOH, spun 12,000RPM 30s; repeated one time.
  • Eluted DNA with 12μL R6, spun 12,000RPM 30s

Quantified post-bisulfite-treated sample on NanoDrop1000:

Definitely a low yield (~108ng) relative to the input (~400ng). Will proceed with Illumina library prep.

 

Library Prep

Illumina library prep was performed with EpiNext Post-Bisulfite DNA Library Preparation Kit (Illumina) (Epigentek).  Changes to the manufacturer’s protocol:

  • Samples were transferred to 1.5mL snap cap tubes for all magnetic bead steps in order to fit in our tube magnets.
  • PCR cycles: 15

No other changes were made to the manufacturer’s protocol.

Epigentek Barcode Indices assigned, per their recommendations for using two libraries for multiplexing (this will be combined with the 400ppm library):

Barcode #12 – CTTGTA

The library was stored @ -20C and will be checked via Bioanalyzer on Monday.

DNA Quantification – C.gigas Larvae 1000ppm

After the discovery that there wasn’t any DNA in the BS-seq Illumina library prep and no DNA in the bisulfite-treated DNA pool, I decided to try to recover any residual DNA left in the 1B2 sample. Sample 1B2 (sheared on 20150109) was dry, so I added 20μL of Buffer EB (Qiagen) to the tube. I vortexed both the 1B1 and 1B2 samples and quantified on the NanoDrop1000 (ThermoFisher). I also re-quantified the pooled BS-treated sample that had been used as input DNA for the libraries.

Results:

Spreadsheet: 20150226_Claire_sheared_Emma_1000ppm_OD260s

Sample 1B1 has ample DNA in it. Since these samples are pools of larvae, we may be able to just proceed with this sample and not worry about pooling with the biological replicate 1B2.

Sample 1B2 has a low amount of DNA, but it’s a usable quantity (total 400ng).

Pooled samples has nothing.

Will make a new pool of DNA from both 1B1 and 1B2 and attempt to make a new bisulfite-treated library.

DNA Quantification – Claire’s Sheared C.gigas Mantle Heat Shock Samples

I previously checked Claire’s sheared DNA on the Bioanalyzer to verify the fragment size and to quantify the samples. Looking at her notebook, her numbers differ greatly from the Bioanalyzer, possibly due to the fact that the DNA1000 assay chip used only measures DNA fragments up to 1000bp in size. If her shearing was incomplete, then there would be DNA fragments larger than 1000bp that wouldn’t have been measured by the Bioanalyzer. So, I decided to quantify the samples on the NanoDrop1000 (ThermoFisher) again.

 

Results:

Spreadsheet: 20150226_Claire_sheared_Emma_1000ppm_OD260s

 

 

 

Comparison of NanoDrop1000 and Bioanalyzer measurements.

Sample NanoDrop (ng/μL) Bioanalyzer (ng/μL)
2M sheared 48.03 16.28
4M sheared 190.96 58.52
6M sheared 141.56 42.98
2MHS sheared 221.93 32.45
4MHS sheared 257.48 43.82
6MHS sheared 202.02 51.12

The NanoDrop is known to overestimate sample quantities due to the indiscriminate nature of UV spectrophotometry and that could be the reason for the large discrepancy between the two measurements (i.e. RNA carryover may lead to overestimation). As such, I’ll quantify the samples using a fluorescence-based assay for double stranded DNA tomorrow in hopes of getting the most accurate measurement.

Bioanalyzer – C.gigas Sheared DNA from 20140108

To complement MBD ChiP-seq data and RNA-seq data that we have from this experiment, we want to generate, at a minimum, some BS-seq data from the same C.gigas individuals used for the other aspects of this experiment.  Claire had previously isolated DNA and sheared the DNA on 20140108. If possible, we’d like to perform MBD enrichment, but the current quantities of DNA may prevent us from this.

To quantify the DNA and evaluate the shearing profile, I ran 1μL of each of the following mantle pre-/post-heat shock samples on a DNA 1000 chip (Agilent) on the Agilent 2100 Bioanalyzer. in the Seeb Lab:

M = mantle
HS = heat shocked

  • 2M sheared
  • 4M sheared
  • 6M sheared
  • 2M HS sheared
  • 4M HS sheared
  • 6M HS sheared

Results:

Bioanalyzer Data File (XAD): 2100_expert_DNA_1000_DE72902486_2015-02-19_11-32-35(2).xad

 

Electropherograms

2100 Bioanalyzer electropherograms of Claire’s sheared C.gigas DNA.

 

Spreadsheet: 2100 expert_DNA 1000_DE72902486_2015-02-19_11-32-35_Results_001

 

Claire’s notebook entry doesn’t ever specify what her target shear size was, but the Bioanalyzer analysis suggests an average size of ~500bp.

Also interesting to note is that Claire’s sample concentrations (as measured on the NanoDrop1000) are significantly greater than what is calculated by the Bioanalyzer. Since the Bioanalyzer chip used (DNA1000) only goes to 1000bp, is it possible the differences in concentrations is due to incomplete shearing of the samples (e.g. a significant portion of the DNA is >1000bp in size and thus not factored in to the Bioanlyzer concentrations calculations)?

Will check sample volumes and determine total amount of remaining DNA for each sample and then assess how to proceed next (i.e. MBD or just BS-seq).

UPDATE 20150226:

Sample volumes were measured and total quantity (ng) of DNA in each sample were added to the spreadsheet above.

Based on the quantities of DNA we have for each sample, will discuss sequencing options (e.g. MBD or not, self-prepare libraries or not, etc) with Steven.

 

Library Prep – Quantification of C.gigas larvae OA 1000ppm library

The completed BS Illumina library made on Friday (1000ppm) was quantified via fluorescence using the Quant-iT DNA BR Kit (Life Technologies/Invitrogen).  Also quantified Jake’s libraries. Used 1μL of  each sample and the standards.  All standards were run in duplicate.  Due to limited sample, the libraries were only processed singularly, without replication.  Fluorescence was measured on a FLx800 plate reader (BioTek), using the Gen5 (BioTek) software for all calculations.

Results:

20150209_CgigasOA_BSlibrraryQuants_OluridaLibraryQuants

The good news is that the standard curve looked fine, with an R²=0.998.

The bad news is that there’s no detectable DNA in the sample, just like last time.

Possibly something is totally shot with this sample?  Will quantify the sheared DNA and decide what to do.

I quantified the sheared DNA and there’s nothing there! Where did it go? I just don’t get it. It was sheared, speed-vac’d and resuspended.  All the DNA should still be in the tubes…