Tag Archives: demultiplexing

Epinext Adaptor 1 Counts – LSU C.virginica Oil Spill Samples

Before contacting the Univ. of Oregon facility for help with this sequence demultiplexing dilemma, I contacted Epigentek to find out what the other adaptor sequence that is used in the EpiNext Post-Bisulfite DNA Library Preparation Kit (Illumina). I used grep and fastx_barcode_splitter to determine how many reads (if any) contained this adaptor sequence. All analysis was performed in the embedded Jupyter (IPython) notebook embedded below.

NBviewer: 20150317_LSU_OilSpill_EpinextAdaptor1_ID.ipynb

 

Results:

This adaptor sequence is not present in any of the reads in the FASTQ file analyzed.

TruSeq Adaptor Counts – LSU C.virginica Oil Spill Sequences

Initial analysis, comparing barcode identification methods, revealed the following info about demultiplexing on untrimmed sequences:

Using grep:

long barcodes: Found in ~12% of all reads

short barcodes: Found in ~25% of all reads

Using fastx_barcode_splitter:

long barcodes, beginning of line: Found in ~15% of all reads

long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)

short barcodes, beginning of line: Found in ~1.3% of all reads

short barcodes, end of line: Found in ~2.7% of all reads

 

Decided to determine what percentage of the sequences in this FASTQ file have just the beginning of the adaptor sequence (up to the 6bp barcode/index):

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

This was done to see if the numbers increased without the barcode index (i.e. see if majority of sequences are being generated from “empty” adaptors lacking barcodes).

The analysis was performed in a Jupyter (IPython) notebook and the notebook is linked, and embedded, below.

NBViewer: 20150316_LSU_OilSpill_Adapter_ID.ipynb

 

Results:

Using grep:

15% of the sequences match

That’s about 3% more than when the adaptor and barcode are searched as one sequence.

Using fastx_barcode_splitter:

beginning of line – 17% match

end of line – 0.06% match

The beginning of line matches are ~2% higher than when the adaptor and barcode are searched as one sequence.

Will contact Univ. of Oregon to see if they can shed any light and/or help with the demultiplexing dilemma we have here. Lots of sequence, but how did it get generated if adaptors aren’t present on all of the reads?