We recently received Illumina HiSeq2500 data back from this project. Initially looking at the data, something seems off. Using FASTQC, the quality drops of drastically towards the last 20 bases of the reads. We also see a high degree of Illumina TruSeq adaptor/index sequences present in our data.
Since this sequencing run was multiplexed (i.e. multiple libraries were pooled and run together on the HiSeq), we need to demultiplex our sequences before performing any trimming. Otherwise, the trimming could remove the index (barcodes) sequences from the data and prevent us from separating out the different libraries from each other.
However, it turns out, demultiplexing is not a simple, straightforward task. There are a variety of programs available and they all have different options. I decided to compare TruSeq index identification using two programs:
-grep (grep is a built-in command line (bash) program that searches through files to find matches to user-provided information.)
-fastx_barcode_splitter.pl (fastx_barcode_splitter.pl is a component of the fastx_tookit that searches through FASTQ files to identify matches to user-provided index/barcode sequences.)
The advantage(s) of using grep is that it’s extremely fast, easy to use, and already exists on most Unix-based computers (Linux, OS X), thus not requiring any software installation. The disadvantage(s) of using grep for a situation like this is that it is not amenable to allowing for mismatches and/or partial matches to the user-provided information.
The advantage(s) of using fastx_barcode_splitter.pl is that it can accept a user-defined number of mismatches and/or partial matches to the user-defined index/barcode sequences. The disadvantage(s) of using fastx_barcode_splitter.pl is that it requires the user to specify the expected location of the index/barcode sequence in the target sequence: either the beginning of the line or the end of the line. It will not search beyond the length(s) of the provided index/barcode sequences. That means if you index/barcode exists in the middle of your sequences, this program will not find it. Additionally, since this program doesn’t exist natively on Unix-based machines, it must be downloaded and installed by the user.
So, I tested both of these programs to see how they compared at matching both long (the TruSeq adaptor/index sequences identified with FASTQC) and “short” (the actual 6bp index sequence) barcodes.
To simplify testing, only a single sequence file was used from the data set.
All analysis was done in a Jupyter (IPython) notebook.
FASTQC HTML file for easier viewing of FASTQC output.
NBViewer version of embedded notebook below.
long barcodes: Found in ~12% of all reads
short barcodes: Found in ~25% of all reads
long barcodes, beginning of line: Found in ~15% of all reads
long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)
short barcodes, beginning of line: Found in ~1.3% of all reads
short barcodes, end of line: Found in ~2.7% of all reads
Overall, the comparison is interesting, however, the important take home from this is that in the best-case scenario (grep, short barcodes), we’re only able to identify 25% of the reads in our sequences!
It should also be noted that my analysis only used sequences in one orientation. It would be a good idea to also do this analysis by searching with the reverse and reverse complements of these sequences.