BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

Jupyter (IPython) Notebook : 20150501_Cgigas_larvae_OA_BLASTn_nt.ipynb

Notebook Viewer : 20150501_Cgigas_larvae_OA_BLASTn_nt

Results:

BLASTn Output File: 20150501_nt_blastn.tab

BLAST e-vals <= 0.001: 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Unique BLAST Species: 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

 

Firstly, since this library was bisulfite converted, we know that matching won’t be as robust as we’d normally see.

However, the BLAST matches for this are terrible.

Only 0.65% of the BLAST matches (e-value <0.001) are to Crassostrea gigas. Yep, you read that correctly: 0.65%.

It’s nearly 40-fold less than the top species: Dictyostelium discoideum (a slime mold)

It’s 30-fold less than the next species: Danio rerio (zebra fish)

Then it’s followed up by human and mouse.

I think I will need to contact the Univ. of Oregon sequencing facility to see what their thoughts on this data is, because it’s not even remotely close to what we should be seeing, even with the bisulfite conversion…

6 comments

  1. You’ve made contigs of the bisulfite data? I feel like between trying to first assemble bisulfite data and then trying to align bisulfite data – where you expect 25% mismatches anyway (no C’s) – there could be enough ‘error’ built in that alignments may not necessarily hit C. gigas first?

    1. Yeah, Steven did a de-novo assembly with the BS-seq data because reads weren’t mapping to the C.gigas genome when using BS-Map. I took the de-novo-assembled BS-seq data (see this notebook entry) and BLASTed against the C.gigas genome and the results were not good. So, this particular BLAST against the GenBank nt DB was to potentially see if our sequencing data was actually “our” data by attempting to ID which species the data matched with the most. So, this wasn’t a “real” attempt at data analysis, it was mostly a long shot at seeing if our data was mostly comprised of another species, thus potentially being the wrong data set sent to use from the sequencing facility.

      1. Thanks for the reply and the link to the notebook entry. Yeah, the evalues are not too good – as expected for bisulfite data I would think. Just to make sure, when the bisulfite alignment was done in bsmap it included mapping to the CTOT and CTOB strands? The reads won’t align to the OT and OB strands – which I think is the default for both bsmap and bismark. I was getting similar mapping before I figured this out :)

      2. Sorry, just saw this. I know you already spoke to Steven about this, but I have no idea what you’re talking about with “CTOT” and “CTOB”! Steven’s been the one dealing with the bisulfite mapping. Thanks for the heads up, though! Does all this interaction with you on this project mean you’re going to come back to our lab? :)

  2. I haven’t done a BLAST with just raw sequencing reads (i.e. not already de-novo assembled into contigs), so not sure how it would play out. However, BLAST has an option (can’t remember if it’s “on” by default or if you have to use a flag in the BLAST arguments when running the command to enable it) that detects short query reads and takes their length in to account when evaluating database matches. This, in theory, should help ensure that matches using short query sequences are accurate. However, it probably would be prudent to run the BLAST and evaluate the top 10 matches (alignments, e-vals, bit score, etc) to see if the results agree with your expectations.

  3. This may be a silly question, but have you ever done this type of blast with an RNA-Seq dataset with reads of similar length? Yes, I would expect C. gigas would be the best hit, but is there something inherent in the database where ‘shortish’ reads would map to other species as a top hit?

Leave a Reply

Your email address will not be published. Required fields are marked *


e.g. 0000-0002-7299-680X

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>