One metric for evaluating the quality of a transcriptome assembly is to examine the number of transcripts that were assembled that appear to be full-length or nearly full-length. Such an analysis with a reference transcript set, such as from human or mouse, is relatively straightforward, since one can align the assembled transcripts to the reference transcripts and examine the length coverage. For non-model organisms, no such reference transcript set is available. If a high quality annotation exists for a closely related organism, then one might compare the assembled transcripts to that closely related transcriptome to examine full-length coverage. In other cases, a more general analysis to perform is to align the assembled transcripts against all known proteins and to determine the number of unique top matching proteins that align across more than X% of its length.

Trinity supports these analyses using BLAST+.

Useful protein databases to search include SwissProt and TrEMBL.

To examine the extent of top-matching BLAST alignments, first run BLAST, and then run the included analysis script below:

Build a blastable database:
%  makeblastdb -in uniprot_sprot.fasta -dbtype prot
Perform the blast search, reporting only the top alignment:
% blastx -query Trinity.fasta -db uniprot_sprot.fasta -out blastx.outfmt6 \
         -evalue 1e-20 -num_threads 6 -max_target_seqs 1 -outfmt 6

This will generate a blast output file called blastx.outfmt6 in tab-delimited format.

Note
Type blastx -help for a description of the tab-delimited output format.

Next, examine the percent of the target being aligned to by the best matching Trinity transcript, like so:

% util/analyze_blastPlus_topHit_coverage.pl blastx.outfmt6 Trinity.fasta uniprot_sprot.fasta

This will create a file called blastx.outfmt6.txt.w_pct_hit_length that adds fields to the blastx output file, including the top hit’s length, percent of the hit’s length included in the alignment to the Trinity transcript, and the header description for that database entry.

In addition, the distribution of percent length coverage for the top matching database entries is provided as follows.

hit_pct_cov_bin count_in_bin >bin_below

100

3242

3242

90

268

3510

80

186

3696

70

202

3898

60

216

4114

50

204

4318

40

164

4482

30

135

4617

20

76

4693

10

0

4693

0

0

4693

Note
Only the single best matching Trinity transcript is reported for each top matching database entry; in other words, target database entries are counted uniquely. If a target protein matches multiple Trinity transcripts as their best hits, that target protein is counted only once along with that Trinity transcript that provides the highest BLAST bit score and longest match length.

Statements we can make based on this table include:

  • There are 268 proteins that each match a Trinity transcript by >80% and ⇐ 90% of their protein lengths.

  • There are 3510 proteins that are represented by nearly full-length transcripts, having >80% alignment coverage.

  • There are 3242 proteins that are covered by more than 90% of their protein lengths.

Note
Similar analyses can be performed using nucleotide databases, running blastn instead of blastx.