Trinity: Abundance Estimation Using RSEM

Trinity-home

RSEM can be used for abundance estimation for transcriptome assemblies. The current version of RSEM, as of the release data of the corresponding Trinity software, is bundled with the Trinity software package.

To run the included version of RSEM, execute the following:

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl

#########################################################################
#
#  --transcripts <string>           transcript fasta file
#  --seqType <string>              fq|fa
#
#  If Paired-end:
#
#  --left <string>
#  --right <string>
#
#    or Single-end:
#
#  --single <string>
#
#
#
# Optional:
#
# --prefix <string>                prefix for RSEM output files (default: 'RSEM')
#
# --SS_lib_type <string>           strand-specific library type:  paired('RF' or 'FR'), single('F' or 'R').
#
# --no_group_by_component          Not Trinity transcripts (use --gene_trans_map to specify gene/trans relationships)
#
# --thread_count                   number of threads to use (default = 4)
#
# --debug                  retain intermediate files
#
#####################
#  Non-Trinity options:
#
#  --gene_trans_map <string>        file containing 'gene(tab)transcript' identifiers per line.
#
#
#########################################################################
#
#  To pass additional parameters to rsem-calculate-expression,
#    type ' -- ' followed by additional pass-through params
#
#########################################################################

An example command with PE reads would be:

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts Trinity.fasta \
        --seqType fq --left left.reads.fq --right right.reads.fq

Note	If you have strand-specific data, be sure to include the --SS_lib_type parameter.

Note

The run_RSEM_align_n_estimate.pl script simply maps the familiar Trinity parameters to those of the RSEM software and then executes RSEM accordingly. The RSEM command generated and executed will be shown via stdout. If you should encounter problems with running RSEM, please contact the RSEM developers and provide them with the corresponding auto-generated RSEM command, not the run_RSEM_align_n_estimate.pl parameters.

Running the above command will generate two primary output files containing the abundance estimation information (note RSEM is the default output file prefix):

RSEM.isoforms.results  : EM read counts per Trinity transcript
RSEM.genes.results     : EM read counts on a per-Trinity-component (aka... gene) basis, 'gene' used loosely here.

The output for the isoforms file looks like so:

transcript_id	gene_id	length	effective_length	expected_count	TPM	FPKM	IsoPct
comp128_c0_seq1	comp128_c0	209	1.73	0.00	0.00	0.00	0.00
comp13_c0_seq1	comp13_c0	235	7.16	1.00	12561.51	5282.75	100.00
comp22_c0_seq1	comp22_c0	215	2.62	0.00	0.00	0.00	0.00
comp28_c0_seq1	comp28_c0	329	54.60	4.00	6591.85	2772.21	100.00
comp33_c0_seq1	comp33_c0	307	40.30	3.00	6697.56	2816.66	100.00
comp35_c0_seq1	comp35_c0	219	3.33	0.00	0.00	0.00	0.00
comp35_c1_seq1	comp35_c1	204	1.19	1.00	75295.99	31665.75	100.00
comp39_c0_seq1	comp39_c0	348	68.20	1.00	1319.32	554.84	100.00
comp39_c0_seq2	comp39_c0	255	13.97	0.00	0.00	0.00	0.00
comp41_c0_seq1	comp41_c0	592	295.77	12.00	3650.37	1535.16	100.00
comp44_c0_seq1	comp44_c0	361	78.10	1.00	1151.96	484.46	100.00
comp44_c1_seq1	comp44_c1	280	25.22	1.00	3568.05	1500.54	100.00

and the genes file provides expression results on a per-Trinity component basis:

gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
comp128_c0	comp128_c0_seq1	0.00	0.00	0.00	0.00	0.00
comp13_c0	comp13_c0_seq1	235.00	7.16	1.00	12561.51	5282.75
comp22_c0	comp22_c0_seq1	0.00	0.00	0.00	0.00	0.00
comp28_c0	comp28_c0_seq1	329.00	54.60	4.00	6591.85	2772.21
comp33_c0	comp33_c0_seq1	307.00	40.30	3.00	6697.56	2816.66
comp35_c0	comp35_c0_seq1	0.00	0.00	0.00	0.00	0.00
comp35_c1	comp35_c1_seq1	204.00	1.19	1.00	75295.99	31665.75
comp39_c0	comp39_c0_seq1,comp39_c0_seq2	348.00	68.20	1.00	1319.32	554.84
comp41_c0	comp41_c0_seq1	592.00	295.77	12.00	3650.37	1535.16
comp44_c0	comp44_c0_seq1	361.00	78.10	1.00	1151.96	484.46
comp44_c1	comp44_c1_seq1	280.00	25.22	1.00	3568.05	1500.54
comp45_c0	comp45_c0_seq1	0.00	0.00	0.00	0.00	0.00
comp47_c1	comp47_c1_seq1	562.00	265.78	8.00	2708.23	1138.95
comp48_c0	comp48_c0_seq1	433.00	139.70	5.00	3220.28	1354.29
comp49_c0	comp49_c0_seq1	272.00	21.31	3.00	12667.38	5327.27
comp49_c1	comp49_c1_seq1	324.00	51.21	2.00	3514.23	1477.91
comp52_c0	comp52_c0_seq1	301.00	36.70	2.00	4902.98	2061.95
comp53_c0	comp53_c0_seq1	304.00	38.48	1.00	2337.98	983.24

Filtering lowly supported transcripts

If you want to filter out the likely transcript artifacts and lowly expressed transcripts, you might consider retaining only those that represent at least 1% of the per-component (IsoPct) expression level. Because Trinity transcripts are not currently scaffolded across sequencing gaps, there will be cases where smaller transcript fragments may lack enough properly-paired read support to show up as expressed, but are still otherwise supported by the read data. Therefore, filter cautiously and we don’t recommend discarding such lowly expressed (or seemingly unexpressed) transcripts, but rather putting them aside for further study.

The utility script TRINITY_RNASEQ_ROOT/util/filter_fasta_by_rsem_values.pl can be used to filter your FASTA file of assembled transcripts using the RSEM values within the RSEM.isoforms.results file, according to min IsoPct, min FPKM, and min TPM. Note, you can provide a list of RSEM output files, one for each sample, and filter out those transcripts that do not meet the specified requirements in any of the samples given.

Sample Data

Under TRINITY_RNASEQ_ROOT/sample_data/test_Trinity_Assembly, execute

% runMe.sh 1

to build Trinity transcript assemblies using the sample data, and then run through the downstream alignment and abundance estimation steps.