Large RNA-Seq data sets, such as those exceeding 300M pairs, are best suited for in silico normalization prior to running Trinity, in order to reduce memory requirements and greatly improve upon runtimes.

Note
Before running the normalization, be sure that in the case of paired reads, the left read names end with suffix /1 and the right read names end with /2.

To normalize your data set, run the included:

TRINITY_RNASEQ_ROOT/util/normalize_by_kmer_coverage.pl
###############################################################################
#
# Required:
#
#  --seqType <string>      :type of reads: ( 'fq' or 'fa')
#  --JM <string>            :(Jellyfish Memory) number of GB of system memory to use for
#                            k-mer counting by jellyfish  (eg. 10G) *include the 'G' char
#
#  --max_cov <int>         :targeted maximum coverage for reads.
#
#
#  If paired reads:
#      --left  <string>    :left reads
#      --right <string>    :right reads
#
#  Or, if unpaired reads:
#      --single <string>   :single reads
#
#  Or, if you have read collections in different files you can use 'list' files, where each line in a list
#  file is the full path to an input file.  This saves you the time of combining them just so you can pass
#  a single file for each direction.
#      --left_list  <string> :left reads, one file path per line
#      --right_list <string> :right reads, one file path per line
#
####################################
##  Misc:  #########################
#
#  --pairs_together                :process paired reads by averaging stats between pairs and retaining linking info.
#
#  --SS_lib_type <string>          :Strand-specific RNA-Seq read orientation.
#                                   if paired: RF or FR,
#                                   if single: F or R.   (dUTP method = RF)
#                                   See web documentation.
#  --output <string>               :name of directory for output (will be
#                                   created if it doesn't already exist)
#                                   default( "normalized_reads" )
#
#  --JELLY_CPU <int>                     :number of threads for Jellyfish to use (default: 2)
#  --PARALLEL_STATS                :generate read stats in parallel for paired reads (Figure 2X Inchworm memory requirement)
#
#  --KMER_SIZE <int>               :default 25
#
#  --min_kmer_cov <int>            :minimum kmer coverage for catalog construction (default: 1)
#
#  --max_pct_stdev <int>           :maximum pct of mean for stdev of kmer coverage across read (default: 100)
#
###############################################################################

This should be run on a machine that has a suitably large amount of RAM (typically hundreds of GB of RAM). The command-line options are quite similar to those used by Trinity.pl.

There are several potential invocations of the normalization process, depending on your interests.

Ideally, the process would be run as follows:

TRINITY_RNASEQ_ROOT/util/normalize_by_kmer_coverage.pl --seqType fq --JM 100G --max_cov 30 --left left.fq --right right.fq --pairs_together --PARALLEL_STATS --JELLY_CPU 10

To roughly halve memory requirements but take almost twice as long to run, remove the —PARALLEL_STATS parameter. In this case, the left.fq and right.fq files will be examined separately instead of in parallel.

Also, to greatly lessen memory requirements, include the option —min_kmer_cov 2, in which case no uniquely occurring kmer will be assayed.

Results

Normalized versions of the reads will be written, with file names corresponding to the original files but with an added extension to indicate the targeted maximum coverage levels.

Sample data

You can run the normalization process on the sample data provided at:

TRINITY_RNASEQ_ROOT/sample_data/test_Trinity_Assembly

by running

__run_read_normalization_pairs_together_fastq.sh

(note there are several such __run_read_normalization… scripts to demonstrate alternate scenarios for execution.)