What computing resources are required?

Ideally, you will have access to a large-memory server, ideally having ~1G of RAM per 1M reads to be assembled (but often, much less memory may be required). We regularly run Trinity on our Dell 256G RAM 22-core server.

How long should this take?

It depends on a number of factors including the number of reads to be assembled and the complexity of the transcript graphs. The assembly from start to finish can take anywhere from ~1/2 hour to 1 hour per million reads (your mileage may vary).

To troubleshoot individual graphs using Butterfly, see Advanced Guide to Trinity.

How can I run this in parallel on a computing grid?

The initial Inchworm and Chrysalis steps need to be run on a single server as a single process, however parts of Chrysalis and all of Butterfly can be run as naive parallel processes on a computing grid.

Running Trinity.pl with the option: --no_run_quantifygraph will stop Trinity just before the parallel processes can be executed. There will be two sets of files generated, each containing commands that should be executed on a computing grid.

The first set of commands correspond t the Chrysalis quantifyGraph step, found at

trinity_out_dir/chrysalis/quantifyGraph_commands

after executing these commands on your computing grid, and ensuring that each succeeded, you can then execute the Butterfly commands found at:

trinity_out_dir/chrysalis/butterfly_commands.adj

(be sure to use the .adj extension, since this file includes other butterfly parameters passed through the Trinity.pl wrapper).

How to get each of these commands to run on your grid will depend on your computing infrastructure. We have several ways to do this via LSF, but they’re all home-grown solutions and not described here.

Once all butterfly commands have been executed, you can retrieve all the resulting assembled transcripts by concatenating the individual result files together like so:

find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} + > Trinity.fasta

How do I identify the specific reads that were incorporated into the transcript assemblies?

Currently, the mappings of reads to transcripts are not reported. To obtain this information, we recommend realigning the reads to the assembled transcripts using Bowtie or BWA. A wrapper around Bowtie that will generate alignments for you is included in the Trinity package as:

util/alignReads.pl

and example data and example execution of this is included at:

sample_data/read_alignment

for paired or single read data. See the various run.sh scripts available for examples.

How do I combine multiple libraries in a single Trinity run? Or, how do I combine paired and single reads?

If you have RNA-Seq data from multiple libraries and you want to run them all through Trinity in a single pass, simply combine all your left.fq files into one left.fq file, and combine all right.fq files into one right.fq file. Then run Trinity using these separately concatenated left and right input files.

If you have additional singletons, add them to the .fq file that they correspond to based on the sequencing method used (if they’re equivalent to the left.fq entries, add them there, etc).

There is no good way to combine strand-specific data with non-strand-specific data, unless you decide to treat the entire data set as non-strand-specific.

Chrysalis died unexpectedly. What do I do?

In nearly every case examined thus far, this is because the stacksize memory grew beyond its maximal setting. Chrysalis can sometimes recurse fairly deeply, in which case the stacksize can grow substantially. Before running Trinity, be sure that your stacksize is set to unlimited. On CentOS linux, you can check your default settings like so (it may be different on your linux distro):

bhaas@hyacinth$ limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    100000 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  131072
memorylocked 256 kbytes
maxproc      2102272

and you would set the stacksize (and other settings) to unlimited like so:

bhaas@hyacinth$ unlimit

and verify the new settings:

bhaas@hyacinth$ limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  131072
memorylocked 256 kbytes
maxproc      2102272

On Solaris, Ubuntu linux, and perhaps Mac OSX, the syntax is different, and might be:

ulimit -s unlimited

On Ubuntu, type: ulimit -a to examine and verify your settings.

On snow leopard, you cannot set it to unlimited for some reason (older versions you could), so try to max it out.

QuantifyGraph of Chrysalis died due to std::bad_alloc

If one or more QuantifyGraph (of Chrysalis) jobs should fail like so:

terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc
We are sorry, commands in file: [failed_quantify_graph_commands.1234.txt] failed.  :-(

it is because these processes ran out of available memory. The fix is the following:

  1. adjust the commands in the failed_quantify_graph_commands.1234.txt file, setting -max_reads 20000000 to a lower value (200000 is sufficient, and will be the new setting in post-Jan-2013 releases of Trinity). Run these adjusted commands to completion.

  2. Create the quantifygraph checkpoint file like so:

    % touch trinity_out_dir/chrysalis/quantifyGraph_commands.run.finished
  3. Rerun your original Trinity.pl command, which should then resume right after the quantifyGraph stage.

Butterfly seems to run forever. What do I do?

Occassionally (very rarely, such as one component per tens of thousands, if at all) Butterfly will encounter a complicated transcript graph and seems to take an eternity to process it. You will notice this by running top and seeing a java process that has been running for a very long time. For example, I’m running a dozen butterfly commands on my large server (22 cores, 256 GB RAM) and I can see various butterfly jobs running as java in the view:

Tasks: 500 total,   7 running, 493 sleeping,   0 stopped,   0 zombie
top - 09:13:33 up 131 days, 21:07,  4 users,  load average: 70.72, 53.70, 28.00Tasks: 510 total,   9 running, 501 sleeping,   0 stopped,   0 zombie
Cpu(s): 89.1%us, 10.4%sy,  0.0%ni,  0.2%id,  0.0%wa,  0.1%hi,  0.2%si,  0.0%stMem:  264349428k total, 48345144k used, 216004284k free,   126640k buffers
Swap:  8385920k total,   314336k used,  8071584k free, 18855720k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7775 bhaas     16   0 1373m 302m 8724 S 201.2  0.1   0:04.02 java
 7735 bhaas     17   0 1358m 329m 8776 S 171.1  0.1   0:04.47 java
 7310 bhaas     17   0 1300m 359m 8804 S 140.9  0.1   0:07.84 java
 8194 bhaas     17   0 1294m 165m 8680 S 125.8  0.1   0:01.88 java
 8313 bhaas     18   0 1356m  36m 8580 S  98.1  0.0   0:00.73 java
 8075 bhaas     17   0 1290m  53m 8668 S  93.1  0.0   0:01.18 java
10241 bhaas     18   0 1376m 604m 8820 S  88.0  0.2   4:31.80 java
32424 bhaas     18   0 1306m 474m 8816 S  88.0  0.2   0:58.53 java
 8143 bhaas     17   0 1292m  48m 8664 S  85.5  0.0   0:01.23 java
 8258 bhaas     17   0 1291m  48m 8656 S  80.5  0.0   0:01.07 java
 1305 bhaas     17   0 1377m 509m 8820 S  78.0  0.2   0:56.11 java
10247 bhaas     18   0 1356m 1.0g 8812 S  78.0  0.4   4:26.23 java
...

A way to see exactly what jobs are running is to execute the following:

 bhaas@hyacinth$ ps auxww | grep Butterfly
bhaas     4588 50.3  0.1 1355708 435476 pts/4  Sl   09:17   0:38 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 9814096 -L 300 -F 300 -C chrysalis/RawComps.0/comp374 --edge-thr=0.16
bhaas     5920 51.3  0.1 1353604 409604 pts/4  Sl   09:18   0:33 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10114793 -L 300 -F 300 -C chrysalis/RawComps.0/comp412 --edge-thr=0.16
bhaas     7747 53.0  0.2 1325344 530752 pts/4  Sl   09:13   3:01 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 11032490 -L 300 -F 300 -C chrysalis/RawComps.0/comp127 --edge-thr=0.16
bhaas    10241 56.5  0.2 1409492 625972 pts/4  Sl   09:06   7:18 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10630881 -L 300 -F 300 -C chrysalis/RawComps.0/comp2 --edge-thr=0.16
bhaas    10247 51.9  0.4 1389204 1077640 pts/4 Sl   09:06   6:42 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10702374 -L 300 -F 300 -C chrysalis/RawComps.0/comp0 --edge-thr=0.16
bhaas    10249 51.8  0.4 1394704 1082764 pts/4 Sl   09:06   6:41 java -Xmx1000M -jar /seq/bhaas/SVN/trinityrnaseq/Butterfly/Butterfly.jar -N 10702374 -L 300 -F 300 -C chrysalis/RawComps.0/comp1 --edge-thr=0.16

Most of the butterfly commands have been running for only a short period of time (seconds), but there are a couple that have been running for several minutes. Most commands will take less than a few minutes to run, and some can take up to an hour. If you see a butterfly command (java) that has been running for many hours, you can consider killing it and trying it again later with altered butterfly parameters. There are a couple of ways to kill the process.

From the command line, you can kill it like so:

kill $pid

where $pid is the process ID in the first column of the top output or second column of the ps output.

From within top, you can kill it by typing k, enter, $pid, enter. (on linux, this is how it works; your system may vary).

Once a Butterfly command has finished (or you’ve killed it to retry it later), the next butterfly command in the queue will take its place.

If all Butterfly commands complete successfully, then the Trinity.pl wrapper script will report success and concatenate all the individual butterfly assembly outputs into a single file (Trinity.fasta).

If any commands did not succeed, then the failed (or killed) commands will be reported and written to a file so that you can adjust the parameters and rerun. There are two primary reasons for why a Butterfly command might run forever:

  1. The Needleman-Wunsch global alignment seems to lock up when aligning long sequences. Try running Butterfly in Smith-Waterman mode instead, by adding the --SW option.

  2. The transcript graph is highly complex. By compacting the graph further, Butterfly will be able to more easily traverse it, though it may reduce the sensitivity for transcript reconstruction and alt-splice detection, so do the following sparingly. You can add the option --edge-thr=$value to the butterfly command. By default, the $value is 0.05. Try setting it to 0.16 or higher to substantially decrease the complexity of the graph, and improve upon the Butterfly runtime. (For more information on the --edge-thr parameter, see Advanced Guide to Trinity).

Once all butterfly commands have been executed, you can retrieve all the resulting assembled transcripts by concatenating the individual result files together like so:

find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} + > Trinity.fasta

Some Butterfly commands fail

When butterfly commands fail, it’s usually due to not enough heap memory being available. This tends happen with the largest component (comp0). Failed butterfly commands will be written to a failed_cmds.txt file and the Trinity process will terminate. Try rerunning the failed commands (which should be very few, such as 1-3 out of the tens of thousands of commands) manually and resetting the heap size -Xmx value to a higher number, such as 20G. After these failed processes complete successfully, you can combine all the Butterfly results into a single file like so:

find trinity_out_dir/chrysalis -name "*allProbPaths.fasta" -exec cat {} + > Trinity.fasta

and consider the entire run a success (albeit with a little hand-holding).

Butterfly fails with java Error: Cannot create GC thread. Out of system resources.

There are a couple reasons why this error message might creep up.

  1. all memory has been consumed on the machine. Each butterfly process wants to reserve 20G of maximum heap space. If there’s less than 20G of free memory on the machine per butterfly (--CPU setting), then java may not be able to initialize (depends on your OS configuration). The -Xmx20G setting indicates that 10G of heap memory should be reserved, and you can lower this to 1G for most Butterfly commands, as needed.

  2. NUMA architecture: one of our users found that the java invocation required: -XX:ParallelGCThreads=<Numerical Thread Count>, otherwise it would try to use too many threads.

How do I change the K-mer size

Although Inchworm has the capability of running independently with different k-mer sizes up to 32 (the 64-bit limit with 2-bit base encoding), Chrysalis and Butterfly are current fixed at the 25mer k-mer size. In testing, we discovered early on that 25-mers appeared to be near-optimal across a different transcriptomes and different read abundance levels, and so fixed the value accordingly as part of the Trinity process. Future development will aim to expose the k-mer setting as an option.

Can I combine strand-specific and non-strand-specific reads?

You can do so, but you wouldn’t be able to benefit as from the strand-specificity, since you’ll need to run Trinity in non-strand-specific mode.

What is the naming convention of the final, exported set of transcripts?

The transcript names found in the output FASTA file are named like:

compX_cY_seqZ

This is a product of graphing algorithms used to generate them, which is explained in detail in the Advanced Guide to Trinity.

In short, X defines the graphical component generated by Chrysalis (from clustering inchworm contigs). Butterfly might tease subgraphs apart from each other within a single component, based on the read support data. This gives rise to subgraphs (cY). Each subgraph then gives rise to path sequences (seqZ).

While isoforms should have the same component number, transcripts sharing the same component number aren’t always isoforms. Those sequences with the same component number are derived from the same isolated de Bruijn graph.