ORF Identification - L.staminea De Novo Transcriptome Assembly v1.0 Using Transdecoder on Mox

After performing a de novo transcriptome assembly with L.staminea RNA-seq data, the Trinity assembly stats were quite a bit more “exaggerated” than normally expected. In an attempt to get a better sense of which contigs might be more useful candidates for downstream analysis, I decided to run the assembly through Transdecoder to identify open reading frames (ORFs). This was run on Mox.

SLURM script (GitHub):

20230617-lsta-transdecoder-transcriptome_v1.0.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20230617-lsta-transdecoder-transcriptome_v1.0
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=05-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20230617-lsta-transdecoder-transcriptome_v1.0

# Transdecoder to identify ORFs and "functional" contigs from 20230616 L.staminea
# de novo transcriptome assembly v1.0.

###################################################################################
# These variables need to be set by user

## Assign Variables

threads=28


# Paths to input/output files

trinity_fasta="/gscratch/scrubbed/samwhite/outputs/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta"
trinity_gene_trans_map="/gscratch/scrubbed/samwhite/outputs/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta.gene_trans_map"

blastp_out_dir="blastp_out"
transdecoder_out_dir="${trinity_fasta##*/}.transdecoder_dir"
pfam_out_dir="pfam_out"
blastp_out="${blastp_out_dir}/blastp.outfmt6"

pfam_out="${pfam_out_dir}/pfam.domtblout"
lORFs_pep="${transdecoder_out_dir}/longest_orfs.pep"
pfam_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/Pfam-A.hmm"
sp_db="/gscratch/srlab/programs/Trinotate-v3.1.1/admin/uniprot_sprot.pep"

# Paths to programs
blast_dir="/gscratch/srlab/programs/ncbi-blast-2.8.1+/bin"
blastp="${blast_dir}/blastp"
hmmer_dir="/gscratch/srlab/programs/hmmer-3.2.1/src"
hmmscan="${hmmer_dir}/hmmscan"
transdecoder_dir="/gscratch/srlab/programs/TransDecoder-v5.5.0"
transdecoder_lORFs="${transdecoder_dir}/TransDecoder.LongOrfs"
transdecoder_predict="${transdecoder_dir}/TransDecoder.Predict"
###################################################################################


# Load Python Mox module for Python module availability
module load intel-python3_2017


# Make output directories
mkdir --parents "${blastp_out_dir}"
mkdir --parents "${pfam_out_dir}"

# Extract long open reading frames
${transdecoder_lORFs} \
-t ${trinity_fasta} \
--gene_trans_map ${trinity_gene_trans_map}

# Run blastp on long ORFs
${blastp} \
-query ${lORFs_pep} \
-db ${sp_db} \
-max_target_seqs 1 \
-outfmt 6 \
-evalue 1e-5 \
-num_threads ${threads} \
> ${blastp_out}

# Run pfam search
${hmmscan} \
--cpu ${threads} \
--domtblout ${pfam_out} \
${pfam_db} \
${lORFs_pep}

# Run Transdecoder with blastp and Pfam results
${transdecoder_predict} \
-t ${trinity_fasta} \
--retain_pfam_hits ${pfam_out} \
--retain_blastp_hits ${blastp_out}

####################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help

    # Handle fastp menu
    elif [[ "${program}" == "fastp" ]]; then
      ${programs_array[$program]} --help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
fi


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Run time was nearly 20hrs.

Screencap of L.staminea Transdecoder run time on Mox showing a run time of 19hrs, 12mins, 51secs

Output folder:

20230617-lsta-transdecoder-transcriptome_v1.0/

BED (text)
- lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.bed (30M)
  - MD5: 02497568c87e65279e83551ba8fb43ae
Coding Sequences (FastA)
- lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.cds (98M)
  - MD5: 1517d724b18d0bd8759cbccc0487e460
GFF (text)
- lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.gff3 (99M)
  - MD5: 901ab1c9ca1c0998714ff9cd602b2063
Peptide Sequences (FastA)
- lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.pep (52M)
  - MD5: 53752bdb28bf2f89a04620bcee109a4f
Pfam output (text)
- pfam.domtblout (132M)
- MD5: 18cd0fdff4993ed7f1d8f7d36f83e1d6
BLASTp output (format 6; text)
- blastp.outfmt6 (6.2M)
- MD5: c416fbe543555672083f1a63a4935ed0

When counting complete ORFs (awk -F"\t" '$3=="gene"' lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.gff3 | grep -c "complete"), the result is 28,451. This is a significant reduction in potential genes compared to what Trinity identified in the do novo assembly from yesterday (502,826 “genes”). Additionally, this is a much more realistic number of genes. Overall, ORF identification broke out like so:

CDS
exon
five_prime_UTR
gene
mRNA
three_prime_UTR

These numbers include partial ORFs. Even including the partial OFRs, these counts are much more realistic compared to the Trinity de novo assembly stats.

RESULTS

BED (text)

Coding Sequences (FastA)

GFF (text)

Peptide Sequences (FastA)

Pfam output (text)

BLASTp output (format 6; text)