Transcriptome Assembly - Genome-guided C.virginica Adult Gonad OA RNAseq Using Trinity on Mox

As part of this project, Steven’s asked that I identify long, non-coding RNAs (lncRNAs) (GitHub Issue) in the Crassostrea virginica (Eastern oyster) adult OA gonad RNAseq data we have. The initial step for this is to assemble transcriptome. I generated the necessary BAM alignment on 20220131. Next was to actually get the transcriptome assembled. I followed the Trinity genome-guided procedure.

I should add here that while this job was running, I figured out that the lncRNAs had already been annotated in the Crassostrea virginica (Eastern oyster) genome (NCBI GCF_002022765.2), so I had already tackled the lncRNA aspect of things on 20220217. However, having a gonad transcriptome assembly won’t hurt anything, so decided to let this continue running.

The initial run of this got interrupted by a corrupted SAM file (GitHub Issue). It’s unclear what caused this, but during the job, the Mox /gscratch/scrubbed/ directory went over the storage quota… The solution was to attempt a re-run, which ran without issue. Although, I did implement a change suggested in that issue which was to set the --max_memory to 100G. I had previously been using 500G, but the developer explained this was excessive and not necessary.

SBATCH script (GitHub):

20220207_cvir_trinity-gg_adult-oa-gonad_assembly-1.0.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20220212_cvir_trinity-gg_adult-oa-gonad_assembly-1.0
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=21-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220212_cvir_trinity-gg_adult-oa-gonad_assembly-1.0

### Genome-guided (NCBI RefSeq GCF_002022765.2) de novo transcriptome assembly of C.virginica adult OA gonda RNAseq.
### See input_fastqs.md5 file for list of input files used for assembly.


###################################################################################
# These variables need to be set by user

## Assign Variables

# These variables need to be set by user

# Path to this script
script_path=/gscratch/scrubbed/samwhite/outputs/20220207_cvir_trinity-gg_adult-oa-gonad_assembly-1.0/20220207_cvir_trinity-gg_adult-oa-gonad_assembly-1.0.sh

# RNAseq FastQs directory
reads_dir=/gscratch/srlab/sam/data/C_virginica/RNAseq

# Transcriptomes directory
transcriptomes_dir=/gscratch/srlab/sam/data/C_virginica/transcriptomes

# CPU threads
threads=40

# Capture specified RAM from this script
# Carrot needed to limit grep to line starting with #SBATCH
# Avoids grep-ing the command below.
max_mem=100G
# BAM file for genome guided assembly
sorted_bam=/gscratch/scrubbed/samwhite/outputs/20220131_cvir_hisat2-GCF_002022765.2_adult-oa-gonad/20210131-cvir-hisat2.sorted.bam

# Max intron size
max_intron=10000

# Name output files
fasta_name="cvir_GG-GCF_002022765.2_transcriptome_v1.0.fasta"
assembly_stats="${fast_name}_assembly_stats.txt"

# Paths to programs
samtools="/gscratch/srlab/programs/samtools-1.10/samtools"
trinity_dir="/gscratch/srlab/programs/trinityrnaseq-v2.9.0"

# Programs associative array
declare -A programs_array
programs_array=(
[samtools_faidx]="${samtools} faidx" \
[samtools_index]="${samtools} index" \
[samtools_sort]="${samtools} sort" \
[samtools_view]="${samtools} view" \
)

###################################################################################

# Exit script if a command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017



## Inititalize arrays
R1_array=()
R2_array=()

# Variables for R1/R2 lists
R1_list=""
R2_list=""

# Create array of fastq R1 files
R1_array=("${reads_dir}"/*R1.fastp-trim*.fq.gz)

# Create array of fastq R2 files
R2_array=("${reads_dir}"/*R2.fastp-trim*.fq.gz)

# Create list of fastq files used in analysis
## Uses parameter substitution to strip leading path from filename
for fastq in "${!R1_array[@]}"
do
  {
    md5sum "${R1_array[${fastq}]}"
    md5sum "${R2_array[${fastq}]}"
  } >> input_fastqs.md5
done

# Create comma-separated lists of FastQ reads
R1_list=$(echo "${R1_array[@]}" | tr " " ",")
R2_list=$(echo "${R2_array[@]}" | tr " " ",")


# Run Trinity
## Running as "stranded" (--SS_lib_type)
${trinity_dir}/Trinity \
--genome_guided_bam ${sorted_bam} \
--genome_guided_max_intron ${max_intron} \
--seqType fq \
--SS_lib_type RF \
--max_memory ${max_mem} \
--CPU ${threads} \
--left "${R1_list}" \
--right "${R2_list}"

# Rename generic assembly FastA
find . -name "Trinity*.fasta" -exec mv {} trinity_out_dir/"${fasta_name}" \;

# Assembly stats
${trinity_dir}/util/TrinityStats.pl trinity_out_dir/"${fasta_name}" \
> ${assembly_stats}

# Create gene map files
${trinity_dir}/util/support_scripts/get_Trinity_gene_to_trans_map.pl \
trinity_out_dir/"${fasta_name}" \
> trinity_out_dir/"${fasta_name}".gene_trans_map

# Create sequence lengths file (used for differential gene expression)
${trinity_dir}/util/misc/fasta_seq_length.pl \
trinity_out_dir/"${fasta_name}" \
> trinity_out_dir/"${fasta_name}".seq_lens

# Create FastA index
${programs_array[samtools_faidx]} \
trinity_out_dir/"${fasta_name}"

# Copy files to transcriptome directory
rsync -av \
trinity_out_dir/"${fasta_name}"* \
${transcriptome_dir}

# Generate FastA MD5 checksum
md5sum trinity_out_dir/"${fasta_name}" > fasta_checksum.md5

#######################################################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
fi


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Runtime was lengthy at almost 10 days. NOTE: The runtime screencap indicates the job failed. Although technically true, the job failed after the Trinity assembly was completed during a command to rsync the finished files to another directory on Mox. So, it’s all good!

screencap of C.virginica gonad transcriptome assembly runtime on Mox

Output folder:

20220212_cvir_trinity-gg_adult-oa-gonad_assembly-1.0/

MD5 checksums of FastQs used for assembly - also functions as list of FastQs (text):
- input_fastqs.md5 (8.0K)
Transcriptome assembly (FastA):
- cvir-GG-GCF-002022765.2-adult_gonad-transcriptome-v1.0.fasta (2.1G)
  - MD5: 88e6e3b1702e81e3034b8383f3c3efa0
Transcriptome FastA index (text):
- cvir-GG-GCF-002022765.2-adult_gonad-transcriptome-v1.0.fasta.fai (88M)
Trinity gene/transcript mapping file (text):
- cvir-GG-GCF-002022765.2-adult_gonad-transcriptome-v1.0.fasta.gene_trans_map (87M)
Trinity transcript sequence lengths (text):
- cvir-GG-GCF-002022765.2-adult_gonad-transcriptome-v1.0.fasta.seq_lens (54M)

Trinity assembly stats (text):

cvir-GG-GCF-002022765.2-adult_gonad-transcriptome-v1.0.fasta_assembly_stats.txt (4.0K)

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':	887315
Total trinity transcripts:	1849486
Percent GC: 36.26

########################################
Stats based on ALL transcript contigs:
########################################

  Contig N10: 7967
  Contig N20: 5284
  Contig N30: 3814
  Contig N40: 2801
  Contig N50: 2062

  Median contig length: 562
  Average contig: 1117.46
  Total assembled bases: 2066718534


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

  Contig N10: 6904
  Contig N20: 4398
  Contig N30: 3003
  Contig N40: 2120
  Contig N50: 1501

  Median contig length: 434
  Average contig: 860.79
  Total assembled bases: 763788564