Transcriptome Compression - P.generosa Transcriptome Assemblies Using CD-Hit-est on Mox

In continued attempts to get a grasp on the geoduck transcriptome size, I decided to “compress” our various assemblies by clustering similar transcripts in each assembly in to a single “representative” transcript, using CD-Hit-est. Settings use to run it were taken from the Trinity FAQ regarding “too many transcripts”.

A bash script was used to rsync files to Mox and then execute the SBATCH script.

Bash script (GitHub):

20190729_rsync_pgen_trinity_assemblies.sh

#!/usr/bin/bash

# Script to retrieve geoduck Trinity assemblies
# Assemblies will be used in SBATCH script called at end of this script.
# Script needs to be run within same directory as SBATCH script.

# Exit if any command fails
set -e

# Set rsync remote path
gannet="gannet:/volume2/web/Atumefaciens"
owl="owl:/volume1/web/Athaliana"

# Create array of directories for storing Trinity assemblies
assembly_dirs_array=(
/gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly
/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia
/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad
/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124
/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99)

# Array of Trinity assemblies remote paths for rysnc-ing
assemblies_array=(
20180827_trinity_geoduck_RNAseq/Trinity.fasta
20190409_trinity_pgen_ctenidia_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_gonad_RNAseq/trinity_out_dir/Trinity.fasta
20190215_trinity_geoduck_heart_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI115_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI116_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI123_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI124_RNAseq/trinity_out_dir/Trinity.fasta
20190409_trinity_pgen_EPI99_RNAseq/trinity_out_dir/Trinity.fasta)

# Retrieve FastA files via rsync
for index in "${!assemblies_array[@]}"
do
  # Remove everything after first slash
  assembly=$(echo "${assemblies_array[index]%%/*}")
  echo "Preparing to download ${assembly}..."
  if [ "${assembly}" = "20180827_trinity_geoduck_RNAseq" ]; then
    echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}"
    rsync \
    --archive \
    --progress \
    "${owl}/${assemblies_array[index]}" \
    "${assembly_dirs_array[index]}"
  else
  echo "Now syncing ${assembly} to ${assembly_dirs_array[index]}"
  rsync \
  --archive \
  --progress \
  "${gannet}/${assemblies_array[index]}" \
  "${assembly_dirs_array[index]}"
  fi
done

# Start SBATCH script to run CD-Hit on all transcriptome assemblies
sbatch 20190729_cdhit-est_pgen_transcriptomes.sh

SBATCH script (GitHub):

20190729_cdhit-est_pgen_transcriptomes.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=cdhit_pgen
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190729_cdhit-est_pgen_transcriptomes

# This script is called by 20190729_cdhit_pgen_trinity_assemblies.sh.
# That script uses rsync to transfer files to Mox via the login node.
# This is required because Mox execute nodes don't have internet access.

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log

# Set CPU threads
threads=27

# Program paths
cd_hit_est="/gscratch/srlab/programs/cd-hit-v4.8.1-2019-0228/cd-hit-est"

# Create assembly paths array
assembly_dirs_array=(
/gscratch/srlab/sam/data/P_generosa/transcriptomes/20180827_assembly
/gscratch/srlab/sam/data/P_generosa/transcriptomes/ctenidia
/gscratch/srlab/sam/data/P_generosa/transcriptomes/gonad
/gscratch/srlab/sam/data/P_generosa/transcriptomes/heart
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI115
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI116
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI123
/gscratch/srlab/sam/data/P_generosa/transcriptomes/juvenile/EPI124
/gscratch/srlab/sam/data/P_generosa/transcriptomes/larvae/EPI99)

# Run cd-hit-est on each assembly
for index in "${!assembly_dirs_array[@]}"
do
  # Store individual sample name by removing
  # everything up to and including the last slash in path
  sample_name=$(echo "${assembly_dirs_array[index]##*/}")
  # Run cd-hit-est
  "${cd_hit_est}" \
  -o "${sample_name}".cdhit \
  -c 0.98 \
  -i "${assembly_dirs_array[index]}"/Trinity.fasta \
  -p 1 \
  -d 0 \
  -b 3 \
  -T "${threads}" \
  -M 0
done

RESULTS

This was quick (~30mins)!:

Screencap of CD-hit-est runtime on Mox for various geoduck transcriptome assemblies

Output folder:

20190729_cdhit-est_pgen_transcriptomes/

The output from CD-Hit is a FastA file. A quick grep -c ">" on the output files suggests that there were many “transcripts” that couldn’t be clustered together. That’s confirmed when taking a look at the end of the .cdhit.clstr files. Here’s an example (tail -n 25 ctenidia.cdhit.clst):

>Cluster 325771
0	189nt, >TRINITY_DN58755_c0_g1_i5... *
>Cluster 325772
0	189nt, >TRINITY_DN25471_c0_g1_i8... *
>Cluster 325773
0	187nt, >TRINITY_DN7576_c0_g1_i14... *
>Cluster 325774
0	186nt, >TRINITY_DN26627_c0_g1_i8... *
>Cluster 325775
0	186nt, >TRINITY_DN41216_c0_g1_i3... *
>Cluster 325776
0	185nt, >TRINITY_DN34387_c0_g1_i1... *
>Cluster 325777
0	185nt, >TRINITY_DN6164_c1_g1_i10... *
>Cluster 325778
0	184nt, >TRINITY_DN7464_c1_g2_i3... *
>Cluster 325779
0	184nt, >TRINITY_DN2395_c0_g1_i1... *
>Cluster 325780
0	184nt, >TRINITY_DN557_c0_g1_i3... *
>Cluster 325781
0	182nt, >TRINITY_DN53036_c0_g1_i3... *
>Cluster 325782
0	180nt, >TRINITY_DN12928_c1_g1_i8... *

Each of these transcripts from the assembly are very short (<200bp) and they do not cluster with any other transcripts (likely due to their short length). So, these are retained as individual transcripts, Although I haven’t taken the time to assess things, there are probably thousands (maybe more?) of these transcripts that are small and don’t cluster.

Anyway, I ran this to provide additional information on our transcriptome assemblies, not really to explore the results of CD-Hit in-depth. I’m putting together a summary post of our assemblies that will compile a variety of “stats”; hopefully providing some insight on a more realistic assessment of the number of transcripts present in the geoduck transcriptome.