Annotation - Geoduck Transcritpome with TransDecoder

I was tasked with generating some qPCR primers to analyze vitellogenin expression in geoduck. In order to do so, I needed to annotate a geoduck transcriptome in order to identify potential vitellogenin genes. I had previously assembled a geoduck transcriptome. For annotation, I used TransDecoder (v5.5.0). The annotation was run on our Mox HPC node.

Mox SBATCH script:

20181121_geo_transdecoder.sh


#!/bin/bash
## Job Name
#SBATCH --job-name=transcoder
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20181121_geo_transdecoder

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Make blast database available to blast
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008/

## Establish variables for more readable code
### Transdecoder programs
td_longorfs="/gscratch/srlab/programs/TransDecoder-v5.5.0/TransDecoder.LongOrfs"

td_predict="/gscratch/srlab/programs/TransDecoder-v5.5.0/TransDecoder.Predict"

### BLASTp
blastp="/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp"

### HMMscan
hmmscan="/gscratch/srlab/programs/hmmer-3.2.1/src/hmmscan"
### Transcriptome
geo_trinity_loc="/gscratch/srlab/sam/data/P_generosa/generosa_transcriptomes/20180827_trinity_geoduck.fasta"

geo_trinity="20180827_trinity_geoduck.fasta"

### UniProt database
uniprot="/gscratch/srlab/blastdbs/UniProtKB_20181008/20181008_uniprot_sprot.fasta"

### Pfam databases
pfam="/gscratch/srlab/sam/data/databases/pfam_db/Pfam31.0/Pfam-A.hmm"

####################
# Run transdecoder longorfs
${td_longorfs} -t ${geo_trinity_loc}

# Run blastp on UniProt database
${blastp} \
-query ${geo_trinity}.transdecoder_dir/longest_orfs.pep \
-db ${uniprot} \
-max_target_seqs 1 \
-outfmt 6 \
-evalue 1e-5 \
-num_threads 28 \
> blastp.outfmt6 \
2> blastp.err

# Run HMMscan on Pfam databases
${hmmscan} \
--cpu 28 \
--domtblout pfam.domtblout \
${pfam} \
${geo_trinity}.transdecoder_dir/longest_orfs.pep \
2> hmmscan.err

# Run transdecoder predict, using info from blatp and hmmscan
${td_predict} \
-t ${geo_trinity_loc} \
--retain_pfam_hits pfam.domtblout \
--retain_blastp_hits blastp.outfmt6

List of input files and databases used:

20180827_trinity_geoduck.fasta (FastA 972MB)
20181008_uniprot_sprot.fasta: Downloaded 20181008.
Pfam31.0/Pfam-A.hmm

Here’s a quick rundown of the TransDecoder process:

Identify longest open reading frames (ORFs)
BLASTp ORFs against UniProt database to identify protein matches.
HMMscan (Hidden Markov Model) against Pfam database to identify protein families.
TransDecoder predicts final coding sequences (CDS) using BLASTp and HMMscan info to provide additional support functional CDS identification.

RESULTS

Data was copied to Gannet via rsync

Output directory:

20181121_geo_transdecoder/

CDS FastA (271MB):

20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.cds

BED file (81MB):

20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.bed

GFF3 file (277MB):

20181121_geo_transdecoder/20180827_trinity_geoduck.fasta.transdecoder.gff3

Alrighty, now we have an annotated transcriptome that I can use for finding vitellogenin transcripts and designing some primers!