I was tasked with generating some qPCR primers to analyze vitellogenin expression in geoduck. In order to do so, I needed to annotate a geoduck transcriptome in order to identify potential vitellogenin genes. I had previously assembled a geoduck transcriptome. For annotation, I used TransDecoder (v5.5.0). The annotation was run on our Mox HPC node.
Mox SBATCH script:
#!/bin/bash
## Job Name
#SBATCH --job-name=transcoder
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20181121_geo_transdecoder
# Load Python Mox module for Python module availability
module load intel-python3_2017
# Document programs in PATH (primarily for program version ID)
date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log
# Make blast database available to blast
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008/
## Establish variables for more readable code
### Transdecoder programs
td_longorfs="/gscratch/srlab/programs/TransDecoder-v5.5.0/TransDecoder.LongOrfs"
td_predict="/gscratch/srlab/programs/TransDecoder-v5.5.0/TransDecoder.Predict"
### BLASTp
blastp="/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin/blastp"
### HMMscan
hmmscan="/gscratch/srlab/programs/hmmer-3.2.1/src/hmmscan"
### Transcriptome
geo_trinity_loc="/gscratch/srlab/sam/data/P_generosa/generosa_transcriptomes/20180827_trinity_geoduck.fasta"
geo_trinity="20180827_trinity_geoduck.fasta"
### UniProt database
uniprot="/gscratch/srlab/blastdbs/UniProtKB_20181008/20181008_uniprot_sprot.fasta"
### Pfam databases
pfam="/gscratch/srlab/sam/data/databases/pfam_db/Pfam31.0/Pfam-A.hmm"
####################
# Run transdecoder longorfs
${td_longorfs} -t ${geo_trinity_loc}
# Run blastp on UniProt database
${blastp} \
-query ${geo_trinity}.transdecoder_dir/longest_orfs.pep \
-db ${uniprot} \
-max_target_seqs 1 \
-outfmt 6 \
-evalue 1e-5 \
-num_threads 28 \
> blastp.outfmt6 \
2> blastp.err
# Run HMMscan on Pfam databases
${hmmscan} \
--cpu 28 \
--domtblout pfam.domtblout \
${pfam} \
${geo_trinity}.transdecoder_dir/longest_orfs.pep \
2> hmmscan.err
# Run transdecoder predict, using info from blatp and hmmscan
${td_predict} \
-t ${geo_trinity_loc} \
--retain_pfam_hits pfam.domtblout \
--retain_blastp_hits blastp.outfmt6
List of input files and databases used:
-
20181008_uniprot_sprot.fasta: Downloaded 20181008.
-
Pfam31.0/Pfam-A.hmm
Here’s a quick rundown of the TransDecoder process:
-
Identify longest open reading frames (ORFs)
-
BLASTp ORFs against UniProt database to identify protein matches.
-
HMMscan (Hidden Markov Model) against Pfam database to identify protein families.
-
TransDecoder predicts final coding sequences (CDS) using BLASTp and HMMscan info to provide additional support functional CDS identification.
RESULTS
Data was copied to Gannet via rsync
Output directory:
CDS FastA (271MB):
BED file (81MB):
GFF3 file (277MB):
Alrighty, now we have an annotated transcriptome that I can use for finding vitellogenin transcripts and designing some primers!