Genome Annotation - Pgenerosa_v071 Using GenSAS

In our various attempts to get the Panopea generosa genome annotated in such a manner that we’re comfortable with (the previous annotation attempts we’re lacking any annotations in almost all of the largest scaffolds, which didn’t seem right), Steven stumbled across GenSAS, a web/GUI-based genome annotation program, so we gave it a shot.

This version of the genome annotation will be referred to as:

  • Panopea-generosa-vv0.71.a2

I uploaded the following to the GenSAS website to potentially use as “evidence files”:

Repeats Files


RESULTS

This took way longer than I was expecting! This took nearly an entire month (the majority of that time was running Augustus ab initio gene prediction, which took ~3 weeks):

v071 GenSAS project summary processes and runtimes screencap

Output folder:

Feature counts:

awk 'NR>3 { print $3 }' Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3 | sort | uniq -c

264153 CDS
264153 exon
 56167 gene
 56167 mRNA

BUSCO assessment:

  • 80.7% complete BUSCOs present in predicted genes

Individual feature GFFs were made with the following shell commands:

features_array=(CDS exon gene mRNA)

for feature in ${features_array[@]}
do
output="Panopea-generosa-v1.0.a2.${feature}.gff3"
input="Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3"
head -n 3 Panopea-generosa-v1.0.a2-merged-2019-08-29-15-28-54.gff3 \
>> ${output}
awk -v feature="$feature" '$3 == feature {print}' ${input} \
>> ${output}
done

SwissProt functional annotations (tab-delimited text):

Pfam annotations (tab-delimited text):

Grabbed the top 10 most abundant Pfam Accessions to see how things looked:

Feature Count Pfam Accession Pfam
364 PF00643.19 B-box zinc finger
293 PF07690.11 Major facilitator family
228 PF00001.16 Rhodopsin-like receptors
220 PF12796.2 Ankyrin repeat
209 PF00651.26 BTB/POZ domain
206 PF00069.20 Protein kinase domain
180 PF00067.17 Cytochrome P450
175 PF02931.18 Ligand-gated ion channel
174 PF00400.27 WD40 repeat
174 PF00059.16 C-type lectin

A rhodopsin protein family appears in the Top 10 most abundant Pfams?! Proteins in this family are involved in light detection…

InterProScan annotations (tab-delimited text):

Project Summary file (TEXT):


=================================
 Project Summary
---------------------------------
# Project Information
  Project Name         : Pgenerosa_v071
  Owner                : kubu4
  Create Date          : 2019-07-10 12:30:24

# Project Properties
  Genus                : Panopea
  Species              : generosa
  Project Type         : invertebrate
  Prefix               : PGEN_
  Common Name          : Pacific geoduck
  Genetic Code         : Standard Code

# Input FASTA
  Filename           : Pgenerosa_v071.fasta
  Filesize           : 1.32 GB
  Number of Sequence : 14014

=================================
 Job Information
---------------------------------
# Official Gene Set
  >PASA Refinement
  - version : 2.3.3
  - Transcripts FASTA file : Trinity.fasta

  # The source Job of the refinement job
    >Augustus-01
    - version : 3.3.1
    - Species : fly
    - Report genes on : both
    - Allowed gene structure : partial
    - cDNA (transcripts) sequences : Trinity.fasta
    - Protein sequences : 20180827_trinity_geoduck.fasta.transdecoder.fa


# The consensus mask Job
  >Masked Repeat Consensus

  # The source jobs for consensus mask job
    >RepeatMasker
    >RepeatModeler

  # Family copy number summary
    Family	Copy Numbers
    DNA	85
    DNA/Academ	264
    DNA/Crypton	200
    DNA/Kolobok-T2	188
    DNA/MuLE-MuDR	94
    DNA/PIF-Harbinger	482
    DNA/Sola	122
    DNA/TcMar-Mariner	599
    DNA/TcMar-Tc1	1266
    DNA/hAT-Tip100	808
    DNA/hAT-Tip100?	255
    Type:DNA	4363
    LINE	2153
    LINE/CR1	4122
    LINE/CR1-Zenon	1717
    LINE/I-Nimb	72
    LINE/Jockey	510
    LINE/L1-Tx1	967
    LINE/L2	1896
    LINE/Penelope	735
    LINE/Proto2	155
    LINE/R2-Hero	211
    LINE/RTE-X	2275
    LINE/Tad1	97
    Type:LINE	14910
    Type:SINE	0
    LTR/DIRS	192
    LTR/Gypsy	1420
    LTR/Ngaro	533
    LTR/Pao	146
    Type:LTR	2291
    Type:EVERYTHING_TE	21564
    Type:Simple_repeat	107
    Type:Unknown	115322

# The functional Jobs on the OGS
  >BLAST protein vs protein (blastp)_SP01
  - version : 2.7.1
  - Protein Data Set : SwissProt
  - Maximum HSP Distinace : 30000
  - Output type : tab
  - Matrix : BLOSUM62
  - Expect : 1e-8
  - Word Size : 3
  - Gap Open : 11
  - Gap Extend : 1

  >DIAMOND Functional SP01
  - version : 0.9.22
  - Protein Data Set : SwissProt

  >BLAST protein vs protein (blastp)
  - version : 2.7.1
  - Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa
  - Maximum HSP Distinace : 30000
  - Output type : tab
  - Matrix : BLOSUM62
  - Expect : 1e-8
  - Word Size : 3
  - Gap Open : 11
  - Gap Extend : 1

  >DIAMOND Functional
  - version : 0.9.22
  - Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa

  >InterProScan
  - version : 5.29-68.0

  >Pfam
  - version : 1.6
  - E-value Sequence : 1
  - E-value Domain : 10

  >SignalP
  - version : 4.1
  - Organism group : euk
  - Method : best
  - D-cutoff for SignalP-noTM networks : 0.45
  - D-cutoff for SignalP-TM networks : 0.50
  - Minimal predicted signal peptide length : 10
  - Truncate to sequence length : 70

Overall, this annotation is much more believable than the previous MAKER annotations, due to the fact that GenSAS actually predicts genes to exist on all the scaffolds (unlike MAKER)! Will be interesting to compare to the GenSAS Panopea-generosa-vv0.74.a3 annotation.