Genome Annotation - Pgenerosa_v074 Using GenSAS

In our various attempts to get the Panopea generosa genome annotated in such a manner that we’re comfortable with (the previous annotation attempts we’re lacking any annotations in almost all of the largest scaffolds, which didn’t seem right), Steven stumbled across GenSAS, a web/GUI-based genome annotation program, so we gave it a shot.

This version of the genome annotation will be referred to as:

  • Panopea-generosa-vv0.74.a3

I uploaded the following to the GenSAS website to potentially use as “evidence files”:

Repeats Files


RESULTS

This took way longer than I was expecting! This took nearly an entire month (the majority of that time was running Augustus ab initio gene prediction, which took ~3 weeks):

GenSAS project summary processes and runtimes screencap


Output folder:

Merged GFF (SwissProt IDs in Column 9 - from BLASTp and DIAMOND):

Feature counts:

awk 'NR>3 { print $3 }' Panopea-generosa-vv0.74.a3-merged-2019-09-03-6-14-33.gff3 | sort | uniq -c

 192022 CDS
 192022 exon
  45748 gene
  45748 mRNA

BUSCO assessment:

  • 68.4% complete BUSCOs present in predicted genes

Individual feature GFFs were made with the following shell commands:

features_array=(CDS exon gene mRNA)

for feature in ${features_array[@]}
do
output="Panopea-generosa-vv0.74.a3.${feature}.gff3"
input="Panopea-generosa-vv0.74.a3-merged-2019-09-24-9-20-04.gff3"
head -n 3 Panopea-generosa-vv0.74.a3-merged-2019-09-24-9-20-04.gff3 \
>> ${output}
awk -v feature="$feature" '$3 == feature {print}' ${input} \
>> ${output}
done

SwissProt functional annotations (tab-delimited text):

Pfam annotations (tab-delimited text):

Grabbed the top 10 most abundant Pfam Accessions to see how things looked:

awk '{print $2}' Panopea-generosa-vv0.74.a3.5d65aaa449919-pfam.tab | sort | uniq -c | sort -nr | head
Feature Count Pfam Accession Pfam
1062 PF00078.22 Reverse transcriptase (RNA-dependent DNA polymerase)
370 PF00665.21 Integrase
353 PF13358.1 DDE superfamily endonuclease
264 PF03372.18 Endonuclease/Exonuclease/phosphatase family
213 PF14529.1 Endonuclease-reverse transcriptase
208 PF00643.19 B-box zinc finger
199 PF07690.11 Major Facilitator Superfamily
158 PF00001.16 7 transmembrane receptor (rhodopsin family)
148 PF12796.2 Ankyrin repeat
136 PF00096.21 Zinc finger

A couple of interesting things that I notice from this table:

  1. The four of the top five most abundant are involved in DNA transposition.

  2. A rhodopsin protein family appears in the Top 10 most abundant Pfams?! Proteins in this family are involved in light detection…

InterProScan annotations (tab-delimited text):

Project Summary file (TEXT):


=================================
 Project Summary
---------------------------------
# Project Information
  Project Name         : Pgenerosa_v074
  Owner                : kubu4
  Create Date          : 2019-07-09 14:07:39

# Project Properties
  Genus                : Panopea
  Species              : generosa
  Project Type         : invertebrate
  Prefix               : PGEN_
  Common Name          : Pacific geoduck
  Genetic Code         : Standard Code

# Input FASTA
  Filename           : Pgenerosa_v074.fa
  Filesize           : 913.68 MB
  Number of Sequence : 18

=================================
 Job Information
---------------------------------
# Official Gene Set
  >PASA Refinement
  - version : 2.3.3
  - Transcripts FASTA file : Trinity.fasta

  # The source Job of the refinement job
    >Augustus-01
    - version : 3.3.1
    - Species : fly
    - Report genes on : both
    - Allowed gene structure : partial
    - cDNA (transcripts) sequences : Trinity.fasta
    - Protein sequences : 20180827_trinity_geoduck.fasta.transdecoder.fa


# The consensus mask Job
  >Masked Repeat Consensus

  # The source jobs for consensus mask job
    >RepeatMasker
    >RepeatModeler

  # Family copy number summary
    Family	Copy Numbers
    DNA	675
    DNA/Academ	1327
    DNA/Crypton	344
    DNA/Ginger	130
    DNA/Kolobok-T2	141
    DNA/Maverick	942
    DNA/MuLE-MuDR	201
    DNA/MuLE-NOF?	142
    DNA/P	167
    DNA/PIF-Harbinger	227
    DNA/RC	587
    DNA/Sola	508
    DNA/TcMar-Fot1	117
    DNA/TcMar-Mariner	6734
    DNA/TcMar-Tc1	3718
    DNA/hAT-Tip100	516
    DNA/hAT-hAT5	1037
    Type:DNA	17513
    LINE	883
    LINE/CR1	5204
    LINE/CR1-Zenon	14653
    LINE/I	980
    LINE/I-Nimb	1119
    LINE/L1	4031
    LINE/L1-Tx1	6620
    LINE/L2	8879
    LINE/L2-Hydra	113
    LINE/Penelope	1026
    LINE/RTE-X	21214
    Type:LINE	64722
    SINE/tRNA-Core-L2	41152
    Type:SINE	41152
    LTR/Caulimovirus	140
    LTR/DIRS	448
    LTR/Gypsy	1031
    LTR/Ngaro	343
    LTR/Pao	82
    Type:LTR	2044
    Type:EVERYTHING_TE	125431
    Type:Simple_repeat	19235
    Type:Unknown	1465471

# The functional Jobs on the OGS
  >InterProScan
  - version : 5.29-68.0

  >Pfam
  - version : 1.6
  - E-value Sequence : 1
  - E-value Domain : 10

  >SignalP
  - version : 4.1
  - Organism group : euk
  - Method : best
  - D-cutoff for SignalP-noTM networks : 0.45
  - D-cutoff for SignalP-TM networks : 0.50
  - Minimal predicted signal peptide length : 10
  - Truncate to sequence length : 70

  >BLAST protein vs protein (blastp)_SP01
  - version : 2.7.1
  - Protein Data Set : SwissProt
  - Maximum HSP Distinace : 30000
  - Output type : tab
  - Matrix : BLOSUM62
  - Expect : 1e-8
  - Word Size : 3
  - Gap Open : 11
  - Gap Extend : 1

  >DIAMOND Functional_SP01
  - version : 0.9.22
  - Protein Data Set : SwissProt

  >BLAST protein vs protein (blastp)
  - version : 2.7.1
  - Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa
  - Maximum HSP Distinace : 30000
  - Output type : tab
  - Matrix : BLOSUM62
  - Expect : 1e-8
  - Word Size : 3
  - Gap Open : 11
  - Gap Extend : 1

  >DIAMOND Functional
  - version : 0.9.22
  - Protein Data Set : 20180827_trinity_geoduck.fasta.transdecoder.fa

Overall, this annotation is much more believable than the previous MAKER annotations, due to the fact that GenSAS actually predicts genes to exist on all the scaffolds (unlike MAKER)! As such, this will likely become the canonical P.generosa genome going forward. With that being said, we still should manually curate this when we have the time to see how well the predictions line up with the evidence.