I’d previously annotated our Pgenerosa_v074 with GenSAS, but did so using limited options as we were (and still are) in need of an annotated genome to use for methylation data analysis. As such, I opted for speed, but a potentially a less accurate annotation.
Since then, I’ve re-run the annotation, but have, essentially, added all the options. Meaning, I’ve run additional ab initio gene predictors: BRAKER and GeneMarkES (previously, I only ran SNAP and Augustus). I also provided additional data to use as evidence; specifically a singular merged BAM file from the Stringtie Isoform ID I ran on 20190723. I also ran RNAmmer (rRNA identification) and tRNAscan (tRNA identification).
This should amount to a more accurate genome annotation, as the analysis will utilize a plethora of additional data to help predict genes.
This version of the genome annotation will be referred to as:
- Panopea-generosa-vv0.74.a4
RESULTS
Output folder:
Merged GFF (SwissProt IDs in Column 9 - from BLASTp and DIAMOND):
BUSCO assessment (metazoa_odb9):
- 71.5% complete BUSCOs present in predicted genes (Panopea-generosa-vv0.74.a3 had 68.4%)
Feature counts:
awk 'NR>3 {print $3}' Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gff3 | sort | uniq -c
236960 CDS
236960 exon
34947 gene
38326 mRNA
1527155 repeat_region
7 rRNA
15514 tRNA
Here’s a feature count comparison between the two different annotations:
Feature | Panopea-generosa-vv0.74.a3 | Panopea-generosa-vv0.74.a4 |
---|---|---|
CDS | 192022 | 236960 |
exon | 192022 | 236960 |
gene | 45748 | 34947 |
mRNA | 45748 | 38326 |
This is particularly interesting because we now see a difference in the number of mRNA/gene features in the Panopea-generosa-vv0.74.a4, as opposed to Panopea-generosa-vv0.74.a3 where these two feature counts are identical. The fact that Panopea-generosa-vv0.74.a4 has different counts (notice that mRNA
is greater than gene
) between these two features suggests that providing the Stringtie BAM files have resulted in identification/annotation of alternative isoforms for some genes.
Individual feature GFFs were made with the following shell commands:
features_array=(CDS exon gene mRNA repeat_region rRNA tRNA)
input="Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gff3"
for feature in ${features_array[@]}
do
output="Panopea-generosa-vv0.74.a4.${feature}.gff3"
head -n 3 ${input} \
>> ${output}
awk -v feature="$feature" '$3 == feature {print}' ${input} \
>> ${output}
done
SwissProt functional annotations (tab-delimited text):
- BLASTp
- DIAMOND
Pfam annotations (tab-delimited text):
Grabbed the top 10 most abundant Pfam Accessions to see how things looked:
awk '{print $2}' Panopea-generosa-vv0.74.a4.5d951bd7b0381-pfam.tab | sort | uniq -c | sort -nr | head
Feature Count | Pfam Accession | Pfam |
---|---|---|
250 | PF07690.11 | major facilitator superfamily (MFS) |
213 | PF00643.19 | B-box zinc finger |
202 | PF00069.20 | Protein kinase domain |
192 | PF00001.16 | Rhodopsin-like receptors |
190 | PF12796.2 | Ankyrin repeat |
147 | PF00400.27 | WD40 repeat |
145 | PF00084.15 | Selectin |
135 | PF00067.17 | Cytochrome P450 |
134 | PF00059.16 | C-type lectin |
133 | PF02931.18 | Ligand-gated ion channel |
A couple of interesting things that I notice from this table:
-
This annotation eliminates the DNA transposition Pfams that were present in the Panopea-generosa-vv0.74.a3 GenSAS annotation from 20190710
-
Repeat families are still present, but now make up the majority of the most abundant Pfams.
-
Rhodopsin Pfam is still amongst the Top 10 most abundant Pfams!
InterProScan annotations (tab-delimited text):
-
Panopea-generosa-vv0.74.a4.5d951bd337e9d-interproscan.tab
- Contains gene ontology (GO) terms
Project Summary file (TEXT):
=================================
Project Summary
---------------------------------
# Project Information
Project Name : Pgenerosa_v074
Owner : kubu4
Create Date : 2019-07-09 14:07:39
# Project Properties
Genus : Panopea
Species : generosa
Project Type : invertebrate
Prefix : PGEN_
Common Name : Pacific geoduck
Genetic Code : Standard Code
# Input FASTA
Filename : Pgenerosa_v074.fa
Filesize : 913.68 MB
Number of Sequence : 18
=================================
Job Information
---------------------------------
# Official Gene Set
>PASA Refinement BRAKER
- version : 2.3.3
- Transcripts FASTA file : Trinity.fasta
# The source Job of the refinement job
>BRAKER-01
- version : 2.1.0
- BAM File :
# The consensus mask Job
>Masked Repeat Consensus
# The source jobs for consensus mask job
>RepeatMasker
>RepeatModeler
# Family copy number summary
Family Copy Numbers
DNA 675
DNA/Academ 1327
DNA/Crypton 344
DNA/Ginger 130
DNA/Kolobok-T2 141
DNA/Maverick 942
DNA/MuLE-MuDR 201
DNA/MuLE-NOF? 142
DNA/P 167
DNA/PIF-Harbinger 227
DNA/RC 587
DNA/Sola 508
DNA/TcMar-Fot1 117
DNA/TcMar-Mariner 6734
DNA/TcMar-Tc1 3718
DNA/hAT-Tip100 516
DNA/hAT-hAT5 1037
Type:DNA 17513
LINE 883
LINE/CR1 5204
LINE/CR1-Zenon 14653
LINE/I 980
LINE/I-Nimb 1119
LINE/L1 4031
LINE/L1-Tx1 6620
LINE/L2 8879
LINE/L2-Hydra 113
LINE/Penelope 1026
LINE/RTE-X 21214
Type:LINE 64722
SINE/tRNA-Core-L2 41152
Type:SINE 41152
LTR/Caulimovirus 140
LTR/DIRS 448
LTR/Gypsy 1031
LTR/Ngaro 343
LTR/Pao 82
Type:LTR 2044
Type:EVERYTHING_TE 125431
Type:Simple_repeat 19235
Type:Unknown 1465471
# The functional Jobs on the OGS
>Pfam
- version : 1.6
- E-value Sequence : 1
- E-value Domain : 10
>SignalP
- version : 4.1
- Organism group : euk
- Method : best
- D-cutoff for SignalP-noTM networks : 0.45
- D-cutoff for SignalP-TM networks : 0.50
- Minimal predicted signal peptide length : 10
- Truncate to sequence length : 70
>BLAST protein vs protein (blastp)
- version : 2.7.1
- Protein Data Set : SwissProt
- Maximum HSP Distinace : 30000
- Output type : tab
- Matrix : BLOSUM62
- Expect : 1e-8
- Word Size : 3
- Gap Open : 11
- Gap Extend : 1
>DIAMOND Functional
- version : 0.9.22
- Protein Data Set : SwissProt
>InterProScan
- version : 5.29-68.0
Overall, I think this annotation is likely to be significantly better than the Panopea-generosa-vv0.74.a3 version. Particularly because we now have alternative isoform information for genes: