Genome Annotation - Pgenerosa_v074 MAKER on Mox

I previously created a subset of the Pgenerosa_v070 genome assembly that contains just the largest 18 scaffolds (these scaffolds were produced by Phase Genomics, utilizing some Hi-C sequencing). The new subsetted genome is labeled as Pgenerosa_v074.fa (914MB).

As part of that, Steven wanted this version annotated using MAKER.

This will perform the following:

  • one round of MAKER gene model predictions
  • two rounds of SNAP gene model training/predictions
  • renaming of gene models to NCBI-standardized convention
  • functional characterization of protein models (via BLASTp)
  • functional characterization of protein domains (via InterProScan5)

Here are a list of the input files used for the various components of the MAKER annotation:

NCBI Protein FastA files

  • NCBI Crassostrea gigas proteome (downloaded 20181119): GCA_000297895.1_oyster_v9_protein.faa

  • NCBI Crassostrea virginica proteome (downloaded 20181119): GCF_002022765.2_C_virginica-3.0_protein.faa

  • SwissProt BLASTp database(downloaded 20190109): uniprot_sprot.fasta

Repeats Files

SBATCH script (GitHub):

## Job Name
#SBATCH --job-name=maker_pgen074
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=2
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=40-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/outputs/20190701_pgen_maker_v074_annotation

# Exit if any command fails
set -e

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Load Open MPI module for parallel, multi-node processing

module load icc_19-ompi_3.1.2

# SegFault fix?

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo "${PATH}" | tr : \\n >> system_path.log

# Add BLAST to system PATH
export PATH=$PATH:/gscratch/srlab/programs/ncbi-blast-2.6.0+/bin
export BLASTDB=/gscratch/srlab/blastdbs/UniProtKB_20181008/

## Establish variables for more readable code


### Paths to Maker binaries



## Path to blastp

## Path to InterProScan5

## Store path to options control file

### Path to genome FastA file

### Paths to transcriptome FastA files

### Path to Crassotrea gigas NCBI protein FastA

### Path to Crassostrea virginica NCBI protein FastA

### Path to Panopea generosa TransDecoder protein FastAs

### Path to P.generosa-specific RepeatModeler library

### Path to P.generosa-specific RepeatMasker GFF

### Path to SwissProt database for BLASTp

## Make directories
mkdir blastp_annotation
mkdir interproscan_annotation
mkdir snap01
mkdir snap02

## Create Maker control files needed for running Maker, only if it doesn't already exist and then edit it.
### Edit options file
### Set paths to P.generosa genome and transcriptome.
### Set path to combined C. gigas, C.virginica, P.generosa proteomes.
### The use of the % symbol sets the delimiter sed uses for arguments.
### Normally, the delimiter that most examples use is a slash "/".
### But, we need to expand the variables into a full path with slashes, which screws up sed.
### Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
if [ ! -e maker_opts.ctl ]; then
  $maker -CTL
  sed -i "/^genome=/ s% %$genome %" "$maker_opts_file"

  # Set transcriptomes to use
  sed -i "/^est=/ s% %\
  ${EPI124_transcriptome} %" \

  # Set proteomes to use
  sed -i "/^protein=/ s% %\
  ${virginica_proteome} \
  %" \

  # Set RepeatModeler library to use
  sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"

  # Set RepeatMasker GFF to use
  sed -i "/^rm_gff=/ s% %${rm_gff} %" "$maker_opts_file"

  # Set est2ggenome to 1 - tells MAKER to use transcriptome FastAs
  sed -i "/^est2genome=0/ s/est2genome=0/est2genome=1/" "$maker_opts_file"

  # Set protein2genome to 1 - tells MAKER to use protein FastAs
  sed -i "/^protein2genome=0/ s/protein2genome=0/protein2genome=1/" "$maker_opts_file"

## Run Maker
### Specify number of nodes to use.
mpiexec -n 56 $maker

## Merge gffs
${gff3_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log

## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log > Pgenerosa_v074.maker.all.noseqs.gff

## Merge all FastAs
${fasta_merge} -d Pgenerosa_v074.maker.output/Pgenerosa_v074_master_datastore_index.log

## Extract GFF alignments for use in subsequent MAKER rounds
### Transcript alignments
awk '{ if ($2 == "est2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.est2genome.gff
### Protein alignments
awk '{ if ($2 == "protein2genome") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.protein2genome.gff
### Repeat alignments
awk '{ if ($2 ~ "repeat") print $0 }' Pgenerosa_v074.maker.all.noseqs.gff > Pgenerosa_v074.maker.all.noseqs.repeats.gff

## Run SNAP training, round 1
cd "${wd}"
cd snap01
${maker2zff} ../Pgenerosa_v074.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v074_snap01.hmm

## Initiate second Maker run.
### Copy initial maker control files and
### Default gene prediction settings are 0 (i.e. don't generate Maker gene predictions)
### - use GFF subsets generated in first round of MAKER
### - set location of snaphmm file to use for gene prediction
### Percent symbols used below are the sed delimiters, instead of the default "/",
### due to the need to use file paths.
if [ ! -e maker_opts.ctl ]; then
  $maker -CTL
  sed -i "/^genome=/ s% %$genome %" maker_opts.ctl

  # Set transcriptomes to use
  sed -i "/^est=/ s% %\
  ${EPI124_transcriptome} %" \

  # Set proteomes to use
  sed -i "/^protein=/ s% %\
  ${virginica_proteome} \
  %" \

  # Set RepeatModeler library to use
  sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"

  sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
  sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
  sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl
  sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap01.hmm %" maker_opts.ctl

## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v074_snap01

## Merge gffs
${gff3_merge} -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log

## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074_snap01.maker.output/Pgenerosa_v074_snap01_master_datastore_index.log > Pgenerosa_v074_snap01.maker.all.noseqs.gff

## Run SNAP training, round 2
cd "${wd}"
cd snap02
${maker2zff} ../snap01/Pgenerosa_v074_snap01.all.gff
${fathom} -categorize 1000 genome.ann genome.dna
${fathom} -export 1000 -plus uni.ann uni.dna
${forge} export.ann export.dna
${hmmassembler} genome . > Pgenerosa_v074_snap02.hmm

## Initiate third and final Maker run.

if [ ! -e maker_opts.ctl ]; then
  $maker -CTL
  sed -i "/^genome=/ s% %$genome %" maker_opts.ctl

  # Set transcriptomes to use
  sed -i "/^est=/ s% %\
  ${EPI124_transcriptome} %" \

  # Set proteomes to use
  sed -i "/^protein=/ s% %\
  ${virginica_proteome} \
  %" \

  # Set RepeatModeler library to use
  sed -i "/^rmlib=/ s% %$repeat_library %" "$maker_opts_file"

  sed -i "/^est_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.est2genome.gff %" maker_opts.ctl
  sed -i "/^protein_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.protein2genome.gff %" maker_opts.ctl
  sed -i "/^rm_gff=/ s% %../Pgenerosa_v074.maker.all.noseqs.repeats.gff %" maker_opts.ctl
  sed -i "/^snaphmm=/ s% %Pgenerosa_v074_snap02.hmm %" maker_opts.ctl

## Run Maker
### Set basename of files and specify number of CPUs to use
mpiexec -n 56 $maker \
-base Pgenerosa_v074_snap02

## Merge gffs
${gff3_merge} \
-d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log

## GFF with no FastA in footer
${gff3_merge} -n -s -d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log > Pgenerosa_v074_snap02.maker.all.noseqs.gff

## Merge FastAs
${fasta_merge} \
-d Pgenerosa_v074_snap02.maker.output/Pgenerosa_v074_snap02_master_datastore_index.log

# Create copies of files for mapping
cp "${maker_prot_fasta}" "${maker_prot_fasta_renamed}"
cp "${maker_transcripts_fasta}" "${maker_transcripts_fasta_renamed}"
cp "${snap02_gff}" "${snap02_gff_renamed}"

# Map IDs
## Change gene names
${map_ids} \
--prefix PGEN_ \
--justify 8 \
"${snap02_gff}" \
> "${id_map}"

## Map GFF IDs
${map_gff_ids} \
"${id_map}" \

## Map FastAs
### Proteins
${map_fasta_ids} \
"${id_map}" \

### Transcripts
${map_fasta_ids} \
"${id_map}" \

# Run InterProScan 5
## disable-precalc since this requires external database access (which Mox does not allow)
cd "${ips_dir}"

${interproscan} \
--input "${maker_prot_fasta_renamed}" \
--goterms \
--output-file-base ${ips_base} \

# Run BLASTp
cd "${blastp_dir}"

${blastp} \
-query "${maker_prot_fasta_renamed}" \
-db ${sp_db_blastp} \
-out "${maker_blastp}" \
-max_target_seqs 1 \
-evalue 1e-6 \
-outfmt 6 \
-num_threads 28

# Functional annotations

cd "${wd}"

## Add putative gene functions
### GFF
${functional_gff} \
${sp_db_blastp} \
"${maker_blastp}" \
"${snap02_gff_renamed}" \
> ${put_func_gff}

### Proteins
${functional_fasta} \
${sp_db_blastp} \
"${maker_blastp}" \
"${maker_prot_fasta_renamed}" \
> ${put_func_prot}

### Transcripts
${functional_fasta} \
${sp_db_blastp} \
"${maker_blastp}" \
"${maker_transcripts_fasta_renamed}" \
> ${put_func_trans}

## Add InterProScan domain info
### Add searchable tags
${ipr_update_gff} \
${put_func_gff} \
"${ips_dir}"/${ips_name} \
> ${put_domain_gff}

### Add viewable features for genome browsers (JBrowse, Gbrowse, Web Apollo)
${iprscan2gff3} \
"${ips_dir}"/${ips_name} \
"${snap02_gff_renamed}" \
> ${ips_domains}


Output folder:

Well, this ran relatively quickly: A little over seven days.

Pgenerosa_v074 MAKER runtime screencap

The important files:

I should’ve just split the GFF as part of the Mox SBATCH script, but I didn’t so I did it locally on my computer. Here are the commands for splitting the GFF. All the GFF files have been addeed to our Genomic Resources wiki (GitHub).

awk 'BEGIN { print "##gff-version 3" ; } $3 == "CDS" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.CDS.gff
awk 'BEGIN { print "##gff-version 3" ; } $3 == "exon" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.exon.gff
awk 'BEGIN { print "##gff-version 3" ; } $3 == "gene" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.gene.gff
awk 'BEGIN { print "##gff-version 3" ; } $3 == "mRNA" {print}' \
Pgenerosa_v074_genome_snap02.all.renamed.putative_function.domain_added.gff \
> Pgenerosa_v074.mRNA.gff

With all of that out of the way, a cursory glance at the results are, honestly, quite shocking. A quick grep -c ">" on the FastA files reveals:

  • 1712 proteins/transcripts

This is compared to 53,035 proteins/transcripts identified/annotated in the Pgenerosa_v070 annotation and 20,813 proteins/transcripts in the Pgenerosa_v071 annotation (only contigs >10kbp).

Additionally, annotations only occur on 2 out of 18 scaffolds:

  • PGA_scaffold_17
  • PGA_scaffold_18.

Surprisingly, these two scaffolds comprise two of the three shortest (35Mbp and 28Mbp) of the 18 scaffolds subjected to annotation.

So, what does this mean? A few thoughts/conjectures in no particular order or prioritization:

  1. Limiting to the longest 18 scaffolds chosen to create this Pgenerosa_v074 subset was the wrong approach. Looking back at the GitHub issue where scaffold selection was discussed reveals the apparent confusion with which assembly provided by Phase Genomics is the proper assembly to use. However, that discussion seems to agree that the “manually curated” version of their genome assembly (which is what was used here) is the proper one to use.

  2. Something went awry in the annotation. This will be difficult/impossible to assess, particularly since two of the scaffolds were successfully annotated. If two worked, why wouldn’t the other 16? All 18 scaffolds were presented to MAKER as a single multi-FastA file. Presumably, each individual FastA would be processed by MAKER in the same way.

  3. The data is real and the majority of genes in the P.generosa genome lie outside of these large scaffolds. If this is the case, this means that there are lengthy stretches of DNA that do not have readily-identifiable genomic features. It also suggests that, overall, the assembly we have (Pgenerosa_v070) is quite fragmented.

So, to assess some of this (particularly points 2 and 3 above), I decided to look at the three annotations we have for P.generosa using IGV on some of the 18 scaffolds that are present in each of the three previous assembly variations we’ve annotated with MAKER.


No annotations found by any of the three different annotations. Surprising, as this is the single longest scaffold in each of the three assemblies (89 Mbp)

IGV comparison of PGA_scaffold_1


Annotations only appear in the Pgenerosa_v070 assembly.

IGV comparison of PGA_scaffold_3


Annotations are found in each of the three assemblies.

IGV comparison of PGA_scaffold_18

Yeesh, so what do those IGV alignments tell us?

  1. The single longest scaffold appears to have no identifiable genomic features, as no annotations are present in any of the various assemblies. That’s a pretty lengthy stretch of DNA to not have anything there. Caveat: haven’t looked at transposable elements, nor other repeats. Is it possible that an 89Mbp stretch of DNA could be comprised solely of low-complexity sequence?

  2. The annotation process appears to be inconsistent. This is troublesome. Although this comes with a catch: I did not use the exact same data for each of the three annotations. For example, the Pgenerosa_v070 and Pgenerosa_v071 annotations used the singular P.generosa transcriptome assembly (from 20180827)for EST evidence, whereas the Pgenerosa_v074 annotation utilized tissue-specific transcriptome assemblies (from 20190409) and Transdecoder peptide data from each of these assemblies. Could these changes account for the differences we see? Oddly, and this thought is solely based on the limited IGV comparisons posted above, it seems like the fewer the overall number of sequences in an assembly, the fewer annotations MAKER is able to generate within any given scaffold.

Now what?

Well, I have the Pgenerosa_v074 assembly currently running on the Genome Sequence Annotation Server (GenSAS) using most of the same files that I used for MAKER on Mox (exceptions are repeat masking/modeling files - those steps will be run on GenSAS). It will be interesting to see how the MAKER and GenSAS annotations compare.

I could also re-run the Pgenerosa_v071 annotation using the exact same files as Pgenerosa_v074 on GenSAS, as the files I need are already on their servers and it will be trivial to initiate the annotation process…