Data Wrangling - Create Canonical Olurida_v081 Genes FastA

I finally had some time to tackle this GitHub Issue and create a canonical genes FastA file using the MAKER IDs, instead of the original contig IDs from our Olympia oyster genome assembly - https://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa (FastA; 1.1GB).

Everything was documented in a Jupyter Notebook (see link below), but here’s the skinny on how I did it:

  1. Pull existing FastA-formatted sequences from the fully annotated GFF (GFF; 2.9GB; MAKER appended the FastAs to the end of the GFF).

  2. Use ‘bedTools fastaFromBed’ to create FastA for all genes using gene GFF coordinates and generate unique FastA headers for each sequence.

  3. Use sed to do a substitution using the MAKER IDs and the bedTools fastaFromBed IDs.

Jupyter Notebook (GitHub):


RESULTS

This ran for a surprisingly long time - a bit over 17 hours just for a find/replace. I think I could’ve speeded things up if the last sed command looked only at lines beginning with “>”, instead of scanning each line for each possible match. Oh well.

Output folder:

Renamed FastA ():

Renamed FastA Index (txt):

Will add to Genomic Resources wiki.