I finally had some time to tackle this GitHub Issue and create a canonical genes FastA file using the MAKER IDs, instead of the original contig IDs from our Olympia oyster genome assembly - https://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa (FastA; 1.1GB).
Everything was documented in a Jupyter Notebook (see link below), but here’s the skinny on how I did it:
-
Pull existing FastA-formatted sequences from the fully annotated GFF (GFF; 2.9GB; MAKER appended the FastAs to the end of the GFF).
-
Use ‘bedTools fastaFromBed’ to create FastA for all genes using gene GFF coordinates and generate unique FastA headers for each sequence.
-
Use
sed
to do a substitution using the MAKER IDs and thebedTools fastaFromBed
IDs.
Jupyter Notebook (GitHub):
RESULTS
This ran for a surprisingly long time - a bit over 17 hours just for a find/replace. I think I could’ve speeded things up if the last sed
command looked only at lines beginning with “>
”, instead of scanning each line for each possible match. Oh well.
Output folder:
Renamed FastA ():
Renamed FastA Index (txt):
Will add to Genomic Resources wiki.