Author Archives: kubu4

FastQC/MultiQC/TrimGalore/MultiQC/FastQC/MultiQC – O.lurida WGBSseq for Methylation Analysis

I previously ran this data through the Bismark pipeline and followed up with MethylKit analysis. MethylKit analysis revealed an extremely low number of differentially methylated loci (DML), which seemed odd.

Steven and I met to discuss and compare our different variations on the analysis and decided to try out different tweaks to evaluate how they affect analysis.

I did the following tasks:

  1. Looked at original sequence data quality with FastQC.

  2. Summarized FastQC analysis with MultiQC.

  3. Trimmed data using TrimGalore!, trimming 10bp from 5′ end of reads (8bp is recommended by Bismark docs).

  4. Summarized trimming stats with MultiQC.

  5. Looked at trimmed sequence quality with FastQC.

  6. Summarized FastQC analysis with MultiQC.

This was run on the Univ. of Washington High Performance Computing (HPC) cluster, Mox.

Mox SBATCH submission script has all details on how the analyses were conducted:


RESULTS

Output folder:

Raw sequence FastQC output folder:

Raw sequence MultiQC report (HTML):

TrimGalore! output folder (trimmed FastQ files are here):

Trimming MultiQC report (HTML):

Trimmed FastQC output folder:

Trimmed MultiQC report (HTML):

Transposable Element Mapping – Crassostrea virginica Genome, Cvirginica_v300, using RepeatMasker 4.07

Per this GitHub issue, I’m IDing transposable elements (TEs) in the Crassostrea virginica genome.

Genome used:

I ran RepeatMasker (v4.07) with RepBase-20170127 and RMBlast 2.6.0 four times:

  1. Species = all

  2. Species = Crassostrea gigas (Pacific oyster)

  3. Species = Crassostrea virginica (Eastern oyster)

  4. Default settings (i.e. no species select – will use human genome).

The idea with running this with four different settings was to get a sense of how the analyses would differ with species specifications.

All runs were performed on roadrunner.

All commands were documented in a Jupyter Notebook (GitHub):

NOTE: RepeatMasker writes the desired output files (*.out, *.cat.gz, and *.gff) to the same directory that the genome is located in! If you conduct multiple runs with the same genome in the same directory, it will overwrite those files, as they are named using the genome assembly filename. Be sure to move files out of the genome directory after each run!


RESULTS:
RUN 1 (species – all)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:  113771462 bp ( 16.62 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        97003     27946871 bp    4.08 %
   SINEs:            48145      9242559 bp    1.35 %
   Penelope           1429       256929 bp    0.04 %
   LINEs:            27022     10570154 bp    1.54 %
    CRE/SLACS           28         2219 bp    0.00 %
     L2/CR1/Rex       2160       316660 bp    0.05 %
     R1/LOA/Jockey    3058       386611 bp    0.06 %
     R2/R4/NeSL        511       226938 bp    0.03 %
     RTE/Bov-B        7377      3276312 bp    0.48 %
     L1/CIN4          1331        95476 bp    0.01 %
   LTR elements:     21836      8134158 bp    1.19 %
     BEL/Pao          1807       936488 bp    0.14 %
     Ty1/Copia        3046       296183 bp    0.04 %
     Gypsy/DIRS1     12789      6060883 bp    0.89 %
       Retroviral     2369       152228 bp    0.02 %

DNA transposons     180693     29492426 bp    4.31 %
   hobo-Activator    12869      1114188 bp    0.16 %
   Tc1-IS630-Pogo    17233      2485049 bp    0.36 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           2388       405926 bp    0.06 %
   Tourist/Harbinger  9302       992476 bp    0.14 %
   Other (Mirage,      238        15946 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       137707     45460608 bp    6.64 %

Total interspersed repeats:   102899905 bp   15.03 %


Small RNA:           45243      9057873 bp    1.32 %

Satellites:           3852       760316 bp    0.11 %
Simple repeats:     203542      8946510 bp    1.31 %
Low complexity:      26205      1281043 bp    0.19 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be root          
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+


RUN 2 (species – Crassostrea gigas)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   93923386 bp ( 13.72 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        26397     15008601 bp    2.19 %
   SINEs:                4          722 bp    0.00 %
   Penelope            675       190160 bp    0.03 %
   LINEs:            17645      8922188 bp    1.30 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex         70        39188 bp    0.01 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          4         5110 bp    0.00 %
     RTE/Bov-B        6194      2718955 bp    0.40 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:      8748      6085691 bp    0.89 %
     BEL/Pao           933       788887 bp    0.12 %
     Ty1/Copia          47        82743 bp    0.01 %
     Gypsy/DIRS1      6819      4822734 bp    0.70 %
       Retroviral        0            0 bp    0.00 %

DNA transposons     163945     26422122 bp    3.86 %
   hobo-Activator     7742       720623 bp    0.11 %
   Tc1-IS630-Pogo    15615      2328538 bp    0.34 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac           2246       393498 bp    0.06 %
   Tourist/Harbinger  8431       876020 bp    0.13 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:       160681     41266796 bp    6.03 %

Total interspersed repeats:    82697519 bp   12.08 %


Small RNA:             214        40811 bp    0.01 %

Satellites:           1396       217317 bp    0.03 %
Simple repeats:     216869      9637447 bp    1.41 %
Low complexity:      27520      1418990 bp    0.21 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea gigas
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        

RUN 3 (species – Crassostrea virginica)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   46637065 bp ( 6.81 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements        43139      8952068 bp    1.31 %
   SINEs:            43139      8952068 bp    1.31 %
   Penelope              0            0 bp    0.00 %
   LINEs:                0            0 bp    0.00 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey       0            0 bp    0.00 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B           0            0 bp    0.00 %
     L1/CIN4             0            0 bp    0.00 %
   LTR elements:         0            0 bp    0.00 %
     BEL/Pao             0            0 bp    0.00 %
     Ty1/Copia           0            0 bp    0.00 %
     Gypsy/DIRS1         0            0 bp    0.00 %
       Retroviral        0            0 bp    0.00 %

DNA transposons       3538      1564942 bp    0.23 %
   hobo-Activator        0            0 bp    0.00 %
   Tc1-IS630-Pogo        0            0 bp    0.00 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger     0            0 bp    0.00 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:        65151     23982146 bp    3.50 %

Total interspersed repeats:    34499156 bp    5.04 %


Small RNA:           43353      8992879 bp    1.31 %

Satellites:              1          222 bp    0.00 %
Simple repeats:     232627     10544162 bp    1.54 %
Low complexity:      29762      1561018 bp    0.23 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be crassostrea virginica
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

RUN 4 (default settings – human genome)

Output folder:

Summary table (text):

Output table (GFF):

SUMMARY TABLE

==================================================
file name: Cvirginica_v300.fa       
sequences:            11
total length:  684741128 bp  (684675328 bp excl N/X-runs)
GC level:         34.83 %
bases masked:   13461422 bp ( 1.97 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:             2056       120820 bp    0.02 %
      ALUs            0            0 bp    0.00 %
      MIRs          240        14635 bp    0.00 %

LINEs:             3408       331585 bp    0.05 %
      LINE1         240        16835 bp    0.00 %
      LINE2         728        69177 bp    0.01 %
      L3/CR1       1369       135234 bp    0.02 %

LTR elements:       704       236625 bp    0.03 %
      ERVL           14          944 bp    0.00 %
      ERVL-MaLRs     12          892 bp    0.00 %
      ERV_classI    272        36695 bp    0.01 %
      ERV_classII     4          206 bp    0.00 %

DNA elements:      1088       100026 bp    0.01 %
     hAT-Charlie     27         1543 bp    0.00 %
     TcMar-Tigger   142         9891 bp    0.00 %

Unclassified:        57         6096 bp    0.00 %

Total interspersed repeats:   795152 bp    0.12 %


Small RNA:         3698       279669 bp    0.04 %

Satellites:          73         5524 bp    0.00 %
Simple repeats:  247957     10848509 bp    1.58 %
Low complexity:   30084      1536314 bp    0.22 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
  Runs of >=20 X/Ns in query were excluded in % calcs


The query species was assumed to be homo sapiens  
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
        
run with rmblastn version 2.6.0+

Assembly Stats – Geoduck Hi-C Final Assembly Comparison

We received the final geoduck genome assembly data from Phase Genomics, in which they updated the assembly by performing some manual curation:

There are additional assembly files that provide some additional assembly data. See the following directory:

Actual sequencing data and two previous assemblies were previously received on 20180421.

All assembly data (both old and new) from Phase Genomics was downloaded in full from the Google Drive link provided by them and stored here on Owl:

Ran Quast to compare all three assemblies provided (command run on Swoose):


/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180403_pga,20180421_pga,20180810_geo_manual \
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-03 11:05:41.596285/PGA_assembly.fasta \ /mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-21 18:09:04.514704/PGA_assembly.fasta \ /mnt/owl/Athaliana/20180822_phase_genomics_geoduck_Results/geoduck_manual/geoduck_manual_scaffolds.fasta

Results:

Quast output folder: results_2018_08_23_07_38_28/

Quast report (HTML): results_2018_08_23_07_38_28/report.html

DNA Methylation Analysis – Bismark Pipeline on All Olympia oyster BSseq Datasets

Bismark analysis of all of our current Olympia oyster (Ostrea lurida) DNA methylation high-throughput sequencing data.

Analysis was run on Emu (Ubuntu 16.04LTS, Apple Xserve). The primary analysis took ~14 days to complete.

All operations are documented in a Jupyter notebook (GitHub):

Genome used:


Input files ( see Olympia oyster Genomic GitHub wiki for more info ):

WG BSseq of Fidalgo Bay offspring grown in Fidalgo Bay & Oyster Bay
  • 1_ATCACG_L001_R1_001.fastq.gz

  • 2_CGATGT_L001_R1_001.fastq.gz

  • 3_TTAGGC_L001_R1_001.fastq.gz

  • 4_TGACCA_L001_R1_001.fastq.gz

  • 5_ACAGTG_L001_R1_001.fastq.gz

  • 6_GCCAAT_L001_R1_001.fastq.gz

  • 7_CAGATC_L001_R1_001.fastq.gz

  • 8_ACTTGA_L001_R1_001.fastq.gz

MBDseq of two populations (Hood Canal & Oyster Bay) grown in Clam Bay
  • zr1394_10_s456.fastq.gz

  • zr1394_11_s456.fastq.gz

  • zr1394_12_s456.fastq.gz

  • zr1394_13_s456.fastq.gz

  • zr1394_14_s456.fastq.gz

  • zr1394_15_s456.fastq.gz

  • zr1394_16_s456.fastq.gz

  • zr1394_17_s456.fastq.gz

  • zr1394_18_s456.fastq.gz

  • zr1394_1_s456.fastq.gz

  • zr1394_2_s456.fastq.gz

  • zr1394_3_s456.fastq.gz

  • zr1394_4_s456.fastq.gz

  • zr1394_5_s456.fastq.gz

  • zr1394_6_s456.fastq.gz

  • zr1394_7_s456.fastq.gz

  • zr1394_8_s456.fastq.gz

  • zr1394_9_s456.fastq.gz


RESULTS:

With Bismark complete, these two sets of analyses can now be looked into further (and separately, as they are separate experiments) using things like MethylKit (R package) and
the Integrative Genomics Viewer (IGV).

Output folder:

Bismark Summary Report:

Individual Sample Reports:

Data Received – Geoduck Metagenome HiSeqX Data

Received the data from the geoduck metagenome libraries that I prepared and were sequenced at the Northwest Genomics Center at UW on the HiSeqX (Illumina) – PE 151bp.

FastQ files are being transferred to owl/nightingales/P_generosa.

These aren’t geoduck sequences, but they are part of a geoduck project. Maybe I should establish a metagenomics directory under nightingales?

Will verifiy md5 checksums and update readme file once the transfer is complete.

RNA Isolation & Quantificaiton – Tanner Crab Hemolymph

Isolated RNA from 40 Tanner crab hemolymph samples selected by Grace with the RNeasy Plus Micro Kit (Qiagen) according to the manufacturer’s protocol, with the following modifications:

  • Added mercaptoethanol (2-ME) to Buffer RLT Plus.

  • All spins were at 21,130g

  • Did not add RNA carrier

  • Used QIAshredder columns to aid in homogenization and removal of insoluble material

  • Eluted with 14uL

RNA was quantified using the Qubit RNA HS (high sensitivity) Assay and run on the Roberts Lab Qubit 3.0.

Used 1uL of sample for quantification.

RNA was returned to the -80C box from where original samples had been stored (Rack 2, Row 3, Column 4).


RESULTS

Qubit quantification (Google Sheet):

Overall, the results aren’t great. Only 15 samples (out of 40) had detectable amounts of RNA. Yields from those 15 samples ranged from 40ng – 300ng, with most landing between 50 – 100ng.

Will pass info along to Grace. Will likely meet with her and Steven to discuss plan on how to move forward.

Genome Annoation – Olympia oyster genome annotation results #02

Yesterday, I annotated our Olympia oyster genome using WQ-MAKER in just 7hrs!.

See that link for run setup and configuration. They are essentially the same, except for the change I’ll discuss below.

The results from that run can be seen here:

In that previous run, I neglected to provide a transposable elements FastA file for use with RepeatMasker.

I remedied that and re-ran it. I modified maker_opts.ctl to include the following:

repeat_protein=../../opt/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner

This TEs file is part of RepeatMasker.


RESULTS

Output folder:

Annotated genome file (GFF):



This run took about an hour longer than the previous run, but for some reason it ran with only 21 workers, instead of 22. This is probably the reason for the increased run time.

I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!

Probably the easiest and most logical to start working with will be those scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with NCBI protein IDs will give us a basic annotaiton of “functional” portions of the genome.

Additionally, we should probably do some sort of comparison of this run with the previous run where I did not provide the transposable elements FastA file.

Genome Annoation – Olympia oyster genome annotation results #01

Yesterday, I annotated our Olympia oyster genome using WQ-MAKER in just 7hrs!.

See that link for run setup and configuration.


RESULTS

Before proceeding further, it should be noted that I neglected to provide Maker with a transposable elements FastA file for RepeatMasker to use.

The following line in the maker_opts.ctl file was originally populated with an absolute path to data I didn’t recognize, so I removed it:

repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner

I’m not entirely sure what the impacts will be on annotation, so I’ve re-run Maker with that line restored (using a relative path). You can find the results of that run here:

Output folder:

Annotated genome file (GFF):

I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!

Probably the easiest and most logical approach from here is to start working with scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with a database of NCBI protein IDs will give us a basic annotation of “functional” portions of the genome.

Additionally, we should probably do some sort of comparison of this run with the follow up run where I provided the transposable elements FastA file to see what impacts the exclusion/inclusion of that info had on annotation.

Genome Annotation – Olympia oyster genome using WQ-MAKER Instance on Jetstream

Yesterday, our Xsede Startup Application (Google Doc) got approval for 100,000 Service Units (SUs) and 1TB of disk space on Xsede/Atmosphere/Jetstream (or, whatever it’s actually called!). The approval happened within an hour of submitting the application!

Here’s a copy of the approval notice:

Dear Dr. Roberts:

Your recently submitted an XSEDE Startup request has been reviewed and approved.

PI: Steven Roberts, University of Washington

Request: Annotation of Olympia oyster (Ostrea lurida) and Pacific geoduck (Panopea generosa) genomes using WQ_MAKER on Jetstream cloud.

Request Number: MCB180124 (New)

Start Date: N/A

End Date: 2019-08-05

Awarded Resources: IU/TACC (Jetstream): 100,000.0 SUs

IU/TACC Storage (Jetstream Storage): 1,000.0 GB

Allocations Admin Comments:

The estimated value of these awarded resources is $14,890.00. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for U.S. The dollar value of your allocation is estimated from the NSF awards supporting the allocated resources.

If XSEDE Extended Collaborative Support (ECSS) assistance was recommended by the review panel, you will be contacted by the ECSS team within the next two weeks to begin discussing this collaboration.

For details about the decision and reviewer comments, please see below or go to the XSEDE User Portal (https://portal.xsede.org), login, click on the ALLOCATIONS tab, then click on Submit/Review Request. Once there you will see your recently awarded research request listed on the right under the section ‘Approved’. Please select the view action to see reviewer comments along with the notes from the review meeting and any additional comments from the Allocations administrator.

By default the PI and all co-PIs will be added to the resources awarded. If this is an award on a renewal request, current users will have their account end dates modified to reflect the new end date of this award. PIs, co-PIs, or Allocation Managers can add users to or remove users from resources on this project by logging into the portal (https://portal.xsede.org) and using the ‘Add/Remove User’ form.

Share the impact of XSEDE! In exchange for access to the XSEDE ecosystem, we ask that all users let us know what XSEDE has helped you achieve:

  • For all publications, please acknowledge use of XSEDE and allocated resources by citing the XSEDE paper (https://www.xsede.org/how-to-acknowledge-xsede) and also add your publications to your user profile.
  • Tell us about your achievements (http://www.xsede.org/group/xup/science-achievements).
  • Help us improve our reporting by keeping your XSEDE user profile up to date and completing the demographic fields (https://portal.xsede.org/group/xup/profile).

For question regarding this decision, please contact help@xsede.org.

Best regards,
XSEDE Resource Allocations Service

===========================

REVIEWER COMMENTS

===========================

Review #0 – Excellent

Assessment and Summary:
Maker is a well-known and fitting workflow for Jetstream. This should be a good use of resources.

Appropriateness of Methodology:

After each user on the allocation has logged in, the PI will need to open a ticket via help@xsede.org and request the quota be set per the allocation to 1TB. Please provide the XSEDE portal ID of each user.

Appropriateness of Computational Research Plan:

We request that any publications stemming from work done using Jetstream cite us – https://jetstream-cloud.org/research/citing-jetstream.php

Efficient Use of Resources:


We had a tremendous amount of help from Upendra Devisetty at CyVerse in getting the Xsede Startup Application written, as well as running WQ-MAKER on Xsede/Atmosphere/Jetstream (or, whatever it’s called!).


Now, on to how I got the run going…

I initiated the Olympia oyster genome annotation using a WQ-MAKER instance (MAKER 2.31.9 with CCTools v3.1) on Jetstream:


I followed the excellent step-by-step directions here:

The “MASTER” machine was a “m1.xlarge” machine (i.e. CPU: 24, Mem: 60 GB, Disk: 60 GB).

I attached a 1TB volume to the MASTER machine.

I set up the run using 21 “WORKER” machines. Twenty WORKERS were “m1.large” machines (i.e. CPU: 10, Mem: 30 GB, Disk: 60 GB). The remaining WORKER was set at “m1.xlarge” (i.e. CPU: 24, Mem: 60 GB, Disk: 60 GB) to use up the rest of our allocated memory.


MASTER machine was initialized with the following command:

nohup wq_maker -contigs-per-split 1 -cores 1 -memory 2048 -disk 4096 -N wq_maker_oly${USER} -d all -o master.dbg -debug_size_limit=0 -stats test_out_stats.txt > log_file.txt 2>&1 &

Each WORKER was started with the following command:

nohup work_queue_worker -N wq_maker_oly${USER} --cores all --debug-rotate-max=0 -d all -o worker.dbg > log_file_2.txt 2>&1 &

When starting each WORKER, an error message was generated, but this doesn’t seem to have any impact on the ability of the program to run:


I checked on the status of the run and you can see that there are 22 WORKERS and 15,568 files “WAITING”. BUT, there are 159,429 contigs in our genome FastA!

Why don’t these match??!!

This is because WQ-MAKER splits the genome FastA into smaller FastA files containg only 10 sequences each. This is why we see 10-fold fewer files being processed than sequences in our genome file.


Here is the rest of the nitty gritty details:

Genome file:
Transcriptome file:
Proteome files (NCBI):

The two FastA files below were concatenated into a single FastA file (gigas_virginica_ncbi_proteomes.fasta) for use in WQ-MAKER.

Maker options control file (maker_opts.ctl):

This is a bit difficult to read in this notebook; copy and paste in text editor for easier viewing.

NOTE: Paths to files (e.g. genome FastA) have to be relative paths; cannot be absolute paths!


#-----Genome (these are always required)
genome=../data/oly_genome/Olurida_v081.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=../data/oly_transcriptome/Olurida_transcriptome_v3.fasta #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=../data/gigas_virginica_ncbi_proteomes.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

After the run finishes, it will have produced a corresponding GFF file for each scaffold. This is unwieldly, so the GFFs will be merged using the following code:

$ gff3_merge -n -d Olurida_v081.maker.output/Olurida_v081_master_datastore_index.log

All WORKERS were running as of 07:45 today. As of this posting (~3hrs later), WQ-MAKER had already processed ~45% of the files! Annotation will be finished by the end of today!!! Crazy!