Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Redundans should assemble both Illumina and PacBio data, so let’s do that.

Sean had previously performed this – twice actually:

It wasn’t entirely clear how he had run Redundans the first time and the second time he used his Platinus contig FASTA file as the necessary reference assembly when running Redundans.

Since he had produced a good looking assembly from PacBio data using Canu, I decided to give Redundans a rip using that assembly.

I then compared all three Redundans runs using QUAST.

Jupyter notebook (GitHub): 20171004_docker_oly_redundans.ipynb

Notebook is also embedded at the bottom of this notebook entry (but, it should be easier to view at the link provided above).

Sean’s Canu assembly (FASTA): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/oly_pacbio_.contigs.fasta
Sean’s first Redundans assembly (scaffolded assembly; FASTA): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output/scaffolds.reduced.fa
Sean’s second Redundans assembly (scaffolded assembly; FASTA): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output_Try_2/scaffolds.reduced.fa
Redundans Output folder: http://owl.fish.washington.edu/Athaliana/20171004_redundans/
Redundans assembly (scaffolded assembly; FASTA): http://owl.fish.washington.edu/Athaliana/20171004_redundans/scaffolds.reduced.fa
Quast Output folder (default settings): http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_21_50/
Quast Output folder (–scaffolds option): http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_28_51/

Of note, is that Redundans didn’t find any alignments for the paired reads for each of the BGI mate-pair Illumina data:

160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz
160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz
160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz

First, I ran QUAST with the default settings:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_21_50/report.html

Using that Canu assembly with Redundans certainly seems to results in a better assembly.

Decided to run QUAST with the –scaffolds option to see what happened:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_28_51/report.html

The scaffolds with the “Ns” removed from them are appended with “_broken” – meaning the scaffolds were broken apart into contigs. Things are certainly cleaner when using the --scaffolds option, however, as far as I can tell, QUAST doesn’t actually generate a FASTA file with the “_broken” scaffolds!

Genome Assembly – minimap/miniasm/racon Overview

0000-0002-2747-368X

Previously, I used the following three tools to do quick assembly of our Olympia oyster PacBio data:

I’m just posting this quick overview to make it easier to follow what was actually done without having to read through three different notebook entries and corresponding Jupyter notebooks.

When I say “quick assembly”, I mean it. The entire assembly process probably takes about an hour on the computer I used – that seems fast.

Here’s the quick and dirty of what was done:

1 Run minimap:

This uses a pre-built set of defaults (the ava-pb in the code below) for analyzing PacBio data. Minimap only accepts two FASTQ files and you need to map your FASTQ file against itself. So, if you have multiple FASTQ sequencing files, you have to concatenate them into a single file prior to running minimap.

minimap2 -x ava-pb -t 23 \
20170911_oly_pacbio_cat.fastq \
20170911_oly_pacbio_cat.fastq \
> 20170911_minimap2_pacbio_oly.paf

2 Run miniasm:

This uses your concatenated FASTQ file and the PAF file output from the miniasm step. The code below is taken from the example provided in the miniasm documentation; there are other options available.

miniasm \
-f \
/home/data/20170911_oly_pacbio_cat.fastq /home/data/20170911_minimap2_pacbio_oly.paf > /home/data/20170918_oly_pacbio_miniasm_reads.gfa

3 Convert miniasm output GFA to FASTA

The FASTA file is needed to re-run minimap in Step 4 below.

awk '$1 ~/S/ {print ">"$2"\n"$3}' 20170918_oly_pacbio_miniasm_reads.gfa > 20170918_oly_pacbio_miniasm_reads.fasta

4 Run minimap with default settings

Using the default settings maps the FASTQ reads back to the contigs (the PAF file) created in the fist step. These mappings are required for Racon assembly (Step 5).

minimap2 \
-t 23 \
20170918_oly_pacbio_miniasm_reads.fasta 20170905_minimap2_pacibio_oly.paf > 20170918_minimap2_mapping_fasta_oly_pacbio.paf

5 Run racon

The output file is the FASTA file listed below.

racon -t 24 \
20170911_oly_pacbio_cat.fastq \
20170918_oly_pacbio_minimap_mappings.paf \
20170918_oly_pacbio_miniasm_assembly.gfa \
20170918_oly_pacbio_racon1_consensus.fasta

Assembly Comparisons – Olympia oyster genome assemblies

0000-0002-2747-368X

— UPDATE 20171009 —

Having run through this a bunch of times now, I realized that the analysis below incorrectly identifies the outputs from Sean’s Redundans runs. The correct output from each of those runs should be the “scaffolds.reduced.fa” FAST files. The “contigs.fa” files that I linked to below are actually the assemblies produced by other programs; which are required as an input for Redudans.

I recently completed an assembly of the UW PacBio sequencing data using Racon and wanted some assembly stats, as well as a way to compare this assembly to the assemblies Sean had completed.

Additionally, Steven recently performed an assembly comparison and I noticed he got some odd results. Specifically, of the three assemblies he compared (PacBio x 1, Illumina x 2), both of the Illumina assemblies had a large quantity of “Ns” in the assemblies. This didn’t seem right and the comparison program he used (QUAST) spit out a message indicating that it seemed like scaffolds were used, instead of contigs. So, I thought I’d give it a shot and see if I could track down non-scaffolded assemblies produced by Sean.

Jupyter notebook (GitHub): 20171003_docker_oly_assembly_comparisons.ipynb

First, I compared the following six assemblies (FASTA files) using QUAST:

Sean’s Assemblies:

PacBio (Canu): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/oly_pacbio_.contigs.fasta
Illumina (Platanus): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Illumina_Platanus_Assembly/Oly_Out__contig.fa
Illumina (Platanus): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Platanus_Assembly_Kmer-22/Oly_Out__contig.fa
Illumina/PacBio (Redundans): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output/contigs.fa
Illumina/PacBio (Redundans): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output_Try_2/contigs.fa

Sam’s Assembly:

PacBio (Racon): http://owl.fish.washington.edu/Athaliana/201709_oly_pacbio_assembly_minimap_asm_racon/20170918_oly_pacbio_racon1_consensus.fasta

QUAST output directory: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/

Here’s the assembly comparison of all assemblies (click on image for larger view):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_genome_assemblies/report.html

The first thing that jumps out to me is the fact that two of the Illumina assemblies, which used different assemblers(!!) have the EXACT same assembly stats. This occurrence seems extremely unlikely. I’ve double-checked my Jupyter notebook to make sure that I didn’t assign the same file by accident (see Input #6)

Very strange!

I also noticed that the first Redundans assembly of Sean’s has a ton of “Ns”, suggesting that it’s actually a scaffolded assembly. As with Steven’s QUAST run, QUAST spits out the messages suggesting to use the “–scaffold” option for this file.

The other thing I noticed is the two PacBio assemblies (Canu & Racon) have a huge difference in the total number of bp (~13,000,000)! I ran a QUAST assembly comparison between just those two for easier viewing/comparison (http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/):

Interactive version of that graphic is here: http://owl.fish.washington.edu/Athaliana/20171003_quast_oly_pacbio_assemblies/report.html

The fact that there is such a large discrepancy in the total number of bps between these two assemblies really leaves me to believe that I am missing a FASTQ file from my assembly. I’m going to go back and see if that is indeed the case or if this difference in the assemblies is real.

Here’s an embedded version of my Jupyter notebook:

Samples Received – C.virginica gonad tissue from Katie Lotterhos

0000-0002-2747-368X

Received and stored @-80C in rack 8, row 5, column 5.

The following information was sent with the samples:

Sample.ID	Date	Temp	pCO2	Notes
031	26-Aug-2016	15	400
032	26-Aug-2016	15	400
033	26-Aug-2016	15	400
034	26-Aug-2016	15	400
035	26-Aug-2016	15	400	All sample sent; it will be in 2mL screw-cap vial
036	26-Aug-2016	15	400
103	26-Aug-2016	15	2800
104	26-Aug-2016	15	2800
105	26-Aug-2016	15	2800
106	26-Aug-2016	15	2800	All sample sent; it will be in 2mL screw-cap vial
108	26-Aug-2016	15	2800

Katie sent this additional info in an email to Steven and me:

These C. virginica samples were exposed to control (400, 6 samples) and OA (2800, 5 samples) conditions for ~4 weeks at 15C. Gonad was carefully extracted by peeling back the outer membrane, flash frozen in liquid N, and placed in -80C (until today when we removed it). During sampling, it was difficult to get a lot of what we considered “pure” gonadal tissue. We sent you ~1/2 of the amount of tissue we have for all samples except for the two samples which were very low and we sent you all the tissue sample we have. Each should be about 10-20 mg of tissue, which I’m worried is not enough for MBD-BS seq. Fingers crossed.

Goals – October 2017

0000-0002-2747-368X

I guess one of my primary goals is to make sure I actually write my monthly goals each month.

Is it bad that I’m writing goals about writing goals? Or, is it meta?

Regardless, I’m actually going to put a lot down on paper, as much has happened since my last set of goals were posted.

We had a “hack week” back in August. For us, “hacking” means organizing and updating lab documentation.

We took on the following tasks:

“Decommission” the LabDocs GitHub repo. This had been the canonical location for all of our online lab resources and had served as the starting point. However, it was not organized particularly well for what we were using it for, and was out of date in a number of places. Additionally, this is a personal GitHub repository of Steven’s and it didn’t make logical sense to use it as a dedicated lab repo.

As part of the decommission, we migrated all of the open issues (we used this great little web-based tool: Issue Mover for GitHub to our organization’s GitHub repository: Roberts Lab @ SAFS

A massive reorganization, updating, and cleansing of files. We now have separate repositories for our onboarding practices(including an official Lab Code of Conduct), laboratory resources, and (https://github.com/RobertsLab/code). Wiki pages have been created for each of these repos, and readme files have been created/updated to improve instructions on how to locate needed information. Overall, we feel it simplifies the ability for lab members to find the information they need.

Here’s a graphic of the amount of love that went into the old LabDocs repo since it’s inception (337,000 additions to files!):

Anyway, on to the current stuff.

Primary goal will be to perform a comparison of Olympia oyster genome assemblies.

Next will be to continue generating a joint assembly of Illumina and PacBio sequencing data for the Olympia oyster genome. This will take over from where Sean Bennett left off.

After that, writing my November 2017 goals…

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

0000-0002-2747-368X

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

This notebook entry just contains the racon execution. This produced this assembly:

http://owl.fish.washington.edu/Athaliana/201709_oly_pacbio_assembly_minimap_asm_racon/20170918_oly_pacbio_racon1_consensus.fasta

All intermediate files generated from this pipeline are here:

http://owl.fish.washington.edu/Athaliana/201709_oly_pacbio_assembly_minimap_asm_racon/

I’ll put together a TL;DR post that provides an overview of the pipeline and an assessment of the final assembly.

Previously ran minimap
and then miniasm.

Jupyter Notebook (GitHub): 20170918_docker_pacbio_oly_racon0.5.0.ipynb

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

0000-0002-2747-368X

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

Previously, ran the first part of the pipeline: minimap

This notebook entry just contains the miniasm execution. Will follow with racon.

Jupyter Notebook (GitHub): 20170918_docker_pacbio_oly_miniasm0.2.ipynb

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

0000-0002-2747-368X

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

This notebook entry just contains the initial minimap execution. Followed up with miniasm and then racon.

Jupyter Notebook (GitHub): 20170907_docker_pacbio_oly_minimap2.ipynb

Samples Submitted – Geoduck Ctenidia to Illumina for 10x Genomics Sequencing

0000-0002-2747-368X

Continuing Illumina’s generous efforts to use our geoduck samples to test out the robustness of their emerging sequencing technologies, they have requested we send them some geoduck tissue so that they can try to complete the genome sequencing efforts using the 10x genomics sequencing platform.

I sent two frozen pieces (~28mg each) of geoduck ctendia tissue on dry ice. Tissue was collected by Brent & Steven on 20150811.

FedEx tracking: 770129114978

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

0000-0002-2747-368X

Here’s a brief overview of what Sean has done with the Oly genome assembly front.

Metassembler

Assemble his BGI assembly and Platanus assembly? Confusing terms here; not sure what he means.
Failed due to 32-bit vs. 64-bit installation of MUMmer. He didn’t have the chance to re-compile MUMmer as 64-bit. However, a recent MUMmer announcement suggests that MUMmer can now handle genomes of unlimited size.
I believe he was planning on using (or was using?) GARM, which relies upon MUMmer and may also include a version of MUMmer (outdated version that led to Sean’s error message?).
Notebook entry

Canu

Assemble UW PacBio data (filenames beginning with m170211, m170315, m170308, and m170301).
Files (including Mox scripts, Pilon contig polishing, & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/
Notebook entry

Redundans

Assembled raw Illumina reads provided by BGI (filenames beginning with 15114 and 16103) & UW PacBio data (filenames beginning with m170211, m170315, m170308, and m170301).
Ran this two times.
First run
- Files (does NOT include Mox scripts!) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output/
- Notebook entry
Second run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output_Try_2/
- Notebook entry

Platanus

Assembled raw Illumina reads provided by BGI (beginning with 151114 and 160103).
Ran this two times.
First run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Illumina_Platanus_Assembly/
- Notebook entry
Second run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Platanus_Assembly_Kmer-22/
- Notebook entry

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Genome Assembly – minimap/miniasm/racon Overview

1 Run minimap:

2 Run miniasm:

3 Convert miniasm output GFA to FASTA

4 Run minimap with default settings

5 Run racon

Assembly Comparisons – Olympia oyster genome assemblies

— UPDATE 20171009 —

Samples Received – C.virginica gonad tissue from Katie Lotterhos

Goals – October 2017

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

Samples Submitted – Geoduck Ctenidia to Illumina for 10x Genomics Sequencing

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett