Tag Archives: assembly

Assembly Comparisons – Oly Assemblies Using Quast

I ran Quast to compare all of our current Olympia oyster genome assemblies.

See Jupyter Notebook in Results section for Quast execution.

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/quast_results/results_2018_01_16_10_08_35/

Heatmapped table of results: http://owl.fish.washington.edu/Athaliana/quast_results/results_2018_01_16_10_08_35/report.html

Very enlightening!

After all the difficulties with PB Jelly, it has produced the most large contigs. However, it does also have the highest quantity and rate of N’s of all the assemblies produced to date.

BEST OF:

# contigs (>= 50000 bp): pbjelly_sjw_01 (894)
Largest Contig: redundans_sjw_02 (322,397bp)
Total Length: pbjelly_sjw_01 (1,180,563,613bp)
Total Length (>=50,000bp): pbjelly_sjw_01 (57,741,906bp)
N50: redundans_sjw_03 (17,679bp)

Jupyter Notebook (GitHub): 20180116_swoose_oly_assembly_comparisons_quast.ipynb

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

After another attempt to fix PB Jelly, I ran it again.

We’ll see how it goes this time…

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171130_oly_pbjelly/

Output FASTA file: http://owl.fish.washington.edu/Athaliana/20171130_oly_pbjelly/jelly.out.fasta

Quast assessment of output FASTA:

Assembly jelly.out
# contigs (>= 0 bp) 696946
# contigs (>= 1000 bp) 159429
# contigs (>= 5000 bp) 68750
# contigs (>= 10000 bp) 35320
# contigs (>= 25000 bp) 7048
# contigs (>= 50000 bp) 894
Total length (>= 0 bp) 1253001795
Total length (>= 1000 bp) 1140787867
Total length (>= 5000 bp) 932263178
Total length (>= 10000 bp) 691523275
Total length (>= 25000 bp) 261425921
Total length (>= 50000 bp) 57741906
# contigs 213264
Largest contig 194507
Total length 1180563613
GC (%) 36.57
N50 12433
N75 5983
L50 26241
L75 60202
# N’s per 100 kbp 6580.58

Have added this assembly to our Olympia oyster genome assemblies table.

This took an insanely long time to complete (nearly six weeks)!!! After some internet searching, I’ve found a pontential solution to this and have initiated another PB Jelly run to see if it will run faster. Regardless, it’ll be interesting to see how the results compare from two independent runs of PB Jelly.

Jupyter Notebook (GitHub): 20171130_emu_pbjelly.ipynb

Assembly Comparison – Oly Assemblies Using Quast

I ran Quast to compare all of our current Olympia oyster genome assemblies.

See Jupyter Notebook in Results section for Quast execution.

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_14_12_30_25/

Heatmapped table of results: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_14_12_30_25/report.html

Very enlightening!

BEST OF:

Largest Contig: redundans_sjw_02 (322,397bp)
Total Length: soap_bgi_01 & pbjelly_sjw_01 (697,528,655bp)
Total Length (>=50,000bp): redundans_sjw_03 (17,006,058bp)
N50: redundans_sjw_03 (17,679bp)

Interesting tidbit: The pbjelly_sjw_01 assembly is EXACTLY the same as the soap_bgi_01. Looking at the output messages from that PB Jelly assembly, one can see why. The messages indicate that no gaps were filled on the BGI scaffold reference! That means the PB Jelly output is just the BGI scaffold reference assembly!

Jupyter Notebook (GitHub): 20171114_swoose_oly_assembly_comparisons_quast.ipynb

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

Yesterday, I ran PB Jelly using Sean’s Platanus assembly, but that didn’t produce an assembly because PB Jelly was expecting gaps in the Illumina reference assembly (i.e. scaffolds, not contigs).

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171114_oly_pbjelly/

Output FASTA file: http://owl.fish.washington.edu/Athaliana/20171114_oly_pbjelly/jelly.out.fasta

OK! This seems to have worked (and it was quick, like less than an hour!), as it actually produced a FASTA file! Will run QUAST with this and some assemblies to compare assembly stats. Have added this assembly to our Olympia oyster genome assemblies table.

Jupyter Notebook (GitHub): 20171114_emu_pbjelly_BGI_scaffold.ipynb

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/Platanus Assembly

Sean had previously attempted to run PB Jelly, but ran into some issues running on Hyak, so I decided to try this on Emu.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171113_oly_pbjelly/

This completed very quickly (like, just a couple of hours). I also didn’t experience the woes of multimillion temp file production that killed Sean’s attempt at running this on Mox (Hyak).

However, it doesn’t seem to have produced an assembly!

Looking through the output, it seems as though it didn’t produce an assembly because there weren’t any gaps to fill in the reference. This makes sense (in regards to the lack of gaps in the reference Illumina assembly) because I used the Platanus contig FASTA file (i.e. not a scaffolds file). I didn’t realize PB Jelly was just designed for gap filling. Guess I’ll give this another go using the BGI scaffold FASTA file and see what we get.

Jupyter Notebook (GitHub): 20171113_emu_pbjelly_22mer_plat.ipynb

Software Crash – Olympia oyster genome assembly with Masurca on Mox

Ah, the joys of bioinformatics. I just received an email from Mox indicating that the Masurca assembly I started 11 DAYS AGO (!!) crashed.

I’m probably not going to put much effort in to trying to figure out what went wrong, but here’s some log file snippets for reference. I’ll probably drop a line to the developers and see if they have any easy ways to address whatever caused the problems, but that’s about as much effort as I’m willing to put into troubleshooting this assembly.

Additionally, since this crashed, I’m not going to bother moving any of the files off of Mox. That means they will be deleted automatically by the system around Nov. 9th, 2017.


slurm-94620.out (tail)

compute_psa 6601202 2632582819
Refining alignments
Joining
Generating assembly input files
Coverage of the mega-reads less than 5 -- using the super reads as well
Coverage threshold for splitting unitigs is 138 minimum ovl 63
Running assembly
/gscratch/srlab/programs/MaSuRCA-3.2.3/bin/deduplicate_unitigs.sh: line 85: 24330 Aborted                 (core dumped) overlapStoreBuild -o $ASM_DIR/$ASM_PREFIX.ovlStore -M 65536 -g $ASM_DIR/$ASM_PREFIX.gkpStore $ASM_DIR/overlaps_dedup.ovb.gz > $ASM_DIR/overlapStore.rebuild.err 2>&1
Assembly stopped or failed, see CA.mr.41.15.17.0.029.log
[Mon Oct 30 23:19:37 PDT 2017] Assembly stopped or failed, see CA.mr.41.15.17.0.029.log

CA.mr.41.15.17.0.029.log (tail)

number of threads     = 28 (OpenMP default)

ERROR:  overlapStore '/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly/CA.mr.41.15.17.0.029/genome.ovlStore' is incomplete; previous overlapStoreBuild probably crashed.

----------------------------------------
Failure message:

failed to unitig

overlapStore.rebuild.err

Scanning overlap files to count the number of overlaps.
Found 277.972 million overlaps.
Memory limit 65536MB supplied.  Ill put 3246167525 IIDs (3435.97 million overlaps) into each of 1 buckets.
bucketizing CA.mr.41.15.17.0.029/overlaps_dedup.ovb.gz
bucketizing DONE!
overlaps skipped:
               0 OBT - low quality
               0 DUP - non-duplicate overlap
               0 DUP - different library
               0 DUP - dedup not requested
terminate called after throwing an instance of std::bad_alloc
  what():  std::bad_alloc

Failed with Aborted

Backtrace (mangled):

overlapStoreBuild[0x40523a]
/usr/lib64/libpthread.so.0(+0xf100)[0x2af83b3c0100]
/usr/lib64/libc.so.6(gsignal+0x37)[0x2af83c0395f7]
/usr/lib64/libc.so.6(abort+0x148)[0x2af83c03ace8]
/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x2af83b62d9d5]
/usr/lib64/libstdc++.so.6(+0x5e946)[0x2af83b62b946]
/usr/lib64/libstdc++.so.6(+0x5e973)[0x2af83b62b973]
/usr/lib64/libstdc++.so.6(+0x5eb93)[0x2af83b62bb93]
/usr/lib64/libstdc++.so.6(_Znwm+0x7d)[0x2af83b62c12d]
/usr/lib64/libstdc++.so.6(_Znam+0x9)[0x2af83b62c1c9]
overlapStoreBuild[0x402e10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2af83c025b15]
overlapStoreBuild[0x403089]

Backtrace (demangled):

[0] overlapStoreBuild() [0x40523a]
[1] /usr/lib64/libpthread.so.0::(null) + 0xf100  [0x2af83b3c0100]
[2] /usr/lib64/libc.so.6::(null) + 0x37  [0x2af83c0395f7]
[3] /usr/lib64/libc.so.6::(null) + 0x148  [0x2af83c03ace8]
[4] /usr/lib64/libstdc++.so.6::__gnu_cxx::__verbose_terminate_handler() + 0x165  [0x2af83b62d9d5]
[5] /usr/lib64/libstdc++.so.6::(null) + 0x5e946  [0x2af83b62b946]
[6] /usr/lib64/libstdc++.so.6::(null) + 0x5e973  [0x2af83b62b973]
[7] /usr/lib64/libstdc++.so.6::(null) + 0x5eb93  [0x2af83b62bb93]
[8] /usr/lib64/libstdc++.so.6::operator new(unsigned long) + 0x7d  [0x2af83b62c12d]
[9] /usr/lib64/libstdc++.so.6::operator new[](unsigned long) + 0x9  [0x2af83b62c1c9]
[10] overlapStoreBuild() [0x402e10]
[11] /usr/lib64/libc.so.6::(null) + 0xf5  [0x2af83c025b15]
[12] overlapStoreBuild() [0x403089]

GDB:

Genome Assembly – Olympia oyster Illumina & PacBio Reads Using Redundans

Had problems with Docker and Jupyter Notebook inexplicably dying and deleting all the files in the working directory of the Jupyter Notebook (which also happened to be the volume mounted in the Docker container).

So, I ran this on my computer, but didn’t have Jupyter installed (yet).

This utilized the Canu contigs file (FASTA) that I generated on 20171018.

Here’s the input command:

sudo python /home/sam/software/redundans/redundans.py -t 24 -l m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq.gz m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz -i 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz -f 20171018_oly_pacbio.contigs.fasta -o /home/data/20171024_docker_oly_redundans_01/

This completed in just over 19hrs.

Copied output files to Owl: http://owl.fish.washington.edu/Athaliana/20171024_docker_oly_redundans_01/

Here’s the desired output file (FASTA): scaffolds.reduced.fa

Will add to our genome assemblies table.

Ran Quast on 20171103 for some assembly stats.

Quast output is here: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_03_22_43_06/

Genome Assembly – Olympia oyster Redundans/Canu vs. Redundans/Racon

Decided to compare the Redundans using Canu as reference and Redundans using Racon as reference. Both reference assemblies were just our PacBio data.

Jupyter notebook (GitHub): 20171005_docker_oly_redundans.ipynb

Notebook is also embedded at the end of this post.

Results:

It should be noted that the paired reads for each of the BGI mate-pair Illumina data did not assemble, just like last time I used them:

  • 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz

Redundans with Canu is better, suggesting that the Canu assembly is the better of the two PacBio assemblies (which we had already suspected).

QUAST comparison using default settings:

Interactive link:http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_06_22_21_06/report.html

QUAST comparison using –scaffolds setting:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_06_22_27_26/report.html

Genome Assembly – minimap/miniasm/racon Overview

Previously, I used the following three tools to do quick assembly of our Olympia oyster PacBio data:

I’m just posting this quick overview to make it easier to follow what was actually done without having to read through three different notebook entries and corresponding Jupyter notebooks.

When I say “quick assembly”, I mean it. The entire assembly process probably takes about an hour on the computer I used – that seems fast.

Here’s the quick and dirty of what was done:

1 Run minimap:

This uses a pre-built set of defaults (the ava-pb in the code below) for analyzing PacBio data. Minimap only accepts two FASTQ files and you need to map your FASTQ file against itself. So, if you have multiple FASTQ sequencing files, you have to concatenate them into a single file prior to running minimap.

minimap2 -x ava-pb -t 23 \
20170911_oly_pacbio_cat.fastq \
20170911_oly_pacbio_cat.fastq \
> 20170911_minimap2_pacbio_oly.paf

2 Run miniasm:

This uses your concatenated FASTQ file and the PAF file output from the miniasm step. The code below is taken from the example provided in the miniasm documentation; there are other options available.

miniasm \
-f \
/home/data/20170911_oly_pacbio_cat.fastq /home/data/20170911_minimap2_pacbio_oly.paf > /home/data/20170918_oly_pacbio_miniasm_reads.gfa

3 Convert miniasm output GFA to FASTA

The FASTA file is needed to re-run minimap in Step 4 below.

awk '$1 ~/S/ {print ">"$2"\n"$3}' 20170918_oly_pacbio_miniasm_reads.gfa > 20170918_oly_pacbio_miniasm_reads.fasta

4 Run minimap with default settings

Using the default settings maps the FASTQ reads back to the contigs (the PAF file) created in the fist step. These mappings are required for Racon assembly (Step 5).

minimap2 \
-t 23 \
20170918_oly_pacbio_miniasm_reads.fasta 20170905_minimap2_pacibio_oly.paf > 20170918_minimap2_mapping_fasta_oly_pacbio.paf

5 Run racon

The output file is the FASTA file listed below.

racon -t 24 \
20170911_oly_pacbio_cat.fastq \
20170918_oly_pacbio_minimap_mappings.paf \
20170918_oly_pacbio_miniasm_assembly.gfa \
20170918_oly_pacbio_racon1_consensus.fasta

Data Received – Initial Geoduck Genome Assembly from BGI

The initial assembly of the Ostrea lurida genome is available from BGI. Currently, we’ve stashed it here:

http://owl.fish.washington.edu/P_generosa_genome_assemblies_BGI/20160314/

The data provided consisted of the following three files:

  • md5.txt
  • N50.txt
  • scaffold.fa.fill

md5.txt – Checksum file to verify integrity of files after downloading.

N50.txt – Contains some very limited stats on scaffolds provided.

scaffold.fa.fill – A FASTA file of scaffolds. Since these are scaffolds (and NOT contigs!), there are many regions containing NNNNNN’s that have been put in place for scaffold assembly based on paired-end spatial information. As such, the N50 information is not as useful as it would be if these were contigs.

Additional assemblies will be provided at some point. I’ve emailed BGI about what we should expect from this initial assembly and what subsequent assemblies should look like.