Tag Archives: PacBio

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

After another attempt to fix PB Jelly, I ran it again.

We’ll see how it goes this time…

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171130_oly_pbjelly/

Output FASTA file: http://owl.fish.washington.edu/Athaliana/20171130_oly_pbjelly/jelly.out.fasta

Quast assessment of output FASTA:

Assembly jelly.out
# contigs (>= 0 bp) 696946
# contigs (>= 1000 bp) 159429
# contigs (>= 5000 bp) 68750
# contigs (>= 10000 bp) 35320
# contigs (>= 25000 bp) 7048
# contigs (>= 50000 bp) 894
Total length (>= 0 bp) 1253001795
Total length (>= 1000 bp) 1140787867
Total length (>= 5000 bp) 932263178
Total length (>= 10000 bp) 691523275
Total length (>= 25000 bp) 261425921
Total length (>= 50000 bp) 57741906
# contigs 213264
Largest contig 194507
Total length 1180563613
GC (%) 36.57
N50 12433
N75 5983
L50 26241
L75 60202
# N’s per 100 kbp 6580.58

Have added this assembly to our Olympia oyster genome assemblies table.

This took an insanely long time to complete (nearly six weeks)!!! After some internet searching, I’ve found a pontential solution to this and have initiated another PB Jelly run to see if it will run faster. Regardless, it’ll be interesting to see how the results compare from two independent runs of PB Jelly.

Jupyter Notebook (GitHub): 20171130_emu_pbjelly.ipynb

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/BGI Scaffold Assembly

Yesterday, I ran PB Jelly using Sean’s Platanus assembly, but that didn’t produce an assembly because PB Jelly was expecting gaps in the Illumina reference assembly (i.e. scaffolds, not contigs).

Re-ran this using the BGI Illumina scaffolds FASTA.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171114_oly_pbjelly/

Output FASTA file: http://owl.fish.washington.edu/Athaliana/20171114_oly_pbjelly/jelly.out.fasta

OK! This seems to have worked (and it was quick, like less than an hour!), as it actually produced a FASTA file! Will run QUAST with this and some assemblies to compare assembly stats. Have added this assembly to our Olympia oyster genome assemblies table.

Jupyter Notebook (GitHub): 20171114_emu_pbjelly_BGI_scaffold.ipynb

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/Platanus Assembly

Sean had previously attempted to run PB Jelly, but ran into some issues running on Hyak, so I decided to try this on Emu.

Here’s a brief rundown of how this was run:

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171113_oly_pbjelly/

This completed very quickly (like, just a couple of hours). I also didn’t experience the woes of multimillion temp file production that killed Sean’s attempt at running this on Mox (Hyak).

However, it doesn’t seem to have produced an assembly!

Looking through the output, it seems as though it didn’t produce an assembly because there weren’t any gaps to fill in the reference. This makes sense (in regards to the lack of gaps in the reference Illumina assembly) because I used the Platanus contig FASTA file (i.e. not a scaffolds file). I didn’t realize PB Jelly was just designed for gap filling. Guess I’ll give this another go using the BGI scaffold FASTA file and see what we get.

Jupyter Notebook (GitHub): 20171113_emu_pbjelly_22mer_plat.ipynb

Software Installation – PB Jelly Suite and Blasr on Emu

I followed along with what Sean previously did when installing on Emu, but it appears he didn’t install it in the shared location to make it accessible to all users. So, I’m installing it in the /home/shared/ directory.

First, I need to install legacy blasr from PacBio:

Installed in

cd /home/shared
git clone https://github.com/PacificBiosciences/pitchfork.git
cd pitchfork
git checkout legacy_blasr
make init PREFIX=/home/shared
make blasr  PREFIX=/home/shared

Ran into this error:

make[1]: Leaving directory '/home/shared/pitchfork/ports/thirdparty/zlib'
make -C ports/thirdparty/hdf5 do-install
make[1]: Entering directory '/home/shared/pitchfork/ports/thirdparty/hdf5'
/home/shared/pitchfork/bin/pitchfork fetch --url https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8.16/src/hdf5-1.8.16.tar.gz
fetching https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8.16/src/hdf5-1.8.16.tar.gz
tar zxf hdf5-1.8.16.tar.gz -C /home/shared/pitchfork/workspace

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Makefile:23: recipe for target '/home/shared/pitchfork/workspace/hdf5-1.8.16' failed
make[1]: *** [/home/shared/pitchfork/workspace/hdf5-1.8.16] Error 2
make[1]: Leaving directory '/home/shared/pitchfork/ports/thirdparty/hdf5'
Makefile:211: recipe for target 'hdf5' failed
make: *** [hdf5] Error 2

Luckily, I came across this GitHub Issue that addresses this exact problem.

I found the functional URL and downloaded the hdf5-1.8.16.tar.gz file to pitchfork/ports/thirdparty/hdf5. Re-ran make blasr PREFIX=/home/shared and things proceeded without issue. As Sean noted, this part takes a long time.

Load the setup-env.sh (this is located here: /home/shared/pitchfork/setup-env.sh

source setup-env.sh

Blasr install is complete!

Then, install networkx v1.1, per the PB Jelly documentation:

python pip -m install networkx==1.1

On to PB Jelly!

Edited the setup.sh file and entered in the path to the PB Jelly install on Emu (/home/shared/PBSuite_15.8.24/):

#/bin/bash

#If you use a virtual env - source it here
#source /hgsc_software/PBSuite/pbsuiteVirtualEnv/bin/activate

#This is the path where you've install the suite.
export SWEETPATH=/home/shared/PBSuite_15.8.24/
#for python modules 
export PYTHONPATH=$PYTHONPATH:$SWEETPATH
#for executables 
export PATH=$PATH:$SWEETPATH/bin/

Test it out with the test data:

  1. Edit the following file to reflect the paths on Emu to find this test data: /home/shared/PBSuite_15.8.24/docs/jellyExample/Protocol.xml

<jellyProtocol>
    <reference>/home/shared/PBSuite_15.8.24/docs/jellyExample/data/reference/lambda.fasta</reference>
    <outputDir>/home/shared/PBSuite_15.8.24/docs/jellyExample/</outputDir>
    <blasr>-minMatch 8 -minPctIdentity 70 -bestn 1 -nCandidates 20 -maxScore -500 -nproc 4 -noSplitSubreads</blasr>
    <input baseDir="/home/shared/PBSuite_15.8.24/docs/jellyExample/data/reads/">
        <job>filtered_subreads.fastq</job>
    </input>
</jellyProtocol>

I went through all the stages of the test data and got through it successfully. Seems ready to roll!

Genome Assembly – Olympia oyster Illumina & PacBio Reads Using Redundans

Had problems with Docker and Jupyter Notebook inexplicably dying and deleting all the files in the working directory of the Jupyter Notebook (which also happened to be the volume mounted in the Docker container).

So, I ran this on my computer, but didn’t have Jupyter installed (yet).

This utilized the Canu contigs file (FASTA) that I generated on 20171018.

Here’s the input command:

sudo python /home/sam/software/redundans/redundans.py -t 24 -l m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq.gz m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz -i 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz -f 20171018_oly_pacbio.contigs.fasta -o /home/data/20171024_docker_oly_redundans_01/

This completed in just over 19hrs.

Copied output files to Owl: http://owl.fish.washington.edu/Athaliana/20171024_docker_oly_redundans_01/

Here’s the desired output file (FASTA): scaffolds.reduced.fa

Will add to our genome assemblies table.

Ran Quast on 20171103 for some assembly stats.

Quast output is here: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_03_22_43_06/

Genome Assembly – Olympia oyster Illumina & PacBio reads using MaSuRCA

UPDATE 20171031 – This crashed. No plans to troubleshoot.

Ran this on Mox (hyak) node.

Create single PacBio FASTA file:

list of PacBio files:

m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz
m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz
m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz
m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz
m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz
m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz
m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz

Concatenate GZIPPED files:

$cat *.gz > pacbio_cat.fastq.gz

Convert FASTQ gzip to FASTA:

$zcat pacbio_cat.fastq.gz | awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' > oly_pacbio_concatentated.fa

Convert and concatenate single non-gzipped FASTQ file:

$awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq >> oly_pacbio_concatentated.fa
$cat *.gz > pacbio_cat.fastq.gz

GUNZIP Illumina GZIPPED FASTQ

$for i in *.gz; do gunzip < "$i" > ${i%%.gz}; done

Determine mean read length and standard deviation from Illumina FASTQ files (needed for MaSuRCA config file) (found code below from this Biostars thread:

$for i in *.fq; do echo "$i   "; awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' $i; done

Output:

151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq   
total 61253141 avg=150.000000 stddev=0.000000
151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq   
total 61253141 avg=150.000000 stddev=0.000000
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq   
total 58755925 avg=150.000000 stddev=0.000000
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq   
total 58755925 avg=150.000000 stddev=0.000000
151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq   
total 43938762 avg=150.000000 stddev=0.000000
151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq   
total 43938762 avg=150.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq   
total 87198584 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq   
total 87198584 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq   
total 43766527 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq   
total 43766527 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq   
total 87135961 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq   
total 87135961 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq   
total 43138844 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq   
total 43138844 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq   
total 95270180 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq   
total 95270180 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq   
total 92524416 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq   
total 92524416 avg=49.000000 stddev=0.000000

Here’s the config file I’m using:

/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly/20171019_masurca_oly_config.txt

# example configuration file

# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:
# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads
# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be
# innies, i.e. --->.<---, and JUMP are assumed to be outties
# <---.--->. If there are any jump libraries that are innies, such as
# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads
# are optional for PE libraries and mandatory for JUMP libraries. Any
# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first
# converted into Celera Assembler compatible .frg files (see
# http://wgs-assembler.sourceforge.com)
DATA
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq

JUMP= sh 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq
JUMP= sh 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq
JUMP= sh 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq
#pacbio reads must be in a single fasta file! make sure you provide absolute path
PACBIO=/gscratch/scrubbed/samwhite/O_lurida/oly_pacbio/oly_pacbio_concatentated.fa
END

PARAMETERS
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 1
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used.  one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 28
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes
SOAP_ASSEMBLY=0
END

Execute the masurca script to generate assembly script (on Mox login node):

$cd /gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly
$/gscratch/srlab/programs/MaSuRCA-3.2.3/bin/masurca 20171019_masurca_oly_config.txt

Got this error:

I’ll edit config file to have standard deviations == 1 and try again.

Got another error:

I’ll edit config file to have different two letter prefix assignments…

It worked!

Final version of config:

# example configuration file

# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:
# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads
# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be
# innies, i.e. --->.<---, and JUMP are assumed to be outties
# <---.--->. If there are any jump libraries that are innies, such as
# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads
# are optional for PE libraries and mandatory for JUMP libraries. Any
# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first
# converted into Celera Assembler compatible .frg files (see
# http://wgs-assembler.sourceforge.com)
DATA
PE= aa 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq
PE= ab 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq
PE= ac 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq
PE= ad 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq
PE= ae 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq
PE= af 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq

JUMP= ba 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq
JUMP= bb 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq
JUMP= bc 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq
#pacbio reads must be in a single fasta file! make sure you provide absolute path
PACBIO=/gscratch/scrubbed/samwhite/O_lurida/oly_pacbio/oly_pacbio_concatentated.fa
END

PARAMETERS
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 1
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used.  one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 28
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes
SOAP_ASSEMBLY=0
END

Submitted the job to Mox using the following command:

sbatch -p srlab -A srlab 20171019_masurca_oly_assembly.sh

Here’s the sbatch script used for the job:

#!/bin/bash
## Job Name
#SBATCH --job-name=20171019_masurca_oly
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly
/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly/assemble.sh

Genome Assembly – Olympia oyster PacBio Canu v1.6

I decided to run Canu myself, since documentation for Sean’s Canu run is a bit lacking. Additionally, it looks like he specified a genome size of 500Mbp, which is probably too small. For this assembly, I set the genome size to 1.9Gbp (based on the info in the BGI assembly report, using 17-mers for calculating genome size), which is probably on the large size.

Additionally, I remembered we had an old PacBio run that we had been forgetting about and thought it would be nice to have incorporated into an assembly.

See all the messy details of this in the Jupyter Notebook below, but here’s the core info about this Canu assembly.

PacBio Input files (available on Owl/nightingales/O_lurida:

m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz                                                               m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz
m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq                       m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz
m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz  m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz
m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz  m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz
m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz  m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz
m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz

Canu execution command (see the Jupyter Notebook below for more info):

$time canu \
useGrid=false \
-p 20171009_oly_pacbio \
-d /home/data/20171018_oly_pacbio_canu/ \
genomeSize=1.9g \
correctedErrorRate=0.075 \
-pacbio-raw m*

Results:

Well, this took a LONG time to run; a bit over two days!

The report file contains some interesting tidbits. For instance:

  • Unitgigging calculates only 1.84x coverage
  • Trimming removed >5 billion (!!) bases: 867850 reads 5755379456 bases (reads with no overlaps, deleted)
  • Unitigging unassembled: unassembled: 479693 sequences, total length 2277137864 bp

I’ll compare this Canu assembly against Sean’s Canu assembly and see how things look.

Report file (text file): http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.report

Contigs Assembly (FASTA): http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/20171018_oly_pacbio.contigs.fasta

Complete Canu output directory: http://owl.fish.washington.edu/Athaliana/20171018_oly_pacbio_canu/

Jupyter Notebook (GitHub): 20171018_docker_oly_canu.ipynb

Data Management – Convert Oly PacBio H5 to FASTQ

After working with all of this Olympia oyster genome sequencing data, I remembered that we had an old, singular PacBio SMRT cell file (from June 2013). This file didn’t seem to be included in any recent assemblies of Sean’s or mine. This is most likely because we have it in the PacBio H5 format and not in FASTQ.

I installed PacBio’s pbh5tools on my computer (swoose), converted the file and moved it to owl/nightingales/O_lurida

python bash5tools.py /mnt/owl/nightingales/O_lurida/m130619_081336_42134_c100525122550000001823081109281326_s1_p0.bas.h5 --outType fastq 

I generated an MD5 checksum and appended to the checksums.md5 file in /owl/nightingales/O_lurida using the following command:

md5sum m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq | awk '{print $2 " = " $1}' >> checksums.md5

The command above pipes the output to awk to format the output to match the existing format of the checksums.md5 file (i.e. filename = hash).

I’ve also updated our Nightingales spreadsheet (Google Sheet) to reflect this.

Will generate updated PacBio assemblies with Canu and/or Racon.

Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Redundans should assemble both Illumina and PacBio data, so let’s do that.

Sean had previously performed this – twice actually:

It wasn’t entirely clear how he had run Redundans the first time and the second time he used his Platinus contig FASTA file as the necessary reference assembly when running Redundans.

Since he had produced a good looking assembly from PacBio data using Canu, I decided to give Redundans a rip using that assembly.

I then compared all three Redundans runs using QUAST.

Jupyter notebook (GitHub): 20171004_docker_oly_redundans.ipynb

Notebook is also embedded at the bottom of this notebook entry (but, it should be easier to view at the link provided above).

Of note, is that Redundans didn’t find any alignments for the paired reads for each of the BGI mate-pair Illumina data:

  • 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz
  • 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz

First, I ran QUAST with the default settings:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_21_50/report.html

Using that Canu assembly with Redundans certainly seems to results in a better assembly.

Decided to run QUAST with the –scaffolds option to see what happened:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_28_51/report.html

The scaffolds with the “Ns” removed from them are appended with “_broken” – meaning the scaffolds were broken apart into contigs. Things are certainly cleaner when using the --scaffolds option, however, as far as I can tell, QUAST doesn’t actually generate a FASTA file with the “_broken” scaffolds!

Genome Assembly – Olympia oyster PacBio minimap/miniasm/racon

In this GitHub Issue, Steven had suggested I try out the minimap/miniasm/racon pipeline for assembling our Olympia oyster PacBio data.

I followed the pipeline described by this paper: http://matzlab.weebly.com/uploads/7/6/2/2/76229469/racon.pdf.

Previously, ran the first part of the pipeline: minimap

This notebook entry just contains the miniasm execution. Will follow with racon.

Jupyter Notebook (GitHub): 20170918_docker_pacbio_oly_miniasm0.2.ipynb