Tag Archives: Illumina

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/Platanus Assembly

Sean had previously attempted to run PB Jelly, but ran into some issues running on Hyak, so I decided to try this on Emu.

PB Jelly Documentation

Here’s a brief rundown of how this was run:

Default PB Jelly settings (including default settings for blasr).
Illumina reference FASTA: Sean’s Platanus kmer=22 assembly
PacBio reads for mapping
Protocol.xml file needed for PB Jelly to run

See the Jupyter Notebook for full details of run (see Results section below).

Results:

Output folder: http://owl.fish.washington.edu/Athaliana/20171113_oly_pbjelly/

This completed very quickly (like, just a couple of hours). I also didn’t experience the woes of multimillion temp file production that killed Sean’s attempt at running this on Mox (Hyak).

However, it doesn’t seem to have produced an assembly!

Looking through the output, it seems as though it didn’t produce an assembly because there weren’t any gaps to fill in the reference. This makes sense (in regards to the lack of gaps in the reference Illumina assembly) because I used the Platanus contig FASTA file (i.e. not a scaffolds file). I didn’t realize PB Jelly was just designed for gap filling. Guess I’ll give this another go using the BGI scaffold FASTA file and see what we get.

Jupyter Notebook (GitHub): 20171113_emu_pbjelly_22mer_plat.ipynb

Genome Assembly – Olympia oyster Illumina & PacBio Reads Using Redundans

0000-0002-2747-368X

Had problems with Docker and Jupyter Notebook inexplicably dying and deleting all the files in the working directory of the Jupyter Notebook (which also happened to be the volume mounted in the Docker container).

So, I ran this on my computer, but didn’t have Jupyter installed (yet).

This utilized the Canu contigs file (FASTA) that I generated on 20171018.

Here’s the input command:

sudo python /home/sam/software/redundans/redundans.py -t 24 -l m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq.gz m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz -i 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz -f 20171018_oly_pacbio.contigs.fasta -o /home/data/20171024_docker_oly_redundans_01/

This completed in just over 19hrs.

Copied output files to Owl: http://owl.fish.washington.edu/Athaliana/20171024_docker_oly_redundans_01/

Here’s the desired output file (FASTA): scaffolds.reduced.fa

Will add to our genome assemblies table.

Ran Quast on 20171103 for some assembly stats.

Quast output is here: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_11_03_22_43_06/

Genome Assembly – Olympia oyster Illumina & PacBio reads using MaSuRCA

0000-0002-2747-368X

UPDATE 20171031 – This crashed. No plans to troubleshoot.

Ran this on Mox (hyak) node.

Create single PacBio FASTA file:

list of PacBio files:

m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq m170315_001112_42134_c101169372550000001823273008151717_s1_p0_filtered_subreads.fastq.gz
m170211_224036_42134_c101073082550000001823236402101737_s1_X0_filtered_subreads.fastq.gz m170315_063041_42134_c101169382550000001823273008151700_s1_p0_filtered_subreads.fastq.gz
m170301_100013_42134_c101174162550000001823269408211761_s1_p0_filtered_subreads.fastq.gz m170315_124938_42134_c101169382550000001823273008151701_s1_p0_filtered_subreads.fastq.gz
m170301_162825_42134_c101174162550000001823269408211762_s1_p0_filtered_subreads.fastq.gz m170315_190851_42134_c101169382550000001823273008151702_s1_p0_filtered_subreads.fastq.gz
m170301_225711_42134_c101174162550000001823269408211763_s1_p0_filtered_subreads.fastq.gz
m170308_163922_42134_c101174252550000001823269408211742_s1_p0_filtered_subreads.fastq.gz
m170308_230815_42134_c101174252550000001823269408211743_s1_p0_filtered_subreads.fastq.gz

Concatenate GZIPPED files:

$cat *.gz > pacbio_cat.fastq.gz

Convert FASTQ gzip to FASTA:

$zcat pacbio_cat.fastq.gz | awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' > oly_pacbio_concatentated.fa

Convert and concatenate single non-gzipped FASTQ file:

$awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' m130619_081336_42134_c100525122550000001823081109281326_s1_p0.fastq >> oly_pacbio_concatentated.fa

$cat *.gz > pacbio_cat.fastq.gz

GUNZIP Illumina GZIPPED FASTQ

$for i in *.gz; do gunzip < "$i" > ${i%%.gz}; done

Determine mean read length and standard deviation from Illumina FASTQ files (needed for MaSuRCA config file) (found code below from this Biostars thread:

$for i in *.fq; do echo "$i   "; awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' $i; done

Output:

151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq   
total 61253141 avg=150.000000 stddev=0.000000
151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq   
total 61253141 avg=150.000000 stddev=0.000000
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq   
total 58755925 avg=150.000000 stddev=0.000000
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq   
total 58755925 avg=150.000000 stddev=0.000000
151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq   
total 43938762 avg=150.000000 stddev=0.000000
151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq   
total 43938762 avg=150.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq   
total 87198584 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq   
total 87198584 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq   
total 43766527 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq   
total 43766527 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq   
total 87135961 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq   
total 87135961 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq   
total 43138844 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq   
total 43138844 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq   
total 95270180 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq   
total 95270180 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq   
total 92524416 avg=49.000000 stddev=0.000000
160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq   
total 92524416 avg=49.000000 stddev=0.000000

Here’s the config file I’m using:

/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly/20171019_masurca_oly_config.txt

# example configuration file

# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:
# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads
# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be
# innies, i.e. --->.<---, and JUMP are assumed to be outties
# <---.--->. If there are any jump libraries that are innies, such as
# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads
# are optional for PE libraries and mandatory for JUMP libraries. Any
# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first
# converted into Celera Assembler compatible .frg files (see
# http://wgs-assembler.sourceforge.com)
DATA
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq
PE= pe 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq

JUMP= sh 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq
JUMP= sh 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq
JUMP= sh 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq
#pacbio reads must be in a single fasta file! make sure you provide absolute path
PACBIO=/gscratch/scrubbed/samwhite/O_lurida/oly_pacbio/oly_pacbio_concatentated.fa
END

PARAMETERS
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 1
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used.  one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 28
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes
SOAP_ASSEMBLY=0
END

Execute the masurca script to generate assembly script (on Mox login node):

$cd /gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly

$/gscratch/srlab/programs/MaSuRCA-3.2.3/bin/masurca 20171019_masurca_oly_config.txt

Got this error:

I’ll edit config file to have standard deviations == 1 and try again.

Got another error:

I’ll edit config file to have different two letter prefix assignments…

It worked!

Final version of config:

# example configuration file

# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:
# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads
# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be
# innies, i.e. --->.<---, and JUMP are assumed to be outties
# <---.--->. If there are any jump libraries that are innies, such as
# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads
# are optional for PE libraries and mandatory for JUMP libraries. Any
# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first
# converted into Celera Assembler compatible .frg files (see
# http://wgs-assembler.sourceforge.com)
DATA
PE= aa 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq
PE= ab 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq
PE= ac 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq
PE= ad 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq
PE= ae 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq
PE= af 49 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq

JUMP= ba 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq
JUMP= bb 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq
JUMP= bc 150 1 /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq /gscratch/scrubbed/samwhite/O_lurida/oly_illumina/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq
#pacbio reads must be in a single fasta file! make sure you provide absolute path
PACBIO=/gscratch/scrubbed/samwhite/O_lurida/oly_pacbio/oly_pacbio_concatentated.fa
END

PARAMETERS
#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content
GRAPH_KMER_SIZE = auto
#set this to 1 for all Illumina-only assemblies
#set this to 1 if you have less than 20x long reads (454, Sanger, Pacbio) and less than 50x CLONE coverage by Illumina, Sanger or 454 mate pairs
#otherwise keep at 0
USE_LINKING_MATES = 1
#this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms
LIMIT_JUMP_COVERAGE = 300
#these are the additional parameters to Celera Assembler.  do not worry about performance, number or processors or batch sizes -- these are computed automatically.
#set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.
CA_PARAMETERS =  cgwErrorRate=0.15
#minimum count k-mers used in error correction 1 means all k-mers are used.  one can increase to 2 if Illumina coverage >100
KMER_COUNT_THRESHOLD = 1
#whether to attempt to close gaps in scaffolds with Illumina data
CLOSE_GAPS=1
#auto-detected number of cpus to use
NUM_THREADS = 28
#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage
JF_SIZE = 200000000
#set this to 1 to use SOAPdenovo contigging/scaffolding module.  Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes
SOAP_ASSEMBLY=0
END

Submitted the job to Mox using the following command:

sbatch -p srlab -A srlab 20171019_masurca_oly_assembly.sh

Here’s the sbatch script used for the job:

#!/bin/bash
## Job Name
#SBATCH --job-name=20171019_masurca_oly
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly
/gscratch/scrubbed/samwhite/20171019_masurca_oly_assembly/assemble.sh

Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

0000-0002-2747-368X

Redundans should assemble both Illumina and PacBio data, so let’s do that.

Sean had previously performed this – twice actually:

It wasn’t entirely clear how he had run Redundans the first time and the second time he used his Platinus contig FASTA file as the necessary reference assembly when running Redundans.

Since he had produced a good looking assembly from PacBio data using Canu, I decided to give Redundans a rip using that assembly.

I then compared all three Redundans runs using QUAST.

Jupyter notebook (GitHub): 20171004_docker_oly_redundans.ipynb

Notebook is also embedded at the bottom of this notebook entry (but, it should be easier to view at the link provided above).

Sean’s Canu assembly (FASTA): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/oly_pacbio_.contigs.fasta
Sean’s first Redundans assembly (scaffolded assembly; FASTA): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output/scaffolds.reduced.fa
Sean’s second Redundans assembly (scaffolded assembly; FASTA): http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output_Try_2/scaffolds.reduced.fa
Redundans Output folder: http://owl.fish.washington.edu/Athaliana/20171004_redundans/
Redundans assembly (scaffolded assembly; FASTA): http://owl.fish.washington.edu/Athaliana/20171004_redundans/scaffolds.reduced.fa
Quast Output folder (default settings): http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_21_50/
Quast Output folder (–scaffolds option): http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_28_51/

Of note, is that Redundans didn’t find any alignments for the paired reads for each of the BGI mate-pair Illumina data:

160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz
160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz
160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz

First, I ran QUAST with the default settings:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_21_50/report.html

Using that Canu assembly with Redundans certainly seems to results in a better assembly.

Decided to run QUAST with the –scaffolds option to see what happened:

Interactive link: http://owl.fish.washington.edu/Athaliana/quast_results/results_2017_10_05_14_28_51/report.html

The scaffolds with the “Ns” removed from them are appended with “_broken” – meaning the scaffolds were broken apart into contigs. Things are certainly cleaner when using the --scaffolds option, however, as far as I can tell, QUAST doesn’t actually generate a FASTA file with the “_broken” scaffolds!

Samples Submitted – Geoduck Ctenidia to Illumina for 10x Genomics Sequencing

0000-0002-2747-368X

Continuing Illumina’s generous efforts to use our geoduck samples to test out the robustness of their emerging sequencing technologies, they have requested we send them some geoduck tissue so that they can try to complete the genome sequencing efforts using the 10x genomics sequencing platform.

I sent two frozen pieces (~28mg each) of geoduck ctendia tissue on dry ice. Tissue was collected by Brent & Steven on 20150811.

FedEx tracking: 770129114978

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

0000-0002-2747-368X

Here’s a brief overview of what Sean has done with the Oly genome assembly front.

Metassembler

Assemble his BGI assembly and Platanus assembly? Confusing terms here; not sure what he means.
Failed due to 32-bit vs. 64-bit installation of MUMmer. He didn’t have the chance to re-compile MUMmer as 64-bit. However, a recent MUMmer announcement suggests that MUMmer can now handle genomes of unlimited size.
I believe he was planning on using (or was using?) GARM, which relies upon MUMmer and may also include a version of MUMmer (outdated version that led to Sean’s error message?).
Notebook entry

Canu

Assemble UW PacBio data (filenames beginning with m170211, m170315, m170308, and m170301).
Files (including Mox scripts, Pilon contig polishing, & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Canu_Output/
Notebook entry

Redundans

Assembled raw Illumina reads provided by BGI (filenames beginning with 15114 and 16103) & UW PacBio data (filenames beginning with m170211, m170315, m170308, and m170301).
Ran this two times.
First run
- Files (does NOT include Mox scripts!) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output/
- Notebook entry
Second run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Redundans_Output_Try_2/
- Notebook entry

Platanus

Assembled raw Illumina reads provided by BGI (beginning with 151114 and 160103).
Ran this two times.
First run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Illumina_Platanus_Assembly/
- Notebook entry
Second run
- Files (including Mox scripts & output FASTA files) are here: http://owl.fish.washington.edu/scaphapoda/Sean/Oly_Platanus_Assembly_Kmer-22/
- Notebook entry

Data Management – Illumina Geoduck HiSeq & MiSeq Data

0000-0002-2747-368X

The HDD we received from Illumina last week only had data (i.e. fastq files) from the NovaSeq runs they performed – nothing from either the MiSeq, nor the HiSeq runs.

We contacted them about the missing data, they confirmed it was missing, and uploaded the remaining data to BaseSpace.

Began downloading the data – will take awhile…

Files will be temporarily stored in these locations:

/volume1/web/nightingales/Geoduck_MiSeq/170317_M03814_0172_000000000-B2K79/Data/GeoDuckRNAMiSeq-35978947

/volume1/web/nightingales/Geoduck_HiSeq/170228_ST-K00104_0382_BHHGTLBBXX/Data/Ironman-35682656

/volume1/web/nightingales/Geoduck_HiSeq/170228_ST-K00104_0381_AHHHWNBBXX/Data/Ironman-35682656

Data Received – Geoduck Genome Sequencing by Illumina

0000-0002-2747-368X

We previously sent some geoduck samples to Illumina, as part of a pilot project for them to test out a new sequencing platform. The data has finally arrived!

It was sent on a 4TB Seagate external hard drive.

Due to weird connection issues we’ve recently encountered with our server, Owl (Synology DS1812+), I connected the HDD directly to Owl via USB (instead of connecting to a computer and transferring). I transferred the data using the Synology web interface to avoid any computer/NAS connection issues that might interrupt the transfer.

We have a meeting with the Illumina people tomorrow afternoon to review the data they’ve provided (looks like it’s going to take awhile, though). Once that meeting takes place, we’ll figure out how to document this project in our data management plan.

Sample Submission – Geoduck gDNA for Illumina Pilot Sequencing Project

0000-0002-2747-368X

Sent 10μg of the geoduck gDNA I isolated earlier today to Illumina on dry ice via FedEx Standard Overnight service.

DNA Isolation – Geoduck gDNA for Illumina-initiated Sequencing Project

0000-0002-2747-368X

We were previously approached by Cindy Lawley (Illumina Market Development) for possible participation in an Illumina product development project, in which they wanted to have some geoduck tissue and DNA on-hand in case Illumina green-lighted the use of geoduck for testing out the new sequencing platform on non-model organisms. Well, guess what, Illumina has give the green light for sequencing our geoduck! However, they need at least 4μg of gDNA, so I’m isolating more.

Isolated DNA from ctenidia tissue from the same Panopea generosa individual used for the BGI sequencing efforts. Tissue was collected by Brent & Steven on 20150811.

Used the E.Z.N.A. Mollusc Kit (Omega) to isolate DNA from five separate ~60mg pieces of ctenidia tissue according to the manufacturer’s protocol, with the following changes:

Samples were homogenized with plastic, disposable pestle in 350μL of ML1 Buffer
Incubated homogenate at 60C for 1hr
No optional steps were used
Performed three rounds of 24:1 chloroform:IAA treatment
Eluted each in 50μL of Elution Buffer and pooled into a single sample

Quantified the DNA using the Qubit dsDNA BR Kit (Invitrogen). Used 1μL of DNA sample.

Concentration = 162ng/μL (Quant data is here [Google Sheet]: 20170105_gDNA_geoduck_qubit_quant

Yield is great (total = ~32μg).

Evaluated gDNA quality (i.e. integrity) by running 162ng (1μL) of sample on 0.8% agarose, low-TAE gel stained with ethidium bromide.

Used 5μL of O’GeneRuler DNA Ladder Mix (ThermoFisher).

Results:

DNA looks good: bright high molecular weight band, minimal smearing, and minimal RNA carryover (seen as more intense “smear” at ~500bp).

Will send off 10μg (they only requested 4μg) so that they have extra to work with in case they come across any issues.

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: Illumina

Genome Assembly – Olympia Oyster Illumina & PacBio Using PB Jelly w/Platanus Assembly

Results:

Genome Assembly – Olympia oyster Illumina & PacBio Reads Using Redundans

Genome Assembly – Olympia oyster Illumina & PacBio reads using MaSuRCA

UPDATE 20171031 – This crashed. No plans to troubleshoot.

Genome Assembly – Olympia Oyster Redundans with Illumina + PacBio

Samples Submitted – Geoduck Ctenidia to Illumina for 10x Genomics Sequencing

Project Progress – Olympia Oyster Genome Assemblies by Sean Bennett

Data Management – Illumina Geoduck HiSeq & MiSeq Data

Data Received – Geoduck Genome Sequencing by Illumina

Sample Submission – Geoduck gDNA for Illumina Pilot Sequencing Project

DNA Isolation – Geoduck gDNA for Illumina-initiated Sequencing Project