Tag Archives: Panopea generosa

Ran the following Quast command:

/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180405_sparse_kmer101,supernova_pseudohap_duck4-p,20180421_Hi-C \
/mnt/owl/Athaliana/20180405_sparseassembler_kmer101_geoduck/Contigs.txt \
/mnt/owl//halfshell/bu-mox/analyses/0305b/duck4-p.fasta.gz \
/mnt/owl/Athaliana/20180419_geoduck_hi-c/Results/geoduck_roberts\ results\ 2018-04-21\ 18\:09\:04.514704/PGA_assembly.fasta

Results:

Quast output folder: results_2018_04_30_08_00_42/

Quast report (HTML): results_2018_04_30_08_00_42/report.html

The data’s pretty interesting and cool!

SparseAssembler has over 2x the amount of data (in bas pairs), yet produces the worst assembly.

SuperNova and Hi-C assemblies are very close in nearly all categories. This isn’t surprising, as the SuperNova assembly was used as a reference assembly for the Hi-C assembly.

However, the Hi-C assembly is insanely better than the SuperNova assembly! For example:

Largest contig is ~7x larger than the SuperNova assembly.
The N50 size is ~243x larger than the SuperNova assembly!!
L50 is only 18, 46x smaller than the SuperNova assembly!

This is pretty amazing, honestly. Even more amazing is that this data was sent over to us as some “preliminary” data for us to take a peak at!

Assembly Stats – Geoduck Hi-C Assembly Comparison

0000-0002-2747-368X

Ran the following Quast command to compare the two geoduck assemblies provided to us by Phase Genomics:

/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180403_pga,20180421_pga \
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts\ results\ 2018-04-03\ 11\:05\:41.596285/PGA_assembly.fasta \
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts\ results\ 2018-04-21\ 18\:09\:04.514704/PGA_assembly.fasta

Results:

Quast Output folder: results_2018_04_30_11_16_04/

Quast report (HTML): results_2018_04_30_11_16_04/report.html

The two assemblies are nearly identical. Interesting…

DNA Isolation & Quantification – Metagenomics Water Filters

0000-0002-2747-368X

After discussing the preliminary DNA isolation attemp with Steven & Emma, we decided to proceed with DNA isolations on the remaining 0.22μm filters.

Isolated DNA from the following five filters:

DNA was isolated with the DNeasy Blood & Tissue Kit (Qiagen), following a modified version of the Gram-Positive Bacteria protocol:

filters were unfolded and unceremoniously stuffed into 1.7mL snap cap tubes
did not perform enzymatic lysis step
filters were incubated with 400μL of Buffer AL and 50μL of Proteinase K (both are double the volumes listed in the kit and are necessary to fully coat the filter in a 1.7mL snap cap tube)
56^oC incubations were performed overnight
400μL of 100% ethanol was added to each after the 56^oC incubation
samples were eluted in 50μL of Buffer AE
all spins were performed at 20,000g

Samples were quantified with the Roberts Lab Qubit 3.0 and the Qubit 1x dsDNA HS Assay Kit.

Used 5μL of each sample for measurement (see Results for update).

Results:

Raw data (Google Sheet): 20180426_qubit_metagenomics_filters

Sample	Concentration(ng/μL)	Initial_volume(μL)	Yield(ng)
Filter #10 pH 7.1 5/15/17	0.296	50	14.65
Filter #7 pH 8.2 5/15/17	8.44	50	422
Filter #7 pH 8.2 5/1917	2.52	50	126
Filter #10 pH 7.1 5/22/17	2.0	50	100
Filter #10 pH 7.1 5/26/17	11.9	50	595

Samples were stored Sam gDNA Box #2, positions G8 – H3. (FTR 213, #27 (small -20^oC frezer))

Assembly – SparseAssembler (k 111) on Geoduck Sequence Data

0000-0002-2747-368X

Continuing to try to find the best kmer setting to work with SparseAssemlber after the last attempt failed due to a kmer size that was too large (k 131; which happens to be outside the max kmer size [127] for SparseAssembler), I re-ran SparseAssembler with an arbitrarily selected kmer size < 131 (picked k 111).

The job was run on our Mox HPC node.

Slurm script: 20180423_sparse_assembler_kmer111_geoduck_slurm.sh

Results:

Output folder:

20180423_sparseassembler_kmer111_geoduck/

Slurm output file:

20180423_sparseassembler_kmer111_geoduck/slurm-164530.out

This failed with the following error message:

Error! K-mer size too large!

Well, this is disappointing. Not entirely sure why this is the case, as it’s below the max kmer setting for SparseAssembler. However, I’m not terribly surprised, as this happened previously (only using NovaSeq data) with a kmer setting of 117.

I’ve posted an issue on the kmergenie GitHub page; we’ll see what happens.

Assembly – SparseAssembler (k 131) on Geoduck Sequence Data

0000-0002-2747-368X

After some runs with kmergenie, I’ve decided to try re-running SparseAssembler using a kmer setting of 131.

The job was run on our Mox HPC node.

Slurm script: 20180422_sparse_assembler_kmer131_geoduck_slurm.sh

Results:

Output folder:

20180422_sparseassembler_kmer131_geoduck/

Slurm output file:

20180422_sparseassembler_kmer131_geoduck/slurm-163406.out

This failed with the following error message:

Error! K-mer size too large!

Looking into this, it’s because the maximum kmer size for kmergenie is 127! Doh!

It’d be nice if the program looked at that setting first before processign all the data files…

A bit disappointing, but I’ll give this a go with a lower kmer setting and see how it goes.

Data Management – Geoduck Phase Genomics Hi-C Data

0000-0002-2747-368X

We received sequencing/assembly data from Phase Genomics.

The data contains two assemblies, produced on two different dates.

All data is here: 20180421_geoduck_hi-c

All FASTQ files (four files; Geoduck_HiC*.gz) were copied to Nightingales:

http://owl.fish.washington.edu/nightingales/P_generosa/

MD5 checksums were verified and appended to the Nightingales checksum file:

http://owl.fish.washington.edu/nightingales/P_generosa/checksums.md5

Nightingales sequencing inventory was updated (Google Sheet):

Nightingales inventory

The two assemblies (and assembly stats) they provided are here:

I’ve updated the project-geoduck-genome GitHub wiki with this info.

Kmer Estimation – Kmergenie (k 301) on Geoduck Sequence Data

0000-0002-2747-368X

Continuing the quest for the ideal kmer size to use for our geoduck assembly.

The previous two runs with kmergenie using the diploid setting were no good.

So, this time, I simply increased the maximum kmer size to 301 and left all other settings as default. I’m hoping this is large enough to produce a smooth curve, with a maximal value that can be determined from the output graph.

The job was run on our Mox HPC node.

Slurm script: 20180421_kmergenie_k301_geoduck_slurm.sh

Results:

Output folder:

20180421_kmergenie_k301_geoduck/

Slurm output file:

20180421_kmergenie_k301_geoduck/slurm-163019.out

Kmer histogram (HTML) reports:

20180421_kmergenie_k301_geoduck/histograms_report.html

Well, the graph is closer to what we’d expect, in that it appears to reach a zenith, but after that plateau, we see a sharp dropoff, as opposed to a gradual dropoff that mirrors the left half. Not entirely sure what the implications for this are, but I’ll go ahead an run SparseAssembler using a kmer size of 131 and see how it goes.

Kmer Estimation – Kmergenie Tweaks on Geoduck Sequence Data

0000-0002-2747-368X

Earlier today, I ran kmergenie on our all of geoduck DNA sequencing data to see what it would spit out for an ideal kmer setting, which I would then use in another assembly attempt using SparseAssembler; just to see how the assembly might change.

The output from that kmergenie run suggested that the ideal kmer size exceeded the default maximum (k = 121), so I decided to run kmergenie a few more times, with some slight changes.

All jobs were run on our Mox HPC node.

Run 1

Diploid
Slurm script: 20180419_kmergenie_diploid_geoduck_slurm.sh

Run 2

Diploid
k 301
Slurm script: 20180419_kmergenie_diploid_k301_geoduck_slurm.sh

Results:

Output folders:

Slurm output files:

Kmer histogram (HTML) reports:

Diploid

Diploid, k 301

Okay, well, these graphs clearly show that the diploid setting is no good.

We should be getting a nice, smooth, concave curve.

Will try running again, without diploid setting and just increasing the max kmer size.

Kmer Estimation – Kmergenie on Geoduck Sequence Data (default settings)

0000-0002-2747-368X

After the last SparseAssembler assembly completed, I wanted to do another run with a different kmer size (last time was arbitrarily set at 101). However, I didn’t really know how to decide, particularly since this assembly consisted of mixed read lenghts (50bp and 100bp). So, I ran kmergenie on all of our geoduck (Panopea generosa) sequencing data in hopes of getting a kmer determination to apply to my next assembly.

The job was run on our Mox HPC node.

Slurm script: 20180419_kmergenie_geoduck_slurm.sh

Input files list (needed for kmergenie command – see Slurm script linked above): geoduck_fastq_list.txt

Results:

Output folder: 20180419_kmergenie_geoduck/

Slurm output file: 20180419_kmergenie_geoduck/slurm-161551.out

Kmer histograms (HTML): 20180419_kmergenie_geoduck/histograms_report.html

Screen cap from Kmer report:

This data estimates the best kmer size for this data to be 121.

However, based on the kmergenie documentation, this is likely to be inaccurate. This inaccuracy is based on the fact that our kmer graph should be concave. Our graph, instead, is only partial – we haven’t reached a kmer size where the number of kmers is decreasing.

As such, I’ll try re-running with a different maximum kmer settting (default max is 121).

Assembly Stats – Quast Stats for Geoduck SparseAssembler Job from 20180405

0000-0002-2747-368X

The geoduck genome assembly started 20180405 completed this weekend.

This assembly utilized the BGI data and all of the Illumina project data (NMP and NovaSeq) with a kmer 101 setting.

I ran Quast to gather some assembly stats, using the following command:

python /home/sam/software/quast-4.5/quast.py -t 24 /mnt/owl/Athaliana/20180405_sparseassembler_kmer101_geoduck/Contigs.txt

Results:

Quast output folder: results_2018_04_15_13_45_03/

Quast report (HTML): results_2018_04_15_13_45_03/report.html

I’ve embedded the Quast HTML report below, but it may be easier to view by using the link above.

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: Panopea generosa

Assembly Stats – Geoduck Genome Assembly Comparisons w/Quast – SparseAssembler, SuperNova, Hi-C

Results:

Assembly Stats – Geoduck Hi-C Assembly Comparison

Results:

DNA Isolation & Quantification – Metagenomics Water Filters

Results:

Assembly – SparseAssembler (k 111) on Geoduck Sequence Data

Results:

Assembly – SparseAssembler (k 131) on Geoduck Sequence Data

Results:

Data Management – Geoduck Phase Genomics Hi-C Data

Kmer Estimation – Kmergenie (k 301) on Geoduck Sequence Data

Results:

Kmer Estimation – Kmergenie Tweaks on Geoduck Sequence Data

Run 1

Run 2

Results:

Diploid

Diploid, k 301

Kmer Estimation – Kmergenie on Geoduck Sequence Data (default settings)

Results:

Assembly Stats – Quast Stats for Geoduck SparseAssembler Job from 20180405

Results: