Category Archives: Geoduck Genome Sequencing

Assembly Stats – Geoduck Hi-C Final Assembly Comparison

We received the final geoduck genome assembly data from Phase Genomics, in which they updated the assembly by performing some manual curation:

There are additional assembly files that provide some additional assembly data. See the following directory:

Actual sequencing data and two previous assemblies were previously received on 20180421.

All assembly data (both old and new) from Phase Genomics was downloaded in full from the Google Drive link provided by them and stored here on Owl:

Ran Quast to compare all three assemblies provided (command run on Swoose):


/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180403_pga,20180421_pga,20180810_geo_manual \
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-03 11:05:41.596285/PGA_assembly.fasta \ /mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts results 2018-04-21 18:09:04.514704/PGA_assembly.fasta \ /mnt/owl/Athaliana/20180822_phase_genomics_geoduck_Results/geoduck_manual/geoduck_manual_scaffolds.fasta

Results:

Quast output folder: results_2018_08_23_07_38_28/

Quast report (HTML): results_2018_08_23_07_38_28/report.html

Data Management – Illumina NovaSeq Geoduck Genome Sequencing

As part of the Illumina collaborative geoduck genome sequencing project, their end goal has always been to sequence the genome in a single run.

They’ve finally attempted this by running 10x Genomics, Hi-C, Nextera, and TruSeq libraries in a single run of the NovaSeq.

I downloaded the data using the BaseSpace downloader using Chrome on a Windows 7 computer (this is not available on Ubuntu and the command line tools that are available from Illumina are too confusing for me to bother spending the time on to figure out how to use them just to download the data).

Data was saved here:

Generated MD5 checksums (using md5sum on Ubuntu) and appended to the checksums file:

Illumina was unable to provide MD5 checksums on their end, so I was unable to confirm data integrity post-download.

Illumina sample info is here:

Will add info to:

List of files received:

10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleA_S10_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleB_S11_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleC_S12_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x5-A3-MultipleD_S13_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleA_S14_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleB_S15_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleC_S16_L002_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L001_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L001_R2_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L002_R1_001.fastq.gz
10x-Genomics-Libraries-Geo10x6-B3-MultipleD_S17_L002_R2_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L001_R1_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L001_R2_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L002_R1_001.fastq.gz
HiC-Libraries-GeoHiC-C3-N701_S18_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP10-B2-AD013_S7_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP11-C2-AD014_S8_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP12-D2-AD015_S9_L002_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L001_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L001_R2_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L002_R1_001.fastq.gz
Nextera-Mate-Pair-Library-GeoNMP9-A2-AD002_S6_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA1-A1-NR006_S1_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA3-C1-NR012_S2_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA5-E1-NR005_S3_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA7-G1-NR019_S4_L002_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L001_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L001_R2_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L002_R1_001.fastq.gz
Trueseq-stranded-mRNA-libraries-GeoRNA8-H1-NR021_S5_L002_R2_001.fastq.gz

Assembly – Geoduck Hi-C Assembly Subsetting

Steven asked me to create a couple of subsets of our Phase Genomics Hi-C geoduck genome assembly (pga_02):

  • Contigs >10kbp

  • Contigs >30kbp

I used pyfaidx on Roadrunner and the following commands:

faidx --size-range 10000,100000000 PGA_assembly.fasta > PGA_assembly_10k_plus.fasta
faidx --size-range 30000,100000000 PGA_assembly.fasta > PGA_assembly_30k_plus.fasta

Ran Quast afterwards to get stats on the new FastA files just to confirm that the upper cutoff value was correct and didn’t get rid of the largest contig(s).

Results:

faidx Output folder: 20180512_geoduck_fasta_subsets/

10kbp contigs (FastA): 20180512_geoduck_fasta_subsets/PGA_assembly_10k_plus.fasta

30kbp contigs (FastA): 20180512_geoduck_fasta_subsets/PGA_assembly_30k_plus.fasta

Quast output folder: results_2018_05_14_06_26_26/

Quast report (HTML): results_2018_05_14_06_26_26/report.html

Everything looks good. The main thing I wanted to confirm by running Quast was that the largest contig in each subset was the same as the original PGA assembly (95,480,635bp.

Read Mapping – Mapping Illumina Data to Geoduck Genome Assemblies with Bowtie2

We have an upcoming meeting with Illumina to discuss how the geoduck genome project is coming along and to decide how we want to proceed.

So, we wanted to get a quick idea of how well our geoduck assemblies are by performing some quick alignments using Bowtie2.

Used the following assemblies as references:

  • sn_ph_01 : SuperNova assembly of 10x Genomics data

  • sparse_03 : SparseAssembler assembly of BGI and Illumina project data

  • pga_02 : Hi-C assembly of Phase Genomics data

The analysis is documented in a Jupyter Notebook.

Jupyter Notebook (GitHub):

NOTE: Due to large amount of stdout from first genome index command, the notebook does not render well on GitHub. I recommend downloading and opening notebook on a locally install version of Jupyter.

Here’s a brief overview of the process:

  1. Generate Bowtie2 indexes for each of the genome assemblies.
  2. Map 1,000,000 reads from the following Illumina NovaSeq FastQ files:

Results:

Bowtie2 Genome Indexes:

Bowtie2 sn_ph_01 alignment folder:

Bowtie2 sparse_03 alignment folder:

Bowtie2 pga_02 alignment folder:


MAPPING SUMMARY TABLE

All mapping data was pulled from the respective *.err file in the Bowtie2 alignment folders.

sequence_ID Assembler Alignment Rate (%)
sn_ph_01 SuperNova (10x) 79.89
sparse_03 SparseAssembler 85.83
pga_02 Hi-C (Phase Genomics) 79.90|

Mapping efficiency is similar for all assemblies. After speaking with Steven, we’ve decided we’ll begin exploring genome annotation pipelines.

Assembly & Stats – SparseAssembler (k95) on Geoduck Sequence Data > Quast for Stats

Had a successful assembly with SparseAssembler k101, but figured I’d just tweak the kmer setting and throw it in the queue and see how it compares; minimal effort/time needed.

Initiatied an assembly run using SparseAssembler on our Mox HPC node on all of our geoduck genomic sequencing data:

Kmer size set to 95.

Slurm script: 20180423_sparse_assembler_kmer95_geoduck_slurm.sh

After the run finished, I copied the files to our server (Owl) and then ran Quast on my computer to gather some assembly stats, using the following command:


/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180423_sparse_k95 \
/mnt/owl/Athaliana/20180423_sparseassembler_kmer95_geoduck/Contigs.txt \

Results:

SparseAssembler output folder: 20180423_sparseassembler_kmer95_geoduck/

SparseAsembler assembley (FastA; 15GB): 20180423_sparseassembler_kmer95_geoduck/Contigs.txt

Quast output folder: quast_results/results_2018_05_10_15_04_07

Quast report (HTML): quast_results/results_2018_05_10_15_04_07/report.html

I’ve embedded the Quast HTML report below, but it may be easier to view by using the link above.

Well, it’s remarkable how different this is than the previous SparseAssembler with k101 setting!

This assembly doesn’t have a single contig >50,000bp, while the previous one has four contigs over that threshold!

Definitely shows what a large impact the kmer setting in assembly software can have on the final assembly!

Assembly Stats – Geoduck Genome Assembly Comparisons w/Quast – SparseAssembler, SuperNova, Hi-C

Steven requested a comparison of geoduck genome assemblies.

Ran the following Quast command:

/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180405_sparse_kmer101,supernova_pseudohap_duck4-p,20180421_Hi-C \
/mnt/owl/Athaliana/20180405_sparseassembler_kmer101_geoduck/Contigs.txt \
/mnt/owl//halfshell/bu-mox/analyses/0305b/duck4-p.fasta.gz \
/mnt/owl/Athaliana/20180419_geoduck_hi-c/Results/geoduck_roberts\ results\ 2018-04-21\ 18\:09\:04.514704/PGA_assembly.fasta
Results:

Quast output folder: results_2018_04_30_08_00_42/

Quast report (HTML): results_2018_04_30_08_00_42/report.html

The data’s pretty interesting and cool!

SparseAssembler has over 2x the amount of data (in bas pairs), yet produces the worst assembly.

SuperNova and Hi-C assemblies are very close in nearly all categories. This isn’t surprising, as the SuperNova assembly was used as a reference assembly for the Hi-C assembly.

However, the Hi-C assembly is insanely better than the SuperNova assembly! For example:

  • Largest contig is ~7x larger than the SuperNova assembly.
  • The N50 size is ~243x larger than the SuperNova assembly!!
  • L50 is only 18, 46x smaller than the SuperNova assembly!

This is pretty amazing, honestly. Even more amazing is that this data was sent over to us as some “preliminary” data for us to take a peak at!

Assembly Stats – Geoduck Hi-C Assembly Comparison

Ran the following Quast command to compare the two geoduck assemblies provided to us by Phase Genomics:

/home/sam/software/quast-4.5/quast.py \
-t 24 \
--labels 20180403_pga,20180421_pga \
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts\ results\ 2018-04-03\ 11\:05\:41.596285/PGA_assembly.fasta \
/mnt/owl/Athaliana/20180421_geoduck_hi-c/Results/geoduck_roberts\ results\ 2018-04-21\ 18\:09\:04.514704/PGA_assembly.fasta
Results:

Quast Output folder: results_2018_04_30_11_16_04/

Quast report (HTML): results_2018_04_30_11_16_04/report.html

The two assemblies are nearly identical. Interesting…

Assembly – SparseAssembler (k 111) on Geoduck Sequence Data

Continuing to try to find the best kmer setting to work with SparseAssemlber after the last attempt failed due to a kmer size that was too large (k 131; which happens to be outside the max kmer size [127] for SparseAssembler), I re-ran SparseAssembler with an arbitrarily selected kmer size < 131 (picked k 111).

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

This failed with the following error message:

Error! K-mer size too large!

Well, this is disappointing. Not entirely sure why this is the case, as it’s below the max kmer setting for SparseAssembler. However, I’m not terribly surprised, as this happened previously (only using NovaSeq data) with a kmer setting of 117.

I’ve posted an issue on the kmergenie GitHub page; we’ll see what happens.

Assembly – SparseAssembler (k 131) on Geoduck Sequence Data

After some runs with kmergenie, I’ve decided to try re-running SparseAssembler using a kmer setting of 131.

The job was run on our Mox HPC node.

Results:

Output folder:

Slurm output file:

This failed with the following error message:

Error! K-mer size too large!

Looking into this, it’s because the maximum kmer size for kmergenie is 127! Doh!

It’d be nice if the program looked at that setting first before processign all the data files…

A bit disappointing, but I’ll give this a go with a lower kmer setting and see how it goes.

Data Management – Geoduck Phase Genomics Hi-C Data

We received sequencing/assembly data from Phase Genomics.

The data contains two assemblies, produced on two different dates.

All data is here: 20180421_geoduck_hi-c

All FASTQ files (four files; Geoduck_HiC*.gz) were copied to Nightingales:

MD5 checksums were verified and appended to the Nightingales checksum file:

Nightingales sequencing inventory was updated (Google Sheet):

The two assemblies (and assembly stats) they provided are here:

I’ve updated the project-geoduck-genome GitHub wiki with this info.