Data Received – Crassostrea virginica MBD BS-seq from ZymoResearch

Received the sequencing data from ZymoResearch for the <em>Crassostrea virginica</em> gonad MBD DNA that was sent to them on 20180207 for bisulfite conversion, library construction, and sequencing.

Gzipped FASTQ files were:

  1. downloaded to Owl/nightingales/C_virginica
  2. MD5 checksums verified
  3. MD5 checksums appended to the checksums.md5 file
  4. file updated
  5. Updated nightingales Google Sheet

Here’s the list of files received:


Here’s the sample processing history:

FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

Yesterday, I transferred our BGI geoduck data to our Mox node and ran it through FASTQC

I transferred our Illumina HiSeq data sets (*NMP*.fastq.gz) to our Mox node (/gscratch/scrubbed/samwhite/illumina_geoduck_hiseq). These were part of the Illumina-sponsored sequencing project.

I verified the MD5 checksums (not documented) and then ran FASTQC, followed by MultiQC.

FastQC slurm script:

This was followed with MultiQC (locally, after copying the the FastQC output to Owl).


FASTQC output: 20180328_illumina_hiseq_geoduck_fastqc

MultiQC output: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data

MultiQC HTML report: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data/multiqc_report.html

Well, lots of fails. I high level of “Per Base N Content” (these are only warnings, but we also haven’t received data with these warnings before). Also, they all fail in the “Overrepresented sequences” analysis.

I’ll run these through TrimGalore! (probably twice), and see how things change.

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Previous FastQC/MultiQC analysis of the geoduck Illumina HiSeq data (NMP.fastq.gz files) revealed a high level of overrepresented sequences, high levels of Per Base N Content, failure of Per Sequence GC Content, and a few other bad things.

Ran these through TrimGalore! on our Mox HPC node.

Added an option in TrimGalore! to automatically run FastQC on the trimmed output files.

TrimGalore! slurm script:


Slurm output file: slurm-153098.out

I received a job status email on 20180330:

SLURM Job_id=153098 Name=20180328_trim_galore_geoduck_hiseq Failed, Run time 1-17:22:47, FAILED, ExitCode 141

The slurm output file didn’t indicate any errors, so I restarted the job and contacted UW IT to see if I could get more info.


Here’s their response:

04/02/2018 9:13 AM PDT – Matt

Hi Sam,

Your job died because of a networking hiccup that caused GPFS (/gscratch filesystem and such) to expel the node from the GPFS cluster. It’s a symptom of a known ongoing network issue that we’re actively working with Lenovo/Intel/IBM. Things like this aren’t happening super frequently, but enough that we recognized something was wrong and started investigating with vendors. Unfortunately, your job was unlucky and got bitten by it.

So, in short, you or your job didn’t do anything wrong. If you haven’t already (and if it is possible for your use case), we would highly recommend building in some sort of periodic state-preserving behavior (and a method to “resume”) into your longer-running jobs. Jobs can unexpectedly die for any number of reasons, and it is nice not to lose days of compute progress when that happens.


Well, okay then.

FastQC/MultiQC – BGI Geoduck Genome Sequencing Data

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

I transferred our BGI geoduck FASTQ files to our Mox node (/gscratch/scrubbed/samwhite/bgi_geoduck/).

I ran FASTQC on them to actually check them out and see if they needed any trimming, as I don’t believe this has been done!

FASTQC slurm script:

Side note: Initial FASTQC failed on one file. Turns out, it got corrupted during transfer! Serves as good reminder about the importance of verifying MD5 checksums after file transfer, prior to attempting to work with files!

This was followed up with MultiQC (run locally from my computer on the files hosted on Owl). This was performed the following day (20180328).


FASTQC output: 20180327_bgi_fastqc

MultiQC output: 20180328_bgi_multiqc

MultiQC HTML report: 20180328_bgi_multiqc/multiqc_report.html

Everything looks nice and clean! Waiting on transfer and FASTQC of Illumina NMP data before proceeding to next assembly attempt.

Titrations – Hollie’s Seawater Samples

All data is deposited in the following GitHub repo:

Sample sizes: ~50g

LabX Method:

Daily pH calibration data file:

Daily pH log file:

Titrant batch:

CRM Batch:

Daily CRM data file:

Sample data file(s):

See metadata file for sample info (including links to master samples sheets):

Assembly – Geoduck NovaSeq using SparseAssembler kmer = 101

The prior run used a kmer size of 61, and the resulting assembly was rather poor (small N50).

For this run, I arbitrarily increased the kmer size to 101, in hopes that this will improve the assembly.

The job was run on our Mox node.

Here’s the batch script to initiate the job:

#SBATCH --job-name=20180322_sparse_assembler_geo_novaseq
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1   
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180322_SparseAssembler_novaseq_geoduck

/gscratch/srlab/programs/SparseAssembler/SparseAssembler \
LD 0 \
NodeCovTh 1 \
EdgeCovTh 0 \
k 101 \
g 15 \
PathCovTh 100 \
GS 2200000000 \
Output folder: 20180322_SparseAssembler_novaseq_geoduck/

This completed much more quickly than the previous run (kmer = 61). The previous assembly took ~10 days, while this assembly completed in ~4 days!

The primary output file of interest is this FASTA file:

In order to get a rough idea of how this assembly looks, I ran it through Quast Version: 4.5, 15ca3b9:

python software/quast-4.5/ \
-t 16

Quast output folder: results_2018_03_27_08_25_52/

Here’re the stats on the assembly:

Quast output (text): results_2018_03_27_08_25_52/report.txt

Quast output (HTML):results_2018_03_27_08_25_52/report.html

This is definitely a better assembly than the kmer = 61 assembly.

N50 = 1149

Also, there’s a single, large contig of 56,361bp, and 54 contigs > 25,000bp. This is good.

Admittedly, I’m a little surprised (and, disappointed) the N50 is as small as it is. However, we have a pretty decent assembly on our hands!

Since SparseAssembler seems to actually run (and, relatively quickly), I’m very tempted to just throw ALL of our geoduck data at it and see how it turns out…

Assembly – Geoduck NovaSeq using SparseAssembler (TL;DR – it worked!)

The prior attempt using SparseAssembler failed due to a kmer size that was deemed too large.

For this run, I arbitrarily reduced the kmer size by ~half (k 61) in hopes that this will just get through an assembly. We can potentially explore the effects of kmer size on assemblies if/when this runs and depending no how the assembly looks.

The job was run on our Mox node.

Here’s the batch script to initiate the job:

## Job Name
#SBATCH --job-name=20180313_sparse_assembler_geo_novaseq
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes (We only get 1, so this is fixed)
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180312_SparseAssembler_novaseq_geoduck

LD 0 
NodeCovTh 1 
EdgeCovTh 0 
k 61 
g 15 
PathCovTh 100 
GS 2200000000 
Output folder: 20180312_SparseAssembler_novaseq_geoduck

IT WORKED!!! At last; we have an assembly of the geoduck NovaSeq data!! It took ~10days to complete.

The primary output file of interest is this FASTA file:

In order to get a rough idea of how this assembly looks, I ran it through Quast Version: 4.5, 15ca3b9:

python software/quast-4.5/ \
-t 16

Quast output folder: results_2018_03_22_08_12_12

Here’re the stats on the assembly:

Quast output (text): results_2018_03_22_08_12_12/report.txt

Quast output (HTML):results_2018_03_22_08_12_12/report.html

Overall, the assembly doesn’t look great. The N50 = 645 is really, really low. One would hope for a much large number for a quality assembly. As it stands, this assembly is comprised of many small contigs.

Looks like we’ll have to fiddle with the kmer size used for SparseAssembler and see if we can improve upon this.

Despite that, it’s an accomplishment to finally get any sort of assembler to run to completion for this data set!

Titrations – Hollie’s Seawater Samples

All data is deposited in the following GitHub repo:

Sample sizes: ~50g

LabX Method:

Daily pH calibration data file:

Daily pH log file:

Titrant batch:

CRM Batch:

Daily CRM data file:

Sample data file(s):

See metadata file for sample info (including links to master samples sheets):

DNA Isolation & Quantification – Geoduck larvae metagenome filter rinses

Isolated DNA from two of the geoduck hatchery metagenome samples Emma delivered on 20180313 to get an idea of what type of yields we might get from these.

  • MG 5/15 #8
  • MG 5/19 #6

As mentioned in my notebook entry upon receipt of these samples, I’m a bit skeptical will get any sort of recovery, based on sample preservation.

Isolated DNA using DNAzol (MRC, Inc.) in the following manner:

  1. Added 1mL of DNAzol to each sample; mixed by pipetting.
  2. Added 0.5mL of 100% ethanol; mixed by inversion.
  3. Pelleted DNA 5,000g x 5mins @ RT.
  4. Discarded supernatants.
  5. Wash pellets (not visible) with 1mL 75% ethanol by dribbling down side of tubes.
  6. Pelleted DNA 5,000g x 5mins @ RT.
  7. Discarded supernatants and dried pellets for 5mins.
  8. Resuspended DNA in 20uL of Buffer EB (Qiagen).

Samples were quantified using the Roberts Lab Qubit 3.0 with the Qubit High Sensitivity dsDNA Kit (Invitrogen).

5uL of each sample were used.


As expected, both samples did not yield any detectable DNA.

Will discuss with Steven on what should be done with the remaining samples.

Titrations – Hollie’s Seawater Samples

Performed total alkalinity (TA) titrations on Hollie’s samples using our T5 Excellence titrator (Mettler Toledo) and Rondolino sample changer.

All data is deposited in the following GitHub repo:

Sample sizes: ~50g

LabX Method:

Daily pH calibration data file:

Daily pH log file:

Titrant batch:

CRM Batch:

Daily CRM data file:

Sample data file(s):

See metadata file for sample info (including links to master samples sheets):