Tag Archives: FASTQC

TrimGalore/FastQC/MultiQC – 14bp Trim C.virginica MBD BS-seq FASTQ data

Yesterday, I ran TrimGalore/FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch with the default settings (i.e. “auto-trim”). There was still some variability in the first ~15bp of the reads and Steven wanted to see how a hard trim would change things.

I ran TrimGalore (using the built-in FastQC option), with a hard trim of the first 14bp of each read and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

20180410_trimgalore_trim14bp_Cvirginica_MDB.sh

Standard error was redirected on the command line to this file:

20180410_trimgalore_trim14bp_Cvirginica_MBD/stderr.log

MD5 checksums were generated on the resulting trimmed FASTQ files:

20180410_trimgalore_trim14bp_Cvirginica_MBD/checksums.md5

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer (data not shown).

Results:

Output folder:

20180410_trimgalore_trim14bp_Cvirginica_MBD/

FastQC output folder:

20180410_trimgalore_trim14bp_Cvirginica_MBD/20180410_fastqc_trimgalore_trim14bp_Cvirginica_MBD/

MultiQC output folder:

20180410_trimgalore_trim14bp_Cvirginica_MBD/20180410_fastqc_trimgalore_trim14bp_Cvirginica_MBD/multiqc_data/

MultiQC HTML report:

20180410_trimgalore_trim14bp_Cvirginica_MBD/20180410_fastqc_trimgalore_trim14bp_Cvirginica_MBD/multiqc_data/multiqc_report.html

OK, this trimming definitely took care of the variability seen in the first ~15bp of all the reads.

However, I noticed that the last 2bp of each of the Read 1 seqs all have some wonky stuff going on. I’m guessing I should probably trim that stuff off, too…

TrimGalore/FastQC/MultiQC – Auto-trim C.virginica MBD BS-seq FASTQ data

0000-0002-2747-368X

Yesterday, I ran FastQC/MultiQC on the Crassostrea virginica MBD BS-seq data from ZymoResearch. Steven wanted to trim it and see how things turned out.

I ran TrimGalore (using the built-in FastQC option) and followed up with MultiQC for a summary of the FastQC reports.

TrimGalore job script:

20180409_trimgalore_autotrim_Cvirginica_MBD.sh

Standard error was redirected on the command line to this file:

20180409_trimgalore_autotrim_Cvirginica_MBD/stderr.log

MD5 checksums were generated on the resulting trimmed FASTQ files:

20180409_trimgalore_autotrim_Cvirginica_MBD/checksums.md5

All data was copied to my folder on Owl.

Checksums for FASTQ files were verified post-data transfer.

Results:

Output folder:

20180409_trimgalore_autotrim_Cvirginica_MBD/

FastQC output folder:

20180409_trimgalore_autotrim_Cvirginica_MBD/20180409_fastqc_trimgalore_autotrim_Cvirginica_MBD/

MultiQC output folder:

20180409_trimgalore_autotrim_Cvirginica_MBD/20180409_fastqc_trimgalore_autotrim_Cvirginica_MBD/multiqc_data/

MultiQC HTML report:

20180409_trimgalore_autotrim_Cvirginica_MBD/20180409_fastqc_trimgalore_autotrim_Cvirginica_MBD/multiqc_data/multiqc_report.html

Overall, the auto-trim didn’t alter things too much. Specifically, Steven is concerned about the variability in the first 15bp (seen in the Per Base Sequence Content section of the MultiQC output). It was reduced, but not greatly. Will perform an independent run of TrimGalore and employ a hard trim of the first 14bp of each read and see how that looks.

FastQC/MultiQC – C. virginica MBD BS-seq Data

0000-0002-2747-368X

Per Steven’s GitHub Issues request, I ran FastQC on the Eastern oyster MBD bisulfite sequencing data we recently got back from ZymoResearch.

Ran FastQC locally with the following script: 20180409_fastqc_Cvirginica_MBD.sh


#!/bin/bash
/home/sam/software/FastQC/fastqc \
--threads 18 \
--outdir /home/sam/20180409_fastqc_Cvirginica_MBD \
/mnt/owl/nightingales/C_virginica/zr2096_10_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_10_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_1_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_1_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_2_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_2_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_3_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_3_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_4_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_4_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_5_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_5_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_6_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_6_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_7_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_7_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_8_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_8_s1_R2.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_9_s1_R1.fastq.gz \
/mnt/owl/nightingales/C_virginica/zr2096_9_s1_R2.fastq.gz

MultiQC was then run on the FastQC output files.

All files were moved to Owl after the jobs completed.

Results:

FastQC Output folder: 20180409_fastqc_Cvirginica_MBD/

MultiQC Output folder: 20180409_fastqc_Cvirginica_MBD/multiqc_data/

MultiQC report (HTML): 20180409_fastqc_Cvirginica_MBD/multiqc_data/multiqc_report.html

Everything looks good to me.

Steven’s interested in seeing what the trimmed output would look like (and, how it would impact mapping efficiencies). Will initiate trimming.

See the GitHub issue linked above for the full discussion.

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data Continued

0000-0002-2747-368X

The previous attempt at this was interrupted by a random glitch with our Mox HPC node.

I removed the last files processed by TrimGalore!, just in case they were incomplete. I updated the slurm script to process only the remaining files that had not been processed when the Mox glitch happened (including the files I deemed “incomplete”).

As in the initial run, I kept the option in TrimGalore! to automatically run FastQC on the trimmed output files.

TrimGalore! slurm script: 20180401_trim_galore_illumina_geoduck_hiseq_slurm.sh

MultiQC was run locally once the files were copied to Owl.

Results:

Job completed on 20180404.

Trimmed FASTQs: 20180328_trim_galore_illumina_hiseq_geoduck/

MD5 checksums: 20180328_trim_galore_illumina_hiseq_geoduck/checksums.md5

MD5 checksums were generated on Mox node and verified after copying to Owl.

Slurm output file: 20180401_trim_galore_illumina_geoduck_hiseq_slurm.sh

TrimGalore! output: 20180328_trim_galore_illumina_hiseq_geoduck/20180404_trimgalore_reports/

FastQC output: 20180328_trim_galore_illumina_hiseq_geoduck/20180328_fastqc_trimmed_hiseq_geoduck/

MultiQC output: 20180328_trim_galore_illumina_hiseq_geoduck/20180328_fastqc_trimmed_hiseq_geoduck/multiqc_data/

MultiQC HTML report: 20180328_trim_galore_illumina_hiseq_geoduck/20180328_fastqc_trimmed_hiseq_geoduck/multiqc_data/multiqc_report.html

Trimming completed and the FastQC results look much better than before.

Will proceed with full-blown assembly!

FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

0000-0002-2747-368X

Since running SparseAssembler seems to be working and actually able to produce assemblies, I’ve decided I’ll try to beef up the geoduck genome assembly with the rest of our existing genomic sequencing data.

Yesterday, I transferred our BGI geoduck data to our Mox node and ran it through FASTQC

I transferred our Illumina HiSeq data sets (*NMP*.fastq.gz) to our Mox node (/gscratch/scrubbed/samwhite/illumina_geoduck_hiseq). These were part of the Illumina-sponsored sequencing project.

I verified the MD5 checksums (not documented) and then ran FASTQC, followed by MultiQC.

FastQC slurm script: 20180328_fastqc_illumina_geoduck_hiseq_slurm.sh

This was followed with MultiQC (locally, after copying the the FastQC output to Owl).

Results:

FASTQC output: 20180328_illumina_hiseq_geoduck_fastqc

MultiQC output: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data

MultiQC HTML report: 20180328_illumina_hiseq_geoduck_fastqc/multiqc_data/multiqc_report.html

Well, lots of fails. I high level of “Per Base N Content” (these are only warnings, but we also haven’t received data with these warnings before). Also, they all fail in the “Overrepresented sequences” analysis.

I’ll run these through TrimGalore! (probably twice), and see how things change.

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

0000-0002-2747-368X

Previous FastQC/MultiQC analysis of the geoduck Illumina HiSeq data (NMP.fastq.gz files) revealed a high level of overrepresented sequences, high levels of Per Base N Content, failure of Per Sequence GC Content, and a few other bad things.

Ran these through TrimGalore! on our Mox HPC node.

Added an option in TrimGalore! to automatically run FastQC on the trimmed output files.

TrimGalore! slurm script: 20180328_trim_galore_illumina_geoduck_hiseq_slurm.sh

Results:

Slurm output file: slurm-153098.out

I received a job status email on 20180330:

SLURM Job_id=153098 Name=20180328_trim_galore_geoduck_hiseq Failed, Run time 1-17:22:47, FAILED, ExitCode 141

The slurm output file didn’t indicate any errors, so I restarted the job and contacted UW IT to see if I could get more info.

UPDATE

Here’s their response:

04/02/2018 9:13 AM PDT – Matt

Hi Sam,

Your job died because of a networking hiccup that caused GPFS (/gscratch filesystem and such) to expel the node from the GPFS cluster. It’s a symptom of a known ongoing network issue that we’re actively working with Lenovo/Intel/IBM. Things like this aren’t happening super frequently, but enough that we recognized something was wrong and started investigating with vendors. Unfortunately, your job was unlucky and got bitten by it.

So, in short, you or your job didn’t do anything wrong. If you haven’t already (and if it is possible for your use case), we would highly recommend building in some sort of periodic state-preserving behavior (and a method to “resume”) into your longer-running jobs. Jobs can unexpectedly die for any number of reasons, and it is nice not to lose days of compute progress when that happens.

-Matt

Well, okay then.

FastQC/MultiQC – BGI Geoduck Genome Sequencing Data

0000-0002-2747-368X

I transferred our BGI geoduck FASTQ files to our Mox node (/gscratch/scrubbed/samwhite/bgi_geoduck/).

I ran FASTQC on them to actually check them out and see if they needed any trimming, as I don’t believe this has been done!

FASTQC slurm script: 20180327_fastqc_bgi_geoduck_slurm.sh

Side note: Initial FASTQC failed on one file. Turns out, it got corrupted during transfer! Serves as good reminder about the importance of verifying MD5 checksums after file transfer, prior to attempting to work with files!

This was followed up with MultiQC (run locally from my computer on the files hosted on Owl). This was performed the following day (20180328).

Results:

FASTQC output: 20180327_bgi_fastqc

MultiQC output: 20180328_bgi_multiqc

MultiQC HTML report: 20180328_bgi_multiqc/multiqc_report.html

Everything looks nice and clean! Waiting on transfer and FASTQC of Illumina NMP data before proceeding to next assembly attempt.

Adapter Trimming and FASTQC – Illumina Geoduck Novaseq Data

0000-0002-2747-368X

We would like to get an assembly of the geoduck NovaSeq data that Illumina provided us with.

Steven previously ran the raw data through FASTQC and there was a significant amount of adapter contamination (up to 44% in some libraries) present (see his FASTQC report here).

So, I trimmed them using TrimGalore and re-ran FASTQC on them.

This required two rounds of trimming using the “auto-detect” feature of Trim Galore.

Round 1: remove NovaSeq adapters
Round 2: remove standard Illumina adapters

See Jupyter notebook below for the gritty details.

Results:

All data for this NovaSeq assembly project can be found here: http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/.

Round 1 Trim Galore reports: [20180125_trim_galore_reports/](http://owl.fish.washington.edu/Athaliana/20180125_geoduck_novaseq/20180125_trim_galore_reports/]
Round 1 FASTQC: 20180129_trimmed_multiqc_fastqc_01
Round 1 FASTQC MultiQC overview: 20180129_trimmed_multiqc_fastqc_01/multiqc_report.html

Round 2 Trim Galore reports: 20180125_geoduck_novaseq/20180205_trim_galore_reports/
Round 2 FASTQC: 20180205_trimmed_fastqc_02/
Round 2 FASTQC MultiQC overview: 20180205_trimmed_multiqc_fastqc_02/multiqc_report.html

For the astute observer, you might notice the “Per Base Sequence Content” generates a “Fail” warning for all samples. Per the FASTQC help, this is likely expected (due to the fact that NovaSeq libraries are prepared using transposases) and doesn’t have any downstream impacts on analyses.

Jupyter Notebook (GitHub): 20180125_roadrunner_trimming_geoduck_novaseq.ipynb

FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

0000-0002-2747-368X

Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.

However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.

Results:

Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.

Links to all of the FastQC output files are linked at the bottom of the notebook.

Jupyter notebook (GitHub): 20170306_docker_fastqc_demultiplexed_bgi_oly_gbs.ipynb

FASTQC – Oly BGI GBS Raw Illumina Data

0000-0002-2747-368X

In getting things prepared for the manuscript we’re writing about the Olympia oyster genotype-by-sequencing data from BGI, I felt we needed to provide a FastQC analysis of the raw data (since these two files are what we submitted to the NCBI short read archive) to provide support for the Technical Validation section of the manuscript.

Below, is the Jupyter notebook I used to run the FastQC analysis on the two files. I’ve embedded for quick viewing, but it might be easier to view the notebook via the GitHub link.

Results:

Well, I realized that running FastQC on the raw data might not reveal anything all too helpful. The reason for this is that the adaptor and barcode sequences are still present on all the reads. This will lead to over-representation of these sequences in all of the samples, which, in turn, will skew FastQC’s intepretation of the read qualities. For comparison, I’ll run FastQC on the demultiplexed data provided by BGI and see what the FastQC report looks like on trimmed data.

However, I’ll need to discuss with Steven about whether or not providing the FastQC analysis is worthwhile as part of the “technical validation” aspect of the manuscript. I guess it can’t hurt to provide it, but I’m not entirely sure that the FastQC report provides any real information regarding the quality of the sequencing reads that we received…

Jupyter notebook (GitHub): 20170301_docker_fastqc_nondemultiplexed_bgi_oly_gbs.ipynb

Sam's Notebook

University of Washington – Fishery Sciences – Roberts Lab

Tag Archives: FASTQC

TrimGalore/FastQC/MultiQC – 14bp Trim C.virginica MBD BS-seq FASTQ data

Results:

TrimGalore/FastQC/MultiQC – Auto-trim C.virginica MBD BS-seq FASTQ data

Results:

FastQC/MultiQC – C. virginica MBD BS-seq Data

Results:

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data Continued

Results:

FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Results:

TrimGalore!/FastQC/MultiQC – Illumina HiSeq Genome Sequencing Data

Results:

UPDATE

FastQC/MultiQC – BGI Geoduck Genome Sequencing Data

Results:

Adapter Trimming and FASTQC – Illumina Geoduck Novaseq Data

Results:

FASTQC – Oly BGI GBS Raw Illumina Data Demultiplexed

FASTQC – Oly BGI GBS Raw Illumina Data