FastQC-MultiQC - M.magister MBD-BSseq Pool Test MiSeq Run on Mox

Earlier today we received the M.magister (C.magister; Dungeness crab) MiSeq data from Mac.

I ran FastQC and MultiQC on Mox.

SBATCH script (GitHub):

#!/bin/bash
## Job Name
#SBATCH --job-name=20201211_mmag_fastqc_multiqc_mbd-bsseq_miseq
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20201211_mmag_fastqc_multiqc_mbd-bsseq_miseq


### FastQC assessment of raw MiSeq sequencing test run for
### MBD-BSseq pool of M.magister samples from 20201202.


###################################################################################
# These variables need to be set by user

# FastQC output directory
output_dir=$(pwd)

# Set number of CPUs to use
threads=28

# Input/output files
checksums=fastq_checksums.md5
fastq_list=fastq_list.txt
raw_reads_dir=/gscratch/srlab/sam/data/C_magister/MBD-BSseq/

# Paths to programs
fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc


# Programs associative array
declare -A programs_array
programs_array=(
[fastqc]="${fastqc}" \
[multiqc]="${multiqc}"
)

###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Sync raw FastQ files to working directory
rsync --archive --verbose \
"${raw_reads_dir}"CH*.fastq.gz .

# Populate array with FastQ files
fastq_array=(CH*.fastq.gz)

# Pass array contents to new variable
fastqc_list=$(echo "${fastq_array[*]}")

# Run FastQC
# NOTE: Do NOT quote ${fastqc_list}
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${fastqc_list}


# Create list of fastq files used in analysis
echo "${fastqc_list}" | tr " " "\n" >> ${fastq_list}

# Generate checksums for reference
while read -r line
do

	# Generate MD5 checksums for each input FastQ file
	echo "Generating MD5 checksum for ${line}."
	md5sum "${line}" >> "${checksums}"
	echo "Completed: MD5 checksum for ${line}."
	echo ""

	# Remove fastq files from working directory
	echo "Removing ${line} from directory"
	rm "${line}"
	echo "Removed ${line} from directory"
	echo ""
done < ${fastq_list}

# Run MultiQC
${programs_array[multiqc]} .


# Capture program options
for program in "${!programs_array[@]}"
do
	{
  echo "Program options for ${program}: "
	echo ""
  # Handle samtools help menus
  if [[ "${program}" == "samtools_index" ]] \
  || [[ "${program}" == "samtools_sort" ]] \
  || [[ "${program}" == "samtools_view" ]]
  then
    ${programs_array[$program]}
  fi
	${programs_array[$program]} -h
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
} &>> program_options.log || true

  # If MultiQC is in programs_array, copy the config file to this directory.
  if [[ "${program}" == "multiqc" ]]; then
  	cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
  fi
done


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Runtime was fast, ~3.5mins:

Cumulative runtime for FastQC and MultiQC on C.magister MiSeq data

Will add FastQC report links to Nightingales spreadsheet (Google Sheet) for those that did not fail.

NOTE: This post was updated on 20201217 using a newly transferred set of FastQs that Mac set up. See the previous commit of this post for more info.

Output folder:

MultiQC Report (HTML; open with web browser):

Individual FastQC reports can be found by browsing the output folder linked above and/or by clicking through the MultiQC report that’s linked above.

This test run was to help evaluate evenness of the sample pooling, as well as identify any other possible issues. Evenness appears OK (not great), but I’m not entirely sure how this would be addressed, as an aliquot of each library was created at a concentration of 4nM and then 1uL of each of these aliquots was combined. Is it safe to assume that any sequencing biases leading to preferential library sequencing is due to the individual libraries? And, that this can be adjusted for when making the final pooling that gets sent off for a full sequencing run? Also, I’m a bit surprised at the high levels of adapter content. I’m curious how these data will look after trimming. Anyway, at this point, I’ll let Laura Spencer and Mac make decisions about going forward with a full sequencing run, as it’s really their project anyway.