Data Wrangling - NanoPore Fast5 Conversion to FastQ of C.bairdi 20102558-2729 Run-02 on Mox with GPU Node

Continuing to work with the NanoPore data that I generated back in January(???!!!). In order to proceed, I first need to convert the raw Fast5 files to FastQ. To do so, I’ll use the NanoPore program guppy. I converted the first run from this flowcell earlier today.

As noted in that previous conversion, using a Mox GPU node decreases processing time by a ridiculous amount, compared to using CPUs. The only rub is that since we don’t own a GPU node, any jobs we submit are:

  • lowest priority in any queue

  • can get interrupted at any time by jobs submitted by the node owner

I’ll be submitting these very early in the morning and with runtimes this fast, I shouldn’t encounter any issues. Exciting!

SBATCH script (GitHub):

## Job Name
#SBATCH --job-name=cbai_guppy_nanopore_20102558-2729
## Allocation Definition
#SBATCH --account=srlab-ckpt
#SBATCH --partition=ckpt
## Resources
## GPU
#SBATCH --gres=gpu:P100:1
#SBATCH --constraint=gpu_default
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=0-01:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20200114_cbai_guppy_nanopore_20102558-2729

## Script for running ONT guppy to perform
## basecalling (i.e. convert raw ONT Fast5 to FastQ) of NanaPore data generated
## on 20200110 from C.bairdi 20102558-2729 gDNA. It is a second run using the same flowcell
## used on 20200110.

## This script utilizes a GPU node. These nodes are only available as part of the checkpoint
## partition/account. Since we don't own a GPU node, our GPU jobs are lowest priority and
## can be interrupted at any time if the node owner submits a new job.

# These variables need to be set by user


# Programs array
declare -A programs_array

# Establish variables for more readable code

# Input files directory

# Output directory

# CPU threads

# Flowcell type

# Sequencing kit used

# GPU devices setting

# Set number of FastQ sequences written per file (0 means all in one file)


# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Load CUDA GPU module
module load cuda/10.1.105_418.39

${programs_array[guppy_basecaller]} \
--input_path ${fast5_dir} \
--save_path ${out_dir} \
--flowcell ${flowcell} \
--kit ${kit} \
--device ${GPU_devices} \
--records_per_fastq ${records_per_fastq} \
--num_callers ${threads}


# Document programs in PATH (primarily for program version ID)
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : n
} >> system_path.log

# Capture program options
for program in "${!programs_array[@]}"
  echo "Program options for ${program}: "
	echo ""
	${programs_array[$program]} --help
	echo ""
	echo ""
	echo "----------------------------------------------"
	echo ""
	echo ""
} &>> program_options.log || true


Took ~6mins to process the convert the six Fast5 files:

Fast5 to FastQ conversion runtime with Mox GPU node

Output folder:

Sequencing Summary (4.7MB; TXT)

All the resulting FastQ files can be accessed in the output folder linked above with this pattern:

  • *.fastq

Unbeknownst to me, I misinterpreted the behavior of the program. I thought the FastQs from all of the Fast5 would be concatenated into a single FastQ. However, that’s not the case. Each Fast5 got converted to its own FastQ. So, I now have six FastQ files instead of just one. Not a big deal as I can concatenate these at a later date.

Now, I’ll get these run through some QC software (FastQC, NanoPlot) to get an idea of how things look before processing them further.