SRA Data - S.namaycush SRA BioProject PRJNA674328 Download and QC

Per this GitHub Issue, which I accidentally forgot about for three weeks (!), Steven wanted me to download the lake trout (Salvelinus namaycush) RNAseq data from NCBI BioProject PRJNA674328 and run QC on the data.

Prior to QC, I had to get the data:

Acquired accessions list file:

Screenshot of NCBI BioProject PRJNA674328 page showing dropdown menu to generate accessions list file

Downloaded associated SRA runs (done on Mox using a build node):

/gscratch/srlab/programs/sratoolkit.2.11.3-centos_linux64/bin/prefetch \
--option-file /gscratch/srlab/sam/data/S_namaycush/RNAseq/SraAccList-PRJNA316738.txt

Converted .sra files to FastQ (done on Mox using a build node):

for file in *.sra; do /gscratch/srlab/programs/sratoolkit.2.11.3-centos_linux64/bin/fasterq-dump "${file}"; done

FastQC, fastp, and MultiQC were run on Mox.

SBATCH script (GitHub):

20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738.sh

#!/bin/bash
## Job Name
#SBATCH --job-name=20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=2-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738

### FastQC and fastp trimming of lake trout (Salvelinus namaycush) SRA BioProject PRJNA674328 RNAseq data.

### fastp expects input FastQ files to be in format: *.fastp-trim.20220706.fq.gz



###################################################################################
# These variables need to be set by user

## Assign Variables

# Set number of CPUs to use
threads=40

# Input/output files
trimmed_checksums=trimmed_fastq_checksums.md5
reads_dir=/gscratch/srlab/sam/data/S_namaycush/RNAseq/
fastq_checksums=input_fastq_checksums.md5

# FastQC output directory
output_dir=$(pwd)

# Paths to programs
fastp=/gscratch/srlab/programs/fastp-0.20.0/fastp
fastqc=/gscratch/srlab/programs/fastqc_v0.11.9/fastqc
multiqc=/gscratch/srlab/programs/anaconda3/bin/multiqc

## Inititalize arrays
fastq_array_R1=()
R1_names_array=()
trimmed_fastq_array_R1=()


# Programs associative array
declare -A programs_array
programs_array=(
[fastqc]="${fastqc}"
[fastp]="${fastp}" \
[multiqc]="${multiqc}"
)


###################################################################################

# Exit script if any command fails
set -e

# Load Python Mox module for Python module availability
module load intel-python3_2017

# Capture date
timestamp=$(date +%Y%m%d)

# Sync raw FastQ files to working directory
rsync --archive --verbose \
"${reads_dir}"*.fastq .

# Create arrays of fastq files and sample names
for fastq in *.fastq
do
  fastq_array_R1+=("${fastq}")
  R1_names_array+=("$(echo "${fastq}" | awk 'FS = "." {print $1}')")
done

# Pass array contents to new variable
raw_fastqc_list=$(echo "${fastq_array_R1[*]}")

# Create MD5 checksums for reference
for fastq in *.fastq
do
  echo "Generating checksum for ${fastq}"
  md5sum ${fastq} | tee --append ${fastq_checksums}
  echo ""
done

# Run FastQC
# NOTE: Do NOT quote ${raw_fastqc_list}
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${raw_fastqc_list}

# Run fastp on files
# Adds JSON report output for downstream usage by MultiQC
for index in "${!fastq_array_R1[@]}"
do
  R1_sample_name=$(echo "${R1_names_array[index]}")
  ${programs_array[fastp]} \
  --in1 ${fastq_array_R1[index]} \
  --thread ${threads} \
  --html "${R1_sample_name}".trimmed."${timestamp}".report.html \
  --json "${R1_sample_name}".trimmed."${timestamp}".report.json \
  --out1 "${R1_sample_name}".trimmed."${timestamp}".fq.gz

  # Generate md5 checksums for newly trimmed files
  {
      md5sum "${R1_sample_name}".trimmed."${timestamp}".fq.gz
  } >> "${trimmed_checksums}"
done



### Run FastQC
### NOTE: Do NOT quote ${trimmed_fastqc_list}

# Create array of trimmed FastQs
trimmed_fastq_array=(*trimmed*.fq.gz)

# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastq_array[*]}")

# Run FastQC
${programs_array[fastqc]} \
--threads ${threads} \
--outdir ${output_dir} \
${trimmed_fastqc_list}

# Run MultiQC
${programs_array[multiqc]} .

####################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
fi


# Document programs in PATH (primarily for program version ID)
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log

RESULTS

Run time was fast, ~50 minutes:

Screencap showing runtime of Mox job

Output folder:

20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738/

Since there are 72 FastQ files associated with this BioProject, I won’t list them all here. Raw FastQs can be retrieved with the following bash wildcard pattern:

*.fastq

Trimmed FastQs can be retrieved with the following bash wildcard pattern:

*trimmed*.fq.gz

Raw FastQ MD5s (text):

20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738/input_fastq_checksums.md5

Trimmed FastQ MD5s (text):

20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738/trimmed_fastq_checksums.md5

MultiQC Report (HTML):

20220706-snam-fastqc-fastp-multiqc-rnaseq-PRJNA316738/multiqc_report.html