BSseq SNP Analysis - Nextflow EpiDiverse SNP Pipeline for C.virginica CEABIGR BSseq data

Steven asked that I identify SNPs from our C.virginica CEABIGR BSseq data (GitHub Issue). So, I ran sorted, deduplicated Bismark BAMs that Steven generated through the EpiDiverse/snp Nextflow pipeline. The job was run on Mox.

SBATCH script:

#!/bin/bash
## Job Name
#SBATCH --job-name=20220921-cvir-ceabigr-nextflow-epidiverse-snp
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=12-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20220921-cvir-ceabigr-nextflow-epidiverse-snp

# Run EpiDiverse/snp on C.virginica Bismark BAMs generated by Steven.
# Requires a FastA file with extension: .fa
# Requires a FastA index file to be in same directory as FastA.

###################################################################################
# These variables need to be set by user

## Directory with BAM(s)
bams_dir="/gscratch/scrubbed/samwhite/data/C_virginica/BSseq/120321-cvBS"

## Location of EpiDiverse/snp pipeline directory
epi_snp="/gscratch/srlab/programs/epidiverse-pipelines/snp"

## FastA file is required to end with .fa
## Requires FastA index file to be present in same directory as FastA
genome_fasta="/gscratch/srlab/sam/data/C_virginica/genomes/GCF_002022765.2_C_virginica-3.0_genomic.fa"

## Location of Nextflow
nextflow="/gscratch/srlab/programs/nextflow-21.10.6-all"

## Specify desired/needed version of Nextflow
nextflow_version="20.07.1"


###################################################################################


# Exit script if a command fails
set -e

# Load Anaconda
# Uknown why this is needed, but Anaconda will not run if this line is not included.
. "/gscratch/srlab/programs/anaconda3/etc/profile.d/conda.sh"

# Activate NF-core conda environment
conda activate epidiverse-snp_env


## Run EpiDiverse/snp
NXF_VER=${nextflow_version} \
${nextflow} run \
${epi_snp} \
--input ${bams_dir} \
--reference ${genome_fasta} \
--variants \
--clusters

###################################################################################
# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
  echo "Logging program options..."
  for program in "${!programs_array[@]}"
  do
    {
    echo "Program options for ${program}: "
    echo ""
    # Handle samtools help menus
    if [[ "${program}" == "samtools_index" ]] \
    || [[ "${program}" == "samtools_sort" ]] \
    || [[ "${program}" == "samtools_view" ]]
    then
      ${programs_array[$program]}

    # Handle DIAMOND BLAST menu
    elif [[ "${program}" == "diamond" ]]; then
      ${programs_array[$program]} help

    # Handle NCBI BLASTx menu
    elif [[ "${program}" == "blastx" ]]; then
      ${programs_array[$program]} -help
    fi
    ${programs_array[$program]} -h
    echo ""
    echo ""
    echo "----------------------------------------------"
    echo ""
    echo ""
  } &>> program_options.log || true

    # If MultiQC is in programs_array, copy the config file to this directory.
    if [[ "${program}" == "multiqc" ]]; then
      cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
    fi
  done
  echo "Finished logging programs options."
  echo ""
fi


# Document programs in PATH (primarily for program version ID)
echo "Logging system \$PATH..."
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system $PATH."

RESULTS

Run time was actually a bit longer than I was expecting (nearly two days). I think this could be shortened by adjusting some of the Nextflow config settings, as many steps seem limited to 2GB of RAM and 8CPUs for most steps. The pipeline was designed to run this type of stuff on less powerful computers, so I believe that’s why those limits are there, but it’s disappointing the pipeline isn’t designed to scale a bit better.

Screencap of Mox emails showing start/end and a runtime of 1 day, 23hours, 6 minutes, and 36 seconds

Output folder:

Clustering results look interesting (sort of?). They separate out by sex, but then there are definitely within-sex clusters:

Clustering tree generated by EpiDiverse/snp pipeline shows clear separation by sex (samples labeled as "Xnumber[FM]) and then some additional separation within sex clusters.

Will add links to CEABIGR GitHub wiki