Tag Archives: mox

Transcriptome Alignment & Bedgraph – Olympia oyster transcriptome with Olurida_v080 genome assembly

Yesterday, I produced a bedgraph file of our Olympia oyster RNAseq data coverage using our Olurida_v081 genome.

I decided that I wanted to use the Olurida_v080 version instead (or, in addtion to?), as the Olurida_v080 version has not been size restricted (the Olurida v081 version is only contigs >1000bp). I feel like we could miss some important regions, so wanted to run this analysis using all of the genome data we currently have available. Additionally, this will be consistent with my previous Bismark (DNA methylation analysis).

Used HISAT2 on our HPC Mox node to align our RNAseq reads to our Olurida_v080 genome assembly:

SBATCH script file:

NOTE: For brevity sake, I have not listed all of the input RNAseq files below. Please see the full script, which is linked above.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_hisat2
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_genome_hisat2_bedgraph

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Set genome assembly path
oly_genome_path=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=20180926_Olurida_v080.sorted.bam

# Set hisat2 basename
hisat2_basename=Olurida_v080

# Set program paths
## hisat2
hisat2=/gscratch/srlab/programs/hisat2-2.1.0

## bedtools
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin

## samtools
stools=/gscratch/srlab/programs/samtools-1.9/samtools

# Build hisat2 genome index
${hisat2}/hisat2-build \
-f ${oly_genome_path}/Olurida_v080.fa \
Olurida_v080 \
-p 28

# Align reads to oly genome assembly
${hisat2}/hisat2 \
--threads 28 \
-x "${hisat2_basename}" \
-q \
-1 \
-2 \
-S 20180926_"${hisat2_basename}".sam

# Convert SAM file to BAM
"${stools}" view \
--threads 28 \
-b 20180926_"${hisat2_basename}".sam > 20180926_"${hisat2_basename}".bam

# Sort BAM
"${stools}" sort \
--threads 28 \
20180926_"${hisat2_basename}".bam \
-o 20180926_"${hisat2_basename}".sorted.bam

# Index for use in IGV
##-@ specifies thread count; --thread option not available in samtools index
"${stools}" index \
-@ 28 \
20180926_"${hisat2_basename}".sorted.bam


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-ibam ${oly_transcriptome_bam} \
-bga \
-split \
-trackline \
> 20180926_oly_RNAseq.bedgraph

The script performs the following functions:

  • Genome indexing
  • RNAseq alignment to genome
  • Convert SAM to BAM
  • Sort and index BAM
  • Determine RNAseq coverage

RESULTS

Output folder:

Bedgraph file (1.9GB):

Loaded in to IGV to verify things looked OK:

Bedgraph – Olympia oyster transcriptome with Olurida_v081 genome assembly

I took the sorted BAM file from yesterday’s corrected RNAseq genome alignment and converted it to a bedgraph using BEDTools genomeCoverageBed tool.

Analysis took place on our HPC Mox node.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=/gscratch/scrubbed/samwhite/20180925_oly_RNAseq_genome_hisat2/20180925_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-ibam ${oly_transcriptome_bam} \
-bga \
-split \
-trackline \
> 20180926_oly_RNAseq.bedgraph

RESULTS

Output folder:

Bedgraph file (1.2GB):

Loaded in to IGV to verify things looked OK:

Transcriptome Alignment – Olympia oyster RNAseq reads aligned to genome with HISAT2

Yesterday’s attempt at producing a bedgraph was a failure and a prodcuct of a major brain fart. The worst part is that I was questioning what I was doing the entire time, but still went through with the process! Yeesh!

The problem was that I tried to take our Trinity-assembled transcriptome and somehow align that to our genome. This can’t work because each of those assemblies don’t know the coordinates used by the other. So, as was the case, you end up with a bedgraph that shows zero coverage for all genome contigs.

Anyway, here’s the correct procedure!

Used HISAT2 on our HPC Mox node to align our RNAseq reads to our Olurida_v081 genome assembly:

SBATCH script files:

PERFORM GENOME INDEXING & ALIGNMENT
20180925_oly_RNAseq_genome_hisat2.sh

SORT & INDEX ALIGNMENT OUTPUT
20180925_oly_RNAseq_genome_sort_bam.sh


RESULTS

Output folder:

Sorted BAM file (58GB):

Will get the sorted BAM file converted to a bedgraph showing genome coverage for use in IGV.

Bedgraph – Olympia oyster transcriptome (FAIL)

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Then, aligned the assembled transcriptome to our genome using Bowtie2.

Finally, I used BEDTools to convert the BAM to BED to bedgraph.

This required an initial indexing of our Olympia oyster genome FastA using samtools faidx tool.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180924_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180924_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Set genome assembly FastA
oly_genome_fasta=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa

# Set indexed genome assembly file
oly_genome_indexed=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa.fai

# Set sorted transcriptome assembly bam file
oly_transcriptome=/gscratch/scrubbed/samwhite/20180919_oly_transcriptome_bowtie2/20180919_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools

# Index genome FastA
${samtools} faidx ${oly_genome_fasta}

# Format indexed genome for bedtools
## Requires only two columns: namelength
awk -v OFS='\t' {'print $1,$2'} ${oly_genome_indexed} > Olurida_v081.fa.fai.genome

# Create bed file
${bedtools}/bamToBed \
-i ${oly_transcriptome} \
> 20180924_oly_RNAseq.bam.bed


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-i ${PWD}/20180924_oly_RNAseq.bed \
-g Olurida_v081.fa.fai.genome \
-bga \
-split \
-trackline \
> 20180924_oly_RNAseq.bed

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Indexed and formatted genome file:

Bedgraph file (for IGV):


This doesn’t appear to have worked properly. Here’s a view of the bedgraph file:


track type=bedGraph
Contig0 0   116746  0
Contig1 0   87411   0
Contig2 0   139250  0
Contig3 0   141657  0
Contig4 0   95692   0
Contig5 0   130522  0
Contig6 0   94893   0
Contig7 0   109667  0
Contig8 0   95943   0

I’d expect multiple entries for each contig (ideally), indicating start/stop positions for where transcripts align within a given contig. However, this appears to simply be a list of all the genome contigs and their lengths (Start=0, Stop=n).

I would expect to see something li

I’ll look into this further and see where this pipeline goes wrong.

Transcriptome Alignment – Olympia oyster Trinity transcriptome aligned to genome with Bowtie2

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Next up, align transcriptome to Olympia oyster genome.

Alignment and creation of BAM files was done using Bowtie2 on our HPC Mox node.

SBATCH script file:

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Sorted BAM file:

Sorted & indexed BAM file (for IGV):

Will get the sorted BAM file converted to a bedgraph for use in IGV.

Transcriptome Assembly – Olympia oyster RNAseq Data with Trinity

Used all of our current oly RNAseq data to assemble a transcriptome using Trinity.

Trinity was run our our Mox HPC node.

Reads were trimmed using the built-in version of Trimmomatic with the default settings.

SBATCH script:

Despite the naming conventions, this job was submitted to the Mox scheduler on 201800912 and finished on 20180913.

After job completion, the entire folder was gzipped, using an interactive node (the following method of gzipping is SUPER fast, btw):

tar -c 20180827_trinity_oly_RNAseq | pigz > 20180827_trinity_oly_RNAseq.tar.gz

RESULTS:

Output folder:

Trinity assembly (FastA):

Next up, I’ll follow up on this GitHub issue and get some bedgraphs generated.

Transcriptome Assembly – Geoduck RNAseq data

Used all of our current geoduck RNAseq data to assemble a transcriptome using Trinity.

Trinity was run our our Mox HPC node. Specifically, I had to use just a single node with 500GB of RAM. Trinity could not run with much less than that. Initially, I attempted to run with two nodes, but our smaller node (120GB) ended up limiting the available RAM (the system only uses the RAM available on the smallest node; it cannot combine RAM or dynamically allocate computing to a node with larger RAM when needed) and Trinity consistently crashed due to memory limitations.

Reads were trimmed using the built-in version of Trimmomatic with the default settings.

SBATCH script:

Due to the huge number of input files, I won’t post the entire script contents here. Instead, here’s a snippet of the script showing the commands used to start the Trinity run:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180829_trinity
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180827_trinity_geoduck_RNAseq

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Run Trinity
/gscratch/srlab/programs/trinityrnaseq-Trinity-v2.8.3/Trinity \
--trimmomatic \
--seqType fq \
--max_memory 500G \
--CPU 28 \

Despite the naming conventions, this job was submitted to the Mox scheduler on 20180829 and finished on 20180901.

After job completion, the entire folder was gzipped (the following method of gzipping is SUPER fast, btw):

tar -c 20180827_trinity_geoduck_RNAseq | pigz > 20180827_trinity_geoduck_RNAseq.tar.gz

RESULTS:

Output folder:

Trinity assembly (FastA):

Next up, I’ll get some annotations going by running through TransDecoder and blastx.

FastQC/MultiQC/TrimGalore/MultiQC/FastQC/MultiQC – O.lurida WGBSseq for Methylation Analysis

I previously ran this data through the Bismark pipeline and followed up with MethylKit analysis. MethylKit analysis revealed an extremely low number of differentially methylated loci (DML), which seemed odd.

Steven and I met to discuss and compare our different variations on the analysis and decided to try out different tweaks to evaluate how they affect analysis.

I did the following tasks:

  1. Looked at original sequence data quality with FastQC.

  2. Summarized FastQC analysis with MultiQC.

  3. Trimmed data using TrimGalore!, trimming 10bp from 5′ end of reads (8bp is recommended by Bismark docs).

  4. Summarized trimming stats with MultiQC.

  5. Looked at trimmed sequence quality with FastQC.

  6. Summarized FastQC analysis with MultiQC.

This was run on the Univ. of Washington High Performance Computing (HPC) cluster, Mox.

Mox SBATCH submission script has all details on how the analyses were conducted:


RESULTS

Output folder:

Raw sequence FastQC output folder:

Raw sequence MultiQC report (HTML):

TrimGalore! output folder (trimmed FastQ files are here):

Trimming MultiQC report (HTML):

Trimmed FastQC output folder:

Trimmed MultiQC report (HTML):

Mox – Over quota: Olympia oyster genome annotation progress (using Maker 2.31.10)

UPDATE 20180730

Per this GitHub Issue, Steven has cleared out files.

Additionally, it appears that my job has just continued from where it was stuck. Very nice!

ORIGINAL POST IS BELOW


Well, this is an issue. Checked in on job progress and noticed that we’ve exceeded our storage quota on Mox:



Will have post an issue on GitHub to help get unnecessary files cleaned out. Kind of shocking that we’re using >6TB of space already…

Mox – Olympia oyster genome annotation progress (using Maker 2.31.10)

TL;DR – It appears to be continuing where it left off!

I decided to spend some time to figure out what was actually happening, as it’s clear that the annotation process is going to need some additional time to run and may span an additional monthly maintenance shutdown.

This is great, because, otherwise, this will take an eternity to actually complete (particularly because we’d have to move the job to run on one of our lab’s computers – which pale in comparison to the specs of our Mox nodes).

However, it’s a bit shocking that this is taking this long, even on a Mox node!

I started annotating the Olympia oyster genome on 20180529. Since then, the job has been interrupted twice by monthly Mox maintenance (which happens on the 2nd Tuesday of each month). Additionally, when this happens, the SLURM output file is overwritten, making it difficult to assess whether or not Maker continues where it left off or if it’s starting over from scratch.

Anyway, here’s how I deduced that the program is continuing where it left off.

  1. I figured out that it produces a generic feature format (GFF) file for each contig.

  2. Decided to search for the first contig GFF and look at it’s last modified date. This would tell me if it was newly generated (i.e. on the date that the job was restarted after the maintenance shutdown) or if it was old. Additionally, if there were more than one of these files, then I’d also know that Maker was just starting at the beginning and writing data to a different location.

    This shows:

    1. Only one copy of Contig0.gff exists.

    2. Last modified date is 20180530.

  3. Check the slurm output file for info.

    This reveals this important piece of info:

    MAKER WARNING: The file 20180529_oly_annotation_01.maker.output/20180529_oly_annotation_01_datastore/AC/68/Contig215522//theVoid.Contig215522/0/Contig215522.0.all.rb.out
    did not finish on the last run

All of these taken together lead me to confidently conclude that Maker is not restarting from the beginning and is, indeed, continuing where it left off. WHEW!

Just for kicks, I also ran a count of GFF files to see where this stands so far:

Wow! 622,010 GFFs!!!

Finally, for posterity, here’s the SLURM script I used to submit this job, back in May! I’ll have all of the corresponding genome files, proteome files, transcriptome files, etc. on one of our servers once the job completes.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180529_oly_maker_genome_annotation
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/srlab/sam/outputs/20180529_oly_maker_genome_annotation

## Establish variables for more readable code

### Path to Maker executable
maker=/gscratch/srlab/programs/maker-2.31.10/bin/maker

### Path to Olympia oyster genome FastA file
oly_genome=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/jelly.out.fasta

### Path to Olympia oyster transcriptome FastA file
oly_transcriptome=/gscratch/srlab/sam/data/O_lurida/oly_transcriptome_assemblies/Olurida_transcriptome_v3.fasta

### Path to Crassotrea gigas NCBI protein FastA
gigas_proteome=/gscratch/srlab/sam/data/C_gigas/gigas_ncbi_protein/GCA_000297895.1_oyster_v9_protein.faa

### Path to Crassostrea virginica NCBI protein FastA
virginica_proteome=/gscratch/srlab/sam/data/C_virginica/virginica_ncbi_protein/GCF_002022765.2_C_virginica-3.0_protein.faa

## Create Maker control files needed for running Maker
$maker -CTL

## Store path to options control file
maker_opts_file=./maker_opts.ctl

## Create combined proteome FastA file
touch gigas_virginica_ncbi_proteomes.fasta
cat "$gigas_proteome" >> gigas_virginica_ncbi_proteomes.fasta
cat "$virginica_proteome" >> gigas_virginica_ncbi_proteomes.fasta

## Edit options file

### Set paths to O.lurida genome and transcriptome.
### Set path to combined C. gigas and C.virginica proteomes.
## The use of the % symbol sets the delimiter sed uses for arguments.
## Normally, the delimiter that most examples use is a slash "/".
## But, we need to expand the variables into a full path with slashes, which screws up sed.
## Thus, the use of % symbol instead (it could be any character that is NOT present in the expanded variable; doesn't have to be "%").
sed -i "/^genome=/ s% %$oly_genome %" "$maker_opts_file"
sed -i "/^est=/ s% %$oly_transcriptome %" "$maker_opts_file"
sed -i "/^protein=/ s% %$gigas_virginica_ncbi_proteomes %" "$maker_opts_file"

## Run Maker
### Set basename of files and specify number of CPUs to use
$maker \
-base 20180529_oly_annotation_01 \
-cpus 24