Tag Archives: bedgraph

Transcriptome Alignment & Bedgraph – Olympia oyster transcriptome with Olurida_v080 genome assembly

Yesterday, I produced a bedgraph file of our Olympia oyster RNAseq data coverage using our Olurida_v081 genome.

I decided that I wanted to use the Olurida_v080 version instead (or, in addtion to?), as the Olurida_v080 version has not been size restricted (the Olurida v081 version is only contigs >1000bp). I feel like we could miss some important regions, so wanted to run this analysis using all of the genome data we currently have available. Additionally, this will be consistent with my previous Bismark (DNA methylation analysis).

Used HISAT2 on our HPC Mox node to align our RNAseq reads to our Olurida_v080 genome assembly:

SBATCH script file:

NOTE: For brevity sake, I have not listed all of the input RNAseq files below. Please see the full script, which is linked above.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_hisat2
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_genome_hisat2_bedgraph

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Set genome assembly path
oly_genome_path=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=20180926_Olurida_v080.sorted.bam

# Set hisat2 basename
hisat2_basename=Olurida_v080

# Set program paths
## hisat2
hisat2=/gscratch/srlab/programs/hisat2-2.1.0

## bedtools
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin

## samtools
stools=/gscratch/srlab/programs/samtools-1.9/samtools

# Build hisat2 genome index
${hisat2}/hisat2-build \
-f ${oly_genome_path}/Olurida_v080.fa \
Olurida_v080 \
-p 28

# Align reads to oly genome assembly
${hisat2}/hisat2 \
--threads 28 \
-x "${hisat2_basename}" \
-q \
-1 \
-2 \
-S 20180926_"${hisat2_basename}".sam

# Convert SAM file to BAM
"${stools}" view \
--threads 28 \
-b 20180926_"${hisat2_basename}".sam > 20180926_"${hisat2_basename}".bam

# Sort BAM
"${stools}" sort \
--threads 28 \
20180926_"${hisat2_basename}".bam \
-o 20180926_"${hisat2_basename}".sorted.bam

# Index for use in IGV
##-@ specifies thread count; --thread option not available in samtools index
"${stools}" index \
-@ 28 \
20180926_"${hisat2_basename}".sorted.bam


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-ibam ${oly_transcriptome_bam} \
-bga \
-split \
-trackline \
> 20180926_oly_RNAseq.bedgraph

The script performs the following functions:

  • Genome indexing
  • RNAseq alignment to genome
  • Convert SAM to BAM
  • Sort and index BAM
  • Determine RNAseq coverage

RESULTS

Output folder:

Bedgraph file (1.9GB):

Loaded in to IGV to verify things looked OK:

Bedgraph – Olympia oyster transcriptome with Olurida_v081 genome assembly

I took the sorted BAM file from yesterday’s corrected RNAseq genome alignment and converted it to a bedgraph using BEDTools genomeCoverageBed tool.

Analysis took place on our HPC Mox node.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=/gscratch/scrubbed/samwhite/20180925_oly_RNAseq_genome_hisat2/20180925_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-ibam ${oly_transcriptome_bam} \
-bga \
-split \
-trackline \
> 20180926_oly_RNAseq.bedgraph

RESULTS

Output folder:

Bedgraph file (1.2GB):

Loaded in to IGV to verify things looked OK:

Bedgraph – Olympia oyster transcriptome (FAIL)

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Then, aligned the assembled transcriptome to our genome using Bowtie2.

Finally, I used BEDTools to convert the BAM to BED to bedgraph.

This required an initial indexing of our Olympia oyster genome FastA using samtools faidx tool.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180924_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180924_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Set genome assembly FastA
oly_genome_fasta=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa

# Set indexed genome assembly file
oly_genome_indexed=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa.fai

# Set sorted transcriptome assembly bam file
oly_transcriptome=/gscratch/scrubbed/samwhite/20180919_oly_transcriptome_bowtie2/20180919_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools

# Index genome FastA
${samtools} faidx ${oly_genome_fasta}

# Format indexed genome for bedtools
## Requires only two columns: namelength
awk -v OFS='\t' {'print $1,$2'} ${oly_genome_indexed} > Olurida_v081.fa.fai.genome

# Create bed file
${bedtools}/bamToBed \
-i ${oly_transcriptome} \
> 20180924_oly_RNAseq.bam.bed


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-i ${PWD}/20180924_oly_RNAseq.bed \
-g Olurida_v081.fa.fai.genome \
-bga \
-split \
-trackline \
> 20180924_oly_RNAseq.bed

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Indexed and formatted genome file:

Bedgraph file (for IGV):


This doesn’t appear to have worked properly. Here’s a view of the bedgraph file:


track type=bedGraph
Contig0 0   116746  0
Contig1 0   87411   0
Contig2 0   139250  0
Contig3 0   141657  0
Contig4 0   95692   0
Contig5 0   130522  0
Contig6 0   94893   0
Contig7 0   109667  0
Contig8 0   95943   0

I’d expect multiple entries for each contig (ideally), indicating start/stop positions for where transcripts align within a given contig. However, this appears to simply be a list of all the genome contigs and their lengths (Start=0, Stop=n).

I would expect to see something li

I’ll look into this further and see where this pipeline goes wrong.