Author Archives: kubu4

Transcriptome Alignment – Olympia oyster Trinity transcriptome aligned to genome with Bowtie2

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Next up, align transcriptome to Olympia oyster genome.

Alignment and creation of BAM files was done using Bowtie2 on our HPC Mox node.

SBATCH script file:

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Sorted BAM file:

Sorted & indexed BAM file (for IGV):

Will get the sorted BAM file converted to a bedgraph for use in IGV.

Transcriptome Assembly – Olympia oyster RNAseq Data with Trinity

Used all of our current oly RNAseq data to assemble a transcriptome using Trinity.

Trinity was run our our Mox HPC node.

Reads were trimmed using the built-in version of Trimmomatic with the default settings.

SBATCH script:

Despite the naming conventions, this job was submitted to the Mox scheduler on 201800912 and finished on 20180913.

After job completion, the entire folder was gzipped, using an interactive node (the following method of gzipping is SUPER fast, btw):

tar -c 20180827_trinity_oly_RNAseq | pigz > 20180827_trinity_oly_RNAseq.tar.gz

RESULTS:

Output folder:

Trinity assembly (FastA):

Next up, I’ll follow up on this GitHub issue and get some bedgraphs generated.

DNA Methylation Analysis – Olympia Oyster Whole Genome BSseq Bismark Pipeline MethylKit Comparison

I previously ran two variations on the Bismark analysis for our Olympia oyster whole genome bisulfite sequencing data:

I followed this up using the MethylKit R package to identify differentially modified loci (DML), based on differing amounts of coverage (1x, 3x, 5x, & 10x) and percent methylation differences between the two groups of oysters (25%, 50%, & 75%).

See the project wiki for experimental design info).

Both sets of analyses were documented in R Projects:

Default Bismark settings:
“Relaxed” Bismark settings

RESULTS

BedGraphs (1x coverage, 25% diff in methylation):

Default Bismark settings:
“Relaxed” Bismark settings:

The BedGraph outputs from the least stringent coverage/percent difference in methylation for both Bismark pipelines yield suprisingly low numbers of DML.

They yield 22 and 21 DML, respectively. Of course, more stringent BedGraphs have fewer DML, but may be more believable due to having a more robust set of data.

Interestingly, the two analyses reveal that a single contig contains the majority of DML, all within a 1000bp range.

Will continue to examine this data by examining Bismark BedGraphs in IGV, and running some additional MethylKit analysis looking at differentially modified regions(DMRs) to see what we can gleen from this.

RNA Isolation – Lyophilized Tanner Crab Hemolymph in RNAlater

Due to difficulties getting RNA from hemolymph samples stored in RNAlater, Grace is testing out lyophilizing samples before extraction. Who knows what impact this will have on RNA, but it’s worth a shot!

Isolated RNA from three crab hemolymph samples preserved in RNAlater (Test 1, Test 2, Test 3) that had been lyophilized overnight last week.

Samples were provided by Grace.

I believe the primary purpose for this particular test was to verify that the freeze dryer was a feasible tool, since Grace experienced a minor mishap when she attempted the lyohpilization initially.

Lyophilization was successful, without any mess.

TEST 3 LYOPHILIZATION


Isolated RNA using TriReagent, according to manufacturer’s protocol:

Added 1mL TriReagent to each tube, vortexed to mix/dissolve solute, incubated 5mins at RT, added 200uL of chloroform, vortexed 15s to mix, incubated at RT for 5mins, centrifuged 15mins, 12,000g, 4oC, transferred aqueous phase to new tube, added 500uL isopropanol to aqueous phase, mixed, incubated at RT for 10mins, centrifuged 8mins, 12,000g, at RT, discarded supernatant, added 1mL 75% ethanol, centrifuged 5mins, 12,000g at RT, discarded supernatant and resuspended in 10uL of 0.1% DEPC-treated H2O.

Quantified RNA using Roberts Lab Qubit 3.0 with the Qubit RNA high sensitivity kit. Used 5uL of each sample.


RESULTS

Qubit (Google Sheet):

Only one sample (Test 3) had detectable levels of RNA (20.4ng/uL).

So, this little test demonstrates that RNA can be isolated from lyophilized samples and extracted with TriReagent. However, I have not evaluated RNA integrity on the Bioanalyzer. I think Grace has some additional samples she wanted to test this method on, so I think we’ll wait until there are more samples before we use the Bioanalyzer.

Will give sample to Grace for -80oC storage.

DNA Methylation Analysis – Olympia Oyster Whole Genome BSseq Bismark Pipeline Comparison

Ran Bismark using our high performance computing (HPC) node, Mox, with two different bowtie2 settings:

  1. Default settings

  2. –score_min L,0,-0.6

The second setting is a bit less stringent than the default settings and should result in a higher percentage of reads mapping. However, not entirely sure what the actual implications will be (if any) for interpreting the resulting data.

Input data was previously trimmed per Bismark’s recommendation for Illumina TruSeq libraries (TrimGalore! 5′ 10bp):

List of input files and Bismark configurations can be seen in the SLURM scripts:


RESULTS

Output folders:

DNA Quantification – Sea Lice DNA from 20180523

We previously received sea lice (Caligus tape) DNA from Cris Gallardo-Escarate at Universidad de Concepción.

Steven asked that I quantify and assess the DNA quality.

Ran the samples on the Roberts Lab Qubit 3.0 using the dsDNA BR assay (Invitrogen) and 1uL of template DNA.

Ran the samples on the Roberts Lab NanoDrop1000 to get 260/280 values for quality assessment using 2uL of template DNA. NanoDrop1000 was blanked with water, but I don’t know what solvent the DNA is currently resuspended in.


RESULTS

Qubit data (Google Sheet):

SAMPLE CONCENTRATION (ng/uL)
FEMALE 1 23.8
FEMALE 2 9.6

NANODROP DATA

TABLE


ABSORBANCE PLOTS


DNA looks super clean. Not sure what this DNA is intended for so can’t speculate much on what the implications might be for the concentrations on downstream usage.

TrimGalore/FastQC/MultiQC – C.virginica Oil Spill MBDseq Concatenated Sequences

Previously concatenated and analyzed our Crassostrea virginica oil spill MBDseq data with FastQC.

We decided to try improving things by running them through TrimGalore! to remove adapters and poor quality sequences.

Processed the samples on Roadrunner (Apple Xserve; Ubuntu 16.04) using default TrimGalore! settings.

After trimming, TrimGalore! output was summarized using MultiQC. Trimmed FastQ files were then analyzed with FastQC and followed up with MultiQC.

Documented in Jupyter Notebook (see below).

Jupyter Notebook (GitHub):


RESULTS

TrimGalore! output folder:

TrimGalore! MultiQC report (HTML):

TrimGalore! FastQC output folder:

FastQC MultiQC report (HTML:

Overall, things look a bit better, but there are still some issues. Will likely eliminate sample 2112_lane_1_TGACCA from analysis and apply some additional sequence filtering, based on sequence length.


SEQUENCE CONTENT PLOT


SHORT SEQUENCE CONTAMINATION


Sequencing Data Analysis – C.virginica Oil Spill MBDseq Concatenation & FastQC

Per Steven’s request, I concatenated our Crassostrea virginica LSU oil spill MBDseq sequencing data and ran FastQC on the concatenated files.

Here’s the list of input files:

2112_lane1_ACAGTG_L001_R1_001.fastq.gz
2112_lane1_ACAGTG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_001.fastq.gz
2112_lane1_ATCACG_L001_R1_002.fastq.gz
2112_lane1_ATCACG_L001_R1_003.fastq.gz
2112_lane1_CAGATC_L001_R1_001.fastq.gz
2112_lane1_CAGATC_L001_R1_002.fastq.gz
2112_lane1_CAGATC_L001_R1_003.fastq.gz
2112_lane1_GCCAAT_L001_R1_001.fastq.gz
2112_lane1_GCCAAT_L001_R1_002.fastq.gz
2112_lane1_TGACCA_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_001.fastq.gz
2112_lane1_TTAGGC_L001_R1_002.fastq.gz

All commands were run on roadrunner (Apple Xserve; Ubuntu 16.04). See Jupyter notebook below for details.

Jupyter notebook (GitHub):


RESULTS:

The concatenated gzip files and FastQC/MultiQC files are in the output folder linked below.

Output folder:

MultiQC report (HTML):

DNA Methylation Analysis – Olympia oyster BSseq MethylKit Analysis

NOTE: IMPORTANT CAVEATS – READ POST BEFORE USING DATA.
I’m posting this for posterity and to provide an overview of what (and whatnot) to do. Plus, this has a good R script for using MethylKit that can be used for subsequent analyses.

The goal of this analysis was to compare the methylation profiles of Olympia oysters originating from a common population (Fidalgo Bay) that were raised in two different locations (Fidalgo Bay & Oyster Bay).

An overview of the experiment can be viewed here:

I previously ran all of Olympia oyster DNA methylation sequencing data through the Bismark pipeline, and then processed them using the MethylKit R library.

First mistake (Bismark):

  • Trimmed FastQ files “incorrectly”.

Bismark provides an excellent user guide and provides a handy table on how to decide on trimming parameters, but I mistakenly trimmed these according to the recommendations for a different library preparation technique. I trimmed based on the Zymo Pico-Methyl Kit (which was used for the other group of data that I processed simultaneously), instead of the TruSeq library prep.

So, “incorrectly” isn’t necessarily the proper term here. The analysis can still be used, however, it’s likely that the excessive trimming results in reducing sequencing coverage, and, in turn, making the downstream analysis result in a highly conservative output. Thus, the data isn’t wrong or bad, it is just very limited.

And, this leads to the second mistake (Bismark):

  • Bowtie alignment score too strict

There’s a bit of a weird “battle” between Bismark and bowtie2. Bismark uses bowtie2 for generating alignments, but bowtie2’s default cutoff score overrides Bismark’s. So, to adjust the score value, you have to explicitly add the scoring parameters to your Bismark parameters during the alignment step. I did not do this.

Again, it’s not wrong, per se, but leads to a significantly limited set of data in the final analysis.

The data were analyzed based on a minimum of:

  • 3x coverage

  • 25% difference in methylation


RESULTS:

Methylkit analysis (R project; GitHub):

BedGraph file (BED):

The analysis resulted in a total of seven (yes, 7) differentially methylated loci (DML) between the two groups. It was this result that made Steven and me revisit the initial Bismark analysis. He has done this previously (but differently) and gotten significantly greater numbers of DML.

Knowing all of this, I will re-trim the data and adjust Bismark alignment score thresholds and then re-analyze with MethylKit.

Regardless here’re some plots to add some visual flair to this notebook entry (these, and more, are available in the GitHub repo):

CLUSTERING DENDROGRAM


PCA PLOT

Transcriptome Assembly – Geoduck RNAseq data

Used all of our current geoduck RNAseq data to assemble a transcriptome using Trinity.

Trinity was run our our Mox HPC node. Specifically, I had to use just a single node with 500GB of RAM. Trinity could not run with much less than that. Initially, I attempted to run with two nodes, but our smaller node (120GB) ended up limiting the available RAM (the system only uses the RAM available on the smallest node; it cannot combine RAM or dynamically allocate computing to a node with larger RAM when needed) and Trinity consistently crashed due to memory limitations.

Reads were trimmed using the built-in version of Trimmomatic with the default settings.

SBATCH script:

Due to the huge number of input files, I won’t post the entire script contents here. Instead, here’s a snippet of the script showing the commands used to start the Trinity run:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180829_trinity
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=30-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180827_trinity_geoduck_RNAseq

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Run Trinity
/gscratch/srlab/programs/trinityrnaseq-Trinity-v2.8.3/Trinity \
--trimmomatic \
--seqType fq \
--max_memory 500G \
--CPU 28 \

Despite the naming conventions, this job was submitted to the Mox scheduler on 20180829 and finished on 20180901.

After job completion, the entire folder was gzipped (the following method of gzipping is SUPER fast, btw):

tar -c 20180827_trinity_geoduck_RNAseq | pigz > 20180827_trinity_geoduck_RNAseq.tar.gz

RESULTS:

Output folder:

Trinity assembly (FastA):

Next up, I’ll get some annotations going by running through TransDecoder and blastx.