Author Archives: kubu4

VCF Splitting with bcftools

Steven asked for some help trying to split a VCF in to individual VCF files.

VCF file (15GB): SNP.TRSdp5g95FnDNAmaf05.vcf.gz

Skip to the Results section if you don’t want to read through the tials and tribulations of getting this to work.

Here’s an overview of how I managed to get this to work and what didn’t work.

Figured out the VCF file needed to be sorted, bgzipped (part of htslib), and indexed with tabix, due to the following error when initially trying to process with VCF file using bcftools:

[W::vcf_parse] contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

So, I did that:

  • Sort and bgzip:
cat SNP.TRSdp5g95FnDNAmaf05.vcf | \
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | \
bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz
  • Index with tabix:
tabix --preset vcf SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

This produced a separate file:

  • SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz.tbi.

It seems as though this file must exist in the same directory as the source VCF for it to be utilized, although no commands work directly with this index file.

Then, tried biostars solution, but produces an error


#!/bin/bash
for file in *.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

Resulting error:

[E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174

And empty split VCF files…

Tried tabix on unsorted bgzipped file yields this error:

[E::hts_idx_push] chromosome blocks not continuous

Tried modified sort:


cat SNP.TRSdp5g95FnDNAmaf05.vcf | \
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1V -k2,2n"}' | \
bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

Produces this error:

[E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174

And empty split VCF files…

Changed to new version of “view” – trying “call” instead (it seems that bcftools view is deprecated?):


#!/bin/bash
for file in *.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools call \
    --consensus-caller \
    --output-type z \
    --threads 18 \
    --samples $sample 
    --output-file ${file/.vcf.gz/.$sample.vcf.gz} \
    $file
  done
done

Still results in empty output files.

Based off of the repeated error about AN/AC counts, tried to fill AN/AC values…


bcftools plugin fill-AN-AC SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz \
--output-type z \
--threads 18 \
--output SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz

And, ran this code:


#!/bin/bash
for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools call \
    --consensus-caller \
    --output-type z \
    --threads 18 \
    --samples $sample 
    --output-file ${file/.vcf.gz/.$sample.vcf.gz} \
    $file
  done
done

Still results in empty files…

Try original code again (expanded shortened arguments to improve readability):


#!/bin/bash
for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools view \
    --min-ac 1 \
    --output-type z \
    --samples $sample \
    --output-file ${file/.vcf*/.$sample.vcf.gz} \
    --threads 18 \
    $file
  done
done

P.S. I realize the outermost for loop is not necessary, but it was faster/easier to just quickly modify the code from that Biostars solution.


RESULTS

That worked! Or, at least it is producing non-empty, split VCF files! I’ll let Steven know and let him decide what impact (if any) the fill-AN-AC plugin had on the file(s)!

BTW, running with 18 threads on my computer, this took ~30mins to split each sample into its own VCF file. However, checking performance via htop, it certainly does not appear to be using 18 threads…

Output folder: 20181015_vcf_split

RNA Isolation – Tanner Crab Hemolymph Pellet in RNAlater using TriReagent

I previously isolated RNA from crab hemolymp from a lyophilized sample using TriReagent and Grace recently tried isolating RNA from crab hemolyph pellet (non-lyophilized) using TriReagent. The results for her extractions weren’t so great, so I’m giving it a shot with the following samples:

  • crab 424

  • crab 429

  • crab 438

Isolated RNA using TriReagent, according to manufacturer’s protocol:

Added 1mL TriReagent to each tube, vortexed to mix/dissolve solute, incubated 5mins at RT, added 200uL of chloroform, vortexed 15s to mix, incubated at RT for 5mins, centrifuged 15mins, 12,000g, 4oC, transferred aqueous phase to new tube, added 500uL isopropanol to aqueous phase, mixed, incubated at RT for 10mins, centrifuged 8mins, 12,000g, at RT, discarded supernatant, added 1mL 75% ethanol, centrifuged 5mins, 12,000g at RT, discarded supernatant and resuspended in 10uL of 0.1% DEPC-treated H2O.

Phase separation after chloroform addition was not particularly good. Aqueous phases in sample 424 was a bit cloudy (salty?) with no defined interphase. The remaining two samples did exhibit a defined interphase and were the aqueous phases were less cloudy than sample 424, but were far from ideal.

Quantified RNA using Roberts Lab Qubit 3.0 with the Qubit RNA high sensitivity kit. Used 5uL of each sample.


RESULTS

No detectable RNA in any samples. Samples were discarded.

As has been the case for all samples in this project, RNA isolation methodologies have produced wildly inconsistent results.

RNA Isolation – Ronit’s C.gigas diploid/triploid dessication/heat shock ctenidia tissues

Isolated RNA from a subset of Ronit’s Crassostrea gigas ctenidia samples (see Ronit’s notebook for experiment deets):

  • D01 C

  • D02 C

  • D19 C

  • D20 C

  • T01 C

  • T02 C

  • T19 C

  • T20 C

RNA was isolated using RNAzol RT (Molecular Research Center) in the following way:

  • Unweighed, frozen tissue transferred to 500uL of RNAzol RT and immediately homogenized with disposable pestle.

  • Added additional 500uL of RNAzol RT and vortexed to mix.

  • Added 400uL of 0.1% DEPC-treated H2O, vortexed and incubated 15mins at RT.

  • Centrifuged 12,000g for 15mins at RT.

  • Transferred 750uL of supernatant to clean tube (discarded remainder), added 1 volume (750uL) of isopropanol, vortexed, and incubated at RT for 10mins.

  • Centrifuged 12,000g for 10mins at RT.

  • Discarded supernatant.

  • Washed pellet with 75% ethanol (made with 0.1% DEPC-treated H2O).

  • Centrifuged 4,000g for 2mins at RT.

  • Discarded supernatant and repeated wash steps.

Pellet was resuspended in 50uL of 0.1% DEPC-treated H2O and stored @ -80oC in Ronit’s temporary box.

SRA Submission – Olymia oyster Whole Genome BS-seq Data

Submitted our whole genome bisulfite sequencing data to NCBI Sequence Read Archive (SRA).

Relevant SRA info is below.

Have updated nightingales Google Sheet with SRA info.

SAMPLE SRA (Study) BioProject BioSample
1NF11 SRP163248 PRJNA494552 SAMN10172233
1NF15 SRP163248 PRJNA494552 SAMN10172234
1NF16 SRP163248 PRJNA494552 SAMN10172235
1NF17 SRP163248 PRJNA494552 SAMN10172236
2NF5 SRP163248 PRJNA494552 SAMN10172237
2NF6 SRP163248 PRJNA494552 SAMN10172238
2NF7 SRP163248 PRJNA494552 SAMN10172239
2NF8 SRP163248 PRJNA494552 SAMN10172240

Installation – Microsoft Machine Learning Server (Microsoft R Open) on Emu/Roadrunner R Studio Server

Steven recently saw an announcement that Microsoft R Open now handles multi-threaded processing (default R does not), so we were interested in trying it out. I installed MLR/MRO on Emu/Roadrunner (Apple Xserve; Ubuntu 16.04). Followed the Microsoft installation directions for Ubuntu. In retrospect, I think I could’ve just installed MRO, but this gets the job done as well and won’t hurt anything.

I’ve set both Emu & Roadrunner R Studio Server to use this installation of R by changing the /etc/restudio/rserver.conf file to the following:


# Server Configuration File

# Use Microsoft R Open instead of default R version.
# Comment out and restart R Studio Server (sudo rstudio-server restart)
# to restore default R version.

rsession-which-r=/opt/microsoft/ropen/3.4.3/lib64/R/bin/R

I have confirmed that R Studio Server on both machines starts up and is using MRO instead of the default version of R.

Ubuntu – Fix “No Video Signal” Issue on Emu

An issue with Emu cropped up a few weeks ago that was seemingly caused by upgrading from Ubuntu 16.04 to 18.04.

However, the problems only seemed related to using Emu via the GUI; users could still use Emu as a headless computer via SSH.

Today, I was upgrading some packages and noticed two things:

  1. When initially logging in to Emu.
    
    sam@swoose:~$ ssh emu
    Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-57-generic x86_64)
    
    * Documentation:  https://help.ubuntu.com
    * Management:     https://landscape.canonical.com
    * Support:        https://ubuntu.com/advantage
    
    0 packages can be updated.
    0 updates are security updates.
    
    New release '18.04.1 LTS' available.
    Run 'do-release-upgrade' to upgrade to it.
    
    You have mail.
    Last login: Tue Oct  2 07:30:32 2018 from 128.95.149.75
    

    This is showing that Emu is still running Ubuntu 16.04, not 18.04 as presumed!

  2. An error in the GRUB config generation process when upgrading packages.


run-parts: executing /etc/kernel/postrm.d/zz-update-grub 4.4.0-134-generic /boot/vmlinuz-4.4.0-134-generic
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.0-137-generic
Found initrd image: /boot/initrd.img-4.4.0-137-generic
Found linux image: /boot/vmlinuz-4.4.0-135-generic
Found initrd image: /boot/initrd.img-4.4.0-135-generic
Found linux image: /boot/vmlinuz-4.4.0-57-generic
Found initrd image: /boot/initrd.img-4.4.0-57-generic
Found linux image: /boot/vmlinuz-4.4.0-53-generic
Found initrd image: /boot/initrd.img-4.4.0-53-generic
error: syntax error.
error: Incorrect command.
error: syntax error.
Syntax error at line 98
Syntax errors are detected in generated GRUB config file.
Ensure that there are no errors in /etc/default/grub
and /etc/grub.d/* files or please file a bug report with
/boot/grub/grub.cfg.new file attached.
done
Processing triggers for libc-bin (2.23-0ubuntu10) ...

These two bits of information led me to believe the problem wasn’t that the system upgrade to 18.04 was incompatible with these old Apple Xserve hardware (since the upgrade didn’t actually get implemented) and instead was that the upgrade might have been initiated, but aborted, which modified the GRUB configuration file(s), breaking the GUI; much like the problem I previously addressed earlier this summer.

When I fixed the display/GUI issues with Emu and Roadrunner earlier this summer, I noted that the /etc/default/grub files on each of the computers were slightly different, despite the fact that these two computers should be identical. So, I replaced the /etc/default/grub file on Emu with the file from Roadrunner and rebooted Emu.

Contents of /etc/default/grub file on Emu/Roadrunner, for future reference:


# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX=""

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"

Voila! Emu now has a functional display/GUI again!

Transcriptome Alignment & Bedgraph – Olympia oyster transcriptome with Olurida_v080 genome assembly

Yesterday, I produced a bedgraph file of our Olympia oyster RNAseq data coverage using our Olurida_v081 genome.

I decided that I wanted to use the Olurida_v080 version instead (or, in addtion to?), as the Olurida_v080 version has not been size restricted (the Olurida v081 version is only contigs >1000bp). I feel like we could miss some important regions, so wanted to run this analysis using all of the genome data we currently have available. Additionally, this will be consistent with my previous Bismark (DNA methylation analysis).

Used HISAT2 on our HPC Mox node to align our RNAseq reads to our Olurida_v080 genome assembly:

SBATCH script file:

NOTE: For brevity sake, I have not listed all of the input RNAseq files below. Please see the full script, which is linked above.


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_hisat2
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_genome_hisat2_bedgraph

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Set genome assembly path
oly_genome_path=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=20180926_Olurida_v080.sorted.bam

# Set hisat2 basename
hisat2_basename=Olurida_v080

# Set program paths
## hisat2
hisat2=/gscratch/srlab/programs/hisat2-2.1.0

## bedtools
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin

## samtools
stools=/gscratch/srlab/programs/samtools-1.9/samtools

# Build hisat2 genome index
${hisat2}/hisat2-build \
-f ${oly_genome_path}/Olurida_v080.fa \
Olurida_v080 \
-p 28

# Align reads to oly genome assembly
${hisat2}/hisat2 \
--threads 28 \
-x "${hisat2_basename}" \
-q \
-1 \
-2 \
-S 20180926_"${hisat2_basename}".sam

# Convert SAM file to BAM
"${stools}" view \
--threads 28 \
-b 20180926_"${hisat2_basename}".sam > 20180926_"${hisat2_basename}".bam

# Sort BAM
"${stools}" sort \
--threads 28 \
20180926_"${hisat2_basename}".bam \
-o 20180926_"${hisat2_basename}".sorted.bam

# Index for use in IGV
##-@ specifies thread count; --thread option not available in samtools index
"${stools}" index \
-@ 28 \
20180926_"${hisat2_basename}".sorted.bam


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-ibam ${oly_transcriptome_bam} \
-bga \
-split \
-trackline \
> 20180926_oly_RNAseq.bedgraph

The script performs the following functions:

  • Genome indexing
  • RNAseq alignment to genome
  • Convert SAM to BAM
  • Sort and index BAM
  • Determine RNAseq coverage

RESULTS

Output folder:

Bedgraph file (1.9GB):

Loaded in to IGV to verify things looked OK:

Bedgraph – Olympia oyster transcriptome with Olurida_v081 genome assembly

I took the sorted BAM file from yesterday’s corrected RNAseq genome alignment and converted it to a bedgraph using BEDTools genomeCoverageBed tool.

Analysis took place on our HPC Mox node.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180926_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180926_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log

# Set sorted transcriptome assembly bam file
oly_transcriptome_bam=/gscratch/scrubbed/samwhite/20180925_oly_RNAseq_genome_hisat2/20180925_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-ibam ${oly_transcriptome_bam} \
-bga \
-split \
-trackline \
> 20180926_oly_RNAseq.bedgraph

RESULTS

Output folder:

Bedgraph file (1.2GB):

Loaded in to IGV to verify things looked OK:

Transcriptome Alignment – Olympia oyster RNAseq reads aligned to genome with HISAT2

Yesterday’s attempt at producing a bedgraph was a failure and a prodcuct of a major brain fart. The worst part is that I was questioning what I was doing the entire time, but still went through with the process! Yeesh!

The problem was that I tried to take our Trinity-assembled transcriptome and somehow align that to our genome. This can’t work because each of those assemblies don’t know the coordinates used by the other. So, as was the case, you end up with a bedgraph that shows zero coverage for all genome contigs.

Anyway, here’s the correct procedure!

Used HISAT2 on our HPC Mox node to align our RNAseq reads to our Olurida_v081 genome assembly:

SBATCH script files:

PERFORM GENOME INDEXING & ALIGNMENT
20180925_oly_RNAseq_genome_hisat2.sh

SORT & INDEX ALIGNMENT OUTPUT
20180925_oly_RNAseq_genome_sort_bam.sh


RESULTS

Output folder:

Sorted BAM file (58GB):

Will get the sorted BAM file converted to a bedgraph showing genome coverage for use in IGV.

Bedgraph – Olympia oyster transcriptome (FAIL)

Progress on generating bedgraphs from our Olympia oyster transcriptome continues.

Transcriptome assembly with Trinity completed 20180919.

Then, aligned the assembled transcriptome to our genome using Bowtie2.

Finally, I used BEDTools to convert the BAM to BED to bedgraph.

This required an initial indexing of our Olympia oyster genome FastA using samtools faidx tool.

SBATCH script file:


#!/bin/bash
## Job Name
#SBATCH --job-name=20180924_oly_bedgraphs
## Allocation Definition 
#SBATCH --account=srlab
#SBATCH --partition=srlab
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=5-00:00:00
## Memory per node
#SBATCH --mem=500G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/scrubbed/samwhite/20180924_oly_RNAseq_bedgraphs

# Load Python Mox module for Python module availability

module load intel-python3_2017

# Document programs in PATH (primarily for program version ID)

date >> system_path.log
echo "" >> system_path.log
echo "System PATH for $SLURM_JOB_ID" >> system_path.log
echo "" >> system_path.log
printf "%0.s-" {1..10} >> system_path.log
echo ${PATH} | tr : \\n >> system_path.log


# Set genome assembly FastA
oly_genome_fasta=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa

# Set indexed genome assembly file
oly_genome_indexed=/gscratch/srlab/sam/data/O_lurida/oly_genome_assemblies/Olurida_v081.fa.fai

# Set sorted transcriptome assembly bam file
oly_transcriptome=/gscratch/scrubbed/samwhite/20180919_oly_transcriptome_bowtie2/20180919_Olurida_v081.sorted.bam


# Set program paths
bedtools=/gscratch/srlab/programs/bedtools-2.27.1/bin
samtools=/gscratch/srlab/programs/samtools-1.9/samtools

# Index genome FastA
${samtools} faidx ${oly_genome_fasta}

# Format indexed genome for bedtools
## Requires only two columns: namelength
awk -v OFS='\t' {'print $1,$2'} ${oly_genome_indexed} > Olurida_v081.fa.fai.genome

# Create bed file
${bedtools}/bamToBed \
-i ${oly_transcriptome} \
> 20180924_oly_RNAseq.bam.bed


# Create bedgraph
## Reports depth at each position (-bg in bedgraph format) and report regions with zero coverage (-a).
## Screens for portions of reads coming from exons (-split).
## Add genome browser track line to header of bedgraph file.
${bedtools}/genomeCoverageBed \
-i ${PWD}/20180924_oly_RNAseq.bed \
-g Olurida_v081.fa.fai.genome \
-bga \
-split \
-trackline \
> 20180924_oly_RNAseq.bed

Alignment was done using the following version of the Olympia oyster genome assembly:


RESULTS:

Output folder:

Indexed and formatted genome file:

Bedgraph file (for IGV):


This doesn’t appear to have worked properly. Here’s a view of the bedgraph file:


track type=bedGraph
Contig0 0   116746  0
Contig1 0   87411   0
Contig2 0   139250  0
Contig3 0   141657  0
Contig4 0   95692   0
Contig5 0   130522  0
Contig6 0   94893   0
Contig7 0   109667  0
Contig8 0   95943   0

I’d expect multiple entries for each contig (ideally), indicating start/stop positions for where transcripts align within a given contig. However, this appears to simply be a list of all the genome contigs and their lengths (Start=0, Stop=n).

I would expect to see something li

I’ll look into this further and see where this pipeline goes wrong.