VCF Splitting with bcftools

kubu4 — Mon, 15 Oct 2018 13:55:44 +0000

Steven asked for some help trying to split a VCF in to individual VCF files.

VCF file (15GB): SNP.TRSdp5g95FnDNAmaf05.vcf.gz

Skip to the Results section if you don’t want to read through the tials and tribulations of getting this to work.

Here’s an overview of how I managed to get this to work and what didn’t work.

Installed bcftools, htslib, and set up the bcftools plugins

Figured out the VCF file needed to be sorted, bgzipped (part of htslib), and indexed with tabix, due to the following error when initially trying to process with VCF file using bcftools:

[W::vcf_parse] contig '' is not defined in the header. (Quick workaround: index the file with tabix.) Undefined tags in the header, cannot proceed in the sample subset mode.

So, I did that:

Sort and bgzip:

cat SNP.TRSdp5g95FnDNAmaf05.vcf | \
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | \
bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

Index with tabix:

tabix --preset vcf SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

This produced a separate file:

SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz.tbi.

It seems as though this file must exist in the same directory as the source VCF for it to be utilized, although no commands work directly with this index file.

Then, tried biostars solution, but produces an error


#!/bin/bash
for file in *.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

Resulting error:

[E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174

And empty split VCF files…

Tried tabix on unsorted bgzipped file yields this error:

[E::hts_idx_push] chromosome blocks not continuous

Tried modified sort:


cat SNP.TRSdp5g95FnDNAmaf05.vcf | \
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1V -k2,2n"}' | \
bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

Produces this error:

[E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174

And empty split VCF files…

Changed to new version of “view” – trying “call” instead (it seems that bcftools view is deprecated?):


#!/bin/bash
for file in *.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools call \
    --consensus-caller \
    --output-type z \
    --threads 18 \
    --samples $sample 
    --output-file ${file/.vcf.gz/.$sample.vcf.gz} \
    $file
  done
done

Still results in empty output files.

Based off of the repeated error about AN/AC counts, tried to fill AN/AC values…


bcftools plugin fill-AN-AC SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz \
--output-type z \
--threads 18 \
--output SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz

And, ran this code:


#!/bin/bash
for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools call \
    --consensus-caller \
    --output-type z \
    --threads 18 \
    --samples $sample 
    --output-file ${file/.vcf.gz/.$sample.vcf.gz} \
    $file
  done
done

Still results in empty files…

Try original code again (expanded shortened arguments to improve readability):


#!/bin/bash
for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools view \
    --min-ac 1 \
    --output-type z \
    --samples $sample \
    --output-file ${file/.vcf*/.$sample.vcf.gz} \
    --threads 18 \
    $file
  done
done

P.S. I realize the outermost for loop is not necessary, but it was faster/easier to just quickly modify the code from that Biostars solution.

RESULTS

That worked! Or, at least it is producing non-empty, split VCF files! I’ll let Steven know and let him decide what impact (if any) the fill-AN-AC plugin had on the file(s)!

BTW, running with 18 threads on my computer, this took ~30mins to split each sample into its own VCF file. However, checking performance via htop, it certainly does not appear to be using 18 threads…

Output folder: 20181015_vcf_split

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

kubu4 — Mon, 18 Apr 2016 22:52:38 +0000

0000-0002-2747-368X

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.

One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!

Jupyter Notebook: 20160418_pyrad_oly_PE-GBS.ipynb

NBviewer: 20160418_pyrad_oly_PE-GBS

Sam's Notebook » .vcf

VCF Splitting with bcftools

RESULTS

Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD