Tag Archives: htslib

VCF Splitting with bcftools

Steven asked for some help trying to split a VCF in to individual VCF files.

VCF file (15GB): SNP.TRSdp5g95FnDNAmaf05.vcf.gz

Skip to the Results section if you don’t want to read through the tials and tribulations of getting this to work.

Here’s an overview of how I managed to get this to work and what didn’t work.

Figured out the VCF file needed to be sorted, bgzipped (part of htslib), and indexed with tabix, due to the following error when initially trying to process with VCF file using bcftools:

[W::vcf_parse] contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

So, I did that:

  • Sort and bgzip:
cat SNP.TRSdp5g95FnDNAmaf05.vcf | \
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | \
bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz
  • Index with tabix:
tabix --preset vcf SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

This produced a separate file:

  • SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz.tbi.

It seems as though this file must exist in the same directory as the source VCF for it to be utilized, although no commands work directly with this index file.

Then, tried biostars solution, but produces an error


#!/bin/bash
for file in *.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

Resulting error:

[E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174

And empty split VCF files…

Tried tabix on unsorted bgzipped file yields this error:

[E::hts_idx_push] chromosome blocks not continuous

Tried modified sort:


cat SNP.TRSdp5g95FnDNAmaf05.vcf | \
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1V -k2,2n"}' | \
bgzip --threads 20 > SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz

Produces this error:

[E::bcf_calc_ac] Incorrect AN/AC counts at NC_035780.1:26174

And empty split VCF files…

Changed to new version of “view” – trying “call” instead (it seems that bcftools view is deprecated?):


#!/bin/bash
for file in *.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools call \
    --consensus-caller \
    --output-type z \
    --threads 18 \
    --samples $sample 
    --output-file ${file/.vcf.gz/.$sample.vcf.gz} \
    $file
  done
done

Still results in empty output files.

Based off of the repeated error about AN/AC counts, tried to fill AN/AC values…


bcftools plugin fill-AN-AC SNP.TRSdp5g95FnDNAmaf05.sorted.vcf.gz \
--output-type z \
--threads 18 \
--output SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz

And, ran this code:


#!/bin/bash
for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools call \
    --consensus-caller \
    --output-type z \
    --threads 18 \
    --samples $sample 
    --output-file ${file/.vcf.gz/.$sample.vcf.gz} \
    $file
  done
done

Still results in empty files…

Try original code again (expanded shortened arguments to improve readability):


#!/bin/bash
for file in SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.vcf.gz; do
  for sample in `bcftools query -l $file`; do
    bcftools view \
    --min-ac 1 \
    --output-type z \
    --samples $sample \
    --output-file ${file/.vcf*/.$sample.vcf.gz} \
    --threads 18 \
    $file
  done
done

P.S. I realize the outermost for loop is not necessary, but it was faster/easier to just quickly modify the code from that Biostars solution.


RESULTS

That worked! Or, at least it is producing non-empty, split VCF files! I’ll let Steven know and let him decide what impact (if any) the fill-AN-AC plugin had on the file(s)!

BTW, running with 18 threads on my computer, this took ~30mins to split each sample into its own VCF file. However, checking performance via htop, it certainly does not appear to be using 18 threads…

Output folder: 20181015_vcf_split