bcftools
bcftools copied to clipboard
bcftools isec --complement non-optimal behavior when duplicate variants are present
This is a bit of a weird corner case, but if I have a VCF file with two identical variants chr1 20000 A G:
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr1\t10000\tx\tA\tC\t.\t.\t."
echo -e "chr1\t20000\ty\tA\tG\t.\t.\t."
echo -e "chr1\t20000\tz\tA\tG\t.\t.\t."
echo -e "chr1\t30000\tt\tA\tT\t.\t.\t.") | bcftools view --no-version -Oz -o A.vcf.gz
bcftools index -ft A.vcf.gz
And I attempt to remove the duplicate variant chr1 20000 A G using the simple VCF:
(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr1\t20000\ty\tA\tG\t.\t.\t.") | bcftools view --no-version -Oz -o B.vcf.gz
bcftools index -ft B.vcf.gz
I obtain using bcftools isec --complement:
$ bcftools isec --no-version --complement --write 1 A.vcf.gz B.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 10000 x A C . . .
chr1 20000 z A G . . .
chr1 30000 t A T . . .
That only one instance of the chr1 20000 A G variant gets removed. Wouldn't it be more appropriate that both instances be removed?
There is the bcftools norm --rm-dup option, would that be a good solution to your problem?
Indeed, that is what I am using now. But I was wondering whether I had discovered by chance an unintended behavior of bcftools isec.