bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools isec --complement non-optimal behavior when duplicate variants are present

Open freeseek opened this issue 4 years ago • 2 comments

This is a bit of a weird corner case, but if I have a VCF file with two identical variants chr1 20000 A G:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr1\t10000\tx\tA\tC\t.\t.\t."
echo -e "chr1\t20000\ty\tA\tG\t.\t.\t."
echo -e "chr1\t20000\tz\tA\tG\t.\t.\t."
echo -e "chr1\t30000\tt\tA\tT\t.\t.\t.") | bcftools view --no-version -Oz -o A.vcf.gz
bcftools index -ft A.vcf.gz

And I attempt to remove the duplicate variant chr1 20000 A G using the simple VCF:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr1\t20000\ty\tA\tG\t.\t.\t.") | bcftools view --no-version -Oz -o B.vcf.gz
bcftools index -ft B.vcf.gz

I obtain using bcftools isec --complement:

$ bcftools isec --no-version --complement --write 1 A.vcf.gz B.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	10000	x	A	C	.	.	.
chr1	20000	z	A	G	.	.	.
chr1	30000	t	A	T	.	.	.

That only one instance of the chr1 20000 A G variant gets removed. Wouldn't it be more appropriate that both instances be removed?

freeseek avatar Jul 24 '21 22:07 freeseek

There is the bcftools norm --rm-dup option, would that be a good solution to your problem?

pd3 avatar Aug 03 '21 09:08 pd3

Indeed, that is what I am using now. But I was wondering whether I had discovered by chance an unintended behavior of bcftools isec.

freeseek avatar Aug 03 '21 14:08 freeseek