bcftools +prune behavior question: --AF-tag AF must be set for -N maxAF to work?
Hi bcftools team,
Thank you so much for this phenomenal set of tools. If you had a few minutes, I was hoping to ask some questions about the +prune plugin's behavior. I'm not sure if this is a bug or if there is something I'm missing here.
I've got a file that I used +fill-tags to add AF to. The ref sites (0/0) only had the AN field present, whereas the alt sites (0/1, 1/1) had AC and AN and thus properly had AF filled in. Nothing happened to the ref sites, since I guess AF couldn't be calculated there.
Then I wanted to use +prune to prioritize keeping the position with maxAF (so 1/1, then 0/1, and hopefully 0/0 after those) in windows of 1kb. To understand what was going on, I counted the number of sites in each genotype class before it was pruned and afterwards.
UNpruned file - 0/0: 23446373 0/1: 209220 1/1: 66777
With the understanding that the -N default was maxAF, I then ran:
-
bcftools +prune $FILE.vcf.gz -Oz -n 1 -w 1kb > $FILE_pruned.vcf.gzRESULT: pruned file - 0/0: 34868 0/1: 47254 1/1: 80
The result seems really odd to me - as if it is selecting against the maxAF (1/1) sites. The resulting proportion of the GT classes is very biased and thus the selection couldn't be random. Given that 99% of the sites in the vcf are 0/0, I also don't think this outcome is due to a -N first scenario either.
So I tried running it with -N maxAF stated explicitly:
-
bcftools +prune $FILE.vcf.gz -Oz -n 1 -w 1kb -N maxAF > $FILE_pruned.vcf.gzRESULT: pruned file - 0/0: 34868 0/1: 47254 1/1: 80
And I got the exact same thing. I was extra confused at this point since issue #1050 implied this command should have worked. However, it wasn't until I added the --AF-tag AF as well that the numbers started to make more sense...
-
bcftools +prune $FILE.vcf.gz -Oz -n 1 -w 1kb --AF-tag AF -N maxAF > $FILE_pruned.vcf.gzRESULT: pruned file - 0/0: 26580 0/1: 29351 1/1: 26350
And actually, the most important part of the command seems to be the --AF-tag AF, since when I ran it without -N maxAF but with --AF-tag AF, I got the same results as in scenario 3.
Is this normal behavior? I would have thought not specifying -N maxAF and letting the defaults take over would have implied --AF-tag AF, but even adding -N maxAF still led to a weird result.
Any idea why 1) and 2) gave those outcomes, and why it wasn't until --AF-tag AF was added that the results seemed to be more in line with what was expected?
Really appreciate your time and I hope at the very least this helped someone. Could very well be that I am missing something entirely, so sorry if that's the case! I should mention, bcftools was used for the calling, filtering, and all file processing - but happy to expand on the generation of the file if needed. Thanks again!
Kindly, Charity