The genotypes do not reflect the allele coverage frequency
Dear,
I have 384 bam aligned samples from GBS technology. I ran the Thermofish Variant Caller (TVC) on them in order to get the genotypes of my variants. That seemed to work well but I wanted to use bcftools in order to compare both vcf files.
Here are 4 variants called with TVC at a specific locus for one single sample :
h2tg000001l 385471 A G 0/1
h2tg000001l 385474 A G 0/1
h2tg000001l 385508 G A 0/1
h2tg000001l 385517 A G 1/1
I ran bcftools on that sample with :
bcftools mpileup \
--threads 128 \
-q 10 -Q 30 --max-depth 1000 \
--skip-indels \
-f reference.fasta sample1.bam \
--annotate FORMAT/AD,FORMAT/DP,INFO/AD | \
bcftools call \
--threads 128 \
-mv -Oz -o sample1.vcf.gz
I usually use -q 10 -Q 30 to take in account only the robust variants, and --max-depth 1000 because some loci have high coverage. I got :
h2tg000001l 385471 A G 1/1
h2tg000001l 385517 A G 1/1
The differencies between TVC and bcftools :
- 4 variants detected by TVC, 2 by bcftools
- 2 are common. 1 out of the 2 has a different genotype
Here are the alignements at the locus :
There are the four potential variants (in the coverage track), and the black line representing the MAPQ10 threshold. bcftools removes the <MAPQ10 (-q 10) and then consider only the variants 1 and 4, as homozygous. I tried to trick the parameters of TVC to remove the <MAPQ10 alignments but nothing happens, and I got the same genotypes.
The variant 1 for example is given as 0/1 for TVC and 1/1 for bcftools. I did a test with -q 0 -Q 0 to consider all the reads. bcftools read the allele depth as 50%/50% for the variant 1 (then 0/1), but still outputed it as 1/1. Does bcftools do not consider the low mapq alignments even if I set up -q 10 ?
Also, in your opinion, do the soft-clipped reads could correspond to orthologs/paralogs that bcftools do not consider? Because we can see they could carry potentiel alleles at the variants 2 and 3.
Best