vt
vt copied to clipboard
Incorrect decompose
Example:
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4
11 101 . GCGT G,GCGA,GTGA,CCGT 199 PASS . GT 0/1 1/2 2/3 2/4
The vt decompose -s | vt normalize
output is (irrelevant fields are skipped):
11 101 GCGT G 0/1 1/. ./. ./.
11 101 G C 0/. ./. ./. ./1
11 102 CGT TGA 0/. ./. ./1 ./.
11 104 T A 0/. ./1 1/. 1/.
But I tend to believe the right one should be:
11 101 GCGT G 0/1 1/. ./. ./.
11 101 G C 0/. ./0 0/0 0/1
11 102 C T 0/. ./0 0/1 0/0
11 104 T A 0/. ./1 1/1 1/0
There are two problems with the vt
output. Firstly, CGT=>TGA
has not been decomposed. Secondly, several .
should be replaced with the reference 0
. For example, in sample S3, the original genotype is 2/3
. Both allele 2
and 3
have the reference base at 11:101
. The genotype at 101:G=>C
should be 0/0
, not ./.
. To do the decompose right, vt
needs to be aware of the allele sequences. BTW, also note that at 11:104
, S3 has a homozygous A/A
genotype, although in the original VCF, 2/3
appears to be a heterozygote.
EDIT: err... actually there is no "GCGT" on human chr11. Nonetheless, if we change coordinate to 61842 where there is a GCGT on the reference, the output is the same.
agreed. Justin Zook @jzook also raised a similar example in the past.
If you are just interested in sites correctness, the sequence of operations can be decompose | decompose_blocksub | normalize.
The original design was that decompose simply perform "vertical" decomposition and decompose_blocksub to perform "horizontal" decomposition.
To ensure genotype correctness, there's definitely a need to perform horizontal decomposition simultaneously with vertical decomposition.
I'll keep thinking about this and will provide a solution by next week. Thanks!
@atks Was this ever resolved?
no, there are quite a few complications to handle this and I've completely forgotten about this. Sorry about that. I can't revisit this right away, but I'll put it on my radar again.