vt icon indicating copy to clipboard operation
vt copied to clipboard

Incorrect decompose

Open lh3 opened this issue 10 years ago • 3 comments

Example:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=11,length=135006516>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3  S4
11  101 .   GCGT    G,GCGA,GTGA,CCGT    199 PASS    .   GT  0/1 1/2 2/3 2/4

The vt decompose -s | vt normalize output is (irrelevant fields are skipped):

11      101     GCGT    G       0/1     1/.     ./.     ./.
11      101     G       C       0/.     ./.     ./.     ./1
11      102     CGT     TGA     0/.     ./.     ./1     ./.
11      104     T       A       0/.     ./1     1/.     1/.

But I tend to believe the right one should be:

11       101     GCGT    G       0/1      1/.      ./.      ./.
11       101     G       C       0/.      ./0      0/0      0/1
11       102     C       T       0/.      ./0      0/1      0/0
11       104     T       A       0/.      ./1      1/1      1/0

There are two problems with the vt output. Firstly, CGT=>TGA has not been decomposed. Secondly, several . should be replaced with the reference 0. For example, in sample S3, the original genotype is 2/3. Both allele 2 and 3 have the reference base at 11:101. The genotype at 101:G=>C should be 0/0, not ./.. To do the decompose right, vt needs to be aware of the allele sequences. BTW, also note that at 11:104, S3 has a homozygous A/A genotype, although in the original VCF, 2/3 appears to be a heterozygote.

EDIT: err... actually there is no "GCGT" on human chr11. Nonetheless, if we change coordinate to 61842 where there is a GCGT on the reference, the output is the same.

lh3 avatar May 18 '15 05:05 lh3

agreed. Justin Zook @jzook also raised a similar example in the past.

If you are just interested in sites correctness, the sequence of operations can be decompose | decompose_blocksub | normalize.

The original design was that decompose simply perform "vertical" decomposition and decompose_blocksub to perform "horizontal" decomposition.

To ensure genotype correctness, there's definitely a need to perform horizontal decomposition simultaneously with vertical decomposition.

I'll keep thinking about this and will provide a solution by next week. Thanks!

atks avatar May 18 '15 12:05 atks

@atks Was this ever resolved?

tseemann avatar Mar 30 '16 04:03 tseemann

no, there are quite a few complications to handle this and I've completely forgotten about this. Sorry about that. I can't revisit this right away, but I'll put it on my radar again.

atks avatar Mar 30 '16 05:03 atks