Incorrect liftover - deletion within deleted region
Thanks for making such a useful tool! I have discovered one unusual edge case where it doesn't seem to work right. In VEGFC, there is a 3bp deletion in hg38 with respect to hg19 (177605082-177605084 in hg19). I have a variant that is a 2bp deletion within those three bases that were deleted. It should be lifted over to a 1bp insertion, but instead gets lifted over to a 2bp insertion.
hg19 nt hg38 4-177605081-CTCA-C 4-177605082-TCA-T 4-177605085-TC-T
177605081 C 176683930 C C
177605082 T - T
177605083 C - -
177605084 A - -
177605085 T 176683931 T T T
177605086 C 176683932 C C -
correct liftover 4-176683930-C-CTCA 4-176683930-C-CT 4-176683931-TC-T
CORRECT INCORRECT CORRECT
BCFtools liftover 4-176683930-C-CCA
Here's the line from the lifted VCF if that helps.
chr4 176683930 4-177605082-TCA-T C CCA . PASS SRC_CHROM=chr4;SRC_POS=177605082;SRC_REF_ALT=TCA,T;SWAP=1
Hmm, or maybe this variant doesn't make sense at all in hg38, if the two alleles are CTCA and CT, neither of which matches the reference of C. So should this one have failed?
This is not an easy one. If you try the following:
$ samtools faidx hg19.fa chr4:177605079-177605089
>chr4:177605079-177605089
AGCTCATCATT
$ samtools faidx hg38.fa chr4:176683927-176683936
>chr4:176683927-176683936
TAGCTCATTT
And if you try aligning the bases with liftOver one by one:
$ echo -e "chr4\t177605080\t177605081" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
chr4 176683929 176683930
$ echo -e "chr4\t177605081\t177605082" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
#Deleted in new
$ echo -e "chr4\t177605082\t177605083" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
#Deleted in new
$ echo -e "chr4\t177605083\t177605084" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
#Deleted in new
$ echo -e "chr4\t177605084\t177605085" | liftOver /dev/stdin hg19ToHg38.over.chain.gz /dev/stdout /dev/stderr
chr4 176683930 176683931
So that the alignment between the two references looks like this:
...AAGCTCATCATTT...
...TAGC---TCATTT...
And so the problem here is that BCFtools/liftover was unable to map the left anchor and tried to realign the region against the new reference but since neither the previous reference allele nor the previous alternate allele matched the new reference allele, it ended up doing something that you might judge as non optimal. You can get a bit more insight using the --no-left-align and the --write-nw options:
$ (echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr4>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "chr4\t177605082\t.\tTCA\tT\t.\t.\t.") | \
bcftools +liftover -- -s hg19.fa -c hg19ToHg38.over.chain.gz -f hg38.fa --no-left-align --write-nw | \
bcftools view -H
chr4 176683930 . CT CCAT . . NW=---|TCA,-|T;SWAP=1
But overall you have to be wary when you liftover variants in regions with additional variation across reference genomes
Thank you for your answer!