bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Difference between --atom-overlaps and --multi-overlaps

Open alexvasilikop opened this issue 1 year ago • 3 comments

Could someone briefly explain the difference between these two options?

Thanks a lot

alexvasilikop avatar Mar 27 '24 13:03 alexvasilikop

It's a bit confusing, isn't it.

The option --multi-overlaps is tied to the --multiallelics option. This one splits multiallelic sites into biallelic sites, for example the row AA > A,CC will be split into two rows AA>A and AA>CC.

On the other hand, the option --atom-overlaps is ties to the --atomize option. This one breaks a complex allele into its atomic constituents, for example the row AA>CC will be split into two rows A>C and A>C.

In both cases the program has to decide what to put in when heterozygous genotypes are split. For the --atomize case, see the example in the manual page

    # Before atomization:
    100  CC  C,GG   1/2

    # After:
    #   bcftools norm --atomize --atom-overlaps .
    100  C   G      ./1
    100  CC  C      1/.
    101  C   G      ./1

    # After:
    #   bcftools norm --atomize --atom-overlaps '*'
    #   bcftools norm --atomize --atom-overlaps\*
    100  C   G,*    2/1
    100  CC  C,*    1/2
    101  C   G,*    2/1

Analogously for the --multiallelics case.

pd3 avatar Mar 27 '24 16:03 pd3

As an aside, what does . mean in a single GT field? The specification impies everyt GT is element is numeric or every element is ., but doesn't allude to partial dots. If this has a meaning, then it's not explicitly listed in the specification. Logically extending it, I'd guess that ./1 is something the genotype is a mix of something unknown and ALT 1. How is that helpful though? It could be 0/1 or 1/1, yet we know the original data was 1/2 so 1/1 is an impossibility.

Also, why is atomization not using the phasing | notation? We know the two "C G" lines are phased as they started as CC,GG.

jkbonfield avatar Mar 27 '24 16:03 jkbonfield

I assume that the dot then means something unknown but by definition different than 1 since otherwise it would not appear as a variant. From what I understand when splitting multiple alleles then default option is to use another alternative allele in the biallelic site and not the reference for comparison.

In any case: it seems that multi-allelics --multi-overlaps is the option I am looking for ( I am looking to perform both splitting (of any type of ALT variants when they exist) into single .ALT variants and subsequently normalization. Would this be the options below. It is not clear to me that if splitting is done then normalization would also be performed.

bcftools sort -O v "!{sample_id}.single_sample.vcf" | \
 bcftools norm -m-any --multi-overlaps 0 --fasta-ref "!{ref_name}" --threads "!{num_threads}" -O v - | \
 bcftools sort -O v -o "${OUTPUT}

Another question, what is the difference between "any" and "both" in the -m option? Is "any' splitting any type of variants whereas "both" only if they are both of the same type?

alexvasilikop avatar Mar 28 '24 07:03 alexvasilikop

Variants are normalized after splitting. (Note there were some improvements in the latest release 1.20 to handle simultaneous atomization and splitting of multiallelics.)

Regarding 'both' and 'any', it is described in the manual page http://samtools.github.io/bcftools/bcftools.html#norm

indels .. only indel records should be split or merged
snps .. only SNV records 
both .. SNVs and indels should be merged separately into two records
any .. this is for merging only, SNVs and indels should be merged into a single record

pd3 avatar Apr 28 '24 11:04 pd3