Improved Structural Variant Support
Various drafts of improved Structural Variant support have been floating around for 2.5 years (see #231, #266) but never merged. I'm attempting to collate everything together in this PR but it is unclear as to what is in scope.
The following changes were agreed backed in 2017 by Cristina Yenyxe Gonzalez, Steve Huang, Daniel Cameron, Xuefang Zhao, Tobias Rausch, Tim Hefferon, John Lopez, Chris Whelan:
- Restrict SVTYPE to the 6 primitives
- SVTYPE will be re-cast as a "basic primitive"; its distinction from EVENTTYPE will be made clear in the spec
- SVTYPE will have the following closed controlled vocabulary: DEL, INS, DUP, INV, CNV, BND
- No colons or subtypes are allowed in SVTYPE value
- All SV VCF records must include a value for SVTYPE
- EVENTTYPE will be added to the spec. It will contain the "biological interpretation" of the variant
- EVENTTYPE will have an open controlled vocabulary, to include: INS, MEI, ALU, L1, SVA, HERV, DEL, -MEI-, -ALU-, -L1-, -SVA-, -HERV-, INV, DUP, DELINS, CNV
- EVENT will continue to be used to specify identifiers to link together multiple VCF records
- EVENT is currently discussed primarily in the graphical BND-notation section of the spec; text will be added elsewhere, in relevant context area(s) of the spec.
- CIPOS and CIEND will continue to be used as they have in the past: as the primary means of representing an interval within which a breakpoint is likely to fall.
- The spec will be amended with examples, as needed, to illustrate the preferred usage of CIPOS and CIEND.
- New tags CIPOSPROB and CIENDPROB will be introduced to the spec as a means to indicate the level of confidence implied by CIPOS and CIEND, respectively.
- Each CI*PROB value shall contain two (2) values separated by a comma. These values shall represent the proportion of a normal distribution expected to fall outside the values recorded in CIPOS. For example, CIPOS=(-50,100) and CIPOSPROB=(0,0.95) indicate the probability that the breakpoint lay more than 50 bp to the left of POS is zero, and the probability that it lay more than 100bp to the right of POS os 0.95.
- New tag HOMPOS will be introduced to the spec. Its value shall represent the coordinate on the given contig of the first basepair of the microhomology indicated by HOMSEQ.
- The definition of HOMPOS will be: Position (relative to POS) of base pair identical micro-homology around event breakpoint. Note that length(HOMSEQ)=HOMLEN=HOMPOS[1]-HOMPOS[0]
- BND will be officially added to the SVTYPE controlled vocabulary, and it will be referenced consistently throughout the spec (unlike now). It will NOT be replaced by other terms that were discussed, such as TRA (for translocation) or ADJ (for adjacency).
- Examples of each type of event shall be drawn up and added to the spec, including one in "standard" notation and another in the equivalent "BND" notation. Tim and Daniel will draft these examples, respectively.
Additional changes that I'd like to see are:
- Either 1 SV per record, or SV fields counts standardised to handle multiple SV records
- the latter was blocked by VCF not supporting list of lists
- A field to resolve STR expansion ambiguity (e.g. HOMSTRIDE)
- Sub-clonality support (for all variants)
- genotyping support for somatic SVs. There are two issues with the current specs:
- copy number != ploidy
- maternal/paternal haplotypes are still meaningful for somatic SV, it's just that there can be many copies of each.
- Karyotype reconstruction
- a 'next SV' field is sufficient
- PSL could be adapted to handle this
- Explicit clarification around SVTYPE about what the claim is
- currently both CNV and SV callers write DELs so it's unclear if a DEL is claim of a breakpoint adjacency, a segmental loss, or both
@lbergelson @yfarjoun @pd3 What's the best way forward from here that minimises the chances of us sitting on PRs for another 2 years?
See #448 for an example of how real-world tools are using fields in a non-compliant manner to work around the lack of proper subclonal support.
I've now stripped the bit where DUP subtypes defined different breakpoints than the root types. This badly breaks backwards compatability and can be better handled as part of the EVENT/EVENTTYPE PR.
@d-cameron Added a few comments on places that I think could use clarifications -- sorry that I didn't make them sooner. Again, thanks for spearheading this change.
Ok, there's quite a bit of discussion around DUP events. I think the real underlying issue is that we don't all agree on what a claim of a DUP is, and we already have an ecosystem in which different tools are making different claims.
Tools based on micro-array or copy number evidence report DUP as an a region of elevated copy number (typically +1 copy). They are making a claim about the number of copies of the duplicated region. Tools based on NGS report DUP when they find a breakpoint in the an orientation consistent with a tandem duplication.
In conclusion, the VCF specs don't actualy specify what symbolic structural variants actually mean, so different tools. Our options are:
- Define DEL/DUP as breakpoint claims
- Define DEL/DUP as CN claim
- Define DEL/DUP as both breakpoint and CN claim
- Add an additional field which is used clarify structural symbolic allele claims and existing calls remain ambiguous.
My preference is option 4.
Thoughts?
Option 5. grandfather v4.2 or earlier, as ambiguous. Make an unambigous choice if the header is v4.3
Tools based on micro-array or copy number evidence report DUP as an a region of elevated copy number (typically +1 copy). They are making a claim about the number of copies of the duplicated region. Tools based on NGS report DUP when they find a breakpoint in the an orientation consistent with a tandem duplication.
Most integrated WGS/NGS germline SV calling pipelines that I've seen being developed for large scale studies include both PE/SR/ASM based calls that would have breakend support as well as depth based calls that don't. You just can't capture the breakpoints of a lot of germline CNVs with short reads, and I think we'll be getting lots of mixed VCFs in the future as integrated pipelines are more widely deployed.
- Add an additional field which is used clarify structural symbolic allele claims and existing calls remain ambiguous.
Just to clarify, are you suggesting something like a SVBKPT INFO field, which, if present, clarifies that there is a breakend claim?
Just chiming in here w/ opinion that tandem duplications & repeat expansions are different from CNVs & should be considered (a) separate entit(y|ies) (e.g. having annotations from sequence, times).
Otherwise liking the move towards resolving ambiguous annotations... But IMO a column for a second reference (chro) is needed, when keeping the columnar format :-)
Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing).
And on a tangent, what is the likelihood of a Tabix index optionally supporting CIPOS/CIEND?
But IMO a column for a second reference (chro) is needed, when keeping the columnar format :-) @mbaudis, I am not clear on what you mean here. Please elaborate.
Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing). @rhdolin, by "PRECISE" do you mean the examples had a PRECISE flag, or just that they did not have the IMPRECISE flag? If the former, I admit to find that confusing as well.
@rhdolin
Doesn't PRECISE/IMPRECISE achieve what INFO.SVBKPT would? (Although I've seen examples that are both PRECISE and have CIPOS and CIEND, which I find confusing).
Usually IMPRECISE is only used by callers that are using paired-end mapping signal to identify variants that show a read mapping signature but for which the exact breakpoint cannot be determined, either because there are no split reads or assembly-based evidence at the site, or the caller didn't look for it. This is different from a depth-based CNV caller, which is not making a claim about the breakpoint structure -- it's just reporting that an extra copy of a reference segment exists in the sample. The breakpoint structure of a CNV call could be much more complex than a simple tandem duplication -- imagine that segments from two chromosomes are duplicated, joined together, and inserted on a third chromosome. If the spec said something like, "if an SVTYPE=DUP variant is marked IMPRECISE, it is making a claim of a tandem duplication", it might have the same effect as the SVBKPT INFO field I was talking about above (or some other similar solution), but I think it would be a bit backwards from its original intent and hard to comprehend.
@d-cameron I think my top preference from your list is your option 4: Keep DEL / DUP ambiguous but add an INFO field indicating whether there's a breakpoint claim of a simple deletion or tandem duplication (which can then be used with IMPRECISE and related fields).
I would also consider option 5 (as I understand it this is making DUP and DEL into breakpoint claims and requiring non-tandem DUPs to be written with SVTYPE=CNV) as long as the new version of the spec contained strong, clear language warning about the difference in the interpretation, particularly of DUP, from older VCFs. Do you know if any SV tools are currently writing VCFs marked 4.3? I would hate to change the interpretation of variants from tools that are in active use without changing the VCF version number.
Just to be clear, I support the goal of better codifying symbolic SV alleles so that it's easier to produce an less ambiguous sequence interpretation.
One more comment on CNVs: cohort-based CNV callers can disentangle different copy number variable alleles even without breakpoint evidence. For example, in the case of overlapping duplications with clearly different boundaries (but for which we don't know the breakpoint structure):
DUP1: |-------------|
DUP2: |------------|
There are three "copy number variable regions" here, but you can sometimes use phasing and parsimonious modeling of allele frequencies to distinguish each sample's genotype for the DUP1 and DUP2 alleles. If we restrict non-breakpoint based claims to use the <CNV> alt allele, I think we'd have to change the description of the alt allele in the spec away from Copy number variable region to something more like Allele that changes the copy number of the reference segment. We'd probably also then need an additional INFO field to say in what direction and by how much the copy number changes for that particular allele. To me this lends support for my preference for your option 4 (allowing the use of <DUP> as the alts for these variants, and adding INFO fields for breakpoint support if they are tandem/simple deletions).
I suggest we adopt something like @cwhelan 's working definition of Option #5, perhaps: “DUP and DEL are breakpoint claims; non-tandem DUPs (i.e. strictly copy number based claims) should be written as SVTYPE=CNV.” However, there is a consequence: moving dispersed DUPs to CNV requires broadening the definition of THAT category to include single-copy-gain dispersed dups.
The August 22, 2019 pdf of v4.3 uses the following definition for CNV:
“Copy number variable region (may be both deletion and duplication)”
or, as pointed out earlier in the current discussion (https://github.com/samtools/hts-specs/pull/465#pullrequestreview-347971907):
“Copy number variable region (multiallelic)”
A freshly-redefined ‘CNV’ would include all of the following variants: • regions in which both a duplication and a deletion have been observed • regions one would call “multiallelic” – that is, in which three or more alleles (reflecting discrete copy number increments) have been reported • the “new kid on the block”: dispersed duplications (for now let's limit these to increases of just one copy number relative to reference)
#3, as defined above, can reasonably be called ‘biallelic’. In contrast, the very broad perception is that “SVTYPE=CNV” implies multiple alleles” (in fact, gnomAD's SV VCF uses 'MCNV' instead of 'CNV' - maybe this is a change we want to consider adopting in the spec). So the ‘CNV’ definition will need to be re-written to include the new kid on the block.
I think it may also be important to recognize, if only conceptually, that any called insertion can represent (depending on the nature and extent of the analysis involved) a dispersed duplication or a tandem duplication. Distinguishing among these possibilities requires an analysis of the insertion’s sequence content and immediate genomic context.
Hi, all,
Chris Whelan brought this discussion to my attention. I want to share some of the use cases I currently implement in Genome STRiP and other software we develop/use in our lab.
From my perspective, it would be ideal if the VCF specification allowed these kinds of representations. In general, I also think it is useful to allow the VCF format to be flexible enough to permit tool-specific extensions in a spec-compliant VCF.
-
SVTYPE. Although we could do this with a different tag, we currently use SVTYPE to default certain genotyping behavior. As one example, if SVTYPE=DEL, then we will never assign INFO:CN > ploidy and all genotype likelihood calculations assume a bi-allelic variant where one allele is the reference and the other a deletion allele. This is in contrast to SVTYPE=CNV, which allows INFO:CN to be greater or less than ploidy.
-
For unphased copy number variants, we use
<CNV>as the ALT allele. The code is also not picky about the ref allele, and in particular N is allowed (regardless of the actual reference base at POS). We use the <CNV> representation, in particular, when there is uncertainty about the true breakpoint (for example, if there are segmental duplications) or when the code does not want to make any assertion about the breakpoint location. -
For phased copy number variants (or for "partitioned" but not phased variants), we use the notation
<CN:n>to represent the allelic copy number. For example, something like this (tabs changed to spaces to condense):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 SAMP2 SAMP3 SAMP4
chr1 1000 . <CN:1> <CN:0>,<CN:2> . . . GT:CN 0|0:2 0|1:1 2|2:4 1|2:2
Older versions of the code used <CNn> (with no colon) as the allele representation, but this caused difficulties because strict interpretation of the specification said that all ALT alleles had to be defined in the header and this in turn required knowing in advance all of the alleles that might be encountered or writing the output file in two passes. The idea behind <CN:n> is that the allele type/template is defined once in the header, but n is variable.
-
For complex structures, such as the C4 locus (Sekar, 2016; Kamitaki, 2020 (under review)) and other examples, we represent the structural haplotypes as particular alleles with encodings of the structure representing the key biology. For example, at C4, we previously defined ALT alleles like
<H_n_n_n_n_x>for example<H_2_1_1_1_B>. The encoding is described in context (or in the description field). For C4, the encoding is the (allelic) copy number of total C4, C4A, C4B, the HERV element and an optional character suffix identifying a particular haplotype (among structurally equivalent, recurrent haplotypes that appear to have arisen independently in humans). As one additional example, we might label an allele as<H_3_1_2_3_insCT>representing a haplotype that carries 1 copy of C4AL, two copies of C4BL, and a common frame shift variant of potential phenotypic importance.While there are other representations one could choose, these were convenient, reasonably human readable (at least for us) and worked well with other tools, such as beagle, without requiring modifications to the downstream tools.
I would also like to say that while I don't expect such representations to be standardized to the point of interoperability or diagnostic use, I do think it is useful to allow the VCF format to have some flexiblity, enough to permit experimentation and innovation within the standard through the use of certain conventions, such as allele naming conventions like I describe above, which might only be understood by certain tools.
At the same time, I think it is good if it is easy tools encountering an unrecognized allele format to not make any special assumptions about it.
If VCF does not have this flexibility, then I think the alternative will to use other non-standard file formats, which I think leads to more file conversion, more friction trying to get tools to work together, more potential for errors, etc.
It seems to me like @bhandsaker 's use cases shake out like this with respect to this pull request:
-
This is incompatible with the proposed changes since it can be applied to copy number variants without breakpoint claims. Either we'd have to revert to leaving SVTYPEs of DEL and DUP ambiguous with respect to breakpoint claims and encode that information in an INFO field (as in the option 4 above), or Genome STRiP would have to switch to using a different INFO tag to encode this logic (which doesn't seem like the end of the world to me personally, although @bhandsaker might have a different reaction, as long as we carefully explain the differences in semantics between VCF versions in the specs).
-
This seems fine with the PR language:
<CNV>is an allowed top-level allele. -
<CN:x>is not allowed by the current language in the PR, but maybeCNis worth adding as another type of top-level allele, for phased or "partitioned" copy number genotypes that are not making breakend claims. -
It seems to me that arbitrary symbolic alleles can still be allowed as long as they are not given an
SVTYPE(which would trigger the more rigorous parsing that this PR is trying to enable). The internal structure of this sort of allele shouldn't really be standardized in the spec.
For 3 and 4, it'd be very useful if the spec allowed "parameterized" symbolic alleles -- like <CN:x>, but where you don't have to enumerate every single possible value of x with its own header line. Maybe that's something to tackle in a different PR, though?
@d-cameron @thefferon @bhandsaker Does this summary seem right to you? Thoughts?
It seems to me that arbitrary symbolic alleles can still be allowed as long as they are not given an SVTYPE (which would trigger the more rigorous parsing that this PR is trying to enable). The internal structure of this sort of allele shouldn't really be standardized in the spec.
CN:x is not allowed by the current language in the PR
CN:x is still allowed, it just has undefined semantics and is not considered a "non-structural symbolic allele". We just need to remove this phase and define unknown symbolic alleles as having implementation-dependent sematics.
maybe CN is worth adding as another type of top-level allele, for phased or "partitioned" copy number genotypes that are not making breakend claims.
Allele-specific copy number is already supported purely by specifying the sample alt allele genotype. We're spreading the info required to correctly parse an event around quite a bit with such a change. We can already encode this information as follows:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP1 SAMP2 SAMP3 SAMP4
chr1 1000 . N <CNV>,<CNV> . . CN=0,2 GT:CN 0|0:2 0|1:1 2|2:4 1|2:2
Except for the problem that, like many other SV field, the CN field only supports a single value thus only a single SV allele per record. If we fix that issue, then there wouldn't be a need for CN:x ALTs.
This is incompatible with the proposed changes since it can be applied to copy number variants without breakpoint claims. Either we'd have to revert to leaving SVTYPEs of DEL and DUP ambiguous
I quite like the idea of using SVTYPE for specifying the claim type. To make a copy number based del claim, you would write ALT=DEL, SVTYPE=CNV. It's still not backwards compatible but it's less of a change than any of the other proposals. ALT=DEL, SVTYPE=CNV, would essentially be a generic claim of copy number loss, or a specific claim of ASCN=0 when genotyped. Similarly DUP/CNV would be defined generically as elevated copy number up to twice sample ploidy, or ASCN=2 when genotyped.
it'd be very useful if the spec allowed "parameterized" symbolic alleles
At a high level, I'm not a fan of encoding information into the ALT allele when it can already be specified elsewhere.
we use the notation CN:n to represent the allelic copy number.
Technically using CN:1 as REF is a voilation of the specs as REF must be [ACGTN]+. 1.6.1.4 specifies that the REF should be the base immediately preceding the variant, so the CN:1, whilst human readably and explicitly a CN call, is not syntactically valid and will result in a parsing failure if spec-compliant.
Edit: CN exists as both INFO and FORMAT
@d-cameron Thanks for your answer.
CN:x is still allowed, it just has undefined semantics and is not considered a "non-structural symbolic allele". We just need to remove this phase and define unknown symbolic alleles as having implementation-dependent sematics.
Just to be 100% clear, what triggers having something be a "structural" vs "non-structural" symbolic allele? Is it just that the top level is one of the enumerated types in the SV section? Or is it triggered by having an SVTYPE on the variant, ie if you have an SVTYPE defined you need to use one of the closed set of top-level alleles? In the conversation about @jmmut said it should be the former, which is fine, but I think the spec should say so explicitly.
Allele-specific copy number is already supported purely by specifying the sample alt allele genotype. We're spreading the info required to correctly parse an event around quite a bit with such a change. We can already encode this information as follows:
chr1 1000 . N <CNV>,<CNV> . . CN=0,2 GT:CN 0|0:2 0|1:1 2|2:4 1|2:2
I had actually forgotten about the INFO level CN tag -- I'm not actually sure what it's supposed to mean from reading the description in the current spec.. the copy number of the alternate allele? Does anyone use it? Maybe it might be better to rename it to ASCN rather than CN to avoid confusion with the FORMAT CN?
This could work, but I don't really love having multiple identical alleles in the ALT field -- I think it really decreases human readability, and seems like it could be prone to bugs in code. I guess in practice the alleles could be renamed to <CNV:0>, <CNV:2> etc by individual tools for readability while keeping a top level allele in the closed set.
Technically using CN:1 as REF is a voilation of the specs as REF must be [ACGTN]+. 1.6.1.4 specifies that the REF should be the base immediately preceding the variant, so the CN:1, whilst human readably and explicitly a CN call, is not syntactically valid and will result in a parsing failure if spec-compliant.
Fair enough, that's a good point.
Hi,
Most of the folks on this thread are YEARS ahead of me in understanding the nuances of SVs, so I hope folks can bear with me for a bit of a more basic comment... The biggest challenge I'm facing now, trying to extract SVs from VCFs for use in a clinical decision support pipeline, is the inconsistent use of INFO.SVTYPE, PRECISE/IMPRECISE, INFO.CIPOS, INFO.CIEND, FORMAT.GT, and FORMAT.CN. I'm afraid I don't remember where some of my sample VCFs originated, but I'm seeing cases where, for instance, SVTYPE=INS, GT=0/1, CN=2 AND SVTYPE=INS, GT=1/1, CN=2. I'm seeing cases where SVTYPE=CNV and there is no CN. I'm seeing cases that are PRECISE and include CIPOS/CIEND. etc.
What would be super helpful would be more guidance in the VCF spec around how each of these fields are to be used depending on SVTYPE. I'd like guidance on how CIPOS/CIEND overlap with dbVAR's inner/outer start/end. It would help to have a validation engine that allows me to determine if a particular variant caller is putting out conformant SV calls. Also, it would be useful if I could optionally include CIPOS/CIEND in tabix indexing so that I can more reliably extract all or potentially all SVs in a range of interest.
There you have my wish list :-)
A bit of feedback:
- I am in favor of making SVTYPE=DUP synonymous with 'tandem duplication'*. As reflected in the January 30, 2020 pdf of the spec.
- I am not convinced in the case of deletions, however. Is there a valid distinction, for the purposes of the spec, between a "breakpoint claim" loss and a "copy number claim" loss? It makes sense to me that we move read-depth-detected copy number gain from SVTYPE=DUP to SVTYPE=CNV – because you don't know the genomic location of the gain. But the same rationale does not apply to a read-depth-detected copy number loss – where we do know the location of the loss (albeit not to basepair resolution).
* A tandem duplication can be:
- direct (same strand orientation as reference copy)
- inverted-upstream (opposite strand orientation, inserted upstream of reference copy)
- inverted-downstream (opposite strand orientation, inserted downstream of reference copy) I am still concerned that the current spec can only describe the first case. Maybe this is not a big real-world problem; if that is the case then I am willing to drop the subject.
I hope I haven't missed the point in the above discussion, but I support the extant practice (and its explicit addition to the spec) of treating allele-specific copy number changes as SVTYPE=CNV and ALT=<CN=0>,<CN=1>,<CN=2>,<CN=3>... even if this means making 'CN=*' a member of the closed set of first-level ALT values. In my experience this model works, so I prefer to allow it unless it causes undue difficulties to tool developers or obscures data comprehension.
I am afraid this PR is getting dangerously technical and the risk of it getting stuck for another two years is rising. I suggest that we focus on a minimal set of changes for the most basic SV types only and do not continue discussion of more complicated cases in this PR. Once it is merged, we can build on it with other PRs.
In terms of the actual changes, I propose the following approach. Much of it has already been said before, I am mostly just summarising.
Consensus proposal, v1
Completely decouple breakpoint and copy number claims
- DEL, DUP, INS, INV, BND would be strictly breakpoint claims. DUP would represent a simple tandem duplication with the same orientation to the reference.
- CNV would be a strictly copy number claim. For consistency, it would have to include non-multiallelic events, for example simple deletions and duplications, if they have not been detected up to breakpoint resolution. We can focus on further standartising CNVs in a later PR. (In fact, I would be willing to take this part of the work, since I also participate in the ELIXIR hCNV project.)
Synchronise the lists of top-level types accepted by symbolic alleles and by SVTYPE
Currently the lists of six SV types are the same, but their meanings are slightly different in those two contexts, and the specification does not provide the allowed combinations of symbolic alleles and SVTYPEs, or their interpretations. This, for example, allows the existence of:
-
<CNV>withSVTYPE=DUP; -
<DUP>withSVTYPE=CNV; -
<CNV>withSVTYPE=CNVand elevated copy number specified byCN.
This has got to be really confusing for a user because it is not clear if those combinations are valid and whether they are identical to each other. Additional confusion is brought about by the fact that DUP means a strictly tandem duplication in a symbolic allele context, but any duplication when in SVTYPE context.
I suggest exactly synchronising the meaning of SVTYPE and the symbolic alleles. In fact, we could get rid of SVTYPE definitions entirely and just say that the list of allowable values is exactly equal to the top-level SV types for symbolic alleles, and that the value of SVTYPE must be always exactly equal to the top-level type of the structural symbolic allele.
@d-cameron, @jmmut, (and others), please let me know what you think about this. I have a number of comments and corrections for this PR (mostly wording and consistency things), but I don't want to introduce them before we have settled on a specific approach.
@tskir Unless I'm misunderstanding, under your proposal a bi-allelic or rare duplication discovered by a read depth-based method would have to be represented by SVTYPE=CNV, an ALT of <CNV>, and INFO/CN=2 or FORMAT/CN=3 or both. Depending on your interpretation of what a 'breakend claim' is, the same might apply to a deletion discovered with a depth based method (since it has no resolved breakpoints), but with different CN values. My issue is that this gives no easy way to, for example, query the VCF file for all duplications, especially since interpreting the CN fields as deletions or duplications depends on knowing the expected ploidy of the reference contig.
Please correct me if I'm wrong about this. If you are suggesting adding something to address this use case in a later PR, I get a little nervous about decoupling the two changes -- it seems like we'd leave the spec in a slightly broken state between the two PRs.
I think that allowing SVTYPE to vary from ALT (so that you can say SVTYPE=DEL; ALT=<CNV>) gives people flexibility they need to continue working with non-breakend-based CNV claims while separating them from breakend-based claims enough for downstream tools to know when they can try to construct an alternate sequence from them. I agree that it could be a little confusing but I'd recommend getting around that by adding detailed wording and examples to the spec document.
@cwhelan I can see your point. How about this then?
Consensus proposal, version 2
- For breakpoint claims, top level of the symbolic structural allele type must be one of {
<DEL>,<INS>,<DUP>(only tandem),<INV>,<BND>}. In this case, SVTYPE must match the top level of the symbolic structural allele exactly. - In case breakpoints are unknown or not reported, symbolic allele must be
<CNV>. In this case, SVTYPE must be one of exactly three types:-
SVTYPE=DEL— copy number decrease compared to the reference; -
SVTYPE=DUP— copy number increase compared to the reference (in this case the implied duplication is not necessarily tandem); -
SVTYPE=CNV— multiallelic copy number region of both increase and decrease of copy number compared to the reference.
-
It's pretty much already how all of this is supposed to work, but my point is that it needs to be explicitly and very carefully worded in the specification.
(Sorry for the confusion, I misclicked and sent the message before it was ready, so I deleted it. This is the complete message.)
After reading multiple times this thread and the related ones, and also the current spec, I think the purpose of the current spec writing (before this PR) was:
symbolic SV ALTs:
- DEL INS DUP INV CNV: read-depth claim. E.g. see the current wording for DUP: "Region of elevated copy number relative to the reference". CNV is the general category: DEL is equivalent to CNV/CN0, INS is CNV/CN1+ of new sequence, DUP is CNV/CN2+, etc. and the specific should be preferred over the general CNV.
- BND: breakpoint claim. Different claim than the other ALTs.
With that, I could see that a read depth claim DUP, can use ALT=DUP and any SVTYPE (possibly DUP for simplicity), but SVTYPE becomes useful for a breakpoint claim where you use the breakend notation in ALT (to make a breakpoint claim) and put the SVTYPE=DUP. I'm not saying we should keep this as it is, I'm just trying to understand the history of the current writing and whether it can be clarified or needs a breaking rewrite.
This is kind of similar to the last proposal by @tskir, where a read depth claim is specified with ALT=CNV and anything else is breakpoint claim.
If we go with the meaning I just explained, from tskir's comment I can see how it may not apply to INS and INV if those can not be identified by a read depth analysis (I'm no expert on that field), but I wonder if the main problem @d-cameron explained offline (about other callers misusing the ALT and SVTYPE fields) still applies? the callers express the evidence in ALT (BND is breakpoint claim, anything else is read depth claim; or we can change to a similar combination as tskir's one), and the interpretation is expressed in SVTYPE. Please let me know if you have seen callers that do not comply with this split.
One concern with that approach is if breakpoint claims are not a superset of read depth claims, and if it would make sense to be able to state both at the same time.
Also, tskir, how do you classify a non-tandem DUP with known location in your suggestion? With a breakend in ALT? is DUP:TANDEM unnecessary then?
@jmmut You raised some very good points. I also had to re-read parts of the specification to address them.
I think the purpose of the current spec writing (before this PR) was: symbolic SV ALTs:
- DEL INS DUP INV CNV: read-depth claim. [...]
- BND: breakpoint claim. Different claim than the other ALTs.
You were quite right to notice that BND is different from other symbolic allele types, and that the other types do not make breakpoint claims. However, those other types are not read depth claims either in the specification. “Read depth” refers to a specific set of methods of (imprecisely) detecting increase and decrease in segment copy numbers. Rather, the specification makes the distinction between “precise” and “imprecise” calls, regardless of the method of detection. In section 1.4.5 “Alternative allele field format”, subsection “Structural Variants” starts:
In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant...
That means that if you have an imprecise structural variant (meaning it has not been detected up to base pair resolution, using whatever method), you specify it using:
- DEL — for any decreased copy number (for example, CN2 → CN0 and/or CN1)
-
DUP — for any increased copy number (for example, CN2 → CN3, CN4 and so on). Since the specification reserves the additional
DUP:TANDEMsubtype, this implies that the “regular” DUP can be any type of duplication, including dispersed, more than one additional copy inserted, different orientations to the reference, etc. - CNV — only to be used when there are both DELs and DUPs in a single call
- INS — when a novel sequence is inserted (this is not necessarily related to copy number change, since the sequence is marked as novel)
- INV — when a portion of the reference sequence is inverted (this is also not related to copy number change)
The current wording of section 1.4.5 leaves it ambiguous what to do with precise structural variants — e. g. when you know coordinates of a huge deletion up to single nucleotide resolution. Some examples use the INFO/IMPRECISE key to indicate this, although this key is not officially reserved for this purpose.
but SVTYPE becomes useful for a breakpoint claim where you use the breakend notation in ALT (to make a breakpoint claim) and put the SVTYPE=DUP
I'm not necessarily against doing it this way, but:
- BND notation is in general used for complex rearrangements which do not necessarily fit into the five simple SV types discussed above. Hence, it most cases it would be impossible to assign a standard SVTYPE to a BND.
- The specification doesn't currently say a word about this, and actually all breakend examples in the current specification are using
SVTYPE=BND. If we decide to allow specifying other SVTYPEs for BNDs, this needs to be explicitly clarified in the specification. - Currently, filtering on
SVTYPE=BNDis the only simple way to find all BND records in the VCF. You can't filter by symbolic allele, because BND are using a specific format for the ALT allele.
Also, tskir, how do you classify a non-tandem DUP with known location in your suggestion? With a breakend in ALT? is DUP:TANDEM unnecessary then?
In light of the points you raised, I think I have an idea for a better proposal which would be much more consistent and also mostly compatible with the current specification version. I will post it shortly.
Based on feedback from @cwhelan and @jmmut, I present to you:
Consensus proposal, version 3
Retain the same SV types for symbolic alleles; expand & clarify their definitions
The types currently present in the specification are just fine, but poorly defined (it took me and @jmmut a couple of days to understand their true indended meaning). Let's define them very explicitly:
- DEL: any copy number decrease compared to the reference (for example, CN2 → CN1 and/or CN0).
-
DUP: any copy number increase compared to the reference (for example, CN2 → CN3 and/or CN4 and so on). By default this can refer to any type of duplication, including tandem or dispersed, adding one or several copies, same or different orientations to the reference.
- DUP:TANDEM subtype: tandem duplication in the same orientation to the reference. This can also include more than one additional copy.
-
CNV — only to be used when there are both DELs and DUPs in a single call (region where increased and decreased copy number is observed).
- For DEL, DUP and CNV specific copy number changes can be specified using
INFO/CNandFORMAT/CNtags.
- For DEL, DUP and CNV specific copy number changes can be specified using
- INS — when a novel sequence is inserted (this is not necessarily related to copy number change, since the sequence is marked as novel).
- INV — when a portion of the reference sequence is inverted (this is also not related to copy number change).
- BND — breakend notation, no changes to the current spec.
Explicitly make SV calls of all types imprecise by default. Mark precise calls using CIPOS and CIEND
Again, this already looks like the way the current specification is intended to work, it's just not clear. Let's explicitly say that by default all structural symbolic alleles denote an approximate variant location with the start/end position as best estimates.
To make a “breakpoint claim”—that is, to specify that the start and/or end of a variant are known to single base resolution—existing CIPOS and/or CIEND fields must be set to (0, 0) values. (Alternatively, if people prefer, we could set up a special value, for example CIPOS=PRECISE, for this purpose, but I think double zero works just fine.) If CIPOS and CIEND are not specified, it must be assumed that the call coordinates are not precise, but uncertainties are not available or not reported.
Exactly synchronise the lists of top level structural symbolic alleles and SVTYPE values
In this verison of the proposal, I'm back with my suggestion to completely synchronise the allowable values of the two lists. Since there will be already a way to discern exact (breakpoint) claims from inexact (e. g. read depth) claims, there is no need to mix the terms together between the symbolic alleles and the SVTYPE.
As far as I can see, this proposal addresses all concerns raised by @cwhelan:
- There will be a simple way to query the VCF for all duplications, whether they were discovered using breakpoints or read depth, because they will all have
ALT=DUP(possibly with subtypes) andSVTYPE=DUP. - There is a way to filter by precise and imprecise claims, using CIPOS/CIEND, without the need to mix different types in symbolic alleles and SVTYPEs.
- The spec will not be left in a broken state.
And the ones by @jmmut:
- This proposal retains as much backwards compatibility with the existing specification as possible; it only uses existing fields and is not introducing any new ones.
- It decouples breakpoint (“preciseness”) information from the actual variant type, allowing the users to specify a broad range of both precise and imprecise structural variant types.
- As for you question about specifying a non-tandem DUP with known insertion location: this will have to be done using BNDs; however, the current specification version also does not provide a way to do this, so there is no regression here. This new variant type can be possibly introduced in the future.
@d-cameron @cwhelan @jmmut Please let me know what you think.