More complete rGFA support (experimental)
Changelog Entry
To be copied to the draft changelog by merger:
-
vg paths -Roption added to compute an rGFA cover (based on a reference sample) from a (mutable) graph, and add it as path fragments.
Description
This is a fixup and refactor of #3891. It uses the same greedy (on steps) method to select snarl traversals to add to the cover. The cover is now stored in a (really simple) in-memory index that maps nodes to fragments that cover them, and these fragments can be walked back to the rank-0 reference.
rGFA covers can be computed from full paths, loaded from path fragments, and saved to path fragments. They are always defined relative to a rank-0 / reference sample which stays as a normal reference path in the graph.
Off-reference rGFA path fragments are REFERENCE-sense paths with a special sample name (_rGFA_) which allows them to be easily selected. Would like to find something less hacky, but this should work for experiments now.
Corresponding Cactus update is: https://github.com/ComparativeGenomicsToolkit/cactus/pull/1186
Just wanted to say I've been experimenting with an end user hack to do something similar*, and I've seen the largest impact in accuracy from variants called around insertions. Reads which straddle a reference and insertion contig will align to one or the other.
As a result, depth around insertions flattens out, or even dips a little. Here's an example in HG002 around chr10:133111600, where vg haplotypes identified two overlapping 900bp insertions (one maternal, one paternal).
Surjecting reads to both insertions and GRCh38 led to reduced depth in the region, causing deepvariant to miss some GIAB annotated variants. However, the reads which surjected to the reference sequence now had their non-reference sequence softclipped. This led to fewer false positive calls around insertions due to misaligned reads.
PS: This is a really good test region. Both the insertions and the surrounding reference sequence are complex enough to allow for uniquely mapping reads, and the accuracy of variant calling is greatly improved when reads from either insertion are not mismapped to the linear reference sequence. It is also in an intron of a CNS expressed gene that has been linked to appetite and overeating, suggesting potential clinical relevance.
I'm very excited to eventually see this code on the main branch!
- KMC
- Create personalized GFA with haplotypes
- Deconstruct GFA against reference of interest
- Isolate insertions > 100bp, extract paths from deconstruction vcf
- Format paths as GFA haplotypes, append to end of GFA file
- contig names were just chr|start|stop|unique_id
- Convert to GBZ, mark all insertion paths as references
- Extract per-sample fastas
- Bring all fastas in cohort together, harmonize the unique id of each insertion. Create translation file of old contig name -> harmonized contig name.
Then: Map reads to GBZ Use samtools reheader, using sed to remove haplotype information and awk to apply the old contig name -> harmonized contig name translation
Hi @glennhickey,
I wonder if you are still planning to implement this feature?
-Joe
Yes! But I've been completely bogged down with other things. The only good news is that I think I'm on the hook to present about it in a few weeks so I won't have much choice but to get moving. I still have your examples to reproduce problems that I will check out. Sorry for the long wait.
Not a problem @glennhickey, I'm so grateful for the work you've put in. I just wanted to double-check, as I'm a few months out from putting together a manuscript comparing the effect of vg-deepvariant with bwamem-haplotypecaller and bwamem-deepvariant, and I think this would be a good addition to the paper.
I'm in the middle of a too-many-projects-too-little-time period myself, so I am very, very familiar with the feeling.