vg construct including haplotypes
Starting from VCFs that have phased haplotypes, what steps should we follow to build a GFA in which the haplotypes are repressented as paths?
It seems that we might be able to do this by building a GBWT and then projecting it out to GFA. But the exact steps have been obscured by changes over time.
(I'm at CPANG22 and some participants are requesting this capability.)
The basic idea is this:
vg construct -r foo.fa -v foo.vcf.gz -a > foo.vg
vg gbwt -v foo.vcf.gz -x foo.vg -g foo.gbz --gbz-format
vg convert -Z -f foo.gbz > foo.gfa
First you build the graph with vg construct, including the allele paths with -a and using other options when appropriate.
Then you build the GBWT from the graph and the VCF file with vg gbwt and convert everything to GBZ format. There are many options that could be useful in various situations (particularly --buffer-size, --num-jobs, --force-phasing, and --discard-overlaps).
Finally you convert the GBZ graph to GFA using with vg convert. Option -Z indicates that the input is in GBZ format and that the conversion should use the GBWTGraph algorithm.
That will output the paths as W-lines, and many of the more advanced options are not exposed. If you want P-lines, you can use the gfa2gbwt binary included in the GBWTGraph package:
gfa2gbwt -d --pan-sn foo
@ekg , @jltsiren PLEASE I'm working on a similar topic and was wondering whether this transformation from alt paths projection to haplotypes paths doesn't cause the loss of variation information? I'm building a vg graph starting from primary one (fasta only) and aligning the others species as fellow:
- primary > map short-reads genome1 > get vcf > phase it with whatshap > use it + fasta = updated graph > giraffe genome2 short-reads > get vcf and so on .. The idea is that i want to build a final graph containing my species collection, represented in haplotypes paths with conserved variations information ( what I understood is in this case I need 2 graphs for one collection, correct ?)
Yes this is actually very tricky, and something I would also like to figure out. I'd be happy to discuss. Feel free to email me to set up a time to talk. The pattern you are describing was something several participants at CPANG22 had in mind. I'm not sure it would require two graphs though. But some kind of iteration would be needed. Unfortunately, we don't yet have any graph native haplotype phasing method.
On Fri, May 27, 2022, 14:51 Chaima-Bouchenak @.***> wrote:
@ekg https://github.com/ekg , @jltsiren https://github.com/jltsiren PLEASE I'm working on a similar topic and was wondering whether this transformation from alt paths projection to haplotypes paths doesn't cause the loss of variation information? I'm building a vg graph starting from primary one (fasta only) and aligning the others species as fellow:
- primary > map short-reads genome1 > get vcf > phase it with whatshap
use it + fasta = updated graph > giraffe genome2 short-reads > get vcf and so on .. The idea is that i want to build a final graph containing my species collection, represented in haplotypes paths with conserved variations information ( what I understood is in this case I need 2 graphs for one collection, correct ?)
— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3668#issuecomment-1139639098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPPVD4NAW62RBFZ5PLVMDHM5ANCNFSM5W2B4NWA . You are receiving this because you were mentioned.Message ID: @.***>