help interpreting output
Hi Kevin, I was looking at some GIAB data this morning and found the link to your tool. I gave it a whirl with this command:
vgraph repmatch --include-regions GIAB/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_nosomaticdel.bed --reference /home/pubseq/genomes/Homo_sapiens/GRCh37/1000genomes/bwa_ind/genome/GRCh37-lite.fa GIAB/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz gsc/GSC.vcf.gz > out.txt
in which the output file contained these match lines:
107 MATCH== TYPE=H 2 MATCH=. TYPE=N3429981 MATCH== TYPE=T 176428 MATCH=X TYPE=H
I think I can guess what the bottom two lines represent, but I was wondering if you could explain all 4 lines? If there is a better way to quantify a match I'd be happy to know that as well.
thanks, Richard
Hi Richard,
Thanks for asking.
- Type=T represents a trivial match, where the two superloci are identical in terms of genomic coordinates, alleles and genotypes. i.e. no need to invoke the full power of the haplotype matcher.
- Type=H is where the haplotype matcher is needed.
- Match="=" are superloci that match
- Match="X" are superloci that don't match.
- Match="N" are nocalls, typically due to out of spec VCF records that overlap, as are occasionally generated by GATK.
Perfect. Many thanks.
One more question - How would you recommend counting the variants uniquely called in my set or in the GIAB set?
I have been working on a wrapper around vgraph that does much more detailed accounting. I'll see if I can share it, as it was developed as part of my day job.
Thanks. Any word on permission to share your code?
I've asked and am waiting for an answer. I expect to hear back by the end of next week.
Many thanks. I'm not up against a deadline or anything I just wanted to try it out.