Accessing translation table from giraffe autoindex workflow

Open ASLeonard opened this issue 4 years ago • 1 comments

This is a bit of a winding journey to try and solve #3126 , so hopefully haven't missed out too many details.

I'm using a minigraph GFA as input (so without P/W lines currently) for downstream mapping with giraffe. The vg autoindex pipeline works fine, and can map sequences. I'm trying to relate mapping back to the original GFA nodes.

For (most?) non-reference nodes, this works fine, as the --request XG can output an xg graph, and can extract the P lines from vg view -g {xg} | awk '$1=="P"'. These nodes tend to have specific names like

P	10_A_clr[7985222]	38462387+,38462388+,38462389+	*

which I can easily relate back to the original minigraph GFA given the source (10_A_clr) and offset (7985222).

The issue is the reference nodes are given as

P 10_ARS <millions of node IDs> *

Which breaks the ability to directly associate giraffe mapped IDs to input GFA IDs.

I've read in the gbwt wiki that there is a translation file included, as hinted at in #3126, but I've found no way to access this. I've tried building the gbwt separately to use the --translation output option, but since my input doesn't have P/W lines it fails. Similarly, trying to use the .gbz output from autoindex as input doesn't seem to get far, failing with

vg gbwt -Z /data/L50.giraffe.gbz -o /data/test.gbwt -g /data/test.gwbtg --translation /data/trans.late -p
-----
Building input GBWTs
Input type: GBZ
Loading GBZ from /data/L50.giraffe.gbz
GBWTs built in 31.7329 seconds, 9.24089 GiB

Saving compressed GBWT to /data/test.gbwt
GBWT serialized in 2.14897 seconds, 9.24089 GiB

Serializing segment to node translation to /data/trans.late
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
-----
Stack trace (most recent call last):
#5    Object "/vg/bin/vg", at 0x5adf8d, in _start
#4    Object "/vg/bin/vg", at 0x1c3f95f, in __libc_start_main
#3    Object "/vg/bin/vg", at 0x58286e, in main
#2    Object "/vg/bin/vg", at 0xc075bb, in vg::subcommand::Subcommand::operator()(int, char**) const
#1    Object "/vg/bin/vg", at 0xc47882, in main_gbwt(int, char**)
#0    Object "/vg/bin/vg", at 0xc3fdd1, in GraphHandler::serialize_segment_translation(GBWTConfig const&) const

Is there actually a way to access the translation table (or was that dropped), or to force continuous sequence split into different nodes (the reference backbone from minigraph) into different nodes for the request XG style translation table, so it would look like

P 10_ARS[2517258]   38462387+,38462388+,38462389+	*
P 10_ARS[2514258]   38462390+,38462391+,38462392+	*

Maybe this is just trying to stretch a use-case too far, but thanks for any input.

Dec 02 '21 10:12 ASLeonard

There is a data model mismatch between minigraph and vg. Minigraph outputs rGFA, which does not store the sequences as full-length paths. Instead, it annotates each segment (node) with information about the first sequence that required that segment.

When vg reads an rGFA file, it stores the annotations as a set of paths. Generally only rank-0 sequences (reference sequences) become full-length paths. For other sequences, we only get short paths that describe novel sequence. And if we convert the graph back to GFA, we get P-lines instead of S-line annotations.

Translation tables are used for another purpose. GFA segments have string names and they can be arbitrarily long, while vg nodes have integer ids and many tasks restrict their length to 1024 bp. A translation table stores a mapping between segment names and (intervals of) node ids. We don't really use it for anything at the moment. You can output the table when you convert GFA into a graph with vg convert or when you build a GBWT from GFA with vg gbwt. In the latter case, the translation can also be stored in a GBZ / GBWTGraph file, but vg can't extract it from there at the moment.

Finally, minigraph uses segment names of the form sN, where N is an integer. When vg parses a minigraph GFA, it assigns the corresponding nodes integer identifiers starting from 1. When the GFA is read from a file, this happens in the order each segment name first occurs in the file. When the GFA is parsed in memory (because it was piped to vg, for example), the order may be different.

Dec 03 '21 06:12 jltsiren