Allow setting the HashMap / Hasher implementation for GraphMap
Summary
The feature would allow setting the underlying HashMap (or at least the Hasher) used in the GraphMap. This would allow using a deterministically seeded hash function to fix the iteration order over runs when the graph is populated in the same order. This would be a very useful feature in at least one important use case for us, as it would allow us to obtain deterministic results between runs.
Motivation
My lab develops a number of tools in rust, and a number of these make use of petgraph. The one motivating this feature request is alevin-fry. This is a tool for processing genomics data. Within the tool, we apply a heuristic to a hard graph problem (minimum cardinality cover by monochromatic arboresences). Because the iteration order for the underlying graph implementation is non-deterministic and changes between runs, and because we can't always solve the problem optimally (hence the heuristic), we can get different results between runs. In our case, the differences are small, but it would nonetheless be ideal to have the option to have results between runs remain identical for the same input. We can use Hashers with fixed seeds (e.g. ahash) to ensure that e.g. hashes are consistent. If we can then also ensure that has build order is consistent, we should be able to obtain consistent iteration order between runs. However, this is currently not possible, as the HashMap implementation underlying the GraphMap is not exposed to the user.
Details
-
What code needs to change? New traits? Which modules?
The GraphMap types would need to take an extra generic argument corresponding to the hasher to be used.
-
Is this a new graph algorithm? Include links to papers, Wikipedia, or other implementations of this algorithm that exist.
No
-
Are you willing to implement this yourself? Mentor someone else and help them implement it?
Unfortunately, I do not have time to implement it myself. However, I would be happy to test it and to help out if someone else can take the lead.
GraphMap is using IndexMaps internally for the nodes and edges, which specifically states:
The key-value pairs have a consistent order that is determined by the sequence of insertion and removal calls on the map. The order does not depend on the keys or the hash function at all. [docs]
Do you have a minimal example where the graph is behaving non-deterministically ?
However, I don't discard the idea of adding a hasher parameter to GraphMap. It would be an useful choice to have.
I haven't had time or opportunity, but one could think about the idea of a different kind of graphmap, maybe one that can separate the node index or node id and key as separate concepts.
@ABorgna,
Thanks for the pointer. I realize now that there is code that makes use of the graph that itself hold e.g. vertex subsets in a hash set. It's totally conceivable that the order of iteration over those subsets (which affects subsequent removals from the graph) is what is actually changing "randomly" between runs. The issue with the default / common behavior of the randomized seeding of the default hash map and hash set is that it is not possible to e.g. globally disable for debugging purposes, so identifying the actual source of non-determinism is non-trivial. Nonetheless, I learned that ahash let's me create a RandomState with a specified seed to avoid this randomness, so I'm going to propagate that change throughout the codebase and see if the behavior becomes deterministic (as is expected).
Regarding the more general utility of such a possibility (i.e. allowing to specify the hasher used), I agree that it might still be useful in a general case. For example, depending on what your graph is being constructed over, there may be a desire to use custom hashers that are e.g. much more efficient or more secure for the objects being hashed, and having the ability to specify this in a constructor could be nice.