[FEA] Add option to report hash collisions
Is your feature request related to a problem? Please describe.
Hash collisions impact performance of hash map insert and probe. It will be useful to find a way to report the number of collisions for static_map and dynamic_map to help assess performance of insert or probe when developers are evaluating perf on their dataset. It would also allow developers to tune the hash function or occupancy to reduce collisions and find the right balance for their scenario.
Describe the solution you'd like
The map can have an optional template argument that specifies if we need to count collisions (disabled by default), so it's opt-in and doesn't impact perf for the standard case. The number of collisions would be stored in a class variable that's accessible with something like get_num_collisions(). Implementation: allocate memory for device variable uint64_t *d_num_collisions, update all insert and find device code to do atomicAdd(d_num_collisions, 1) to that variable, then copy the contents to the host variable after the kernel. Here is where we can count the collisions for no-CG static_map::find:
https://github.com/NVIDIA/cuCollections/blob/2196040f0562a0280292eebef5295d914f615e63/include/cuco/detail/static_map.inl#L252
The atomic will be guarded by the template argument check, so should only impact perf if we're asked to count collisions. Similarly, would have to update all other insert and find functions.
Describe alternatives you've considered None.
Additional context None.
I understand the desire for this, but it would introduce some awkwardness into the interfaces and the implementation.
What if instead we had a utility or benchmark/test function for computing the number of hash collisions for a given set of keys/hash function? Something like:
template <typename InputKey, typename Hasher>
size_t count_collisions(InputKey begin, InputKey end, Hasher h);
All this would do is keep an array of proxy "slots" and compute i = h(begin + j). If slots[i] == FILLED, then increment num_collisions and advance to slots[i+1] until an EMPTY slot is found.
This allows us to keep the collision counting logic independent of the implementation complexities of the static_map.