arkouda
arkouda copied to clipboard
Optimizations for GroupBy on Strings
The current GroupBy logic for strings is optimized for grouping a single array of long (or variable-length) strings. However, there are at least two other cases that could benefit from different logic:
- A single array of short strings. Currently, all strings get hashed to 128-bit values, which are then sorted to achieve grouping. However, if the maximum length of the strings is 16 bytes or less, then directly sorting the strings would use the same or fewer radix digits as sorting the hashes, and you wouldn't have to compute the hashes. In practice, it might even make sense to set the length threshold a little higher, e.g. directly sort strings when the max length is 20 bytes or less.
- Multiple arrays, at least one of which is strings. Currently, when grouping by multiple arrays, arkouda will first hash any strings arrays (via
Strings._get_grouping_keys()) so that all arrays are represented by integers, and then it will hash all the integer representations (insideUniqueMsg.chpl) and accumulate those into a single array of 128-bit hashes for sorting. In this situation, strings effectively get hashed twice, which is wasteful. I think we can skip the first hash and reworkUniqueMsgto just hash each strings array once.