Create benchmark for sorting relevant distributions of data

Open reuster986 opened this issue 4 years ago • 0 comments

I can't seem to find the discussion right now, but we've been mulling over whether to explore a different sorting algorithm, such as the twoArrayRadixSort already in Chapel. The current LSD radix sort has the advantages of simplicity and complete insensitivity to data distribution, which are powerful. However, there are conditions (such as mostly-sorted input) under which an MSD radix sort would perform significantly faster, and it's not clear to me which sort is best for the majority of data distributions we are likely to encounter in the wild.

To that end, I'd like to create a benchmark that evaluates the performance of sorting and co-sorting on several distributions of data. For now, only the existing LSD radix sort would be used, and I would expect the performance to be the same on all inputs of a given size, regardless of distribution. But if/when we include the Chapel twoArrayRadixSort as a runtime choice, the benchmarking infrastructure will exist to compare the two algorithms side by side.

Cases I would like to test:

[x] uniform random integers
- [0, 2**16)
- [0, 2**32)
- [0, 2**64)
[x] power-law integers and floats
[x] RMAT-generated vertex identifiers (ints)
[x] block-sorted data, i.e. concatenate(arrays), where each arrays[i] is sorted
[x] refinements, e.g. coargsort([a, b]), where a is sorted but b is not
[x] datetime64[ns]-like data, i.e. values whose range is much smaller than their magnitude
[x] IPv4/IPv6-like data, e.g. with 90% of values in [0, 2**32) and 10% in [2**32, 2**128) (this would require a coargsort)
[ ] Strings of uniformly distributed length
[ ] Strings with log-normally distributed length

Any other suggestions or additional cases?

Nov 16 '21 16:11 reuster986