Study to compare t-Digest and REQ sketch
Compare the performance of t-Digest with the closest competitor in the library, REQ sketch. REQ sketch is the closest competitor because it prioritizes high rank accuracy (HRA mode) or low rank accuracy (LRA mode), unlike other quantile sketches (KLL, classic) with the same rank error for any rank. There are a few obvious differences:
- REQ sketch can work with any data type with a comparator, t-Digest is limited to numeric data (floating-point types)
- REQ sketch retains and returns values observed in the input only - no notion of distance (only less than comparison), no interpolation. t-Digest is based on computing means and does interpolation.
- t-Digest prioritizes both high rank and low rank accuracy at the same time with the default scaling function. Perhaps this can be changed with different scaling functions.
Here is what I have so far. Surprisingly REQ sketch with k=10 and k=12 behaves very similarly
We compared t-digest and ReqSketch in a paper (at KDD'21) with Graham and people from Splunk. I think t-digest was not updated much since then, so I hope the points below are still true.
Here's an upshot based on the paper (see Section 6 in the paper):
- We were able to construct inputs for t-digest that make its error arbitrarily bad, even though these inputs are highly unlikely to be encountered in practice (N.B. we needed to use numbers from astronomically large to infinitesimally small). On the other hand, we were able to come up with distributions such that on i.i.d. samples from these distributions, t-digest has worse error than ReqSketch while using similar space (again, these distributions may seem artificial due to requiring large scale). See plots in Section 5 in the paper. Thus, t-digest is not robust to adversarially inputs.
- Of course, on more realistic distributions, t-digest works much better in terms of accuracy.
- t-digest has worse update time; see Table 1. This is because sampling used by KLL and ReqSketch does not require so many expensive operations as maintaining centroids in the t-digest.
Here is what I have so far. Surprisingly REQ sketch with k=10 and k=12 behaves very similarly
I'd say that values of $k \le 12$ are somewhat small, but of course, such small values are needed to have somewhat comparable space for t-digest (at least for short streams). Actually, in the KDD paper we used k = 4 to achieve similar space for a stream of length $2^{20}$.
The choice of the uniform distribution may not be the best for a comparison. I'd suggest to also include the log uniform distribution (uniform on the log scale) that we used in the paper. Or a normal distribution.
@PavelVesely although slightly off topic, do you understand why REQ k=12 and k=10 end up retaining about the same number of items and having about the same rank error? Do you think it might be a bug in the implementation?
@PavelVesely although slightly off topic, do you understand why REQ k=12 and k=10 end up retaining about the same number of items and having about the same rank error? Do you think it might be a bug in the implementation?
My guess is that for such a small k, rounding would yield sketches of pretty much the same size. Specifically, since k is the initial section size and section size decreases by a factor of $\sqrt{2}$, the next section size (after the first doubling of the number of sections) is the same for k=10 and k=12 --- the nearest even number to $10/\sqrt{2}$ is 8, so it's the same as for $12/\sqrt{2}$
Apologies for the belated answer! (You can let me know by email if you'd like me to look at something, but I'm not watching the dev mailing list closely.)
@AlexanderSaydakov The plots are nice, albeit it's not surprising that t-digest is much better on uniform distribution. Can you try another distribution which is more skewed?
sorry for the delay. I plan to do this eventually, but was busy with more urgent tasks.
I measured rank error with input from exponential distribution (lambda=1.5). I don't see drastic difference from the uniform distribution
Thanks for the update! In the KDD paper, we tried the log-uniform distribution, that is, choosing a random $R\in (0,1)$ and outputting item $10^{R*E}$, where $E$ the maximum exponent. Generally, the larger $E$, the worse for t-digest.
(We actually tried some variations of the log-uniform distribution, like additionally choosing a random sign, or squaring $R$ in the exponent, that is, $10^{R^2*E}$, which made the error of t-digest even worse.)
@AlexanderSaydakov @leerho Btw, are there any differences between your implementation of t-digest and the "official" implementation? I mean primarily algorithmic differences that would influence the accuracy on some datasets.
We implemented merge slightly differently since the reference implementation modifies the input sketches. My testing shows that our version does a bit better in terms of accuracy after merge. Other than that I believe our implementation is equivalent to the reference implementation.