Update to reflect new changes to the t-digest algorithm
Lately there have been many changes to the t-digest that improve both accuracy and speed of the algorithm. Having the changes incorporated into this module will be very useful when this implementation of t-digest is used to implement the user facing quantile function in Dask.
There are couple of questions open still before the work can begin:
- Is the algorithm described in this paper the final one?
- Does the Java implementation in the t-digest repository contain the new changes? Can it be used as a reference?
Agreed. I'd have to give the paper a read to see what changes were made since I implemented this. However, I suggest you start with adding the existing implementation to Dask, and update the algorithm here afterwards. The algorithm as written works, and seeing it in use will likely influence what changes (if any) need to be made.
Is the algorithm described in this paper the final one?
Yes. The paper is tracked in the github repo as well: https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf.
Does the Java implementation in the t-digest repository contain the new changes? Can it be used as a reference?
Yes, and kind of. The tdigest license (apache) is included in our license file with a note in the relevant files that the algorithm is borrowed from the tdigest repo. However, the fastest way to do things in java is not the same as in C - you will likely have to make changes rather than doing a direct port.
I personally have no bandwidth right now to help with this project, but would accept whatever changes are made, provided they come with speed/accuracy improvements and proper tests.
It makes sense to do the Dask implementation first. The speed improvement is nice to have but probably not essential, the current version should work alright.
Thanks for the answers. My C is actually better than my Java so hopefully it's not too complicated to convert it.
I should have time to do this after I have the Dask implementation ready so I'll just put this on my pile.