vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Add cardinality estimate stat

Open a10y opened this issue 1 year ago • 3 comments

Useful for compressor to decide if Dict compression is worthwhile.

There's a Rust crate already implementing it: https://docs.rs/hyperloglogplus/latest/hyperloglogplus/struct.HyperLogLogPlus.html

Can be used:

  • At compress time: determine if Dict is worth trying or just fallback directly to FSST
  • At query time: Datafusion allows reporting cardinality estimates, which are used for join selection: https://github.com/apache/datafusion/blob/8ba6732af5f4f32cbe0a23ef6bc2f393c640898b/datafusion/physical-plan/src/joins/utils.rs#L905

a10y avatar Sep 23 '24 15:09 a10y

I think we want this instead of HLL++: https://www.cidrdb.org/cidr2019/papers/p23-freitag-cidr19.pdf

lwwmanning avatar Sep 23 '24 16:09 lwwmanning

(In particular, it gives good estimates of cardinality of arbitrary combinations of attributes rather than just attributes, which is cool / handy for compound join keys)

lwwmanning avatar Sep 23 '24 16:09 lwwmanning

if we're taking off the shelf, this crate looks potentially better: https://github.com/cloudflare/cardinality-estimator/tree/main

lwwmanning avatar Oct 04 '24 19:10 lwwmanning

previously mentioned in #85

robert3005 avatar Nov 05 '24 13:11 robert3005