pinot icon indicating copy to clipboard operation
pinot copied to clipboard

auto sharding strategy for theta sketch

Open patelprateek opened this issue 3 years ago • 3 comments

I was going through the pr : https://github.com/apache/pinot/pull/5316 Can you please point me to how or where is this implemented. How do we define high cardinality threshold

I am running into issues where different sets can be different cardinality and error is high and wanted insights on how to tune theta params during my indexing phase . what is a reasonable theta threshold to decide high cardinality

patelprateek avatar Sep 20 '22 09:09 patelprateek

Here is the doc for the function: https://docs.pinot.apache.org/configuration-reference/functions/distinctcountthetasketch You may also learn more about theta sketch here: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

There is one parameter that can be passed in the function: nominalEntries. By default it is set to 4096, and you may try a higher value to get better accuracy (performance will be worse)

Jackie-Jiang avatar Sep 21 '22 20:09 Jackie-Jiang

May be my question wasn't clear. I understand what theta sketches are , but trying to understand how you build auto sharding for some high cardinality segments when constructing theta sketch , what is considered high cardinality , what thresholds ? IIUC intersection(theta_sketch(a) , theta_sketch(b)) can have high error rate when jaccard similarity is low or difference between cardinality of A and B sets are big , so you also shard the bigger set to have size smaller . Trying to understand better on how is this sharding implemented

patelprateek avatar Sep 21 '22 23:09 patelprateek

The implementation for this support is in the DistinctCountThetaSketchAggregationFunction class. With the current implementation, we don't shard the set. I think this can be a good optimization, and we need some research to decide a cardinality threshold to shard the set. We can also consider providing this threshold as a parameter to the function. Do you want to help contribute this feature?

Jackie-Jiang avatar Sep 23 '22 18:09 Jackie-Jiang