Exact Cardinality Count extension
Description
This PR introduces the druid-exact-count extension, providing a new aggregation function for computing the exact distinct count of values within a dimension. Unlike approximate estimators like HyperLogLog, this extension guarantees precision, which is crucial for use cases demanding exact figures.
The patch achieves this by leveraging RoaringBitmap, a data structure optimized for storing and manipulating sets of 64-bit integers with good compression and performance. The extension includes the necessary components for integrating this functionality into Druid's query processing engine.
Exact Count Aggregation
The core of this PR is the implementation of an exact count aggregator using RoaringBitmap64.
-
Behavioral aspects:
The aggregator is invoked via the
bitmap64ExactCounttype in native queries or a corresponding SQL function. It's designed to ingest values from any dimension as long as they are of typelong.- Configuration is minimal.
- Empty inputs or all-null inputs correctly result in a count of 0.
- String columns are not supported.
Integration Tests
- Changed
./it.shto also includeextension-contribpackages when building Druid image. - Added IT for druid-exact-count.
Differences with Distinct Count Aggregator
| Exact Count | Distinct Count |
|---|---|
| No prerequisites to configuring hash partition, segment granularity | Prerequisites needed to perform aggregation |
| Works on 64-bit number columns only (BIGINT) | Works on dimension columns (Including Strings, Complex Types, etc) |
Release note
Introduced a new extension druid-exact-count which provides an aggregator BITMAP64_EXACT_COUNT(columnName) for computing exact distinct counts on numerical columns.
Key changed/added classes in this PR
-
Bitmap64ExactCountAggregatorFactory -
Bitmap64ExactCountBuildAggregatorFactory -
Bitmap64ExactCountMergeAggregatorFactory -
Bitmap64ExactCountBuildAggregator -
Bitmap64ExactCountMergeAggregator -
Bitmap64ExactCountBuildBufferAggregator -
Bitmap64ExactCountMergeBufferAggregator -
RoaringBitmap64Counter -
Bitmap64interface, for extensibility when newer/faster bitmap functions are introduced. -
Bitmap64ExactCountBuildComplexMetricSerde -
Bitmap64ExactCountMergeComplexMetricSerde -
Bitmap64ExactCountModule -
Bitmap64ExactCountPostAggregator -
Bitmap64ExactCountSqlAggregator -
it.sh - Docs @
druid-exact-count.md - Integration Test files
This PR has:
- [x] been self-reviewed.
- [x] added documentation for new or modified features or behaviors.
- [x] a release note entry in the PR description.
- [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [x] added or updated version, license, or notice information in licenses.yaml
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [x] added integration tests.
- [x] been tested in a test Druid cluster.
@FrankChen021 Thanks for the feedback, I have added Integration tests by allowing CI to build Druid image with all extension-contrib packages.
Added examples for Bitmap64ExactCountBuild + Bitmap64ExactCountMerge from rolled-up Bitmap64 columns. Please take a look again.