cube icon indicating copy to clipboard operation
cube copied to clipboard

feat(redshift) support approx count distinct

Open pauldheinrichs opened this issue 1 year ago • 9 comments

Check List

  • [ ] Tests has been run in packages where changes made if available
  • [ ] Linter has been run for changed code
  • [ ] Tests for the changes have been added if not covered yet
  • [ ] Docs have been added / updated if required

Add support for redshift approx distinct calls (i think) please feel free to make changes or whatever to help get this through. I would just like to start using this for additive measures in cube and would love this to get in sooner rather than later 😅

Docs: https://docs.aws.amazon.com/redshift/latest/dg/hyperloglog-functions.html

Issue Reference this PR resolves Fixes: https://github.com/cube-js/cube/issues/8510

Description of Changes Made (if issue reference is not provided)

[Description goes here]

pauldheinrichs avatar Jul 24 '24 15:07 pauldheinrichs

The latest updates on your projects. Learn more about Vercel for Git ↗︎

8 Skipped Deployments
Name Status Preview Comments Updated (UTC)
examples-angular-dashboard ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-react-d3 ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-react-dashboard ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-react-data-table ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-react-highcharts ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-react-material-ui ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-react-pivot-table ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm
examples-vue-query-builder ⬜️ Ignored (Inspect) Visit Preview Sep 9, 2024 4:05pm

vercel[bot] avatar Jul 24 '24 15:07 vercel[bot]

If this gets through @paveltiunov's review, I think changes to this docs section would be needed: https://cube.dev/docs/product/configuration/data-sources/aws-redshift#count_distinct_approx

igorlukanin avatar Jul 26 '24 17:07 igorlukanin

Tested with pre-aggregation:

  • Successfully created pre-aggregation SQL
 - Executing Load Pre Aggregation SQL: scheduler-ba104a65-512d-4700-aaeb-67caa9743162
--
  CREATE TABLE dev_pre_aggregations.session_funnel_main20240101_ubycvf3l_wiz2w3y_1jafait AS SELECT
      "session_funnel".step "session_funnel__step", date_trunc('day', ("session_funnel".t::timestamptz AT TIME ZONE 'UTC')) "session_funnel__time_day", count("session_funnel".user_id) "session_funnel__conversions", CASE WHEN HLL("session_funnel".first_step_user_id) > 0 THEN 100.0 * count("session_funnel".user_id) / HLL("session_funnel".first_step_user_id) ELSE NULL END "session_funnel__conversions_percent", HLL("session_funnel".first_step_user_id) "session_funnel__first_step_conversions"
    FROM

pauldheinrichs avatar Jul 29 '24 14:07 pauldheinrichs

Validated non pre-aggregated queries,


SELECT
      date_trunc('day', ("session_funnel".t::timestamptz AT TIME ZONE 'UTC')) "session_funnel__time_day", CASE WHEN APPROXIMATE COUNT(DISTINCT "session_funnel".first_step_user_id) > 0 THEN 100.0 * count("session_funnel".user_id) / APPROXIMATE COUNT(DISTINCT "session_funnel".first_step_user_id) ELSE NULL END "session_funnel__conversions_percent"
    FROM
      (
      ```

pauldheinrichs avatar Jul 29 '24 15:07 pauldheinrichs

@pauldheinrichs Thanks for contributing! I guess hllInit should use https://docs.aws.amazon.com/redshift/latest/dg/r_HLL_CREATE_SKETCH.html

paveltiunov avatar Aug 24 '24 01:08 paveltiunov

@pauldheinrichs Thanks for contributing! I guess hllInit should use https://docs.aws.amazon.com/redshift/latest/dg/r_HLL_CREATE_SKETCH.html

@paveltiunov I attempted that in this https://github.com/cube-js/cube/pull/8512/commits/71e088275864cb58f2a9ed075732386e00701a3e commit actually and resulting not super succesfully when testing? Unless i tested wrong? 🤔

pauldheinrichs avatar Aug 24 '24 13:08 pauldheinrichs

@pauldheinrichs Hard to tell what's went wrong but according to the doc HLL already does an estimation and it isn't something we want to happen in hllInit.

paveltiunov avatar Sep 09 '24 15:09 paveltiunov

@pauldheinrichs It feels we need HLL_CARDINALITY(HLL_COMBINE(${sql})) for hllMerge and also HLL_COMBINE(${sql}) for hllMergeOnly. But this one needs to be tested.

paveltiunov avatar Sep 28 '24 18:09 paveltiunov

@pauldheinrichs Given that AWS HLL format spec is proprietary and HLL support can't be also brought to Cube Store, I wonder how useful this addition to the Redshift driver is. WDYT?

igorlukanin avatar Jun 09 '25 11:06 igorlukanin