elementary icon indicating copy to clipboard operation
elementary copied to clipboard

Request for volume anomalies to pre-aggregate data before computing statistics

Open garfieldthesam opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. I need to run volume anomaly tests on very large tables. However, I cannot performantly do so because the compiled query does not pre-aggregate the row count data. For example this is what the first CTE looks like for a test run on my Databricks cluster:

with monitored_table as (
  select
    *
  from
    <very large table>
  where
    <where predicates>
),

For large tables (especially ones that are very wide), this is prohibitively expensive for 2 reasons:

  • A select * fetches all table columns, which isn't necessary in principle for creating a time series; this is especially costly for columnar databases
  • An aggregated time series can be generated from the table, providing a much smaller dataset for later CTEs in the compiled query to apply their transformations/filters to

Describe the solution you'd like I'd like to request a rearchitecture of how the volume anomalies code works to improve performance for large tables. For a volume anomaly test the first CTE may look something like this:

with monitored_table_volume_counts as (
  select
    date_trunc(datepart, timestamp_column) as time_bucket,
    count(1) as num_rows
  from
    <very large table>
  where
    <where predicates>
  group by 
    date_trunc(datepart, timestamp_column)
),

Describe alternatives you've considered We already have a solution, but it is cumbersome and not ideal:

We've had to create derived data quality metrics tables summarizing the large table's metrics each over time. We then run elementary column sum tests on the aggregated metrics.

Additional context Somewhat related to but still distinct from feature request #1158

Would you be willing to contribute this feature? Absolutely willing if I have the bandwidth (though that seems unlikely in the near future).

garfieldthesam avatar Dec 14 '23 02:12 garfieldthesam

Hi @garfieldthesam ! Thanks for opening this issue and sorry about the late response. We've actually implemented since the issue was opened performance optimizations to the volume anomalies test (and others) - so I wonder if the issue is still relevant.

Closing but feel free to re-open.

haritamar avatar May 29 '24 00:05 haritamar

please reopen, not fixed (the first CTE should be an aggregated time series, as OP describes) tested with 0.15.2

data-blade avatar Jul 05 '24 10:07 data-blade

fyi @haritamar

data-blade avatar Jul 10 '24 16:07 data-blade

Hey @data-blade, sorry for the late response 🙏🏻 I am re-opening this issue, and will talk with the team about it.

IDoneShaveIt avatar Aug 15 '24 08:08 IDoneShaveIt