Request for volume anomalies to pre-aggregate data before computing statistics
Is your feature request related to a problem? Please describe. I need to run volume anomaly tests on very large tables. However, I cannot performantly do so because the compiled query does not pre-aggregate the row count data. For example this is what the first CTE looks like for a test run on my Databricks cluster:
with monitored_table as (
select
*
from
<very large table>
where
<where predicates>
),
For large tables (especially ones that are very wide), this is prohibitively expensive for 2 reasons:
- A
select *fetches all table columns, which isn't necessary in principle for creating a time series; this is especially costly for columnar databases - An aggregated time series can be generated from the table, providing a much smaller dataset for later CTEs in the compiled query to apply their transformations/filters to
Describe the solution you'd like I'd like to request a rearchitecture of how the volume anomalies code works to improve performance for large tables. For a volume anomaly test the first CTE may look something like this:
with monitored_table_volume_counts as (
select
date_trunc(datepart, timestamp_column) as time_bucket,
count(1) as num_rows
from
<very large table>
where
<where predicates>
group by
date_trunc(datepart, timestamp_column)
),
Describe alternatives you've considered We already have a solution, but it is cumbersome and not ideal:
We've had to create derived data quality metrics tables summarizing the large table's metrics each over time. We then run elementary column sum tests on the aggregated metrics.
Additional context Somewhat related to but still distinct from feature request #1158
Would you be willing to contribute this feature? Absolutely willing if I have the bandwidth (though that seems unlikely in the near future).
Hi @garfieldthesam ! Thanks for opening this issue and sorry about the late response. We've actually implemented since the issue was opened performance optimizations to the volume anomalies test (and others) - so I wonder if the issue is still relevant.
Closing but feel free to re-open.
please reopen, not fixed (the first CTE should be an aggregated time series, as OP describes) tested with 0.15.2
fyi @haritamar
Hey @data-blade, sorry for the late response 🙏🏻 I am re-opening this issue, and will talk with the team about it.