Support for hourly, weekly, monthly bucket size in anomaly tests
In anomaly detection, the default bucket size is 1 day. I couldn't find a way to configure the size of these buckets.
Checking the code here I think that it's not an easy fix, since it also depends on how the data is collected before this query is executed.
I would keep exploring, but perhaps you can guide me on what should be done to be able to do this.
Hi @rloredo! Thanks for opening this issue, makes perfect sense. You are right that this is not an easy fix. However, we had this in mind as a future option when we wrote the tests, so it's not a fundamental change. I'll test a bit and describe the changes required here, so we can discuss further :)
Thanks again @rloredo, let me know what do you think about this:
Here's a short background on our tests -
Each anomalies test is a macro under ./macros/edr/tests/. Each test runs 2 queries - one to collect new metrics, and one to calculate the anomaly score. The results of the queries are saved to tables, and merged to the elementary models on an on-run-end hook.

My suggestion is to break down the change to pieces. I think the best place to start would the column-level tests. After we add this e2e, it should be pretty easy to reuse the macros and change the other tests as well.
Here's a description on what's needed to support this -
-
We need to add bucket_size (is this the best name?) to the config of tests. I think it makes sense to configure it in the model and / or test level. You can look at the implementation of timestamp_column here. I suggest having a list of supported time formats, and if the config is not in the list, return day as default.
-
This config needs to be added as a param to 4 tests:
-
test_all_columns_anomalies -
test_column_anomalies
- The tests should pass the bucket size as a param to the following queries:
-
column_monitoring_query -
get_anomaly_scores_query
-
In the queries, there is a macro named
elementary.date_truncthat today always receives 'day', but can receive any date part supported by the 4 platforms we support ('hour', 'week' and 'month' are all supported!). -
In the
metrics_finalCTE the following columns should be impacted by the config:
-
bucket_end -
bucket_duration_hours
As the anomaly scores query is shared between all the tests and has a complex logic, I think it's better the team makes the required changes to support the different buckets on that one.
Hi, @Maayan-s! thank you for your fast response. It looks clearer now :) I can dedicate some time to doing part of the steps, but the last part wasn't clear to me: do you want to leave this enhancement to the team? I wouldn't mind waiting for you to do it either.
For the bucket_size thing, I think it's good to change the name... I think we can use this as inspiration. Note that they also use lookback_periods, trend_periods, and test_periods. I guess that days_back should also change to periods_back (or lookback_periods).
Let me know what you think and if you want me to take some of this!
Thanks, @rloredo!
We would do it at some point, but I don't think we will get to it in the next several weeks.
If you are open to taking on the steps I described, this will make it easier for us to prioritize.
We could have someone on the team finish the anomaly_scores_query changes and integration to the UI as well.
What do you think?
Great, I can give it a try then :)
So, I did most of the changes in https://github.com/elementary-data/dbt-data-reliability/pull/85 These tests are working and writing the data correctly to __test tables:
- table_anomalies
- all_columns_anomalies
- column_anomalies
I couldn't make dimension_anomalies work, mainly because of the metrics_anomaly_score view. Also, I've only tested the daily_buckets_cte (renamed to period_buckets_cte) for Snowflake and BQ. I don't have access to Redshift. I didn't run any integration or e2e test either.
The #TODO's are marked in the code.
This is as far as I can get! Let me know if someone will pick it up and I if you need any clarification :)
Wow @rloredo, thank you so much! Awesome work, you actually implemented a lot more than I expected :) I'll find time to review in the next few days! Also we will prioritize internally the remaining tasks to make this functionality operational.