dbt-data-reliability icon indicating copy to clipboard operation
dbt-data-reliability copied to clipboard

Support for hourly, weekly, monthly bucket size in anomaly tests

Open rloredo opened this issue 3 years ago • 7 comments

In anomaly detection, the default bucket size is 1 day. I couldn't find a way to configure the size of these buckets.

Checking the code here I think that it's not an easy fix, since it also depends on how the data is collected before this query is executed.

I would keep exploring, but perhaps you can guide me on what should be done to be able to do this.

rloredo avatar Aug 21 '22 16:08 rloredo

Hi @rloredo! Thanks for opening this issue, makes perfect sense. You are right that this is not an easy fix. However, we had this in mind as a future option when we wrote the tests, so it's not a fundamental change. I'll test a bit and describe the changes required here, so we can discuss further :)

Maayan-s avatar Aug 21 '22 17:08 Maayan-s

Thanks again @rloredo, let me know what do you think about this:

Here's a short background on our tests - Each anomalies test is a macro under ./macros/edr/tests/. Each test runs 2 queries - one to collect new metrics, and one to calculate the anomaly score. The results of the queries are saved to tables, and merged to the elementary models on an on-run-end hook.

elementary anomalies tests

My suggestion is to break down the change to pieces. I think the best place to start would the column-level tests. After we add this e2e, it should be pretty easy to reuse the macros and change the other tests as well.

Here's a description on what's needed to support this -

  1. We need to add bucket_size (is this the best name?) to the config of tests. I think it makes sense to configure it in the model and / or test level. You can look at the implementation of timestamp_column here. I suggest having a list of supported time formats, and if the config is not in the list, return day as default.

  2. This config needs to be added as a param to 4 tests:

  • test_all_columns_anomalies
  • test_column_anomalies
  1. The tests should pass the bucket size as a param to the following queries:
  • column_monitoring_query
  • get_anomaly_scores_query
  1. In the queries, there is a macro named elementary.date_trunc that today always receives 'day', but can receive any date part supported by the 4 platforms we support ('hour', 'week' and 'month' are all supported!).

  2. In the metrics_final CTE the following columns should be impacted by the config:

  • bucket_end
  • bucket_duration_hours

As the anomaly scores query is shared between all the tests and has a complex logic, I think it's better the team makes the required changes to support the different buckets on that one.

Maayan-s avatar Aug 21 '22 18:08 Maayan-s

Hi, @Maayan-s! thank you for your fast response. It looks clearer now :) I can dedicate some time to doing part of the steps, but the last part wasn't clear to me: do you want to leave this enhancement to the team? I wouldn't mind waiting for you to do it either.

For the bucket_size thing, I think it's good to change the name... I think we can use this as inspiration. Note that they also use lookback_periods, trend_periods, and test_periods. I guess that days_back should also change to periods_back (or lookback_periods).

Let me know what you think and if you want me to take some of this!

rloredo avatar Aug 22 '22 08:08 rloredo

Thanks, @rloredo! We would do it at some point, but I don't think we will get to it in the next several weeks. If you are open to taking on the steps I described, this will make it easier for us to prioritize. We could have someone on the team finish the anomaly_scores_query changes and integration to the UI as well.

What do you think?

Maayan-s avatar Aug 22 '22 08:08 Maayan-s

Great, I can give it a try then :)

rloredo avatar Aug 22 '22 15:08 rloredo

So, I did most of the changes in https://github.com/elementary-data/dbt-data-reliability/pull/85 These tests are working and writing the data correctly to __test tables:

  • table_anomalies
  • all_columns_anomalies
  • column_anomalies

I couldn't make dimension_anomalies work, mainly because of the metrics_anomaly_score view. Also, I've only tested the daily_buckets_cte (renamed to period_buckets_cte) for Snowflake and BQ. I don't have access to Redshift. I didn't run any integration or e2e test either.

The #TODO's are marked in the code.

This is as far as I can get! Let me know if someone will pick it up and I if you need any clarification :)

rloredo avatar Aug 28 '22 16:08 rloredo

Wow @rloredo, thank you so much! Awesome work, you actually implemented a lot more than I expected :) I'll find time to review in the next few days! Also we will prioritize internally the remaining tasks to make this functionality operational.

Maayan-s avatar Aug 29 '22 11:08 Maayan-s