superset icon indicating copy to clipboard operation
superset copied to clipboard

Rolling mean on resampled data produces incorrect graph

Open matthew-at-qamcom opened this issue 3 years ago • 9 comments

I cannot correctly graph a rolling average on resampled data.

How to reproduce the bug

  1. Add this CSV file as a dataset: demo.csv
  2. Create a "Time-series Line Chart" based on the dataset provided
  3. Set the metric to be "AVG(value)"
  4. At this stage, if you click "Update chart" you'll see a straight line (y=5). Note, for example, there is no value for 2000-01-03, as expected.
  5. Open "Advanced Analytics"
  6. From the resampling rules, select "1 calendar day frequency"
  7. From fill method, select "Zero imputation" (or "Sum values", they both give the same outcome)
  8. If you update the chart now, you will see many days with zero values. The line is no longer the simple y=5. This is as expected.
  9. Select "mean" from as the rolling window function.
  10. Set period and min periods to, say, 5.
  11. Update the chart
  12. Note that graph is not a smooth curve, but rather has values at y=5 and y=0:

Expected results

I expected to see a smooth curve, with values between zero and 5, similar to: image

Actual results

We see values at y=5 and y=0, not the values that would be expected from a rolling mean on resampled data: image

Environment

  • browser type and version: Firefox 109.0.1
  • superset version: 0.0.0-dev. I've also tried this on Superset 2.3
  • python version: 3.8.13

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • [ x ] I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • [ x ] I have reproduced the issue with at least the latest released version of superset.
  • [ x ] I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

I'm using the apache/superset Docker images.

matthew-at-qamcom avatar Feb 14 '23 01:02 matthew-at-qamcom

For more context, I posted a related question on Stack Overflow.

matthew-at-qamcom avatar Feb 14 '23 01:02 matthew-at-qamcom

I re-encountered this problem again. Here's an indication of the size of the effect, with two graphs superimposed. In blue we have Superset's results and in red we have a plot of data manipulated directly using Pandas. superimposed

The difference is caused by the fact that we do not have data for weekends. These days are just ignored by Superset, but a correct manipulation would fill with zeros before calculating the rolling values.

matthew-at-qamcom avatar May 07 '23 01:05 matthew-at-qamcom

This still seems to be an issue in 3.x. Thanks for giving sample data and a detailed repro flow. Keeping this open.

rusackas avatar Feb 28 '24 06:02 rusackas

I wonder if @zhaoyongjie knows what's going on here?

rusackas avatar Feb 28 '24 06:02 rusackas

Still an issue in 4.0.1. I think the issue can be reduced down to the fact that the the rolling values are calculated first, after which the interpolation is carried out.

I can't really think of scenarios when you would want this to be the order instead of first interpolating, so a change of default would imho be justified.

Rydberg95 avatar May 23 '24 08:05 Rydberg95

@dosu-bot

rusackas avatar Mar 20 '25 20:03 rusackas

This has been silent for nearly a year, but I'll leave it open since it's a data correctness issue. Volunteers welcome to contribute though, since this doesn't seem to be getting much interest or prioritization.

rusackas avatar Mar 20 '25 21:03 rusackas