sloth icon indicating copy to clipboard operation
sloth copied to clipboard

Expressions should produce continuous data during low or zero traffic

Open clux opened this issue 3 years ago • 3 comments

The generated SLIs do not currently produce smooth graphs in grafana or prometheus in cases where there's low traffic or missing data, but they could easily do so with a couple of minor additions.

The two cases:

  • No errors happening recently
  • No traffic happening recently

When there are no errors recently the numerator in error_query / total_query will often be absent when users have not initialised their error metrics to zero values. This can be handled by doing a or on() vector(0) in the numerator (or across the whole fraction), however this fix does not work when there is also no traffic.

If there's no traffic, then the denominator in that query is zero, (at least if the metrics are properly initialised). This means we get an absent metric in prometheus (i.e. missing data), and in grafana it's even worse because zero division actually yields something pretty buggy ( https://github.com/grafana/grafana/issues/59349 ). At any rate, restricting the denominator explicitly to non-zero values, lets us default the undefined/missing parts equally and gives us a smooth default in both prometheus and grafana:

(error_query / total_query > 0) or on() vector(0)

I.e. it should be a fairly easy thing to add to sloth. We avoid dividing by zero, and returns an absent metric instead (when the total_query returns zero), thus the fallback kicks in. This catches both the cases where any of the metrics are unitialised, plus when we have zero over zero in the expression.

WDYT? Would you be open to a change like this?

clux avatar Nov 26 '22 10:11 clux

Have updated the issue a bit. Tried to clarify that this is not just a grafana display issue (though it is worse in grafana), but about producing a continuous SLI output even though traffic is low/zero.

clux avatar Nov 28 '22 15:11 clux

Hi, @clux. I had the same question and solved it with raw query. Perhaps this can also help you:

- name: "requests-availability"
  objective: 95
  sli:
    raw:
      errorRatioQuery: |
        (
          (sum(rate(istio_requests_total{reporter="source", destination_service="app", response_code=~"5.."}[{{.window}}])))
          /
          (sum(rate(istio_requests_total{reporter="source", destination_service="app"}[{{.window}}])) > 0)
        ) OR on() vector(0)
  alerting:
    name: high_error_rate
    labels:
      category: "availability"

zhdanovartur avatar Nov 29 '22 06:11 zhdanovartur

Ah, good to see it is possible. I was hoping that this type of thing could perhaps be defaulted within sloth though, so that not every user would have to discover this.

clux avatar Nov 29 '22 07:11 clux