Response time histograms using the prometheus sink are no longer in seconds
We migrated from statsD to the prometheus sink and use the following mapper snippet to monitor our infrastructure:
- match: "ratelimit_server.*.response_time"
name: "ratelimit_service_response_time_seconds"
timer_type: histogram
labels:
grpc_method: "$1"
These metrics used to be output in seconds, but are now output in ms.
As stated in the statsd-exporter README:
Statsd timer data is transmitted in milliseconds, while Prometheus expects the unit to be seconds. The exporter converts all timer observations to seconds. Histogram and distribution events (
handdmetric type) are not subject to unit conversion.
This used to happen when parsing observer events https://github.com/prometheus/statsd_exporter/blob/c18857b71b4afc2c304e4d34aa431a41234843ac/pkg/line/line.go#L82. In the new implementation, the histogram value is taken as-is: https://github.com/envoyproxy/ratelimit/blob/28b1629a21e885bdd2b527d6a1c1de8483dc47d4/src/stats/prom/prometheus_sink.go#L157.
This change (regression?) means that the default histogram buckets no longer make sense. I think we need to implement the same kind of unit switch.
WDYT?