Instrument with OTel tracing
Fixes #3670
I've pulled in tracing/tracing.go from prometheus/prometheus to kickstart this, and to provide a familiar (to Prometheus operators) configuration mechanism.
It's possible that it'd be worth pushing some of this up into prometheus/common, but I'll defer on that one.
So far, all incoming requests are instrumented, and notifications are instrumented, with webhooks and e-mails getting slightly more detailed instrumentation, both with downward trace propagation.
The decoupled nature of how alerts are handled within Alertmanager means that there's a bunch of disjointed spans, but I've attempted to rectify some of that by using span links.
unclear why CI's failing - I wonder if that test's unrelatedly flaky?
I've been wondering about the support of this for months and it's awesome to see movement on this front. Much appreciated!
@els0r you're welcome! Let me know if there are any particular things you want traced - this is really just a start.
@grobinson-grafana
@hairyhenderson, can you show me how to test this next week so I can better understand how it all works?
sorry for the delay... laptop refresh caused me to lose my test configs 🤦♂️
I tested this by setting up a local Grafana, Tempo, and Agent (configured to receive OTLP).
Then I fired up the webhook echo service (go run ./examples/webhook/echo.go - I actually modified that code first to print headers too so I could see the trace headers)
I then ran alertmanager like this:
$ ./alertmanager --web.listen-address=127.0.0.1:9093 --log.level=debug --config.file=am-trace-test.yml
Here's am-trace-test.yml:
global:
smtp_smarthost: 'localhost:2525'
smtp_from: '[email protected]'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
routes:
- matchers:
- severity="page"
receiver: web.hook
- matchers:
- severity="email"
receiver: smtp
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'smtp'
email_configs:
- to: '[email protected]'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
tracing:
client_type: "grpc"
endpoint: 'localhost:4317'
sampling_fraction: 1.0
insecure: true
Then I ran prometheus like this:
$ ./prometheus --config.file=prom-alert-tracing-test.yml --web.listen-address=127.0.0.1:9990
Here's prom-alert-tracing-test.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
rule_files:
- prom-alert-tracing-rules.yml
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9990"]
tracing:
client_type: "grpc"
endpoint: 'localhost:4317'
sampling_fraction: 1.0
insecure: true
And here's the referenced prom-alert-tracing-rules.yml:
groups:
- name: alerts
rules:
- alert: AlwaysFiring
expr: '1'
labels:
severity: page
- alert: AlsoAlwaysFiring
expr: '1'
labels:
severity: email
For the smtp receiver, I also set up an instance of smtprelay and a simple SMTP echo service. I've lost the configs for those, and can whip something up if you need...
Let me know if this makes sense, otherwise let's set up some time tomorrow or next week to pair on this 😉
IIRC this
--- FAIL: TestResolved (5.38s)
acceptance.go:182: failed to start alertmanager cluster: unable to get a successful response from the Alertmanager: Get "http://127.0.0.1:40783/api/v2/status": dial tcp 127.0.0.1:40783: connect: connection refused
is a flaky test, re-running shoud fix it