alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Instrument with OTel tracing

Open hairyhenderson opened this issue 2 years ago • 8 comments

Fixes #3670

I've pulled in tracing/tracing.go from prometheus/prometheus to kickstart this, and to provide a familiar (to Prometheus operators) configuration mechanism.

It's possible that it'd be worth pushing some of this up into prometheus/common, but I'll defer on that one.

So far, all incoming requests are instrumented, and notifications are instrumented, with webhooks and e-mails getting slightly more detailed instrumentation, both with downward trace propagation.

The decoupled nature of how alerts are handled within Alertmanager means that there's a bunch of disjointed spans, but I've attempted to rectify some of that by using span links.

hairyhenderson avatar Jan 16 '24 22:01 hairyhenderson

unclear why CI's failing - I wonder if that test's unrelatedly flaky?

hairyhenderson avatar Jan 16 '24 22:01 hairyhenderson

I've been wondering about the support of this for months and it's awesome to see movement on this front. Much appreciated!

els0r avatar Jan 19 '24 14:01 els0r

@els0r you're welcome! Let me know if there are any particular things you want traced - this is really just a start.

hairyhenderson avatar Jan 19 '24 15:01 hairyhenderson

@grobinson-grafana

@hairyhenderson, can you show me how to test this next week so I can better understand how it all works?

sorry for the delay... laptop refresh caused me to lose my test configs 🤦‍♂️

I tested this by setting up a local Grafana, Tempo, and Agent (configured to receive OTLP).

Then I fired up the webhook echo service (go run ./examples/webhook/echo.go - I actually modified that code first to print headers too so I could see the trace headers)

I then ran alertmanager like this:

$ ./alertmanager --web.listen-address=127.0.0.1:9093 --log.level=debug --config.file=am-trace-test.yml

Here's am-trace-test.yml:

global:
  smtp_smarthost: 'localhost:2525'
  smtp_from: '[email protected]'
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - matchers:
        - severity="page"
      receiver: web.hook
    - matchers:
        - severity="email"
      receiver: smtp
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
  - name: 'smtp'
    email_configs:
      - to: '[email protected]'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

tracing:
  client_type: "grpc"
  endpoint: 'localhost:4317'
  sampling_fraction: 1.0
  insecure: true

Then I ran prometheus like this:

$ ./prometheus --config.file=prom-alert-tracing-test.yml --web.listen-address=127.0.0.1:9990

Here's prom-alert-tracing-test.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 127.0.0.1:9093

rule_files:
  - prom-alert-tracing-rules.yml

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9990"]

tracing:
  client_type: "grpc"
  endpoint: 'localhost:4317'
  sampling_fraction: 1.0
  insecure: true

And here's the referenced prom-alert-tracing-rules.yml:

groups:
  - name: alerts
    rules:
      - alert: AlwaysFiring
        expr: '1'
        labels:
          severity: page
      - alert: AlsoAlwaysFiring
        expr: '1'
        labels:
          severity: email

For the smtp receiver, I also set up an instance of smtprelay and a simple SMTP echo service. I've lost the configs for those, and can whip something up if you need...

Let me know if this makes sense, otherwise let's set up some time tomorrow or next week to pair on this 😉

hairyhenderson avatar Feb 08 '24 22:02 hairyhenderson

IIRC this

--- FAIL: TestResolved (5.38s)
    acceptance.go:182: failed to start alertmanager cluster: unable to get a successful response from the Alertmanager: Get "http://127.0.0.1:40783/api/v2/status": dial tcp 127.0.0.1:40783: connect: connection refused

is a flaky test, re-running shoud fix it

TheMeier avatar Mar 10 '24 12:03 TheMeier