Feature request: slack_configs.send_resolved a templatable field
What did you do?
I'm trying to set up a receiver that can take a dynamic value for send_resolved, such as:
## Route alerts based on slack_channel label
- name: global_slack_channel_label
slack_configs:
- api_url: <some-url>
channel: '{{ printf \"#%s\" (index ((index .Alerts 0).Labels) \"slack_channel\") }}'
send_resolved: {{ (index ((index .Alerts 0).Labels)) "send_resolved") | quote | default true | replace "\"" "" }}
With this, we get a line 894: cannot unmarshal !!map into bool as Alertmanager requires this field to be boolean instead of a tmpl_string. There are various approaches to how this could be done so I'm not worried about this particular template working or not, but the send_resolved field does not allow templating as-is.
What did you expect to see?
I would like to be able to render true/false boolean based on a label in an alert. This would reduce the number of receivers required.
I think, at least for me, it is hard to understand what the use case is for this feature. Making notification decisions based on individual alerts also doesn't fit well in the Alertmanager design, which is based around grouping related alerts together. It would also be very easy for someone to write an invalid template and accidentally break all their notifications without realizing it.
We have a centralised alertmanager with a configuration containing routes and receivers for a dozen or more teams, with hundreds of different alerts flowing to different channels.
Some alerts teams will want send_resolved notifications, other alerts won't. The only way to do this currently is to have two routes and two receivers, one with a send_resolved: false matcher on it, which will route to the receiver with send_resolved set to be false. Multiply this by the number of teams and different permutations of send_resolved true/false, you get a messy configuration.
It would massively simplify (half!) our receivers configuration if we could inherit send_resolved (and other route level attributes like group_by) from the alert labels.
Isn't this just "moving" the messiness from Alertmanager to Prometheus? Instead of a receiver per team, all alert rules must have a custom label send_resolved?
It would be an optional label with send_resolved: true default in the receiver, with alerts optionally able to turn off resolved notifications, yes.
That complexity per alert makes way more sense to me than it does at a receiver level; a lot of alerts just generate noise with resolved notifications, but some are genuinely useful.
Practical example of how we use dynamic configuration. Taking some generic kubernetes alert:
- alert: CPUThrottlingHigh
annotations:
description: "{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}."
summary: Processes experience elevated CPU throttling.
runbook: <link to our runbook>
expr: |
sum by (container, pod, namespace, team) (increase(container_cpu_cfs_throttled_periods_total[5m]))
/
sum by (container, pod, namespace, team) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100)
for: 15m
labels:
group_by: pod
We want this CPUThrottlingHigh alert to be routed to the correct slack channel based on the team label in the pod, but this particular alert we want to be grouped by pod, so we get individual alerts per pod instead of grouping them and risk missing notifications (or not being alerted quick enough due to group_interval). We set a group_by: pod label on the alert, and then we then have to have a route per team like:
- receiver: slack_devops
group_by: [alertname, pod]
match_re:
team: devops|platform-infrastructure|cloud-infrastructure
group_by: pod
Being able to set the group_by attribute from the alert labels would reduce the amount of routes we require, and it's the same idea with making send_resolved templatable too. It makes perfect sense to me for each alert to be able to specify what it should be grouped by, repeat interval, group_wait, send_resolved and all those other route/receiver level attributes.
It would be an optional label with send_resolved: true default in the receiver, with alerts optionally able to turn off resolved notifications, yes.
I still think there are a lot of disadvantages here, and not a lot of advantages. For example:
-
Having support for templates in
send_resolvedmakes diagnosing issues like "why didn't I get a resolved notification?" much more difficult to solve, because it's no longer static, and now depends on the alerts at the time and their labels. -
There is an issue here where as far as I can tell, alerts in Alertmanager are unsorted. That would mean if a group has two alerts where one has
send_resolved: trueand the other hassend_resolved: false, then a template that observes alerts using indexes would be non-deterministic:send_resolved: {{ (index ((index .Alerts 0).Labels)) "send_resolved") | quote | default true | replace "\"" "" }}This would also cause problems in high availability mode where one Alertmanager is sending resolved notifications and the others are not.
-
In your example, one team could accidentally disable resolved notifications for all other teams just by having a
send_resolved: falselabel in one of their alerts:- receiver: slack_devops group_by: [alertname, pod] match_re: team: devops|platform-infrastructure|cloud-infrastructure group_by: pod
Are you sure this couldn't be solved a better way, for example, by provisioning your Alertmanager configuration file using automation or CI?
Are you sure this couldn't be solved a better way, for example, by provisioning your Alertmanager configuration file using automation or CI?
I'm open to suggestions. What would you recommend for controlling which alerts are resolved and which aren't?
If I understand the problem, you are concerned with having a large number of routes and receivers for the teams in your organization, and find it difficult to manage them all due to their total size.
My suggestion would be to provision your Alertmanager configuration automatically, then you don't need to edit it. For example, each team could specify their own route and receiver configurations, and then you can merge them together into a single Alertmanager configuration file using a script or automated job. You can avoid conflicts by suffixing the team name to each receiver like slack_devops slack_platform_infrastructure and then use these in the routes.
If you just want individual teams to choose between send_resolved and not configure their own routes and receivers, then just give them that option, and for each team, automatically create a receiver and route in the configuration file for them with their send_resolved preference.
It isn't as black and white as teams will or won't want resolved notifications, though. It is still very much context specific to each alert as to whether a resolved notification is required.
Even with automation and generating routes / receivers, it doesn't change the initial problem.
each alert as to whether a resolved notification is required
I can see you are thinking about it in terms of individual alerts, but Alertmanager is very much a tool for grouping alerts. Can you can rephrase your problem in terms of groups? How should a send_resolved decision be made for a group of alerts in your organization?
I have the same use case as dwilliams. Based on the nature of the alert, some of my alerts don't want send_resolved to be set, these alerts are probably part of a group of alerts that do need send_resolved to be set. It's not like "all alerts from this service, or to this team don't need send_resolved".. It's more like "Log alerts for any service written in this way don't need send resolved". In fact, I feel like this goes against the design goal of thinking of things in terms of groups of alerts, because, I'm forced to ungroup from business logic lines and group them among technical implementation lines.
The feature request (as described) doesn't support enabling/disabling send_resolved for specific alerts. In fact, it does the opposite. It supports enabling/disabling send_resolved for the group based on the labels of a specific alert.
Plus one to this issue also from me. I have the very same problem as dwilliams (my code is almost identical to his, also the same use case - one alertmanager for many teams, don't want to double the routes, futhermore, some alerts do need the resolved state notification, others don't). Would be nice to enable dynamic send_resolved field from a template.
I still don't think the feature request makes sense as explained, for the reasons mentioned (i.e. having individual alerts change the send_resolved behavior for the rest of the group would cause a lot of surprising edge cases where send_resolved does/doesn't work because an alert is missing or has missing labels).
In such case it can fall back to the specific receiver's send_resolved default value.
I fear that will be a real nightmare to debug – so much so that people will create GitHub issues for send resolved not working – putting additional burden on maintainers. For me this is a 👎 I'm afraid.
+1 for this feature request, but would need it in context of pagerduty_config.