alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Muting interacts with group_interval in unexpected (?) fashion

Open beorn7 opened this issue 4 years ago • 1 comments

If all alerts in a notification group are muted (silenced or inhibited) at the time the group would notify for the first time, no notification is sent. So far so good. However, if an alert in that group becomes unmuted, the group won't notify immediately. The muted notification counts as "this group has already notified", so that the next notification will only be sent once group_interval has passed. This is particularly problematic with a very long group_interval, let's say one day. In extreme cases, this might lead to the group to never notify (if at the time the group_interval runs out another condition is met that mutes the alert at this time again, particularly easy with the new time-based muting feature).

More discussion here.

I can totally imagine that the current semantics has its reasons, and that changing it is not only difficult to implement but might create some other unexpected side effects. I still want to file this issue so that we are aware of this behavior. Should the conclusion be that it cannot be changed, we can declare it a feature, document it properly, and close this as "works as intended". Hopefully, though, we can find a better way.

Note that the same thought process needed to resolve this issue might also help with the long standing problem of when to send a resolve notification in case an alert is muted.

beorn7 avatar Mar 01 '21 17:03 beorn7

I can confirm that this is still in v0.29. Here's a little more context.

The problem is that the alertmanager resets the timer before flushing the alerts (here), even if they all will be muted by the MuteStage in the notification pipeline (here).

The sequence in the discussion above can still replicate the problem:

  • A new aggregation group is created with alerts that are all silenced
  • Timer fires for the first attempt
  • The timer is reset and flush is called, executing the notification pipeline
  • All alerts in the pipeline are filtered out by MuteStage
  • No notification is sent, but the timer is already counting down from group_interval
  • Later, an alert in the group becomes unmuted (silence expired, was removed, etc)
  • The alert is added to the group, but there's no mechanism to trigger an immediate notification. The group must wait until group_interval passes

The docs for group_interval say:

When the first notification was sent, wait 'group_interval' to send a batch of new alerts that started firing for that group.

So the expected behavior is to send a notification immediately for the first unmuted alert if no notification has been sent.

waltherlee avatar Nov 30 '25 05:11 waltherlee