Muting interacts with group_interval in unexpected (?) fashion
If all alerts in a notification group are muted (silenced or inhibited) at the time the group would notify for the first time, no notification is sent. So far so good. However, if an alert in that group becomes unmuted, the group won't notify immediately. The muted notification counts as "this group has already notified", so that the next notification will only be sent once group_interval has passed. This is particularly problematic with a very long group_interval, let's say one day. In extreme cases, this might lead to the group to never notify (if at the time the group_interval runs out another condition is met that mutes the alert at this time again, particularly easy with the new time-based muting feature).
More discussion here.
I can totally imagine that the current semantics has its reasons, and that changing it is not only difficult to implement but might create some other unexpected side effects. I still want to file this issue so that we are aware of this behavior. Should the conclusion be that it cannot be changed, we can declare it a feature, document it properly, and close this as "works as intended". Hopefully, though, we can find a better way.
Note that the same thought process needed to resolve this issue might also help with the long standing problem of when to send a resolve notification in case an alert is muted.
I can confirm that this is still in v0.29. Here's a little more context.
The problem is that the alertmanager resets the timer before flushing the alerts (here), even if they all will be muted by the MuteStage in the notification pipeline (here).
The sequence in the discussion above can still replicate the problem:
- A new aggregation group is created with alerts that are all silenced
- Timer fires for the first attempt
- The timer is reset and flush is called, executing the notification pipeline
- All alerts in the pipeline are filtered out by
MuteStage - No notification is sent, but the timer is already counting down from
group_interval - Later, an alert in the group becomes unmuted (silence expired, was removed, etc)
- The alert is added to the group, but there's no mechanism to trigger an immediate notification. The group must wait until
group_intervalpasses
The docs for group_interval say:
When the first notification was sent, wait 'group_interval' to send a batch of new alerts that started firing for that group.
So the expected behavior is to send a notification immediately for the first unmuted alert if no notification has been sent.