alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Alertmanager loses in-memory alerts when it restarts

Open lavkesh opened this issue 5 years ago • 4 comments

What did you do? Alertmanager stores fingerprints of current alerts in the memory send via clients. It stores nf-logs and silences in the disk(if configured). When alertmanger starts it reads nf-logs and silences from the the disk. The in-memory store is recreated when the alertmanager restarts. Now, if the client has stopped sending alerts to alertmanager while alertmanager was down, the AM will not be able to send resolve notifications to receivers.

What did you expect to see? AM should be able to send resolve notification after it restarts.

Can we also store the in-memory alerts onto the disk via snapshot(similar to nf-logs and silences) and/or while graceful shutdown and recover them from disk when AM restarts?

lavkesh avatar Sep 10 '20 06:09 lavkesh

Something similar has been already discussed a while ago (see #2042). The maintainers had concluded that it isn't something we're willing to support for now. To increase Alertmanager resiliency, you should first look at running several Alertmanager instances.

simonpasquier avatar Sep 28 '20 13:09 simonpasquier

Something similar has been already discussed a while ago (see #2042). The maintainers had concluded that it isn't something we're willing to support for now.

@simonpasquier Hi, my understanding is that the topic of #2042 discussion is the persistence of alerts at any time, and the main discussion here is the persistence of alerts at graceful shutdown. The latter solution will be much simpler, as OP mentioned:

Can we also store the in-memory alerts onto the disk via snapshot(similar to nf-logs and silences) and/or while graceful shutdown and recover them from disk when AM restarts?

And about:

To increase Alertmanager resiliency, you should first look at running several Alertmanager instances.

This solution cannot cope with all scenarios, such as rolling updates. You will eventually lose all alerts no matter how many instances you have. Rolling update is a common scenario and requirement, isn't it? So I support OP, and if I can, I'd be happy to new a PR to achieve this.

glidea avatar Oct 02 '22 13:10 glidea

IMHO we need more details about how it would work in practice and look at edge cases. For instance, when reloading alerts from disk, should Alertmanager consider all past alerts or discard alerts older than a certain period? I'm sure that there are other scenarios needing consideration. A document describing the issue + possible solution would be helpful.

simonpasquier avatar Jan 06 '23 14:01 simonpasquier

The main issue with rolling releases of Alertmanager is when the rollout is completed within ResendDelay seconds in Prometheus. What happens is all the Alertmanagers are restarted, but none of them have alerts. To avoid this situation you should wait ResendDelay seconds between rolling out pods in a deployment, as this will ensure at least one Alertmanager has alerts at all times. However, I don't think this is possible in Kubernetes.

A more nuanced issue is that rolling releases reset both the Group wait and Group interval timers. That means if you have an alert that is at the fourth minute of a five minute Group wait, and then a rollout is started, the alert will need to wait another five minutes before a notification is sent.

This means the recommendation above needs to be restricted further such that, at a minimum, you should wait the maximum Group wait and Group interval seconds between rolling out pods.

grobinson-grafana avatar Nov 19 '23 14:11 grobinson-grafana