[Feature Request] Provide capability for `DELETE /silences/<id>` to really delete silence
Problem
Currently when call the delete silence API it actually expire them not delete them. An expired silence is actually deleted 5 days later by default. while the intention of this is useful for use cases like auditing but it may become troublesome for users and system admin when there are too many expired silences. When user list their silences there are high chances they don't care about ones that expired more than 1 or 2 days ago. System admin has to constantly worry about impact of performance by the large amount of expired silence in the memory, and there is actually an impact due the number of expired silences, the system admin has no mitigation mechanism currently.
Proposal
The DELETE /silences/<id> API provides a boolean parameter like soft where when soft=true the API expires an alert and when soft=false the API hard deletes the alert.
This way an user can reduce noice by hard delete silences, and a system admin can mitigate issue by asking their customer to hard delete silences (or just hard delete themselves, depending on who they work for).
Hi! :wave: I think the reason the Delete API expires silences, instead of deleting them, is because Alertmanager needs to create tombstones to achieve eventual consistency of its replicated state. Without tombstones, deleted silences could "come back" when running a number of Alertmanagers in HA.
As an interim fix, you can decrease the data retention period from 120h. This will mean expired silences are deleted sooner. However, be careful not to lower the retention period too much as the retention period also deletes stale notification logs. I would not set the retention period lower than your largest repeat interval.
As a longer term fix, perhaps we could look into two improvements:
- Adding a configuration option that sets the maximum age of expired silences returned by the API.
- Store expired silences on disk rather than in-memory, if you are experiencing very high memory usage due to having a lot of expired silences.
👋 @grobinson-grafana
Thank you so much for your responses and the background for expiring silences instead of deleting them. The 2 improvements you proposed makes sense, especially the 2nd option.
For option 1 though, in addition to a configuration, it can also be useful to have some kind of API to allow an user to mark a silences as invisible, because user will have more control over what they need and don't need. Invisible just means it is expired and should not be returned to customer through API calls; the sole purpose of keeping the invisible silences is for eventual consistency. And invisible silences are always good candidate to be stored in disk instead of memory.
So I would still say that have DELETE /silences/<id> to support an parameter like soft=false, which marks the silence to be invisible, would be useful :)
Hi, I'd like to add a further justification for the proposal. When trying to view expired silences in the Alertmanager Web Interface, the browser can lock up for several seconds to respond. Trying to filter results becomes difficult because the dynamic nature of the filter causes it to re-render on each character typed. Limiting the number of results returned (much like Prometheus does in its Web Interface) would be ideal.