alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Fix for deadlock issue #3682

Open rajagopalanand opened this issue 2 years ago • 4 comments

What does this PR do?

This PR is a fix for #3682. In some instances, mem.Alerts.Subscribe() and store.gc() can get deadlocked

  1. Lock acquisition in store.Alerts.gc():
    • The method store.Alerts.gc() acquires a lock on its internal mutex store.Alerts.mtx
    • While holding the lock, it then calls the callback function which will try to acquire a lock on mem.Alerts.mtx
  2. Concurrent Execution:
    • mem.Alerts.Subscribe() acquires a lock on its internal mutex mem.Alerts.mtx and calls store.Alerts.List()
  3. Deadlock situation:
    • Callback function tries to acquire a lock on mem.Alerts.mtx. However this lock is already being held by mem.Subscribe()
    • Similarly mem.Subscribe() cannot proceed because store.List() cannot acquire lock (store.Alerts.mtx) because it is being held by store.gc()

Another way of summarizing this is store.Alerts.gc() was holding the lock until callback function completed which in turn was waiting to acquire the lock. Callback function could not acquire the lock because Subscribe() was holding the lock. Subscribe() cannot progress because it calls store.Alerts.List() which was waiting for lock acquisition which was being held by store.Alerts.gc(). This fix releases the lock held by store.Alerts.gc() prior to calling the callback function

AM_Deadlock

rajagopalanand avatar Feb 08 '24 19:02 rajagopalanand

@rajagopalanand thank you very much for your contribution - can you please fix the linter?

gotjosh avatar Feb 13 '24 10:02 gotjosh

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

gotjosh avatar Feb 13 '24 10:02 gotjosh

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

Was not aware of this but I will take a look

rajagopalanand avatar Feb 14 '24 03:02 rajagopalanand

This is a duplicate of https://github.com/prometheus/alertmanager/pull/3648/files, right?

Was not aware of this but I will take a look

The other PR addresses issues more than just the deadlock issue mentioned in #3682. If this can be reviewed and merged, then when the other PR gets reviewed and merged it can address race conditions and can retain a test from this PR

rajagopalanand avatar Feb 14 '24 21:02 rajagopalanand