Fix false critical on OMD backup job when agent runs at the time the backup is about to start
Prevent this false critical alert:
| Host | |
|---|---|
| Service | OMD |
| Event | OK → CRITICAL |
| Time | Mon Oct 23 01:30:05 EDT 2023 |
| Summary | Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06, Size: 426 MiB, Next run: 2023-10-23 01:30:00CRIT |
| Details | Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06Size: 426 MiBNext run: 2023-10-23 01:30:00CRIT |
| Host Metrics | rta=0.010ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.038ms;;;; rtmin=0.002ms;;;; |
| Service Metrics | backup_duration=123.582501;;;; backup_avgspeed=865828.190744;;;; backup_size=446827456;;;; |
Basically, this happens when the backup is about to start (here at 01:30:00) but hasn't started yet when the agent checked (around 01:30:00 also in this case but the alert was generated at 01:30:05). In the logs, the backup actually started at 01:30:03, it's normal for a cron job to sometimes have a very small discrepancy, add to that the discrepancy between the check and the time of the alert reported by Checkmk and we get a false critical in this case. The 30 seconds buffer will prevent this corner case from every happening again.
I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅
I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.
Just curious, is there anything else to do? I'm aware 30 seconds isn't exactly cute, could just take 2 checks or similar before it turns critical instead.
Dear Checkmk Contributor! Unfortunately, we had to re-write our git-repo history, rendering your PR auto-closed. We will therefore rebase your PR onto the current master and reopen it again. Sorry for the inconvenience.
Dear Contributor. Unfortunately, we learned that re-opening a PR which was force-rebased, is not possible (see https://github.com/isaacs/github/issues/361). Therefore we kindly ask you to create a new PR for your change. We apologize for the circumstances and will implement technical measures to prevent such incidents in the future.
@TimotheusBachinger I apologize I am writing here. I don't have PR to share, just a comment.
The issue still can happen if the CheckMK check runs delayed more 30 seconds. For example the scheduled backup is at 01:00:00, but the checkmk check for 00:59 happens at 00:59:46 for example. next_run < time.time() + 30 will be true since the next_run time is still from yesterdays backup.
A workaround is to set Maximum number of check attempts for service to 2 or higher.
Hi @tr3pan I would actually need a new PR on order to get it into our processing queue. If you're interested, feel free to take the current state of this one here.
@TimotheusBachinger Here it is https://github.com/Checkmk/checkmk/pull/841