checkmk icon indicating copy to clipboard operation
checkmk copied to clipboard

Fix false critical on OMD backup job when agent runs at the time the backup is about to start

Open dnlldl opened this issue 2 years ago • 3 comments

Prevent this false critical alert:

Host
Service OMD backup
Event OK → CRITICAL
Time Mon Oct 23 01:30:05 EDT 2023
Summary Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06, Size: 426 MiB, Next run: 2023-10-23 01:30:00CRIT
Details Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06Size: 426 MiBNext run: 2023-10-23 01:30:00CRIT
Host Metrics rta=0.010ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.038ms;;;; rtmin=0.002ms;;;;
Service Metrics backup_duration=123.582501;;;; backup_avgspeed=865828.190744;;;; backup_size=446827456;;;;

Basically, this happens when the backup is about to start (here at 01:30:00) but hasn't started yet when the agent checked (around 01:30:00 also in this case but the alert was generated at 01:30:05). In the logs, the backup actually started at 01:30:03, it's normal for a cron job to sometimes have a very small discrepancy, add to that the discrepancy between the check and the time of the alert reported by Checkmk and we get a false critical in this case. The 30 seconds buffer will prevent this corner case from every happening again.

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

dnlldl avatar Oct 26 '23 00:10 dnlldl

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

github-actions[bot] avatar Oct 26 '23 00:10 github-actions[bot]

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

dnlldl avatar Oct 26 '23 00:10 dnlldl

Just curious, is there anything else to do? I'm aware 30 seconds isn't exactly cute, could just take 2 checks or similar before it turns critical instead.

dnlldl avatar Mar 23 '24 18:03 dnlldl

Dear Checkmk Contributor! Unfortunately, we had to re-write our git-repo history, rendering your PR auto-closed. We will therefore rebase your PR onto the current master and reopen it again. Sorry for the inconvenience.

TimotheusBachinger avatar Jun 19 '24 08:06 TimotheusBachinger

Dear Contributor. Unfortunately, we learned that re-opening a PR which was force-rebased, is not possible (see https://github.com/isaacs/github/issues/361). Therefore we kindly ask you to create a new PR for your change. We apologize for the circumstances and will implement technical measures to prevent such incidents in the future.

TimotheusBachinger avatar Jun 19 '24 13:06 TimotheusBachinger

@TimotheusBachinger I apologize I am writing here. I don't have PR to share, just a comment.

The issue still can happen if the CheckMK check runs delayed more 30 seconds. For example the scheduled backup is at 01:00:00, but the checkmk check for 00:59 happens at 00:59:46 for example. next_run < time.time() + 30 will be true since the next_run time is still from yesterdays backup.

A workaround is to set Maximum number of check attempts for service to 2 or higher.

tr3pan avatar Jul 31 '25 13:07 tr3pan

Hi @tr3pan I would actually need a new PR on order to get it into our processing queue. If you're interested, feel free to take the current state of this one here.

TimotheusBachinger avatar Jul 31 '25 13:07 TimotheusBachinger

@TimotheusBachinger Here it is https://github.com/Checkmk/checkmk/pull/841

tr3pan avatar Jul 31 '25 13:07 tr3pan