checkmk Fix false critical on OMD backup job when agent runs at the time the backup is about to start

Prevent this false critical alert:

Host
Service	OMD backup
Event	OK → CRITICAL
Time	Mon Oct 23 01:30:05 EDT 2023
Summary	Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06, Size: 426 MiB, Next run: 2023-10-23 01:30:00CRIT
Details	Backup completed, it was running for 2 minutes 4 seconds from 2023-10-16 01:30:03 till 2023-10-16 01:32:06Size: 426 MiBNext run: 2023-10-23 01:30:00CRIT
Host Metrics	rta=0.010ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.038ms;;;; rtmin=0.002ms;;;;
Service Metrics	backup_duration=123.582501;;;; backup_avgspeed=865828.190744;;;; backup_size=446827456;;;;

Basically, this happens when the backup is about to start (here at 01:30:00) but hasn't started yet when the agent checked (around 01:30:00 also in this case but the alert was generated at 01:30:05). In the logs, the backup actually started at 01:30:03, it's normal for a cron job to sometimes have a very small discrepancy, add to that the discrepancy between the check and the time of the alert reported by Checkmk and we get a false critical in this case. The 30 seconds buffer will prevent this corner case from every happening again.

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

Oct 26 '23 00:10 dnlldl

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

Oct 26 '23 00:10 github-actions[bot]

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

Oct 26 '23 00:10 dnlldl

Just curious, is there anything else to do? I'm aware 30 seconds isn't exactly cute, could just take 2 checks or similar before it turns critical instead.

Mar 23 '24 18:03 dnlldl

Dear Checkmk Contributor! Unfortunately, we had to re-write our git-repo history, rendering your PR auto-closed. We will therefore rebase your PR onto the current master and reopen it again. Sorry for the inconvenience.

Jun 19 '24 08:06 TimotheusBachinger

Dear Contributor. Unfortunately, we learned that re-opening a PR which was force-rebased, is not possible (see https://github.com/isaacs/github/issues/361). Therefore we kindly ask you to create a new PR for your change. We apologize for the circumstances and will implement technical measures to prevent such incidents in the future.

Jun 19 '24 13:06 TimotheusBachinger

@TimotheusBachinger I apologize I am writing here. I don't have PR to share, just a comment.

The issue still can happen if the CheckMK check runs delayed more 30 seconds. For example the scheduled backup is at 01:00:00, but the checkmk check for 00:59 happens at 00:59:46 for example. next_run < time.time() + 30 will be true since the next_run time is still from yesterdays backup.

A workaround is to set Maximum number of check attempts for service to 2 or higher.

Jul 31 '25 13:07 tr3pan

Hi @tr3pan I would actually need a new PR on order to get it into our processing queue. If you're interested, feel free to take the current state of this one here.

Jul 31 '25 13:07 TimotheusBachinger

@TimotheusBachinger Here it is https://github.com/Checkmk/checkmk/pull/841

Jul 31 '25 13:07 tr3pan