cancel postgres check within timeout
What does this PR do?
Motivation
Now that we can set the cancel timeout with env var check_cancel_timeout, we should read the value from the agent config instead of hard coding it.
This PR makes sure postgres check is canceled within timeout (0.4s). This is to fix intermittent cancelation timeout (the default check cancel timeout is 0.5s) happening in cluster runners, which prevent new check being re-scheduled.
2023-12-12 14:51:10 UTC | CORE | ERROR | (pkg/collector/scheduler.go:106 in Unschedule) | Error stopping check postgres:2f8838d1de1b4182: an error occurred while calling check.Cancel(): timeout while calling check.Cancel() on check ID postgres:2f8838d1de1b4182
Additional Notes
This change does introduce new concerns where database connection could be left idle if they are not being gracefully closed within the timeout.
Review checklist (to be filled by reviewers)
- [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
- [ ] Changelog entries must be created for modifications to shipped code
- [ ] Add the
qa/skip-qalabel if the PR doesn't need to be tested during QA.
Test Results
16 files 16 suites 19m 49s :stopwatch: 266 tests 264 :heavy_check_mark: 2 :zzz: 0 :x: 2 136 runs 2 032 :heavy_check_mark: 104 :zzz: 0 :x:
Results for commit 4352a9af.
:recycle: This comment has been updated with latest results.
Codecov Report
Merging #16408 (4352a9a) into master (60ea0d0) will increase coverage by
0.10%. Report is 13 commits behind head on master. The diff coverage is61.11%.
:exclamation: Current head 4352a9a differs from pull request most recent head 524dd6a. Consider uploading reports for the commit 524dd6a to get more accurate results
Additional details and impacted files
| Flag | Coverage Δ | |
|---|---|---|
| activemq | ? |
|
| cassandra | ? |
|
| confluent_platform | ? |
|
| hive | ? |
|
| hivemq | ? |
|
| hudi | ? |
|
| ignite | ? |
|
| jboss_wildfly | ? |
|
| kafka | ? |
|
| postgres | 92.33% <61.11%> (+0.06%) |
:arrow_up: |
| presto | ? |
|
| solr | ? |
|
| tomcat | ? |
|
| weblogic | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
postgres check is canceled within timeout (4s)
That's huge. Do we know where the time is being spent: in the db or in the Python code?
postgres check is canceled within timeout (4s)
That's huge. Do we know where the time is being spent: in the db or in the Python code?
@nenadnoveljic we don't know exactly where the time is spent, mainly because the timeout happens so rarely with no clear reproducible pattern (we see it happen roughly once or twice a month).
@nenadnoveljic we don't know exactly where the time is spent, mainly because the timeout happens so rarely with no clear reproducible pattern (we see it happen roughly once or twice a month).
When it happens, is it possible for us to check where the time is spent?