integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

cancel postgres check within timeout

Open lu-zhengda opened this issue 2 years ago • 5 comments

What does this PR do?

Motivation

Now that we can set the cancel timeout with env var check_cancel_timeout, we should read the value from the agent config instead of hard coding it.

This PR makes sure postgres check is canceled within timeout (0.4s). This is to fix intermittent cancelation timeout (the default check cancel timeout is 0.5s) happening in cluster runners, which prevent new check being re-scheduled.

2023-12-12 14:51:10 UTC | CORE | ERROR | (pkg/collector/scheduler.go:106 in Unschedule) | Error stopping check postgres:2f8838d1de1b4182: an error occurred while calling check.Cancel(): timeout while calling check.Cancel() on check ID postgres:2f8838d1de1b4182

Additional Notes

This change does introduce new concerns where database connection could be left idle if they are not being gracefully closed within the timeout.

Review checklist (to be filled by reviewers)

  • [ ] Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • [ ] Changelog entries must be created for modifications to shipped code
  • [ ] Add the qa/skip-qa label if the PR doesn't need to be tested during QA.

lu-zhengda avatar Dec 12 '23 18:12 lu-zhengda

Test Results

     16 files       16 suites   19m 49s :stopwatch:    266 tests    264 :heavy_check_mark:     2 :zzz: 0 :x: 2 136 runs  2 032 :heavy_check_mark: 104 :zzz: 0 :x:

Results for commit 4352a9af.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Dec 12 '23 19:12 github-actions[bot]

Codecov Report

Merging #16408 (4352a9a) into master (60ea0d0) will increase coverage by 0.10%. Report is 13 commits behind head on master. The diff coverage is 61.11%.

:exclamation: Current head 4352a9a differs from pull request most recent head 524dd6a. Consider uploading reports for the commit 524dd6a to get more accurate results

Additional details and impacted files
Flag Coverage Δ
activemq ?
cassandra ?
confluent_platform ?
hive ?
hivemq ?
hudi ?
ignite ?
jboss_wildfly ?
kafka ?
postgres 92.33% <61.11%> (+0.06%) :arrow_up:
presto ?
solr ?
tomcat ?
weblogic ?

Flags with carried forward coverage won't be shown. Click here to find out more.

codecov[bot] avatar Dec 12 '23 19:12 codecov[bot]

postgres check is canceled within timeout (4s)

That's huge. Do we know where the time is being spent: in the db or in the Python code?

nenadnoveljic avatar Dec 13 '23 10:12 nenadnoveljic

postgres check is canceled within timeout (4s)

That's huge. Do we know where the time is being spent: in the db or in the Python code?

@nenadnoveljic we don't know exactly where the time is spent, mainly because the timeout happens so rarely with no clear reproducible pattern (we see it happen roughly once or twice a month).

lu-zhengda avatar Dec 13 '23 12:12 lu-zhengda

@nenadnoveljic we don't know exactly where the time is spent, mainly because the timeout happens so rarely with no clear reproducible pattern (we see it happen roughly once or twice a month).

When it happens, is it possible for us to check where the time is spent?

nenadnoveljic avatar Dec 13 '23 12:12 nenadnoveljic