cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants

Open emanlodovice opened this issue 2 years ago • 3 comments

Describe the bug Currently the manager's SyncRuleGroups and GetRules methods share the same lock. This means that if SyncRuleGroups becomes slow then GetRules will have to wait a long time to acquire the lock.

SyncRuleGroups can become slow when we are updating a Rule group with slow running rules because the RuleGroup will wait for the Rule to finish before it stops.

https://github.com/prometheus/prometheus/blob/main/rules/group.go#L249 https://github.com/prometheus/prometheus/blob/main/rules/group.go#L426-L430

Additional Context

emanlodovice avatar Jan 24 '24 22:01 emanlodovice

Maybe we can snapshot the tenant's RuleGroups before updating the manager and we read from have GetRules read from the snapshot when SyncRuleGroups is running

emanlodovice avatar Jan 24 '24 22:01 emanlodovice

My 2 cents:

It is definitely something we need to fix. GetRules shouldn't be impacted by that user manager lock. I think it is fine to read the snapshot as you mentioned, we might not have the up-to-date rule groups at each ruler but it is ok since we do eventual consistency. And we have the global rules merge, too.

yeya24 avatar Jan 25 '24 01:01 yeya24

Thanks @yeya24 , I will try to create a PR to address this issue.

emanlodovice avatar Jan 25 '24 02:01 emanlodovice