Slow running rules from one tenant can cause PrometheusRules API to timeout for all tenants
Describe the bug
Currently the manager's SyncRuleGroups and GetRules methods share the same lock. This means that if SyncRuleGroups becomes slow then GetRules will have to wait a long time to acquire the lock.
SyncRuleGroups can become slow when we are updating a Rule group with slow running rules because the RuleGroup will wait for the Rule to finish before it stops.
https://github.com/prometheus/prometheus/blob/main/rules/group.go#L249 https://github.com/prometheus/prometheus/blob/main/rules/group.go#L426-L430
Additional Context
Maybe we can snapshot the tenant's RuleGroups before updating the manager and we read from have GetRules read from the snapshot when SyncRuleGroups is running
My 2 cents:
It is definitely something we need to fix. GetRules shouldn't be impacted by that user manager lock.
I think it is fine to read the snapshot as you mentioned, we might not have the up-to-date rule groups at each ruler but it is ok since we do eventual consistency. And we have the global rules merge, too.
Thanks @yeya24 , I will try to create a PR to address this issue.