gateway icon indicating copy to clipboard operation
gateway copied to clipboard

Find the right home for maximum number of parallel retries

Open zhaohuabing opened this issue 2 years ago • 6 comments

Description: EG needs to find the right home for "maximum number of parallel retries"

I prefer option 1 because: Even though both have "retries" in their name, "maximum number of parallel retries" and "request retires" serve two different purposes. The concurrent max retries setting is inherently associated with the Circuit Breaker, which fails requests quickly when lot of retries happen and apply back pressure on downstream. On the other hand, request retries are specifically designed to mitigate transient network issues. Would love more insights from @kflynn and other @envoyproxy/gateway-maintainers

[optional Relevant Links:]

  • https://github.com/envoyproxy/gateway/pull/2284
  • https://github.com/envoyproxy/gateway/pull/2168

zhaohuabing avatar Dec 19 '23 03:12 zhaohuabing

Should they configure separately, since they have different purposes ?

Xunzhuo avatar Dec 20 '23 05:12 Xunzhuo

y configure separately, since they have different purposes ?

That's my proposal. Need more input on this.

zhaohuabing avatar Dec 20 '23 11:12 zhaohuabing

What's the actual purpose of configuring parallel retries? It clearly doesn't make any sense that a single request will be be sent to multiple upstream instances in parallel, so this implies that many downstream requests are arriving, failing, and being retried.

Given that Envoy configures budgeted retries in circuit breaking, and counted retries separately, I wonder if it also separates its counting of parallel retries between the two. Does anyone happen to know the answer to that?

I would hope that it doesn't, in which case I would hope that this would be a single setting in the Retry configuration. But I don't know what Envoy actually does...

kflynn avatar Dec 20 '23 18:12 kflynn

Given that Envoy configures budgeted retries in circuit breaking, and counted retries separately, I wonder if it also separates its counting of parallel retries between the two. Does anyone happen to know the answer to that?

@kflynn I'm not sure I get your question here. If you're referring the max_retries and retry_budget in the CircuitBreakers, I believe only one of them should be specified, and according to envoy docs, retry_budget is the recommended one. So I would suggest EG only exposes the retry_budget to the Circuit Breaker. API.

Cluster maximum active retries: The maximum number of retries that can be outstanding to all hosts in a cluster at any given time. In general we recommend using retry budgets; however, if static circuit breaking is preferred it should aggressively circuit break retries. This is so that retries for sporadic failures are allowed, but the overall retry volume cannot explode and cause large scale cascading failure. If this circuit breaker overflows the upstream_rq_retry_overflow counter for the cluster will increment.

References:

  • https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/circuit_breaking
  • https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds

zhaohuabing avatar Dec 25 '23 09:12 zhaohuabing

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Jan 24 '24 12:01 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Feb 23 '24 16:02 github-actions[bot]

While working on #2725 I noticed that the default envoy max_retry limit is quite low (3) leading to retry overflow when there is a significant amount of requests to a restarting backend. Envoy provides a sample that increases this value to 10 to deal with restart scenarios. Should we use a more permissive default, e.g. 20% default retry budget?

guydc avatar Mar 02 '24 00:03 guydc