api icon indicating copy to clipboard operation
api copied to clipboard

OutlierDetection documentation wrong for error types

Open howardjohn opened this issue 6 years ago • 21 comments

https://istio.io/docs/reference/config/networking/v1alpha3/destination-rule/#OutlierDetection

Under consecutiveErrors: 502, 503, and 504s only, 500 codes will not trigger the outlier detection

But in the docs above:

For HTTP services, hosts that continually return 5xx errors for API calls are ejected from the pool for a pre-defined period of time.

So sounds like the docs are incorrect here. It will only work for 502/503/504 from looking at the code and https://github.com/istio/api/pull/617

howardjohn avatar Apr 25 '19 15:04 howardjohn

@howardjohn Thanks for clearing it up.

Just one question why isn't 500 status code included in the circuit breaking? In my view a service that returns continuously Internal Server Errors should also be also ejected from the pool. Otherwise we are keeping a service that is failing for whatever reason.

codependent avatar Apr 25 '19 15:04 codependent

Some context from the change #617 that introduced it:

We try not to expose the plethora of sometimes confusing Envoy options to end users, in the routing api.

Within a mesh, gateway errors will be more common (502/503/504) while most sensible external services will return a 503 to shed load.

Secondly we just made the outlier detection generic to both tcp and http. The consecutive gateway error applies only to http and will make no sense in tcp context.

I also feel that 500 error code is not something indicative of overload. The whole idea behind outliers is to remove overloaded servers from the lb pool.

We don’t have very many users relying on this behavior I think. We kept it intentionally generic so that we can switch to a more specific error code in future (which happens to be now).

howardjohn avatar Apr 25 '19 15:04 howardjohn

I understand the motivation now however I think there's another valid scenario that is not supported.

In our microservice infrastructure we were relying on Hystrix as an application circuit breaker. When downstream services are crashing (500 errors) we don't want to keep sending traffic to them, and Hystrix removed them temporaly from the pool.

Right now with Istio, this failing services are being invoked despite being actually down. That is to say we are protected from latency/overloading but not from failure. We can always translate 500 errors to 502/503/504 at the microservices but this seems like a work around.

Does this make sense?

codependent avatar Apr 25 '19 17:04 codependent

Envoy explicitly distinguishes between consecutive gateway failures (502,503,504) errors and consecutive errors (5xx), supporting both of them.

https://www.envoyproxy.io/docs/envoy/v1.5.0/intro/arch_overview/outlier

Istio’s consecutiveErrors field is therefore misleading, shouldn’t it be aligned with Envoys terminology? Also, why not allow both kinds of configurations?

codependent avatar Apr 25 '19 18:04 codependent

I think the 500 status code should be included in the circuit breaking, resilience4j and lagom give this support too.

venosov avatar Apr 26 '19 08:04 venosov

@rshriram any thoughts ^?

howardjohn avatar Apr 26 '19 14:04 howardjohn

So there are really 2 issues here.

  1. The documentation is currently wrong. The consecutiveErrors field is documented correctly, but the overview/example is wrong and needs to be fixed.

  2. Should Istio support both kinds of outlier detection that Envoy supports (5xx and Gateway failures). If so, would it be sufficeint to add a boolean (e.g., allFailures to choose weather consecutiveErrors maps to all 5xx errors or just the Gateway ones?

@rshriram ^^^^

frankbu avatar Aug 19 '19 15:08 frankbu

wow.. this has come a full circle. So options here are to change the semantics of consecutiveErrors once again [but then people like @arnecls will want to set this up only for 502/3,etc.]. Or as @frankbu says, we could add a boolean that says "treat500ErrorsAsFailure` such that people can decide to use application error (HTTP 500) as a sign of failure.

rshriram avatar Aug 19 '19 16:08 rshriram

Just wanted to bump this and give a +1 to adding a boolean to support all 5XX-level errors instead of just 502-504 gateway errors.

My reasoning for wanting this is when dealing with third parties, you really can't control what kinds of failure codes they return. If a third party returns 500s when it's down and not 503s, yes that's semantically incorrect, but that's not something we control as integrators of an API.

Also, I think it's generally valid to want to treat multiple consecutive 500s as a good sign of "this host isn't healthy and we shouldn't route traffic to it".

Would love to see this issue get some more love!

embeaken avatar Nov 04 '19 18:11 embeaken

+1 on supporting consecutive 5xx errors.

another option for supporting this instead of adding a boolean: we could add another field so consecutiveErrors = consecutive gateway errors consecutive5xxErrors = consecutive 5xx errors the benefit of this is that we could set separate threshold for the two just like envoy

yingzhuivy avatar Nov 12 '19 21:11 yingzhuivy

We have two implementation options to support outlier detection for consecutive 5xx errors (500-599) instead of just gateway errors (502/503/504):

  1. Add a boolean, all_server_errors. By default, this is false, which means only 502, 503 and 504 return codes are considered errors. When this is set to true, all 5xx return codes are considered errors.
  2. Add a new int32, consecutive_server_errors that configures the number of consecutive server errors (all 5xx) before an endpoint is ejected. If the value is not set (default), endpoints are not ejected for non-gateway errors (e.g. 500, 505). Ejection could still be performed on gateway errors (502, 503, 504) using consecutive_errors.

Option 1 is clear. Users can configure exactly one way to reject endpoints based on upstream errors.

Option 2 provides a more flexible API. Users could configure more stringent outlier detection for gateway errors than for server errors. This comes at the cost of a more complex API: consecutive_errors and consecutive_server_errors will be confusing to configure for new users, especially because consecutive_errors actually means consecutive_gateway_errors.

I’m not sure how useful it is to be able to set two thresholds separately, which option do you guys prefer? @rshriram @frankbu @howardjohn

yingzhuivy avatar Nov 22 '19 01:11 yingzhuivy

@yingzhuivy I think Option 2 (the more flexible API) would be my choice. I think we can avoid confusing users by picking a better name for the new field. My suggestion for the 2 fields:

  1. consecutive_errors: consecutive gateway errors (502, 503, 504) as currently defined
  2. consecutive_5xx: consecutive errors of any 5xx type

frankbu avatar Nov 22 '19 15:11 frankbu

@frankbu sounds good to me, although I would think consecutive_errors means ANY errors. But not worth changing the API to make that more clear. Instead we should just make it clearly documented.

howardjohn avatar Nov 24 '19 22:11 howardjohn

Current documentation for consecutiveErrors has no description on the interpretation of zero value. Looking at the implementation, we do not distinguish zero from not set, and consecutiveErrors field has the following semantics:

  • > 0, enable consecutive gateway errors with the value
  • =0/unset, enable consecutive gateway errors with default value 5
  • <0, not valid

For the new consecutive5xxErrors field we can follow the pattern to have:

  • >0, enable consecutive 5xx errors with the value
  • =0/unset, disable consecutive 5xx errors
  • <0, not valid

The problem with this design is that there is no way to turn off consecutive gateway errors.

Some options to consider: Option 1: Distinguish between 0 and unset. 0 means disable the feature while unset means default.

consecutiveErrors:

  • >0, enable consecutive gateway errors with the value
  • =0, disable consecutive gateway errors
  • unset, enable consecutive gateway errors with default value 5
  • <0, not valid

consecutive5xxErrors:

  • >0, enable consecutive 5xx errors with the value
  • =0, disable consecutive 5xx errors
  • unset, disable consecutive 5xx errors
  • <0, not valid

This is the most straightforward and ideal approach, but need to change existing types to google.protobuf.UInt32Value. Or we could handle detecting unset value during yaml parsing. If this is not feasible, we could consider:

Option 2: Allow negative value, which represents turning the feature off; 0 still means default.

consecutiveErrors:

  • >0, enable consecutive gateway errors with the value
  • =0/unset, enable consecutive gateway errors with default value 5
  • <0, disable consecutive gateway errors

consecutive5xxErrors:

  • >0, enable consecutive 5xx errors with the value
  • =0/unset, disable consecutive 5xx errors
  • <0, disable consecutive 5xx errors

Option 3: Another option is to keep the existing API same, when either of the following is true, disable consecutive gateway errors outlier detection:

  • consecutiveErrors = consecutive5xxErrors ≠ 0
  • consecutiveErrors=0 & consecutive5xxErrors=5

(If a user wants to enable consecutive 5xx only, they need to set consecutiveErrors equal to consecutive5xxErrors) This option is too complex and is hard to understand on user’s perspective; and there is still no way to disable both at the same time.

Thoughts? @frankbu @howardjohn

Also, with any of the options above, we need to have clear documentation, including the interpretation of zero/negative values.

yingzhuivy avatar Dec 03 '19 23:12 yingzhuivy

Maybe this is a good excuse to deprecate the consecutive_errors field, and add two new fields consecutive_gateway_errors and consecutive_5xx_errors with Option 1 semantics.

  // $hide_from_docs
  int32 consecutive_errors = 1 [deprecated=true];

frankbu avatar Dec 04 '19 15:12 frankbu

ok, what's next step? should I submit a PR for deprecating consecutive_errors and adding two new fields? Is there a deprecation process that I should follow?

yingzhuivy avatar Dec 05 '19 20:12 yingzhuivy

@yingzhuivy Yes, just update the consecutive_errors field as I've shown above (mark as deprecated and hide from docs) and then add the two new fields with proper doc for how they work. Then you need to add the new code, but make sure to leave the old field working as before (the validator should complain if both fields are set).

frankbu avatar Dec 05 '19 20:12 frankbu

How about this:

monitorErrors:
  gatewayErrors: ..
  all5xxErrors: ..

rshriram avatar Dec 10 '19 19:12 rshriram

@rshriram I think we are not expecting any other types of monitoring errors right? if that's the case, wouldn’t it be simpler to put directly instead of combining them into a struct?

cc @jhahn21 @brianwolfe

yingzhuivy avatar Dec 10 '19 21:12 yingzhuivy

From someone new to Istio (and I have no idea about complexity),

outlierDetection: 
      httpStatusCodes: [500, 501, 502, 503, 504] # Default: [502, 503, 504]
      maxEjectionPercent: 100
      consecutiveErrors: 1
      interval: 5s

would be both simple to understand and easy to use (and therefore match all scenarios).

If, for some reason, people want to use 403 or 401 (auth service is failing?) or any other scenario, they could use it as they please.

But I guess this would mean a change in EnvoyProxy first to open that possibility.

[Edit] After thinking back about it, I realize this is not helpful while Envoy does not implement it. Sorry for wasting your time. I leave it here if any Envoy contributor comes by.

MaximeBernard avatar Mar 25 '20 17:03 MaximeBernard

Looking at the API docs now, it has fields for consecutiveGatewayErrors (502, 503, 504) and consecutive5xxErrors (any 500-class), and no longer for consecutiveErrors.

I think that means this can be closed?

craigbox avatar Dec 11 '24 23:12 craigbox