envoy icon indicating copy to clipboard operation
envoy copied to clipboard

ORCA Header Propagation

Open davinci26 opened this issue 9 months ago • 18 comments

Title: ORCA Header Propagation

Currently Envoy propagates the ORCA header it gets from upstream to downstream. Consider the following case:

Image

Envoy will propagate the header endpoint-load-metrics to the client and that might cause the client to load-balance against Envoy based on the upstream utilization. This is can cause surprising load for different Envoy instances.

Note that there are other usecases where Envoy is acting like a sidecar proxy for the application and in that case it makes sense to forward the headers. In my view there is no one-fits-all-answer for all different envoy usecases,

Generally we have 2 options to address this issue:

  1. Strip load report headers based on an API field/change. This has the cost of adding/maintaining an additional API field
  2. Document the behaviour and point operators to the http filter that can be used to strip the headers from the response. This has the cost of making the feature complicated to use and can lead to surprises for the users.

Looking forward to your thoughts

davinci26 avatar May 07 '25 18:05 davinci26

You can also use the response_headers_to_remove field on Route Configuration: https://github.com/envoyproxy/envoy/blob/1e68b78b9f26e4d93b01faae257158b3ec54798a/api/envoy/config/route/v3/route.proto#L66

paul-r-gall avatar May 08 '25 12:05 paul-r-gall

speaking of which another thing I realized is that grpc-go (and maybe other implementations) put the ORCA headers in the trailers so an operator would need to remove both the headers and the trailers to make this implementation robust.

You can also use the response_headers_to_remove field on Route Configuration:

Yeah, that is another straight forward alternative to use

davinci26 avatar May 08 '25 13:05 davinci26

yes; I believe that gRPC will put the ORCA metrics in trailers generally. and the response_headers_to_remove only removes response headers, not response trailers.

paul-r-gall avatar May 08 '25 13:05 paul-r-gall

yes; I believe that gRPC will put the ORCA metrics in trailers generally. and the response_headers_to_remove only removes response headers, not response trailers.

Also I cant seem to find an api to remove response trailers which does make things a bit more complicated

davinci26 avatar May 08 '25 13:05 davinci26

yurp, that API doesn't exist today. would be a welcome addition if you'd like to open a PR?

paul-r-gall avatar May 08 '25 13:05 paul-r-gall

yurp, that API doesn't exist today. would be a welcome addition if you'd like to open a PR?

Yeah for sure,

Let me summarize bellow the full recommendation

Problem Statement

When the upstream client is using ORCA it advertises the load-reports in two possible mechanism:

  1. Response Trailers
  2. Response Headers

Those headers will be propagated by default from Envoy to the downstream client and that might cause it load balance against Envoy using those headers and cause unintended consequences.

Proposed solution

In the ORCA filter documentation we will tell to Envoy operators who use ORCA about this case and suggest as a way forward to combine ORCA with a router option to remove the ORCA headers.

However there is no current way to remove response trailers with the current API surface so we recommend the addition of an additional field in envoy/api/envoy/config/route/v3/route.proto called response_trailers_to_remove

Do we want to tag api-shepherds to get a signoff on the api addition before moving forward with the PR?

davinci26 avatar May 08 '25 13:05 davinci26

if we're doing response_trailers_to_remove we might as well do all the other *_trailers_to_* API fields. @envoyproxy/api-shepherds sound reasonable?

paul-r-gall avatar May 08 '25 13:05 paul-r-gall

One concern with removal of endpoint-load-metrics headers/trailers is that this will make it more difficult to observe / debug ORCA reports as downstream filters as well as clients will not see them.

As a mitigation we can optionally add a parsed OrcaLoadReport to the filter state before removing header/trailer.

I'm not sure whether it would make sense to add config knobs to the router filter config or cluster or some place else.

I think tagging api-shepherds before making forward with the PR makes sense.

cc: @adisuissa @yanavlasov @AndresGuedez @wbpcode @markdroth

efimki avatar May 08 '25 13:05 efimki

One concern with removal of endpoint-load-metrics headers/trailers is that this will make it more difficult to observe / debug ORCA reports as downstream filters as well as clients will not see them.

For my own learning/mental model. Is this a common usecase, I was under the impression that the router filter is the last filter in the HTTP filter chain in the most common case so wouldnt all other HTTP filters have access to the headers before the router removes them?

davinci26 avatar May 08 '25 13:05 davinci26

On the response path the router filter goes first:

Request path: Filter A --> Fitler B --> Router --> upstream

Response path: Filter A <-- Filter B <-- Router <-- upstream.

paul-r-gall avatar May 08 '25 13:05 paul-r-gall

On the response path the router filter goes first:

Yeah in that case the above makes sense

davinci26 avatar May 08 '25 13:05 davinci26

I agree that there are reasonable use-cases where we would want to remove the header, but there are also reasonable use-cases where we wouldn't (e.g., in a service mesh desployment, a sidecar proxy on the server side would want to pass this data through to the downstream client), so clearly this needs to be configurable. I think the key question is, which behavior do we expect will be the common case, which should be our default behavior?

If we expect that passing the header through is the common case, then I think that's the right default behavior, and we can just provide a knob to remove it for the cases where that behavior is desirable. My inclination would be to say that users should use the header mutation filter for this, rather than add yet another first-class knob to the RouteConfiguration. (In principle, I would view even the existing response_headers_to_remove field as being deprecated in favor of the header mutation filter.)

On the other hand, if we expect that removing the header is the common case, then we should consider changing our default behavior and providing a knob to inhibit that default. If we go this route, then we need to decide where this knob goes -- it could be either in the router filter config or in the cluster resource. And since this would be a default behavior change, we would probably need a runtime knob as a migration mechanism.

One other note:

One concern with removal of endpoint-load-metrics headers/trailers is that this will make it more difficult to observe / debug ORCA reports as downstream filters as well as clients will not see them.

As a mitigation we can optionally add a parsed OrcaLoadReport to the filter state before removing header/trailer.

If there are actually multiple filters that want to see this data, we don't want each of them to have to parse the data independently for performance reasons, so we would definitely want to do something like that anyway, regardless of whether we remove the header or not. Note that this would also be true if we had a upstream filter use this data, since that would run before the router filter.

If we have any use-case where multiple filters need the data, then I think we'd want to use the approach that we originally talked about, where we have an API to get the data that checks to see if we've already parsed it and stored the parsed data, if not parses it and stores the parsed data, and then returns the stored parsed data. That way, whatever filter needs it first will trigger the parsing, and any filter that needs it later would just use the already-parsed data.

markdroth avatar May 08 '25 15:05 markdroth

If it helps with the decision making I think the current default of propagating makes sense due to the docs here https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge#best-practices-edge

Envoy is a production-ready edge proxy, however, the default settings are tailored for the service mesh use case, and some values need to be adjusted when using Envoy as an edge proxy.

But being open to feedback makes sense

davinci26 avatar May 08 '25 15:05 davinci26

After doing a bit of research and playing around with the existing API I validated that:

          - name: downstream-header-mutation
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.header_mutation.v3.HeaderMutation
              mutations:
                response_mutations:
                - remove: "trailer-key"

does not mutate actually the response trailers (I wanted to confirm that myself). I guess the suggestion then is to leave the *_trailers_to_* API as it is an only add the extension to type.googleapis.com/envoy.extensions.filters.http.header_mutation.v3.HeaderMutation maybe in the form of: trailer_mutations field that has the same exact format as response_mutations.

On the second point I think what everyone saying is making sense. Do we see part 2 (internal filter api for load reporting) as a requirement before we implement part 1 or can those be implemented independently?

davinci26 avatar May 08 '25 19:05 davinci26

I think you need request trailer mutations and response trailers mutations (separately)

repeated config.common.mutation_rules.v3.HeaderMutation response_trailer_mutations = 4;
repeated config.common.mutation_rules.v3.HeaderMutation request_trailer_mutations = 5;

added to https://github.com/envoyproxy/envoy/blob/f2a1a61901d7b7ed3a7ced176fd0540859ef5d9c/api/envoy/extensions/filters/http/header_mutation/v3/header_mutation.proto#L19.

Definitely these can be implemented independently.

paul-r-gall avatar May 08 '25 19:05 paul-r-gall

Linking this here as well https://github.com/envoyproxy/envoy/pull/39433 since I assume folks are interested. It is marked as draft because I wanted the CI to be green, but I have tested it locally and it works fine

davinci26 avatar May 12 '25 13:05 davinci26

The mutation PR was merged yesterday, wanted to do some doc changes on top of it to explain to users about the edge case and then we can close the parent issue as mitigated

davinci26 avatar May 15 '25 12:05 davinci26

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 14 '25 16:06 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar Jun 21 '25 16:06 github-actions[bot]