Fix #32464

Commit Message: retry on connect failure Additional Description: see bellow Risk Level: High Testing: unit test and build locally Docs Changes: No Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional API Considerations:]

More context

Description:

When retrying on connection failure all requests having a body bigger than the connection buffer limit will not be retried.

Give the configuration bellow all requests bigger than 32768 bytes will not be retried. If not overridden per route it will come from the listener config per_connection_buffer_limit_bytes and default to 1MiB

The following happens:

the router receive the body and proxy it to the upstream requests
the buffer limit get reached at the router level
the router give up on retry but keep on forwarding to the upstream requests
the buffer of the upstream connection will get full and notify to stop readying on the downstream connection.

If the upstream requests fail the requests will not be retried.

Proposed solution:

The router can a callback for when the upstream connection get a connection via the addUpstreamCallbacks.

If the router buffer get full before any upstream connection got establish keep on buffering above the limit and call the onDecoderFilterAboveWriteBufferHighWatermark() callback to ensure we stop on reading from the downstream. This allows to keep on retrying will keeping the buffer at a value close to the desired limit.

Once it get a new connection enable drain the buffer which enable read from downstream and drop retry, shadow, and buffering.

For all other retry the behaviour is unchanged

Example:

config:

static_resources:
  listeners:
  - name: test_connect_failure    
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 8100
    per_connection_buffer_limit_bytes: 32768
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: AUTO
          stat_prefix: web1
          route_config:
            virtual_hosts:
            - name: backend
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: local_service_fail
                  timeout: 0s
                  retry_policy:
                    retry_on: "connect-failure"
                    num_retries: 3
                    retry_back_off:
                      base_interval: 0.050s
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: local_service_fail
    connect_timeout: 1s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    upstream_connection_options:
      tcp_keepalive:
        keepalive_time: 300
    common_lb_config:
      healthy_panic_threshold:
        value: 40
    load_assignment:
      cluster_name: local_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                # port doesn't match anything
                port_value: 8085

with Envoy docker image: "envoyproxy/envoy:v1.29.0":

only relevant logs:

[2024-02-13 09:25:46.619][21][debug][connection] [./source/common/network/connection_impl.h:98] [Tags: "ConnectionId":"0"] current connecting state: false
[2024-02-13 09:25:46.619][21][debug][router] [source/common/router/router.cc:514] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] cluster 'local_service_fail' match for URL '/'
[2024-02-13 09:25:46.619][21][debug][router] [source/common/router/router.cc:731] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] router decoding headers:
':authority', 'localhost:8100'
':path', '/error'
':method', 'POST'
':scheme', 'http'
'user-agent', 'Go-http-client/1.1'
'content-length', '92768'
'accept-encoding', 'gzip'
'x-forwarded-proto', 'http'
'x-request-id', '00db4f6a-b6cb-4fe3-b4d4-34bad1e002f2'

[2024-02-13 09:25:46.619][21][debug][connection] [./source/common/network/connection_impl.h:98] [Tags: "ConnectionId":"1"] current connecting state: true
[2024-02-13 09:25:46.619][21][debug][connection] [source/common/network/connection_impl.cc:1009] [Tags: "ConnectionId":"1"] connecting to 127.0.0.1:8085
[2024-02-13 09:25:46.619][21][debug][connection] [source/common/network/connection_impl.cc:1028] [Tags: "ConnectionId":"1"] connection in progress
[2024-02-13 09:25:46.619][21][trace][connection] [source/common/network/connection_impl.cc:654] [Tags: "ConnectionId":"0"] read ready. dispatch_buffered_data=0
[2024-02-13 09:25:46.619][21][trace][connection] [source/common/network/raw_buffer_socket.cc:25] [Tags: "ConnectionId":"0"] read returns: 32768
[2024-02-13 09:25:46.619][21][debug][router] [source/common/router/router.cc:856] The request payload has at least 65412 bytes data which exceeds buffer limit 32768. Give up on the retry/shadow.
[2024-02-13 09:25:46.619][21][debug][http] [source/common/http/filter_manager.cc:541] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] request data too large watermark exceeded
[2024-02-13 09:25:46.619][21][debug][http] [source/common/http/conn_manager_impl.cc:1922] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] Read-disabling downstream stream due to filter callbacks.
[2024-02-13 09:25:46.619][21][trace][connection] [source/common/network/connection_impl.cc:401] [Tags: "ConnectionId":"0"] readDisable: disable=true disable_count=0 state=0 buffer_length=0
[2024-02-13 09:25:46.619][21][debug][connection] [source/common/network/connection_impl.cc:757] [Tags: "ConnectionId":"1"] delayed connect error: 111
[2024-02-13 09:25:46.619][21][debug][connection] [source/common/network/connection_impl.cc:278] [Tags: "ConnectionId":"1"] closing socket: 0
[2024-02-13 09:25:46.619][21][trace][connection] [source/common/network/connection_impl.cc:469] [Tags: "ConnectionId":"1"] raising connection event 0
[2024-02-13 09:25:46.619][21][debug][router] [source/common/router/router.cc:1331] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] upstream reset: reset reason: remote connection failure, transport failure reason: delayed connect error: 111
[2024-02-13 09:25:46.619][21][debug][http] [source/common/http/conn_manager_impl.cc:1912] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] Read-enabling downstream stream due to filter callbacks.
[2024-02-13 09:25:46.619][21][trace][connection] [source/common/network/connection_impl.cc:401] [Tags: "ConnectionId":"0"] readDisable: disable=false disable_count=1 state=0 buffer_length=0
[2024-02-13 09:25:46.619][21][debug][http] [source/common/http/filter_manager.cc:1065] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] Sending local reply with details upstream_reset_before_response_started{remote_connection_failure,delayed_connect_error:_111}
[2024-02-13 09:25:46.619][21][debug][http] [source/common/http/conn_manager_impl.cc:1869] [Tags: "ConnectionId":"0","StreamId":"742379432996652670"] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '152'
'content-type', 'text/plain'
'date', 'Tue, 13 Feb 2024 09:25:46 GMT'
'server', 'envoy'
'connection', 'close'

Test

unit test
integration test
tested in one AZ of a staging environment, saw the UF response flags going away

Feb 13 '24 10:02 fl0Lec

Hi @fl0Lec, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/32359 was opened by fl0Lec.

see: more, trace.

Feb 13 '24 10:02 repokitteh-read-only[bot]

As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/32359 was opened by fl0Lec.

see: more, trace.

Feb 13 '24 10:02 repokitteh-read-only[bot]

/retest

Feb 15 '24 10:02 fl0Lec

also opened an issue: https://github.com/envoyproxy/envoy/issues/32464

Feb 21 '24 08:02 fl0Lec

@fl0Lec thank you for the change. Can you describe the problem that you are trying to solve, please? I.e. is there an open issue you are fixing?

I think retries on TCP connection failures should already working. Are they not?

@alyssawilk for more comments.

/wait-any

Feb 27 '24 16:02 yanavlasov

+1 - I think it would help if you could write an integration test which simulates these conditions, fails without your change and succeeds with your change. Mind giving that a shot? /wait

Feb 27 '24 18:02 alyssawilk

Make sense, I forgot to add the link to the issue I created: https://github.com/envoyproxy/envoy/issues/32464 This add more context on the problem that I am trying to solve here.

I will look into adding an integration test

Feb 28 '24 11:02 fl0Lec

Ah, with the attached issue that makes much more sense, thanks so much. So to summarize if we'd be willing to buffer X bytes in an upstream send buffer before applying flow control we should be willing to buffer X bytes (larger than the default of Y) in the retry buffer when we don't haven a upstream connection? As long as we cancel the retry once we connect so we don't buffer the extra bytes for a long duration (which it looks like you're doing) I think that's sensible. Definitely lets get e2e tests, ideally a bit more commenting, and runtime guard as well as it's a pretty high risk change.

Feb 28 '24 15:02 alyssawilk

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/32359 was synchronize by fl0Lec.

see: more, trace.

Mar 08 '24 13:03 repokitteh-read-only[bot]

I added per your suggestion @alyssawilk:

runtime guard (defaulting to false as it is a pretty risky change)
renamed some variable for clarity
2 integration tests:
- 1 with the new behaviour (retry even when buffer limit exceeded)
- 1 with the old behaviour (return 5xx)

I also update the PR description to give more context. This PR has also currently been tested in part of our staging environment. I will like a graph in the PR comment soon

Mar 08 '24 13:03 fl0Lec

/retest

Mar 10 '24 11:03 fl0Lec

adding in @KBaichoo for first pass

Mar 11 '24 14:03 alyssawilk

/retest

Mar 27 '24 17:03 alyssawilk

hm, can you look into test failures? may need a main merge? /wait

Mar 27 '24 19:03 alyssawilk

hm, can you look into test failures? may need a main merge?

merge main, still need to figure out the new test failures /wait

Mar 28 '24 20:03 fl0Lec

tests looks better now. Only one test failed not for arm64, will retest it /retest

Mar 31 '24 17:03 fl0Lec

/retest

Apr 03 '24 13:04 fl0Lec

@alyssawilk maintainer on-call ping

Apr 08 '24 14:04 RyanTheOptimist

To clarify the goal of this PR:

The intent is to ensure we retry the connection-failure until we get at least one connection with an upstream. While the router is waiting to get a connection with an upstream, we allow buffering up to cluster connection buffer limit plus 1 Mib. Once a connection is acquired, we will use the defaultretry_per_shadow_buffer_limit_ (min between the downstream connection buffer limit and route buffer limit).

Why do we need plus 1 MiB We rely on the connection filter to pause reads on the downstream connection. Therefore, the router must be able to buffer more than the upstream buffer limit, as it will not pause read until the buffer is full (reached high watermark).

Apr 10 '24 14:04 fl0Lec

/retest

Apr 10 '24 14:04 fl0Lec

To clarify the goal of this PR:

The intent is to ensure we retry the connection-failure until we get at least one connection with an upstream. While the router is waiting to get a connection with an upstream, we allow buffering up to cluster connection buffer limit plus 1 Mib. Once a connection is acquired, we will use the defaultretry_per_shadow_buffer_limit_ (min between the downstream connection buffer limit and route buffer limit).

Why do we need plus 1 MiB We rely on the connection filter to pause reads on the downstream connection. Therefore, the router must be able to buffer more than the upstream buffer limit, as it will not pause read until the buffer is full (reached high watermark).

So in general, the router will buffer request data without sending a 4xx until it starts sending data upstream. After data has been sent upstream, the router will not be able to retry requests unless the request body fits in the shadow/retry buffer.

All increasing the router buffer does is increase the amount of data the router will buffer in this manner before pushing back downstream. This increases the risk of OOM so without a concrete value add I don't think it's worth tackling. If you want larger buffers, you can configure larger buffers. I don't think it's worth the additional complexity to already complicated code to reduce latency by buffering more if and only if we are waiting on upstream connections.

Your e2e test passes just fine with a clean envoy checkout (after removing the reloadable flag set) so I think enovy handles so I think that shows envoy already does retries for large requests on connect failure?

bazel test --config=remote //test/integration:integration_test --test_timeout=120 --test_arg=--gtest_filter=IpVersionsAndHttp1Parser/IntegrationTest.EnsureConnectionRetry/IPv4BalsaParser

Apr 10 '24 20:04 alyssawilk

So in general, the router will buffer request data without sending a 4xx until it starts sending data upstream. After data has been sent upstream, the router will not be able to retry requests unless the request body fits in the shadow/retry buffer.

Yes that was the intention. I also proposed to pause via the onDecoderFilterAboveWriteBufferHighWatermark once we reach the shadow/retry buffer limit but this was pushed back. https://github.com/envoyproxy/envoy/pull/32359#discussion_r1526455610

All increasing the router buffer does is increase the amount of data the router will buffer in this manner before pushing back downstream. This increases the risk of OOM so without a concrete value add I don't think it's worth tackling. If you want larger buffers, you can configure larger buffers. I don't think it's worth the additional complexity to already complicated code to reduce latency by buffering more if and only if we are waiting on upstream connections.

They idea would be to be able to ensure we retry connection failure but not allowing huge buffer for every requests. Because of having on connection pool per thread, the connection reused of Envoy is not optimal and we are seeing an important number of connection. Which might timeout from time to time due to a lost SYN packet. This is affecting our SLO and we would like to tackle this issue here if possible. This is the implementation that came along with the review of this PR but we are open to other alternative.

Your e2e test passes just fine with a clean envoy checkout (after removing the reloadable flag set) so I think enovy handles so I think that shows envoy already does retries for large requests on connect failure? bazel test --config=remote //test/integration:integration_test --test_timeout=120 --test_arg=--gtest_filter=IpVersionsAndHttp1Parser/IntegrationTest.EnsureConnectionRetry/IPv4BalsaParser

The test does not pass on checkout for me when I checkout main

[2024-04-11 13:33:49.283][9387219][critical][assert] [test/integration/http_integration.cc:559] assert failure: 0. Details: Timed out waiting for new connection.

I also added a test inside this PR that shows the difference of behaviour: BufferOverflowConnectionRetry https://github.com/DataDog/envoy/blob/3763185bf4ce0fa6a3667fcd11921aa3edcdaf45/test/integration/integration_test.cc#L2488

Apr 11 '24 17:04 fl0Lec

Ok, I looked into this more (thanks for your patience here) The test passed for me but I downloaded it and patched and edited it again and it is flaky :-) so I see why it didn't work for you - I forgot how much more complicated the upstream path has gotten.

Fundamentally, the router used to be the point at which data was shipped upstream. So it would buffer the data in its own buffers until an upstream was selected, then send data upstream.

Nowadays we create an upstream request immediately, and start sending data to that upstream request immediately, regardless of if a connection was established. I'm not a fan of increasing the buffer limits temporarily during connection establishment. It may help your use case, but it adds a lot of complexity, and opens up an OOM vector for everyone who doesn't want a per-connection limit applied on a per-stream basis when upstream connections are slow to establish. It also doesn't help if you have small upstream connection buffers. It masks the real problem, which is that we're willing to accept "infinite" data from the client because we spool it upstream, without knowing the upstream is there. I think a better fix would be to (optionally?) defer sending data upstream until the connection is established, trusting the HCM/watermark chain to keep the buffer small until we know we can safely proxy. If we had a connect failure we can in that one case retry using the HCM buffer instead of shadow/retry.

I'd like to get @mattklein123 @KBaichoo @botengyao to weigh in on this. I'm unfortunately headed out for a week so won't be around for further conversation or review.

Aside, @botengyao should the fix here be returning StopIterationAndWatermark rather than sendLocalReply? Seems if we've never sent data upstream, we should just delay until we have one. https://github.com/envoyproxy/envoy/commit/3c234770d2b1660fa6f3fdb1f9c3916933ec7b84#diff-e3584a650478334f7373c0df344c85d499b5123f5237ce2bbabaad5ee53b899eR910

Apr 11 '24 18:04 alyssawilk

/assign @botengyao

Apr 11 '24 18:04 botengyao

Fundamentally, the router used to be the point at which data was shipped upstream. So it would buffer the data in its own buffers until an upstream was selected, then send data upstream. I think a better fix would be to (optionally?) defer sending data upstream until the connection is established, trusting the HCM/watermark chain to keep the buffer small until we know we can safely proxy. If we had a connect failure we can in that one case retry using the HCM buffer instead of shadow/retry.

This all sounds more than reasonable to me. I would just have two concerns with this approach:

waiting to get a callback from the upstream to inform the router that it has a connection might decrease Envoy performance. From what I understand now, once the upstream gets a connection, it can immediately start sending data. With this approach, it would have to fail for the data to be sent by the router and processed by the filters
when relying on the HCM buffer plus watermark, the router will exceed the buffer watermark, which will pause reading on the downstream listener. Until then, everything is fine. But once it's started sending the data upstream, right now, the router copies the data, which makes sense when a none connection-failure retry is enabled (not when only connection-failure is enabled; maybe an optimization in the future). So the buffer will stay above the watermark. We can not really drain it because other filters might be using it. So what would be the path there?

I also have a question regarding the "HCM buffer instead of shadow/retry". From my understanding, they are the same: https://github.com/DataDog/envoy/blob/3763185bf4ce0fa6a3667fcd11921aa3edcdaf45/source/common/router/router.cc#L934. Please bear with me has this is my first contribution in this repo.

but it adds a lot of complexity, and opens up an OOM vector for everyone who doesn't want a per-connection limit applied on a per-stream basis when upstream connections are slow to establish.

I agree that the workaround that I implemented introduces some memory risk and is not ideal. But I believe the discussion we are having now can bring great benefits.

Apr 12 '24 11:04 fl0Lec

Aside, @botengyao should the fix here be returning StopIterationAndWatermark rather than sendLocalReply? Seems if we've never sent data upstream, we should just delay until we have one. https://github.com/envoyproxy/envoy/commit/3c234770d2b1660fa6f3fdb1f9c3916933ec7b84#diff-e3584a650478334f7373c0df344c85d499b5123f5237ce2bbabaad5ee53b899eR910

I don't think so, since retry & shadow is disabled, and upstream_request is created during decoding header, the empty here means no more upstream_requests should happen, so we can local reply.

Apr 18 '24 05:04 botengyao

/wait

Apr 18 '24 13:04 adisuissa

Aside, @botengyao should the fix here be returning StopIterationAndWatermark rather than sendLocalReply? Seems if we've never sent data upstream, we should just delay until we have one. 3c23477#diff-e3584a650478334f7373c0df344c85d499b5123f5237ce2bbabaad5ee53b899eR910

I don't think so, since retry & shadow is disabled, and upstream_request is created during decoding header, the empty here means no more upstream_requests should happen, so we can local reply.

From my understanding empty means that we are waiting for a retry to happens (we are in the backoff wait). I believe this issue is linked to it https://github.com/envoyproxy/envoy/issues/15211. Ideally we would wait for the retry to happened instead of returning a 507

Apr 23 '24 13:04 fl0Lec

Sorry was OOO,

I would also rely on the buffer reaching the watermark to pause the read. So:

getting connection from downstream
trying to get a connection with upstream
filter buffer gets full (we go over the watermark which pauses the downstream read)
we get a connection

At that point to go bellow the watermark, we have the following options:

draining the buffer: no possible as other filter might also use it.
increasing the limit to be right above the currently buffered data. Probably the option for now
do we have other option ?

WDYT @botengyao ?

Apr 23 '24 14:04 fl0Lec

Aside, @botengyao should the fix here be returning StopIterationAndWatermark rather than sendLocalReply? Seems if we've never sent data upstream, we should just delay until we have one. 3c23477#diff-e3584a650478334f7373c0df344c85d499b5123f5237ce2bbabaad5ee53b899eR910

I don't think so, since retry & shadow is disabled, and upstream_request is created during decoding header, the empty here means no more upstream_requests should happen, so we can local reply.

From my understanding empty means that we are waiting for a retry to happens (we are in the backoff wait). I believe this issue is linked to it #15211. Ideally we would wait for the retry to happened instead of returning a 507

This is not the same issue I believe and let's go back to this PR ;-)

Apr 23 '24 18:04 botengyao

router: retry on connect failure

More context

Description:

Proposed solution:

Example:

Test