envoy quic: propagate default network change event to upstream connections

Commit Message: Upon default network changes on devices, existing QUIC connections should be closed or drained depending on whether it is idle or not (having in-flight requests). Draining the non-idle connection is needed before E-M is able to migrate those connections to the new default network to avoid TOO_MANY_RTO error on them later. This PR does the necessary plumbing to pass the network change event from ConnectivityManager to EnvoyQuicClientSession and H3 conn_pool for them to act upon.

Additional Description: Introduce a new classes QuicNetworkConnectivityObserver and EnvoyQuicNetworkObserverRegistry/EnvoyQuicNetworkObserverRegistryFactory in Envoy to register each QUIC connection. And extends EnvoyQuicNetworkObserverRegistry in mobile to propagate network change events to each connections.

Risk Level: high, change upstream QUIC connection lifetime Testing: new e2e and unit tests Docs Changes: N/A Release Notes: Yes Platform Specific Features: enabled only on Android Runtime guard: envoy_reloadable_features_quic_upstream_connection_handle_network_change

Aug 21 '24 17:08 danzh2010

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/35775 was opened by danzh2010.

see: more, trace.

Aug 21 '24 17:08 repokitteh-read-only[bot]

/retest

Aug 22 '24 14:08 danzh2010

/retest

Aug 22 '24 19:08 danzh2010

/retest

Aug 22 '24 21:08 danzh2010

/assign @RyanTheOptimist

Aug 23 '24 19:08 danzh2010

/retest

Aug 23 '24 20:08 danzh2010

/retest

Aug 23 '24 22:08 danzh2010

/wait

Aug 27 '24 19:08 RyanTheOptimist

I split the mechanical plumbing of register and register factory into the https://github.com/envoyproxy/envoy/pull/35912.

Aug 29 '24 19:08 danzh2010

/retest

Sep 09 '24 14:09 danzh2010

Will look later today but why is this QUIC-only? Shouldn't we have consistent draining for H2 and H1? if not please update PR description to explain why H3 only.

Sep 09 '24 14:09 alyssawilk

/retest

Sep 11 '24 15:09 danzh2010

sorry coming back to large(r) design questions Now that we agree this needs to be done for all protocols, I'm wondering if we should revisit the approach. Specifically we already have a function cluster_manager_.drainConnections which we call on onDnsResolutionComplete, iff we think the dns resolution was due to a network change and enable_drain_post_dns_refresh_ is true I suggest instead of adding a second parallel mechanism we revisit/adapt the existing mechanism, or add a second cluster manager method if we need two (one on network change, one on DNS resolution complete)

I thought about this before. We would need to add 4 new interfaces for network changes corresponding to the 4 Android NetworkCallback methods that we are interested in (onAvailable(), onCapabilitiesChanged(), onLosing(), onLost()) and plumb them thru cluster manager runOnAllThread() all the way to connections if we were to follow the same pattern of drainConnections(). Iteration on the interfaces changes will be a pain. Instead of the massive plumbing, the new approach is passing down a registry in the same way as cluster_manager_.drainConnections() and handles these network change callbacks within the mobile's own implementation of the registry and thus less Envoy core code modification once the registry is plumbed through.

Sep 12 '24 17:09 danzh2010

Mmm, I hear that, but I think doing a factory (edited: not registry) per protocol is as complicated, and much less clear to understand what's going on. Also would we need four functions or one function with an enum indicating what change took place? it's not like we have a onReadEvent onWriteEvent onReadWriteEvent onCloseEvent for fds. Ryan, thoughts on pros and cons of the two approaches? /wait-any

Sep 12 '24 19:09 alyssawilk

Offline, sounds like we could get away with one non-quic-specific registry. I think my take-away from a conversation with Ryan is I want to wait on the design doc before proceeding with this review =P /wait

Sep 12 '24 19:09 alyssawilk

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Oct 12 '24 20:10 github-actions[bot]

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Oct 20 '24 00:10 github-actions[bot]

This needs to be sync'ed with main. Not ready for review yet.

Aug 27 '25 18:08 danzh2010

/wait

Sep 09 '25 15:09 RyanTheOptimist

CI seems to be failing /wait

Sep 15 '25 21:09 RyanTheOptimist

/wait

Sep 16 '25 14:09 danzh2010

I split this PR into smaller ones. The current one is simply refining QuicNetworkConnectivityObserver interface. The rest of the plumbing is moved into a follow up PR. PTAL

Oct 09 '25 18:10 danzh2010

Addressed test coverage. PTAL!

Oct 10 '25 00:10 danzh2010

This is core code, so please also wait for Ryan's approval

Oct 10 '25 13:10 abeyad