quic: propagate default network change event to upstream connections
Commit Message: Upon default network changes on devices, existing QUIC connections should be closed or drained depending on whether it is idle or not (having in-flight requests). Draining the non-idle connection is needed before E-M is able to migrate those connections to the new default network to avoid TOO_MANY_RTO error on them later. This PR does the necessary plumbing to pass the network change event from ConnectivityManager to EnvoyQuicClientSession and H3 conn_pool for them to act upon.
Additional Description: Introduce a new classes QuicNetworkConnectivityObserver and EnvoyQuicNetworkObserverRegistry/EnvoyQuicNetworkObserverRegistryFactory in Envoy to register each QUIC connection. And extends EnvoyQuicNetworkObserverRegistry in mobile to propagate network change events to each connections.
Risk Level: high, change upstream QUIC connection lifetime Testing: new e2e and unit tests Docs Changes: N/A Release Notes: Yes Platform Specific Features: enabled only on Android Runtime guard: envoy_reloadable_features_quic_upstream_connection_handle_network_change
CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).
/retest
/retest
/retest
/assign @RyanTheOptimist
/retest
/retest
/wait
I split the mechanical plumbing of register and register factory into the https://github.com/envoyproxy/envoy/pull/35912.
/retest
Will look later today but why is this QUIC-only? Shouldn't we have consistent draining for H2 and H1? if not please update PR description to explain why H3 only.
/retest
sorry coming back to large(r) design questions Now that we agree this needs to be done for all protocols, I'm wondering if we should revisit the approach. Specifically we already have a function cluster_manager_.drainConnections which we call on onDnsResolutionComplete, iff we think the dns resolution was due to a network change and enable_drain_post_dns_refresh_ is true I suggest instead of adding a second parallel mechanism we revisit/adapt the existing mechanism, or add a second cluster manager method if we need two (one on network change, one on DNS resolution complete)
I thought about this before. We would need to add 4 new interfaces for network changes corresponding to the 4 Android NetworkCallback methods that we are interested in (onAvailable(), onCapabilitiesChanged(), onLosing(), onLost()) and plumb them thru cluster manager runOnAllThread() all the way to connections if we were to follow the same pattern of drainConnections(). Iteration on the interfaces changes will be a pain. Instead of the massive plumbing, the new approach is passing down a registry in the same way as cluster_manager_.drainConnections() and handles these network change callbacks within the mobile's own implementation of the registry and thus less Envoy core code modification once the registry is plumbed through.
Mmm, I hear that, but I think doing a factory (edited: not registry) per protocol is as complicated, and much less clear to understand what's going on. Also would we need four functions or one function with an enum indicating what change took place? it's not like we have a onReadEvent onWriteEvent onReadWriteEvent onCloseEvent for fds. Ryan, thoughts on pros and cons of the two approaches? /wait-any
Offline, sounds like we could get away with one non-quic-specific registry. I think my take-away from a conversation with Ryan is I want to wait on the design doc before proceeding with this review =P /wait
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This needs to be sync'ed with main. Not ready for review yet.
/wait
CI seems to be failing /wait
/wait
I split this PR into smaller ones. The current one is simply refining QuicNetworkConnectivityObserver interface. The rest of the plumbing is moved into a follow up PR. PTAL
Addressed test coverage. PTAL!
This is core code, so please also wait for Ryan's approval