Add support for OD-CDS concurrent with SoTW CDS
Add support for OD-CDS concurrent with SoTW CDS
Currently, SoTW CDS updates delete all OD-CDS delivered clusters. This cluster removal is unnecessary, causes extra load on management servers, and was the cause of tricky to debug cluster_not_found responses on SoTW updates.
There's been similar concerns & solutions for related SoTW CDS interactions:
- xDS-TP (https://github.com/envoyproxy/envoy/pull/41117)
- DynamicForwardProxy (https://github.com/envoyproxy/envoy/issues/35171#issuecomment-2228235125)
Example
Consider the following setup with both static clusters & ODCDS:
# bootstrap file referencing static clusters
dynamic_resources:
cds_config:
path_config_source:
path: /envoy/configs/cds.yaml
resource_api_version: V3
lds_config:
path_config_source:
path: /envoy/configs/lds.yaml
resource_api_version: V3
# cds.yaml w static clusters A,B,C
resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
name: A
...
# lds.yaml route with OnDemand CDS for clusters D,E,F
...
- match: ...
route:
weighted_clusters:
clusters:
- name: D
weight: 100
typed_per_filter_config:
envoy.filters.http.on_demand:
"@type": type.googleapis.com/envoy.extensions.filters.http.on_demand.v3.PerRouteConfig
odcds: ...
timeout: 45s
- Envoy starts and loads clusters A,B,C from
cds.yaml - Over time, Envoy discovers clusters D,E,F via ODCDS
- SoTW CDS update occurs. This can happen by replacing the
lds.yamlfile. - SoTW CDS removes ODCDS clusters D,E,F
Proposal
I have a few independent high level ideas to make this work:
- Extend https://github.com/envoyproxy/envoy/pull/41117, making SoTW CDS only able to remove its own clusters.
- Have ODCDS add Clusters similar to DFP, piping
avoid_cds_removalthrough to thecds_api_helper. - Add explicit owner tracking in the Cluster Manager. SoTW CDS, ODCDS, DFP, etc use a unique identifier when interacting with the Cluster Manager. The Cluster Manager rejects requests related to clusters with a mismatched owner.
I'm in favor of option 1 since it's a straightforward change and implements what the intended behavior of SoTW CDS should be.
@adisuissa, related to https://github.com/envoyproxy/envoy/pull/41117
I believe this is more challenging than the description above. Each mux needs to track its own set of subscriptions, and when the cluster manager will merge the view from all the muxes. It will require proper implementation of the xDS-reources manager, and have a cache for each mux.
Does the same concern apply to enabling resource tracking for the SoTW API, cds_api_impl.cc similar to https://github.com/envoyproxy/envoy/pull/41117, either config flagged or by default? This solves the mentioned SoTW <> ODCDS interaction and is a more straightforward approach.
I moved the Cluster owner idea to the proposal section for clarity as it was just one idea for solving the SoTW <> ODCDS interaction.
The goal is to have multiple xDS services each owning their resources, and sending the updates to the relevant Envoy components. The Envoy components should not "know" whether SotW or delta is used.
My high level goal is to prevent CDS SotW from removing OD CDS clusters. https://github.com/envoyproxy/envoy/pull/41117 does exactly this but only for ODCDS over xDS-TP (support_multi_ads_sources enables tracking for CdsApi but is only true when xDS-TP is enabled).
Shouldn't CdsApi also track in other cases since ODCDS over traditional xDS faces the same problem.
// In case cds_config is configured and the new xDS-TP configs are used,
// then the CdsApi will need to track the resources, as the xDS-TP configs
// may be used for OD-CDS. If this is not set, the SotW update may override
// the OD-CDS resources.
const bool support_multi_ads_sources =
bootstrap.has_default_config_source() || !bootstrap.config_sources().empty();
Let me know if I'm misunderstanding something.
Hey @adisuissa, following up on this. Thanks
The proper way to solve this is to have per-source cache that will remove the clusters that it previously added. The issue is that currently SotW doesn't keep that state. I'm not saying it's not possible, but it will require a design first that works well with #24773.
I see what you mean. Is it not possible to leverage sotw_resource_names_ here though or is that only meant to be temporary and not the proper way to handle it?