XDS DeltaVirtualHosts gRPC config stream to xxx closed: grpc: received message larger than max
Title: xDS server sends larger message than max
Description:
I have an envoy configuration that uses RDS and has more than 50k routes. Even though I am using DELTA_GRPC sometimes the proxy will end-up not being able to receive any new updates with the error message:
[2024-09-16 17:17:12.814][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] DeltaVirtualHosts gRPC config stream to xds_cluster closed since 105s ago: 8, grpc: received message larger than max (21173076 vs. 4194304)
Locally, I created a gRCP client with higher grpc.MaxCallRecvMsgSize(math.MaxInt32) and it worked but I am curious if this is something that we want to be able to configure on envoy as well.
Any idea why using the DELTA API is not enough and it batches bigger updates than the gRPC can handle?
I also saw that defaultServerMaxSendMessageSize = math.MaxInt32 vs defaultClientMaxReceiveMessageSize = 1024 * 1024 * 4 which is exactly what causes the issue to appear.
Repro steps:
- Use a simple cache xDS server from https://github.com/envoyproxy/go-control-plane
- Add tens of thousands of routes on RDS
- Error message appears
Config: envoy.yaml
admin:
access_log_path: /dev/null
address:
socket_address:
address: 0.0.0.0
port_value: {{ .Values.proxy.config.info_port }}
dynamic_resources:
ads_config:
api_type: DELTA_GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: xds_cluster
set_node_on_first_message_only: true
cds_config:
resource_api_version: V3
ads: { }
lds_config:
path_config_source:
path: {{ .Values.configgen.config.lds_path }}
node:
cluster: envoy-cluster
id: {{ .Values.global.xdsNodeID }}
static_resources:
clusters:
- name: xds_cluster
type: STRICT_DNS
connect_timeout: 10s
load_assignment:
cluster_name: xds_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: {{ .Values.global.xdsAddress }}
port_value: {{ .Values.global.xdsPort }}
http2_protocol_options: { }
layered_runtime:
layers:
- name: runtime-0
rtds_layer:
rtds_config:
resource_api_version: V3
api_config_source:
transport_api_version: V3
api_type: DELTA_GRPC
grpc_services:
envoy_grpc:
cluster_name: xds_cluster
name: runtime-0
lds.yaml
version_info: "0"
resources:
- "@type": "type.googleapis.com/envoy.config.listener.v3.Listener"
name: http_listener
address:
socket_address:
address: 0.0.0.0
port_value: 80
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: http
codec_type: AUTO
server_name: "abc"
strip_any_host_port: true
rds:
route_config_name: "{{ .Route_config_name }}"
config_source:
resource_api_version: V3
api_config_source:
api_type: DELTA_GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: xds_cluster
set_node_on_first_message_only: true
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
Logs:
[2024-09-16 17:17:12.814][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] DeltaVirtualHosts gRPC config stream to xds_cluster closed since 105s ago: 8, grpc: received message larger than max (21173076 vs. 4194304)
Based on my knowledge of delta XDS, we could probably split up the subscription requests to avoid hitting the default receive message size limit. But that limit is somewhat arbitrary. It don't recall it showing up in the gRPC spec and there's nothing to prevent a different gRPC server implementation from choosing a different limit. I think this ends up being a well-meaning default for servers with untrusted clients that trips up systems with trusted clients as their scale grows.
Based on my knowledge of delta XDS, we could probably split up the subscription requests to avoid hitting the default receive message size limit. But that limit is somewhat arbitrary. It don't recall it showing up in the gRPC spec and there's nothing to prevent a different gRPC server implementation from choosing a different limit. I think this ends up being a well-meaning default for servers with untrusted clients that trips up systems with trusted clients as their scale grows.
I am using the https://github.com/envoyproxy/go-control-plane implementation of xDS server and it looks like it doesn't split up the requests and just full sends all the deltas disregarding the size.
In any case; the 4MB limit size on the receiving end of envoy seems very low. I haven't used/see other implementations of the control-plane (e.g. https://github.com/envoyproxy/java-control-plane) so it might be only the golang one that is the problematic and does not split-up messages.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.