Istio sidecars flood CoreDNS because of ExternalName services with ports that Knative creates for inference services
What version of Knative?
0.23.3
Summary
KNative creates ExternalName services for each inference service for redirecting traffic to Istio IngressGateway. For each such service, all Istio sidecars, every 5 seconds, will try to resolve the specified DNS target. As a result CoreDNS gets flooded. This is even worse on EKS where you have ndots: 5 and ec2.internal in search domains, that is, each DNS query results to five with one of them getting forwarded to AWS nameservers outside the cluster.
Steps to Reproduce the Problem
- We use Knative Serving 0.23.3 with Istio 1.9.6 and KFServing 0.6.1.
- We create an InferenceService with a transformer and a predictor.
- KFServing creates an ExternalName service without ports configuration. In our setup it points to
knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local. - KNative creates two ExternalName services with ports configuration (see https://github.com/knative/serving/commit/09986741f0bc6e369ed99370728ad715054656d5). In our setup it points to
knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local. - Istio will create a STRICT_DNS cluster only for ExternalName services with ports configuration (see also https://github.com/istio/istio/issues/23463, https://github.com/istio/istio/issues/37331).
-
All Istio sidecars (envoy) running in the cluster, for each STRICT_DNS cluster, they will try to resolve the specified DNS target every 5 seconds (see https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/service_discovery#strict-dns). In our setup this is
knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local. - On EKS because of
ndots: 5this will result in 5 DNS requests (4 NXDOMAIN for eachsearchdomains and one NOERROR) (see also https://discuss.istio.io/t/flood-of-nxdomain-lookups-to-coredns-from-istio-sidecar/11588) - On EKS, because the last seach domain is
ec2.internal(on us-east-1), one of the above DNS requests will be forwarded to AWS nameservers that will respond with NXDOMAIN.
In a cluster with lots of pods with Istio sidecars and lots of inference services CoreDNS gets flooded with DNS requests. We have seen it going into CrashLoopBackoff and getting i/o timeout when talking to AWS nameservers:
[ERROR]: plugins/error 2 knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local.ec2.internal. A: read udp 10.52.76.80:XXX -> 10.52.0.2:53: i/o timeout
Any news on this? @mattmoor What is the rational for creating ExternalName services with ports (see https://github.com/knative/serving/commit/09986741f0bc6e369ed99370728ad715054656d5)? Should we consider the observed behavior an Istio bug, that is the fact that ExternalName services with ports are handled differently and we end up with tons of DNS queries from Istio sidecars?
we are having similar issue, would be great if fixing that

I'm observing the same issue in Istio without Knative. Seems like an Istio bug IMO
straight forward thinking, if ExternalName services with ports are the culprit, removing the port config would resolve that?
@kyue1005 removing the port definition on ExternalName Service prevents this behavior, yes..
However, when you have Istio mTLS configured in STRICT mode, the Destination Rules wont have a port defined for those ExternalNames, and traffic to those ExternalName Services will not work. Defining a port on the ExternalName allows this to function, but Istio apparently goes nuts with resolving those names constantly .. DDoS`ing CoreDNS
my temporary solution is to use a full qualified domain for my local gateway address to avoid the ndots search issue it does relieve the DNS loading a bit, but the root cause still lies in the STRICT_DNS issue hope there would be a fix on that soon
This is the exact issue we are running into. Seems at a certain number of KServices, say 100, we start to see DNS failures and coreDNS crushing under load. Strict mTLS is a requirement of our product. Any ideas on where and how we can target a fix?
@kyue1005 where did you configure the fully qualified domain for the local gateway address? We're seeing a similar issue although we don't have sidecars in our environment.
@daraghlowe I update the config-istio as below
local-gateway.knative-serving.knative-local-gateway: "knative-local-gateway.istio-system.svc.cluster.local."
@kyue1005 Hi! Does the change above need a full restart of the istio-proxy on the target pods? (so like delete/respawn of the pod or similar)
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
Came across Istio Smart DNS Proxy feature. Although the underlying issue of ExternalName service DNS resolution every 5 secs will still be there but it appears that after installing Istio Smart DNS Proxy the DNS queries are significantly reduced even with ndots:5 in /etc/resolv.conf. Any thoughts ?
https://istio.io/latest/blog/2020/dns-proxy/ https://istio.io/latest/docs/ops/configuration/traffic-management/dns-proxy/
"With Istio’s implementation of the CoreDNS style auto-path technique, the sidecar agent will detect the real hostname being queried within the first query and return a cname record to productpage.ns1.svc.cluster.local as part of this DNS response, as well as the A/AAAA record for productpage.ns1.svc.cluster.local. The application receiving this response can now extract the IP address immediately and proceed to establishing a TCP connection to that IP. The smart DNS proxy in the Istio agent dramatically cuts down the number of DNS queries from 12 to just 2!"
Is there an istio issue to track this perf problem?
edit - I just made one asking for recommendations - https://github.com/istio/istio/issues/44169
Hey folks it was pointed out here that the 5s sync is configurable here -https://github.com/istio/istio/issues/44169#issuecomment-1489830835
The request interval should be 30s (coredn tll)
local-gateway is probably useless. Can ExternalName be removed?
As istio support local dns, if the external name is knative-serving-cluster-ingressgateway.knative-serving.svc.cluster.local(a cluster ip kubernets service).
The sidecar itself can serve the DNS, no redirecting to coreDNS