HTTPRoute intermittently fails to distribute traffic
What is the issue?
When using an httproute to dynamically redistribute load from one Service to a MultiCluster mirrored Service, traffic only intermittently transmits correctly.
How can it be reproduced?
- 2 clusters,
eastandwest, joined by a multicluster link that mirrors appropriately labelled services deployed inwestintoeast - a Service
fooin clustereast(but no deployment to receive traffic) - a mirrored Service in
eastcalledfoo-west. This should pass traffic to a deployment of something will return basic acks e.g. curls. - an HTTPRoute directing traffic received by parentRef Service
footo backendReffoo-east. - send traffic to
Logs, error output, etc
Application curl logs:
❯ kubectl exec -it busybox-5cd4968444-zn549 -- wget http://APP.APP.svc.cluster.local/ping -O -
Defaulted container "main" out of: main, linkerd-init (init), linkerd-proxy (init)
Connecting to APP.APP.svc.cluster.local (IPADDR:80)
writing to stdout
written to stdout
☸ non-prod
❯ kubectl exec -it busybox-5cd4968444-zn549 -- wget http://APP.APP.svc.cluster.local/ping -O -
Defaulted container "main" out of: main, linkerd-init (init), linkerd-proxy (init)
Connecting to APP.APP.svc.cluster.local (IPADDR:80)
wget: server returned error: HTTP/1.1 504 Gateway Timeout
command terminated with exit code 1
Proxy sidecar:
[ 853.183882s] INFO ThreadId(01) outbound:proxy{addr=10.100.238.202:80}:service{ns=APP name=APP port=80}: linkerd_proxy_api_resolve::resolve: No endpoints
[ 856.184109s] INFO ThreadId(01) outbound:proxy{addr=10.100.238.202:80}:service{ns=APP name=APP port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 856.184575s] INFO ThreadId(01) outbound:proxy{addr=10.100.238.202:80}:rescue{client.addr=172.27.8.216:48586}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.238.202:80: route default.http: backend Service.APP.APP:80: Service.APP.APP:80: service in fail-fast error.sources=[route default.http: backend Service.APP.APP:80: Service.APP.APP:80: service in fail-fast, backend Service.APP.APP:80: Service.APP.APP:80: service in fail-fast, Service.APP.APP:80: service in fail-fast, service in fail-fast]
output of linkerd check -o short
❯ linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 24.3.2 but the latest edge version is 24.5.3
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 24.5.1 but the latest edge version is 24.5.3
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-888c96b5b-7pwmc (edge-24.5.1)
* linkerd-destination-888c96b5b-hl54h (edge-24.5.1)
* linkerd-destination-888c96b5b-vn62f (edge-24.5.1)
* linkerd-identity-56bbfdc7b6-2cfhj (edge-24.5.1)
* linkerd-identity-56bbfdc7b6-f9bvq (edge-24.5.1)
* linkerd-identity-56bbfdc7b6-h67sk (edge-24.5.1)
* linkerd-proxy-injector-68c6b7bc6-5vxm6 (edge-24.5.1)
* linkerd-proxy-injector-68c6b7bc6-hgmks (edge-24.5.1)
* linkerd-proxy-injector-68c6b7bc6-l45wh (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
linkerd-destination-888c96b5b-7pwmc running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
some proxies are not running the current version:
* collector-7db4655-sdwth (edge-24.5.1)
* jaeger-5c4c9ff587-5c729 (edge-24.5.1)
* jaeger-injector-6cb867b4f8-5mhnd (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
collector-7db4655-sdwth running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-db8857cf8-mfw6c (edge-24.5.1)
* metrics-api-db8857cf8-p59sg (edge-24.5.1)
* metrics-api-db8857cf8-wxm87 (edge-24.5.1)
* tap-6d6cf4c465-2rzj8 (edge-24.5.1)
* tap-6d6cf4c465-8bshr (edge-24.5.1)
* tap-6d6cf4c465-bg6sd (edge-24.5.1)
* tap-injector-66c6f694f4-7rwx4 (edge-24.5.1)
* tap-injector-66c6f694f4-9hjpw (edge-24.5.1)
* tap-injector-66c6f694f4-vqw6r (edge-24.5.1)
* web-56d54f864d-82jcp (edge-24.5.1)
* web-56d54f864d-j4vbv (edge-24.5.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-db8857cf8-mfw6c running edge-24.5.1 but cli running edge-24.3.2
see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- Kubernetes v1.29.3
- EKS cluster
- Bottlerocket nodes
- Cilium CNI in AWS VPC replacement mode
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
maybe
Folding this into #12610
Hi @Sierra1011. That error log from the proxy indicated that it doesn't have any endpoints in the Service.APP.APP:80 backend service to route to. Can you confirm that the service exists and that it has endpoints? You can use kubectl get service and kubectl get endpoints to confirm this. You can also use the linkerd diagnostics endpoints command to see Linkerd's view of what endpoints the service has, if any.
Hi @adleong, I'll set up a test similar to as described in #12610 to troubleshoot it exactly, hopefully today if nothing is on fire :crossed_fingers:
So, I deployed a full stack of emojivoto (emoji, voting, vote-bot, web) in cluster 1, and a deployment of emoji to cluster 2, with a service mirrored to cluster 1. You're right; there's no endpoints shown for the mirrored emoji service, and if I scale down the original emoji deployment, no endpoints shown at all.
Playing around running curl to emoji while running linkerd viz tap on the respective deployments showed that it was at least hitting the relevant deployments.
So, that seems to be working fine, but I'm not in a position to go back and reimplementing our app as it was when I raised this as an issue (having received a ton of 5xx errors), but I'll try it elsewhere and come back with some more info.
OK, so it's been a fairly slow chase down on this I'm afraid.
So, I'm going to talk in real terms rather than the emojivoto service I'm deploying for funsies. I have some deployments with services on one cluster; let's call them monolith and legacy-assets and they live in the monolith namespace. monolith depends on legacy-assets being reachable in order to start up.
I'm migrating the deployment of services from one cluster to a new cluster which is called eks-non-prod-primary. Standard A to B stuff.
My intention is to use pod-to-pod multicluster from Linkerd and HTTPRoutes to avoid changing config in the actual app; I can just create the HTTPRoute and dynamically move traffic from the in-cluster service to the new cluster.
So I deploy legacy-assets to the new cluster. It's got remote-discovery enabled, so it creates a Service called legacy-assets-eks-non-prod-primary in the monolith namespace. I make my HTTPRoute:
apiVersion: policy.linkerd.io/v1beta3
kind: HTTPRoute
metadata:
name: legacy-assets
namespace: monolith
spec:
parentRefs:
- group: core
kind: Service
name: legacy-assets
port: 80
rules:
- backendRefs:
- group: ""
kind: Service
name: legacy-assets-eks-non-prod-primary
port: 80
weight: 100
- group: ""
kind: Service
name: legacy-assets
port: 80
weight: 0
matches:
- path:
type: PathPrefix
value: /
What should happen is all traffic goes to the other cluster. But what actually happens is I get HTTP 500 responses.
I got this from the linkerd-proxy container (adding line breaks for legibility purposes):
outbound:proxy{addr=10.100.127.47:80}:rescue{client.addr=172.27.198.240:49658}:
linkerd_app_core::errors::respond:
HTTP/1.1 request failed error=logical service 10.100.127.47:80:
route HTTPRoute.monolith.legacy-assets: backend default.fail:
HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary
error.sources=[route HTTPRoute.monolith.legacy-assets:
backend default.fail: HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary, backend default.fail:
HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary,
HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary]
(and in one line to preserve the full error from logs)
outbound:proxy{addr=10.100.127.47:80}:rescue{client.addr=172.27.198.240:49658}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.127.47:80: route HTTPRoute.monolith.legacy-assets: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary error.sources=[route HTTPRoute.monolith.legacy-assets: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary, backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary, HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary]
The only thing I really have to go on is that we don't have nativeSidecar enabled on these old clusters, and the new ones do. As the pod starts, the container is immediately querying the service, but if the proxy isn't ready it fails with generic networking issues.
Any suggestions to get more info out of it?
Alright, I'll hold my hands up here and say there may be a big old "but" here - I upgraded to 24.5.5 a few days ago and saw that it made its way to the top environment without issue. However, it actually got stuck on that particular cluster.
Having fixed it so we're running a later version of edge (I saw in #12610 a fix mentioned) we now are no longer seeing this error. Please ignore me while I continue testing this on the actual latest version - if I have any issues I'll come back to it.