linkerd2 HTTPRoute intermittently fails to distribute traffic

What is the issue?

When using an httproute to dynamically redistribute load from one Service to a MultiCluster mirrored Service, traffic only intermittently transmits correctly.

How can it be reproduced?

2 clusters, east and west, joined by a multicluster link that mirrors appropriately labelled services deployed in west into east
a Service foo in cluster east (but no deployment to receive traffic)
a mirrored Service in east called foo-west. This should pass traffic to a deployment of something will return basic acks e.g. curls.
an HTTPRoute directing traffic received by parentRef Service foo to backendRef foo-east.
send traffic to

Logs, error output, etc

Application curl logs:

❯ kubectl exec -it busybox-5cd4968444-zn549 -- wget http://APP.APP.svc.cluster.local/ping -O -
Defaulted container "main" out of: main, linkerd-init (init), linkerd-proxy (init)
Connecting to APP.APP.svc.cluster.local (IPADDR:80)
writing to stdout
written to stdout

☸ non-prod
❯ kubectl exec -it busybox-5cd4968444-zn549 -- wget http://APP.APP.svc.cluster.local/ping -O -
Defaulted container "main" out of: main, linkerd-init (init), linkerd-proxy (init)
Connecting to APP.APP.svc.cluster.local (IPADDR:80)
wget: server returned error: HTTP/1.1 504 Gateway Timeout
command terminated with exit code 1

Proxy sidecar:

[   853.183882s]  INFO ThreadId(01) outbound:proxy{addr=10.100.238.202:80}:service{ns=APP name=APP port=80}: linkerd_proxy_api_resolve::resolve: No endpoints
[   856.184109s]  INFO ThreadId(01) outbound:proxy{addr=10.100.238.202:80}:service{ns=APP name=APP port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[   856.184575s]  INFO ThreadId(01) outbound:proxy{addr=10.100.238.202:80}:rescue{client.addr=172.27.8.216:48586}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.238.202:80: route default.http: backend Service.APP.APP:80: Service.APP.APP:80: service in fail-fast error.sources=[route default.http: backend Service.APP.APP:80: Service.APP.APP:80: service in fail-fast, backend Service.APP.APP:80: Service.APP.APP:80: service in fail-fast, Service.APP.APP:80: service in fail-fast, service in fail-fast]

output of `linkerd check -o short`

❯ linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    is running version 24.3.2 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-888c96b5b-7pwmc (edge-24.5.1)
        * linkerd-destination-888c96b5b-hl54h (edge-24.5.1)
        * linkerd-destination-888c96b5b-vn62f (edge-24.5.1)
        * linkerd-identity-56bbfdc7b6-2cfhj (edge-24.5.1)
        * linkerd-identity-56bbfdc7b6-f9bvq (edge-24.5.1)
        * linkerd-identity-56bbfdc7b6-h67sk (edge-24.5.1)
        * linkerd-proxy-injector-68c6b7bc6-5vxm6 (edge-24.5.1)
        * linkerd-proxy-injector-68c6b7bc6-hgmks (edge-24.5.1)
        * linkerd-proxy-injector-68c6b7bc6-l45wh (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-888c96b5b-7pwmc running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
        * collector-7db4655-sdwth (edge-24.5.1)
        * jaeger-5c4c9ff587-5c729 (edge-24.5.1)
        * jaeger-injector-6cb867b4f8-5mhnd (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
    collector-7db4655-sdwth running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-db8857cf8-mfw6c (edge-24.5.1)
        * metrics-api-db8857cf8-p59sg (edge-24.5.1)
        * metrics-api-db8857cf8-wxm87 (edge-24.5.1)
        * tap-6d6cf4c465-2rzj8 (edge-24.5.1)
        * tap-6d6cf4c465-8bshr (edge-24.5.1)
        * tap-6d6cf4c465-bg6sd (edge-24.5.1)
        * tap-injector-66c6f694f4-7rwx4 (edge-24.5.1)
        * tap-injector-66c6f694f4-9hjpw (edge-24.5.1)
        * tap-injector-66c6f694f4-vqw6r (edge-24.5.1)
        * web-56d54f864d-82jcp (edge-24.5.1)
        * web-56d54f864d-j4vbv (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-db8857cf8-mfw6c running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes v1.29.3
EKS cluster
Bottlerocket nodes
Cilium CNI in AWS VPC replacement mode

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

May 16 '24 15:05 Sierra1011

Folding this into #12610

May 16 '24 17:05 olix0r

Hi @Sierra1011. That error log from the proxy indicated that it doesn't have any endpoints in the Service.APP.APP:80 backend service to route to. Can you confirm that the service exists and that it has endpoints? You can use kubectl get service and kubectl get endpoints to confirm this. You can also use the linkerd diagnostics endpoints command to see Linkerd's view of what endpoints the service has, if any.

May 21 '24 23:05 adleong

Hi @adleong, I'll set up a test similar to as described in #12610 to troubleshoot it exactly, hopefully today if nothing is on fire :crossed_fingers:

May 22 '24 08:05 Sierra1011

So, I deployed a full stack of emojivoto (emoji, voting, vote-bot, web) in cluster 1, and a deployment of emoji to cluster 2, with a service mirrored to cluster 1. You're right; there's no endpoints shown for the mirrored emoji service, and if I scale down the original emoji deployment, no endpoints shown at all.

Playing around running curl to emoji while running linkerd viz tap on the respective deployments showed that it was at least hitting the relevant deployments.

So, that seems to be working fine, but I'm not in a position to go back and reimplementing our app as it was when I raised this as an issue (having received a ton of 5xx errors), but I'll try it elsewhere and come back with some more info.

May 22 '24 15:05 Sierra1011

OK, so it's been a fairly slow chase down on this I'm afraid.

So, I'm going to talk in real terms rather than the emojivoto service I'm deploying for funsies. I have some deployments with services on one cluster; let's call them monolith and legacy-assets and they live in the monolith namespace. monolith depends on legacy-assets being reachable in order to start up. I'm migrating the deployment of services from one cluster to a new cluster which is called eks-non-prod-primary. Standard A to B stuff.

My intention is to use pod-to-pod multicluster from Linkerd and HTTPRoutes to avoid changing config in the actual app; I can just create the HTTPRoute and dynamically move traffic from the in-cluster service to the new cluster.

So I deploy legacy-assets to the new cluster. It's got remote-discovery enabled, so it creates a Service called legacy-assets-eks-non-prod-primary in the monolith namespace. I make my HTTPRoute:

apiVersion: policy.linkerd.io/v1beta3
kind: HTTPRoute
metadata:
  name: legacy-assets
  namespace: monolith
spec:
  parentRefs:
  - group: core
    kind: Service
    name: legacy-assets
    port: 80
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: legacy-assets-eks-non-prod-primary
      port: 80
      weight: 100
    - group: ""
      kind: Service
      name: legacy-assets
      port: 80
      weight: 0
    matches:
    - path:
        type: PathPrefix
        value: /

What should happen is all traffic goes to the other cluster. But what actually happens is I get HTTP 500 responses.

I got this from the linkerd-proxy container (adding line breaks for legibility purposes):

outbound:proxy{addr=10.100.127.47:80}:rescue{client.addr=172.27.198.240:49658}: 
linkerd_app_core::errors::respond: 
HTTP/1.1 request failed error=logical service 10.100.127.47:80: 
route HTTPRoute.monolith.legacy-assets: backend default.fail: 
HTTP request configured to fail with 500 Internal Server Error: 
Service not found legacy-assets-eks-non-prod-primary 
error.sources=[route HTTPRoute.monolith.legacy-assets: 
backend default.fail: HTTP request configured to fail with 500 Internal Server Error: 
Service not found legacy-assets-eks-non-prod-primary, backend default.fail: 
HTTP request configured to fail with 500 Internal Server Error: 
Service not found legacy-assets-eks-non-prod-primary, 
HTTP request configured to fail with 500 Internal Server Error: 
Service not found legacy-assets-eks-non-prod-primary]

(and in one line to preserve the full error from logs)

outbound:proxy{addr=10.100.127.47:80}:rescue{client.addr=172.27.198.240:49658}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.127.47:80: route HTTPRoute.monolith.legacy-assets: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary error.sources=[route HTTPRoute.monolith.legacy-assets: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary, backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary, HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary]

The only thing I really have to go on is that we don't have nativeSidecar enabled on these old clusters, and the new ones do. As the pod starts, the container is immediately querying the service, but if the proxy isn't ready it fails with generic networking issues.

Any suggestions to get more info out of it?

Jun 06 '24 17:06 Sierra1011

Alright, I'll hold my hands up here and say there may be a big old "but" here - I upgraded to 24.5.5 a few days ago and saw that it made its way to the top environment without issue. However, it actually got stuck on that particular cluster.

Having fixed it so we're running a later version of edge (I saw in #12610 a fix mentioned) we now are no longer seeing this error. Please ignore me while I continue testing this on the actual latest version - if I have any issues I'll come back to it.

Jun 07 '24 11:06 Sierra1011

HTTPRoute intermittently fails to distribute traffic

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

output of `linkerd check -o short`