linkerd_reconnect: Failed to connect error=Connection refused (os error 111) after installing 2.11.1
What is the issue?
I am seeing below errors in linkerd destination pod after installing 2.11.1 in aks cluster, below are the logs from the pod. we previously used 2.10 without any issues, we did not upgrade but installed 2.11 after removing 2.10. Please let me know if any logs are required for troubleshooting.
How can it be reproduced?
Installing new 2.11.1 linkerd version
Logs, error output, etc
[ 0.001240s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.001570s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.001583s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.001586s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.001587s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.001589s] INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[ 0.001591s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.001593s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
[ 0.002035s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 0.003857s] WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[ 0.112761s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 0.332287s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 0.738942s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 1.240545s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 1.742524s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 2.067324s] INFO ThreadId(02) daemon:admin{listen.addr=0.0.0.0:4191}: linkerd_app_core::serve: Connection closed error=TLS detection timed out
[ 72.931085s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 73.431840s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 73.932647s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 74.434329s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 74.936055s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 75.436486s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 75.938267s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 76.440036s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 76.940731s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 77.442481s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 77.944229s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 78.444986s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 78.945760s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 79.446562s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 79.948405s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 80.450125s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 80.950652s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 81.452416s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 81.954388s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 82.456028s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 82.957336s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[ 83.459112s] WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111
output of linkerd check -o short
~ linkerd check
Linkerd core checks
===================
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days
linkerd-version
---------------
‼ can determine the latest version
Get "https://versioncheck.linkerd.io/version.json?version=stable-2.11.1&uuid=58eb0377-e4d1-43a5-8baf-9c9c44545559&source=cli": net/http: TLS handshake timeout
see https://linkerd.io/2.11/checks/#l5d-version-latest for hints
‼ cli is up-to-date
unsupported version channel: stable-2.11.1
see https://linkerd.io/2.11/checks/#l5d-version-cli for hints
control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
unsupported version channel: stable-2.11.1
see https://linkerd.io/2.11/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-7d9d7865ff-8kkzh (stable-2.11.1)
* linkerd-identity-5f8f46575-fdzjb (stable-2.11.1)
* linkerd-proxy-injector-56fd45796f-8m7cx (stable-2.11.1)
see https://linkerd.io/2.11/checks/#l5d-cp-proxy-version for hints
√ control plane proxies and cli versions match
Status check results are √
Linkerd extensions checks
=========================
linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
could not find proxy container for prometheus-86bdfbd9d6-z55qz pod
see https://linkerd.io/2.11/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
prometheus-86bdfbd9d6-24t68 status is Failed
see https://linkerd.io/2.11/checks/#l5d-viz-pods-running for hints
× viz extension proxies are healthy
The "linkerd-proxy" container in the "prometheus-86bdfbd9d6-24t68" pod is not ready
see https://linkerd.io/2.11/checks/#l5d-viz-proxy-healthy for hints
Environment
- k8s version -- 1.20
- cluster env -- AKS
- Host OS -- linux
- Linkerd Version -- 2.11.1
I also see these errors in linkerd proxy pod logs
[ 33.007036s] WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-policy.linkerd.svc.cluster.local. type: SRV class: IN
Hi @prydeep! Based on those logs, it looks like the destination controller is unable to connect to the policy controller (which runs in the same pod on port 8090). Do you see any errors in the policy controller's container logs? Or any warnings in the Kubernetes events (which you can see by doing a kubectl describe on the destination pod).
Thank you for responding @adleong Please find the policy container logs
2022-04-26T18:59:50.571586Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T18:59:51.572722Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:04:10.126764Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:04:11.128036Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:05:00.598921Z INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:05:01.600125Z INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:08:29.808007Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:08:30.809124Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:09:20.473420Z INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:09:21.474626Z INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:13:25.822166Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2022-04-26T19:13:26.824169Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:13:39.602616Z INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:13:40.603818Z INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:15:05.162470Z INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out
2022-04-26T19:15:06.163672Z INFO pods: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:17:45.238031Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:17:46.238604Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:17:58.886482Z INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:17:59.887775Z INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting
2022-04-26T19:22:04.549612Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
2022-04-26T19:22:05.550852Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
@Team any help appreciated. We were running without any issues until we upgraded to 2.11.1
We're having the same issue.
It looks like the policy controller is unable to contact the Kubernetes API. Can you try the latest Linkerd stable-2.11.2 to confirm whether the problem is still present?
It looks like the policy controller is unable to contact the Kubernetes API. Can you try the latest Linkerd stable-2.11.2 to confirm whether the problem is still present?
Tried 2.11.2, and I see the following errors in policy container
2022-04-30T01:57:03.013009Z INFO grpc{port=8090}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2022-05-04T13:03:54.605543Z WARN pods: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2022-05-04T13:03:54.605773Z WARN servers: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2022-05-04T13:03:54.606000Z WARN serverauthorizations: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2022-05-04T13:03:54.620562Z ERROR pods: kube_client::client: failed with error error trying to connect: Connection reset by peer (os error 104)
2022-05-04T13:03:54.620730Z ERROR servers: kube_client::client: failed with error error trying to connect: Connection reset by peer (os error 104)
2022-05-04T13:03:54.620778Z ERROR serverauthorizations: kube_client::client: failed with error error trying to connect: Connection reset by peer (os error 104)
2022-05-04T13:03:54.621493Z ERROR pods: kube_client::client: failed with error error trying to connect: tcp connect error: Connection refused (os error 111)
2022-05-04T13:04:12.857362Z ERROR serverauthorizations: kube_client::client: failed with error error trying to connect: tcp connect error: Connection refused (os error 111)
2022-05-04T13:04:12.863107Z ERROR pods: kube_client::client: failed with error error trying to connect: tcp connect error: Connection refused (os error 111)
@prydeep Were you able to resolve this issue?
I turned set proxy.logLevel=warn,linkerd=debug,warn and got this output:
DEBUG ThreadId(01) policy:watch{port=9443}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_tls::client: Peer does not support TLS reason=loopback
@jayjaytay I accidentally closed the issue. The issue is still there for me
DEBUG ThreadId(01) policy:watch{port=9443}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_tls::client: Peer does not support TLS reason=loopback
This is innocuous. It's indicating that the proxy shouldn't attempt mTLS to a container in the same pod.
Based on my understanding of the logs above:
- the policy controller is unable to reach the Kubernetes API for some reason
- there are no errors in the proxy related to outbound traffic
You might try installing Linkerd with --set policyController.logLevel=info\,linkerd=trace\,kubert=debug -- this will enable verbose logs from the policy controller
Outbound traffic on port 443--where the Kubernetes API is usually hosted--is not proxied on the control plane; so I'd probably ignore the destination controller's proxy logs unless we have some indication that this traffic is being proxied.
We're running linkerd 2.11.2 on AKS and have not encountered this issue, so there are probably some missing relevant details... How were these clusters created? What CNIs are being used? What would we have to do, specifically, to try to reproduce this problem?
@olix0r sorry for late reply, we are using Azure CNI
This could then be Azure/AKS#2750. @prydeep there's a small repro in there you can try to confirm that's the issue.
It also sounds like #8296 may help resolve this in some cases
This could then be Azure/AKS#2750. @prydeep there's a small repro in there you can try to confirm that's the issue.
@alpeb tried that and yes saw the same thing. What is the next step, is there a fix for this and will it be merged to 2.11.2 stable?
That's an AKS bug unfortunately, so nothing we can do on our side, besides voicing your concern in that ticket in the hope it'll get better visibility.
Thank you @alpeb
Hi @alpeb, I'm facing the same issue after upgrading from stable 2.10.2 to 2.11.2 I'm using AKS with kubenet, with k8s version 1.21.7
All control-plane components of linkerd start failing with the same logs provided in the issue description for linkerd-proxy. Do we have any known mitigation over here?
Logs from linkerd-proxy-injectors and linkerd-destination's linkerd-proxy container:
[0.005486s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.005494s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[0.008909s] WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[0.012103s] WARN ThreadId(01) daemon:identity: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.cluster.local. type: SRV class: IN
[0.028235s] INFO ThreadId(01) linkerd_proxy::signal: received SIGTERM, starting shutdown
Hey @alpeb any updates here, we're stuck on an upgrade
Hi @alpeb, We switched to Azure CNI for our AKS clusters, now only one of our linkerd-destinations pod is failing with:
WARN ThreadId(01) policy:watch{port=8090}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
I tried the repro for the issue Aks#2750, but that is not the case for us.
And it's only failing in one of our clusters, any idea what might be causing this? We're at linkerd 2.11.0 (as 2.11.1 and 2.11.2 were causing above issues^)
I'm using EKS and getting this error too
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
same error in EKS as well. 2.12.2 with linkerd cni enabled
Also seeing this in EKS, it's just popped up sort of out of the blue, and is only affecting one deployment it seems. Currently running version 2.11.4, and am not in a position where I can upgrade Linkerd currently.
Edit: This was actually a misconfiguration in the app the sidecar was proxying. It was set to listen on 127.0.0.1:8080 instead of 0.0.0.0:8080, etc. - meaning the sidecar couldn't connect to the app! All fine now.
I'm going to close this issue out for now. This was originally opened for stable 2.11.1 and we are now on stable 2.12.2. We have been unable to reproduce this and enough comments have happened about slightly different issues that I feel we've deviated from the parent issue.
If you do see this, please feel free to reopen a new issue with as much detail as possible. Thanks!