Experiencing Intermittent 401 Unauthorized Errors from Kube API Server
Describe the bug
The Kubernetes JavaScript client library is employed in our Node.js application. Recently, experiencing intermittent 401 Unauthorized errors from the Kube API Server.
Error trace:
body: {
kind: 'Status',
apiVersion: 'v1',
metadata: {},
status: 'Failure',
message: 'Unauthorized',
reason: 'Unauthorized',
code: 401
}
Within the Node.js application, the logic involves listing the pods. Most of the time there were no errors observed, but sometimes this 401 error is thrown by the Kubernetes client. This issue began to be noticed following the latest Kubernetes upgrade, which went from version 1.22.1 to 1.24.4.
Initially, the suspicion was directed towards the Kubernetes service account token. This suspicion arose because, starting from Kubernetes version 1.24, the token is no longer mounted as a secret by default; instead, it is mounted inside the container and refreshed by the Kubelet every hour. In contrast, in the 1.22 version, this token was stored as a secret. we print the error stack trace in the app. By decoding the token that is passed as headers to the kube server, it is found that the token was generated a few seconds back and this happens intermittently when a new token is used after the token rotation every 1 hour.
However, upon further analysis, it became evident that this was not the root cause. This is because, for the most part, everything operates smoothly, and the occurrence of the 401 error appears to be sporadic and random.
Need help to find the concrete root cause of why this issue is happening.
Client Version
e.g. 0.16.3
Server Version
e.g. 1.24.4
Example Code sample code snippet, the error is thrown from line 5
1 let kubeConfig = new k8s.KubeConfig();
2 kubeConfig.loadFromDefault();
3 let kubeApi = kubeConfig.makeApiClient(k8s.CoreV1Api)
4 let labelSelector = 'app=' + appName;
5 let res = await kubeApi.listPodForAllNamespaces(false, null, 'status.phase=Running', labelSelector);
Environment (please complete the following information):
- OS: Linux
- NodeJS Version 12.22.12
- Cloud runtime : NA
There's a comment here: https://github.com/kubernetes-client/javascript/blob/master/src/file_auth.ts#L43
We only poll the file for changes every 60 seconds. That means that we likely cache the token across the token refresh and that may be too long.
We should probably use filesytem events to get an event when the file changes.
I suspect that is what is causing your problem, but it's hard to know without logs or more details.
If you wanted to send a PR to update that code to use events we'd be happy to take it.
We only poll the file for changes every 60 seconds. That means that we likely cache the token across the token refresh and that may be too long.
that seems unlikely... the kubelet refreshes the token at 80% its lifetime, the minimum lifetime is 10 minutes, which means the file should be getting updated with at least 2 minutes remaining on the previous token's lifetime.
Thanks @brendandburns and @liggitt I want to add some more context, we had enabled the debug level and got the token from the response header.
When I tried to decode the token it seemed like the newly created token was only used by the request, but not sure why it was throwing the 401 error. I am attaching the decoded token output for reference, Let me know if anything else is useful for understanding.
{
"aud": [
"api",
"vault",
"factors"
],
"exp": 1723243859,
"iat": 1691707859,
"iss": "api",
"kubernetes.io": {
"namespace": "my-namespace",
"pod": {
"name": "my-pod-name",
"uid": "dd69f3de-64ad-4230-bbe5-1ae099f164b6"
},
"serviceaccount": {
"name": "my-pod-service-account",
"uid": "a77fc43b-933a-414a-b2d7-bd785d54794f"
},
"warnafter": 1691711466
},
"nbf": 1691707859,
"sub": "system:serviceaccount:my-namespace:my-pod-service-account"
}
Here, the iat (issued at time): 1691707859 is 10 August 2023 22:50:59 UTC. I am getting this error in my application exactly at 10 August 2023 22:51:02 which I can see from logs. So from my understanding, we are getting this error when a newly created token is used very immediately.
@brendandburns @liggitt can u help out here?
I'm not very familiar with the kubelet token regeneration flow, the client library simply picks the token from the file and sends it as a header.
Given what @liggitt said about the Kubelet regeneration, I agree that the polling interval is unlikely to be the cause here unless your nodes are seriously overloaded.
If it seems like this is due to kubelet/apiserver interactions, then it probably makes more sense to file a bug on the main kubernetes repository.
I wonder if there is clock skew between your node(s) and the API Server?
I wonder if there is clock skew between your node(s) and the API Server?
getting a 401 response when using a ~brand new token would be more likely to be due to clock skew between API servers, if anything... the node clock isn't in play for validating a brand new token
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.