che icon indicating copy to clipboard operation
che copied to clipboard

kubernetes 1.30.5 support

Open karatkep opened this issue 1 year ago • 37 comments

Summary

Dear Community,

Could you please help me verify if Eclipse Che 7.93.0 supports Kubernetes 1.30.5? The che-dashboard and che pods stopped working when our Kubernetes cluster was updated to version 1.30.5.

Here is a sample of the error in the che-dashboard:

ERROR[12:03:22 UTC]: [HTTP request failed[
    err: {
      "type": "le",
      "message": "HTTP request failed",
      "stack":
          HttpError: HTTP request failed
              at q._callback (/backend/server/backend.js:8:898957)
              at t._callback.t.callback.t.callback (/backend/server/backend.js:14:1087840)
              at q.emit (node:events:517:28)
              at q.<anonymous> (/backend/server/backend.js:14:1100418)
              at q.emit (node:events:517:28)
              at IncomingMessage.<anonymous> (/backend/server/backend.js:14:1099250)
              at Object.onceWrapper (node:events:631:28)
              at IncomingMessage.emit (node:events:529:35)
              at endReadableNT (node:internal/streams/readable:1400:12)
              at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
      "response": {
        "statusCode": 401,
        "body": {
          "kind": "Status",
          "apiVersion": "v1",
          "metadata": {},
          "status": "Failure",
          "message": "Unauthorized",
          "reason": "Unauthorized",
          "code": 401
        },
        "headers": {
          "audit-id": "6b14e1b5-8a08-41a8-a093-5e00693737a6",
          "cache-control": "no-cache, private",
          "content-type": "application/json",
          "date": "Mon, 04 Nov 2024 12:03:21 GMT",
          "content-length": "129",
          "connection": "close"
        },
        "request": {
          "uri": {
            "protocol": "https:",
            "slashes": true,
            "auth": null,
            "host": "10.1.0.1:443",
            "port": "443",
            "hostname": "10.1.0.1",
            "hash": null,
            "search": null,
            "query": null,
            "pathname": "/apis/org.eclipse.che/v2/checlusters",
            "path": "/apis/org.eclipse.che/v2/checlusters",
            "href": "https://10.1.0.1:443/apis/org.eclipse.che/v2/checlusters"
          },
          "method": "GET",
          "headers": {
            "Accept": "application/json",
            "Authorization": "Bearer MASKED"
          }
        }
      },
      "body": {
        "type": "Object",
        "message": "Unauthorized",
        "stack":
            
        "kind": "Status",
        "apiVersion": "v1",
        "metadata": {},
        "status": "Failure",
        "reason": "Unauthorized",
        "code": 401
      },
      "statusCode": 401,
      "name": "HttpError"
    }

The same issue affects the che pod. It appears that both lost access to the Kubernetes API after the upgrade to version 1.30.5.

ServiceAccounts, Cluster Roles and Bindings are in place for both che-dashboard and che pods

Relevant information

No response

karatkep avatar Nov 04 '24 14:11 karatkep

@karatkep Could you show che pod logs?

I've tried to reproduce on Minikube with Kubernetes 1.31.0, but no luck

tolusha avatar Nov 04 '24 16:11 tolusha

@tolusha According to the che logs, the che pod starts receiving 401 errors from the kube-api exactly one hour after the pod starts working/launches:

06-Nov-2024 08:26:02.136 INFO [main] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [/home/user/eclipse-che/tomcat/webapps/ROOT.war] has finished in [2,488] ms
06-Nov-2024 08:26:02.138 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
06-Nov-2024 08:26:02.144 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [40907] milliseconds
2024-11-06 09:26:32,950[c4d-k5x9l-37628]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@6c199c1d] for cluster [RemoteSubscriptionChannel], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:42,473[4c4d-k5x9l-3460]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@f31944b] for cluster [WorkspaceStateCache], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:47,468[c4d-k5x9l-46003]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@5ed91d32] for cluster [WorkspaceLocks], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]

karatkep avatar Nov 06 '24 10:11 karatkep

@tolusha, as I can see, the issue is that the token is not being refreshed. It is generated for 1 hour, and after that time, the che-dashboard continues to use it despite its expiration. Is there any way to prompt the che-dashboard to refresh it before using it for kube-api calls?

karatkep avatar Nov 11 '24 16:11 karatkep

@karatkep Could you share CheCluster CR? What OIDC provider do you use?

tolusha avatar Nov 12 '24 13:11 tolusha

@tolusha, Yes, of course, I will provide the CheCluster CR. However, I don't think that the issue lies with the CheCluster CR or OIDC. The same version of Eclipse Che 7.93.0 was deployed in two identical AKS clusters (Kubernetes version 1.27.9), and everything was fine until one of the clusters was upgraded to 1.30.5. Immediately after this update, problems with the kube-api started. Reviewing the token used, for example, by the che-dashboard, I see that the expiration field "exp" is always the same and is in the past. From this, I conclude that for Kubernetes version 1.30.5, the token is not being updated.

karatkep avatar Nov 12 '24 18:11 karatkep

@tolusha , @ibuziuk , We found the root cause of the issue. In Kubernetes 1.27.9, the token (located at the path /var/run/secrets/kubernetes.io/serviceaccount/token) is issued for one year, although it is refreshed every hour (or more precisely every 50 minutes). At the same time, in Kubernetes 1.30.5, the token is issued for one hour and is also refreshed every 50 minutes. However, Che (che-dashboard, che, and most likely che-gateway) caches this token at startup and uses it. Consequently, in Kubernetes 1.27.9 there is no problem since the token is issued for one year, but in Kubernetes 1.30.5, the problem begins after the first hour from startup because the cached token is used.

karatkep avatar Nov 12 '24 22:11 karatkep

@karatkep So, if you restart all pods, Che will continue working, right?

tolusha avatar Nov 13 '24 08:11 tolusha

@tolusha Correct, we need to restart the Che pods every hour to ensure they remain operational.

karatkep avatar Nov 13 '24 09:11 karatkep

@tolusha, @ibuziuk, Could you please share information and plans regarding this issue? Is everything clear and understandable? Were you able to reproduce it? Are you currently working on a resolution, or do you have plans to start working on it soon?

Just to be on the same page - there is absolutely no pressure from my side. I just want to understand the current status and plans regarding this issue. On my part, I have already used one of the possible workarounds and written a CronJob that restarts the necessary Che pods. If other Eclipse Che users are facing or will face the same issue, I am more than willing to share this workaround.

karatkep avatar Nov 15 '24 09:11 karatkep

@karatkep Thank you for the follow-up and investigation details - https://github.com/eclipse-che/che/issues/23230#issuecomment-2471679757

I'm still wondering if the token lifetime is configurable on the k8s end in general? Do you happen to have the link to the Release Notes, docs, or commit where this change with the lifetime was introduced? Could it be some AKS config?

The issue has been planned for the next sprint (Nov 20 - Dec 10), however, so far @tolusha was not able to reproduce it on vanilla minikube.

@karatkep also contributions from the Community are most welcome if you would like to change or update the caching mechanism in the project ;-)

ibuziuk avatar Nov 15 '24 10:11 ibuziuk

@ibuziuk, When I was researching this issue, I came across the documentation at https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#tokenrequest-api which contains detailed information about configuring token lifetime. Moreover, I conducted an experiment where I disabled the che-operator (so it wouldn’t interfere with making changes) and used the expirationSeconds to modify the lifetime of the token. I tried setting it to one day or 86400 seconds for the che-dashboard in the deployment. After restarting the che-dashboard pod, I confirmed that the lifetime of the token (located in /var/run/secrets/kubernetes.io/serviceaccount/token) had indeed changed.

P.S. But frankly speaking, I do not like the option of using a long-lived token - it contradicts security best practices. It seems to me that whoever made this change (token lifetime: 1y -> 1h), it is a step in the right direction to use short-lived tokens. And in my opinion, a well-written application should not cache the token indefinitely.

karatkep avatar Nov 15 '24 10:11 karatkep

I managed to decrease the kubernetes token lifetime to 10 minutes and I confirm that there are Kubernetes connection failure warnings coming every second right after the token expiration time. However, since kubernetes updates the roken in every pod, I could not reproduce the dashboard error and all kubernetes related actions work fine even after the token expiration. Currently I am working on jgroups-kubernetes che-server dependency update. This library throws the error to the che-server log after the token expires.

vinokurig avatar Dec 06 '24 11:12 vinokurig

Unfortunately updating the jgroups-kubernetes dependency to latest did not solve the issue with che-server cyclic log warning, filed an upstream issue. As for the dashboard log error I could not reproduce it with the refreshed kubernetes token, all dashboard kubernetes related actions work fine, e.g PAT token add/list.

vinokurig avatar Dec 09 '24 09:12 vinokurig

@karatkep could you please elaborate more on what exactly does not work regardless the logs errors? Can you open dashboard page, navigate to user preferences?

vinokurig avatar Dec 09 '24 15:12 vinokurig

To summarize:

  • If kubernetes service account token is refreshed after expiration, all the functionality works as expected except the cyclic error in the che-server logs.
  • The che-server logs error is caused by the jgroups-kubernetes dependency. The dependency is not used for the current functionality and can be removed as a leftover, however we should consider either to update the dependency, when a new version with the fix is available, or to remove the dependency as a leftover and chek that it does not break the current functionality.
  • We are going to update the fabric8 kubernetes client to latest

vinokurig avatar Dec 10 '24 13:12 vinokurig

@karatkep my understanding is that so far @vinokurig was not able to reproduce the error even with the short-lived token. Steps to reproduce would be highly appreciated.

Basically, all k8s interactions are happening using Fabric8-Kubernetes-Client for che-server and we plan to bump it to version 7.0.0 next sprint. cc @manusa maybe you have some input on this situation? do we need to care about updating the token / /var/run/secrets/kubernetes.io/serviceaccount/token, or client handles the update under the hood - https://github.com/eclipse-che/che/issues/23230#issuecomment-2471679757 ?

ibuziuk avatar Dec 10 '24 15:12 ibuziuk

cc @manusa maybe you have some input on this situation? do we need to care about updating the token / /var/run/secrets/kubernetes.io/serviceaccount/token, or client handles the update under the hood - #23230 (comment) ?

I understand that the Kubernetes Client in use is 6.10.0.

In this case, yes there's a TokenRefreshInterceptor that reloads the config in case there is an auth client error in the HTTP response.

https://github.com/fabric8io/kubernetes-client/blob/9101a2fa4a8f912ff6cda23e4d4b59895ccdc755/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/utils/TokenRefreshInterceptor.java#L123-L126

The interceptor logic will work and reload the Config as long as the Config was not provided manually. Does this ring any bell? Setting a breakpoint in the mentioned lines of code should allow you to debug what's going on the moment the authorization fails.

manusa avatar Dec 11 '24 04:12 manusa

Hello @ibuziuk, @vinokurig. Please allow me to gather more details regarding this case. I will share them later today or tomorrow.

karatkep avatar Dec 11 '24 17:12 karatkep

@karatkep could you please elaborate more on what exactly does not work regardless the logs errors? Can you open dashboard page, navigate to user preferences?

@vinokurig, the dashboard issue arises when an user attempts to start the devworkspace. Please see the screenshot below: image

The endpoint /dashboard/api/devworkspace/running-workspaces-cluster-limit-exceeded is failing to function because it attempts to call the Kubernetes API endpoint /apis/org.eclipse.che/v2/checlusters, but it returns a 401 error due to the token being expired.

karatkep avatar Dec 16 '24 01:12 karatkep

Hello @vinokurig, just wanted to check if you need anything else from my side to unblock your investigation.

karatkep avatar Dec 19 '24 22:12 karatkep

Hello @karatkep, sorry for the late response, I managed to reproduce the unauthorized error on dashboard, investigating ...

vinokurig avatar Dec 23 '24 09:12 vinokurig

@karatkep could you please confirm that the issue is fixed with 7.97.0 release?

ibuziuk avatar Jan 08 '25 13:01 ibuziuk

@ibuziuk , We started receiving the 418 I'm a teapot error as soon as we updated Eclipse Che to version 7.97.0 and stopped hourly restarts of che, che-gateway and che-dashboard pods.

From che-gateway pod oauth-proxy container logs:

10.10.10.10:58724 - 10fc3e3ebefed28125004a26512bad11 - [email protected] [2025/01/11 13:29:19] test-che.qwe.com GET / "/dashboard/api/devworkspace/running-workspaces-cluster-limit-exceeded" HTTP/1.1 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" 200 5 0.036
10.10.10.10:58724 - 1078721a6dfed6dfe08ac6caca30cd11 - [email protected] [2025/01/11 13:29:19] test-che.qwe.com GET / "/qwerty-qwe-com/test/3100/?tkn=eclipse-che" HTTP/1.1 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" 418 0 0.001

From che-gateway pod configbump container logs:

E0111 13:29:19.199505       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.ConfigMap: Unauthorized

karatkep avatar Jan 11 '25 13:01 karatkep

@karatkep Could you provide che operator logs? Did you get 418 immidiately or afte an hour ?

tolusha avatar Jan 13 '25 09:01 tolusha

@tolusha, There is nothing criminal in the che-operator logs.

time="2025-01-13T12:10:55Z" level=info msg="Successfully reconciled."
2025-01-13T12:11:35Z	INFO	controllers.DevWorkspaceRouting	Reconciling DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
2025-01-13T12:11:35Z	INFO	controllers.DevWorkspaceRouting	Reconciling DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
2025-01-13T12:11:36Z	INFO	controllers.DevWorkspaceRouting	Reconciling DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
2025-01-13T12:11:36Z	INFO	controllers.DevWorkspaceRouting	Adding Finalizer for the DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
2025-01-13T12:11:36Z	INFO	controllers.DevWorkspaceRouting	Reconciling DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
2025-01-13T12:11:36Z	ERROR	Reconciler error	{"controller": "devworkspacerouting", "controllerGroup": "controller.devfile.io", "controllerKind": "DevWorkspaceRouting", "DevWorkspaceRouting": {"name":"routing-workspace87f1f5feed574d56","namespace":"test-che.qwe.com-4m30a0"}, "namespace": "test-che.qwe.com-4m30a0", "name": "routing-workspace87f1f5feed574d56", "reconcileID": "8561d8fc-f183-4a72-99d1-cd8283bfa28b", "error": "Operation cannot be fulfilled on devworkspaceroutings.controller.devfile.io \"routing-workspace87f1f5feed574d56\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/che-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/che-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/che-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
2025-01-13T12:11:36Z	INFO	controllers.DevWorkspaceRouting	Reconciling DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
2025-01-13T12:11:36Z	INFO	controllers.DevWorkspaceRouting	Reconciling DevWorkspaceRouting	{"Request.Namespace": "test-che.qwe.com-4m30a0", "Request.Name": "routing-workspace87f1f5feed574d56", "devworkspace_id": "workspace87f1f5feed574d56"}
time="2025-01-13T12:11:50Z" level=info msg="Successfully reconciled."

I restarted the che-gateway, and the issue was resolved. However, after one hour, the issue returned.

karatkep avatar Jan 13 '25 12:01 karatkep

@karatkep

We started receiving the 418 I'm a teapot error as soon as we updated Eclipse Che to version 7.97.0 and stopped hourly restarts of che, che-gateway and che-dashboard pods.

Do you mean that Che stops working after an hour? Can you access Dashboard, start a workspace?

vinokurig avatar Jan 15 '25 08:01 vinokurig

Did not see anything unusual after the service account token refresh in the che-geteway pod.

vinokurig avatar Jan 15 '25 08:01 vinokurig

Did not see anything unusual after the service account token refresh in the che-geteway pod.

As I mentioned above, I see following error in configbump occurring 1h after the start:

E0111 13:29:19.199505       1 reflector.go:126] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: Failed to list *v1.ConfigMap: Unauthorized

karatkep avatar Jan 15 '25 10:01 karatkep

@karatkep Can you access Dashboard, start a workspace having that?

vinokurig avatar Jan 15 '25 13:01 vinokurig

@vinokurig, yes, I can access the Dashboard, see all workspaces, and start a workspace. I am successfully redirected to the 'Starting workspace' page, but once all steps are completed, I am redirected to the IDE where I encounter a 418 error:

10.10.10.10:58724 - 1078721a6dfed6dfe08ac6caca30cd11 - [email protected] [2025/01/11 13:29:19] test-che.qwe.com GET / "/qwerty-qwe-com/test/3100/?tkn=eclipse-che" HTTP/1.1 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" 418 0 0.001

karatkep avatar Jan 15 '25 14:01 karatkep