kops icon indicating copy to clipboard operation
kops copied to clipboard

Worker nodes are not updated by kops upgrade / tf apply / kops rolling-update

Open marek-obuchowicz opened this issue 11 months ago • 10 comments

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.32.0 (git-v1.32.0)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

Initially, 1.27.16 then upgrade to v1.28.15 followed by upgrade to v1.29.15

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops edit cluster [changed kubernetesVersion `1.27.16`  to `1.28.15`]
kops update cluster --target terraform
cd out/terraform; terraform plan apply; cd ../..
kops rolling-update cluster --validation-timeout 2h --yes

For next update cycle, 1.28 to 1.29.15 i did the same process as above.

5. What happened after the commands executed? Just after rolling-update rotated one control-plan node:

$ kubectl get nodes -owide
NAME                  STATUS   ROLES                       AGE     VERSION    INTERNAL-IP     EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
i-02ca04220fdea027a   Ready    control-plane,spot-worker   45d     v1.27.16   172.20.65.80    ---              Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.22
i-044d3c800ebd9182f   Ready    node,spot-worker            3d12h   v1.27.16   172.20.41.249   ---              Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.22
i-049c7e07ada7bd67f   Ready    control-plane,spot-worker   79s     v1.28.15   172.20.34.8     ---              Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.25
i-091c6621ef56fb552   Ready    control-plane,spot-worker   105d    v1.27.16   172.20.155.6    ---              Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.22
i-0e7b6c4d5e0cc899d   Ready    node,spot-worker            42h     v1.27.16   172.20.92.38    ---              Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.22

After kops rolling-update cluster ... finished for 1.27 -> 1.28 upgrade, all nodes have been rotated and containerd has been upgraded everywhere from 1.7.22 to 1.7.25. However, k8s version seems to be upgraded only on control plane - nodes still run old k8s version:

$ kubectl get nodes -owide
NAME                  STATUS   ROLES                       AGE   VERSION    INTERNAL-IP      EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
i-02fdccb13d452b12c   Ready    node,spot-worker            22m   v1.27.16   172.20.49.100    ---             Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.25
i-049c7e07ada7bd67f   Ready    control-plane,spot-worker   44m   v1.28.15   172.20.34.8      ---             Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.25
i-067307066010c55d8   Ready    control-plane,spot-worker   37m   v1.28.15   172.20.157.199   ---             Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.25
i-0954159b2b57b93f8   Ready    node,spot-worker            27m   v1.27.16   172.20.70.145    ---             Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.25
i-0f9a889b6d013dc78   Ready    control-plane,spot-worker   31m   v1.28.15   172.20.78.182    ---             Ubuntu 22.04.5 LTS   6.8.0-1021-aws   containerd://1.7.25

I've tried to run rolling-update again, but none of the nodes was in NeedsUpdate status.

As next step, I've upgraded to v1.29.15 (same process as described above) and - long story short, after rolling-update - control plane is running v1.29.15 and nodes seem to be upgraded to 1.28.15 now - but not to 1.29. So somehow we're getting control plane running one version higher than nodes. Manual rotation of worker nodes didn't fix the issue.

6. What did you expect to happen? I expect version upgrade to work, and have nodes running same k8s version as control plane.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2017-06-29T12:09:24Z"
  generation: 57
  name: REDACTED
spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "ecr:DescribeImages",
            "ecr:BatchGetImage",
            "ecr:InitiateLayerUpload",
            "ecr:UploadLayerPart",
            "ecr:CompleteLayerUpload",
            "ecr:PutImage",
            "ecr:CreateRepository"
          ],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": [
            "elasticfilesystem:DescribeAccessPoints",
            "elasticfilesystem:DescribeFileSystems",
            "elasticfilesystem:DescribeMountTargets",
            "ec2:DescribeAvailabilityZones"
          ],
          "Resource": "*"
        },
        {
          "Effect": "Allow",
          "Action": [
            "elasticfilesystem:CreateAccessPoint"
          ],
          "Resource": "*",
          "Condition": {
            "StringLike": {
              "aws:RequestTag/efs.csi.aws.com/cluster": "true"
            }
          }
        },
        {
          "Effect": "Allow",
          "Action": "elasticfilesystem:DeleteAccessPoint",
          "Resource": "*",
          "Condition": {
            "StringEquals": {
              "aws:ResourceTag/efs.csi.aws.com/cluster": "true"
            }
          }
        }
      ]
  api:
    dns: {}
  authorization:
    alwaysAllow: {}
  awsLoadBalancerController:
    enabled: true
  certManager:
    enabled: true
    managed: false
  channel: stable
  cloudProvider: aws
  configBase: s3://REDACTED/REDACTED
  containerRuntime: containerd
  dnsZone: REDACTED
  etcdClusters:
  - backups:
      backupStore: s3://REDACTED/REDACTED/backups/etcd/main/
    etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1b
      name: b
    manager:
      backupRetentionDays: 90
    name: main
    provider: Manager
  - backups:
      backupStore: s3://REDACTED/REDACTED/backups/etcd/events/
    etcdMembers:
    - instanceGroup: master-us-east-1a
      name: a
    - instanceGroup: master-us-east-1c
      name: c
    - instanceGroup: master-us-east-1b
      name: b
    manager:
      backupRetentionDays: 90
    name: events
    provider: Manager
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    enableAdmissionPlugins:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - NodeRestriction
    - PersistentVolumeClaimResize
    - ResourceQuota
  kubeDNS:
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    cgroupDriver: systemd
    imageGCHighThresholdPercent: 75
    imageGCLowThresholdPercent: 60
  kubernetesApiAccess:
  - REDACTED
  - REDACTED
  - REDACTED
  kubernetesVersion: 1.29.15
  masterPublicName: REDACTED
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nodeProblemDetector:
    cpuRequest: 10m
    enabled: true
    memoryRequest: 32Mi
  nodeTerminationHandler:
    enableRebalanceMonitoring: false
    enableSQSTerminationDraining: true
    enabled: true
    managedASGTag: aws-node-termination-handler/managed
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-east-1a
    type: Public
    zone: us-east-1a
  - cidr: 172.20.64.0/19
    name: us-east-1c
    type: Public
    zone: us-east-1c
  - cidr: 172.20.96.0/19
    name: us-east-1e
    type: Public
    zone: us-east-1e
  - cidr: 172.20.128.0/19
    name: us-east-1b
    type: Public
    zone: us-east-1b
  - cidr: 172.20.160.0/19
    name: us-east-1d
    type: Public
    zone: us-east-1d
  - cidr: 172.20.192.0/19
    name: us-east-1f
    type: Public
    zone: us-east-1f
  topology:
    dns:
      type: Public


9. Anything else do we need to know? As we use terraform, kops reconcile is not an option, so I've been using the old process.

marek-obuchowicz avatar May 22 '25 15:05 marek-obuchowicz

I've checked nodeup config (S3 bucket - igconfig/node/nodes/nodeupconfig) and I see the versions are screwed here:

Assets:
  amd64:
  - b07a27fd5bd2419c9c623de15c1dd339af84eb27e9276c81070071065db00036@https://dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubelet,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubelet
  - 1f7651ad0b50ef4561aa82e77f3ad06599b5e6b0b2a5fb6c4f474d95a77e41c5@https://dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubectl,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubectl
  - 5035d7814c95cd3cedbc5efb447ef25a4942ef05caab2159746d55ce1698c74a@https://artifacts.k8s.io/binaries/cloud-provider-aws/v1.27.1/linux/amd64/ecr-credential-provider-linux-amd64
  - f3a841324845ca6bf0d4091b4fc7f97e18a623172158b72fc3fdcdb9d42d2d37@https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-amd64-v1.2.0.tgz,https://github.com/containernetworking/plugins/releases/download/v1.2.0/cni-plugins-linux-amd64-v1.2.0.tgz
  - 02990fa281c0a2c4b073c6d2415d264b682bd693aa7d86c5d8eb4b86d684a18c@https://github.com/containerd/containerd/releases/download/v1.7.25/containerd-1.7.25-linux-amd64.tar.gz
  - e83565aa78ec8f52a4d2b4eb6c4ca262b74c5f6770c1f43670c3029c20175502@https://github.com/opencontainers/runc/releases/download/v1.2.4/runc.amd64
  - 71aee9d987b7fad0ff2ade50b038ad7e2356324edc02c54045960a3521b3e6a7@https://github.com/containerd/nerdctl/releases/download/v1.7.4/nerdctl-1.7.4-linux-amd64.tar.gz
  - d16a1ffb3938f5a19d5c8f45d363bd091ef89c0bc4d44ad16b933eede32fdcbb@https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
  arm64:
  - 7dfb8087ee0eff9a3f667e1ec749b5a57a0848e59ce9ed42ad00e7ece1c55274@https://dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubelet,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubelet
  - 7d45d9620e67095be41403ed80765fe47fcfbf4b4ed0bf0d1c8fe80345bda7d3@https://dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubectl,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubectl
  - b3d567bda9e2996fc1fbd9d13506bd16763d3865b5c7b0b3c4b48c6088c04481@https://artifacts.k8s.io/binaries/cloud-provider-aws/v1.27.1/linux/arm64/ecr-credential-provider-linux-arm64
  - 525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57@https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz,https://github.com/containernetworking/plugins/releases/download/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz
  - e9201d478e4c931496344b779eb6cb40ce5084ec08c8fff159a02cabb0c6b9bf@https://github.com/containerd/containerd/releases/download/v1.7.25/containerd-1.7.25-linux-arm64.tar.gz
  - 285f6c4c3de1d78d9f536a0299ae931219527b2ebd9ad89df5a1072896b7e82a@https://github.com/opencontainers/runc/releases/download/v1.2.4/runc.arm64
  - d8df47708ca57b9cd7f498055126ba7dcfc811d9ba43aae1830c93a09e70e22d@https://github.com/containerd/nerdctl/releases/download/v1.7.4/nerdctl-1.7.4-linux-arm64.tar.gz
  - 0b615cfa00c331fb9c4524f3d4058a61cc487b33a3436d1269e7832cf283f925@https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-arm64.tar.gz
CAs: {}
ClusterName: REDACTED
Hooks:
- null
- null
InstallCNIAssets: true
KeypairIDs:
  kubernetes-ca: "6817834704856542238920816512"
KubeProxy:
  clusterCIDR: 100.96.0.0/11
  cpuRequest: 100m
  image: registry.k8s.io/kube-proxy:v1.29.15@sha256:243026cfce3209b89d9f883789108276ffec87d98190ac2a77776edd4e0e6015
  logLevel: 2
KubeletConfig:
  anonymousAuth: false
  authenticationTokenWebhook: true
  authorizationMode: Webhook
  cgroupDriver: systemd
  cgroupRoot: /
  cloudProvider: external
  clusterDNS: 169.254.20.10
  clusterDomain: cluster.local
  enableDebuggingHandlers: true
  evictionHard: memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
  featureGates:
    InTreePluginAWSUnregister: "true"
  imageGCHighThresholdPercent: 75
  imageGCLowThresholdPercent: 60
  kubeconfigPath: /var/lib/kubelet/kubeconfig
  logLevel: 2
  nodeLabels:
    node-role.kubernetes.io/node: ""
  podInfraContainerImage: registry.k8s.io/pause:3.9@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
  podManifestPath: /etc/kubernetes/manifests
  protectKernelDefaults: true
  registerSchedulable: true
  shutdownGracePeriod: 30s
  shutdownGracePeriodCriticalPods: 10s
KubernetesVersion: 1.28.15
Networking:
  nonMasqueradeCIDR: 100.64.0.0/10
  serviceClusterIPRange: 100.64.0.0/13
UpdatePolicy: automatic
UsesKubenet: true
containerdConfig:
  logLevel: info
  runc:
    version: 1.2.4
  version: 1.7.25
usesLegacyGossip: false
usesNoneDNS: false
  • some components, like kube-proxy, crictl are on 1.29 but KubernetesVersion and kubelet is still 1.28.
  • similar nodeupconfig for masters looks fine - all is correctly set to 1.29

marek-obuchowicz avatar May 22 '25 15:05 marek-obuchowicz

if you run kops upgrade cluster once more followed by your usual kops update cluster; terraform apply; kops rolling-update cluster commands, does it upgrade the remaining components to 1.29 ?

rifelpet avatar May 23 '25 06:05 rifelpet

I've run kops upgrade cluster - it has proposed changing AMI images (only). After doing that, files (like nodeupconfig) were generated correctly - all masters/nodes on the same 1.29 version.

I then tried to upgrade to 1.30 and the same error has happened. I did the following:

kops edit cluster (change `kubernetesVersion`)
kops upgrade cluster (nothing happened, `No upgrade required`)
kops update cluster --out terraform --yes

and it generated TF manifests with 1.30 (masters) and 1.29 (nodes)

I have then reverted changes by doing kops edit cluster -> change kubernetesVersion back to 1.29, then executed update command with specific version, instead of edit cluster:

kops upgrade cluster --kubernetes-version 1.30.13 --yes
kops update cluster --target terraform --yes

and this has not worked, too - nodeup manifest for nodes still have 1.29.

So I went for it again:

  • ran rolling-update cluster --yes - it rolled all nodes, but masters came up with 1.30 / nodes with 1.29
  • ran kops upgrade cluster - doesn't do anything, beside reporting cluster version "1.30.13" is greater than the desired version "1.30.12"

marek-obuchowicz avatar May 29 '25 11:05 marek-obuchowicz

I did one more step that worked :o

kops edit cluster -> replace 1.30.13 with 1.30.12 kops upgrade cluster -> No upgrade required kops update cluster --out terraform --yes -> this has now succesfully generated nodeup configs with 1.30.12

I believe this may be related to this: https://github.com/kubernetes/kops/blob/master/channels/stable#L129 - 1.30.13 isn't available there yet.

I will go on with updates on live cluster now... and will update here.

marek-obuchowicz avatar May 29 '25 11:05 marek-obuchowicz

Same has happened on production cluster. I believe I found a workaround: adding --ignore-kubelet-version-skew to kops update cluster seems to restore the functionality. This may be related to new kops reconcile implementation, however - it didn't seem so from any documentation. For example, https://kops.sigs.k8s.io/operations/updates_and_upgrades/ does not mention this flag - and its default value (false) breaks existing processes.

marek-obuchowicz avatar May 29 '25 22:05 marek-obuchowicz

Thanks for investigating this. Can you confirm this is the sequence that works for you? If so we can update the docs for kops with terraform.

# update cluster spec with `kops upgrade cluster` or `kops edit cluster`
kops update cluster --out terraform --ignore-kubelet-version-skew --yes
terraform apply
kops rolling-update cluster --yes

We likely won't be able to improve the support around terraform support for kops reconcile given how interleaved the terraform commands need to be with the kops operations.

We may be able to recommend a sequences of terraform apply -target commands to apply the control plane's resources (aws_s3_object, aws_autoscaling_group, aws_launch_template, etc) before applying the node resources.

rifelpet avatar Jun 12 '25 02:06 rifelpet

I was just bit by this. Using kops 1.32.0 I'm upgrading from k8s 1.30.8 to 1.31.9. I did kops edit cluster and modified kubernetesVersion The first run of

kops update cluster --target terraform ..

produced nodeupconfig objects in S3 with the new version for control plane nodes and old version (1.30.8) for regular nodes. I applied terraform and did kops rolling-update .. --instance-group-roles control-plane,apiserver to update my control plane nodes.

A second run of

kops update cluster --target terraform .. (exactly the same command as above, no additional flags)

produced nodeupconfig objects in S3 with the new version (1.31.9) for the regular nodes. A second terraform apply and kops rolling-update .. to update the worker nodes.

This isn't bad way to handle the kubelet version skew issue. Especially if new flags can be added for kops update cluster --target terraform like --update-control-plane and --update-workers to make it explicit what we're updating. As it is right now it's seems the current behavior i experienced is a coincidence rather than intended, although it worked well.

azhelev avatar Jun 12 '25 14:06 azhelev

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 10 '25 14:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 10 '25 15:10 k8s-triage-robot

/remove-lifecycle rotten

geckofu avatar Oct 11 '25 13:10 geckofu