Worker nodes are not updated by kops upgrade / tf apply / kops rolling-update
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Client version: 1.32.0 (git-v1.32.0)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Initially, 1.27.16 then upgrade to v1.28.15 followed by upgrade to v1.29.15
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops edit cluster [changed kubernetesVersion `1.27.16` to `1.28.15`]
kops update cluster --target terraform
cd out/terraform; terraform plan apply; cd ../..
kops rolling-update cluster --validation-timeout 2h --yes
For next update cycle, 1.28 to 1.29.15 i did the same process as above.
5. What happened after the commands executed? Just after rolling-update rotated one control-plan node:
$ kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
i-02ca04220fdea027a Ready control-plane,spot-worker 45d v1.27.16 172.20.65.80 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.22
i-044d3c800ebd9182f Ready node,spot-worker 3d12h v1.27.16 172.20.41.249 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.22
i-049c7e07ada7bd67f Ready control-plane,spot-worker 79s v1.28.15 172.20.34.8 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.25
i-091c6621ef56fb552 Ready control-plane,spot-worker 105d v1.27.16 172.20.155.6 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.22
i-0e7b6c4d5e0cc899d Ready node,spot-worker 42h v1.27.16 172.20.92.38 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.22
After kops rolling-update cluster ... finished for 1.27 -> 1.28 upgrade, all nodes have been rotated and containerd has been upgraded everywhere from 1.7.22 to 1.7.25. However, k8s version seems to be upgraded only on control plane - nodes still run old k8s version:
$ kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
i-02fdccb13d452b12c Ready node,spot-worker 22m v1.27.16 172.20.49.100 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.25
i-049c7e07ada7bd67f Ready control-plane,spot-worker 44m v1.28.15 172.20.34.8 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.25
i-067307066010c55d8 Ready control-plane,spot-worker 37m v1.28.15 172.20.157.199 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.25
i-0954159b2b57b93f8 Ready node,spot-worker 27m v1.27.16 172.20.70.145 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.25
i-0f9a889b6d013dc78 Ready control-plane,spot-worker 31m v1.28.15 172.20.78.182 --- Ubuntu 22.04.5 LTS 6.8.0-1021-aws containerd://1.7.25
I've tried to run rolling-update again, but none of the nodes was in NeedsUpdate status.
As next step, I've upgraded to v1.29.15 (same process as described above) and - long story short, after rolling-update - control plane is running v1.29.15 and nodes seem to be upgraded to 1.28.15 now - but not to 1.29. So somehow we're getting control plane running one version higher than nodes. Manual rotation of worker nodes didn't fix the issue.
6. What did you expect to happen?
I expect version upgrade to work, and have nodes running same k8s version as control plane.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
creationTimestamp: "2017-06-29T12:09:24Z"
generation: 57
name: REDACTED
spec:
additionalPolicies:
node: |
[
{
"Effect": "Allow",
"Action": [
"ecr:DescribeImages",
"ecr:BatchGetImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload",
"ecr:PutImage",
"ecr:CreateRepository"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"elasticfilesystem:DescribeAccessPoints",
"elasticfilesystem:DescribeFileSystems",
"elasticfilesystem:DescribeMountTargets",
"ec2:DescribeAvailabilityZones"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"elasticfilesystem:CreateAccessPoint"
],
"Resource": "*",
"Condition": {
"StringLike": {
"aws:RequestTag/efs.csi.aws.com/cluster": "true"
}
}
},
{
"Effect": "Allow",
"Action": "elasticfilesystem:DeleteAccessPoint",
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/efs.csi.aws.com/cluster": "true"
}
}
}
]
api:
dns: {}
authorization:
alwaysAllow: {}
awsLoadBalancerController:
enabled: true
certManager:
enabled: true
managed: false
channel: stable
cloudProvider: aws
configBase: s3://REDACTED/REDACTED
containerRuntime: containerd
dnsZone: REDACTED
etcdClusters:
- backups:
backupStore: s3://REDACTED/REDACTED/backups/etcd/main/
etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1c
name: c
- instanceGroup: master-us-east-1b
name: b
manager:
backupRetentionDays: 90
name: main
provider: Manager
- backups:
backupStore: s3://REDACTED/REDACTED/backups/etcd/events/
etcdMembers:
- instanceGroup: master-us-east-1a
name: a
- instanceGroup: master-us-east-1c
name: c
- instanceGroup: master-us-east-1b
name: b
manager:
backupRetentionDays: 90
name: events
provider: Manager
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
enableAdmissionPlugins:
- NamespaceLifecycle
- LimitRanger
- ServiceAccount
- PersistentVolumeLabel
- DefaultStorageClass
- DefaultTolerationSeconds
- MutatingAdmissionWebhook
- ValidatingAdmissionWebhook
- NodeRestriction
- PersistentVolumeClaimResize
- ResourceQuota
kubeDNS:
nodeLocalDNS:
enabled: true
provider: CoreDNS
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
cgroupDriver: systemd
imageGCHighThresholdPercent: 75
imageGCLowThresholdPercent: 60
kubernetesApiAccess:
- REDACTED
- REDACTED
- REDACTED
kubernetesVersion: 1.29.15
masterPublicName: REDACTED
networkCIDR: 172.20.0.0/16
networking:
kubenet: {}
nodeProblemDetector:
cpuRequest: 10m
enabled: true
memoryRequest: 32Mi
nodeTerminationHandler:
enableRebalanceMonitoring: false
enableSQSTerminationDraining: true
enabled: true
managedASGTag: aws-node-termination-handler/managed
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: us-east-1a
type: Public
zone: us-east-1a
- cidr: 172.20.64.0/19
name: us-east-1c
type: Public
zone: us-east-1c
- cidr: 172.20.96.0/19
name: us-east-1e
type: Public
zone: us-east-1e
- cidr: 172.20.128.0/19
name: us-east-1b
type: Public
zone: us-east-1b
- cidr: 172.20.160.0/19
name: us-east-1d
type: Public
zone: us-east-1d
- cidr: 172.20.192.0/19
name: us-east-1f
type: Public
zone: us-east-1f
topology:
dns:
type: Public
9. Anything else do we need to know?
As we use terraform, kops reconcile is not an option, so I've been using the old process.
I've checked nodeup config (S3 bucket - igconfig/node/nodes/nodeupconfig) and I see the versions are screwed here:
Assets:
amd64:
- b07a27fd5bd2419c9c623de15c1dd339af84eb27e9276c81070071065db00036@https://dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubelet,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubelet
- 1f7651ad0b50ef4561aa82e77f3ad06599b5e6b0b2a5fb6c4f474d95a77e41c5@https://dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubectl,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/amd64/kubectl
- 5035d7814c95cd3cedbc5efb447ef25a4942ef05caab2159746d55ce1698c74a@https://artifacts.k8s.io/binaries/cloud-provider-aws/v1.27.1/linux/amd64/ecr-credential-provider-linux-amd64
- f3a841324845ca6bf0d4091b4fc7f97e18a623172158b72fc3fdcdb9d42d2d37@https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-amd64-v1.2.0.tgz,https://github.com/containernetworking/plugins/releases/download/v1.2.0/cni-plugins-linux-amd64-v1.2.0.tgz
- 02990fa281c0a2c4b073c6d2415d264b682bd693aa7d86c5d8eb4b86d684a18c@https://github.com/containerd/containerd/releases/download/v1.7.25/containerd-1.7.25-linux-amd64.tar.gz
- e83565aa78ec8f52a4d2b4eb6c4ca262b74c5f6770c1f43670c3029c20175502@https://github.com/opencontainers/runc/releases/download/v1.2.4/runc.amd64
- 71aee9d987b7fad0ff2ade50b038ad7e2356324edc02c54045960a3521b3e6a7@https://github.com/containerd/nerdctl/releases/download/v1.7.4/nerdctl-1.7.4-linux-amd64.tar.gz
- d16a1ffb3938f5a19d5c8f45d363bd091ef89c0bc4d44ad16b933eede32fdcbb@https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
arm64:
- 7dfb8087ee0eff9a3f667e1ec749b5a57a0848e59ce9ed42ad00e7ece1c55274@https://dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubelet,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubelet
- 7d45d9620e67095be41403ed80765fe47fcfbf4b4ed0bf0d1c8fe80345bda7d3@https://dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubectl,https://cdn.dl.k8s.io/release/v1.28.15/bin/linux/arm64/kubectl
- b3d567bda9e2996fc1fbd9d13506bd16763d3865b5c7b0b3c4b48c6088c04481@https://artifacts.k8s.io/binaries/cloud-provider-aws/v1.27.1/linux/arm64/ecr-credential-provider-linux-arm64
- 525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57@https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz,https://github.com/containernetworking/plugins/releases/download/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz
- e9201d478e4c931496344b779eb6cb40ce5084ec08c8fff159a02cabb0c6b9bf@https://github.com/containerd/containerd/releases/download/v1.7.25/containerd-1.7.25-linux-arm64.tar.gz
- 285f6c4c3de1d78d9f536a0299ae931219527b2ebd9ad89df5a1072896b7e82a@https://github.com/opencontainers/runc/releases/download/v1.2.4/runc.arm64
- d8df47708ca57b9cd7f498055126ba7dcfc811d9ba43aae1830c93a09e70e22d@https://github.com/containerd/nerdctl/releases/download/v1.7.4/nerdctl-1.7.4-linux-arm64.tar.gz
- 0b615cfa00c331fb9c4524f3d4058a61cc487b33a3436d1269e7832cf283f925@https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-arm64.tar.gz
CAs: {}
ClusterName: REDACTED
Hooks:
- null
- null
InstallCNIAssets: true
KeypairIDs:
kubernetes-ca: "6817834704856542238920816512"
KubeProxy:
clusterCIDR: 100.96.0.0/11
cpuRequest: 100m
image: registry.k8s.io/kube-proxy:v1.29.15@sha256:243026cfce3209b89d9f883789108276ffec87d98190ac2a77776edd4e0e6015
logLevel: 2
KubeletConfig:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
cgroupDriver: systemd
cgroupRoot: /
cloudProvider: external
clusterDNS: 169.254.20.10
clusterDomain: cluster.local
enableDebuggingHandlers: true
evictionHard: memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
featureGates:
InTreePluginAWSUnregister: "true"
imageGCHighThresholdPercent: 75
imageGCLowThresholdPercent: 60
kubeconfigPath: /var/lib/kubelet/kubeconfig
logLevel: 2
nodeLabels:
node-role.kubernetes.io/node: ""
podInfraContainerImage: registry.k8s.io/pause:3.9@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
podManifestPath: /etc/kubernetes/manifests
protectKernelDefaults: true
registerSchedulable: true
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 10s
KubernetesVersion: 1.28.15
Networking:
nonMasqueradeCIDR: 100.64.0.0/10
serviceClusterIPRange: 100.64.0.0/13
UpdatePolicy: automatic
UsesKubenet: true
containerdConfig:
logLevel: info
runc:
version: 1.2.4
version: 1.7.25
usesLegacyGossip: false
usesNoneDNS: false
- some components, like kube-proxy, crictl are on
1.29but KubernetesVersion and kubelet is still 1.28. - similar
nodeupconfigfor masters looks fine - all is correctly set to1.29
if you run kops upgrade cluster once more followed by your usual kops update cluster; terraform apply; kops rolling-update cluster commands, does it upgrade the remaining components to 1.29 ?
I've run kops upgrade cluster - it has proposed changing AMI images (only). After doing that, files (like nodeupconfig) were generated correctly - all masters/nodes on the same 1.29 version.
I then tried to upgrade to 1.30 and the same error has happened. I did the following:
kops edit cluster (change `kubernetesVersion`)
kops upgrade cluster (nothing happened, `No upgrade required`)
kops update cluster --out terraform --yes
and it generated TF manifests with 1.30 (masters) and 1.29 (nodes)
I have then reverted changes by doing kops edit cluster -> change kubernetesVersion back to 1.29, then executed update command with specific version, instead of edit cluster:
kops upgrade cluster --kubernetes-version 1.30.13 --yes
kops update cluster --target terraform --yes
and this has not worked, too - nodeup manifest for nodes still have 1.29.
So I went for it again:
- ran
rolling-update cluster --yes- it rolled all nodes, but masters came up with 1.30 / nodes with 1.29 - ran
kops upgrade cluster- doesn't do anything, beside reportingcluster version "1.30.13" is greater than the desired version "1.30.12"
I did one more step that worked :o
kops edit cluster -> replace 1.30.13 with 1.30.12
kops upgrade cluster -> No upgrade required
kops update cluster --out terraform --yes -> this has now succesfully generated nodeup configs with 1.30.12
I believe this may be related to this: https://github.com/kubernetes/kops/blob/master/channels/stable#L129 - 1.30.13 isn't available there yet.
I will go on with updates on live cluster now... and will update here.
Same has happened on production cluster. I believe I found a workaround: adding --ignore-kubelet-version-skew to kops update cluster seems to restore the functionality. This may be related to new kops reconcile implementation, however - it didn't seem so from any documentation. For example, https://kops.sigs.k8s.io/operations/updates_and_upgrades/ does not mention this flag - and its default value (false) breaks existing processes.
Thanks for investigating this. Can you confirm this is the sequence that works for you? If so we can update the docs for kops with terraform.
# update cluster spec with `kops upgrade cluster` or `kops edit cluster`
kops update cluster --out terraform --ignore-kubelet-version-skew --yes
terraform apply
kops rolling-update cluster --yes
We likely won't be able to improve the support around terraform support for kops reconcile given how interleaved the terraform commands need to be with the kops operations.
We may be able to recommend a sequences of terraform apply -target commands to apply the control plane's resources (aws_s3_object, aws_autoscaling_group, aws_launch_template, etc) before applying the node resources.
I was just bit by this. Using kops 1.32.0
I'm upgrading from k8s 1.30.8 to 1.31.9.
I did kops edit cluster and modified kubernetesVersion
The first run of
kops update cluster --target terraform ..
produced nodeupconfig objects in S3 with the new version for control plane nodes and old version (1.30.8) for regular nodes.
I applied terraform and did kops rolling-update .. --instance-group-roles control-plane,apiserver to update my control plane nodes.
A second run of
kops update cluster --target terraform .. (exactly the same command as above, no additional flags)
produced nodeupconfig objects in S3 with the new version (1.31.9) for the regular nodes.
A second terraform apply and kops rolling-update .. to update the worker nodes.
This isn't bad way to handle the kubelet version skew issue. Especially if new flags can be added for kops update cluster --target terraform like --update-control-plane and --update-workers to make it explicit what we're updating.
As it is right now it's seems the current behavior i experienced is a coincidence rather than intended, although it worked well.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten