Pods with disks can schedule to the wrong node
We should use nodeSelectors to ensure that pods with disks mount to the correct nodes (no zone or correct zone).
Kubeflow and other systems have some pods that depend on specific node locations.
The shared-daaas-system es nodes have been having a problem with this for a while, however we seem to be stuck now as there doesn't seem to be any zones using '1' anymore (and only a temp node as 0 currentlY), they have been switched over to canadacentral-x.
HTTPStatusCode: 400, RawError: { "error": { "code": "BadRequest", "message": "Disk /subscriptions/9f29402c-64f1-4691-853c-a14607472bdc/resourceGroups/aaw-prod-cc-00-rg-aks-managed/providers/Microsoft.Compute/disks/restore-52eecf64-79d8-4c69-9354-7e432c99cdc9 cannot be attached to the VM because it is not in zone '1'." } }
Is zone: canadacentral-1 the same as the previous zone: 1 ? Where is the zone mapping configured is it on the volume itself?
c.c. @chuckbelisle
This is where AAW prod is in a bit of an odd place, zone 1 is the disks that were migrated from the old environment and are technically not in an availability zone. Therefore, they need to be scheduled on one of the temp nodes.
Likely the workloads are missing nodeSelectors which will force them onto the right nodes at scheduling time.
@zachomedia thanks for the background, I was going to try the nodeSlector method for this specific issue, however there doesn't seem to be any temp node in zone 1 anymore, only zone 0... how are those nodes provisioned? :)
Also is there a plan to migrate the storage so we don't need the temp nodes anymore?
@vexingly I don't believe the zone matters on the temp nodes because they are not officially attached to a zone.
Oh, it's confusing that they are labeled with a zone... but you're right it doesn't seem to matter for accessing the storage! 👍
@zachomedia are you looking at for this sprint?
I think @vexingly has it under control, but if anyone needs my help I'm around :)
The shared-daaas-system elasticsearch has the nodeselector now and is working well, I don't know of other pods that need something similar but if we see any of them start to fail we can fix them up then.
This issue arose once again on Dec. 2nd, 2023 during the upgrade to Kubernetes 1.26.
While trying to fix it by setting nodeSelectors, it seems that there are policies in place that strip it for some workloads. To get around this, a nodeAffinity was set on the pvs that are not zonal (mostly just disks restored from the previous cluster).
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- '0'
Warning:
nodeAffinitiesare immutable. Ensure caution when setting them and setpersistentVolumeReclaimPolicytoRetainif you need to delete and recreate the PV/PVC.