aaw Pods with disks can schedule to the wrong node

We should use nodeSelectors to ensure that pods with disks mount to the correct nodes (no zone or correct zone).

Kubeflow and other systems have some pods that depend on specific node locations.

Apr 24 '22 00:04 zachomedia

The shared-daaas-system es nodes have been having a problem with this for a while, however we seem to be stuck now as there doesn't seem to be any zones using '1' anymore (and only a temp node as 0 currentlY), they have been switched over to canadacentral-x.

HTTPStatusCode: 400, RawError: { "error": { "code": "BadRequest", "message": "Disk /subscriptions/9f29402c-64f1-4691-853c-a14607472bdc/resourceGroups/aaw-prod-cc-00-rg-aks-managed/providers/Microsoft.Compute/disks/restore-52eecf64-79d8-4c69-9354-7e432c99cdc9 cannot be attached to the VM because it is not in zone '1'." } }

Is zone: canadacentral-1 the same as the previous zone: 1 ? Where is the zone mapping configured is it on the volume itself?

c.c. @chuckbelisle

Jun 27 '22 21:06 vexingly

This is where AAW prod is in a bit of an odd place, zone 1 is the disks that were migrated from the old environment and are technically not in an availability zone. Therefore, they need to be scheduled on one of the temp nodes.

Likely the workloads are missing nodeSelectors which will force them onto the right nodes at scheduling time.

Jun 27 '22 22:06 zachomedia

@zachomedia thanks for the background, I was going to try the nodeSlector method for this specific issue, however there doesn't seem to be any temp node in zone 1 anymore, only zone 0... how are those nodes provisioned? :)

Also is there a plan to migrate the storage so we don't need the temp nodes anymore?

Jun 29 '22 15:06 vexingly

@vexingly I don't believe the zone matters on the temp nodes because they are not officially attached to a zone.

Jun 29 '22 15:06 zachomedia

Oh, it's confusing that they are labeled with a zone... but you're right it doesn't seem to matter for accessing the storage! 👍

Jun 29 '22 18:06 vexingly

@zachomedia are you looking at for this sprint?

Jun 30 '22 13:06 sylus

I think @vexingly has it under control, but if anyone needs my help I'm around :)

Jun 30 '22 13:06 zachomedia

The shared-daaas-system elasticsearch has the nodeselector now and is working well, I don't know of other pods that need something similar but if we see any of them start to fail we can fix them up then.

Jun 30 '22 13:06 vexingly

This issue arose once again on Dec. 2nd, 2023 during the upgrade to Kubernetes 1.26.

While trying to fix it by setting nodeSelectors, it seems that there are policies in place that strip it for some workloads. To get around this, a nodeAffinity was set on the pvs that are not zonal (mostly just disks restored from the previous cluster).

 nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - '0'

Warning: nodeAffinities are immutable. Ensure caution when setting them and set persistentVolumeReclaimPolicy to Retain if you need to delete and recreate the PV/PVC.

Dec 04 '23 13:12 justbert