AKS [BUG] Scale out with NAP recently started failing

Describe the bug NodeClaims created by NAP are not launching, this causes scale out to fail. Seems to be a recent regression, describing the node claim leads to this message:

{ "error": { "code": "MissingApiVersionParameter", "message": "The api-version query parameter (?api-version=) is required for all requests." } }

To Reproduce Repros consistently on one of our clusters, but not the other. Perhaps this regression is starting to roll out.

Create a workload that needs to add nodes and uses NAP.

You will see the following message, but the node is never added to the cluster successfully. [Pod should schedule on: nodeclaim/default-x7kct]

kubectl describe nodeclaim -n kube-system

RESPO... Reason: LaunchFailed Status: False Type: Launched Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Ready Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Registered Events:

Expected behavior Nodes launch and scale out the workload as expected.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Kubernetes version 1.30.3

Sep 11 '24 18:09 vikas-rajvanshy

@tallaxes @Bryce-Soghigian

Sep 13 '24 20:09 justindavies

Searched logs based on the nodeclaim you provided, found this error message on the put for network interface

"code": "CannotMixIPBasedAddressesAndIPConfigurationsOnLoadBalancerBackendAddressPool",\n "message": "Mixing backend ipconfigurations and IPAddresses in backend pool /subscriptions/<REDACTED>/resourceGroups/<REDACTED>/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes is not allowed."

Sep 13 '24 20:09 Bryce-Soghigian

Thanks for looking this up Bryce. What could cause this to happen - is there a setting in AKS that could cause this?

Sep 13 '24 21:09 vikas-rajvanshy

Is this cluster (possibly unlike others) using IP-based SLB?

Sep 13 '24 21:09 tallaxes

I'm using a common bicep file to provision both of my clusters so they should both have the same settings. I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this. It also uses Istio Mesh and Ingress gateway

Sep 13 '24 21:09 vikas-rajvanshy

I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this

That's what I suspect

Sep 13 '24 22:09 tallaxes

Thanks for the suggestion - I'll try turning it off later this evening to see if it mitigates the issue.

Sep 13 '24 23:09 vikas-rajvanshy

I tried the mitigation - applying the fix required me to tear down and rebuild the cluster. It seemed to be working fine for 3-4 days and then I ran into a similar set of symptoms again this morning. The logs look different this time though.

NodeClaims fail with: - lastTransitionTime: '2024-09-19T19:35:34Z' message: Node not registered with cluster reason: NodeNotFound status: 'False' type: Registered

Any ideas? Could this be related to https://github.com/Azure/AKS/issues/4545?

Sep 19 '24 20:09 vikas-rajvanshy

The only way to find out if it's related to the other issue is to either:

check if we are talking about the same node image (the ubuntu2204 from 13.09)
SSH into the instance and verify whether the kubelet file is missing the default IMDS environment variables

Node not registered / not found issues are often related to a connectivity issue between the node's kubelet and the API server. I would suggest that you make sure your firewalling is allowing this traffic. Looking a kubelet's logs gives the answer most of the time

Sep 20 '24 06:09 CCOLLOT

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @Bryce-Soghigian, @pavneeta, @AllenWen-at-Azure

Feb 19 '25 18:02 microsoft-github-policy-service[bot]

This issue will now be closed because it hasn't had any activity for 7 days after stale. vikas-rajvanshy feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Feb 26 '25 21:02 microsoft-github-policy-service[bot]