AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Scale out with NAP recently started failing

Open vikas-rajvanshy opened this issue 1 year ago • 9 comments

Describe the bug NodeClaims created by NAP are not launching, this causes scale out to fail. Seems to be a recent regression, describing the node  claim leads to this message:

{ "error": { "code": "MissingApiVersionParameter", "message": "The api-version query parameter (?api-version=) is required for all requests." } }

To Reproduce Repros consistently on one of our clusters, but not the other. Perhaps this regression is starting to roll out.

Create a workload that needs to add nodes and uses NAP.

You will see the following message, but the node is never added to the cluster successfully. [Pod should schedule on: nodeclaim/default-x7kct]

kubectl describe nodeclaim -n kube-system

RESPO... Reason: LaunchFailed Status: False Type: Launched Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Ready Last Transition Time: 2024-09-11T17:44:19Z Message: Node not launched Reason: NotLaunched Status: False Type: Registered Events:

Expected behavior Nodes launch and scale out the workload as expected.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • Kubernetes version 1.30.3

vikas-rajvanshy avatar Sep 11 '24 18:09 vikas-rajvanshy

@tallaxes @Bryce-Soghigian

justindavies avatar Sep 13 '24 20:09 justindavies

Searched logs based on the nodeclaim you provided, found this error message on the put for network interface

"code": "CannotMixIPBasedAddressesAndIPConfigurationsOnLoadBalancerBackendAddressPool",\n    "message": "Mixing backend ipconfigurations and IPAddresses in backend pool /subscriptions/<REDACTED>/resourceGroups/<REDACTED>/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes is not allowed."

Bryce-Soghigian avatar Sep 13 '24 20:09 Bryce-Soghigian

Thanks for looking this up Bryce. What could cause this to happen - is there a setting in AKS that could cause this?

vikas-rajvanshy avatar Sep 13 '24 21:09 vikas-rajvanshy

Is this cluster (possibly unlike others) using IP-based SLB?

tallaxes avatar Sep 13 '24 21:09 tallaxes

I'm using a common bicep file to provision both of my clusters so they should both have the same settings. I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this. It also uses Istio Mesh and Ingress gateway

vikas-rajvanshy avatar Sep 13 '24 21:09 vikas-rajvanshy

I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this

That's what I suspect

tallaxes avatar Sep 13 '24 22:09 tallaxes

Thanks for the suggestion - I'll try turning it off later this evening to see if it mitigates the issue.

vikas-rajvanshy avatar Sep 13 '24 23:09 vikas-rajvanshy

I tried the mitigation - applying the fix required me to tear down and rebuild the cluster. It seemed to be working fine for 3-4 days and then I ran into a similar set of symptoms again this morning. The logs look different this time though.

NodeClaims fail with: - lastTransitionTime: '2024-09-19T19:35:34Z' message: Node not registered with cluster reason: NodeNotFound status: 'False' type: Registered

Any ideas? Could this be related to https://github.com/Azure/AKS/issues/4545?

vikas-rajvanshy avatar Sep 19 '24 20:09 vikas-rajvanshy

The only way to find out if it's related to the other issue is to either:

  • check if we are talking about the same node image (the ubuntu2204 from 13.09)
  • SSH into the instance and verify whether the kubelet file is missing the default IMDS environment variables

Node not registered / not found issues are often related to a connectivity issue between the node's kubelet and the API server. I would suggest that you make sure your firewalling is allowing this traffic. Looking a kubelet's logs gives the answer most of the time

CCOLLOT avatar Sep 20 '24 06:09 CCOLLOT

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @Bryce-Soghigian, @pavneeta, @AllenWen-at-Azure

This issue will now be closed because it hasn't had any activity for 7 days after stale. vikas-rajvanshy feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.