[BUG] Scale out with NAP recently started failing
Describe the bug NodeClaims created by NAP are not launching, this causes scale out to fail. Seems to be a recent regression, describing the node claim leads to this message:
{ "error": { "code": "MissingApiVersionParameter", "message": "The api-version query parameter (?api-version=) is required for all requests." } }
To Reproduce Repros consistently on one of our clusters, but not the other. Perhaps this regression is starting to roll out.
Create a workload that needs to add nodes and uses NAP.
You will see the following message, but the node is never added to the cluster successfully. [Pod should schedule on: nodeclaim/default-x7kct]
kubectl describe nodeclaim
RESPO...
Reason: LaunchFailed
Status: False
Type: Launched
Last Transition Time: 2024-09-11T17:44:19Z
Message: Node not launched
Reason: NotLaunched
Status: False
Type: Ready
Last Transition Time: 2024-09-11T17:44:19Z
Message: Node not launched
Reason: NotLaunched
Status: False
Type: Registered
Events:
Expected behavior Nodes launch and scale out the workload as expected.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- Kubernetes version 1.30.3
@tallaxes @Bryce-Soghigian
Searched logs based on the nodeclaim you provided, found this error message on the put for network interface
"code": "CannotMixIPBasedAddressesAndIPConfigurationsOnLoadBalancerBackendAddressPool",\n "message": "Mixing backend ipconfigurations and IPAddresses in backend pool /subscriptions/<REDACTED>/resourceGroups/<REDACTED>/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes is not allowed."
Thanks for looking this up Bryce. What could cause this to happen - is there a setting in AKS that could cause this?
Is this cluster (possibly unlike others) using IP-based SLB?
I'm using a common bicep file to provision both of my clusters so they should both have the same settings. I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this. It also uses Istio Mesh and Ingress gateway
I do have IP address pool management turned on (by using backend pool type = NodeIP), not sure if this could cause this
That's what I suspect
Thanks for the suggestion - I'll try turning it off later this evening to see if it mitigates the issue.
I tried the mitigation - applying the fix required me to tear down and rebuild the cluster. It seemed to be working fine for 3-4 days and then I ran into a similar set of symptoms again this morning. The logs look different this time though.
NodeClaims fail with: - lastTransitionTime: '2024-09-19T19:35:34Z' message: Node not registered with cluster reason: NodeNotFound status: 'False' type: Registered
Any ideas? Could this be related to https://github.com/Azure/AKS/issues/4545?
The only way to find out if it's related to the other issue is to either:
- check if we are talking about the same node image (the ubuntu2204 from 13.09)
- SSH into the instance and verify whether the kubelet file is missing the default IMDS environment variables
Node not registered / not found issues are often related to a connectivity issue between the node's kubelet and the API server. I would suggest that you make sure your firewalling is allowing this traffic. Looking a kubelet's logs gives the answer most of the time
This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. @Bryce-Soghigian, @pavneeta, @AllenWen-at-Azure
This issue will now be closed because it hasn't had any activity for 7 days after stale. vikas-rajvanshy feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.