Issue with autoscaler scheduling
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: 5.15.0-1057-aws
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.27
- GPU Operator Version: v23.9.2
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
We are using gpu-operator for managing GPU drivers on k8s autoscaler managed EC2 instances. The autoscaler is configured to scale up from 0 to 8 Instances.
The issue we are seeing is that starting a single GPU workload will trigger multiple (~4) node scale-ups before marking all but one as unneeded and scaling them back down.
We also attempted the following with no success: https://github.com/NVIDIA/gpu-operator/issues/140#issuecomment-847998871
Currently, our best guess is that since the first upcoming node is already marked as ready even though the GPU operator has not finished setup. Thus, no GPU resource is available which causes our GPU workload Pod to remain in an unschedulable state.
The cluster auto-scheduler will therefore see that the node is ready but the workload pod is still unschedulable, thus triggering an additional scale-up. This process will repeat until the first node completed the GPU setup, providing the requested GPU resource and making the workload pod schedulable.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
- Autoscaling setup with minimum nodes 0
- Start a workload requesting GPU resources
The cluster auto-scheduler will therefore see that the node is ready but the workload pod is still unschedulable, thus triggering an additional scale-up. This process will repeat until the first node completed the GPU setup, providing the requested GPU resource and making the workload pod schedulable.
It takes 3-5 minutes for the GPU stack to be ready on the node (driver installation, container-toolkit setup etc), please increase the timeout set with the auto-scalar to mark the node as ready to resolve this.
:wave:
It takes 3-5 minutes for the GPU stack to be ready on the node (driver installation, container-toolkit setup etc), please increase the timeout set with the auto-scalar to mark the node as ready to resolve this.
I don't think it is that simple. The cluster-autoscaler will watch for the node.kubernetes.io/not-ready taint to disappear when bringing up a node for scheduling the GPU workload. The problem is that the node is set to ready before the gpu-operator "did it's thing", thus the cluster-autoscaler thinks that the workload should already be assigned to the node, which it is not, since the requested GPU resources cannot be fulfilled.
The "proper" way would probably to have the operator add autoscaler startup taint which the operator can then remove once the GPU stack is ready.
FWIW, we are currently working around this by using kyverno:
- We define our ASG with the tag
"k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu" = "true:NoSchedule", which will cause the autoscaler to add the specified taint to the node one scale-up trigger - The GPU operator does its thing
- We use the following kyverno mutating policy (terraform hcl format) to remove the taint once the GPUs are available (i.e.
nvidia.com/gpu.countlabel exists):
resource "kubernetes_manifest" "clusterpolicy_untaint_node_when_gpu_ready" {
manifest = {
"apiVersion" = "kyverno.io/v1"
"kind" = "ClusterPolicy"
"metadata" = {
"name" = "untaint-node-when-gpu-ready"
}
"spec" = {
"background" = false
"rules" = [
{
"context" = [
{
"name" = "newtaints"
"variable" = {
"jmesPath" = "request.object.spec.taints[?key!='ignore-taint.cluster-autoscaler.kubernetes.io/gpu-node-ready']"
}
},
]
"match" = {
"any" = [
{
"operations" = [
"CREATE",
"UPDATE",
]
"resources" = {
"kinds" = [
"Node",
],
"selector" = {
"matchExpressions" = [
{ "key" = "nvidia.com/gpu.count", "operator" = "Exists", "values" = [] }
]
}
}
},
]
}
"mutate" = {
"patchesJson6902" = <<-EOT
- path: /spec/taints
op: replace
value: {{ newtaints }}
EOT
}
"name" = "remove-taint-when-gpu-ready"
},
]
}
}
}
Hello, might be a little late but there's some hack in the cluster-autoscaler for that purpose worth trying:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/customresources/gpu_processor.go#L45
So if you make sure that the nodes gets provisioned with the label your cloud provider sets it won't over provision nodes. (I don't know if you can preset labels on EC2 nodes)
As for the scale up, do not use a selector, use the nvidia.com/gpu resource request
On AWS it looks like it's this label: k8s.amazonaws.com/accelerator
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L38
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
@Jasper-Ben thanks for providing the information on the workaround you are currently using. Would the recommendation in https://github.com/NVIDIA/gpu-operator/issues/708#issuecomment-3035818041 be a viable solution as well? Looking at https://github.com/kubernetes/autoscaler/blob/722902dce0d3fba2a1eb9904a4d91e448496a7f8/cluster-autoscaler/processors/customresources/gpu_processor.go#L36-L70 it does appear that nodes provisioned with GPU labels specific to the cloud provider (e.g. k8s.amazonaws.com/accelerator for AWS) would not be marked ready until GPU resources are allocatable.
The "proper" way would probably to have the operator add autoscaler startup taint which the operator can then remove once the GPU stack is ready.
Thanks for suggesting this. The one possible concern I have (without knowing much) is that the node could be marked as ready before the operator gets the chance to add the taint. I would be hesitant to explore this option unless there are concrete ways to avoid this race.
Hi @cdesiniotis,
Would the recommendation in https://github.com/NVIDIA/gpu-operator/issues/708#issuecomment-3035818041 be a viable solution as well?
uff, it has been a while since I worked on this, so please stop me if I am taking complete nonsense 😅
From the description and link that @nox-404 provided it does seem like that approach should be a workable approach. But then again, this would be cloud provider specific.
Thanks for suggesting this. The one possible concern I have (without knowing much) is that the node could be marked as ready before the operator gets the chance to add the taint. I would be hesitant to explore this option unless there are concrete ways to avoid this race.
Yeah, if the operator adds the taint during startup that could be a valid concern.
A possible solution could be to have the user add the taint to the autoscaling group resource themselves via a node template, similar to what is described here: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0. That way the node should come up with the taint straight away.
The GPU operator would then have to know about this taint (either well-known or configurable), could have the appropriate tolerations configured and remove the taints one the node is ready.