Node doesn't expose GPU resource on g4dn.[n]xlarge
Image I'm using: System Info:
- Kernel Version: 5.15.160
- OS Image: Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia)
- Operating System: linux
- Architecture: amd64
- Container Runtime Version: containerd://1.6.31+bottlerocket
- Kubelet Version: v1.26.14-eks-b063426
- Kube-Proxy Version: v1.26.14-eks-b063426
What I expected to happen:
100% of the time that in EKS I start a Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6 node on a g4dn.[n]xlarge instance-type it should expose the gpu count for pods.
Capacity:
...
nvidia.com/gpu: 1
...
Allocatable:
...
nvidia.com/gpu: 1
...
What actually happened:
~5% of the time that in EKS I start a Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6 node on a g4dn.[n]xlarge instance-type it didn't expose the gpu count for pods, causing pods requiring nvidia.com/gpu: 1 to not be scheduled, keeping them in pending state waiting for a node.
Capacity:
cpu: 8
ephemeral-storage: 61904460Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32366612Ki
pods: 29
Allocatable:
cpu: 7910m
ephemeral-storage: 55977408418
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31676436Ki
pods: 29
How to reproduce the problem: Note: This issue has existed for more than a year, you can see the slack thread here
Current settings:
- EKS K8s v1.26
- Karpenter Autoscaler v0.33.5
- Karpenter Nodepool:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: random-name
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
limits:
cpu: 1000
template:
metadata:
labels:
company.ai/node: random-name
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: random-name
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge
- g4dn.2xlarge
- g4dn.4xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-central-1a
- eu-central-1b
- eu-central-1c
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
taints:
- effect: NoSchedule
key: nvidia.com/gpu
status:
resources:
cpu: "8"
ephemeral-storage: 61904460Ki
memory: 32366612Ki
nvidia.com/gpu: "1"
pods: "29"
vpc.amazonaws.com/pod-eni: "39"
- Karpenter EC2NodeClass:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: random-name
spec:
amiFamily: Bottlerocket
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
iops: 3000
snapshotID: snap-d4758cc7f5f11
throughput: 500
volumeSize: 60Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: KarpenterNodeRole-prod
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*Private*'
karpenter.sh/discovery: prod
tags:
nodepool: random-name
purpose: prod
vendor: random-name
status:
amis:
- id: ami-0a3eb13c0c420309b
name: bottlerocket-aws-k8s-1.26-nvidia-aarch64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-0a3eb13c0c420309b
name: bottlerocket-aws-k8s-1.26-nvidia-aarch64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-0e68f27f62340664d
name: bottlerocket-aws-k8s-1.26-aarch64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-09469fd78070eaac6
name: bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-09469fd78070eaac6
name: bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-096d4acd33c9e9449
name: bottlerocket-aws-k8s-1.26-x86_64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
instanceProfile: prod_29098703412147
securityGroups:
- id: sg-71ed7b6a7c7
name: eks-cluster-sg-prod-684080
subnets:
- id: subnet-aa204f6e57f07
zone: eu-central-1a
- id: subnet-579874f746c1b
zone: eu-central-1c
- id: subnet-07ce8a8349377
zone: eu-central-1b
- Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deploy-gpu
spec:
template:
spec:
containers:
- ...
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
nodeSelector:
company.ai/node: random-name
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
...
-
Karpenter managed process:
- The Deployment is deployed in the cluster
- There are no nodes that can fulfill the requirements of
resources,node labesandtolerations - karpenter detects it can fulfill the requirement with the described
NodePoolandEC2NodeClass - karpenter create a new node of the pool
karpenter.sh/nodepool=random-name - The node started is healthy but the scheduler can't schedule the pod on it because it is not exposing the gpu.
'0/16 nodes are available: 1 Insufficient nvidia.com/gpu, 10 node(s) didn''t match Pod''s node affinity/selector, 5 node(s) had untolerated taint {deepc-cpu: }. preemption: 0/16 nodes are available: 1 No preemption victims found for incoming pod, 15 Preemption is not helpful for scheduling..' -
Node created:
kubectl describe no ip-192-168-164-242.eu-central-1.compute.internal
Name: ip-192-168-164-242.eu-central-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=g4dn.2xlarge
beta.kubernetes.io/os=linux
company.ai/node=random-name
failure-domain.beta.kubernetes.io/region=eu-central-1
failure-domain.beta.kubernetes.io/zone=eu-central-1b
k8s.io/cloud-provider-aws=e7b76f679c563363cec5c6d5c3
karpenter.k8s.aws/instance-category=g
karpenter.k8s.aws/instance-cpu=8
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
karpenter.k8s.aws/instance-family=g4dn
karpenter.k8s.aws/instance-generation=4
karpenter.k8s.aws/instance-gpu-count=1
karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
karpenter.k8s.aws/instance-gpu-memory=16384
karpenter.k8s.aws/instance-gpu-name=t4
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-local-nvme=225
karpenter.k8s.aws/instance-memory=32768
karpenter.k8s.aws/instance-network-bandwidth=10000
karpenter.k8s.aws/instance-size=2xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/nodepool=random-name
karpenter.sh/registered=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-164-242.eu-central-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=g4dn.2xlarge
topology.ebs.csi.aws.com/zone=eu-central-1b
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1b
Annotations: alpha.kubernetes.io/provided-node-ip: 192.168.164.242
csi.volume.kubernetes.io/nodeid:
{"csi.tigera.io":"ip-192-168-164-242.eu-central-1.compute.internal","ebs.csi.aws.com":"i-0fd5f7a7969d63c9d"}
karpenter.k8s.aws/ec2nodeclass-hash: 15616957348189460630
karpenter.k8s.aws/ec2nodeclass-hash-version: v1
karpenter.sh/nodepool-hash: 14407783392627717656
karpenter.sh/nodepool-hash-version: v1
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 08 Jul 2024 08:57:11 -0500
Taints: nvidia.com/gpu:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-192-168-164-242.eu-central-1.compute.internal
AcquireTime: <unset>
RenewTime: Fri, 12 Jul 2024 13:43:42 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:11 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:11 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:11 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:19 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.164.242
InternalDNS: ip-192-168-164-242.eu-central-1.compute.internal
Hostname: ip-192-168-164-242.eu-central-1.compute.internal
Capacity:
cpu: 8
ephemeral-storage: 61904460Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32366612Ki
pods: 29
Allocatable:
cpu: 7910m
ephemeral-storage: 55977408418
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31676436Ki
pods: 29
System Info:
Machine ID: ec2a5800025dbe346ecc517c4de3
System UUID: ec2a58-0002-5dbe-346e-cc517c4de3
Boot ID: c25bd0fd-67f0-4681-946a-64f8c5c57878
Kernel Version: 5.15.160
OS Image: Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.31+bottlerocket
Kubelet Version: v1.26.14-eks-b063426
Kube-Proxy Version: v1.26.14-eks-b063426
ProviderID: aws:///eu-central-1b/i-d5f7a7969d63c
Non-terminated Pods: (8 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-rlbvk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
calico-system csi-node-driver-x5jn7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
kube-system aws-node-txk8m 50m (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
kube-system ebs-csi-node-b8ln5 30m (0%) 0 (0%) 120Mi (0%) 768Mi (2%) 4d4h
kube-system kube-proxy-zkn9m 100m (1%) 0 (0%) 0 (0%) 0 (0%) 4d4h
loki loki-promtail-6644l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
monitoring monitoring-prometheus-node-exporter-fx749 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
tigera-operator tigera-operator-7b594b484b-rkn5g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d21h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 180m (2%) 0 (0%)
memory 120Mi (0%) 768Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
-
As you can see the node created fulfill the requirements of
node labesandtolerationsbut not theresources(gpu) -
Inspecting the node:
Using the session manager -> amin-container -> sheltie
bash-5.1# lsmod | grep nvidia
nvidia_uvm 1454080 0
nvidia_modeset 1265664 0
nvidia 56004608 2 nvidia_uvm,nvidia_modeset
drm 626688 1 nvidia
backlight 24576 2 drm,nvidia_modeset
i2c_core 102400 2 nvidia,drm
bash-5.1# systemctl list-unit-files | grep nvidia
nvidia-fabricmanager.service enabled enabled
nvidia-k8s-device-plugin.service enabled enabled
bash-5.1# journalctl -b -u nvidia-k8s-device-plugin
-- No entries --
bash-5.1# journalctl -b -u nvidia-fabricmanager.service
-- No entries --
bash-5.1# journalctl --list-boots
IDX BOOT ID FIRST ENTRY LAST ENTRY
0 c25bd0fd67f04681946a64f8c5c57878 Tue 2024-07-09 22:06:03 UTC Fri 2024-07-12 22:02:13 UTC
From the slack thread, someone suggest this:
Grasping at straws, but I wonder if this is some sort of initialization race condition where the kubelet service starts before the NVIDIA device is ready.
Thanks for reporting this! By any chance do you have the instance running? It seems odd that the device plugin isn't showing any output.
Yes sir, I have the instance running.
I agree, from the previous time that I reported the incident (slack thread) the output was different:
# journalctl -u nvidia-k8s-device-plugin
Apr 14 06:03:47 ip-192-168-114-245.eu-central-1.compute.internal systemd[1]: Dependency failed for Start NVIDIA kubernetes device plugin.
Apr 14 06:03:47 ip-192-168-114-245.eu-central-1.compute.internal systemd[1]: nvidia-k8s-device-plugin.service: Job nvidia-k8s-device-plugin.service/start failed with result 'dependency'.
But this time it is empty
@arnaldo2792 let me know if there are steps you want to perform to diagnose the issue?
I am investigating on this end. On the EC2 g4dn.* instance family Bottlerocket may require manual intervention to disable GSP firmware download. This has to happen during boot, before Bottlerocket loads the nvidia kmod. I will find the relevant API to set this as a boot parameter and test the results. Here's the relevant line from nvidia-smi -q:
GSP Firmware Version : 535.183.01
This shows that the nvidia kmod downloaded firmware to the GSP during boot. The desired state is:
GSP Firmware Version : N/A
The slightly better news is that we do have an issue open internally to select the "no GSP download" option on appropriate hardware, without requiring any configuration.
@larvacea I want to thank you for taking the time to investigate this strange issue. Also I am happy that you found some breadcrumbs on what the problem is. :clap:
Here's one way to set the relevant kernel parameter using apiclient:
apiclient apply <<EOF
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware"=["0"]
[settings.boot]
reboot-to-reconcile = true
EOF
apiclient reboot
After the instance reboots, nvidia-smi -q should report N/A for GSP Firmware Version. One can use the same toml fragment as part of instance user data. That's why the toml includes reboot-to-reconcile: this should result in Bottlerocket rebooting automatically whenever the kernel-parameters setting changes the kernel command line.
I do not know if this is responsible for the 5% failure rate you see. I'd love to hear if this helps or not.
My understanding is that if that I set the kernel parameter "nvidia.NVreg_EnableGpuFirmware"=["0"] I can be 100% sure that GSP firmware wont be downloaded and that would be enough for my use case where karpenter is in charge of starting and shutting down nodes on demand. (I don't have long living nodes).
Also my understanding is that the parameter reboot-to-reconcile = true will help someone to fix a long living node to set the Firmware parameter. Which is not required in my usecase.
Based on that understanding I would say that my fix would be to add the Firmware parameter in the userData of the karpenter EC2NodeClass as follows:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: random-name
spec:
amiFamily: Bottlerocket
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
iops: 3000
snapshotID: snap-d4758cc7f5f11
throughput: 500
volumeSize: 60Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: KarpenterNodeRole-prod
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*Private*'
karpenter.sh/discovery: prod
tags:
nodepool: random-name
purpose: prod
vendor: random-name
userData: |-
[settings.boot.kernel-parameters]
"nvidia.NVreg_EnableGpuFirmware"=["0"]
However, I don't know the internals of that process and maybe my understanding is wrong and I need to use reboot-to-reconcile setting too.
Please correct me if I am wrong
The reboot-to-reconcile setting solves an ordering problem in Bottlerocket boot on aws EC2 instances. We can't access user data until the network is available. If anything in user data changes the kernel command line, we need to persist the command line and reboot for the new kernel command line to have any effect. If reboot-to-reconcile is true and the desired kernel command line is different from the one that Bottlerocket booted with, we reboot. On this second boot, the kernel command line does not change, so we will not reboot (and thus will not enter a reboot loop that prevents the instance from starting).
We intend to add logic to automate this and set the desired kmod option before we load the driver. In general-purpose Linux operating systems, one could solve the problem by putting the desired configuration in /etc/modprobe.d. The driver is a loadable kmod, so modprobe will find this configuration file if it exists before the kmod is loaded. On a general-purpose Linux machine, the system administrator has access to /etc, and /etc persists across boots.
In Bottlerocket, /etc is not persisted. It is a memory-resident file system (tmpfs) and built during boot by systemd. One can place the driver configuration in the kernel command line even though the driver is not resident; modprobe reads the command line and adds any configuration it finds to the variables it sourced from /etc/modprobe.d (or possibly a few other locations).
Hope this helps.
@andrescaroc Is your karpenter solution working? We are facing similar issues with bottlerocket.