gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Flatcar
- Kernel Version:
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S 1.26
- GPU Operator Version: v23.3.1 gpu operator image v22.9.0-ubi8
2. Issue or feature description
gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster. In description of these pods shows :-
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NodeNotReady 3m49s node-controller Node is not ready
k get events -n gpu-operator
LAST SEEN TYPE REASON OBJECT MESSAGE
4m Warning NodeNotReady pod/gpu-feature-discovery-rq8p4 Node is not ready
4m1s Warning NodeNotReady pod/gpu-operator-node-feature-discovery-worker-jqfk8 Node is not ready
4m1s Warning NodeNotReady pod/nvidia-container-toolkit-daemonset-6twjt Node is not ready
93s Normal TaintManagerEviction pod/nvidia-cuda-validator-7zxdq Cancelling deletion of Pod gpu-operator/nvidia-cuda-validator-7zxdq
4m1s Warning NodeNotReady pod/nvidia-dcgm-exporter-vffrj Node is not ready
4m1s Warning NodeNotReady pod/nvidia-device-plugin-daemonset-jwlqh Node is not ready
93s Normal TaintManagerEviction pod/nvidia-device-plugin-validator-gvtsg Cancelling deletion of Pod gpu-operator/nvidia-device-plugin-validator-gvtsg
4m1s Warning NodeNotReady pod/nvidia-driver-daemonset-8jbgc Node is not ready
4m1s Warning NodeNotReady pod/nvidia-operator-validator-62h5p Node is not ready
Logs of k8s-driver-manager before terminating gpu node
k logs nvidia-driver-daemonset-5gt5q -c k8s-driver-manager -n gpu-operator
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Current value of AUTO_UPGRADE_POLICY_ENABLED='
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/ip-10-222-101-214.ec2.internal labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-hhrnx condition met
Waiting for the container-toolkit to shutdown
pod/nvidia-container-toolkit-daemonset-kb6s4 condition met
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Auto upgrade policy of the GPU driver on the node ip-10-222-101-214.ec2.internal is disabled
Cordoning node ip-10-222-101-214.ec2.internal...
node/ip-10-222-101-214.ec2.internal cordoned
Draining node ip-10-222-101-214.ec2.internal of any GPU pods...
W0922 16:03:37.375717 7767 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-09-22T16:03:37Z" level=info msg="Identifying GPU pods to delete"
time="2023-09-22T16:03:37Z" level=info msg="No GPU pods to delete. Exiting."
unbinding device 0000:00:1e.0
Auto upgrade policy of the GPU driver on the node ip-10-222-101-214.ec2.internal is disabled
Uncordoning node ip-10-222-101-214.ec2.internal...
node/ip-10-222-101-214.ec2.internal uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/ip-10-222-101-214.ec2.internal labeled
Value which we are passing for helm :-
source:
path: deployments/gpu-operator
repoURL: https://github.com/NVIDIA/gpu-operator.git
targetRevision: v23.3.1
helm:
releaseName: gpu-operator
values: |-
validator:
repository: our-repo/nvidia
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
daemonsets:
priorityClassName: system-node-critical
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
operator:
repository: our-repo/nvidia
image: gpu-operator
version: v22.9.0-ubi8
imagePullSecrets: [image-secret]
defaultRuntime: containerd
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
driver:
enabled: true
repository: our-repo/nvidia
image: nvidia-kmods-driver-flatcar
version: '{{values.driverImage}}'
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
toolkit:
enabled: true
repository: our-repo/nvidia
image: container-toolkit
version: v1.13.0-ubuntu20.04
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
devicePlugin:
repository: our-repo/nvidia
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
dcgm:
repository: our-repo/nvidia
image: 3.1.7-1-ubuntu20.04
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
dcgmExporter:
repository: our-repo/nvidia
image: dcgm-exporter
imagePullSecrets:
- frog-auth
version: 3.1.7-3.1.4-ubuntu20.04
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
gfd:
repository: our-repo/nvidia
image: gpu-feature-discovery
version: v0.8.0-ubi8
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
migManager:
enabled: true
repository: our-repo/nvidia
image: k8s-mig-manager
version: v0.5.2-ubuntu20.04
imagePullSecrets:
- image-secret
tolerations:
- key: gpu.kubernetes.io/gpu-exists
operator: Exists
effect: NoSchedule
node-feature-discovery:
image:
repository: our-repo/nvidia/node-feature-discovery
imagePullSecrets:
- name: image-secret
worker:
tolerations:
- key: "gpu.kubernetes.io/gpu-exists"
operator: "Equal"
value: ""
effect: "NoSchedule"
nodeSelector:
beta.kubernetes.io/os: linux
Please let us know how to control this pod eviction when gpu node get scale down as these pods shows in running even after gpu node got removed from cluster.
@shivamerla @cdesiniotis Please suggest on this
@shnigam2 Can you share your gpu node yaml manifest?
@tariq1890 Please find the manifest of GPU node when all nvidia pods are in running state:
k get po -n gpu-operator -o wide |grep -i ip-10-222-100-91.ec2.internal
gpu-feature-discovery-zzkqg 1/1 Running 0 6m44s 100.119.232.78 ip-10-222-100-91.ec2.internal <none> <none>
gpu-operator-node-feature-discovery-worker-2vqg7 1/1 Running 0 7m52s 100.119.232.69 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-container-toolkit-daemonset-ksp5q 1/1 Running 0 6m44s 100.119.232.73 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-cuda-validator-ccgrb 0/1 Completed 0 5m13s 100.119.232.76 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-dcgm-exporter-tjpz9 1/1 Running 0 6m44s 100.119.232.75 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-device-plugin-daemonset-xc7rb 1/1 Running 0 6m44s 100.119.232.77 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-device-plugin-validator-c6qzp 0/1 Completed 0 4m26s 100.119.232.79 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-driver-daemonset-cxjdf 1/1 Running 0 7m20s 100.119.232.72 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-operator-validator-tq797 1/1 Running 0 6m44s 100.119.232.74 ip-10-222-100-91.ec2.internal <none> <none>
k get nodes ip-10-222-100-91.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"csi.oneagent.dynatrace.com":"ip-10-222-100-91.ec2.internal","csi.tigera.io":"ip-10-222-100-91.ec2.internal","ebs.csi.aws.com":"i-054d7daae0d81b5ec"}'
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
nfd.node.kubernetes.io/extended-resources: ""
nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-cpuid.AVX512VL,cpu-cpuid.AVX512VNNI,cpu-cpuid.CMPXCHG8,cpu-cpuid.FMA3,cpu-cpuid.FXSR,cpu-cpuid.FXSROPT,cpu-cpuid.HYPERVISOR,cpu-cpuid.LAHF,cpu-cpuid.MOVBE,cpu-cpuid.MPX,cpu-cpuid.OSXSAVE,cpu-cpuid.SYSCALL,cpu-cpuid.SYSEE,cpu-cpuid.X87,cpu-cpuid.XGETBV1,cpu-cpuid.XSAVE,cpu-cpuid.XSAVEC,cpu-cpuid.XSAVEOPT,cpu-cpuid.XSAVES,cpu-hardware_multithreading,cpu-model.family,cpu-model.id,cpu-model.vendor_id,kernel-config.NO_HZ,kernel-config.NO_HZ_IDLE,kernel-version.full,kernel-version.major,kernel-version.minor,kernel-version.revision,nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.driver.rev,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor,nvidia.com/gfd.timestamp,nvidia.com/gpu.compute.major,nvidia.com/gpu.compute.minor,nvidia.com/gpu.count,nvidia.com/gpu.family,nvidia.com/gpu.machine,nvidia.com/gpu.memory,nvidia.com/gpu.product,nvidia.com/gpu.replicas,nvidia.com/mig.capable,nvidia.com/mig.strategy,pci-10de.present,pci-1d0f.present,storage-nonrotationaldisk,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major,system-os_release.VERSION_ID.minor
nfd.node.kubernetes.io/worker.version: v0.12.1
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 10.222.100.91/24
projectcalico.org/IPv4IPIPTunnelAddr: 100.119.232.64
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2023-09-23T02:36:25Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: g4dn.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1a
feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512BW: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512CD: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512F: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512VL: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI: "true"
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: "true"
feature.node.kubernetes.io/cpu-cpuid.FMA3: "true"
feature.node.kubernetes.io/cpu-cpuid.FXSR: "true"
feature.node.kubernetes.io/cpu-cpuid.FXSROPT: "true"
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR: "true"
feature.node.kubernetes.io/cpu-cpuid.LAHF: "true"
feature.node.kubernetes.io/cpu-cpuid.MOVBE: "true"
feature.node.kubernetes.io/cpu-cpuid.MPX: "true"
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: "true"
feature.node.kubernetes.io/cpu-cpuid.SYSCALL: "true"
feature.node.kubernetes.io/cpu-cpuid.SYSEE: "true"
feature.node.kubernetes.io/cpu-cpuid.X87: "true"
feature.node.kubernetes.io/cpu-cpuid.XGETBV1: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVE: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVEC: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVES: "true"
feature.node.kubernetes.io/cpu-hardware_multithreading: "true"
feature.node.kubernetes.io/cpu-model.family: "6"
feature.node.kubernetes.io/cpu-model.id: "85"
feature.node.kubernetes.io/cpu-model.vendor_id: Intel
feature.node.kubernetes.io/kernel-config.NO_HZ: "true"
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE: "true"
feature.node.kubernetes.io/kernel-version.full: 5.15.125-flatcar
feature.node.kubernetes.io/kernel-version.major: "5"
feature.node.kubernetes.io/kernel-version.minor: "15"
feature.node.kubernetes.io/kernel-version.revision: "125"
feature.node.kubernetes.io/pci-10de.present: "true"
feature.node.kubernetes.io/pci-1d0f.present: "true"
feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
feature.node.kubernetes.io/system-os_release.ID: flatcar
feature.node.kubernetes.io/system-os_release.VERSION_ID: 3510.2.7
feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "3510"
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "2"
instance-group: cpu-g4dn-xlarge
kubernetes.io/arch: amd64
kubernetes.io/hostname: ip-10-222-100-91.ec2.internal
kubernetes.io/os: linux
kubernetes.io/role: node
our-registry.cloud/gpu: "true"
node-role.kubernetes.io/node: ""
node.kubernetes.io/instance-type: g4dn.xlarge
node.kubernetes.io/role: node
nvidia.com/cuda.driver.major: "525"
nvidia.com/cuda.driver.minor: "105"
nvidia.com/cuda.driver.rev: "17"
nvidia.com/cuda.runtime.major: "12"
nvidia.com/cuda.runtime.minor: "0"
nvidia.com/gfd.timestamp: "1695436816"
nvidia.com/gpu.compute.major: "7"
nvidia.com/gpu.compute.minor: "5"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu.deploy.dcgm: "true"
nvidia.com/gpu.deploy.dcgm-exporter: "true"
nvidia.com/gpu.deploy.device-plugin: "true"
nvidia.com/gpu.deploy.driver: "true"
nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
nvidia.com/gpu.deploy.node-status-exporter: "true"
nvidia.com/gpu.deploy.nvsm: ""
nvidia.com/gpu.deploy.operator-validator: "true"
nvidia.com/gpu.family: turing
nvidia.com/gpu.machine: g4dn.xlarge
nvidia.com/gpu.memory: "15360"
nvidia.com/gpu.present: "true"
nvidia.com/gpu.product: Tesla-T4
nvidia.com/gpu.replicas: "1"
nvidia.com/mig.capable: "false"
nvidia.com/mig.strategy: single
topology.ebs.csi.aws.com/zone: us-east-1a
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1a
name: ip-10-222-100-91.ec2.internal
resourceVersion: "36894521"
uid: d5c9ddb2-3379-4c9f-942e-0b65d1162edb
spec:
podCIDR: 100.96.37.0/24
podCIDRs:
- 100.96.37.0/24
providerID: aws:///us-east-1a/i-054d7daae0d81b5ec
taints:
- effect: NoSchedule
key: gpu.kubernetes.io/gpu-exists
status:
addresses:
- address: 10.222.100.91
type: InternalIP
- address: ip-10-222-100-91.ec2.internal
type: Hostname
- address: ip-10-222-100-91.ec2.internal
type: InternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: "88450615150"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 15980652Ki
nvidia.com/gpu: "1"
pods: "110"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: 95975060Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16083052Ki
nvidia.com/gpu: "1"
pods: "110"
conditions:
- lastHeartbeatTime: "2023-09-23T02:37:03Z"
lastTransitionTime: "2023-09-23T02:37:03Z"
message: Calico is running on this node
reason: CalicoIsUp
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:25Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:25Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:25Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:57Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/nvidia-kmods-driver-flatcar@sha256:3e83fc8abe394bb2a86577a2e936e425ec4c3952301cb12712f576ba2b642cb4
sizeBytes: 1138988828
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/dcgm-exporter@sha256:ae014d7f27c32ba83128ba31e2f8ab3a0910a46607e63d2ae7a90ae3551e3330
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
sizeBytes: 1059498968
- names:
- our-registry-cngccp-docker.jfrog.io/splunk/fluentd-hec@sha256:9f6b4642a22f8942bb4d6c5357ee768fe515fa21d49577b88ba12098c382656b
- our-registry-cngccp-docker.jfrog.io/splunk/fluentd-hec:1.2.8
sizeBytes: 315828956
- names:
- xpj245675755234.live.dynatrace.com/linux/oneagent@sha256:a44033e943518221fd657d033845c12850ba872d9e61616c192f406919b87bb3
- xpj245675755234.live.dynatrace.com/linux/oneagent:1.265.152
sizeBytes: 227902134
- names:
- nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:cab21c93987a5c884075efe0fb4a8abaa1997e1696cbc773ba69889f42f8329b
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.1
sizeBytes: 213778085
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/k8s-device-plugin@sha256:46ce950d29cd67351c37850cec6aafa718d346f181c956d73bec079f9d96fbc1
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/k8s-device-plugin:v0.14.0-ubi8
sizeBytes: 165982184
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-feature-discovery@sha256:b1c162fb5fce21a684b4e28dae2c37d60b2d3c47b7270dd0bce835b7ce9e5a24
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
sizeBytes: 162038014
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:f6bf463459a61aa67c5f9e4f4f97797609b85bf77aaef88b0e78536889a7e517
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:devel-ubi8
sizeBytes: 141870962
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:91e028c8177b4896b7d79f08c64f3a84cb66a0f5a3f32b844d909ebbbd7e0369
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04
sizeBytes: 127160969
- names:
- docker.io/calico/cni@sha256:9a2c99f0314053aa11e971bd5d72e17951767bf5c6ff1fd9c38c4582d7cb8a0a
- docker.io/calico/cni:v3.25.1
sizeBytes: 89884044
- names:
- docker.io/calico/node@sha256:0cd00e83d06b3af8cd712ad2c310be07b240235ad7ca1397e04eb14d20dcc20f
- docker.io/calico/node:v3.25.1
sizeBytes: 88335791
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery@sha256:a498b39f2fd7435d8862a9a916ef6eb4d2a4d8d5b4c6788fb48bdb11b008e87a
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery:v0.12.1
sizeBytes: 73669012
- names:
- our-registry-cngccp-docker.jfrog.io/dynatrace/dynatrace-operator@sha256:ce621425125ba8fdcfa0f300c75e0167e9301a4654fcd1c14baa75f4d41151a3
- our-registry-cngccp-docker.jfrog.io/dynatrace/dynatrace-operator:v0.9.1
sizeBytes: 43133681
- names:
- public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver@sha256:2d1ecf57fcfde2403a66e7709ecbb55db6d2bfff64c5c71225c9fb101ffe9c30
- public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.18.0
sizeBytes: 30176686
- names:
- registry.k8s.io/kube-proxy@sha256:8d998d77a1fae5d933a7efea97faace684559d70a37a72dba7193ed84e1bc45d
- registry.k8s.io/kube-proxy:v1.26.7
sizeBytes: 21764578
- names:
- our-registry-cngccp-docker.jfrog.io/kube2iam@sha256:aba84ebec51b25a22ffbcf3fe1599dabb0c88d7de87f07f00b85b79ddd72d672
- our-registry-cngccp-docker.jfrog.io/kube2iam:imdsv2-fix
sizeBytes: 14666113
- names:
- docker.io/calico/node-driver-registrar@sha256:5954319e4dbf61aac2e704068e9f3cd083d67f630c08bc0d280863dbf01668bc
- docker.io/calico/node-driver-registrar:v3.25.1
sizeBytes: 11695360
- names:
- docker.io/calico/csi@sha256:1f17de674c15819408c02ea5699bc3afe75f3120fbaf9c23ad5bfa2bca01814c
- docker.io/calico/csi:v3.25.1
sizeBytes: 11053330
- names:
- docker.io/calico/pod2daemon-flexvol@sha256:66629150669c4ff7f70832858af28139407da59f61451a8658f15f06b9f20436
- docker.io/calico/pod2daemon-flexvol:v3.25.1
sizeBytes: 7167792
- names:
- public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:6ad0cae2ae91453f283a44e9b430e475b8a9fa3d606aec9a8b09596fffbcd2c9
- public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.7.0-eks-1-26-7
sizeBytes: 6560300
- names:
- public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:d9e11b42ae5f4f2f7ea9034e68040997cdbb04ae9e188aa897f76ae92698d78a
- public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.9.0-eks-1-26-7
sizeBytes: 6086054
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/logrotate@sha256:26454d4621f3ed8c1d048fbc3a25b31a00f45a4404c1d3716845cb154b571e3e
- our-registry-cngccp-docker-k8s.jfrog.io/logrotate:1.0_5469f66
sizeBytes: 5572108
- names:
- registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db
- registry.k8s.io/pause:3.6
sizeBytes: 301773
nodeInfo:
architecture: amd64
bootID: 1e548d4e-2bf3-4de0-ae7f-017980214214
containerRuntimeVersion: containerd://1.6.16
kernelVersion: 5.15.125-flatcar
kubeProxyVersion: v1.26.7
kubeletVersion: v1.26.7
machineID: ec27866d6a0f6aeff75511a5668b6a78
operatingSystem: linux
osImage: Flatcar Container Linux by Kinvolk 3510.2.7 (Oklo)
systemUUID: ec27866d-6a0f-6aef-f755-11a5668b6a78
How are you draining these nodes?
Please ensure --ignore-daemonsets is set to false when running the kubectl drain command
@tariq1890 Like direct termination of backend ec2 instance and it was removing all these nvidia pods till k8s 1.24, but on k8s 1.26 version these 4 pods shows running even underline instance was already removed. Any parameter which we need to pass for k8s 1.26.
@tariq1890 @cdesiniotis @shivamerla please let me know how to fix this issue. Daemonsets are not getting scaled down on node termination by cluster autoscaler. This would ideally removed all nvidia damonset pods on node removal which is not happening in our case.
@shivamerla could you please help us to understand the cause of such behavior we are using flatcar as worker node.
@shivamerla @tariq1890 @cdesiniotis Could you please help us in fixing this behaviour , due to this unnecessarly showing pods in namepace which actually not exist as node already got scaled down.
@shnigam2 Can you provide logs from the k8s controller-manager pod to check for errors on cleaning up these pods? Are you using images from private registry(i.e using pullSecrets)?
@shivamerla Yes we are using private registry , please find the controller-manager logs for errors:
I1110 02:29:52.043817 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-krj5j"
E1110 02:29:52.048104 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/gpu-feature-discovery-krj5j; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.048189 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-vzcrj"
E1110 02:29:52.057008 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-vzcrj; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.057063 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-device-plugin-daemonset-xztw8"
E1110 02:29:52.061290 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-device-plugin-daemonset-xztw8; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.061316 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-device-plugin-daemonset-lwhk6"
E1110 02:29:52.065459 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-device-plugin-daemonset-lwhk6; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.065625 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-container-toolkit-daemonset-fzg45"
E1110 02:29:52.071929 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-container-toolkit-daemonset-fzg45; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.071967 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-container-toolkit-daemonset-wdlkq"
E1110 02:29:52.076635 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-container-toolkit-daemonset-wdlkq; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.076784 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-bh6jn"
E1110 02:29:52.080977 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/gpu-feature-discovery-bh6jn; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.081028 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-8vv6j"
E1110 02:29:52.085230 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-8vv6j; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
@shivamerla can you plz check and help on this
@shnigam2 we have a known issue which will be fixed in next patch v23.9.1 (later this month). The problem is we are adding duplicate pullSecrets in the spec. You can avoid this by not specifying the pullSecret in ClusterPolicy for validator image. We use the validator image as initContainer and thus ended up adding the same secret twice for initContainer as well as main container in every Daemonset.