shobhit_n comments

Results 13 comments of


                                            shobhit_n

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shivamerla @cdesiniotis Please suggest on this

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@tariq1890 Please find the manifest of GPU node when all nvidia pods are in running state: ``` k get po -n gpu-operator -o wide |grep -i ip-10-222-100-91.ec2.internal gpu-feature-discovery-zzkqg 1/1 Running...

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@tariq1890 Like direct termination of backend ec2 instance and it was removing all these nvidia pods till k8s 1.24, but on k8s 1.26 version these 4 pods shows running even...

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@tariq1890 @cdesiniotis @shivamerla please let me know how to fix this issue. Daemonsets are not getting scaled down on node termination by cluster autoscaler. This would ideally removed all nvidia...

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shivamerla could you please help us to understand the cause of such behavior we are using flatcar as worker node.

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shivamerla @tariq1890 @cdesiniotis Could you please help us in fixing this behaviour , due to this unnecessarly showing pods in namepace which actually not exist as node already got scaled...

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shivamerla Yes we are using private registry , please find the controller-manager logs for errors: ``` I1110 02:29:52.043817 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-krj5j" E1110 02:29:52.048104 1 gc_controller.go:255]...

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster

@shivamerla can you plz check and help on this

aws cloud controller manager is unable to manage the nodes in cluster

@cartermckinnon We have followed below steps on existing 1.26 cluster to make it ready for 1.27 upgrade ``` On existing version 1.26 Add tag to each node [kubernetes.io/cluster/cluster-name: owned k...

aws cloud controller manager is unable to manage the nodes in cluster

@cartermckinnon Let me share you 10-kubeadm-conf and kubeadm-config which we currently have in 1.26 where in tree support is there :- ``` 10-kubeam-conf # Note: This dropin only works with...