gpu-operator
gpu-operator copied to clipboard
Operator-validator enabled, validators cannot be started due to tolerations
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [x] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
Some nodes have a specific taints. The deamonSets can be configured to have tolerations but the validators don't support that as far as I could see. Can I have some help with this please?
2. Steps to reproduce the issue
values.yaml config
daemonsets:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: dedicated
operator: Equal
effect: NoExecute
value: customValue100
Node taints:
taints:
- effect: NoExecute
key: dedicated
value: customValue100
Node labels:
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu.deploy.dcgm: "true"
nvidia.com/gpu.deploy.dcgm-exporter: "true"
nvidia.com/gpu.deploy.device-plugin: "true"
nvidia.com/gpu.deploy.driver: "false"
nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
nvidia.com/gpu.deploy.node-status-exporter: "true"
nvidia.com/gpu.deploy.operator-validator: "true"
nvidia.com/gpu.present: "true"
3. Information to attach (optional if deemed irrelevant)
- [x] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
time="2022-06-23T14:56:38Z" level=info msg="pod nvidia-cuda-validator-lb52p is curently in Pending phase"
time="2022-06-23T14:56:44Z" level=info msg="Error: error validating cuda workload: failed to get pod nvidia-cuda-validator-lb52p, err pods \"nvidia-cuda-validator-lb52p\" not found"
- [x] If a pod is deleted
kubectl get events -w
0s Normal TaintManagerEviction pod/nvidia-cuda-validator-28cfx Marking for deletion Pod gpu-operator/nvidia-cuda-validator-lb52p
Thanks for reporting this, will fix this in upcoming patch. PR: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/497
The fix is part of the v1.11.1 release, thank you very much for all your help!