gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Operator-validator enabled, validators cannot be started due to tolerations

Open catalinpan opened this issue 3 years ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [x] Are you running Kubernetes v1.13+?
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [x] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Some nodes have a specific taints. The deamonSets can be configured to have tolerations but the validators don't support that as far as I could see. Can I have some help with this please?

2. Steps to reproduce the issue

values.yaml config

daemonsets:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  - key: dedicated
    operator: Equal
    effect: NoExecute
    value: customValue100

Node taints:

  taints:
  - effect: NoExecute
    key: dedicated
    value: customValue100

Node labels:

    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
    nvidia.com/gpu.deploy.device-plugin: "true"
    nvidia.com/gpu.deploy.driver: "false"
    nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
    nvidia.com/gpu.deploy.node-status-exporter: "true"
    nvidia.com/gpu.deploy.operator-validator: "true"
    nvidia.com/gpu.present: "true"

3. Information to attach (optional if deemed irrelevant)

  • [x] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
time="2022-06-23T14:56:38Z" level=info msg="pod nvidia-cuda-validator-lb52p is curently in Pending phase"
time="2022-06-23T14:56:44Z" level=info msg="Error: error validating cuda workload: failed to get pod nvidia-cuda-validator-lb52p, err pods \"nvidia-cuda-validator-lb52p\" not found"
  • [x] If a pod is deleted kubectl get events -w
0s          Normal    TaintManagerEviction   pod/nvidia-cuda-validator-28cfx                                    Marking for deletion Pod gpu-operator/nvidia-cuda-validator-lb52p

catalinpan avatar Jun 23 '22 15:06 catalinpan

Thanks for reporting this, will fix this in upcoming patch. PR: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/497

shivamerla avatar Jul 11 '22 17:07 shivamerla

The fix is part of the v1.11.1 release, thank you very much for all your help!

catalinpan avatar Oct 03 '22 22:10 catalinpan