Kevin Klues
Kevin Klues
Hmm, that's really strange. I've not had anyone report this before, and I've not experienced it myself. What do the logs for the plugin show? Is the currently running plugin...
It looks like the plugin was SIGHupped and restarted internally (without actually shutting down), rather than killed and respawned. There are known issues with the SIGHUP handling. If you kill...
The only thing I can think of that would explain this is if you somehow had a rouge plugin binary running on your system in addition to the one managed...
Also -- just to be sure -- you are seeing this under *Allocatable* and not *Capacity* in the output of `kubectl describe node` correct? Because I wouldn't expect the numbers...
That said -- it certainly shouldn't stick round for 10 hours -- I just tested your exact scenario and (in my setup) the *Allocatable* values were updated immediately after the...
Another thing that just occurred to me. I'm not sure how/why it wouldn't be cleaned up, but can you check the contents of the /var/lib/kubelet/device-plugins folder. If there is an...
Needs to be set on the host, not inside a container. Here’s a link to the details: https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit
The plugin never calls `nvml.NewDevice()` anywhere -- it only calls `nvml.NewDeviceLite()`, which shouldn't have this issue. That said, the issue comes from the fact that you can't call `GetMemoryInfo()` without...
No, you must stop the kubelet. You must also stop the device plugin (and any other clients of the GPU, e.g. gpu-feature-discovery, etc.). Please see the discussion starting here: https://github.com/NVIDIA/k8s-device-plugin/issues/180#issuecomment-682344586
Also, if what you are trying to do is run the plugin with MIG support, this is a good resource: https://docs.google.com/document/d/1bshSIcWNYRZGfywgwRHa07C0qRyOYKxWYxClbeJM-WM/edit