Support automatic discovery of MIG devices
Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:
- The device plug-in instance corresponding to the newly created GI is not started
- The status of the newly created CI in the node is not displayed
When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally. It seems that the newly created MIG resources are not automatically discovered.
That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.
If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.
Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
As far as I know, MIG reconfiguration requires that no jobs are running on the GPU. However, in a MIG configuration with 4g.20gb and 2g.10gb instances, is it possible to delete the 2g.10gb instance and create new 1g.5gb instances without affecting the job running on the 4g.20gb instance?
As far as I know, MIG reconfiguration requires that no jobs are running on the GPU. However, in a MIG configuration with 4g.20gb and 2g.10gb instances, is it possible to delete the 2g.10gb instance and create new 1g.5gb instances without affecting the job running on the 4g.20gb instance?
Yes, it is possible to do this dynamic reconfiguration without affecting the job running on the 4g.20gb instance.
Contrary to the belief that the entire GPU must be idle, modern NVIDIA drivers and MIG (Multi-Instance GPU) architecture support dynamic reconfiguration. This allows you to add, remove, or resize GPU Instances (GIs) and Compute Instances (CIs) on the fly, provided the specific resources you are modifying are not currently in use.
That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.
If you use the GPU operator, this process is automated for you by a component called the
mig-manager, so that you don't have to manager this complexity yourself.Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles
I am closing this issue as @klueska answer properly responds to the main issue question.
I am closing this issue. If any problems persist using the latest version of the NVIDIA K8S-Device-Plugin, please open a new issue.