k8s-device-plugin Support automatic discovery of MIG devices

Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:

The device plug-in instance corresponding to the newly created GI is not started
The status of the newly created CI in the node is not displayed

When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally. It seems that the newly created MIG resources are not automatically discovered.

Oct 15 '24 02:10 DrAuYueng

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles

Oct 15 '24 05:10 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Jan 14 '25 04:01 github-actions[bot]

As far as I know, MIG reconfiguration requires that no jobs are running on the GPU. However, in a MIG configuration with 4g.20gb and 2g.10gb instances, is it possible to delete the 2g.10gb instance and create new 1g.5gb instances without affecting the job running on the 4g.20gb instance?

Jun 12 '25 18:06 kimmstop

As far as I know, MIG reconfiguration requires that no jobs are running on the GPU. However, in a MIG configuration with 4g.20gb and 2g.10gb instances, is it possible to delete the 2g.10gb instance and create new 1g.5gb instances without affecting the job running on the 4g.20gb instance?

Yes, it is possible to do this dynamic reconfiguration without affecting the job running on the 4g.20gb instance.

Contrary to the belief that the entire GPU must be idle, modern NVIDIA drivers and MIG (Multi-Instance GPU) architecture support dynamic reconfiguration. This allows you to add, remove, or resize GPU Instances (GIs) and Compute Instances (CIs) on the fly, provided the specific resources you are modifying are not currently in use.

Nov 20 '25 13:11 ArangoGutierrez

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles

I am closing this issue as @klueska answer properly responds to the main issue question.

I am closing this issue. If any problems persist using the latest version of the NVIDIA K8S-Device-Plugin, please open a new issue.

Nov 20 '25 14:11 ArangoGutierrez