k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Support automatic discovery of MIG devices

Open DrAuYueng opened this issue 1 year ago • 1 comments

Using k8s-device-plugin in our kubernetes cluster, we found that in MIG mode:

  1. The device plug-in instance corresponding to the newly created GI is not started
  2. The status of the newly created CI in the node is not displayed

When we delete the Pod corresponding to k8s-device-plugin and trigger a rebuild, the resources are displayed normally. It seems that the newly created MIG resources are not automatically discovered.

DrAuYueng avatar Oct 15 '24 02:10 DrAuYueng

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles

klueska avatar Oct 15 '24 05:10 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Jan 14 '25 04:01 github-actions[bot]

As far as I know, MIG reconfiguration requires that no jobs are running on the GPU. However, in a MIG configuration with 4g.20gb and 2g.10gb instances, is it possible to delete the 2g.10gb instance and create new 1g.5gb instances without affecting the job running on the 4g.20gb instance?

kimmstop avatar Jun 12 '25 18:06 kimmstop

As far as I know, MIG reconfiguration requires that no jobs are running on the GPU. However, in a MIG configuration with 4g.20gb and 2g.10gb instances, is it possible to delete the 2g.10gb instance and create new 1g.5gb instances without affecting the job running on the 4g.20gb instance?

Yes, it is possible to do this dynamic reconfiguration without affecting the job running on the 4g.20gb instance.

Contrary to the belief that the entire GPU must be idle, modern NVIDIA drivers and MIG (Multi-Instance GPU) architecture support dynamic reconfiguration. This allows you to add, remove, or resize GPU Instances (GIs) and Compute Instances (CIs) on the fly, provided the specific resources you are modifying are not currently in use.

ArangoGutierrez avatar Nov 20 '25 13:11 ArangoGutierrez

That is correct. The device-plugin needs to be restarted after a MIG reconfiguration.

If you use the GPU operator, this process is automated for you by a component called the mig-manager, so that you don't have to manager this complexity yourself.

Using the mig-manager you can dynamically reconfiguration the set the available MIG devices on a node by setting a node-label. Details can be found here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html#example-reconfiguring-mig-profiles

I am closing this issue as @klueska answer properly responds to the main issue question.

I am closing this issue. If any problems persist using the latest version of the NVIDIA K8S-Device-Plugin, please open a new issue.

ArangoGutierrez avatar Nov 20 '25 14:11 ArangoGutierrez