HAMi icon indicating copy to clipboard operation
HAMi copied to clipboard

feat:Update k8s-device-plugin to v0.14.5 to Resolve nanoGPT Runtime Issue

Open haitwang-cloud opened this issue 1 year ago • 1 comments

What type of PR is this?

During an offline debugging session with @archlitchi , we identified that the current NVIDIA device plugin (v1.4.0) is causing compatibility issues with nanoGPT, preventing it from running properly. This issue persists even after setting CUDA_DISABLE_CONTROL to true and removing ld.so.preload from the GPU node. We've confirmed that this problem also occurs in version 0.14.0 of the k8s-device-plugin. To resolve this, we need to update the k8s-device-plugin to at least version 0.14.5

/kind bug What this PR does / why we need it:

This update is crucial for ensuring that our GPU resources are utilized effectively and that applications like nanoGPT can run without the encountered hindrances.

Which issue(s) this PR fixes: Fixes # https://github.com/Project-HAMi/HAMi/issues/347

Special notes for your reviewer: I can not build the HAMi in my local env, so I am not be able to run the E2E testing , could u plz help to build a test img with this PR?

Does this PR introduce a user-facing change?:

No

haitwang-cloud avatar Jul 18 '24 06:07 haitwang-cloud

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haitwang-cloud Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

hami-robott[bot] avatar Jul 18 '24 06:07 hami-robott[bot]

Replace it with new PR https://github.com/Project-HAMi/HAMi/pull/855

haitwang-cloud avatar Mar 18 '25 08:03 haitwang-cloud