TimWang
TimWang
@elezar Could u PTAL?
I already try the `echo get_default_active_thread_percentage | nvidia-cuda-mps-control` from https://github.com/NVIDIA/k8s-device-plugin/issues/647, everything looks fine and the hang process issue still happens. ``` root@gpu-pod:/# echo get_default_active_thread_percentage | nvidia-cuda-mps-control 25.0 ```
@janetat 不需要。我现在就是这么用的。这些组件的功能不冲突,因为它们的定位不同。Node Feature Discovery和GPU Feature Discovery旨在发现与节点相关的信息,并在创建新标签时显示这些信息。而Hami不依赖这些组件来实现其功能。
@janetat 你可以自定义自己的镜像,但需要自行安装CUDA等。你可以参考 https://github.com/Project-HAMi/ai-benchmark/blob/main/Dockerfile
@thungrac You can use Python's > or >> operators to redirect output to a file. The > operator creates a new file, or overwrites the file if it already exists....
As a user for the k8s-vgpu-scheduler , I would suggest that @chaunceyjiang could add more UT for your code change to ensure that the new code is fully tested.
> > As a user for the k8s-vgpu-scheduler , I would suggest that @chaunceyjiang could add more UT for your code change to ensure that the new code is fully...
@wawa0210 @archlitchi PTAL
Append the log after set the `LIBCUDA_LOG_LEVEL` to `4` ``` (base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38...