TimWang comments

Results 32 comments of


                                            TimWang

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

@elezar Could u PTAL?

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

I already try the `echo get_default_active_thread_percentage | nvidia-cuda-mps-control` from https://github.com/NVIDIA/k8s-device-plugin/issues/647, everything looks fine and the hang process issue still happens. ``` root@gpu-pod:/# echo get_default_active_thread_percentage | nvidia-cuda-mps-control 25.0 ```

HAMi需要Node Feature Discovery或者GPU Feature Discovery吗？

@janetat 不需要。我现在就是这么用的。这些组件的功能不冲突，因为它们的定位不同。Node Feature Discovery和GPU Feature Discovery旨在发现与节点相关的信息，并在创建新标签时显示这些信息。而Hami不依赖这些组件来实现其功能。

[Feature]: Consider parallel_tool_calls parameter at the API level

+14

AI应用的基础镜像能是alpine吗？

@janetat 你可以自定义自己的镜像，但需要自行安装CUDA等。你可以参考 https://github.com/Project-HAMi/ai-benchmark/blob/main/Dockerfile

The limitations on gpumem and gpucores are not working correctly.

@thungrac You can use Python's > or >> operators to redirect output to a file. The > operator creates a new file, or overwrites the file if it already exists....

Optimize the RegisterFromNodeAnnotations code to make it clearer. Enhance its readability.

As a user for the k8s-vgpu-scheduler , I would suggest that @chaunceyjiang could add more UT for your code change to ensure that the new code is fully tested.

Optimize the RegisterFromNodeAnnotations code to make it clearer. Enhance its readability.

> > As a user for the k8s-vgpu-scheduler , I would suggest that @chaunceyjiang could add more UT for your code change to ensure that the new code is fully...

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

@wawa0210 @archlitchi PTAL

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

Append the log after set the `LIBCUDA_LOG_LEVEL` to `4` ``` (base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38...