coldzerofear comments

Results 9 comments of


                                            coldzerofear

fix enable node device score plugin

/assign @shinytang6

Fix some potential null pointer panic

/assign @lowang-bh

Fix some potential null pointer panic

> > volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/vgpu.(*GPUDevices).GetStatus(0xc0007173c0?) > > /go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/vgpu/metrics.go:71 +0x18 > > The logs show that it panic here, we should also assert them. The assertions have been resolved on the outer layer,...

fix enable node device score plugin

/assign @thor-wl

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

You can use the environment variable `LIBCUDA_LOG_LEVEL` to increase the logging level of the hami core and obtain more context

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

> 在将`LIBCUDA_LOG_LEVEL``4` > > ``` > (base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error > [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2 > [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10 > [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38 >...

1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..

If there are only two physical GPUs on your node, a single container can only request a maximum of 2 VGPUs

coldzerofear

Fix device share plugins npe

Fix enabling vgpu node lock

fix enable node device score plugin

Fix some potential null pointer panic

Fix some potential null pointer panic

fix enable node device score plugin

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

1 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 1 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling..