fighterhit
fighterhit
@Dreamsorcerer Yes, relying on external events can timely know the changes in the number of server instances, but it is more complicated. It would be simpler if it could be...
I met this on `525.85.12` for A30.
Hi @aritger, is there any solution about this?
Thanks @aritger,[nvidia-bug-report.log.gz](https://github.com/NVIDIA/open-gpu-kernel-modules/files/10836852/nvidia-bug-report.log.gz). The problem occurs after running in the kubernetes environment for a period of time, and `nvidia-smi` will get stuck for a while. The specific error phenomenon is similar...
@jelmd +1, I ran into this problem again. Hi @aritger @Joshua-Ashton maybe this is a Driver issue, please take a look. ``` Fri Mar 3 14:41:55 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI...
> Hi @lpla , what's your use case environment? In kubernetes?
> There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch. Have you...
> We also use PyTorch on the GPUs, but the 470 driver used before has been more stable.
### UPDATE They have confirmed `Xid 119` this bug. They said that the `GSP` feature was introduced from version 510, but it has not been fixed yet. They only gave...
Hi @stephenroller @liming5619 , maybe it's better to downgrade the driver version. On one hand, the GSP feature was introduced by NVIDIA since 510 but has not been fixed yet....