YangYuxuan
YangYuxuan
Hi, thanks for the work, i have a problem for no-peer-access environments here. ``` (WorkerDict pid=2890081) [2025-03-13 15:00:53 TP0] Scheduler hit an exception: Traceback (most recent call last): (WorkerDict pid=2890081)...
@fzyzcjy Thanks for the explanation, i guess that is the root of the issue. I disabled the overrides of the CUDA_VISIBLE_DEVICES, and the error is resolved. I think the problem...
@fzyzcjy #### 1. Disabling Overrides and this PR https://github.com/pytorch/pytorch/pull/149248 This what i meant by disabling overrides of CUDA_VISIBLE_DEVICES. I actually do not alter ray's device isolation. And this current fix...
@fzyzcjy BTW, the peer access error appears regardless of different devices, I guess it is not a device issue.
@fzyzcjy Thanks!! The script runs just fine in the container. I manage to reproduce and fix the problem in my environment, it turns out that the torch-memory-saver is wrongly installed...