OpenRLHF icon indicating copy to clipboard operation
OpenRLHF copied to clipboard

ERROR - An error occurred during score calculation: Task was killed due to the node running low on memory.

Open Siri-2001 opened this issue 1 year ago • 2 comments

请问我在用Ray多机多卡跑训练的时候出现了以下的问题,请问有哪些解决办法呢

Image

Siri-2001 avatar Apr 03 '25 05:04 Siri-2001

已经参考网上解决办法将RAY_DEBUG_DISABLE_MEMORY_MONITOR=1了,但是仍然无法解决。目前节点最大内存是1000G这个是无法改变的,请问有哪些可以优化的方法呢

Siri-2001 avatar Apr 03 '25 05:04 Siri-2001

@Siri-2001 后面有找到解决办法吗

xs1997zju avatar Apr 21 '25 02:04 xs1997zju