Neo`

Results 9 comments of Neo`

我也是,导致训练一段时间之后一定会因为 oom (内存 oom 而非显存)导致 ray kill 掉进程从而死掉,然后从断点 Resume 之后完全不是一开始的训练状态了 ![Image](https://github.com/user-attachments/assets/b78bb192-4c9a-4919-993e-a8bcbfa83def)

> > 同样的问题,最后内存炸了:( > > 是将vllm升级到0.7.3,verl的code更新到最新版的时候出现的,关闭了VLLM_ATTENTION_BACKEND=XFORMERS的设置 之前没有这个问题吗

We have solved this problem in this issue #429

> I suspect that this is memory leak when saving checkpoint. But I only saved the checkpoint once or twice during the training process, so if it's a checkpoint saving...

> [@PeterSH6](https://github.com/PeterSH6) would you mind adding the label `ray` to the issue? Thanks! Okay, I'll look into that in the next few days.

> In my case, I set `actor_rollout_ref.rollout.free_cache_engine=False` to solve this problem. My vllm version is 0.6.3. it does not work well for me, my version is 0.7.2

@kevin85421 Hi!I‘ve generated some `.heap` memory snapshot file by using `jemalloc`, I'm wondering how I can characterize memory leaks from these heap files (sorry I know very little about this)....

@kevin85421 I think I have discovered a clue indicating a continuously increasing memory usage. I used `jeprof` to print memory snapshots of a specific Ray worker process at intervals i500,...

@hiyouga It works! So it's a vLLM problem, thanks a lot! ![Image](https://github.com/user-attachments/assets/ca497af8-5652-4ccb-800c-fb240bfcd18e)