Neo`
Neo`
我也是,导致训练一段时间之后一定会因为 oom (内存 oom 而非显存)导致 ray kill 掉进程从而死掉,然后从断点 Resume 之后完全不是一开始的训练状态了 
> > 同样的问题,最后内存炸了:( > > 是将vllm升级到0.7.3,verl的code更新到最新版的时候出现的,关闭了VLLM_ATTENTION_BACKEND=XFORMERS的设置 之前没有这个问题吗
We have solved this problem in this issue #429
> I suspect that this is memory leak when saving checkpoint. But I only saved the checkpoint once or twice during the training process, so if it's a checkpoint saving...
> [@PeterSH6](https://github.com/PeterSH6) would you mind adding the label `ray` to the issue? Thanks! Okay, I'll look into that in the next few days.
> In my case, I set `actor_rollout_ref.rollout.free_cache_engine=False` to solve this problem. My vllm version is 0.6.3. it does not work well for me, my version is 0.7.2
@kevin85421 Hi!I‘ve generated some `.heap` memory snapshot file by using `jemalloc`, I'm wondering how I can characterize memory leaks from these heap files (sorry I know very little about this)....
@kevin85421 I think I have discovered a clue indicating a continuously increasing memory usage. I used `jeprof` to print memory snapshots of a specific Ray worker process at intervals i500,...
@hiyouga It works! So it's a vLLM problem, thanks a lot! 