Neo` comments

Results 9 comments of


                                            Neo`

内存利用率随着save_freq增长是为什么呀

我也是，导致训练一段时间之后一定会因为 oom （内存 oom 而非显存）导致 ray kill 掉进程从而死掉，然后从断点 Resume 之后完全不是一开始的训练状态了 ![Image](https://github.com/user-attachments/assets/b78bb192-4c9a-4919-993e-a8bcbfa83def)

内存利用率随着save_freq增长是为什么呀

> > 同样的问题，最后内存炸了:( > > 是将vllm升级到0.7.3，verl的code更新到最新版的时候出现的，关闭了VLLM_ATTENTION_BACKEND=XFORMERS的设置之前没有这个问题吗

内存利用率随着save_freq增长是为什么呀

We have solved this problem in this issue #429

Ray OOM causes the process to be killed

> I suspect that this is memory leak when saving checkpoint. But I only saved the checkpoint once or twice during the training process, so if it's a checkpoint saving...

Ray OOM causes the process to be killed

> [@PeterSH6](https://github.com/PeterSH6) would you mind adding the label `ray` to the issue? Thanks! Okay, I'll look into that in the next few days.

Ray OOM causes the process to be killed

> In my case, I set `actor_rollout_ref.rollout.free_cache_engine=False` to solve this problem. My vllm version is 0.6.3. it does not work well for me, my version is 0.7.2

Ray OOM causes the process to be killed

@kevin85421 Hi！I‘ve generated some `.heap` memory snapshot file by using `jemalloc`, I'm wondering how I can characterize memory leaks from these heap files (sorry I know very little about this)....

Ray OOM causes the process to be killed

@kevin85421 I think I have discovered a clue indicating a continuously increasing memory usage. I used `jeprof` to print memory snapshots of a specific Ray worker process at intervals i500,...

Ray OOM causes the process to be killed

@hiyouga It works! So it's a vLLM problem, thanks a lot! ![Image](https://github.com/user-attachments/assets/ca497af8-5652-4ccb-800c-fb240bfcd18e)