verl icon indicating copy to clipboard operation
verl copied to clipboard

What situations could cause the 'Memory usage increased after sleeping' problem?

Open shenofusc opened this issue 4 months ago • 2 comments

Hi, I’m using verl framework to train Qwen3-30B-A3B, and my train task has trained for more than 200 steps, but at 223 step, we occured the ‘Memory usage increased after sleeping’ problem, It’s a little bit weird, I don’t know how to fix it, Is there anybody can give some clues to help me fix this problem? THANKS!

My enviroment is as follows: hardware: Huawei Ascend NPU 910B, 8 nodes 8 NPUs, total 64 NPUs CANN: 8.2.RC1 Python: 3.10.18 vllm: 0.9.1 vllm-ascend: 0.9.1rc3 torch: 2.5.1 torch-npu: 2.5.1.post1

Here is the error logs: File "/cache/verl_algo/verl/single_controller/ray/base.py", line 766, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) File "/cache/verl_algo/verl/single_controller/base/decorator.py", line 430, in inner return func(*args, **kwargs) File "/cache/verl_algo/verl/utils/profiler/mstx_profile.py", line 210, in wrapper return func(self, *args, **kwargs) File "/cache/verl_algo/verl/workers/fsdp_workers.py", line 762, in generate_sequences with self.rollout_sharding_manager: File "/cache/verl_algo/verl/utils/profiler/performance.py", line 105, in f return self.log(decorated_function, *args, **kwargs) File "/cache/verl_algo/verl/utils/profiler/performance.py", line 118, in log output = func(*args, **kwargs) File "/cache/verl_algo/verl/workers/sharding_manager/fsdp_vllm.py", line 240, in exit self.inference_engine.sleep(level=1) File "/cache/verl_env/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1322, in sleep self.llm_engine.sleep(level=level) File "/cache/verl_env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1860, in sleep self.model_executor.sleep(level=level) File "/cache/verl_env/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 207, in sleep self.collective_rpc("sleep", kwargs=dict(level=level)) File "/cache/verl_env/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) File "/cache/verl_env/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method return func(*args, **kwargs) File "/cache/verl_env/lib/python3.10/site-packages/vllm_ascend/worker/worker.py", line 202, in sleep assert freed_bytes >= 0, "Memory usage increased after sleeping." AssertionError: Memory usage increased after sleeping.

shenofusc avatar Oct 25 '25 03:10 shenofusc

vllm community's answer: https://discuss.vllm.ai/t/what-situations-could-cause-the-memory-usage-increased-after-sleeping-problem/1769/2

shenofusc avatar Oct 27 '25 03:10 shenofusc

Hi, please provide your script.

1k77 avatar Dec 03 '25 09:12 1k77