verl [Bug] Make `free_cache_engine` option workable in latest vLLM/SGLang

Checklist Before Starting

[X] Search for similar PR(s).

What does this PR do?

Make free_cache_engine option workable in latest vLLM/SGLang

High-Level Design

It looks like actor_rollout_ref.rollout.free_cache_engine control option only works for vLLM version 0.5.4 and 0.6.3, the sleep / wake up mode in vLLM engine, as well as release / resume memory occupation in SGLang is enabled by default and there's no way to turn them off.

While always alllowing inference engine to free cache can be ideal, it's unfortunately not supported on some devices, such as AMD Mi250x, since it doesn't support virtual memory management: https://github.com/vllm-project/vllm/pull/12695#issuecomment-2633919751

So we would need to be able to turn it off so that verl can run on those devices.

In addition, it looks like we no longer need to enforce eager in latest vLLM when we choose to free cache, so this PR also lifted this restriction.

Additional Info.

Issue Number: None
Training: both FSDP and Megatron
Inference: both vLLM and SGLang

Checklist Before Submitting

[X] Read the Contribute Guide.
[X] Apply pre-commit checks.
[X] Add [BREAKING] to the PR title if it breaks any API.
[X] Update the documentation about your changes in the docs.
[X] Add CI test(s) if neccessary.

May 09 '25 11:05 HollowMan6

We gonna to review SGLang part these days.

May 09 '25 16:05 zhaochenyang20

@yushengsu-thu yusheng could you help on the AMD part. SInce release KV cache and offload works slow on ROCM

May 09 '25 16:05 zhaochenyang20

Fixed https://github.com/volcengine/verl/actions/runs/14928202581/job/41953954187, https://github.com/volcengine/verl/actions/runs/14928202571/job/41953954048 should not be caused by this PR. https://github.com/volcengine/verl/actions/runs/14943863586/job/41997145096?pr=1464 and https://github.com/volcengine/verl/actions/runs/14928202608/job/41953954881 is also fixed

May 09 '25 16:05 HollowMan6

@zhaochenyang20 , actually @YangWang92 and I are working on this virtual memory management issue. vLLM part: https://github.com/vllm-project/vllm/pull/12695#issuecomment-2633919751

SGLang part: https://github.com/sgl-project/sglang/issues/5093#issuecomment-2785084577 https://github.com/fzyzcjy/torch_memory_saver/issues/9#issuecomment-2850012653 Once we resolve the issues above, this problem will be addressed. Stay tuned.

But , I believe, @HollowMan6 proposed feature is a general and useful one, as the issue of memory not being properly released can occur when certain kernels lack proper support (I’ve seen others encounter similar problems on different GPUs as well), so we can add it.

May 09 '25 16:05 yushengsu-thu

Btw, That said, this feature seems more like a general one, as the issue of memory not being properly released can occur when certain kernels lack proper support (I’ve seen others encounter similar problems on different GPUs as well) so we can add it.

@yushengsu-thu Yes, and not only about that, I think we will also need this PR to make verl support older AMD GPUs, such as the MI250x I'm using, since they unfortunately don't support virtual memory management APIs because of the hardware. https://github.com/ROCm/hip-tests/blob/84a460d96bafb00615969304cc5eaddc3b20bc3d/catch/unit/memory/hip_vmm_common.hh#L27-L37

I was the original author of https://github.com/vllm-project/vllm/pull/12695, and I've been following your progress after I found out about the issue for supporting sleep mode on MI250x. It would be hard to support it unless we no longer rely on virtual memory management APIs / find an alternative.

May 09 '25 17:05 HollowMan6

I was the original author of vllm-project/vllm#12695, and I've been following your progress after I found out about the issue for supporting sleep mode on MI250x. It would be hard to support it unless we no longer rely on virtual memory management APIs / find an alternative.

@HollowMan6 , I would like to follow up on this discussion. You mentioned that find an alternative to replace virtual memory management APIs. Have you ever tried the different API in ROCm or have other designs to mitigate this issue?

May 09 '25 19:05 yushengsu-thu

@yushengsu-thu To the best of my knowledge, I'm not aware of different ROCm APIs for achieving the same feature, but this is definitely something that's worth considering, and I would assume that you can check with your AMD colleagues, and they might have something in mind.

For other designs, when I initially opened the feature request for sleep mode at the vLLM side: https://github.com/vllm-project/vllm/issues/10714, I did send a temporary PR of my solution for destroying all KV cache during runtime: https://github.com/vllm-project/vllm/pull/10810. This one is device-agnostic as it doesn't use any lower API and fully relies on the garbage collection, but as it needs to make sure we clear out all the reference counters, it's not very stable and fully transparent to the application, and might be hard to maintain in the long run. So I guess that's why they choose to come up with a more elegant solution https://github.com/vllm-project/vllm/pull/11743, that involves this low-level virtual memory management API.

May 09 '25 19:05 HollowMan6

@yushengsu-thu To the best of my knowledge, I'm not aware of different ROCm APIs for achieving the same feature, but this is definitely something that's worth considering, and I would assume that you can check with your AMD colleagues, and they might have something in mind.

For other designs, when I initially opened the feature request for sleep mode at the vLLM side: vllm-project/vllm#10714, I did send a temporary PR of my solution for destroying all KV cache during runtime: vllm-project/vllm#10810. This one is device-agnostic as it doesn't use any lower API and fully relies on the garbage collection, but as it needs to make sure we clear out all the reference counters, it's not very stable and fully transparent to the application, and might be hard to maintain in the long run. So I guess that's why they choose to come up with a more elegant solution vllm-project/vllm#11743, that involves this low-level virtual memory management API.

@HollowMan6 , yes, we are working with AMD and MSR's colleagues to solve this issue. I think your previous solution might also be a good idea, and it's better to have a device-agnostic solution. @zhaochenyang20, we might pull Tom together to brainstorm this (virtual memory management - device-agnostic solution) within the sglang side to better RL training?

May 09 '25 20:05 yushengsu-thu

LGTM

I am running the CI. We can merge it after your approval. thanks!

May 11 '25 19:05 zhaochenyang20

The latest CI failure https://github.com/volcengine/verl/actions/runs/14958501396/job/42018309219?pr=1464 doesn't seem to be this PR specific, maybe it's a good idea to rerun it again.

(WorkerDict pid=15082) [2025-05-11 19:51:54 TP0] Scheduler hit an exception: Traceback (most recent call last): (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2014, in run_scheduler_process (WorkerDict pid=15082) scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank) (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 258, in init (WorkerDict pid=15082) self.tp_worker = TpWorkerClass( (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in init (WorkerDict pid=15082) self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port) (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 75, in init (WorkerDict pid=15082) self.model_runner = ModelRunner( (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 177, in init (WorkerDict pid=15082) min_per_gpu_memory = self.init_torch_distributed() (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 360, in init_torch_distributed (WorkerDict pid=15082) init_distributed_environment( (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 1052, in init_distributed_environment (WorkerDict pid=15082) torch.distributed.init_process_group( (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper (WorkerDict pid=15082) return func(*args, **kwargs) (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper (WorkerDict pid=15082) func_return = func(*args, **kwargs) (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1714, in init_process_group (WorkerDict pid=15082) store, rank, world_size = next(rendezvous_iterator) (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 226, in _tcp_rendezvous_handler (WorkerDict pid=15082) store = _create_c10d_store( (WorkerDict pid=15082) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 194, in _create_c10d_store (WorkerDict pid=15082) return TCPStore( (WorkerDict pid=15082) RuntimeError: The server socket has failed to listen on any local network address. port: 30856, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

May 11 '25 20:05 HollowMan6

@zhaochenyang20 @yushengsu-thu @eric-haibin-lin Any plan to merge this soon? New conflicts constantly arise, and I hope to have this quickly merged so that I don't need to spend time resolving them.

Jun 04 '25 01:06 HollowMan6

@zhaochenyang20 @yushengsu-thu @eric-haibin-lin Any plan to merge this soon? New conflicts constantly arise, and I hope to have this quickly merged so that I don't need to spend time resolving them.

@eric-haibin-lin @zhaochenyang20 . It looks good to me. Since the current version uses a hybrid engine design, some potential issues have arisen that were not present in the disaggregated design. Releasing and resuming memory at the kernel level is one such issue (AMD side also has this issue, but we have already solved it, and we will release the updated Docker image soon). I’m aware that other company GPUs face similar problems if they haven't modified their memory allocation implementation (here: https://github.com/fzyzcjy/torch_memory_saver/issues/9#issuecomment-2910919609), and they might also encounter this issue. Therefore, merging this feature could enable more users (who are not using NVIDIA GPUs) to train RL models on their setups (it just requires more memory during training).

Jun 04 '25 18:06 yushengsu-thu

Merged for now. Thanks for the contribution. @HollowMan6 in the next PR would you mind adding an unit test with free_cache_engine=False and run generate_sequence, just to make sure this option will not be broken again in the future? Testing with the latest sglang/vllm should be sufficient. Thanks!

Jun 29 '25 15:06 eric-haibin-lin

This seems to break when setting free_cache_engine to False, at least for the async vLLM rollout -- the first rollout was good, but after the actor update, the second rollout would be corrupted (generating random tokens). Setting free_cache_engine to True is fine.

Jul 09 '25 15:07 Cysu

This seems to break when setting free_cache_engine to False, at least for the async vLLM rollout -- the first rollout was good, but after the actor update, the second rollout would be corrupted (generating random tokens). Setting free_cache_engine to True is fine.

One possible reason can be that some code forgets to do free_cache_engine checks before sleep calls, but if I remember correctly, vLLM should give out errors if it's in sleep mode and there are new requests. Feel free to submit a PR if you somehow spot the culprit.

Jul 09 '25 19:07 HollowMan6