add SGLang as rollout engine to verl
#22 . WIP, will add more details tomorrow :)
@ocss884 also, should be rebased with main. Thnaks!
Hi, thanks for the work, i have a problem for no-peer-access environments here.
(WorkerDict pid=2890081) [2025-03-13 15:00:53 TP0] Scheduler hit an exception: Traceback (most recent call last):
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2264, in run_scheduler_process
(WorkerDict pid=2890081) scheduler.event_loop_overlap()
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(WorkerDict pid=2890081) return func(*args, **kwargs)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 502, in event_loop_overlap
(WorkerDict pid=2890081) self.process_input_requests(recv_reqs)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 595, in process_input_requests
(WorkerDict pid=2890081) output = self._request_dispatcher(recv_req)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/utils.py", line 444, in __call__
(WorkerDict pid=2890081) return fn(obj)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2059, in update_weights_from_tensor
(WorkerDict pid=2890081) success, message = self.tp_worker.update_weights_from_tensor(recv_req)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 229, in update_weights_from_tensor
(WorkerDict pid=2890081) success, message = self.worker.update_weights_from_tensor(recv_req)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 215, in update_weights_from_tensor
(WorkerDict pid=2890081) success, message = self.model_runner.update_weights_from_tensor(
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 567, in update_weights_from_tensor
(WorkerDict pid=2890081) named_tensors = [
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 568, in <listcomp>
(WorkerDict pid=2890081) (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1021, in _unwrap_tensor
(WorkerDict pid=2890081) return tensor.get(tp_rank)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1033, in get
(WorkerDict pid=2890081) return MultiprocessingSerializer.deserialize(self.values[rank])
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/utils.py", line 1290, in deserialize
(WorkerDict pid=2890081) return ForkingPickler.loads(data)
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
(WorkerDict pid=2890081) storage = storage_cls._new_shared_cuda(
(WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/torch/storage.py", line 1434, in _new_shared_cuda
(WorkerDict pid=2890081) return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
(WorkerDict pid=2890081) RuntimeError: CUDA error: peer access is not supported between these two devices
I did some tests, and guess the problem is something like this.
When a tensor is serializde by a ray process launched by verl, the GPU id is allocated by the resource pool, and is enumerated from 0 (btw, I set all parallel size=1, so each process only see one GPU; with id 0); when sglang launch its own scheduler process, it has access to all GPUs with global id (I guess it is because sglang has its own dp management?).
For the current setting, the above process should not trigger any peer communication, each rollout weight should be updated by the actor weight on the same device. However, when the tensor is serialized in the local environment and deserialized in the global environment, the mismatch happens.
For a temporary fix, I can move the tensors to the CPU first as this process should not be the bottleneck (and it works). However, when torch-memory-saver is enabled, many invalid memory access error happens. So I guess there should be a better solution.
p.s. I also wonder why verl should manage dp itself, shouldn't this part be left to the inference engine?
Hi, thanks for the work, i have a problem for no-peer-access environments here.
(WorkerDict pid=2890081) [2025-03-13 15:00:53 TP0] Scheduler hit an exception: Traceback (most recent call last): (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2264, in run_scheduler_process (WorkerDict pid=2890081) scheduler.event_loop_overlap() (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (WorkerDict pid=2890081) return func(*args, **kwargs) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 502, in event_loop_overlap (WorkerDict pid=2890081) self.process_input_requests(recv_reqs) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 595, in process_input_requests (WorkerDict pid=2890081) output = self._request_dispatcher(recv_req) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/utils.py", line 444, in __call__ (WorkerDict pid=2890081) return fn(obj) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2059, in update_weights_from_tensor (WorkerDict pid=2890081) success, message = self.tp_worker.update_weights_from_tensor(recv_req) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 229, in update_weights_from_tensor (WorkerDict pid=2890081) success, message = self.worker.update_weights_from_tensor(recv_req) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 215, in update_weights_from_tensor (WorkerDict pid=2890081) success, message = self.model_runner.update_weights_from_tensor( (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 567, in update_weights_from_tensor (WorkerDict pid=2890081) named_tensors = [ (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 568, in <listcomp> (WorkerDict pid=2890081) (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank)) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1021, in _unwrap_tensor (WorkerDict pid=2890081) return tensor.get(tp_rank) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1033, in get (WorkerDict pid=2890081) return MultiprocessingSerializer.deserialize(self.values[rank]) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/sglang/srt/utils.py", line 1290, in deserialize (WorkerDict pid=2890081) return ForkingPickler.loads(data) (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor (WorkerDict pid=2890081) storage = storage_cls._new_shared_cuda( (WorkerDict pid=2890081) File "/home/yyx/miniconda3/envs/verl-sglang/lib/python3.10/site-packages/torch/storage.py", line 1434, in _new_shared_cuda (WorkerDict pid=2890081) return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) (WorkerDict pid=2890081) RuntimeError: CUDA error: peer access is not supported between these two devicesI did some tests, and guess the problem is something like this.
When a tensor is serializde by a ray process launched by verl, the GPU id is allocated by the resource pool, and is enumerated from 0 (btw, I set all parallel size=1, so each process only see one GPU; with id 0); when sglang launch its own scheduler process, it has access to all GPUs with global id (I guess it is because sglang has its own dp management?).
For the current setting, the above process should not trigger any peer communication, each rollout weight should be updated by the actor weight on the same device. However, when the tensor is serialized in the local environment and deserialized in the global environment, the mismatch happens.
For a temporary fix, I can move the tensors to the CPU first as this process should not be the bottleneck (and it works). However, when torch-memory-saver is enabled, many invalid memory access error happens. So I guess there should be a better solution.
p.s. I also wonder why verl should manage dp itself, shouldn't this part be left to the inference engine?
cc @fzyzcjy could you take a look? thanks!
@Dutch-voyage Hi, it seems this PR overrides the visible devices, thus each process should see all GPUs instead of 1 GPU. Thus it would be great if more information could be shared, e.g. are you directly using the latest code in this PR? (Too tired now, so maybe my words are super silly!)
@PeterSH6 I rebased with the main and everything looks good on my side. The docs are updated. ready to roll!
@fzyzcjy Thanks for the explanation, i guess that is the root of the issue. I disabled the overrides of the CUDA_VISIBLE_DEVICES, and the error is resolved.
I think the problem is with ray's device isolation, when each ActorRolloutRef process is launched, the actor module is bond with each isolated device id (starting from 0), therefore simply exposing cuda devices to the rollout engine (sglang here) is not enough. We need consistency here.
Another issue is after disabling CUDA_VISIBLE_DEVICES overrides, it hangs with torch-memory-saver enabled (it works fine without it, so I guess it is relevant).
Fantastic!
@Dutch-voyage IIRC ocss884 mentioned that Verl (e.g. its FSDP) will have some issues if ray's device isolation is fully disabled, so I am not sure whether that direction is OK.
I have created an issue in PyTorch https://github.com/pytorch/pytorch/issues/149196 for the "multiprocess + CUDA_VISIBLE_DEVICES" issue. Since I do not have a no-peer-access environment, could you please check whether that script, besides giving buggy device id, also reproduces the CUDA error: peer access is not supported between these two devices error on your device?
As for error when "move-to-CPU + memory-saver-error", could you please share a bit more information (e.g. the code), then I can have a brief look. Quick guess: maybe have a try by .to(current gpu) for the CPU tensor and see whether the error disappears; IIRC wrong devices can cause errors.
@Dutch-voyage In addition, maybe have a try on patching local pytorch source code with https://github.com/pytorch/pytorch/pull/149248 to check whether that solves in the no-peer-connection env.
@fzyzcjy
1. Disabling Overrides and this PR https://github.com/pytorch/pytorch/pull/149248
This what i meant by disabling overrides of CUDA_VISIBLE_DEVICES. I actually do not alter ray's device isolation. And this current fix does not move tensors to CPU anymore. In verl/workers/rollout/sglang_rollout.py
class SGLangRollout(BaseRollout):
def __init__(
...
):
super().__init__()
self.config = config
# disable overrides of CUDA_VISIBLE_DEVICES
# del os.environ["CUDA_VISIBLE_DEVICES"]
# if os.environ["ENSURE_CUDA_VISIBLE_DEVICES"]:
# os.environ["CUDA_VISIBLE_DEVICES"] = os.environ[
# "ENSURE_CUDA_VISIBLE_DEVICES"
# ]
and
self.inference_engine = VerlEngine(
model_path=actor_module,
dtype=config.dtype,
mem_fraction_static=config.gpu_memory_utilization,
device_mesh_cpu=device_mesh_cpu["tp"],
base_gpu_id=0, # changed from "base_gpu_id=src_rank"
gpu_id_step=1,
# log_level="INFO",
# log_requests=True,
# log_requests_level=2,
max_running_requests=1,
)
I also tried this PR, it works fine (no additional changes from the original code, env var unchanged, not moved to cpu) when torch-memory-saver is off. Kinda curious though if it is a bug in torch.mp, only storing device info from local envs also seems reasonable to me.
2. Both methods + torch-memory-saver
For both methods, disabling overrides and torch.mp patch, the same error throws when torch-memory-saver is on.
(WorkerDict pid=559716) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=559716) what(): CUDA error: an illegal memory access was encountered
(WorkerDict pid=559716) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=559716) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=559716) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=559716)
(WorkerDict pid=559716) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=559716) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f69fd757446 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=559716) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f69fd7016e4 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=559716) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f69fd843a18 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=559716) frame #3: <unknown function> + 0x28d2c (0x7f69fd813d2c in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=559716) frame #4: <unknown function> + 0x29011 (0x7f69fd814011 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
(WorkerDict pid=559716) frame #5: <unknown function> + 0x81af88 (0x7f69fc3baf88 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
(WorkerDict pid=559716) frame #6: <unknown function> + 0x5fa2d8 (0x7f69fc19a2d8 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
(WorkerDict pid=559716) frame #7: <unknown function> + 0x6f66d (0x7f69fd73866d in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=559716) frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f69fd73137b in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=559716) frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f69fd731529 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so)
(WorkerDict pid=559716) frame #10: <unknown function> + 0x8c1a98 (0x7f69fc461a98 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
(WorkerDict pid=559716) frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f69fc461de6 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
(WorkerDict pid=559716) frame #12: <unknown function> + 0x1306ba (0x5578c75c76ba in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #13: <unknown function> + 0x124583 (0x5578c75bb583 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #14: <unknown function> + 0x13d697 (0x5578c75d4697 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #15: <unknown function> + 0x150525 (0x5578c75e7525 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #16: _PyEval_EvalFrameDefault + 0x13ca (0x5578c75cc8fa in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #17: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #18: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #19: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #20: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #21: <unknown function> + 0x1504f2 (0x5578c75e74f2 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #22: _PyEval_EvalFrameDefault + 0x320 (0x5578c75cb850 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #23: _PyObject_FastCallDictTstate + 0xd0 (0x5578c75d3f50 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #24: _PyObject_Call_Prepend + 0x69 (0x5578c75e5ba9 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #25: <unknown function> + 0x2114c9 (0x5578c76a84c9 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #26: _PyObject_MakeTpCall + 0x26b (0x5578c75d4a6b in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #27: _PyEval_EvalFrameDefault + 0x54a6 (0x5578c75d09d6 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #28: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #29: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #30: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #31: _PyEval_EvalFrameDefault + 0x2d80 (0x5578c75ce2b0 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #32: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #33: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #34: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #35: _PyEval_EvalFrameDefault + 0x2d80 (0x5578c75ce2b0 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #36: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #37: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #38: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #39: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #40: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #41: _PyEval_EvalFrameDefault + 0x320 (0x5578c75cb850 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #42: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #43: _PyEval_EvalFrameDefault + 0x13ca (0x5578c75cc8fa in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #44: <unknown function> + 0x1d7f90 (0x5578c766ef90 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #45: PyEval_EvalCode + 0x87 (0x5578c766eed7 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #46: <unknown function> + 0x20842a (0x5578c769f42a in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #47: <unknown function> + 0x203833 (0x5578c769a833 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #48: PyRun_StringFlags + 0x7d (0x5578c7692c3d in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #49: PyRun_SimpleStringFlags + 0x3c (0x5578c7692a7c in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #50: Py_RunMain + 0x26b (0x5578c769198b in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #51: Py_BytesMain + 0x37 (0x5578c7662527 in sglang::scheduler_TP0)
(WorkerDict pid=559716) frame #52: <unknown function> + 0x29d90 (0x7f6a07b92d90 in /lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=559716) frame #53: __libc_start_main + 0x80 (0x7f6a07b92e40 in /lib/x86_64-linux-gnu/libc.so.6)
(WorkerDict pid=559716) frame #54: <unknown function> + 0x1cb421 (0x5578c7662421 in sglang::scheduler_TP0)
(WorkerDict pid=559716)
(WorkerDict pid=559716) Fatal Python error: Aborted
(WorkerDict pid=559716)
(WorkerDict pid=559716) Thread 0x00007f68777ce640 (most recent call first):
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1400 in watchdog_thread
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 953 in run
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap
(WorkerDict pid=559716)
(WorkerDict pid=559716) Thread 0x00007f6876fcd640 (most recent call first):
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 320 in wait
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/queue.py", line 171 in get
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 124 in forward_thread_func_
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 112 in forward_thread_func
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 953 in run
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap
(WorkerDict pid=559716)
(WorkerDict pid=559716) Thread 0x00007f68767cc640 (most recent call first):
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 324 in wait
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 607 in wait
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap
(WorkerDict pid=559716)
(WorkerDict pid=559716) Thread 0x00007f6875fcb640 (most recent call first):
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 324 in wait
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 607 in wait
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap
(WorkerDict pid=559716)
(WorkerDict pid=559716) Current thread 0x00007f6a05d4d740 (most recent call first):
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 215 in update_weights_from_tensor
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 232 in update_weights_from_tensor
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1549 in update_weights_from_tensor
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/utils.py", line 440 in __call__
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 601 in process_input_requests
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 508 in event_loop_overlap
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1757 in run_scheduler_process
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/process.py", line 108 in run
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
(WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
(WorkerDict pid=559716) File "<string>", line 1 in <module>
3. Test script
As for the the test script, this what I wrote as concise as possible:
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
os.environ["ENSURE_CUDA_VISIBLE_DEVICES"] = "0,1"
# Initialize ray
if not ray.is_initialized():
ray.init()
# Set up model path - replace with appropriate test model
model_path = os.environ.get("TEST_MODEL_PATH", "/home/yyx/models/Qwen2.5-0.5B")
# Create config
# I set all parallel size = 1 by default
config = OmegaConf.load("verl/trainer/config/ppo_trainer.yaml")
config = config.actor_rollout_ref
config.model.path = model_path
config.rollout.name = "sglang"
# Create a resource pool directly
resource_pool = RayResourcePool(
process_on_nodes=[2], # 2 GPUs on 1 node
use_gpu=True,
max_colocate_count=1,
name_prefix="test_pool"
)
# I followed the logic of main_ppo.py and ray_trainer.py faithfully here
actor_rollout_cls = RayClassWithInitArgs(cls=ray.remote(ActorRolloutRefWorker),
config=config,
role='actor_rollout')
all_wg = {}
class_dict = {"actor_rollout": actor_rollout_cls}
worker_dict_cls = create_colocated_worker_cls(class_dict=class_dict)
wg_dict = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys())
all_wg.update(spawn_wg)
worker = all_wg['actor_rollout']
worker.init_model()
# ...
# omit data preprocess here, I get a DataProto from raw prompts here
outputs = workers.generate_sequences(prompts)
personal suggestions
I am very naive to the multi-processing errors, so not sure if simple fix exists. But I personally would recommend verl to let inference engine manage data-parallel (especially considering sglang has a dp router design to enhance prefix cache hit rate). But big refactoring is required.
@fzyzcjy BTW, the peer access error appears regardless of different devices, I guess it is not a device issue.
@Dutch-voyage Hi, since I cannot reproduce errors using 8xH100, could you please run the following reproduction commands and paste outputs:
Create a new container with image lmsysorg/sglang:dev, and execute the following:
# Inside the `lmsysorg/sglang:dev` container
cd /root
# Download
git clone -b dev_sglang https://github.com/ocss884/verl
git clone -b feat/patch_torch https://github.com/fzyzcjy/sglang
# Patch verl
# paste the patch below to the diff file first
(cd verl && git apply ../patch_to_verl.diff)
# install
python3 -m pip install --upgrade pip && python3 -m pip install --upgrade uv
(cd verl && python3 -m uv pip install -e .)
(cd sglang/python && python3 -m uv pip install -e '.[all]' --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python)
python3 -m uv pip install flash-attn --no-build-isolation --no-deps
python3 -m uv pip install torch_memory_saver
# prepare data
(cd verl && python3 examples/data_preprocess/gsm8k.py)
# Check versions and paste outputs
(cd verl && git rev-parse HEAD)
(cd sglang && git rev-parse HEAD)
# Run and paste outputs
(cd verl && TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 NCCL_NVLS_ENABLE=0 python3 -m verl.trainer.main_ppo \
actor_rollout_ref.rollout.name=sglang \
data.train_files=/root/data/gsm8k/train.parquet \
data.val_files=/root/data/gsm8k/test.parquet \
data.train_batch_size=64 \
data.val_batch_size=1312 \
data.max_prompt_length=512 \
data.max_response_length=1 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.rollout.free_cache_engine=True \
actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
+actor_rollout_ref.rollout.sampling_params.max_new_tokens=100 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
critic.model.enable_gradient_checkpointing=True \
critic.ppo_micro_batch_size=16 \
critic.model.fsdp_config.param_offload=True \
critic.model.fsdp_config.optimizer_offload=True \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger="['console']" \
+trainer.val_before_train=True \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=2 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=10 \
trainer.total_epochs=1)
Patch to verl:
diff --git a/verl/workers/rollout/sglang_rollout/sglang_rollout.py b/verl/workers/rollout/sglang_rollout/sglang_rollout.py
index efd0264..c9e73d2 100644
--- a/verl/workers/rollout/sglang_rollout/sglang_rollout.py
+++ b/verl/workers/rollout/sglang_rollout/sglang_rollout.py
@@ -137,6 +137,7 @@ class SGLangRollout(BaseRollout):
device_mesh_cpu=device_mesh_cpu["tp"],
base_gpu_id=src_rank,
gpu_id_step=1,
+ enable_memory_saver=True,
# NOTE(Chenyang): if you want to debug the sglang engine
# please set the following parameters
# Otherwise, it will make the engine run too slow
diff --git a/verl/workers/sharding_manager/fsdp_sglang.py b/verl/workers/sharding_manager/fsdp_sglang.py
index 8b760b0..a45ad67 100644
--- a/verl/workers/sharding_manager/fsdp_sglang.py
+++ b/verl/workers/sharding_manager/fsdp_sglang.py
@@ -67,6 +67,11 @@ class FSDPSGLangShardingManager(BaseShardingManager):
self.gen_random_states = None
def __enter__(self):
+ self.inference_engine.resume_memory_occupation()
+ print('after resume, sleep for 5 second to check nvidia-smi...')
+ import time
+ time.sleep(5)
+
log_gpu_memory_usage('Before state_dict() in sharding manager memory', logger=logger)
params = self.module.state_dict()
log_gpu_memory_usage('After state_dict() in sharding manager memory', logger=logger)
@@ -93,7 +98,10 @@ class FSDPSGLangShardingManager(BaseShardingManager):
def __exit__(self, exc_type, exc_value, traceback):
log_gpu_memory_usage('Before SGLang offload in sharding manager', logger=logger)
- self.inference_engine.release_memory_occupation
+ self.inference_engine.release_memory_occupation()
+ print('after release, sleep for 5 second to check nvidia-smi...')
+ import time
+ time.sleep(5)
log_gpu_memory_usage('After SGLang offload in sharding manager', logger=logger)
# self.module.to('cuda')
Remark: NCCL_NVLS_ENABLE=0 may not be needed on your device
@fzyzcjy Thanks!! The script runs just fine in the container. I manage to reproduce and fix the problem in my environment, it turns out that the torch-memory-saver is wrongly installed with previous pip cache. When reinstalled with torch-memory-saver, it works fine too. Thanks again for your help! Really appreciate it.
@ocss884 @PeterSH6 As for now, people from qwen team is reporting issues on multi-nodes. We will merge single node SGLang into veRL ASAP and work with @Qiaolin-Yu on multi-node training with ray. cc Bao Rong at FDU
@ocss884 @PeterSH6 As for now, people from qwen team is reporting issues on multi-nodes. We will merge single node SGLang into veRL ASAP and work with @Qiaolin-Yu on multi-node training with ray. cc Bao Rong at FDU
I'm very concerned about this PR, so if there's anything I can do to help, please don't hesitate to ask.
@vermouth1992 Hi, the test dataset.yml encounter
../../../../.local/lib/python3.10/site-packages/torch/__init__.py:290: in <module>
from torch._C import * # noqa: F403
E ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
Torch seems can not find regular cuda deps locates site-packages/nvidia/cudnn/lib/libcudnn.so.9. Could you help check any resent changes relates to it? Looks like another PR also encounter this error https://github.com/volcengine/verl/actions/workflows/dataset.yml. Thanks!
@PeterSH6 Could you please help on this?
@ocss884 this seems to be an environment problem. We can skip this. Will check this ci later
It seems that torch.distributed.dtensor is made public in Pytorch 2.5. It was torch.distributed._dtensor in Pytorch 2.4. I guess we need to add compatibility code for both versions.
LGTM! Shall we merge then @PeterSH6 ?
Nice!!
@fzyzcjy
1. Disabling Overrides and this PR pytorch/pytorch#149248
This what i meant by disabling overrides of CUDA_VISIBLE_DEVICES. I actually do not alter ray's device isolation. And this current fix does not move tensors to CPU anymore. In verl/workers/rollout/sglang_rollout.py
class SGLangRollout(BaseRollout): def __init__( ... ): super().__init__() self.config = config # disable overrides of CUDA_VISIBLE_DEVICES # del os.environ["CUDA_VISIBLE_DEVICES"] # if os.environ["ENSURE_CUDA_VISIBLE_DEVICES"]: # os.environ["CUDA_VISIBLE_DEVICES"] = os.environ[ # "ENSURE_CUDA_VISIBLE_DEVICES" # ]and
self.inference_engine = VerlEngine( model_path=actor_module, dtype=config.dtype, mem_fraction_static=config.gpu_memory_utilization, device_mesh_cpu=device_mesh_cpu["tp"], base_gpu_id=0, # changed from "base_gpu_id=src_rank" gpu_id_step=1, # log_level="INFO", # log_requests=True, # log_requests_level=2, max_running_requests=1, )I also tried this PR, it works fine (no additional changes from the original code, env var unchanged, not moved to cpu) when torch-memory-saver is off. Kinda curious though if it is a bug in torch.mp, only storing device info from local envs also seems reasonable to me.
2. Both methods + torch-memory-saver
For both methods, disabling overrides and torch.mp patch, the same error throws when torch-memory-saver is on.
(WorkerDict pid=559716) terminate called after throwing an instance of 'c10::Error' (WorkerDict pid=559716) what(): CUDA error: an illegal memory access was encountered (WorkerDict pid=559716) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=559716) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=559716) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=559716) (WorkerDict pid=559716) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=559716) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f69fd757446 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=559716) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f69fd7016e4 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=559716) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f69fd843a18 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=559716) frame #3: <unknown function> + 0x28d2c (0x7f69fd813d2c in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=559716) frame #4: <unknown function> + 0x29011 (0x7f69fd814011 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=559716) frame #5: <unknown function> + 0x81af88 (0x7f69fc3baf88 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so) (WorkerDict pid=559716) frame #6: <unknown function> + 0x5fa2d8 (0x7f69fc19a2d8 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so) (WorkerDict pid=559716) frame #7: <unknown function> + 0x6f66d (0x7f69fd73866d in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=559716) frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f69fd73137b in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=559716) frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f69fd731529 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=559716) frame #10: <unknown function> + 0x8c1a98 (0x7f69fc461a98 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so) (WorkerDict pid=559716) frame #11: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f69fc461de6 in /home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/lib/libtorch_python.so) (WorkerDict pid=559716) frame #12: <unknown function> + 0x1306ba (0x5578c75c76ba in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #13: <unknown function> + 0x124583 (0x5578c75bb583 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #14: <unknown function> + 0x13d697 (0x5578c75d4697 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #15: <unknown function> + 0x150525 (0x5578c75e7525 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #16: _PyEval_EvalFrameDefault + 0x13ca (0x5578c75cc8fa in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #17: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #18: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #19: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #20: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #21: <unknown function> + 0x1504f2 (0x5578c75e74f2 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #22: _PyEval_EvalFrameDefault + 0x320 (0x5578c75cb850 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #23: _PyObject_FastCallDictTstate + 0xd0 (0x5578c75d3f50 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #24: _PyObject_Call_Prepend + 0x69 (0x5578c75e5ba9 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #25: <unknown function> + 0x2114c9 (0x5578c76a84c9 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #26: _PyObject_MakeTpCall + 0x26b (0x5578c75d4a6b in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #27: _PyEval_EvalFrameDefault + 0x54a6 (0x5578c75d09d6 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #28: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #29: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #30: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #31: _PyEval_EvalFrameDefault + 0x2d80 (0x5578c75ce2b0 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #32: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #33: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #34: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #35: _PyEval_EvalFrameDefault + 0x2d80 (0x5578c75ce2b0 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #36: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #37: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #38: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #39: _PyEval_EvalFrameDefault + 0x72c (0x5578c75cbc5c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #40: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #41: _PyEval_EvalFrameDefault + 0x320 (0x5578c75cb850 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #42: _PyFunction_Vectorcall + 0x6c (0x5578c75db99c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #43: _PyEval_EvalFrameDefault + 0x13ca (0x5578c75cc8fa in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #44: <unknown function> + 0x1d7f90 (0x5578c766ef90 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #45: PyEval_EvalCode + 0x87 (0x5578c766eed7 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #46: <unknown function> + 0x20842a (0x5578c769f42a in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #47: <unknown function> + 0x203833 (0x5578c769a833 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #48: PyRun_StringFlags + 0x7d (0x5578c7692c3d in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #49: PyRun_SimpleStringFlags + 0x3c (0x5578c7692a7c in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #50: Py_RunMain + 0x26b (0x5578c769198b in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #51: Py_BytesMain + 0x37 (0x5578c7662527 in sglang::scheduler_TP0) (WorkerDict pid=559716) frame #52: <unknown function> + 0x29d90 (0x7f6a07b92d90 in /lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=559716) frame #53: __libc_start_main + 0x80 (0x7f6a07b92e40 in /lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=559716) frame #54: <unknown function> + 0x1cb421 (0x5578c7662421 in sglang::scheduler_TP0) (WorkerDict pid=559716) (WorkerDict pid=559716) Fatal Python error: Aborted (WorkerDict pid=559716) (WorkerDict pid=559716) Thread 0x00007f68777ce640 (most recent call first): (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1400 in watchdog_thread (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 953 in run (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap (WorkerDict pid=559716) (WorkerDict pid=559716) Thread 0x00007f6876fcd640 (most recent call first): (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 320 in wait (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/queue.py", line 171 in get (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 124 in forward_thread_func_ (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 112 in forward_thread_func (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 953 in run (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap (WorkerDict pid=559716) (WorkerDict pid=559716) Thread 0x00007f68767cc640 (most recent call first): (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 324 in wait (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 607 in wait (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap (WorkerDict pid=559716) (WorkerDict pid=559716) Thread 0x00007f6875fcb640 (most recent call first): (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 324 in wait (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 607 in wait (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 1016 in _bootstrap_inner (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/threading.py", line 973 in _bootstrap (WorkerDict pid=559716) (WorkerDict pid=559716) Current thread 0x00007f6a05d4d740 (most recent call first): (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 215 in update_weights_from_tensor (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 232 in update_weights_from_tensor (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1549 in update_weights_from_tensor (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/utils.py", line 440 in __call__ (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 601 in process_input_requests (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 508 in event_loop_overlap (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1757 in run_scheduler_process (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/process.py", line 108 in run (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/spawn.py", line 129 in _main (WorkerDict pid=559716) File "/home/yyx/miniconda3/envs/verl-sgl-test/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main (WorkerDict pid=559716) File "<string>", line 1 in <module>3. Test script
As for the the test script, this what I wrote as concise as possible:
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" os.environ["ENSURE_CUDA_VISIBLE_DEVICES"] = "0,1" # Initialize ray if not ray.is_initialized(): ray.init() # Set up model path - replace with appropriate test model model_path = os.environ.get("TEST_MODEL_PATH", "/home/yyx/models/Qwen2.5-0.5B") # Create config # I set all parallel size = 1 by default config = OmegaConf.load("verl/trainer/config/ppo_trainer.yaml") config = config.actor_rollout_ref config.model.path = model_path config.rollout.name = "sglang" # Create a resource pool directly resource_pool = RayResourcePool( process_on_nodes=[2], # 2 GPUs on 1 node use_gpu=True, max_colocate_count=1, name_prefix="test_pool" ) # I followed the logic of main_ppo.py and ray_trainer.py faithfully here actor_rollout_cls = RayClassWithInitArgs(cls=ray.remote(ActorRolloutRefWorker), config=config, role='actor_rollout') all_wg = {} class_dict = {"actor_rollout": actor_rollout_cls} worker_dict_cls = create_colocated_worker_cls(class_dict=class_dict) wg_dict = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls) spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys()) all_wg.update(spawn_wg) worker = all_wg['actor_rollout'] worker.init_model() # ... # omit data preprocess here, I get a DataProto from raw prompts here outputs = workers.generate_sequences(prompts)personal suggestions
I am very naive to the multi-processing errors, so not sure if simple fix exists. But I personally would recommend verl to let inference engine manage data-parallel (especially considering sglang has a dp router design to enhance prefix cache hit rate). But big refactoring is required.
Have you solved this problem?
@yuleiqin This has been solved long ago (see discussions below that one). But if you find any regressions feel free to post details