Question about actor training-rollout resharding
Hi verl developers, I'm trying to understand the 3D-HybridEngine recently. Heres are my questions:
- From the source code, I observed that now it requires
infer_tensor_parallel_size <= train_tensor_parallel_size, as suggested inmegatron_vllm.py. While in practice, we may require a larger TP size during inference, since it can allow greater batch sizes to reduce total memory access for model parameters. Any plan to extend the 3D-HybridEngine in this case? - In the corresponding paper, for 70B model 3D-HybridEnging can effectively reduce the transisition time from
~30s -> ~5s. I'm querous about how this is achieved. In vanilla version, the worst case of pure transmission time should be70B * 2 / (400Gbps/8) = 2.8s, so it seems all the other30-2.8belongs to software overhead, such as the slow parameter update process (i.e., refit problem) of vllm engine (https://github.com/vllm-project/vllm/issues/1897). Maybe the greatest benefit is from the adapted vLLM engine?
Happy new year and smooth coding!
Hi @0oshowero0, thanks for your question!
While in practice, we may require a larger TP size during inference, since it can allow greater batch sizes to reduce total memory access for model parameters.
What do you mean about the greater batch sizes in the context of inference? In my opinion, inference with continuous batching may not have the batch size concept. We observe that using a smaller TP size in inference with more replicas achieves higher throughput. That's why make this assumption.
We may indeed sometimes need infer_tensor_parallel_size > train_tensor_parallel_size when serving larger models. But this requires splitting the model chunks rather than performing all_gather. But this solution may also limited by the number of heads in the qkv_proj.
Therefore, we don't have a plan to support infer_tensor_parallel_size > train_tensor_parallel_size at the moment but we welcome contributions.
We may focus on supporting pipeline parallelism in inference to support larger-scale models.
So it seems all the other 30-2.8 belongs to software overhead, such as the slow parameter update process (i.e., refit problem) of the vllm engine. Maybe the greatest benefit is from the adapted vLLM engine?
There are several factors for this time reduction:
- In OpenRLHF design, the vllm rollout is deployed on separate machines. Therefore, it requires gathering all the actor model parameters and sending them distributedly to the rollout. This process (including 2xM size communication and building comm buffer) may incur large memory overhead and have to be done per-parameter.
- In HybridFlow design, we have a continuous buffer, which can be done with some one-time allgathers in parallel. Moreover, 3D-HybridEngine doesn't need to gather all the model weights during sync. Instead, it only needs to collect part of the model parameters. This is beneficial for large-scale models whose sizes may exceed GPU memory capacity.
- Our adapted vLLM engine is mainly for SPMD execution patterns. This help us implement our 3D-HybridEngine design and could improve the inference throughput thanks to the low overhead of SPMD.
Happy New Year :)
@PeterSH6 Thank you for your detailed comment!
Here the batch size means the upper limit of continue batching. For example, in a 4 GPU setting, we can serve 10 requests with continue bating; but if we have 8 GPU, we may extend the continue batching to 20 requests since we have a bigger (total) HBM.
Both bigger TP or PP during actor rollout can be benifical to increase the upper limit of continue batching, and as you suggested, maybe increase PP is a better choice :)
We may focus on supporting pipeline parallelism in inference to support larger-scale models.
Closed due to lack of activity. Feel to reopen or create a new issue in case of new questions
Hi, I'm having an issue with 3D-HybridEngine. I didn't find any code about this or use of micro-dp. What should I do to run with the reshading strategy describe in the paper
I'm learning code in reshading.I find that , Reshading is completed in the following steps :
- broadcast in pp group
- allgather in ep group
- allgather in tp group
Thanks