Support of VLLM Async pipeline parallel

Open mpj1234 opened this issue 2 months ago • 0 comments

Feature request

Many times, the limited memory of a single card requires some models of VLLM to run as much as possible, although he has lost some training efficiency and expects official support.

#2851 I have successfully started this method, but the latest code has changed a lot.

model: qwen2.5-7B num_attention_heads=28 vocab_size=152064 Only tp4 can be used, but not tp28. Because of the limitation of vocab_size size, vocab_size needs to be divisible by tp quantity in vllm.

Motivation

Single card memory is limited

Your contribution

#2851 I have successfully started this method, but the latest code has changed a lot.

Dec 01 '25 11:12 mpj1234