[DRAFT] Vllm integration
-- UPDATE 7/7/2024: after chatting with @lewtun, we'd like to see if vLLM is willing to support https://github.com/vllm-project/vllm/issues/6189 officially before merging this PR as it may cause confusion for the users.
This PR adds a vLLM backend for generation purposes. Preliminary testing shows it's ~8x faster. Given 80 mins of training, the one with HF generation proceeded for 2650 episodes, whereas the one with vLLM generation proceeded for 16k episodes.
Note that your milage might vary with different hardware / generation length. For example, in TL;DR vllm 1B models vLLM does not seem to provide much speed benefits, likely due to short generation length.
Note that we have to use our custom vLLM build to achieve precise device placement (so that we can place the vLLM instance on the 8th GPU). See vwxyzjn/vllm#1
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I'm really looking forward to this integration! Just out of curiosity, do you think using optimum or torch.compile as a generation backend is possible? @vwxyzjn
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I'm really looking forward to this integration! Just out of curiosity, do you think using optimum or torch.compile as a generation backend is possible? @vwxyzjn
Yes I think torch.compile would be an option, but with the caveat that currently only a few model architectures are supported.
Hi, is there any updates? Thanks!