Dong-Yong Lee
Dong-Yong Lee
same here, mark
In my humble opinion, There might be a problem when loading the model checkpoint. https://github.com/vllm-project/vllm/blob/bbbf86565f2fb2bab0cf6675f9ebefcd449390bd/vllm/model_executor/models/llama.py#L336-L339 For this loop, it needs some cpu memories per GPU device for loading a checkpoint...
Indeed, after sharding my model's checkpoints to small pieces, It works on me normally.
I know that there is no way to partially load a large checkpoint file at code level. (To load a checkpoint file, memory of the same size as the checkpoint...
I agree with @dimitry12 Among other things, I believe that changing to accept embeds as input would be the smallest first step towards supporting multi-modality. The case for a multi-modal...
We conducted several tests and confirmed that the performance degradation was not significant. In fact, we measured the benchmark 5 times for the main branch and feature branch using the...
@WoosukKwon @zhuohan123 Hello authors, I have tested this PR and completed the alignment with the latest prepare_inputs function. Could you please review this PR?
I am facing the following problem. In some cases, the output of the sequence after some iterations is different. I have tested this with gpt2 model. output from embeds ```python...
Hi @WoosukKwon , I've made some changes to the PR that you saw, so I'm asking you to review it again. - Updated source code to accept prompt_embeds argument in...
Hello @will-wiki , Thank you for your interest in my work! I have installed vLLM latest release v0.2.4, and run the script you provided. As a result, the latest vllm...