Dong-Yong Lee comments

Results 15 comments of


                                            Dong-Yong Lee

In my humble opinion, There might be a problem when loading the model checkpoint. https://github.com/vllm-project/vllm/blob/bbbf86565f2fb2bab0cf6675f9ebefcd449390bd/vllm/model_executor/models/llama.py#L336-L339 For this loop, it needs some cpu memories per GPU device for loading a checkpoint...

ray OOM in tensor parallel

Indeed, after sharding my model's checkpoints to small pieces, It works on me normally.

ray OOM in tensor parallel

I know that there is no way to partially load a large checkpoint file at code level. (To load a checkpoint file, memory of the same size as the checkpoint...

[Question] Usage with Multimodal LLM

I agree with @dimitry12 Among other things, I believe that changing to accept embeds as input would be the smallest first step towards supporting multi-modality. The case for a multi-modal...

Support generation from input embedding

We conducted several tests and confirmed that the performance degradation was not significant. In fact, we measured the benchmark 5 times for the main branch and feature branch using the...

Support generation from input embedding

@WoosukKwon @zhuohan123 Hello authors, I have tested this PR and completed the alignment with the latest prepare_inputs function. Could you please review this PR?

Support generation from input embedding

I am facing the following problem. In some cases, the output of the sequence after some iterations is different. I have tested this with gpt2 model. output from embeds ```python...

Support generation from input embedding

Hi @WoosukKwon , I've made some changes to the PR that you saw, so I'm asking you to review it again. - Updated source code to accept prompt_embeds argument in...

Support generation from input embedding

Hello @will-wiki , Thank you for your interest in my work! I have installed vLLM latest release v0.2.4, and run the script you provided. As a result, the latest vllm...