Dimensionzw comments

Results 6 comments of


                                            Dimensionzw

Qwen2-VL Model support

> > Hello > > Will it be possible to include support for Qwen2-VL model? Thank you > > It maybe difficult now because trtllm do not support M-ROPE([NVIDIA/TensorRT-LLM#2183](https://github.com/NVIDIA/TensorRT-LLM/issues/2183)). I...

Qwen2-VL Model support

> Master branch have support. Docker image is beta version at now and will update in the futrue.主分支有支持。 Docker 镜像目前是 beta 版本，将来会更新。 Thank you very much, but the image registry.cn-hangzhou.aliyuncs.com/opengrps/grps_gpu:grps1.1.0_cuda12.5_cudnn9.2_trtllm0.16.0_py3.12...

Qwen2-VL Model support

> > > Master branch have support. Docker image is beta version at now and will update in the futrue.主分支有支持。 Docker 镜像目前是 beta 版本，将来会更新。 > > > > > >...

Support disaggregated prefill ?

> demo start args. > > ### pd master > python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "pd_master" --host `hostname -i` --port 60011 > > ### prefill node > nvidia-cuda-mps-control -d...

Support disaggregated prefill ?

[Bug] InternVL2-2B的推理速度慢，发现是视觉特征提取的耗时很长

@fong-git 我这边实测在tp均为4的情况下，lmdeploy比vllm慢500ms左右，feature推理时间基本一致，问题就在于to cpu这部分，vllm是直接把GPU 的torch tensor传入后续流程的： ``` def merge_multimodal_embeddings(input_ids: torch.Tensor, inputs_embeds: torch.Tensor, multimodal_embeddings: NestedTensors, placeholder_token_id: int) -> torch.Tensor: """ Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the positions in ``inputs_embeds`` corresponding to...