Dimensionzw
Dimensionzw
> > Hello > > Will it be possible to include support for Qwen2-VL model? Thank you > > It maybe difficult now because trtllm do not support M-ROPE([NVIDIA/TensorRT-LLM#2183](https://github.com/NVIDIA/TensorRT-LLM/issues/2183)). I...
> Master branch have support. Docker image is beta version at now and will update in the futrue.主分支有支持。 Docker 镜像目前是 beta 版本,将来会更新。 Thank you very much, but the image registry.cn-hangzhou.aliyuncs.com/opengrps/grps_gpu:grps1.1.0_cuda12.5_cudnn9.2_trtllm0.16.0_py3.12...
> > > Master branch have support. Docker image is beta version at now and will update in the futrue.主分支有支持。 Docker 镜像目前是 beta 版本,将来会更新。 > > > > > >...
> demo start args. > > ### pd master > python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "pd_master" --host `hostname -i` --port 60011 > > ### prefill node > nvidia-cuda-mps-control -d...
> demo start args. > > ### pd master > python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "pd_master" --host `hostname -i` --port 60011 > > ### prefill node > nvidia-cuda-mps-control -d...
@fong-git 我这边实测在tp均为4的情况下,lmdeploy比vllm慢500ms左右,feature推理时间基本一致,问题就在于to cpu这部分,vllm是直接把GPU 的torch tensor传入后续流程的: ``` def merge_multimodal_embeddings(input_ids: torch.Tensor, inputs_embeds: torch.Tensor, multimodal_embeddings: NestedTensors, placeholder_token_id: int) -> torch.Tensor: """ Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the positions in ``inputs_embeds`` corresponding to...