lmdeploy [Bug] 3090 部署internlm2-chat-20b-4bits，提问卡住不懂

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

8eaba988b29483b2a7e2bc5ed72215a eefcc464b1d2ecd5a9353da1cfdd0c5 eff178161cee8f18a836ebdb2ae70f3

Reproduction

转换格式： lmdeploy convert internlm2-chat-20b /vllmapi/internlm2-chat-20b-4bits --model-format awq --group-size 128 --tp 1 推理 lmdeploy chat turbomind ./workspace 输入你好后，出现警告信息后就开着不动了

Environment

docker环境

Error traceback

No response

Feb 21 '24 14:02 makefree3

我在3090上试了下，还没有复现这个问题。你用的是哪个镜像呢？宿主机上 cuda driver版本是多少呢？可以贴下 nvidia-smi 的结果

Feb 22 '24 07:02 lvhan028

这是我的结果。步骤和你的一样

Feb 22 '24 07:02 lvhan028

使用了 @makefree3 的环境，复现了卡主的问题。gdb信息如下： OGdA9x1A0Q

结合源码： img_v3_028b_94aed6f2-e9b6-4f8c-9090-6c03928a33cg

在windows系统上，应该要走 #ifdef _MSC_VER 的分支。但是因为是windows上的docker container，所以执行的是 #else 分支。而这个分支会导致卡主。

解决方法是都换成 cudaStreamSynchronize，但需要验证这种修改会不会影响在LINUX上的性能。

Feb 27 '24 10:02 lvhan028

我们还是建议直接在windows裸机上部署。

Apr 22 '24 12:04 lvhan028

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

Apr 30 '24 02:04 github-actions[bot]

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

May 05 '24 02:05 github-actions[bot]

我们还是建议直接在windows裸机上部署。

你好，我在windows裸机上本地部署internvl2-8B，运行示例代码：

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'D:\\xxxx\\InternVL2-8B'
system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
chat_template_config = ChatTemplateConfig('internvl-internlm2')
chat_template_config.meta_instruction = system_prompt
pipe = pipeline(model, chat_template_config=chat_template_config,
                backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('D:\\xxx.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

没有任何输出以下是一些warning： Flash Attention is not available, use_flash_attn is set to false. .... gemm_config.in is not found; using default GEMM algo

那边仓库说是lmdeploy在windows的支持不好的问题，请问有什么好的解决方案吗？我的环境是： torch2.2.2+cu121 lmdeploy 0.5.2.post1

Aug 06 '24 02:08 humphreyde

创建 pipeline 的时候，加入参数 log_level="INFO"，然后再运行一遍demo，贴下详细的日志

Aug 28 '24 16:08 lvhan028

创建 pipeline 的时候，加入参数 log_level="INFO"，然后再运行一遍demo，贴下详细的日志

换成量化版本解决了

Aug 30 '24 05:08 humphreyde