lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Bug] 3090 部署internlm2-chat-20b-4bits,提问卡住不懂

Open makefree3 opened this issue 1 year ago • 5 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.

Describe the bug

8eaba988b29483b2a7e2bc5ed72215a eefcc464b1d2ecd5a9353da1cfdd0c5 eff178161cee8f18a836ebdb2ae70f3

Reproduction

转换格式: lmdeploy convert internlm2-chat-20b /vllmapi/internlm2-chat-20b-4bits --model-format awq --group-size 128 --tp 1 推理 lmdeploy chat turbomind ./workspace 输入你好后,出现警告信息后就开着不动了

Environment

docker环境

Error traceback

No response

makefree3 avatar Feb 21 '24 14:02 makefree3

我在3090上试了下,还没有复现这个问题。你用的是哪个镜像呢?宿主机上 cuda driver版本是多少呢?可以贴下 nvidia-smi 的结果

lvhan028 avatar Feb 22 '24 07:02 lvhan028

这是我的结果。步骤和你的一样 image

lvhan028 avatar Feb 22 '24 07:02 lvhan028

使用了 @makefree3 的环境,复现了卡主的问题。gdb信息如下: OGdA9x1A0Q

结合源码: img_v3_028b_94aed6f2-e9b6-4f8c-9090-6c03928a33cg

在windows系统上,应该要走 #ifdef _MSC_VER 的分支。但是因为是windows上的docker container,所以执行的是 #else 分支。 而这个分支会导致卡主。

解决方法是都换成 cudaStreamSynchronize,但需要验证这种修改会不会影响在LINUX上的性能。

lvhan028 avatar Feb 27 '24 10:02 lvhan028

我们还是建议直接在windows裸机上部署。

lvhan028 avatar Apr 22 '24 12:04 lvhan028

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.

github-actions[bot] avatar Apr 30 '24 02:04 github-actions[bot]

This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.

github-actions[bot] avatar May 05 '24 02:05 github-actions[bot]

我们还是建议直接在windows裸机上部署。

你好,我在windows裸机上本地部署internvl2-8B,运行示例代码:

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
from lmdeploy.vl import load_image

model = 'D:\\xxxx\\InternVL2-8B'
system_prompt = '我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
chat_template_config = ChatTemplateConfig('internvl-internlm2')
chat_template_config.meta_instruction = system_prompt
pipe = pipeline(model, chat_template_config=chat_template_config,
                backend_config=TurbomindEngineConfig(session_len=8192))

image = load_image('D:\\xxx.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

没有任何输出 以下是一些warning: Flash Attention is not available, use_flash_attn is set to false. .... gemm_config.in is not found; using default GEMM algo

那边仓库说是lmdeploy在windows的支持不好的问题,请问有什么好的解决方案吗? 我的环境是: torch2.2.2+cu121 lmdeploy 0.5.2.post1

humphreyde avatar Aug 06 '24 02:08 humphreyde

创建 pipeline 的时候,加入参数 log_level="INFO",然后再运行一遍demo,贴下详细的日志

lvhan028 avatar Aug 28 '24 16:08 lvhan028

创建 pipeline 的时候,加入参数 log_level="INFO",然后再运行一遍demo,贴下详细的日志

换成量化版本解决了

humphreyde avatar Aug 30 '24 05:08 humphreyde