ChatGLM2-6B [Feature] <title>6b-32k的多卡部署一直报错，但是2k的没问题，求助？

Is your feature request related to a problem? Please describe.

(chatglm) dingyifan@ciccy004:~/code/ChatGLM2-6B$ python web_demo.py Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00, 1.07it/s] web_demo.py:89: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead. user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style( Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). Traceback (most recent call last): File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/queueing.py", line 388, in call_prediction output = await route_utils.call_process_api( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/route_utils.py", line 219, in call_process_api output = await app.get_blocks().process_api( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/blocks.py", line 1437, in process_api result = await self.call_function( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/blocks.py", line 1123, in call_function prediction = await utils.async_iteration(iterator) File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/utils.py", line 503, in async_iteration return await iterator.anext() File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/utils.py", line 496, in anext return await anyio.to_thread.run_sync( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, *args) File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/utils.py", line 479, in run_sync_iterator_async return next(iterator) File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/gradio/utils.py", line 629, in gen_wrapper yield from f(*args, **kwargs) File "web_demo.py", line 65, in predict for response, history, past_key_values in model.stream_chat(tokenizer, input, history, past_key_values=past_key_values, File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/home/dingyifan/.cache/huggingface/modules/transformers_modules/model-32k/modeling_chatglm.py", line 1072, in stream_chat for outputs in self.stream_generate(**inputs, past_key_values=past_key_values, File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) File "/home/dingyifan/.cache/huggingface/modules/transformers_modules/model-32k/modeling_chatglm.py", line 1157, in stream_generate outputs = self( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/dingyifan/.cache/huggingface/modules/transformers_modules/model-32k/modeling_chatglm.py", line 946, in forward transformer_outputs = self.transformer( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/dingyifan/.cache/huggingface/modules/transformers_modules/model-32k/modeling_chatglm.py", line 836, in forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "/home/anaconda/envs/chatglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/dingyifan/.cache/huggingface/modules/transformers_modules/model-32k/modeling_chatglm.py", line 655, in forward presents = torch.cat((presents, kv_cache), dim=0) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

Solutions

不知道

Additional context

No response

Sep 25 '23 02:09 300id

我也是同样问题，换回ChatGLM2-6B就可以多卡

Sep 27 '23 03:09 zhuang-maowei

同样的问题，求优化

Jan 09 '24 09:01 Connor-Shen