我在使用多卡的时候遇到了问题
buntu 22.04 环境用的 anaconda python版本 3.10 cuda 11.8 硬件4张p104-100 8gb 模型已经下载到了本地
参考 https://github.com/THUDM/VisualGLM-6B/issues/102 请问visualglm-6b可以多卡部署吗
因为卡的显存有限不能直接使用没有量化的模型 在/VisualGLM-6B目录下可以 正常运行 python web_demo_hf.py --quant 4 --share 到上面这里是没有问题的 能够打开网页
但是我打算多卡部署的时候遇到了问题
torchrun --nnode 1 --nproc_per_node= 4 web_demo_hf.py --quant 4 --share
Traceback (most recent call last): File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 632, in determine_local_world_size return int(nproc_per_node) ValueError: invalid literal for int() with base 10: ''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in
等号和4之间多了个空格吧
好的我试一下
torchrun --nnode 1 --nproc_per_node=4 web_demo_hf.py --quant 4 --share WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-09-20 02:54:07,442] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,442] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,443] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:07,443] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 02:54:16,232] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-20 02:54:16,283] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
Loading checkpoint shards: 60%|████████████████████████████████████████████████████▏ | 3/5 [01:33<01:05, 32.98s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 913 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 914 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 915 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 916) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python
Traceback (most recent call last):
File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
web_demo_hf.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-09-20_02:56:44 host : wtcai4x104 rank : 3 (local_rank: 3) exitcode : -9 (pid: 916) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 916
内存满了 但是swap还有很多的空间 但是就是被kill了
去掉空格 确实是解决了一部分问题 后来我挂了更大的虚拟内存解决被kill的问题 但还是不行 我过后再试试 torchrun --nnode 1 --nproc_per_node=4 web_demo_hf.py --quant 4 --share WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:47,668] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-20 04:27:56,387] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-20 04:27:56,431] [INFO] [RANK 0] You are using model-only mode. For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:11<00:00, 74.26s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:11<00:00, 74.28s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:12<00:00, 74.41s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:12<00:00, 74.43s/it]
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
chatbot = gr.Chatbot().style(height=480)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style(
3.44.4
Running on local URL: http://0.0.0.0:9088
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
chatbot = gr.Chatbot().style(height=480)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style(
3.44.4
Traceback (most recent call last):
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in
main(args)
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main
demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch
) = networking.start_server(
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server
raise OSError(
OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch().
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
chatbot = gr.Chatbot().style(height=480)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style(
3.44.4
Traceback (most recent call last):
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in
main(args)
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main
demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch
) = networking.start_server(
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server
raise OSError(
OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch().
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
chatbot = gr.Chatbot().style(height=480)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style(
3.44.4
Traceback (most recent call last):
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in
main(args)
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 135, in main
demo.queue().launch(share=args.share, inbrowser=True, server_name='0.0.0.0', server_port=9088)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/blocks.py", line 1907, in launch
) = networking.start_server(
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/gradio/networking.py", line 207, in start_server
raise OSError(
OSError: Cannot find empty port in range: 9088-9088. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch().
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1657 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1655) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python
Traceback (most recent call last):
File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wtchen/anaconda3/envs/v6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
server_port parameter to launch().
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
chatbot = gr.Chatbot().style(height=480)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style(
3.44.4
Traceback (most recent call last):
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in server_port parameter to launch().
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:104: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
image_path = gr.Image(type="filepath", label="Image Prompt", value=None).style(height=504)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:106: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
chatbot = gr.Chatbot().style(height=480)
/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py:116: GradioDeprecationWarning: The style method is deprecated. Please set these arguments in the constructor instead.
user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=4).style(
3.44.4
Traceback (most recent call last):
File "/home/wtchen/ai/VisualGLM-6B/web_demo_hf.py", line 143, in server_port parameter to launch().
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1657 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1655) of binary: /home/wtchen/anaconda3/envs/v6b2/bin/python
Traceback (most recent call last):
File "/home/wtchen/anaconda3/envs/v6b2/bin/torchrun", line 8, in web_demo_hf.py FAILED
Failures: [1]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1656) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 3 (local_rank: 3) exitcode : 1 (pid: 1658) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-09-20_04:43:15 host : wtcai4x104 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1655) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
内存满了 但是swap还有很多的空间 但是就是被kill了
老哥,现在你能多卡部署推理了吗?我也遇到这个问题,无法解决。