LLM-Tuning icon indicating copy to clipboard operation
LLM-Tuning copied to clipboard

多卡运行后报错

Open ghlili opened this issue 2 years ago • 8 comments

请问有用过多卡训练过吗?我单卡是通的,但是多卡会报错诶

ghlili avatar Jun 27 '23 08:06 ghlili

提issue的时候把代码、报错之类的都贴一下嘛~

beyondguo avatar Jun 27 '23 08:06 beyondguo

Traceback (most recent call last): File "/root/LLM-Tuning/chatglm_lora_tuning.py", line 141, in <module> main() File "/root/LLM-Tuning/chatglm_lora_tuning.py", line 134, in main trainer.train() File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/transformers/trainer.py", line 2645, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/LLM-Tuning/chatglm_lora_tuning.py", line 56, in compute_loss return model( ^^^^^^ File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply output.reraise() File "/root/anaconda3/envs/llmtune/lib/python3.11/site-packages/torch/_utils.py", line 644, in reraise raise exception TypeError: Caught TypeError in replica 1 on device 1.

因为使用的公司平台,不是使用的CUDA_VISIBLE_DEVICES发起命令,而且代码里没看到DDP之类的东西,正在看hugging face的文档。。。

ghlili avatar Jun 27 '23 09:06 ghlili

这个报错暂时还看不出个所以然来,你使用的环境跟我的一致吗?DDP的话应该是Huggingface的Trainer自动执行和分配的,不需要自己写了。

beyondguo avatar Jun 27 '23 10:06 beyondguo

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) 那请问有没有遇到过在tensor在两张gpu上的情况

ghlili avatar Jun 27 '23 12:06 ghlili

我输出了model,发现都在cuda:0上,好奇怪

ghlili avatar Jun 27 '23 12:06 ghlili

你打印一下 print(model.hf_device_map) 看看具体layer是怎么分配的(先确保你加载模型的时候使用的是 device_map="auto")。

beyondguo avatar Jun 28 '23 05:06 beyondguo

输出的{'': 0} 我没有改过你的代码 不知道是不是数据加载问题,以前用DDP只存在过存在于cpu和cuda上,没遇到过存在于两个cuda的 https://discuss.huggingface.co/t/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least-two-devices-cuda-1-and-cuda-0/39548/3 我在这里看到有人遇到同样问题,但是好像没解决问题

ghlili avatar Jun 28 '23 05:06 ghlili

你可能没用accelerate...据说是需要的,你看看好使吗

AaronZLT avatar Jun 28 '23 09:06 AaronZLT