CLUE icon indicating copy to clipboard operation
CLUE copied to clipboard

torch1.5,多GPU运行训练出错。

Open YYGe01 opened this issue 5 years ago • 2 comments

您好,我有8个GPU可以使用,pytorch为1.5版本,cuda10.1

当我运行python classifier_pytorch/run_classifier.py时,参数如下: --model_type=bert --model_name_or_path=chinese_L-12_H-768_A-12 --task_name=iflytek --do_train --do_eval --do_lower_case --data_dir=clue_data/iflytek_public/ --max_seq_length=128 --per_gpu_train_batch_size=16 --per_gpu_eval_batch_size=16 --learning_rate=2e-5 --num_train_epochs=3.0 --logging_steps=759 --save_steps=759 --output_dir=outputs/iflytek_output/ --overwrite_output_dir --seed=42

会报错,报错如下: File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 569, in main() File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 504, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 113, in train outputs = model(**inputs) File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 897, in forward head_mask=head_mask) File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 606, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

但是当我指定一个GPU时,不会报错: os.environ["CUDA_VISIBLE_DEVICES"] = "0"

这个问题我运行huggingface的transformers库里的examples/gule.py案例也遇到过,初步怀疑是pytorch版本过高导致,当我pytroch版本降为1.2时,不指定GPU也不会报错。目前还不清楚torch1.5报错的确切原因。

YYGe01 avatar May 07 '20 07:05 YYGe01

你好,这个问题解决了吗

zhhao1 avatar Oct 16 '20 16:10 zhhao1

我找到解决办法了,我用单个GPU把这个打印出来next(self.parameters()).dtype, 都是torch.float32,应该就是版本问题。直接替换掉就可以了

zhhao1 avatar Oct 16 '20 16:10 zhhao1