您好,我有8个GPU可以使用,pytorch为1.5版本,cuda10.1
当我运行python classifier_pytorch/run_classifier.py时,参数如下:
--model_type=bert
--model_name_or_path=chinese_L-12_H-768_A-12
--task_name=iflytek
--do_train
--do_eval
--do_lower_case
--data_dir=clue_data/iflytek_public/
--max_seq_length=128
--per_gpu_train_batch_size=16
--per_gpu_eval_batch_size=16
--learning_rate=2e-5
--num_train_epochs=3.0
--logging_steps=759
--save_steps=759
--output_dir=outputs/iflytek_output/
--overwrite_output_dir
--seed=42
会报错,报错如下:
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 569, in
main()
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 504, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 113, in train
outputs = model(**inputs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 897, in forward
head_mask=head_mask)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 606, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
但是当我指定一个GPU时,不会报错:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
这个问题我运行huggingface的transformers库里的examples/gule.py案例也遇到过,初步怀疑是pytorch版本过高导致,当我pytroch版本降为1.2时,不指定GPU也不会报错。目前还不清楚torch1.5报错的确切原因。
我找到解决办法了,我用单个GPU把这个打印出来next(self.parameters()).dtype, 都是torch.float32,应该就是版本问题。直接替换掉就可以了