NPU微调报错:ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend}
实验环境: 昇腾910B3
https://github.com/modelscope/swift/blob/main/docs/source/LLM/NPU%E6%8E%A8%E7%90%86%E4%B8%8E%E5%BE%AE%E8%B0%83%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md 按照这个文档进行了环境设置
Traceback (most recent call last):
File "/train/swift/swift/cli/sft.py", line 5, in
[INFO:swift] Start time of running main: 2024-05-31 10:59:56.122626
Traceback (most recent call last):
File "/data/swift/swift/cli/sft.py", line 5, in
File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args
File "
File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args
File "
File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args
File "
File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args
File "", line 136, in init
File "/data/swift/swift/trainers/arguments.py", line 38, in post_init
super().post_init()
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in post_init
and (self.device.type != "cuda")
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get
cached = self.fget(obj)
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2026, in _setup_devices
self.distributed_state = PartialState(
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/state.py", line 185, in init
raise ValueError("Your assigned backend {original_backend} is not avaliable, please use {backend}")
ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend}
[2024-05-31 11:00:12,621] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1355830) of binary: /data/anaconda3/envs/swift-npu/bin/python
Traceback (most recent call last):
File "/data/anaconda3/envs/swift-npu/bin/torchrun", line 8, in
sys.exit(main())
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/data/swift/swift/cli/sft.py FAILED
Failures: [1]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 1 (local_rank: 1) exitcode : 1 (pid: 1355831) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 2 (local_rank: 2) exitcode : 1 (pid: 1355832) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 3 (local_rank: 3) exitcode : 1 (pid: 1355833) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : 1 (pid: 1355830) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
训练脚本中添加 --ddp_backend hccl 就可以了