ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

NPU微调报错:ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend}

Open FlynnShi opened this issue 1 year ago • 1 comments

实验环境: 昇腾910B3

https://github.com/modelscope/swift/blob/main/docs/source/LLM/NPU%E6%8E%A8%E7%90%86%E4%B8%8E%E5%BE%AE%E8%B0%83%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md 按照这个文档进行了环境设置

Traceback (most recent call last): File "/train/swift/swift/cli/sft.py", line 5, in sft_main() File "/train/swift/swift/utils/run_utils.py", line 21, in x_main args, remaining_argv = parse_args(args_class, argv) File "/train/swift/swift/utils/utils.py", line 102, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/usr/local/Python3.10.12/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 150, in init File "/train/swift/swift/llm/utils/argument.py", line 808, in post_init self._init_training_args() File "/train/swift/swift/llm/utils/argument.py", line 837, in _init_training_args training_args = Seq2SeqTrainingArguments( File "", line 134, in init File "/train/swift/swift/trainers/arguments.py", line 36, in post_init super().post_init() File "/usr/local/Python3.10.12/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in post_init and (self.device.type != "cuda") File "/usr/local/Python3.10.12/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device return self._setup_devices File "/usr/local/Python3.10.12/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/usr/local/Python3.10.12/lib/python3.10/site-packages/transformers/training_args.py", line 2026, in _setup_devices self.distributed_state = PartialState( File "/usr/local/Python3.10.12/lib/python3.10/site-packages/accelerate/state.py", line 185, in init raise ValueError("Your assigned backend {original_backend} is not avaliable, please use {backend}") ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend} 报错信息如上

FlynnShi avatar May 30 '24 01:05 FlynnShi

[INFO:swift] Start time of running main: 2024-05-31 10:59:56.122626 Traceback (most recent call last): File "/data/swift/swift/cli/sft.py", line 5, in sft_main() File "/data/swift/swift/utils/run_utils.py", line 21, in x_main args, remaining_argv = parse_args(args_class, argv) File "/data/swift/swift/utils/utils.py", line 102, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 166, in init File "/data/swift/swift/llm/utils/argument.py", line 851, in post_init

File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args

File "", line 136, in init File "/data/swift/swift/trainers/arguments.py", line 38, in post_init super().post_init() File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in post_init and (self.device.type != "cuda") File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device return self._setup_devices File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2026, in _setup_devices self.distributed_state = PartialState( File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/state.py", line 185, in init raise ValueError("Your assigned backend {original_backend} is not avaliable, please use {backend}") ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend} [INFO:swift] Setting template_type: qwen [INFO:swift] Setting args.lazy_tokenize: False Traceback (most recent call last): File "/data/swift/swift/cli/sft.py", line 5, in sft_main() File "/data/swift/swift/utils/run_utils.py", line 21, in x_main args, remaining_argv = parse_args(args_class, argv) File "/data/swift/swift/utils/utils.py", line 102, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 166, in init File "/data/swift/swift/llm/utils/argument.py", line 851, in post_init

File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args

File "", line 136, in init File "/data/swift/swift/trainers/arguments.py", line 38, in post_init super().post_init() File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in post_init and (self.device.type != "cuda") File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device return self._setup_devices File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2026, in _setup_devices self.distributed_state = PartialState( File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/state.py", line 185, in init raise ValueError("Your assigned backend {original_backend} is not avaliable, please use {backend}") ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend} Traceback (most recent call last): File "/data/swift/swift/cli/sft.py", line 5, in sft_main() File "/data/swift/swift/utils/run_utils.py", line 21, in x_main args, remaining_argv = parse_args(args_class, argv) File "/data/swift/swift/utils/utils.py", line 102, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 166, in init File "/data/swift/swift/llm/utils/argument.py", line 851, in post_init

File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args

File "", line 136, in init File "/data/swift/swift/trainers/arguments.py", line 38, in post_init super().post_init() File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in post_init and (self.device.type != "cuda") File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device return self._setup_devices File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2026, in _setup_devices self.distributed_state = PartialState( File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/state.py", line 185, in init raise ValueError("Your assigned backend {original_backend} is not avaliable, please use {backend}") ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend} Traceback (most recent call last): File "/data/swift/swift/cli/sft.py", line 5, in sft_main() File "/data/swift/swift/utils/run_utils.py", line 21, in x_main args, remaining_argv = parse_args(args_class, argv) File "/data/swift/swift/utils/utils.py", line 102, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 166, in init File "/data/swift/swift/llm/utils/argument.py", line 851, in post_init

File "/data/swift/swift/llm/utils/argument.py", line 880, in _init_training_args

File "", line 136, in init File "/data/swift/swift/trainers/arguments.py", line 38, in post_init super().post_init() File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in post_init and (self.device.type != "cuda") File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device return self._setup_devices File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in get cached = self.fget(obj) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/transformers/training_args.py", line 2026, in _setup_devices self.distributed_state = PartialState( File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/accelerate/state.py", line 185, in init raise ValueError("Your assigned backend {original_backend} is not avaliable, please use {backend}") ValueError: Your assigned backend {original_backend} is not avaliable, please use {backend} [2024-05-31 11:00:12,621] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1355830) of binary: /data/anaconda3/envs/swift-npu/bin/python Traceback (most recent call last): File "/data/anaconda3/envs/swift-npu/bin/torchrun", line 8, in sys.exit(main()) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/anaconda3/envs/swift-npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/swift/swift/cli/sft.py FAILED

Failures: [1]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 1 (local_rank: 1) exitcode : 1 (pid: 1355831) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 2 (local_rank: 2) exitcode : 1 (pid: 1355832) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 3 (local_rank: 3) exitcode : 1 (pid: 1355833) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-05-31_11:00:12 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : 1 (pid: 1355830) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

hunterhome avatar May 31 '24 03:05 hunterhome

训练脚本中添加 --ddp_backend hccl 就可以了

atomrun39 avatar Jun 26 '24 07:06 atomrun39