ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

单机多卡 DDP+MP 断点续训报错

Open kratorado opened this issue 1 year ago • 3 comments

单机,8卡,NPROC_PER_NODE=4 ,中断后续训,报下面的错误

Traceback (most recent call last):
  File "~/swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "~/swift/swift/utils/run_utils.py", line 31, in x_main
    result = llm_x(args, **kwargs)
  File "~/swift/swift/llm/sft.py", line 228, in llm_sft
    trainer.train(training_args.resume_from_checkpoint)
  File "~/swift/swift/trainers/trainers.py", line 50, in train
    res = super().train(*args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/transformers/trainer.py", line 1917, in _inner_training_loop
    self.optimizer.step()
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/accelerate/optimizer.py", line 132, in step
    self.scaler.step(self.optimizer, closure)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 452, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 350, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/accelerate/optimizer.py", line 185, in patched_step
    return method(*args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/adamw.py", line 187, in step
    adamw(
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/adamw.py", line 339, in adamw
    func(
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/adamw.py", line 516, in _multi_tensor_adamw
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/optim/optimizer.py", line 409, in _group_tensors_by_device_and_dtype
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "~/miniconda3/envs/torch2.2/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 38, in _group_tensors_by_device_and_dtype
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

感觉像是MP导致的

kratorado avatar Mar 23 '24 14:03 kratorado

抱歉,暂时没有时间来看断点续训点问题,会在4月份来看这个问题

Jintao-Huang avatar Mar 23 '24 16:03 Jintao-Huang

解决了吗, 我今天遇到了类似的问题

edgeinfinity-wzt avatar Jun 13 '24 14:06 edgeinfinity-wzt

遇到了相似的问题+1

shaoyan1223 avatar Jul 29 '24 09:07 shaoyan1223