ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

None of the inputs have requires_grad=True. Gradients will be None

Open Anorid opened this issue 1 year ago • 1 comments

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图) 在训练yi-vl模型和Phi-3-vision-128k-instruct模型在webui页面训练的时候报错 image image image 下面的这段是报错信息Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO:swift] The SftArguments will be saved in: /root/autodl-tmp/result-phi/phi3-vision-128k-instruct/v8-20240530-164513/sft_args.json [INFO:swift] The Seq2SeqTrainingArguments will be saved in: /root/autodl-tmp/result-phi/phi3-vision-128k-instruct/v8-20240530-164513/training_args.json [INFO:swift] The logging file will be saved in: /root/autodl-tmp/result-phi/phi3-vision-128k-instruct/v8-20240530-164513/logging.jsonl

Train: 0%| | 0/100 [00:00<?, ?it/s]/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") /root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Traceback (most recent call last): File "/root/autodl-tmp/swift/swift/cli/sft.py", line 5, in sft_main() File "/root/autodl-tmp/swift/swift/utils/run_utils.py", line 27, in x_main result = llm_x(args, **kwargs) File "/root/autodl-tmp/swift/swift/llm/sft.py", line 298, in llm_sft trainer.train(training_args.resume_from_checkpoint) File "/root/autodl-tmp/swift/swift/trainers/trainers.py", line 50, in train res = super().train(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 3147, in training_step self.accelerator.backward(loss) File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 2125, in backward loss.backward(**kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Train: 0%| | 0/100 [00:12<?, ?it/s] Traceback (most recent call last): File "/root/autodl-tmp/swift/swift/cli/sft.py", line 5, in sft_main() File "/root/autodl-tmp/swift/swift/utils/run_utils.py", line 27, in x_main result = llm_x(args, **kwargs) File "/root/autodl-tmp/swift/swift/llm/sft.py", line 298, in llm_sft trainer.train(training_args.resume_from_checkpoint) File "/root/autodl-tmp/swift/swift/trainers/trainers.py", line 50, in train res = super().train(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 3147, in training_step self.accelerator.backward(loss) File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 2125, in backward loss.backward(**kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26530 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26529) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/autodl-tmp/swift/swift/cli/sft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-05-30_16:45:42 host : autodl-container-ade5118aae-191466bc rank : 0 (local_rank: 0) exitcode : 1 (pid: 26529) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等) 这是cuda版本 root@autodl-container-ade5118aae-191466bc:~/autodl-tmp# pip show torch Name: torch Version: 2.0.0+cu118 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: [email protected] License: BSD-3 Location: /root/miniconda3/lib/python3.8/site-packages Requires: jinja2, triton, networkx, typing-extensions, filelock, sympy Required-by: xtuner, trl, triton, torchvision, peft, optimum, flash-attn, bitsandbytes, accelerate root@autodl-container-ade5118aae-191466bc:~/autodl-tmp# pip show transformers Name: transformers Version: 4.40.2 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: [email protected] License: Apache 2.0 License Location: /root/miniconda3/lib/python3.8/site-packages Requires: tokenizers, packaging, regex, pyyaml, tqdm, safetensors, huggingface-hub, numpy, filelock, requests Required-by: xtuner, trl, transformers-stream-generator, peft, optimum, ms-swift root@autodl-container-ade5118aae-191466bc:~/autodl-tmp# +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40 On | 00000000:01:00.0 Off | Off | | 0% 23C P8 22W / 300W | 2MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:41:00.0 Off | Off | | 0% 21C P8 12W / 300W | 2MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Additional context Add any other context about the problem here(在这里补充其他信息)

Anorid avatar May 30 '24 08:05 Anorid

贴一下命令 我帮你看看吧

Jintao-Huang avatar May 31 '24 06:05 Jintao-Huang

fixed

Jintao-Huang avatar Jun 01 '24 07:06 Jintao-Huang