None of the inputs have requires_grad=True. Gradients will be None
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
在训练yi-vl模型和Phi-3-vision-128k-instruct模型在webui页面训练的时候报错
下面的这段是报错信息Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO:swift] The SftArguments will be saved in: /root/autodl-tmp/result-phi/phi3-vision-128k-instruct/v8-20240530-164513/sft_args.json
[INFO:swift] The Seq2SeqTrainingArguments will be saved in: /root/autodl-tmp/result-phi/phi3-vision-128k-instruct/v8-20240530-164513/training_args.json
[INFO:swift] The logging file will be saved in: /root/autodl-tmp/result-phi/phi3-vision-128k-instruct/v8-20240530-164513/logging.jsonl
Train: 0%| | 0/100 [00:00<?, ?it/s]/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/root/miniconda3/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "/root/autodl-tmp/swift/swift/cli/sft.py", line 5, in
Train: 0%| | 0/100 [00:12<?, ?it/s]
Traceback (most recent call last):
File "/root/autodl-tmp/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/root/autodl-tmp/swift/swift/utils/run_utils.py", line 27, in x_main
result = llm_x(args, **kwargs)
File "/root/autodl-tmp/swift/swift/llm/sft.py", line 298, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/root/autodl-tmp/swift/swift/trainers/trainers.py", line 50, in train
res = super().train(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 3147, in training_step
self.accelerator.backward(loss)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 2125, in backward
loss.backward(**kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26530 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 26529) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/root/autodl-tmp/swift/swift/cli/sft.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-05-30_16:45:42 host : autodl-container-ade5118aae-191466bc rank : 0 (local_rank: 0) exitcode : 1 (pid: 26529) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等) 这是cuda版本 root@autodl-container-ade5118aae-191466bc:~/autodl-tmp# pip show torch Name: torch Version: 2.0.0+cu118 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: [email protected] License: BSD-3 Location: /root/miniconda3/lib/python3.8/site-packages Requires: jinja2, triton, networkx, typing-extensions, filelock, sympy Required-by: xtuner, trl, triton, torchvision, peft, optimum, flash-attn, bitsandbytes, accelerate root@autodl-container-ade5118aae-191466bc:~/autodl-tmp# pip show transformers Name: transformers Version: 4.40.2 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: [email protected] License: Apache 2.0 License Location: /root/miniconda3/lib/python3.8/site-packages Requires: tokenizers, packaging, regex, pyyaml, tqdm, safetensors, huggingface-hub, numpy, filelock, requests Required-by: xtuner, trl, transformers-stream-generator, peft, optimum, ms-swift root@autodl-container-ade5118aae-191466bc:~/autodl-tmp# +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40 On | 00000000:01:00.0 Off | Off | | 0% 23C P8 22W / 300W | 2MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:41:00.0 Off | Off | | 0% 21C P8 12W / 300W | 2MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Additional context Add any other context about the problem here(在这里补充其他信息)
贴一下命令 我帮你看看吧
fixed