SenseVoice icon indicating copy to clipboard operation
SenseVoice copied to clipboard

微调的时候遇到问题

Open wangchao112211 opened this issue 6 months ago • 1 comments

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

遇到不知名的错误

Code

What have you tried?

修改sh里的参数也不行,用cuda跑也不行

What's your environment?

  • OS (e.g., Linux):
  • FunASR Version (e.g., 1.2.6):
  • ModelScope Version (e.g., 1.13.3):
  • PyTorch Version (e.g., 2.3.1):
  • How you installed funasr (pip, source):
  • yes
  • Python version: 3.10
  • GPU (e.g., V100M32) A800*2
  • CUDA/cuDNN version (e.g., cuda_11.0):
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
  • Any other relevant information:

报错信息如下: Type: torch.float32 [2025-07-15 10:31:33,167][root][INFO] - Build optim [2025-07-15 10:31:33,170][root][INFO] - Build scheduler [2025-07-15 10:31:33,171][root][INFO] - Build dataloader [2025-07-15 10:31:33,171][root][INFO] - Build dataloader [2025-07-15 10:31:33,181][root][INFO] - Build optim [2025-07-15 10:31:33,184][root][INFO] - Build scheduler [2025-07-15 10:31:33,184][root][INFO] - Build dataloader [2025-07-15 10:31:33,184][root][INFO] - Build dataloader [2025-07-15 10:31:34,835][root][INFO] - total_num of samplers: 226156, /data/ASR/SenseVoice/data/train_example.jsonl [2025-07-15 10:31:34,835][root][INFO] - total_num of samplers: 6, /data/ASR/SenseVoice/data/val_example.jsonl [2025-07-15 10:31:34,845][root][INFO] - total_num of samplers: 226156, /data/ASR/SenseVoice/data/train_example.jsonl [2025-07-15 10:31:34,845][root][INFO] - total_num of samplers: 6, /data/ASR/SenseVoice/data/val_example.jsonl [2025-07-15 10:31:35,394][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 21665, after: 21665 [2025-07-15 10:31:35,414][root][INFO] - rank: 1, dataloader start from step: 0, batch_num: 21665, after: 21665 W0715 10:31:36.284000 139943396536896 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 32993 closing signal SIGTERM E0715 10:31:48.072000 139943396536896 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 1 (pid: 32994) of binary: /data/anaconda/envs/cosyvoice/bin/python3.10 Traceback (most recent call last): File "/data/anaconda/envs/cosyvoice/bin/torchrun", line 8, in sys.exit(main()) File "/data/anaconda/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/data/anaconda/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/data/anaconda/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/data/anaconda/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/anaconda/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/FunASR/funasr/bin/train_ds.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-07-15_10:31:36 host : localhost.localdomain rank : 1 (local_rank: 1) exitcode : -11 (pid: 32994) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 32994 是版本问题么

wangchao112211 avatar Jul 15 '25 02:07 wangchao112211

@wangchao112211 hello兄弟解决没,是啥原因

liuhuang31 avatar Sep 01 '25 13:09 liuhuang31