FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

基于SenseVoiceSmall训练方言时报错误KeyError: 6491

Open lukeewin opened this issue 8 months ago • 3 comments

环境为:Ubuntu server 22.04 python: 3.11 cuda: 11.8 执行训练报下面错误:

[2025-05-01 18:32:21,169][root][INFO] - Validate epoch: 1, rank: 0

[2025-05-01 18:32:21,172][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 1, after: 1
[2025-05-01 18:32:21,291][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 1, after: 1
Error executing job with overrides: ['++model=iic/SenseVoiceSmall', '++train_data_set_list=/usr/local/src/FunASR/examples/industrial_data_pretraining/sense_voice/data/train_test.jsonl', '++valid_data_set_list=/usr/local/src/FunASR/examples/industrial_data_pretraining/sense_voice/data/val_test.jsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=6000', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=8', '++train_conf.max_epoch=150', '++train_conf.log_interval=1', '++train_conf.resume=true', '++train_conf.validate_interval=2000', '++train_conf.save_checkpoint_interval=2000', '++train_conf.keep_nbest_models=10', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/usr/local/src/FunASR/examples/industrial_data_pretraining/sense_voice/../../ds_stage1.json', '++optim_conf.lr=0.0002', '++output_dir=./outputs']
Traceback (most recent call last):
  File "/usr/local/src/FunASR/examples/industrial_data_pretraining/sense_voice/../../../funasr/bin/train_ds.py", line 228, in <module>
    main_hydra()
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/src/FunASR/examples/industrial_data_pretraining/sense_voice/../../../funasr/bin/train_ds.py", line 56, in main_hydra
    main(**kwargs)
  File "/usr/local/src/FunASR/examples/industrial_data_pretraining/sense_voice/../../../funasr/bin/train_ds.py", line 201, in main
    trainer.validate_epoch(model=model, dataloader_val=dataloader_val, epoch=epoch + 1)
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/funasr/train_utils/trainer_ds.py", line 771, in validate_epoch
    self.forward_step(model, batch, loss_dict=loss_dict)
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/funasr/train_utils/trainer_ds.py", line 670, in forward_step
    retval = model(**batch)
             ^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/funasr/models/sense_voice/model.py", line 697, in forward
    encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, text)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/funasr/models/sense_voice/model.py", line 759, in encode
    [[self.textnorm_int_dict[int(style)]] for style in text[:, 3]]
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/funasr/models/sense_voice/model.py", line 759, in <listcomp>
    [[self.textnorm_int_dict[int(style)]] for style in text[:, 3]]
      ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 6491
E0501 18:32:25.772000 140269714682944 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 77200) of binary: /root/miniconda3/envs/funasr/bin/python3.11
Traceback (most recent call last):
  File "/root/miniconda3/envs/funasr/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/funasr/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../../funasr/bin/train_ds.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-01_18:32:25
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 77200)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
^C

有大佬知道如何解决吗?

lukeewin avatar May 01 '25 10:05 lukeewin

您好,现在解决了吗?

deegy666 avatar May 14 '25 07:05 deegy666

您好,现在解决了吗?

lukeewin avatar May 14 '25 09:05 lukeewin

大佬,现在怎么解决了吗

sinkup27 avatar Oct 29 '25 10:10 sinkup27