FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

How to train Paraformer-Large instead of finetune.

Open Dongru1 opened this issue 11 months ago • 1 comments

Hello,

I want to pretrain the Paraformer-large instead of fine-tuning. Since the language is different, I need to reconstruct the tokenzier and can not use the fine-tune.sh script. Based on the Paraformer-large config examples/industrial_data_pretraining/paraformer/modelscope_models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/config.yaml and Paraformer-based training code in examples/aishell/paraformer, I started to train the model. However, the loss is not converging. Could you confirm that the config.yaml of the paraformer-large released is the same as the config you used to pretaining it?

Also, I did not find an example to generate seg_dict in examples/aishell/paraformer. Where could I get the scripts to generate it based on my corpus?

Thank you so much.

Dongru1 avatar Feb 28 '25 02:02 Dongru1

A few more questions.

  1. The CTC-loss weight in Paraformer-large conf is 0.0. In Paraformer-base, it is 0.3. Is it a mistake? Or it is the best practice for the Paraformer-Large?
  2. My corpus is about 12,000; How many epochs should it take to get a good result?

Dongru1 avatar Feb 28 '25 02:02 Dongru1