How to train Paraformer-Large instead of finetune.
Hello,
I want to pretrain the Paraformer-large instead of fine-tuning. Since the language is different, I need to reconstruct the tokenzier and can not use the fine-tune.sh script.
Based on the Paraformer-large config examples/industrial_data_pretraining/paraformer/modelscope_models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/config.yaml and Paraformer-based training code in examples/aishell/paraformer, I started to train the model. However, the loss is not converging. Could you confirm that the config.yaml of the paraformer-large released is the same as the config you used to pretaining it?
Also, I did not find an example to generate seg_dict in examples/aishell/paraformer. Where could I get the scripts to generate it based on my corpus?
Thank you so much.
A few more questions.
- The CTC-loss weight in Paraformer-large conf is 0.0. In Paraformer-base, it is 0.3. Is it a mistake? Or it is the best practice for the Paraformer-Large?
- My corpus is about 12,000; How many epochs should it take to get a good result?