CosyVoice About the training of the flow model

Hey, guys. I am trying to train the flow model from scratch recently.

But I am a bit confused about the training pipeline of the flow model. As suggested by #281 @aluminumbox, the flow can be trained just change the param model in the run.sh script to 'flow', which will initalize the whole flow model, e.g., flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec in the cosyvoice.yaml. I am confused that the loss function of the flow only take the loss of cosyvoice.flow.flow.MaskedDiffWithXvec.decoder into consideration without any from other modules of the flow (e.g., encoder, length_regulator).

Question:

Is this manner suitable for the flow training considering that the loss is mostly related to the decoder (cosyvoice.flow.flow.MaskedDiffWithXvec.decoder ) of the flow model? Should we freeze partial of the flow (e.g., the encoder, length_regulator, and nn.Embedding()) to train the flow model?
How did parameters of other modules (except the decoder) of the flow update if we initialize the whole flow model?

PS: the reason why I ask this question is that, I found the training of the whole flow model was unstable when initializing the whole flow model. But the training is stable when I only initialized the decoder but freezed other modules by

model = configs['flow']
for param in model.parameters():
    param.requires_grad = False
for param in model.decoder.parameters():
    param.requires_grad = True
...
optimizer = optim.Adam(model.module.decoder.parameters(), **configs['train_conf']['optim_conf'])

Here are the loss curves when without (The first one) and with (the second one) freezing. From the first one, it can be observed that there is a crash when the step is around 15k (the generated audio is also worse when steps>15k. But the second one is more stable with the same setting except the freezed module.

Aug 22 '24 07:08 huskyachao

do not freeze encoder, reduce lr or increase batch size

Aug 26 '24 07:08 aluminumbox

I compared the flow model config of cosyvoice.yaml and cosyvoice.fromscratch.yaml; the former is used to fine-tune on the pretrained model, and the latter is used to train from scratch on small dataset. I found that the small model (trained on small dataset) config compared to the big model (the open source one configed as the cosyvoice.yaml), the parameter reduction mainly occurs on the Conformer Encoder part, decrease from 6 block to 3 bloks, but on the decoder part, the ConditionalDecoder [worked as ODE estimator ] config, it only reduce 4 blocks of mid_blocks; Why there keeps heavy paremeter config on the ODE estimator? If I want to further reduce the model parameters, where can I cut more?

Sep 05 '24 07:09 JohnHerry

关于flow模型的 length_regulator, 没有太明白一些细节。理解上是由于speech_token抽取的时候，S3以50的码率编码，一个码代表了320个样点； mel谱是用256的步长编码，一个梅尔帧256个样点，16K数据的帧率就是62.5；现在flow就是要将50码率的东西，转为62.5码率的东西； length_regulator 就是对50的序列内部插值，插值到62.5的码率下长度。感觉length_regulator 里一堆卷积变换的，就是对硬插值后得到的结果做一下平滑。但是length_regulator的inference里，又是分频带又是组合的，这段逻辑在做什么啊？

Sep 14 '24 08:09 JohnHerry

正常的情况下 flow模型训练一步需要多少时间啊？需要训练多少步？我们这里训练怎么这么慢呢？

Oct 09 '24 07:10 JohnHerry

Hey, guys. I am trying to train the flow model from scratch recently.

But I am a bit confused about the training pipeline of the flow model. As suggested by #281 @aluminumbox, the flow can be trained just change the param model in the run.sh script to 'flow', which will initalize the whole flow model, e.g., flow: !new:cosyvoice.flow.flow.MaskedDiffWithXvec in the cosyvoice.yaml. I am confused that the loss function of the flow only take the loss of cosyvoice.flow.flow.MaskedDiffWithXvec.decoder into consideration without any from other modules of the flow (e.g., encoder, length_regulator).

Question:

Is this manner suitable for the flow training considering that the loss is mostly related to the decoder (cosyvoice.flow.flow.MaskedDiffWithXvec.decoder ) of the flow model? Should we freeze partial of the flow (e.g., the encoder, length_regulator, and nn.Embedding()) to train the flow model?

How did parameters of other modules (except the decoder) of the flow update if we initialize the whole flow model?

PS: the reason why I ask this question is that, I found the training of the whole flow model was unstable when initializing the whole flow model. But the training is stable when I only initialized the decoder but freezed other modules by
model = configs['flow']
for param in model.parameters():
    param.requires_grad = False
for param in model.decoder.parameters():
    param.requires_grad = True
...
optimizer = optim.Adam(model.module.decoder.parameters(), **configs['train_conf']['optim_conf'])
Here are the loss curves when without (The first one) and with (the second one) freezing. From the first one, it can be observed that there is a crash when the step is around 15k (the generated audio is also worse when steps>15k. But the second one is more stable with the same setting except the freezed module.

Hi, huskyachao, what is your progress now? have you got good result to train the flow from scratch? I met the same problem like you, the loss jump back during the warmingup steps. What is your final lr and batch size of success training?

Oct 10 '24 01:10 JohnHerry

关于flow模型的 length_regulator, 没有太明白一些细节。理解上是由于speech_token抽取的时候，S3以50的码率编码，一个码代表了320个样点； mel谱是用256的步长编码，一个梅尔帧256个样点，16K数据的帧率就是62.5；现在flow就是要将50码率的东西，转为62.5码率的东西； length_regulator 就是对50的序列内部插值，插值到62.5的码率下长度。感觉length_regulator 里一堆卷积变换的，就是对硬插值后得到的结果做一下平滑。但是length_regulator的inference里，又是分频带又是组合的，这段逻辑在做什么啊？

因为speech token和mel不是整数倍，inference的时候按首中尾分开平滑，是为了减轻流式情况下的overlap处不连续的问题

Oct 10 '24 01:10 aluminumbox

然而好像目前flow的流式合成效果还是不好，还是衔接处有爆音【我是用GT speech token测试的，不是length regulated 的LLM生成token】。而且我们以前测试过，这个问题不仅仅在flow这边，以前HiFiGAN做流式谱合成时就有衔接处爆音的问题。请问这里hift也是做了什么流式设计的优化吗？

Oct 10 '24 02:10 JohnHerry

正常的情况下 flow模型训练一步需要多少时间啊？需要训练多少步？我们这里训练怎么这么慢呢？

hi，请问你用了多少数据训练呢？最后loss收敛到什么范围呢？我目前使用了大概5w小时数据，loss收敛比你放出来的图要慢很多，而且数值也要大

Nov 14 '25 01:11 echo-hmwang

建议用10万小时以上数据量训练，请注意数据质量。训练迭代2个epoch基本可用，如果算力充足建议训练到4个以上。以前测试110W steps效果还是不错的。注意人家开源配置的说明部分，1e-5的学习率是人家用来fine-tune使用的，你从头训练肯定不用这个学习率配置，觉得损失收敛慢就要增加学习率了。另外多卡训练比单卡训练更稳定，损失曲线更漂亮。其它自己多摸索吧。后边总是会踩坑的，坑坑都不同。

Nov 14 '25 01:11 JohnHerry