VITS实现的中文TTS

本项目仅为学习使用
This project is only for academic purposes

This is a fork of https://github.com/lutianxiong/vits_chinese
The original version of VITS : https://github.com/jaywalnut310/vits

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Espnet连接：github.com/espnet/espnet/tree/master/espnet2/gan_tts/vits

coqui-ai/TTS连接：github.com/coqui-ai/TTS/tree/main/recipes/ljspeech/vits_tts

基于VITS 实现 48KHZ的 baker TTS 的流程

预先准备

apt-get install espeak
pip install -r requirements.txt
cd monotonic_align
python setup.py build_ext --inplace

将标贝标注拷贝到./filelists/

删除2365号和2762号内容，不对中英文混杂进行训练
或修改2365号和2762内容为如下，此为baker标注错误，并且使用的英文编码无法识别，导致编码失败（本项目不使用这两条数据，会清洗掉）

002365 这图#2难不成#2是#1P过的#4？ zhe4 tu2 nan2 bu4 cheng2 shi4 pi1 guo4 de5

002762 我是#2善良#1活泼#3、好奇心#1旺盛的#2B型血#4。 wo3 shi4 shan4 liang2 huo2 po1 hao4 qi2 xin1 wang4 sheng4 de5 bi4 xing2 xie3

运行命令进行预处理

python preprocess.py

将标贝音频拷贝到./baker_waves/，启动训练

使用的label为五级停顿、切分声韵母、无儿化音版
python train.py -c configs/baker_base.json -m baker_base

一张RTX3090 24G，训练40小时以上

推理

修改为对应的模型，进行推理 python inference.py

可能存在的问题与解决方案：

RuntimeError: view_ as_ complex is only supported for float and double tensors, but got a tensor of scalar type: Half

音频处理时半精度出现的问题，解决方案在这个issue

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

使用DistributedDataParallel函数出现的问题，可在DDP中添加find_unused_parameters=True参数，但似乎并不是最优解

running build_ext
copying build/lib.linux-x86_64-3.8/monotonic_align/core.cpython-38-x86_64-linux-gnu.so -> monotonic_align
error: could not create 'monotonic_align/core.cpython-38-x86_64-linux-gnu.so': No such file or directory

在monotonic_align文件夹下再创建一个monotonic_align文件夹

停顿不正常。本来已经在音素后面强插边界了，VITS又在add_blank中强插边界，具体是配置参数："add_blank": false
可能影响停顿的原因：随机时长预测，具体配置参数：use_sdp=True
合成音频出现电音。可适当提高推理时的噪音参数，能显著提高合成质量noise_scale=0.667, noise_scale_w=0.8

ch_vits
ch_vits copied to clipboard

Metadata

VITS实现的中文TTS

基于VITS 实现 48KHZ的 baker TTS 的流程

预先准备

将标贝标注拷贝到./filelists/

运行命令进行预处理

将标贝音频拷贝到./baker_waves/，启动训练

推理

可能存在的问题与解决方案：

← Metadata

Owner

Metadata

ch_vits ch_vits copied to clipboard

Metadata

VITS实现的中文TTS

基于VITS 实现 48KHZ的 baker TTS 的流程

预先准备

将标贝标注拷贝到./filelists/

运行命令进行预处理

将标贝音频拷贝到./baker_waves/，启动训练

推理

可能存在的问题与解决方案：

← Metadata

Owner

Metadata

ch_vits
ch_vits copied to clipboard