FastSpeech2 icon indicating copy to clipboard operation
FastSpeech2 copied to clipboard

How can I train this model with data in sample_rate=16k ?

Open Tian14267 opened this issue 4 years ago • 6 comments

hello, guys, I have a question here: If my data **sample_rate **is 16k, and I want use this 16k data to train model . How can I modify parameter ? and the model of Hifi-Gan, how can get this model with sample_rate=16k, and what param should I change?

Tian14267 avatar Nov 22 '21 09:11 Tian14267

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s

Now transforming that to 16khz: your window length: 16000 * 0,046439909 = 743,038544 ~ 743 your hop length: 16000 * 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same

So for example 16khz preprocess.yaml would look like:

preprocessing:
  val_size: 512
  text:
    text_cleaners: ["english_cleaners"]
    language: "en"
  audio:
    sampling_rate: 16000
    max_wav_value: 32767.0
  stft:
    filter_length: 743
    hop_length: 186
    win_length: 743
  mel:
    n_mel_channels: 80
    mel_fmin: 0
    mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
  pitch:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True
  energy:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True

Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio.

dunky11 avatar Nov 22 '21 17:11 dunky11

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s

Now transforming that to 16khz: your window length: 16000 * 0,046439909 = 743,038544 ~ 743 your hop length: 16000 * 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same

So for example 16khz preprocess.yaml would look like:

preprocessing:
  val_size: 512
  text:
    text_cleaners: ["english_cleaners"]
    language: "en"
  audio:
    sampling_rate: 16000
    max_wav_value: 32767.0
  stft:
    filter_length: 743
    hop_length: 186
    win_length: 743
  mel:
    n_mel_channels: 80
    mel_fmin: 0
    mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
  pitch:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True
  energy:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True

Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio. Thank you very much ! So I also need to train a new Hifi-Gan model with sample_rate=16K? Another question, If I finetune my own data in authors model, and get my voice. How can I do it in right way ? The result of my finetune is bad, Maybe I am not in right way ?

Tian14267 avatar Nov 24 '21 03:11 Tian14267

No, you dont need to train a new hifigan, the output of hifigan will be 22050 hz even though you trained fastspeech on 16khz mel spectrograms

dunky11 avatar Nov 26 '21 22:11 dunky11

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

The window length for pretrained HiFiGAN is: 1024/22050 = 0,046439909s The hop length for pretrained HiFiGAN is: 256/22050 = 0,011609977s

Now transforming that to 16khz: your window length: 16000 * 0,046439909 = 743,038544 ~ 743 your hop length: 16000 * 0,011609977 = 185,759632 ~ 186 fmin and fmax stay the same

So for example 16khz preprocess.yaml would look like:

preprocessing:
  val_size: 512
  text:
    text_cleaners: ["english_cleaners"]
    language: "en"
  audio:
    sampling_rate: 16000
    max_wav_value: 32767.0
  stft:
    filter_length: 743
    hop_length: 186
    win_length: 743
  mel:
    n_mel_channels: 80
    mel_fmin: 0
    mel_fmax: 8000 # please set to 8000 for HiFi-GAN vocoder, set to null for MelGAN vocoder
  pitch:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True
  energy:
    feature: "phoneme_level" # support 'phoneme_level' or 'frame_level'
    normalization: True

Also set max_wav_value to 32767, that's a bug in the authors implementation, which will cause artifacts in the generated audio.

Your guide so simple to understand but I saw something very strange with my custom data

  1. I only change sampling rate to 16000 ( preprocessor.py librosa.load(wav_path, sr=16000)) You can hear difference voice between 2 wav ( 37.wav is raw data with sr 16000hz) the speech is same but the voice is same as read with low freq 37.zip 37_reconstruced.zip

  2. I change all parameter as you cofig ( sr, filter_length, hop_length, win_length) As a result, the voice is also different and the speed is slower 37_reconstructed_2.zip

Can you explain and give me some advices???

dohuuphu avatar Feb 28 '22 15:02 dohuuphu

mark

leslie2046 avatar Apr 10 '22 09:04 leslie2046

If you want to use your model with the pretrained HiFiGAN as Vocoder you need to mimic its short-time fourier transform window and hop length.

后来解决了吗

leslie2046 avatar Apr 17 '22 01:04 leslie2046