DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

FastPitch: How to properly calculate pitch mean, std, fmin and fmax given the pitch estimated in shape of [1xmel_frames]?

Open yerzhan7orazayev opened this issue 4 years ago • 3 comments

Dear @alancucki ,

How to properly calculate pitch mean, std, fmin and fmax given the pitch estimated in shape of [1xmel_frames]?

Yerzhan.

yerzhan7orazayev avatar Nov 25 '21 13:11 yerzhan7orazayev

Hi @yerzhan7orazayev ,

sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin and fmax, for 22kHz keep the default.

alancucki avatar Jan 10 '22 21:01 alancucki

Hi @yerzhan7orazayev ,

sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin and fmax, for 22kHz keep the default.

@alancucki Do you mind elaborating about the procedure for calculating the pitch mean and std over the entire dataset? What if your dataset has a mixture of different female and male speakers? Does using the pitch mean and std still work?

Also, is there a particular standard for specifying the fmin and fmax for different sampling rates? For example, I have a 16kHz sampled dataset. I still used the default 22kHz fmin and fmax for my 16kHz dataset and didn't hear that much of a difference (I could be wrong), so I was wondering how the fmin and fmax was specified.

Thanks in advance

jinny1208 avatar Jan 28 '22 13:01 jinny1208

Hi @yerzhan7orazayev , sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin and fmax, for 22kHz keep the default.

@alancucki Do you mind elaborating about the procedure for calculating the pitch mean and std over the entire dataset? What if your dataset has a mixture of different female and male speakers? Does using the pitch mean and std still work?

Also, is there a particular standard for specifying the fmin and fmax for different sampling rates? For example, I have a 16kHz sampled dataset. I still used the default 22kHz fmin and fmax for my 16kHz dataset and didn't hear that much of a difference (I could be wrong), so I was wondering how the fmin and fmax was specified.

Thanks in advance

fmin and fmax can be computed according to the frame-rate of sample audios, when you training on 16KHz samples, your (fmin, fmax) should be (0, 8000), to lower the effect of some noise in samples, you can tune that two values, eg. (40, 7200) for Man speaker, (60, 7800) for Woman speaker.

As to the pitch-mean, pitch-std, I am following the reply of your question.

I also have two other questions: The first, Should the F0 sequence from sample audios to compute pitch-mean and pitch-std contain zero values? the zero values are from unvoiced segments. The second, I see some argument to compute pitch in the code:

librosa.pyin(
    fmin=librosa.note_to_hz("C2"), fmax=librosa.note_to_hz('C7'), frame_length=1024
)

The values of fmin, fmax and frame_length are not identical with the config on mel-spectram, Is that stil ok when I changed mel-spectram arguments before training?

JohnHerry avatar May 20 '22 09:05 JohnHerry