About pitch_predictor of different resolutions
Thanks for the good job.
When I read the code , A question disturb me from understand it wholy: Why the pitch_predictor can predict pitch under different resolutions?
I see no difference when predict pitches on "phomeme level" or on "frame level", except the mask argument: the former mask length is the length of characters, while the latter has length of mel frames.
so Why? the frame level pitch prediction will get a pitch result with the same length of input characters, and then masked_file with zero values?
have you tried the feature of pitch and energy with "frame_level"? I hava tried that configuration but the result is terrible.There are many noises and wrong pronounciation within the audio while in the inference,but the audio synthesized of validation in the training step is good.Do you know why there is such a big difference?