Martin Kristiansen
Martin Kristiansen
@adrianastan, have you tried concatenating the speaker embeddings to the text encoding (by repeating it for each symbol)?
@adrianastan No, the speaker embedding is summed with the input and positional encoding, not concatenated. This kind of summation should be acceptable for positional encoding, but it is not suited...
@adrianastan In my experiment I increased the dimensionality of the encoder to fit the embedded symbols with speaker information concatenated to them. That way, the encoder receives intact and clearly...
@adrianastan Hope it works out for you. Just wanted to add that I got great results by concatenating the speaker embedding directly to the input of the 1) pitch predictor...
@Moon-sung-woo I was facing this problem too. The current implementation doesn't support multiple mean/std values for each speakers, but clearly that is needed in the multispeaker setting. Calculate these values...
Also, to mention another problem with that function, the fmax setting is way too high `librosa.note_to_hz('C7')` equals to 2093 Hz, and nobody speaks at that frequency. Probabilistic YIN takes MUCH...