Adrian Lancucki

Results 11 comments of Adrian Lancucki

@subhankar-ghosh looks good! I wonder if extra conditioning improves the loss of predictors.

Thanks! IMHO to do proper emotive synthesis there has to be an emotive dataset. In FastPitch, conditioning on speaker, which is implemented for multi-speaker models, could be overloaded to handle...

Hi @yerzhan7orazayev , sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin...

Hi @brentcty-2020 , sorry for a late reply. Have you manage to resolve your issue? After 100 epochs you should get something very intelligibile, just a little bit noisy ([sample](https://github.com/NVIDIA/DeepLearningExamples/files/7849935/001_the_overwhelming_majority_of_people_in.wav.gz))....

Hi @Pawel-VRtechnology , That should be fairly straightforward. You're gonna need to adjust `--text-cleaners`, `--symbol-set` and either get ahold of a Polish pronunciation dictionary and supply it with `--cmudict-path`, or...

Hi @wiamfa there is a [pre-trained](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_fr_quartznet15x5) model for French from the [NeMo](https://github.com/NVIDIA/NeMo) project. You can use it in NeMo, or follow these steps to load it in DeepLearningExamples: 1. Change...

Hi @karanveersingh5623 , Sorry for a late reply. afaik DALI uses custom pipelines in which you're permitted only to use DALI operations, so no `sox`. It might suffice to build...

Hi @adrianastan Sorry for replying late. I haven't got much experience with that many speakers, but I'd try to add capacity to the model and look at class imbalance -...

> I assume that the simple summation of the speaker embedding to the text encoding is not strong enough to preserve the speaker identity That might be the case. Positional...