[MMS TTS] - Can we change the speaker's voice (not language), without fine-tuning? Any controllable parameters, or seed?
❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
I am using the MMS TTS and its amazing. So far for one language (eng) there is one speakers voice. Are there any parameters or random seeds which can be changed to have an entire different persons voice, without fine-tuning? Even if we cant do emotions or lets say voice pitch etc. but can it be done where we just have a random new naturally sounding person?
Code
What have you tried?
MMS TTS and Hugginface mms-tts
What's your environment?
- fairseq Version (e.g., 1.0 or main): main
- PyTorch Version (e.g., 1.0) - 1.13
- OS (e.g., Linux): Linus
- How you installed fairseq (
pip, source): pip - Build command you used (if compiling from source):
- Python version: 3.10
JFYI, For now sampling rate is the only thing which can tune this a little, Higher gives you deeper voice (slower) while lower number give thinner voice (faster).
@QaisarRajput For now, controllable generation (e.g., change gender, emotion, etc) is not supported yet. You could consider cascading the MMS TTS model with an off-the-shelf voice cloning model to achieve this.
@QaisarRajput For now, controllable generation (e.g., change gender, emotion, etc) is not supported yet. You could consider cascading the MMS TTS model with an off-the-shelf voice cloning model to achieve this.
Could you please name one voice cloning repo on vits to achieve this? I find out that directly fine-tuning on Korean model makes very bad results.
Not sure how this would work, but here is one example for voice conversion.
I suggest looking into Coqui which has recipes for using MMS-TTS (FairSeq) alongside voice cloning; I've used it successfully for gender.
Regarding emotion, etc. Bark looks promising, but I haven't tested it yet.