fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

[MMS TTS] - Can we change the speaker's voice (not language), without fine-tuning? Any controllable parameters, or seed?

Open QaisarRajput opened this issue 2 years ago • 6 comments

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I am using the MMS TTS and its amazing. So far for one language (eng) there is one speakers voice. Are there any parameters or random seeds which can be changed to have an entire different persons voice, without fine-tuning? Even if we cant do emotions or lets say voice pitch etc. but can it be done where we just have a random new naturally sounding person?

Code

What have you tried?

MMS TTS and Hugginface mms-tts

What's your environment?

  • fairseq Version (e.g., 1.0 or main): main
  • PyTorch Version (e.g., 1.0) - 1.13
  • OS (e.g., Linux): Linus
  • How you installed fairseq (pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.10

QaisarRajput avatar Jun 11 '23 16:06 QaisarRajput

JFYI, For now sampling rate is the only thing which can tune this a little, Higher gives you deeper voice (slower) while lower number give thinner voice (faster).

QaisarRajput avatar Jun 11 '23 17:06 QaisarRajput

@QaisarRajput For now, controllable generation (e.g., change gender, emotion, etc) is not supported yet. You could consider cascading the MMS TTS model with an off-the-shelf voice cloning model to achieve this.

chevalierNoir avatar Jun 12 '23 03:06 chevalierNoir

@QaisarRajput For now, controllable generation (e.g., change gender, emotion, etc) is not supported yet. You could consider cascading the MMS TTS model with an off-the-shelf voice cloning model to achieve this.

Could you please name one voice cloning repo on vits to achieve this? I find out that directly fine-tuning on Korean model makes very bad results.

CopyNinja1999 avatar Jun 12 '23 12:06 CopyNinja1999

Not sure how this would work, but here is one example for voice conversion.

chevalierNoir avatar Jun 12 '23 14:06 chevalierNoir

I suggest looking into Coqui which has recipes for using MMS-TTS (FairSeq) alongside voice cloning; I've used it successfully for gender.

Regarding emotion, etc. Bark looks promising, but I haven't tested it yet.

khof312 avatar Mar 22 '24 19:03 khof312