OpenVoice Can we convert speech?

Hello, is it possible to convert audio recordings of speech to a new voice? This will be for a low resource language so TTS will not work?

Also will phonemes like a strong rolling "R" be reproduced by the model? Are there any ways to improve the appropriate phoneme generation?

Thanks!

Oct 04 '25 05:10 ChristopherBrown1

Hi @ChristopherBrown1! I'm Dosu and I’m helping the OpenVoice team.

Yes, OpenVoice can convert audio recordings of speech to a new voice by first transcribing the source audio to text (using Whisper ASR or similar), then synthesizing speech in the target voice using a base speaker TTS model and applying a tone color converter for timbre matching. This workflow is possible for multiple languages, including low-resource ones, as long as the transcription and TTS components support the language, though some users report challenges with language and voice matching for non-English languages example & code.

Phoneme handling is based on converting text to phoneme sequences (like IPA), and the model includes mechanisms for phoneme-level processing. Complex phonemes, such as a strong rolling "R", can be reproduced if they are represented in the input and supported by the base speaker model, but accuracy depends on the quality of phoneme mapping and the underlying TTS model model details.

To improve phoneme generation accuracy—especially for challenging sounds or low-resource languages—focus on high-quality, clean reference audio, correct phoneme mapping, and consider customizing phoneme definitions for your target language or sound. Manual correction or augmentation of phoneme mapping may help. You can also experiment with the tau parameter in the ToneColorConverter to influence conversion strength and intelligibility, which may affect voice quality and phoneme clarity troubleshooting & tips.

For best results, use clean, mono, 22.05 kHz audio inputs for both neutral and reference speech. Persistent issues with voice conversion (like noise or silence) are usually due to asset or environment mismatches, not code or configuration errors reference.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Oct 04 '25 05:10 dosubot[bot]

The language I want to do is not supported by any TTS model so is there a way to convert directly from speech?

Oct 04 '25 16:10 ChristopherBrown1

OpenVoice does not support direct speech-to-speech conversion for languages that lack a TTS model. The pipeline always requires converting speech to text (using ASR like Whisper), then generating new speech from text with a base speaker TTS model, and finally applying timbre transfer. If there is no TTS model for your language, OpenVoice cannot synthesize output in that language—even if it can extract voice characteristics from reference audio. This is a core limitation confirmed by the codebase and documentation (example, docs, code).

To support a new language, you would need to train or obtain a base speaker TTS model for that language. There is no alternative speech-to-speech path in OpenVoice for unsupported languages.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Oct 04 '25 16:10 dosubot[bot]