text_generation_webui_xtts icon indicating copy to clipboard operation
text_generation_webui_xtts copied to clipboard

Generated audio swapping accents over time

Open spike4379 opened this issue 2 years ago • 3 comments

Love the tts, this is amazing, however I thought I would bring up that despite the clip I use or its format of WAV or MP3, and it being perfect. The generated speech will always move between an american accent or a british. Is there a known way to label the audio sample as american or british so it knows which it should stick to?

If this topic needs to go elsewhere please let me know.

spike4379 avatar Nov 21 '23 01:11 spike4379

unfortunately this is an isssue with the underlying model. Try shortening your input audio to 5-9 seconds where the accent is very noticeable, that might help

kanttouchthis avatar Nov 21 '23 05:11 kanttouchthis

I had to get a pretty good quality, clean sample of someone for it to sound and remain sounding like them. There is a very occasional slip in the audio, but 95%+ sounds good. I also don't think longer clips necessarily give better results (though I've not done much testing on that and kept my samples around the 8-9 second mark).

Other things that may help:

  • Make sure the audio is down sampled to mono, 24000Hz, 16 Bit
  • If you need to do any audio cleaning, do it before you compress it down to the above settings.
  • Ensure the clip you use doesn't have background noises or music on e.g. lots of movies have quiet music when many of the actors are talking. Bad quality audio will have hiss that needs clearing up. The AI will pick this up, even if we dont.
  • Try make your clip one of nice flowing speech, like the included example.wav file.
  • Make sure the clip doesnt start or end with breathy sounds (breathing in/out etc).

I've not tested yet, but, I also wonder if you use an audio clip that was an AI generated audio, will that come out sounding right VS genuine audio of a real person. There could be a law of diminishing returns causing degradation in quality.

My current experience is, the better the sample, the more like the original person, their accent, nuances etc

EDIT - Changed the suggested Hz

erew123 avatar Nov 21 '23 11:11 erew123

the model samples at 24khz mono so that's probably what you want your source audio to be

kanttouchthis avatar Nov 21 '23 12:11 kanttouchthis