tts: add speaker file support
- Added support for TTS speaker files, including a new command-line option
--tts-speaker-fileto specify the file path. - Implemented JSON handling in
tts.cppto load and parse speaker data, enhancing audio generation capabilities.
@edwko Could you please have a look on this PR?
@ngxson @dm4 Looks good! Just a couple of thoughts, this would handle only v0.2 it might make sense to do this more dynamically, maybe add versioning logic similar to this PR https://github.com/ggml-org/llama.cpp/pull/11287
Maybe get version from common_get_builtin_chat_template, or I could add more metadata to the speaker files (like a version fields) to construct the prompt based on the specific version.
// Something like this:
double get_speaker_version(json speaker) {
if (speaker.contains("version")) {
return speaker["version"].get<double>();
}
// Also could get version from model itself
// if (common_get_builtin_chat_template(model) == "outetts-0.3") {
// return 0.3;
// }
return 0.2;
}
static std::string audio_text_from_speaker(json speaker) {
std::string audio_text = "<|text_start|>";
double version = get_speaker_version(speaker);
if (version <= 0.3) {
std::string separator = (version == 0.3) ? "<|space|>" : "<|text_sep|>";
for (const auto &word : speaker["words"])
audio_text += word["word"].get<std::string>() + separator;
}
else if (version > 0.3) {
// Future version support could be added here
}
return audio_text;
}
// static std::string audio_data_from_speaker(json speaker) would also need some adjustments to support different versions.
Hello @ngxson and @edwko, I have already added support for version 0.3. Since common_get_builtin_chat_template() was removed in this commit, I have switched to using llama_model_chat_template() to obtain the model's tokenizer.chat_template metadata.
@ggerganov merge it please
Can you provide examples commands both for v0.2 and v0.3 so I can run some tests?
Example commands for v0.2 and v0.3 are identical: llama-tts -m OuteTTS-v2-or-v3 -mv Wavtokenizer -c 4096 --tts-use-guide-tokens --tts-speaker-file en_female_1.json -p "Hello world" Speaker files from here: https://github.com/edwko/OuteTTS/tree/main/outetts/version/v1/default_speakers OuteTTS v0.3: https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/tree/main OuteTTS v0.2: https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/tree/main Wavtokenizer: https://huggingface.co/novateur/WavTokenizer-large-speech-75token/tree/main (must be converted to gguf)
--tts-use-guide-tokens is optional, sometimes gives better results for v0.2
For prompts longer than 10 words it can hit this assert and stop generation (tested only on CPU, not related to this PR as same assert error present on all previous builds) https://github.com/ggml-org/llama.cpp/blob/14dec0c2f29ae56917907dbf2eed6b19438d0a0e/src/llama.cpp#L8470 Removing this assert allows for longer prompt generation.
Awesome! With OuteTTS v0.3 it even generates all punctuation correctly! To be honest this is already quite a good quality for such a small model. Perhaps it is worth updating examples/tts/README and adding -ub 4096 argument as it is necessary for correct generation. I would like to see more PR's merged in this, for example #11070 server example really works well without unloading models from memory. It is possible to further develop this example as it is of great value for many, especially considering that OuteTTS 0.3-500M is allowed for commercial use. Despite the fact that Kokoro TTS is now considered the highest quality text to speech model with the smallest size, it is worth noting that it's a very distilled model without the ability to emulate voices except for simply blending existing ones, meanwhile OuteTTS offers a novel approach to speech synthesis using simple llm where anyone can generate a copy of a voice in just a few seconds by passing simple json file.
Feel free to contribute improvements. I think the tts example is in a very hacky state and can be improved in many ways. Ideally, it should become a more general purpose TTS example that would support more TTS models. But we first need the infra for that to be added to libllama, which I am working on atm.
Also, I think figuring out streaming first is crucial before making major changes and additions.