llama.cpp tts: add speaker file support

Added support for TTS speaker files, including a new command-line option --tts-speaker-file to specify the file path.
Implemented JSON handling in tts.cpp to load and parse speaker data, enhancing audio generation capabilities.

Feb 24 '25 08:02 dm4

@edwko Could you please have a look on this PR?

Feb 26 '25 13:02 ngxson

@ngxson @dm4 Looks good! Just a couple of thoughts, this would handle only v0.2 it might make sense to do this more dynamically, maybe add versioning logic similar to this PR https://github.com/ggml-org/llama.cpp/pull/11287

Maybe get version from common_get_builtin_chat_template, or I could add more metadata to the speaker files (like a version fields) to construct the prompt based on the specific version.

// Something like this:

double get_speaker_version(json speaker) {
    if (speaker.contains("version")) {
        return speaker["version"].get<double>();
    } 
    // Also could get version from model itself
    // if (common_get_builtin_chat_template(model) == "outetts-0.3") {
    //     return 0.3;
    // }
    return 0.2;
}

static std::string audio_text_from_speaker(json speaker) {
    std::string audio_text = "<|text_start|>";
    double version = get_speaker_version(speaker);
    
    if (version <= 0.3) {
        std::string separator = (version == 0.3) ? "<|space|>" : "<|text_sep|>";
        for (const auto &word : speaker["words"])
            audio_text += word["word"].get<std::string>() + separator;
    }
    else if (version > 0.3) {
        // Future version support could be added here
    }

    return audio_text;
}

// static std::string audio_data_from_speaker(json speaker) would also need some adjustments to support different versions.

Feb 27 '25 09:02 edwko

Hello @ngxson and @edwko, I have already added support for version 0.3. Since common_get_builtin_chat_template() was removed in this commit, I have switched to using llama_model_chat_template() to obtain the model's tokenizer.chat_template metadata.

Mar 01 '25 12:03 dm4

@ggerganov merge it please

Mar 01 '25 22:03 Koalamana9

Can you provide examples commands both for v0.2 and v0.3 so I can run some tests?

Mar 02 '25 18:03 ggerganov

Example commands for v0.2 and v0.3 are identical: llama-tts -m OuteTTS-v2-or-v3 -mv Wavtokenizer -c 4096 --tts-use-guide-tokens --tts-speaker-file en_female_1.json -p "Hello world" Speaker files from here: https://github.com/edwko/OuteTTS/tree/main/outetts/version/v1/default_speakers OuteTTS v0.3: https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/tree/main OuteTTS v0.2: https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/tree/main Wavtokenizer: https://huggingface.co/novateur/WavTokenizer-large-speech-75token/tree/main (must be converted to gguf)

--tts-use-guide-tokens is optional, sometimes gives better results for v0.2

For prompts longer than 10 words it can hit this assert and stop generation (tested only on CPU, not related to this PR as same assert error present on all previous builds) https://github.com/ggml-org/llama.cpp/blob/14dec0c2f29ae56917907dbf2eed6b19438d0a0e/src/llama.cpp#L8470 Removing this assert allows for longer prompt generation.

Mar 02 '25 20:03 Koalamana9

Awesome! With OuteTTS v0.3 it even generates all punctuation correctly! To be honest this is already quite a good quality for such a small model. Perhaps it is worth updating examples/tts/README and adding -ub 4096 argument as it is necessary for correct generation. I would like to see more PR's merged in this, for example #11070 server example really works well without unloading models from memory. It is possible to further develop this example as it is of great value for many, especially considering that OuteTTS 0.3-500M is allowed for commercial use. Despite the fact that Kokoro TTS is now considered the highest quality text to speech model with the smallest size, it is worth noting that it's a very distilled model without the ability to emulate voices except for simply blending existing ones, meanwhile OuteTTS offers a novel approach to speech synthesis using simple llm where anyone can generate a copy of a voice in just a few seconds by passing simple json file.

Mar 03 '25 16:03 Koalamana9

Feel free to contribute improvements. I think the tts example is in a very hacky state and can be improved in many ways. Ideally, it should become a more general purpose TTS example that would support more TTS models. But we first need the infra for that to be added to libllama, which I am working on atm.

Also, I think figuring out streaming first is crucial before making major changes and additions.

Mar 03 '25 16:03 ggerganov