llama.cpp Support for OuteTTS 1.0

Since v1.0 has simplified processing, this implementation provides full feature support.

JSON Speaker Loading:
- Added support for the new JSON speaker format, which includes an interface version.
- OuteTTS 1.0 is supported using interface version 3.
Text Chunking for Long Inputs:
- Enables processing of very long input texts by splitting them.
- Splitting respects minimum and maximum word boundaries (min = 10, max = 30).
- Supports multilingual text.
- Can be disabled via --tts-no-text-chunking (default: enabled).
Text Preprocessing & Prompt:
- While optional, a light cleanup and normalization step is included to improve output quality.
- Added new required prompt handling for the v1.0
Code Organization:
- Implementation is located in: tts-outetts-v1.cpp.
- A default speaker is added in a header file as JSON default_speaker.h.

DAC (Descript Audio Codec) Integration:
- The decoder layers from DAC need to be implemented: descript-audio-codec/dac/model/dac.py
- Model used:
  weights_24khz_1.5kbps_v1.0.pth
- DAC is supported by the transformers library and can be converted to safetensors, which might help implementation.
  Also, see this PR I submitted to fix a dependency issue in the conversion script for compatibility with newer PyTorch versions:
  transformers PR #36393
- Requesting assistance from @ngxson and @ggerganov for implementing this part.

Default generation uses default speaker automatically and chunked text:

build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "A very very long text"

Disables chunked text:

build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "Hello, how are you doing?" --tts-no-text-chunking

With custom speaker file:

build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "A very very long text" --tts-speaker-file "path/to/speaker.json"

Apr 07 '25 09:04 edwko

The decoder layers from DAC need to be implemented

FYI, currently we're missing Snake1d which should be implemented via https://github.com/ggml-org/llama.cpp/pull/12487

Apr 07 '25 11:04 ngxson

Does DAC replace WavTokenizer?

Apr 22 '25 12:04 ggerganov

Does DAC replace WavTokenizer?

Yes, since this model is multilingual, DAC is a better fit for reconstructing audio across languages.

Apr 22 '25 17:04 edwko

It would be really great if this would get merged. However I was wondering whether it'd also be possible to add mulitlingual support to llama-server?

Apr 25 '25 09:04 Horschig

FYI: OuteTTS 1.0 is supported by chatllm.cpp. You can find DAC & SNAC implementation there.

May 19 '25 06:05 foldl