Support for OuteTTS 1.0
Since v1.0 has simplified processing, this implementation provides full feature support.
Changes and Features
-
JSON Speaker Loading:
- Added support for the new JSON speaker format, which includes an interface version.
- OuteTTS 1.0 is supported using interface version 3.
-
Text Chunking for Long Inputs:
- Enables processing of very long input texts by splitting them.
- Splitting respects minimum and maximum word boundaries (min = 10, max = 30).
- Supports multilingual text.
- Can be disabled via
--tts-no-text-chunking(default: enabled).
-
Text Preprocessing & Prompt:
- While optional, a light cleanup and normalization step is included to improve output quality.
- Added new required prompt handling for the v1.0
-
Code Organization:
- Implementation is located in:
tts-outetts-v1.cpp. - A default speaker is added in a header file as JSON
default_speaker.h.
- Implementation is located in:
TODO / Help Needed
-
DAC (Descript Audio Codec) Integration:
- The decoder layers from DAC need to be implemented: descript-audio-codec/dac/model/dac.py
- Model used:
weights_24khz_1.5kbps_v1.0.pth - DAC is supported by the
transformerslibrary and can be converted tosafetensors, which might help implementation.
Also, see this PR I submitted to fix a dependency issue in the conversion script for compatibility with newer PyTorch versions:
transformers PR #36393 - Requesting assistance from @ngxson and @ggerganov for implementing this part.
Example Commands
Default generation uses default speaker automatically and chunked text:
build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "A very very long text"
Disables chunked text:
build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "Hello, how are you doing?" --tts-no-text-chunking
With custom speaker file:
build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "A very very long text" --tts-speaker-file "path/to/speaker.json"
- The decoder layers from DAC need to be implemented
FYI, currently we're missing Snake1d which should be implemented via https://github.com/ggml-org/llama.cpp/pull/12487
Does DAC replace WavTokenizer?
Does DAC replace WavTokenizer?
Yes, since this model is multilingual, DAC is a better fit for reconstructing audio across languages.
It would be really great if this would get merged. However I was wondering whether it'd also be possible to add mulitlingual support to llama-server?
FYI: OuteTTS 1.0 is supported by chatllm.cpp. You can find DAC & SNAC implementation there.