llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Support for OuteTTS 1.0

Open edwko opened this issue 10 months ago • 4 comments

Since v1.0 has simplified processing, this implementation provides full feature support.

Changes and Features

  • JSON Speaker Loading:
    • Added support for the new JSON speaker format, which includes an interface version.
    • OuteTTS 1.0 is supported using interface version 3.
  • Text Chunking for Long Inputs:
    • Enables processing of very long input texts by splitting them.
    • Splitting respects minimum and maximum word boundaries (min = 10, max = 30).
    • Supports multilingual text.
    • Can be disabled via --tts-no-text-chunking (default: enabled).
  • Text Preprocessing & Prompt:
    • While optional, a light cleanup and normalization step is included to improve output quality.
    • Added new required prompt handling for the v1.0
  • Code Organization:
    • Implementation is located in: tts-outetts-v1.cpp.
    • A default speaker is added in a header file as JSON default_speaker.h.

TODO / Help Needed

  • DAC (Descript Audio Codec) Integration:
    • The decoder layers from DAC need to be implemented: descript-audio-codec/dac/model/dac.py
    • Model used:
      weights_24khz_1.5kbps_v1.0.pth
    • DAC is supported by the transformers library and can be converted to safetensors, which might help implementation.
      Also, see this PR I submitted to fix a dependency issue in the conversion script for compatibility with newer PyTorch versions:
      transformers PR #36393
    • Requesting assistance from @ngxson and @ggerganov for implementing this part.

Example Commands

Default generation uses default speaker automatically and chunked text:

build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "A very very long text"

Disables chunked text:

build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "Hello, how are you doing?" --tts-no-text-chunking

With custom speaker file:

build/bin/llama-tts-outetts-v1 -m "path/to/model.gguf" -p "A very very long text" --tts-speaker-file "path/to/speaker.json"

edwko avatar Apr 07 '25 09:04 edwko

  • The decoder layers from DAC need to be implemented

FYI, currently we're missing Snake1d which should be implemented via https://github.com/ggml-org/llama.cpp/pull/12487

ngxson avatar Apr 07 '25 11:04 ngxson

Does DAC replace WavTokenizer?

ggerganov avatar Apr 22 '25 12:04 ggerganov

Does DAC replace WavTokenizer?

Yes, since this model is multilingual, DAC is a better fit for reconstructing audio across languages.

edwko avatar Apr 22 '25 17:04 edwko

It would be really great if this would get merged. However I was wondering whether it'd also be possible to add mulitlingual support to llama-server?

Horschig avatar Apr 25 '25 09:04 Horschig

FYI: OuteTTS 1.0 is supported by chatllm.cpp. You can find DAC & SNAC implementation there.

foldl avatar May 19 '25 06:05 foldl