GlaDOS icon indicating copy to clipboard operation
GlaDOS copied to clipboard

Support of multilingual nvidia / parakeet-tdt-0.6b-v3

Open eqikkwkp25-cyber opened this issue 8 months ago • 9 comments

Finally Nvidia delivered a multilingual version of their Parakeet TDT model which can be found here https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

See https://github.com/k2-fsa/sherpa-onnx/pull/2500

eqikkwkp25-cyber avatar Aug 18 '25 05:08 eqikkwkp25-cyber

I downloaded encoder, decoder and joiner from https://github.com/k2-fsa/sherpa-onnx/pull/2500 and modified ASR/tdt_asr.py to match the sha256 hashsums. I guess this approach was somehow stupid as i get below error.

uv run glados start --config configs/assistant_config.yaml
Traceback (most recent call last):
  File "/home/xyz/GLaDOS/.venv/bin/glados", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/xyz/GLaDOS/src/glados/cli.py", line 340, in main
    start(args.config)
  File "/home/xyz/GLaDOS/src/glados/cli.py", line 242, in start
    glados = Glados.from_config(glados_config)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xyz/GLaDOS/src/glados/engine.py", line 286, in from_config
    asr_model = get_audio_transcriber(
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xyz/GLaDOS/src/glados/ASR/__init__.py", line 44, in get_audio_transcriber
    return TDTTranscriber()
           ^^^^^^^^^^^^^^^^
  File "/home/xyz/GLaDOS/src/glados/ASR/tdt_asr.py", line 338, in __init__
    raise ValueError(
ValueError: Joiner output dimension mismatch: expected 1030, got 8198

What needs to be done to get Parakeet v3 working with Glados?

eqikkwkp25-cyber avatar Aug 22 '25 15:08 eqikkwkp25-cyber

maybe.you need to.check the model configs. Ideally, the dimension info should be read from the.model.

csukuangfj avatar Aug 22 '25 15:08 csukuangfj

Still on holiday, should be a quick fix when I'm back.

It will need Prompt translations, and GLaDOS voices for each supported language though.

Maybe let's discuss this in the Discord. I'll need help getting this done.

dnhkng avatar Aug 22 '25 16:08 dnhkng

I'm also looking forward to multilingual support and already made some attempts in that direction. Replacing ASR model with new multilingual v3 version should be rather easy and obviously we will need to train new Piper voice (most difficult part) and modify prompts (I don't think we need to necessarily translate them entirely, it should be enough to add something like "always reply in X language"). We should probably have some dicts in config file, specifying supported languages, paths to their Piper voice models etc.

However, I think current code's main limitation is actually not in ASR but in TTS part, where we rely on some English-specific phonemizers etc. It would require some refactor to make that part work in other languages, but it's definitely doable.

jhajducz avatar Aug 24 '25 08:08 jhajducz

It's not too bad actually.

The TTS is phoneme based. I didn't like the GPL licence of espeak-ng, so I created my own neural net phonemiser (autoregressive, I never published the training code, sorry).

However, we can use espeak-ng for all the languages used by the ASR, and train the TTS using the phonemes generated with the specified language.

This is how it's done with Piper, but I find that package way too bloated for my purposes.

I would be fine to add in espeak-ng back in as a package, or use it's outputs to generate a new model neural phonemiser for language.

The later is nice, as it can be MIT licenced, and is also generally faster than espeak-ng when running on GPU!

dnhkng avatar Aug 24 '25 10:08 dnhkng

Great to hear that, I keep my fingers crossed! :)

jhajducz avatar Aug 25 '25 02:08 jhajducz

Interesting discussion / approach. Honestly, i thought that with the support of parakeet-tdt-0.6b-v3 i could do ASR in German, write some small code for translation of the prompts to English and pipe it to Glados for LLM + TTS part in English.

End2End Glados in German is very promising, i checked "Glados Deutsch" on YouTube and the voice sounds mean and saracastic :-)

eqikkwkp25-cyber avatar Aug 25 '25 09:08 eqikkwkp25-cyber

I have this working, I'll push the changes soon.

I just have to update the models on the releases page etc.

dnhkng avatar Nov 01 '25 07:11 dnhkng

I just have to update the models on the releases page etc.

Great news David. I am looking forward to testing it.

eqikkwkp25-cyber avatar Nov 01 '25 09:11 eqikkwkp25-cyber