Support of multilingual nvidia / parakeet-tdt-0.6b-v3
Finally Nvidia delivered a multilingual version of their Parakeet TDT model which can be found here https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
I downloaded encoder, decoder and joiner from https://github.com/k2-fsa/sherpa-onnx/pull/2500 and modified ASR/tdt_asr.py to match the sha256 hashsums. I guess this approach was somehow stupid as i get below error.
uv run glados start --config configs/assistant_config.yaml
Traceback (most recent call last):
File "/home/xyz/GLaDOS/.venv/bin/glados", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/xyz/GLaDOS/src/glados/cli.py", line 340, in main
start(args.config)
File "/home/xyz/GLaDOS/src/glados/cli.py", line 242, in start
glados = Glados.from_config(glados_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyz/GLaDOS/src/glados/engine.py", line 286, in from_config
asr_model = get_audio_transcriber(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyz/GLaDOS/src/glados/ASR/__init__.py", line 44, in get_audio_transcriber
return TDTTranscriber()
^^^^^^^^^^^^^^^^
File "/home/xyz/GLaDOS/src/glados/ASR/tdt_asr.py", line 338, in __init__
raise ValueError(
ValueError: Joiner output dimension mismatch: expected 1030, got 8198
What needs to be done to get Parakeet v3 working with Glados?
maybe.you need to.check the model configs. Ideally, the dimension info should be read from the.model.
Still on holiday, should be a quick fix when I'm back.
It will need Prompt translations, and GLaDOS voices for each supported language though.
Maybe let's discuss this in the Discord. I'll need help getting this done.
I'm also looking forward to multilingual support and already made some attempts in that direction. Replacing ASR model with new multilingual v3 version should be rather easy and obviously we will need to train new Piper voice (most difficult part) and modify prompts (I don't think we need to necessarily translate them entirely, it should be enough to add something like "always reply in X language"). We should probably have some dicts in config file, specifying supported languages, paths to their Piper voice models etc.
However, I think current code's main limitation is actually not in ASR but in TTS part, where we rely on some English-specific phonemizers etc. It would require some refactor to make that part work in other languages, but it's definitely doable.
It's not too bad actually.
The TTS is phoneme based. I didn't like the GPL licence of espeak-ng, so I created my own neural net phonemiser (autoregressive, I never published the training code, sorry).
However, we can use espeak-ng for all the languages used by the ASR, and train the TTS using the phonemes generated with the specified language.
This is how it's done with Piper, but I find that package way too bloated for my purposes.
I would be fine to add in espeak-ng back in as a package, or use it's outputs to generate a new model neural phonemiser for language.
The later is nice, as it can be MIT licenced, and is also generally faster than espeak-ng when running on GPU!
Great to hear that, I keep my fingers crossed! :)
Interesting discussion / approach. Honestly, i thought that with the support of parakeet-tdt-0.6b-v3 i could do ASR in German, write some small code for translation of the prompts to English and pipe it to Glados for LLM + TTS part in English.
End2End Glados in German is very promising, i checked "Glados Deutsch" on YouTube and the voice sounds mean and saracastic :-)
I have this working, I'll push the changes soon.
I just have to update the models on the releases page etc.
I just have to update the models on the releases page etc.
Great news David. I am looking forward to testing it.