vosk-api Adding Malayalam language support to Vosk

I have built a vosk compatible model for Malayalam language. The source code, model in zip format, and links to training and test data and test WER are provided in this repository.

How can I help to make this model listed in Vosk website?

Sep 24 '21 05:09 kavyamanohar

Thank you Kavya, looks great! I'll try to add this model ASAP.

We also need to integrate Malayalam data from https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus eventually.

Sep 24 '21 09:09 nshmyrev

Thanks @nshmyrev.

The Malayalam dataset in https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus is currently 'unlabelled'. I think we can not use it unless transcript is available.

Sep 26 '21 11:09 kavyamanohar

I reviewed this, looks like we need to work more on the model. Otherwise the error rate are too high.

Oct 30 '21 22:10 nshmyrev

Thanks for your time and effort for reviewing it @nshmyrev. The WERs are higher on test datasets where OOV rates are quite high. Test set 1 - 8% WER (1% OOV) Test set 2 - 31% WER (8% OOV) Test set 3 - 85% WER (36% OOV)

Considering the agglutinative nature of Malayalam language, do you have any suggestions on improving WER by working on the language modeling aspect. How are good WER achieved in languages like German which forms morphologically complex words? Thanks in advance for any pointers

Nov 01 '21 05:11 kavyamanohar

@kavyamanohar good work

Aug 15 '25 13:08 lonelypx