vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

Adding Malayalam language support to Vosk

Open kavyamanohar opened this issue 4 years ago • 5 comments

I have built a vosk compatible model for Malayalam language. The source code, model in zip format, and links to training and test data and test WER are provided in this repository.

How can I help to make this model listed in Vosk website?

kavyamanohar avatar Sep 24 '21 05:09 kavyamanohar

Thank you Kavya, looks great! I'll try to add this model ASAP.

We also need to integrate Malayalam data from https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus eventually.

nshmyrev avatar Sep 24 '21 09:09 nshmyrev

Thanks @nshmyrev.

The Malayalam dataset in https://github.com/Open-Speech-EkStep/ULCA-asr-dataset-corpus is currently 'unlabelled'. I think we can not use it unless transcript is available.

kavyamanohar avatar Sep 26 '21 11:09 kavyamanohar

I reviewed this, looks like we need to work more on the model. Otherwise the error rate are too high.

nshmyrev avatar Oct 30 '21 22:10 nshmyrev

Thanks for your time and effort for reviewing it @nshmyrev. The WERs are higher on test datasets where OOV rates are quite high. Test set 1 - 8% WER (1% OOV) Test set 2 - 31% WER (8% OOV) Test set 3 - 85% WER (36% OOV)

Considering the agglutinative nature of Malayalam language, do you have any suggestions on improving WER by working on the language modeling aspect. How are good WER achieved in languages like German which forms morphologically complex words? Thanks in advance for any pointers

kavyamanohar avatar Nov 01 '21 05:11 kavyamanohar

@kavyamanohar good work

lonelypx avatar Aug 15 '25 13:08 lonelypx