Charin
Charin
## Thai-language specific metrics ### Sequence classification `sklearn` implementation - [ ] accuracy - F1 - precision - recall - prevalence ### Token classification `seqeval` at entity level - [...
`transformers` is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization...
- [x] zh_cn to th notebook - [ ] zh_cn to th script
Both datasets have over 100k questions. Translations will make training sets: - [ ] 1.1 machine translation - [ ] 2.0 machine translation - [ ] 1.1 human translation -...
Benchmark wanchanberta results (all models; see https://arxiv.org/abs/2101.09635) against AI4thai APIs](https://aiforthai.in.th/service_bn.php): - [x] en-th machine translation - [ ] zh-th machine translation (pending model from AI Builders) - [ ] word...
Source data to pretrain a new WangchanBERTa on legal domain
* Experiment scripts * Finetuning scripts / notebooks * Training MLM / finetuning MLM scripts / notebooks
Refactoring thai2transformers as ``` Huggingface Utility Functions, Scripts and Notebooks for Thai language thai2transformers provides utility functions, scripts and notebooks to pretrain, finetune, evaluate and infer Huggingface models and datasets...
# thai2fit subword version ## token level - newmm - sefr cut - ssg - sentencepiece (ask louise) ## datasets for LM - wikipedia (ask louise) - prachathai67k - thaisum...
Are there any rules/standards for word tokenization in Burmese? If not we can also use such datasets as this one as a training set to train some word tokenization models.