Charin issues

Results 14 issues of


                                            Charin

Refactor as package

## Thai-language specific metrics ### Sequence classification `sklearn` implementation - [ ] accuracy - F1 - precision - recall - prevalence ### Token classification `seqeval` at entity level - [...

Refactor thai2transformers as utility package for transformers

`transformers` is currently the de facto way to train NLP models (maybe speech and image soon?). For Thai language, we have some difficulties using the default settings; for example, tokenization...

enhancement

Add seq2seq notebooks and scripts

- [x] zh_cn to th notebook - [ ] zh_cn to th script

Translate and align SQuAD 1.1 and SQuAD 2.0

Both datasets have over 100k questions. Translations will make training sets: - [ ] 1.1 machine translation - [ ] 2.0 machine translation - [ ] 1.1 human translation -...

Benchmark wanchanberta results (all models; see https://arxiv.org/abs/2101.09635) against AI4thai APIs](https://aiforthai.in.th/service_bn.php): - [x] en-th machine translation - [ ] zh-th machine translation (pending model from AI Builders) - [ ] word...

documentation

Source data for LegalWangchanBERTa

Source data to pretrain a new WangchanBERTa on legal domain

Refactor experiment/training/finetuning scripts with documentation

* Experiment scripts * Finetuning scripts / notebooks * Training MLM / finetuning MLM scripts / notebooks

enhancement

Major, breaking refactoring to v1

Refactoring thai2transformers as ``` Huggingface Utility Functions, Scripts and Notebooks for Thai language thai2transformers provides utility functions, scripts and notebooks to pretrain, finetune, evaluate and infer Huggingface models and datasets...

thai2fit at subword level

# thai2fit subword version ## token level - newmm - sefr cut - ssg - sentencepiece (ask louise) ## datasets for LM - wikipedia (ask louise) - prachathai67k - thaisum...

Word tokenization standards

Are there any rules/standards for word tokenization in Burmese? If not we can also use such datasets as this one as a training set to train some word tokenization models.