microbert
microbert copied to clipboard
A tiny BERT for low-resource monolingual models
⚠️ NOTE: If you want to train a MicroBERT for your language, please see lgessler/microbert2.
Introduction
MicroBERT is a BERT variant intended for training monolingual models for low-resource languages by reducing model sizes and using multitask learning on part of speech tagging and dependency parsing in addition to the usual masked language modeling.
For more information, please see our paper. If you'd like to cite our work, please use the following citation:
@inproceedings{gessler-zeldes-2022-microbert,
title = "{M}icro{BERT}: Effective Training of Low-resource Monolingual {BERT}s through Parameter Reduction and Multitask Learning",
author = "Gessler, Luke and
Zeldes, Amir",
booktitle = "Proceedings of the The 2nd Workshop on Multi-lingual Representation Learning (MRL)",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.mrl-1.9",
pages = "86--99",
}
Pretrained Models
The following pretrained models are available.
Note that each model's suffix indicates the tasks that were used to pretrain it: masked language modeling (m),
XPOS tagging (x), or dependency parsing (p).
microbert-ancient-greek-mmicrobert-ancient-greek-mxmicrobert-ancient-greek-mxpmicrobert-coptic-mmicrobert-coptic-mxmicrobert-coptic-mxpmicrobert-indonesian-mmicrobert-indonesian-mxmicrobert-indonesian-mxpmicrobert-maltese-mmicrobert-maltese-mxmicrobert-maltese-mxpmicrobert-uyghur-mmicrobert-uyghur-mxmicrobert-uyghur-mxpmicrobert-tamil-mmicrobert-tamil-mxmicrobert-tamil-mxpmicrobert-wolof-mmicrobert-wolof-mxmicrobert-wolof-mxp
Usage
Setup
- Ensure submodules are initialized:
git submodule update --init --recursive
- Create a new environment:
conda create --name embur python=3.9
conda activate embur
- Install PyTorch etc.
conda install pytorch torchvision cudatoolkit -c pytorch
- Install dependencies:
pip install -r requirements.txt
Experiments
This repo is exposed as a CLI with the following commands:
├── data # Data prep commands
│ ├── prepare-mlm
│ └── prepare-ner
├── word2vec # Static embedding condition
│ ├── train
│ ├── evaluate-ner
│ └── evaluate-parser
├── mbert # Pretrained MBERT
│ ├── evaluate-ner
│ └── evaluate-parser
├── mbert-va # Pretrained MBERT with Chau et al. (2020)'s VA method
│ ├── evaluate-ner
│ ├── evaluate-parser
│ └── train
├── bert # Monolingual BERT--main experimental condition
│ ├── evaluate-ner
│ ├── evaluate-parser
│ └── train
├── evaluate-ner-all # Convenience to perform evals on all NER conditions
├── evaluate-parser-all # Convenience to perform evals on all parser conditions
└── stats # Supporting commands for statistical summaries
└── format-metrics
To see more information, add --help at the end of any partial subcommand, e.g. python main.py --help,
python main.py bert --help, python main.py word2vec train --help.
Adding a language
For each new language to be added, you'll want to follow these conventions:
- Put all data under
data/$NAME/, with "raw" data going in some kind of subdirectory. (If it is a UD corpus, the standard UD name would be good, e.g.data/coptic/UD_Coptic-Scriptorium) - Ensure that it will be properly handled by the module
embur.commands.data. Put a script atembur/scripts/$NAME_data_prep.pythat will take the dataset's native format and write it out intodata/$NAME/converted, if appropriate. - Update
embur.language_configswith the language's information.
If you'd like to add a language's Wikipedia dump, see wiki-thresher.
Please don't hesitate to email me ([email protected]) if you have any questions.