Finetuning NLLB models with error "ValueError: --share-all-embeddings requires a joined dictionary", need help!
❓ Questions and Help
I want to test finetuning nllb models(3.3B) , I followed the doc in Finetuning NLLB models
with this command:
python fairseq/examples/nllb/modeling/train/train_script.py \ cfg=nllb200_dense3.3B_finetune_on_fbseed \ cfg/dataset=fairseq/examples/nllb/modeling/train/conf/cfg/dataset/fbseed_chat.yaml \ cfg.dataset.lang_pairs="deu_Latn-eng_Latn" \ cfg.fairseq_root=fairseq \ cfg.output_dir=nllb_fine_tuned \ cfg.dropout=0.1 \ cfg.warmup=10 \ cfg.finetune_from_model=nllb_models/model_3B/checkpoint.pt
fbseed_chat.yaml as follows:
"
defaults:
- default
dataset_name: "fbseed_chat"
num_shards: 1
langs_file: "examples/nllb/modeling/scripts/flores200/langs.txt"
lang_pairs: "deu_Latn-eng_Latn"
data_prefix:
localcluster: fairseq/data-bin/iwslt14.tokenized.de-en
"
files in data folder as follows: `
dict.deu_Latn.txt
dict.eng_Latn.txt
test.de-en.de.bin
test.de-en.de.idx
test.de-en.en.bin
test.de-en.en.idx
train.de-en.de.bin
train.de-en.de.idx
train.de-en.en.bin
train.de-en.en.idx
valid.de-en.de.bin
valid.de-en.de.idx
valid.de-en.en.bin
valid.de-en.en.idx
`
data files are produced by "fairseq/examples/translation/prepare-iwslt14.sh".
After executed finetuning command , erros like this:
Traceback (most recent call last): File "./slurm_snapshot_code/2022-09-05T09_08_29.058828/train.py", line 14, in <module> cli_main() File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq_cli/train.py", line 634, in cli_main distributed_utils.call_main(cfg, main) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/distributed/utils.py", line 371, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/distributed/utils.py", line 345, in distributed_main main(cfg, **kwargs) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq_cli/train.py", line 113, in main model = fsdp_wrap(task.build_model(cfg.model)) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/tasks/translation_multi_simple_epoch.py", line 246, in build_model return super().build_model(args, from_checkpoint) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/tasks/fairseq_task.py", line 694, in build_model model = models.build_model(args, self, from_checkpoint) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/models/__init__.py", line 107, in build_model return model.build_model(cfg, task) File "/home/translation/fairseq/slurm_snapshot_code/2022-09-05T09_08_29.058828/fairseq/models/transformer/transformer_legacy.py", line 112, in build_model raise ValueError("--share-all-embeddings requires a joined dictionary") ValueError: --share-all-embeddings requires a joined dictionary
Need Help! Thanks very much!
Before asking:
- search the issues.
- search the docs.
What is your question?
Code
What have you tried?
What's your environment?
- fairseq Version (1.0.0a0+f87107c):
- PyTorch Version (1.12.1+cu102)
- OS (Ubuntu 20.04):
- How you installed fairseq (source):
- Build command you used (if compiling from source):
- Python version: 3.8
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
You cannot use iwslt14 data with nllb model. It is because fairseq's models are binded with one or two vocabulary dict.txt. The dict.txt's line counts + special tokens determine a model's input feature size and output feature size.
So prepare-iwslt14.sh prepares data with a dict.txt made for iwslt14 model.
You need to use nllb's 12800 vocabulary dict.txt to prepare your data.
The command may be like fairseq-preprocess {iwslt data} --srcdict {nllb vocab} --joined-dictionary ..... Well I suggest you read nllb's page again.
@gmryu Thanks for your replay! I skipped the data preparation to quick test for finetuning, as i don't have my data at the moment. I have another question, if nllb's vocabulary dict.txt doesn't contain some words in my data, can I add new words into nllb's dict.txt and finetuning based on this new dict.txt?
@cokuehuang The total vocabulary size must be the same because, as I wrote before, a model's input feature size and output feature size are determined by given vocabulary size. If vocaburary size is different, you get an error: mismatch between loaded weights and intialized weights.
So, in other words, you can
- alter words already in the dict.txt to other words. Words found in the end of dict.txt are probably less frequent(No guranteen). I would alter them first.
- edit the checkpoint. (checkout "prune" or "distill") a recent issue about editing a model is here: https://github.com/facebookresearch/fairseq/issues/4664 If you have interest in this, you would need to read till the end.
Hello,
Can anyone please share the code/notebook of NLLB finetuned if possible?
Hi, I would appreciate it if I can have the code/notebook of NLLB finetuned too.
Hi, I would appreciate it if I can have the code/notebook of NLLB finetuned too.
Did you get the code for finetuning the NLLB model?