multifit
multifit copied to clipboard
error in LM pretraining
What I did?
- Checked out the
pretrain-lmbranch because it has clear instructions how to pretrain LM (#57). - Installed required packages.
- Executed
bash prepare_wiki.sh de - Executed
python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100 - Received the following traceback:
python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100Setting LM weights seed seed to 0Running tokenization: 'lm-notst' ...Wiki text was split to 1 articlesWiki text was split to 1 articlesWiki text was split to 1 articlesTraceback (most recent call last):File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main"__main__", mod_spec)File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_codeexec(code, run_globals)File "/home/ubuntu/multifit/multifit/__main__.py", line 16, in <module>fire.Fire(Experiment())File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 138, in Firecomponent_trace = _Fire(component, args, parsed_flag_args, context, name)File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 468, in _Firetarget=component.__name__)File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTracecomponent = fn(*varargs, **kwargs)File "/home/ubuntu/multifit/multifit/training.py", line 587, in train_self.pretrain_lm.train_(pretrain_dataset)File "/home/ubuntu/multifit/multifit/training.py", line 275, in train_learn = self.get_learner(data_lm=dataset.load_lm_databunch(bs=self.bs, bptt=self.bptt, limit=self.limit))File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 208, in load_lm_databunchlimit=limit)File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 258, in load_n_cache_databunchdatabunch = self.databunch_from_df(bunch_class, train_df, valid_df, **args)File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 271, in databunch_from_df**args)File "/home/ubuntu/multifit/fastai_contrib/text_data.py", line 147, in make_data_bunch_from_dfTextList.from_df(valid_df, path, cols=text_cols, processor=processor))File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fastai/data_block.py", line 434, in __init__if not self.train.ignore_empty and len(self.train.items) == 0:TypeError: len() of unsized object
From initial debugging, train.items is an ndarray with shape () . When I print it, it returns articles in German. I suppose this part suggests a problem Wiki text was split to 1 articles - I reckon the wiki text should be split in more than 1 article. So maybe something goes wrong in read_wiki_articles() in dataset.py... This is my educated guess, but I don't know where to go from here.
My package versions differ slightly from those in requirements.txt, maybe sacremoses is related:
fire 0.3.0
sacremoses 0.0.38
sentencepiece 0.1.85
fastai 1.0.47