python_autocomplete icon indicating copy to clipboard operation
python_autocomplete copied to clipboard

size mismatch for weights and bias

Open 1e0ndavid opened this issue 4 years ago • 5 comments

Hi there. After trained the model, I run "python serve.py" to test whether the model is capable to use, before this I have changed run_uuid to be that of my model and checkpoint. Any idea about why it raises error "RuntimeError: Error(s) in loading state_dict for TransformerXLModel:"? Thanks.


(autocomplete) daijianbo@ubuntu18:~/python_autocomplete-master-old/python_autocomplete$ python serve.py

LABML WARNING Not a valid git repository: /home/daijianbo/python_autocomplete-master-old

Prepare model... Prepare n_tokens... Prepare tokenizer...[DONE] 1.27ms Prepare n_tokens...[DONE] 2.10ms Prepare transformer...[DONE] 1.33ms Prepare ffn...[DONE] 0.30ms Prepare device... Prepare device_info...[DONE] 23.29ms Prepare device...[DONE] 23.51ms Prepare model...[DONE] 107.18ms Selected experiment = source_code run = b32da5eea23711eb982bccbbfe110075 checkpoint = 1744896 Loading checkpoint...[FAIL] 840.09ms Traceback (most recent call last): File "serve.py", line 18, in predictor = get_predictor() File "/home/daijianbo/python_autocomplete-master-old/python_autocomplete/evaluate/factory.py", line 39, in get_predictor conf = load_experiment() File "/home/daijianbo/python_autocomplete-master-old/python_autocomplete/evaluate/factory.py", line 33, in load_experiment experiment.start() File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/experiment.py", line 256, in start return _experiment_singleton().start(run_uuid=_load_run_uuid, checkpoint=_load_checkpoint) File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 407, in start global_step = self.__start_from_checkpoint(run_uuid, checkpoint) File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 312, in __start_from_check point self._load_checkpoint(checkpoint_path) File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 280, in _load_checkpoint self.checkpoint_saver.load(checkpoint_path) File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/init.py", line 118, in load saver.load(checkpoint_path, info[name]) File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/labml/internal/experiment/pytorch.py", line 66, in load self.model.load_state_dict(state) File "/home/daijianbo/miniconda3/envs/autocomplete/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(RuntimeError: Error(s) in loading state_dict for TransformerXLModel: size mismatch for src_embed.weight: copying a param with shape torch.Size([1096, 512]) from checkpoint, the shape in current model is torch.Size([1097, 512]). size mismatch for generator.weight: copying a param with shape torch.Size([1096, 512]) from checkpoint, the shape in current model is torch.Size([1097, 512]). size mismatch for generator.bias: copying a param with shape torch.Size([1096]) from checkpoint, the shape in current model is torch.Size([1097]).

1e0ndavid avatar Apr 29 '21 12:04 1e0ndavid

Looks like the number of tokens is different from the number of token when it was training. Did you change the dataset or run BPE again?

vpj avatar Apr 30 '21 05:04 vpj

Looks like the number of tokens is different from the number of token when it was training. Did you change the dataset or run BPE again?

No, I dont think I did, the weird thing is my friend also met this problem and the difference between two dimensions is bigger than mine, he got [1084, 512] and [1092, 512] respectively. One way we solve this problem is to train more times and select another checkpoint, sometimes it works. I'm not sure what goes wrong here, whether it could be in the part of "segment-level recurrence"? I have no idea since I haven't review the code carefully.

1e0ndavid avatar Apr 30 '21 07:04 1e0ndavid

This looks sounds like a bug. The dimensions of the embedding weights are number of tokens and number of embedding features (d_model)

vpj avatar Apr 30 '21 13:04 vpj

I will give it a try and see if I can reproduce. Are you running the latest master? Did you make changes? Also is the dataset the same?

vpj avatar Apr 30 '21 13:04 vpj

I will give it a try and see if I can reproduce. Are you running the latest master? Did you make changes? Also is the dataset the same?

Ok, try and see what will happen, lol. Yep, I downloaded this model several days ago so I suppose I have run the latest master. I haven't make any change other than commented the part of downloading data part, I used my own data and copied it from other folder directly. I also suspected at first whether is because I edited any key code but I don't think so after comparing with the original version. I tried other times from the very beginning, say from downloading the model to making it work. Similar question still exist, btw, my friend also met this, so maybe there is something wrong in the model?

And yeah, I always keep the dataset the same.

1e0ndavid avatar May 01 '21 14:05 1e0ndavid