Failed loading model: Could not find key /_1 in the model file

Open danielhers opened this issue 6 years ago • 0 comments

@OfirArviv, this is an issue we've been talking about a bit. I guess it could be called a DyNet issue but I'll try working around it.

In the mrp branch, whenever I train a model without BERT, when I try to load it I get this error:

...
[dynet] 2.1
Loading from 'test_files/models/ucca.enum'... Done (0.000s).
Loading model from 'test_files/models/ucca':   0%|          | 0/13 [00:00<?, ?param/s]Traceback (most recent call last):
  File "tupa/tupa/model.py", line 234, in load
    self.classifier.load(self.filename)
  File "tupa/tupa/classifiers/classifier.py", line 125, in load
    self.load_model(filename, d)
  File "tupa/tupa/classifiers/nn/neural_network.py", line 474, in load_model
    values = self.load_param_values(filename, d)
  File "tupa/classifiers/nn/neural_network.py", line 503, in load_param_values
    desc="Loading model from '%s'" % filename, unit="param"))
  File "tupa/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1005, in __iter__
    for obj in iterable:
  File "_dynet.pyx", line 450, in load_generator
  File "_dynet.pyx", line 453, in _dynet.load_generator
  File "_dynet.pyx", line 327, in _dynet._load_one
  File "_dynet.pyx", line 1482, in _dynet.ParameterCollection.load_lookup_param
  File "_dynet.pyx", line 1497, in _dynet.ParameterCollection.load_lookup_param
RuntimeError: Could not find key /_1 in the model file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "tupa/tupa/parse.py", line 653, in <module>
    main()
  File "tupa/tupa/parse.py", line 649, in main
    list(main_generator())
  File "tupa/tupa/parse.py", line 631, in main_generator
    yield from train_test(test=args.input, args=args)
  File "tupa/tupa/parse.py", line 557, in train_test
    yield from filter(None, parser.train(train, dev=dev, test=test is not None, iterations=args.iterations))
  File "tupa/tupa/parse.py", line 457, in train
    self.model.load()
  File "tupa/tupa/model.py", line 244, in load
    raise IOError("Failed loading model from '%s'" % self.filename) from e
OSError: Failed loading model from 'test_files/models/ucca'

Looking at the model .data file, I couldn't find any problem. There is certainly a line starting with #LookupParameter# /_1 there.

However, looking at the training log, I found it that all updates resulted in this error:

Error in update(): Magnitude of gradient is bad: -nan

I could then reproduce the problem by adding (loss/0).backward() before self.trainer.update() in https://github.com/danielhers/tupa/blob/20e7b12e3c91a5839c73db8433f0968fbd965bc6/tupa/classifiers/nn/neural_network.py#L422

So now the question is just why all updates result in -nan gradients in the normal code.

Aug 09 '19 10:08 danielhers