ReFinED icon indicating copy to clipboard operation
ReFinED copied to clipboard

Issue with loading Additional Entities

Open seanaedmiston opened this issue 2 years ago • 8 comments

I have tried to load additional entities as per the README by running preprocess_all. Everything appears to run fine - however when I try and load the refined model afterwards with something like:

refined = Refined(
    model_file_or_model=data_dir+ "/wikipedia_model_with_numbers/model.pt",
    model_config_file_or_model_config=data_dir + "/wikipedia_model_with_numbers/config.json",
    entity_set="wikidata",
    data_dir=data_dir,
    use_precomputed_descriptions = False,
    download_files=False,
    preprocessor=preprocessor
)

I get an error like:

Traceback (most recent call last):
  File "/home/azureuser/Hafnia/email_ee/email_refined.py", line 91, in <module>
    refined = Refined(
  File "/home/azureuser/ReFinED/src/refined/inference/processor.py", line 100, in __init__
    self.model = RefinedModel.from_pretrained(
  File "/home/azureuser/ReFinED/src/refined/model_components/refined_model.py", line 643, in from_pretrained
    model.load_state_dict(checkpoint, strict=False)
  File "/home/azureuser/.pyenv/versions/venv3108/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RefinedModel:
        size mismatch for entity_typing.linear.weight: copying a param with shape torch.Size([1369, 768]) from checkpoint, the shape in current model is torch.Size([1447, 768]).
        size mismatch for entity_typing.linear.bias: copying a param with shape torch.Size([1369]) from checkpoint, the shape in current model is torch.Size([1447]).
        size mismatch for entity_disambiguation.classifier.weight: copying a param with shape torch.Size([1, 1372]) from checkpoint, the shape in current model is torch.Size([1, 1450]).

To the best of my understanding, this is because the number of classes in the wikidata dump has changed since the original model was trained. (Number of class_to_label.json now has 1446 entries.) Is there any way to accomodate this without completely retraining the model?

seanaedmiston avatar Mar 06 '23 02:03 seanaedmiston

I went through a very similar issue after updating the files with latest wiki dumps. I believe that is indeed to attribute to the different shape of the classes tensor.

To perform zero-shot inference, without the need to retrain your model, you may want to use a mixture of original files (the ones that consider the old number of classes) and newly generated ones.

The combination that i figured out to run the model effectively is the following:

  • class_to_idx.json ----------------------------------------------------- (original)
  • class_to_label.json --------------------------------------------------- (original)
  • description_tns.pt ---------------------------------------------------- (new)
  • human_qcodes.json ----------------------------------------------------- (new)
  • nltk_sentence_splitter_english.pickle ------------------------------- (new)
  • pem.lmdb --------------------------------------------------------------- (new)
  • qcode_to_class_tns_<number>.pt --------------------------------------- (original)
  • qcode_to_idx.lmdb ----------------------------------------------------- (original)
  • qcode_to_wiki.lmdb ---------------------------------------------------- (NOTE)
  • subclasses.lmdb ------------------------------------------------------- (new)

NOTE: qcode_to_wiki.lmdb is generated by translating qcode_to_idx.json into a lmdb dictionary, which means that instead of mapping qcodes to wikipedia titles (as intended), it returns numerical indexes. This might be a bug worth of a new issue. However, I tried to solve this by simply renaming the newly generated additional_data/qcode_to_label.lmdb as qcode_to_wiki.lmdb, and it works just fine.

lucatorellimxm avatar Mar 14 '23 11:03 lucatorellimxm

Thanks heaps for replying @lucatorellimxm. With your suggestions I was at least able to run the model... but for whatever reason the performance is way off. Some entities that it was previously disambiguating/linking are no longer linking correctly, and my 'additional entities' are also not linking.

seanaedmiston avatar Mar 16 '23 23:03 seanaedmiston

Just an update in case anyone ever looks here... Eventually got everything working well... But discovered 2 things:

  1. To use 'additional_entities' without retraining the model in full, the trick is to copy the 'chosen_classes.txt' from the original. This just means that when all of the indexes are rebuilt with the additional_entities in them - it uses the exact same classes as the original model was trained on. (This avoids the error I initially reported above.)
  2. Even having done that - linking performance was terrible. I eventually tracked it down to an issue processing wikipedia redirects. (Redirects turns out to be one of the biggest sources of disambiguation data.) For 'new' wikipedia dumps, the redirects handling was completely broken. Reworked in fork here: https://github.com/Simbolo-io/ReFinED

seanaedmiston avatar Apr 11 '23 23:04 seanaedmiston

Great advices, thank you @seanaedmiston.

Does 2. still hold true in case of full model training? I am experiencing some linking issues with rather easy mentions even after training the model from scratch on new data and that could be the case.

lucatorellimxm avatar Apr 12 '23 09:04 lucatorellimxm

Yes - I saw poor linking performance (point 2) even with full model training. Fixing the 'redirect' parsing problems I found should fix that. It made a huge difference for me. My fork is a bit of a mess, but the only changes you should need are in process_wiki.py : https://github.com/amazon-science/ReFinED/compare/main...Simbolo-io:ReFinED:main#diff-7aac257f29f9e00bda22f968125b52fc5bc3ced71e9627c5bf51780c4a8230c3

One little wrinkle, in the latest wikipedia dumps there is an article title that consists of just a backslash. If that causes you problems, you may need the additional fix to loaders.py here: https://github.com/amazon-science/ReFinED/compare/main...Simbolo-io:ReFinED:main#diff-7fbb3c56891f6094624a3872d81cde9dab1d4585452975093f5fdd63dece42ea

seanaedmiston avatar Apr 12 '23 11:04 seanaedmiston

I am trying to add additional entities without retraining. I am not able to find the file "chosen_classes.txt" in the original folder: ` additional_data:

datasets:

roberta-base: config.json merges.txt pytorch_model.bin vocab.json

wikipedia_data: class_to_idx.json descriptions_tns.pt nltk_sentence_splitter_english.pickle qcode_to_class_tns_6269457-138.np qcode_to_wiki.lmdb class_to_label.json human_qcodes.json pem.lmdb qcode_to_idx.lmdb subclasses.lmdb

wikipedia_model: config.json model.pt

wikipedia_model_with_numbers: config.json model.pt ` how can I find it and thanks in advance?

yhifny avatar Sep 06 '23 12:09 yhifny