Model loading doesn't work for SentencePieceTokenizer

Open siin-lab opened this issue 1 year ago • 0 comments

This code fails with an error:

import tkseem as tk

tokenizer_path = 'model.pl'
tokenizer = tk.SentencePieceTokenizer()
tokenizer.train(dataset_file)

# save the tokenizer to a file
tokenizer.save_model(tokenizer_path)

# load the tokenizer from a file
tokenizer = tk.SentencePieceTokenizer()
tokenizer.load_model(tokenizer_path)

# test the tokenizer
a = tokenizer.tokenize("السلام عليكم")

Error message is:

Traceback (most recent call last):
  File "/Users/user/Desktop/Projects/train-tokenizer.py", line 15, in <module>
    a = tokenizer.tokenize("السلام عليكم")
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/site-packages/tkseem/sentencepiece_tokenizer.py", line 50, in tokenize
    return self.sp.encode(text, out_type=str)
AttributeError: 'bool' object has no attribute 'encode'

The solution to this issue is updating the "load_model" to:

    def load_model(self, file_path):
        """Load a saved sp model

        Args:
            file_path (str): file path of the trained model
        """
        self.sp = spm.SentencePieceProcessor(model_proto=open(file_path, "rb").read())

Apr 06 '24 18:04 siin-lab