albert
albert copied to clipboard
tokenization encode_pieces error
I created my custom sentencepiece model and vocab based on korean wiki lines.
and function encode_pieces in tokenization.py, I have an error.
"'int' object has no attribute 'lower'"
I think the process remove the "," makes an error (for example "12,345" to "12345")
In my case, the code
cur_pieces = sp_model.EncodeAsPieces( six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
makes pieces as binary. so I changed the code to
cur_pieces = sp_model.EncodeAsPieces( six.ensure_str(piece[:-1]).replace(SPIECE_UNDERLINE.decode("utf-8"), ""))
Is any problem?
Ran into the same problem, I'm using your solution too.