tokenization encode_pieces error

Open akakakakakaa opened this issue 6 years ago • 1 comments

I created my custom sentencepiece model and vocab based on korean wiki lines.

and function encode_pieces in tokenization.py, I have an error.

"'int' object has no attribute 'lower'"

I think the process remove the "," makes an error (for example "12,345" to "12345")

In my case, the code

cur_pieces = sp_model.EncodeAsPieces( six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))

makes pieces as binary. so I changed the code to

cur_pieces = sp_model.EncodeAsPieces( six.ensure_str(piece[:-1]).replace(SPIECE_UNDERLINE.decode("utf-8"), ""))

Is any problem?

Jan 03 '20 19:01 akakakakakaa

Ran into the same problem, I'm using your solution too.

Jan 10 '20 18:01 pvcastro