Rewrite and Speed Up Tokenizer

Open jonthegeek opened this issue 5 years ago • 1 comments

As an RBERT user, I'd like the tokenizer to be as fast as it can be, so that I don't have to wait for this step more than is absolutely necessary.

First thing to check: Does keras::text_tokenizer (and friends) do what we need? If so, we should be able to save_text_tokenizer() when the model is downloaded for #51.

Nov 02 '20 13:11 jonthegeek

Oh, duh, no, keras::text_tokenizer doesn't easily do the wordpiece stuff.

Check out wordpiece_encode in https://github.com/bnosac/sentencepiece though to see if that looks efficient.

Nov 02 '20 13:11 jonthegeek