tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Fast WordPiece tokenizer implementation

Open catleeball opened this issue 3 years ago • 4 comments

Feature request

It would be nice to implement the Fast WordPiece tokenization algorithm.

Motivation

Fast WordPiece performs sub-word tokenization in O(n) time using failure links similar to Aho-Corasick. It also consolidates the whitespace/punctuation splitting pass into the tokenization loop.

The paper benchmarks Fast WP's performance against WP and one of HuggingFace's tokenizers.

Your contribution

If possible, I'd like to try implementing this and sending a PR if that's okay!

I'm considering writing a paper with some possible improvements on Fast WordPiece, and wanted the experience of implementing Fast WordPiece to better understand it. Additionally, I'd like to keep improving at Rust, so this is a nice excuse to practice. :)

Thanks!

catleeball avatar Sep 16 '22 18:09 catleeball

cc @Narsil @SaulLu

LysandreJik avatar Sep 16 '22 18:09 LysandreJik

Ah, I just realized I should have opened this in huggingface/tokenizers, I can re-create this bug over there if that's better!

catleeball avatar Sep 16 '22 21:09 catleeball

Let me move it for you :)

LysandreJik avatar Sep 16 '22 22:09 LysandreJik

Ah, thank you! :)

catleeball avatar Sep 17 '22 00:09 catleeball

I am not sure what they compared against.

I already attempted aho-corasick, and in practice it performed poorly: https://github.com/huggingface/tokenizers/pull/863

This is because we already have a pretokenizer that splits items on spaces (in 90%+ of tokenizers) which means the wordpiece current implementation is actually faster than the aho-corasick implementation.

It might be faster if you remove a bunch of features of this lib (like normalization, and keeping offsets) to make it a pure aho-corasick implementation, but as far as I know it's not really possible to keep all features and implement it this way. Unigram for starters is not implementable as aho-corasick Stuff like blingfire from microsoft claims to be faster and purely DFA : https://github.com/microsoft/BlingFire.

I am highly unsure how they can make that happen both on BPE and Unigram, which are not (afaik) equivalent to greedy algorithms.... But they seem to have parity so there must be a solution.

Narsil avatar Sep 27 '22 14:09 Narsil

Sorry for my slow reply here @Narsil ! I had my github emails going to a filter and missed this update.

I am not sure what they compared against. [...] I am highly unsure how they can make that happen both on BPE and Unigram

I wanted to investigate the results from section 5 of their paper in more detail anyway, so I sent Dr. Song an email if he's able to share his implementation or other details. I'll follow up here.

This is because we already have a pretokenizer that splits items on spaces (in 90%+ of tokenizers) which means the wordpiece current implementation is actually faster than the aho-corasick implementation.

In addition to using an Aho-Corasick automata, they also collapsed the whitespace splitting pass into the tokenization pass. That might explain a little of the delta you saw in your implementation's benchmarks, but it doesn't seem like it would be that large. Maybe there's something else missing.

catleeball avatar Sep 28 '22 00:09 catleeball

but it doesn't seem like it would be that large.

Yes it is huge. First case you're splitting on whitespace (it's quite fast) and then running 30 times an operator which a vast majority is a single lookup in a table (as most words are NOT split since they are in the vocabulary). When the split does happen, then yes it's slower, and it becomes slower the more a token has pieces. but since it's only looking a words, usually the size is relatively small.

If you could encompass everything in a single pass (normalization, pretokenizer, tokenization) , it would indeeed be faster, but transformers the biggest user of this lib needs quite often individual passes for flexibility.

Narsil avatar Sep 29 '22 09:09 Narsil

Understood, thanks for explaining that! It sounds like Fast WordPiece's approach might not mesh nicely with HF's current tokenizer organization, at least not without a lot of non-trivial work.

Feel free to close this then if you think it won't be viable!

P.S. Dr. Song got back to me. He no longer has the original code or sampled data, but he did supply me with his general benchmarking template code that he said would be very similar to what he used during the experiments.

I'll likely do benchmarking over the weekend, attempting to reproduce the paper's results, and also single-language benchmarks to compare. Would you want me to keep this bug updated with the results if they'll be useful to you or your future developments?

catleeball avatar Sep 29 '22 17:09 catleeball

Feel free to close this then if you think it won't be viable!

This is a hunch, but doing to work will probably prove the results. Please go ahead and do it, I just wanted to manage expectation, and show you previous work so that you could get started faster.

Narsil avatar Oct 05 '22 08:10 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 01 '24 01:02 github-actions[bot]