tokenizers Count number of tokens toeknizer might produce without really tokenizing?

Using len(tokenizer(_2encode_line_list).input_ids[0]) is fast for counting the number of tokens might produce, but is there any faster way? Since the tokenization process did some mapping in the vocab, and I just wanna know the number and know it fast.

Jan 08 '22 07:01 xrkk

Sorry but no, there's no fast way to know, unless you do the full tokenization.

Mileage may vary, and on specific tokenizers you could go faster than this lib because you can take shortcuts. But in general you can't . The regular BPE algorithm is O(n log(n)), no real way to go faster.

But encoding in general is pretty fast, do you mind sharing on what kind of data you want to work, and the speed you imagine getting ?

Jan 08 '22 14:01 Narsil

Sorry but no, there's no fast way to know, unless you do the full tokenization.

Mileage may vary, and on specific tokenizers you could go faster than this lib because you can take shortcuts. But in general you can't . The regular BPE algorithm is O(n log(n)), no real way to go faster.

But encoding in general is pretty fast, do you mind sharing on what kind of data you want to work, and the speed you imagine getting ?

I'm working on a task to compare function disassembly from binary files, maxmium token length of each function is set to 512, but for functions larger than 512, I need to know which instruction disassembly to keep and which to ignore, based on the length of tokens of each instruction disassembly.

And, I'm not using BPE algorithm, rather WordLevel from scratch:

m = WordLevel(unk_token='[UNK]')

# define tokenizer
wl_tokenizer = Tokenizer(m)
wl_tokenizer.pre_tokenizer = BertPreTokenizer() 
wl_tokenizer.add_special_tokens(["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
wl_tokenizer.normalizer = BertNormalizer()    
wl_tokenizer.post_processor = processors.RobertaProcessing(
    sep=("[SEP]", wl_tokenizer.token_to_id("[SEP]")),
    cls=("[CLS]", wl_tokenizer.token_to_id("[CLS]")),
)

Jan 09 '22 03:01 xrkk

I'm working on a task to compare function disassembly from binary files, maxmium token length of each function is set to 512, but for functions larger than 512, I need to know which instruction disassembly to keep and which to ignore, based on the length of tokens of each instruction disassembly.

Seems interesting !

Wordpiece is definitely the best candidate for faster tokenization. It should be O(n). In practice, it does not seems obvious that the "slow" algorithm is slower:

You can check out this PR to check if it improves on your end (if you have issues with it please ping, it might contain still contain bugs): https://github.com/huggingface/tokenizers/pull/863

Original intent for this PR was triggered by this https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html However, since Bert is whitespace splitting by default, it seems that the O(n) algorithm is a bit slower than it's O(n²) counterpart. My sort of guess of initial investigation is that we operate on small words in English which are almost always already in the dictionnary so we are actually O(1) in most cases and only O(n²) on very rare occasions (would require a word of length n to contain exactly n tokens).

The slowdown seems however relatively small so if your use case seems like it would hugely benefit from such a change we'll definitely consider switching (the slowdown is relatively small on the benchmark, so if we add another benchmark were it's night and day, it would definitely make the change a net increase)

To test, just checkout the PR and install the local version of the python bindings pip install -e bindings/python

Eager to see if it makes a change for you.

Jan 10 '22 10:01 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 04 '24 01:03 github-actions[bot]