WordTokenizers.jl Optimize statistical unigram tokenizer `decode

Was testing out the SentencePieceModel tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at the code for the decode_forward pass it seems you are considering candidate spans (char_start, char_end) with arbitrary length, even though the vocabulary will have some maximum length element. Constraining decoding to consider spans of at most this max length yields the same result, since no longer substring will be present in vocab. This change has a dramatic impact for tokenizing longer pieces of text. This PR addresses the problem by computing a max_vocab_codeunit_len for the SentencePieceModel to cache the longest code unit length for any vocabulary element. This field is used to truncate search for decoding.

This gist below highlights the performance gap. There is existing test coverage for this code and those tests still pass, but happy to add more if there's something else to test in the implementation.

using HTTP
using WordTokenizers

spm = load(ALBERT_V1)
# Download Hamlet text and truncate to roughly first 5k bytes
long_text = String(HTTP.get("https://dlg.usg.edu/record/dlg_zlgb_gb5027/fulltext.text").body)
max_len = 5000
long_text = long_text[begin:thisind(long_text, max_len)]
@time spm(long_text)

Before this PR: 7.098319 seconds (195.33 k allocations: 11.746 MiB, 1.41% compilation time) With this PR: 0.016252 seconds (8.23 k allocations: 1.026 MiB)

Dec 30 '21 22:12 aria42

Thanks @aviks assuming this means I can merge PR, or do I need another approver?

Jan 07 '22 04:01 aria42

@aviks I think you need to merge since I can't do it myself.

Jan 24 '22 23:01 aria42

Optimize statistical unigram tokenizer `decode_forward`