unexpected behaviour byte char level tokenizer
I have a file that contains 1M times the following line (and nothing else): <DOCUMENT> \test{bla} thisisatest </DOCUMENT>
I learn a tokenizer on it with:
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer(add_prefix_space=False)
tokenizer.train(
[FILEPATH],
vocab_size=10000,
min_frequency=0,
show_progress=True,
)
then, when doing:
encoded = tokenizer.encode(r"<DOCUMENT> \test{bla} thisisatest </DOCUMENT>")
print(encoded.tokens)
I get: ['<', 'DOCUMENT', '>', 'Ä \\', 'test', '{', 'bla', '}', 'Ä thisisatest', 'Ä </', 'DOCUMENT', '>']
and I don't understand why. It seems like the tokenizer is not processing some chars like "<" ">" "{" like the others, e.g. like if they were some special tokens. What should I do to have <DOCUMENT> as a single token? I also tried to have this line repeated with other regular text, but I always have the same issue, i.e. <DOCUMENT> never becomes a token.
Hi @glample ,
This is expected because the pre_tokenizersplits with this regexp:
https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/byte_level.rs#L35 https://github.com/huggingface/transformers/blob/master/src/transformers/models/gpt2/tokenization_gpt2.py#L193
Which can be traced to the original implementation: https://github.com/openai/gpt-2/blob/master/src/encoder.py#L53
Because the tokens a presplit, the BPE algorithm doesn't see them as a single token.
One simple way to change that is to use pre_tokenizers.WhitespaceSplit() which uses char::is_whitespace method for split.
tokenizer = ByteLevelBPETokenizer(add_prefix_space=False)
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
...
This might lead to other consequences though, so maybe be careful about the decoder part.
The best way might be to try to use the raw components directly:
Components https://huggingface.co/docs/tokenizers/python/latest/components.html and how to assemble them.
Thanks for your answer. WhiteSpaceSplit alone does not seem compatible with the ByteLevelBPETokenizer, is it? Typically, if I try to encode characters never seen before it will fail to decode them. However, I got it working with:
tokenizer = ByteLevelBPETokenizer(add_prefix_space=False)
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
pre_tokenizers.WhitespaceSplit(),
pre_tokenizers.Digits(individual_digits=True),
pre_tokenizers.ByteLevel(),
])
However, the problem is that the ByteLevel at the end always applies the same regex as above, and the things I would want to see as a single token are always split. In the code, I can see a TODO here regarding the possibility to modify this regex:
https://github.com/huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/byte_level.rs#L90
Would it be possible to add this option? That would be very useful! I'm really surprised by this regex actually. To me, adding a custom pre-tokenizer like the hardcoded regex kind of defeats the purpose of ByteLevel. Pure byte level split (even without white space split) is what I thought was behind ByteLevel. Probably too late to change the default behavior, but having the option to modify the regex would be convenient.
Thank you!
You are entirely correct that it should be overridable as mentionned in the TODO.
To my knowledge this is doable, the replace normalizer does it (https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.normalizers.Replace).
Personnally I don't think, I will have tons of time to dedicate to this, but PRs are welcome (I think overall it should be very similar to what happens for Replace).
It's even more confusing since byte level doesn't really handle bytes (ByT5 in transformers does that and it doesn't use tokenizers as it's really plain bytes). It just maps individual bytes to printable utf-8 chars. which means outputs are always valid utf-8 strings (wouldn't be the case for raw bytes). It's close enough but as you mention is does add this odd split.
If you're fine with char (https://doc.rust-lang.org/std/primitive.char.html) level whitespace split, you can try:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
[pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Digits(individual_digits=True)]
)
trainer = trainers.BpeTrainer(
vocab_size=10000,
min_frequency=0,
show_progress=True,
)
tokenizer.train(["test.txt"], trainer)
print(tokenizer.encode(r"<DOCUMENT> \test{bla} thisisatest </DOCUMENT>").tokens)
print(tokenizer.decode(tokenizer.encode(r"<DOCUMENT> \test{bla} thisisatest </DOCUMENT>").ids))
tokenizer.save("tokenizer.json")
Would that work in your use case ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.