tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer?

Open LuoKaiGSW opened this issue 1 year ago • 5 comments

I have a model that uses BloomTokenizerFast, which does not have properties like byte_decoder and sp_model, so I can't figure out how it implements the mapping between byte values and Unicode characters. I've looked through the source code and only found that the pre_tokenize_str function can convert input text characters into Unicode characters, but I didn't see the mapping relationship it depends on. So I want to ask, how can I find this mapping relationship? Or is the mapping relationship used by the fast tokenizer the same as that of gpt2?

LuoKaiGSW avatar Jun 04 '24 07:06 LuoKaiGSW

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

ArthurZucker avatar Jun 05 '24 07:06 ArthurZucker

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

Thank you for your reply, but I didn't fully understand what you meant. After using tokenizer._tokenizer.model, I got a BPE object, but I didn't see the attribute I wanted in it - that is, the mapping from byte values to Unicode. Could you explain it a bit more clearly, please?

LuoKaiGSW avatar Jun 05 '24 08:06 LuoKaiGSW

You cannot see any attributes because both __repr__ and __str__ are not implemented

ArthurZucker avatar Jun 11 '24 13:06 ArthurZucker

You cannot see any attributes because both __repr__ and __str__ are not implemented

So, is it impossible to read this mapping relationship from the fast tokenizer?

LuoKaiGSW avatar Jun 11 '24 13:06 LuoKaiGSW

It is coming with the PR that I linked 😉

ArthurZucker avatar Jun 11 '24 16:06 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 12 '24 01:08 github-actions[bot]

Closing as we do have the capabilities merged now!

ArthurZucker avatar Aug 16 '24 09:08 ArthurZucker