How to load tokenizer trained by sentencepiece or tiktoken
Hi, does this lib supports loading pre-trained tokenizer trained by other libs, like sentencepiece and tiktoken? Many models on hf hub store tokenizer in these formats
For sentencepiece it is mostly transformers and for tiktoken we don't have one directly 😢 It's planned for both!
@xenova if you can share some automations!
Here's my tiktoken-to-hf conversion script: https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee And then we already have a SPM converter :)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Transformers now has https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1478
an "official" tiktoken converter