tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

How to load tokenizer trained by sentencepiece or tiktoken

Open jordane95 opened this issue 1 year ago • 4 comments

Hi, does this lib supports loading pre-trained tokenizer trained by other libs, like sentencepiece and tiktoken? Many models on hf hub store tokenizer in these formats

jordane95 avatar Mar 13 '24 10:03 jordane95

For sentencepiece it is mostly transformers and for tiktoken we don't have one directly 😢 It's planned for both!

ArthurZucker avatar Mar 27 '24 15:03 ArthurZucker

@xenova if you can share some automations!

ArthurZucker avatar Mar 27 '24 15:03 ArthurZucker

Here's my tiktoken-to-hf conversion script: https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee And then we already have a SPM converter :)

xenova avatar Mar 27 '24 15:03 xenova

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 27 '24 01:04 github-actions[bot]

Transformers now has https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1478

an "official" tiktoken converter

ArthurZucker avatar Apr 30 '24 10:04 ArthurZucker