tokenizers
tokenizers copied to clipboard
π₯ Fast State-of-the-Art Tokenizers optimized for Research and Production
I'm not sure if this is related to https://github.com/huggingface/tokenizers/issues/892 - the code below replaces digits with the special `` token. The `tokens` and `ids` are correct but the offsets of...
@Narsil this is based on my comment to https://github.com/huggingface/tokenizers/issues/985 I'd like to introduce a special token `` to my vocabulary and replace any digits in the input (i.e. using the...
I get the following Error when installing tokenisers from source (as I'm on Macbook's M1 so can't install using pip I believe): ``` Copying rust artifact from /Users/karimfoda/Documents/STUDIES/PYTHON/SHORTFORM/tokenizers/bindings/python/target/release/libtokenizers.dylib to build/lib.macosx-11.1-arm64-cpython-310/tokenizers/tokenizers.cpython-310-darwin.so...
I'd like to turn off the output that huggingface is generating when I use _unique_no_split_tokens_ so that the following code executes cleanly without all the "Using ..." > In[2] tokenizer...
Hi, I'm attempting to simply serialize and then unserialize a trained tokenizer. When I run the following code: ``` tokenizer = Tokenizer(BPE()) trainer = BpeTrainer(vocab_size=280) tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"]) save_to_filepath = 'preprocessing/tokenizer.json'...
I'm trying to save and load a fine-tuned model that i trained following as per the site https://huggingface.co/transformers/v1.0.0/model_doc/overview.html#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump . But facing this error below. Version: Transformer: 4.20.1 Traceback (most recent...
Do you guys plan to officially support such a binding? It seems pretty logical, after all Rust produces native code. We have a product in C++ and need to implement...
Hi, I was wondering whether there exists regex support for the Replace Normalizer. It seems that Replace just replaces the strings verbatim right now. Thanks! StΓ©phan
I've added an arm64 runner that you can check out in the settings tab. Edited the CI to make it release arm64 bins !
``` $ pip install tokenizers Collecting tokenizers Downloading tokenizers-0.12.1.tar.gz (220 kB) ββββββββββββββββββββββββββββββββββββββββ 220.7/220.7 kB 4.1 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done...