transformers [tokenizer] Inconsistent behavior in slow tokenizer and fast tokenizer

System Info

transformers version: 4.35.2
Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
Python version: 3.8.18
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no need
Using distributed or parallel set-up in script?: no need

Who can help?

@ArthurZucker and @younesbelkada

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer


def answer_or_exception(tokenizer, id):
    print(f'<<<<<<{tokenizer.__class__}>>>>>>')
    try:
        print(f'"{tokenizer.decode([id])}"')
    except Exception as e:
        print(e)


tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=False)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=True)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # correct


tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=False)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=True)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # correct

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
sequence item 0: expected str instance, NoneType found
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
""
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
"给"
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
piece id is out of range.
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
"给"
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
""

Expected behavior

Consistent decode behavior in slow tokenizer and fast tokenizer when id exceeds vocab size. For example, instead of raise exceptions, the slow tokenizer output empty strings like the fast tokenizer does.

Feb 21 '24 02:02 Ki-Seki

Hey! Thanks for opening an issue. Few things first. You are using a custom / local checkpoint with trust remote code.

Fast is not erroring out when you feed OOV, while slow is and it is indeed inconsistent. Would you like to open a PR for a fix? 🤗

Feb 21 '24 03:02 ArthurZucker

Yes, I'll try that. Thank you for your reply!

Feb 21 '24 03:02 Ki-Seki

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

May 03 '24 05:05 hackpk

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

I'm OK with that. I have other things to do recently.😭

May 03 '24 06:05 Ki-Seki

Sure 🤗

May 03 '24 07:05 ArthurZucker

@ArthurZucker @Ki-Seki Has it been fixed yet? I want to start working on it.

Aug 16 '24 12:08 akkefa

the PR is close so you can probably work on it!

Aug 19 '24 14:08 ArthurZucker

@ArthurZucker What should the behavior be when both tokenizer types encounter an OOV token? Should it simply raise an index error exception, or do you have something else in mind?

Aug 20 '24 09:08 akkefa

Not entirely sure, but imo let's align to what FAST does1

Aug 20 '24 12:08 ArthurZucker

@ArthurZucker We can start a warning about this new change. After a certain version release, we can introduce a custom error exception for OOV word in tokenizer.

Aug 20 '24 17:08 akkefa

Yeah that sounds reasonable

Aug 20 '24 17:08 ArthurZucker

@ArthurZucker

My plan is to start raising a warning in the fast tokenizer implementation so that users know when out-of-vocabulary (OOV) tokens are being ignored. In the normal tokenizer, OOV tokens typically generate different errors by default.

The main challenge with this task is that most users aren't using the base decode method from the tokenization_utils class in their model implementations. Instead, they are overriding it in their custom tokenizer classes.

Could you please review the pull request (#32912) and share your thoughts?

Aug 21 '24 10:08 akkefa