transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[tokenizer] Inconsistent behavior in slow tokenizer and fast tokenizer

Open Ki-Seki opened this issue 1 year ago • 12 comments

System Info

  • transformers version: 4.35.2
  • Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
  • Python version: 3.8.18
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no need
  • Using distributed or parallel set-up in script?: no need

Who can help?

@ArthurZucker and @younesbelkada

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer


def answer_or_exception(tokenizer, id):
    print(f'<<<<<<{tokenizer.__class__}>>>>>>')
    try:
        print(f'"{tokenizer.decode([id])}"')
    except Exception as e:
        print(e)


tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=False)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=True)
# vocab size: 50294
answer_or_exception(tokenizer, 50294)  # correct
answer_or_exception(tokenizer, 50295)  # correct


tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=False)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # wrong

tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=True)
# vocab size: 31999
answer_or_exception(tokenizer, 31999)  # correct
answer_or_exception(tokenizer, 32000)  # correct

Output:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
sequence item 0: expected str instance, NoneType found
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
"               "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
""
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
"ç»™"
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
piece id is out of range.
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
"ç»™"
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
""

Expected behavior

Consistent decode behavior in slow tokenizer and fast tokenizer when id exceeds vocab size. For example, instead of raise exceptions, the slow tokenizer output empty strings like the fast tokenizer does.

Ki-Seki avatar Feb 21 '24 02:02 Ki-Seki

Hey! Thanks for opening an issue. Few things first. You are using a custom / local checkpoint with trust remote code.

Fast is not erroring out when you feed OOV, while slow is and it is indeed inconsistent. Would you like to open a PR for a fix? 🤗

ArthurZucker avatar Feb 21 '24 03:02 ArthurZucker

Yes, I'll try that. Thank you for your reply!

Ki-Seki avatar Feb 21 '24 03:02 Ki-Seki

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

hackpk avatar May 03 '24 05:05 hackpk

@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?

I'm OK with that. I have other things to do recently.😭

Ki-Seki avatar May 03 '24 06:05 Ki-Seki

Sure 🤗

ArthurZucker avatar May 03 '24 07:05 ArthurZucker

@ArthurZucker @Ki-Seki Has it been fixed yet? I want to start working on it.

akkefa avatar Aug 16 '24 12:08 akkefa

the PR is close so you can probably work on it!

ArthurZucker avatar Aug 19 '24 14:08 ArthurZucker

@ArthurZucker What should the behavior be when both tokenizer types encounter an OOV token? Should it simply raise an index error exception, or do you have something else in mind?

akkefa avatar Aug 20 '24 09:08 akkefa

Not entirely sure, but imo let's align to what FAST does1

ArthurZucker avatar Aug 20 '24 12:08 ArthurZucker

@ArthurZucker We can start a warning about this new change. After a certain version release, we can introduce a custom error exception for OOV word in tokenizer.

akkefa avatar Aug 20 '24 17:08 akkefa

Yeah that sounds reasonable

ArthurZucker avatar Aug 20 '24 17:08 ArthurZucker

@ArthurZucker

My plan is to start raising a warning in the fast tokenizer implementation so that users know when out-of-vocabulary (OOV) tokens are being ignored. In the normal tokenizer, OOV tokens typically generate different errors by default.

The main challenge with this task is that most users aren't using the base decode method from the tokenization_utils class in their model implementations. Instead, they are overriding it in their custom tokenizer classes.

Could you please review the pull request (#32912) and share your thoughts?

akkefa avatar Aug 21 '24 10:08 akkefa