transformers [LLaVa-v1.6] Difference between slow and fast tokenizer for liuhaotian/llava-v1.6-34b

System Info

Transformers v4.39.dev

Who can help?

@ArthurZucker will know this

Reproduction

So when converting the LLaVa-NeXT models, I noticed a discrepancy between the slow/fast tokenizers of this checkpoint: liuhaotian/llava-v1.6-34b

Namely, when converting the following string, I'd expect equivalent input_ids:

from transformers import AutoTokenizer, AddedToken

slow_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", use_fast=False)
slow_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)

fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b")
fast_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)

prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"

input_ids = slow_tokenizer(prompt).input_ids
fast_input_ids = fast_tokenizer(prompt).input_ids

assert input_ids == fast_input_ids

However they are not the same:

print(input_ids)
print(fast_input_ids)

gives

[6, 10707, 144, 47329, 567, 3275, 98, 7, 6, 3903, 144, 64000, 144, 5697, 620, 2709, 594, 719, 2728, 100, 7, 6, 765, 13611, 144]
[6, 1328, 144, 47329, 567, 3275, 98, 7, 6, 2942, 144, 64002, 59568, 144, 5697, 620, 2709, 594, 719, 2728, 100, 7, 6, 14135, 144]

This is currently leading to an issue as reported here, due to the AutoTokenizer class loading the fast tokenizer by default, which uses 64002 as image_token_index, whereas the model has this set to 64000. The slow tokenizer returns the correct input_ids.

Current workaround is to use the slow tokenizer.

Expected behavior

I'd like to have equivalent behaviour.

Mar 21 '24 07:03 NielsRogge

Hey! That is expected but should be fixed. basically the ~~and~~ are added at the beginning but in slow we overwrite, in fast we don't. The converter should fix this by using from_slow=True. It should not be a default but since the slow supports overwriting fast should as well

Mar 21 '24 08:03 ArthurZucker

In [8]: fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b",  bos_token="<|startoftext|>", eos_token ="<|endoftext|>", from_slow=True)

In [9]: fast_tokenizer
Out[9]: 
LlamaTokenizerFast(name_or_path='liuhaotian/llava-v1.6-34b', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        64000: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        64001: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Mar 21 '24 08:03 ArthurZucker

@ArthurZucker thanks, however could you clarify? Your code snippet seems to use id = 64.000 for the "<|startoftext|>" token, whereas the "<image>" token needs to have that index. Do you mean that this is something that is currently not supported?

Using from_slow=True results in the same fast_input_ids:

from transformers import AutoTokenizer, AddedToken

slow_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", use_fast=False)
slow_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)

fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b",  bos_token="<|startoftext|>", eos_token ="<|endoftext|>", from_slow=True)
fast_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)

prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"

input_ids = slow_tokenizer(prompt).input_ids
fast_input_ids = fast_tokenizer(prompt).input_ids

print(input_ids)
print(fast_input_ids)

assert input_ids == fast_input_ids

Mar 21 '24 12:03 NielsRogge

I'll push a PR one sec

Mar 22 '24 01:03 ArthurZucker

@ArthurZucker One (possibly separate) issue I notice when comparing the two tokenizers, is that for the slow tokenizer, the eos_token is <|im_end|> whereas for the fast tokenizer it's <|endoftext|>

>>> slow_tokenizer
LlamaTokenizer(name_or_path='liuhaotian/llava-v1.6-34b', vocab_size=64000, model_max_length=4096, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|im_end|>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<image>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        64000: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

>>> fast_tokenizer
LlamaTokenizerFast(name_or_path='liuhaotian/llava-v1.6-34b', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
        7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        64000: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Mar 28 '24 17:03 amyeroberts

Mmm on main I am getting this:

Mar 30 '24 17:03 ArthurZucker

Ah, I see what happened - I was taking the examples from here, but the token is set to override the default for one of the tokenizers but not the other. My bad!

Apr 04 '24 09:04 amyeroberts

this actually isn't fixed.

Apr 17 '24 13:04 bghira

@bghira Are you running on a source install of transformers? This has been merged into main but isn't part of the most stable release

Apr 17 '24 15:04 amyeroberts

yep

Apr 17 '24 15:04 bghira

cc @ArthurZucker

Apr 17 '24 15:04 amyeroberts

This is fixed. If not then please share a reproducer.

Apr 18 '24 09:04 ArthurZucker

you're right, neither fast nor slow work for me

Apr 18 '24 11:04 bghira

Does this mean that the faster tokenizer should be working? If I run this code below, it confirms that its the fast one but then it throws a warning that says You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers. Does that mean that it is not actually using the fast tokenizer?

        fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b",  bos_token="<|startoftext|>", eos_token ="<|endoftext|>", from_slow=True)
        fast_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
        print(fast_tokenizer) # confirms that it is the fast one

        image_processor = LlavaNextImageProcessor.from_pretrained(MODEL_DIR, local_files_only=True)
        self.processor = LlavaNextProcessor(image_processor=image_processor, tokenizer=fast_tokenizer)
        
        self.model = LlavaNextForConditionalGeneration.from_pretrained(MODEL_DIR, 
                                                                       torch_dtype=torch.float16, 
                                                                       local_files_only=True,
                                                                       use_flash_attention_2=True)

The inferencing is very slow even on an A100 so I suspect something is not working. Here is what the print statement looks like:

LlamaTokenizerFast(name_or_path='/model', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|im_end|>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<image>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	64000: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

One thing to note, I am not on building transformers from source. I saw in a comment that might be an issue?

Apr 22 '24 03:04 david-vectorflow

The tokenizer needs to be converted from the slow tokenizers means that the tokenizer is re-computed from the slow information, which is what you want

Apr 22 '24 18:04 ArthurZucker

Your output is correct @david-vectorflow

Apr 22 '24 18:04 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 17 '24 08:05 github-actions[bot]