[LLaVa-v1.6] Difference between slow and fast tokenizer for liuhaotian/llava-v1.6-34b
System Info
Transformers v4.39.dev
Who can help?
@ArthurZucker will know this
Reproduction
So when converting the LLaVa-NeXT models, I noticed a discrepancy between the slow/fast tokenizers of this checkpoint: liuhaotian/llava-v1.6-34b
Namely, when converting the following string, I'd expect equivalent input_ids:
from transformers import AutoTokenizer, AddedToken
slow_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", use_fast=False)
slow_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b")
fast_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"
input_ids = slow_tokenizer(prompt).input_ids
fast_input_ids = fast_tokenizer(prompt).input_ids
assert input_ids == fast_input_ids
However they are not the same:
print(input_ids)
print(fast_input_ids)
gives
[6, 10707, 144, 47329, 567, 3275, 98, 7, 6, 3903, 144, 64000, 144, 5697, 620, 2709, 594, 719, 2728, 100, 7, 6, 765, 13611, 144]
[6, 1328, 144, 47329, 567, 3275, 98, 7, 6, 2942, 144, 64002, 59568, 144, 5697, 620, 2709, 594, 719, 2728, 100, 7, 6, 14135, 144]
This is currently leading to an issue as reported here, due to the AutoTokenizer class loading the fast tokenizer by default, which uses 64002 as image_token_index, whereas the model has this set to 64000. The slow tokenizer returns the correct input_ids.
Current workaround is to use the slow tokenizer.
Expected behavior
I'd like to have equivalent behaviour.
Hey! That is expected but should be fixed. basically the and are added at the beginning but in slow we overwrite, in fast we don't. The converter should fix this by using from_slow=True. It should not be a default but since the slow supports overwriting fast should as well
In [8]: fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", bos_token="<|startoftext|>", eos_token ="<|endoftext|>", from_slow=True)
In [9]: fast_tokenizer
Out[9]:
LlamaTokenizerFast(name_or_path='liuhaotian/llava-v1.6-34b', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
64000: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
64001: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
@ArthurZucker thanks, however could you clarify? Your code snippet seems to use id = 64.000 for the "<|startoftext|>" token, whereas the "<image>" token needs to have that index. Do you mean that this is something that is currently not supported?
Using from_slow=True results in the same fast_input_ids:
from transformers import AutoTokenizer, AddedToken
slow_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", use_fast=False)
slow_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", bos_token="<|startoftext|>", eos_token ="<|endoftext|>", from_slow=True)
fast_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"
input_ids = slow_tokenizer(prompt).input_ids
fast_input_ids = fast_tokenizer(prompt).input_ids
print(input_ids)
print(fast_input_ids)
assert input_ids == fast_input_ids
I'll push a PR one sec
@ArthurZucker One (possibly separate) issue I notice when comparing the two tokenizers, is that for the slow tokenizer, the eos_token is <|im_end|> whereas for the fast tokenizer it's <|endoftext|>
>>> slow_tokenizer
LlamaTokenizer(name_or_path='liuhaotian/llava-v1.6-34b', vocab_size=64000, model_max_length=4096, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|im_end|>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<image>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
64000: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
>>> fast_tokenizer
LlamaTokenizerFast(name_or_path='liuhaotian/llava-v1.6-34b', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
64000: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Mmm on main I am getting this:
Ah, I see what happened - I was taking the examples from here, but the token is set to override the default for one of the tokenizers but not the other. My bad!
this actually isn't fixed.
@bghira Are you running on a source install of transformers? This has been merged into main but isn't part of the most stable release
yep
cc @ArthurZucker
This is fixed. If not then please share a reproducer.
you're right, neither fast nor slow work for me
Does this mean that the faster tokenizer should be working? If I run this code below, it confirms that its the fast one but then it throws a warning that says You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers. Does that mean that it is not actually using the fast tokenizer?
fast_tokenizer = AutoTokenizer.from_pretrained("liuhaotian/llava-v1.6-34b", bos_token="<|startoftext|>", eos_token ="<|endoftext|>", from_slow=True)
fast_tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
print(fast_tokenizer) # confirms that it is the fast one
image_processor = LlavaNextImageProcessor.from_pretrained(MODEL_DIR, local_files_only=True)
self.processor = LlavaNextProcessor(image_processor=image_processor, tokenizer=fast_tokenizer)
self.model = LlavaNextForConditionalGeneration.from_pretrained(MODEL_DIR,
torch_dtype=torch.float16,
local_files_only=True,
use_flash_attention_2=True)
The inferencing is very slow even on an A100 so I suspect something is not working. Here is what the print statement looks like:
LlamaTokenizerFast(name_or_path='/model', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|im_end|>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<image>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
6: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
7: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
64000: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
One thing to note, I am not on building transformers from source. I saw in a comment that might be an issue?
The tokenizer needs to be converted from the slow tokenizers means that the tokenizer is re-computed from the slow information, which is what you want
Your output is correct @david-vectorflow
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.