Clément Dumas
Clément Dumas
Hi, I think you should use .from_pretrained with the official mistral name and give `hf_model=hf_model_of_your_finetuned_model`
yeah but your cfg is the same than the official mistral one, right ? I think this should work, do you mind trying and sharing the error if there is...
The problem is that Mistral's config enforces `cfg.d_vocab = 32000`: https://github.com/TransformerLensOrg/TransformerLens/blob/5a374ec4b33cec6281b37494175d14f06c75dcfd/transformer_lens/loading_from_pretrained.py#L938 A quick hack to fix your problem is to change this line to `32002`. For a long-term solution, we...
Maybe another solution would be to let people pass `hf_config` as an argument 🤔 But then we'd have to make the `elif architecture == "MistralForCausalLM":` case use `hf_config`, as right...
Oh wait @ArthurZucker is that what you're fixing here in https://github.com/huggingface/tokenizers/pull/1568 ?
Same issue with unnormalized non-special tokens: ```py from tokenizers import AddedToken from transformers import AutoTokenizer tok_name = "meta-llama/llama-2-7b-hf" fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True) slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False) tok = "" t...
And there is even more differences when you add `normalized=True` for special tokens ... ```py from tokenizers import AddedToken from transformers import AutoTokenizer tok_name = "meta-llama/llama-2-7b-hf" fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)...
Also, if you specify the `add_prefix_space` arg, the tokenizer is actually using the slow implementation which leads to different behavior for the above code! https://github.com/huggingface/transformers/blob/9485289f374d4df7e8aa0ca917dc131dcf64ebaf/src/transformers/models/llama/tokenization_llama_fast.py#L154
Hey @ArthurZucker, thanks for your answer. I'm using 0.19.1 which should have the fix. I'm really confused right now. Why isn't the fact that `use_fast` alters the behavior of the...
Ok so I should do some unit test and choose different kwarg depending on the tokenizer to get the same behavior?