Clément Dumas comments

Results 87 comments of


                                            Clément Dumas

Setup for fine tuned Mistral model ?

Hi, I think you should use .from_pretrained with the official mistral name and give `hf_model=hf_model_of_your_finetuned_model`

Setup for fine tuned Mistral model ?

yeah but your cfg is the same than the official mistral one, right ? I think this should work, do you mind trying and sharing the error if there is...

Setup for fine tuned Mistral model ?

The problem is that Mistral's config enforces `cfg.d_vocab = 32000`: https://github.com/TransformerLensOrg/TransformerLens/blob/5a374ec4b33cec6281b37494175d14f06c75dcfd/transformer_lens/loading_from_pretrained.py#L938 A quick hack to fix your problem is to change this line to `32002`. For a long-term solution, we...

Setup for fine tuned Mistral model ?

Maybe another solution would be to let people pass `hf_config` as an argument 🤔 But then we'd have to make the `elif architecture == "MistralForCausalLM":` case use `hf_config`, as right...

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

Oh wait @ArthurZucker is that what you're fixing here in https://github.com/huggingface/tokenizers/pull/1568 ?

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

Same issue with unnormalized non-special tokens: ```py from tokenizers import AddedToken from transformers import AutoTokenizer tok_name = "meta-llama/llama-2-7b-hf" fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True) slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False) tok = "" t...

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

And there is even more differences when you add `normalized=True` for special tokens ... ```py from tokenizers import AddedToken from transformers import AutoTokenizer tok_name = "meta-llama/llama-2-7b-hf" fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)...

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

Also, if you specify the `add_prefix_space` arg, the tokenizer is actually using the slow implementation which leads to different behavior for the above code! https://github.com/huggingface/transformers/blob/9485289f374d4df7e8aa0ca917dc131dcf64ebaf/src/transformers/models/llama/tokenization_llama_fast.py#L154

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

Hey @ArthurZucker, thanks for your answer. I'm using 0.19.1 which should have the fix. I'm really confused right now. Why isn't the fact that `use_fast` alters the behavior of the...

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

Ok so I should do some unit test and choose different kwarg depending on the tokenizer to get the same behavior?