CodeGen icon indicating copy to clipboard operation
CodeGen copied to clipboard

[BUG] CodeGen 2.5 Tokenizer cannot be initialized anymore

Open AlEscher opened this issue 1 year ago • 6 comments

The code from https://huggingface.co/Salesforce/codegen25-7b-multi_P#causal-sampling-code-autocompletion and https://github.com/salesforce/CodeGen/tree/main/codegen25#sampling does not work currently. Creating the tokenizer like

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True)
# or
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)

gives an error:

Traceback (most recent call last):
  File "C:\teamscale\teamscale\server\com.teamscale.service\src\main\resources\com\teamscale\service\testimpact\embeddings_prioritization\ml\code_gen_embedder.py", line 4, in <module>
    tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Alessandro\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 905, in from_pretrained
    return tokenizer_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Alessandro\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\tokenization_utils_base.py", line 2213, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Alessandro\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\tokenization_utils_base.py", line 2447, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Alessandro\.cache\huggingface\modules\transformers_modules\Salesforce\codegen25-7b-multi\0bdf3f45a09e4f53b333393205db1388634a0e2e\tokenization_codegen25.py", line 136, in __init__
    super().__init__(
  File "C:\Users\Alessandro\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\tokenization_utils.py", line 435, in __init__
    super().__init__(**kwargs)
  File "C:\Users\Alessandro\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\tokenization_utils_base.py", line 1592, in __init__
    raise AttributeError(f"{key} conflicts with the method {key} in {self.__class__.__name__}")
AttributeError: add_special_tokens conflicts with the method add_special_tokens in CodeGen25Tokenizer

I have installed tiktoken==0.8.0 as installation of tiktoken==0.4.0 via pip fails.

AlEscher avatar Nov 16 '24 16:11 AlEscher

Seems to be an issue with transformers: https://github.com/huggingface/transformers/issues/33453

AlEscher avatar Nov 16 '24 17:11 AlEscher

any fix for this I'm having trouble quantizing the model into gguf format

ahmedashraf443 avatar Nov 26 '24 19:11 ahmedashraf443

any fix for this I'm having trouble quantizing the model into gguf format

Only workaround I found for now is https://onetwobytes.com/2024/10/07/codegen2-5-llm-not-working-with-latest-huggingface-transformers/

AlEscher avatar Nov 26 '24 20:11 AlEscher

@AlEscher I tried converting it to ggml so that I can quantize it and convert it to gguf but I'm still having trouble. did you manage to quantize it or are you running the full weights?

ahmedashraf443 avatar Nov 27 '24 05:11 ahmedashraf443

@ahmedashraf443 I am running full weights. By installing the specified transformers version as in the article I am able to load the model and use it.

AlEscher avatar Nov 28 '24 15:11 AlEscher

Yeah i thought so . Sadly i can't run the full weights on my laptop and i tried quantuzing the model but it never works. Guess ill have to stick toqwen25.5-coder

ahmedashraf443 avatar Nov 28 '24 15:11 ahmedashraf443