transformers Unable to load starcoder2 finetuned version getting quantization errors

System Info

I am running on A100 with 40 GB GPU memory

Who can help?

@SunMarc and @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

1- I have a SFT tuned starcoder2 model 2- I am trying to load the same using AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)

model = from_pretrained_wrapper(model_name_or_path,
functionexecutor-run-evaluation-d4381aa-3454115740: File “/app/code/evaluation/evaluation_utils.py”, line 189, in from_pretrained_wrapper
functionexecutor-run-evaluation-d4381aa-3454115740: AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py”, line 563, in from_pretrained
functionexecutor-run-evaluation-d4381aa-3454115740: return model_class.from_pretrained(
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py”, line 3039, in from_pretrained
functionexecutor-run-evaluation-d4381aa-3454115740: config.quantization_config = AutoHfQuantizer.merge_quantization_configs(
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/quantizers/auto.py”, line 149, in merge_quantization_configs
functionexecutor-run-evaluation-d4381aa-3454115740: quantization_config = AutoQuantizationConfig.from_dict(quantization_config)
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/quantizers/auto.py”, line 73, in from_dict
functionexecutor-run-evaluation-d4381aa-3454115740: raise ValueError(
functionexecutor-run-evaluation-d4381aa-3454115740: ValueError: Unknown quantization type, got bitsandbytes - supported types are: [‘awq’, ‘bitsandbytes_4bit’, ‘bitsandbytes_8bit’, ‘gptq’, ‘aqlm’, ‘quanto’]

Expected behavior

It should be able to load the model properly.

Apr 02 '24 06:04 h-sinha22

Could you share a full reproducer?

Apr 02 '24 08:04 ArthurZucker

model config used:

{ "_name_or_path": "/app/mnt/models_cache/bigcode/starcoder2-7b", "activation_function": "gelu", "architectures": [ "Starcoder2ForCausalLM" ], "attention_dropout": 0.1, "attention_softmax_in_fp32": true, "bos_token_id": 0, "embedding_dropout": 0.1, "eos_token_id": 0, "hidden_act": "gelu_pytorch_tanh", "hidden_size": 4608, "initializer_range": 0.018042, "intermediate_size": 18432, "layer_norm_epsilon": 1e-05, "max_position_embeddings": 16384, "mlp_type": "default", "model_type": "starcoder2", "norm_epsilon": 1e-05, "norm_type": "layer_norm", "num_attention_heads": 36, "num_hidden_layers": 32, "num_key_value_heads": 4, "quantization_config": { "_load_in_4bit": false, "_load_in_8bit": false, "bnb_4bit_compute_dtype": "float32", "bnb_4bit_quant_storage": "uint8", "bnb_4bit_quant_type": "fp4", "bnb_4bit_use_double_quant": false, "llm_int8_enable_fp32_cpu_offload": false, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": false, "load_in_8bit": false, "quant_method": "bitsandbytes" }, "residual_dropout": 0.1, "rope_theta": 1000000, "scale_attention_softmax_in_fp32": true, "scale_attn_weights": true, "sliding_window": 4096, "torch_dtype": "bfloat16", "transformers_version": "4.39.1", "use_bias": true, "use_cache": true, "vocab_size": 49152 }

Apr 02 '24 13:04 h-sinha22

That is not a full reproducer, we need the full code that you are running

Apr 02 '24 15:04 ArthurZucker

I meet the same error ,

package version: bitsandbytes 0.43.1 transformers 4.40.0 torch 2.2.2+cu118 torchaudio 2.2.2+cu118 torchvision 0.17.2+cu118

My actions are as follows

First i use quantization code to quantize Chinese-Llama-2-7b to Chinese-Llama-2-7b-4bits. this is my quantization code:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig,


model_id = "LinkSoul/Chinese-Llama-2-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    quantization_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    device_map='auto'
)

if __name__ == '__main__':

    import os
    output = "soulteary/Chinese-Llama-2-7b-4bit"
    if not os.path.exists(output):
        os.mkdir(output)

    model.save_pretrained(output)
    print("done")

**then i get the quantized model: soulteary/Chinese-Llama-2-7b-4bit

and i want use transformers to load the soulteary/Chinese-Llama-2-7b-4bit ,after i use next code。**

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer,BitsAndBytesConfig

model_id = 'soulteary/Chinese-Llama-2-7b-4bit'



if torch.cuda.is_available():


    quantization_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="'bitsandbytes_4bit",  
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=quantization_config,
        local_files_only=True,
        torch_dtype=torch.float16,
        device_map='auto'
    )
else:
else:
    model = None

the erros appears：

Traceback (most recent call last): File "/home/soikit/LLM/app.py", line 6, in from model import run File "/home/soikit/LLM/model.py", line 15, in model = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3155, in from_pretrained config.quantization_config = AutoHfQuantizer.merge_quantization_configs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py", line 149, in merge_quantization_configs quantization_config = AutoQuantizationConfig.from_dict(quantization_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py", line 73, in from_dict raise ValueError( ValueError: Unknown quantization type, got bitsandbytes - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto']

Apr 22 '24 09:04 luoruijie

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 17 '24 08:05 github-actions[bot]

Similar issue... Not able to load after saving 4bit.

ValueError: Supplied state dict for model.layers.16.self_attn.vision_expert_dense.weight does not contain `bitsandbytes__*` and possibly other `quantized_stats` components.

May 21 '24 10:05 1049451037

cc @SunMarc and @younesbelkada

May 23 '24 14:05 ArthurZucker

Hi @1049451037 can you share a simple and short reproducible snippet? Can you also try with latest transformers pip install -U transformers

May 24 '24 05:05 younesbelkada

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    quant_method='nf4'
)
model = AutoModelForCausalLM.from_pretrained('THUDM/cogvlm2-llama3-chat-19B', quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained('THUDM/cogvlm2-llama3-chat-19B')

# save int4
model.save_pretrained('./cogvlm2-llama3-chat-19B-int4')
tokenizer.save_pretrained('./cogvlm2-llama3-chat-19B-int4')

# load failed
model = AutoModelForCausalLM.from_pretrained('./cogvlm2-llama3-chat-19B-int4', quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained('./cogvlm2-llama3-chat-19B-int4')

May 27 '24 02:05 1049451037

On it !

May 28 '24 13:05 younesbelkada

cc @SunMarc

Jun 23 '24 18:06 amyeroberts

I meet the same error ,我遇到同样的错误，

package version: 软件包版本： bitsandbytes 0.43.1 位和字节 0.43.1 transformers 4.40.0 变形金刚4.40.0 torch 2.2.2+cu118 火炬2.2.2+cu118 torchaudio 2.2.2+cu118 火炬音频2.2.2+cu118 torchvision 0.17.2+cu118 火炬视觉 0.17.2+cu118

My actions are as follows我的行动如下

First i use quantization code to quantize Chinese-Llama-2-7b to Chinese-Llama-2-7b-4bits.首先我使用量化代码将Chinese-Llama-2-7b量化为Chinese-Llama-2-7b-4bits。 this is my quantization code: 这是我的量化代码：
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig,


model_id = "LinkSoul/Chinese-Llama-2-7b"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    quantization_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    device_map='auto'
)

if __name__ == '__main__':

    import os
    output = "soulteary/Chinese-Llama-2-7b-4bit"
    if not os.path.exists(output):
        os.mkdir(output)

    model.save_pretrained(output)
    print("done")
then i get the quantized model: soulteary/Chinese-Llama-2-7b-4bit然后我得到量化模型：soulteary/Chinese-Llama-2-7b-4bit

and i want use transformers to load the soulteary/Chinese-Llama-2-7b-4bit ,after i use next code。我想在使用下一个代码后使用变压器来加载 Soulteary/Chinese-Llama-2-7b-4bit 。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer,BitsAndBytesConfig

model_id = 'soulteary/Chinese-Llama-2-7b-4bit'



if torch.cuda.is_available():


    quantization_config = BitsAndBytesConfig(
        bnb_4bit_quant_type="'bitsandbytes_4bit",  
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=quantization_config,
        local_files_only=True,
        torch_dtype=torch.float16,
        device_map='auto'
    )
else:
else:
    model = None
the erros appears：出现错误：

Traceback (most recent call last):回溯（最近一次调用最后一次）： File "/home/soikit/LLM/app.py", line 6, in 文件“/home/soikit/LLM/app.py”，第 6 行，位于 from model import run 从模型导入运行 File "/home/soikit/LLM/model.py", line 15, in 文件“/home/soikit/LLM/model.py”，第 15 行，位于 model = AutoModelForCausalLM.from_pretrained( 模型 = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained 文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py”，第 563 行，from_pretrained return model_class.from_pretrained( 返回 model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3155, in from_pretrained 文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/modeling_utils.py”，第 3155 行，from_pretrained config.quantization_config = AutoHfQuantizer.merge_quantization_configs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py", line 149, in merge_quantization_configs文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py”，第 149 行，在 merge_quantization_configs 中 quantization_config = AutoQuantizationConfig.from_dict(quantization_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py", line 73, in from_dict文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py”，第 73 行，在 from_dict 中 raise ValueError( 引发值错误（ ValueError: Unknown quantization type, got bitsandbytes - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto'] ValueError：未知的量化类型，获取了 bitsandbytes - 支持的类型有：['awq'、'bitsandbytes_4bit'、'bitsandbytes_8bit'、'gptq'、'aqlm'、'quanto']

Hello brother, have you solved this problem?

Jul 20 '24 12:07 winter-cs

Hey, in both snippet the quantization method being used are either nf4 which does not exist in transformers, or bitsandbytes which alone does not exist either.

You should properly set the quantization_type="bitsandbytes_4bit" for example to fix this!

Aug 03 '24 14:08 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 13 '24 08:09 github-actions[bot]

Hi @ArthurZucker , thanks for your reply, where should I specifiy that parameter? I'm facing a similar issue (https://stackoverflow.com/questions/79068298/valueerror-supplied-state-dict-for-layers-does-not-contain-bitsandbytes-an) and I would trully appreciate your help.

@1049451037 did you solve it? My issue is almost identical to yours

Oct 09 '24 02:10 llealgt

cc @SunMarc on this 🤗

Oct 09 '24 11:10 ArthurZucker

Hey @llealgt, I'm not able to reproduce your error. Make sure that you have installed the latest version of transformers and bitsandbytes. Could you try creating a google colab notebook with your expected error using a smaller model such as meta-llama/Meta-Llama-3.1-8B-Instruct ?

Oct 09 '24 16:10 SunMarc

Thanks @SunMarc , yes, i'm using the latest version of both transformers and bitsandbites, I created the collab as you suggested(same issue there): https://colab.research.google.com/drive/1BGlj8zJYisJaJNIwjLinukcaaLMPuFAC?usp=sharing

For context my use case is simple: quantize llama 3.1 70b, save it in the server and then load it in different servers, but the loading part results in the error.

Oct 09 '24 18:10 llealgt

Found the issue, make sure to load back the model with the same class you used. Also you don't need to pass again the qunatization_config as it has been saved:

loaded_model = AutoModelForCausalLM.from_pretrained(pt_save_directory)

Oct 10 '24 14:10 SunMarc

@SunMarc thank you so much! it seems like it worked!, in the colab and locally I was able to load the saved quantized model, not in the remote server I have a new issue but I guess that is another thing, logs don't say much(remote amazon sagemaker server) just: new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( UPDATE: Hidden on a different log I found this is a cuda out of memory error. Thanks

Oct 10 '24 19:10 llealgt

@SunMarc I apologize, this maybe not the original problem but is defintely related and could save a lot of time with some of your help. The fix you provided solved my initial issue, after that I started facing lots of differents errors that I have been solving 1 by 1, but now there is 1 more(hope the last one): when running in the real scenario(llama 70b instead of 8b and on amazon notebooks instead of colab) i'm able to save and load the model(similar to the colab) in my dev env. but after copying the model files to s3 to deploy to remote server/endpoint, the loading fails, code is the same(and obviously I change the paths to use the right one), it looks to me as if the model is attempted to be loaded to a single gpu even when setting device_map="auto". As mentioned, the process works fine in the dev environment and the remote server/endpoint in theory is identical to it but for some reason it doesn-t work there, the error is:

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to from_pretrained.

UPDATE I fond my issue, and it has nothing to do with quantization, or transformerrs/huggingface, the issue is that my target system(amazon sagemaker) is calling the model load multiple times(1 per gpu) so the first time it works, the second it throws memory error because the GPUs are already loaded. https://repost.aws/questions/QUzI0YGYPCS4yLWtsxTaNlfg/model-fn-called-multiple-times-1-per-gpu-during-deployment

Oct 11 '24 02:10 llealgt

Glad that you managed to find the solution !

Oct 11 '24 10:10 SunMarc

There is a config.json file in the quantization directory I changed load_in_4bit: false in the config.josn file to load_in_4bit: true and solved the problem

Oct 18 '24 07:10 xuanzhangyang