Unable to load starcoder2 finetuned version getting quantization errors
System Info
I am running on A100 with 40 GB GPU memory
Who can help?
@SunMarc and @younesbelkada
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
1- I have a SFT tuned starcoder2 model 2- I am trying to load the same using AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)
model = from_pretrained_wrapper(model_name_or_path,
functionexecutor-run-evaluation-d4381aa-3454115740: File “/app/code/evaluation/evaluation_utils.py”, line 189, in from_pretrained_wrapper
functionexecutor-run-evaluation-d4381aa-3454115740: AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path)
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py”, line 563, in from_pretrained
functionexecutor-run-evaluation-d4381aa-3454115740: return model_class.from_pretrained(
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py”, line 3039, in from_pretrained
functionexecutor-run-evaluation-d4381aa-3454115740: config.quantization_config = AutoHfQuantizer.merge_quantization_configs(
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/quantizers/auto.py”, line 149, in merge_quantization_configs
functionexecutor-run-evaluation-d4381aa-3454115740: quantization_config = AutoQuantizationConfig.from_dict(quantization_config)
functionexecutor-run-evaluation-d4381aa-3454115740: File “/usr/local/lib/python3.8/dist-packages/transformers/quantizers/auto.py”, line 73, in from_dict
functionexecutor-run-evaluation-d4381aa-3454115740: raise ValueError(
functionexecutor-run-evaluation-d4381aa-3454115740: ValueError: Unknown quantization type, got bitsandbytes - supported types are: [‘awq’, ‘bitsandbytes_4bit’, ‘bitsandbytes_8bit’, ‘gptq’, ‘aqlm’, ‘quanto’]
Expected behavior
It should be able to load the model properly.
Could you share a full reproducer?
model config used:
{ "_name_or_path": "/app/mnt/models_cache/bigcode/starcoder2-7b", "activation_function": "gelu", "architectures": [ "Starcoder2ForCausalLM" ], "attention_dropout": 0.1, "attention_softmax_in_fp32": true, "bos_token_id": 0, "embedding_dropout": 0.1, "eos_token_id": 0, "hidden_act": "gelu_pytorch_tanh", "hidden_size": 4608, "initializer_range": 0.018042, "intermediate_size": 18432, "layer_norm_epsilon": 1e-05, "max_position_embeddings": 16384, "mlp_type": "default", "model_type": "starcoder2", "norm_epsilon": 1e-05, "norm_type": "layer_norm", "num_attention_heads": 36, "num_hidden_layers": 32, "num_key_value_heads": 4, "quantization_config": { "_load_in_4bit": false, "_load_in_8bit": false, "bnb_4bit_compute_dtype": "float32", "bnb_4bit_quant_storage": "uint8", "bnb_4bit_quant_type": "fp4", "bnb_4bit_use_double_quant": false, "llm_int8_enable_fp32_cpu_offload": false, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": false, "load_in_8bit": false, "quant_method": "bitsandbytes" }, "residual_dropout": 0.1, "rope_theta": 1000000, "scale_attention_softmax_in_fp32": true, "scale_attn_weights": true, "sliding_window": 4096, "torch_dtype": "bfloat16", "transformers_version": "4.39.1", "use_bias": true, "use_cache": true, "vocab_size": 49152 }
That is not a full reproducer, we need the full code that you are running
I meet the same error ,
package version: bitsandbytes 0.43.1 transformers 4.40.0 torch 2.2.2+cu118 torchaudio 2.2.2+cu118 torchvision 0.17.2+cu118
My actions are as follows
First i use quantization code to quantize Chinese-Llama-2-7b to Chinese-Llama-2-7b-4bits. this is my quantization code:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig,
model_id = "LinkSoul/Chinese-Llama-2-7b"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
quantization_config = BitsAndBytesConfig(
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
),
device_map='auto'
)
if __name__ == '__main__':
import os
output = "soulteary/Chinese-Llama-2-7b-4bit"
if not os.path.exists(output):
os.mkdir(output)
model.save_pretrained(output)
print("done")
**then i get the quantized model: soulteary/Chinese-Llama-2-7b-4bit
and i want use transformers to load the soulteary/Chinese-Llama-2-7b-4bit ,after i use next code。**
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer,BitsAndBytesConfig
model_id = 'soulteary/Chinese-Llama-2-7b-4bit'
if torch.cuda.is_available():
quantization_config = BitsAndBytesConfig(
bnb_4bit_quant_type="'bitsandbytes_4bit",
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
local_files_only=True,
torch_dtype=torch.float16,
device_map='auto'
)
else:
else:
model = None
the erros appears:
Traceback (most recent call last):
File "/home/soikit/LLM/app.py", line 6, in
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Similar issue... Not able to load after saving 4bit.
ValueError: Supplied state dict for model.layers.16.self_attn.vision_expert_dense.weight does not contain `bitsandbytes__*` and possibly other `quantized_stats` components.
cc @SunMarc and @younesbelkada
Hi @1049451037 can you share a simple and short reproducible snippet? Can you also try with latest transformers pip install -U transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
quant_method='nf4'
)
model = AutoModelForCausalLM.from_pretrained('THUDM/cogvlm2-llama3-chat-19B', quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained('THUDM/cogvlm2-llama3-chat-19B')
# save int4
model.save_pretrained('./cogvlm2-llama3-chat-19B-int4')
tokenizer.save_pretrained('./cogvlm2-llama3-chat-19B-int4')
# load failed
model = AutoModelForCausalLM.from_pretrained('./cogvlm2-llama3-chat-19B-int4', quantization_config=quant_config)
tokenizer = AutoTokenizer.from_pretrained('./cogvlm2-llama3-chat-19B-int4')
On it !
cc @SunMarc
I meet the same error ,我遇到同样的错误,
package version: 软件包版本: bitsandbytes 0.43.1 位和字节 0.43.1 transformers 4.40.0 变形金刚4.40.0 torch 2.2.2+cu118 火炬2.2.2+cu118 torchaudio 2.2.2+cu118 火炬音频2.2.2+cu118 torchvision 0.17.2+cu118 火炬视觉 0.17.2+cu118
My actions are as follows我的行动如下
First i use quantization code to quantize Chinese-Llama-2-7b to Chinese-Llama-2-7b-4bits.首先我使用量化代码将Chinese-Llama-2-7b量化为Chinese-Llama-2-7b-4bits。 this is my quantization code: 这是我的量化代码:
import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig, model_id = "LinkSoul/Chinese-Llama-2-7b" model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, quantization_config = BitsAndBytesConfig( bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ), device_map='auto' ) if __name__ == '__main__': import os output = "soulteary/Chinese-Llama-2-7b-4bit" if not os.path.exists(output): os.mkdir(output) model.save_pretrained(output) print("done")then i get the quantized model: soulteary/Chinese-Llama-2-7b-4bit然后我得到量化模型:soulteary/Chinese-Llama-2-7b-4bit
and i want use transformers to load the soulteary/Chinese-Llama-2-7b-4bit ,after i use next code。我想在使用下一个代码后使用变压器来加载 Soulteary/Chinese-Llama-2-7b-4bit 。
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer,BitsAndBytesConfig model_id = 'soulteary/Chinese-Llama-2-7b-4bit' if torch.cuda.is_available(): quantization_config = BitsAndBytesConfig( bnb_4bit_quant_type="'bitsandbytes_4bit", ) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=quantization_config, local_files_only=True, torch_dtype=torch.float16, device_map='auto' ) else: else: model = Nonethe erros appears: 出现错误:
Traceback (most recent call last):回溯(最近一次调用最后一次): File "/home/soikit/LLM/app.py", line 6, in 文件“/home/soikit/LLM/app.py”,第 6 行,位于 from model import run 从模型导入运行 File "/home/soikit/LLM/model.py", line 15, in 文件“/home/soikit/LLM/model.py”,第 15 行,位于 model = AutoModelForCausalLM.from_pretrained( 模型 = AutoModelForCausalLM.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained 文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py”,第 563 行,from_pretrained return model_class.from_pretrained( 返回 model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3155, in from_pretrained 文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/modeling_utils.py”,第 3155 行,from_pretrained config.quantization_config = AutoHfQuantizer.merge_quantization_configs( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py", line 149, in merge_quantization_configs文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py”,第 149 行,在 merge_quantization_configs 中 quantization_config = AutoQuantizationConfig.from_dict(quantization_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py", line 73, in from_dict文件“/home/soikit/bj20_venv/lib/python3.11/site-packages/transformers/quantizers/auto.py”,第 73 行,在 from_dict 中 raise ValueError( 引发值错误( ValueError: Unknown quantization type, got bitsandbytes - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto'] ValueError:未知的量化类型,获取了 bitsandbytes - 支持的类型有:['awq'、'bitsandbytes_4bit'、'bitsandbytes_8bit'、'gptq'、'aqlm'、'quanto']
Hello brother, have you solved this problem?
Hey, in both snippet the quantization method being used are either nf4 which does not exist in transformers, or bitsandbytes which alone does not exist either.
You should properly set the quantization_type="bitsandbytes_4bit" for example to fix this!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @ArthurZucker , thanks for your reply, where should I specifiy that parameter? I'm facing a similar issue (https://stackoverflow.com/questions/79068298/valueerror-supplied-state-dict-for-layers-does-not-contain-bitsandbytes-an) and I would trully appreciate your help.
@1049451037 did you solve it? My issue is almost identical to yours
cc @SunMarc on this 🤗
Hey @llealgt, I'm not able to reproduce your error. Make sure that you have installed the latest version of transformers and bitsandbytes. Could you try creating a google colab notebook with your expected error using a smaller model such as meta-llama/Meta-Llama-3.1-8B-Instruct ?
Thanks @SunMarc , yes, i'm using the latest version of both transformers and bitsandbites, I created the collab as you suggested(same issue there): https://colab.research.google.com/drive/1BGlj8zJYisJaJNIwjLinukcaaLMPuFAC?usp=sharing
For context my use case is simple: quantize llama 3.1 70b, save it in the server and then load it in different servers, but the loading part results in the error.
Found the issue, make sure to load back the model with the same class you used. Also you don't need to pass again the qunatization_config as it has been saved:
loaded_model = AutoModelForCausalLM.from_pretrained(pt_save_directory)
@SunMarc thank you so much! it seems like it worked!, in the colab and locally I was able to load the saved quantized model, not in the remote server I have a new issue but I guess that is another thing, logs don't say much(remote amazon sagemaker server) just:
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
UPDATE: Hidden on a different log I found this is a cuda out of memory error.
Thanks
@SunMarc I apologize, this maybe not the original problem but is defintely related and could save a lot of time with some of your help. The fix you provided solved my initial issue, after that I started facing lots of differents errors that I have been solving 1 by 1, but now there is 1 more(hope the last one): when running in the real scenario(llama 70b instead of 8b and on amazon notebooks instead of colab) i'm able to save and load the model(similar to the colab) in my dev env. but after copying the model files to s3 to deploy to remote server/endpoint, the loading fails, code is the same(and obviously I change the paths to use the right one), it looks to me as if the model is attempted to be loaded to a single gpu even when setting device_map="auto". As mentioned, the process works fine in the dev environment and the remote server/endpoint in theory is identical to it but for some reason it doesn-t work there, the error is:
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set
load_in_8bit_fp32_cpu_offload=Trueand pass a customdevice_maptofrom_pretrained.
UPDATE I fond my issue, and it has nothing to do with quantization, or transformerrs/huggingface, the issue is that my target system(amazon sagemaker) is calling the model load multiple times(1 per gpu) so the first time it works, the second it throws memory error because the GPUs are already loaded. https://repost.aws/questions/QUzI0YGYPCS4yLWtsxTaNlfg/model-fn-called-multiple-times-1-per-gpu-during-deployment
Glad that you managed to find the solution !
There is a config.json file in the quantization directory
I changed load_in_4bit: false in the config.josn file to load_in_4bit: true and solved the problem