[BUG] The dynamically quantized MoE model failed to deploy in vLLM.
Describe the bug
The issue described involves a KeyError when attempting to deploy a quantized DeepSeek-V2-Lite-Chat model using the vllm framework. The error occurs during the weight-loading process, where the key names in the model's named_parameters() dictionary (params_dict) do not match the key names in the quantized weight file. Specifically:
Key Mismatch:
The original key in self.named_parameters() is 'model.layers.21.mlp.experts.w2_qweight'. The processed key in the quantized weight file is 'model.layers.21.mlp.experts.w2_weight'.
Error Cause:
When loading the weights, the code attempts to access param = params_dict[name], but the name from the weight file does not exist in params_dict, resulting in a KeyError.
vllm version:
vllm-main commit hash debd6bb
or
https://github.com/ZZBoom/vllm/commits/main/
commit hash fc7c714854f422a7e000bcc9fa31d4f61796a7b6
How can this issue be resolved?
Error stack trace:
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/utils.py", line 2238, in run_method
return func(*args, **kwargs)
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/model_runner.py", line 1113, in load_model
self.model = get_model(vllm_config=self.vllm_config)
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/loader.py", line 426, in load_model
loaded_weights = model.load_weights(
File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/models/deepseek_v2.py", line 790, in load_weights
param = params_dict[name]
KeyError: 'model.layers.10.mlp.experts.w2_weight
My dynamic quantization settings are set to the default configuration from the gptqmodel homepage:
python
dynamic = {
# .*\. matches the layers_node prefix
# layer index starts at 0
# positive match: layer 19, gate module
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},
# positive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},
# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {},
# negative match: skip all down modules for all layers
r"-:.*down.*": {},
}
The config after quantization is:
{
"_name_or_path": "/mnt/models/deepseek-ai/DeepSeek-V2-Lite-Chat/",
"architectures": [
"DeepseekV2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV2Config",
"AutoModel": "modeling_deepseek.DeepseekV2Model",
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
},
"aux_loss_alpha": 0.001,
"bos_token_id": 100000,
"eos_token_id": 100001,
"ep_size": 1,
"first_k_dense_replace": 1,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 10944,
"kv_lora_rank": 512,
"max_position_embeddings": 163840,
"model_type": "deepseek_v2",
"moe_intermediate_size": 1408,
"moe_layer_freq": 1,
"n_group": 1,
"n_routed_experts": 64,
"n_shared_experts": 2,
"norm_topk_prob": false,
"num_attention_heads": 16,
"num_experts_per_tok": 6,
"num_hidden_layers": 27,
"num_key_value_heads": 16,
"pretraining_tp": 1,
"q_lora_rank": null,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"quantization_config": {
"bits": 8,
"checkpoint_format": "gptq",
"desc_act": false,
"dynamic": {
"+:.*\\.18\\..*gate.*": {
"bits": 4,
"group_size": 32
},
"-:.*\\.20\\..*gate.*": {},
"-:.*down.*": {},
".*\\.19\\..*gate.*": {
"bits": 8,
"group_size": 64
}
},
"group_size": 64,
"lm_head": false,
"meta": {
"damp_auto_increment": 0.0025,
"damp_percent": 0.01,
"mse": 0.0,
"quantizer": [
"gptqmodel:2.0.0-dev"
],
"static_groups": false,
"true_sequential": true,
"uri": "https://github.com/modelcloud/gptqmodel"
},
"pack_dtype": "int32",
"quant_method": "gptq",
"sym": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 0.707,
"mscale_all_dim": 0.707,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"rope_theta": 10000,
"routed_scaling_factor": 1.0,
"scoring_func": "softmax",
"seq_aux": true,
"tie_word_embeddings": false,
"topk_group": 1,
"topk_method": "greedy",
"torch_dtype": "bfloat16",
"transformers_version": "4.48.3",
"use_cache": true,
"v_head_dim": 128,
"vocab_size": 102400
}
Besides, GPTQModel.load test is ok
GPU Info H20
Show output of:
nvidia-smi
Software Info
Operation System/Version + Python Version
Show output of:
pip show gptqmodel torch transformers accelerate triton
# pip show gptqmodel torch transformers accelerate triton
Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, colorlog, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by:
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, flash_attn, flashinfer-python, gptqmodel, lm_eval, optimum, outlines, peft, runai-model-streamer, timm, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.48.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: auto_gptq, compressed-tensors, gptqmodel, lm_eval, optimum, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.3.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: auto_gptq, gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License:
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch
@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.
@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.
You can use this model: https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite-Chat/files. We also used this model, but only utilized 27 of its layers.
@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?
@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?
I'm very sorry, our model involves private data and may not be convenient to share.
Do you have the resources to quantize a small MoE model, such as DeepSeek-V2-Lite-Chat? If it's not convenient for you, I'll think of other solutions. Thank you very much.
@liweiqing1997 Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack traces.
@liweiqing1997 Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack traces.
I figure out what caused this bug. If an expert is unactivated, gptqmodel will keep it in float16/bf16 format. This "mixed precision" strategy caused the KeyError when vllm loading the model.