GPTQModel [BUG] The dynamically quantized MoE model failed to deploy in vLLM.

Describe the bug

The issue described involves a KeyError when attempting to deploy a quantized DeepSeek-V2-Lite-Chat model using the vllm framework. The error occurs during the weight-loading process, where the key names in the model's named_parameters() dictionary (params_dict) do not match the key names in the quantized weight file. Specifically:

Key Mismatch:

The original key in self.named_parameters() is 'model.layers.21.mlp.experts.w2_qweight'. The processed key in the quantized weight file is 'model.layers.21.mlp.experts.w2_weight'.

Error Cause:

When loading the weights, the code attempts to access param = params_dict[name], but the name from the weight file does not exist in params_dict, resulting in a KeyError.

vllm version:

vllm-main commit hash debd6bb
or
https://github.com/ZZBoom/vllm/commits/main/ 
commit hash fc7c714854f422a7e000bcc9fa31d4f61796a7b6

How can this issue be resolved?

Error stack trace:


File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
    driver_worker_output = run_method(self.driver_worker, sent_method,
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/utils.py", line 2238, in run_method
    return func(*args, **kwargs)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/worker/model_runner.py", line 1113, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
    return loader.load_model(vllm_config=vllm_config)
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/model_loader/loader.py", line 426, in load_model
    loaded_weights = model.load_weights(
  File "/mnt/lwq/lwq/quant/vllm/vllm-main-debd6bb/vllm/model_executor/models/deepseek_v2.py", line 790, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.10.mlp.experts.w2_weight

My dynamic quantization settings are set to the default configuration from the gptqmodel homepage:


python

dynamic = {
# .*\. matches the layers_node prefix
# layer index starts at 0

# positive match: layer 19, gate module 
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  

# positive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  

# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {}, 

# negative match: skip all down modules for all layers
r"-:.*down.*": {},  
}

The config after quantization is:


{
  "_name_or_path": "/mnt/models/deepseek-ai/DeepSeek-V2-Lite-Chat/",
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 10944,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1408,
  "moe_layer_freq": 1,
  "n_group": 1,
  "n_routed_experts": 64,
  "n_shared_experts": 2,
  "norm_topk_prob": false,
  "num_attention_heads": 16,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 27,
  "num_key_value_heads": 16,
  "pretraining_tp": 1,
  "q_lora_rank": null,
  "qk_nope_head_dim": 128,
  "qk_rope_head_dim": 64,
  "quantization_config": {
    "bits": 8,
    "checkpoint_format": "gptq",
    "desc_act": false,
    "dynamic": {
      "+:.*\\.18\\..*gate.*": {
        "bits": 4,
        "group_size": 32
      },
      "-:.*\\.20\\..*gate.*": {},
      "-:.*down.*": {},
      ".*\\.19\\..*gate.*": {
        "bits": 8,
        "group_size": 64
      }
    },
    "group_size": 64,
    "lm_head": false,
    "meta": {
      "damp_auto_increment": 0.0025,
      "damp_percent": 0.01,
      "mse": 0.0,
      "quantizer": [
        "gptqmodel:2.0.0-dev"
      ],
      "static_groups": false,
      "true_sequential": true,
      "uri": "https://github.com/modelcloud/gptqmodel"
    },
    "pack_dtype": "int32",
    "quant_method": "gptq",
    "sym": true
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 0.707,
    "mscale_all_dim": 0.707,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },
  "rope_theta": 10000,
  "routed_scaling_factor": 1.0,
  "scoring_func": "softmax",
  "seq_aux": true,
  "tie_word_embeddings": false,
  "topk_group": 1,
  "topk_method": "greedy",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.48.3",
  "use_cache": true,
  "v_head_dim": 128,
  "vocab_size": 102400
}

Besides, GPTQModel.load test is ok

GPU Info H20

Show output of:

nvidia-smi

Software Info

Operation System/Version + Python Version

Show output of:

pip show gptqmodel torch transformers accelerate triton
# pip show gptqmodel torch transformers accelerate triton
Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, colorlog, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, flash_attn, flashinfer-python, gptqmodel, lm_eval, optimum, outlines, peft, runai-model-streamer, timm, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.48.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: auto_gptq, compressed-tensors, gptqmodel, lm_eval, optimum, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.3.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: auto_gptq, gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch

Mar 13 '25 09:03 liweiqing1997

@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.

Mar 13 '25 11:03 Qubitium

@liweiqing1997 Can you upload the model to HF for us to test? ModelScope is ok too.

You can use this model: https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite-Chat/files. We also used this model, but only utilized 27 of its layers.

Mar 13 '25 11:03 liweiqing1997

@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?

Mar 13 '25 11:03 Qubitium

@liweiqing1997 Can you upload the quantized model so we can skip the slow quant stage and directly run it?

I'm very sorry, our model involves private data and may not be convenient to share.

Do you have the resources to quantize a small MoE model, such as DeepSeek-V2-Lite-Chat? If it's not convenient for you, I'll think of other solutions. Thank you very much.

Mar 13 '25 12:03 liweiqing1997

@liweiqing1997 Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack traces.

Mar 13 '25 12:03 Qubitium

@liweiqing1997 Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack traces.

I figure out what caused this bug. If an expert is unactivated, gptqmodel will keep it in float16/bf16 format. This "mixed precision" strategy caused the KeyError when vllm loading the model.

Jul 15 '25 02:07 Sekri0