auto-round Support q2-k to q4-k

need to support double quant in algorithm part

Feb 12 '25 02:02 wenhuach21

Hi @wenhuach21 @n1ck-guo, does export for q4_k work right now? I tried to adapt that for torchao, and tried to serve with vllm vllm serve ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf --tokenizer microsoft/Phi-4-mini-instruct --device cuda -O3 there seems to be some issue with shape mismatch:

  File ".../llama.cpp/gguf-py/gguf/gguf_reader.py", line 364, in _build_tensors
    data = self._get(data_offs, item_type, item_count).reshape(np_dims),
ValueError: cannot reshape array of size 1536 into shape (3072,)

can you help me take a look at https://gist.github.com/jerryzh168/fac8f8c8f89c65ef7cc3d76fdc74ba04#file-gistfile1-txt-L48, wondering if the argument list for ggml_quant are correct or not:

data = ggml_quant(
+                            float_data,
+                            data_qtype.name.lower(),
+                            scale,
+                            None,
+                            wmin_m=wmin_m,
+                            d_scale=d_scale,
+                            d_wmin_m=d_wmin_m)

float_data is the original floating point data, and scale, wmin_m, d_scale, d_wmin_m are calculated with https://github.com/intel/auto-round/blob/37341f5ee2ac1f63f1bf03fe8652a126d12cba6b/auto_round/data_type/int.py#L77

Apr 11 '25 02:04 jerryzh168

Hi @wenhuach21 @n1ck-guo, does export for q4_k work right now? I tried to adapt that for torchao, and tried to serve with vllm vllm serve ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf --tokenizer microsoft/Phi-4-mini-instruct --device cuda -O3 there seems to be some issue with shape mismatch:
  File ".../llama.cpp/gguf-py/gguf/gguf_reader.py", line 364, in _build_tensors
    data = self._get(data_offs, item_type, item_count).reshape(np_dims),
ValueError: cannot reshape array of size 1536 into shape (3072,)
can you help me take a look at https://gist.github.com/jerryzh168/fac8f8c8f89c65ef7cc3d76fdc74ba04#file-gistfile1-txt-L48, wondering if the argument list for ggml_quant are correct or not:
data = ggml_quant(
+                            float_data,
+                            data_qtype.name.lower(),
+                            scale,
+                            None,
+                            wmin_m=wmin_m,
+                            d_scale=d_scale,
+                            d_wmin_m=d_wmin_m)
float_data is the original floating point data, and scale, wmin_m, d_scale, d_wmin_m are calculated with

auto-round/auto_round/data_type/int.py

Line 77 in 37341f5

def quant_tensor_asym_dq(tensor, bits=4, group_size=-1, v=0, min_scale=1.0, max_scale=1.0, scale_dtype=torch.float16,

Thank you for the reporting, we will check the related issues immediately

Apr 11 '25 03:04 n1ck-guo

@n1ck-guo if you want to repro the issue, here are the steps:

create a conda env
patch https://github.com/pytorch/ao/pull/2042 for torchao
install torchao from source: python setup.py develop
use https://gist.github.com/jerryzh168/898b2d84c380fdd8d10ee97c5546af85 to upload the checkpoint
patch https://github.com/intel/auto-round/pull/504
use https://gist.github.com/jerryzh168/25f6d2fd0687d1df1246c55706f061e7 to convert the model to gguf
serve with vllm: vllm serve ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf --tokenizer microsoft/Phi-4-mini-instruct --device cuda -O3 where ./phi4-mini-torchao-ar-gguf-q4_k-3.8B-Q4_K_S.gguf is generated gguf file from step 6.

Apr 11 '25 03:04 jerryzh168

We have tested the code of export q4_k_s. For some other models, it works well. But for microsoft/Phi-4-mini-instruct, we cannot export and it will raise error. This is because our code relying on the original export code from llama-cpp (convert_hf_to_gguf.py) and it seems it not works well with Phi-4. We will also try to reproduce this problem according to the method you provided and try to find the problem. Thank you again for your question, we will try our best to solve it.

Apr 16 '25 02:04 n1ck-guo

@jerryzh168 Thank you for waiting. This issue seems to be caused by a problem with the llama.cpp version. Could you please try with this pr https://github.com/intel/auto-round/pull/524 and the lastest gguf-py. You can use the following command to install the lastest gguf-py: git clone https://github.com/ggml-org/llama.cpp.git && cd llama.cpp/gguf-py && pip install .

Apr 17 '25 06:04 n1ck-guo