llama.cpp llama 7B model can not use q2

I build llama.cpp from master, and convert https://huggingface.co/decapoda-research/llama-7b-hf model to ggml I use that command:

CUDA_VISIBLE_DEVICES=0 ./quantize ../../models/ggml-model-f16.bin ../../models/ggml-model-q4_k_s.bin  3

It works, and I can get the quantized model file, like that:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A30
main: build = 635 (5c64a09)
main: quantizing '../../models/ggml-model-f16.bin' to '../../models/ggml-model-q4_k_s.bin' as q4_1
llama.cpp: loading model from ../../models/ggml-model-f16.bin
llama.cpp: saving model to ../../models/ggml-model-q4_k_s.bin
[   1/ 291]                tok_embeddings.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->    78.12 MB | hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.082 0.067 0.051 0.037 0.025 0.040
[   2/ 291]                          norm.weight -             4096, type =    f32, size =    0.016 MB
[   3/ 291]                        output.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->    78.12 MB | hist: 0.040 0.025 0.037 0.051 0.067 0.082 0.095 0.102 0.102 0.095 0.083 0.067 0.052 0.037 0.025 0.040
.....
llama_model_quantize_internal: model size  = 12853.02 MB
llama_model_quantize_internal: quant size  =  4017.27 MB
llama_model_quantize_internal: hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.083 0.067 0.051 0.037 0.025 0.040

main: quantize time = 47205.34 ms
main:    total time = 47205.34 ms

but if I'm quantizing with q2-k, an error has occurred：

CUDA_VISIBLE_DEVICES=0 ./quantize ../../models/ggml-model-f16.bin ../../models/ggml-model-q2_k.bin  10 10
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A30
main: build = 635 (5c64a09)
main: quantizing '../../models/ggml-model-f16.bin' to '../../models/ggml-model-q2_k.bin' as q2_K using 10 threads
llama.cpp: loading model from ../../models/ggml-model-f16.bin
llama.cpp: saving model to ../../models/ggml-model-q2_k.bin
[   1/ 291]                tok_embeddings.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->     0.00 MB | hist:
Floating point exception (core dumped)

floating-point exceptions can occur due to processor misdetections , but hot to fix ?thank you very much.

Jun 08 '23 02:06 zhaohb

Your first run is converting to q4_1 not q4_k_s.

If you built using cmake from current master there is no k_quants support, you can either add a line to CMakelists https://github.com/ggerganov/llama.cpp/pull/1748 or build using the Makefile for now.

Edit: Fixed now

Jun 08 '23 06:06 johnson442

yes, it is q4_1.

thank you very much, already solved my problem.

Jun 09 '23 01:06 zhaohb

llama 7B model can not use q2_k