llama 7B model can not use q2_k
I build llama.cpp from master, and convert https://huggingface.co/decapoda-research/llama-7b-hf model to ggml I use that command:
CUDA_VISIBLE_DEVICES=0 ./quantize ../../models/ggml-model-f16.bin ../../models/ggml-model-q4_k_s.bin 3
It works, and I can get the quantized model file, like that:
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A30
main: build = 635 (5c64a09)
main: quantizing '../../models/ggml-model-f16.bin' to '../../models/ggml-model-q4_k_s.bin' as q4_1
llama.cpp: loading model from ../../models/ggml-model-f16.bin
llama.cpp: saving model to ../../models/ggml-model-q4_k_s.bin
[ 1/ 291] tok_embeddings.weight - 4096 x 32000, type = f16, quantizing .. size = 250.00 MB -> 78.12 MB | hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.082 0.067 0.051 0.037 0.025 0.040
[ 2/ 291] norm.weight - 4096, type = f32, size = 0.016 MB
[ 3/ 291] output.weight - 4096 x 32000, type = f16, quantizing .. size = 250.00 MB -> 78.12 MB | hist: 0.040 0.025 0.037 0.051 0.067 0.082 0.095 0.102 0.102 0.095 0.083 0.067 0.052 0.037 0.025 0.040
.....
llama_model_quantize_internal: model size = 12853.02 MB
llama_model_quantize_internal: quant size = 4017.27 MB
llama_model_quantize_internal: hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.083 0.067 0.051 0.037 0.025 0.040
main: quantize time = 47205.34 ms
main: total time = 47205.34 ms
but if I'm quantizing with q2-k, an error has occurred:
CUDA_VISIBLE_DEVICES=0 ./quantize ../../models/ggml-model-f16.bin ../../models/ggml-model-q2_k.bin 10 10
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A30
main: build = 635 (5c64a09)
main: quantizing '../../models/ggml-model-f16.bin' to '../../models/ggml-model-q2_k.bin' as q2_K using 10 threads
llama.cpp: loading model from ../../models/ggml-model-f16.bin
llama.cpp: saving model to ../../models/ggml-model-q2_k.bin
[ 1/ 291] tok_embeddings.weight - 4096 x 32000, type = f16, quantizing .. size = 250.00 MB -> 0.00 MB | hist:
Floating point exception (core dumped)
floating-point exceptions can occur due to processor misdetections , but hot to fix ?thank you very much.
Your first run is converting to q4_1 not q4_k_s.
If you built using cmake from current master there is no k_quants support, you can either add a line to CMakelists https://github.com/ggerganov/llama.cpp/pull/1748 or build using the Makefile for now.
Edit: Fixed now
yes, it is q4_1.
thank you very much, already solved my problem.