llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

quantize.exe Bug(s) --token-embedding-type / --output-tensor-type and - Docu? Advanced Usage Context ?

Open David-AU-github opened this issue 1 year ago • 0 comments

Windows 11. Use of quantize.exe - missing documentation?

I am trying to locate information on: --include-weights tensor_name: use importance matrix for this/these tensor(s) --exclude-weights tensor_name: use importance matrix for this/these tensor(s)

Specifically the format of "tensor_name(s)" to be used and/or file to be provided and used with these options. Is it looking for a imatrix.dat or a file with "tensor name(s) : Q6_K" for example ?

I can see the output and names during execution - just need to know what format(s) that "--include-weights" is expecting/valid. Not sure if this is a bug or not.

Same for this ( BUG? ) : --token-embedding-type ggml_type: --output-tensor-type ggml_type:

These do not seem work when using "Q8_0", "Q6_0" etc etc as in: --token-embedding-type Q8_0 --token-embedding-type ggml_type_Q8_0 --token-embedding-type ggml_type:Q8_0

Example:

./quantize --output-tensor-type Q8_0 --token-embedding-type Q8_0 --imatrix imatrix.dat models/TinyLlama/ggml-model-f32.gguf models/TinyLlama/TinyLlama-IQ4_XS-we5.gguf IQ4_XS

This is what shows when I try to quant with these two flags set:

[ 1/ 201] output.weight - [ 2048, 32000, 1, 1], type = f32, ====== llama_model_quantize_internal: did not find weights for output.weight converting to q6_K .. size = 250.00 MiB -> 51.27 MiB [ 2/ 201] token_embd.weight - [ 2048, 32000, 1, 1], type = f32, ====== llama_model_quantize_internal: did not find weights for token_embd.weight converting to iq4_xs .. size = 250.00 MiB -> 33.20 MiB

(REF: https://github.com/ggerganov/llama.cpp/pull/6239 )

NOTE: --leave-output-tensor Works 100% (stays in FP 32/16) --pure Works 100%.

And example for: --override-kv KEY=TYPE:VALUE

If you can point me to the documentation and/or show a brief example that would be great. I have reviewed the programming code directly ; but it also does not reveal format(s) supported.

For " --override-kv KEY=TYPE:VALUE " ; a brief example or point to documentation would be great.

Thank you

usage: F:\llamacpp\quantize.exe [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing --pure: Disable k-quant mixtures and quantize all tensors to the same type --imatrix file_name: use data in file_name as importance matrix for quant optimizations --include-weights tensor_name: use importance matrix for this/these tensor(s) --exclude-weights tensor_name: use importance matrix for this/these tensor(s) --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor --override-kv KEY=TYPE:VALUE Advanced option to override model metadata by key in the quantized model. May be specified multiple times. Note: --include-weights and --exclude-weights cannot be used together

David-AU-github avatar Apr 20 '24 01:04 David-AU-github