llama.cpp Custom quantization schemes

This is not ready to merge but I wanted to get your opinion if it’s something you’d be interested in including. If so, I can clean it up and improve it a little.

The idea is to allow creating a custom quantization mix by reading the per-layer quant type from a config file, by specifying CUSTOM as the type, like so:

./quantize --allow-requantize ../models/Meta-Llama-3-8B-Instruct.Q8_0.gguf ./llama3-q.gguf CUSTOM

The config file is currently hardcoded to read quant.cfg from the current directory (sample cfg is included). In the config file I allow specifying a default type for tensors that are not explicitly overridden, and the tensor name / type pairs with the requested type.

Possible improvements would be:

specifying types as strings instead of enum values („Q3_K“ instead of 11)
wildcards or regex to specify tensor names (like „blk\d{2}.ffn_down.weight“)
allow variable type (like -1, +2): if the default is Q8_K, and quant.cfg says -1 for the tensor you get Q6_K)
make filename for quant.cfg configurable via command line switch.

Apr 23 '24 12:04 jubruckne

This would be handy, as i like to experiment with different custom quants, and its a little clunky having to modify and rebuild llama.cpp every time i want to change something. For example, i found that with Mixtral, having token_embed, attn_v/k/q/output as Q6_K with iq4_xs weights typically scores better than the standard iq4_xs. Weirdly, it even slightly outperforms the above at Q8_0 with iq4_xs weights.

Apr 23 '24 13:04 askmyteapot

Yes, this functionality is welcome

Apr 23 '24 14:04 ggerganov

Excellent idea, I wanted to see such a feature but am unable to do it myself. I will use it.. a lot!

All the possibles improvements you mention are pertinent.

Also, this tool should ideally feature variable quantization. For example, it can be useful to be able to quantize a fraction of a given weight in a quant, and the other half in another. Example : the ffn.down.weight is usually the "lead" of the 3 ffn weights in terms of influence over perplexity. Simply quantizing half of the ffn.down.weight in the immediate superior quant gives a very good perplexity shrink on most models, to not speak about other benches like ARC.

Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant. Example : Take a 70b model, with 80 layers, with a LLAMA_FTYPE IQ2_S I'd like to quantize the ffn.down.weight as such without recompiling LlamaCPP 👍

10 (or any number) first layers in IQ3_XXS.
One every x layer into IQ3_XXS between layer 11 and 70 (for example).
10 (or any number) last layers in IQ3XXS.
The rest in IQ2_S. Of course, these numbers are arbitrary, and I'd be curious to know which layers are actually the most influential over a model, and thus, would deserve the higher bitrate of a variable quant.

I'm currently toying with the code in Llama.cpp file, and that's quite indigest and not practical, especially because 2 approaches were used to define the quant strategies 👍

The IQ1 and IQ2 quant strategies are a tree, the weight being branchess.
The other quants (IQ and Q) are branches in per-weight trees. That coexistence of 2 approaches is confusing to me, and should ideally be harmonized into either one (by weight) or another (by Quant strategy).

Apr 23 '24 17:04 Nexesenex

I'd like to quantize the ffn.down.weight as such without recompiling LlamaCPP

Yeah, that's the idea. I actually explained my intentions slightly incorrectly in the first post above. It's actually about allowing individual quantisation for each tensor (not layer). So you can have a config file like this:

# use default quantisation of Q8_0
ftype=7

# override tensors matching a pattern with a specific quant:
blk.10.ffn_up.weight=7
blk.1?.ffn_up.weight=10
blk.2?.ffn_up.weight=10
blk.1?.attn*=23
blk.2?.attn*=23
*down*=14
*gate*=12

Apr 24 '24 09:04 jubruckne

Moreover, and that's a bit more complex, the ideal combination might be to be able to use a customizable form "more_bits feature" (query it in the llama.cpp file) to make that partial quant, and to select a layer range of a given weight to quantize with a higher quant.

Exactly my plan :) The idea here would be that instead of setting a specific quant type, increments of +1, +2, -1, ... relative to the default could be used. For example:

# use default quantisation of Q4_K_S
ftype=14

# override tensors matching a pattern with a specific quant:
*ffn_up.weight=+1
*ffn_down.weight=-1

The challenge lies in defining the what the sequences of quant types should be. One possibility is to establish a sequence such that it's transitioning between similar quant types of different "bit" rates, such as from x_K to x+1_K or from IQx_S to IQx-1_S. For example:

IQ1_S, IQ1_M
IQ2_XXS, IQ2_XS, IQ2_S, Q2_K
IQ3_XXS, IQ3_S, Q3_K
Q4_0, Q4_1, Q4_K, IQ4_XS, IQ4_NL
Q5_0, Q5_1, Q5_K
Q6_K
Q8_K

Using this sequence, a default of Q4_K would transition to Q5_K with a +1 adjustment and to Q3_K with a -1.

A more detailed sequence might look like this:

IQ1_S
IQ1_M
IQ2_XXS
IQ2_XS
IQ2_S
Q2_K
IQ3_XXS
IQ3_S
Q3_K
Q4_0, Q4_1, Q4_K
IQ4_XS, IQ4_NL
Q5_0, Q5_1, Q5_K
Q6_K
Q8_K

However now that I've tried to write down sensible sequences, I realize that defining one that is universally applicable is challenging due to the varying nature of quant types, and probably doesn't make sense in most cases.

Any thoughts?

Apr 24 '24 10:04 jubruckne

Well, ideally the whole pattern would be definable so the system can be universally applied, there's no premade recipe which makes consensus nor should it be, because we are still empirically discovering the effects of particular quantization strategies as we try them.

Here's a reformulation of my idea compatible with your plans :

Define optionally each tensor to be offseted, in relative (+1, -1) or absolute GGML_TYPE.
Define optionally within a tensor one or several ranges of layer (relative or absolute) to be quantized either on the baseline quant, either with a relative offset to the baseline quant, either with a GGML_TYPE, either with a mix of 2 quants on a given layer interval.

Example in profane writing, for each tensor chosen for a customized quantization away from a base quantization strategy, presently a base Q4_K defined on a 70b L2 model with 80 layers on which we want to customize the ffn.down without even using Q4_K for the sake of the example : ffn.down -> Layer 1:15 or first 20% : Q5_K (or +1) ; Layer 16:65 : Q5K (or +1) every x layers, rest Q3_K (or -1) ; Layer 66-80 or last 20% : Q5K (or +1) The "x layers" pattern being of course appliable to the first or last range of layers, and not intermediary one.

That might require a slight overhaul of the quant strategy part of Llama.CPP, and potentially an harmonization of its hierarchical tree in respect for the IQ1 and IQ2 groups, but if possible, that'd offer the widest range of possibilities.

I'm sorry for my lack of code proficiency, I have no background into coding beyond mimicking what I see, understanding and adapting a few formatting tricks, and changing values.

Apr 24 '24 18:04 Nexesenex

I think this should be ready. I added parsing of enum values (so that friendly names like Q8_0 can be used instead of their numeric values), wildcards for tensor names, and possibility to specify the cfg file to use.

To use, specify the new CUSTOM type on ./quantize like so: ./quantize ../models/Phi-3-mini-4k-instruct-fp16.gguf ./phi3-q.gguf CUSTOM:quant.cfg

The quant.cfg should be pretty self-explanatory:

# Defines the default ftype (the quantization mix code, 
# that you pass to quantize if you're not using custom mix).
# tensors that are not overriden below will be quantized 
# according to this mix.
#
# Must be one of
#    Q4_0, Q4_1, Q5_0, Q5_1, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, 
#    IQ1_S, IQ1_M, Q2_K, Q2_K_S, IQ3_XXS, IQ3_S, IQ3_M, Q3_K,
#    IQ3_XS, Q3_K_S, Q3_K_M, Q3_K_L, IQ4_NL, IQ4_XS, Q4_K, 
#    Q4_K_S, Q4_K_M, Q5_K, Q5_K_S, Q5_K_M, Q6_K, Q8_0, F16

ftype=Q6_K

# Defines overrides for tensors with names matching a given 
# string. Filters are processed in order given, the first 
# matching will be used. 
#
# Wildcards are allowed:
#     ? single character
#     * multiple characters
#
# Type must be one of 
#     F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, Q2_K, Q3_K, 
#     Q4_K, Q5_K, Q6_K, Q8_K, IQ2_XXS, IQ2_XS, IQ3_XXS, 
#     IQ1_S, IQ4_NL, IQ3_S, IQ2_S, IQ4_XS, IQ1_M

blk.10.ffn_up.weight=Q5_K
blk.1?.ffn_up.weight=Q4_K
blk.23.*=Q2_K
blk.24.*=Q2_K
blk.25.*=Q2_K
blk.2?.ffn_up.weight=Q4_K
*_gate*=Q4_K
*.attn*=IQ4_XS
*_down*=IQ3_S
output.weight=Q5_K

Apr 25 '24 09:04 jubruckne

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 556 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8437.61ms p(95)=19797.62ms fails=, finish reason: stop=488 truncated=68
Prompt processing (pp): avg=93.93tk/s p(95)=352.4tk/s
Token generation (tg): avg=33.41tk/s p(95)=49.02tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=20b22433f0cf941c1b43e27c086e2ef71798fd57

prompt_tokens_seconds

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 461.65, 461.65, 461.65, 461.65, 461.65, 708.57, 708.57, 708.57, 708.57, 708.57, 583.45, 583.45, 583.45, 583.45, 583.45, 610.8, 610.8, 610.8, 610.8, 610.8, 681.27, 681.27, 681.27, 681.27, 681.27, 694.95, 694.95, 694.95, 694.95, 694.95, 698.32, 698.32, 698.32, 698.32, 698.32, 738.85, 738.85, 738.85, 738.85, 738.85, 740.37, 740.37, 740.37, 740.37, 740.37, 754.14, 754.14, 754.14, 754.14, 754.14, 758.94, 758.94, 758.94, 758.94, 758.94, 783.8, 783.8, 783.8, 783.8, 783.8, 831.77, 831.77, 831.77, 831.77, 831.77, 853.53, 853.53, 853.53, 853.53, 853.53, 842.07, 842.07, 842.07, 842.07, 842.07, 844.5, 844.5, 844.5, 844.5, 844.5, 844.77, 844.77, 844.77, 844.77, 844.77, 867.89, 867.89, 867.89, 867.89, 867.89, 863.96, 863.96, 863.96, 863.96, 863.96, 864.51, 864.51, 864.51, 864.51, 864.51, 870.65, 870.65, 870.65, 870.65, 870.65, 873.37, 873.37, 873.37, 873.37, 873.37, 870.04, 870.04, 870.04, 870.04, 870.04, 867.42, 867.42, 867.42, 867.42, 867.42, 862.4, 862.4, 862.4, 862.4, 862.4, 861.24, 861.24, 861.24, 861.24, 861.24, 849.5, 849.5, 849.5, 849.5, 849.5, 848.25, 848.25, 848.25, 848.25, 848.25, 845.86, 845.86, 845.86, 845.86, 845.86, 847.08, 847.08, 847.08, 847.08, 847.08, 851.68, 851.68, 851.68, 851.68, 851.68, 850.73, 850.73, 850.73, 850.73, 850.73, 851.39, 851.39, 851.39, 851.39, 851.39, 857.7, 857.7, 857.7, 857.7, 857.7, 860.93, 860.93, 860.93, 860.93, 860.93, 861.72, 861.72, 861.72, 861.72, 861.72, 846.45, 846.45, 846.45, 846.45, 846.45, 842.42, 842.42, 842.42, 842.42, 842.42, 841.97, 841.97, 841.97, 841.97, 841.97, 844.18, 844.18, 844.18, 844.18, 844.18, 846.31, 846.31, 846.31, 846.31, 846.31, 838.22, 838.22, 838.22, 838.22, 838.22, 821.9, 821.9, 821.9, 821.9, 821.9, 791.6, 791.6, 791.6, 791.6, 791.6, 790.51, 790.51, 790.51, 790.51, 790.51, 791.04, 791.04, 791.04, 791.04, 791.04, 797.59, 797.59, 797.59, 797.59, 797.59, 797.61, 797.61, 797.61, 797.61, 797.61, 801.3, 801.3, 801.3, 801.3, 801.3, 803.42, 803.42, 803.42, 803.42, 803.42, 809.92, 809.92, 809.92, 809.92, 809.92, 811.74, 811.74, 811.74, 811.74, 811.74, 810.68, 810.68, 810.68, 810.68, 810.68, 808.5, 808.5, 808.5, 808.5, 808.5, 809.7, 809.7, 809.7, 809.7, 809.7, 810.26, 810.26, 810.26, 810.26, 810.26, 811.95, 811.95, 811.95, 811.95, 811.95, 813.78, 813.78, 813.78, 813.78, 813.78, 814.71, 814.71, 814.71, 814.71, 814.71, 817.21, 817.21, 817.21, 817.21, 817.21, 817.28, 817.28, 817.28, 817.28, 817.28]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.46, 40.46, 40.46, 40.46, 40.46, 43.64, 43.64, 43.64, 43.64, 43.64, 32.67, 32.67, 32.67, 32.67, 32.67, 32.48, 32.48, 32.48, 32.48, 32.48, 34.09, 34.09, 34.09, 34.09, 34.09, 33.8, 33.8, 33.8, 33.8, 33.8, 34.34, 34.34, 34.34, 34.34, 34.34, 35.33, 35.33, 35.33, 35.33, 35.33, 35.53, 35.53, 35.53, 35.53, 35.53, 35.04, 35.04, 35.04, 35.04, 35.04, 34.52, 34.52, 34.52, 34.52, 34.52, 34.41, 34.41, 34.41, 34.41, 34.41, 34.2, 34.2, 34.2, 34.2, 34.2, 33.03, 33.03, 33.03, 33.03, 33.03, 32.96, 32.96, 32.96, 32.96, 32.96, 32.36, 32.36, 32.36, 32.36, 32.36, 32.67, 32.67, 32.67, 32.67, 32.67, 32.79, 32.79, 32.79, 32.79, 32.79, 31.99, 31.99, 31.99, 31.99, 31.99, 31.7, 31.7, 31.7, 31.7, 31.7, 31.69, 31.69, 31.69, 31.69, 31.69, 31.78, 31.78, 31.78, 31.78, 31.78, 32.06, 32.06, 32.06, 32.06, 32.06, 31.9, 31.9, 31.9, 31.9, 31.9, 31.99, 31.99, 31.99, 31.99, 31.99, 32.15, 32.15, 32.15, 32.15, 32.15, 32.07, 32.07, 32.07, 32.07, 32.07, 31.98, 31.98, 31.98, 31.98, 31.98, 31.52, 31.52, 31.52, 31.52, 31.52, 31.62, 31.62, 31.62, 31.62, 31.62, 31.84, 31.84, 31.84, 31.84, 31.84, 31.94, 31.94, 31.94, 31.94, 31.94, 32.07, 32.07, 32.07, 32.07, 32.07, 32.17, 32.17, 32.17, 32.17, 32.17, 32.14, 32.14, 32.14, 32.14, 32.14, 32.02, 32.02, 32.02, 32.02, 32.02, 31.68, 31.68, 31.68, 31.68, 31.68, 31.7, 31.7, 31.7, 31.7, 31.7, 31.79, 31.79, 31.79, 31.79, 31.79, 31.9, 31.9, 31.9, 31.9, 31.9, 32.01, 32.01, 32.01, 32.01, 32.01, 32.18, 32.18, 32.18, 32.18, 32.18, 31.96, 31.96, 31.96, 31.96, 31.96, 31.5, 31.5, 31.5, 31.5, 31.5, 31.49, 31.49, 31.49, 31.49, 31.49, 30.83, 30.83, 30.83, 30.83, 30.83, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.75, 30.79, 30.79, 30.79, 30.79, 30.79, 30.9, 30.9, 30.9, 30.9, 30.9, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.97, 30.79, 30.79, 30.79, 30.79, 30.79, 30.71, 30.71, 30.71, 30.71, 30.71, 30.8, 30.8, 30.8, 30.8, 30.8, 31.0, 31.0, 31.0, 31.0, 31.0, 31.14, 31.14, 31.14, 31.14, 31.14, 31.22, 31.22, 31.22, 31.22, 31.22, 31.25, 31.25, 31.25, 31.25, 31.25, 31.28, 31.28, 31.28, 31.28, 31.28, 31.27, 31.27, 31.27, 31.27, 31.27]

Details

kv_cache_usage_ratio

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.38, 0.38, 0.38, 0.38, 0.38, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.36, 0.36, 0.36, 0.36, 0.36, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.56, 0.56, 0.56, 0.56, 0.56, 0.51, 0.51, 0.51, 0.51, 0.51, 0.4, 0.4, 0.4, 0.4, 0.4, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 556 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715266997 --> 1715267625
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0]

Apr 25 '24 10:04 github-actions[bot]

Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0:

F16: Final estimate: PPL = 6.7647 Q8_0: Final estimate: PPL = 6.7646 Q4_0: Final estimate: PPL = 7.2904 Q5_1: Final estimate: PPL = 6.8849

This is a 7.7% difference, but these numbers are even worse earlier on in evaluation.

Mistral PPL: F16: Final estimate: PPL = 5.6925 Q8_0: Final estimate: PPL = 5.6918 Q4_0: Final estimate: PPL = 5.8192

Only 2.2% difference for mistral.

Different UI's using the Q4_0 series would be getting a higher quality degradation for llama3 than llama2 or mistral.

This isn't a llama.cpp issue, most gpu quantizations will get similar results. Is there a pre-existing quantization sweet spot suitable as the de-facto for llama3?

Apr 26 '24 04:04 BarfingLemurs

Hi, I'm noticing we don't have good perplexity/quality scores for llama3 8b at Q4_0:

F16: Final estimate: PPL = 6.7647 Q8_0: Final estimate: PPL = 6.7646 Q4_0: Final estimate: PPL = 7.2904 Q5_1: Final estimate: PPL = 6.8849

This is a 7.7% difference, but these numbers are even worse earlier on in evaluation.

Mistral PPL: F16: Final estimate: PPL = 5.6925 Q8_0: Final estimate: PPL = 5.6918 Q4_0: Final estimate: PPL = 5.8192

Only 2.2% difference for mistral.

llama3 reacts more strongly to quantization, probably because it makes more use of the bits/precision it was trained on.

someone should use MAP to find the frontier of best ppl-size (or any other 2 dimensional metric)

May 09 '24 12:05 Green-Sky

Is this PR still working? I'd be interested to try it on the new deepseek-v2 models to see if using lower quants for the later layers is feasible.

Jun 27 '24 00:06 jukofyork

This PR seems dead, but I have found where you can hack this in: llama.cpp::llama_tensor_get_type().

Interestingly there looks to be a lot of hard-coded tests of n_expert == 8 that might be hurting the quantization of some of the newer MoE models that use more experts like dbrx, deepseek-v2, Qwen-MoE, etc:

The new "shared experts" might need thinking about now too:

  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6

as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

Jun 27 '24 10:06 jukofyork

static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
    const std::string name = ggml_get_name(tensor);

    // TODO: avoid hardcoded tensor names - use the TN_* constants
    const llm_arch arch = qs.model.arch;
    const auto       tn = LLM_TN(arch);

    auto use_more_bits = [](int i_layer, int num_layers) -> bool {
        return i_layer < num_layers/8 || i_layer >= 7*num_layers/8 || (i_layer - num_layers/8)%3 == 2;
    };
    const int n_expert = std::max(1, (int)qs.model.hparams.n_expert);
    auto layer_info = [n_expert] (int i_layer, int n_layer, const char * name) {
        if (n_expert > 1) {
            // Believe it or not, "experts" in the FFN of Mixtral-8x7B are not consecutive, but iccasionally randomly
            // sprinkled in the model. Hence, simply dividing i_ffn_down by n_expert does not work
            // for getting the current layer as I initially thought, and we need to resort to parsing the
            // tensor name.
            if (sscanf(name, "blk.%d.", &i_layer) != 1) {
                throw std::runtime_error(format("Failed to determine layer for tensor %s", name));
            }
            if (i_layer < 0 || i_layer >= n_layer) {
                throw std::runtime_error(format("Bad layer %d for tensor %s. Must be in [0, %d)", i_layer, name, n_layer));
            }
        }
        return std::make_pair(i_layer, n_layer);
    };

    // for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
    // with the quantization of the output tensor
    if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
        if (qs.params->output_tensor_type < GGML_TYPE_COUNT) {
            new_type = qs.params->output_tensor_type;
        } else {
            int nx = tensor->ne[0];
            if (arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
                new_type = GGML_TYPE_Q8_0;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ2_S  || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M   ||
                     ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q5_K;
            }
            else if (new_type != GGML_TYPE_Q8_0) {
                new_type = GGML_TYPE_Q6_K;
            }
        }
    } else if (name == "token_embd.weight") {
        if (qs.params->token_embedding_type < GGML_TYPE_COUNT) {
            new_type = qs.params->token_embedding_type;
        } else {
            if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
                ftype == LLAMA_FTYPE_MOSTLY_IQ1_S   || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
                new_type = GGML_TYPE_Q2_K;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) {
                new_type = GGML_TYPE_IQ3_S;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
                new_type = GGML_TYPE_IQ3_S;
            }
        }
    } else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
               ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M    || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
        if (name.find("attn_v.weight") != std::string::npos) {
            if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
            else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
            ++qs.i_attention_wv;
        }
        else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (name.find("ffn_down") != std::string::npos) {
            if (qs.i_ffn_down < qs.n_ffn_down/8) {
                new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
            }
            ++qs.i_ffn_down;
        }
        else if (name.find("attn_output.weight") != std::string::npos) {
            if (qs.model.hparams.n_expert == 8) {
                new_type = GGML_TYPE_Q5_K;
            } else {
                if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;
            }
        }
    } else if (name.find("attn_v.weight") != std::string::npos) {
        if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
            new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = qs.model.hparams.n_gqa() >= 4 ? GGML_TYPE_Q4_K : !qs.has_imatrix ? GGML_TYPE_IQ3_S : GGML_TYPE_IQ3_XXS;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S) && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
            new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q5_K;
        else if ((ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && qs.model.hparams.n_gqa() >= 4) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) &&
                use_more_bits(qs.i_attention_wv, qs.n_attention_wv)) new_type = GGML_TYPE_Q6_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) new_type = GGML_TYPE_Q5_K;
        if (qs.model.type == MODEL_70B) {
            // In the 70B model we have 8 heads sharing the same attn_v weights. As a result, the attn_v.weight tensor is
            // 8x smaller compared to attn_q.weight. Hence, we can get a nice boost in quantization accuracy with
            // nearly negligible increase in model size by quantizing this tensor with more bits:
            if (new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K) new_type = GGML_TYPE_Q5_K;
        }
        if (qs.model.hparams.n_expert == 8) {
            // for the 8-expert model, bumping this to Q8_0 trades just ~128MB
            // TODO: explore better strategies
            new_type = GGML_TYPE_Q8_0;
        }
        ++qs.i_attention_wv;
    } else if (name.find("attn_k.weight") != std::string::npos) {
        if (qs.model.hparams.n_expert == 8) {
            // for the 8-expert model, bumping this to Q8_0 trades just ~128MB
            // TODO: explore better strategies
            new_type = GGML_TYPE_Q8_0;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = GGML_TYPE_IQ2_S;
        }
    } else if (name.find("attn_q.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) {
            new_type = GGML_TYPE_IQ2_S;
        }
    } else if (name.find("ffn_down") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K_S) {
            if (i_layer < n_layer/8) new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS && !qs.has_imatrix) {
            new_type = i_layer < n_layer/8 ? GGML_TYPE_Q4_K : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
            new_type = i_layer < n_layer/16 ? GGML_TYPE_Q5_K
                     : arch != LLM_ARCH_FALCON || use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q4_K
                     : GGML_TYPE_Q3_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M && (i_layer < n_layer/8 ||
                    (qs.model.hparams.n_expert == 8 && use_more_bits(i_layer, n_layer)))) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
            new_type = arch == LLM_ARCH_FALCON ? GGML_TYPE_Q4_K : GGML_TYPE_Q5_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
            if (arch == LLM_ARCH_FALCON) {
                new_type = i_layer < n_layer/16 ? GGML_TYPE_Q6_K :
                           use_more_bits(i_layer, n_layer) ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
            } else {
                if (use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
            }
        }
        else if (i_layer < n_layer/8 && (ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) && !qs.has_imatrix) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M && use_more_bits(i_layer, n_layer)) new_type = GGML_TYPE_Q6_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && arch != LLM_ARCH_FALCON && i_layer < n_layer/8) {
            new_type = GGML_TYPE_Q5_K;
        }
        else if ((ftype == LLAMA_FTYPE_MOSTLY_Q4_0 || ftype == LLAMA_FTYPE_MOSTLY_Q5_0)
                && qs.has_imatrix && i_layer < n_layer/8) {
            // Guard against craziness in the first few ffn_down layers that can happen even with imatrix for Q4_0/Q5_0.
            // We only do it when an imatrix is provided because a) we want to make sure that one can always get the
            // same quantization as before imatrix stuff, and b) Q4_1/Q5_1 do go crazy on ffn_down without an imatrix.
            new_type = ftype == LLAMA_FTYPE_MOSTLY_Q4_0 ? GGML_TYPE_Q4_1 : GGML_TYPE_Q5_1;
        }
        ++qs.i_ffn_down;
    } else if (name.find("attn_output.weight") != std::string::npos) {
        if (arch != LLM_ARCH_FALCON) {
            if (qs.model.hparams.n_expert == 8) {
                if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K   || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS ||
                    ftype == LLAMA_FTYPE_MOSTLY_Q3_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL  ||
                    ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S || ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S  ||
                    ftype == LLAMA_FTYPE_MOSTLY_IQ3_M  || ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS) {
                    new_type = GGML_TYPE_Q5_K;
                }
            } else {
                if      (ftype == LLAMA_FTYPE_MOSTLY_Q2_K   ) new_type = GGML_TYPE_Q3_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS) new_type = GGML_TYPE_IQ3_S;
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M ) new_type = GGML_TYPE_Q4_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L ) new_type = GGML_TYPE_Q5_K;
                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_M  ) new_type = GGML_TYPE_Q4_K;
            }
        } else {
            if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q4_K;
        }
    }
    else if (name.find("attn_qkv.weight") != std::string::npos) {
        if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L || ftype == LLAMA_FTYPE_MOSTLY_IQ3_M) {
            new_type = GGML_TYPE_Q4_K;
        }
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) new_type = GGML_TYPE_Q5_K;
        else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) new_type = GGML_TYPE_Q6_K;
    }
    else if (name.find("ffn_gate") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_gate;
    }
    else if (name.find("ffn_up") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (ftype == LLAMA_FTYPE_MOSTLY_IQ3_XS && (i_layer >= n_layer/8 && i_layer < 7*n_layer/8)) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_up;
    }

    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
    //}
    // IK: let's remove this, else Q2_K is almost the same as Q3_K_S
    //else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q3_K;
    //}
    // This can be used to reduce the size of the Q5_K_S model.
    // The associated PPL increase is fully in line with the size reduction
    //else {
    //    if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_S) new_type = GGML_TYPE_Q4_K;
    //}
    bool convert_incompatible_tensor = false;
    if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
        new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K || new_type == GGML_TYPE_IQ4_XS ||
        new_type == GGML_TYPE_IQ2_XS || new_type == GGML_TYPE_IQ2_XXS || new_type == GGML_TYPE_IQ2_S ||
        new_type == GGML_TYPE_IQ3_XXS || new_type == GGML_TYPE_IQ1_S || new_type == GGML_TYPE_IQ3_S ||
        new_type == GGML_TYPE_IQ1_M) {
        int nx = tensor->ne[0];
        int ny = tensor->ne[1];
        if (nx % QK_K != 0) {
            LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
            convert_incompatible_tensor = true;
        } else {
            ++qs.n_k_quantized;
        }
    }
    if (convert_incompatible_tensor) {
        switch (new_type) {
            case GGML_TYPE_IQ2_XXS:
            case GGML_TYPE_IQ2_XS:
            case GGML_TYPE_IQ2_S:
            case GGML_TYPE_IQ3_XXS:
            case GGML_TYPE_IQ3_S:
            case GGML_TYPE_IQ1_S:
            case GGML_TYPE_IQ1_M:
            case GGML_TYPE_Q2_K:
            case GGML_TYPE_Q3_K:
            case GGML_TYPE_IQ4_XS: new_type = GGML_TYPE_IQ4_NL; break;
            case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0;   break;
            case GGML_TYPE_Q5_K:   new_type = GGML_TYPE_Q5_1;   break;
            case GGML_TYPE_Q6_K:   new_type = GGML_TYPE_Q8_0;   break;
            default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
        }
        LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
        ++qs.n_fallback;
    }

    return new_type;
}

What criterion was used to find these combinations originally?

If we can test each configuration in a reasonable amount of time then it would be quite feasible to optimize this automatically using the Cross-Entropy Method (there is another version [for optimization] not shown on the Wikipedia page that optimizes discrete Bernoulli and/or categorical / "multinoulli" distributions [see Chapter 5 of Rubinstein's book]).

The dimensions are likely to be almost independent and it might even be nearly as easy to optimize "layer-index specific" quant schemes.

From previous experience using CEM on highly independent set of variables like this, you would need to be able to perform a minimum of 10-20 evaluations per variable to be optimized (you need much, much more though if you need to assume a non-diagonal covariance matrix [or conditional dependence for the discrete case] - which I don't think this would need and CMA-ES would be more suitable in that case anyway...).

It's very robust to noise so a noisy/quick evaluation criterion like perplexity will be preferable to a slow/precise criterion like KL-divergence.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

Jun 27 '24 10:06 jukofyork

What criterion was used to find these combinations originally?

The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored.

Jun 27 '24 13:06 slaren

What criterion was used to find these combinations originally?

The original combinations were created by @ikawrakow from his own tests on llama and llama2 models, as far as I know. I think it would be very good to be able to automate this process, I would expect that different models will benefit from different quantization schemes.

One potential problem is if the optimization boundaries are hard to set due to say perplexity returning nan causing lots of samples to be discarded.

This should never happen unless there is a bug in llama.cpp, in which case it needs to be fixed rather than ignored.

I can't promise how soon I can look at it, but it is definitely possible even without understanding any of the original logic:

new_type = GGML_TYPE_XXX;

It would just need a modified version of the function that can select this from a categorical distribution (what people in the ML community have started calling "multinoulli").

The name "Cross Entropy Method" might sound intimidating, but it is actually super-simple:

Randomly initialize the distribution for each variable (intelligently if possible).
Take N samples (usually 100) from the distribution and evaluate each sample.
Rank the samples and choose the top 0.1 * N of the samples.
Calculate the new Maximum Likelihood distribution to use from these samples.
Go to step 2.

For (1) you could set the initial categorical distribution to be weighted heavily towards @ikawrakow's choices and possibly also set hard boundaries on what you think are sensible for the memory size budget you are looking at.
For (2a), since we are assuming independence of the variables it will just be a simple "weighted roulette wheel" selection process.
For (2b), since we have a memory size budget this will have to be incorporated as a constraint into the evaluation via a penalty (a soft penalty preferably so as not to discard too many samples... You can progressively "harden" the penalty during the run to enforce the constraint though).
For (4), this just comes down to the empirical fraction of counts in each bin for the discrete case. You have to be slightly careful that none of the bins get set to zero (this is easily solved via Additive smoothing and IIRC explained in Rubinstein's book).

Just eyeballing the function there looks to be maybe 5-10 choices for a given model, so using a population of 100 and assuming 10-20 evaluations per variable: 5*100*10 = 5000 .. 10*100*20 = 20000 evaluations per model to be optimized (minimum), but it is likely a lot could be learnt from small models and used to constrain the search for larger models.

A week has 7*24*60 = 10080 minutes, so it would need to take no longer than 2-5 minutes per evaluation to be feasible IMO. It is very easy to parallelize using MPI though so could be run on a cluster of machines if needed.

Jun 27 '24 14:06 jukofyork

That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples.

Jun 27 '24 14:06 slaren

That sounds very interesting. I am not sure what parameter you would use to optimize a quantization scheme. I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time. Not sure if that affects your estimation of the number of samples.

Generally the more you constrain it and the more the independence assumption is broken the more samples you need (ie: it will waste samples trying to pass over the constraints and have trouble navigating non-orthogonal "valleys" otherwise).

If the independence assumption is very wrong then it's almost certainly better to use CMA-ES instead (CEM does have a version using a non-diagonal covariance matrix, but it requires a Cholesky Factorization to sample from and suffers from needing many more samples to reliably estimate the covariance matrix compared to CMA-ES's incremental method).

There are likely other things like using a clipped-Gaussian instead of a categorical distribution (as the choices are ordered) that can be tried to reduce the number of samples needed.

It works really well in practice and often can find solutions a human could not due to the human getting stuck in a local optima where they can't escape by tuning a single variable alone.

If the optimization landscape is very smooth and "nice" there are other methods that can use way fewer samples. Somebody with an OR background would likely be able to suggest even better ways of tackling this - I've just had success in the past using CEM for problems almost exactly like this (and SPSA for problems with homogenous variables and low-noise evaluations available).

Jun 27 '24 14:06 jukofyork

Sorry I missed this part of your question:

I guess that what we really would want to find is the pareto front of all the quantization schemes that are not worse than any other scheme in both file size and perplexity at the same time.

This can likely be found in a single run by starting with the maximum allowable memory budget, converging to a (fairly) stable solution, and then reducing the budget constraint/penalty downwards (or vice versa).

If you search for "L1 regularization path" you'll see plots like this found all in 1 run:

Which are basically doing the same thing by reducing (or increasing) the penalty during a single run of the optimization algorithm.

Jun 27 '24 15:06 jukofyork

The new "shared experts" might need thinking about now too:
  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6
as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).

Out of luck trying to do anything with the "shared_experts":

[  28/ 959]           blk.1.ffn_down_exps.weight - [ 1536,  5120,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  29/ 959]           blk.1.ffn_gate_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  30/ 959]             blk.1.ffn_up_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB

Jun 27 '24 18:06 jukofyork

If I can get this PR working or figure out how to hack the llama.cpp::llama_tensor_get_type() code I'm going to try using bigger quants for the early layer's expert tensors and smaller for the later.

I can't find it now but read a paper that hypothesised the later layers don't do all that much and mostly just do averaging (link me if you know this paper please!). This paper (which came later IIRC) also shows this:

https://arxiv.org/pdf/2403.17887

Starting around 60th percentile layer in.

"Across models, the deeper layers tend to be very similar, though the deepest blocks that include the final layer (squares along the outer diagonal) are (near-)maximally dissimilar."

Charles Goddard (the Mergekit creator) tried the above method here:

https://huggingface.co/chargoddard/llama3-42b-v0

but I think it's got a much better chance keeping the layers; just having them more quantized... Deepseek-v2 looks the perfect model to try this on as it's 90% MLP.

Jun 27 '24 18:06 jukofyork

The new "shared experts" might need thinking about now too:
  "n_routed_experts": 160,
  "n_shared_experts": 2,
  "num_experts_per_tok": 6
as they will be disproportionality effected by quantization (ie: used 100% of the time vs 4/160 = 2.5% of the time, for the example deepseek-v2 config above).
Out of luck trying to do anything with the "shared_experts":
[  28/ 959]           blk.1.ffn_down_exps.weight - [ 1536,  5120,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  29/ 959]           blk.1.ffn_gate_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB
[  30/ 959]             blk.1.ffn_up_exps.weight - [ 5120,  1536,   160,     1], type =    f16, converting to iq3_xxs .. size =  2400.00 MiB ->   459.38 MiB

Actually I've found they are in separate tensors and are named differently: ffn_up_shexp.weight, ffn_gate_shexp.weight, and ffn_down_shexp.weight.

I've also found the low rank attn_q_a.weight, attn_q_b.weight, attn_kv_a_mqa.weight and attn_kv_b.weight tensors were falling through and getting quantized using the lowest default... This is very bad as these are actually tiny compared to the rest of the giant MLP tensors and the W.W^T products that this creates will likely have O(((w-q)^2)^2) rate-distortion (ie: 4-th power quantization error!).

So I've tried to look through the function to distil what @ikawrakow obviously must have spent hours figuring out, and have come up with this:

    // ### JUK'S DEEPSEEK V2 CUSTOM CONFIG (Use: 'llama-quantize --imatrix ... ... ... Q5_K_M') ###
    if (name == tn(LLM_TENSOR_OUTPUT, "weight")) {
         new_type = GGML_TYPE_Q6_K;
    } else if (name == "token_embd.weight") {
         new_type = GGML_TYPE_Q5_K;
    } else if (name.find("attn_q_a.weight") != std::string::npos || name.find("attn_q_b.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
    } else if (name.find("attn_kv_a_mqa.weight") != std::string::npos || name.find("attn_kv_b.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
        // ++qs.i_attention_wv; @@@ Looks to be used for 'use_more_bits' tests and not outside this function... @@@
    } else if (name.find("attn_output.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q5_K;
    } else if (name.find("shexp.weight") != std::string::npos) {
        new_type = GGML_TYPE_Q8_0;
    } else if (name.find("ffn_down_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_down, qs.n_ffn_down, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ4_XS;
        }
        else {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        ++qs.i_ffn_down;
    } else if (name.find("ffn_gate_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_gate, qs.n_ffn_gate, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else {
            new_type = GGML_TYPE_IQ2_S;
        }
        ++qs.i_ffn_gate;
    } else if (name.find("ffn_up_exps.weight") != std::string::npos) {
        auto info = layer_info(qs.i_ffn_up, qs.n_ffn_up, name.c_str());
        int i_layer = info.first, n_layer = info.second;
        if (i_layer < n_layer/8 || i_layer >= 7*n_layer/8) {
            new_type = GGML_TYPE_IQ3_XXS;
        }
        else {
            new_type = GGML_TYPE_IQ2_S;
        }
        ++qs.i_ffn_up;
    } else
    // ### JUK ###

It needs to be copied right before this line:

if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")))

The mix of the GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS and GGML_TYPE_IQ2_S is just my attempt at getting this to fit in 96GB of VRAM...

Hopefully this helps as the IQ3_XXS version I made using the stock settings and with the problems outlined above (that let me get a whopping 1K of context in 96GB VRAM!), was as dumb as a post... :frowning:

I will also try just leaving all these as f16 later as they are tiny in comparison to everything else and the ffn_gate_inp.weight routing tensors are already left as f32 for this reason:

[  16/ 959]          blk.1.ffn_down_shexp.weight - [ 3072,  5120,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  17/ 959]          blk.1.ffn_gate_shexp.weight - [ 5120,  3072,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  18/ 959]            blk.1.ffn_up_shexp.weight - [ 5120,  3072,     1,     1], type =    f16, converting to q8_0 .. size =    30.00 MiB ->    15.94 MiB
[  20/ 959]           blk.1.attn_kv_a_mqa.weight - [ 5120,   576,     1,     1], type =    f16, converting to q8_0 .. size =     5.62 MiB ->     2.99 MiB
[  21/ 959]               blk.1.attn_kv_b.weight - [  512, 32768,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  24/ 959]                blk.1.attn_q_a.weight - [ 5120,  1536,     1,     1], type =    f16, converting to q8_0 .. size =    15.00 MiB ->     7.97 MiB
[  25/ 959]                blk.1.attn_q_b.weight - [ 1536, 24576,     1,     1], type =    f16, converting to q8_0 .. size =    72.00 MiB ->    38.25 MiB

Jun 28 '24 13:06 jukofyork

Yeah, I think quantizing the low-rank attention weights was absolutely killing the model... I've put in a PR to fix this: https://github.com/ggerganov/llama.cpp/pull/8194.

Jun 28 '24 15:06 jukofyork

Giving some usage feedback so that this gets merged.

This PR works pretty much almost out-of-the-box (just need to define quant types in lowercase instead of uppercase in the config, and a few tweaks to adapt to the latest llama.cpp state).

It introduces very useful functionality. Props to @jubruckne !

Sep 11 '24 12:09 HaroldBenoit