llama.cpp Changes for the existing quant strategies / FTYPEs and new ones

Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models:

The tensor attn.v.weight passed in Q4_K for models like Gemma v2 (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.
The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.
The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.

More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under a different kind of tree mixing these 5 quant strategies.

I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be used as default.

Partial "changelog" :

Edit : I applied the attn.v.weight modifications to the IQ3 quant strategies as well. Edit 2 : I looked furthermore at the attn.v.weight "tree" and made changes in coherence with what I did for IQ2 and IQ3, but only when Ikawrakow's code was clearly going in such direction already. Edit 3 : I harmonized all the n_expert == 8 into n_expert >= 8 within the quant strategies. Edit 4 and 5 : attn_output.weight and attn_qkv.weight changed from Q4_K (4.5bpw) to IQ4_XS (4.25bpw) for FTYPE IQ3_M, considering that FTYPE IQ4_XS is a more qualitative quant and has both tensors in IQ4_XS.

Edit 6/7 : attn.k.weight is relatively small on GQA & MOE models, and penalizing it on IQ3_XXS and IQ3_XS quant strategies isn't pertinent imo on such models. I simply removed the penalty in such cases. Edit 8 : bolster a bit IQ3_M with a bump on its attn_v and attn_k tensors. Edit 9 : get rid of the IQ1/IQ2 quant strategy tree, and replace these quants in the usual tensors tree, and increase their attn_ tensors when they are used to quantize MOEs. Edit 10 : Shorten a bit the formatting to remove a few lines.

Edit 11 : Refactor partly the attn_k tensors tree and add progressivity in the quants. Edit 12 : Lower the threshold of 8 to 4 for the big MOE-specific quant parameters. Edit 13 : Rework a bit the attn.v tensors tree for more progressivity. Edit 14 : Some revamp done on token embeddings, attn_qkv, and the ffns. Edit 15 and 16 : New quants : Q2_K_L, IQ2_XL, IQ3_XL

Edit 17 : Merge master b3565 Edit 18 and 19 : New Quant : IQ1_XS Edit 20 and 21 : Some adjustments and reversals. Edit 21 : New IQ1_XL quant strategy, and some corrections Edit 22 : Merge master b3569

Examples:

Current quants are here : https://huggingface.co/Nexesenex/google_gemma-2-9b-it_iMat.GGUF/tree/main

Results:

IQ1_XS

PR Current : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.15 GiB (2.00 BPW)
Arc-C 299     42.80936455   
Arc-E 570     68.24561404  
PPL 512 wikitext : 15.1105 +/- 0.11363

IQ1_S

MASTER : Gemma 2 9b It IQ1_S, quant made from BF16
Size : 2.21 GiB (2.05 BPW)
Arc-C 299     42.47491639
Arc-E 570     66.84210526
PPL 512 wikitext : 15.9317 +/- 0.11979

PR Current : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.23 GiB (2.07 BPW)
Arc-C 299     43.14381271
Arc-E 570     68.42105263
PPL 512 wikitext : 14.1578 +/- 0.10530

IQ1_M

MASTER : Gemma 2 9b It IQ1_M, quant made from BF16
Size : 2.37 GiB (2.20 BPW)
Arc-C 299     45.81939799  
Arc-E 570     73.85964912
PPL 512 wikitext : 13.7215 +/- 0.10231

PR Current : Gemma 2 9b It IQ1_M quant made from BF16
Size : 2.36 GiB (2.19 BPW)
Arc-C 299     45.81939799
Arc-E 570     74.56140351
PPL 512 wikitext : 12.6773 +/- 0.09336

IQ1_XL
PR Current : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.48 GiB (2.30 BPW)
Arc-C 299     47.49163880  
Arc-E 570     73.33333333
PPL 512 wikitext : 11.5001 +/- 0.08487

IQ2_XXS

MASTER : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.63 GiB (2.44 BPW)
Arc-C 299     48.16053512   
Arc-E 570     73.15789474   
PPL 512 wikitext : 11.2527 +/- 0.08307

PR Current : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.73 GiB (2.54 BPW)
Arc-C 299     48.82943144
Arc-E 570     74.56140351
PPL 512 wikitext : 10.8439 +/- 0.08026

IQ2_XS

MASTER : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.85 GiB (2.65 BPW)
Arc-C 299     49.49832776
Arc-E 570     78.24561404  
PPL 512 wikitext : 10.5698 +/- 0.07803

PR Current : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.91 GiB (2.70 BPW)
Arc-C 299     49.16387960
Arc-E 570     78.59649123
PPL 512 wikitext : 10.3607 +/- 0.07660 

IQ2_S

MASTER : Gemma 2 9b It IQ2_S (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 2.99 GiB (2.77 BPW)
Arc-C 299     52.84280936
Arc-E 570     77.54385965
PPL 512 wikitext : 10.3868 +/- 0.07787

PR Int : Gemma 2 9b It IQ2_S (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     49.83277592
Arc-E 570     77.71929825
PPL 512 wikitext : 10.1303 +/- 0.07486

PR Current : Gemma 2 9b It IQ2_S, quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299     52.17391304
Arc-E 570     77.89473684
PPL 512 wikitext : 10.1071 +/- 0.07450 

IQ2_M

MASTER : Gemma 2 9b It IQ2_M (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 3.19 GiB (2.97 BPW)
Arc-C 299     56.52173913
Arc-E 570     77.01754386
PPL 512 wikitext : 9.8154 +/- 0.07324

PR init : Gemma 2 9b It IQ2_M (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.20 GiB (2.98 BPW)
Arc-C 299     54.18060201
Arc-E 570     78.07017544
PPL 512 wikitext :  9.5734 +/- 0.07040

PR CURRENT : Gemma 2 9b It IQ2_M, quant made from BF16
Size : 3.29 GiB (3.06 BPW)
Arc-C 299     55.85284281
Arc-E 570     78.07017544
PPL 512 wikitext : 9.4128 +/- 0.06881

IQ2_XL

PR CURRENT : Gemma 2 9b It IQ2_XL, quant made from BF16
Size : 3.41 GiB (3.17 BPW)
Arc-C 299     56.18729097
Arc-E 570     78.07017544
PPL 512 wikitext : 9.3283 +/- 0.06820

Q2_K_L

PR CURRENT : Gemma 2 9b It Q2_K_L, quant made from BF16
Size : 3.70 GiB (3.44 BPW)
Arc-C 299     58.19397993
Arc-E 570     79.29824561
PPL 512 wikitext : around 9.25

IQ3_XXS

MASTER : Gemma 2 9b It IQ3_XXS (with iMatrix, attn_k in IQ2_S, and attn_v in IQ3_XXS), quant made from BF16
Size : 3.53 GiB (3.28 BPW)
Arc-C 299 56.52173913
Arc-E 570 79.12280702
PPL 512 wikitext : 9.4116 +/- 0.06982

PR CURRENT : Gemma 2 9b It IQ3_XXS (with Imatrix, attn_k in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.60 GiB (3.35 BPW)
Arc-C 299 56.18729097
Arc-E 570 78.77192982
PPL 512 wikitext : 9.2026 +/- 0.06781

IQ3_XS

MASTER : Gemma 2 9b It IQ3_XS (with iMatrix)), quant made from BF16
Size : 3.85 GiB (3.58 BPW)
Arc-C 299     58.86287625
Arc-E 570     78.94736842
PPL 512 wikitext : 9.2584 +/- 0.06866

PR CURRENT : Gemma 2 9b It IQ3_XS (with Imatrix), quant made from BF16
Size : 3.82 GiB (3.55 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext :  9.0658 +/- 0.06633

IQ3_S

MASTER : Gemma 2 9b It IQ3_S (with iMatrix, attn_v in IQ3_S), quant made from BF16
Size : 4.03 GiB (3.75 BPW)
Arc-C 299     57.52508361
Arc-E 570     77.71929825
PPL 512 wikitext : 9.2100 +/- 0.06859

PR : Gemma 2 9b It IQ3_S (with Imatrix, attn_v in Q4_K), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     57.19063545
Arc-E 570     78.07017544
PPL 512 wikitext : 9.0082 +/- 0.06633

PR rev 2: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299     56.85618729
Arc-E 570     78.42105263
PPL 512 wikitext : 9.0082 +/- 0.06633
(I think ARC differences are due to the b3565 merge)

PR rev3 - CURRENT: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.05 GiB (3.76 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.42105263
PPL 512 wikitext : 8.9969 +/- 0.06610

IQ3_M

MASTER : Gemma 2 9b It IQ3_M (with iMatrix, attn_output in Q4_K), quant made from BF16
Size : 4.18 GiB (3.89 BPW)
Arc-C 299     56.85618729
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9697 +/- 0.06598

PR : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS), quant made from BF16
Size : 4.16 GiB (3.87 BPW)
Arc-C 299     57.19063545
Arc-E 570     77.71929825
PPL 512 wikitext : 8.9556 +/- 0.06586

PR rev2 : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K), quant made from BF16
Size : 4.20 GiB (3.90 BPW)²
Arc-C 299     58.52842809²
Arc-E 570     77.54385965²
PPL 512 wikitext : 8.9445 +/- 0.06576²

PR rev3 - CURRENT : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K, attn.k IQ4_XS), quant made from BF16
Size : 4.23 GiB (3.93 BPW)
Arc-C 299     58.19397993
Arc-E 570     77.19298246
PPL 512 wikitext : 8.9082 +/- 0.06536

IQ3_XL

PR CURRENT : Gemma 2 9b It IQ3_XL (with Imatrix), quant made from BF16
Size : 4.50 GiB (4.18 BPW)
Arc-C 299     56.85618729 
Arc-E 570     78.42105263
PPL 512 wikitext : 8.8843 +/- 0.06558

IQ4_XS

MASTER : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.87 GiB (4.52 BPW)
Arc-C 299     57.52508361
Arc-E 570     78.24561404
PPL 512 wikitext : 8.8456 +/- 0.06533

FP16

MASTER : Gemma 2 9b It F16.
Size : 17.22 GiB (16.00 BPW)
Arc-C 299     59.53177258
Arc-E 570     78.77192982
PPL 512 wikitext : 8.7881 +/- 0.06533

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Aug 02 '24 18:08 Nexesenex

It would be useful to have some objective data (such as perplexity tests) to evaluate the effect of these changes.

Aug 08 '24 21:08 slaren

@Slaren:

Relevant examples in head post.

Considering the results obtained, I think that it's worth it, the size remains around 260MB below IQ3_XS. The overall high BPW of Gemma is justified mostly by the monolithic embd/output tensor in Q5_K, as the output.weight are usually on FTYPE IQ3_XXS.

The non GQA/non MOE models are not affected.

For FType IQ3_M and its attn_output.weight and attn.qkv.weight in Q4_K, Ikawrakow setting was made prior to IQ4_XS quants and never edited since, considering that FType IQ4_XS has an attn.output.weight in.. IQ4_XS without any problem signaled about it. A PPL test is imo not even necessary there.

For FTYpe IQ2_M, jumping from attn_output IQ2_XS (FTYPE IQ2_XS) to attn_ouput IQ3_S (FTYPE IQ2_S, which is mainly made of IQ2_XS tensors as well), is literally overkill, and Ikawrakow likely didn't pay attention to it and just threw in the value : IQ2_S was simply jumped over, as well as IQ3_XXS for that tensor.

Edit : I will actually do the tests, and share the quants on Huggingface.

Edit 2 : tests made on Gemma v2 9 it. I think it's conclusive.

Aug 09 '24 16:08 Nexesenex

HI @Nexesenex very very interesting work on quantization.

Perhaps go to try the 1.58bit dynamics quants ? perhaps this will be nice to have it in llamacpp ? Do you have investigate the dynamics quant ? like this ? :

https://unsloth.ai/blog/dynamic-4bit

and

https://unsloth.ai/blog/deepseekr1-dynamic

Thanks ! have a nice days.

Mar 08 '25 00:03 navr32