Changes for the existing quant strategies / FTYPEs and new ones
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models:
-
The tensor attn.v.weight passed in Q4_K for models like Gemma v2 (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.
-
The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.
-
The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.
More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under a different kind of tree mixing these 5 quant strategies.
I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be used as default.
Partial "changelog" :
Edit : I applied the attn.v.weight modifications to the IQ3 quant strategies as well. Edit 2 : I looked furthermore at the attn.v.weight "tree" and made changes in coherence with what I did for IQ2 and IQ3, but only when Ikawrakow's code was clearly going in such direction already. Edit 3 : I harmonized all the n_expert == 8 into n_expert >= 8 within the quant strategies. Edit 4 and 5 : attn_output.weight and attn_qkv.weight changed from Q4_K (4.5bpw) to IQ4_XS (4.25bpw) for FTYPE IQ3_M, considering that FTYPE IQ4_XS is a more qualitative quant and has both tensors in IQ4_XS.
Edit 6/7 : attn.k.weight is relatively small on GQA & MOE models, and penalizing it on IQ3_XXS and IQ3_XS quant strategies isn't pertinent imo on such models. I simply removed the penalty in such cases. Edit 8 : bolster a bit IQ3_M with a bump on its attn_v and attn_k tensors. Edit 9 : get rid of the IQ1/IQ2 quant strategy tree, and replace these quants in the usual tensors tree, and increase their attn_ tensors when they are used to quantize MOEs. Edit 10 : Shorten a bit the formatting to remove a few lines.
Edit 11 : Refactor partly the attn_k tensors tree and add progressivity in the quants. Edit 12 : Lower the threshold of 8 to 4 for the big MOE-specific quant parameters. Edit 13 : Rework a bit the attn.v tensors tree for more progressivity. Edit 14 : Some revamp done on token embeddings, attn_qkv, and the ffns. Edit 15 and 16 : New quants : Q2_K_L, IQ2_XL, IQ3_XL
Edit 17 : Merge master b3565 Edit 18 and 19 : New Quant : IQ1_XS Edit 20 and 21 : Some adjustments and reversals. Edit 21 : New IQ1_XL quant strategy, and some corrections Edit 22 : Merge master b3569
Examples:
Current quants are here : https://huggingface.co/Nexesenex/google_gemma-2-9b-it_iMat.GGUF/tree/main
Results:
IQ1_XS
PR Current : Gemma 2 9b It IQ1_XS quant made from BF16
Size : 2.15 GiB (2.00 BPW)
Arc-C 299 42.80936455
Arc-E 570 68.24561404
PPL 512 wikitext : 15.1105 +/- 0.11363
IQ1_S
MASTER : Gemma 2 9b It IQ1_S, quant made from BF16
Size : 2.21 GiB (2.05 BPW)
Arc-C 299 42.47491639
Arc-E 570 66.84210526
PPL 512 wikitext : 15.9317 +/- 0.11979
PR Current : Gemma 2 9b It IQ1_S quant made from BF16
Size : 2.23 GiB (2.07 BPW)
Arc-C 299 43.14381271
Arc-E 570 68.42105263
PPL 512 wikitext : 14.1578 +/- 0.10530
IQ1_M
MASTER : Gemma 2 9b It IQ1_M, quant made from BF16
Size : 2.37 GiB (2.20 BPW)
Arc-C 299 45.81939799
Arc-E 570 73.85964912
PPL 512 wikitext : 13.7215 +/- 0.10231
PR Current : Gemma 2 9b It IQ1_M quant made from BF16
Size : 2.36 GiB (2.19 BPW)
Arc-C 299 45.81939799
Arc-E 570 74.56140351
PPL 512 wikitext : 12.6773 +/- 0.09336
IQ1_XL
PR Current : Gemma 2 9b It IQ1_XL quant made from BF16
Size : 2.48 GiB (2.30 BPW)
Arc-C 299 47.49163880
Arc-E 570 73.33333333
PPL 512 wikitext : 11.5001 +/- 0.08487
IQ2_XXS
MASTER : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.63 GiB (2.44 BPW)
Arc-C 299 48.16053512
Arc-E 570 73.15789474
PPL 512 wikitext : 11.2527 +/- 0.08307
PR Current : Gemma 2 9b It IQ2_XXS, quant made from BF16
Size : 2.73 GiB (2.54 BPW)
Arc-C 299 48.82943144
Arc-E 570 74.56140351
PPL 512 wikitext : 10.8439 +/- 0.08026
IQ2_XS
MASTER : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.85 GiB (2.65 BPW)
Arc-C 299 49.49832776
Arc-E 570 78.24561404
PPL 512 wikitext : 10.5698 +/- 0.07803
PR Current : Gemma 2 9b It IQ2_XS, quant made from BF16
Size : 2.91 GiB (2.70 BPW)
Arc-C 299 49.16387960
Arc-E 570 78.59649123
PPL 512 wikitext : 10.3607 +/- 0.07660
IQ2_S
MASTER : Gemma 2 9b It IQ2_S (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 2.99 GiB (2.77 BPW)
Arc-C 299 52.84280936
Arc-E 570 77.54385965
PPL 512 wikitext : 10.3868 +/- 0.07787
PR Int : Gemma 2 9b It IQ2_S (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299 49.83277592
Arc-E 570 77.71929825
PPL 512 wikitext : 10.1303 +/- 0.07486
PR Current : Gemma 2 9b It IQ2_S, quant made from BF16
Size : 3.00 GiB (2.79 BPW)
Arc-C 299 52.17391304
Arc-E 570 77.89473684
PPL 512 wikitext : 10.1071 +/- 0.07450
IQ2_M
MASTER : Gemma 2 9b It IQ2_M (with iMatrix, attn_output and attn.v in IQ3_S), quant made from BF16
Size : 3.19 GiB (2.97 BPW)
Arc-C 299 56.52173913
Arc-E 570 77.01754386
PPL 512 wikitext : 9.8154 +/- 0.07324
PR init : Gemma 2 9b It IQ2_M (with Imatrix, attn_output in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.20 GiB (2.98 BPW)
Arc-C 299 54.18060201
Arc-E 570 78.07017544
PPL 512 wikitext : 9.5734 +/- 0.07040
PR CURRENT : Gemma 2 9b It IQ2_M, quant made from BF16
Size : 3.29 GiB (3.06 BPW)
Arc-C 299 55.85284281
Arc-E 570 78.07017544
PPL 512 wikitext : 9.4128 +/- 0.06881
IQ2_XL
PR CURRENT : Gemma 2 9b It IQ2_XL, quant made from BF16
Size : 3.41 GiB (3.17 BPW)
Arc-C 299 56.18729097
Arc-E 570 78.07017544
PPL 512 wikitext : 9.3283 +/- 0.06820
Q2_K_L
PR CURRENT : Gemma 2 9b It Q2_K_L, quant made from BF16
Size : 3.70 GiB (3.44 BPW)
Arc-C 299 58.19397993
Arc-E 570 79.29824561
PPL 512 wikitext : around 9.25
IQ3_XXS
MASTER : Gemma 2 9b It IQ3_XXS (with iMatrix, attn_k in IQ2_S, and attn_v in IQ3_XXS), quant made from BF16
Size : 3.53 GiB (3.28 BPW)
Arc-C 299 56.52173913
Arc-E 570 79.12280702
PPL 512 wikitext : 9.4116 +/- 0.06982
PR CURRENT : Gemma 2 9b It IQ3_XXS (with Imatrix, attn_k in IQ3_XXS, and attn_v in Q4_K), quant made from BF16
Size : 3.60 GiB (3.35 BPW)
Arc-C 299 56.18729097
Arc-E 570 78.77192982
PPL 512 wikitext : 9.2026 +/- 0.06781
IQ3_XS
MASTER : Gemma 2 9b It IQ3_XS (with iMatrix)), quant made from BF16
Size : 3.85 GiB (3.58 BPW)
Arc-C 299 58.86287625
Arc-E 570 78.94736842
PPL 512 wikitext : 9.2584 +/- 0.06866
PR CURRENT : Gemma 2 9b It IQ3_XS (with Imatrix), quant made from BF16
Size : 3.82 GiB (3.55 BPW)
Arc-C 299 57.19063545
Arc-E 570 78.07017544
PPL 512 wikitext : 9.0658 +/- 0.06633
IQ3_S
MASTER : Gemma 2 9b It IQ3_S (with iMatrix, attn_v in IQ3_S), quant made from BF16
Size : 4.03 GiB (3.75 BPW)
Arc-C 299 57.52508361
Arc-E 570 77.71929825
PPL 512 wikitext : 9.2100 +/- 0.06859
PR : Gemma 2 9b It IQ3_S (with Imatrix, attn_v in Q4_K), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299 57.19063545
Arc-E 570 78.07017544
PPL 512 wikitext : 9.0082 +/- 0.06633
PR rev 2: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.07 GiB (3.79 BPW)
Arc-C 299 56.85618729
Arc-E 570 78.42105263
PPL 512 wikitext : 9.0082 +/- 0.06633
(I think ARC differences are due to the b3565 merge)
PR rev3 - CURRENT: Gemma 2 9b It IQ3_S (with Imatrix), quant made from BF16
Size : 4.05 GiB (3.76 BPW)
Arc-C 299 57.52508361
Arc-E 570 78.42105263
PPL 512 wikitext : 8.9969 +/- 0.06610
IQ3_M
MASTER : Gemma 2 9b It IQ3_M (with iMatrix, attn_output in Q4_K), quant made from BF16
Size : 4.18 GiB (3.89 BPW)
Arc-C 299 56.85618729
Arc-E 570 77.71929825
PPL 512 wikitext : 8.9697 +/- 0.06598
PR : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS), quant made from BF16
Size : 4.16 GiB (3.87 BPW)
Arc-C 299 57.19063545
Arc-E 570 77.71929825
PPL 512 wikitext : 8.9556 +/- 0.06586
PR rev2 : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K), quant made from BF16
Size : 4.20 GiB (3.90 BPW)²
Arc-C 299 58.52842809²
Arc-E 570 77.54385965²
PPL 512 wikitext : 8.9445 +/- 0.06576²
PR rev3 - CURRENT : Gemma 2 9b It IQ3_M (with Imatrix, attn_output in IQ4_XS, attn.v Q5_K, attn.k IQ4_XS), quant made from BF16
Size : 4.23 GiB (3.93 BPW)
Arc-C 299 58.19397993
Arc-E 570 77.19298246
PPL 512 wikitext : 8.9082 +/- 0.06536
IQ3_XL
PR CURRENT : Gemma 2 9b It IQ3_XL (with Imatrix), quant made from BF16
Size : 4.50 GiB (4.18 BPW)
Arc-C 299 56.85618729
Arc-E 570 78.42105263
PPL 512 wikitext : 8.8843 +/- 0.06558
IQ4_XS
MASTER : Gemma 2 9b It IQ4_XS (with iMatrix,), quant made from BF16
Size : 4.87 GiB (4.52 BPW)
Arc-C 299 57.52508361
Arc-E 570 78.24561404
PPL 512 wikitext : 8.8456 +/- 0.06533
FP16
MASTER : Gemma 2 9b It F16.
Size : 17.22 GiB (16.00 BPW)
Arc-C 299 59.53177258
Arc-E 570 78.77192982
PPL 512 wikitext : 8.7881 +/- 0.06533
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
It would be useful to have some objective data (such as perplexity tests) to evaluate the effect of these changes.
@Slaren:
Relevant examples in head post.
Considering the results obtained, I think that it's worth it, the size remains around 260MB below IQ3_XS. The overall high BPW of Gemma is justified mostly by the monolithic embd/output tensor in Q5_K, as the output.weight are usually on FTYPE IQ3_XXS.
The non GQA/non MOE models are not affected.
For FType IQ3_M and its attn_output.weight and attn.qkv.weight in Q4_K, Ikawrakow setting was made prior to IQ4_XS quants and never edited since, considering that FType IQ4_XS has an attn.output.weight in.. IQ4_XS without any problem signaled about it. A PPL test is imo not even necessary there.
For FTYpe IQ2_M, jumping from attn_output IQ2_XS (FTYPE IQ2_XS) to attn_ouput IQ3_S (FTYPE IQ2_S, which is mainly made of IQ2_XS tensors as well), is literally overkill, and Ikawrakow likely didn't pay attention to it and just threw in the value : IQ2_S was simply jumped over, as well as IQ3_XXS for that tensor.
Edit : I will actually do the tests, and share the quants on Huggingface.
Edit 2 : tests made on Gemma v2 9 it. I think it's conclusive.
HI @Nexesenex very very interesting work on quantization.
Perhaps go to try the 1.58bit dynamics quants ? perhaps this will be nice to have it in llamacpp ? Do you have investigate the dynamics quant ? like this ? :
https://unsloth.ai/blog/dynamic-4bit
and
https://unsloth.ai/blog/deepseekr1-dynamic
Thanks ! have a nice days.