llama.cpp
llama.cpp copied to clipboard
Vulkan: improve mul_mat_vec_iq1_m
./build/bin/Release/test-backend-ops.exe perf -o MUL_MAT -p type_a=iq1_m
Tested on AMD 8845HS 780M iGPU
| n | PR: μs/run | PR: GFLOPS | Main: μs/run | Main: GFLOPS | Speedup vs Main |
|---|---|---|---|---|---|
| 1 | 224.28 | 523.63 | 282.44 | 415.80 | 1.26x |
| 2 | 310.53 | 756.38 | 385.04 | 610.01 | 1.24x |
| 3 | 408.65 | 862.15 | 515.79 | 683.08 | 1.26x |
| 4 | 589.40 | 797.02 | 1244.08 | 377.60 | 2.11x |
| 5 | 1075.96 | 545.75 | 4427.85 | 132.62 | 4.11x |
| 8 | 2576.61 | 364.64 | 4985.43 | 188.45 | 1.94x |
| 512 | 11601.05 | 5180.00 | 11948.15 | 5030.00 | 1.03x |
I don't see much of a difference either way. Maybe slight improvement on RDNA3 for n=1, maybe slightly negative on GCN, Nvidia and Intel. Hard to tell, it's close to run-to-run variance.
AMD RX 8060S
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
before:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 13632 runs - 74.80 us/run - 117.44 MFLOP/run - 1.57 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 11928 runs - 85.84 us/run - 234.88 MFLOP/run - 2.74 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 8804 runs - 113.87 us/run - 352.32 MFLOP/run - 3.09 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 7029 runs - 144.38 us/run - 469.76 MFLOP/run - 3.25 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3420 runs - 300.60 us/run - 587.20 MFLOP/run - 1.95 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 856 runs - 1247.46 us/run - 939.52 MFLOP/run - 753.15 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 182 runs - 5539.74 us/run - 60.13 GFLOP/run - 10.85 TFLOPS
after:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 14484 runs - 71.63 us/run - 117.44 MFLOP/run - 1.64 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 11076 runs - 92.69 us/run - 234.88 MFLOP/run - 2.53 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 9088 runs - 113.53 us/run - 352.32 MFLOP/run - 3.10 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 7242 runs - 140.27 us/run - 469.76 MFLOP/run - 3.35 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 6156 runs - 165.65 us/run - 587.20 MFLOP/run - 3.54 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 749 runs - 1423.48 us/run - 939.52 MFLOP/run - 660.02 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 174 runs - 5764.40 us/run - 60.13 GFLOP/run - 10.43 TFLOPS
AMD Radeon Pro VII
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none
before:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 11076 runs - 97.17 us/run - 117.44 MFLOP/run - 1.21 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 7242 runs - 144.39 us/run - 234.88 MFLOP/run - 1.63 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 4260 runs - 249.32 us/run - 352.32 MFLOP/run - 1.41 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3195 runs - 313.11 us/run - 469.76 MFLOP/run - 1.50 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2736 runs - 368.32 us/run - 587.20 MFLOP/run - 1.59 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 428 runs - 3086.22 us/run - 939.52 MFLOP/run - 304.43 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 78 runs - 12824.58 us/run - 60.13 GFLOP/run - 4.69 TFLOPS
after:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 9372 runs - 110.16 us/run - 117.44 MFLOP/run - 1.07 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 6816 runs - 151.09 us/run - 234.88 MFLOP/run - 1.55 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 4260 runs - 236.26 us/run - 352.32 MFLOP/run - 1.49 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3621 runs - 281.72 us/run - 469.76 MFLOP/run - 1.67 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3078 runs - 325.78 us/run - 587.20 MFLOP/run - 1.80 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 321 runs - 3558.65 us/run - 939.52 MFLOP/run - 264.01 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 78 runs - 12860.79 us/run - 60.13 GFLOP/run - 4.68 TFLOPS
Intel A770
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
before:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 7668 runs - 139.32 us/run - 117.44 MFLOP/run - 842.96 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2556 runs - 405.05 us/run - 234.88 MFLOP/run - 579.89 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 1136 runs - 956.64 us/run - 352.32 MFLOP/run - 368.29 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 426 runs - 3181.79 us/run - 469.76 MFLOP/run - 147.64 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 342 runs - 5578.36 us/run - 587.20 MFLOP/run - 105.26 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 107 runs - 9632.67 us/run - 939.52 MFLOP/run - 97.54 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 66 runs - 15407.65 us/run - 60.13 GFLOP/run - 3.90 TFLOPS
after:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 7668 runs - 143.76 us/run - 117.44 MFLOP/run - 816.93 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2982 runs - 377.97 us/run - 234.88 MFLOP/run - 621.42 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 1420 runs - 747.32 us/run - 352.32 MFLOP/run - 471.45 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 639 runs - 1968.56 us/run - 469.76 MFLOP/run - 238.63 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 513 runs - 2413.24 us/run - 587.20 MFLOP/run - 243.33 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 107 runs - 9919.79 us/run - 939.52 MFLOP/run - 94.71 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 66 runs - 15310.42 us/run - 60.13 GFLOP/run - 3.93 TFLOPS
Nvidia RTX 3090
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
before:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 11928 runs - 83.89 us/run - 117.44 MFLOP/run - 1.40 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 9798 runs - 103.71 us/run - 234.88 MFLOP/run - 2.26 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 4828 runs - 208.74 us/run - 352.32 MFLOP/run - 1.69 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 5112 runs - 201.51 us/run - 469.76 MFLOP/run - 2.33 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2907 runs - 358.87 us/run - 587.20 MFLOP/run - 1.64 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2247 runs - 448.54 us/run - 939.52 MFLOP/run - 2.09 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 682 runs - 1467.35 us/run - 60.13 GFLOP/run - 40.98 TFLOPS
after:
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 11928 runs - 85.92 us/run - 117.44 MFLOP/run - 1.37 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 6816 runs - 148.84 us/run - 234.88 MFLOP/run - 1.58 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 5112 runs - 198.05 us/run - 352.32 MFLOP/run - 1.78 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3621 runs - 286.45 us/run - 469.76 MFLOP/run - 1.64 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3249 runs - 311.52 us/run - 587.20 MFLOP/run - 1.88 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 749 runs - 1439.02 us/run - 939.52 MFLOP/run - 652.89 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 706 runs - 1418.96 us/run - 60.13 GFLOP/run - 42.38 TFLOPS
The code seems to fix the performance only on Windows. In Linux, I cannot see the improvement.
In comparison with ROCm, it produced
D:\llama_latest>build\bin\test-backend-ops.exe perf -o MUL_MAT -p iq1_m
HIP Library Path: C:\WINDOWS\SYSTEM32\amdhip64_7.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon 780M Graphics, gfx1103 (0x1103), VMM: no, Wave Size: 32
Testing 2 devices
Backend 1/2: ROCm0
Device description: AMD Radeon 780M Graphics
Device memory: 59327 MB (59175 MB free)
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 5112 runs - 210.56 us/run - 117.44 MFLOP/run - 557.76 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 4260 runs - 257.42 us/run - 234.88 MFLOP/run - 912.44 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 3124 runs - 323.46 us/run - 352.32 MFLOP/run - 1.09 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2556 runs - 414.22 us/run - 469.76 MFLOP/run - 1.13 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 2052 runs - 495.51 us/run - 587.20 MFLOP/run - 1.19 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 1284 runs - 808.00 us/run - 939.52 MFLOP/run - 1.16 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1): 88 runs - 11595.92 us/run - 60.13 GFLOP/run - 5.19 TFLOPS
Backend ROCm0: OK
Backend 2/2: CPU
Skipping CPU backend
2/2 backends passed
OK