Move a repeated calc to const
Not sure if the metal compiler would do this automatically.
Is there a way to test speed differences in metal code?
Haha, no. This is not the right intermediate type, the results are... not correct.
I was testing with random seeds, and one of them produced crazy answers. Using the same seed, with and without this produces the same content. Ignore previous comment.
FYI: I unrolled the inner loop and tried to vectorize but I am not sure that helps much more than this.
https://github.com/sroussey/llama.cpp/pull/1/files
Is there a way to test speed differences in metal code?
You just run it a few consecutive times with and without the change and note down the eval time per token.
I don't think this change would make a big difference, but in case you observe improvement - we can add it. Probably need to apply it to all other kernels as well to have consistency in the implementation
Suggest to compare or check the assembly code of metal (I don't know how, but it is easy in C++ compiler). the common expression extraction is a common tech in compiler these days.
I ran several times both before and after and found 5% improvement. I have a non-GPU process running for several days in the background. Waiting for that to finish today and I’ll rerun to be sure.
This common expression then gets dereferenced so don’t think it’s getting pulled out (it’s a metal thing) though it definitely could.
This code block is only in the 4_k kernel.
I’ll retest my conversion of the code to unroll the inner loop and use float4. It didn’t seem to be any better than this one however, at least in early tests.
5% sounds great. Will wait for your confirmation and we can merge
Oddly seems to be 5% faster when cold, only about 1-2% faster when GPU is hot.
I noticed that the same pattern is copied a few other places, so will make changes and return.
I see no difference on my machine. If anything, the original tends to be slightly faster with run times for e.g. Q4_K_S fluctuating between 23.9 and 24.0 ms/token, while the version in this PR varying in 23.95-24.05 ms/token. This is on latest master and ggml-metal.metal copied over.
OK, then we should close this.
How do you get such precise measurements? I have an M2 Max 30c GPU and don't ever see 23.x. I wait until the SoC temp is 39C and then test with -n 1000 (not sure what you use, but my guess is that it is shorter). Even so, I get a wide range.
For reference:
repeat 5 ./main -m "./models/7B/ggml-model-q4k.bin" -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes I love: " -ngl 1 -n 1000 -s 1
OK, then we should close this.
How do you get such precise measurements? I have an M2 Max 30c GPU and don't ever see 23.x. I wait until the SoC temp is 39C and then test with -n 1000 (not sure what you use, but my guess is that it is shorter). Even so, I get a wide range.
I have the Mac on wall power sitting unused and completely idle. I SSH into it from another machine and run the calculation with a 5 second break between each run. As I'm not interested in measuring how the OS deals with stress, thermal management, etc., I run short calculations (128 tokens). This also reduces the impact of matrix multiplications, which get more important with increasing context size. When I haven't used a model for a while, I always ignore the first run. All this combined gives pretty reproducible results.
Closing per OP comment https://github.com/ggerganov/llama.cpp/pull/1794#issuecomment-1595006828