llama.cpp Move a repeated calc to const

Not sure if the metal compiler would do this automatically.

Is there a way to test speed differences in metal code?

Jun 10 '23 20:06 sroussey

Haha, no. This is not the right intermediate type, the results are... not correct.

Jun 10 '23 20:06 sroussey

I was testing with random seeds, and one of them produced crazy answers. Using the same seed, with and without this produces the same content. Ignore previous comment.

Jun 10 '23 20:06 sroussey

FYI: I unrolled the inner loop and tried to vectorize but I am not sure that helps much more than this.

https://github.com/sroussey/llama.cpp/pull/1/files

Jun 11 '23 01:06 sroussey

Is there a way to test speed differences in metal code?

You just run it a few consecutive times with and without the change and note down the eval time per token.

I don't think this change would make a big difference, but in case you observe improvement - we can add it. Probably need to apply it to all other kernels as well to have consistency in the implementation

Jun 11 '23 09:06 ggerganov

Suggest to compare or check the assembly code of metal (I don't know how, but it is easy in C++ compiler). the common expression extraction is a common tech in compiler these days.

Jun 12 '23 00:06 howard0su

I ran several times both before and after and found 5% improvement. I have a non-GPU process running for several days in the background. Waiting for that to finish today and I’ll rerun to be sure.

This common expression then gets dereferenced so don’t think it’s getting pulled out (it’s a metal thing) though it definitely could.

This code block is only in the 4_k kernel.

I’ll retest my conversion of the code to unroll the inner loop and use float4. It didn’t seem to be any better than this one however, at least in early tests.

Jun 12 '23 17:06 sroussey

5% sounds great. Will wait for your confirmation and we can merge

Jun 12 '23 17:06 ggerganov

Oddly seems to be 5% faster when cold, only about 1-2% faster when GPU is hot.

Jun 14 '23 18:06 sroussey

I noticed that the same pattern is copied a few other places, so will make changes and return.

Jun 14 '23 18:06 sroussey

I see no difference on my machine. If anything, the original tends to be slightly faster with run times for e.g. Q4_K_S fluctuating between 23.9 and 24.0 ms/token, while the version in this PR varying in 23.95-24.05 ms/token. This is on latest master and ggml-metal.metal copied over.

Jun 16 '23 16:06 ikawrakow

OK, then we should close this.

How do you get such precise measurements? I have an M2 Max 30c GPU and don't ever see 23.x. I wait until the SoC temp is 39C and then test with -n 1000 (not sure what you use, but my guess is that it is shorter). Even so, I get a wide range.

Jun 16 '23 17:06 sroussey

For reference:

repeat 5 ./main -m "./models/7B/ggml-model-q4k.bin" -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes I love: " -ngl 1 -n 1000 -s 1

Jun 16 '23 17:06 sroussey

OK, then we should close this.

How do you get such precise measurements? I have an M2 Max 30c GPU and don't ever see 23.x. I wait until the SoC temp is 39C and then test with -n 1000 (not sure what you use, but my guess is that it is shorter). Even so, I get a wide range.

I have the Mac on wall power sitting unused and completely idle. I SSH into it from another machine and run the calculation with a 5 second break between each run. As I'm not interested in measuring how the OS deals with stress, thermal management, etc., I run short calculations (128 tokens). This also reduces the impact of matrix multiplications, which get more important with increasing context size. When I haven't used a model for a while, I always ignore the first run. All this combined gives pretty reproducible results.

Jun 16 '23 18:06 ikawrakow

Closing per OP comment https://github.com/ggerganov/llama.cpp/pull/1794#issuecomment-1595006828

Jul 07 '23 12:07 evanmiller