llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Move a repeated calc to const

Open sroussey opened this issue 2 years ago • 13 comments

Not sure if the metal compiler would do this automatically.

Is there a way to test speed differences in metal code?

sroussey avatar Jun 10 '23 20:06 sroussey

Haha, no. This is not the right intermediate type, the results are... not correct.

sroussey avatar Jun 10 '23 20:06 sroussey

I was testing with random seeds, and one of them produced crazy answers. Using the same seed, with and without this produces the same content. Ignore previous comment.

sroussey avatar Jun 10 '23 20:06 sroussey

FYI: I unrolled the inner loop and tried to vectorize but I am not sure that helps much more than this.

https://github.com/sroussey/llama.cpp/pull/1/files

sroussey avatar Jun 11 '23 01:06 sroussey

Is there a way to test speed differences in metal code?

You just run it a few consecutive times with and without the change and note down the eval time per token.

I don't think this change would make a big difference, but in case you observe improvement - we can add it. Probably need to apply it to all other kernels as well to have consistency in the implementation

ggerganov avatar Jun 11 '23 09:06 ggerganov

Suggest to compare or check the assembly code of metal (I don't know how, but it is easy in C++ compiler). the common expression extraction is a common tech in compiler these days.

howard0su avatar Jun 12 '23 00:06 howard0su

I ran several times both before and after and found 5% improvement. I have a non-GPU process running for several days in the background. Waiting for that to finish today and I’ll rerun to be sure.

This common expression then gets dereferenced so don’t think it’s getting pulled out (it’s a metal thing) though it definitely could.

This code block is only in the 4_k kernel.

I’ll retest my conversion of the code to unroll the inner loop and use float4. It didn’t seem to be any better than this one however, at least in early tests.

sroussey avatar Jun 12 '23 17:06 sroussey

5% sounds great. Will wait for your confirmation and we can merge

ggerganov avatar Jun 12 '23 17:06 ggerganov

Oddly seems to be 5% faster when cold, only about 1-2% faster when GPU is hot.

sroussey avatar Jun 14 '23 18:06 sroussey

I noticed that the same pattern is copied a few other places, so will make changes and return.

sroussey avatar Jun 14 '23 18:06 sroussey

I see no difference on my machine. If anything, the original tends to be slightly faster with run times for e.g. Q4_K_S fluctuating between 23.9 and 24.0 ms/token, while the version in this PR varying in 23.95-24.05 ms/token. This is on latest master and ggml-metal.metal copied over.

ikawrakow avatar Jun 16 '23 16:06 ikawrakow

OK, then we should close this.

How do you get such precise measurements? I have an M2 Max 30c GPU and don't ever see 23.x. I wait until the SoC temp is 39C and then test with -n 1000 (not sure what you use, but my guess is that it is shorter). Even so, I get a wide range.

sroussey avatar Jun 16 '23 17:06 sroussey

For reference:

repeat 5 ./main -m "./models/7B/ggml-model-q4k.bin" -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes I love: " -ngl 1 -n 1000 -s 1

sroussey avatar Jun 16 '23 17:06 sroussey

OK, then we should close this.

How do you get such precise measurements? I have an M2 Max 30c GPU and don't ever see 23.x. I wait until the SoC temp is 39C and then test with -n 1000 (not sure what you use, but my guess is that it is shorter). Even so, I get a wide range.

I have the Mac on wall power sitting unused and completely idle. I SSH into it from another machine and run the calculation with a 5 second break between each run. As I'm not interested in measuring how the OS deals with stress, thermal management, etc., I run short calculations (128 tokens). This also reduces the impact of matrix multiplications, which get more important with increasing context size. When I haven't used a model for a while, I always ignore the first run. All this combined gives pretty reproducible results.

ikawrakow avatar Jun 16 '23 18:06 ikawrakow

Closing per OP comment https://github.com/ggerganov/llama.cpp/pull/1794#issuecomment-1595006828

evanmiller avatar Jul 07 '23 12:07 evanmiller