luminal icon indicating copy to clipboard operation
luminal copied to clipboard

Metal GPU not utilising efficiently on M2 pro

Open akashicMarga opened this issue 1 year ago • 8 comments

Screenshot 2024-05-03 at 4 49 55 PM

lama-3 example running slow and not utilising metal GPU. it's mostly 0 and sometimes a spike of 20 or 35%.

akashicMarga avatar May 03 '24 11:05 akashicMarga

Hi, ca you pull the latest main branch and let me know if it's still happening? It seems like it isn't compiling with metal for you

jafioti avatar May 03 '24 23:05 jafioti

same. not compiling for metal.

https://github.com/jafioti/luminal/assets/18519731/676d2f7d-3eeb-4605-964c-6f2c597b2e1e

akashicMarga avatar May 04 '24 07:05 akashicMarga

Would you be able to set the num tokens generated to 1 and do execute_debug in the decoding loop? My guess is there is still some op taking 90% of the time. The debug printout will tell you the shape of time each op took

jafioti avatar May 05 '24 22:05 jafioti

On discord you mentioned this is for the M3. Is it the M3 or M2 Pro?

jafioti avatar May 08 '24 13:05 jafioti

There i mentioned. Macbook pro with just M2.

akashicMarga avatar May 08 '24 14:05 akashicMarga

@akashicMarga What tool do you use to get the GPU diagnostic and memory usage on the right in your screenshots?

jafioti avatar Jun 07 '24 03:06 jafioti

https://github.com/tlkh/asitop

akashicMarga avatar Jun 07 '24 04:06 akashicMarga

@akashicMarga I got my hands on a 16GB and tested it out. It's weird but turns out the memory usage isn't getting properly reported. Phi worked, but llama did not, and the memory usage was >9 gb before running luminal. So I think the issue still is that memory is running out and the model is getting kicked to swap, but it's not correctly reported.

Did you say you got candle or llama cpp running with Q8 llama on your machine?

jafioti avatar Jun 08 '24 17:06 jafioti