luminal Metal GPU not utilising efficiently on M2 pro

Screenshot 2024-05-03 at 4 49 55 PM

lama-3 example running slow and not utilising metal GPU. it's mostly 0 and sometimes a spike of 20 or 35%.

May 03 '24 11:05 akashicMarga

Hi, ca you pull the latest main branch and let me know if it's still happening? It seems like it isn't compiling with metal for you

May 03 '24 23:05 jafioti

same. not compiling for metal.

https://github.com/jafioti/luminal/assets/18519731/676d2f7d-3eeb-4605-964c-6f2c597b2e1e

May 04 '24 07:05 akashicMarga

Would you be able to set the num tokens generated to 1 and do execute_debug in the decoding loop? My guess is there is still some op taking 90% of the time. The debug printout will tell you the shape of time each op took

May 05 '24 22:05 jafioti

On discord you mentioned this is for the M3. Is it the M3 or M2 Pro?

May 08 '24 13:05 jafioti

There i mentioned. Macbook pro with just M2.

May 08 '24 14:05 akashicMarga

@akashicMarga What tool do you use to get the GPU diagnostic and memory usage on the right in your screenshots?

Jun 07 '24 03:06 jafioti

https://github.com/tlkh/asitop

Jun 07 '24 04:06 akashicMarga

@akashicMarga I got my hands on a 16GB and tested it out. It's weird but turns out the memory usage isn't getting properly reported. Phi worked, but llama did not, and the memory usage was >9 gb before running luminal. So I think the issue still is that memory is running out and the model is getting kicked to swap, but it's not correctly reported.

Did you say you got candle or llama cpp running with Q8 llama on your machine?

Jun 08 '24 17:06 jafioti