Low GPU usage of quantized Mixtral 8x22B for prompt processing on Metal

Open beebopkim opened this issue 1 year ago • 1 comments

My computer is M1 Max Mac Studio with 32 Cores of GPU with 64 GB of RAM. macOS version is Sonoma 14.4.1.

I run llama-bench from commit 4cc120c7443cf9dab898736f3c3b45dc8f14672b and it shows low GPU usage for prompt processing. Of course, inferences on main and server show same low GPU usages.

Screenshot 2024-04-13 at 12 40 32 AM-2

In the above image, I run benchmark for IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, and Q2_K_S but IQ1_S and IQ1_M from https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-v0.1-GGUF will show same low GPU usage.

Apr 12 '24 15:04 beebopkim

#6740

Apr 19 '24 09:04 stefanvarunix

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 04 '24 01:06 github-actions[bot]