Research Stage

[ ] Background Research (Let's try to avoid reinventing the wheel)
[ ] Hypothesis Formed (How do you think this will work and it's effect?)
[ ] Strategy / Implementation Forming
[x] Analysis of results
[ ] Debrief / Documentation (So people in the future can learn from us)

Previous existing literature and research

Command

 ./llama.cpp/build/bin/llama-cli \
    --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    --cache-type-k q4_0 \
    --threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \
    --temp 0.6 \
    --ctx-size 8192 \
    --seed 3407 \
    --prompt "<｜User｜>What is the capital of Italy?<｜Assistant｜>"

Model

DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S 1.58Bit, 131GB

Hardware

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0              58W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:2A:00.0 Off |                    0 |
| N/A   32C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Hypothesis

Reported performances is 140 token/second

Implementation

No response

Analysis

Llama.cpp Performance Analysis

Raw Benchmarks

llama_perf_sampler_print:    sampling time =       2.45 ms /    35 runs   (    0.07 ms per token, 14297.39 tokens per second)
llama_perf_context_print:        load time =   20988.11 ms
llama_perf_context_print: prompt eval time =    1233.88 ms /    10 tokens (  123.39 ms per token,     8.10 tokens per second)
llama_perf_context_print:        eval time =    2612.63 ms /    24 runs   (  108.86 ms per token,     9.19 tokens per second)
llama_perf_context_print:       total time =    3869.00 ms /    34 tokens

Detailed Analysis

1. Token Sampling Performance

Total Time: 2.45 ms for 35 runs
Per Token: 0.07 ms
Speed: 14,297.39 tokens per second
Description: This represents the speed at which the model can select the next token after processing. This is extremely fast compared to the actual generation speed, as it only involves the final selection process.

2. Model Loading

Total Time: 20,988.11 ms (≈21 seconds)
Description: One-time initialization cost to load the model into memory. This happens only at startup and doesn't affect ongoing performance.

3. Prompt Evaluation

Total Time: 1,233.88 ms for 10 tokens
Per Token: 123.39 ms
Speed: 8.10 tokens per second
Description: Initial processing of the prompt is slightly slower than subsequent token generation, as it needs to establish the full context for the first time.

4. Generation Evaluation

Total Time: 2,612.63 ms for 24 runs
Per Token: 108.86 ms
Speed: 9.19 tokens per second
Description: This represents the actual speed of generating new tokens, including all neural network computations.

5. Total Processing Time

Total Time: 3,869.00 ms
Tokens Processed: 34 tokens
Average Speed: ≈8.79 tokens per second

Key Insights

Performance Bottlenecks:
- The main bottleneck is in the evaluation phase (actual token generation)
- While sampling can handle 14K+ tokens per second, actual generation is limited to about 9 tokens per second
- This difference highlights that the neural network computations, not the token selection process, are the limiting factor
Processing Stages:
- Model loading is a significant but one-time cost
- Prompt evaluation is slightly slower than subsequent token generation
- Sampling is extremely fast compared to evaluation
Overall Performance:
- The system demonstrates typical performance characteristics for a CPU-based language model
- The total processing rate of ~9 tokens per second is reasonable for local inference on consumer hardware

Relevant log output

Jan 28 '25 23:01 loretoparisi

Thanks for the replication! Here is a silly nitpick, shouldn't it be 14 tokens per second?

Reported performances is 140 token/second

Original:

The 1.58bit quantization should fit in 160GB of VRAM for fast inference (2x H100 80GB), with it attaining around 14 tokens per second. You don't need VRAM (GPU) to run 1.58bit R1, just 20GB of RAM (CPU) will work however it maybe slow. For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.

Jan 29 '25 04:01 winston-bosan

Hey! Whoops guys apologies - just found out it should be 10 to 14 tokens / s for generation speed and not 140 (140 tok/s is the prompt eval time) on 2xH100. 😢

Sorry I didn't get any sleep over the past week since I was too excited to pump out the 1.58bit and release it to everyone. 😢

I mentioned most people should expect to get 1 to 3 tokens / s on most local GPUs, so I'm unsure how I missed the 140 tokens / s.

The 140 tokens / s is the prompt eval time - the generation / decode speed is in fact 10 to 14 tokens / s - so I must have reported the wrong line.

Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:

llama_perf_sampler_print:    sampling time =     199.35 ms /  2759 runs   (    0.07 ms per token, 13839.98 tokens per second)
llama_perf_context_print:        load time =   32281.52 ms
llama_perf_context_print: prompt eval time =    1598.12 ms /   220 tokens (    7.26 ms per token,   137.66 tokens per second)
llama_perf_context_print:        eval time =  237358.50 ms /  2538 runs   (   93.52 ms per token,    10.69 tokens per second)
llama_perf_context_print:       total time =  239477.62 ms /  2758 tokens

I've changed the blog post, docs and everywhere to reflect this issue.

I also uploaded a screen recording GIF showing 140tok/s for prompt eval and 10 tok/s for generation for the 1st minute and the last minute to show an example:

So 140 tok / s is the prompt eval time, and I so I reported the wrong line - decoding speed is 10 to 14 tok / s.

On more analysis - I can see via Open Router https://openrouter.ai/deepseek/deepseek-r1 the API tokens / s is around 3 or 4 tokens / s for R1.

Throughput though is a different measure - https://artificialanalysis.ai/models/deepseek-r1/providers reports 60 tok / s for DeepSeek's official API.

Assuming 6 tok / s for DeepSeek per single user, then throughput should be attainable at 10 * single user tokens / s.

Jan 29 '25 04:01 danielhanchen

Also @loretoparisi extreme appreciate the testing so thanks again!

Again thank you for testing the model out - hope the 1.58bit model functions well!

Jan 29 '25 04:01 danielhanchen

You can try to offload also the non-repeating tensors by using -ngl 62 instead of -ngl 61. You might have to lower the physical batch size to -ub 128 or -ub 256 to reduce compute buffer sizes and maybe improve the pipeline parallelism with 2 GPUs.

Btw, here is a data point for M2 Studio:

https://github.com/user-attachments/assets/73094fcb-8030-424d-9be1-299877d03035

The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. ~Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.~

Jan 29 '25 06:01 ggerganov

@ggerganov Super cool! Glad it worked well on Mac! I'm a Linux user so I had to ask someone else to test it, but good thing you verified it works smoothly :) Good work again on llama.cpp!

Jan 29 '25 07:01 danielhanchen

@ggerganov why did you strike out the fa? Is it working now?

Jan 29 '25 08:01 HabermannR

@ggerganov why did you strike out the fa? Is it working now?

Sorry for the confusion. At first I thought that to enable FA it only requires to support n_embd_head_k != n_embd_head_v which is doable. But then I remembered that DS uses MLA and thought that the FA implementation that we have is not compatible with this attention mechanism, so I striked it out. But now I look at the code and it is actually compatible. So the initial point remains valid and FA can be enabled with some work.

Jan 29 '25 08:01 ggerganov

@ggerganov this would enable v quantization, right? And maybe some speed ups?

Jan 29 '25 08:01 HabermannR

It will:

Reduce compute memory usage
Enable V quantization that reduces the KV cache memory
Improve performance at longer contexts

The Metal changes for FA should be relatively simple I think if someone wants to take a stab at it.

Jan 29 '25 08:01 ggerganov

You can try to offload also the non-repeating tensors by using -ngl 62 instead of -ngl 61. You might have to lower the physical batch size to -ub 128 or -ub 256 to reduce compute buffer sizes and maybe improve the pipeline parallelism with 2 GPUs.

Btw, here is a data point for M2 Studio:

r1.mp4 The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. ~Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.~

@ggerganov thanks! I did --threads 12 -no-cnv --n-gpu-layers 62 --prio 2 -ub 256 \:

sampler seed: 3407
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.600
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1

with some little improvment

llama_perf_sampler_print:    sampling time =       1.98 ms /    35 runs   (    0.06 ms per token, 17667.84 tokens per second)
llama_perf_context_print:        load time =   27176.99 ms
llama_perf_context_print: prompt eval time =     916.83 ms /    10 tokens (   91.68 ms per token,    10.91 tokens per second)
llama_perf_context_print:        eval time =    2308.80 ms /    24 runs   (   96.20 ms per token,    10.40 tokens per second)

how to change pipe parallelism?

Jan 29 '25 10:01 loretoparisi

how to change pipe parallelism?

It's enabled by default. The -ub parameter will affect the prompt processing speed and you can tune the value for optimal performance on your system. Just use a larger prompt, or llama-bench because just 10 tokens for prompt will not give you meaningful results.

Jan 29 '25 10:01 ggerganov

how to change pipe parallelism?

It's enabled by default. The -ub parameter will affect the prompt processing speed and you can tune the value for optimal performance on your system. Just use a larger prompt, or llama-bench because just 10 tokens for prompt will not give you meaningful results.

This is the updated results with a larger prompt. It scored 9.41 tokens per second

llama_perf_sampler_print:    sampling time =     103.30 ms /  1337 runs   (    0.08 ms per token, 12942.63 tokens per second)
llama_perf_context_print:        load time =   23387.53 ms
llama_perf_context_print: prompt eval time =    1102.60 ms /    20 tokens (   55.13 ms per token,    18.14 tokens per second)
llama_perf_context_print:        eval time =  139817.28 ms /  1316 runs   (  106.24 ms per token,     9.41 tokens per second)
llama_perf_context_print:       total time =  141240.50 ms /  1336 tokens

Jan 29 '25 11:01 loretoparisi

@ggerganov these are test results from llama-bench

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        189.38 ± 1.11 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         10.32 ± 0.01 |

Jan 29 '25 15:01 loretoparisi

Further tests on 2x H100 / 80GB, matching 12 tokens per second

llama_perf_sampler_print:    sampling time =      70.01 ms /  1128 runs   (    0.06 ms per token, 16112.67 tokens per second)
llama_perf_context_print:        load time =   28143.05 ms
llama_perf_context_print: prompt eval time =   54405.96 ms /    20 tokens ( 2720.30 ms per token,     0.37 tokens per second)
llama_perf_context_print:        eval time =   94147.64 ms /  1107 runs   (   85.05 ms per token,    11.76 tokens per second)
llama_perf_context_print:       total time =  148778.61 ms /  1127 tokens

and benchmarks:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        276.56 ± 1.24 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         11.89 ± 0.01 |

while 4x H100/80GB @ 214 TFLOPS we have

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         pp512 |        273.10 ± 1.41 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB |   671.03 B | CUDA       |  62 |      12 |   q4_0 |         tg128 |         11.84 ± 0.00 |

or

llama_perf_sampler_print:    sampling time =      57.98 ms /  1128 runs   (    0.05 ms per token, 19453.98 tokens per second)
llama_perf_context_print:        load time =   23185.90 ms
llama_perf_context_print: prompt eval time =   49140.45 ms /    20 tokens ( 2457.02 ms per token,     0.41 tokens per second)
llama_perf_context_print:        eval time =   95991.54 ms /  1107 runs   (   86.71 ms per token,    11.53 tokens per second)
llama_perf_context_print:       total time =  145329.21 ms /  1127 tokens

So I didnt' see any isignificative improvement in increasing GPUs or increasing threads > 12 right now. Apparently from nvidia-smi all {0,1,2,3} GPUs were in use.

Jan 29 '25 16:01 loretoparisi

This model is really something. I came up with a fun puzzle:

What could this mean: 'gwkki qieks'?

Solution by DeepSeek-R1 IQ1_S

It does not always get it right, but neither does the API.

Jan 29 '25 21:01 ggerganov

Surprised the generation speed (ie: non-prompt processing) is so similar for H100, A100 and M2 Ultra?

Isn't the memory bandwidth approximately: H100 = 2 x A100 = 4 x M2 Ultra?

Jan 29 '25 23:01 jukofyork

I'm even more confused now as people seem to be getting ~1.5 tokens per second using SSDs:

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/13

https://old.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/

At best these have 1/100th the memory bandwidth of even the M2 Ultra?

Jan 30 '25 00:01 jukofyork

There's a Reddit thread on this now:

https://www.reddit.com/r/LocalLLaMA/comments/1idivqe/the_mac_m2_ultra_is_faster_than_2xh100s_in/

I have 9x3090 and while running Deepseek 2.5 q4, I got about 25 tok/s

With R1 IQ1_S I get 2.5 tok/s. There is a bottleneck somewhere.

IQ1_S are seemingly not the best quants for CUDA backend. What's with Q2K?

It would be interesting to see the results for K quants (and _0 if anyone can run them).

Jan 30 '25 12:01 jukofyork

@ggerganov This is vLLM for comparison started as

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 8192 --enforce-eager

It seems that when there is a GPU KV cache usage>0.0% generation throughput is ~30 tokens/s

...
INFO 01-30 16:15:44 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:49 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:54 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:59 engine.py:291] Aborted request chatcmpl-04d27b242dbc4bc0b0743235a31d53d7.
INFO 01-30 16:16:09 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 01-30 16:16:19 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

In fact when you have GPU KV cache usage: 0.0% you get

INFO 01-30 16:16:09 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

Worth to note that PyTorch Eager execution mode was enforced.

Jan 30 '25 16:01 loretoparisi

@loretoparisi That 10.5 t/s is just averaging between 31T/s and 0T/s after your prompt ended, not actually a meaningful number right?

Jan 30 '25 17:01 justinjja

@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?

Jan 30 '25 17:01 jukofyork

@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?

It's R1 official Distillation from Qwen2.5-32B

Jan 30 '25 18:01 loretoparisi

@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?

It's R1 official Distillation from Qwen2.5-32B

But the whole point of this thread is to benchmark the deepseek-v3 architecture? :)

Jan 30 '25 18:01 jukofyork

Reporting 8 tok/s on 2x A100 (pcie).

Jan 30 '25 22:01 irdbl

1.58bit R1

Reporting 3 token/s on 1x4090 24GB with 192 CPU core/huge CPU memory (>100GB).
Reporting ~0.5 token/s on 1x4090 24GB with limited CPU memory ~60GB.
#1's config system_info: n_threads = 192 (n_threads_batch = 192) / 384 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Jan 30 '25 23:01 marvin-0042

Here are our results for DeepSeek-R1 IQ1_S 1.58bit:

AMD EPYC 9654 96-Core 768GB RAM, 1 * Nvidia RTX 3090 (24GB VRAM) ./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 8 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<｜User｜>Why the Saturn planet in our system has rings?.<｜Assistant｜>"

Results:

llama_perf_sampler_print:    sampling time =      29.78 ms /   526 runs   (    0.06 ms per token, 17662.27 tokens per second)
llama_perf_context_print:        load time =   21075.00 ms
llama_perf_context_print: prompt eval time =    2659.54 ms /    13 tokens (  204.58 ms per token,     4.89 tokens per second)
llama_perf_context_print:        eval time =  155924.77 ms /   512 runs   (  304.54 ms per token,     3.28 tokens per second)
llama_perf_context_print:       total time =  158686.10 ms /   525 tokens

AMD EPYC 7713 64-Core 952GB RAM, 8 * Nvidia L40 (45GB VRAM, 360GB total VRAM) ./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 62 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<｜User｜>Why the Saturn planet in our system has rings?.<｜Assistant｜>"

Results:

llama_perf_sampler_print:    sampling time =      55.95 ms /  1106 runs   (    0.05 ms per token, 19768.71 tokens per second)
llama_perf_context_print:        load time =   26832.58 ms
llama_perf_context_print: prompt eval time =   47971.09 ms /    13 tokens ( 3690.08 ms per token,     0.27 tokens per second)
llama_perf_context_print:        eval time =   96577.12 ms /  1092 runs   (   88.44 ms per token,    11.31 tokens per second)
llama_perf_context_print:       total time =  144832.33 ms /  1105 tokens

AMD EPYC 7V12 64-Core 1820GB RAM, 8 * A100 SXM4 (80GB VRAM, 640GB total VRAM) ./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 62 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<｜User｜>Why the Saturn planet in our system has rings?.<｜Assistant｜>"

Results:

llama_perf_sampler_print:    sampling time =     126.63 ms /  1593 runs   (    0.08 ms per token, 12579.66 tokens per second)
llama_perf_context_print:        load time =   33417.77 ms
llama_perf_context_print: prompt eval time =    1183.44 ms /    13 tokens (   91.03 ms per token,    10.98 tokens per second)
llama_perf_context_print:        eval time =  209953.23 ms /  1579 runs   (  132.97 ms per token,     7.52 tokens per second)
llama_perf_context_print:       total time =  211573.62 ms /  1592 tokens

Jan 31 '25 09:01 agoralski-qc

6-11 t/s on 8x3090. Bigger context lower side, and visa versa.

Jan 31 '25 22:01 davidsyoung

6-11 t/s on 8x3090. Bigger context lower side, and visa versa.

Speed is plenty good for generation 👍 Can you share the prompt processing speed? preferably on 10k+prompt.

Feb 01 '25 09:02 ryseek

Out of curiosity, I went searching for a fat model with similar total parameters to R1's active parameters. Found an iQ1s of Faclon 40b. (Yes it's basically braindead lol)

I'm running 10x P40's, so both models fit in Vram. Tested with 500 input tokens and 500 output tokens:

Falcon 40b - iQ1s Prompt: 160 T/s Generation: 9.5T/s

DeepSeek R1 - iQ1s Prompt: 53 T/s Generation: 6.2 T/s

Feb 01 '25 17:02 justinjja

6-11 t/s on 8x3090. Bigger context lower side, and visa versa.

Speed is plenty good for generation 👍 Can you share the prompt processing speed? preferably on 10k+prompt.

So I had some issues with getting CUDA out of memory during prompt processing at 10k+ context, even though it would allow me to load the model etc.

I also picked up another 3090 today, so I have 9x3090 now. I loaded the DeepSeek-R1-UD-IQ1_M model instead of the 1.58bit. However, I had to limit the GPU's on power to 280w as I only have 2x1500W PSU.

It's notable to mention that each GPU with MoE architecture pulls 130-150w~ on inference and not higher, but I believe it's more so peak utilisation that spikes too much so I limited to 280w.

I'm still playing around with -ub with 128/256, and matching context size with a nice balance.

Without FA, there's a lot of vram usage for context. Also using -ub at 128 to fix a bigger context. It's also unbalanced with splitting layers, here's how it looks. It's hard to balance right with tensor split.

load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 15740.59 MiB
load_tensors:        CUDA1 model buffer size = 16315.85 MiB
load_tensors:        CUDA2 model buffer size = 19035.16 MiB
load_tensors:        CUDA3 model buffer size = 19035.16 MiB
load_tensors:        CUDA4 model buffer size = 19035.16 MiB
load_tensors:        CUDA5 model buffer size = 16315.85 MiB
load_tensors:        CUDA6 model buffer size = 19035.16 MiB
load_tensors:        CUDA7 model buffer size = 19035.16 MiB
load_tensors:        CUDA8 model buffer size = 17040.83 MiB
load_tensors:   CPU_Mapped model buffer size =   497.11 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 10240
llama_init_from_model: n_ctx_per_seq = 10240
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 128
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (10240) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 10240, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size =  3640.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  2730.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =  2730.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA7 KV buffer size =  3185.00 MiB
llama_kv_cache_init:      CUDA8 KV buffer size =  2730.00 MiB
llama_init_from_model: KV self size  = 27755.00 MiB, K (q4_0): 8235.00 MiB, V (f16): 19520.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model:      CUDA0 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA1 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA2 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA3 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA4 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA5 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA6 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA7 compute buffer size =   712.50 MiB
llama_init_from_model:      CUDA8 compute buffer size =   712.50 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    23.51 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 10
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240

Prompt wise, here you go:

prompt eval time =   97867.98 ms /  7744 tokens (   12.64 ms per token,    79.13 tokens per second)
       eval time =  450625.60 ms /  2000 tokens (  225.31 ms per token,     4.44 tokens per second)
      total time =  548493.58 ms /  9744 tokens
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 192.168.1.64 200
slot launch_slot_: id  0 | task 4105 | processing task
slot update_slots: id  0 | task 4105 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9920
slot update_slots: id  0 | task 4105 | kv cache rm [2, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.206452
slot update_slots: id  0 | task 4105 | kv cache rm [2050, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 4098, n_tokens = 2048, progress = 0.412903
slot update_slots: id  0 | task 4105 | kv cache rm [4098, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 6146, n_tokens = 2048, progress = 0.619355
slot update_slots: id  0 | task 4105 | kv cache rm [6146, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 8194, n_tokens = 2048, progress = 0.825806
slot update_slots: id  0 | task 4105 | kv cache rm [8194, end)
slot update_slots: id  0 | task 4105 | prompt processing progress, n_past = 9920, n_tokens = 1726, progress = 0.999798
slot update_slots: id  0 | task 4105 | prompt done, n_past = 9920, n_tokens = 1726
slot      release: id  0 | task 4105 | stop processing: n_past = 9969, truncated = 0
slot print_timing: id  0 | task 4105 | 
prompt eval time =  129716.52 ms /  9918 tokens (   13.08 ms per token,    76.46 tokens per second)
       eval time =   11962.27 ms /    50 tokens (  239.25 ms per token,     4.18 tokens per second)
      total time =  141678.79 ms /  9968 tokens

Prompt processing is quite slow with -ub 128. Token generation also got quite a bit slower. I would say that's a combination of bigger quant, -ub 128, and GPUs limited to 280w.

GPU utilisation during inference really sits around 10%, so I believe there is huge potential for optimisation here.

Feb 01 '25 23:02 davidsyoung

Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit

Research Stage

Previous existing literature and research

Command

Model

Hardware

Hypothesis

Implementation

Analysis

Llama.cpp Performance Analysis

Raw Benchmarks

Detailed Analysis

1. Token Sampling Performance

2. Model Loading

3. Prompt Evaluation

4. Generation Evaluation

5. Total Processing Time

Key Insights

Relevant log output