Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit
Research Stage
- [ ] Background Research (Let's try to avoid reinventing the wheel)
- [ ] Hypothesis Formed (How do you think this will work and it's effect?)
- [ ] Strategy / Implementation Forming
- [x] Analysis of results
- [ ] Debrief / Documentation (So people in the future can learn from us)
Previous existing literature and research
Command
./llama.cpp/build/bin/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<|User|>What is the capital of Italy?<|Assistant|>"
Model
DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S 1.58Bit, 131GB
Hardware
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:27:00.0 Off | 0 |
| N/A 34C P0 58W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:2A:00.0 Off | 0 |
| N/A 32C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Hypothesis
Reported performances is 140 token/second
Implementation
No response
Analysis
Llama.cpp Performance Analysis
Raw Benchmarks
llama_perf_sampler_print: sampling time = 2.45 ms / 35 runs ( 0.07 ms per token, 14297.39 tokens per second)
llama_perf_context_print: load time = 20988.11 ms
llama_perf_context_print: prompt eval time = 1233.88 ms / 10 tokens ( 123.39 ms per token, 8.10 tokens per second)
llama_perf_context_print: eval time = 2612.63 ms / 24 runs ( 108.86 ms per token, 9.19 tokens per second)
llama_perf_context_print: total time = 3869.00 ms / 34 tokens
Detailed Analysis
1. Token Sampling Performance
- Total Time: 2.45 ms for 35 runs
- Per Token: 0.07 ms
- Speed: 14,297.39 tokens per second
- Description: This represents the speed at which the model can select the next token after processing. This is extremely fast compared to the actual generation speed, as it only involves the final selection process.
2. Model Loading
- Total Time: 20,988.11 ms (≈21 seconds)
- Description: One-time initialization cost to load the model into memory. This happens only at startup and doesn't affect ongoing performance.
3. Prompt Evaluation
- Total Time: 1,233.88 ms for 10 tokens
- Per Token: 123.39 ms
- Speed: 8.10 tokens per second
- Description: Initial processing of the prompt is slightly slower than subsequent token generation, as it needs to establish the full context for the first time.
4. Generation Evaluation
- Total Time: 2,612.63 ms for 24 runs
- Per Token: 108.86 ms
- Speed: 9.19 tokens per second
- Description: This represents the actual speed of generating new tokens, including all neural network computations.
5. Total Processing Time
- Total Time: 3,869.00 ms
- Tokens Processed: 34 tokens
- Average Speed: ≈8.79 tokens per second
Key Insights
-
Performance Bottlenecks:
- The main bottleneck is in the evaluation phase (actual token generation)
- While sampling can handle 14K+ tokens per second, actual generation is limited to about 9 tokens per second
- This difference highlights that the neural network computations, not the token selection process, are the limiting factor
-
Processing Stages:
- Model loading is a significant but one-time cost
- Prompt evaluation is slightly slower than subsequent token generation
- Sampling is extremely fast compared to evaluation
-
Overall Performance:
- The system demonstrates typical performance characteristics for a CPU-based language model
- The total processing rate of ~9 tokens per second is reasonable for local inference on consumer hardware
Relevant log output
Thanks for the replication! Here is a silly nitpick, shouldn't it be 14 tokens per second?
Reported performances is 140 token/second
Original:
The 1.58bit quantization should fit in 160GB of VRAM for fast inference (2x H100 80GB), with it attaining around 14 tokens per second. You don't need VRAM (GPU) to run 1.58bit R1, just 20GB of RAM (CPU) will work however it maybe slow. For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.
Hey! Whoops guys apologies - just found out it should be 10 to 14 tokens / s for generation speed and not 140 (140 tok/s is the prompt eval time) on 2xH100. 😢
Sorry I didn't get any sleep over the past week since I was too excited to pump out the 1.58bit and release it to everyone. 😢
I mentioned most people should expect to get 1 to 3 tokens / s on most local GPUs, so I'm unsure how I missed the 140 tokens / s.
The 140 tokens / s is the prompt eval time - the generation / decode speed is in fact 10 to 14 tokens / s - so I must have reported the wrong line.
Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:
llama_perf_sampler_print: sampling time = 199.35 ms / 2759 runs ( 0.07 ms per token, 13839.98 tokens per second)
llama_perf_context_print: load time = 32281.52 ms
llama_perf_context_print: prompt eval time = 1598.12 ms / 220 tokens ( 7.26 ms per token, 137.66 tokens per second)
llama_perf_context_print: eval time = 237358.50 ms / 2538 runs ( 93.52 ms per token, 10.69 tokens per second)
llama_perf_context_print: total time = 239477.62 ms / 2758 tokens
I've changed the blog post, docs and everywhere to reflect this issue.
I also uploaded a screen recording GIF showing 140tok/s for prompt eval and 10 tok/s for generation for the 1st minute and the last minute to show an example:
So 140 tok / s is the prompt eval time, and I so I reported the wrong line - decoding speed is 10 to 14 tok / s.
On more analysis - I can see via Open Router https://openrouter.ai/deepseek/deepseek-r1 the API tokens / s is around 3 or 4 tokens / s for R1.
Throughput though is a different measure - https://artificialanalysis.ai/models/deepseek-r1/providers reports 60 tok / s for DeepSeek's official API.
Assuming 6 tok / s for DeepSeek per single user, then throughput should be attainable at 10 * single user tokens / s.
Also @loretoparisi extreme appreciate the testing so thanks again!
Again thank you for testing the model out - hope the 1.58bit model functions well!
You can try to offload also the non-repeating tensors by using -ngl 62 instead of -ngl 61. You might have to lower the physical batch size to -ub 128 or -ub 256 to reduce compute buffer sizes and maybe improve the pipeline parallelism with 2 GPUs.
Btw, here is a data point for M2 Studio:
https://github.com/user-attachments/assets/73094fcb-8030-424d-9be1-299877d03035
The prompt processing reported by llama-bench is only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. ~Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.~
@ggerganov Super cool! Glad it worked well on Mac! I'm a Linux user so I had to ask someone else to test it, but good thing you verified it works smoothly :) Good work again on llama.cpp!
@ggerganov why did you strike out the fa? Is it working now?
@ggerganov why did you strike out the fa? Is it working now?
Sorry for the confusion. At first I thought that to enable FA it only requires to support n_embd_head_k != n_embd_head_v which is doable. But then I remembered that DS uses MLA and thought that the FA implementation that we have is not compatible with this attention mechanism, so I striked it out. But now I look at the code and it is actually compatible. So the initial point remains valid and FA can be enabled with some work.
@ggerganov this would enable v quantization, right? And maybe some speed ups?
It will:
- Reduce compute memory usage
- Enable V quantization that reduces the KV cache memory
- Improve performance at longer contexts
The Metal changes for FA should be relatively simple I think if someone wants to take a stab at it.
You can try to offload also the non-repeating tensors by using
-ngl 62instead of-ngl 61. You might have to lower the physical batch size to-ub 128or-ub 256to reduce compute buffer sizes and maybe improve the pipeline parallelism with 2 GPUs.Btw, here is a data point for
M2 Studio:r1.mp4 The prompt processing reported by
llama-benchis only 23t/s which is quite low, but the Metal backend is very poorly optimized for MoE, so maybe it can be improved a bit. ~Also, currently we have to disable FA because of the unusual shapes of the tensors in the attention which can also be improved.~
@ggerganov thanks! I did --threads 12 -no-cnv --n-gpu-layers 62 --prio 2 -ub 256 \:
sampler seed: 3407
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1
with some little improvment
llama_perf_sampler_print: sampling time = 1.98 ms / 35 runs ( 0.06 ms per token, 17667.84 tokens per second)
llama_perf_context_print: load time = 27176.99 ms
llama_perf_context_print: prompt eval time = 916.83 ms / 10 tokens ( 91.68 ms per token, 10.91 tokens per second)
llama_perf_context_print: eval time = 2308.80 ms / 24 runs ( 96.20 ms per token, 10.40 tokens per second)
how to change pipe parallelism?
how to change pipe parallelism?
It's enabled by default. The -ub parameter will affect the prompt processing speed and you can tune the value for optimal performance on your system. Just use a larger prompt, or llama-bench because just 10 tokens for prompt will not give you meaningful results.
how to change pipe parallelism?
It's enabled by default. The
-ubparameter will affect the prompt processing speed and you can tune the value for optimal performance on your system. Just use a larger prompt, orllama-benchbecause just 10 tokens for prompt will not give you meaningful results.
This is the updated results with a larger prompt. It scored 9.41 tokens per second
llama_perf_sampler_print: sampling time = 103.30 ms / 1337 runs ( 0.08 ms per token, 12942.63 tokens per second)
llama_perf_context_print: load time = 23387.53 ms
llama_perf_context_print: prompt eval time = 1102.60 ms / 20 tokens ( 55.13 ms per token, 18.14 tokens per second)
llama_perf_context_print: eval time = 139817.28 ms / 1316 runs ( 106.24 ms per token, 9.41 tokens per second)
llama_perf_context_print: total time = 141240.50 ms / 1336 tokens
@ggerganov these are test results from llama-bench
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
| model | size | params | backend | ngl | threads | type_k | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | CUDA | 62 | 12 | q4_0 | pp512 | 189.38 ± 1.11 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | CUDA | 62 | 12 | q4_0 | tg128 | 10.32 ± 0.01 |
Further tests on 2x H100 / 80GB, matching 12 tokens per second
llama_perf_sampler_print: sampling time = 70.01 ms / 1128 runs ( 0.06 ms per token, 16112.67 tokens per second)
llama_perf_context_print: load time = 28143.05 ms
llama_perf_context_print: prompt eval time = 54405.96 ms / 20 tokens ( 2720.30 ms per token, 0.37 tokens per second)
llama_perf_context_print: eval time = 94147.64 ms / 1107 runs ( 85.05 ms per token, 11.76 tokens per second)
llama_perf_context_print: total time = 148778.61 ms / 1127 tokens
and benchmarks:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model | size | params | backend | ngl | threads | type_k | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | CUDA | 62 | 12 | q4_0 | pp512 | 276.56 ± 1.24 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | CUDA | 62 | 12 | q4_0 | tg128 | 11.89 ± 0.01 |
while 4x H100/80GB @ 214 TFLOPS we have
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model | size | params | backend | ngl | threads | type_k | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | CUDA | 62 | 12 | q4_0 | pp512 | 273.10 ± 1.41 |
| deepseek2 671B IQ1_S - 1.5625 bpw | 130.60 GiB | 671.03 B | CUDA | 62 | 12 | q4_0 | tg128 | 11.84 ± 0.00 |
or
llama_perf_sampler_print: sampling time = 57.98 ms / 1128 runs ( 0.05 ms per token, 19453.98 tokens per second)
llama_perf_context_print: load time = 23185.90 ms
llama_perf_context_print: prompt eval time = 49140.45 ms / 20 tokens ( 2457.02 ms per token, 0.41 tokens per second)
llama_perf_context_print: eval time = 95991.54 ms / 1107 runs ( 86.71 ms per token, 11.53 tokens per second)
llama_perf_context_print: total time = 145329.21 ms / 1127 tokens
So I didnt' see any isignificative improvement in increasing GPUs or increasing threads > 12 right now. Apparently from nvidia-smi all {0,1,2,3} GPUs were in use.
This model is really something. I came up with a fun puzzle:
What could this mean: 'gwkki qieks'?
Solution by DeepSeek-R1 IQ1_S
It does not always get it right, but neither does the API.
Surprised the generation speed (ie: non-prompt processing) is so similar for H100, A100 and M2 Ultra?
Isn't the memory bandwidth approximately: H100 = 2 x A100 = 4 x M2 Ultra?
I'm even more confused now as people seem to be getting ~1.5 tokens per second using SSDs:
https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/13
https://old.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/
At best these have 1/100th the memory bandwidth of even the M2 Ultra?
There's a Reddit thread on this now:
https://www.reddit.com/r/LocalLLaMA/comments/1idivqe/the_mac_m2_ultra_is_faster_than_2xh100s_in/
I have 9x3090 and while running Deepseek 2.5 q4, I got about 25 tok/s
With R1 IQ1_S I get 2.5 tok/s. There is a bottleneck somewhere.
IQ1_S are seemingly not the best quants for CUDA backend. What's with Q2K?
It would be interesting to see the results for K quants (and _0 if anyone can run them).
@ggerganov This is vLLM for comparison started as
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 8192 --enforce-eager
It seems that when there is a GPU KV cache usage>0.0% generation throughput is ~30 tokens/s
...
INFO 01-30 16:15:44 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:49 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 32.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:54 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 31.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 01-30 16:15:59 engine.py:291] Aborted request chatcmpl-04d27b242dbc4bc0b0743235a31d53d7.
INFO 01-30 16:16:09 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 01-30 16:16:19 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
In fact when you have GPU KV cache usage: 0.0% you get
INFO 01-30 16:16:09 metrics.py:453] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Worth to note that PyTorch Eager execution mode was enforced.
@loretoparisi That 10.5 t/s is just averaging between 31T/s and 0T/s after your prompt ended, not actually a meaningful number right?
@loretoparisi DeepSeek-R1-Distill-Qwen-32B isn't deepseek?
@loretoparisi
DeepSeek-R1-Distill-Qwen-32Bisn't deepseek?
It's R1 official Distillation from Qwen2.5-32B
@loretoparisi
DeepSeek-R1-Distill-Qwen-32Bisn't deepseek?It's R1 official Distillation from Qwen2.5-32B
But the whole point of this thread is to benchmark the deepseek-v3 architecture? :)
Reporting 8 tok/s on 2x A100 (pcie).
1.58bit R1
-
Reporting 3 token/s on 1x4090 24GB with 192 CPU core/huge CPU memory (>100GB).
-
Reporting ~0.5 token/s on 1x4090 24GB with limited CPU memory ~60GB.
-
#1's config system_info: n_threads = 192 (n_threads_batch = 192) / 384 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Here are our results for DeepSeek-R1 IQ1_S 1.58bit:
AMD EPYC 9654 96-Core 768GB RAM, 1 * Nvidia RTX 3090 (24GB VRAM)
./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 8 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<|User|>Why the Saturn planet in our system has rings?.<|Assistant|>"
Results:
llama_perf_sampler_print: sampling time = 29.78 ms / 526 runs ( 0.06 ms per token, 17662.27 tokens per second)
llama_perf_context_print: load time = 21075.00 ms
llama_perf_context_print: prompt eval time = 2659.54 ms / 13 tokens ( 204.58 ms per token, 4.89 tokens per second)
llama_perf_context_print: eval time = 155924.77 ms / 512 runs ( 304.54 ms per token, 3.28 tokens per second)
llama_perf_context_print: total time = 158686.10 ms / 525 tokens
AMD EPYC 7713 64-Core 952GB RAM, 8 * Nvidia L40 (45GB VRAM, 360GB total VRAM)
./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 62 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<|User|>Why the Saturn planet in our system has rings?.<|Assistant|>"
Results:
llama_perf_sampler_print: sampling time = 55.95 ms / 1106 runs ( 0.05 ms per token, 19768.71 tokens per second)
llama_perf_context_print: load time = 26832.58 ms
llama_perf_context_print: prompt eval time = 47971.09 ms / 13 tokens ( 3690.08 ms per token, 0.27 tokens per second)
llama_perf_context_print: eval time = 96577.12 ms / 1092 runs ( 88.44 ms per token, 11.31 tokens per second)
llama_perf_context_print: total time = 144832.33 ms / 1105 tokens
AMD EPYC 7V12 64-Core 1820GB RAM, 8 * A100 SXM4 (80GB VRAM, 640GB total VRAM)
./llama.cpp/llama-cli --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 12 -no-cnv --prio 2 --n-gpu-layers 62 --temp 0.6 --ctx-size 8192 --seed 3407 --prompt "<|User|>Why the Saturn planet in our system has rings?.<|Assistant|>"
Results:
llama_perf_sampler_print: sampling time = 126.63 ms / 1593 runs ( 0.08 ms per token, 12579.66 tokens per second)
llama_perf_context_print: load time = 33417.77 ms
llama_perf_context_print: prompt eval time = 1183.44 ms / 13 tokens ( 91.03 ms per token, 10.98 tokens per second)
llama_perf_context_print: eval time = 209953.23 ms / 1579 runs ( 132.97 ms per token, 7.52 tokens per second)
llama_perf_context_print: total time = 211573.62 ms / 1592 tokens
6-11 t/s on 8x3090. Bigger context lower side, and visa versa.
6-11 t/s on 8x3090. Bigger context lower side, and visa versa.
Speed is plenty good for generation 👍 Can you share the prompt processing speed? preferably on 10k+prompt.
Out of curiosity, I went searching for a fat model with similar total parameters to R1's active parameters. Found an iQ1s of Faclon 40b. (Yes it's basically braindead lol)
I'm running 10x P40's, so both models fit in Vram. Tested with 500 input tokens and 500 output tokens:
Falcon 40b - iQ1s Prompt: 160 T/s Generation: 9.5T/s
DeepSeek R1 - iQ1s Prompt: 53 T/s Generation: 6.2 T/s
6-11 t/s on 8x3090. Bigger context lower side, and visa versa.
Speed is plenty good for generation 👍 Can you share the prompt processing speed? preferably on 10k+prompt.
So I had some issues with getting CUDA out of memory during prompt processing at 10k+ context, even though it would allow me to load the model etc.
I also picked up another 3090 today, so I have 9x3090 now. I loaded the DeepSeek-R1-UD-IQ1_M model instead of the 1.58bit. However, I had to limit the GPU's on power to 280w as I only have 2x1500W PSU.
It's notable to mention that each GPU with MoE architecture pulls 130-150w~ on inference and not higher, but I believe it's more so peak utilisation that spikes too much so I limited to 280w.
I'm still playing around with -ub with 128/256, and matching context size with a nice balance.
Without FA, there's a lot of vram usage for context. Also using -ub at 128 to fix a bigger context. It's also unbalanced with splitting layers, here's how it looks. It's hard to balance right with tensor split.
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: CUDA0 model buffer size = 15740.59 MiB
load_tensors: CUDA1 model buffer size = 16315.85 MiB
load_tensors: CUDA2 model buffer size = 19035.16 MiB
load_tensors: CUDA3 model buffer size = 19035.16 MiB
load_tensors: CUDA4 model buffer size = 19035.16 MiB
load_tensors: CUDA5 model buffer size = 16315.85 MiB
load_tensors: CUDA6 model buffer size = 19035.16 MiB
load_tensors: CUDA7 model buffer size = 19035.16 MiB
load_tensors: CUDA8 model buffer size = 17040.83 MiB
load_tensors: CPU_Mapped model buffer size = 497.11 MiB
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 10240
llama_init_from_model: n_ctx_per_seq = 10240
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 128
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 0.025
llama_init_from_model: n_ctx_per_seq (10240) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 10240, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: CUDA0 KV buffer size = 3640.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 2730.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 3185.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 3185.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 3185.00 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 2730.00 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 3185.00 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 3185.00 MiB
llama_kv_cache_init: CUDA8 KV buffer size = 2730.00 MiB
llama_init_from_model: KV self size = 27755.00 MiB, K (q4_0): 8235.00 MiB, V (f16): 19520.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.49 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model: CUDA0 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA1 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA2 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA3 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA4 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA5 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA6 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA7 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA8 compute buffer size = 712.50 MiB
llama_init_from_model: CUDA_Host compute buffer size = 23.51 MiB
llama_init_from_model: graph nodes = 5025
llama_init_from_model: graph splits = 10
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 10240
Prompt wise, here you go:
prompt eval time = 97867.98 ms / 7744 tokens ( 12.64 ms per token, 79.13 tokens per second)
eval time = 450625.60 ms / 2000 tokens ( 225.31 ms per token, 4.44 tokens per second)
total time = 548493.58 ms / 9744 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 192.168.1.64 200
slot launch_slot_: id 0 | task 4105 | processing task
slot update_slots: id 0 | task 4105 | new prompt, n_ctx_slot = 10240, n_keep = 0, n_prompt_tokens = 9920
slot update_slots: id 0 | task 4105 | kv cache rm [2, end)
slot update_slots: id 0 | task 4105 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.206452
slot update_slots: id 0 | task 4105 | kv cache rm [2050, end)
slot update_slots: id 0 | task 4105 | prompt processing progress, n_past = 4098, n_tokens = 2048, progress = 0.412903
slot update_slots: id 0 | task 4105 | kv cache rm [4098, end)
slot update_slots: id 0 | task 4105 | prompt processing progress, n_past = 6146, n_tokens = 2048, progress = 0.619355
slot update_slots: id 0 | task 4105 | kv cache rm [6146, end)
slot update_slots: id 0 | task 4105 | prompt processing progress, n_past = 8194, n_tokens = 2048, progress = 0.825806
slot update_slots: id 0 | task 4105 | kv cache rm [8194, end)
slot update_slots: id 0 | task 4105 | prompt processing progress, n_past = 9920, n_tokens = 1726, progress = 0.999798
slot update_slots: id 0 | task 4105 | prompt done, n_past = 9920, n_tokens = 1726
slot release: id 0 | task 4105 | stop processing: n_past = 9969, truncated = 0
slot print_timing: id 0 | task 4105 |
prompt eval time = 129716.52 ms / 9918 tokens ( 13.08 ms per token, 76.46 tokens per second)
eval time = 11962.27 ms / 50 tokens ( 239.25 ms per token, 4.18 tokens per second)
total time = 141678.79 ms / 9968 tokens
Prompt processing is quite slow with -ub 128. Token generation also got quite a bit slower. I would say that's a combination of bigger quant, -ub 128, and GPUs limited to 280w.
GPU utilisation during inference really sits around 10%, so I believe there is huge potential for optimisation here.