Bug: llamacpp for CPU/GPU (avx avx2) quants IQ1xx, IQ2xx, IQ3xx are overheating (CPU 90C) CPU ryzen 9 7950x3d but IQ4xx and other quants not (CPU 65C)
What happened?
CPU Ryzen 7950x3D win 11
Mistral-Large-Instruct-2407.IQ3_XS.gguf ( CPU 90 C )
Meta-Llama-3-70B-Instruct.Q4_K_M.gguf (CPU 66 C )
Temperature is higher than the CPU torture tests made by CPUZ then max I have is 83 C. That happens ONLY with Mistral-Large-Instruct-2407.IQ3_XS.gguf for me even I set --threads 1 my CPU is heating up like crazy to 90 C but manager showing only 1 thread used for llamacpp....
Mistral-Large-Instruct-2407.IQ3_XS.gguf
llama-cli.exe --model models/new3/Mistral-Large-Instruct-2407.IQ3_XS.gguf --color --threads 1 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 39 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template chatml
llama-cli.exe --model models/new3/Mistral-Large-Instruct-2407.IQ3_XS.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 39 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template chatml
Log start
main: build = 3488 (75af08c4)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1722289609
llama_model_loader: loaded meta data with 41 key-value pairs and 795 tensors from models/new3/Mistral-Large-Instruct-2407.IQ3_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Mistral Large Instruct 2407
llama_model_loader: - kv 3: general.version str = 2407
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Mistral
llama_model_loader: - kv 6: general.size_label str = Large
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = mrl
llama_model_loader: - kv 9: general.license.link str = https://mistral.ai/licenses/MRL-0.1.md
llama_model_loader: - kv 10: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv 11: llama.block_count u32 = 88
llama_model_loader: - kv 12: llama.context_length u32 = 32768
llama_model_loader: - kv 13: llama.embedding_length u32 = 12288
llama_model_loader: - kv 14: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 15: llama.attention.head_count u32 = 96
llama_model_loader: - kv 16: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: general.file_type u32 = 22
llama_model_loader: - kv 20: llama.vocab_size u32 = 32768
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,32768] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,32768] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: quantize.imatrix.file str = Mistral-Large-Instruct-2407-IMat-GGUF...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = Mistral-Large-Instruct-2407-IMat-GGUF...
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 616
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 148
llama_model_loader: - kv 38: split.no u16 = 0
llama_model_loader: - kv 39: split.count u16 = 0
llama_model_loader: - kv 40: split.tensors.count i32 = 795
llama_model_loader: - type f32: 177 tensors
llama_model_loader: - type q4_K: 88 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq3_xxs: 308 tensors
llama_model_loader: - type iq3_s: 221 tensors
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32768
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 12288
llm_load_print_meta: n_layer = 88
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = IQ3_XS - 3.3 bpw
llm_load_print_meta: model params = 122.61 B
llm_load_print_meta: model size = 46.70 GiB (3.27 BPW)
llm_load_print_meta: general.name = Mistral Large Instruct 2407
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
llm_load_tensors: offloading 39 repeating layers to GPU
llm_load_tensors: offloaded 39/89 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 26799.61 MiB
llm_load_tensors: CUDA0 buffer size = 21018.94 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 8224
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 1574.12 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1252.88 MiB
llama_new_context_with_model: KV self size = 2827.00 MiB, K (f16): 1413.50 MiB, V (f16): 1413.50 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1669.13 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 40.07 MiB
llama_new_context_with_model: graph nodes = 2822
llama_new_context_with_model: graph splits = 543
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to the AI, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
> hello /
Hello! How can I assist you today? Let's have a friendly and respectful conversation. 😊
> tell me a stry /
I'd be happy to share a short story with you! Here we go:
Once upon a time in a small town nestled between rolling hills and a sparkling river, there lived a little girl named Lily. Lily was known for her vibrant imagination and her love for drawing. She could spend hours by the river, sketching the ducks, the flowers, and the clouds above.
> llama_print_timings: load time = 18929.92 ms
llama_print_timings: sample time = 3.32 ms / 107 runs ( 0.03 ms per token, 32248.34 tokens per second)
llama_print_timings: prompt eval time = 18667.41 ms / 64 tokens ( 291.68 ms per token, 3.43 tokens per second)
llama_print_timings: eval time = 56435.98 ms / 105 runs ( 537.49 ms per token, 1.86 tokens per second)
llama_print_timings: total time = 123532.93 ms / 169 tokens
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf llama-cli.exe --model models/new3/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 42 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3
llama-cli.exe --model models/new3/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 42 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3
Log start
main: build = 3488 (75af08c4)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1722289776
llama_model_loader: loaded meta data with 33 key-value pairs and 724 tensors from models/new3/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 70B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 80
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 8192
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 13: llama.attention.head_count u32 = 64
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/Meta-Llama-3.1-70B-Instru...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 125
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.68 MiB
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloaded 42/81 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 19985.43 MiB
llm_load_tensors: CUDA0 buffer size = 20557.70 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 8224
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 1220.75 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1349.25 MiB
llama_new_context_with_model: KV self size = 2570.00 MiB, K (f16): 1285.00 MiB, V (f16): 1285.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1140.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 32.07 MiB
llama_new_context_with_model: graph nodes = 2566
llama_new_context_with_model: graph splits = 498
main: chat template example: <|start_header_id|>system<|end_header_id|>
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to the AI, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
> tell me a story /
Once upon a time, in a small village nestled in the rolling hills of Tuscany, there was a tiny shop called "Mirabel's Marvels." The shop was run by a kind and gentle woman named Mirabel, who was known throughout the village for her extraordinary talent: she could hear the whispers of inanimate objects.
llama_print_timings: load time = 15492.02 ms
llama_print_timings: sample time = 13.44 ms / 72 runs ( 0.19 ms per token, 5358.34 tokens per second)
llama_print_timings: prompt eval time = 9574.75 ms / 14 tokens ( 683.91 ms per token, 1.46 tokens per second)
allama_print_timings: eval time = 30604.81 ms / 72 runs ( 425.07 ms per token, 2.35 tokens per second)
Any Idea what is happening?
Name and Version
llama-cli --version version: 3488 (75af08c4) built with MSVC 19.29.30154.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
No response
I don't know what is the "issues" ... but 1 or 2 elements:
- for Mistral use "--chat-template llama2" ... not chatml (https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template#supported-templates) if you want "good" result.
- --threads 30 : you have a 16Core/32Thread on CPU with llamafile activated do not use more than 16 thread
Next ... (not 100% sur):
- -ngl 39 :
llm_load_tensors: offloaded 39/89 layers to GPU llm_load_tensors: CUDA_Host buffer size = 26799.61 MiB llm_load_tensors: CUDA0 buffer size = 21018.94 MiB
It may not do what you think. when partial offloaded, it do not say part of layer is execute on GPU other on CPU, it say all possible layer is execute on GPU but only the weight are partialy offloaded, the other a copied from RAM when needed.
that say a possibility (less that 10% confidance ;) )
- Meta-Llama-3.1-70B-Instruct-Q4_K_M => can and is 100% compute on GPU (Q quant...) => not realy CPU power needed
- Mistral-Large-Instruct-2407.IQ3_XS.gguf => not all Op can be compute on GPU (IQ?) so some are compute on CPU
It's not killing your CPU, Ryzens deigned to run hot, basically they boost until they until they hit one of the limits, like power or temperature. 90C is the throttling point for the 3D cache variants, if you don't have top cooling you will reach it. My 5800X3d almost hits 80C, and it is in a custom watercooling loop with 2X 360 radiators.
If you use --threads 1 it will only use a single thread, that is the point of the parameter, and single thread loads are hotter on these CPUs, use something like --threads 16 to spread the load to 16 threads.
Too few or too many threads are both slower, you have to play around.
As you can read from my 1st post I tested with 1 thread and still getting close 90C .. almost then is like 88 C on 1 thread ... How is possible ONE thread out of 32 could heat up CPU to 90 C?
CPU is hottest during a text generation.
Is insane any application for torturing CPU test is not heating up to 90C my CPU...
OK
I have a very good CPU cooling system. Made more tests with -ngl 0 (no gpu)
Literally after 10 seconds of generating output:
-ngl 0 -threads 30 Mistral-Large-Instruct-2407.IQ3_XS.gguf - CPU 90 C
llama_print_timings: sample time = 2.93 ms / 95 runs ( 0.03 ms per token, 32456.44 tokens per second)
llama_print_timings: prompt eval time = 5388.26 ms / 41 tokens ( 131.42 ms per token, 7.61 tokens per second)
llama_print_timings: eval time = 76124.89 ms / 94 runs ( 809.84 ms per token, 1.23 tokens per second)
llama_print_timings: total time = 81538.07 ms / 135 tokens
-ngl 0 -threads 1 Mistral-Large-Instruct-2407.IQ3_XS.gguf - CPU 83 C
llama_print_timings: load time = 45392.05 ms
llama_print_timings: sample time = 0.17 ms / 6 runs ( 0.03 ms per token, 34682.08 tokens per second)
llama_print_timings: prompt eval time = 18023.36 ms / 41 tokens ( 439.59 ms per token, 2.27 tokens per second)
llama_print_timings: eval time = 52078.78 ms / 5 runs (10415.76 ms per token, 0.10 tokens per second)
llama_print_timings: total time = 75544.36 ms / 46 tokens
-ngl 0 -threads 30 Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf - CPU 63 C
llama_print_timings: load time = 17336.42 ms
llama_print_timings: sample time = 6.83 ms / 83 runs ( 0.08 ms per token, 12146.93 tokens per second)
llama_print_timings: prompt eval time = 6319.47 ms / 22 tokens ( 287.25 ms per token, 3.48 tokens per second)
llama_print_timings: eval time = 70359.14 ms / 83 runs ( 847.70 ms per token, 1.18 tokens per second)
BRllama_print_timings: total time = 76719.51 ms / 105 tokens
-ngl 0 -threads 1 Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf - CPU 61 C
llama_print_timings: load time = 22971.33 ms
llama_print_timings: sample time = 1.02 ms / 14 runs ( 0.07 ms per token, 13752.46 tokens per second)
llama_print_timings: prompt eval time = 37650.82 ms / 22 tokens ( 1711.40 ms per token, 0.58 tokens per second)
llama_print_timings: eval time = 28594.53 ms / 13 runs ( 2199.58 ms per token, 0.45 tokens per second)
llama_print_timings: total time = 67562.07 ms / 35 tokens
With only cpu interface still heating up like a crazy with Mistral-Large-Instruct-2407.IQ3_XS.gguf but not with Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
For comparison CPU stress test - all cores
AVX - 79 C
AVX2 - 79 C
AVX512 75 C
More tests: CPU interface OLNY.
30 threads Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf noavx - 82 C - 0.35 t/s avx - 63 C - 1.50 t/s avx2 - 58 C - 1.55 t/s
30 threads Mistral-Large-Instruct-2407.IQ3_XS.gguf noavx - 82 C - 0.23 t/s avx - 90 C - 1.13 t/s avx2 - 90 C - 1.22 t/s
Is possible that llamacpp is using wrongly extensions avx and avx2 with Mistral-Large-Instruct-2407.IQ3_XS.gguf ? Probably with IQxx ... need make more tests ....
As noavx seems have the same CPU temp.
More tests v2:
I made tests with CPU interface for llama 3.1 8b ( save data usage ;) )
Seems affected with insane CPU overheating are IQ1xx - 90c IQ2xx - 90c IQ3xx - 90c IQ4xx - 59c - seems working fine like normal q4k_m - no overheat CPU ... is like 60w vs 170w used.
Final test v3
30 threads, cuda 12, Mistral-Large-Instruct-2407.IQ4_XS.gguf
llama-cli.exe --model models/new4/Mistral-Large-Instruct-2407.IQ4_XS.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 30 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template chatml
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.74 MiB ggml_cuda_host_malloc: failed to allocate 41305.73 MiB of pinned memory: out of memory llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloaded 30/89 layers to GPU llm_load_tensors: CPU buffer size = 41305.73 MiB llm_load_tensors: CUDA0 buffer size = 21096.56 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 8224 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 1863.25 MiB llama_kv_cache_init: CUDA0 KV buffer size = 963.75 MiB llama_new_context_with_model: KV self size = 2827.00 MiB, K (f16): 1413.50 MiB, V (f16): 1413.50 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1666.50 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1747453952 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model 'models/new4/Mistral-Large-Instruct-2407.IQ4_XS.gguf' main: error: unable to load model
Can not load in spite of 64GB RAM and 24 GB VRAM even with parameter --no-mmap
;-(
FINAL
So is something wrong how llamacpp (GPU and CPU interface with avx , avx2 ) is handling IQ1xx, IQ2 xx, IQ3xx not affected IQ4xx
IQ1xx - 90c
IQ2xx - 90c
IQ3xx - 90c
IQ4xx - 59c
EDIT
Without the --no-mmap I was load model successful in spite used 60 GB of the RAM not 40 GB like should with --no-mmap ... Why?
llama-cli.exe --model models/new4/Mistral-Large-Instruct-2407.IQ4_XS.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 8196 --interactive -ngl 34 --simple-io -e --multiline-input --no-display-prompt --conversation --temp 0.6 --chat-template chatml
Log start
main: build = 3490 (6e2b6000)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1722357037
llama_model_loader: loaded meta data with 41 key-value pairs and 795 tensors from models/new4/Mistral-Large-Instruct-2407.IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Mistral Large Instruct 2407
llama_model_loader: - kv 3: general.version str = 2407
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Mistral
llama_model_loader: - kv 6: general.size_label str = Large
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = mrl
llama_model_loader: - kv 9: general.license.link str = https://mistral.ai/licenses/MRL-0.1.md
llama_model_loader: - kv 10: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv 11: llama.block_count u32 = 88
llama_model_loader: - kv 12: llama.context_length u32 = 131072
llama_model_loader: - kv 13: llama.embedding_length u32 = 12288
llama_model_loader: - kv 14: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 15: llama.attention.head_count u32 = 96
llama_model_loader: - kv 16: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: general.file_type u32 = 30
llama_model_loader: - kv 20: llama.vocab_size u32 = 32768
llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 23: tokenizer.ggml.model str = llama
llama_model_loader: - kv 24: tokenizer.ggml.pre str = default
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv 26: tokenizer.ggml.scores arr[f32,32768] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,32768] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Mistral-Large-Instruct-24...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 616
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 148
llama_model_loader: - kv 38: split.no u16 = 0
llama_model_loader: - kv 39: split.count u16 = 0
llama_model_loader: - kv 40: split.tensors.count i32 = 795
llama_model_loader: - type f32: 177 tensors
llama_model_loader: - type q5_K: 88 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_xs: 529 tensors
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1732 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32768
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 12288
llm_load_print_meta: n_layer = 88
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params = 122.61 B
llm_load_print_meta: model size = 60.94 GiB (4.27 BPW)
llm_load_print_meta: general.name = Mistral Large Instruct 2407
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
llm_load_tensors: offloading 34 repeating layers to GPU
llm_load_tensors: offloaded 34/89 layers to GPU
llm_load_tensors: CPU buffer size = 62402.30 MiB
llm_load_tensors: CUDA0 buffer size = 23909.44 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 8224
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 1734.75 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1092.25 MiB
llama_new_context_with_model: KV self size = 2827.00 MiB, K (f16): 1413.50 MiB, V (f16): 1413.50 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1666.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 40.07 MiB
llama_new_context_with_model: graph nodes = 2822
llama_new_context_with_model: graph splits = 598
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 30 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8224, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to the AI, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
> tell me a story /
Once upon a time, in a small town nestled between the rolling hills and the sparkling sea, there lived a young girl named Lily
>
llama_print_timings: load time = 27200.28 ms
llama_print_timings: sample time = 8.22 ms / 28 runs ( 0.29 ms per token, 3405.50 tokens per second)
llama_print_timings: prompt eval time = 6747.93 ms / 33 tokens ( 204.48 ms per token, 4.89 tokens per second)
llama_print_timings: eval time = 19665.08 ms / 28 runs ( 702.32 ms per token, 1.42 tokens per second)
and as I suspected IQ4xx is not heating my CPU - 66C not 90C like IQ1xx, IQ2xx ,IQ3xx .
Definitely is something wrong with implementation under llamacpp and IQ1xx,IQ2xx,IQ3xx ( avx and avx2 ?)
My guess is that inference with more bits per parameter bottlenecks on RAM bandwidth, allowing the CPU to rest a little between computations, while with less bits per parameter, memory may be fast enough to load the parameters in time to keep the CPU busy all the time.
As for it being harsher than CPU stress test, if we get to the bottom of it, we may find an interesting idea for a better stress test.
I'm removing issue labels because it is not clear that anything is wrong here.
I will test old q2 and Q3 to be sure .
Your idea seems strange ...why even with 1 thread ( I have 32 ) IQ3, IQ2 or IQ1 is heating up my CPU to 90C? That is not possible to utilize 1 thread and getting power usage about 170W where with 1 thread and IQ4 CPU has 48C ( 4C more than idle )
The power consumption is high, while all but one core are mostly idle? If so, I would think that there is something wrong with the CPU.
As for temperature, if compute thread runs on one particular core, that core will heat up.
The power consumption is high, while all but one core are mostly idle? If so, I would think that there is something wrong with the CPU.
As for temperature, if compute thread runs on one particular core, that core will heat up.
Yes llamacpp is heating up my CPU even with one thread ( not core ) to 90C using IQ1, IQ2, IQ3 models.. crazy ....
CPU was tested for stable work many times , no errors no overheating, totally normal work. ..only llamacpp is doing it if using avx or avx2 instructions and IQ1, IQ2, IQ3 models ( IQ4 , q4, q6, q8 , q4k_m , q5, q5k_m behaves OK no overheating even with 30 threads around 60C not 90C!) ...still have to test old q2 and Q3.
Any other program testing the stress CPU for 1 core or all cores is not behaving like that. 1 core is almost meanless for power consumption.
Llamacpp set to using 1 core and my Ryzen 9 7950x3d is switching core every few seconds on ctx0 cluster ( setup in my bios for using as pronary cluster with extra 128MB 3D memory ). That's normal behaviour for that CPU
Practically, what would be the solution here? Should we add a delay when using an AMD processor to avoid overheating it?
I'm trying to understand why is happening this.
Any application is not able to do that. Application testing CPU stress ( even for avx , avx2 avx512 hardly reaching 80C.
Is possible something is messed up how avx instructions are handled by llamacpp with those IQ1, IQ2, IQ3 models ? For CPU interface llamacpp CPU is a bit cooler with avx512 ..2-3 C less ( 87-88C) comparing to axv and axv2.
CPU is going into throttling...I have a big water cooling and is not able to cool it.
I wonder if someone could test power consumption on intel CPU 11+ generation IQ4 Vs q4_km for CPU / GPU avx/avx2 interface. I'm curious if only Ryzen CPU is experience it.
I cannot imagine what could possibly be done wrong by llama.cpp, it is just using AVX intrinsics the way they are supposed to be used. For instance, this is the iq3_xxs AVX2 dot product implementation:
https://github.com/ggerganov/llama.cpp/blob/44d28ddd5caaa5e9de573bdaaa5b5b2448a29ace/ggml/src/ggml-quants.c#L10152-L10196
Later I will test old q2 and Q3 ... And let you know ....
Thanks for looking into it .
To exclude the possibility that the CPU is problematic, the issue can be reproduced on other CPUs. For that, SHA256 of the file that you see the problem with would be useful, and a link to download it.
Is possible something is messed up how avx instructions are handled by llamacpp with those IQ1, IQ2, IQ3 models ?
Instructions are executed/handled by CPU, so when there is a problem with some instructions, it usually lies within the hardware. I think, you meant to say that instructions may be used by llama.cpp in an incorrect way. It is possible, but even then, the abnormal power consumption pattern would still indicate that there is a problem with hardware and/or OS as well.
Models taken from https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/tree/main
-ngl 0 <-- for mostly CPU test command:
llama-cli.exe --model Meta-Llama-3.1-8B-Instruct-Q3_K_M.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 100000 --interactive -ngl 0 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3
OK ... made tests with ol Q2, Q3K_M, Q3K_L, Q4K_M, Q8, , IQ4_XS
Q2 - OK -57C Q3K_M - OK - 57C Q3K_L, - OK - 57C Q4K_M - OK - 57C Q8 - OK - 57C IQ2_M - HOT! - 88C IQ3_XS - HOT! - 88C IQ3_M - HOT! - 88C IQ4_XS - OK - 57C
Clearly only affected models for abnormal crazy CPU utilization are IQ2_M, IQ3_XS and IQ3_M.
All models except - IQ2_M, IQ3_XS, IQ3_M - 57C - around 90W
Literally afar few seconds 90C - ONLY IQ2_M, IQ3_XS, IQ3_M - 90C - around 150W
I do not know what kind of more proofs I can provide yet. All models, big quants and small quants working as expected BUT those 3 - IQ2_M, IQ3_XS, IQ3_M llamacpp handling differently and trying to kill my CPU ;)
Any idea it is fixable?
I tested and observed that CPU does get hot with these models:
849856e1e7eff8ea7425a2a4cee50f3d547165194f50ea3112c9fc07cb08daad Meta-Llama-3.1-8B-Instruct-IQ2_M.gguf
e15a3e54436cb7de4b201c084ae59755627908341faa15bfc1f2259b8ae63e96 Meta-Llama-3.1-8B-Instruct-IQ3_M.gguf
c57b55244c1bb8e754c7d2d07bfb37cf3c65838d56d602269c33096adb5707b1 Meta-Llama-3.1-8B-Instruct-IQ3_XS.gguf
Generally, the smaller is the model, the hotter the CPU gets, but some differences are within margin of error. This seems to be consistent with my guess about CPU saturation.
The highest temperature was achieved during prompt processing with IQ2_M model (I used CPU build for that), and was approximately the same as under CPU stress test. CPU architecture is znver3; other information:
build=3504 commit="e09a800f" n_threads=8 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
With 1 compute thread (-t 1 -tb 1), CPU was significantly cooler.
Overall, I haven't observed any anomalies.
Yes ... no more anomalies except IQ1 , IQ2, IQ3 are taking a lot of more energy without any reason ... why Q2, Q3 , Q4 or IQ4 are free from it....
Just guessing, but the i-quants use lookup tables that might be stressing the cache more than the other quants.
Just guessing, but the i-quants use lookup tables that might be stressing the cache more than the other quants.
IQ4 is not doing it ...
A possibility (pure speculation... ;) ) the IQ2/3 use lookup table so L3 cache is high stressed. the IQ4 don't use lookup table not so high L3 cache stress
Is it possible that high IQ2/3 temp is not because of CPU but cache access.
In my case I can't see so much temp difference / power diff with my CPU:
- a ryzen 7940HS (zen4) => all ~100°C and power ~60W
- a Ryzen 9 5950X 16-Core Processor (znver3) => ~56°C (dirrect water-cooled...)
this bench was done with llamafile-bench... (can find my test with llama-bench... need more time re-run)
| cpu_info | model_filename | size | test | t/s |
|---|---|---|---|---|
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-IQ3_XXS | 43.78 GiB | pp32 | 0.79 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-IQ3_XXS | 43.78 GiB | pp64 | 0.83 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-IQ3_XXS | 43.78 GiB | pp128 | 0.83 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-IQ3_XXS | 43.78 GiB | tg16 | 0.77 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-Q2_K | 42.09 GiB | pp32 | 5.12 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-Q2_K | 42.09 GiB | pp64 | 5.39 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-Q2_K | 42.09 GiB | pp128 | 5.38 |
| AMD Ryzen 9 7940HS (znver4) | Mistral-Large-Instruct-2407-Q2_K | 42.09 GiB | tg16 | 1.25 |
| AMD Ryzen 9 7940HS (znver4) | Meta-Llama-3-70B-Instruct.Q4_K_M | 39.59 GiB | pp32 | 8.95 |
| AMD Ryzen 9 7940HS (znver4) | Meta-Llama-3-70B-Instruct.Q4_K_M | 39.59 GiB | pp64 | 9.16 |
| AMD Ryzen 9 7940HS (znver4) | Meta-Llama-3-70B-Instruct.Q4_K_M | 39.59 GiB | pp128 | 9.25 |
| AMD Ryzen 9 7940HS (znver4) | Meta-Llama-3-70B-Instruct.Q4_K_M | 39.59 GiB | tg16 | 1.37 |
Yes as you see in this cas Q2 is *6 more speed as IQ3 with llamafile... for me using lookup table can only be quick on FPGA... not on CPU/GPU
Note: it is not the first case I see that RAM access need significant POWER. May be good to see what happen this wonderfull 3D cache CPU.
may you test what happen with llama-bench:
llama-bench -ngl "0,8,16" -p "32,64,128,256,512" -n "16" -m "<model1>,<model2>,..."
(Note: llamafile as some optimisation of using AVX512/BF16 dot that llama.cpp don't so we get x2 on zen4 CPU ... so don't compare your result with this one")
I think, with p ≥ 32, it can still be processed on GPU, even with ngl = 0. I used CPU build for testing pp.
A possibility (pure speculation... ;) ) the IQ2/3 use lookup table so L3 cache is high stressed. the IQ4 don't use lookup table not so high L3 cache stress
Is it possible that high IQ2/3 temp is not because of CPU but cache access.
In my case I can't see so much temp difference / power diff with my CPU:
- a ryzen 7940HS (zen4) => all ~100°C and power ~60W
- a Ryzen 9 5950X 16-Core Processor (znver3) => ~56°C (dirrect water-cooled...)
this bench was done with llamafile-bench... (can find my test with llama-bench... need more time re-run)
cpu_info model_filename size test t/s AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-IQ3_XXS 43.78 GiB pp32 0.79 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-IQ3_XXS 43.78 GiB pp64 0.83 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-IQ3_XXS 43.78 GiB pp128 0.83 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-IQ3_XXS 43.78 GiB tg16 0.77 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-Q2_K 42.09 GiB pp32 5.12 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-Q2_K 42.09 GiB pp64 5.39 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-Q2_K 42.09 GiB pp128 5.38 AMD Ryzen 9 7940HS (znver4) Mistral-Large-Instruct-2407-Q2_K 42.09 GiB tg16 1.25 AMD Ryzen 9 7940HS (znver4) Meta-Llama-3-70B-Instruct.Q4_K_M 39.59 GiB pp32 8.95 AMD Ryzen 9 7940HS (znver4) Meta-Llama-3-70B-Instruct.Q4_K_M 39.59 GiB pp64 9.16 AMD Ryzen 9 7940HS (znver4) Meta-Llama-3-70B-Instruct.Q4_K_M 39.59 GiB pp128 9.25 AMD Ryzen 9 7940HS (znver4) Meta-Llama-3-70B-Instruct.Q4_K_M 39.59 GiB tg16 1.37 Yes as you see in this cas Q2 is *6 more speed as IQ3 with llamafile... for me using lookup table can only be quick on FPGA... not on CPU/GPU
Note: it is not the first case I see that RAM access need significant POWER. May be good to see what happen this wonderfull 3D cache CPU.
may you test what happen with llama-bench:
llama-bench -ngl "0,8,16" -p "32,64,128,256,512" -n "16" -m "<model1>,<model2>,..."(Note: llamafile as some optimisation of using AVX512/BF16 dot that llama.cpp don't so we get x2 on zen4 CPU ... so don't compare your result with this one")
I check that ..thanks Also 7950x3d has 2 ctx modules ctx0 and ctx1. Only ctx0 module has acces to 3d catche 128 MB. I will run llamacpp on the module ctx1 only without 3d cache and test IQ2 , IQ3 is not heat up the CPU like crazy ...
Also for me IQ3_XS is bit faster than Q3_K_S. Only slightly.
bench with things like that (mmap don't change perfo not reported here
llama-bench --mmap "0,1" -r 3 -ngl 0 -p "8,16,32,64,128,256,512,1024" -n "16" -m "Mistral-7B-Instruct-v0.3-f32.gguf,Mistral-7B-Instruct-v0.3.BF16.gguf,Mistral-7B-Instruct-v0.3.F16.gguf,Mistral-7B-Instruct-v0.3-IQ4_NL.gguf,Mistral-7B-Instruct-v0.3-IQ3_M.gguf,Mistral-7B-Instruct-v0.3-Q8_0.gguf,Mistral-7B-Instruct-v0.3-Q6_K.gguf,Mistral-7B-Instruct-v0.3-Q5_K_M.gguf,Mistral-7B-Instruct-v0.3-Q4_K_M.gguf,Mistral-7B-Instruct-v0.3-Q3_K_L.gguf"
AMD Ryzen 9 5950X 16-Core Processor (znver3)
=> so a zen3
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp8 | 7.55 ± 0.00 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp16 | 14.88 ± 0.03 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp32 | 29.39 ± 0.02 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp64 | 51.21 ± 0.10 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp128 | 61.25 ± 0.05 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp256 | 62.79 ± 0.03 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp512 | 58.63 ± 0.12 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | pp1024 | 57.95 ± 0.33 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 16 | tg16 | 1.74 ± 0.00 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp8 | 22.66 ± 0.01 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp16 | 26.76 ± 0.05 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp32 | 28.03 ± 0.03 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp64 | 28.73 ± 0.01 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp128 | 28.94 ± 0.05 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp256 | 29.05 ± 0.01 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp512 | 28.95 ± 0.04 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | pp1024 | 28.61 ± 0.02 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 16 | tg16 | 3.58 ± 0.00 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp8 | 15.87 ± 0.04 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp16 | 31.78 ± 0.02 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp32 | 45.06 ± 0.08 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp64 | 49.29 ± 0.07 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp128 | 52.56 ± 0.10 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp256 | 53.58 ± 0.01 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp512 | 51.89 ± 0.02 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | pp1024 | 51.20 ± 0.09 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 16 | tg16 | 3.49 ± 0.00 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp8 | 45.33 ± 0.09 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp16 | 47.46 ± 0.05 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp32 | 48.18 ± 0.04 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp64 | 48.69 ± 0.02 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp128 | 48.68 ± 0.01 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp256 | 48.36 ± 0.01 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp512 | 47.84 ± 0.16 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | pp1024 | 47.01 ± 0.01 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 16 | tg16 | 12.17 ± 0.00 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp8 | 23.06 ± 0.05 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp16 | 23.36 ± 0.03 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp32 | 23.56 ± 0.02 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp64 | 23.70 ± 0.00 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp128 | 23.70 ± 0.02 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp256 | 23.60 ± 0.00 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp512 | 23.32 ± 0.12 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | pp1024 | 23.22 ± 0.01 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 16 | tg16 | 14.92 ± 0.01 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp8 | 51.28 ± 0.08 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp16 | 71.16 ± 0.01 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp32 | 72.64 ± 0.25 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp64 | 74.48 ± 0.25 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp128 | 71.51 ± 0.09 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp256 | 69.37 ± 0.02 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp512 | 67.68 ± 0.02 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | pp1024 | 66.57 ± 0.24 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 16 | tg16 | 6.59 ± 0.00 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp8 | 46.39 ± 0.01 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp16 | 50.50 ± 0.00 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp32 | 51.93 ± 0.09 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp64 | 52.35 ± 0.00 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp128 | 52.38 ± 0.01 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp256 | 51.84 ± 0.01 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp512 | 51.19 ± 0.25 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | pp1024 | 49.96 ± 0.51 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 16 | tg16 | 8.51 ± 0.00 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp8 | 47.38 ± 0.04 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp16 | 50.72 ± 0.13 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp32 | 52.44 ± 0.09 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp64 | 53.12 ± 0.01 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp128 | 53.11 ± 0.01 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp256 | 52.58 ± 0.01 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp512 | 51.84 ± 0.01 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | pp1024 | 50.96 ± 0.11 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 16 | tg16 | 9.81 ± 0.00 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp8 | 62.26 ± 0.23 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp16 | 68.69 ± 0.06 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp32 | 71.30 ± 0.00 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp64 | 71.86 ± 0.14 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp128 | 71.75 ± 0.11 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp256 | 70.63 ± 0.02 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp512 | 69.20 ± 0.01 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | pp1024 | 67.68 ± 0.20 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 16 | tg16 | 11.45 ± 0.00 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp8 | 52.08 ± 0.11 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp16 | 55.18 ± 0.09 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp32 | 56.22 ± 0.06 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp64 | 56.76 ± 0.01 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp128 | 56.86 ± 0.00 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp256 | 56.31 ± 0.02 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp512 | 55.53 ± 0.18 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | pp1024 | 54.21 ± 0.02 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 16 | tg16 | 13.02 ± 0.00 |
build: b72c20b8 (3505)
most model are from https://huggingface.co/lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF/tree/main
Also for me IQ3_XS is bit faster than Q3_K_S. Only slightly.
on zen3 (need more time for zen4 bench). for me on zen3 IQ3_M is 2x slower than Q3_K_M => look that 3D/L3 cache did a good job in your case (need to compare with zen4 to be faire ;) )
I can run llamacpp on the module ctx1 only without 3d cache and test IQ2 , IQ3 is not heat up the CPU like crazy ...
~~It can validate a high temperature because on the 3D cache... but is it your CPU issues or something that happens to all 3D cache CPUs...~~
need wait for Mirek190 test ;)
I can run llamacpp on the module ctx1 only without 3d cache and test IQ2 , IQ3 is not heat up the CPU like crazy ...
It can validate a high temperature because on the 3D cache... but is it your CPU issues or something that happens to all 3D cache CPUs...
no no no I did not test yet ;) I wanted to say "I will run..."
AMD Ryzen 9 7940HS (znver4)
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 8 | pp64 | 34.00 ± 0.13 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 8 | pp128 | 35.72 ± 0.04 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 8 | pp256 | 36.20 ± 0.05 |
| llama 7B all F32 | 27.00 GiB | 7.25 B | CPU | 8 | tg16 | 2.00 ± 0.00 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 8 | pp64 | 44.04 ± 0.90 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 8 | pp128 | 44.09 ± 0.05 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 8 | pp256 | 44.05 ± 0.04 |
| llama 7B BF16 | 13.50 GiB | 7.25 B | CPU | 8 | tg16 | 4.08 ± 0.00 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 8 | pp64 | 30.06 ± 0.23 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 8 | pp128 | 31.44 ± 0.02 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 8 | pp256 | 30.06 ± 0.03 |
| llama 7B F16 | 13.50 GiB | 7.25 B | CPU | 8 | tg16 | 3.99 ± 0.01 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 8 | pp64 | 28.86 ± 0.08 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 8 | pp128 | 28.58 ± 0.01 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 8 | pp256 | 28.34 ± 0.01 |
| llama 7B IQ4_NL - 4.5 bpw | 3.85 GiB | 7.25 B | CPU | 8 | tg16 | 13.66 ± 0.01 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 8 | pp64 | 14.45 ± 0.00 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 8 | pp128 | 14.43 ± 0.02 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 8 | pp256 | 14.39 ± 0.01 |
| llama 7B IQ3_S mix - 3.66 bpw | 3.06 GiB | 7.25 B | CPU | 8 | tg16 | 12.02 ± 0.01 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 8 | pp64 | 47.05 ± 0.09 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 8 | pp128 | 45.52 ± 0.07 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 8 | pp256 | 44.32 ± 0.17 |
| llama 7B Q8_0 | 7.17 GiB | 7.25 B | CPU | 8 | tg16 | 7.50 ± 0.01 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 8 | pp64 | 41.80 ± 0.03 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 8 | pp128 | 41.38 ± 0.12 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 8 | pp256 | 40.84 ± 0.01 |
| llama 7B Q6_K | 5.54 GiB | 7.25 B | CPU | 8 | tg16 | 9.70 ± 0.02 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 8 | pp64 | 34.47 ± 0.02 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 8 | pp128 | 34.28 ± 0.03 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 8 | pp256 | 33.97 ± 0.02 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.25 B | CPU | 8 | tg16 | 11.18 ± 0.03 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 8 | pp64 | 48.47 ± 0.02 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 8 | pp128 | 48.23 ± 0.06 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 8 | pp256 | 47.57 ± 0.04 |
| llama 7B Q4_K - Medium | 4.07 GiB | 7.25 B | CPU | 8 | tg16 | 13.05 ± 0.07 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 8 | pp64 | 36.08 ± 0.01 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 8 | pp128 | 35.76 ± 0.01 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 8 | pp256 | 35.38 ± 0.02 |
| llama 7B Q3_K - Large | 3.56 GiB | 7.25 B | CPU | 8 | tg16 | 14.72 ± 0.06 |
I did some token-generation testing on the 7950 x3d with IQ and regular quants. It appears that the IQ quants are simply way more computationally expensive than the K quants, which is why the CPU gets hot.
The CPU has CCD 0 with the 3D cache on top, and CCD 1 that can clock slightly higher. Each CCD has 8 cores / 16 threads. With IQ quants the cores on CCD 0 always get hot, while CCD 1 just gets warm. The L3 cache on both CCDs stays cool.
Here is a test with Llama-3.1-70B-Instruct Q5_K (50 GB)
| Threads | CCD0 / CCD1 | ms/token |
|---|---|---|
| 6 | 3/3 | 800 |
| 8 | 4/4 | 800 |
These results are consistent with other tests, like Mistral 7B Q4_K_M and several other K quants of other architectures. Running 6 threads, manually split between both CCDs, results in the highest token generation speed. 6 threads are just enough to saturate my RAM, while not causing additional overhead. CCD 0 gets warm, CCD 1 stays relatively cool.
The 800 ms/token for a 50 GB model are consistent with the roughly 65 GB/s bandwidth that benchmarking tools measure for my RAM, confirming that the inference is RAM-bound at 6 threads.
Things look different for Mistral large IQ3_M (55 GB):
| Threads | CCD0 / CCD1 | ms/token |
|---|---|---|
| 6 | 6/0 | 2350 |
| 6 | 0/6 | 2000 |
| 6 | 3/3 | 1650 |
| 8 | 8/0 | 1850 |
| 8 | 0/8 | 1800 |
| 8 | 4/4 | 1350 |
| 10 | 5/5 | 1150 |
| 12 | 6/6 | 1050 |
| 14 | 7/7 | 950 |
| 16 | 8/8 | 900 |
| 16 | Automatic | 1250 |
| 31 | Automatic | 900 |
Here we can see that 6 threads aren't enough to saturate the RAM as with the K quants. Also, 6 threads on the CCD that clocks higher are faster, so this also doesn't depend on the L3 cache size or speed. The RAM only gets saturated at 14 to 16 cores, meaning the IQ3 quant requires at least 2.5x more calculation than the K quant. Given that the cores stay relatively cool for the K quants as they mostly wait for the RAM, the calculation requirement is probably way higher than just 2.5x. That means IQ quants inference is not RAM-bound but CPU-bound on weaker CPUs, like prompt processing always is. Aside from that we can see that the automatic thread to core mapping isn't optimal here.
So no, it doesn't seem to be the L3 / X3D cache that gets the CPU hot. CCD 0 always runs a bit hotter than CCD 1 as it's more difficult to get the generated heat out of it, and the IQ quants are less RAM-bound than the K quants. Given that splitting the load between cores has large impact, I guess that the L2 cache is relevant here.