llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Eval bug: Several models producing gibberish

Open iamangus opened this issue 11 months ago • 13 comments

Name and Version

[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 register_backend: registered backend ROCm (2 devices) register_device: registered device ROCm0 (AMD Radeon VII) register_device: registered device ROCm1 (AMD Radeon VII) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz) load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so version: 4753 (51f311e0) built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux

Operating systems

Mac, Linux

GGML backends

HIP

Hardware

CPU = G3930 GPU = 2x Instinct Mi50

Models

https://huggingface.co/microsoft/phi-4-gguf/blob/main/phi-4-q4.gguf https://huggingface.co/YorkieOH10/Meta-Llama-3.1-8B-Instruct-Q8_0-GGUF/resolve/main/meta-llama-3.1-8b-instruct-q8_0.gguf?download=true

Problem description & steps to reproduce

Getting random character strings when offloading to GPU.

~/llama.cpp/build/bin/llama-cli -m ~/phi-4-q4.gguf -p "Hello!" -ngl 999

Installed ROCm following the below steps for Alma8.10: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-methods/package-manager/package-manager-rhel.html

built llama.cpp follwing the below steps: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip

It seems to work fine when not offloading to the GPU and just running on CPU. Slowly of course, but it works.

First Bad Commit

No response

Relevant log output

[root@localhost ~]# ~/llama.cpp/build/bin/llama-cli -m ~/phi-4-q4.gguf -p "Hello!" -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
register_backend: registered backend ROCm (2 devices)
register_device: registered device ROCm0 (AMD Radeon VII)
register_device: registered device ROCm1 (AMD Radeon VII)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz)
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so
build: 4753 (51f311e0) with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon VII) - 16348 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 243 tensors from /root/phi-4-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 4
llama_model_loader: - kv   3:                            general.version str              = 4
llama_model_loader: - kv   4:                       general.organization str              = Microsoft
llama_model_loader: - kv   5:                           general.basename str              = phi
llama_model_loader: - kv   6:                         general.size_label str              = 15B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,7]       = ["phi", "nlp", "math", "code", "chat"...
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  11:                        phi3.context_length u32              = 16384
llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 16384
llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv  15:                           phi3.block_count u32              = 40
llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 250000.000000
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 0
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 100257
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  101 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.43 GiB (4.94 BPW) 
load: special tokens cache size = 96
load: token to piece cache size = 0.6151 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 40
print_info: n_head_kv        = 10
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1280
print_info: n_embd_v_gqa     = 1280
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 17920
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 250000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.66 B
print_info: general.name     = Phi 4
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|endoftext|>'
print_info: EOS token        = 100257 '<|endoftext|>'
print_info: EOT token        = 100257 '<|endoftext|>'
print_info: PAD token        = 100257 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: EOG token        = 100257 '<|endoftext|>'
print_info: EOG token        = 100265 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   275.62 MiB
load_tensors:        ROCm0 model buffer size =  4163.91 MiB
load_tensors:        ROCm1 model buffer size =  4190.80 MiB
.......................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 250000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   420.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =   380.00 MiB
llama_init_from_model: KV self size  =  800.00 MiB, K (f16):  400.00 MiB, V (f16):  400.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.38 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model:      ROCm0 compute buffer size =   437.01 MiB
llama_init_from_model:      ROCm1 compute buffer size =   437.02 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    42.02 MiB
llama_init_from_model: graph nodes  = 1606
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Hello<|im_end|><|im_start|>assistant<|im_sep|>Hi there<|im_end|><|im_start|>user<|im_sep|>How are you?<|im_end|><|im_start|>assistant<|im_sep|>

system_info: n_threads = 2 (n_threads_batch = 2) / 2 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 1123923216
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

systemHello!


> 
&3%9C(6B>;#$C/F;;8/49=0%41%588-6.D5>BB8;)/H@=!$9+,GC51(>40=&89&$'G>2GFF0C*69F-8/<$A88>;@+CB6-#C1B!*=<"5-:4.<'*&E7/A>(G%!-:G*72D+/G+B*://;;3"A'9,*E<FDHG4-524"E:$5F1:7;AA4(45/%%%2;81;8./#5'C'2E$@>@8(%;2<<F
> hello!
%':8B2+F@H!/;,7*;F$"'@!&/&<E6;06:@H(8);-;50>337";*
> 
llama_perf_sampler_print:    sampling time =      37.41 ms /    60 runs   (    0.62 ms per token,  1603.93 tokens per second)
llama_perf_context_print:        load time =   13567.06 ms
llama_perf_context_print: prompt eval time =    4097.00 ms /    17 tokens (  241.00 ms per token,     4.15 tokens per second)
llama_perf_context_print:        eval time =    6391.57 ms /   251 runs   (   25.46 ms per token,    39.27 tokens per second)
llama_perf_context_print:       total time =  104277.98 ms /   268 tokens
Interrupted by user
[root@localhost ~]# ^C

iamangus avatar Feb 21 '25 20:02 iamangus

Here is an attempt using a model out of the llama.cpp walkthrough: https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
register_backend: registered backend ROCm (2 devices)
register_device: registered device ROCm0 (AMD Radeon VII)
register_device: registered device ROCm1 (AMD Radeon VII)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Celeron(R) CPU G3930 @ 2.90GHz)
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-hip.so
load_backend: failed to find ggml_backend_init in /root/llama.cpp/build/bin/libggml-cpu.so
build: 4753 (51f311e0) with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-23) for x86_64-redhat-linux (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon VII) - 16348 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from /root/gemma-1.1-7b-it.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-1.1-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.96 GiB (4.99 BPW) 
load: control-looking token:    107 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 1.6014 MB
print_info: arch             = gemma
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 256
print_info: n_swa            = 0
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 4096
print_info: n_embd_v_gqa     = 4096
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 24576
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 8.54 B
print_info: general.name     = gemma-1.1-7b-it
print_info: vocab type       = SPM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 107 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 227 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 107 '<end_of_turn>'
print_info: max token length = 93
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   615.23 MiB
load_tensors:        ROCm0 model buffer size =  2379.45 MiB
load_tensors:        ROCm1 model buffer size =  2697.64 MiB
.............................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =   832.00 MiB
llama_init_from_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.98 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
llama_init_from_model:      ROCm0 compute buffer size =   212.01 MiB
llama_init_from_model:      ROCm1 compute buffer size =   562.02 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    38.02 MiB
llama_init_from_model: graph nodes  = 931
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model


system_info: n_threads = 2 (n_threads_batch = 2) / 2 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 2656632722
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> 
<unused9><unused8><unused27><unused11><unused31>[@BOS@]<unused10><unused9><unused24><2mass><mask><unused32><unused27><unused19><unused0><unused4><unused9><unused15><unused31><unused22><unused27><unused21><unused7><unused28><unused21><unused1><2mass><unused14><unused25><unused6><unused1><unused20><unused8><unused17><mask><2mass>

> Hello!
<unused32><unused24><unused29><unused26><unused2><unused4><unused14><mask><unused4><unused2><unused0>

> 
llama_perf_sampler_print:    sampling time =      23.50 ms /    41 runs   (    0.57 ms per token,  1745.05 tokens per second)
llama_perf_context_print:        load time =   18996.88 ms
llama_perf_context_print: prompt eval time =     181.81 ms /    29 tokens (    6.27 ms per token,   159.50 tokens per second)
llama_perf_context_print:        eval time =    4246.69 ms /    56 runs   (   75.83 ms per token,    13.19 tokens per second)
llama_perf_context_print:       total time =  152630.96 ms /    85 tokens
Interrupted by user
[root@localhost ~]#```

iamangus avatar Feb 21 '25 21:02 iamangus

Do these models produce correct results when using a CPU-only build? Note, if llama.cpp was compiled with GPU support GPUs can be used even at 0 GPU layers.

JohannesGaessler avatar Feb 21 '25 21:02 JohannesGaessler

It is working when not using -ngl 999. I do see vram utilization when running rocm-smi, but the output is CPU-slow. I will clone again and do a CPU only build and test.

iamangus avatar Feb 21 '25 21:02 iamangus

Unfortunately the CPU build is failing.

[  4%] Built target ggml-base
[  8%] Built target ggml-cpu
[  9%] Built target ggml
[ 19%] Built target llama
[ 19%] Built target build_info
[ 20%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 20%] Building CXX object common/CMakeFiles/common.dir/chat.cpp.o
[ 21%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 21%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 22%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 22%] Building CXX object common/CMakeFiles/common.dir/llguidance.cpp.o
[ 23%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 23%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 24%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 24%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 25%] Linking CXX static library libcommon.a
[ 25%] Built target common
[ 26%] Building CXX object tests/CMakeFiles/test-tokenizer-0.dir/test-tokenizer-0.cpp.o
[ 26%] Linking CXX executable ../bin/test-tokenizer-0
../bin/libggml.so: undefined reference to `vtable for std::filesystem::__cxx11::filesystem_error'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::directory_iterator::operator*() const'
../bin/libggml.so: undefined reference to `typeinfo for std::filesystem::__cxx11::filesystem_error'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::path::_M_find_extension() const'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::filesystem_error::~filesystem_error()'
../bin/libggml.so: undefined reference to `std::filesystem::status(std::filesystem::__cxx11::path const&)'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::path::_M_split_cmpts()'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::directory_iterator::operator++()'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::directory_iterator::directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)'
../bin/libggml.so: undefined reference to `std::filesystem::__cxx11::filesystem_error::_M_gen_what()'
collect2: error: ld returned 1 exit status
gmake[2]: *** [tests/CMakeFiles/test-tokenizer-0.dir/build.make:102: bin/test-tokenizer-0] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1858: tests/CMakeFiles/test-tokenizer-0.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2

I attempted to use the rocm images referenced here: https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md

However it doesn't look like the rocm images actually exist: https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/versions

iamangus avatar Feb 21 '25 21:02 iamangus

Sorry, I didn't read your original post correctly. If your ROCm build has a binary called test-backend-ops, does it report a test failure if you run it?

JohannesGaessler avatar Feb 21 '25 22:02 JohannesGaessler

Unfortunately I do not have that binary and am not able to find it.

iamangus avatar Feb 22 '25 15:02 iamangus

@IMbackK do you have access to a Radeon VII to run test-backend-ops? I am unable to reproduce the issue on my RX 6800 so I suspect that the configuration for that specific GPU is wrong and it would be helpful to know which kernel to look at.

JohannesGaessler avatar Feb 22 '25 15:02 JohannesGaessler

May be worth trying to build with GGML_CUDA_NO_PEER_COPY, in case it is an issue copying data between the GPUs. If it works with a single GPU, that's likely to be the case.

slaren avatar Feb 22 '25 17:02 slaren

It seems building with -DGGML_CUDA_NO_PEER_COPY=ON does fix the issue. It seems like the entire model is loaded into the VRAM of both GPUs. Is there a way around this?

iamangus avatar Feb 22 '25 17:02 iamangus

Upon further inspection it looks like it is actually splitting the model between the GPUs. Are you able to provide any details around the impact of building with that flag?

iamangus avatar Feb 22 '25 17:02 iamangus

It causes data transfers between GPUs to be done through system memory, which may be slower. From what I understand, this problem is usually caused by using a kernel without the AMD patches for ROCm.

slaren avatar Feb 22 '25 18:02 slaren

Should that not be taken care of by installing amdgpu-dkms? This might be specific to the install method I went with on the almalinux. I will try doing the same steps on ubuntu and see if that is any better.

iamangus avatar Feb 23 '25 19:02 iamangus

Okay I have done some more testing. I tried almalinux again but using rocm 6.2.2, and same deal. I just finished trying ubuntu 22.04 with rocm 6.2.2 and I am seeing the same issue.

Is it possible that this is a hardware limitation of my motherboard or a BIOS setting needs to be changed? I am running these out of a no name mining chassis. This is the chassis

iamangus avatar Feb 24 '25 16:02 iamangus

I don't know, that's only what I remember from other people talking about this here, but I don't have any experience working with ROCm. It may also be because this GPU is not well supported in ROCm.

slaren avatar Feb 25 '25 23:02 slaren

Unfortunately I've seen a few other people using multiple of these GPUs without issues, only difference is they have them in old supermicro servers. Either way I will stick with the work around you provided. Thank you for your help!

iamangus avatar Feb 26 '25 00:02 iamangus

@iamangus Did you ever figure this out? I'm having the same issue on a similar chasis. I'm using the octominer...

segmond avatar Apr 20 '25 04:04 segmond

It seams pcie p2p dosent work on those bridge chips. Amd explicitly dosent support rocm p2p on bridge chips because gennerally those are buggy and untested (hw and kernel driver wise) p2p is fundamentally imposssible when in cases where some GPUs are behind a bridge and others are not and rocm dosent correctly handle this case by not trying the p2p path. Its thus not surprising at all that those chassis are broken, either due to p2p being broken in hw or kernel driver, or rocm going down the p2p path when it should not due to pcie topology. Regardless this is nothing that lamacpp has anything to do with and your best recourse is not to set DGGML_CUDA_NO_PEER_COPY but disabling p2p entirely via kconfig.

I would like to also note that nvidia avoids this mess of the broken untested pcie bridge implementations by AFAIK not supporting p2p at all on consumer GPUs. Effectively its allways GGML_CUDA_NO_PEER_COPY just handled by the driver, same as amdgpu/amdhsa with p2p disabled in kconfig

IMbackK avatar Apr 20 '25 09:04 IMbackK

@IMbackK Are you referring to the pcie bridge chips on these specific motherboards being the issue? I had assumed that it had something to do with the motherboard so I have a different one coming that I am hoping will do the trick.

iamangus avatar Apr 21 '25 20:04 iamangus

I got this resolved. Just reinstalled the amd driver and installed rocm. Followed the instructions, use the package instruction not the amd install script. Set iommu=pt and I did use DGGML_CUDA_NO_PEER_COPY to get multi GPU support, without setting it, the most GPU I could use was 2. But once I set it to ON and rebuilt, I could infer across 6 GPUs. @IMbackK How do you disable p2p via kconfig?

segmond avatar Apr 22 '25 04:04 segmond

You need to be running amdgpu from the mainline kernel, NOT amdgpu-dkms from the rocm install. You need to disable CONFIG_DMABUF_MOVE_NOTIFY and CONFIG_HSA_AMD_P2P

IMbackK avatar Apr 22 '25 11:04 IMbackK

@IMbackK Are you referring to the pcie bridge chips on these specific motherboards being the issue? I had assumed that it had something to do with the motherboard so I have a different one coming that I am hoping will do the trick.

all mainboards of this type are going to be broken. Either they use bridge chips for some of the slots, which makes p2p impossible or they use pcie lanes from the chipset which on modernish platforms is just a pice bridge integrated onto the chipset again makeing pcie p2p impossible. pcie p2p is only possible if all the gpus are connected to the cpu directly, or theoreticly but not supported by rocm, when all gpus are behind the same bridge.

IMbackK avatar Apr 22 '25 11:04 IMbackK