Eval bug: inference of 32B eats too much memory on ROCM HIP (5x AMD Radeon Instinct Mi50 (gfx906))
Name and Version
./llama-cli --version
ROCm calling rocblas_initialize as a workaround for a rocBLAS bug
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 ROCm devices:
Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 4: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
version: 1 (10f2e81)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
HIP
Hardware
8 * Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
5 * AMD Radeon Instinct Mi50 (gfx906)
Models
Qwen2.5-32B-Instruct-Q4_K_M.gguf
Problem description & steps to reproduce
when I run ./llama-cli with context -c 32768 and big prompt, it eats too much VRAM and fails with OOM. The same run on Nvidia gpus, works fine.
This is my exact command.
./llama-cli -m models/Qwen2.5-32B-Instruct-Q4_K_M.gguf --n-gpu-layers 100 -c 32768 -f full_prompt.txt
What happens: first the model is loaded it occupies 50% VRAM on all 5 gpus. Then it starts reading the prompt, VRAM starts to gradually grow, eating all VRAM and ends with error.
llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm errorROCm error: out of memory
Same behaviour is observed when using llama-server.
Build command
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_CURL=ON -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release \
I also add LLAMA_CUDA_NO_PEER_COPY=1 to prevent gibberish output, as mentioned in https://github.com/ggml-org/llama.cpp/issues/3051, but it has no effect on VRAM eating problem.
I tried to run the same inference on nvidia gpus, it worked fine, at the stage of prompt reading VRAM only grows by 10-15% compared to model loading stage, and then stops growing.
First Bad Commit
No response
Relevant log output
ROCm calling rocblas_initialize as a workaround for a rocBLAS bug
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 ROCm devices:
Device 0: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 4: AMD Radeon VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
build: 1 (10f2e81) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon VII) - 15922 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm2 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm3 (AMD Radeon VII) - 16348 MiB free
llama_model_load_from_file_impl: using device ROCm4 (AMD Radeon VII) - 16348 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 771 tensors from models/Qwen2.5-32B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 32B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 32B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 32B
llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: qwen2.block_count u32 = 64
llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
llama_model_loader: - kv 16: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: general.file_type u32 = 15
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Qwen2.5-32B-Instruct-GGUF...
llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 448
llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 18.48 GiB (4.85 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 5120
print_info: n_layer = 64
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 27648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 32B
print_info: model params = 32.76 B
print_info: general.name = Qwen2.5 32B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: ROCm0 model buffer size = 3726.02 MiB
load_tensors: ROCm1 model buffer size = 3581.64 MiB
load_tensors: ROCm2 model buffer size = 3545.55 MiB
load_tensors: ROCm3 model buffer size = 3545.55 MiB
load_tensors: ROCm4 model buffer size = 4109.59 MiB
load_tensors: CPU_Mapped model buffer size = 417.66 MiB
................................................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 32768
llama_init_from_model: n_ctx_per_seq = 32768
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init: ROCm0 KV buffer size = 1664.00 MiB
llama_kv_cache_init: ROCm1 KV buffer size = 1664.00 MiB
llama_kv_cache_init: ROCm2 KV buffer size = 1664.00 MiB
llama_kv_cache_init: ROCm3 KV buffer size = 1664.00 MiB
llama_kv_cache_init: ROCm4 KV buffer size = 1536.00 MiB
llama_init_from_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
llama_init_from_model: ROCm_Host output buffer size = 0.58 MiB
llama_init_from_model: ROCm0 compute buffer size = 2664.00 MiB
llama_init_from_model: ROCm1 compute buffer size = 2664.00 MiB
llama_init_from_model: ROCm2 compute buffer size = 2664.00 MiB
llama_init_from_model: ROCm3 compute buffer size = 2664.00 MiB
llama_init_from_model: ROCm4 compute buffer size = 2664.00 MiB
llama_init_from_model: ROCm_Host compute buffer size = 74.01 MiB
llama_init_from_model: graph nodes = 2246
llama_init_from_model: graph splits = 6
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
system_info: n_threads = 4 (n_threads_batch = 4) / 8 | ROCm : NO_VMM = 1 | NO_PEER_COPY = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 1112466201
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 32768
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
/home/rig/llamafix/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error
ROCm error: out of memory
current device: 4, in function alloc at /home/rig/llamafix/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:366
ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Inappropriate ioctl for device.
No stack.
The program is not being run.
./run_llama_cli.sh: line 1: 24869 Aborted (core dumped) ./llama-cli -m models/Qwen2.5-32B-Instruct-Q4_K_M.gguf --n-gpu-layers 100 -c 32768 -f full_prompt.txt
Tried to build llama.cpp with Vulcan, on the same 5x AMD Radeon Instinct Mi50 the same inference command works fine but 2x slower, the VRAM doesn't grow at prompt reading.
I can not repoduce this with a trio of gfx908 devices as well as gfx1030, rocm 6.3.2, latest master lamacpp and ./llama-cli -m models/Qwen2.5-32B-Instruct-Q4_K_M.gguf --n-gpu-layers 100 -c 32768
There is no difference in the amount allocated regardless of if there is no prompt or if i start with a 16K long prompt via -f. Its possible that the issue is in your rocm environment or in a code path not hit by gfx908 or 1030. Could you get a hsa trace with rocprof? that would show the allocations.
i use rocm-5.7.3 and llama.cpp head commit is 10f2e81809bbb69ecfe64fc8b4686285f84b0c07
i launched rocprof (see attachment)
I will take a look, but 5.7.3 is very old and I am aware of several issues with this version, I would strongly suggest upgrading to at least 6.2
hi @IMbackK, i tested with rocm 6.3, got the same behaviour, eats all VRAM and fails with OOM
rocprof files attached
This issue was closed because it has been inactive for 14 days since being marked as stale.