command-r-plus-104b-q5_k_s really too large for 3x24 GB ?
Hello!
I can use command-r-plus-104b-iq4_xs on my Ubuntu computer with 3x RTX 4090 successfully. But I fail to bring the next larger model, command-r-plus-104b-q5_k_s, to run. There are 3x24 GB VRAM (minus 2 GB which are used by system processes), so all together 70 GB VRAM available. The model's size is <67 GB. I tried to split the model to the three VRAMs via
./main -m /model/ggml-c4ai-command-r-plus-104b-q5_k_s.gguf --n-gpu-layers 65 --tensor-split 499,524,523 -n 256 --keep 48 --repeat_penalty 1.0 --temp 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
But whatever splitting factors I am using, always any other device (0, 1 or 2) says OOM. Here is one of my logs:
Log start
main: build = 2647 (8228b66d)
main: built with cc (Ubuntu 11.3.0-12ubuntu1) 11.3.0 for x86_64-linux-gnu
main: seed = 1713048786
llama_model_loader: loaded meta data with 26 key-value pairs and 642 tensors from /model/ggml-c4ai-command-r-plus-104b-q5_k_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = command-r
llama_model_loader: - kv 1: general.name str = 313aab747f8c3aefdd411b1f6a5a555dd421d9e8
llama_model_loader: - kv 2: command-r.block_count u32 = 64
llama_model_loader: - kv 3: command-r.context_length u32 = 131072
llama_model_loader: - kv 4: command-r.embedding_length u32 = 12288
llama_model_loader: - kv 5: command-r.feed_forward_length u32 = 33792
llama_model_loader: - kv 6: command-r.attention.head_count u32 = 96
llama_model_loader: - kv 7: command-r.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: command-r.rope.freq_base f32 = 75000000,000000
llama_model_loader: - kv 9: command-r.attention.layer_norm_epsilon f32 = 0,000010
llama_model_loader: - kv 10: general.file_type u32 = 16
llama_model_loader: - kv 11: command-r.logit_scale f32 = 0,833333
llama_model_loader: - kv 12: command-r.rope.scaling.type str = none
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 5
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 255001
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: split.no u16 = 0
llama_model_loader: - kv 24: split.count u16 = 0
llama_model_loader: - kv 25: split.tensors.count i32 = 642
llama_model_loader: - type f32: 193 tensors
llama_model_loader: - type q5_K: 448 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = command-r
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 253333
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 12288
llm_load_print_meta: n_head = 96
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 12
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 1,0e-05
llm_load_print_meta: f_norm_rms_eps = 0,0e+00
llm_load_print_meta: f_clamp_kqv = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale = 8,3e-01
llm_load_print_meta: n_ff = 33792
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = none
llm_load_print_meta: freq_base_train = 75000000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Small
llm_load_print_meta: model params = 103,81 B
llm_load_print_meta: model size = 66,86 GiB (5,53 BPW)
llm_load_print_meta: general.name = 313aab747f8c3aefdd411b1f6a5a555dd421d9e8
llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token = 0 '<PAD>'
llm_load_print_meta: LF token = 136 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size = 0,98 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 23721,00 MiB on device 1: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/model/ggml-c4ai-command-r-plus-104b-q5_k_s.gguf'
main: error: unable to load model
Is there kind of overhead what brings the usage for sure above the available VRAM?
overhead what brings the usage for sure above the available VRAM? ... model size = 66,86 GiB ... allocating 23721,00 MiB
66.86 + 2.37(kv cache) = 69.23, so yes.
Ouh! Thank yoi, @Jeximo ! So it seems to me that it is still possible if I can wipe a part of that system-side used 2 GB in the VRAM of device 0. Or do I miss anything else which makes 3x24 GB impossible to manage this model fully?
Or do I miss anything else which makes 3x24 GB impossible to manage this model fully?
It may be possible if you can spare a bit of system space, but your main parameters don't include --ctx-size, so by default you'll have only 512 tokens.
More context will increase VRAM requirements too, so I guess it's up to you to decide if it's worth it.
Good point, thank you @Jeximo !
Does this resolve your issue @Marcophono2 ?
If so, it would be kind of you to close it.
Not really, @arnfaldur , but the smaller model was good enough for me.
Best regards Marc