llama.cpp Segmentation Fault 11 on M2 Ultra 192GB when offloading more than 110GB into Metal

This issue is occurring on an Mac Studio with an M2 Ultra process and 192GB of RAM.
I originally hit the issue when attempting to load a 120b model using the latest version of Oobabooga. I then pulled the latest version of llama.cpp (b2167) to confirm that the issue persisted. The issue is present in both, and both act identically, so I will be using my output from Oobabooga below. But I did confirm the exact same load settings cause the same fail output in llama.cpp b2167.
When loading a model in llama.cpp, I always set no-mmap and mlock.
A few weeks back, using the sudo command, I increased my mac's usable vram from 147GB to 170GB. Until now this has not posed a problem, as best as I can tell. But I wanted to point out that it is not a new thing.

My Issue:

Up until today, I was using a version of Oobabooga from mid-December, and was successfully running 120b models with full metal offload. Today I updated Oobabooga to the latest version, and with it came a newer version of Llama.cpp.

Up until now, Llama.cpp on the Mac used either 0 or 1 for ngl; 0 off, 1 on. This version now respects the ngl flag completely, and a 120b model now can manually offload 141 layers on the Mac.

On the previous version of Llama.cpp, and all versions up until now, I've been able to load a 120b completely into the metal working space without issue. As of some recent version released since mid-December, I am now unable to increase the Metal memory usage past ~110GB.

Something odd happens when I attempt to offload more layers after hitting 110GB of usage; the console output looks like it splits out a new metal buffer line and then crashes. On a 120b model, this means the cut-off is 127 layers. If I go past that to 128, it crashes. I have attempted going up to 141 and even 256 layers; same result.

Example of Successful Load on older Llama.cpp from mid-December, with ngl set to 1

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 121920.51 MiB
llm_load_tensors: mem required  = 121920.51 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  = 8960.00 MiB, K (f16): 4480.00 MiB, V (f16): 4480.00 MiB
llama_build_graph: non-view tensors processed: 2944/2944
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/user/text-generation-webui-main-2/installer_files/env/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 178257.92 MB

Example of Failed Load on new Llama.cpp, offloading 128 out of 141 layers.

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110508.00 MiB, (110508.06 / 170000.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   476.00 MiB, (110984.06 / 170000.00)
/bin/sh: line 1: 77807 Segmentation fault: 11

Example of Successful Load on new Llama.cpp, offloading 127 out of 141 layers.

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110116.94 MiB, (110117.00 / 170000.00)
llm_load_tensors: offloading 127 repeating layers to GPU
llm_load_tensors: offloaded 127/141 layers to GPU
llm_load_tensors:        CPU buffer size = 11803.09 MiB
llm_load_tensors:      Metal buffer size = 110116.94 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/user/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 178257.92 MB

Feb 17 '24 05:02 SomeOddCodeGuy

Assuming you use the same command as I do, sudo sysctl iogpu.wired_limit_mb=29500 (with your specific number), you have to do it every time after a reboot, it does not persist.

Feb 17 '24 20:02 dreambottle

heh I might not reboot as often as I should... it's a headless mac that I use a server in my house, so it can go more than a week or two without reboot. I know that I should reboot more, but I honestly haven't experienced performance issues doing that.

Just to confirm, though, I did just now reboot it and you are correct: I'm back to 147GB.

Also to confirm, I retried the scenario stated in the ticket above, and the issue is persisting even with the original working set size..

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 140
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 120.32 B
llm_load_print_meta: model size       = 119.06 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = D:\text-generation-webui-main\models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.96 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 110508.00 MiB, (110508.38 / 147456.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   476.00 MiB, (110984.38 / 147456.00)
/bin/sh: line 1:   822 Segmentation fault: 11

Feb 18 '24 02:02 SomeOddCodeGuy

Note: This error does not occur in Koboldcpp version 1-58, which was showing being 36 commits behind llama.cpp last I looked. I am able to load the full layers up to 155b (when using the command to increase vram to 170GB) without issue.

Feb 21 '24 01:02 SomeOddCodeGuy

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 08 '24 01:04 github-actions[bot]