Expected Behavior

I have been getting intermittent segfaults for no apparent reason. Sometimes they occur right at the beginning of text generation, and sometimes they occur after a lot of text has already been generated. They seem to be deterministic in that I can sometimes work around them by changing the prompt, but if I don’t change the prompt, they consistently occur. I normally use the 65B model, which exhibits the problem, but I am attaching a repro for the 13B model. I am not 100% sure but I believe the issue affects all four model sizes (7B, 13B, 30B, 65B).

Current Behavior

Intermittent segfaults

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

2019 16-inch MacBook Pro, 2.3 GHz 8-Core Intel Core i9, 64 GB of RAM

Operating System, e.g. for Linux:

$ uname -a Darwin Ryans-MBP-2.lan 22.3.0 Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64 x86_64

SDK version, e.g. for Linux:

$ python3 --version
$ make --version
$ g++ --version

Python 3.10.0

GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: x86_64-apple-darwin22.3.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Failure Information (for bugs)

See log below

Steps to Reproduce

I can consistently reproduce on my machine by running the following command:

 ./main --ctx_size 2048 -m ./models/13B/ggml-model-q4_0.bin --top_p 0 --top_k 
rlanday@Ryans-MBP-2 llama.cpp % ./main --ctx_size 2048 -m ./models/13B/ggml-model-q4_0.bin --top_p 0 --top_k 40 --temp 0.7 --repeat_penalty 1.176470588235294 -t 8 -n -1 --repeat_last_n 16384 -p "Active Internet connections" -s 1680491962

Failure Logs

Environment info:

commit cc9cee8e9e7598bd280295f6264f36d3a9224006

rlanday@Ryans-MBP-2 ~ % sysctl -a | grep machdep.cpu
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.tsc_ccc.numerator: 192
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.arch_perf.version: 4
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.size: 256
machdep.cpu.max_basic: 22
machdep.cpu.max_ext: 2147483656
machdep.cpu.vendor: GenuineIntel
machdep.cpu.brand_string: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
machdep.cpu.family: 6
machdep.cpu.model: 158
machdep.cpu.extmodel: 9
machdep.cpu.extfamily: 0
machdep.cpu.stepping: 13
machdep.cpu.feature_bits: 9221960262849657855
machdep.cpu.leaf7_feature_bits: 43804591 1073741824
machdep.cpu.leaf7_feature_bits_edx: 3154120192
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.signature: 591597
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT SGXLC MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.logical_per_package: 16
machdep.cpu.cores_per_package: 8
machdep.cpu.microcode_version: 244
machdep.cpu.processor_flag: 5
machdep.cpu.core_count: 8
machdep.cpu.thread_count: 16

llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy                        1.22.2
sentencepiece                0.1.97
torch                        1.13.0

llama.cpp$ make --version | head -1
GNU Make 4.3

$ md5sum ./models/13B/ggml-model-q4_0.bin
0abc81985f6c529faaa661dee3674efd  ./models/13B/ggml-model-q4_0.bin

Here is an ASAN output:

rlanday@Ryans-MBP-2 llama.cpp % ./main --ctx_size 2048 -m ./models/13B/ggml-model-q4_0.bin --top_p 0 --top_k 40 --temp 0.7 --repeat_penalty 1.176470588235294 -t 8 -n -1 --repeat_last_n 16384 -p "Active Internet connections" -s 1680491962
main(30917,0x7ff844413680) malloc: nano zone abandoned due to inability to preallocate reserved vm space.
main: seed = 1680491962
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from './models/13B/ggml-model-q4_0.bin'
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.700000, top_k = 40, top_p = 0.000000, repeat_last_n = 16384, repeat_penalty = 1.176471
generate: n_ctx = 2048, n_batch = 8, n_predict = -1, n_keep = 0


 Active Internet connections=================================================================
==30917==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000000000 at pc 0x000107c8c2e1 bp 0x7ff7b8a76690 sp 0x7ff7b8a75e58
READ of size 65536 at 0x625000000000 thread T0
    #0 0x107c8c2e0 in __asan_memmove+0xe0 (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x472e0) (BuildId: 756bb7515781379f84412f22c4274ffd2400000010000000000a0a0000030d00)
    #1 0x107499506 in std::__1::pair<int const*, int*> std::__1::__copy_impl[abi:v15006]<int const, int, void>(int const*, int const*, int*) copy.h:56
    #2 0x107498d42 in std::__1::pair<int const*, int*> std::__1::__copy[abi:v15006]<int const*, int const*, int*, 0>(int const*, int const*, int*) copy.h:94
    #3 0x107498958 in int* std::__1::copy[abi:v15006]<int const*, int*>(int const*, int const*, int*) copy.h:103
    #4 0x107498868 in int* std::__1::__uninitialized_allocator_copy[abi:v15006]<std::__1::allocator<int>, int, int, (void*)0>(std::__1::allocator<int>&, int const*, int const*, int*) uninitialized_algorithms.h:575
    #5 0x1074985e4 in std::__1::enable_if<__is_cpp17_forward_iterator<int const*>::value, void>::type std::__1::vector<int, std::__1::allocator<int> >::__construct_at_end<int const*>(int const*, int const*, unsigned long) vector:1031
    #6 0x10760a283 in std::__1::vector<int, std::__1::allocator<int> >::vector<int const*>(int const*, std::__1::enable_if<(__is_cpp17_forward_iterator<int const*>::value) && (is_constructible<int, std::__1::iterator_traits<int const*>::reference>::value), int const*>::type) vector:1158
    #7 0x10751cdd4 in std::__1::vector<int, std::__1::allocator<int> >::vector<int const*>(int const*, std::__1::enable_if<(__is_cpp17_forward_iterator<int const*>::value) && (is_constructible<int, std::__1::iterator_traits<int const*>::reference>::value), int const*>::type) vector:1152
    #8 0x10751cb73 in llama_sample_top_p_top_k llama.cpp:1808
    #9 0x10748cd9d in main main.cpp:292
    #10 0x7ff8007a330f in start+0x97f (dyld:x86_64+0xfffffffffff7230f) (BuildId: bba777096cad3592ab0309d0f7b8610e32000000200000000100000000020d00)

0x625000000000 is located 256 bytes to the left of 8192-byte region [0x625000000100,0x625000002100)
freed by thread T0 here:
    #0 0x107c9d17d in wrap__ZdlPv+0x7d (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x5817d) (BuildId: 756bb7515781379f84412f22c4274ffd2400000010000000000a0a0000030d00)
    #1 0x107494d84 in void std::__1::__libcpp_operator_delete[abi:v15006]<void*>(void*) new:256
    #2 0x107494d68 in void std::__1::__do_deallocate_handle_size[abi:v15006]<>(void*, unsigned long) new:280
    #3 0x107494d40 in std::__1::__libcpp_deallocate[abi:v15006](void*, unsigned long, unsigned long) new:290
    #4 0x107545189 in std::__1::allocator<float>::deallocate[abi:v15006](float*, unsigned long) allocator.h:128
    #5 0x107544e34 in std::__1::allocator_traits<std::__1::allocator<float> >::deallocate[abi:v15006](std::__1::allocator<float>&, float*, unsigned long) allocator_traits.h:282
    #6 0x1075e9162 in std::__1::__split_buffer<float, std::__1::allocator<float>&>::~__split_buffer() __split_buffer:355
    #7 0x1075e4dd4 in std::__1::__split_buffer<float, std::__1::allocator<float>&>::~__split_buffer() __split_buffer:352
    #8 0x10760952f in std::__1::vector<float, std::__1::allocator<float> >::__append(unsigned long) vector:1051
    #9 0x107514c6f in std::__1::vector<float, std::__1::allocator<float> >::resize(unsigned long) vector:1918
    #10 0x10751ba07 in llama_eval_internal(llama_context&, int const*, int, int, int) llama.cpp:982
    #11 0x107519a90 in llama_eval llama.cpp:1727
    #12 0x10748c7ec in main main.cpp:267
    #13 0x7ff8007a330f in start+0x97f (dyld:x86_64+0xfffffffffff7230f) (BuildId: bba777096cad3592ab0309d0f7b8610e32000000200000000100000000020d00)

previously allocated by thread T0 here:
    #0 0x107c9cd5d in wrap__Znwm+0x7d (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x57d5d) (BuildId: 756bb7515781379f84412f22c4274ffd2400000010000000000a0a0000030d00)
    #1 0x1074974e4 in void* std::__1::__libcpp_operator_new[abi:v15006]<unsigned long>(unsigned long) new:246
    #2 0x1074974c8 in std::__1::__libcpp_allocate[abi:v15006](unsigned long, unsigned long) new:272
    #3 0x1075e591c in std::__1::allocator<float>::allocate[abi:v15006](unsigned long) allocator.h:112
    #4 0x1075e56ba in std::__1::__allocation_result<std::__1::allocator_traits<std::__1::allocator<float> >::pointer> std::__1::__allocate_at_least[abi:v15006]<std::__1::allocator<float> >(std::__1::allocator<float>&, unsigned long) allocate_at_least.h:54
    #5 0x1075e52c3 in std::__1::__split_buffer<float, std::__1::allocator<float>&>::__split_buffer(unsigned long, unsigned long, std::__1::allocator<float>&) __split_buffer:316
    #6 0x1075e46dc in std::__1::__split_buffer<float, std::__1::allocator<float>&>::__split_buffer(unsigned long, unsigned long, std::__1::allocator<float>&) __split_buffer:312
    #7 0x107514adb in std::__1::vector<float, std::__1::allocator<float> >::reserve(unsigned long) vector:1500
    #8 0x10750c92e in llama_init_from_file llama.cpp:1652
    #9 0x107489ae9 in main main.cpp:102
    #10 0x7ff8007a330f in start+0x97f (dyld:x86_64+0xfffffffffff7230f) (BuildId: bba777096cad3592ab0309d0f7b8610e32000000200000000100000000020d00)

SUMMARY: AddressSanitizer: heap-buffer-overflow (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x472e0) (BuildId: 756bb7515781379f84412f22c4274ffd2400000010000000000a0a0000030d00) in __asan_memmove+0xe0
Shadow bytes around the buggy address:
  0x1c49ffffffb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c49ffffffc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c49ffffffd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c49ffffffe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c49fffffff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1c4a00000000:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c4a00000010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c4a00000020: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x1c4a00000030: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x1c4a00000040: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x1c4a00000050: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==30917==ABORTING
zsh: abort      ./main --ctx_size 2048 -m ./models/13B/ggml-model-q4_0.bin --top_p 0 --top_k

Apr 07 '23 12:04 rlanday

If I had to guess, the problem is that I’m passing an enormous value (16384) for repeat_last_n (since I was running into problems with repetitive output and wanted to basically max out this value) and that’s getting used for a pointer subtraction here:

               id = llama_sample_top_p_top_k(ctx,
                        last_n_tokens.data() + n_ctx - params.repeat_last_n,
                        params.repeat_last_n, top_k, top_p, temp, repeat_penalty);

This looks super sketchy. I personally wouldn’t pass this as a raw pointer, and there definitely needs to be parameter sanitization/bounds-checking.

Apr 07 '23 12:04 rlanday

repeat_last_n should be bounds checked to be <= ctx_size when parsing command line arguments that should be enough and raw pointers==speed, it's always best to use raw pointers and fix the code instead of slowing it down with additional sanitization checks

also you want to tweak the repeat_penalty arg to affect the probability of repetition

Apr 08 '23 23:04 anzz1

Thanks for the additional information. This code appears to only be getting called once per token generated, so I doubt there’d be hardly any performance impact from adding a bounds check here (much less than the time wasted from having the code crash repeatedly, requiring the job to be manually restarted). I see there’s already a TODO where last_n_tokens is defined to replace it with a ring buffer so maybe this fix can be included in that change.

Apr 10 '23 09:04 rlanday

This code appears to only be getting called once per token generated, so I doubt there’d be hardly any performance impact from adding a bounds check here (much less than the time wasted from having the code crash repeatedly, requiring the job to be manually restarted).

Sure but there's no need for that, since repeat_last_n shouldn't ever be > ctx_size so the argument should be clamped once on startup.

Apr 11 '23 12:04 anzz1

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 11 '24 01:04 github-actions[bot]

Intermittent segmentation faults in llama_sample_top_p_top_k()

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs