llama.cpp Run several single thread operators parellel

In 18 threads testing, it shows about 5% performance gain.

Apr 08 '23 13:04 howard0su

In my testing, this will give noticable difference when running in high # of threads:

Before

Running with 18 threads...
         18 threads | run 1/4 | current token time 237.93 ms - eval time 32116.6 ms - prompt eval time 1903.46 ms
         18 threads | run 2/4 | current token time 223.18 ms - eval time 33081.06 ms - prompt eval time 1785.41 ms
         18 threads | run 3/4 | current token time 560.18 ms - eval time 42127.41 ms - prompt eval time 4481.46 ms
         18 threads | run 4/4 | current token time 226.64 ms - eval time 32145.81 ms - prompt eval time 1813.12 ms

After

Running with 18 threads...
         18 threads | run 1/4 | current token time 221.68 ms - eval time 30907.99 ms - prompt eval time 1773.44 ms
         18 threads | run 2/4 | current token time 225.93 ms - eval time 31100.67 ms - prompt eval time 1807.41 ms
         18 threads | run 3/4 | current token time 222.29 ms - eval time 30184.69 ms - prompt eval time 1778.33 ms
         18 threads | run 4/4 | current token time 233.21 ms - eval time 31018.9 ms - prompt eval time 1865.65 ms

Apr 08 '23 13:04 howard0su

Eval time:

+--------------------------------------------------------------------------+
|+                 +  + +                         x                       x|
|     |__________A____M____|                |_____M_______A_____________|  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3       32116.6      33081.06      32145.81     32447.823     548.59349
+   4      30184.69      31100.67       31018.9     30803.062     419.74213
Difference at 95.0% confidence
        -1644.76 +/- 933.691
        -5.06894% +/- 2.87751%
        (Student's t, pooled s = 475.491)

Apr 09 '23 13:04 howard0su

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

Apr 13 '23 08:04 jon-chuang

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

10cores 20threads win10 box

Apr 13 '23 17:04 howard0su

Getting segfault

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/LLaMA/7B/ggml-model-q4_0.bin --prompt "William Safire will walk us through the nuances of bad" --threads 1 --seed 1 --n_predict 16 --tfs 0.97 --mirostat 2 --mirostat_ent 5
main: build = 513 (11702ed)
main: seed  = 1
llama.cpp: loading model from models/LLaMA/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 1 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.970000, top_p = 1.000000, typical_p = 1.000000, temp = 1.000000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 16, n_keep = 0


 William Safire will walk us through/p/i/llama.cpp/ggml.c:12128:49: runtime error: member access within null pointer of type 'struct ggml_compute_state'
pc_0x508ca7###func_ggml_graph_compute###file_/p/i/llama.cpp/ggml.c###line_12128###obj_(main+0x508ca7)
pc_0x56e9c8###func_llama_eval_internal###file_/p/i/llama.cpp/llama.cpp###line_1272###obj_(main+0x56e9c8)
pc_0x571ff0###func_llama_eval###file_/p/i/llama.cpp/llama.cpp###line_2726###obj_(main+0x571ff0)
pc_0x41a78a###func_main###file_/p/i/llama.cpp/examples/main/main.cpp###line_360###obj_(main+0x41a78a)
pc_0x7ff119a4a50f###func___libc_start_call_main###file_<null>###line_0###obj_(libc.so.6+0x2750f)
pc_0x7ff119a4a5c8###func___libc_start_main@GLIBC_2.2.5###file_<null>###line_0###obj_(libc.so.6+0x275c8)
pc_0x423624###func__start###file_<null>###line_0###obj_(main+0x423624)

AddressSanitizer:DEADLYSIGNAL

Process finished with exit code 1

May 05 '23 18:05 ivanstepanovftw

To support this properly would require deeper changes, at least:

Ensuring that the dependencies of each operation are respected, so that no operations are run before their dependencies
Ensuring that enough work buffer memory is allocated to run multiple operations concurrently
Ensuring that each operation has a different, non-overlapping work buffer

May 05 '23 19:05 slaren