llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Run several single thread operators parellel

Open howard0su opened this issue 2 years ago • 4 comments

In 18 threads testing, it shows about 5% performance gain.

howard0su avatar Apr 08 '23 13:04 howard0su

In my testing, this will give noticable difference when running in high # of threads:

Before

Running with 18 threads...
         18 threads | run 1/4 | current token time 237.93 ms - eval time 32116.6 ms - prompt eval time 1903.46 ms
         18 threads | run 2/4 | current token time 223.18 ms - eval time 33081.06 ms - prompt eval time 1785.41 ms
         18 threads | run 3/4 | current token time 560.18 ms - eval time 42127.41 ms - prompt eval time 4481.46 ms
         18 threads | run 4/4 | current token time 226.64 ms - eval time 32145.81 ms - prompt eval time 1813.12 ms

After

Running with 18 threads...
         18 threads | run 1/4 | current token time 221.68 ms - eval time 30907.99 ms - prompt eval time 1773.44 ms
         18 threads | run 2/4 | current token time 225.93 ms - eval time 31100.67 ms - prompt eval time 1807.41 ms
         18 threads | run 3/4 | current token time 222.29 ms - eval time 30184.69 ms - prompt eval time 1778.33 ms
         18 threads | run 4/4 | current token time 233.21 ms - eval time 31018.9 ms - prompt eval time 1865.65 ms

howard0su avatar Apr 08 '23 13:04 howard0su

Eval time:

+--------------------------------------------------------------------------+
|+                 +  + +                         x                       x|
|     |__________A____M____|                |_____M_______A_____________|  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   3       32116.6      33081.06      32145.81     32447.823     548.59349
+   4      30184.69      31100.67       31018.9     30803.062     419.74213
Difference at 95.0% confidence
        -1644.76 +/- 933.691
        -5.06894% +/- 2.87751%
        (Student's t, pooled s = 475.491)

howard0su avatar Apr 09 '23 13:04 howard0su

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

jon-chuang avatar Apr 13 '23 08:04 jon-chuang

In 18 threads testing, it shows about 5% performance gain.

18 threads on a how many C/T machine?

10cores 20threads win10 box

howard0su avatar Apr 13 '23 17:04 howard0su

Getting segfault

/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/LLaMA/7B/ggml-model-q4_0.bin --prompt "William Safire will walk us through the nuances of bad" --threads 1 --seed 1 --n_predict 16 --tfs 0.97 --mirostat 2 --mirostat_ent 5
main: build = 513 (11702ed)
main: seed  = 1
llama.cpp: loading model from models/LLaMA/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 1 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.970000, top_p = 1.000000, typical_p = 1.000000, temp = 1.000000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 16, n_keep = 0


 William Safire will walk us through/p/i/llama.cpp/ggml.c:12128:49: runtime error: member access within null pointer of type 'struct ggml_compute_state'
pc_0x508ca7###func_ggml_graph_compute###file_/p/i/llama.cpp/ggml.c###line_12128###obj_(main+0x508ca7)
pc_0x56e9c8###func_llama_eval_internal###file_/p/i/llama.cpp/llama.cpp###line_1272###obj_(main+0x56e9c8)
pc_0x571ff0###func_llama_eval###file_/p/i/llama.cpp/llama.cpp###line_2726###obj_(main+0x571ff0)
pc_0x41a78a###func_main###file_/p/i/llama.cpp/examples/main/main.cpp###line_360###obj_(main+0x41a78a)
pc_0x7ff119a4a50f###func___libc_start_call_main###file_<null>###line_0###obj_(libc.so.6+0x2750f)
pc_0x7ff119a4a5c8###func___libc_start_main@GLIBC_2.2.5###file_<null>###line_0###obj_(libc.so.6+0x275c8)
pc_0x423624###func__start###file_<null>###line_0###obj_(main+0x423624)

AddressSanitizer:DEADLYSIGNAL

Process finished with exit code 1

ivanstepanovftw avatar May 05 '23 18:05 ivanstepanovftw

To support this properly would require deeper changes, at least:

  • Ensuring that the dependencies of each operation are respected, so that no operations are run before their dependencies
  • Ensuring that enough work buffer memory is allocated to run multiple operations concurrently
  • Ensuring that each operation has a different, non-overlapping work buffer

slaren avatar May 05 '23 19:05 slaren