llama.cpp
llama.cpp copied to clipboard
Run several single thread operators parellel
In 18 threads testing, it shows about 5% performance gain.
In my testing, this will give noticable difference when running in high # of threads:
Before
Running with 18 threads...
18 threads | run 1/4 | current token time 237.93 ms - eval time 32116.6 ms - prompt eval time 1903.46 ms
18 threads | run 2/4 | current token time 223.18 ms - eval time 33081.06 ms - prompt eval time 1785.41 ms
18 threads | run 3/4 | current token time 560.18 ms - eval time 42127.41 ms - prompt eval time 4481.46 ms
18 threads | run 4/4 | current token time 226.64 ms - eval time 32145.81 ms - prompt eval time 1813.12 ms
After
Running with 18 threads...
18 threads | run 1/4 | current token time 221.68 ms - eval time 30907.99 ms - prompt eval time 1773.44 ms
18 threads | run 2/4 | current token time 225.93 ms - eval time 31100.67 ms - prompt eval time 1807.41 ms
18 threads | run 3/4 | current token time 222.29 ms - eval time 30184.69 ms - prompt eval time 1778.33 ms
18 threads | run 4/4 | current token time 233.21 ms - eval time 31018.9 ms - prompt eval time 1865.65 ms
Eval time:
+--------------------------------------------------------------------------+
|+ + + + x x|
| |__________A____M____| |_____M_______A_____________| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 3 32116.6 33081.06 32145.81 32447.823 548.59349
+ 4 30184.69 31100.67 31018.9 30803.062 419.74213
Difference at 95.0% confidence
-1644.76 +/- 933.691
-5.06894% +/- 2.87751%
(Student's t, pooled s = 475.491)
In 18 threads testing, it shows about 5% performance gain.
18 threads on a how many C/T machine?
In 18 threads testing, it shows about 5% performance gain.
18 threads on a how many C/T machine?
10cores 20threads win10 box
Getting segfault
/p/i/llama.cpp/cmake-build-relwithdebinfo/bin/main -m models/LLaMA/7B/ggml-model-q4_0.bin --prompt "William Safire will walk us through the nuances of bad" --threads 1 --seed 1 --n_predict 16 --tfs 0.97 --mirostat 2 --mirostat_ent 5
main: build = 513 (11702ed)
main: seed = 1
llama.cpp: loading model from models/LLaMA/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 1 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 0, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 0, tfs_z = 0.970000, top_p = 1.000000, typical_p = 1.000000, temp = 1.000000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 16, n_keep = 0
William Safire will walk us through/p/i/llama.cpp/ggml.c:12128:49: runtime error: member access within null pointer of type 'struct ggml_compute_state'
pc_0x508ca7###func_ggml_graph_compute###file_/p/i/llama.cpp/ggml.c###line_12128###obj_(main+0x508ca7)
pc_0x56e9c8###func_llama_eval_internal###file_/p/i/llama.cpp/llama.cpp###line_1272###obj_(main+0x56e9c8)
pc_0x571ff0###func_llama_eval###file_/p/i/llama.cpp/llama.cpp###line_2726###obj_(main+0x571ff0)
pc_0x41a78a###func_main###file_/p/i/llama.cpp/examples/main/main.cpp###line_360###obj_(main+0x41a78a)
pc_0x7ff119a4a50f###func___libc_start_call_main###file_<null>###line_0###obj_(libc.so.6+0x2750f)
pc_0x7ff119a4a5c8###func___libc_start_main@GLIBC_2.2.5###file_<null>###line_0###obj_(libc.so.6+0x275c8)
pc_0x423624###func__start###file_<null>###line_0###obj_(main+0x423624)
AddressSanitizer:DEADLYSIGNAL
Process finished with exit code 1
To support this properly would require deeper changes, at least:
- Ensuring that the dependencies of each operation are respected, so that no operations are run before their dependencies
- Ensuring that enough work buffer memory is allocated to run multiple operations concurrently
- Ensuring that each operation has a different, non-overlapping work buffer