llama.cpp Numa

2S Xeon E5-2690v1

Before: ./main -n 1024 -m models/7B/ggml-model-q4_0.bin --ignore-eos -p "Building a website can be done in 10 simple steps:" -t 32 ... llama_print_timings: sample time = 796.65 ms / 1024 runs ( 0.78 ms per token) llama_print_timings: prompt eval time = 44892.59 ms / 785 tokens ( 57.19 ms per token) llama_print_timings: eval time = 178337.97 ms / 1020 runs ( 174.84 ms per token)

# echo 3 > /proc/sys/vm/drop_caches

After: ./main -n 1024 -m models/7B/ggml-model-q4_0.bin --ignore-eos -p "Building a website can be done in 10 simple steps:" -t 32 ... llama_print_timings: sample time = 795.63 ms / 1024 runs ( 0.78 ms per token) llama_print_timings: prompt eval time = 46996.62 ms / 785 tokens ( 59.87 ms per token) llama_print_timings: eval time = 101232.21 ms / 1020 runs ( 99.25 ms per token)

May 21 '23 21:05 zrm

Not sure if this one is alright (9d058c2096b9f1f300e1ee16f5740a6a0a342917). On machines with high thread synchronization overhead, distributing GGML_TASK_FINALIZE to the thread pool adds about 10% to eval time, so only do it for ops where finalize actually does something. But am I correct that this is currently none of them?

May 21 '23 22:05 zrm

The method you just pull should be the right solution. I am excited about your improvement. I'm assembling a numa computer with 2 CPU and 8 channels of memory (E5-2698Bv3 x2 / DDR3 1866 32G x8), and I'll test it later.

May 22 '23 01:05 rankaiyx

Here are some results on a dual socket Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz. Each one has 18 cores

master

# 8 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 8 -s 6

llama_print_timings:        load time =  2649.90 ms
llama_print_timings:      sample time =    14.44 ms /    26 runs   (    0.56 ms per token)
llama_print_timings: prompt eval time =   488.38 ms /     8 tokens (   61.05 ms per token)
llama_print_timings:        eval time =  3450.39 ms /    25 runs   (  138.02 ms per token)
llama_print_timings:       total time =  6120.78 ms

# 12 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 12 -s 6

llama_print_timings:        load time =  2558.62 ms
llama_print_timings:      sample time =    13.90 ms /    26 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   335.17 ms /     8 tokens (   41.90 ms per token)
llama_print_timings:        eval time =  2190.39 ms /    25 runs   (   87.62 ms per token)
llama_print_timings:       total time =  4768.89 ms

# 16 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 16 -s 6

llama_print_timings:        load time =  2422.32 ms
llama_print_timings:      sample time =    13.93 ms /    26 runs   (    0.54 ms per token)
llama_print_timings: prompt eval time =   264.24 ms /     8 tokens (   33.03 ms per token)
llama_print_timings:        eval time =  2008.39 ms /    25 runs   (   80.34 ms per token)
llama_print_timings:       total time =  4450.76 ms

# 18 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 18 -s 6

llama_print_timings:        load time =  2467.76 ms
llama_print_timings:      sample time =    14.18 ms /    26 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   306.80 ms /     8 tokens (   38.35 ms per token)
llama_print_timings:        eval time =  1668.66 ms /    25 runs   (   66.75 ms per token)
llama_print_timings:       total time =  4156.82 ms

# 36 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 36 -s 6


llama_print_timings:        load time =  2440.28 ms
llama_print_timings:      sample time =    13.64 ms /    26 runs   (    0.52 ms per token)
llama_print_timings: prompt eval time =   283.70 ms /     8 tokens (   35.46 ms per token)
llama_print_timings:        eval time =  2977.92 ms /    25 runs   (  119.12 ms per token)
llama_print_timings:       total time =  5438.13 ms

zrm:numa

# 8 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 8 -s 6

llama_print_timings:        load time =  2784.40 ms
llama_print_timings:      sample time =    12.51 ms /    26 runs   (    0.48 ms per token)
llama_print_timings: prompt eval time =   626.98 ms /     8 tokens (   78.37 ms per token)
llama_print_timings:        eval time =  3395.67 ms /    25 runs   (  135.83 ms per token)
llama_print_timings:       total time =  6198.56 ms

# 12 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 12 -s 6

llama_print_timings:        load time =  2611.76 ms
llama_print_timings:      sample time =    12.17 ms /    26 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   407.14 ms /     8 tokens (   50.89 ms per token)
llama_print_timings:        eval time =  2819.72 ms /    25 runs   (  112.79 ms per token)
llama_print_timings:       total time =  5449.82 ms

# 16 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 16 -s 6

llama_print_timings:        load time =  2542.99 ms
llama_print_timings:      sample time =    11.76 ms /    26 runs   (    0.45 ms per token)
llama_print_timings: prompt eval time =   335.91 ms /     8 tokens (   41.99 ms per token)
llama_print_timings:        eval time =  2551.17 ms /    25 runs   (  102.05 ms per token)
llama_print_timings:       total time =  5111.75 ms

# 18 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 18 -s 6

llama_print_timings:        load time =  2500.30 ms
llama_print_timings:      sample time =    11.47 ms /    26 runs   (    0.44 ms per token)
llama_print_timings: prompt eval time =   300.49 ms /     8 tokens (   37.56 ms per token)
llama_print_timings:        eval time =  2538.30 ms /    25 runs   (  101.53 ms per token)
llama_print_timings:       total time =  5055.94 ms

# 36 threads
make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --no-mmap -n 32 -t 36 -s 6

llama_print_timings:        load time =  2448.93 ms
llama_print_timings:      sample time =    11.44 ms /    26 runs   (    0.44 ms per token)
llama_print_timings: prompt eval time =   253.50 ms /     8 tokens (   31.69 ms per token)
llama_print_timings:        eval time =  2497.20 ms /    25 runs   (   99.89 ms per token)
llama_print_timings:       total time =  4963.27 ms

I don't observe the speed-up from the zrm:numa branch. I suspect that this has something to do with mmap.

@zrm has done their tests without the --no-mmap option, so this means by default mmap is used. On my system, if I use mmap, there is a significant slowdown - here is master using 18 threads:

make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 6

llama_print_timings:        load time =   434.70 ms
llama_print_timings:      sample time =    14.18 ms /    26 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   259.91 ms /     8 tokens (   32.49 ms per token)
llama_print_timings:        eval time =  3285.42 ms /    25 runs   (  131.42 ms per token)
llama_print_timings:       total time =  3740.55 ms

I recommend to re-do the analysis with always using --no-mmap and see if there is benefit from NUMA in that case

May 24 '23 06:05 ggerganov

I recommend to re-do the analysis with always using --no-mmap and see if there is benefit from NUMA in that case

master:

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 --no-mmap ... llama_print_timings: sample time = 399.22 ms / 512 runs ( 0.78 ms per token) llama_print_timings: prompt eval time = 15385.30 ms / 265 tokens ( 58.06 ms per token) llama_print_timings: eval time = 86911.83 ms / 510 runs ( 170.42 ms per token)

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 ... llama_print_timings: sample time = 397.81 ms / 512 runs ( 0.78 ms per token) llama_print_timings: prompt eval time = 15456.71 ms / 265 tokens ( 58.33 ms per token) llama_print_timings: eval time = 97122.96 ms / 510 runs ( 190.44 ms per token)

The reason for this is that when master does mmap it uses MAP_POPULATE to prefault the pages and the kernel will then load the model into memory on an arbitrary NUMA node, commonly putting the whole thing on one node. This is basically the worst case scenario on NUMA so any change makes it better, including --no-mmap.

The numa branch disables MAP_POPULATE and MADV_WILLNEED on NUMA systems in favor of MADV_RANDOM, allowing the pages to be faulted in based on usage pattern, and pins threads to nodes so the same thread number is always running on the same node. That puts pages on the node that actually access them. But if you run it after you've run master the pages are already in the page cache and it barely does anything:

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 --no-mmap ... llama_print_timings: sample time = 396.27 ms / 512 runs ( 0.77 ms per token) llama_print_timings: prompt eval time = 16307.36 ms / 265 tokens ( 61.54 ms per token) llama_print_timings: eval time = 84633.24 ms / 510 runs ( 165.95 ms per token)

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 ... llama_print_timings: sample time = 396.57 ms / 512 runs ( 0.77 ms per token) llama_print_timings: prompt eval time = 18226.49 ms / 265 tokens ( 68.78 ms per token) llama_print_timings: eval time = 84807.91 ms / 510 runs ( 166.29 ms per token)

So you have to drop the page cache once first:

# echo 3 > /proc/sys/vm/drop_caches

There is also a kernel parameter called numa_balancing which tries to migrate pages to the node it thinks will use them, but for pages that are already on the appropriate node it only makes it worse and should be disabled:

# echo 0 > /proc/sys/kernel/numa_balancing

The numa branch warns if this is enabled (which it is by default on many distributions).

Then you get this:

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 --no-mmap ... llama_print_timings: sample time = 396.31 ms / 512 runs ( 0.77 ms per token) llama_print_timings: prompt eval time = 15202.07 ms / 265 tokens ( 57.37 ms per token) llama_print_timings: eval time = 81976.09 ms / 510 runs ( 160.74 ms per token)

./main -n 512 -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -t 32 ... llama_print_timings: sample time = 397.33 ms / 512 runs ( 0.78 ms per token) llama_print_timings: prompt eval time = 15527.65 ms / 265 tokens ( 58.59 ms per token) llama_print_timings: eval time = 43868.49 ms / 510 runs ( 86.02 ms per token)

(note: don't actually run --no-mmap first, that loads the model into the page cache again)

May 24 '23 20:05 zrm

Let's see if I can better explain how this is working. ggml_compute_forward_mul_mat_f16_f32():

https://github.com/ggerganov/llama.cpp/blob/ac7876ac20124a15a44fd6317721ff1aa2538806/ggml.c#L9802-L9807

Thread ith always gets the same rows of the model, and they're several memory pages in size. If we use mmap() but don't preload, on the first eval each thread goes to read from that memory and hits page faults because the pages haven't been loaded off the filesystem yet. The kernel then loads the page into the memory of the node of the thread that hit the page fault. Since we pinned the threads to nodes, thread ith always runs on that node and when it goes to access the same page of the model for the next token, that page is in the memory of the local node.

This is much faster than all the pages being in the memory of one node and it's even faster than node interleave.

If the model was already in the system page cache when the program starts, there are no page faults. The kernel already has the model cached in RAM. If it was loaded by a previous run that accessed the pages like this, that's great, because we don't have to load it from the filesystem again and each thread is still going to access the same part of the model. But if the model was initially read into the page cache in some other way, e.g. using --no-mmap or mmap() with MAP_POPULATE, then the pages aren't where we want them and to get them there we have to drop the page cache (or reboot the machine, or call madvise() with MADV_DONTNEED) so the next run will reload the model from the filesystem.

May 24 '23 23:05 zrm

@zrm

Thanks for the detailed explanation. It makes a lot of sense and it is clear now.

The strange thing is that I cannot make my system go faster with 36 threads. I tried following your instructions, but the speed is still worse compared to master with 18 threads:

Here is a full log of a few runs on zrm/numa and then back on master:

gg:/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg:/llama.cpp$ sudo sh -c "echo 0 > /proc/sys/kernel/numa_balancing"
gg:/llama.cpp$ git checkout zrm/numa 
Note: switching to 'zrm/numa'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 8502d51 fix args
gg:/llama.cpp$ make clean && make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'build-info.h'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o -o vdot 

====  Run ./main -h for help.  ====

main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 51042.35 ms
llama_print_timings:      sample time =    15.32 ms /    32 runs   (    0.48 ms per token)
llama_print_timings: prompt eval time = 50070.67 ms /     8 tokens ( 6258.83 ms per token)
llama_print_timings:        eval time =  3589.51 ms /    31 runs   (  115.79 ms per token)
llama_print_timings:       total time = 54655.16 ms
gg:/llama.cpp$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   366.90 ms
llama_print_timings:      sample time =    15.52 ms /    32 runs   (    0.49 ms per token)
llama_print_timings: prompt eval time =   352.65 ms /     8 tokens (   44.08 ms per token)
llama_print_timings:        eval time =  4386.11 ms /    31 runs   (  141.49 ms per token)
llama_print_timings:       total time =  4776.22 ms
gg:/llama.cpp$ sudo sh -c "echo 0 > /proc/sys/kernel/numa_balancing"
gg:/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg:/llama.cpp$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 51688.09 ms
llama_print_timings:      sample time =    15.15 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time = 50843.78 ms /     8 tokens ( 6355.47 ms per token)
llama_print_timings:        eval time =  2967.22 ms /    31 runs   (   95.72 ms per token)
llama_print_timings:       total time = 54678.23 ms
gg:/llama.cpp$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   371.08 ms
llama_print_timings:      sample time =    14.89 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   357.25 ms /     8 tokens (   44.66 ms per token)
llama_print_timings:        eval time =  4162.02 ms /    31 runs   (  134.26 ms per token)
llama_print_timings:       total time =  4555.69 ms
gg:/llama.cpp$ git checkout master
Previous HEAD position was 8502d51 fix args
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
gg:/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o -o vdot 

====  Run ./main -h for help.  ====

main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   770.77 ms
llama_print_timings:      sample time =    16.58 ms /    32 runs   (    0.52 ms per token)
llama_print_timings: prompt eval time =   335.86 ms /     8 tokens (   41.98 ms per token)
llama_print_timings:        eval time =  2636.63 ms /    31 runs   (   85.05 ms per token)
llama_print_timings:       total time =  3431.53 ms
gg:/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg:/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 10943.07 ms
llama_print_timings:      sample time =    17.51 ms /    32 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   250.01 ms /     8 tokens (   31.25 ms per token)
llama_print_timings:        eval time =  2212.98 ms /    31 runs   (   71.39 ms per token)
llama_print_timings:       total time = 13181.17 ms
gg:/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   427.84 ms
llama_print_timings:      sample time =    17.69 ms /    32 runs   (    0.55 ms per token)
llama_print_timings: prompt eval time =   243.10 ms /     8 tokens (   30.39 ms per token)
llama_print_timings:        eval time =  2141.87 ms /    31 runs   (   69.09 ms per token)
llama_print_timings:       total time =  2594.99 ms

Is it possible that I have some BIOS setting wrong?

May 25 '23 18:05 ggerganov

@ggerganov Have you tried "numactl --interleave=all"? will it improve?

May 26 '23 01:05 rankaiyx

E5-2698Bv3 x 2 Each one has 16 cores 32 threads DDR3 1866 32G x 8 100GB/s

master:

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"

$numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 0 size: 128792 MB node 0 free: 126491 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 129010 MB node 1 free: 126895 MB node distances: node 0 1 0: 10 21 1: 21 10

$./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 16 -p "I believe the meaning of life is" llama_print_timings: sample time = 17.66 ms / 32 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 387.29 ms / 8 tokens ( 48.41 ms per token) llama_print_timings: eval time = 3655.15 ms / 31 runs ( 117.91 ms per token) $numactl -H node 0 size: 128792 MB node 0 free: 122891 MB node 1 size: 129010 MB node 1 free: 126834 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 14.38 ms / 26 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 287.04 ms / 8 tokens ( 35.88 ms per token) llama_print_timings: eval time = 3420.18 ms / 25 runs ( 136.81 ms per token) $numactl -H node 0 size: 128792 MB node 0 free: 122893 MB node 1 size: 129010 MB node 1 free: 126815 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 32 -p "I believe the meaning of life is" --no-mmap llama_print_timings: sample time = 9.78 ms / 17 runs ( 0.58 ms per token) llama_print_timings: prompt eval time = 351.71 ms / 8 tokens ( 43.96 ms per token) llama_print_timings: eval time = 2204.62 ms / 16 runs ( 137.79 ms per token) $numactl -H node 0 size: 128792 MB node 0 free: 126579 MB node 1 size: 129010 MB node 1 free: 123042 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 60 -p "I believe the meaning of life is" --no-mmap llama_print_timings: sample time = 17.75 ms / 32 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 383.18 ms / 8 tokens ( 47.90 ms per token) llama_print_timings: eval time = 4577.02 ms / 31 runs ( 147.65 ms per token) $numactl -H node 0 size: 128792 MB node 0 free: 126554 MB node 1 size: 129010 MB node 1 free: 123107 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 16 -p "I believe the meaning of life is" --no-mmap llama_print_timings: sample time = 17.82 ms / 32 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 434.39 ms / 8 tokens ( 54.30 ms per token) llama_print_timings: eval time = 3614.41 ms / 31 runs ( 116.59 ms per token) $numactl -H node 0 size: 128792 MB node 0 free: 122971 MB node 1 size: 129010 MB node 1 free: 126673 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $numactl --interleave=0-1 ./main -m ../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 32 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 18.05 ms / 32 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 279.41 ms / 8 tokens ( 34.93 ms per token) llama_print_timings: eval time = 2837.69 ms / 31 runs ( 91.54 ms per token) again: llama_print_timings: sample time = 18.02 ms / 32 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 303.77 ms / 8 tokens ( 37.97 ms per token) llama_print_timings: eval time = 3170.54 ms / 31 runs ( 102.28 ms per token) again: llama_print_timings: sample time = 18.32 ms / 32 runs ( 0.57 ms per token) llama_print_timings: prompt eval time = 311.67 ms / 8 tokens ( 38.96 ms per token) llama_print_timings: eval time = 2847.70 ms / 31 runs ( 91.86 ms per token) $numactl -H node 0 size: 128792 MB node 0 free: 124720 MB node 1 size: 129010 MB node 1 free: 124749 MB

zrm:

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 16 -p "I believe the meaning of life is" llama_print_timings: sample time = 71.39 ms / 128 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 363.49 ms / 8 tokens ( 45.44 ms per token) llama_print_timings: eval time = 12101.28 ms / 127 runs ( 95.29 ms per token) numactl -H node 0 size: 128792 MB node 0 free: 125005 MB node 1 size: 129010 MB node 1 free: 124674 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 53.07 ms / 94 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 353.08 ms / 8 tokens ( 44.13 ms per token) llama_print_timings: eval time = 7880.36 ms / 93 runs ( 84.74 ms per token) numactl -H node 0 size: 128792 MB node 0 free: 124720 MB node 1 size: 129010 MB node 1 free: 124951 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 60 -p "I believe the meaning of life is" llama_print_timings: sample time = 49.05 ms / 88 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 368.89 ms / 8 tokens ( 46.11 ms per token) llama_print_timings: eval time = 8753.94 ms / 87 runs ( 100.62 ms per token) numactl -H node 0 size: 128792 MB node 0 free: 124774 MB node 1 size: 129010 MB node 1 free: 124942 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-7B.ggmlv3.q4_0.bin -n 128 -t 32 -p "I believe the meaning of life is" --no-mmap llama_print_timings: sample time = 44.54 ms / 79 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 394.15 ms / 8 tokens ( 49.27 ms per token) llama_print_timings: eval time = 9881.16 ms / 78 runs ( 126.68 ms per token) numactl -H node 0 size: 128792 MB node 0 free: 122947 MB node 1 size: 129010 MB node 1 free: 126708 MB

May 26 '23 20:05 rankaiyx

65b

master:

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 16 -p "I believe the meaning of life is" llama_print_timings: sample time = 8.79 ms / 16 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 3194.02 ms / 8 tokens ( 399.25 ms per token) llama_print_timings: eval time = 16712.37 ms / 15 runs ( 1114.16 ms per token) node 0 size: 128792 MB node 0 free: 126487 MB node 1 size: 129010 MB node 1 free: 79816 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 9.00 ms / 16 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 2286.92 ms / 8 tokens ( 285.86 ms per token) llama_print_timings: eval time = 19443.28 ms / 15 runs ( 1296.22 ms per token) node 0 size: 128792 MB node 0 free: 79686 MB node 1 size: 129010 MB node 1 free: 126610 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $numactl --interleave=0-1 ./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 18.26 ms / 32 runs ( 0.57 ms per token) llama_print_timings: prompt eval time = 2232.64 ms / 8 tokens ( 279.08 ms per token) llama_print_timings: eval time = 23214.56 ms / 31 runs ( 748.86 ms per token) node 0 size: 128792 MB node 0 free: 102975 MB node 1 size: 129010 MB node 1 free: 103229 MB

zrm: $sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 16 -p "I believe the meaning of life is" llama_print_timings: sample time = 8.85 ms / 16 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 3446.24 ms / 8 tokens ( 430.78 ms per token) llama_print_timings: eval time = 10474.60 ms / 15 runs ( 698.31 ms per token) node 0 size: 128792 MB node 0 free: 103510 MB node 1 size: 129010 MB node 1 free: 103068 MB

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 8.87 ms / 16 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 2340.34 ms / 8 tokens ( 292.54 ms per token) llama_print_timings: eval time = 8888.35 ms / 15 runs ( 592.56 ms per token)

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" $./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 60 -p "I believe the meaning of life is" llama_print_timings: sample time = 9.02 ms / 16 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 2353.47 ms / 8 tokens ( 294.18 ms per token) llama_print_timings: eval time = 8453.18 ms / 15 runs ( 563.55 ms per token)

May 26 '23 21:05 rankaiyx

--no-mmap

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" ./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is" --no-mmap llama_print_timings: sample time = 8.85 ms / 16 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 3461.39 ms / 8 tokens ( 432.67 ms per token) llama_print_timings: eval time = 11593.47 ms / 15 runs ( 772.90 ms per token)

again: llama_print_timings: sample time = 8.93 ms / 16 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 3142.02 ms / 8 tokens ( 392.75 ms per token) llama_print_timings: eval time = 20544.88 ms / 15 runs ( 1369.66 ms per token)

again: llama_print_timings: sample time = 8.96 ms / 16 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 6109.34 ms / 8 tokens ( 763.67 ms per token) llama_print_timings: eval time = 37335.85 ms / 15 runs ( 2489.06 ms per token)

again: llama_print_timings: sample time = 8.83 ms / 16 runs ( 0.55 ms per token) llama_print_timings: prompt eval time = 7148.55 ms / 8 tokens ( 893.57 ms per token) llama_print_timings: eval time = 49204.25 ms / 15 runs ( 3280.28 ms per token)

numactl -H node 0 size: 128792 MB node 0 free: 126433 MB node 1 size: 129010 MB node 1 free: 79861 MB

May 26 '23 21:05 rankaiyx

~~If one cpu core is occupied by another program, it will have a great impact on speed.~~ The reason is that my CPU frequency has been reduced after a period of full load work.

I'm downloading guanaco-65b_Q4_0. I'll test it later. Now I go bed.

May 26 '23 21:05 rankaiyx

zrm 65b q4_0

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" ./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 9.01 ms / 16 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 2885.66 ms / 8 tokens ( 360.71 ms per token) llama_print_timings: eval time = 7546.16 ms / 15 runs ( 503.08 ms per token)

$sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" ./main -m ../../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 60 -p "I believe the meaning of life is" llama_print_timings: sample time = 9.07 ms / 16 runs ( 0.57 ms per token) llama_print_timings: prompt eval time = 1964.91 ms / 8 tokens ( 245.61 ms per token) llama_print_timings: eval time = 6788.23 ms / 15 runs ( 452.55 ms per token)

disable numa in BIOS master q5 ./main -m ../ggmlv3/guanaco-65B.ggmlv3.q5_1.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 8.92 ms / 16 runs ( 0.56 ms per token) llama_print_timings: prompt eval time = 2177.03 ms / 8 tokens ( 272.13 ms per token) llama_print_timings: eval time = 9704.86 ms / 15 runs ( 646.99 ms per token)

q4 ./main -m ../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 9.20 ms / 16 runs ( 0.57 ms per token) llama_print_timings: prompt eval time = 1860.88 ms / 8 tokens ( 232.61 ms per token) llama_print_timings: eval time = 7922.27 ms / 15 runs ( 528.15 ms per token)

./main -m ../ggmlv3/guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 62 -p "I believe the meaning of life is" llama_print_timings: sample time = 9.06 ms / 16 runs ( 0.57 ms per token) llama_print_timings: prompt eval time = 1863.77 ms / 8 tokens ( 232.97 ms per token) llama_print_timings: eval time = 8602.44 ms / 15 runs ( 573.50 ms per token)

May 27 '23 11:05 rankaiyx

It doesn't seem to work on the windows platform. Win10 2022H2

zrm: D:\zrm\llama.cpp-win> .\main.exe -m ....\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 16 -p "I believe the meaning of life is" llama_print_timings: sample time = 10.23 ms / 16 runs ( 0.64 ms per token) llama_print_timings: prompt eval time = 6351.19 ms / 8 tokens ( 793.90 ms per token) llama_print_timings: eval time = 55692.97 ms / 15 runs ( 3712.86 ms per token)

reboot D:\zrm\llama.cpp-win> .\main.exe -m ....\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 17.52 ms / 16 runs ( 1.09 ms per token) llama_print_timings: prompt eval time = 7219.65 ms / 8 tokens ( 902.46 ms per token) llama_print_timings: eval time = 67883.02 ms / 15 runs ( 4525.53 ms per token)

master: D:\llama-master-66874d4-bin-win-avx2-x64> .\main.exe -m ..\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 16 -p "I believe the meaning of life is" llama_print_timings: sample time = 4.62 ms / 16 runs ( 0.29 ms per token) llama_print_timings: prompt eval time = 4916.49 ms / 8 tokens ( 614.56 ms per token) llama_print_timings: eval time = 33004.08 ms / 15 runs ( 2200.27 ms per token)

D:\llama-master-66874d4-bin-win-avx2-x64> .\main.exe -m ..\ggmlv3\guanaco-65B.ggmlv3.q4_0.bin -n 16 -t 32 -p "I believe the meaning of life is" llama_print_timings: sample time = 4.24 ms / 16 runs ( 0.27 ms per token) llama_print_timings: prompt eval time = 7452.86 ms / 8 tokens ( 931.61 ms per token) llama_print_timings: eval time = 23078.24 ms / 15 runs ( 1538.55 ms per token)

May 28 '23 13:05 rankaiyx

Windows does not seem to have a tool like numactl, and the numa disable setting in bios has no effect on Windows.

~Under linux, my CPU will reduce the frequency after working at full load for dozens of seconds. I have changed a lot of bios settings, changed multiple distributions, and still can't solve it in a variety of ways.~

~I don't seem to have a good idea now.~ The performance degradation is due to the memory voltage, which is solved by setting it to 1.5V.

May 28 '23 14:05 rankaiyx

It doesn't seem to work on the windows platform.

I didn't write a Windows implementation because the APIs are different, I'm not especially familiar with them and I haven't got a Windows NUMA system to test with anyway. Someone so inclined could plausibly get it to work there assuming the Windows APIs to do it exist.

May 29 '23 17:05 zrm

Someone so inclined could plausibly get it to work there assuming the Windows APIs to do it exist.

There is a support article about these APIs: https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support. I am in the same situation with no access to a Windows NUMA machine.

May 29 '23 17:05 Piezoid

The strange thing is that I cannot make my system go faster with 36 threads.

I'm trying to guess what causes this, but it's hard without access to the same type of system to test.

You have more cores and the frequent thread synchronization is significantly more expensive across sockets. That could be contributing but I doubt it's the main problem. What happens on this system when the numa branch is run with 18 threads without --no-mmap?

What happens on this system if you drop the page cache and then run master with 36 threads under numactl --interleave=all? In this case the performance shouldn't be different with and without --no-mmap (both the mmap'd file and the copy in memory should be interleaved), so if there was we have something to investigate.

This code is an attempt to get the system to load pages into the right node without explicitly specifying it with libnuma. It's possible that some hardware difference is making that less effective on this system, e.g. your system has AVX-512 which could change the memory access pattern. Calling madvise() with MADV_RANDOM had no major effect on my 2-node system but significantly improved performance on the 4-node system (and the 4N system has AVX2 where the 2N system only has AVX, which may or may not be related). I might try writing the code that actually uses libnuma just to see if it works any better than this.

May 30 '23 02:05 zrm

Since you only implemented the Linux part, why does it degrade performance on Windows? Is it because using w64devkit compilation causes the compiler to think that it is compiling for the Linux platform?

May 30 '23 02:05 rankaiyx

Is it possible that I have some BIOS setting wrong?

If numa is disabled in BIOS, memory interleaving may occur automatically, which can lead to this phenomenon.

May 30 '23 02:05 rankaiyx

Numa is cold knowledge for most people. Can we rely on the assistance of chatgpt to complete the development of the Windows part? I am willing to test for this. I have used the following prompt to try to make chatgpt to complete the code, you can see if it can be used for reference?

I: Here is a C language program, which only implements the linux part, as developers are not familiar with the API of Windows, can you help complete the Windows part? By the way, can you help explain all these codes? ChatGPT: Of course! I'll be happy to help you complete the Windows part of your C language program and explain the code to you. Please provide the code you have so far, and let me know which specific parts you need assistance with. I: [the code] ChatGPT:

#ifdef __linux__
void set_numa_thread_affinity(int thread_n, int n_threads)
{
    if (!ggml_is_numa()) { return; }
    // run thread on node_num thread_n / (threads per node)
    int node_num = thread_n / (n_threads / ggml_numa.n_nodes);
    struct ggml_numa_node *node = &ggml_numa.nodes[node_num];
    size_t setsize = CPU_ALLOC_SIZE(ggml_numa.total_cpus);
    cpu_set_t *cpus = CPU_ALLOC(ggml_numa.total_cpus);
    CPU_ZERO_S(setsize, cpus);
    for (size_t i = 0; i < node->n_cpus; ++i) {
        CPU_SET_S(node->cpus[i], setsize, cpus);
    }
    int rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
    if (rv) {
        fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n", strerror(rv));
    }
    CPU_FREE(cpus);
}

void clear_numa_thread_affinity(void)
{
    if (!ggml_is_numa()) { return; }
    size_t setsize = CPU_ALLOC_SIZE(ggml_numa.total_cpus);
    cpu_set_t *cpus = CPU_ALLOC(ggml_numa.total_cpus);
    CPU_ZERO_S(setsize, cpus);
    for (unsigned i = 0; i < ggml_numa.total_cpus; ++i) {
        CPU_SET_S(i, setsize, cpus);
    }
    int rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
    if (rv) {
        fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n", strerror(rv));
    }
    CPU_FREE(cpus);
}

#elif defined(_WIN32)

#include <windows.h>

void set_numa_thread_affinity(int thread_n, int n_threads)
{
    if (!ggml_is_numa()) { return; }

    int node_num = thread_n / (n_threads / ggml_numa.n_nodes);
    struct ggml_numa_node *node = &ggml_numa.nodes[node_num];

    DWORD_PTR affinityMask = 0;
    for (size_t i = 0; i < node->n_cpus; ++i) {
        affinityMask |= (1ULL << node->cpus[i]);
    }

    DWORD_PTR previousAffinityMask = SetThreadAffinityMask(GetCurrentThread(), affinityMask);
    if (previousAffinityMask == 0) {
        fprintf(stderr, "warning: SetThreadAffinityMask failed: %lu\n", GetLastError());
    }
}

void clear_numa_thread_affinity(void)
{
    if (!ggml_is_numa()) { return; }

    DWORD_PTR processAffinityMask;
    DWORD_PTR systemAffinityMask;
    if (!GetProcessAffinityMask(GetCurrentProcess(), &processAffinityMask, &systemAffinityMask)) {
        fprintf(stderr, "warning: GetProcessAffinityMask failed: %lu\n", GetLastError());
        return;
    }

    DWORD_PTR affinityMask = 0;
    for (unsigned i = 0; i < ggml_numa.total_cpus; ++i) {
        affinityMask |= (1ULL << i);
    }

    DWORD_PTR previousAffinityMask = SetThreadAffinityMask(GetCurrentThread(), affinityMask);
    if (previousAffinityMask == 0) {
        fprintf(stderr, "warning: SetThreadAffinityMask failed: %lu\n", GetLastError());
    }
}

#else
// Other platforms
void set_numa_thread_affinity(int thread_n, int n_threads) {}
void clear_numa_thread_affinity(void) {}
#endif

In the Windows implementation:

set_numa_thread_affinity:

It calculates the node number and retrieves the corresponding NUMA node information as in the Linux implementation. The Windows API function is used to set the thread affinity. It takes the current thread handle (obtained using ) and an affinity mask. The affinity mask is created by bitwise OR'ing the individual processor masks associated with the CPUs of the NUMA node.SetThreadAffinityMaskGetCurrentThread() If fails, an error message is printed.SetThreadAffinityMask

clear_numa_thread_affinity:

It retrieves the process affinity mask and system affinity mask using the Windows API function .GetProcessAffinityMask It creates an affinity mask with all the available CPUs set. It calls with the created affinity mask to clear the thread affinity.SetThreadAffinityMask If fails, an error message is printed.SetThreadAffinityMask

May 30 '23 02:05 rankaiyx

@zrm

What happens on this system if you drop the page cache and then run master with 36 threads under numactl --interleave=all? In this case the performance shouldn't be different with and without --no-mmap (both the mmap'd file and the copy in memory should be interleaved), so if there was we have something to investigate.

Here are 3 consecutive runs without --no-mmap and another 3 with it after dropping caches:

gg$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
gg$ make clean
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'build-info.h'
gg$ make -j
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o -o vdot 

====  Run ./main -h for help.  ====

gg$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 10820.97 ms
llama_print_timings:      sample time =    16.93 ms /    32 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   255.50 ms /     8 tokens (   31.94 ms per token)
llama_print_timings:        eval time =  2840.72 ms /    31 runs   (   91.64 ms per token)
llama_print_timings:       total time = 13686.18 ms
gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   444.10 ms
llama_print_timings:      sample time =    16.95 ms /    32 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   257.23 ms /     8 tokens (   32.15 ms per token)
llama_print_timings:        eval time =  3302.28 ms /    31 runs   (  106.53 ms per token)
llama_print_timings:       total time =  3770.90 ms
gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   458.03 ms
llama_print_timings:      sample time =    16.86 ms /    32 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   271.82 ms /     8 tokens (   33.98 ms per token)
llama_print_timings:        eval time =  3747.79 ms /    31 runs   (  120.90 ms per token)
llama_print_timings:       total time =  4230.27 ms


# drop caches and use --no-mmap


gg$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 11154.26 ms
llama_print_timings:      sample time =    16.83 ms /    32 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   251.21 ms /     8 tokens (   31.40 ms per token)
llama_print_timings:        eval time =  3816.48 ms /    31 runs   (  123.11 ms per token)
llama_print_timings:       total time = 14995.15 ms
gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =  2298.71 ms
llama_print_timings:      sample time =    16.84 ms /    32 runs   (    0.53 ms per token)
llama_print_timings: prompt eval time =   270.88 ms /     8 tokens (   33.86 ms per token)
llama_print_timings:        eval time =  3681.84 ms /    31 runs   (  118.77 ms per token)
llama_print_timings:       total time =  6005.08 ms
gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap
main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =  2186.45 ms
llama_print_timings:      sample time =    16.75 ms /    32 runs   (    0.52 ms per token)
llama_print_timings: prompt eval time =   279.02 ms /     8 tokens (   34.88 ms per token)
llama_print_timings:        eval time =  3977.38 ms /    31 runs   (  128.30 ms per token)
llama_print_timings:       total time =  6188.21 ms
gg$

It's weird that during the first 3 runs, the time keeps increasing after each one. I think someone else also reported similar results earlier in the thread

May 30 '23 07:05 ggerganov

This patch is really useful, especially when using models greater than 13B. Look forward to its early merger!

Jun 01 '23 01:06 rankaiyx

The reason is that I am still not 100% convinced that this improves performance always, so it has to be optional.

I'm not sure this is always optimal either, I just need to find some time to look into it further.

Since you only implemented the Linux part, why does it degrade performance on Windows?

The only thing this should be doing on Windows is not sending the finalize ops to the thread pool when they don't do anything (because that isn't platform-specific), which should be a modest improvement. Is it actually worse than master or is it just not any better?

Can we rely on the assistance of chatgpt to complete the development of the Windows part?

The code looks plausible but only implements setting thread affinity. It would also need an implementation of ggml_numa_init(), and then something like "if(ggml_is_numa()) { prefetch = false; }" at the top of the Windows implementation of llama_mmap.

But you're the one with the code and the Windows machine. If you can get it to generate those, try it and see if it works.

Here are 3 consecutive runs without --no-mmap and another 3 with it after dropping caches:

Am I correct that this was without interleave?

Expected:

gg$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" gg$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 ... gg$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" gg$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap

Observed:

gg$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 ... gg$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches" gg$ ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap

It might also be interesting to see the output of numactl -H after each run.

Jun 11 '23 07:06 zrm

I am also in favor of adding a special enable option to experimental features that have not been fully verified, which is disabled by default. This allows the project to quickly merge new features, and then change the option to be enabled by default when it is perfected later. Look forward to the early merger!

Jun 12 '23 02:06 rankaiyx

Here is updated run using numactl --interleave=all. The last run is back on master using 18 threads:

github/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
github/llama.cpp$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 52565.75 ms
llama_print_timings:      sample time =    14.88 ms /    32 runs   (    0.46 ms per token)
llama_print_timings: prompt eval time = 51737.01 ms /     8 tokens ( 6467.13 ms per token)
llama_print_timings:        eval time =  3846.54 ms /    31 runs   (  124.08 ms per token)
llama_print_timings:       total time = 56435.07 ms
github/llama.cpp$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 95348 MB
node 0 free: 87720 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 96757 MB
node 1 free: 89888 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
github/llama.cpp$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   280.25 ms
llama_print_timings:      sample time =    14.31 ms /    32 runs   (    0.45 ms per token)
llama_print_timings: prompt eval time =   265.86 ms /     8 tokens (   33.23 ms per token)
llama_print_timings:        eval time =  2649.89 ms /    31 runs   (   85.48 ms per token)
llama_print_timings:       total time =  2952.27 ms
github/llama.cpp$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 95348 MB
node 0 free: 87710 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 96757 MB
node 1 free: 89884 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
github/llama.cpp$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =   269.95 ms
llama_print_timings:      sample time =    15.05 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   257.25 ms /     8 tokens (   32.16 ms per token)
llama_print_timings:        eval time =  2869.96 ms /    31 runs   (   92.58 ms per token)
llama_print_timings:       total time =  3162.75 ms
github/llama.cpp$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 95348 MB
node 0 free: 87704 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 96757 MB
node 1 free: 89880 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
github/llama.cpp$ sudo sh -c "/usr/bin/echo 3 > /proc/sys/vm/drop_caches"
github/llama.cpp$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time = 12084.76 ms
llama_print_timings:      sample time =    14.94 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   254.01 ms /     8 tokens (   31.75 ms per token)
llama_print_timings:        eval time =  2618.52 ms /    31 runs   (   84.47 ms per token)
llama_print_timings:       total time = 14726.04 ms
github/llama.cpp$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 95348 MB
node 0 free: 87670 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 96757 MB
node 1 free: 89842 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
github/llama.cpp$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =  2787.36 ms
llama_print_timings:      sample time =    16.30 ms /    32 runs   (    0.51 ms per token)
llama_print_timings: prompt eval time =   261.98 ms /     8 tokens (   32.75 ms per token)
llama_print_timings:        eval time =  3063.27 ms /    31 runs   (   98.82 ms per token)
llama_print_timings:       total time =  5875.18 ms
github/llama.cpp$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70
node 0 size: 95348 MB
node 0 free: 87664 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
node 1 size: 96757 MB
node 1 free: 89838 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
github/llama.cpp$ numactl --interleave=all ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 36 -s 7 --no-mmap
main: build = 587 (8502d51)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 36 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =  2718.69 ms
llama_print_timings:      sample time =    14.96 ms /    32 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =   279.15 ms /     8 tokens (   34.89 ms per token)
llama_print_timings:        eval time =  2658.41 ms /    31 runs   (   85.76 ms per token)
llama_print_timings:       total time =  5399.77 ms
github/llama.cpp$ git checkout master
mPrevious HEAD position was 8502d51 fix args
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
github/llama.cpp$ make clean
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'build-info.h'
github/llama.cpp$ make -j && ./main -m models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -n 32 -t 18 -s 7 --no-mmap
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o -o vdot 

====  Run ./main -h for help.  ====

main: build = 587 (c31bbe9)
main: seed  = 7
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 18 / 72 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 I believe the meaning of life is to live out your purpose.
I believe in treating others as you would like to be treated. If you are not treated with respect and kindness,
llama_print_timings:        load time =  2311.91 ms
llama_print_timings:      sample time =    17.31 ms /    32 runs   (    0.54 ms per token)
llama_print_timings: prompt eval time =   242.81 ms /     8 tokens (   30.35 ms per token)
llama_print_timings:        eval time =  2171.10 ms /    31 runs   (   70.04 ms per token)
llama_print_timings:       total time =  4507.93 ms
github/llama.cpp$

It seems that --interleave=all with 36 threads does help, but it still does not become better compared to 18 threads

Jun 12 '23 17:06 ggerganov

If the number of threads is odd, the result exceeds the maximum number of nodes in numa.

One solution is: int node_num = thread_n / ((float)n_threads / ggml_numa.n_nodes);

A better solution (which can improve performance by 20%) is: int node_num = thread_n % ggml_numa.n_nodes;

Only 13B and 65B were tested. 13B performance improvement of 20% (from 122ms to 105ms). The performance of 65B has not improved (from 439ms to 440ms).

Jun 17 '23 14:06 rankaiyx

It seems that --interleave=all with 36 threads does help, but it still does not become better compared to 18 threads

This is starting to look like it is the thread synchronization overhead across multiple sockets.

That system has many fast cores with AVX-512, so it finishes the actual work much faster but might not be that much faster at synchronization, so the overhead is proportionally higher and exceeds the benefit of having more cores and memory channels. I'll try posting something later which reduces the synchronization overhead some more.

(I suspect the reason the numa branch is slower than master with 18 threads on this system is that it will still schedule half of the threads on each node and incur the cross-socket synchronization overhead.)

If the number of threads is odd, the result exceeds the maximum number of nodes in numa.

I fixed that already but haven't done the commit yet. But n_threads not a multiple of n_nodes is going to degrade performance anyway, so don't do that.

A better solution (which can improve performance by 20%) is: int node_num = thread_n % ggml_numa.n_nodes;

That would cause every thread to run on a different node than the ones adjacent to it instead of only the ones at a node boundary. On my system that results in a significant slowdown because it causes more pages to be accessed by threads on multiple nodes. What makes you think it improves performance?

Jun 17 '23 18:06 zrm

A better solution (which can improve performance by 20%) is: int node_num = thread_n % ggml_numa.n_nodes;

That would cause every thread to run on a different node than the ones adjacent to it instead of only the ones at a node boundary. On my system that results in a significant slowdown because it causes more pages to be accessed by threads on multiple nodes. What makes you think it improves performance?

I did observe an improvement in performance, at least not a decrease. I guess the possible reason for the performance improvement is that the threads themselves need to be dispersed and forcibly divided into two large groups, which may lead to more cross-node access.

My test method is to copy two copies of a model and test the two allocation methods repeatedly and alternately.

My test environment is: E5-2698Bv3 x2 / DDR3 1866 32G x8 32 cores, 64 threads, dual nodes, each node has 4 channels of memory. The number of threads used is 30, with a random seed specified.

Jun 18 '23 03:06 rankaiyx

Suddenly found that the two-node computer is very much like the human brain.

Jun 18 '23 03:06 rankaiyx

There is now a new version of this which adds a --numa option that has to be specified or ggml_numa_init() is not called and then numa.n_nodes remains 0 and the NUMA-specific operations do nothing (because 0 is less than 2 so ggml_is_numa() is false). It also changes the way ggml_graph_compute works to reduce synchronization overhead. In master when the last thread finishes it sets an atomic flag, the main thread is waiting on this flag, the main thread performs finalize and then init for the next graph node and the main thread sets another flag to distribute the next node to the other threads. In this version there is no specific "main" thread and finalize and init are performed immediately by the thread that sees the others are finished, then notifies the other threads of the next graph node to be distributed. It seems to be faster. This doesn't implement threaded finalize at all, since it didn't seem to be used. If it becomes necessary in the future it could be implemented by adding an "n_active_finalize" atomic. Then for graph nodes that require threaded finalize, the ready threads would spin on n_active until it becomes zero, immediately perform the threaded finalize method and decrement n_active_finalize when they're done, and the last to finish finalize would do init for the next graph node.

This is modifying synchronization primitives so it would be good if anyone could test this on ARM or whatever unusual systems they may have.

My test method is to copy two copies of a model and test the two allocation methods repeatedly and alternately.

If the model was loaded into the system page cache using the other code that would explain why it could perform worse with code that distributes the thread numbers to different nodes. How does your system do with the latest version?

Why do you use 30 threads when your system has 32 cores?

Jun 18 '23 17:06 zrm