Jeximo
Jeximo
> 96 cores 192 threads ... the peek inferencing speed tops at around 60 threads This sounds normal. The CPU may be over-saturated: [token generation performance tips readme](https://github.com/ggerganov/llama.cpp/blob/f87f7b898651339fe173ddf016ca826163e899d8/docs/token_generation_performance_tips.md#verifying-that-the-cpu-is-not-oversaturated)
> how to keep the output response of API is the same as the output of main command. It's unclear what settings you used. [Readme shows](https://github.com/ggerganov/llama.cpp/blob/4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc/examples/server/README.md) `seed`, and `temperature` API....
> n_threads = 112 / 224 Test `--threads N`. I don't know what's optimal for your system, usually it's best to start at 1, then see if token generation speed...
There's a PR to implement this: https://github.com/ggerganov/llama.cpp/pull/6741
> Hi, > > I am using this model ggml-model-q4_0.gguf and ggml-model-f32.gguf > Unclear, but this doesn't seem to be the focus of your question. > > My issues is...
> Why does this commit alter output? @Azirine In order to figure out the difference, then show the steps for how you decided https://github.com/ggerganov/llama.cpp/commit/c47cf414efafb8f60596edc7edb5a2d68065e992 lowered output quality.
> CPU only beat GPU output hands down. ... GPU 75% / CPU 25% -> Always seems to yield higher quality output. GPU 50% / CPU 50% -> Even better...
> LMSYS Chatbot Arena @Azirine See I didn't say "_LMSYS_". Please do not read things I didn't say, **that'd be great**. > alters the model's outputs even with identical prompts,...
> n_gpu_layer= -1, This isn't a thing, you've set your GPU to use no layers. Increase the #.
> Even when I try with 30 it still the same issue https://github.com/ggerganov/llama.cpp/blob/4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc/examples/main/README.md?plain=1#L318 The original post has a typo in the parameter, `--n-gpu-layers N`. Did you use `--n-gpu-layers 30`, and...