llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

How to predict a specific length of tokens?

Open simmonssong opened this issue 10 months ago • 3 comments

In llama.cpp, --n-predict option is used to set the number of tokens to predict when generating text/

I don't find the binding for that in doc.

simmonssong avatar Mar 19 '25 03:03 simmonssong

Hi, the binding for that parameter is max_tokens.

DanieleMorotti avatar Mar 19 '25 07:03 DanieleMorotti

max_tokens cannot ensure an exact number of predicted tokens. Sometimes, a model predicts less than max_tokens .

simmonssong avatar Mar 21 '25 01:03 simmonssong

Yes, and the --n-predict option in llama.cpp won't work unless you ignore the EOS token, as explained here. Thus, I don't know if it was what you were looking for, to sample until the --n-predict value is reached and then truncate.

I was not able to find such option on the high level API of this repo, maybe you can have a look at this example, that adopts the low level api.

DanieleMorotti avatar Mar 21 '25 08:03 DanieleMorotti