How to predict a specific length of tokens?
In llama.cpp, --n-predict option is used to set the number of tokens to predict when generating text/
I don't find the binding for that in doc.
Hi, the binding for that parameter is max_tokens.
max_tokens cannot ensure an exact number of predicted tokens. Sometimes, a model predicts less than max_tokens .
Yes, and the --n-predict option in llama.cpp won't work unless you ignore the EOS token, as explained here. Thus, I don't know if it was what you were looking for, to sample until the --n-predict value is reached and then truncate.
I was not able to find such option on the high level API of this repo, maybe you can have a look at this example, that adopts the low level api.