lighteval Add CLI arg `generate_until_token` to support reasoning and CoT models

As noted in #8 and #513 , LightEval expects models to follow a question with an immediate answer, but chain-of-thought and reasoning models (such as DeepSeek) generate many tokens to arrive at a more accurate / thought-out result before answering.

This PR would add --generate-until-token '</think>' as the syntax to support these models. It must be run with --use-chat-template and a TransformerModel model, or it will raise an Exception.

I have a CoLab notebook running a BigBench task which I didn't run to the end, but I used logger.info to confirm it was generating reasoning text. In a previous test linked in #513 I confirmed this method works on a short task

Notes:

"generate until token" or "wait until..." is the clearest name I could think of to remind people to use the ending token
this doesn't look at the tokenizer's chat template string, but it could be helpful to detect the appropriate token
Should I have do_sample=True when generating the reasoning text? Is that reproducible?
Is there a way to see logger.debug() when calling lighteval from the command line? I can remove the logging of reasoning text if it isn't helpful
For better results on my custom task in #513 , I had to change answers ["A", "B", ...] to ["The answer is A", ...] - thoughts about using the template string to set post-reasoning text and be compatible with more evals?

Mar 17 '25 17:03 mapmeld

Hi! Is this currently working?

Apr 02 '25 00:04 EdwardSJ151

@EdwardSJ151 you should be able to run it with the code in this PR. If you run into issues, please comment. This might help:

change the logger.debug line to logger.info to confirm that reasoning text is there
if your answers are "A", "B", "C", you might need to change them to "The answer is A"

Apr 03 '25 03:04 mapmeld

Hi! Thanks for the PR, we implemented it slightly differently here, allowing evals to run their course and filtering out the reasoning tokens for the metrics.

Aug 05 '25 11:08 clefourrier