lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

Add CLI arg `generate_until_token` to support reasoning and CoT models

Open mapmeld opened this issue 10 months ago • 2 comments

As noted in #8 and #513 , LightEval expects models to follow a question with an immediate answer, but chain-of-thought and reasoning models (such as DeepSeek) generate many tokens to arrive at a more accurate / thought-out result before answering.

This PR would add --generate-until-token '</think>' as the syntax to support these models. It must be run with --use-chat-template and a TransformerModel model, or it will raise an Exception.

I have a CoLab notebook running a BigBench task which I didn't run to the end, but I used logger.info to confirm it was generating reasoning text. In a previous test linked in #513 I confirmed this method works on a short task

Notes:

  • "generate until token" or "wait until..." is the clearest name I could think of to remind people to use the ending token
  • this doesn't look at the tokenizer's chat template string, but it could be helpful to detect the appropriate token
  • Should I have do_sample=True when generating the reasoning text? Is that reproducible?
  • Is there a way to see logger.debug() when calling lighteval from the command line? I can remove the logging of reasoning text if it isn't helpful
  • For better results on my custom task in #513 , I had to change answers ["A", "B", ...] to ["The answer is A", ...] - thoughts about using the template string to set post-reasoning text and be compatible with more evals?

mapmeld avatar Mar 17 '25 17:03 mapmeld

Hi! Is this currently working?

EdwardSJ151 avatar Apr 02 '25 00:04 EdwardSJ151

@EdwardSJ151 you should be able to run it with the code in this PR. If you run into issues, please comment. This might help:

  • change the logger.debug line to logger.info to confirm that reasoning text is there
  • if your answers are "A", "B", "C", you might need to change them to "The answer is A"

mapmeld avatar Apr 03 '25 03:04 mapmeld

Hi! Thanks for the PR, we implemented it slightly differently here, allowing evals to run their course and filtering out the reasoning tokens for the metrics.

clefourrier avatar Aug 05 '25 11:08 clefourrier