Add CLI arg `generate_until_token` to support reasoning and CoT models
As noted in #8 and #513 , LightEval expects models to follow a question with an immediate answer, but chain-of-thought and reasoning models (such as DeepSeek) generate many tokens to arrive at a more accurate / thought-out result before answering.
This PR would add --generate-until-token '</think>' as the syntax to support these models.
It must be run with --use-chat-template and a TransformerModel model, or it will raise an Exception.
I have a CoLab notebook running a BigBench task which I didn't run to the end, but I used logger.info to confirm it was generating reasoning text. In a previous test linked in #513 I confirmed this method works on a short task
Notes:
- "generate until token" or "wait until..." is the clearest name I could think of to remind people to use the ending token
- this doesn't look at the tokenizer's chat template string, but it could be helpful to detect the appropriate token
- Should I have
do_sample=Truewhen generating the reasoning text? Is that reproducible? - Is there a way to see
logger.debug()when calling lighteval from the command line? I can remove the logging of reasoning text if it isn't helpful - For better results on my custom task in #513 , I had to change answers
["A", "B", ...]to["The answer is A", ...]- thoughts about using the template string to set post-reasoning text and be compatible with more evals?
Hi! Is this currently working?
@EdwardSJ151 you should be able to run it with the code in this PR. If you run into issues, please comment. This might help:
- change the
logger.debugline tologger.infoto confirm that reasoning text is there - if your answers are "A", "B", "C", you might need to change them to "The answer is A"
Hi! Thanks for the PR, we implemented it slightly differently here, allowing evals to run their course and filtering out the reasoning tokens for the metrics.