lighteval <think> tags for thinking models

Thinking models like DeepSeek-R1 emit thinking tags in the output. Is there a way to filter these out easily? Currently they make it directly into the output and so mess up metrics.

Jan 24 '25 21:01 JoelNiklaus

Nope, not at the moment! We used to have regex parsers but they were underused so we removed them.

Jan 25 '25 07:01 clefourrier

I had the same question and got something working. For now it's more of a hack, but hopefully this is a starting point to get it working generally. Here is my branch and demo notebook.

Notes:

could add a flag (similar to use_chat_template). I saw some code uses add_reasoning_prompt
<think> is already in tokenizer_config.json's chat_template field, so maybe instead of hardcoding, possible to detect common tokens there, and then insert it in the chat template again?
I needed to change my answer options from ["A", "B"...] to ["The answer is A", ...]
2048 tokens seems like the right size
didn't test few-shot
this isn't maximally efficient because it works on one doc at a time, but I don't know how well large batches would fit

Here's the key section in prompt_manager:

        elif use_chat_template:
            chat_preview = self.model.tokenizer.apply_chat_template(
                output, tokenize=False, add_generation_prompt=True
            )
            tokenized = self.model.tokenizer(chat_preview, return_tensors="pt").to(self.model.device)
            prepared_batch = Batch(
                input_ids=tokenized["input_ids"],
                input_mask=tokenized["attention_mask"],
                input_lengths=[len(tokenized["input_ids"][0])],
                truncated=[False],
                padded=[False],
            )
            response = self.model._generate(
                batch=prepared_batch,
                max_new_tokens=2048,
                stop_tokens=["</think>"],
            )
            all_start = chat_preview + response[0].result[0] + "</think>"
            return all_start, num_effective_fewshots

Mar 14 '25 03:03 mapmeld

Hi! This is now fixed on main after this PR. The details also contain both the original prediction and the post processed one.

Aug 05 '25 11:08 clefourrier