Qwen2.5-Math issues

24点游戏还是没法解决

1

Math demo https://qwen2.org/math ：如何从 11, 2, 23,8从简单的加减乘除运算得到 24,每个数都只用一次 ``` To obtain 24 using the numbers 11, 2, 23, and 8 with basic arithmetic operations (addition, subtraction, multiplication, and division),...

novvoo

Few-shot prompt adaptation issue for GSM8K evaluation with 8-shot setting

Hi, When running the GSM8K evaluation experiments using an 8-shot setting, I noticed that the few-shot examples were not successfully applied. Specifically, the current implementation triggers (in [here](https://github.com/QwenLM/Qwen2.5-Math/blob/a45202bd16f1ec06f433442dc1152d0074773465/evaluation/utils.py#L204C5-L214C59)). ```python if...

passing2961

For generated solutions with non-python code, how are the solution COT generated by the model cleaned in the evaluation process?

I can't locate the core logic for cleaning the model's generated solution from non-python code in the codes within evaluation directory. Currently, i can only see how dataset ground truth...

Ming593

当输出超过限制时会出现卡死的情况

用Transformers库调用模型，max_new_tokens设置为20000，当生成长度为4096时会出现警告：This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or...

EdmunddzzZ

TIR实验结果异常

6

我参考TIR的prompt在qwen2.5-Math的1.5B和7B模型上进行了实验，得到的指标结果比COT差，我怀疑我的实现缺少了一些步骤，能说明下更详细的实现方式嘛？我参考下面的prompt实现了TIR ``` # TIR messages = [ {"role": "system", "content": "Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."},...

wangzhihao-coder

exceed the model's predefined maximum length (4096)

2

When I use the Qwen2.5-Math-7B model for inference, I get the following information: This is a friendly reminder - the current text generation call will exceed the model's predefined maximum...

zhoumengbo

分数计算逻辑似乎有问题导致n_sampling没生效？

1

下面这块代码，我理解是，对于每个问题只取n个sample的第0个的分数的均值作为acc。那么n_sampling>1就没意义了。 [evaluate.py#line78](https://github.com/QwenLM/Qwen2.5-Math/blob/a45202bd16f1ec06f433442dc1152d0074773465/evaluation/evaluate.py#L78) ```python score_mat = [] for sample in samples: sample['score'] = scores[idx: idx+len(sample['pred'])] assert len(sample['score']) == len(sample['pred']) score_mat.append(sample['score']) idx += len(sample['pred']) max_len = max([len(s) for s in score_mat]) for...

gantuo

Add a licnese

1

Hi authors I’m currently a research scientist at NVIDIA working on mathematical reasoning. I came across your repository, and I really appreciate the work you’ve done! We’re also working on...

wedu-nvidia

用评测代码测试 Qwen2.5-Math-1.5B 结果和 report 的结果出入比较大

6

您好，请问 base model 的评测是有专门的 prompt 吗？直接用对 instruct 模型的评测代码测试Qwen2.5-Math-1.5B，结果与 report 结果差距有点大。

pipixiaqishi1

Inquiry on the Composition of Pre-training Dataset for Qwen-2.5-math and How to Replicate

I am interested in understanding the composition of the pre-training dataset used for Qwen-2.5-math. Specifically, I would like to know: 1. What are the primary sources or types of datasets...

TrishKyrie

Qwen2.5-Math
Qwen2.5-Math copied to clipboard

Metadata

24点游戏还是没法解决

Few-shot prompt adaptation issue for GSM8K evaluation with 8-shot setting

For generated solutions with non-python code, how are the solution COT generated by the model cleaned in the evaluation process?

当输出超过限制时会出现卡死的情况

TIR实验结果异常

exceed the model's predefined maximum length (4096)

分数计算逻辑似乎有问题导致n_sampling没生效？

Add a licnese

用评测代码测试 Qwen2.5-Math-1.5B 结果和 report 的结果出入比较大

Inquiry on the Composition of Pre-training Dataset for Qwen-2.5-math and How to Replicate

← Metadata

Owner

Metadata

Qwen2.5-Math Qwen2.5-Math copied to clipboard

Metadata

← Metadata

Owner

Metadata

Qwen2.5-Math
Qwen2.5-Math copied to clipboard