apps
apps copied to clipboard
APPS: Automated Programming Progress Standard (NeurIPS 2021)
Hi @xksteven , I have a question about why you advise to run the evaluation code for one solution at a time instead of doing it for all generations at...
Hello, I am trying to evaluate my model's generated codes using scripts in eval. However, for a particular problem, results[index] turns out to be an empty array as a result...
Hi, thanks for your work. I don't quite understand the role of check5 in the evaluating process, it seems to bring some wrong results. Here is an example of 4496...
Hi,thanks for the amazing work! I wank to ask about the detailed steps about generated code solutions post-processing when testing one solution. (e.g. After a code solution was generated, did...
I'm trying to run the pre-trained 1.5B model linked in [the README](https://drive.google.com/file/d/1XW1Od9L-5l9zXl1HUCyER5pS9zQTbIvU/view?usp=sharing) on the APPS test set. I downloaded the dataset and ran the script `train/apps_create_split.py` on it, then ran...
 There are some long problems in APPS, so I truncated them after encoding. But the output of the model is in the form of "problem + answer", so output...
Not necessarily an issue, but I noticed that for train/val, the answer_type is based on whether starter_code exists but that at eval time, it's based on fn_name. Is there a...
Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random...
pyext no longer installs on 3.11 because of changes to inspect; this fork fixes the install issue.
Most of the major benchmarks would have leaderboard to rank LLMs against one another other, but not this one? Where could I find such a result?