tju01
tju01
I'm going to work on this now over at https://github.com/tju01/oasst-automatic-model-eval. I started with the OpenAI evals framework which also had its own issue #2348, but I'm going to extend that...
I'm interested in this issue and have started working on it.
I have evaluated the OpenAssistant RLHF model and built a simple UI to view the scores and also the outputs because often the scores on their own can be misleading...
I've had my questions answered on the discord server. I have done the basic evaluation of multiple models, but there is lots of room for improvement. I'm going to continue...
1. https://github.com/OpenBMB/ToolBench works like the Vicuna benchmark and just asks an OpenAI model to evaluate the output. See https://github.com/OpenBMB/ToolBench/tree/master/toolbench/evaluation. 2. https://github.com/ShishirPatil/gorilla compares the model output to the ground truth using...
https://github.com/sambanova/toolbench also already has a leaderboard here: https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard
Also consider https://github.com/princeton-nlp/intercode
Also https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks. And see https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/issues/8 for some other papers.
Current state of my research: [Gorilla](https://github.com/ShishirPatil/gorilla) seems quite limited to evaluating knowledge about how to call a bunch of ML models on some input data. Could be part of a...
Actually https://github.com/OpenBMB/ToolBench also does some other evaluation. See https://github.com/OpenBMB/ToolBench#model-experiment. It _also_ does LLM grading, but it's not the only thing.