tju01

Results 42 comments of tju01

I'm going to work on this now over at https://github.com/tju01/oasst-automatic-model-eval. I started with the OpenAI evals framework which also had its own issue #2348, but I'm going to extend that...

I'm interested in this issue and have started working on it.

I have evaluated the OpenAssistant RLHF model and built a simple UI to view the scores and also the outputs because often the scores on their own can be misleading...

I've had my questions answered on the discord server. I have done the basic evaluation of multiple models, but there is lots of room for improvement. I'm going to continue...

1. https://github.com/OpenBMB/ToolBench works like the Vicuna benchmark and just asks an OpenAI model to evaluate the output. See https://github.com/OpenBMB/ToolBench/tree/master/toolbench/evaluation. 2. https://github.com/ShishirPatil/gorilla compares the model output to the ground truth using...

https://github.com/sambanova/toolbench also already has a leaderboard here: https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard

Also consider https://github.com/princeton-nlp/intercode

Also https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks. And see https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/issues/8 for some other papers.

Current state of my research: [Gorilla](https://github.com/ShishirPatil/gorilla) seems quite limited to evaluating knowledge about how to call a bunch of ML models on some input data. Could be part of a...

Actually https://github.com/OpenBMB/ToolBench also does some other evaluation. See https://github.com/OpenBMB/ToolBench#model-experiment. It _also_ does LLM grading, but it's not the only thing.