tju01 comments

Results 42 comments of


                                            tju01

Model evaluation

I'm going to work on this now over at https://github.com/tju01/oasst-automatic-model-eval. I started with the OpenAI evals framework which also had its own issue #2348, but I'm going to extend that...

Consider using OpenAI Evals

I'm interested in this issue and have started working on it.

Consider using OpenAI Evals

I have evaluated the OpenAssistant RLHF model and built a simple UI to view the scores and also the outputs because often the scores on their own can be misleading...

Consider using OpenAI Evals

I've had my questions answered on the discord server. I have done the basic evaluation of multiple models, but there is lots of room for improvement. I'm going to continue...

Tool usage

1. https://github.com/OpenBMB/ToolBench works like the Vicuna benchmark and just asks an OpenAI model to evaluate the output. See https://github.com/OpenBMB/ToolBench/tree/master/toolbench/evaluation. 2. https://github.com/ShishirPatil/gorilla compares the model output to the ground truth using...

Tool usage

https://github.com/sambanova/toolbench also already has a leaderboard here: https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard

Tool usage

Also consider https://github.com/princeton-nlp/intercode

Tool usage

Also https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks. And see https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/issues/8 for some other papers.

Tool usage

Current state of my research: [Gorilla](https://github.com/ShishirPatil/gorilla) seems quite limited to evaluating knowledge about how to call a bunch of ML models on some input data. Could be part of a...

Tool usage

Actually https://github.com/OpenBMB/ToolBench also does some other evaluation. See https://github.com/OpenBMB/ToolBench#model-experiment. It _also_ does LLM grading, but it's not the only thing.