Documentation on Evaluation is not present at the landing page for the Documentation link in the help and Feedback pane.
Currently covered are Models, Playground, and Finetune. Bulk Run and evaluation are both missing.
Note while the evaluations appear to be standard, it won't be clear how they apply specifically in this case or may have conflicting documentation if a google search lands someone for example at a description of F1 score, where the harmonic mean of precision and recall isn't helpful in understanding how that applies to LLM results.
Of specific interest: Which of the evaluations require use of a model? (gpt4o-mini was the only choice .. but which eval caused that choice to be presented (I am assuming METEOR?) was unclear
Thanks for your feedback, we'll update the documentation to cover topics you mentioned.
For you question about the requirement of judge model: currently, only LLM-based evaluators require use of a model. At this time, only Azure OpenAI models and those compatible with OpenAI API (including OpenAI models and GitHub models) are supported.
You'll first need to add model from Model Catalog, then you'll have the choice to select this model while creating a new evaluation.
LLM-based:
- Coherence
- Fluency
- Relevence
- Similarity
Code-based:
- BLEU
- GLEU
- F1 Score
- METEOR