llm-evaluation-framework topic
promptfoo
Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command...
deepeval
The LLM Evaluation Framework
parea-sdk-py
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
agentic_security
Agentic LLM Vulnerability Scanner / AI red teaming kit 🧪
MixEval
The official evaluation suite and dynamic data release for MixEval.
KIEval
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
fm-leaderboarder
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
realign
Realign is a testing and simulation framework for AI applications.
qa_metrics
An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model promp...
contextcheck
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.