agent-benchmark topic
List
agent-benchmark repositories
ai-agents-reality-check
51
Stars
0
Forks
51
Watchers
Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible...
eval-view
31
Stars
3
Forks
31
Watchers
Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.