agent-benchmark topic

List agent-benchmark repositories

ai-agents-reality-check

51
Stars
0
Forks
51
Watchers

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible...

eval-view

31
Stars
3
Forks
31
Watchers

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.