agent-benchmark topic

List agent-benchmark repositories

ai-agents-reality-check

Stars

Forks

Watchers

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible...

Cre4T3Tiv3

agent-architecture

agent-benchmark

agent-evaluation

agent-performance

eval-view

Stars

Forks

Watchers

Catch AI agent regressions before you ship. YAML test cases, golden baselines, execution tracing, cost tracking, CI integration. LangGraph, CrewAI, Anthropic, OpenAI.

hidai25

agent

agent-benchmark

agent-evaluation

agentic-ai