lechmazur
lechmazur
confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
deception
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...
bazaar
The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomplete information. Each agent receives a private value and must...
step_game
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a mo...
divergent
LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.
writing
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story
elimination_game
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other
pgg_bench
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing economi...