lechmazur

Results 8 repositories owned by


                                            lechmazur

confabulations

240

Stars

Forks

240

Watchers

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

lechmazur

ai-evaluation

benchmark

claude

confabulations

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

lechmazur

ai-benchmarks

ai-evaluation

ai-safety

ai-security

bazaar

Stars

Forks

Watchers

The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomplete information. Each agent receives a private value and must...

lechmazur

claude

gemini

grok

llama

step_game

Stars

Forks

Watchers

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a mo...

lechmazur

benchmark

deepseek

deepseek-r1

eval

divergent

Stars

Forks

Watchers

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

lechmazur

benchmark

claude

deepseek

gemini

writing

327

Stars

Forks

327

Watchers

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

lechmazur

claude

claude-3-7-sonnet

deepseek

deepseek-r1

elimination_game

295

Stars

Forks

295

Watchers

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

lechmazur

benchmark

claude-3-7-sonnet

deepseek-r1

eval

pgg_bench

Stars

Forks

Watchers

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing economi...

lechmazur

benchmark

claude-3-7-sonnet

deepseek-r1

eval