lechmazur

Results 8 repositories owned by lechmazur

confabulations

240
Stars
7
Forks
240
Watchers

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

deception

31
Stars
2
Forks
31
Watchers

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

bazaar

35
Stars
4
Forks
35
Watchers

The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomplete information. Each agent receives a private value and must...

step_game

81
Stars
2
Forks
81
Watchers

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a mo...

divergent

33
Stars
1
Forks
33
Watchers

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

writing

327
Stars
7
Forks
327
Watchers

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

elimination_game

295
Stars
11
Forks
295
Watchers

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

pgg_bench

39
Stars
2
Forks
39
Watchers

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing economi...