🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

KUMO is a novel benchmark for systematically evaluating complex reasoning capabilities in Large Language Models (LLMs) through procedurally generated reasoning games. This repository contains the official implementation of our research paper.

🚀 Quick Links

📂 Benchmark Dataset
📑 Benchmark Format
⚙️ Environment Setup
📈 Evaluation
🛠️ Dataset Generation

📂 Benchmark Dataset

The KUMO benchmark introduces procedurally generated reasoning games structured around:

🔍 Truth Set ($T$): Possible truths.
🎯 Action Set ($A$): Available actions.
🌟 Outcomes ($\mathcal{O}$): Action-based outcomes.
📚 Knowledge Book ($K$): Detailed guidelines linking truths, actions, and outcomes.

Gameplay Mechanics:

A valid truth ($t^*$) is secretly chosen.
Players take actions and observe outcomes.
Deduce the truth efficiently using logic and reasoning.

🧑‍⚕️ Example Scenario: Diagnosing diseases using medical tests.

📌 Provided Domains:

100 autogenerated exemplar domains
Categories: Computer Science, Biology, Art, and more
Typical domain: ~50 truths, ~30 actions

KUMO Example

📑 Benchmark Format

The KUMO dataset is provided in JSON format, simplifying integration and customization. Data available at KUMO/env:

kumo/
└── env/
    ├── data/
    │   └── [DomainName]_data.py
    ├── [DomainName]/
    │   ├── knowledge_book/
    │   │   └── truth_num=4+action_num=6+valid_truth_num=1/
    │   │       ├── seed=0.txt
    │   │       └── ...
    │   └── truth_num=4+action_num=6+valid_truth_num=1.jsonl
    └── [DomainName].py

⚙️ Customize parameters (truth_num, action_num, etc.) easily for tailored benchmarking.

⚙️ Environment Setup

🔽 Clone the Repository

git clone https://github.com/linhaowei1/kumo.git
cd kumo

📦 Install Dependencies

Recommended: Conda with Python (3.10 to 3.12):

conda create -n kumo python=3.12
conda activate kumo
pip install -r requirements.txt

Hardware requirement: None. Only CPU is needed for inference.

📈 Evaluation

We recommend using OpenAI API to call LLMs. Please check examples/main.sh to customize by adding your own API key and model name. The results are in results/.

Expected runtime: depended on API call (GPT-4o may take about 3 hours to run all 100 easy-setting tasks using this script).

🛠️ Dataset Generation

Create customized domains and scenarios easily:

1️⃣ Seed Configuration

Generate scenarios via LLM:

python generate/config_generation.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --data_path ./templates/config_generation.jsonl

🔗 Detailed Instructions

2️⃣ Task Instances via SAT Sampling

Generate specific tasks:

python SAT_sampling.py \
  --truth_num 4 \
  --action_num 6 \
  --valid_truth_num 1 \
  --data_num 50 \
  --domain MedicalEnv

🔗 Example Script

3️⃣ Knowledge Book Generation

Automatically build detailed knowledge bases:

python knowledge_book_generation.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --data_num 50 \
  --truth_num 4 \
  --action_num 6 \
  --valid_truth_num 1 \
  --domain MedicalEnv

🔗 Example

4️⃣ Optional Knowledge Book Refinement

Improve generated knowledge books:

python generate/knowledge_book_revision.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --domain MedicalEnv \
  --revision_template_path ./templates/revision_template.md

🔗 Revision Details

💬 Support & Questions

For support, feedback, or inquiries, please:

Open an issue on GitHub
Contact the repository maintainers directly

kumo
kumo copied to clipboard

Metadata

🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

🚀 Quick Links

📂 Benchmark Dataset

Gameplay Mechanics:

📑 Benchmark Format

⚙️ Environment Setup

🔽 Clone the Repository

📦 Install Dependencies

📈 Evaluation

🛠️ Dataset Generation

1️⃣ Seed Configuration

2️⃣ Task Instances via SAT Sampling

3️⃣ Knowledge Book Generation

4️⃣ Optional Knowledge Book Refinement

← Metadata

Owner

Metadata

kumo kumo copied to clipboard

Metadata

🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

🚀 Quick Links

📂 Benchmark Dataset

Gameplay Mechanics:

📑 Benchmark Format

⚙️ Environment Setup

🔽 Clone the Repository

📦 Install Dependencies

📈 Evaluation

🛠️ Dataset Generation

1️⃣ Seed Configuration

2️⃣ Task Instances via SAT Sampling

3️⃣ Knowledge Book Generation

4️⃣ Optional Knowledge Book Refinement

← Metadata

Owner

Metadata

kumo
kumo copied to clipboard