kumo
kumo copied to clipboard
☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models
🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models
KUMO is a novel benchmark for systematically evaluating complex reasoning capabilities in Large Language Models (LLMs) through procedurally generated reasoning games. This repository contains the official implementation of our research paper.
🚀 Quick Links
- 📂 Benchmark Dataset
- 📑 Benchmark Format
- ⚙️ Environment Setup
- 📈 Evaluation
- 🛠️ Dataset Generation
📂 Benchmark Dataset
The KUMO benchmark introduces procedurally generated reasoning games structured around:
- 🔍 Truth Set ($T$): Possible truths.
- 🎯 Action Set ($A$): Available actions.
- 🌟 Outcomes ($\mathcal{O}$): Action-based outcomes.
- 📚 Knowledge Book ($K$): Detailed guidelines linking truths, actions, and outcomes.
Gameplay Mechanics:
- A valid truth ($t^*$) is secretly chosen.
- Players take actions and observe outcomes.
- Deduce the truth efficiently using logic and reasoning.
🧑⚕️ Example Scenario: Diagnosing diseases using medical tests.
📌 Provided Domains:
- 100 autogenerated exemplar domains
- Categories: Computer Science, Biology, Art, and more
- Typical domain: ~50 truths, ~30 actions

📑 Benchmark Format
The KUMO dataset is provided in JSON format, simplifying integration and customization. Data available at KUMO/env:
kumo/
└── env/
├── data/
│ └── [DomainName]_data.py
├── [DomainName]/
│ ├── knowledge_book/
│ │ └── truth_num=4+action_num=6+valid_truth_num=1/
│ │ ├── seed=0.txt
│ │ └── ...
│ └── truth_num=4+action_num=6+valid_truth_num=1.jsonl
└── [DomainName].py
⚙️ Customize parameters (truth_num, action_num, etc.) easily for tailored benchmarking.
⚙️ Environment Setup
🔽 Clone the Repository
git clone https://github.com/linhaowei1/kumo.git
cd kumo
📦 Install Dependencies
Recommended: Conda with Python (3.10 to 3.12):
conda create -n kumo python=3.12
conda activate kumo
pip install -r requirements.txt
Hardware requirement: None. Only CPU is needed for inference.
📈 Evaluation
We recommend using OpenAI API to call LLMs. Please check examples/main.sh to customize by adding your own API key and model name. The results are in results/.
Expected runtime: depended on API call (GPT-4o may take about 3 hours to run all 100 easy-setting tasks using this script).
🛠️ Dataset Generation
Create customized domains and scenarios easily:
1️⃣ Seed Configuration
Generate scenarios via LLM:
python generate/config_generation.py \
--load_type OPENAI \
--api_base http://localhost:8001/v1 \
--api_key EMPTY \
--data_path ./templates/config_generation.jsonl
2️⃣ Task Instances via SAT Sampling
Generate specific tasks:
python SAT_sampling.py \
--truth_num 4 \
--action_num 6 \
--valid_truth_num 1 \
--data_num 50 \
--domain MedicalEnv
3️⃣ Knowledge Book Generation
Automatically build detailed knowledge bases:
python knowledge_book_generation.py \
--load_type OPENAI \
--api_base http://localhost:8001/v1 \
--api_key EMPTY \
--data_num 50 \
--truth_num 4 \
--action_num 6 \
--valid_truth_num 1 \
--domain MedicalEnv
🔗 Example
4️⃣ Optional Knowledge Book Refinement
Improve generated knowledge books:
python generate/knowledge_book_revision.py \
--load_type OPENAI \
--api_base http://localhost:8001/v1 \
--api_key EMPTY \
--domain MedicalEnv \
--revision_template_path ./templates/revision_template.md
💬 Support & Questions
For support, feedback, or inquiries, please:
- Open an issue on GitHub
- Contact the repository maintainers directly