DeepSearch
DeepSearch copied to clipboard
This is the official code of DeepSearch paper!
✅ DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
📖 Overview
DeepSearch is a novel reinforcement learning framework that addresses the fundamental challenge of balancing exploration breadth and depth in Monte Carlo Tree Search (MCTS). By introducing a rollout-guided exploration mechanism with adaptive replay buffers, DeepSearch achieves superior performance on mathematical reasoning tasks.
Built on VERL commit 2bd291e5494db03ba358ef279a334c2f0829b979
🔧 Environment Setup
Prerequisites
DeepSearch requires the following dependencies:
- Python >= 3.10
verlframeworksglangfor efficient inference
Installation
-
Install VERL framework:
cd DeepSearch # Follow the official VERL installation guide # https://verl.readthedocs.io/en/latest/start/install.html -
Install SGLang:
# Follow the official SGLang installation guide # https://docs.sglang.ai/get_started/install.html pip install sglang -
Install DeepSearch:
cd DeepSearch pip install -e . # to active `import deepsearch`
🚀 Quick Start
Inference with vLLM
Our pre-trained model fangwu97/DeepSearch-1.5B supports efficient inference using vLLM.
Install dependencies:
pip install vllm>=0.8.5.post1
pip install transformers>=4.52.4
Example usage:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
def convert_question_to_messages(question: str):
messages = [
{"role": "user",
"content": question + " Let's think step by step and output the final answer within \\boxed{}."}
]
return messages
model_id = "fangwu97/DeepSearch-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
max_tokens=32768
)
model = LLM(
model=model_id,
tensor_parallel_size=1
)
prompt = tokenizer.apply_chat_template(
convert_question_to_messages("Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."),
add_generation_prompt=True,
tokenize=False
)
outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
response = outputs[0].outputs[0].text
print(response)
🎓 Training
DeepSearch employs a multi-round training strategy with iterative data filtering and replay buffer accumulation to progressively improve model performance.
Dataset
We provide the complete training data used in our experiments under the data/ directory:
data/
├── deepmath103k_round0/
│ ├── train_0.parquet # Problems with Pass1@4=0% on nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
│ └── train_25.parquet # Problems with Pass1@4=25% on nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
├── deepmath103k_round1/
│ ├── train_0.parquet # Problems with Pass1@4=0% on Round0 best checkpoint
│ └── train_25.parquet # Problems with Pass1@4=25% on Round0 best checkpoint
└── val/
└── aime2025_boxed_n32.parquet # Validation set (placeholder, not actively used)
Data source: zwhe99/DeepMath-103K
The dataset is progressively filtered based on model performance across training rounds, focusing on challenging problems that the model has not yet mastered.
Round 0: Initial Training
Launch training:
bash round0.sh
Output structure:
results/
└── deepsearch_1_5b_round0/
├── checkpoints/ # Model checkpoints saved every 5 steps
│ ├── global_step_5/
│ ├── global_step_10/
│ └── ...
└── rollout_data/ # MCTS rollout trajectories saved every step
├── 1.jsonl
├── 2.jsonl
└── ...
Convert checkpoints to HuggingFace format:
Follow the VERL checkpoint conversion guide to merge FSDP/Megatron checkpoints into standard HuggingFace format.
Replay Buffer Registration (Round 0)
Register replay buffer from rollout data:
python deepsearch/data_setup/register_replay_buffer.py \
--rollout_dir results/deepsearch_1_5b_round0/rollout_data
Output:
results/
└── deepsearch_1_5b_round0/
└── rollout_data/
├── 1.jsonl
├── 2.jsonl
├── ...
└── replay_buffer_0.25.json # Replay buffer with 25% success threshold
Data Filtering for Round 1
We provide pre-filtered training data in data/deepmath103k_round1/. To create your own filtered dataset:
-
Generate predictions using the best Round 0 checkpoint:
- Perform
n=4generations per problem ondata/deepmath103k_round0 - Follow the vLLM inference example shown above
- Perform
-
Calculate Pass@4 scores:
- Use the evaluation metric from
deepsearch/data_setup/register_replay_buffer.py
- Use the evaluation metric from
-
Filter data based on performance:
- Keep problems with Pass1@4 = 0% (still challenging)
- Keep problems with Pass1@4 = 25% (moderate difficulty)
Round 1: Training with Replay Buffer
Launch Round 1 training:
bash round1.sh
This round incorporates:
- ✅ Filtered training data from Round 0 evaluation
- ✅ Accumulated replay buffer from Round 0
- ✅ Same rollout-guided MCTS exploration strategy
Output structure:
results/
└── deepsearch_1_5b_round1/
├── checkpoints/ # Round 1 checkpoints
│ ├── global_step_5/
│ └── ...
└── rollout_data/ # Round 1 rollout data
├── 1.jsonl
└── ...
Extending to More Rounds
To continue training for additional rounds (Round 2, 3, ...):
-
Register new replay buffer:
python deepsearch/data_setup/register_replay_buffer.py \ --rollout_dir results/deepsearch_1_5b_round{N}/rollout_data⚠️ Important: Combine the new replay buffer with previous ones to accumulate high-quality trajectories.
-
Filter training data:
- Evaluate the current round's best model on the previous round's training data
- Select problems with Pass1@4 = 0% or 25%
- Update
data.train_filesin the training configuration
-
Update training configuration:
- Set
data.train_filesto the newly filtered dataset path - Set
actor_rollout_ref.rollout.mcts_config.replay_bufferto the accumulated replay buffer path
- Set
-
Launch next round:
bash round{N}.sh
📊 Results
DeepSearch achieves competitive performance on challenging mathematical reasoning benchmarks. Please refer to our paper for detailed experimental results and analysis.
📝 Citation
If you find DeepSearch useful in your research, please consider citing:
@article{wu2025deepsearch,
title={DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search},
author={Wu, Fang and Xuan, Weihao and Qi, Heli and Lu, Ximing and Tu, Aaron and Li, Li Erran and ChoiRetry, Yejin},
journal={arXiv preprint arXiv:2509.25454},
year={2025}
}
🙏 Acknowledgments
- Built on the VERL reinforcement learning framework
- Training data sourced from DeepMath-103K
- Base model: nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
🤝 Contributing
We welcome contributions! Please feel free to submit issues or pull requests.
📧 Contact
For questions or feedback, please open an issue on GitHub.