Open-LLaVA-Video-R1
Open-LLaVA-Video-R1 copied to clipboard
[LLaVA-Video-R1]✨First Adaptation of R1 to LLaVA-Video (2025-03-18)
Open LLaVA-Video-R1
The current open-source code related to multimodal Deepseek-R1/GRPO is predominantly based on Qwen2VL. However, in the field of video understanding, LLaVA-Video, which serves as one of the most important baselines, still does not have any related open-source projects available (as of 2025/03/18). Therefore, we try to fill this gap by releasing a codebase, Open-LLaVA-Video-R1.
News
- [2025/03/19] We release the codebase of Open LLaVA-Video-R1
What we did
To our best knowledge, we are the first to adapt R1/GRPO to LLaVA-Video architecture. In detail, we train LLaVA-Video using GRPO with accuracy and format rewards on the DVD-counting dataset. Training the 7B model on dvd datasets can be completed in approximately 5.5 hours using 8 x A800 (80G) GPUs. The training curve is as follows:
Performance
The experiment settting is the same as Qwen-based Video-R1, validated on the DVD-counting task. As shown in the Table, 11.5% gain is observed after using grpo training on LLaVA-Video-Qwen.
| Dataset | LLaVA-Video-7B | LLaVA-Video-7B+GRPO |
|---|---|---|
| DVD-counting-test | 20.5 | 32.0 (11.5↑) |
Set up
git clone https://github.com/Hui-design/Open-LLaVA-Video-R1.git
cd Open-LLaVA-Video-R1
Our environment is basically the same as Open-r1-video and r1-video. If you have already installed them, you can directly use the previous environment. If you haven't installed them yet, you can try the following commands.
conda create -n LLaVA-Video-R1 python=3.10
conda activate LLaVA-Video-R1
pip3 install -e ".[dev]"
pip3 install flash_attn --no-build-isolation
Dataset
We use the same task as r1-video, using the DVD-counting dataset.
Our dataset organization is:
dvd_dataset
- dvd
- *.mp4
- train_dvd.jsonl
- test_dvd.jsonl
GRPO on LLaVA-Video
First download LLaVA-Video-Qwen, and modify the model_name_or_path in the train_llava_video.sh
# to run GRPO on llava_video
bash train_llava_video.sh
Evaluation
Evaluation on video counting task
python llava_video_inference.py
Citiation
@misc{Tang2025LlavaVideoR1,
author = {Canhui Tang},
title = {Open LLaVA-Video-R1},
howpublished = {\url{https://github.com/Hui-design/Open-LLaVA-Video-R1}},
note = {Accessed: 2025-03-18},
year = {2025}
}
Acknowledgements
We sincerely appreciate the contributions of the open-source community. The related projects are as follows: