FrameFusion
FrameFusion copied to clipboard
[ICCV'25] The official code of paper "Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models"
FrameFusion
Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models
FrameFusion reduces the number of tokens in Large Vision-Language Models (LVLMs) by combining similarity-based merging with importance-based pruning. It achieves a 70% vision token reduction, 3.4–4.4× LLM speedups, and 1.6–1.9× end-to-end speedups with minimal performance impact.
https://github.com/user-attachments/assets/bb9d3b25-6f21-4863-b7c8-27b88356fdcf
This demo can be reproduced with
script/demo/llava_video_compare.py.
Feel free to star the repo or cite the paper if you find it interesting.
@inproceedings{fu2025framefusion,
title={FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models},
author={Fu, Tianyu and Liu, Tengxuan and Han, Qinghao and Dai, Guohao and Yan, Shengen and Yang, Huazhong and Ning, Xuefei and Wang, Yu},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={22654--22663},
year={2025}
}
News
-
[2025/08] Update webpage, check our interactive demos here
-
[2025/06] Our paper is accepted by ICCV'25
-
[2025/05] Support Qwen2-VL and InternVL2.5
-
[2025/04] Support NVILA model family
Environment Setup
General
Create a new environment:
conda create -n framefusion python=3.10
conda activate framefusion
Install FrameFusion:
pip install -e .
Working with Other Models
Important: NVILA and Llava-Video have conflicting architectures. FrameFusion supports both, but please install only one to avoid conflicts.
Llava-Video
To install Llava-Video LVLM dependencies:
- Clone the LLaVA-NeXT repository:
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git cd LLaVA-NeXT - Install via:
pip install -e .[llava-video]
NVILA
To install NVILA dependencies:
- Clone the VILA repository:
git clone https://github.com/NVlabs/VILA.git cd VILA - Run environment setup script to install dependencies in current conda environment:
./environment_setup.sh - Install via:
pip install -e .
Qwen2-VL
After standard installation, please reinstall transformers==4.51.3 to ensure version compatibility.
pip install -e .[qwen2-vl]
For all other models, continue using transformers==4.45.2.
How to
Run an example
We provide an example with LLaVA-Video-7B model to inference on a video with or without FrameFusion in script/playground/example_llava.py.
python script/playground/example_llava.py
Apply FrameFusion
You can apply FrameFusion in your own code to any huggingface model that supports the interface with few lines of code. Here is an example:
from llava.model.builder import load_pretrained_model
from framefusion.interface import apply_framefusion
# set attn_implementation to be sdpa
tokenizer, model, image_processor, max_length = load_pretrained_model("lmms-lab/LLaVA-Video-7B-Qwen2", None, "llava_qwen", torch_dtype="bfloat16", attn_implementation='sdpa', device_map="auto")
# apply FrameFusion
apply_framefusion(model, cost=0.3, similarity_lower_bound=0.6, ratio_lower_bound=0.1)
# use the model as usual
Evaluate FrameFusion
We use lmms-eval to evaluate FrameFusion.
To apply FrameFusion, clone the official lmms-eval repository, install it from source, and insert the following lines into evaluator.py after the standard model initialization of lm (around line187):
from framefusion.interface import apply_framefusion
model_to_compress = getattr(lm, "_model", lm.model)
apply_framefusion(model_to_compress, cost=0.3, similarity_lower_bound=<S_th from our paper>, ratio_lower_bound=0.1)
Please refer to our paper for the recommended similarity_lower_bound (S_th) values for different models.
Example
As an example, you can evaluate FrameFusion on LLaVA-Video-7B-Qwen2 model using the following command.
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port 28997 -m lmms_eval \
--model llava_vid \
--model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,device_map=auto,max_frames_num=64,overwrite=False,force_sample=True,torch_dtype="bfloat16",add_time_instruction=True,conv_template=qwen_1_5 \
--tasks $BENCHMARK \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_vid_7b \
--output_path ./logs/
Adapt to new models
Understand Code Structure
framefusion/: The main package for FrameFusion.models/: The adapter for different models.main.py: The main implementation of FrameFusion.interface.py: The interface for applying FrameFusion.
scripts/: Scripts for running experiments.evaluate/: Scripts for evaluating the performance models.playground/: Scripts for running misc experiments.
example/: Example input videos
Modify the code
-
Add a new model adapter in
framefusion/models/, it applies framefusion after the attention module.Three model functions are required:
llm_forward,decoder_forward, andattention_forward. The forward functions are easily modified from the correspondingmodeling_<MODEL>.pyfunctions in huggingface transformers. All modifications are marked with###comments. For LLM, seeframefusion/models/qwen2/modeling_qwen2.pyas an example. -
Register the model in
framefusion/interface.py, it applies framefusion to the correct model class. -
Add a new example in
script/playground/, it shows how to apply framefusion to the model.
Happy to help
If you have any questions on applying FrameFusion to a new model, please feel free to open an issue. We are happy to help you and expand the adapter for more models.
Supported Model List
MimiCPM-V
Llava-Video
NVILA
- Efficient-Large-Model/NVILA-Lite-2B
- Efficient-Large-Model/NVILA-8B-Video
- Efficient-Large-Model/NVILA-Lite-15B-Video
Qwen2-VL
Note: Please use
transformers==4.51.3when running Qwen2-VL series models.