NoteMR
NoteMR copied to clipboard
[CVPR 2025] Code for "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering".
NoteMR
This is the official implementation of the paper "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering", which is accepted by CVPR 2025.
Abstract
The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model's fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model's reasoning capabilities. Our method achieves state-of-the-art performance on the OK-VQA and A-OKVQA datasets, demonstrating its robustness and effectiveness across diverse VQA scenarios.
Model Architecture
Environment Requirements
The experiments were conducted on NVIDIA RTX A6000 GPU with 48GB memory.
- Python 3.10.14
- PyTorch 2.0.1
- CUDA 11.7
To run the MLLM reasoning code, you need to install the requirements:
pip install -r requirements.txt
Data Download
We evaluate our model using two publicly available KB-VQA dataset.
- OK-VQA
- A-OKVQA
Run Code
Step. 1-1 Retrieval (FLMR/PreFLMR)
We extract the top-k passages related to the input image and the question with the knowledge retriever, using the pre-trained PreFLMR.
Step. 1-2 Generate Knowledge Notes
python .\generate_knowledge_notes.py
Step. 2 Generate Visual Notes (GradCAM)
Step. 3 Generate Output
python .\generate_output.py
Papers for the Project & How to Cite
If you use or extend our work, please cite the paper as follows:
@InProceedings{Fang_2025_CVPR,
author = {Fang, Wenlong and Wu, Qiaofeng and Chen, Jing and Xue, Yun},
title = {Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {19597-19607}
}