NoteMR

This is the official implementation of the paper "Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering", which is accepted by CVPR 2025.

Abstract

The knowledge-based visual question answering (KB-VQA) task involves using external knowledge about the image to assist reasoning. Building on the impressive performance of multimodal large language model (MLLM), recent methods have commenced leveraging MLLM as an implicit knowledge base for reasoning. However, the direct employment of MLLM with raw external knowledge might result in reasoning errors due to misdirected knowledge information. Additionally, MLLM may lack fine-grained perception of visual features, which can result in hallucinations during reasoning. To address these challenges, we propose Notes-guided MLLM Reasoning (NoteMR), a novel framework that guides MLLM in better reasoning by utilizing knowledge notes and visual notes. Specifically, we initially obtain explicit knowledge from an external knowledge base. Then, this explicit knowledge, combined with images, is used to assist the MLLM in generating knowledge notes. These notes are designed to filter explicit knowledge and identify relevant internal implicit knowledge within the MLLM. We then identify highly correlated regions between the images and knowledge notes, retaining them as image notes to enhance the model's fine-grained perception, thereby mitigating MLLM induced hallucinations. Finally, both notes are fed into the MLLM, enabling a more comprehensive understanding of the image-question pair and enhancing the model's reasoning capabilities. Our method achieves state-of-the-art performance on the OK-VQA and A-OKVQA datasets, demonstrating its robustness and effectiveness across diverse VQA scenarios.

Model Architecture

The framework of Notes-guided MLLM Reasoning (NoteMR).

Environment Requirements

The experiments were conducted on NVIDIA RTX A6000 GPU with 48GB memory.

Python 3.10.14
PyTorch 2.0.1
CUDA 11.7

To run the MLLM reasoning code, you need to install the requirements:

pip install -r requirements.txt

Data Download

We evaluate our model using two publicly available KB-VQA dataset.

OK-VQA

A-OKVQA

Run Code

Step. 1-1 Retrieval (FLMR/PreFLMR)

We extract the top-k passages related to the input image and the question with the knowledge retriever, using the pre-trained PreFLMR.

Step. 1-2 Generate Knowledge Notes

python .\generate_knowledge_notes.py

Step. 2 Generate Visual Notes (GradCAM)

Step. 3 Generate Output

python .\generate_output.py

Papers for the Project & How to Cite

If you use or extend our work, please cite the paper as follows:

@InProceedings{Fang_2025_CVPR,
    author    = {Fang, Wenlong and Wu, Qiaofeng and Chen, Jing and Xue, Yun},
    title     = {Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {19597-19607}
}

NoteMR
NoteMR copied to clipboard

Metadata

NoteMR

Abstract

Model Architecture

Environment Requirements

Data Download

Run Code

Step. 1-1 Retrieval (FLMR/PreFLMR)

Step. 1-2 Generate Knowledge Notes

Step. 2 Generate Visual Notes (GradCAM)

Step. 3 Generate Output

Papers for the Project & How to Cite

← Metadata

Owner

Metadata

NoteMR NoteMR copied to clipboard

Metadata

NoteMR

Abstract

Model Architecture

Environment Requirements

Data Download

Run Code

Step. 1-1 Retrieval (FLMR/PreFLMR)

Step. 1-2 Generate Knowledge Notes

Step. 2 Generate Visual Notes (GradCAM)

Step. 3 Generate Output

Papers for the Project & How to Cite

← Metadata

Owner

Metadata

NoteMR
NoteMR copied to clipboard