ViQuAE icon indicating copy to clipboard operation
ViQuAE copied to clipboard

Source code and data used in the papers ViQuAE (Lerner et al., SIGIR'22), Multimodal ICT (Lerner et al., ECIR'23) and Cross-modal Retrieval (Lerner et al., ECIR'24)

.. image:: ./meerqat_logo_by_hlb.png

meerqat

Source code and data used in the papers: - ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities <https://hal.science/hal-03650618>__ (Lerner et al., SIGIR’22) - Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering <https://hal.science/hal-03933089>__
(Lerner et al., ECIR'23) - Cross-modal Retrieval for Knowledge-based Visual Question Answering <https://hal.science/hal-04384431>__ (Lerner et al., ECIR'24)

See also MEERQAT project <https://www.meerqat.fr/>__.

Getting the ViQuAE dataset and KB

The data is provided in two formats: HF’s datasets (based on Apache Arrow) and plain-text JSONL files (one JSON object per line). Both formats can be used in the same way as datasets parses objects into python dict (see below), however our code only supports (and is heavily based upon) datasets. Images are distributed separately, in standard formats (e.g. jpg).

The images

Here’s how to get the images grounding the questions of the dataset:

.. code:: sh

get the images. TODO integrate this in a single dataset

git clone https://huggingface.co/datasets/PaulLerner/viquae_images

to get ALL images (dataset+KB) use https://huggingface.co/datasets/PaulLerner/viquae_all_images instead

cd viquae_images

in viquae_all_images, the archive is split into parts of 5GB

cat parts/* > images.tar.gz

tar -xzvf images.tar.gz export VIQUAE_IMAGES_PATH=$PWD/images

Alternatively, you can download images from Wikimedia Commons using meerqat.data.kilt2vqa download (see below).

The ViQuAE dataset

If you don’t want to use datasets you can get the data directly from https://huggingface.co/datasets/PaulLerner/viquae_dataset (e.g. git clone https://huggingface.co/datasets/PaulLerner/viquae_dataset).

The dataset format largely follows KILT <https://huggingface.co/datasets/kilt_tasks>. Here I’ll describe the dataset without pre-computed features. Pre-computed features are basically the output of each step described in EXPERIMENTS.rst <./EXPERIMENTS.rst>.

.. code:: py

In [1]: from datasets import load_dataset ...: dataset = load_dataset('PaulLerner/viquae_dataset') In [2]: dataset Out[2]: DatasetDict({ train: Dataset({ features: ['image', 'input', 'kilt_id', 'id', 'meta', 'original_question', 'output', 'url', 'wikidata_id'], num_rows: 1190 }) validation: Dataset({ features: ['image', 'input', 'kilt_id', 'id', 'meta', 'original_question', 'output', 'url', 'wikidata_id'], num_rows: 1250 }) test: Dataset({ features: ['image', 'input', 'kilt_id', 'id', 'meta', 'original_question', 'output', 'url', 'wikidata_id'], num_rows: 1257 }) }) In [3]: item = dataset['test'][0]

this is now a dict, like the JSON object loaded from the JSONL files

In [4]: type(item) Out[4]: dict

url of the grounding image

In [5]: item['url'] Out[5]: 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Jackie_Wilson.png/512px-Jackie_Wilson.png'

file name of the grounding image as stored in $VIQUAE_IMAGES_PATH

In [6]: item['image'] Out[6]: '512px-Jackie_Wilson.png'

you can thus load the image from $VIQUAE_IMAGES_PATH/item['image']

meerqat.data.loading.load_image_batch does that for you

In [7]: from meerqat.data.loading import load_image_batch

fake batch of size 1

In [8]: image = load_image_batch([item['image']])[0]

it returns a PIL Image, all images have been resized to a width of 512

In [9]: type(image), image.size Out[9]: (PIL.Image.Image, (512, 526))

question string

In [10]: item['input'] Out[10]: "this singer's re-issued song became the UK Christmas number one after helping to advertise what brand?"

answer string

In [11]: item['output']['original_answer'] Out[11]: "Levi's"

processing the data:

In [12]: dataset.map(my_function)

this is almost the same as (see how can you adapt the code if you don’t want to use the datasets library)

In [13]: for item in dataset: ...: my_function(item)

The ViQuAE Knowledge Base (KB)

Again, the format of the KB is very similar to KILT’s Wikipedia <https://huggingface.co/datasets/kilt_wikipedia>__ so I will not describe all fields exhaustively.

.. code:: py

again you can also clone directly from https://huggingface.co/datasets/PaulLerner/viquae_wikipedia to get the raw data

data_files = dict( humans_with_faces='humans_with_faces.jsonl.gz', humans_without_faces='humans_without_faces.jsonl.gz', non_humans='non_humans.jsonl.gz' ) kb = load_dataset('PaulLerner/viquae_wikipedia', data_files=data_files) kb DatasetDict({ humans_with_faces: Dataset({ features: ['anchors', 'categories', 'image', 'kilt_id', 'text', 'url', 'wikidata_info', 'wikipedia_id', 'wikipedia_title'], num_rows: 506237 }) humans_without_faces: Dataset({ features: ['anchors', 'categories', 'image', 'kilt_id', 'text', 'url', 'wikidata_info', 'wikipedia_id', 'wikipedia_title'], num_rows: 35736 }) non_humans: Dataset({ features: ['anchors', 'categories', 'image', 'kilt_id', 'text', 'url', 'wikidata_info', 'wikipedia_id', 'wikipedia_title'], num_rows: 953379 }) }) item = kb['humans_with_faces'][0] item['wikidata_info']['wikidata_id'], item['wikidata_info']['wikipedia_title'] ('Q313590', 'Alain Connes')

file name of the reference image as stored in $VIQUAE_IMAGES_PATH

you can use meerqat.data.loading.load_image_batch like above

item['image'] '512px-Alain_Connes.jpg'

the text is stored in a list of string, one per paragraph

type(item['text']['paragraph']), len(item['text']['paragraph']) (list, 25) item['text']['paragraph'][1] "Alain Connes (; born 1 April 1947) is a French mathematician,
currently Professor at the Collège de France, IHÉS, Ohio State University and Vanderbilt University.
He was an Invited Professor at the Conservatoire national des arts et métiers (2000).\n"

you might want to concatenate these three datasets to get a single dataset (e.g. to split the articles in passages)

from datasets import concatenate_datasets kb['humans_with_faces'] = kb['humans_with_faces'].map(lambda item: {'is_human': True}) kb['humans_without_faces'] = kb['humans_without_faces'].map(lambda item: {'is_human': True}) kb['non_humans'] = kb['non_humans'].map(lambda item: {'is_human': False}) kb_recat = concatenate_datasets([kb['non_humans'], kb['humans_with_faces'], kb['humans_without_faces']]) kb_recat.save_to_disk('data/viquae_wikipedia_recat')

To format the articles into text passages, follow instructions at EXPERIMENTS.rst <./EXPERIMENTS.rst>__ (Preprocessing passages section). Alternatively, get them from https://huggingface.co/datasets/PaulLerner/viquae_v4-alpha_passages (load_dataset('PaulLerner/viquae_v4-alpha_passages')).

Formatting WIT for multimodal ICT

WIT (Srinavasan et al. http://arxiv.org/abs/2103.01913) is available at https://github.com/google-research-datasets/wit. (By any chance, if you have access to Jean Zay, it is available at $DSDIR/WIT).

Follow instructions at meerqat.data.wit (see meerqat.data.wit.html) or get it from https://huggingface.co/datasets/PaulLerner/wit_for_mict (load_dataset('PaulLerner/wit_for_mict'))

Annotation of the ViQuAE data

Please refer to ANNOTATION.md <./ANNOTATION.md>__ for the annotation instructions

Experiments

Please refer to EXPERIMENTS.rst <./EXPERIMENTS.rst>__ for instructions to reproduce our experiments

Reference

If you use the ViQuAE dataset or KB, please cite: ::

@inproceedings{lerner2022viquae, author = {Paul Lerner and Olivier Ferret and Camille Guinaudeau and Le Borgne, Hervé and Romaric Besançon and Moreno, Jose G and Lovón Melgarejo, Jesús }, year={2022}, title={{ViQuAE}, a Dataset for Knowledge-based Visual Question Answering about Named Entities}, booktitle = {Proceedings of The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, series = {SIGIR’22}, URL = {https://hal.archives-ouvertes.fr/hal-03650618}, DOI = {10.1145/3477495.3531753}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA} }

If you use this code for multimodal information retrieval or early fusion or Inverse Cloze Task pre-training, please cite: ::

@inproceedings{lerner2023ict,
  TITLE = {{Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering}},
  AUTHOR = {Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille},
  URL = {https://hal.science/hal-03933089},
  BOOKTITLE = {{European Conference on Information Retrieval (ECIR 2023)}},
  ADDRESS = {Dublin, Ireland},
  YEAR = {2023},
  MONTH = Apr,
  KEYWORDS = {Visual Question Answering ; Pre-training ; Multimodal Fusion},
  PDF = {https://hal.science/hal-03933089v2/file/ecir-2023-vf-authors.pdf},
  HAL_ID = {hal-03933089},
  HAL_VERSION = {v2},
}

If you use this code for mono- or cross-modal information retrieval with CLIP or fine-tuning CLIP, please cite: ::

@unpublished{lerner2024cross,
  TITLE = {{Cross-modal Retrieval for Knowledge-based Visual Question Answering}},
  AUTHOR = {Lerner, Paul and Ferret, Olivier and Guinaudeau, Camille},
  URL = {https://hal.science/hal-04384431},
  NOTE = {Accepted at ECIR 2024},
  YEAR = {2024},
  MONTH = Jan,
  KEYWORDS = {Visual Question Answering ; Multimodal ; Cross-modal Retrieval ; Named Entities},
  PDF = {https://hal.science/hal-04384431/file/camera_ecir_2024_cross_modal_arXiv.pdf},
  HAL_ID = {hal-04384431},
  HAL_VERSION = {v1},
}

Installation

Install PyTorch 1.9.0 following the official document wrt to your distribution <https://pytorch.org/get-started/locally/>__ (preferably in a virtual environment)

Also install ElasticSearch <https://www.elastic.co/fr/downloads/elasticsearch>__ (and run it) or pyserini <https://github.com/castorini/pyserini>__ if you want to do sparse retrieval.

The rest should be installed using pip:

.. code:: sh

$ git clone https://github.com/PaulLerner/ViQuAE.git $ pip install -e ViQuAE $ python

import meerqat

Docs

Read the docs! <https://paullerner.github.io/ViQuAE/meerqat.ir.search.html>__

To build the docs locally, run sphinx-apidoc -o source_docs/ -f -e -M meerqat then sphinx-build -b html source_docs/ docs/