i-CIR: Instance-Level Composed Image Retrieval (NeurIPS 2025)

Official implementation of our Baseline Approach for SurprIsingly strong Composition (BASIC) and the instance-level composed image retrieval (i-CIR) dataset.
[arXiv] · [paper] · [project page]

TL;DR: We introduce BASIC, a training-free VLM-based method that centers and projects image embeddings, and i-CIR—a curated, instance-level composed image retrieval benchmark with rich hard negatives that is compact yet really hard.

Overview

This repository contains a clean implementation for performing composed image retrieval (CIR) on i-CIR dataset using vision-language models (CLIP/SigLIP).

Method (BASIC)

Our BASIC method decomposes multimodal queries into object and style components through:

Feature Standardization: Centering features using LAION-1M statistics
Contrastive PCA Projection: Separating information using positive and negative text corpora
Query Expansion: Refining queries with top-k similar database images
Harris Corner Fusion: Combining image and text similarities with geometric weighting

EP illustration

Dataset

Well-curated

i-CIR is an instance-level composed image retrieval benchmark where each instance is a specific, visually indistinguishable object (e.g., Temple of Poseidon). Each query composes an image of the instance with a text modification. For every instance we curate a shared database and define composed positives plus a rich set of hard negatives—visual (same/similar object, wrong text), textual (right text semantics, different instance—often same category), and composed (nearly matches both parts but fails one).

EP illustration

Compact but hard

Built by combining human curation with automated retrieval from LAION, followed by filtering (quality/duplicates/PII) and manual verification of positives and hard negatives, i-CIR is compact yet challenging: it rivals searching with >40M distractor images for simple baselines, while keeping per-query databases manageable. Key stats:

Instances: 202
Total images: ~750K
Composed queries: 1,883
Image queries / instance: 1–46
Text queries / instance: 1–5
Positives / composed query: 1–127
Hard negatives / instance: 951–10,045
Avg database size / query: ~3.7K images

Truly compositional

Performance peaks at interior text–image fusion weights ($\lambda$) and shows large composition gains over the best uni-modal baselines—evidence that both modalities must work together.

EP illustration

🔽 Download the i-CIR dataset

i-CIR is stored here.

Option A — Direct tarball (recommended):

# Download 
wget https://vrg.fel.cvut.cz/icir/icir_v1.0.0.tar.gz -O icir_v1.0.0.tar.gz
# Extract
tar -xzf icir_v1.0.0.tar.gz
# Verify
sha256sum -c icir_v1.0.0.sha256   # should print OK

Reulting layout:

icir/
├── database/
├── query/
├── database_files.csv
├── query_files.csv
├── VERSION.txt
├── LICENSE
└── checksums.sha256

Installation

Requirements

Python 3.9+
PyTorch 2.0+
CUDA-capable GPU (recommended)

Setup

# Clone the repository
git clone https://github.com/billpsomas/icir.git
cd icir

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Prepare Data

Ensure you have the following structure:

icir/
├── data/
│   ├── icir/                       # i-CIR dataset
│   └── laion_mean/                 # Pre-computed LAION means
├── corpora/
│   ├── generic_subjects.csv        # Positive corpus (objects)
│   └── generic_styles.csv          # Negative corpus (styles)
└── synthetic_data/                 # Score normalization data
    ├── dataset_1_sd_clip.pkl.npy
    └── dataset_1_sd_siglip.pkl.npy

2. Extract Features

Extract features for the ILCIR dataset and text corpora:

# Extract i-CIR dataset features
python3 create_features.py --dataset icir --backbone clip --batch 512 --gpu 0

# Extract corpus features
python3 create_features.py --dataset corpus --backbone clip --batch 512 --gpu 0

Features will be saved to features/{backbone}_features/.

3. Run Retrieval

The easiest way is to use method presets with --use_preset:

# Full BASIC method (recommended)
python3 run_retrieval.py --method basic --use_preset

# Baseline methods
python3 run_retrieval.py --method sum --use_preset
python3 run_retrieval.py --method product --use_preset
python3 run_retrieval.py --method image --use_preset
python3 run_retrieval.py --method text --use_preset

For advanced usage with custom parameters:

python3 run_retrieval.py \
  --method basic \
  --backbone clip \
  --dataset icir \
  --results_dir results/ \
  --specified_corpus generic_subjects \
  --specified_ncorpus generic_styles \
  --num_principal_components_for_projection 250 \
  --aa 0.2 \
  --standardize_features \
  --use_laion_mean \
  --project_features \
  --do_query_expansion \
  --contextualize \
  --normalize_similarities \
  --path_to_synthetic_data ./synthetic_data \
  --harris_lambda 0.1

Methods

The codebase implements several retrieval methods:

basic: Full decomposition method with all components (PCA projection, query expansion, Harris fusion)
sum: Simple sum of image and text similarities
product: Simple product of image and text similarities
image: Image-only retrieval (ignores text)
text: Text-only retrieval (ignores image)

Key Parameters

--method: Retrieval method (basic, sum, product, image, text)
--backbone: Vision-language model (clip for ViT-L/14, siglip for ViT-L-16-SigLIP-256)
--use_preset: Use predefined method configurations (recommended)
--specified_corpus: Positive corpus for projection (default: generic_subjects)
--specified_ncorpus: Negative corpus for projection (default: generic_styles)
--num_principal_components_for_projection: PCA components, >1 for exact count or <1 for energy threshold (default: 250)
--aa: Negative corpus weight in contrastive PCA (default: 0.2)
--harris_lambda: Harris fusion parameter (default: 0.1)
--contextualize: Add corpus objects to the text query to contextualize the query
--standardize_features: Center features before projection
--use_laion_mean: Use pre-computed LAION mean for centering
--project_features: Apply PCA projection
--do_query_expansion: Expand queries with retrieved images
--normalize_similarities: Apply score normalization using synthetic data

Corpus Files

Text corpora define semantic spaces for PCA projection:

generic_subjects.csv: General object/subject descriptions (positive corpus)
generic_styles.csv: General style/attribute descriptions (negative corpus)

Corpora are CSV files with a single column of text descriptions, loaded from the corpora/ directory.

Output

Results are saved to the specified results directory (default: results/):

results/
└── icir/
    └── {method_variant}/
        └── mAP_table.csv          # Mean Average Precision results

Each result file includes:

mAP score for the retrieval method
Configuration parameters used (for basic method only)
Timestamp of the experiment

Results (mAP %)

Method	ImageNet-R	NICO	Mini-DN	LTLL	i-CIR
Text	0.74	1.09	0.57	5.72	3.01
Image	3.84	6.32	6.66	16.49	3.04
Text + Image	6.21	9.30	9.33	17.86	8.20
Text × Image	7.83	9.79	9.86	23.16	17.48
WeiCom	10.47	10.54	8.52	26.60	18.03
PicWord	7.88	9.76	12.00	21.27	19.36
CompoDiff	12.88	10.32	22.95	21.61	9.63
CIReVL	18.11	17.80	26.20	32.60	18.66
Searle	14.04	15.13	21.78	25.46	19.90
MCL	8.13	19.09	18.41	16.67	19.89
MagicLens	9.13	19.66	20.06	24.21	27.35
CoVR	11.52	24.93	27.76	24.68	28.50
FREEDOM	29.91	26.10	37.27	33.24	17.24
FREEDOM†	25.81	23.24	32.14	30.82	15.76
BASIC	32.13	31.65	39.58	41.38	31.64
BASIC†	27.54	28.90	35.75	38.22	34.35

† Without query expansion.

Project Structure

icir/
├── run_retrieval.py           # Main retrieval script
├── create_features.py         # Feature extraction script
├── utils.py                   # General utilities (device setup, text processing, evaluation)
├── utils_features.py          # Feature I/O and model loading
├── utils_retrieval.py         # Core retrieval algorithms
├── requirements.txt           # Python dependencies
├── README.md                  # This file
├── LICENSE                    # MIT License
├── data/                      # Dataset and normalization data
├── corpora/                   # Text corpus files
├── features/                  # Extracted features (generated)
└── results/                   # Retrieval results (generated)

Citation

If you use this code in your research, please cite:

@inproceedings{
    psomas2025instancelevel,
    title={Instance-Level Composed Image Retrieval},
    author={Bill Psomas and George Retsinas and Nikos Efthymiadis and Panagiotis Filntisis and Yannis Avrithis and Petros Maragos and Ondrej Chum and Giorgos Tolias},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025}
}

License

This code is licensed under the MIT License - see the LICENSE file for details.
This dataset is licensed under the CC-BY-NC-SA License - see dataset's LICENSE file dor details.

Acknowledgments

Vision-language models via OpenCLIP
LAION-1M statistics for feature standardization

Contact

For questions or issues, please open an issue on GitHub.

icir
icir copied to clipboard

Metadata

i-CIR: Instance-Level Composed Image Retrieval (NeurIPS 2025)

Overview

Method (BASIC)

Dataset

Well-curated

Compact but hard

Truly compositional

🔽 Download the i-CIR dataset

Installation

Requirements

Setup

Quick Start

1. Prepare Data

2. Extract Features

3. Run Retrieval

Methods

Key Parameters

Corpus Files

Output

Results (mAP %)

Project Structure

Citation

License

Acknowledgments

Contact

← Metadata

Owner

Metadata

icir icir copied to clipboard

Metadata

i-CIR: Instance-Level Composed Image Retrieval (NeurIPS 2025)

Overview

Method (BASIC)

Dataset

Well-curated

Compact but hard

Truly compositional

🔽 Download the i-CIR dataset

Installation

Requirements

Setup

Quick Start

1. Prepare Data

2. Extract Features

3. Run Retrieval

Methods

Key Parameters

Corpus Files

Output

Results (mAP %)

Project Structure

Citation

License

Acknowledgments

Contact

← Metadata

Owner

Metadata

icir
icir copied to clipboard