- Create the Python environment.
conda create -n ret2 -y --no-default-packages python==3.10.16
conda activate ret2
- Install Pytorch.
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
- Install faiss.
conda install -n ret2 -y -c conda-forge faiss-gpu==1.7.4
- Clone the repo and install other dependencies.
git clone https://github.com/aimagelab/ReT-2.git
cd ReT-2
pip install -r requirements.txt
from src.models import Ret2Model
import requests
from PIL import Image
from io import BytesIO
import torch
import torch.nn.functional as F
device = 'cuda' if torch.cuda.is_available() else 'cpu'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
query_img_url = 'https://upload.wikimedia.org/wikipedia/commons/8/84/Ghirlandina_%28Modena%29.jpg'
response = requests.get(query_img_url, headers=headers)
query_image = Image.open(BytesIO(response.content)).convert('RGB')
query_text = 'Where is this building located?'
passage_img_url = 'https://upload.wikimedia.org/wikipedia/commons/0/09/Absidi_e_Ghirlandina.jpg'
response = requests.get(query_img_url, headers=headers)
passage_image = Image.open(BytesIO(response.content)).convert('RGB')
passage_text = (
"The Ghirlandina is the bell tower of the Cathedral of Modena, in Modena, Italy. "
"It is 86.12 metres (282.7 ft) high and is the symbol of the city. "
"It was built in Romanesque style in the 12th century and is part of a UNESCO World Heritage Site."
)
model = Ret2Model.from_pretrained('aimagelab/ReT2-M2KR-ColBERT-SigLIP2-ViT-L', device_map=device)
query_txt_inputs = model.tokenizer([query_text], return_tensors='pt').to(device)
query_img_inputs = model.image_processor([query_image], return_tensors='pt').to(device)
passage_txt_inputs = model.tokenizer([passage_text], return_tensors='pt').to(device)
passage_img_inputs = model.image_processor([passage_image], return_tensors='pt').to(device)
with torch.inference_mode():
query_feats = model.get_ret_features(
input_ids=query_txt_inputs.input_ids,
attention_mask=query_txt_inputs.attention_mask,
pixel_values=query_img_inputs.pixel_values
)
passage_feats = model.get_ret_features(
input_ids=passage_txt_inputs.input_ids,
attention_mask=passage_txt_inputs.attention_mask,
pixel_values=passage_img_inputs.pixel_values
)
sim = F.normalize(query_feats, p=2, dim=-1) @ F.normalize(passage_feats, p=2, dim=-1).T
print(f"query-passage similarity: {sim.item():.3f}")
The core script to reproduce the results shown in the paper is evaluate.py. The evaluation is split into three stages:
index
: typically the longest stage, extracts embeddings from multimodal passages and saves them to file system.create_index
: reads embeddings and index them as afaiss.IndexFlatIP
for efficient inner-product similarity search. This stage outputs two files: (1)knn.index
, that is the actual Faiss index; (2)knn.json
, that is a JSON list of 2-uple, where the first item is the (unique) passage ID (i.e.pid
), and the second item is the textual content associated with that passages.search
: queries the index and computes metrics. The output is: (1)metrics.json
, a JSON dict containing the retrieval metrics; (2)metrics.txt
, containing the metrics written in Markdown for pretty-printing; (3)ranking.tsv
, a .tsv file where each row has 4 fields:qid
: a unique identifier of a query.pidx
: the 0-index value of the retrieved passage in the JSONL file used to create the index in stage 1.rank
: the rank of the retrived passage. Note that the top-1 retrieved passage has rank equal to 0.score
: the query-passage cosine similarity.
Both the multimodal passages to be indexed and the multimodal queries are supposed to be in JSONL format. To index and search through custom data collections, please refer to our Hugging Face's dataset for examples on how to prepare your data for indexing and search.
Our experiments have been run on the Leonardo HPC cluster, which features SLURM as the resource manager. Inside the scripts folder, we provide SLURM sbatch scripts to reproduce the paper results. As the index
stage can last very long, depending on the amount of passages to be indexed, evaluate.py does support single-node multi-GPU acceleration to speed up the process. Conversely, the search
stage currently only works with a single GPU.
To reproduce the results on the LLaVA, KVQA, OVEN, and IGLUE tasks of M2KR, run eval_m2kr_single_gpu.sh.
For OKVQA, InfoSeek, E-VQA, and WIT, run eval_m2kr_multi_gpu.sh.
Before running either script, you have to set two variables:
IMAGE_ROOT_PATH
: the path where M2KR images have been downloaded.JSONL_ROOT_PATH
the path where JSONL annotations have been downloaded.
To reproduce the results on M-BEIR, run eval_mbeir_multi_gpu.sh. Similarly to M2KR, you have to set
IMAGE_ROOT_PATH
: the path where M-BEIR images have been downloaded. Please refer to the official repository to download them.JSONL_ROOT_PATH
the path where JSONL annotations have been downloaded.
For the best VQA performance, we recommend using ReT-2 paired with ColBERT + SigLIP2 as the retrieval engine.
The first step is to index the Wikipedia knowledge base (KB) and retrieve the top-k passages for each visual question. This step is built on the same evaluate.py script used for M2KR and M-BEIR, but you have to set some additional variables according to the chosen VQA benchmark: either InfoSeek or Encyclopedic-VQA (E-VQA).
Next is the answer generation stage, where the generator is a Multimodal Large Language Model (MLLM), whose prompt for each question is enriched with the top-k retrieved passages from the previous stage.
Modify infoseek_index_and_search.sh as follows:
dataset_path_index
: a JSONL file containing the multimodal passages of your KB. For InfoSeek, this is a collection of 1M (image)-text pairs from Wikipedia pages. You can download it from here.
Note: we also release a smaller KB of 525k passages, that is the KB used in our previous work, ReT.image_root_index
: the path to which image paths indataset_path_index
are relative to. InfoSeek shares the same KB as OVEN, and the images can be downloaded as a tarfile from here.dataset_path_query
: a JSONL file containing the retrieval queries. You can download it from here.image_root_query
: the path to which image paths indataset_path_query
are relative to. The images can be downloaded by following the official OVEN guidelines.
This script outputs ranking.jsonl
, which is similar to the ranking.tsv file produced after searching throug an index, with the addition of the textual content associated to each retrieved passage.
Run infoseek_rag_llava_more.sh to generate answers with LLaVA-MORE-8B, or infoseek_rag_qwen.sh to generate with Qwen2.5-VL-7B-Instruct. In either case, modify the scripts as follows:
dataset_path_query
: same as above.image_root_query
: same as above.ranking_path
: path to theranking.jsonl
file generatedy by infoseek_index_and_search.sh.output_path
: the JSONL file that stores the answer to each visual question.
To compute the evaluation metrics, run the following:
python src/rag/metrics/infoseek_compute_metrics.py \
--prediction_paths /path/to/my_predictions.jsonl \
--experiment_names my_experiment \
--reference_path infoseek_val.jsonl \
--reference_qtype_path infoseek_val_qtype.jsonl
where infoseek_val.jsonl
and infoseek_val_qtype.jsonl
can be downloaded from the official InfoSeek repository.
Modify evqa_index_and_search.sh as follows:
dataset_path_index
: a JSONL file containing the multimodal passages of your KB. For E-VQA, this is a collection of 15.9M (image)-text pairs from Wikipedia pages. Indexing can take a lot, depending on the capability of your GPUs. You can download it from here.image_root_index
: the path to which image paths indataset_path_index
are relative to. Images for E-VQA KB have been sampled from WikiWeb2M and can be downloaded accordingly to the official E-VQA repository. Because E-VQA uses only a subset of 2.4M images from the original 11M, we have packed them as tarfiles for convenience. You can download them from here.dataset_path_query
: a JSONL file containing the retrieval queries. You can download it from here.image_root_query
: the path to which image paths indataset_path_query
are relative to. According to the official E-VQA repository, query images come from two datasets: iNaturalist 2021 and Google Landmarks Dataset v2. After downloading them, yourimage_root_query
should look like this:
.
βββ image_root_query/
βββ iNaturalist/
β βββ train/
β β βββ 03248_Animalia_Chordata_Aves_Anseriformes_Anatidae_Dendrocygna_viduata/
β β β βββ 90af92cc-254a-43c9-90bd-cb894fc8613e.jpg
β β β βββ 2f2a2358-3429-492c-a90e-63d59dc2dfe3.jpg
β β β βββ fbbbe163-5e0f-4b90-a948-33373aff2e24.jpg
β β β βββ ...
β β βββ ...
β βββ val
β βββ test
βββ Google_Landmarks_v2/
βββ 0/
β βββ 0/
β β βββ 0/
β β β βββ 0000059611c7d079.jpg
β β β βββ 0000070506c174cc.jpg
β β β βββ 000008ae30de967e.jpg
β β β βββ ...
β β βββ 1
β β βββ 2
β β βββ ...
β βββ ...
βββ ...
Run evqa_rag_llava_more.sh to generate answers with LLaVA-MORE-8B, or evqa_rag_qwen.sh to generate with Qwen2.5-VL-7B-Instruct. In either case, modify the scripts as follows:
dataset_path_query
: same as above.image_root_query
: same as above.ranking_path
: path to theranking.jsonl
file generatedy by evqa_index_and_search.sh.output_path
: the JSONL file that stores the answer to each visual question.
In E-VQA, the evaluation metric is the accuracy, where an answer is deemed as correct if its BEM score with the reference is greater than 0.5. To compute it, run the following:
python src/rag/metrics/evqa_compute_metrics.py --input_path /path/to/my_predictions.jsonl
evqa_compute_metrics.py requires tensorflow to run. The offical python dependencies can be found here. Unfortunately, we did not succeed in installing tensorflow in our environment due to some conflicts, so you may require a dedicated environment for that.
If you happen to use our works, please cite them with the following BibTeX:
@inproceedings{caffagni2025recurrence,
title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}},
author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
@article{caffagni2025recurrencemeetstransformers,
title={{Recurrence Meets Transformers for Universal Multimodal Retrieval}},
author={Davide Caffagni and Sara Sarto and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara},
journal={arXiv preprint arXiv:2509.08897},
year={2025}
}