Efficient Text-Image Retrieval Using Large Language Models
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Efficient retrieval from large-scale image databases is a key challenge, particularly as applications increasingly rely on multimodal models such as CLIP. While CLIP offers strong joint image–text representations for semantic search, its globally pooled embeddings often struggle with fine-grained, multi-concept queries, leading to high false positives and reliance on costly verification models. To address this, we propose a hybrid framework that structures the embedding space through feature clustering and models candidate selection as a multi-armed bandit problem. Each cluster acts as an arm, with relevance scores from ground-truth systems as rewards. Using Thompson Sampling, this approach balances exploration and exploitation to quickly identify promising clusters, reducing unnecessary ground-truth queries. Experiments show that our method significantly improves precision and lowers computational costs in multi-keyword retrieval tasks, enabling scalable, fine-grained retrieval in resource-constrained settings. This structured, adaptive approach effectively enhances CLIP-based retrieval pipelines.