Efficient Text-Image Retrieval Using Large Language Models

Loading...
Thumbnail Image

Authors

Liu, Jiahao

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Efficient retrieval from large-scale image databases is a key challenge, particularly as applications increasingly rely on multimodal models such as CLIP. While CLIP offers strong joint image–text representations for semantic search, its globally pooled embeddings often struggle with fine-grained, multi-concept queries, leading to high false positives and reliance on costly verification models. To address this, we propose a hybrid framework that structures the embedding space through feature clustering and models candidate selection as a multi-armed bandit problem. Each cluster acts as an arm, with relevance scores from ground-truth systems as rewards. Using Thompson Sampling, this approach balances exploration and exploitation to quickly identify promising clusters, reducing unnecessary ground-truth queries. Experiments show that our method significantly improves precision and lowers computational costs in multi-keyword retrieval tasks, enabling scalable, fine-grained retrieval in resource-constrained settings. This structured, adaptive approach effectively enhances CLIP-based retrieval pipelines.

Description

Keywords

Information technology

Citation