Efficient Text-Image Retrieval Using Large Language Models

Liu, Jiahao

Efficient Text-Image Retrieval Using Large Language Models

Files

Liu_Jiahao_2025_MA.pdf (10.68 MB)

Date

2026-03-10

Authors

Liu, Jiahao

Abstract

Efficient retrieval from large-scale image databases is a key challenge, particularly as applications increasingly rely on multimodal models such as CLIP. While CLIP offers strong joint image–text representations for semantic search, its globally pooled embeddings often struggle with fine-grained, multi-concept queries, leading to high false positives and reliance on costly verification models. To address this, we propose a hybrid framework that structures the embedding space through feature clustering and models candidate selection as a multi-armed bandit problem. Each cluster acts as an arm, with relevance scores from ground-truth systems as rewards. Using Thompson Sampling, this approach balances exploration and exploitation to quickly identify promising clusters, reducing unnecessary ground-truth queries. Experiments show that our method significantly improves precision and lowers computational costs in multi-keyword retrieval tasks, enabling scalable, fine-grained retrieval in resource-constrained settings. This structured, adaptive approach effectively enhances CLIP-based retrieval pipelines.

Keywords

Information technology

URI

https://hdl.handle.net/10315/43614

Collections

Information Systems and Technology

Full item page

Efficient Text-Image Retrieval Using Large Language Models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections