Multimodal Representation Learning in Medical Using Vision-Language Models

Baghbanzadeh, Negin

Multimodal Representation Learning in Medical Using Vision-Language Models

Files

Baghbanzadeh_Negin_2025_MSc.pdf (2.8 MB)

Date

2025-11-11

Authors

Baghbanzadeh, Negin

Abstract

Recent advances in multimodal models such as LLaVA and InstructBLIP highlight the importance of high-quality image encoders, particularly in the biomedical domain where figures and captions are complex. Existing medical vision–language datasets primarily emphasize scale, often overlooking data quality. In this thesis, we introduce OPEN-PMC, a carefully curated collection of biomedical image–text pairs derived from PubMed Central. OPEN-PMC incorporates multiple refinement steps, including compound figure decomposition with a modality-robust detector, subcaption segmentation, in-text reference extraction and summarization, and modality-aware classification. Using OPEN-PMC-2M, we conduct controlled experiments to quantify the effect of each processing step on contrastive pretraining. Our findings show that subfigure decomposition and enriched captions substantially improve retrieval, zero-shot classification, and robustness, outperforming larger but noisier datasets. Scaling to OPEN-PMC-18M, one of the largest curated biomedical VL datasets to date, we demonstrate state-of-the-art performance while discussing remaining limitations in large-scale contextual augmentation and clinical validation.

Keywords

Computer science, Medical imaging and radiology

URI

https://hdl.handle.net/10315/43371

Collections

Computer Science

Full item page

Multimodal Representation Learning in Medical Using Vision-Language Models

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections