Multimodal Representation Learning in Medical Using Vision-Language Models

Dolatabadi, ElhamBaghbanzadeh, Negin2025-11-112025-11-112025-08-122025-11-11https://hdl.handle.net/10315/43371Recent advances in multimodal models such as LLaVA and InstructBLIP highlight the importance of high-quality image encoders, particularly in the biomedical domain where figures and captions are complex. Existing medical vision–language datasets primarily emphasize scale, often overlooking data quality. In this thesis, we introduce OPEN-PMC, a carefully curated collection of biomedical image–text pairs derived from PubMed Central. OPEN-PMC incorporates multiple refinement steps, including compound figure decomposition with a modality-robust detector, subcaption segmentation, in-text reference extraction and summarization, and modality-aware classification. Using OPEN-PMC-2M, we conduct controlled experiments to quantify the effect of each processing step on contrastive pretraining. Our findings show that subfigure decomposition and enriched captions substantially improve retrieval, zero-shot classification, and robustness, outperforming larger but noisier datasets. Scaling to OPEN-PMC-18M, one of the largest curated biomedical VL datasets to date, we demonstrate state-of-the-art performance while discussing remaining limitations in large-scale contextual augmentation and clinical validation.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Computer scienceMedical imaging and radiologyMultimodal Representation Learning in Medical Using Vision-Language ModelsElectronic Thesis or Dissertation2025-11-11MultimodalContrastive learningMedical