Multimodal Representation Learning in Medical Using Vision-Language Models

Baghbanzadeh, Negin

Multimodal Representation Learning in Medical Using Vision-Language Models

dc.contributor.advisor	Dolatabadi, Elham
dc.contributor.author	Baghbanzadeh, Negin
dc.date.accessioned	2025-11-11T20:13:17Z
dc.date.available	2025-11-11T20:13:17Z
dc.date.copyright	2025-08-12
dc.date.issued	2025-11-11
dc.date.updated	2025-11-11T20:13:16Z
dc.degree.discipline	Computer Science
dc.degree.level	Master's
dc.degree.name	MSc - Master of Science
dc.description.abstract	Recent advances in multimodal models such as LLaVA and InstructBLIP highlight the importance of high-quality image encoders, particularly in the biomedical domain where figures and captions are complex. Existing medical vision–language datasets primarily emphasize scale, often overlooking data quality. In this thesis, we introduce OPEN-PMC, a carefully curated collection of biomedical image–text pairs derived from PubMed Central. OPEN-PMC incorporates multiple refinement steps, including compound figure decomposition with a modality-robust detector, subcaption segmentation, in-text reference extraction and summarization, and modality-aware classification. Using OPEN-PMC-2M, we conduct controlled experiments to quantify the effect of each processing step on contrastive pretraining. Our findings show that subfigure decomposition and enriched captions substantially improve retrieval, zero-shot classification, and robustness, outperforming larger but noisier datasets. Scaling to OPEN-PMC-18M, one of the largest curated biomedical VL datasets to date, we demonstrate state-of-the-art performance while discussing remaining limitations in large-scale contextual augmentation and clinical validation.
dc.identifier.uri	https://hdl.handle.net/10315/43371
dc.language	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject	Medical imaging and radiology
dc.subject.keywords	Multimodal
dc.subject.keywords	Contrastive learning
dc.subject.keywords	Medical
dc.title	Multimodal Representation Learning in Medical Using Vision-Language Models
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Baghbanzadeh_Negin_2025_MSc.pdf
Size:: 2.8 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.87 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.39 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science