Multimodal Representation Learning in Medical Using Vision-Language Models

dc.contributor.advisorDolatabadi, Elham
dc.contributor.authorBaghbanzadeh, Negin
dc.date.accessioned2025-11-11T20:13:17Z
dc.date.available2025-11-11T20:13:17Z
dc.date.copyright2025-08-12
dc.date.issued2025-11-11
dc.date.updated2025-11-11T20:13:16Z
dc.degree.disciplineComputer Science
dc.degree.levelMaster's
dc.degree.nameMSc - Master of Science
dc.description.abstractRecent advances in multimodal models such as LLaVA and InstructBLIP highlight the importance of high-quality image encoders, particularly in the biomedical domain where figures and captions are complex. Existing medical vision–language datasets primarily emphasize scale, often overlooking data quality. In this thesis, we introduce OPEN-PMC, a carefully curated collection of biomedical image–text pairs derived from PubMed Central. OPEN-PMC incorporates multiple refinement steps, including compound figure decomposition with a modality-robust detector, subcaption segmentation, in-text reference extraction and summarization, and modality-aware classification. Using OPEN-PMC-2M, we conduct controlled experiments to quantify the effect of each processing step on contrastive pretraining. Our findings show that subfigure decomposition and enriched captions substantially improve retrieval, zero-shot classification, and robustness, outperforming larger but noisier datasets. Scaling to OPEN-PMC-18M, one of the largest curated biomedical VL datasets to date, we demonstrate state-of-the-art performance while discussing remaining limitations in large-scale contextual augmentation and clinical validation.
dc.identifier.urihttps://hdl.handle.net/10315/43371
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subjectMedical imaging and radiology
dc.subject.keywordsMultimodal
dc.subject.keywordsContrastive learning
dc.subject.keywordsMedical
dc.titleMultimodal Representation Learning in Medical Using Vision-Language Models
dc.typeElectronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Baghbanzadeh_Negin_2025_MSc.pdf
Size:
2.8 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
Loading...
Thumbnail Image
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description:

Collections