Multimodal Representation Learning in Medical Using Vision-Language Models
| dc.contributor.advisor | Dolatabadi, Elham | |
| dc.contributor.author | Baghbanzadeh, Negin | |
| dc.date.accessioned | 2025-11-11T20:13:17Z | |
| dc.date.available | 2025-11-11T20:13:17Z | |
| dc.date.copyright | 2025-08-12 | |
| dc.date.issued | 2025-11-11 | |
| dc.date.updated | 2025-11-11T20:13:16Z | |
| dc.degree.discipline | Computer Science | |
| dc.degree.level | Master's | |
| dc.degree.name | MSc - Master of Science | |
| dc.description.abstract | Recent advances in multimodal models such as LLaVA and InstructBLIP highlight the importance of high-quality image encoders, particularly in the biomedical domain where figures and captions are complex. Existing medical vision–language datasets primarily emphasize scale, often overlooking data quality. In this thesis, we introduce OPEN-PMC, a carefully curated collection of biomedical image–text pairs derived from PubMed Central. OPEN-PMC incorporates multiple refinement steps, including compound figure decomposition with a modality-robust detector, subcaption segmentation, in-text reference extraction and summarization, and modality-aware classification. Using OPEN-PMC-2M, we conduct controlled experiments to quantify the effect of each processing step on contrastive pretraining. Our findings show that subfigure decomposition and enriched captions substantially improve retrieval, zero-shot classification, and robustness, outperforming larger but noisier datasets. Scaling to OPEN-PMC-18M, one of the largest curated biomedical VL datasets to date, we demonstrate state-of-the-art performance while discussing remaining limitations in large-scale contextual augmentation and clinical validation. | |
| dc.identifier.uri | https://hdl.handle.net/10315/43371 | |
| dc.language | en | |
| dc.rights | Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests. | |
| dc.subject | Computer science | |
| dc.subject | Medical imaging and radiology | |
| dc.subject.keywords | Multimodal | |
| dc.subject.keywords | Contrastive learning | |
| dc.subject.keywords | Medical | |
| dc.title | Multimodal Representation Learning in Medical Using Vision-Language Models | |
| dc.type | Electronic Thesis or Dissertation |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Baghbanzadeh_Negin_2025_MSc.pdf
- Size:
- 2.8 MB
- Format:
- Adobe Portable Document Format