SWE-Bench+: Enhanced Coding Benchmark for LLMs

Aleithan, Reem

SWE-Bench+: Enhanced Coding Benchmark for LLMs

dc.contributor.advisor	Wang, Song
dc.contributor.author	Aleithan, Reem
dc.date.accessioned	2025-07-23T15:15:11Z
dc.date.available	2025-07-23T15:15:11Z
dc.date.copyright	2025-04-10
dc.date.issued	2025-07-23
dc.date.updated	2025-07-23T15:15:10Z
dc.degree.discipline	Computer Science
dc.degree.level	Master's
dc.degree.name	MSc - Master of Science
dc.description.abstract	Large Language Models (LLMs) in Software Engineering (SE) can offer valuable assistance for coding tasks. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues. Several impressive LLM-based toolkits have recently been developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this thesis, we address this gap by presenting an empirical analysis of the SWE-bench dataset. We manually screen instances where SWE-Agent + GPT-4 successfully resolved the issues by comparing model-generated patches with developer-written pull requests. Our analysis reveals two critical issues: (1) 33.47% of patches have solution leakage, where the fix is directly or indirectly revealed in the issue report or comments; and (2) 24.70% of successful patches are suspicious due to weak test cases that fail to detect incorrect, incomplete, or irrelevant fixes. Filtering out these problematic instances drops SWE-Agent + GPT-4’s resolution rate from 12.47% to 4.58%. Motivated by these findings, we propose SWE-Bench+, a refined version of the benchmark using two LLM-based tools: SoluLeakDetector to identify solution-leak issues and TestEnhancer to reduce weak test cases. SWE-Bench+ identifies solution-leak issues with 86% accuracy and reduces suspicious patches by 19%. To reduce the risk of potential data leakage, we collect a new set of post-cutoff GitHub issues. We then evaluate models on this dataset, observing a consistent performance drop across all models. This highlights the impact of solution leakage and weak tests in inflating resolution rates in current benchmarks.
dc.identifier.uri	https://hdl.handle.net/10315/43000
dc.language	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject	Computer engineering
dc.subject.keywords	SWE-Bench
dc.subject.keywords	Evaluation benchmarks
dc.subject.keywords	Software engineering
dc.subject.keywords	Artificial intelligence
dc.title	SWE-Bench+: Enhanced Coding Benchmark for LLMs
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Aleithan_Reem_2025_MSc.pdf
Size:: 1.93 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.87 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.39 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science