SWE-Bench+: Enhanced Coding Benchmark For LLMs
dc.contributor.advisor | Song Wang | |
dc.contributor.author | Aleithan, Reem | |
dc.date.accessioned | 2025-07-23T15:15:11Z | |
dc.date.available | 2025-07-23T15:15:11Z | |
dc.date.copyright | 2025-04-10 | |
dc.date.issued | 2025-07-23 | |
dc.date.updated | 2025-07-23T15:15:10Z | |
dc.degree.discipline | Computer Science | |
dc.degree.level | Master's | |
dc.degree.name | MSc - Master of Science | |
dc.description.abstract | Large Language Models (LLMs) in Software Engineering (SE) can offer valuable assistance for coding tasks. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues. Several impressive LLM-based toolkits have recently been developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this thesis, we address this gap by presenting an empirical analysis of the SWE-bench dataset. We manually screen instances where SWE-Agent + GPT-4 successfully resolved the issues by comparing model-generated patches with developer-written pull requests. Our analysis reveals two critical issues: (1) 33.47% of patches have solution leakage, where the fix is directly or indirectly revealed in the issue report or comments; and (2) 24.70% of successful patches are suspicious due to weak test cases that fail to detect incorrect, incomplete, or irrelevant fixes. Filtering out these problematic instances drops SWE-Agent + GPT-4’s resolution rate from 12.47% to 4.58%. Motivated by these findings, we propose SWE-Bench+, a refined version of the benchmark using two LLM-based tools: SoluLeakDetector to identify solution-leak issues and TestEnhancer to reduce weak test cases. SWE-Bench+ identifies solution-leak issues with 86% accuracy and reduces suspicious patches by 19%. To reduce the risk of potential data leakage, we collect a new set of post-cutoff GitHub issues. We then evaluate models on this dataset, observing a consistent performance drop across all models. This highlights the impact of solution leakage and weak tests in inflating resolution rates in current benchmarks. | |
dc.identifier.uri | https://hdl.handle.net/10315/43000 | |
dc.language | en | |
dc.rights | Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests. | |
dc.subject | Computer science | |
dc.subject | Computer engineering | |
dc.subject.keywords | SWE-Bench | |
dc.subject.keywords | Evaluation Benchmarks | |
dc.subject.keywords | Software Engineering | |
dc.subject.keywords | Artificial Intelligence | |
dc.title | SWE-Bench+: Enhanced Coding Benchmark For LLMs | |
dc.type | Electronic Thesis or Dissertation |
Files
Original bundle
1 - 1 of 1