SWE-Bench+: Enhanced Coding Benchmark For LLMs

dc.contributor.advisorSong Wang
dc.contributor.authorAleithan, Reem
dc.date.accessioned2025-07-23T15:15:11Z
dc.date.available2025-07-23T15:15:11Z
dc.date.copyright2025-04-10
dc.date.issued2025-07-23
dc.date.updated2025-07-23T15:15:10Z
dc.degree.disciplineComputer Science
dc.degree.levelMaster's
dc.degree.nameMSc - Master of Science
dc.description.abstractLarge Language Models (LLMs) in Software Engineering (SE) can offer valuable assistance for coding tasks. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues. Several impressive LLM-based toolkits have recently been developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this thesis, we address this gap by presenting an empirical analysis of the SWE-bench dataset. We manually screen instances where SWE-Agent + GPT-4 successfully resolved the issues by comparing model-generated patches with developer-written pull requests. Our analysis reveals two critical issues: (1) 33.47% of patches have solution leakage, where the fix is directly or indirectly revealed in the issue report or comments; and (2) 24.70% of successful patches are suspicious due to weak test cases that fail to detect incorrect, incomplete, or irrelevant fixes. Filtering out these problematic instances drops SWE-Agent + GPT-4’s resolution rate from 12.47% to 4.58%. Motivated by these findings, we propose SWE-Bench+, a refined version of the benchmark using two LLM-based tools: SoluLeakDetector to identify solution-leak issues and TestEnhancer to reduce weak test cases. SWE-Bench+ identifies solution-leak issues with 86% accuracy and reduces suspicious patches by 19%. To reduce the risk of potential data leakage, we collect a new set of post-cutoff GitHub issues. We then evaluate models on this dataset, observing a consistent performance drop across all models. This highlights the impact of solution leakage and weak tests in inflating resolution rates in current benchmarks.
dc.identifier.urihttps://hdl.handle.net/10315/43000
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subjectComputer engineering
dc.subject.keywordsSWE-Bench
dc.subject.keywordsEvaluation Benchmarks
dc.subject.keywordsSoftware Engineering
dc.subject.keywordsArtificial Intelligence
dc.titleSWE-Bench+: Enhanced Coding Benchmark For LLMs
dc.typeElectronic Thesis or Dissertation

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Aleithan_Reem_2025_MSc.pdf
Size:
1.93 MB
Format:
Adobe Portable Document Format