SWE-Bench+: Enhanced Coding Benchmark For LLMs
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large Language Models (LLMs) in Software Engineering (SE) can offer valuable assistance for coding tasks. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues. Several impressive LLM-based toolkits have recently been developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this thesis, we address this gap by presenting an empirical analysis of the SWE-bench dataset. We manually screen instances where SWE-Agent + GPT-4 successfully resolved the issues by comparing model-generated patches with developer-written pull requests. Our analysis reveals two critical issues: (1) 33.47% of patches have solution leakage, where the fix is directly or indirectly revealed in the issue report or comments; and (2) 24.70% of successful patches are suspicious due to weak test cases that fail to detect incorrect, incomplete, or irrelevant fixes. Filtering out these problematic instances drops SWE-Agent + GPT-4’s resolution rate from 12.47% to 4.58%. Motivated by these findings, we propose SWE-Bench+, a refined version of the benchmark using two LLM-based tools: SoluLeakDetector to identify solution-leak issues and TestEnhancer to reduce weak test cases. SWE-Bench+ identifies solution-leak issues with 86% accuracy and reduces suspicious patches by 19%. To reduce the risk of potential data leakage, we collect a new set of post-cutoff GitHub issues. We then evaluate models on this dataset, observing a consistent performance drop across all models. This highlights the impact of solution leakage and weak tests in inflating resolution rates in current benchmarks.