Investigating Effectiveness of Large Language Models in Automated Software Engineering

Wang, SongShin, Ji-ho2025-11-112025-11-112025-08-272025-11-11https://hdl.handle.net/10315/43345Artificial intelligence has reshaped automated software engineering (ASE), shifting practice from task-specific fine-tuning of deep learning (DL) models towards promptbased interaction with large language models (LLMs). This work addresses the unresolved question of when or how to use traditional fine-tuned models versus LLM prompting for code and proposes guidance grounded in empirical evidence. First, we investigate Machine Learning (ML) code generation with fine-tuned models. We observe that models perform better on ML tasks than on general programs. However, generated code improves developer performance primarily by hinting at correct APIs, suggesting generation decomposition and incorporating human feedback. Secondly, we compare GPT-4 prompting with fine-tuned baselines for ASE tasks, i.e. code summarization, generation, and translation. GPT-4 excels with task-specific prompts in some tasks (e.g., comment generation), but fine-tuned models remain stronger for code generation and translation. A user study shows conversational, iterative prompting significantly boosts perceived usefulness and output quality. Thirdly, we assess the neural test oracle generators, focusing on the correlation between textual similarity metrics (static) and test adequacy (dynamic). We find that textual similarity metrics correlate weakly with dynamic adequacy metrics like coverage and mutation score, so solely relying on static metrics is misleading. Thus, execution-based metrics should be primarily used for assessing test oracle effectiveness. Lastly, we investigated the effectiveness of different sources for RAG-based LLM test generation. Specifically, we mined API documentation, GitHub issues, and StackOverflow Q&As to form the RAG database. We find that RAG significantly improves coverage and bug detectability. We also observe that GitHub issues are especially beneficial due to rich, context-specific usage examples and bug-inducing behaviours. Overall, the dissertation demonstrates that combining domain adaptation (finetuning) and retrieval-based prompting can mitigate limitations of both paradigms. Future work will broaden the evaluation to non-functional requirements and extend to other tasks, e.g., issue resolving and feature implementation.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Computer scienceArtificial intelligenceEngineeringInvestigating Effectiveness of Large Language Models in Automated Software EngineeringElectronic Thesis or Dissertation2025-11-11AI4SEFM4SELLM4SECode generationUnit test generationTest oracle generationCode summarizationCode translationPrompt engineeringFine-tuning models.