Investigating Effectiveness of Large Language Models in Automated Software Engineering

Shin, Ji-ho

Investigating Effectiveness of Large Language Models in Automated Software Engineering

dc.contributor.advisor	Wang, Song
dc.contributor.author	Shin, Ji-ho
dc.date.accessioned	2025-11-11T20:10:00Z
dc.date.available	2025-11-11T20:10:00Z
dc.date.copyright	2025-08-27
dc.date.issued	2025-11-11
dc.date.updated	2025-11-11T20:10:00Z
dc.degree.discipline	Electrical Engineering & Computer Science
dc.degree.level	Doctoral
dc.degree.name	PhD - Doctor of Philosophy
dc.description.abstract	Artificial intelligence has reshaped automated software engineering (ASE), shifting practice from task-specific fine-tuning of deep learning (DL) models towards promptbased interaction with large language models (LLMs). This work addresses the unresolved question of when or how to use traditional fine-tuned models versus LLM prompting for code and proposes guidance grounded in empirical evidence. First, we investigate Machine Learning (ML) code generation with fine-tuned models. We observe that models perform better on ML tasks than on general programs. However, generated code improves developer performance primarily by hinting at correct APIs, suggesting generation decomposition and incorporating human feedback. Secondly, we compare GPT-4 prompting with fine-tuned baselines for ASE tasks, i.e. code summarization, generation, and translation. GPT-4 excels with task-specific prompts in some tasks (e.g., comment generation), but fine-tuned models remain stronger for code generation and translation. A user study shows conversational, iterative prompting significantly boosts perceived usefulness and output quality. Thirdly, we assess the neural test oracle generators, focusing on the correlation between textual similarity metrics (static) and test adequacy (dynamic). We find that textual similarity metrics correlate weakly with dynamic adequacy metrics like coverage and mutation score, so solely relying on static metrics is misleading. Thus, execution-based metrics should be primarily used for assessing test oracle effectiveness. Lastly, we investigated the effectiveness of different sources for RAG-based LLM test generation. Specifically, we mined API documentation, GitHub issues, and StackOverflow Q&As to form the RAG database. We find that RAG significantly improves coverage and bug detectability. We also observe that GitHub issues are especially beneficial due to rich, context-specific usage examples and bug-inducing behaviours. Overall, the dissertation demonstrates that combining domain adaptation (finetuning) and retrieval-based prompting can mitigate limitations of both paradigms. Future work will broaden the evaluation to non-functional requirements and extend to other tasks, e.g., issue resolving and feature implementation.
dc.identifier.uri	https://hdl.handle.net/10315/43345
dc.language	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject	Artificial intelligence
dc.subject	Engineering
dc.subject.keywords	AI4SE
dc.subject.keywords	FM4SE
dc.subject.keywords	LLM4SE
dc.subject.keywords	Code generation
dc.subject.keywords	Unit test generation
dc.subject.keywords	Test oracle generation
dc.subject.keywords	Code summarization
dc.subject.keywords	Code translation
dc.subject.keywords	Prompt engineering
dc.subject.keywords	Fine-tuning models.
dc.title	Investigating Effectiveness of Large Language Models in Automated Software Engineering
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Shin_Ji-ho_2025_PhD.pdf
Size:: 2.06 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.87 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.39 KB
Format:: Plain Text
Description:

Download

Collections

Electrical Engineering and Computer Science