Investigating Effectiveness of Large Language Models in Automated Software Engineering

dc.contributor.advisorWang, Song
dc.contributor.authorShin, Ji-ho
dc.date.accessioned2025-11-11T20:10:00Z
dc.date.available2025-11-11T20:10:00Z
dc.date.copyright2025-08-27
dc.date.issued2025-11-11
dc.date.updated2025-11-11T20:10:00Z
dc.degree.disciplineElectrical Engineering & Computer Science
dc.degree.levelDoctoral
dc.degree.namePhD - Doctor of Philosophy
dc.description.abstractArtificial intelligence has reshaped automated software engineering (ASE), shifting practice from task-specific fine-tuning of deep learning (DL) models towards promptbased interaction with large language models (LLMs). This work addresses the unresolved question of when or how to use traditional fine-tuned models versus LLM prompting for code and proposes guidance grounded in empirical evidence. First, we investigate Machine Learning (ML) code generation with fine-tuned models. We observe that models perform better on ML tasks than on general programs. However, generated code improves developer performance primarily by hinting at correct APIs, suggesting generation decomposition and incorporating human feedback. Secondly, we compare GPT-4 prompting with fine-tuned baselines for ASE tasks, i.e. code summarization, generation, and translation. GPT-4 excels with task-specific prompts in some tasks (e.g., comment generation), but fine-tuned models remain stronger for code generation and translation. A user study shows conversational, iterative prompting significantly boosts perceived usefulness and output quality. Thirdly, we assess the neural test oracle generators, focusing on the correlation between textual similarity metrics (static) and test adequacy (dynamic). We find that textual similarity metrics correlate weakly with dynamic adequacy metrics like coverage and mutation score, so solely relying on static metrics is misleading. Thus, execution-based metrics should be primarily used for assessing test oracle effectiveness. Lastly, we investigated the effectiveness of different sources for RAG-based LLM test generation. Specifically, we mined API documentation, GitHub issues, and StackOverflow Q&As to form the RAG database. We find that RAG significantly improves coverage and bug detectability. We also observe that GitHub issues are especially beneficial due to rich, context-specific usage examples and bug-inducing behaviours. Overall, the dissertation demonstrates that combining domain adaptation (finetuning) and retrieval-based prompting can mitigate limitations of both paradigms. Future work will broaden the evaluation to non-functional requirements and extend to other tasks, e.g., issue resolving and feature implementation.
dc.identifier.urihttps://hdl.handle.net/10315/43345
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subjectEngineering
dc.subject.keywordsAI4SE
dc.subject.keywordsFM4SE
dc.subject.keywordsLLM4SE
dc.subject.keywordsCode generation
dc.subject.keywordsUnit test generation
dc.subject.keywordsTest oracle generation
dc.subject.keywordsCode summarization
dc.subject.keywordsCode translation
dc.subject.keywordsPrompt engineering
dc.subject.keywordsFine-tuning models.
dc.titleInvestigating Effectiveness of Large Language Models in Automated Software Engineering
dc.typeElectronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Shin_Ji-ho_2025_PhD.pdf
Size:
2.06 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
Loading...
Thumbnail Image
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description: