Service-Level Prediction And Anomaly Detection Towards System Failure Root Cause Explainability In Microservice Applications

dc.contributor.advisorLitoiu, Marin
dc.contributor.authorRouf, Raphael
dc.date.accessioned2025-07-23T15:19:15Z
dc.date.available2025-07-23T15:19:15Z
dc.date.copyright2025-04-22
dc.date.issued2025-07-23
dc.date.updated2025-07-23T15:19:14Z
dc.degree.disciplineInformation Systems and Technology
dc.degree.levelMaster's
dc.degree.nameMA - Master of Arts
dc.description.abstractMicroservice and cloud computing operations are increasingly adopting automation. The importance of models in fostering resilient and efficient adaptive architectures is central to ensuring that services operate with expected behavior and performance. To effectively predict, detect, and explain system failures, it is essential to develop a comprehensive understanding of the affected application. This means examining its unexpected behaviors from multiple perspectives, including logs, metrics, and dependencies, to uncover the underlying root causes. This thesis presents a novel approach to system failure prediction, root cause analysis and explainable failure type analysis by leveraging a three-fold modality of IT observability data: logs, metrics, and traces. The proposed methodology integrates Graph Neural Networks (GNN) to capture spatial information and Gated Recurrent Units (GRU) to encapsulate the temporal aspects within the data. A key emphasis lies in utilizing a stitched representation derived from logs, microservice events and resource metrics to predict system failures proactively. The traces are aggregated to construct a comprehensive service call flow graph and represented as a dynamic graph. Furthermore, permutation testing is applied to harness node scores, aiding in the identification of root causes behind these failures. We evaluate our approach on open source datasets: MicroSS, QOTD and Train Ticket dataset that captures various types of system faults such as resource overload and wrong manipulation faults. Our findings on real world cases demonstrates that the proposed three-fold modality of observability data based on enhanced preprocessing and applying GNN-GRU and gradient distance explanation model captures failure type predictions well and explains them effectively for engineers to help debug and diagnose the issue.
dc.identifier.urihttps://hdl.handle.net/10315/43032
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectInformation technology
dc.subject.keywordsMicroservices
dc.subject.keywordsMultimodal data
dc.subject.keywordsGraph neural networks
dc.subject.keywordsGated recurrent units
dc.subject.keywordsTemporal dynamics
dc.subject.keywordsSpatial features
dc.subject.keywordsAnomaly detection
dc.subject.keywordsSystem failure prediction
dc.subject.keywordsSystem failure localization
dc.subject.keywordsSystem failure root cause analysis
dc.titleService-Level Prediction And Anomaly Detection Towards System Failure Root Cause Explainability In Microservice Applications
dc.typeElectronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Rouf_Raphael_2025_MA.pdf
Size:
1.28 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
Loading...
Thumbnail Image
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description: