Litoiu, MarinRouf, Raphael2025-07-232025-07-232025-04-222025-07-23https://hdl.handle.net/10315/43032Microservice and cloud computing operations are increasingly adopting automation. The importance of models in fostering resilient and efficient adaptive architectures is central to ensuring that services operate with expected behavior and performance. To effectively predict, detect, and explain system failures, it is essential to develop a comprehensive understanding of the affected application. This means examining its unexpected behaviors from multiple perspectives, including logs, metrics, and dependencies, to uncover the underlying root causes. This thesis presents a novel approach to system failure prediction, root cause analysis and explainable failure type analysis by leveraging a three-fold modality of IT observability data: logs, metrics, and traces. The proposed methodology integrates Graph Neural Networks (GNN) to capture spatial information and Gated Recurrent Units (GRU) to encapsulate the temporal aspects within the data. A key emphasis lies in utilizing a stitched representation derived from logs, microservice events and resource metrics to predict system failures proactively. The traces are aggregated to construct a comprehensive service call flow graph and represented as a dynamic graph. Furthermore, permutation testing is applied to harness node scores, aiding in the identification of root causes behind these failures. We evaluate our approach on open source datasets: MicroSS, QOTD and Train Ticket dataset that captures various types of system faults such as resource overload and wrong manipulation faults. Our findings on real world cases demonstrates that the proposed three-fold modality of observability data based on enhanced preprocessing and applying GNN-GRU and gradient distance explanation model captures failure type predictions well and explains them effectively for engineers to help debug and diagnose the issue.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Information technologyService-Level Prediction And Anomaly Detection Towards System Failure Root Cause Explainability In Microservice ApplicationsElectronic Thesis or Dissertation2025-07-23MicroservicesMultimodal dataGraph neural networksGated recurrent unitsTemporal dynamicsSpatial featuresAnomaly detectionSystem failure predictionSystem failure localizationSystem failure root cause analysis