Improving the Reliability of AI Infrastructure Software with Data-Driven Software Analytics
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Today, AI systems are increasingly used in safety-critical fields like transportation, finance, and robotics. While AI offers many benefits that simplify daily life, its widespread adoption has also increased threats, highlighting the urgent need for secure AI. Failing to protect AI systems against security threats could have disastrous consequences. Like traditional software, AI applications are built upon multiple layers: application and service, model, framework, library and compiler, and hardware.
In this thesis, we first conduct an empirical study to characterize and understand security weaknesses in AI frameworks. We identified Memory Leak (CWE-401) and Integer Overflow (CWE-190) as the two most prevalent bug types, with common root causes being improper validation of tensor properties and poor memory management. Next, we assess the effectiveness of five popular static analysis tools for identifying bugs in AI frameworks. Our study shows that these tools detect only a small fraction of bugs. Key limitations include lacking support for AI-specific macros/APIs, tensor data types, and computation graphs. We then evaluate dynamic analysis techniques, specifically DL fuzz testing tools, on real-world bugs in AI frameworks. Our findings show that DL fuzzers detect only 6.5% (34 out of 517) of unique bugs in our benchmark dataset. We also identify two main factors limiting the effectiveness of these tools.
Based on these findings, we developed a novel API-level DL fuzzer called Orion to address the limitations of existing fuzzers and identify new bugs in AI backend implementations. Our study confirms that most bugs stem from inadequate checks on tensor properties. In the final chapter, we characterize DL checker bugs and propose TensorGuard, an innovative tool designed to detect and repair such bugs. TensorGuard achieves an accuracy of 11.1%, surpassing the state-of-the-art bug repair baseline by 2%. We also tested TensorGuard on six months of checker-related updates (493 changes) in Google’s JAX library, successfully detecting 64 checker bugs.
Taken together, the findings from the five studies provide robust evidence that using data-driven software analytics to mine publicly available historical repositories of AI frameworks—such as code repositories and bug databases—holds immense potential for advancing the reliability of AI infrastructure software.