Information Systems and Technology

Permanent URI for this collectionhttps://hdl.handle.net/10315/27588

Browse

Now showing 1 - 20 of 69

Access status: Open Access ,
Integrating Natural Language Processing with Expert Systems for Streamlined Evaluation of Applications
(2025-11-11) Mohanty, Arup; Khaiter, Peter A.
Screening of applications for jobs, education, grants, and the like can often be slow, subjective, and inefficient. To address these issues, a novel framework, the Hybrid AI Platform for Streamlining Evaluation (HAIPSE), is introduced, integrating expert rule-based reasoning with NLP and computer-vision techniques, delivering a structured alternative to traditional applicant tracking systems. The platform’s heuristic scoring module, powered by spaCy, extracts key application details and grades responses against predefined criteria. To capture context and reduce reviewers’ workload, the HAIPSE incorporates large language models, Meta Llama 3-8B and Mistral 8 × 7B, generating concise essay summaries. The novel group-fairness metrics are applied within the evaluation pipeline, making the scoring process more transparent while preserving nuanced content. Furthermore, built-in debiasing steps embed proactive fairness checks directly into the framework’s design, preventing bias rather than merely detecting it post factum. All components of the HAIPSE were trained and tested on a limited set of sample IDs and 2,000+ real applications from the NIB Trust Fund (Canada). Compared with manual reviews, the HAIPSE improves transparency and reduces bias, while a collaborative audit interface bridges AI automation and expert judgment, reinforcing responsible AI. Ethical considerations of fairness and responsible deployment ensure that the resulting assessments remain scalable, interpretable, and equitable.
Access status: Open Access ,
A Systematic Evaluation Framework for Smart Contract Security Analyzers: Methods, Metrics, and Framework
(2025-11-11) Hejazi, Niosha; Arash Habibi Lashkari
Smart contracts automate agreements in blockchain systems but their immutable nature makes them vulnerable to permanent flaws once deployed. This thesis evaluates 256 smart contract vulnerability detection tools developed between 2018 and 2024, including approaches such as fuzzing, symbolic execution, formal verification, and artificial intelligence–based analysis. Tools were classified by detection strategy (static, dynamic, hybrid), domain (academic or industry), and scope. The evaluation involved a theoretical review of architecture, usability, and documentation, alongside an empirical assessment of accuracy, speed, and false positive rates. Findings show that while certain tools excel in specific areas, none achieve balanced performance or comprehensive coverage. To address these gaps, a modular six-layer evaluation framework is introduced, defining functional areas such as code analysis, coverage, integration, and user experience. The framework offers a benchmark for tool assessment and future development. Additionally, a graph-based detection model is proposed, demonstrating improved accuracy in both binary and multi-class settings.
Access status: Open Access ,
Characterizing Osteosarcopenia In Spinal Metastases Patients Undergoing Stereotactic Body Radiotherapy (Sbrt): Leveraging Deep Learning For Improved Outcome Prediction
(2025-07-23) Castano Sainz, Yessica Caridad; Chen, Stephen
Stereotactic body radiotherapy (SBRT) is commonly used to treat spinal metastases, offering excellent local control and pain relief. However, it carries an average 14% risk of vertebral compression fractures (VCFs), and despite growing evidence linking osteosarcopenia to adverse clinical outcomes, musculoskeletal health is not routinely assessed during SBRT treatment planning. This thesis introduces a fully automated pipeline for extracting musculoskeletal biomarkers from CT, combining deep learning–based segmentation with vertebral landmark–guided cropping and volumetric analysis. Sarcopenia thresholds were derived for volumetric indices using height-based and vertebral-based normalization, guided by established literature cutoffs for the Psoas Muscle Index (PMI). Osteoporosis was defined using trabecular bone density. In this SBRT cohort, 58% of patients met criteria for osteoporosis, 45% for sarcopenia, and 31% for osteosarcopenia. In multivariable logistic regression analyses, significant associations between fracture risk and both osteoporosis and lower psoas muscle density were observed in specific models, warranting further investigation. Additionally, categorical definitions of sarcopenia and osteosarcopenia were significantly associated with reduced overall survival. The pipeline was extended to MRI using CT-based segmentations as weak labels for training nnU-Net models, achieving high segmentation accuracy and supporting future radiation-free musculoskeletal biomarker assessment.
Access status: Open Access ,
Grasp - A Graph-Based SLA Breach Prediction Framework at the Service Level in Neural Inference
(2025-07-23) Fehresti, Sara; Litoiu, Marin
Cloud computing, as the backbone of modern adaptive software architectures, has revolutionized data storage and processing, driven by the power and flexibility of microservices. Despite their advantages in fault isolation and flexible deployment, microservices often experience unpredictable latency spikes, leading to costly Service Level Agreement (SLA) violations. This thesis introduces a multiscale time-spectrum framework, called GRASP (Graph-based SLA Breach Prediction). It leverages time-series data, sequence processing, and graph-based modeling to proactively detect performance anomalies and predict SLA breaches in microservice-based systems within upcoming time windows. In our framework, raw data is transformed into graph representations and fed into deep learning models to capture both topological and temporal characteristics. By combining graph analysis with sequential modeling, our dual approach not only identifies critical service dependencies but also pinpoints potential end-to-end bottlenecks. Evaluations on microservice datasets demonstrate its superiority over baseline methods in early warnings, forecasting breaches, and localizing root causes at the service level, underscoring its potential to enhance the reliability and efficiency of microservice-based applications in cloud environments.
Access status: Open Access ,
Service-Level Prediction And Anomaly Detection Towards System Failure Root Cause Explainability In Microservice Applications
(2025-07-23) Rouf, Raphael; Litoiu, Marin
Microservice and cloud computing operations are increasingly adopting automation. The importance of models in fostering resilient and efficient adaptive architectures is central to ensuring that services operate with expected behavior and performance. To effectively predict, detect, and explain system failures, it is essential to develop a comprehensive understanding of the affected application. This means examining its unexpected behaviors from multiple perspectives, including logs, metrics, and dependencies, to uncover the underlying root causes. This thesis presents a novel approach to system failure prediction, root cause analysis and explainable failure type analysis by leveraging a three-fold modality of IT observability data: logs, metrics, and traces. The proposed methodology integrates Graph Neural Networks (GNN) to capture spatial information and Gated Recurrent Units (GRU) to encapsulate the temporal aspects within the data. A key emphasis lies in utilizing a stitched representation derived from logs, microservice events and resource metrics to predict system failures proactively. The traces are aggregated to construct a comprehensive service call flow graph and represented as a dynamic graph. Furthermore, permutation testing is applied to harness node scores, aiding in the identification of root causes behind these failures. We evaluate our approach on open source datasets: MicroSS, QOTD and Train Ticket dataset that captures various types of system faults such as resource overload and wrong manipulation faults. Our findings on real world cases demonstrates that the proposed three-fold modality of observability data based on enhanced preprocessing and applying GNN-GRU and gradient distance explanation model captures failure type predictions well and explains them effectively for engineers to help debug and diagnose the issue.
Access status: Open Access ,
Application of Remote Sensing and Machine Learning in Vegetation Phenology and Climate Change Studies
(2025-07-23) Suleman, Masooma Ali Raza; Khaiter, Peter A.
Remote sensing and machine learning (ML) have revolutionized phenology studies by offering scalable and automated methods for monitoring vegetation growth patterns. Traditional phenology detection methods, which rely on field observations, are often labor-intensive and geographically constrained. This thesis introduces a novel deep learning model, the Temporal Multivariate Attention Network (TMANet), which integrates remote sensing data, climate indices, and ground observations to enhance phenological stage detection in crops. Focusing on corn phenology, the study explores how remote sensing data preprocessing optimizes its utility for phenology applications, how ML techniques improve detection accuracy, and how TMANet outperforms traditional models in capturing temporal and environmental dependencies. The proposed framework provides a robust, data-driven approach to understanding vegetation responses to climate variability, supporting sustainable agricultural management. The findings contribute to advancing phenology research by offering a scalable and efficient methodology for monitoring crop development and assessing climate change impacts on vegetation phenology.
Access status: Open Access ,
XAI-Driven Malicious Encrypted Traffic Detection and Characterization to Enhance Information Security
(2025-07-23) Sharma, Adit; Habibi Lashkari, Arash
Securing information through encryption is essential in data communication, but to effectively detect malicious activities, it is crucial to distinguish between encrypted and non-encrypted traffic. Traditional encrypted traffic classification methods, including rule-based systems and conventional machine learning approaches, often struggle with scalability, generalization, and class imbalance, leading to suboptimal classification performance. This study introduces a novel hybrid model for encrypted traffic classification, integrating Multi-Head Attention mechanisms for feature enhancement and LightGBM as the final classifier. The proposed model follows a two-step classification process: first, performing binary classification to separate encrypted and non-encrypted traffic, and second, applying multi-class classification to categorize encrypted traffic into TOR, VPN, I2P, Zeronet, and Freenet. To improve model interpretability, SHAP is employed to validate the importance of attention-based features, while LIME provides insights into misclassified instances, enabling adjustments such as weight threshold tuning and handling class imbalances. Furthermore, this study incorporates a refined dataset preprocessing pipeline, leveraging NTL Flowlyzer—an advanced traffic analyzer that extracts over 400 features, including entropy-based attributes. To address class imbalance issues, strategic adjustments such as SMOTE augmentation for Freenet and class-specific threshold tuning were applied based on SHAP and LIME insights, resulting in improved classification performance. The experimental evaluation demonstrates that the proposed hybrid model outperforms existing approaches in accuracy, precision ,and recall while maintaining efficiency in both time and computational complexity. By integrating explainable AI techniques and adaptive optimization strategies, our approach enhances classification performance and improves the transparency and interpretability of encrypted traffic detection. These findings contribute to advancing cybersecurity by enabling more robust and interpretable encrypted traffic classification models.
Access status: Open Access ,
Improving User Sparse Query Interpretation Through Pseudo-Relevance Retrieval Methods
(2025-04-10) Pei, Quanli; Huang, Jimmy
Despite the rapid development of information retrieval technology, understanding sparse user query remains a significant challenge. Users often input short, ambiguous, or context-lacking queries when searching, making it difficult for retrieval systems to capture user intent. This thesis focuses on this critical issue and proposes three innovative models based on Pseudo-Relevance Feedback: CNRoc, CLRoc, and LLM-PRF, with the aim of enhancing the performance of retrieval systems. The CNRoc model enriches query expansions by incorporating external conceptual knowledge, enabling it to capture the subtle meanings of query terms and generate more semantically relevant expansion terms. The CLRoc model combines weak and strong relevance signals, utilizing Contrastive Learning to optimize document selection and enhance the alignment between user intent and result documents. The LLM-PRF model integrates Large Language Model to improve the query representation capability of dense retrieval systems, further enhancing the understanding of user intent. Experimental results demonstrate that these models significantly outperform traditional methods in multiple evaluation metrics, providing effective solutions for handling sparse query. Ultimately, this thesis lays the groundwork for future advancements in Information Retrieval, ensuring that users can more effectively retrieve the information they want and make informed decisions.
Access status: Open Access ,
Instruction-Tuning For Chart Comprehension And Reasoning
(2025-04-10) Shah Mohammadi, Mehrad; Prince, Emmanuel Hoque
Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as questionanswering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chartrelated tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising if instructions generated with distinct charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model (2) a pipeline model employing a two-step approach. Evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.
Access status: Open Access ,
Deconstructing And Restyling SVG Charts Using Large Language Models
(2025-04-10) Zaidi, Syed Muhammad Ali Raza; Hoque Prince, Enamul
SVG charts are very common on the Web, however, reusing, editing and restyling these charts is very difficult. To facilitate this process, this thesis explores the challenges of extracting data and visual encodings from SVG chart images and restyling them based on user queries. We leverage large language models (LLMs) to facilitate this process using few-shot prompt approaches, enabling users to deconstruct and restyle existing Vega-Lite visualizations through natural language input. Our evaluation on 800 SVG charts and 250 natural language queries reveals that our system accurately deconstruct 93.4% charts and successfully restyled 38.6% queries. Finally, based on the above techniques, we develop a Chrome plugin tool that detects and deconstructs SVG charts from the web page and then restyles the charts based on user input.
Access status: Open Access ,
Proactive & Fine-Grained Monitoring For Microservice Call Chains In Cloud-Native Applications Through Latency Distribution Prediction
(2025-04-10) Hussain, Hamza; Litoiu, Marin
Modern cloud-native applications are distributed in nature and have their health monitored through multiple channels. In this study, we propose a singular novel approach that leverages multi-channel monitoring data for fine-grained performance analysis, proactive anomaly prediction, and root-cause analysis in microservices based applications. To this end, we employ Microservice Embeddings, Graph Neural Networks (GNN), and Gated Recurrent Units (GRU) to predict latency distribution, as opposed to a single latency value, for individual calls within a microservice call chain, as well as the distribution of end-to-end latency. Thus, our approach enables deeper insights into system performance and targeted diagnostics for anomalies. We use several benchmark datasets containing anomalies and show that our approach performs consistently across the latency spectrum while outperforming baseline latency prediction approaches by about 6%. Lastly, we show that our approach can be efficiently used to automate the process of trace-based anomaly prediction and perform root-cause analysis.
Access status: Open Access ,
Stock Price Prediction Using Sentiment and Technical Analysis
(2025-04-10) Gao, Huaqi; Huang, Jimmy
With the rapid advancement of the economy, the stock market has garnered extensive attention in both business and academic fields. Due to the dynamic, unstable, information-sensitive nature of the stock market, obtaining an accurate stock price prediction is extremely challenging. This study explores the integration of sentiment data from financial news headlines with historical stock data to predict stock prices. This research gathered historical price and trading volume data for the S&P 500 index, sourced from Yahoo Finance, along with 106,494 financial news titles obtained from Reuter. This dataset encompasses the period around the 2008 financial crisis from Oct 20, 2006, to Nov 19, 2013. Empirical implementation of the proposed methodology revealed the substantial value of incorporating sentiment and historical information to enhance the accuracy of stock price prediction.
Access status: Open Access ,
Simulation Optimization Of Operating Room Schedules For Elective Orthopaedic Surgeries
(2025-04-10) Maltseva, Daria Victorovna; Chen, Stephen
The aim of this thesis was to solve the problem of scheduling elective surgeries in a multiple operating room setting with the goal of minimizing the amount of overtime incurred. While surgical durations cannot always be perfectly estimated and vary by procedure and surgeon, we propose an approach that relies on leveraging the stochastic nature of surgical durations to simulate each operating day and understand the probability of incurring overtime under a certain schedule of surgeries. Through experimentation with three optimization techniques that strategically re-schedule surgeries, two showed promising results being able to reduce the total number of overtime surgeries by 12-15%, equivalent to approximately 1h of total monthly overtime. This approach serves as a tool for improving schedules and supporting decision makers at any hospital dealing with elective surgeries. Our contribution involves introducing the simulation optimization model and describing the data-driven approach to analyzing the scheduling problem.
Access status: Open Access ,
Comparative Analysis of Language Models on Augmented Low-Resource Datasets for Application in Question & Answering Systems
(2024-11-07) Ranjbargol, Seyedehsamaneh; Erechtchoukova, Marina G.
This thesis aims to advance natural language processing (NLP) in question-answering (QA) systems for low-resource domains. The research presents a comparative analysis of several pre-trained language models, highlighting their performance enhancements when fine-tuned with augmented data to address several critical questions, such as the effectiveness of synthetic data and the efficiency of data augmentation techniques for improving QA systems in specialized contexts. The study focuses on developing a hybrid QA framework that can be integrated with a cloud-based information system. This approach refines the functionality and applicability of QA systems, boosting their performance in low-resource settings by using targeted fine-tuning and advanced transformer models. The successful application of this method demonstrates the significant potential for specialized, AI-driven QA systems to adapt and thrive in specific environments.
Access status: Open Access ,
Improving Seafood Production Through Data Science Methods
(2024-11-07) Teimouri Lotfabadi, Bahareh; Khaiter, Peter A.
Global production of seafood has quadrupled over the past 50 years. Seafood production is characterized by one of the highest waste rates in the food industry reaching up to 50% of the original raw material. Therefore, seafood companies are interested in reducing their waste rates, thus increasing production yields. In this thesis, we apply a Data Science (DS) methodology and suggest an extended DS framework to address theoretical and practical issues in the seafood industry. The framework encapsulates data processing, statistical, machine learning, visualization and optimization capabilities. The research employs unique real-world data collected in a seafood production facility over a 2-year period. The study will contribute to the economic well-being of the individual seafood producers as they could perform their business planning and forecasting in a more informed and predictive way as well as to the overall sustainability of the seafood industry due to the waste rate reduction.
Access status: Open Access ,
Progressive Hierarchical Classification For Multi-Category Image Classification
(2024-10-28) Kuo, Te-Chuan; Chen, Stephen
This thesis evaluates a hierarchical classification model applied to the CIFAR-10 dataset, focusing on addressing the limitations of existing methods, which often struggle with (i) overlapping features and (ii) poor interpretability of classification decisions. The hierarchical model was implemented to mitigate these issues by refining classification through a multi-stage process that narrows the focus progressively. Our hierarchical approach has demonstrated its ability to focus on distinguishing features critical to specific classes and groups/pairs. Furthermore, the hierarchical models provide enhanced transparency over the baseline model by allowing a granular examination of classification performance across multiple stages.
Access status: Open Access ,
Data-Driven Causal Decision Support for Business Process Management
(2024-07-18) Jandaghi Alaee, Ali; Senderovich, Arik
Control-flow and resource assignment decisions influence business processes. Recorded process data can be used to identify which decisions are informed by data to predict their outcome, and to guide interventions as part of a what-if analysis. The latter requires causal models that explain decisions. Yet, existing methods are limited: they focus on control-flow decisions only, ignore potential confounders, and use ad-hoc methods to resolve causal conflicts. We fill this gap, by introducing a causal decision modeling framework which uncovers confounding effects, and captures resource decisions. Moreover, we provide a process-aware causal discovery algorithm that takes process precedence into account. In addition, we employ domain knowledge to include unobserved factors. We address the problem of identification, conduct interventional outcome prediction and improve decision-making by acquiring unavailable data to maximize the utility of interventions. We demonstrate the feasibility of our approach through a set of experiments on synthetically generated and real-world datasets.
Access status: Open Access ,
Optimizing Data Compression via Data Reordering Strategies
(2024-07-18) Du, Qinxin; Yu, Xiaohui
To improve the efficiency and cost-effectiveness of handling large tabular datasets stored in databases, a range of data compression techniques are employed. Among these, dictionary-based compression methods such as Lz4, Gzip, and Zstandard are commonly utilized to decrease data size. However, while these traditional dictionary-based compression techniques can reduce data size to some degree, they are not able to identify the internal patterns within given datasets. Thus, there remains substantial potential for further data size reduction by identifying repetitive data patterns. This thesis proposes two novel approaches to improve tabular data compres- sion performance. Both methods involve data preprocessing using an advanced data encoding technique called locality-sensitive hashing (LSH). One approach utilizes clustering for data reordering, while the other employs a heuristic-based solver for the Travelling Salesman Problem (TSP). The data encoding process enables the identification of internal repetitive patterns within the original datasets. Records with similar features are grouped together and compressed into a much smaller size after reordering. Furthermore, a novel table partitioning strategy based on the number of distinct values in each column is designed to further improve the compression ratio of the entire table. Extensive experiments are then conducted on one synthetic dataset and three real datasets to evaluate the performance of the proposed algorithms by varying parameters of interest. The data encoding and reordering methods show significant efficiency improvements, resulting in reduced data size and substantially increased data compression ratios.
Access status: Open Access ,
Revolutionizing Time Series Data Preprocessing with a Novel Cycling Layer in Self-Attention Mechanisms
(2024-07-18) Chen, Jiyan; Yang, Zijiang
This thesis presents a novel method for improving time series data preprocessing by incorporating a cycling layer into self-attention mechanisms. Traditional techniques often struggle to capture the cyclical nature of time series data, impacting predictive model accuracy. By integrating a cycling layer, this thesis aims to enhance the ability of models to recognize and utilize cyclical patterns within datasets, exemplified by the Jena Climate dataset from the Max Planck Institute for Biogeochemistry. Empirical results demonstrate that the proposed method not only improves the accuracy of forecasts but also increases model fitting speed compared to conventional approaches. This thesis contributes to the advancement of time series analysis by offering a more effective preprocessing technique.
Access status: Open Access ,
Integrating Natural Language and Visualizations for Exploring Data on Smartwatch
(2024-07-18) Varadarajan, Kaavya; Prince, Enamul Hoque
Smartwatches are increasingly popular for collecting and exploring personal data, including health, stocks, and weather information. However, the use of micro-visualizations to present such data faces challenges due to limited screen size and interactivity. To address this problem, we propose integrating natural language (voice) with micro-visualizations (charts) to enhance user comprehension and insights. Leveraging a large language model like ChatGPT, we automatically summarize micro-visualizations and combine them with audio narrations and interactive visualizations to aid users in understanding the data. A user study with sixteen participants suggests that the combination of voice and charts results in superior accuracy, preference, and usefulness compared to presenting charts alone. This highlights the efficacy of integrating natural language with visualizations on smartwatches to improve user interaction and data comprehension.

Browse

Recent Submissions