Information Systems and Technology

Permanent URI for this collection

https://hdl.handle.net/10315/27588

Browse

Now showing 1 - 20 of 62

Open Access
Improving User Sparse Query Interpretation Through Pseudo-Relevance Retrieval Methods
(2025-04-10) Pei, Quanli; Huang, Jimmy
Despite the rapid development of information retrieval technology, understanding sparse user query remains a significant challenge. Users often input short, ambiguous, or context-lacking queries when searching, making it difficult for retrieval systems to capture user intent. This thesis focuses on this critical issue and proposes three innovative models based on Pseudo-Relevance Feedback: CNRoc, CLRoc, and LLM-PRF, with the aim of enhancing the performance of retrieval systems. The CNRoc model enriches query expansions by incorporating external conceptual knowledge, enabling it to capture the subtle meanings of query terms and generate more semantically relevant expansion terms. The CLRoc model combines weak and strong relevance signals, utilizing Contrastive Learning to optimize document selection and enhance the alignment between user intent and result documents. The LLM-PRF model integrates Large Language Model to improve the query representation capability of dense retrieval systems, further enhancing the understanding of user intent. Experimental results demonstrate that these models significantly outperform traditional methods in multiple evaluation metrics, providing effective solutions for handling sparse query. Ultimately, this thesis lays the groundwork for future advancements in Information Retrieval, ensuring that users can more effectively retrieve the information they want and make informed decisions.
Open Access
Instruction-Tuning For Chart Comprehension And Reasoning
(2025-04-10) Shah Mohammadi, Mehrad; Prince, Emmanuel Hoque
Charts provide visual representations of data and are widely used for analyzing information, addressing queries, and conveying insights to others. Various chart-related downstream tasks have emerged recently, such as questionanswering and summarization. A common strategy to solve these tasks is to fine-tune various models originally trained on vision tasks language. However, such task-specific models are not capable of solving a wide range of chartrelated tasks, constraining their real-world applicability. To overcome these challenges, we introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising if instructions generated with distinct charts. We then present two distinct systems for instruction tuning on such datasets: (1) an end-to-end model (2) a pipeline model employing a two-step approach. Evaluation shows that our instruction-tuning approach supports a wide array of real-world chart comprehension and reasoning scenarios, thereby expanding the scope and applicability of our models to new kinds of tasks.
Open Access
Deconstructing And Restyling SVG Charts Using Large Language Models
(2025-04-10) Zaidi, Syed Muhammad Ali Raza; Hoque Prince, Enamul
SVG charts are very common on the Web, however, reusing, editing and restyling these charts is very difficult. To facilitate this process, this thesis explores the challenges of extracting data and visual encodings from SVG chart images and restyling them based on user queries. We leverage large language models (LLMs) to facilitate this process using few-shot prompt approaches, enabling users to deconstruct and restyle existing Vega-Lite visualizations through natural language input. Our evaluation on 800 SVG charts and 250 natural language queries reveals that our system accurately deconstruct 93.4% charts and successfully restyled 38.6% queries. Finally, based on the above techniques, we develop a Chrome plugin tool that detects and deconstructs SVG charts from the web page and then restyles the charts based on user input.
Open Access
Proactive & Fine-Grained Monitoring For Microservice Call Chains In Cloud-Native Applications Through Latency Distribution Prediction
(2025-04-10) Hussain, Hamza; Litoiu, Marin
Modern cloud-native applications are distributed in nature and have their health monitored through multiple channels. In this study, we propose a singular novel approach that leverages multi-channel monitoring data for fine-grained performance analysis, proactive anomaly prediction, and root-cause analysis in microservices based applications. To this end, we employ Microservice Embeddings, Graph Neural Networks (GNN), and Gated Recurrent Units (GRU) to predict latency distribution, as opposed to a single latency value, for individual calls within a microservice call chain, as well as the distribution of end-to-end latency. Thus, our approach enables deeper insights into system performance and targeted diagnostics for anomalies. We use several benchmark datasets containing anomalies and show that our approach performs consistently across the latency spectrum while outperforming baseline latency prediction approaches by about 6%. Lastly, we show that our approach can be efficiently used to automate the process of trace-based anomaly prediction and perform root-cause analysis.
Open Access
Stock Price Prediction Using Sentiment and Technical Analysis
(2025-04-10) Gao, Huaqi; Huang, Jimmy
With the rapid advancement of the economy, the stock market has garnered extensive attention in both business and academic fields. Due to the dynamic, unstable, information-sensitive nature of the stock market, obtaining an accurate stock price prediction is extremely challenging. This study explores the integration of sentiment data from financial news headlines with historical stock data to predict stock prices. This research gathered historical price and trading volume data for the S&P 500 index, sourced from Yahoo Finance, along with 106,494 financial news titles obtained from Reuter. This dataset encompasses the period around the 2008 financial crisis from Oct 20, 2006, to Nov 19, 2013. Empirical implementation of the proposed methodology revealed the substantial value of incorporating sentiment and historical information to enhance the accuracy of stock price prediction.
Open Access
Simulation Optimization Of Operating Room Schedules For Elective Orthopaedic Surgeries
(2025-04-10) Maltseva, Daria Victorovna; Chen, Stephen
The aim of this thesis was to solve the problem of scheduling elective surgeries in a multiple operating room setting with the goal of minimizing the amount of overtime incurred. While surgical durations cannot always be perfectly estimated and vary by procedure and surgeon, we propose an approach that relies on leveraging the stochastic nature of surgical durations to simulate each operating day and understand the probability of incurring overtime under a certain schedule of surgeries. Through experimentation with three optimization techniques that strategically re-schedule surgeries, two showed promising results being able to reduce the total number of overtime surgeries by 12-15%, equivalent to approximately 1h of total monthly overtime. This approach serves as a tool for improving schedules and supporting decision makers at any hospital dealing with elective surgeries. Our contribution involves introducing the simulation optimization model and describing the data-driven approach to analyzing the scheduling problem.
Open Access
Comparative Analysis of Language Models on Augmented Low-Resource Datasets for Application in Question & Answering Systems
(2024-11-07) Ranjbargol, Seyedehsamaneh; Erechtchoukova, Marina G.
This thesis aims to advance natural language processing (NLP) in question-answering (QA) systems for low-resource domains. The research presents a comparative analysis of several pre-trained language models, highlighting their performance enhancements when fine-tuned with augmented data to address several critical questions, such as the effectiveness of synthetic data and the efficiency of data augmentation techniques for improving QA systems in specialized contexts. The study focuses on developing a hybrid QA framework that can be integrated with a cloud-based information system. This approach refines the functionality and applicability of QA systems, boosting their performance in low-resource settings by using targeted fine-tuning and advanced transformer models. The successful application of this method demonstrates the significant potential for specialized, AI-driven QA systems to adapt and thrive in specific environments.
Open Access
Improving Seafood Production Through Data Science Methods
(2024-11-07) Teimouri Lotfabadi, Bahareh; Khaiter, Peter A.
Global production of seafood has quadrupled over the past 50 years. Seafood production is characterized by one of the highest waste rates in the food industry reaching up to 50% of the original raw material. Therefore, seafood companies are interested in reducing their waste rates, thus increasing production yields. In this thesis, we apply a Data Science (DS) methodology and suggest an extended DS framework to address theoretical and practical issues in the seafood industry. The framework encapsulates data processing, statistical, machine learning, visualization and optimization capabilities. The research employs unique real-world data collected in a seafood production facility over a 2-year period. The study will contribute to the economic well-being of the individual seafood producers as they could perform their business planning and forecasting in a more informed and predictive way as well as to the overall sustainability of the seafood industry due to the waste rate reduction.
Open Access
Progressive Hierarchical Classification For Multi-Category Image Classification
(2024-10-28) Kuo, Te-Chuan; Chen, Stephen
This thesis evaluates a hierarchical classification model applied to the CIFAR-10 dataset, focusing on addressing the limitations of existing methods, which often struggle with (i) overlapping features and (ii) poor interpretability of classification decisions. The hierarchical model was implemented to mitigate these issues by refining classification through a multi-stage process that narrows the focus progressively. Our hierarchical approach has demonstrated its ability to focus on distinguishing features critical to specific classes and groups/pairs. Furthermore, the hierarchical models provide enhanced transparency over the baseline model by allowing a granular examination of classification performance across multiple stages.
Open Access
Data-Driven Causal Decision Support for Business Process Management
(2024-07-18) Jandaghi Alaee, Ali; Senderovich, Arik
Control-flow and resource assignment decisions influence business processes. Recorded process data can be used to identify which decisions are informed by data to predict their outcome, and to guide interventions as part of a what-if analysis. The latter requires causal models that explain decisions. Yet, existing methods are limited: they focus on control-flow decisions only, ignore potential confounders, and use ad-hoc methods to resolve causal conflicts. We fill this gap, by introducing a causal decision modeling framework which uncovers confounding effects, and captures resource decisions. Moreover, we provide a process-aware causal discovery algorithm that takes process precedence into account. In addition, we employ domain knowledge to include unobserved factors. We address the problem of identification, conduct interventional outcome prediction and improve decision-making by acquiring unavailable data to maximize the utility of interventions. We demonstrate the feasibility of our approach through a set of experiments on synthetically generated and real-world datasets.
Open Access
Optimizing Data Compression via Data Reordering Strategies
(2024-07-18) Du, Qinxin; Yu, Xiaohui
To improve the efficiency and cost-effectiveness of handling large tabular datasets stored in databases, a range of data compression techniques are employed. Among these, dictionary-based compression methods such as Lz4, Gzip, and Zstandard are commonly utilized to decrease data size. However, while these traditional dictionary-based compression techniques can reduce data size to some degree, they are not able to identify the internal patterns within given datasets. Thus, there remains substantial potential for further data size reduction by identifying repetitive data patterns. This thesis proposes two novel approaches to improve tabular data compres- sion performance. Both methods involve data preprocessing using an advanced data encoding technique called locality-sensitive hashing (LSH). One approach utilizes clustering for data reordering, while the other employs a heuristic-based solver for the Travelling Salesman Problem (TSP). The data encoding process enables the identification of internal repetitive patterns within the original datasets. Records with similar features are grouped together and compressed into a much smaller size after reordering. Furthermore, a novel table partitioning strategy based on the number of distinct values in each column is designed to further improve the compression ratio of the entire table. Extensive experiments are then conducted on one synthetic dataset and three real datasets to evaluate the performance of the proposed algorithms by varying parameters of interest. The data encoding and reordering methods show significant efficiency improvements, resulting in reduced data size and substantially increased data compression ratios.
Open Access
Revolutionizing Time Series Data Preprocessing with a Novel Cycling Layer in Self-Attention Mechanisms
(2024-07-18) Chen, Jiyan; Yang, Zijiang
This thesis presents a novel method for improving time series data preprocessing by incorporating a cycling layer into self-attention mechanisms. Traditional techniques often struggle to capture the cyclical nature of time series data, impacting predictive model accuracy. By integrating a cycling layer, this thesis aims to enhance the ability of models to recognize and utilize cyclical patterns within datasets, exemplified by the Jena Climate dataset from the Max Planck Institute for Biogeochemistry. Empirical results demonstrate that the proposed method not only improves the accuracy of forecasts but also increases model fitting speed compared to conventional approaches. This thesis contributes to the advancement of time series analysis by offering a more effective preprocessing technique.
Open Access
Integrating Natural Language and Visualizations for Exploring Data on Smartwatch
(2024-07-18) Varadarajan, Kaavya; Prince, Enamul Hoque
Smartwatches are increasingly popular for collecting and exploring personal data, including health, stocks, and weather information. However, the use of micro-visualizations to present such data faces challenges due to limited screen size and interactivity. To address this problem, we propose integrating natural language (voice) with micro-visualizations (charts) to enhance user comprehension and insights. Leveraging a large language model like ChatGPT, we automatically summarize micro-visualizations and combine them with audio narrations and interactive visualizations to aid users in understanding the data. A user study with sixteen participants suggests that the combination of voice and charts results in superior accuracy, preference, and usefulness compared to presenting charts alone. This highlights the efficacy of integrating natural language with visualizations on smartwatches to improve user interaction and data comprehension.
Open Access
Machine learning algorithms for Long COVID effects detection
(2024-03-16) Ahuja, Harit; Litoiu, Marin; Sergio, Lauren
In the realm of the Internet of Things (IoT) and Machine learning (ML), there is a growing demand for applications that can improve healthcare. By integrating sensors, cloud computing and ML we can create a powerful platform that enables insights into healthcare. Building upon these concepts, we propose a novel approach to address the widespread problem of long COVID. We utilize a wearable device to capture electroencephalogram (EEG) readings, which are then transformed through a set of processing steps into actionable decisions. We use a methodology that initiates data collection from a Cognitive-Motor Integration (CMI) task, followed by data preprocessing, feature engineering, and then the application of ML and advanced Deep Learning (DL) algorithms. To address challenges like data scarcity and privacy concerns, we generate synthetic data and train them using the same model as the original data for comparative analysis. Our method was tested on real cases and achieved prominent results: the CNN-LSTM model achieved 83% accuracy with original data and surged to 93% using synthetic data.
Open Access
Unveiling the Complexities of Student Satisfaction in E-learning: An Integrated Framework for the Context of COVID-19
(2024-03-16) Lin, Rui; Huang, Jimmy
Amidst the global pandemic’s reshaping of education, our study investigates e-learning dynamics in Canadian higher education. Integrating the Technology Acceptance Model (TAM), the DeLone and McLean Information Systems Success Model (D&M ISS), and the Expectation Confirmation Model (ECM), we introduce the innovative C-RES framework. This framework, which stands for COVID-19 Remote E-learning System, uniquely addresses the complexities of e-learning systems and their role in student satisfaction during COVID-19. Through Structural Equation Modeling (SEM) analysis of responses from a diverse pool of graduate students across Canada, we uncover relationships among psycho- logical factors, quality dimensions, and social influences. We demonstrate how self-efficacy, IT anxiety, and perceived system and information quality significantly influence students’ ease of use and usefulness perceptions, impacting their satisfaction and commitment to Learning Management Systems (LMS). Our findings reveal that e-learning quality lies not only in technology but also in content, and highlight the significant influence of individual confidence and community dynamics on student experiences. These insights provide actionable strategies for enhancing the effectiveness and resilience of e-learning systems, especially in crises. While focusing on the Canadian pandemic context, our research suggests exploring demographic influences in future studies. This thesis serves as a foundation for future e-learning explorations, pushing educational technology boundaries during global disruptions and offering key strategies for resilience and effectiveness in higher education.
Open Access
Exploratory Analysis of Water Quality in a Small Urbanized Watershed Using Deep Learning
(2023-12-08) Ofosu, Alfred; Erechtchoukova, Marina G.
Water is a life-sustaining resource for living organisms inside and outside water bodies. Natural waters serve as municipal and industrial water supplies, sources for agricultural irrigation, homes for aquatic ecosystems, recreation, and other essential uses. The quality of water determines its use. Therefore, it must be monitored, managed, and reported to help stakeholders in decision-making that can protect watershed ecosystems and improve measures to mitigate factors adversely affecting water bodies. Water quality is represented by a set of parameters that describe specific characteristics or properties of water. These parameters are determined by measuring water's physical and chemical characteristics and concentration levels of various substances in a water column with subsequent sample analysis in laboratories. This results in low frequencies of observations for water quality parameters compared to hydrometric and meteorological data. Frequencies of observation adopted by many water quality monitoring systems vary between 4 and 12 samples per year, suggesting applying modelling techniques to support decision-making. The study aims to develop a data-driven computational tool for water quality modelling in a small, highly urbanized watershed of the Don River, Ontario, Canada. The study focuses on major ions, namely, cations: calcium (Ca2+), magnesium (Mg2+), sodium (Na+), and potassium (K+), and anions such as bicarbonate (HCO3-), carbonate (CO32-), chloride (Cl-), and sulphate (SO42-). These parameters are not affected significantly by the aquatic ecosystem. The hydrological and meteorological processes mainly determine their dynamics. The study uses data from different monitoring systems belonging to the Toronto and Region Conservation Authority (TRCA) and Environment and Climate Change Canada (ECCC). It consists of water quality parameters and hydrometric and meteorological characteristics observed in the watershed over 57 years. Concentrations of selected water quality parameters are modelled using deep neural networks. The data pre-processing framework for cleansing and integrating data observed at different frequencies from different locations is developed. The framework is applied for the comparative analysis of neural networks of various configurations. Two sets of computational experiments were conducted. In the first set of experiments, integrated data from all monitoring stations in the watershed was fed into the deep learning algorithms to train a neural network to predict the concentration of major ions for the upcoming month (t+1). The second set of experiments uses upstream environmental parameters to train the model and predict the major ion concentrations in the lower subwatershed. The study investigates the performance of developed models in accurately predicting ion concentrations and provides insights into the relationship between environmental factors and water quality in the investigated watershed. The findings have practical applications for water resource management and pollution prevention efforts.
Open Access
Enhancing General Language Models for Biomedical Test Retrieval via Diversified Prior Knowledge
(2023-12-08) Huang, Yizheng; Huang, Jimmy
The thesis introduces the Diversified Prior Knowledge Enhanced General Language Model (DPK-GLM) to improve the efficacy of general language models in biomedical Information Retrieval (IR). General language models often struggle with biomedical data due to its specialized terminology and the need for precise matching. DPK-GLM tackles these challenges by integrating domain-specific knowledge, thereby enhancing the model's ability to understand and process biomedical information. The framework comprises three core components. The first, Knowledge-based Query Expansion, leverages authoritative biomedical databases to enrich search queries with domain-specific entities. The second, Aspect-based Filter, identifies documents that are highly relevant to the query. The third, Diversity-based Score Reweighting, re-ranks these filtered documents by combining similarity and diversity scores, yielding more accurate results. Experimental tests on public biomedical IR datasets confirm that DPK-GLM significantly improves retrieval performance.
Open Access
Using Data Analytics and Machine Learning in Sustainable Forest Management from Remote Sensing Data
(2023-08-04) Sysoeva, Polina; Khaiter, Peter A.
Nowadays, remote sensing has become a widely used technique to acquire data for ecosystem service assessment (ESA) and other sustainable management practices. Remotely Sensed Data (RSD) is particularly crucial in locations where in situ observations are either limited or completely impossible due to their inaccessibility, such as mountainous areas. However, due to the unique features of the RSD, obtaining substantial insights requires specific preprocessing steps and strong computational algorithms, such as machine learning (ML). In the research, we present a methodology integrating RSD with data analytic and machine learning techniques for the needs of ESA. A pipeline for preprocessing EOS data, transforming into features, and experimenting with tuning of the ML algorithms is developed. A practical application of the proposed approach is demonstrated through assessing the impact of extreme weather events on forest ecosystems and their carbon sequestration abilities in two areas of the Kashmir Valley, Jammu & Kashmir, India.
Open Access
Comparative Analysis of Transformer-Based Language Models for Text Analysis in the Domain of Sustainable Development
(2023-08-04) Safwat, Nabil; Erechtchoukova, Marina G.
With advancements of Artificial Intelligence, Natural Language Processing (NLP) has gained a lot of attention because of its potential to facilitate complex human-machine interactions, enhance language-based applications, and automate processing of unstructured texts. The study investigates the transfer learning approach on Transformer-based Language models, abstractive text summarization approach, and their application to the domain of Sustainable Development with the goal to determine SDGs representation in scientific publications using the text summarization technique. To achieve this, the traditional transfer learning framework was expanded so that: (1) the relevance of textual documents to specified text can be evaluated, (2) neural language models, namely BART and T5, were selected, and (3) 8 text similarity measures were investigated to identify the most informative ones. Both the BART and T5 models were fine-tuned on an acquired domain-specific corpus of scientific publications extracted from Scopus Elsevier database. The relevance of recently published works to an SDG was determined by calculating semantic similarity scores between each model generated summary to the SDG’s description. The proposed framework made it possible to identify goals that dominated the developed corpus and those that require further attention of the research community.
Open Access
Dynamic Elastic Provisioning For NFV-Enabled 5G Networks Using Machine Learning
(2023-03-28) Ali, Khalid; Jammal, Manar
5G networks are expected to support a variety of services and applications by having a more stringent latency, reliability, and bandwidth requirements compared to previous generations. To meet these requirements, Open Radio Access Networks (O-RAN) has been proposed. The O-RAN Alliance assumes O-RAN components to be Virtualized Network Functions (VNFs). Furthermore, O-RAN allows employing Machine Learning (ML) solutions to tackle challenges in resource management. However, intelligently managing resources for O-RAN can prove challenging. Network providers need to dynamically scale resources in response to incoming traffic. Elastically allocating resources provides higher flexibility, reduces OPerational EXpenditure (OPEX), and increases resource utilization. In this work, we propose and evaluate an elastic VNF orchestration framework for O-RAN. The proposed system consists of a traffic forecasting-based dynamic scaling scheme using ML, and a Reinforcement Learning (RL) based VNF placement policy. The models are evaluated based on their predictive capabilities subject to all Service-Level Agreements.