Electrical Engineering and Computer Science

Permanent URI for this collection

https://hdl.handle.net/10315/36010

Browse

Now showing 1 - 7 of 7

Open Access
A Cloud-Based Extensible Avatar For Human Robot Interaction
(2019-07-02) AlTarawneh, Enas Khaled Ahm; Jenkin, Michael
Adding an interactive avatar to a human-robot interface requires the development of tools that animate the avatar so as to simulate an intelligent conversation partner. Here we describe a toolkit that supports interactive avatar modeling for human-computer interaction. The toolkit utilizes cloud-based speech-to-text software that provides active listening, a cloud-based AI to generate appropriate textual responses to user queries, and a cloud-based text-to-speech generation engine to generate utterances for this text. This output is combined with a cloud-based 3D avatar animation synchronized to the spoken response. Generated text responses are embedded within an XML structure that allows for tuning the nature of the avatar animation to simulate different emotional states. An expression package controls the avatar's facial expressions. The introduced rendering latency is obscured through parallel processing and an idle loop process that animates the avatar between utterances. The efficiency of the approach is validated through a formal user study.
Open Access
A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation
(2024-07-18) Karim, Rezaul; Wildes, Richard P.
This dissertation presents an end-to-end trainable and unified multiscale encoder-decoder transformer for dense video estimation, with a focus on segmentation. We investigate this direction by exploring unified multiscale processing throughout the processing pipeline of feature encoding, context encoding and object decoding in an encoder-decoder model. Correspondingly, we present a Multiscale Encoder-Decoder Video Transformer (MED-VT) that uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on additional preprocessing, such as computing object proposals or optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we explore temporal consistency through a transductive learning scheme that exploits many-to-label propagation across time. To demonstrate the applicability of the approach, we provide empirical evaluation of MED-VT/MEDVT++ on three unimodal video segmentation tasks: (Automatic Video Object Segmentation (AVOS), actor-action segmentation, Video Semantic Segmentation (VSS)) and a multimodal task (Audio Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on additional preprocessing, such as object proposals or optical flow. We also document details of the model’s internal learned representations by presenting a detailed interpretability study, encompassing both quantitative and qualitative analyses.
Open Access
Exploiting Novel Deep Learning Architecture in Character Animation Pipelines
(2022-12-14) Ghorbani, Saeed; Troje, Nikolaus
This doctoral dissertation aims to show a body of work proposed for improving different blocks in the character animation pipelines resulting in less manual work and more realistic character animation. To that purpose, we describe a variety of cutting-edge deep learning approaches that have been applied to the field of human motion modelling and character animation. The recent advances in motion capture systems and processing hardware have shifted from physics-based approaches to data-driven approaches that are heavily used in the current game production frameworks. However, despite these significant successes, there are still shortcomings to address. For example, the existing production pipelines contain processing steps such as marker labelling in the motion capture pipeline or annotating motion primitives, which should be done manually. In addition, most of the current approaches for character animation used in game production are limited by the amount of stored animation data resulting in many duplicates and repeated patterns. We present our work in four main chapters. We first present a large dataset of human motion called MoVi. Secondly, we show how machine learning approaches can be used to automate proprocessing data blocks of optical motion capture pipelines. Thirdly, we show how generative models can be used to generate batches of synthetic motion sequences given only weak control signals. Finally, we show how novel generative models can be applied to real-time character control in the game production.
Open Access
Investigating and Modeling the Effects of Task and Context on Drivers' Attention
(2024-07-18) Kotseruba, Iuliia; Tsotsos, John K.
Driving, despite its widespread nature, is a demanding and inherently risky activity. Any lapse in focus, such as failing to look at the traffic signals or not noticing the actions of other road users, can lead to severe consequences. Technology for driver monitoring and assistance aims to mitigate these issues, but requires a deeper understanding of how drivers observe their surroundings to make decisions. In this dissertation, we investigate the link between where drivers look, tasks they perform, and the surrounding context. To do so, we first conduct a meta-study of the behavioral literature that documents an overwhelming importance of the top-down (task-driven) effects on gaze. Next, we survey applied research to show that most models do not necessarily make this connection and instead establish correlations between where the drivers looked and images of the scene, without explicitly considering drivers' actions and environment. Next, we annotate and analyze the four largest publicly available datasets that contain driving footage and eye-tracking data. The new annotations for task and context show that data is dominated by trivial scenarios (e.g. driving straight, standing) and help uncover problems with the typical data recording and processing pipelines that result in noisy, missing, or inaccurate data, particularly during safety-critical scenarios (e.g. intersections). For the only dataset with the raw data available, we create a new ground truth which alleviates some of the discovered issues. We also provide recommendations for future data collection. Using the new annotations and ground truth, we benchmark a representative set of bottom-up models for gaze prediction (i.e. those that do not represent the task explicitly). We conclude that while corrected ground truth boosts performance, the implicit representation is not sufficient to capture the effects of task and context on where drivers look. Lastly, motivated by these findings, we propose a task- and context-aware model for drivers' gaze prediction with explicit representation of the drivers' actions and context. The first version of the model, SCOUT, improves state-of-the-art performance by over 80% overall and 30% on the most challenging scenarios. We then propose SCOUT+, which relies on the more readily available route and map information similar to what the driver might see on the in-car navigation screen. SCOUT+ achieves comparable results as the version that uses more precise numeric and text labels.
Open Access
Learned Exposure Selection for High Dynamic Range Image Synthesis
(2021-03-08) Segal, Shane Maxwell; Brown, Michael; Brubaker, Marcus
High dynamic range (HDR) imaging is a photographic technique that captures a greater range of luminance than standard imaging techniques. Traditionally accomplished by specialized sensors, HDR images are regularly created through the fusion of multiple low dynamic range (LDR) images that can now be captured by smartphones or other consumer grade hardware. Three or more images are traditionally required to generate a well-exposed HDR image. This thesis presents a novel system for the fast synthesis of HDR images by means of exposure fusion with only two images required. Experiments show that a sufficiently trained neural network can predict a suitable exposure value for the next image to be captured, when given an initial image as input. With these images fed into the exposure fusion algorithm, a high-quality HDR image can be quickly generated.
Open Access
Leveraging Dual-Pixel Sensors for Camera Depth of Field Manipulation
(2022-03-03) Abuolaim, Abdullah Ahmad Taleb; Brown, Michael S.
Capturing a photo with clear scene details is important in photography and for computer vision applications. The range of distance in the real world that makes the scene's objects appear with clear details is known to be the camera's depth of field (DoF). The DoF is controlled by either adjusting lens distance to sensor (i.e., focus distance), aperture size, and/or focal length of the cameras. At capture time, especially for video recording, DoF adjustment is often restricted to lens movements as adjusting other parameters introduces artifacts that can be visible in the recorded video. Nevertheless, the desired DoF is not always achievable at capture time due to many reasons like the physical constraints of the camera optics. This leads to another direction of adjusting DoF after effect as a post-processing step. Although pre- or post-capture DoF manipulation is essential, there are few datasets and simulation platforms that enable investigating DoF at capture time. Another limitation is the lack of real datasets for DoF extension (i.e., defocus deblurring), where the prior work relies on synthesizing defocus blur and ignores the physical formation of defocus blur in real cameras (e.g., lens aberration and radial distortion). To address this research gap, this thesis revisits DoF manipulation from two point of views: (1) adjusting DoF at capture time, a.k.a. camera autofocus (AF), within the context of dynamic scenes (i.e., video AF); (2) computationally manipulating the DoF as a post-capturing process. To this aim, we leverage a new imaging sensor technology known as the dual-pixel (DP) sensor. DP sensors are used to optimize camera AF and can provide good cues to estimate the amount of defocus blur present at each pixel location. In particular, this thesis provides the first 4D temporal focal stack dataset along with AF platform to examine video AF. It also presents insights about user preference that lead to propose two novel video AF algorithms. As for post-capture DoF manipulation, we examine the problem of reducing defocus blur (i.e., extending DoF) by introducing a new camera aperture adjustment to collect the first dataset that has images with real defocus blur and their corresponding all-in-focus ground truth. We also propose the first end-to-end learning-based defocus deblurring method. We extend image defocus deblurring to a new domain application (i.e., video defocus deblurring) by designing a data synthesis framework to generate realistic DP video data through modeling physical camera constraints, such as lens aberration and redial distortion. Finally, we build on top of a data synthesis framework to synthesize shallow DoF with other aesthetic effects, such as multi-view synthesis and image motion.
Open Access
Machine Learning and Digital Histopathology Analysis for Tissue Characterization and Treatment Response Prediction in Breast Cancer
(2023-12-08) Saednia, Khadijeh Shirin; Sadeghi-Naini, Ali
Breast cancer is the most common type of diagnosed cancer and the leading cause of cancer-related death in women. Early diagnosis and prognosis in breast cancer patients can permit more therapeutic options and possibly improve their survival and quality of life. The gold standard approach for breast cancer diagnosis and characterization is histopathology assessment on biopsy specimens, which is time and resource-demanding. In this dissertation project, state-of-the-art machine learning (ML) methods have been developed and investigated for breast tissue characterization, nuclei segmentation, and chemotherapy response prediction in breast cancer patients using pre-treatment digitized histopathology images. First, a novel multi-scale attention-guided deep learning model is introduced to characterize breast tissue on digital pathology images according to four histological types. Evaluation results on the test set show the effectiveness of the proposed approach in accurate histopathology image classification with an accuracy of 97.5%. In the next step, a cascaded deep-learning-based model is proposed to delineate tumor nuclei in digital pathology images accurately, which is an essential step for extracting hand-crafted quantitative features for analysis with conventional ML models. The proposed model could achieve an F1 score of 0.83 on an independent test set. At the end, two novel ML frameworks are introduced and investigated for chemotherapy response prediction. In the first approach, a digital histopathology image analysis framework has been developed to extract various subsets of quantitative features from the segmented digitized slides for conventional ML model development. Several ML experiments have been conducted with different feature sets to develop prediction models of therapy response using a gradient boosting machine with decision trees. The proposed model with the optimal feature set could achieve an accuracy of 84%, sensitivity of 85% and specificity of 82% on an independent test set. The second approach introduces a hierarchical self-attention-guided deep learning framework to predict breast cancer response to chemotherapy using digital histopathology images of pre‑treatment tumor biopsies. The whole slide images (WSIs) are processed automatically through the proposed hierarchical framework consisting of patch-level and tumor-level processing modules followed by a patient-level response prediction component. A combination of convolutional and transformer modules is utilized at each processing level. The proposed framework could outperform the conventional ML models with a test accuracy, sensitivity, and specificity of 86%, 87%, and 83%, respectively. The proposed methods and the reported results in this dissertation are steps toward streamlining the histopathology workflow and implementing response-guided precision oncology for breast cancer patients.

Browse

Browsing Electrical Engineering and Computer Science by Subject "Artificial intelligence"

Results Per Page

Sort Options