Theses

Permanent URI for this collection

This collection consists of theses and dissertations produced by graduate students affiliated with the York Centre for Vision Research. These works represent significant contributions to the interdisciplinary field of vision science and have been approved in accordance with the academic standards of their respective graduate programs (including Biology, Digital Media, Electrical Engineering and Computer Science, Interdisciplinary Studies, Kinesiology & Health Science, Philosophy, Physics & Astronomy, Psychology, and others). This collection is managed and deposits authorized by the Coordinator for the Centre.

Browse

Recent Submissions

Now showing 1 - 20 of 60
  • ItemOpen Access
    A Solution for Scale Ambiguity in Generative Novel View Synthesis
    (2025-04-10) Forghani, Fereshteh; Brubaker, Marcus
    Generative Novel View Synthesis (GNVS) involves generating plausible unseen views of a scene given an initial view and the relative camera motion between the input and target views using generative models. A key limitation of current generative methods lies in their susceptibility to scale ambiguity, an inherent challenge in multi-view datasets caused by the use of monocular techniques to estimate camera positions from uncalibrated video frames. In this work, we present a novel approach to tackle this scale ambiguity in multi-view GNVS by optimizing the scales as parameters in an end-to-end fashion. We also introduce Sample Flow Consistency (SFC), a novel metric designed to assess scale consistency across samples with the same camera motion. Through various experiments, we demonstrate our approach yields improvements in terms of SFC, providing more consistent and reliable novel view synthesis.
  • ItemOpen Access
    From Discrete to Continuous: Learning 3D Geometry from Unstructured Points by Random Continuous Space Queries
    (2025-04-10) Jia, Meng; Kyan, Matthew J.
    In this dissertation, we focus on generalizing recent point convolution methods and building well-behaved point-cloud 3D shape features to achieve more robust, invariant, and versatile implicit neural representations (INR) of 3D shapes. In recent efforts to explore point-cloud based learning methods to improve 3D shape analysis, there has been much attention paid to the use of INR-based frameworks. Existing methods, however, mostly formulate models with an encoder-decoder architecture that incorporates a global shape embedding space, which often fails to model fine-grained local details efficiently, limiting overall generalization performance. To overcome this problem, we propose a convolutional feature space sampling operation (Dual-Feature Sampling or DFS) and develop a novel INR learning framework (Stochastic Continuous Function Learning or SCFL). This framework is first adapted and evaluated for its use in surface reconstruction of generic objects from sparsely sampled point clouds, which is a task that has been extensively used to bench-mark INR 3D shape learning methods. This study demonstrates impressive capabilities of our method, namely: 1) an ability to faithfully recover fine details and uncommon shape characteristics; 2) improved robustness to point-cloud rotation; 3) flexibility to handle different levels of sparsity in the input point clouds; 4) significantly better generalization in the presence of unseen shape categories. In addition, the proposed DFS operator proposed for this framework is well-formulated and general enough that it can be easily made compatible for integration into existing systems designed to address more complex 3D shape tasks. In this work, we harness this powerful ability to represent shape, within a newly proposed SCFL-based occupancy network, applied to shape based processing problems in medical image registration and segmentation. Specifically, our network is adapted and applied to two different, traditionally challenging problems: 1) liver image-to-physical registration; and 2) tumour-bearing whole brain segmentation. In both of these tasks, significant deformation can severely degrade and hinder performance. We illustrate however, that accuracy in both tasks can be considerably improved over baseline methods using our proposed network. Finally, through the course of the investigations conducted, an intensive effort has been made throughout the dissertation to review, analyze and offer speculative insights into the features of these proposed innovations, their role in the configurations presented, as well as possible utility in other scenarios and configurations that may warrant future investigation. It is our hope that the work in this dissertation may help to spark new ideas to advance the state of the art in learning-based representation of 3D shapes and encourage more interest in novel applications of INR to solve real-world problems.
  • ItemOpen Access
    Underwater gesture-based human-to-robot communication
    (2025-04-10) Codd-Downey, Robert Frank; Jenkin, Michael
    Underwater human to robot interaction presents significant challenges due to the harsh environment, including reduced visibility from suspended particulate matter and high attenuation of light and electromagnetic waves generally. Divers have developed an application-specific gesture language that has proven effective for diver-to-diver communication underwater. Given the wide acceptance of this language for underwater communication, it would seem an appropriate mechanism for diver to robot communication as well. Effective gesture recognition systems must address several challenges. Designing a gesture language involves balancing expressiveness and system complexity. Detection techniques range from traditional computer vision methods, suitable for small gesture sets, to neural networks for larger sets requiring extensive training data. Accurate gesture detection must handle noise and distinguish between repeated gestures and single gestures held for longer durations. Reliable communication also necessitates a feedback mechanism to allow users to correct miscommunications. Such systems must also deal with the need to recognize individual gesture tokens and their sequences, a problem that is hampered by the lack of large-scale labelled datasets of individual tokens and gesture sequences. Here these problems are addressed through weakly supervised learning and a sim2real approach that reduces by several orders of magnitude the effort required in obtaining the necessary labelled dataset. This work addresses this communication task by (i) developing a traditional diver and diver part recognition system (SCUBANetV1+), (ii) using this recognition within a weak supervision approach to train SCUBANetV2, a diver hand gesture recognition system, (iii) SCUBANetV2 recognizes individual gestures, and provides input to the Sim2Real trained SCUBALang LSTM network which translates temporal gesture sequences into phrases. This neural network pipeline effectively recognizes diver hand gestures in video data, demonstrating success in structured sequences. Each of the individual network components are evaluated independently, and the entire pipeline evaluated formally using imagery obtained in both the open ocean and in pool environments. As a final evaluation, the resulting system is deployed within a feedback structure and evaluated using a custom unmanned unwatered vehicle. Although this work concentrates on underwater gesture-based communication, the technology and learning process introduced here can be deployed in other environments for which application-specific gesture languages exist.
  • ItemOpen Access
    Normalized Moments for Photo-realistic Style Transfer
    (2025-04-10) Canham, Trevor Dalton; Brown, Michael S.
    Style transfer, the operation of matching appearance features between source and target images, is a complex and highly subjective problem. Due to the profundity of the concept of artistic style, the optimal solution is ill-defined, so the variety of approaches that have been proposed represent partial solutions to varying degrees of efficiency, usability and appearance of results. In this work a photo-realistic style transfer method for image and video is proposed that is based on vision science principles and on a recent mathematical formulation for the deterministic decoupling of features. As a proxy for mimicking the effects of camera color rendering or post processing, the employed features (the first through fourth order moments of the color distribution) represent important cues for visual adaptation and pre-attentive processing. The method is evaluated on the above criteria in a series of application relevant experiments and is shown to have results of high visual quality, without spatio-temporal artifacts, and validation tests in the form of observer preference experiments show that it compared very well with the state-of-the-art (deep learning, optimal transport, etc.) The computational complexity of the algorithm is low, and a numerical implementation that is amenable for real-time video application is proposed and demonstrated. Finally, general recommendations for photo-realistic style transfer are discussed.
  • ItemOpen Access
    Investigating Pannexin1a-Mediated Mechanisms Of Pain And Neuroinflammation Using Zebrafish
    (2025-04-10) Jeyarajah, Darren; Zoidl, George
    This thesis explores the role of Panx1a in modulating pain and neuroinflammatory responses in zebrafish (Danio rerio). Nociception was induced using acetic acid (AA) treatments. Behavioral assays conducted on Panx1a knockout (KO) zebrafish larvae demonstrate significant alterations in response to acetic acid (AA)-induced pain. Pharmacological interventions using probenecid, a Panx1 inhibitor, and ibuprofen, a cyclooxygenase (COX) inhibitor, reveal their potential in modulating pain behaviors and rescuing nociceptive deficits. Furthermore, molecular analyses employing quantitative polymerase chain reaction (qPCR) and RNA sequencing (RNA-seq) elucidate the regulatory impact of Panx1a on gene expression related to nociception, neuroinflammation, and synaptic plasticity. In summary, this thesis provides evidence of Panx1a's involvement in pain and neuroinflammation, proposing zebrafish as a viable model for studying nociception.
  • ItemOpen Access
    Developing A Non-Human Primate Model To Dissect The Neural Mechanism Of Facial Emotion Processing
    (2025-04-10) Taghian Alamooti, Shirin; Kar, Kohitij
    Facial emotion recognition is a cornerstone of social cognition, vital for interpreting social cues and fostering communication. Despite extensive research in human subjects, the neural mechanisms underlying this process remain incompletely understood. This thesis investigates these mechanisms using a non-human primate model to provide deeper insights into the neural circuitry involved in facial emotion processing. We embarked on a comparative analysis of facial emotion recognition between humans and rhesus macaques. Using a carefully curated set of facial expression images from the Montreal Set of Facial Displays of Emotion (MSFDE), we designed a series of binary emotion discrimination tasks. Our innovative approach involved detailed behavioral metrics that revealed significant parallels in emotion recognition patterns between the two species. These findings highlight the macaques’ potential as a robust model for studying human-like facial emotion recognition. Building on these behavioral insights, the second phase of our research delved into the neural underpinnings of this cognitive process. We conducted large-scale, chronic multi-electrode recordings in the inferior temporal (IT) cortex of rhesus macaques. By mapping the neural activity associated with the classification of different facial emotions, we uncovered specific neural markers that correlate strongly with behavioral performance. These neural signatures provide compelling evidence for the role of the IT cortex in processing complex emotional cues. Our findings bridge the gap between behavioral and neural perspectives on facial emotion recognition, offering a comprehensive understanding of the underlying mechanisms. This research not only underscores the evolutionary continuity of social cognition across primate species but also sets the stage for future explorations into the neural basis of emotion processing. The integration of behavioral analysis with advanced neural recording techniques presents a powerful framework for advancing our knowledge of social cognition and its disorders.
  • ItemOpen Access
    Gaze-Contingent Multi-Modal and Multi-Sensory Applications
    (2024-11-07) Vinnikov, Margarita; Allison, Robert
    Gaze-contingent displays are applications that are driven by the user's gaze. They are an important tool for many multi-modal and multi-sensory domains. They can be used to precisely control the retinal image in real time to study visual control of natural behaviour through experimentation, or to improve user experience in virtual reality applications. In this dissertation, I explored the application of gaze-contingent display technology to dierent models and senses and evaluate whether such applications can be useful for simulation, psychophysical research and human-computer interaction. Specically, I have looked at the visual gaze-contingent display and an audio gaze-contingent display. I examined the effects of simulated visual defects on user's perception and control of self-motion during locomotion. I found that gaze-contingent display simulations of visual defects signicantly altered visual patterns and impaired the accuracy and precision of judgement of heading. I also examined the impact of simulating gaze-contingent depth-of-field for monocular and stereoscopic displays. The experimental data showed that the alleviation of negative eects associated with stereo displays depends on the user's age and the types of scenes that are viewed. Finally, I simulated gaze-contingent audio displays that imitated the cocktail party effect. My audio enhancement techniques turned to be very benecial for applications that have to deal with user's attention to multiple sources of sounds (speakers) such as teleconferences and social games. Finally, in this dissertation, I demonstrated that gaze-contingent systems can be used in many aspects of virtual system design and if combined together (used for multiple cues and senses) can be a very powerful tool for augmenting and improving the overall user experience.
  • ItemOpen Access
    Image White Balance for Multi-Illuminant Scenes
    (2024-11-07) Arora, Aditya; Derpanis, Konstantinos G.
    Performing white-balance (WB) correction for scenes with multiple illuminants remains a challenging task in computer vision. Most previous methods estimate per-pixel scene illumination directly in the RAW sensor image space. Recent work explored an alternative fusion strategy, where a neural network fuses multiple white-balanced versions of the input image processed to sRGB using pre-defined white-balance settings. Inspired by this line of work, we present two contributions targeting fusion-based multi-illuminant WB correction. First, we introduce a large-scale multi-illumination dataset rendered from RAW images to support training fusion models and evaluation. The dataset comprises over 16,000 sRGB images with ground truth sRGB white-balance corrected images. Next, we introduce an attention-based architecture to fuse five white-balance settings. This architecture yields an improvement of up to 25% over prior work.
  • ItemOpen Access
    Symmetry-based monocular 3D vehicle ground-truthing for traffic analytics
    (2024-11-07) Tran, Trong Thao; Elder, James
    3D object detection is critical for autonomous driving and traffic analytics. Current research relies on LiDAR-derived ground truth for training and evaluation. However, LiDAR ground truth is expensive and usually inaccurate in the far field due to sparse LiDAR returns. Assuming a fully calibrated camera and a 3D terrain model, we explore whether inexpensive RGB imagery can be used to obtain 3D ground truth based on the bilateral symmetry of motor vehicles. From manually annotated symmetry points and tire-ground contact points, we infer a vertical symmetry plane and 3D point cloud to estimate vehicle location, pose, and dimensions. These estimates are input into a probabilistic model derived from a standard public motor vehicle dataset to form maximum a posteriori estimates of remaining dimensions. Evaluations on a public traffic dataset show that this novel symmetry-based approach is more accurate than LiDAR-based ground-truthing on single frames and comparable to LiDAR-based methods that propagate information across frames.
  • ItemOpen Access
    Probing Human Visual Strategies Using Interpretability Methods for Artificial Neural Networks
    (2024-10-28) Kashef Alghetaa, Yousif Khalid Faeq; Kar, Kohitij
    Unraveling human visual strategies during object recognition remains a challenge in vision science. Existing psychophysical methods used to investigate these strategies are limited in accurately interpreting human decisions. Recently, artificial neural network (ANN) models, which show remarkable similarities to human vision, provide a window into human visual strategies. However, inconsistencies among different techniques hinder the use of explainable AI (XAI) methods to interpret ANN decision-making. Here, we first develop and validate a novel surrogate method, in silico, using behavioral probes in ANNs with explanation-masked images to address these challenges. Finally, by identifying the XAI method and ANN with the highest human alignment, we provide a working hypothesis and an effective approach to explain human visual strategies during object recognition -- a framework relevant to many other behaviors.
  • ItemOpen Access
    Influence of a visual landmark shift on memory-guided reaching in the monkey
    (2024-03-16) Lin, Jennifer Yi Xuan; Crawford, John Douglas
    Reach and gaze data were collected from one female Macaca mulatta monkey (ML) trained to perform a memory-guided reaching task to determine the influence of allocentric cue shifts on reaching responses in the non-human primate. A landmark (4 ‘dots’ spaced 10° apart forming the corners of a virtual square) was presented at 1 of 15 locations on a touch screen. The landmark either reappeared at the same location (stable landmark condition) or shifted by 8° in one of 8 directions (landmark shift condition). ‘No-landmark’ controls were the same, but without the landmark. The presence of a stable landmark increased the accuracy of both gaze and touch responses and the precision of gaze. In the landmark shift condition, reaches shifted partially (mean = 29 %) with the landmark. Overall, these data suggest that the monkey is influenced by visual landmarks when reaching to remembered targets in a similar way as humans.
  • ItemOpen Access
    Key-Frame Based Motion Representations for Pose Sequences
    (2024-03-16) Thasarathan, Harrish Patrick; Derpanis, Konstantinos
    Modelling human motion is critical for computer vision tasks that aim to perceive human behaviour. Extending current learning-based approaches to successfully model long-term motions remains a challenge. Recent works rely on autoregressive methods, in which motions are modelled sequentially. These methods tend to accumulate errors, and when applied to typical motion modelling tasks, are limited up to only four seconds. We present a non-autoregressive framework to represent motion sequences as a set of learned key-frames without explicit supervision. We explore continuous and discrete generative frameworks for this task and design a key-framing transformer architecture to distill a motion sequence into key-frames and their relative placements in time. We validate our learned key-frame placement approach with a naive uniform placement strategy and further compare key-frame distillation using our transformer architecture with an alternative common sequence modelling approach. We demonstrate the effectiveness of our method by reconstructing motions up to 12 seconds.
  • ItemOpen Access
    Examining Autoexposure for Challenging Scenes
    (2024-03-16) Yang, Beixuan; Brown, Michael S.
    Autoexposure (AE) is a critical step cameras apply to ensure properly exposed images. While current AE algorithms are effective in well-lit environments with unchanging illumination, these algorithms still struggle in environments with bright light sources or scenes with abrupt changes in lighting. A significant hurdle in developing new AE algorithms for challenging environments, especially those with time-varying lighting, is the lack of platforms to evaluate AE algorithms and suitable image datasets. To address this issue, we have designed a software platform allowing AE algorithms to be used in a plug-and-play manner with the dataset. In addition, we have captured a new 4D exposure dataset that provides a complete solution space (i.e., all possible exposures) over a temporal sequence with moving objects, bright lights, and varying lighting. Our dataset and associate platform enable repeatable evaluation of different AE algorithms and provide a much-needed starting point to develop better AE methods.
  • ItemOpen Access
    Active Visual Search: Investigating human strategies and how they compare to computational models
    (2024-03-16) Wu, Tiffany; Tsotsos, John K.
    Real world visual search by fully active observers has not been sufficiently investigated. Whilst the visual search paradigm has been widely used, most studies use a 2D, passive observation task, where immobile subjects search through stimuli on a screen. Computational models have similarly been compared to human performance only to the degree of 2D image search. I conduct an active search experiment in a 3D environment, measuring eye and head movements of untethered subjects during search. Results show patterns forming strategies for search, such as repeated search paths within and across subjects. Learning trends were found, but only in target present trials. Foraging models encapsulate subject location-leaving actions, whilst robotics models captured viewpoint selection behaviours. Eye movement models were less applicable to 3D search. The richness of data collected from this experiment opens many avenues of exploration, and the possibility of modelling active visual search in a more human-informed manner.
  • ItemOpen Access
    Fine Granularity is Critical for Intelligent Neural Network Pruning
    (2023-12-08) Heyman, Andrew Baldwin; Zylberberg, Joel
    Neural network pruning is a popular approach to reducing the computational costs of training and/or deploying a network, and aims to do so while minimizing accuracy loss. Pruning methods that remove individual weights (fine granularity) yield better ratios of accuracy to parameter count, while methods that preserve some or all of a network’s structure (coarser granularity, e.g. pruning channels from a CNN) take better advantage of hardware and software optimized for dense matrix computations. We compare intelligent iterative pruning using several different criteria sampled from the literature against random pruning at initialization across multiple granularities on two different image classification architectures and tasks. We find that the advantage of intelligent pruning (with any criterion) over random pruning decreases dramatically as granularity becomes coarser. Our results suggest that, compared to coarse pruning, fine pruning combined with efficient implementation of the resulting networks is a more promising direction for improving accuracy-to-cost ratios.
  • ItemOpen Access
    Mug Shots: Systematic Biases in the Perception of Facial Orientation within Pictorial Spaces
    (2023-12-08) Esser, Maxwell Jacob Rosenstein; Troje, Nikolaus
    Pictures are 2-D projections of a 3-D world, so pictorial spaces behave differently than the 3-D visual spaces we inhabit. For instance, the angular orientation of a face pictured in half-profile view is systematically overestimated by the human observer – a 35° view is estimated to be approximately 45°. What is the cause for this perceptual orientation bias? We tested three different hypotheses. (1) The phenomenon is specific to pictorial projections due to the twofoldness of the medium and does not occur in 3-D space. (2) It can be explained with the depth compression expected when the vantage point of the observer is closer to the picture than the point of projection. (3) The visual system uses a shape prior that does not match the elliptical horizontal cross section of a typical head. Our results support the third hypothesis, and this effect can be mitigated through adding geometric information through structure-from-motion.
  • ItemOpen Access
    A 360-degree Omnidirectional Photometer Using a Ricoh Theta Z1
    (2023-12-08) MacPherson, Ian Michael; Brown, Michael S.
    Spot photometers measure the luminance emitted or reflected from a small surface area in a physical environment. Because the measurement is limited to a "spot," capturing dense luminance readings for an entire environment is impractical. This thesis demonstrates the potential of using an off-the-shelf commercial camera to operate as a 360-degree luminance meter. The method uses the Ricoh Theta Z1 camera, which provides a full 360-degree omnidirectional field of view and an API to access the camera's minimally processed RAW images. Working from the RAW images, this thesis describes a calibration method to map the RAW images under different exposures and ISO settings to luminance values. By combining the calibrated sensor with multi-exposure high-dynamic-range imaging, a cost-effective mechanism for capturing dense luminance maps of environments is provided. The results show that the Ricoh Theta calibrated as a luminance meter performs well when validated against a significantly more expensive spot photometer.
  • ItemOpen Access
    Volumetric Attribute Compression for 3D Point Clouds using Feedforward Network with Geometric Attention
    (2023-08-04) Do, Viet Ho Tam Thuc; Cheung, Gene
    We study 3D point cloud attribute compression using a volumetric approach: given a target volumetric attribute function $f : \mathbb{R}^3 \rightarrow \mathbb{R}$, we quantize and encode parameter vector $\theta$ that characterizes $f$ at the encoder, for reconstruction $f_{\hat{\theta}}(\x)$ at known 3D points $\x$'s at the decoder, where $\hat{\theta}$ is a quantized version of $\theta$. Extending a previous work Region Adaptive Hierarchical Transform (RAHT) that employs piecewise constant functions to span a nested sequence of function spaces, we propose a feedforward linear network that implements higher-order B-spline bases spanning function spaces without eigen-decomposition. Feedforward network architecture means that the system is amenable to end-to-end neural learning. The key to our network is space-varying convolution, similar to a graph operator, whose weights are computed from the known 3D geometry for normalization. We show that the number of layers in the normalization at the encoder is equivalent to the number of terms in a matrix inverse Taylor series. Experimental results on real-world 3D point clouds show up to 2-3 dB gain over RAHT in energy compaction and 20-30\% in bitrate reduction.
  • ItemOpen Access
    Sparse Shape Encoding for Improved Instance Segmentation
    (2023-08-04) Liu, Keyi; Elder, James
    Neurophysiological studies suggest that neurons in the intermediate visual area V4 of the primate cortex encode a sparse representation of object shape. While there are metabolic arguments for such sparse representations, there are also potential advantages for inference. Here we explore whether sparse shape encoding can yield benefits for instance segmentation. Specifically, we encode 2D object shape using a Distance Transform Map(DTM) and learn a sparse basis for this representation. To make use of this encoding, we design an instance segmentation head to estimate the sparse coefficients of each object, and then recover the shape from the zero-crossing level set of the corresponding DTM. Our novel SparseShape encoding approach produces fewer topological errors than the state-of-the-art, yields competitive mask AP on the MS COCO benchmark, and exhibits superior generalization performance on the Cityscapes traffic instance segmentation task.
  • ItemOpen Access
    Augmented Reality Water-Level Task
    (2023-08-04) Abadi, Romina; Allison, Robert
    The``Water Level Task'' asks participants to draw the water level in a tilted container. Studies showed that many adults have difficulty with the task. Our study aimed to determine if the misconception about water orientation happens in a more natural environment. We implemented an AR water-in-container effect to create an augmented reality (AR) version of the Water-Level task (AR-WLT). In the AR-WLT, participants interacted with two containers half full of water in a Hololens2 AR display and were asked to determine which looked more natural. In at least one of the two simulations, the water surface did not remain horizontal. A traditional online WLT was created to recruit low and high-scoring participants. Our results showed that low-scoring individuals were likelier to make errors in the AR version. However, participants did not choose simulations close to their 2D drawings, suggesting different cognitive and perceptual factors were involved in different environments.