Wildes, Richard P.Karim, Rezaul2024-07-182024-07-182024-06-252024-07-18https://hdl.handle.net/10315/42219This dissertation presents an end-to-end trainable and unified multiscale encoder-decoder transformer for dense video estimation, with a focus on segmentation. We investigate this direction by exploring unified multiscale processing throughout the processing pipeline of feature encoding, context encoding and object decoding in an encoder-decoder model. Correspondingly, we present a Multiscale Encoder-Decoder Video Transformer (MED-VT) that uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on additional preprocessing, such as computing object proposals or optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we explore temporal consistency through a transductive learning scheme that exploits many-to-label propagation across time. To demonstrate the applicability of the approach, we provide empirical evaluation of MED-VT/MEDVT++ on three unimodal video segmentation tasks: (Automatic Video Object Segmentation (AVOS), actor-action segmentation, Video Semantic Segmentation (VSS)) and a multimodal task (Audio Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on additional preprocessing, such as object proposals or optical flow. We also document details of the model’s internal learned representations by presenting a detailed interpretability study, encompassing both quantitative and qualitative analyses.Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.Computer scienceArtificial intelligenceRoboticsA Unified Multiscale Encoder-Decoder Transformer for Video SegmentationElectronic Thesis or Dissertation2024-07-18Computer scienceComputer visionDeep learningRoboticsVideo understandingSegmentationAutomatic video object segmentationActor action segmentationVideo semantic segmentationMultimodalAudio-visual segmentationSound source segmentationCamouflage object segmentationDAVIS 2016Youtube objectsMoCAVSPWA2DAVSBenchTemporal consistencyRobustnessTransformersEncoder-decoderFeature interactionObject queriesModel dissectionInterpretabilityExplainabile AIMachine learningCVPRMED-VT