A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation

dc.contributor.advisorWildes, Richard P.
dc.contributor.authorKarim, Rezaul
dc.date.accessioned2024-07-18T21:31:33Z
dc.date.available2024-07-18T21:31:33Z
dc.date.copyright2024-06-25
dc.date.issued2024-07-18
dc.date.updated2024-07-18T21:31:32Z
dc.degree.disciplineElectrical Engineering & Computer Science
dc.degree.levelDoctoral
dc.degree.namePhD - Doctor of Philosophy
dc.description.abstractThis dissertation presents an end-to-end trainable and unified multiscale encoder-decoder transformer for dense video estimation, with a focus on segmentation. We investigate this direction by exploring unified multiscale processing throughout the processing pipeline of feature encoding, context encoding and object decoding in an encoder-decoder model. Correspondingly, we present a Multiscale Encoder-Decoder Video Transformer (MED-VT) that uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on additional preprocessing, such as computing object proposals or optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we explore temporal consistency through a transductive learning scheme that exploits many-to-label propagation across time. To demonstrate the applicability of the approach, we provide empirical evaluation of MED-VT/MEDVT++ on three unimodal video segmentation tasks: (Automatic Video Object Segmentation (AVOS), actor-action segmentation, Video Semantic Segmentation (VSS)) and a multimodal task (Audio Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on additional preprocessing, such as object proposals or optical flow. We also document details of the model’s internal learned representations by presenting a detailed interpretability study, encompassing both quantitative and qualitative analyses.
dc.identifier.urihttps://hdl.handle.net/10315/42219
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subjectRobotics
dc.subject.keywordsComputer science
dc.subject.keywordsComputer vision
dc.subject.keywordsDeep learning
dc.subject.keywordsRobotics
dc.subject.keywordsVideo understanding
dc.subject.keywordsSegmentation
dc.subject.keywordsAutomatic video object segmentation
dc.subject.keywordsActor action segmentation
dc.subject.keywordsVideo semantic segmentation
dc.subject.keywordsMultimodal
dc.subject.keywordsAudio-visual segmentation
dc.subject.keywordsSound source segmentation
dc.subject.keywordsCamouflage object segmentation
dc.subject.keywordsDAVIS 2016
dc.subject.keywordsYoutube objects
dc.subject.keywordsMoCA
dc.subject.keywordsVSPW
dc.subject.keywordsA2D
dc.subject.keywordsAVSBench
dc.subject.keywordsTemporal consistency
dc.subject.keywordsRobustness
dc.subject.keywordsTransformers
dc.subject.keywordsEncoder-decoder
dc.subject.keywordsFeature interaction
dc.subject.keywordsObject queries
dc.subject.keywordsModel dissection
dc.subject.keywordsInterpretability
dc.subject.keywordsExplainabile AI
dc.subject.keywordsMachine learning
dc.subject.keywordsCVPR
dc.subject.keywordsMED-VT
dc.titleA Unified Multiscale Encoder-Decoder Transformer for Video Segmentation
dc.typeElectronic Thesis or Dissertation

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Karim_Rezaul_2024_PhD.pdf
Size:
45.03 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description: