A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation

Karim, Rezaul

A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation

dc.contributor.advisor	Wildes, Richard P.
dc.contributor.author	Karim, Rezaul
dc.date.accessioned	2024-07-18T21:31:33Z
dc.date.available	2024-07-18T21:31:33Z
dc.date.copyright	2024-06-25
dc.date.issued	2024-07-18
dc.date.updated	2024-07-18T21:31:32Z
dc.degree.discipline	Electrical Engineering & Computer Science
dc.degree.level	Doctoral
dc.degree.name	PhD - Doctor of Philosophy
dc.description.abstract	This dissertation presents an end-to-end trainable and unified multiscale encoder-decoder transformer for dense video estimation, with a focus on segmentation. We investigate this direction by exploring unified multiscale processing throughout the processing pipeline of feature encoding, context encoding and object decoding in an encoder-decoder model. Correspondingly, we present a Multiscale Encoder-Decoder Video Transformer (MED-VT) that uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on additional preprocessing, such as computing object proposals or optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we explore temporal consistency through a transductive learning scheme that exploits many-to-label propagation across time. To demonstrate the applicability of the approach, we provide empirical evaluation of MED-VT/MEDVT++ on three unimodal video segmentation tasks: (Automatic Video Object Segmentation (AVOS), actor-action segmentation, Video Semantic Segmentation (VSS)) and a multimodal task (Audio Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on additional preprocessing, such as object proposals or optical flow. We also document details of the model’s internal learned representations by presenting a detailed interpretability study, encompassing both quantitative and qualitative analyses.
dc.identifier.uri	https://hdl.handle.net/10315/42219
dc.language	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject	Artificial intelligence
dc.subject	Robotics
dc.subject.keywords	Computer science
dc.subject.keywords	Computer vision
dc.subject.keywords	Deep learning
dc.subject.keywords	Robotics
dc.subject.keywords	Video understanding
dc.subject.keywords	Segmentation
dc.subject.keywords	Automatic video object segmentation
dc.subject.keywords	Actor action segmentation
dc.subject.keywords	Video semantic segmentation
dc.subject.keywords	Multimodal
dc.subject.keywords	Audio-visual segmentation
dc.subject.keywords	Sound source segmentation
dc.subject.keywords	Camouflage object segmentation
dc.subject.keywords	DAVIS 2016
dc.subject.keywords	Youtube objects
dc.subject.keywords	MoCA
dc.subject.keywords	VSPW
dc.subject.keywords	A2D
dc.subject.keywords	AVSBench
dc.subject.keywords	Temporal consistency
dc.subject.keywords	Robustness
dc.subject.keywords	Transformers
dc.subject.keywords	Encoder-decoder
dc.subject.keywords	Feature interaction
dc.subject.keywords	Object queries
dc.subject.keywords	Model dissection
dc.subject.keywords	Interpretability
dc.subject.keywords	Explainabile AI
dc.subject.keywords	Machine learning
dc.subject.keywords	CVPR
dc.subject.keywords	MED-VT
dc.title	A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Karim_Rezaul_2024_PhD.pdf
Size:: 45.03 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.87 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.39 KB
Format:: Plain Text
Description:

Download

Collections

Electrical Engineering and Computer Science
Theses and Dissertations