A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation
dc.contributor.advisor | Wildes, Richard P. | |
dc.contributor.author | Karim, Rezaul | |
dc.date.accessioned | 2024-07-18T21:31:33Z | |
dc.date.available | 2024-07-18T21:31:33Z | |
dc.date.copyright | 2024-06-25 | |
dc.date.issued | 2024-07-18 | |
dc.date.updated | 2024-07-18T21:31:32Z | |
dc.degree.discipline | Electrical Engineering & Computer Science | |
dc.degree.level | Doctoral | |
dc.degree.name | PhD - Doctor of Philosophy | |
dc.description.abstract | This dissertation presents an end-to-end trainable and unified multiscale encoder-decoder transformer for dense video estimation, with a focus on segmentation. We investigate this direction by exploring unified multiscale processing throughout the processing pipeline of feature encoding, context encoding and object decoding in an encoder-decoder model. Correspondingly, we present a Multiscale Encoder-Decoder Video Transformer (MED-VT) that uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on additional preprocessing, such as computing object proposals or optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we explore temporal consistency through a transductive learning scheme that exploits many-to-label propagation across time. To demonstrate the applicability of the approach, we provide empirical evaluation of MED-VT/MEDVT++ on three unimodal video segmentation tasks: (Automatic Video Object Segmentation (AVOS), actor-action segmentation, Video Semantic Segmentation (VSS)) and a multimodal task (Audio Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on additional preprocessing, such as object proposals or optical flow. We also document details of the model’s internal learned representations by presenting a detailed interpretability study, encompassing both quantitative and qualitative analyses. | |
dc.identifier.uri | https://hdl.handle.net/10315/42219 | |
dc.language | en | |
dc.rights | Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests. | |
dc.subject | Computer science | |
dc.subject | Artificial intelligence | |
dc.subject | Robotics | |
dc.subject.keywords | Computer science | |
dc.subject.keywords | Computer vision | |
dc.subject.keywords | Deep learning | |
dc.subject.keywords | Robotics | |
dc.subject.keywords | Video understanding | |
dc.subject.keywords | Segmentation | |
dc.subject.keywords | Automatic video object segmentation | |
dc.subject.keywords | Actor action segmentation | |
dc.subject.keywords | Video semantic segmentation | |
dc.subject.keywords | Multimodal | |
dc.subject.keywords | Audio-visual segmentation | |
dc.subject.keywords | Sound source segmentation | |
dc.subject.keywords | Camouflage object segmentation | |
dc.subject.keywords | DAVIS 2016 | |
dc.subject.keywords | Youtube objects | |
dc.subject.keywords | MoCA | |
dc.subject.keywords | VSPW | |
dc.subject.keywords | A2D | |
dc.subject.keywords | AVSBench | |
dc.subject.keywords | Temporal consistency | |
dc.subject.keywords | Robustness | |
dc.subject.keywords | Transformers | |
dc.subject.keywords | Encoder-decoder | |
dc.subject.keywords | Feature interaction | |
dc.subject.keywords | Object queries | |
dc.subject.keywords | Model dissection | |
dc.subject.keywords | Interpretability | |
dc.subject.keywords | Explainabile AI | |
dc.subject.keywords | Machine learning | |
dc.subject.keywords | CVPR | |
dc.subject.keywords | MED-VT | |
dc.title | A Unified Multiscale Encoder-Decoder Transformer for Video Segmentation | |
dc.type | Electronic Thesis or Dissertation |
Files
Original bundle
1 - 1 of 1