Araslanov, Nikita (2022)
Deep Visual Parsing with Limited Supervision.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00022514
Ph.D. Thesis, Primary publication, Publisher's Version
Text
nikita_araslanov_phd_thesis.pdf Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike. Download (66MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Deep Visual Parsing with Limited Supervision | ||||
Language: | English | ||||
Referees: | Roth, Prof. Stefan ; Vedaldi, Prof. Andrea | ||||
Date: | 2022 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | xvi, 174 Seiten | ||||
Date of oral examination: | 14 September 2022 | ||||
DOI: | 10.26083/tuprints-00022514 | ||||
Abstract: | Scene parsing entails interpretation of the visual world in terms of meaningful semantic concepts. Automatically performing such analysis with machine learning techniques is not a purely scientific endeavour. It holds transformative potential for emerging technologies, such as autonomous driving and robotics, where deploying a human expert can be economically unfeasible or hazardous. Recent methods based on deep learning have made substantial progress towards realising this potential. However, to achieve high accuracy on application-specific formulations of the scene parsing task, such as semantic segmentation, deep learning models require significant amounts of high-quality dense annotation. Obtaining such supervision with human labour is costly and time-consuming. Therefore, reducing the need for precise annotation without sacrificing model accuracy is essential when it comes to deploying these models at scale. In this dissertation, we advance towards this goal by progressively reducing the amount of required supervision in the context of semantic image segmentation. In this task, we aim to label every pixel in the image with its semantic category. We formulate and implement four novel deep learning techniques operating under varying levels of task supervision: First, we develop a recurrent model for instance segmentation, which sequentially predicts one object mask at a time. Sequential models have provision for exploiting the temporal context: segmenting prominent instances first may disambiguate mask prediction for hard objects (e.g. due to occlusion) later on. However, such advantageous ordering of prediction is typically unavailable. Our proposed actor-critic framework discovers such orderings and provides empirical accuracy benefits compared to a baseline without such capacity. Second, we consider weakly supervised semantic segmentation. This problem setting requires the model to produce object masks with only image-level labels available as the training supervision. In contrast to previous works, we approach this problem with a practical single-stage model. Despite its simple design, it produces highly accurate segmentation, competitive with, or even improving upon several multi-stage methods. Reducing the amount of supervision further, we next study unsupervised domain adaptation. In this scenario, there are no labels available for real-world data. Instead, we may only use the labels of synthetically generated visual scenes. We propose a novel approach, which adapts the segmentation model trained on synthetic data to unlabelled real-world images using pseudo labels. Crucially, we construct these pseudo annotation by leveraging equivariance of the semantic segmentation task to similarity transformations. At the time of publication, our adaptation framework achieved state-of-the-art accuracy, in some benchmarks even substantially surpassing that of previous art. Last, we present an unsupervised technique for representation learning. We define the desired representation to be useful for the task of video object segmentation, which requires establishing dense object-level correspondences in video sequences. Learning such features efficiently in a fully convolutional regime is prone to degenerate solutions. Yet our approach circumvents them with a simple and effective mechanism based on the already familiar model equivariance to similarity transformations. We empirically show that our framework attains new state-of-the-art video segmentation accuracy at a significantly reduced computational cost. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-225141 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Visual Inference | ||||
Date Deposited: | 17 Oct 2022 12:03 | ||||
Last Modified: | 21 Oct 2022 12:47 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/22514 | ||||
PPN: | 500483175 | ||||
Export: |
View Item |