Stelzner, Karl (2023)
Elements of Unsupervised Scene Understanding: Objectives, Structures, and Modalities.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00026355
Ph.D. Thesis, Primary publication, Publisher's Version
Text
dissertation_stelzner.pdf Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike. Download (9MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Elements of Unsupervised Scene Understanding: Objectives, Structures, and Modalities | ||||
Language: | English | ||||
Referees: | Kersting, Prof. Dr. Kristian ; Kosiorek, PhD Adam R. ; Vergari, Prof. Dr. Antonio | ||||
Date: | 13 December 2023 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | xv, 150 Seiten | ||||
Date of oral examination: | 21 November 2023 | ||||
DOI: | 10.26083/tuprints-00026355 | ||||
Abstract: | Enabling robust interactions between automated systems and the real world is a major goal of artificial intelligence. A key ingredient towards this goal is scene understanding: the ability to process visual imagery into a concise representation of the depicted scene, including the identity, position, and geometry of objects. While supervised deep learning approaches have proven effective at processing visual inputs, the cost of supplying human annotations for training quickly becomes infeasible as the diversity of the inputs and the required level of detail increases, putting full real-world scene understanding out of reach. For this reason, this thesis investigates unsupervised methods to scene understanding. In particular, we utilize generative models with structured latent variables to facilitate the learning of object-based representations. We start our investigation in an autoencoding setting, where we highlight the capability of such systems to identify objects without human supervision, as well as the advantages of integrating tractable components within them. At the same time, we identify some limitations of this setting, which prevent success in more visually complex environments. Based on this, we then turn to video data, where we leverage the prediction of dynamics to both regularize the representation learning task and to enable applications to reinforcement learning. Finally, to take another step towards a real world setting, we investigate the learning of representations encoding 3D geometry. We discuss various methods to encode and learn about 3D scene structure, and present a model which simultaneously infers the geometry of a given scene, and segments it into objects. We conclude by discussing future challenges and lessons learned. In particular, we touch on the challenge of modelling uncertainty when inferring 3D geometry, the tradeoffs between various data sources, and the cost of including model structure. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-263552 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Artificial Intelligence and Machine Learning | ||||
Date Deposited: | 13 Dec 2023 13:02 | ||||
Last Modified: | 15 Dec 2023 10:21 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/26355 | ||||
PPN: | 514077468 | ||||
Export: |
View Item |