Mahajan, Shweta (2022)
Multimodal Representation Learning for Diverse Synthesis with Deep Generative Models.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00021651
Ph.D. Thesis, Primary publication, Publisher's Version
Text
mahajan-phd.pdf Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike. Download (57MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Multimodal Representation Learning for Diverse Synthesis with Deep Generative Models | ||||
Language: | English | ||||
Referees: | Roth, Prof. Stefan ; Schwing, Prof. Dr. Alexander | ||||
Date: | 2022 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | xxi, 191 Seiten | ||||
Date of oral examination: | 13 June 2022 | ||||
DOI: | 10.26083/tuprints-00021651 | ||||
Abstract: | One of the key factors driving the success of machine learning for scene understanding is the development of data-driven approaches that can extract information automatically from the vast expanse of data. Multimodal representation learning has emerged as one of the demanding areas to draw meaningful information from the input data and achieve human-like performance. The challenges in learning representations can be ascribed to the heterogeneity of the available datasets where the information comes from various modalities or domains such as visual signals in the form of images and videos or textual signals in form of sentences. Moreover, one encounters far more unlabeled data in the form of highly multimodal, complex image distributions. In this thesis, we advance the field of multimodal representation learning for diverse synthesis with applications in vision and language; and complex imagery. We take a probabilistic approach and leverage deep generative models to capture the multimodality of the underlying true data distribution offering a strong advantage of learning from unlabeled data. To this end, in the first part, we focus on cross-domain data of images and text. We develop joint deep generative frameworks to encode the joint representations of the two distributions following distinct generative processes. The latent spaces are structured to encode semantic information available from the paired training data and even the domain-specific variations in the data. Furthermore, we introduce intricate data-dependent priors to capture the multimodality of the two distributions. The benefits of our presented frameworks are manifold. The semi-supervised techniques preserve the structural information of input representations in each modality with the potential to include any information that may be missing in other modalities, resulting in embeddings that generalize across datasets. The approaches consequently resolve the ambiguities of the joint distribution and allow for many-to-many mappings. In this thesis, we also introduce a novel factorization in the latent space that encodes contextual information independently of the object information and can leverage diverse contextual descriptions from the annotations of images that share similar contextual information leading to enriched multimodal latent space and thus increased diversity in the generated captions. Perception plays a vital role in human understanding of the environment. As image data becomes abundant and complex, it is inevitable for AI systems to learn the underlying structure of these multimodal distributions for general scene understanding. Even though popular deep generative models like GANs and VAEs for image distributions have made advancements, there are still gaps in capturing the underlying true data distribution. GANs are not designed to provide density estimates and VAEs only approximate the underlying data generating distribution with intractable likelihoods, posing challenges in both training and inference. To resolve the limitations, in the second part of the thesis, we construct powerful normalizing flows and autoregressive approaches for image distributions. Normalizing flows and autoregressive generative methods belong to the class of exact inference models that optimize the exact log-likelihood of the data. Our first approach enhances the representational power of flow-based models which are constrained due to the invertibility of the flow layers by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors. The scrupulously designed prior can better capture dependencies in complex multimodal data and achieves state-of-the-art density estimation results and improved image generation quality. Our second method concentrates on autoregressive models with their highly flexible functional forms. The sequential ordering of the dimensions makes these models computationally expensive. To address this, we propose a block-autoregressive approach employing a lossless pyramid decomposition with scale-specific representations. The sparse dependency structure makes it easier to encode the joint distribution of image pixels. Our approach yields state-of-the-art results for density estimation on various image datasets, especially for high-resolution data, and exhibits sampling speeds superior even to easily parallelizable flow-based models. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-216515 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Visual Inference | ||||
Date Deposited: | 22 Jul 2022 12:35 | ||||
Last Modified: | 07 Dec 2022 10:36 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/21651 | ||||
PPN: | 497916320 | ||||
Export: |
View Item |