Rücklé, Andreas (2021)
Representation Learning and Learning from Limited Labeled Data for Community Question Answering.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00018508
Ph.D. Thesis, Primary publication, Publisher's Version
|
Text
representation-learning-and-learning-from-limited-labeled-data-for-cqa.pdf Copyright Information: CC BY-ND 4.0 International - Creative Commons, Attribution NoDerivs. Download (0B) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Representation Learning and Learning from Limited Labeled Data for Community Question Answering | ||||
Language: | English | ||||
Referees: | Gurevych, Prof. Dr. Iryna ; Berant, Prof. Dr. Jonathan ; Glavaš, Prof. Dr. Goran | ||||
Date: | 2021 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | xi, 214 Seiten | ||||
Date of oral examination: | 12 April 2021 | ||||
DOI: | 10.26083/tuprints-00018508 | ||||
Abstract: | The amount of information published on the Internet is growing steadily. Accessing the vast knowledge in them more effectively is a fundamental goal of many tasks in natural language processing. In this thesis, we address this challenge from the perspective of community question answering by leveraging data from web forums and Q&A communities to find and identify answers for given questions automatically. More precisely, we are concerned with fundamental challenges that arise from this setting, broadly categorized in (1) obtaining better text representations and (2) dealing with scenarios where we have little or no labeled training data. We first study attention mechanisms for learning representations of questions and answers to compare them efficiently and effectively. A limitation of previous approaches is that they leverage question information when learning answer representations. This procedure of dependent encoding requires us to obtain separate answer representations for each question, which is inefficient. To remedy this, we propose a self-attentive model that does not suffer from this drawback. We show that our model achieves on-par or better performance for answer selection tasks compared to other approaches while allowing us to encode questions and answers independently. Due to the importance of attention mechanisms, we present a framework to effortlessly transform answer selection models into prototypical question answering systems for the interactive inspection and side-by-side comparison of attention weights. Besides purely monolingual approaches, we study how to transfer text representations across languages. A popular concept to obtain universally re-usable representations is the one of sentence embeddings. Previous work either studied them only monolingually or cross-lingually for only a few individual datasets. We go beyond this by studying universal cross-lingual sentence embeddings, which are re-usable across many different classification tasks and across languages. Our training-free approach generalizes the concept of average word embeddings by concatenating different kinds of word embeddings and by computing several generalized means. Due to its simplicity, we can effortlessly extend our approach to new languages by incorporating cross-lingual word embeddings. We show that our sentence embeddings outperform more complex techniques monolingually on nine tasks and achieve the best results cross-lingually for the transfer from English to German and French. We complement this by studying an orthogonal approach where we machine translate the input from German to English and continue monolingually. We investigate the impact of a standard neural machine translation model on the performance of models for determining question similarity in programming and operating systems forums. We highlight that translation mistakes can have a substantial performance impact, and we mitigate this by adapting our machine translation models to these specialized domains using back-translation. In the second part, we study monolingual scenarios with (a) little labeled data, (b) only unlabeled data, (c) no target dataset information. These are critical challenges in our setting as there exist large numbers of web forums that contain only a few labeled question-answer pairs and no labeled similar questions. One approach to generalize from small training data is to use simple models with few trainable layers. We present COALA, a shallow task-specific network architecture specialized in answer selection, containing only one trainable layer. This layer learns representations of word n-grams in questions and answers, which we compare and aggregate for scoring. Our approach improves upon a more complex compare-aggregate architecture by 4.5 percentage points on average, across six datasets with small training data. Moreover, it outperforms standard IR baselines already with 25 labeled instances. The standard method for training models to determine question similarity requires labeled question pairs, which do not exist for many forums. Therefore, we investigate alternatives such as self-supervised training with question title-body information, and we propose duplicate question generation. By leveraging larger amounts of unlabeled data, we show that both methods can achieve substantial improvements over adversarial domain transfer and outperform supervised in-domain training on two datasets. We find that duplicate question generation transfers well to unseen domains, and that we can leverage self-supervised training to obtain suitable answer selection models based on state-of-the-art pre-trained transformers. Finally, we argue that it can be prohibitive to train separate specialized models for each forum. It is desirable to obtain one model that generalizes well to several unseen scenarios. Towards this goal, we broadly study the zero-shot transfer capabilities of text matching models in community question answering. We train 140 models with self-supervised training signals on different forums and transfer them to nine evaluation datasets of question similarity and answer selection tasks. We find that the large majority of models generalize surprisingly well, and in six cases, all models outperform standard IR baselines. Our analyses reveal that considering a broad selection of source domains is crucial because the best zero-shot transfer performance often correlates with neither domain similarity nor training data size. We investigate different combination techniques and propose incorporating self-supervised and supervised multi-task learning with data from all source forums. Our best model for zero-shot transfer, MultiCQA, outperforms in-domain models on six datasets even though it has not seen target-domain data during training. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-185080 | ||||
Classification DDC: | ?? ddc_dnb_004 ?? | ||||
Divisions: | ?? fb20_uw ?? | ||||
Date Deposited: | 28 Jun 2021 09:11 | ||||
Last Modified: | 28 Jun 2021 09:11 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/18508 | ||||
PPN: | 483225991 | ||||
Export: |
View Item |