Tauchmann, Christopher (2021)
Advanced Corpus Annotation Strategies for NLP. Applications in Automatic Summarization and Text Classification.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00017576
Ph.D. Thesis, Primary publication, Publisher's Version
|
Text
PhDThesis_ChristopherTauchmann.pdf Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike. Download (2MB) | Preview |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Advanced Corpus Annotation Strategies for NLP. Applications in Automatic Summarization and Text Classification | ||||
Language: | English | ||||
Referees: | Kersting, Prof. Dr. Kristian ; Mieskes, Prof. Dr. Margot | ||||
Date: | 2021 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | x, 179 Seiten | ||||
Date of oral examination: | 5 February 2021 | ||||
DOI: | 10.26083/tuprints-00017576 | ||||
Abstract: | Natural Language Processing (NLP) methods demand elaborate strategies for the creation of corpora that are fundamental to well-working NLP systems. In this thesis, we present different corpus creation strategies and application scenarios for different NLP tasks and show how they can benefit a task. One focus lies on automatic summarization and summary evaluation, and the other on corpus creation for text classification tasks. To this end, in the first part of the thesis we provide the necessary background on corpus annotation for such an analysis: Chapter 2 details research on corpus annotation theory and annotation practices in different disciplines such as Corpus Linguistics, and Computational Linguistics/Natural Language Processing (NLP). It also introduces the crowdsourcing approach to language annotations. Chapter 3 shows how different annotator populations annotate datasets with different annotation strategies. These strategies combine human and machine input. Chapter 4 details the background and historical overview of the foundations on automatic summarization and summary evaluation. We show that automatic summarization is a challenging NLP task and highlight the limiting focus in research on short English newswire datasets in research which can lead to rather skewed results. The second part deals with specific application scenarios in automatic summarization and summary evaluation. Chapter 5 describes the creation of a hierarchical summarization dataset. This dataset addresses two limitations in research: the focus on news datasets is enhanced with heterogeneous documents, and the source documents for the summaries are longer. Our research makes use of both crowdworkers and expert annotators, and shows how the strengths of both populations can be meaningfully combined in a larger corpus. Chapter 6 presents how research can benefit from the extension of an existing heterogeneous summarization corpus from the educational domain with a range of further topics from this domain. Furthermore, we introduce an evaluation of summarization difficulty using heterogeneity estimators based on measures from information theory and cosine similarity. Chapter 7 outlines the creation of a summary evaluation corpus with annotations of a content-based evaluation metric, the Pyramid method. We apply an existing automatic method to create the Pyramids on the same corpus and show that they correspond well to manual expert Pyramids. In the third part, the focus lies on general corpus creation illustrated by two other tasks which are both machine learning (ML)-oriented. Chapter 8 describes a crowdsourcing method to annotate items based on measuring input data complexity with measures from language learning, NLP, and information theory. We create different subsets of data that also function to train and filter crowdworkers. We test the method on an existing three-class sentence classification dataset from argument mining and show that our method needs fewer annotators to achieve the same inter-annotator agreement than randomly distributed dataset portions. Chapter 9 presents the creation of a dataset that includes discourse conventions in texts from the social sciences that concern the topic of Artificial Intelligence (AI). The dataset consists of subsets of data from different domains: software development, research paper abstracts, and online discussions. We annotate the dataset with expert active learning, where the ML model ‘‘asks’’ for annotations on certain items. Moreover, we evaluate the conventions that an ML model predicts and explain why these conventions can be detected correctly by the model. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-175768 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Artificial Intelligence and Machine Learning | ||||
Date Deposited: | 22 Oct 2021 07:09 | ||||
Last Modified: | 22 Oct 2021 07:09 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/17576 | ||||
PPN: | 487412036 | ||||
Export: |
View Item |