Klie, Jan-Christoph (2024)
Improving Natural Language Dataset Annotation Quality and Efficiency.
Technische Universität Darmstadt
doi: 10.26083/tuprints-00026580
Ph.D. Thesis, Primary publication, Publisher's Version
Text
dissertation_jck_final_20240502.pdf Copyright Information: CC BY-NC-ND 4.0 International - Creative Commons, Attribution NonCommercial, NoDerivs. Download (6MB) |
Item Type: | Ph.D. Thesis | ||||
---|---|---|---|---|---|
Type of entry: | Primary publication | ||||
Title: | Improving Natural Language Dataset Annotation Quality and Efficiency | ||||
Language: | English | ||||
Referees: | Gurevych, Prof. Dr. Iryna ; Webber, Prof. Ph.D Bonnie | ||||
Date: | 7 June 2024 | ||||
Place of Publication: | Darmstadt | ||||
Collation: | xi, 242 Seiten | ||||
Date of oral examination: | 18 April 2024 | ||||
DOI: | 10.26083/tuprints-00026580 | ||||
Abstract: | Annotated data is essential in many scientific disciplines, including natural language processing, linguistics, language acquisition research, bioinformatics, healthcare, or the digital humanities. Datasets are used to train and evaluate machine learning models, to deduce new knowledge, and to suggest appropriate revisions to existing theories. Especially in machine learning, large, high-quality datasets play a crucial role in advancing the field and evaluate new approaches. There are two central topics when creating these crucial datasets: annotation efficiency and annotation quality. We improve on both in this thesis. While annotated data is fundamental and sought after, creating it via manual annotation is expensive, time-consuming, and often requires experts. It is therefore very desirable to reduce costs and improve speed of data annotation, two significant aspects of annotation efficiency. Through this thesis, we hence propose different ways of improving annotation efficiency, including human-in-the-loop label suggestions, interactive annotator training, and community annotation. To train well-performing models and for their accurate evaluation, the data itself needs to be of the highest quality. Errors in the dataset can lead to degraded downstream task performance, biased or even cause harmful predictions. In addition, when erroneous data is used to evaluate or compare model architectures, algorithms, training regimes, or other scientific contributions, the relative order in performance might change. Thus, dataset errors can cause incorrect conclusions to be drawn. The focus of most machine learning work is on developing new models and methods; data quality is often overlooked. To alleviate quality issues, this thesis presents two contributions to improve annotation quality. First, we analyze best practices of annotation quality management, analyze how it is conducted in practice, and derive recommendations for future dataset creators on how to structure the annotation process and manage quality. Second, we survey the field of automatic annotation error detection, formalize it, re-implement and study the effectiveness of the most commonly used methods. Based on extensive experiments, we provide insights and recommendations concerning which ones should be used in which context. |
||||
Alternative Abstract: |
|
||||
Status: | Publisher's Version | ||||
URN: | urn:nbn:de:tuda-tuprints-265805 | ||||
Classification DDC: | 000 Generalities, computers, information > 004 Computer science | ||||
Divisions: | 20 Department of Computer Science > Ubiquitous Knowledge Processing | ||||
TU-Projects: | DFG|GU798/21-1|Infrastruktur für in DFG|EC503/1-1|Infrastruktur für in |
||||
Date Deposited: | 07 Jun 2024 12:07 | ||||
Last Modified: | 10 Jun 2024 05:31 | ||||
URI: | https://tuprints.ulb.tu-darmstadt.de/id/eprint/26580 | ||||
PPN: | 518988678 | ||||
Export: |
View Item |