Logo des Repositoriums
  • English
  • Deutsch
Anmelden
Keine TU-ID? Klicken Sie hier für mehr Informationen.
  1. Startseite
  2. Publikationen
  3. Publikationen der Technischen Universität Darmstadt
  4. Zweitveröffentlichungen
  5. Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
 
  • Details
2021
Zweitveröffentlichung
Artikel
Verlagsversion

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

File(s)
Download
Hauptpublikation
s13042-022-01553-3.pdf
CC BY 4.0 International
Format: Adobe PDF
Size: 1.24 MB
TUDa URI
tuda/9350
URN
urn:nbn:de:tuda-tuprints-221643
DOI
10.26083/tuprints-00022164
Autor:innen
Bayer, Markus ORCID 0000-0002-2040-5609
Kaufhold, Marc-André ORCID 0000-0002-0387-9597
Buchhold, Björn
Keller, Marcel
Dallmeyer, Jörg
Reuter, Christian ORCID 0000-0003-1920-038X
Kurzbeschreibung (Abstract)

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.

Freie Schlagworte

Textual data augmenta...

Small text data analy...

Text generation

Long and short text c...

Sprache
Englisch
Fachbereich/-gebiet
20 Fachbereich Informatik > Wissenschaft und Technik für Frieden und Sicherheit (PEASEC)
Forschungs- und xchange Profil
Forschungsfelder > Information and Intelligence > Cybersecurity & Privacy
DDC
000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
Institution
Universitäts- und Landesbibliothek Darmstadt
Ort
Darmstadt
Titel der Zeitschrift / Schriftenreihe
International Journal of Machine Learning and Cybernetics
ISSN
1868-808X
Verlag
Springer
Publikationsjahr der Erstveröffentlichung
2021
Verlags-DOI
10.1007/s13042-022-01553-3
PPN
506942228

  • TUprints Leitlinien
  • Cookie-Einstellungen
  • Impressum
  • Datenschutzbestimmungen
  • Webseitenanalyse
Diese Webseite wird von der Universitäts- und Landesbibliothek Darmstadt (ULB) betrieben.