Logo des Repositoriums
  • English
  • Deutsch
Anmelden
Keine TU-ID? Klicken Sie hier für mehr Informationen.
  1. Startseite
  2. Publikationen
  3. Publikationen der Technischen Universität Darmstadt
  4. Zweitveröffentlichungen (aus DeepGreen)
  5. Label modification and bootstrapping for zero-shot cross-lingual hate speech detection
 
  • Details
2023
Zweitveröffentlichung
Artikel
Verlagsversion

Label modification and bootstrapping for zero-shot cross-lingual hate speech detection

File(s)
Download
Hauptpublikation
s10579-023-09637-4.pdf
CC BY 4.0 International
Format: Adobe PDF
Size: 1.12 MB
TUDa URI
tuda/12510
URN
urn:nbn:de:tuda-tuprints-284293
DOI
10.26083/tuprints-00028429
Autor:innen
Bigoulaeva, Irina ORCID 0000-0002-6955-981X
Hangya, Viktor ORCID 0000-0002-5144-3069
Gurevych, Iryna ORCID 0000-0003-2187-7621
Fraser, Alexander ORCID 0000-0003-4891-682X
Kurzbeschreibung (Abstract)

The goal of hate speech detection is to filter negative online content aiming at certain groups of people. Due to the easy accessibility and multilinguality of social media platforms, it is crucial to protect everyone which requires building hate speech detection systems for a wide range of languages. However, the available labeled hate speech datasets are limited, making it difficult to build systems for many languages. In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages, while highlighting label issues across application scenarios, such as inconsistent label sets of corpora or differing hate speech definitions, which hinder the application of such methods. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply them to the target language, which lacks labeled examples, and show that good performance can be achieved. We then incorporate unlabeled target language data for further model improvements by bootstrapping labels using an ensemble of different model architectures. Furthermore, we investigate the issue of label imbalance in hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance. We test simple data undersampling and oversampling techniques and show their effectiveness.

Freie Schlagworte

Hate speech

Cross-lingual transfe...

Class imbalance

BERT

CNN

LSTM

Sprache
Englisch
Fachbereich/-gebiet
20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung
DDC
000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
Institution
Universitäts- und Landesbibliothek Darmstadt
Ort
Darmstadt
Titel der Zeitschrift / Schriftenreihe
Language Resources and Evaluation
Startseite
1515
Endseite
1546
Jahrgang der Zeitschrift
57
Heftnummer der Zeitschrift
4
ISSN
1574-0218
Verlag
Springer Netherlands
Ort der Erstveröffentlichung
Dordrecht
Publikationsjahr der Erstveröffentlichung
2023
Verlags-DOI
10.1007/s10579-023-09637-4
PPN
542355213

  • TUprints Leitlinien
  • Cookie-Einstellungen
  • Impressum
  • Datenschutzbestimmungen
  • Webseitenanalyse
Diese Webseite wird von der Universitäts- und Landesbibliothek Darmstadt (ULB) betrieben.