Logo des Repositoriums
  • English
  • Deutsch
Anmelden
Keine TU-ID? Klicken Sie hier für mehr Informationen.
  1. Startseite
  2. Publikationen
  3. Publikationen der Technischen Universität Darmstadt
  4. Erstveröffentlichungen
  5. Grouping the Unstructured – A Comparison of Methods for Unsupervised Document Clustering of a Specialised Corpus
 
  • Details
2025
Erstveröffentlichung
Buch
Verlagsversion

Grouping the Unstructured – A Comparison of Methods for Unsupervised Document Clustering of a Specialised Corpus

File(s)
Download
Hauptpublikation
Schlander_Masterarbeit_Evolving_Scholarship_vol009.pdf
CC BY 4.0 International
Format: Adobe PDF
Size: 13.65 MB
TUDa URI
tuda/14363
URN
urn:nbn:de:tuda-tuprints-311197
DOI
10.26083/tuprints-00031119
Autor:innen
Schlander, Anna
Kurzbeschreibung (Abstract)

The rapid growth of digital corpora presents a need for methods to automatically organise large collections of domain-specific texts into meaningful, interpretable groups. This study evaluates the effectiveness of document vectorisation combined with clustering for this purpose, comparing three prominent embedding approaches: Word2Vec, FastText, and Sentence-BERT in combination with the clustering algorithms k-means, DBSCAN, and hierarchical agglomerative clustering. In this work, the PubMed Abstracts corpus, consisting of academic abstracts from the field of neuroscience, was processed.

The study delves into the characteristics, pitfalls and specific advantages of vectorisation and clustering methods. A combination of vectorisation and clustering with methods of corpus linguistics and statistics allows us furthermore to seek and identify the „linguistic triggers“ that lead to specific behaviour of embeddings and clustering algorithms.

A qualitative analysis framework is applied to assess cluster coherence and interpretability. Quantitative measures are presented alongside visual analyses of clustering results, including statistics for cluster-based subcorpora, inferred qualitative categories, and their distribution across clusters. Cramér’s V is employed to quantify associations between clustering methods and category assignments. The observations demonstrate distinct operational characteristics and trade-offs across vectorisation–clustering combinations.

The findings inform methodological selection for large-scale text analysis and offer a framework for exploring scalable, interpretable, and linguistically informed clustering approaches. Ultimately, this work discusses and answers the question if we can create meaningful groups of documents and improve the accessibility of domain-specific corpora, given limited prior knowledge, through cluster analysis – a task that gains relevance as digital corpora grow.

Sprache
Englisch
Herausgeber:innen
Bartsch, Sabine ORCID 0000-0001-7379-2158
Gius, Evelyn ORCID 0000-0001-8888-8419
Müller, Marcus ORCID 0000-0003-4921-4512
Rapp, Andrea ORCID 0000-0003-4933-4397
Weitin, Thomas ORCID 0000-0002-9003-5746
Fachbereich/-gebiet
02 Fachbereich Gesellschafts- und Geschichtswissenschaften > Institut für Sprach- und Literaturwissenschaft > Germanistik - Computerphilologie und Mediävistik
DDC
000 Allgemeines, Informatik, Informationswissenschaft > 000 Allgemeines, Wissenschaft
400 Sprache > 400 Sprache, Linguistik
400 Sprache > 420 Englisch
400 Sprache > 430 Deutsch
800 Literatur > 800 Literatur, Rhetorik, Literaturwissenschaft
800 Literatur > 820 Englische Literatur
800 Literatur > 830 Deutsche Literatur
Institution
Universitäts- und Landesbibliothek Darmstadt
Ort
Darmstadt
Titel der Reihe
Digital Philology | Evolving Scholarship in Digital Philology
Bandnummer der Reihe
9
ISSN
2701-8210
PPN
532702093

  • TUprints Leitlinien
  • Cookie-Einstellungen
  • Impressum
  • Datenschutzbestimmungen
  • Webseitenanalyse
Diese Webseite wird von der Universitäts- und Landesbibliothek Darmstadt (ULB) betrieben.