Logo des Repositoriums
  • English
  • Deutsch
Anmelden
Keine TU-ID? Klicken Sie hier für mehr Informationen.
  1. Startseite
  2. Publikationen
  3. Publikationen der Technischen Universität Darmstadt
  4. Erstveröffentlichungen
  5. Reconstructing Shuffled Text : Bad Results for NLP, but Good News for Using In-Copyright Texts
 
  • Details
2025
Erstveröffentlichung
Preprint

Reconstructing Shuffled Text : Bad Results for NLP, but Good News for Using In-Copyright Texts

File(s)
Download
Hauptpublikation
4163_Reconstructing_shuffled_text_Conference_Version.pdf
CC BY 4.0 International
Format: Adobe PDF
Size: 361.04 KB
TUDa URI
tuda/13838
URN
urn:nbn:de:tuda-tuprints-301406
DOI
10.26083/tuprints-00030140
Autor:innen
Du, Keli ORCID 0000-0001-7800-0682
Ackerschewski, Sarah
Navruz, Uygar
Sinir, Nazan
Valline, Julian ORCID 0009-0007-7096-0348
Schöch, Christof ORCID 0000-0002-4557-2753
Kurzbeschreibung (Abstract)

Existing copyright laws in the European Union, the United States, and many other jurisdictions worldwide impose limitations on Text and Data Mining that affect the storage, publication, and reuse of datasets built from in-copyright texts. Therefore, derived text formats (DTFs) have been proposed. One important aspect of DTFs regarding copyright law is the reconstructibility of the source text from its corresponding DTF. In this paper we present the first of a series of experiments we plan to conduct on this issue. For this experiment, we have fine-tuned a large language model to reconstruct source texts from DTFs. The results of the reconstruction are mixed, but on the whole not very successful. This suggests that reconstructing text in DTFs is not as simple as is sometimes assumed and we believe that this result may encourage scholars to convert their in-copyright texts to DTFs and publish them as research data.

Freie Schlagworte

Derived text format

copyright

reconstructibility

evaluation

Sprache
Englisch
Fachbereich/-gebiet
02 Fachbereich Gesellschafts- und Geschichtswissenschaften > Institut für Sprach- und Literaturwissenschaft > Digital Philology - Neuere deutsche Literaturwissenschaft
DDC
800 Literatur > 800 Literatur, Rhetorik, Literaturwissenschaft
Institution
Universitäts- und Landesbibliothek Darmstadt
Ort
Darmstadt
Titel der Reihe
CCLS2025 Conference Preprints
Bandnummer der Reihe
4
Heftnummer der Zeitschrift
1
Zusätzliche Infomationen
This paper has been submitted to the conference track of JCLS. It has been peer reviewed and accepted for presentation and discussion at the 4th Annual Conference of Computational Literary Studies at Krakow, Poland, in July 2025.
Zusätzliche Links (Organisation)
https://jcls.io/site/ccls2025/

  • TUprints Leitlinien
  • Cookie-Einstellungen
  • Impressum
  • Datenschutzbestimmungen
  • Webseitenanalyse
Diese Webseite wird von der Universitäts- und Landesbibliothek Darmstadt (ULB) betrieben.