Reconstructing Shuffled Text : Bad Results for NLP, but Good News for Using In-Copyright Texts
Reconstructing Shuffled Text : Bad Results for NLP, but Good News for Using In-Copyright Texts
Existing copyright laws in the European Union, the United States, and many other jurisdictions worldwide impose limitations on Text and Data Mining that affect the storage, publication, and reuse of datasets built from in-copyright texts. Therefore, derived text formats (DTFs) have been proposed. One important aspect of DTFs regarding copyright law is the reconstructibility of the source text from its corresponding DTF. In this paper we present the first of a series of experiments we plan to conduct on this issue. For this experiment, we have fine-tuned a large language model to reconstruct source texts from DTFs. The results of the reconstruction are mixed, but on the whole not very successful. This suggests that reconstructing text in DTFs is not as simple as is sometimes assumed and we believe that this result may encourage scholars to convert their in-copyright texts to DTFs and publish them as research data.

