The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia.
Technische Universität, Darmstadt
[Ph.D. Thesis], (2014)
Thesis_OF_published_print.pdf - Published Version
Available under Creative Commons Attribution Non-commercial No Derivatives, 2.5.
Download (3MB) | Preview
|Item Type:||Ph.D. Thesis|
|Title:||The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia|
Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively created resource that reflects the spirit of the open collaborative content creation paradigm. Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source.
A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This includes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation shows how natural language processing approaches can be used to assist information quality management on a massive scale.
In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particular focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis.
Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality.
As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data.
We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present FlawFinder, a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance in the wild.
As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future.
Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus.
|Place of Publication:||Darmstadt|
|Uncontrolled Keywords:||Natural Language Processing, Wikipedia, Information Quality Management, Collaboration, Collaborative Writing, Topic Bias, Text Classification, Machine Learning, Text Quality, Readability|
|Classification DDC:||000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
000 Allgemeines, Informatik, Informationswissenschaft > 020 Bibliotheks- und Informationswissenschaft
400 Sprache > 400 Sprache, Linguistik
|Divisions:||Fachbereich Informatik > Ubiquitäre Wissensverarbeitung|
|Date Deposited:||25 Aug 2014 13:51|
|Last Modified:||25 Aug 2014 13:51|
|Referees:||Gurevych, Prof. Dr. Iryna and Schütze, Prof. Dr. Hinrich and Rosé, Prof. Dr. Carolyn|
|Refereed:||15 July 2014|