Towards Context-free Information Importance Estimation

The amount of information contained in heterogeneous text documents such as news articles, blogs, social media posts, scientific articles, discussion forums, and microblogging platforms is already huge and is going to increase further. It is not possible for humans to cope with this flood of information, so that important information can neither be found nor be utilized. This situation is unfortunate since information is the key driver in many areas of society in the present Information Age. Hence, developing automatic means that can assist people to handle the information overload is crucial. Developing methods for automatic estimation of information importance is an essential step towards this goal.

The guiding hypothesis of this work is that prior methods for automatic information importance estimation are inherently limited because they are based on merely correlated signals that are, however, not causally linked with information importance. To resolve this issue, we lay in this work the foundations for a fundamentally new approach for importance estimation. The key idea of context-free information importance estimation is to equip machine learning models with world knowledge so that they can estimate information importance based on causal reasons.

In the first part of this work, we lay the theoretical foundations for context-free information importance estimation. First, we discuss how the abstract concept of information importance can be formally defined. So far, a formal definition of this concept is missing in the research community. We close this gap by discussing two information importance definitions, which equate the importance of information with its impact on the behavior and the impact on the course of life of the information recipients, respectively. Second, we discuss how information importance estimation abilities can be assessed. Usually, this is done by performing automatic summarization of text documents. However, we find that this approach is not ideal. Instead, we propose to consider ranking, regression, and preference prediction tasks as alternatives in future work. Third, we deduce context-free information importance estimation as a logical consequence of the previously introduced importance definitions. We find that reliable importance estimation, in particular for heterogeneous text documents, is only possible with context-free methods.

In the second part, we develop the first machine learning models based on the idea of context-free information importance estimation. To this end, we first tackle the lack of suited datasets that are required to train and test machine learning models. In particular, large and heterogeneous datasets to investigate automatic summarization of multiple source documents are missing, because their construction is complicated and costly. To solve this problem, we present a simple and cost-efficient corpus construction approach and demonstrate its applicability by creating new multi-document summarization datasets. Second, we develop a new machine learning approach for context-free information importance estimation, implement a concrete realization, and demonstrate its advantages over contextual importance estimators. Third, we develop a new method to evaluate automatic summarization methods. Previous works are based on expensive reference summaries and unreliable semantic comparisons of text documents. On the contrary, our approach uses cheap pairwise preference annotations and only much simpler sentence-level similarity estimation.

This work lays the foundations for context-free information importance estimation. We hope that future research will explore if this fundamentally new type of information importance estimation can eventually lead to human-level information importance estimation abilities.

Sprache

Englisch

Alternativtitel

Richtung kontextfreie Informationswichtigkeitsbewertung

Alternatives Abstract

Die Menge an Information in heterogenen Texten wie Nachrichtenartikeln, Blogs, Beiträgen in sozialen Medien, wissenschaftlichen Artikeln, Diskussionsforen und Plattformen für Mikroblogging ist bereits heute gewaltig und wird in Zukunft weiter wachsen. Es ist für Menschen nicht möglich diese Flut von Informationen zu handhaben, sodass wichtige Informationen nicht gefunden und dadurch nicht nutzbar gemacht machen können. Dieser Umstand ist bedauerlich, da Informationen im heutigen Informationszeitalter die treibende Kraft in vielen Bereichen der Gesellschaft sind. Daher ist die Entwicklung automatischer Systeme erforderlich, die Menschen dabei unterstützen können der Informationsflut zu begegnen. Hierfür ist die Entwicklung von Methoden zur automatisierten Bewertung von Informationswichtigkeit ein wesentlicher Schritt.

Die grundlegende Hypothese in dieser Arbeit ist, dass bisherige Methoden zur automatisierten Bewertung von Informationswichtigkeit inhärent limitiert sind, da diese auf lediglich korrelierten Signalen basieren, die allerdings in keinem kausalen Zusammenhang zur Informationswichtigkeit stehen. Um dieses Problem zu lösen, werden in dieser Arbeit die Grundlagen für einen fundamental neuen Ansatz zur automatisierten Wichtigkeitsbewertung gelegt. Die Kernidee von kontextfreie Informationswichtigkeitsbewertung ist es, Modelle des maschinellen Lernens mit Weltwissen auszustatten, sodass diese auf Basis der ursächlichen Gründe die Wichtigkeit von Information bewerten können.

Im ersten Teil dieser Arbeit legen wir die theoretischen Grundlagen für kontextfreie Informationswichtigkeitsbewertung. Als erstes wird besprochen, wie der abstrakte Begriff Informationswichtigkeit formal definiert werden kann, da in der Forschungsgemeinde bisher eine klare Definition dieses Begriffes fehlt. Wir schließen diese Lücke, indem wir zwei Definitionen diskutieren, die Informationswichtigkeit mit der Auswirkung auf das Verhalten und auf das Leben der Informationsempfänger gleichsetzen. Als zweites wird diskutiert, wie die Fähigkeit zur Einschätzung der Wichtigkeit von Informationen bewerten werden kann. Üblicherweise wird dieses Problem im Kontext des automatisierten Zusammenfassens von Textdokumenten bewertet. Es zeigt sich allerdings, dass dies nicht ideal ist. Stattdessen schlagen wir vor, in zukünftiger Forschung die Erstellung von Ranglisten, die Durchführung von Regressionsanalysen und die Vorhersage von paarweise Präferenzen als Alternativen zu nutzen. Als drittes wird die kontextfreie Informationswichtigkeitsbewertung als logische Konsequenz der zuvor eingeführten Wichtigkeitsdefinitionen geschlussfolgert. Es zeigt sich, dass eine verlässliche Bewertung der Informationswichtigkeit, insbesondere in heterogenen Texten, nur mit kontextfreien Methoden möglich ist.

Im zweiten Teil entwickeln wir erste Modelle auf Basis des maschinellen Lernens zur kontextfreien Informationswichtigkeitsbewertung. Zunächst befassen wir uns hierzu mit dem Mangel an geeigneten Datensätzen, die für das Trainieren und Testen der Modelle benötigt werden. Insbesondere große und heterogene Datensätze, die nötig sind, um das automatisierte Zusammenfassen mehrerer Quelldokumente zu untersuchen, fehlen bisher, da deren Erstellung kompliziert und kostenintensiv ist. Wir lösen dieses Problem, indem wir einen einfachen und kosteneffizienten Ansatz entwickeln und seine Anwendbarkeit durch die Erstellung neuer Datensätze demonstrieren. Als zweites entwickeln wir einen neuen Ansatz des maschinellen Lernens für die kontextfreie Informationswichtigkeitsbewertung, implementieren eine konkrete Realisierung und demonstrieren die Vorteile gegenüber kontextabhängigen Systemen. Als drittes stellen wir eine neue Methode zur Evaluierung automatisierter Systeme vor. Frühere Arbeiten basieren auf teuren Referenzzusammenfassungen und unzuverlässigen semantischen Vergleichen von Textdokumenten. Unser Ansatz hingegen nutzt günstige paarweise Präferenzannotationen und einfachere semantische Vergleiche auf Satzebene.

Diese Arbeit legt den Grundstein für die kontextfreie Informationswichtigkeitsbewertung. Wir hoffen dass zukünftige Forschung erkunden wird, ob diese fundamental neue Art der Informationswichtigkeitsbewertung zu menschenähnlichen Fähigkeiten in diesem Bereich führen kann.

Fachbereich/-gebiet

20 Fachbereich Informatik > Knowledge Engineering

DDC

000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik

Institution

Technische Universität Darmstadt

Ort

Darmstadt