Information Preparation with the Human in the Loop

Polisetty Venkata Sai, Avinesh (2020)
Information Preparation with the Human in the Loop.
TU Darmstadt
doi: 10.25534/tuprints-00011839
Ph.D. Thesis, Primary publication

Preview

Text
phdthesis-final-version.pdf
Copyright Information: CC BY-SA 4.0 International - Creative Commons, Attribution ShareAlike.
Download (3MB) | Preview

Item Type:

Ph.D. Thesis

Type of entry:

Primary publication

Title:

Information Preparation with the Human in the Loop

Language:

English

Referees:

Gurevych, Prof. Dr. Iryna ; Sanderson, Prof. Mark

Date:

22 June 2020

Place of Publication:

Darmstadt

Date of oral examination:

18 July 2019

DOI:

10.25534/tuprints-00011839

Abstract:

With the advent of the World Wide Web (WWW) and the rise of digital media consumption, abundant information is available nowadays for any topic. But these days users often suffer from information overload posing a great challenge for finding relevant and important information. To alleviate this information overload and provide significant value to the users, there is a need for automatic information preparation methods. Such methods need to support users by discovering and recommending important information while filtering redundant and irrelevant information. They need to ensure that the users do not drown in, but rather benefit from the prepared information. However, the definition of what is relevant and important is subjective and highly specific to the user’s information need and the task at hand. Therefore, a method must continually learn from the feedback of its users. In this thesis, we propose new approaches to put the human in the loop in order to interactively prepare information along the three major lines of research: information aggregation, condensation, and recommendation.

For multiple well-studied tasks in natural language processing, we point out the limitation of existing methods and discuss how our approach can successfully close the gap to the human upper bound by considering user feedback and adapting to the user’s information need. We put a particular focus on applications in digital journalism and introduce the new task of live blog summarization. We show that the corpora we create for this task are highly heterogeneous as compared to the standard summarization datasets which pose new challenges to previously proposed non-interactive methods.

One way to alleviate information overload is information aggregation. We focus on the corresponding task of multi-document summarization and argue that previously proposed methods are of limited usefulness in the real-world application as they do not take the users’ goal into account. To address these drawbacks, we propose an interactive summarization loop to iteratively create and refine multi-document summaries based on the users’ feedback. We investigate sampling strategies based on active machine learning and joint optimization to reduce the number of iterations and the amount of user feedback required. Our approach significantly improves the quality of the summaries and reaches a performance near the human upper bound. We present a system demonstration implementing the interactive summarization loop, study its scalability, and highlight its use cases in exploring document collections and creating focused summaries in journalism.

For information condensation, we investigate a text compression setup. We address the problem of neural models requiring huge amounts of training data and propose a new interactive text compression method to reduce the need for large-scale annotated data. We employ state-of-the-art Seq2Seq text compression methods as our base models and propose an active learning setup with multiple sampling strategies to efficiently use minimal training data. We find that our method significantly reduces the amount of data needed to train and that it adapts well to new datasets and domains.

We finally focus on information recommendation and discuss the need for explainable models in machine learning. We propose a new joint recommendation system of rating prediction and review summarization, which shows major improvements over state-of-the-art systems in both the rating prediction and the review summarization task. By solving this task jointly based on multi-task learning techniques, we furthermore obtain explanations for a rating by showing the generated review summary marked based on the model’s attention and a histogram of user preferences learned from the reviews of the users.

We conclude the thesis with a summary of how human-in-the-loop approaches improve information preparation systems and envision the use of interactive machine learning methods also for other areas of natural language processing.

Alternative Abstract:

Alternative Abstract

Language

Mit dem Aufkommen des World Wide Web (WWW) und dem Anstieg des Konsums digitaler Medien stehen heutzutage reichlich Informationen zu jedem Thema zur Verfügung. Heutzutage leiden Benutzer jedoch häufig unter einer Informationsüberflutung, die eine große Herausforderung für das Auffinden relevanter und wichtiger Informationen darstellt. Um diese Informationsüberflutung zu verringern und den Benutzern einen erheblichen Mehrwert zu bieten, sind automatische Methoden zur Informationsvorbereitung erforderlich. Solche Methoden müssen Benutzer unterstützen, indem sie wichtige Informationen erkennen und empfehlen und gleichzeitig redundante und irrelevante Informationen filtern. Sie müssen sicherstellen, dass die Benutzer nicht ertrinken, sondern von den vorbereiteten Informationen profitieren. Die Definition dessen, was relevant und wichtig ist, ist jedoch subjektiv und sehr spezifisch für den Informationsbedarf des Benutzers und die jeweilige Aufgabe. Daher muss eine Methode kontinuierlich aus dem Feedback ihrer Benutzer lernen. In dieser Arbeit schlagen wir neue Ansätze vor, um den Menschen auf den neuesten Stand zu bringen und Informationen interaktiv auf den drei Hauptforschungslinien vorzubereiten: Informationsaggregation, Verdichtung und Empfehlung.

Für mehrere gut untersuchte Aufgaben in der Verarbeitung natürlicher Sprache weisen wir auf die Einschränkungen bestehender Methoden hin und diskutieren, wie unser Ansatz die Lücke zur menschlichen Obergrenze erfolgreich schließen kann, indem er das Feedback der Benutzer berücksichtigt und sich an den Informationsbedarf des Benutzers anpasst. Wir legen besonderen Wert auf Anwendungen im digitalen Journalismus und stellen die neue Aufgabe der Live-Blog-Zusammenfassung vor. Wir zeigen, dass die Korpora, die wir für diese Aufgabe erstellen, im Vergleich zu den Standard-Zusammenfassungsdatensätzen, die die zuvor vorgeschlagenen nicht interaktiven Methoden vor neue Herausforderungen stellen, sehr heterogen sind.

Eine Möglichkeit, die Informationsüberflutung zu verringern, ist die Informationsaggregation. Wir konzentrieren uns auf die entsprechende Aufgabe der Zusammenfassung mehrerer Dokumente und argumentieren, dass zuvor vorgeschlagene Methoden in der realen Anwendung nur begrenzt nützlich sind, da sie das Ziel der Benutzer nicht berücksichtigen. Um diese Nachteile zu beheben, schlagen wir eine interaktive Zusammenfassungsschleife vor, um iterativ Zusammenfassungen mehrerer Dokumente basierend auf dem Feedback der Benutzer zu erstellen und zu verfeinern. Wir untersuchen Stichprobenstrategien, die auf aktivem maschinellem Lernen und gemeinsamer Optimierung basieren, um die Anzahl der Iterationen und die Anzahl der erforderlichen Benutzerfeedbacks zu reduzieren. Unser Ansatz verbessert die Qualität der Zusammenfassungen erheblich und erreicht eine Leistung nahe der menschlichen Obergrenze. Wir präsentieren eine Systemdemonstration zur Implementierung der interaktiven Zusammenfassungsschleife, untersuchen ihre Skalierbarkeit und heben ihre Anwendungsfälle bei der Untersuchung von Dokumentensammlungen und der Erstellung fokussierter Zusammenfassungen im Journalismus hervor.

Zur Informationskondensation untersuchen wir ein Textkomprimierungssetup. Wir befassen uns mit dem Problem neuronaler Modelle, die große Mengen an Trainingsdaten erfordern, und schlagen eine neue interaktive Textkomprimierungsmethode vor, um den Bedarf an umfangreichen kommentierten Daten zu verringern. Wir verwenden modernste Seq2Seq-Textkomprimierungsmethoden als Basismodelle und schlagen einen aktiven Lernaufbau mit mehreren Stichprobenstrategien vor, um minimale Trainingsdaten effizient zu nutzen. Wir stellen fest, dass unsere Methode die zum Trainieren erforderliche Datenmenge erheblich reduziert und sich gut an neue Datensätze und Domänen anpasst.

Wir konzentrieren uns schließlich auf Informationsempfehlungen und diskutieren die Notwendigkeit erklärbarer Modelle beim maschinellen Lernen. Wir schlagen ein neues gemeinsames Empfehlungssystem für die Vorhersagevorhersage und die Zusammenfassung von Bewertungen vor, das sowohl bei der Bewertung von Bewertungen als auch bei der Aufgabe der Zusammenfassung von Bewertungen wesentliche Verbesserungen gegenüber modernen Systemen zeigt. Indem wir diese Aufgabe gemeinsam auf der Grundlage von Multitasking-Lerntechniken lösen, erhalten wir außerdem Erklärungen für eine Bewertung, indem wir die generierte Bewertungszusammenfassung anzeigen, die auf der Grundlage der Aufmerksamkeit des Modells markiert ist, und ein Histogramm der Benutzerpräferenzen, die aus den Bewertungen der Benutzer gelernt wurden.

Wir schließen die Arbeit mit einer Zusammenfassung darüber ab, wie Human-in-the-Loop-Ansätze Informationsvorbereitungssysteme verbessern, und sehen den Einsatz interaktiver Methoden des maschinellen Lernens auch für andere Bereiche der Verarbeitung natürlicher Sprache vor.

German

URN:

urn:nbn:de:tuda-tuprints-118394

Classification DDC:

000 Generalities, computers, information > 004 Computer science
400 Language > 400 Language, linguistics

Divisions:

20 Department of Computer Science > Ubiquitous Knowledge Processing

Date Deposited:

01 Jul 2020 08:39

Last Modified:

09 Jul 2020 06:35

URI: