Mining User-Generated Content for Incidents.
Technische Universität, Darmstadt
[Ph.D. Thesis], (2014)
733_Schultz_redu.pdf - Accepted Version
Available under Creative Commons Attribution Non-commercial No Derivatives, 2.5.
Download (16MB) | Preview
|Item Type:||Ph.D. Thesis|
|Title:||Mining User-Generated Content for Incidents|
Social media changed human interaction by allowing people to connect to each other anytime and from anywhere, resulting in many valuable information shared about a variety of different domains. Because of the large amount of data created every day, social media analytics became an important topic to make use of this source of information. Most importantly, people contribute much valuable information about crisis events such as small-scale incidents, which is currently not taken into account by decision makers in emergency management. Reasons for this are the sheer amount of information as well as the heterogeneous and unstructured nature of the data, which hinder the use of this source of information.
In this dissertation, we try to answer the question 'How can user-generated content be made a usable and valuable source of information for situational awareness of decision makers?'. To answer this question, we present a framework consisting of the necessary steps to process a large amount of social media data in such a way that incident-related information is identified and aggregated to distinct incident clusters, each one containing information about an individual real-world incident. With the contributions presented in this dissertation, previously not usable user-generated content becomes a valuable source for decision making in emergency management.
In the first part of the dissertation, we introduce the requirements of a system for small-scale incident detection, which are derived from the (1) spatial, (2) temporal, and (3) thematic dimensions defining an incident. Based on these dimensions, we develop a framework for processing a large amount of user-generated content. As a first step of the framework, we introduce how user-generated content is collected to create an initial information base, which is processed in the subsequent steps of the framework.
In the second part, we introduce several steps to preprocess unstructured social media data so it can be used in the subsequent steps of the framework. We show how named entities and temporal expressions are identified and extracted so that they can be used as additional information when creating a machine learning model that is generalizable for data that stems from a different city. We introduce a set of adaptations applied to standard techniques that allow us to extract named entities and temporal expressions from unstructured text. We also present how we make use of the temporal expressions to infer the point in time when an incident occurred.
In that part, we also deal with the problem of how to infer the spatial dimensions of user-generated content. We contribute a novel approach for the geolocalization of tweets that is capable of inferring the home location of a Twitter user, the point of origin where a tweet was sent, as well as of inferring the location focus of a tweet message. We show that the approach is able to locate 92% of all tweets with a median accuracy of below 30km, thus outperforming related approaches. Furthermore, it predicts the user's residence with a median accuracy of below 5.1km. Finally, the same approach is able to estimate the focus of incident-related tweets within a median accuracy of below 250m.
In the third part of the dissertation, we present approaches for inferring the thematic dimension. We contribute a general approach for applying crowdsourcing to manually classify and aggregate user-generated content according to the information need of the command staff in emergency management. With this approach, we are able to differentiate incident-related information from information not related to an incident. Evaluation results with end users show that this approach is indeed valuable for the command staff.
As crowdsourcing is limited when it comes to the timely filtering of a large amount of information, we further contribute an approach for automatically detecting incident-related information in user-generated content. Based on an extensive evaluation of different feature groups, we present a highly precise machine learning approach that is able to classify the incident type with an F-measure of more than 90%. We also deal with the dynamism and regional variation of user-generated content and contribute a concept that allows creating features that are not city-specific and support training a generalized model.
Based on the temporal, spatial, and thematic dimensions of each information item, we present a clustering approach that is able to detect incidents in a large amount of social media data. The approach clusters all information related to the same incident. Furthermore, it is able to cope with different organizational incident type vocabularies. Evaluation results show that the approach is able to detect more than 50% of real-world incidents published in an official emergency management system. Furthermore, 32.14% of the detected incidents are within a 500m radius and within a 10min time interval of the real-world incident, allowing precise spatial and temporal localization. With this recall, we outperform related approaches, which only detect about 5% of the real-world incidents. Also, more than 77% of the incident clusters created with our approach are indeed related to incidents, thus significantly reducing the quantity of irrelevant information. Furthermore, we underline the importance of incident-related tweets by conducting a user study of situational information shared in user-generated content, showing that valuable situational information is indeed shared in social media.
Finally, we contribute an approach for reducing labeling costs of user-generated content. The presented algorithms make use of temporal, spatial, and thematic metadata to determine the most valuable instances to label. Our evaluation shows that the approach outperforms current state-of-the-art approaches. Furthermore, we show that labeling costs can indeed be reduced.
|Place of Publication:||Darmstadt|
|Classification DDC:||000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
600 Technik, Medizin, angewandte Wissenschaften > 620 Ingenieurwissenschaften
|Divisions:||20 Department of Computer Science
20 Department of Computer Science > Knowledge Engineering
20 Department of Computer Science > Telecooperation
|Date Deposited:||12 Aug 2014 09:22|
|Last Modified:||12 Aug 2014 09:22|
|Referees:||Mühlhäuser, Prof. Dr. Max and Sure-Vetter, Prof. Dr. York|
|Refereed:||23 June 2014|