Semantic Enriched Aggregation of Social Media in Crowd Situation Awareness

JOHANNES KEPLER UNIVERSITAT¨ LINZ JKU Technisch-Naturwissenschaftliche Fakultät Semantic Enriched Aggregation of Social Media in Crowd Situation Awareness MASTERARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieurin im Masterstudium Computer Science Eingereicht von: Carina Reiter BSc Angefertigt am: FAW - Institute for Application Oriented Knowledge Processing Beurteilung: a.Univ.-Prof. Dr. Birgit Pröll Linz, Oktober 2015 Kurzfassung Diese Arbeit wurde im Rahmen des crowdSA Projektes am Institut fürAnwendung- sorientierte Wissensverarbeitung der Johannes Kepler Universität in Linz verfasst. Das crowdSA Projekt beschäftigtsich mit der Entwicklung eines Krisenmanagement-Systems, welches automatisiert Daten aus sozialen Netzwerken extrahiert, analysiert, weiterver- arbeitet und anschließend anschaulich über eine Benutzeroberfläche Domänenexperten präsentiert wird. Das Ausmaß dieser Arbeit umfasst die Aggregation der Nachrichten aus den sozialen Me- dien. Hierbei werden die vorverarbeiteten annotierten Inhalte als Input genommen und Cluster, welche lebensnahen Ereignissen entsprechen, auf Basis von Ahnlichkeitsmatrizen¨ generiert. Die Ahnlichkeit¨ zwischen Nachrichten wird anhand von Metadaten, wie zum Beispiel Standort, und Nutzdaten, wie semantische und syntaktische Informationen, berechnet und anschließend an die Clustering-Komponente weitergegeben. Die daraus resultierenden Ereignis-Cluster werden mittels einer grafischen Benutzeroberfläche ver- anschaulicht. Fürdie Evaluierung dieser Arbeit, wurden fünfSzenarien aus Twitter Hochwasserdaten von 2013 in Osterreich¨ und Deutschland extrahiert. Unter Zuhilfenahme der Metriken Recall (Trefferquote) und Precision (Genauigkeit) und dem aus diesen resultierenden F1- Maß, kann die korrekte Zuordnung von Nachrichten zu einem Ereignis-Cluster beurteilt werden. Bei der Gegenüberstellung verschiedener Parametrisierungen, konnte das beste Ergebnis von 45%, mit der bevorzugten Gewichtung von syntaktischen Ahnlichkeits-¨ werten, erreicht werden. Weitere Verbesserungen, besonders im Bezug auf die Auswahl der Ahnlichkeitsmerkmale¨ und Algorithmen, könnten sich mit einem besseren F1 Wert auswirken. Jedoch erzielte die Implementierung, die im Umfang dieser Arbeit durchgeführtwurde, bereits jetzt wesentlich bessere Ergebnisse als das Referenzsystem CrisisTracker (Bestwert: 10%). Somit wurden die im Vorfeld festgelegten Ziele erreicht. Abstract This thesis is settled within the crowdSA researching project at the Institute for Appli- cation Oriented Knowledge Processing of the Johannes Kepler University of Linz, and deals with social media monitoring and analysis among crisis situations. The project aims to create a system which processes and enriches social media data and presents the results with a graphical user interface in order to be used by crisis management actors. The scope of this work lies in the message aggregation component, which takes an- notated social media content as an input and creates clusters, that represent real life events. The annotations, e.g. location, semantic concepts, part-of-speech tags, etc., will be added to the content within a preprocessing step. Based on these tags a similarity matrix for all social media content combinations is calculated and passed to the clustering framework which then aggregates similar contents to event clusters. The results are presented by a graphical user interface, where also the algorithm parametrization is visualized. To evaluate the results, five scenarios of the flooding in Germany and Austria of 2013 are used. Also recall and precision metrics and a resulting F1-score are calculated for various parametrization options. The best results - a F1-score of 45% - are achieved by using geographical, semantic and syntactic similarity values with a higher weighting for the latter. Further improvements in context of feature and algorithm selection might result in a better F1-score. The project implementation yet outperforms a reference tool called CrisisTracker, what makes it succeed in all predefined goals. Acknowledgements The practical work of this thesis was developed in collaboration with Gerald Madlsperger [47] in order to build a system which addresses both our theses. Therefore selected chapters, as stated, are identical to the respective ones in both documents. iii Contents Kurzfassungi Abstract ii Acknowledgements iii Contents iv List of Figures vii 1 Introduction1 1.1 Motivation...................................1 1.2 State of Research................................2 1.2.1 CrisisTracker..............................7 1.3 Aims and Objectives..............................8 1.3.1 Data Analysis and Definition of a Feature Hierarchy........8 1.3.2 Cluster Retrieval based on Semantic Enriched and Geolocated Data8 2 Dataset 10 2.1 Data....................................... 10 2.2 Scenarios.................................... 11 2.2.1 Scenario: "Bridge Blockade"..................... 12 2.2.2 Scenario: "Sandbags"......................... 14 2.2.3 Scenario: "Drinking Water Supply"................. 16 2.2.4 Scenario: "Riverdams"......................... 18 2.2.5 Scenario: "Roadblock"......................... 20 3 Concept 23 3.1 Features..................................... 23 3.1.1 Content Features............................ 25 3.1.2 Location Based Features........................ 31 3.1.3 User based features........................... 32 3.2 Architecture................................... 32 3.3 Components................................... 36 3.3.1 Controller Component......................... 37 3.3.2 PreProcessor.............................. 38 iv Contents v 3.3.3 Aggregation Component........................ 43 3.3.4 Object Extraction Component.................... 46 3.3.5 Evolution Analysis Component.................... 46 3.3.6 Visualization Component....................... 47 3.4 Data Model................................... 55 3.4.1 Prerequisites.............................. 55 3.4.2 CrowdSA Data-layer.......................... 56 4 Implementation 62 4.1 PreProcessing.................................. 62 4.1.1 Topic Fencing: Part-of-Speech Tagging................ 62 4.1.2 Geo Fencing............................... 63 4.2 Aggregation................................... 64 4.2.1 Similarity Calculation......................... 64 4.2.2 Clustering................................ 66 4.2.3 Clustering tools............................. 68 4.3 Visualization.................................. 73 4.3.1 Home (Historical Data)........................ 74 4.3.2 Timeslice Details............................ 76 4.3.3 Event Detail.............................. 78 4.3.4 Timeline Result............................. 80 5 Results and Evaluation 81 5.1 Evaluation Criteria............................... 81 5.1.1 Cluster Evaluation........................... 81 5.1.2 Recall.................................. 83 5.1.3 Precision................................ 84 5.1.4 F1-Score................................. 85 5.2 Evaluation Results............................... 86 5.2.1 Semantic Level............................. 86 5.2.2 Geolocation Level........................... 87 5.2.3 Scenario 1: Bridge Blockade...................... 88 5.2.4 Scenario 2: Sandbags......................... 90 5.2.5 Scenario 3: Drinking Water...................... 91 5.2.6 Scenario 4: River Dams........................ 92 5.2.7 Scenario 5: Roadblock......................... 93 5.2.8 Evaluation Summary.......................... 95 6 Conclusion 96 6.1 Summary.................................... 96 6.2 Concluding Statements............................. 97 7 Future Work 98 7.1 PreProcessor.................................. 98 7.1.1 Time Fencing.............................. 98 7.2 Aggregation................................... 99 7.2.1 Similarity Framework......................... 99 7.3 Visualization.................................. 100 Contents vi 7.3.1 Live Data................................ 100 Bibliography 103 Declaration of Authorship 112 List of Figures 1.1 Social Media Monitoring Tools Overview and Functionality........6 3.1 Similarity Feature Hierarchy.......................... 24 3.2 General Architecture.............................. 33 3.3 Overall Pipeline................................. 35 3.4 Abstract Pipeline Architecture........................ 36 3.5 Class Diagram of Controller Architecture.................. 38 3.6 Package diagram of preprocessing component................ 39 3.7 Class diagram of preprocessing component.................. 40 3.8 Class diagram of the TweetLocator...................... 42 3.9 Package diagram of aggregation component................. 44 3.10 Class diagram of aggregation component................... 45 3.11 Site Map of the Prototype Visualization................... 50 3.12 Home Page of Live Data Usage........................ 51 3.13 Home Page of Historical Data Usage..................... 52 3.14 Result Page of an Event Clustering Timeline................ 53 3.15 Result Page of an Event Cluster....................... 54 3.16 Result Page of an Episode Timeline..................... 55 3.17 Generic Source Layer.............................. 57 3.18 Source Layer from CrisisTracker....................... 58 3.19 Generic Aggregated Layer........................... 59 3.20 Aggregated Layer from CrisisTracker..................... 60 3.21 Integrated Layer................................ 61 4.1 Visualization component diagram....................... 74 4.2 Home (Historical Data)............................ 75 4.3 Timeslice

Semantic Enriched Aggregation of Social Media in Crowd Situation Awareness

Building a Classification Model Using Affinity Propagation

A Robust Density-Based Clustering Algorithm for Multi-Manifold Structure

Density-Based Algorithms for Active and Anytime Clustering

Statistically Rigorous Testing of Clustering Implementations

An Interactive Educational Web Resource for Clustering Big Data Mohith Manjunath1,2, Yi Zhang1,2, Steve H

Density-Based Clustering Heuristics for the Traveling Salesman Problem

Machine Learning for Improved Data Analysis of Biological Aerosol Using the WIBS Simon Ruske1, David O

Evaluation of the Clustering Performance of Affinity

A Dimensionality Reduction-Based Multi-Step Clustering Method for Robust Vessel Trajectory Analysis

Biclustering and Visualization of High Dimensional Data Using Visual Statistical Data Analyzer

Preserving the Privacy in Online Social Networks Using Enhanced Clustering Algorithms

Contributions to Large Scale Data Clustering and Streaming with Aﬃnity Propagation