JOHANNES KEPLER UNIVERSITAT¨ LINZ JKU

Technisch-Naturwissenschaftliche Fakult¨at

Semantic Enriched Aggregation of Social Media in Crowd Situation Awareness

MASTERARBEIT

zur Erlangung des akademischen Grades Diplom-Ingenieurin

im Masterstudium Computer Science

Eingereicht von: Carina Reiter BSc

Angefertigt am: FAW - Institute for Application Oriented Knowledge Processing

Beurteilung: a.Univ.-Prof. Dr. Birgit Pr¨oll

Linz, Oktober 2015 Kurzfassung

Diese Arbeit wurde im Rahmen des crowdSA Projektes am Institut f¨urAnwendung- sorientierte Wissensverarbeitung der Johannes Kepler Universit¨at in Linz verfasst. Das crowdSA Projekt besch¨aftigtsich mit der Entwicklung eines Krisenmanagement-Systems, welches automatisiert Daten aus sozialen Netzwerken extrahiert, analysiert, weiterver- arbeitet und anschließend anschaulich ¨uber eine Benutzeroberfl¨ache Dom¨anenexperten pr¨asentiert wird.

Das Ausmaß dieser Arbeit umfasst die Aggregation der Nachrichten aus den sozialen Me- dien. Hierbei werden die vorverarbeiteten annotierten Inhalte als Input genommen und Cluster, welche lebensnahen Ereignissen entsprechen, auf Basis von Ahnlichkeitsmatrizen¨ generiert. Die Ahnlichkeit¨ zwischen Nachrichten wird anhand von Metadaten, wie zum Beispiel Standort, und Nutzdaten, wie semantische und syntaktische Informationen, berechnet und anschließend an die Clustering-Komponente weitergegeben. Die daraus resultierenden Ereignis-Cluster werden mittels einer grafischen Benutzeroberfl¨ache ver- anschaulicht.

F¨urdie Evaluierung dieser Arbeit, wurden f¨unfSzenarien aus Twitter Hochwasserdaten von 2013 in Osterreich¨ und Deutschland extrahiert. Unter Zuhilfenahme der Metriken Recall (Trefferquote) und Precision (Genauigkeit) und dem aus diesen resultierenden F1- Maß, kann die korrekte Zuordnung von Nachrichten zu einem Ereignis-Cluster beurteilt werden. Bei der Gegen¨uberstellung verschiedener Parametrisierungen, konnte das beste Ergebnis von 45%, mit der bevorzugten Gewichtung von syntaktischen Ahnlichkeits-¨ werten, erreicht werden.

Weitere Verbesserungen, besonders im Bezug auf die Auswahl der Ahnlichkeitsmerkmale¨ und Algorithmen, k¨onnten sich mit einem besseren F1 Wert auswirken. Jedoch erzielte die Implementierung, die im Umfang dieser Arbeit durchgef¨uhrtwurde, bereits jetzt wesentlich bessere Ergebnisse als das Referenzsystem CrisisTracker (Bestwert: 10%). Somit wurden die im Vorfeld festgelegten Ziele erreicht. Abstract

This thesis is settled within the crowdSA researching project at the Institute for Appli- cation Oriented Knowledge Processing of the Johannes Kepler University of Linz, and deals with social media monitoring and analysis among crisis situations. The project aims to create a system which processes and enriches social media data and presents the results with a graphical user interface in order to be used by crisis management actors.

The scope of this work lies in the message aggregation component, which takes an- notated social media content as an input and creates clusters, that represent real life events. The annotations, e.g. location, semantic concepts, part-of-speech tags, etc., will be added to the content within a preprocessing step. Based on these tags a similarity matrix for all social media content combinations is calculated and passed to the clus- tering framework which then aggregates similar contents to event clusters. The results are presented by a graphical user interface, where also the algorithm parametrization is visualized.

To evaluate the results, five scenarios of the flooding in Germany and Austria of 2013 are used. Also recall and precision metrics and a resulting F1-score are calculated for various parametrization options. The best results - a F1-score of 45% - are achieved by using geographical, semantic and syntactic similarity values with a higher weighting for the latter.

Further improvements in context of feature and algorithm selection might result in a better F1-score. The project implementation yet outperforms a reference tool called CrisisTracker, what makes it succeed in all predefined goals. Acknowledgements

The practical work of this thesis was developed in collaboration with Gerald Madlsperger [47] in order to build a system which addresses both our theses. Therefore selected chapters, as stated, are identical to the respective ones in both documents.

iii Contents

Kurzfassungi

Abstract ii

Acknowledgements iii

Contents iv

List of Figures vii

1 Introduction1 1.1 Motivation...... 1 1.2 State of Research...... 2 1.2.1 CrisisTracker...... 7 1.3 Aims and Objectives...... 8 1.3.1 Data Analysis and Definition of a Feature Hierarchy...... 8 1.3.2 Cluster Retrieval based on Semantic Enriched and Geolocated Data8

2 Dataset 10 2.1 Data...... 10 2.2 Scenarios...... 11 2.2.1 Scenario: ”Bridge Blockade”...... 12 2.2.2 Scenario: ”Sandbags”...... 14 2.2.3 Scenario: ”Drinking Water Supply”...... 16 2.2.4 Scenario: ”Riverdams”...... 18 2.2.5 Scenario: ”Roadblock”...... 20

3 Concept 23 3.1 Features...... 23 3.1.1 Content Features...... 25 3.1.2 Location Based Features...... 31 3.1.3 User based features...... 32 3.2 Architecture...... 32 3.3 Components...... 36 3.3.1 Controller Component...... 37 3.3.2 PreProcessor...... 38

iv Contents v

3.3.3 Aggregation Component...... 43 3.3.4 Object Extraction Component...... 46 3.3.5 Evolution Analysis Component...... 46 3.3.6 Visualization Component...... 47 3.4 Data Model...... 55 3.4.1 Prerequisites...... 55 3.4.2 CrowdSA Data-layer...... 56

4 Implementation 62 4.1 PreProcessing...... 62 4.1.1 Topic Fencing: Part-of-Speech Tagging...... 62 4.1.2 Geo Fencing...... 63 4.2 Aggregation...... 64 4.2.1 Similarity Calculation...... 64 4.2.2 Clustering...... 66 4.2.3 Clustering tools...... 68 4.3 Visualization...... 73 4.3.1 Home (Historical Data)...... 74 4.3.2 Timeslice Details...... 76 4.3.3 Event Detail...... 78 4.3.4 Timeline Result...... 80

5 Results and Evaluation 81 5.1 Evaluation Criteria...... 81 5.1.1 Cluster Evaluation...... 81 5.1.2 Recall...... 83 5.1.3 Precision...... 84 5.1.4 F1-Score...... 85 5.2 Evaluation Results...... 86 5.2.1 Semantic Level...... 86 5.2.2 Geolocation Level...... 87 5.2.3 Scenario 1: Bridge Blockade...... 88 5.2.4 Scenario 2: Sandbags...... 90 5.2.5 Scenario 3: Drinking Water...... 91 5.2.6 Scenario 4: River Dams...... 92 5.2.7 Scenario 5: Roadblock...... 93 5.2.8 Evaluation Summary...... 95

6 Conclusion 96 6.1 Summary...... 96 6.2 Concluding Statements...... 97

7 Future Work 98 7.1 PreProcessor...... 98 7.1.1 Time Fencing...... 98 7.2 Aggregation...... 99 7.2.1 Similarity Framework...... 99 7.3 Visualization...... 100 Contents vi

7.3.1 Live Data...... 100

Bibliography 103

Declaration of Authorship 112 List of Figures

1.1 Social Media Monitoring Tools Overview and Functionality...... 6

3.1 Similarity Feature Hierarchy...... 24 3.2 General Architecture...... 33 3.3 Overall Pipeline...... 35 3.4 Abstract Pipeline Architecture...... 36 3.5 Class Diagram of Controller Architecture...... 38 3.6 Package diagram of preprocessing component...... 39 3.7 Class diagram of preprocessing component...... 40 3.8 Class diagram of the TweetLocator...... 42 3.9 Package diagram of aggregation component...... 44 3.10 Class diagram of aggregation component...... 45 3.11 Site Map of the Prototype Visualization...... 50 3.12 Home Page of Live Data Usage...... 51 3.13 Home Page of Historical Data Usage...... 52 3.14 Result Page of an Event Clustering Timeline...... 53 3.15 Result Page of an Event Cluster...... 54 3.16 Result Page of an Episode Timeline...... 55 3.17 Generic Source Layer...... 57 3.18 Source Layer from CrisisTracker...... 58 3.19 Generic Aggregated Layer...... 59 3.20 Aggregated Layer from CrisisTracker...... 60 3.21 Integrated Layer...... 61

4.1 Visualization component diagram...... 74 4.2 Home (Historical Data)...... 75 4.3 Timeslice Details...... 77 4.4 Event Details...... 79

7.1 HeidelTime Demo example[31]...... 99 7.2 Homepage of live data usage...... 101

vii Chapter 1

Introduction

In this chapter we will give an overview of the background of this thesis. Therefor we introduce the research topic and give an overview of existing projects and literature. Further we will state the aims and objectives of this work.

1.1 Motivation

In the past decade social media platforms have become a tremendous part in people’s lives. This resulted also in a huge amount of data that is produced and stored in the World Wide Web. With this change in lifestyle, projects appeared which perceived an immense value of this data for various aims. Not only commercial intensions can be re- trieved from social media data, but also information about events and critical situations, which can be in further consequence used to help people are found on those platforms.

The Institute for Application Oriented Knowledge Processing1 and Institute for Co- operative Information Systems2 of the Johannes Kepler University of Linz work on a project in collaboration with the Austrian Red Cross and others, in order to create a Crowd Situation Awareness [62] system. This framework should extract, analyze and process data from social media platforms and present enriched information to partners at e.g. the Austrian Red Cross, so that those can act faster and more efficiently. The architecture of the framework Salfinger et. al [62] introduce six levels of information fusion, which base on Llinas et al. JDL (joint directors of laboratories) Data fusion model [43], and involve the following:

1http://faw.jku.at 2http://cis.jku.at

1 Introduction 2

L0 - Crowd Sensing Observing Social Media and analyzing / preprocessing towards quality assurance and semantic annotations.

L1 - Crowd Perception Detecting event situations based on content similarity and geo-fencing. Further these clustered Social Media Contents will be mapped to a crisis ontology.

L2 - Comprehension Discovering evolving situations and analyzing their relationship in order to better profile a situation to further automatically learn from these events.

L3 - Projection Forecasting impacts of future events compared to related existing events. This also involves informing emergency managers and judging the likeliness of the future event to happen.

L4 - Resource Management Including incoming data from authority sensors, in order to retain a bigger picture of the situation.

L5 - User Refinement Presenting the results of the system to the operating user based on geographical maps. Further adjustments / feedback will be assessed through the visualization component.

This master thesis is located between L0 and L1 of the crowdSA project in order to evaluate possible strategies in the field of semantic enriched message aggregation.

1.2 State of Research

In a previous seminary project[24] we evaluated various so called Social Media Monitor- ing tools, which observe, analyze and process crowd sourced data for different purposes. For this master’s thesis these tools are especially interesting, since they provide a bench- mark of how crowd data can be extracted and processed.

In order to get an overview over the features and conditions of these tools we came up with five main evaluation criteria, which include functionality, employment, access, services and supplemented datasources. Introduction 3

Access The access criteria should describe how data is gathered from the supported sources and the functions which are applied directly to the fetched data. Furthermore the access criteria should inform about the supported social media platforms like Facebook, Twitter, Blogspot, and so forth.

Functionality This section discusses the core of the inspected social media monitoring tools. The criteria tries to give a good overview over the most common and most relevant features in relation to crisis detection. For each feature the evaluation should contain enough information to get an overview over the functional principle, but also to distinguish the method from other methods.

Services In contrast to the Functionality criteria this one does not describe the used functions and algorithms to manipulate and explore the data, but which functionalities and meta-data are provided to the users of this system over APIs, user interfaces or only simple export possibilities. Services are always configurable and controllable by the user of the system.

Supplemented Data sources Supplemented data sources are used to get further data from other source than social media platforms. This data can be used to enrich the social media data or to feed algorithms like filtering functions or spam probability calculations.

Employment Employment describes architectural and usability aspects of the in- spected tools.

With those aspects in mind, it was possible to evaluate the most prominent tools in this researching field, which include the following.

Attensity Attensity, Palo Alto, California, USA http://www.attensity.com/products/technology/integration/

Attentio Attentio SA, London, United Kingdom http://attentio.com/

Brandwatch Brandwatch, Brighton, United Kingdom http://www.brandwatch.com/ Introduction 4

BuzzCapture BuzzCapture b.v., Amsterdam, Netherlands http://www.buzzcapture.com/brand-monitor/

Crimson Hexagon Crimson Hexagon, Boston, Massachusetts, USA http://www.crimsonhexagon.com/forsight-platform/

Crisis Tracker [58] Madeira Interactive Technologies Institute, University of Madeira, Portugal http://ufn.virtues.fi/crisistracker

Cyfe Cyfe Inc., Framingham, Massachusetts, USA http://www.cyfe.com/

Dialogix Dialogix, Southport BC, Queensland, Australia http://www.dialogix.com.au/

eCairn eCairn Inc., Mountain View, California, USA http://www.ecairn.com/

Engagor Engagor BVBA, Gent, Belgium http://engagor.com/

iScienceMaps [56] iScience, Experimental Psychology and Internet Science, Universidad de Deusto, Bilbao, Spain. http://maps.iscience.deusto.es/

Open-Social-Media-Monitoring1 [2] Openstream Internet Solutions, Z¨urich, Switzerland http://openstream.github.io/open-social-media-monitoring/

Repustate Repustate Inc., Toronto, Canada https://www.repustate.com/

Semiocast Semiocast, Paris, France http://developer.semiocast.com/ Introduction 5

SensePlace2 [45] GeoVista, Penn State University, State College, Pennsylvania, USA http://www.geovista.psu.edu/SensePlace2/

SocialMention Jon Cianciullo, Toronto, Canada http://socialmention.com/

SocialSearcher Social Searcher, St. Petersburg, Russia http://www.social-searcher.com/

Spinn3r [26][28] Tailrank Inc, San Francisco, USA http://www.spinn3r.com/

Spiral16 Spiral16, Kansas City, Kansas, USA http://www.spiral16.com/software-services/

SwiftRiver [3] Ushahidi, Inc., Nairobi, Kenya http://www.ushahidi.com/products/swiftriver-platform

Tattler Phase2 Technology, New York, USA http://tattlerapp.com/

Topsy Topsy Labs, San Francisco, USA http://about.topsy.com/

Twazzup Twazzup, San Francisco, USA http://www.twazzup.com/

Twitcident [4] Twitcident, Delft, Netherlands http://twitcident.com/

Twitris [54] Wright State University, Dayton, Ohio, USA http://twitris.knoesis.org/ Introduction 6

Tweettronics Tweettronics, San Francisco, USA http://www.tweettronics.com

TwitterEcho [13] Faculty of Engineering, University of Porto, Portugal http://robinson.fe.up.pt/~projects/twitter_crawler/wiki/TwitterEcho

uberVU uberVU, Cambridge, Massachusetts, USA http://www.ubervu.com/

Viralheat Viralheat, San Mateo, California, USA https://www.viralheat.com/

Whostalkin Joe Hall, Columbia, South Carolina, USA http://www.whostalkin.com/ We finally came up with the following figure , which assesses the tools according to the subcategories of the metric functionality. This category is also crucial now in our current project, as it focuses on aggregation, evolution, information extraction and semantic enrichment, provenance and quality analysis. The darker the color in the table cell, the higher the value of the feature within the tool.

Figure 1.1: Social Media Monitoring Tools Overview and Functionality

Concluding, the assessment resulted in a good evaluation of the tool CrisisTracker[59], due to its functionalities in the field of aggregation and quality analysis. As our system Introduction 7 is partly resting on this tool, we would like to give a summary of its main features in the following section.

1.2.1 CrisisTracker

This section was composed together with Gerald Madlsperger[47], since it affects both of our theses.

CrisisTracker[59] is a Social Media Monitoring tool which is specialized to discover crises and realized by the Madeira Interactive Technologies Institute, University of Madeira, Portugal. It tracks and analyzes Tweets for certain keywords to identify catastrophic events. The basic idea of CrisisTracker is to cluster Tweets with a defined word sim- ilarity. The similarity is based on the bag-of-words principle, which treats each word within a text equally, ignoring their position and word order. The actual metric is then calculated using a cosine similarity combined with a locality-sensitive hashing, which considers the meta-data location of a Tweet. The resulting clusters are called stories and allow the user to understand the density and spread of a certain information. As it is an open source project by the University of Madeira, it has a great potential to extend the system with interesting features.

An important advantage of CrisisTracker is the ability to cluster the collected Tweet information and to group them into stories. Therefor it calculates for each pair of Tweet a similarity value. Further it is possible to define Classes which the system automatically fills with suiting Tweets.

After aggregation, the tool offers to rank the Tweets, by simply evaluating the size of the cluster. Too large or single-item clusters have a bad ranking, since these elements are often considered as spam. Additionally CrisisTracker uses User Feedback for further quality evaluation. CrisisTracker makes use of lexical analysis to avoid spam information. Therefor it checks the amount of popular words, e.g. cool, omg, gosh, amazing, etc., combined with a short- ened URL, within the Tweet and derives a weighting, how likely it is that this Tweet is spam.

The tool needs to be downloaded and run locally. The results are presented in a graph- ical user interface. Introduction 8

1.3 Aims and Objectives

The aims and objectives of this thesis result from the general motivation of the crowdSA project, but also from the given dataset and its scenarios. Since our example data, which concerns the flooding in Austria and Germany in 2013, re- flects Tweets of various situations, e.g. immediate danger, validated information spread- ing, help requests, etc., and backgrounds, e.g. official institutions, emergency service, private affected people, etc., it is necessary to represent these motives in the resulting clusters. Therefore the objectives of the users might vary and therefore the configurabil- ity is an important requirement. As mentioned in the previous section, existing systems do not have the desired functionality concerning aggregating of geographically and se- mantically close Tweets. For that reason we would like to achieve better results, in the context of aggregation, than the CrisisTracker system.

Having these factors in mind, we came up with the following two main goals for this thesis:

1.3.1 Data Analysis and Definition of a Feature Hierarchy

In order to successfully process the data, it is necessary to analyze and understand its content. Therefore it is necessary to extract and illustrate all facets of the dataset. As features we define these characteristics which represent the content, syntactic and author-based information of a Tweet.

These objectives are defined concerning the data analysis:

• Feature Hierarchy should make aggregation configurable

• Features should cover different aspects of the dataset

• A focus is set on semantic and geolocation Features

1.3.2 Cluster Retrieval based on Semantic Enriched and Geolocated Data

After identifying the features of the data, it is required to group similar Tweets. There- fore it is necessary to define a configurable threshold for the identification of which Tweets belong together. Introduction 9

The objectives of this step include the following:

• Clustering operates with similarity matrices (calculated on feature similarities)

• Configurability and weighting of matrices should have an impact on the clustering results Chapter 2

Dataset

In this chapter the data, which is used for our flooding scenario is described. It was composed together with Gerald Madlsperger[47] since it affects both of our theses.

2.1 Data

The data used in our project is real life twitter data, which was collected during the flooding in Austria and Germany in June 2013. The tweets were extracted with the tool CrisisTracker, which was developed at the University of Madeira [59]. For the extraction, the following keywords were used Hochwasser, Flut, Donau, Linz, Passau, Wasserstand, Wasserpegel, Fluss, Uberschwemmung,¨ Uberflutung,¨ Sands¨acke,Unwet- ter, Regen, Starkregen. As the keyword list contains only German terms, we retrieved primarily tweets from Austria and Germany, but also tweets from Czech Republic, the Netherlands, etc. were found. Overall we could collect 46.579 tweets within the time period of 03.06.2013 - 11.06.2013. CrisisTrackers main functionality is to aggregate tweets and form informative stories. The implementation is done in two stages, first a clustering algorithm forms TweetClusters. Over time those clusters could possibly change, but instead of changing the clusters CrisisTracker introduces Stories which are the consolidation of belonging clusters. Those stories are evolving over time. For our dataset we were faced with the following statistical situation.

Tweets: 46.579 TweetClusters: 15.620 Stories: 11.814

At this point it should also be mentioned, that 1.196 of these aggregated Stories consist of only one single tweet. Which means that in actual fact the amount of informative 10 Dataset 11 stories is less then the 11.814. The single tweet stories emerge from the fact that Cri- sisTracker’s clustering approaches start with every single tweet as its own cluster and merges those in following steps.

During the review of the data set, we could identify several topics which consist a lot of distinct messages from different users. This fact labels them as very informative topics. Further those topics have several instantiations at different places and points in time. Some of them were also detected by CrisisTracker and are part of the 11.814 stories. They include the following examples:

• information about blockades and impassable streets (confirmed and unconfirmed information)

• breaking of dams

• problems with drinking water supply

• help needed for building up sandbag barriers

• generally many pictures were posted

Further we recognized, that within the discussion about above topics many public news or broadcasting stations were involved. These often contained the word “Liveticker”.

2.2 Scenarios

For developing and optimizing the result of the practical part of this thesis, we decided to extract five scenarios manually from the whole dataset. Where a scenario corresponds to the instantiation of one of above topics at a certain point of time and space. As men- tioned above, we chose these scenarios based on their amount of distinct information, their significance for the user and their evolutionary structure. As the scenarios where extracted manually those attributes were not evaluated with specific metrics.

For evaluating information extraction algorithms it is customary to separate the dataset into a development set and an evaluation set. This is also called Held-out evaluation and was described by Kanoulas et al. [36]. Therefore we also extracted five similar evaluation scenarios, which we excluded from the development data. As we are dealing with real life data, the equality in structure, size and evolutionary behaviour compared to the development data could not be ensured. For us it was more important to have detailed datasets for the development phase than for the evaluation phase as we were Dataset 12 hoping for better optimized algorithms with this strategy. Therefor we decided to take the more complete scenarios as development set and the others as evaluation set.

In the following subsections we present the regions of interest of the according sce- narios, which consist of distinct tweets with informative content. In fact those scenarios contain duplicates and noisy data like emotional tweets about the situation, which are not useful for us at all. As a manual clustering of the whole dataset is not viable, we extracted the already mentioned regions of interest for the given topics. Further we pro- vide the number of all tweets contained in the same time-span as the regions of interest. This gives an image of the granularity the desired algorithms have to deal with.

2.2.1 Scenario: ”Bridge Blockade”

This data contains Tweets about bridges, which had to be closed due to the risk of flooding. Also the consequences for the traffic situation are included.

2.2.1.1 Development Set

The development set contains very similar data to the development set but takes place in Dessau, Germany and concerns the bridge ”Friedensbr¨ucke”.

The scenario was extracted by a SQL-Query which defined the region of interest. Af- terwards we checked how many other tweets occurred within the same time-span as our region of interest.

SQL-Query: select * from socialmediacontent where text ilike ”%Friedensbr%” order by entrytime; Size of the region of interest: ∼ 21 Number of tweets in time-slice: 25200 Dataset 13

2013-06-03 15:25:52 Katastrophenschutzstab dementiert Sperrung der Friedensbr¨ucke f¨ur heute, 20 Uhr. Ob und wann k¨onnenicht gesagt werden. #Hochwasser 2013-06-03 16:33:09 Verschont vom Hochwasser #FzC Besucht uns die kommenden Tag an der #Friedensbr¨ucke http://t.co/muxkpX6Nop http://t.co/M4l4PL0aBb 2013-06-03 17:21:59 RT @mz dessau: Nun doch: Kleutsch und Sollnitz werden evakuiert. Friedensbr¨ucke ist ab etwa 23 Uhr dicht. #Hochwasser 2013-06-04 03:27:27 RT @MDRaktuell: Kritische #Hochwasser-Lage in #Dessau: Friedensbr¨ucke ¨uber die Mulde gesperrt 2013-06-04 05:57:13 Friedensbr¨ucke in #Dessau jetzt auch f¨urFußg¨angergesperrt. Von Osten geht es nur ¨uber die Autobahnabfahrt S¨udin die Stadt. #Hochwasse 2013-06-04 09:03:45 Im MOment kommen Radfahrer und Fußg¨angerwieder ¨uber die Friedensbr¨ucke in #Dessau. #Hochwasser http://t.co/JT1ndJxjjk

Table 2.1: Samples of the Region of Interest of the Evaluation Set ”Bridge Blockade”

2.2.1.2 Evaluation Set

The corresponding evaluation set is placed in Dresden, Germany where the bridge ”Blaues Wunder” had to be closed because of the flooding. The tweets start with speculations about the bridge to be closed in a few hours. After a few hours the bridge gets closed and a few days later reopened again.

SQL-Query: select * from socialmediacontent where text ilike ”%Blau%Wund%” order by entrytime; Size of the region of interest: ∼ 36 Number of tweets in time-slice: 36018 Dataset 14

2013-06-03 12:38:29 RT @RadioDresden: #Hochwasser #Dresden Blaues Wunder wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt! 2013-06-03 13:12:00 #hochwasser #dresden blaues wunder ist IM MOMENT noch offen. Wird aber sicher nicht mehr lange dauern 2013-06-03 16:00:49 #Hochwasser #Dresden Blaues Wunder wird laut Stadt bis sp¨atestensMorgenfr¨uh gesperrt! Derzeit noch frei. 2013-06-03 19:48:04 #Hochwasser #Dresden: Schließung der Br¨ucke Blaues Wunder f¨urAutoverkehr wird vorbereitet, alle Infos: http://t.co/Vylgb4DdwU (kf) 2013-06-03 21:46:16 RT @MDRINFO: Erste Elbebr¨ucke in Dresden gesperrt: ”Blaues Wunder” nicht mehr f¨urAutos befahrbar. 2013-06-04 07:07:00 Schillerstrasse leer wie nie. Blaues Wunder scheint wirklich gesperrt zu sein #hochwasser http://t.co/m48JrSoYPe 2013-06-09 08:22:21 #Hochwasser #Dresden AKTUELL: Blaues Wunder soll am Abend/in der Nacht bei Pegel unter 7,10m und nach kurzer Kontrolle freigegeben werden!

Table 2.2: Samples of the Region of Interest of the Development Set ”Bridge Block- ade”

2.2.2 Scenario: ”Sandbags”

This scenario contains data about collecting, filling and placing of sandbangs to prevent damage caused by the flooding.

2.2.2.1 Development Set

The development set takes place in Halle next to the river Saale. The first few tweets are about calling for volunteers to fill sandbags. Afterwards those sandbags are delivered to the dam Gimritzer Damm. The last tweets inform about the successful protection.

SQL-Query: select * from socialmediacontent where text ilike ”%Halle%sand%” or text ilike ”%sand%Halle%” or text ilike ”%Halle%s¨acke%”or text ilike ”%s¨acke%Halle%” or Dataset 15 text ilike ”%sand%Gimritzer%” or text ilike ”%Gimritzer%sand%” order by entrytime; Size of the region of interest: ∼ 50 Number of tweets in time-slice: 33325

2013-06-03 14:12:48 #Hochwasser: In #Halle l¨auftdas Bef¨ullenvon Sands¨acken am Waldkater auf Hoch- touren 2013-06-03 15:09:41 Weiss jemand bereits, wo man sich am morgigen Tag in der @stadt halle einfinden kann, um Sands¨acke zu bef¨ulleno.¨a.?#Hochwasser #LSA 2013-06-03 19:20:36 #Hochwasser Stadt #Halle ruft Freiwillige zur Sandabf¨ullstationam Hubertusplatz. Die Sands¨acke werden dringen am Gimritzer Damm gebraucht. 2013-06-04 01:11:09 #¨aSaale-#Hochwasser: #Halle erwartet am Morgen 100.000 Sands¨acke aus Han- nover. Jeder gef¨ullteSandsack geht laut Stadt in den Gimritzer Damm 2013-06-04 03:00:51 100.000 Sands¨acke werden nach Halle gebracht. Die Saale hat 7,50 Meter ¨uberschritten http://t.co/3gJjirUXs7 #Hochwasser 2013-06-04 06:46:12 RT @ZDFmagdeburg: Hochwasser an der Saale in Halle: 100.000 Sands¨acke aus Hannover sch¨utzenden Gimmritzer Damm.

Table 2.3: Samples of the Region of Interest of the Development Set ”Sandbags”

2.2.2.2 Evaluation Set

The evaluation set is again similar to the development set but does not provide the de- tailed evolutionary content. The first days the information about the need of sandbags is spread, but they do not say where the sandbags are used. After 5 days there are some tweets about the successful protection again.

SQL-Query: select * from socialmediacontent where text ilike ”%Dresden%sand%” or text ilike ”%sand%Dresden%” or text ilike ”%Dresden%s¨acke%”or text ilike ”%s¨acke%Dresden%” order by entrytime; Size of the region of interest: ∼ 161 Number of tweets in time-slice: 38663 Dataset 16

2013-06-03 13:22:11 RT @pandanananas: Hi. Irgendwer im Raum #Dresden hier, der noch’n paar Sands¨acke zur Verf¨ugungstellen kann? Hier wird’s knapp. #rt #follow. . . 2013-06-03 13:27:44 Die J¨udische Gemeinde sucht dringend Helfer/innen zum Sandsack f¨ullenund stapeln! #Hochwasser #Dresden 2013-06-03 14:37:07 Das #Stadtteilhaus in der #Neustadt von #Dresden braucht ab 17:30 Uhr Hilfe um Sands¨acke zu stapeln! Kommt zahlreich, DANKE 2013-06-03 15:06:36 #Hochwasser #Dresden Rund 200 Freiwillige Helfer bef¨ullenseit 14 Uhr Sands¨acke an der Straßenmeisteri Hansastraße. 8000 S¨acke bef¨ullt 2013-06-03 16:36:22 Hansastraße hat keine Sands¨acke mehr – Helfer werden erst in ein paar Stunden wieder gebraucht! #Hochwasser #Dresden 2013-06-08 12:55:35 RT @sandsteinpost: Hochwasser in Dresden geht zur¨uck - Deiche und Schutzw¨alle weiter stabil. Zur Seite: http://t.co/3TJlf0O0Nu

Table 2.4: Samples of the Region of Interest of the Evaluation Set ”Sandbags”

2.2.3 Scenario: ”Drinking Water Supply”

This section concerns the impact of the flooding to the drinking water supply. As this is a rather rare scenario in central Europe, we were not able to extract a good evaluation set. This uniqueness makes the scenario special, therefor we decided to use it also without a good evaluation set and hope to gather further information for this project even if we can’t evaluate this special case.

2.2.3.1 Development Set

The development set contains rather unspecific data. The scenario is about Munich, Germany where the drinking water is supplemented with chlorine.

SQL-Query: select * from socialmediacontent where text ilike ”%M¨unch%trinkw%”or text ilike ”%trinkw%M¨unch%”order by entrytime; Size of the region of interest: ∼ 16 Dataset 17

Number of tweets in time-slice: 26215

2013-06-03 13:57:06 #TelMi #telmi Nach dem anhaltenden Starkregen: Vorsorgemaßnahme f¨urdas M¨unchner Trinkwasser: M¨unchen,... http://t.co/fvJdUZt4mr 2013-06-03 15:17:49 Chlor im Tee wegen #Hochwasser: RT @StadtMuenchen Hinweis der Stadtwerke M¨unchen: Leichte Chlorung des Trinkwassers http://t.co/aqcpnityqC 2013-06-03 16:17:32 DTN Germany: Hochwasser im Live-Ticker - Jahrtausend-Flut: Stadtwerke in M¨unchen chloren das Trinkwasser: Evak... http://t.co/zOyuQEmCxC 2013-06-04 07:52:38 Das #Hochwasser hat jetzt auch #M¨unchen erreicht: das #Trinkwasser wird wohl unertr¨aglich - #kalk!!!

Table 2.5: Samples of the Region of Interest of the Evaluation Set ”Drinking Water Supply”

2.2.3.2 Evaluation Set

The evaluation set provides information about problems in the drinking water supply in Passau, Germany. The data starts with speculations about closing the drinking water supply. Afterwards people start to panic buy bottled water. At the end the water supply is actually shut down.

SQL-Query: select * from socialmediacontent where text ilike ”%Passau%trinkw%” or text ilike ”%trinkw%Passau%” order by entrytime; Size of the region of interest: ∼ 235 Number of tweets in time-slice: 31526 Dataset 18

2013-06-03 12:39:11 #Hochwasser: Passauer Stadtwerke stellen #Trinkwasserversorgung ein http://t.co/HIufRF71QG 2013-06-03 12:45:50 Hochwasser in Deutschland: Trinkwasser in Passau soll abgestellt werden.. #tech http://t.co/fngidpWfOi 2013-06-03 12:48:40 Die #Hochwasser-Lage ist vielerorts weiter kritisch: In #Passau wird die Trinkwasserversorgung eingestellt. http://t.co/OFORkVXObG 2013-06-03 13:17:00 RT @msnde: Hamsterk¨aufe in #Passau: Trinkwasser wird abgestellt http://t.co/qZVES3Kw4h #hochwasser #bayern 2013-06-03 13:52:36 Trinkwasser-Versorgung in Passau eingestellt. 2013-06-03 14:29:38 Die Juni-Flut: Kein Trinkwasser mehr in Passau - Hamburger Abendblatt http://t.co/ZivTxT5F9J 2013-06-03 16:33:31 Passau: Kein Strom, kein Trinkwasser, Altstadt bis zum ersten Stock ¨uberflutet.

Table 2.6: Samples of the Region of Interest of the Development Set ”Drinking Water Supply”

2.2.4 Scenario: ”Riverdams”

The scenario river dams is about situations where dams have to open their gates because of too much pressure induced by the masses of water.

2.2.4.1 Development Set

The development set takes place in Th¨uringen,Germany and concerns the Bleiloch- Stausee dam. First there are speculations about opening the dam completely, at the end it was not opened, but a lot of water had to be drained. This had effects on other critical places downstream.

SQL-Query: select * from socialmediacontent where text ilike ”%bleiloch%” order by entrytime; Size of the region of interest: ∼ 11 Dataset 19

Number of tweets in time-slice: 28766

2013-06-03 13:31:20 #de99x #995ap Hochwasser in Ostth¨uringen : +++ Mehr Wasser aus Bleiloch- Stausee +++ MDR http://t.co/ZpvU0y3Pm0 2013-06-03 14:09:46 #Hochwasser #Th¨uringenLandratsamt SOK: Situation an #Bleiloch-Talsperre bleibt aufgrund der Wetterlage & Sperranlagen beherrschbar 2013-06-03 15:05:30 #de99x #994we Hochwasser in Th¨uringen: +++ Bleiloch-Offnung¨ abgesagt +++ Entspannung http://t.co/VncCGpakSd 2013-06-04 07:06:27 RT @mdr th: Betreiber der #Bleilochtalsperre muss mehr Wasser ablassen, weil Stausee zu voll ist- Dadurch Gefahr f¨ur#Ziegenr¨uck. #Hochwasser 2013-06-04 07:31:50 RT @MDRINFO: #Hochwasser im #Saale Orla Kreis:Aus der #TalsperreBleiloch werden große Wassermengen abgelassen.

Table 2.7: Samples of the Region of Interest of the Development Set ”Riverdams”

2.2.4.2 Evaluation Set

The evaluation set is very similar to the development set but takes place at the dam Spremberg.

SQL-Query: select * from socialmediacontent where text ilike ”%spremberg%” or text ilike ”%spree%talsperre%” or text ilike ”%talsperre%spree%” or text ilike ”%spree%schlamm%” or text ilike ”%schlamm%spree%” order by entrytime; Size of the region of interest: ∼ 24 Number of tweets in time-slice: 37277 Dataset 20

2013-06-03 13:15:25 RT @lr online: #Hochwasser: Spree in #Spremberg steigt ¨uber die Ufer: http://t.co/PCBVSaNJ1G 2013-06-03 19:02:48 Hochwasser: Talsperre Spremberg muss ge¨offnetwerden und braune Schlammflut rollt auf Spreewald zu http://t.co/O8WnWIHKzI #Tagebaufolgen 2013-06-04 04:16:03 #Hochwasser-Alarmstufe 4 f¨urSpremberg. Experten planen mehr Wasser abzu- lassen. Die Folge: Brauner Eisenschlamm fließt in den #Spreewald. 2013-06-04 07:50:03 Durch das Hochwasser droht die Talsperre Bautzen derzeit ¨uberzulaufen. Was das f¨urdie Talsperre in Spremberg... http://t.co/x5oW7X4XSF

Table 2.8: Samples of the Region of Interest of the Evaluation Set ”Riverdams”

2.2.5 Scenario: ”Roadblock”

As the name suggests this scenario is about roads which had to be closed because of the water on the roads.

2.2.5.1 Development Set

The development set deals with a blockade of a smaller but important road called Holzhofgasse. After the blockade we found tweets about detours. A few days later the street is reopened.

SQL-Query: select * from socialmediacontent where text ilike ”%Holzhofgasse%” or- der by entrytime; Size of the region of interest: ∼ 25 Number of tweets in time-slice: 33108 Dataset 21

2013-06-03 13:29:23 RT @RadioDresden: #Hochwasser #Dresden PLS RT! Am Nachmittag wird mit Sperrung von Holzhofgasse und Blauem Wunder gerechnet! 2013-06-03 14:44:33 #Hochwasser #Dresden Ab 17 Uhr ist die Holzhofgasse gesperrt! Umleitung der Bautzner Straße via K¨onigsbr¨ucker und Stauffenbergallee 2013-06-03 14:57:39 17 Uhr: Die Holzhofgasse am Diako wird jetzt endg¨ultiggesperrt. Umleitung ¨uber K¨onigsbr¨ucker und Stauffenbergallee. #Hochwasser #Dresden 2013-06-03 16:37:27 Blaues Wunder, Holzhofgasse, Laubegast: Die @DVBAG bereiten sich auf Umleitungen vor. http://t.co/USFtMF0P2f #hochwasser #dresden http://t.co/Q7PUfoNusV 2013-06-11 11:47:27 #Hochwasser #Dresden Aktuelle Info zur Holzhofgasse. Ab Donnerstagfr¨uh(3.30 Uhr) soll die Strecke wieder frei sein!

Table 2.9: Samples of the Region of Interest of the Evaluation Set ”Roadblock”

2.2.5.2 Evaluation Set

The second set is about the highway A9 in Germany. The first part of the information concerns the building of dikes to prevent the water from flooding the highway. At the end the road has to be closed partly, despite reinforcement of the dikes.

SQL-Query: select * from socialmediacontent where text ilike ”%A9%Dessau%” or text ilike ”%Dessau%A9%” or text ilike ”%Dessau%Autobahn%” or text ilike ”%Autobahn%Dessau%” or text ilike ”%Dessau%sperr%” or text ilike ”%sperr%Dessau%” order by entrytime; Size of the region of interest: ∼ 80 Number of tweets in time-slice: 39618 Dataset 22

2013-06-03 13:51:34 #Hochwasser Autobahn: Die A9 bei Dessau bekommt einen Hilfsdeich http://t.co/bQIDVdiOY5 2013-06-03 13:58:24 Ab 18 Uhr gibt es Stau auf der A9 bei #Dessau: Autobahn bekommt einen Hilfs- deich. #Hochwasser http://t.co/Dz2Rkez1kp 2013-06-03 17:28:11 RT @MDR SAN: #Hochwasser: Bei #Dessau wird ein Deich verst¨arkt,um eine Uberflutung¨ der Autobahn 9 zu verhindern. 2013-06-04 04:31:45 #Hochwasser-Schutzmaßnahmen auf der #A9 zwischen #Dessau-S¨udund Dessau- Ost. Autobahn deshalb teilweise gesperrt. http://t.co/Q7PUfoNusV

Table 2.10: Samples of the Region of Interest of the Development Set ”Roadblock” Chapter 3

Concept

In this chapter we describe the conceptual design of the thesis’ practical work. We will introduce the main components and suggest methods for the implementation of a prototype. The chapter was partly composed together with Gerald Madlsperger[47] since it affects both of our theses.

3.1 Features

In order to later on calculate similarity measures between Tweets, we first define fea- tures. Therefore we came up with a feature hierarchy as shown in figure 3.1. The filled boxes in the graph are suggested to be implemented for the prototype. Reasons for the selection of the features will be mentioned in the following sections.

We mainly distinguish between Content Features which concern the text of a Tweet, Location Based Features which provide information about the geographical environment of the Tweet and User Based Features which deal with the author of the message.

23 Concept 24

Figure 3.1: Similarity Feature Hierarchy Concept 25

3.1.1 Content Features

This type of features is defined by the message of a Tweet. The message itself distin- guishes between hashtags, links and the remaining plain text, where each of those have different purpose within a message. Hashtags represent keywords, links point to addi- tional information sources and the plain text shows an opinion or general statement of the author. We regard on the one side hashtags and links, on the other side semantic and syntactic information of the whole text as features, as described in the following sections.

3.1.1.1 Syntactic

Syntactic features characterize the structure of a sentence and can therefore be measured by linguistic analysis metrics. We suggest one of the most relevant methods in the following.

N-grams: N-grams are commonly used when analyzing texts, as it gives a good and simple possibility to compare words and phrases and thus calculate their similarity. N-grams can be formed on basis of characters or words and therefore bring different opportunities. Character n-grams make it possible to find related words ignoring their grammatical position. Word n-grams, however, bring the potential of finding phrases and negations within a text. For the German language it is recommended to use n-grams with a length of 4, since it has proven to gain the best result [34].

Here an example how N-grams are calculated:

Tweet 1: RT RadioDresden: #Hochwasser #Dresden Blaues Wunder wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt!

Tweet 2: #Hochwasser #Dresden Blaues Wunder wird laut Stadt bis sp¨atestens Morgenfr¨uhgesperrt! Derzeit noch frei.

The example of the above mentioned Tweets will result into the following shared character 4-grams: #Hoc, Hoch, wass, sser, Dres, #Dre, sden, Blau, aues, Wund, nder, Stad, tadt, gesp, sper, errt,... Concept 26

Similarity Method: Kondrak [38] suggests to compare all n-grams of two tweets with each other in order to find the longest common subsequence. In his algorithm they recur- sively match all possible n-grams of all words in a text against each other and take the maximum of the sum of subsequent equal n-grams in both strings. A further refinement is suggested in order to receive a higher accuracy similarity of a pair of n-grams where single, but not all, letters match. Here the number of matching letters is set in a ratio to the length n.

In the above given example the longest common subsequence would be the following 11 4-grams: #Hoc, chwa, sser, #Dre, sden, Blau, esWu, nder, wird

This value is then set to a ratio of the total number of 4-grams of the longest of the two Tweets to get a normalized value.

We will make use of n-grams in combination with Part-of-Speech-Tag similarity as men- tioned in the next section. More details on the implementation can be found in chapter 4.

3.1.1.2 Semantic

Named Entities Named Entities are used to analyze the semantics of a message by classifying words to pre-defined categories, such as places, things, persons, etc. Therefore it is also possible to calculate the similarity of two messages, by comparing their Named Entities. Commonly an additional knowledge base is used to calculate the relationship between entities. Concept 27

Tweet 1: RT RadioDresden: #Hochwasser #Dresden BlauesWunder wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt!

Tweet 2: #Hochwasser #Dresden Blaues Wunder wird laut Stadt bis sp¨atestens Morgenfr¨uhgesperrt! Derzeit noch frei.

In this example “Dresden“ would be classified by annotation tools, which will be discussed later on, as the entity “Location“, as well as “Stadt“ in Tweet 2 would be labeled as “Location“. To understand that those words are still semantically not the same, a knowledge base, e.g. DBpedia, can be used. This allows also to find structures in the entity relations. In our example Dresden would be a subcategory of Stadt and therefore results in a moderate similarity number. If the entities are the same or synonyms the similarity value would be 1.0.

Similarity Method: Bontcheva et al.[18] introduced an algorithm, which finds named entities by using the tool ANNIE, followed by matching entities to DBpedia entries. The similarity is then calculated in different (string, structural, contextual) ways.

This feature is not implemented in this work, but a Named Entity component will be described in the thesis of Gerald Madlsperger [47].

Part-of-Speech-Tag Part-of-Speech tags bring additional information which can make the similarity calculation of texts more accurate, then when using syntactic features solely. The Stanford PoS-Tagger [64] classifies each input word to a lexical class, e.g. nouns, verbs, adjectives, etc. Concretely, this means that words that have the same type and further also a close syntactic similarity are more likely to have the same semantic. [19]

Here an example how PoS-Tags are compared: Concept 28

Tweet 1: RT RadioDresden: #Hochwasser #Dresden Blaues Wunder wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt!

Tweet 2: #Hochwasser #Dresden Blaues Wunder wird laut Stadt bis sp¨atestens Morgenfr¨uhgesperrt! Derzeit noch frei.

In this example for each word in Tweet1 the maximal similarity to all words in Tweet2 with the same PoS tag is taken.

w = gesperrt, PoS-tag = main verb main verbs of Tweet2 = {wird, gesperrt}

This means that for word “gesperrt” of Tweet 1, which is of type main verbs, all words with the same type are taken from Tweet 2. Since “gesperrt“ is also in the set of main words of Tweet 2, the maximal similarity for this example is 1.0.

Similarity Method: Xie et al.[68] propose to calculate a maximal similarity only between words which have the same PoS-tag with the following formula w ... a word in Tweet T pos(w) ... returns the PoS-tag of word w

maxSim(w, Ti) = max{sim(w, wi)} where wi ∈ {Ti}, pos(wi) = pos(w) (3.1)

The similarity of the text elements, which belong to the same PoS-tag, will be calculated with the help of n-grams, as explained in section 4.2.1.2 and cosine similarity. Zhang et al. [72] use cosine similarity also for PoS-tag similarity calculation, by forming a vector with the amounts of occurring tags followed by applying the below formula:

T ... the vector of the tag occurrences of Tweet 1 w ... one particular tag occurrence in T

Pm k=1 wk(T )wk(Ti) cos(T,Ti) = (3.2) ||T || · ||Ti||

More details on the implementation can be found in chapter4. Concept 29

3.1.1.3 Hashtag

Hashtags are regarded in an own section, since they do not follow the usual structure of usual text messages. Here only single or artificially combined words are used, which provide a challenge in analyzing the syntax and semantic. As the methodology for hashtags differs slightly from those described in the previous section, we will discuss these in more detail here.

Count The number of hashtag delivers information about the quality of the Tweet. The more hashtags used, the more reflection on the content was required. Since this is not the main focus of this project, this similarity feature will not be implemented.

Syntactic Content It makes sense to use a syntactical comparison of hashtags, since they contain mostly topic words. Thus, it is often not necessary to make us of stemming. Therefore the syntactic similarity can be calculated rather quickly.

Similarity Method: Since hashtags are rather short words or phrases, Cosine Similarity seems to be an appropriate method of comparing their syntactical content. This can be implemented similar to Zangerle et al. [71] by calculating e.g. Levenshtein distance, which counts the necessary changes of letters in order to have hashtag 1 equal to hashtag 2.

Yet hashtags have a large semantic importance, and therefore it is more useful to use the below mentioned semantic similarity approach. Nevertheless a combination of semantic- syntactic features would deliver the optimum in similarity analysis, as suggested by Bansal et al. [10], where they first segment the syntactic content of a hashtag and fur- ther on apply semantic analysis. Yet this would exceed the scope of this thesis, that is why we will focus on the implementation of semantic hashtag analysis only .

Semantic Content Hashtags deliver short and concise information about the topic of a Tweet. Yet similar Tweets do not necessarily have the same hashtag and therefore it is required to make use of a semantic hashtag analysis.

Similarity Method: Moreno introduces in his PhD thesis [65] a framework, which uses the WordNet1 ontology in order to calculate a semantic similarity. An ontology-based

1http://wordnet.princeton.edu/ Concept 30 semantic relatedness measure, such as the Wu & Palmer measure [67], can be used to get an actual similarity value.

Tweet 1: RT RadioDresden: #Hochwasser #Dresden Blaues Wunder wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt!

Tweet 2: #Uberflutung¨ #Deutschland #Landunter auch in #Dresden!

In this example two hashtags are used which are synonyms to each other. #Uberflutung¨ and #Hochwasser are recognized in the German version of WordNet, GermaNet2, as synonyms and therefore have a similarity of 1.0.

More details on the implementation can be found in chapter4.

3.1.1.4 Linking

This feature is divided into URLs which point to external websites and URLs which link to a media file. We differentiate between those, because the linking target, either a text or media content, offers different kind of information and opportunities. Multimedia data often contains a visualization of a current situation, text content rather provides reports or literature information.

External URL These URLs, in comparison to internal URLs, which in the context with Twitter represent Retweets, point to content on other websites. To consider these external URLs as a feature is rather useful, since Tweets containing the same URL are likely to have a relating content. A problem facing URLs in Twitter is, that an URL shortener is used and therefore the comparison of the URL text is difficult. Yang et al. [69] introduced an approach, where this drawback is overcome by using the Longurl API3. This tool not only expands the URL, but also provides information such as the website title, meta description and content type.

Similarity Method: N-grams or cosine similarity can be applied on the meta description and the extended URL. As this feature does not provide any required additional information gain for our project, it will not be implemented.

3http://longurl.org/ Concept 31

Multimedia URL Since we only have the URL available for comparing images and videos, most of the similarity calculation methods mentioned at the External URL sec- tion, apply also here. Yet it is possible to identify if the URL includes multimedia data or refers to a webpage, such as the often used multimedia uploader online tools, e.g. TwitPic4. Since the Tweets in our dataset contain rather few multimedia elements, this feature does not provide any necessary additional information gain for our project, it will not be implemented.

3.1.2 Location Based Features

Generally there are two ways of calculating a location based similarity metric. The first method uses content similarity approaches, where the location name is used. The second method uses the GPS-coordinates which are stored in the meta-data of a Tweet. Unfortunately, according to Rogstadius et al. [59] only 1% of the Tweets have location information in their meta-data, therefore we need to enrich the Tweets with location data retrieved from the content.

3.1.2.1 Content Location

Location information can also be extracted from the content of a tweet, therefore it is necessary to identify location entities in a pre processing step. Then the location names can be compared with above mentioned content similarity approaches. A special case for this category is the location mentioned in hashtags. Since it is common to hashtag a location belonging to the content of the tweet, it might be useful to give more importance to those. More details on the implementation can be found in chapter 4.2.1.3.

3.1.2.2 Meta Data Location

The simplest way of extracting the location information of a Tweet, is found in the meta data. The exact GPS coordinates of the place where the tweet was posted are stored and can be used for comparison.

Similarity Method: With the Haversine distance [66] it is possible to calculated the closest distance between two points on an earth shaped globe, where each point consists of a latitude and longitude value. More details on the implementation can be found in chapter 4.2.1.1. 4http://www.twitpic.com/ Concept 32

3.1.3 User based features

These features base on the assumption, that users that have similar characteristics are likely to talk about similar things on Twitter.[15]

3.1.3.1 User Location

This takes the location mentioned in a user profile into account and implies, that users, which are living in a similar area have a common interest in disasters happening in their area. Therefore the tweets of those users might be clustered together during emergency situations. The reason for splitting the user and the Tweet location results mainly from the reflec- tion, that the one in the content varies frequently, whereas the home location of the user in his/her profile is rather a longtime setting. Since this feature is not in the focus of this thesis’ topic, it will not be implemented.

3.1.3.2 Profile Meta Data

Similar users and their interests might be detected by comparing meta data, such as the description in a user profile. For the description text the similarity can be calculated as the normal content based features. Since this feature is not in the focus of this thesis’ topic, it will not be implemented

3.1.3.3 Number of Tweets, Retweets and Followers

The number of tweets, Retweets and followers mostly give information on the quality and trustworthiness of the user, rather then the similarity of the information found in the Tweets. Therefore the features that are not looked into in more detail will not be implemented.

3.2 Architecture

In this section we present an overview of the overall pipeline architecture, which was composed together with Gerald Madlsperger [47]. Therefor package diagrams for the whole pipeline and also a class diagram for the abstract execution pipeline are provided. As all components, described later in this chapter, are based on this architecture. It is recognizable that they have a similar style. Concept 33

In order to get an overview of the whole architecture, we provide together with Ger- ald Madlsperger [47] figure 3.2, that also illustrates which components are discussed in which thesis.

Figure 3.2: General Architecture

First of all we need a Database, which contains the dataset described within the previous chapter. The first component of the pipeline is called PreProcessor and is responsible for annotating the data with additional information. This additional information is used by the following components to compute the desired results. After the PreProcessor the component called Aggregation is executed. This component combines similar Tweets to bigger clusters with the help of configurable similarity features. Within the next step we want to identify real world objects which are discussed within the tweets. Therefor the Object Extraction component is introduced, which identifies objects with the help of natural language processing (NLP) tools. The last part of the pipeline is the Evolution Analysis component. It calculates relations of aggregates and objects over certain peri- ods of time. Every component stores its results within the database. After a successful Concept 34 execution of the pipeline the data is ready to be read by the Visualization component and to be presented to the end user.

Figure 3.2 already depicts that we are dealing with separated components, executed one by one. Further we identified modularity respectively interchangeability but also performance aspects as the main architectural requirements. Therefor we decided to use a pipeline architecture which was already used for the partner project CSI by the CIS Institute [12]. The pipeline architecture ensures the modularity and with the help of software design patterns like the strategy pattern the interchangeability. Also the performance issues are supported by the modularity as it supports the parallelization of the execution. The communication between the blocks within one pipeline is done by data access ob- jects, which are simple java-beans containing the parameters needed for the execution of the follow-up pipeline block.

Part of this architecture is also the high degree of freedom in combining and nesting the components. Because of the generic interfaces only the type of the data access ob- jects have to be known and passed from one to the other component to combine them. Further pipelines can be nested, for example the later introduced controller pipeline combines several execution pipelines which are doing the actual job. The work of the controller is to configure the execution pipelines according to the configuration files but also to parallelize them. Concept 35

Figure 3.3: Overall Pipeline

Figure 3.3 shows the structure of the whole pipeline and the interaction of the single components. The controller triggers the components in the correct order, as some of them may rely on data of other ones. The details about every component and their inputs and outputs are discussed in the next section. Concept 36

Figure 3.4: Abstract Pipeline Architecture

Figure 3.4 on the other hand shows the internal structure and generic architecture of every component. Each should provide one class which implements the Pipeline-interface and starts the internal processes of the component. Therefor it takes configuration input in form of DataAccessObjects. In Figure 3.4 those DataAccessObjects are encoded as generic Objects. As recently as the components become concrete they will specify new DataAccessObjects classes which are java beans containing the necessary information for execution.

3.3 Components

In this section every component is shortly described. Furthermore the input, output and possible frameworks or algorithms for implementation are listed. Of course the implementation should follow the previously shown architecture. This is shown within package diagrams and also a class diagrams for every single module. Concept 37

3.3.1 Controller Component

The controller should be a mediator between the user interface and the other compo- nents described within this chapter. It is initialized by the user interface and executes the different processing pipelines according to the configuration. Figure 3.5 shows the execution order of the components we used for our implementation from the left to the right. The order is defined by the output the components are creating. For example the output of the preprocessor is needed for all following components and the output of the message aggregation component is needed for the object extraction and so forth.

Furthermore the controller component should be able to initialize the data sources which are used for the execution. For this work we will only use the static data, which was gathered during the flooding in 2013 in Austria. However, it is flexible enough to trigger a twitter-api wrapper which fetches live data according to given parameters. Therefor the architecture of the controller is open for extensions in this direction, which is given by the pipeline architecture used for every single component in this project.

Input The controller needs configuration parameters in form of a configuration file for execution. This configuration file contains the desired manifestations of all config- urable parameters like clustering method or clustering features, length of time-slices, algorithms used for evolution and object extraction and so forth. The actual content of this configuration file is not part of the concept chapter, but of the implementation chapter.

Output The controller doesn’t provide an output, it just executes the other compo- nents and handles possible errors by itself or forwards them to the user interface if user interaction is necessary.

Diagrams Class Diagram of Controller Architecture. Concept 38

Figure 3.5: Class Diagram of Controller Architecture

3.3.2 PreProcessor

The pre-processing component is the first step in the architecture. Here the Tweets are taken from the database and processed towards the needs of all other components. There are three main sub-components, which include geo-, time- and topic-fencing, that edit the data to their own requirements. This component uses external tools which help to gain additional information to the stored data, e.g. locations are mapped and corresponding entities are found.

Input As an input the preprocessor takes the unprocessed Tweets fetched from Twit- ter and stored in the database. Concept 39

Output This component gives four different types of outputs, all of those represent annotations for tweets. Those are POS-, Temporal-, Spatial- and NamedEntity- tags.

Diagrams Class- and package diagrams of the preprocessor component architecture.

Figure 3.6: Package diagram of preprocessing component Concept 40

Figure 3.7: Class diagram of preprocessing component

3.3.2.1 Geo Fencing

The reason for using a geo fencing component (Geolocator) comes up, as according to Rogstadius et al. [59] only 1% of the tweets contain geographical coordinates in their meta data. For our system it is crucial to have this information and therefore this component is responsible for enriching each Tweet with location information retrieved from the content. Concept 41

Algorithms We will be using the geo fencing component which we implemented in our practical work [23]. Michael Jahn [33] implemented a tool, which retrieves geo- information from the Tweet content on a basic level. We used this program and adapted it to our needs.

In the solution of Michael Jahn, a so called ’oe citylist’ list, which includes only ma- jor Austrian cities, was stored in a DB. Against these cities, the Tweet-keywords are matched. This idea could be extended by storing all found locations with corresponding longitude and latitude and therefore implement some kind of caching-strategy. This would decrease the number of API request and which has a positive impact on the per- formance.

Another desired functionality of the geo locator is the possibility to store multiple lo- cations per Tweet in our database. As there are many Tweets referring to multiple location names in their tweets e.g.

Tweet 1: The water in #Passau is rising, but it’s not as bad as in #Sch¨ardingyet!

For this case in the current solution of Jahn, only the first appearing location is detected and stored. In the given example: Passau For each Tweet the text is analyzed, all nouns and hashtags are matched with Geonames. The first occurring location is taken and its coordinates are stored in the database. We want to improve this behavior by analyzing all mentioned locations within a Tweet and identifying the most defining one.

Algorithms There exists various literature on ambiguous geographical annotations. Peng et. al. [53] suggest a model based machine learning approach, which requires a rather large training dataset. The results are then mapped to a location ontology. For our project, a machine learning approach would exceed the scope and therefore we decided to focus on a simplified version. For each Tweet multiple locations will be rec- ognized and matched to the Geonames ontology [22]. The decision over which of the location’s coordinates will be eventually stored, will be made based on the level of the concept hierarchy in the ontology. Further details will be explained in the implementa- tion chapter 4.1.2. Concept 42

Tweet 1: RT RadioDresden: #Hochwasser #Dresden Elberadweg wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt!

In this example Dresden and Elberadweg are considered as locations. Since El- beradweg is a more precise location and a child element of the city Dresden, the coordinates of Elberadweg would be chosen for the location annotation.

As shown in figure 3.8 there is the possibility to locate single Tweets but also a whole cluster. For the whole Event Cluster a convex hull for all its Tweets and a center point is calculated.

Figure 3.8: Class diagram of the TweetLocator

3.3.2.2 Time Fencing

This component is interesting, since the meta-data timestamp does not always corre- spond to the actually time when the event was happening. Concept 43

Since this component is not essential for our prototype and the implementation of exist- ing tools and algorithms would exceed the scope of this work, Time Fencing is declared as Future Work. Further information can be found in chapter7.

3.3.2.3 Topic Fencing: Part-of-Speech Tagging

A part-of-speech (PoS) tagger is used in order to analyze the structure of a text. Not only the grammatical tags are found, but it is also possible to understand the relationship between words and phrases.

Algorithms There exist well established PoS-Taggers for multiple languages, yet it is a rather new challenge to deal with microblogs such as Twitter. Still there exist some research and development projects for Twitter PoS-Taggers, such as the GATE Plugin TwitIE[14]. This projects contains an extended Stanford-tagger [64], which is now able to recognize Twitter specific tags such as ’Retweet’. Also abbreviations and shortened writings, e.g. “2moro”, “lol”, “luv”, etc., are common in Tweets and have been added to the vocabulary of the TwitIE tagger.

Basically TwitIE is working with English texts only, yet they offer a possibility to adapt their project to German or Spanish, since those languages are supported in GATE itself. It is necessary though, to train the tagger on German Twitter datasets.

3.3.3 Aggregation Component

In the aggregation component the goal is to cluster similar Tweets to so called Events. The list of Tweets is analyzed for several given similarity metrics, upon which then the aggregation algorithms are applied. To support the further processing in the Object Extraction and Evolution Analysis components, also keywords for the found clusters are identified and stored.

Input As an input the aggregation component takes apart from the original collection of Tweets also temporal- and geolocation annotations.

Output The aggregation component delivers Events as an output. An event is a cluster of Tweets, that follow a defined similarity. Further the event provides also a list of keywords, that are descriptive for this cluster. Concept 44

Diagrams Class- and package diagrams of the aggregation component architecture.

Figure 3.9: Package diagram of aggregation component Concept 45

Figure 3.10: Class diagram of aggregation component

3.3.3.1 Similarity Framework

The similarity framework takes the set of Tweets and their time- and geolocation an- notation as an input and calculates the similarity metrics. The features which should be used for the similarity calculation, are selectable through a configuration file via the visualization component.

In our system, we decided to use affinity matrices for taking all similarity features into account. In the end, one value should show the similarity of two Tweets.

Algorithms The details upon the algorithms used for calculating the similarity met- rics can be found in chapter 3.1. Concept 46

3.3.3.2 Cluster Framework

The clustering framework takes the Tweets for one timeslice and their similarity values of each pair in the form of a matrix, as an input. On this data it is possible to apply different clustering algorithms. As an output this component returns Event Clusters.

Algorithms There exist various clustering libraries, with which the desired and also best fitting algorithms can be executed. We will compare and decide upon one framework in chapter 4.2.3. Concerning specific clustering methods Affinity Propagation [40] is suggested, as explained in chapter 4.2.2. Further Hierarchical Clustering [30], will be used for experimenting.

3.3.3.3 Keyword Extraction

The component Keyword Extraction is useful for the final visualization of the results, as it gives a brief overview of the topic within a cluster. Further it is interesting to compare the vocabulary, which is used in different Tweets of one cluster.

Algorithms Keyword Extraction in general, is often done by calculating a term fre- quency, such as tf*idf [9]. The weighting and therefore the size of one term in a Tag Cloud results from its frequency within the input text. As there exist various java li- braries for extracting keywords and creating a Tag Cloud, e.g. OpenCloud5, one of them might be useful for application.

3.3.4 Object Extraction Component

This part is needed for the overall system, but will not be described in this work. It involves the step after the aggregation component, where objects are identified for each Tweet cluster. See in the Master thesis of Gerald Madlsperger [47] for more details.

3.3.5 Evolution Analysis Component

This part is needed for the overall system, but will not be described in this work. The component finds temporal and evolutionary connections between aggregates and objects, which are extracted in the previous components. See in the Master thesis of Gerald Madlsperger [47] for more details.

5http://opencloud.mcavallo.org/ Concept 47

3.3.6 Visualization Component

This section describes the visualization part of this project, which was developed together with Gerald Madlsperger[47]. It is used to visualize the results of the other components on the one hand but also to find useful visualization techniques for the whole project. Therefor we will take a look on some other work in this field before we start to introduce our own visualization concept.

3.3.6.1 Theory

A lot of work was already done in the field of visualization for crisis, news and also twitter data most influencing for this theses was the work of MacEachren et al. [44], Rogstadius et al. [59], Inoue et al. [32] and Ye et al. [70]. Nevertheless not all of those concepts are useful for reaching our goals. Those are the compact visualization of aggregated twitter information, which we called episodes, in the context of space and time. The user of the interface should be able to find important respectively urgent situations and track down their evolution or development over time. A lot of the existing tools, which we mentioned within the introduction chapter1, concentrate on the temporal behavior only like Topsy [41], which shows 2 dimensional graphs representing the amount of tweets over time. Others are focused on the spatial context like CrisisTracker [59], where all data or the data for a certain timespan is visualized within one geo-map. But to show the results of all our components we have to combine both spaces. In the following we will present some of the most important concepts, their advantages and disadvantages.

Trajectory-Oriented view The simplest way to visualize spatio-temporal data for the user are static geomaps with trajectories as marks, where the map represents the spatial space and the trajectories the temporal one like presented in the work of An- drienko [7]. However, as stated by Andrienko this kind of trajectory-oriented view is not applicable for huge amounts of data without previous aggregation.

This view does not need user interaction to provide all the information and is there- for easy understandable, of course user interactions can be integrated to get detailed information. For our application this kind of visualization is not perfect as the data is huge in it’s size on small geographical areas even after aggregation. The cause of this is hidden in the nature of our data, it is user generated and isn’t bound to rules except its length. Therefor the variety and size of unrelated data points is not predictable. Concept 48

Of course for future work the data can be filtered with the help of a quality detection component, but this is not part of this work.

Small Multiples: Another way to visualize both the spatial but also temporal as- pect is to show static maps next to each other where every map corresponds to one point in time or to one time-slice in our case. In the work of Ye et al. [70] this kind of visualization was reduced to two parallel maps only and is called bi-temporal view. In the field of information visualization this it is the possibility to explore two situations at different times simultaneously and compare them to each other. Therefor the opera- tor of the visualization is able to detect changes and anomalies fast. The downside is, that information, which should be visualized, is detailed and as soon as the amount of parallel maps rises the amount of information will do so too and therefor the possibility to detect anomalies in it will decrease [8].

Nevertheless we decided to use this kind of visualization as the basic view for exploring the processed data. In our case we decided for 3 synchronized maps, where the center one corresponds to the current time-slice. The others correspond to the time-slice be- fore and after. We added some user interaction so that the operator is able to travel through time by clicking the previous or following map and make that one to the current time-slice.

Animation: A very famous way of presenting data over time are animations. They are very natural for the operator as the time axis is also presented in form of time. According to Archambault et al. [8] animations are very good for clustering similar moving objects independent of the distance of those. But they are lacking in detection of changes over time, because a direct comparison like small multiples are providing is not possible.

For later implementations of our prototype, animations could be used for showing the movements over the whole data set in time lapse. For showing the results of our com- ponents this kind of visualization is not useful as the comparison possibilities are not supported.

Timemap: Another useful view of spatio-temporal data is called timemap which combines and synchronizes the concepts of geo-maps, known from google-maps [25] for example, and SIMILE time-lines [50]. Hsu et al. [29] and Inoue et al. [32] used those visualizations in their systems to provide an overview to relations over historical data. Concept 49

The advantages of this kind of view is the combination of an overview but also the possibility to dig into details. Therefor it can be used to compare objects between cer- tain time-slices but also to detect similarities in movements of different objects over longer time periods. The disadvantage is that the view gets overloaded very fast when big amounts of data have to be handled.

Therefor we used this kind of view to follow the movements of a certain object over time and not to compare the movement of several objects.

3.3.6.2 Prototype of the User Interface

Site Map The visualization component consists of four main pages. The home screen, which will be described in one of the following paragraphs, consists of two page varia- tions. Initially the Historical Data Home screen is shown, but it should be possible to switch to the Live Data Home screen in future implementations. The first screen is used to present recorded data, which is already analyzed and stored in the database. The second one enables the user to start the whole processing pipeline using live data. The processing of live data is not part of this work, because of that only the Historical Data Home screen is implemented.

Nevertheless in both cases a button links to a Result page called Timeslice Details. This screen presents the results of the aggregation component, which we already de- scribed as events. If this page is accessed from the Live Data Home screen a pop-up will appear showing the streaming status. After the streaming and calculation for the first time slices is finished the Result page is shown. The events visualized within this first result screen can be clicked and redirects to a detailed result page for the selected event. This page is called Event Details and shows the outcome of the object extraction com- ponent known as episodes. The basic structure of this page is similar to the Timeslice Details screen and is further described in one of the following paragraphs. To enable the user exploring the whole evolution of an episode found in this view, links to another screen called Episode Timeline are provided for every episode. Concept 50

Figure 3.11: Site Map of the Prototype Visualization

Home - Live Data (Figure 7.2) The home screen offers 2 possibilities to deal with the data. If the option “Live Data” is chosen, by a toggle button changing between live and historical data, the Tweets are streamed from Twitter starting at the current point in time. Therefore it is not possible to fetch any past data. You can follow the screen description with Figure 7.2 for better understanding.

For the fetching of the data it is possible to set a time slice size. As soon as a time Concept 51 slice has ended, it is visualized in a Result Page. Further it is possible to select, whether only German tweets or also all other languages should be shown in the Timeslice Details. Generally for the Episode Extraction only German tweets are taken into consideration.

Filter options are available to narrow the result set. On the one side it is possible to enter keywords which should be used by the Twitter Streamer. Further a geographi- cal polygon should be taken into account in the preprocessing steps of the result data. This means that only Tweets which are located within this area are considered for the Event Clustering and Episode Extraction.

Figure 3.12: Home Page of Live Data Usage

Home - Historical Data (Figure 3.13) If historical data is chosen, then it is possible to select an already existing data set. In this implementation of the project we will focus on this option for analyzing the data. Generally this means, that the Event Clustering and Episode Extraction was already done for all time frames in the data set. Concept 52

Figure 3.13: Home Page of Historical Data Usage

Timeslice Detail (Figure 3.14) On this page the result of the Event Clustering and Event Evolution is displayed. It includes a list of all events and their key facts shown in Figure 3.14 at the bottom. Further the time slices will be visualized with Small Multiples either within a CoverFlow or three of them next to each other, where it is possible to pass backwards or forwards. The concept of Small Multiples was already explained within section 3.3.6.1. A CoverFlow is an interactive visualization of pictures and enables the user to visually navigate through them. Instead of the pictures we could show the geo-maps. This enables the user to experience the outcome of the evolution component. The visualization of events is done by showing the centroids of each event on a map per time slice as so called markers [25]. The event markers, representing the centroids, are clickable and forward to a detail page of the selected event. Concept 53

Figure 3.14: Result Page of an Event Clustering Timeline

Event Detail (Figure 3.15) On this page the details for an event are shown includ- ing its episodes which are the outcome of the object extraction component.

On top you can see a map visualization containing a polygon, illustrating the surround- ing of an event and star-objects, which display all episodes that were found within the event. As you may have noticed there are again maps for preceding and succeeding time slices displayed in Figure 3.15 in form of a Coverflow. These are only shown if similar clusters and episodes on succeeding or preceding time slices were found. The star-objects will be color-encoded so that the user can track which episode evolved in which way.

Underneath the maps, detailed statistics for an event such as size, duration, area size and a list of all tweets are visualized. Further a tag cloud with most common keywords is shown. Concept 54

At the bottom a list of episodes is provided and contains information about the content in form of the extracted event phrases and quantity of tweets. As soon as the user selects one of the episodes the tweets table under the episodes table is populated with all tweets contained by the selected episode.

Figure 3.15: Result Page of an Event Cluster Concept 55

Episode Timeline (Figure 3.15) Within the Event Detail screen it is possible to get further detail to the evolution of a specific episode. Each episode will provide a link to the Episode Timeline screen. There the evolution of this episode will be displayed with the help of a timemap, which was already described in section 3.3.6.1. Instead of a mockup we will present a real screen shot for this component, as this kind of visualization could not be characterized fully with the concepts of the mockup tool.

Figure 3.16: Result Page of an Episode Timeline

3.4 Data Model

This section deals with the adaption of the database, which was created for our system together with Gerald Madlsperger[47]. We will also explain the changes and design decisions which were made during the work. The data-layer is based on the CrisisTracker implementation and was only extended to meet our requirements.

3.4.1 Prerequisites

The project CSI relies on a data-layer modeled with the tool Visual Paradigm, which is able to generate the whole hibernate layer [11], which is the interface between the java based code and the database. We decided to follow the same approach for this work. As we already stated in the architecture chapter the system was designed for flexibility. Therefore we decided to split our data model in a CrisisTracker layer and a CrowdSA Concept 56 layer, where the first layer is only a copy the original data schema of CrisisTracker, but modeled with Visual Paradigm for generating the hibernate mappings. The CrowdSA layer is also based on the CrisisTracker schema but provides the extensions we need for the improved functionality of our system. The new schema is able to store the annota- tions created by the preprocessor, further the episodes had to be modeled and also the evolutionary connections between episodes but also aggregates respectively events are part of the new schema.

For copying the data from the old schema to the CrowdSA schema we came up with a data-adapter, which fetches the data from the already filled database, containing the data described within chapter2 and writes it into the CrowdSA database.

3.4.2 CrowdSA Data-layer

As mentioned the same modeling software as for the CSI project was used. The cause for this is a planned future combination of the crowdSA components and the CSI com- ponents as described within the introduction1. To be conform with the CSI data-layer we split the CrisisTracker schema in a layered one with three layers. The first layer is called Source Layer, the second one Aggregated Layer and the third Integrated Layer.

Source Layer It should be possible to extend the system for other sources than twit- ter, for this purpose we introduced a generic model for social media content. The generic model is shown in Figure 3.17. As you can see it provides a class called SocialMedi- aContent which will later be inherited by the class called Tweet within the Twitter specialized Source Layer. The annotation class represents the results of the preprocess- ing component. Further a location can be stored for every SocialMediaContent instance. Concept 57

Figure 3.17: Generic Source Layer

Figure 3.18 shows the Source Layer for the twitter data. It was copied from the CrisisTracker and extended for the generalization to the SocialMediaContent class from Figure 3.17. Within those tables the dataset migrated from the CrisisTracker database is stored. Concept 58

Figure 3.18: Source Layer from CrisisTracker

Aggregated Layer The Aggregated Layer is used for storing the results of the aggre- gation component, described in section 3.3.3.

As for the Source Layer we introduced a generic model to be able to reuse the aggre- gation implementation for other sources. Like the SocialMediaContent class the generic SocialMediaAggregate can have a location, further it contains several instances of So- cialMediaContent classes. Concept 59

Figure 3.19: Generic Aggregated Layer

We also copied the CrisisTracker model for this part. However, it was not used in a productive way but only for evaluation of the aggregation component. All the results of our aggregation component get along with the Generic Aggregated Layer. Concept 60

Figure 3.20: Aggregated Layer from CrisisTracker

Integrated Layer The Integrated Layer is the place where the results of the object extraction component are stored. It was implemented from scratch within the thesis of Gerald Maldsperger [47], as the CrisisTracker had no functionality for extracting objects. In Figure 3.21 we can see that a CrowdObject can have one of two types. The first type called GemetType is a leftover from the preceding practical work for this thesis. Instead of a complex object extraction component we mapped the tweets against the Gemet Thesaurus [20] and stored the results as objects. We didn’t want to loose this functionality and instead of deleting this part in the schema, we extended it for the EpisodeType which corresponds to the results of the object extraction component. Concept 61

Figure 3.21: Integrated Layer Chapter 4

Implementation

This chapter discusses the concrete algorithms that were used for the implementation of the components PreProcessor, Aggregation and Visualization, covered in this project. It was partly composed together with Gerald Madlsperger[47] since it affects both of our theses.

4.1 PreProcessing

Two sub-components of the preprocessor are covered within this work, the others are implemented and discussed in the Master’s thesis of Gerald Madlsperger [47]. These include the Part-of-Speech tagger and the Geolocator. Both take the original Tweets as an input and store the processed annotations in the operating database.

4.1.1 Topic Fencing: Part-of-Speech Tagging

The PoS-Tagger is implemented with the GATE ANNIE [17] Pipeline, where the tool annotates the input Tweets according to their Part-of-Speech values. Each Tweet refers to a list of PoS-annotations in the database. These annotations are stored as key-value pairs, where the key consists of indices which describe the part of the Tweet which be- longs to one PoS-Tag. The value element refers to the PoS-type of the word.

The following example will explain the format of the PoS annotations:

62 Implementation 63

Tweet 1: RT RadioDresden: #Hochwasser #Dresden BlauesWunder wird nach Angaben der Stadt in den n¨achsten zwei Stunden gesperrt!

In this example Stadt is stored as the following key-value pair: a(Stadt) = {(73,78), NN} This means, that in Tweet 1 the characters at position 73 to 78 were identified as a noun (NN).

In further consequence the similarity of a pair of Tweets can be calculated as described in section 4.2.1.2

4.1.2 Geo Fencing

As mentioned in chapter 3.3.2.1, our Geo Fencing component bases on previous projects. Yet it was necessary to adapt this tool to handle common situations like identifying multiple locations. Generally there are four levels of geolocalisation that can be selected through the configuration file, which is selectable through the visualization component.

• Meta data - only uses the location information available from the Tweet meta data

• Hashtags - extracts all hashtags within a Tweet and maps it to Geonames1 loca- tions

• Named Entity - extracts all Named Entities of a Tweet and maps it to Geonames

• Nouns - extracts all Nouns of a Tweet and maps it to Geonames

Based on these configuration settings the outcome may result in a list of multiple lo- cations, whereas only one will be stored in the database. In order to identify the most accurate position, it is necessary to analyze the hierarchy of the locations. Therefore Geonames [22] is requested through their webservice for the complete location hierarchy of the respective toponym. A HierarchyManager deals with those hierarchies and also retrieves the most precise location.

Hierarchy Manager Geonames knows 9 different location categories, so called Fea- ture Classes. These classes cover toponyms such as countries, rivers, roads, parks, etc. and identify all of them with feature codes. In order to find the most accurate location, we split the hierarchy elements into their categories and identify the most frequently mentioned location in each category. We give priority to the category Road and always

1http://www.geonames.org/ Implementation 64 return a toponym of this type is found. For all other categories the most frequent el- ement is chosen as the most accurate location. In case the occurrences of a toponyms are equal, the first appearance is taken.

For the following example, the whole Tweet is iterated and each word is sent to Geonames. In case a location is returned, the whole location hierarchy is stored in the location hierarchy list of the Tweet.

Tweet 1: Prießnitzstraße ist noch trocken. Ecke Nordstraße ca. 30 cm Luft und Bischofsweg ca 80cm Luft bis zur Straße. #Hochwasser #Dresden #Neustadt

For the above given example the following resulting location would be retrieved in their respective configuration settings:

Geo Level Location Meta Data - Hashtag Dresden Named Entity Prießnitzstraße Noun Prießnitzstraße

4.2 Aggregation

4.2.1 Similarity Calculation

The first subcomponent of the message aggregation is formed by the Similarity calcula- tion. Since we have a bigger set of Tweets in which the similarity of each possible pair has to be calculated, we decided to make use of an affinity matrix [48]. The similarity of two Tweets is calculated based on three features, which result in three similarity val- ues, namely Geolocation-, Part-of-Speech- and Semantic Similarity. Upon those values a weighted sum is calculated and stored in the respective cell of the affinity matrix. The weighting for each similarity setting is defined in the configuration file and sums up to 1.0. Implementation 65

4.2.1.1 Geolocation Similarity

The similarity of two locations is calculated with a 2D Euclidean distance measure. This method is embedded in the Vivid Solutions JTS Topology Suite2 which we used for dealing with geometries and topologies. The distance method of this suite takes two coordinates as an input, but ignores the Z coordinate. The Euclidean distance of two 2D points p and q is calculated as follows:

p 2 2 d(p, q) = (q1 − p1) + (q2 − p2) (4.1)

In our case the Euclidean distance is a plausible measure, because we do not require an infrastructural distance, based on roadmaps. This method is also followed by several researching projects, such as Ruiz et. al.[61] for measuring movements in Microblogs.

4.2.1.2 Part-of-Speech Similarity

The idea of Part-of-Speech similarity is to compare two Tweets according to the PoS- Tags of their content. Each pair of Tweets is analyzed for their PoS annotations. For each word combination of Tweet1 and Tweet2 which have the same PoS annotation type the similarity is calculated. The similarity value is retrieved by N-gram similarity. We therefore use the solution of Apache Lucene SpellChecker Suite3, which offers to calcu- late similarities between two Strings with either N-gram Similarity [34] or Levenshtein Distance [42]. As mentioned already in chapter 3.1.1 N-grams is more useful in our con- text, since we want to compare text phrases. Levenshtein Distance is more appropriate for overcoming spelling errors.

For the reasons mentioned in chapter , our implementation takes for each pair of words N-grams of size 4. Before those words are used by the Lucene method, they are con- verted to lower case, since this has an impact on the result. For our use, though there is no difference between upper and lower case words. The Lucene SpellChecker Suite further uses the N-gram distance measure introduced by Kondrak[38] which has already been mentioned in chapter 3.1.1.

4.2.1.3 Semantic Similarity

In order to take the semantic content of a Tweet into account, we implemented this com- ponent. The text of two Tweets is analyzed for their semantic relatedness. Therefore

2http://www.vividsolutions.com/jts/ 3https://wiki.apache.org/lucene-java/SpellChecker Implementation 66 keywords of each Tweet are extracted. Since we desire to have the system configurable it is possible to select whether the semantic similarity should be calculated based on all Hashtags or all Nouns of the Tweets. The configuration settings, which are set via the visualization tool, can have the values HASHTAG or ALL. In case ALL is chosen, all hashtags and nouns in the Tweet are set as keywords.

The following example will show the keywords of the different settings:

Tweet 1: Prießnitzstraße ist noch trocken. Ecke Nordstraße ca. 30 cm Luft und Bischof- sweg ca 80cm Luft bis zur Straße. #Hochwasser #Dresden #Neustadt

Semantic Level Keywords Hashtag Hochwasser, Dresden, Neustadt All Hochwasser, Dresden, Neustadt, Prießnitzs- traße, Ecke, Nordstraße, Luft, Bischofsweg, Luft, Straße

We use the DISCO library by LinguaTools4 to retrieve the semantic relatedness of the extracted keywords. DISCO works with a Wikipedia knowledge base and calculates the distance value of two words in the wordspace by Cosine Similarity. It would also be possible to choose the Kolb Similarity Algorithm [37], but it has shown that Cosine Similarity works better in our case.

Another consideration was to use WordNet5, respectively GermaNet6 for finding a Se- mantic Similarity. Unfortunately the German language tool GermaNet is not available for free use and therefore was not applicable in our project. An advantage of GermaNet compared to DISCO would have been, that it bases on WordNet which is widely used in researching projects. Nevertheless we also retrieved with DISCO a good result and could make use of the built in similarity calculations.

4.2.2 Clustering

In this chapter we will introduce the clustering algorithms which we used in the pro- totype, but also evaluate external tools of which we will integrate one in our system. Further references from other researching projects are given, which used our suggested algorithms in content of microblogging data.

4http://www.linguatools.de/disco/ 5http://wordnet.princeton.edu/ 6http://www.sfs.uni-tuebingen.de/GermaNet/ Implementation 67

4.2.2.1 Affinity Propagation

Affinity Propagation describes a rather new algorithm in the field of data clustering. It was first introduced in 2008 by Frey and Dueck [21] as Clustering by Passing Messages between Data Points. In comparison to K-Means Clustering [46], it is not necessary to know the cluster size. Further a calculated central vector is not required, but the clustering is done based on actual data points and simultaneously takes all of those into consideration. These so called ’exemplars’ are data points which best describe the cluster they are in. To find those exemplars, all data points are seen as nodes within a network. For each pair of points energy functions are calculated and transmitted as messages. There exist two kinds of energy functions, that describe the affinity between two points, first the responsibility and second availability.

• Responsibility (i, k), shows how well suited the point k is, in order to be an ex- emplar for point i

r(i, k) ← s(i, k) − max a(i, k0), +s(i, k0) (4.2) k06=k

• Availability a(i, k), shows how appropriate it would be for point i to take point k as its exemplar

X a(i, k) ← min 0, r(k, k) + max 0, r(i0, k) (4.3) i0∈/i,k

These functions are repeatedly calculated until an appropriate setting for exemplars is found. This applies when the minimum of the energy functions is reached.

Vasantha et al.[73] show in their work, that Affinity Propagation works efficiently and effective also for text data. Zhao [40] uses Affinity Propagation on Tweets as an example of Social Media Data.

4.2.2.2 Hierarchical Clustering

Hierarchical Clustering is a widely known and used method for data aggregation and analysis. It is an iterative and monotonic approach, where each data point is merged or split with another data point based on their similarity value. When a certain threshold is reached, the splitting/merging stops and a final set of clusters remain. The literature [60] distinguishes between two kinds of hierarchical clustering: Implementation 68

Agglomerative Clustering Each data point is its own cluster, with each iteration the clusters are merged bottom up according to their similarity.

Divisive Clustering The whole data set is one single cluster, with each iteration the clusters are split drop down according to their similarity.

Hierarchical Clustering is also applicable in many research projects concerning Mi- croblogs such as Twitter. A quite commonly referenced Paper by Olariu[51] proposes a solution for Event Detection of a Twitter Stream with the help of Hierarchical Ag- glomerative Clustering. Hereby he first classifies the Tweets to certain topics and then applies the clustering algorithm in order retrieve event clusters. Ifrim et .al. [30] also gained reasonable results in using this aggregation method for finding topic clusters in Twitter data. As an advantage they mentioned the possibility of setting a threshold and therefor being able to determine the tightness or looseness of a cluster, without stating a number of clusters such as with K-Means. Especially since the objectives of clustering may vary, it can be useful to set a larger threshold for finding more general topics with different sub-clusters. This setting applies also in our project.

4.2.3 Clustering tools

Since there exist various toolkits and libraries with clustering algorithms, we evaluated some of them in order to find the best fitting tool. Our requirements on the library include the following:

• Open Source respectively available for free

• Good Java support

• Accept similarity/distance matrices as an input

• Possibility of using Affinity Propagation and Hierarchical Clustering algorithms

In conclusion we decided for using the ELKI toolkit [6], due to the prevailing advan- tages. Although the Java support is still officially in a Beta version, it has proven good functionality and results. In the following sections all tools we took in consideration are introduced and shortly evaluated according to the above mentioned requirements.

The following table gives an overview of the evaluated tools and their advantages and disadvantages. Implementation 69

Tool Open Source Java support Similarity matrix AP and HC algorithms ELKI[6] + + + + JML[49] + + - - Mahout[52] + + + - R[55] + ˜ + + RapidMiner[1] + + ˜ - S-Space[35] + + - - WEKA[27] + + - -

4.2.3.1 ELKI[6]

ELKI stands for Environment for Developing KDD-Applications Supported by Index- Structures and is an open source tool developed by the Ludwig-Maximilians-Universit¨at in Munich, Germany. It offers various algorithms such as and outlier detection.

Algorithms

• Affinity Propagation Clustering

• Hierarchical Clustering

• K-means

• Expectation Maximization

• etc.

Evaluation Since it offers the two algorithms we require for our system, ELKI became interesting to take a closer look. The documentation states that for now the tool is rather used with a standalone GUI and still in a Beta version as a Java library. Still the JavaDoc is comprehensive and an extensive tutorial shows good examples on how to use the clustering algorithms within a Java program. Also the community behind ELKI is still very active and we were able to solve occurring problems in direct contact with the developers. Another major advantage is the possibility to use pre-computed distance/similarity files. Those files represent our similarity matrices and work with ELKI quite smoothly. All in all ELKI has proven to be a good and easy choice for our requirements with correspondingly good clustering results. Implementation 70

4.2.3.2 JML[49]

JML stands for Java Library for Machine Learning and offers a simple Java toolkit for various applications in Machine Learning and Data Mining. There exists a component dealing with cluster analysis, yet they only offer a small number of implemented algo- rithms. Meanwhile JML was replaced by the project LAML (Linear Algebra and Machine Learn- ing), which is in general more comprehensive and faster. Still it does not provide any different Clustering mechanisms than with JML.

Algorithms

• K-Means

Evaluation JML has no implementation of any of the desired clustering algorithms, yet Spectral Clustering might have been an interesting approach for our data analysis. JML is developed for Java use only and therefore has a good working library. Unfortunately it only works with data point matrices and does not take similarity/dis- tance values as an input. This makes the use of the tool within our project impossible and therefore JML is refused.

4.2.3.3 Mahout[52]

Apache Mahout is a widely known and used framework for Machine Learning appli- cations. It is commonly applied for its comprehensive MapReduce and Recommender functionality. Still it offers also several clustering algorithms which all base on K-Means.

Algorithms

• K-Means

• Spectral K-Means

• Fuzzy K-means

• Streaming K-means Implementation 71

Evaluation Mahout works as a Java Library and offers an easy way to implement their algorithms. Further the tutorials give a good overview and show examples to each topic. Mahout clustering works also with similarity matrices, which are committed as a doc- ument file. Unfortunately the offered clustering algorithms all base on K-Means, which implies that the number of resulting clusters is required. For our project it is impossi- ble to define the k-clusters in the beginning. Therefore it is not possible to work with Apache Mahout.

4.2.3.4 R[55]

R is a widely known and established framework for mathematical computations. In comparison to similar projects like MathLab, R is completely open source. Another focus topic of R is the graphical presentation of data and functions. The vast variety of clustering algorithms includes also Hierarchical Clustering and Affinity Propagation.

Algorithms

• Affinity Propagation

• Hierarchical Clustering

• K-Means

• Expectation Maximization

• etc.

Evaluation Both desired clustering algorithms are available in so called CRAN pack- ages of R. For Hierarchical Clustering there exist even several different implementations and settings. Further the algorithms may take similarity/distance matrices as an input for the clustering. Unfortunately the integration of R in a Java project is rather complex, since it requires R to run on the machine separately. The complexity of the system, compared to the rather manageable task the library should fulfill in our project, was the main reason why we decided against the integration. Implementation 72

4.2.3.5 RapidMiner[1]

RapidMiner is an open source platform for data mining and analysis initially introduced as a researching project at the Technical University of Dortmund. RapidMiner finds application in various commercial industries and therefore is worldwide known for its expertise in machine learning.

Algorithms

• K-Means

• Agglomerative Clustering

• etc.

Evaluation Rapidminer offers a good Java library which can easily be integrated in existing projects. Yet it showed difficulties in dealing with similarity/distance matrices, because it would have required modifying the package and file structure of the source code. This and the fact that there is no implementation of Affinity Propagation were the main reasons for not working with RapidMiner.

4.2.3.6 S-Space[35]

S-Space is a small researching project developed by the University of California. It is a collection of algorithms dealing with a semantic space and other forms of language/text processing. Since this also includes text clustering, we considered the project for our evaluation.

Algorithms

• K-Means

• Hierarchical Clustering

• Spectral Clustering

• etc. Implementation 73

Evaluation Although S-Space exists as a Java Maven project, the disadvantages pre- vail. Apart from the not existing implementation of Affinity Propagation, there is also no possibility to commit a similarity/distance matrix. It is only possible to work with a matrix of data points, which is not useful within our project.

4.2.3.7 WEKA[27]

WEKA is a framework of machine learning algorithms and other data mining issues. It is an open source project fully implemented and available in Java. WEKA is often used in combination with its Graphical User Interface, which provides not only results in written form but also as nicely presented graphs.

Algorithms

• K-Means

• Hierarchical Clustering

• Expectation Maximization

• etc.

Evaluation The integration of WEKA into our project would work quite smoothly, yet it does not provide an implementation of Affinity Propagation. Another problem comes up with the fact, that it is not possibly to work with similarity/distance matrices, but simply with actual data points. For these reasons WEKA will not be used in this project.

4.3 Visualization

The Visualization component was implemented in collaboration with Gerald Madlsperger[47], since the results of both of our projects should be presented in a fluent passage. Basically we implemented the User Interface with the Java Server Pages technology. Therefore we created five JSP files, which represent the five pages available at our visu- alization. These pages include the following four, explained in their own sections and a simple Error page.

Because of the long execution times of the pipeline for the whole dataset, we decided to Implementation 74 fully decouple the visualization from the rest of the pipeline and pre-compute the results we want to see. Figure 4.1 depicts, that the visualization component is only connected with the database from where it reads the already analyzed data.

Figure 4.1: Visualization component diagram

4.3.1 Home (Historical Data)

This describes the initial page of the User Interface where various configuration files can be selected. Each configuration stands for an already executed program run. Alto- gether we identified 10 parameters with which the pipeline can be initialized. Those are summarized in the following with references to their detailed description.

• Boundingbox: This parameter is used to specify a geographical area in which we are operating.

• PoS-Weight: Specifies the weighting of the PoS-similarity for aggregation. Implementation 75

• Geo-Weight: Specifies the weighting of the Geo-similarity for aggregation.

• Semantic-Weight: Specifies the weighting of the Semantic-similarity for aggrega- tion.

• Clustering Algorithm: Selects the preferred clustering algorithm.

• LDA Algorithm: Selects the preferred LDA algorithm (With/Without model se- lection) [47] 1.

• Evolution Feature: Selects the feature used for evolution calculation (Ontology, Entity) [47] 1

• LDA Feature: Selects the feature used for LDA (Ontology, Event phrase) [47] 1.

• Aggregation Evolution Threshold: Threshold which is used to find dependent ag- gregates over time [47] 1.

• Episode Evolution Threshold: Threshold which is used to find dependent episodes over time [47] 1.

Figure 4.2: Home (Historical Data)

1Those parameters are part of the object extraction and evolution components, find more details in the work of Gerald Madlsperger [47] Implementation 76

4.3.2 Timeslice Details

The Timeslice Details screen was implemented with 3 google maps instances on which all events over three time slices are shown. Everything else was implemented straight forward according to the concept described in chapter 3.3.6.2. Implementation 77

Figure 4.3: Timeslice Details Implementation 78

4.3.3 Event Detail

At first glance the event Detail screen looks identical to the Timeslice Details screen. But the episodes shown on the preceding and succeeding maps are not necessarily within the following resp. preceding time slice but are directly connected to one episode of the current event. Also the coloring of the episodes matches over all the maps, therefor the detection of connected episodes is simplified. Implementation 79

Figure 4.4: Event Details Implementation 80

4.3.4 Timeline Result

As stated in the concept chapter this page contains a timemap visualization. The im- plementation was done with a javascript library called timemap [16] which is based on the SIMILE timeline [50] implementation. This page is used only for the component of Gerald Madlsperger and therefore will not explained any further within this thesis. Chapter 5

Results and Evaluation

In this chapter we will present the results of the system according to different configu- ration settings. Further we will evaluate those results and draw a conclusion.

5.1 Evaluation Criteria

Evaluation criteria are important to define, since they provide the basis for an objective evaluation. In Information Retrieval and Analysis there exist various metrics which give an insight into the quality and correctness of clustering. We chose some among those, which represent and deliver the highest information.

5.1.1 Cluster Evaluation

Within ELKI there exists a cluster analysis functionality, which calculates different metrics to evaluate the clustering of the given data set. Achtert et.al.[5] state that the evaluation of clustering algorithms can turn out difficult, because each dataset including their requirements differ. Also it may be demanding to find out the one perfect data par- titioning. In this paper they also propose an evaluation method which calculates metrics based on pairs of datapoints within clusters. They calculate widely known metrics such as Precision, Recall, F-Measure, Rand, Jaccard, Purity and Mutual Information. This approach is further discussed by Kriegel et.al.[39], where they apply the so called pair-counting method not only on individual data points, but also on whole clusters in order to compare the results of various clustering algorithms.

Although those clustering metrics were calculated by ELKI during the execution, the results were not very useful for further analysis. The input of a reference clustering in 81 Results 82 the form of an external file is not possible. Reference clusters can only be submitted by defining a second clustering algorithm which calculates a reference result based on the same dataset.

To overcome this problem we define some basic statistics, which give an overview of the data structure after clustering. Usually the cluster size and the number of clusters are used as input parameters before the clustering takes place. In this project it was not possible to define those measures beforehand, which also made the application of e.g. K-Means not possible. We therefore calculate an average for these metrics, based on the clustering result and try to interpret the outcomes.

Average Cluster Size The average cluster size offers an insight on the segmenta- tion of the data, relating to the cohesiveness of the individual data points. In our case it shows the scope of one event within one timeslice. The higher the ratio of average cluster size and the total number of data points is, the the lower is the number of clus- ters. This might occur if the segmentation of few but large data groups is very clear, or if the parameters for clustering is not very strict. On the opposite a tightly defined parametrization results in many clusters with single elements. In order to get a meaningful value for our clustering results, the average was calculated based on the results of all settings introduced in chapter 5.2.

Average # of Tweets per Timeslice Average Cluster Size 100 13

This result shows a quite good ratio of an average cluster size and the total number of Tweets. Naturally the quality of the result also depends on the number of clusters that were found within one timeslice.

Average Number of Clusters A similar purpose follows the average number of clus- ters, since this measure stands in close relationship with the average cluster size. The higher the number of clusters, the more likely it is, that the cluster size is low. On the other hand if the number of clusters is low, chances are high that those few clusters are big and therefore not very compact and precise.

Average # of Tweets per Timeslice Average # of Clusters 100 9 Results 83

Also this result shows that the number of clusters within one timeslice offers a decent distribution of Tweets. Yet these metrics do not provide any information about the correctness of the assignment of Tweets to their cluster, which is why we have to come up with another evaluation metric.

5.1.2 Recall

As mentioned above, there exist many measures in information retrieval which evaluate the quality of a clustering algorithm. Due to many reasons, e.g. unknown total scope of documents, vague definition of correct clustering, etc., it is not possible respectively it does not make sense to calculate most of them. Yet the metric of recall is an ideal measure of defining the quality of our clustering results.

Recall was first mentioned by Rijsbergen in his book Information Retrieval[57], where he describes as recall the amount of relevant documents that are retrieved. This metric does not need to exactly know the exact number of relevant documents to calculate a precise value, which makes it useful in this thesis.

Reference Data In our project we do not have a one perfect clustering, because the assignment of a Tweet to a cluster is very subjective. Since we have defined various scenarios within the dataset, we evaluate the composition of the clusters based on those. We define SQL queries which represent these scenarios best and check, whether the Tweets of those scenarios are also clustered together.

True Positives True positives represent Tweets which are correctly assigned to one cluster. Since we cannot define, whether Clusteri or Clusterj is the accurate cluster, for us the true positive value is depending on the other elements in the cluster. This means, that for each Tweet we check how many neighbors of the reference clusters are also in the actual cluster.

False Negatives As false negatives in comparison, we describe Tweets which should be in the same cluster as TPs but are clustered to a different one. The calculation of false negatives bases on the results of true positives and the number of elements in the reference cluster. Hence those Tweets that are in the reference cluster, but are not true positives are in fact false negatives.

The final formula of the recall metric [57] is defined as follows: Results 84

TP Recall = (5.1) (TP + FN)

Example In this example we have the following setup of a reference and actual cluster:

Reference Clusteri: Tweet1, Tweet2, Tweet3

Actual Clusteri: Tweet1, Tweet3, Tweet4, Tweet5

Actual Clusterj: Tweet2, Tweet6

To calculate the TP for Tweet1 we check if Tweet2 and Tweet3 are also in the actual

cluster. In this example only Tweet3 can be found in the Actual Clusteri which results in a TP value of 1.

On the other hand, to count all FN for Tweet1 we check the remaining tweets in the reference cluster. In this case Tweet2 is positioned into a different cluster than Cluster1 which results to a FN of 1.

The final recall calculation for Tweet1 would result into the following:

1 Recall = = 0.5 (5.2) Tweet1 (1 + 1)

5.1.3 Precision

The precision metric [57] is required when using also the recall measure, in order to acquire an objective evaluation. It represents the amount of the retrieved documents that is relevant to the initial query. Since we have only a vague description of what is relevant in our whole dataset, the precision metric would distort the outcome, because it values documents which are ac- tually relevant but not known in advance, negatively. Therefore we decided to calculate the recall and precision metrics only for a smaller dataset, which includes only the pre- defined scenarios. In those cases we have defined the relevant Tweets for each scenario beforehand already.

For the calculation of the precision we also need the True Positives as described in the previous section. Basically they describe the amount of all retrieved data which is identified correctly as relevant. Further we also need False Positives. Results 85

False Positive In our project, we define false positives as Tweets which were posi- tioned in a cluster, which was different in the reference data, which was defined in the previous section. Concretely, alls neighbors for the Tweet are compared to the neighbors of the Tweet in the reference cluster.

TP P recision = (5.3) (TP + FP )

Example Regarding the same example as with the Recall we have the following setup of a reference and actual cluster:

Reference Clusteri: Tweet1, Tweet2, Tweet3

Actual Clusteri: Tweet1, Tweet3, Tweet4, Tweet5

Actual Clusterj: Tweet2, Tweet6

As calculated already in the previous section, we have a TP value of 1.

To count the FP values, we need to check which Tweets are clustered to Actual Clusteri

although they are not found in Reference Clusteri. In this case it affects two Tweets,

namely Tweet4 and Tweet5.

The final precision calculation for Tweet1 would result into the following:

1 P recision = = 0.33 (5.4) Tweet1 (1 + 2)

5.1.4 F1-Score

In order to get one overall measure for the aggregation, the F1-score can be derived from recall and precision. It was first mentioned by Van Rijsbergen [57] to measure the effectiveness of a information retrieval system. Basically it calculates the harmonic mean of recall and precision with the following formula:

precision ∗ recall F 1 − Score = 2 ∗ (5.5) precision + recall

Example Regarding the same example as with the Recall we have the following setup of a reference and actual cluster: Results 86

Reference Clusteri: Tweet1, Tweet2, Tweet3

Actual Clusteri: Tweet1, Tweet3, Tweet4, Tweet5

Actual Clusterj: Tweet2, Tweet6

Further we calculated the following measures:

RecallTweet1: 0.5

PrecisionTweet1: 0.33

This would lead to the following final calculation:

0.33 ∗ 0.55 F 1 − Score = 2 ∗ = 0.3975 (5.6) Tweet1 0.33 + 0.5 This is a rather moderate result, since the best F1-measure lies at 1.0 and the lowest at 0.0.

5.2 Evaluation Results

In this section we will present and discuss the results based on the before mentioned eval- uation criteria. We calculated the recall and precision metrics based on the clustering results with differing parametrization. First we will try to identify the best parametriza- tion for the semantic and geolocation level, which we introduced in the previous chapter. Further we will apply these settings on the predefined dataset scenarios by using differ- ent similarity weighting.

As this project was initially motivated to improve the tool CrisisTracker, we use their clustering result as a reference to our work. The recall is calculated in the same manner as with our system’s result and bases on the same scenario dataset. In the CrisisTracker documentation [59] they refer to a very low recall and fairly good precision of their result, due to the fact that CrisisTracker tends to produce small clusters, to make sure unwanted Tweets cannot be found in the cluster.

5.2.1 Semantic Level

As a first setting we will differentiate between the two semantic levels which are config- urable in the system. With the setting HASHTAG all hashtags of a Tweet are taken into consideration for semantic analysis, in contrast to the setting ALL which considers Results 87 all words contained in a Tweet. For further reading see chapter 4.2.1.3. The other pa- rameters were set to an equal weighting of 0.33 for all three similarity options and we use affinity propagation clustering, because this proved to be one of the best settings. Therefore the following result was achieved:

Semantic Level Similarity Clustering Recall Precision F1 Geo PoS Semantic HASHTAG 0.33 0.33 0.33 AP1 0.56174 0.14717 0.21979 ALL 0.33 0.33 0.33 AP 0.27931 0.09438 0.13178

1 Affinity Propagation

Conclusion Although it may seem that the more semantic analysis is taken into con- sideration, the better the result would be, our evaluation showed a different outcome. We derive this from the fact, that with the setting ALL too many word-pairs are seman- tically analyzed and therefore lower the average similarity value per Tweet. Since these Tweets then have a low similarity, they are also not clustered together. Another drawback of using the setting ALL is the weak performance of the algorithm. The execution through the system with all word-pairs takes too much time to use it in realtime systems, whereas the run with the setting HASHTAG takes a decent amount of time.

According to this result we will use the semantic level configuration of HASHTAG in the following scenario evaluations.

5.2.2 Geolocation Level

Similar to the Semantic Level we try to find the best setting of the Geolocation Level. Here we distinguish between the three levels HASHTAGS, NAMED ENTITY and NOUNS as described in chapter 4.1.2. As suggested in the previous section, the three similarity options were set to an equal weighting of 0.33 and we use affinity propagation clustering, because this proved to be a good setting already in the previous run. This resulted in the following outcomes: Results 88

Semantic Level Similarity Clustering Recall Precision F1 Geo PoS Semantic HASHTAGS 0.33 0.33 0.33 AP1 0.56174 0.14717 0.21979 NAMED ENTITY 0.33 0.33 0.33 AP 0.43895 0.15679 0.21359 NOUNS 0.33 0.33 0.33 AP 0.33495 0.16577 0.20799

1 Affinity Propagation

Conclusion Again we retrieve the best results when using HASHTAG level, yet the values are lying closer to each other in the geolocation context. It can be explained with the fact, that social media users tend to use location as hashtags. Also when taking named entities and all nouns into account, a lot of non-location words are found and annotated with syntactically similar sounding locations, e.g. the noun Regen (rain in German) is mistakenly mapped to city of Regen in Germany. Therefore we decided to use the setting HASHTAG for the following scenario evaluation.

5.2.3 Scenario 1: Bridge Blockade

The first scenario describes a situation where the bridge Blaues Wunder in Dresden was blocked. Here we used several different settings which we thought may lead to a reasonable outcome. In the further scenarios we will focus on the best four settings, resulting from this evaluation.

5.2.3.1 Reference: CrisisTracker

As mentioned above the evaluation measures for CrisisTracker are calculated for the resulting aggregates, when applying the tool on the same dataset as we used in our project.

Recall Precision F1 CrisisTracker 0.11305 0.03726 0.05604

5.2.3.2 Own Implementation

As mentioned above, for this scenario we decided to calculate the recall for an extensive set of parameters to get an overview and good impression of the results. Results 89

Setting Nr. Similarity Clustering Recall Precision F1 Geo PoS Semantic 1 1.00 0.00 0.00 AP1 0.91787 0.13108 0.22940 2 1.00 0.00 0.00 NAHCAP2 0.02872 0.48152 0.09176 3 0.00 1.00 0.00 AP 0.52157 0.36645 0.43046 4 0.00 1.00 0.00 NAHC 0.49225 0.37074 0.42294 5 0.00 0.00 1.00 AP 0.94444 0.10996 0.19698 6 0.00 0.00 1.00 NAHC 0.27669 0.12046 0.16785 7 0.25 0.25 0.50 AP 0.64227 0.13372 0.22136 8 0.25 0.25 0.50 NAHC 0.32916 0.20303 0.25115 9 0.25 0.50 0.25 AP 0.56463 0.19192 0.28646 10 0.25 0.50 0.25 NAHC 0.39304 0.2697 0.31993 11 0.33 0.33 0.33 AP 0.55096 0.12535 0.20423 12 0.33 0.33 0.33 NAHC 0.25436 0.11463 0.12590 13 0.50 0.25 0.25 AP 0.52896 0.16036 0.24611 14 0.50 0.25 0.25 NAHC 0.35662 0.24951 0.29360

1 Affinity Propagation 2 Naive Agglomerative Hierarchical Clustering

Conclusion The first noticeable issue with this result is the fact, the best result was achieved by the sole use of Part-of-Speech (PoS) features. We derive this from the circumstance that PoS tags also include location tags, therefore it can improve the sim- ilarity value, in case many location tags were used within the Tweet content. Generally it is to say, that also distributed weightings lead to good results, as long as it contains PoS features. Also the choice of clustering algorithm has a mixed impact, yet the result in this scenario turned out better when using affinity propagation. Another interesting observation comes up when looking at the recall and precision values in detail. We achieve for some settings extraordinarily high recall values which are con- trasted by a very low precision result, which can be explained by the average cluster size of the setting. Big clusters which contain all desired Tweets, but also many unwanted ones result into a high recall and a low precision. Further it is encouraging, that all settings yielded a better F1 measure than when using CrisisTracker. Results 90

5.2.4 Scenario 2: Sandbags

The second scenario deals with the need for sandbags in Dresden. This evaluation set contains more Tweets than in the previous scenario, this might result in a variation of the results.

5.2.4.1 Reference: CrisisTracker

The F1 measure for the CrisisTracker clustering receives a better result as in the previous scenario.

Recall Precision F1 CrisisTracker 0.168 0.07525 0.10394

5.2.4.2 Own Implementation

In this scenario we only reflect the same parametrization and highlight the best value.

Setting Nr. Similarity Clustering Recall Precision F1 Geo PoS Semantic 1 1.00 0.00 0.00 AP1 0.95 0.1015 0.18350 2 1.00 0.00 0.00 NAHC2 0.00932 0.50140 0.01829 3 0.00 1.00 0.00 AP 0.33066 0.62489 0.43246 4 0.00 1.00 0.00 NAHC 0.14984 0.62067 0.24141 5 0.00 0.00 1.00 AP 0.91667 0.12923 0.22652 6 0.00 0.00 1.00 NAHC 0.33235 0.14181 0.19879 7 0.25 0.25 0.50 AP 0.41679 0.2198 0.28781 8 0.25 0.25 0.50 NAHC 0.19502 0.20307 0.19896 9 0.25 0.50 0.25 AP 0.26493 0.22731 0.24468 10 0.25 0.50 0.25 NAHC 0.25946 0.27147 0.26533 11 0.33 0.33 0.33 AP 0.41094 0.18274 0.25298 12 0.33 0.33 0.33 NAHC 0.25067 0.14377 0.18273 13 0.50 0.25 0.25 AP 0.39334 0.2235 0.28504 14 0.50 0.25 0.25 NAHC 0.23395 0.25630 0.24461

1 Affinity Propagation 2 Naive Agglomerative Hierarchical Clustering Results 91

Conclusion Here again the setting which takes only Part-of-Speech features into ac- count, achieves the best result. Generally it is to say, that the values, apart from the highest, are lower than those of the previous scenario, which might be due to the fact that the Tweets are distributed among a higher number of timeslices. The number of Tweets belonging together within a timeslice therefore might be lower, which then re- sults in a weaker clustering. Nevertheless the overall result is again better than with Crisis Tracker.

5.2.5 Scenario 3: Drinking Water

In the third scenario information concerning the drinking water supply in Passau can be found. This is the scenario with the highest amount of Tweets and also one of those which is spread over the longest timespan.

5.2.5.1 Reference: CrisisTracker

For this scenario the CrisisTracker achieves a similar value than in the previous scenario.

Recall Precision F1 CrisisTracker 0.12115 0.07635 0.09367

5.2.5.2 Own Implementation

For evaluation we again used the same settings as in the previous scenarios and achieved fairly results. Results 92

Setting Nr. Similarity Clustering Recall Precision F1 Geo PoS Semantic 1 1.00 0.00 0.00 AP2 0.64 0.10686 0.18314 2 1.00 0.00 0.00 NAHC 0.01871 0.32718 0.0354 3 0.00 1.00 0.00 AP 0.36183 0.34304 0.35219 4 0.00 1.00 0.00 NAHC 0.39133 0.32201 0.35330 5 0.00 0.00 1.00 AP 0.8 0.10080 0.17905 6 0.00 0.00 1.00 NAHC 0.33188 0.17509 0.22924 7 0.25 0.25 0.50 AP 0.44036 0.1744855723 0.24993 8 0.25 0.25 0.50 NAHC 0.12828 0.09525 0.10933 9 0.25 0.50 0.25 AP 0.35470 0.24831 0.29212 10 0.25 0.50 0.25 NAHC 0.34877 0.21836 0.26857 11 0.33 0.33 0.33 AP 0.38846 0.18885 0.25414 12 0.33 0.33 0.33 NAHC 0.23702 0.12302 0.16197 13 0.50 0.25 0.25 AP 0.33329 0.20900 0.25690 14 0.50 0.25 0.25 NAHC 0.21582 0.16197 0.18506

1 Naive Agglomerative Hierarchical Clustering 2 Affinity Propagation

Conclusion Here the hierarchical clustering achieves better results than the affinity propagation clustering. With 39% of correct cluster assignments we accomplish a good result, although the spread of the relevant Tweets across the timeslices is quite high.

5.2.6 Scenario 4: River Dams

The fourth scenario deals with the threatening dam breaks of the river Spree in Sprem- berg. This scenario has the lowest scope in our sets of scenarios with under 30 Tweets.

5.2.6.1 Reference: CrisisTracker

In this scenario the F1 value of CrisisTracker is rather low. The reason for that may be the smaller set of Tweets within the scenario.

Recall Precision F1 CrisisTracker 0.16667 0.02869 0.04895 Results 93

5.2.6.2 Own Implementation

For the same parametrization that we used in the previous scenarios we attain mixed resulting values.

Setting Nr. Similarity Clustering Recall Precision F1 Geo PoS Semantic 1 1.00 0.00 0.00 AP1 0.75 0.15811 0.2611 2 1.00 0.00 0.00 NAHC2 0.14583 0.29513 0.19520 3 0.00 1.00 0.00 AP 0.75 0.06568 0.12079 4 0.00 1.00 0.00 NAHC 0.5625 0.38523 0.45729 5 0.00 0.00 1.00 AP 0.75 0.04144 0.07855 6 0.00 0.00 1.00 NAHC 0.17916 0.01329 0.02474 7 0.25 0.25 0.50 AP 0.75 0.15235 0.2532 8 0.25 0.25 0.50 NAHC 0.3875 0.16023 0.2267 9 0.25 0.50 0.25 AP 0.6875 0.05727 0.10574 10 0.25 0.50 0.25 NAHC 0.4916 0.18538 0.26924 11 0.33 0.33 0.33 AP 0.75 0.04180 0.07919 12 0.33 0.33 0.33 NAHC 0.36667 0.04460 0.07933 13 0.50 0.25 0.25 AP 0.75 0.15629 0.25868 14 0.50 0.25 0.25 NAHC 0.42916 0.16141 0.23460

1 Affinity Propagation 2 Naive Agglomerative Hierarchical Clustering

Conclusion Here we achieve the best F1 value with the 100% PoS setting. Surpris- ingly all other values are far behind this result and do not achieve a very respectable outcome. We explain it with the fact, that there is only a small set of Tweets for this scenario available. This implicates that within one timeslice there are only very few relevant Tweets available which furthermore results in similar values per timeslice. The overall average value then is set together by those resembling outcomes.

5.2.7 Scenario 5: Roadblock

The final scenario involves the situation of road blockings of the highway A9 in Germany. It has the lowest ratio of Tweets and duration of the scenario. This might result into a very low number of Tweets per timeslice. Results 94

5.2.7.1 Reference: CrisisTracker

In this scenario CrisisTracker achieves the lowest recall, precision and therefore also F1 value.

Recall Precision F1 CrisisTracker 0.06667 0.01111 0.01904

5.2.7.2 Own Implementation

Here we again look at the settings mentioned before, where the focus is set on single weighted parameters but also distributed weighting of all similarity options into consid- eration.

Setting Nr. Similarity Clustering Recall Precision F1 Geo PoS Semantic 1 1.00 0.00 0.00 AP1 0.875 0.04190 0.07997 2 1.00 0.00 0.00 NAHC2 0.09416 0.44947 0.15570 3 0.00 1.00 0.00 AP 0.27125 0.32812 0.29699 4 0.00 1.00 0.00 NAHC 0.44164 0.20450 0.27956 5 0.00 0.00 1.00 AP 0.875 0.04190 0.07997 6 0.00 0.00 1.00 NAHC 0.36915 0.10101 0.15862 7 0.25 0.25 0.50 AP 0.65833 0.06103 0.11172 8 0.25 0.25 0.50 NAHC 0.28445 0.03441 0.06139 9 0.25 0.50 0.25 AP 0.67708 0.10931 0.18823 10 0.25 0.50 0.25 NAHC 0.49740 0.04881 0.08890 11 0.33 0.33 0.33 AP 0.7083 0.19712 0.30841 12 0.33 0.33 0.33 NAHC 0.18355 0.03632 0.0606 13 0.50 0.25 0.25 AP 0.7083 0.19719 0.30850 14 0.50 0.25 0.25 NAHC 0.34487 0.032025 0.058607

1 Affinity Propagation 2 Naive Agglomerative Hierarchical Clustering

Conclusion In this scenario the best results were obtained by a distributed weighting for the first time. In this case the consideration of all features, with a higher weight to geolocation achieved a fairly good F1 value of 30%. Generally the F1 values are quite low, which might occur due to the before mentioned low number of Tweets compared to the duration of the whole scenario. Results 95

5.2.8 Evaluation Summary

To sum up the evaluation section, it is to say that basically Naive Agglomerative Hier- archical Clustering and Affinity Propagation equally good results over the total num- ber of settings. We derive this from the fact, that either of the algorithms have their strengths. Where affinity propagation is a rather new technique, which applies well on short messages such as microblogging. Hierarchical clustering in the contrast is a widely acknowledged clustering method and has been offered by ELKI already from the be- ginning. Also there exist more options of configuration in the ELKI toolkit for NAHC compared to Affinity Propagation.

The importance of the different similarity options emerge especially when combining multiple features. The best results that can be achieved depend largely on the com- position of the dataset. In several scenarios the best results are achieved when using Part-of-Speech features only, although we also always obtain good results when using all options weighted equally.

As mentioned above the composition of the dataset has a high impact on the outcomes, since it is important to have a bigger amount of relevant Tweets within one timeslice. If this is not the case, experiences have shown that the number of False Positives increases tremendously. In conclusion it is to say, that our system achieved in all scenarios by far better re- sults than CrisisTracker. Therefore we would call our basic objectives of an improved aggregation as successful. Chapter 6

Conclusion

To conclude this thesis we will shortly give a summary on the content and reflect on the achieved goals and results.

6.1 Summary

To summarize this work, it is to say that we created a system which takes social media content as an input and offers a visualization component that allows the user to interact with the parametrization and the presentation of the results. The architecture is created modularly in order to keep the system adaptable, since it should be integrated in the overall crowdSA framework.

In this project the two main parts represent the similarity and the clustering com- ponents, which offer a wide range of configuration possibilities. For this parametrization we came up with a feature hierarchy, according to which the similarity of a pair of Tweets can be calculated. Of those we implemented the to us most important ones within the prototype system, these include geographical, semantic and syntactic features. For each of those we came up with various feature levels in order to make the parametrization of the system even more detailed. Since there exists a vast amount of clustering tools and algorithms, we decided to im- plement two of them in order to retrieve comparable results. For the implementation we used an external library which fully corresponded our needs.

We evaluated the system by defining metrics which draw a comparison between the outcomes of different settings. We learned that the consideration of a distributed set of similarity features performs good, but depending on the dataset composition, using

96 Conclusion 97

Part-of-Speed features only can obtain the best results. The performance of the two clustering algorithms was in both cases good, still as expected the affinity propagation obtained better overall results.

6.2 Concluding Statements

Finally we can state that we succeeded in creating a modular and compact system which meets the requirements of the crowdSA project. We achieved better clustering results then the reference tool CrisisTracker did, which we derive from our solid feature hierarchy. Semantic and geolocation aspects played a big role in creating the feature hierarchy and furthermore in calculating similarity measures. Generally the aggregation component bases on various parameters which can be config- ured through a user interface. Chapter 7

Future Work

In this chapter we will point out a few ideas which would have improved the results of this systems, but were not possible to implement within the scope of a master’s thesis. Since the system’s architecture was designed very modular, it is easily possible to integrate future add-ons.

7.1 PreProcessor

Of course there are several interesting concepts which could be implemented within the preprocessor component. Especially improvements of the current functionality might be an issue when considering performance and quality of results.

7.1.1 Time Fencing

The most relevant future feature of the preprocessor should represent the time fencing, as already mentioned in chapter 3.3.2.2.

This component takes Tweets as an input and delivers annotations as an output. These annotations should provide information about when the described situation of the Tweet content has actually taken place. Therefore the system should be able to identify tem- poral information, such as dates or phrases like “tomorrow”, “today”, “next week”, etc., from the social media content.

Algorithms HeidelTime is a temporal tagger which uses the TIMEX3 annotation standard, which is part of the TimeML. The project HeidelTime [63] offers a possibility

98 Future Work 99 to extract temporal information among other languages, also from German texts. Since our extracted tweets mainly use German as a language, this is very useful. Unfortu- nately there is no HeidelTime plugin for GATE, yet it can be included into the UIMA pipeline1. Apache UIMA is a framework which makes it possible to analyze and anno- tate unstructured text.

In figure 7.1 an example with different temporal tags is shown. The output in this example is a TimeML file, with TIMEX3 tags.

Figure 7.1: HeidelTime Demo example[31]

7.2 Aggregation

For the aggregation there exist various options and possibilities on how to extend the component. Since the parametrization of similarity features and clustering algorithms offers the capability to easily integrate new functionality.

7.2.1 Similarity Framework

Apart from improvements of the existing similarity calculations, also new features and their similarity aspects should be taken into consideration for implementation.

Semantic Relatedness The similarity feature of semantic relatedness could use some improvements, especially in context with performance. Currently we offer two configu- ration levels, namely HASHTAG and ALL, which decide on the word types, which are taken into account for a semantic analysis. When only considering hashtags, the perfor- mance of the algorithm is quite good, but as soon as all words are used, the duration of the execution stands in no reasonable relation to the quality of results. Therefore it would be necessary to implement further configuration possibilities, of for example taking only Named Entities into account.

1http://uima.apache.org/ Future Work 100

Time Fencing As mentioned in the previous section, time fencing should play an important role in future implementations. Therefore it is also necessary to consider similarity metrics on how the relatedness of temporal tags can be calculated. The basic idea of the time annotations is to find out the date and time when the content actually happened. Hence temporal similarity might be computed by setting a scaling threshold, which then represents a certain similarity value, e.g. Tweets occurring in a certain time frame of 15 minutes have a similarity of 100%, those 15 minutes later only have a similarity of 75%, etc.

7.3 Visualization

This section was composed together with Gerald Madlsperger[47] since it affects both of our theses.

A graphical user interface always includes aspects which could be improved towards usability. For our project we primarily wanted to present and visualize the configura- tion and the results of our system.

7.3.1 Live Data

As mentioned in chapter 3.3.6.2 in future it should also be possible to stream live data from Twitter or other Social Media platforms, in order to present real time results as seen on figure 7.2. Since this implies a higher implementation effort we already came up with some conceptional ideas.

For the fetching of the data it should be possible to set a Time Frame size. As soon as a Time Frame has ended, it is visualized in a Result Page. Further it is possible to select, whether only German Tweets or also all other languages should be shown in the Time Frame Result. Generally for the Episode Extraction only German Tweets are taken into consideration.

Filter options are available to narrow the result set. On the one side it is possible to enter keywords which should be used by the Twitter Streamer. Further a geographi- cal polygon should be taken into account in the preprocessing steps of the result data. This means that only Tweets which are located within this area are considered for the event Clustering and Episode Extraction. Future Work 101

Figure 7.2: Homepage of live data usage Glossary

Component A component is a part of code which conforms the interface description of the pipeline architecture and implements exactly one function..2, 27, 32–35, 44–47, 50, 51, 53, 54, 56–58, 61, 63, 68, 71–73, 75, 91–94

Episode An episode describes a specific situation within a bigger event, for example within an event about the flooding in Dresden, an episode could be the situation where people are looking for sandbags to build up a dam. 45, 47, 49, 51–54, 73, 74

Event An event describes the aggregation of messages talking about the same topic, like all tweets about the flooding in Dresden.2, 40, 41, 47, 50, 51, 66, 74, 77, 95

Feature A feature is a characteristic which distinguishes one Social Media Content from another. Features offer the possibility to categorize different Social Media Contents and therefore provide the opportunity to perform message aggregation.. 3,6,9, 23, 25, 27, 29–32, 35, 43, 61, 62, 73, 90–94

Pipeline This project is built on the basis of a stack-able pipeline architecture. This means that the whole project consists of several components which can be com- bined to executable pipelines.. 32–35, 47, 71, 72

Time Slice A time slice is a given span of time. Within this work the term is often used, as the data is split according to pre-defined time slices at the beginning of the pipeline.. 47, 48, 50, 51, 74

102 Bibliography

[1] RapidMiner. Online, April 2012. URL http://rapid-i.com/content/view/181/ 190/.

[2] http://www.magentocommerce.com/magento-connect/open-social-media- monitoring-3195.html, 2013.

[3] https://wiki.ushahidi.com/display/WIKI/SwiftRiver, 2013.

[4] Fabian Abel, Claudia Hauff, Geert-Jan Houben, Richard Stronkman, and Ke Tao. Twitcident: fighting fire with information from social web streams. In Proceed- ings of the 21st international conference companion on World Wide Web, WWW ’12 Companion, pages 305–308, New York, NY, USA, 2012. ACM. ISBN 978-1- 4503-1230-1. doi: 10.1145/2187980.2188035. URL http://doi.acm.org/10.1145/ 2187980.2188035.

[5] Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. Evaluation of clusterings - metrics and visual support. In Anasta- sios Kementsietsidis and Marcos Antonio Vaz Salles, editors, IEEE 28th Inter- national Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages 1285–1288. IEEE Computer Soci- ety, 2012. ISBN 978-0-7695-4747-3. doi: 10.1109/ICDE.2012.128. URL http: //dx.doi.org/10.1109/ICDE.2012.128.

[6] Elke Achtert, Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. Interactive data mining with 3d-parallel-coordinate-trees. In Kenneth A. Ross, Divesh Srivas- tava, and Dimitris Papadias, editors, Proceedings of the ACM SIGMOD Interna- tional Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 1009–1012. ACM, 2013. ISBN 978-1-4503-2037-5. doi: 10. 1145/2463676.2463696. URL http://doi.acm.org/10.1145/2463676.2463696.

[7] G. Andrienko and N. Andrienko. Spatio-temporal aggregation for visual analysis of movements. In Visual Analytics Science and Technology, 2008. VAST ’08. IEEE Symposium on, pages 51–58, Oct 2008. doi: 10.1109/VAST.2008.4677356.

103 Bibliography 104

[8] D. Archambault, H.C. Purchase, and B. Pinaud. Animation, small multiples, and the effect of mental map preservation in dynamic graphs. Visualization and Com- puter Graphics, IEEE Transactions on, 17(4):539–552, April 2011. ISSN 1077-2626. doi: 10.1109/TVCG.2010.78.

[9] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Re- trieval. ACM Press / Addison-Wesley, 1999. ISBN 0-201-39829-X.

[10] Piyush Bansal, Somay Jain, and Vasudeva Varma. Towards semantic retrieval of hashtags in microblogs. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, pages 7–8, Republic and Canton of Geneva, Switzerland, 2015. International World Wide Web Conferences Steering Committee. ISBN 978-1-4503-3473-0. doi: 10.1145/2740908.2742717. URL http: //dx.doi.org/10.1145/2740908.2742717.

[11] Christian Bauer and Gavin King. Java Persistence with Hibernate. Manning Pub- lications Co., Greenwich, CT, USA, 2006. ISBN 1932394885.

[12] Norbert Baumgartner, Stefan Mitsch, Andreas M¨uller, Werner Retschitzegger, An- drea Salfinger, and Wieland Schwinger. The Situation Radar - Visualizing Collab- orative Situation Awareness in Traffic Control Systems. In Proceedings of the 19th World Congress on Intelligent Transport Systems, 2012.

[13] Matko Boanjak, Eduardo Oliveira, Jos´eMartins, Eduarda Mendes Rodrigues, and Lu´ısSarmento. Twitterecho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion, pages 1233–1240, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1230-1. doi: 10.1145/2187980.2188266. URL http://doi.acm.org/10.1145/2187980.2188266.

[14] Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A. Greenwood, Diana May- nard, and Niraj Aswani. TwitIE: An open-source information extraction pipeline for microblog text. In Proceedings of the International Conference on Recent Ad- vances in Natural Language Processing. Association for Computational Linguistics, 2013.

[15] Wenlong Chen, Shaoyin Cheng, Xing He, and Fan Jiang. Influencerank: An efficient social influence measurement for millions of users in microblog. In Cloud and Green Computing (CGC), 2012 Second International Conference on, pages 563–570, Nov 2012. doi: 10.1109/CGC.2012.31. Bibliography 105

[16] Community. timemap javascript library to help use a simile timeline with online maps including google, openlayers, and bing., 2011. URL http://code.google. com/p/timemap/. [Online; accessed 28-July-2015].

[17] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), 2002.

[18] Danica Damljanovic and Kalina Bontcheva. Named entity disambiguation using linked data. In Proceedings of the 9th Extended Semantic Web Conference (ESWC 2012), Poster session, Proceedings of the 9th Extended Semantic Web Confer- ence (ESWC 2012), Poster session, Heraklion, Crete., 2012. URL http://2012. eswc-conferences.org/sites/default/files/eswc2012_submission_334.pdf.

[19] Leon Derczynski, Alan Ritter, and Sam Clark. Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of Recent Advances in Natural Language Processing, pages 198–206, 2013.

[20] EIONET. EIONET GEMET Thesaurus. [Online; accessed 13-August-2015]. URL http://www.eionet.europa.eu/gemet/rdf?langcode=en.

[21] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315:2007, 2007.

[22] Switzerland. GeoNames, a project of Unxos GmbH. GeoNames. URL http:// download.geonames.org/export/dump/.

[23] Carina Reiter Gerald Madlsperger. Project in Intelligent Information Systems: CrowdSA Pipeline, 2014.

[24] Carina Reiter Gerald Madlsperger, Sebastian P¨oll. Evaluation of Social Media Search / Monitoring Tools, 2013.

[25] Google. Google maps apis google maps apis. [Online; accessed 18-August-2015], 2015. URL https://developers.google.com/maps/?hl=en.

[26] Viet Ha-Thuc, Yelena Mejova1, Christopher Harris, and Padmini Srinivasan. Event intensity tracking in weblog collections.

[27] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The weka data mining software: An update. SIGKDD Ex- plor. Newsl., 11(1):10–18, November 2009. ISSN 1931-0145. doi: 10.1145/1656274. 1656278. URL http://doi.acm.org/10.1145/1656274.1656278. Bibliography 106

[28] Alyssa Harding, Brian Ngo, Brian Steadman, and Nina Liong. Fintan an algorithmic approach to news aggregation.

[29] W. H. Hsu, M. Abduljabbar, R. Osuga, M. Lu, and W. Elshamy. Visualization of clandestine labs from seizure reports: Thematic mapping and data mining research directions. In Proceedings of the 2nd European Workshop on Human-Computer In- teraction and Information Retrieval, Nijmegen, The Netherlands, August 25, 2012, pages 43–46, 2012.

[30] Georgiana Ifrim, Bichen Shi, and Igor Brigadir. Event detection in twitter using aggressive filtering and hierarchical tweet clustering. In Proceedings of the SNOW 2014 Data Challenge co-located with 23rd International World Wide Web Con- ference (WWW 2014), Seoul, Korea, April 8, 2014., pages 33–40, 2014. URL http://ceur-ws.org/Vol-1150/ifrim.pdf.

[31] Uni-Heidelberg Informatik. Heideltime google code @ONLINE, October 214. URL https://code.google.com/p/heideltime/.

[32] Y. Inoue, K. Tsuruoka, and M. Arikawa. Spatio-temporal story mapping animation based on structured causal relationships of historical events. ISPRS – International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Xl-4:101–103, 2014. doi: 10.5194/isprsarchives-XL-4-101-2014.

[33] Michael Jahn. Extraktion von Krisenrelevanten Inhalten aus Twitter mit aus- gew¨ahltenSystemen, 2014.

[34] S. Jarvis and S.A. Crossley. Approaching Language Transfer through Text Classifi- cation: Explorations in the Detectionbased Approach. Channel View Publications, 2012. ISBN 9781847697004. URL https://books.google.at/books?id=dyH0_ DPdQV0C.

[35] David Jurgens and Keith Stevens. The s-space package: An open source pack- age for word space models. In Proceedings of the ACL 2010 System Demon- strations, ACLDemos ’10, pages 30–35, Stroudsburg, PA, USA, 2010. Associa- tion for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id= 1858933.1858939.

[36] E. Kanoulas, M. Lupu, P. Clough, M. Sanderson, M. Hall, A. Hanbury, and E. Toms. Information Access Evaluation – Multilinguality, Multimodality, and Interaction: 5th International Conference of the CLEF Initiative, CLEF 2014, Sheffield, UK, September 15-18, 2014, Proceedings. Lecture Notes in Computer Science. Springer International Publishing, 2014. ISBN 9783319113821. URL https://books.google.at/books?id=z_JOBAAAQBAJ. Bibliography 107

[37] Peter Kolb. Experiments on the difference between semantic similarity and re- latedness. In Kristiina Jokinen and Eckhard Bick, editors, Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009, volume 4, pages 81–88. Northern European Association for Language Technology, 2009. URL http://dspace.utlib.ee/dspace/bitstream/10062/9731/1/paper37.pdf.

[38] Grzegorz Kondrak. N-gram similarity and distance. In Mariano P. Consens and Gonzalo Navarro, editors, 12th International Conference String Processing and In- formation Retrieval (SPIRE), volume 3772 of Lecture Notes in Computer Science, Berlin, Germany, pages 115–126, Buenos Aires, Argentina, 2005. Springer. URL http://www.cs.ualberta.ca/~kondrak/papers/spire05.ps#Kondrak05.

[39] Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. Evaluation of multiple clustering solutions. In Emmanuel M¨uller,Stephan G¨unnemann,Ira Assent, and Thomas Seidl, editors, Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings, Athens, Greece, September 5, 2011, in conjunction with ECML/PKDD 2011, volume 772 of CEUR Workshop Proceed- ings, pages 55–66. CEUR-WS.org, 2011. URL http://ceur-ws.org/Vol-772/ multiclust2011-paper7.pdf.

[40] Jagadeesh Majji Kumar Vasantha. An efficient text clustering algorithm using affin- ity propagation. Indian Journal of Computer Science and Engineering (IJCSE), 2013.

[41] Topsy Labs. Search and analyze the social web. [Online; accessed 18-August-2015], 2015. URL http://topsy.com/.

[42] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, February 1966.

[43] James Llinas, Christopher Bowman, Galina Rogova, Alan Steinberg, Ed Waltz, and Frank White. Revisiting the jdl data fusion model ii. In In P. Svensson and J. Schubert (Eds.), Proceedings of the Seventh International Conference on Information Fusion (FUSION 2004, pages 1218–1230, 2004.

[44] A. M. MacEachren, A. Jaiswal, A. C. Robinson, S. Pezanowski, A. Savelyev, P. Mi- tra, X. Zhang, and J. Blanford. Senseplace2: Geotwitter analytics support for situational awareness. In Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, pages 181–190, 2011. doi: 10.1109/VAST.2011.6102456.

[45] A.M. MacEachren, A. Jaiswal, A.C. Robinson, S. Pezanowski, A. Savelyev, P. Mi- tra, X. Zhang, and J. Blanford. Senseplace2: Geotwitter analytics support for Bibliography 108

situational awareness. In Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, pages 181–190, 2011. doi: 10.1109/VAST.2011.6102456.

[46] David J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cam- bridge University Press, New York, NY, USA, 2002. ISBN 0521642981.

[47] Gerald Madlsperger. Object Extraction and Evolution in Crowd Situation Aware- ness, 2015.

[48] Debanjan Mahata, JohnR. Talburt, and VivekKumar Singh. Identification and ranking of event-specific entity-centric informative content from twitter. In Chris Biemann, Siegfried Handschuh, Andr´e Freitas, Farid Meziane, and Elisabeth M´etais,editors, Natural Language Processing and Information Systems, volume 9103 of Lecture Notes in Computer Science, pages 275–281. Springer International Publishing, 2015. ISBN 978-3-319-19580-3. doi: 10.1007/978-3-319-19581-0 24. URL http://dx.doi.org/10.1007/978-3-319-19581-0_24.

[49] University of Illinois at Urbana-Champaign Mingjie Qian, Department of Com- puter Science. Java library for machine learning (jml), 2014. URL https: //sites.google.com/site/qianmingjie/home/toolkits/jml. [Online; accessed 06-July-2015].

[50] Massachusetts Institute of Te chnology. Simile: Semantic in teroperability of meta- data and information in unlike environments, 2008. URL http://simile.mit. edu/wiki/Main_Page. [Online; accessed 28-July-2015].

[51] Andrei Olariu. Hierarchical clustering in improving microblog stream summariza- tion. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 7817 of Lecture Notes in Computer Science, pages 424– 435. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-37255-1. doi: 10.1007/ 978-3-642-37256-8 35. URL http://dx.doi.org/10.1007/978-3-642-37256-8_ 35.

[52] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman. Mahout in Ac- tion. Manning Publications Co., Greenwich, CT, USA, 2011. ISBN 1935182684, 9781935182689.

[53] Yefei Peng, Daqing He, and Ming Mao. Geographic named entity disambiguation with automatic profile generation. In Web Intelligence, pages 522–525. IEEE Com- puter Society, 2006. ISBN 0-7695-2747-7. URL http://dblp.uni-trier.de/db/ conf/webi/webi2006.html#PengHM06. Bibliography 109

[54] Hemant Purohit and Amit Sheth. Twitris v3: From citizen sensing to analysis, coor- dination and action, 2013. URL https://www.aaai.org/ocs/index.php/ICWSM/ ICWSM13/paper/view/6106.

[55] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2015. URL http://www. R-project.org.

[56] Ulf-Dietrich Reips and Pablo Garaizar. Mining twitter: A source for psy- chological wisdom of the crowds. Behavior Research Methods, 43(3):635–642, 2011. doi: 10.3758/s13428-011-0116-6. URL http://dx.doi.org/10.3758/ s13428-011-0116-6.

[57] C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition, 1979. ISBN 0408709294.

[58] J. Rogstadius, C. Teixeira, M. Vukovic, V. Kostakos, E. Karapanos, and J. Laredo. IBM Journal of Research and Development (2013), 2013.

[59] J. Rogstadius, M. Vukovic, C.A. Teixeira, V. Kostakos, E. Karapanos, and J.A. Laredo. Crisistracker: Crowdsourced social media curation for disaster awareness. IBM Journal of Research and Development, 57(5):4:1–4:13, Sept 2013. ISSN 0018- 8646. doi: 10.1147/JRD.2013.2260692.

[60] Lior Rokach and Oded Maimon. Clustering methods. In Oded Maimon and Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 321–352. Springer US, 2005. ISBN 978-0-387-24435-8. doi: 10.1007/0-387-25465-X 15. URL http://dx.doi.org/10.1007/0-387-25465-X_15.

[61] Eduardo J. Ruiz, Vagelis Hristidis, Carlos Castillo, and Aristides Gionis. Measuring and summarizing movement in microblog postings. In Emre Kiciman, Nicole B. Ellison, Bernie Hogan, Paul Resnick, and Ian Soboroff, editors, ICWSM. The AAAI Press, 2013. ISBN 978-1-57735-610-3.

[62] Andrea Salfinger, Werner Retschitzegger, Wieland Schwinger, and Birgit Pr¨oll. crowdSA - Towards Adaptive and Situation-Driven Crowd-Sensing for Disaster Sit- uation Awareness. In Proceedings of IEEE International Multi-Disciplinary Confer- ence on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA 2015), Orlando, USA, 03 2015.

[63] Jannik Str¨otgenand Michael Gertz. Heideltime: High quality rule-based extraction and normalization of temporal expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 321–324, Stroudsburg, PA, Bibliography 110

USA, 2010. Association for Computational Linguistics. URL http://dl.acm.org/ citation.cfm?id=1859664.1859735.

[64] Kristina Toutanova and Christopher D. Manning. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13, EMNLP ’00, pages 63– 70, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. doi: 10.3115/1117794.1117802. URL http://dx.doi.org/10.3115/1117794.1117802.

[65] Carlos Vicient. Moving towards the Semantic Web: enabling new technologies through the semantic annotation of social contents. PhD thesis, Computer En- gineering (Enginyeria Inform`atica),Universitat Rovira i Virgil, 2015.

[66] Wikipedia. Haversine formula, 2014. URL http://en.wikipedia.org/wiki/ Haversine_formula. [Online; accessed 05-October-2014].

[67] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceed- ings of the 32Nd Annual Meeting on Association for Computational Linguistics, ACL ’94, pages 133–138, Stroudsburg, PA, USA, 1994. Association for Compu- tational Linguistics. doi: 10.3115/981732.981751. URL http://dx.doi.org/10. 3115/981732.981751.

[68] Shasha Xie and Yang Liu. Using corpus and knowledge-based similarity measure in maximum marginal relevance for meeting summarization. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 4985–4988, March 2008. doi: 10.1109/ICASSP.2008.4518777.

[69] Qing Yang Yang Zhang. Clustering users in twitter based on interests. In National Conference on Information Retrieval, Japan, 2011. URL http://www.nlpr.ia. ac.cn/2011papers/gnhy/nh4.pdf.

[70] F. Ye, H. Wang, S. Ouyang, X. Tang, Z. Li, and M. Prakash. Spatio-temporal anal- ysis and visualization using sph for dam-break and flood disasters in a gis environ- ment. In Geomatics for Integrated Water Resources Management (GIWRM), 2012 International Symposium on, pages 1–6, 2012. doi: 10.1109/GIWRM.2012.6349636.

[71] Eva Zangerle, Wolfgang Gassler, and G¨unther Specht. On the impact of text simi- larity functions on hashtag recommendations in microblogging environments. Social Network Analysis and Mining, 3(4):889–898, 2013. ISSN 1869-5450. doi: 10.1007/ s13278-013-0108-x. URL http://dx.doi.org/10.1007/s13278-013-0108-x. Bibliography 111

[72] Yi Zhang and Flora S. Tsai. Combining named entities and tags for novel sentence detection. In Proceedings of the WSDM ’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR ’09, pages 30–34, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-430-0. doi: 10.1145/1506250.1506256. URL http://doi.acm.org/10.1145/1506250.1506256.

[73] University of Michigan Zhe Zhao. Replication on affinity propagation: Clustering by passing messages between data points, 2014. URL http://www-personal.umich. edu/~zhezhao/papers/AffinityPropagation.pdf. [Online; accessed 03-March- 2015]. Eidesstattliche Erkl¨arung

Ich erkl¨arean Eides statt, dass ich die vorliegende Masterarbeit selbstst¨andigund ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw. die w¨ortlich oder sinngem¨aßentnommenen Stellen als solche kenntlich gemacht habe. Die vorliegende Masterarbeit ist mit dem elektronisch ¨ubermittelten Textdoku- ment identisch.

Ort und Datum Carina Reiter

112 Lebenslauf

Carina Reiter 17. Juli 1990 Nationalität: Österreich

4020 Linz

E [email protected]

Berufserfahrung

08/2015 – jetzt Junior Data Warehouse Developer Vollzeit bet-at-home.com Entertainment GmbH - MS SSIS/SSAS/SSRS - Data Warehouse

09/2013 – 06/2014 Software Testing and Development Teilzeit Emporia Telecom Produktions- und Vertriebs-GmbH & Co.KG - Blackbox Software-Testen von Mobiltelefonen - Web-Entwicklung einer Software Testumgebung

07/2013 – 09/2013 Software Entwicklung Praktika Borland Entwicklung GmbH (a Micro Focus Company) - Software Entwurf, Design und Entwicklung einer Software Test-Umgebung - SCRUM - Java und Web 2.0 Entwicklung

07/2012 – 08/2012 Software Entwicklung Praktika Borland Entwicklung GmbH (a Micro Focus Company) - Software Entwurf, Design und Entwicklung einer Software Test-Umgebung - SCRUM - Java und Web 2.0 Entwicklung

1 / 3 - Carina Reiter Ausbildung

03/2013 – jetzt Johannes Kepler Universität Linz Dipl.-Ing., Computer Science Intelligent Information Systems

09/2014 – 12/2014 University of Jyväskylä, Finland Information Technology Erasmus Auslandssemester

10/2009 – 02/2013 Johannes Kepler Universität Bachelor of Science, Informatik

09/2004 – 07/2009 BHAK Perg Matura, Internationale Wirtschaft

Kenntnisse

Datenbankentwicklung Fortgeschritten Datenbanksysteme Fortgeschritten

HTML Fortgeschritten Java Fortgeschritten Linux Fortgeschritten

Machine Learning Fortgeschritten MS Office Fortgeschritten

Ontologien Fortgeschritten SQL Fortgeschritten

Webdevelopment Fortgeschritten Windows Betriebssystem Fortgeschritten

Business Intelligence C# C/C++ Data Mining

Datawarehouse JavaScript NoSQL PHP Rapidminer

Scrum

Sprachen

Deutsch Muttersprache Englisch Fließend Tschechisch Gut

Französisch Grundkenntnisse Russisch Grundkenntnisse

2 / 3 - Carina Reiter Interessen

Reisen Fotografie und Bildbearbeitung Sport (Laufen, Yoga)

Kino und Filmindustrie

Linz am 13. Oktober 2015

3 / 3 - Carina Reiter