SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams Erich Schubert Michael Weiler Hans-Peter Kriegel Institut für Informatik Institut für Informatik Institut für Informatik LMU Munich LMU Munich LMU Munich Germany Germany Germany
[email protected]fi.lmu.de
[email protected]fi.lmu.de
[email protected]fi.lmu.de ABSTRACT 1. INTRODUCTION AND MOTIVATION The analysis of social media data poses several challenges: first of Social media such as Twitter produces a fast-flowing data stream all, the data sets are very large, secondly they change constantly, with thousands of new documents every minute, containing only a and third they are heterogeneous, consisting of text, images, geo- short fragment of text (up to 140 characters), some annotated en- graphic locations and social connections. In this article, we focus tities and links to other users as well as web sites, and sometimes on detecting events consisting of text and location information, and location information on where the message originated from. Many introduce an analysis method that is scalable both with respect to traditional analysis methods do not work well on such data: the in- volume and velocity. We also address the problems arising from formal language with many special abbreviations, emoji icons and differences in adoption of social media across cultures, languages, constantly changing portmanteau words prevents many advanced and countries in our event detection by efficient normalization. linguistic techniques from understanding the contents. Further- We introduce an algorithm capable of processing vast amounts of more, many methods in topic modeling need to visit data multiple data using a scalable online approach based on the SigniTrend event times to optimize the inferred structure, which is infeasible when detection system, which is able to identify unusual geo-textual pat- thousands of new documents arrive every minute.