Event Detection in High Throughput Social Media

Event Detection in High Throughput Social Media Michael Weiler München 2016 Event Detection in High Throughput Social Media Michael Weiler Dissertation an der Fakultät für Mathematik, Informatik und Statistik der Ludwig–Maximilians–Universität München vorgelegt von Michael Weiler aus München München, den 20.10.2016 Erstgutachter: Prof. Dr. Hans-Peter Kriegel Zweitgutachter: Prof. Dr. Michael Gertz Mündliche Prüfung: 20. Dezember 2016 Eidesstattliche Versicherung (Siehe Promotionsordnung vom 12.07.11, § 8, Abs. 2 Pkt. .5.) Hiermit erkläre ich an Eidesstatt, dass die Dissertation von mir selbstständig, ohne unerlaubte Beihilfe angefertigt ist. Michael Weiler Name, Vorname Ort, Datum Unterschrift Doktorand/in vi Contents Abstract xiv Zusammenfassung (Abstract in German) xvii I Preface 1 1 Introduction 3 2 Thesis Overview and Contributions 7 2.1 Trend Mining and Event Detection . .7 2.2 Geo-social Co-location Mining of Events . .8 3 Preliminaries 9 3.1 Hashing . .9 3.2 Bloom Filter . 10 3.3 Count–Min Sketch . 11 3.4 Zipf’s Law . 12 3.5 Term Frequency–Inverse Document Frequency . 14 3.6 Stemming and Tokenization . 15 3.7 Outliers . 16 3.8 Trends, Events and Emerging Topics . 17 3.9 Gold Standard . 18 3.10 Twitter . 20 3.11 Time Series Data . 20 3.11.1 The Importance of Aggregation . 20 3.11.2 Normalization . 21 4 Incorporated Publications and Co-authorship 25 4.1 SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds . 25 4.2 SpotHot: Scalable Detection of Geo-spatial Events in Large Tex- tual Streams . 26 4.3 Scalable Detection of Emerging Topics and Geo-spatial Events in Large Textual Streams . 26 viii CONTENTS 4.4 Outlier Detection and Trend Detection: Two Sides of the Same Coin . 26 4.5 Geo-social Co-location Mining . 27 4.6 Socio Textual Mapping . 27 4.7 TrendTracker: Modelling the Motion of Trends in Space and Time 27 II Trend Mining and Event Detection 29 5 Emerging Topic Detection 31 5.1 Introduction . 31 5.2 Challenges . 32 5.3 Related Work . 32 5.4 Detection of Significant Topics . 35 5.4.1 Emerging and Trending Topics . 36 5.4.2 Emerging Topics are Outliers . 37 5.4.3 Trend Detection on Co-occurrences . 39 5.4.4 Early Detection of Trends . 39 5.5 Scalability by Hashing . 40 5.5.1 Trend Redundancy and Refinement . 42 6 Geo-spatial Event Detection 47 6.1 Motivation . 47 6.2 Requirements . 48 6.3 Challenges . 49 6.4 Problem Definition . 50 6.5 Related Work . 51 6.6 Symbolic Representation of Location . 53 6.6.1 Grid-based Token Generation . 53 6.6.2 Tokens based on Administrative Boundaries . 54 6.7 Incorporate Geographical Information . 56 6.8 Significance of Location-Events . 56 6.9 Updating the Moving Averages . 57 6.10 Significance Computation . 58 6.11 Hash Table Update Process . 59 7 Data Sets 61 7.1 News Articles . 61 7.2 Stack Overflow . 62 7.3 Twitter . 62 7.3.1 Geographic Distribution . 62 8 Experiments 65 8.1 Manual Analysis of Real World Trends . 65 8.2 Scalability . 75 8.2.1 Using only Text Data . 75 CONTENTS ix 8.2.2 Incorporating Location Data . 75 8.2.3 Distributed Computing . 76 8.2.4 Hash Table Size . 78 8.3 Comparison with Exact Frequencies . 79 8.4 Most Significant Regional Events . 81 8.5 New Year’s Eve . 81 8.6 WikiTimes Events . 83 8.7 Earthquakes . 83 8.8 Online Demonstration . 90 9 Relationship of Trends and Outliers 93 9.1 Introduction . 94 9.2 Traditional Outlier Detection . 95 9.2.1 Outlier Detection in Euclidean Space . 95 9.2.2 Specialized Outlier Detection . 96 9.2.3 Outlier Detection in Data Streams . 97 9.2.4 Generalization of Outlier Detection . 97 9.3 Limitations of Traditional Outlier Detection . 98 9.3.1 Example: KDD Cup ’99 . 98 9.3.2 Example: United States Census Data . 99 9.3.3 Example: Traffic Accidents . 99 9.3.4 Observations . 100 9.3.5 Bridge the Gap . 100 9.4 Relationship between Trends and Outliers . 101 10 Conclusion 103 III Geo-social Co-location Mining of Events 107 11 Trend Patterns and Dissimilation Prediction 109 11.1 Introduction . 110 11.2 Preliminaries . 113 11.3 Spatio-Temporal Trend Dissemination Rule Mining . 115 11.3.1 Space Decomposition Scheme . 115 11.3.2 Trend Flow Modeling . 115 11.3.3 Trend Flow Mining . 117 11.3.4 Trend Archetype Clustering . 118 11.3.5 Trend Archetype Flow Modelling . 119 11.4 Related Work and Discussion . 119 11.5 Experimental Evaluation . 120 11.5.1 Parameters and Data Set . 120 11.5.2 Evaluation of Trend Archetypes . 120 11.5.3 Evaluation of approximation quality . 122 11.5.4 Evaluation of algorithmic runtime. 124 x CONTENTS 11.6 Conclusions . 125 12 Event Co-location Mining 127 12.1 Introduction . 127 12.2 Problem Definition . 129 12.3 Related Work . 131 12.4 Probabilistic Frequent Co-Location Mining . 133 12.4.1 Occurrence Probability Estimation . 134 12.4.2 Transformation to Probabilistic Frequent Itemset Mining 134 12.5 Experiments . 138 12.6 Conclusions . 139 13 Socio Textual Mapping of Events (Vision) 141 13.1 Introduction . 141 13.2 Overview . 143 13.3 Socio Textual Maps . 144 13.4 Theoretic Foundation . 145 13.5 Proof of Concept . 146 13.5.1 Feature Selection and Transformation . 146 13.5.2 Clustering . 147 13.6 Challenges . 147 IV Conclusion 151 Acknowledgements 155 Bibliography 155 List of Figures 1.1 Data never Sleeps . .3 1.2 Event detection as preprocessing filter step . .5 3.1 Example usage of a Bloom filter . 11 3.2 Example usage of a Count-Min Sketch . 12 3.3 Term Frequency Distribution within Twitter . 13 3.4 Results for LOF on unemployment rates of Bavaria . 17 3.5 Hype Cycle for Emerging Technologies . 18 3.6 ECG anomaly detection from Mateo et al. [131]. 19 3.7 Aggregation of one dimensional time series . 22 3.8 Twitter usage varies due to day-night cycles of Twitter users . 23 5.1 Visualization of selected word occurrences for the Boston Marathon bombing on news data . 38 6.1 Grid tokens for a location in Washington, DC . 54 6.2 Pigeonhole principle with 2-dimensional grids . 55 6.3 Reverse Geocoding with OpenStreetMap Data . 55 6.4 Tweet tokenization and generation of geo tokens of an example tweet ................................ 56 6.5 Hash table maintenance . 60 8.1 Even Justin Bieber can trend on Twitter – referring to a Bieber look-alike kissing a man. 66 8.2 Google Trends chart for some of the top 10 Events in 2013 . 72 8.3 Scalability of GeoScope and our method (SpotHot). 77 8.4 Performance with varying hash table size `, k = 4 ........ 78 8.5 Recall and precision compared to exact counting. 79 8.6 New Year around the world at σ 3 ............... 82 8.7 Frequency of the term “earthquake”≥ globally vs. locally . 85 8.8 Web app overview of year summary . 91 8.9 Web app detail for the Malaysia airplane crash . 91 9.1 “Pizza” seems to be particularly popular on weekends. 101 9.2 Statistical model used by SigniTrend . 102 11.1 Distribution of trend “MH17” . 110 xii LIST OF FIGURES 11.2 Distribution of trend “PokémonGo!” . 111 11.3 A spatio-temporal trend dissemination example. 112 11.4 A k-d tree based space decomposition . 116 11.5 Trend flow modelling . 117 11.6 Trend Flow Modelling - Tensor Decomposition . ..

Load more