Analysis and Forecasting of Trending Topics in Online Media © April 2013
Total Page:16
File Type:pdf, Size:1020Kb
ANALYSIS AND FORECASTING OF TRENDINGTOPICSINONLINEMEDIA christopher tim althoff Computer Science Department University of Kaiserslautern Germany April 2013 Christopher Tim Althoff: Analysis and Forecasting of Trending Topics in Online Media © April 2013 Master’s Thesis Department of Computer Science University of Kaiserslautern P.O. Box 3049 67653 Kaiserslautern Germany supervisors: Prof. Dr. Prof. h.c. Andreas Dengel Damian Borth, M.Sc. DECLARATION I declare that this document has been composed by myself, and describes my own work, unless otherwise acknowledged in the text. It has not been accepted in any previous application for a degree. All verbatim extracts have been distin- guished by quotation marks, and all sources of information have been specifically acknowledged. Kaiserslautern, April 29, 2013 Christopher Tim Althoff ABSTRACT Among the vast information available on the web, social media streams capture what people currently pay attention to and how they feel about certain topics. Awareness of such trending topics plays a crucial role in many application domains such as economics, health monitoring, journalism, finance, marketing, and social multimedia systems. However, optimal use of trending topics in these application domains requires a better understanding of their various characteristics in different social media channels. To this end, we present the first comprehensive study across three major online and social media channels, Twitter, Google, and Wikipedia, covering thousands of trending topics over an observation period of an entire year. Our results indicate that depending on one’s requirements one does not necessarily have to turn to Twitter for information about current events and that some media channels strongly emphasize content of specific categories. As our second key contribution we further present a novel approach for the challenging task of forecasting the life cycle of trending topics in the moment they emerge. Our fully automated approach is based on a nearest neighbor forecasting technique exploiting the observation that semantically similar topics exhibit similar behavior. We further demonstrate on a large-scale dataset of Wikipedia page view statis- tics that forecasts by the proposed approach are about 9-48k views closer to the actual viewing statistics compared to baseline methods and achieve a mean average percentage error (MAPE) of 45-19% for time periods of up to 14 days. v ZUSAMMENFASSUNG Soziale Medienkanäle im Web erfassen welchen Themen Menschen derzeit Aufmerksamkeit schenken und spiegeln das momentane Stimmungsbild wider. Das Verständnis des Verhaltens von sogenannten Trending Topics, Themen von besonderer aktueller Relevanz, spielt eine entscheidende Rolle in Anwendungs- domänen wie Wirtschaftswissenschaften, Gesundheitsüberwachung, Journalis- mus, Finanzwesen, Marketing, und sozialen Multimedia-Systemen. Eine optimale Nutzung von Trending Topics in diesen Domänen setzt ein gutes Verständnis von deren Eigenschaften und Verhaltensmerkmalen in ver- schiedenen sozialen Medienkanälen voraus. Zu diesen Zweck präsentieren wir die erste umfassende Studie von tausenden von Trending Topics in drei be- deutenden Online-Medienkanälen, Twitter, Google und Wikipedia, über einen Beobachtungszeitraum von einem Jahr. Unsere Ergebnisse zeigen, dass man je nach Anforderungen nicht notwendigerweise Twitter als Informationsquelle für aktuelle Geschehnisse heranziehen muss und dass manche soziale Medienkanäle bestimmte Inhaltskategorien besonders betonen. Der zweite Hauptbeitrag dieser Arbeit ist ein neuartiges Prognoseverfahren, um den komplexen Lebenszyklus von Trending Topics zum Zeitpunkt deren Entstehung vorherzusagen. Der vollau- tomatische Ansatz basiert auf einem Nearest-Neighbor-Prognoseverfahren und nutzt aus, dass wie in dieser Arbeit beobachtet semantisch verwandte Themen ein ähnliches Verhalten aufweisen. Auf einem umfangreichen Datensatz von Anzeigestatistiken von Wikipedia- Artikeln zeigen wir empirisch, dass der vorgeschlagene Ansatz 9-48k Betrachtun- gen näher an der Wirklichkeit liegt als vergleichbare Baselines und autoregressive Modelle, und dass dieser Ansatz Vorhersagen für bis zu 14 Tage mit einem Mean Average Percentage Error (MAPE) von 45-19% ermöglicht. vi Computers are good at following instructions, but not at reading your mind. — Donald E. Knuth [53] ACKNOWLEDGMENTS This thesis originates from my work as a student research assistant at the German Research Center for Artificial Intelligence (DFKI), where I have been a member of the Multimedia Analysis and Data Mining research group from May 2009 until May 2011 and after my time abroad again since January 2013. I would like to thank Prof. Andreas Dengel for the many opportunities I’ve been given at DFKI and constant support during my studies and academic career. I’m heartily thankful to my supervisor, Damian Borth, Master Mind, who from nightly discussions about trending topics on the streets of Nara, Japan to the last draft of my thesis provided me with opportunities, motivation, instruction and insight. I am grateful for the freedom in researching various aspects of trending topics and forecasting, a great time in Japan, and countless discussions on soccer, hair transplants, and other things. I offer my regards to all of those who supported me in any respect during the completion of the project. Particularly, I feel very grateful to Jörn Hees for many helpful discussions as well as paper writing, reviewing, coding, and microgolf sessions, to Andreas Hoepner for insighful discussions on time series analysis, and to Wolfgang Schlauch and Klaus-Dieter Althoff for proof-reading the manuscript. I am further thankful for the support through the PiCloud Academic Research Program Grant that enabled me to conduct this research on their computing cluster free of charge. Lastly, my deepest gratitude goes to my family, and especially my parents, for all the opportunities they have created for me and for giving me the best possible support, careful correction, and enabling freedom. Christopher Tim Althoff vii CONTENTS 1introduction 1 1.1 Trending Topics . 2 1.2 Application Domains . 5 1.3 Goals and Outline . 10 2 background 13 2.1 Online and Social Media . 13 2.1.1 Reasons for Participating in Social Media . 14 2.1.2 Wikipedia . 17 2.1.3 Twitter . 19 2.1.4 Google . 20 2.2 Time Series Forecasting . 22 2.2.1 Time Series . 22 2.2.2 Forecasting . 23 2.2.3 Basic Forecasting Models . 24 2.2.4 Autoregressive Models . 24 2.2.5 Nearest Neighbor Forecasting Models . 27 2.2.6 Forecasting in the Presence of Structural Breaks . 28 3 related work 31 3.1 Topic and Event Detection . 31 3.2 Multi-Channel Analyses . 32 3.3 Forecasting Behavioral Dynamics . 33 3.4 Social Multimedia Applications . 35 4 multi-channel analysis 37 4.1 Methodology . 37 4.1.1 Raw Trending Topic Sources . 38 4.1.2 Unification and Clustering of Trends . 38 4.1.3 Ranking Trending Topics . 40 4.2 Correspondence to Real-World Events . 40 4.2.1 Expert Descriptions . 41 4.2.2 Automatically Generating Descriptions . 43 ix x contents 4.3 Lifetime Analysis across Channels . 47 4.4 Delay Analysis across Channels . 49 4.5 Topic Category Analysis across Channels . 54 5 forecastingoftrendingtopics 59 5.1 Characteristics of Behavioral Signals . 59 5.2 System Overview . 63 5.3 Discovering Semantically Similar Topics . 65 5.4 Nearest Neighbor Sequence Matching . 66 5.5 Forecasting . 68 5.6 Generalizability of Proposed Approach . 70 6 evaluation 73 6.1 Dataset Description . 73 6.2 Performance Metrics . 74 6.3 Experiments . 75 6.3.1 Discovering Semantically Similar Topics . 76 6.3.2 Nearest Neighbor Sequence Matching . 79 6.3.3 Forecasting . 81 6.3.4 Qualitative Forecasting Results . 84 7 conclusion 87 7.1 Summary . 87 7.2 Future Work . 89 bibliography 91 a appendix 101 LISTOFFIGURES Figure 1 Top 10 trending topics . 2 Figure 2 Word cloud of individual topics of “Olympics 2012”.... 4 Figure 3 The trending topic “Olympics 2012” during summer 2012 . 5 Figure 4 Predicting flu infections . 7 Figure 5 Predicting unemployment claims . 7 Figure 6 Wikipedia page views for Whitney Houston . 10 Figure 7 Maslow’s hierarchy of needs . 15 Figure 8 Example tweet . 19 Figure 9 Google Trends example . 21 Figure 10 Global temperature difference time series . 22 Figure 11 Google stock price time series . 23 Figure 12 Example forecasts of basic and AR models . 25 Figure 13 Structural break in Volkswagen stock price time series . 28 Figure 14 Trending topics generation process . 38 Figure 15 Raw Trending Topics, Clustering, and Ranking . 39 Figure 16 Spikes in attention of trending topic “earthquake” . 42 Figure 17 Spikes in attention of trending topic “Champions League” . 43 Figure 18 Concept overview to automatically generate descriptions . 44 Figure 19 Trending topics sequences example . 48 Figure 20 Lifetime for top 200 trending topics across channels . 50 Figure 21 Definition of delay between two channels . 51 Figure 22 Histogram of delays between different media channels . 53 Figure 23 Topic category distribution of individual channels . 54 Figure 24 Trending topic corresponding to Neil Armstrong’s death . 55 Figure 25 Examples signals in different channels . 60 Figure 26 Classes of behavioral signals . 61 Figure 27 System overview of proposed forecasting approach . 64 xi xii List of Figures Figure 28 Example properties for 2012 Summer Olympics and seman- tically similar topics . 65 Figure 29 Example Wikipedia page view statistics . 74 Figure 30 Qualitative results indicating the necessity of Nearest Neigh-