Tracking Online Trend Locations Using a Geo-Aware Topic Model
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 Tracking Online Trend Locations using a Geo-Aware Topic Model JONAH SCHREIBER KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Tracking Online Trend Locations using a Geo-Aware Topic Model JONAH SCHREIBER [email protected] Master’s Thesis at CSC Supervisor: Joel Brynielsson Examiner: Anders Lansner Stockholm, Sweden 2016 Abstract In automatically categorizing massive corpora of text, various topic models have been applied with good success. Much work has been done on applying machine learning and NLP methods on Internet media, such as Twitter, to survey online discussion. However, less focus has been placed on studying how geographical locations discussed in on- line fora evolve over time, and even less on associating such location trends with topics. Can online discussions be geographically tracked over time? This thesis attempts to answer this question by evaluating a geo-aware Streaming Latent Dirichlet Allocation (SLDA) implementa- tion which can recognize location terms in text. We show how the model can predict time-dependent locations of the 2016 American primaries by automatic discovery of election topics in various Twitter corpora, and deduce locations over time. Referat Följning av online-trender och deras lokalisering med en geo-medveten ämnesmodell För att automatiskt kategorisera massiva textkorpusar har olika äm- nesmodeller applicerats framgångsrikt. Mycket arbete har lagts på att använda maskininlärnings- och NLP-metoder på internetmedia, såsom Twitter, för att överblicka diskussion på nätet. Dock har mindre fokus lagts på studier av hur geografiska platser som diskuteras i nätets fora evolverar med tiden, och mindre då på associering av sådana platstren- der med ämnen eller teman. Kan diskussioner på nätet spåras geogra- fiskt över tiden? Denna uppsats försöker svara på denna fråga genom att evaluera en geo-medveten Strömmande Latent Dirichlet Allokering (SLDA)-implementation som kan identifiera platsnamn i text. Vi visar hur modellen kan förutspå tidsberoende valplatser i det amerikanska primärvalet år 2016 genom att automatiskt upptäcka valrelaterade äm- nen i olika Twitterkorpusar, och härleda platser över tiden. Contents 1 Introduction 1 1.1 Problem Statement . 2 2 Background 3 2.1 Thesis Client . 3 2.1.1 EUROSKY . 3 2.1.2 Interest in Social Media . 4 2.2 History of Topic Modeling . 4 2.3 Related Work . 5 3 Theory 8 3.1 Topic Models . 8 3.1.1 Introduction to Latent Dirichlet Allocation (LDA) . 8 3.1.2 LDA’s Generative Process . 10 3.1.3 Inference with LDA . 13 3.1.4 Online LDA . 15 3.2 Geographic Location Extraction . 16 4 Method 17 4.1 Overview . 17 4.2 Tweet Collection . 18 4.3 Preprocessing . 18 4.4 SLDA . 20 4.5 Evaluation . 22 4.5.1 Perplexity . 22 4.5.2 Evaluation Types . 22 5 Results 24 5.1 Illustrations . 24 5.2 Perplexity . 28 5.3 Location Prediction . 29 5.4 Time Prediction . 34 6 Discussion 40 6.1 Qualitative Results . 40 6.2 Quantitative Results . 42 6.3 Implementation Discussion . 43 7 Conclusion 45 7.1 Future Work . 46 Appendices 46 A Corpus Information 47 Bibliography 52 Acknowledgements I would like to sincerely thank Marianela Lozano and Joel Brynielsson at FOI for providing me with invaluable assistance with all the steps in this project, including specification, implementation, documentation and technical details. In addition, thanks to Magnus Rosell for providing very helpful feedback and assistance with LDA. Moreover, I want to express deep gratitude to FOI for hosting and helping with this project. I would further like to express my gratitude to my examiner, Anders Lansner, for his role in this project, and my coworker Mattias Harryson, for his helpful guidance. Finally, I want to thank my family and friends for their support during this project. Chapter 1 Introduction The rapid growth of social media has enabled millions of people to express them- selves and reach wide audiences. As the quantity of openly available text on the Internet is increasing ever faster, so is the interest in categorizing such data and capturing overviews of ongoing discussions. Such summaries of public opinion have broad uses in industry, commerce and security. For example, disaster relief supplies can be better distributed using knowledge compiled by systems that analyze Twit- ter trends for areas of concern. Online ads can be better proportioned in real time by tracking Internet forum discussions. However, such summaries can be infeasible to discover by hand. In view of this accelerating development, many researchers have turned to machine learning methods, which can offer automated processes that discover what texts are about. Topic modeling is an emerging field in machine learning that aims to arrange large corpora of text into representative topics. These topics describe thematic patterns and relations between texts. Most topic models offer unsupervised learning; they discover topical structures by statistical methods and do not require manual per-document annotations of the “correct” classification. They are typically trained on a training corpus before being applied to new and unexplored documents, but the training itself can also reveal interesting morphologies of the corpus. The use of topic models on social media is still ongoing research, and many of their applications are yet unknown. Topic models that can track trends in a real- time, online fashion have recently been proposed and evaluated on popular social media platforms, such as Twitter. The term trend is defined here to mean distinct temporal changes that a topic experiences. It does not necessarily indicate popu- larity, but rather topic evolution. Trend detection and tracking is a very valuable capacity to decision-makers, because it can suggest how to best allocate limited resources. A key feature of online trends is that many have distinctly evolving geographical locations. For example, touring musicians visit different cities over the duration of their tour, which tends to be temporally reflected in relevant online discussions. However, automatic extraction of locations from such discussions is difficult. Some 1 CHAPTER 1. INTRODUCTION media let users report their location, called geo-tagging, but such data is very often unavailable. To solve this, some previous work has focused on making topic models geo-aware, as in Figure 1.1. The term geo-aware is defined here to mean that a model recognizes that text content may include explicit or implicit geographical locations. Previous work explores mostly if the location of a document’s author can be identified. Less work has been done to identify how geographical locations correlate with topics, and even less on how topical locations change over time. 6 US UK 4 Russia 2 Word frequency 0 01-27 02-06 02-16 02-26 03-07 03-17 03-27 04-06 04-16 04-26 Date Figure 1.1. A simple word-frequency visualization of a geographical trend. The proportions of the countries Russia, UK and US are shown over time in a news corpus filtered by the keyword “Syria.” While such visualizations are helpful, they are limited in that they are unaware to the topical context, they are chosen manually, and it is unclear how to classify unseen documents with this prior knowledge. This thesis will address this latter problem. We seek not only a compounded model of topics and locations, but a more integrated solution. It should not only correlate topics with locations, but also recognize that topics and locations might not be independently associable. Some topics, like international relations, are inherently geographical while others, like mathematics, are not. We need a model that is sensitive to such knowledge and incorporates it in a sophisticated way as it detects emerging geographical trends. 1.1 Problem Statement Geographical locations are ambiguous and therefore inherently contextual. Many places share names with common English words and also amongst themselves. It is believed that IBM’s computer Jeopardy! player Watson lost a game at its final challenge, because it confounded the major city Toronto in Canada with a minor city also called Toronto in Illinois. Context is very difficult to model because texts can have complex semantics. The challenge is therefore not only to recognize places, but also disambiguate them whilst in the presence of words related to a complex configuration of topics. 2 Chapter 2 Background This chapter covers the history behind the commission of this thesis, the history of the scientific fields involved as well as their related work. It provides an important context to why this thesis is relevant. 2.1 Thesis Client The Swedish Defence Research Agency (FOI) is this paper’s thesis client. FOI is a leading assignment-based agency that researches defense and security solutions. It was created in 2001 through a merge of the Swedish Defence Research Establishment (FOA) with the Swedish Institute for Aeronautic Research (FFA). FOI provides its clients from the defense sector with advanced research and systems development, which are designed to aid decision-makers, assess various types of threats, analyze and manage crises, and study various other fields. FOI cooperates internationally and many of its assignments are sourced from foreign authorities, particularly within the EU. It is an authority based on the Swedish Ministry of Defense, but most of its activities are commenced on behalf of various clients, particularly the Swedish Armed Forces [2]. 2.1.1 EUROSKY FOI is a major partner in an EU project called EUROSKY, whose objective is to improve air cargo security. The project is part of an international initiative to secure air freight supply chains. EUROSKY is particularly interested in systemic solutions that can assist detection of illegal and dangerous cargo. FOI commissioned this thesis as part of its role in this project, by providing research and development of information systems capable of aiding decision-making in the allocation of security resources.