UC Irvine UC Irvine Electronic Theses and Dissertations

Title Analysis-Aware Approach to Improving Social Data Quality

Permalink https://escholarship.org/uc/item/6k08w8js

Author Sadri, Mehdi

Publication Date 2017

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE

Analysis-Aware Approach to Improving Social Data Quality

DISSERTATION

submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in Computer Science

by

Mehdi Sadri

Dissertation Committee: Professor Sharad Mehrotra, Chair Professor Chen Li Professor Nalini Venkatasubramanian Professor Yaming Yu

2017 c 2017 Mehdi Sadri DEDICATION

To my beloved parents, Monir and Mohammad.

ii TABLE OF CONTENTS

Page

LIST OF FIGURES vi

LIST OF TABLES viii

ACKNOWLEDGMENTS ix

CURRICULUM VITAE x

ABSTRACT OF THE DISSERTATION xi

1 Introduction 1

2 Preliminaries and Related Work 4 2.1 Data Quality ...... 5 2.1.1 Social Data Quality ...... 5 2.2 Data Acquisition ...... 6 2.2.1 Social Data Acquisition ...... 7 2.3 Data Cleaning ...... 9 2.3.1 Social Data Cleaning ...... 10 2.4 Analysis-Aware Approach ...... 11

3 Social Data Acquisition 13 3.1 Introduction ...... 14 3.2 Motivating Example ...... 18 3.3 Notation and Problem Definition ...... 19 3.4 Query Generation ...... 23 3.4.1 Probabilistic Query Coverage ...... 23 3.4.2 Query Generation ...... 25 3.4.3 Statistics Maintenance ...... 26 3.4.4 Combinatorial MAB Framework ...... 30 3.4.5 Greedy Approximation Bound ...... 33 3.4.6 Greedy Algorithm ...... 34 3.5 Relevance Check ...... 35 3.5.1 Phrase Based Relevance (Rt) ...... 36 3.5.2 Clue Relevance (Rc) ...... 37

iii 3.5.3 User History (Ru)...... 38 3.6 Topic Maintenance ...... 38 3.7 Experimental Evaluation ...... 40 3.7.1 Experimental Setup ...... 40 3.7.2 Evaluation Criteria ...... 43 3.7.3 Experimental Results ...... 44 3.8 TAPP ( Follow-up Application) ...... 53 3.8.1 System Overview ...... 54 3.9 Summary ...... 57

4 Social Entity Linking 58 4.1 Introduction ...... 59 4.2 Motivating Example ...... 65 4.3 Preliminaries ...... 67 4.3.1 Window-based Stream ...... 67 4.3.2 Data Cleaning Functionalities ...... 68 4.3.3 Entity Blocks ...... 72 4.3.4 Mention Probabilities ...... 72 4.3.5 Continuous Top-k Query ...... 73 4.4 Deterministic Top-k ...... 75 4.5 Probabilistic Top-k ...... 77 4.5.1 Factor Graph ...... 78 4.5.2 Entity Probabilistic Model ...... 80 4.5.3 Entity Dominance Graph (EDG) ...... 83 4.5.4 Selection Criteria ...... 85 4.5.5 Stopping Criteria ...... 86 4.5.6 Finding Top-K ...... 90 4.5.7 Scalability of EDG ...... 94 4.6 Architecture of TkET ...... 96 4.6.1 Sliding Window Stream Processing ...... 97 4.7 Experimental Evaluation ...... 98 4.7.1 Experimental Setup ...... 99 4.7.2 Knowledge Base ...... 101 4.7.3 Factor Graphs and Dimple ...... 104 4.7.4 Synthetic Dataset Genration ...... 105 4.7.5 Experimental Results ...... 106 4.7.6 Real Tweet Dataset Experimental Results ...... 111 4.7.7 Discussion ...... 112 4.8 Related Work ...... 113 4.8.1 Social Entity Linking ...... 113 4.8.2 Top-k Query Answering ...... 114 4.9 Summary ...... 114

5 SoDAS: Social Data Analytics System 116 5.1 System Overview ...... 117

iv 6 Conclusions and Future Work 120

Bibliography 124

v LIST OF FIGURES

Page

2.1 Common Steps in Data Processing Pipelines ...... 4

3.1 TAS Architecture ...... 14 3.2 Phrase Weight ...... 18 3.3 TAS Iterations ...... 19 3.4 Phrase Maintenance ...... 39 3.5 Approximate Relative Recall ...... 44 3.6 TAS vs. BaseM: Number of Tweets ...... 45 3.7 TAS over Simulation: Number of Tweets ...... 46 3.8 TAS vs. BaseM: Approximate Relative Recall ...... 47 3.9 TAS with Different Phrase Budgets ...... 48 3.10 TAS with Different Inner Iteration Sizes ...... 49 3.11 Topic Maintenance Module On vs. Off: Number of Tweets ...... 49 3.12 TAS: Number of Phrases ...... 50 3.13 TAS vs. ATM: Topic 75 ...... 52 3.14 TAS vs. ATM: Topic 85 ...... 52 3.15 TAAP Application ...... 55

4.1 NER and NEL Black Box Interfaces ...... 68 4.2 Factor Graph for the “Catfish” Entity Block ...... 80 4.3 Entity Dominance Graph Example ...... 84 4.4 In-Degree vs. Out-Degree based Stopping Criteria ...... 87 4.5 Out-Degree based Stopping Criteria Example ...... 88 4.6 In-Degree based Stopping Criteria Example ...... 89 4.7 EDG 1-2 steps of TkET top-2 algorithm on Motivating Example ...... 91 4.8 EDG 3-5 steps of TkET top-2 algorithm on Motivating Example ...... 92 4.9 EDG 6-7 steps of TkET top-2 algorithm on Motivating Example ...... 93 4.10 Transitivity of Pairwise Dominance ...... 95 4.11 Overview of TkET ...... 95 4.12 Architecture of TkET ...... 96 4.13 Sliding Window, Stream Processing ...... 98 4.14 Motivating Example’s Identified Entity Blocks ...... 102 4.15 Selected Synthetic Datasets Block Size Distribution ...... 107 4.16 SDS4: Latency vs. Parameters(k, th) ...... 109 4.17 SDS4: Accuracy vs. Parameters(k, th) ...... 110

vi 5.1 SoDAS General Architecture ...... 118

vii LIST OF TABLES

Page

3.1 Example Phrases of an Interest ...... 19 3.2 Fixed Corpus Topics of Interest ...... 45 3.3 ARR Calculation, Sample Sizes ...... 47 3.4 Streaming Topics of Interest ...... 51 3.5 Streaming Experiment Summary ...... 53

4.1 Example Raw Tweets ...... 65 4.2 Selected Dataset Parameters ...... 108 4.3 Efficiency for Out-Degree based Stopping Criteria ...... 108 4.4 Efficiency for In-Degree based Stopping Criteria ...... 110 4.5 Efficiency over the Real Tweet Dataset ...... 112

viii ACKNOWLEDGMENTS

I would like first to express my deepest sincere gratitude to my advisor Prof. Sharad Mehro- tra for his unwavering guidance, support, and encouragement. Prof. Sharad has patiently taught me how to identify new important research problems, solve problems in principle, and how to write research papers. I am glad to have had the opportunity to work with him and for that I am very grateful. An additional special gratitude is to due to Prof. Yaming Yu and Prof. Charless Fowlkes for their insightful support and suggestions throughout this research, especially on the third and fourth chapters of this thesis. The time and effort they spent with me were instrumental in my progress. I would also like to extend my appreciation to the members of my doctoral committee; Prof. Chen Li, Prof. Nalini Venkatasubramanian, Prof. Yaming Yu, for their useful feedback and for finding the time to serve on my committee.

I would like to thank everyone in the ISG group, especially my colleagues in the Data Quality and Privacy Group at UCI, Yasser Altowim, Hotham Altwaijry, Stylianos Doudalis, Kerim Oktay, Jie Xu, Liyan Zhang, Abdulrahman Alsaudi, and Jamshid Esmaelnezhad. The work reported in this thesis was also supported in part by NSF grants CNS-1527536, CNS-1545071, CNS-1450768, CNS-1450768, CNS-1059436, CNS-1118114.

Foremost, I would like to thank my parents and my brother who always had faith in me, and their supports helped me focus on my studies. I would also like to thank my friends, Moham- mad Shokoohi-Yekta, Hamid Mirebrahim, Mahdi Abbaspour-Tehrani, Sky Faber, Gabriel Faber, Mohammad Asghari, Sara Nasseri, Maral Amir, Laleh Aghababaie, Forough Arab- shahi, Jonathan Shoemaker, Kristin Roher, Kevin Bache, Kyle Benson, Patricia Sponaugle and Ahmad Malekitabar who have supported and encouraged me to move forward.

ix CURRICULUM VITAE

Mehdi Sadri

EDUCATION Doctor of Philosophy in Computer Science 2017 University of California, Irvine Irvine, CA Master of Science in Computer Science 2013 University of California, Irvine Irvine, CA Master of Science in Computer Engineering 2011 Sharif University of Technology Tehran, Iran Bachelor of Science in Computer Engineering 2009 Shahid Beheshti University Tehran, Iran

Publications TkET: Efficient Top-k Entities Query Processing over 2017 Stream of Tweets Under Submission TAS: Online Adaptive Topic Focused Tweet Acquisition 2017 System TKDE Journal, Under Review Adaptive Topic Follow-up on Twitter 2017 ICDE Online Adaptive Topic Focused Tweet Acquisition 2016 CIKM

Selected Honors and Awards Full Scholarship for Three Years of Support 2011-2014 University of California, Irvine Master Thesis Grant 2010 ERICT (Education and Research Institute for ICT), Tehran, Iran Ranked 8th 2009 14th Iran Student Olympiads of Computer Engineering

x ABSTRACT OF THE DISSERTATION

Analysis-Aware Approach to Improving Social Data Quality

By

Mehdi Sadri

Doctor of Philosophy in Computer Science

University of California, Irvine, 2017

Professor Sharad Mehrotra, Chair

In the era of real time stream processing, the past decade has witnessed emer- gence of wide variety of applications to tap into live social data feed to gain awareness about events, opinions and sentiments of the community. Social data has the potential to bring new functionalities and improvements in a wide variety of domains from Emergency Response to Political Analysis. Thus, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dynamic data sources by modern applications. Such new re- quirements have created new challenges for traditional data processing techniques. In this thesis, we respond to some of these challenges.

First, we explore the problem of online adaptive topic focused tweet acquisition. Specifically, we propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track ac- cording to a reinforcement learning algorithm. The selection follows an explore-exploit policy to approximate the effectiveness of different phrases in retrieving relevant tweets based on Bayesian inferences. We also develop a tweet relevance model, which enables checking the relevance of collected tweets to the topic of interest based on multiple criteria. The objec- tive of TAS is to improve the recall of collected relevant tweets. Our experimental studies show significant improvements over the state-of-the-art, furthermore the performance gap increases when the topics are more specific.

xi Subsequently, efficient processing of top-k mentioned entities query posed on a stream of tweets has become a key part of a broad class of real-time applications, ranging from con- tent search to marketing. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. Furthermore, the continuous and fast gen- eration of tweets makes it crucial for such applications to process those queries at an equally fast pace. In order to address these requirements, we propose TkET (pronounced ticket) as an analysis-aware entity linking framework for efficiently answering top-k entities query over Twitter stream in an sliding window fashion. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.

xii Chapter 1

Introduction

Data-driven technologies such as decision support, analysis, and scientific discovery tools have become a critical component of many organizations and businesses. The effectiveness of such technologies, however, is closely tied to the quality of data on which they are ap- plied. It is well-recognized that the outcome of the analysis is only as good as the quality of the data on which the analysis is performed. That is why today’s organizations spend a substantial percentage of their budgets on cleaning tasks such as filtering irrelevant data, removing duplicates and various types of data enrichment such as named entity identification and linking to improve data quality prior to pushing data through the analysis pipeline.

Recent years have seen a transformation in the type of data available on the web and many internet-based applications generate big data streams, which are known as Big Data Streams. Such applications comprise IoT-based monitoring systems, global flight monitoring systems, social media, etc. Processing such fast big data streams in an online adaptive manner re- quires a paradigm shift in how we approach the data acquisition and cleaning problem. Particularly, Social media (e.g., “micro-”, such as Twitter) has the potential to bring new functionalities and improvements in a wide variety of domains from Emergency Re-

1 sponse, Business Intelligence and Political Analysis to Social Sciences. For instance, in [14] authors propose an approach to predict the stock market based on the Twitter mood (Also, as a real use case of Twitter data, in 2015, Twitter announced an update to its partnership with Bloomberg to bring more information to Bloomberg’s Financial Terminal product).

As in any other data analysis process, data acquisition and data cleaning are vital steps of social data analysis pipeline. Due to the enormous amount of data produced on social media, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dynamic and/or heterogeneous data sources by modern applications. Such new requirements as well as the unstructured nature of social data have created challenging new problems for traditional data preparation (acquisition/cleaning) tasks to improve the quality of data. In online adaptive social data processing, an analysis-aware approach is motivated by several key perspectives.

Our approach considers the semantics of the given analysis task and thus can process certain types of analysis much more efficiently. Analysis-aware approach is also useful when a small organization is in possession of a very large dataset (e.g., have access to Twitter firehose), but typically needs to analyze only small portions of it to answer some analytical queries quickly. In this case, it would be counterproductive for this organization to spend their limited computational resources on cleaning all the data, especially given that most of it is going to be unnecessary.

In this thesis, we address key challenges in building an analysis-aware approach to improv- ing social data quality by exploring two various aspects on the social data analysis process. In particular, we first explore the problem of adaptive topic focused data acquisition in the context of Twitter. Specifically, we propose a Tweet Acquisition System (TAS), which iteratively selects phrases to track according to a reinforcement learning algorithm. The

2 selection algorithm follows an explore-exploit policy to approximate the effectiveness of dif- ferent phrases in retrieving relevant tweets based on Bayesian inference. We also develop a tweet relevance model, which enables checking the relevance of collected tweets to the topic of interest based on multiple criteria analysis in near real time. The objective of TAS is to improve the recall of collected relevant tweets. Our experimental studies show significant improvements over the state-of-the-art, furthermore the performance gap increases when the topics are more specific.

Also, efficient processing of top-k mentioned entities queries posed on a stream of tweets has become a key part of a broad class of real-time applications, ranging from content search to marketing. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. Furthermore, the continuous and fast generation of tweets makes it crucial for such applications to process those queries at an equally fast pace. To address these requirements, in this thesis, we propose TkET (pronounced ticket) as an analysis-aware entity linking framework for efficiently answering top-k entities queries over Twitter stream. The comprehensive empirical evaluation of the proposed solution demon- strates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings on real and synthetic datasets.

The rest of this thesis is organized as follows. Chapter 2 covers the preliminaries and related work. Our Tweet Acquisition System (TAS) is described in Chapter 3. Subsequently, Chap- ter 4 presents our TkET (pronounced ticket) framework. Chapter 5 presents our SoDAS as an end-to-end social data analytics system. Finally, we conclude this thesis and discuss future extensions in Chapter 6.

3 Chapter 2

Preliminaries and Related Work

In this chapter, after defining data quality, first we explain the standard definition of some common steps in data processing pipelines and special characteristics of social data and what makes social data processing different from other domains. Figure 2.1 shows the common steps in data processing pipelines. Data Acquisition and Data Cleaning (specifically Named Entity Identification and Linking) are the two important very common steps in any data processing pipeline including social data processing. Data acquisition and cleaning are two common steps in every data processing pipeline and were addressed in many different do- mains. The concepts we study in this thesis are solutions for Data Acquisition (Chapter 3)

Data Acquisition

Data Integration

Data Cleaning

Data Analysis

Visualizing Results

Figure 2.1: Common Steps in Data Processing Pipelines

4 and Data Cleaning (Chapter 4) steps in a real-time social data processing pipeline where we are facing a big fast stream of noisy social data. Then, we define the problem of improving social data quality both in case of data acquisition and the top-k entities query processing on top of a given stream of tweets. Finally, we cover the basic related work in improving social data quality. (We cover more related work in Chapters 3 and 4.)

2.1 Data Quality

Data quality refers to the condition of a set of values of qualitative or quantitative variables. There are many definitions of data quality but data is generally considered high quality if it is “fit for [its] intended uses in operations, decision making and planning” [74]. Alternatively, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency (e.g., semantics or relevance to the topic) within data becomes signifi- cant, regardless of fitness for use for any particular external purpose. People’s views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose.

2.1.1 Social Data Quality

In this thesis, we focus on social media and particularly on Twitter as the main source of data. We refer to data quality as the effect of the cleanness of the data to be used, on the quality of the final analysis result. We observed in Twitter based topic centered applications that collecting as many relevant tweets to the topic of interest of the client have significant effect on improving the quality and consistency of the final analysis result. Also, having the high speed of the social media data streams, dealing with less irrelevant data helps the analysis

5 process to better catch up with the speed of the underlying data stream. Furthermore, data enrichments including named entity identification and linking often help in targeting the right set of tweets for the purpose of the final analysis. One common analysis task in recent Twitter based applications is of answering top-k entity queries on top of the live stream of tweets. Hence, starting with high quality identified entity mentions in tweets result in higher quality of the top-k results. We approach the concept of data quality for the purpose of answering top-k entity queries by introducing a novel probabilistic approach to improve the underlying data quality as it makes the final top-k result more accurate.

2.2 Data Acquisition

Problem

Every data centered application, needs to acquire data to work on. Hence, every data processing pipeline usually starts with data acquisition. Data acquisition rages from simple reading records from a relational database all the way to pinging many multiple APIs and data sources in order to collect the required data for further analysis. In many current data centered applications, the final analysis that is needed to be done on the data is known to the system a-priori. Having that, data acquisition can be done in a much smarter way to not acquire unnecessary data. That is very beneficial because it removes the need for further filtering of the acquired data. Also, since resources are limited, both the data provider may limit the data access and the user might not have enough resources (such as storage or processing units) to push as much data as he wishes to the processing pipeline.

6 Related Work

Data acquisition has been extensively studied in the Sensor Networks domain. Early ap- proaches use in-network aggregation to reduce the transmitted data, with later approaches addressing missing values, outliers, and intermittent connections [92, 38, 30]. Information Driven Sensor Querying from Chu et al. [20] uses probabilistic models for estimation of target position in a tracking application. Model-driven data acquisition has also been ex- tensively studied. In [32] authors propose Model-driven data acquisition allows applications to reach the desired data by calculating models instead of receiving sensed values in wireless sensor network based application deployments.

2.2.1 Social Data Acquisition

Problem

In the social data context, the main type of accessible data are text and images alongside with a lot of useful metadata. We believe in order to be able to analyze social media data with the speed of the stream, one needs to filter the social data stream based on his/her spe- cific topical needs. For example, if the client is interested in acquiring data for the purpose of analyzing the public opinion about a brand, there is no need to acquire/store and later analyze the social data about personal events or social content about other topics.

Social data has a lot of specific characteristics that makes it difficult for a proper topi- cal data acquisition. First, the fast speed of social data (e.g., Twitter has more than 500 Million tweets per day [44]) needs very fast filtering algorithms in order to filter the stream based on the given topic on the fly. Also, people on social media don’t follow traditional lan- guage concepts and they use unusual/fuzzy ways to refer to well known concepts of entities.

7 For example, in Twitter users refer to the President of United States of America, as “PO- TUS” but the client who is interested in tweets related to the United States, might not be able to come up with that as a relevant pattern. The social data acquisition system should be adaptive and be able to learn the new concepts and ways people are referring to relevant concepts to the interest of the client. Also, social media APIs that we can use to acquire data as well as the infrastructure the client might use to collect/store/analyze such data is limited [70].

Related Work

Social media monitoring has been an important task since the emergence of social media data based applications. While most of the Twitter monitoring applications [78, 14, 88] acquire tweets based on a set of static manually selected keywords, some adaptive approaches have been proposed. Boanjak et al. [13] propose a distributed focused crawler for Twitter, which heuristically selects Twitter users who have the most strong connection to known existing users and crawls topic related tweets from them. Chang et al. [57] propose an automatic topic focused monitor for Twitter stream, they assume a classifier for the topic of interest is given to the system.

Research on extracting high-quality information from social media [86, 58] and on sum- marizing or otherwise presenting Twitter event content [33, 63, 80] has gathered recent attention. Agichtein et al. [86] examine properties of text and authors to find quality con- tent in Yahoo! Answers, a related effort to ours but over fundamentally different data. In event content presentation, Diakopoulos et al. [33] and Shamma et al. [80] analyzed Twitter messages corresponding to largescale media events to improve event reasoning, visualization, and analytics. Becker et al. [10] present centrality-based approaches to extract high-quality,

8 relevant, and useful Twitter messages from a given set of messages related to an event.

Our Approach

We address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time in Chapter 3 of this thesis. We propose a Tweet Acquisition System (TAS) [77], that iteratively selects phrases to track based on an explore-exploit strategy and dynamically adapts the representation of topic of interest. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance further improves when the topics are more specific.

2.3 Data Cleaning

Problem

Data cleaning is a vital part of any data analysis pipeline. Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data [68]. Data cleaning is an integral part of data warehousing and master data management technologies that has been commoditized in a variety of products such as IBM InfoSphere, SAS DataFlux, and SAP Hana [73] to name a few. It is often estimated that cleaning and preparation of data accounts for up to 80% of the time enterprises spend on data analytics with the remaining period spent on interactive analysis and mining of the warehouse.

9 Related Work

There is a large body of research on the subject of data cleaning in general and in particular different kinds of data cleaning and the meaning of data cleaning in different domains. In [89] authors present an overview of the history of data cleaning and list related challenges and different definitions. Also, [67] gives a holistic view of the defnition and the reasons behind the need for data cleaning. Often, one of the important tasks in the data cleaning process is of Named Entity Identification and Linking [61]. For example, given the sentence “Paris is the capital of France”, the idea is to determine that “Paris” refers to the city of Paris and not to Paris Hilton or any other entity that could be referred as “Paris”. Named Entity Linking is different from Named Entity Identification in that Named Entity Identification identifies the occurrence or mention of a named entity in text but it does not identify which specific entity it is. By comparison, entity linking is a recent task. The earliest work on the task by Bunescu and Pasca [17], Silviu [29] and Cucerzan [29] aims to link entity mentions to their corresponding topic pages in

2.3.1 Social Data Cleaning

Problem

Data quality is an important challenge in working with the social data due to the fast stream- ing nature of the data as well as the fact that most user provided data is not necessarily in a standard format to be used by higher level data analyzers. Improving tweets quality, usu- ally starts with collecting the relevant tweets or filtering the twitter stream. One important aspect if social data quality is of identification and linking of mentions to real life entities in tweet texts. With respect to linking Tweets to the concepts they refer to, an important challenge is the low quality in the ambiguous text of Twitter posts that strongly affects the

10 linking performance. Named entity recognition methods typically have 85-90% accuracy on longer document texts, but 30-50% on tweets [59].

Related Work

There is a large body of research in social data enrichment and tweet annotation. In [76] authors present a experimental survey of data enrichment techniques for tweets. Informa- tion extraction over microblogs has only recently become an active research topic [18, 36, 9], following early experiments in [15]. Knowledge base based techniques are the main ap- proaches to candidate entity generation and are leveraged by many entity linking systems [43, 72, 37, 16, 51, 39, 81, 82]. Then, the research has focused on microblog-specific infor- mation extraction algorithms (e.g. named entity recognition for Twitter using CRFs [75] or hybrid methods [90]).

2.4 Analysis-Aware Approach

In the era of online big data streams, in addition to large local repositories and data ware- houses, today’s enterprises have access to a very large amount of diverse data sources, including web data repositories, continuously generated sensory data, social media posts, clickstream data from web portals, audio/video data capture, and so on. As a result, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dy- namic and/or heterogeneous data sources by modern applications. Such new requirements have created challenging new problems for traditional entity resolution, and data cleaning in general, techniques. With the increasing importance of interactive (and near real-time) analytics, recent research has begun to consider how analysis-aware approaches can be incor- porated in data cleaning, and more specifically in the context of entity resolution [4, 6, 7, 5].

11 The information about the final analysis can be helpful in reducing the work in almost all the usual steps of the data processing pipeline. In this thesis we focus on the Data Acquisition and Cleaning. In case of data acquisition, information about the final analysis results in a more focused data acquisition in order to fulfill the requirement of the client. That will potentially result in less number of irrelevant unnecessary data to be pushed to the analysis pipeline. Reduced amount of data to be processed is going to make the whole data processing pipeline more efficient. Knowledge of the final analysis in case of data cleaning results in more focused cleaning on the portions of data that are going to be used by the final analysis.

In case of social data cleaning, as an common analysis task, efficient processing of top-k mentioned entities queries posed on a stream of tweets has become a key part of a broad class of real-time applications, ranging from content search to marketing. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. The continuous and fast generation of tweets makes it crucial for such applications to process those queries at an equally fast pace. In order to address these requirements, we propose TkET (pronounced ticket) in Chapter 4 as an analysis-aware entity linking frame- work for efficiently answering top-k entities queries over Twitter stream. TkET, follows a probabilistic approach to focus the disambiguation process on potential entity mentions that will change the result of the target top-k entities query.

12 Chapter 3

Social Data Acquisition

The past decade has witnessed emergence of wide variety of technologies/applications to tap into live Twitter feed to gain awareness about events, opinions and sentiments of the com- munity. Such innovations have the potential to bring new functionalities and improvements in a wide variety of domains from Emergency Response, Business Intelligence and Political Analysis to Social Sciences. For instance, Sakaki et al. [78] constructed an earthquake re- porting system to track earthquake instantly through Twitter as an emergency management application. Other examples include predicting the stock market based on the Twitter mood [14] and predicting elections [88].

A vital task, common across social media applications, is that of acquiring social media content relevant to the topic of interest to the application. Such data may then be subjected to diverse types of application-specific analysis. Typically, applications require acquired data to be both relevant, as well as, to provide good coverage of the topic of interest on the social media (i.e., support high recall). Such a data acquisition task over social media has proved to be difficult.

13 Figure 3.1: TAS Architecture

3.1 Introduction

Twitter, in addition to supporting access to sample data, supports interfaces to query the stream of tweets using a limited number of textual patterns. Current systems [78, 91] use a fixed manual strategy for data acquisition in which the client/analyst selects a set of phrases to query Twitter that they believe would retrieve tweets relevant to the topic of interest. For instance, a user may choose a pattern “Democrat Nominee” to retrieve tweets relevant to “US Presidential Debate”. Further, the set of textual patterns can be generated using external knowledge bases and/or based on other topical content (e.g., news article).

A fixed strategy has several drawbacks. First, it is difficult to generate and choose all the relevant phrases that when used as a query would provide simultaneously good coverage, as well as, maintain high relevance amongst the tweets retrieved. While a set of potential phrases could be generated using ontologies and/or knowledge bases, one further needs a mechanism to select a subset of them to query Twitter.

For instance, given “US Presidential Debate” as the topic of interest, phrases such as “GOP

14 Nominee” and “Republican Nominee” could be generated as potentially important phrases. However, it is not obvious which of the two would generate better coverage if one needs to choose one based on the limits imposed by the Twitter interface. Another issue complicating the acquisition challenge is that social media such as Twitter are dynamic – hence different phrases may be more beneficial at different points in time and as a result any phrase selection must be adaptive.

Furthermore, given that tweets are contributed by the public with no limits to vocabu- lary and ways of expressing concepts, new phrases (that may provide very good coverage and relevance) may dynamically emerge that no knowledge base could realistically generate a priori. For instance, a phrase “Cruz oops” 1 might emerge as a trending phrase which, if used as a query, may retrieve many tweets relevant to “US Presidential Debate” as the topic of interest. Efficacy of acquisition requires such emergent phrases be added dynamically (and likewise phased out when no longer beneficial).

In order to overcome disadvantages of the fixed approach, prior research has explored adap- tive strategies for Twitter data acquisition [77, 13, 57, 46] . Boanjak [13] proposes a focused crawler, which crawls topic related tweets from heuristically selected users (i.e., the user who has the most connections to the existing ones). Joseph et al. [46] propose an approach based on natural language aspects (e.g., alternative spellings) to uncover relevant keywords in tracking tweets contributing to situational awareness. In [71, 60, 69, 87] authors propose approaches to detect events in order to collect more tweets. Chang et al. [57] propose an iterative process to solve the acquisition problem. In each iteration, the approach first takes a sample of the Twitter stream to extract patterns relevant to the topic of interest. It then chooses from relevant patterns to probe Twitter to collect more relevant tweets.

1Ted Cruz, a GOP 2016 presidential nominee had an oops moment during a GOP presidential debate.

15 While the experiments in [57] show that the adaptive approach performs better compared to the fixed approaches, it still has several disadvantages. First, as we illustrate in our ex- periments in chapter 3.7, while the approach of [57] works well for popular topics, if the topic of interest is not of broad interest (e.g., a news article in a local newspaper or an event in a small town), the initial random sample may not have enough relevant tweets for the approach to learn the set of phrases important to retrieve the relevant tweets resulting in poor performance.

Second, it is limited to the situation where the acquisition system can be provided with a classifier that, given a tweet, could determine whether the tweet is relevant to the topic or not. While such classifiers can be built for certain topics and events (e.g., classifier for traffic situations, or emergencies), it is not obvious how one can build and provide such a classifier in an online setting where the user may dynamically specify his/her information need.

For instance, a user may wish to monitor tweets relevant to a news story by pointing to the HTML page that lists out the story. It seems infeasible to train a sophisticated classifier that can accurately classify tweets as being relevant or irrelevant in such a dynamic setting. Third, since the approach of [57] does not support a feedback procedure, there is no way to detect if a phrase does not result in capturing relevant tweets and avoid it in consecutive iterations. Finally, since the approach processes a large number of sample tweets, its com- putational overhead is high.

In this chapter, we design an adaptive Tweet Acquisition System (TAS), where the system starts with an initial set of phrases/terms of relevance to the topic of interest. Such phrases could either be specified by the user and/or learnt through available ontologies, knowledge bases or topic focused content from other sources (e.g., news stories). At any instance, TAS maintains an estimate of the relative degree of effectiveness in retrieving relevant tweets to

16 the topic of interest for each phrase. TAS uses such a model to dynamically ascertain the relevance of the incoming tweets collected to the topic of interest. The relevance measure incorporates also features about the tweeter (i.e., the person who posted the tweet), as well as, the statistics of pattern co-occurrences.

TAS follows an iterative process with the objective of maximizing the recall by automati- cally changing the query and updating the set of phrases representing the topic of interest. The basic intuition is that the “current” patterns trending in relevant tweets can help re- trieve more relevant tweets in the “near” future and furthermore “repetitive co-occurrence” of terms can help identify additional patterns that may empower the acquisition system. TAS uses a reinforcement learning algorithm to adaptively select a set of phrases that strike a balance between exploring the importance of different phrases and that of exploiting the phrases it has determined to be of high retrieval quality.

In contrast to Chang et al. [57], TAS does not depend upon the initial random sample to learn the representation of topic of interest. It provides good performance (as we will show experimentally in a later Chapter) even when topics of interest do not have a broad coverage. Also, even though TAS relies on the initial representation of the topic of interest provided by the client/application, it dynamically updates the representation by leveraging the relevant patterns in the previously acquired tweets making it robust to incomplete and/or suboptimal specification of the initial query.

This chapter is based on our preliminary work in [77], where we developed a reinforcement learning based technique to select a subset of phrases for a query from a set of potential phrases relevant to the topic of interest. The technique balances the proportion of selected phrases for the purpose of updating their degree of effectiveness, and the selected known phrases to be effective if maximizing the recall. This helps in preventing the system from

17 Figure 3.2: Phrase Weight falling into local traps. Furthermore, in this chapter we build [77] to develop an end to end Tweet Acquisition System. In particular, TAS makes the following additional contributions:

3.2 Motivating Example

A social media analyst interested in the social media content relevant to data science and data management in general would like to collect tweets in the real time from the stream of tweets provided by twitter related to his topic of interest, “Data Science”. In order to use TAS, the analyst comes up with a set of phrases defining his topic of interest. For example, such phrases contain “Data Science”, “Databases”, “Data Management”. It seems really hard for the analyst even if he is an expert to come up with all the possible phrases defining his interest. Also, phrases such as “VLDB” or “deep learning” suddenly happen to become very relevant and the effectiveness of different phrases in collecting relevant tweets change over time.

18 Table 3.1: Example Phrases of an Interest

Terms(k) Weight(w) Location(l) “presidential debate” 0.8 “republicans debate” 0.7 “democrats debate” 0.6 “clinton drought” 0.9 33.71,-117.89,33.57,-117.78

Figure 3.3: TAS Iterations

3.3 Notation and Problem Definition

The overall architecture of TAS is shown in Figure 3.1. TAS is designed as a multi-client system where a set of clients may concurrently specify their topics of interest. While con- currency adds another layer of complexity, we focus our discussions to the design of TAS components and related techniques in the context of a single client with a single topic of interest. Since in the multi-client/multi-interest setting the Twitter imposed phrase budget splits between all clients and their interests, we limit the phrase budget in our experiments to a portion of the imposed limit by Twitter.

In TAS, the client defines the topic of interest I through the Interest Manager module

by providing a cohesive set of phrases I.P = {p1, p2, ..., p|I.P |} with a subtext of the topic of interest. The topic of interest may also be extracted from a knowledge base and/or external data sources (e.g., Wikipedia, News Articles). A phrase pi in TAS is a triple (k, w, l), where pi.k is a set of terms, pi.l is a nullable location bounding box and pi.w is the weight that

19 gives a comparative measure of how distinctive pi is for the corresponding topic of interest. Figure 3.2 illustrates the meaning of the phrase weight on TAS, the phrase weight is an approximation of the ratio of tweets matching the phrase and being relevant to the topic of interest of the client, to the total number of tweets matching the phrase.

Table 4.1 illustrates an example of phrases a client with an interest in the “US Presiden- tial Debate” might specify. In this example, the client defined his interest using four initial phrases. The last phrase for instance, states that a tweet posted from California having both keywords “clinton” and “drought” in its text has a 90% chance of being relevant to the topic of interest of the client. Note these initial estimations of weights need not to be accurate and are used to seed the system which will learn these statistics adaptively based on collected tweets.

The system starts collecting tweets iteratively and updates its statistics based on collected tweets and their relevances to the topic of interest. Each collected tweet includes some metadata about the tweet itself and the tweeter who posted it. Based on the output of the Twitter API, we define a tweet t as a tuple: (k, u, τ, l), where t.k is the textual vector of t, t.u is the embedded information of the Twitter user who posted the tweet, t.τ is the timestamp of t and t.l is the nullable location information where the tweet has been posted from.

Match: A tweet t matches a phrase pi if all the terms in pi.k are present in the tweet’s text t.k, regardless of order and ignoring the case. For a tweet to be matched with a phrase containing a location bounding box pi.l, the tweet must have a geo-tag (t.l) located in the bounding box and cover all the phrase terms. A tweet t matches a set of phrases {p1, ...pj} if t matches ALL the phrases {p1, ...pj}, also it matches the interest of client I if there is At

Least One phrase pi ∈ I.P that t matches pi. Note that the concept of a tweet matching an interest as used above is different from a tweet being relevant to an interest, discussed next.

20 Relevance: Not every tweet matching the topic of interest is relevant to it. TAS uses a dynamic relevance model based on the representation of the topic of interest, phrase level correlated patterns and the history of the tweeter in posting relevant tweets to determine relevance of a tweet to user interest. Relevance checking (line 9) is critical to determine statistics that drive the phrase selection to query Twitter, and also in filtering tweets to present to the client. We will discuss the relevance model in Chapter 3.5 in details.

TAS Iterations: TAS follows an iterative process with two iterations nested inside each

other. Each inner iteration δi lasts as long as TAS captures |δ| number of relevant tweets. TAS selects a set of phrases (from the set of possible phrases associated with the topic of

interest) to query using the twitter API at the beginning of each δi. An outer iteration ∆j, consists of |∆| number of internal iterations. The set of tweets captured in ∆j is used by

TAS to dynamically adapt the representation of the topic of interest before starting ∆j+1. Due to Twitter rate limitations, there is a minimum duration of an inner iteration given to the system. Figure 3.3 illustrates the iterative process of TAS in an example where |∆| = 3.

ALGORITHM 1: High-Level Flow of TAS input : an interest I, the budget Bp, output: a set of collected tweets 1 if |I.P | ≤ Bp then 2 return;

3 j = 0 //index of the outer iteration 4 while Acquisition.IsRunning do 5 for i = 1 to |∆| do 6 δi ← Acquisition.StartListening(q); 7 for t ∈ δi do

8 Store(δi); 9 ∆j ← ∆j + δi; 10 j = j + 1

The overall flow of TAS is shown in Algorithm 1. The system initiates by generating a query

21 (line 6) and collecting tweets based on the filter endpoint of the Twitter’s streaming API. Then TAS receives tweets matching the query (line 7). The filter endpoint allows a limited number of track phrases and location bounding boxes to monitor the live stream of tweets [47].

Query Generation: At the beginning of each inner iteration δi, the Query Generator module generates a query qj by selecting a set of phrases qj.P ⊆ I.P such that |qj.P | ≤ Bp where Bp is the phrase budget. Intuitively, the goal of query-generator is to choose the set of phrases from I.P that maximize the number of collected tweets relevant to the topic of interest. We will formally state the query-generation problem and our approach to solving it in Chapter 3.4.

Topic Maintenance: By the end of each outer iterelation, TAS dynamically modifies the representation of the topic of interest to adapt the representation based on the tweets cap- tured during the previous iterations. Topic maintenance may dynamically add new phrases with new patterns after logging repetitive concurrences of patterns in relevant tweets in the coarser iteration of the system and/or drop/modify existing phrases in I.P (line 13). Such changes will then impact the tweets captured during the subsequent inner iterations. As an example if the client is interested in the “US Presidential Debate” as a topic, repetitive emergence of “Cruz oops” can be a sign that it is a new pattern that covers the topic of interest of the client. We will discuss Topic Maintenance in further detail in Section 3.6.

Updating Statistics In order to choose the set of phrases during query generation and/or phrases during topic maintenance, TAS maintains statistics about the tweets it captures during its iterations. In particular, for each phrase pi ∈ I.P , two pieces of information are maintained:

22 • Tweet Rate: f(pi) represents the rate of tweets matching pi in tweets/sec.

• Probability of Relevance: p(R|pi) represents the probability that a tweet matching

pi being relevant to the topic of interest.

In addition to the above, TAS maintains similar statistics for a set of phrases {p1, ..., pj}.

That is, f(p1, ..., pj) represents the rate of tweets matching the set of phrases {p1, ..., pj} in tweets/sec. Likewise p(R|p1, ..., pj) represents the probability that a tweet matching

{p1, ..., pj} to be relevant to the topic of interest. TAS statistics are updated after every iteration (line 11). We will discuss how such statistics are determined, both initially and then subsequently updated during each iteration in Chapter 3.4.3.

3.4 Query Generation

At the beginning of each inner iteration, the representation of topic of interest I contains a set of phrases I.P . There may be statistics about past tweets of some tweeters available in the system. Since Twitter allows a very limited number of phrases to query using Twitter API, TAS selects a subset from the set of potential phrases. We refer to this selection process as Query Generation.

3.4.1 Probabilistic Query Coverage

The goal of the query generation is to select phrases to maximize the number of relevant tweets collected. We formalize the goal by defining the concept of query coverage.

Query Coverage: Assume that we are given a phrase set I.P = {p1, p2, ..., p|I.P |} and let

23 qi be a query such that qi.P ⊆ I.P . The coverage of a query qi, denoted as Cov(qi.P, I.P ), and defined in Eq. 3.1, reflects the proportion of the target tweets covered by query qi.

S | R(T (pj),I)| Cov(q .P, I.P ) = pj ∈qi.P (3.1) i | S R(T (p ),I)| pk∈I.P k

In Eq. 3.1, T (pj) represents the set of tweets matching phrase pj over the next iteration and

R(T (pj),I) represents the subset of relevant tweets to the topic of interest I from T (pj).

Intuitively, the numerator of Eq. 3.1 reflects the number of relevant tweets covered by qi, while the denominator reflects the total number of relevant tweets covered by I.P .

Since we do not know ∀pj ∈ I.P, R(T (pj),I), apriori, we cannot directly compute the cover- age of all possible query selections and choose the one that maximizes coverage. We, thus, define a probabilistic version of the query coverage which we can estimate using phrase level statistics gathered by the system so far.

Probabilistic Query Coverage: We estimate the coverage of a query using the Inclusion- Exclusion principle having estimates for effectiveness of phrase in I.P as:

|Q| X K−1 X (−1) f(pi1 , ..., pik )p(R|pi1 , ..., pik ) k=1 ∀S⊆Q |S|=k S={p ,...,p } Covˆ (Q, I.P ) = i1 ik (3.2) |I.P | X l−1 X (−1) f(pj1 , ..., pjl )p(R|pj1 , ..., pjl ) l=1 ∀R⊆I.P |R|=l R={pj1,...,pjl}

24 where the denominator in Eq. 3.2 is a constant for all possible queries at each instance of query generation. Without loss of generality, we remove the constant denominator from the Eq. 3.2 and we refer to Covˆ (Q, I.P ) as Cov(Q, I.P ) in the remaining of this chapter.

3.4.2 Query Generation

Having defined the concept of query coverage, we can now define the goal of query generation more formally.

Definition 1. (Query Generation).

Let I.P = {p1, p2, ..., p|I.P |} be a set of phrases for the topic of interest I. Given a positive

+ number Bp ∈ N , where Bp ≤ |I.P |, the goal of the query generation is to find a set Q ⊆ I.P of size |Q| ≤ Bp, such that Cov(Q, I.P ) is maximized.

In other words, the goal is to find the optimal query Q∗ such that:

∗ Q = arg maxQ⊆I.P,|Q|≤Bp Cov(Q, I.P ) (3.3)

The query generation problem leads to following three challenges:

(a) How should we estimate the values of rate and probability of relevance for a (set of) phrases? Initially, when TAS starts such a probability/rate can be determined apri- ori based on the weights provided by the user (or determined automatically through an extraction code), and/or by exploiting a knowledge base to generate a model of importance of different phrases. However, as the iterations proceed, we obtain more statistics about the phrases that have been selected so far. We can, therefore, update our initial belief to come up with more accurate assessment of these parameters.

25 (b) A query selection mechanism, can easily fall into a local trap. First, the probabilities and rates we determine are only estimates and the accuracy of them depend upon the

size of the samples we have seen. If a phrase pi is better compared to pj, and we

always select it, the system may improve the estimate for pi, but will never update the

probability and rate for pj. This is further a problem since twitter is dynamic and as a function of time a phrase’s value may change. So we need an adaptive mechanism that allows us to learn statistics of all and not just the phrases that seem to be best so that we do not lose an important phrase.

(c) Query coverage, just like most max coverage problems studied in the literature [83, 2], is NP-hard. We need efficient ways to solve the query coverage problem.

We address challenges (a) and (b) in the next two subchapters. For (a) we use incremental updates based on bayesian analysis and for (b) we use a Reinforcement Learning approach by mapping our problem to a Multi-armed Bandit (MAB) problem. We then address the greedy algorithm to solve the query coverage problem in Chapter 3.4.4.

3.4.3 Statistics Maintenance

In order to choose the set of phrases during query generation and modifying the representa- tion of topic of interest during topic maintenance, TAS maintains its statistics based on the tweets it captures during its iterations. In particular, for each phrase pi ∈ I.P , f(pi) is the rate of tweets matching pi in tweets/sec and p(R|pi) representing the probability of a tweet matching pi being relevant to the topic of interest are maintained.

In addition to the above, TAS maintains similar statistics for pairs of phrases {pi, pj} as

26 2 f(pi, pj) and p(R|pi, pj). We will assume that no tweet contains more than two phrases

from I.P , that is f(p1, ..., pn) = 0 where n > 2. As a result, we do not need to maintain

p(R|p1, ..., pn) for n > 2.

Initialization: When TAS starts, it initializes the probability of relevance of single phrases

with their initial weights and sets all single phrase tweet rates equal to 1 (∀pi ∈ I.P, p(R|pi) =

pi.w, f(pi) = 1). For pairs of phrases ∀pi, pj ∈ I.P ∧ pi 6= pj, TAS initializes its statistics

as p(R|pi, pj) = max(pi.w, pj.w), f(pi, pj) = 1.

Updating: We adopt a Bayesian approach to update statistics in TAS as more information becomes available. In our setting, we iteratively get new evidence about the effectiveness of each phrase in retrieving relevant tweets to the topic of interest and the rate of matching tweets. During each inner iteration δi, suppose TAS collects nij tweets matching phrase

pj ∈ I.P where Yij ≤ nij of those tweets are relevant to the topic of interest I. Let pij and

fij denote the underlying probability of relevance and rate of tweets matching pj respec-

tively. Then we have the binomial model Yij|(nij, pij) ∼ Bin(nij, pij) and the Poisson model

nij|fij ∼ P oisson(δi.τfij) where δi.τ represents the duration of δi in seconds.

Given the computational constraint of online updating, we perform the following approxi-

mation to derive closed-form formulas for estimating pij and fij:

2 2 2 2 pˆij =τ ˆij(ˆpi−1 j/(τ +τ ˆi−1 j) + Yij/nij/σij).

2 1 τˆij = 2 2 2 1/(τ +τ ˆi−1 j) + 1/σij

2Note that the assumption of no more than two phrases may occur in a tweet is based on (a) limited length of a tweet (< 140 Chars) and (b) that phrases are not contained in one another. For instance if phrases are p1 =“University California Irvine”, p2 =“Irvine” and p3 =“UCI”, then co-occurrence of p1, p3 would imply co-occurrence of p1, p2, p3. We deal with containment among phrases separately by never selecting two phrases one containing the other.

27 ˆ 2 ˆ 2 2 fij =η ˆij(fi−1 j/(η +η ˆi−1 j) + nijδi.τ/n˜ij).

2 1 ηˆij = 2 2 2 1/(η +η ˆi−1 j) + (δi.τ) /n˜ij

where τ 2 and η2 are positive parameters that controls how much variation is allowed for adjacent iterations. We assume τ 2 and η2 are known to simplify the presentation, but our method extends readily to the situation when they are unknown and have to be estimated.

ˆ 2 The posterior meansp ˆij and fij are the estimates we use but we need to keep track ofτ ˆij and 2 ˆ ηˆij to facilitate this computation. In our setting,p ˆnj and fnj after δn give the most recent

value for p(R|pj) and f(pj) respectively. In addition, we follow similar procedure to estimate statistics for pairs of phrases.

The natural estimator for pij based on Yij alone is Yij/nij and for fij is nij/δi.τ. If we impose a uniform prior on pij then the conditional distribution of pij given Yij becomes

Beta(Yij + 1, nij − Yij + 1), whose mean value (the posterior mean),p ˜ij ≡ (Yij + 1)/(nij + 2), can serve as an alternative point estimator for pij. Denoten ˜ij = nij + 1 which is an alter- native estimator for δi.τfij. This estimatorp ˜ij has the advantage that it is always strictly between 0 and 1, never touching the boundaries. To capture the dependence between pij, fij for different iterations δi, we build the following random-walk models for the two sequences separately:

2 pnj|(p1j, p2j, . . . , pn−1 j) ∼ N(pn−1 j, τ )

2 fnj|(f1j, f2j, . . . , fn−1 j) ∼ N(fn−1 j, η )

for n = 2, 3,..., where τ 2 and η2 are positive parameters that controls how much variation is allowed for adjacent iterations. We assume τ 2 and η2 are known to simplify the presen-

28 tation, but our method extends readily to the situation when they are unknown and have

to be estimated. Note that we assume a normal distribution even though pij are bounded

quantities. This is not a serious issue as long as pij are not too close to the boundaries. Given the computational constraint of online updating, we perform the following approximation to

derive closed-form formulas for estimating pij and fij. As long as nij is large and pij not too close to zero or one, we can use the normal approximation to the Binomial and the Poisson to obtain the following:

approx. 2 Yij/nij|pij ∼ N(pij, σij)

approx. 2 nij/δi.τ|fij ∼ N(fij, n˜ij/(δi.τ) )

2 where σij ≡ p˜ij(1 − p˜ij)/nij are estimates of the variances of Yij/nij. Combined with the random-walk model on pij, which specifies their distribution before observing Yij, this allows us to compute the posterior distribution of pij recursively. This is an example of Kalman

filtering. Suppose based on the approximation above the posterior distribution of pi−1 j has been computed as:

2 pi−1 j|(Y1j,Y2j,...,Yi−1 j) ∼ N(ˆpi−1 j, τˆi−1 j).

Then one may compute the distribution of pij after observing Yi−1 j but before Yij as:

2 2 pij|(Y1j,Y2j,...,Yi−1 j) ∼ N(ˆpi−1 j, τ +τ ˆi−1 j)

which, when combined with the newly observed Yij yields the posterior distribution

29 2 pij|(Y1j,Y2j,...,Yi−1 j,Yij) ∼ N(ˆpij, τˆij)

Similarly, suppose

ˆ 2 fi−1 j|(n1j, n2j, . . . , ni−1 j) ∼ N(fi−1 j, ηˆi−1 j).

Then we get the final formulas.

3.4.4 Combinatorial MAB Framework

The statistics we maintain are only estimates. Since we do not know them a priori and due to the dynamic nature of social media, we need an adaptive mechanism to allow TAS to learn the effectiveness of all the phrases over time (and not just of those that may have been selected initially based on erroneous statistics). TAS achieves adaptive acquisition by using an explore-exploit strategy for query generation. In particular, it uses a Multi-armed Bandit (MAB) approach.

In the simplest form of MAB [8, 93], each arm of a slot machine is associated with a reward and a probability distribution (unknown to the gambler) of whether selecting the arm will result in a reward. The goal is to design a strategy to maximize the overall expected reward. The query generation problem can be viewed as an instance of

Combinatorial MAB (CMAB) [1], where the gambler is allowed to select a subset of k arms (a.k.a, super arms). In query generation, each phrase in I.P can be viewed as a simple arm associated with a set of i.i.d. random outcomes with an unknown expectation

30 µi. Thus, µ = (µ1, µ2, ..., µ|I.P |) is the vector of expectations of all phrases. We view the

[|I.P |] queries Q ⊆ 2 , where |Q| ≤ Bp as super arms. We denote the expected reward for Q as rµ(Q) = E[Rt(Q)] where Rt(Q) is a non-negative random variable denoting the reward of round t when super arm Q is played.

The most common performance measure for bandit algorithms is the total expected regret defined as the gap between the expected reward achieved by the algorithm and that achieved by an algorithm always selecting the best arm. In CMAB, bound on rate of growth of the regret is O(log n), if reward function is either linear, or, in case of a non-linear function, when the function is monotonic and smooth [1].

The reward function rµ(Q) = Cov(Q, I.P ), in our query generation, is non-linear. How- ever, it is both monotonic and smooth and, as a result, we can use the CBAM framework developed in [1] to choose the super arm (i.e., the set of phrases) while ensuring that the overall regret remains bounded. To see that rµ(Q) is monotonic, note that for all Q ⊆ I.P

0 0 where i ∈ [|I.P |] and µi ≤ µi, we have rµ(Q) ≤ rµ(Q). To see that it is smooth, consider a function f(x) = |I.P | × x. Having f(x) as the smoothness function, for any two vectors µ

0 0 0 and µ , we have |rµ(Q) − rµ(Q)| ≤ f(Λ), where maxi∈Q|µi − µi| ≤ Λ.

Algorithm 2 shows the Combinatorial UCB (CUCB) algorithm for CMAB problem. After each inner iteration, based on previous information, CUCB maintains an empirical mean

µˆi for each phrase pi. More precisely, the value ofµ ˆi after updating phrase statistics is

Cov({pi}, I.P ). The actual expectation vectorµ ¯ given to the oracle contains an adjustment term for eachµ ˆi, which depends on the round number t and the number of times phrase pi has been queried (stored in variable Ti). Then CUCB plays the super arm (query) returned by the oracle and update variables Ti’s andµ ˆi’s accordingly. Note that in our model all phrases have bounded support on [0, 1], but with the adjustmentµ ¯i may exceed 1. If such

31 ALGORITHM 2: Combinatorial UCB input : an interest I, the budget Bp, output: a set of selected phrases for the topic of interest I as Q 1 For each phrase pi, maintain: (1) variable Ti as the total number of times pi is queried so far, initially 0; (2) variableµ ˆi as the latest outcome Cov({pi}, I.P ) of phrase pi, initially 1. 2 t ← 0 3 while Acquisition.IsRunning do 4 t ← t + 1 5 for each phrase pi ∈ I.P do q 3lnt 6 setµ ¯ = min(ˆµ + , 1) i i 2Ti

7 Q = Oracle(¯µ1, µ¯2, ..., µ¯|I.P |) 8 Query Q, observe outcomes of phrases in the Q, update all Ti’s andµ ˆi’s. 9 return Q;

µ¯i is illegal to the oracle, we simply replace it with 1. Since replacing any value larger than 1 with 1 does not violate monotonicity and bounded smoothness of the reward function we directly use the originalµ ¯i and the bound on the regret is not affected.

Oracle: The oracle in algorithm 2 is a mechanism to efficiently solve the optimization problem of the probabilistic coverage model in Eq. 3.3. We propose an algorithm for ef- ficiently computing the probabilistic maximum coverage model and to make it even more efficient, we propose heuristics to optimize the greedy algorithm.

Identifying a subset of phrases to maximize coverage defined in Eq. (3.3), requires TAS to enumerate all subset of phrases in I.P which is exponential 3. In this chapter we propose a greedy algorithm that has an approximation bound for computing probabilistic coverage that achieves the same goal, but can be computed efficiently.

3It can be proven by a reduction from the Maximum Coverage problem that maximizing Cov(Q, I.P ) is NP-Hard [83].

32 3.4.5 Greedy Approximation Bound

To develop the greedy algorithm with approximation bound, we first show the sub-modularity property of the proposed probabilistic coverage model.

Definition 2. (Sub-modularity). Given a finite set I.P , a set function f : 2|I.P | → R is sub- modular iff for any two sets Q ⊆ Q0 ⊆ I.P and p ∈ Q0 \I.P it holds that f(Q∪{p})−f(Q) ≥ f(Q0 ∪ {p}) − f(Q0)

That is, for a sub-modular f, adding p to a smaller subset of a set will have larger incre- mental impact on f than adding p to the set itself. The connection between sub-modularity and coverage of a query is very intuitive: adding an phrase to a small query will have larger incremental impact compared to that of the same query with many more phrases added.

Lemma 1. Cov(., I.P ) is a non-decreasing sub-modular set function. Proof. It is trivial to show that Cov(., I.P ) is a non-decreasing set function. Let us now prove sub-modularity of Cov(., I.P ). It is trivial to drive that:

Cov(Q∪{p}, I.P )−Cov(Q, I.P ) = f(p)×p(R|p)−f(p, p1)×p(R|p, p1)−f(p, p2)×p(R|p, p2)

|Q|+1 X − ... + f(p, p1, p2) × p(R|p, p1, p2) + ... = (−1)n+1 n=1 X f(p, pi1 , pi2 , ..., pin ) × p(R|p, pi1 , pi2 , ..., pin )

0≤i1≤i2≤...≤in≤|Q|

In other words, this is equal to the rate of relevant tweets collected by phrase p as f(p)×p(R|p) without counting the shared set of results with other phrases in Q. As a result, Cov(Q ∪ {p}, I.P ) − Cov(Q, I.P ) is non-increasing. Therefore, for any two sets of Q, Q0 ⊆ I.P such that Q ⊆ Q0 and p∈ / Q0, Cov(Q ∪ {p}, I.P ) − Cov(Q, I.P ) ≥ Cov(Q0 ∪ {p}, I.P ) −

33 Cov(Q0, I.P ). We can now conclude that Cov(., I.P ) is a non-decreasing sub-modular func- tion.

In the context of non-decreasing sub-modular set function f where f(∅) = 0, it has been proven in [64] that the greedy algorithm that always selects the object whose addition max- imizes the function f produces a solution with approximation ratio (1 − 1/e), where e is the base of the natural logarithm. That is, the greedy algorithm yields a solution that is at least (1 − 1/e) fraction of the optimal solution. Therefore, the following lemma holds:

Lemma 2. The greedy algorithm for the problem of maximizing Cov(Q, I.P ) has approxi- mation ratio (1 − 1/e) [64].

3.4.6 Greedy Algorithm

Let us now describe the greedy algorithm in detail and explore how to implement it efficiently. The outline of the algorithm is shown in Algorithm 3. It starts with an empty query Q. At every step of the greedy process, the algorithm traverses the remaining phrases in I.P \ Q and adds into Q the phrase p∗ whose addition will lead to maximum coverage of I.P by the resulting set Q ∪ {p∗}.

ALGORITHM 3: Greedy Phrase Selection input : an interest I, the budget Bp, output: a set of selected phrases for the topic of interest I as Q 1 while |Q| ≤ Bp do ∗ 2 p ← arg maxp∈I.P \Q Cov(Q ∪ {p}, I.P ) 3 Q ← Q ∪ {p∗}

4 return Q

The largest contribution to the computational complexity of the greedy algorithm comes from

34 the while loop (steps 2-4). In each iteration it needs to compute coverage Cov(Q∪{p}, I.P ) that results from adding each phrase p ∈ I.P \ Q into Q (step 3). The complexity of each iteration is constant and the algorithm needs o compute the coverage for |I.P \ Q| objects in each iteration. Since there are totally Bp iterations, the overall complexity of the greedy

∗ algorithm is O(|I.P |.Bp). In order to make the computation of coverage Cov(Q ∪ {p }, I.P ) even more efficient, one approach is to incrementaly update Cov(Q∪{p∗}, I.P ) based on the value of Cov(Q, I.P ) as follows:

|Q| ∗ ∗ ∗ X ∗ ∗ Cov(Q ∪ {p }, I.P ) = Cov(Q, I.P ) + f(p )p(R|p ) − f(p , pi)p(R|p , pi) (3.4) i=1

3.5 Relevance Check

In TAS, we measure relevance of the incoming tweets to the topic of interest in order to select phrases during query generation (Sec. 3.4) as well as to adapt the topic of interest (Sec. 3.6). Difficulty in determining relevance arises due to:

• The short and ill-structured nature of text in tweets that may not contain enough keywords to enable classifying it as relevant (or not).

• The dynamic nature of tweets where the keywords used to refer to the topic shift quickly which means an approach based on a given classifier will not be effective.

TAS maintains a dynamic model of the topic of interest which it adapts over time based on frequent patterns in tweets collected in the previous iterations that deemed relevant. The relevance of a tweet t to the topic I, denoted by R(t, I) is measured using a combination of three factors:

35 • Static relevance of t to the initial patterns of the topic of interest defined by the client

(Rt).

• The relevance based on the proximity of the textual vector of t to the clusters of ac-

quired relevant tweets to the interest I (Rc).

• The history of the tweeter who posted t in posting relevant tweets to the topic of

interest in the past (Ru).

The overall relevance is computed as a combination of the three factors as R(t, I) = wt.Rt +

wc.Rc + wu.Ru where wt, wc and wu are weights specifying the importance of each factor in determining the relevance of incoming tweet t to the interest I. These weights, assigned by the client, can be different based on the characteristics of the topic of interest. As an instance, where the topic of interest has a tweeter community such as “UC Irvine”, since the

Twitter users who tweet about this topic are limited, the user relevance (Ru) is an important factor compared to the topics such as “Vice Presidential Debate” of general interest. We now discuss how individual relevance on each of these is computed. We will specify the weights in experiments.

3.5.1 Phrase Based Relevance (Rt)

Given a tweet t containing keywords t.k, we define the phrase-based relevance of the tweet t

to the interest I with a set of phrases I.P = {p1, ...p|I.P |}, denoted Rt(t, I) by first defining

relevance of a tweet to a given phrase pi (denoted as Rp(t, pi)) as defined below:

36   p(R|pi)|pi.k ∩ t.k|/|pi.k| pi.l = null Rp(t, pi) =  p(R|pi)(α.|pi.k ∩ t.k|/|pi.k| + (1 − α)LP (t, pi)) otherwise

The function LP (t, pi) shows the proximity of t to pi.l with a number in [0, 1], where

LP (t, pi) = 1 means a location matches with strong evidence (e.g. geo-tag point of t being in the location bounding box of pi). Essentially, Rp(t, pi) computes the relevance of a given phrase as a product of the probability of relevance of the phrase to the interest times the degree of match between the tweet and the phrase.

If t.k contains all terms in pi.k, the degree of match would be 1, and the relevance of the tweet based on the phrase would correspond to the probability of a tweet being relevant given that it contains phrase pi. We compute the phrase based relevance Rt(t, I) based on all the phrases in I as follows:

X Rt(t, I) = max Rp(t, pi)+( Rp(t, p)− max Rp(t, pi))/(|I.P |−1)(1− max Rp(t, pi)) (3.5) pi∈I.P pi∈I.P pi∈I.P pi∈I.P

In the computation above, Rt(t, I) is computed as sum of the relevance based on the best matching phrase as well as partial relevance of other phrases in P.I.

3.5.2 Clue Relevance (Rc)

To addition to the phrase relevance, we also compute relevance based on the observed fre-

quent patterns in relevant tweets that match other phrases in I.P . For each phrase pi ∈ I.P ,

TAS maintains a set of frequent patterns (extracted using Apriori algorithm [3]) FP (pi)

37 such that the patterns have a given level of support. For a tweet t retrieved due to a match

to phrase pi, we compute if it contains any of these frequent patterns associated with other phrases. If the number of such patterns contained in t is more than a threshold γ, we asso-

ciate a clue-relevance measure Rc to 1. That is,

[ | {s ∈ FP (pj)|s ⊂ t.k ∧ s∈ / I.P }| > γ (3.6)

pj ∈(I.P −{pi})

3.5.3 User History (Ru)

A tweet t has a higher chance to be relevant to a topic of interest, if the Twitter user t.u who posted t, had posted relevant tweets in the past. The user relevance Ru(t) is equal to the maximum of the average relevance of all the past tweets by t.u and the average relevance of tweets posted by t.u in the latest outer iteration.

3.6 Topic Maintenance

In order to adapt to new trends related to the topic of interest, TAS adapts the representation of the topic of interest based on patterns observed in the tweets seen in the past. Figure 3.4 illustrates different coverage situations of a phrase in TAS. Monitoring Twitter based on a phrase may result in acquiring a set of tweets where some can be irrelevant to the topic of interest. Based on the proportion of the number of relevant tweets to the total number of tweets a phrase results in acquiring, TAS takes one of the following actions or keeps the phrase without changing it:

• Phrase Generalization: When a phrase pi has resulted in acquiring a significant

38 Figure 3.4: Phrase Maintenance

number of relevant tweets and none or very few irrelevant tweets, it is considered to be

too specific. Such a phrase is generalized by dropping some terms from pi. To do so, TAS determines the frequency of a term in irrelevant tweets collected in the past, as well as commonness of the term in the language. Higher the commonness and higher the frequency, the less its chance of being important for the topic. To generalize a phrase, TAS computes the importance of each term and drops the term with the min- imum importance if the minimum importance is less than or equal to the importance threshold given to the system. TAS approximates the commonness of an english term based on Term Frequency Inverse Document Frequency (TF-IDF) [21] in English Wikipedia.

• Phrase Specialization: TAS specializes phrases that while having resulted in ac- quiring some relevant tweets, have also resulted in acquiring a significant number of irrelevant tweets. TAS specializes such phrases by adding new terms to reduce number

of irrelevant tweets retrieved. In order to specialize pi, TAS finds all frequent relevant

patterns in relevant tweets retrieved due to phrase pi. It chooses the most frequent

pattern to add new terms to pi.

• Phrase Addition: It is unrealistic to assume the initial definition of the topic of in- terest to be complete. In order to dynamically enrich the representation of the topic of interest, TAS detects new relevant phrases that may dynamically emerge to empower

39 the system to retrieve relevant tweets in consecutive iterations. Candidate new phrases in TAS are the frequent patterns of the past relevant tweets. TAS computes frequent patterns with the Apriori algorithm where the support (or occurrence frequency) of a

pattern pi is computed as the number of tweets containing pi divided by the number

of all tweets in an iteration. A pattern pi with a support no less than the pre-specified minimum support threshold is considered to be a frequent pattern.

• Phrase Deletion: If a phrase continuously results in acquiring no or very few relevant and a significant number of irrelevant tweets, TAS removes it from the interest.

3.7 Experimental Evaluation

In this chapter we empirically evaluate the effectiveness of TAS 4 by conducting experiments on real Twitter data. We evaluate the performance of TAS comparing to static solutions and the state of the art adaptive algorithm [57].

3.7.1 Experimental Setup

We study the performance of TAS in the following settings:

• Fixed Corpus

• Online Twitter Stream

Fixed corpus consists of a pre-crawled corpus of tweets to fully evaluate TAS collected during

4The source code and datasets available at: http://uci311.ics.uci.edu/sodas

40 the months of October and November 2015. The tweets are a subset of more than 42 million English tweets from billions of tweets collected with the sample API of Twitter.

We use fixed corpus for two reasons. First, having a fixed corpus, which we have complete access to, we can evaluate TAS with many different configurations (e.g. different iteration sizes). Second, using a fixed corpus, we can isolate the dynamics of the Twitter stream and compare experiments executed at different times. In addition to the fixed corpus, we also conduct experiments on the live stream of Tweets. These experiments show the effectiveness of TAS in the real situation. In the Twitter stream setting, we conduct experiments for the TAS and the baselines all starting at the same time on different machines assuming the returning results are all from the same sample of the Twitter stream.

Tweet Normalization: In our experiments we focus on English tweets. For each in- coming tweet, we first classify it to be in English or not using a simple algorithm based on a given English dictionary. We also remove stop words (e.g. “the”) and common special Twitter terms (e.g. “rt”) from the text of the incoming tweet.

Topic of Interest: In order to show that TAS works for any topic of interest, in fixed corpus setting we evaluated TAS with 6 different topics based on news articles that were being discussed during the fixed corpus’s time period. We chose news articles from a wide range of topics with different approximate popularities on Twitter (broadness) to show the effectiveness of TAS on broad as well as more specific topics from different categories. Since the topic of interest is difficult to characterize in a short news article, following a news article on Twitter as a topic of interest is tricky. We obtain the initial representation for each news article by parsing the HTML content of the corresponding page and extracting the most important phrases from it.

In order to identify the most important phrases from a HTML page and sort them based on

41 their approximate importance in identifying relevant tweets related to the news article, we have considered the structural information of the web page (e.g., title, headings, formatting) as well as the textual content itself. We believe the more popular a phrase in a news article, the more distinguishing that phrase is to identify relevant content elsewhere. Some of the parameters of the system such as the size of iterations are related to the popularity of the topic of interest, we tune those parameters manually for each topic of interest.

We approximate the comparative broadness of a topic of interest considering number of tweets mentioning the topic of corresponding news article. We categorize them in three levels of broadness: narrow, semi-broad and broad and then we select from different levels of broadness. In order to cover a general set of topics, we choose news articles from Crisis, Entertainment, Politics, News and Sport categories based on their posted categories on orig- inal websites (e.g., CNN).

Baseline Methods: We compared the performance of TAS to baselines have been noted in the previous works on social media monitoring [57].

• BaseM captures target tweets for a topic of interest using a static manually selected list of keywords. This is the most commonly used approach for focused capturing of tweets.

• ATM is the state of the art in focused Twitter monitoring. It enables monitoring target tweets of a topic of interest based on a given classifier.

We report for the fixed corpus setting only the results comparing to the BaseM since there is no access to the random sample API in the simulation mode.

TAS Parameters: TAS uses a relatively large number of parameters in its various components/sub-

42 algorithms (e.g., English language classifier, stop word remover, relevance model, size of it- erations, etc.). While it would be interesting to explore which of these parameters, if any, could be learnt automatically, such an exploration is beyond the scope of our current chap- ter. In TAS, the parameters have been set manually by considering performance over a large number of example queries to keep the evaluation unbiased. In our experiments, the

following values used for the system parameters: |∆| = 5, τ = 0.5, η = 1, wt = 0.5, wc = 0.3,

wu = 0.2, α = 0.7 in Rp(t, pi), γ = 2 in Clue Relevance.

Our Configurations: In order to evaluate the effectiveness of TAS, we test it with different

configurations. First, to show that TAS works with different phrase budget constraints (Bp), we evaluate TAS with different values of Bp, 20 and 50 phrases. This is very important specially when multiple clients each with multiple interests need results simultaneously. In that case, the phrase budget per interest is more restricted. Second, to study the efficiency of adapting the representation of topic of interest, we evaluate the performance of TAS hav- ing the topic maintenance module enabled vs. disabled. Finally, to show the advantages of our iterative acquisition method, we evaluate TAS with different iteration sizes (|δ|) and compare TAS with a static approach.

3.7.2 Evaluation Criteria

The ultimate goal of TAS is to maximize the number of collected target tweets (viz. “recall” in IR). We note that because of the sample nature of the Twitter API’s output, to be able to maximize the recall, the system should improve precision as well.

In order to measure the recall, since there is no gold standard available, we report the Approximate Relative Recall (ARR) of TAS compared to a baseline. The approximate relative recall of the method A comparing to B, where we have the output result set of each

43 Figure 3.5: Approximate Relative Recall method as shown in figure 3.5 is calculated as follows:

|A ∩ B| + p × |A − B| ARR(A, B) = SA (3.7) pSA × |A − B| + pSB × |B − A| + |A ∩ B|

In Eq. 3.7, A and B are the output sets of corresponding approaches and psA shows the probability of a sample tweet from A to be relevant. Since |A| and |B| are large, it is infeasible to check the relevance of all tweets in A and B manually. Therefore, we label random samples SA and SB to approximate the number of relevant results for each method.

We choose the sample size (e.g. |SA|) with the confidence level of 95% and confidence interval of 5% for each of the result sets. We also report the total number of collected tweets to show the efficiency of TAS in processing lower number of irrelevant tweets.

3.7.3 Experimental Results

We first report performance results on the fixed corpus, showing more than twofold improve- ment in number of acquired relevant tweets comparing to the static manual approach. Then, we empirically confirm that TAS outperforms other methods in acquiring relevant tweets to the topic of interest of the client on the stream of Twitter.

44 TAS BaseM 14

12

Thousands 10

8

6

Number of Tweets 4

2

0 15 25 35 45 55 65 Topic of Interest

Figure 3.6: TAS vs. BaseM: Number of Tweets

Fixed Corpus

In fixed corpus setting, we study TAS for 6 different news articles published in the beginning days of the fixed corpus data acquisition time frame as topics of interest. The topics are from a broad range of categories and popularities. The list of chosen topics is shown in Table 4.2. We compare TAS to BaseM in terms of, both, the number of collected relevant tweets and the approximate relative recall. We also study the impact of different phrase budgets

(Bp), different inner iteration (δ) sizes and the topic maintenance module. Since in the fixed corpus setting we do not have access to an unbiased sample of tweets, we do not compare TAS to ATM in the fixed corpus setting.

Table 3.2: Fixed Corpus Topics of Interest

ID Category Topic Source Broadness 15 Entertainment Star Wars CNN[22] Broad 25 Politics Justin Trudeau CNN[24] Semibroad 35 Crime Oscar Pistorius CNN[25] Narrow 45 Sports Baseball Mets NYTimes[66] Broad 55 Local Border Patrol at UC Irvine LATimes[54] Narrow 65 Crisis Hurricane Patricia CNN[23] Semibroad

45 15 25 35 45 55 65 14

12 Thousands

10

8

6

4

Number of Retrieved Relevant Tweets 2

0 0 5 10 15 20 25 30 35 40 Millions Number of Replayed Tweets

Figure 3.7: TAS over Simulation: Number of Tweets

In this setting, we conduct the following experiments. First, we compare TAS with the static baseline to show that TAS outperforms it for different topics. Figure 3.6 illustrates the effectiveness of TAS in capturing relevant tweets for all the topics comparing to the static manual BaseM. In this experiment, the phrase budget is 50 and the inner iteration size is 500 for all the topics except topic 55 where the |δ| is 10. As expected, for all the topics TAS results in retrieving more relevant tweets (on average 40% more tweets comparing to the BaseM).

Second, to illustrate the iterative effectiveness of TAS in retrieving relevant tweets in all the selected topics, figure 3.7 shows the number of retrieved relevant tweets for different topics over the simulation time. The acquisition trend is different for different topics and it plateaus when the topic is not getting discussed on Twitter anymore. The beginning of trend for topics 15, 45 and 65 shows how effective TAS is initially enriching the correspond- ing representation of the topic of interest.

46 TAS BaseM 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% Approximate Relave Recall (ARR) 0.00% 15 25 35 45 55 65 Topic of Interest

Figure 3.8: TAS vs. BaseM: Approximate Relative Recall

In order to fully investigate the effectiveness of TAS in improving the recall, we also cal- culated the approximate relative recall of TAS vs. BaseM. To compute the ARR, we have sampled results of TAS and BaseM separately as shown in Table 4.3 (SS stands for Sam- ple Size). We crowdsourced determination of whether if a tweet from the sample was relevant.

Specifically, we put the list of sample tweets in addition to the content of the news article corresponding to each topic of interest on Amazon Mechanical Turk and asked for human judgement on relevance. Figure 3.8 shows the effectiveness of TAS in improving the relative recall for all the topics of interest comparing to the static BaseM. The average approximate relative recall of TAS is 95% and for the BaseM is 49%, which shows up to 90% improvement.

Table 3.3: ARR Calculation, Sample Sizes

Topic TAS TAS-SS BaseM BaseM-SS Intersection 15 10779 364 4343 218 3841 25 3395 336 935 137 722 35 704 197 330 29 301 45 13134 363 7981 302 6566 55 15 4 11 0 11 65 6941 348 333 80 3230

47 20 50 400 20 18 16

Thousands 14 12 10 8 6 Number of Tweets 4 2 0 15 25 35 45 55 65 Topic of Interest

Figure 3.9: TAS with Different Phrase Budgets

We evaluated TAS with different configurations, including different phrase budgets, different inner iteration sizes and with/without topic maintenance methods. These experiments show that TAS can handle various constraints, take advantages of iterations and adapt to the changing world.

Figure 3.9 illustrates the result of running TAS on the fixed corpus with different phrase budgets of 20, 50 and 400 (Twitter filter endpoint’s limitation is 400 phrases, we use it to reduce the impacts of the budget constraint for TAS). The phrase budget is a limitation forced by the API, but in multi-client scenario it would be necessary for TAS to use the phrase budget per interest/client in an optimal way. As expected and shown in Figure 3.9, generally the performance improves with a higher value of Bp.

Figure 3.10 illustrates the result of running TAS on the fixed corpus with different inner iteration sizes for different topics of interest. The best choice of the |δ| is related to the popularity of the topic on Twitter. For example, as Figure 3.10 shows, the best choice of |δ| for topic 15 is 500 and for topic 35 is 50 since topic 15 is more trendy over the simulation data. The most important feature of TAS comparing to BaseM is the higher degree of adaptability. Figure 3.11 illustrates the number of tweets collected with the Topic

48 10 50 100 500 1000 14

12

Thousands 10

8

6

4 Number of Tweets 2

0 15 25 35 45 55 65 Topic of Interest

Figure 3.10: TAS with Different Inner Iteration Sizes

WO45 WI45 12

Thousands 10

8

6

4 Number of Retrieved Tweets

2

0 0 5 10 15 20 25 30 35 40 45 Millions Number of Replayed Tweets

Figure 3.11: Topic Maintenance Module On vs. Off: Number of Tweets

49 30 (-) cubs sweep 25 (g) game mets to mets (s) mets to mets watch 20 (+) mets nlcs (+) lgm 15

10 Number of Phrases

5

0 0 5 10 15 20 25 30 35 40 45 Millions Number of ReplayedTweets

Figure 3.12: TAS: Number of Phrases

Maintenance enabled (WI45 in Figure 3.11) comparing to it being disabled (WO45 in Figure

3.11). In this experiment the Bp is 50 and the |δ| is equal to 500 for the topic of interest is 45. Figure 3.11 shows 3 times more retrieved tweets when the topic maintenance module is active.

The topic maintenance module is responsible to maintain the representation of the topic of interest. It dynamically maintains I by adding, removing, specializing or generalizing phrases. Figure 3.12 illustrates the number of active phrases in the representation of topic of interest for topic 45 (Baseball Mets) over the simulation time. A few example phrase changes have been labeled. Addition of “lgm” (Let’s Go Mets) and “mets nlcs” as new relevant phrases, generalization of “game mets” to “mets”, specialization of “mets” to “mets watch” in later iterations and the removal of “cubs sweep” are a few topic maintenance actions taken by TAS. As expected we face a plateau in number of phrases and that shows when the representation of topic of interest gets stable.

50 Twitter Stream

In this setting, we conduct experiments to compare TAS with each baseline separately on collecting relevant tweets for 20 selected news articles. Due to the limited space, we report detailed results for a few of topics shown in Table 4.4. We also report the overall results for all the topics.

Table 3.4: Streaming Topics of Interest

ID Category Topic Source Broadness 75 Crisis Porter Ranch Gas Leak LATimes[53] Narrow 85 Entertainment Star Wars CNN[26] Broad

Each experiment is based on a concurrent execution of TAS with baselines simultaneously. Since Twitter has some restrictions on number of concurrent connections from an IP ad- dress, for our experiments on live Twitter stream, we use separate machines with different registered application keys. We use the same configuration (e.g. Bp) on all the systems.

Classifiers: Since ATM is based on a given classifier and the performance of classifier can change the performance of ATM, in order to compare TAS with ATM we evaluate it with classifiers of two topics as discussed in [57]. We obtain a classifier f of a topic in the following steps. First, we define different types of features including word features and other additional features (e.g., whether a tweet is from a news agent). Then, we label a set of tweets for training (selected from the initial output of the BaseM results), and train classifiers with different classification models (NB: Naive Bayes and SVM: Support Vector Machine). Finally, we select the best one to use.

We used a training set of size 200 and 1000 manually labeled tweets for topics 75 and 85 respectively. Here, we emphasize that our focus is not designing classifiers and we train classifiers just for the purpose of fairly comparing TAS with ATM.

51 TAS 75 ATM 75 250

200 Thousands

150

100

50 Number of Retrieved Relevant Tweets 0 0 10000 20000 30000 40000 50000 60000 70000 Time (sec)

Figure 3.13: TAS vs. ATM: Topic 75

TAS 85 ATM 85 3000

2500 Thousands 2000

1500

1000

500

Number of Retrieved Relevant Tweets 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Time (sec)

Figure 3.14: TAS vs. ATM: Topic 85

52 Figures 3.13, 3.14 show the tweet acquisition trend over time for topics 75 and 85 respec- tively. As expected, Figure 3.13 clearly shows the effectiveness of TAS in capturing more relevant tweets compared to ATM for the finer topic of interest 75. Figure 3.14 illustrates the comparable effectiveness of TAS in acquiring relevant tweets to coarser topics such as 85. Also, ATM resulted in capturing a large number of irrelevant tweets that shows the need for a relatively heavy post processing step to reduce the number of irrelevant tweets.

Table 3.5: Streaming Experiment Summary

Topic 75 Topic 85 Topic 95 Topic 105 Topic 115 Overall Method P RR F1 P RR F1 P RR F1 P RR F1 P RR F1 P RR F1 TAS 0.48 0.95 0.64 0.58 0.66 0.61 0.33 0.84 0.47 0.6 0.75 0.67 0.37 0.81 0.51 0.43 0.82 0.57 ATM 0.06 0.24 0.09 0.17 0.73 0.28 0.08 0.50 0.14 0.19 0.33 0.24 0.22 0.83 0.35 0.3 0.49 0.35 BaseM 0.81 0.1 0.17 0.88 0.12 0.22 0.72 0.21 0.31 0.93 0.24 0.38 0.79 0.17 0.28 0.81 0.14 0.28

Since we are using classifiers in streaming setting to be fair in comparison with ATM, we could measure the precision and the Relative Recall (RR) of TAS and the baselines. We also

P ×RR report the F1-Score (computed as F 1 = 2 × P +RR where P is the Precision and RR is the relative recall) of each method and clearly the F1-Scores for TAS comparing to the other baselines indicate the better overall performance of TAS comparing to baselines. Table 4.5 clearly shows significant improvements in overall performance (based on F1 measure) of TAS over ATM and the BaseM. The performance gap increases when the topics are more specific (e.g., topic 75).

3.8 TAPP (Twitter Follow-up Application)

Social media applications have recently grabbed the attention of millions of users around the globe. Twitter, a microblogging service, is considered to be playing a major role of the interactivity in this modern era. With the vast growth of data interchanged daily in Twit- ter, a need for searching and exploring its content in an engaging method is rising. Giving the limitation of Twitter, a simple search in Twitter using pre-defined keywords will fail to

53 provide better understanding of the topic in mind. Hence, we introduce our system TAPP (Twitter Follow-up Application) which creates an environment for users enabling them to query Twitter and search for tweets related to any topic of interest in an interactive and dynamic methodology.

After reading a news article discussing a certain topic, the user can use TAPP to further explore the topic by viewing other people’s opinions and concerns on Twitter. To this end, TAPP first extracts a set of query phrases from the article, queries Twitter API using those phrases to retrieve relevant tweets, and then returns a summarized ranked list of relevant tweets to the user.

The key concept differentiating our system from other approaches is that of the adaptive searching mechanism. TAPP iteratively sets of tweets returned from the Twitter API. By initiating a simple refresh request, the user can view a more effective set of tweets generated based on the latest updated set of query phrases. Such a feature helps our application cover phrases that are only related to the topic of the article during the time of reading that article. For example, consider an article discussing the latest news with regard to “climate change”. A new phrase “climate change presidential election” may not be presented in the article but could eventually be generated by our system to give a better coverage of “climate change” with relation to up-to-date events. By querying Twitter API using this new phrase, TAPP can view the tweets that are not related to that specific article only, but also to the broad topic of the article in general.

3.8.1 System Overview

TAAS has been fully developed using the JAVA programming language and has been hosted on a tomcat server. As illustrated in Figure 1, our framework consists of different compo-

54 Figure 3.15: TAAP Application nents responsible for performing different functions. TAAS core responsibility is to adaptively search and query Twitter using a component called Twitter Acquisition System (TAS)[xxx] and return summarized results using a different component which is Twitter Analysis Engine (TAE).

In order for TAS to start the acquisition of related tweets, it takes a set of phrases as input and generates the Query Phrase based on those phrases. TAS is a dynamically adap- tive acquisition system that deals with multiple topics of interest at the same time. The set of query phrases of each article corresponds to one topic of interest. The system runs on iterations to dynamically enrich the query phrases of topics of interest using an explore- exploit strategy. At the beginning of each iteration, TAS first generates a new set of phrases per topic, where each phrase is associated with a weight that represents its relevance to the topic, and then queries Twitter API using the generated sets of phrases to obtain relevant tweets. The newly generated phrases is derived from the related tweets collected from the past iteration based on a reinforcement learning approach.

55 The retrieved tweets will then go through a relevance checker to determine the topic of each tweet. The main factors in determining the relevance of a tweet to a certain topic are: the tweets relevance to the topic by checking every term in the tweet against phrases of that topic and heuristics of Tweeter to check if he/she has relevant tweets in the past. The relevant tweets are then stored in the tweet database. After acquiring a set of related tweets, TAE retrieves them from the database and executes several algorithms to summarize and rank the relevant tweets based on some factors including the influentially of the tweeter (computed using the number of his/her followers), the time and location of the tweet and how many times it has been retweeted. It also illustrates the spatial distribution of related tweets for a certain topic in order to show its effectiveness around the world. TFTA works on top of TAAS in order to crawl a webpage and extracts key phrases which is sent to TAAS to be processed.

Upon reading an online article, the user can send a request to TFTA to fetch a set of tweets related to the topic of the article. The first request made by the user will be in the form of the URL of the web page containing the article. If the URL has been already submit- ted, then the user request will be directly sent to the TAE component in TAAS. Otherwise, the request is handled in three steps: extracting key phrases from the article, dynamically retrieving relevant tweets from Twitter, and generating a ranked list of related tweets. In order to extract key phrases from the article, TFTA retrieves the HTML page of the input URL code and performs various operations to determine the set of key phrases that best represent the corresponding article.

A pool of phrases that appear in the article document will be collected and then a score value will be assigned to each phrase based on three factors. First, the distance between the words of the phrase in the article. If the words are adjacent to each other, then a higher score will be assigned to the phrase and vice versa. Second, the location of the phrase in

56 the HTML page of the article. Phrases that are present in the title or the headlines are given a higher score. Third, using an entity recognition tool, a phrase containing entities or people will be assigned a higher score. The overall score of the phrase is determined using a weighted summation model on these individual scores.

After viewing the results, the user has the ability to send a refresh request to TFTA to retrieve even more enhanced tweets about the same article. The refresh request will be di- rectly sent to TAE which in turn fetches more recent tweets from the database and ranks them as explained earlier.

3.9 Summary

In this chapter, we study the task of adaptive collection of target tweets for a topic of interest with Twitter APIs. We make the following contributions: First, we propose TAS that enables collecting target tweets in an online adaptive way. Second, we develop a tweet relevance model, which enables checking the relevance of collected tweets to the topic of interest based on multiple criteria analysis. Third, we develop a phrase selection reinforcement learning algorithm, which iteratively finds a set of phrases that have a near-optimal coverage and we adopt a Bayesian approach to update statistics in TAS. Finally, we conduct extensive experiments to evaluate TAS and demonstrate that it greatly improves the baseline methods.

As a demo, we built TAAP that is a browser extension to follow/analyze tweets related to a web page (e.g., news story) in real time. Looking forward, we hope to incorporate named entity identification and entity disambiguation to improve the performance of TAS. Other future work includes incorporating a more general relevance model combining information from various dimensions such as spatial and temporal domains to enable deeper relevance analysis.

57 Chapter 4

Social Entity Linking

Efficient processing of top-k mentioned entities query over a stream of tweets is a key part of a broad class of real-time applications, ranging from content search to marketing. For instance, imagine a scenario where the CEO of the technology company “Apple”, is inter- ested to get a report showing the overall sentiment of tweets about the company every day. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. For example, “Apple” in one tweet can refer to the fruit named apple and not the company. Distinguishing between references to the target entity of the query (e.g., Apple Inc.) and other possible entities (e.g., a fruit) proved to be a challenging task over stream of tweets. Furthermore, the continuous and fast generation of tweets makes it crucial for such applications to process top-k mentioned entities query at an equally fast pace. However, to the best of our knowledge, no study has yet addressed the efficiency of approaches to answer such queries.

In order to address these requirements, in this chapter, we propose TkET (pronounced ticket) as an analysis-aware entity linking framework for efficiently answering top-k enti- ties query over stream of tweets. The comprehensive empirical evaluation of the proposed

58 solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.

4.1 Introduction

Social media services, such as Twitter, that emerged over the past decade, changed the way we communicate. Twitter is a massive social networking site tuned towards fast communi- cation. Twitter’s speed and ease of publication have made it an important communication medium for the people from all walks of life. More than 200 million active users publish over 500 million 140-character “Tweets” about diverse topics of interest everyday. Such data has now become pervasive, and has motivated numerous applications in e-commerce, entertain- ment, government, health care, and e-science, among others. As a result, last decade has witnessed a large body of academic research as well as several industrial investigations on different aspects of using big data stream of Twitter in various applications. Examples are predicting the stock market [14] and result of the elections [88] based on the Twitter mood.

A vital task, common across many social data applications, is that of performing Named Entity Recognition (a.k.a. Named Entity Identification or Named Entity Extraction), and Named Entity Linking over social data. For example, given a tweet such as “Congrats to As- ghar Farhadi for A SEPARATION’s Foreign Language win at last night’s Oscars!”, Named Entity Recognizer determines that string “Asghar Farhadi” is a person name, and that “A SEPARATION” is a movie or, for simplicity, these two strings refer to real-world entities. Named Entity Linking goes one step further, inferring that “A SEPARATION” actually refers to a unique real world entity with a representation in the knowledge base. Wikipedia 1 based knowledge bases often used for the purpose of named entity analysis. For the rest of this chapter, we refer to knowledge base as an instance of a Wikipedia based knowledge base.

1http://www.wikipedia.org/

59 For example, Named Entity Linker infers that the identified entity mention “A SEPARA-

TION”, as the output of Named Entity Recognizer, refers to the entity at en.wikipedia. org/wiki/A_Separation, and similarly for “Asghar Farhadi”. A named entity mention, is a string that refers to a named entity [79], for example “Asghar Farhadi” is a mention of the entity of the director with the same name at en.wikipedia.org/wiki/Asghar_Farhadi.

The Named Entity Linking task over social media data is challenging due to name variations, entity ambiguity and the noisy unstructured nature of the social data. A named entity may have multiple representation forms, such as its full name, partial names, nicknames, aliases, abbreviations, and alternate spellings. For example, the named entity of “Asghar Farhadi” could be called by only his last name as “Farhadi” or his first name as “Asghar”, and the named entity of “Barack Obama” at en.wikipedia.org/wiki/Barack_Obama, has several Twitter user defined nicknames. Barack Obama was mentioned on Twitter by Twitter users (Tweeters) using “Obama”, “POTUS”, “Obam”, “O’Bomber”, “Obummer”, “Obonehead”, “Bam”, etc. An entity linking system has to identify the correct mapping entities for men- tions of various representation forms. While entity linking in the context of social media data has recently been studied [31], entity linking has been extensively studied in the past, particularly, in the context of dynamic web data [62].

On the other hand, a potential entity mention could possibly denote different named en- tities. For instance, the entity mention “Magnolia” may refer to the large flowering plant at en.wikipedia.org/wiki/Magnolia, a 1999 drama movie at en.wikipedia.org/wiki/ Magnolia_(film), multiple different places including cities in different states (e.g., a city in southwestern Montgomery County, Texas, United States at en.wikipedia.org/wiki/ Magnolia,_Texas), a food and beverage brand owned by San Miguel Corporation (SMC) at en.wikipedia.org/wiki/Magnolia_(brand) or many other entities which can be referred to as “Magnolia”. An entity linking system has to disambiguate the entity mention in the

60 tweet text and identify the correct mapping entity for each entity mention. Often, the Named Entity Linker needs to exploit further contextual information about the given mention to disambiguate it correctly to the entity in the knowledge base it indeed is referring [31]. For example, given a real tweet at twitter.com/normalhuman99/status/818641122499010560, with the text “I think Magnolia will always be my favorite Paul Thomas Anderson film” and an identified potential mention of a movie as “Magnolia”, passing the contextual key-phrases from the text of tweet such as “anderson” or “film”, makes it clear for the Entity Linker that the given mention is indeed referring to the 1999 movie Magnolia at en.wikipedia. org/wiki/Magnolia_(film).

In many current online big social data stream applications [11], social data analysts often wish to answer different types of top-k queries over the online social data stream. Processing of top-k queries efficiently, is a crucial requirement in many interactive environments over big data. One of the major such queries is of answering top-k mentioned entities (organizations, people, locations, companies, hashtags, users, etc.) on Twitter. Often these queries are further limited to a given category of entities, such as Locations or Movies. For example, a requested query could be “What are the top-3 Movies getting discussed positively (sentiment wise) in tweets?”. In the given query, the specified category of entities to be considered is “Movies”, as a category of entities in the knowledge base. The client can define the query in a way that the results get reported periodically in an sliding window fashion. In such a scenario, the ordering criteria to select the top-k identified entities, is the number of iden- tified mentions indeed referring to the target entity. A target entity is an identified entity that belongs to the target category of the query (e.g., a movie).

While we will develop our framework for stream setting, let us first consider how it can be done for a given collection of tweets. We will then generalize our solution to the stream setting in Section 4.6.1. In order to answer top-k entities query over a collection of tweets,

61 often one needs to follow the following steps in the given order:

1. Tweet Preprocessing: Tweet Text Normalization (e.g., Stop Word Removal), Tweet Time Zone Normalization, etc.

2. Tweet Segmentation: After splitting a tweet text into a sequence of consecutive n-grams (n ≥ 1), each of which is called a segment, the latent topics of the tweet can be better captured. A segment can be a named entity (e.g., a movie title “magnolia”), a semantically meaningful information unit (e.g., “officially released”), or any other type of phrase which appears “more than chance” [56].

3. Named Entity Recognition: Given a named entity category in the knowledge base (e.g., Movies), alongside with the generated segments from the previous step, a Named Entity Recognizer (NER) identifies segments potentially mentioning an entity from the given category in the knowledge base. For example, given the category as “Movies” and a set of segments as [“officially released”,“paul thomas anderson”,“magnolia”], NER

identifies “magnolia” as a potential reference to a target entity at en.wikipedia. org/wiki/Magnolia_(film) and computes an approximate initial probability of the identified mention referring to the corresponding target entity. The probability is often low due to the ambiguity of the identified mention. In the last example, the mention “Magnolia” could be referring to many different entities other than the target entity, such as a plant or a city with the same name.

4. Named Entity Linking: Given the output of NER, Named Entity Linker (NEL) performs the disambiguation procedure. For example, given the identified potential mention to the target entity in the last step as “magnolia” alongside with the identified potential target entity in the knowledge base and a set of other contextual segments extracted from the text of tweet, NEL outputs ’1’ showing that the mention“magnolia”

in the given context indeed refers to the specified target entity at en.wikipedia.org/

62 wiki/Magnolia_(film).

5. Top-k Query Processing: At this stage, there is a unique interpretation of top-k mentioned movies. Finishing the latest step, we have a list of entities each mentioned several times in the given set of tweets, and there is no ambiguity in them. One can sort the identified entities based on number of times each entity is mentioned in the given set of tweets to report the final top-k most mentioned entities. For example, assume we identified five different movies mentioned in the given set of tweets as Catfish with 3 mentions indeed referring to the movie Catfish, Magnolia with 1, Bad Moms with 2, Frozen with 1 and Gummo with 1 mention indeed referring to its target movie entity in the knowledge base. Then, the final answer to a query such as “What are the top-2 mentioned Movies in the given set of tweets?” is 1-Catfish and 2-Bad Moms.

The problem in the above approach is that there is significant overhead since disambiguation procedure, or the Named Entity Linking can be very complex and it takes time. For instance, in our minimal implementation of NER and NEL based on the Wikipedia as the underlying knowledge base, the average time complexity over more than 1000 different queries for a call to the NER module is 22ms and for the NEL is around 35 times more, that is close to be 1 second. Therefore, performing Named Entity Recognition/Linking with the speed of Twitter’s firehose stream (6000 tweets per second), becomes computationally impractical. One possible solution is to limit the number of calls to the expensive Named Entity Linking for the purpose of disambiguation to only the identified mentions that disambiguating them would change the final top-k result or its accuracy.

In this chapter we design TkET (pronounced as Ticket), the goal of which is to sustain the top-k answer over an online stream of tweets in real-time. TkET does two things:

(a) Minimizes the number of disambiguations needed to get the top-k answer.

63 (b) Exploits stream semantics to evaluate incrementally.

We focus on (a) in this chapter and leave (b) as a possible future work direction for TkET based on an approach to exploit predictability property of stream semantics to further reduce number of calls to the expensive NEL , also evaluate result incrementally. To address (a), we developed two algorithms. A deterministic algorithm called TkET-D and a probabilistic version called TkET-P. The deterministic algorithm (TkET-D), gives the same answer to the specified top-k entities query posed over tweets as the answer we would get get if we resolve everything following the steps above but reducing number of calls to the NEL module. Such a deterministic strategy, as we will see, will incur significant cost of disambiguation, so we design a probabilisitic algorithm. The probabilistic algorithm does not provide the same guarantee as deterministic. It only tries to bind the probability that one of the entities not in the top-k answer has only a bounded (to an arbitrarily) small amount to be in the top-k answer. We study two variants of the probabilistic algorithm that offers different level of guarantee to avoid extra cost in answering top-k entity queries.

TkET, in general, prefers to select the entity with the highest chance of being in the fi- nal top-k result, and selects the mention to disambiguate based on the potential effect of disambiguating the mention in reducing the uncertainty of the corresponding entity being in the top-k subset of identified entities. We stop the disambiguation as soon as 1) the chance the probability that an entity which is not in the top-k could have a larger number of mentions compared to any entity from the current top-k result, is bounded and below a threshold, as our In-Degree based stopping criteria. Alternatively, TkET can set to stop the disambiguation as soon as 2) the chance the probability that an entity which is in the top-k have a larger number of mentions compared to any entity not in the current top-k result, is bounded and above a threshold, as the Out-Degree based stopping criteria. We formally and empirically show the effect of configuring TkET with different stopping criteria in Section 4.7.

64 In the next section, we describe the main idea behind our query-driven solution to answer top-k entities query using an illustrative motivating example. We use the same example to explain different parts of our proposed framework in the rest of this chapter.

4.2 Motivating Example

We describe the main idea behind our query-driven solution to answer top-k entities query using an illustrative example.

Table 4.1: Example Raw Tweets

ID Tweeter Text Time Location

t1 u1 and now I watch magnolia t1 l1 t2 u2 watching catfish just breaks my heart t2 l2 t3 u3 so bad moms grocery scene tears from laughter t3 l3 t4 u4 just watched the catfish movie for the first time very interesting t4 l4 t5 u5 6lbs catfish while bass fishing today Its frozen in the fridge now t5 l5 t6 u6 These types of white people are way more scarier then any ghetto black people watching gummo t6 l6 t7 u7 enjoying break time with my new favorite magnolia ice cream flavor avocado macchiato t7 l7 t8 u3 bad moms was by far one of the best funniest moviesI’ve seen! t8 l8 t9 u8 nev looked so cute in the catfish movie when he was wearing his retainer goodby t2 l9 t10 u9 i hated taking care of them but my favorite flower is the magnolia t9 l10 t11 u10 two years ago today tim baseball was the catfish of magnolia t10 l10 t12 u11 I’m never going back The past is in the past let it go elsa (frozen) t11 l11

Social media analysts are responsible for analyzing a company’s presence on social media sites, which include Facebook, Twitter, Flickr, Instagram, Snapchat, Youtube as well as any company blogs. Consider a social media analyst, who is working for a film and cinema magazine industry. As part of her daily work, she needs to continuously monitor Twitter for the list of most discussed movies. Assume she has access to a stream of relevant tweets to her interest, she needs the list of top most mentioned movies to be reported periodi- cally every hour. The goal is to identify a list of top-k entities from an specified category of entities in the knowledge base (movies in this example) using a sliding window stream- ing fashion. The first step is to identify potential mentions to target named entities (e.g., movies). Consider a set of collected tweets for the last hour of tweet collection process are

65 shown in Table 4.1. The potential mentions to identified target named entities made bold in the text of tweets in the example in Table 4.1. Mentions “Magnolia”, “Catfish”, “Bad Moms”, “Frozen” and “Gummo” are the identified potential references to movies as entities in the real world. Each of these entities corresponds to a target entry in the knowledge base. In order to answer the top-k entities query on top of such data, the brute force approach counts the number of potential mentions for each movie and returns the top-k entities based on the counts. This, however, may not result in the correct determination of the top-k result.

For example, the top-2 frequent mentioned movies in the set of tweets shown in Table 4.1, are Catfish with 5 identified potential mentions and Magnolia with 4 mentions. Simply returning Magnolia and Catfish may not, however, be correct since we cannot be sure if the Catfish and Magnolia mentions indeed refer to the corresponding target movie entities. For instance, the mention “catfish” in tweet t5, refers not to the movie, but to the fish named Catfish. Then, in order to be able to correctly count mentions to the target entity (from movie category), potential mentions need to get disambiguated.

The quality of the disambiguation function depends on the sophistication of its algorithm. For example, to disambiguate mentions to real world entities in tweets, there is a large body of research on proper entity linking. Specially, mainly due to the short nature of tweets, there were several recent research contributions on how to use a variety of contextual information such as the social connections in between users [43], the theme of the tweeter’s past tweets to get a sense of the tweeter’s general interest [43] or exploiting spatial/temporal aspects of tweets in entity linking [35]. The more advance the disambiguation function, generally, the higher the cost of disambiguation and therefore in order to do disambiguation at the speed of stream, one should minimize the number of calls to the disambiguation function. The worst case scenario is to disambiguate all the potential mentions to all identified entities, then sorting entities based on the number of identified mentions referring to them, and returning

66 the list of top-k from the sorted list of entities. Knowing the value of k (meaning knowing the details of the final analysis task), as we will see, we can avoid a considerable part of the overhead for disambiguation to output a bounded probabilistic answer.

Before we develop our idea, in the next section, let us develop some notation and formally define related concepts and the problem setting.

4.3 Preliminaries

In this section, we present preliminaries needed to understand our framework. Our imple- mentation of TkET follows a count-based window model to query the continues Twitter stream, but the concepts and algorithms in this chapter can easily change for the time-based window scenario as well.

4.3.1 Window-based Stream

Let S be a stream of tweets with either a count- or time-based query window. The window has a fixed size w that represents the maximum number of tweets that can be processed in a window (in count-based windows) or a time interval (in time-based windows). The window periodically slides on S from window Wi to window Wi+1 by a query slide s. In count-based windows, this slide causes the new window Wi+1 to include s new tweets from S and causes the oldest s tweets from Wi to expire and get removed from Wi+1. In time-based interval, the new window Wi+1 will contain only all tweets with timestamps greater than or equal to the current time minus s.

67 KB Category Potential Target KB Entity (E.g., Movies) (e.g., the Movie Catfish)

Contextual Keyphrases (string) (E.g., “film”, “actor”, “director”)

Knowledge Potential Mention (string) NER Base NEL Potential Mention (string) (E.g., “catfish”) (E.g., “catfish”)

Potential Target KB Entity Indeed referring to the + Probability target entity or not (0/1) (of the mention referring to a target entity from the given category in the KB)

Figure 4.1: NER and NEL Black Box Interfaces

4.3.2 Data Cleaning Functionalities

In TkET, we assume a few core data cleaning functionalities are given to TkET through “black box” modules. The main black box functions to TkET, are of the Tweet Segmenter, NER or Named Entity Recognizer and NEL or Named Entity Linker.

Tweet Segmenter

Recently, segment-based tweet representation has demonstrated effectiveness in named entity recognition (NER) and event detection from tweet streams [56]. To be able to identify poten- tial target entity mentions in tweets, TkET (using the Tweet Segmenter) generates a list of n-grams potentially referring to a target entity to be passed to the NER. In our experimen- tal implementation of Tweet Segmenter, we output 1,2,3-grams after doing preprocessing on the tweet’s text. In order to be more efficient, we approximate the commonness of each term in the tweet through Wikipedia, and to avoind processing very common phrases, each output n-gram at least has a term with less than a threshold commonness. Tweet text segmentation is a dynamic area of research. In [55], authors present a novel framework HybridSeg, which aggregates both the local context knowledge and the global knowledge bases in the process

68 of tweet segmentation. Employing more sophisticated tweet segmentation methods in TkET will improve its efficiency and accuracy of its final answer. After a tweet gets segmented into potential entity mentions as the output of Tweet Segmenter, TkET passes the generated segments to the NER module alongside the specified Category in the knowledge base, based on the specified continues top-k entities query.

Named Entity Recognizer (NER)

As Figure 4.1 illustrates, inputs to the NER module, are the string of a potential mention to a target entity from a given category, also the defined category in the knowledge base itself. For example, passing “catfish” as a potential mention to a “Movie” to the NER module, NER outputs a probability value for the chance of “catfish” referring to a target entity from the specified category of “Movies”. There is a large body of research from natural language processing to machine learning about Named Entity Recognition/Identification and there are very sophisticated methods to perform NER.

For the purpose of conducting our experiments, we have implemented a simple lookup func- tion based on the English wikipedia (as the underlying Knowledge Base), to look for articles in Wikipedia (from the given category) with the title matching the given potential mention. In our simple lookup function, we approximate the initial probability of the given mention referring to a target entity from the given category, by incorporating a) the textual similarity between the title of the article corresponding to the potential target entity in Wikipedia and the given mention, b) number of other possible entities in the knowledge base that could be referred the same way. Note that the contribution of TkET is not in the broad topic of named entity recognition and our implementation of the black box functionalities are min- imal and for the purpose of comparative experimental evaluations. We explain further on the implementation of NER module in TkET in Section 4.7.

69 Named Entity Linker (NEL)

The task to link the potential named entity mentions detected from tweets with the cor- responding real world entities in the knowledge base, is called tweet entity linking. This task is of practical importance and can facilitate many different tasks, such as personalized recommendation and user interest discovery. The tweet entity linking task is challenging due to the noisy, short, and informal nature of tweets. Previous methods focus on linking entities in Web documents, and largely rely on the context around the entity mention and the topical coherence between entities in the document. However, these methods cannot be effectively applied to the tweet entity linking task due to the insufficient context information contained in a tweet.

The other black box function used by TkET is NEL, or Named Entity Linking module. As Figure 4.1 illustrates, inputs to the NEL module, are the string of a potential mention to a given target entity in the knowledge base and a list of extracted key-phrases from the terms surrounding the mention in the text of the corresponding tweet. The set of key-phrases, could also be extracted from the historical tweets by the tweeter who posted the tweet by modeling his topical interest. For example, given a tweet such as “Catfish (2010) Dir. Henry Joost and Ariel Schulman”, passing “catfish” as a mention of a potential entity movie “Catfish” in the knowledge base (https://en.wikipedia.org/wiki/Catfish_(film)) alongside a list of key-phrases such as “2010, schulman, joost” to the NEL module, it outputs ’1’. The ’1’, as output of NEL module, means the mention “Catfish” in the tweet above, indeed refers to the corresponding target entity’s entry in the knowledge base. There is a large body of research from natural language processing to machine learning addressing Named Entity Linking. Specially, Twitter has received more and more attention from research community recently in terms of the knowledge base oriented entity linking and there are very sophisti- cated methods to perform NEL.

70 There is a large body of research on how to use a variety of contextual information such as the social connections in between users [43], the theme of the tweeter’s past tweets to get a sense of tweeter’s interest [43] or the spatial and temporal aspects of tweets in entity linking [35]. Since the disambiguation functions are often expensive, calling NEL is in fact the bottleneck of any query evaluation approach. We note that our algorithms/approaches are meant for the situation where NEL is not computationally cheap, and calling it is in fact the bottleneck of a query evaluation approach. One of the possible future work on TkET, is to employ more sophisticated implementations (or commercial products) of the NER and NEL functions. For instance, in [82], Shen et al., propose KAURI, a framework which can effectively link named entity mentions detected in tweets with a knowledge base via user interest modeling [82]. Incorporating KAURI in TkET will improve the accuracy and the semantic of TkET top-k result.

Let us assume that the disambiguation function or Named Entity Linker, NEL(ti, ej, ck) is a black-box function that, given a tweet ti containing a mention potentially referring to an entity ej, and a set of other key-phrases extracted from the text of ti as ck, determines if the mention indeed refers to ej or not. In general, such disambiguation functions can be expensive, requiring expensive compute intensive algorithms, consulting external data sources, and/or seeking human input. We will assume that, a NEL module may return a classification, a binary answer, or a numeric similarity value (confidence). For the purpose of embedding such a function in our framework, the outcome of NEL is mapped into either

Yes/010, indicating that the mention refers to the target entity, or No/000 otherwise. In our implementation of TkET, we use the knowledge base directly to implement the named entity identification and disambiguation as knowledge base lookup functions. We explain further on the implementation of NEL module in TkET in Section 4.7.

71 4.3.3 Entity Blocks

At the beginning of each window, TkET updates the content of the window based on the query slide and then divides the set of tweets of the window in a set of entity blocks cor- responding to each identified potential target entity in E. For notation simplicity, we refer to the corresponding entity block to the real entity ei in the knowledge base as ei as well.

Hence, we use E = {e1, e2, e3, . . . , e|E|} referring to the set of identified entities or the set of their corresponding entity blocks interchangeably within this chapter. An entity block ei

consists of all tweets containing mentions potentially referring to ei. Figure 4.14 shows the set of identified entity blocks corresponding to the collection of tweets in Table 4.1. Entity

block b1, for example, corresponds to one target entity which is the movie Catfish, and its tweets contain five Catfish mentions; only three of them indeed refer to the movie Catfish.

4.3.4 Mention Probabilities

TkET then, for each new identified mention, computes an initial probability value, showing the probability of that mention indeed referring to the target entity of the corresponding

entity block. In order to simplify the notation, we assume in each tweet ti, there can be at most one mention potentially referring to a target entity corresponding to an entity block. We, thus, do not introduce new notation to denote mentions in tweets and use the tweet-id tj in an entity block ei to refer to the mention to the target entity ei in the text of tweet tj. The probability that the mention in tj refers to the target entity, is denoted as p(tj). TkET initially approximate this value based on the degree of textual similarity between the mention and the definition of target entity in the knowledge base. TkET assumes there is a given named entity identification function NER(tj, ci) as a black-box to compute a probabil- ity p(tj) for the identified mention in tj referring to a target entity from the knowledge base specified category c (e.g., Movies). In our implementation of TkET, we implemented a sim-

72 ple lookup function based on the Wikipedia article titles to do the named entity identification.

Later, TkET updates these probabilities including the contextual positive correlations in between tweets belonging to the same entity block. For instance, as a possible contextual

positive correlation, consider the presence of the common term “watch” in tweets t2 and t4 in the motivating example in Table 4.1. Tweets can be positively correlated based on the common terms. Hence, if the function NEL indicates that the mention in t2 refers to the movie Catfish, then this provides more evidence that the Catfish mention in t4 also refers to the movie Catfish. TkET models the positive correlation among the tweets of the same block based on several tweet dependencies. In our experiments we consider Location Proximity (closeness of tweet geo-tags) and the Same Tweeter User as potential positive correlation indicators. Such correlation help in computing the probability values of mentions belonging to an entity block more accurately, which in turn help in determining which mention to disambiguate next.

4.3.5 Continuous Top-k Query

Imagine a scenario where a person watching Oscars on television, wishes to continuously monitor online Twitter stream to know the top movies discussed over tweets during the last 5 minutes of the Twitter stream in every minute. Such a task can be expressed using the following continuous top-k query:

q1: “Retrieve Top-2 Movies over Windows of 5 minutes Sliding every 1 minute.”

or more formally as q1 = (2, Movies, 5, 1). A continuous top-k query q = (k, c, w, s), where k is the parameter of top-k, c is the category in the knowledge base to look for potential target entity mentions in tweets, w is the size of each window and s is the slide size, when

73 issued on a stream of tweets S returns the k entities that are mentioned the most within each window of the stream. In TkET, the information about available entity categories and their hierarchical structure is provided by the Knowledge Base.

When the window is full, the real size of an entity block is defined as the number of mentions indeed referring to the target entity corresponding to the entity block. Assume a continues sliding window top-k query qi, a set of identified potential entity mentions to the identified entity set E = {e1, e2, ..., e|E|} in window Wj. Each identified potential target entity ej, cor- responds to a set of identified mentions (tweets containing mentions) and their probability

of referring to the target entity. For example, e1 = {(t11, p11), ..., (t1|e1|, p1|e1|)} shows the mentions and their probability of referring to the target entity e1. Having this probabilistic representation, a top-k entities query answer rsult, X , containing k top mentioned entities, is as follows:

∀ey ∈ X , ∀ez ∈ (E − X ): |ey| ≥ |ez| (4.1)

The Equation 4.1 shows that every entity in the top-k result of X has a larger or equal num- ber of mentions indeed referring to its target entity comparing to any entity not in the list of top-k. We call a top-k result set Xd that satisfies Equation 4.1, a deterministic top-k result.

Also, in order to define a probabilistic top-k result, lets define p(e1 ≥ e2) as the proba- bility of the size of entity block e1 being larger than the size of entity block e2. In other words, p(e1 ≥ e2) is the probability that the number of mentions referring to e1 is greater or equal to the number of mentions referring e2. Then, there exists a deterministic top-k entities solution X , such that it satisfies Equation 4.2.

∀ey ∈ X , ∀ez ∈ (E − X ): p(ey ≥ ez) = 1 (4.2)

74 In the rest of this chapter, we first in Section 4.4 show the standard deterministic solution to answer such tiop-k entities query. Then in Section 4.5, we explain how we use probabilistic graphical models in order to model and compute intra entity block mention probabilities and to compute the size probability of entity blocks in TkET. Then we introduce the main TkET algorithm in Section 4.5.6. We discuss the experimental setup and the empirical results in Section 4.7. Finally, after covering some related work in section 4.8, we conclude our discussion on TkET in Section 4.9.

4.4 Deterministic Top-k

Standard Deterministic Solution

The standard solution to generating a correct answer to top-k query on a window Wi would call the NER on each and every identified mention in the window, and then return the k target entities that are mentioned the most in the window. For instance, assuming that W1 contains the tweets in Table 4.1, the top-k result of the standard solution is {Catfish, BadMoms}.

Optimized Deterministic Solution

Before we describe our Ticket approach that provides an approximate answer, we first discuss a more optimized solution that provides the true deterministic answer based on the answer semantics discussed above, but with on average less number of calls to the disambiguation function comparing to the standard deterministic solution.

In the Algorithm 4, we have several subroutines about which block and within a chosen block which mention to evaluate next. A full algorithm will need to fix those functions and the goal would be to choose the functions that minimize the number of disambiguation per-

75 ALGORITHM 4: TkET-D: Optimized Deterministic Solution input : k, a set of identified Entity Blocks E output: T opK ⊆ E 1 for all Entity Block ei ∈ E do 2 ei.status = Unknown; 3 ei.certainYES = ei.certainNO = ei.uncertain = 0; 4 for all tj ∈ ei do 5 if p(tj)== 1 then 6 ei.certainYES++;

7 if p(tj) == 0 then 8 ei.certainNO ++;

9 ei.uncertain = ei.count - ei.certainYES - ei.certainNO; 10 done = false; 11 if |E| < k then 12 done = true; 13 return E;

14 TopK={}; 15 while not done do 16 eselected = SelectBlock(E) ∈ E 17 tselected = SelectMention(eselected) ∈ eselected 18 if Disambiguate(tselectd) == 1 then 19 eselected.certainYES++; 20 eselected.uncertain–; 21 if ∃D ⊆ E, |D| ≥ |E| − k, ∀ex ∈ D, eselected.certainY ES > ex.certainY ES + ex.uncertain then 22 TopK = TopK ∪{eselected} 23 if |T opK| ≥ k then 24 done = true; 25 return T opK;

26 else 27 eselected.certainNO++; 28 eselected.uncertain–; 29 return Null;

76 formed before which the top-k answers are retrieved. Note that in the Algorithm 4, while number of disambiguations may change before the algorithm terminates, the answer is de- terministic. The algorithm continues disambiguating until it is sure that there exist no other unresolved block which could be in the top-k result.

In the Algorithm 4, lines 1-10 are to initialize each entity block ei values of certainly Yes

mentions as number of mentions indeed referring to the target entity ei, and similarly cer- tain No mentions and uncertain count. Line 16 is the Inter Entity Block selection method as SelectBlock that given the set of all identified entity blocks E, selects a block for the purpose of disambiguation with the objective of minimizing total number of calls to the disambiguation function to answer the top-k mentioned entities query. In Line 17, the al- gorithm calls an external method to select a tweet to disambiguate. Again the objective of the mention selection is of minimizing total number of calls to the disambiguation function. Lines 21-25 checks for the stopping criteria of existing k entity blocks certainly dominating at least |E| − k other entity blocks.

4.5 Probabilistic Top-k

We use factor graphs to model the dependencies in between tweets with potential entity mentions and to efficiently compute the probability distribution of size of each identified entity block.

Let p(e1 ≥ e2) be the probability of the size of entity block e1 being larger than the size of entity block e2. In other words, p(e1 ≥ e2) is the probability that the number of mentions referring to e1 is greater or equal to the number of mentions referring e2. Our goal is to given a probability threshold value τ ∈ [0, 1], develop an approximate top-k entities solution X with two possible probabilistic guarantees, such that the number of calls to the disambiguation

77 function or NEL module, issued by TkET to generate X , is minimized. In the first definition for probabilistic guarantee, every entity in the top-k result dominates other entities not in the top-k result with a probability more than a threshold τ. We call this definition, the Out-Going based definition. Following definition one, X satisfies Equation 4.3.

∀ey ∈ X , ∀ez ∈ (E − X ): p(ey ≥ ez) > τ (4.3)

Alternatively, in the second definition, all the entities not in the top-k, their chance of being in the top-k by dominating any entity in the top-k result, is bounded and very small. We call this definition, the In-Coming based definition. Following the second definition, X satisfies Equation 4.4.

∀ey ∈ X , ∀ez ∈ (E − X ): p(ez ≥ ey) < 1 − τ (4.4)

4.5.1 Factor Graph

A factor graph is a bipartite graph representing the factorization of a function. In probability theory and its applications, factor graphs are used to represent factorization of a probability distribution function, enabling efficient computations, such as the computation of marginal distributions through the sum-product algorithm. One of the important success stories of factor graphs and the sum-product algorithm is the decoding of capacity-approaching error- correcting codes, such as LDPC and turbo codes [12]. Factor Graph is a type of probabilistic graphical model with two types of nodes, (random) variables V and factors F , with edges between variables and the factors in which they appear [49]. In our setting, a random vari- able correspond to a tweet in an entity block containing a mention potentially referring to the target entity of the block.

78 The random variable takes a value of 1, and 0 if the corresponding tweet’s mention does not correctly refer to the target entity. A factor is a function of random variables, and is used to evaluate the correlations among variables. In TkET, we define factors as binary positive correlations in between a pair of tweets. For example when two tweets were posted from closer than a threshold geo-locations, this location proximity shows a degree of positive correlation in the mentions those tweets to be referring to the target entity. The set of fac- tors we consider in our experiments are defined in Section 3.7. The probability of a possible world graph is then defined to be proportional to some measure of weighted combination of factor functions [49]. Factor graphs can be combined with message passing algorithms to efficiently compute certain characteristics of the random variables, such as the marginal distributions.

Message Passing on Factor Graphs

A popular message passing algorithm on factor graphs is the Sum-Product algorithm [49], which efficiently computes all the marginals of the individual variables of the functions and random variables. In particular, the marginal of variable Xk is defined as:

X gk(Xk) = g(X1,X2,...,Xn) (4.5)

Xk¯

where the notation Xk¯ means that the summation goes over all the variables, except Xk. The messages of the sum-product algorithm are conceptually computed in the vertices and passed along the edges. A message from/to a variable vertex is always a function of that particular variable. For instance, when a variable is binary, the messages over the edges incident to the corresponding vertex can be represented as vectors of length two: the first entry is the message evaluated in 0, the second entry is the message evaluated in 1. Other probabilistic models such as Bayesian networks and Markov networks can be represented as

79 Correlation TC CT Factors Tweet Random t t5 t t4 t Variables 11 9 2

sum sum

E EC1 C2

Summation sum Tree EC3

sum

EC

Figure 4.2: Factor Graph for the “Catfish” Entity Block factor graphs; the latter representation is frequently used when performing inference over such networks using belief propagation. On the other hand, Bayesian networks are more naturally suited for generative models, as they can directly represent the causalities of the model [52].

4.5.2 Entity Probabilistic Model

We keep a factor graph per entity block updated during the lifetime of TkET. We visualize a small example factor graph in Figure 4.2 for the entity block of “Catfish” in Table 4.1. Note that this represents a probabilistic factor graph with random variables, and not an entity- relationship diagram. Specifically, the factor graph includes random variables capturing the uncertain target entity of each mention, and random variables to capture uncertain count of references to the target entity for each entity block.

80 Entity block ei, has a factor graph model with a node for each binary variable tj corre-

sponding to a candidate mention in tweet tj in entity block ei with a prior probability provided by the knowledge base. We assign pairwise factors between positively correlated tweets in the block and we add an additional variable to the model which represents count of the tweets having mentions indeed referring to the target entity, or the real size of the entity block. The naive way to model this is to add a factor connected to all binary vari- ables tj ∈ ei and the variable Ci which enforces consistency between the variables (so that

the joint distribution P (Ci, t1, t2, ...t{|ei|}) has zero probability when Ci is not the sum of tj).

However, the sum-product algorithm [50] that we use to evaluate marginals, has to sum over all possible configurations (i.e., the message from all the tj variables to the Ci variable will involve summing over 2|Ei| configurations) and it is not feasible to process having the high speed of the Twitter stream. We solve the sum over all random variables performance problem by following a divide and conquer style reshaping of the factor graph model. We build a tree of intermediate sizes by adding additional intermediate variables to keep the

intermediate sizes. Assuming ei = {t0, ..., t|ei|}, at the lowest level we split the set of tweets in ei into two subsets of close to equal sizes of ei1 = {t0, ..., t |ei| } and ei2 = {t |ei| , ..., t|ei|}. 2 2 +1 Then we continue to split recursively until we get to set’s of random variables with size two.

|ei| (l−1) At level l of the binary tree, we have 2l variables each of which can take on one of 2 + 1 states. The factors in this tree just connect two nodes to the node in the next layer above so

3l |ei| inference at the l’th layer will involve summing over O(2 ) states for each of the 2l factors

at that layer. Since there are only log(|ei|) layers this is significantly more efficient than the naive approach when we are trying to resolve large number of mentions. In the example

factor graph provided in Figure 4.2, the random variable Ec shows the real size distribution of the corresponding entity block, Catfish, at each instance of time. The values in the corresponding size probability table to each entity block, guide TkET in selecting further

81 mentions for the purpose of disambiguation in order to reduce the amount of uncertainties.

Pairwise Degree of Dominance

We calculate the pairwise dominance degree from ei to ej where ei, ej ∈ E, showing the

probability of ei dominating ej in the size order of entity blocks using the following formula:

max(|ei|) X d(ei > ej) = p(ei = c) × p(ej ≤ c) (4.6) c=0

where c takes size values from 0 to the maximum possible size of ei. Since we are facing a probabilistic representation, to increase the efficiency of our top-k entities query answering algorithm, we introduce a client specified threshold τ on the pairwise degree of dominance.

It means in case d(ei > ej) > τ, we assume there is no further need to increase the degree

of this dominance and we can assume Ei is the dominant entity block in between ei and ej.

Similarly, when d(ei > ej) < 1 − τ, we assume ei is not dominating ej and there is no further need to increase the confidence.

Based on the definition above, the degree of dominances in between two nodes ei and ej includes the probability of equality from both sides. In other words, based on the inclusion of probability of equality on both sides, d(ei > ej) + d(ej > ei) ≥ 1. Hence, for every

possible ei and ej, the probabilities of mentions in them could follow distributions in a way that d(ei > ej) > τ and d(ej > ei) > τ, meaning our above definition of dominance degree is not a dominance ordering criteria. In our implementation, when there is such a tie, we select the entity block dominating the other one with a greater dominance degree. The intuition behind our selection that such a situation happens when the expected target size of the entities are very similar. In that case, either of them would satisfy the top-k conditions.

82 A more conservative definition of the dominance degree does not include the equality prop- erty on both sides and results in a slightly worse efficiency over our experiments. We define the conservative version of the dominance degree as follows:

max(|ei|) X d(ei > ej) = p(ei = c) × p(ej < c) (4.7) c=0

4.5.3 Entity Dominance Graph (EDG)

Next, TkET builds an Entity Dominance Graph (EDG) where nodes in EDG represent the entities in E and edges represent the dominance degree in between them. An edge from ei to ej with the weight d(ei, ej) shows the dominance degree of ei over ej. For the purpose of representation, as Figure 4.3 shows, we make the edges with degree greater than the threshold τ, or less than 1 − τ as solid edges that TkET assumes there is no further need for disambiguating their values.

In order to model the ordering in between identified entity blocks, TkET generates a di- rected weighted graph called Entity Dominance Graph (EDG) with |E| nodes corresponding to current identified entity blocks in the system. In EDG, there is an edge from entity block ei to ej 6= ei in the set of E = {e1, ..., e|E|} with the weight dij showing the dominance degree of ei over ej. There could be an edge in EDG from ei to ej, conversely there could be an edge from ej to ei in the same EDG at the same time.

Approximate Degrees

Definition 3. For every node ei in EDG, we define the sum of weights on ingoing adjacent edges to ei as the Approximate In-Degree AID(ei) of the node ei, and the sum of weights on

83 Figure 4.3: Entity Dominance Graph Example

adjacent outgoing edges from ei as its Approximate Out-Degree AOD(ei).

For example in Figure 4.3, AID(magnolia) = 3.08 and AOD(magnolia) = 2.43. Since for each entity block size probabilities get calculated and are ready using the factor graph, the

computational complexity of calculating d(ei > ej) is O(|Ei| × |Ej|) and assuming equal size of l for all the entity blocks, the computational complexity becomes O(l2) which is quadratic and the total cost of computing all pairwise probability of dominances is in order of O(|M| × l2). Figure 4.3 represents the starting entity dominance graph for our driving example in Sec. 4.2. The dashed edges show the edges with the probability of dominance greater than the threshold τ = 0.9 or less than 1 − τ. For example, based on the dominance probabilities in Figure 4.3, with 85% chance entity “Bad Moms” will end up having equal or more number of mentions to its corresponding target entity comparing to entity “Magnolia”.

84 4.5.4 Selection Criteria

In order to find the top-k mentioned entities, we start selecting from entity blocks and dis- ambiguating mentions inside entity blocks to update the probabilistic distribution of entity block sizes. Selection is performed at two levels: Inter entity block selection and intra selected entity block mention to be disambiguate selection. Furthermore, after the each disambigua- tion step, the probability distribution of size for the selected entity block gets updated. After every call to the disambiguation function, success probability of the corresponding random variable in the factor graph of the selected entity block, gets updated. Then, the marginals get computed based on the updated factor graph and the selection in the following iterations would be based on the updated size probabilities.

Inter Entity Block Selection

In order to disambiguate minimum number of mentions from entity blocks, we define a se- lection criteria to choose entity blocks to disambiguate in an iterative manner. The selection criteria is based on the intuition that the greater the approximate out-degree of a node in the EDG, and the less its approximate in-degree, the more chance of that node being in the top-k entity blocks. Therefore, in Definition 4, we define Dominance Degree as a metric to linearly order entity blocks based on an approximate comparative measurement of their chance to be in the final list of top-k mentioned entities.

Definition 4. Dominance Degree: Dominance degree of a node ei is defined as the difference between the approximate out-degree and approximate in-degree of it as DD(ei) =

AOD(ei) − AID(ei).

Dominance degree gives the algorithm a relative criteria for choosing the most in doubt

85 but promising entity blocks to be in the list of top-k. Initially based on the numbers in the Figure 4.14 for the motivating example, the maximum dominance degree belongs to Magnolia with DD(magnolia) = 2.345, so it will get selected as a candidate entity block for further disambiguation. Upon selection of an entity block, TkET selects a mention from the selected entity block to disambiguate. Then it passes the selected mention within the tweet tj to the black box disambiguation lookup function alongside the potential target entity ei.

The D(tj, ei) returns 0 or 1 showing that the mention is indeed referring to ei.

Intra Entity Block Mention Selection

In TkET, we select the mention in the selected entity block such that its disambiguation results in the maximum improvement of DD of the selected entity block. In other words, if we

assume ei is the selected entity block, the intra mention selection problem is a maximization problem and the goal is to find the optimal mention such that:

∗ +tj −tj t = arg min p(tj).DD(ei ) + (1 − p(tj)).DD(ei ) (4.8) tj ∈ei

+tj −tj where DD(ei ) is the DD after positive disambiguation of mention tj and similarly DD(ei ) after negative disambiguation of tj. We choose a mention to disambiguate from the selected entity block that satisfies the Equation 4.8.

4.5.5 Stopping Criteria

TkET iteratively selects entity blocks and mentions from them in order to see if they indeed referring to that target entity or not. However, the critical question to have an optimized cleaning to answer the query is when to stop? We define the stopping criteria through defining the following theorems based on the EDG.

86 Figure 4.4: In-Degree vs. Out-Degree based Stopping Criteria

Stopping Criteria (Out-Degree Based)

For each node in EDG corresponding to an entity block ei, we define its Hard Out-Degree

HOD(ei, X ) where the set X contains the already selected entity blocks to be in the final top-k set when TkET is considering ei. HOD(ei, X ) shows the number of outgoing edges with pairwise dominance degree more than the pairwise dominance degree threshold τ, or in other words, number of solid outgoing edges. Figure 4.4 illustrates the difference between the meaning of Out-Degree based stopping criteria.

Theorem 1. In order to find the top-k entity blocks, it is enough to find k nodes in the EDG with Hard Out-Degree greater than |M| − k where M is the set of all identified entity blocks. The algorithm stops when k such nodes are identified or in other words when |TK| = k.

In other words, the computed top-k result after updating ei, that can happen after a call to the disambiguation function, based on the Out-Degree based stopping criteria is X + defined as follows:

+ X = {ei|HOD(ei, X ) ≥ |M| − k} (4.9)

87 Init

+a

+a

T=0.9

Figure 4.5: Out-Degree based Stopping Criteria Example

where TkET stops when |X +| is equal to k. For example, if we assume there are two nodes corresponding to target entities a and b in the EDG each entity block containing three potential mentions to the target entity of the corresponding entity block. The initial probability of referring to the target entity for each mention is 0.5 and the ground truth for the size of them are |a| = 2 and |b| = 3, so the deterministic result for top-1 query is b.

For the top-1 analysis, having the first version of stopping criteria as defined above we get the steps shown in Figure 4.5 in order to get to the top-1 result of “a” which is not the same as the deterministic answer of “b”. Whereas using the second In-Degree based version of criteria shown in Figure 4.6, the final answer of the top-1 is “b” which is the correct answer based on the ground truth data.

Stopping Criteria (In-Degree Based)

Based on our experiments in section 4.7, the above stopping criteria makes TkET to answer the top-k query very fast but the top-k list is not necessarily the same as the ground truth.

88 Init

+a

+a

-a

+b

+b

+b

T=0.9

Figure 4.6: In-Degree based Stopping Criteria Example

89 Figure 4.4 illustrates the difference between the meaning of In-Degree based stopping criteria. Hence, we define a more conservative stopping criteria as follows:

Theorem 2. In order to find the top-k entity blocks, it is enough to find k nodes in the EDG with only incoming solid edges. It means for every entity block in E but not in the top-k list, the probability of that entity dominating any entity in the list of top-k is bounded to the 1 − τ.

4.5.6 Finding Top-K

Many social data applications must process the fast stream of social media posts (i.e., a stream that emits 6000 tweets per second) in real time. Hence, an important requirement for our solution is that it scales to the firehose stream (i.e., can process tweets as fast as they come in). Computing the pairwise probability of dominance for all the identified entity blocks in the active window depending on the size of the EDG could be computationally very expensive. It can result in TkET not being able to properly process the query in a limited time comparing to the speed of stream. Before addressing the scalability issue of TkET, the steps TkET takes in answering the motivating example analysis task are as follows:

Steps of TkET on Motivating Example

Figures 4.7 (Initial EDG and EDGs after the first 2 steps), 4.8 (EDGs after the middle 3 steps), 4.9 (EDGs after the last 2 steps) show the steps TkET takes on cleaning the sliding window of the motivating example as shown in Table 4.1. In this example, TkET is configured with τ = 0.9 and it generates the top-2 result of [(1)Catfish, (2)BadMoms] after 7 disambiguations out of all 14 mentions in the window resulting in 50% saving. In this example, the calculated result by TkET is equal in case of entity blocks and the ranking to the deterministic answer to the same query.

90 (a) Step 0: Initial EDG

(b) Step 1: after one Positive Disambiguation on Catfish

(c) Step 2: after one another Positive Disambiguation on Catfish

Figure 4.7: EDG 1-2 steps of TkET top-2 algorithm on Motivating Example

91 (a) Step 3: after one Negative Disambiguation on Catfish

(b) Step 4: after one another Positive Disambiguation on Catfish

(c) Step 5: after one another Negative Disambiguation on Catfish

Figure 4.8: EDG 3-5 steps of TkET top-2 algorithm on Motivating Example

92 (a) Step 6: after one Positive Disambiguation on Magnolia

(b) Step 7: after one Positive Disambiguation on Bad Moms, the Stopping Criteria is met and the top-2 result are Catfish and Bad Moms.

Figure 4.9: EDG 6-7 steps of TkET top-2 algorithm on Motivating Example

93 4.5.7 Scalability of EDG

In order to make the EDG computations efficient to be done by the speed of stream, we put a cap on the size of EDG defined by the number of entity blocks involved in it. Further- more, if the size of current window’s EDG reaches the cap, we follow a divide and conquer sort-merge algorithm to model the EDG.

Sort-Merge Algorithm: We partition the list of identified entity blocks sorted based on the title of their corresponding entry in the knowledge base into h disjoint subsets keep- ing the start and ending string pattern for title of entity blocks in each partition such that each subset contains less than the cap size entity blocks. Then we generate an EDG per partition, and for selecting tweets for the purpose of disambiguation, we select from the candidates for all the partitions the one with the highest DD among all entity blocks.

Having the list of top-k mentioned entities per partition, we merge entity blocks based on their values for HOD and computing their pairwise dominance degrees. If we assume there are h such partitions and we have the list of top-k for each, then we have h × k entity blocks that the global top-k are in between them. Then, we generate a meta EDG based on the h × k entity blocks. The list of top-k entities in the meta-EDG is final answer to the query. Disambiguation iteratively continues until TkET reaches the stoping criteria in the meta-EDG.

Transitivity Property of the Dominance Degree

In order to make sure it is enough to find the global top-k entities in between sets of local top-k entities, we need to have transitivity in the definition of dominance. Figure 4.10 illustrates the reason behind the transitivity requirement. Assuming we have two partitions

94 A D

B C E

Figure 4.10: Transitivity of Pairwise Dominance

TKET(S, qi, τ) 1 G ← {} // edge dominance graph (empty initially) 2 while S.isNotEmpty() do 3 Update-Window(S, Wj) 4 Update-Blocks-And-Probabilities(Wj) 5 Update-Graph(G) 6 while isStoppingCriteriaNotMet(G, qi) do 7 ey ← Select-Entity-Block(G, qi) 8 tz ← Select-Mention(G, ey, qi) 9 Disambiguate(tz, ey) 10 Update-Probabilities(Wj) 11 Update-Graph(G) 12 Produce-Answer(G, qi) Figure 4.11: Overview of TkET of entity blocks as {A, B, C} and {D,E} and the final query is the top-1. If we assume A is the top-1 entity in the first partition and D is the top-1 in the second partition, in order to be able to identify the global top-1 result by comparing the local top-1 results, we need to assume if p(A > D) > τ and p(D > E) > τ then p(A > E) > τ. A more conservative condition to fulfill the transitivity requirement is to ensure p(A > D > E) > τ. Assuming independence, p(A > D > E) = p(A > D) × p(D > E) > τ 2. Therefore, in order to do the sort-merge algorithm properly, we need to assign the pairwise dominance threshold of local √ entity dominance graphs to τ and τ for the meta entity dominance graph.

Since the Standard Solution requires calling the NEL module on each mention in the window, such a standard solution can be inefficient in practice, especially in the case of Twitter stream where the generation of tweets is fast and continuous. However, since the goal behind a top- k query is often to identify a few relevant and frequently mentioned entities (k << w),

95 Preprocessed Tweets Segmenter (outputs 1-2-3-gram)

Contextual Keyphrases

Knowledge Base NEL NER

KB Category EDGs Entity Blocks

k TkET Factor Graphs top-k

Figure 4.12: Architecture of TkET instead of disambiguating all mentions, a preferred cost-effective solution is to generate an approximate answer to the query using the minimal number of calls to the NEL module.

4.6 Architecture of TkET

The general flow in Figure 4.11 presents a high-level overview of TkET, which is explained in more details next. TkET continually apply the query on each window of the stream and generates an approximate answer for it. Each while loop (lines 2-12) can be viewed as a single window. In each window, TkET builds a probability model that can then guide it in choosing which entity block and then which mention inside the selected entity block to disambiguate next in the flow of the algorithm. TkET continues to disambiguate potential mentions to the target entity, until it reaches the stopping condition where it return the approximate set of identified entity blocks.

96 Figure 4.12 shows an overview of the architecture of TkET and its dependencies. TkET is dependent on some complex functionalities to be provided before it can perform the top-k query answering procedure. As it was discussed earlier in this chapter, the two main func- tionalities to be provided to TkET are NER and NEL as it is illustrated in Figure 4.1. Our implementation is based on knowledge base lookup for NER and NEL, but TkET is general to be used based on different methods of NER and NEL while they satisfy the interface requirements shown in Figure 4.1. The other black box functionality to TkET is Tweet Segmenter.

After a tweet get segmented into potential entity mentions as the output of Tweet Seg- menter, TkET passes them to the NER module alongside the specified Category in the knowledge base. Upon receiving the potential target knowledge base entity that the mention might refer to and the assigned probability by NER, TkET identifies the correct entity block or it initiates a new entity block and adds the new mention alongside its assigned proba- bility to it. Then, updates the factor graph corresponding to the identified/initiated entity block. Furthermore, it recomputes the marginals of the affected factor graph, and updates the corresponding EDG containing the identified entity block. During the lifecycle of TkET, it iteratively selects entity blocks and selects mentions inside them to disambiguate until it reaches the specified stopping criteria. TkET disambiguate mentions by passing them to the NEL module, receiving a deterministic answer and updating the corresponding factor graph marginals and the corresponding EDG.

4.6.1 Sliding Window Stream Processing

The challenge of answering sliding-window top-k queries on uncertain data streams stems from the strict space and time requirements of processing both arriving and expiring tweets in high-speed Twitter streams. In TkET setting, we keep everything updated for the current

97 Figure 4.13: Sliding Window, Stream Processing sliding window. Hence, we keep one instance of EDG updated for the whole duration of TkET’s lifetime.

However, we need to take care of arriving and expiring tweets because arriving tweets having mentions to already existing entity blocks need to be added to their corresponding entity blocks and be incorporated in their factor graphs. In case of arriving tweets with mentions referring to identified new entities that the current set of factor graphs don’t cover, TkET initiates new entity blocks and adds a corresponding node in the global EDG followed by computing all possible edges from and to the new node to other existing nodes.

As Figure 4.13 illustrates, in TkET, tweets enter the window and get expired as time moves. In order to take care of expiring tweets, TkET keeps a priority queue updated having all tweets entered the system. Knowing the identified mentions in the expiring tweet, TkET can follow corresponding entity blocks and using tweet id remove the tweet from those entity blocks and re-compute the corresponding factor graphs. TkET follows by re-computing all possible edges from and to the all updated nodes in EDG.

4.7 Experimental Evaluation

In this section, we evaluate the performance of our analysis aware top-k entities query answer- ing approach using real Twitter dataset and synthetic datasets with different configurations.

98 4.7.1 Experimental Setup

As discussed before, in TkET we use the whole English Wikipedia as our entity lookup knowledge base.

Stopping Criteria

In order to be more general, in our experimental results we report results both in case of using the first Out-Degree based definition of the stopping criteria as well as using the second In-Degree based version of the stopping criteria definition.

Performance Criteria

We measure the performance of TkET in answering top-k entities query on top of Twitter stream based on the proportion of number of not disambiguated mentions (tweets) to the total number of mentions in all entity blocks as Saving. Also, we report the time to an- swer the top-k entities query given the current window as Latency. Finally, we report the correctness of the final top-k result based on the information gain distance of the returned top-k result from the ground truth top-k as Accuracy of TkET.

Comparing the top-k elements between two or more ranked results is a common task in many contexts and settings. A few measures have been proposed to compare top-k lists with attractive mathematical properties [85, 48]. This topic has received much attention over the past decade, mainly in the context of information retrieval. Among the most cited work on this topic is that of Fagin and colleagues [34]. They proposed an easy to compute metric based primarily on Spearman’s Footrule [85]. Formally, if π1 and π2 define two permutations from the symmetric group Sn of all permutations of n elements, Spearman’s foot rule gives

99 the L1 distance between the ranks of corresponding elements in the two permutations, as: n X L1(π1, π2) = |π1(i) − π2(i)|, where any π1(i) or π2(i) is the position (rank) of the ith i=1 element in the permutation, given some total ordering of n elements. Fagin et al. extend this metric to comparing two top-k lists in presence of non-overlapping elements (i.e., elements that are in one list but not in the other one). This is achieved by fixing the contribution, to the distance, of the non-overlapping elements to a value greater than k, typically (k + 1).

Formally, the extended metric for two top-k lists γ1 and γ2 is defined as it is shown in Equation 4.10.

X X X L1(γ1, γ2) = 2(k −|γ1 ∩γ2|)(k +1)+ |γ1(i)−γ2(i)|− γ1(i)− γ2(i) (4.10)

i∈γ1∩γ2 i∈γ1−γ2 i∈γ2−γ1

where γ1 ∩ γ2 is the set of elements that overlap between the two lists, |γ1 ∩ γ2| denotes the number of overlapping elements, γ1 −γ2 gives the non-overlapping elements in γ1, and γ2 −γ1 gives those in γ2. The more similar the two top-k lists together, the closer is the value of their L1(γ1, γ2) to 0.

Since in our setting we compare the final top-k result set of TkET with the ground truth list and we assume we know the rank of each entity block from the TkET result in the ground truth list, we slightly change the formula in Equation 4.10 to get the Equation 4.11 as a distance measure between the top-k result of TkET and their ranking in the list of ground truth. We call the new measure L8 distance and we report the result of both L1 and L8 in our experiments.

X X L8(γ, γgt) = 2(k − |γ ∩ γgt|)(k + 1) + |γ(i) − γgt(i)| − |γ(i) − γgt(i)| (4.11)

i∈γ∩γgt i∈γ−γgt

100 Tweet Preprocessing

Before entering a window, a tweet ti undergoes a preprocessing step that removes stop words and Twitter special terms (e.g., “rt”) from it, and then uses a name entity extraction tool to extract the mentions that might refer to a target entity. The named entity extraction step, outputs a prior probability identifying the initial probability of that mention referring to target entity of the corresponding entity block.

Table 4.1 shows an example collection of preprocessed tweets that belong to a single win- dow over the Twitter stream. The crossed words are the stop words and T‘witter special terms that were removed whereas the bold words correspond to mentions that might re- fer to target entities as the result of named entity recognition/extraction/identification. A target entity is an entity that belongs to the target category of the query. For example, the target category of query q1 is Movies and thus the movie Catfish is a potential target

entity. The set of all identified potential target entities is denoted as E = {e1, e2, e3, . . . , e|E|}.

For tweets in Table 4.1, E = {“Magnolia”, “Catfish”, “Bad Moms”, “Frozen” and “Gummo”}.

For example, there are five Catfish mentions of t2, t4, t9, t5 and t11 in the Catfish entity block

extracted from the tweets shown in Figure 4.14 and out of them t2, t4, t9 indeed refer to the target entity of the movie Catfish. In our implementation of TkET, named entity recog- nition/extraction/identification happens through a simple lookup in the knowledge base.

4.7.2 Knowledge Base

We created inverted indexes both based on the title of articles and their content after doing the tf-idf analysis to identify specific terms and their importance for each entity. Also, based

101 Movie List of Fish Potential Music Band Movie Reference Tv Show Plant Movie (s)

Entities Place Place Movie Song Movie None None None None None Entity Title Catfish Magnolia Bad Moms Frozen Gummo Tweet t2 0.75 t1 0.8 t3 0.55 t5 0.05 t6 0.65 Prob. ID t4 0.95 t7 0.05 t8 0.95 t12 0.95

t9 0.85 t10 0.15

t5 0.05 t11 0.15

t11 0.15

Figure 4.14: Motivating Example’s Identified Entity Blocks on a tf-idf analysis on all the articles in Wikipedia, we implemented a function approximating term commonness for each given english term. We exploit the commonness analysis in order to prevent checking for very common words in the simple lookup function. For example, if there is a name such as “Adam” in the tweet’s text, since “Adam” is a very common term in English, TkET does not check it alone for the purpose of named entity identification.

Knowledge Base Lookup

Knowledge Base is a fundamental component for the entity linking task. Knowledge bases provide the contextual information about the world’s entities (e.g., the entities of Pulp Fic- tion and Django Unchained), their semantical categories (e.g., Movies), and the mutual relationships between entities (e.g., both directed by Quentin Tarantino). In our experi- ments, the knowledge base is created based on the whole English Wikipedia. Wikipedia is a free online encyclopedia. At present, Wikipedia has become the largest and most popular internet encyclopedia in the world and is also a very dynamic. The basic entry in Wikipedia is an article, which defines and describes an entity or a topic, and each article in Wikipedia is uniquely referenced by an identifier. Currently, English Wikipedia contains over 4.4 million

102 articles.

Wikipedia has a high coverage of named entities and contains massive knowledge about notable named entities. Besides, the structure of Wikipedia provides a set of useful features for entity linking, such as entity pages, article categories, number of page visits for each arti- cle (approximating the popularity of different entities), disambiguation pages, and hyperlinks in Wikipedia articles. In order to create a working knowledge base for our experiments, we downloaded and indexed the whole dump of English wikipedia.

We implemented two necessary methods to access the knowledge base. Simple Lookup to identify potential target entity mentions as a relatively cheap function for the purpose of named entity identification. Also, a much more expensive function given the contextual in- formation in the tweet as Disambiguation Lookup to decide if the identified potential mention is indeed referring to the target entity or not.

• Simple Lookup SL(t, c) In order to implement the simple lookup functionality on top of Wikipedia, we created inverted indexes for the important terms in the title of articles. Given a potential men-

tion m in tweet t, we first compute the commonness of each term mi in m and choosing the least common term, we lookup in the inverted index of the specified category c, for that term to get the list of potential matching entities and if we see an entity with the exact title as m, we return that without any further investigation. The probability of the identified mention is then approximated to the edit-distance in between the men- tion and the article title divided by the number of characters in the article title plus number of entities in the knowledge base with the same title.

Otherwise, we look for other terms and then return the intersection of returned po-

103 tential entity sets. In addition to the potential target entity in the knowledge base, simple lookup also returns an approximate prior probability of the mention referring to the target entity or p(t). In TkET, we compute that probability based on similarity analysis of the text of article corresponding to the potential target entity and the text of the identified mention.

• Disambiguation Lookup DL(t, e) Using the contextual information (e.g., key-phrases in the remaining of the tweet’s text) and the content of Wikipedia articles we return the most probable corresponding entity to the identified potential mention to e in t having the contextual information in tweet t. In order to compute DL, we created inverted indexes from important terms (identified using tf-idf analysis) in the content of Wikipedia articles to the articles corresponding to entities. DL(t, e) returns the entity corresponding to the article with the most title similarity to the mention m that contains most of contextual key-phrases in the text of t.

4.7.3 Factor Graphs and Dimple

In order to implement a factor graph per identified entity block in TkET, we use Dimple [40]. Dimple is an open-source API for probabilistic modeling and inference. Dimple allows the user to specify probabilistic models in the form of graphical models (factor graphs, Bayesian networks, or Markov networks), and performs inference on the model using a variety of supported algorithms. Probabilistic graphical models unify a great number of models from machine learning, statistical text processing, vision, bioinformatics, and many other fields concerned with the analysis and understanding of noisy, incomplete, or inconsistent data. Graphical models reduce the complexity inherent in complex statistical models by dividing them into a series of logically (and statistically) independent components.

104 By factoring the problem into subproblems with known and simple interdependencies, and by adopting a common language to describe each sub-problem, one can considerably simplify the task of creating complex probabilistic models. An important attribute of Dimple is that it allows the user to construct probabilistic models in a form that is largely independent of the algorithm used to perform inference on the model. This modular architecture benefits those who create probabilistic models by freeing them from the complexities of the inference algorithms, and it benefits those who develop new inference algorithms by allowing these algorithms to be implemented independently from any particular model or application. In TkET, we use the implementation of Sum-Product algorithm in Dimple for the purpose of marginal inference.

4.7.4 Synthetic Dataset Genration

In order to fully investigate the effect of different parameter assignments on the performance of TkET and it’s performance on different data distributions, we implemented a synthetic dataset generator for the purpose of evaluating TkET. Synthetic data generation takes place given a set of parameters as [n, m, β, s, p], where n is the number of tweets (mentions) in the dataset, m is the number of identified entity blocks in the system, β is the maximum value of the power in the entity block random real size assignment formula and s is the exponent of the Zipfian rank-size distribution [65], that we generate the size of entity blocks based on that. Also, p is the probability that two tweets have positive correlation with each other in each entity block. We use the value of p, as the probability of a head, to flip a coin with that probability for each pair of tweets in an entity block deciding about defining a positive correlation in between them or not.

In order to generate a dataset, we generate m random numbers from the Zip’s distribu-

105 tion with exponent s each corresponding to a potential entity block in the dataset. Then, we distribute n tweets in between those m entity blocks based on the ratio of their corresponding Zipfian random generated number to the sum of all Zipfian random generated numbers. At this stage, we have m entity blocks with random sizes based on a Zipfian distribution with the exponent s containing in total n tweets. For each identified synthetic entity block, we generate a random real size for it based on the assigned actual size from the last step using

1−r the following formula: |ei| , where 0 ≤ r ≤ β is a random number deciding about the real size of the entity block ei. Note that the real size of a given entity block ei, depending on

1−β the value of r, can go from |ei| to |ei|. After deciding about the real size of each entity block, we randomly assign prior probabilities for tweets in each entity block.

4.7.5 Experimental Results

Synthetic Dataset Experimental Results

We evaluate TkET through extensive experiments on synthetic datasets. Experimental re- sults demonstrate the effectiveness of our methods. For the purpose of more detail analysis of the parameters, we have selected five datasets as SDS1, SDS2, SDS3, SDS4 and SDS5. Replication: In order to make the results based on random synthetic datasets statistically valid [28], we have averaged the results for the selected datasets over five different replicas per selected configuration. Therefore, for each selected dataset we have generated 5 repli- cas and averaged the accuracy, latency and saving results based on the results of executing TkET on each of the 5 replicas. Figure 4.15 represents the one selected distribution of generated Synthetic DataSets (SDSs), as example, with parameters as SDS4 : n = 10000, m = 3000, beta = 0.3, s = 1.0 and

SDS5 : n = 100000, m = 40000, beta = 0.7, s = 4.0. Table 4.2 shows a summary of synthetic dataset parameters. Next, we analyze the effect of different TkET parameter as the value of

106 Selected Synthetic Dataset (n=10000,m=3000,beta=0.3,s=1.0)

30

20 RealSize/Size 1.0

0.8

0.6

0.4

10 Number of mentions Indeed Referring to the Target Entity to the Target Number of mentions Indeed Referring

0 0 10 20 30 Number of Mentions in the Block (a) Block Size Distribution SDS4: (n=10000, m=3000, beta=0.3, s=1.0) Selected Synthetic Dataset (n=100000,m=40000,beta=0.7,s=4.0)

20

RealSize/Size

0.75

0.50

0.25

10 Number of mentions Indeed Referring to the Target Entity to the Target Number of mentions Indeed Referring

0 20 40 60 Number of Mentions in the Block (b) Block Size Distribution SDS5: (n=100000, m=4000, beta=0.7, s=4.0)

Figure 4.15: Selected Synthetic Datasets Block Size Distribution

107 k, dominance degree threshold τ, choice of stopping criteria. We made the cap size statically equal to 100 based on the preliminary results of our experiments using different values for the capsize, 100 resulted in more robust results. Also, in our experiments, we do not consider and we do not report cases when k > |E|/2. In those cases, the problem TkET is facing is more of a sorting problem rather than top-k query answering.

Table 4.2: Selected Dataset Parameters

Dataset n m β s SDS1 1000 200 0.4 2 SDS2 2000 400 0.4 2 SDS3 3000 600 0.4 2 SDS4 10000 3000 0.3 1 SDS5 100000 4000 0.7 4

Efficiency

We evaluate the efficiencies of proposed algorithms in this section. Figure 4.16 shows the latency of TkET having different values of the parameters th = τ and k. Figure 4.16a, as expected, shows that with higher values of k it takes more time to calculate the top-k result for TkET and also TkET using the second Stopping Criteria which is based on In-Degree, results in considerably more computation cost.

Table 4.3: Efficiency for Out-Degree based Stopping Criteria

Dataset saving¯ latency¯ L¯1 L¯8 SDS1 0.99 56 (s) 0.7 0.84 SDS2 0.98 121 (s) 1.28 1.72 SDS3 0.93 159 (s) 1.4 3.53 SDS4 0.87 349 (s) 3.1 5.48 SDS5 0.96 1493 (s) 1.82 3.9

108 latency vs. k

400

300

stop 1 200 2 time (s)

100

0

0 25 50 75 100 k (a) SDS4: : Latency vs. k latency vs. T

400

300

stop 200 1 2 time (s)

100

0

0.5 0.6 0.7 0.8 0.9 1.0 T (b) SDS4: : Latency vs. th

Figure 4.16: SDS4: Latency vs. Parameters(k, th)

109 Inverse Accuracy (L1) vs. k Inverse Accuracy (L8) vs. k 2.0 25

20 1.5

15

1.0 stop stop

L1 1 L8 1 2 2 10

0.5 5

0.0 0

0 25 50 75 100 0 25 50 75 100 k k (a) SDS4: : L1 vs. k (b) SDS4: : L8 vs. k Inverse Accuracy (L1) vs. T Inverse Accuracy (L8) vs. T 25

20 1.5

15

1.0 stop stop

L1 1 L8 1 2 2 10

0.5

5

0.0 0

0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 T T (c) SDS4: : L1 vs. th (d) SDS4: : L8 vs. th

Figure 4.17: SDS4: Accuracy vs. Parameters(k, th)

Table 4.4: Efficiency for In-Degree based Stopping Criteria

Dataset saving¯ latency¯ L¯1 L¯8 SDS1 0.98 71 (s) 0.11 0.41 SDS2 0.94 139 (s) 0.14 0.38 SDS3 0.9 198 (s) 0.09 0.17 SDS4 0.76 903 (s) 0.012 0.023 SDS5 0.71 5711 (s) 0.03 0.13

Accuracy

We measure the accuracy of top-k result in case of being similar to the ground truth list of top-k, both based on the members in the top-k and their order through two relative top- k distances L1 and L8. The closer the value of distance to zero, shows the more similarity between TkET top-k result set and the ground truth top-k list. For simplicity of comparison, we report all L1 and L8 observed values in normalized version as divided by the size of entity blocks in the selected dataset or m. The Figure 4.17 shows the accuracy of TkET results for

110 dataset SDS4 and how it changes for different values of TkET parameters th = τ and k. As it was expected, based on all reports, the second stopping criteria (In-Degree based) results in more accurate top-k result.

Scalibility

We tested the scalability of TkET with the synthetic dataset SDS5. In our current im- plementation of TkET we do not make the processing parallel. The computations related to the factor graphs and the EDG computations can easily get distributed over multiple systems and that will reduce the overhead of TkET in case of latency. We do not consider parallelism in this thesis and we leave that as a possible future work on scalability of TkET. TkET with the out-degree based stopping criteria scales better than the version with the more conservative in-degree based stopping criteria.

Overall Synthetic Performance

We have tested TkET for more that 8000 different configuration of parameters as n = [200, 5000], m = [10, 2500], maxf = [0.1, 1.0], zipfs = [1, 4], k = [2, 50], τ = [0.55, 1] and both versions of the stopping criteria, the overall saving has a mean of 0.9842, median of 0.99 and minimum of 0.09. The overall L1 has mean of 1.393, median of 0.336 and maximum of 28.267. Similarly, the overall L8 has mean of 3.777, median of 1.980 and maximum of 36.335.

4.7.6 Real Tweet Dataset Experimental Results

We evaluate TkET through experiments on real tweet dataset as well to show its effectiveness. We have collected a set of sample tweets using the sample API of Twitter during the third

111 week of November 2015 based on the keyword “trump”, that is the first name of the current president elect of the United States of America. We removed the non-english tweets and retweets to get to a dataset of 500K tweets matching the phrase “trump”. In order to compare TkET to the deterministic solution in case of Latency, Accuracy and Saving we ran TkET with a fixed dominance degree threshold of τ = 0.8 as T kET and with the same set of queries but τ = 1 that results in the deterministic ground truth top-k as GT .

Table 4.5: Efficiency over the Real Tweet Dataset

category #mentions #entities saving¯ latency¯ L¯1 L¯8 Companies 12K 3430 0.83 14 (m) 2.29 4.6 Politicians 197K 1611 0.7 211 (m) 5.2 6.19 Countries 85K 218 0.59 205 (m) 1.3 0.47

We have tested the performance of TkET over the specified dataset with multiple query cate- gories of Companies, Politicians and Countries excluding identified potential entity mentions containing the term “trump”. We have also fixed the k to be k = 10 in our real dataset experiments. Thus, the queries are as for example “What are the top-10 mentioned Compa- nies?” or “What are the top-10 mentioned Politicians?”. Table 4.5 shows the experimental results on the real Twitter dataset. The savings clearly show the effectiveness of TkET in improving the efficiency. Exploiting more sophisticated, accurate named entity identification and linking methods will significantly improve the semantics of the final top-k result in the real Twitter data scenarios.

4.7.7 Discussion

Experimental results on Synthetic and Real datasets show the overall efficiency of TkET in answering top-k entities query over tweets comparing to cleaning the whole dataset before finding the top-k. Furthermore, extensive experiments in the synthetic dataset setting shows using the In-Degree based stopping criteria results in more accurate results in general, but it

112 takes more time to finish resulting in more latency. Hence, for smaller datasets and smaller window sizes using the In-Degree stopping criteria is acceptable in order to improve the quality of the top-k result without killing the efficiency. Moreover, for larger datasets, stop- ping based on the first Out-degree stopping criteria results in less computation overhead and thus faster results with the price lower accuracy. The best choice of parameter assignments depend on the implementation (multi threading, factor graph implementation, etc.). In our implementation, the latency grows very fast for datasets larger than 10K tweets using the In-Degree stopping criteria.

4.8 Related Work

The most important research topics related to our work are Social Entity Linking and Top-k Query Answering.

4.8.1 Social Entity Linking

The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applica- tions include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity.

Name dictionary based techniques are the main approaches to candidate entity generation and are leveraged by many entity linking systems [43, 72, 37, 16, 51, 39, 81, 82]. The structure of Wikipedia provides a set of useful features for generating candidate entities, such as entity pages, redirect pages, disambiguation pages, bold phrases from the first paragraphs, and hyperlinks in Wikipedia articles. These entity linking systems leverage different combinations

113 of these features to build an offline name dictionary between various names and their possible mapping entities, and exploit this constructed name dictionary to generate candidate entities.

4.8.2 Top-k Query Answering

The problem of top-k query processing have been well studied for static databases. The well-known top-k algorithms on a single relation, such as Onion [19] and Prefer [41], are preprocessing-based techniques. They focus on preparing the meta-data through a pre- analysis of the data to facilitate subsequent run-time query processing. For example, Onion [19] computes and stores the convex hulls for the data, and later evaluates the top-k query based on these convex hulls.

Probabilistic top-k queries have been extensively studied under various semantics, including U-Topk [84], u-kRanks [84], PT-k [42], global-topk [94], and expected rank queries [27]. Efficient query processing algorithms have been developed to evaluate the probabilistic top- k queries. However, the existing studies only return the most probable top-k result, although such results may contain high uncertainty.

4.9 Summary

Efficient processing of top-k mentioned entities queries posed on a stream of tweets has become a key part of a broad class of real-time applications, ranging from content search to marketing. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. The continuous and fast generation of tweets makes it crucial for such applications to process those queries at an equally fast pace. To address these requirements, we propose TkET (pronounced ticket) as an analysis-aware entity linking framework for efficiently answering top-k entities queries over Twitter stream. The compre-

114 hensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.

Possible future works for the TkET are proving the transitivity property of the dominance degree for τ greater than some constant. We believe there should be a constant minimum for τ where we can guarantee transitivity property of the dominance degree for any given triple entity blocks. Investigating that possibility is a possible future work for TkET that would result in improving the performance of Entity Dominance Graph calculations. An- other possible future work for TkET is of incorporating different solver algorithms for the entity factor graphs and analyzing the effect of solver selection of final top-k result.

115 Chapter 5

SoDAS: Social Data Analytics System

SoDAS is a generic plug-n-play framework intended to facilitate rapid prototyping of social data analysis applications over dynamically collected social media content (e.g., stream of tweets). One of its key components is a multi-user data acquisition framework that allows end-users and/or applications to specify their data acquisition needs in the form of queries consisting of a set of phrases.

It uses external knowledge bases (e.g., domain specific ontologies, Wikipedia, Wordnet) to expand the input phrases to create an internal query representation from which phrases are selected for data capture using Twitter API. The system automatically and dynamically modifies queries to social media based on already collected data in order to optimize the quality/relevance of data captured to the user interest.

A distributed database component of SoDAS indexes and stores the dynamic data cap- tured and also maintains associations between the data and the user queries that resulted in the capture of the data. The analysis component of SoDAS is a plug-n-play engine that allows user defined functions to be dynamically added to the system. SoDAS supports an

116 analysis specification language using which users can specify complex analysis workflows, wherein element of the workflow consists of applying a user specified function (or a function from a set of libraries included in SoDAS) to an input data set. The input dataset may either be a result of a previous analysis or created using a query over the data sets collected by the acquisition module. SoDAS supports ad-hoc as well as periodic analysis over both collected as well as streaming data.

5.1 System Overview

SoDAS is composed of several different modules with specific functions. The main modules in SoDAS are:

• Tweets Acquisition: It interacts with Twitter to retrieve interesting tweets.

• Analysis: It performs the concrete analysis on these tweets.

Then there are other modules with auxiliary functions. Four modules handling the interac- tion with the users as follows:

• Interests Specification: It provides the user with the ability to specify the queries for the system.

• Analyses Specification: It provides the user with the ability to specify the analyses the system must perform.

• Config Specification: It provides the user with the ability to specify several configura- tion parameters needed by the system.

• Visualization: It shows the results to the user.

117 Figure 5.1: SoDAS General Architecture

118 Five other modules take care of the interaction with the database on which SoDAS system is based:

• Interests Storer: stores and reads the interests the users expressed.

• Analyses Storer: stores and reads the analyses the system must perform.

• Tweets Storer: stores and reads the tweets retrieved through Twitter API.

• Results Storer: stores and reads the results of the analysis made by the system.

• Config Storer: stores and reads the configuration parameters.

119 Chapter 6

Conclusions and Future Work

Last decade witnessed the emergence of social data applications in many different contexts from brand analysis and advertisement to predicting elections and the stock market. In order for the final analysis results on top of social data to be accurate, the quality of user provided social data needs to get improved. The low quality of social data comes from the fact that people use less formal language on social media and they create new ways to refer to real world concepts. Short nature of social media posts in general, and specially micro-blogs such as Twitter, adds to the complexity of improving social data quality since the contextual information is very limited. Also, social data gets generated very fast, and therefore, one needs to analyze data with the speed of the stream in order to be able to report the analysis findings in a timely, useful manner for the client.

We believe improving the data quality in general, and in particular improving the social data quality starts with acquiring the relevant data to the final interest of the client. With the nowadays speed of data generation, one cannot collect, store, and then analyze all avail- able data in order to get the final required analysis task done. For example, users post more than 500 million tweets per day on Twitter, that is equivalent to 6000 tweets per second.

120 If we assume the client is a social media analyst working for the University of California, Irvine monitoring UCI’s public presentation on Twitter, he is only interested in a microscopic portion of tweets posted on Twitter everyday. So, we need to focus on the client needs to improve the collected data quality starting at the data acquisition time.

In order to fulfill Social Data Acquisition needs, we presented TAS as an online adaptive tweet acquisition system that works based on the traditional use case of collecting social data by providing a set of textual patterns. In TAS, the client starts acquiring tweets by defining his topic of interest through a topically cohesive set of phrases. A phrase in TAS is a set of terms that when they appear together despite the order in a tweet, makes that tweet a qualified candidate to be relevant to the topic of interest of the client. We then follow a reinforcement learning based selection algorithm to iteratively generate queries including a selected subset of all relevant phrases listed in the interest of the client.

TAS also iteratively updates the representation of the topic of interest of the client based on the frequent patterns in relevant and irrelevant collected tweets. Although TAS can work with any relevance check module to approximate the degree of relevance of the incoming tweets, we implemented a minimal relevance check module based on multiple criteria for the purpose of testing the performance of TAS. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance further improves when the topics are more specific.

Furthermore, the collected social data (e.g., tweets) may then be subjected to diverse types of application-specific analysis. Recently, efficient processing of top-k mentioned entities query posed on a stream of tweets has become a key part of a broad class of real-time applications, ranging from content search to marketing. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. The continuous and fast

121 generation of tweets makes it crucial for such applications to process those queries at an equally fast pace. To address these requirements, we propose TkET (pronounced ticket) as an analysis-aware entity linking framework for efficiently answering top-k entities queries over Twitter stream. The comprehensive empirical evaluation of the proposed solution demon- strates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.

One possible future work on Tweet Acquisition would be of making TAS work for multiple clients requesting tweets to be collected based on their interests at the same time in real time. The challenge would be finding the optimal scheduling in between different clients and their different interests. Another follow up project on TAS would include designing, imple- menting and using a more sophisticated relevance check module and doing entity extraction and prioritizing entities at the query generation time. Finally, since the data collected by TAS gets used by other modules in the final application, the higher level modules using the data provided by TAS in real time, can give TAS feedback both in case of relevance and the importance of different phrases in retrieving relevant tweets. Then TAS would include the feedbacks provided by the data users in enhancing the search criteria and the query generation process to maximize the recall from an other perspective.

Possible future works for the TkET are proving the transitivity property of the dominance degree for τ greater than some constant. We believe there should be a constant minimum τ, where we can guarantee transitivity property of the dominance degree for any given triple entity blocks. Investigating that possibility is a possible future work for TkET that would result in improving the performance of Entity Dominance Graph calculations. Another pos- sible future work for TkET is of incorporating different solver algorithms for the entity factor graphs and analyzing the effect of solver selection of final top-k results. Finally, TkET can exploit the predictability probability of the sliding windows [45] in order to predict the po-

122 tential top-k results for each sliding window based on the past windows and incorporate that information in answering the top-k query in a more efficient way.

123 Bibliography

[1] W. C. 0013, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications. In ICML, volume 28 of JMLR Proceedings, pages 151–159. JMLR.org, 2013. [2] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In WSDM, WSDM, pages 5–14, New York, NY, USA, 2009. ACM. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. VLDB, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [4] Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. Proceedings of the VLDB Endowment, 7(11):999–1010, 2014. [5] H. Altwaijry, D. Kalashnikov, and S. Mehrotra. Qda: A query-driven approach to entity resolution. IEEE Transactions on Knowledge and Data Engineering, 2016. [6] H. Altwaijry, D. V. Kalashnikov, and S. Mehrotra. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 6(14):1846–1857, 2013. [7] H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. Query: a framework for integrating entity resolution with query processing. Proceedings of the VLDB Endowment, 9(3):120– 131, 2015. [8] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, May 2002. [9] S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su. Optimizing web search using social annotations. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 501–510, New York, NY, USA, 2007. ACM. [10] H. Becker, M. Naaman, and L. Gravano. Selecting quality twitter content for events. volume 11, 2011. [11] G. Bello-Orgaz, J. J. Jung, and D. Camacho. Social big data: Recent achievements and new challenges. Information Fusion, 28:45–59, 2016. [12] C. Berrou. The ten-year-old turbo codes are entering into service. Comm. Mag., 41(8):110–116, Aug. 2003.

124 [13] M. Boanjak, E. Oliveira, J. Martins, E. Mendes Rodrigues, and L. Sarmento. Twit- terecho: A distributed focused crawler to support open research with twitter data. In Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, pages 1233–1240, New York, NY, USA, 2012. ACM. [14] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8, 2011. [15] K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood, D. Maynard, and N. Aswani. Twitie: An open-source information extraction pipeline for microblog text. In RANLP, pages 83–90, 2013. [16] R. Bunescu. Using encyclopedic knowledge for named entity disambiguation. In In EACL, pages 9–16, 2006. [17] R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disam- biguation. In EACL, volume 6, pages 9–16, 2006. [18] A. E. Cano Basave, A. Varga, M. Rowe, M. Stankovic, and A.-S. Dadzie. Making sense of microposts (# msm2013) concept extraction challenge. 2013. [19] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and J. R. Smith. The onion technique: Indexing for linear optimization queries. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, pages 391– 402, New York, NY, USA, 2000. ACM. [20] M. Chu, H. Haussecker, and F. Zhao. Scalable information-driven sensor querying and routing for ad hoc heterogeneous sensor networks. International Journal of High Performance Computing Applications, 16(3):293–313, 2002. [21] J. Chuang, C. D. Manning, and J. Heer. Without the clutter of unimportant words: Descriptive keyphrases for text visualization. TOCHI, 19(3):19, 2012. [22] Final ’star wars’ trailer is here: Fans react as tickets go fast. http://www.cnn.com/ 2015/10/19/entertainment/star-wars-force-awakens-final-trailer-feat/ index.html. [23] Hurricane patricia ’potentially catastrophic’ as it heads to mexico. http://www.cnn. com/2015/10/22/americas/hurricane-patricia/index.html. [24] Justin trudeau, liberals win clear majority in canada elections. http://www.cnn.com/ 2015/10/19/world/canadian-election/index.html. [25] Oscar pistorius released from prison, under house arrest. http://www.cnn.com/2015/ 10/19/africa/south-africa-oscar-pistorius-released-house-arrest/index. html. [26] ’star wars: The force awakens’ has biggest box of- fice opening ever. http://money.cnn.com/2015/12/20/media/ star-wars-the-force-awakens-opening-weekend-box-office/.

125 [27] G. Cormode, F. Li, and K. Yi. Semantics of ranking queries for probabilistic data and expected ranks. In 2009 IEEE 25th International Conference on Data Engineering, pages 305–316, March 2009.

[28] C. Croarkin, P. Tobias, and C. Zey. Engineering statistics handbook. NIST iTL, 2002.

[29] S. Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In EMNLP-CoNLL, volume 7, pages 708–716, 2007.

[30] A. Deligiannakis, Y. Kotidis, V. Vassalos, V. Stoumpos, and A. Delis. Another outlier bites the dust: Computing meaningful aggregates in sensor networks. In 2009 IEEE 25th International Conference on Data Engineering, pages 988–999. IEEE, 2009.

[31] L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva. Analysis of named entity recognition and linking for tweets. CoRR, abs/1410.7182, 2014.

[32] A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. Model- driven data acquisition in sensor networks. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, pages 588–599. VLDB Endowment, 2004.

[33] N. Diakopoulos, M. Naaman, and F. Kivran-Swaine. Diamonds in the rough: Social me- dia visual analytics for journalistic inquiry. In Visual Analytics Science and Technology (VAST), 2010 IEEE Symposium on, pages 115–122. IEEE, 2010.

[34] R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. SIAM Journal on Discrete Mathematics, 17(1):134–160, 2003.

[35] Y. Fang and M.-W. Chang. Entity linking on microblogs with spatial and temporal signals. Transactions of ACL (TACL), 2(Oct):259–272, 2014.

[36] T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Anno- tating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechan- ical Turk, CSLDAMT ’10, pages 80–88, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[37] A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classifica- tion, and tagging for social media: A wikipedia-based approach. Proc. VLDB Endow., 6(11):1126–1137, Aug. 2013.

[38] L. Gruenwald, M. S. Sadik, R. Shukla, and H. Yang. Dems: a based technique to handle missing data in mobile sensor network applications. In Proceedings of the Seventh International Workshop on Data Management for Sensor Networks, pages 26–32. ACM, 2010.

126 [39] X. Han, L. Sun, and J. Zhao. Collective entity linking in web text: A graph-based method. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 765–774, New York, NY, USA, 2011. ACM.

[40] S. Hershey, J. Bernstein, B. Bradley, A. Schweitzer, N. Stein, T. Weber, and B. Vigoda. Accelerating inference: towards a full language, compiler and hardware stack. CoRR, abs/1212.2991, 2012.

[41] V. Hristidis, N. Koudas, and Y. Papakonstantinou. Prefer: A system for the efficient execution of multi-parametric ranked queries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, pages 259–270, New York, NY, USA, 2001. ACM.

[42] M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: A prob- abilistic threshold approach. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 673–686, New York, NY, USA, 2008. ACM.

[43] W. Hua, K. Zheng, and X. Zhou. Microblog entity linking with social temporal context. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1761–1775, New York, NY, USA, 2015. ACM.

[44] T. Inc. United states securities exchange commission. united states securities exchange commission., October 2013.

[45] C. Jin, K. Yi, L. Chen, J. X. Yu, and X. Lin. Sliding-window top-k queries on uncertain streams. Proceedings of the VLDB Endowment, 1(1):301–312, 2008.

[46] K. Joseph, P. L. Landwehr, and K. M. Carley. An approach to selecting keywords to track on twitter during a disaster. In Proceedings of the 11th International Conference on Information Systems for Crisis Response and Management, State College, PA, 2014.

[47] K. Joseph, P. M. Landwehr, and K. M. Carley. Two 1% s dont make a whole: Comparing simultaneous samples from streaming api. In Social Computing, Behavioral- Cultural Modeling and Prediction, pages 75–83. Springer, 2014.

[48] M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.

[49] F. R. Kschischang, B. J. Frey, and H. A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498–519, Feb 2001.

[50] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.

[51] S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 457–466, New York, NY, USA, 2009. ACM.

127 [52] H. Langseth. The hammersley-clifford theorem and its impact on modern statistics. 2002.

[53] Anxious porter ranch residents weigh legal response to massive gas leak. http://www. latimes.com/local/california/la-me-1219-gas-leak-20151219-story.html.

[54] No border patrol at uc irvine career fair, and that’s a shame. http://www.latimes. com/opinion/editorials/la-ed-uci-border-patrol-20151022-story.html.

[55] C. Li, A. Sun, J. Weng, and Q. He. Exploiting hybrid contexts for tweet segmentation. In Proceedings of the 36th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’13, pages 523–532, New York, NY, USA, 2013. ACM.

[56] C. Li, A. Sun, J. Weng, and Q. He. Tweet segmentation and its application to named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 27(2):558– 570, 2015.

[57] R. Li, S. Wang, and K. C. Chang. Towards social data platform: Automatic topic- focused monitor for twitter stream. PVLDB, 6(14):1966–1977, 2013.

[58] L. Liu, L. Sun, Y. Rui, Y. Shi, and S. Yang. Web video topic discovery and tracking via bipartite graph reinforcement model. In Proceedings of the 17th international conference on World Wide Web, pages 1009–1018. ACM, 2008.

[59] X. Liu, M. Zhou, F. Wei, Z. Fu, and X. Zhou. Joint inference of named entity recognition and normalization for tweets. In Proceedings of the 50th Annual Meeting of the Associa- tion for Computational Linguistics: Long Papers-Volume 1, pages 526–535. Association for Computational Linguistics, 2012.

[60] W. Magdy and T. Elsayed. Adaptive method for following dynamic topics on twitter. In ICWSM, 2014.

[61] D. Nadeau. Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. 2007.

[62] D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26, 2007.

[63] M. Nagarajan, K. Gomadam, A. P. Sheth, A. Ranabahu, R. Mutharaju, and A. Jadhav. Spatio-temporal-thematic analysis of citizen sensor data: Challenges and experiences. In International Conference on Web Information Systems Engineering, pages 539–553. Springer, 2009.

[64] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set functions–i. Mathematical Programming, 14(1):265–294, 1978.

[65] M. E. Newman. Power laws, pareto distributions and zipf’s law. Contemporary physics, 46(5):323–351, 2005.

128 [66] Mets, team of big shoulders, sweep cubs to reach world series. http://www.nytimes. com/2015/10/22/sports/baseball/new-york-mets-beat-chicago-cubs-nlcs. htmlsmid=tw-nytimes&smtyp=cur. [67] J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2012.

[68] J. W. Osborne and A. Overbay. Best practices in data cleaning. Best practices in quantitative methods, pages 205–213, 2008.

[69] M. Osborne, S. Moran, R. McCreadie, A. von Lunen, M. Sykora, E. Cano, N. Ireson, C. Macdonald, I. Ounis, Y. He, T. Jackson, F. Ciravegna, and A. O’Brien. Real-time detection, tracking, and monitoring of automatically discovered events in social media. 2014.

[70] S. Papadopoulos and Y. Kompatsiaris. Social multimedia crawling for mining and search. Computer, 47(5):84–87, 2014.

[71] S. Petrovi´c,M. Osborne, and V. Lavrenko. Streaming first story detection with appli- cation to twitter. HLT, pages 181–189, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[72] K. Q. Pu, O. Hassanzadeh, R. Drake, and R. J. Miller. Online annotation of text streams with structured entities. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pages 29–38, New York, NY, USA, 2010. ACM.

[73] J. Radcliffe. Magic quadrant for master data management of customer data. G00206031, Gartner, Inc., Stamford, USA, 2009.

[74] T. Redman. Data Driven: Profiting from Your Most Important Business Asset. General management. Harvard Business Press, 2008.

[75] A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimen- tal study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524–1534. Association for Computational Linguistics, 2011.

[76] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1524–1534, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.

[77] M. Sadri, S. Mehrotra, and Y. Yu. Online adaptive topic focused tweet acquisition. In Proceedings of the 25th ACM International on Conference on Information and Knowl- edge Management, CIKM ’16, pages 2353–2358, New York, NY, USA, 2016. ACM.

[78] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 851–860, New York, NY, USA, 2010. ACM.

129 [79] S. Sarawagi. Information extraction. Foundations and trends in databases, 1(3):261–377, 2008.

[80] D. A. Shamma, L. Kennedy, and E. Churchill. Statler: Summarizing media through short-message services. In Proceedings of the 2010 ACM Conference on Computer Sup- ported Cooperative Work (CSCW?10), 2010.

[81] W. Shen, J. Wang, P. Luo, and M. Wang. Linden: Linking named entities with knowl- edge base via semantic knowledge. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 449–458, New York, NY, USA, 2012. ACM.

[82] W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 68–76, New York, NY, USA, 2013. ACM.

[83] Proof of probabilistic maximum coverage being np-hard. http://www.ics.uci.edu/ dvk/pub/sigmod14ext.pdf.

[84] M. A. Soliman, I. F. Ilyas, and K. C. C. Chang. Top-k query processing in uncertain databases. In 2007 IEEE 23rd International Conference on Data Engineering, pages 896–905, April 2007.

[85] C. Spearman. The proof and measurement of association between two things. The American journal of psychology, 15(1):72–101, 1904.

[86] M. A. Todwal and M. Wanjari. Finding high-quality content in social media.

[87] Real-time filtering task. https://github.com/lintool/twitter-tools/wiki/ TREC-2015-Track-Guidelines.

[88] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. In ICWSM, pages 178–185, 2010.

[89] J. Van den Broeck, S. A. Cunningham, R. Eeckels, and K. Herbst. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med, 2(10):e267, 2005.

[90] M. Van Erp, G. Rizzo, and R. Troncy. Learning with the web: Spotting named entities on the intersection of nerd and machine learning. In # MSM, pages 27–30, 2013.

[91] X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang. Topic sentiment analysis in twitter: A graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 1031–1040, New York, NY, USA, 2011. ACM.

[92] W. Wu, H. B. Lim, and K.-L. Tan. Query-driven data collection and data forwarding in intermittently connected mobile sensor networks. In Proceedings of the Seventh In- ternational Workshop on Data Management for Sensor Networks, pages 20–25. ACM, 2010.

130 [93] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1201–1208, New York, NY, USA, 2009. ACM.

[94] X. Zhang and J. Chomicki. On the semantics and evaluation of top-k queries in prob- abilistic databases. In Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on, pages 556–563, April 2008.

131