UPTEC IT 13 003 Examensarbete 30 hp Februari 2013

Tweet Collect: short text message collection using automatic query expansion and classification

Erik Ward

Abstract Tweet Collect: short text message collection using automatic query expansion and classification Erik Ward

Teknisk- naturvetenskaplig fakultet UTH-enheten The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called Besöksadress: tweets, which are short, contain twitter-specific writing styles and are often Ångströmlaboratoriet Lägerhyddsvägen 1 idiosyncratic give rise to a vocabulary mismatch between typically chosen Hus 4, Plan 0 keywords for tweet collection and words used to describe television shows.

Postadress: A method is presented that uses a new form of query expansion that generates Box 536 751 21 Uppsala pairs of search terms and takes into consideration the language usage of twitter to access user data that would otherwise be missed. Supervised Telefon: classification, without manually annotated data, is used to maintain precision 018 – 471 30 03 by comparing collected tweets with external sources. The method is implemented,

Telefax: as the Tweet Collect system, in Java utilizing many processing steps to improve 018 – 471 30 00 performance.

Hemsida: The evaluation was carried out by collecting tweets about five different http://www.teknat.uu.se/student television shows during their time of airing and indicating, on average, a 66.5% increase in the number of relevant tweets compared with using the title of the show as the search terms and 68.0% total precision. Classification gives a, slightly lower, average increase of 55.2% in number of tweets and a greatly increased 82.0% total precision. The utility of an automatic system for tracking topics that can find additional keywords is demonstrated. Implementation considerations and possible improvements are discussed that can lead to improved performance.

Handledare: Kazushi Ikeda Ämnesgranskare: Tore Risch Examinator: Lars-Åke Nordén ISSN: 1401-5749, UPTEC IT 13 003 Sponsor: KDDI R&D Laboratories, Göran Holmquist Foundation Tryckt av: Reprocentralen ITC

Sammanfattning

Social media som Twitter växer i popularitet och stora mängder med- delanden, tweets, skrivs varje dag. Dessa meddelanden innehåller värde- full information som kan användas till marknadsundersökningar men är mycket korta, 140 tecken, och uppvisar i många fall ett idiosynkratiskt uttryckssätt. För att komma åt så många tweets som möjligt om en viss produkt, till exempel ett TV program, krävs det att rätt söktermer är tillgängliga; en twitteranvändare använder nödvändigtvis inte samma ord för att beskriva samma sak som en annan. Olika grupper använ- der således olika språkbruk och jargong. I text i twittermeddelanden är detta uppenbart, vi kan se hur somliga använder vissa så kallade hash- tags for att uttrycka sig samt andra språkyttringar. Detta leder till vad som brukar kallas problemet med olika ordförråd (vocabulary mismatch problem). För att försöka samla in så många twittermeddelanden som möjligt om olika produkter har ett system som kan generera nya söktermer utvecklats, här kallat Tweet Collect. Genom att analysera vilka ord som ger mest information, generera par av ord som beskriver olika saker och ta hänsyn till det språkbruk som finns på Twitter skapas nya söktermer utifrån ursprungliga söktermer, s.k. frågeexpansion (query expansion). Utöver att samla in tweets som motsvarar de nya söktermerna så avgör en maskininlärningsalgoritm om dessa tweets är relevanta eller inte för att på så sätt öka precisionen. Efter att ha samlat in tweets för fem TV program så utvärderades systemet genom att utföra en stickprovsundersökning av nya insamlade tweets. Denna undersökning visar att, i genomsnitt, ökar antalet rele- vanta tweets med 66.5% mot att endast använda TV programmets titel. Av alla insamlade tweets så handlar endast 68.0% faktiskt om TV pro- grammet, men genom att använda maskininlärning kan detta ökas till 82.0% med endast ett avkall till 55.2% ökning av nya, relevanta, tweets. I denna rapport visas användbarheten av ett automatiskt system som kan hitta nya söktermer och på så sätt motverka problemet med olika ordförråd. Genom att komma åt tweets som skrivs med ett annat språkbruk så hävdas det att metodfelet vid insamling av tweets minskar. Systemets förverkligande i programmeringsspråket Java diskuteras och förbättringar föreslås som kan leda till ökad effektivitet.

This thesis is dedicated to the wonderful country of Japan and all who come to experience her.

This thesis expands upon:

Erik Ward, Kazushi Ikeda1, Maike Erdmann2, Masami Nakazawa2, Gen Hattori2, and Chihiro Ono2. Automatic Query Expansion and Classification for Television Related Tweet Collection. Proceedings of Information Processing Society of Japan (IPSJ) SIG Technical Reports, vol. 2012, no. 10, pp. 1-8, 2012.

Acknowledgment

I wish to thank the Göran Holmquist Foundation and the Sweden Japan Foundation for travel funding.

1Supervisor 2Proofreading Glossary

AQE Automatic Query Expansion. Blind relevance feedback. Corpus A set of documents, typically in one domain. Relevance feed- Update a query based on documents that are known to be relevant back for this query.

Table of Notations Ω The vocabulary: the set of all known terms. t Term: a word without spacing characters. q Query: a set of terms. q ∈ Q ⊂ D. C Corpus: a set of documents. d Document: a set of terms. d ∈ D, where D is the set of all possible documents. tf(t, d) Term frequency: an integer valued function that gives the frequency of occurrence of t in d. df(t) Document frequency: the number of documents in a corpus that contains t. idf(t) lg(1/df(t)) R Set of related documents; used for automatic query expansion.

Contents

1 Introduction 1

2 Background 3 2.1 Twitter ...... 3 2.1.1 Structure of a tweet ...... 3 2.1.2 Accessing twitter data: Controlling sampling ...... 4 2.1.3 Stratification of tweet users and resulting language use . . . . 5 2.2 Information retrieval ...... 6 2.2.1 Text data: Sparse vectors ...... 6 2.2.2 Terms weights based on statistical methods ...... 9 2.2.3 The vocabulary mismatch problem ...... 9 2.2.4 Automatic query expansion ...... 9 2.2.5 Measuring performance ...... 11 2.2.6 Software systems for information retrieval ...... 12 2.3 Topic classification ...... 12 2.4 External data sources ...... 13

3 Related work 15 3.1 Relevant works in information retrieval ...... 15 3.1.1 Query expansion and pseudo relevance feedback ...... 16 3.1.2 An alternative representation using Wikipedia ...... 17 3.2 Classification ...... 17 3.2.1 Television ratings by classification ...... 18 3.2.2 Ambiguous tweet about television shows ...... 18 3.2.3 Other topics than television ...... 20 3.3 Tweet collection methodology ...... 21 3.4 Summary ...... 22

4 Automatic query expansion and classification using auxiliary data 25 4.1 Problem description and design goals ...... 25 4.2 New search terms from query expansion ...... 26 4.2.1 Co-occurrence heuristic ...... 27 4.2.2 Hashtag heuristic ...... 28 4.2.3 Algorithms ...... 29 4.2.4 Auxiliary Data and Pre-processing ...... 30 4.2.5 Twitter data quality issues ...... 30 4.2.6 Collection of new tweets for evaluation ...... 32 4.3 A classifier to improve precision ...... 33 4.3.1 Unsupervised system ...... 34 4.3.2 Data extraction ...... 34 4.3.3 Web scraping ...... 34 4.3.4 Classification of sparse vectors ...... 34 4.3.5 Features ...... 35 4.3.6 Classification ...... 35 4.4 Combined approach ...... 36

5 Tweet Collect: Java implementation using No-SQL database 37 5.1 System overview ...... 37 5.2 Components ...... 38 5.2.1 Statistics database ...... 39 5.2.2 Implementation of algorithms ...... 41 5.2.3 Twitter access ...... 43 5.2.4 Web scraping ...... 43 5.2.5 Classification ...... 43 5.2.6 Result storage and visualization ...... 44 5.3 Limitations ...... 44 5.4 Development methodology ...... 45 6 Performance evaluation 47 6.1 Collecting tweets about television programs ...... 47 6.1.1 Auxiliary data ...... 48 6.1.2 Experiment parameters ...... 49 6.1.3 Evaluation ...... 49 6.2 Results ...... 50 6.2.1 Ambiguity ...... 51 6.2.2 Classification ...... 51 6.2.3 System results ...... 53

7 Analysis 55 7.1 System results ...... 55 7.2 Generalizing the results ...... 56 7.3 Evaluation measures ...... 57 7.4 New search terms ...... 57 7.5 Classifier performance ...... 57

8 Conclusions and future work 61 8.1 Applicability ...... 61 8.2 Scalability ...... 62 8.3 Future work ...... 62 8.3.1 Other types of products and topics ...... 62 8.3.2 Parameter tuning ...... 62 8.3.3 Temporal aspects ...... 63 8.3.4 Understanding names ...... 63 8.3.5 Improved classification ...... 63 8.3.6 Ontology ...... 64 8.3.7 Improved scalability and performance profiling ...... 64

Bibliography 65

Appendices 68

A Hashtag splitting 69 B SVM performance 71

List of Figures

2.1 The C4.5 classifier ...... 13

3.1 Approach to classifying tweets, here for television shows, but the same approach applies for other proper nouns...... 19

4.1 Visualization of fraction of tweets by keywords for the show Saturday night live, here different celebrities that have been on the show dominate the resulting twitter feed...... 33

4.2 Conceptual view of collection and classification of new tweets...... 36

5.1 Conceptual view of collection and classification of new tweets...... 38

5.2 How the different components are used to evaluate system performance. This does not represent the intended use case where collection, pre- processing and classification is an ongoing process...... 39

6.1 Results of filtering auxiliary data to improve data quality. Note that the first filtering step is not included here and these tweets represent strings containing either the title of a show or the title words formed into a hashtag...... 48

6.2 Fraction of tweets by search terms for How I met your mother...... 52

6.3 Fraction of tweets by search terms for The X factor...... 52

7.1 Ways to generate training data from auxiliary data. Here we have two data sets, A and B that correspond to searching for the titles A and B, respectively. Either we can have a classifier for each title, the left case, or we can have just one classifier that is trained on the Cartesian product of data sets and titles, the right case. Regrettably tests show that we cannot use the right case unless we include training data of the type Rtitle,title for all shows...... 59 List of Tables

2.1 Examples of two vector representations of the same document. In this example the vocabulary is severely limited and readers should imagine a vocabulary of several thousands of words and the resulting, sparse, vector representations. Note that capitalization is ignored which is very common in practice...... 7 2.2 Inverted index structure. We can look up which documents that contain certain words by grouping document numbers by the words that are included in the document. If we wish to support frequency counts of the words we store not only document numbers but instead tuples of (Number, frequency)...... 8

3.1 Methods that I use from related works in my combined approach of query expansion and classification...... 22

4.1 Expansion terms for the show “How I met your mother” using equation 2.7 and resulting search terms by hashtag,mention and co-occurrence heuristics. Note that a space means conjunction and a comma means disjunction. This used data where tweets mentioning other show have been removed...... 30 4.2 Search terms generated for the television shows The vampire diaries and using a moderately sizes data-set...... 31

5.1 List of dependencies organized by (sub) component...... 40

6.1 TV shows used for collecting tweets with new search terms. Shows marked with “*” are aired as reruns multiple times every day...... 49 6.2 Text sources used for comparing with tweets...... 50 6.3 Number of tweets collected for the different TV shows during 23h30min. 51 6.4 Percentage of tweets containing the title that are related to the television show...... 51 6.5 Classification results when using manually labeled test data as training data with 10-fold cross validation...... 52 6.6 Classification results when using training data generated from the same external sources, training examples are from all five shows...... 52 6.7 Class distribution of annotated data after classification by baseline, left, and C4.5 classifiers, right. The baseline classifier is the naive classifier: cbaseline(tweet) = related ...... 53 6.8 System performance using automatic query expansion, before and after classification. The subscript c denotes results after classification. . . . . 54

7.1 95% confidence interval for accuracy with training data generated from the same external sources, training examples are from all five shows. . 56 7.2 First 13 term pairs for AQE using top 40 terms to form pairs and virtual documents of size 5. Also visible is a bug where I do not remove hashtags from consideration when forming pairs...... 58

B.1 Results of classification of annotated test data with linear support vector machines. Text data is treated as sparse vectors...... 71

List of Algorithms

1 Algorithm, Top(K,R), produces an array of single search terms. . . 29 2 Algorithm, Pairs, produces the pairs of search terms used...... 29

Chapter 1

Introduction

“Ultimately, brands need to have a role in society. The best way to have a role in society is to understand how people are talking about things in real time.” – Jean-Philippe Maheu, Chief Digital Officer, Ogilvy [19]

Adoption of social media has increased dramatically in the last years and millions of users use social media services every day. There are for example 806 million Facebook users [11] and 140 million twitter users [5]. Since the creation of material is decentralized and requires no permission, enormous quantities of unstructured, uncategorized information is created by users every minute; 340 million twitter messages that are authored every day [5]. This development has co-occurred with the explosion of data generation in gen- eral and presents IT practitioners and computer scientists with new unsolved prob- lems but also with opportunity for new business. Industry has quickly realized that there is value in all this unstructured data, coining the equivocal term big data. The marked for managing big data has grown faster than the IT sector in general and showed a growth of 10% annually to $100 billion in 2010 [3]. The multitude of data available yields unprecedented opportunities to gain in- sight of what people are thinking and what they want; to conduct automated market research. This knowledge is extremely valuable for public relations and for adver- tising. One maturing technology that attempts to analyze users opinions in text is sentiment analysis, another is estimating ratings of television programs using the number of twitter messages written [39][35]. But for these technologies to be truly useful, text about the topics of interest needs to be collected in a reliable and representative fashion. Classification and information retrieval techniques can be used to improve the quality and reach of twitter message collection. However, human text, especially in a social media setting, is often very vague and one problem is to find the many messages that do not explicitly mention the topic that one wishes to analyze: the vocabulary miss match problem[12]. Mitigating these difficulties is the focus of this thesis.

1 CHAPTER 1. INTRODUCTION

A crucial part of the process of conducting market research on a topic, such as determining sentiment towards a certain product or estimating ratings, is to get a good sample of messages. When gathering messages in social media, often keywords determined by an analyst is used, such as in [38] [39] and [35]. I argue that this method ignores a large fraction of the messages relating to certain topics and thus detrimentally affects the validity of results of later analysis. The idiosyncratic and novel language use on twitter, driven by the short message length, results in a vocabulary mismatch that can be mitigated by the use of a systematic method to find the messages not covered by using the title, or other manually selected terms, as a search terms. At its core the research problem addressed by this thesis work is: Get as many as possible tweets about a specified product.

This goal is to be solved using the methods available for a running and scalable system. I use the term product in stead of topic here since this is closer to most business goals of tweet analysis and it reflects the experiments that I have carried out. In essence, I wish to optimize tweet collection. To improve tweet collection I present and test the use of streaming retrieval with additional keywords determined using relevance feedback techniques and automatic query expansion (AQE), as seen in information retrieval, particularly in ad-hoc re- trieval. By comparing term distributions in sets of messages about different topics I determine descriptive terms for each topic that yield improved recall when included as search terms. By also classifying the retrieved tweets as either relevant or irrel- evant to the topic, higher precision can be achieved. Classification also, in part, deals with the issue of ambiguity [40]. The proposed method is evaluated by collecting tweets about television shows using streaming retrieval for popularity estimation, but the method is not limited to this domain. This thesis consists of the following chapters:

Chapter 2 An overview of general techniques.

Chapter 3 Related work in the area of tweet retrieval and classification.

Chapter 4 Methods that I have employed.

Chapter 5 Prototype system that I developed.

Chapter 6 Experiments and data.

Chapter 7 Analysis of results.

Chapter 8 Conclusions and future work.

2 Chapter 2

Background

This chapter presents the concepts and technologies involved, in particular: twitter, information retrieval (IR) and classification. The problem of conduction market research is first presented in terms of how to get data from twitter, then how to find which of this data to use by IR and classification. This chapter is intended to introduce the most common techniques used when finding relevant information, but these techniques are not used in the standard way in my proposed method. Instead they serve as the inspiration and implementation building blocks of the method. A holistic, conceptual view of the proposed method is introduced in chapter 4 and it could be useful to read this first if the reader is already familiar with the concepts presented here. Subsection 2.1.2 is technical and subject to the changing implementation of the twitter API1. However, it is necessary to analyze the API to understand the limitations present when working with twitter data since this represents the access point used by researchers and other third parties.

2.1 Twitter

Twitter is a growing social media site where users can share short text messages of 140 characters; tweets. A user base of 140 million users [5] make it a very interesting source of data. The data that I am interested in collecting is the messages where users write about certain TV programs; to use for market research.

2.1.1 Structure of a tweet

Below is a hypothetical twitter message highlighting different features. The format of the message is very similar to what one can find in actual tweets and it is evident that much of the information has been shortened to fit in 140 characters.

1https://dev.twitter.com/

3 CHAPTER 2. BACKGROUND

User Erik Ward User tag erikuppsala Text I am writing a master thesis http://bit.ly/KdMep5 @kddird #KDDI Time-stamp 17:22 PM - 12 Oct 2012 via web The very sort text message format has given rise to several adoptions by the com- munity:

Retweet The letters “RT” at the start of a message indicate that it is a copy of another message.

User tag A unique string associated with each twitter account

Reply and mentions The @ sign indicates that the message is directed towards a specific user with user tag or refers to that user.

Hashtags A ’#’ sign followed by a keyword denotes the user selected category of the message (one category for each unique keyword string). Hashtags are unorganized and work by gentleman’s agreement.

Short URLs Several services provide a way to shorten URLs such as transforming http://www.it.uu.se/edu/exjobb/helalistan to http://bit.ly/KdMep5 by redirecting through their site.

2.1.2 Accessing twitter data: Controlling sampling

In essence, accessing twitter data is done by collecting tweets that contain certain keywords or are written by specific users. What the keywords are for finding tweets about television shows and how they are obtained are described in chapter 4. But, even if these keywords are known, access to the data is limited because of the underlying medium. The basic approach is sampling at different times using standard HTTP GET requests, the so called REST2 approach. Each sample has an upper limit of how many tweets that are retrieved and a user is allowed only a certain number of calls per hour. Conceptually: the Twitter company maintains a buffer of tweets of a fixed size that is indexed by a full text index for Boolean search. This FIFO cache is replaced with new tweets at different rates depending on the rate which tweets are produced. Users are allowed to query this very large cache of tweets and thus gain access only to the fraction of results that was produced in a fairly recent time period. Furthermore, not all tweets that are produced are available through this method and the complexity of a query is limited.

2REpresentational State Transfer

4 2.1. TWITTER

“Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface." – https: //dev.twitter.com/docs/api/1.1/get/search/tweets3

Besides the fact that not all tweets are accessible through the REST approach, there are further complications. These limitation has to do both with the fact that the number of tweets per request is limited and that the number of requests are limited. If tweets a produced faster than the requests are issued the surplus tweets are dropped without warning, this can happen if one wishes to track many different keyword sets. If they are produced slower then each request will return many previous seen tweets (wasting bandwidth). The following is also stated in the API documentation, making long queries harder to use in this setting:

“Limit your searches to 10 keywords and operators.” – https://dev. twitter.com/docs/api/1.1/get/search/tweets

Twitter data can also be accessed in a streaming fashion in two ways:

1. Access all incoming tweets or a sample of all incoming tweets. Accessing a random sample of all tweets is not attractive for our application and obtaining all tweets is a very data intensive streaming service requiring a contract with retailers.

2. Access all messages that match a Boolean query e.g. “My friend has a dog” and “My father drives a Volvo” will match q = (My ∧ dog) ∨ (My ∧ V olvo). This sample is limited to at most 1% of all tweets but represents the most exhaustive way of collecting tweets containing certain keywords.

When researchers evaluate their twitter related research it is common to use a static data set composed of messages collected for a certain query over a period of time [10][26]. One important such data set is the TREC microblog corpus4. I will revisit various sampling issues in chapter 3. In my project a combination of methods is used, the REST search method to acquire a large sample of tweets for many different topics over a long period of time and the streaming method to track a specific topic in an exhaustive way.

2.1.3 Stratification of tweet users and resulting language use It is safe to assume that different groups of twitter users use different language to describe their thoughts. Certain trends in e.g. hashtag use spread to different groups of users depending on their position in the social network and other factors such as what their interests are and so on. There is support for this assumption

3Accessed Oct. 16 2012 4https://sites.google.com/site/microblogtrack/2012-guidelines

5 CHAPTER 2. BACKGROUND in work done here at KDDI R&D where the feature extraction was used to extract terms used by different demographic groups show that the terms used differ [23]. If we expand our assumption slightly we can also assume that an analyst that selects keywords to use for tweet collection need not necessarily be aware of the language use of different strata. It is therefore possible to achieve an improvement in recall if we can catch other types of language use. In the proposed method we start with the title of a television show as the basis for our analysis, see chapter 4 but it is not hard to imagine that the jargon of users is not an exact specification and that they will sometimes use the title combined with words that are more specific to their writing style, demographic and social context. These words could include slang expressions and hashtags.

2.2 Information retrieval

This section is based upon the book An Introduction to Information Retrieval by Manning, Raghavan and Schütze [25] and summarizes the key concepts of informa- tion retrieval that are used in this thesis. The task of finding the correct content out of a large collection of documents is often called information retrieval (IR). Most work in IR focus on finding the text document that, according to the models employed, the user wants, although there are several applications in which IR is expanded to other content such as images, video or audio recordings. A typical task for a commercial IR system is ad hoc retrieval: find the best documents related to a set of user supplied search terms, a query. This thesis is not concerned with ad hoc retrieval, the topics for which I want to retrieve documents for are automatically generated or known beforehand, nevertheless a great deal of overlap exists between the more traditional techniques of IR an my proposed method, described in chapter 4. Specifically, my method uses queries, represents documents in a similar way and builds upon an existing IR system destined for ad-hoc retrieval.

2.2.1 Text data: Sparse vectors

Textual data is composed of strings of characters where one can choose different scopes of analysis, common strategies are to regard a document as one scope or to consider parts of documents as scopes on their own, e.g. 100 word segments or the different structural parts of the documents such as titles, captions and main text. Delving further into the text, one can look at words as the smallest building block or look at sub-strings of length n, often called n-grams. The concept of n- grams is also used for looking at a sequence of n words in the literature but it should be clear from the context which is meant. In this thesis we will look at words, often called terms, separated by spacing characters, as the atomic unit of strings and denote these by t.

6 2.2. INFORMATION RETRIEVAL

Table 2.1: Examples of two vector representations of the same document. In this example the vocabulary is severely limited and readers should imagine a vocabulary of several thousands of words and the resulting, sparse, vector representations. Note that capitalization is ignored which is very common in practice. Vocabulary ’a’, ’brown’, ’dog’, ’fox’, ’i’, ’is’, ’jumped’, ’lazy’, ’over’, ’quick’, ’the’, ’this’ Document “The quick brown fox jumped over the lazy dog” Words (lex. order) ’brown’, ’dog’, ’fox’, ’jumped’, ’lazy’, ’over’, ’quick’, ’the’, ’the’ Boolean vector (0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0) Frequency vector (0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 0)

A well know concept is lexicographic ordering and this definition of how strings are ordered can be used to transform text into a vector representation in a concise way. Assume that one know all words that can appear in a language; call this set the vocabulary, Ω. Using the vocabulary one can transform any document into a vector representation by counting the number of of each word and outputting these counts in lexicographical order: create a feature vector for the string. Table 2.1 shows an example of the process. In practice we do not know all words that can appear in texts but can ignore words that are not seen before or add them to the vocabulary since the vectors are never formed explicitly. Conceptually, the Boolean model, used by the twitter search API, creates Boolean vectors as in table 2.1 for all document and all queries. A bit wise matching is then performed and the documents with one or more matches are eligible results of a search. Since the data is sparse many optimizations in storage and computation are possible, but they are omitted here. For twitter messages the Boolean model makes a lot of sense, tweets are very short and each word can thus be assumed to be very important to the message. Furthermore, it is the method that requires the least computation and storage so it can be implemented effectively with an inverted index, see table 2.2. If we study table 2.2 further we can see that the more common a word is the more documents will be stored in the index entry for that word, possibly slowing down retrieval. For common data sets of documents, often called corpora and denoted by C, an important phenomena called Zipf’s law has been observed: the probability mass distribution of words decrease exponentially for words according to how common they are. So roughly: the second most common word is about half as common as the most common word and so on. This empirical law means that a very small number of words are present in most documents. To reduce the space and time needed for look up in an inverted index these words are commonly ignored: these words are often called stop words. If we are not content with regarding all documents that match, in the Boolean matching sense, equally relevant and want to rank5 documents in some way there

5Typically the k highest ranked documents, with the highest scores, are assumed the most

7 CHAPTER 2. BACKGROUND

Table 2.2: Inverted index structure. We can look up which documents that contain certain words by grouping document numbers by the words that are included in the document. If we wish to support frequency counts of the words we store not only document numbers but instead tuples of (Number, frequency). Documents Num. Text 1 “a dog” 2 “a brown fox” Index Key Value ’a’ 1,2 ’brown’ 2 ’dog’ 1 ’fox’ 2 are many extensions and interpretations one can use. The basic building block for ranking is the following two assumptions: • The more common a word is inside a document, the more relevant the docu- ment is to a query containing this word. • The more documents that contain a word, the less important the word is. This leads us to the very common representations of texts as tf · idf vectors (term frequency, inverse document frequency vectors). For each word that we keep track of in the vocabulary we also keep track of the number of documents that term appears in and calculate idf(t) = lg(1/|{d | t ∈ d}|), d ∈ C. When a query, q = {t1, t2, ...} is issued we sum up, for each document d, the tf · idf elements that match that document: X X Score(q, d) = w(t, d) = tf(t, d) · idf(t) (2.1) t∈q∧t∈d t∈q∧t∈d One can also compare two documents, or a query and a document, in the form of tf · idf vectors using any distance metric, most notably the cosine distance:

d~i · d~j cos(d~i, d~j) = (2.2) ||d~i|| ||d~j|| A keen reader will note that by using frequency vectors to represent text we have lost a lot of information; namely the ordering information of words. This model may appear simplistic but has been shown to work well in practice. Because order is not accounted for the model is often called the BOW (Bag6 Of Words) model. important. 6Bag is synonymous to multiset here.

8 2.2. INFORMATION RETRIEVAL

2.2.2 Terms weights based on statistical methods One can formalize the two basic assumptions used in the tf ·idf vector representation and instead use statistical methods. The intuition is the same but the models are more familiar since it allows IR systems to benefit from the vast knowledge in probability theory and statistics. For instance, it is possible to weigh terms in documents according to some function, f, defined on the two probabilities p(t|d): the probability of term t given a document d, and p(t|c): the probability of a term in the whole corpus.

w(t, d) ∼ f(p(t|d), p(t|C)) (2.3) tf p(t|d) = (2.4) TF tfC p(t|C) = (2.5) |C|

Where tf is the number of times we see t in document d, TF is the number of terms in d, tfC is the number of times we see t in the whole collection and |C| the number of terms in the collection. Note that we are essentially estimating the probability distribution of terms in different sets using the assumption that terms are distributed according to the mean (maximum likelihood estimation). These methods are usually similar to hypothesis testing in that we choose terms for which we reject the null hypothesis that p(t|d) has a good fit with p(t|C). An- other way is to consider the information of the seeing a term in a document such as using the Kullback-Liebler divergence. It is also possible to use an urn model that explicitly considers the size of documents, TF , instead of just p(t|d) = tf/T F such as the divergence from randomness framework [7].

2.2.3 The vocabulary mismatch problem When a user is searching for a set of relevant documents it is a typical case that the user and the authors of those documents use a different vocabulary (not to be confused with the vocabulary of all seen terms Ω). This means that not all relevant documents are found or that ranking of documents is not in line with what the user finds important. This problem can also be seen when one tried to conduct market research us- ing twitter data. It is not hard to imagine that authors of tweets use a different vocabulary and jargon than an analyst that selects keywords to search for.

2.2.4 Automatic query expansion Query expansion is a method where the original query is used to generate new queries to provide greater effectiveness of the IR task. There are many different ways one can perform query expansion automatically, they can be very broadly

9 CHAPTER 2. BACKGROUND categorized into local and global methods. Local methods use the results obtained from the first query to find new search terms while global methods often use global semantic information such as Wordnet or use an auxiliary corpora. An excellent review of query expansion is available in the survey by Carpineto and Romano [12]. In the local case query expansion is often pseudo relevance feedback, related to the concept of relevance feedback. In relevance feedback, an ad-hoc IR technique, users are asked to grade results and a new search is carried out taking into account the best rated results of the first query. If instead the top k results of ranked ad-hoc retrieval are assumed to be relevant, here denoted as R, the pseudo Relevant set, one can use this set to get new query terms. A common technique is to use Roc- chios algorithm where a BOW query vector is moved towards relevant documents according to different weights.

1 X ~q = α~q + β d~ (2.6) m 0 |R| j d~j ∈R

We see that the original query vector q0 of tf · idf elements is multiplied with the scalar α and added to a summation of the tf · idf vectors of the documents in R multiplied with another scalar β. Note that we need to use a similarity based score for a new ranking of results as in equation 2.2 rather than a summation of all w(t), t ∈ qm as in equation 2.1 to have use of query expansion with re-weighting. For our purpose the re-weighting is not very interesting since we are interested in only obtaining new search terms. If we consider the vector: 1 X d~ |R| j d~j ∈R as a list instead and sort the elements in ascending order we can use the first L elements as additional terms. The statistical method in section 2.2.2 can easily be extended to query expansion where we instead of p(t|d) one considers p(t|R), where R is a set of known “relevant” documents. Using the χ2 statistic we can perform AQE by ranking expansion terms according to highest score and using the top K ones:

(p(t|R) − p(t|C))2 score(t) = χ2(t) = (2.7) p(t|C)

tfR p(t|d) = |R| TF p(t|C) = |C|

Query expansion in this thesis is used only to acquire additional search terms and no re-weighting of the search terms is done. In this sense AQE is very sim- ilar to association mining, especially the concept of confidence [12, p. 14]. In my

10 2.2. INFORMATION RETRIEVAL proposed method I calculate a metric similar in spirit to confidence for the rule: keyword → group of records. Instead of χ2, there are other way of hypothesis testing that can be used to find good descriptors of sets such as the AIC (Akaike information criterion), which compares different models, in this case rules of the form above. [6]

2.2.5 Measuring performance In ad hoc retrieval performance is perhaps best measured by what users think of the results returned. This is a very time consuming and expensive process so most IR systems are tested on annotated test collection such as the TREC [4] collections. Here a set of queries are supplied and a list of relevant documents for each, the IR system is then tested on how well it can retrieve the predetermined relevant documents. Often only the first couple of results are measured, but for the problem I am concerned with, maximum recall, this is not an option: I will instead sample the results to determine overall performance, see chapter 6. For each query q and underlying information need, each document d is in one of the four categories:

True positive A document that is relevant and was retrieved by the IR system in response to q. Denoted by tp.

True negative A document that is not relevant and was not retrieved, tn.

False positive A non relevant document assumed wrongly to be relevant and is thus retrieved, fp.

False negative A document that is relevant and was not retrieved, fn.

From these simple definitions several metrics have been developed. The two most common measures are:

tp precision = (2.8) tp + fp tp recall = (2.9) tp + fn

Where precision reflects how many of the retrieved results are relevant and recall how many of the relevant results that are made available. It is desirable to maximize both metrics but they are naturally opposed goals, if one returns all documents in the collection then recall is maximized (1.0) and if no documents are returned then precision is maximized. In actual systems, increasing one often decreases the other. Therefore the harmonic mean of the two is often used to measure the IR system, called the F-measure. 2 ∗ precision ∗ recall F = (2.10) 1 precision + recall

11 CHAPTER 2. BACKGROUND

These metrics can also be used in other cases when an asymmetric importance is assigned to two different classes: that related results are important and unre- lated unimportant. A typical case is classification where one wants to filter out all unrelated records but keep all related ones. The difference between filtering and IR is blurred when there are temporal relevance demands on search results. In the extreme case, used in this thesis, where no results are stored for a standing query on streaming data they become inseparable.

2.2.6 Software systems for information retrieval There are many software systems that are perfectly suited for information retrieval employing different versions of the inverted index idea presented in section 2.2.1. Relational database management systems or other systems optimized for e.g. fast indexing instead of consistency can be used. There are many specialized information retrieval systems and they can be called No-SQL databases because they do not adhere to SQL specifications. I have worked with one such specialized information retrieval system (No-SQL), Terrier 3.57 [31]. It is a relatively mature research system dedicated to information retrieval with open source code, good documentation and community support. Its design is focused on experimentation and configuration and it written entirely in Java giving me a good trade-off between performance, scalability, stability, ease of implementation and experimentation. One drawback is that there is no query language and if one wishes to do other things than document search you have to do this in source code. This is still preferable since custom search operations is exactly the idea with this software: Terrier is designed for easy modification of the open source code and easy configuration. In contrast, many SQL systems would require foreign functions or possibly even recompilation to do what I wanted to do.

2.3 Topic classification

The act of retrieving top ranked documents for a query is it self a form of classifica- tion of all the documents in the corpus. But in the case of twitter no ranking is done of the results of streaming retrieval, so we can explicitly introduce a classification step here to improve precision. To clarify:

1. Obtain a query vector q

2. If desired, create a new query vector q∗ with AQE

3. Rank all documents according to q∗ by sorting their scores. For example scores obtained by using equation 2.2.

4. The k highest ranked documents are considered related.

7www.terrier.org

12 2.4. EXTERNAL DATA SOURCES

In an ad-hoc IR system the user can themselves decide what value of k is ac- ceptable and in that way allow for maximum recall my looking at more and more results. In practice this will result in very low precision. By instead of ranking results classifying all of them (giving them a score of either 1 or 0) I suggest that one can achieve more favorable results in terms of overall F − measure. A supervised classifier is a function or program that uses has a training step to modify its behavior. In this step it is fed data that is similar to the data it will later be asked to classify and can generalize various properties of the data in order to make an informed decision later.[18] I will assume that the reader is familiar with supervised classification8 and instead focus on concepts that are important for my proposed method. The sparse vector formats listed in table 2.1 are necessarily not optimal for classification and if we can introduce some form of processing to include background knowledge in the features used there is a possibility for improving the results. In chapters 3 and 4 I will elaborate on this idea further but the basic scheme used is to focus as much as possible on transforming a sparse text representation into a more concise representation based on comparing it with other texts. Our classifier will need to make a decision for each tweet whether or not it is relevant to our information need. The information need in IR is usually expressed as a query but in supervised classification we describe it in the form of training examples and in the proposed method, also as external sources.

Figure 2.1: The C4.5 classifier

The C4.5 is a commercial decision tree classifier. It is an extension of a basic decision tree induction algorithm [32] and thus creates partitioning rules for data into classes based on purity measures. In C4.5 and the open source Java implementation J48 that I used, additional measures are taken to reduce over-fitting and to handle missing values and other improvements [34].

2.4 External data sources

External data sources can be used both for query expansion by looking for context of the original search terms [12] and for classification by comparing with our data. The key issue is that we can find representative information about the information need. Many researchers have focused on link structure of external resources but due to time constraints I have not considered this angle of approach and instead focused on using the text of the source itself.

8An excellent book is [32], where most of the techniques used in this thesis are covered

13 CHAPTER 2. BACKGROUND

Wikipedia The on-line dictionary, Wikipedia9, is the largest such resource of its kind. It has been used by many researchers in text mining and I will use it to provide context for my classifier.

EPG EPGs (Electronic program guides) contain a the airtime of a show, the most prominent actors of the show and a short synopsis. Several companies provide API access to EPG data such as Rovi Corporation10.

Web pages If we have web pages that are relevant to our information need we can use them as additional background information.

9wikipedia.org 10rovicorp.com

14 Chapter 3

Related work

In this chapter I will investigate related work regarding the collection of tweets with certain topics. One topic that is of special interest is television and thus much of the related work presented will be about the problem of finding and identifying television related tweets. Perhaps the most straightforward interpretation of the problem of identifying television related tweets is as a classification task. The solution taken by most researchers is that of supervised classification of tweets by topic. A training set of labeled data is used to train a classifier such as a support vector machine and the approach is tested on an unused part of the training set, typically using k-fold cross validation. But in reality a running system must deal not only with classifying tweets but also retrieving them from the a large database such as the twitter API and therefore formulating the problem as an IR task is also attractive. I will also go through the approaches taken by different authors for collecting tweets. My proposed method, see chapter 4, is a combination of query expansion in tweet retrieval and classification of tweets and thus these two research areas are directly related to my work. However I have not found any directly comparable results.

3.1 Relevant works in information retrieval

I believe that one does not have to look at the problem strictly as a classification problem and that many techniques used in ad hoc retrieval can be used for the vertical search task. These techniques could be viewed as pre-processing methods for the classification task. The status of information retrieval of tweets is reviewed in [16] where several interesting analogues to more typical document (much longer than tweets) search techniques are assessed, among them a version of page rank for twitter, which does not yield great results. From IR in general, the perhaps most interesting technique to improve recall is Query expansion [12].

15 CHAPTER 3. RELATED WORK

3.1.1 Query expansion and pseudo relevance feedback

One promising idea is to account for the change of language use over time. Massoudi et al. [26] use a retrieval model where a language modeling approach is used and query expansion terms are generated by looking at the recency of tweets that they appear in and how many tweets they appear in. Another variant of temporal pseudo relevance feedback used for analyzing twitter messages is to build a fully connected graph of initial search results where edges are weighed by their terms temporal correlation similar to the approach above. Page rank is then used on this graph to find the most relevant documents. The temporal profiles where built with four hour intervals pretty small corpus of twitter messages and page rank is not suited for working with this kind of graph so it is not surprising that this TREC submission [2]1 was unsuccessful. A very interesting use of Wikipedia is in one AQE approach where anchor texts are used as expansion terms. In [8] Wikipedia is indexed and searched for the same query terms as in an original query for a blog collection, the top Wikipedia documents returned are analyzed to find popular anchor text that link to the the highest ranked Wikipedia pages. These anchor texts are then used as expansion terms resulting in an improvement over a baseline. In [15] Efron uses hashtags to improve the effectiveness of twitter ad hoc re- trieval. By analyzing a corpus of twitter messages he creates subsets where one hashtag is present and fits a probabilistic language model for each such subset. A language model is also fitted to each query and the models that correspond the best to the query model (according to Kullback Lieber divergence) will have their hashtags added as additional query terms. This approach provides modest improve- ment, but I think that creating a language model from just the query terms is risky since there is so little evidence present in a query of a few words. Papers submitted for the TREC micro-blog track of 2011 represent the use of different IR techniques for twitter search including topic modeling, different forms of query expansion, extensive use of hashtags and many other approaches. Many of the papers are not published in peer-reviewed journals but never the less represent the latest research in this area. The main evaluation measure was precision at 30 results averaged for each of the different queries and the best results where in the 40% range. [2]. The approach taken by Bhattacharya et al. [2]2 is particularly interesting since they report one of the best unofficial test scores of more than 60% P@30 and use a IR methodology perhaps best suited to structured (XML) retrieval. They use Indri3 which employs a combination of language modeling and inference in Bayesian networks. They create different regions from the tweet and external sources that can be treated using different language models and combine the similarity of a query with each of these regions in a Bayesian network. The external sources are web

1Qatar Computing Research Institute submission to the TREC 2011 microblog track 2Bhattacharya et al. University of Iowa (UIowaS) at TREC 2011 microblog track. 3http://www.lemurproject.org/indri/

16 3.2. CLASSIFICATION pages from URLs in the tweets and definitions of hashtags from a community based site, i.e. they expand tweets to include externally referenced information.

3.1.2 An alternative representation using Wikipedia

Since my goal is to achieve good recall while maintaining precision I have looked at the work of Gabrilovich et al. with much interest. They use an alternative representation they call explicit semantic indexing, ESA, contrasting LSA (latent semantic indexing), where Wikipedia serves as the basis for the representation of documents. In [20] the basic idea of ESA is presented. Each word in a text is associated with a set of Wikipedia pages. They create this representation by building an inverse index of Wikipedia and instead of using it for look-up they use the index structure itself as the representation, i.e. each word is represented by a sparse vector with all Wikipedia pages as elements. A word such as Sweden might appear in thousands of articles but using an tf ∗ idf scheme the word might have the greatest association with articles about Sweden. To make the system feasible only the top k articles and the weights corresponding to the term frequency in those articles are kept. For a text the vectors of the words in it are summed to create a document vector. This approach works best for small text so it should be ideal for use with twitter messages. The alternative representation is used to build an ad-hoc IR system where queries are also transformed into Wikipedia-space and compared with e.g. cosine similarity to the text we have in a collection. Using only ESA results in poor per- formance but very impressive abilities to associate queries and texts that does not share a single word with each other, which highlights the possibility for greatly in- creased recall. In unison with a BOW IR system and automatic feature selection using the information measure the method yields good results [17]. But this method can definitely cause a loss in precision for some queries because unrelated Wikipedia pages may contain the same word.

3.2 Classification

In this section I list some works that only look at the subset of the problem where a set of possibly related tweets are acquired and we want to classify them, skip- ping the issue of recall when retrieving tweets from twitter. In general the related works mentioned in this section tackle the problem of classifying tweets contain- ing ambiguous words, such as the company name Apple. Even though I have used classification to improve precision of additional tweets gathered using other search terms the justification for using classification is the same, it is needed to try and filter out tweets that superficially seem related but are not.

17 CHAPTER 3. RELATED WORK

3.2.1 Television ratings by classification

Arguing that conventional TV ratings, the so called Nielsen ratings, are outdated Wakamiya et al. employ an alternative method for estimating the number of viewers [39]. In their paper they present a method that uses tweets to calculate the ratings and they use a large data set from the Twitter Open API. The data was geotagged and filtered by keywords such as TV and watching and later filtered. Here they key problem of identifying which messages are related to a particu- lar TV shows is addressed. As seen in other works, additional information about the television programs are used, here in the form of an electronic program guide (EPG). Textual similarity is then computed between the set of collected tweets and EPG entries. As far as I know Wakamiya and her colleagues are unique in also incorporating both temporal and spatial information to make the decision. The textual similarity is based on the Jaccard similarity coefficient and a mor- phological analysis function is used to only compare nouns, possibly due to the way the EPG is structured. In contrast, one could for instance imagine that verbs such as watching could be useful, an observation made in other related works. To facilitate the large number of text comparisons required an inverted index was employed. The use of spatial relevance is motivated by the need to determine which TV station the author of a tweet was watching. Therefore it might be unnecessary in the general problem of determining weather or not a tweet is related to a particular TV program. Hypothesizing that users write about TV shows in close temporal proximity to the broadcast a temporal relevance score is used in the final relevance measure, a quotient of the three similarity scores, which is then used to match a tweet to the highest rated EPG entry, corresponding to one television broadcast. The sought after popularity measure of how many people watched the show can then be calcu- lated. Experimental results indicate high precision for the proposed method but pos- sibly low recall. Regrettably no discussion about the statistical significance of the ratings acquired was present.

3.2.2 Ambiguous tweet about television shows

In a series of papers; [35] [13] [14], a group of researchers from AT&T labs and Leigh University, including Bernard Renger, Junlan Feng and Ovidiu Dan, present a method for classifying tweets and an application of their method, Voice enabled social TV. Their approach achieves the best performance in terms of F-measure on a labeled test set that I have seen in literature. But since there are no standardized test sets caution should be taken before accepting this approach as the unequivocally best approach. The basic scheme of classification considered is shown in figure 3.1. The key concept of their work is the use of a two stage classifier approach. First a classifier is trained using a set of manually labeled data, secondly another classifier is trained on the small data set but with features derived from the large data set

18 3.2. CLASSIFICATION

Figure 3.1: Approach to classifying tweets, here for television shows, but the same approach applies for other proper nouns. labeled by the first classifier which is also used to extract additional information such as new search terms for the twitter streaming API. The second classifier is used to make a decision about a tweet m and a show i having the title si: ( 1 if m is a reference to show si f(i, m) = 0 otherwise

The first classifier is a binary classifier that also models the function above, it does however use less features that the second classifier, as seen below. The training and testing data is generated using twitters streaming API where one searches for keywords and get a statistically significant sample back. The search terms used include not only the name of the show but also alternative titles found at IMDB.com and TV.com. A set of labeled data was manually created for eight shows and this data set is then used for training and validation of the first classifier. The features used for the first classifier differs from the previous approach listed since they are not directly related to textual similarity where one uses a bag of words model. Instead a collaboration of features is used:

• Manually selected terms and language patterns of interest.

– Television terms such as watching, episode. – Network terms such as cnn, bcc. – Regular expression capturing episode and season information e.g. S{0 − 9}+E{0 − 9}+.

• Automatically captured language patterns.

– From a large data set (10 million tweets), replace titles and hashtags with placeholders and extract sequences of three words where the placeholder is included. These then become rules that if seen in unlabeled messages are used to indicate that the message is TV related.

19 CHAPTER 3. RELATED WORK

– Use si to check for the presence of the uppercase string. – Check if there is more than one title that is not an ambiguous word (according to Wordnet).

• Textual comparison with external sources using cosine similarity measure and the bag of words model.

– Characters of the show – Actors of the show – Words from Wikipedia page

Most features are treated as binary values, 1 if a positive match was found 0 otherwise, the rest are scaled to the unit interval. After training on a few thousand twitter messages the first classifier is then used to classify the large unlabeled data set, this yields labels for each twitter message. This data is then used as training data together with the original data set for the second classifier that uses all of the features of the first classifier and three additional feature types. Interestingly new more refined rules are captured from this new labeled data set as well as new search terms. Using the features listed above, different classifiers where tested for the two classifiers support vector machines and rotation forests [36] where deemed the best. An F-measure of 85.6% was the best result achieved by the latter classifier in 10-fold cross validation of the initial labeled data set. To summarize: several interesting features are combined with the common textual similarity measures often used information retrieval, the two classifier approach slightly increases the F-measure and also generalizes quite well to unseen shows. Parts of this approach can certainly be applied to classifying new tweets that are retrieved using query expansion and in my method I use a similar approach with slightly different features, see chapter 4.

3.2.3 Other topics than television Other authors tackle the related and very similar problem of identifying which tweets are about a certain company and which are not. Company names can be ambiguous in much the same way as television shows and programs. As stated in National University of Distributed Learning’s WePS-3 workshop’s task definition: “Nowadays, the ambiguity of names is an important bottleneck for these experts”[1], referring to experts in on-line reputation management. The task outlined in the workshop included data to be analyzed. As a submission to the WePS-3 workshop, Yerva et al. devised a classification method for the problem using support vector machines (SVM) [41]. Here we also see the use of external data as a basis for comparison with tweets. For each company, a set of profiles is created, each a bag of pre-processed unigrams and trigrams. The different profiles capture e.g. the company website, Wordnet disambiguation and

20 3.3. TWEET COLLECTION METHODOLOGY manually recorded related words. The features used was co-occurrence counts of words in the tweet with the different profiles. Experimental results were positive and indicate the need for high quality external information. One idea to improve recall in classification is to cluster messages, however this method typically suffers from poor results if applied to just term occurrence vectors. Perez et al find terms from the corpus of twitter messages they are working on to help clustering methods [33]. They call their method Self-term expansion methodology and achieve improvements in recall and precision by finding a set of additional terms for ambiguous company names. Words that co-occur with company names in tweets labeled true in the training set are added to each tweet containing the company name in the test set. Unfortunately the paper is very vague in its method description but using unsupervised clustering with k-means with k = 2 does not seem like a promising idea for classification, however the method could be used as a query expansion technique.

3.3 Tweet collection methodology

In chapters 1 and 2 I argued that the common approach of accessing twitter data for various research project is lacking in reach, or recall, of the data that is considered for sampling. To see this we can consider the methodology used by some of the related work presented in this chapter. Regarding ad-hoc information retrieval of tweets such as [37], [16]; [10] that use the TREC microblog data set4; and [26] which employed query expansion, it is not clear if we can compare effectiveness of tweet collection. In relation to market research, it is an open question weather results achieved on a small data set sampled for a shorter period of time and annotated with a modest number of query–relevance judgment pairs are applicable to the problem of obtaining as many as possible related tweets. We are most interested in evaluations done with the constraints of up to date, inclusive tweet collection in place. Nevertheless, many of the techniques used are certainly interesting. In [28], Mitchell et al. evaluate a system they have set up for on-line television in which social media is integrated. Twitter is used to present tweets about the currently viewed program. Here the twitter API is used and a simple search of the programs title is employed to retrieve relevant messages. Their work represents the basic use of twitter for retrieving TV related tweets and unfortunately recall and precision is not evaluated. The work done with classifying TV-program related tweets [39], [35] and works about classifying other ambiguous topics such as [40] use test sets collected using simple rules, such as using the title of the topic, or manually selected keywords. A limited form of query expansion is used in [9] to generate the data set, all hashtags found in the data set retrieved by searching for “#worldcup” are recursively used to

4https://sites.google.com/site/microblogtrack/

21 CHAPTER 3. RELATED WORK search for new tweets. In [30] the streaming API is used and messages are classified in a streaming fashion, however the search terms used are manually selected. Wakamiya et al., that employ an alternative method for estimating the number of viewers by counting certain tweets [39], do not use titles of TV programs directly. Instead, a large data set collected from the Twitter API during one month was used, where all geotagged 5 data with Japanese origin available was filtered for the, manually selected, Japanese keywords equivalent to words such as TV and watching is used. Experimental results indicate high precision for the proposed method but possibly low recall. Regrettably, no discussion about the statistical significance of the ratings acquired was present. Dan et al.[35] [13] [14] use an approach that achieves an F-measure of 89%. However, their results are only valid as a measure of an overall system if all the relevant tweets can be found using the title of the show as a search term.

3.4 Summary

Ad-hoc search of twitter messages typically uses text indexing, either the common tf ∗ idf scheme or a language model approach. Even though there are many dif- ferences between ad-hoc search of microblogs and web documents [37] techniques learned from established IR methods can certainly be applied. Text classification typically represents texts as tf vectors and either does super- vised training directly on the sparse vectors or extracts features to use for training. When dealing with short documents such as tweets, external sources are often used and the best results [14] come from including hand crafted features and mining very simple rules from a bootstrapped sample. Unsupervised approaches are less suc- cessful when dealing with short text messages as described in [33] and a clustering can never be maintained for the incoming stream of data in a live application.

Table 3.1: Methods that I use from related works in my combined approach of query expansion and classification.

Method Reference Use hashtags as expansion terms [15] Look up URL contents in tweets [2]6 Use Wikipedia as a way to compare tweets [20], [13] Use EPGs as a way to compare tweets [39] Co-occurrence with name to get additional terms [33] Multiclass supervised classification of tweets using external sources [13], [41]

Since we are treading in somewhat unknown territory, the streaming retrieval of television related tweets, I have perhaps included some works that can be considered

5Some users enable geotagging so that the coordinates of the user at the time of posting is publicly available

22 3.4. SUMMARY of peripheral importance. However, I will explore some concepts from each of these works in my own system, aiming to exploit the strengths of their approaches and avoid the weak points. In table 3.1 I have listed the different ideas from related work that I have used in my proposed method, see chapter 4.

23

Chapter 4

Automatic query expansion and classification using auxiliary data

This chapter contains a conceptual overview of the methods used. I start by elabo- rating on the problem statement in chapter 1 and then continue to describe the two parts of my approach: AQE (automatic query expansion) and supervised classifica- tion. Some initial experiments are described since they guided my development of various features of the method that was later used for larger experiments.

4.1 Problem description and design goals

To clarify the intuition behind which methods are used I repeat and elaborate the research problem definition from chapter 1: Get as many as possible tweets about a specified product.

With the following constraints:

Tweet availability The twitter API determines how tweets can be accessed.

Precision We do not want false positives.

Scalability The resources required in terms of CPU time, memory consumption and disk should increase sub-linearly, disregarding storage of collected tweets, with the number of tweets we collect.

Recency Results need to be made available in a timely fashion. We need to access the tweets about a product as soon as possible, ideally in real time.

Resolution We need to be able to tell when a specific tweet was created.

We can also formulate the problem in terms of a search problem: The problem here is that we need to know the keywords to use (or suffer from low recall). Often these keywords are chosen by an analyst as seen in chapter 3.

25 CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USING AUXILIARY DATA

Retrieve tweets about products by searching (Boolean match) for keywords about these products.

We can prove by example that this method does not retrieve all related tweets. Furthermore, many keywords that are good descriptors for products are also good descriptors for other classes of tweets. There is a problem of ambiguity. Another interpretation is in terms of a supervised classification problem: Classify all tweets as related to the product or not

Superficially this looks as an attractive approach; we have now gained a good method to weed out ambiguous tweets. But there are several problems:

Multiple-class problem If we need to track more than one product it becomes a multiple class problem. To solve this we need to have access to labeled data for each product.

Unbalanced classes The ratio of tweets that are about a specific product are vanishingly small, let us assume this ratio is 0.1% and that the false positive rate is 0.01%. With perfect recall the results will be almost tied between 50.02% tp and 49.97% fp. This classifier performance is of course completely unrealistic making the problem harder.

The class distribution problem is not insurmountable, we must give up recall to achieve reasonable precision. This is directly opposed to our goal but is unavoidable, we must instead focus on giving up as little recall as possible. But compared to treating the problem as a search problem we are still ahead in achieving our goal. There is an important caveat here: if we have high recall of the related class and consistent fp rates we can still track the change in the number of tweets. The worst problem is to acquire training data; it is not financially viable to manually annotate data for a given product unless that product is very impor- tant. When tracking television programs this manual labor is staggering, there are hundreds of large shows and programs in the US alone of very different genres.

4.2 New search terms from query expansion

The main idea of my approach to the problem described in section 4.1 is to combine the strength of both search and classification by removing the analyst that selects search terms and replacing him with an automatic method. As I will show in the following sections, the end goal of query expansion in this thesis is to find a disjunction of new search terms that describe the product that we are searching for. If we retrieve all tweets that match this logical expression we can hopefully find more relevant tweets. In other words, these search terms can be used to find additional messages. As an example, consider a search for “hello”:

26 4.2. NEW SEARCH TERMS FROM QUERY EXPANSION

Original search expression Expansion search expression “hello” “hello” ∨ “hi” ∨ “greetings” ∨ ...

Here we assume that some query expansion method generates some additional terms and that we can retrieve the tweets that match one or more of these terms. As described in section 4.2.1 I will refine this approach somewhat to search for not only a disjunction of single search terms but a disjunction of pairs of two search terms where both of the terms in the pair must be present in the retrieved messages (logical AND). If we revisit the example above it could look something like this:

Original search expression Expansion search expression “hello” “hello” ∨ “hi” ∧ “greetings” ∨ ...

Since I want the Tweet Collect system to only need a list of product names to work the original search expression will be a conjunction of the words that make up the product name. For the television show “How I met your mother” the original search expression will be: ”how” ∧ “I” ∧ “met” ∧ “your” ∧ “mother”. Taking inspiration from the automatic query expansion techniques listed in chap- ter 2: we can treat a set of tweets that contain the exact title of a product as pseudo relevant tweets; they are generated by an original query. Given a larger population of tweets C where R ⊂ C we can calculate many different statistics about the terms present in R. From these statistics we can generate well chosen additional search terms. This method is very simple and highly effective, given that our initial as- sumptions holds: the tweets in R are actually relevant and we can approximate the true distributions by the statistics in our corpus. I have performed some small scale tests with different query expansion methods that use different ranking criteria for new search terms and found that χ2, equation 2.7, performed the best. Furthermore, it is not very important which method is used:

“However, several experiments suggest that the choice of the ranking function does not have a great impact on the overall system performance as long as it is used just to determine a set of terms to be used in the expanded query [Salton and Buckley 1990; Harman 1992; Carpineto et al. 2001]”[12]

I also did some small scale tests where I tried the query expansion method from [26] directly but this resulted in very poor expansion terms. The terms could be dismissed upon inspection and testing searching for them on twitter.

4.2.1 Co-occurrence heuristic Instead of producing single term expansion terms I produce conjunctions of two terms as follows: Given a list of k terms, check the pairwise co-occurrence of these

27 CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USING AUXILIARY DATA terms in virtual documents consisting of V tweets, a tweet, the bV/2c tweets col- lected just before and the bV/2c collected just after the tweet containing the first term in the conjunction pair. Rank the pairs according to their modified dice coef- ficient: 2 · df˜ D˜ = u∧v (4.1) dfu + dfv Where df˜ represents document frequency of the virtual documents in the pseudo relevant set and df the document frequency in the collection as a whole.

4.2.2 Hashtag heuristic Given a list of k terms, all terms that are mentions or hashtags, start with # or @ respectively, are considered related if the hashtag without the initial pound symbol is not found in a standard English dictionary.

28 4.2. NEW SEARCH TERMS FROM QUERY EXPANSION

Algorithm 1 Algorithm, Top(K,R), produces an array of single search terms. 1: R is an array of relevant tweets, twl, 1 ≥ l ≥ N. 2: for all terms t ∈ ∪i twi do 3: if pR > pC then 4: Use equation 2.7 calculate score(t) and add ht, score(t)i to list l 5: end if 6: end for 7: Sort l in order of score(t). 8: Let top[K] be an array of terms ti. 9: top ← the K terms corresponding to largest Inf(t). 10: return top

Algorithm 2 Algorithm, Pairs, produces the pairs of search terms used. 1: Let R be an array of relevant tweets, twl, 1 ≥ l ≥ N. 2: top ← Top(K, R) 3: Let pairs[K · (K − 1)/2] be an array of hString, String, Integeri. 4: for all terms ti in top do 5: Tu ← {tweets tw | ti ∈ tw} 6: for all terms tj ∈ top | j > i do 7: Tv ← {tweets tw | tj ∈ tw} 8: for all twl ∈ Tv do 9: vd ← twl−2@twl−1@...@twl+2 10: if ti ∈ vd then 11: hti, tj, counti ← pairs[index(i, j)] 12: pairs[index(i, j)] ← hti, tj, count + 1i 13: end if 14: end for 15: end for 16: end for

4.2.3 Algorithms

Algorithm 2 describes how to get pairs of terms and their counts. Note that the nested loops on lines 6-14 correspond to doing a join between the tweets that contain ti and the virtual documents, formed by tj, that contain ti. This can be implemented as a hash-join of search results, see listing 5.1. The final step of sorting the term pairs according to their modified dice coefficient using equation 4.1 is omitted. The function index(i, j) returns the correct index to store the term pair at in P airs.

29 CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USING AUXILIARY DATA

Top 20 expansion terms for the show “How I met your mother” met,mother,@realwizkhalifa,@himymcbs,barney,movie,ted,thats,assisti, #orangotag,#himym,tv,wiz,netflix,show,@fancite,fancite,lily,watch,robin

The final keywords @realwizkhalifa,@himymcbs,#orangotag,#himym,@fancite, @fancite fancite,barney #himym,mosby ted,barney ted,barney fancite barney @fancite,fancite ted,@fancite ted,rt@realwizkhalifa ted Table 4.1: Expansion terms for the show “How I met your mother” using equation 2.7 and resulting search terms by hashtag,mention and co-occurrence heuristics. Note that a space means conjunction and a comma means disjunction. This used data where tweets mentioning other show have been removed.

4.2.4 Auxiliary Data and Pre-processing To get accurate statistics about the term distributions of tweets about certain prod- ucts I use auxiliary data collected by KDDI R&D about television programs for several months. The data consists of tweets containing titles of television programs: one set Rj for each television program. This collection is carried out using the REST GET method described in chapter 2. After an initial test of my approach I discovered that data quality is lacking in many tweets in the sets Rj. To mitigate this problem I included extensive pre- processing methods to obtain good data to work with for algorithms 1 and 2.

• Because it is so common to mention several shows in one tweet I chose to define those tweets as non related and thus filtered out each tweet containing a title that has a name longer that 1 word except its own.

• As mentioned in chapter 2, re-tweets are a way in which users share tweets that they think are important. From my own collecting I have discovered that about 25% of tweets are re-tweets. These contain almost now new information to increase recall and all messages containing the sub string “RT” or “rt” as a separate word are filtered out.

• Many tweets are either automatically generated or re-posted so than an iden- tical duplicate is found; I use a hash table to remove these.

• Filter all tweets for language using a Bayesian classifier, non-English tweets are removed.

4.2.5 Twitter data quality issues As a preliminary experiment I took up to 20 thousand tweets for each of the ap- proximately 1000 shows that are tracked as my different pseudo relevant sets Rj.

30 4.2. NEW SEARCH TERMS FROM QUERY EXPANSION

Performing the pre-processing steps above yields the search terms as shown in table 4.1 when the proposed method is used. Using these terms as search terms to the twitter streaming API tweets were collected for four hours or until 1,000 tweets were collected. These initial results are not very impressive other than that they do contain some related tweets that do not contain the title. If we take a look at table 4.1 we can see that there are many terms that seem completely irrelevant. This made me investigate this phenomenon further by looking at the expansion terms of three TV shows: “The Vampire Diaries”, “How I Met Your Mother” and “The Secret Circle”.

Table 4.2: Search terms generated for the television shows The vampire diaries and The secret circle using a moderately sizes data-set.

Show New terms The Vampire Diaries vampire,diaries, #peopleschoice, @peopleschoice, ,voted vote, retweet, #networktvdrama, #scififantasyshow, ordinary,#thevampirediaries,just,homecoming, @iansomerhalder,damon,#tvd, s3ep8, people, #orangotag The Secret Circle circle, secret, #thesecretcircle, balcoin, #newtvdrama, #peopleschoice,@peopleschoice, voted, retweet,vote, assisti,#orangotag, s01e09,vou,s1ep8,assistir,cassie, 1x09,@chriszylka, ver

The terms that I consider good from tables 4.1 and 4.2 are: #thevampirediaries, @iansomerhalder, damon, @himymcbs, barney, ted,#himym, robin, lily, #thesecret- circle, cassie, @chriszylka, balcion. They include relevant hashtags, actor names, character names and related concepts. The rest of the terms are most likely not relevant. The analytics company Pear Analytics was cited in [27] and they determined that out of a sample of two thousand twitter messages 40% was “pointless babble”. Some of the unattractive keywords found correspond to tweets in the following categories:

Non English ver, vou, assistir are non English words from tweets that slipped by the language detector. It is a very hard problem to detect language in just 140 characters, one improvement one can make is to exclude any search terms from the analysis and make sure that the search terms are in English to begin with.

Chance Some terms just happen to be much more frequent terms in some pseudo relevant sets. By increasing the corpus size one can hope to obtain a better sample of the distribution of terms; improve df estimations.

31 CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USING AUXILIARY DATA

Automatically generated tweets Many websites produce tweets on the users behalf, these messages are often a standard message where the user’s name or a URL to a user-page is the only distinguishing characteristic. Since these messages are also very frequent they will shift the language model away from human writing patterns.

Compound hashtags By examination of data I have discovered that it is very common to use hashtags that are composed of several standard words that are concatenated, such as #networktvdrama seen in table 4.2. These seem to be used both for describing topics and some times for emphasis.

The most effective way to lessen these problems is to increase the corpus size. This will limit the effect of many of the problems listed above. Furthermore, one can devise many strategies to deal with different undesirable tweets. One could for instance use only tweets by verified users 1, that is, users where the twitter company has verified the identity. This is very problematic however since verification is a privilege that only users with a high public profile is granted and these users represent only a minute proportion of twitter users. Some strategies to consider are avoiding comparing things that are used to dis- tinguish automatically generated tweets, user-names and URLs for example. Many spam tweets are very similar to other spam tweets, actually hashing tweets re- moves about 20.4% of collected tweets as we can see in chapter 6 (due to spam but also other reasons). A very high word concurrence rate could perhaps also be used to remove spam tweets, but this is a far more expensive operation; in the worst case we have to compare each new tweet with all old ones resulting in |C|(|C|+1)/2 = O(|C|2) complexity. Perhaps one can devise a special hash function for this problem, but this is not explored in this thesis. A developer can, if they choose to do so, indicate a source value for each tweet such as web (default) or my_iphone_app.com. But, this is not a good approach for removing automatically generated tweet content since developers do care to include another source in all cases and many users write tweets on a device which generates another source tag than default, such as: iphone. Compound hashtags is a phenomenon that I explored when building a classifier and is treated in section 4.3.2.

4.2.6 Collection of new tweets for evaluation Since there is no annotated corpus of tweets for this task to use for empirical eval- uation I decided that the best approach is to evaluate the method as it would be used in a deployed system and evaluate results manually. For each television show that is tracked a sample of tweets will be annotated by labels: related, unrelated. By means of the streaming API, tweets are collected for a fixed time using the additional keywords found and the actual title. Firstly, the set of tweets containing

1http://support.twitter.com/articles/119135-faqs-about-verified-accounts

32 4.3. A CLASSIFIER TO IMPROVE PRECISION the exact title is determined. These tweets represent the baseline and will not be considered for manual labeling. Secondly, the remaining tweets are filtered for English. By comparing the cardinality of the two sets we get the ratio of tweets collected by my method and the baseline, an indication of the increase in recall. To determine the precision, the tweets that are in English and do not contain the exact title are sampled uniformly without replacement and manually inspected to see if they are related or not. Results of the experiments are shown in chapter 6.

4.3 A classifier to improve precision

Initial experiments where partly successful and I decided to create a data visualiza- tion tool to inspect how many tweets where collected for each of the generated search terms or search term pairs. I did further experiments where data was collected for 24 hours for the following television programs:

• Make it or break it

• Buffy the vampire slayer

• Saturday night live

Figure 4.1: Visualization of fraction of tweets by keywords for the show Satur- day night live, here different celebrities that have been on the show dominate the resulting twitter feed.

The results for Saturday night live are shown in figure 4.1. These results are discouraging: firstly, the increase in recall does not seem to be that great but more importantly, the resulting tweets are dominated by those found by searching for different celebrities. This illustrates the need for maintaining precision with a classifier that is able to prune these large numbers of only partially related tweets.

33 CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USING AUXILIARY DATA

4.3.1 Unsupervised system One can deploy a supervised algorithm relying on annotated training data if one finds a way to gather data considered relevant and not-relevant. In this thesis I will use the sets Rj yet again for training data. This means that the data used for training is different in nature from data we wish to classify and I will use different techniques to bridge this gap.

4.3.2 Data extraction I want to include the information contained in external resources of the tweets as well as the hashtags and user mentions. Hashtags I split hashtags to their constituent parts with an algorithm that checks all sub-strings of either “i” or of length greater than two that make up the hashtag. If a sub-string is not found in a dictionary then it is not consid- ered. Then the sub-strings, which when concatenated make up the original string, are selected according the product of their frequencies in a corpus. See appendix A. Mentions I will search for the user on twitter and use a description and name if available. These will be included as additional text in the tweet. URLs I will scrape the website and use text extracted from it. This text, without markup, replaces the URL in a tweet making the tweet a longer document. Initially I wanted to resolve abbreviations but I found no reliable way of doing this. There are publicly available abbreviation databases such as the STANDS4 API2 but the problem is that there are many different possibilities for abbreviations and it is a very hard problem to determine what is an abbreviation in the first place.

4.3.3 Web scraping A lot of content on web pages are not relevant to the main focus of the web page. This content could for instance be commercials or a side menu that offers navigation of the web site and so on. If this non-relevant text was included either as an external source or as additional tweet text found by looking up URLs found in tweets it is likely that the proposed method would be much less effective. Therefore we have chosen to use the Boilerplate supervised learning method that has high accuracy when determining informative text sections of web sites [24].

4.3.4 Classification of sparse vectors To increase precision I have used different supervised classification schemes to clas- sify tweets retrieved using additional keywords. When classifying sparse data di- rectly, such as text data, we need to use classifiers models that are adapted to 2http://www.abbreviations.com/api.php

34 4.3. A CLASSIFIER TO IMPROVE PRECISION thousands of dimensions. One popular such model is a linear SVM (support vector machine).[32] A a linear SVM has a non-linear, convex objective function with linear con- straints, that is optimized to find a maximum separating hyperplane of two sets of points in euclidean space, related points and unrelated points. If such a hyperplane exits then the problem is convex and a global optimum can be found using gradient based methods.[32] If points overlap in the two sets so that it is impossible to fit a hyperplane be- tween the two sets of points, then slack variables are introduced for miss-labeled training examples. The dual of the problem is still convex however so that a global optimum can be found. The SVM problem is a Quadratic program since the eu- clidean distance is used and there exists many efficient algorithms for finding the optimizer. I tested using linear SVM for my classification problem, and not surprisingly they perform very well when cross-validating a training set generated when searching for the title of tweets. However the model performs poorly on tweets that come from a search of expansion terms. See appendix B for more details. This behavior can be explained by the fact that twitter messages are very short, there simply are not many words in common of the very sparse vectors.

4.3.5 Features

Since working directly with the sparse tweet vectors is not fruitful I take my in- spiration from related works in tweet classification [13][40] and compare external sources with tweets. The supervised classifier, f, can be seen as a function of two input arguments, a tweet and a show title. If we use K different external sources:

c : RK → {true, false}

f(tweet, title) = c(g(pp(tweet), ws(title)))

Here c denotes a supervised, binary, vector based classifier, pp the pre-processing operations listed in section 4.3.2, ws web scraping of external resources as described in section 4.3.3 and g the cosine distance of tf ∗ idf vectors. The features used in c, correspond to different external sources processed by ws. Each source corresponds to one feature in the feature vector that represents the tweet during classification. The feature value is calculated as the cosine distance between the tf · idf vectors of the tweet and the text source.

4.3.6 Classification

Once I have a non-sparse vector for each tweet that is reasonably good at discrim- inating the two classes my choice of classifier models is much greater than when dealing with sparse vectors. This is because many classifier algorithms are not

35 CHAPTER 4. AUTOMATIC QUERY EXPANSION AND CLASSIFICATION USING AUXILIARY DATA

Figure 4.2: Conceptual view of collection and classification of new tweets. adapted to this problem in terms of run-time and resulting complexity. By empir- ical evaluation the C4.5 decision tree classifier turned out to give the best results and have a reasonable run time. Since the training data is of a different nature that the training data that we intend to classify I decided to add some noise to the training data by changing the class labels of 5% of the training data. This turned out to give slightly improved results.

4.4 Combined approach

Figure 4.2 shows a conceptual view of a system that uses a combined approach of AQE to find new terms and supervised classification to improve precision.

36 Chapter 5

Tweet Collect: Java implementation using No-SQL database

"If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton Letter from Isaac Newton to Robert Hooke, 5 February 1676 [29, p. 416]

In this chapter I describe a component view of the implemented system and the many external libraries on which it depends. For each component a short description of the functionality is provided and issues regarding it are addressed. The Java implementation of algorithm 2 is listed and analyzed.

5.1 System overview

The algorithms described in chapter 4 are not very complicated for smaller data sets, the main problem any implementation needs to solve is how to cope with large amounts of data so that statistics gathered become accurate. In section 5.2.2 I describe a scalable implementation using various data structures for efficiency. My goal was to build a prototype system and thus minimizing development time was crucial. By dividing the system into two main components, query expan- sion and classification I could focus on one part at a time during implementation. Future versions of the software could perhaps be more focused on a more unified architecture. Since development time is crucial but also the ability to process large amounts of data using Java was a natural choice, offering a good trade off between performance and safety. A JVM acts as the necessary glue for many different third party libraries and gives a developer a lot of safety because of garbage collection, array bounds checking and exception handling. Some components of the system are not so tightly integrated however and instead communicate using files and are run by bash scripts. Figure 5.1 describes the two main components and their most important sub- components. The external systems Twitter API and Internet are not accessed di-

37 CHAPTER 5. TWEET COLLECT: JAVA IMPLEMENTATION USING NO-SQL DATABASE

Figure 5.1: Conceptual view of collection and classification of new tweets. rectly but instead through robust Java libraries. The typical usage of the system is first computation of expansion terms as a batch job then data collection from twitter as a ongoing process. Classification is not done as tweets arrive in this prototype version but that is the intended use case. Instead I only wanted to evaluate the feasibility of classification and this is done as a batch job. Figure 5.2 describes how the components are used at this stage of prototype development and evaluation.

5.2 Components

Many of the components of the system are built as wrapper classless to encapsulate the needed functionality of external libraries. These external libraries, or dependen- cies, do most of the work with the exception of two components that were custom built:

Get pairs of terms The implementation of algorithm 2 Pairs().

Pre-processing String processing and web scraping to transform tweets and web- pages to a similar language.

The dependencies used are listed in table 5.1.

38 5.2. COMPONENTS

Figure 5.2: How the different components are used to evaluate system performance. This does not represent the intended use case where collection, pre-processing and classification is an ongoing process.

5.2.1 Statistics database

This is the most important component in terms of using algorithms 1 and 2 from chapter 4 efficiently. What is required is very fast look-up of statistics for terms in different subsets of the total collection as well as inspection of the terms in an individual document. Furthermore, the system should be easily configurable and extensible so that one can e.g. decide what terms to index, how to treat different characters and customize query expansion. Almost all RDBMs support full text indexes, i.e. inverted indexes as described in chapter 2 but fall short in many other respects. Dedicated IR systems such as Lucene1, Indri and Terrier2[31] have retrieval performance that is much faster for some operations with much less hassle than in many RDBMs systems. Further- more, I need to be able to go in and modify low level behavior without too much development time. After investigating Apache Lucene, Indri and Terrier (all frequently used in related works) I decided that Terrier was the best documented system with the clearest source code and most supportive community. The basis of my project is therefore built around Terrier 3.5. I had many doubts throughout the project about not using a RDBMs but the key issue that made the decision to use a dedicated text system was development time. Once I got up to speed of how Terrier worked I could modify any internal behavior relatively easily. Changing low level behavior in many RDBMs systems is

1http://lucene.apache.org/core/ 2http://terrier.org/

39 CHAPTER 5. TWEET COLLECT: JAVA IMPLEMENTATION USING NO-SQL DATABASE

Table 5.1: List of dependencies organized by (sub) component.

Software component Dependency Role All Logger4j Variable levels of debug and logging out- put. Term statistics Terrier 3.5 Fast access of term statistics, efficient storage of inverted index structure, easy to customize indexing and retrieval be- havior. language-detection Remove as many tweets that are not in English as possible to not skew statistics. Query expansion Terrier 3.5 Generate single expansion terms using al- gorithm 1. Twitter API access Twitter4j Reliable access to the twitter API, convert data into Java objects. Web scraping Executor service Reliable implementation of thread pool for I/O intensive tasks.

Boilerpipe3[24] Remove boilerplate code from web pages. Cosine distance Terrier 3.5 Term statistics about collected tweets and external sources are used to form tf · idf vectors. Classify Weka Stable and correct implementation of var- ious classification algorithms. Results storage JFreeChart Visualize results and visualization JDBC Database connectivity

MySQL Store results and select subsets of tweets for visualization.

a tedious and dangerous task. Most of the work happens when we index documents and build an inverted index structure. Terrier does not fully support incremental indexing but one can merge indexes without to much overhead. Nevertheless, for a prototype system it is enough to build an index once and then use it. The perhaps most time consuming process is the pre-processing operations before indexing where we wish to remove all tweets that contain two different show terms. Many optimizations are possible but I choose to do a naive implementation of comparing every show name with every tweet and leave it running over night.

40 5.2. COMPONENTS

5.2.2 Implementation of algorithms Algorithm 1 that computes the most informative expansion terms from a set of pseudo relevant documents was already implemented in Terrier 3.5 as a query ex- pansion class. I did some modifications of the source code to be able to specify large amounts of documents as related instead of doing an original query. The use of the query expansion class directly is not a typical use case of terrier but the code base was well organized so it was easy to use this as an entry point instead of the more typical use cases of ad-hoc querying. Listing 5.1 describes the implementation of algorithm 2. I perform searches of an inverted index with a particular term to create result sets of documents RSk corresponding to term tk. I do T opX ∗ (T opX − 1)/2 hash-joins between the re- sult sets RSi, i = 0..T opK − 1 and RSj, j = i..T opK − 1 to see in how many documents the two terms tj and ti co-occur. I also exploit the fact that docu- ments are assigned documents ids in order of indexing to create virtual documents dj−2@dj−1@dj@dj+1@dj+2. The complexity is (skipping some minor operations such as sorting the O(T opX2) size array on line 55):

• Algorithm 1 takes O(|PRS|) time.

• The outer loop is done TopX times

• Create and populate a hash table in O(|RSi|) where RSi is the documents that contain the term ti.

• The inner loop is done TopX-i times

• Do a hash-join with the results of a query for tJ . This corresponds to 5 look-ups in O(1) done |RSj| times.

The total complexity is (some abuse of notation):

O(|PRS|) + O(T opX ∗ [ |RSi| + (T opX − i) ∗ |RSj| ])

2 = O(|PRS|) + O(T opX ∗ |RSi|) + O(T opX ∗ |RSj|) Now the complexity analysis becomes a bit tricky if we want to express it in the total number of documents in the collection |C|. The number of documents |RS| is potentially |C| but this is almost guaranteed to not happen since we are searching for the most descriptive terms of a particular relevant set. To guarantee a maximum run-time we can limit the search results RS to the first e.g. 1000 documents that contain this term and in most cases not affect the results. Searching an inverted index for a specific term is assumed to take O(1) in this analysis and this is the case if we can find the pointer to the list of documents that contain that term in constant time; which is exactly what an inverted index is designed to do.

41 CHAPTER 5. TWEET COLLECT: JAVA IMPLEMENTATION USING NO-SQL DATABASE

Listing 5.1: Java code to perform Algorithm 2 1 public String[][] getCoOcTerms(PRF type,List tl,int topX,String topicId){ 2 int queryids = 10000; 3 int vDocSize = Settings.getInt("vDocSize",5); 4 logger.info("Virtual document size is"+ vDocSize); 5//get the topX terms 6 List KL; 7 if(tl == null) 8 KL = getTopTerms(type,topX,topicId);//type is Chi Squared in experiments 9 else 10 KL=tl; 11//term pair 12 class TermPair { 13 Term termA, termB; int count=0; double dice=0.0; 14 } 15//create term −pair array that we later sort 16 TermPair[] arr = new TermPair[(topX∗(topX−1))/2]; 17 for(int i=0; i ht; 21 int idx =0; 22 double c = 1.0; 23//Go through the top terms one by one 24 for(int i=0;i(docids_u.length); 31 for(int j=0;j

42 5.2. COMPONENTS

5.2.3 Twitter access

Access to twitter data is based on the HTTP protocol and uses JSON format for the data. Managing connections, authentication, conversion to Java objects and asynchronous handling of new packages arriving is done conveniently using the Twitter4j library.

5.2.4 Web scraping

This components job description is very simple, fetch the HTML content of a web- page in a tweet and pass it on to the boilerpipe library. Yet I spent major efforts optimizing this component, why? The problem lies in response times of HTTP requests, with Java typically any- where between 0.5 to 5 seconds. Perhaps one in three tweets contain an URL and with an average look-up time of a web page of 1.5 seconds just fetching the contents of 10,000 tweets would take in excess of 80 minutes. Combine this with the fact that the boilerpipe algorithm is not very fast either, perhaps 0.5 seconds to process one web-page and we have a serious problem. To solve this problem of waiting for I/O i used a thread pool implementation and the producer-consumer pattern. The many threads performing fetching of web- pages and sleeping while waiting for the reply make up the producer and the boil- erpipe library the consumer. Using the java.util.concurrent.ExecutorService interface I could with relative ease create an implementation capable of fetching and processing 1-2 MB/s of web content.

5.2.5 Classification

The actual classification algorithms used are implemented in the open source Weka project. However these implementations need data in a specific input format. To improve performance I also pre-processed the data before classification and this is where I did most of my implementation work. Since I wanted to use tf ∗ idf weighting of the terms in tweets I again used a terrier index for the tweets that I classified and the external documents that I want to compare them with. After the index is created it is easy to do the necessary conversions. Here we run into the first obstacle with the prototype system: it does not support real time indexing with the present software components. One could however update the index on a regular basis, e.g. once per hour and still meet reasonable time demands. After pre-processing I use either the Weka GUI or a bash script to perform training and classification and for this initial prototype I saw no need to integrate this component further. This is not hard however since Weka is a Java library.

43 CHAPTER 5. TWEET COLLECT: JAVA IMPLEMENTATION USING NO-SQL DATABASE

5.2.6 Result storage and visualization

After data is fetched from the twitter streaming API it is stored in a MySQL database so that it can be visualized. This is essential for understanding the perfor- mance of the method, we must de-multiplex the different keywords to know which ones were effective.

5.3 Limitations

At the present state the system is very much a prototype system; it works well enough to evaluate a snapshot of the world. But not for continuous operation. To see this I will elaborate what happens if we wish to classify tweets in real time.

1. Build (or update) indexA of the current state of collected tweets containing titles (pseudo relevant tweets).

2. Use indexA to get expansion terms using algorithm 2.

3. Search twitter using the streaming API using the keywords found.

a) A tweet arrives b) Retrieve (or update) external sources so they are up to date. c) Pre-process the tweet.

d) Build (or update) indexB with the external sources and the new tweet. e) Convert the tweet to the cosine distance from relevant external sources using tf · ifd vectors found in indexB. f) Classify the tweet using a classifier trained previously.

The big issue with the current implementation and the limitations of Terrier 3.5 is that we have no really efficient way to update indexA and indexB. One must index new tweets and external sources (as an index of a single document in the extreme case) and merge with the old version of the index at reasonable intervals meaning that we cannot get a real-time view of the world. This is not the state of the implementation described here. Instead indexA is at first computed, and we search twitter using keywords generated using the statistics we can retrieve from this structure and store these tweets as-is in a MySQL database. Then we fetch external sources and create indexB from the retrieved tweets and the pseudo relevant tweets that will be used to train a classifier. Before training a classifier pre-processing is done by retrieving statistics from indexB so that we finally can train and test it.

44 5.4. DEVELOPMENT METHODOLOGY

5.4 Development methodology

By using an agile approach I was able to quickly realize a working prototype that still is realistic in that it is scalable and works with real quantities of data (millions of tweets). This was only possible by using light weight designs and heavy reliance on external libraries. I used two major sprints:

1. Develop the AQE component and do small scale tests

2. Develop the classification component and do large scale tests

After the first sprint I evaluated the progress and realized the need for extensive processing of tweets (remove spam, resolve URLs and hashtags) to obtain good results. It is near impossible to do an architecture before prototyping that will hold up for anything more than a short amount of time; the human mind is simply not capable of imagining the complex interplay of different external libraries and standards before one has experience working with them. I actually lost perhaps 3-4 days by doing an initial architecture of the system and an initial design. It helped me somewhat with knowing the different problems ahead but it was far to low level and did not play well with they way external libraries and other software components were designed. The system is not robust, not optimized for performance nor does it represent how a running system will look. But, the goal of investigating the proposed method in chapter 4 within the time frame available and with a zero budget was achieved.

45

Chapter 6

Performance evaluation

In this chapter results of experiments to test the performance of Tweet Collect are described. The axillary data and parameters used to collect tweets about television programs are also described. As shown in chapter 4 a large corpus of pseudo-related tweets are used as a basis for query expansion and since I am interested in collecting tweets about television programs, I used a corpus of television related tweets. To remove spam and other unwanted tweets pre-processing operations are per- formed on this corpus. After there operations I index the data with a full-text index using Terrier 3.5. We are then ready to get expansion terms which are used to query the Twitter streaming API. The resulting tweets are stored in a MySQL database so that i can visualize the results. To improve precision the tweets are transformed into a vector representation by comparing to external sources and classified. As seen in chapter 5, Tweet Collect is essentially a two stage system consisting of a query expansion stage and a classification stage. Each of these two stages are evaluated by comparing to a separate baseline. For query expansion I compare the number of collected tweets using AQE and the number of collected tweets not using query expansion. Furthermore I look at the precision of these collected tweets. For classification I sample and label 500 tweets from each show and compare classifier performance with the naive baseline: assume all collected tweets are related. Finally, I extrapolate AQE and classifier performance to estimate the system performance in terms of additional numbers of tweets collected and overall precision. This system performance is compared with the baseline of using my AQE approach without classification.

6.1 Collecting tweets about television programs

One possible application of the proposed method is to improve popularity estimation of television programs by increasing recall of collection of tweets about television programs. The method is evaluated with this application in mind, but should work for other types of product tracking as well.

47 CHAPTER 6. PERFORMANCE EVALUATION

Figure 6.1: Results of filtering auxiliary data to improve data quality. Note that the first filtering step is not included here and these tweets represent strings containing either the title of a show or the title words formed into a hashtag.

6.1.1 Auxiliary data A large corpus of tweets is essential. This means that we need to have ongoing tweet collection for tweets that include titles of TV programs over a longer period of time. A TV related tweet corpus made up of 133 million tweets was collected by KDDI R&D over several months. These tweets have been collected by polling the REST API at the maximum allowed rate every day. The keywords used to poll the REST API was titles of 1478 different American TV shows and the most common hashtags found in these tweets, always grouping by the title. To improve data quality strict filtering was employed:

1. Only keep tweets that contain the title words or a concatenated string of the title words prefixed with #.

2. Keep only alphanumeric characters and #,@. Remove URLs from considera- tion.

3. Remove all tweets containing any capitalization of RT as a stand-alone term.

4. Remove all tweets matching the exact same content as another previously seen tweet.

5. Remove all tweets that contain more than one show title. This second title must be longer than one word and comes from a list of known shows.

6. Remove all tweets that are determined not to be English by a naive Bayes classifier.

Figure 6.1 lists the results of this filtering where we only keep the roughly 25% tweets that does not match any of the filter criteria.

48 6.1. COLLECTING TWEETS ABOUT TELEVISION PROGRAMS

Table 6.1: TV shows used for collecting tweets with new search terms. Shows marked with “*” are aired as reruns multiple times every day.

TV show Genre Air times (UTC) How I met your mother Drama, Comedy 9/22/* The big bang theory Drama, Comedy 9/23/* The vampire diaries Drama, Science fiction 9/21/00:00 The X factor Talent show 9/27/00:00 Wheel of fortune Game show 9/18/23:30

6.1.2 Experiment parameters

To evaluate the proposed method we collected data for 5 TV shows of different genres using AQE. Due to limitations of a free twitter API account we could only search for one of these shows at a time and did so for 23h30min staring 6h before airing of the show, see table 6.1. To obtain search terms for the twitter streaming API Top(k) was used with k=25. Then the hashtag heuristic was applied to get hashtags and mentions as search terms. Finally, the 40 highest ranked term pairs according to equation 4.1 out of the possible 300 generated by Pairs() was used. For comparison, we also search for the actual title so that we later can filter out all tweets that contain the title to see the increase in number of tweets. As described in section 4.3.5 I use the distances from a tweets tf · idf vector to external documents tf · idf vectors as the features for classification. In table 6.2 the external documents that are used are listed. For each show the external documents can be externally generated by looking up web content. The TV words source is not gathered from the web but instead created manually and consists of the words episode, premiere, season, watch, watching and patterns of the form eX, e0X, sX, s0X, sXeX and s0Xe0X with X = 1..10. More ac- curate document frequencies are estimated using government documents from the American national corpus [22].

6.1.3 Evaluation

After obtaining AQE collection results for the different shows a sample of 500 tweets for each show that do not contain the title was labeled. This allows us to see how well the system works without the classification step, see table 6.7; to evaluate a classifier for the problem, see table 6.6, and complete system performance in terms of increased number of tweets and precision, see table 6.8. Judging the topic of a message is something most humans are very good at, however this problem is far from trivial. The decision is based upon experience and knowledge of the interpreter about the subject matter itself and the jargon used to talk about it. Consider the following two hypothetical messages:

49 CHAPTER 6. PERFORMANCE EVALUATION

Table 6.2: Text sources used for comparing with tweets.

Text source Description EPG Description of show TV.com Description of show, character names Wikipedia page Main Wikipedia page, use of boilerplate algorithm Top10 Google The top 10 pages of Google search, use of boilerplate then concatenated Collected tweets Concatenation of originally collected tweets containing the title of the TV show TV words Television related terms

“When actorX and actorY kiss I get tears in my eyes every time” “omg #MN is so good, @actorX is the best” where #MN is a hypothetical hashtag used to denote Movie Name. For a person that has seen the movie in question it is obvious that the first mes- sage refers to a specific movie. If that person is also an avid twitter user she will understand the second message to be strongly related to the same movie. Much of twitter consists of even more idiosyncratic messages but with the proper knowledge these can be understood and classified. A strong definition of related to is not possible, however we can at least conclude that a message that contains a title that is unique (or almost unequivocally used for one topic) is related. If this title has alternatives in the form of hashtags, messages containing these are also related. Furthermore we can collect messages containing other strongly related meta data terms and leave it up to an evaluator to determine if they are related. Tweets that are not written in English are manually replaced from the tweets until we have 500 English tweets for each show that are labeled.

6.2 Results

The proposed method gives us a number additional tweets, the results of the exper- iments when using AQE only are listed in table 6.3. We can observe that the TV show The X factor has an abnormal number of additional tweets compared to the number of tweets containing the title. Figures 6.2-6.3 shows a breakdown of how many tweets per keyword, or keyword pair, were found for the shows How I met your mother and The X factor respectively. A keyword must account for at least 0.1% to be included in the chart. These charts show what kind of keywords are generated and how large a fraction of the retrieved results they account for. Most of the keyword pairs do not give many new tweets but a few do. The most important new keywords are arguably different hashtags

50 6.2. RESULTS

Table 6.3: Number of tweets collected for the different TV shows during 23h30min.

TV show Containing title Extra tweets How I met your mother 6,271 11,002 The big bang theory 10,222 3,907 The vampire diaries 13,118 23,598 The X factor 62,539 253,376 Wheel of fortune 1,253 912

Table 6.4: Percentage of tweets containing the title that are related to the television show.

TV show Fraction related How I met your mother 100% The big bang theory 99% The vampire diaries 100% The X factor 100% Wheel of fortune 81% and mentions. Here we see the reason why The X factor has a disproportionate number of additional tweets: the popularity of the celebrity hosts overtake that of the show itself.

6.2.1 Ambiguity

The issue of ambiguous titles is investigated in [40], [14] and other works. Here we have focused on titles that consists of at least three words and assumed that any tweet that contains all these words is actually about the TV show. To test this assumption we sampled 100 tweets from each show and assigned labels. We can see in table 6.4 that this assumption is not completely accurate but good enough for our assumption except in the case of The wheel of fortune.

6.2.2 Classification

To increase precision we wish to remove as many of the unrelated additional tweets as possible. We also want to keep as many as possible of the related ones to achieve our goal of increasing recall. We do this by supervised classification and the chosen algorithm was the J48 implementation of the C4.5 decision tree algorithm using the machine learning toolkit Weka [21]. Best case classification results are listed in table 6.5 where one model is built for each show and the manually labeled data is used with 10-fold cross validation. The following abbreviations are used: Acc. denotes the accuracy, P1 the precision,

51 CHAPTER 6. PERFORMANCE EVALUATION

Figure 6.2: Fraction of tweets by search Figure 6.3: Fraction of tweets by search terms for How I met your mother. terms for The X factor.

Table 6.5: Classification results when using manually labeled test data as training data with 10-fold cross validation.

TV show Acc. P1 R1 F1 How I met your mother 0.892 0.846 0.856 0.851 The big bang theory 0.894 0.924 0.916 0.92 The vampire diaries 0.784 0.726 0.898 0.803 The X factor 0.876 0.822 0.731 0.774 Wheel of fortune 0.938 0.929 0.954 0.941 Average 0.877 0.850 0.871 0.858

Table 6.6: Classification results when using training data generated from the same external sources, training examples are from all five shows.

TV show Acc. P1 R1 F1 How I met your mother 0.874 0.820 0.833 0.826 The big bang theory 0.886 0.918 0.910 0.914 The vampire diaries 0.746 0.748 0.727 0.737 The X factor 0.508 0.356 0.862 0.504 Wheel of fortune 0.834 0.797 0.916 0.852 Average 0.770 0.728 0.850 0.767

R1 the recall and F1 the F-measure. P1,R1 and F1 are calculated for the related class. These metrics are defined as follows, where tp denotes true positive, tn true

52 6.2. RESULTS

Table 6.7: Class distribution of annotated data after classification by baseline, left, and C4.5 classifiers, right. The baseline classifier is the naive classifier: cbaseline(tweet) = related

TV show tp fp acc. (new) tpC fpC acc.C (new) How I met your mother 180 320 36.0% 150 33 82.0% The big bang theory 334 166 66.8% 304 27 91.8% The vampire diaries 245 255 49.0% 178 60 74.8% The X factor 145 355 29.0% 125 226 48.3% Wheel of fortune 261 239 52.2% 239 61 79.7% negative, fp false positive and fn false negative: Acc. = (tp + tn)/(tp + tn + fp + fn)

P1 = tp/(tp + fp)

R1 = tp/(tp + fn)

F1 = 2 · P1R1/(P1 + R1) A feasible system however, cannot rely on manually labeled data and table 6.6 shows the results when we build one model using assumed labels. The training data is made up of up to 10,000 tweets containing the title for each show that where randomly sampled from a database of collected tweets. These tweets are used both as related and unrelated training examples depending on which of the 5 sets of external sources they where compared against. The test set is composed of the annotated data. Table 6.7 shows the class distribution of the labeled sample of 500 tweets that do not contain the title for each show. The table also shows classification results of this sample, indicated by the subscript C . Our classifier is compared to a baseline classifier that assumes all tweets are relevant. In a live system one uses all tweets that are determined to be relevant by the classifier and these correspond to two categories tpC and fpC .

6.2.3 System results After classification we can estimate the performance of the complete system if we assume that the rate related to unrelated tweets is the same for all new tweets that are collected and that the classifier performance is also the same. The maximum likelihood estimation of precision and the increase in number of tweets is calculated with:

TPˆ = |ttitle| + TP _rate · P _rate · |textra|

FPˆ = FP _rate · N_rate · |textra|

∆tweets = (TP/ˆ |ttitle|) − 1 prec = TP/ˆ (TPˆ · FPˆ )

53 CHAPTER 6. PERFORMANCE EVALUATION

Table 6.8: System performance using automatic query expansion, before and after classification. The subscript c denotes results after classification.

TV show ∆tweets prec ∆tweetsC precC How I met your mother 63.2% 59.2% 52.6% 88.5% The big bang theory 25.5% 90.8% 23.2% 97.7% The vampire diaries 88.1% 67.2% 64.1% 83.2% The X factor 117.5% 43.1% 101.2% 48.3% Wheel of fortune 38.0% 79.9% 34.8% 91.4% Average 66.5% 68.0% 55.2% 82.0%

Here ttitle denotes the set of tweets containing the title and textra the set of additional tweets that are retrieved using AQE. The rates P _rate and N_rate is the estimated rate of positive and negative tweets of textra according to assigned labels. From classification of the labeled data we estimate the classifier performance for all the retrieved tweets with the true positive rate TP _rate and the false positive rate FP _rate. The results can be seen in table 6.8, where ∆tweetsC and precC are the collection results after classification. Note that for the show Wheel of fortune the increase in tweets is actually greater and the total precision lower since not all the tweets containing the title are related, see table 6.4.

54 Chapter 7

Analysis

In this chapter I have analyzed the results of the experiments described in chapter 6 with regards to the system performance, automatic query expansion performance and classifier performance. There are some weak points of the proposed method and these are discussed.

7.1 System results

In chapter 6 the results of AQE and classification are listed. For four out of the five shows tested the improvement in recall, the increased number of related tweets are reasonable at around 30% and the system precision is agreeable. With around 80 % accuracy for the new tweets we see that the method is feasible and that it is worthwhile to try and improve the results further. For one show, The X factor, the results are not satisfactory with an accuracy of classification for the new tweets that does not contain the title at 48.3%. This false positive rate comes from the fact that many users talk about the celebrities in the tweets that we base AQE on so that we get query drift. Regrettably the external sources also feature a lot of celebrity names causing the spurious tweets to get a good distance value from them. Query drift is the primary risk with query expansion as noted in [12]. This is the reason for using external sources to try and filter out the new spurious results. An example of the problem: the system associates the mention @TheXfac- torUSA and @ddlovato to the show The X factor and tweets containing them are likely to be classified as related. When resolving these user ids using twitter we get a description that includes the sub-string “The X Factor”. What is needed is that the system can understand that the mention @TheXfactorUSA is not about a person but the twitter account associated with the show whilst @ddlovato refers (mostly) to the host of this show as the users idol. The most effective operational characteristic of the system is the understanding of twitter language use with the help of heuristic methods. Splitting hashtags into their constituent words, looking up web content, resolving user tags used as a

55 CHAPTER 7. ANALYSIS

Table 7.1: 95% confidence interval for accuracy with training data generated from the same external sources, training examples are from all five shows.

TV show Acc. lower upper How I met your mother 0.874 0.842 0.902 The big bang theory 0.886 0.855 0.913 The vampire diaries 0.746 0.705 0.784 The X factor 0.508 0.463 0.553 Wheel of fortune 0.834 0.798 0.866 Average 0.770 substitute for the title and assuming that some abbreviations stand for the shows name allows classification to be accurate. The tweets where this is applicable also correspond to the majority of related additional tweets. A second, much smaller, group of related tweets are not easy to classify correctly, they often refer to events in the shows or voice opinions about how characters or TV personalities behave in the TV program.

7.2 Generalizing the results

In table 6.7 I have assumed the following:

• The rate of related tweets to unrelated tweets is the maximum likelihood estimation from the annotated sample of 500 tweets for each show.

• The classification performance is the same for all tweets for a specific show. That is, I generalize the true-positive and false-positive rate and apply them to the whole population.

Lets investigate these assumptions and their implications. We can start by looking at the accuracy and assume that classification represents a series of Bernoulli trials where success is assigning the correct label to a datum and failure assigning the wrong label. The confidence interval is calculated using the Pearson-Klopper method and the results are listed in table 7.1. One can observe that the confidence intervals are rather tight suggesting that a sample of 500 tweets are enough to gauge the accuracy of the classifier. The big issue is that the accuracy is different depending on which show we have, this tells us that the classifier model can not represent what an annotator thinks is not related, especially in the case of The X factor. This is because we compare to external sources before classification so what we are really classifying is: is the data similar to external sources in the same way as training data and for one show this does not give similar results as the annotator.

56 7.3. EVALUATION MEASURES

7.3 Evaluation measures

The goal of the system is to get as many as possible tweets about a certain product with a low false positive rate. It would be natural to measure the recall to see how well this goal was achieved. However, this is not possible unless we have a representative test set where all the data has been annotated with labels. Since, one cannot know the total number of related tweets in any feasible way I have chosen to estimate the increase in recall: the fraction of new related tweets.

7.4 New search terms

The search term pairs generally match mostly related tweets, but often quite few of them. Many of the term-pairs account for a very small amount of the collected tweets. An exception is when the search term pair is the name of an actor or television personality, then we usually get a large number of tweets about this person. Here we see a clear example of query drift. Worth mentioning is that the reason for using term pairs is twofold, firstly one needs to reduce the burden of a classifier by narrowing down the initial set of tweets to classify and the latent classes in this data set. Secondly, the twitter API does not allow an unlimited amount of tweets to be retrieved for one query, at least without commercial access, so it is desirable to perform this pruning before actually retrieving tweets. I employ a heuristic to select hashtags and mentions and this is in general a good idea. Many of the generated hashtags result in new tweets that are related but some do not. An example is the hashtag #suitup that originally comes from a catch phrase of the show How I met your mother but has since become very popular to use in many different occasions. The mention heuristic is perhaps the most dangerous, where the exact same problem of query drift towards celebrities exhibits itself and also towards avid twitter users that often discuss the topic. It is a very blunt weapon to try and get some highly desirable mentions such as “@thexfactorusa”. For many of the tested television shows mostly names of people are returned as term pairs. This is especially the case for the show The X factor. Some examples are show in table 7.2. This is most likely due to the fact that the two word combination given-name, surname is a very common writing pattern casting some doubt on the type of search terms that is obtainable with the co-occurrence heuristic described in section 4.2.1. However many of these names are names of characters of shows and in this case almost all of the tweets one gets are related to the show.

7.5 Classifier performance

In [14] a random forest classifier is used that is reported to generalize well to unseen shows. This classifier is trained on manually labeled data about 7 of 8 shows and

57 CHAPTER 7. ANALYSIS

Table 7.2: First 13 term pairs for AQE using top 40 terms to form pairs and virtual documents of size 5. Also visible is a bug where I do not remove hashtags from consideration when forming pairs.

The big bang theory The X factor yousxo inspirationthe caley cuoco amaro melanie rothman disintegration xfactor #xfactor ornithophobia diffusion demi lovato #bazinga bazinga canty marcus recombination hypothesis amaro krajcik initiation beta amaro rene hawking excitation melanie rene initiation siri krajcik canty sheldon #sheldon britney spears kitty pur melanie krajcik kitty purr astro crow hawking stephen amaro cowell siri beta tested on the last. I tested this approach but it did not give good results and I believe that in contrast to their approach which centers very much around language patters including the title I have attempted a very different problem with new tweets that do not contain the same words at all in many cases. Even though the classification scheme is similar with doing comparisons to ex- ternal sources in my approach it is necessary to have training data that is about the same show, we can however benefit from using a single classifier that is trained on the Cartesian product of relevant sets Rj and titles, see figure 7.1. This could appear to be a major limitation, however since we are never relying on manually labeled data it is only a question of increased training time. Why did I choose the C4.5 algorithm? This question has two answers: scala- bility and empirical performance. Out of all the classifiers attempted: SVM linear, SVM Gaussian kernel, neural network, C4.5, Random Tree, Random Forest, Naive Bayes, Nearest neighbor only rotation forest gives comparable results in terms of F1 measure. But, rotation forest is much more memory and computationally intensive since it creates many decision trees and performs principal component analysis [36] so my machine ran out of memory unless I cut the training set down dramatically. Other classifier models such as neural networks are very expensive to train, espe- cially considering optimizing the structure, and also give poor results so these have not been tested in any large extent. Not only the training data and the classifier algorithm used are important. Per- haps most important are the features used to describe tweets. I have used distances to external documents in terms of the distance from BOW representations. This ignores ordering of terms and assigns a lot of importance to high frequency terms

58 7.5. CLASSIFIER PERFORMANCE

Figure 7.1: Ways to generate training data from auxiliary data. Here we have two data sets, A and B that correspond to searching for the titles A and B, respectively. Either we can have a classifier for each title, the left case, or we can have just one classifier that is trained on the Cartesian product of data sets and titles, the right case. Regrettably tests show that we cannot use the right case unless we include training data of the type Rtitle,title for all shows. in external documents. Using this approach certain patterns or language structures that cannot be described using this simple representation cannot be expressed. Con- sider an exact quote from the script of a television show, a tweet containing this is most certainly related to the television show but has very little change of being classified correctly using the present approach. This is one of the reasons why we do not see recall values that are higher than what is observed for classification.

59

Chapter 8

Conclusions and future work

The proposed method in chapter 4 and its prototype implementation described in chapter 5 give an indication that the system indeed partially fulfills the research goal of increasing tweet collection about products, in this case: tweets about television programs. Performing the combined approach of query expansion and classification for tweet collection is a new way of improving market research and as such much ad- ditional testing is required before the true potential can be gaged accurately. The initial tests done using the prototype developed should be seen as a proof of concept more than anything else and indicate potential for increased effectiveness in tweet collection.

8.1 Applicability

One intended use case is for market research but the methods used can be employed as long as it is possible to collect a corpus of auxiliary data for an original set of queries. Further testing is required to see the performance of product tracking for areas such as reputation management of companies. Improving the sampling for market research is clearly a desirable goal. The implemented system allows an analyst to see not only what users that use the same keywords as him thinks about a product but also the users that use other, in particular twitter specific, keywords think about a product. The error rate is slightly high for really accurate aggregate analysis but removing bias should be a desirable goal even if the precision of the survey is lowered slightly. It is definitely possible to use the system to access individual tweets instead of just aggregate statistics such as the numbers of tweets during specific times. With this system users can access a broad sample of tweets about their favorite topic that they might not have know about before. Most likely further ranking methods could be applied, perhaps based on the social graph on twitter to show a user a previously unseen selection of tweets and users that are also about their favorite topic.

61 CHAPTER 8. CONCLUSIONS AND FUTURE WORK

8.2 Scalability

One big issue with the system is the need to collect large number tweets about specific products. But in the intended use case of market research this is already being done and my method has the potential to improve the results. In terms of implementation a major rework is required that collects all data under the same roof with a common access point. The current implementation using many files is merely a prototype and was limited by strict time constraints. The real problem is the need to recompute or update indexes. Even if this can be done rather quickly even for large amounts of data it means a latency for classification of tweets. There are database management systems that support real time indexing, relational databases for example, that can be used to avoid this problem. But preferably one of the parallel No-SQL databases available, such as MongoDB1, should be used to support large amounts of inserts.

8.3 Future work

The proposed method has some issues that need improvement but there are also untapped potential for using the method in other settings than collecting TV related tweets that can be evaluated.

8.3.1 Other types of products and topics

Only five different television shows where tested because of the expensive process of labeling data for evaluation. Because of this limited possibility to evaluate the method I decided that it was best to stick to one type of products instead of evalu- ating for other types. Future work should evaluate the applicability of the proposed method for other products and topics. Perhaps company related tweets are the most interesting which can be used for reputation management.

8.3.2 Parameter tuning

Due to the fact that annotation of results was done after collection it is not possible to optimize parameters using e.g. a grid search where we find the best results on a training corpus. However, the tests done actually tests the real world usefulness of the system. A more thorough theoretical analysis of the system could give us models for e.g. how many search terms to use or a threshold for the weights, X2 scores, of hashtags and mentions that we choose to include as search terms. Providing a theoretical model of tweets and the occurrence of different terms could be of great interest. There are two routes for parameter tuning by local search; compare system results e.g. F -measure to find the parameters that give the best results. If one has

1http://www.mongodb.org/

62 8.3. FUTURE WORK sufficiently good theoretical models then it is possible to simulate tweet generation, perhaps supported by real data. A more realistic, but expensive approach, is to create a product tracking corpus by collecting a sample of all tweets generated in which all interesting tweets are labeled.

8.3.3 Temporal aspects In chapter 3 I mentioned some work in twitter information retrieval that uses tempo- ral profiles to weigh terms. This has a strong intuitive appeal; it is safe to assume that language use changes over time and is temporary. In twitter this manifests itself in changing hashtags and other buzz-words. However it is unclear how one should treat temporally in AQE, further study into how language use changes over time is needed to use this information for generating new search terms. When it comes to classification it is certainly interesting to look into time weight- ing of tweets, e.g. that tweets closer to the broadcasting time of a show are more likely to be related. This is a very dangerous approach however since it is very close to circular reasoning:

Tweets about television that are authored close to the broadcast time of a show indicate ratings for the show ↔ Tweets that are authored close to the broadcast time of a show indicate that they are about the television show.

8.3.4 Understanding names The role of names in language is that of identifying entities, not just people but products or even abstract concepts. In my research names represent the most common proper nouns used to access data but understanding and exploiting the presence of named entities is crucial for effectiveness. A lot of the problems of the proposed approach has to do with how to deal with names of persons associated with television shows, either fictional characters, actors, hosts or just twitter authorities on the television show. Here I suggest future research into to the field of entity disambiguation: e.g. if we can associate a name with a Wikipedia page we could possibly conclude what relationship this name has to the original query and act accordingly.

8.3.5 Improved classification Regarding hard to classify tweets such as tweets quoting a television show or de- scribing a scene of a television show: using BOW based approach for features is not enough to express this type of information. However, matching word two-grams or three-grams from tweets against a full transcript of the show would most likely capture the hard to classify tweets about events that happened in the TV-show, but this requires access to more accurate external data as-well as radically more

63 CHAPTER 8. CONCLUSIONS AND FUTURE WORK memory and computing resources. The bag of words assumption made where we compare tf ∗ idf scores is not enough to handle these types of tweets. The main future work in improving tweet classification lies in finding compu- tationally efficient ways to associate tweets with concepts. Concretely what these concepts are is also another question. I have used external sources such as Wikipedia but also a collection of tweets that we have defined to be about the concept, the auxiliary data itself. The focus of this thesis has been on methods that are compu- tationally feasible and even though there are many techniques from computational linguistics, for instance such that consider sentence structure e.g. non-deterministic grammars, these are many times computationally expensive.

8.3.6 Ontology In general, the external sources used leave much to be desired. Access to more specific descriptions about the products we wish to track would be desirable. One could also imagine using links between external sources to provide richer context. In related works in web-search this is often an approach taken, such as assigning specific importance to anchor-texts. One could also build ontologies with statistical techniques such as latent semantic indexing. I briefly looked into link oriented data mining of Wikipedia during the initial, literature review, stages of this thesis but found the technologies involved to be immature for practical deployment. Another ontology based approach that is tech- nically mature but lacks proven efficiency is the use of man-made ontologies such as word-net or RDF descriptors. Finding exactly what kind of link structure that is usable for either improving AQE or classification is a daunting task but philosophically very interesting. Tweets are very short and thus any type of correctly inferred link between a tweet and a know concept has the potential to be very helpful.

8.3.7 Improved scalability and performance profiling The underlying system, Terrier 3.5, works well for a research application and as a proof of concept but for a commercial application it is very likely that other systems would work better, especially with regards to real time indexing and horizontal scalability. As a first step a thorough profile of the application obtained by running a suitable evaluation suite is required. This should reveal key characteristics such as what operations need to be fast.

64 Bibliography

[1] Task definition WePS-3 workshop. http://nlp.uned.es/weps/weps-3/ call-for-participation. Accessed 2012 11 19.

[2] The twentieth text REtrieval conference (TREC 2011) proceedings. http: //trec.nist.gov/pubs/trec20/t20.proceedings.html.

[3] Data, data everywhere. The Economist, February 2010.

[4] Trec data. http://trec.nist.gov/data.html, 2012. Accessed 2012 11 19.

[5] Twitter turns six. http://blog.twitter.com/2012/03/twitter-turns-six. html, March 2012. Accessed 2012 11 19.

[6] H. Akaike. A new look at the statistical model identification. IEEE Transac- tions on Automatic Control, 19(6):716 – 723, December 1974.

[7] Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of in- formation retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4):357–389, October 2002.

[8] Jaime Arguello, Jonathan L. Elsas, Jamie Callan, and Jaime G. Carbonell. Document representation and query expansion models for blog recommenda- tion. Technical report, CiteSeerX, 2009.

[9] Kyoko Ariyasu, Hiroshi Fujisawa, and Yasuaki Kanatsugu. Message analysis algorithms and their application to social tv. page 1. ACM Press, 2011.

[10] S. Bhattacharya, C.G. Harris, Y. Mejova, C. Yang, P. Srinivasan, and T.M. Track. The university of iowa at trec 2011: Microblogs, medical records and crowdsourcing.

[11] Berit Block. Facebook: Around the world in 800 days. http://blog. comscore.com/, May 2012.

[12] Claudio Carpineto and Giovanni Romano. A survey of automatic query ex- pansion in information retrieval. ACM Comput. Surv., 44(1):1:1–1:50, January 2012.

65 BIBLIOGRAPHY

[13] Ovidiu Dan, Junlan Feng, and Brian Davison. Filtering microblogging messages for social tv. In Proceedings of the 20th international conference companion on World wide web, WWW ’11, pages 197–200, New York, NY, USA, 2011. ACM.

[14] Dan Ovidiu, Junlan Feng, and Brian D. Davidson. A bootstrapping approach to identifying relevant tweets for social TV. In ICWSM, Barcelona, 2011.

[15] Miles Efron. Hashtag retrieval in a microblogging environment. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10, pages 787–788, New York, NY, USA, 2010. ACM.

[16] Miles Efron. Information search and retrieval in microblogs. Journal of the American Society for Information Science and Technology, 62(6):996–1008, 2011.

[17] Ofer Egozi, Shaul Markovitch, and Evgeniy Gabrilovich. Concept-based in- formation retrieval using explicit semantic analysis. ACM Trans. Inf. Syst., 29(2):8:1–8:34, April 2011.

[18] A.P. Engelbrecht. Computational intelligence: an introduction. wiley, 2007.

[19] Barbara Farfan. Funny and inspiring quotable quotations about twit- ter, social media & business. http://retailindustry.about.com/od/ retailleaderquotes/a/Funny_inspiring_quotable_quotations_about_ twitter_social_media_business.htm. Acessed 19/11/2012.

[20] Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 1606–1611, 2007.

[21] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reute- mann, and Ian H. Witten. The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, November 2009.

[22] N. Ide and K. Suderman. Integrating linguistic resources: The american na- tional corpus model. In Proceedings of the 6th International Conference on Language Resources and Evaluation, 2006.

[23] K. Ikeda, G. Hattori, K. Matsumoto, C. Ono, and Y. Takishima. Social media visualization for tv, 2011.

[24] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM interna- tional conference on Web search and data mining, WSDM ’10, pages 441–450, New York, NY, USA, 2010. ACM.

66 [25] C.D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval, volume 1. Cambridge University Press Cambridge, 2008. [26] Kamran Massoudi, Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. Incorporating query expansion and quality indicators in searching microblog posts. In Paul Clough, Colum Foley, Cathal Gurrin, Gareth Jones, Wessel Kraaij, Hyowon Lee, and Vanessa Mudoch, editors, Advances in Information Retrieval, volume 6611 of Lecture Notes in Computer Science, pages 362–367. Springer Berlin / Heidelberg, 2011. [27] Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. Adding semantics to microblog posts. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM. [28] Keith Mitchell, Andrew Jones, Johnathan Ishmael, and Nicholas J.P. Race. Social TV: toward content navigation using social awareness. In Proceedings of the 8th international interactive conference on Interactive TV&Video, EuroITV ’10, pages 283–292, New York, NY, USA, 2010. ACM. [29] S.I. Newton. The Correspondence of Isaac Newton, volume 1. published for the Royal Society at the University Press, 1959. [30] Kyosuke Nishida, Ryohei Banno, Ko Fujimura, and Takahide Hoshide. Tweet classification by data compression. In Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web, DETECT ’11, pages 29–34, New York, NY, USA, 2011. ACM. [31] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Ter- rier: A High Performance and Scalable Information Retrieval Platform. In Pro- ceedings of ACM SIGIR’06 Workshop on Open Source Information Retrieval (OSIR 2006), 2006. [32] T. Pang-Ning, M. Steinbach, and V. Kumar. Introduction to data mining. 2006. [33] Fernando Perez-Tellez, David Pinto, John Cardiff, and Paolo Rosso. On the difficulty of clustering microblog texts for online reputation management. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, WASSA ’11, pages 146–152, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. [34] J.R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kauf- mann, 1993. [35] Bernard Renger, Junlan Feng, Ovidiu Dan, Harry Chang, and Luciano Bar- bosa. VoiSTV: voice-enabled social TV. In Proceedings of the 20th interna- tional conference companion on World wide web, WWW ’11, pages 253–256, New York, NY, USA, 2011. ACM.

67 BIBLIOGRAPHY

[36] J.J. Rodriguez, L.I. Kuncheva, and C.J. Alonso. Rotation forest: A new classi- fier ensemble method. Pattern Analysis and Machine Intelligence, IEEE Trans- actions on, 28(10):1619 –1630, October 2006.

[37] Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. #TwitterSearch: a comparison of microblog search and web search. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, pages 35–44, New York, NY, USA, 2011. ACM.

[38] A. Tumasjan, T.O. Sprenger, P.G. Sandner, and I.M. Welpe. Predicting elec- tions with twitter: What 140 characters reveal about political sentiment. 2010.

[39] Shoko Wakamiya, Ryong Lee, and Kazutoshi Sumiya. Towards better TV viewing rates: exploiting crowd’s media life logs over twitter for TV rating. In Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’11, pages 39:1–39:10, New York, NY, USA, 2011. ACM.

[40] S.R. Yerva, Z. Miklós, and K. Aberer. It was easy, when apples and blackberries were only fruits. In Third Web People Search Evaluation Forum (WePS-3), CLEF, 2010.

[41] Surender Reddy Yerva, Zoltán Miklós, and Karl Aberer. It was easy, when apples and blackberries were only fruits. 2010.

68 Appendix A

Hashtag splitting

As I realized that twitter users use hashtags in many different ways i decided that I wanted process one common type of them, long compound hashtags, and convert to machine understandable text. Examples of compound hashtags for are: #best- showever, #Ilovesports. I hypothesize that often hashtags are not used for topic marking but for emphasis and in this case compound hashtags are very common. Since twitter does not allow users to format their text e.g. as bold or italic, users have come up with this way to express themselves. The gist of this problem is to convert a concatenated string into its constituent parts, where each part is found in a dictionary. However where to split is non-trivial. Often several splits can form a valid solution, consider “mynameis”. This string can be slit as “my name is” or as “myna me is”1 which contains only words from a common dictionary. This problem requires more than a greedy algorithm or brute force search since even moderately short strings such as “mynameis” will create an exponential number of alternatives to compare with the original string unless we are careful. My approach is as follows, start by generating all the in-dictionary words that start with a letter at a certain position in the original word and are a sub sequence of the original string as seen here:

Pos. Possible word m my,myna y n na,name a am m me e ei i i,is s

1Mynas are a family of birds originating from southern and eastern Asia.

69 APPENDIX A. HASHTAG SPLITTING

After generating possible substrings, do a search of all combinations to find which form the valid string. I do depth first search starting from the left of the string and applying two constraints:

1. The solutions length must be the same as the original string with bounds consistency.

2. Any chosen new word must together with the other chosen words make up the first letters of the original string.

The second constraint subsumes the first constraint but is more expensive to check, therefore I apply the first constraint first. To later choose only one of the solutions found I choose the one with the largest product of term frequencies from the constituent words. The term frequencies come from a Project Gutenberg corpus2. Regrettably this means that the input data is slightly dated.

2Project Gutenberg frequency lists are available at http://en.wiktionary.org/wiki/ Wiktionary:Frequency_lists

70 Appendix B

SVM performance

Table B.1: Results of classification of annotated test data with linear support vector machines. Text data is treated as sparse vectors.

TV show Accuracy How I Met Your Mother 57% (285/500) The Big Bang Theory 30.6% (153/500) The Vampire Diaries 53.6% (268/500) The X factor 69.8% (349/500) Wheel of Fortune 74% (370/500)

71