Tweet Collect: Short Text Message Collection Using Automatic Query Expansion and Classification
Total Page:16
File Type:pdf, Size:1020Kb
UPTEC IT 13 003 Examensarbete 30 hp Februari 2013 Tweet Collect: short text message collection using automatic query expansion and classification Erik Ward Abstract Tweet Collect: short text message collection using automatic query expansion and classification Erik Ward Teknisk- naturvetenskaplig fakultet UTH-enheten The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called Besöksadress: tweets, which are short, contain twitter-specific writing styles and are often Ångströmlaboratoriet Lägerhyddsvägen 1 idiosyncratic give rise to a vocabulary mismatch between typically chosen Hus 4, Plan 0 keywords for tweet collection and words used to describe television shows. Postadress: A method is presented that uses a new form of query expansion that generates Box 536 751 21 Uppsala pairs of search terms and takes into consideration the language usage of twitter to access user data that would otherwise be missed. Supervised Telefon: classification, without manually annotated data, is used to maintain precision 018 – 471 30 03 by comparing collected tweets with external sources. The method is implemented, Telefax: as the Tweet Collect system, in Java utilizing many processing steps to improve 018 – 471 30 00 performance. Hemsida: The evaluation was carried out by collecting tweets about five different http://www.teknat.uu.se/student television shows during their time of airing and indicating, on average, a 66.5% increase in the number of relevant tweets compared with using the title of the show as the search terms and 68.0% total precision. Classification gives a, slightly lower, average increase of 55.2% in number of tweets and a greatly increased 82.0% total precision. The utility of an automatic system for tracking topics that can find additional keywords is demonstrated. Implementation considerations and possible improvements are discussed that can lead to improved performance. Handledare: Kazushi Ikeda Ämnesgranskare: Tore Risch Examinator: Lars-Åke Nordén ISSN: 1401-5749, UPTEC IT 13 003 Sponsor: KDDI R&D Laboratories, Göran Holmquist Foundation Tryckt av: Reprocentralen ITC Sammanfattning Social media som Twitter växer i popularitet och stora mängder med- delanden, tweets, skrivs varje dag. Dessa meddelanden innehåller värde- full information som kan användas till marknadsundersökningar men är mycket korta, 140 tecken, och uppvisar i många fall ett idiosynkratiskt uttryckssätt. För att komma åt så många tweets som möjligt om en viss produkt, till exempel ett TV program, krävs det att rätt söktermer är tillgängliga; en twitteranvändare använder nödvändigtvis inte samma ord för att beskriva samma sak som en annan. Olika grupper använ- der således olika språkbruk och jargong. I text i twittermeddelanden är detta uppenbart, vi kan se hur somliga använder vissa så kallade hash- tags for att uttrycka sig samt andra språkyttringar. Detta leder till vad som brukar kallas problemet med olika ordförråd (vocabulary mismatch problem). För att försöka samla in så många twittermeddelanden som möjligt om olika produkter har ett system som kan generera nya söktermer utvecklats, här kallat Tweet Collect. Genom att analysera vilka ord som ger mest information, generera par av ord som beskriver olika saker och ta hänsyn till det språkbruk som finns på Twitter skapas nya söktermer utifrån ursprungliga söktermer, s.k. frågeexpansion (query expansion). Utöver att samla in tweets som motsvarar de nya söktermerna så avgör en maskininlärningsalgoritm om dessa tweets är relevanta eller inte för att på så sätt öka precisionen. Efter att ha samlat in tweets för fem TV program så utvärderades systemet genom att utföra en stickprovsundersökning av nya insamlade tweets. Denna undersökning visar att, i genomsnitt, ökar antalet rele- vanta tweets med 66.5% mot att endast använda TV programmets titel. Av alla insamlade tweets så handlar endast 68.0% faktiskt om TV pro- grammet, men genom att använda maskininlärning kan detta ökas till 82.0% med endast ett avkall till 55.2% ökning av nya, relevanta, tweets. I denna rapport visas användbarheten av ett automatiskt system som kan hitta nya söktermer och på så sätt motverka problemet med olika ordförråd. Genom att komma åt tweets som skrivs med ett annat språkbruk så hävdas det att metodfelet vid insamling av tweets minskar. Systemets förverkligande i programmeringsspråket Java diskuteras och förbättringar föreslås som kan leda till ökad effektivitet. This thesis is dedicated to the wonderful country of Japan and all who come to experience her. This thesis expands upon: Erik Ward, Kazushi Ikeda1, Maike Erdmann2, Masami Nakazawa2, Gen Hattori2, and Chihiro Ono2. Automatic Query Expansion and Classification for Television Related Tweet Collection. Proceedings of Information Processing Society of Japan (IPSJ) SIG Technical Reports, vol. 2012, no. 10, pp. 1-8, 2012. Acknowledgment I wish to thank the Göran Holmquist Foundation and the Sweden Japan Foundation for travel funding. 1Supervisor 2Proofreading Glossary AQE Automatic Query Expansion. Blind relevance feedback. Corpus A set of documents, typically in one domain. Relevance feed- Update a query based on documents that are known to be relevant back for this query. Table of Notations Ω The vocabulary: the set of all known terms. t Term: a word without spacing characters. q Query: a set of terms. q ∈ Q ⊂ D. C Corpus: a set of documents. d Document: a set of terms. d ∈ D, where D is the set of all possible documents. tf(t, d) Term frequency: an integer valued function that gives the frequency of occurrence of t in d. df(t) Document frequency: the number of documents in a corpus that contains t. idf(t) lg(1/df(t)) R Set of related documents; used for automatic query expansion. Contents 1 Introduction 1 2 Background 3 2.1 Twitter . 3 2.1.1 Structure of a tweet . 3 2.1.2 Accessing twitter data: Controlling sampling . 4 2.1.3 Stratification of tweet users and resulting language use . 5 2.2 Information retrieval . 6 2.2.1 Text data: Sparse vectors . 6 2.2.2 Terms weights based on statistical methods . 9 2.2.3 The vocabulary mismatch problem . 9 2.2.4 Automatic query expansion . 9 2.2.5 Measuring performance . 11 2.2.6 Software systems for information retrieval . 12 2.3 Topic classification . 12 2.4 External data sources . 13 3 Related work 15 3.1 Relevant works in information retrieval . 15 3.1.1 Query expansion and pseudo relevance feedback . 16 3.1.2 An alternative representation using Wikipedia . 17 3.2 Classification . 17 3.2.1 Television ratings by classification . 18 3.2.2 Ambiguous tweet about television shows . 18 3.2.3 Other topics than television . 20 3.3 Tweet collection methodology . 21 3.4 Summary . 22 4 Automatic query expansion and classification using auxiliary data 25 4.1 Problem description and design goals . 25 4.2 New search terms from query expansion . 26 4.2.1 Co-occurrence heuristic . 27 4.2.2 Hashtag heuristic . 28 4.2.3 Algorithms . 29 4.2.4 Auxiliary Data and Pre-processing . 30 4.2.5 Twitter data quality issues . 30 4.2.6 Collection of new tweets for evaluation . 32 4.3 A classifier to improve precision . 33 4.3.1 Unsupervised system . 34 4.3.2 Data extraction . 34 4.3.3 Web scraping . 34 4.3.4 Classification of sparse vectors . 34 4.3.5 Features . 35 4.3.6 Classification . 35 4.4 Combined approach . 36 5 Tweet Collect: Java implementation using No-SQL database 37 5.1 System overview . 37 5.2 Components . 38 5.2.1 Statistics database . 39 5.2.2 Implementation of algorithms . 41 5.2.3 Twitter access . 43 5.2.4 Web scraping . 43 5.2.5 Classification . 43 5.2.6 Result storage and visualization . 44 5.3 Limitations . 44 5.4 Development methodology . 45 6 Performance evaluation 47 6.1 Collecting tweets about television programs . 47 6.1.1 Auxiliary data . 48 6.1.2 Experiment parameters . 49 6.1.3 Evaluation . 49 6.2 Results . 50 6.2.1 Ambiguity . 51 6.2.2 Classification . 51 6.2.3 System results . 53 7 Analysis 55 7.1 System results . 55 7.2 Generalizing the results . 56 7.3 Evaluation measures . 57 7.4 New search terms . 57 7.5 Classifier performance . 57 8 Conclusions and future work 61 8.1 Applicability . 61 8.2 Scalability . 62 8.3 Future work . 62 8.3.1 Other types of products and topics . 62 8.3.2 Parameter tuning . 62 8.3.3 Temporal aspects . 63 8.3.4 Understanding names . 63 8.3.5 Improved classification . 63 8.3.6 Ontology . 64 8.3.7 Improved scalability and performance profiling . 64 Bibliography 65 Appendices 68 A Hashtag splitting 69 B SVM performance 71 List of Figures 2.1 The C4.5 classifier . 13 3.1 Approach to classifying tweets, here for television shows, but the same approach applies for other proper nouns. 19 4.1 Visualization of fraction of tweets by keywords for the show Saturday night live, here different celebrities that have been on the show dominate the resulting twitter feed. 33 4.2 Conceptual view of collection and classification of new tweets. 36 5.1 Conceptual view of collection and classification of new tweets. 38 5.2 How the different components are used to evaluate system performance. This does not represent the intended use case where collection, pre- processing and classification is an ongoing process. 39 6.1 Results of filtering auxiliary data to improve data quality. Note that the first filtering step is not included here and these tweets represent strings containing either the title of a show or the title words formed into a hashtag. 48 6.2 Fraction of tweets by search terms for How I met your mother. 52 6.3 Fraction of tweets by search terms for The X factor.