Master's Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Eindhoven University of Technology MASTER Named entity recognition and disambiguation in tweets Patra, S.R. Award date: 2014 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Department of Mathematics and Computer Science Web Engineering Group Named Entity Recognition and Disambiguation in Tweets Master Thesis Soumya Ranjan Patra Supervisors: dr. Mykola Pechenizkiy Erik Tromp MSc Committee Members: dr. Mykola Pechenizkiy prof. dr. Paul De Bra dr. Alexander Serebrenik Erik Tromp MSc Eindhoven, August 2014 Abstract Social media has grown exponentially over the past few years. Users are generat- ing far more unstructured content than ever before. Successful companies are also very active in social media analysing these data for their marketing campaigns. But the informal and noisy nature of such data makes it quite difficult to extract mean- ingful information out of them. In this thesis, we investigate the problem of named entity recognition and dis- ambiguation in tweets. Given a tweet written in English, the task focuses on ex- tracting all the named entities inside and assigning each entity with a reference link. It has potential applications in online product monitoring, sentiment analysis, electoral predictions and other social media analytics. We present a five step approach for this problem consisting of tokenization, Part-of-speech tagging, normalization, mention extraction and disambiguation. First three steps are used as preprocessing steps for mention extraction and disambigua- tion which are the core components of our approach. For tokenization and part- of-speech tagging, we use the regular expression tokenizer and tagger developed by Owoputi et al. In the third step, we propose a normalization algorithm which uses Brown clusters and Microsoft web ngram language model to normalize the tweets. In the next step, our mention extractor extracts possible candidate entities with the help of POS tags and ngram matching against Freebase. Finally, using the intra-tweet context and relationship among entities, our disambiguation algorithm assigns appropriate reference links to these extracted entities. To evaluate our approach, We carry out several experiments and benchmark our complete process against different state of the art extraction and disambiguation systems. We also evaluate the importance of individual steps by measuring the overall performance degradation by leaving that step out of the pipeline. Finally, we present a case study demonstrating how our approach can be applied practically in a real world scenario. i Acknowledgements I would like to express my sincere gratitude to dr. Mykola Pechenizkiy for super- vising and guiding me throughout my thesis. Particular thanks to my daily super- visor Mr. Erik Tromp for his constant support and motivation. His guidance and insightful comments have helped me greatly during this period. I would also like to thank the rest of my thesis committee: prof. dr. Paul De Bra and dr. Alexander Serebrenik, for their help in assessing my thesis. Last but not least, I am thankful to my family and friends for their priceless encouragement and support. ii Contents 1 Introduction1 1.1 Motivation..............................1 1.2 Practical Applications........................3 1.3 Problem Formulation........................4 1.4 Our Approach and Main Results..................5 1.5 Organization of The Thesis.....................6 2 Approach7 2.1 Tokenization.............................7 2.1.1 Problem Formulation....................8 2.1.2 Related Work........................8 2.1.3 Owoputi et al. Regular Expression Tokenizer.......8 2.2 Part-Of-Speech Tagging......................8 2.2.1 Problem Formulation....................9 2.2.2 Related Work........................9 2.2.3 Owoputi et al. Tagger with Unsupervised Features.... 10 2.3 Normalization............................ 10 2.3.1 Problem Formulation.................... 11 2.3.2 Related Work........................ 11 2.3.3 Our Approach........................ 11 2.4 Mention Extraction......................... 14 2.4.1 Problem Formulation.................... 15 2.4.2 Related Work........................ 15 2.4.3 Our Approach........................ 16 2.5 Disambiguation........................... 17 2.5.1 Problem Formulation.................... 18 2.5.2 Related Work........................ 18 2.5.3 Our Approach........................ 19 3 Experimental Evaluation 28 3.1 Motivation.............................. 28 3.2 Datasets............................... 28 3.3 Complete Process Evaluation.................... 29 iii 3.3.1 Entity Extraction Experiments............... 29 3.3.2 Entity Disambiguation Experiments............ 33 3.4 Evaluation of The Importance of Each Step............ 39 3.4.1 Without POS Tagging................... 39 3.4.2 Without Normalization................... 41 3.4.3 Without Disambiguation.................. 41 3.5 Evaluation of Normalization.................... 41 4 Case Study : Filtering Targeted Tweets with Disambiguation 44 5 Conclusion and Future Work 46 Appendices 51 A AIDA Disambiguation Errors 52 A.1 AIDA Disambiguation Error : Capitalization............ 52 A.2 AIDA Disambiguation Error : Grammatical Error......... 53 A.3 AIDA Disambiguation Error : In Complete Setting........ 54 iv List of Tables 1.1 sample tweets showing the challenges in performing information extraction tasks on tweets ...............................2 3.1 Ritter et al. test dataset statistics ..................... 30 3.2 Comparison of performances with other approaches on Ritter et al.(2011) test dataset ................................ 32 3.3 Habib et al. dataset statistics ....................... 34 3.4 Disambiguation performances of our approach benchmarked against TAGME and different settings of AIDA ........................ 36 3.5 Showing the impact of removing individual steps on the overall performance (T: Tokenization, P: POS tagging, N:Normalization, E: Mention Extraction, D: Dis- ambiguation) .............................. 39 3.6 Examples of some errors introduced due to the removal of POS tagging .... 41 3.7 Normalization Experiments ....................... 42 4.1 Showing the performance of our system in correctly filtering a targeted entity out of ambiguous entities .......................... 45 4.2 Some representative examples from the case study ............. 45 v List of Figures 1.1 A simple visualization of the expected goal of the overall system. .......5 2.1 Pipeline for named entity recognition and disambiguation ..........7 2.2 An example showing the normalization process ............... 13 2.3 Examples showing most similar entities from the entity vector model ..... 22 2.4 Sorted list of candidate similarities for mention tuple (Rick Ross, Drake) ... 24 2.5 F-measure for different similarity thresholds on Ritter et al. sample dataset ... 25 2.6 Sorted list of candidate similarities for mentions ’Catherine Hardwicke’, ’Maze’, ’James Dashner’ ............................ 25 2.7 Sorted list of candidate similarities for mention tuple (’Richard Pryor’, ’Hamlet’) 27 3.1 Plot showing the performance comparison of our approach with other baselines and Ritter et al. ............................. 33 3.2 Plot showing the performance comparison of our approach with different settings of AIDA ................................ 37 3.3 Plot Showing the impact of removing individual steps on the overall performance 40 3.4 Plot showing the effect of introducing spelling errors in our approach and AIDA complete system ............................ 43 A.1 Screen shots from AIDA online tool showing how small case affects disambiguation 52 A.2 Screen shots from AIDA online tool showing how a grammatical error affects disambiguation ............................. 53 A.3 Screen shots from AIDA online tool showing disambiguation error with the com- plete setting .............................. 54 vi Chapter 1 Introduction Named entity recognition (NER) and named entity disambiguation (NED) are two important information extraction tasks in the domain of natural language process- ing. A named entity is basically a phrase inside a text that belongs to a predefined category such as person, location, organization, movie name, product name etc. NER is the task of identifying such named entities inside a given text. NED, on the other hand, focuses on identifying the actual identity of the extracted entity. For example, given an input text Philips hired 500 employees this quarter but ASML only hired 200, a recognizer component is expected to extract the phrases Philips and ASML as named entities. For the