Identifying Tweets with Implicit Entity Mentions

Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2016 Identifying Tweets with Implicit Entity Mentions Adarsh Koruthu Alex Wright State University Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all Part of the Computer Sciences Commons Repository Citation Alex, Adarsh Koruthu, "Identifying Tweets with Implicit Entity Mentions" (2016). Browse all Theses and Dissertations. 1577. https://corescholar.libraries.wright.edu/etd_all/1577 This Thesis is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE Scholar. For more information, please contact [email protected]. Identifying Tweets with Implicit Entity Mentions A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science By ADARSH ALEX B.E, University of Mumbai, 2013 2016 Wright State University WRIGHT STATE UNIVERSITY GRADUATE SCHOOL AUG 28, 2016 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Adarsh Alex ENTITLED Identifying Tweets with Implicit Entity Mentions BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science. Amit P. Sheth Thesis Director Mateen Rizki Chair, Department of Computer Science and Engineering Committee on Final Examination Amit P. Sheth, Ph. D Krishnaprasad Thirunarayan, Ph. D Tanvi Banerjee, Ph. D Robert E. Fyffe Vice President for Research and Dean of the Graduate School ABSTRACT ALEX, ADARSH. M.S., Department of Computer Science and Engineering, Wright State University, 2016. Identifying Tweets with Implicit Entity Mentions Social networking sites like Twitter and Facebook have become a significant source of user-generated content in the past decade. Mining of this user-generated content has proved beneficial for a broad range of applications like Event Extraction, Document Retrieval, and Sentiment Analysis. Identifying entities is one of the major tasks that fuel important information for above tasks. Identification of entities is typically performed in two steps; Named Entity Recognition (NER) and Entity Linking. State of the art NER solutions focus on recognizing the entities that are mentioned explicitly in social media posts. However, entities are frequently mentioned implicitly in them. For example, the tweet ‘Didn’t know that its the same actress in Fault in our stars and Divergent.’ contains explicit references to movies Fault in our stars and Divergent while it implicitly refers to actress Shailene Woodley. Spotting and classifying tweets with such implicit entity mentions (i.e. recognize that above tweet has implicit entity of type ACTRESS) is the initial step towards identifying the implicit mention of Shailene Woodley in this tweet. In this thesis, we propose a two step semantic driven approach to address the spotting and typing of implicit entity mentions in text. Specifically, we answer two research questions in this thesis: 1. How to find tweets that have implicit entity mentions of a given type? 2. What features help to distinguish tweets with implicit entity mentions from tweets with explicit entity mentions and tweets with no entity mentions at all? We answer the first question by developing a technique to find semantic cues that indicate the presence of implicit entity mentions in tweets. The second research question is answered by exploiting the syntactic features of the tweets, along with semantic features extracted from crowd-sourced knowledge bases like Wikipedia and DBpedia, to determine whether a tweet has an implicit entity mention or not. We evaluate our approach by creating a gold standard dataset for two domains namely movies and books. iii Contents 1 Introduction 1 2 Related Work 7 2.1 Named Entity Recognition on Organized Text . .8 2.2 Named Entity Recognition in Unorganized Text . .8 2.3 Role of Background Knowledge in Text Analysis . .9 2.3.1 Wikipedia as Background Knowledge . 10 3 Background 12 3.1 Wikipedia . 12 3.2 DBpedia . 13 3.3 Semantic Similarity . 14 3.4 Word2Vec . 14 3.5 Random Forest . 15 4 Approach 16 4.1 Identifying tweets with potential implicit entity mentions . 17 4.1.1 Formal Semantic Cues . 18 4.1.1.1 Head Nouns . 19 4.1.2 Twitter Specific Semantic Cues . 21 iv 4.2 Classifying Tweets as Implicit, Explicit and Null . 23 4.2.1 Domain Relevant Entities . 23 4.2.1.1 Domain Relevant Entities Using DBpedia . 24 4.2.1.2 Commonness . 25 4.2.1.3 Domain Relevant Entities Using Wikipedia . 26 4.2.2 Window Based Bigrams . 28 4.2.3 Explicit Entity Mentions . 29 4.2.4 Part-Of-Speech Tags . 29 4.3 Classifying Tweets into Predefined Type . 30 5 Evaluation 31 5.1 Semantic Cue Evaluation . 31 5.2 Classifying Tweets as Implicit, Explicit and Null . 33 5.2.1 Dataset . 33 5.2.2 Evaluation Metrics . 35 5.2.3 Results of Classification . 35 5.2.3.1 Error Analysis . 37 5.3 Discussion . 37 5.3.1 Impact of Relationships on the Classification Step . 38 5.3.2 Impact of Knowledge on the Classification Step . 39 6 Conclusion and Future Work 42 v List of Figures 3.1 Sample Wikipedia Page . 13 4.1 Approach Overview . 17 4.2 Categories of Furious 7 . 19 4.3 Parse Tree . 20 4.4 Semantic Cues for MOVIE . 22 4.5 Semantic Cues for BOOK . 23 4.6 DBpedia Subgraph . 24 4.7 Drawbacks of DBpedia . 27 4.8 Wikipedia Hyperlink Graph . 27 5.1 Semantic Cue Evaluation for Movies . 32 5.2 Semantic Cue Evaluation for Books . 32 5.3 Semantic Cue Below Similarity of 0.5 for Movies . 33 5.4 Semantic Cue Below Similarity of 0.5 for Books . 34 5.5 Impact of Relationships on Classification - Movie . 38 5.6 Impact of Relationships on Classification - Book . 39 5.7 Impact of Wikipedia Knowledge on Books . 40 5.8 Impact of Wikipedia Knowledge on Movies . 40 5.9 Impact of DBpedia Knowledge on Books . 41 vi 5.10 Impact of DBpedia Knowledge on Movies . 41 vii List of Tables 1.1 Example Tweets Filtered by Semantic Cues of the Entity Type MOVIE . .5 4.1 Tweets with Semantic Cues of Entity Type MOVIE . 18 5.1 Dataset 1 Statistics . 34 5.2 Dataset 2 Statistics . 34 5.3 Classification Results for Books on First Dataset . 36 5.4 Classification Results for Movies on First Dataset . 36 5.5 Classification Results for Movies on Second Dataset . 36 5.6 Classification Results for Books on Second Dataset . 36 viii ACKNOWLEDGEMENTS My journey through graduate school has been an extraordinary and fulfilling experience. I would like to take this opportunity to thank everyone who has helped me along the way. First and foremost, I want to express my sincere gratitude towards my advisor Dr. Amit P. Sheth for his continuous guidance. I am thankful to him for providing me with this amazing opportunity to pursue my research interests. His dedication and enthusiasm continues to inspire me even today. I would like to thank Dr. T.K.Prasad for all his insightful comments. I would also like to thank Dr. Tanvi Banerjee for her patience and feedback on this work. I would like to thank Sujan Perera for guiding me through every stage of my research. It would have been impossible to complete my thesis without his guidance. I would also like to thank the entire Kno.e.sis team. This acknowledgement would be incomplete without thanking my family. I am thankful to my parents Alexander and Nancy for all the sacrifices they have made and their continued faith in me. Last but not the least, I would love to thank Kamni for her continuous support and motivation. ix 1 Introduction The advent of the World Wide Web has been one of the biggest breakthroughs in technological history. Since its inception in the early 1990s, the web has experienced an exponential growth with respect to the user base and the content. Over the past decade, the web itself has been revolutionized with the introduction of social media sites, blogs and other platforms which allow human interactions on the web. The rapid growth of these platforms especially social media sites like Twitter and Facebook has given users a common space in which they can communicate and express their opinions. Social media has had a tremendous impact in our day-to-day life. Twitter, a microblogging website is one of the social networking giants. The latest Twitter statistics show that there are 500 million tweets per day with an active user base of 320 million users per month. These numbers are increasing at an exponential rate as each day passes by. The discussion topics of the tweets range from what people have done during the course of the day, opinions about new movie, climate change on earth, to presidential elections in USA. Twitter has been extensively used for a variety of applications like event extraction [Ritter et al. 2012], opinion analysis [Pang and Lee 2008] and earthquake detection [Sakaki et al. 2010]. Identifying entities mentioned in tweets is a critical component for all applications mentioned above. Identification of entities is typically performed in two steps. The first step is termed named entity recognition which spots and classifies rigid designators in text to predefined types (e.g. PERSON, LOCATION, OR- 1 2 GANISATION) [Nadeau and Sekine 2007]. The second task aims to assign unique identities to the spotted entities by the named entity recognition task w.r.t a knowledge base, and it is termed as entity linking [Rao et al. 2013]. The literature on identifying entities in tweets has focused on explicitly mentioned entities. However, it is observed that more often than not entities are mentioned implicitly in tweets.

Identifying Tweets with Implicit Entity Mentions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support