Towards Deep Semantic Analysis of Hashtags
Total Page:16
File Type:pdf, Size:1020Kb
Towards Deep Semantic Analysis of Hashtags Thesis submitted in partial fulfillment of the requirements for the degree of Masters of Science (by Research) in Computer Science and Engineering by Piyush Bansal 201102022 [email protected] International Institute of Information Technology Hyderabad - 500 032, INDIA July 2016 Copyright c Piyush Bansal, 2016 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Towards Deep Semantic Analysis of Hashtags” by Piyush Bansal, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. Vasudeva Varma To Spontaneity Acknowledgments Thank you, Prof. Vasudeva Varma for being my advisor, and motivating me to pursue research in a really passionate manner. Your suggestions have shaped my thoughts, and given me direction when I most needed it. Our discussions have inspired me to explore the wonderful world of search and information extraction. I thank Prof. Manish Gupta for all the constructive feedback, and guidance. Your lectures have been very instrumental in inculcating a spirit of curiosity in me. I thank all the members of Search and Information Extraction Lab for being there whenever I needed any advice. Your support has been most valuable in this work. Special thanks to Arpit Sood for being my mentor in the formative period of this research. My heartfelt thanks to Romil Bansal for carrying out elaborate discussions and experiments with me that have shaped my understanding. Thanks to Priya Radhakrishnan, Dharmesh Kakadia, Ajay Dubey for their suggestions at various point of time during my research. I’d also like to extend my gratitude towards the Special Interest Group on Information Retrieval (SIGIR) for granting me the ACM Donald B. Crouch Grant to attend and present at SIGIR 2014, Gold Coast, Australia. This experience was very rewarding. Also, I’d like to thank the kind folks at Microsoft Research India for extending support to enable my trip to Vienna, Austria to present at 37th European Conference on Information Retrieval (ECIR 2015). I thank all my friends who have shared the curiosity, and fascination towards intelligent machines with me. I’d like to thank all the amazing people who’ve proofread the drafts and made many useful suggestions – Sakshi Gupta, Ayushi Dalmia, and Satarupa Guha. From the bottom of my heart, Thank my mom and dad, for your love, support and advice. Thank you for teaching me to be inquisitive, and amazed at the magnitude of this universe. v Abstract Microblogging services like Twitter enable communication at a massive scale. It has been recently reported1 by Twitter that every month, 248 million users access the microblogging platform, and create around 500 million posts everyday in more than 35 languages. This tremendous growth gives an oppor- tunity to mine useful information that is being shared across social media. However, there exists a 140 character limitation on the posts (also known as “tweets”) that users can create on Twitter. This results in heavy use of emoticons, abbreviations, misspellings and has lead to various linguistic “innovations” that render the traditional text analysis techniques less effective. Another interesting aspect of such tweets is the usage of semantico-syntactical constructs called “hashtags”. Hashtags are “#”-prefixed keywords used by people in order to organise the meaning of their tweets. Also, hashtags enable classification of tweets, since the posts using same or similar hash- tags are expected to be semantically related to each other. The challenge posed by hashtags is the fact that most hashtags are not simply “#”-prefixed keywords, but “#” symbol prefixed with concatenation of various words or phrases which are not space delimited. For example, consider the hashtag - “#NSAvsS- nowden”. We observe that this hashtag is essentially “NSA vs Snowden”, which is not a single keyword, but a concatenation of various words. In this thesis, we discuss and compare various approaches in order to “segment” the hashtag into meaningful words. Also, our task extends beyond just the segmentation of hashtag - we present a unified framework to also perform “entity-linking” on various constituent entities in a hashtag. Entity Linking is an established IR task, where the goal is to extract latent semantics from plain text by linking the text to a knowledge base (KB) such as Wikipedia. Consider, for example, the following text - “Snowden reveals classified information from NSA”, we first need to identify various entities in this piece of text, followed by “disambiguating” them, and establishing a link between those entities and some knowledge base (KB) so that we have additional contextual information available about the concerned entity. This approach has been found to be instrumental in order to teach the meaning of text to machines, which is otherwise meant for human consumption. 1https://about.twitter.com/company vi vii Hence, after performing entity linking on the segmented hashtag - “Snowden vs NSA”, we would have enriched the text with additional semantic information by establishing links between “Snow- den” and corresponding Wikipedia page - “Edward Snowden”, between “NSA” and “National Security Agency”. “NSA”, in principle, could also refer to “National Sports Academy” or “National Security Act”, and this is exactly where “disambiguation” becomes important for mention-resolving, which employs contextual information to perform this task. Since hashtags are human curated labels associated to tweets, our premise is that segmenting and linking the entities present within the hashtags could therefore help in better understanding and extrac- tion of information shared across the social media. Traditionally, most of the IR tasks have treated hashtags as either a single word, or have ignored them for all practical purposes. We demonstrate how extraction of semantics from tweets improved, when additional semantic information was made avail- able by our system by segmenting and entity-linking hashtags. We demonstrate this by performing various experiments on NEEL Challenge Dataset, and a human annotated subset of Stanford Senti- ment Analysis Dataset, which has also been made public to ease future research in this area. We have achieved the P@1 score of 0.914 on NEEL Dataset and 0.873 on the manually annotated Stanford Sen- timent Analysis Dataset for hashtag segmentation and linking. We also showcase how our approach leads to improvements in the task of “Semantic Microblog retrieval” and “Semantic Hashtag retrieval”. Microblog retrieval refers to retrieval of a ranked list of microposts given a query Q. Hashtag retrieval, on the other hand is a relatively newer IR task. It basically refers to retrieving a ranked list of the top-k hashtags relevant to a user’s query Q. To retrieve information related to a user’s interest, for instance, “Rock concerts”, it’d be very helpful to the user if they can be suggested a list of hashtags which are commonly used in relation to “Rock concerts”. By tracking these hashtags, a user can gain information about rock concerts via the posted tweets. However, it’s not possible for the user to manually figure out all the hashtags that are used across Twitter, relevant to their interest. In this thesis, we also address this problem. In order to solve these two retrieval problems, we propose and discuss a virtual document structure, which we refer to as Semantically Enriched Microblog Document (SEMD), and experiment with various psuedo-blind relevance feedback mechanisms. We tested our approach on the publicly available Stanford sentiment analysis tweet corpus. We observed an improvement of more than 10% in NDCG for microblog retrieval task, and around 11% in mean average precision for hashtag retrieval task. We experiment with some interesting aspects of search engine evaluation in context of our tasks. Our work presents elaborate discussion on various aspects of hashtag analysis. In this manner, we hope that it drives fruitful research in various tasks of microblog IR in the future by drawing attention towards hashtags. Contents Chapter Page 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 The Dream of Semantic Web . 1 1.2 Motivation . 2 1.3 Discussion on Problem Statement . 4 1.4 Overview of our Approaches . 4 1.5 Our Contributions . 5 1.6 Organization of thesis . 6 2 Related Work ::::::::::::::::::::::::::::::::::::::::: 7 2.1 Overview of Microblog IR . 7 2.2 Hashtag Segmentation . 8 2.3 Entity Linking and Disambiguation . 9 2.4 Microblog and Hashtag Retrieval . 12 2.5 Summary . 13 3 Hashtag Segmentation and Extraction of Semantics :::::::::::::::::::::: 14 3.1 Introduction . 14 3.2 Problem Motivation . 15 3.3 System Architecture . 15 3.3.1 Hashtag Segmentations Seeder . 16 3.3.2 Feature Extraction and Entity Linking . 19 3.3.2.1 Unigram Score: . 20 3.3.2.2 Bigram Score: . 20 3.3.2.3 Context Score: . 20 3.3.2.4 Capitalisation Score: . 20 3.3.2.5 Relatedness Score: . 22 3.3.3 Segmentation Ranker . 23 3.4 Training Procedure . 23 3.5 Experiments and Results . 23 3.5.1 Evaluation Metrics and Datasets . 24 3.5.1.1 Evaluation Metrics . 24 viii CONTENTS ix 3.5.1.2 Datasets . 24 3.5.1.3 Results . 27 3.6 Remarks and further analysis . 28 3.7 Conclusions . 29 3.8 Future Work . 29 4 Towards Semantic Hashtag and Microblog Retrieval :::::::::::::::::::::: 30 4.1 Problem Motivation . 30 4.2 Introduction . 31 4.3 Semantic Enrichment . 33 4.3.1 System Description . 33 4.3.2 Semantically Enriched Microblog Document . 34 4.4 Retrieval Procedure . 34 4.5 Dataset Description and Evaluation Procedure . 35 4.6 Results and Discussion . 37 4.7 Conclusions and Future Work . 38 5 Future Work :::::::::::::::::::::::::::::::::::::::::: 39 6 Conclusions :::::::::::::::::::::::::::::::::::::::::: 41 Bibliography :::::::::::::::::::::::::::::::::::::::::::: 43 List of Figures Figure Page 1.1 Freebase Entry for “Project Gutenberg” . 2 1.2 Linking Open (LOD) Data Project Cloud Diagram .