Semantic Sentiment Analysis of Microblogs, C January 21, 2015 Supervisors: Dr

SEMANTICSENTIMENTANALYSISOFMICROBLOGS hassan saif A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy (PhD) Knowledge Media Institute Open University January 21, 2015 Hassan Saif: Semantic Sentiment Analysis of Microblogs, c January 21, 2015 supervisors: Dr. Harith Alani Dr. Miriam Fernandez Dr. Yulan He location: Milton Keynes, United Kingdom Dedicated to my beloved mother Nebal Saif. ABSTRACT Microblogs and social media platforms are now considered among the most popular forms of online communication. Through a platform like Twitter, much information re- flecting people’s opinions and attitudes is published and shared among users on a daily basis. This has recently brought great opportunities to companies interested in tracking and monitoring the reputation of their brands and businesses, and to policy makers and politicians to support their assessment of public opinions about their policies or political issues. A wide range of approaches to sentiment analysis on Twitter, and other similar microblogging platforms, have been recently built. Most of these approaches rely mainly on the presence of affect words or syntactic structures that explicitly and unambiguously re- flect sentiment (e.g., “great”, “terrible”). However, these approaches are semantically weak, that is, they do not account for the semantics of words when detecting their sentiment in text. This is problematic since the sentiment of words, in many cases, is associated with their semantics, either along the context they occur within (e.g., “great” is negative in the context “pain”) or the conceptual meaning associated with the words (e.g., “Ebola“ is negative when its associated semantic concept is “Virus“). This thesis investigates the role of words’ semantics in sentiment analysis of microblogs, aiming mainly at addressing the above problem. In particular, Twitter is used as a case study of microblogging platforms to investigate whether capturing the sentiment of words with respect to their semantics leads to more accurate sentiment analysis mod- els on Twitter. To this end, several approaches are proposed in this thesis for extracting and incorporating two types of word semantics for sentiment analysis: contextual semantics (i.e., semantics captured from words’ co-occurrences) and conceptual semantics (i.e., semantics extracted from external knowledge sources). Experiments are conducted with both types of semantics by assessing their impact in three popular sentiment analysis tasks on Twitter; entity-level sentiment analysis, tweet- level sentiment analysis and context-sensitive sentiment lexicon adaptation. Evaluation under each sentiment analysis task includes several sentiment lexicons, and up to 9 Twit- ter datasets of different characteristics, as well as comparing against several state-of-the- art sentiment analysis approaches widely used in the literature. v The findings from this body of work demonstrate the value of using semantics in sentiment analysis on Twitter. The proposed approaches, which consider words’ semantics for sentiment analysis at both, entity and tweet levels, surpass non-semantic approaches in most datasets. vi ACKNOWLEDGEMENTS When I think of my PhD experience over the past three and half years, two things come to my mind. Firstly, my PhD was an exciting journey that reshaped my life entirely. Secondly, I have been amazingly fortunate to meet people who have walked with me through this journey step by step and inch by inch until the end. My deepest gratitude goes to my supervisor, Dr. Harith Alani. His deep faith in me, along with his expert guidance and continuous support, were undeniably the main rea- sons behind all the success and achievements I had out of my PhD work. I am also very thankful to him for teaching me the art of strategic decision making and “bulletproof” academic writing. I could not be more grateful and fortunate to have had Dr. Alani as my PhD supervisor. My sincere thanks also go to my second supervisor Dr. Miriam Fernandez, who did make herself available for advising and supporting me technically and emotionally during the toughest times of my PhD. The countless brainstorming sessions, during which she patiently spent hours discussing my research ideas and plans, have undoubtedly had a huge impact on the quality of my research. I am also grateful to my external supervisor Dr. Yulan He, with whom I had the hon- our to work under direct supervision during the first year of my PhD and her distant supervision afterwards. Dr. Yulan He is one of the most intelligent researchers I have ever worked with. Her extraordinary expertise and techniques in mathematics, machine learning and natural language processing inspired my work on this dissertation. I would like to extend my gratitude to the members of my dissertation committee, Prof. Markus Strohmaier and Dr. Trevor Collins for their insightful comments and feedback, which helped improve this dissertation. My sincere thanks also goes to the committee chairman, Dr. Paul Mulholland. Gratitude is also extended to my colleagues in the Knowledge Media Institute (KMi). Their professional support, along with their unique positive spirit and attitude, make KMi one of the best and most unique places to work. The past few years gave me a chance to meet unique people from all over the globe, who instantly touched my soul with their amazing personality and became an important part of my life. I am extremely grateful to my dearest friends, Giuseppe Scavo, Dra- vii homira Herrmannova, Ilaria Tiddi, Lara Piccolo, Lucas Anastasiou, Lukas Zilka, Mag- dalena Malysz and Maria Maleshkova. Your friendship helped me stay sane throughout these difficult years. Finally, special recognition goes out to my parents, Riad and Nebal Saif, my brothers Aamer and Rafat, and my sister Sara for supporting me spiritually throughout writing this dissertation and my life in general. viii CONTENTS 1 introduction3 1.1 Motivation 5 1.1.1 Sentiment Analysis of Twitter: Gaps and Challenges 6 1.1.2 From Affect Words to Words’ Semantics 7 1.2 Research Questions, Hypotheses and Contributions 9 1.3 Thesis Methodology and Outline 13 1.4 Publications 16 i background 19 2 literature review 21 2.1 Background 21 2.1.1 Fundamentals 22 2.1.2 A Note on Terminology 27 2.2 Sentiment Analysis of Twitter 28 2.2.1 Traditional Sentiment Analysis Approaches 29 2.2.1.1 The Machine Learning Approach 29 2.2.1.2 The Lexicon-based Approach 39 2.2.1.3 The Hybrid Approach 45 2.2.1.4 Discussion 46 2.3 Semantic Sentiment Analysis 49 2.3.1 Contextual Semantics 49 2.3.2 Conceptual Semantics 52 2.4 Summary and Discussion 58 2.4.1 Discussion 61 ii semantic sentiment analysis of twitter 65 3 contextual semantics for sentiment analysis of twitter 67 3.1 Introduction 67 3.2 The SentiCircle Representation of Words’ Semantics 69 3.2.1 Overview 69 3.2.2 SentiCircle Construction Pipeline 70 ix x contents 3.2.2.1 Term Indexing 71 3.2.2.2 Context Vector Generation 72 3.2.2.3 SentiCircle Generation 73 3.2.2.4 Senti-Median: The Overall Contextual Sentiment Value 76 3.3 SentiCircles for Sentiment Analysis 77 3.3.1 Entity-level Sentiment Detection 77 3.3.2 Tweet-level Sentiment Detection 77 3.3.2.1 The Median Method 78 3.3.2.2 The Pivot Method 78 3.3.2.3 The Pivot-Hybrid Method 79 3.3.3 Evaluation Setup 79 3.3.3.1 Datasets 80 3.3.3.2 Sentiment Lexicons 83 3.3.3.3 Baselines 83 3.3.3.4 Thresholds and Parameters Tuning 84 3.3.4 Evaluation Results 85 3.3.4.1 Entity-Level Sentiment Detection 85 3.3.4.2 Tweet-Level Sentiment Detection 87 3.3.4.3 Impact on Words’ Sentiment 89 3.4 SentiCircles for Adapting Sentiment Lexicons 90 3.4.1 Evaluating SentiStrength on the Adapted Thelwall-Lexicon 93 3.5 Runtime Analysis 95 3.6 Discussion 96 3.7 Summary 98 4 conceptual semantics for sentiment analysis of twitter 99 4.1 Introduction 99 4.2 Conceptual Semantics for Supervised Sentiment Analysis 101 4.2.1 Extracting Conceptual Semantics 102 4.2.2 Conceptual Semantics Incorporation 103 4.2.3 Evaluation Setup 105 4.2.3.1 Datasets 106 4.2.3.2 Semantic Concepts Extraction 106 4.2.3.3 Baselines 106 4.2.4 Evaluation Results 109 contents xi 4.2.4.1 Results on Incorporating Semantic Features 111 4.2.4.2 Comparison of Results 112 4.3 Conceptual Semantics for Lexicon-based Sentiment Analysis 115 4.3.1 Enriching SentiCircles with Conceptual Semantics 115 4.3.2 Evaluation Results 116 4.4 Discussion 117 4.5 Summary 119 5 semantic patterns for sentiment analysis of twitter 121 5.1 Introduction 121 5.2 Related Work 123 5.3 Semantic Sentiment Patterns of Words 123 5.3.1 Syntactical Preprocessing 124 5.3.2 Capturing Contextual Semantics and Sentiment of Words 124 5.3.3 Extracting Patterns from SentiCircles 125 5.4 Evaluation Setup 126 5.4.1 Tweet-Level Evaluation Setup 127 5.4.2 Entity-Level Evaluation Setup 127 5.4.3 Evaluation Baselines 129 5.4.4 Number of SS-Patterns in Data 131 5.5 Evaluation Results 132 5.5.1 Results of Tweet-Level Sentiment Classification 132 5.5.2 Results of Entity-Level Sentiment Classification 134 5.6 Within-Pattern Sentiment Consistency 135 5.6.1 Sentiment Consistency vs. Sentiment Dispersion 136 5.7 Discussion 137 5.8 Summary 139 iii analysis study 141 6 stopword removal for twitter sentiment analysis 143 6.1 Introduction 143 6.2 Stopword Analysis Set-Up 145 6.2.1 Datasets 145 6.2.2 Stopword removal methods 146 6.2.2.1 The Classic Method 146 6.2.2.2 Methods based on Zipf’s Law (Z-Methods) 146 xii contents 6.2.2.3 Term Based Random Sampling (TBRS) 147 6.2.2.4 The Mutual Information Method (MI) 148 6.2.3

Load more