Estimating Popularity by Sentiment and Polarization Classiﬁcation on Social Media

Faculty of Electrical Engineering, Mathematics and Computer Science Network Architectures and Services Estimating Popularity by Sentiment and Polarization Classification on Social Media K. Charalampidou (4040422) Committee members: Supervisor: Dr. C. Doerr Mentor: N. Blenn Member: Dr. Ir. F.A. Kuipers Member: Dr. A.C.C. Lo July 1, 2012 M.Sc. Thesis No: PVM 2012-75 Estimating Popularity by Sentiment and Polarization Classification on Social Media Master’s Thesis in Computer Science Network Architectures and Services Group Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Kassandra Charalampidou July 1, 2012 Author Kassandra Charalampidou Title Estimating popularity by sentiment and polarization classification on social media MSc presentation June 28, 2012 Graduation Committee Dr. C. Doerr Delft University of Technology N. Blenn Delft University of Technology Dr. Ir. F.A. Kuipers Delft University of Technology Dr. A.C.C. Lo Delft University of Technology Abstract Mass processing of social media posts has been brought to scien- tists’ attention during the last decade. The massive growth of online social networks, like Twitter and Facebook, have created a need for determining peoples’ opinions and moods through these means. This thesis constitutes a research on measuring users’ sentiment upon a particular subject by analyzing their posts. Establishing an efficient sentiment measurement technique, can be used into estimating popularity of products or persons. For separating subjective from objective posts, a hybrid classifier based on the syntax analysis of texts, is proposed, performing clearly better than existing classifying tools. Moreover, a new sentiment evaluation technique for measuring the polarity and magnitude of posts’ sentiment is described and tested over different social media. Results are compared to various real ratings and show that this approach can have a promising accuracy on sentiment establishment of online posts. Contents Preface 7 1 Introduction 9 2 Related work 13 2.1 Sentiment polarization estimation . 13 2.2 Sentiment magnitude establishment . 15 2.3 Uses of social media sentiment . 16 2.4 Outline . 17 3 Tracing subjectivity in social media posts 19 3.1 Twitter . 20 3.2 Natural Language Processing . 21 3.2.1 Stanford NLP . 22 3.2.2 Stanford NLP parser . 22 3.2.3 Stanford NLP POS Tagger . 23 3.3 Classification Methodology . 26 3.3.1 Limitations . 28 3.4 Sentiment classification performance . 30 3.4.1 State-of-art-classifiers . 30 3.4.2 Results and Evaluation . 31 4 Adjectives sentiment establishment 35 4.1 Methodology . 35 4.2 Corpus . 37 4.3 Adjectives selection . 38 4.4 Ground truth . 39 4.5 Adjectives sentiment estimation from social media posts . 40 5 Contents 5 Measuring movies popularity 45 5.1 Methodology . 45 5.2 Movies sentiment estimation . 49 5.2.1 Evaluation of IMDb critics . 49 5.2.2 Evaluation of Rotten Tomatoes critics . 50 5.2.3 Evaluation of Twitter posts . 52 5.2.4 Correlation with Box office . 55 5.3 Results analysis and Evaluation . 56 6 Conclusion and Future work 59 6.1 Summary . 59 6.2 Future Work . 60 6.2.1 Efficient tweets retrieval . 60 6.2.2 Tweets content classification . 60 6.2.3 Tweets sentiment estimation . 61 Bibliography i 6 Preface This thesis is the final result of a graduation project and completes the master’s degree programme Computer Science of the Faculty of Electrical Engineering, Mathematics and Computer Science at Delft University of Technology. Even though my specialization is on Parallel and Distributed systems, I was always interested in the social network analysis area and thus I immedi- ately seeked my involvement in this field. In the Network Architectures and Services (NAS) Group, I found plenty of challenging projects, and a group of excited and helpful people to work with. I would like to especially thank Christian Doerr and Norbert Blenn for their valuable guidance and advice, and for their positivity which kept my motivation high throughout this thesis. Delft, The Netherlands July 1, 2012 7 Chapter 1 Introduction Online social networks have enjoyed significant growth over the past years. Online communities such as Twitter and Facebook have been swamped with active users during the last years and much attention has been given in analyzing the social behavior and opinions of users. The wide-spread popularity of online social networks and the resulting availability of data has enabled the investigation of new research questions, such as the analysis and estimation of public opinion on various subjects. People’s sentiment towards a particular matter as expressed online, can be very useful in many cases, and its classification and estimation arises to a crucial subject. Cha, Haddadi et al. [14], have shown that the volume of discussion about products in weblogs (like Twitter) can be correlated with the product’s financial performance. It is also known that social network users represent the aggregate voice of millions of potential consumers, especially for products designed for the target-group of young-aged technology users. This reveals a brand new aspect that companies should take close consideration of, while this free and high-scale feedback can give them the opportunity to understand consumers’ needs and take proper action. Additionally, a lot of effort has been given to social media analysis, re- garding its power at predicting real-world outcomes, during the latest years. Some of the research that has been already made has shown the information gained from social networks can be indeed used to make quantitative predic- tions on some specific domains. Apart from that, aggregation of opinion of a collective population may be useful to gain insights about their behavior as well as predicting future trends on technology, arts and other domains. This can be also helpful for markets when they design marketing and advertising campaigns. A lot of studies have been done for implementing automatic techniques 9 Chapter 1. Introduction that classify social media posts according to their sentiment. However, most of these approaches are based on machine learning algorithms (Naive Bayes, Maximum Entropy, SVM) using lists of common positive and negative words. For implementing such an algorithm, it is required that a human will provide a list of training words or sentences with their sentiment, in order to train the selected model according to them. The algorithm is then able to predict the sentiment of any given text, by assign it into the category of either positive or negative meaning. This means that a whole sentence can be evaluated as positive or negative, only because of the appearance of a specific word with strong sentiment. Without examining the relation of this word to others and determining the actual meaning of the phrase, this technique often leads to wrong classification. Additionally, word’s sentiment can change from text to text, depending on its subject, its writing style and it purpose, and therefore a word with positive sentiment meaning in one kind of texts can have negative sentiment in another, something which is not taken into account by such approaches. Apart from that, additional effort and time is needed for constructing lists of positive and negative words, as well as training machine learning classifiers. Even though there are already plenty of implementations for polarity determination of texts, very few of them introduce the concept of syntax analysis in their techniques. It is, therefore, very challenging not only to use natural language processing tools in a brand new approach of this matter but also establish particular syntax patterns that would apply to texts of any topic. This way, it is possible to build a polarity classifier that can be applicable to texts of any style, length and language complexity as long as it maintains a correct syntax structure. This thesis describes a hybrid classifier implementing the mentioned approach, which was presented at the Networking 2012 conference [12] and compares it to existing methods. It is based on the syntactic analysis of each one of the given posts, and is seeking for specific grammatical patterns that denote the existence of sentiment over a subject. Therefore, this approach is not counting on the appearance of specific words separately but rather on particular syntax structures that reveal the actual relation between these words. This way we managed to have a clearer perception of the actual meaning of the given text, while also this classifier can be used on any content with no need to be trained on any human-made lists, as other common approaches. This algorithm contributes a 40% gain on accuracy over the existing classifiers, while it reaches about 85% of correct classified posts. Apart from determining whether a text is objective/subjective or posi- 10 tive/negative, it is also extremely useful to know how positive or how negative it is - in other words what is its sentiment magnitude. Applying the hybrid classifier can ensure that subjective posts are separated from objective ones and therefore applying a sentiment estimation technique on them, instead of the initial text collection, leads to more accurate results. Our approach of that aspect is based on adjectives derived by the hybrid classifier applied to all posts before. These adjectives are initially attached with a sentiment magnitude metric according to the number of times that they occur in positive and negative labeled training texts. Then we use those results to apply a sentiment magnitude value to whole texts on a specific matter and accordingly we finally extract people’s sentiment on this particular subject. 11 Chapter 2 Related work 2.1 Sentiment polarization estimation Sentiment analysis, i.e. the extraction of an opinion’s overall polarization and strength towards a particular subject matter, is a recent research direction, and typically approached from a statistical, or machine-learning angle.

Load more