California State University, Northridge a Comparison of Lexicographical and Machine Learning Approaches to Sentiment Analysi
Total Page:16
File Type:pdf, Size:1020Kb
California State University, Northridge A Comparison of Lexicographical and Machine Learning Approaches to Sentiment Analysis A thesis submitted in partial fulfillment of the requirements For the degree of Master of Science in Computer Science By Jeffrey Yoshida May 2019 The thesis of Jeffrey Yoshida is approved: ___________________________________ ____________________ Robert McIlhenny, Ph.D. Date ___________________________________ ____________________ Kyle Dewey, Ph.D. Date ___________________________________ ____________________ George Wang, Ph.D., Chair Date California State University, Northridge ii Acknowledgments I would like to thank Dr. George Wang for being my committee chair and for all his help throughout the entire thesis process. Thank you also to Dr. Robert McIlhenney for making time for me despite being on countless other thesis committees and Dr. Kyle Dewey for giving me tremendous feedback on my paper. Finally, I would like to thank my parents for the support they have given me all these years. I would not be where I am today without your constant support. iii Table of Contents SIGNATURES ....................................................................................................... ii ACKNOWLEDGEMENTS …………………………………………………….. iii LIST OF FIGURES ............................................................................................... vi LIST OF TABLES ............................................................................................... vii ABSTRACT ........................................................................................................ viii 1. INTRODUCTION ............................................................................................ 1 1.1 Background …………………………….…………………………………..1 1.2 Approaches to Sentiment Analysis …………………………………………2 2. RELATED WORKS…………......................................................................... 4 3. TECHNICAL APPROACH.............................................................................. 5 3.1 Data Exploration ........................................................................................... 5 3.2 General Workflow ........................................................................................ 7 3.3 Hardware ...................................................................................................... 8 3.4 Software …………………………………………………………………....9 4. DATA PREPROCESSING ........................................................................... 10 5. WORD EMBEDDINGS ................................................................................. 13 5.1 Count-Based Methods …………………………………………………… 13 5.2 Prediction-Based Methods ………………………………………………. 15 5.3 Word Embedding Implementations ……………………………………... 17 6. MODELS ........................................................................................................ 19 6.1 SentiWordNet ……………………………………………………………. 20 6.2 AFINN …………………………………………………………….…..…. 20 6.3 Logistic Regression …………………………………………...…………. 21 6.4 Support Vector Machine …………………………………………………. 21 iv 6.5 Random Forest …………………………………..……………………….. 23 6.6 Naïve Bayes Classifier …………………………...………………………. 24 6.7 Multilayer Perceptron …………………….………...……………………. 25 6.8 Convolutional Neural Network …………………………………..………. 27 7. RESULTS ....................................................................................................... 29 7.1 Metrics ……………………………………………………...……………. 29 7.2 Model Results ……………………………………………………………. 31 8. CONCLUSION ............................................................................................... 35 REFERENCES ………………………………………………………………….36 v List of Figures Figure 1 – Histogram of Review Scores in the Amazon Fine Foods Dataset ................... 6 Figure 2 – Histogram of Positive and Negative Reviews in the Amazon Fine Foods Dataset ………………………………………………….................................................... 6 Figure 3 – Workflow for Project …………………............................................................ 7 Figure 4 – Example of Discrete Representation of Words .............................................. 13 Figure 5 – Word2vec Example …………………………................................................ 16 Figure 6 – SVM with Linearly Separable Data …………………………....................... 22 Figure 7 – Example of Linearly Inseparable Data ………………………....................... 23 Figure 8 – Perceptron Diagram ........................................................................................ 26 Figure 9 – Multilayer Perceptron Diagram ...................................................................... 27 Figure 10 –True Positives, False Positives, False Negatives, and True Negatives …..... 29 vi List of Tables Table 1 – Sample Count Vectorization ............................................................................ 14 Table 2 –10 Most Similar Words to “tasty” in the Amazon Fine Foods Reviews Dataset Word2vec Model.............................................................................................................. 17 Table 3 – All Predictive Model and Word Embedding Combinations Used …............... 19 Table 4 – Overall Results ……………………................................................................. 31 Table 5 – Single-Class Classification Results …………...……...................................... 31 Table 6 – Domain-Specific Words Similar to Good ........................................................ 34 Table 7 – Pre-trained Model Words Similar to Good ...................................................... 34 vii Abstract A Comparison of Lexicographical and Machine Learning Approaches to Sentiment Analysis By Jeffrey Yoshida Master of Science in Computer Science Sentiment analysis is an area of computer science research that deals with extracting subjective information such as opinions, attitudes and emotions from text data. Although sentiment analysis is a technically challenging task, the potential benefits it can yield are great. An industry particularly interested in sentiment analysis is the e- commerce industry where major companies often receive far more product reviews than can be handled manually. These reviews if analyzed accurately, can become an invaluable source of consumer insights that go beyond numerical review scores. The goal of this thesis is to analyze and compare the performance of different methods of sentiment analysis on a dataset of Amazon product reviews. viii 1. Introduction 1.1 Background Due to rapid improvements in internet technology the amount of data being produced, consumed and stored has skyrocketed. In 2013 it was estimated that 90% of the worlds stored data was generated in only the two years prior [1]. What is more astounding is that an estimated 95% of this data is unstructured data such as text, videos and audio [2]. Because of this rich abundance of unstructured data, there has never been a greater need for methods of extracting valuable insights from data. This increased need to generate insights from unstructured data has fueled interest in research areas such as sentiment analysis. Sentiment analysis is generally defined as an area of computer science research that deals with extracting subjective information such as opinions, attitudes and emotions from text data [3]. For the purposes of this paper, sentiment analysis is the task of computationally determining whether a body of text has a positive or negative tone. Whether it be understanding the social sentiment regarding a clothing brand or political opinions on a contentious specific topic, sentiment analysis can provide crucial insights that are impractical to obtain by manual inspection of data. With the growth of e- commerce, the need for in-depth consumer insights has never been greater. A study of online consumer-generated reviews by Comscore, a large media analytics company, found that consumers were willing to play at least 20% more for products and services that received a 5-star rating when compared to the same service that had received a 4-star rating [4]. In food, legal and hotel services, consumers were willing to pay between 40% and 99% more [4]. The influence of product reviews is further supported by a 2018 study 1 done by a site called brightlocal which found that 95% of people between the age of 18- 34 read reviews of local businesses [5]. More importantly this survey also found that 57% of consumers will only use a business if it has 4 or more stars. Because of this, understanding product sentiment is valuable. 1.2 Approaches to Sentiment Analysis There are two main approaches to sentiment analysis: the lexicographical approach and the machine learning approach [6]. In the lexicographical approach to sentiment analysis, the overall attitude of a body of text is determined by analyzing individual words or phrases. The polarity of each individual word is determined using a sentiment dictionary, a specialized dictionary that gives the word a sentiment polarity score. The sentiment of the entire body of text is computed as the sum of the polarity scores of the individual words or phrases in the text. There are many different sentiment dictionaries available such as SentiWordNet, AFINN and Opinion Lexicon [7,8, 9, 10, 11]. The other common method of performing sentiment analysis is by using machine learning. Like many other research fields, the field of sentiment analysis has been influenced by the rapid growth of machine learning [12, 13]. By creating a training dataset consisting of pieces of text labeled as positive or negative, a model can