Proximity-Based Sentiment Analysis

Graduate Theses, Dissertations, and Problem Reports 2010 Proximity-based sentiment analysis S. M. Shamimul Hasan West Virginia University Follow this and additional works at: https://researchrepository.wvu.edu/etd Recommended Citation Hasan, S. M. Shamimul, "Proximity-based sentiment analysis" (2010). Graduate Theses, Dissertations, and Problem Reports. 4603. https://researchrepository.wvu.edu/etd/4603 This Thesis is protected by copyright and/or related rights. It has been brought to you by the The Research Repository @ WVU with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you must obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/ or on the work itself. This Thesis has been accepted for inclusion in WVU Graduate Theses, Dissertations, and Problem Reports collection by an authorized administrator of The Research Repository @ WVU. For more information, please contact [email protected]. PROXIMITY-BASED SENTIMENT ANALYSIS by S.M.Shamimul Hasan Thesis submitted to the College of Engineering and Mineral Resources at West Virginia University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Approved by Dr. Donald A. Adjeroh, Committee Chairperson Dr. Arun Abraham Ross Dr. James D. Mooney Lane Department of Computer Science and Electrical Engineering Morgantown, West Virginia 2010 Keywords: Sentiment Analysis, Text Mining, Proximity Pattern, Machine Learning, Natural Language Processing. Copyright 2010 S.M.Shamimul Hasan Abstract PROXIMITY-BASED SENTIMENT ANALYSIS by S.M.Shamimul Hasan Sentiment analysis is an emerging field, concerned with the detection of human emotions from textual data. Sentiment analysis seeks to characterize opinionated or evaluative aspects of natural language text thus helping people to discover valuable information from large amount of unstructured data. Sentiment analysis can be used for grouping search engine results, analyzing news content, reviews for books, movie, sports, blogs, web forums, etc. Sentiment (i.e., bad or good opinion) described in texts has been studied widely, and at three different levels: word, sentence, and document level. Several methods have been proposed for sentiment analysis, mostly based on common machine learning techniques such as Support Vector Machine (SVM), Naive Bayes (NB), Maximum Entropy (ME). In this thesis we explore a new methodology for sentiment analysis called proximity-based sentiment analysis. We take a different approach, by considering a new set of features based on word proximities in a written text. We focused on three different word proximity based features, namely, proximity distribution, mutual information between proximity types and proximity patterns. We applied this approach to the analysis of movie reviews domain. We perform empirical research to demonstrate the performance of the proposed approach. The experimental results show that proximity-based sentiment analysis is able to extract sentiments from a specific domain, with performance comparable to the state-of- the-art. To the best of our knowledge, this is the first attempt at focusing on proximity based features as the primary features in sentiment analysis. DEDICATION First of all I want to dedicate this thesis to Almighty “Allah” for giving me opportunity to prove myself in this world. Then I want to dedicate this thesis to my lovely and caring parents- “Hasina Sardar” and “Shahjahan Sardar”. iii ACKNOWLEDGMENTS I would like to mention my gratitude and to express my acknowledgements to all the people who have supported me. I would like to put this in words without missing a chance to mention how fortunate I am to have all these people around. I would like to thank all my professors, friends and family members without whom this difficult task wouldn't have been completed. My deepest gratitude and appreciation goes to Dr. Donald A. Adjeroh for his guidance, patience, support and encouragement throughout my study at West Virginia University, which led to this thesis. I would like to thank my committee members, Dr. Arun Abraham Ross and Dr. James D. Mooney for reviewing my thesis and providing knowledgeable comments and suggestions. I especially thank the Lane Department of Computer Science and Electrical Engineering at West Virginia University for providing me the opportunity to study here. My thanks to all faculty members from whom I have taken courses. Their interesting classes have inspired me and kept me loving computer science. I would like to thank my parents and my brothers for their undying prayers, love, encouragement and moral support. ‘Thank you’, Mom and Dad for standing behind me and encouraging me always to take a step forward, you are the greatest people in the world. Last but not least, I want to thank all my friends and colleagues both in Bangladesh and in West Virginia University who stayed by me throughout this period of time constantly encouraging me to work hard and at the same time who made my stay and work at the West Virginia University a very pleasurable one. iv TABLE OF CONTENTS Abstract………………………………………………………………………………………………………… . ii Dedication……………………………………………………………………………………………………… . iii Acknowledgements…………………………………………………………………………………………… iv Table of contents……………………………………………………………………………………………… v List of Figures………………………………………………………………………………………………….. vi List of Tables…………………………………………………………………………………………………… vii Chapter 1: Introduction……………………………………………………………………………………… . 1 1.1 Introduction……………………………………………………...…………………………. 1 1.2 Problem and Motivation………………………………………………………………….. 2 1.3 Prior Work………………………………………………………………………………….. 3 1.4 General Approach and Thesis Contribution…………………………………………… 3 1.5 Thesis Organization……………………………………………………………………….. 4 Chapter 2: Background and Literature Review………………………………………………………… ... 5 Chapter 3: Proximity -Based Analysis……………………………………………………………………… 10 3.1 Data Cleaning……………………………………………………………………………... 11 3.2 Measuring Distances……………………………………………………………………... 11 3.2.1 Polarity Words…………………………………………………………….. 12 3.2.2 Polarity & Non-Polarity Words………………………………………….. 13 3.3 Proximity Models…………………………………………………………………………. 14 3.4 Proximity -Based Features ………………………………………………………………… 15 Chapter 4: Sentiment Classification Techniques………………………………………………………… 19 4.1 Unsupervised Approach………………………………………………………………….. 19 4.1.1 Feature Weighting………………………………………………………… 20 4.2 Mean & Median Approach………………………………………………………………… 20 4.3 Machine Learning Approach……………………………………………………………… 21 4.3.1 Naïve Bayes (NB) classification………………………………………… 21 4.3.2 Support Vector Machines (SVM)………………………………………. 22 4.3.3 K-Nearest Neighbours (K-NN)…………………………………………... 23 4.3.4 J-48 Algorithm……………………………………………………………. 23 4.3.5 Multilayer Perceptron (MLP)……………………………………………. 24 Chapter 5: Results & Discussion…………………………………………………………………………… 25 5.1 Dataset……………………………………………………………………………………… 25 5.2 Packages Used……………………………………………………………………………. 25 5.3 Results…………………………………………………………………………………….... 25 5.3.1 Polarity Word Count 26 5.3.2 Unsupervised Approach…………………………………………………. 26 5.3.3 Mean-Median Approach…………………………………………………. 29 5.3.4 Machine Learning Approach……………………………………………. 33 5.3.5 Different Dataset…………………………………………………………. 40 Chapter 6: Conclusion & Future Work…………………………………………………………………… ... 41 References…………………………………………………………………………………………………….... 42 v LIST OF FIGURES Figure 1: Assumption on human writing style…………………………………………………... 10 Figure 2: General schematic diagram of the approach………………………………………… 11 Figure 3: Segmentation of Input Text…………………………………………………………….. 12 Figure 4: Two Proximity Models…..……………………………………………………………… 15 Figure 5 : Illustration of an SVM classifier……………………………………………………….. 22 Figure 6: Three layer multilayer perceptron……………………………………………………... 24 Figure 7: Correctly classification of positive review……………………………………………... 27 Figure 8: Correctly classification of negative review……………………………………………. 27 Figure 9: An undefined situation…………………………………………………………………. 27 Figure 10: Impact of feature weighting on the overall performance on classification accuracy. …………………………………………………………………………………………… 28 Figure 11: Mean of the training set………………………………………………………………. 29 Figure 12: Median of the training set…………………………………………………………….. 29 Figure 13: Classification Accuracy mean (Positive)……………………………………………. 31 Figure 14: Classification Accuracy mean (Negative)…………………………………………… 31 Figure 15: Classification Accuracy median (Positive)………………………………………….. 32 Figure 16: Classification Accuracy median (Negative)…………………………………………. 32 Figure 17: Mutual Information between Proximity Types………………………………………. 36 vi LIST OF TABLES Table 1: A confusion table…………………………………………………………………………….. 6 Table 2: Existing work on sentiment analysis……………………………………………………….. 8 Table 3: Distance Notation with polarity words……………………………………………………... 13 Table 4: Distance Notation with polarity and non-polarity words………………………………….. 14 Table 5: Polarity Pattern Example…………………………………………………………………… 18 Table 6: Polarity and Non-polarity Pattern Example……………………………………………….. 18 Table 7: Performance of polarity word count……………………………………………………….. 26 Table 8: Performance of Unsupervised Approach (both positive and negative review)………… 26 Table 9: Classification accuracy by considering mean of the training set (Scheme 1)…………. 30 Table 10: Classification accuracy by considering

Load more