Opinion Mining with the Sentwordnet Lexical Resource

Technological University Dublin ARROW@TU Dublin Dissertations School of Computer Sciences 2009-03-01 Opinion mining with the SentWordNet lexical resource Bruno Ohana Technological University Dublin, [email protected] Follow this and additional works at: https://arrow.tudublin.ie/scschcomdis Part of the Computer Engineering Commons Recommended Citation Ohana, Bruno, "Opinion mining with the SentWordNet lexical resource" (2009). Dissertations. 25. https://arrow.tudublin.ie/scschcomdis/25 This Dissertation is brought to you for free and open access by the School of Computer Sciences at ARROW@TU Dublin. It has been accepted for inclusion in Dissertations by an authorized administrator of ARROW@TU Dublin. For more information, please contact [email protected], [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License Opinion Mining with the SentiWordNet Lexical Resource Bruno Ohana A dissertation submitted in partial fulfilment of the requirements of Dublin Institute of Technology for the degree of M.Sc. in Computing (Knowledge Management) March 2009 I certify that this dissertation which I now submit for examination for the award of MSc in Computing (Knowledge Management), is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the test of my work. This dissertation was prepared according to the regulations for postgraduate study of the Dublin Institute of Technology and has not been submitted in whole or part for an award in any other Institute or University. The work reported on in this dissertation conforms to the principles and requirements of the Institute’s guidelines for ethics in research. Signed: _________________________________ Date: 31 March 2009 i 1. ABSTRACT Sentiment classification concerns the application of automatic methods for predicting the orientation of sentiment present on text documents. It is an important subject in opinion mining research, with applications on a number of areas including recommender and advertising systems, customer intelligence and information retrieval. SentiWordNet is a lexical resource of sentiment information for terms in the English language designed to assist in opinion mining tasks, where each term is associated with numerical scores for positive and negative sentiment information. A resource that makes term level sentiment information readily available could be of use in building more effective sentiment classification methods. This research presents the results of an experiment that applied the SentiWordNet lexical resource to the problem of automatic sentiment classification of film reviews. First, a data set of relevant features extracted from text documents using SentiWordNet was designed and implemented. The resulting feature set is then used as input for training a support vector machine classifier for predicting the sentiment orientation of the underlying film review. Several scenarios exploring variations on the parameters that generate the data set, outlier removal and feature selection were executed. The results obtained are compared to other methods documented in the literature. It was found that they are in line with other experiments that propose similar approaches and use the same data set of film reviews, indicating SentiWordNet could become an important resource for the task of sentiment classification. Considerations on future improvements are also presented based on a detailed analysis of classification results. Key words: opinion mining, sentiment classification, lexical resources, data mining, SentiWordNet. i To my Parents i ACKNOWLEDGEMENTS I would like to express my sincere thanks to my supervisor, Brendan Tierney, for all the help and guidance provided on all stages of this dissertation and for the valuable insights and discussions that contributed to the results of this project. I also wish to thank Andrea Esuli and Fabrizio Sebastiani, from the Italian Institute of Information Science and Technology, for promptly making the SentiWordNet lexical resource available for use in this research. i TABLE OF CONTENTS 1. ABSTRACT ......................................................................................................II TABLE OF FIGURES ..............................................................................................IX TABLE OF TABLES ................................................................................................. X 1. INTRODUCTION ........................................................................................... 12 1.1. KNOWLEDGE MANAGEMENT,DISCOVERY AND OPINIONS IN TEXT.................. 12 1.2. BACKGROUND .................................................................................................. 14 1.3. RESEARCH PROBLEM........................................................................................ 16 1.4. THE INTELLECTUAL CHALLENGE ..................................................................... 16 1.5. RESEARCH OBJECTIVES.................................................................................... 17 1.6. METHODOLOGY................................................................................................ 18 1.7. RESOURCES ...................................................................................................... 19 1.8. SCOPE AND LIMITATIONS ................................................................................. 20 1.9. ORGANISATION OF THIS DISSERTATION ............................................................ 21 2. KNOWLEDGE CREATION AND DISCOVERY....................................... 23 2.1. KNOWLEDGE ORGANISATIONS ......................................................................... 23 2.2. KNOWLEDGE MANAGEMENT............................................................................ 25 2.2.1. Characterising Knowledge ...................................................................27 2.3. KNOWLEDGE CREATION................................................................................... 31 2.3.1. Nonaka’s Spiral of Knowledge (Nonaka, 1994)...................................31 2.3.2. The Conditions for Knowledge Creation..............................................33 2.4. INFORMATION TECHNOLOGY AND KNOWLEDGE MANAGEMENT...................... 35 2.5. KNOWLEDGE DISCOVERY................................................................................. 38 2.5.1. Knowledge Discovery and Data Mining...............................................39 2.5.2. Implications to Knowledge Management .............................................40 2.6. CONCLUSION .................................................................................................... 41 3. KNOWLEGDE DISCOVERY AND DATA MINING ................................ 43 3.1. KNOWLEDGE DISCOVERY PROCESSES.............................................................. 43 3.1.1. Analysis of Knowledge Discovery Processes .......................................44 v 3.1.2. Considerations on Knowledge Discovery Processes............................48 3.2. DATA MINING TECHNIQUES ............................................................................. 52 3.2.1. Goal Based Categorisation of Data Mining Techniques......................52 3.2.2. Summary and Considerations...............................................................57 3.2.3. Considerations on the Data Set ............................................................58 3.3. DATA MINING ALGORITHMS FOR CLASSIFICATION .......................................... 61 3.3.1. Introduction ..........................................................................................61 3.3.2. Supervised Learning Algorithms ..........................................................62 3.3.3. Nearest Neighbour Methods.................................................................63 3.3.4. Tree Based Methods .............................................................................65 3.3.5. Naïve Bayes ..........................................................................................66 3.3.6. Large Margin Classifiers: Support Vector Machines ..........................67 3.3.7. Considerations on Classifier Techniques .............................................69 3.3.8. Evaluating Classifier Performance ......................................................72 3.3.9. Challenges to Classification in Data Mining .......................................75 3.4. DATA MINING TOOLS....................................................................................... 79 3.4.1. Open Source Tools and RapidMiner ....................................................80 3.5. CONCLUSION .................................................................................................... 81 4. TEXT MINING AND OPINION MINING................................................... 83 4.1. TEXT MINING ................................................................................................... 83 4.1.1. Applications of Text Mining..................................................................86 4.1.2. Representation of Text Data .................................................................88 4.1.3. Document Classification Techniques ...................................................93 4.2. OPINION MINING .............................................................................................. 95 4.2.1. Introduction: Opinions in Text .............................................................95

Opinion Mining with the Sentwordnet Lexical Resource

Lexical Resource Reconciliation in the Xerox Linguistic Environment

Wordnet As an Ontology for Generation Valerio Basile

Modeling and Encoding Traditional Wordlists for Machine Applications

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Unification of Multiple Treebanks and Testing Them with Statistical Parser with Support of Large Corpus As a Lexical Resource

Instructions for ACL-2010 Proceedings

The Interplay Between Lexical Resources and Natural Language Processing

Exploring Complex Vowels As Phrase Break Correlates in a Corpus of English Speech with Proposel, a Prosody and POS English Lexicon

Methods for Efficient Ontology Lexicalization for Non-Indo

Maurice Gross' Grammar Lexicon and Natural Language Processing

Using Relevant Domains Resource for Word Sense Disambiguation

Using a Lexical Semantic Network for the Ontology Building Nadia Bebeshina-Clairet, Sylvie Despres, Mathieu Lafourcade