Topic-Oriented Sentiment Analysis on Blogs and Microblogs a Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Topic-oriented Sentiment Analysis on Blogs and Microblogs A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Zhixin Zhou M.Sc School of Science RMIT University School of Science College of Science Engineering and Health RMIT University Jun, 2016 Declaration I certify that except where due acknowledgement has been made, the work is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed. Zhixin Zhou Jun 20, 2016 ii Acknowledgments Words cannot express my gratitude for my supervisors, Dr. Xiuzhen (Jenny) Zhang, Prof. Mark Sanderson, Dr Phil Vines and Assoc. Prof. Xiaodong Li. The completion of the thesis would never have been possible without their continuous help and guidance throughout my candidature. I am grateful to the School of Computer Science and Information Technology (now School of Science) at RMIT University for both the finantial support and such a great academic environment. My sincere thanks to all my fellow PhD students. You accompany made my life joyful. Finally I wish to thank my family for their patience and encouragement over the past four years. In particular, I wish to thank my partner, Xiaoyan Liu. At my current page of life, your love is a miracle that shines up my way. iii Credits Portions of the material in this thesis have previously appeared in the following publications: • Positive, negative, or mixed? Mining blogs for opinions [Zhang et al., 2009] • Seeing the forest from trees: Blog retrieval by aggregating post similarity scores [Zhou et al., 2010b] • Sentiment classification of blog posts using topical extracts [Zhou et al., 2012] • Sentiment Analysis on Twitter through Topic-based Lexicon Expansion[Zhou et al., 2014] Contents Abstract 1 1 Introduction 3 1.1 Sentiment Analysis in the Context of Social Media . .4 1.2 Statement of Problems . .7 1.3 Contributions . .9 1.4 Thesis Structure . 11 2 Background 12 2.1 Sentiment Analysis in General . 12 2.1.1 What is Sentiment Analysis? . 13 2.1.2 General Approaches to Sentiment Classification . 15 2.1.3 Supervised Classification Models . 17 Machine learning models . 18 2.1.4 Lexicon-based Classification Models . 19 2.2 Sentiment Analysis on Social Media . 20 iv CONTENTS v 2.2.1 Sentiment Analysis on Blogosphere . 20 2.2.2 Sentiment Analysis on Microblog . 21 2.3 Topic Dependency in Sentiment Classification . 23 2.4 Noise Reduction to Improve Sentiment Classification . 25 2.4.1 Text Summarization in a Glimpse . 26 2.4.2 Joint Topic-sentiment Models . 26 2.4.3 Classification on Extracts . 29 2.5 Blog Sentiment Retrieval . 30 3 Text Corpora 32 3.1 TREC Blog Track Collections . 32 3.2 Twitter Collection . 37 3.2.1 TREC Microblog Track 2011 . 37 3.2.2 Stanford Sentiment140 Collection . 38 3.3 Movie Review Dataset . 39 4 Topic Dependency in Sentiment Classification 40 4.1 Introduction . 40 4.2 Classification Model . 41 4.2.1 A General Sentiment Lexicon . 42 4.2.2 Support Vector Machine . 43 4.2.3 Opinion Word Extraction . 44 4.2.4 Opinion Word Vectors . 45 CONTENTS vi 4.3 Evaluation . 45 4.3.1 Topic-independent versus Topic-dependent Classification . 45 4.3.2 Experiment Set-up . 46 4.3.3 Topic-independent Opinion Classification . 47 4.3.4 Topic-dependent Opinion Classification . 47 4.3.5 Blind Test of the Classification Model . 48 4.4 Summary . 49 5 Addressing Topic Drift in Blogpost Sentiment Classification 50 5.1 Introduction . 50 5.2 Pilot Study . 53 5.2.1 Manual Annotation . 55 topicality . 55 Sentiment Orientation . 57 The Labeling Procedure . 59 Stats of the Annotated Dataset . 61 5.2.2 Classification . 65 The Supervised Word-based Approach . 66 The Voting Model . 67 The Logistic Regression Model . 68 5.2.3 Experiments and Results . 69 5.3 Automatically Extraction of Multi-facet Snippets . 73 5.3.1 Classification Models . 74 CONTENTS vii Supervised Naive Bayes Model . 75 Unsupervised Lexicon-based Models . 75 5.3.2 Automatic Extraction Methods . 76 Lexicons . 77 Extracts . 79 5.3.3 Experiment and Discussions . 82 5.3.4 Results and Discussions . 84 Extracts vs. Full text by Topics . 85 Extracts vs. Full text by Topic Categories . 90 5.3.5 Single-facet vs. Multi-facet Extracts . 94 5.3.6 Benchmarking on the Movie Review dataset . 97 5.4 Summary . 99 6 Topic-dependent Twitter Sentiment Classification 102 6.1 Introduction . 103 6.2 Emoticon-based Sentiment Lexicon Expansion . 106 6.3 Classification . 110 6.4 Experiment . 112 6.4.1 General Lexicon vs. Global Expansion . 113 6.4.2 Domain-specific Expansion vs. Global Expansion . 115 6.4.3 Representative Opinion Expressions . 122 6.5 Summary . 123 CONTENTS viii 7 Aggregating Topicality and Subjectivity for Blog Sentiment Retrieval 126 7.1 Faceted Blog Retrieval as a Re-ranking Task . 127 7.2 Motivation and Problem Definition . 128 7.3 The Blog Retrieval Framework . 131 7.4 Depth and Width of Topical Relevance . 135 7.5 The Probablistic Model . 136 7.6 Linear Pooling: A Holistic Approach . 138 7.7 Experiment . 139 7.7.1 Evaluation . 139 7.7.2 Results and Discussion . 141 7.8 Summary . 147 8 Conclusion and future work 150 8.1 Summary of Thesis . 150 8.2 Future Work . 152 A 150-topic Collection for the Blog Posts 154 Bibliography 164 List of Figures 3.1 The blog searching behavior . 34 4.1 Classification accuracy: Topic-independent vs. topic-independent . 48 5.1 The Labeling Interface . 59 5.2 Negative Documents in Topic 869 . 63 5.3 Positive Documents in Topic 869 . 63 5.4 Negative Documents in Topic 903 . 63 5.5 Positive Documents in Topic 903 . 63 5.6 Positions of Subjective Sentences in Topic 869 . 64 5.7 Positions of Subjective Sentences in Topic 903 . 64 5.8 Classification on Topic 869 . 69 5.9 Classification on Topic 903 . 69 5.10 Evaluation Framework . 83 5.11 Classification with Naive Bayes Multinomial - 150-topic Collection . 87 5.12 Classification with Naive Bayes Multinomial - Balanced Collection . 88 ix LIST OF FIGURES x 5.13 Classification with SOCAL - Balanced Collection . 91 5.14 Five Most Improved Topic Categories with SOCAL on Extracts . 93 5.15 Improved Topic Categories with Naive Bayes Multinomial on Extracts . 96 5.16 Classification on the Movie Review Dataset . 99 6.1 Similarity between the expanded terms from different topics . 117 6.2 Class distribution and Classification Performance . 118 7.1 Precision of Faceted Distillation . 129 7.2 The Blog Retrieval Framework . 132 7.3 Overview . ..