Content Modeling for Social Media Text Author
Total Page:16
File Type:pdf, Size:1020Kb
Content Modeling for Social Media Text by Christina Sauper Submitted to the Department of Electrical Engineering and Computer Science ARCHIVES in partial fulfillment of the requirements for the degree of MAss TTUE Doctor of Philosophy JUL 01 2012 at the '-lLiBRRIES MASSACHUSETTS INSTITUTE OF TECHNOLOGY - June 2012 © Massachusetts Institute of Technology 2012. All rights reserved. Author ................ ................... Department of 4 lectrical Engineering and Computer Science May 14, 2012 Certified by........... ...... .... Rg.. Bar...i.. Regina Barzilay Associate Professor, Electrical Engineering and Computer Science Thesis Supervisor Accepted by A.... cce .. ted. y . ............. .... J - Leslie A. Kolodziejski Chairman, Department Committee on Graduate Theses Content Modeling for Social Media Text by Christina Sauper Submitted to the Department of Electrical Engineering and Computer Science on May 14, 2012, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This thesis focuses on machine learning methods for extracting information from user- generated content. Instances of this data such as product and restaurant reviews have become increasingly valuable and influential in daily decision making. In this work, I consider a range of extraction tasks such as sentiment analysis and aspect-based review aggregation. These tasks have been well studied in the context of newswire documents, but the informal and colloquial nature of social media poses significant new challenges. The key idea behind our approach is to automatically induce the content structure of individual documents given a large, noisy collection of user-generated content. This structure enables us to model the connection between individual documents and effectively aggregate their content. The models I propose demonstrate that content structure can be utilized at both document and phrase level to aid in standard text analysis tasks. At the document level, I capture this idea by joining the original task features with global contextual information. The coupling of the content model and the task-specific model allows the two components to mutually influence each other during learning. At the phrase level, I utilize a generative Bayesian topic model where a set of properties and corresponding attribute tendencies are represented as hidden variables. The model explains how the observed text arises from the latent variables, thereby connecting text fragments with corresponding properties and attributes. Thesis Supervisor: Regina Barzilay Title: Associate Professor, Electrical Engineering and Computer Science 2 Acknowledgments I have been extremely fortunate throughout this journey to have the support of many wonderful people. It is only with their help and encouragement that this thesis has come into existence. First and foremost, I would like to thank my advisor, Regina Barzilay. She has been an incredible source of knowledge and inspiration; she has a sharp insight about interesting directions of research, and she is a strong role model in a field in which women are still a minority. Her high standards and excellent guidance have shaped my work into something I can be truly proud of. My thesis committee members, Peter Szolovits and James Glass, have also provided invaluable discussion as to the considerations of this work applied to their respective fields, analysis of medical data and speech processing. Working with my coauthor, Aria Haghighi, was an intensely focused and amazing time; he has been a great friend and collaborator to me, and I will certainly miss the late-night coding with Tosci's ice cream. I am very grateful to Alan Woolf, Michelle Zeager, Bill Long, and the other members of the Fairwitness group for helping to make my introduction to the medical world as smooth as possible. My fellow students at MIT are a diverse and fantastic group of people: my group- mates from the Natural Language Processing group, including Ted Benson, Branavan, Harr Chen, Jacob Eisenstein, Yoong Keok Lee, Tahira Naseem, and Ben Snyder; my officemates, including David Sontag and Paresh Malalur; and my friends from GSB and other areas of CSAIL. They have always been there for good company and stim- ulating discussions, both related to research and not. I am thankful to all, and I wish them the best for their future endeavors. Finally, I dedicate this thesis to my family. They have stood by me through good times and bad, and it is through their loving support and their neverending faith in me that I have made it this far. 3 Contents 1 Introduction 12 1.1 Modeling content structure for text analysis tasks . ... .. ... .. 18 1.2 Modeling relation structure for informative aggregation 20 1.3 Contributions .. .. ... .. ... .. ... .. .. .. 22 1.4 O utline .. ... .... ... ... .... ... .... 23 2 Modeling content structure for text analysis tasks 25 2.1 Introduction . .... ..... .... ..... ... .. .. .. .. 26 2.2 Related work ... ... .... ... .... ... .. .. .. 28 2.2.1 Task-specific models for relevance . .. ... .. .. .. .. 28 2.2.2 Topic modeling ... .. ... ... .. ... .. .. 29 2.2.3 Discourse models of document structure .. .. .. 32 2.3 M odel .. ... .... ... .... ... .... .. .. .. .. 33 2.3.1 Multi-aspect phrase extraction . ... ... .. .. .. 34 2.3.1.1 Problem Formulation . .. ... .. .. 34 2.3.1.2 Model overview .. .. .. .. .. 36 2.3.1.3 Learning .. ..... .. .. ... .. 38 2.3.1.4 Inference . ..... ... .. .. .. 40 4 2.3.1.5 Leveraging unannotated data . 41 2.3.2 Multi-aspect sentiment analysis .. .. .. .. ... .. .. .. 4 2 2.3.2.1 Problem Formulation . ... .. .. .. .. .. .. 4 2 2.3.2.2 M odel .. .. .. .. .. .. .. .. .. .. .. 4 3 2.3.2.3 Learning .. .. .. .. .. .. .. .. .. .. 4 3 2.3.2.4 Inference .. .. .. .. .. .. .. .. ... .. 4 4 2.4 Data sets .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 4 4 2.4.1 Multi-aspect sentiment analysis . .. .. .. .. .. .. 4 5 2.4.2 Multi-aspect summarization .. .. .. .. .. .. .. 4 7 2.5 Experiments .. .. .. .. .. .. .. .. .. .. .. .. .. 4 g 2.5.1 Multi-aspect phrase extraction . .. .. .. .. .. 5 0 2.5.1.1 Baselines .. ... .. .. .. .. .. .. .. 5 0 2.5.1.2 Evaluation metrics .. .. .. .. .. .. .. 5 1 2.5.1.3 Results .. .. .. .. .. .. .. .. .. 5 2 2.5.2 Multi-aspect sentiment analysis . .. .. .. .. .. 5 5 2.5.2.1 Baselines . .. .. .. .. .. .. .. .. 5 6 2.5.2.2 Evaluation metrics . .. .. .. .. .. 5 6 2.5.2.3 Results ... .... .. .. .. .. .. 56 2.6 Conclusion .. .. ... ... ... .. .. .. .. .. 57 3 Modeling relation structure for informative aggregation 59 3.1 Introduction .. .. .. .. .. .. .. .. .. .. .. .. .. 60 3.2 Related work . .. .. .. .. .. .. .. .. .. .. .. .. 65 3.2.1 Single-aspect sentiment analysis . .. .. .. .. .. .. 65 3.2.2 Aspect-based sentiment analysis .. ... .. .. .. .. .. 66 3.2.2.1 Data-mining and fixed-aspect techniques for sentiment analysis .. .. .. .. .. .. .. .. .. .. .. 66 3.2.2.2 Multi-document summarization and its application to sentiment analysis . .. .. ... .. .. ... .. .. 68 3.2.2.3 Probabilistic topic modeling for sentiment analysis . 70 5 3.3 Problem formulation .. .. ... .. ... .. ... .. 73 3.3.1 Model components ... .. .. .. ... .. .. .. 73 3.3.2 Problem setup. ... ... ... .. ... .. .. .. .. .. 75 3.4 M odel ... .. ... .. ... .. ... .. ... .. .. .. 76 3.4.1 General formulation . .. .. .. ... .. .. .. .. .. 76 3.4.2 Model extensions ... .. .. .. ... .. .. .. .. .. 82 3.5 Inference. .. ... .. ... .. ... .. ... .. .. .. .. 84 3.5.1 Inference for model extensions . .. ... .. .. .. .. .. 90 3.6 Experiments . ... ... ... ... ... ... ... .. .. .. .. 92 3.6.1 Joint identification of aspect and sentiment . .. .. .. .. 93 3.6.1.1 Data set . ... .. .. .. ... .. .. .. .. .. 93 3.6.1.2 Domain challenges and modeling techniques ..... 94 3.6.1.3 Cluster prediction ... ... ... .. .. .. .. 95 3.6.1.4 Sentiment analysis .. ... .. .. .. .. .. .. 97 3.6.1.5 Per-word labeling accuracy .. .. .. .. 100 3.6.2 Aspect identification with shared aspects .. .. .. .. 103 3.6.2.1 D ata set . .. .. ... .. .. .. .. .. .. 104 3.6.2.2 Domain challenges and modeling te chniques ... .. 104 3.6.2.3 Cluster prediction .. .. .. .. .. .. 105 3.7 Conclusion .. .. .. .. ... .. .. .. ... .. .. .. ... .. 107 4 Conclusion 109 4.1 Future Work. .. ... .. .. .. ... .. .. .. ... .. .. .. 110 A Condensr 113 A.1 System implementation. .. .. ... .. .. .. ... .. .. .. .. 113 A.2 Examples . .. ... .. .. .. ... .. .. .. ... .. .. .. .. 120 6 List of Figures 1-1 A comparison of potential content models for summarization and sen- tim ent analysis ... .. ... .. ... ... .. ... .. ... .. .. 14 (a) A content model for summarization focused on identifying topics in text. ....... ............................... 14 (b) A content model for sentiment analysis focused on identifying sentiment-bearing sentences in text. .. ... .. ... ... .. 14 1-2 Three reviews about the same restaurant discussing different aspects and opinions. .. .. .. .. .. .. .. ... .. .. .. .. .. .. 16 1-3 An excerpt from a DVD review demonstrating ambiguity which can be resolved through knowledge of structure. .. .. .. .. .. ... .. 17 1-4 Sample text analysis tasks to which document structure is introduced. 19 (a) Example system input and output on the multi-aspect sentiment analysis task ... .... ... .... .... ... .... ... 19 (b) Example system input and output on the multi-aspect phrase extraction dom ain .. ... .... ... .... .... ... 19 1-5 An example of the desired input and output of our system in the restau- rant domain, both at corpus-level and at snippet-level. ..