People-Centric Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
People-Centric Natural Language Processing David Bamman CMU-LTI-15-007 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis committee Noah Smith (chair), Carnegie Mellon University Justine Cassell, Carnegie Mellon University Tom Mitchell, Carnegie Mellon University Jacob Eisenstein, Georgia Institute of Technology Ted Underwood, University of Illinois Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies © David Bamman, 2015 Contents 1 Introduction 3 1.1 Structure of this thesis . .5 1.1.1 Variation in content . .5 1.1.2 Variation in author and audience . .6 1.2 Evaluation . .6 1.3 Thesis statement . .8 2 Methods 9 2.1 Probabilistic graphical models . .9 2.2 Linguistic structure . 12 2.3 Conditioning on metadata . 13 2.4 Notation in this thesis . 15 I Variation in content 16 3 Learning personas in movies 18 3.1 Introduction . 18 3.2 Data . 19 3.2.1 Text . 19 3.2.2 Metadata . 20 3.3 Personas . 21 3.4 Models . 22 3.4.1 Dirichlet Persona Model . 22 3.4.2 Persona Regression . 24 3.5 Evaluation . 25 3.5.1 Character Names . 25 i CONTENTS ii 3.5.2 TV Tropes . 25 3.5.3 Variation of Information . 26 3.5.4 Purity . 27 3.6 Exploratory Data Analysis . 28 3.7 Conclusion and Future Work . 29 4 Learning personas in books 33 4.1 Introduction . 33 4.2 Literary Background . 34 4.3 Data . 35 4.3.1 Character Clustering . 36 4.3.2 Pronominal Coreference Resolution . 37 4.3.3 Dimensionality Reduction . 38 4.4 Model . 38 4.4.1 Hierarchical Softmax . 39 4.4.2 Inference . 42 4.5 Evaluation . 42 4.6 Experiments . 44 4.7 Analysis . 45 4.8 Conclusion . 47 5 Learning events through people 49 5.1 Introduction . 49 5.2 Data . 50 5.3 Model . 52 5.4 Evaluation . 56 5.5 Analysis . 58 5.6 Additional Analyses . 61 5.6.1 Correlations among events . 62 5.6.2 Historical distribution of events . 62 5.7 Related Work . 63 5.8 Conclusion . 65 CONTENTS iii II Variation in the author and audience 66 6 Learning ideal points of propositions through people 70 6.1 Introduction . 70 6.2 Task and Data . 71 6.2.1 Data . 72 6.2.2 Extracting Propositions . 73 6.2.3 Human Benchmark . 74 6.3 Models . 75 6.3.1 Additive Model . 75 6.3.2 Single Membership Model . 77 6.4 Comparison . 79 6.4.1 Principal Component Analysis . 79 6.4.2 `2-Regularized Logistic Regression . 79 6.4.3 Co-Training . 80 6.5 Evaluation . 80 6.6 Convergent Validity . 83 6.6.1 Correlation with Self-declarations . 83 6.6.2 Estimating Media Audience . 84 6.7 Conclusion . 84 7 Learning word representations through people 86 7.1 Introduction . 86 7.2 Model . 87 7.3 Evaluation . 90 7.3.1 Qualitative Evaluation . 90 7.3.2 Quantitative Evaluation . 91 7.4 Conclusion . 94 8 Improving sarcasm detection with situated features 95 8.1 Introduction . 95 8.2 Data . 97 8.3 Experimental Setup . 97 8.4 Features . 98 8.4.1 Tweet Features . 98 8.4.2 Author Features . 99 CONTENTS 1 8.4.3 Audience Features . 100 8.4.4 Environment Features . 101 8.5 Evaluation . 101 8.6 Analysis . 101 8.7 Conclusion . 102 9 Conclusion 104 9.1 Summary of contributions . 104 9.2 Future directions . 105 9.2.1 Richer models . 105 9.2.2 Supervision . 106 9.2.3 Fine-grained opinion mining . 106 9.2.4 Stereotyping . 107 9.2.5 Richer context . 107 10 Bibliography 108 Acknowledgments The work presented in this thesis addresses the complexity of the interaction between peo- ple and text. The words on these pages naturally hold a similarly complex relationship; while I am their author, I count my words lucky to be influenced by many people: most im- mediately, my advisor, Noah Smith, and my collaborators Justine Cassell, Chris Dyer, Jacob Eisenstein, Tom Mitchell, Brendan O’Connor, Tyler Schnoebelen and Ted Underwood. In my time at Carnegie Mellon, you have each shaped my thinking profoundly; the work in this thesis has benefited not only from the clarity and depth of your thoughts on it, but also in the interests and ideals you have each inspired in me. There are many others whose influence I am grateful to acknowledge: my colleagues Waleed Ammar,.