People-Centric Natural Language Processing

People-Centric Natural Language Processing David Bamman CMU-LTI-15-007 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis committee Noah Smith (chair), Carnegie Mellon University Justine Cassell, Carnegie Mellon University Tom Mitchell, Carnegie Mellon University Jacob Eisenstein, Georgia Institute of Technology Ted Underwood, University of Illinois Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies © David Bamman, 2015 Contents 1 Introduction 3 1.1 Structure of this thesis . .5 1.1.1 Variation in content . .5 1.1.2 Variation in author and audience . .6 1.2 Evaluation . .6 1.3 Thesis statement . .8 2 Methods 9 2.1 Probabilistic graphical models . .9 2.2 Linguistic structure . 12 2.3 Conditioning on metadata . 13 2.4 Notation in this thesis . 15 I Variation in content 16 3 Learning personas in movies 18 3.1 Introduction . 18 3.2 Data . 19 3.2.1 Text . 19 3.2.2 Metadata . 20 3.3 Personas . 21 3.4 Models . 22 3.4.1 Dirichlet Persona Model . 22 3.4.2 Persona Regression . 24 3.5 Evaluation . 25 3.5.1 Character Names . 25 i CONTENTS ii 3.5.2 TV Tropes . 25 3.5.3 Variation of Information . 26 3.5.4 Purity . 27 3.6 Exploratory Data Analysis . 28 3.7 Conclusion and Future Work . 29 4 Learning personas in books 33 4.1 Introduction . 33 4.2 Literary Background . 34 4.3 Data . 35 4.3.1 Character Clustering . 36 4.3.2 Pronominal Coreference Resolution . 37 4.3.3 Dimensionality Reduction . 38 4.4 Model . 38 4.4.1 Hierarchical Softmax . 39 4.4.2 Inference . 42 4.5 Evaluation . 42 4.6 Experiments . 44 4.7 Analysis . 45 4.8 Conclusion . 47 5 Learning events through people 49 5.1 Introduction . 49 5.2 Data . 50 5.3 Model . 52 5.4 Evaluation . 56 5.5 Analysis . 58 5.6 Additional Analyses . 61 5.6.1 Correlations among events . 62 5.6.2 Historical distribution of events . 62 5.7 Related Work . 63 5.8 Conclusion . 65 CONTENTS iii II Variation in the author and audience 66 6 Learning ideal points of propositions through people 70 6.1 Introduction . 70 6.2 Task and Data . 71 6.2.1 Data . 72 6.2.2 Extracting Propositions . 73 6.2.3 Human Benchmark . 74 6.3 Models . 75 6.3.1 Additive Model . 75 6.3.2 Single Membership Model . 77 6.4 Comparison . 79 6.4.1 Principal Component Analysis . 79 6.4.2 `2-Regularized Logistic Regression . 79 6.4.3 Co-Training . 80 6.5 Evaluation . 80 6.6 Convergent Validity . 83 6.6.1 Correlation with Self-declarations . 83 6.6.2 Estimating Media Audience . 84 6.7 Conclusion . 84 7 Learning word representations through people 86 7.1 Introduction . 86 7.2 Model . 87 7.3 Evaluation . 90 7.3.1 Qualitative Evaluation . 90 7.3.2 Quantitative Evaluation . 91 7.4 Conclusion . 94 8 Improving sarcasm detection with situated features 95 8.1 Introduction . 95 8.2 Data . 97 8.3 Experimental Setup . 97 8.4 Features . 98 8.4.1 Tweet Features . 98 8.4.2 Author Features . 99 CONTENTS 1 8.4.3 Audience Features . 100 8.4.4 Environment Features . 101 8.5 Evaluation . 101 8.6 Analysis . 101 8.7 Conclusion . 102 9 Conclusion 104 9.1 Summary of contributions . 104 9.2 Future directions . 105 9.2.1 Richer models . 105 9.2.2 Supervision . 106 9.2.3 Fine-grained opinion mining . 106 9.2.4 Stereotyping . 107 9.2.5 Richer context . 107 10 Bibliography 108 Acknowledgments The work presented in this thesis addresses the complexity of the interaction between people and text. The words on these pages naturally hold a similarly complex relationship; while I am their author, I count my words lucky to be influenced by many people: most im- mediately, my advisor, Noah Smith, and my collaborators Justine Cassell, Chris Dyer, Jacob Eisenstein, Tom Mitchell, Brendan O’Connor, Tyler Schnoebelen and Ted Underwood. In my time at Carnegie Mellon, you have each shaped my thinking profoundly; the work in this thesis has benefited not only from the clarity and depth of your thoughts on it, but also in the interests and ideals you have each inspired in me. There are many others whose influence I am grateful to acknowledge: my colleagues Waleed Ammar,.

People-Centric Natural Language Processing

Bibliography of Erik Wilde

2018 Year in Review

Decentralized Identifier WG F2F Sessions

Scott Fabius Kiesling

Embed a Video with Schema Codes

K-ICT · Ver.2017 –

Activitypub W3C Recommendation 23 January 2018

Re-Decentralising the Web (PDF)

Independent Together: Building and Maintaining Values in a Distributed Web Infrastructure

A Platform for Decentralized Social Applications Based on Linked Data

Indieweb Being Social on the Web Drupalcamp Gent 2018 | 23 November 2018

SECOND LANGUAGE RESEARCH FORUM 2012 OCTOBER 18 - 21 HOSTED in PITTSBURGH, PA BY: University of Pittsburgh Carnegie Mellon University