Accelerating Text-As-Data Research in Computational Social Science
Total Page:16
File Type:pdf, Size:1020Kb
Accelerating Text-as-Data Research in Computational Social Science Dallas Card August 2019 CMU-ML-19-109 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Noah A. Smith, Chair Artur Dubrawski Geoff Gordon Dan Jurafsky (Stanford University) Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Copyright c 2019 Dallas Card This work was supported by a Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship, NSF grant IIS-1211277, an REU supplement to NSF grant IIS-1562364, a Bloomberg Data Science Research Grant, a University of Washington Innovation Award, and computing resources provided by XSEDE. Keywords: machine learning, natural language processing, computational social science, graphical models, interpretability, calibration, conformal methods Abstract Natural language corpora are phenomenally rich resources for learning about people and society, and have long been used as such by various disciplines such as history and political science. Recent advances in machine learning and natural language pro- cessing are creating remarkable new possibilities for how scholars might analyze such corpora, but working with textual data brings its own unique challenges, and much of the research in computer science may not align with the desiderata of social scien- tists. In this thesis, I present a line of work on developing methods for computational social science focused primarily on observational research using natural language text. Throughout, I take seriously the concerns and priorities of the social sciences, leading to a focus on aspects of machine learning which are otherwise sometimes secondary, including calibration, interpretability, and transparency. Two ideas which unify this work are the problems of exploration and measurement, and as a running example I consider the problem of analyzing how news sources frame contemporary political issues. Following the introduction, I devote one chapter to providing the necessary background on computational social science, framing, and the “text as data” paradigm. Subsequent chapters each focus on a particular model or method that strives to address some aspect of research which may be of particular interest to social scientists. Chap- ters 3 and 4 focus on the unsupervised setting, with the former presenting a model for learning archetypal character representations, and the latter presenting a framework 3 4 for neural document models which can flexibly incorporate metadata. Chapters 5 and 6 focus on the supervised setting and present alternately, a method for measuring label proportions in text in the presence of domain shift, and a variation on deep learning classifiers which produces more transparent and robust predictions. The final chapter concludes with implications for computational social science and possible directions for future work. Contents 1 Introduction 11 1.1 Thesis statement................................ 13 1.2 Structure of this thesis............................. 14 2 Background 15 2.1 Computational social science and “text as data”.............. 15 2.2 Exploration and measurement......................... 18 2.3 Desiderata in computational social science methods............ 20 2.4 Running example: Framing in the media................... 27 3 Inferring character and story types as an aspect of framing 30 3.1 Introduction................................... 30 3.2 Model description................................ 33 3.3 Clustering stories................................ 34 3.4 Dataset...................................... 37 3.5 Identifying entities............................... 38 5 6 CONTENTS 3.6 Exploratory analysis............................... 39 3.7 Experiments: Personas and framing..................... 44 3.7.1 Experiment 1: Direct comparison................... 45 3.7.2 Experiment 2: Automatic evaluation................. 46 3.8 Qualitative evaluation.............................. 49 3.9 Related work................................... 51 3.10 Summary..................................... 51 4 Modeling documents with metadata using neural variational inference 52 4.1 Introduction................................... 52 4.2 Background and motivation.......................... 55 4.3 Scholar: A neural topic model with covariates, supervision, and sparsity 57 4.3.1 Generative story............................. 58 4.4 Learning and inference............................. 61 4.4.1 Prediction on held-out data...................... 63 4.4.2 Additional prior information..................... 64 4.4.3 Additional details............................ 65 4.5 Experiments and results............................ 66 4.5.1 Unsupervised evaluation........................ 68 4.5.2 Text classification............................ 71 4.5.3 Exploratory study............................ 72 4.6 Additional related work............................. 74 CONTENTS 7 4.7 Summary..................................... 75 5 Estimating label proportions from annotations 76 5.1 Introduction................................... 76 5.2 Problem definition............................... 78 5.3 Methods..................................... 82 5.3.1 Proposed method: Calibrated probabilistic classify and count (PCCcal).................................. 83 5.3.2 Existing methods appropriate for extrinsic labels.......... 84 5.3.3 Existing methods appropriate for intrinsic labels.......... 85 5.4 Experiments................................... 86 5.4.1 Datasets................................. 88 5.4.2 Results.................................. 89 5.5 Discussion.................................... 92 5.6 Summary..................................... 96 6 Transparent and credible predictions using deep neural networks 97 6.1 Introduction................................... 97 6.2 Background................................... 102 6.2.1 Scope and notation........................... 102 6.2.2 Nonparametric kernel regression................... 102 6.2.3 Conformal methods.......................... 104 6.3 Deep weighted averaging classifiers...................... 107 8 CONTENTS 6.3.1 Model details.............................. 108 6.3.2 Training................................. 109 6.3.3 Prediction and explanations...................... 110 6.3.4 Confidence and credibility....................... 111 6.4 Experiments................................... 112 6.4.1 Datasets................................. 113 6.4.2 Models and training.......................... 114 6.5 Results...................................... 115 6.5.1 Classification performance...................... 115 6.5.2 Interpretability and explanations................... 117 6.5.3 Approximate explanations....................... 120 6.5.4 Confidence and credibility....................... 120 6.6 Discussion and future work.......................... 125 6.7 Summary..................................... 127 7 Conclusion 129 7.1 Summary of contributions........................... 129 7.2 Recurring themes................................ 131 7.3 Implications for computational social science................ 132 7.4 Directions for future work........................... 135 Bibliography 139 Acknowledgements Writing an acknowledgements section is a powerful reminder that scholarship is a collective endeavour, and that we all depend on, and benefit so much from, the contri- butions of countless others, both personally and professionally. I certainly could not have completed this work without the help of innumerable colleagues, friends, and family members. First and foremost, I would like to thank my advisor, Noah Smith, for his unbounded support, encouragement, and insight, and for creating an exceptional learning environ- ment in which so many are able to thrive while continuing to challenge both themselves and each other. In addition, I wish to express my sincere thanks to all members of my thesis committee, my teachers, collaborators, and co-authors who have taught me so much, all current and former members of Noah’s ARK for their camaraderie and mentorship, the provocative interlocutors at the Tech Policy Lab for their stimulating conversation, all of my friends who have kept me sane and helped me to grow, the incredible support staff of the Machine Learning Department at Carnegie Mellon Uni- versity, as well as everyone at the Paul G. Allen School at the University of Washington for providing me with a temporary second home. While far from exhaustive, I would like to express a special gratitude to Alnur Ali, Waleed Ammar, David Bamman, Antoine Bosselut, Amber Boydstun, Ben Cowley, Tam Dang, Jesse Dodge, Artur Dubrawski, Chris Dyer, Nicholas FitzGerald, Alona Fyshe, 9 10 CONTENTS Emily Kalah Gade, Geoff Gordon, Justin Gross, Suchin Gururangan, Ari Holtzman, George Ibrahim, Dan Jurafsky, Lingpeng Kong, Will Lowe, Bill McDowell, Darren McKee, Tom Mitchell, Jared Moore, Brendan O’Connor, Anthony Platanios, Aaditya Ramdas, Philip Resnik, Maarten Sap, Nathan Schneider, Dan Schwartz, Roy Schwartz, Swabha Swayamdipta, Yanchuan Sim, Diane Stidle, Chenhao Tan, Margot Taylor, Sam Thomp- son, Ryan Tibshirani, Leron Vandsburger, Hanna Wallach, Jing Xiang, Mark Yatskar, Dani Yogatama, and Michael Zhang. Above all, I would like to thank my family, espe- cially my parents, for their love and support, and for giving me the confidence and curiosity to continue exploring and follow this path, wherever it might lead. Chapter 1 Introduction Over the past decade, the field of