Statistical Text Analysis for Social Science
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Text Analysis for Social Science Brendan T. O’Connor August 2014 CMU-ML-14-101 Statistical Text Analysis for Social Science Brendan T. O’Connor August 2014 CMU-ML-14-101 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA, USA Thesis Committee: Noah A. Smith, chair Tom Mitchell Cosma Shalizi Gary King, Harvard University Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy This work was supported by NSF CAREER IIS0644225, NSF CAREER IIS1054319, NSF grant IIS1211277, NSF grant IIS1251131, DARPA grant N10AP20042, IARPA contract D12PC00347, an Alfred P. Sloan grant, Mi- crosoft’s support for Machine Learning graduate students, Google’s support of the Worldly Knowledge project at Carnegie Mellon University, the Berkman Faculty Development Fund at Carnegie Mellon University, the Center for Applied Research in Technology at the Tepper School of Business, computing resources from the Open Source Data Cloud (Grossman et al. (2012), sponsored by the Open Cloud Consortium, the Gordon and Betty Moore Foundation, the NSF, and the University of Chicago) and computing resources from the Pittsburgh Supercomputing Center. Keywords: computational social science, natural language processing, text mining, quantitative text analysis, machine learning, probabilistic graphical models, Bayesian statistics, exploratory data analysis, social media, opinion polling, sociolinguistics, event data, international relations. 2 Abstract What can text corpora tell us about society? How can automatic text analysis algorithms efficiently and reliably analyze the social processes revealed in language production? This work develops statistical text analyses of dynamic social and news media datasets to ex- tract indicators of underlying social phenomena, and to reveal how social factors guide linguistic production. This is illustrated through three case studies: first, examining whether sentiment expressed in social media can track opinion polls on economic and political topics (Chapter 3); second, analyzing how novel online slang terms can be very specific to geographic and demo- graphic communities, and how these social factors affect their transmission over time (Chapters 4 and 5); and third, automatically extracting political events from news articles, to assist analyses of the interactions of international actors over time (Chapter 6). We demonstrate a variety of computational, linguistic, and statistical tools that are employed for these analyses, and also contribute MiTextExplorer, an interactive system for exploratory anal- ysis of text data against document covariates, whose design was informed by the experience of researching these and other similar works (Chapter 2). These case studies illustrate recurring themes toward developing text analysis as a social science methodology: computational and sta- tistical complexity, and domain knowledge and linguistic assumptions. 3 Acknowledgments This work has been made possible only through rich collaborations and conversations with a wide variety of colleagues at both Carnegie Mellon (including the Noah’s ARK lab and the Machine Learning Department) and many other institutions. Thanks to everyone for all the support and learning that influenced this work, from people including but not limited to: Yisong Yue, Dani Yogatama, Tae Yano, Min Xu, Eric Xing, Matt Wytock, Hanna Wallach, Johan Ugander, Dustin Tingley, Ryan Tibshirani, Sam Thomson, Andrew Thomas, Eric Sun, Brandon Stewart, Rion Snow, Alex Smola, Noah Smith, Yanchuan Sim, Cosma Shalizi, Phil Schrodt, Nathan Schneider, Aaron Schein, Bryan Routledge, Molly Roberts, Philip Resnik, Joseph Reisinger, Dan Ramage, Patrick Perry, Olutobi Owoputi, John Myles White, Burt Monroe, Tom Mitchell, Brendan Meeder, Andre Martins, Cameron Marlow, Yucheng Low, Michael Love, Jayant Krishnamurthy, Michel Krieger, Lingpeng Kong, Gary King, Rob Kass, Venky Iyer, Muhammed Idris, Qirong Ho, Michael Heil- man, Alex Hanna, Aria Haghighi, Forest Gregg, Joey Gonzalez, Kevin Gimpel, Matt Gardner, Alona Fyshe, Seth Flaxman, Jeffrey Flanigan, Jenny Finkel, Khalid El-Arini, Jacob Eisenstein, Chris Dyer, Jesse Dodge, Dipanjan Das, Justin Cranshaw, Fred Conrad, William Cohen, Shay Cohen, Jonathan Chang, Andrew Carlson, Dallas Card, Lukas Biewald, Justin Betteridge, John Beieler, David Bamman, Ramnath Balasubramanyan, Erin Baggott, Lars Backstrom, Waleed Ammar, and David Ahn. My apologies for any omissions. Thanks most of all to my family and friends for all their support and encouragement. 4 Contents Contents 5 1 Text analysis for the social sciences 8 1.1 Introduction . 8 1.2 Text analysis methods . 11 1.3 Thesis statement . 14 2 MiTextExplorer: Interactive exploratory text and covariate analysis 15 2.1 Introduction . 15 2.2 Motivation: Can we “just look” at statistical text data? . 16 2.3 MITEXTEXPLORER: linked brushing for text and covariate correlations . 18 2.3.1 Covariate-driven queries . 19 2.3.2 Term selection and KWIC views . 20 2.3.3 Term-association queries . 21 2.3.4 Pinned terms . 22 2.4 Statistical term association measures . 23 2.5 Related work: Exploratory text analysis . 24 2.6 Future work . 26 3 Text sentiment and opinion polling 27 3.1 Introduction . 27 3.2 Data . 27 3.2.1 Twitter Corpus . 28 3.2.2 Public Opinion Polls . 28 3.3 Text Analysis . 29 3.3.1 Message Retrieval . 30 3.3.2 Opinion Estimation . 31 3.3.3 Comparison to Related Work . 32 3.4 Moving Average Aggregate Sentiment . 32 3.5 Correlation Analysis: Is text sentiment a leading indicator of polls? . 33 3.5.1 Obama 2009 Job Approval and 2008 Elections . 37 3.6 Measurement bias and inference goals . 38 3.7 Conclusion . 40 5 4 Mixed-membership analysis of geographic and demographic lexical variation in social media 41 4.1 Introduction . 41 4.2 Geographic topic model and lexical variation . 42 4.2.1 Synopsis . 42 4.2.2 Data . 43 4.2.3 Model . 44 4.2.4 Inference . 46 4.2.5 Implementation . 47 4.2.6 Evaluation . 48 4.2.7 Analysis . 51 4.2.8 Related Work . 53 4.3 Demographic lexical variation through U.S. Census data . 54 4.3.1 Introduction . 54 4.3.2 Data . 54 4.3.3 Model . 55 4.3.4 Analysis . 56 4.4 Conclusion and reflections on textual social data analysis . 58 5 Social determinants of linguistic diffusion in social media 60 5.1 Synopsis . 60 5.2 Introduction . 60 5.3 Materials and methods . 62 5.3.1 Modeling spatiotemporal lexical dynamics in social media data . 63 5.3.2 Constructing a network of linguistic diffusion . 70 5.3.3 Geographic and demographic correlates of linguistic diffusion . 71 5.4 Results . 73 5.5 Discussion . 75 5.6 Appendix: Data processing . 76 5.6.1 Messages . 77 5.6.2 Words . 78 6 International Events from the News 85 6.1 Synopsis . 85 6.2 Introduction . 85 6.3 Data . 86 6.4 Model . 87 6.4.1 Smoothing Frames Across Time . 89 6.5 Inference . 89 6.6 Experiments . 90 6.6.1 Lexical Scale Impurity . 90 6.6.2 Conflict Detection . 91 6.6.3 Results . 93 6.6.4 Case study . 94 6.7 Related Work . 95 6.7.1 Events Data in Political Science . 95 6.7.2 Events in Natural Language Processing . 96 6.8 Conclusion . 96 6 6.9 Appendix: Evaluation . 97 6.9.1 Type-level calculation of lexical scale impurity . 97 6.9.2 TABARI lexicon matching . 97 6.10 Appendix: Details on Inference . 98 6.10.1 Language Submodel [z j η] ............................. 99 6.10.2 Context Submodel [α; β j η] ............................. 99 6.10.3 Logistic Normal [η j z; η¯] ...............................101 6.10.4 Learning concentrations and variances . 102 6.10.5 Softmax bug . 103 7 Conclusion 105.