Statistical Text Analysis for Social Science
Statistical Text Analysis for Social Science Brendan T. O’Connor August 2014 CMU-ML-14-101 Statistical Text Analysis for Social Science Brendan T. O’Connor August 2014 CMU-ML-14-101 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA, USA Thesis Committee: Noah A. Smith, chair Tom Mitchell Cosma Shalizi Gary King, Harvard University Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy This work was supported by NSF CAREER IIS0644225, NSF CAREER IIS1054319, NSF grant IIS1211277, NSF grant IIS1251131, DARPA grant N10AP20042, IARPA contract D12PC00347, an Alfred P. Sloan grant, Mi- crosoft’s support for Machine Learning graduate students, Google’s support of the Worldly Knowledge project at Carnegie Mellon University, the Berkman Faculty Development Fund at Carnegie Mellon University, the Center for Applied Research in Technology at the Tepper School of Business, computing resources from the Open Source Data Cloud (Grossman et al. (2012), sponsored by the Open Cloud Consortium, the Gordon and Betty Moore Foundation, the NSF, and the University of Chicago) and computing resources from the Pittsburgh Supercomputing Center. Keywords: computational social science, natural language processing, text mining, quantitative text analysis, machine learning, probabilistic graphical models, Bayesian statistics, exploratory data analysis, social media, opinion polling, sociolinguistics, event data, international relations. 2 Abstract What can text corpora tell us about society? How can automatic text analysis algorithms efficiently and reliably analyze the social processes revealed in language production? This work develops statistical text analyses of dynamic social and news media datasets to ex- tract indicators of underlying social phenomena, and to reveal how social factors guide linguistic production.
[Show full text]