ABSTRACT Guided Probabilistic Topic Models for Agenda-Setting
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of dissertation: Guided Probabilistic Topic Models for Agenda-setting and Framing Viet-An Nguyen, Doctor of Philosophy, 2015 Dissertation directed by: Professor Philip Resnik Department of Linguistics and Institute for Advanced Computer Studies Professor Jordan Boyd-Graber College of Information Studies Institute for Advanced Computer Studies Probabilistic topic models are powerful methods to uncover hidden thematic structures in text by projecting each document into a low dimensional space spanned by a set of topics. Given observed text data, topic models infer these hidden struc- tures and use them for data summarization, exploratory analysis, and predictions, which have been applied to a broad range of disciplines. Politics and political conflicts are often captured in text. Traditional ap- proaches to analyze text in political science and other related fields often require close reading and manual labeling, which is labor-intensive and hinders the use of large-scale collections of text. Recent work, both in computer science and political science, has used automated content analysis methods, especially topic models to substantially reduce the cost of analyzing text at large scale. In this thesis, we follow this approach and develop a series of new probabilistic topic models, guided by additional information associated with the text, to discover and analyze agenda- setting (i.e., what topics people talk about) and framing (i.e., how people talk about those topics), a central research problem in political science, communication, public policy and other related fields. We first focus on study agendas and agenda control behavior in political de- bates and other conversations. The model we introduce, Speaker Identity for Topic Segmentation (SITS), is able to discover what topics that are talked about during the debates, when these topics change, and a speaker-specific measure of agenda control. To make the analysis process more effective, we build Argviz, an interac- tive visualization which leverages SITS's outputs to allow users to quickly grasp the conversational topic dynamics, discover when the topic changes and by whom, and interactively visualize the conversation's details on demand. We then analyze policy agendas in a more general setting of political text. We present the Label to Hier- archy (L2H) model to learn a hierarchy of topics from multi-labeled data, in which each document is tagged with multiple labels. The model captures the dependencies among labels using an interpretable tree-structured hierarchy, which helps provide insights about the political attentions that policymakers focus on, and how these policy issues relate to each other. We then go beyond just agenda-setting and expand our focus to framing|the study of how agenda issues are talked about, which can be viewed as second-level agenda-setting. To capture this hierarchical views of agendas and frames, we in- troduce the Supervised Hierarchical Latent Dirichlet Allocation (SHLDA) model, which jointly captures a collection of documents, each is associated with a contin- uous response variable such as the ideological position of the document's author on a liberal-conservative spectrum. In the topic hierarchy discovered by SHLDA, higher-level nodes map to more general agenda issues while lower-level nodes map to issue-specific frames. Although qualitative analysis shows that the topic hier- archies learned by SHLDA indeed capture the hierarchical view of agenda-setting and framing motivating the work, interpreting the discovered hierarchy still incurs moderately high cost due to the complex and abstract nature of framing. Mo- tivated by improving the hierarchy, we introduce Hierarchical Ideal Point Topic Model (HIPTM) which jointly models a collection of votes (e.g., congressional roll call votes) and both the text associated with the voters (e.g., members of Congress) and the items (e.g., congressional bills). Customized specifically for capturing the two-level view of agendas and frames, HIPTM learns a two-level hierarchy of topics, in which first-level nodes map to an interpretable policy issue and second-level nodes map to issue-specific frames. In addition, instead of using pre-computed response variable, HIPTM also jointly estimates the ideological positions of voters on multiple interpretable dimensions. Guided Probabilistic Topic Models for Agenda-setting and Framing by Viet-An Nguyen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Professor Philip Resnik, Chair/Co-Advisor Professor Jordan Boyd-Graber, Co-Advisor Professor H´ectorCorrada Bravo Professor Hal Daum´eIII Professor Wayne McIntosh Professor Hanna Wallach c Copyright by Viet-An Nguyen 2015 Acknowledgments I am eternally indebted to the many people, whose help and support have made this dissertation possible. I can only hope that the following words can express just how grateful I am to have such amazing people in my life. First and foremost, I wholeheartedly thank my co-advisors, Philip Resnik and Jordan Boyd-Graber, for their continuous guidance and advice throughout my grad- uate career. They have been an ideal team, providing as much support as I needed while giving me the freedom to explore plenty of sidetracks. In particular, I would like to thank Jordan for guiding me patiently through many technical challenges and for helping me tirelessly in preparing and presenting our work at various con- ferences. I am grateful to Philip for pushing me hard to make sure this thesis is coherent and well-written. I also very much appreciate his delicious dinners during his sabbatical. Many thanks also go to the other members of my committee, H´ectorCorrada Bravo, Hal Daum´eIII, Wayne McIntosh, and Hanna Wallach, for their insightful questions and valuable feedbacks, which provide new perspectives and help establish new connections to improve this thesis. I also like to thank Hal for his helpful com- ments on my various practice talks and for his excellent Computational Linguistics I course which introduced me to and got me interested in NLP. I also thank Hanna for her great dissertation, from which I have learned so much about hierarchical topic models and slice sampling. I have been very fortunate to have the opportunities to work with wonderful ii coauthors and collaborators. I thank Deborah Cai, Jennifer Midberry, and Yuanxin Wang for annotating excellent sets of data; Stephen Altschul for his great lectures and encouragements which allowed me to explore my way in an alien land called biology; and Kristina Miler for tirelessly providing helpful feedbacks and insights, without which Chapter6 would not have been possible. Various ideas in this thesis were shaped and influenced through the many con- versations with my colleagues in the CLIP lab. Yuening Hu, Ke Zhai and Mohit Iyyer deserve a special mention for being such great friends and collaborators. I cherished my discussions with Amit Goyal, Vlad Eidelman, Ke Wu, Taesun Moon, Alvin Grissom II, Leonardo Claudino, Naho Orita, Thang Nguyen, Andy Garron, and Peter Enns. I enjoyed interacting with Eric Hardisty, Chang Hu, Kristy Holling- shead, Earl Wagner, Greg Sanders, John Morgan, Junhui Li, Jiarong Jiang, He He, Sudha Rao, Snigdha Chaturvedi, Jagadeesh Jagarlamudi, Ferhan Ture, and Ning Gao. I also want to especially thank Joe Webster, Janice Perrone, and other UMI- ACS staffs for their helps and supports. During my PhD journey, I had the chance to spend two wonderful summers interning at Facebook. I am extremely grateful to all the people who helped make my time at Facebook such a blast including Cameron Marlow, Danny Ferrante, Sofus Macskassy, and other members of the Core Data Science team. My special thanks go to Jonathan Chang and Carlos Diuk for their guidance and mentorship and Nicole Gurries for opening up the opportunity for me in the first place. Of course, my past few years would have been much more difficult and boring without my dear friends: Bao Nguyen, Chanh Kieu, My Le, Ha Nguyen, Tho Ho, iii Toan Ngo, Hien Tran, Hien Nguyen, and Dzung Ta. Thank you all. Finally, I want to thank my family, although I know there aren't enough words to describe my gratitude to them. I thank my parents for their unconditional love; for always believing in me and teaching me to believe in myself; for having sacrificed greatly to provide for me an amazing education and let me pursue my dreams. I thank my sister for her endless support, advice and encouragement. And most importantly, I thank my beloved wife Yen, who has always been there with me, sharing every moment, through all the highs and the lows in this adventure. I know, together, we could overcome any obstacles and accomplish anything. iv Table of Contents List of Tables ix List of Figures xii 1 Introduction1 1.1 The Needs for Automated Methods for Analyzing Political Text in the Big Data Era.............................1 1.2 The Importance of Agendas and Frames in Political Science Research4 1.3 Analyzing Agendas and Frames: Methods and Costs.........5 1.3.1 Human Coding..........................8 1.3.2 Supervised Learning.......................9 1.3.3 Topic Modeling.......................... 11 1.4 Automated Content Analysis at Lower Cost.............. 12 1.4.1 Analyzing Agendas and Agenda Control in Political Debates. 15 1.4.2 Learning Agenda Hierarchy from Multi-labeled Legislative Text 16 1.4.3 Discovering Agendas and Frames Discovery in Ideologically Polarized Text........................... 18 1.4.4 Discovering Agendas and Frames from Roll Call Votes and Congressional Floor Debates................... 20 1.5 Main Technical Contributions...................... 22 2 Probabilistic Topic Modeling Foundations 23 2.1 Latent Dirichlet Allocation: The Basic Topic Model.......... 24 2.2 Beyond LDA: Topic Modeling Extensions................ 27 2.2.1 Using Bayesian Nonparametrics................. 28 2.2.2 Incorporating Metadata..................... 33 2.2.3 Adding Hierarchical Structure.................