Computer-Assisted Topic Classification for Mixed-Methods Social Science Research

WITP Computer-Assisted Topic Classification for Mixed-Methods Social Science Research Hillard, Purpura, and Wilkerson Dustin Hillard Stephen Purpura John Wilkerson ABSTRACT. Social scientists interested in mixed-methods research have traditionally turned to human annotators to classify the documents or events used in their analyses. The rapid growth of digi- tized government documents in recent years presents new opportunities for research but also new chal- lenges. With more and more data coming online, relying on human annotators becomes prohibitively expensive for many tasks. For researchers interested in saving time and money while maintaining confidence in their results, we show how a particular supervised learning system can provide estimates of the class of each document (or event). This system maintains high classification accuracy and provides accurate estimates of document proportions, while achieving reliability levels associated with human efforts. We estimate that it lowers the costs of classifying large numbers of complex documents by 80% or more. KEYWORDS. Topic classification, data mining, machine learning, content analysis, information retrieval, text annotation, Congress, legislation Technological advances are making vast of interest to political scientists, a user of the amounts of data on government activity newly Library of Congress’ THOMAS Web site (http:// available, but often in formats that are of limited thomas.loc.gov) can use its Legislative Indexing value to researchers as well as citizens. In this Vocabulary (LIV) to search for congressional article, we investigate one approach to trans- legislation on a given topic. Similarly, a user of a forming these data into useful information. commercial Internet service turns to a topic clas- “Topic classification” refers to the process of sification system when searching, for example, assigning individual documents (or parts of doc- Yahoo! Flikr for photos of cars or Yahoo! uments) to a limited set of categories. It is Personals for postings by men seeking women. widely used to facilitate search as well as in the Topic classification is valued for its ability to study of patterns and trends. To pick an example limit search results to documents that closely Dustin Hillard is a Ph.D. candidate in the Department of Electrical Engineering, University of Washington. Stephen Purpura is a Ph.D. student in Information Science, Cornell University. John Wilkerson is Associate Professor of Political Science at the University of Washington. This project was made possible with support of NSF grants SES-0429452, SES-00880066, SES-0111443, and SES-00880061. An earlier version of the paper was presented at the Coding Across the Disciplines Workshop (NSF grant SES-0620673). The views expressed are those of the authors and not the National Science Foundation. We thank Micah Altman, Frank Baumgartner, Matthew Baum, Jamie Callan, Claire Cardie, Kevin Esterling, Eduard Hovy, Aleks Jakulin, Thorsten Joachims, Bryan Jones, David King, David Lazer, Lillian Lee, Michael Neblo, James Purpura, Julianna Rigg, Jesse Shapiro, and Richard Zeckhauser for their helpful comments. Address correspondence to: John Wilkerson, Box 353530, University of Washington, Seattle, WA 98195 (E-mail: [email protected]). Journal of Information Technology & Politics, Vol. 4(4) 2007 Available online at http://jitp.haworthpress.com © 2007 by The Haworth Press. All rights reserved. doi:10.1080/19331680801975367 31 32 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS match the user’s interests, when compared to approximately 379,000 congressional bill less selective keyword-based approaches. How- titles that trained human researchers have ever, a central drawback of these systems is assigned to one of 20 major topic and 226 their high costs. Humans—who must be trained subtopic categories, with high levels of inter- and supervised—traditionally do the labeling. annotator reliability.1 We draw on this corpus Although human annotators become somewhat to test several supervised learning algorithms more efficient with time and experience, the that use case-based2 or “learning by example” marginal cost of coding each document does methods to replicate the work of human anno- not really decline as the scope of the project tators. We find that some algorithms perform expands. This has led many researchers to ques- our particular task better than others. How- tion the value of such labor-intensive ever, combining results from individual approaches, especially given the availability of machine learning methods increases accuracy computational approaches that require much beyond that of any single method, and provides less human intervention. key signals of confidence regarding the Yet there are also good reasons to cling to a assigned topic for each document. We then proven approach. For the task of topic classifi- show how this simple confidence estimate cation, computational approaches are useful can be employed to achieve additional classifi- only to the extent that they “see” the patterns cation accuracy more efficiently than would that interest humans. A computer can quickly otherwise be possible. detect patterns in data, such as the number of E’s in a record. It can then very quickly orga- nize a dataset according to those patterns. But computers do not necessarily detect the patterns TOPIC CLASSIFICATION FOR that interest researchers. If those patterns are SOCIAL SCIENCE DOCUMENT easy to objectify (e.g., any document that men- RETRIEVAL tions George W. Bush), then machines will work well. The problem, of course, is that many Social scientists are interested in topic clas- of the phenomena that interest people defy sim- sification for two related reasons: retrieving ple definitions. “Bad” can mean good—or individual documents and tracing patterns and bad—depending on the context in which it is trends in issue-related activity. Mixed-method used. Humans are simply better at recognizing studies that combine pattern analyses with such distinctions, although computerized meth- case-level investigations are becoming stan- ods are closing the gap. dard, and linked examples are often critical to Technology becomes increasingly attractive persuading readers to accept statistical find- as the size and complexity of a classification ings (King, Keohane, & Verba, 1994). In Soft task increase. But what do we give up in terms News Goes to War, for example, Baum (2003) of accuracy and reliability when we adopt a draws on diverse corpora to analyze media cov- particular automated approach? In this article, erage of war (e.g., transcripts of “Entertainment we begin to investigate this accuracy/efficiency Tonight,” the jokes of John Stewart’s “The tradeoff in a particular context. We begin by Daily Show,” and network news programs). describing the ideal topic classification system Keyword searches are fast and may be effec- where the needs of social science researchers tive for the right applications, but effective key- are concerned. We then review existing appli- word searches can also be difficult to construct cations of computer-assisted methods in politi- without knowing what is actually in the data. A cal science before turning our attention to a search that is too narrow in scope (e.g., “renew- method that has generated limited attention able energy”) will omit relevant documents, within political science to date: supervised while one that is too broad (e.g., “solar”) learning systems. will generate unwanted false positives. In fact, The Congressional Bills Project (http:// most modern search engines, such as Google, www.congressionalbills.org) currently includes consciously reject producing a reasonably Hillard, Purpura, and Wilkerson 33 comprehensive list of results related to a topic tive attention with shifts in coding protocol as a design criterion.3 (Baumgartner, Jones, & Wilkerson, 2002). Many political scientists rely on existing So, what type of topic classification system databases where humans have classified events best serves the needs of social scientists? If the (decisions, votes, media attention, legislation) goals are to facilitate trend tracing and docu- according to a predetermined topic system (e.g., ment search, an ideal system possesses the fol- Jones & Baumgartner, 2005; Poole & lowing characteristics. First, it should be Rosenthal, 1997; Rohde, 2004; Segal & Spaeth, discriminating. By this we mean that the topic 2002). In addition to enabling scholars to study categories are mutually exclusive and span the trends and compare patterns of activity, reliable entire agenda of topics. The search requires that topic classification can save considerable the system indicate what each document is pri- research time. For example, Adler and Wilkerson marily about, while trend tracing is made more (2008) wanted to use the Congressional Bills difficult if the same document is assigned to Project database to study the impact of congres- multiple categories. Second, it should be sional reforms. To do this, they needed to trace accurate. The assigned topic should reflect the how alterations in congressional committee document’s content, and there should be a sys- jurisdictions affected bill referrals. The fact that tematic way of assessing accuracy. Third, the every bill during the years of interest had already ideal system should be reliable. Pattern and been annotated for topic allowed them to reduce trend tracing require that similar documents be the number of bills that had to be individually classified

Computer-Assisted Topic Classification for Mixed-Methods Social Science Research

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support