WITP Computer-Assisted Topic Classification for Mixed-Methods Social Science

Hillard, Purpura, and Wilkerson Dustin Hillard Stephen Purpura John Wilkerson

ABSTRACT. Social scientists interested in mixed-methods research have traditionally turned to human annotators to classify the documents or events used in their analyses. The rapid growth of digi- tized government documents in recent years presents new opportunities for research but also new chal- lenges. With more and more data coming online, relying on human annotators becomes prohibitively expensive for many tasks. For researchers interested in saving time and money while maintaining confi- dence in their results, we show how a particular supervised learning system can provide estimates of the class of each document (or event). This system maintains high classification accuracy and provides accu- rate estimates of document proportions, while achieving reliability levels associated with human efforts. We estimate that it lowers the costs of classifying large numbers of complex documents by 80% or more.

KEYWORDS. Topic classification, data mining, machine learning, content analysis, information retrieval, text annotation, Congress, legislation

Technological advances are making vast of interest to political scientists, a user of the amounts of data on government activity newly Library of Congress’ THOMAS Web site (http:// available, but often in formats that are of limited thomas.loc.gov) can use its Legislative Indexing value to researchers as well as citizens. In this Vocabulary (LIV) to search for congressional article, we investigate one approach to trans- legislation on a given topic. Similarly, a user of a forming these data into useful information. commercial Internet service turns to a topic clas- “Topic classification” refers to the process of sification system when searching, for example, assigning individual documents (or parts of doc- Yahoo! Flikr for photos of cars or Yahoo! uments) to a limited set of categories. It is Personals for postings by men seeking women. widely used to facilitate search as well as in the Topic classification is valued for its ability to study of patterns and trends. To pick an example limit search results to documents that closely

Dustin Hillard is a Ph.D. candidate in the Department of Electrical Engineering, University of Washington. Stephen Purpura is a Ph.D. student in , Cornell University. John Wilkerson is Associate Professor of Political Science at the University of Washington. This project was made possible with support of NSF grants SES-0429452, SES-00880066, SES-0111443, and SES-00880061. An earlier version of the paper was presented at the Coding Across the Disciplines Workshop (NSF grant SES-0620673). The views expressed are those of the authors and not the National Science Foundation. We thank Micah Altman, Frank Baumgartner, Matthew Baum, Jamie Callan, Claire Cardie, Kevin Esterling, Eduard Hovy, Aleks Jakulin, Thorsten Joachims, Bryan Jones, David King, David Lazer, Lillian Lee, Michael Neblo, James Purpura, Julianna Rigg, Jesse Shapiro, and Richard Zeckhauser for their helpful comments. Address correspondence to: John Wilkerson, Box 353530, University of Washington, Seattle, WA 98195 (E-mail: [email protected]). Journal of Information Technology & Politics, Vol. 4(4) 2007 Available online at http://jitp.haworthpress.com © 2007 by The Haworth Press. All rights reserved. doi:10.1080/19331680801975367 31 32 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS match the user’s interests, when compared to approximately 379,000 congressional bill less selective keyword-based approaches. How- titles that trained human researchers have ever, a central drawback of these systems is assigned to one of 20 major topic and 226 their high costs. Humans—who must be trained subtopic categories, with high levels of inter- and supervised—traditionally do the labeling. annotator reliability.1 We draw on this corpus Although human annotators become somewhat to test several supervised learning algorithms more efficient with time and experience, the that use case-based2 or “learning by example” marginal cost of coding each document does methods to replicate the work of human anno- not really decline as the scope of the project tators. We find that some algorithms perform expands. This has led many researchers to ques- our particular task better than others. How- tion the value of such labor-intensive ever, combining results from individual approaches, especially given the availability of machine learning methods increases accuracy computational approaches that require much beyond that of any single method, and provides less human intervention. key signals of confidence regarding the Yet there are also good reasons to cling to a assigned topic for each document. We then proven approach. For the task of topic classifi- show how this simple confidence estimate cation, computational approaches are useful can be employed to achieve additional classifi- only to the extent that they “see” the patterns cation accuracy more efficiently than would that interest humans. A computer can quickly otherwise be possible. detect patterns in data, such as the number of E’s in a record. It can then very quickly orga- nize a dataset according to those patterns. But computers do not necessarily detect the patterns TOPIC CLASSIFICATION FOR that interest researchers. If those patterns are SOCIAL SCIENCE DOCUMENT easy to objectify (e.g., any document that men- RETRIEVAL tions George W. Bush), then machines will work well. The problem, of course, is that many Social scientists are interested in topic clas- of the phenomena that interest people defy sim- sification for two related reasons: retrieving ple definitions. “Bad” can mean good—or individual documents and tracing patterns and bad—depending on the context in which it is trends in issue-related activity. Mixed-method used. Humans are simply better at recognizing studies that combine pattern analyses with such distinctions, although computerized meth- case-level investigations are becoming stan- ods are closing the gap. dard, and linked examples are often critical to Technology becomes increasingly attractive persuading readers to accept statistical find- as the size and complexity of a classification ings (King, Keohane, & Verba, 1994). In Soft task increase. But what do we give up in terms News Goes to War, for example, Baum (2003) of accuracy and reliability when we adopt a draws on diverse corpora to analyze media cov- particular automated approach? In this article, erage of war (e.g., transcripts of “Entertainment we begin to investigate this accuracy/efficiency Tonight,” the jokes of John Stewart’s “The tradeoff in a particular context. We begin by Daily Show,” and network news programs). describing the ideal topic classification system Keyword searches are fast and may be effec- where the needs of social science researchers tive for the right applications, but effective key- are concerned. We then review existing appli- word searches can also be difficult to construct cations of computer-assisted methods in politi- without knowing what is actually in the data. A cal science before turning our attention to a search that is too narrow in scope (e.g., “renew- method that has generated limited attention able energy”) will omit relevant documents, within political science to date: supervised while one that is too broad (e.g., “solar”) learning systems. will generate unwanted false positives. In fact, The Congressional Bills Project (http:// most modern search engines, such as Google, www.congressionalbills.org) currently includes consciously reject producing a reasonably Hillard, Purpura, and Wilkerson 33 comprehensive list of results related to a topic tive attention with shifts in coding protocol as a design criterion.3 (Baumgartner, Jones, & Wilkerson, 2002). Many political scientists rely on existing So, what type of topic classification system databases where humans have classified events best serves the needs of social scientists? If the (decisions, votes, media attention, legislation) goals are to facilitate trend tracing and docu- according to a predetermined topic system (e.g., ment search, an ideal system possesses the fol- Jones & Baumgartner, 2005; Poole & lowing characteristics. First, it should be Rosenthal, 1997; Rohde, 2004; Segal & Spaeth, discriminating. By this we mean that the topic 2002). In addition to enabling scholars to study categories are mutually exclusive and span the trends and compare patterns of activity, reliable entire agenda of topics. The search requires that topic classification can save considerable the system indicate what each document is pri- research time. For example, Adler and Wilkerson marily about, while trend tracing is made more (2008) wanted to use the Congressional Bills difficult if the same document is assigned to Project database to study the impact of congres- multiple categories. Second, it should be sional reforms. To do this, they needed to trace accurate. The assigned topic should reflect the how alterations in congressional committee document’s content, and there should be a sys- jurisdictions affected bill referrals. The fact that tematic way of assessing accuracy. Third, the every bill during the years of interest had already ideal system should be reliable. Pattern and been annotated for topic allowed them to reduce trend tracing require that similar documents be the number of bills that had to be individually classified similarly from one period to the next, inspected from about 100,000 to “just” 8,000. even if the terminology used to describe those Topic classification systems are also widely documents is changing. For example, civil rights used in the private sector and in government. issues have been framed very differently from However, a topic classification system created one decade to the next. If the goal is to compare for one purpose is not necessarily suitable for civil rights attention over time, then the classifi- another. Well-known document retrieval sys- cation system must accurately capture attention tems such as the Legislative Indexing Vocabu- despite these changing frames. Fourth, it should lary of the Library of Congress’ THOMAS be probabilistic. In addition to discriminating a Web site allow researchers to search for docu- document’s primary topic, a valuable topic ments using preconstructed topics (http:// system for search should also identify those thomas.loc.gov/liv/livtoc.html), but the THOMAS documents that address the topic even though Legislative Indexing Vocabulary is primarily they are not primarily about that topic. Finally, it designed to help users (congressional staff, should be efficient. The less costly the system is lobbyists, lawyers) track down contemporary to implement, the greater its value. legislation. This contemporary focus creates the Human-centered approaches are attractive potential for “topic drift,” whereby similar docu- because they meet most of these standards. ments are classified differently over time as users’ Humans can be trained to discriminate the main conceptions of what they are looking for change.4 purpose of a document, and their performance For example, “women’s rights” did not exist can be monitored until acceptable levels of as a category in the THOMAS system until accuracy and reliability are achieved. However, sometime after 1994. The new category likely human annotation is also costly. In this article, was created to serve current users better, but we ask whether supervised machine learning earlier legislation related to women’s rights was methods can achieve similar levels of accuracy not reclassified to ensure intertemporal compa- and reliability while improving efficiency. rability. Topic drift may be of little concern We begin by contrasting our approach to where contemporary search is concerned, but it several computer-assisted categorization meth- is a problem for researchers hoping to compare ods currently used in political science research. legislative activity or attention across time. If Only supervised learning systems have the the topic categories are changing, researchers potential to address the five goals of topic clas- risk confusing shifts in the substance of legisla- sification described above. 34 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS

COMPUTER ASSISTED CONTENT Systems based on keyword searching can ANALYSIS IN POLITICAL SCIENCE meet the requirements for a solid topic classifi- cation system. Keyword search systems such as Content-analysis methods center on extract- TABARI can be highly accurate and reliable ing meaning from documents. Applications of because the system simply replicates coding computer-assisted content analysis methods decisions originally made by humans.5 If the have developed slowly in political science over system encounters text for which it has not been the past four decades, with each innovation trained, it does not classify that text. Keyword adding a layer of complexity to the information search systems can also be discriminating, gleaned from the method. Here we focus on a because only documents that include the search selected set of noteworthy projects that serve as terms are labeled. The system can also be prob- examples of some of these important develop- abilistic, by using rules to establish which ments (see Table 1). documents are related to a topic area. Data comparison, or keyword matching, was However, for nonbinary classification tasks, the first search method ever employed on digi- achieving the ability to be both discriminating tal data. Keyword searches identify documents and probabilistic can be expensive, because the that contain specific words or word sequences. system requires explicit rules of discrimination Within political science, one of the most for the many situations where the text touches sophisticated is KEDS/TABARI (Schrodt, on more than one topic. For example, the topic Davis, & Weddle, 1994; Schrodt & Gerner, “elderly health care issue” includes subjects 1994). TABARI turns to humans to create a set that are of concern to non-seniors (e.g., health of computational rules for isolating text and insurance, prevention). Effective keyword associating it with a particular event category. searches must account for these contextual fac- Researchers use the resulting system to analyze tors, and at some point other methods may changing attention in the international media or prove to be more efficient for the same level of other venues. accuracy.

TABLE 1. Criteria for Topic Classification and the Appropriateness of Different Computer-Assisted Content Analysis Methods

Criteria Method

Unsupervised Keds/Tabari Wordscores Hopkins and Supervised Learning (without King, 2007 Learning human intervention)

System for topic Partial Yes No Partial Yes classification Discriminates the primary YesNoNoNoYes subject of a document? Document level accuracy No No Yes No Yes is assessed Document level reliability is No No Yes No Yes assessed? Indicate secondary topics? No No No No Yes Efficient to implement Yes, integrating Yes, but costs Yes, costs Yes, costs Yes, costs document level rise with decline with decline with decline accuracy and scope of scope of scope of task with reliability checks task task scope of makes the process task similar to supervised learning Hillard, Purpura, and Wilkerson 35

Unsupervised approaches, such as factor has been used to locate party manifestos and analysis or agglomerative clustering, have been other political documents of multiple nations used for decades as an alternative to keyword along the same ideological continuum. searching. They are often used as a first step to This method can be efficient because it uncovering patterns in data including document requires only the human intervention required content (Hand, Mannila, & Smyth, 2001). In a to select training documents and conduct vali- recent political science application, Quinn, dation. Wordscores is also probabilistic, Monroe, Colaresi, Crespin, and Radev (2006) because it can produce ranked observations. have used this approach to cluster rhetorical And the method has been shown to be accurate arguments in congressional floor speeches. and reliable. However, its accuracy and reliabil- Unsupervised approaches are efficient ity are application dependent, in that the ranks because typically they do not require human Wordscores assigns to documents will make guidance, in contrast to data comparison or key- sense only if the training documents closely word methods. They also can be discriminating approximate the user’s information retrieval and/or probabilistic, because they can produce goal. Its small number of training documents mutually exclusive and/or ranked observations. limits the expression of the user’s information Consider the simplest case of unsupervised needs. That is, Wordscores was not designed to learning using agglomerative clustering—near- place events in discrete categories. exact duplicate detection. As a researcher, if Application-independent methods for con- you know that 30% of the documents in a data ducting algorithmic content analysis do not set are near-exact duplicates (99.8% of text exist. The goal of such a system would be to content is equivalent) and each has the same generate discriminative and reliable results effi- topic assigned to it, it would be inefficient to ciently and accurately for any content analysis ask humans to label all of these documents. question that might be posed by a researcher. Instead, the research would use an unsupervised There is a very active community of computer approach to find all of the clusters of duplicates, scientists interested in this problem, but, to label just one document in the cluster, and then date, humans must still select the proper trust the labeling approach to assign labels to method for their application. Many Natural the near-exact duplicate documents.6 Language Processing (NLP) researchers believe But, to assess the accuracy and reliability of that an application-independent method will unsupervised methods on more complex content never be developed (Kleinberg, 2002).7 analysis questions, humans must examine the As a part of this search for a more general data and decide relevance. And once researchers method, Hopkins and King recently have begin to leverage information from human developed a supervised learning method that experts to improve accuracy and reliability gives, as output, “approximately unbiased and (achieve a higher match with human-scored rele- statistically consistent estimates of the propor- vance) in the data generation process, the method tion of all documents in each category” (Hopkins essentially evolves into a hybrid of the super- & King, 2007, p.2). They note that “accurate vised learning method we focus on here. estimates of these document category propor- Another semi-automated method, similar to tions have not been a goal of most work in the the method proposed in this article, is the super- classification literature, which has focused vised use of word frequencies in Wordscores instead on increasing the accuracy of individual (www.wordscores.com). With Wordscores, document classification” (ibid). For example, a researchers select model training “cases” that classifier that correctly estimates 90% of the are deemed to be representative of opposite documents belonging to a class must estimate ends of a spectrum (Laver, Benoit, & Garry, incorrectly that 10% of those documents belong 2003). The software then orders other docu- to other classes. These errors can bias estimates ments along this spectrum based on their simi- of class proportions (e.g., the proportion of all larities and differences to the word frequencies media coverage devoted to different candi- of the end-point model documents. Wordscores dates), depending on how they are distributed. 36 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS

Like previous work (Purpura & Hillard, the corpus. Examples might include a binary 2006), the method developed by Hopkins and indicator variable, which specifies whether King begins with documents labeled by the document contains a picture, a vector humans, and then statistically analyzes word containing “term weights” for each word in features to generate an efficient, discrimina- the document, or a real number in the interval tive, multiclass classification. However, their (i.e., 0, infinity) that represents the cost of pro- approach of estimating proportions is not ducing the document. Typically, a critical appropriate for the researchers interested in selection criterion is empirical system perfor- mixed-methods research requiring the ability mance. If a human can separate all of the doc- to analyze documents within a class. Despite uments in a corpus perfectly by asking, for this limitation, mixed-methods researchers example, whether a key e-mail address may still want to use the Hopkins and King appears, then a useful document representation method to validate estimates from alternative would be a binary indicator variable specify- supervised learning systems. Because it is the ing whether the e-mail address appears in each only other method (in the political science lit- document. However, for classification tasks erature) of those mentioned that relies on that are more complex, simplicity, calculation human-labeled training samples, it does offer cost, and theoretical justifications are also rel- a unique opportunity to compare the predic- evant selection criteria. tion accuracy of our supervised learning Our document representation consists of a approach in our problem domain to another vector of term weights, also known as feature approach (though the comparison must be representation, as documented in Joachims restricted to proportions). (1998). For the term weights, we use both tf*idf (term frequency multiplied by inverse docu- ment frequency) and a mutual information SUPERVISED LEARNING weight (Purpura & Hillard, 2006). The most APPROACHES typical feature representation first applies Por- ter stemming to reduce word variants to a com- Supervised learning (or machine learning mon form (Porter, 1980), before computing classification) systems, have been the focus of term frequency in a sample divided by the more than 1,000 scholarly publications in the inverse document frequency (to capture how computational linguistics literature in recent often a word occurs across all documents in the years (Breiman, 2001; Hand, 2006; Mann, corpus) (Papineni, 2001).8 A list of common Mimno, & McCallum, 2006; Mitchell, 1997; words (stop words) also may be omitted from Yang & Liu, 1999). These systems have been each text sample. used for many different text annotation pur- Feature representation is an important poses but have been rarely used for this purpose research topic in itself, because different in political science. approaches yield different results depending In this discussion, we focus on supervised on the task at hand. Stemming can be learning systems that statistically analyze terms replaced by a myriad of methods that perform within documents of a corpus to create rules for a similar task—capturing the signal in the classifying those documents into classes. To transformation of raw text to numeric repre- uncover the relevant statistical patterns, annota- sentation—but with differing results. In tors mark a subset of the documents in the corpus future research, we hope to demonstrate how as being members of a class. The researcher then alternative methods of preprocessing and fea- develops a “document representation” that draws ture generation can improve the performance on this “training set” to accurately machine anno- of our system. tate previously unseen documents in the corpus For topic classification, a relatively compre- referred to as the “test set.” hensive analysis (Yang & Liu, 1999) finds that Practically, a document representation can support vector machines (SVMs) are usually be any numerical summary of a document in the best performing model. Purpura and Hillard Hillard, Purpura, and Wilkerson 37

(2006) applied a SVM model to the corpus Boostexter Model studied here with high fidelity results. We are particularly interested in whether combining The Boostexter tool allows for features of a the decisions of multiple supervised learning similar form to the SVM, where a word can be systems can improve results. This combined associated with a score for each particular text approach is known as ensemble learning (Brill example (Schapire & Singer, 2000). We use the & Wu, 1998; Curran, 2002; Dietterich, 2000). same feature computation as for the SVM Research indicates that ensemble approaches model, and likewise remove those words that yield the greatest improvements over a single occur less often than the corpus average. Under classifier when the individual classifiers per- this scenario, the weak learner for each iteration form with similar accuracy but make different of AdaBoost training consists of a simple ques- types of mistakes. tion that asks whether the score for a particular word is above or below a certain threshold. The Boostexter model can accommodate multicate- Algorithms gory tasks easily, so only one model need be We will test the performance of four alterna- learned. tives: Naïve Bayes, SVM, Boostexter, and The MaxEnt Model MaxEnt. The MaxEnt classifier assigns a document to Naïve Bayes a class by converging toward a model that is as uniform as possible around the feature set. In Our Naïve Bayes Classifier uses a decision our case, the model is most uniform when it has rule and a Bayes probability model with strong maximal entropy. We use the rainbow toolkit assumptions of independence of the features (McCallum, 1996). This toolkit provides a (tf*idf). Our decision rule is based on MAP cross validation feature that allows us to select (maximum a posteriori), and it attempts to the optimal number of iterations. We provide select a label for each document by selecting just the raw words to rainbow, and let it run the class that is most probable. Our implemen- word stemming and compute the feature values. tation of the Naïve Bayes comes from the Figure 1 summarizes how we apply this rainbow toolkit (McCallum, 1996). system to the task of classifying congres- sional bills based on the word features of The SVM Model their titles. The task consists of two stages. In The SVM system builds on binary pair- the first, we employ the ensemble approach wise classifiers between each pair of catego- developed here to predict each bill’s major ries, and chooses the one that is selected topic class. Elsewhere, we have demonstrated most often as the final category (Joachims, that the probability of correctly predicting the 1998). Other approaches are also common subtopic of a bill, given a correct prediction (such as a committee of classifiers that test of its major topic, exceeds .90 (Hillard, one vs. the rest), but we have found that the Purpura, & Wilkerson, 2007; Purpura & initial approach is more time efficient with Hillard, 2006). We leverage this valuable equal or greater performance. We use a lin- information about likely subtopic class in the ear kernel, Porter stemming, and a feature second stage by developing unique subtopic document representations (using the three value (mutual information) that is slightly 9 more detailed than the typical inverse docu- algorithms) for each major topic. ment frequency feature. In addition, we Performance Assessment prune those words in each bill that occur less often than the corpus average. Further details We assess the performance of our automated and results of the system are described in methods against trained human annotators. Purpura and Hillard (2006). Although we report raw agreement between 38 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS

FIGURE 1. Topic classifying Congressional bill titles.

human and machine for simplicity, we also Cohen’s Kappa statistic is a standard metric discount this agreement for confusion, or the used to assess interannotator reliability between probability of that that the human and machine two sets of results while controlling for chance might agree by chance. Chance agreement is of agreement (Cohen, 1968). Usually, this tech- little concern when the number of topics is nique assesses agreement between two human large. However, in other contexts, chance annotators, but the computational linguistics agreement may be a more relevant concern. field also uses it to assess agreement between Hillard, Purpura, and Wilkerson 39 human and machine annotators. Cohen’s Kappa but the p(E) component is calculated statistic is defined as: differently:

pA()− pE () 1 C = pE()= ∑((1pp− )) k − cc 1()− pE C 1 c=1 π In the equation, p(A) is the probability of the where C is the number of categories, and c is observed agreement between the two assess- the approximate chance that a bill is classified ments: as category c:

+ = ()/2HumanTotalcc ComputerTotal N p c = 1 == N pA() ∑ I() Humannn Computer N n=1 In this study we report just AC1 because there is no meaningful difference between Kappa where N is the number of examples, and I() is and AC1.10 For annotation tasks of this level an indicator function that is equal to 1 when of complexity, a Cohen’s Kappa or AC1 statis- the two annotations (human and computer) tic of 0.70 or higher is considered to be very agree on a particular example. P(E) is the good agreement between annotators (Carletta, probability of the agreement expected by 1996). chance: Corpus: The Congressional Bills Project

1 C (HumanTotal × The Congressional Bills Project (http:// pE()= ∑ c 2 www.congressionalbills.org ) archives informa- N c=1 ComputerTotalc ) tion about federal public and private bills introduced since 1947. Currently the database where N is again the total number of examples includes approximately 379,000 bills. Research- and the argument of the sum is a multiplica- ers use this database to study legislative trends tion of the marginal totals for each category. over time as well as to explore finer questions For example, for category 3—health—the such as the substance of environmental bills argument would be the total number of bills a introduced in 1968, or the characteristics of the human annotator marked as category 3 multi- sponsors of environmental legislation. plied by the total number of bills the computer Human annotators have labeled each bill’s system marked as category 3. This multiplica- title (1973–98) or short description (1947–72) tion is computed for each category, summed, as primarily about one of 226 subtopics origi- and then normalized by N2. nally developed for the Policy Agendas Project Due to bias under certain constraint (http://www.policyagendas.org ). These subtop- conditions, computational linguists also use ics are further aggregated into 20 major topics another standard metric, the AC1 statistic, to (see Table 2). For example, the major topic of assess interannotator reliability (Gwet, 2002). environment includes 12 subtopics correspond- The AC1 statistic corrects for the bias of ing to longstanding environmental issues, Cohen’s Kappa by calculating the agreement including species and forest protection, by chance in a different manner. It has a sim- recycling, and drinking water safety, among ilar form: others. Additional details can be found online at http://www.policyagendas.org/codebooks/top- icindex.html. The students (graduate and undergraduate) pA()− pE () AC1 = who do the annotation train for approxi- 1()− pE mately three months as part of a year-long 40 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS

TABLE 2. Major Topics of the rabbits in dog racing in the sports and gambling Congressional Bills Project regulation category (1526), while another might legitimately conclude that it is primarily about 1 Macroeconomics species and forest protection (709). The fact 2 Civil Rights, Minority Issues, Civil Liberties that interannotator reliability is generally high, 3 Health despite the large number of topic categories, 4 Agriculture suggests that the annotators typically agree on 5 Labor, Employment, and Immigration 6 Education where a bill should be assigned. In a review of a 7 Environment small sample, we found that the distribution 8 Energy between legitimate disagreements and actual 10 Transportation annotation errors was about 50/50. 12 Law, Crime, and Family Issues 13 Social Welfare 14 Community Development and Housing Issues 15 Banking, Finance, Domestic Commerce EXPERIMENTS AND FINDINGS 16 Defense 17 Space, Science, Technology, Communications 18 Foreign Trade The main purpose of automated text classifi- 19 International Affairs and Foreign Aid cation is to replicate the performance of human 20 Government Operations labelers. In this case, the classification task con- 21 Public Lands and Water Management sists of either 20 or 226 topic categories. We 99 Private Legislation exploit the natural hierarchy of the categories by first building a classification system to determine the major category, and then building commitment. Typically, each student anno- a child system for each of the major categories tates 200 bills per week during the training that decides among the subcategories within process. To maintain quality, interannotator that major class, as advocated by Koller and agreement statistics are regularly calculated. Sahami (1997). Annotators do not begin annotation in earnest We begin by performing a simple random until interannotator reliability (including a split on the entire corpus and use the first subset master annotator) approaches 90% at the for training and the second for testing. Thus, major topic level and 80% at the subtopic one set of about 190,000 labeled samples is level.11 Most bills are annotated by just one used to predict labels on about 190,000 other person, so the dataset undoubtedly includes cases. annotation errors. Table 3 shows the results produced when However, it is important to recognize that using our text preprocessing methods and four interannotator disagreements are usually legiti- off-the-shelf computer algorithms. With 20 mate differences of opinion about what a bill is major topics and 226 subtopics, a random primarily about. For example, one annotator assignment of bills to topics and subtopics might place a bill to prohibit the use of live can be expected to yield very low levels of

TABLE 3. Bill Title Interannotator Agreement for Five Model Types

SVM MaxEnt Boostexter Naïve Bayes Ensemble

Major topic N = 20 88.7% (.881) 86.5% (.859) 85.6% (.849) 81.4% (.805) 89.0% (.884) Subtopic N = 226 81.0% (.800) 78.3% (.771) 73.6% (.722) 71.9% (.705) 81.0% (.800)

Note. Results are based on using approximately 187,000 human-labeled cases to train the classifier to predict approxi- mately 187,000 other cases (that were also labeled by humans but not used for training). Agreement is computed by comparing the machine’s prediction to the human assigned labels. (AC1 measure presented in parentheses). Hillard, Purpura, and Wilkerson 41 accuracy. It is therefore very encouraging to of the 100th Congress (1987–1988), we only find high levels of prediction accuracy across use information from the 99th Congress, plus the different algorithms. This is indicative of a whatever data the human annotation team has feature representation—the mapping of text to generated for the 100th as part of an active numbers for analysis by the machine—that rea- learning process. We find that this approach sonably matches the application. yields results equal to, or better than, what can The ensemble learning voting algorithm com- be achieved using all available previous train- bines the best of the four (SVM, MaxEnt, and ing data. Boostexter) marginally improves interannotator The second key design decision is that we agreement (compared to SVM alone) by 0.3% only want to accept machine-generated class (508 bills). However, combining information labels for bills when the system has high confi- from three algorithms yields important addi- dence in the prediction. In other cases, we wish tional information that can be exploited to lower to have humans annotate the bills, because we the costs of improving accuracy. When the three have found that this catches cases of topic drift algorithms predict the same major topic for a and it minimizes mistakes. One implication of bill, the prediction of the machine matches the Table 4 is that the annotation team may be able human-assigned category 94% of the time (see to trust the algorithms’ prediction when all Table 4). When the three algorithms disagree by three methods agree and limit its attention to predicting different major topics, collectively the the cases of disagreement where they disagree. machine predictions match the human annotation But we need to confirm that the results are com- team only 61% of the time. The AC1 measure parable when we use a partial memory learning closely tracks the simple accuracy measure, so system. for brevity we present only accuracy results in For the purposes of these experiments, we the remaining experiments. will focus on predicting the topics of bills from the 100th Congress to the 105th Congress using only the bill topics from the previous Congress as training data. This is the best approximation PREDICTING TO THE FUTURE: of the “real world” case that we are able to con- WHEN AND WHERE SHOULD struct, because (a) these congressional sessions HUMANS INTERVENE? have the lowest computer/human agreement of all of the sessions in the data set, (b) the 105th A central goal of the Congressional Bills Congress is the last human-annotated session, Project (as well as many other projects) is to and (c) the first production experiment with live turn to automated systems to lower the costs of data will use the 105th Congress’ data to pre- labeling new bills (or other events) as opposed dict the class labels for the bills of the 106th to labeling events of the distant past. The previ- Congress. The results reported in Table 5 are at ous experiments shed limited light on the value of the method for this task. How different are our results if we train on annotated bills from previous Congresses to predict the topics of TABLE 4. Prediction Success for 20 Topic bills of future Congresses? Categories When Machine Learning Ensemble From past research we know that topic drift Agrees and Disagrees across years can be a significant problem. Although we want to minimize the amount of Methods Agree Methods Disagree time that the annotation team devotes to examin- correct 94% 61% ing bills, we also need a system that approaches incorrect 6% 39% 90% accuracy. To address these concerns, we cases 85% 15% adopt two key design modifications. First, we (N of Bills) (158,762) (28,997) implement a partial memory learning system. Note. Based on using 50% of the sample to train the For example, to predict class labels for the bills systems to predict to the other 50%. 42 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS the major topic only. As mentioned, the proba- machine and the human team in three situa- bility of correctly predicting the subtopic of a tions: when the three algorithms agree, when bill, given a correct prediction of major topic two of them agree, and when none of them class, exceeds 0.90 (Hillard et al., 2007; Pur- agree. When all three agree, only 10.3% of their pura & Hillard, 2006). predictions differ from those assigned by the Several results in Table 5 stand out. Overall, human annotators. But when only two agree, we find that when we train on a previous Con- 39.8% of the predictions are wrong by this stan- gress to predict the class labels of the next dard, and most (58.5%) are wrong when the Congress, the system correctly predicts the three algorithms disagree. major topic about 78.5% of the time without Of particular note is how this ensemble any sort of human intervention. This is approx- approach can guide future efforts to improve imately 12% below what we would like to see, overall accuracy. Suppose that only a small but we have not spent any money on human amount of resources were available to pay annotation yet. human annotators to review the automated How might we strategically invest human system’s prediction for the purpose of improv- annotation efforts to best improve the perfor- ing overall accuracy. (Remember that in an mance of the system? To investigate this ques- applied situation, we would not know which tion we will begin by using the major topic assignments were correct and which were class labels of bills in the 99th Congress to wrong.) With an expected overall accuracy predict the major topic class labels of the bills rate of about 78%, 78% of the annotator’s in the 100th Congress. Table 6 reports the per- efforts would be wasted on inspecting cor- centage of cases that agree between the rectly labeled cases.

TABLE 5. Prediction Success When the Ensemble Agrees and Disagrees

Congress Bills inTest Ensemble Methods Correctly Predicts Major Topic (%) Set (N) Agree (%) Train Test When 3 Methods When Methods Combined Agree Best Individual Agree Disagree and Disagree Classifier

99th 100th 8508 61.5 89.7 59.3 78.0 78.3 100th 101st 9248 62.1 93.0 61.5 81.1 80.8 101st 102nd 9602 62.4 90.3 61.1 79.3 79.3 102nd 103rd 7879 64.8 90.1 60.2 79.6 79.5 103rd 104th 6543 62.4 89.0 57.5 77.1 76.6 104th 105th 7529 60.0 87.4 58.9 76.0 75.6 Mean 8218 62.2 89.9 59.7 78.5 78.4

Note. The “best individual classifier” is usually the SVM system.

TABLE 6. Prediction Success When the Ensemble Agrees and Disagrees

3 Methods Agree 2 Methods Agree No Agreement Overall

Correct 89.7% 64.2% 41.5% 78.0% Incorrect 10.3% 36.8% 58.5% 22.0% Share of incorrect cases 28.8% 50.8% 20.2% ----- All cases 61.5% 30.8% 7.7% 100.0% (N of Bills) (5233) (2617) (655) (8508)

Note. Training on bills of the 99th Congress to predict bills of the 100th Congress. Hillard, Purpura, and Wilkerson 43

A review of just 655 bills where the three 104th Congress (1995–96), and then predicted methods disagree (i.e., less than 8% of the sam- the proportion of bills falling into each of 20 ple) can be expected to reduce overall annota- major topics of the 105th Congress. tion errors by 20%. In contrast, inspecting the In Figure 2, an estimate that lies along the same number of cases when the three methods diagonal is perfectly predicting the actual pro- agree would reduce overall annotation errors by portion of bills falling into that topic category. just 3.5%. If there are resources to classify The further the estimate strays from the diago- twice as many bills (just 1,310 bills, or about nal, the less accurate the prediction. Thus, 15% of the cases), overall error can be reduced Figure 2 indicates that the SVM algorithm— by 32%, bumping overall accuracy from 78% to which labels cases in addition to predicting pro- 85%. Coding 20% of all bills according to this portions—is performing as well and sometimes strategy increases overall accuracy to 87%. much better than the ReadMe algorithm. These In the political science literature, the most findings buttress our belief that estimation bias appropriate alternative approach for validating is just one of the considerations that should the methods presented here is the one recently affect methods selections. In this case, the advocated by Hopkins and King (2007). While greater document level classification accuracy their method, discussed earlier, does not predict of the SVM discriminative approach translated to the case level and is therefore inadequate for into greater accuracy of the estimated propor- the goals we have established in this work. We tions. In practice, mixed method researchers can compare estimates of proportions by apply- using our approach gain the ability to inspect ing our software and the ReadMe software individual documents in a class (because they made available by Hopkins and King (http:// know the classification of each document) gking.harvard.edu/readme/) to the same dataset. while still having confidence that the estimates We trained the ReadMe algorithm and the best of the proportions of the documents assigned to performing algorithm of our ensemble (SVM) each class are reasonable. The technique for on the human-assigned topics of bills of the bias reduction proposed by Hopkins and King

FIGURE 2. Predicting document proportions via two methods. 44 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS could then be used as a postprocessing step— congressionalbills.org). We hope that this work potentially further improving proportion inspires others to improve upon it. predictions. We appreciate the attraction of less costly approaches such as keyword searches, cluster- ing methodologies, and reliance on existing CONCLUSIONS indexing systems designed for other purposes. Supervised learning systems require high- Topic classification is a central component quality training samples and active human of many social science data collection intervention to mitigate concerns such as topic projects. Scholars rely on such systems to iso- drift as they are applied to new domains (e.g., late relevant events, to study patterns and new time periods), but it is also important to trends, and to validate statistically derived appreciate where other methods fall short as far conclusions. Advances in information technol- as the goals of social science research are ogy are creating new research databases that concerned. For accurately, reliably, and effi- require more efficient methods for topic clas- ciently classifying large numbers of complex sifying large numbers of documents. We individual events, supervised learning systems investigated one of these databases to find that are currently the best option. a supervised learning system can accurately estimate a document’s class as well as docu- ment proportions, while achieving the high NOTES inter-annotator reliability levels associated with human efforts. Moreover, compared to 1. http://www.congressionalbills.org/Bills_Reliability. human annotation alone, this system lowers pdf costs by 80% or more. 2. We use case-based to mean that cases (examples marked with a class) are used to train the system. This is We have also found that combining infor- conceptually similar to the way that reference cases are mation from multiple algorithms (ensemble used to train law school students. learning) increases accuracy beyond that of 3. Google specifically rejects use of recall as a any single algorithm, and also provides key design criterion in their design documents, available at signals of confidence regarding the assigned http://www-db.stanford.edu/˜backrub/google.html topic for each document. We then showed how 4. Researchers at the Congressional Research Ser- this simple confidence estimate could be vice are very aware of this limitation of their system, which now includes more than 5,000 subject terms. How- employed to achieve additional classification ever, we have been reminded on more than one occasion accuracy. that THOMAS’s primary customer (and the entity that Although the ensemble learning method pays the bills) is the U.S. Congress. alone offers a viable strategy for improving 5. TABARI does more than classify, but at its heart it classification accuracy, additional gains can be is an event classification system just as Google (at its achieved through other active learning inter- heart) is an information retrieval system. ventions. In Hillard et al. (2007), we show how 6. If we were starting from scratch, we would employ confusion tables that report classification errors this method. Instead, we use it to check whether near- exact duplicates are labeled identically by both humans by topic can be used to target follow-up inter- and software. Unfortunately, humans make the mistake of ventions more efficiently. One of the conclusions mislabeling near-exact duplicates more times than we care of that research was that stratified sampling to dwell upon, and we are glad that we now have comput- approaches can be more efficient than random erized methods to check them. sampling, especially where smaller training 7. Many modern popular algorithmic NLP text clas- samples are concerned. In addition, much of the sification approaches convert a document into a mathe- computational linguistics literature focuses on matical representation using the bag of words method. This method reduces the contextual information available feature representations and demonstrates that to the machine. Different corpus domains and applications experimentation in this area is also likely to require more contextual information to increase effective- lead to improvements. The Congressional Bills ness. Variation in the document pre-processing (including Project is in the public domain (http://www. morphological transformation) is one of the key methods Hillard, Purpura, and Wilkerson 45 for increasing effectiveness. See Manning and Shütze Statistical Methods for Inter-rater Reliability Assess- (1999) for a helpful introduction to this subject. ment, 1(April), 1–6. 8. Although the use of features similar to tf*idf (term Hand, D. (2006). Classifier technology and the illusion of frequency multiplied by inverse document frequency) progress. Statistical Science, 21(1), 1–14. dates back to the 1970’s, we cite Papineni’s literature Hand, D., Mannila, H., & Smyth, P. (2001). Principles of review of the area. data mining (adaptive computation and machine learn- 9. Figure 1 depicts the final voting system used to ing). Cambridge, MA: MIT Press. predict the major and subtopics of each Congressional Bill. Hillard, D., Purpura, S., & Wilkerson, J. (2007). An active The SVM system, as the best performing classifier, is used learning framework for classifying political text. Pre- alone for the subtopic prediction system. However, when sented at the annual meetings of the Midwest Political results are reported for individual classifier types (SVM, Science Association, Chicago, April. Boostexter, MaxEnt, and Naïve Bayes), the same classifier Hopkins, D., & King, G. (2007). Extracting systematic system is used to predict both major and subtopics. social science meaning from text. Unpublished manu- 10. Cohen’s Kappa, AC1, Krippendorf’s Alpha, and script (Sept. 15, 2007), Center for Basic Research, simple percentage comparisons of accuracy are all reason- . able approximations for the performance of our system Joachims, T. (1998). Text categorization with support because the number of data points and the number of cate- vector machines: Learning with many relevant fea- gories are large. tures. In Proceedings of the European conference on 11. See http://www.congressionalbills.org/BillsReliability. machine learning (pp. 137–142). Chemnitz, Germany: pdf Springer. Jones, B. D., & Baumgartner, F. B. (2005). The politics of attention: How government prioritizes problems. REFERENCES Chicago: University of Chicago Press. King, G., Keohane, R. O., & Verba, S. (1994). Designing Adler, E. S., & Wilkerson, J. (2008). Intended conse- social inquiry: Scientific inference in qualitative quences? Committee reform and jurisdictional change research. Princeton, NJ: Princeton University Press. in the House of Representatives. Legislative Studies Kleinberg, J. (2002). An impossibility theorem for cluster- Quarterly, 33(1), 85–112. ing. Proceedings of the 2002 Conference on Advances in Baum, M. (2003). Soft news goes to war: Public opinion Neural Information Processing Systems, 15, 463–470. and American foreign policy in the new media age. Koller, D., & Sahami, M. (1997). Hierarchically classify- Princeton NJ: Princeton University Press. ing documents using very few words. In Proceedings Baumgartner, F., Jones, B. D., & Wilkerson, J. (2002). of the Fourteenth International Conference on Studying policy dynamics. In F. Baumgarter & B. D. Machine Learning, 170-78. San Francisco, CA: D. H. Jones (Eds.), Policy dynamics (pp. 37–56). Chicago: Fisher, & E. Morgan Publishers. University of Chicago Press. Laver, M., Benoit, K., & Garry, J. (2003). Estimating the Breiman, L. (2001). Statistical modeling: The two cul- policy positions of political actors using words as data. tures. Statistical Science, 16(3), 199–231. American Political Science Review, 97(2), 311–331. Brill, E. & Wu, J. (1998). Classifier combination for Mann, G., Mimno, D., & McCallum, A. (2006). Biblio- improved lexical disambiguation. In Proceedings of metric impact measures and leveraging topic analysis. the COLING-ACL ’98 (pp. 191–195). Philadelphia, Proceedings of the ACM/IEEE-CS Joint Conference on PA: Association for Computational Linguistics. Digital Libraries (pp. 65–74). New York: Association Carletta, J. (1996). Assessing agreement on classification for Computer Machinery. tasks: The Kappa statistic. Computational Linguistics, Manning, C., & Shütze, H. (1999). Foundations of statisti- 22(2), 249–254. cal natural language processing. Cambridge, MA: Cohen, J. (1968). Weighted Kappa: Nominal scale agree- MIT Press. ment with provision for scaled disagreement or partial McCallum, A. (1996). Bow: A toolkit for statistical credit. Psychological Bulletin, 70(4), 213–220. language modeling, text retrieval, classification and Curran, J. (2002). Ensemble methods for automatic the- clustering. Available at http://www.cs.cmu.edu/ saurus extraction. Proceedings of the conference on mccallum/bow empirical methods in natural language processing Mitchell, T. (1997). Machine learning. New York: (pp. 222–229). Morristown, NJ: Association for Com- McGraw-Hill. putational Linguistics. Papineni, K. (2001). Why inverse document frequency? Dietterich, T. (2000). Ensemble methods in machine learn- In Proceedings of the North American Chapter of ing. Lecture Notes in , 1857, 1–15. the Association for Computational Linguistics Gwet, K. (2002). Kappa statistic is not satisfactory for (pp. 1–8). Morristown, NJ: Association for Comput- assessing the extent of agreement between raters. ing Machinery. 46 JOURNAL OF INFORMATION TECHNOLOGY & POLITICS

Poole, K., & Rosenthal, H. (1997). Congress: A political- Lansing, MI. Available at http://crespin.myweb.uga. economic history of roll call voting. New York: Oxford edu/pipcdata.htm University Press. Schapire, R. E., & Singer, Y. (2000). Boostexter: A boost- Porter, M. F. (1980). An algorithm for suffix stripping. ing based system for text categorization. Machine Program, 16(3), 130–137. Learning, 39(2/3), 135–168. Purpura, S., & Hillard, D. (2006). Automated classifica- Schrodt P., Davis, S., & Weddle, J. (1994). Political Science: tion of congressional legislation. In Proceedings of the KEDS—a program for the machine coding of event data. International Conference on Digital Government Social Science Computer Review, 12(4), 561–587. Research (pp. 219–225). New York: Association for Schrodt, P., & Gerner, D. (1994). Validity assessment of a Computer Machines. machine-coded event data set for the Middle East, Quinn, K., Monroe, B., Colaresi, M., Crespin, M., & 1982–92. American Journal of Political Science, 38(3), Radev, D. (2006). An automated method of topic- 825–844. coding legislative speech over time with application to Segal, J., & Spaeth, H. (2002). The Supreme Court and the the 105th–108th U.S. Senate. Presented at the Annual attitudinal model revisited. Cambridge, England: Cam- Meetings of the Society for Political Methodology, bridge University Press. Seattle, WA. Yang, Y., & Liu, X. (1999). A re-examination of text cate- Rohde, D. W. (2004). Roll call voting data for the gorization methods. Proceedings of the ACM-SIGIR House of Representatives, 1953–2004. Conference on Research and Development in Informa- Compiled by the Political Institutions and Public tion Retrieval. San Francisco, New York: Association Choice Program, Michigan State University, East for Computer Machinery.