Measuring Democracy: From Texts to Data

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Thiago Veiga Marzag˜ao,M.A.

Graduate Program in Political Science

The Ohio State University

2014

Dissertation committee:

Sarah Brooks, Advisor

Irfan Nooruddin

Marcus Kurtz

Janet Box-Steffensmeier Copyright by

Thiago Veiga Marzag˜ao

2014 Abstract

In this dissertation I use the text-as-data approach to create a new democracy index, which I call Automated Democracy Scores (ADS). Unlike other indices, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders’ ideological biases.

ii Dedication

Dedicated to the taxpayers who paid for this research.

iii Acknowledgements

Sarah Brooks, Irfan Nooruddin, Marcus Kurtz, and Janet Box-Steffensmeier provided invaluable advice and mentorship. They pushed me to be methodologically rigorous and to fully explore the substantive implications of my research. Also, they allowed me to take a big risk: creating an automated democracy index has never been attempted before and a different committee might have found the idea too ambitious for a dissertation research. I thank my committee members for their trust and open-mindedness. I am also indebted to those who took the time to read and comment earlier drafts and/or discuss my research idea: Philipp Rehm, Paul DeBell, Margaret Hanson, Carolyn Morgan, Peter Tunkis, Vittorio Merola, Raphael Cunha, and Marina Duque. I am also grateful to the institutions and people that provided material assistance. The Fulbright (grantee ID 15101786) and the Coordena¸c˜aode Aperfei¸coamento de Pessoal de N´ıvel Superior - CAPES (BEX 2821/09-5) paid my tuition, university fees, airline tickets, and part of my health insurance. The Minist´eriodo Planejamento, Or¸camento e Gest˜ao- MPOG (proceed n. 03080.000769/2010-53) granted me four years of paid leave. The Ohio Supercomputer Center allocated computing time. Courtney Sanders patiently answered my endless questions about graduation, forms, and procedures. Finally, I am indebted to my loved ones, all of which supported me even from a great distance. All errors are mine.

iv Vita

July 2003 ...... B.S. International Relations, University of Bras´ılia June 2007 . . . . . M.A. International Relations, University of Bras´ılia December 2012 M.A. Political Science, Ohio State University

Publications

A dimens˜aogeogrfica das elei¸c˜oesbrasileiras (“The spatial dimension of Brazilian elections”). Opini˜aoP´ublica(Public Opinion), 19(2), 270-290, 2013.

Lobby e protecionismo no Brasil contemporˆaneo(“Lobby and protectionism in Brazil”). Revista Brasileira de Economia (Brazilian Review of Economics), 62(3), 263-178, 2008.

Fields of Study

Major Field: Political Science

v Table of Contents

Abstract ...... ii Dedication ...... iii Acknowledgements ...... iv Vita ...... v List of Tables ...... vii List of Figures ...... viii Introduction ...... 1 Paper 1: Ideological Bias in Democracy Measures ...... 3 Paper 2: Automated Democracy Scores ...... 27 Paper 3: Measuring Democracy From Texts: Can We Do Better Than Wordscores? 63 Conclusion ...... 92 References ...... 93 Appendix A: SEM Estimation ...... 99 Appendix B: Replication ...... 101 Appendix C: HMT ...... 103 Appendix D: HBB ...... 105 Appendix E: Multiple Decision Trees ...... 107

vi List of Tables

Table 1. Bollen and Paxton’s regressions for 1980 ...... 10 Table 2. Replication of Bollen and Paxton’s regressions for 1980 . . 12 Table 3. Simulation results for Marxism-Leninism and Catholicism 20 Table 4. Simulation results for Protestantism and monarchy ...... 21 Table 5. ADS summary statistics, by year ...... 47 Table 6. Correlation between ADS and other indices, by year . . . . . 49 Table 7. Largest discrepancies between ADS and UDS ...... 50 Table 8. Overlaps for the year 2008 ...... 54 Table 9. Correlations with UDS (using 50 topics) ...... 81 Table 10. Correlations with UDS (using 100 topics) ...... 82 Table 11. Correlations with UDS (using 150 topics) ...... 83 Table 12. Correlations with UDS (using 200 topics) ...... 84 Table 13. Correlations with UDS (using 300 topics) ...... 85 Table 14. Top 5 topics extracted with LSA ...... 87 Table 15. First 5 topics extracted with LDA ...... 89

vii List of Figures

Figure 1. Bollen and Paxton’s model ...... 7 Figure 2. Bollen and Paxton’s fit statistics ...... 8 Figure 3. Fit statistics from my replication of Bollen and Paxton . 9 Figure 4. Automated Democracy Scores, 2012 ...... 45 Figure 5. Automated Democracy Scores, 1993-2012 ...... 46 Figure 6. ADS range and press coverage ...... 48 Figure 7. Example of wordsXtopics table generated with LSA . . . . 68 Figure 8. Example of topicsXdocuments table generated with LSA 70 Figure 9. Example of decision tree ...... 78

viii Introduction

In this dissertation I investigate the flaws of current democracy indices and propose a new, improved one. This dissertation consists of three papers. In the first paper I show that, unlike what previous research has led us to believe, we cannot make any claims about the nature of the ideological biases that contaminate existing democracy measures. For instance, I show that the Freedom House data, often believed to have a conservative bias, may actually have a liberal bias instead. I do that by replicating previous research on the subject (Bollen and Paxton 2000) but replacing real-world data by simulated data in which I manipulate democracy levels and the ideological biases of hypothetical raters. The results of these Monte Carlos show that even though we can confidently assert the existence of bias in some democracy measures we cannot say anything about which measures are biased or in what ways. That means we currently have no way to circumvent the circularity problem: if we find that democracy is associated with some variable X is that a genuine association or an artifact of our democracy measure being biased toward X? In the second paper I use automated text analysis to create the first machine- coded democracy index, which I call Automated Democracy Scores (ADS). I produce the ADS using the well-known “Wordscores” algorithm and 42 million news articles from 6,043 different sources. The ADS cover all independent countries in the 1993- 2012 period. Unlike the democracy indices we have today, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders’ ideological biases; and a simple (though computa-

1 tionally demanding) extension of the method would yield daily data and real-time data. I create a website where anyone can replicate and tweak the data-generating process by changing the parameters of the underlying model (no coding required): www.democracy-scores.org In the third paper I explore other ways to create an automated democracy index from news articles. More specifically, I use the same news articles used in the second paper but I replace Wordscores by other algorithms - namely, a combination of topic extraction methods (Latent Semantic Analysis and Latent Dirichlet Allocation) and decision trees. The goal is to address the issue of construct validity more directly.

2 Ideological Bias in Democracy Measures

Abstract

In this paper I show that, unlike what previous research has led us to believe, we cannot make any claims about the nature of the ideological biases that contaminate existing democracy measures. For instance, I show that the Freedom House data, often believed to have a conservative bias, may actually have a liberal bias instead. I do that by replicating previous research on the subject (Bollen and Paxton 2000) but replacing real-world data by simulated data in which I manipulate democracy levels and the ideological biases of hypothetical raters. The results of these Monte Carlos show that even though we can confidently assert the existence of bias in some democracy measures we cannot say anything about which measures are biased or in what ways. That means we currently have no way to circumvent the circularity problem: if we find that democracy is associated with some variable X is that a genuine association or an artifact of our democracy measure being biased toward X?

1. Introduction

What do we know about ideological bias in democracy measures? Bollen and Paxton (2000), using structural equation modeling, find that several indicators from the Freedom House dataset (Sussman 1982; Gastil 1988) and from Arthur Banks’ Cross-National Time Series Archive - CNTS (Banks [1971], updated through 1988) are compromised by ideological bias: the coding is sensitive to a number of variables that (conceptually) have nothing to do with democracy, such as economic policy (whether the polity is Marxist-Leninist), religion (whether the polity is predominantly Roman Catholic or predominantly Protestant), and form of government (whether the polity is a monarchy or a republic).

3 Bollen and Paxton’s article has been highly influential. Fourteen years after its publication it is still the only comprehensive, systematic attempt to uncover ideo- logical bias in democracy measures. As such, it appears in nearly every discussion of democracy measurement (e.g.: Munck and Verkuilen [2002], Treier and Jackman [2008], Pemstein, Meserve and Melton [2010]). And it has influenced researchers’ choices - most notably, it has become commonplace to avoid the Freedom House democracy data on the grounds that Bollen and Paxton have found them to have a conservative bias. Although influential, Bollen and Paxton’s findings have never been subjected to scrutiny; they are usually taken at face value. In this paper I perform the first re- assessment of Bollen and Paxton’s findings. I do that with simulated data in which I manipulate the countries’ levels of democracy and the measures’ ideological biases. I find that Bollen and Paxton’s method: a) yields incorrect results about which democ- racy measures are biased; b) yields incorrect results about the nature of those biases; and c) fails to find bias when different measures are biased in similar ways. In sum, I show that for the past fourteen years political scientists have allowed flawed results to influence their choices of democracy data. Those choices, in turn, may have affected what we think we know today about democracy. For instance, are democracy and economic freedom associated, as some economists argue,1 or is that apparent association an artifact of our democracy data being responsive to economic policy in the first place? Bollen and Paxton’s results suggest that the Freedom House data have a conservative bias and that the CNTS data do not. Hence we might feel inclined to address the bias problem by avoiding the Freedom House data and using the CNTS data instead. But in this paper I show that Bollen and Paxton’s results do not really tell us anything about which indices

1See for instance Lawson and Clark (2010).

4 are biased, or in what ways. In other words, whichever democracy measure we choose, we cannot discard the possibility that our empirical tests are circular. Section 2 details Bollen and Paxton’s work. Section 3 explains the Monte Carlos and presents the results. Section 4 concludes.

2. Bollen and Paxton’s analysis

In this section I explain in detail Bollen and Paxton’s methodology and results. The gist of it is that Bollen and Paxton treat ideological bias as a latent variable and use structural equation modeling (SEM) to extract it from Freedom House indicators and from CNTS indicators. Bollen and Paxton then regress the extracted biases on a number of polity characteristics (economic, social, and political variables) and use the estimated coefficients (signs and statistical significance) to make conclusions about how exactly the Freedom House and the CNTS are biased. Bollen and Paxton’s analysis is based on eight indicators, four from the Free- dom House dataset and four from the CNTS dataset. The Freedom House indicators are: “freedom of broadcast media”, “freedom of print media”, “civil liberties”, and “political rights”. The CNTS indicators are: “freedom of group opposition”, “com- petitiveness of the nomination process”, “chief executive elected”, and “effectiveness of the legislative body”. Bollen and Paxton start by using SEM to extract five latent variables from those eight indicators. Two latent variables are assumed to be democracy features (“politi- cal liberties” and “democratic rule”) and three are assumed to be coder-specific ide- ological biases (Raymond Gastil’s and Leonard Sussman’s, who were Freedom House coders, and Arthur Banks’, who was the CNTS coder)2.

2Raymond Gastil was responsible for the “civil liberties” and “political rights” indicators. Leonard Sussman was responsible for the “freedom of broadcast media” and “freedom of print media”

5 Each of the eight indicators is modeled as being determined by a traits factor, a methods factor, and random measurement error. For instance, “freedom of broadcast media” is modeled as being determined by the traits factor “political liberties”, by the methods factor “Leonard Sussman’s bias” (since Leonard Sussman was the re- searcher responsible for the “freedom of broadcast media” indicator), and by random measurement error. More generally, each indicator is modeled as

indicatorkp = λttraittp + λmmethodmp + δkp

where indicator k for polity p is a linear combination of traits factor t for polity p, methods factor m for polity p, and indicator k’s random measurement error for polity p. The complete picture of which indicators load on which factors is provided in Figure 1 below, extracted from Bollen and Paxton (65).3

indicators. Arthur Banks was responsible for the ‘freedom of group opposition”, “competitiveness of the nomination process”, “chief executive elected”, and “effectiveness of the legislative body” indicators. 3I thank Prof. Bollen for helping me understand some aspects of the model specification.

6 Source: Bollen and Paxton (2000, p. 65). Figure 1. Bollen and Paxton’s model

Figure 1 follows the standard SEM notation, with squared boxes representing indi- cators (i.e., observed variables) and circles representing factors (i.e., latent variables). The factor-to-indicator arrows show which indicators load on which factors.4 The factor-to-factor arrows show which factors correlate.5 The “E” arrows show which indicators have random measurement error.6 Bollen and Paxton estimate, for each indicator, the factor loadings (i.e., the lamb- das) and the random measurement error for each year in the 1972-19888 interval (see Appendix A). They find the fit statistics shown in Figure 2 below, extracted from

4Based on previous work (Bollen 1993), Bollen and Paxton model the “freedom of group opposition” as being free from Banks’ ideological bias. 5The “political liberties” and “democratic rule” factors correlate because they are close concepts. “Sussman” and “Gastil” correlate because they both worked at the Freedom House. 6Based on previous work (Bollen 1993), Bollen and Paxton model the “political rights” and “com- petitiveness of the nomination process” indicators as having no random measurement error, i.e., δ = 0.

7 their article (67).

Source: Bollen and Paxton (2000, p. 67). IFI stands for Incremental Fit Index and IRMSEA stands for 1-Root Mean Square Error of Approximation. Figure 2. Bollen and Paxton’s fit statistics

Figure 2 compares two fit statistics - the Incremental Fit Index (IFI) and the Root Mean Square Error of the Approximation (1-RMSEA) - for each year from 1972 to 1988. In both cases (IFI and 1-RMSEA) the larger the statistic, the better the model fit. As we see, the model fit improves considerably when we include both traits and methods factors, as compared to when we include only traits factors.7 The improved fit shows that each of the eight indicators is the product not only of the underlying

7Not all factors are included in every year: Gastil’s and Bank’s indicators are available for the entire 1972-1988 interval, but Sussman’s are only available for the 1979-1981 and 1983-1987 intervals. Hence for 1972-1978, 1980, and 1988 the estimated model is actually a restricted version of the model depicted in Figure 1 above: everything is the same except that the Sussman factor and the corresponding indicators are not included.

8 trait (“political liberties” or “democratic rule”, according to the case) and random measurement error, but also the product of a systematic component - the rater’s ideological bias. I replicated Bollen and Paxton’s analysis, just to make sure I was following the same procedures, and obtained almost exactly the same fit statistics:

Source: My own estimations. These estimates are essentially identical to those in Bollen and Paxton (2000, p. 67). IRMSEA stands for 1-Root Mean Square Error of Approximation and IFI stands for Incremental Fit Index. Figure 3. Fit statistics from my replication of Bollen and Paxton

9 Bollen and Paxton then use those estimates to produce three sets of factor scores, one for each rater (see Appendix A). Bollen and Paxton regress these factor scores on a number of country-level variables, for two sets of years: 1972, 1975, 1980, 1984, and 1988 (Gastil’s factor scores and Banks’ factor scores); and 1980 and 1984 (Sussman’s factor scores). Table 1 below reproduces the estimates they obtained using 1980 data.

Table 1. Bollen and Paxton’s regressions for 1980a

Gastil Sussman Banks Marxist-Leninist -0.976*** -0.504* 1.366** (0.294) (0.3) (0.53) Protestant 0.357 0.459 0.670** (0.324) (0.295) (0.264) Roman Catholic 0.835*** 1.239**** 0.676** (0.312) (0.331) (0.286) monarchy 0.343 0.131 -1.441**** (0.263) (0.282) (0.268) ln(energy per capita) -0.122 -0.063 0.124 (0.126) (0.123) (0.177) ln(years since independence) 0.152 0.084 -0.515 (0.153) (0.153) (0.16) coups 0.349 0.380 -0.615*** (0.268) (0.287) (0.202) internal or interstate war in 1980? -0.360 0.022 -0.234 (0.302) (0.328) (0.438) ln(protests) 0.098 0.068 -0.174 (0.121) (0.139) (0.109) ln(political strikes) -0.074 -0.080 0.211 (0.137) (0.158) (0.135) ln(riots) -0.200 -0.023 0.075 (0.142) (0.148) (0.148) media coverage 0.019 0.002 -0.034 (0.036) (0.037) (0.032) ln(population) 0.155 0.029 0.279** (0.117) (0.139) (0.122) ln(area in km2) -0.050 -0.025 -0.024

continued

10 Table 1. Continued.

(0.058) (0.061) (0.076) ln(radio sets + TV sets per capita) 0.067 0.022 -0.105 (0.075) (0.067) (0.102) intercept -0.817 -0.494 1.922*** (0.646) (0.614) (0.632) adjusted R2 0.24 0.26 0.29 N 81 81 81 a OLS estimates. Heteroskedastic-consistent standard errors in parenthe- ses. * p <0.10; ** p <0.05; *** p <0.01; **** p <0.001. Data sources: The New York Times, CBS News Index, Facts on File, The World Al- manac and Encyclopedia, United Nations Statistical Yearbook, others.

As we observe, Leornad Sussman and Raymond Gastil seem to be biased against Marxist-Leninist countries and in favor of Roman Catholic countries, whereas Arthur Banks seems to be biased in favor of Marxist-Leninist countries, Protestant countries, and Roman Catholic countries, and against monarchic countries. In regressions using data from other years, Bollen and Paxton also find positive, statistically significant coefficients for the Protestant variable in the Gastil and Sussman regressions (Bollen and Paxton 2000, 76). As the next section will show, none of these conclusions is warranted. As before, here too I replicated Bollen and Paxton, just to make sure I was fol- lowing the same procedures. Here my replication was less successful, with several discrepancies (see Appendix B for details):

11 Table 2. Replication of Bollen and Paxton’s regressions for 1980a

Gastil Sussman Banks Marxist-Leninist -0.874*** -0.846** 0.587** (0.257) (0.358) (0.293) Protestant -0.103 -0.081 0.147 (0.280) (0.390) (0.320) Roman Catholic 0.494** 0.872** 0.372 (0.238) (0.332) (0.272) monarchy 0.024 -0.473 -0.166 (0.276) (0.385) (0.315) ln(energy per capita) -0.170 -0.088 0.368** (0.136) (0.190) (0.156) ln(years since independence) 0.294* 0.279 -0.314** (0.129) (0.179) (0.147) coups 1976-1980 -0.018 0.054 -0.620** (0.159) (0.222) (0.182) internal or interstate war in 1980? -0.677** -0.617* 0.352 (0.263) (0.366) (0.300) ln(protests in 1975-1980) 0.066 0.118* 0.004 (0.043) (0.060) (0.049) ln(strikes in 1975-1980) -0.027 0.011 0.031 (0.038) (0.053) (0.043) ln(riots in 1975-1980) -0.029 -0.020 0.017 (0.041) (0.058) (0.047) ln(media coverage) 0.120 0.069 -0.355*** (0.117) (0.163) (0.133) ln(population) -0.096 -0.240* 0.296** (0.101) (0.141) (0.116) ln(area in km2) 0.031 0.082 -0.076 (0.064) (0.089) (0.073) ln(radio sets + TV sets per capita) 0.094 -0.014 -0.188 (0.137) (0.192) (0.157) intercept -0.986 0.429 1.004 (1.132) (1.578) (1.292) N 112 112 112 F 3.98*** 3.27*** 3.29*** adjusted R-squared 0.2871 0.2344 0.2364 a OLS estimates. Heteroskedastic-consistent standard errors in paren- theses. * p <0.10; ** p <0.05; *** p <0.01.

12 As in Bollen and Paxton here too Marxism-Leninism has a negative, statistically significant coefficient in the Gastil and Sussman regressions and a positive, statisti- cally significant coefficient in the Banks regression. Also as in Bollen and Paxton we find here that Roman Catholic has a positive, statistically significant coefficient in the Gastil and Sussman regressions. The similarities stop there. In Bollen and Paxton Protestant, Roman Catholic, and monarchy all turn out statistically significant in the Gastil regression, but in the replication they do not. (The other variables have little to do with ideological bias so they are of no interest here.) Because the point of this paper are the simulations, not the replication, I leave the details for Appendix B.

3. Monte Carlos

3.1 Basic idea

How solid are the results obtained in the previous section? In this section I show that they are indeterminate; they tell us nothing about which democracy measures are biased or in what ways. The estimates from my replication of Bollen and Paxton suggest, for instance, that Gastil and Sussman are biased against Marxism-Leninism and that Banks is biased in favor of Marxist-Leninist countries. But what if all three raters are biased in the same direction, only to different degrees? If all three raters are biased in favor of Marxism-Leninism but Banks more so than Gastil and Sussman, couldn’t that produce the opposite coefficient signs we observe? Or, alternatively, if all three are biased against Marxist-Leninist countries but Gastil and Sussman much more so than Banks, couldn’t that produce opposite coefficient signs as well? The same applies to the other three variables of interest - Protestant, Roman

13 Catholic, and monarchy. Table 2 suggests that none of the raters are biased against or in favor of Protestant or monarchic countries. But maybe they all are, and to similar degrees - so the bias becomes “invisible” and the SEM estimates simply cannot capture it. Table 2 also suggests that Gastil and Sussman are biased in favor of Roman Catholic countries whereas Banks is not. But what if Banks is biased in favor of Roman Catholic countries as well, just less so than Gastil and Sussman? How can we verify all that? We cannot observe a country’s “true” level of democ- racy or a rater’s ideological bias - these are latent variables. But we can simulate them. In other words, we can make up some democracy levels and some raters’ ideo- logical biases. We can then re-do Bollen and Paxton analysis, but using the simulated democracy data rather than the actual democracy data. Because we will know the “true” (i.e., simulated) democracy levels and ideological biases, we will be able to know how reliable Bollen and Paxton’s results are. I start by producing simulated data in which I fix the level of democracy and the direction and magnitude of each rater’s ideological bias. I then make these simulated factors load on a number of simulated indicators, estimate the structural model, and extract the factor scores. I then regress the extracted factor scores on the same country-level variables Bollen and Paxton used (Marxism-Leninism, Protestantism, etc) and check whether the coefficients are “telling the truth” - for instance, whether the coefficient of Marxism- Leninism has a negative and statistically significant coefficient when the simulated rater is biased against Marxism-Leninism. I repeat the process thousands of times, each time drawing a new batch of sim- ulated factors, and count how often we obtain misleading coefficients (for instance, how often Marxism-Leninism is not negative and significant even though the simu- lated rater is biased in favor of Marxism-Leninism). That should give us an idea of

14 how reliable the findings on Table 2 - and, by extension, those in Bollen and Paxton - are.

3.2 Model specification

I begin by simulating three factors: each country’s level of democracy; the idiosyn- crasies (i.e., the systematic measurement error) of a hypothetical rater we are going to call Rater #1; and the idiosyncrasies of a hypothetical rater we are going to call Rater #2. The level of democracy is generated as a uniform random variable rang- ing from 0 to 20.8 Rater #1’s factor is generated as a normal random variable with mean 5 and standard deviation 5. And Rater #2’s factor is generated as a normal random variable with mean 5 and standard deviation 15.9 For each factor I generate 112 observations (the number of countries in the dataset). The second step is to introduce ideological bias into the factors. I do that by making the raters’ factors alternately respond to Marxism-Leninism, Protestantism, Roman Catholicism, or monarchy. The nature of the simulated bias is different across these four variables. In the case of Marxism-Leninism Table 2 suggests opposite biases. So I test whether the same result might be obtained even if Rater #1 and Rater #2 were biased in the same direction, but to different degrees. Hence for Marxist-Leninist countries I boost Rater #1’s factor by p1 points and Rater #2’s factor by p2 points, with p2 always fixed at 0.025 and p1 taking the following values: 0.5, 1, 3, 5, 7, 10, 15, and 20. In the case of Protestantism Table 2 would have us believe that none of the raters are biased. But what if all raters are biased in the same direction and to similar 8That seems to be the distribution of actual measures of democracy (e.g., the “political rights” index of the Freedom House). 9The normal distribution is chosen because structural equation models rely on the assumption that the factors follow a multivariate normal distribution (thus we could not have all three factors follow a uniform distribution).

15 degrees? Could that not make the bias become “invisible” in the estimation? To check that, for Protestant countries I boost Rater #1’s factor by c1 percent and Rater #2’s factor by c2 percent, with the c1 -c2 pairs being: 30%-35%, 50%-55%, 70%-75%, 130%-135%, 150%- 155%, 170%-175%, 190%-195%, and 230%-235%.10 In the case of Roman Catholicism Table 2 suggests that Gastil and Sussman are positively biased and that Banks is not biased in any direction. We want to know whether that result might be obtained even if all three raters were positively biased, only with Gastil and Sussman more so than Banks. So here I do the same as in the Marxism-Leninism case: for Roman Catholic countries I boost Rater #2’s factor by p2 =0.025 points and Rater #1’s factor by p1 points, with p1 taking the following values: 0.5, 1, 3, 5, 7, 10, 15, and 20. Finally, in the case of monarchy Table 2 suggests no one is biased, but - as in the case of Protestantism - perhaps Gastil, Sussman, and Banks are all biased in the same direction and to similar degrees, which could make the bias “disappear” in the SEM estimations. So for monarchies, as for Protestant countries, I boost Rater #1’s factor by c1 percent and Rater #2’s factor by c2 percent, with the c1 -c2 pairs being, again, 30%-35%, 50%-55%, 70%-75%, 130%-135%, 150%-155%, 170%-175%, 190%-195%, and 230%-235%. The third step is to use those simulated factors to generate simulated indicators (i.e., the variables we do observe in SEM estimation). I model them as follows:

indicator1 = 14.12 × rater1 + 69.87 × democracy + δ1/m

indicator2 = 06.71 × rater1 + 68.57 × democracy + δ2/m

indicator3 = 18.31 × rater1 + 53.45 × democracy + δ3/m

indicator4 = 31.81 × rater1 + 28.21 × democracy + δ4/m

10Thus here the bias is multiplicative - unlike in the Marxism-Leninism case, where the bias is additive.

16 indicator5 = 95.69 × rater1 + 40.63 × democracy + δ5/m

indicator6 = 38.13 × rater1 + 97.69 × democracy + δ6/m

indicator7 = 70.70 × rater2 + 21.17 × democracy + δ7/m

indicator8 = 31.51 × rater2 + 51.63 × democracy + δ8/m

indicator9 = 90.83 × rater2 + 26.09 × democracy + δ9/m

indicator10 = 12.99 × rater2 + 55.01 × democracy + δ10/m

indicator11 = 53.06 × rater2 + 63.13 × democracy + δ11/m

indicator12 = 52.19 × rater2 + 15.67 × democracy + δ12/m

As we see, democracy loads on all twelve indicators; Rater #1 loads on the first six indicators; and Rater #2 loads on the last six indicators. There is also a random measurement error, δ, specific to each indicator. In one third of the simulations the m parameter (that divides δ) is simply 1, so the error term does not suffer any transformation. In another third of the simulations the m parameter is 0.001, so we can see what happens to the estimates when the random errors are magnified. And in another third of the simulations the m parameter is 1,000, so we can see what happens to the estimates when the random errors shrink. Each δ is a combination of a normal random variable and a beta random variable, as follows:

δ1 = N(µ = 0, σ = 827) + Beta(α = 0.68, β = 0.78)

δ2 = N(µ = 0, σ = 4188) + Beta(α = 0.10, β = 0.72)

δ3 = N(µ = 0, σ = 228) + Beta(α = 0.58, β = 0.67)

δ4 = N(µ = 0, σ = 3237) + Beta(α = 0.50, β = 0.60)

δ5 = N(µ = 0, σ = 1965) + Beta(α = 0.06, β = 0.83)

δ6 = N(µ = 0, σ = 734) + Beta(α = 0.15, β = 0.86)

δ7 = N(µ = 0, σ = 1439) + Beta(α = 0.51, β = 0.46)

δ8 = N(µ = 0, σ = 2983) + Beta(α = 0.23, β = 0.40)

17 δ9 = N(µ = 0, σ = 1190) + Beta(α = 0.73, β = 0.21)

δ10 = N(µ = 0, σ = 112) + Beta(α = 0.26, β = 0.63)

δ11 = N(µ = 0, σ = 806) + Beta(α = 0.54, β = 0.37)

δ12 = N(µ = 0, σ = 4299) + Beta(α = 0.13, β = 0.46)

These modeling choices need justification. The number of factors - three - is the minimum we need to be able to fix each country’s level of democracy and to evaluate Bollen and Paxton’s assertions about bias direction. The number of indicators (twelve) is somewhat arbitrary; it could have been eight or fourteen, for instance. What matters is that for each factor there are at least three or four indicators, so that there are enough data to estimate the model. The loading coefficients (14.12, 69.87, etc) are entirely arbitrary, except that they are always positive; otherwise the direction of the bias would change between the factors and the indicators11.12 The parameters shown above - the distributional parameters of the factors, the factor loadings of each indicator, and the distributional parameters of the error terms - remain the same across all simulations. But at each simulation the factors and the random errors are redrawn, so the indicators (which are functions of both) change as well. Also, the m parameter, as explained above, assumes three different values (1, 0.001, and 1,000). The simulations are done separately for each of the four variables of interest, i.e., in any given simulation the hypothetical raters are biased toward only one of the four variables. The basic procedure is: I generate the three factors (democracy, Rater #1, Rater #2), bias Rater #1 and Rater #2, generate the twelve random errors,

11E.g., if Rater #1 is biased in favor of Marxism-Leninism, a negative factor loading would make the corresponding indicator be biased against Marxism-Leninism. 12Initially I generated the random errors (the δs) as purely normal variables. But that resulted in excessive correlations between the errors, even when varying the standard deviations. That resulted in highly correlated indicators, which results in non-invertible matrices and makes SEM estimation impossible. That is why I add the beta component.

18 generate the twelve indicators, estimate the structural equations model, save the two sets of factor scores assumed to represent ideological bias (Rater #1’s and Rater #2’s), regress each set on country-level variables (same ones used to produce Table 2)13, and check whether the outcome of interest (i.e., the outcome analogous to that of Table 2) obtains. I repeat this process 1,000 times for each of the four variables of interest and for each of the p1 -p2 pairs and c1 -c2 pairs discussed above. I also repeat the process 1,000 times using the original, unbiased simulated factors, just to have a baseline. Finally, I repeat the whole process for each of the three values of m discussed before (1, 0.001, and 1,000). Thus in total there are 108 different specifications with 1,000 repetitions each.14 In SEM estimation identification is usually achieved by fixing some of the param- eters. I do that by fixing the variances of the errors and the variances and covariances of the factors. Thus what changes from one simulation to the next are the estimated factor loadings, and consequently the factor scores and the coefficients obtained from regressing these factor scores on country-level variables. All 108,000 estimations con- verge, so I do not discard any of them (Paxton et al. 2001, 301-302).15

3.3 Results

The results are summarized on Tables 3 and 4 below.

13These country-level variables are real-world data, not simulated data. 14(4 variables of interest) × (1 baseline + 8 values of p or c) × (3 values of m) × (1,000 repetitions) = 108,000 simulations 15The estimations take about five hours to run on a CPU with 2.4GHz and 4GB of memory and using the ‘sem’ package in R. In multi-core machines that time can be drastically reduced by parallelizing the simulations across the multiple cores (that requires the code to be rewritten though).

19 Table 3. Simulation results for Marxism-Leninism and Catholicism - frequency of misleading resultsa

Marxism-Leninism Roman Catholicism m=1 m=0.001 m=1,000 m=1 m=0.001 m=1,000 p1 =20; p2 =0.025 189 212 216 816 821 833 p1 =15; p2 =0.025 156 167 167 796 786 786 p1 =10; p2 =0.025 94 86 98 605 569 573 p1 =7; p2 =0.025 50 47 67 355 341 357 p1 =5; p2 =0.025 15 31 31 207 211 250 p1 =3; p2 =0.025 15 18 11 122 115 127 p1 =1; p2 =0.025 9 3 7 68 76 73 p1 =0.5; p2 =0.025 6 4 2 57 47 68 no bias 5 3 2 41 48 46 a For Marxism-Leninism the misleading result is a positive, statistically signifi- cant coefficient for Rater #1 combined with a negative, statistically significant coefficient for Rater #2. For Catholicism the misleading result is a positive, statistically significant coefficient for Rater #1 combined with a non-significant coefficient for Rater #2. Statistical significance is defined based on a p-value lower than 0.10.

20 Table 4. Simulation results for Protestantism and monarchy - frequency of misleading results (except for the “no bias” row, which shows correct results)a

Protestantism monarchy m=1 m=0.001 m=1,000 m=1 m=0.001 m=1,000 no bias 791 816 816 794 795 791 c1 =30%; c2 =35% 737 697 706 661 655 635 c1 =50%; c2 =55% 617 587 589 523 503 532 c1 =70%; c2 =75% 542 499 502 400 354 390 c1 =130%; c2 =135% 265 279 288 135 110 135 c1 =150%; c2 =155% 214 207 205 103 94 79 c1 =170%; c2 =175% 164 177 138 51 55 39 c1 =190%; c2 =195% 119 122 114 40 35 38 c1 =230%; c2 =235% 71 82 74 22 19 9 a For both variables the misleading result is a combination of non-significant coef- ficients for both Rater #1 and Rater #2. Statistical significance is defined based on a p-value lower than 0.10. Unlike the other rows, the “no bias” row does not show misleading results: it simply shows how often we obtain no evidence of bias when there is indeed no bias.

The results corroborate the suspicions raised before. For Roman Catholicism and Marxism-Leninism, if the two raters are biased in the same direction but to different degrees we often obtain the same results we saw on Table 2. If Rater #1’s factor gets a 3-point boost when it comes to Roman Catholic countries and Rater #2’s factor gets only a 0.025-point boost, we obtain misleading results 12.2% of the time (122 simulations out of 1,000). As the difference becomes larger, so does the frequency of misleading results: 20.7% if Rater #1’s bonus is 5 points, and 81.6% if Rater #1’s bonus is 20 points. Granted, a bonus of 20 points is unrealistic. Rater #1’s factor is generated as a normal distribution with mean 5 and standard deviation 5, which means that about 95% of it lies in the [-4.8, 14.8] interval. A bonus of 20 would thus imply a rather passionate rater - one that is willing to rate a North Korea as a Sweden merely on the grounds of that (hypothetical) North Korea being Roman Catholic. But a bonus of

21 3 or 5 points is perfectly imaginable - and it is be enough to yield misleading results too often (more than 10% and more than 20% of the time, respectively). For Marxism-Leninism the difference in the magnitude of the bias must be some- what extreme: a “bonus” of at least 15 points from Rater #1 and of only 0.025 points from Rater#2. Below 15 points we obtain misleading results less than 10% of the time. A bonus of 15 points sounds unrealistic given that Rater #1’s factor follows a normal distribution with mean 5 and standard deviation 5. For Protestantism we find that when the two hypothetical raters are biased in the same direction and to similar degrees, we are bound to find no bias whatsoever in our estimations. When the bias is in the vicinity of 30% we find non-significant coefficients 73.7% of the time - which would mislead the researcher into thinking that neither rater is biased. Even when the bias is so extreme as to be around 170% we still obtain non-significant coefficients 16.4% of the time. For monarchy the bias only becomes “visible” when it reaches 170% or more. In other words, the bias must be of 170% or higher so we can obtain misleading results less than 10% of the time. It is clear, on the other hand, that we do obtain the correct result when none of the hypothetical raters are biased. Here the misleading result would be evidence of bias when there is in fact no bias. For Marxism-Leninism that happens less than 1% of the time. For Roman Catholicism that happens less than 5% of the time. For Protestantism that happens less than 10% of the time. And for monarchy that happens less than 3% of the time. That provides little solace though - with real-world data we cannot know whether the lack of statistical significance means unbiasedness or whether it means that all raters are biased in similar ways.

22 3.4 Summary

What do all these results tell us about Bollen and Paxton’s results? They tell us two things. First, when Bollen and Paxton find bias, all we can assert is that at least one of the raters is biased, but we cannot know which one(s) or in which direction(s). Consider Protestantism, for instance. Bollen and Paxton claim that Banks is biased in favor of Protestant countries while Gastil and Sussman are unbiased (with 1980 data). But it may be the case that Gastil and Sussman are biased against Protes- tant countries while Banks is unbiased. Or perhaps all three are biased in favor of Protestant countries, only Banks more so than Gastil and Sussman. Or, still, perhaps all three are biased against Protestant countries, only Banks less so than Gastil and Sussman. Our simulations show that in any of these scenarios we might obtain the same results that Bollen and Paxton did. Second, when Bollen and Paxton do not find bias, there is a good chance that there is bias. We know that because when our simulated raters are biased in the same direction and to similar degrees, our results often suggest no bias; depending on the magnitude of the biases, we get wrong results over 80% of the time. In other words, when all raters are biased in a similar way the bias becomes “invisible”. These same warnings apply to other studies that claim to have uncovered bias in existing measures of democracy. Consider Steiner (2012), for instance. He regresses Freedom House data on other democracy measures and then checks for correlations between the residuals and a number of foreign policy indicators (voting behavior in the UN, alliances, rivalries, foreign assistance, and trade). He finds the expected correlations and concludes that the Freedom House “rates countries that have closer political ties and affinities with the U.S. [...] as more democratic” (4). But what if the Freedom House is unbiased but all other measures are biased

23 against US-friendly countries? Or what if all measures of democracy are biased against US-friendly countries, only the Freedom House less so than the others? In all these sce- narios Steiner might observe exactly the same result, so his conclusions are completely unwarranted.16 Unless we have an unbiased measure of democracy, any statistical at- tempt to uncover the direction of ideological biases is futile.

4. Conclusion

We cannot uncover ideological bias a posteriori. All we can assert today is that at least some of our democracy measures are contaminated by ideological bias. We cannot say which ones and we cannot say anything about the direction or magni- tude of the biases (these are “known unknowns”, to use Donald Rumsfeld’s famous expression); and it is possible that other biases exist that we have not uncovered yet (“unknown unknowns”). In sum, we are in the dark. Because we are in the dark many democracy-related arguments rest on shaky ground. Are economic freedom and democracy associated or is that apparent associ- ation an artifact of our democracy measures being biased in favor of economic free- dom? Are parliamentary democracies more stable than presidential ones or do our democracy measures favor parliamentary systems? If we do not know what democracy measures are biased, or in what ways, how can we address these questions? The implication is that we need better democracy measures. In particular, we need a democracy measure whose data-generating process is transparent and replicable, so that at the very least we can know something about how exactly the measure is

16It is perfectly plausible that the Freedom House has an anti-market (and as a consequence perhaps an anti-US bias), since it includes “socioeconomic rights” and “freedom from gross socioeconomic inequalities” among its subcomponents (Munck and Verkuilen 2002, 9). Alternatively, it is perfectly plausible that Steiner results simply reveal that countries with closer ties to the US are more democratic, for whatever reasons.

24 biased. The crux of the matter is that all existing indices we have today - be it the Freedom House, the CNTS, or the Polity (Marshall, Jaggers, and Gurr 2013) - rely on country experts checking boxes on questionnaires. We do not observe what boxes those experts check, or why. The process is opaque, which makes it easy for country experts to boost the scores of countries that adopt the “correct” policies. Coding rules help, but still leave too much open for interpretation. Consider this excerpt from the Polity IV handbook: “If the regime bans all major rival parties but allows minor political parties to operate, it is coded here. However, these parties must have some degree of autonomy from the ruling party/faction and must represent a moderate ideological/philosophical, although not political, challenge to the incumbent regime.” (p. 73). How do we measure autonomy? Can we always observe it? What is “moderate”? Clearly it is not that hard to smuggle ideological contraband into democracy scores. This is not to say that there have not been innovations in the field of democracy measurement. The Varieties of Democracy project, begun in 2010, is a group effort that seeks to build a new, fine-grained measure consisting of 33 subcomponents, each subdivided into dozens of more specific indicators.17 Pemstein, Meserve and Melton (2010), in turn, treat democracy as a latent variable and use a multirater ordinal probit model to extract that latent variable from twelve different measures (including the Polity and the Freedom House); they call the resulting measure the Unified Democracy Scores (UDS). These are interesting developments, but both fall short of addressing the issue of bias. The Varieties of Democracy project may give us fine-grained democracy indi- cators but just like existing measures these indicators will be produced by country experts opaquely checking boxes in a questionnaire. The UDS, in turn, does a great

17See Coppedge et al (2001) and https://v-dem.net/

25 job at mitigating random error, but as the authors themselves acknowledge, the UDS cannot fix systematic error. Hence neither initiative addresses the problem of bias. One possible solution would be to embrace the text-as-data approach (see Grim- mer and Stewart [2013] for an overview) - for instance, by using some automated algorithm to extract regime-related information from news articles. That would make the data-generating process transparent and replicable (and also much cheaper, as we would be dispensing with country experts).

26 Automated Democracy Scores

Abstract

In this paper I use automated text analysis to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). I produce the ADS using the well-known Wordscores algorithm (created by Laver, Benoit, and Garry [2003]) and 42 million news articles from 6,043 different sources. The ADS cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today, the ADS are replicable, have standard errors small enough to actually distin- guish between cases, and avoid contamination by human coders’ ideological biases; and a simple (though computationally demanding) extension of the method would yield daily data and real-time data. I create a website where anyone can replicate and tweak the data-generating process by changing the parameters of the underlying model (no coding required): www.democracy-scores.org

1. Introduction

In this paper I use automated text analysis to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The basic idea behind the ADS is simple. News articles on, say, North Korea or Cuba contain words like “censorship” and “repression” more often than news articles on Belgium or Aus- tralia. Hence news articles contain regime-related information (even if we disregard word order and treat each article as a “bag of words”). We can quantify that infor- mation to build a democracy index. I produce the ADS using the Wordscores algorithm, developed in Laver, Benoit, and Garry (2003), and 42 million news articles from 6,043 different sources. The ADS

27 cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today, the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders’ ideological biases; and a simple (though computationally demanding) extension of the method would yield daily data and real-time data. The next section explains why we need a new democracy index in the first place. The remaining sections explain the method in detail; show the results and how they compare to existing democracy data; and discuss some future extensions.

2. Why do we need yet another democracy index?

There are at least twelve democracy indices today (Pemstein, Meserve, and Melton 2010). They all draw to some extent from Dahl’s (1972) conceptualization: democracy as a mixture of competition and participation. But they differ markedly in how they operationalize the concept - i.e., they differ in what empirical phenomena they pick as democracy manifestations; in how they aggregate these different empirical phenomena to produce a democracy scale; and in whether they model democracy as a categorical or continuous variable (Munck and Verkuilen 2002). In light of such diversity, do we really need yet another democracy index? I argue that we do, for three reasons. First, because the democracy indices we have today do not provide adequate measures of uncertainty. Without a good uncertainty measure we cannot know whether two countries are equally democratic or not, or whether a given country has become more (or less) democratic over time. That is, we cannot do descriptive inference. Moreover, without a good uncertainty measure we cannot do causal inference when democracy is one of the regressors. As Treier and Jackman (2008) warn, “whenever

28 democracy appears as an exploratory variable in empirical work, there is an (almost always ignored) errors-in-variables problem, potentially invalidating the substantive conclusions of these studies” (203). Yet the two most popular indices - the Polity (Marshall, Gurr, and Jaggers 2013) and the Freedom House (Freedom House 2013) - only give us point estimates, without any measure of uncertainty. That prevents us from knowing, say, whether Uruguay (Polity score = 10) is really more democratic than Argentina (Polity score = 8) or whether the uncertainty of the measurement process is sufficient to make them statistically indistinguishable. Only two indices come with uncertainty measures: Treier and Jackman’s (2008) and Pemstein, Meserve, and Melton’s (2010). Treier and Jackman (2008) treat democ- racy as a latent variable and use an item-response model to extract it from Polity indicators. Treier and Jackman (2008) provide both the point estimates (the means of the marginal posterior distributions of the latent democracy variable) and confidence intervals (quantiles of the marginal posterior distributions).18 Pemstein, Meserve, and Melton (2010) also treat democracy as a latent variable. They use a multirater ordinal probit model to extract that latent variable from twelve different measures (including the Polity and the Freedom House). And, as Treier and Jackman (2008), they also provide both point estimates (posterior means) and confidence intervals (posterior quantiles). They call their index Unified Democracy Scores (UDS). Both indices are big improvements over the Polity and Freedom House. It is hard to understand, in particular, why the UDS have not become the default democracy

18Armstrong (2011) does something similar, but using Freedom House indicators instead. His goal is different though - he extracts latent variables not to create a new democracy index but to investigate some properties of the Freedom House indicators. Also, he only reports the results for the 50 most populous countries and the full set of results is not available online.

29 index in political science: the UDS summarize almost all pre-existing indices, have broad time and country coverage, and are freely available online.19 Even if one is not interested in standard errors, why arbitrarily pick this or that individual democracy index when we can rely on the collective wisdom of all indices, condensed in the UDS? That political scientists continue to use the Polity and Freedom House is probably due to inertia.20 If we do want standard errors though (and we should, for the sake of good de- scriptive and causal inference), we have a problem: both Treier and Jackman’s (2008) index and the UDS offer standard errors that are too large to be useful. In Treier and Jackman’s (2008) data 70 of the 153 countries are statistically indistinguishable from the United States (in the year 2000 - the only year they report). In the UDS data 70% of the countries are all statistically indistinguishable from each other (in the year 2008 - the last year in the UDS dataset); pairs as diverse (regime-wise) as Denmark and Suriname, Poland and Mali, or New Zealand and Mexico have overlapping confidence intervals. In short, existing democracy indices either have no standard errors or they have standard errors too large to be useful. That prevents us from doing descriptive infer- ence and from knowing the effect of democracy on other variables. The second reason why we need a new democracy index is bias. Using structural equation modeling, Bollen and Paxton (2000) evaluate two popular sources of democ- racy data - the Freedom House and Arthur Banks’ Cross-National Time Series Data Archive (Banks [1971], updated through 1988) - and show that they are contaminated by ideological bias. Raters have policy preferences and boost the democracy scores of countries that adopt the “correct” policies.

19http://www.unified-democracy-scores.org/ 20A less noble possibility is cherry-picking: perhaps researchers try different indices and pick the one that yields the “correct” results.

30 Hence we cannot reliably use existing democracy measures to estimate the impact of democracy on policy (or vice-versa). For instance, how can we assess the impact of democracy on welfare spending when our measure of democracy is partly based on welfare spending? The democracy measures we have today make our empirical tests circular. When we regress welfare spending on democracy we usually assume that we are regressing y on x, but in reality we may be regressing y on z = f(x, y). Bollen and Paxton (2000) only evaluate two democracy datasets, but all democ- racy data rely on country experts and for any given country there are only so many experts. As Munck and Verkuilen (2002) warn, “for all the differences that go into the construction of these indices, they have relied, in some cases quite heavily, on the same sources and even the same precoded data.” (29). Hence for all we know the Polity data are no less biased than the Freedom House data or the CNTS data. True, these democracy measures correlate highly, but that may merely indicate that they are all biased in similar ways (Munck and Verkuilen 2002). Unfortunately, whatever biases exist in those indices carry over to Treier and Jackman’s (2008) and to the UDS. The latent-variable approach mitigates coders’ random errors, but not coders’ systematic errors. As Pemstein, Meserve, and Melton (2010) put it, the UDS rely on the assumption that “raters perceive democracy levels in a noisy but unbiased fashion” (10). The third reason why we need a new democracy index is replicability. Human- coded indices like the Polity and the Freedom House (and indices based on them, like Treier and Jackman’s [2008] and the UDS) rely on country experts checking boxes on questionnaires. We cannot see what boxes they are checking, or why; all we observe are the final scores. The process is opaque and at odds with the increasingly demanding standards of openness and replicability of the field. Clearly, any of these three reasons alone justifies the creation of a new democracy

31 index.

3. News articles

Our first task is to select the news articles to be used. Picking this or that news source - say, The New York Times or the The Wall Street Journal - would not do. The reason is that there would not be enough text. Countries like the United States and Russia are on the news all the time, but countries like Uruguay and Cambodia only get occasional coverage and countries like Tuvalu and Kiribati are almost never mentioned. A single newspaper or magazine, or even a handful thereof, would not provide the amount of text we need to produce reliable democracy scores for all 196 independent countries in the world. Hence I use a total of 6,043 news sources. These are all the news sources in English available on LexisNexis Academic, which is an online repository of journalistic content. The list includes American newspapers like The New York Times, USA Today, and The Washington Post; foreign newspapers like The Guardian and The Daily Telegraph; news agencies like Reuters, Agence France Presse (English edition), and Associated Press; and online sources like blogs and TV stations’ websites. I use LexisNexis’s internal taxonomy to identify and select articles that contain regime-related news. In particular, I choose all articles with one or more of the fol- lowing tags: “human rights violations” (a subtag of “crime, law enforcement and corrections”); “elections and politics” (a subtag of “government and public admin- istration”); “human rights” (a subtag of “international relations and national secu- rity”); “human rights and civil liberties law” (a subtag of “law and legal system”); and “censorship” (a subtag of “society, social assistance and lifestyle”).21

21It would be interesting to know how the results change if we select different topic tags, but unfortunately that is no longer possible: on 12/23/2013 LexisNexis changed its user interface

32 LexisNexis’ news database covers the period 1980-present,22 so in theory the ADS could cover that period as well. LexisNexis provides search codes for all countries that exist today - e.g., #GC508# for Afghanistan and #GC342# for Mexico. That way we can search for news articles on a specific country and be sure that all results will turn up - even the ones that do not mention the name of country (for instance, many articles use the name of the country’s capital when they mean the country’s government - as in “Moscow retaliated by canceling the summit”). Unfortunately, however, LexisNexis provides no search codes for countries that no longer exist. To search for articles on the Soviet Union, for instance, we would need to search for the name(s) of the country (Soviet Union, USSR), its derivatives (Soviet), the name of the capital (Moscow), etc - anything that might tell us that the article refers to the Soviet Union. Clearly that would not work. What if the article does not mention any of those terms? And what if the country has a name that is also a proper noun (like Turkey)? We would have unreliable results. Thus we can only reliably search the 1992-2012 period. Other than Yugoslavia, no country has ceased to exist since 1992, so we have search codes for basically everything. (Naturally, many countries were created in that period, but that is not a problem - the dataset will simply start in 2008 for Kosovo, in 2002 for East Timor, and so on). That selection - i.e., regime-related news, all countries that exist today, 1992-2012 - results in a total of about 42 million articles (around 4 billion words total), which I then organize by country-year.23 To help reduce spurious associations I remove proper

and the dozens of political tags and subtags that existed before are now collapsed into a single “Government & Politics” tag, which is too broad for our purposes here. 22Actual coverage varies by news source. 23A small proportion of the articles (about 0.05%) is left out. When a search produces more than 3,000 results LexisNexis only returns the first 1,000. Whenever possible I overcome that problem by searching for smaller periods of time. But in a few cases even searches for a single day produce more than 3,000 results. I could not figure out what criteria LexisNexis uses to select the 1,000

33 nouns24 (in a probabilistic way)25. For each country-year I merge all the corresponding news articles into a single document and transform the document into a term-frequency vector - i.e., a vector that contains the absolute frequency of each word. I then merge all term-frequency vectors into one big term-frequency matrix. Rows represent words and columns rep- resent country-years, so each cell gives us the absolute frequency of a given word in the news articles corresponding to a given country-year.

4. Algorithm

There are several automated ways to extract data from text (see Grimmer and Stewart [2013] for an overview). The particular method I use is the Wordscores algo- rithm, created by Laver, Benoit, and Garry (2003) - henceforth LBG -, from which this section draws heavily. The next paragraphs explain the algorithm in detail, but here is the gist of it: we manually score some documents - called “reference” documents or “training” documents (Manning, Raghavan, and Sch¨utze2008); the algorithm then “learns” from the reference documents and uses that knowledge to score all other doc-

results it returns (I asked them by email but they never replied). Thus to avoid any selection biases I just leave all results out in those cases. (The cases are: Pakistan 5/2/2011; Pakistan 5/3/2011; Afghanistan 10/8/2001; Afghanistan 10/9/2001; United Kingdom 7/8/2005; and United States, several dates between 2003 and 2012.) I realize that doing this may introduce selection biases of its own, but at least I am creating the selection biases myself, whereas I have no idea how LexisNexis selects those 1,000 results. In any case, it is doubtful that excluding 0.05% of the news articles will have any noticeable impact on the results. 24As we will see later the scoring algorithm works by associating certain words with certain qualities. But we want those words to refer to general phenomena like torture, repression, and censorship, not to specific people or places. We do not want, for instance, “Washington” being associated with high levels of democracy just because the word appears frequently on news stories featuring a highly democratic country. Removing proper nouns helps avoid that. 25I cannot possibly read 42 million articles. And we cannot simply remove all capitalized words, as that would eliminate the first word of every sentence, even if it is not a proper noun. Hence I apply the following rule: if all occurrences of the word are capitalized then that is probably a proper noun and therefore it is removed. (For each country-year I merge all the corresponding news articles into a single document and process each document in chunks of 10MB - to reduce memory usage -, so the check is restricted to the same 10MB chunk.)

34 uments - called “virgin” documents. So far Wordscores has only been used to measure party ideology (from party manifestos and legislative speeches). To the best of my knowledge, this is the first time Wordscores - or any other method of automated text analysis - is used to measure democracy. The first step is to select the reference cases. In other words, we need to pick some of the 4,067 country-years we have here to serve as the baseline from which the machine will “learn”. Ideally the reference set must span the entire regime scale. If we only feed the algorithm, say, highly democratic cases, then the machine will not learn what words are associated with middle-of-the-road cases or authoritarian cases. To ensure that the reference set is broad enough I pick all country-years from 1992, the first year for which we have news articles (see previous section). Thus the ADS only cover the 1993-2012 period even though we have news articles from 1992 as well. The vast majority of multivariate analyses that use some measure of democracy (Polity, Freedom House, etc) use pretty recent data, rarely going farther back in time than the 1970s, so the ADS should serve most applied research well. The second step is to give each reference case a score. For us that means assigning a democracy score to each country-year from 1992. I follow LBG and extract these reference scores from an existing index.26 In particular, I choose Pemstein, Meserve, and Melton’s (2010) UDS, which I mention before. The UDS have data on 184 countries for the year 1992. Hence we have 184 reference documents and 3,883 (4,067 - 184) virgin documents.

The third step is to compute the word scores. Let Fwr be the relative frequency of word w on reference document r. The probability that we are reading document P r given that we see word w is then P (r|w) = Fwr/ Fwr. We let Ar be the a priori r

26Just to be clear, LBG were measuring party ideology, not democracy, so obviously the indices they use have nothing to do with the one I use here.

35 P position of reference document r and compute each word score as Sw = (P (r|w) · r Ar). The fourth step is to use the word scores to compute the scores of the remaining documents - the “virgin” documents. Let Fwv be the relative frequency of word w on P virgin document v. The score of virgin document v is then Sv = (Fwv · Sw). w Intuitively, the algorithm uses the training documents to learn how word usage differs across the reference cases - for instance, it learns that the word “censorship” is more frequent the lower the democracy score of the document. The algorithm then uses that information to produce word scores (hence the name of the method), and later uses the word scores to score the virgin documents. A concrete example may help. Suppose that we choose North Korea 2012 and Belgium 2012 as our reference cases and assign them democracy scores of 0 and 10 respectively. We merge all news articles on North Korea in 2012 into a single document and merge all news articles on Belgium in 2012 into another document. Suppose now that the word “censorship” accounts for 15% of all the words in the North Korea document and for 1% of the words in the Belgium document. If we see the word “censorship” the probability that we are reading the North Korea document is 0.15/(0.15 + 0.01) = 0.9375 and the probability that we are reading the Belgium document is 0.01/(0.15 + 0.01) = 0.0625. The score of the word “censorship” is thus (0.9375 · 0) + (0.0625 · 10) = 0.625. To score a virgin document we simply multiply each word score by its relative frequency and sum across. The fifth step is the computation of uncertainty measures for the point esti- √ √ v mates. LBG propose the following measure of uncertainty: Vv/ N , where Vv =

P 2 v Fwv(Sw − Sv) and N is the total number of virgin words. The Vv term captures w the dispersion of the word scores around the score of the document. Its square root divided by the square root of N v gives us a standard error, which we can use to assess

36 whether two cases are statistically different from each other. The sixth and final step is the re-scaling of the virgin scores. In any given text the most frequent words are “the”, “of”, “and”, etc, which are usually of not interest. Because these words have similar relative frequencies across all reference texts they will have centrist scores. For instance, if “the” accounts for 10% of our (hypothetical) North Korea document (whose manually assigned score is 0) and for 10% of the (also hypothetical) Belgium document (whose manually assigned score is 10), the score of “the” will be 5, exactly in the middle of the scale. That makes the scores of the virgin documents “bunch” together around the mid- dle of the scale; their dispersion is just not in the same metric as that of the reference texts. In LBG’s estimations of party ideology in Britain, the scores of the reference documents range from 8.21 to 17.21, but the scores of the virgin documents range from 10.21 to 10.73. That is not a problem per se, as the scores of the virgin docu- ments are perfectly comparable to each other. But they are not comparable to the scores of the reference documents, whose dispersion is higher, and that may be a problem depending on the intended goals.27 To correct for the “bunching” of virgin scores, LBG propose re-scaling these as

∗ follows: Sv = (Sv −Sv¯)(σr/σv)+Sv¯, where Sv is the raw score of virgin document v, Sv¯ is the average raw score of all virgin texts, σr is the standard deviation of the reference scores, and σv is the standard deviation of the virgin scores. This transformation expands the raw virgin scores by making them have the same standard deviation as the

27We could of course remove irrelevant words, but identifying relevant and irrelevant words is not always so clear-cut. For instance, as I mention later, Monroe, Colaresi and Quinn (2008) find several non-obvious partisan words - like “baby” (Republican) and “bankruptcy” (Democrat) - in their analysis of legislative speeches in the US Senate. Thus if we exclude words a priori we risk throwing away important information. Moreover, removing any words would require knowledge of the language in which the text is written. That would defeat one of the biggest advantages of the method: the fact that it is language-blind (all we need to know are the positions of the reference documents).

37 reference scores. Martin and Vanberg (2008) propose an alternative re-scaling formula, but Benoit and Laver (2008) show that the original formula is more appropriate when there are many virgin cases and few reference cases, which is the case here. The final output is a dataset comprising all independent countries from 1993 to 2012, which makes for a total of 3,883 country-years. For each country-year three statistics are provided: the ADS point estimate, the ADS 95% lower bound, and the ADS 95% upper bound. Initially I considered having not only a democracy scale but also subcomponents, `ala Polity. But of the 805 JSTOR-indexed articles that cite the Polity data over the last ten years, only a handful mention (and even fewer use) the Polity subcomponents (Pemstein, Meserve, and Melton [2010] also note this point). Hence there is simply not enough demand to justify breaking down the ADS into more specific items.28 And, rich in regime-related information as our news articles may be, they nonetheless become progressively less informative as we move from “democracy” down to, say, “turnover percentage in the legislature”. The more specific we get, the higher the noise-to-signal ratio. Wordscores is the best-known text-scaling method in political science. It has been subject to extensive scrutiny over the years and generally found to perform well, as long as the texts are not too short29 and share enough vocabulary30. Klemmensen, Hobolt, and Hansen (2007), for instance, use Wordscores to measure party ideology, with Danish manifestos and speeches, and find that the method yields scores that correlate highly with those produced independently by human coders. Beauchamp

28Thus the ADS are bound to displease, for instance, Coppedge et al. (2011), who call for “thicker” measures of democracy. 29If the texts are too short then there is simply not enough data to produce meaningful results. How short is too short is unclear though: all else equal more is better, but a 5,000-word text may contain more informative words than a 10,000-word text. 30In the extreme case where the vocabulary of the reference texts and the vocabulary of the virgin texts are disjoint, we cannot even produce any estimates.

38 (2010) applies Wordscores to US Senate speeches and, as Klemmensen, Hobolt, and Hansen (2007), also finds that the estimates correlate highly with human-coded ones. Lowe (2008) notes that Wordscores lacks an explicit model for the data-generating process of the word frequencies. But he argues that, as long as the word frequencies follow an ideal point structure,31 Wordscores should produce good estimates - and he notes that “The empirical success of the method suggests that these assumptions may be reasonable.” (370).

5. Advantages over existing measures

The ADS are intended to address the three issues discussed earlier: standard errors, ideological bias, and replicability.

Small standard errors

As shown above, with Wordscores the total number of virgin words goes in the denominator of the formula of the standard errors. Hence the more texts we have, the smaller the standard errors will be. Here we have 42 million news articles, so we should have standard errors small enough to distinguish even between very similar cases - say, between Sweden and Norway. As we will see later, that is indeed what happens. The ADS are the first democracy index whose uncertainty measure captures

31Lowe (2008) proposes that we interpret Wordscores as an approximation to correspondence analysis - which relies on the assumption of ideal point structure.

39 such fine-grained distinctions.

Less ideological bias

The ADS are not immune to contamination by ideological bias. First, the journal- ists and editors behind news articles have their own policy preferences. And second, at least in the case of supervised learning algorithms (like Wordscores), someone must choose and score the reference cases. But the scope for manipulation is more restricted in the ADS. Journalists and editors have their policy preferences but there is a lot more ideological diversity among journalists (contrast The New York Times and The Wall Street Journal, for instance) than among political scientists, the vast majority of which are somewhere on the left of the ideology spectrum (Klein and Stern 2005; Maranto, Hess, and Redding 2009; Maranto and Woessner 2012). Combining 6,043 different news sources, as we do here, surely goes a long way toward mitigating ideological bias. The reference scores do offer a backdoor for manipulation but, unlike the anony- mous country experts who fill out the Polity and Freedom House questionnaires, the researcher who assigns reference scores does so in the open and thus bears reputa- tional costs in case of mischief. The transparency of the process creates an incentive structure that rewards honesty. As Schedler (2012) puts it, “The key to accountable expert measurement [...] is publicity. Rather than treating experts the same way as we treat survey subjects, whom we grant full anonymity, experts need to assume public responsibility for their measurement decisions.” True, we are using the UDS for the reference scores, and the UDS themselves must be contaminated by ideological bias, as discussed before. But the ADS do not inherit that bias. We are using regime-related news, so the vast majority of policy

40 and economic discussions is left out. Hence the biases of the UDS become, by and large, random noise in the ADS. Intuitively, imagine that the UDS are biased in favor of countries with generous welfare, like Sweden. The UDS of these countries will be “boosted” somewhat. But to the extent that the news articles we selected are focused on political regime and not on welfare policy, the algorithm will not associate those boosted scores with welfare- related words and hence the word scores will not be biased. They will be less efficient, as (ideally) no particular words will be associated with those boosted scores, but that is it.

Replicability

The process behind the ADS is fully transparent. All the choices (reference cases and scores) are visible to the public and every part of the process can be replicated exactly. Anyone with access to LexisNexis can download the same articles, apply the same algorithm, and verify the results. There are practical obstacles though: downloading 42 million articles is time- consuming and the computations require powerful machines and non-trivial program- ming.32 Therefore I created a website that facilitates the process: www.democracy- scores.org. No coding is required: there is a table with empty cells corresponding to each country-year between 1992 and 2012 and you simply enter the scores for the ref- erence cases you choose. The results are sent by email. That way anyone can change

32Wordscores has long been implemented in Stata and R, but these implementations load all the data into memory at once. That would not work here, as there are 200GB of data, so I had to write my own implementation of Wordscores (in Python). That implementation splits the data into chunks and processes each chunk individually, which reduces memory requirements (though not to the point where the script could be run on personal computers - there is a trade off between memory requirements and speed).

41 the reference set and produce their own ADS, regardless of computational resources or programming skills.33

6. Justifying some choices

Why not use unsupervised learning instead?

Wordscores is a type of supervised learning algorithm, by which I mean that the machine “learns” from an initial human input (the reference cases and their scores). But there are also unsupervised learning algorithms, which do not require an initial input. In these, the machine learns by itself not only how to measure but also what to measure. (See Manning, Raghavan, and Sch¨utze[2008] for an introduction to both supervised and unsupervised learning in the context of text analysis.) In political science, a concrete example of unsupervised learning algorithm is the one developed by Slapin and Proksch (2008), popularly known as Wordfish. Like Wordscores, Wordfish is most commonly used to measure party ideology, using party manifestos or legislative speeches. The Wordfish method does not require the user to specify or score any reference texts. It will create a scale based on whatever underlying dimension has the most impact on word frequencies.34 If we are talking about party manifestos, that dimension may be, say, the left-right dimension.

33The operation uses Amazon Web Services and to keep costs down for now I need to pre-authorize the user’s email address. I intend to obtain funding and lift that restriction in the future. 34Slapin and Proksch explicitly model the data-generating process (DGP) behind word frequency. That DGP is assumed to follow a Poisson distribution (hence the name of method): yijt = P oisson(λijt), where yijt is the frequency of word j in document i at time t. The parameter λijt is modeled as λijt = exp(αit + ψj + βj · ωit), where αit is the fixed-effect of document i at time t, ψj is the fixed-effect of word j, βj captures the relevance of word j in capturing the underlying concept (say, party ideology), and ωit is the estimated position of party i at time t. The model is estimated using an expectation-maximization (EM) algorithm (see McLachlan and Kirshnan [2007] for details on EM).

42 But it may not. And if the scores turn out to be capturing something else there is no way to fix that; it may be hard to even know what is being captured. Supervised learning, on the other hand, allows us to calibrate the scale by explicitly showing the machine what a democratic country looks like or what a left-wing party looks like (depending on what we are trying to measure). That way we can have greater confidence in the construct validity of the resulting measure.35

Why not use event data instead?

An alternative approach would be to machine-code democracy based not on words but on events. Applications like Knowledge Manager36 and TABARI (Schrodt 2001) can use dictionaries of actors and verbs to extract meaning from sentences. For in- stance, TABARI can correctly classify the sentence “North Korean state media have called on the United States to forge ‘ties of confidence’ with Pyongyang” into the category “Appeal for diplomatic cooperation” (category #022 of the Conflict and Mediation Event Observations Codebook). There are voluminous event data avail- able for free37 and King and Lowe (2003) show that in some cases automated event coding can be as accurate as human coding. So why not use event data to produce the ADS? The reason is that although the coding itself is automated, it relies on dictionaries of actors and verbs that are produced manually, entry by entry. In other words, we must know the relevant actors and verbs a priori. With Wordscores, however, we let the data speak. As Hopkins and King (2007) note, automated text analysis allows us

35That said, Wordscores and Wordfish are not antithetical methods. Lowe (2008) and Benoit and Nulty (2013) argue that Wordscores is also model-based in a sense, only the model is implicit. 36http://vranet.com/ 37Most notably the Global Data on Events, Location, and Tone (GDELT), which contains over 200 million geolocated events from 1979 to 2012. See Leetaru and Schrodt (2013).

43 to discover relevant features a posteriori. For instance, Monroe, Colaresi and Quinn (2008) find several non-obvious “partisan” words, like “baby” and “bankruptcy”, which a hand-coded dictionary might have missed. As a consequence, event data can be of limited usefulness. Consider, for instance, the latest version of the World Handbook of Politics (WHP), machine-coded by the Knowledge Manager application.38 It reports three recent coups in Canada (one in 1996, one in 1998, and one in 1999), 15 recent coups in the US (three of which tak- ing place in 1994 alone), and none in 2002 Venezuela (even though there was one).39 Similarly nonsensical statistics are reported for other political indicators, such as cen- sorship measures, curfews, and political arrests. That is not a very promising output, especially given the time and effort put in the creation of event data dictionaries (around 4,000 hours each)40. Hence I chose not to work with even data, at least for now.

7. Overview of results

The full 1993-2012 dataset is available for download.41 Figure 4 below gives an idea of the ADS distribution in 2012.

38The WHP can be downloaded from https://sociology.osu.edu/worldhandbook 39I checked the WHP definition of coup, to make sure it is not peculiar, but that does not seem to explain the nonsensical results (the WHP defines a coup as an “Irregular seizure of executive power, and rebellion by armed forces”). 40http://eventdata.psu.edu/faq.html 41https://s3.amazonaws.com/thiagomarzagao/ADS.csv

44 Note: Range limits are Jenks natural breaks. Figure 4. Automated Democracy Scores, 2012

As expected, democracy is highest in Western Europe and in the developed portion of the English-speaking world, and lowest in Africa and in the Middle East.

45 Figure 5 below shows that the ADS follow a normal distribution.

Figure 5. Automated Democracy Scores, 1993-2012 (with normal distribution)

46 Table 5 below shows the ADS summary statistics by year.

Table 5. ADS summary statistics, by year

N mean std. dev. min. max. 1993 193 0.0061666 1.40437 -3.20916 3.81217 1994 193 0.0939503 1.36697 -2.6985 4.14979 1995 193 0.1005004 1.073329 -2.99738 2.36396 1996 193 -0.1076104 1.128553 -3.22593 2.26484 1997 193 -0.0159435 1.25768 -2.93822 3.03361 1998 193 0.0088406 1.150099 -2.54625 2.7043 1999 193 -0.0999732 1.134464 -2.9453 2.63257 2000 193 0.2312175 0.7445582 -1.31987 2.66054 2001 193 0.2222522 0.7182253 -1.29777 1.92263 2002 194 0.2400814 0.735135 -1.18534 2.33285 2003 194 0.2121506 0.7185639 -1.3477 2.50623 2004 194 0.2213473 0.645 -1.69878 2.03608 2005 194 0.3315942 0.6461306 -1.08297 2.19639 2006 195 0.2869473 0.6760403 -1.28804 2.18348 2007 195 0.3678394 0.7192703 -1.11441 2.4193 2008 196 0.3860345 0.7002583 -1.11659 2.58216 2009 196 0.3212706 0.6923328 -1.487 2.34994 2010 196 0.4233154 0.6748002 -1.08075 2.29522 2011 196 0.4015369 0.7163083 -1.15564 2.38172 2012 196 0.4958635 0.7909505 -1.16859 2.38636 all 3883 0.2073097 0.9338698 -3.22593 4.14979

As expected, the average ADS increases over time, from 0.006 in 1993 to 0.495 in 2012. That reflects the several democratization processes that happened over that period. We observe the same change in other democracy indices as well (between 1993 and 2012 the average Polity score42 increased from 2.24 to 4.06 and the average Freedom House score43 decreased from 7.46 to 6.63;44 the average UDS score increased

42polity2 43civil liberties + political rights 44Freedom House scores decrease with democracy.

47 from 0.21 to 0.41 between 1993 and 2008, the last year in the UDS dataset). Also as expected, the standard errors decrease with press coverage. The larger the document with the country-year’s news articles, the narrower the corresponding confidence interval. As Figure 6 shows, that relationship is not linear though: after 500KB or so the confidence intervals shrink dramatically and do not change much afterwards, not even when the document has 15MB or more.

Note: ADS range = 95% upper bound minus 95% lower bound. Figure 6. ADS range and press coverage

48 8. The ADS vs other indices - point estimates

The ADS point estimates correlate 0.7439 with the UDS’ (posterior means), 0.6693 with the Polity’s (polity2), and -0.7380 with the Freedom House’s (civil liberties + political rights).45 Table 6 below breaks down these correlations by year.

Table 6. Correlation between ADS and other indices, by year

UDS Politya FHb UDS Politya FHb 1993 0.8021 0.7279 -0.7677 2003 0.7470 0.6610 -0.7445 1994 0.7921 0.6947 -0.7574 2004 0.7493 0.6635 -0.7553 1995 0.7797 0.7221 -0.7650 2005 0.7702 0.6833 -0.7632 1996 0.7783 0.7457 -0.7812 2006 0.7140 0.6458 -0.7596 1997 0.8059 0.7647 -0.8001 2007 0.6982 0.6207 -0.7413 1998 0.8052 0.7355 -0.7864 2008 0.7377 0.6363 -0.7506 1999 0.7729 0.7260 -0.7714 2009 n/ac 0.6353 -0.7627 2000 0.7491 0.6794 -0.7579 2010 n/ac 0.6467 -0.7791 2001 0.7641 0.6881 -0.7948 2011 n/ac 0.6472 -0.7661 2002 0.7668 0.6793 -0.7875 2012 n/ac 0.6155 -0.7603 a polity2 (see Marshall, Gurr, and Jaggers 2013, p. 17) b civil liberties + political rights (see Freedom House 2013) c The UDS do not cover the 2009-2012 period.

As we see, the correlations do not vary much over time. This is a good sign: it means that the ADS are not overly influenced by the idiosyncrasies of the year 1992, from which we extract the reference cases. Otherwise we would see the correlations decline sharply after 1993. The correlations do not vary much across indices either, other than being somewhat weaker for the Polity data. This is also a good sign: it means that the ADS are not overly influenced by the idiosyncrasies of the UDS, from which we extract the reference scores.46 45Pearson correlation. 46Though we must remember that the UDS are partly based on the Polity and the Freedom House,

49 I also ran the algorithm using other years (rather than 1992) for the reference set, using UDS as well. I also ran the algorithm using multiple years (up to all years but one) for the reference set, again using UDS. Finally, I also ran the algorithm using not the UDS but the Polity and Freedom House indices for the reference set. In all these scenarios the correlations remained in the vicinity of 0.70.47 This corroborates Klemmensen, Hobolt, and Hansen’s (2007) finding that Wordscores’ results are robust to the choice of reference texts. Country-wise, what are the most notable differences between the ADS and the UDS? Table 7 below shows the largest discrepancies.

Table 7. Largest discrepancies between ADS and UDS

largest positive differences largest negative differences ADS UDS ∆ ADS UDS ∆ Swaziland2007 1.53 -1.13 2.66 Israel1994 -1.71 0.97 -2.69 Liechtenstein1994 4.14 1.56 2.58 Israel1993 -1.70 0.97 -2.67 Liechtenstein1993 3.81 1.57 2.23 Israel1999 -1.20 1.44 -2.65 Ireland1994 3.08 1.17 1.90 Israel1997 -1.36 1.07 -2.44 Andorra1993 2.41 0.60 1.80 Benin1993 -1.79 0.49 -2.28 Luxembourg1994 3.29 1.51 1.77 Israel1998 -1.17 1.08 -2.26 Bhutan1996 -0.21 -1.97 1.75 Yemen1993 -2.60 -0.40 -2.20 Ireland1993 2.90 1.16 1.73 Israel1996 -1.08 1.08 -2.16 Finland1994 3.67 2.00 1.67 Tunisia1993 -2.71 -0.55 -2.15 China2008 0.68 -0.97 1.65 Oman1996 -3.18 -1.12 -2.05

The largest positive differences - i.e., the cases where the ADS are higher than the UDS - are mostly found in small countries with little press coverage. That is as

so by extension the ADS also are. 47The cases we use for the reference set cannot be used for the virgin set. For instance, in one scenario I used every other year for the reference set, starting with 1992. In that scenario the reference set was thus [1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012] and the virgin set was [1993 1995 1997 1999 2001 2003 2005 2007 2009 2011]. To compute the correlations with other indices I only used the virgin set.

50 expected: the less press attention, the fewer news articles we have to go by, and the harder it is to pinpoint the country’s “true” democracy level. The largest negative differences, however, tell a different story. It seems as if either the ADS repeatedly underestimate Israel’s democracy score or the UDS repeatedly overestimate it (and not only for the years shown in Table 7). We do not observe a country’s “true” level of democracy, so we cannot know for sure whether the ADS or the UDS are biased,48 but as discussed before the ADS should be unbiased to the extent that we managed to filter out news articles not related to political regime. As discussed before, whatever biases exist in the UDS should become, by and large, random noise in the ADS. The UDS, on the other hand, rely on the assumption that “raters perceive democ- racy levels in a noisy but unbiased fashion” (Pemstein, Meserve, and Melton 2010, 10), which as Bollen and Paxton (2000) have shown is simply not true. Hence what- ever biases exist in the Polity, Freedom House, etc, wind up in the UDS as well. The data-generating process behind the UDS does not mitigate bias in any way. In other words, it seems more likely that the UDS are overestimating Israel’s democracy scores than that the ADS are underestimating them. This pro-Israel bias is interesting in itself, but it also raises the more general question of whether the UDS might have an overall conservative bias. To investigate that possibility I performed a difference-of-means test, splitting the data in two groups: country-years with left-wing governments and country-years with right-wing governments (I used Keefer’s [2012] Dataset of Political Institutions for data on government ideological orientation.)49 The test rejected the null hypothesis that the mean ADS-UDS difference is the same for the two groups: the mean ADS-UDS difference for left-wing country-years

48Though of course these two possibilities are not mutually exclusive. 49I used the EXECLRC variable.

51 (-0.127, std. error = 0.024, n = 802) is statistically smaller than the mean ADS-UDS difference for right-wing country-years (-0.328, std. error = 0.025, n = 603), with p < 0.00001. As both means are negative, it seems that the UDS tend to reward right-wing governments. I also checked whether the UDS may be biased toward economic policy specifically. I split the country-years in the Index of Economic Freedom (Heritage Foundation 2014) dataset into two groups: statist (IEF score below the median) and non-statist (IEF score above the median). The difference-of-means test shows that the mean ADS-UDS difference for statists (-0.132, std. error = 0.0196, n = 1057) is statistically lower than that of non-statists (-0.215, std. error = 0.015, n = 1977), with p < 0.0006. Both means are negative here as well, so it seems that the UDS somehow reward free market policies. These findings are surprising. Political scientists are overwhelmingly on the left side of the political spectrum (Klein and Stern 2005; Maranto, Hess, and Redding 2009; Maranto and Woessner 2012), so if anything we would expect their democracy measures to be biased in favor of left-wing governments and policies, not against them. Perhaps the country experts who code the democracy indices behind the UDS are not political scientists for the most part. It is hard to know for sure, as the codebooks usually do not mention the coders’ backgrounds.50 We cannot conclusively indict the UDS or its constituent indices though. Per- haps democracy and right-wing government are positively associated and the ADS are somehow less efficient at capturing that association. This is consistent with the Hayek-Friedman hypothesis that left-wing governments are detrimental to democracy because economic activism expands the state’s coercive resources (Hayek 1944; Fried-

50Marshall, Gurr, and Jaggers (2013), for instance, only say that “at least four coders” (6) coded each Polity case, without giving any further information.

52 man 1962). As we do not observe a country’s true level of democracy, it is hard to know for sure what is going on here. At least until we know whether the UDS are biased or the ADS are inefficient, the ADS are the conservative choice. Say we regress economic policy on the UDS and find that more democratic countries tend to have less regulation. Is that relationship genuine or is it an artifact of the UDS being biased in favor of free market policies? With biased measures our tests become circular: we cannot know the effect of x on y when our measure of x is partly based on y. Inefficiency, on the other hand, merely makes our tests more conservative.

9. The ADS vs other indices - standard errors

As mentioned before the UDS data and also Treier and Jackman’s (2008) data have not only point estimates but also standard errors. That is a big improvement over data like the Polity and the Freedom House, which provide point estimates only. But the UDS and Treier and Jackman’s confidence intervals are too wide to be useful. Too many cases are statistically indistinguishable. The ADS, on the other hand, have smaller confidence intervals. These confidence intervals tend to be larger the less press coverage the country gets, but in all cases they are smaller than the corresponding UDS ones. Table 8 below shows, for each country in 2008 (the last year for which there are UDS data), how many other countries have overlapping confidence intervals in each dataset (UDS and ADS).

53 Table 8. Overlaps for the year 2008

UDS ADS UDS ADS Afghanistan 107 4 Libya 51 2 Albania 119 6 Liechtenstein 112 3 Algeria 100 2 Lithuania 88 3 Andorra 130 4 Luxembourg 106 0 Angola 76 4 Macedonia 120 4 Antigua & Barbuda 130 5 118 8 Argentina 119 7 Malawi 119 4 Armenia 118 1 Malaysia 117 3 Australia 90 1 Maldives 133 2 Austria 63 2 Mali 121 8 Azerbaijan 70 1 Malta 68 1 Bahamas 118 7 Mauritania 70 2 Bahrain 65 1 Mauritius 118 11 Bangladesh 102 6 Mexico 119 5 Barbados 120 0 Micronesia 88 10 Belarus 65 3 Moldova 126 6 Belgium 99 2 Mongolia 127 4 Belize 130 10 Montenegro 118 9 Benin 126 3 Morocco 76 4 Bhutan 111 10 Mozambique 113 4 Bolivia 125 6 Myanmar 41 2 Bosnia-Herzegovina 129 5 Namibia 111 1 117 3 Nauru 116 11 Brazil 118 0 Nepal 119 3 Brunei 73 11 Netherlands 66 3 Bulgaria 124 6 New Zealand 80 0 Burkina Faso 104 8 Nicaragua 120 7 Burundi 117 2 Niger 115 2 Cambodia 103 2 Nigeria 115 1 Cameroon 75 3 North Korea 31 4 Canada 90 2 Norway 66 8 Cape Verde 125 5 Oman 59 0 Central African Rep. 100 2 Pakistan 117 5 Chad 79 0 Palau 116 7 Chile 110 0 Panama 124 6 China 58 5 Papua New Guinea 123 4 Colombia 122 0 Paraguay 124 8 Comoros 122 10 Peru 119 8

continued

54 Table 8. Continued.

Congo Brazzaville 71 5 Philippines 127 2 Congo Kinshasa 103 1 Poland 99 6 Costa Rica 109 11 Portugal 90 4 Croatia 127 7 Qatar 43 0 Cuba 58 2 Romania 123 3 Cyprus 66 4 Russia 103 1 Czech Rep. 110 25 Rwanda 85 5 Denmark 63 3 St. Kitts & Nevis 120 10 Djibouti 98 1 St. Lucia 124 13 Dominica 126 11 St. Vin. & the Gren. 125 12 Dominican Rep. 119 8 Samoa 126 12 East Timor 121 3 San Marino 113 18 Ecuador 121 3 S. Tome & Principe 135 9 Egypt 72 0 Saudi Arabia 29 5 El Salvador 126 8 Senegal 127 8 Equatorial Guinea 59 4 Serbia 126 6 Eritrea 58 0 Seychelles 132 4 Estonia 110 4 Sierra Leone 120 1 Ethiopia 104 1 Singapore 102 6 Fiji 76 5 Slovakia 100 1 Finland 66 4 Slovenia 88 2 France 112 4 Solomon Is. 128 13 Gabon 89 4 Somalia 62 1 Gambia 80 2 South Africa 117 0 Georgia 119 2 South Korea 123 2 Germany 66 4 Spain 80 4 Ghana 119 3 Sri Lanka 121 1 Greece 82 7 Sudan 59 1 Grenada 118 9 Suriname 130 13 Guatemala 122 9 Swaziland 50 3 Guinea 72 1 Sweden 63 3 Guinea-Bissau 117 4 Switzerland 63 8 Guyana 110 12 Syria 62 0 Haiti 111 4 Taiwan 110 5 Honduras 123 10 Tajikistan 93 3 Hungary 89 1 Tanzania 105 5 Iceland 67 1 Thailand 114 3 India 123 2 Togo 102 5 Indonesia 119 2 Tonga 109 8

continued

55 Table 8. Continued.

Iran 62 1 Trinidad & Tobago 110 4 Iraq 62 1 Tunisia 73 2 Ireland 90 0 Turkey 124 1 Israel 97 1 Turkmenistan 52 7 Italy 97 2 Tuvalu 116 16 Ivory Coast 80 4 Uganda 104 5 Jamaica 122 9 Ukraine 119 4 Japan 118 9 United Arab Emirates 59 3 Jordan 80 0 United Kingdom 90 1 Kazakhstan 65 6 United States 80 2 Kenya 119 2 Uruguay 80 1 Kiribati 131 13 Uzbekistan 52 5 Kuwait 73 1 Vanuatu 124 6 Kyrgyzstan 111 6 Venezuela 116 5 Laos 58 4 Vietnam 65 1 Latvia 126 1 Yemen 98 1 Lebanon 117 0 Zambia 117 4 Lesotho 113 4 69 5 Liberia 121 2

As we see, there is much less overlapping in the ADS than in the UDS. For instance, in the UDS the United States is statistically indistinguishable from 80 other countries, whereas in the ADS the United States is statistically indistinguishable from only one other country (Solomon Islands, which rarely appears in the news and thus has a wide confidence interval). The country with most overlaps in the UDS data is Sao Tome and Principe, which is statistically indistinguishable from 135 other countries. That makes 70% of the UDS scores (for 2008) statistically the same. The worst case in the ADS is Czech Republic, which overlaps with 25 other countries (in the UDS Czech Republic overlaps with 110 other countries).51

51Unfortunately we cannot compare the ADS and Treier and Jackman’s data exactly, as these are not available online, but the plot we find on their article shows that the confidence intervals are even wider than the UDS ones.

56 The reason why the ADS standard errors are much smaller than the UDS ones is the sheer size of the data. We have 42 million news articles, which give us about 4 billion words in total. Because the total number of virgin words goes in the denomina- tor of the formula for the standard errors, those 4 billion words shrink the confidence intervals dramatically. The ADS standard errors also tell us something about the nature of democracy. The large standard errors of the UDS and of Treier and Jackman’s data might lead us to believe that democracy is better modeled as a categorical variable, like the one in Alvarez et al (1996). Gugiu and Centellas (2013) claim that that is indeed the case: they use hierarchical cluster analysis to extract the latent democracy variable behind five existing indices (among which the Polity and Freedom House indices) and find that that latent variable is categorical, not continuous. That conclusion is unwarranted though. If the constituent measures (Polity, Free- dom House, etc) are too coarse to capture fine-grained regime differences then it is not surprising that their latent variable will also be too coarse to capture fine-grained regime differences. But just because a given measure fails to capture subtle distinc- tions does not mean that these distinctions do not exist. As the ADS standard errors suggest, these subtle distinctions do seem to exist.

10. A word on conceptualization

For expositional convenience, so far I have neglected the issue of conceptualization, which normally should precede any measurement efforts (Sartori 1970). The concept behind any data produced with supervised learning methods is the concept behind the reference scores. Here the reference scores are taken from the UDS, so the concept of democracy behind the ADS and the UDS are the same. The

57 UDS, in turn, is a latent variable extracted from twelve different democracy indices. Thus the democracy concept underlying the UDS is the common, shared portion of those twelve democracy concepts. And what is that shared portion? As Munck and Verkuilen (2002) note, “the decision to draw, if to different degrees, on Dahl’s (1972, 4-6) influential insight that democracy consists of two attributes - contestation or competition and participation or inclusion - has done much to ensure that these measures of democracy are squarely focused on theoretically relevant issues” (9). Hence the UDS - and by extension the ADS - reflect the core, Dahlsian view of democracy shared by the twelve democracy indices used in their construction.52 Granted, it is hard to map ADS (or UDS) variation to particular democracy levels. For instance, what exactly does a score of -1.5 mean? If the score changes from -1.5 to -1.0, what does that tell us about the institutional changes that took place? Unlike human-coded measures, machine-coded ones tend to be more data-driven, so we often need to interpret the results a posteriori, by examining concrete cases. But mapping score variation to concrete phenomena is no easier with the human- coded measures of democracy we have today. These are based on subcomponents that, as Munck and Verkuilen (2002) note, are often arbitrarily aggregated. In the Polity data two countries may change from, say, 2 to 3 for completely different rea- sons (Gleditsch and Ward 1997). It is important to keep these issues in mind when evaluating machine-coded data. These should be compared to actual alternatives, not to imaginary ones.

52As noted before, I also ran the algorithm replacing the UDS with Polity and Freedom House scores, to check how the correlations changed. Hence in these cases the underlying concept behind the ADS is the Polity one or the Freedom House one, according to the case.

58 11. Conclusion and future extensions

The ADS address important limitations of the democracy indices we have today. The ADS are replicable, have standard errors narrow enough to distinguish cases, and mitigate contamination by human coders’ ideological biases. The ADS are also cost-effective: all we need are texts and reference scores, both of which already exist; there is no need to hire dozens of country experts and spend months collecting and reviewing their work. This paper is intended merely as a starting point though. Wordscores is the best- known text-scaling method in political science, so it allows me to introduce a new idea (automating democracy measurement) while using a familiar algorithm. But the goal here is simply to initiate a discussion in that direction, hopefully encouraging others to make their own contributions. There are many interesting alternatives that we should try next. First, we could try adjusting the algorithm to take into account “disproportionate” press reactions. Minor setbacks in highly democratic countries often attract a great deal of press coverage (e.g., the recent blocking of explicit internet content in Britain). Similarly, minor regime liberalizations in highly authoritarian countries often attract a great deal of press coverage as well (e.g., the recent decision of the Cuban government to allow Yoani S´anchez - a local journalist opposed to the Castro regime - to travel abroad for the first time). Newspapers and magazines often use words like “repression” and “democratization” liberally when referring to such events and that may cause unwarranted score fluctuations.

To correct for that perhaps we could use the scores for year t0 as priors in the estimation of the scores for year t1 (instead of starting anew every year, as I do here). That might make the scores more robust to minor fluctuations and to the

59 idiosyncrasies of press coverage. We would need to re-think the math and possibly drop Wordscores altogether in favor of a more explicit Bayesian approach, but that might prevent nonsensical results - of which there are a few in the current dataset. Second, we could address conceptualization more directly. Here we are using all 6.3 million unique words in the dataset, so the impact of each individual word on the virgin scores is negligible. That makes it hard to know exactly what concrete phe- nomena are driving our estimates: is it elections? Censorship? Power alternation? We know, from the UDS and their constituent indices, that the underlying concept is the core Dahlsian view of democracy - contestation and participation. But contestation and participation are abstractions, with multiple concrete manifestations each; which of these manifestations impact the ADS most? One way to find the answer would be to drop Wordscores and use instead a combination of Latent Semantic Analysis (LSA) and regression. LSA is a method that allows us to extract the latent components behind all the words in the dataset.53 We can then pick a set of reference cases and scores, regress these scores on the LSA- extracted components, and use the estimated coefficients to compute the scores of the virgin cases. That way rather than assessing the impact of (say) the word “election” on the esti- mates we assess the impact of the latent component “election” on the estimates. That latent component should encompass not only the word “election” but all election- related words, which makes it easier for us to see what concrete phenomena are behind the virgin scores. Third, we could ensure construct validity in more rigorous ways. Here we used topic tags to select regime-related news, but that is necessarily imperfect. Some contami- nation certainly exists. In fact, there are some oddities in the dataset that cannot be

53See Landauer, Foltz, and Laham (1998) for an introduction to LSA.

60 explained otherwise. For instance, the score of Equatorial Guinea jumps from -0.26 in 2011 to 1.16 in 2012 and becomes statistically indistinguishable from Norway’s, even though there was no regime liberalization whatsoever. That jump is probably related to Equatorial Guinea hosting the 2012 of Nations. We could of course browse the full list of words and remove the ones that are not regime-related, but that is hard to do with 6.3 million words. With LSA, however, we reduce those 6.3 million words to a few hundred latent components, which are of course much easier to inspect (and drop, when necessary). In addition to that we could also, as a preliminary step, use a topic classification algorithm - like the one in Quinn et al (2010) - to discard news articles that are unrelated to political regime. With this two-step approach - discarding extraneous articles and discarding extraneous latent components - we should eliminate most, if not all, of the oddities in the dataset. Fourth, once we have settled on a particular algorithm it would be interesting to replicate existing (substantive) work on democracy but using the ADS instead, to see how the results change. As the ADS come with standard errors, we could incorporate these in the regressions, perhaps using errors-in-variables models (Fuller 1987). Fifth, we could produce a daily or real-time democracy index. Existing indices are year-based and outdated by 1-12 months, so we do not know how democratic a country is today or how democratic it was, say, on 11/16/2006. Automated text analysis can help us overcome those limitations. We cannot score the news articles from only one or two days, as there would not be enough data to produce meaningful results, but we can pick, say, the 12-month period immediately preceding a certain date - for instance, 11/17/2005-11/16/2006 if we want democracy scores for 11/16/2006. In other words, we can apply the method to any arbitrary time interval, not just Jan/1-Dec/31. The idea is similar to moving averages, except that here we would

61 have “moving scores” instead. We would use the interval 11/17/2005-11/16/2006 to produce scores for 11/16/2006, the interval 11/18/2005-11/17/2006 to produce scores for 11/17/2006, and so on. That would give us a daily measure of democracy, which to the best of my knowledge does not exist yet. If we continuously update the dataset then we have not only daily data, but also real-time data. Applying the method to arbitrary time intervals could be a big improvement, for two reasons. First, because daily democracy data might open the door for interesting research venues. For instance, do governments become more authoritarian in the face of bank runs and capital flight? Do riots and demonstrations elicit repression or liberalization? Year-based democracy measures miss short-term fluctuations and thus preclude us from answering those types of questions. Second, because we would be able to measure regime change before and after arbitrary events. Say, the 12 months before a given presidential election and the 12 months after it. As both scores would have standard errors we would be able to tell whether the polity became more or less democratic after the new president took over.54

54There are technical and legal difficulties to be overcome though. To produce daily data for the 1993-2012 period we would need to run the algorithm 7,305 times (365 days times 20 years plus extra day in leap years). That would take over 20,000 CPU-hours to run, which far exceeds the computational resources currently at my disposal. To produce real-time data, in turn, we would need to retrieve articles from LexisNexis on a daily basis. But that requires performing 196 different searches every day (one search for each country) and downloading the corresponding material. It would be too time-consuming to do manually, so we would need to automate the process. But LexisNexis’ license terms explicitly prohibit any sort of automated use (see item 2.2 in http://www.lexisnexis.com/terms/general.aspx). These are logistical problems though and can probably be overcome in the future.

62 Measuring Democracy From News Articles: Can We Do Better Than Wordscores?

Abstract

In this paper I explore different ways to measure democracy from news articles. In an earlier paper I used the popular Wordscores algorithm (created by Laver, Benoit, and Garry [2003]) for that. Here I explore some alternatives - namely, a combination of topic extraction methods (Latent Semantic Analysis and Latent Dirichlet Allocation) and decision trees - and compare the results to the ones obtained using Wordscores.

1. Introduction

In this paper I explore different ways to measure democracy from news articles. The basic idea is simple. News articles on, say, North Korea or Cuba contain words like “censorship” and “repression” more often than news articles on Belgium or Aus- tralia. And if news articles contain regime-related information we can quantify that information to build a democracy index. I adopt a supervised learning approach. In supervised learning we feed the machine a number of pre-scored cases - the reference set. The machine then “learns” from the reference set. In text analysis that means learning how the frequency of each word or topic varies according to the document scores. For instance, the algorithm may learn that the word “censorship” is more frequent the lower the democracy score of the country-year. Finally, the algorithm uses that knowledge to assign scores to all other cases - i.e., to the virgin set.

63 There are several algorithms for supervised learning. In an earlier paper55 I used Wordscores (Laver, Benoit, and Garry 2003). The outcome was a machine-coded democracy index, which I called Automated Democracy Scores (ADS) and which covers all country-years in the 1993-2012 period. Unlike other democracy indices the ADS are replicable, have standard errors small enough to actually distinguish between cases, and avoid contamination by human coders’ ideological biases.56 Wordscores has a few limitations though. First and foremost, Wordscores makes it difficult to address construct validity. Wordscores “learns” how word frequency changes with the variable of interest (be it democracy, party ideology, or anything else) and then uses that information to score new documents (see Laver, Benoit, and Garry [2003] for details). That works well, but there is an inefficiency: Wordscores uses the frequency of each individual word in isolation, ignoring co-occurrence. That certain words - say, “censorship” and “torture” - tend to co-occur could suggest that they belong to a same topic - say, “repression”. More formally, the words “censorship” and “torture” may be observed manifestations of the unobserved (latent) variable “repression”. Wordscores completely ignores that sort of information. Without knowing what topics drive our results it is hard to assess construct va- lidity. We know that the ADS are capturing the two Dahlsian pillars of polyarchy - contestation and participation (see earlier paper). But contestation and participation are abstract concepts, with many possible concrete manifestations. With Wordscores it is hard to know what those concrete manifestations are. We could in principle assess the influence of each individual word.57 But here we have 6.3 million unique words, so the impact of any individual word is negligible. And

55Available from http://ssrn.com/abstract=2412325 Currently under journal review. 56The ADS can be downloaded from https://s3.amazonaws.com/thiagomarzagao/ADS.csv There is also a website where anyone can replicate and tweak the data-generating process (no coding required): http://democracy-scores.org/ 57For instance, by computing each word’s TF-IDF (more on TF-IDF later).

64 in any case the results would be based on a method that ignores co-occurrence and thus discards useful information. What we need is a way to extract the topics behind our texts and then use these topics - rather than individual words - to produce our democracy scores. That way we should be able to know exactly what concrete phenomena are driving our democracy scores. And we should also be able drop extraneous topics if needed (to reduce noise or bias). That is the goal of this paper. To extract the topics I try two competing algorithms: Latent Semantic Analysis and Latent Dirichlet Allocation. Both are explained in the following sections, but the gist of it is that they reduce millions of words to a few hundred topics. It is unclear ex ante which algorithm is superior (Anaya 2011), so I try both and then compare the results. Once I have the topics I use decision trees to produce democracy scores.58 Then I should be able to inspect which topics are most influential, thereby learning what concrete manifestations of democracy are driving our democracy scores. Also, I should be able to discard the topics that are both influential and extraneous, and then produce a new, improved batch of scores. There are costs in moving from Wordscores to these other methods. First, the we lose the simplicity of Wordscores. Second, we move from a single algorithm (Word- scores does everything itself) to a combination of algorithms (one algorithm extracts the topics and another algorithm produces the democracy scores). Third, computing time increases from 2-3 hours to 24-168 hours for each batch of democracy scores (depending on the exact algorithms used). But hopefully the benefits will outweigh the costs. 58As explained later I use decision trees rather than OLS because I have more variables (topics) than observations (country-years), which does not leave us any degrees of freedom to run OLS.

65 The remainder of this paper provides the details. Section 2 explains how the news articles were selected and processed. Section 3 explains Latent Semantic Analsys. Sec- tion 4 explains Latent Dirichlet Allocation. Section 5 explains decision trees. Section 6 presents the results. Section 7 concludes.

2. News articles

In this paper I use the same texts I used in my earlier paper: 42 million regime- related news articles from 6,043 different sources, all published between 1992 and 2012, and which I organize by country-year. These are all the news sources in English available on LexisNexis Academic, which is an online repository of journalistic content. I used LexisNexis’ internal taxonomy to identify regime-related content.59 I start in 1992 because LexisNexis has no search codes for countries that have ceased to exist, which prevents us from reliably retrieving news articles on, say, the Soviet Union or East Germany.60 Further details on how the articles were selected and processed are provided in my earlier paper. The period 1992-2012 gives us a total of 4,067 country-years. As in my earlier paper, I choose the year 1992 for the reference set and extract the corresponding scores from the Unified Democracy Scores - UDS (Pemstein, Meserve, and Melton 2010). The UDS have data on 184 countries for the year 1992. Hence we have 184 reference cases and 3,883 (4,067 - 184) virgin cases. I select the year 1992 simply

59I chose all articles with one or more of the following tags: “human rights violations” (a subtag of “crime, law enforcement and corrections”); “elections and politics” (a subtag of “government and public administration”); “human rights” (a subtag of “international relations and national security”); “human rights and civil liberties law” (a subtag of “law and legal system”); and “cen- sorship” (a subtag of “society, social assistance and lifestyle”). It would be interesting to know how the results change if we select different topic tags, but unfortunately that is no longer possible: on 12/23/2013 LexisNexis changed its user interface and the dozens of political tags and subtags that existed before are now collapsed into a single “Government & Politics” tag, which is too broad for our purposes here. 60Other than Yugoslavia, no country has ceased to exist since 1992.

66 because it is the first year in our dataset. I select the UDS mainly because they are an amalgamation of several other democracy scores, which reduces measurement noise (though not measurement bias). Further details on why I selected the UDS are provided in my earlier paper. I try three variations of the same collection of news articles. The first version - which I call corpora A - is exactly the same as the one used in my earlier paper. It contains about 6.3 million unique words after proper nouns are removed probabilis- tically (see earlier paper for details on this point). The other two versions - corpora B and corpora C - are reduced versions of corpora A. To produce corpora B and corpora C I started with corpora A and then dropped the 100 most frequent words in the English language.61 To produce corpora B I also dropped any word that does not appear more than once in any of documents.62 The outcome is that corpora B has only about 2.3 million unique words - some 4 million less than corpora A. To produce corpora C I went one step further and, from each document, I dropped any words that only appeared once. Hence corpora C ⊂ corpora B ⊂ corpora A.

3. Latent Semantic Analysis

In this section I explain what Latent Semantic Analysis is (3.1) is and how the math behind it works (3.2).

61As measured by the Oxford English Corpus (which covers all regional varieties of English - Ameri- can, British, etc). These are mostly words like “the”, “of”, “have”, etc. Combined these stopwords account for about 25% of English texts (http://www.oxforddictionaries.com/words/the-oec-facts- about-the-language) but do not contain any regime-related information, so dropping them should reduce noise. 62This removed extremely rare words, which is desirable because sometimes rare words can be unduly influential. See Manning, Raghavan, and Sch¨utze(2008).

67 3.1 Intuition

Latent Semantic Analysis (LSA) is a method for extracting topics from texts (Landauer, Foltz, and Laham 1998). More concretely, LSA tells us two things: a) which words “matter” more for each topic; and b) which topics appear more in each document. In this subsection I explain the intuition behind LSA and what outputs it produces, but I leave the math for the subsection 3.2. Let us begin with the words. Say that we have a total of m unique words in our corpora and that we want to extract the k most prevalent topics from that corpora.

˜ 63 LSA will produce an m × k matrix, which we will call U, where each entryu ˜ij gives us the “salience” of word i for topic j. To give a concrete (though fictional) example, suppose that we have run LSA on a corpora of medical articles, extracted the top three topics, and sorted each column of U˜ in decreasing order. The outcome might look somewhat like this:

Figure 7. Example of wordsXtopics table generated with LSA

As we observe in this example, the largest word weights (in absolute values) help

63The reason for the tilde will be explained in subsection 3.2.

68 us see what topics underlie the set of texts - in this case, diabetes (topic #1), heart diseases (topic #2), and cancer (topic #3). Importantly, each topic contains weights for all words that appear in the entire corpora. For instance, topic #3 contains not only cancer-related words but also all other words: insulin, mellitus, heart, etc. Hence all topics have exactly the same length (m). What changes is the weight each topic assigns to each word. E.g., in topic #3 cancer-related words have the largest weights - which is why we label topic #3 “cancer”. In real life applications the topics are usually not so clear-cut. Innocuous words like “the”, “of”, “did”, etc (usually called stopwords) often have large weights in at least some of the topics. Also, it is common for the top 20 or 50 words to be very similar across two or more topics. Finally, in real life applications with large corpora we usually extract a few hundred topics, not just three. (More on these points later.) LSA also tells us which topics are more “salient” in each document. That happens by means of dimensionality reduction. Say that we have a total of n documents in the corpora. LSA will reduce the huge, m × n, wordsXdocuments matrix to a more manageable, k × n, topicsXdocuments matrix. Let us call this topicsXdocuments ˜ matrix S. Each entrys ˜ij gives us the “salience” of topic i for document j. In our fictional example above, if we sort the columns of S˜ in descending order the outcome might look like this:

69 Figure 8. Example of topicsXdocuments table generated with LSA

Here I expect that LSA will generate regime-related topics - say, “elections”, “repression”, and so on - and also extraneous topics. I will then use decision trees to create democracy scores, using all topics. Afterwards I should be able to inspect which topics are influencing the scores the most, drop the rows of S˜ corresponding to topics that are both extraneous and influential, and generate a new, improved set of democracy scores.

3.2 Math

Before we run LSA we need to transform our texts into data. We begin by trans- forming each document into a vector of word counts and then merging all vectors. The outcome is a term-frequency matrix, where rows represent terms, columns rep- resent documents, and each entry is the frequency of term i on document j (i.e., the term-frequency, TFij). Next we apply the TF-IDF transformation. The TF-IDF of each entry is given by its term-frequency (TFij) multiplied by ln(n/dfi), where n is the total number

of documents and dfi is the number of documents in which word i appears (i.e.,

the word’s document frequency – DF; the ln(n/dfi) ratio thus gives us the inverse document frequency – IDF). What the TF-IDF transformation does is increase the importance of the word the

70 more it appears in the document but the less it appears in the whole corpora. Hence it helps us reduce the weights of inane words like “the”, “of”, etc and increase the weights of discriminant words (i.e., words that appear a lot but only in a few documents). For more details on TF-IDF, see Manning, Raghavan, and Sch¨utze(2008). The next step is normalization. Here we have documents of widely different sizes, ranging from a few kilobytes (documents corresponding to small countries, like An- dorra or San Marino, which rarely appear in the news) to 15 megabytes (documents corresponding the United States, Russia, etc - countries that appear in the news all the time). Longer documents contain more unique words and have larger TF values, which may skew the results (Manning, Raghavan, and Sch¨utze2008). To avoid that we normalize the columns of the TF-IDF matrix, transforming them into unit vectors. We will call our normalized TF-IDF matrix A. Its dimensions are m × n (m is the number of unique words in all documents and n is the number of documents). Now we are finally ready to run LSA. In broad strokes, LSA is the use of a particular matrix factorization algorithm - singular value decomposition (SVD) - to decompose a term-frequency matrix or some transformation thereof (like TF-IDF or normalized TF-IDF) and extract the word weights and topic scores. Let us break down LSA into each step. We start by deciding how many topics we want to extract - call it k. There is no principled way to choose k, but for large corpora the rule of thumb is something between 100 and 300 (Martin and Berry 2011). To extract the desired k topics we use SVD to decompose A, as follows:

A = U Σ V ∗ m×n m×m m×n n×n

where U is an orthogonal matrix (i.e., U 0U = I) whose columns are the left- singular vectors of A; Σ is a diagonal matrix whose non-zero entries are the singular

71 values of A; and V ∗ is an orthogonal matrix whose columns are the right-singular vectors of A (Martin and Berry 2011). (Notation: throughout this paper for any matrix M, M 0 is its transpose and M ∗ is its conjugate transpose). There are several algorithms for computing SVD and the one I use here is the one created by Halko, Martisson, and Tropp (2011) - henceforth HMT. I choose HMT because it is specifically designed to handle large matrices, like the ones we have here.64 See Appendix C for details on HMT. The next step in LSA is to truncate U, Σ, and V ∗. We do that by keeping only the first k columns of U, the first k rows and first k columns of Σ, and the first k rows of V ∗. Let us call these truncated matrices U˜, Σ,˜ and V˜ ∗.65 The truncated matrices give us what we want. U˜ maps words into topics: each

˜ ˜ ∗ ˜ entryu ˜ij gives us the weight of word i on topic j. The product ΣV = S, in turn,

maps topics into documents: each entrys ˜ij gives us the weight of topic i on document j. Importantly, the topics extracted with LSA are ordered: the first topic - i.e., the first column of S˜ - captures more variation than the second column, the second column captures more variation than the third column, and so on. Thus if we run LSA with (say) k = 200 and then with k = 300, the first 200 topics will be the same in both S˜ matrices. In other words, each topic is independent of all topics extracted after it. I set k = 300 or k = 150, depending on the corpora (with corpora A we cannot set k = 300 because the resulting wordsXtopics matrix has dimensions 6.3 million

64Depending on the specific corpora we are using (see section 2) we have up to 6.3 million words, which multiplied by 4,067 documents yields 25 billion entries. 65If we multiply out U˜Σ˜V˜ ∗ the outcome, A˜, is the best k-rank approximation to A. I.e., A˜ is the r m n 2 k-rank matrix that minimizes ||A − B||F = ||X||F = Σ Σ xij, the Frobenius distance (see i=1j=1 Stewart [1993] for proof). A and A˜ both have dimensions m × n (since U˜ Σ˜ V˜ ∗ = A˜ ), but m×k k×k k×n m×n A˜ is “smaller” in the sense that it has a lower rank: A˜ only contains the top k latent components (topics) of A - i.e., the k latent components that best summarize A; all other latent components are discarded.

72 by 300, which is too large even for high-memory computers; see section 2 above for details on each corpora). That is all there is to LSA: we use SVD to decompose A, truncate the resulting matrices, and extract from them the word weights and topic weights.66

4. Latent Dirichlet Allocation

In this section I explain what Latent Dirichlet Allocation is (4.1) and how the math behind it works (4.2).

4.1 Intuition

LSA, discussed in the previous section, is a flexible technique: it does not assume anything about how the words are generated. LSA is at bottom a data reduction technique; its core math - truncated SVD - works as well for mapping documents onto topics as it does for compressing image files. But that flexibility comes at the cost of interpretability. The word weights and topic weights we get from LSA do not have a natural interpretation. We know that the word weights represent the “salience” of each word for each topic, and that the topic weights represent the “salience” of each topic on each document, but beyond that we cannot say much. “Salience” does not have a natural interpretation in LSA. That is the motivation for Latent Dirichlet Allocation (LDA), which was created by Blei, Ng, and Jordan (2003). Whereas LSA is model-free, LDA models every aspect of the data-generating process of the texts. We lose generality (the results are only

66LSA is a popular algorithm also in the field of information retrieval (in which it is known as Latent Semantic Indexing - LSI). Search engines often use LSA to reduce search terms to the underlying topics. The search engine then looks for documents that score high on those topics - whether or not they contain the search terms the user entered. That is how search engines can return pertinent results even though the specific search terms we used may be missing in those results. See Manning, Raghavan, and Sch¨utze(2008) for details on how LSA is used in information retrieval.

73 as good as the assumed model), but we gain interpretability: with LDA we also get word weights and topic weights, but they have a clear meaning (more on this later). It is unclear under what conditions LSA or LDA tends to produce superior results (Anaya 2011). The topics extracted by LDA are generally believed to be more clear- cut than those extracted by LSA, but on the other hand they are also believed to be broader (Crain et al 2012). Hence in this paper I try both LSA and LDA and compare the results.

4.2 Math

Just as we did in LSA, here too we start by transforming our texts into data, i.e., into a term-frequency matrix, where each entry is the frequency of term i on document j. Unlike what we did in LSA here we will use the term frequencies directly, with- out any TF-IDF or normalization (LDA models the data-generating process of term frequencies, not of TF-IDF values or any other transformations). LDA assumes the following data-generating process for each document.67 We begin by choosing the number of words in the document, N. We draw N from a Poisson distribution: N ∼ P oisson(ξ). Next we create a k-dimensional vector, θ, that contains the topic proportions in the document. For instance, if θ = [0.3, 0.2, 0.5] then 30% of the words will be assigned to the first topic, 20% to the second topic, and 50% to the third topic (to continue the example in the previous section these topics may be, say, “diabetes”, “heart diseases”, and “cancer”). We draw θ from a Dirichlet distribution: θ ∼ Dir(α). I set k alternately to 50, 100, 150, 200, and 300.68 Unlike LSA topics, LDA topics

67Here I draw heavily from Blei, Ng, and Jordan (2003). 68Like LSA, here too we cannot set k > 150 when using corpora A, for memory reasons.

74 are not ordered. If we run LDA with, say, k = 200 and k = 300, the first 200 topics will not be the same. Hence we cannot simply run LDA once, with k = 300, and then just drop columns later as desired. We need to run LDA again for every k. Now that we have the number of words (N) and the topic distribution (θ) we are ready to choose each word, w. First we draw its topic, z, from the k topics in θ, as follows: z ∼ Multinomial(θ). We then draw w from p(w|z, β), where β is a k × m

matrix whose entry βij is the probability of word j being selected if we randomly draw a word from topic i (m is the total number of unique words in the corpora). For instance, the word “insulin” may have probability 0.05 of being selected from the topic “diabetes” and 0.001 of being selected from the topic “heart diseases”. There are three levels in the model: α and β are the same for all documents, θ is specific to each document, and z and w are specific to each word in each document.69 That is all there is to the data-generating model behind LDA. To estimate the model we need to find the α and β that maximize the probability of observing the ws:70

 k  N k V ! Γ(Σ α ) R j i i Q αi−1 Q P Q wn p(w|α, β) = Q θi (θiβij) i Γ(αi) i=1 n=1i=1j=1

That function is intractable, which precludes exact inference. We need to use some approximative algorithm. I use the one created by Hoffman, Blei, and Bach (2010) - henceforth HBB. I choose HBB because it handles large matrices and runs relatively fast. See Appendix D for details. Like LSA, LDA also yields an m × k matrix of wordsXtopics weights and a k × n matrix of topicsXdocuments weights. Unlike LSA though, here these estimates have a natural interpretation. Each word weight is the probability of word i being selected

69We disregard ξ because, as Bleig, Ng, and Jordan (2003) note, N is an ancillary variable, inde- pendent of everything else in the model, so we can ignore its randomness. 70V is the total number of unique words (i.e., the size of the vocabulary) and the other terms are defined as before.

75 if we randomly draw a word from topic j. And each topic weight is the proportion of words in document j drawn from topic i.

5. Decision trees

In this section I motivate the use of decision trees (5.1) and explain the underlying math (5.2).

5.1 Motivation

LSA and LDA give us word weights, which map words onto topics, and topic weights, which map topics onto documents. But how do we go from that to democracy scores? In principle we could use OLS. As explained before, our reference set are the 184 independent countries in the year 1992, whose reference scores we extract from the UDS, and our virgin set are the 3,883 independent countries in the years 1993-2012. Thus we could, in principle: a) regress the topic scores of the 184 reference cases on their respective UDS scores; and b) use the estimated coefficients to compute (“predict”) the democracy scores of the 3,883 virgin cases. But that would not work here. We have between 50 and 300 topics and only 184 reference cases, so with OLS we would quickly run out of degrees of freedom. Even with only 50 topics we would still be violating the “10 observations per variable” rule of thumb. And there may be all sorts of interactions and other non-linearities and to model these we would need additional terms, which would require even more degrees of freedom. Thus I use decision trees instead. With decision trees we split the observations recursively, until they are all allocated in homogeneous “leaves”. Each split - and thus

76 the path to each leaf - is based on certain values of the regressors (if x1 > 10 then follow this branch, if x1 ≤ 10 then follow this other branch, etc). Decision trees are non-parametric: we are not estimating any parameters, we are simply trying to find the best splitting points (i.e., the splitting points that make the leaves as homogeneous as possible). Because decision trees are non-parametric we can use them even if we have more variables than observations (Gr¨omping2009). Moreover, decision trees handle non- linearities well: the inter-relations between variables are captured in the very hierar- chical structure of the tree; we do not need to specify a priori what variables interact or in what ways. Hence my choice of tree-based regression techniques over alternatives like LASSO, Ridge, or forward stepwise regression, all of which can handle a large number of variables but at the cost of ignoring non-linearities. The next subsection explains the math behind decision trees. In what follows I draw heavily from Hastie, Tibshirani, and Friedman (2008).

5.2 Math

Decision trees were first introduced by Breiman et al (1984). Say that we have a dependent variable y, k independent variables, and n observations, and that all variables are continuous. We want to split the observations in two subsets (let us call them r1 and r2) based on some independent variable j and some value s. More specifically, we want to put all observations for which j ≤ s in one subset and all observations for which j > s in another subset. But we do not want to choose j and s arbitrarily: we want to choose the j and s that make each subset as homogeneous as possible when it comes to y. To do that we find the j and s that minimize the sum of the mean squared errors of the subsets:

77 P 2 P 2  ( yi − y¯ ) ( yi − y¯ )  yi∈r1 y¯∈r1 yi∈r2 y¯∈r2 min  +  j,s nr1 nr2

We find j and s iteratively. We then repeat the operation for resulting subsets, partitioning each in two, and we keep doing so recursively until the subsets have fewer than l observations. The lower the l the better the model fits the reference set but the worse it generalizes to the virgin set. There is no rigorous way to choose l. Here I just follow two popular choices of l (l = 2 and l = 5) and compare the results. The outcome is a decision tree that relates the k independent variables to y. For instance, if we tried to predict individual income based on socioeconomic variables our decision tree might look like this:

years of schooling

<=8 >8

parental years of income schooling

<=$50k >$50k <=12 >12

years of age height age schooling

<=4 >4 <=25 >25 <=5ft >5ft <=25 >25

$24.7k $37.2k $41.8k $49.9k $68.4k $72.6k $77.6k $85.9k Figure 9. Example of decision tree

This is of course an extremely contrived example,71 but it gives us a concrete idea of what a decision tree looks like. Each node splits the observations in two groups according to some independent variable j and some splitting point s, chosen so as to

71In real life applications the trees are usually much bigger, with hundreds or thousands of nodes and leaves.

78 minimize the sum of the mean squared errors of the two subsets immediately below. We stop growing the tree when the subsets become small enough - i.e., when the subsets contain fewer than l observations. The very last subsets are the leaves of the tree.72 The average y of each leaf gives us the predicted y for new observations. Here, for instance, someone with 5 years of schooling, parental income over $50k, and aged 30 would have a predicted income of $49.9k. We can see how the tree captures non-linearities. Parental income only matters when the individual has eight years of schooling or less. Height only matters when the individual has more than eight but twelve or less years of schooling. Age only matters for two groups of people: those with eight years of schooling or less and parental income over $50k; and those with more than twelve years of schooling. And so on. With conventional regression we would need to model all such non-linearities ex- plicitly, positing a priori what depends on what. And we would need several interac- tive and higher-order terms, which would take up degrees of freedom. With decision trees, however, the non-linearities are learned from the data, no matter how many or how complex they are (assuming of course that we have enough observations)73. Instead of creating a single decision tree we can create different decision trees and average out their predictions. The three most popular ways to do that are called random forests, extreme random forests, and AdaBoost. The conditions under which a single decision tree or multiple decision trees should be chosen are not clear. The conditions under which random forests, extreme random forests, or AdaBoost should be chosen are not clear either. Therefore I try all four techniques and compare their

72The process can be tweaked in a number of ways (we can replace the mean squared error by other criteria, we can split the subsets in more than two, we can “prune” the tree, etc), but here I stick to the basics. 73One difficulty with non-parametric algorithms is that “enough observations” is harder to define than with parametric algorithms.

79 results. I leave the explanation of random forests, extreme random forests, and Ad- aBoost to Appendix E.

6. Results

I produced a total of 234 batches of democracy scores, each covering the same country-years, i.e., all 3,883 countries in the 1993-2012 period (same as in my earlier paper). The batches vary by corpora (A, B, or C), topic-extraction method (LSA or LDA), prior α if the topic-extraction method is LDA (symmetric α or asymmetric normalized α), number of topics (50, 100, 150, 200, or 300), tree method (decision trees, random forests, extreme random forests, or AdaBoost), size of the subset of k √ when splitting the tree nodes (c = k/3, c = k, or c = k),74 and minimum node size (l = 2 or l = 5). Tables 9-13 below show the correlation between each batch and the UDS (there is one table for each number of topics - 50, 100, 150, 200, and 300)75. If the correlations are too low that means the batches are measuring democracy poorly. The benchmark is the correlation obtained in my earlier paper (i.e., using Wordscores): 0.74. The computations took a total of 1,512 CPU-hours, using high-performance servers.

74Only applicable for random forests, extreme random forests, and AdaBoost. See Appendix E 75Tables 12 and 13 do not have corpora A results because corpora A has 6.3 million unique words, which multiplied by 200 or 300 topics yields a matrix too large to handle, even using high-memory computers.

80 Table 9. Correlations with UDS (using 50 topics)

corpora A corpora B corpora C LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α)

c = k/3; l = 5 tree -0.08 0.15 -0.07 -0.13 -0.10 -0.09 -0.05 -0.07 0.11 random forest -0.27 -0.04 -0.20 -0.25 -0.25 -0.24 -0.21 -0.17 -0.05 ext. ran. for. -0.24 -0.11 -0.01 -0.23 -0.20 -0.28 -0.18 -0.09 0.06 AdaBoost -0.24 -0.06 -0.15√ -0.24 0.17 -0.24 -0.21 -0.09 -0.003 c = k; l = 5 tree 0.02 -0.08 -0.05 -0.14 0.03 -0.009 -0.13 -0.01 -0.12 random forest -0.28 -0.04 -0.20 -0.24 -0.26 -0.27 -0.25 -0.18 -0.10 ext. ran. for. -0.24 -0.11 -0.02 -0.24 -0.22 -0.27 -0.06 -0.12 0.01 AdaBoost -0.24 -0.04 -0.16 -0.23 -0.20 -0.27 -0.24 -0.11 -0.05 c = k; l = 5

81 tree -0.10 0.09 -0.19 -0.14 -0.06 -0.04 -0.01 0.006 -0.04 random forest -0.27 -0.05 -0.19 -0.24 -0.23 -0.21 -0.21 -0.15 -0.003 ext. ran. for. -0.25 -0.09 0.03 -0.24 -0.19 -0.27 -0.16 -0.08 0.08 AdaBoost -0.24 -0.06 -0.16 -0.23 -0.14 -0.23 -0.21 -0.08 0.03 c = k/3; l = 2 tree -0.05 0.11 -0.001 -0.07 0.01 -0.04 0.02 0.12 0.07 random forest -0.27 -0.03 -0.18 -0.25 -0.24 -0.26 -0.21 -0.16 -0.04 ext. ran. for. -0.23 -0.09 0.005 -0.21 -0.19 -0.25 -0.17 -0.08 0.07 AdaBoost -0.24 -0.05 -0.15√ -0.23 -0.19 -0.26 -0.21 -0.10 -0.03 c = k; l = 2 tree -0.09 0.01 -0.03 -0.03 -0.14 -0.11 -0.03 0.05 -0.006 random forest -0.27 -0.04 -0.18 -0.25 -0.25 -0.26 -0.22 -0.18 -0.08 ext. ran. for. -0.22 -0.10 -0.004 -0.21 -0.19 -0.24 -0.17 -0.11 0.02 AdaBoost -0.24 -0.05 -0.15 -0.23 -0.19 -0.26 -0.21 -0.11 -0.08 c = k; l = 2 tree -0.09 0.07 -0.20 -0.15 -0.02 -0.02 -0.05 0.03 -0.002 random forest -0.26 -0.06 -0.18 -0.24 -0.23 -0.21 -0.21 -0.14 0.001 ext. ran. for. -0.24 -0.08 0.03 -0.23 -0.17 -0.24 -0.16 -0.07 0.09 AdaBoost -0.24 -0.07 -0.07 -0.23 -0.14 -0.23 -0.22 -0.08 0.02 Table 10. Correlations with UDS (using 100 topics)

corpora A corpora B corpora C LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α)

c = k/3; l = 5 tree -0.06 0.14 0.06 -0.0009 0.12 -0.02 -0.10 -0.15 0.02 random forest -0.23 -0.24 -0.04 -0.18 -0.13 -0.31 -0.18 -0.20 -0.18 ext. ran. for. -0.19 -0.09 0.02 -0.18 -0.05 -0.19 -0.15 -0.09 -0.16 AdaBoost -0.19 -0.18 -0.002√ -0.15 -0.09 -0.20 -0.15 -0.13 -0.11 c = k; l = 5 tree -0.02 -0.11 -0.01 0.001 -0.05 -0.07 -0.03 0.001 0.003 random forest -0.23 -0.24 -0.06 -0.20 -0.16 -0.29 -0.20 -0.22 -0.20 ext. ran. for. -0.18 -0.09 0.008 -0.18 -0.09 -0.21 -0.16 -0.17 -0.17 AdaBoost -0.19 -0.19 -0.01 -0.16 -0.09 -0.20 -0.16 -0.16 -0.13 c = k; l = 5

82 tree -0.11 -0.07 0.09 -0.03 0.05 -0.11 0.10 0.05 -0.17 random forest -0.22 -0.23 -0.02 -0.17 -0.12 -0.32 -0.16 -0.20 -0.16 ext. ran. for. -0.19 -0.10 0.03 -0.17 -0.02 -0.18 -0.14 -0.03 -0.15 AdaBoost -0.20 -0.17 0.02 -0.14 -0.06 -0.21 -0.13 -0.13 -0.10 c = k/3; l = 2 tree -0.12 -0.02 -0.07 -0.17 0.05 0.15 -0.07 0.003 0.02 random forest -0.23 -0.23 -0.04 -0.19 -0.13 -0.29 -0.18 -0.20 -0.16 ext. ran. for. -0.18 -0.09 0.02 -0.18 -0.08 -0.20 -0.15 -0.07 -0.15 AdaBoost -0.19 -0.17 -0.01√ -0.15 -0.11 -0.19 -0.14 -0.13 -0.12 c = k; l = 2 tree -0.06 -0.18 0.06 -0.06 -0.10 -0.15 0.01 -0.01 -0.01 random forest -0.22 -0.25 -0.03 -0.19 -0.14 -0.28 -0.19 -0.21 -0.18 ext. ran. for. -0.17 -0.09 0.01 -0.17 -0.08 -0.20 -0.15 -0.14 -0.17 AdaBoost -0.18 -0.18 -0.02 -0.15 -0.10 -0.18 -0.15 -0.15 -0.14 c = k; l = 2 tree -0.10 -0.08 0.09 -0.04 0.05 -0.10 0.12 -0.03 -0.15 random forest -0.22 -0.23 -0.01 -0.17 -0.12 -0.31 -0.16 -0.19 -0.16 ext. ran. for. -0.19 -0.10 0.04 -0.17 -0.03 -0.17 -0.14 -0.02 -0.15 AdaBoost -0.19 -0.17 0.007 -0.14 -0.07 -0.20 -0.13 -0.14 -0.11 Table 11. Correlations with UDS (using 150 topics)

corpora A corpora B corpora C LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α)

c = k/3; l = 5 tree 0.002 -0.13 0.07 -0.07 -0.14 -0.09 0.008 -0.14 0.03 random forest -0.20 -0.20 -0.10 -0.21 -0.25 -0.27 -0.20 -0.19 -0.24 ext. ran. for. -0.17 -0.11 -0.07 -0.16 -0.14 -0.16 -0.15 -0.04 -0.17 AdaBoost -0.16 -0.16 -0.03√ -0.17 -0.27 -0.16 -0.17 -0.16 -0.18 c = k; l = 5 tree -0.04 -0.08 -0.06 -0.12 -0.05 0.04 -0.09 -0.02 -0.09 random forest -0.20 -0.19 -0.11 -0.21 -0.28 -0.31 -0.21 -0.21 -0.26 ext. ran. for. -0.16 -0.12 -0.08 -0.17 -0.16 -0.14 -0.16 -0.06 -0.21 AdaBoost -0.16 -0.17 -0.04 -0.17 -0.29 -0.21 -0.17 -0.19 -0.20 c = k; l = 5

83 tree -0.02 -0.16 0.04 -0.02 -0.06 0.04 -0.03 0.03 0.05 random forest -0.19 -0.19 -0.07 -0.20 -0.25 -0.24 -0.19 -0.18 -0.24 ext. ran. for. -0.18 -0.11 -0.07 -0.16 -0.15 -0.18 -0.14 -0.04 -0.15 AdaBoost -0.15 -0.12 -0.01 -0.17 -0.24 -0.12 -0.17 -0.14 -0.18 c = k/3; l = 2 tree -0.01 -0.11 -0.05 -0.04 -0.0003 -0.10 -0.08 -0.03 -0.13 random forest -0.20 -0.20 -0.10 -0.21 -0.27 -0.30 -0.20 -0.18 -0.24 ext. ran. for. -0.17 -0.11 -0.05 -0.16 -0.15 -0.15 -0.15 -0.03 -0.17 AdaBoost -0.15 -0.15 -0.02√ -0.17 -0.29 -0.19 -0.17 -0.16 -0.19 c = k; l = 2 tree 0.03 -0.21 0.03 -0.02 -0.01 -0.02 -0.05 -0.08 0.06 random forest -0.20 -0.19 -0.10 -0.21 -0.27 -0.30 -0.21 -0.20 -0.24 ext. ran. for. -0.16 -0.11 -0.05 -0.16 -0.15 -0.14 -0.15 -0.06 -0.20 AdaBoost -0.15 -0.14 -0.03 -0.16 -0.29 -0.19 -0.17 -0.20 -0.19 c = k; l = 2 tree -0.01 -0.18 0.04 -0.01 -0.04 -0.07 -0.02 0.06 0.08 random forest -0.19 -0.18 -0.08 -0.19 -0.24 -0.22 -0.19 -0.18 -0.23 ext. ran. for. -0.17 -0.11 -0.07 -0.16 -0.13 -0.18 -0.14 -0.03 -0.15 AdaBoost -0.15 -0.11 -0.01 -0.16 -0.23 -0.13 -0.17 -0.14 -0.18 Table 12. Correlations with UDS (using 200 topics)

corpora B corpora C LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α)

c = k/3; l = 5 tree 0.01 -0.04 0.02 -0.02 0.04 0.26 random forest -0.14 -0.09 -0.24 -0.14 -0.10 -0.20 ext. ran. for. -0.14 0.05 -0.19 -0.13 -0.05 -0.24 AdaBoost -0.13 0.007 -0.15√ -0.12 -0.11 -0.19 c = k; l = 5 tree -0.09 0.008 0.08 -0.12 -0.02 -0.05 random forest -0.17 -0.12 -0.25 -0.15 -0.13 -0.22 ext. ran. for. -0.14 0.03 -0.20 -0.13 -0.11 -0.22 AdaBoost -0.13 -0.01 -0.18 -0.13 -0.13 -0.21 c = k; l = 5

84 tree -0.08 0.03 0.009 -0.02 -0.08 -0.01 random forest -0.13 -0.06 -0.22 -0.13 -0.09 -0.18 ext. ran. for. -0.15 0.06 -0.14 -0.14 -0.02 -0.21 AdaBoost -0.12 0.03 -0.11 -0.10 -0.10 -0.17 c = k/3; l = 2 tree -0.04 0.06 -0.20 -0.10 -0.08 -0.05 random forest -0.16 -0.11 -0.25 -0.14 -0.10 -0.19 ext. ran. for. -0.14 0.03 -0.20 -0.13 -0.03 -0.23 AdaBoost -0.13 -0.01 -0.18√ -0.11 -0.11 -0.20 c = k; l = 2 tree -0.04 0.07 -0.02 -0.08 0.04 -0.05 random forest -0.16 -0.11 -0.25 -0.16 -0.13 -0.22 ext. ran. for. -0.14 0.03 -0.20 -0.14 -0.08 -0.21 AdaBoost -0.13 -0.01 -0.18 -0.12 -0.14 -0.23 c = k; l = 2 tree -0.08 0.05 0.004 -0.02 -0.08 -0.03 random forest -0.14 -0.06 -0.22 -0.13 -0.09 -0.17 ext. ran. for. -0.14 0.06 -0.14 -0.14 -0.009 -0.20 AdaBoost -0.12 0.02 -0.13 -0.11 0.10 -0.18 Table 13. Correlations with UDS (using 300 topics)

corpora B corpora C LSA LDA (sym. α) LDA (asym. α) LSA LDA (sym. α) LDA (asym. α)

c = k/3; l = 5 tree -0.02 0.007 0.06 -0.05 0.02 -0.07 random forest -0.11 -0.21 -0.15 -0.11 -0.17 -0.24 ext. ran. for. -0.10 -0.12 0.02 -0.12 -0.13 -0.02 AdaBoost -0.09 -0.20 -0.17√ -0.10 -0.16 -0.23 c = k; l = 5 tree -0.01 -0.07 0.01 -0.01 -0.008 -0.13 random forest -0.13 -0.27 -0.17 -0.13 -0.19 -0.25 ext. ran. for. -0.11 -0.18 -0.01 -0.11 -0.18 -0.06 AdaBoost -0.11 -0.25 -0.19 -0.11 -0.18 -0.24 c = k; l = 5

85 tree -0.01 0.03 -0.10 -0.0002 -0.08 0.04 random forest -0.10 -0.17 -0.13 -0.10 -0.16 -0.24 ext. ran. for. -0.10 -0.09 0.04 -0.12 -0.07 -0.004 AdaBoost -0.10 -0.17 -0.17 -0.10 -0.13 -0.22 c = k/3; l = 2 tree -0.04 -0.006 -0.07 0.01 -0.06 -0.05 random forest -0.11 -0.25 -0.15 -0.11 -0.17 -0.24 ext. ran. for. -0.10 -0.16 0.02 -0.12 -0.11 -0.02 AdaBoost -0.10 -0.24 -0.19√ -0.10 -0.17 -0.23 c = k; l = 2 tree -0.01 -0.02 0.06 -0.00009 -0.13 0.02 random forest -0.13 -0.25 -0.17 -0.13 -0.20 -0.24 ext. ran. for. -0.11 -0.16 -0.01 -0.11 -0.16 -0.07 AdaBoost -0.11 -0.24 -0.21 -0.11 -0.18 -0.24 c = k; l = 2 tree -0.03 0.005 -0.07 0.0008 -0.09 0.02 random forest -0.10 -0.17 -0.13 -0.10 -0.15 -0.23 ext. ran. for. -0.10 -0.08 0.04 -0.12 -0.07 -0.006 AdaBoost -0.09 -0.18 -0.16 -0.10 -0.15 -0.22 As we observe, the results are disappointing. All batches perform poorly, with the correlations usually in the 0.10-0.20 range - way below the correlation of 0.74 obtained with Wordscores. The results are the same regardless of corpora, topic extraction method (LSA or LDA), α, c or l. The only parameters that seem to make some small difference are the number of topics (with LSA, the fewer the topics the better) and the prediction method (random forests outperform AdaBoost, which outperforms extreme random forests, which outperform decision trees). But even the highest correlations are in the 0.30 vicinity, which still implies an unacceptably high noise-to-signal ratio.76 One possibility is that the poor results are due to extraneous topics. Let us inspect the top twenty words of the top five topics extracted with LSA (using corpora A):

76Nearly all the correlations are negative, which implies that little regime-related information that is being captured has to do with autocracy.

86 Table 14. Top 5 topics extracted with LSA

topic #1 topic #2 topic #3 topic #4 topic #5 0.415 copyright -0.251 blogs -0.248 blogs 0.227 african 0.226 blogs 0.388 u 0.215 arab -0.238 obama -0.19 kosovo 0.226 kosovo 0.241 i 0.211 israeli -0.181 arab 0.187 kabila -0.219 iraq 0.151 mr 0.209 palestinian 0.181 kosovo 0.175 rwanda -0.202 boucher 0.15 blogs 0.196 israel -0.18 re-distributors 0.159 hutu 0.174 obama 0.137 iraq 0.189 arafat -0.156 israel 0.154 liberia 0.165 re-distributors 0.12 pg -0.184 mln -0.152 palestinian 0.149 africa -0.157 bush 0.11 obama -0.181 re-distributors -0.142 israeli -0.144 soviet 0.157 serb

87 0.109 newswire -0.18 obama -0.138 newstex 0.14 zaire 0.153 mln 0.106 re-distributors -0.144 newswire 0.13 milosevic -0.13 milosevic 0.151 newstex 0.102 bush 0.142 rabin 0.129 serb -0.128 serb -0.135 copyright 0.091 european 0.14 palestinians 0.121 european 0.125 rwandan 0.133 milosevic 0.086 mln -0.138 revs -0.114 syria 0.124 leone 0.131 bosnian 0.084 boucher 0.138 iraq 0.112 yeltsin 0.121 zimbabwe 0.13 revs 0.078 israel -0.138 newstext 0.111 soviet 0.12 angola -0.126 i 0.075 arab 0.126 islamic 0.108 bosnian 0.12 sierra 0.124 serbs 0.075 eu 0.118 saudi -0.104 mln -0.117 yeltsin -0.12 fleischer 0.073 reserved 0.117 netanyahu -0.102 iraq 0.115 congo 0.107 serbia 0.071 soviet 0.109 lebanon 0.1 serbs -0.113 european 0.099 yugoslav 0.071 q -0.108 copyright -0.099 saudi 0.107 uganda -0.098 ereli 0.068 palestinian 0.107 syria 0.096 hke 0.101 burundi 0.096 european ...... As we see, topic #1 is incoherent, topic #2 is “Middle East”, topic #3 is incoher- ent, topic #4 is “Africa & former Yugoslavia”, and topic #5 is “former Yugoslavia”. These topics are not useful here. What we need are narrow, regime-related topics like “censorship” or “torture”, not place-related topics like “Middle East” or “former Yugoslavia”. Also, place-specific topics tell us that LSA gives a lot of weight to words that survived our probabilistic removal of proper nouns (see earlier paper for details on this point). Let us now inspect some topics extracted with LDA. Here are the first five topics obtained using LDA, corpora B, 100 topics, and asymmetric α, which is the LDA model that performed the best (correlation of 0.32 with the UDS when using random forests, c = k, and l = 5):77

77LDA topics are not ordered, so the first five topics are not necessarily the top five topics.

88 Table 15. First 5 topics extracted with LDA (using 100 topics, corpora B, and asymmetric α)

topic #1 topic #2 topic #3 topic #4 topic #5 0.012 s 0.049 russia 0.015 is 0.021 is 0.039 vietnam 0.009 is 0.034 ukraine 0.013 s 0.14 was 0.018 s 0.008 ncube 0.018 russian 0.010 was 0.010 party 0.016 thailand 0.006 said 0.014 soviet 0.007 has 0.009 s 0.014 thai 0.005 has 0.013 s 0.007 eu 0.008 are 0.011 said 0.004 was 0.013 uzbekistan 0.007 are 0.008 has 0.008 is 0.004 are 0.012 moscow 0.007 uk 0.008 said 0.008 bangkok 0.003 reserved 0.012 is 0.006 said 0.007 government 0.007 was

89 0.003 rights 0.011 president 0.005 minister 0.007 political 0.007 vietnamese 0.003 burges 0.009 said 0.004 been 0.005 had 0.006 asian 0.003 gono 0.007 former 0.004 london 0.005 election 0.006 asean 0.002 were 0.007 has 0.004 union 0.005 been 0.006 news 0.002 words 0.006 yeltsin 0.004 were 0.005 elections 0.006 indonesia 0.002 been 0.006 are 0.004 had 0.004 were 0.005 has 0.002 milkis 0.006 was 0.004 labour 0.004 rights 0.005 singapore 0.002 feha 0.006 georgia 0.004 government 0.004 should 0.005 are 0.002 completecampaigns 0.005 presidential 0.003 english 0.004 words 0.004 cambodia 0.002 had 0.005 news 0.003 more 0.003 english 0.004 foreign 0.002 news 0.004 cis 0.003 words 0.003 documents 0.004 minister 0.002 length 0.004 central 0.003 countries 0.003 load-date 0.004 hanoi 0.002 times 0.004 interfax 0.003 prime 0.003 language 0.004 words ...... The result is very similar to what we found with LSA. Topic #1 is all over the place, topic #2 is “former USSR republics”, topic #3 is “UK”, topic #4 is “politics”, and topic #5 is “East/Southeast Asia”. Again I obtained place-specific topics (or excessively broad topics like “politics”) and not the narrow, regime-related topics I expected to find. I inspected every topic of every specification and they are all equally disappointing: either the words are disparate and do not form a coherent topic or the topic is too broad or not regime-related. My initial idea was to inspect the most influential topics so I could know exactly what aspects of democracy are driving the results. But that is just not feasible: the topics do not correspond to aspects of democracy. In a sense, all topics are extraneous. Hence the democracy scores produced here are all but noise, which is why they correlate so weakly with the UDS.

7. Conclusion

The goal of this paper was to improve on the democracy index produced in earlier paper, which was based on the Wordscores algorithm. More specifically, I expected to produce a democracy index that could let us easily know what concrete phenomena - i.e., what democracy manifestations - are being captured. I tried to achieve that by dropping Wordscores and replacing it by a combination of topic-extraction and decision trees. I failed to achieve the desired goal. Instead I produced democracy indices that do not even capture democracy in the first place; they are only slightly better than random guesses. Wordscores, in all its simplicity (it is based on very simple math - high school algebra and Bayes theorem) outperforms the more sophisticated methods used here. This echoes works like Klemmensen, Hobolt, and (2007) and Beauchamp

90 (2010), who also find that their Wordscores-based measures correlate highly with other indices. As Lowe (2008) notes, despite Wordscores limitations - it is atheoretical and its assumptions are not made explicit - “The empirical success of the method suggests that these assumptions may be reasonable.” (370). The results suggest that the road ahead does not pass through fancier algorithms, but perhaps through minor changes in Wordscores itself. For instance, we could try adjusting Wordscores to take into account “disproportionate” press reactions. Minor setbacks in highly democratic countries often attract a great deal of press coverage (e.g., the recent blocking of explicit internet content in Britain). Similarly, minor regime liberalizations in highly authoritarian countries often attract a great deal of press coverage as well (e.g., the recent decision of the Cuban government to allow Yoani S´anchez - a local journalist opposed to the Castro regime - to travel abroad for the first time). Newspapers and magazines often use words like “repression” and “democratization” liberally when referring to such events and that may cause unwar- ranted score fluctuations. To correct for that perhaps we could use the scores for year

t0 as priors in the estimation of the scores for year t1 (instead of starting anew every year, as I do here). That might make the scores more robust to minor fluctuations and to the idiosyncrasies of press coverage.

91 Conclusion

In this dissertation I investigated the flaws of current democracy indices and pro- posed a new, improved one, which I called Automated Democracy Scores (ADS). This new index has limitations of its own, but it is replicable, it has standard errors small enough to actually distinguish between cases, and it avoids contamination by human coders’ ideological biases. It is also cost-effective: indices like the Polity and the Freedom House require a legion of human coders to update the country scores every year, while the ADS can be updated by a single person; all that is needed is access to news articles and some familiarity with webscraping. The next step is to put the ADS to use. It would be interesting to re-assess what we know about democracy and natural resources, or about democracy and social policy, by replicating existing works while substituting the ADS for whatever other indices were used originally. I also plan to break down the time series into daily scores and, perhaps, real-time scores. We have increasingly more social data available in real-time - most prominently, data from social networks -, but our democracy indices are still yearly, which precludes us from asking many interesting questions. Finally, I intend to extend the ADS to the pre-1993 period; that should be possible in a few years, as more and more newspapers and digitizing their archives.

All data and code used in the three papers are available from thiagomarzagao.com

92 References

Alvarez, Mike, Jos´eCheibub, Fernando Limongi, and Adam Przeworski. 1996. “Classifying Political Regimes.” Studies in Comparative International Development, 31 (2): 3-36.

Anaya, Leticia. 2011. Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers. PhD dissertation, Department of Information and Decision Sciences, University of North Texas.

Banks, Arthur. 1971. Cross-polity time-series data (electronic dataset updated to 1988). Cambridge, MA: MIT Press.

Beauchamp, Nick. 2010. Text-Based Scaling of Legislatures: a Comparison of Meth- ods with Applications to the US Senate and UK House of Commons. Unpublished manuscript.

Benoit, Kenneth, and Michael Laver. 2008. “Compared to What? A Comment on ‘A Robust Transformation Procedure for Interpreting Political Text’ by Martin and Vanberg.” Political Analysis, 16 (1): 101-111.

Blei, David, Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3: 993-1022.

Bollen, Kenneth. 1989. Structural equations with latent variables. New York, NY: Wiley.

Bollen, Kenneth. 1993. “Liberal Democracy: Validity and Methods Factors in Cross-National Measures. American Journal of Political Science, 37 (4): 1207-1230.

Bollen, Kenneth and Pamela Paxton. 2000. “Subjective Measures of Liberal Democ- racy.” Comparative Political Studies, 33 (1): 58-86.

Breiman, Leo. 2001. “Random Forests.” Machine Learning , 45 (1): 5-32.

Breiman, Leo, Jerome Friedman, Charles Stone, and R. A. Olshen. 1984. Classi-

93 fication and regression trees, Wadsworth.

Coppedge, Michael, John Gerring, David Altman, Michael Bernhard, Steven Fish, Allen Hicken, Matthew Kroenig, Staffan I. Lindberg, Kelly McMann, Pamela Paxton, Holli Semetko, Svend-Erik Skaaning, Jeffrey Staton, Jan Teorell. 2011. “Conceptual- izing and Measuring Democracy: a New Approach.” Perspectives on Politics, 9 (2): 247-267.

Crain, Steven, Ke Zhou, Shuang-Hong Yang, and Hongyuan Zha. 2012. “Dimen- sionality Reduction and Topic Modeling: from Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond.” In Mining Text Data, eds. Charu Aggarwal and ChengXiang Zhai, 129-161, Springer.

Dahl, Robert. 1972. Polyarchy. Yale University Press.

Drucker, Harris. 1997. “Improving Regressors Using Boosting Techniques.” ICML, 97: 107-115.

Freedom House. 2013. Freedom In the World. Freedom House.

Freund, Yoav, and Robert Schapire. 1997. “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting.” Journal of Computer and System Sciences, 55 (1): 119-139.

Friedman, Milton. 1962. Capitalism and Freedom. University of Chicago Press.

Fuller, Wayne. 1987. Measurement Error Models. John Wiley & Sons.

Gastil, Raymond D. 1988. Freedom in the World: Political Rights and Civil Lib- erties 1987-1988. Washington, DC: Freedom House.

Gleditsch, Kristian, and Michael Ward. 1997. “Double Take: A Reexamination of Democracy and Autocracy in Modern Polities.” Journal of Conflict Resolution, 41 (3): 361-383.

Grimmer, Justin, and Brandon Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Anal- ysis, 21 (3): 267-297.

Gr¨omping,Ulrike. 2009. “Variable Importance Assessment in Regression: Linear

94 Regression versus Random Forest.” The American Statistician, 63 (4): 309-319.

Gu, Ming, James Demmely, and Inderjit Dhillonz. 1994. Efficient Computation of the Singular Value Decomposition with Applications to Least Squares Problems. Tech- nical Report CS-94-257, Department of Computer Science, University of Tennessee.

Gugiu, Mihaiela, and Miguel Centellas. 2013. “The Democracy Cluster Classifica- tion Index.” Political Analysis, 21 (3): 334-349.

Halko, Nathan, Per-Gunnar Martinsson, and Joel Tropp. 2011. “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” SIAM Review, 53 (2): 217-288.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2008. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

Hayek, Friedrich von. 1944. The Road to Serfdom. University of Chicago Press.

Heritage Foundation. 2014. Index of Economic Freedom. Available at http:// www.heritage.org/index/

Hoffman, Matthew, David Blei, and Francis Bach. 2010. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2 (3): 1-9.

Hopkins, Daniel, and Gary King. 2007. Extracting Systematic Social Science Mean- ing from Text. Unpublished manuscript.

Keefer, Philip. 2002. DPI2000 Database of Political Institutions: Changes and Variable Definitions. Development Research Group, The World Bank.

King, Gary, and Will Lowe. 2003. “An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design.” International Organization, 57 (3): 617-642.

Klein, Daniel B., and Charlotta Stern. 2005. “Professors and Their Politics: The Policy Views of Social Scientists.” Critical Review, 17 (3-4): 257-303.

Klemmensen, Robert, Sara Binzer Hobolt, and Martin Ejnar Hansen. 2007. “Es- timating Policy Positions Using Political Texts: An Evaluation of the Wordscores Approach.” Electoral Studies, 26 (4): 746-755.

Kolenikov, Stas. 2009. “Confirmatory Factor Analysis Using ‘confa’.” Stata Jour-

95 nal, 9 (3): 329-373.

Landauer, Thomas, Peter Foltz, and Darrell Laham. 1998. “An Introduction to Latent Semantic Analysis.” Discourse Processes, 25 (2-3): 259-284.

Laver, Michael, Kenneth Benoit, and John Garry. 2003. “Extracting Policy Posi- tions from Political Texts Using Words as Data.” American Political Science Review, 97 (2): 311-331.

Lawson, Robert, and J.R. Clark. 2010. Examining the Hayek-Friedman hypothesis on economic and political freedom. Journal of Economic Behavior and Organization, 74: 230-239.

Leetaru, Kalev, and Philip A. Schrodt. 2013. “GDELT: Global Data on Events, Location, and Tone, 1979-2012.” Paper presented at the ISA Annual Convention.

Lowe, Will. 2008. “Understanding Wordscores.” Political Analysis, 16 (4): 356-371.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Sch¨utze. 2008. In- troduction to Information Retrieval. 1st ed. Cambridge University Press.

Maranto, Robert, Fredrick Hess, and Richard Redding. 2009. The Politically Cor- rect University: Problems, Scope, and Reforms. AEI Press.

Maranto, Robert, and Matthew Woessner. 2012. “Diversifying the Academy: How Conservative Academics Can Thrive in Liberal Academia.” PS: Political Science & Politics, 45 (3): 469-474.

Marshall, Monty, Ted Gurr, and Jaggers, Keith. 2013. Polity IV Project: Political Regime Characteristics and Transitions, 1800-2012, Dataset Users’ Manual. Viena, VA: Center for Systemic Peace.

Martin, Dian, and Michael Berry. 2011. “Mathematical Foundations Behind La- tent Semantic Analysis.” In Handbook of Latent Semantic Analysis, eds. Thomas Landauer, Danielle McNamara, Simon Dennis, and Walter Kintsch, 35-55, Routledge.

Martin, Lanny W., and Georg Vanberg. 2008. “A Robust Transformation Proce- dure for Interpreting Political Text.” Political Analysis, 16 (1): 93-100.

McLachlan, Geoffrey, and Thriyambakam Krishnan. 2007. The EM Algorithm and

96 Extensions. John Wiley & Sons.

Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Con- flict.” Political Analysis, 16 (4): 372-403.

Munck, Gerardo L., and Jay Verkuilen. 2002. “Conceptualizing and Measuring Democracy Evaluating Alternative Indices.” Comparative Political Studies, 35 (1): 5-34.

Paxton, Pamela et al. 2001. “Monte Carlo Experiments: Design and Implementa- tion.” Structural Equation Modeling, 8 (2): 287-312.

Pemstein, Daniel, Stephen A. Meserve, and James Melton. 2010. “Democratic Compromise: A Latent Variable Analysis of Ten Measures of Regime Type.” Political Analysis, 18 (4): 426-449.

Polikar, Robi. 2006. “Ensemble-based systems in decision-making.” Circuits and systems magazine, 6 (3): 21-45.

Quinn, Kevin M., Burt L. Monroe, Michael Colaresi, Michael H. Crespin, and Dragomir R. Radev. 2010. “How to Analyze Political Attention with Minimal As- sumptions and Costs.” American Journal of Political Science, 54 (1): 209-228.

Reh˚uˇrek,Radim.ˇ 2010. Fast and Faster: a Comparison of Two Streamed Matrix Decomposition Algorithms. Unpublished manuscript.

Sartori, Giovanni. 1970. “Concept Misformation in Comparative Politics.” Amer- ican Political Science Review, 64 (4): 1033-1053.

Schedler, Andreas. 2012. “Judgement and Measurement in Political Science.” Per- spectives on Politics, 10 (1): 21-36.

Schrodt, Philip. 2001. Automated Coding of International Event Data Using Sparse Parsing Techniques. Unpublished manuscript.

Slapin, Jonathan B., and Sven-Oliver Proksch. 2008. “A Scaling Model for Esti- mating Time-series Party Positions from Texts.” American Journal of Political Sci- ence, 52 (3): 705-722.

Steiner, Nils. 2012. “Testing for a Political Bias in Freedom House Democracy Scores: Are US Friendly States Judged to Be More Democratic?” Available at SSRN

97 1919870.

Stewart, Gilbert. 1993. “On the early history of the singular value decomposition.” SIAM Review, 35 (4): 551-566.

Sussman, Leonard R. 1982. “The Continuing Struggle for Freedom of Informa- tion.” In Freedom in the World, edited by Raymond D. Gastil, 101-119. Santa Bar- bara, CA: Greenwood.

Taylor, John. 1997. An Introduction to Error Analysis: the Study of Uncertainties in Physical Measurements. 2nd ed. University Science Books.

Treier, Shawn, and Simon Jackman. 2008. “Democracy as a Latent Variable.” American Journal of Political Science, 52 (1): 201-217.

98 Appendix A: SEM estimation

In SEM estimation we must find the estimates that minimize the difference be- tween the observed variance-covariance matrix (S, in SEM notation) and the im- plied variance-covariance matrix (Σ, in SEM notation). The S is simply the ob- served variance-covariance matrix of the indicators. The Σ is the theoretical variance- covariance matrix derived from the model, i.e.

0 0 0 0 0 0 0 Σ = E[xx ] = E[(Λxξ + δ)(ξ Λx + δ )] = ΛxE(ξξ )Λx + Θδ = ΛxΦΛx + Θδ

where x is the matrix of indicators, Λ is the matrix of factor loadings, ξ is the matrix of factors, Φ is the variance-covariance matrix of factors, and Θ is the variance- covariance matrix of random measurement errors (Bollen 1989, 236-237). The estimates that minimize the difference between S and Σ are found by mini- mizing the following function via maximum likelihood:

F = log|Σ| + trace[SΣ−1] − log|S| − k

where log|Σ| is the natural logarithm of the determinant of Σ, trace[SΣ−1] is the trace of the product of S and the inverse of Σ, log|S| is the natural logarithm of the determinant of S, and k is the number of indicators (Bollen 1989, 254). ˆ ˆ ˆ ˆ The factor scores, in turn, are produced using the formula L = ΣLLΛΣxxx, where ˆ ˆ L is the matrix of estimated factor scores, ΣLL is the estimated variance-covariance ˆ ˆ matrix of factors, Λ is the matrix of estimated factor loadings, Σxx is the estimated

99 variance-covariance matrix of indicators, and x is the mean-centered matrix of ob- served indicators. In Bollen and Paxton the factor scores are based on the average estimated SEM parameters of 1979 and 1980.78

78Bollen and Paxton also check whether the parameters are stable over time.

100 Appendix B: Replication

I collected the four Freedom House measures and the four CNTS measures and modeled the latent factors exactly as in Bollen and Paxton. I estimated the model for the same years Bollen and Paxton did (1972-1988) and obtained the same fit statistics they did (see Figure 3 in Section 3, above).79 The second part of the replication - i.e., regressing factor scores on country-level variables - was less successful, with several discrepancies in terms of coefficient signs and statistical significance (see Table 2 in Section 3). It is hard to know why. Prof. Bollen and Prof. Paxton no longer have the data, so I had to tabulate everything from scratch, using the same sources. Many of these were printed sources, so typing errors are a possibility. Some of the data were also available from secondary, electronic sources. I tried these as well, but still could not replicate the original estimates exactly. I also tried coding the media coverage variable in different ways. Bollen and Paxton report that they built that variable based on how often the country appeared on the New York Times, on the CBS News Index, and on the Facts on File almanac, but they do not report how exactly: was it a mere sum of each country’s mentions in each of those three sources? Was it a weighted sum? Was it logged? Were those three sources 79As is often the case with SEM estimation, making the models converge takes some doing. I follow Kolenikov’s (2009) three-step approach: first I estimate the “traits-only” model, save the estimates, and produce residuals of the eight indicators; second I use those residuals to estimate the “methods- only” model and save the estimates; third I combine both sets of estimates and use them as starting values for the full model (i.e., the model as depicted in Figure 1, with both traits and methods factors). In other words, I use estimates of restricted models as informative starting values when estimating the full model. However, even with informative starting values the models often do not converge (after hundreds of iterations). Thus some of the fit statistics in Figure 3 are based on models that did not converge. But all fit statistics are virtually identical to the fit statistics reported by Bollen and Paxton themselves.

101 combined into a single principal components value? I tried each of these possibilities, but still could not obtain the exact same results Bollen and Paxton did. Finally, I tried dropping different combinations of countries. Bollen and Paxton’s 1980 dataset has 81 countries whereas my own 1980 dataset has 112 countries. Bollen and Paxton do not specify what those 81 countries are, so I tried dropping several combinations of 31 countries (112 - 81 = 31) to see if I could obtain the same results, but that did not work either. Table 2 (see Section 3, above) reports the regression estimates I obtained for the year 1980, using electronic sources and all the 112 countries, and coding media coverage as the natural logarithm of how often the country was mentioned on the New York Times, CBS News Index, or Facts on File in that year.

102 Appendix C: HMT

With HMT we do not decompose A directly. First we use random sampling to find a matrix that captures most of the action in A. Then we use that matrix to build a compressed version of A and use QR factorization to decompose the compressed version into U, Σ, and V ∗. The reason for this indirect approach is the sheer size of A, which makes its direct decomposition unwieldy. The details of HMT are beyond the scope of this paper, but for reproducibility purposes I will briefly describe each step. We start by generating a Gaussian random matrix Ω with dimensions n × (k + p), where p is an integer parameter that affects the probability of finding a solution.80 Here I set p = 100, based on the results in Reh˚uˇrek(2010).ˇ 81 Next we form Y = (AA∗)qAΩ, where q is a power that affects accuracy.82 Here I set q = 2, again based on Reh˚uˇrek(2010).ˇ We now perform the QR factorization Y = QR to find Q, which captures most of the action in A.83 HMT do not propose a particular QR algorithm. The one I use is in Gu, Demmel, and Dhillon (1994); it is a “divide-and-conquer” approach suitable for large matrices (it splits the matrix recursively into smaller and smaller chunks; see article for details). We then form the matrix B = Q∗A, which is an upper bidiagonal matrix and a

80The parameter p is the number of extra random samples (beyond k) we want to draw (thus p is called the oversampling parameter). It can be zero (in which case Ω will have dimensions n × k), but a positive p helps ensure that we capture the required amount of information from A to find a solution. See HMT for details. The higher the p the better, but there is a computational cost. 81Reh˚uˇrek(2010)ˇ investigates the informational gain vs computational cost of tweaking each HMT parameter. 82See HMT for details. The higher the q the higher the accuracy but, as with p, there is a compu- tational cost. 83More formally, Q has orthonormal columns and A ≈ QQ∗A.

103 compressed version of A. Finally, we use QR factorization (again) to decompose B into UˆΣV ∗ and we form the matrix U = QUˆ. That is how we obtain U, Σ, and V ∗.

104 Appendix D: HBB

HBB is a variational Bayes (VB) algorithm. It factorizes the intractable function (over different sets of parameters) and then uses variational calculus to approximate the posterior distribution of the unobserved parameters. I choose HBB because, unlike other VB algorithms, it processes the data in chunks, which makes it suitable for handling large matrices. HBB loads one chunk of data, updates the model, loads another chunk, updates the model, etc until all chunks have been processed. For that reason HBB is called an online VB algorithm (online here has nothing to do with internet or connectivity, but with the streamed way in which the data are processed). That is in contrast to batch VB algorithms, which load the whole dataset, update the model, load the whole dataset again, update the model again, etc. Batch VB algorithms require much more computer memory (since the dataset is fully loaded) and take much longer to run (updating x times in one pass over the data is faster than updating once after each of x passes).84 The details of VB and of HBB in particular are beyond the scope of this paper, but for reproducibility purposes I note that: a) I use chunks of 20 documents each; b) I alternately initialize α to two different priors: a symmetric 1/k prior85 and a fixed normalized asymmetric prior86 (α controls the sparsity of the wordsXtopics and topicsXdocuments distributions); c) I cap the number of iterations at 1,000 (i.e., if

84Speed is important here because we are talking about days, not minutes or seconds. 85 E.g., if k = 3 then α = [1/3, 1/3√, 1/3]. √ √ 86E.g., if k = 3 then α = [1/(0 + 3), 1/(1 + 3), 1/(2 + 3)].

105 convergence is not achieved in 1,000 iterations we move on to the next chunk anyway); and d) I set the convergence threshold at 0.001. HBB is much more intricate than HMT (the algorithm I use for LSA); describing each step would take pages, so instead I refer the reader to Hoffman, Blei, and Bach (2010).

106 Appendix E - Multiple decision trees

Random forests

Random forests were first introduced in Breiman (2001). As the name suggests, random forests are an extension of decision trees. The idea is simple. We treat the reference set as a population, draw multiple bootstrap samples from it (with each sample having the same size as the population), and use each sample to grow a decision tree. To predict y for new observations we simply average out the predicted ys from the different trees. The idea of random forests is to reduce noise. With a conventional decision tree small perturbations of the data can drastically impact the choice of j and s. By averaging out the predictions of multiple, bootstrapped trees we reduce that noise. Each bootstrapped tree yields poor predictions - slightly better than random guesses -, but their average predictions often outperform those of a conventional tree. There is no rigorous way to choose the number of bootstrapped trees, N. Here I set N = 10, 000. (I tried N = 1, 000 but the results were less stable - with the same data and parameters two different sets of of random forests with N = 1, 000 each would produce somewhat different results. With N = 10, 000 the results remain the same.) The more the bootstrapped trees are different from each other, the more they will reduce noise in the end. Thus it is common to grow the trees in a slightly different

107 way: instead of picking j from all k independent variables, we pick j from a random √ subset of k, with size c ≤ k (common choices are c = k/3, c = k and c = log k). The smaller the c, the more different the trees will be, and the more they will reduce noise in the end. But if the subset is too small (say, 1) each tree will perform so poorly that their combined performance will also be poor. Choosing the subset size is a matter √ of trial and error. I try c = k/3, c = k, and also simply c = k, and compare the results. I try l = 2 and l = 5 for each individual tree, just as before.

Extreme random forests

This is essentially the same as random forests, except that here we also randomize the choice of s. Instead of finding the best s for each j we draw a random s for each j and then pick the best (j, s) combination. This usually reduces noise a bit more (at the cost of degrading the performance of each individual tree). √ As with random forests, N = 10, 000; c ∈ (k/3, k, k); and l ∈ (2, 5).

AdaBoost

AdaBoost (short for adaptive boosting) was first proposed by Freund and Schapire (1997). There are several variations thereof and the one I use here is Drucker’s (1997), popularly known as AdaBoost.R2, which I choose for being suitable to continuous outcomes (most AdaBoost variations are designed for categorical outcomes). Say that we have n observations. We start by choosing the number of models - in our case, trees - we want to create.87 We grow the first tree, t = 1, using all observa-

87Here I apply AdaBoost to trees but AdaBoost can be applied to any type of model - including

108 88 t tions and weighting each observation by wi = 1/n, and compute each absolute error n |yi − yˆi|. We then find the largest error, Dt = max|yi − yˆi|, and use it to compute the j=1 t adjusted error of every observation, ei = |yi − yˆi|/Dt. Next we calculate the adjusted n P t t error of the entire tree, t = eiwi. If t ≥ 0.5 we discard the tree and stop the pro- i=1 cess (if this happens with the very first tree that means AdaBoost failed). Otherwise

t t+1 t 1−ei we compute βt = t/(1 − t), update the observation weights wi = wiβt /Zt (Zt is a normalizing constant), and grow the next tree, t = 2, using the updated weights.

We repeat the process until all desired trees are grown or until t ≥ 0.5. To predict y for new observations we collect the individual predictions and take their weighted

median, using ln(1/βt) as weights. Intuitively, after each tree we increase the weights of the observations with the largest errors, then use the updated weights to grow the next tree. The goal is to force the learning process to concentrate on the hardest cases. Just as in random forests, here too we end up with multiple trees and we make predictions by aggregating the predictions of all trees. Unlike in random forests, however, here the trees are not independent and we aggregate their predictions by taking a weighted median rather than a simple mean. As with random forests, N = 10, 000 (less when  ≥ 0.5 for one or more of the √ trees); c ∈ (k/3, k, k); and l ∈ (2, 5).

parametric ones like OLS, MLE, etc. 88This is in contrast with random forest, where we use bootstrapped samples.

109