Hidden in Plain Sight? Detecting Electoral Irregularities Using Statutory Results∗

Zach Warner† J. Andrew Harris‡ Michelle Brown§ Christian Arnold¶ July 7, 2020

Abstract Confidence in election results is a central pillar of democracy, but in many developing countries, elections are plagued by a number of irregularities. Such problems can include incredible vote margins, sky-high turnout, and statutory forms that indicate manual edits. Recent scholarship has sought to identify these problems and use them to quantify the magnitude of election fraud in countries around the world. In this paper, we argue that this literature suffers from its reliance on an ideal election asa baseline case, and delineate an alternative data-generating process for the irregularities often seen in developing democracies: benign human error. Using computer vision and deep learning tools, we identify statutory irregularities for each of 30,000 polling stations in ’s 2013 presidential election. We show that these irregularities are uncorrelated with election outcomes and do not reflect systematic fraud. Our findings suggest that scholars of electoral integrity should take care to ensure that their methods are sensitive to context and account for the substantial challenges of administering elections in developing democracies.

∗We thank the National Democratic Institute for generously sharing their data. We also thank our research assistants, Henri Böschen, Karunakar Reddy Mannem, and Jeff Morgan, for their contributions. Authors are listed in reverse-alphabetical order. †Postdoctoral Research Fellow, Cardiff University. email: [email protected], web: http://www. zachwarner.net. ‡Assistant Professor, New York University, Abu Dhabi. email [email protected], web: http://www. jandrewharris.org/. §Senior Advisor, Elections and Political Processes, National Democratic Institute. ¶Lecturer, Cardiff University. email: [email protected], web: http://christianarnold.org/.

1 The median turnout among all polling stations in Kenya’s 2013 presidential election was 88%. The average margin between the top two candidates was 83%. When such figures coincided with credible reports of electoral maladministration, many observers inferred that these results were evidence of fraud instead of what President Uhuru Kenyatta referred to as just “one or two clerical errors” (Gettleman 2013). Sky-high turnout, extreme vote margins, and allegations of malfeasance are common in developing democracies like Kenya, leading many scholars to develop methods for detecting and quantifying election fraud (Alvarez, Hall, and Hyde 2008; Asunka et al. 2019; Beber and Scacco 2012; Cantú 2019; Enikolopov et al. 2013; Klimek et al. 2012; Myagkov, Ordeshook, and Shakin 2009; Rozenas 2017). Typically, these scholars’ goal is to distinguish reliable election returns from fraudulent ones by analyzing anomalous results. But as Hanlon’s Razor dictates,1 there is a third possibility: results which are unreliable not because of fraud, but because of simple human error.

In this paper, we consider electoral irregularities, a broader—and more common—problem than electoral fraud. We define an irregularity as a deviation from a procedural ruleor norm during an election. While electoral fraud is defined by an inferred intent to affect an election outcome, irregularities are the unavoidable result of humans working under pressure to produce a precise vote count.2 They may arise for many reasons, including a poor work environment, insufficient training, or inadequate or missing election materials. Theymayor may not influence election results. And they are increasingly important. While motivated political agents frequently present irregularities as evidence suggestive of fraud, recent legal judgments have suggested that the incidence of irregularities alone may be sufficient grounds on which to nullify an election (Weinberg 2008). If common electoral irregularities can make or break democratic elections, characterizing these irregularities and examining their relationship to election outcomes takes on particular importance.

1. Often cited as “do not attribute to malice what can be explained by incompetence,” the idea was originally formulated by von Goethe (1774 [1957], 14) as “misunderstandings and lethargy perhaps produce more wrong in the world than deceit and malice do.” 2. Strictly speaking, fraud is a type of electoral irregularity that arises from an attempt to manipulate election results. However, for clarity, we use “irregularities” to refer to those irregularities that do not arise from fraud throughout the paper.

2 We study these irregularities in the context of Kenya’s 2013 presidential election. Six years following an election marred by violence, the 2013 election saw 86% turnout among the electorate of over 14 million voters. Uhuru Kenyatta won 50.15% of the vote, just enough to clear the 50% threshold required to avoid a runoff. While the election was consistent witha parallel vote tabulation conducted by the Elections Observation Group (2013), opposition leaders alleged fraud. As a close election in a developing democracy with mixed success in delivering reliable elections, Kenya’s 2013 contest is an ideal case to probe the distribution of election irregularities more broadly. We begin by gathering official election results as collated on paper at each polling station. Known as “Form 34A” in Kenya, these documents provide information such as vote counts by candidate, rejected votes, and the names and signatures of agents observing the count on behalf of the candidates. The presiding officer at each polling station completes each form manually before scanning it and digitally transmitting it to a national tallying center. These statutory forms provide the most granular data available on electoral process and outcomes across Kenya’s approximately 30,000 polling stations. These forms comprise a corpus that is too large for human coding, so we turn to computer vision and machine learning tools to identify irregularities on these forms. Specifically, we draw a stratified sample of 3,000 forms (nationally representative at the county level), which research assistants independently coded for irregularities. Building on Cantú (2019), we train deep neural networks to recognize irregularities among this training sample, which we then use to identify irregularities among the remaining forms. We examine the relationship between these irregularities and polling station-level outcomes such as turnout and vote share. We show that irregularities among these statutory forms do not relate to election outcomes in a way that might reflect fraud. Irregularities are not concentrated in areas where one party or ethnic group is dominant. Nor is there any meaningful relationship between problems on statutory forms and turnout, the vote margin, or ballots which are rejected, disputed, or spoiled. We also examine the geographic and socioeconomic correlates of irregularities,

3 finding that they occur in more urban, ethnically fractionalized constituencies, in contrast with expectations about fraud being conducted in ethnic enclaves or party strongholds (Gutiérrez-Romero 2014). Last, we demonstrate that the typical manual edit of a statutory form does not significantly impact final vote tallies. Taken together, our evidence paintsa clear picture of an election rife with irregularities—but ones which appear to be unimportant for the eventual outcome. Though our design does not allow us to rule out fraud entirely, our findings indicate that election irregularities do not in themselves amount tofraud. This paper contributes to the study of electoral integrity in four main ways. First, we demonstrate the widespread occurrence of irregularities in statutory forms as a phenomenon unto itself, and not as merely an artifact of systematic fraud. We conceptualize a series of irregularities as indicators merely of divergence from established norms and procedures, divorced from an underlying assumption of intentionality. While the gap between widespread irregularities and systematic fraud might seem slight, conceptual slippage between the two can have enormous consequences. For example, after Evo Morales narrowly won re-election in Bolivia’s October 2019 election, the Organization of American States (2019, 9) issued a report that concluded that “a series of willful actions were taken to alter the results expressed at the poll,” leading to Morales’s resignation and exile. Recent research has determined that a substantial portion of the evidence for this determination resulted from a flawed statistical analysis of anomalous patterns in the election returns (Idrobo, Kronick, and Rodríguez 2020; Williams and Curiel 2020). In this case, mistaking irregularities for malfeasance may have removed a duly-elected head of state. Rather than protecting Bolivians’ democratic rights, the search for evidence of electoral fraud may have undermined them. Second, we broaden scholarly attention from a few key irregularities—abnormal patterns in vote tallies, vote shares, turnout, and edited results—to a number of other pieces of information that can be gleaned from statutory forms. We show how the presence or absence of features such as candidate agents’ signatures or an official stamp can inform an analysis of an election’s reliability.

4 Our third contribution is methodological: we build on Cantú (2019) by using deep learning to detect irregularities in statutory forms, introducing at least three notable improvements over the current state-of-the-art. To avoid overfitting, our models use random data augmentation in lieu of a pre-balanced sample, which is costly to obtain. Further, we implement transfer learning, a process in which we start with a publicly-available model which has been pre-trained on a comprehensive corpus of images before re-training it for our purposes (Caruana 1994; Bengio 2012; Razavian et al. 2014). Last, we tune our models’ hyperparameters. All three innovations enhance out-of-sample predictive power. As a result, we are able to significantly improve our ability to detect irregularities, cutting the error rate reported in Cantú (2019) by more than one-third. Finally, we show that, despite their prevalence, edited forms appear to have far less impact on election outcomes than previously assumed. This finding contrasts with assumptions discussed in the Organization of American States (2019) report on Bolivia, Kenyan opposition leaders’ legal argument following their loss in the 2013 election (Supreme Court of Kenya 2013), and the evidence in Cantú (2019). While it is clear that edited results can reflect electoral malfeasance, our results indicate that they can also just be indicative of overtaxed and undersupported electoral administrators. Scholars should therefore take care when citing

edited returns—even when they are widespead—as prima facie evidence of systemic fraud.

Separating fraud from irregularities

Electoral integrity is a growing concern globally (Norris 2014). Unsurprisingly, the majority of this literature in comparative politics focuses on “bad behavior”—election fraud, voter intimidation, and the like.3 Improving electoral integrity depends on our ability to measure this bad behavior, so scholars have developed a range of methods for quantifying electoral fraud. Most prominent are tests of the distributions of vote shares, vote counts, and turnout rates to infer whether anomalous patterns amount to credible evidence of fraud (Alvarez,

3. See Birch (2011) for a comprehensive introduction.

5 Hall, and Hyde 2008; Beber and Scacco 2012; Cantú 2014; Cantú and Saiegh 2011; Ferrari, McAlister, and Mebane 2018; Klimek et al. 2012; Mebane 2006; Mebane and Sekhon 2004; Medzihorsky 2015; Montgomery et al. 2015; Myagkov, Ordeshook, and Shakin 2009; Rozenas 2017). More recently, Cantú (2019) examines statutory forms directly, shifting focus from identifying fraud across a distribution of results to identifying it in the form of edited results at the polling station level. However, there is a clear conceptual distinction between intentional, systematic fraud and unintentional, non-systematic fumble.4 The expertise, training, logistics, and coordination required to successfully implement elections—particularly national elections—are anything but trivial. As a result, weaknesses at any point in the chain may lead to irregularities that are neither malicious nor substantively important with respect to the outcome. When fraud does occur, it does so against a backdrop of these inevitable irregularities. Of course, differentiating between fraud and relatively innocuous irregularities is difficult. This challenge is reflected in the lack of research seeking to describe and explain the prevalence and distribution of irregularities within which fraud might be nested.5 This blurring between error and fraud is pernicious for several reasons. First, without greater knowledge of when, where, and in what forms irregularities occur, our understanding of fraud is inherently limited. If we do not know what irregularities look like, then everything irregular will look like fraud, and we will conflate the two. Second, because irregularities are a function of the quality of election administration, and election administration is typically conducted by subnational governments, irregularities are likely to be geographically concentrated. For instance, states, counties, and other local governments are usually responsible for supplying election day presiding officers.6 Variation in the quality of officials recruited, as well as the training provided to them, may shape when and where irregularities occur. This geographic concentration makes irregularities even easier to misjudge as evidence

4. Hyde (2008) provides an excellent discussion of the relationship between irregularities, fraud, and intent. 5. One exception is Ansolabehere et al. (2018), which examines disparities between election night tallies and later recounts in two Wisconsin elections. 6. In Kenya, parliamentary constituencies are responsible for providing these officers.

6 of fraud. Greater attention to basic contours of irregularities can help overcome these

problems, improving our understanding of irregularities per se as well as of fraud. Drawing a clear conceptual distinction between irregularities and fraud comes with another important benefit: it creates space for realistic electoral processes to play out. In particular, decoupling irregularities from fraud blunts the ability of politically motivated parties to cast doubt on legitimate election results by focusing on irregularities which may or may not indicate manipulation. Such claims have been increasingly powerful in shaping countries’ democratic trajectories, particularly in contexts with low trust and easy dissemination of evidence of irregularities via social media. Like in the Bolivian case discussed above, Kenya’s 2017 election was annulled because it suffered from widespread irregularities, with only tenuous evidence to link these problems to systematic fraud (Cheeseman et al. 2019). More recently, Malawi’s 2019 presidential election was annulled on the grounds that some statutory results had been edited using Tipp-Ex (a brand of white-out), lacked the signature of a presiding officer, or had small mathematical errors (Jegwa 2019). Given how common these problems are even when fraud is absent, treating them as fraud leaves little scope for building trust in democratic processes in most of the developing world. Finally, because allegations of fraud correlate with increased post-election violence (Daxecker 2012), scholars have noted the need to tune their methods for studying fraud to be sensitive to false positives (Rozenas 2017). Yet incorrectly associating a series of irregularities to systematic fraud can similarly contribute to voters’ insecurity. Shifting scholarly attention to the many processes through which benign irregularities can arise may help temper this risk of generating false positives that exacerbate electoral violence.

What do irregularities look like?

Enumerating every potential irregularity is far too broad a task for any one paper. Given the complexity of electoral administration, there are a great many mechanisms by which

7 irregularities might arise. Further, because they are (by definition) unintentional, irregularities are at least to some extent a function of randomness. And because electoral processes vary so much across countries, irregularities which may be common in one context may be impossible in another and vice versa. We confine ourselves to a much narrower task here: we examine the irregularities which can be observed on the basic statutory forms that constitute legal election results. Although our data are still limited to Kenya’s 2013 presidential election, we identify problems which can be observed in many countries that follow Kenya’s basic electoral process of producing election results.7 In the Kenyan model, an official form is used to report the final results at a polling station. During counting, certain individuals—usually some mix of observers, candidates’ agents, security personnel, and the media—witness the process. A presiding officer records the results using a nationally-prescribed methodology, then stamps the document, signs it, and presents it to present agents for verification and signature. The form is then transmitted to a national tallying center where results are aggregated and announced. Figure 1 provides a complete example of one such form in our data. We consider four categories of irregularities that can be observed on these statutory forms.8 First are document problems, which relate to the representation of the physical paper record. Document problems can occur when the form is not properly scanned and transmitted for aggregation or when the document lacks security features such as a unique Quick Response (QR) Code. These irregularities may arise when presiding officers solve election-day difficulties. For instance, if an official form has not been delivered intime,they may use loose-leaf paper without a QR code to report results. These irregularities may also reflect fraud, such as an attempt to replace an official form with acounterfeit. Second are procedural problems, which indicate that basic tasks prescribed by the electoral commission may not have been completed. We focus on two such tasks due to their

7. For instance, Bolivia, Malawi, and the Mexican election studied in Cantú (2019) all follow this broad outline. 8. See the online appendix for more details and examples of these elements.

8 Figure 1: The first page of an example statutory result.

9 Figure 1 (continued): The second page of an example statutory result.

10 commonness among developing democracies: whether the presiding officer signed the form, and whether the form was stamped with an official electoral commission stamp. Completing these tasks indicates some degree of competence on the presiding officer’s part, and indirectly, that of the constituency’s electoral officers’ training on how to fill out the form. Problems with either procedure may suggest inattentiveness on the part of the presiding officer, bad training, or even fraud, as the actual form may have been replaced with a substitute that omits this important procedural information. Third are agent-related problems. Ideally, individuals representing political candidates attend each polling station, observe the vote count, and sign off on the results. These agents’ presence or absence can have an enormous impact on vote counting. By watching the count, an agent can ensure that ballots for her candidate are correctly counted, and that problematic ballots for other candidates are excluded from the final tally. On the other hand, when only one agent is present, that agent may be able to induce the presiding officer to adjust totals to favor a given candidate; when no agents are present, a presiding officer may be more prone to adjust vote totals to benefit his preferred candidate. We examine five different variables related to agents: whether agents are present at the polling station, as indicated by any names listed; whether all agents listed signed the form; whether any agents refused to sign the form, as indicated in the space reserved for explaining refusals; whether any agent at all signed, capturing the minimal situation of at least one agent present; and whether the agent signatures appear different from one another, guarding against a presiding officer falsifying agents’ presence. The last type of irregularity we study refers to the integrity of the vote tally, focusing on whether these tallies have been manually edited. Such irregularities can indicate simple arithmetic mistakes that arise during the counting and recording process. On the other hand, they can also represent direct evidence of aggregation fraud (Callen and Long 2015; Cantú 2019), particularly when the edits make significant changes to candidates’ vote shares orare concentrated in particular types of polling stations.

11 Our conceptual goal in defining these types of irregularities is to more clearly distinguish the presence of an irregularity from the presence of fraud. This permits us to characterize the distribution of form-based irregularities independent from numeric election results. Separating irregularities in this way allows us to understand whether irregularities align with results in a way that suggests an intent to defraud.

Kenya’s presidential election of 2013

The 2013 Kenyan presidential election provides a compelling case in which to study these irregularities. Among the eight candidates, opposition leader Raila Odinga and heir-apparent Uhuru Kenyatta went into election day in a virtual tie in the polls. To avoid a second-round runoff, one candidate would need an outright majority of votes nationally, as wellas25% of the vote in at least half of Kenya’s 47 counties. After a relatively calm election day, the Independent Electoral and Boundaries Commission (IEBC) announced that Kenyatta had satisfied the majority requirement with 50.51% of the vote—less than ten thousand votes over the threshold, out of over twelve million cast. The announcement of such a narrow victory led to an immediate legal battle to annul the result, with opposition leaders arguing that significant irregularities had vitiated the entire election. Kenya’s Supreme Court rejected this argument, and a month later Kenyatta assumed office. The arguments and evidence presented in the case focused on a number of issues, including voter registration, election materials procurement, transmission of results, and vote tally aggregation. But central to both sides’ arguments were the extent and meaning of widespread irregularities among the polling station forms. For instance, the affidavit of Janet Ong’era, executive director of the opposition party, provides a list of forms that “fail the test of integrity of electoral documents,” citing reasons such as missing agent signatures or the lack of an official IEBC stamp (Republic of Kenya 2013, para. 38). Kenyatta’s team replied with the argument that while examples of such problems could be found, “there were no

12 constitutional or statutory violations” (Supreme Court of Kenya 2013, para. 161). For both sides, the fundamental focus was the legal import of irregularities with Form 34A—the focus of our study. While the issues and arguments in this case may seem focused on byzantine regulations regarding obscure electoral procedures in Kenya, they speak to questions of much broader interest: whether and when an irregularity on the election results in their most basic state indicates fraud. These questions follow elections in many developing democracies. As cited above, elections in Kenya in 2017, Bolivia in 2019, and Malawi in 2019 all saw similar disputes over polling station returns, as did Zimbabwe’s 2018 election. As the legal strategies of political aspirants place increasing scrutiny on the basic materials of elections, more such disputes will likely arise. This case is also ideal because it allows us to compare our results against estimates of fraud using existing distributional methods. For instance, using ward-level data,9 Ferrari, McAlister, and Mebane (2018) estimate that 29% of returns in this election were the result of “incremental fraud” wherein vote tallies are adjusted (as opposed to being wholly fabricated). Similarly, Mebane (2016) estimates that 35% of wards experienced incremental fraud, translating to approximately 750,000 tainted votes (or around 6% of the electorate). To develop a more complete picture of scholarly expectations around fraud in this election, we generate resampled kernel density, Bayesian finite mixture model, Benford’s second-digit, and last-digit estimates of fraud using our polling station-level data (Beber and Scacco 2012; Ferrari, McAlister, and Mebane 2018; Mebane 2006; Rozenas 2017). Table 1 presents these results. Taken as a whole, these distribution-based approaches conclude that Kenya’s 2013 election results were significantly fraudulent.

9. There are 290 single-member district parliamentary constituencies in Kenya, grouped into 47 counties, which are devolved political units, each with a governor. Within each constituency are a number of county assembly wards, each electing a county-level representative. See the online appendix for a visualization of Kenya’s internal political boundaries.

13 Table 1: Estimates of fraud in Kenya’s 2013 presidential election

Method Quantity Statistic Result Resampled Kernel Density (RKD) Vote shares F 0.27∗ Bayesian finite mixture model Vote counts F (“incremental”) 0.05∗ Vote counts F (“extreme”) 0.14∗ Benford second-digit test Vote counts χ2 5,069.70∗ Turnout χ2 185.36∗ Last-digit test Vote counts χ2 414.66∗ Turnout χ2 6.44 ∗p < .05. The “F” statistic refers to the proportion of polling station returns identified as fraudulent. All methods are implemented with respect to votes for Uhuru Kenyatta, the candidate for the incumbent party. Resampled kernel density estimation is implemented as described in Rozenas (2017), and produces a 95% credible interval of [0.13, 0.45]. The Bayesian finite mixture model is implemented in R using the eforensics package, as described in Ferrari, McAlister, and Mebane (2018) and Mebane (2016). This approach produces credible intervals of [0.00, 0.10] and [0.08, 0.22] for “incremental” and “extreme” fraud, respectively. For details on the second-digit and last-digit tests, see Mebane (2006) and Beber and Scacco (2012).

Identifying irregularities using deep learning

To study statutory forms directly, we must identify which forms contain irregularities. With some 30,000 forms, manual coding is prohibitively costly, so we instead classify these documents using computer vision and machine learning tools. Supervised machine learning techniques require a “training set” of data in which the irregularities are manually coded. We built this training set using a stratified random sample of 3,000 documents. Two research assistants independently coded each form in the training

set for each of the ten irregularities, marking them as 1 if an irregularity was present, 0 if no irregularity was present, and NA if the information could not be discerned—for instance, if the form has scanned incorrectly and it was impossible to identify whether agents refused to sign.10 We then hired a more experienced research assistant to conduct random spot-checks to ensure data quality, as well as to resolve all coding disputes.11 Like Cantú (2019), our modeling strategy relies on deep neural networks. These models

10. See the online appendix for the instructions provided to coders and examples of irregularities. 11. Irregularities were well-defined and identifiable, with disagreement among coders just 3% of all coded values.

14 use information in each image of a statutory form to classify it as either containing the irregularity of interest or not. Standard machine learning models in political science typically rely on classifiers in which the input features (i.e., covariates) are known ex ante (Cohen and Warner, forthcoming; Hill and Jones 2014; Hindman 2015). These variables are usually chosen from theory or contextual knowledge—for example, a series of geopolitical features are used to predict international conflict in Beck, King, and Zeng(2000). In contrast, deep neural networks attempt to classify observations using input features which are not known, but rather are learned by the model itself. This property makes deep neural networks well-suited to tasks where the model must be insensitive to irrelevant variation (LeCun, Bengio, and Hinton 2015): by stacking multiple layers of learning, such models can be finely tuned to identify which features are important for classification, and which are just noise. In particular, we estimate convolutional neural networks (CNNs). CNNs are designed for data that are structured as multiple arrays bound together, as our data are, since each image of a statutory form is comprised of three color channels. In the first step, features such as lines and curves are extracted from the raw image by passing the data through convolutional filters, each of size 3 × 3 pixels, and then through pooling filters (of size 2 × 2 pixels) to merge similar features into one. This combination of a convolutional layer and a pooling layer forms the main building block of a CNN through which abstract features are learned, and can be repeated many times to increase these features’ complexity. In the second step, the output of the final pooling filter is flattened into a one-dimensional vector (notunlike a vector of covariates) and fed into a fully-connected neural network consisting of multiple layers of nodes. This step uses the features extracted in the first step to predict whether the image contains the irregularity of interest. As each image is fed through the model, weights are updated to improve predictive performance through the feedback mechanism known as backpropagation (Rumelhart, Hinton, and Williams 1986). Because of the complexity of the model, images are loaded in batches; one complete iteration over all images in all batches is referred to as a learning epoch. CNNs’ flexibility and performance have made them the gold

15 standard in supervised classification of visual data and especially document analysis (LeCun et al. 1998; Razavian et al. 2014). We introduce three improvements to this general framework used in Cantú (2019). First, we do not pre-select our training sample to achieve balance between irregularities and non-irregularities. This approach is prohibitively costly not just due to the relative rareness of some irregularities (see Table 2 below), but also because a different training sample would have to be obtained for each of our 10 irregularities. Instead, we build data augmentation into our models. As each batch is loaded, each image is randomly permuted via rotation, cropping, flipping, and other small adjustments.12 This process synthetically increases our training sample size, as no image is exactly repeated across epochs, in turn decreasing the propensity for the model to overfit the training data—resulting in better overall prediction (Krizhevsky, Sutskever, and Hinton 2012; Wong et al. 2016). Second, we introduce transfer learning. Increasing the complexity of a CNN by adding more convolutional layer-pooling layer combinations can yield better predictions by allowing it to learn more complex features in an image. However, this complexity makes models extremely computationally costly to train, effectively setting an upper bound on how many layers we can include in our model.13 Transfer learning overcomes this problem by allowing us to start from a baseline of a model that has been pre-trained on a popular, publicly-available corpus of images (Tan et al. 2018). We need only then feed our images through this pre-trained model, allowing it to update to fit our data. We select the InceptionV3 model architecture (Szegedy et al. 2016), trained on the ImageNet database (Deng et al. 2009), due to its performance with these statutory forms. Our last improvement over the Cantú (2019) approach is to tune our hyperparameters. CNNs’ performance can vary widely depending on a number of factors, including the extent of image preprocessing, the number of images in a batch, the number of convolutional and

12. See the online appendix for a complete description of the data augmentation process. 13. The models we study take approximately 6-48 hours to estimate on a cluster with 50Gb of allocated memory; as discussed below, we estimate over 600 of these models.

16 pooling layers, and the size of the fully-connected layers in the prediction step, to name a few (Li et al. 2020). To maximize predictive accuracy and ensure we are properly identifying irregularities, we tune over five model architectures, three data augmentation settings, two image sizes, and two batch sizes.14 For each of our 10 irregularities, we tune over the 60 models in this {5 × 3 × 2 × 2} grid, resulting in 600 fitted models. Each model is fit on only 2,000 images from the training set, while a validation setof500 images is used at the end of each epoch to minimize cross cross-entropy loss (i.e., maximize predictive accuracy). After each model is done fitting, we compute performance metrics ona true test sample composed of the remaining 500 images which were entirely held out from model training. The hyperparameter values that maximize predictive performance on this test set for a given irregularity are then saved and used to identify that irregularity among

the remaining 27,000 forms. All models are estimated using the keras module in Python via slurm arrays on Cardiff University’s high-performance computing cluster (Chollet 2015).

Predicting irregularities

Table 2 summarizes the performance of the tuned models. Because each of the irregularities we study are rare—ranging from 31% of observations to less than 1%—common metrics such as classification accuracy are inappropriate for these data. For instance, since 99.6%of statutory forms scanned correctly, we could achieve accuracy of 99.6% by simply predicting

that every form scanned correctly. We instead report F1 scores, which represent the harmonic mean between precision and recall. Precision is defined as the proportion of cases predicted to have an irregularity that actually did, while recall is the proportion of irregularities that were

15 correctly identified. F1 scores range between zero and one, with higher values reflecting

14. The five model architectures vary the number of convolutional layer-pooling layer blocks and thenumber of filters within them, as well as the size and number of fully connected layers in the prediction step.Seethe online appendix for details. true positives 15. In other words, precision is the proportion , while recall is given by true positives + false positives true positives . true positives + false negatives

17 Table 2: Image classifier performance

Irregularity Prop. minority F1 Document problems QR code missing 0.006 0.999 Poor scan quality 0.004 1.000

Procedure problems Form not stamped 0.030 0.988 Presiding officer did not sign 0.008 1.000

Agent problems No agents listed 0.048 0.996 Any agent did not sign 0.103 0.978 No agents signed 0.052 0.997 Agent signatures appear identical 0.052 0.996 Agent refusal to sign listed 0.026 0.992

Results edited 0.308 0.957 Prop. minority indicates the proportion of observations in the held-out sample belonging to the less-common class (e.g., 31% of results were edited while 69% were not). F1 is the harmonic mean of precision and recall among a common true test sample of 500 polling stations, ranging over [0, 1] where larger values indicate better performance.

better performance.

As the Table indicates, our classification models perform exceptionally well. Two types of irregularities, scan quality and presiding officer signature, are classified perfectly, with every single observation among the test sample correctly predicted.16 Five more variables

achieve F1 scores of over 0.99, with just 1-7 observations among the held-out sample of 500 misclassified. Further, as we report in the online appendix, the errors are relatively balanced, with no model routinely over- or under-predicting irregularities.

Even the worst-performing model predicts extremely well, with an F1 of 0.957 for edited results. By way of comparison, the model for edited results studied in Cantú (2019) achieves

16. The 495 forms that scanned correctly were predicted to be good scans, while the two forms that did not scan correctly were predicted to be bad scans; the 494 forms with presiding officer signatures were all predicted to have one, while the 4 without signatures were predicted to not have one. (Each sample is slightly smaller than 500 observations due to missingness among the statutory forms.)

18 17 an F1 of approximately 0.885. Our results correspond to reducing the number of errors from 11 to just 7 out of every 100 observations—a 36% reduction in the raw error rate. Across all irregularities, we expect an error rate of just 1.58 cases per 100, or 475 misclassified cases (per irregularity) in our sample of approximately 30,000 polling stations. To make the analysis more tractable, we group these irregularities into three sets of problems relating to the conceptual groups defined above: the production and transmission of the statutory form itself; basic procedural requirements of certifying the form; and observation of the electoral process by candidates or their agents. Given its uniqueness and theoretical relevance, we also directly study whether results were edited. These groups are constructed as indicated in Table 2, with our three aggregate variables coded as a problem if any variable in that group is predicted to be irregular.18

Irregularities in the 2013 presidential election

Figure 2 summarizes the average occurrence of each irregularity category by constituency, grouped by county. Each dot indicates the mean proportion of polling stations within a constituency exhibiting a given problem, with lines for 95% confidence intervals, colored by county.19 Two notable patterns emerge. First, agent and editing irregularities are much more frequent than either document- or procedure-related problems. Second, both agent and editing problems appear nation-wide, and are not concentrated in any particular stronghold. For instance, neither agent problems nor edited results are more likely in incumbent ethnic strongholds, defined as polling stations for which the electorate is more than 90% Kikuyuor

17. While Cantú (2019) does not report these values, they can be approximately inferred from the confusion matrices provided. Details can be found in our online appendix. 18. Automatic clustering and dimension-reduction techniques (e.g., multiple correspondence analysis) did not reveal any systematic patterns in how these variables correlate, suggesting they are uncovering different types of statutory irregularities. 19. These estimates are generated by estimating a linear probability model with constituency fixed effects, clustering standard errors by constituency.

19 Document problems Procedure problems Agent problems Results edited

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Baringo ● Baringo ● Baringo ● Baringo ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Bomet ● Bomet ● Bomet ● Bomet ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Bungoma ● Bungoma ● Bungoma ● Bungoma ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Busia ● Busia ● Busia ● Busia ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Elgeyo/Marakwet ● Elgeyo/Marakwet ● Elgeyo/Marakwet ● Elgeyo/Marakwet ● ● ● ● ● ● ● ● ● ● ● ● ● Embu ● Embu ● Embu ● Embu ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Garissa ● Garissa ● Garissa ● Garissa ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Homa Bay ● Homa Bay ● Homa Bay ● Homa Bay ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Isiolo ● Isiolo ● Isiolo ● Isiolo ● ● ● ● ● ● ● ● ● ● ● ● ● Kajiado ● Kajiado ● Kajiado ● Kajiado ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kakamega ● Kakamega ● Kakamega ● Kakamega ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kericho ● Kericho ● Kericho ● Kericho ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kiambu ● ● Kiambu ● Kiambu ● ● Kiambu ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kilifi ● ● Kilifi ● Kilifi ● ● Kilifi ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kirinyaga ● ● Kirinyaga ● Kirinyaga ● ● Kirinyaga ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kisii ● Kisii ● ● Kisii ● ● ● Kisii ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kisumu ● Kisumu ● ● Kisumu ● ● ● Kisumu ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kitui ● Kitui ● Kitui ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Kwale ● ● Kwale ● Kwale ● ● Kwale ● ● ● ● ● ● ● ● ● Laikipia ● ● Laikipia ● ● ● Laikipia ● ● Laikipia ● ● ● Lamu ● ● Lamu ● Lamu ● ● Lamu ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Machakos ● ● Machakos ● Machakos ● ● Machakos ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Makueni ● ● Makueni ● Makueni ● ● Makueni ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Mandera ● ● Mandera ● Mandera ● ● Mandera ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Marsabit ● ● Marsabit ● Marsabit ● ● Marsabit ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Meru ● ● Meru ● ● ● Meru ● ● Meru ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Migori ● ● Migori ● Migori ● ● Migori ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Mombasa ● ● Mombasa ● Mombasa ● ● Mombasa ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Murang'a ● ● Murang'a ● ● ● Murang'a ● ● Murang'a ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● City ● ● Nairobi City ● Nairobi City ● ● Nairobi City ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Nakuru ● ● Nakuru ● Nakuru ● ● Nakuru ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Nandi ● Nandi ● Nandi ● Nandi ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Narok ● Narok ● Narok ● Narok ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Nyamira ● Nyamira ● Nyamira ● Nyamira ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Nyandarua ● Nyandarua ● Nyandarua ● Nyandarua ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Nyeri ● Nyeri ● Nyeri ● Nyeri ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Samburu ● Samburu ● Samburu ● Samburu ● ● ● ● ● ● ● ● ● ● ● ● ● Siaya ● Siaya ● Siaya ● Siaya ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Taita Taveta ● Taita Taveta ● Taita Taveta ● Taita Taveta ● ● ● ● ● ● ● ● ● ● ● ● ● Tana River ● Tana River ● Tana River ● Tana River ● ● ● ● ● ● ● ● ● Tharaka/Nithi ● Tharaka/Nithi ● Tharaka/Nithi ● Tharaka/Nithi ● ● ● ● ● ● ● ● ● ● ● ● ● Trans Nzoia ● Trans Nzoia ● Trans Nzoia ● Trans Nzoia ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Turkana ● Turkana ● Turkana ● Turkana ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Uasin Gishu ● Uasin Gishu ● Uasin Gishu ● Uasin Gishu ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Vihiga ● Vihiga ● Vihiga ● Vihiga ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Wajir ● Wajir ● Wajir ● Wajir ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● West Pokot ● West Pokot ● West Pokot ● West Pokot ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Constituency mean Constituency mean Constituency mean Constituency mean

Figure 2: Proportion of polling stations with statutory irregularities. Dots indicate constituency fixed effects estimated using a linear probability model, with lines for 95% confidence intervals. Colors correspond to counties.

20 Kalenjin.20

In addition, Figure 2 indicates that most document problems are concentrated in two constituencies (one in each of Kitui and Vihiga counties), wherein all polling stations exhibited irregularities. Since election administration is organized at the constituency level, the most plausible interpretation of this result is that administrators in these two constituencies may have failed to provide adequate training to presiding officers on how to scan results. In Vihiga, Emuhaya constituency shows 55 forms with document problems, with only one unproblematic form. Upon closer inspection, the reason for this finding is indeed a training problem: forms were accidentally scanned in landscape mode, cutting off a significant part of the page.21 Procedural problems are similarly rare. Greater variation is apparent in agent problems and edited results, but again no clear patterns emerge. Agent problems appear to vary within counties instead of across them. Such problems occur less frequently in pro-Kenyatta counties such as Bomet and Nandi, but more frequently in Meru and Murang’a, both of which were also government strongholds. Similarly, agent problems were less frequent in Siaya and Kakamega but more frequent in Kilifi and Kwale, all opposition counties. Nor do these results suggest more irregularities in hotly contested counties. For example, Nairobi City county is one of the places with the most edited results and ended up supporting Odinga by a margin of 53% to Kenyatta’s 47%. However, the next-most competitive county, Narok, experienced an average number of edited results; Samburu county, the next-most competitive after Narok, had fewer than average edited results. Overall, Figure 2 suggests little to no relationship between partisanship or polarization and statutory irregularities. Thus far, we have examined election irregularities as an outcome. In Table 3, we examine how these irregularities relate to election results. We regress polling station-level voting outcomes on each of these problems, using linear probability models with constituency fixed

20. See the online appendix for regression results. 21. The outlier in Kitui is the result of Kitui West constituency only having one polling station form available, which happened to have a document irregularity.

21 Table 3: Irregularities and electoral outcomes

Turnout Abs. margin Margin Document prob. −0.01 −0.02 −0.02 0.01 0.06 0.05 (0.02) (0.02) (0.03) (0.02) (0.06) (0.04) Procedure prob. −0.01 −0.01 −0.00 −0.00 0.00 0.01 (0.01) (0.01) (0.01) (0.01) (0.02) (0.02) Agent prob. −0.00 −0.00 0.00 0.00 −0.00 0.00 (0.00) (0.00) (0.00) (0.00) (0.01) (0.01) Results edited −0.01∗ −0.00∗ −0.01∗ −0.00 0.00 0.00 (0.00) (0.00) (0.00) (0.00) (0.00) (0.00)

Bal. rejected Bal. disputed Bal. spoiled Document prob. 0.45 0.40 −0.00† −0.00† 0.08 −0.15 (0.53) (0.70) (0.00) (0.00) (0.32) (0.24) Procedure prob. 0.44 0.47 0.01 0.01 0.34∗ 0.34∗ (0.33) (0.35) (0.01) (0.01) (0.15) (0.16) Agent prob. 0.20∗ 0.16† −0.00 −0.00 0.08 0.08 (0.09) (0.10) (0.00) (0.00) (0.05) (0.05) Results edited 0.62∗ 0.55∗ 0.00 0.00 0.37∗ 0.36∗ (0.10) (0.10) (0.00) (0.00) (0.03) (0.03) Controls !!! Constituency FE !!!!!! ∗p < .05,† p < .10. Estimates are from linear models with standard errors in parentheses. Controls are as described in the text. All models include constituency fixed effects and standard errors clustered by constituency. Full results are reported in the online appendix.

effects and standard errors clustered by constituency. For each set of results, we estimate simple models where the only independent variable is the presence of an irregularity, as well as models with a series of controls. These controls include population density, ruggedness of terrain, ethnic fractionalization, geographic isolation, poverty, literacy rate, and night-time lights (a proxy for economic activity).22 In the top half of the Table, we study three dependent variables: turnout, the absolute vote margin, and Kenyatta’s vote margin, all expressed as proportions (the first two ranging over [0, 1] and the latter over [−1, 1]). Only problems relating to edited results have any

22. See the online appendix for data sources, coding rules, and full regression results. Our estimates are robust to using logistic regression, and to using ward fixed effects, county fixed effects, or no fixed effects.

22 relationship to turnout or the absolute vote margin. In both cases, however, the relationship runs in the opposite direction from what an expectation of fraud would predict, as polling stations with statutory forms showing edited results have lower turnout and a smaller absolute vote margin. Moreover, the miniscule effect sizes suggest that vote tallies are largely unchanged even when results are edited, and the edits do not appear to favor either Kenyatta or Odinga, as there is no relationship to the vote margin at all. Taken together, the most plausible interpretation of these results is that edited results reflect not deliberate manipulation of results but rather more mundane corrections to human error in counting ballots. One alternative explanation is that these results reflect electoral manipulation not by ballot-stuffing, but rather by ballot-shaving. In this scenario, partisans systematically discard valid votes to favor particular candidates. The negative relationship between edited results and turnout suggests this is a possible mechanism, though the null result for Kenyatta margin suggests it would have to be carried out by partisans for both sides. At the same time, the negative coefficient for absolute margin indicates that partisans would havetobe shaving votes in polling stations their opponents won, i.e., shaving votes for the winner—an implausible mechanism given the stark polarization at the polling station level.23 Nevertheless, the bottom half of Table 3 investigates this alternative explanation by studying similar models where the dependent variables are instead the number of ballots that were rejected, disputed, or spoiled at each polling station. Broadly, these estimates suggest that statutory irregularities correspond to more of these ballots. Yet they also indicate that the vote-shaving hypothesis is implausible: the presence of any such irregularity has such a small effect size that, at most, only one ballot per polling station is likely to beaffected. Aggregated, this suggests that perhaps as many as 30,000 votes could have been shaved—not enough to sway the outcome of the election, but perhaps enough to get Kenyatta past the 50% runoff threshold. While we cannot definitively rule this possibility out, it seems unlikely that the incumbent party would embark on a nationwide plan to violate electoral law by

23. Only 4.18% of polling stations had an absolute margin of less than 10 percentage points, and the average absolute margin was 71.30 percentage points.

23 having its officers shave one vote here and there, rather than invest in altering tallies ina few key constituencies.24 A much more plausible interpretation is just that the presence of problematic ballots makes the process of counting and aggregating ballots more difficult, leading to more errors by electoral administrators, resulting in more edited results. If irregularities in statutory forms do not reflect electoral manipulation, but rather simple errors, then we would expect to observe them relatively evenly throughout Kenya, as hinted in Figure 2. To examine this question more closely, we regress each of the four problem categories on the geographic and socioeconomic variables used as controls in the analysis above. Figure 3 plots the coefficients from these four models.25 None of our geographic or socioeconomic variables is meaningfully related to document or procedural problems (in orange and green, respectively). Agent problems (blue) and edited results (purple), however, appear to be driven by at least three separate mechanisms. First, the strongest correlate of agent problems and edited results is polling station isolation. This is an intuitive result: more isolated polling stations are much more likely to be costly for candidate agents to travel to, resulting in irregularities that arise from agents not being present to observe the process (and verify that they have done so by signing the statutory form). And without agents present during vote tabulation, presiding officers at isolated polling stations are less likely to catch mathematical errors, resulting in fewer edits to results. A second mechanism explains the link between more urban and diverse areas and higher rates of statutory irregularities. We think it is likely that this result tells a story about contestation—that such polling stations are more hotly contested, leading to more disputed ballots, more edits to results, and more refusals to sign by candidates’ agents. Third, the small negative effect of literacy on the probability of edited results may reflecta competence mechanism: better-educated electoral administrators may be less likely to make

24. Indeed, the theory that rigging occurred in the tallying of results in Nairobi, rather than at individual polling stations, featured centrally in the legal battle over the election results. 25. See the online appendix for complete regression results. Note that unlike the foregoing analysis, these models do not include constituency fixed effects because much of the variation we are interested indoesnot vary significantly by constituency, so including them would be an “easy test” of our theory. Nevertheless, the results are effectively unchanged if such fixed effects are included.

24 ● ● Population density ● ●

● ● Terrain ruggedness ● ●

● ● Ethnic fractionalization ● ●

● ● Poll. Station isolation ● ●

● ● Poverty rate ● ●

● ● Literacy rate ● ●

● ● Night lights ● ●

−0.3 −0.2 −0.1 0.0 0.1 0.2 Predicted change in the probability of an irregularity

Figure 3: Predicted relationship between polling station characteristics and the probability of statutory irregularities. Dots indicate estimates from linear probability models, with lines for 95% confidence intervals. Hollow (filled) dots indicate estimates that cannot (can) be statistically distinguished from zero at p < .05. Colors refer to irregularity types, with orange for document problems, green for procedure problems, blue for agent problems, and purple for edited results.

25 Figure 4: An example of an edited result. Here the presiding officer has incorrectly written the number of votes cast on the registered voter line before fixing the mistake. errors counting ballots, requiring fewer changes to the statutory form.

Finally, to study more closely the types of edits that are made, we sampled 500 of the statutory forms manually coded (during the construction of our training sample) as having been edited. We then had research assistants determine whether the original entries could be discerned, and where they could be, compared the original and final entries to understand the magnitude of these changes. The results are telling: the median change to registered voters, spoiled ballots, ballots cast, rejected ballots, disputed ballots, objections to rejected ballots, and valid votes cast are all exactly zero (with a global mean of 0.60). We then manually inspected every form where the change in any of these tallies was greater than 10, of which there were only 26. In almost every case, it was clear that the error was one of simply writing the entry on the wrong line. Figure 4 provides a characteristic example wherein the presiding officer wrote the number of votes cast in the registered voters line, crossed it out, andthen filled in the information correctly. Such edits bear a much stronger resemblance tobenign mistakes than they do to systematic fraud. Taken together, these results present a clear picture. In Kenya’s 2013 presidential election, statutory irregularities are not concentrated in particular constituencies or counties, do not correlate with electoral outcomes, do not produce meaningful changes in vote tallies, appear to be produced not by partisan processes but rather logistical and competency-based

26 mechanisms, and simply look like fixes to human mistakes. While we cannot definitively rule out electoral manipulation, we find very little evidence in support of systematic fraud. Our analysis suggests instead that irregularities are just minor deviations from an ideal election, entirely consistent with electoral administration in a developing democracy.

Conclusion

Administering elections is costly and difficult even for rich democracies with long histories of voting. For developing democracies like Kenya, the challenges of conducting an election are even greater, resulting in a number of common irregularities that often look like fraud. In this paper, we demonstrate that these irregularities do not always indicate an intent to manipulate electoral results. Statutory forms may frequently show document or procedural problems, may indicate no observers were present to verify the count, or may reflect edited tallies. But when analyzed closely using cutting-edge deep learning tools and extremely granular data from Kenya’s 2013 presidential election, these issues appear to be the result not of malfeasance but of simple human error. Our work has several limitations. Most importantly, we study only irregularities related to the production of statutory results. Yet the absence of clear patterns linking irregularities to fraud in this case does not prove that Kenya’s electoral process was entirely free from fraud—only that form-based irregularities do not reflect significant changes to the election results. Malfeasance can occur at many other points in an electoral cycle (Norris 2014), and there are many other points of entry for fraudulent behavior. For instance, fraud can occur during the aggregation stages of an election, when polling station results are transferred from one form to another, or from a statutory form to a centralized database of results (Callen and Long 2015; Cantú 2019). The approach we explore here should therefore be seen as a complement to, and not a substitute for, existing methods for examining electoral processes. A more pessimistic critique of our approach is that our data are produced by Kenya’s

27 Independent Electoral and Boundaries Commission, which may itself might have been party to any fraud in the production of these forms. While we acknowledge the fraught electoral history in Kenya, we think this theory implausible. If the forms had been manufactured wholesale, we would expect the organization of the original files to be more uniform, rather

than haphazardly bundled in ad hoc concatenated PDFs by constituency. We would expect little variation in scan or document quality. We would expect more uniformity in terms of writing implements, ink, signatures, and other idiosyncratic details arising on election day—such as the random commentary and tea stains marking the normal passage of business on election day. And most importantly, we would have expected to see many fewer normal irregularities related to agents and editing. Nevertheless, this critique does raise an important point regarding the provenance and chain of custody for election data. Future efforts might seek to standardize the production, storage, and preservation of election materials of the type we study here.26 (Future work to be written...)

26. See https://www.openelectiondata.net/ for one first step in that direction.

28 References

Alvarez, R. Michael, Thad E. Hall, and Susan D. Hyde, eds. 2008. Election Fraud: Detecting and Deterring Electoral Manipulation. Washington, DC: Brookings Institution Press. Ansolabehere, Stephen, Barry C. Burden, Kenneth R. Mayer, and Charles Stewart III. 2018. “Learning from Recounts.” Election Law Journal 17 (2): 100–116. Asunka, Joseph, Sarah Brierley, Miriam Golden, Eric Kramon, and George Ofosu. 2019. “Electoral Fraud or Violence: The Effect of Observers on Party Manipulation Strategies.” British Journal of Political Science 49 (1): 129–151. Beber, Bernd, and Alexandra Scacco. 2012. “What the Numbers Say: A Digit-Based Test for Election Fraud.” Political Analysis 20 (2): 211–234. Beck, Nathaniel, Gary King, and Langche Zeng. 2000. “Improving Quantitative Studies of International Conflict: A Conjecture.” American Political Science Review 94 (1): 21–35. Bengio, Yoshua. 2012. “Deep Learning of Representations For Unsupervised and Transfer Learning.” In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 17–36. Birch, Sarah. 2011. Electoral Malpractice. Oxford: Oxford University Press. Callen, Michael, and James D. Long. 2015. “Institutional Corruption and Election Fraud: Evidence from a Field Experiment in Afghanistan.” American Economic Review 105 (1): 354–381. Cantú, Francisco. 2014. “Identifying Irregularities in Mexican Local Elections.” American Journal of Political Science 58 (4): 936–951. . 2019. “The Fingerprints of Fraud: Evidence from Mexico’s 1988 Presidential Election.” American Political Science Review 113 (3): 710–726. Cantú, Francisco, and Sebastián Saiegh. 2011. “Fraudulent Democracy? An Analysis of Argentina’s Infamous Decade Using Supervised Machine Learning.” Political Analysis 19 (4): 409–433. Caruana, Rich. 1994. “Learning Many Related Tasks at the Same Time With Backpropagation.” In Proceedings of the 7th International Conference on Neural Information Processing Systems, 657–664. Cheeseman, Nic, Karuti Kanyinga, Gabrielle Lynch, Mutuma Ruteere, and Justin Willis. 2019. “Kenya’s 2017 Elections: Winner-Takes-All Politics as Usual?” Journal of Eastern African Studies 2 (215-234). Chollet, François. 2015. Keras. https://keras.io. Cohen, Mollie J., and Zach Warner. Forthcoming. “How to Get Better Survey Data More Efficiently.” Political Analysis.

29 Daxecker, Ursula E. 2012. “The Cost of Exposing Cheating: International Election Monitoring, Fraud, and Post-Election Violence in Africa.” Journal of Peace Research 49 (4): 503–516. Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “ImageNet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Elections Observation Group. 2013. The Official Results are Consistent with Elections Observation Group’s Parallel Vote Tabulation (PVT). Available at https://www.ndi. org/. Enikolopov, Ruben, Vasily Korovkin, Maria Petrova, Konstantin Sonin, and Alexei Zakharov. 2013. “Field Experiment Estimate of Electoral Fraud in Russian Parliamentary Elections.” Proceedings of the National Academy of Sciences 110 (2): 448–452. Ferrari, Diogo, Kevin McAlister, and Walter R. Mebane Jr. 2018. Developments in Positive Empirical Models of Election Frauds: Dimensions and Decisions. Paper presened at the 2018 Annual Meeting of the Society for Political Methodology, Provo, UT, July 16-18, 2018. Gettleman, Jeffrey. 2013. Voting Irregularities in Kenya Election Are Confirmed, Adding Fuel to Dispute. New York Times. 29 March 2013. Available at https://www.nytimes.com. Gutiérrez-Romero, Roxana. 2014. “An Inquiry into the Use of Illegal Electoral Practices and Effects of Political Violence and Vote-Buying.” Journal of Conflict Resolution 58 (8): 1500–1527. Hill, Daniel W., Jr., and Zachary M. Jones. 2014. “An Empirical Evaluation of Explanations for State Repression.” American Political Science Review 108 (3): 661–687. Hindman, Matthew. 2015. “Building Better Models: Prediction, Replication, and Machine Learning in the Social Sciences.” The Annals of the American Academy of Political and Social Science 659 (1): 48–62. Hyde, Susan D. 2008. “How International Election Observers Detect and Deter Fraud.” In Election Fraud: Detecting and Deterring Electoral Manipulation, edited by R. Michael Alvarez, Thad E. Hall, and Susan D. Hyde, 201–215. Washington, DC: Brookings Institution Press. Idrobo, Nicolás, Dorothy Kronick, and Francisco Rodríguez. 2020. Do Shifts in Late-Counted Votes Signal Fraud? Evidence from Bolivia. Working Paper. Jegwa, Peter. 2019. Malawi Election: Court Orders New Vote after May 2019 Result Annulled. BBC News, 3 February 2020. Avaialble at https://www.bbc.co.uk/. Klimek, Peter, Yuri Yegorov, Rudolf Hanel, and Stefan Thurner. 2012. “Statistical Detection of Systematic Election Irregularities.” Proceedings of the National Academy of Sciences 109 (4): enikolopovetal13.

30 Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25 (2). LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–444. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” Proceedings of the IEEE 86 (11): 2278–2324. Li, Liam, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020. A System for Massively Parallel Hyperparameter Tuning. Conference on Machine Learning and Systems 2020. Mebane, Walter R., Jr. 2006. Election Forensics: The Second-Digit Benford’s Law Test and Recent American Presidential Elections. Paper presented at the Election Fraud Conference, Salt Lake City, Utah, September 29-30, 2006. . 2016. Election Forensics: Frauds Tests and Observation-level Frauds Probabilities. Paper presented at the 2016 Annual Meeting of the Midwest Political Science Association, Chicago, IL, April 7-10, 2016. Mebane, Walter R., Jr., and Jasjeet S. Sekhon. 2004. “Robust Estimation and Outlier Detection for Overdispersed Multinomial Models of Count Data.” American Journal of Political Science 48 (2): 392–411. Medzihorsky, Juraj. 2015. “Election Fraud: A Latent Class Framework for Digit-Based Tests.” Political Analysis 23 (4): 506–517. Montgomery, Jacob M., Santiago Olivella, Joshua D. Potter, and Brian F. Crisp. 2015. “An Informed Forensics Approach to Detecting Vote Irregularities.” Political Analysis 23 (4): 488–505. Myagkov, Mikhail, Peter C. Ordeshook, and Dimitri Shakin. 2009. The Forensics of Election Fraud: Russia and Ukraine. New York: Cambridge University Press. Norris, Pippa. 2014. Why Electoral Integrity Matters. Cambridge: Cambridge University Press. Organization of American States. 2019. Electoral Integrity Analysis: General Elections in the Plurinational State of Bolivia Final Report. Razavian, Ali Sharif, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. “CNN Features Off the Shelf: An Astounding Baseline for Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 806–813. Republic of Kenya. 2013. Deposition of Janet Ongera. Rozenas, Arturas. 2017. “Detecting Election Fraud from Irregularities in Vote-Share Distributions.” Political Analysis 25 (1): 41–56.

31 Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–536. Supreme Court of Kenya. 2013. 2013 Presidential Election Judgment. Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. “Rethinking the Inception Architecture for Computer Vision.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826. Tan, Chuanqi, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. 2018. A Survey on Deep Transfer Learning. Proceedings of the International Conference on Artificial Neural Networks. von Goethe, Johann Wolfgang. 1774 [1957]. The Sufferings of Young Werther. Translated by Bayard Quincy Morgan. New York: Frederick Ungar. Weinberg, Barry H. 2008. The Resolution of Election Disputes: Legal Principles That Control Election Challenges. Washington, DC: International Foundation for Electoral Systems. Williams, Jack R., and John Curiel. 2020. Analysis of the 2019 Bolivia Election. Technical Report. Wong, Sebastien C., Adam Gatt, Victor Stamatescu, and Mark D. McDonnell. 2016. Understanding Data Augmentation for Classification: When to Warp? Proceedings of the 2016 IEEE international conference on digital image computing: techniques and applications.

32 Online Appendix to: Hidden in Plain Sight? Detecting Electoral Fraud Using Statutory Results∗

Zach Warner† J. Andrew Harris‡ Michelle Brown§ Christian Arnold¶ July 13, 2020

∗We thank the National Democratic Institute for generously sharing their data. We also thank our research assistants, Henri Böschen, Karunakar Reddy Mannem, and Jeff Morgan, for their contributions. Authors are listed in reverse-alphabetical order. †Postdoctoral Research Fellow, Cardiff University. email: [email protected], web: http://www. zachwarner.net. ‡Assistant Professor, New York University, Abu Dhabi. email [email protected], web: http://www. jandrewharris.org/. §Senior Advisor, Elections and Political Processes, National Democratic Institute. ¶Lecturer, Cardiff University. email: [email protected], web: http://christianarnold.org/.

1 Contents

A.1 Kenya’s internal political boundaries ...... 3 A.2 Details about hyperparameter tuning ...... 4 A.2.1 Model architectures ...... 4 A.2.2 Data augmentations ...... 5 A.2.3 Tuning procedures ...... 6 A.3 Diagnostic plots, classification model training ...... 7 A.4 Confusion matrices, classification model training ...... 18 A.5 Computation of F1 scores for Cantú (2019) ...... 20 A.6 Regression results, government stronghold analysis ...... 21 A.7 Covariate data sources ...... 22 A.8 Full results, Table 3 (main text) ...... 23 A.9 Robustness checks, Table 3 (main text) ...... 30 A.10 Regression results, Figure 3 (main text) ...... 33

2 A.1 Kenya’s internal political boundaries

Kenya’s 2010 Constitution reorganized the country’s internal political boundaries. With the former provinces eliminated, 47 counties became the new top-level administrative units. Nested within these counties are 290 parliamentary constituencies, which are further subdivided into county assembly wards. Figure A1 visualizes these subdivisions.

● ●

● ● ● ● ● ● ●●

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

(a) Counties (b) Constituencies (c) Wards

Figure A1: Kenyan territorial divisions. Polling centers—which can contain multiple polling stations (or “streams”)—are indicated with dots. From left to right, Nakuru county, , and Waseges ward are highlighted.

3 A.2 Details about hyperparameter tuning

A.2.1 Model architectures We tune across five model architectures: {inception, average, wide, deep, hard}.1 Note that throughout, we refer to a “block” as the combination of a 3-pixel empty border padding layer (which helps prevent the edges near the border getting ignored during training), a convolutional layer, a batch normalization, an exponential linear unit layer, and a pooling layer (in that order). The details of each model are as follows. The inception model’s base model is the InceptionV3 model (Szegedy et al. 2016), trained on the ImageNet database (Deng et al. 2009). We take the top layer of this model (its output), add a pooling layer, a batch normalization layer, a dense layer with 1,024 exponential linear units (ELUs), and then a 50% dropout layer before the activation layer using a sigmoid function. We train the model such that all layers are trainable, including those in the underlying InceptionV3 model.2 The “average” model’s feature extraction step consists of three layers: a block in which the convolutional step has 32 filters, another block with 64 filters, and a third blockwith 128 filters. The prediction step is also grouped into three layers. In the first fully-connected layer we have a 10% dropout step, a flattening step, a dense step with 1,024 ELUs, a batch normalization, and an activation step. In the second fully-connected layer we have another 10% dropout step followed by a dense step with 512 ELUs, batch normalization, and then activation. The final layer is another 10% dropout, a dense step reducing to thefinal predictions, which are then batch normalized and activated using the sigmoid function. The “wide” model builds on the average model. The first difference is that the feature extraction step includes a fourth block where the convolutional step has 256 filters. The second difference is that in the prediction step, the number of ELUs in the first fully-connected layer is 4,096 (still reducing down to 512). The “deep” model differs from the average model in that the feature extraction stephas 7 blocks, wherein the convolutional layers have 16, 16, 16, 16, 32, 32, and 64 filters. The prediction step has three fully-connected layers instead of the average model’s two, with 2,048, 1,024, and 256 ELUs, respectively. Finally, the “hard” differs from the average model in that the feature extraction stephas 5 blocks, wherein the convolutional layers have 32, 64, 64, 128, and 256 filters. The prediction step has three fully-connected layers instead of the average model’s two, with 4,096, 2,048, and 512 ELUs, respectively.

1 . The inception model is named for the base model we use; average because it is a reasonable starting point for a classification task of average difficulty; wide because the model appears wide if visualized vertically (due to the large number of ELUs in the first fully-connected layer); deep because the model appears tallif visualized vertically (due to the number of blocks in the feature extraction step); and hard because it is a complex model for difficult classification tasks. 2 . Usually only the top few layers of a transfer learning model need to be retrained, with the rest kept “frozen.” However, since our application is somewhat different than typical object-detection tasks—e.g., we are looking for blank spaces within a table instead of identifying faces—our models perform better when all layers are kept trainable.

4 A.2.2 Data augmentations All images, irrespective of data augmentation, undergo the following preprocessing steps. First, images are cropped to the region of the form where the relevant feature is consistently found. For instance, the QR code is always in the top-left corner of the form, so we automatically zoom in on this region for the QR code irregularity. Second, the cropped images are resized to be square, so that the pixel width and pixel height are equal (the precise size is determined by the tuning parameters discussed below). This step increases computational stability and is standard in deep learning models. Third, all images are denoised using a median filter. This technique reduces noise while preserving edges better than other filters, e.g. a Gaussian filter, do. Finally, image pixel values are rescaled from RGB coordinates rangingover [0, 255] so that they are expressed as a proportion between 0 and 1. After these preprocessing steps, the images are then augmented. The Python module we use to fit our deep learning modules, keras (Chollet 2015), provides a number of built-in functions for augmentation through its ImageDataGenerator class. The augmentations relevant to our data and model are as follows:

1. samplewise_center: Set the global mean across all pixels to 0.

2. samplewise_std_normalization: Divide all pixels by the standard deviation across all pixels.

3. horizontal_flip: Flip the image horizontally with 50% probability.

4. vertical_flip Flip the image vertically with 50% probability.

5. brightness_range: Shift the brightness of all pixels by a random value within the range given.

6. zoom_range: Zoom in or out of the image by a random value up to the value given.

7. channel_shift_range: Shift red, blue, and green channels of all pixels by a random value up to the value given.

8. rotation_range: Rotate the image in either direction by a random value up to the value given.

9. width_shift_range: Shift the image horizontally by a random value up to the value given.

10. height_shift_range: Shift the image vertically by a random value up to the value given.

11. shear_range: Shear the image (i.e., fix one axis and rotate the other axis) by a random value up to the value given.

For all augmentations that reveal pixels outside the existing boundaries (e.g., by shifting or rotating the image), empty pixel values are filled using the value of the nearest neighbor, the default setting in keras.

5 For the purposes of tuning our models, we use three settings: {none, low, high}. In the none condition, none of these data augmentations are applied. In the low setting, we turn on samplewise centering, samplewise scaling, horizontal flipping, and vertical flipping; set the brightness range and zoom range to [0.95, 1.05]; set the channel shift range to 5; and set the rotation range, width shift range, height shift range, and shear range to [−0.05, 0.05]. In the high setting, we turn on samplewise centering, samplewise scaling, horizontal flipping, and vertical flipping; set the brightness range and zoom range to [0.70, 1.30]; set the channel shift range to 30; and set the rotation range, width shift range, height shift range, and shear range to [−0.30, 0.30]. We note here that keras’s data augmentation is not replication-safe because it does not respect random seeds set at the beginning of a script call. We therefore implement a custom ImageDataGenerator class and a custom Generator sequence to call the class, bypassing this problem. This code can be found in our replication archive.

A.2.3 Tuning procedures For each of our 10 irregularities, we tune over the 60 models in a {5 × 3 × 2 × 2} grid. This grid consists of:

• Model architecture: inception, average, wide, deep, hard;

• Data augmentation: none, low, high;

• Image size: 256 x 256 pixels, 512 x 512 pixels; and

• Batch size: 16, 32.

These 600 models were run using slurm arrays on Cardiff’s high-performance computing cluster, Supercomputing Wales, the national supercomputing research facility for Wales. Each model was allowed to run until either: its out-of-sample F1 achieved 0.99, it reached 300 epochs, or it timed out after 48 hours (the maximum run-time allowed on the cluster).

6 A.3 Diagnostic plots, classification model training

Below are training diagnostic plots for each of the ten (tuned) classification models we estimate. Each plot provides the accuracy, loss, and mean squared error (MSE) for the training and test data. Higher accuracy values and lower loss and MSE values indicate better performance.

7 Accuracy

0.9975

0.9950

0.9925

Accuracy 0.9900 Train 0.9875 Validation

1 2 3 4 5 6 7 8 9 10 Epoch

Loss 0.06 Train Validation 0.04 Loss 0.02

0.00 1 2 3 4 5 6 7 8 9 10 Epoch

MSE 0.0125 Train 0.0100 Validation

0.0075 MSE 0.0050

0.0025

1 2 3 4 5 6 7 8 9 10 Epoch

Figure A2: Training diagnostic plots, QR code missing

8 Accuracy 1.000

0.995

0.990 Accuracy 0.985 Train Validation

1 2 3 4 5 6 7 8 9 10 Epoch

Loss

0.10 Train 0.08 Validation

0.06

Loss 0.04

0.02

0.00 1 2 3 4 5 6 7 8 9 10 Epoch

MSE

0.015 Train Validation

0.010 MSE 0.005

0.000 1 2 3 4 5 6 7 8 9 10 Epoch

Figure A3: Training diagnostic plots, poor scan quality

9 Accuracy 1.00

0.99

0.98

Accuracy 0.97 Train 0.96 Validation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

Loss

1.0 Train Validation 0.8

0.6

Loss 0.4

0.2

0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

MSE

Train 0.03 Validation

0.02 MSE

0.01

0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

Figure A4: Training diagnostic plots, form not stamped

10 Accuracy 1.000

0.995

0.990

0.985 Accuracy Train 0.980 Validation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

Loss

Train 1.00 Validation 0.75

Loss 0.50

0.25

0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

MSE

Train 0.020 Validation 0.015

MSE 0.010

0.005

0.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

Figure A5: Training diagnostic plots, presiding officer did not sign

11 Accuracy

0.98

0.96 Accuracy Train 0.94 Validation

1 2 3 4 5 6 7 8 9 10 Epoch

Loss

0.20 Train Validation 0.15

0.10 Loss

0.05

0.00 1 2 3 4 5 6 7 8 9 10 Epoch

MSE 0.05 Train 0.04 Validation

0.03

MSE 0.02

0.01

0.00 1 2 3 4 5 6 7 8 9 10 Epoch

Figure A6: Training diagnostic plots, no agents listed

12 Accuracy 1.0

0.8

0.6

Accuracy 0.4 Train 0.2 Validation

1 10 15 23 30 Epoch

Loss

3 Train Validation

2 Loss 1

0 1 10 15 23 30 Epoch

MSE 0.8 Train 0.6 Validation

0.4 MSE

0.2

0.0 1 10 15 23 30 Epoch

Figure A7: Training diagnostic plots, any agent did not sign

13 Accuracy 1.00

0.98

0.96

0.94 Accuracy Train 0.92 Validation

1 2 3 4 5 6 7 8 9 10 Epoch

Loss

Train 0.4 Validation 0.3

Loss 0.2

0.1

0.0 1 2 3 4 5 6 7 8 9 10 Epoch

MSE

0.08 Train Validation 0.06

MSE 0.04

0.02

0.00 1 2 3 4 5 6 7 8 9 10 Epoch

Figure A8: Training diagnostic plots, no agents signed

14 Accuracy

0.98

0.96 Accuracy Train 0.94 Validation

1 2 3 4 5 6 7 8 9 10 Epoch

Loss

0.3 Train Validation

0.2 Loss 0.1

0.0 1 2 3 4 5 6 7 8 9 10 Epoch

MSE

0.06 Train Validation

0.04 MSE

0.02

1 2 3 4 5 6 7 8 9 10 Epoch

Figure A9: Training diagnostic plots, agent signatures appear identical

15 Accuracy 1.000

0.975

0.950

Accuracy 0.925 Train 0.900 Validation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

Loss

Train Validation 0.4 Loss 0.2

0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

MSE

0.08 Train Validation 0.06

MSE 0.04

0.02

0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Epoch

Figure A10: Training diagnostic plots, agent refusal to sign listed

16 Accuracy 1.0

0.9

0.8 Accuracy Train 0.7 Validation

1 25 45 68 90 Epoch

Loss 40 Train 30 Validation

20 Loss

10

0 1 25 45 68 90 Epoch

MSE

Train 0.3 Validation

0.2 MSE 0.1

0.0 1 25 45 68 90 Epoch

Figure A11: Training diagnostic plots, results edited

17 A.4 Confusion matrices, classification model training

Below are confusion matrices from our image classification models. Each is constructed using the true test sample of approximately 500 images left out from model training. Note that some samples are slightly smaller than 500 due to data missingness (i.e., missing, partially missing, corrupted, or otherwise incomplete statutory forms).

Table A1: Confusion matrix, QR code missing

True non-irregularity True irregularity Predicted non-irregularity 496 0 Predicted irregularity 1 3

Table A2: Confusion matrix, poor scan quality

True non-irregularity True irregularity Predicted non-irregularity 495 0 Predicted irregularity 0 2

Table A3: Confusion matrix, form not stamped

True non-irregularity True irregularity Predicted non-irregularity 479 8 Predicted irregularity 4 7

Table A4: Confusion matrix, presiding officer did not sign

True non-irregularity True irregularity Predicted non-irregularity 494 0 Predicted irregularity 0 4

18 Table A5: Confusion matrix, no agents listed

True non-irregularity True irregularity Predicted non-irregularity 471 2 Predicted irregularity 2 22

Table A6: Confusion matrix, any agent did not sign

True non-irregularity True irregularity Predicted non-irregularity 440 18 Predicted irregularity 6 33

Table A7: Confusion matrix, no agents signed

True non-irregularity True irregularity Predicted non-irregularity 468 0 Predicted irregularity 3 26

Table A8: Confusion matrix, agent signatures appear identical

True non-irregularity True irregularity Predicted non-irregularity 469 1 Predicted irregularity 2 25

Table A9: Confusion matrix, agent refusal to sign listed

True non-irregularity True irregularity Predicted non-irregularity 482 4 Predicted irregularity 2 9

Table A10: Confusion matrix, results edited

True non-irregularity True irregularity Predicted non-irregularity 336 25 Predicted irregularity 10 129

19 A.5 Computation of F1 scores for Cantú (2019)

Table 1 in Cantú (2019) provides a confusion matrix with mean values across 20 samples, each of 150 images. We can approximate the F1 score across this sample by multiplying each cell in the table by 3,000 (20 × 150). This gives 2,790 true negatives (top-left), 210 false positives (top-right), 450 false negatives (bottom-left), and 2,550 true positives (bottom-right). Applying the formula supplied in the main text (our paper), the approximate precision is 0.924, while the approximate recall is 0.850. F1 is then given by the harmonic mean: (0.924)(0.850) F1 (Cantú) = 2 ≈ 0.885 (A1) 0.924 + 0.850

20 A.6 Regression results, government stronghold analysis

Table A11: Irregularities are unrelated to government party strength

Agent problems Edited results Gov. stronghold 0.00 0.01 −0.00 0.02 (0.01) (0.01) (0.00) (0.02) Population density −0.00 0.02∗ (0.00) (0.01) Terrain ruggedness −0.00 0.00 (0.00) (0.00) Ethnic fractionalization 0.02 0.07∗ (0.01) (0.02) Poll. station isolation 0.11∗ −0.18∗ (0.05) (0.07) Poverty rate −0.01 −0.01 (0.01) (0.01) Literacy rate 0.01 0.01 (0.00) (0.01) Night lights −0.00 −0.02 (0.01) (0.01) Constituency FE !!!! ∗p < .05. As indicated in the main text, polling stations are coded as government strongholds if the proportion of registered voters who are Kikuyu or Kalenjin is greater than 90%. All models include constituency fixed effects and cluster standard errors by constituency.

21 A.7 Covariate data sources

Population density data come from Linard et al. (2012), and is computed as the log plus one of raw population density. Data on terrain ruggedness are calculated from a digital elevation model and data provided by the United States Geological Survey (2004). Ethnic fractionalization is computed as one minus the sum of squares of the size of each ethnic group expressed as a proportion of registered voters; these proportions are estimated from the 2013 voter register using the method described in Harris (2015). Polling station isolation is computed as the minimum distance to the nearest polling center, weighted by the number of registered voters, using a method described in Doogan et al. (2018). Data on the poverty rate are described in Linard et al. (2012), night time lights data are from National Oceanic and Atmospheric Administration (2013), and literacy rate data are given in Bosco et al. (2017).

22 A.8 Full results, Table 3 (main text)

Table A12: Turnout models, no controls

Document problem −0.01 (0.02) Procedure problem −0.01 (0.01) Agent problem −0.00 (0.00) Edited results −0.01∗ (0.00) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

23 Table A13: Turnout models, with controls

Document problem −0.02 (0.02) Procedure problem −0.01 (0.01) Agent problem −0.00 (0.00) Edited results −0.00∗ (0.00) Population density −0.01∗ −0.01∗ −0.01∗ −0.01∗ (0.00) (0.00) (0.00) (0.00) Terrain ruggedness 0.00∗ 0.00∗ 0.00∗ 0.00∗ (0.00) (0.00) (0.00) (0.00) Ethnic fractionalization −0.04∗ −0.04∗ −0.04∗ −0.04∗ (0.01) (0.01) (0.01) (0.01) Poll. station isolation −0.10 −0.10 −0.10 −0.10 (0.06) (0.06) (0.06) (0.06) Poverty rate 0.00 0.00 0.00 0.00 (0.00) (0.00) (0.00) (0.00) Night lights −0.01∗ −0.01∗ −0.01∗ −0.01∗ (0.00) (0.00) (0.00) (0.00) Literacy rate 0.00 0.00 0.00 0.00 (0.00) (0.00) (0.00) (0.00) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

24 Table A14: Absolute margin models, no controls

Document problem −0.02 (0.03) Procedure problem −0.00 (0.01) Agent problem 0.00 (0.00) Edited results −0.01∗ (0.00) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

Table A15: Absolute margin models, with controls

Document problem 0.01 (0.02) Procedure problem −0.00 (0.01) Agent problem 0.00 (0.00) Edited results −0.00 (0.00) Population density −0.01 −0.01 −0.01 −0.01 (0.01) (0.01) (0.01) (0.01) Terrain ruggedness 0.01∗ 0.01∗ 0.01∗ 0.01∗ (0.00) (0.00) (0.00) (0.00) Ethnic fractionalization −0.38∗ −0.38∗ −0.38∗ −0.38∗ (0.03) (0.03) (0.03) (0.03) Poll. station isolation 0.20 0.20 0.20 0.20 (0.18) (0.18) (0.18) (0.18) Poverty rate −0.01 −0.01 −0.01 −0.01 (0.01) (0.01) (0.01) (0.01) Night lights −0.02∗ −0.02∗ −0.02∗ −0.02∗ (0.01) (0.01) (0.01) (0.01) Literacy rate −0.03∗ −0.03∗ −0.03∗ −0.03∗ (0.01) (0.01) (0.01) (0.01) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

25 Table A16: Kenyatta margin models, no controls

Document problem 0.06 (0.06) Procedure problem 0.00 (0.02) Agent problem −0.00 (0.01) Edited results 0.00 (0.00) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

Table A17: Kenyatta margin models, with controls

Document problem 0.05 (0.04) Procedure problem 0.01 (0.02) Agent problem 0.00 (0.01) Edited results 0.00 (0.00) Population density −0.01 −0.01 −0.01 −0.01 (0.02) (0.02) (0.02) (0.02) Terrain ruggedness 0.04∗ 0.04∗ 0.04∗ 0.04∗ (0.01) (0.01) (0.01) (0.01) Ethnic fractionalization −0.23∗ −0.23∗ −0.23∗ −0.23∗ (0.07) (0.07) (0.07) (0.07) Poll. station isolation 0.19 0.19 0.19 0.19 (0.33) (0.33) (0.33) (0.33) Poverty rate −0.05 −0.05 −0.05 −0.05 (0.03) (0.03) (0.03) (0.03) Night lights 0.00 0.00 0.00 0.00 (0.01) (0.01) (0.01) (0.01) Literacy rate −0.03 −0.03 −0.03 −0.03 (0.02) (0.02) (0.02) (0.02) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

26 Table A18: Rejected vote models, no controls

Document problem 0.46 (0.53) Procedure problem 0.44 (0.33) Agent problem 0.20∗ (0.09) Edited results 0.62∗ (0.10) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

Table A19: Rejected vote models, with controls

Document problem 0.40 (0.70) Procedure problem 0.47 (0.35) Agent problem 0.16 (0.10) Edited results 0.55∗ (0.10) Population density 0.47∗ 0.47∗ 0.47∗ 0.46∗ (0.11) (0.11) (0.11) (0.10) Terrain ruggedness −0.03 0.03 −0.03 −0.03 (0.05) (0.05) (0.05) (0.05) Ethnic fractionalization 1.24∗ 1.24∗ 1.24∗ 1.21∗ (0.27) (0.27) (0.27) (0.27) Poll. station isolation 1.68 1.63 1.67 1.63 (1.62) (1.62) (1.62) (1.61) Poverty rate −0.48∗ −0.47∗ −0.48∗ −0.46∗ (0.11) (0.11) (0.11) (0.11) Night lights 0.21∗ 0.21∗ 0.21∗ 0.21∗ (0.10) (0.10) (0.10) (0.10) Literacy rate −0.35∗ −0.35∗ −0.35∗ −0.33∗ (0.16) (0.16) (0.16) (0.16) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

27 Table A20: Disputed vote models, no controls

Document problem −0.00 (0.00) Procedure problem 0.01 (0.01) Agent problem −0.00 (0.00) Edited results 0.00 (0.00) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

Table A21: Disputed vote models, with controls

Document problem −0.00 (0.00) Procedure problem 0.01 (0.01) Agent problem −0.00 (0.10) Edited results 0.00 (0.00) Population density 0.00 0.00 0.00 0.00 (0.00) (0.00) (0.00) (0.00) Terrain ruggedness −0.00 −0.00 −0.00 −0.00 (0.00) (0.00) (0.00) (0.00) Ethnic fractionalization −0.01 −0.01 −0.01 −0.01 (0.00) (0.00) (0.00) (0.00) Poll. station isolation −0.01 −0.01 −0.01 −0.01 (0.04) (0.04) (0.04) (1.04) Poverty rate −0.00 −0.00 −0.00 −0.00 (0.00) (0.00) (0.00) (0.00) Night lights −0.00 −0.00 −0.00 −0.00 (0.00) (0.00) (0.00) (0.00) Literacy rate 0.00 0.00 0.00 0.00 (0.00) (0.00) (0.00) (0.00) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

28 Table A22: Spoiled vote models, no controls

Document problem 0.08 (0.32) Procedure problem 0.34∗ (0.16) Agent problem 0.08 (0.05) Edited results 0.37∗ (0.03) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

Table A23: Spoiled vote models, with controls

Document problem −0.15 (0.24) Procedure problem 0.34∗ (0.16) Agent problem 0.08 (0.05) Edited results 0.36∗ (0.03) Population density 0.24∗ 0.23∗ 0.24∗ 0.22∗ (0.04) (0.04) (0.04) (0.04) Terrain ruggedness −0.05∗ −0.06∗ −0.05∗ −0.06∗ (0.02) (0.02) (0.02) (0.02) Ethnic fractionalization 0.18∗ 0.19∗ 0.18∗ 0.17∗ (0.08) (0.08) (0.08) (0.07) Poll. station isolation 1.38∗ 1.34 1.37∗ 1.35 (0.69) (0.69) (0.69) (0.69) Poverty rate −0.08∗ −0.09∗ −0.08∗ −0.08∗ (0.04) (0.04) (0.04) (0.04) Night lights 0.02 0.02 0.02 0.01 (0.04) (0.04) (0.04) (0.04) Literacy rate −0.13∗ −0.13∗ 0.13∗ −0.12∗ (0.04) (0.04) (0.04) (0.04) Constituency FE !!!! ∗p < .05. All models include constituency fixed effects and cluster standard errors by constituency.

29 A.9 Robustness checks, Table 3 (main text)

Table A24: Ward fixed effects

Turnout Abs. margin Margin Document prob. −0.01 −0.02 0.02 0.05 0.04 0.06 (0.02) (0.03) (0.03) (0.03) (0.03) (0.03) Procedure prob. −0.01 −0.01 −0.01 −0.00 −0.03 −0.03 (0.01) (0.01) (0.01) (0.01) (0.02) (0.02) Agent prob. −0.00 −0.00 0.00 0.00 −0.01 −0.00 (0.00) (0.00) (0.00) (0.00) (0.01) (0.01) Results edited −0.00∗ −0.00∗ −0.01∗ −0.00 0.00 0.00 (0.00) (0.00) (0.00) (0.00) (0.00) (0.00)

Bal. rejected Bal. disputed Bal. spoiled Document prob. 0.26 −0.06 −0.00 −0.00 0.24 0.03 (0.58) (0.69) (0.00) (0.00) (0.31) (0.25) Procedure prob. 0.50 0.50 0.01 0.01 0.32∗ 0.31 (0.36) (0.38) (0.01) (0.01) (0.16) (0.16) Agent prob. 0.20∗ 0.16 −0.00 −0.00 0.10∗ 0.10∗ (0.09) (0.10) (0.00) (0.00) (0.05) (0.05) Results edited 0.61∗ 0.56∗ 0.00 0.00 0.36∗ 0.36∗ (0.10) (0.10) (0.00) (0.00) (0.03) (0.03) Controls !!! Ward FE !!!!!! ∗p < .05. Estimates are from linear models with standard errors in parentheses. Controls are as described in Section A.7. All models include ward fixed effects and standard errors clustered by ward.

30 Table A25: County fixed effects

Turnout Abs. margin Margin Document prob. −0.01 −0.02 0.07 0.01 −0.04 0.11 (0.01) (0.02) (0.05) (0.03) (0.06) (0.07) Procedure prob. −0.00 −0.00 0.00 0.00 −0.02 −0.02 (0.01) (0.01) (0.01) (0.01) (0.02) (0.02) Agent prob. −0.00 −0.00 0.00 0.00 −0.01 −0.01 (0.00) (0.00) (0.00) (0.00) (0.01) (0.01) Results edited −0.01∗ −0.00∗ −0.01∗ −0.00 −0.00 −0.00 (0.00) (0.00) (0.00) (0.00) (0.01) (0.01)

Bal. rejected Bal. disputed Bal. spoiled Document prob. −1.08 0.44 −0.00 −0.00 0.04 −0.19 (0.91) (0.76) (0.00) (0.00) (0.16) (0.19) Procedure prob. 0.43 0.49∗ 0.01 0.01 0.31∗ 0.32∗ (0.21) (0.22) (0.01) (0.01) (0.14) (0.14) Agent prob. 0.15 0.15 −0.00 −0.00 0.06 0.06 (0.10) (0.10) (0.00) (0.00) (0.05) (0.05) Results edited 0.70∗ 0.59∗ 0.00 0.00 0.37∗ 0.36∗ (0.09) (0.10) (0.00) (0.00) (0.03) (0.03) Controls !!! County FE !!!!!! ∗p < .05. Estimates are from linear models with standard errors in parentheses. Controls are as described in Section A.7. All models include county fixed effects and standard errors clustered by county.

31 Table A26: No fixed effects

Turnout Abs. margin Margin Document prob. −0.04∗ −0.04∗ −0.06 0.01 −0.46∗ −0.11 (0.01) (0.01) (0.03) (0.05) (0.08) (0.14) Procedure prob. −0.01 −0.01 −0.01 −0.00 −0.09 −0.05 (0.01) (0.01) (0.02) (0.02) (0.06) (0.06) Agent prob. −0.00 −0.00 −0.02∗ −0.01∗ −0.01 0.00 (0.00) (0.00) (0.01) (0.01) (0.02) (0.02) Results edited −0.01∗ −0.01∗ −0.02∗ −0.01∗ 0.06∗ 0.05∗ (0.00) (0.00) (0.00) (0.00) (0.01) (0.01)

Bal. rejected Bal. disputed Bal. spoiled Document prob. 0.68 0.34 −0.00 −0.00 0.07 −0.27 (0.54) (0.99) (0.01) (0.01) (0.20) (0.36) Procedure prob. 0.34 0.50 0.01 0.01 0.33∗ 0.36∗ (0.40) (0.39) (0.01) (0.01) (0.14) (0.14) Agent prob. 0.14 0.16 −0.00 −0.00 0.08∗ 0.09∗ (0.11) (0.11) (0.00) (0.00) (0.04) (0.04) Results edited 0.91∗ 0.62∗ 0.00 0.00 0.42∗ 0.36∗ (0.07) (0.07) (0.00) (0.00) (0.03) (0.03) Controls !!! ∗p < .05. Estimates are from linear models with standard errors in parentheses. Controls are as described in Section A.7.

32 A.10 Regression results, Figure 3 (main text)

Table A27: Predictors of electoral irregularities

Irregularity Document Procedure Agent Edited results Population density 0.00 0.00 −0.00 0.00 (0.00) (0.00) (0.00) (0.00) Terrain ruggedness 0.00∗ −0.00 −0.00 0.00 (0.00) (0.00) (0.00) (0.00) Ethnic fractionalization 0.00∗ 0.00 0.03∗ 0.05∗ (0.00) (0.00) (0.01) (0.01) Poll. station isolation 0.00 −0.01 0.07∗ −0.15∗ (0.00) (0.01) (0.03) (0.04) Poverty rate 0.00 0.00 0.00 −0.02∗ (0.00) (0.00) (0.00) (0.00) Night lights 0.00 −0.00 0.01∗ 0.01 (0.00) (0.00) (0.00) (0.01) Literacy rate 0.00 −0.00 −0.00 −0.01 (0.00) (0.00) (0.00) (0.00) ∗p < .05. Data sources are described in Section A.7 above.

33 References

Bosco, Claudio, Victor Alegana, Tomas Bird, Carla Pezzulo, Linus Bengtsson, Alessandro Sorichetta, Jessica Steele, et al. 2017. “Exploring the High-Resolution Mapping of Gender-Disaggregated Development Indicators.” Journal of the Royal Society Interface 14. Cantú, Francisco. 2019. “The Fingerprints of Fraud: Evidence from Mexico’s 1988 Presidential Election.” American Political Science Review 113 (3): 710–726. Chollet, François. 2015. Keras. https://keras.io. Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “ImageNet: A Large-Scale Hierarchical Image Database.” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. Doogan, Nathan J., Megan E. Roberts, Mary Ellen Wewers, Erin R. Tanenbaum, Elizabeth A. Mumford, and Frances A. Stillman. 2018. “Validation of a New Continuous Geographic Isolation Scale: A Tool for Rural Health Disparities Research.” Social Science & Medicine 215:123–132. Harris, J. Andrew. 2015. “What’s in a Name? A Method for Extracting Information about Ethnicity from Names.” Political Analysis 23 (2): 212–224. Linard, Catherine, Marius Gilbert, Robert W. Snow, Abdisalan M. Noor, and Andrew J. Tatem. 2012. “Population Distribution, Settlement Patterns, and Accessibility across Africa in 2010.” PLoS ONE 7 (2). National Oceanic and Atmospheric Administration. 2013. DMSP-OLS Nighttime Lights Time Series Version 4, Stable Lights Product. Image and data processing by NOAA’s National Geophysical Data Center. DMSP data collected by US Air Force Weather Agency. Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. “Rethinking the Inception Architecture for Computer Vision.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826. United States Geological Survey. 2004. USGS SRTM 30M Digital Elevation Model. Available at https://www.usgs.gov. Last accessed 11 July 2020.

34