Regression Discontinuity Designs in Economics

Journal of Economic Literature 48 (June 2010): 281–355 http:www.aeaweb.org/articles.php?doi=10.1257/jel.48.2.281

David S. Lee and Thomas Lemieux*

This paper provides an introduction and “user guide” to Regression Discontinuity (RD) designs for empirical researchers. It presents the basic theory behind the research design, details when RD is likely to be valid or invalid given economic incentives, explains why it is considered a “quasi-experimental” design, and summarizes different ways (with their advantages and disadvantages) of estimating RD designs and the limitations of interpreting these estimates. Concepts are discussed using examples drawn from the growing body of empirical research using RD. ( JEL C21, C31)

1. Introduction (1960) analyzed the impact of merit awards on future academic outcomes, using the fact egression Discontinuity (RD) designs that the allocation of these awards was based Rwere first introduced by Donald L. on an observed test score. The main idea Thistlethwaite and Donald T. Campbell behind the research design was that individ- (1960) as a way of estimating treatment uals with scores just below the cutoff (who effects in a nonexperimental setting where did not receive the award) were good com- treatment is determined by whether an parisons to those just above the cutoff (who observed “assignment” variable (also referred did receive the award). Although this evalua- to in the literature as the “forcing” variable tion strategy has been around for almost fifty or the “running” variable) exceeds a known years, it did not attract much attention in cutoff point. In their initial application of economics until relatively recently. RD designs, Thistlethwaite and Campbell Since the late 1990s, a growing number of studies have relied on RD designs to estimate program effects in a wide variety of economic * Lee: Princeton University and NBER. Lemieux: contexts. Like Thistlethwaite and Campbell University of British Columbia and NBER. We thank David Autor, David Card, John DiNardo, Guido Imbens, (1960), early studies by Wilbert van der Klaauw and Justin McCrary for suggestions for this article, as well (2002) and Joshua D. Angrist and Victor Lavy as for numerous illuminating discussions on the various (1999) exploited threshold rules often used by topics we cover in this review. We also thank two anonymous referees for their helpful suggestions and comments, educational institutions to estimate the effect and Damon Clark, Mike Geruso, Andrew Marder, and of financial aid and class size, respectively, Zhuan Pei for their careful reading of earlier drafts. Diane on educational outcomes. Sandra E. Black Alexander, Emily Buchsbaum, Elizabeth Debraggio, Enkeleda Gjeci, Ashley Hodgson, Yan Lau, Pauline Leung, (1999) exploited the presence of discontinui- and Xiaotong Niu provided excellent research assistance. ties at the geographical level (school district

281 282 Journal of Economic Literature, Vol. XLVIII (June 2010) boundaries) to estimate the willingness to pay a highly credible and transparent way of for good schools. Following these early papers estimating program effects, RD designs can in the area of education, the past five years be used in a wide variety of contexts cover- have seen a rapidly growing literature using ing a large number of important economic RD designs to examine a range of questions. questions. These two facts likely explain Examples include the labor supply effect of why the RD approach is rapidly becoming welfare, unemployment insurance, and dis- a major element in the toolkit of empirical ability programs; the effects of Medicaid on economists. health outcomes; the effect of remedial edu- Despite the growing importance of RD cation programs on educational achievement; designs in economics, there is no single com- the empirical relevance of median voter mod- prehensive summary of what is understood els; and the effects of unionization on wages about RD designs—when they succeed, and employment. when they fail, and their strengths and weak- One important impetus behind this recent nesses.2 Furthermore, the “nuts and bolts” of flurry of research is a recognition, formal- implementing RD designs in practice are not ized by Jinyong Hahn, Petra Todd, and van (yet) covered in standard econometrics texts, der Klaauw (2001), that RD designs require making it difficult for researchers interested seemingly mild assumptions compared to in applying the approach to do so. Broadly those needed for other nonexperimental speaking, the main goal of this paper is to fill approaches. Another reason for the recent these gaps by providing an up-to-date over- wave of research is the belief that the RD view of RD designs in economics and cre- design is not “just another” evaluation strat- ating a guide for researchers interested in egy, and that causal inferences from RD applying the method. designs are potentially more credible than A reading of the most recent research those from typical “natural experiment” reveals a certain body of “folk wisdom” strategies (e.g., difference-in-differences or regarding the applicability, interpretation, instrumental variables), which have been and recommendations of practically imple- heavily employed in applied research in menting RD designs. This article represents recent decades. This notion has a theoreti- our attempt at summarizing what we believe cal justification: David S. Lee (2008)- for to be the most important pieces of this wis- mally shows that one need not assume the dom, while also dispelling misconceptions RD design isolates treatment variation that is that could potentially (and understandably) “as good as randomized”; instead, such ran- arise for those new to the RD approach. domized variation is a consequence of agents’ We will now briefly summarize the most inability to precisely control the assignment important points about RD designs to set variable near the known cutoff. the stage for the rest of the paper where So while the RD approach was initially we systematically discuss identification, thought to be “just another” program evalu- interpretation, and estimation issues. Here, ation method with relatively little general and throughout the paper, we refer to the applicability outside of a few specific prob- assignment variable as X. Treatment is, thus, lems, recent work in economics has shown quite the opposite.1 In addition to providing the RD design in economics is unique as it is still rarely used in other disciplines. 1 See Thomas D. Cook (2008) for an interesting his- 2 See, however, two recent overview papers by van tory of the RD design in education research, psychology, der Klaauw (2008b) and Guido W. Imbens and Thomas statistics, and economics. Cook argues the resurgence of Lemieux (2008) that have begun bridging this gap. Lee and Lemieux: Regression Discontinuity Designs in Economics 283 assigned to individuals (or “units”) with a instrumental variables (IV) approaches. value of X greater than or equal to a cutoff When using IV for causal inference, one value c. must assume the instrument is exog- enously generated as if by a coin-flip. • RD designs can be invalid if indi- Such an assumption is often difficult to viduals can precisely manipulate the justify (except when an actual lottery “assignment variable.” was run, as in Angrist (1990), or if there When there is a payoff or benefit to were some biological process, e.g., gen- receiving a treatment, it is natural for an der determination of a baby, mimicking economist to consider how an individual a coin-flip). By contrast, the variation may behave to obtain such benefits. For that RD designs isolate is randomized example, if students could effectively as a consequence of the assumption that “choose” their test score X through individuals have imprecise control over effort, those who chose a score c (and the assignment variable. hence received the merit award) could be somewhat different from those who • RD designs can be analyzed—and chose scores just below c. The impor- tested—like randomized experiments. tant lesson here is that the existence of This is the key implication of the local a treatment being a discontinuous func- randomization result. If variation in the tion of an assignment variable is not suf- treatment near the threshold is approxi- ficient to justify the validity of anRD mately randomized, then it follows that design. Indeed, if anything, discontinu- all “baseline characteristics”—all those ous rules may generate incentives, caus- variables determined prior to the realiza- ing behavior that would invalidate the tion of the assignment variable—should RD approach. have the same distribution just above and just below the cutoff. If there is a discon- • If individuals—even while having tinuity in these baseline covariates, then some influence—are unable to pre- at a minimum, the underlying identify- cisely manipulate the assignment ing assumption of individuals’ inability variable, a consequence of this is that to precisely manipulate the assignment the variation in treatment near the variable is unwarranted. Thus, the threshold is randomized as though baseline covariates are used to test the from a randomized experiment. validity of the RD design. By contrast, This is a crucial feature of the RD when employing an IV or a matching/ design, since it is the reason RD designs regression-control strategy, assumptions are often so compelling. Intuitively, typically need to be made about the rela- when individuals have imprecise con- tionship of these other covariates to the trol over the assignment variable, even if treatment and outcome variables.3 some are especially likely to have values of X near the cutoff, every individual will • Graphical presentation of an RD have approximately the same probability design is helpful and informative, but of having an X that is just above (receiv- the visual presentation should not be ing the treatment) or just below (being denied the treatment) the cutoff— 3 Typically, one assumes that, conditional on the covari- similar to a coin-flip experiment. This ates, the treatment (or instrument) is essentially “as good result clearly differentiates the RD and as” randomly assigned. 284 Journal of Economic Literature, Vol. XLVIII (June 2010)

tilted toward either finding an effect which case has a smaller bias with- or finding no effect. out knowing something about the true It has become standard to summarize function. There will be some functions RD analyses with a simple graph show- where a low-order polynomial is a very ing the relationship between the out- good approximation and produces little come and assignment variables. This has or no bias, and therefore it is efficient to several advantages. The presentation of use all data points—both “close to” and the “raw data” enhances the transpar- “far away” from the threshold. In other ency of the research design. A graph can situations, a polynomial may be a bad also give the reader a sense of whether approximation, and smaller biases will the “jump” in the outcome variable at occur with a local linear regression. In the cutoff is unusually large compared to practice, parametric and nonparametric the bumps in the regression curve away approaches lead to the computation of from the cutoff. Also, a graphical analy- the exact same statistic.5 For example, sis can help identify why different func- the procedure of regressing the outcome tional forms give different answers, and Y on X and a treatment dummy D can can help identify outliers, which can be be viewed as a parametric regression a problem in any empirical analysis. The (as discussed above), or as a local linear problem with graphical presentations, regression with a very large bandwidth. however, is that there is some room for Similarly, if one wanted to exclude the the researcher to construct graphs mak- influence of data points in the tails of the ing it seem as though there are effects X distribution, one could call the exact when there are none, or hiding effects same procedure “parametric” after trim- that truly exist. We suggest later in the ming the tails, or “nonparametric” by paper a number of methods to minimize viewing the restriction in the range of X such biases in presentation. as a result of using a smaller bandwidth.6 Our main suggestion in estimation is to • Nonparametric estimation does not not rely on one particular method or represent a “solution” to functional specification. In any empirical analysis, form issues raised by RD designs. It is results that are stable across alternative therefore helpful to view it as a com- plement to—rather than a substitute 5 See section 1.2 of James L. Powell (1994), where it for—parametric estimation. is argued that is more helpful to view models rather than When the analyst chooses a parametric particular statistics as “parametric” or “nonparametric.” It functional form (say, a low-order poly- is shown there how the same least squares estimator can simultaneously be viewed as a solution to parametric, semi- nomial) that is incorrect, the resulting parametric, and nonparametric problems. estimator will, in general, be biased. 6 The main difference, then, between a parametric and When the analyst uses a nonparametric nonparametric approach is not in the actual estimation but rather in the discussion of the asymptotic behavior of the procedure such as local linear regres- estimator as sample sizes tend to infinity. For example, sion—essentially running a regression standard nonparametric asymptotics considers what would using only data points “close” to the happen if the bandwidth h—the width of the “window” 4 of observations used for the regression—were allowed to cutoff—there will also be bias. With a shrink as the number of observations N tended to infinity. finite sample, it is impossible to know It turns out that if h 0 and Nh as N , the bias → →∞ →∞ will tend to zero. By contrast, with a parametric approach, when one is not allowed to make the model more flexible 4 Unless the underlying function is exactly linear in the with more data points, the bias would generally remain— area being examined. even with infinite samples. Lee and Lemieux: Regression Discontinuity Designs in Economics 285

and equally plausible specifications are said, as we show below, there has been an generally viewed as more reliable than explosion of discoveries of RD designs that those that are sensitive to minor changes cover a wide range of interesting economic in specification. RD is no exception in topics and questions. this regard. The rest of the paper is organized as follows. In section 2, we discuss the origins of the • Goodness-of-fit and other statistical RD design and show how it has recently been tests can help rule out overly restric- formalized in economics using the potential tive specifications. outcome framework. We also introduce an Often the consequence of trying many important theme that we stress throughout different specifications is that it may the paper, namely that RD designs are partic- result in a wide range of estimates. ularly compelling because they are close cous- Although there is no simple formula ins of randomized experiments. This theme is that works in all situations and con- more formally explored in section 3 where texts for weeding out inappropriate we discuss the conditions under which RD specifications, it seems reasonable, at designs are “as good as a randomized experi- a minimum, not to rely on an estimate ment,” how RD estimates should be inter- resulting from a specification that can be preted, and how they compare with other rejected by the data when tested against commonly used approaches in the program a strictly more flexible specification. evaluation literature. Section 4 goes through For example, it seems wise to place less the main “nuts and bolts” involved in imple- confidence in results from a low-order menting RD designs and provides a “guide to polynomial model when it is rejected practice” for researchers interested in using in favor of a less restrictive model (e.g., the design. A summary “checklist” highlight- separate means for each discrete value ing our key recommendations is provided at of X). Similarly, there seems little reason the end of this section. Implementation issues to prefer a specification that uses all the in several specific situations (discrete assign- data if using the same specification, but ment variable, panel data, etc.) are covered in restricting to observations closer to the section 5. Based on a survey of the recent lit- threshold, gives a substantially (and sta- erature, section 6 shows that RD designs have tistically) different answer. turned out to be much more broadly applicable in economics than was originally thought. Although we (and the applied literature) We conclude in section 7 by discussing recent sometimes refer to the RD “method” or progress and future prospects in using and “approach,” the RD design should perhaps interpreting RD designs in economics. be viewed as more of a description of a particular data generating process. All other 2. Origins and Background things (topic, question, and population of interest) equal, we as researchers might pre- In this section, we set the stage for the rest fer data from a randomized experiment or of the paper by discussing the origins and the from an RD design. But in reality, like the basic structure of the RD design, beginning randomized experiment—which is also more with the classic work of Thistlethwaite and appropriately viewed as a particular data Campbell (1960) and moving to the recent generating process rather than a “method” of interpretation of the design using modern analysis—an RD design will simply not exist tools of program evaluation in economics to answer a great number of questions. That (potential outcomes framework). One of 286 Journal of Economic Literature, Vol. XLVIII (June 2010) the main virtues of the RD approach is that Thistlethwaite and Campbell (1960) pro- it can be naturally presented using simple vide some graphical intuition for why the graphs, which greatly enhances its credibility coefficient could be viewed as an estimate τ and transparency. In light of this, the major- of the causal effect of the award. We illustrate ity of concepts introduced in this section are their basic argument in figure 1. Consider an represented in graphical terms to help cap- individual whose score X is exactly c. To get ture the intuition behind the RD design. the causal effect for a person scoring c, we need guesses for what her Y would be with 2.1 Origins and without receiving the treatment. The RD design was first introduced by If it is “reasonable” to assume that all Thistlethwaite and Campbell (1960) in their factors (other than the award) are evolving study of the impact of merit awards on the “smoothly” with respect to X, then B would ′ future academic outcomes (career aspira- be a reasonable guess for the value of Y of tions, enrollment in postgraduate programs, an individual scoring c (and hence receiving etc.) of students. Their study exploited the the treatment). Similarly, A would be a rea- ′′ fact that these awards were allocated on the sonable guess for that same individual in the basis of an observed test score. Students with counterfactual state of not having received test scores X, greater than or equal to a cut- the treatment. It follows that B A would ′ − ′′ off value c, received the award, while those be the causal estimate. This illustrates the with scores below the cutoff were denied the intuition that the RD estimates should use award. This generated a sharp discontinuity observations “close” to the cutoff (e.g., in this in the “treatment” (receiving the award) as case at points c and c ). ′ ′′ a function of the test score. Let the receipt There is, however, a limitation to the intu- of treatment be denoted by the dummy vari- ition that “the closer to c you examine, the able D {0, 1}, so that we have D 1 if better.” In practice, one cannot “only” use ∈ = X c and D 0 if X c. data close to the cutoff. The narrower the ≥ = < At the same time, there appears to be no area that is examined, the less data there are. reason, other than the merit award, for future In this example, examining data any closer academic outcomes, Y, to be a discontinuous than c and c will yield no observations at all! ′ ′′ function of the test score. This simple rea- Thus, in order to produce a reasonable guess soning suggests attributing the discontinu- for the treated and untreated states at X c = ous jump in Y at c to the causal effect of the with finite data, one has no choice but to use merit award. Assuming that the relationship data away from the discontinuity.7 Indeed, between Y and X is otherwise linear, a sim- if the underlying function is truly linear, we ple way of estimating the treatment effect know that the best linear unbiased estima- τ is by fitting the linear regression tor of is the coefficient on D from OLS τ estimation (using all of the observations) of (1) Y D X , equation (1). = α + τ + β + ε This simple heuristic presentation illus- where is the usual error term that can be trates two important features of the RD ε viewed as a purely random error generating variation in the value of Y around the regression line D X . This case is 7 α + τ + β Interestingly, the very first application of the RD depicted in figure 1, which shows both the design by Thistlethwaite and Campbell (1960) was based on discrete data (interval data for test scores). As a result, true underlying function and numerous real- their paper clearly points out that the RD design is funda- izations of . ε mentally based on an extrapolation approach. Lee and Lemieux: Regression Discontinuity Designs in Economics 287

2 B′ τ A ″ Outcome variable ( Y )

0 c ″ c c′ Assignment variable (X)

Figure 1. Simple Linear RD Setup design. First, in order for this approach to in the theoretical work of Hahn, Todd, and work, “all other factors” determining Y must van der Klaauw (2001), who described the be evolving “smoothly” with respect to X. If RD evaluation strategy using the language the other variables also jump at c, then the of the treatment effects literature. Hahn, gap will potentially be biased for the treat- Todd, and van der Klaauw (2001) noted the τ ment effect of interest. Second, since an RD key assumption of a valid RD design was that estimate requires data away from the cut- “all other factors” were “continuous” with off, the estimate will be dependent on the respect to X, and suggested a nonparamet- chosen functional form. In this example, if ric procedure for estimating that did not τ the slope were (erroneously) restricted to assume underlying linearity, as we have in β equal zero, it is clear the resulting OLS coef- the simple example above. ficient on D would be a biased estimate of The necessity of the continuity assump- the true discontinuity gap. tion is seen more formally using the “potential outcomes framework” of the treatment 2.2 RD Designs and the Potential Outcomes effects literature with the aid of a graph. It is Framework typically imagined that, for each individual i, While the RD design was being imported there exists a pair of “potential” outcomes: into applied economic research by studies Yi(1) for what would occur if the unit were such as van der Klaauw (2002), Black (1999), exposed to the treatment and Yi(0) if not and Angrist and Lavy (1999), the identifica- exposed. The causal effect of the treatment is tion issues discussed above were formalized represented by the difference Y (1) Y (0). i − i 288 Journal of Economic Literature, Vol. XLVIII (June 2010)

4.00

3.50 Observed

3.00 D E F

2.50 B ′ B 2.00 E[Y(1)|X] A 1.50 Observed A Outcome variable ( Y ) ′ 1.00

0.50 E[Y(0)|X]

0.00 Xd 0 0.5 1 1.5 2 2.5 3 3.5 4 Assignment variable (X)

Figure 2. Nonlinear RD

The fundamental problem of causal infer- B A lim E[Yi Xi c ] − = 0 | = + ε ence is that we cannot observe the pair Yi(0) ε↓ and Yi(1) simultaneously. We therefore typi- lim E[Yi Xi = c ], − 0 | + ε cally focus on average effects of the treat- ε↑ ment, that is, averages of Y (1) Y (0) over i − i which would equal (sub-)populations, rather than on unit-level effects. E[Yi(1) Yi(0) X c]. In the RD setting, we can imagine there − | = are two underlying relationships between This is the “average treatment effect” at the average outcomes and X, represented by cutoff c. E[Y (1) | X ] and E[Y (0) X ], as in figure 2. This inference is possible because of i i | But by definition of the RD design, all indi- the continuity of the underlying functions viduals to the right of the cutoff (c 2 in E[Y (1) X ] and E[Y (0) X ].8 In essence, = i | i | this example) are exposed to treatment and all those to the left are denied treatment. Therefore, we only observe E[Y (1) X ] to 8 The continuity of both functions is not the minimum i | the right of the cutoff and E[Y (0) X] to that is required, as pointed out in Hahn, Todd, and van der i | Klaauw (2001). For example, identification is still possible the left of the cutoff as indicated in the even if only E[Y (0) X ] is continuous, and only continuous i | figure. at c. Nevertheless, it may seem more natural to assume that the conditional expectations are continuous for all values It is easy to see that with what is observ- of X, since cases where continuity holds at the cutoff point able, we could try to estimate the quantity but not at other values of X seem peculiar. Lee and Lemieux: Regression Discontinuity Designs in Economics 289 this continuity condition enables us to use cannot, therefore, be correlated with any the average outcome of those right below other factor.9 the cutoff (who are denied the treat- At the same time, the other standard ment) as a valid counterfactual for those assumption of overlap is violated since, right above the cutoff (who received the strictly speaking, it is not possible to treatment). observe units with either D 0 or D 1 = = Although the potential outcome frame- for a given value of the assignment variable work is very useful for understanding how X. This is the reason the continuity assump- RD designs work in a framework applied tion is required—to compensate for the economists are used to dealing with, it also failure of the overlap condition. So while introduces some difficulties in terms of we cannot observe treatment and non- interpretation. First, while the continuity treatment for the same value of X, we can assumption sounds generally plausible, it is observe the two outcomes for values of X not completely clear what it means from an around the cutoff point that are arbitrarily economic point of view. The problem is that close to each other. since continuity is not required in the more traditional applications used in econom- 2.3 RD Design as a Local Randomized ics (e.g., matching on observables), it is not Experiment obvious what assumptions about the behavior of economic agents are required to get When looking at RD designs in this way, continuity. one could get the impression that they Second, RD designs are a fairly pecu- require some assumptions to be satisfied, liar application of a “selection on observ- while other methods such as matching on ables” model. Indeed, the view in James J. observables and IV methods simply require Heckman, Robert J. Lalonde, and Jeffrey A. other assumptions.10 From this point of Smith (1999) was that “[r]egression discon- view, it would seem that the assumptions tinuity estimators constitute a special case for the RD design are just as arbitrary as of selection on observables,” and that the those used for other methods. As we discuss RD estimator is “a limit form of matching throughout the paper, however, we do not at one point.” In general, we need two cru- believe this way of looking at RD designs cial conditions for a matching/selection on does justice to their important advantages observables approach to work. First, treat- over most other existing methods. This ment must be randomly assigned conditional point becomes much clearer once we com- on observables (the ignorability or uncon- pare the RD design to the “gold standard” foundedness assumption). In practice, this is of program evaluation methods, random- typically viewed as a strong, and not particu- ized experiments. We will show that the larly credible, assumption. For instance, in a RD design is a much closer cousin of ran- standard regression framework this amounts domized experiments than other competing to assuming that all relevant factors are con- methods. trolled for, and that no omitted variables are correlated with the treatment dummy. In an 9 In technical terms, the treatment dummy D follows a RD design, however, this crucial assumption degenerate (concentrated at D 0 or D 1), but nonethe- is trivially satisfied. When X c, the treat- = = ≥ less random distribution conditional on X. Ignorability is ment dummy D is always equal to 1. When thus trivially satisfied. 10 X c, D is always equal to 0. Conditional For instance, in the survey of Angrist and Alan B. < Krueger (1999), RD is viewed as an IV estimator, thus hav- on X, there is no variation left in D, so it ing essentially the same potential drawbacks and pitfalls. 290 Journal of Economic Literature, Vol. XLVIII (June 2010)

4.0

Observed (treatment) 3.5 E[Y(1)|X]

3.0

2.5

2.0

Observed (control) 1.5 E[Y(0)|X]

Outcome variable ( Y ) 1.0

0.5

0.0 0 0.5 1 1.5 2 2.5 3 3.5 4 Assignment variable (random number, X)

Figure 3. Randomized Experiment as a RD Design

In a randomized experiment, units are words, continuity is a direct consequence of typically divided into treatment and control randomization. groups on the basis of a randomly gener- The fact that the curves E[Y (1) X ] and i | ated number, . For example, if follows a E[Y (0) X ] are flat in a randomized experi- ν ν i | uniform distribution over the range [0, 4], ment implies that, as is well known, the aver- units with 2 are given the treatment age treatment effect can be computed as ν ≥ while units with 2 are denied treat- the difference in the mean value of Y on the ν < ment. So the randomized experiment can right and left hand side of the cutoff. One be thought of as an RD design where the could also use an RD approach by running assignment variable is X v and the cutoff regressions of Y on X, but this would be less = is c 2. Figure 3 shows this special case in efficient since we know that if randomization = the potential outcomes framework, just as in were successful, then X is an irrelevant vari- the more general RD design case of figure able in this regression. 2. The difference is that because the assign- But now imagine that, for ethical reasons, ment variable X is now completely random, people are compensated for having received it is independent of the potential outcomes a “bad draw” by getting a monetary compen- Y (0) and Y (1), and the curves E[Y (1) X ] sation inversely proportional to the random i i i | and E[Y (0) X ] are flat. Since the curves are number X. For example, the treatment could i | flat, it trivially follows that they are also con- be job search assistance for the unemployed, tinuous at the cutoff point X c. In other and the outcome whether one found a job = Lee and Lemieux: Regression Discontinuity Designs in Economics 291 within a month of receiving the treatment. that the RD design is more closely related If people with a larger monetary compen- to randomized experiments than to other sation can afford to take more time looking popular program evaluation methods such for a job, the potential outcome curves will as matching on observables, difference-in- no longer be flat and will slope upward. The differences, and IV. reason is that having a higher random number, i.e., a lower monetary compensation, 3. Identification and Interpretation increases the probability of finding a job. So in this “smoothly contaminated” randomized This section discusses a number of issues experiment, the potential outcome curves of identification and interpretation that arise will instead look like the classical RD design when considering an RD design. Specifically, case depicted in figure 2. the applied researcher may be interested Unlike a classical randomized experi- in knowing the answers to the following ment, in this contaminated experiment questions: a simple comparison of means no longer yields a consistent estimate of the treatment 1. How do I know whether an RD design effect. By focusing right around the thresh- is appropriate for my context? When old, however, an RD approach would still are the identification assumptions plau- yield a consistent estimate of the treatment sible or implausible? effect associated with job search assistance. The reason is that since people just above 2. Is there any way I can test those or below the cutoff receive (essentially) the assumptions? same monetary compensation, we still have locally a randomized experiment around the 3. To what extent are results from RD cutoff point. Furthermore, as in a random- designs generalizable? ized experiment, it is possible to test whether randomization “worked” by comparing the On the surface, the answers to these local values of baseline covariates on the two questions seem straightforward: (1) “An sides of the cutoff value. RD design will be appropriate if it is plau- Of course, this particular example is sible that all other unobservable factors are highly artificial. Since we know the monetary “continuously” related to the assignment compensation is a continuous function of variable,” (2) “No, the continuity assump- X, we also know the continuity assumption tion is necessary, so there are no tests for required for the RD estimates of the treat- the validity of the design,” and (3) “The RD ment effect to be consistent is also satisfied. estimate of the treatment effect is only appli- The important result, due to Lee (2008), cable to the subpopulation of individuals at that we will show in the next section is that the discontinuity threshold, and uninforma- the conditions under which we locally have tive about the effect anywhere else.” These a randomized experiment (and continuity) answers suggest that the RD design is no right around the cutoff point are remark- more compelling than, say, an instrumen- ably weak. Furthermore, in addition to tal variables approach, for which the analo- being weak, the conditions for local random- gous answers would be (1) “The instrument ization are testable in the same way global must be uncorrelated with the error in the randomization is testable in a randomized outcome equation,” (2) “The identification experiment by looking at whether baseline assumption is ultimately untestable,” and (3) covariates are balanced. It is in this sense “The estimated treatment effect is applicable 292 Journal of Economic Literature, Vol. XLVIII (June 2010) to the subpopulation whose treatment was 3.1 Valid or Invalid RD? affected by the instrument.” After all, who’s to say whether one untestable design is more Are individuals able to influence the “compelling” or “credible” than another assignment variable, and if so, what is the untestable design? And it would seem that nature of this control? This is probably the having a treatment effect for a vanishingly most important question to ask when assess- small subpopulation (those at the threshold, ing whether a particular application should in the limit) is hardly more (and probably be analyzed as an RD design. If individuals much less) useful than that for a population have a great deal of control over the assign- “affected by the instrument.” ment variable and if there is a perceived As we describe below, however, a closer benefit to a treatment, one would certainly examination of the RD design reveals quite expect individuals on one side of the thresh- different answers to the above three questions: old to be systematically different from those on the other side. 1. “When there is a continuously distrib- Consider the test-taking RD example. uted stochastic error component to the Suppose there are two types of students: A assignment variable—which can occur and B. Suppose type A students are more when optimizing agents do not have able than B types, and that A types are also precise control over the assignment keenly aware that passing the relevant thresh- variable—then the variation in the old (50 percent) will give them a scholarship treatment will be as good as random- benefit, while B types are completely igno- ized in a neighborhood around the dis- rant of the scholarship and the rule. Now continuity threshold.” suppose that 50 percent of the questions are trivial to answer correctly but, due to ran- 2. “Yes. As in a randomized experiment, dom chance, students will sometimes make the distribution of observed baseline careless errors when they initially answer the covariates should not change discon- test questions, but would certainly correct tinuously at the threshold.” the errors if they checked their work. In this scenario, only type A students will make sure 3. “The RD estimand can be interpreted to check their answers before turning in the as a weighted average treatment effect, exam, thereby assuring themselves of a pass- where the weights are the relative ex ing score. Thus, while we would expect those ante probability that the value of an who barely passed the exam to be a mixture individual’s assignment variable will be of type A and type B students, those who in the neighborhood of the threshold.” barely failed would exclusively be type B students. In this example, it is clear that the Thus, in many contexts, the RD design marginal failing students do not represent a may have more in common with random- valid counterfactual for the marginal passing ized experiments (or circumstances when an students. Analyzing this scenario within an instrument is truly randomized)—in terms RD framework would be inappropriate. of their “internal validity” and how to imple- On the other hand, consider the same sce- ment them in practice—than with regression nario, except assume that questions on the control or matching methods, instrumental exam are not trivial; there are no guaran- variables, or panel data approaches. We will teed passes, no matter how many times the return to this point after first discussing the students check their answers before turn- above three issues in greater detail. ing in the exam. In this case, it seems more Lee and Lemieux: Regression Discontinuity Designs in Economics 293 plausible that, among those scoring near the 3.1.1 Randomized Experiments from threshold, it is a matter of “luck” as to which Nonrandom Selection side of the threshold they land. Type A students can exert more effort—because they To see how the inability to precisely con- know a scholarship is at stake—but they do trol the assignment variable leads to a source not know the exact score they will obtain. In of randomized variation in the treatment, this scenario, it would be reasonable to argue consider a simplified formulation of the RD that those who marginally failed and passed design:11 would be otherwise comparable, and that an RD analysis would be appropriate and would (2) Y D W U = τ + δ1 + yield credible estimates of the impact of the scholarship. D 1[X c] = ≥ These two examples make it clear that one must have some knowledge about the mech- X W V, = δ2 + anism generating the assignment variable beyond knowing that, if it crosses the thresh- where Y is the outcome of interest, D is the old, the treatment is “turned on.” It is “folk binary treatment indicator, and W is the wisdom” in the literature to judge whether vector of all predetermined and observable the RD is appropriate based on whether characteristics of the individual that might individuals could manipulate the assignment impact the outcome and/or the assignment variable and precisely “sort” around the dis- variable X. continuity threshold. The key word here is This model looks like a standard endog- “precise” rather than “manipulate.” After enous dummy variable set-up, except that all, in both examples above, individuals do we observe the assignment variable, X. This exert some control over the test score. And allows us to relax most of the other assump- indeed, in virtually every known application tions usually made in this type of model. of the RD design, it is easy to tell a plausi- First, we allow W to be endogenously deter- ble story that the assignment variable is to mined as long as it is determined prior to some degree influenced by someone. But V. Second, we take no stance as to whether individuals will not always be able to have some elements of or are zero (exclusion δ1 δ2 precise control over the assignment variable. restrictions). Third, we make no assump- It should perhaps seem obvious that it is nec- tions about the correlations between W, U, essary to rule out precise sorting to justify and V.12 the use of an RD design. After all, individ- In this model, individual heterogeneity in ual self-selection into treatment or control the outcome is completely described by the regimes is exactly why simple comparison of pair of random variables (W, U); anyone with means is unlikely to yield valid causal infer- the same values of (W, U) will have one of ences. Precise sorting around the threshold two values for the outcome, depending on is self-selection. whether they receive treatment. Note that, What is not obvious, however, is that, when one formalizes the notion of having 11 We use a simple linear endogenous dummy variable imprecise control over the assignment vari- setup to describe the results in this section, but all of the able, there is a striking consequence: the results could be stated within the standard potential out- variation in the treatment in a neighborhood comes framework, as in Lee (2008). 12 This is much less restrictive than textbook descrip- of the threshold is “as good as randomized.” tions of endogenous dummy variable systems. It is typically We explain this below. assumed that (U, V ) is independent of W. 294 Journal of Economic Literature, Vol. XLVIII (June 2010)

Imprecise control Precise control “Complete control” Density

0 x

Figure 4. Density of Assignment Variable Conditional on W w, U u = =

since RD designs are implemented by run- Now consider the distribution of X, condi- ning regressions of Y on X, equation (2) looks tional on a particular pair of values W w, = peculiar since X is not included with W and U u. It is equivalent (up to a translational = U on the right hand side of the equation. We shift) to the distribution of V conditional on could add a function of X to the outcome W w, U u. If an individual has complete = = equation, but this would not make a differ- and exact control over X, we would model it ence since we have not made any assump- as having a degenerate distribution, conditions about the joint distribution of W, U, and tional on W w, U u. That is, in repeated = = V. For example, our setup allows for the case trials, this individual would choose the same where U X U , which yields the out- score. This is depicted in figure 4 as the thick = δ3 + ′ come equation Y D W X U . line. = τ + δ1 + δ3 + ′ For the sake of simplicity, we work with the If there is some room for error but indi- simple case where X is not included on the viduals can nevertheless have precise control right hand side of the equation.13 about whether they will fail to receive the

unobservable term U. Since it is not possible to distinguish 13 When RD designs are implemented in practice, the between these two effects in practice, we simplify the estimated effect of X on Y can either reflect a true causal setup by implicitly assuming that X only comes into equa- effect of X on Y or a spurious correlation between X and the tion (2) indirectly through its (spurious) correlation with U. Lee and Lemieux: Regression Discontinuity Designs in Economics 295 treatment, then we would expect the density randomized in a neighborhood of the thresh- of X to be zero just below the threshold, but old. To see this, note that by Bayes’ Rule, we positive just above the threshold, as depicted have in figure 4 as the truncated distribution. This density would be one way to model the first (3) Pr[W w, U u X x] = = | = example described above for the type A students. Since type A students know about the Pr[W w, U u] f (x W w, U u) = = , scholarship, they will double-check their = | = = __f(x) answers and make sure they answer the easy questions, which comprise 50 percent of the where f ( ) and f ( ) are marginal and ∙ ∙| ∙ test. How high they score above the pass- conditional densities for X. So when ing threshold will be determined by some f (x W w, U u) is continuous in x, the | = = randomness. right hand side will be continuous in x, which Finally, if there is stochastic error in the therefore means that the distribution of W, U assignment variable and individuals do not conditional on X will be continuous in x.15 have precise control over the assignment That is, all observed and unobserved prede- variable, we would expect the density of X termined characteristics will have identical (and hence V ), conditional on W w, U u distributions on either side of x c, in the = = = to be continuous at the discontinuity thresh- limit, as we examine smaller and smaller old, as shown in figure 4 as the untruncated neighborhoods of the threshold. distribution.14 It is important to emphasize In sum, that, in this final scenario, the individual still has control over X: through her efforts, she Local Randomization: If individuals have can choose to shift the distribution to the imprecise control over X as defined above, right. This is the density for someone with then Pr[W w, U u X x] is continu- = = | = W w, U u, but may well be different— ous in x: the treatment is “as good as” ran- = = with a different mean, variance, or shape of domly assigned around the cutoff. the density—for other individuals, with different levels of ability, who make different In other words, the behavioral assumption choices. We are assuming, however, that all that individuals do not precisely manipulate individuals are unable to precisely control X around the threshold has the prediction the score just around the threshold. that treatment is locally randomized. This is perhaps why RD designs can be Definition: We say individuals have so compelling. A deeper investigation into imprecise control over X when conditional the real-world details of how X (and hence on W w and U u, the density of V (and D) is determined can help assess whether it = = hence X) is continuous. is plausible that individuals have precise or imprecise control over X. By contrast, with When individuals have imprecise control over X this leads to the striking implication that variation in treatment status will be 15 Since the potential outcomes Y(0) and Y(1) are functions of W and U, it follows that the distribution of Y(0) and Y(1) conditional on X is also continuous in x when individuals have imprecise control over X. This implies that 14 For example, this would be plausible when X is a the conditions usually invoked for consistently estimating test score modeled as a sum of Bernoulli random vari- the treatment effect (the conditional means E[Y(0) X x] | = ables, which is approximately normal by the central limit and E[Y(1) X x being continuous in x) are also satisfied. | = ] theorem. See Lee (2008) for more detail. 296 Journal of Economic Literature, Vol. XLVIII (June 2010) most nonexperimental evaluation contexts, 3.2.2 Testing the Validity of the RD Design learning about how the treatment variable is determined will rarely lead one to conclude An almost equally important implication of that it is “as good as” randomly assigned. the above local random assignment result is that it makes it possible to empirically assess 3.2 Consequences of Local Random the prediction that Pr[W w, U u X x] Assignment = = | = is continuous in x. Although it is impossible There are three practical implications of to test this directly—since U is unobserved— the above local random assignment result. it is nevertheless possible to assess whether Pr[W w X x] is continuous in x at the 3.2.1 Identification of the Treatment Effect = | = threshold. A discontinuity would indicate a First and foremost, it means that the dis- failure of the identifying assumption. continuity gap at the cutoff identifies the This is akin to the tests performed to treatment effect of interest. Specifically, we empirically assess whether the randomiza- have tion was carried out properly in randomized experiments. It is standard in these analyses lim E[Y X c ] to demonstrate that treatment and control 0 | = + ε ε↓ groups are similar in their observed base- lim E[Y X c ] line covariates. It is similarly impossible to − 0 | = + ε ε↑ test whether unobserved characteristics are balanced in the experimental context, so the lim (w 1 u) = τ + 0 δ + most favorable statement that can be made ε↓ ∑w,u about the experiment is that the data “failed Pr[W w, U u X c ] × = = | = + ε to reject” the assumption of randomization. Performing this kind of test is arguably more important in the RD design than in lim (w 1 u) − 0 δ + ε↑ ∑w,u the experimental context. After all, the true nature of individuals’ control over the assign- Pr[W w, U u X c ] × = = | = + ε ment variable—and whether it is precise or imprecise—may well be somewhat debat- , = τ able even after a great deal of investigation into the exact treatment-assignment mecha- where the last line follows from the continu- nism (which itself is always advisable to do). ity of Pr[W w, U u X x]. Imprecision of control will often be nothing = = | = As we mentioned earlier, nothing changes more than a conjecture, but thankfully it has if we augment the model by adding a direct testable predictions. impact of X itself in the outcome equation, There is a complementary, and arguably as long as the effect of X on Y does not jump more direct and intuitive test of the impre- at the cutoff. For example, in the example of cision of control over the assignment vari- Thistlethwaite and Campbell (1960), we can able: examination of the density of X itself, allow higher test scores to improve future as suggested in Justin McCrary (2008). If the academic outcomes (perhaps by raising the density of X for each individual is continu- probability of admission to higher quality ous, then the marginal density of X over the schools) as long as that probability does not population should be continuous as well. A jump at precisely the same cutoff used to jump in the density at the threshold is proba- award scholarships. bly the most direct evidence of some degree Lee and Lemieux: Regression Discontinuity Designs in Economics 297 of sorting around the threshold, and should researchers will include them in regressions, provoke serious skepticism about the appro- because doing so can reduce the sampling priateness of the RD design.16 Furthermore, variability in the estimator. Arguably the one advantage of the test is that it can always greatest potential for this occurs when one be performed in a RD setting, while testing of the baseline covariates is a pre-random- whether the covariates W are balanced at the assignment observation on the dependent threshold depends on the availability of data variable, which may likely be highly corre- on these covariates. lated with the post-assignment outcome vari- This test is also a partial one. Whether each able of interest. individual’s ex ante density of X is continuous The local random assignment result allows is fundamentally untestable since, for each us to apply these ideas to the RD context. For individual, we only observe one realization of example, if the lagged value of the depen- X. Thus, in principle, at the threshold some dent variable was determined prior to the individuals’ densities may jump up while oth- realization of X, then the local randomization ers may sharply fall, so that in the aggregate, result will imply that that lagged dependent positives and negatives offset each other variable will have a continuous relationship making the density appear continuous. In with X. Thus, performing an RD analysis on recent applications of RD such occurrences Y minus its lagged value should also yield the seem far-fetched. Even if this were the case, treatment effect of interest. The hope, how- one would certainly expect to see, after strat- ever, is that the differenced outcome mea- ifying by different values of the observable sure will have a sufficiently lower variance characteristics, some discontinuities in the than the level of the outcome, so as to lower density of X. These discontinuities could be the variance in the RD estimator. detected by performing the local randomiza- More formally, we have tion test described above. lim E[Y W X c ] 3.2.3 Irrelevance of Including Baseline 0 − π| = + ε ε↓ Covariates lim E[Y W X c ] A consequence of a randomized experi- − 0 − π| = + ε ε↑ ment is that the assignment to treatment is, by construction, independent of the base- lim (w( 1 ) u) = τ + 0 ∑ δ − π + line covariates. As such, it is not necessary to ε↓ w,u include them to obtain consistent estimates Pr[W w, U u X c ] of the treatment effect. In practice, however, × = = | = + ε

16 lim (w( 1 ) u) Another possible source of discontinuity in the − 0 ∑ δ − π + density of the assignment variable X is selective attrition. ε↑ w,u For example, John DiNardo and Lee (2004) look at the effect of unionization on wages several years after a union Pr[W w, U u X c ] representation vote was taken. In principle, if firms that × = = | = + ε were unionized because of a majority vote are more likely , to close down, then conditional on firm survival at a later = τ date, there will be a discontinuity in X (the vote share) that could threaten the validity of the RD design for estimat- where W is any linear function, and W can ing the effect of unionization on wages (conditional on π survival). In that setting, testing for a discontinuity in the include a lagged dependent variable, for density (conditional on survival) is similar to testing for selective attrition (linked to treatment status) in a standard example. We return to how to implement randomized experiment. this in practice in section 4.4. 298 Journal of Economic Literature, Vol. XLVIII (June 2010)

3.3 Generalizability: The RD Gap as a The discontinuity gap then, is a par- Weighted Average Treatment Effect ticular kind of average treatment effect across all individuals. If not for the term In the presence of heterogeneous treat- f (c W w, U u) f (c), it would be the | = = / ment effects, the discontinuity gap in an average treatment effect for the entire RD design can be interpreted as a weighted population. The presence of the ratio average treatment effect across all individu- f (c W w, U u) f (c) implies the discon- | = = / als. This is somewhat contrary to the temp- tinuity is instead a weighted average treat- tation to conclude that the RD design only ment effect where the weights are directly delivers a credible treatment effect for the proportional to the ex ante likelihood that an subpopulation of individuals at the threshold individual’s realization of X will be close to and says nothing about the treatment effect the threshold. All individuals could get some “away from the threshold.” Depending on weight, and the similarity of the weights the context, this may be an overly simplistic across individuals is ultimately untestable, and pessimistic assessment. since again we only observe one realization Consider the scholarship test example of X per person and do not know anything again, and define the “treatment” as “receiv- about the ex ante probability distribution of ing a scholarship by scoring 50 percent or X for any one individual. The weights may be greater on the scholarship exam.” Recall relatively similar across individuals, in which that the pair W, U characterizes individual case the RD gap would be closer to the heterogeneity. We now let (w, u) denote overall average treatment effect; but, if the τ the treatment effect for an individual with weights are highly varied and also related to W w and U u, so that the outcome the magnitude of the treatment effect, then = = equation in (2) is instead given by the RD gap would be very different from the overall average treatment effect. While Y D (W, U) W U. it is not possible to know how close the RD = τ + δ1 + gap is from the overall average treatment This is essentially a model of completely effect, it remains the case that the treat- unrestricted heterogeneity in the treatment ment effect estimated using a RD design is effect. Following the same line of argument averaged over a larger population than one as above, we obtain would have anticipated from a purely “cutoff ” interpretation. (5) lim E[Y X c ] Of course, we do not observe the density of 0 | = + ε ε↓ the assignment variable at the individual level lim E[Y X c ] so we therefore do not know the weight for − 0 | = + ε ε↑ each individual. Indeed, if the signal to noise

ratio of the test is extremely high, someone (w, u) Pr[W w, U u X c] = ∑w,u τ = = | = who scores a 90 percent may have almost a f (c W w, U u) zero chance of scoring near the threshold, (w, u) | = = implying that the RD gap is almost entirely = ∑ τ __f (c) w,u dominated by those who score near 50 per- Pr[W w, U u], cent. But if the reliability is lower, then the × = = RD gap applies to a relatively broader subpopulation. It remains to be seen whether where the second line follows from equation or not and how information on the reliabil- (3). ity, or a second test measurement, or other Lee and Lemieux: Regression Discontinuity Designs in Economics 299

covariates that can predict the assignment weight is given to those districts in which a could be used in conjunction with the RD close election race was expected. gap to learn about average treatment effects 3.4 Variations on the Regression for the overall population. The understanding Discontinuity Design of the RD gap as a weighted average treatment effect serves to highlight that RD causal To this point, we have focused exclu- evidence is not somehow fundamentally dis- sively on the “classic” RD design introduced connected from the average treatment effect by Thistlethwaite and Campbell (1960), that is often of interest to researchers. whereby there is a single binary treatment It is important to emphasize that the RD and the assignment variable perfectly pre- gap is not informative about the treatment dicts treatment receipt. We now discuss two if it were defined as “receipt of a scholar- variants of this base case: (1) when there is ship that is awarded by scoring 90 percent so-called “imperfect compliance” of the rule or higher on the scholarship exam.” This is and (2) when the treatment of interest is a not so much a “drawback” of the RD design continuous variable. as a limitation shared with even a carefully In both cases, the notion that the RD controlled randomized experiment. For design generates local variation in treatment example, if we randomly assigned financial that is “as good as randomly assigned” is aid awards to low-achieving students, what- helpful because we can apply known results ever treatment effect we estimate may not for randomized instruments to the RD be informative about the effect of financial design, as we do below. The notion is also aid for high-achieving students. helpful for addressing other data problems, In some contexts, the treatment effect such as differential attrition or sample selec- “away from the discontinuity threshold” may tion, whereby the treatment affects whether not make much practical sense. Consider the or not you observe the outcome of interest. RD analysis of incumbency in congressional The local random assignment result means elections of Lee (2008). When the treatment that, in principle, one could extend the ideas is “being the incumbent party,” it is implic- of Joel L. Horowitz and Charles F. Manski itly understood that incumbency entails win- (2000) or Lee (2009), for example, to provide ning the previous election by obtaining at bounds on the treatment effect, accounting least 50 percent of the vote.17 In the election for possible sample selection bias. context, the treatment “being the incum- 3.4.1. Imperfect Compliance: The bent party by virtue of winning an election, “Fuzzy” RD whereby 90 percent of the vote is required to win” simply does not apply to any real-life In many settings of economic interest, situation. Thus, in this context, it is awkward treatment is determined partly by whether to interpret the RD gap as “the effect of the assignment variable crosses a cutoff point. incumbency that exists at 50 percent vote- This situation is very important in practice for share threshold” (as if there is an effect at a variety of reasons, including cases of imper- a 90 percent threshold). Instead it is more fect take-up by program participants or when natural to interpret the RD gap as estimat- factors other than the threshold rule affect ing a weighted average treatment effect of the probability of program participation. incumbency across all districts, where more Starting with William M. K. Trochim (1984), this setting has been referred to as a “fuzzy” 17 For this example, consider the simplified case of a RD design. In the case we have discussed two-party system. so far—the “sharp” RD design—the 300 Journal of Economic Literature, Vol. XLVIII (June 2010) probability of treatment jumps from 0 to 1 and “excludability” (i.e., X crossing the cutoff when X crosses the threshold c. The fuzzy cannot impact Y except through impacting RD design allows for a smaller jump in the receipt of treatment). When these assump- probability of assignment to the treatment at tions are made, it follows that18 the threshold and only requires E[Y(1) Y(0) unit is complier, X c], τF = − | = lim Pr (D 1 X c ) 0 = | = + ε ε↓ where “compliers” are units that receive the lim Pr (D 1 X c ). treatment when they satisfy the cutoff rule ≠ 0 = | = + ε ε↑ (X c), but would not otherwise receive it. i ≥ Since the probability of treatment jumps In summary, if there is local random by less than one at the threshold, the jump in assignment (e.g., due to the plausibility of the relationship between Y and X can no lon- individuals’ imprecise control over X), then ger be interpreted as an average treatment we can simply apply all of what is known effect. As in an instrumental variable setting about the assumptions and interpretability however, the treatment effect can be recov- of instrumental variables. The difference ered by dividing the jump in the relationship between the “sharp” and “fuzzy” RD design between Y and X at c by the fraction induced is exactly parallel to the difference between to take-up the treatment at the threshold— the randomized experiment with perfect in other words, the discontinuity jump in the compliance and the case of imperfect com- relation between D and X. In this setting, the pliance, when only the “intent to treat” is treatment effect can be written as randomized. For example, in the case of imperfect lim 0 E[Y X c ] lim 0 E[Y X c ] ε↓ | = + ε − ε↑ | = + ε F ______ , compliance, even if a proposed binary instru- τ = lim 0 E[D X c ] lim 0 E[D X c ] ε↓ | = + ε − ε↑ | = + ε ment Z is randomized, it is necessary to rule where the subscript “F” refers to the fuzzy out the possibility that Z affects the outcome, RD design. outside of its influence through treatment There is a close analogy between how the receipt, D. Only then will the instrumental treatment effect is defined in the fuzzy RD variables estimand—the ratio of the reduced design and in the well-known “Wald” formu- form effects of Z on Y and of Z on D—be lation of the treatment effect in an instru- properly interpreted as a causal effect of D mental variables setting. Hahn, Todd and on Y. Similarly, supposing that individuals do van der Klaauw (2001) were the first to show not have precise control over X, it is neces- this important connection and to suggest sary to assume that whether X crosses the estimating the treatment effect using two- threshold c (the instrument) has no impact stage least-squares (TSLS) in this setting. on y except by influencing D. Only then will We discuss estimation of fuzzy RD designs in the ratio of the two RD gaps in Y and D be greater detail in section 4.3.3. properly interpreted as a causal effect of D Hahn, Todd and van der Klaauw (2001) on Y. furthermore pointed out that the interpreta- In the same way that it is important to tion of this ratio as a causal effect requires verify a strong first-stage relationship in an the same assumptions as in Imbens and IV design, it is equally important to verify Angrist (1994). That is, one must assume “monotonicity” (i.e., X crossing the cutoff cannot simultaneously cause some units to 18 See Imbens and Lemieux (2008) for a more formal take up and others to reject the treatment) exposition. Lee and Lemieux: Regression Discontinuity Designs in Economics 301 that a discontinuity exists in the relationship impact the continuous regressor of interest between D and X in a fuzzy RD design. T. If 0 and U 0, then the model col- γ = 2 = Furthermore, in this binary-treatment– lapses to a “sharp” RD design (with a con- binary-instrument context with unrestricted tinuous regressor). heterogeneity in treatment effects, the IV Note that we make no additional assump- estimand is interpreted as the average treat- tions about U2 (in terms of its correlation ment effect “for the subpopulation affected with W or V ). We do continue to assume by the instrument,” (or LATE). Analogously, imprecise control over X (conditional on W 19 the ratio of the RD gaps in Y and D (the and U1, the density of X is continuous). “fuzzy design” estimand) can be interpreted Given the discussion so far, it is easy to as a weighted LATE, where the weights show that reflect the ex ante likelihood the individual’s X is near the threshold. In both cases, the (7) lim E[Y X c ] 0 | = + ε exclusion restriction and monotonicity con- ε↓ dition must hold. lim E[Y X c ] − 0 | = + ε ε↑ 3.4.2 Continuous Endogenous Regressor lim E[T X c ] = 0 | = + ε In a context where the “treatment” is a ε↓ continuous variable—call it T—and there lim E[T X c ] . U− 0 | = + ε γ is a randomized binary instrument (that can ε↑ additionally be excluded from the outcome The left hand side is simply the “reducedV equation), an IV approach is an obvious way form” discontinuity in the relation between of obtaining an estimate of the impact of T y and X. The term preceding on the right γ on Y. The IV estimand is the reduced-form hand side is the “first-stage” discontinuity in impact of Z on Y divided by the first-stage the relation between T and X, which is also impact of Z on T. estimable from the data. Thus, analogous The same is true for an RD design when to the exactly identified instrumental vari- the regressor of interest is continuous. Again, able case, the ratio of the two discontinuities the causal impact of interest will still be the yields the parameter : the effect of T on Y. γ ratio of the two RD gaps (i.e., the disconti- Again, because of the added notion of imper- nuities in Y and T). fect compliance, it is important to assume To see this more formally, consider the that D (X crossing the threshold) does not model directly enter the outcome equation. In some situations, more might be known (6) Y T W U about the rule determining T. For exam- = γ + δ1 + 1 ple, in Angrist and Lavy (1999) and Miguel T D W U Urquiola and Eric A. Verhoogen (2009), = ϕ + γ + 2 class size is an increasing function of total D 1[X c] school enrollment, except for discontinui- = ≥ ties at various enrollment thresholds. But X W V, = δ2 + 19 Although it would be unnecessary to do so for the which is the same set-up as before, except identification of , it would probably be more accurate to γ with the added second equation, allowing describe the situation of imprecise control with the conti- for imperfect compliance or other factors nuity of the density of X conditional on the three variables (W, U1, U2). This is because U2 is now another variable (observables W or unobservables U2) to characterizing heterogeneity in individuals. 302 Journal of Economic Literature, Vol. XLVIII (June 2010)

additional information about characteristics the relationship between the treatment vari- such as the slope and intercept of the under- able D and X, a step function, going from lying function (apart from the magnitude of 0 to 1 at the X c threshold. The second = the discontinuity) generally adds nothing to column shows the relationship between the the identification strategy. observables W and X. This is flat becauseX is To see this, change the second equation in completely randomized. The same is true for (6) to T D g(X) where g( ) is any con- the unobservable variable U, depicted in the = ϕ + ∙ tinuous function in the assignment variable. third column. These three graphs capture Equation (7) will remain the same and, thus, the appeal of the randomized experiment: knowledge of the function g( ) is irrelevant treatment varies while all other factors are ∙ for identification.20 kept constant (on average). And even though There is also no need for additional theo- we cannot directly test whether there are no retical results in the case when there is indi- treatment-control differences in U, we can vidual-level heterogeneity in the causal effect test whether there are such differences in of the continuous regressor T. The local ran- the observable W. dom assignment result allows us to borrow Now consider an RD design (panel B of from the existing IV literature and interpret figure 5) where individuals have imprecise the ratio of the RD gaps as in Angrist and control over X. Both W and U may be sys- Krueger (1999), except that we need to add tematically related to X, perhaps due to the the note that all averages are weighted by the actions taken by units to increase their prob- ex ante relative likelihood that the individu- ability of receiving treatment. Whatever the al’s X will land near the threshold. shape of the relation, as long as individuals have imprecise control over X, the relation- 3.5 Summary: A Comparison of RD and ship will be continuous. And therefore, as we Other Evaluation Strategies examine Y near the X c cutoff, we can be = We conclude this section by compar- assured that like an experiment, treatment ing the RD design with other evaluation varies (the first column) while other factors approaches. We believe it is helpful to view are kept constant (the second and third col- the RD design as a distinct approach rather umns). And, like an experiment, we can test than as a special case of either IV or match- this prediction by assessing whether observing/regression-control. Indeed, in important ables truly are continuous with respect to X ways the RD design is more similar to a ran- (the second column).21 domized experiment, which we illustrate We now consider two other commonly below. used nonexperimental approaches, referring Consider a randomized experiment where to the model (2): subjects are assigned a random number X and are given the treatment if X c. By construc- Y D W U ≥ = τ + δ1 + tion, X is independent and not systematically related to any observable or unobservable D 1[X c] = ≥ characteristic determined prior to the randomization. This situation is illustrated in X W V. = δ2 + panel A of figure 5. The first column shows

20 As discussed in 3.2.1, the inclusion of a direct effect of X in the outcome equation will not change identifica- 21 We thank an anonymous referee for suggesting these tion of . illustrative graphs. τ Lee and Lemieux: Regression Discontinuity Designs in Economics 303

A. Randomized Experiment

1 1 E [ U|X ] E [ D|X ] E [ W|X ]

0 0 0 0 x 0 x 0 x

B. Regression Discontinuity Design 1 1 1 E [ U|X ] E [ D|X ] E [ W|X ]

0 0 0 0 0 x x 0 x

C. Matching on Observables

1 1 E[U|W, D 0] = E[U|W, D 1] = ] E [ D| W U|W, D ] E [ U|W,

0 0 W W

D. Instrumental Variables

1 1 1 |Z ] U E [ D|Z ] E [ W|Z ] E [

0 0 0 z z z

Figure 5. Treatment, Observables, and Unobservables in Four Research Designs

304 Journal of Economic Literature, Vol. XLVIII (June 2010)

3.5.1 Selection on Observables: Matching/ hope that more W ’s will lead to less biased Regression Control estimates, this is obviously not necessarily the case. For example, consider estimating The basic idea of the “selection on observ- the economic returns to graduating high ables” approach is to adjust for differences school (versus dropping out). It seems natu- in the W ’s between treated and control indi- ral to include variables like parents’ socioeco- viduals. It is usually motivated by the fact nomic status, family income, year, and place that it seems “implausible” that the uncon- of birth in the regression. Including more ditional mean Y for the control group repre- and more family-level W ’s will ultimately sents a valid counterfactual for the treatment lead to a “within-family” sibling analysis; group. So it is argued that, conditional on W, extending it even further by including date treatment-control contrasts may identify the of birth leads to a “within-twin-pair” analysis. (W-specific) treatment effect. And researchers have been critical—justifi- The underlying assumption is that condi- ably so—of this source of variation in edu- tional on W, U and V are independent. From cation. The same reasons causing discomfort this it is clear that about the twin analyses should also cause skepticism about “kitchen sink” multivariate E[Y D 1, W w] matching/propensity score/regression con- | = = trol analyses.23 E[Y D 0, W w] It is also tempting to believe that, if the W ’s − | = = do a “good job” in predicting D, the selection E[U W w, V c w ] on observables approach will “work better.” = τ + | = ≥ − δ2 But the opposite is true: in the extreme case E[U W w, V c w when the W ’s perfectly predict X (and hence − | = < − δ2 ] D), it is impossible to construct a treatment- . control contrast for virtually all observations. = τ For each value of W, the individuals will Two issues arise when implementing this either all be treated or all control. In other approach. The first is one of functional form: words, there will be literally no overlap in how exactly to control for the W ’s? When the support of the propensity score for the the W ’s take on discrete values, one possibil- treated and control observations. The pro- ity is to compute treatment effects for each pensity score would take the values of either distinct value of W, and then average these 1 or 0. effects across the constructed “cells.” This The “selection on observables” approach is will not work, however, when W has continu- illustrated in panel C of figure 5. Observables ous elements, in which case it is necessary to W can help predict the probability of treat- implement multivariate matching, propen- ment (first column), but ultimately one must sity score, reweighting procedures, or non- assume that unobservable factors U must be parametric regressions.22 the same for treated and control units for Regardless of the functional form issue, there is arguably a more fundamental ques- 23 Researchers question the twin analyses on the tion of which W ’s to use in the analysis. While grounds that it is not clear why one twin ends up having it is tempting to answer “all of them” and more education than the other, and that the assumption that education differences among twins is purely random (as ignorability would imply) is viewed as far-fetched. We 22 See Hahn (1998) on including covariates directly thank David Card for pointing out this connection between with nonparametric regression. twin analyses and matching approaches. Lee and Lemieux: Regression Discontinuity Designs in Economics 305 every value of W. That is, the crucial assump- When Z is continuous, there is an addition is that the two lines in the third column tional approach to identifying . The “Heckit” τ be on top of each other. Importantly, there is approach uses the fact that no comparable graph in the second column because there is no way to test the design E[Y W * w*, Z z, D 1] | = = = since all the W ‘s are used for estimation. w* 3.5.2 Selection on Unobservables: = τ + γ Instrumental Variables and “Heckit” E U W * w*, Z z, V c w + | = = ≥ − δ2 A less restrictive modeling assumption is C D to allow U and V to be correlated, conditional E[Y W * w*, Z z, D 0] on W. But because of the arguably “more | = = = realistic” flexible data generating process, / w* another assumption is needed to identify . = γ τ One such assumption is that some elements E U W * w*, Z z, V c w . of W (call them Z) enter the selection equa- + | = = < − δ2 tion, but not the outcome equation and are C D also uncorrelated with U. An instrumental If we further assume a functional form for variables approach utilizes the fact that the joint distribution of U, V, conditional on W * and Z, then the “control function” E[Y W * w*, Z z] terms E U W w, V c w and | = = | = ≥ − δ2 E U W w, V c w 2 are functions E[D W * w*, Z z] w* | = C < − δ D = | = = τ + γ of observed variables, with the parameters C D * * then estimable from the data. It is then pos- E[U W w , Z z] sible, for any value of W w, to identify + | = = = τ as E[D W * w*, Z z] w* = | = = τ + γ * * E[U W * w*], (8) (E[Y W w , Z z, D 1] + | = | = = = where W has been split up into W * and Z E[Y W * w*, Z z, D 0]) − | = = = and is the corresponding coefficient forw *. γ Conditional on W * w*, Y only varies with E U W * w*, Z z, V c w = − | = = ≥ − δ2 Z because of how D varies with Z. Thus, one identifies by “dividing” the reduced form EA UC W * w*, Z z, V c w D . τ − | = = < − δ2 quantity E[D W * w*, Z z] (which can | = = τ be obtained by examining the expectation Even if C the joint distribution of U, V D isB of Y conditional on Z for a particular value unknown, in principle it is still possible w* of W *) by E[D W * w*, Z z], which to identify , if it were possible to choose | = = τ is also provided by the observed data. It is two different values of Z such that c w − δ2 common to model the latter quantity as a approaches and . If so, the last two −∞ ∞ linear function in Z, in which case the IV terms in (8) approach E[U W * w*] and, | = estimator is (conditional on W *) the ratio of hence, cancel one another. This is known as coefficients from regressions of Y on Z and “identification at infinity.” D on Z. When Z is binary, this appears to be Perhaps the most important assumption the only way to identify without imposing that any of these approaches require is the τ further assumptions. existence of a variable Z that is (conditional 306 Journal of Economic Literature, Vol. XLVIII (June 2010) on W *) independent of U.24 There does not and that the assumption invoked is that Z is seem to be any way of testing the validity uncorrelated with U, conditional on W. In this of this assumption. Different, but equally latter case, the design seems fundamentally “plausible” Z’s may lead to different answers, untestable, since all the remaining observable in the same way that including different sets variables (the W ’s) are being “used up” for of W ’s may lead to different answers in the identifying the treatment effect. selection on observables approach. 3.5.3 RD as “Design” not “Method” Even when there is a mechanism that justifies an instrument Z as “plausible,” it is RD designs can be valid under the more often unclear which covariates W * to include general “selection on unobservables” in the analysis. Again, when different sets of environment, allowing an arbitrary correla- W * lead to different answers, the question tion among U, V, and W, but at the same time becomes which is more plausible: Z is inde- not requiring an instrument. As discussed pendent of U conditional on W * or Z is inde- above, all that is needed is that conditional on pendent of U conditional on a subset of the W, U, the density of V is continuous, and the variables in W *? While there may be some local randomization result follows. situations where knowledge of the mecha- How is an RD design able to achieve nism dictates which variables to include, in this, given these weaker assumptions? The other contexts, it may not be obvious. answer lies in what is absolutely necessary The situation is illustrated in panel D of in an RD design: observability of the latent figure 5. It is necessary that the instrument index X. Intuitively, given that both the Z is related to the treatment (as in the first “selection on observables” and “selection on column). The crucial assumption is regard- unobservables” approaches rely heavily on ing the relation between Z and the unob- modeling X and its components (e.g., which servables U (the third column). In order for W ’s to include, and the properties of the an IV or a “Heckit” approach to work, the unobservable error V and its relation to other function in the third column needs to be flat. variables, such as an instrument Z), actually Of course, we cannot observe whether this is knowing the value of X ought to help. true. Furthermore, in most cases, it is unclear In contrast to the “selection on observ- how to interpret the relation between W and ables” and “selection on unobservables” Z (second column). Some might argue the modeling approaches, with the RD design observed relation between W and Z should the researcher can avoid taking any strong be flat if Z is truly exogenous, and that if Z is stance about what W ’s to include in the anal- highly correlated with W, then it casts doubt ysis, since the design predicts that the W ’s on Z being uncorrelated with U. Others will are irrelevant and unnecessary for identifi- argue that using the second graph as a test is cation. Having data on W ’s is, of course, of only appropriate when Z is truly randomized, some use, as they allow testing of the underlying assumption (described in section 4.4). 24 For IV, violation of this assumption essentially means For this reason, it may be more helpful to that Z varies with Y for reasons other than its influence consider RD designs as a description of a par- on D. For the textbook “Heckit” approach, it is typically ticular data generating process, rather than a assumed that U, V have the same distribution for any value of Z. It is also clear that the “identification at infinity” “method” or even an “approach.” In virtually approach will only work if Z is uncorrelated with U, oth- any context with an outcome variable Y, treat- erwise the last two terms in equation (8) would not cancel. ment status D, and other observable variables See also the framework of Heckman and Edward Vytlacil (2005), which maintains the assumption of the indepen- W, in principle a researcher can construct a dence of the error terms and Z, conditional on W *. regression-control or instrumental variables Lee and Lemieux: Regression Discontinuity Designs in Economics 307

(after designating one of the W variables a transparent way of graphically showing how valid instrument) estimator, and state that the treatment effect is identified. We thus the identification assumptions needed are begin the section by discussing how to graph satisfied. the data in an informative way. We then This is not so with an RD design. Either move to arguably the most important issue the situation is such that X is observed, or it in implementing an RD design: the choice is not. If not, then the RD design simply does of the regression model. We address this by not apply.25 If X is observed, then one has presenting the various possible specifications, little choice but to attempt to estimate the discussing how to choose among them, and expectation of Y conditional on X on either showing how to compute the standard errors. side of the cutoff. In this sense, the RD Next, we discuss a number of other prac- design forces the researcher to analyze it in tical issues that often arise in RD designs. a particular way, and there is little room for Examples of questions discussed include researcher discretion—at least from an iden- whether we should control for other covari- tification standpoint. The design also pre- ates and what to do when the assignment dicts that the inclusion of W ’s in the analysis variable is discrete. We discuss a number of should be irrelevant. Thus it naturally leads tests to assess the validity of the RD designs, the researcher to examine the density of X or which examine whether covariates are “bal- the distribution of W ’s, conditional on X, for anced” on the two sides of the threshold, and discontinuities as a test for validity. whether the density of the assignment vari- The analogy of the truly randomized able is continuous at the threshold. Finally, experiment is again helpful. Once the we summarize our recommendations for researcher is faced with what she thinks is a implementing the RD design. properly carried out randomized controlled Throughout this section, we illustrate the trial, the analysis is quite straightforward. various concepts using an empirical example Even before running the experiment, most from Lee (2008) who uses an RD design to researchers agree it would be helpful to dis- estimate the causal effect of incumbency in play the treatment-control contrasts in the U.S. House elections. We use a sample of W ’s to test whether the randomization was 6,558 elections over the 1946–98 period (see carried out properly, then to show the simple Lee 2008 for more detail). The assignment mean comparisons, and finally to verify the variable in this setting is the fraction of votes inclusion of the W’s make little difference in awarded to Democrats in the previous elec- the analysis, even if they might reduce sam- tion. When the fraction exceeds 50 percent, pling variability in the estimates. a Democrat is elected and the party becomes the incumbent party in the next election. Both the share of votes and the probability 4. Presentation, Estimation, and Inference of winning the next election are considered In this section, we systematically discuss as outcome variables. the nuts and bolts of implementing RD 4.1 Graphical Presentation designs in practice. An important virtue of RD designs is that they provide a very A major advantage of the RD design over competing methods is its transparency, which can be illustrated using graphical methods. 25 Of course, sometimes it may seem at first that an RD A standard way of graphing the data is to design does not apply, but a closer inspection may reveal that it does. For example, see Per Pettersson (2000), which eventu- divide the assignment variable into a number ally became the RD analysis in Pettersson-Lidbom (2008b). of bins, making sure there are two separate 308 Journal of Economic Literature, Vol. XLVIII (June 2010) bins on each side of the cutoff point (to avoid at this point, i.e., of the treatment effect. Since having treated and untreated observations an RD design is “as good as a randomized mixed together in the same bin). Then, the experiment” right around the cutoff point, the average value of the outcome variable can be treatment effect could be computed by com- computed for each bin and graphed against paring the average outcomes in “small” bins the mid-points of the bins. just to the left and right of the cutoff point. More formally, for some bandwidth h, If there is no visual evidence of a discontinu- and for some number of bins K 0 and K 1 to ity in a simple graph, it is unlikely the formal the left and right of the cutoff value, respec- regression methods discussed below will yield tively, the idea is to construct bins (bk , bk 1], a significant treatment effect. + for k 1, . . . , K K K , where A third advantage is that the graph also = = 0 + 1 shows whether there are unexpected compa- b c (K k 1) h. rable jumps at other points. If such evidence k = − 0 − + is clearly visible in the graph and cannot be The average value of the outcome variable explained on substantive grounds, this calls in the bin is into question the interpretation of the jump N at the cutoff point as the causal effect of the __ 1 Y k Yi 1 {bk Xi bk 1}. treatment. We discuss below several ways of _ + = Nk i∑1 < ≤ = testing explicitly for the existence of jumps at It is also useful to calculate the number of points other than the cutoff. observations in each bin Note that the visual impact of the graph N is typically enhanced by also plotting a rela- Nk 1{bk Xi bk 1} tively flexible regression model, such as a + = i∑ 1 < ≤ = polynomial model, which is a simple way to detect a possible discontinuity in the of smoothing the graph. The advantage of assignment variable at the threshold, which showing both the flexible regression line would suggest manipulation. and the unrestricted bin means is that the There are several important advantages regression line better illustrates the shape of in graphing the data this way before starting the regression function and the size of the to run regressions to estimate the treatment jump at the cutoff point, and laying this over effect. First, the graph provides a simple way the unrestricted means gives a sense of the of visualizing what the functional form of the underlying noise in the data. regression function looks like on either side Of course, if bins are too narrow the esti- of the cutoff point. Since the mean of Y in mates will be highly imprecise. If they are a bin is, for nonparametric kernel regres- too wide, the estimates may be biased as they sion estimators, evaluated at the bin mid- fail to account for the slope in the regression point using a rectangular kernel, the set of line (negligible for very narrow bins). More bin means literally represent nonparametric importantly, wide bins make the compari- estimates of the regression function. Seeing sons on both sides of the cutoff less credible, what the nonparametric regression looks like as we are no longer comparing observations can then provide useful guidance in choosing just to the left and right of the cutoff point. the functional form of the regression models. This raises the question of how to choose A second advantage is that comparing the the bandwidth (the width of the bin). In mean outcomes just to the left and right of the practice, this is typically done informally by cutoff point provides an indication of the mag- trying to pick a bandwidth that makes the nitude of the jump in the regression function graphs look informative in the sense that bins Lee and Lemieux: Regression Discontinuity Designs in Economics 309 are wide enough to reduce the amount of first test is a standardF -test comparing the fit noise, but narrow enough to compare obser- of a regression model with K bin dummies ′ vations “close enough” on both sides of the to one where we further divide each bin into cutoff point. While it is certainly advisable to two equal sized smaller bins, i.e., increase the experiment with different bandwidths and number of bins to 2K (reduce the bandwidth ′ see how the corresponding graphs look, it is from h to h 2). Since the model with K bins ′ ′/ ′ also useful to have some formal guidance in is nested in the one with 2K bins, a standard ′ the selection process. F-test with K degrees of freedom can be ′ One approach to bandwidth choice is used. If the null hypothesis is not rejected, based on the fact that, as discussed above, the this provides some evidence that we are not mean outcomes by bin correspond to kernel oversmoothing the data by using only K bins. ′ regression estimates with a rectangular ker- Another test is based on the idea that if the nel. Since the standard kernel regression is a bins are “narrow enough,” then there should special case of a local linear regression where not be a systematic relationship between Y the slope term is equal to zero, the cross-val- and X, that we capture using a simple regres- idation procedure described in more detail sion of Y on X, within each bin. Otherwise, in section 4.3.1 can also be used here by this suggests the bin is too wide and that the constraining the slope term to equal zero.26 mean value of Y over the whole bin is not For reasons we discuss below, however, one representative of the mean value of Y at the should not solely rely on this approach to boundaries of the bin. In particular, when select the bandwidth since other reasonable this happens in the two bins next to the cut- subjective goals should be considered when off point, a simple comparison of the two bin choosing how to the plot the data. means yields a biased estimate of the treat- Furthermore, a range a bandwidths often ment effect. A simple test for this consists of yield similar values of the cross-valida- adding a set of interactions between the bin tion function in practical applications (see dummies and X to a base regression of Y on below). A researcher may, therefore, want the set of bin dummies, and testing whether to use some discretion in choosing a band- the interactions are jointly significant. The width that provides a particularly compelling test statistic once again follows a F distribu- illustration of the RD design. An alternative tion with K degrees of freedom. ′ approach is to choose a bandwidth based on Figures 6–11 show the graphs for the a more heuristic visual inspection of the data, share of Democrat vote in the next elec- and then perform some tests to make sure tion and the probability of Democrats win- this informal choice is not clearly rejected. ning the next election, respectively. Three We suggest two such tests. Consider the sets of graphs with different bandwidths are case where one has decided to use K bins reported using a bandwidth of 0.02 in figures ′ based on a visual inspection of the data. The 6 and 9, 0.01 in figures 7 and 10, and 0.005

26 ˆ In section 4.3.1, we consider the cross-validation on a regression with a constant term, which means Y (Xi) N ˆ 2 ˆ function CVY (h) 1 N i 1 (Yi Y (Xi)) where Y (Xi) is the average value of Y among all observations in the bin = ( / ) ∑ = − is the predicted value of Yi based on a regression using (excluding observation i). Note that this is slightly differ- observations with a bin of width h on either the left (for ent from the standard cross-validation procedure in kernel observations on left of the cutoff) or the right (for observa- regressions where the left-out observation is in the middle tions on the right of the cutoff) of observation i, but not instead of the edge of the bin (see, for example, Richard including observation i itself. In the context of the graph Blundell and Alan Duncan 1998). Our suggested procedure discussed here, the only modification to the cross-valida- is arguably better suited to the RD context since estimation ˆ tion function is that the predicted value Y (Xi ) is based only of the treatment effect takes place at boundary points. 310 Journal of Economic Literature, Vol. XLVIII (June 2010)

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 6. Share of Vote in Next Election, Bandwidth of 0.02 (50 bins)

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 7. Share of Vote in Next Election, Bandwidth of 0.01 (100 bins) Lee and Lemieux: Regression Discontinuity Designs in Economics 311

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 8. Share of Vote in Next Election, Bandwidth of 0.005 (200 bins)

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10

0.00 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 9. Winning the Next Election, Bandwidth of 0.02 (50 bins) 312 Journal of Economic Literature, Vol. XLVIII (June 2010)

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10

0.00 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 10. Winning the Next Election, Bandwidth of 0.01 (100 bins)

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10

0.00 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 11. Winning the Next Election, Bandwidth of 0.005 (200 bins) Lee and Lemieux: Regression Discontinuity Designs in Economics 313 in figures 8 and 11. In all cases, we also show when the outcome variable is the vote share the fitted values from a quartic regression (figure 13). For both outcome variables, the model estimated separately on each side of value of the cross-validation function grows the cutoff point. Note that the assignment quickly for bandwidths smaller than 0.02, variable is normalized as the difference suggesting that the graphs with narrower bins between the share of vote to Democrats and (figures 7, 8, 10, and 11) are too noisy. Republicans in the previous election. This Panel B of table 1 shows the results of our means that a Democrat is the incumbent two suggested specification tests. The tests when the assignment variable exceeds zero. based on doubling the number of bins and We also limit the range of the graphs to win- running regressions within each bin yield ning margins of 50 percent or less (in abso- remarkably similar results. Generally speak- lute terms) as data become relatively sparse ing, the results indicate that only fairly wide for larger winning (or losing) margins. bins are rejected. Looking at both outcome All graphs show clear evidence of a discon- variables, the tests systematically reject mod- tinuity at the cutoff point. While the graphs els with bandwidths of 0.05 or more (twenty are all quite informative, the ones with the bins over the –0.5 to 0.5 range). The models smallest bandwidth (0.005, figures 8 and 11) are never rejected for either outcome vari- are more noisy and likely provide too many able once we hit bandwidths of 0.02 (fifty data points (200) for optimal visual impact. bins) or less. In practice, the testing proce- The results of the bandwidth selection pro- dure rules out bins that are larger than those cedures are presented in table 1. Panel A shows reported in figures 6–11. the cross-validation procedure always suggests At first glance, the results in the two pan- using a bandwidth of 0.02 or more, which cor- els of table 1 appear to be contradictory. The responds to similar or wider bins than those cross-validation procedure suggests band- used in figures 6 and 9 (those with the largest widths ranging from 0.02 to 0.05, while the bins). This is true irrespective of whether we bin and regression tests suggest that almost pick a separate bandwidth on each side of the all bandwidths of less than 0.05 are accept- cutoff (first two rows of the panel), or pick the able. The reason for this discrepancy is that bandwidth that minimizes the cross-validation while the cross-validation procedure tries function for the entire date range on both the to balance precision and bias, the bin and left and right sides of the cutoff. In the case regression tests only deal with the “bias” part where the outcome variable is winning the of the equation by checking whether the next election, the cross-validation procedure value of Y is more or less constant within a for the data to the right of the cutoff point and given bin. Models with small bins easily pass for the entire range suggests using a very wide this kind of test, although they may yield a bin (0.049) that would only yield about ten very noisy graph. One alternative approach is bins on each side of the cutoff. to choose the largest possible bandwidth that As it turns out, the cross-validation function passes the bin and the regression test, which for the entire data range has two local min- turns out to be 0.033 in table 1, a bandwidth ima at 0.021 and 0.049 that correspond to the that is within the range of those suggested by optimal bandwidths on the left and right hand the cross-validation procedure. side of the cutoff. This is illustrated in figure From a practical point of view, it seems to 12, which plots the cross-validation function be the case that formal procedures, and in par- as a function of the bandwidth. By contrast, ticular cross-validation, suggest bandwidths the cross-validation function is better behaved that are wider than those one would likely and shows a global minimum around 0.020 choose based on a simple visual examination 314 Journal of Economic Literature, Vol. XLVIII (June 2010)

Table 1 Choice of Bandwidth in Graph for Voting Example

A. Optimal bandwidth selected by cross-validation Side of cutoff Share of vote Win next election

Left 0.021 0.049 Right 0.026 0.021 Both 0.021 0.049

B. P-values of tests for the numbers of bins in RD graph Share of vote Win next election

No. of bins Bandwidth Bin test Regr. test Bin test Regr. test

10 0.100 0.000 0.000 0.001 0.000 20 0.050 0.000 0.000 0.026 0.049 30 0.033 0.163 0.390 0.670 0.129 40 0.025 0.157 0.296 0.024 0.020 50 0.020 0.957 0.721 0.477 0.552 60 0.017 0.159 0.367 0.247 0.131 70 0.014 0.596 0.130 0.630 0.743 80 0.013 0.526 0.740 0.516 0.222 90 0.011 0.815 0.503 0.806 0.803 100 0.010 0.787 0.976 0.752 0.883

Notes: Estimated over the range of the forcing variable (Democrat to Republican difference in the share of vote in the previous election) ranging between –0.5 and 0.5. The “bin test” is computed by comparing the fit of a model with the number of bins indicated in the table to an alternative where each bin is split in 2. The “regression test” is a joint test of significance of bin-specific regression estimates of the outcome variable on the share of vote in the previous election.

of the data. In particular, both figures 7 and smoothing (relative to what formal bandwidth 10 (bandwidth of 0.01) look visually accept- selection procedures would suggest) to better able but are clearly not recommended on the illustrate the variation in the raw data when basis of the cross-validation procedure. This graphically illustrating an RD design. likely reflects the fact that one important goal of the graph is to show how the raw data 4.2 Regression Methods look, and too much smoothing would defy the purpose of such a data illustration exercise. 4.2.1 Parametric or Nonparametric Furthermore, the regression estimates of the Regressions? treatment effect accompanying the graphical results are a formal way of smoothing the When we introduced the RD design data to get precise estimates. This suggests in section 2, we followed Thistlethwaite that there is probably little harm in under- and Campbell (1960) in assuming that the Lee and Lemieux: Regression Discontinuity Designs in Economics 315

435

434

433

432

431 Cross-validation function

430

429 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Bandwidth

Figure 12. Cross-Validation Function for Choosing the Bandwidth in a RD Graph: Winning the Next Election

78.8

78.6

78.4

78.2

78.0

Cross-validation function 77.8

77.6

77.4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Bandwidth

Figure 13. Cross-Validation Function for Choosing Bandwidth in a RD Graph: Share of Vote at Next Election 316 Journal of Economic Literature, Vol. XLVIII (June 2010)

underlying regression model was linear in function. As it turns out, however, the RD the assignment variable X: setting poses a particular problem because we need to estimate regressions at the cutoff Y D X . point. This results in a “boundary problem” = α + τ + β + ε that causes some complications for nonpara- In general, as in any other setting, there is metric methods. no particular reason to believe that the true From an applied perspective, a simple model is linear. The consequences of using an way of relaxing the linearity assumption is incorrect functional form are more serious in to include polynomial functions of X in the the case of RD designs however, since mis- regression model. This corresponds to the specification of the functional form typically series estimation approach often used in non- generates a bias in the treatment effect, .27 parametric analysis. A possible disadvantage τ This explains why, starting with Hahn, Todd, of the approach, however, is that it provides and van der Klaauw (2001), the estimation of global estimates of the regression function RD designs have generally been viewed as a over all values of X, while the RD design nonparametric estimation problem. depends instead on local estimates of the This being said, applied papers using the regression function at the cutoff point. The RD design often just report estimates from fact that polynomial regression models use parametric models. Does this mean that data far away from the cutoff point to predict these estimates are incorrect? Should all the value of Y at the cutoff point is not intui- studies use nonparametric methods instead? tively appealing. That said, trying more flex- As we pointed out in the introduction, we ible specification by adding polynomials in X think that the distinction between parametric as regressors is an important and useful way of and nonparametric methods has sometimes assessing the robustness of the RD estimates been a source of confusion to practitioners. of the treatment effect. Before covering in detail the practical issues The other leading nonparametric approach involved in the estimation of RD designs, we is kernel regressions. Unlike series (poly- thus provide some background to help clarify nomial) estimators, the kernel regression is the insights provided by nonparametric anal- fundamentally a local method well suited for ysis, while also explaining why, in practice, estimating the regression function at a partic- RD designs can still be implemented using ular point. Unfortunately, this property does “parametric” methods. not help very much in the RD setting because Going beyond simple parametric linear the cutoff represents a boundary point where regressions when the true functional form is kernel regressions perform poorly. unknown is a well-studied problem in econo- These issues are illustrated in figure 2, metrics and statistics. A number of nonpara- which shows a situation where the relation- metric methods have been suggested to ship between Y and X (under treatment or provide flexible estimates of the regression control) is nonlinear. First, consider the point D located away from the cutoff point. The kernel estimate of the regression of Y on X at 27 By contrast, when one runs a linear regression in a X Xd is simply a local mean of Y for values model where the true functional form is nonlinear, the esti- = mated model can still be interpreted as a linear predictor of X close to Xd. The kernel function provides that minimizes specification errors. But since specification a way of computing this local average by put- errors are only minimized globally, we can still have large ting more weight on observations with values specification errors at specific points including the cutoff point and, therefore, a large bias in RD estimates of the of X close to Xd than on observations with val- treatment effect. ues of X far away from Xd. Following Imbens Lee and Lemieux: Regression Discontinuity Designs in Economics 317 and Lemieux (2008), we focus on the conve- extremely noisy estimates of the treatment nient case of the rectangular kernel. In this effect.28 setting, computing kernel regressions simply As a solution to this problem, Hahn, Todd, amounts to computing the average value of Y and van der Klaauw (2001) suggests run- in the bin illustrated in figure 2. The result- ning local linear regressions to reduce the ing local average is depicted as the horizontal importance of the bias. In our setup with a line EF, which is very close to true value of Y rectangular kernel, this suggestion simply evaluated at X X on the regression line. amounts to running standard linear regres- = d Applying this local averaging approach is sions within the bins on both sides of the problematic, however, for the RD design. cutoff point to better predict the value of the Consider estimating the value of the regres- regression function right at the cutoff point. sion function just on the right of the cutoff In this example, the regression lines within point. Clearly, only observations on the right the bins around the cutoff point are close to of the cutoff point that receive the treatment linear. It follows that the predicted values of should be used to compute mean outcomes the local linear regressions at the cutoff point on the right hand side. Similarly, only observa- are very close to the true values of A and B. tions on the left of the cutoff point that do not Intuitively, this means that running local receive the treatment should be used to com- linear regressions instead of just computing pute mean outcomes on the left hand side. averages within the bins reduces the bias by Otherwise, regression estimates would mix an order of magnitude. Indeed, Hahn, Todd, observations with and without the treatment, and van der Klaauw (2001) show that the which would invalidate the RD approach. remaining bias is of an order of magnitude In this setting, the best thing is to com- lower, and is comparable to the usual bias pute the average value of Y in the bin just in kernel estimation at interior points like D to the right and just to the left of the cutoff (the small difference between the horizontal point. These two bins are shown in figure 2. line EF and the true value of the regression The RD estimate based on kernel regres- line evaluated at D). sions is then equal to B A . In this exam- In the literature on nonparametric estima- ′ − ′ ple where the regression lines are upward tion at boundary points, local linear regres- sloping, it is clear, however, that the esti- sions have been introduced as a means of mate B A overstates the true treatment reducing the bias in standard kernel regres- ′ − ′ effect represented as the difference B A sion methods.29 One of the several contribu- − at the cutoff point. In other words, there tions of Hahn, Todd, and van der Klaauw is a systematic bias in kernel regression (2001) is to show how the same bias-reducing estimates of the treatment effect. Hahn, Todd, and van der Klaauw (2001) provide a more formal derivation of the bias (see also 28 The trade-off between bias and precision is a funda- Imbens and Lemieux 2008 for a simpler mental feature of kernel regressions. A larger bandwidth yields more precise, but potentially biased, estimates of the exposition when the kernel is rectangu- regression. In an interior point like D, however, we see that lar). In practical terms, the problem is that the bias is of an order of magnitude lower than at the cutoff in finite samples the bandwidth has to be (boundary) point. In more technical terms, it can be shown (see Hahn, Todd, and van der Klaauw 2001 or Imbens and large enough to encompass enough obser- Lemieux 2008) that the usual bias is of order h2 at interior vations to get a reasonable amount of pre- points, but of order h at boundary points, where h is the cision in the estimated average values of bandwidth. In other words, the bias dies off much more quickly when h goes to zero when we are at interior, as Y. Otherwise, attempts to reduce the bias opposed to boundary, points. by shrinking the bandwidth will result in 29 See Jianqing Fan and Irene Gijbels (1996). 318 Journal of Economic Literature, Vol. XLVIII (June 2010) procedure should also be applied to the RD Y f (X c) , = αr + r − + ε design. We have shown here that, in practice, this simply amounts to applying the original where f ( ) and f ( ) are functional forms that l ∙ r ∙ insight of Thistlethwaite and Campbell (1960) we discuss later. The treatment effect can to a narrower window of observations around then be computed as the difference between the cutoff point. When one is concerned that the two regressions intercepts, and , αr αl the regression function is not linear over the on the two sides of the cutoff point. A more whole range of X, a highly sensible procedure direct way of estimating the treatment effect is, thus, to restrict the estimation range to is to run a pooled regression on both sides of values closer to the cutoff point where the the cutoff point: linear approximation of the regression line is less likely to result in large biases in the RD Y D f (X c) , = αl + τ + − + ε estimates. In practice, many applied papers present RD estimates with varying window where and f (X c) f (X c) τ = αr − αl − = l − widths to illustrate the robustness (or lack D [ f (X c) f (X c)]. One advan- + r − − l − thereof) of the RD estimates to specifica- tage of the pooled approach is that it directly tion issues. It is comforting to know that this yields estimates and standard errors of the common empirical practice can be justified treatment effect . Note, however, that τ on more formal econometric grounds like it is recommended to let the regression those presented by Hahn, Todd, and van der function differ on both sides of the cut- Klaauw (2001). The main conclusion we draw off point by including interaction terms from this discussion of nonparametric meth- between D and X. For example, in the lin- ods is that it is essential to explore how RD ear case where f (X c) (X c) and l − = βl − estimates are robust to the inclusion of higher f (X c) (X c), the pooled regression r − = βr − order polynomial terms (the series or poly would be nomial estimation approach) and to changes in the window width around the cutoff point Y D (X c) = αl + τ + βl − (the local linear regression approach). ( ) D (X c) . + βr − βl − + ε 4.3 Estimating the Regression The problem with constraining the slope of A simple way of implementing RD designs the regression lines to be the same on both in practice is to estimate two separate regres- sides of the cutoff ( ) is best illustrated βr =βl sions on each side of the cutoff point. In by going back to the separate regressions terms of computations, it is convenient to above. If we were to constrain the slope to subtract the cutoff value from the covariate, be identical on both sides of the cutoff, this i.e., transform X to X c, so the intercepts would amount to using data on the right − of the two regressions yield the value of the hand side of the cutoff to estimate , and αl regression functions at the cutoff point. vice versa. Remember from section 2 that The regression model on the left hand side in an RD design, the treatment effect is of the cutoff point (X c) is obtained by comparing conditional expec- < tations of Y when approaching from the Y l fl (X c) , left ( l limx c E[Yi Xi x]) and from the = α + − + ε α = ↑ | = right ( r limx c E[Yi Xi x]) of the cut- α = ↓ | = while the regression model on the right hand off. Constraining the slope to be the same side of the cutoff point (X c) is would thus be inconsistent with the spirit of ≥ Lee and Lemieux: Regression Discontinuity Designs in Economics 319 the RD design, as data from the right of the The regression model on the left hand side cutoff would be used to estimate , which of the cutoff point is αl is defined as a limit when approaching from the left of the cutoff, and vice versa. Y (X c) , = αl + βl − + ε In practice, however, estimates where the regression slope or, more generally, the where c h X c, − ≤ < regression function f (X c) are constrained − to be the same on both sides of the cutoff while the regression model on the right hand point are often reported. One possible justi- side of the cutoff point is fication for doing so is that if the functional form is indeed the same on both sides of the Y (X c) , = αr + βr − + ε cutoff, then more efficient estimates of the treatment effect are obtained by imposing where c X c h. τ ≤ ≤ + that constraint. Such a constrained specification should only be viewed, however, as an As before, it is also convenient to estimate additional estimate to be reported for the the pooled regression sake of completeness. It should not form the core basis of the empirical approach. Y D (X c) = αl + τ + βl − 4.3.1 Local Linear Regressions and ( ) D (X c) , Bandwidth Choice + βr − βl − + ε As discussed above, local linear regres- where c h X c h, − ≤ ≤ + sions provide a nonparametric way of consistently estimating the treatment effect in an since the standard error of the estimated RD design (Hahn, Todd, and van der Klaauw treatment effect can be directly obtained (2001), Jack Porter (2003)). Following from the regression. Imbens and Lemieux (2008), we focus on While it is straightforward to estimate the the case of a rectangular kernel, which linear regressions within a given window of amounts to estimating a standard regression width h around the cutoff point, a more dif- over a window of width h on both sides of the ficult question is how to choose this band- cutoff point. While other kernels (triangular, width. In general, choosing a bandwidth Epanechnikov, etc.) could also be used, the in nonparametric estimation involves find- choice of kernel typically has little impact ing an optimal balance between precision in practice. As a result, the convenience and bias. One the one hand, using a larger of working with a rectangular kernel com- bandwidth yields more precise estimates as pensates for efficiency gains that could be more observations are available to estimate achieved using more sophisticated kernels.30 the regression. On the other hand, the linear specification is less likely to be accurate

30 It has been shown in the statistics literature (Fan and arguably more transparent way of putting more weight on Gijbels 1996) that a triangular kernel is optimal for esti- observations close to the cutoff is simply to reestimate a mating local linear regressions at the boundary. As it turns model with a rectangular kernel using a smaller bandwidth. out, the only difference between regressions using a rect- In practice, it is therefore simpler and more transparent angular or a triangular kernel is that the latter puts more to just estimate standard linear regressions (rectangular weight (in a linear way) on observations closer to the cutoff kernel) with a variety of bandwidths, instead of trying out point. It thus involves estimating a weighted, as opposed different kernels corresponding to particular weighted to an unweighted, regression within a bin of width h. An regressions that are more difficult to interpret. 320 Journal of Economic Literature, Vol. XLVIII (June 2010) when a larger bandwidth is used, which can available. The importance of undersmooth- bias the estimate of the treatment effect. If ing only has to do with a thought experi- the underlying conditional expectation is not ment of how much the bandwidth should linear, the linear specification will provide shrink if the sample size were larger so that a close approximation over a limited range one obtains asymptotically correct standard of values of X (small bandwidth), but an errors, and does not help one choose a par- increasingly bad approximation over a larger ticular bandwidth in a particular sample.32 range of values of X (larger bandwidth). In the econometrics and statistics litera- As the number of observations avail- ture, two procedures are generally consid- able increases, it becomes possible to use ered for choosing bandwidths. The first an increasingly small bandwidth since linear procedure consists of characterizing the regressions can be estimated relatively pre- optimal bandwidth in terms of the unknown cisely over even a small range of values of joint distribution of all variables. The rel- X. As it turns out, Hahn, Todd, and van der evant components of this distribution can Klaauw (2001) show the optimal bandwidth then be estimated and plugged into the opti- 1 5 is proportional to N − / , which corresponds mal bandwidth function.33 In the context to a fairly slow rate of convergence to zero. of local linear regressions, Fan and Gijbels For example, this suggests that the bandwidth (1996) show this involves estimating a num- should only be cut in half when the sample ber of parameters including the curvature of size increases by a factor of 32 (25). For tech- the regression function. In practice, this can nical reasons, however, it would be preferable be done in two steps. In step one, a rule-of- to undersmooth by shrinking the bandwidth thumb (ROT) bandwidth is estimated over at a faster rate requiring that h N −δ with the whole relevant data range. In step two, ∝ 1 5 2 5, in order to eliminate an the ROT bandwidth is used to estimate the / < δ < / asymptotic bias that would remain when optimal bandwidth right at the cutoff point. 1 5. In the presence of this bias, the For the rectangular kernel, the ROT band- δ = / usual formula for the variance of a standard width is given by: least square estimator would be invalid.31 1 5 In practice however, knowing at what rate 2 / ˜ R the bandwidth should shrink in the limit does hROT 2.702 ______σ , = N 2 not really help since only one actual sam- i 1 m˜ (xi) ∑ = ′′ ple with a given number of observations is s t U V

31 See Hahn, Todd, and van der Klaauw (2001) and approximate the variance of the RD estimator in the actual Imbens and Lemieux (2008) for more details. sample. This does not say anything about what bandwidth 32 The main purpose of asymptotic theory is to use the should be chosen in the actual sample available for imple- large sample properties of estimators to approximate the menting the RD design. distribution of an estimator in the real sample being con- 33 A well known example of this procedure is the sidered. The issue is a little more delicate in a nonparamet- “rule-of-thumb” bandwidth selection formula in ker- ric setting where one also has to think about how fast the nel density estimation where an estimate of the dis- bandwidth should shrink when the sample size approaches persion in the variable (standard deviation or the infinity. The point about undersmoothing is simply that interquartile range), ˆ , is plugged into the formula 1 5 σ one unpleasant property of the optimal bandwidth is that 0.9 ˆ N − / . Bernard W. Silverman (1986) shows that ∙ σ ∙ it does not yield the convenient least squares variance for- this formula is the closed form solution for the optimal mula. But this can be fixed by shrinking the bandwidth bandwidth choice problem when both the actual density a little faster as the sample size goes to infinity. Strictly and the kernel are Gaussian. See also Imbens and Karthik speaking, this is only a technical issue with how to perform Kalyanaraman (2009), who derive an optimal bandwidth the thought experiment (what happens when the sample for this RD setting, and propose a data-dependent method size goes to infinity?) required for using asymptotics to for choosing the bandwidth. Lee and Lemieux: Regression Discontinuity Designs in Economics 321

where m ˜ ( ) is the second derivative (curva- using only observations with values of X on ′′ ∙ ture) of an estimated regression of Y on X, ˜ the right of X (X X X h). σ i i < ≤ i + is the estimated standard error of the regres- Repeating the exercise for each and every sion, R is the range of the assignment vari- observation, we get a whole set of predicted able over which the regression is estimated, values of Y that can be compared to the and the constant 2.702 is a number specific to actual values of Y. The optimal bandwidth the rectangular kernel. A similar formula can can be picked by choosing the value of h that be used for the optimal bandwidth, except minimizes the mean square of the difference both the regression standard error and the between the predicted and actual value of Y. ˆ average curvature of the regression func- More formally, let Y (Xi) represent the pre- tion are estimated locally around the cutoff dicted value of Y obtained using the regres- point. For the sake of simplicity, we only sions described above. The cross-validation compute the ROT bandwidth in our empiri- criterion is defined as cal example. Following the common practice in studies using these bandwidth selection N 1 ˆ 2 methods, we also use a quartic specification (9) CVY(h) _ (Yi Y (Xi)) 34 = N i∑1 − for the regression function. = The second approach is based on a cross- validation procedure. In the case consid- with the corresponding cross-validation ered here, Jens Ludwig and Douglas Miller choice for the bandwidth (2007) and Imbens and Lemieux (2008) have opt proposed a “leave one out” procedure aimed h arg min CV (h). CV = Y specifically at estimating the regression func- h tion at the boundary. The basic idea behind this procedure is the following. Consider an Imbens and Lemieux (2008) discuss this pro- observation i. To see how well a linear regres- cedure in more detail and point out that since sion with a bandwidth h fits the data, we run we are primarily interested in what happens a regression with observation i left out and around the cutoff, it may be advisable to only use the estimates to predict the value of Y at compute CVY (h) for a subset of observations X X . In order to mimic the fact that RD with values of X close enough to the cutoff = i estimates are based on regression estimates point. For instance, only observations with at the boundary, the regression is estimated values of X between the median value of X to using only observations with values of X on the left and right of the cutoff could be used the left of X (X h X X ) for observa- to perform the cross-validation. i i − ≤ < i tions on the left of the cutoff point (X c). The second rows of tables 2 and 3 show the i < For observations on the right of the cutoff local linear regression estimates of the treat- point (X c), the regression is estimated ment effect for the two outcome variables i ≥ (share of vote and winning the next election). 34 See McCrary and Heather Royer (2003) for an We show the estimates for a wide range of example where the bandwidth is selected using the ROT bandwidths going from the entire data range procedure (with a triangular kernel), and Stephen L. (bandwidth of 1 on each side of the cutoff) DesJardins and Brian P. McCall (2008) for an example where the second step optimal bandwidth is computed to a very small bandwidth of 0.01 (winning (for the Epanechnikov kernel). Both papers use a quartic margins of one percent or less). As expected, regression function m(x) x x4, which = β0 + β1 + … + β4 the precision of the estimates declines means that m (x) 2 6 x 12 x2. Note that the ′′ = β2 + β3 + β4 quartic regressions are estimated separately on both sides quickly as we approach smaller and smaller of the cutoff. bandwidths. Notice also that estimates based 322 Journal of Economic Literature, Vol. XLVIII (June 2010)

Table 2 RD Estimates of the Effect of Winning the Previous Election on the Share of Votes in the Next Election

Bandwidth: 1.00 0.50 0.25 0.15 0.10 0.05 0.04 0.03 0.02 0.01

Polynomial of order: Zero 0.347 0.257 0.179 0.143 0.125 0.096 0.080 0.073 0.077 0.088 (0.003) (0.004) (0.004) (0.005) (0.006) (0.009) (0.011) (0.012) (0.014) (0.015) [0.000] [0.000] [0.000] [0.000] [0.003] [0.047] [0.778] [0.821] [0.687] One 0.118 0.090 0.082 0.077 0.061 0.049 0.067 0.079 0.098 0.096 (0.006) (0.007) (0.008) (0.011) (0.013) (0.019) (0.022) (0.026) (0.029) (0.028) [0.000] [0.332] [0.423] [0.216] [0.543] [0.168] [0.436] [0.254] [0.935] Two 0.052 0.082 0.069 0.050 0.057 0.100 0.101 0.119 0.088 0.098 (0.008) (0.010) (0.013) (0.016) (0.020) (0.029) (0.033) (0.038) (0.044) (0.045) [0.000] [0.335] [0.371] [0.385] [0.458] [0.650] [0.682] [0.272] [0.943] Three 0.111 0.068 0.057 0.061 0.072 0.112 0.119 0.092 0.108 0.082 (0.011) (0.013) (0.017) (0.022) (0.028) (0.037) (0.043) (0.052) (0.062) (0.063) [0.001] [0.335] [0.524] [0.421] [0.354] [0.603] [0.453] [0.324] [0.915] Four 0.077 0.066 0.048 0.074 0.103 0.106 0.088 0.049 0.055 0.077 (0.013) (0.017) (0.022) (0.027) (0.033) (0.048) (0.056) (0.067) (0.079) (0.063) [0.014] [0.325] [0.385] [0.425] [0.327] [0.560] [0.497] [0.044] [0.947] Optimal order of 6 3 1 2 1 2 0 0 0 0 the polynomial Observations 6,558 4,900 2,763 1,765 1,209 610 483 355 231 106

Notes: Standard errors in parentheses. P-values from the goodness-of-fit test in square brackets. The goodness-of-fit test is obtained by jointly testing the significance of a set of bin dummies included as additional regressors in the model. The bin width used to construct the bin dummies is 0.01. The optimal order of the polynomial is chosen using Akaike’s criterion (penalized cross-validation).

on very wide bandwidths (0.5 or 1) are sys- these estimates linked to the fact that the lin- tematically larger than those for the smaller ear approximation does not hold over a wide bandwidths (in the 0.05 to 0.25 range) that data range. This is particularly clear in the are still large enough for the estimates to be case of winning the next election where fig- reasonably precise. A closer examination of ures 9–11 show some clear curvature in the figures 6–11 also suggests that the estimates regression function. for very wide bandwidths are larger than Table 4 shows the optimal bandwidth what the graphical evidence would suggest.35 obtained using the ROT and cross-valida- This is consistent with a substantial bias for tion procedure. Consistent with the above

35 In the case of the vote share, the quartic regression Similarly, the quartic regression shown in figures 9–11 for shown in figures 6–8 implies a treatment effect of 0.066, winning the next election implies a treatment effect of which is substantially smaller than the local linear regres- 0.375, which is again smaller than the local linear regression estimates with a bandwidth of 0.5 (0.090) or 1 (0.118). sion estimates with a bandwidth of 0.5 (0.566) or 1 (0.689). Lee and Lemieux: Regression Discontinuity Designs in Economics 323

Table 3 Rd Estimates of the Effect of Winning the Previous Election on Probability of Winning the Next Election

Bandwidth: 1.00 0.50 0.25 0.15 0.10 0.05 0.04 0.03 0.02 0.01

Polynomial of order: Zero 0.814 0.777 0.687 0.604 0.550 0.479 0.428 0.423 0.459 0.533 (0.007) (0.009) (0.013) (0.018) (0.023) (0.035) (0.040) (0.047) (0.058) (0.082) [0.000] [0.000] [0.000] [0.000] [0.011] [0.201] [0.852] [0.640] [0.479] One 0.689 0.566 0.457 0.409 0.378 0.378 0.472 0.524 0.567 0.453 (0.011) (0.016) (0.026) (0.036) (0.047) (0.073) (0.083) (0.099) (0.116) (0.157) [0.000] [0.000] [0.126] [0.269] [0.336] [0.155] [0.400] [0.243] [0.125] Two 0.526 0.440 0.375 0.391 0.450 0.607 0.586 0.589 0.440 0.225 (0.016) (0.023) (0.039) (0.055) (0.072) (0.110) (0.124) (0.144) (0.177) (0.246) [0.075] [0.145] [0.253] [0.192] [0.245] [0.485] [0.367] [0.191] [0.134] Three 0.452 0.370 0.408 0.435 0.472 0.566 0.547 0.412 0.266 0.172 (0.021) (0.031) (0.052) (0.075) (0.096) (0.143) (0.166) (0.198) (0.247) (0.349) [0.818] [0.277] [0.295] [0.115] [0.138] [0.536] [0.401] [0.234] [0.304] Four 0.385 0.375 0.424 0.529 0.604 0.453 0.331 0.134 0.050 0.168 (0.026) (0.039) (0.066) (0.093) (0.119) (0.183) (0.214) (0.254) (0.316) (0.351) [0.965] [0.200] [0.200] [0.173] [0.292] [0.593] [0.507] [0.150] [0.244] Optimal order of 4 3 2 1 1 2 0 0 0 1 the polynomial Observations 6,558 4,900 2,763 1,765 1,209 610 483 355 231 106

discussion, the suggested bandwidth ranges showing more curvature for winning the next from 0.14 to 0.28, which is large enough to election than the vote share, which calls for a get precise estimates, but narrow enough to smaller bandwidth to reduce the estimation minimize the bias. Two interesting patterns bias linked to the linear approximation. can be observed in table 4. First, the band- Figures 14 and 15 plot the value of the width chosen by cross-validation tends to be cross-validation function over a wide range a bit larger than the one based on the rule- of bandwidths. In the case of the vote share of-thumb. Second, the bandwidth is gener- where the linearity assumption appears more ally smaller for winning the next election accurate (figures 6–8), the cross-validation (second column) than for the vote share (first function is fairly flat over a sizable range of column). This is particularly clear when the values for the bandwidth (from about 0.16 optimal bandwidth is constrained to be the to 0.29). This range includes the optimal same on both sides of the cutoff point. This bandwidth suggested by cross-validation is consistent with the graphical evidence (0.282) at the upper end, and the ROT 324 Journal of Economic Literature, Vol. XLVIII (June 2010)

Table 4 Optimal Bandwidth for Local Linear Regressions, Voting Example

A. Rule-of-thumb bandwidth Share of vote Win next election Left 0.162 0.164 Right 0.208 0.130 Both 0.180 0.141 B. Optimal bandwidth selected by cross-validation Left 0.192 0.247 Right 0.282 0.141 Both 0.282 0.172

Notes: Estimated over the range of the forcing variable (Democrat to Republican difference in the share of vote in the previous election) ranging between –0.5 and 0.5. See the text for a description of the rule-of-thumb and cross- validation procedures for choosing the optimal bandwidth.

bandwidth (0.180) at the lower end. In the optimal bandwidth is larger in the case of the case of winning the next election (figure vote share. 15), the cross-validation procedure yields This example also illustrates the impor- a sharper suggestion of optimal bandwidth tance of first graphing the data before run- around 0.15, which is quite close to both the ning regressions and trying to choose the optimal cross-validation bandwidth (0.172) optimal bandwidth. When the graph shows and the ROT bandwidth (0.141). a more or less linear relationship, it is natu- The main difference between the two ral to expect different bandwidths to yield outcome variables is that larger bandwidths similar results and the bandwidth selection start getting penalized more quickly in the procedure not to be terribly informative. But case of winning the election (figure 15) than when the graph shows substantial curvature, in the case of the vote share (figure 14). This it is natural to expect the results to be more is consistent with the graphical evidence in sensitive to the choice of bandwidth and that figures 6–11. Since the regression function bandwidth selection procedures will play a looks fairly linear for the vote share, using more important role in selecting an appro- larger bandwidths does not get penalized as priate empirical specification. much since they improve efficiency without 4.3.2 Order of Polynomial in Local generating much of a bias. But in the case Polynomial Modeling of winning the election where the regression function exhibits quite a bit of curvature, In the case of polynomial regressions, larger bandwidths are quickly penalized for the equivalent to bandwidth choice is introducing an estimation bias. Since there the choice of the order of the polynomial is a real tradeoff between precision and regressions. As in the case of local linear bias, the cross-validation procedure is quite regressions, it is advisable to try and report informative. By contrast, there is not much a number of specifications to see to what of a tradeoff when the regression function is extent the results are sensitive to the order more or less linear, which explains why the of the polynomial. For the same reason Lee and Lemieux: Regression Discontinuity Designs in Economics 325

78.2

78.1

78.0

77.9

77.8

77.7

77.6

77.5 Cross-validation function

77.4

77.3

77.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Bandwidth

Figure 14. Cross-Validation Function for Local Linear Regression: Share of Vote at Next Election

435

434

433

432

431

Cross-validation function 430

429

428 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Bandwidth

Figure 15. Cross-Validation Function for Local Linear Regression: Winning the Next Election 326 Journal of Economic Literature, Vol. XLVIII (June 2010) mentioned earlier, it is also preferable to a particular parametric model (say a cubic estimate separate regressions on the two model) compares relative to a more general sides of the cutoff point. nonparametric alternative. In the context The simplest way of implementing poly- of the RD design, a natural nonparametric nomial regressions and computing standard alternative is the set of unrestricted means of errors is to run a pooled regression. For the outcome variable by bin used to graphi- example, in the case of a third order polyno- cally depict the data in section 4.1. Since one mial regression, we would have virtue of polynomial regressions is that they provide a smoothed version of the graph, Y D (X c) it is natural to ask how well the polynomial = αl + τ + βl1 − model fits the unrestricted graph. A simple (X c)2 (X c)3 way of implementing the test is to add the + βl2 − + βl3 − set of bin dummies to the polynomial regres- ( ) D (X c) sion and jointly test the significance of the + βr1 − βl1 − bin dummies. For example, in a first order ( ) D (X c)2 polynomial model (linear regression), the + βr2 − βl2 − test can be computed by including K 2 − ( ) D (X c)3 . bin dummies B , for k 2 to K 1, in the + βr3 − βl3 − + ε k = − model While it is important to report a number of specifications to illustrate the robustness of Y D (X c) = αl + τ + βl1 − the results, it is often useful to have some more formal guidance on the choice of the ( ) D (X c) + βr1 − βl1 − order of the polynomial. Starting with van K 1 der Klaauw (2002), one approach has been − k Bk to use a generalized cross-validation proce- + k∑ 2 ϕ + ε dure suggested in the literature on nonpara- = metric series estimators.36 One special case of generalized cross-validation (used by Dan and testing the null hypothesis that A. Black, Jose Galdo, and Smith (2007a), for 2 3 … K 1 0. Note that two of ϕ = ϕ = = ϕ − = example), which we also use in our empirical the dummies are excluded because of col- example, is the well known Akaike informa- linearity with the constant and the treatment tion criterion (AIC) of model selection. In a dummy, D.37 In terms of specification choice regression context, the AIC is given by procedure, the idea is to add a higher order term to the polynomial until the bin dum- 2 AIC N ln ( ˆ ) 2p, mies are no longer jointly significant. = σ + Another major advantage of this proce- 2 where ˆ is the mean squared error of the dure is that testing whether the bin dum- σ regression, and p is the number of param- mies are significant turns out to be a test for eters in the regression model (order of the polynomial plus one for the intercept). 37 While excluding dummies for the two bins next to the One drawback of this approach is that it cutoff point yields more interpretable results ( remains τ does not provide a very good sense of how the treatment effect), the test is invariant to the excluded bin dummies, provided that one excluded dummy is on the left of the cutoff point and the other one is on the right 36 See Blundell and Duncan (1998) for a more general (something standard regression packages will automati- discussion of series estimators. cally do if all K dummies are included in the regression). Lee and Lemieux: Regression Discontinuity Designs in Economics 327 the presence of discontinuities in the regres- and 0.37–0.57 for the probability of winning sion function at points other than the cutoff (table 3). One set of models the goodness-of- point. In that sense, it provides a falsification fit test does not rule out, however, is higher test of the RD design by examining whether order polynomial models with small band- there are other unexpected discontinuities in widths that tend to be imprecisely estimated the regression function at randomly chosen as they “overfit” the data. points (the bin thresholds). To see this, Looking informally at both the fit of the K rewrite B as model (goodness-of-fit test) and the preci- ∑ k 1 ϕk k = sion of the estimates (standard errors) sug- K K gests the following strategy: use higher k Bk 1 ( k k 1) B k+ , order polynomials for large bandwidths of − k∑1 ϕ = ϕ + k∑ 2 ϕ − ϕ = = 0.50 and more, lower order polynomials for bandwidths between 0.05 and 0.50, and zero K where B k+ j k Bj is a dummy variable order polynomials (comparisons of means) = ∑ = indicating that the observation is in bin for bandwidths of less than 0.05, since the k or above, i.e., that the assignment variable latter specification passes the goodness- X is above the bin cutoff bk. Testing whether of-fit test for these very small bandwidths. all the k k 1 are equal to zero is equiva- Interestingly, this informal approach more or ϕ − ϕ − lent to testing that all the are the same less corresponds to what is suggested by the ϕk (the above test), which amounts to testing AIC. In this specific example, it seems that that the regression line does not jump at the given a specific bandwidth, the AIC provides bin thresholds bk. reasonable suggestions on which order of the Tables 2 and 3 show the estimates of the polynomial to use. treatment effect for the voting example. For 4.3.3 Estimation in the Fuzzy RD Design the sake of completeness, a wide range of bandwidths and specifications are presented, As discussed earlier, in both the “sharp” along with the corresponding p-values for and the “fuzzy” RD designs, the probability the goodness-of fit test discussed above (a of treatment jumps discontinuously at the bandwidth of 0.01 is used for the bins used cutoff point. Unlike the case of the sharp RD to construct the test). We also indicate at the where the probability of treatment jumps bottom of the tables the order of the polyno- from 0 to 1 at the cutoff, in the fuzzy RD mial selected for each bandwidth using the case, the probability jumps by less than one. AIC. Note that the estimates of the treat- In other words, treatment is not solely deter- ment effect for the “order zero” polynomi- mined by the strict cutoff rule in the fuzzy als are just comparisons of means on the two RD design. For example, even if eligibility for sides of the cutoff point, while the estimates a treatment solely depends on a cutoff rule, for the “order one” polynomials are based on not all the eligibles may get the treatment (local) linear regressions. because of imperfect compliance. Similarly, Broadly speaking, the goodness-of-fit tests program eligibility may be extended in some do a very good job ruling out clearly mis- cases even when the cutoff rule is not satis- specified models, like the zero order poly- fied. For example, while Medicare eligibility nomials with large bandwidths that yield is mostly determined by a cutoff rule (age 65 upward biased estimates of the treatment or older), some disabled individuals under effect. Estimates from models that pass the age of 65 are also eligible. the goodness-of-fit test mostly fall in the Since we have already discussed the inter- 0.05–0.10 range for the vote share (table 2) pretation of estimates of the treatment effect 328 Journal of Economic Literature, Vol. XLVIII (June 2010) in a fuzzy RD design in section 3.4.1, here we (12) Y T f (X c) , = αr + τr + r − + εr just focus on estimation and implementation issues. The key message to remember from where . In this setting, can be τr = τ δ τr the earlier discussion is that, as in a standard interpreted as an “intent-to-treat” effect. IV framework, the estimated treatment effect Estimation in the fuzzy RD design can can be interpreted as a local average treat- be performed using either the local linear ment effect, provided monotonicity holds. regression approach or polynomial regres- In the fuzzy RD design, we can write the sions. Since the model is exactly identified, probability of treatment as 2SLS estimates are numerically identical to the ratio of reduced form coefficients r , Pr (D 1 X x) T g(x c), τ /δ = | = = γ + δ + − provided that the same bandwidth is used for equations (11) and (12) in the local lin- where T 1[X c] indicates whether the ear regression case, and that the same order = ≥ assignment variable exceeds the eligibil- of polynomial is used for g( ) and f ( ) in the ∙ ∙ ity threshold c.38 Note that the sharp RD is polynomial regression case. a special case where 0, g( ) 0, and In the case of the local linear regression, γ = ∙ = 1. It is advisable to draw a graph for Imbens and Lemieux (2008) recommend δ = the treatment dummy D as a function of the using the same bandwidth in the treatment assignment variable X using the same proce- and outcome regression. When we are close dure discussed in section 4.1. This provides to a sharp RD design, the function g( ) ∙ an informal way of seeing how large the is expected to be very flat and the optimal jump in the treatment probability is at the bandwidth to be very wide. In contrast, there δ cutoff point, and what the functional form is no particular reason to expect the func- g( ) looks like. tion f ( ) in the outcome equation to be flat ∙ ∙ Since D Pr(D 1 X x) , where or linear, which suggests the optimal band- = = | = + ν is an error term independent of X, the width would likely be less than the one for ν fuzzy RD design can be described by the two the treatment equation. As a result, Imbens equation system: and Lemieux (2008) suggest focusing on the outcome equation for selecting bandwidth, (10) Y D f(X c) , and then using the same bandwidth for the = α + τ + − + ε treatment equation. (11) D T g(X − c) . While using a wider bandwidth for the = γ + δ + + ν treatment equation may be advisable on Looking at these equations suggests esti- efficiency grounds, there are two practi- mating the treatment effect by instru- cal reasons that suggest not doing so. First, τ menting the treatment dummy D with using different bandwidths complicates the T. Note also that substituting the treat- computation of standard errors since the ment determining equation into the out- outcome and treatment samples used for the come equation yields the reduced form estimation are no longer the same, meaning the usual 2SLS standard errors are no longer valid. Second, since it is advisable to explore 38 Although the probability of treatment is modeled as the sensitivity of results to changes in the a linear probability model here, this does not impose any bandwidth, “trying out” separate bandwidths restrictions on the probability model since g(x c) is unre- − for each of the two equations would lead to stricted on both sides of the cutoff c, while T is a dummy variable. So there is no need to write the model using a a large and difficult-to-interpret number of probit or logit formulation. specifications. Lee and Lemieux: Regression Discontinuity Designs in Economics 329

The same broad arguments can be used in 4.4.1 Inspection of the Histogram of the the case of local polynomial regressions. In Assignment Variable principle, a lower order of polynomial could be used for the treatment equation (11) than Recall that the underlying assumption for the outcome equation (12). In practice, that generates the local random assignment however, it is simpler to use the same order result is that each individual has impre- of polynomial and just run 2SLS (and use cise control over the assignment variable, as defined in section 3.1.1. We cannot test 2SLS standard errors). this directly (since we will only observe one 4.3.4 How to Compute Standard Errors? observation on the assignment variable per individual at a given point in time), but an As discussed above, for inference in the intuitive test of this assumption is whether sharp RD case we can use standard least the aggregate distribution of the assignment squares methods. As usual, it is recom- variable is discontinuous, since a mixture of mended to use heteroskedasticity-robust individual-level continuous densities is itself standard errors (Halbert White 1980) instead a continuous density. of standard least squares standard errors. One McCrary (2008) proposes a simple two- additional reason for doing so in the RD case step procedure for testing whether there is a is to ensure the standard error of the treat- discontinuity in the density of the assignment ment effect is the same when either a pooled variable. In the first step, the assignment vari- regression or two separate regressions on able is partitioned into equally spaced bins each side of the cutoff are used to compute and frequencies are computed within those the standard errors. As we just discussed, it bins. The second step treats the frequency is also straightforward to compute standard counts as a dependent variable in a local linear errors in the fuzzy RD case using 2SLS meth- regression. See McCrary (2008), who adopts ods, although robust standard errors should the nonparametric framework for asymptot- also be used in this case. Imbens and Lemieux ics, for details on this procedure for inference. (2008) propose an alternative way of comput- As McCrary (2008) points out, this test can ing standard errors in the fuzzy RD case, but fail to detect a violation of the RD identifica- nonetheless suggest using 2SLS standard tion condition if for some individuals there is errors readily available in econometric soft- a “jump” up in the density, offset by jumps “down” for others, making the aggregate den- ware packages. sity continuous at the threshold. McCrary One small complication that arises in the (2008) also notes it is possible the RD esti- nonparametric case of local linear regres- mate could remain unbiased, even when sions is that the usual (robust) standard errors there is important manipulation of the assign- from least squares are only valid provided that ment variable causing a jump in the density. h N −δ with 1 5 2 5. As we men- ∝ / < δ < / It should be noted, however, that in order to tioned earlier, this is not a very important point rely upon the RD estimate as unbiased, one in practice, and the usual standard errors can needs to invoke other identifying assumptions be used with local linear regressions. and cannot rely upon the mild conditions we 39 4.4 Implementing Empirical Tests of RD focus on in this article. Validity and Using Covariates

In this part of the section, we describe 39 McCrary (2008) discusses an example where students how to implement tests of the validity of the who barely fail a test are given extra points so that they barely pass. The RD estimator can remain unbiased if one RD design and how to incorporate covariates assumes that those who are given extra points were chosen in the analysis. randomly from those who barely failed. 330 Journal of Economic Literature, Vol. XLVIII (June 2010)

2.50

2.00

1.50

1.00

0.50

0.00 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 16. Density of the Forcing Variable (Vote Share in Previous Election)

One of the examples McCrary uses for his estimation, replacing the dependent vari- test is the voting model of Lee (2008) that able with each of the observed baseline we used in the earlier empirical examples. covariates in W. A discontinuity would indi- Figure 16 shows a graph of the raw densi- cate a violation in the underlying assump- ties computed over bins with a bandwidth tion that predicts local random assignment. of 0.005 (200 bins in the graph), along with Intuitively, if the RD design is valid, we a smooth second order polynomial model. know that the treatment variable cannot Consistent with McCrary (2008), the graph influence variables determined prior to the shows no evidence of discontinuity at the realization of the assignment variable and cutoff. McCrary also shows that a formal treatment assignment; if we observe it does, test fails to reject the null hypothesis of no something is wrong in the design. discontinuity in the density at the cutoff. If there are many covariates in W, even abstracting from the possibility of misspecifi- 4.4.2 Inspecting Baseline Covariates cation of the functional form, some discon- An alternative approach for testing the tinuities will be statistically significant by validity of the RD design is to examine random chance. It is thus useful to combine whether the observed baseline covariates the multiple tests into a single test statistic to are “locally” balanced on either side of the see if the data are consistent with no discon- threshold, which should be the case if the tinuities for any of the observed covariates. treatment indicator is locally randomized. A simple way to do this is with a Seemingly A natural thing to do is conduct both Unrelated Regression (SUR) where each a graphical RD analysis and a formal equation represents a different baseline Lee and Lemieux: Regression Discontinuity Designs in Economics 331 covariate, and then perform a 2 test for within a narrower window around the cut- χ the discontinuity gaps in all questions being off point, such as the one suggested by the zero. For example, supposing the underlying bandwidth selection procedures discussed in functional form is linear, one would estimate section 4.3.1. the system Figure 17 shows the RD graph for a baseline covariate, the Democratic vote share in w D X the election prior to the one used for the 1 = α1 + β1 + γ1 + ε1 … assignment variable (four years prior to … the current election). Consistent with Lee (2008), there is no indication of a disconti- w D X K = αK + βK + γK + εK nuity at the cutoff. The actual RD estimate using a quartic model is –0.004 with a stan- and test the hypothesis that , …, are dard error of 0.014. Very similar results are β1 βK jointly equal to zero, where we allow the obtained using winning the election as the ’s to be correlated across the K equations. outcome variable instead (RD estimate of ε Alternatively, one can simply use the OLS –0.003 with a standard error of 0.017). estimates of , … , obtained from a β1 βK 4.5 Incorporating Covariates in Estimation “stacked” regression where all the equations for each covariate are pooled together, while If the RD design is valid, the other use for D and X are fully interacted with a set of K the baseline covariates is to reduce the sam- dummy variables (one for each covariate wk). pling variability in the RD estimates. We dis- Correlation in the error terms can then be cuss two simple ways to do this. First, one can captured by clustering the standard errors “residualize” the dependent variable—sub- on individual observations (which appear tract from Y a prediction of Y based on the in the stacked dataset K times). Under the baseline covariates W—and then conduct an null hypothesis of no discontinuities, the RD analysis on the residuals. Intuitively, this 1 Wald test statistic N ˆ V ˆ − ˆ (where ˆ is procedure nets out the portion of the varia- β′ β β the vector of estimates of , …, , and V ˆ tion in Y we could have predicted using the β1 βK is the cluster-and-heteroskedasticity con- predetermined characteristics, making the sistent estimate of the asymptotic variance question whether the treatment variable of ˆ ) converges in distribution to a 2 with can explain the remaining residual varia- β χ K degrees of freedom. tion in Y. The important thing to keep in Of course, the importance of functional mind is that if the RD design is valid, this form for RD analysis means a rejection of procedure provides a consistent estimate of the null hypothesis tells us either that the the same RD parameter of interest. Indeed, underlying assumptions for the RD design any combination of covariates can be used, are invalid, or that at least some of the equa- and abstracting from functional form issues, tions are sufficiently misspecified and too the estimator will be consistent for the same restrictive, so that nonzero discontinuities parameter, as discussed above in equation are being estimated, even though they do (4). Importantly, this two-step approach also not exist in the population. One could use allows one to perform a graphical analysis of the parametric specification tests discussed the residual. earlier for each of the individual equations to To see this more formally in the paramet- see if misspecification of the functional form ric case, suppose one is willing to assume is an important problem. Alternatively, the that the expectation of Y as a function of X test could be performed only for observations is a polynomial, and the expectation of each 332 Journal of Economic Literature, Vol. XLVIII (June 2010)

0.80

0.70

0.60

0.50

0.40

0.30

0.20 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 17. Discontinuity in Baseline Covariate (Share of Vote in Prior Election)

element of W is also a polynomial function of equation (13) is correct, in computing esti- X. This implies mated standard errors in the second step, one can ignore the fact that the first step was

(13) Y D X˜ ˜ estimated.40 = τ + γ + ε The second approach—which uses the W X˜ u, same assumptions implicit in equation (13)— = δ + is to simply add W to the regression. While where X ˜ is a vector of polynomial terms in X, this may seem to impose linearity in how W and u are of conformable dimension, and δ and u are by construction orthogonal to D ε ˜ 40 The two-step procedure solves the sample analogue and X . It follows that to the following set of moment equations:

E D (Y W D X ˜ ) 0 (14) Y W D X˜ ˜ W − π0 − τ − γ = − π = τ + γ − π + ε X ˜ E W(Y W ) 0. ˜ ca b − π0 d = D X ( ˜ ) u As noted above, the second-step estimator for is con- = τ + γ − δπ − π + ε C D τ sistent for any value of . Letting τ , and using the π θ ≡ γ ˜ notation of Whitney K. Newey and Daniel L. McFadden D X u . 1 = τ + γ − π + ε (1994), this means that the first row of A (B 0) G − G ∇πθ π = − θ π is a row of zeros. It follows from their theorem 6.1, with This makes clear that a regression of Y W − π the 1,1 element of V being the asymptotic variance of the ˜ estimator for , that the 1,1 element of V is equal to the 1,1 on D and X will give consistent estimates of τ 1 1 element of G − E[ g(z)g(z) ]G − , which is the asymptotic and . This is true no matter the value of . θ ′ θ ′ τ γ π covariance matrix of the second stage estimator ignoring Furthermore, as long as the specification in estimation in the first step. Lee and Lemieux: Regression Discontinuity Designs in Economics 333 affects Y, it can be shown that the inclusion involves limiting the estimation to a window of these regressors will not affect the consis- of data around the threshold and using a lin- tency of the estimator for .41 The advantage ear specification within that window.45 We τ of this second approach is that under these note that as the neighborhood shrinks, the functional form assumptions and with homo- true expectation of W conditional on X will skedasticity, the estimator for is guaranteed become closer to being linear, and so equa- τ to have a lower asymptotic variance.42 By tion (13) (with X ˜ containing only the linear contrast, the “residualizing” approach can in term) will become a better approximation. some cases raise standard errors.43 For the voting example used throughout The disadvantage of solely relying upon this paper, Lee (2008) shows that adding a this second approach, however, is that it does set of covariates essentially has no impact on not help distinguish between an inappropri- the RD estimates in the model where the ate functional form and discontinuities in W, outcome variable is winning the next elec- as both could potentially cause the estimates tion. Doing so does not have a large impact of to change significantly whenW is includ- on the standard errors either, at least up τ ed.44 On the other hand, the “residualizing” to the third decimal. Using the procedure approach allows one to examine how well the based on residuals instead actually slightly residuals fit the assumed order of polynomial increases the second step standard errors— (using, for example, the methods described a possibility mentioned above. Therefore in in subsection 4.3.2). If it does not fit well, this particular example, the main advantage then it suggests that the use of that order of of using baseline covariates is to help estab- polynomial with the second approach is not lish the validity of the RD design, as opposed justified. Overall, one sensible approach is to to improving the efficiency of the estimators. directly enter the covariates, but then to use 4.6 A Recommended “Checklist” for the “residualizing” approach as an additional Implementation diagnostic check on whether the assumed order of the polynomial is justified. Below is a brief summary of our recom- As discussed earlier, an alternative mendations for the analysis, presentation, approach to estimating the discontinuity and estimation of RD designs.

41 To see this, rewrite equation (13) as Y D X˜ ˜ the footnote above, e 0, and thus D X˜ d D˜ , imply- = τ + γ + = − = Da X˜ b Wc , where a, b, c, and are linear projec- ing that the denominator in the ratio does not change when + + + μ μ tion coefficients and the residual from a population regres- W is included. sion on D, X˜ , and W. If a 0, then adding W will not 43 From equation (14), the regression error variance ε = affect the coefficient onD . This will be true—applying the will increase if V( u ) V( ) V(u ) 2C( , u ) ε − π > ε ⇔ π − ε π > Frisch–Waugh theorem—when the covariance between 0, which will hold when, for example, is orthogonal to u ε and D X˜ d We (where d and e are coefficients from and is nonzero. ε − − π ˜ 44 projecting D on X and W ) is zero. This will be true when If the true equation for W contains more polyno- e 0, because is by assumption orthogonal to both D mial terms than X ˜ , then e, as defined in the preceding = ε ˜ and X . Applying the Frisch–Waugh theorem again, e is the footnotes (the coefficient obtained by regressing D on the coefficient obtained by regressing D on W X˜ u; by residual from projecting W on X˜ ), will not be zero. This − δ ≡ assumption u and D are uncorrelated, so e 0. implies that including W will generally lead to inconsis- = 42 The asymptotic variance for the least squares esti- tent estimates of , and may cause the asymptotic variance τ mator (without including W) of is given by the ratio to increase (since V(D X˜ d We) V(D ˜ )). τ − − ≤ V( ) V(D ˜ ) where D˜ is the residual from the population 45 And we have noted that one can justify this by assum- ε / regression of D on X ˜ . If W is included, then the least ing that in that specified neighborhood, the underlying squares estimator has asymptotic variance of 2 V(D function is in fact linear, and make standard parametric σ / − X ˜ d We), where 2 is the variance of the error when W inferences. Or one can conduct a nonparametric inference − σ is included, and d and e are coefficients from projecting approach by making assumptions about the rate at which D on X˜ and W. 2 cannot exceed V( ), and as shown in the bandwidth shrinks as the sample size grows. σ ε 334 Journal of Economic Literature, Vol. XLVIII (June 2010)

1. To assess the possibility of manipula- 3. Graph a benchmark polynomial tion of the assignment variable, show specification. Superimpose onto the its distribution. The most straightfor- graph the predicted values from a low- ward thing to do is to present a histo- order polynomial specification (see fig- gram of the assignment variable, using ures 6–11). One can often informally a fixed number of bins. The bin widths assess by comparing the two functions should be as small as possible, without whether a simple polynomial specifi- compromising the ability to visually see cation is an adequate summary of the the overall shape of the distribution. data. If the local averages represent the For an example, see figure 16. The bin- most flexible “nonparametric” repre- to-bin jumps in the frequencies can sentation of the function, the polyno- provide a sense in which any jump at mial represents a “best case” scenario the threshold is “unusual.” For this rea- in terms of the variance of the RD son, we recommend against plotting a estimate, since if the polynomial speci- smooth function comprised of kernel fication is correct, under certain con- density estimates. A more formal test ditions, the least squares estimator is of a discontinuity in the density can be efficient. found in McCrary (2008). 4. Explore the sensitivity of the results 2. Present the main RD graph using to a range of bandwidths, and a range binned local averages. As with the his- of orders to the polynomial. For an togram, we recommend using a fixed example, see tables 2 and 3. The tables number of nonoverlapping bins, as should be supplemented with infor- described in subsection 4.1. For exam- mation on the implied rule-of-thumb ples, see figures 6–11. The nonover- bandwidth and cross-validation band- lapping nature of the bins for the local widths for local linear regression (as averages is important; we recommend in table 4), as well as the AIC-implied against simply presenting a continuum optimal order of the polynomial. The of nonparametric estimates (with a sin- specification tests that involve add- gle break at the threshold), as this will ing bin dummies to the polynomial naturally tend to give the impression of specifications can help rule out overly a discontinuity even if there does not restrictive specifications. Among all the exist one in the population. We recom- specifications that are not rejected by mend reporting bandwidths implied by the bin-dummy tests, and among the cross-validation, as well as the range of polynomial orders recommended by widths that are not statistically rejected the AIC, and the estimates given by in favor of strictly less restrictive alterna- both rule of thumb and CV bandwidths, tives (for an example, see table 1). We report a “typical” point estimate and recommend generally “undersmooth- a range of point estimates. A useful ing,” while at the same time avoiding graphical device for illustrating the “too narrow” bins that produce a scatter sensitivity of the results to bandwidths of data points, from which it is difficult is to plot the local linear discontinuity to see the shape of the underlying func- estimate against a continuum of band- tion. Indeed, we recommend against widths (within a range of bandwidths simply plotting the raw data without a that are not ruled out by the above minimal amount of local averaging. specification tests). For an example Lee and Lemieux: Regression Discontinuity Designs in Economics 335

0.18

0.16 95 percent 0.14 condence bands

0.12

0.10

0.08

0.06 Estimated treatment effect 0.04 LLR estimate of Quadratic t the treatment 0.02 effect

0.00 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Bandwidth

Figure 18. Local Linear Regression with Varying Bandwidth: Share of Vote at Next Election

of such a presentation, see the online assumption holds. If the estimates do appendix to Card, Carlos Dobkin, and change in an important way, it may indi- Nicole Maestas (2009), and figure 18. cate a potential sorting of the assignment variable that may be reflected in 5. Conduct a parallel RD analysis on a discontinuity in one or more of the the baseline covariates. As discussed baseline covariates. In terms of imple- earlier, if the assumption that there is mentation, in subsection 4.5, we sug- no precise manipulation or sorting of gest simply including the covariates the assignment variable is valid, then directly, after choosing a suitable order there should be no discontinuities in of polynomial. Significant changes in variables that are determined prior the estimated effect or increases in the to the assignment. See figure 17, for standard errors may be an indication of example. a misspecified functional form. Another check is to perform the “residualizing” 6. Explore the sensitivity of the results procedure suggested there, to see if to the inclusion of baseline covari- that same order of polynomial provides ates. As discussed above, the inclusion a good fit for the residuals, using the of baseline covariates—no matter how specification tests from point 4. highly correlated they are with the outcome—should not affect the estimated We recognize that, due to space limitations, discontinuity, if the no-manipulation researchers may be unable to present every 336 Journal of Economic Literature, Vol. XLVIII (June 2010) permutation of presentation (e.g., points Additionally, the various estimation and 2–4 for every one of 20 baseline covariates) graphing techniques discussed in section within a published article. Nevertheless, we 4 can readily be used in the case of a dis- do believe that documenting the sensitiv- crete assignment variable. For instance, ity of the results to these array of tests and as with a continuous assignment variable, alternative specifications—even if they only either local linear regressions or polyno- appear in unpublished, online appendices— mial regressions can be used to estimate the is an important component of a thorough jump in the regression function at the cutoff RD analysis. point. Furthermore, the discreteness of the assignment variable simplifies the problem of bandwidth choice when graphing the 5. Special Cases data since, in most cases, one can simply In this section, we discuss how the RD compute and graph the mean of the out- design can be implemented in a number of come variable for each value of the discrete specific cases beyond the one considered up assignment variable. The fact that the vari- to this point (that of a single cross-section able is discrete also provides a natural way with a continuous assignment variable). of testing whether the regression model is well specified by comparing the fitted model 5.1 Discrete Assignment Variable and to the raw dispersion in mean outcomes at Specification Errors each value of the assignment variable. Lee Up until now, we have assumed the assign- and Card (2008) show that, when errors are ment variable was continuous. In practice, homoskedastic, the model specification can however, X is often discrete. For example, be tested using the standard goodness-of-fit age or date of birth are often only available statistic at a monthly, quarterly, or annual frequency level. Studies relying on an age-based cut- (ESS ESS ) ( J K) off thus typically rely on discrete values of G R − UR / − , ≡ __ESS (N J) the age variable when implementing an RD UR/ − design. Lee and Card (2008) study this case in where ESSR is the estimated sum of squares detail and make a number of important of the restricted model (e.g., low order poly- points. First, with a discrete assignment vari- nomial), while ESSUR is the estimated sum able, it is not possible to compare outcomes of squares of the unrestricted model where in very narrow bins just to the right and left a full set of dummies (for each value of the of the cutoff point. Consequently, one must assignment variable) are included. In this use regressions to estimate the conditional unrestricted model, the fitted regression cor- expectation of the outcome variable at the responds to the mean outcome in each cell. cutoff point by extrapolation. As discussed G follows a F ( J K, N J) distribution − − in section 4, however, in practice we always where J is the number of values taken by the extrapolate to some extent, even in the case assignment variables and K is the number of of a continuous assignment variable. So the parameters of the restricted model. fact we must do so in the case of a discrete This test is similar to the test in section assignment variable does not introduce par- 4 where we suggested including a full set ticular complications from an econometric of bin dummies in the regression model point of view, provided the discrete variable and testing whether the bin dummies were is not too coarsely distributed. jointly significant. The procedure is even Lee and Lemieux: Regression Discontinuity Designs in Economics 337 simpler here, as bin dummies are replaced 5.2 Panel Data and Fixed Effects by dummies for each value of the discrete assignment variable. In the presence of het- In some situations, the RD design will eroskedasticity, the goodness-of-fit test can be embedded in a panel context, whereby be computed by estimating the model and period by period, the treatment variable is testing whether a set of dummies for each determined according to the realization of value of the discrete assignment variable are the assignment variable X. Again, it seems jointly significant. In that setting, the test sta- natural to propose the model tistic follows a chi-square distribution with J K degrees of freedom. Y D f (X ; ) a − it = itτ + it γ + i + εit In Lee and Card (2008), the difference between the true conditional expectation (where i and t denote the individuals and E [Y X x] and the estimated regression time, respectively), and simply estimate a | = function forming the basis of the goodness- fixed effects regression by including indi- of-fit test is interpreted as a random specifi- vidual dummy variables to capture the unit- cation error that introduces a group structure specific error component, ai . It is important in the standard errors. One way of correcting to note, however, that including fixed effects the standard errors for group structure is to is unnecessary for identification in an RD run the model on cell means.46 Another way design. This sharply contrasts with a more is to “cluster” the standard errors. Note that traditional panel data setting where the error in this setting, the goodness-of-fit test can also component ai is allowed to be correlated be interpreted as a test of whether standard with the observed covariates, including the errors should be adjusted for the group struc- treatment variable Dit, in which case includ- ture. In practice, it is nonetheless advisable to ing fixed effects is essential for consistently either group the data or cluster the standard estimating the treatment effect . τ errors in micro-data models irrespective of the An alternative is to simply conduct the results of the goodness-of-fit test. The main RD analysis for the entire pooled-cross- purpose of the test should be to help choose a section dataset, taking care to account for reasonably accurate regression model. within-individual correlation of the errors Lee and Card (2008) also discuss a num- over time using clustered standard errors. ber of issues including what to do when The source of identification is a compari- specification errors under treatment and son between those just below and above control are correlated, and how to possibly the threshold, and can be carried out with adjust the RD estimates in the presence of a single cross-section. Therefore, impos- specification errors. Since these issues are ing a specific dynamic structure intro- beyond the scope of this paper, interested duces more restrictions without any gain in readers should consult Lee and Card (2008) identification. for more detail. Time dummies can also be treated like any other baseline covariate. This is appar- ent by applying the main RD identification result: conditional on what period it is, we 46 When the discrete assignment variable—and the “treatment” dummy solely dependent on this variable—is are assuming the density of X is continuous the only variable used in the regression model, standard at the threshold and, hence, conditional on OLS estimates will be numerically equivalent to those X, the probability of an individual observa- obtained by running a weighted regression on the cell means, where the weights are the number of observations tion coming from a particular period is also (or the sum of individual weights) in each cell. continuous. 338 Journal of Economic Literature, Vol. XLVIII (June 2010)

We note that it becomes a little bit more In summary, one can utilize the panel awkward to use the justification proposed in nature of the data by conducting an RD subsection 4.5 for directly including dum- analysis on the entire dataset, using lagged mies for individuals and time periods on variables as baseline covariates for inclusion the right hand side of the regression. This as described in subsection 4.5. The primary is because the assumption would have to caution in doing this is to ensure that for be that the probability that an observation each period, the included covariates are the belonged to each individual (or the probabil- variables determined prior to the present ity that an observation belonged to each time period’s realization of Xit. period) is a polynomial function in X and, strictly speaking, nontrivial polynomials are 6. Applications of RD Designs in not bounded between 0 and 1. Economics A more practical concern is that inclusion of individual dummy variables may lead to an In what areas has the RD design been increase in the variance of the RD estimator applied in economic research? Where do for another reason. If there is little “within- discontinuous rules come from and where unit” variability in treatment status, then might we expect to find them? In this section, the variation in the main variable of interest we provide some answers to these questions (treatment after partialling out the individual by providing a survey of the areas of applied heterogeneity) may be quite small. Indeed, economic research that have employed the seeing standard errors rise when including RD design. Furthermore, we highlight some fixed effects may be an indication of a mis- examples from the literature that illustrate specified functional form.47 what we believe to be the most important Overall, since the RD design is still valid elements of a compelling, “state-of-the-art” ignoring individual or time effects, then the implementation of RD. only rationale for including them is to reduce 6.1 Areas of Research Using RD sampling variance. But there are other ways to reduce sampling variance by exploiting the As we suggested in the introduction, the structure of panel data. For instance, we can notion that the RD design has limited appli- treat the lagged dependent variable Yit 1 as cability to a few specific topics is inconsistent − simply another baseline covariate in period with our reading of existing applied research t. In cases where Yit is highly persistent over in economics. Table 5 summarizes our survey time, Yit 1 may well be a very good predic- of empirical studies on economic topics that − tor and has a very good chance of reducing have utilized the RD design. In compiling the sampling error. As we have also discussed this list, we searched economics journals as earlier, looking at possible discontinuities in well as listings of working papers from econ- baseline covariates is an important test of the omists, and chose any study that recognized validity of the RD design. In this particular the potential use of an RD design in their case, since Yit can be highly correlated with given setting. We also included some papers Yit 1, finding a discontinuity in Yit but not from non-economists when the research was − in Yit 1 would be a strong piece of evidence closely related to economic work. − supporting the validity of the RD design. Even with our undoubtedly incomplete compilation of over sixty studies, table 5 illustrates that RD designs have been applied in many different contexts. Table 5 summa- 47 See the discussion in section 4.5. rizes the context of the study, the outcome Lee and Lemieux: Regression Discontinuity Designs in Economics 339

Table 5 Regression Discontinuity Applications in Economics

Study Context Outcome(s) Treatment(s) Assignment variable(s) Education Angrist and Lavy (1999) Public Schools Test scores Class size Student enrollment (Grades 3–5), Israel Asadullah (2005) Secondary schools, Examination pass rate Class size Student enrollment Bangladesh Bayer, Ferreira, Valuation of schools Housing prices, Inclusion in school Geographic location and McMillan (2007) and neighborhoods, school test scores, attendance region Northern California demographic characteristics Black (1999) Valuation of school Housing prices Inclusion in school Geographic location quality, Massachusetts attendance region Canton and Blom (2004) Higher education, University enrollment, Student loan receipt Economic need index Mexico GPA, Part-time employment, Career choice Cascio and Lewis (2006) Teenagers, AFQT test scores Age at school entry Birthdate United States Chay, McEwan, Elementary schools, Test scores Improved infrastructure, School averages of test and Urquiola (2005) Chile more resources scores Chiang (2009) School accountability, Test scores, education Threat of sanctions School’s assessment Florida quality score Clark (2009) High schools, U.K. Examination pass rates “Grant maintained” Vote share school status Ding and Lehrer (2007) Secondary school Academic achievement School assignment Entrance examination students, China (Test scores) scores Figlio and Kenny (2009) Elementary and middle Private donations to D or F grade in school Grading points schools, Florida school performance measure Goodman (2008) College enrollment, School choice Scholarship offer Test scores Massachusetts Goolsbee and Public schools, Internet access in E-Rate subsidy amount Proportion of students Guryan (2006) California classrooms, test scores eligible for lunch program Guryan (2001) State-level equalization: Spending on schools, State education aid Relative average elementary, middle test scores property values schools, Massachusetts Hoxby (2000) Elementary schools, Test scores Class size Student enrollment Connecticut Kane (2003) Higher education, College attendance Financial aid receipt Income, assets, GPA California Lavy (2002) Secondary schools, Test scores, Performance based Frequency of school Israel drop out rates incentives for teachers type in community Lavy (2004) Secondary schools, Test scores Pay-for-performance School matriculation Israel incentives rates Lavy (2006) Secondary schools, Dropout rates, School choice Geographic location Tel Aviv test scores Jacob and Lefgren (2004a) Elementary schools, Test scores Teacher training School averages on Chicago test scores 340 Journal of Economic Literature, Vol. XLVIII (June 2010)

Table 5 (continued) Regression Discontinuity Applications in Economics

Study Context Outcome(s) Treatment(s) Assignment variable(s)

Jacob and Lefgren (2004b) Elementary schools, Test scores Summer school Standardized test Chicago attendance, grade scores retention Leuven, Lindahl, Primary schools, Test scores Extra funding Percent disadvantaged Oosterbeek, and Netherlands minority pupils Webbink (2007) Matsudaira (2008) Elementary schools, Test scores Summer school, Test scores Northeastern United grade promotion States Urquiola (2006) Elementary schools, Test scores Class size Student enrollment Bolivia Urquiola and Class size sorting- RD Test scores Class size Student enrollment Verhoogen (2009) violations, Chile Van der Klaauw College enrollment, Enrollment Financial Aid offer SAT scores, GPA (2002, 1997) East Coast College Van der Klaauw (2008a) Elementary/middle Test scores, Title I federal funding Poverty rates schools, New York student attendance City Labor Market Battistin and Rettore Job training, Italy Employment rates Training program Attitudinal test score (2002) (computer skills) Behaghel, Crepon, Labor laws, France Hiring among age Tax exemption for Age of worker and Sedillot (2008) groups hiring firm Black, Smith, Berger, and UI claimants, Kentucky Earnings, benefit Mandatory reemploy- Profiling score Noel (2003); Black, receipt/duration ment services (job (expected benefit Galdo, and Smith (2007b) search assistance) duration) Card, Chetty, Unemployment Unemployment Lump-sum severance Months employed, and Weber (2007) benefits, Austria duration pay, extended UI job tenure benefits Chen and van der Klaauw Disability insurance Labor force Disability insurance Age at disability (2008) beneficiaries, participation benefits decision United States De Giorgi (2005) Welfare-to-work Re-employment Job search assistance, Age at end of program, United probability training, education unemployment spell Kingdom DiNardo and Lee (2004) Unionization, Wages, employment, Union victory in NLRB Vote share United States output election Dobkin and Individuals, California Educational attainment, Age at school entry Birthdate Ferreira (2009) and Texas wages Edmonds (2004) Child labor supply and Child labor supply, school Pension receipt of oldest Age school attendance, attendance family member South Africa Hahn, Todd, and Discrimination, Minority employment Coverage of federal Number of employees van der Klaauw (1999) United States antidiscrimination law at firm Lalive (2008) Unemployment Unemployment Maximum benefit Age at start of Benefits, Austria duration duration unemployment spell, geographic location Lee and Lemieux: Regression Discontinuity Designs in Economics 341

Table 5 (continued) Regression Discontinuity Applications in Economics

Study Context Outcome(s) Treatment(s) Assignment variable(s) Lalive (2007) Unemployment, Unemployment duration, Benefits duration Age at start of Austria duration of job search, unemployment spell quality of post- unemployment jobs Lalive, Van Ours, Unemployment, Unemployment Benefit replacement Pre-unemployment and Zweimller (2006) Austria duration rate, potential benefit income, age duration Leuven and Oosterbeek Employers, Training, wages Business tax deduction, Age of employee (2004) Netherlands training Lemieux and Milligan Welfare, Canada Employment, marital Cash benefit Age (2008) status, living arrangements Oreopoulos (2006) Returns to education, Earnings Coverage of compulsory Birth year U.K. schooling law Political Economy Albouy (2009) Congress, United States Federal expenditures Party control of seat Vote share in election Albouy (2008) Senate, United States Roll call votes Incumbency Initial vote share Ferreira and Gyourko Mayoral Elections, Local expenditures Incumbency Initial vote share (2009) United States Lee (2008, 2001) Congressional elections, Vote share in next Incumbency Initial vote share United States election Lee, Moretti, and Butler House of Representa- Roll call votes Incumbency Initial vote share (2004) tives, United States McCrary (2008) House of Representa- N/A Passing of resolution Share of roll call vote tives, United States “Yeay” Pettersson-Lidbom (2006) Local Governments, Expenditures, Number of council seats Population Sweden and Finland tax revenues Pettersson-Lidbom (2008) Local Governments, Expenditures, Left-, right-wing bloc Left-wing parties’ Sweden tax revenues share Health Card and Shore-Sheppard Medicaid, Overall insurance Medicaid eligibility Birthdate (2004) United States coverage Card, Dobkin, Medicare, Health care utilization Coverage under Age and Maestas (2008) United States Medicare Card, Dobkin, Medicare, California Insurance coverage, Medicare coverage Age and Maestas (2009) Health services, Mortality Carpenter and Dobkin Alcohol and mortality, Mortality Attaining minimum Age (2009) United States legal drinking age Ludwig and Miller (2007) Head Start, Child mortality, Head Start funding County poverty rates United States educational attainment McCrary and Royer (2003) Maternal education, Infant health, fertility Age of school entry Birthdate United States, timing California and Texas Snyder and Evans (2006) Social Security Mortality Social security Birthdate recipients, United payments ($) States 342 Journal of Economic Literature, Vol. XLVIII (June 2010)

Table 5 (continued) Regression Discontinuity Applications in Economics

Study Context Outcome(s) Treatment(s) Assignment variable(s)

Crime Berk and DeLeeuw (1999) Prisoner behavior in Inmate misconduct Prison security levels Classification score California Berk and Rauma (1983) Ex-prisoners recidivism, Arrest, parole violation Unemployment Reported hours of California insurance benefit work Chen and Shapiro (2004) Ex-prisoners recidivism, Arrest rates Prison security levels Classification score United States Lee and McCrary (2005) Criminal offenders, Arrest rates Severity of sanctions Age at arrest Florida Hjalmarsson (2009) Juvenile offenders, Recidivism Sentence length Criminal history score Washington State Environment Chay and Greenstone Health effects of Infant mortality Regulatory status Pollution levels (2003) pollution, United States Chay and Greenstone Valuation of air quality, Housing prices Regulatory status Pollution levels (2005) United States Davis (2008) Restricted driving Hourly air pollutant Restricted automobile Time policy, Mexico measures use Greenstone and Gallagher Hazardous waste, Housing prices Superfund clean-up Ranking of level of (2008) United States status hazard Other Battistin and Rettore Mexican anti-poverty School attendance Cash grants Pre-assigned (2008) program probability of being (PROGRESA) poor Baum-Snow and Marion Housing subsidies, Residents’ Increased subsidies Percentage of eligible (2009) United States characteristics, new households in area housing construction Buddelmeyer and Skoufias Mexican anti-poverty Child labor and Cash grants Pre-assigned (2004) program school attendance probability of being (PROGRESA) poor Buettner (2006) Fiscal equalization Business tax rate Implicit marginal tax Tax base across municipalities, rate on grants to Germany localities Card, Mas, and Rothstein Racial segregation, Changes in census tract Minority share exceeding Initial minority share (2008) United States racial composition “tipping” point Cole (2009) Bank nationalization, Share of credit granted Nationalization of Size of bank India by public banks private banks Edmonds, Mammen, and Household structure, Household composition Pension receipt of Age Miller (2005) South Africa oldest family member Ferreira (2007) Residential mobility, Household mobility Coverage of tax benefit Age California Pence (2006) Mortgage credit, Size of loan State mortgage credit Geographic location United States laws Pitt and Khandker (1998) Poor households, Labor supply, children Group-based credit Acreage of land Bangladesh school enrollment program Pitt, Khandker, McKernan, Poor households, Contraceptive use, Group-based credit Acreage of land and Latif (1999) Bangladesh Childbirth program Lee and Lemieux: Regression Discontinuity Designs in Economics 343 variable, the treatment of interest, and the esting “treatment” to examine since it seems assignment variable employed. nothing more than an arbitrary label. On the While the categorization of the various other hand, if failing the exam meant not studies into broad areas is rough and some- being able to advance to the next grade in what arbitrary, it does appear that a large share school, the actual experience of treated and come from the area of education, where the control individuals is observably different, no outcome of interest is often an achievement matter how large or small the impact on the test score and the assignment variable is also outcome. a test score, either at the individual or group As another example, in the U.S. Congress, (school) level. The second clearly identifiable a Democrat obtaining the most votes in group are studies that deal with labor market an election means something real—the issues and outcomes. This probably reflects Democratic candidate becomes a represen- that, within economics, the RD design has so tative in Congress; otherwise, the Democrat far primarily been used by labor economists, has no official role in the government. But and that the use of quasi-experiments and in a three-way electoral race, the treatment program evaluation methods in documenting of the Democrat receiving the second-most causal relationships is more prevalent in labor number of votes (versus receiving the low- economics research. est number) is not likely a treatment of inter- There is, of course, nothing in the struc- est: only the first-place candidate is given ture of the RD design tying it specifically to any legislative authority. In principle, stories labor economics applications. Indeed, as the could be concocted about the psychological rest of the table shows, the remaining half effect of placing second rather than third of the studies are in the areas of political in an election, but this would be an exam- economy, health, crime, environment, and ple where the salience of the treatment is other areas. more speculative than when treatment is a concrete and observable event (e.g., a can- 6.2 Sources of Discontinuous Rules didate becoming the sole representative of a Where do discontinuous rules come from, constituency). and in what situations would we expect to encounter them? As table 5 shows, there is 6.2.1 Necessary Discretization a wide variety of contexts where discontinuous rules determine treatments of interest. Many discontinuous rules come about There are, nevertheless, some patterns that because resources cannot, for all practical emerge. We organize the various discontinu- purposes, be provided in a continuous man- ous rules below. ner. For example, a school can only have a Before doing so, we emphasize that a good whole number of classes per grade. For RD analysis—as with any other approach a fixed level of enrollment, the moment a to program evaluation—is careful in clearly school adds a single class, the average class spelling out exactly what the treatment is, size drops. As long as the number of classes and whether it is of any real salience, inde- is an increasing function of enrollment, pendent of whatever effect it might have on there will be discontinuities at enrollments the outcome. For example, when a pretest where a teacher is added. If there is a man- score is the assignment variable, we could dated maximum for the student to teacher always define a “treatment” as being “having ratio, this means that these discontinuities passed the exam” (with a test score of 50 per- will be expected at enrollments that are cent or higher), but this is not a very inter- exact multiples of the maximum. This is the 344 Journal of Economic Literature, Vol. XLVIII (June 2010) essence of the discontinuous rules used in identified three broad motivations behind the analyses of Angrist and Lavy (1999), M. the use of these discontinuous rules. Niaz Asadullah (2005), Caroline M. Hoxby First, a number of rules seem driven (2000), Urquiola (2006), and Urquiola and by a compensatory or equalizing motive. Verhoogen (2009). For example, in Kenneth Y. Chay, Patrick Another example of necessary discretiza- J. McEwan and Urquiola (2005), Edwin tion arises when children begin their school- Leuven et al. (2007), and van der Klaauw ing years. Although there are certainly (2008a), extra resources for schools were allo- exceptions, school districts typically follow a cated to the neediest communities, either on guideline that aims to group children together the basis of school-average test scores, dis- by age, leading to a grouping of children advantaged minority proportions, or poverty born in year-long intervals, determined by rates. Similarly, Ludwig and Miller (2007), a single calendar date (e.g., September 1). Erich Battistin and Enrico Rettore (2008), This means children who are essentially of and Hielke Buddelmeyer and Emmanuel the same age (e.g., those born on August 31 Skoufias (2004) study programs designed to and September 1), start school one year apart. help poor communities, where the eligibility This allocation of students to grade cohorts of a community is based on poverty rates. In is used in Elizabeth U. Cascio and Ethan G. each of these cases, one could imagine pro- Lewis (2006), Dobkin and Fernando Ferreira viding the most resources to the neediest and (2009), and McCrary and Royer (2003). gradually phasing them out as the need index Choosing a single representative by way declines, but in practice this is not done, per- of an election is yet another example. When haps because it was impractical to provide the law or constitution calls for a single rep- very small levels of the treatment, given the resentative of some constituency and there fixed costs in administering the program. are many competing candidates, the choice A second motivation for having a discon- can be made via a “first-past-the-post” or tinuous rule is to allocate treatments on the “winner-take-all” election. This is the typi- basis of some measure of merit. This was cal system for electing government officials the motivation behind the merit award from at the local, state, and federal level in the the analysis of Thistlethwaite and Campbell United States. The resulting discontinuous (1960), as well as recent studies of the effect relationship between win/loss status and the of financial aid awards on college enroll- vote share is used in the context of the U.S. ment, where the assignment variable is some Congress in Lee (2001, 2008), Lee, Enrico measure of student achievement or test Moretti and Matthew J. Butler (2004), score, as in Thomas J. Kane (2003) and van David Albouy (2009), Albouy (2008), and in der Klaauw (2002). the context of mayoral elections in Ferreira Finally, we have observed that a number and Joseph Gyourko (2009). The same idea of discontinuous rules are motivated by the is used in examining the impacts of union need to most effectively target the treatment. recognition, which is also decided by a secret For example, environmental regulations or ballot election (DiNardo and Lee 2004). clean-up efforts naturally will focus on the most polluted areas, as in Chay and Michael 6.2.2 Intentional Discretization Greenstone (2003), Chay and Greenstone Sometimes resources could potentially (2005), and Greenstone and Justin Gallagher be allocated on a continuous scale but, in (2008). In the context of criminal behav- practice, are instead done in discrete lev- ior, prison security levels are often assigned els. Among the studies we surveyed, we based on an underlying score that quantifies Lee and Lemieux: Regression Discontinuity Designs in Economics 345 potential security risks, and such rules were age. Receipt of pension benefits is typi- used in Richard A. Berk and Jan de Leeuw cally tied to reaching a particular age (see (1999) and M. Keith Chen and Jesse M. Eric V. Edmonds 2004; Edmonds, Kristin Shapiro (2004). Mammen, and Miller 2005) and, in the United States, eligibility for the Medicare 6.3 Nonrandomized Discontinuity Designs program begins at age 65 (see Card, Throughout this article, we have focused Dobkin, and Maestas 2008) and young on regression discontinuity designs that fol- adults reach the legal drinking age at 21 low a certain structure and timing in the (see Christopher Carpenter and Dobkin assignment of treatment. First, individuals 2009). Similarly, one is subject to the less or communities—potentially in anticipa- punitive juvenile justice system until the tion of the assignment of treatment—make age of majority (typically, 18) (see Lee and decisions and act, potentially altering their McCrary 2005). probability of receiving treatment. Second, These cases stand apart from the typical there is a stochastic shock due to “nature,” RD designs discussed above because here reflecting that the units have incomplete assignment to treatment is essentially inevi- control over the assignment variable. And table, as all subjects will eventually age into finally, the treatment (or the intention to the program (or, conversely, age out of the treat) is assigned on the basis of the assign- program). One cannot, therefore, draw any ment variable. parallels with a randomized experiment, We have focused on this structure because which necessarily involves some ex ante in practice most RD analyses can be viewed uncertainty about whether a unit ultimately along these lines, and also because of the receives treatment (or the intent to treat). similarity to the structure of a randomized Another important difference is that experiment. That is, subjects of a random- the tests of smoothness in baseline char- ized experiment may or may not make deci- acteristics will generally be uninformative. sions in anticipation to participating in a Indeed, if one follows a single cohort over randomized controlled trial (although their time, all characteristics determined prior to actions will ultimately have no influence on reaching the relevant age threshold are by the probability of receiving treatment). Then construction identical just before and after the stochastic shock is realized (the random- the cutoff.48 Note that in this case, time is ization). Finally, the treatment is adminis- the assignment variable, and therefore can- tered to one of the groups. not be manipulated. A number of the studies we surveyed This design and the standard RD share though, did not seem to fit the spirit or the necessity of interpreting the discontinu- essence of a randomized experiment. Since it ity as the combined effect of all factors that is difficult to think of the treatment as being switch on at the threshold. In the example of locally randomized in these cases, we will Thistlethwaite and Campbell (1960), if pass- refer to the two research designs we identi- ing a scholarship exam provides the symbolic fied in this category as “nonrandomized” discontinuity designs. 48 There are exceptions to this. There could be attrition 6.3.1 Discontinuities in Age with Inevitable over time, so that in principle, the number of observations could discontinuously drop at the threshold, changing the Treatment composition of the remaining observations. Alternatively, when examining a cross-section of different birth cohorts at Sometimes program status is turned a given point in time, it is possible to have sharp changes in on when an individual reaches a certain the characteristics of individuals with respect to birthdate. 346 Journal of Economic Literature, Vol. XLVIII (June 2010) honor of passing the exam as well as a mon- no discontinuity at age 67, when Social etary award, the true treatment is a pack- Security benefits commence payment. On age of the two components, and one cannot the other hand, some medical procedures attribute any effect to only one of the two. are too expensive for an under-65-year-old Similarly, when considering an age-activated but would be covered under Medicare upon treatment, one must consider the possibility turning 65. In this case, individuals’ greater that the age of interest is causing eligibility awareness of such a predicament will tend to for potentially many other programs, which increase the size of the discontinuity in uti- could affect the outcome. lization of medical procedures with respect There are at least two new issues that to age (e.g., see Card, Dobkin, and Maestas are irrelevant for the standard RD but are 2008). important for the analysis of age discontinui- At this time we are unable to provide any ties. First, even if there is truly an effect on more specific guidelines for analyzing these the outcome, if the effect is not immediate, age/time discontinuities since it seems that it generally will not generate a discontinu- how one models expectations, information, ity in the outcome. For example, suppose and behavior in anticipation of sharp changes the receipt of Social Security benefits has in regimes will be highly context-dependent. no immediate impact but does have a long- But it does seem important to recognize run impact on labor force participation. these designs as being distinct from the stan- Examining the labor force behavior as a dard RD design. function of age will not yield a discontinuity We conclude by emphasizing that when at age 67 (the full retirement age for those distinguishing between age-triggered treat- born after 1960), even though there may be ments and a standard RD design, the involve- a long-run effect. It is infeasible to estimate ment of age as an assignment variable is not long-run effects because by the time we as important as whether the receipt of treat- examine outcomes five years after receiving ment—or analogously, entering the control the treatment, for example, those individuals state—is inevitable. For example, on the sur- who were initially just below and just above face, the analysis of the Medicaid expansions age 67 will be exposed to essentially the same in Card and Lara D. Shore-Sheppard (2004) length of time of treatment (e.g., five years).49 appears to be an age-based discontinuity The second important issue is that, be- since, effective July 1991, U.S. law requires cause treatment is inevitable with the pas- states to cover children born after September sage of time, individuals may fully anticipate 30, 1983, implying a discontinuous relation- the change in the regime and, therefore, may ship between coverage and age, where the behave in certain ways prior to the time when discontinuity in July 1991 was around 8 years treatment is turned on. Optimizing behavior of age. This design, however, actually fits in anticipation of a sharp regime change may quite easily into the standard RD framework either accentuate or mute observed effects. we have discussed throughout this paper. For example, simple life-cycle theories, First, note that treatment receipt is not assuming no liquidity constraints, suggest inevitable for those individuals born near the that the path of consumption will exhibit September 30, 1983, threshold. Those born strictly after that date were covered from July 1991 until their 18th birthday, while 49 By contrast, there is no such limitation with the those born on or before the date received no standard RD design. One can examine outcomes defined at an arbitrarily long time period after the assignment to such coverage. Second, the data generating treatment. process does follow the structure discussed Lee and Lemieux: Regression Discontinuity Designs in Economics 347 above. Parents do have some influence streets, parks, commercial development, etc. regarding when their children are born, but In order for this to resemble a more standard with only imprecise control over the exact RD design, one would have to imagine the date (and at any rate, it seems implausible relevant boundaries being set in a “random” that parents would have anticipated that such way, so that it would be simply luck deter- a Medicaid expansion would have occurred mining whether a house ended up on either eight years in the future, with the particular side of the boundary. The concern over the birthdate cutoff chosen). Thus the treatment endogeneity of boundaries is clearly recog- is assigned based on the assignment variable, nized by Black (1999), who “. . . [b]ecause which is the birthdate in this context. of concerns about neighborhood differences Examples of other age-based discontinui- on opposite sides of an attendance district ties where neither the treatment nor control boundary, . . . was careful to omit boundar- state is guaranteed with the passage of time ies from [her] sample if the two attendance that can also be viewed within the standard districts were divided in ways that seemed RD framework include studies by Cascio and to clearly divide neighborhoods; attendance Lewis (2006), McCrary and Royer (2003), districts divided by large rivers, parks, golf Dobkin and Ferreira (2009), and Phillip courses, or any large stretch of land were Oreopoulos (2006). excluded.” As one could imagine, the selection of which boundaries to include could 6.3.2 Discontinuities in Geography quickly turn into more of an art than a science. Another “nonrandomized” RD design We have no uniform advice on how to ana- is one involving the location of residences, lyze geographic discontinuities because it where the discontinuity threshold is a bound- seems that the best approach would be par- ary that demarcates regions. Black (1999) ticularly context-specific. It does, however, and Patrick Bayer, Ferreira, and Robert seem prudent for the analyst, in assessing McMillan (2007) examine housing prices on the internal validity of the research design, either side of school attendance boundaries to carefully consider three sets of questions. to estimate the implicit valuation of different First, what is the process that led to the loca- schools. Lavy (2006) examines adjacent tion of the boundaries? Which came first: neighborhoods in different cities, and there- the houses or the boundaries? Were the fore subject to different rules regarding boundaries a response to some preexisting student busing. Rafael Lalive (2008) com- geographical or political constraint? Second, pares unemployment duration in regions in how might sorting of families or the endog- Austria receiving extended benefits to adja- enous location of houses affect the analysis? cent control regions. Karen M. Pence (2006) And third, what are all the things differing examines census tracts along state borders between the two regions other than the treat- to examine the impact of more borrower- ment of interest? An exemplary analysis and friendly laws on mortgage loan sizes. discussion of these latter two issues in the In each of these cases, it is awkward to context of school attendance zones is found view either houses or families as locally ran- in Bayer, Ferreira, and McMillan (2007). domly assigned. Indeed this is a case where economic agents have quite precise control 7. Concluding Remarks on RD Designs in over where to place a house or where to live. Economics: Progress and Prospects The location of houses will be planned in response to geographic features (rivers, lakes, Our reading of the existing and active lit- hills) and in conjunction with the planning of erature is that—after being largely ignored 348 Journal of Economic Literature, Vol. XLVIII (June 2010) by economists for almost forty years—there We believe a “state-of-the-art” RD have been significant inroads made in under- analysis today will consider carefully standing the properties, limitations, inter- the possibility of endogenous sorting. A pretability, and perhaps most importantly, recent analysis that illustrates this stan- in the useful application of RD designs to a dard is that of Urquiola and Verhoogen wide variety of empirical questions in eco- (2009), who examine the class size cap nomics. These developments have, for the RD design pioneered by Angrist and most part, occurred within a short period of Lavy (1999) in the context of Chile’s time, beginning in the late 1990s. highly liberalized market for primary Here we highlight what we believe are the schools. In a certain segment of the pri- most significant recent contributions of the vate market, schools receive a fixed pay- economics literature to the understanding ment per student from the government. and application of RD designs. We believe However, each school faces a very high these are helpful developments in guiding marginal cost (hiring one extra teacher) applied researchers who seek to implement for crossing a multiple of the class size RD designs, and we also illustrate them with cap. Perhaps unsurprisingly, they find a few examples from the literature. striking discontinuities in the histogram of the assignment variable (total enroll- • Sorting and Manipulation of the ment in the grade), with an undeniable Assignment Variable: Economists con- “stacking” of schools at the relevant class sider how self-interested individuals or size cap cutoffs. They also provide evi- optimizing organizations may behave in dence that those families in schools just response to rules that allocate resources. to the left and right of the thresholds It is therefore unsurprising that the are systematically different in family discussion of how endogenous sort- income, suggesting some degree of sorting around the discontinuity threshold ing. For this reason, they conclude that can invalidate the RD design has been an RD analysis in this particular con- found (to our knowledge, exclusively) in text is most likely inappropriate.51 This the economics literature. By contrast, study, as well as the analysis of Bayer, textbook treatments outside econom- Ferreira, and McMillan (2007) reflects a ics on RD do not discuss this sorting or heightened awareness of a sorting issue manipulation, and give the impression recognized since the beginning of the that the knowledge of the assignment recent wave of RD applications in eco- rule is sufficient for the validity of the nomics.52 From a practitioner’s perspec- RD.50 tive, an important recent development

50 For example, Trochim (1984) characterizes the three make it clear that the existence of a deterministic rule for central assumptions of the RD design as: (1) perfect adher- the assignment of treatment is not sufficient for unbiased- ence to the cutoff rule, (2) having the correct functional ness, and it is necessary to assume the influence of all other form, and (3) no other factors (other than the program of factors (omitted variables) are the same on either side of the interest) cause the discontinuity. More recently, William discontinuity threshold (i.e., their continuity assumption). R. Shadish, Cook, and Campbell (2002) claim on page 243 51 Urquiola and Verhoogen (2009) emphasize the sort- that the proof of the unbiasedness of RD primarily follows ing issues may well be specific to the liberalized nature of from the fact that treatment is known perfectly once the the Chilean primary school market, and that they may or assignment variable is known. They go on to argue that this may not be present in other countries. deterministic rule implies omitted variables will not pose 52 See, for example, footnote 23 in van der Klaauw a problem. But Hahn, Todd, and van der Klaauw (2001) (1997) and page 549 in Angrist and Lavy (1999). Lee and Lemieux: Regression Discontinuity Designs in Economics 349

is the notion that we can empirically some control over the assignment vari- examine the degree of sorting, and one able, as long as this control is impre- way of doing so is suggested in McCrary cise—that is, the ex ante density of the (2008). assignment variable is continuous—the consequence will be local randomiza- • RD Designs as Locally Randomized tion of the treatment. So in a number Experiments: Economists are hesitant of nonexperimental contexts where to apply methods that have not been resources are allocated based on a sharp rigorously formalized within an econo- cutoff rule, there may indeed be a hid- metric framework, and where crucial den randomized experiment to utilize. identifying assumptions have not been And furthermore, as in a randomized clearly specified. This is perhaps one of experiment, this implies that all observ- the reasons why RD designs were under- able baseline covariates will locally have utilized by economists for so long, since it the same distribution on either side of is only relatively recently that the under- the discontinuity threshold—an empiri- lying assumptions needed for the RD cally testable proposition. were formalized.53 In the recent litera- We view the testing of the continuity ture, RD designs were initially viewed of the baseline covariates as an impor- as a special case of matching (Heckman, tant part of assessing the validity of any Lalonde, and Smith 1999), or alterna- RD design—particularly in light of the tively as a special case of IV (Angrist and incentives that can potentially generate Krueger 1999), and these perspectives sorting—and as something that truly may have provided empirical researchers sets RD apart from other evaluation a familiar econometric framework within strategies. Examples of this kind of test- which identifying assumptions could be ing of the RD design include Jordan D. more carefully discussed. Matsudaira (2008), Card, Raj Chetty Today, RD is increasingly recognized and Andrea Weber (2007), DiNardo in applied research as a distinct design and Lee (2004), Lee, Moretti and Butler that is a close relative to a randomized (2004), McCrary and Royer (2003), experiment. Formally shown in Lee Greenstone and Gallagher (2008), and (2008), even when individuals have Urquiola and Verhoogen (2009).

53 An example of how economists’/econometricians’ nonzero”(p. 647). After reading the article, an econometri- notion of a proof differs from that in other disciplines is cian will recognize the discussion above not as a proof of found in Cook (2008), who views the discussion in Arthur the validity of the RD, but rather as a restatement of the S. Goldberger (1972a) and Goldberger (1972b) as the first consequence of z being an indicator variable determined “proof of the basic design,” quoting the following passage by an observed variable x, in a specific parameterized in Goldberger (1972a) (brackets from Cook 2008): “The example. Today we know the existence of such a rule is explanation for this serendipitous result [no bias when not sufficient for a valid RD design, and a crucial neces- selection is on an observed pretest score] is not hard to sary assumption is the continuity of the influence of all locate. Recall that z [a binary variable representing the other factors, as shown in Hahn, Todd, and van der Klaauw treatment contrast at the cutoff] is completely determined (2001). In Goldberger (1972a), the role of the continuity of by pretest score x [an obtained ability score]. It cannot omitted factors was not mentioned (although it is implicitly contain any information about x* [true ability] that is not assumed in the stylized model of test scores involving nor- contained within x. Consequently, when we control on x mally distributed and independent errors). Indeed, appar- as in the multiple regression, z has no explanatory power ently Goldberger himself later clarified that he did not set with respect to y [the outcome measured with error]. More out to propose the RD design, and was instead interested formally, the partial correlation of y and z controlling on in the issues related to selection on observables and unob- x vanishes although the simple correlation of y and z is servables (Cook 2008). 350 Journal of Economic Literature, Vol. XLVIII (June 2010)

• Graphical Analysis and Presentation: threshold.55 There are many exam- The graphical presentation of an ples that follow this general principle; RD analysis is not a contribution of recent ones include Matsudaira (2008), economists,54 but it is safe to say that the Card, Chetty and Weber (2007), Card, body of work produced by economists Dobkin, and Maestas (2009), McCrary has led to a kind of “industry standard” and Royer (2003), Lee (2008), and that the transparent identification strat- Ferreira and Gyourko (2009). egy of the RD be accompanied by an equally transparent graph showing the • Applicability: Soon after the introduc- empirical relation between the outcome tion of RD, in a chapter in a book on and the assignment variable. Graphical research methods, Campbell and Julian presentations of RD are so prevalent in C. Stanley (1963) wrote that the RD applied research, it is tempting to guess design was “very limited in range of that studies not including the graphical possible applications.” The emerging evidence are ones where the graphs are body of research produced by econo- not compelling or well-behaved. mists in recent years has proven quite In an RD analysis, the graph is indis- the opposite. Our survey of the litera- pensable because it can summarize a ture suggests that there are many kinds great deal of information in one picture. of discontinuous rules that can help It can give a rough sense of the range answer important questions in econom- of the both the assignment variable and ics and related areas. Indeed, one may the outcome variable as well as the over- go so far as to guess that whenever a all shape of the relationship between scarce resource is rationed for individual the two, thus indicating what functional entities, if the political climate demands forms are likely to make sense. It can also a transparent way of distributing that alert the researcher to potential outliers resource, it is a good bet there is an in both the assignment and outcome RD design lurking in the background. variables. A graph of the raw means—in In addition, it seems that the approach nonoverlapping intervals, as discussed of using changes in laws that disqualify in section 4.1—also gives a rough sense older birth cohorts based on their date of the likely sampling variability of the of birth (as in Card and Shore-Sheppard RD gap estimate itself, since one can (2004) or Oreopoulos (2006)) may well compare the size of the jump at the have much wider applicability. discontinuity to natural “bumpiness” in One way to understand both the the graph away from the discontinuity. applicability and limitations of the RD Our reading of the literature is that the design is to recognize its relation to a most informative graphs are ones that standard econometric policy evaluation simultaneously allow the raw data “to framework, where the main variable speak for themselves” in revealing a of interest is a potentially endogenous discontinuity if there is one, yet at the binary treatment variable (as consid- same time treat data near the thresh- ered in Heckman 1978 or more recently old the same as data away from the discussed in Heckman and Vytlacil

55 For example, graphing a smooth conditional expec- 54 Indeed the original article of Thistlethwaite and tation function everywhere except at the discontinuity Campbell (1960) included a graphical analysis of the data. threshold violates this principle. Lee and Lemieux: Regression Discontinuity Designs in Economics 351

2005). This selection model applies to a is perfect compliance of the discontinuous great deal of economic problems. As we rule, it may be that the researcher does not pointed out in section 3, the RD design directly observe the assignment variable, but describes a situation where you are able instead possesses a slightly noisy measure of to observe the latent variable that deter- the variable. Understanding the effects of mines treatment. As long as the density this kind of measurement error could further of that variable is continuous for each expand the applicability of RD. In addition, individual, the benefit of observing the there may be situations where the researcher latent index is that one neither needs to both suspects and statistically detects some make exclusion restrictions nor assume degree of precise sorting around the thresh- any variable (i.e., an instrument) is old, but that the sorting may appear to be independent of errors in the outcome relatively minor, even if statistically signifi- equation. cant (based on observing discontinuities in From this perspective, for the class of baseline characteristics). The challenge, problems that fit into the standard treat- then, is to specify under what conditions one ment evaluation problem, RD designs can correct for small amounts of this kind of can be seen as a subset since there is an contamination. institutional, index-based rule playing a Second, so far we have discussed the role in determining treatment. Among sorting or manipulation issue as a potential this subset, the binding constraint of problem or nuisance to the general program RD lies in obtaining the necessary data: evaluation problem. But there is another way readily available public-use household of viewing this sorting issue. The observed survey data, for example, will often only sorting may well be evidence of economic contain variables that are correlated agents responding to incentives, and may with the true assignment variable (e.g., help identify economically interesting phe- reported income in a survey, as opposed nomena. That is, economic behavior may to the income used for allocation of ben- be what is driving discontinuities in the fre- efits), or are measured too coarsely (e.g., quency distribution of grade enrollment (as years rather than months or weeks) to in Urquiola and Verhoogen 2009), or in the detect a discontinuity in the presence distribution of roll call votes (as in McCrary of a regression function with significant 2008), or in the distribution of age at offense curvature. This is where there can be a (as in Lee and McCrary 2005), and those significant payoff to investing in secur- behavioral responses may be of interest. ing high quality data, which is evident in These cases, as well as the age/time and most of the studies listed in table 5. boundary discontinuities discussed above, do not fit into the “standard” RD framework, 7.1 Extensions but nevertheless can tell us something impor- We conclude by discussing two natural tant about behavior, and further expand the directions in which the RD approach can kinds of questions that can be addressed by be extended. First, we have discussed the exploiting discontinuous rules to identify “fuzzy” RD design as an important departure meaningful economic parameters of interest. from the “classic” RD design where treatment is a deterministic function of the assign- References ment variable, but there are other departures Albouy, David. 2008. “Do Voters Affect or Elect Poli- that could be practically relevant but not as cies? A New Perspective With Evidence from the well understood. For example, even if there U.S. Senate.” Unpublished. 352 Journal of Economic Literature, Vol. XLVIII (June 2010)

Albouy, David. 2009. “Partisan Representation in Con- Services More Effective Than the Services Them- gress and the Geographic Distribution of Federal selves? Evidence from Random Assignment in the Funds.” National Bureau of Economic Research UI System.” American Economic Review, 93(4): Working Paper 15224. 1313–27. Angrist, Joshua D. 1990. “Lifetime Earnings and the Black, Sandra E. 1999. “Do Better Schools Matter? Vietnam Era Draft Lottery: Evidence from Social Parental Valuation of Elementary Education.” Quar- Security Administrative Records.” American Eco- terly Journal of Economics, 114(2): 577–99. nomic Review, 80(3): 313–36. Blundell, Richard, and Alan Duncan. 1998. “Kernel Angrist, Joshua D., and Alan B. Krueger. 1999. “Empir- Regression in Empirical Microeconomics.” Journal ical Strategies in Labor Economics.” In Handbook of of Human Resources, 33(1): 62–87. Labor Economics, Volume 3A, ed. Orley Ashenfelter Buddelmeyer, Hielke, and Emmanuel Skoufias. 2004. and David Card, 1277–1366. Amsterdam; New York “An Evaluation of the Performance of Regression and Oxford: Elsevier Science, North-Holland. Discontinuity Design on PROGRESA.” World Bank Angrist, Joshua D., and Victor Lavy. 1999. “Using Mai- Policy Research Working Paper 3386. monides’ Rule to Estimate the Effect of Class Size Buettner, Thiess. 2006. “The Incentive Effect of Fis- on Scholastic Achievement.” Quarterly Journal of cal Equalization Transfers on Tax Policy.” Journal of Economics, 114(2): 533–75. Public Economics, 90(3): 477–97. Asadullah, M. Niaz. 2005. “The Effect of Class Size on Campbell, Donald T., and Julian C. Stanley. 1963. Student Achievement: Evidence from Bangladesh.” “Experimental and Quasi-experimental Designs for Applied Economics Letters, 12(4): 217–21. Research on Teaching.” In Handbook of Research on Battistin, Erich, and Enrico Rettore. 2002. “Testing for Teaching, ed. N. L. Gage, 171–246. Chicago: Rand Programme Effects in a Regression Discontinuity McNally. Design with Imperfect Compliance.” Journal of the Canton, Erik, and Andreas Blom. 2004. “Can Student Royal Statistical Society: Series A (Statistics in Soci- Loans Improve Accessibility to Higher Education ety), 165(1): 39–57. and Student Performance? An Impact Study of Battistin, Erich, and Enrico Rettore. 2008. “Ineli- the Case of SOFES, Mexico.” World Bank Policy gibles and Eligible Non-participants as a Double Research Working Paper 3425. Comparison Group in Regression-Discontinuity De- Card, David, Raj Chetty, and Andrea Weber. 2007. signs.” Journal of Econometrics, 142(2): 715–30. “Cash-on-Hand and Competing Models of Inter- Baum-Snow, Nathaniel, and Justin Marion. 2009. “The temporal Behavior: New Evidence from the Labor Effects of Low Income Housing Tax Credit Devel- Market.” Quarterly Journal of Economics, 122(4): opments on Neighborhoods.” Journal of Public Eco- 1511–60. nomics, 93(5–6): 654–66. Card, David, Carlos Dobkin, and Nicole Maestas. 2008. Bayer, Patrick, Fernando Ferreira, and Robert McMil- “The Impact of Nearly Universal Insurance Cover- lan. 2007. “A Unified Framework for Measuring age on Health Care Utilization: Evidence from Medi- Preferences for Schools and Neighborhoods.” Jour- care.” American Economic Review, 98(5): 2242–58. nal of Political Economy, 115(4): 588–638. Card, David, Carlos Dobkin, and Nicole Maestas. 2009. Behaghel, Luc, Bruno Crépon, and Béatrice Sédillot. “Does Medicare Save Lives?” Quarterly Journal of 2008. “The Perverse Effects of Partial Employment Economics, 124(2): 597–636. Protection Reform: The Case of French Older Work- Card, David, Alexandre Mas, and Jesse Rothstein. ers.” Journal of Public Economics, 92(3–4): 696–721. 2008. “Tipping and the Dynamics of Segregation.” Berk, Richard A., and Jan de Leeuw. 1999. “An Evalu- Quarterly Journal of Economics, 123(1): 177–218. ation of California’s Inmate Classification System Card, David, and Lara D. Shore-Sheppard. 2004. Using a Generalized Regression Discontinuity “Using Discontinuous Eligibility Rules to Identify Design.” Journal of the American Statistical Associa- the Effects of the Federal Medicaid Expansions on tion, 94(448): 1045–52. Low-Income Children.” Review of Economics and Berk, Richard A., and David Rauma. 1983. “Capital- Statistics, 86(3): 752–66. izing on Nonrandom Assignment to Treatments: A Carpenter, Christopher, and Carlos Dobkin. 2009. Regression-Discontinuity Evaluation of a Crime- “The Effect of Alcohol Consumption on Mortality: Control Program.” Journal of the American Statisti- Regression Discontinuity Evidence from the Mini- cal Association, 78(381): 21–27. mum Drinking Age.” American Economic Journal: Black, Dan A., Jose Galdo, and Jeffrey A. Smith. 2007a. Applied Economics, 1(1): 164–82. “Evaluating the Regression Discontinuity Design Cascio, Elizabeth U., and Ethan G. Lewis. 2006. Using Experimental Data.” Unpublished. “Schooling and the Armed Forces Qualifying Test: Black, Dan A., Jose Galdo, and Jeffrey A. Smith. 2007b. Evidence from School-Entry Laws.” Journal of “Evaluating the Worker Profiling and Reemploy- Human Resources, 41(2): 294–318. ment Services System Using a Regression Disconti- Chay, Kenneth Y., and Michael Greenstone. 2003. nuity Approach.” American Economic Review, 97(2): “Air Quality, Infant Mortality, and the Clean Air Act 104–07. of 1970.” National Bureau of Economic Research Black, Dan A., Jeffrey A. Smith, Mark C. Berger, and Working Paper 10053. Brett J. Noel. 2003. “Is the Threat of Reemployment Chay, Kenneth Y., and Michael Greenstone. 2005. Lee and Lemieux: Regression Discontinuity Designs in Economics 353

“Does Air Quality Matter? Evidence from the Hous- Support and Elderly Living Arrangements in a Low- ing Market.” Journal of Political Economy, 113(2): Income Country.” Journal of Human Resources, 376–424. 40(1): 186–207. Chay, Kenneth Y., Patrick J. McEwan, and Miguel Fan, Jianqing, and Irene Gijbels. 1996. Local Polyno- Urquiola. 2005. “The Central Role of Noise in mial Modelling and Its Applications. London; New Evaluating Interventions That Use Test Scores to York and Melbourne: Chapman and Hall. Rank Schools.” American Economic Review, 95(4): Ferreira, Fernando. Forthcoming. “You Can Take It 1237–58. With You: Proposition 13 Tax Benefits, Residential Chen, M. Keith, and Jesse M. Shapiro. 2004. “Does Mobility, and Willingness to Pay for Housing Ameni- Prison Harden Inmates? A Discontinuity-Based ties.” Journal of Public Economics. Approach.” Yale University Cowles Foundation Dis- Ferreira, Fernando, and Joseph Gyourko. 2009. “Do cussion Paper 1450. Political Parties Matter? Evidence from U.S. Cities.” Chen, Susan, and Wilbert van der Klaauw. 2008. “The Quarterly Journal of Economics, 124(1): 399–422. Work Disincentive Effects of the Disability Insur- Figlio, David N., and Lawrence W. Kenny. 2009. ance Program in the 1990s.” Journal of Economet- “Public Sector Performance Measurement and rics, 142(2): 757–84. Stakeholder Support.” Journal of Public Economics, Chiang, Hanley. 2009. “How Accountability Pressure 93(9–10): 1069–77. on Failing Schools Affects Student Achievement.” Goldberger, Arthur S. 1972a. “Selection Bias in Evalu- Journal of Public Economics, 93(9–10): 1045–57. ating Treatment Effects: Some Formal Illustrations.” Clark, Damon. 2009. “The Performance and Competi- Unpublished. tive Effects of School Autonomy.” Journal of Political Goldberger, Arthur S. 1972b. Selection Bias in Evalu- Economy, 117(4): 745–83. ating Treatment Effects: The Case of Interaction. Cole, Shawn. 2009. “Financial Development, Bank Unpublished. Ownership, and Growth: Or, Does Quantity Imply Goodman, Joshua. 2008. “Who Merits Financial Aid?: Quality?” Review of Economics and Statistics, 91(1): Massachusetts’ Adams Scholarship.” Journal of Pub- 33–51. lic Economics, 92(10–11): 2121–31. Cook, Thomas D. 2008. “‘Waiting for Life to Arrive’: Goolsbee, Austan, and Jonathan Guryan. 2006. “The A History of the Regression-Discontinuity Design Impact of Internet Subsidies in Public Schools.” in Psychology, Statistics and Economics.” Journal of Review of Economics and Statistics, 88(2): 336–47. Econometrics, 142(2): 636–54. Greenstone, Michael, and Justin Gallagher. 2008. Davis, Lucas W. 2008. “The Effect of Driving Restric- “Does Hazardous Waste Matter? Evidence from tions on Air Quality in Mexico City.” Journal of Politi- the Housing Market and the Superfund Program.” cal Economy, 116(1): 38–81. Quarterly Journal of Economics, 123(3): 951–1003. De Giorgi, Giacomo. 2005. “Long-Term Effects of a Guryan, Jonathan. 2001. “Does Money Matter? Mandatory Multistage Program: The New Deal for Regression-Discontinuity Estimates from Education Young People in the UK.” Institute for Fiscal Studies Finance Reform in Massachusetts.” National Bureau Working Paper 05/08. of Economic Research Working Paper 8269. DesJardins, Stephen L., and Brian P. McCall. 2008. Hahn, Jinyong. 1998. “On the Role of the Propensity “The Impact of the Gates Millennium Scholars Pro- Score in Efficient Semiparametric Estimation of gram on the Retention, College Finance- and Work- Average Treatment Effects.” Econometrica, 66(2): Related Choices, and Future Educational Aspirations 315–31. of Low-Income Minority Students.” Unpublished. Hahn, Jinyong, Petra Todd, and Wilbert van der DiNardo, John, and David S. Lee. 2004. “Economic Klaauw. 1999. “Evaluating the Effect of an Antidis- Impacts of New Unionization on Private Sector crimination Law Using a Regression-Discontinuity Employers: 1984–2001.” Quarterly Journal of Eco- Design.” National Bureau of Economic Research nomics, 119(4): 1383–1441. Working Paper 7131. Ding, Weili, and Steven F. Lehrer. 2007. “Do Peers Hahn, Jinyong, Petra Todd, and Wilbert van der Affect Student Achievement in China’s Secondary Klaauw. 2001. “Identification and Estimation of Schools?” Review of Economics and Statistics, 89(2): Treatment Effects with a Regression-Discontinuity 300–312. Design.” Econometrica, 69(1): 201–09. Dobkin, Carlos, and Fernando Ferreira. 2009. “Do Heckman, James J. 1978. “Dummy Endogenous Vari- School Entry Laws Affect Educational Attainment ables in a Simultaneous Equation System.” Econo- and Labor Market Outcomes?” National Bureau of metrica, 46(4): 931–59. Economic Research Working Paper 14945. Heckman, James J., Robert J. Lalonde, and Jeffrey A. Edmonds, Eric V. 2004. “Does Illiquidity Alter Child Smith. 1999. “The Economics and Econometrics of Labor and Schooling Decisions? Evidence from Active Labor Market Programs.” In Handbook of Household Responses to Anticipated Cash Trans- Labor Economics, Volume 3A, ed. Orley Ashenfelter fers in South Africa.” National Bureau of Economic and David Card, 1865–2097. Amsterdam; New York Research Working Paper 10265. and Oxford: Elsevier Science, North-Holland. Edmonds, Eric V., Kristin Mammen, and Douglas L. Heckman, James J., and Edward Vytlacil. 2005. “Struc- Miller. 2005. “Rearranging the Family? Income tural Equations, Treatment Effects, and Econometric 354 Journal of Economic Literature, Vol. XLVIII (June 2010)

Policy Evaluation.” Econometrica, 73(3): 669–738. Experience: A Regression Discontinuity Analysis of Hjalmarsson, Randi. 2009. “Juvenile Jails: A Path to the Close Elections.” University of California Berkeley Straight and Narrow or to Hardened Criminality?” Center for Labor Economics Working Paper 31. Journal of Law and Economics, 52(4): 779–809. Lee, David S. 2008. “Randomized Experiments from Horowitz, Joel L., and Charles F. Manski. 2000. “Non- Non-random Selection in U.S. House Elections.” parametric Analysis of Randomized Experiments Journal of Econometrics, 142(2): 675–97. with Missing Covariate and Outcome Data.” Jour- Lee, David S. 2009. “Training, Wages, and Sample nal of the American Statistical Association, 95(449): Selection: Estimating Sharp Bounds on Treat- 77–84. ment Effects.” Review of Economic Studies, 76(3): Hoxby, Caroline M. 2000. “The Effects of Class Size 1071–1102. on Student Achievement: New Evidence from Popu- Lee, David S., and David Card. 2008. “Regression Dis- lation Variation.” Quarterly Journal of Economics, continuity Inference with Specification Error.” Jour- 115(4): 1239–85. nal of Econometrics, 142(2): 655–74. Imbens, Guido W., and Joshua D. Angrist. 1994. “Iden- Lee, David S., and Justin McCrary. 2005. “Crime, Pun- tification and Estimation of Local Average Treatment ishment, and Myopia.” National Bureau of Economic Effects.” Econometrica, 62(2): 467–75. Research Working Paper 11491. Imbens, Guido W., and Karthik Kalyanaraman. 2009. Lee, David S., Enrico Moretti, and Matthew J. Butler. “Optimal Bandwidth Choice for the Regression Dis- 2004. “Do Voters Affect or Elect Policies? Evidence continuity Estimator.” National Bureau of Economic from the U.S. House.” Quarterly Journal of Econom- Research Working Paper 14726. ics, 119(3): 807–59. Imbens, Guido W., and Thomas Lemieux. 2008. Lemieux, Thomas, and Kevin Milligan. 2008. “Incen- “Regression Discontinuity Designs: A Guide to Prac- tive Effects of Social Assistance: A Regression Dis- tice.” Journal of Econometrics, 142(2): 615–35. continuity Approach.” Journal of Econometrics, Jacob, Brian A., and Lars Lefgren. 2004a. “The Impact 142(2): 807–28. of Teacher Training on Student Achievement: Quasi- Leuven, Edwin, Mikael Lindahl, Hessel Oosterbeek, experimental Evidence from School Reform Efforts and Dinand Webbink. 2007. “The Effect of Extra in Chicago.” Journal of Human Resources, 39(1): Funding for Disadvantaged Pupils on Achievement.” 50–79. Review of Economics and Statistics, 89(4): 721–36. Jacob, Brian A., and Lars Lefgren. 2004b. “Remedial Leuven, Edwin, and Hessel Oosterbeek. 2004. “Evalu- Education and Student Achievement: A Regression- ating the Effect of Tax Deductions on Training.” Discontinuity Analysis.” Review of Economics and Journal of Labor Economics, 22(2): 461–88. Statistics, 86(1): 226–44. Ludwig, Jens, and Douglas L. Miller. 2007. “Does Head Kane, Thomas J. 2003. “A Quasi-experimental Estimate Start Improve Children’s Life Chances? Evidence of the Impact of Financial Aid on College-Going.” from a Regression Discontinuity Design.” Quarterly National Bureau of Economic Research Working Journal of Economics, 122(1): 159–208. Paper 9703. Matsudaira, Jordan D. 2008. “Mandatory Summer Lalive, Rafael. 2007. “Unemployment Benefits, Unem- School and Student Achievement.” Journal of Econo- ployment Duration, and Post-unemployment Jobs: A metrics, 142(2): 829–50. Regression Discontinuity Approach.” American Eco- McCrary, Justin. 2008. “Manipulation of the Running nomic Review, 97(2): 108–12. Variable in the Regression Discontinuity Design: Lalive, Rafael. 2008. “How Do Extended Benefits A Density Test.” Journal of Econometrics, 142(2): Affect Unemployment Duration? A Regression 698–714. Discontinuity Approach.” Journal of Econometrics, McCrary, Justin, and Heather Royer. 2003. “Does 142(2): 785–806. Maternal Education Affect Infant Health? A Regres- Lalive, Rafael, Jan C. van Ours, and Josef Zweimüller. sion Discontinuity Approach Based on School Age 2006. “How Changes in Financial Incentives Affect Entry Laws.” Unpublished. the Duration of Unemployment.” Review of Eco- Newey, Whitney K., and Daniel L. McFadden. 1994. nomic Studies, 73(4): 1009–38. “Large Sample Estimation and Hypothesis Test- Lavy, Victor. 2002. “Evaluating the Effect of Teachers’ ing.” In Handbook of Econometrics, Volume 4, ed. Group Performance Incentives on Pupil Achieve- Robert F. Engle and Daniel L. McFadden, 2111– ment.” Journal of Political Economy, 110(6): 2245. Amsterdam; London and New York: Elsevier, 1286–1317. North-Holland. Lavy, Victor. 2004. “Performance Pay and Teachers’ Oreopoulos, Philip. 2006. “Estimating Average and Effort, Productivity and Grading Ethics.” National Local Average Treatment Effects of Education Bureau of Economic Research Working Paper 10622. When Compulsory Schooling Laws Really Matter.” Lavy, Victor. 2006. “From Forced Busing to Free American Economic Review, 96(1): 152–75. Choice in Public Schools: Quasi-Experimental Evi- Pence, Karen M. 2006. “Foreclosing on Opportunity: dence of Individual and General Effects.” National State Laws and Mortgage Credit.” Review of Eco- Bureau of Economic Research Working Paper 11969. nomics and Statistics, 88(1): 177–82. Lee, David S. 2001. “The Electoral Advantage to Pettersson, Per. 2000. “Do Parties Matter for Fis- Incumbency and Voters’ Valuation of Politicians’ cal Policy Choices?” Econometric Society World Lee and Lemieux: Regression Discontinuity Designs in Economics 355

Congress 2000 Contributed Paper 1373. Social Security Notch.” Review of Economics and Pettersson-Lidbom, Per. 2008a. “Does the Size of the Statistics, 88(3): 482–95. Legislature Affect the Size of Government? Evidence Thistlethwaite, Donald L., and Donald T. Campbell. from Two Natural Experiments.” Unpublished. 1960. “Regression-Discontinuity Analysis: An Alter- Pettersson-Lidbom, Per. 2008b. “Do Parties Matter for native to the Ex Post Facto Experiment.” Journal of Economic Outcomes? A Regression-Discontinuity Educational Psychology, 51(6): 309–17. Approach.” Journal of the European Economic Asso- Trochim, William M. K. 1984. Research Design for ciation, 6(5): 1037–56. Program Evaluation: The Regression-Discontinuity Pitt, Mark M., and Shahidur R. Khandker. 1998. “The Approach. Beverly Hills: Sage Publications. Impact of Group-Based Credit Programs on Poor Urquiola, Miguel. 2006. “Identifying Class Size Effects in Households in Bangladesh: Does the Gender of Developing Countries: Evidence from Rural Bolivia.” Participants Matter?” Journal of Political Economy, Review of Economics and Statistics, 88(1): 171–77. 106(5): 958–96. Urquiola, Miguel, and Eric A. Verhoogen. 2009. “Class- Pitt, Mark M., Shahidur R. Khandker, Signe-Mary Size Caps, Sorting, and the Regression-Discontinu- McKernan, and M. Abdul Latif. 1999. “Credit Pro- ity Design.” American Economic Review, 99(1): grams for the Poor and Reproductive Behavior in 179–215. Low-Income Countries: Are the Reported Causal van der Klaauw, Wilbert. 1997. “A Regression-Dis- Relationships the Result of Heterogeneity Bias?” continuity Evaluation of the Effect of Financial Aid Demography, 36(1): 1–21. Offers on College Enrollment.” New York University Porter, Jack. 2003. “Estimation in the Regression Dis- C.V. Starr Center for Applied Economics Working continuity Model.” Unpublished. Paper 10. Powell, James L. 1994. “Estimation of Semiparametric van der Klaauw, Wilbert. 2002. “Estimating the Effect Models.” In Handbook of Econometrics, Volume 4, of Financial Aid Offers on College Enrollment: A ed. Robert F. Engle and Daniel L. McFadden, 2443– Regression-Discontinuity Approach.” International 2521. Amsterdam; London and New York: Elsevier, Economic Review, 43(4): 1249–87. North-Holland. van der Klaauw, Wilbert. 2008a. “Breaking the Link Shadish, William R., Thomas D. Cook, and Donald T. between Poverty and Low Student Achievement: Campbell. 2002. Experimental and Quasi-Experi- An Evaluation of Title I.” Journal of Econometrics, mental Designs for Generalized Causal Inference. 142(2): 731–56. Boston: Houghton Mifflin. van der Klaauw, Wilbert. 2008b. “Regression-Disconti- Silverman, Bernard W. 1986. Density Estimation for nuity Analysis: A Survey of Recent Developments in Statistics and Data Analysis. London and New York: Economics.” Labour, 22(2): 219–45. Chapman and Hall. White, Halbert. 1980. “A Heteroskedasticity-Consistent Snyder, Stephen E., and William N. Evans. 2006. “The Covariance Matrix Estimator and a Direct Test for Effect of Income on Mortality: Evidence from the Heteroskedasticity.” Econometrica, 48(4): 817–38. This article has been cited by:

1. Guido W. Imbens. 2010. Better LATE Than Nothing: Some Comments on Deaton (2009) and Heckman and Urzua (2009)Better LATE Than Nothing: Some Comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic Literature 48:2, 399-423. [Abstract] [View PDF article] [PDF with links]