<<

Regression Discontinuity Design on Model Schools’ Value-Added Effects: Empirical Evidence from Rural Beijing

Kai Hong CentER Graduate School, Tilburg University

April 2010

Abstract In this study we examine the value-added effects of model schools on students’ achievements. We apply regression discontinuity design to data from Daxin District in rural Beijing. Both parametric and nonparametric approaches are adopted and estimate results are heterogeneous. For science student, significant positive effects ranging from 20 to 80 are found. While for art students, we find few evidences to support positive effects. Three robust checks, including additional covariates, school specific cutoffs and peers effects, are also performed and robustness of our results are confirmed further. Two policy related issues are also discussed: compliers and noncompliers and partially fuzzy design, by which we find smaller effects for the full population and almost the same effects for eligible participants. Keywords: regression discontinuity design, value-added effect, model school, economics of education

1 1 Introduction

How to allocate restricted educational resources to obtain maximum achievement has been a controversial issue for a long time. On the one hand, education is crucial for further developments. For example, in economics, education plays an important role in both economic growth and eliminating inequality1. On the other hand, educational resources are usually limited. For example, public finance of education in most countries accounts for less than 5% of the GDP. In 2005 in China this percentage was only 2.82%. How to balance them and guarantee that these limited educational resources are used efficiently is the central problem in economics of education. In China, establishing key schools or model schools is recognized as an effective way to solve this problem. Since 1994, policies regarding model schools have been implemented. Nowadays, 15 years later, these model schools perform really well in almost all fields, especially in students’ achievements. However, because whether a student can be enrolled in a model school largely depends on his or her previous performance, students in model schools usually have excellent achievements even before they enter current schools and are more likely to obtain the same achievements in normal schools. In that case, compared with generous educational input, whether value-added effects of model schools on students’ achievement are large enough is still questionable. In this paper we will exam the effects of model high schools on students’ achievements in rural Beijing with the regression discontinuity design (RD design for short). The effects of teachers or schools have been drawing attention from researchers for several decades. Such effects are usually known as “value-added” effect and interpreted from both the descriptive or causal aspect, say treatment effects of certain policies concerning teachers or schools. Many specific topics, from theories to practices, are involved into this field, such as the realization of treatments, how to obtain reliable causal estimation of these treatment effects and how to deal with certain specific econometric techniques2. The main question that needs to be answered is usually presented as “what are the effects on students of being in school A on their sequential test scores”, or “how much a particular school or teacher has added value to their students’ test scores”. To answer such questions usually we need to compare the post-test scores with test scores before the treatment assignment and identify to what extent we obtain the causal effects. Before going deeper into this field, it is necessary to review methods used to identify and estimate causal effects with an emphases on a special one called regression discontinuity design. The rest of the paper is organized as follows: the second section will be a intensive review on RD design in economics of education, including an introduction of randomized experiments which is recognized as the standard method for estimating causal effects; the development of RD design, which is commonly recognized as quasi- experimental design; a few recent applications of such design to value-added analysis; a brief conclusion and several relevant prospects. In the third section empirical backgrounds, including the introduction of relevant exams in Beijing and model school policies and data descriptions, are introduced. The forth section deals with the evidence of validity, where discontinuity of variables are analyzed by graphics and the density of the treatment-determining variable is further tested. The fifth section concerns on empirical analysis. After an introduction of the RD design framework, both of parametric and nonparametric estimation are performed. Robustness checks, including addi- tional covariates, multiple cutoffs and peers effects are also discussed. In the sixth section policy extensions dealing with compliers, noncompliers and eligible participants are intensively discussed. The seventh section concludes.

1For the relation between education and economic growth, see Stevens and Weale (2003). For the relation between education and eliminating inequality, see Mickelson (1987). 2See Donald B. Rubin et al. (2004) for a short review on the value-added assessment in education.

2 2 RD Design in Economics of Education

2.1 Basic Settings

Usually the RD design begins with a population of objects N, N = (1, 2, ..., I). An object in it can be denoted by i. Such object can be individuals, households, schools and so on.

For each i several attributes are observed. One is the outcome Yi. We want to know why it varies across objects.

Another is the treatment Ti. If we assume that there are only two levels of treatment for simplicity, we have Ti = 1 for objects in the treatment group and Ti = 0 for those in the control group. Other characteristics can be denoted by Xi. The treatment effect is measured by the difference between outcomes of the same object in the treatment group and the control group, with other characteristics Xi unchanged. For a given individual, we have the following formulation:

Yi = α + βiTi + i, where α is a constant term and i is the error term. We can find that if Ti = 0, we have Yi = α + i. So 0 0 with the assumption that E(i) = 0, we have α = E(Yi ), where Yi is the outcome without the treatment. And 1 1 1 0 E(Yi ) = α + E(βi), where Yi is the outcome with the treatment. Then we have E(Yi ) − E(Yi ) = E(βi), which is the average treatment effect for the treatment T .

2.2 Random Experiment

The central idea about causal inference of treatment effect goes back to Rubin (1974), where thoughts of randomized experiment and potential outcome were introduced. As what has been pointed out there, the basic 1 0 1 0 expression of a treatment effect, say T , on individual i can be written in the form of Yi − Yi , where Yi and Yi are outcomes after and before the treatment respectively. However, it is usually impossible to observe these two outcomes of a given individual simultaneously. Holland (1986) summarizes two potential ways to deal with this problem: scientific solution and statistical solution. Which one is useful depends on the validity of assumptions3. In the scientific solution some special assumptions, like the untestable unit homogeneity is proposed. For example, we assume that both outcomes before 0 0 1 1 and after treatment are the same for two objects. It means that Y1 = Y2 and Y1 = Y2 . While, if object 1 is in 1 0 the treatment group and object 2 is in the control group, we can observe Y1 and Y2 . Then the treatment effect 1 0 can be measured by observed values Y1 − Y2 . In the statistical solution the average treatment effect is identified under certain conditions, such as the well-known randomization or independence assumption4. More detailed, the individuals are assigned randomly to make the treatment independent of all other variables, such as the backgrounds or outcomes themselves. Mathematically, we have the independence assumption: Y 0,Y 1⊥T . In this case we have E(Y 1) = E(Y 1|T = 1) and E(Y 0) = E(Y 0|T = 0). Then the average treatment effect can be expressed in the form of E(Y 1|T = 1) − E(Y 0|T = 0). Usually it is difficult to realize such pure random assignment, which calls for well-designed experiments. If these experiments are not available, selection biases in the average treatment effect may come about. One of common 5 cases occurs when the assignment to treatment is determined by a predictor , which can be denoted by Si. There

3The former one is commonly used in the science laboratory and the latter one is usually preferred by social experiments. 4Of course, there are also some additional necessary assumptions to make the causal inference simpler, such as the Stable Unit Treatment Value Assumption (SUTVA) argued by Rubin (1986). The SUTVA contains two components: one is that all objects in a certain group, such as the treatment group or control group, should receive the same treatment; another is that the potential outcomes of a certain object should not be affected by the treatment status of another object. 5There are also many other methods to solve the problem of non-random experiment. Sometimes researchers can replicate a random experiment by matching methods on observables, or IV strategies on unobservables. For details see Heckman, Ichimura and Todd (1998) and Imbens and Angrist (1994) respectively.

3 is a cutoff point S0 of this covariate6 and the units will be treated if the value of covariates is on one side of this 0 0 point and will not be treated if on the other side. So we have Ti = 1 if Si ≥ S and Ti = 0 if Si < S . This idea leads to another analyzing framework named regression discontinuity design. The following section is a review of the development of such design with a concentration on applications to economics of education.

2.3 Origin

The concept of RD design is first introduced by Thistlethwaite and Campbell (1960), where the effect of recog- nition (Certificate of Merit) on several factors relating to high school students7 is analyzed. The decision of the Certificate of Merit is mainly made on the basis of qualifying scores. The paper shows results briefly by graphic presentations. The problem of tests of significance is also discussed and the t-test from Mood (1950)8 is adopted. The main purpose of this paper is to compare the RD analysis with the ex post facto experiment9 in both of methods and application results. One advantage of RD analysis is emphasized: the RD analysis does not rely upon matching to equate experimental and control groups10. The crucial idea behind the RD design is that the assignment to treatment is determined by other observed variables, according to certain administrative decisions, completely or partially. If the relationship between the assignment to treatment and observed variables is completely deterministic, like that in Campbell (1960), the RD design is called sharp RD design. If such relationship is not deterministic11, the RD design is called fuzzy RD  0 design, which is introduced by Campbell (1969). With mathematic expression, we have Ti = 1 Si > S and

lim P r(Ti = 1|Si = s) − lim P r(Ti = 1|Si = s) = 1 in the former case, while in the latter case we only have s→S0+ s→S0− 0 12 Ti = 1Si > S and lim P r(Ti = 1|Si = s) − lim P r(Ti = 1|Si = s) > 0 , which means that the jump at the s→S0+ s→S0− cutoff point is smaller than one. Though introduced as early as in the 1960s, the RD design experienced few theoretical or practical developments until the middle 1990s, except for several applications in psychology and education by these pioneers themselves, such as Cook and Campbell (1979). Nevertheless in the middle 1990s researches started to boom after a 30-year silence13.

2.4 Early Applications

There are two possible reasons for the popularity of the RD design since the middle 1990s. One is that more and more programs, not only limited to education, assign treatment in this way. The other is that several advantages of the RD design, such as mild assumptions, are realized by more and more researchers. Van der Klaauw (1997), where the effect of financial aid on college enrollment is studied, reveals the relation between fuzzy RD design and methods of instrumental variables. In the fuzzy RD design there are also unobserved variables which affect the assignment to treatment. If such unknown factors are not independent assignment

6This covariate can be a single variable or a combination of several variables. It is called assignment, treatment-determining, selection, running, forcing or ratings variables in various literatures. In the remaining sections we prefer the name of treatment-determining variable. 7These factors include attitudes toward intellectualism, the number of students planning to seek the MD or Phd degree, the number planning to become college teachers or scientific researchers and the number who succeed in obtaining scholarships from other granting agencies. 8It is the test of the significance of the deviation of the first experimental value beyond the cutoff from a value predicted from a linear fit of the control groups. 9In an ex post facto experiment, the treatment and control groups are not selected before the experiment. So the treatment can not be manipulated. The research will study the treatment effects after the naturally occurring treatment. 10Usually these matching methods, such as propensity score method, are not applicable in RD design circumstance because the violation of strong ignorability condition. For details see Rosenbaum and Rubin (1983). 11However, the author argues that under this setting the framework of regression discontinuity design does not work. Though this argument has been revised in several consequential papers, the essential of fuzzy design remained unclear until recently. 12Of course the expression on the left is negative if the rule of assignment to treatment is conducted in the opposite way. 13See Cook (2008) for a review of the history of the RD design in psychology, education, and economics.

4 errors, then the simple regression of outcomes on the treatment will give biased estimates14. To solve this problem the treatment is replaced by the propensity score which should be estimated at the beginning. Such a two-stage procedure will lead to a consistent estimate of the treatment effect. A semi-parametric estimation method is adopted and sensitivity analysis is also involved to check the sensitivity of the estimates to different specifications. Angrist and Lavy (1999) also analyzes a fuzzy RD design using the framework of instrumental variables, where the effect of class size on the academic achievement, say pupils’ scores, is discussed. Here Maimonides’ rule15 is introduced to divide students into classes of equal size, which make the class size correlated to the enrollment. So instead of the propensity score, class size itself serves as the dependent variable in the first stage. At the time of these early applications instrumental variables (IV) is the dominant method, so many early researches on RD design also formulate the causal analysis in this framework, like Imbens and Angrist (1994), though they already consider cutoffs on treatment-determining variables. Then to understand it is necessary for us to briefly discuss the relation between RD design and alternative methods, such as IV methods and matching, before going deeper into specific issues.

2.5 Matching, IV and RD Design

2.5.1 Matching Methods and (Sharp) RD Design

1 0 1 0 Generally speaking, we have four potential outcomes: Y1i, Y1i, Y0i and Y0i. The first two are outcomes of objects 1 0 in the treatment group. Y1i is the observed outcome while Y1i is the unobserved counterfactual outcome. The last 1 0 two are outcomes of those in the control group. Y0i is the unobserved counterfactual outcome while Y0i is the observed outcome. Then the average observed difference of outcomes between two groups can be expressed as:

1 0 1 0 0 0 E(Y1i) − E(Y0i) = E(Y1i) − E(Y1i) + [E(Y1i) − E(Y0i)],

1 0 0 0 where E(Y1i) − E(Y1i) is the average treatment effect on the treated (ATET for short) and E(Y1i) − E(Y0i) is the selection bias. 0 0 If treatment is randomly assigned, we have E(Y1i) − E(Y0i) and the ATET just equals the average observed difference in outcomes between different groups. If the treatment is assigned on observables only, matching esti- mators of treatment effects become useful. To perform it we need additional assumptions. One is the conditional independence assumption: Y 0,Y 1⊥T |X, which is equivalent to Y 0,Y 1⊥T |P (X), where P (X) = P r(T = 1|X). Another is called overlap condition, which means that for every characteristic X in the treatment group there should be objects in the control group. So under this condition we have 0 < P r(T = 1|X) < 1 for all x in the treatment group. There are two kinds of matching methods: exact matching and inexact matching. In exact matching, we just match objects in treatment group and control group on their observable characteristics X. This procedure requires that the characteristics X are discrete and for each value there are many objects in both groups. If these conditions are not satisfied, exact matching become impractical and we have to turn to inexact matching. One of the most popular methods of inexact matching is the method of propensity scores. The propensity score is the conditional probability of receiving treatment given X. This method matches on the propensity score and compares objects in both groups whose propensity scores are closest. To estimate the propensity score we can use parametric models such as logit or nonparametric methods. The general formula of the matching ATET is as follows:

M 1 X 1 X 0 AT ET = [Yi − ω(i, j)Yj ], NT i∈T =1 j

14This idea goes back to Barnow et al. (1980). 15Interpreted by Maimondies, the rule can be stated as follows: Twenty-five children may be assigned to one teacher. It there are more than forty children, two teachers must be appointed. If the number of children is between twenty-five and forty, an additional assistant is needed.

5 where NT is the number of objects in the treatment group, j is the object in the comparison group of the object i in the treatment group, and such comparison group is expressed as Cj(X) = {j|Xj ∈ c(Xi)}, where c(Xi) is the neighborhood of characteristics Xi. ω(i, j) is the weight and 0 < ω(i, j) ≤ 1. So by choosing different weights we can obtain different estimators. Both matching methods and sharp RD design are cases of assignment on observables only. But we are not able to apply matching methods such as propensity score methods to solve sharp RD design problem, because settings of the later essentially violate the overlap condition. In the sharp RD design we have P r(T = 1|X) = 0 for all X < X0, so there is no region of common support, as would be required for matching. Formally the treatment effect of sharp RD design can be derived as follows:

s β = lim P r(Yi|Si = s) − lim P r(Yi|Si = s) s→S0+ s→S0−

2.5.2 Instrumental Variables and (Fuzzy) RD design

In a fuzzy RD design, the jump in the probability of treatment at the cutoff point is smaller than one, which implies that the relation between treatment-determining variables and treatment is not deterministic. There are several reasons for such non-deterministic assignment. Sometimes we have Ti(X), so for individual i the treatment assignment is a deterministic function of the treatment-determining variable but such function may be different across individuals. Sometimes there is a unique deterministic function to assign treatment for all individuals, but one or more treatment-determining variables are not observable. In that case we obtain a fuzzy RD design, though essentially it should be a sharp RD design. In the first case, where we have various deterministic functions, settings are associated with instrumental variable methods introduced by Imbens and Angrist (1994). They intensively discuss the identification and estimation of a special average treatment effect, called local average treatment effect (LATE), where there can be no subpopulation with zero probability of treatment16. 1 0 Now again we have potential outcomes Yi and Yi . Moreover, we have an instrumental variable Zi that is independent of the potential outcomes and related with treatment. So we have Ti(Z) = 1 if individual i would be treated with Z = z and Ti(Z) = 0 if he or she would not be treated with Z = z. Then a latent index model is introduced, where the treatment is related to a latent treatment index and such index is determined by instrumental variables. Mathematically we have

( ∗ ∗ 1,Ti > 0 Ti = γ0 + γ1Zi + τi,Ti = ∗ 0,Ti ≤ 0

Then to identify the treatment effect, we need some mild assumptions. One is the assumption of the existence 0 1 of instruments: for a Z, for all z,(Yi ,Yi ,Ti) are jointly independent of Zi; P (z) = E(Ti|Zi = z) is a nontrivial function of z17. Another is the monotonicity assumption: for all individuals i, for all z and ω, we have either Ti(z) ≥ Ti(ω) or Ti(z) ≤ Ti(ω), where both Ti(z) and Ti(ω) can be zero or one. Then LATE can be 1 0 identified and estimated through the IV approach: LAT Ez,w = E[Yi − Yi |Ti(z) 6= Ti(ω)]. This estimated effect is the average effect for individuals who will change their treatment status when the instrument is changed.  0 In a fuzzy RD design with various deterministic assignment functions, the treatment indicator 1 Si ≥ S , as well as polynomial functions of the treatment-determining variables and other covariates, if any, serve as the instrumental variable. Then under similar assumptions, we can also identify the LATE at the cutoff point. That is the treatment effect of individuals whose treatment status changes discontinuously, say from non-treatment to treatment when the value of treatment-determining variable crosses the cutoff point.

16In fact, only with binary instrument the LATE and the IV estimates are equivalent. With multiple instruments they are different in general. For details, see Angrist, Imbens and Rubin (1996). 17 Here both of Ti(z) and Zi are random variables, e.g., Zi is randomly assigned and not able to determine Ti(z) deterministically.

6 More detailed, we can estimate the treatment effect of fuzzy RD design with a two-stage procedure. In the first stage, the propensity score function is estimated:

 0 E(Ti|Si) = f(Si) + λ1 Si ≥ S , where the polynomial f can be estimated parametrically, semi-parametrically or non-parametrically. In the second stage the function called the control function-augmented outcome equation is estimated:

Yi = α + βE(Ti|Si) + l(Si) + i, where the estimated propensity score from the first stage replaces the treatment variable. Both f and l are polynomials of Si and can be chosen separately, so they can be different or the same. Finally we can get the treatment effect of fuzzy RD design as follows:

lim E(Yi|Si = s) − lim E(Yi|Si = s) o+ o− βfuzzy = s→S s→S lim E(Ti|Si = s) − lim E(Ti|Si = s) s→So+ s→So−

2.6 Identification and Estimation

Almost at the same time of early studies mentioned above, the identification and estimation of the treatment effect in RD design are also developed by several researches. There are two important questions remaining unclear for early researchers. One is what sources of identification are and another is how to estimate the treatment effect under minimal restrictions or assumptions on functions or involved. In this section we will briefly discuss relevant issues and leave details to the following formal analytical sections. Hahn, Todd and Van der Klaauw (2001) briefly discusses these two questions about the RD design. It is shown that under several weak functional continuity assumptions, the treatment effect can be identified nonparametrically by comparing persons arbitrarily close to the point at which the probability of receiving treatment changes discon- tinuously18. For the estimation, it is shown that the kernel estimator is numerically equivalent to a standard local Wald estimator19 under certain conditions, such as a particular choice of kernel and subsample. But the inferences based on them are different because the kernel estimator is asymptotically biased due to the bad boundary behav- ior20. To avoid this problem the method of local linear regression is introduced, whose associated bias is smaller than that of standard kernel estimator and does not depend on the density of the data21. Porter (2003) follows the discussion of Hahn, et. al. (2001) and focuses on the bias problems in the estimation of RD design. To overcome such problems, several estimators are investigated to attain the optimal convergence rate under various smoothness conditions, including the Nadaraya-Watson estimator, partially linear estimator and a local polynomial estimator22. The last two are advocated because of their bias-reduction property. The local polynomial estimator is even more robust because such a property also holds under smoothness conditions of the partially linear estimator. No matter which one is adopted, say the local polynomial estimator or partially linear estimator, their asymptotic properties rely on the smoothness conditions of control functions involved in the regression equations. If the degree of smoothness is unknown, then the estimates from these methods will inflate the bias rather than decrease it. To solve this potential problem Sun (2005) introduces the adaptive estimator, which first estimates the degree of smoothness before applying estimations mentioned above.

18In fact, without the constant effect assumption, the treatment effect can only be identified at the discontinuous point. 19The Wald Estimator was first introduced in Wald (1940). Under the binary IV settings, it can be described as follows: βW ald = [E(Y |Z = 1) − E(Y |Z = 0)]/[E(X|Z = 1) − E(X|Z = 0)]. 20At the boundary points, the order of the bias of the standard kernel estimators is O(h) while such order is O(h2) at interior points. So the convergence rate at the boundary points is slower and the bias will be substantial in finite samples. 21In the local linear regression, a local straight line is used to fit the underlying function, while in the standard kernel estimation the straight line is simplified to a constant. 22Here the Nadaraya-Watson estimator is just the local Wald estimator and the local polynomial estimator is just the local linear estimator, both of which are discussed in Hahn, et. al. (2001).

7 All these estimations discussed so far belong to non-parametric or semi-parametric estimation, which is a popular way implemented by a large amount of researches because of its weak assumptions and mild mis-specification bias23. There are also several parametric estimations, such as the control function approach. However, the validity of these parametric estimations relies on the delicate correct specification of control functions and the stronger global continuity assumption. Though with such assumption we can use data far from the cutoff, the large potential bias may decrease the precision of the estimation greatly. Following the non-parametrc or semi-parametric estimations, there are many further discussions. Some focus on the choice of bandwidth used in kernel estimations, such as Imbens and Lemieux (2008) and Ludwig and Miller (2005). A special case is introduced by Black, Galdo and Smith (2007), which mainly discusses the order of polynomial in the local polynomial estimation. As a sub-conclusion, we summarize the basic assumptions and results of identification. If we assume constant treatment effects and continuous error term at the cutoff, the treatment effect of sharp RD design and fuzzy RD design can be derived as follows:

sharp β = lim P r(Yi|Si = s) − lim P r(Yi|Si = s), s→S0+ s→S0−

lim E(Yi|Si = s) − lim E(Yi|Si = s) o+ o− βfuzzy = s→S s→S . lim E(Ti|Si = s) − lim E(Ti|Si = s) s→So+ s→So−

2.7 Special Subjects and Relevant Tests

Since being recovered, researches on RD design develop in two ways: theoretical improvements and empirical applications24. In this section we mainly focus on the former and give relevant applications for each theoretical development. We will also be biased to those related closely to potential developments of the value-added problem in our application.

2.7.1 Covariates Other than Treatment-determining Variables

One natural starting point is to consider the roles of covariates other than the treatment-determining variables. Then the regression model for the observed outcomes becomes:

Yi = α + βiTi + γiXi + i

According to Imbens and Lemieux (2008), these covariates are mainly used for three purposes: examining the validity of RD design; eliminating small sample biases and improving the precision. Firstly, if the treatment is locally randomized, then the observed covariates should be locally balanced on either side of the cutoff. Lee and Lemieux (2009) introduces Seemingly Unrelated Regression (SUR) to check whether predetermined covariates are influenced by treatment-determining variables, say whether they are discontinuous at the cutoff. Mathematically, we have several covariates and relevant regression equations:

Xj = αj + βjT + τjS + µj,

2 where j = 1, ..., J, so we have J covariates. Then we just need to perform a χ test to see whether βj are jointly equal to zero. An empirical application can be found in Lee, Moretti and Butler (2004). Secondly, in practice sometimes we include observations with treatment-determining variables not close to the cutoff, and then with additional covariates some biases from these observations may be eliminated and the estimator will still be consistent25. Given that some observed covariates are controlled, the identifying assumptions can still

23See van der Klaauw (2008)b for a detailed comparison between non-parametric and parametric estimations. 24It is a rough classification because few of them only focus on one aspect. Most of these studies mainly pay attention to one aspect while also giving considerations to the other. 25For these observations, the assignment of treatment may be no longer independent of covariates.

8 hold even observations a bit far from the cutoff are included. For example, suppose there is a covariate relating to the potential outcome. If it is not controlled, with observations far from the cutoff, both the potential outcome and the error term will be likely to jump at the cutoff. Then it is difficult to distinguish the treatment effect with the effect of this covariate and a spurious effect will be induced. Thirdly, even if the RD design is valid without additional covariates, incorporating additional covariates may still be helpful for the improvement of the efficiency26. Generally the variance of the estimator with additional covariates is smaller than that with only the treatment-determining variable. Furthermore, the former is decreasing with the number of additional covariates27.

2.7.2 Discrete Treatment-Determining Variables

Another significant improvement comes from discrete treatment-determining covariates. Card and Shore- Sheppard (2004) is an early investigation of this topic, where the effects of two large expansions that offer Medicaid coverage to low-income children in certain age ranges are examined. Several variables, such as Medicaid enrollment, on either side of the cutoff of the age are compared. The model used here involves discrete treatment-determining covariates, say the dummy variable that says whether the age is larger than the cutoff. A low-order polynomial function of age and income is also included into the model to smooth the change of outcome variables. Lee and Card (2008) intensively discusses this problem from a purely theoretical viewpoint. It is shown that if the treatment-determining variable is discrete, the conditions for non-parametric or semi-parametric methods do not hold and consequentially the treatment effect is not non-parametrically identified28. Then identification can be achieved by introducing an underlying function in a parametric form for the approximation of the relation between the treatment-determining variable and the expected outcome, as what is done in Card and Shore-Sheppard (2004). The intuition can be illustrated by the following example: suppose that we have discrete treatment-determining variable X = (x1, ..., xJ ), the regression function can be expressed as E(Y |X = xj) = Tjβ0 + h(xj) and it is equivalent to a micro-data model: Yij = Tjβ0 + h(xj) + ij, where h(xj) is a continuous function and can be approximated by a certain form, such as a polynomial. ij is the error term and defined as ij = Yij −E(Yij|X = xj). However, being different from previous studies, without the assumption that the form of underlying regression function is correct, the model here allows for the deviations of the expected value of the outcome from the predicted value given by the function. Such a deviation is called the random specification error. Under the polynomial function assumption, the regression can be written as: Yij = α0 + Tjβ0 + Xjγ0 + αj + ij. Here αj is just the random specification error and defined as αj = h(xj) − Xjγ0. Then under this framework the standard errors of estimation are intensively discussed, say in what conditions heteroskedasticity-consistent standard errors, cluster- consistent standard errors and further adjustment errors are properly used. Finally how to obtain more efficient estimators and the relation between such estimator and Bayesian estimation are also discussed.

2.7.3 Continuity of Density

As shown in Hahn et al. (2001), the RD design can be related to treatment effects and can be as good as a randomized experiment as long as expectations of the potential outcomes under treatment and control states are both continuous in treatment-determining variables, especially around the cutoff. Such validity of RD design can be tested by examining whether treatment-determining variables are continuous at the cutoff. However, as shown by several studies, this assumption does not hold naturally all the time. Martorell (2004) is one of the early researches mentioning this problem, where the effects of high school graduation exams on several outcomes, such as graduation, earnings and so on, are examined. Here whether unobservable determinants of the student outcome

26However, including additional covariates does not necessarily improve the efficiency of estimation. See Lee (2008) for a counterex- ample. 27For details about the RD design with additional covariates, see Frolich (2007). 28That is because there are no observations in an arbitrarily small neighborhood of cutoff even with infinite data. Then the kernel estimator in the limit will put all weights on the ”empty” neighborhood extremely closed to the cutoff.

9 exhibit discontinuous behavior at the passing cutoff serves as the crucial condition under which the causal effect is valid. Lee (2008) finds that if individuals are able to perfectly manipulate the treatment-determining variable, then the density of such a variable is likely to be discontinuous at the cutoff. The continuity condition is formalized as follows: the cdf of the treatment-determining variable s on the error term , say F (s|), is continuously differentiable and satisfies F (s|) ∈ (0, 1) at the cutoff for each . It is also shown that the treatment-determining variable could contain two components. One is systematic and can be affected by actions of individuals and the other is an exogenous random chance part. With the second part, though endogenous sorting still exists to some extent, the local random assignment can occur because of imprecise manipulation. Then unbiased impact estimates can still be obtained. An example regarding the U.S. House election is analyzed to illustrate the main points discussed. Though the test regarding the continuity of expectation of pre-determined characteristics on the treatment- determining variable is a powerful process, sometimes these characteristics are unobserved or unavailable. As discussed above, in that case the test about the discontinuity of density function of the treatment-determining variable is not suitable. McCrary (2008) introduces a special density test for this issue, which is based on an estimator for the discontinuity in the density function of the treatment-determining variable at the cutoff. Such a test is a Wald test with the null hypothesis that the discontinuity is zero. It is implemented in two steps: first a finely gridded is obtained, and then local linear regression is adopted to smooth the histogram on either side of the cutoff point. Two additional conditions for the validity of this test are also discussed: one is the monotony of manipulation, say all the individuals should manipulate the treatment-determining variable to the same direction. The other is that the identification actually fails because of such manipulation. The continuous density of the treatment-determining variable is neither necessary nor sufficient for the identification29.

2.8 Fuzzy RD Design

In our potential settings of the value-added problem, as well as many other quasi-experimental settings in educational fields, the treatment is rarely completely determined by the treatment-determining variables. In that case it seems that a separate concentration on fuzzy RD design is necessary. Fuzzy RD design means that the size of the discontinuity is less than one. The treatment effect can be identified by similar processes with sharp RD design. But it requires more and stronger conditions to interpret the treatment effect. For example, it can be defined as the mean effect on the subpopulation of compliers in a neighborhood of the cutoff30. We will discuss the identification and interpretation of fuzzy RD design in greater details in other relevant sections. Chay, McEwan and Urquiola (2005) is an early typical empirical research on this topic. The effects of Chile’s 900 Schools Program that allocated resources based on cutoffs in schools’ mean test scores are analyzed. Because the noise and mean reversion can bias the estimation from conventional strategies, an RD design is used to solve this problem. Meanwhile because the assignment does not rely exclusively on the test scores, a fuzzy RD design is preferred. Some other aspects, like unobserved exact cutoffs, are also discussed to make the estimation more precise. Battistin and Rettore (2008) analyzes the partially fuzzy design in which the eligible individuals can participate based on self-selection. Through information on three groups: ineligibles, eligible non-participants and participants, they show that in this case the identification of the mean effect on participants who are marginally eligible in a right-neighborhood of the cutoff requires the same conditions as those in a sharp design. A specification test if the validity of non-experimental estimators through these local identifications is also discussed. Such a test allows us to test the ignorability condition, say Y0⊥T |S, X, directly by checking whether the selection bias equals zero,

29The manipulation will lead to a discontinuous density, however, if such manipulation is randomly performed, the treatment effect can still be identified. 30The concept of compliers and relevant settings are similar to those used in IV approach, for details see Angrist, Imbens and Rubin (1996).

10 say lim {E(Y0|T = 1, s, x) − E(Y0|T = 0, s, x)} = 0. This test is informative only at the cutoff of eligibility, but s→S0+ if it rejects the non-experimental estimators locally then it is enough to reject altogether. An application to the PROGRESA program, which aims at encouraging investments in education, health and nutrition through large monetary transfers in rural Mexico, is used to illustrate the crucial points in this paper. Yang (2009) generalizes the problem by pointing out the dual nature of RD design: a borderline experiment near the cutoff and a strong valid exclusion restriction in the selection equation. Focusing on fuzzy RD design, the paper proposes two estimators for the average treatment effect in the presence of multiple selection biases31: RD robust estimator and correction function estimator. The former is used for the population near the cutoff where selection is based on observables, while the latter is used for a population away from the cutoff where selection is based on unobservables. Which estimator is appropriate depends on the research question. The choice between them is essentially a balance between internal and external validity32. The empirical analysis by Chay, et. al. (2005) is reexamined to show the improvements brought by these two new estimators.

2.9 Applications of RD Design to Value-Added Problem

In this section we will introduce several recent applications of the RD design to value-added problems. We hope that it is helpful for forming a complete picture of such design. To be more comprehensible, these applications are organized in three groups with different treatments: school based, class based and teacher based33. In all of these applications the potential outcome is a test score or another similar variable, such as the graduation rate. The treatment-determining variables are test scores (usually for school based and teacher based cases) or student enrollment (usually for class based cases).

2.9.1 School Based Applications

In school based applications, the treatment is usually the assignment to a certain special kind of schools or to policies imposed on characteristics of schools. One kind of typical studies in this field focuses on the effect of summer schools, such as Matsudaira (2008) that exploits the effect of mandatory summer school on students’ achievement34, where the treatment-determining variables are the math and reading scores in the year 2000; the potential outcomes are math and reading scores in the year 2002; the treatment is attending summer school. Here students in Chicago who fail to meet any of several criteria may be mandated to attend the summer school, so in general the problem falls in a fuzzy RD design framework. Following Porter (2003), a three order polynomial function of the test score is included in the regression equations and the effect is estimated parametrically. The empirical results show that the average effects are much smaller than those of other studies. Furthermore, they are heterogeneous across grades.

2.9.2 Teacher Based Applications

Most of the teacher based applications focus on effects of incentives for teachers on students’ performance35. Lavy (2004) evaluates such an effect in Israel with two identification strategies: propensity score matching and RD design. Here the potential outcomes are the scores in several subjects; the treatment-determining variable is the

31For RD robust estimator, it removes selection bias on observables and controls for the heterogeneous bias from the interaction between observables and the treatment. For correction function estimator, it deals with omitted-variable bias and controls for the sorting bias. 32The RD robust estimator is only valid near the cutoff but has few specifications on function and flexible parameterization while the correction function estimator can deal with biases on unobservables but has a more restrictive parameterization. 33There is also antoher kind of popular application concerning the effect of financial matters. We will not discuss it because it is related to value-added problem very closely. People who have interest in this topic can turn to Guryan (2001), Canton and Blom (2004), Leuven and Oosterbeek (2007) and Van der Klaauw (2008). 34Jacob and Lefgren (2004)a is another example that also discusses the effects of summer school on students’ test scores in the RD design framework. For other intensive sutides, see Roderick, Engel and Nagaoka (2003). 35For effects of teacher training program, see Jacob and Lefgren (2004)b, which analyzes an example from elementary schools in Chicago with school averages on test scores as treatment-determining variables.

11 1999 school matriculation rate; the treatment is the assignment to cash bonuses for teachers. The propensity score matching is feasible because of the very rich and unique data available on all schools and students. In the RD design two variations are considered. One is to exploit the random measurement error in the treatment-determining variable. Then conditional on the true value, the treatment assignment is random36. Then the treatment effect can be identified by non-parametrically matching schools on the basis of the true value. The other is a sharp RD design with a bandwidth of about 10 percent. Both estimations involve a panel data structure with fixed school- level effects. Several additional effects of the incentive program are also discussed, such as the spillover effects on other non-treated subjects, effects on teachers’ pedagogy, effort and grading ethics. The empirical results show that such incentives actually increase student achievements because of improvements of effort and pedagogy rather than artificial inflation in test scores.

2.9.3 Class Based Applications

The relation between class size and students’ achievements is the crucial question in this kind of applications. One recent research on this topic comes from Urquiola (2006), where the effects of class size on test scores in rural Bolivia are analyzed. The potential outcome is test scores. The treatment-determining variable is the enrollment at the school level. The treatment is class size. Two identification strategies are presented: one is only focusing on schools with fewer than 30 students, which is helpful for eliminating schools’ and parental manipulating choice. The other is similar to that in Angrist, et. al. (1999), which also generates a fuzzy discontinuity in the relation between class size and enrollment. Such relation can be described as follows: Cjk = Ek/nk, where Cjk is the class size of j class in school k, Ek is the enrollment of school k, nk is the number of classes in school k. Empirical results from both strategies confirm the negative relation between class sizes and test scores. However, such effect is larger for the RD strategy and smaller for the first strategy with smaller class sizes.

2.10 Conclusion and Prospects

In previous sections we mainly review the origins and developments of RD design, along the technological ways with a concentration on applications to economics of education. Several recent applications to value-added programs are also summarized. Considering the research question at hand, we can develop the work in several potential directions, though not all of them will appear in the remaining sections. Firstly, we could go deeper into the problem of sorting or manipulation of the treatment-determining variables. In our problem, it seems that the assignment to model schools is partially determined by scores of the high school entrance exam, so a fuzzy RD design is feasible. But though the exact cutoff remains unknown before the exam, the rules are public knowledge and information of previous years is also available. In that case the endogenous sorting might invalidate the RD design and lead to biased estimation37. Secondly, we could also exploit the partial fuzzy RD design in our problem. It seems that the cutoff of tests from the high school entrance exam is just an eligibility threshold rather than an actual treatment threshold. Students with higher scores can give up on their specific considerations. Thirdly, we could compare empirical results from RD design with those from other methods, say matching or IV. Several relevant aspects can also be considered, such as the testing of randomized property near the cutoff and the actual treatment-receiving mechanism for fuzzy RD design. Finally, combining empirical results with actual educational policies implemented in China, we could try some policies suggestions. At least in China there are few researches on value-added problems which use RD design

36Limited to the sample of 97 schools that were eligible for the treatment, the correlation between the correct matriculation rate and the measurement error is very low. Finally a sample of 29 schools is used. In it 17 schools with the correct matriculation rate higher than the cutoff is chosen for the treatment erroneously, while other 12 schools have similar correct matriculation rate but are not treated because the their measurement errors, if any, are not negative enough. 37See Urquiola and Verhoogen (2007) for a recent research on effects of class size focusing on the endogenous sorting problem.

12 framework. So if some interesting and unique results are obtained, they may also be very meaningful in political sense.

3 Empirical Backgrounds

3.1 Test Scores and Model School Program

3.1.1 Entrance Exam of High Schools in Beijing

The entrance exam of high schools is held once per summer (usually in late June). The score of it serves as the only criteria for the enrollment in high schools in most cases38. The students in middle schools can participate in the entrance exam if they satisfy one of the following criteria: First, the student has citizenship of Beijing. He or she can be a third-grade student or an already graduated student younger than 18 years old. Second, there are also several exceptions for those without the citizenship of Beijing, such as children of post-doctoral researchers. In our sample of Daxin District there are few students satisfying these criteria, so we will not show them in detail39. The questions or items in the exam are the same for all students. There are six subjects involved in it: Chinese, Mathematics, English40, Physics, Chemistry and PE. The full scores are 120, 120, 120, 100, 80 and 30 respectively. So the total full score is 570.

3.1.2 Admission to High Schools

The students should submit their choice, which includes at most eight schools in the order of preference, before the exam. High schools will admit students according to their scores and choices. Here is an example to illustrate the process. Suppose that School A plans to admit 10 students. Then it will rank students who choose it as the first choice by scores and admit the first 10. Other students who are not admitted by School A will be returned to the pool and wait for consideration of the rest schools in their choice. If the number of students who choose School A as the first choice is less than 10, say 8. Then School A will admit all of them and rank students who remain in the pool and choose them as the second choice by scores. The first 2 will be admitted. If there are still vacancies, the same process will be followed for students who choose School A as the third choice and so on. If School A still has vacancies after considering students who choose them as the eighth choice, it can contact those who remain in the pool but do not choose it. The rank of students is based on scores already including those special extra cases. If two students have the same score, then there are several priorities to distinguish them41. If they still have the same, then they will be ranked by scores of Mathematics, Chinese and English sequentially. If all of scores of these subjects are the same for two students, the student with smaller pre-assigned random series number will be admitted.

38There are two kinds of exceptions: one is that students with excellent awards, such as ”Jin Fan” and ”Yin Fan” awards, can enter high schools assigned in advance. The other is that students satisfying several conditions, such as minority race or children of martyr, can obtain additional scores. 39All of the observations involved in this study are eligible for the entrance exam. We do not have the exact data concerning the second criterion. However, we can explore it indirectly. For example, in our sample the proportions of parents who hold a graduation degree are only approximately 0.5% and 0.6% for father and mother respectively. For post-doctoral researchers the proportions are much less than those. Furthermore, in our sample there are only 3.6% of students who do not study in the assigned middle schools. Students without the citizenship of Beijing belong to that group but it is not the sole reason for that, so the proportion of these students will be even less than 3.6%. 40Very few students in special schools will choose other foreign languages, such as French or Spanish. 41These priorities include children of serviceman and diplomats.

13 3.1.3 Choice between Science and Art

In middle schools, there is no distinction between science and art subjects. All of students take the same courses and, as introduced above, the subjects covered in the entrance exam of high schools are also the same for all students. After entrance to the high school, students are still not divided at the beginning. At the end of the first year in high schools, students submit applications to indicate their preference about subjects. Then before the second year, they are allocated to science or art class according to their willingness completely. After that the courses covered in both of classes and exams become a bit different.

3.1.4 Entrance Exam of Colleges in Beijing

The entrance exam of colleges in Beijing is held once per summer, usually in early June. Similar to the exam discussed above, the score of it serves as the only criteria for the enrollment of colleges in most cases, but we will not discuss it further because it has little to do with the theme of this paper. The students satisfying following criteria can take the exam in Beijing: First, complies with the constitution and other laws of PRC; has diploma of high schools or other equivalent certificates; without mental or physical problems that can affect study significantly; has citizenship of Beijing. Second, resident foreign students who have diploma of high schools or other equivalent certificates and do not have mental or physical problems that can affect study significantly. At the same time he or she should hold certificates authorized by relevant organizations. The subjects involved in the exam are different for students with major of Art (Wen Ke) and Science (Li Ke). For art students, the exam includes Chinese, Mathematics for Art, Foreign Language and Integrated Art. For science students, the exam includes Chinese, Mathematics for Science, Foreign Language and Integrated Science42. The full scores of Chinese, Mathematics for Art or Science and Foreign Language are all 150. Full scores of two integrated subjects are both 300. So the full score of the exam is 750. Students satisfying one of several conditions, such as minority race, can also obtain additional scores in the phase of admission. However it will not affect the original scores in our case, so we will not focus on these extra conditions. There is also complicated admission process for the enrollment of colleges according to the score of the exam as well as other criteria. However, such process is not important for our research on effects of Model High Schools, so we will not go further in this aspect.

3.1.5 Model High School in Beijing

Now there are 68 model high schools in Beijing. Such kind of high schools can go back to key high schools started in 1950s, to which government firstly assigns excellent resources to guarantee the educational quality. Model high schools are established to replace key high schools in 1990s. The essentials of the policy of model high schools can be summarized as follows: honoring high schools with excellent educational conditions and results; treating them favorably at the policy aspects, such as public finance, while imposing necessary social responsibility on them. Such model high schools should reflect a value that balances efficiency and equality. Here we will mainly discuss the difference between these model high schools and other, normal, high schools, especially the effects on students’ achievements. First, model high schools are favorites of public educational finance. They can obtain more financial support from the government. Moreover, they can collect much more money from external funding because of their excellent reputation. Then these schools can afford better learning conditions, due to which students there should obtain higher achievements. Sequentially, model high schools have much appeal for students in middle schools. Then many excellent students will choose them. To some extent, this will lead to a relatively better students’ achievement, say higher scores in entrance exam of colleges. Then more excellent students will choose them because of such high scores. Similarly, these model high schools are also attractive for excellent

42Both kinds of students can choose one from English, Russian, Japanese, German, French and Spanish as their foreign language. Integrated art includes History, Politics and Geography while Integrated Science includes Physics, Chemistry and Biology.

14 teachers because of their advanced educational conditions and salaries. The opportunities to become famous are also much more in model high schools for those young teachers with great potentials. In conclusion, according to points argued above, the difference or inequality between model high schools and normal high schools becomes more and more serious. The phenomenon called Matthew Effect43 is forming.

3.2 Data

The data come from a complete survey concerning 11 high schools and 3867 students in Daxin District in rural Beijing. Characteristics of them which may have potential influence on students’ achievements are collected from several aspects, such as school, teacher, student and score. These students took the entrance exam of high schools in 2005. According to admission rules introduced above, they were assigned to these 11 high schools. 3 years later, they took the entrance exam of colleges in 200844. These scores are collected as well as several other characteristics regarding parental backgrounds of these students. The descriptive statistics of these schools and students are shown in Table 1. According to the descriptive statistics we can find some facts relating to our topic. For both science and art students, on average students in model schools have higher final scores as well as higher original scores. But the difference of final scores between students in model schools and normal schools is much larger than that of original scores. For art students it is 35.0 vs. 14.3 while for science students it is 20.5 vs. 9.9. So we can find that though model schools can not explain their relatively higher final scores completely, they actually account for a large part of these higher final scores. From aspect of schools, it is obvious that model schools have better students. Teachers in these schools are also more advanced and experienced, which are reflected by higher ratio of advanced teacher and lower ratio of teachers younger than 35. Furthermore, all of model schools locate in urban area and this is likely to impact the school choice but unlikely to have substantial direct affects on students performance. From aspect of students, it seems that students in model schools and normal schools have similar individual characteristics. However, there are significant differences in parental backgrounds between them, especially in unemployment status. So in our sample it is possible that some background covariate other than treatment also jump at the cutoff.

4 Evidence of Validity

4.1 Discontinuity of Variables

As advocated by many literatures, graphic analysis has already been a part of standard process of studies with RD design. By this we can have an overall imagine of the identification strategy and the credibility of the RD design. In this section three kinds of graphs will be shown: outcomes by treatment-determining variable; probability of treatment by treatment-determining variable and covariates by treatment-determining variable. While the density of the treatment-determining variable will be analyzed in the next section.

4.1.1 Graphic Analysis of Outcomes by Treatment-Determining Variable

In this kind of graph we plot the average outcome for different values of the treatment-determining variable, which is the original score in this study. First a binwidth h is chosen, and then we have number of bins to the left

43This terminology comes from the Biblical Gospel of Matthew and first coined by Merton (1968). 44Because of some unobservable reasons, some of high school students did not take the entrance exam in 2008. But the proportion is not large. It is 1.2% for model schools and 10.2% for normal schools. In this study we only focus on the effect of model schools, so the relatively large proportion for normal schools will not lead to serious problems, though the decision of drop out is very likely to be endogenous. In the following study we just drop them from the whole sample.

15 Table 1: Descriptive Statisticsa

Science Student Art Student Model Schools Normal Schools Total Model Schools Normal Schools Total N Mean N Mean N Mean N Mean N Mean N Mean Achievement OScore 278 488.9 1015 443.0 1293 452.9 782 495.4 1495 453.7 2277 468.0 (17.16) (34.76) (36.97) (15.68) (27.10) (30.97) FScore 278 482.9 1015 387.5 1293 408.0 782 474.4 1495 372.6 2277 407.6 (73.65) (86.20) (92.37) (70.43) (85.50) (94.00) Demographics Male 278 0.658 1014 0.679 1292 0.675 782 0.504 1495 0.474 2277 0.484 Normal 278 0.971 986 0.961 1264 0.964 781 0.971 1456 0.967 2237 0.968 Age 278 18.58 1014 18.76 1292 18.72 782 18.56 1495 18.76 2277 18.69 (0.668) (0.698) (0.696) (0.660) (0.724) (0.709) Parents CollegeF 278 0.273 1014 0.189 1292 0.207 782 0.185 1495 0.183 2277 0.184 CollegeM 276 0.268 897 0.215 1173 0.228 776 0.186 1433 0.191 2209 0.189 UnemF 276 0.000 779 0.207 1055 0.153 777 0.004 1146 0.264 1923 0.159 UnemM 275 0.000 635 0.261 910 0.182 769 0.003 1063 0.323 1832 0.188 FarmerF 276 0.399 779 0.481 1055 0.460 777 0.503 1146 0.469 1923 0.483 FarmerM 275 0.498 635 0.452 910 0.451 769 0.502 1063 0.467 1832 0.481 School Model Schools Normal Schools Total N Mean N Mean N Mean Urban 2 1.000 9 0.556 11 0.636 Students N 2 1894 9 1085 11 1232 (554.4) (410.3) (522.0) Teachers N 2 154.5 9 94.78 11 105.6 (2.121) (47.07) (48.55) R.AdTea 2 0.421 9 0.233 11 0.267 (0.013) (0.142) (0.148) R.Tea35 2 0.431 9 0.643 11 0.604 (0.029) (0.144) (0.155) Min ES 2 482.0 9 424.1 11 434.6 (11.31) (36.99) (40.69) a 1. The number in parentheses is the standard deviation and it is not shown for dummies. 2. OScore is the score of entrance exam of high school (OScore or original score for short). FScore is the score of entrance exam of college (FScore or final score for short). Normal equals to one if the student is entrolled through normal process. College equals to one if the highest educational level of father (F) or mother (M) is at least college. Unem equals to one if father (F) or mother (M) is unemployed. R. AdTea is the rate of advanced teachers. R.Tea35 is the rate of teachers younger than 35. Min ES is the minimum entracne score required by the school.

and right of the cutoff point c which are denoted by Kl and Kr. Then we have bk = c − (Kl − k + 1)h and construct bins (bk, bk+1].

Assume that there are Nk observations whose treatment-determining variables fall in the bin (bk, bk+1]. Now the PNk average outcome of bin (bk, bk+1] is Y k = i=1 Yi/Nk. We also have the middle point of this bin bk = (bk +bk+1)/2. Then the graph is just the plot of Y k against bk.

16 utemr,i a ewitna ( as written be can it Furthermore, that probability the cutoff, the than lower by is described score original is is whose school cutoff observations high for model that enters Suppose he to lead sides. will both analysis on formal (2008), a the Lemieux that of and little. opportunity evidences family Imbens relatively the strong discontinuity, to effort, is any a According ability, estimates find of significant point. as evidence not and cutoff little such can robust the shows we at factors, analysis above, scores graphic common shown final the graphs some of if to mean by According conditional determined the direction. are in same discontinuity scores the original in etc., and support, scores final of 1 Both figure appendix. in the presented in are variable left treatment-determining are covariates of density for and results treatment while outcomes, for analysis graphic of rpi td eitiieyfidta hyaetesm,s h olwn qainholds: equation following the so same, the are they that find intuitively we study as graphic expressed approximately be can htcs hr sn infiatvleaddeeto h oe col ecncaiytetofcsi h following in the in So facts scores. two the final clarify their can change We not school. will model school the model of effect a value-added in significant are no they is there whether is case cutoff, other that the The to treatment. closed the very is receiving observations one of probability true: the be in discontinuity should facts following two the ecnivsiaeti setabtmr ihfrua.Nww nyfcso h iscoett h cutoff the to closest bins the on focus only we Now formulae. with more bit a aspect this investigate can We score. original the with trends increasing continuously overall shows score final the students of kinds both In hnfrteosrainwoeoiia cr ssihl oe rhge hntectff h bevdfia score final observed the cutoff, the than higher or lower slightly is score original whose observation the for Then have already we study this For P 2 h eain ewe h rgnlsoeadtefia cr nmdlshosadnra col are schools normal and schools model in score final the and score original the between relations The . F i m Obs Treatment Final Score (

O 0 50 100 150 0 .2 .4 .6 .8 1 350400450500550600 400 400 400 i and ) iue1 rpi nlss1 cec tdnsfrtelf column left the for students science 1: Analysis Graphic 1: Figure F P i 1 n F ( 450 450 450 O m P i ( OriginalScore OriginalScore OriginalScore respectively. ) 1 O hl uhpoaiiyfrosrain hs rgnlsoei ihrta the than higher is score original whose observations for probability such while , c ∗ P P (1 + ) 7,adasm that assume and 474, = 1 1 F − m P 500 500 500 ( − 2 O ) P F ∗ (1 + ) 1 m P ) F ( 1 O n = ∗ ( ( = ) O − P 550 550 550 ∗ 2 P = ) hc mle httectffi aeadteei vnno even is there and fake is cutoff the that implies which , 17 1 P ) F 1 P

n Obs Treatment Final Score − 2 h ( F 0 50 100 150 200 0 .2 .4 .6 .8 1 300 400 500 600 O 6, = 400 400 400 P m ∗ 2 ( and ) ) O F K ∗ n (1 + ) ( l O = P ∗ F .T urne hseuto,a es of least at equation, this guarantee To ). 2 K F 450 450 450 m − r m ( OriginalScore OriginalScore OriginalScore 0 so 10, = O P ( O ∗ 2 ) = ) ∗ F (1 + ) n ( F O n b 500 500 500 ∗ k ( ) O − 474 = ∗ P ,wihipista for that implies which ), 2 ) F − n ( 6(11 550 550 550 O ∗ .Nwi our in Now ). − k .Results ). section.

4.1.2 Graphic Analysis of Probability of Treatment by Treatment-Determining Variable

In this section we plot the probability of treatment against the treatment-determining variable. The methodology is similar as that used in the last section but replace the average outcome by the average treatment variable. These graphs show that there is actually discontinuity in the probability of treatment at the cutoff point. It is also shown that the probability of treatment does not keep constant before and after the cutoff, so the degree of jump is less than 1 and approximately 40% and the fuzzy RD design is presented clearly: some students enter model schools with their original scores lower than the cutoff while some students with original scores higher than the cutoff choose normal schools. But it seems whether the official cutoff point (474) is the best one that describes such discontinuity remains unclear. Also according to graphs there seems more than one jump in the probability of treatment. The point around 490, which is the minimum entrance score of No. 1 School, is another potential cutoff.

Then we turn to the informal mathematical analysis in the last section. Now we already find that P1 6= P2(P1 < m ∗ n ∗ P2). So the only explanation regarding the continuity of the final score at the cutoff is F (O ) = F (O ), which implies that it may be difficult to find robust and significant value-added effects of the model school at least around the current cutoff.

4.1.3 Covariates by Treatment-Determining Variable

By graphic analysis of other covariates on the treatment-determining variable we are able to detect potential mis-specifications. If some covariates other than the outcome or the probability of treatment also jump at the cutoff, these covariates are likely to be affected by the treatment, and this violates the basic assumptions of RD design. There are nine individual covariates. The graphs of them by treatment-determining variable can be found in Appendix 3. According to these graphs, we can find that though all of these covariates are predetermined, some of them actually present discontinuity at the cutoff to some extent, such as education levels of parents. As discussed in the previous section, the discontinuity of covariates will also account for the (dis)continuity of the final score, as long as there are actually significant relations between these covariates and final scores. The observed ambiguous value-added effect of the model school may be a combination where the true effect of the model school is offset by some covariates that affect the final score in an opposite direction.

4.2 Density of the Treatment-Determining Variable

By ploting the density of the treatment-determining variable we are able to check roughly whether there is a discontinuity in the distribution of such variable at the cutoff. If so, it is possible that the value of treatment- determining variable is manipulated by the individuals. In that case the settings of RD design are also violated.

To draw this graph, we need to plot the number of observations in each bin, say Nk, against the middle point of this bin bk. According to these graphs, we are not able to find sufficient evidence to support the discontinuity in the distribution of the treatment-determining variable at the cutoff. Next we will perform a formal density test which comes from McCrary (2008). As have been discussed in the previous sections, if the treatment-determining variable can be manipulated, the conditional expectation of the counterfactual outcomes in the treatment-determining variable may not be continuous at the cutoff. In that case the identification assumptions of RD design will be invalid. In our study, though the exact cutoff is unknown before the exam, students can still obtain much information about it from many indirect ways, such as cutoffs of previous years and their relatively rankings in the class. The manipulation of the original scores is possible, so besides the informal graphic analysis, a formal density test is still needed to investigate that

18 possibility. So it is necessary for us to test whether the treatment-determining variable, which is the original score in our case, is manipulated by the test of whether the density function of the original score is continuous at the cutoff45. Here we adopt McCrary density test, which is introduced in McCrary (2008). Generally speaking, there are two steps to perform this test. First, the first-step histogram is plotted by (Xj,Yj), where Xj is the midpoints of the histogram bin j and Yj is the normalized counts of the number of observations falling into the bin j. Second, the histogram will be smoothed by local linear regression. We need to find (φb1, φb2) to minimize the following objective function: J X Xj − s L = [Y − φ − φ (X − s)]2κ( )[1(X > c)1(s ≥ c) + 1(X < c)1(s < c)], j 1 2 j h j j j=1 where s is the realization of the treatment-determining variable, κ(•) is a kernel function, h is the bandwidth, c is the cutoff and 1(•) is the indicator function. Then the estimated density at s is given by fb(s) = φb1. Formally, we are interested in such estimated density in the limit form around the cutoff, so we can obtain the following expressions: + + X Xj − c Sn,2 − Sn,1(Xj − c) lim fb(s) = κ( ) + + + Yj; s→c+ h S S − (S )2 Xj >c n,2 n,0 n,1 − − X Xj − c Sn,2 − Sn,1(Xj − c) lim fb(s) = κ( ) − − − Yj, s→c− h S S − (S )2 Xj c h j n,k Xj

θb = ln lim fb(s) − ln lim fb(s). s→c+ s→c−

Furthermore, an approximate standard error of θb is given in the following form: s 24 1 1 σbθ = ( + ). 5nh lims→c+ fb(s) lims→c− fb(s)

The results of the test, involving three cutoffs, are shown in Table 246. According to the table above, none of the estimated discontinuity at the meaningful cutoff is significant. These results coincide with the graphic presentation at the beginning. In fact, the manipulation will lead to serious problem only when the treatment-determining variable can be manipulated completely. Partial manipulation, which means there are still idiosyncratic terms that cannot be controlled by the agent, does not lead to identification problems. Scores in exam, including our studies, belong to the category of partial manipulation because nobody can predict the exact score before the exam47. A similar example can be found in van der Klaauw (2002). Also, manipulations from other sources, like teachers or administrators, are even illegal and forbidden under the rules of exams in this study.

45According to McCrary (2008), only complete manipulation will lead to identification problems, which may not be the case in our study. Though students can control their original scores to some extent, it is very difficult for them to control their score completely. During the exam there are always many uncertainties that are difficult to expect or control. 46The graphs of density can be found in the appendix 47In fact, precise prediction after the exam is also very difficult because scores of open questions heavily depend on the subject judgment of the teacher.

19 Table 2: McCrary Density Test, Three Cutoffsa

Discontinuity Bin Size Bandwidth Whole Sample 0.793 1.100 47.21 (0.076) Cutoff 449 Science 0.765 2.056 52.65 (0.123) Art 0.769 1.298 44.43 (0.106) Whole Sample 0.048 1.100 37.70 (0.067) Cutoff 474 Science -0.067 2.056 44.03 (0.108) Art 0.099 1.298 38.32 (0.085) Whole Sample -0.046 1.100 39.88 (0.081) Cutoff 490 Science -0.065 2.056 38.95 (0.150) Art -0.093 1.298 46.76 (0.088) a The bandwidth used is also obtained by a two-step procedure. Given a fixed bandwidth, the performance of the discontinuity estimation does not require a careful choice of the binsize. For details of the choice of bandwidth and binsize, see McCrary (2008).

5 Empirical Analysis

5.1 RD Design Framework

5.1.1 Constant Treatment Effect

Basically the relation between test scores and model schools assignment can be presented by the following constant treatment effect model: f Si = α + βKi + i, f where Si is the score of entrance exam to college (final score for short) of student i, α is a constant , Ki is an indicator that equals one if student i attends a model school, β is the constant treatment effect, i is heterogeneous error term including all other factors affecting final scores. o Meanwhile, Si is the score of entrance exam to high school (original score for short) of student i. We also have o∗ o o∗ another indicator Di that is equal to one if the original score is less than a cutoff S : Di = 1 {Si < S }. The difference and relation between two indicators are shown as follows: Ki is the actual treatment indicator.

It is equal to one if student i really attends a model school. While Di is only the eligibility indicator. It is equal to one if student i is eligible for a model school according to his original score. So they are not necessarily the same for a certain student48. Our main goal is to get a consistent estimate of the causal treatment effect β. Nevertheless, our basic model presented above suffers from endogeneity, which implies that there many other variables correlated with treatment

48Normally if the assignment rules are followed strictly, the eligibility indicator must equal to one for those students whose treatment indicator also equals to one. But sometimes ineligible students are still able to attend a model school because of some other unknown specific reasons.

20 in the error term that can also affect the final score. For example, generally speaking, the student who attends a model school has higher original score, which is caused by several unobserved characteristics, such as individual intelligence and family status. Such characteristics will also have effects on the final score and positively bias the estimate of treatment effect49. To solve this problem, we need to exploit our case and find proper instruments. We can start from adding the original score into the model:

f o Si = α + γSi + βKi + i First we consider the case in which there is no manipulating choice about the assignment to treatment. Then the treatment indicator is equivalent to the eligibility treatment and Di = Ki for all students. This is just the sharp RD design framework. To identify the treatment effect we need the following assumption: o o o o∗ A1: E[i|Si = S ] is continuous in S at the cutoff S Assumption A1 means that the conditional expectations of all other characteristics having effects on final scores and not controlled are continuous at the cutoff. This assumption guarantees that the difference in average final scores between students who just obtain original scores marginally higher than the cutoff and those who just have original scores marginally lower can be attributed to the assignment to model schools only. All other characteristics of them are the same on average. According to identification introduced by Hahn et al. (2001), we can obtain the following identified treatment effect of model schools on final test scores for students whose original scores are just below and above the cutoff50:

sharp f o o f o o β = lim E(Si |Si = S ) − lim E(Si |Si = S ), So→So∗+ So→So∗−

f o o f o o as long as both lim E(Si |Si = S ) and lim E(Si |Si = S ) exist. So→So∗+ So→So∗− Then we focus on fuzzy RD design framework, under which the treatment indicator is different from the eligibility indicator for some students. Some eligible students give up the opportunity of model schools while some other ineligible students attend model schools for some special reasons. In that case the constant treatment effect can be identified as: f o o f o o lim E(Si |Si = S ) − lim E(Si |Si = S ) fuzzy So→So∗+ So→So∗− β = o o o o lim E(Ki|Si = S ) − lim E(Ki|Si = S ) So→So∗+ So→So∗− The denominator is the fraction of students who will not attend model schools if they obtain original scores lower than the cutoff. It measures the discontinuity of the probability of receiving treatment at the cutoff. Under the current setting it is always finite and non-zero, which implies a discontinuous probability at the cutoff, so the identified treatment effect is meaningful.

5.1.2 Heterogeneous Treatment Effect

If we assume that the treatment effect varies across students, the identifications are similar with additional f assumptions. Now the basic model becomes Si = α + βiKi + i. For sharp RD design, we make following assumptions: o o o o∗ A2: E(βi|Si = S ) is continuous in S at the cutoff S . o o∗ A3: Conditional on original scores Si near the cutoff Si , βi and Ki are independent. Assumption A2 means that the conditional expected treatment effect, which can be recognized as a function of the original score, should not jump at the cutoff. Assumption A3 guarantees that we are able to separate the jumps of treatment and its effect at least near the cutoff. Then we can identify the treatment effect in the average form at the cutoff51: sharp o o∗ f o o f o o E[βi |Si = Si ] = lim E(Si |Si = S ) − lim E(Si |Si = S ) So→So∗+ So→So∗− 49Generally speaking, the treatment-determining variable does not necessarily have effects on the outcome. It can be completely exogenous. One example of the exogenous treatment-determining variable is the random number series. 50For deductions of identification of treatment effects in this section see the appendix. 51For deduction of heterogeneous treatment effect, see the appendix.

21 For fuzzy RD design, we need more assumptions as follows: o o o∗ A4: βi and Ki(Si ) are jointly independent of Si near the cutoff Si . o o A5: For all s > 0, we have Ki(Si + s) ≥ Ki(Si − s) for all i. Assumption A4 is just the fuzzy version of assumption A3. Assumption A5 is also called monotonecity assump- tion and implies that for all students, we exclude the case that he (or she) would attend model schools if his (or her) original score were lower but would attend normal schools with higher original score. Then we can obtain the average treatment effect in the following form:

f o o f o o lim E(Si |Si = S ) − lim E(Si |Si = S ) fuzzy o o So→So∗+ So→So∗− lim E[β |Ki(S + s) − Ki(S − s) = 1] = + i i i o o o o s→0 lim E(Ki|Si = S ) − lim E(Ki|Si = S ) So→So∗+ So→So∗− This effect is the causal effect like ”LATE”, the local average treatment effect. It measures the average treatment effect of assignment to model schools for students who will attend model schools as long as their original scores are higher than the cutoff.

5.2 Estimation

Generally there are two ways to estimate the treatment effect in a RD design: parametric estimation and non-parametric estimation. In the identification section we have already shown that to estimate the treatment effect we need to estimate the limitations of conditional expectation functions from both sides of the cutoff. With the parametric methods, we are able to use more observations, say the whole sample, at the cost of additional assumptions about the specification of the regression function. Consequentially such parametric estimation will be biased if the regression function is falsely specified. With non-parametric methods, we can only use observations in a small neighborhood near the cutoff but do not need to impose additional assumptions on regression functions. However, the bias may still exist because of the boundary problem. In the following sections we will perform the estimation in both ways and do a comparative analysis. Also, as shown previously, whether the treatment effect is regarded as homogeneous or heterogeneous will only affect the interpretations. The expressions to be estimated are exactly the same for both cases and we will not distinguish them in the estimation part.

5.2.1 Parametric Estimation

The basic regression function is already obtained in previous sections as follows:

f Si = α + βKi + i

To perform the parametric estimation, we need to impose additional assumption on such basic regression function. In a sharp RD design it can be written as

f o Si = f(Si ) + βKi + i,

o o o∗ where f(Si ) is a function of the treatment-determining variable Si and continuous at the cutoff S . It can be global or piecewise polynomials. Such regression function is called a control-function-augmented outcome equation. In a fuzzy RD design, we are not able to get unbiased estimated treatment effect from this equation because the potential correlation between the treatment Ki and the error term i. To solve this problem, we need to revise the equation a bit to the following form: f o o Si = f(Si ) + βE(Ki|Si ) + i, o o where under the fuzzy RD design settings, f(Si ) is continuous at the cutoff while E(Ki|Si ) is discontinuous at the cutoff.

22 Table 3: Parametric Estimation, with and without Covariates, OLS and Linear RDa

OLS RD Simple OS only all Covariates Simple OS only all Covariates Science 95.40∗∗∗ 19.85∗∗∗ 19.98∗∗∗ 195.9∗∗∗ 68.36∗∗∗ 81.07∗∗∗ (5.176) (5.524) (7.081) (10.57) (11.49) (15.87) [1293] [1293] [870] [1293] [1293] [870] Art 101.8∗∗∗ 20.16∗∗∗ 6.846 156.1∗∗∗ 27.61∗∗ −11.96 (3.351) (4.582) (5.858) (5.201) (11.23) (17.30) [2277] [2277] [1785] [2277] [2277] [1785] a 1. *** and **imply significance at the 1% and 5% level respectively. 2. The complete estimation results, including the number of observations and R-squared, can be found in the appendix. 3. The standard errors reported in () are heteroskedasticity consistent and Number of observations are reported in [].

It is not difficult to relate discussions about fuzzy RD design above to a two-stage estimation procedure. In the first stage we need to estimate the following equation:

o o o∗ Ki = l(Si ) + θ1 {Si > Si } + µi,

o o where l(Si ) is also a function of Si and continuous at the cutoff. Imposing specification assumptions of this function, o o we can obtain the estimation of θ and fitted propensity score E(Ki|Si ) = P r(Ki = 1|Si ). In the second stage we just replace the treatment by this estimated propensity score and estimate the following equation:

f o o Si = f(Si ) + βE(Ki|Si ) + i

o o If we assume the same functional form for l(Si ) and f(Si ), then such two-stage estimation procedure is equivalent o o∗ to a two-stage least squares for IV methods. Here the instruments are just the treatment indicator 1 {Si > Si } o and variables in f(Si ). The relevant estimation results are concluded in Table 3. f In these OLS estimations we just focus on the basic regression function mentioned above as Si = α + βKi + i. For simple OLS the regression function is just in that form. Then original scores and other covariates are added into such function sequentially. In these RD designs we focus on the two-stage estimation procedure discussed above. o o Both of l(Si ) and f(Si ) are linear. Similarly, the original score and other covariates are added sequentially into the simple RD model. In each kind of estimations, when we do not include any regressors other than a constant and treatment, compared with those from other estimations with additional covariates, the treatment effects are significantly overestimated. It is obvious that beside the treatment (key school), the original score also contains influence information that can determine the final score. Unsurprisingly, for science students including other covariates does not change significant estimated results much. However, these additional covariates change the estimation results much for art students: neither is significant and estimated effect from RD design even becomes negative. These results imply that for art students additional covariates can explain a substantial part of the final score and may behave discontinuously around the cutoff with the possibility of treatment. Comparing corresponding results from OLS and RD design, we can find that all of the significant estimated treatment effects from OLS are smaller than these counterparts from RD design. So it seems that in the former model these omitted variables are negatively correlated with the final score. The last comparison is between results for science students and art students. In OLS estimations, except for specifications with all covariates, estimated effects for science student are slightly smaller than those for art students. While in RD design, from the first two specifications, estimated effects for science students are remarkably larger

23 Table 4: Parametric Estimation, RD Design, Robust Test, Sciencea

o l(Si ) C L Q Cu PL PQ PC C 195.9∗∗∗ 215.7∗∗∗ 215.8∗∗∗ 212.6∗∗∗ 211.1∗∗∗ 211.4∗∗∗ 206.0∗∗∗ (10.57) (6.762) (6.337) (5.944) (6.123) (6.034) (6.162) L 58.13∗∗∗ 68.36∗∗∗ 80.08∗∗∗ 78.18∗∗∗ 72.87∗∗∗ 75.66∗∗∗ 69.06∗∗∗ (9.773) (11.49) (10.67) (10.99) (10.89) (10.80) (10.71) Q 20.43∗∗ 24.03∗∗ 30.11∗∗ 30.64∗∗ 24.73∗ 24.85∗ 15.11 (10.35) (12.18) (15.26) (14.23) (13.39) (14.77) (14.02) o ∗ ∗ ∗ ∗ f(Si ) Cu 17.94 21.10 26.43 29.50 21.68 20.43 7.261 (10.39) (12.22) (15.31) (17.09) (15.68) (16.88) (15.66) PL 38.17∗∗∗ 44.89∗∗∗ 64.60∗∗∗ 67.89∗∗∗ 56.88∗∗∗ 64.22∗∗∗ 50.33∗∗∗ (9.918) (11.66) (12.70) (14.56) (14.78) (15.09) (14.48) PQ 27.16∗∗ 31.95∗∗ 58.28∗∗∗ 58.39∗∗∗ 40.48∗∗ 51.80∗∗∗ 35.64∗∗ (11.96) (14.07) (15.24) (16.73) (17.82) (17.01) (15.86) PC 10.41 12.24 46.64∗∗∗ 44.09∗∗∗ 15.51 34.36∗ 28.26∗ (13.47) (15.84) (17.61) (18.92) (20.07) (19.46) (15.62) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The specifications of these regression functions can be found in the appendix. 3. The standard errors reported are heteroskedasticity consistent. 4. Number of observations are N = 1293 for all specifications. than those for art students. Considering the endogenous shortcoming of OLS estimations, it is reasonable to say that science students can gain more than art students through model schools. Then we will explore the parametric estimation by robust test. The following seven function specifications without additional covariates other than the original score will be examined: constant (C), linear (L), quadratic (Q), cubic (Cu), piecewise linear (PL), piecewise quadratic (PQ) and piecewise cubic (PC). The estimated treatment effects are presented in Table 4 and Table 5 while expressions of these seven specifications can be found in the appendix. From these results, we get some interesting findings. First, the estimated treatment effect is much more sensitive o o o to the specifications of f(Si ) than to the specifications of l(Si ). When a certain specification of l(Si ) is fixed, the o o estimate result changes remarkably with specifications of f(Si ). But such change with specifications of l(Si ) is much o slighter. For example, for science students, if we set l(Si ) as a constant, then the estimated treatment effect will o change from 195.9 to 10.41 while the specification of f(Si ) changes from a constant to a piecewise cubic function. o But such change is only from 215.7 to 195.9 when the constant specification of f(Si ) is fixed and specification of o l(Si ) varies. This finding is reasonable because the varying range of fitted probability of treatment is much smaller than that of the original score in the outcome equation. So even if the treatment is also very sensitive to the o specifications of l(Si ), few of the sensitivity can be transfered to the final score. Second, most of significant estimated treatment effects for science students range from 50 to 80, which account for 6.7% to 10.7% of the total score, while those for art students range from 30 to 60, which account for 4.0% to 8.0% of the total score. The effects seem not very large in proportion, but considering the competitiveness of the exam and admission, such effect can play curcial roles. The main exception is those under the constant specification o of f(Si ), in which case the estimated treatment effects are much larger than the rest. So it implies that factors implicitly involved in the original score can account for a large part of the final score directly. That is in line with what we have found in previous linear cases. For science students, all of the estimated effects are positive and most of them are significant at least at 10%

24 Table 5: Parametric Estimation, RD Design, Robust Test, Arta

o l(Si ) C L Q Cu PL PQ PC C 156.1∗∗∗ 172.4∗∗∗ 172.5∗∗∗ 170.2∗∗∗ 167.5∗∗∗ 166.7∗∗∗ 165.8∗∗∗ (5.201) (4.345) (4.071) (3.880) (4.002) (4.015) (3.936) L 21.28∗∗ 27.61∗∗ 49.02∗∗∗ 46.18∗∗∗ 39.10∗∗∗ 39.19∗∗∗ 39.10∗∗∗ (8.659) (11.23) (11.94) (11.38) (10.80) (11.22) (10.98) Q −6.037 −7.831 −9.831 −10.53 −17.40 −20.43∗ −17.89∗ (6.687) (8.673) (10.89) (10.98) (10.74) (10.61) (10.19) o ∗ ∗∗ ∗ f(Si ) Cu −6.016 −7.803 −9.796 −9.816 −17.86 −20.48 −17.95 (6.698) (8.688) (10.91) (10.93) (10.72) (10.34) (9.794) PL 9.949 12.90 34.15∗∗∗ 29.66∗∗ 15.58 16.72 17.53 (7.304) (9.474) (12.64) (12.20) (11.44) (12.30) (11.69) PQ 23.37∗∗∗ 30.31∗∗∗ 55.93∗∗∗ 47.89∗∗∗ 36.60∗∗∗ 55.17∗∗∗ 53.48∗∗∗ (8.292) (10.76) (13.67) (13.22) (12.99) (16.74) (15.71) PC 22.15∗∗ 28.73∗∗ 62.02∗∗∗ 51.80∗∗∗ 34.70∗∗ 58.11∗∗∗ 50.78∗∗∗ (9.291) (12.05) (15.58) (15.21) (14.55) (19.23) (15.74) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The specifications of these regression functions can be found in the appendix. 3. The standard errors reported are heteroskedasticity consistent. 4. Number of observations are N = 2277 for all specifications. level. While for art students more estimated effects are insignificant and many, some of which are even signficant, become negative. The smaller overall positive effects and significant negative effects for art students provide us suspicion about the value-added effects of model schools on that group of students.

5.2.2 Nonparametric Kernel Estimation

The homogeneous treatment effect of model schools under fuzzy RD design can be identified as

f o o f o o lim E(Si |Si = S ) − lim E(Si |Si = S ) So→So∗+ So→So∗− o o o o . lim E(Ki|Si = S ) − lim E(Ki|Si = S ) So→So∗+ So→So∗− So generally we need to estimate four limits. Two of them are about outcomes and other two are about treatments. In the following parts we will focus on the former two to illustrate the estimation process. In the standard nonparametric kernel estimation, we have a kernel κ(φ), which satisfies R κ(φ)dφ = 1. Further- Sf o f o o Sf o f o o more, assume that L+ (S ) = lim E(Si |Si = S ) and L− (S ) = lim E(Si |Si = S ), h is the bandwidth, So→So∗+ So→So∗− then we can estimate the limitations as

o o o o P f Si −S P f Si −S Si κ( h ) Si κ( h ) i|So≥So∗ i|So

o o o o P Si −S P Si −S Kiκ( h ) Kiκ( h ) i|So≥So∗ i|So

25 The estimators from nonparametric kernel estimation are biased under settings of RD design, though it remains consistent. The main reason for that is due to the bad boundary behavior of the estimator, which says that the convergence rate of the estimator is slower at boundary points than that at interior points. For example, suppose that under a sharp RD design, the final score is increasing with the original score in a sufficient large neighborhood around the cutoff. Then with the rectangular kernel, compared with the final score just to the left hand side of the cutoff, the average final score in a left bandwidth will be underestimated while compared with the final score just to the right hand side of the cutoff, the counterpart in the right bandwidth will be overestimated. This will lead to an overestimated treatment effect of high schools. To deal with this bias problem, several other methods, such as local linear regression or series regression are introduced. The later is just adding higher-order polynomial to the settings of the former. We will focus on the former and the relevant issues concerning the optimal kernel and bandwidth in the following two sections.

5.2.3 Local Linear Regression

In the local linear regression, we just do linear regressions in bandwidths to both sides of the cutoff to predict the value of the regression function just at the cutoff. In principle the following parameters are estimated:

n  o o  ˆ X f o o∗ 2 Si − S ∗ o o∗ (ˆα, β) = arg min [Si − α − β(Si − S )] κ( )1(Si > S ) . α,β h i=1

Sf o f o o Then one components of estimated treatment effect is obtained: L+ (S ) = lim E(Si |Si = S ) =α ˆ. With So→So∗+ the similar method other components (one for sharp RD design and three for fuzzy RD design) can be obtained. It is not difficult to find that with the rectangular kernel the procedure discussed above is just doing standard linear regression in a bandwidth to both sides of the cutoff52. The results of the local linear regression are shown in table 6. The significant estimated treatment effects from local linear regression for science students range from 30 to 55, which are a bit smaller but still in line with these findings from the global parametric estimation. However, these results differ much across bandwidths. It seems that the estimated treatment effects increase with the bandwidth chosen for both kinds of students. That is reasonable because with the expansion of bandwidth, more students with scores further away from the average of respective side are included. Such sensitivities to the choice of bandwidth are also verified by the fact that none of the estimated treatment effects for art students with various bandwidths is significant. Different kernels can also lead to various results. But it seems that the impacts of the choice of kernels are a bit slighter. To check this further, we investigate more bandwidth for art students. The estimated results are shown in the appendix and are only significant in two kernels with bandwidth of 5. When the bandwidth varies from 5 to 400, though the results change much, most of them are insignificant. Emphasizing on the estimated effects for science students, we can find that the significance is enhanced by larger bandwidth. All of the four estimated effects are significant under 1% level when bandwidth 200 is chosen while only two are significant under 5% level for bandwidth 50. But bandwidth 200 is actually too large and accounts for 35.1% of the original score. With linear specification such a large bandwidth may lead to severe biased results. To obtain the precise and unbiased estimates we should investigate the choice of optimal bandwidths and kernels. We will discuss this aspect in the section of optimal bandwidth choice. Compared with those from the standard , estimators from local linear regression have better boundary behavior. Now the asymptotic bias of the estimator has the order of O(hp+1), where p = 1 for the linear regression and p > 1 for the higher order polynomial. Then the bias of order O(h2) is just comparable with that of the internal points53.

52For details of the local linear regression with a rectangular kernel, see the appendix. 53For details, see Fan (1992) and Porter (2003).

26 Table 6: Local Linear Regression, Arbitrary Bandwidths and Kernelsa

Kernel Rectangular Triangular Gaussian Epanechnikov Science Standard (100) 42.37∗∗ 30.17∗ 46.43∗∗∗ 49.36∗∗∗ (20.99) (15.92) (17.84) (15.04) Half (50) 39.22 6.852 36.20∗∗ 37.62∗∗ (25.30) (19.57) (18.50) (15.61) Double (200) 53.95∗∗∗ 43.06∗∗∗ 53.15∗∗∗ 55.03∗∗∗ (19.22) (15.76) (17.72) (14.56) Art Standard (100) 2.235 -5.346 4.149 6.220 (11.13) (13.65) (10.99) (9.700) Half (50) -14.11 -12.42 -3.219 -2.374 (14.01) (17.90) (11.64) (9.538) Double (200) 11.62 2.606 11.18 13.39 (10.45) (12.64) (17.42) (10.40) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are obtained through bootstrap method. 3. For science students, the number of observations is N = 1293. For art students, the number of observations is N = 2277.

5.2.4 Optimal Bandwidth Choice

As shown in many other literatures, compared with the choice of bandwidth, the choice of kernel has little impact in practice54. In the scope of this article we will focus on four kinds of kernel functions: rectangular (uniform) kernel, triangular kernel, Epanechnikov kernel and Gaussian kernel. The expressions of these kernel functions will be presented in the appendix. The choice of optimal bandwidth is more important for our applications. On one hand, if the chosen bandwidth is too small, then there may be insufficient observations falling into this bandwidth and consequentially the precision of the estimation will be affected significantly. One the other hand, if the chosen bandwidth is too large, then the specified function may not fit the data well and as a result the estimation will be biased. The choice of optimal bandwidth is just to find a balance between precision and unbiasedness. Practically there are three methods to choose the optimal bandwidth. For further convenience of comparison, here we calculate the optimal bandwidth for local linear regressions. The first method is a two-step way that involves the unknown joint distribution of all variables. In the first step, a rule-of-thumb (ROT) bandwidth is estimated using the whole data. Mathematically, the ROT bandwidth can be expressed as follows:

 1/5    σ˜2  hROT = Cκ n ,  P 00 2   [m ˜ (xi)]  i=1 where Cκ is a kernel specified constant,m ˜ (x) is the fitted value of a polynomial function which is estimated globally. It is shown that odd order polynomial functions are better and in practice a fourth order one is widely used. Here ˆ ˆ ˆ 2 ˆ 3 ˆ 4 we follow these discussions and assume thatm ˜ (x) = β0 + β1x + β2x + β3x + β4x . Furthermore, we have 00 ˆ ˆ ˆ 2 m˜ (x) = 2β2 + 6β3x + 12β4x .σ ˜ is the standard error of the regression of the polynomial function.

54See Lee and Lemieux (2009) for a brief discussion.

27 Then in the second step, the ROT bandwidth from the first step is plugged into the following formula to compute the optimal bandwidth:  1/5  2   σˆ (x0)  hOPT (x0) = Cκ n .  P[m ˆ 00 (x )κ (x − x )]2   i hROT i 0  i=1 If we want to obtain the optimal bandwidth to the left of the cutoff, then we can just only use the data to the left of the cutoff and vice versa for the optimal bandwidth to the right of the cutoff. The second method is based on a cross-validation procedure. First, for each observations i, we estimate a linear regression of outcome on the treatment-determining variable with the rest observations on the same side to the cutoff n ˆ 1 P ˆ 2 as that of i. The fitted value is Y (xi). Then we define the cross-validation criterion as CVy(h) = N (Yi −Y (xi)) . i=1 OPT In that case the optimal bandwidth is chosen to satisfy the following condition: hCV = arg min CVY (h). h However, according to Imbens and Kalyanaraman (2009), both of the previous methods choose the optimal bandwidth over the whole support, which will not actually be optimal for the RD settings. Then the third method called IK approach is introduced. Generally speaking, the IK approach starts from an adjustment to the cross-validation method, where the optimal bandwidth is chosen to minimize the following approximation to the mean integrated squared error criterion (MISE): Z  2 MISE(h) = E [m ˆ h(x) − m(x)] f(x)dx x In the IK approach the proposed criterion becomes:   MSE(h) = E[(βˆ − β)2] = E [( limˆ m(x) − lim m(x)) − ( limˆ m(x) − lim m(x))]2 . x→c+ x→c+ x→c− x→c−

This criterion is not practically feasible and an asymptotic mean squared error (AMSE) is used to replace it:

2 2 4 00 00 2 C2 σ+(c) σ−(c) AMSE(h) = C1h [m+(c) − m−(c)] + [ + ], Nh f+(c) f−(c)

00 00 where C1 and C2 are kernel specific constants, m+(c) and m−(c) are the second derivatives of the regression function 2 2 at the right and left respectively, σ+(c) and σ−(c) are the limits of the conditional variance at the right and left respectively. The optimal bandwidth is chosen to minimize this criterion:

IK hOPT = arg min AMSE(h) h In practice, three modifications are made to the previous approach to reduce the variance of the estimated bandwidth. Finally the estimated bandwidth can be expressed as follows:

( )1/5 2ˆσ2/fˆ(c) hˆIK = C N −1/5, OPT κ 00 00 2 [m ˆ +(c) − mˆ −(c)] + (ˆr+ +r ˆ−) wherer ˆ+ andr ˆ− are the regularization terms to avoid unrealistically large bandwidth chosen occasionally. The methods concerning choice of optimal bandwidth discussed above is applicable to sharp RD design. For fuzzy RD design adopted in this article, we need to modify them a bit. Under fuzzy RD design, we have two pairs of regression functions: one for the outcome and the other for the treatment. So in principle there are two optimal bandwidths. As recommended in several recent literatures, it is better to use the optimal bandwidth from the outcome function for both of them55. As already shown in the previous sections, here we have several additional covariates. They will not affect the optimal bandwidth much as long as they are continuous at the cutoff and do not have great explanatory power.

55For details see Imbens and Lemieux (2008) or Lee and Lemieux (2009).

28 Table 7: Optimal Bandwidth, Various Kernelsa

2 ˆ 00 00 Optimal Bandwidthσ ˆ (c) f(c)m ˆ +(c)m ˆ −(c)r ˆ+ rˆ− Rectangular Triangular Gaussian Epanechnikov Science 23.99 15.27 4.865 14.22 3792 0.012 -0.137 -0.214 0.112 0.168 Art 21.95 13.97 4.451 13.01 4539 0.014 0.010 0.068 0.116 0.129 a The kernel-specific constants are approximately 5.4, 3.4375, 1.0951 and 3.1999 respectively.

Table 8: Local Linear Regression, Optimal Bandwidtha

Kernel Rectangular Triangular Gaussian Epanechnikov Science Optimal −38.05∗ -50.54 -44.46 -30.45 (21.24) (34.01) (33.71) (20.31) Half Optimal −72.82∗ -14.59 -2.746 −52.21∗ (43.47) (31.43) (35.36) (30.29) Double Optimal 42.79∗∗ −38.20∗ −46.25∗∗ 30.89 (20.14) (22.36) (23.25) (20.32) Art Optimal -22.84 -34.04 -19.00 -7.338 (18.82) (43.51) (43.39) (17.80) Half Optimal -70.74 54.12 72.10∗ -39.04 (46.30) (51.47) (39.88) (35.21) Double Optimal -17.79 -12.00 21.87 -12.32 (13.60) (18.21) (23.02) (12.20) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are obtained through bootstrap method. 3. For science students, the number of observations is N = 1293. For art students, the number of observations is N = 2277.

We only need to include these covariates into the conditional variance terms to make the estimator more precise. According to previous graphic analysis, we do not find evidence of discontinuity of covariates at the cutoff, so here it is not necessary to add them into the model and we will only focus on the original score and the treatment. The optimal bandwidth with rectangular kernel,triangular kernel, Gaussian kernel and Epanechnikov kernel are shown in table 7. With optimal bandwidths obtained above, we can estimate the treatment effect of model schools again with various kernels. The estimation results are presented in table 8. With the optimal bandwidth, most of the estimated effects are insignificant. The only significant one is the effect for science students with rectangular kernel. However, it is even negative. As before, to test the sensitivity we introduce half and double optimal bandwidth. For science students, there are more significant estimated results. But both of positive and negative effects are found and they differ too much. For art students, only one estimated effect is significant at the 10% level, though it is positive and consists with findings from parametric estimation to some extent. So far the whole picture about the estimated effects becomes a bit clear. In parametric estimation, we can conclude with positive and more homogeneous effects of model schools. In nonparametric estimation, with arbitrary larger bandwidth some similar significant positive effects can still be found, especially for science students. If we adopt relatively smaller optimal bandwidth, fewer effects remain significant and the remarkable difference between various bandwidths makes the exact effect ambiguous.

29 The potential explanations about these findings can be concluded as follows: in parametric estimation, the whole sample is involved to induce a global estimation. In that case the significant relatively large positive effect reflects the difference between model schools and normal schools in an more “average”way and depends heavily on o o the specifications of control functions f(Si ) and l(Si ). Both of the large amount of observations far from the cutoff and improper specifications will lead to biased predictions at the cutoff. In our case the parametric estimation will overestimate the effect of the model school. In the nonparametric estimation, as general cases, the estimated effects are very sensitive to the bandwidth choice. With larger bandwidth, more observations are included. The local linear estimation will lead to a more precise but less accurate estimated effects. Similarly, in our sample a larger bandwidth will tend to overestimate the effect. Then with the IK method introduced by Imbens and Kalyanaraman we calculate the optimal bandwidths, which are much smaller than those we tried arbitrarily before. These optimal bandwidths minimize the expectation of the estimated error square in theory and should balance the precision and unbiasedness properly. It is usually difficult to say which estimation can lead to less biased results. So maybe so far the conclusion can be that for science students, there are more solid evidence for the positive value-added effects. But for art students, it is a bit difficult for us to conclude with sufficient evidences to support the value-added effects. Furthermore, even if there are some effects, they may be surprisingly negative. However, it is still a bit early to conclude that the model school does not have any value-added effects at all. According to the previous analysis, there are at least three issues calling for attention. The first one concerns roles of additioanl covariates. Though there are few evidences to support the discontinuity of these covariates at the cutoff, it is a rough analysis and it is still necessary to add them in the model for robustness checks. The second One is about multiple cutoffs. According to the graphic analysis there seem at least two cutoffs. So it is reasonable to conjecture the heterogeneity among normal schools and model schools. For example, final scores of students in one model school are much higher than those in the other while some normal schools have higher final scores than usual, which are even similar to those from the model school that has lower scores. The last one explores peer effects to see whether the true value-added effects of model schools are concealed by them. We will deal with these two issues in the following section of extension.

5.3 Robustness Checks

In this section we will check the reliability of the results obtained from three aspects. Two robustness tests are already implemented: robustness to functional specifications in the parametric estimation and robustness to kernels and bandwidths in the nonparametric estimation. Here three additional issues related closely to our study will be intensively discussed: the effect of additional covariates, the test for jumps at the non-discontinuity points and the role of peers effects.

5.3.1 Covariates

As discussed above, additional covariates mainly can help us in two ways: test the validity of the RD design and reduce the sampling variation. In this section we will estimate the value-added effect of the model school with additional covariates and compare it with what we got in previous analysis. To detect the relations between covariates and treatment, we conduct a series of regressions in the following form: o Xj = αj + βjK + γjS , where j = 1, 2, ... and Xj is the covariate. The estimated results are shown in table 9 From these regressions, we can find that many covariates are highly correlated with the treatment and some results are really interesting. The relations between the treatment and covariates are highly similar between science and art students. Students in model schools are younger and more likely to be girls. Surprisingly, parents of students in model schools are more likely to be farmers and have relatively lower levels of education. These findings confirm that for elementary education, factors like personal abilities, incentives and efforts are much more important than

30 Table 9: Relations between Covariates and Treatmenta

Science Art

Treatment (βj) OS (γj) N Treatment (βj) OS (γj)N Age −0.139∗∗∗ -0.001 1292 −0.129∗∗∗ −0.002∗∗ 2277 (0.053) (0.001) (0.040) (0.001) Male −0.244∗∗ 0.004∗∗∗ 1292 -0.078 0.004∗∗∗ 2277 (0.103) (0.001) (0.072) (0.001) Normal -0.151 0.006∗∗∗ 1264 −0.358∗∗ 0.010∗∗∗ 2237 (0.200) (0.002) (0.161) (0.002) CollegeF −0.274∗∗ 0.014∗∗∗ 1293 −0.235∗∗∗ 0.006∗∗∗ 2277 (0.113) (0.002) (0.078) (0.001) CollegeM −0.381∗∗∗ 0.013∗∗∗ 1293 -0.072 0.008∗∗∗ 2277 (0.119) (0.002) (0.080) (0.001) UnemF 0.032∗∗∗ 779 −3.089∗∗∗ 0.026∗∗∗ 1923 (0.003) (0.271) (0.004) UnemM 0.033∗∗∗ 635 −3.290∗∗∗ 0.025∗∗∗ 1832 (0.004) (0.295) (0.004) FarmerF 0.709∗∗∗ −0.023∗∗∗ 1055 0.723∗∗∗ −0.017∗∗∗ 1923 (0.120) (0.002) (0.086) (0.002) FarmerM 0.799∗∗∗ −0.018∗∗∗ 910 0.687∗∗∗ −0.017∗∗∗ 1832 (0.121) (0.002) (0.089) (0.002) a 1. *** and ** imply significance at 1% and 5% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Besides Age, all other covariates are treated with probit model. 4. Two missing results imply deterministic relation between treatment and covariates. family backgrounds and supports. It seems very likely that these covariates also perform discontinuously at the cutoff to some extent. So it is necessary to consider them when we analyze the value-added effect with RD design. We will check the effect of additional covariates in both parametric and nonparametric way. In the parametric estimation, the two-stage procedure adopted in the previous section can be modified as follows: In the first stage, we estimate the following equation:

o o o∗ Ki = l(Si ) + θ1 {Si ≥ Si } + γXi + µi While in the second state, these covariates still enter the equation in the linear form, so the equation to be estimated in this stage is f o o Si = f(Si ) + βE(Ki|Si ,Xi) + γXi + i. o o To simplify the problem, we only introduce the covariates in the linear form. The case that both l(Si ) and f(Si ) are linear is already shown in Table 3. Here we will show results from constant and cubic specifications of both. As before, we will explore the parametric estimation with covariates by robust test with the same seven specifications. The results are presented in the appendix. Compared with the estimated results without additional covariates, different changes are found for different groups. For science students, though several estimated effects are actually changed, the overall results are not substantially affected. The value-added effects for science students are relatively robust to additional covariates. For art students, situations become complicated. Most of the estimated results become insignificant. Furthermore, except for cases with constant specifications in the second stage, which suffers from severe endogenous problem, all of the other significant effects are negative. This finding casts more doubts on the existence of value-added effects for art students.

31 Table 10: Parametric and Local Linear Regression, Linear Covariatesa

Parametric Estimation Local Linear Regression N C-C Cu-Cu Rectangular Triangular Gaussian Epanechnikov Science 106.5∗∗∗ 35.57∗ −55.90∗∗∗ -44.56 -25.44 −41.10∗∗ 870 (6.967) (19.63) (17.97) (35.72) (32.26) (20.46) Art 107.2∗∗∗ −47.46∗∗∗ -32.68 -48.61 -25.63 -16.05 1785 (4.604) (15.60) (22.39) (43.56) (52.41) (21.02) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The specifications of these regression functions can be found in the appendix. 3. For parametric estimation, the standard errors reported are heteroskedasticity consistent. 4. The kernel-specific constants are approximately 5.4000, 3.4375, 1.0951 and 3.1999 respectively. 5. For local linear regression, the standard errors reported are obtained through bootstrap method.

For the standard errors, we do not find evidence that additional covariates will be helpful for reducing the sampling variability. Almost all of the standard errors increase with few exceptions. Moreover, it seems that the standard errors for art students increase even more. In the nonparametric estimation, we can also consider the effect of additional covariates. Here to simplify the estimation process, we adopt a two-stage residualized approach. In the first stage, we estimate the following regression: f Si = γXi + µi Then in the second stage the outcome variable is replaced by the fitted residual term and the value-added effect is estimated with local linear regressions. Compared with the method that includes additional covariates directly, this approach will lead to biased estima- tor because of the potential correlation between the treatment (or treatment-determining variables) and additional covariates. However, in local linear regressions used in the RD design, the bandwidth that is chosen is usually not large, so observations in a close neighborhood around the cutoff play a crucial role. For these observations the correlation between the treatment and additional covariates is generally not a big problem, while the estimation process is simplified much. So with our settings, it seems still safe to adopt the two-stage procedure discussed above. The estimated results from the regression in the first stage can be found in the appendix. Then the fitted residuals are obtained and will serve as the outcome variable in the second stage. As before, we will calculate the optimal bandwidth with the same IK approach, and then check the robustness with various bandwidths and kernels. The results from optimal bandwidth can be found in table 10 and others are left to the appendix. Now we can find that with additional covariates, the estimated value-added effects actually have changed a bit, especially those significant ones. For science students, though both positive and negative effects exist, the positive ones become more dominant. The number of significant negative effects decreases remarkably from 5 to 1 while such number is doubled for positive effects. For art students, the situation may be opposite. With three significant negative effects, it seems that the negative value-added effects of the model school for art students are further verified.

5.3.2 School Specific Effects

Until now we have analyze the value-added effects of the model school with the official cutoff 474, which is the minimum requirement of the second model school. We do not find strong evidences to support the value-added effects, especially for art students. However, we can review the descriptive statistics in detail to see to what extent these two kinds of schools differ. First we will disintegrate these two groups and present characteristics by schools separately.The results are presented in the appendix.

32 Table 11: Multiple Cutoffs, Parametric Estimation, Linear Covariatesa

Without Additional Covariates With Additional Covariates C-C Cu-Cu N C-C Cu-Cu N Cutoff 490 Science 388.4∗∗∗ −77.11∗ 1293 139.2∗∗∗ 64.19 870 (31.56) (44.67) (9.390) (74.77) Art 263.2∗∗∗ 32.51 2277 126.0∗∗∗ −53.63∗∗∗ 1785 (10.90) (36.19) (5.190) (20.34) Cutoff 449 Science 178.9∗∗∗ 61.58∗ 1293 104.7∗∗∗ −296.0∗∗∗ 870 (6.615) (35.68) (7.881) (75.80) Art 182.9∗∗∗ -20.70 2277 128.3∗∗∗ 56.38∗∗ 1785 (5.768) (23.04) (6.302) (25.87) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The specifications of these regression functions can be found in the appendix. 3. The standard errors reported are heteroskedasticity consistent.

From these descriptions, it is not difficult to find that the first model school (No. 1) has much higher original scores and final scores than others. At the same time, the difference of these scores between the second model school (Xinhua) and the first normal school (No. 2), which may be the best normal school, is not so remarkable. So in this section we will focus on the value-added effect of the No. 1 School, where the cutoff becomes 490. Also, another cutoff will also be considered. That is the minimum entrance score of No. 8 School (449), where the treatment is studying in six schools with minimum entrance scores higher than 449. So there are no special policy implications and it is only helpful for us to fully explore the data and to understand the whole procedure better. If we find similar effects at these two cutoffs, it is reasonable to suspect the value-added effects at the cutoff 474. In this section we will only estimate the effects parametrically because too few observations can be used to analyze nonparametrically and in that case the standard errors will become extremely large. In table 11 parametric results from constant and cubic specifications with and without additional covariates are presented. The complete results can be found in the appendix. For 490 cutoff, we find different results for science students. With parametric estimation, the range becomes larger. Several large negative significant effects are found while most significant positive effects also increase remark- ably. For art student, expect for several results with extremely large standard errors, negative effects are confirmed again. According to these results from 449 cutoff, an important finding is related to effects for art students, most of which become significantly positive. Looking at results from three cutoffs, we can find that for science students, the effects at 490 cutoff may be more remarkable. At the same time there seems no gap between effects at 474 cutoff and those at 449 cutoff. For art students, the positive effects at 449 cutoff are even much more significant, where there is no noticeable policy implications. All of these evidences make us suspect the value-added effects of model schools. Even if there are any, they are not as marked as we have expected.

5.3.3 Peers Effects

In this section we will go a bit further by involving the relative level of a student to detect its impact on the student achievement. Such effects are usually called peers effects and have been studied in many literatures on economics of education.For example, Carman and Zhang (2008) estimates peer effects on achievement of student from a middle school in China and finds students at the middle tends to benefit from better peers while those at the end do not. Ding and Steven (2007) studies China’s high schools and finds strong positive and nonlinear peers effects on students achievements. Under our RD settings we will focus on a specific peer effects, which is the change of relative positions when the student enters a high school from a middle school, while most previous researches mainly estimate peers effects in a

33 Table 12: Parametric and Local Linear Regression, Peers Effectsa

Parametric Estimation Local Linear Regression N C-C Cu-Cu Rectangular Triangular Gaussian Epanechnikov Science 101.7∗∗∗ 38.75∗ -47.59 -22.34 -15.67 −59.80∗∗ 870 (6.045) (23.10) (30.57) (32.32) (29.79) (26.72) Art 115.6∗∗∗ −62.84∗∗∗ −53.53∗ -26.54 -11.37 −60.04∗ 1785 (4.280) (18.30) (27.74) (43.72) (42.47) (30.97) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The specifications of these regression functions can be found in the appendix. 3. For parametric estimation, the standard errors reported are heteroskedasticity consistent. 4. The kernel-specific constants are approximately 5.4000, 3.4375, 1.0951 and 3.1999 respectively. 5. For local linear regression, the standard errors reported are obtained through bootstrap method. absolute way. For target students whose original scores are just near the cutoff, their relative rankings or positions in the score distribution will change remarkably after they enter the high school. In our study the cutoff is at the upper part of the score distribution, so such change is be more substantial for those who enter high schools and their relative positions change from the upper part to the end. While for students choose normal schools such change is more moderate and from upper part to the top. So in our study, if there are actually peers effects, then the change in relative positions in score distribution will affect on the comparison of achievements and ignoring them may lead to a biased estimation of the value-added effects of model school. According to Carman and Zhang (2008), empirical studies of peers effects face three challenges: proper definition of a peer group; self-selection and teacher effects; reflection problem that the student and peer achievement are determined simultaneously. In our study none of them have crucial impacts on the empirical results. We define the peer group on the grade level, as what Ding and Steven (2007) does, because we do not need to address peers effects on the class level. Self-selection and related omitted bias is naturally not a problem in a RD analysis. Teacher effects as well as other school specific effects are just what we want to estimate and can be separated from peers effects by involving change of relative positions. Reflection problem is not severe either because we only care final scores that come out after three-year study and there are enough time for peer interactions. Here we define the relative position in the middle school of student i by the gap between his original score and the average of original scores of all students. Differently, the relative position in the high school of the same student is defined by the gap between his original score and the average of original scores of students in the same high school. The change of relative positions is just the difference between these two gaps. The empirical results are shown in table 12. The results considering peers effects seem to confirm the negative effects further, especially for art students. Furthermore, our findings indicate negative peers effects through the negative coefficient of the change of gaps. Of course, in our study the value-added effects do not include peers effects created by the more competitive atmosphere in model schools. If such peers effects are also recognized as one kind of value-added effects, the identification strategy and empirical results should be modified to that. Findings in this section also reveal the heterogeneous essential of value-added effects of model schools, which not only comes from student specific characteristics, but also from the peers. Together with school (teacher) specific characteristics, these three aspects almost can determine the achievement of students completely.

34 Table 13: OLS, Subsample without Observed Always-takers and Never-takersa

Treatment Only Treatment and OS all Covariates Science 120.2∗∗∗ 37.84∗∗∗ 43.76∗∗∗ (4.695) (6.153) (8.246) [1107] [1107] [701] Art 119.3∗∗∗ 25.39∗∗∗ 9.150 (3.430) (6.496) (9.274) [1949] [1949] [1470] a 1. *** imply significance at the 1% level. 2. The standard errors reported in () are heteroskedasticity consistent and the number of observations are presented in [].

6 Policy Extensions

6.1 Compliers and Noncompliers

Now we have estimated value-added effects of model schools on achievements of art and science students respec- tively through fuzzy RD design. The positive effects for science students are more obvious while negative effects for art students really make people disappointed. However, it is still early to be pessimistic because as mentioned in previous section, what we have estimated is LATE estimates and only applicable to a subgroup of the whole population. In our study, no matter which method, say parametric or nonparametric estimation, is adopted, the effects estimated are average effects of the treatment only for students who are compliers56 and just have original scores o o∗ Si = S = 474. This is really a small fraction of the whole population and the external validity of our study is heavily restricted because of the essentials of RD design. Nevertheless, according to Imbens and Wooldridge (2008), the fuzzy feature makes it is possible to extend our study to a larger population by standard analysis based on unconfoundedness. f o The whole procedure can be described as follows: the simple model Si = α + γSi + βKi + i suffers from the problem of endogeneity just because potential omitted variables in the error term and their relations with the treatment. In our study, this means besides the original score and the cutoff, students will decide to attend model schools or not according to their own considerations based on many other unobservable factors. Two kinds of students show such behavior most explicitly: observed always-takers who attend model schools with original scores lower than the cutoff and observed never-takers who attend normal schools with original scores higher than the cutoff. The descriptive statistics of them can be found in the appendix. Eliminating them from the sample is helpful to reduce the endogenous problem57. Then ignoring the discontinuity of the probability of treatment and assuming that unconfoundedness holds, which means that the treatment and error term are independent, we can obtain the average treatment effect (ATE). The results of the simple model with OLS on the reduced subsample are presented in Table 13. Now we are in the same situation with Fan et al. (2010) that without observed alway-takers and never-takers, the estimated effects are larger than those with the full sample. Because in the reduced sample the weight of compliers is larger, so in the full sample the effects for the complier, which are estimated in previous sections, are larger than the ATEs over all groups for both science and art students. So for science students, model schools have

56There are four groups: never-takers who will not attend model schools in spit of their eligibilities; compliers who always comply with their eligibilities; defiers who will always defy their eligibilities; always-takers who will attend model schools regardless of their eligibilities. In our study the assumption A5 already excludes defiers. 57Other sources of endogeneity include unobserved always-takers and never-takers, but we are not able to distinguish them from compliers

35 value-added effects for those at the margin but the effects for the whole will decrease. For art students, it is difficult to find substantial value-added effects of model schools around the cutoff, and what makes things worse is that such effects will decrease further for the full sample.

6.2 Partially Fuzzy Design

So far we have already estimated the value-added effects of model schools for compliers and compared them with ATEs. There are two problems remains: one is that under our fuzzy settings, assumptions required to identify the effects for compliers are a bit strong and some like A5 are difficult to test. The other is that we still do not know any effects other than LATE exactly. In this section we will deal with these two problems through a partially fuzzy design introduced by Battistin and Rettore (2008) and focus on the effects for those participants, which also have strong policy implications. In last section we move those observed always-takers and never-takers from the full sample, then the jump of probability of treatment becomes exactly one and with assumption A1-A3 the average effect at the cutoff can be identified. In this section these observed never-takers will be added into the subsample and observed always-takers are still dropped because the amount of them are quite small and we do not care effects on them much. In our new subsample there are three groups of students: ineligibles whose original scores are below the cutoff, eligible nonparticipants who have original scores above the cutoff but attend normal schools, participants who attend model schools. According to Bsttistin and Rettore (2008), we can identify effects for those participants around the cutoff and test the validity of effects for participants away from the cutoff under simple assumptions of sharp RD design. Simply the average effects for students attending model schools can be identified as:

f o o f o o lim E(Si |Si = S ) − lim E(Si |Si = S ) pf So→So∗+ So→So∗− lim E(β |Ki = 1) = + i o o s→0 lim E(Ki|Si = S ) So→So∗+ This effect is only for participants around the cutoff and can not be extended to all participants directly because of the selection bias. But the availability of eligible nonparticipants make it possible to identify such selection bias. f1 f0 Formally, the average effects for participants away from the cutoff is E(βi|Ki = 1) = E(Si |Ki = 1)−E(Si |Ki = 1) o o∗ o f0 f0 for all Si ≥ Si and the selection bias can be expressed as sb(S ) = E(Si |Ki = 1) − E(Si |Ki = 0) for o o∗ f0 1 f0 all Si ≥ Si . At the cutoff the first term can be obtained by lim E(Si |Ki = 1) = φ lim E(Si ) − So→So∗+ So→So∗− 1−φ f0 φ lim E(Si )|Ki = 0, where φ = lim E(Ki) is the probability of attend model schools at the cutoff. So So→So∗+ So→So∗+ finally we can write the selection bias at the cutoff as follows:

o 1 f f lim sb(S ) = [ lim E(Si ) − lim E(Si |Ki = 0)] So→So∗+ φ So→So∗− So→So∗+ Estimation results are shown in Table 14, including effects for participants as well as the selection bias at the cutoff. If additional covariates are not included, the estimate results from parametric and nonparametric methods, for both science and art students, are not changed much. When additional covariates are controlled, estimate results from parametric methods still remain the same, while estimated effects from nonparametric methods really change substantially. Though none of them is significant, actually more positive and smaller negative effects are found. It can be concluded that around the cutoff, valued-added effects of model schools for compliers are almost the same as those for participants. For selection bias at the cutoff, we find that most are positive, though the magnitude varies a lot with bandwidths. So at least in our case, selection biases prevent us from extending estimate results to participants far away from the cutoff.

36 Table 14: Partially Fuzzy Design, Effects and Selection Biasa

Parametric Estimation Local Linear Regression N C-C Cu-Cu Rectangular Triangular Gaussian Epanechnikov Without Additional Covariates Science 186.3∗∗∗ 30.25∗∗ −38.05∗ -50.65 −42.43∗ -33.49 1256 (9.468) (12.96) (21.33) (47.10) (24.74) (28.10) Art 154.4∗∗∗ 3.500 -22.84 -33.93 -15.14 -8.737 2239 (4.972) (9.186) (20.00) (41.52) (41.64) (23.04) With Additional Covariates Science 105.8∗∗∗ 34.28∗∗ -10.49 -0.435 6.960 -5.752 833 (6.590) (14.10) (15.91) (23.32) (23.21) (16.46) Art 107.6∗∗∗ −24.12∗ -2.092 -7.041 1.507 5.833 1748 (4.537) (12.57) (13.09) (26.38) (27.79) (15.02) Without Additional Covariates Science Optimal Bandwidth 23.06 14.68 4.676 13.66 Selection Bias 11.91 20.85 19.41 20.93 Art Optimal Bandwidth 21.12 13.44 4.283 12.52 Selection Bias 5.072 7.311 -54.68 6.105 With Additional Covariates Science Optimal Bandwidth 19.96 12.71 4.048 11.83 Selection Bias 41.99 16.94 7.411 27.63 Art Optimal Bandwidth 20.29 12.92 4.115 12.02 Selection Bias 27.38 33.31 -37.25 33.31 a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The specifications of these regression functions can be found in the appendix. 3. For parametric estimation, the standard errors reported in () are heteroskedasticity consistent. 4. The kernel-specific constants are approximately 5.4000, 3.4375, 1.0951 and 3.1999 respectively. 5. For local linear regression, the standard errors reported are obtained through bootstrap method.

7 Conclusion

We use regression discontinuity design to analyze the value-added effect of model schools in China. The data comes from students in 11 high schools in Daxin district in rural areas of Beijing. Introduced by Thistlethwaite and Campbell (1960), RD design can identify treatment effects with relatively weak assumptions but requires definite treatment rules. Considering the incompleteness of our dataset and explicit rules of admission into high schools, RD design is very suitable for our study. In most specifications involved, we estimate effects by both parametric and nonparametric methods. First we deal with simple cases with only the outcome variable (final score), treatment (model school) and treatment-determining variable (original score). For science student, we find significant positive effects ranging approximately from 20 to 80 with in parametric estimation. The local linear regression with large bandwidth also presents significant effects from 30 to 50. However, local linear regression with smaller optimal bandwidth provides more evidences for negative effects. For art students, positive effects become ambiguous even in parametric estimation. Both positive effects from 20 to 60 and negative effects around -20 are supported by certain specifications. In local linear regression, no matter how about the bandwidth chosen, few estimated effects are significant. Considering relatively large standard errors in local linear regression with smaller optimal bandwidth, it seems that the estimated results from parametric estimation are more reliable. Three robustness checks are also considered. With more additional covariates, both of positive effects for science

37 and art students show a decreasing trend while the negative effects for art students are confirmed further. Effects from another two spurious cutoffs are also analyzed for comparisons. For both science and art students, positive effects at the real cutoff are not dominant. These results make us cast more doubt on the value-added effects of model schools. We also go further by checking peers effects, which have been believed that play important roles in student achievements. With such peers effects controlled, effects for science students do not change much while those for art students are confirmed to be negative further. It seems that it is the more competitive atmosphere created by model schools rather than these school themselves has positive effects on art students achievements. So far it seems safe to conclude as follows: model schools have positive value-added effects on student achieve- ments. For art students, it is still difficult to say whether such effects are positive or negative. But it is clear that the more competitive atmosphere is helpful for their achievements. If it is defined as a by-product of model schools, the positive value-added effects can be concluded at least in this aspect. However, understanding what we have estimated is crucial for policy makers. In main parts the effects obtained are for those compliers, who will follow the rules exactly, around the cutoff. If their original scores are higher than the cutoff, they will attend model schools, otherwise they will definitely go to normal schools. Comparing them with standard results based on unconfoundedness, we can find that effects for the full population will be even smaller. We also check effects for those eligible participants and find that effects for this group are almost the same as those for compliers. Finally, further investigations are still called for. As discussed above, what we have estimated are average effects for compliers, who will change their school choice when their original scores cross the cutoff. We are not able to obtain effects for all students. Given the cutoff, school choices of these compliers only depend on original scores and there must be some unobserved characteristics of them to lead to such behaviors. Further research on this field will be helpful for us to understand the school choice mechanism and to make policies better targeted and more effective. Analysis on selection bias around the cutoff is also helpful to test the fundamental assumptions of the framework.

38 References

[1] Angrist J. D., G. W. Imbens, D. B. Rubin (1996). Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association, 91, 444-455.

[2] Angrist J. D., V. Lavy (1999). Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement. Quarterly Journal of Economics, 114(2), 533-575.

[3] Barnow B. S., G. G. Cain, A. S. Golberger (1980). “Issues in the Analysis of Selectivity Bias” in Stormsdorfer E. and G. Farkas (eds.). Evaluation Studies Review Annual, 5, Sage: Beverly Hills, 43-59.

[4] Battistin E., E. Rettore (2008). Ineligibles and Eligible Non-Participants as a Double Comparison Group in Regression-Discontinuity Designs. Journal of Econometrics, 142, 715-730.

[5] Black D., J. Galdo, J. Smith (2007). Evaluating the Working Profiling and Reemployment Services System Using a Regression Discontinuity Design. American Economic Review Papers and Proceedings, 97(2), 104-107.

[6] Campbell D. T. (1969). Reforms as Experiments. American Psychologist, 24, 409-429.

[7] Canton E., A. Blom. (2004). Can Student Loans Improve Accessibility to Higher Education and Student Performance? An Impact Study of the Case of SOFES, Mexico. The World Bank Working Paper 3425.

[8] Card D., L. D. Shore-Sheppard (2004). Using Discontinuous Eligibility Rules to Identify the Effects of the Federal Medicaid Expansions on Low-Income Children. Review of Economics and Statistics, 86(3), 752-766.

[9] Carman K., L. Zhang (2008). Classroom Peer Effects and Academic Achievement: Evidence from a Chinese Middle School. SCID Working Paper 336.

[10] Chay K. Y., P. J. McEwan, M. Urquiola (2005). The Central Role of Noise in Evaluation Interventions That Use Test Scores to Rank Schools. American Economic Review, 95(4), 1237-1258.

[11] Cook T. D. (2008). “Waiting for Life to Arrive”: A History of the Regression-Discontinuity Design in Psychology Statistics and Economics. Journal of Econometrics, 142, 636-654.

[12] Cook T. D., D. T. Campbell (1979). Quasi-Experimentation: Design and Analysis for Field Settings. Chicago: Rand McNally.

[13] DesJardins S. L., B. P. McCall (2008). The Impact of the Gates Millennium Scholars Program on the Retention, College Finance- and Work-Related Choices, and Future Educational Aspirations of Low-Income Minority Students. University of Michigan, Working Paper.

[14] Ding W., S. F. Lehrer (2007). Do Peers Affect Student Achievement in China’s Secondary Schools? The Review of Economics and Statistics, 89(2),300-312.

[15] Fan J. (1992). Design-adaptive Nonparametric Regression. Journal of the American Statistical Association, 87(420), 998-1004.

[16] Fan E.,X. Meng, Z. Wei, G. Zhao (2010). Rates of Return to University Education: The Regression Disconti- nuity Design. IZA DP No. 4749.

[17] Frolich M. (2007). Regression Discontinuity Design with Covariates. Department of Economics, University of St. Gallen, Discussion Paper NO. 2007-32.

[18] Fuji D., G. Imbems, K. Kalyanaraman (2009). Notes for Matlab and Stata Regression Discontinuity Software. Department of Economics, Harvard University, mimeo.

39 [19] Guryan J. (2001). Dose Money Matter? Regression-Discontinuity Estimates from Education Finance Reform in Massachusetts. NBER Technical Report.

[20] Hahn J., P. Todd, W. van der Klaauw (2001). Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design. Econometrica, 69(1), 201-209.

[21] Heckman J. J., H. Ichimura, P. Todd (1998). Matching as an Econometric Evaluation Estimator. Review of Economics Studies, 65, 261-294.

[22] Holland P. W. (1986). Statistics and Causal Inference. Journal of the American Statistical Association, 81(396), 945-960.

[23] Imbens G. W., J. D. Angrist (1994). Identification and Estimation of Local Average Treatment Effects. Econo- metrica, 62(2), 467-475.

[24] Imbens G. W., K. Kalyanaraman (2009). Optimal Bandwidth Choice for the Regression Discontinuity Estima- tor. NBER Working Paper NO. 14726.

[25] Imbens G. W., T. Lemieux (2008). Regression Discontinuity Design: A Guide to Practice. Journal of Econo- metrics, 142, 615-635.

[26] Jacob B. A., L. Lefgren (2004)a. Remedial Eduation and Student Achievement: A Regression-Discontinuity Analysis. Review of Economics and Statistics, 86(1), 226-244.

[27] Jacob B. A., L. Lefgren (2004)b. The Impact of Teacher Training on Student Achievement: Quasi-Experimental Evidence from School Reform Efforts in Chicago. Journal of Human Resources, 39(1), 50-79.

[28] Lavy V. (2004). Performance Pay and Teachers’ Effort, Productivity and Grading Ethics. NBER Working Paper NO. 10622.

[29] Lee D. S. (2008). Randomized Experimental from Non-Random Selection in U.S. House Elections. Journal of Econometrics, 142, 675-697.

[30] Lee D. S., D. Card (2008). Regression Discontinuity Inference with Specification Error. Journal of Econometrics, 142, 655-674.

[31] Lee D. S., T. Lemieux (2009). Regression Discontinuity Design in Economics. NBER Working Paper NO. 14723.

[32] Lee D. S., E. Moretti, M. Butler (2004). Do Voters Affect or Elect Policies? Evidence from the U.S. House. Quarterly Journal of Economics, 119(3), 807-859.

[33] Leuven E., M. Lindahl, H. Oosterbeek, D. Webbink. (2007). The Effect of Extra Funding for Disadvantaged Pupils on Achievement. Review of Economics and Statistics, 89(4), 721-736.

[34] Ludwig J., D. Miller (2005). Does Head Start Improve Children’s Life Chance? Evidence from a Regression Discontinuity Design. NBER Working Paper NO. 11702.

[35] Martorell F. (2004). Do High School Graduation Exams Matter? A Regression Discontinuity Approach. Uni- versity of California, Berkeley, Job Market Paper.

[36] Matsudaira J. D. (2008). Mandatory Summer School and Student Achievement. Journal of Econometrics, 142, 829-850.

[37] McCrary J. (2008). Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test. Journal of Econometrics, 142, 698-714.

40 [38] Merton R. K. (1968). The Matthew Effect in Science. Science, 159(3810), 56-63.

[39] Michelson R. A. (1987). Education and the Struggle against Race, Class and Gender Inequality. Humanity and Society, 11(4), 440-464.

[40] Mood A. M. (1950). Introduction to the Theory of Statistics. New York: McGraw-Hill.

[41] Porter J. (2003). Estimation in the Regression Discontinuity Model. Department of Economics, Harvard Uni- versity, mimeo.

[42] Rosenbaum P., D. B. Rubin (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70(1), 41-55.

[43] Roderick M., M. Engel, J. Nagaoka (2003). Ending Social Promotion: Results from Summer Bridge. Chicago: Consortium on Chicago School Research.

[44] Rubin D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Non-randomized Studies. Journal of Educational Psychology, 66(5), 688-701.

[45] Rubin D. B. (1986). Which Ifs Have Causal Answers? Discussion of Holland’s ”Statistics and Causal Inference”. Journal of the American Statistical Association, 81(396), 961-962.

[46] Rubin D. B., E. A. Stuart, E. L. Zanutto (2004). A Potential Outcomes View of Value-Added Assessment in Education. Journal of Educational and Behavioral Statistics, 29(1), 103-116.

[47] Stevens P., M. Weale (2003). Education and Economic Growth. National Institute of Economic and Social Research, mimeo.

[48] Sun Y. (2005). Adaptive Estimation of the Regression Discontinuity Model. Department of Economics, Uni- versity of California, San Diego, mimeo.

[49] Thistlethwaite D. L., D. T. Campbell (1960). Regression-Discontinuity Analysis: An Alternative to the Ex Post Facto Experiment. Journal of Educational Psychology, 51(6), 309-317.

[50] Urquiola M. (2006). Identifying Class Effects in Developing Countries: Evidence from Rural Bolivia. Review of Economics and Statistics, 88(1), 171-177.

[51] Urquiola M., E. Verhoogen (2007). Class Size and Sorting in Market Equilibrium Theory and Evidence. NBER Working Paper NO. 13303.

[52] van der Klaauw W. (1997). A Regression-Discontinuity Evaluation of the Effect of Financial Aid Offers on College Enrollment. C.V. Starr Center for Applied Economics, New York University, Working Paper 97-10.

[53] van der Klaauw W. (2008)a. Breaking the Link between Poverty and Low Student Achievement: An Evaluation of Title I. Journal of Econometrics, 142, 731-756.

[54] van der Klaauw W. (2008)b. Regression-Discontinuity Analysis: A Survey of Recent Developments in Eco- nomics. Labour, 22(2), 219-245.

[55] Wald A. (1940). The Fitting of Straight Lines if Both Variables are Subject to Error. The Annals of Mathe- matical Statistics, 11(3), 284-300.

[56] Yang M. (2009). Regression Discontinuity Design: Identification and Estimation of Treatment Effects with Multiple Selection Biases. Department of Economics, Lehigh University, mimeo.

41 A Identifications of Constant Treatment Effects

f o In RD design, the average final scores of students with original scores above and below the cutoff are E[Si |Si + s] f o and E[Si |Si − s] respectively. Then the difference between them can be written as follows: f o o∗ f o o∗ E[Si |Si = S + s] − E[Si |Si = S − s] o o∗ o o∗ = E[α + βKi + i|Si = S + s] − E[α + βKi + i|Si = S − s] o o∗ o o∗ o o∗ o o∗ = β {E[Ki|Si = Si + s] − E[Ki|Si = Si − s]} + {E[i|Si = Si + s] − E[i|Si = S − s]} o o∗ o o∗ According to A1, we have limSo→So∗+ E[i|Si = S + s] − limSo→So∗− E[i|Si = S − s] = 0, so f o o∗ f o o∗ limSo→So∗+ E[Si |Si = S + s] − limSo→So∗− E[Si |Si = S − s] o o∗ o o∗ = β {limSo→So∗+ E[Ki|Si = S + s] − limSo→So∗− E[Ki|Si = S − s]} According to settings of sharp RD design, we have o o∗ o o∗ limSo→So∗+ E[Ki|Si = S + s] = 1 while limSo→So∗− E[Ki|Si = S − s] = 0 o o∗ o o∗ So limSo→So∗+ E[Ki|Si = S + s] − limSo→So∗− E[Ki|Si = S − s] = 1, then f o o∗ f o o∗ β = limSo→So∗+ E[Si |Si = S + s] − limSo→So∗− E[Si |Si = S − s]

For fuzzy RD design, the jump of the probability of treatment is less than one at the cutoff, so we have: o o∗ o o∗ limSo→So∗+ E[Ki|Si = S + s] < 1 while limSo→So∗− E[Ki|Si = S − s] > 0 o o∗ o o∗ and limSo→So∗+ E[Ki|Si = S + s] − limSo→So∗− E[Ki|Si = S − s] < 1 Then under settings of sharp RD design, we already obtain: f o o∗ f o o∗ limSo→So∗+ E[Si |Si = S + s] − limSo→So∗− E[Si |Si = S − s] o o∗ o o∗ = β {limSo→So∗+ E[Ki|Si = S + s] − limSo→So∗− E[Ki|Si = S − s]} f o o∗ f o o∗ limSo→So∗+ E[Si |Si =S +s]−limSo→So∗− E[Si |Si =S −s] so β = o o∗ o o∗ limSo→So∗+ E[Ki|Si =S +s]−limSo→So∗− E[Ki|Si =S −s]

B Identification of Heterogeneous Treatment Effects

Following the derivatives in Appendix A, we have f o o∗ f o o∗ E[Si |Si = S + s] − E[Si |Si = S − s] o o∗ o o∗ = E[α + βKi + i|Si = S + s] − E[α + βKi + i|Si = S − s] o o∗ o o∗ o o∗ o o∗ = {E[βiKi|Si = Si + s] − E[βiKi|Si = Si − s]} + {E[i|Si = Si + s] − E[i|Si = S − s]} Because of A1, we have o o∗ o o∗ limSo→So∗+ E[i|Si = S + s] − limSo→So∗− E[i|Si = S − s] = 0 Because of A3, we have o o∗ o o∗ o o∗ E[βiKi|Si = S ± s] = E[βi|Si = S ± s]E[Ki|Si = S ± s] f o o∗ f o o∗ so limSo→So∗+ E[Si |Si = S + s] − limSo→So∗− E[Si |Si = S − s] o o∗ o o∗ = {limSo→So∗+ E[βiKi|Si = S + s] − limSo→So∗− E[βiKi|Si = S − s]} o o∗ o o∗ o o∗ o o∗ = {limSo→So∗+ E[βi|Si = S + s]E[Ki|Si = S + s]} − {limSo→So∗− E[βi|Si = S − s]E[Ki|Si = S − s]} o o∗ o o∗ o o∗ = E[βi|Si = S ] {limSo→So∗+ E[Ki|Si = S + s] − limSo→So∗− E[Ki|Si = S − s]} Then we have o o∗ f o o∗ f o o∗ E[βi|Si = Si ] = limSo→So∗+ E[Si |Si = S + s] − limSo→So∗− E[Si |Si = S − s]

For fuzzy RD design, according to deductions above, we also have f o o∗ f o o∗ E[Si |Si = S + s] − E[Si |Si = S − s] o o∗ o o∗ = E[α + βKi + i|Si = S + s] − E[α + βKi + i|Si = S − s] o o∗ o o∗ o o∗ o o∗ = {E[βiKi|Si = Si + s] − E[βiKi|Si = Si − s]} + {E[i|Si = Si + s] − E[i|Si = S − s]} Again, because of A1, we have o o∗ o o∗ limSo→So∗+ E[i|Si = S + s] − limSo→So∗− E[i|Si = S − s] = 0 o o∗ o o∗ Then for E[βiKi|Si = S + s] − E[βiKi|Si = S − s], because of A4, we can obtain o o∗ o o∗ E[βiKi|Si = S + s] − E[βiKi|Si = S − s]

42 o∗ o∗ = E[βiKi(S + s)] − E[βiKi(S − s)] o∗ o∗ E {βi[Ki(S + s) − Ki(S − s)]} o∗ o∗ o∗ o∗ o∗ o∗ P r[Ki(S + s) − Ki(S − s) = 1]E {βi|[Ki(S + s) − Ki(S − s) = 1]} + P r[Ki(S + s) − Ki(S − s) = o∗ o∗ −1]E {βi|[Ki(S + s) − Ki(S − s) = −1]} o∗ o∗ Because of A5, we have P r[Ki(S + s) − Ki(S − s) = −1] = 0 o o∗ o o∗ E[βiKi|Si = S + s] − E[βiKi|Si = S − s] o∗ o∗ o∗ o∗ = P r[Ki(S + s) − Ki(S − s) = 1]E {βi|[Ki(S + s) − Ki(S − s) = 1]} o∗ o∗ o∗ o∗ = P r[Ki(S + s) = 1,Ki(S − s) = 0]E {βi|[Ki(S + s) − Ki(S − s) = 1]} o∗ o∗ o∗ o∗ = {P r[Ki(S + s) = 1] − P r[Ki(S − s) = 1]} E {βi|[Ki(S + s) − Ki(S − s) = 1]} o o∗ o o∗ o∗ o∗ = [E(Ki|Si = S + s) − E(Ki|Si = S − s)]E {βi|[Ki(S + s) − Ki(S − s) = 1]} Then we have f o o∗ f o o∗ lims→0+ E[Si |Si = S + s] − lims→0+ E[Si |Si = S − s] o o∗ o o∗ o∗ o∗ = lims→0+ [E(Ki|Si = S + s) − E(Ki|Si = S − s)] lims→0+ E {βi|[Ki(S + s) − Ki(S − s) = 1]} so finally we can obtain f o o∗ f o o∗ o o limSo→So∗+ E[Si |Si =S +s]−limSo→So∗− E[Si |Si =S −s] lims→0+ E[βi|Ki(Si + s) − Ki(Si − s) = 1] = o o∗ o o∗ limSo→So∗+ E[Ki|Si =S +s]−limSo→So∗− E[Ki|Si =S −s]

C Covariates by Treatment-Determining Variable 19 18.8 18.8 18.6 Age Age 18.6 18.4 18.4 18.2 18 18.2 400 450 500 550 400 450 500 550 OriginalScore OriginalScore 1 .7 .9 .6 .8 Male Male .7 .5 .6 .4 .5 400 450 500 550 400 450 500 550 OriginalScore OriginalScore 1 1 .95 .95 .9 Normal Normal .9 .85 400 450 500 550 400 450 500 550 OriginalScore OriginalScore

Figure 2: Graphic Analysis 2: science students for the left column

43 UnemM FarmM CollegeM UnemF FarmF CollegeF −.1 0 .1 .2 .3 .4 0 .2 .4 .6 .8 0 .2 .4 .6 .8 1 −.1 0 .1 .2 .3 .4 0 .2 .4 .6 .8 0 .2 .4 .6 400 400 400 400 400 400 iue3 rpi nlss3 cec tdnsfrtelf column left the for students science 3: Analysis Graphic 3: Figure iue4 rpi nlss4 cec tdnsfrtelf column left the for students science 4: Analysis Graphic 4: Figure 450 450 450 450 450 450 OriginalScore OriginalScore OriginalScore OriginalScore OriginalScore OriginalScore 500 500 500 500 500 500 550 550 550 550 550 550 44 UnemM FarmM CollegeM UnemF FarmF CollegeF −.2 0 .2 .4 .6 .2 .4 .6 .8 1 0 .1 .2 .3 .4 −.1 0 .1 .2 .3 .4 .2 .4 .6 .8 0 .1 .2 .3 .4 .5 400 400 400 400 400 400 450 450 450 450 450 450 OriginalScore OriginalScore OriginalScore OriginalScore OriginalScore OriginalScore 500 500 500 500 500 500 550 550 550 550 550 550 D Full Results of Parametric Estimation

Table 15: Parametric Estimation, Full resultsa

Science Student Art Student OLS OLS OLS RD RD RD OLS OLS OLS RD RD RD with OS with Cov. with OS with Cov. with OS with Cov. with OS with Cov. Model 95.40∗∗∗ 19.85∗∗∗ 19.98∗∗∗ 195.9∗∗∗ 68.37∗∗∗ 81.07∗∗∗ 101.8∗∗∗ 20.16∗∗∗ 6.846 156.1∗∗∗ 27.61∗∗ -11.96 (5.176) (5.524) (7.081) (10.57) (11.49) (15.87) (3.351) (4.582) (5.858) (5.201) (11.23) (17.30) OS 1.647∗∗∗ 1.432∗∗∗ 1.372∗∗∗ 0.879∗∗∗ 1.956∗∗∗ 2.159∗∗∗ 1.883∗∗∗ 2.389∗∗∗ (0.075) (0.126) (0.100) (0.187) (0.092) (0.123) (0.153) (0.254) Male 17.08∗∗∗ 20.74∗∗∗ 0.448 0.184 (4.832) (4.897) (3.023) (3.063) Age −10.80∗∗∗ −9.824∗∗∗ −13.75∗∗∗ −13.86∗∗∗ (2.988) (2.975) (2.274) (2.282) Normal −22.48∗ -16.00 −66.02∗∗∗ −69.62∗∗∗ (11.83) (11.77) (13.62) (13.70) 45 CollegeF 12.96∗∗ 11.70∗ 3.523 3.155 (6.437) (6.770) (5.110) (5.115) CollegeM -9.021 -8.070 -0.844 -1.939 (6.799) (6.983) (4.899) (4.946) UnemF 19.93∗∗ 29.03∗∗∗ −3.170∗∗ -6.847 (8.289) (8.418) (7.068) (7.870) FarmerF -8.579 -4.626 3.656 3.507 (7.033) (7.290) (4.202) (4.198) UnemM -4.500 15.11 3.704 -2.915 (8.058) (9.263) (7.042) (9.971) FarmerM 7.559 0.678 -1.172 -1.943 (7.093) (7.507) (4.388) (4.396) Constant 387.5∗∗∗ −342.1∗∗∗ -28.05 365.9∗∗∗ −228.0∗∗∗ 180.1∗ 372.6∗∗∗ −514.6∗∗∗ -284.6 354.0∗∗∗ −483.0∗∗∗ −377.7∗∗∗ (2.707) (33.62) (80.02) (3.130) (43.38) (95.45) (2.211) (42.14) (73.32) (2.499) (68.28) (116.6) R2 0.180 0.502 0.422 0.316 0.511 0.436 0.264 0.510 0.472 0.326 0.506 0.472 N 1293 1293 870 1293 1293 870 2277 2277 1785 2277 2277 1785 a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. E Specifications of Regression Functions and kernel Functions

o Constant: f(Si ) = ρ0 o P1 o j Linear Function: f(Si ) = j=0 ρj(Si ) o P2 o j Quadratic Function: f(Si ) = j=0 ρj(Si ) o P3 o j Cubic Function: f(Si ) = j=0 ρj(Si ) o P1 o j o o∗ o o∗ Piecewise Linear Function: f(Si ) = j=0 ρj(Si ) + φ1(Si − Si )1 {Si ≥ Si } o P2 o j P2 o o∗ o o∗ Piecewise Quadratic Function: f(Si ) = j=0 ρj(Si ) + k=1 φk(Si − Si )1 {Si ≥ Si } o P3 o j P3 o o∗ o o∗ Piecewise Cubic Function: f(Si ) = j=0 ρj(Si ) + k=1 φk(Si − Si )1 {Si ≥ Si } Rectangular Kernel: κ(φ) = 0.5 if φ ∈ [−1, 1] and κ(φ) = 0, otherwise. Triangular Kernel: κ(φ) = 1 − |φ| if φ ∈ [−1, 1] and κ(φ) = 0, otherwise. −φ2/2 Gaussian Kernel: κ(φ) = e √ . 2π Epanechnikov Kernel: κ(φ) = 3(1 − φ2)/4 if φ ∈ [−1, 1] and κ(φ) = 0, otherwise.

F Local Linear Regression for Rectangular Kernel

Now in our fuzzy RD design we assume a rectangular kernel and a fixed bandwidth. Then on the left hand side of the cutoff, we focus on the following regression models:

f o o∗ Sl S = αSl + βSl(S − S ) + 

o o∗ Kl K = αKl + βKl(S − S ) +  and obtain: ˆ X n f o o∗ 2o (ˆαSl, βSl) ≡ arg min [Si − αSl − βSl(Si − Si )] αSl,βSl i:So∈(So∗−h,So∗)

ˆ X  o o∗ 2 (ˆαKl, βkl) ≡ arg min [Ki − αKl − βKl(Si − Si )] αKl,βkl i:So∈(So∗−h,So∗) Similarly, for the observations on the right hand side of the cutoff, we focus on the following regression models:

f o o∗ Sr S = αSr + βSr(S − S ) + 

o o∗ Kr K = αKr + βKr(S − S ) +  and obtain the following estimated parameters:

ˆ X n f o o∗ 2o (ˆαSr, βSr) ≡ arg min [Si − αSr − βSr(Si − Si )] αSr ,βSr i:So∈(So∗,So∗+h)

ˆ X  o o∗ 2 (ˆαKr, βkr) ≡ arg min [Ki − αKr − βKr(Si − Si )] αKr ,βkr i:So∈(So∗,So∗+h) Then we can estimate the homogeneous treatment effect of model high schools in our fuzzy RD design as the following: αˆ − αˆ β = Sr Sl αˆKr − αˆKl

46 G Parametric Estimation with Linear Covariates

Table 16: Parametric Estimation, Robust Test, Linear Covariates, Sciencea

o l(Si ) C L Q Cu PL PQ PC C 106.5∗∗∗ 153.8∗∗∗ 156.8∗∗∗ 156.3∗∗∗ 155.4∗∗∗ 155.5∗∗∗ 152.8∗∗∗ (6.967) (7.142) (6.779) (6.767) (7.011) (6.809) (6.861) L 58.54∗∗∗ 81.07∗∗∗ 98.53∗∗∗ 97.49∗∗∗ 88.77∗∗∗ 95.45∗∗∗ 89.44∗∗∗ (11.46) (15.87) (15.88) (15.87) (15.98) (15.87) (15.63) Q 20.39∗ 28.25∗ 34.83∗ 33.96∗ 29.98∗ 32.03∗ 26.75 (10.87) (15.05) (18.55) (17.81) (16.32) (17.34) (16.50) o ∗ ∗ ∗ ∗ ∗ ∗ f(Si ) Cu 20.43 28.29 34.88 35.57 29.90 34.61 27.61 (11.27) (15.61) (19.25) (19.63) (16.48) (19.61) (17.99) PL 37.67∗∗∗ 52.17∗∗∗ 69.15∗∗∗ 69.04∗∗∗ 55.30∗∗∗ 68.25∗∗∗ 62.45∗∗∗ (10.56) (14.62) (16.76) (16.25) (15.50) (15.85) (15.38) PQ 26.62∗∗ 36.87∗∗ 57.53∗∗∗ 59.34∗∗∗ 39.09∗∗ 58.89∗∗∗ 52.43∗∗∗ (12.31) (17.05) (19.17) (18.99) (18.07) (17.73) (17.05) PC 12.51 17.33 42.98∗∗ 46.56∗∗ 18.37 48.24∗∗ 45.20∗∗∗ (13.66) (18.92) (21.38) (21.09) (20.06) (19.69) (17.10) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Numbers of observations are N = 870 for all specifications.

Table 17: Parametric Estimation, Robust Test, Linear Covariates, Arta

o l(Si ) C L Q Cu PL PQ PC C 107.2∗∗∗ 157.0∗∗∗ 158.0∗∗∗ 158.3∗∗∗ 157.2∗∗∗ 156.5∗∗∗ 154.0∗∗∗ (4.604) (5.105) (4.940) (4.884) (5.043) (4.891) (4.861) L -7.309 -11.96 6.188 15.59 -5.863 11.40 5.130 (10.57) (17.30) (23.55) (29.97) (19.01) (28.61) (22.76) Q −14.92∗ −24.40∗ −25.68∗ −12.36 −29.71∗∗ −15.73 −21.20∗ (7.772) (12.71) (13.38) (16.09) (12.87) (14.92) (12.01) o ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ f(Si ) Cu −25.13 −41.11 −43.26 −47.46 −43.27 −56.64 −53.36 (8.262) (13.51) (14.22) (15.60) (13.61) (15.23) (13.44) PL −8.610 −14.08 −3.636 9.326 −14.23 5.024 −1.115 (9.697) (15.86) (20.05) (26.80) (16.03) (25.38) (19.59) PQ 3.519 5.756 17.53 38.03 5.817 44.87 30.47 (10.63) (17.38) (21.93) (30.13) (17.56) (32.20) (24.91) PC 0.479 0.784 16.57 42.17 0.793 50.10 29.66 (11.75) (19.22) (25.23) (34.68) (19.42) (36.83) (25.09) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Numbers of observations are N = 1785 for all specifications.

47 H Nonparametric Estimation with Linear Covariates

Table 18: OLS Regression, Linear Covariatesa

Whole Sample Science Student Art Student Age −19.61∗∗∗ −17.16∗∗∗ −21.88∗∗∗ (2.343) (3.887) (2.897) Male 10.02∗∗∗ 21.81∗∗∗ 2.125 (3.251) (5.997) (4.022) Normal 4.702 29.60∗∗ -15.02 (9.564) (12.42) (13.63) CollegeF 14.06∗∗∗ 24.47∗∗∗ 6.566 (5.396) (8.036) (7.049) CollegeM 7.377 -2.465 9.971 (5.476) (8.993) (6.793) UnemF 5.087 17.22∗ -1.483 (5.902) (9.880) (7.409) UnemM −19.12∗∗∗ -9.079 −23.97∗∗∗ (5.905) (9.314) (7.604) FarmerF −10.20∗ −28.42∗∗∗ -3.741 (5.468) (9.908) (6.489) FarmerM −13.44∗∗ 5.697 −20.68∗∗∗ (5.582) (9.785) (6.721) Cons 797.2∗∗∗ 718.3∗∗∗ 863.8∗∗∗ (45.72) (74.79) (57.17) R2 0.067 0.107 0.068 No. Obs 2655 870 1785 a 1. ***, **, * imply significance at the 1%, 5% and 10% level respec- tively. 2. The standard errors reported are heteroskedasticity consistent.

Table 19: Residuals, Optimal Bandwidth, Various Kernelsa

Science Student Art Student Kernel Rec. Tri. Gau. Epa. Rec. Tri. Gau. Epa. Optimal Bandwidth 21.11 13.44 4.281 12.51 21.21 13.50 4.301 12.57 Optimal −40.86∗ -22.78 -5.747 -25.45 −36.69∗∗ -57.55 -36.29 -19.86 (20.92) (37.83) (32.47) (21.72) (17.64) (40.32) (54.47) (17.54) Half Optimal -37.68 22.04 26.45 -30.72 −90.26∗ 27.48 39.74 −67.17∗ (34.68) (24.69) (30.38) (44.84) (47.76) (47.62) (59.13) (35.14) Double Optimal 43.35∗∗ -29.38 -31.58 39.54∗∗ -9.370 -29.03 -42.75 -9.894 (18.72) (24.93) (26.16) (17.38) (14.95) (22.51) (26.79) (15.40) N 870 1785 a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The kernel-specific constants are approximately 5.4000, 3.4375, 1.0951 and 3.1999 respectively. 3. The standard errors reported are obtained through bootstrap method.

48 I School Specific Effects

Table 20: Descriptive Statistics by Schools, Sciencea

School OS FS Male Age Normal ColF ColM UneF UneM FarF FarM No.1 497.1 509.6 0.724 18.43 0.952 0.524 0.359 0.000 0.000 0.153 0.343 (17.62) (59.69) (0.599) [145] [145] [145] [145] [145] [145] [145] [144] [143] [144] [143] Xinhua 479.9 453.8 0.586 18.74 0.992 0.000 0.008 0.000 0.000 0.667 0.667 (11.14) (76.56) (0.706) [133] [133] [133] [133] [133] [133] [133] [132] [132] [132] [132] No.2 475.6 460.5 0.721 18.73 0.958 0.335 0.288 0.626 0.688 0.038 0.036 (14.74) (58.64) (0.650) [215] [215] [215] [215] [215] [215] [215] [211] [192] [211] [192] No.3 456.4 388.0 0.768 18.91 0.870 0.174 0.130 0.116 0.101 0.319 0.377 (16.75) (73.12) (0.742) [69] [69] [69] [69] [69] [69] [69] [69] [69] [69] [69] No.5 460.4 440.6 0.768 18.72 0.957 0.145 0.101 0.294 0.377 0.176 0.203 (14.13) (47.40) (0.662) [69] [69] [69] [69] [69] [69] [69] [68] [69] [68] [69] No.8 458.5 432.3 0.647 18.74 0.917 0.082 0.047 0.000 0.000 0.353 0.424 (28.79) (65.35) (0.639) [85] [85] [85] [85] [84] [85] [85] [85] [85] [85] [85] Xinda 451.6 376.9 0.579 18.72 0.989 0.478 0.360 0.000 0.000 0.703 0.758 (20.10) (81.80) (0.744) [178] [178] [178] [178] [178] [178] [178] [74] [66] [74] [66] Yuzhong 428.3 357.1 0.690 18.69 1.000 0.000 0.000 0.000 0.000 0.958 0.986 (13.31) (54.21) (0.785) [71] [71] [71] [71] [44] [71] [71] [71] [71] [71] [71] Jiugong 425.9 383.0 0.705 18.85 0.955 0.011 0.023 0.018 0.019 0.679 0.759 (14.20) (59.00) (0.736) [88] [88] [88] [88] [88] [88] [88] [56] [54] [56] [54] Wenzhong 411.6 316.5 0.660 18.69 1.000 0.053 0.053 (32.24) (75.45) (0.672) [94] [94] [94] [94] [94] [94] [94] Caizhong 397.9 304.4 0.669 18.82 0.972 (32.47) (65.96) (0.684) [145] [145] [145] [145] [145] a 1. The number in () is the standard deviation and it is not shown for dummies. The number of observations is shown in []. 2. Results not shown are missing.

49 Table 21: Descriptive Statistics by Schools, Arta

School OS FS Male Age Normal ColF ColM UneF UneM FarF FarM No.1 504.3 497.6 0.525 18.46 0.963 0.312 0.357 0.007 0.004 0.369 0.364 (13.58) (67.92) (0.613) [459] [459] [459] [459] [459] [459] [459] [458] [451] [458] [451] Xinhua 482.8 441.4 0.474 18.70 0.981 0.006 0.009 0.000 0.000 0.696 0.698 (7.892) (60.05) (0.699) [323] [323] [323] [323] [322] [323] [323] [319] [318] [319] [318] No.2 477.7 432.0 0.478 18.77 0.943 0.124 0.131 0.618 0.708 0.121 0.115 (15.49) (64.47) (0.753) [314] [314] [314] [314] [314] [314] [314] [314] [312] [314] [312] No.3 461.9 379.4 0.390 18.65 0.935 0.106 0.065 0.098 0.131 0.541 0.598 (12.56) (80.64) (0.713) [123] [123] [123] [123] [123] [123] [123] [122] [122] [122] [122] No.5 464.5 399.8 0.453 18.70 0.987 0.050 0.057 0.604 0.667 0.182 0.226 (8.869) (54.11) (0.718) [159] [159] [159] [159] [159] [159] [159] [159] [159] [159] [159] No.8 468.1 430.5 0.521 18.66 0.951 0.021 0.021 0.000 0.000 0.535 0.576 (20.85) (60.12) (0.681) [144] [144] [144] [144] [142] [144] [144] [144] [144] [144] [144] Xinda 454.5 370.2 0.545 18.85 0.973 0.569 0.405 0.000 0.000 0.664 0.752 (13.48) (78.83) (0.749) [343] [343] [343] [343] [338] [343] [343] [146] [125] [146] [125] Yuzhong 432.6 304.5 0.424 18.82 0.983 0.000 0.000 0.000 0.000 0.957 0.956 (13.87) (62.20) (0.710) [92] [92] [92] [92] [60] [92] [92] [92] [90] [92] [90] Jiugong 425.3 303.2 0.517 18.91 0.983 0.076 0.059 0.000 0.000 0.722 0.768 (20.40) (49.47) (0.679) [118] [118] [118] [118] [118] [118] [118] [97] [99] [97] [99] Wenzhong 420.7 292.2 0.400 18.69 1.000 0.046 0.054 (28.50) (67.20) (0.703) [130] [130] [130] [130] [130] [130] [130] Caizhong 411.9 283.8 0.347 18.63 0.986 (27.86) (73.27) (0.638) [72] [72] [72] [72] [72] a 1. The number in () is the standard deviation and it is not shown for dummies. The number of observations is shown in []. 2. Results not shown are missing.

50 Table 22: Descriptive Statistics by Schools, School

School Urban No.Stu. No.Tea. R.AdTea R.Tea35 Min ES No.1 Yes 2286 153 0.412 0.451 490 Xinhua Yes 1502 156 0.429 0.410 474 No.2 Yes 1877 171 0.287 0.626 468 No.3 Yes 817 63 0.317 0.508 454 No.5 Yes 1023 85 0.329 0.518 457 No.8 Yes 950 83 0.494 0.434 449 Xinda Yes 1704 182 0.258 0.648 436 Yuzhong No 804 71 0.155 0.789 398 Jiugong No 823 68 0.118 0.618 359 Wenzhong No 966 63 0.063 0.841 397 Caizhong No 801 67 0.075 0.806 399

51 Table 23: Parametric Estimation, Cutoff 490, Robust Test, Linear Covariates, Sciencea

o l(Si ) C L Q Cu PL PQ PC C 139.2∗∗∗ 232.6∗∗∗ 227.1∗∗∗ 213.4∗∗∗ 191.0∗∗∗ 169.7∗∗∗ 168.8∗∗∗ (9.390) (10.32) (9.691) (10.20) (10.11) (9.110) (9.045) L 79.18∗∗∗ 101.9∗∗∗ 124.9∗∗∗ 108.0∗∗∗ 68.53∗∗∗ 60.31∗∗∗ 61.47∗∗∗ (11.46) (15.87) (15.88) (15.87) (15.98) (15.87) (15.63) Q 9.598 12.35 23.72 1.092 −35.07∗ −36.83∗∗∗ −28.97∗ (15.68) (20.17) (38.75) (32.24) (19.95) (15.51) (15.53) o ∗∗∗ ∗∗∗ ∗∗∗ f(Si ) Cu 13.47 17.33 33.29 64.19 −111.0 −47.58 −38.70 (15.70) (20.19) (38.78) (74.77) (35.88) (16.58) (16.29) PL 58.53∗∗∗ 75.29∗∗∗ 222.1∗∗∗ 279.7∗∗∗ −140.3∗∗∗ 16.36 24.20 (23.56) (30.31) (39.90) (46.02) (56.47) (20.69) (20.04) PQ 161.6∗∗∗ 207.9∗∗∗ 232.1∗∗∗ 303.5∗∗∗ −387.4∗∗∗ 16.60 24.42 (61.17) (78.69) (43.61) (47.90) (146.6) (21.91) (20.10) PC 80.15 103.1 228.1∗∗∗ 298.8∗∗∗ -192.1 223.2∗∗∗ 118.0∗∗∗ (178.0) (229.0) (44.28) (48.39) (426.6) (55.38) (35.10) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Numbers of observations are N = 870 for all specifications.

Table 24: Parametric Estimation, Cutoff 490, Robust Test, Linear Covariates, Arta

o l(Si ) C L Q Cu PL PQ PC C 126.0∗∗∗ 176.3∗∗∗ 172.8∗∗∗ 169.1∗∗∗ 162.8∗∗∗ 140.8∗∗∗ 135.2∗∗∗ (5.190) (5.913) (5.339) (5.118) (4.993) (4.977) (5.066) L 4.617 6.498 32.11 25.67∗ 22.92∗ 11.64 7.022 (10.71) (14.90) (20.43) (15.55) (12.24) (11.29) (10.03) Q −26.53∗∗∗ −36.90∗∗∗ −57.46∗∗∗ −67.42∗∗∗ −37.72∗∗ −26.99∗∗∗ −30.60∗∗∗ (9.821) (13.66) (21.27) (22.39) (19.05) (10.26) (10.20) o ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗ ∗∗∗ f(Si ) Cu −23.23 −32.32 −50.32 −53.63 −17.34 −21.07 −25.02 (8.813) (12.26) (19.09) (20.34) (18.09) (8.349) (8.648) PL −36.44∗∗∗ −50.69∗∗∗ 59.99 38.35 975.0∗∗∗ −12.65 −21.22∗ (12.18) (16.95) (60.85) (61.12) (326.0) (13.35) (11.10) PQ 24.80 34.49 125.0∗ 80.33 −663.5 1.433 −10.90 (30.79) (42.83) (55.37) (63.41) (823.9) (14.61) (11.76) PC 138.9∗ 193.2∗ 136.9∗∗ 94.76 −3716∗ 19.45 −16.67 (79.61) (110.7) (56.31) (74.10) (2130) (59.98) (14.66) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Numbers of observations are N = 1785 for all specifications.

52 Table 25: Parametric Estimation, Cutoff 449, Robust Test, Linear Covariates, Sciencea

o l(Si ) C L Q Cu PL PQ PC C 104.7∗∗∗ 174.1∗∗∗ 177.3∗∗∗ 174.0∗∗∗ 178.7∗∗∗ 174.7∗∗∗ 174.0∗∗∗ (7.881) (10.24) (10.03) (9.465) (9.988) (9.278) (9.322) L −48.38∗∗∗ −86.65∗∗∗ −57.19∗ −7.788 −42.60 10.29 9.470 (16.91) (30.28) (32.45) (29.22) (32.42) (25.54) (25.30) Q −45.18∗∗∗ −80.91∗∗∗ −80.60∗∗∗ −23.87 −76.98∗∗∗ −4.376 −4.953 (15.49) (27.75) (27.65) (22.34) (27.35) (20.81) (20.81) o ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ f(Si ) Cu −78.17 −140.0 −139.5 −296.0 −131.4 −50.16 −47.26 (20.02) (35.86) (35.72) (75.80) (35.35) (53.97) (48.69) PL 51.16∗∗ 91.62∗∗ 82.99∗∗ 98.80∗∗∗ 78.59∗∗ 93.00∗∗∗ 90.41∗∗∗ (22.21) (39.78) (36.04) (29.68) (34.12) (25.38) (25.38) PQ 59.33∗∗∗ 106.2∗∗∗ 99.10∗∗∗ 92.98∗∗∗ 91.14∗∗∗ 91.30∗∗∗ 88.19∗∗∗ (22.33) (40.00) (36.40) (31.53) (34.31) (30.16) (29.76) PC 59.94∗∗∗ 107.3∗∗∗ 100.2∗∗∗ 95.52∗∗∗ 92.08∗∗∗ 92.42∗∗∗ 88.06∗∗∗ (22.70) (40.65) (36.99) (32.48) (34.87) (30.63) (29.77) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Numbers of observations are N = 870 for all specifications.

Table 26: Parametric Estimation, Cutoff 449, Robust Test, Linear Covariates, Arta

o l(Si ) C L Q Cu PL PQ PC C 128.3∗∗∗ 195.3∗∗∗ 195.1∗∗∗ 190.3∗∗∗ 195.9∗∗∗ 188.9∗∗∗ 188.9∗∗∗ (6.302) (6.820) (6.878) (6.177) (6.685) (6.162) (6.192) L 5.766 9.003 7.775 12.60 13.07 9.734 10.91 (9.422) (14.71) (14.49) (22.33) (15.85) (20.10) (20.42) Q 33.38∗∗∗ 52.12∗∗∗ 52.53∗∗∗ 50.69∗∗∗ 48.21∗∗ 46.23∗∗∗ 47.37∗∗∗ (12.44) (19.43) (19.58) (16.22) (18.88) (17.39) (17.09) o ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗ ∗ f(Si ) Cu 26.22 40.94 41.26 56.38 36.72 43.83 46.39 (12.03) (18.78) (18.93) (25.87) (18.37) (26.39) (25.20) PL 59.49∗ 92.89∗ 92.78∗ 65.62 82.10∗ 51.15 53.11 (34.10) (53.24) (54.36) (48.12) (47.05) (45.73) (45.02) PQ 55.78∗ 87.09∗ 87.81 107.8∗∗∗ 76.97∗ 101.0∗∗∗ 102.5∗∗∗ (33.63) (52.51) (53.44) (33.75) (46.41) (37.40) (35.94) PC 57.81∗ 90.25∗ 91.01∗∗ 115.4∗∗∗ 79.77∗ 104.1∗∗∗ 102.3∗∗∗ (33.74) (52.68) (53.60) (32.84) (46.56) (37.47) (36.52) a 1. ***, **, * imply significance at the 1%, 5% and 10% level respectively. 2. The standard errors reported are heteroskedasticity consistent. 3. Numbers of observations are N = 1785 for all specifications.

53 J Descriptive Statistics about Policy Extensions

Table 27: Descriptive Statistics, Policy Extensionsa

Science Student Art Student Model Normal Model Normal ROSB 28.08 -17.79 34.64 -7.089 (17.16) (34.76) (15.68) (27.10) ROS 0.010 0.011 5.440 4.602 (14.85) (22.19) (11.76) (17.85) DROS -28.07 17.86 -29.20 11.69 (8.607) (26.73) (8.474) (22.21) No.Obs [278] [1014] [782] [1495] Eligible Noneligible Eligible Noneligible Eligible Noneligible Eligible Noneligible OS 492.9 462.5 486.1 435.6 496.9 466.7 485.5 446.1 (14.37) (7.862) (9.810) (32.01) (14.55) (6.317) (9.900) (24.21) [241] [37] [149] [866] [744] [38] [290] [1205] FS 492.0 423.2 478.6 371.8 474.5 472.2 445.2 355.2 (59.29) (118.7) (56.49) (80.54) (68.45) (103.0) (61.65) (81.15) [241] [37] [149] [866] [744] [38] [290] [1205] Male 0.676 0.541 0.743 0.669 0.515 0.289 0.541 0.458 [241] [37] [148] [866] [744] [38] [290] [1205] Normal 0.988 0.865 0.973 0.959 0.989 0.605 0.965 0.967 [241] [37] [148] [838] [743] [38] [288] [1168] Age 18.56 18.73 18.72 18.76 18.56 18.58 18.69 18.78 (0.688) (0.508) (0.637) (0.708) (0.657) (0.722) (0.655) (0.739) [241] [37] [148] [866] [744] [38] [290] [1205] CollegeF 0.249 0.432 0.282 0.173 0.180 0.289 0.090 0.205 [241] [37] [149] [866] [744] [38] [290] [1205] CollegeM 0.187 0.216 0.235 0.136 0.211 0.263 0.090 0.156 [241] [37] [149] [866] [744] [38] [290] [1205] UnemF 0.000 0.000 0.534 0.131 0.004 0.000 0.491 0.188 [239] [37] [146] [633] [739] [38] [285] [861] UnemM 0.000 0.000 0.594 0.173 0.003 0.000 0.543 0.244 [238] [37] [133] [502] [732] [37] [280] [783] FarmerF 0.427 0.216 0.130 0.562 0.521 0.158 0.263 0.537 [239] [37] [146] [633] [739] [38] [285] [861] FarmerM 0.508 0.432 0.165 0.500 0.512 0.297 0.279 0.534 [238] [37] [133] [502] [732] [37] [280] [783] a 1. The number in () is the standard deviation and is not shown for dummies. The number in [] is the number of observations. 2. ROSB is the relative original score compared with peers in middle schools while ROS is the relative original score compared with peers in high schools. DROS is the difference between them. Eligible implies whether the original score is higher than the cutoff.

54