<<

Empirical Methods in Applied Economics

Jörn-Ste¤en Pischke LSE

October 2005

1 The Evaluation Problem: Introduction and Ran- domized

1.1 The Nature of Many questions in economics and the social sciences are evaluation ques- tions of the form: what is the e¤ect of a treatment (D) on an outcome (y). Consider a simple question of this type: Do hospitals make people healthier? The answer seems to be a pretty obvious yes, but how do we really know this? One approach to answering the question would be to get some some data and run a regression. For example, y could be self-reported health status, and D could be an indicator (dummy variable) whether the indi- vidual went to a hospital in the previous year. Obviously, there are many issues with the measures, for example, whether self-reported health status is reliable and comparable across individuals. These questions are obviously important, but I will not discuss them here. We can then run a regression:

y = + D + " (1)

I have run this regression with data for 12,640 individuals from Germany for 1994. The health measure ranges from 1 to 5 with 5 being the best health. The regression yields the following result

health = 2:52 0:58 hospital + " (0:09) (0:03)

Taken at face value, the regression result suggests that going to the hospital makes people sicker, and the e¤ect is strongly signi…cant. Of course, it is easy to see why this interpretation of the regression result is silly: people who

1 go to the hospital are less healthy to begin with, and being treated may well improve their health. However, even after the hosptialization they are not as healthy as those individuals who never get hospitalized. Regression only uncovers correlations but does not prove that the causality runs from the right hand side variable (hospital) to the left hand side variable (health). The inherent problem here is referred to as selection bias, and it is useful to look at it a bit more formally. Think about our treatment variable D as binary, D = 0; 1 . This is the variable of interest in the regression. Extending the frameworkf g to multivalued or continuous treatments is possible but the basic insights are the same. We will concentrate on the binary case in most of these notes. Think of two potential outcomes y0 and y1, which are de…ned as y if D = 1 outcome = 1 y if D = 0  0 These are potential or counterfactual outcomes in the sense that I can think of both outcomes for each individual. For example, y0 is the health status of an individual if the individual had not gone to the hospital, irrespective of whether the individual actually went. However, each individual will only realize one of these outcomes, depending on whether the treatment is chosen (D = 1) or not (D = 0). Consider an individual who gets hospitalized. I will only ever observe y1 for this individual. However, I would like to know y0 for this individual (what would the health of this individual have been like had he or she not gone to the hospital) since I am interested in the treatment e¤ect y1 y0. Since I do not observe y0 in this case, the only available strategy is to substitute the health of someone who was not hospitalized instead. Let us return to the realized outcome y, which we observe

y if D = 1 y = 1 y if D = 0  0 = y0 + (y1 y0)D (2) this notation is useful, because y1 y0 is the treatment e¤ect for an in- dividual. For most of these lectures we will concentrate on models with homogeneous (and linear) treatment e¤ects, so y1 y0 is the same for every- body: a constant parameter to be estimated. In principle, there could be a distribution of both y1 and y0 as well as y1 y0 in the population, and we will return to this in the last lecture. In the constant treatment e¤ects case we can rewrite (2) in the more

2 familiar form y = + D+ " q q q E(y0)(y1 y0) y0 E(y0) It is now straightforward to characterize what is happening in the simple regression (1). Take expectations conditional on D:

E(y D = 1) = + + E(" D = 1) j j E(y D = 0) = + E(" D = 0) j j The OLS estimate of is E(y D = 1) E(y D = 0) = + E(" D = 1) E(" D = 0) j j j j treatment e¤ect selection bias |{z} | {z (3) } Notice that the selection bias can be rewritten in terms of y0:

E(" D = 1) E(" D = 0) = E(y0 D = 1) E(y0 D = 0) (4) j j j j In terms of the hospital example, the selection bias is the average di¤erence in the counterfactual health status if individuals did not go into the hospital between the group who does and does not have hospitalizations. If the sick go to the hospital, the selection bias will be negative. The goal of evaluation research is to overcome the selection bias. These lectures will deal with various methods which are popular in the literature.

1.2 How Randomized Experiments Overcome the Selection Bias: The Tennessee STAR

It is easy to see from (4) that statistical independence of D and y0; y1 (or "), i.e. D y0; y1, solves the selection problem. In fact, independence is stronger? than what is required, mean independence is enough as long as we are only interested in means. Clearly, random assignment of D ensures independence. To get at the result another way, consider the simple comparison of means estimator.

E(y D = 1) E(y D = 0) E(y1 D = 1) E(y0 D = 0) j j ) j j E(y1) E(y0) = ) where the second line is a consequence of statistical independence of treat- ment and the counterfactual outcomes.

3 Randomized experiments are considered the gold standard of evalua- tion research. This does not mean that random assignment experiments are without problems, but in principle they perfectly solve the evaluation problem. Experiments in the social sciences are not as common as in the natural sciences but they have been around for many decades, and they are becoming more and more prevalent. Even in situations where we do not have experimental data, the idea of a randomization experiment is a useful conceptual benchmark, and it often helps us structure a successful empirical approach. As an example of an experiment, consider a large scale on class size called the Tennessee STAR experiment. We would like to know what the e¤ect of smaller classes is on student achievement. Class size is typically not randomly assigned but chosen by school administrators. For example, in many countries classes for remedial students tend to be smaller and classes for advanced students tend to be bigger. This assignment would clearly result in selection bias in observational data. In addition, school funding is mostly local in the US so better funded schools with smaller classes are often in richer neighborhoods. The children in these schools may do better for reasons that have nothing to do with class size. So in order to answer this question, the legislature of the US state of Tennessee funded a randomized experiment on small classes and teacher aides by appropriating 12 million US$. It was implemented for a cohort of kindergartners in 1985/86 and lasted for four years, i.e. until the kindergartners were in grade 3. Starting from grade 4, everybody was in regular sized classes again. The experiment involved about 11,600 children. The average class size in regular Tennessee classes in 1985/86 was about 22.3. The idea of the experiment was to assign students to one of three treatments: small classes with 13-17 children, regular classes with 22-25 children, or regular classes with a full time teacher aide. Other regular size classes also had a part-time aide. Schools with at least three classes could choose to participate in the experiment. The 1985/86 entering kindergarten cohort was randomly assigned to one of the class types within these schools. In addition, teachers were randomly assigned to these classes. Pupils who repeated or skipped a grade left the experiment. Students who entered an experimental school and grade later were added to the experiment and ran- domly assigned to one of the classes. One, somewhat unfortunate, aspect of the experiment is that students in the regular and regular/aide classes were reassigned after the kindergarten year, possibly due to protests of the par- ents with children in the regular classrooms. There was also some switching of children after the kindergarten year. Despite these problems, the STAR

4 experiment seems to have been an extremely well implemented randomized trial. The outcome measure collected on the students in the STAR experiment are test scores on the Stanford Achievement test and the Tennessee Basic Skills First Test. These tests were given in March of each school year. The results I will present are from Krueger (1999) and come from kindergarten through 3rd grade. Krueger uses the percentile rank on each test, using the regular and regular/aide classes as the basic for scaling. The measure he looks at is the average over three subject tests. The …rst question to ask about a randomized experiment is whether the randomization was carried out properly. In order to this, it is typical to look at the pre-treatment outcome. Since the experiment should have no ef- fect on test scores prior to the experimental intervention, the pre-treatment outcomes should be balanced, i.e. the averages should be the same in the treatment and control group. In this case, a pre-treatment outcome would be a student test score before the child entered kindergarten. Unfortu- nately, such tests were not conducted. Hence, it is only possible to look at immutable characteristics of the children, which should also be balanced across treatment and control groups. Table 1 in Krueger (1999) presents these means. The three student char- acteristics are free lunch, race, and age of the child. Free lunch is a measure of family income, since only poor children qualify for a free school lunch. The means of these variables are similar across all three class types, and any di¤erences are never statistically signi…cant. Based on these characteristics there is no evidence that the random assignment was compromised. The table also presents some information on the treatment and out- comes, the average class size, the attrition rate, and the test scores. The attrition rate was lower in small kindergarten classrooms, and the di¤erence is signi…cant. Attrition rates are similar in the higher grades. However, the di¤erence in attrition rates in kindergarten means that the sample of stu- dents observed in the higher grades may not randomly distributed to class types anymore. Of course, this could even be true if the attrition rates were the same across class types, as long as the attrition process di¤ers. The kindergarten results are therefore the most reliable. We will return to the attrition issue again below. Class sizes are signi…cantly lower in the small class rooms, which means that the experiment was actually successful in creating the desired variation. While small classes have on average 7 fewer students, a look at table 3 shows that there is a distribution of class sizes within class type. The biggest small classes are actual as big or bigger than the smallest regular classes. Finally,

5 table 1 shows that test scores are higher in small class rooms, which gives a preview of the …ndings. Since a randomized experiment overcomes the selection bias, the di¤er- ence of the mean outcomes in the treatment and control group is actually an estimate of the impact of the treatment. The di¤erence in means is conveniently calculated by a regression of the outcome on a constant (which will re‡ect the control group mean) and a dummy for the treatment group (which will re‡ect the di¤erence in outcomes), as we saw above in eq. (3). In the Tennessee STAR case, there are actually two treatments, small classes and regular/aide classes. Regression is again an easy tool to combine the estimate of both treatment e¤ects. These results are shown in column (1) of table 5. The small class e¤ect is about 5 to 6 percentile points and signi…cant, while the regular/aide e¤ect is small and insigni…cant.

1.3 Conditional Random Assignment versus Uncorrelated Controls The remainder of table 5 shows di¤erent regression speci…cations, which include additional covariates. Why would it be necessary or advisable to use additional regressors? There are two di¤erent and distinct reasons for including these covariates. The …rst reason is conditional randomization. Recall that in the STAR experiment, students were not assigned to class types centrally across all schools participating in the study. Rather, random assignment was done within each school. In order to see why this might matter, think of there being two school types, rural schools with enough chil- dren in a grade to form three classes (one of each type) and urban schools, which are big enough to form four classes. The urban schools would form two small class, one regular class, and one regular/aide class. Hence, the probability of being assigned to a small class is larger in urban schools. If students in urban schools perform better than those in rural schools irrespec- tive of class size, then student performance will be systematically related to the probability of being assigned to a small class. Comparing all students across Tennessee, as in column (1) of table 5, might therefore simply re‡ect these di¤erences in urban and rural schools. However, the probability of being assigned to a small class is the same for all students within a school. Hence, we want to compare students in small classes to those in regular classes in the same school. This could either be done school by school, and the e¤ects then averaged. An alternative is to run a regression with a full set of school …xed e¤ects, which allow each school to have a di¤erent mean performance. The school …xed e¤ects

6 are conditioning variables, and the experiment is valid conditional on these controls. The regression including school …xed e¤ects is shown in column (2) of table 5. It turns out that the results are in fact hardly a¤ected by this inclusion. The second type of covariate we could use is an uncorrelated control. Column (3) includes the student characteristics race, age, and free lunch. We saw before that these characteristics are not systematically related to the class assignment of the student. If these controls, call them x, are uncorrelated with the treatment D, then they will not a¤ect the estimate of . Running the long regression

y = + D + x + " (5) will give the same results as running the short regression in (1). Nevertheless, the variables x are useful. Notice that the standard er- ror of the estimated treatment e¤ects in column (3) is smaller than the corresponding standard error in column (2). Although the controls x are uncorrelated with D, they have important explanatory power for x. In- cluding these controls lowers the remaining residual variance, which in turn lowers the estimated standard error. It is easy to see this formally. Consider the long regression (5) with one variable x. Collect the regressors in the matrix

Z = D x

Then we have   D D 0 N 0 Z Z = 0 = T 0 0 x x 0 x x   0    0  where NT is the number of treated observations. The o¤-diagonal terms are approximately zero in the sample because E(D0x) = 0 (because of sampling variation, these terms will not be exactly zero). So

1 1 0 var =  Z Z =  NT  " 0 " 0 1   x     Because the o¤-diagonalb terms are (approximately) zero, the sampling vari- ance is given by "=NT for both the short and the long regression. However, because x is correlated with y, " will be smaller in the long regression, and, hence, the standard errors will be smaller. We saw before in table 5 that the inclusion of school …xed e¤ects did not change the treatment e¤ects much. This suggests that school …xed e¤ects

7 and treatment are roughly uncorrelated. Hence, although the school …xed e¤ects in principle act as conditioning variables, in this case, in practice they actually also play the role of uncorrelated controls. In fact, the standard errors are much more a¤ected by including the school …xed e¤ects than the student characteristics because the school …xed e¤ects explain an important part of the variance in student performance. The last column adds teacher characteristics. Teacher characteristics are also uncorrelated with the treat- ment because teachers are randomly assigned. The results in column (4) are therefore again very similar.

1.4 Multiple Outcomes and Data Mining in Experiments The nature of experiments typically suggests straightforward research de- signs and ways to analyze the resulting data. Hence, using experimental data provides some insurance against the worst o¤enses that can be com- mitted with non-experimental data. That does not mean that experimental data are always straightforward to analyze and do not present any poten- tial pitfalls. One such pitfall arises in the analysis of experiments with many outcomes. A good example of this situation (and of avoiding the pitfalls) is the paper by Kling et al. (2004) on the Moving to Opportunity (MTO) experiment. In the MTO experiment, inhabitants of public housing were given vouchers to move to private rental apartment. One treatment group received vouchers with the condition attached that they had to use the voucher in a low poverty neighborhood, a second treatment group was given vouchers with no strings attached (this group is called the Section 8 group in the paper), and a third group received no vouchers. Kling et al. (2004) analyze a wide variety of outcomes of the adult in the family …ve years after random assignment, like self-su¢ ciency, health, etc. Table 3 shows the results for self-su¢ ciency outcomes, and table 5 for health outcomes. The column labeled CM displays the control group mean, and the column labeled ITT displays the di¤erence in the outcome for the treatment group and the control group. ITT stands for intention to treat estimate. This refers to the issue that not all voucher recipients actually move, which is the treatment of interest. Instead, the experiment only manipulates voucher receipt, which is not actually of much economic interest in itself. We will return to this issue in section 4 on instrumental variables. For the time being, we will concentrate on the ITT estimates. The results for the self-su¢ ciency outcomes in table 3 are all insigni…cant, while some of the health outcomes are signi…cant. For example, one out of …ve physical health outcomes is signi…cant: voucher recipients seem to be

8 less likely to be obese. The issue in looking at these results is whether we should conclude that the vouchers had no e¤ect on health, or whether we should conclude that the vouchers had an e¤ect on obesity but nothing else. If you approach the results with the question in mind: did MTO vouchers a¤ect obesity, then looking at the standard signi…cance levels is indeed the right thing to do. The obesity e¤ect is signi…cant at the 5 percent level. But if you are initially agnostic, and the question is whether MTO had an e¤ect on some physical health outcome, then the answer is not the same. Obviously, if you look at 20 independent outcomes, and none was a¤ected by the treatment you would still expect to reject once at the 5 percent level in 20 independent tests. Here we are looking at 5 tests, and the probability of rejecting at least once at the 5 percent level is about 23 percent under the null hypothesis. [Bonferroni adjustment] The danger of data mining, an easy temptation with observational data, is also present with experimental data. Suppose your experiment does not show any e¤ects or no large e¤ects, like the MTO example. It is tempting to slice the data various ways in the search for some e¤ects, for example by looking at subgroups. Invariably, in many di¤erent speci…cations you might …nd some signi…cant e¤ects. It will be impossible to tell whether these e¤ects arose by chance, or re‡ect true e¤ects. Hence, it is best to ap- proach experimental data (just like any other data) with a set of prespeci…ed questions and a research design to answer these questions. Of course, once you look at the data, there may always be results emerging, which you did not suspect originally, and which ask for further exploration. It is …ne to do this exploration, but you should keep in mind that any new hypothesis, which you develop in the course of looking at the results, should really be tested using a fresh data set. Kling et al. (2004) are very careful in their discussion of these issues.

1.5 Recall that all evaluation research is based on comparisons of a treatment and control group. Hence, it is important that untreated individuals are not a¤ected by the treatment. Many experiments involve some type of treatment even for the control group. For example, the treatment under study might be a job training program. However, in order to induce control group members to participate and stay in the experiment, they may be o¤ered some counseling services on job search etc., although they receive no training. This would obviously a¤ect the interpretation of the results.

9 However, more important are spillovers from the fact that some individ- uals are treated on others. Such externalities would be obvious in public health treatments. For example, Kremer and Miguel (2004) study an ex- periment which involves a deworming of school children in Kenya. They are interested in whether the health treatment leads to better health, more school attendance, and better schooling outcomes for the treated children. However, since the worms might infect others, treating one child reduces the health risks for children around her. These other children may be part of the treatment or the control group. Kremer and Miguel are primarily interested in these externalities. In order to study the e¤ects of treatments with externalities, you would have to randomize the entire unit with which the treated individual interacts. In the deworming example, this might be an entire village. In fact, in order to establish the e¤ect both on the treated individual, and on other non-treated individuals when some neighbors are treated, you need a two stage randomization design, where some villages are treated, and others are not. Furthermore, within the treated villages, some individuals would be randomly assigned to receive the treatment, and others not. The experiment analyzed by Kremer and Miguel does not have exactly this design. Instead, the randomization occurred at the level of schools. Hence, it is well suited to study the e¤ect of the treatment, including externalities within schools. However, the authors go on and also make use of non-experimental aspects of the data collection. For example, they have information on siblings of treated children, who may attend a di¤erent (untreated) school. This allows them to study various externalities. Externalities from treatments might arise in many guises. One might be that experiments do not generate certain externalities but they are an important part of the process to be studied. An important example in economic settings is general equilibrium e¤ects. Return to the Tennessee class size experiment. Reducing class size necessitates hiring more teachers. Even though the STAR experiment was a fairly large scale experiment, it was still small compared to the size of the teacher labor market. Hence, it might be relatively easy to obtain additional teachers to teach in the experiment. Implementing a small class policy more broadly, say in all of Tennessee or nationwide, might result in much more strain in the teacher labor market. The new teachers to be hired might be less experienced and therefore poorer teachers, they might be less interested in teaching etc. Small scale experiments are often not well suited to assess these general equilibrium impacts of a policy. Non-exerimental variation is often much better suited to study these issues.

10 The no condition, which is necessary for experimental results to recover the correct treatment e¤ects, is often called the stable unit treat- ment value assumption (SUTVA) in the literature. This means that the treatment e¤ect for individual i: y1i y0i is not a¤ected by whether some other individual j is treated. That means exactly that externalities and general equilibrium e¤ects are absent.

1.6 Attrition Many social experiments involve treatments which last a long period of time. In the 1970s, various “Negative Income Tax (NIT) Experiments”were performed in the US. Households were often subject to the experiment, or followed up as control group members, for a period of three to …ve years. As in any panel survey, survey members may move or stop responding for a variety of reasons. Such attrition reduces not only the sample size, but if attrition is non-random, it might bias the results. Ashenfelter and Plant (1990) study attrition in one of the NIT exper- iments, the Seattle and Denver Income Maintenance Experiments (SIME- DIME). In these experiments, treated households were o¤ered means tested cash transfers, consisting of a speci…c grant, and a withdrawal or tax rate at which the grant was reduced for each dollar of earnings. The experiments consisted of 11 di¤erent treatments, i.e. combinations of grant amounts and tax rates. The authors parameterize the generosity of a grant/tax rate com- bination by the break-even level, which is the level of earnings above which an individual receives no more transfer. Table 1 summarizes the parameters of the various treatments. Table 5 in the Ashenfelter and Plant paper demonstrates that attrition in the experiments is fairly high: above 20 percent in the least generous treatment. More importantly, attrition rates are systematically related to treatment generosity. The higher the break-even level, the lower the attri- tion rate. This makes perfect economic sense. If an experiment transfers more resources to a participant, the participant has more of an incentive to stick around and collect the money. Of course, not everybody in a treat- ment group is a¤ected by this in the same way. Some individuals will have high wages or want to work a lot of hours, and hence will be ineligible even for the more generous transfers. These individuals are not a¤ected by the generosity of the program, and their attrition is probably fairly insensitive to the program parameters. The individuals most likely not to attrit in the generous treatments are those who actually receive bene…ts because they have low wages or would like to work only few hours. But this means that

11 program generosity a¤ects attrition di¤erentially for di¤erent individuals, and this is likely to lead to biases in the estimated program e¤ects. If at- trition is non-random by treatment status, treatment is no longer randomly assigned in the remaining samples, and simple comparisons do not recover the true treatment e¤ect. Let A denote whether an individual attrits from the experiment. If A = 1, the individual leaves the experiment and we have no outcome data, and if A = 0, the individual remains in the sample. The treatment e¤ect would be recovered by

E(y D = 1) E(y D = 0) j j Instead we estimate the e¤ect on the sample subject to attrition, which is

E(y D = 1;A = 0) E(y D = 0;A = 0) j j E(y1 D = 1;A = 0) E(y0 D = 0;A = 0) ) j j

However, if A is correlated with both D and y0; y1 then

E(y1 D = 1;A = 0) E(y0 D = 0;A = 0) ; E(y1 A = 0) E(y0 A = 0) j j j j because D y0; y1 ; (D A = 0) y0; y1, so the estimate on the basis of the sample? with attrition noj longer? recovers the treatment e¤ect. In order to see in which direction we would expect the bias from attrition to go in the Ashenfelter and Plant case, consider a simple example. For simplicity, consider only one treatment and a single control group, so we are in the D = 0; 1 case. Also, let there be two types of individuals, those with high wages and those with low wages. Low wage individuals work hl hours, and high wage individuals work hh > hl hours, so the labor supply elasticity is positive. Any individual receiving the cash transfer will work hi hours, so is the treatment e¤ect we are interested in. We do not observe the type of the individual (i.e. the wage) directly, we only observe hours. This may sound like a strong assumption, but the individual type could be equally well tied to other characteristics, like preferences, which are very di¢ cult to observe. Assume there are four individuals of each type, and two of each type receive the treatment. If we had all the eight data points, we could compare mean hours of the treated and untreated, and uncover . Now assume there is attrition among all the high wage individuals, irrespective of treatment, and among the low wage individuals when they do not receive the treatment. This may be reasonable because the treated low wage individuals receive

12 the largest transfers, and hence have the most to gain from staying in the program. One of the individuals in each of the type/treatment group cells attrits, except for the low wage/D = 1 combination, where both stay. Using the …ve remaining data points

E(y D = 1;A = 0) E(y D = 0;A = 0) j j h + 2h h + h = h l h l 3 2   hh hl = 6 Hence, the estimate using the sample which is subject to attrition overesti- mates the labor supply response to the program. This is because more low wage and hence low hours individuals remain in the treated sample than in the control sample. Typical analyses of the NIT experiments have looked at a variable like earnings (e.g. Ashenfelter, 1978). It is impossible to assess the impact of attrition on earnings since we do not have data on the earnings of those who leave the sample. Hence, Ashenfelter and Plant (1990) present results for transfer payments. This is not in itself a terribly interesting outcome variable, but it allows them to demonstrate and measure the e¤ect of attri- tion. They assume that those who leave the sample receive zero transfer payments. This makes sense since there is actually administrative infor- mation on transfers. In table 3, Ashenfelter and Plant present estimates on the sample excluding those who attrit, similar to the previous literature. Table 6 computes the same estimates but includes missing individuals with a transfer payment of zero. It is fairly obvious that this will have to pro- duce lower estimates of the program e¤ects. A comparison of the table 3 and table 6 results reveals that the year 3 payments including the missing individuals are up to 50 percent lower than those estimates which exclude these individuals. However, for some treatment groups the di¤erences are small. Nevertheless, this suggests that using the samples subject to attrition might yield estimates of the labor supply elasticity, which are substantially overstated. Attrition seems to have non-negligible impact on the results. It is therefore useful in designing experiments to think about the attri- tion problem from the beginning. The NIT studies would probably have bene…ted by linking participants in the experiment directly to administrative records on earnings, for example those kept by the unemployment insurance system. In this case, individuals can be tracked even if they otherwise chose

13 not to participate in the experiment any longer. Such a link has in fact been performed for some experiments ex post. On the other hand, administrative data may create other problems. For example the unemployment insurance system may not cover the self-employed, and hence introduce other biases.

1.7 Hawthorne E¤ects Most of the issues discussed above are not speci…c to experimental eval- uation studies but they may arise similarly with non-experimental data. However, there is one issue that is very speci…c to experiments. That is the possibility that the individuals in the experiment may react to the fact that they are participating in an experiment, and not just to the treatment. This problem was …rst discovered by Elton Mayo in a study on the produc- tivity of workers in the Hawthorne plant of the Western Electric Company in Cicero, in the late 1920s. Mayo and his colleagues manipulated the environmental in‡uences on workers (light, humidity, etc.) and found that productivity improved among all workers, not just those subject to the treatments. They concluded that this may be simply due to the fact that the workers were being studied. Participation in the study, maybe the idea of being “watched”by someone, seemed to be a much more important motivator than the treatments under study. This type of behavior of ex- perimental subjects has since been dubbed the Hawthorne e¤ect after the name of the plant. Hawthorne e¤ects would be most important when they di¤erentially af- fected treated and control individuals. Consider the Tennessee STAR exper- iment again. The teachers who were part of the study knew, of course, that the experiment was going on. Teachers have a strong interest in class size reduction. Smaller classes mean a higher demand for teachers, and hence higher wages in equilibrium. Smaller classes may also be more pleasant to teach and involve a smaller workload outside the classroom, for example less grading. Hence teachers had a stake in actually trying to in‡uence the results of the experiment. The teachers assigned to small classes might have done a particularly good job for the duration of the experiment, or at least to made sure that their pupils performed well on the tests. Teachers in large classes, on the other hand, might have withheld e¤ort so that the experiment would produce a large class size e¤ect. It is very di¢ cult to rule out these behavioral responses and they limit the usefulness of the results. Studies relying on non-experimental data are very unlikely to be subject to the same e¤ect. One way to overcome the possibility of Hawthorne e¤ects in experiments

14 is to conduct them in a way that the participants do not recognize that they are participating in a research study. This is sometimes possible in …eld experiments. Karlan and Zinman (2005) conducted an experiment in conjunction with a bank o¤ering loans to individuals at di¤erent interest rates. The assignment of customers to interest rates was randomized but customers would not know that they were participating in an experiment. Obviously this is much easier in small scale experiments than in large scale social experiments, like Tennessee STAR.

15