Harrison and List 2004 JEL.Pdf

dec04_Article 1 12/14/04 2:25 PM Page 1009

Journal of Economic Literature Vol. XLII (December 2004) pp. 1009–1055

Field Experiments

1 GLENN W. H ARRISON and JOHN A. LIST

1. Introduction experimental environment. We do not see the notion of a “sterile environment” as a n some sense every empirical researcher is negative, provided one recognizes its role in reporting the results of an experiment. I the research discovery process. In one Every researcher who behaves as if an exoge- sense, that sterility allows us to see in crisp nous variable varies independently of an error relief the effects of exogenous treatments on term effectively views their data as coming behavior. However, lab experiments in isola- from an experiment. In some cases this belief tion are necessarily limited in relevance for is a matter of a priori judgement; in some predicting field behavior, unless one wants cases it is based on auxiliary evidence and to insist a priori that those aspects of eco- inference; and in some cases it is built into the nomic behavior under study are perfectly design of the data collection process. But the general in a sense that we will explain. distinction is not always as bright and clear. Rather, we see the beauty of lab experi- Testing that assumption is a recurring difficul- ments within a broader context—when they ty for applied econometricians, and the search are combined with field data, they permit always continues for variables that might bet- sharper and more convincing inference.2 ter qualify as truly exogenous to the process In search of greater relevance, experi- under study. Similarly, the growing popularity mental economists are recruiting subjects in of explicit experimental methods arises in the field rather than in the classroom, using large part from the potential for constructing field goods rather than induced valuations, the proper counterfactual. and using field context rather than abstract Field experiments provide a meeting ground between these two broad approaches to empirical economic science. By examining the 2 When we talk about combining lab and field data, we nature of field experiments, we seek to make it do not just mean a summation of conclusions. Instead, we have in mind the two complementing each other in some a common ground between researchers. functional way, much as one might conduct several lab We approach field experiments from the experiments in order to tease apart potential confounds. perspective of the sterility of the laboratory For example, James Cox (2004) demonstrates nicely how “trust” and “reciprocity” are often confounded with “other regarding preferences,” and can be better identified sep- 1 Harrison: Department of Economics, College of arately if one undertakes several types of experiments Business Administration, University of Central Florida; with the same population. Similarly, Alvin Roth and List: Department of Agricultural and Resource Economics Michael Malouf (1979) demonstrate how the use of dollar and Department of Economics, University of Maryland, payoffs can confound tests of cooperative game theory and NBER. We are grateful to Stephen Burks, Colin with less information of one kind (knowledge of the utili- Camerer, Jeffrey Carpenter, Shelby Gerking, R. Mark ty function of the other player), and more information of Isaac, Alan Krueger, John McMillan, Andreas Ortmann, another kind (the ability to make interpersonal compar- Charles Plott, David Reiley, E. Elisabet Rutström, isons of monetary gain), than is usually assumed in the Nathaniel Wilcox, and the referees for generous comments. leading theoretical prediction. 1009 dec04_Article 1 12/14/04 2:25 PM Page 1010

1010 Journal of Economic Literature, Vol. XLII (December 2004)

terminology in instructions.3 We argue that Our second point is that many of the char- there is something methodologically funda- acteristics of field experiments can be found mental behind this trend. Field experiments in varying, correlated degrees in lab experi- differ from laboratory experiments in many ments. Thus, many of the characteristics that ways. Although it is tempting to view field people identify with field experiments are experiments as simply less controlled variants not only found in field experiments, and of laboratory experiments, we argue that to do should not be used to differentiate them so would be to seriously mischaracterize them. from lab experiments. What passes for “control” in laboratory exper- Our third point, following from the first iments might in fact be precisely the opposite two, is that there is much to learn from field if it is artificial to the subject or context of the experiments when returning to the lab. The task. In the end, we see field experiments as unexpected behaviors that occur when one being methodologically complementary to loosens control in the field are often indica- traditional laboratory experiments.4 tors of key features of the economic transac- Our primary point is that dissecting the tion that have been neglected in the lab. Thus, characteristics of field experiments helps field experiments can help one design better define what might be better called an ideal lab experiments, and have a methodological experiment, in the sense that one is able to role quite apart from their complementarity observe a subject in a controlled setting but at a substantive level. where the subject does not perceive any of In section 2 we offer a typology of field the controls as being unnatural and there is no experiments in the literature, identifying the deception being practiced. At first blush, the key characteristics defining the species. We idea that one can observe subjects in a natural suggest some terminology to better identify setting and yet have controls might seem different types of field experiments, or more contradictory, but we will argue that it is not.5 accurately to identify different characteristics of field experiments. We do not propose 3 We explain this jargon from experimental economics a bright line to define some experiments as below. field experiments and others as something 4 This view is hardly novel: for example, in decision research, Robert Winkler and Allan Murphy (1973) pro- else, but a set of criteria that one would vide an excellent account of the difficulties of reconciling expect to see in varying degrees in a field suboptimal probability assessments in artefactual laborato- experiment. We propose six factors that can ry settings with field counterparts, as well as the limitations of applying inferences from laboratory data to the field. be used to determine the field context of an 5 Imagine a classroom setting in which the class breaks experiment: the nature of the subject pool, up into smaller tutorial groups. In some groups a video cov- the nature of the information that the sub- ering certain material is presented, in another group a free discussion is allowed, and in another group there is a more jects bring to the task, the nature of the com- traditional lecture. Then the scores of the students in each modity, the nature of the task or trading rules group are examined after they have taken a common exam. applied, the nature of the stakes, and the Assuming that all of the other features of the experiment are controlled, such as which student gets assigned to which environment in which the subjects operate. group, this experiment would not seem unnatural to the sub- Having identified what defines a field exper- jects. They are all students doing what comes naturally to iment, in section 3 we put experiments in students, and these three teaching alternatives are each stan- dardly employed. Along similar lines in economics, albeit general into methodological perspective, as with simpler technology and less control than one might like, one of the ways that economists can identify see Edward Duddy (1924). For recent novel examples in the treatment effects. This serves to remind us economics literature, see Colin Camerer (1998) and David Lucking-Reiley (1999). Camerer (1998) places bets at a race why we want control and internal validity in track to examine if asset markets can be manipulated, while all such analyses, whether or not they consti- Lucking-Reiley (1999) uses internet-based auctions in a pre- tute field experiments. In sections 4 through existing market with an unknown number of participating bidders to test the theory of revenue equivalence between 6 we describe strengths and weaknesses of four major single-unit auction formats. the broad types of field experiments. Our dec04_Article 1 12/14/04 2:25 PM Page 1011

Harrison and List: Field Experiments 1011

literature review is necessarily selective, material, language, animal, etc., and not in although List (2004d) offers a more complete the laboratory, study, or office.” This orients bibliography. us to think of the natural environment of the In sections 7 and 8 we review two types of different components of an experiment.6 experiments that may be contrasted with It is important to identify what factors ideal field experiments. One is called a social make up a field experiment so that we can experiment, in the sense that it is a deliber- functionally identify what factors drive ate part of social policy by the government. results in different experiments. To provide Social experiments involve deliberate, ran- a direct example of the type of problem that domized changes in the manner in which motivated us, when List (2001) obtains some government program is implemented. results in a field experiment that differ from They have become popular in certain areas, the counterpart lab experiments of Ronald such as employment schemes and the detec- Cummings, Glenn Harrison, and Laura tion of discrimination. Their disadvantages Osborne (1995) and Cummings and Laura have been well documented, given their Taylor (1999), what explains the difference? political popularity, and there are several Is it the use of data from a particular market important methodological lessons from those whose participants have selected into the debates for the design of field experiments. market instead of student subjects; the use The other is called a “natural experiment.” of subjects with experience in related tasks; The idea is to recognize that some event that the use of private sports-cards as the under- naturally occurs in the field happens to have lying commodity instead of an environmen- some of the characteristics of a field experi- tal public good; the use of streamlined ment. These can be attractive sources of data instructions, the less-intrusive experimental on large-scale economic transactions, but methods, mundane experimenter effects, or usually at some cost due to the lack of con- is it some combination of these and similar trol, forcing the researcher to make certain identification assumptions. 6 If we are to examine the role of “controls” in different Finally, in section 9 we briefly examine experimental settings, it is appropriate that this word also be defined carefully. The OED (2nd ed.) defines the verb related types of experiments of the mind. In “control” in the following manner: “To exercise restraint or one case these are the “thought experi- direction upon the free action of; to hold sway over, exer- ments” of theorists and statisticians, and in cise power or authority over; to dominate, command.” So the word means something more active and intervention- the other they are the “neuro-economics ist than is suggested by its colloquial clinical usage. experiments” provided by technology. The Control can include such mundane things as ensuring ster- objective is simply to identify how they differ ile equipment in a chemistry lab, to restrain the free flow of germs and unwanted particles that might contaminate from other types of experiments we consider, some test. But when controls are applied to human behav- and where they fit in. ior, we are reminded that someone’s behavior is being restrained to be something other than it would otherwise 2. Defining Field Experiments be if the person were free to act. Thus we are immediate- ly on alert to be sensitive, when studying responses from a There are several ways to define words. controlled experiment, to the possibility that behavior is One is to ascertain the formal definition by unusual in some respect. The reason is that the very control that defines the experiment may be putting the sub- looking it up in the dictionary. Another is to ject on an artificial margin. Even if behavior on that identify what it is that you want the word- margin is not different than it would be without the con- label to differentiate. trol, there is the possibility that constraints on one margin may induce effects on behavior on unconstrained margins. The Oxford English Dictionary (Second This point is exactly the same as the one made in the “the- Edition) defines the word “field” in the fol- ory of the second best” in public policy. If there is some lowing manner: “Used attributively to immutable constraint on one of the margins defining an optimum, it does not automatically follow that removing a denote an investigation, study, etc., carried constraint on another margin will move the system closer out in the natural environment of a given to the optimum. dec04_Article 1 12/14/04 2:25 PM Page 1012

1012 Journal of Economic Literature, Vol. XLII (December 2004)

differences? We believe field experiments (McKinley Blackburn, Harrison, and have matured to the point that some frame- Rutström 1994). Alternatively, the subject work for addressing such differences in a pool can be designed to represent a target systematic manner is necessary. population of the economy (e.g., traders at the Chicago Board of Trade in Michael 2.1 Criteria that Define Field Experiments Haigh and John List 2004) or the general Running the risk of oversimplifying what population (e.g., the Danish population in is inherently a multidimensional issue, we Harrison, Morton Igel Lau, and Melonie propose six factors that can be used to deter- Williams 2002). mine the field context of an experiment: In addition, nonstandard subject pools • the nature of the subject pool, might bring experience with the commodity • the nature of the information that the or the task to the experiment, quite apart subjects bring to the task, from their wider array of demographic char- • the nature of the commodity, acteristics. In the field, subjects bring cer- • the nature of the task or trading rules tain information to their trading activities in applied, addition to their knowledge of the trading • the nature of the stakes, and institution. In abstract settings the impor- • the nature of the environment that the tance of this information is diminished, by subject operates in. design, and that can lead to behavioral We recognize at the outset that these changes. For example, absent such informa- characteristics will often be correlated to tion, risk aversion can lead to subjects varying degrees. Nonetheless, they can be requiring a risk premium when bidding for used to propose a taxonomy of field experi- objects with uncertain characteristics. ments that will, we believe, be valuable as The commodity itself can be an impor- comparisons between lab and field experi- tant part of the field. Recent years have mental results become more common. seen a growth of experiments concerned Student subjects can be viewed as the with eliciting valuations over actual goods, standard subject pool used by experi- rather than using induced valuations over menters, simply because they are a conven- virtual goods. The distinction here is ience sample for academics. Thus when one between physical goods or actual services goes “outdoors” and uses field subjects, they and abstractly defined goods. The latter should be viewed as nonstandard in this have been the staple of experimental eco- sense. But we argue that the use of nonstan- nomics since Edward Chamberlin (1948) dard subjects should not automatically qual- and Vernon Smith (1962), but imposes an ify the experiment as a field experiment. The artificiality that could be a factor influenc- experiments of Cummings, Harrison, and E. ing behavior.7 Such influences are actually Elizabet Rutström (1995), for example, used of great interest, or should be. If the nature individuals recruited from churches in order of the commodity itself affects behavior in to obtain a wider range of demographic a way that is not accounted for by the the- characteristics than one would obtain in the ory being applied, then the theory has at standard college setting. The importance of best a limited domain of applicability that a nonstandard subject pool varies from we should be aware of, and at worse is sim- experiment to experiment: in this case it simply false. In either case, one can better ply provided a less concentrated set of socio- demographic characteristics with respect to 7 It is worth noting that neither Chamberlin (1948) nor age and education level, which turned out to Smith (1962) used real payoffs to motivate subjects in their market experiments, although Smith (1962) does explain be important when developing statistical how that could be done and reports one experiment (fn 9., models to adjust for hypothetical bias p. 121) in which monetary payoffs were employed. dec04_Article 1 12/14/04 2:25 PM Page 1013

Harrison and List: Field Experiments 1013

understand the limitations of the generality this is an important component of the inter- of theory only via empirical testing.8 play between the lab and field. Early illus- Again, however, just having one field char- trations of the value of this approach include acteristic, in this case a physical good, does David Grether, R. Mark Isaac, and Charles not constitute a field experiment in any fun- Plott [1981, 1989], Grether and Plott [1984], damental sense. Rutström (1998) sold lots and James Hong and Plott [1982]. and lots of chocolate truffles in a laboratory The nature of the stakes can also affect study of different auction institutions field responses. Stakes in the laboratory designed to elicit values truthfully, but hers might be very different than those encoun- was very much a lab experiment despite the tered in the field, and hence have an effect tastiness of the commodity. Similarly, Ian on behavior. If valuations are taken seriously Bateman et al. (1997) elicited valuations over when they are in the tens of dollars, or in the pizza and dessert vouchers for a local restau- hundreds, but are made indifferently when rant. While these commodities were not the price is less than one dollar, laboratory or actual pizza or dessert themselves, but field experiments with stakes below one dol- vouchers entitling the subject to obtain lar could easily engender imprecise bids. Of them, they are not abstract. There are many course, people buy inexpensive goods in the other examples in the experimental literature field as well, but the valuation process they of designs involving physical commodities.9 use might be keyed to different stake levels. The nature of the task that the subject is Alternatively, field experiments in relatively being asked to undertake is an important poor countries offer the opportunity to eval- component of a field experiment, since one uate the effects of substantial stakes within a would expect that field experience could given budget. play a major role in helping individuals The environment of the experiment can develop heuristics for specific tasks. The lab also influence behavior. The environment experiments of John Kagel and Dan Levin can provide context to suggest strategies and (1999) illustrate this point, with “super-expe- heuristics that a lab setting might not. Lab rienced” subjects behaving differently than experimenters have always wondered inexperienced subjects in terms of their whether the use of classrooms might engen- propensity to fall prey to the winners’ curse. der role-playing behavior, and indeed this is An important question is whether the suc- one of the reasons experimental economists cessful heuristics that evolve in certain field are generally suspicious of experiments settings “travel” to the other field and lab without salient monetary rewards. Even settings (Harrison and List 2003). Another with salient rewards, however, environmen- aspect of the task is the specific parameteri- tal effects could remain. Rather than view zation that is adopted in the experiment. them as uncontrolled effects, we see them as One can conduct a lab experiment with worthy of controlled study. parameter values estimated from the field 2.2 A Proposed Taxonomy data, so as to study lab behavior in a “field- relevant” domain. Since theory is often Any taxonomy of field experiments runs domain-specific, and behavior can always be, the risk of missing important combinations of the factors that differentiate field experiments from conventional lab experiments. 8 To use the example of Chamberlin (1948) again, List (2004e) takes the natural next step by exploring the pre- There is some value, however, in having dictive power of neoclassical theory in decentralized, nat- broad terms to differentiate what we see as urally occurring field markets. the key differences. We propose the following 9 We would exclude experiments in which the commodity was a gamble, since very few of those gambles take terminology: the form of naturally occurring lotteries. •a conventional lab experiment is one dec04_Article 1 12/14/04 2:25 PM Page 1014

1014 Journal of Economic Literature, Vol. XLII (December 2004)

that employs a standard subject pool of extent of discrimination in the sports-card students, an abstract framing, and an marketplace. imposed10 set of rules; • an artefactual field experiment is the 3. Methodological Importance of Field same as a conventional lab experiment Experiments 11 but with a nonstandard subject pool; Field experiments are methodologically •a framed field experiment is the same as important because they mechanically force an artefactual field experiment but with the rest of us to pay attention to issues that field context in either the commodity, great researchers seem to intuitively task, or information set that the subjects address. These issues cannot be comfortably 12 can use; forgotten in the field, but they are of more •a natural field experiment is the same as general importance. a framed field experiment but where The goal of any evaluation method for the environment is one where the sub- “treatment effects” is to construct the prop- jects naturally undertake these tasks er counterfactual, and economists have and where the subjects do not know spent years examining approaches to this 13 that they are in an experiment. problem. Consider five alternative methods We recognize that any such taxonomy of constructing the counterfactual: con- leaves gaps, and that certain studies may not trolled experiments, natural experiments, fall neatly into our classification scheme. propensity score matching (PSM), instru- Moreover, it is often appropriate to con- mental variables (IV) estimation, and struc- duct several types of experiments in order to tural approaches. Define y1 as the outcome identify the issue of interest. For example, with treatment, y0 as the outcome without Harrison and List (2003) conducted artefac- treatment, and let T1 when treated and tual field experiments and framed field T0 when not treated.14 The treatment experiments with the same subject pool, pre- effect for unit i can then be measured as cisely to identify how well the heuristics that τ i yi1 yi0. The major problem, however, is might apply naturally in the latter setting τ one of a missing counterfactual: i is “travel” to less context-ridden environments unknown. If we could observe the outcome found in the former setting. And List (2004b) for an untreated observation had it been conducted artefactual, framed, and natural treated, then there is no evaluation problem. experiments to investigate the nature and “Controlled” experiments, which include laboratory experiments and field experi- 10 The fact that the rules are imposed does not imply ments, represent the most convincing that the subjects would reject them, individually or social- method of creating the counterfactual, since ly, if allowed. 11 they directly construct a control group via To offer an early and a recent example, consider the 15 risk-aversion experiments conducted by Hans Binswanger randomization. In this case, the population (1980, 1981) in India, and Harrison, Lau, and Williams (2002), who took the lab experimental design of Maribeth Coller and Melonie Williams (1999) into the field with a 14 We simplify by considering a binary treatment, but representative sample of the Danish population. the logic generalizes easily to multiple treatment levels and 12 For example, the experiments of Peter Bohm continuous treatments. Obvious examples from outside (1984b) to elicit valuations for public goods that occurred economics include dosage levels or stress levels. In eco- naturally in the environment of subjects, albeit with nomics, one might have some measure of risk aversion or unconventional valuation methods; or the Vickrey auctions “other regarding preferences” as a continuous treatment. and “cheap talk” scripts that List (2001) conducted with 15 Experiments are often run in which the control is pro- sport-card collectors, using sports cards as the commodity vided by theory, and the objective is to assess how well the- and at a show where they trade such commodities. ory matches behavior. This would seem to rule out a role 13 For example, the manipulation of betting markets by for randomization, until one recognizes that some implicit Camerer (1998) or the solicitation of charitable contribu- or explicit error structure is required in order to test theo- tions by List and Lucking-Reiley (2002). ries meaningfully. We return to this issue in section 8. dec04_Article 1 12/14/04 2:25 PM Page 1015

Harrison and List: Field Experiments 1015

average treatment effect is given by individuals with the same value for these fac- τ ∗ ∗ ∗ ∗ y 1 y 0, where y 1 and y 0 are the treat- tors will display homogenous responses to ed and nontreated average outcomes after the treatment, then the treatment effect can the treatment. We have much more to say be measured without bias. In effect, one can about controlled experiments, in particular use statistical methods to identify which two field experiments, below. individuals are “more homogeneous lab rats” “Natural experiments” consider the treat- for the purposes of measuring the treatment ment itself as an experiment and find a natu- effect. More formally, the solution advocated rally occurring comparison group to mimic is to find a vector of covariates, Z, such that τ ⊥ ∈ the control group: is measured by compar- y1,y0 T | Z and pr(T 1 | Z) (0,1), where ing the difference in outcomes before and ⊥ denotes independence.16 after for the treated group with the before Another alternative to the DID model is and after outcomes for the nontreated group. the use of instrumental variables (IV), which Estimation of the treatment effect takes the approaches the structural econometric τ η form Yit Xit Tit it, where i indexes the method in the sense that it relies on exclusion unit of observation, t indexes years, Yit is the restrictions (Joshua D. Angrist, Guido W. outcome in cross-section i at time t, Xit is a Imbens, and Donald B. Rubin 1996; and vector of controls, Tit is a binary variable, Joshua D. Angrist and Alan B. Krueger 2001). η α λ ε it i t it, and t is the difference-in-dif- The IV method, which essentially assumes ferences (DID) average treatment effect. If that some components of the non-experi- we assume that data exists for two periods, mental data are random, is perhaps the most τ t∗ t∗ u∗ u∗ then (y t1 y t0) (y t1 y t0) where, widely utilized approach to measuring treat- t∗ for example, y t1 is the mean outcome for ment effects (Mark Rosenzweig and Kenneth the treated group. Wolpin 2000). The crux of the IV approach is A major identifying assumption in DID to find a variable that is excluded from the estimation is that there are no time-varying, outcome equation, but which is related to unit-specific shocks to the outcome variable treatment status and has no direct association that are correlated with treatment status, with the outcome. The weakness of the IV and that selection into treatment is inde- approach is that such variables do not often pendent of temporary individual-specific exist, or that unpalatable assumptions must η α λ effect: E( it | Xit, Dit) E( i | Xit, Dit) t. If be maintained in order for them to be used to ε τ it, and are related, DID is inconsistently identify the treatment effect of interest. τt τ ε ε estimated as E( ) E( it1 it0 | D 1) A final alternative to the DID model is ε ε E( it1 it0 | D 0). structural modeling. Such models often entail One alternative method of assessing the a heavy mix of identifying restrictions (e.g., impact of the treatment is the method of propensity score matching (PSM) developed 16 If one is interested in estimating the average treat- in P. Rosenbaum and Donald Rubin (1983). ment effect, only the weaker condition E(y0|T 1, This method has been used extensively in Z) E(y0|T 0, Z) E(y0 | Z) is required. This assumption is the debate over experimental and nonexper- called the “conditional independence assumption,” and intuitively means that given Z, the nontreated outcomes imental evaluation of treatment effects initi- are what the treated outcomes would have been had they ated by Lalonde (1986): see Rajeev Dehejia not been treated. Or, likewise, that selection occurs only and Sadek Wahba (1999, 2002) and Jeffrey on observables. Note that the dimensionality of the problem, as measured by Z, may limit the use of matching. A Smith and Petra Todd (2000). The goal of more feasible alternative is to match on a function of Z. PSM is to make non-experimental data “look Rosenbaum and Rubin (1983, 1984) showed that matching like” experimental data. The intuition on p(Z) instead of Z is valid. This is usually carried out on the “propensity” to get treated p(Z), or the propensity behind PSM is that if the researcher can score, which in turn is often implemented by a simple pro- select observable factors so that any two bit or logit model with T as the dependent variable. dec04_Article 1 12/14/04 2:25 PM Page 1016

1016 Journal of Economic Literature, Vol. XLII (December 2004)

separability), impose structure on technology could be applied to real people, but to actu- and preferences (e.g., constant returns to ally do so entails some serious and often scale or unitary income elasticities), and sim- unattractive logistical problems.19 plifying assumptions about equilibrium out- A more substantial response to this criti- comes (e.g., zero-profit conditions defining cism is to consider what it is about students equilibrium industrial structure). Perhaps the that is viewed, a priori, as being nonrepre- best-known class of such structural models is sentative of the target population. There are computable general equilibrium models, at least two issues here. The first is whether which have been extensively applied to evalu- endogenous sample selection or attrition has ate trade policies, for example.17 It typically occurred due to incomplete control over relies on complex estimation strategies, but recruitment and retention, so that the yields structural parameters that are well- observed sample is unreliable in some statis- suited for ex ante policy simulation, provided tical sense (e.g., generating inconsistent esti- one undertakes systematic sensitivity analysis mates of treatment effects). The second is of those parameters.18 In this sense, structur- whether the observed sample can be informal models have been the cornerstone of non- ative on the behavior of the population, experimental evaluation of tax and welfare assuming away sample selection issues. policies (R. Blundell and Thomas MaCurdy 4.2 Sample Selection in the Field 1999; and Blundell and M. Costas Dias 2002). Conventional lab experiments typically 4. Artefactual Field Experiments use students who are recruited after being told only general statements about the 4.1 The Nature of the Subject Pool experiment. By and large, recruitment pro- A common criticism of the relevance of cedures avoid mentioning the nature of the inferences drawn from laboratory experi- task, or the expected earnings. Most lab ments is that one needs to undertake an experiments are also one-shot, in the sense experiment with “real” people, not students. that they do not involve repeated observa- This criticism is often deflected by experi- tions of a sample subject to attrition. Of menters with the following imperative: if you course, neither of these features is essential. think that the experiment will generate differ- If one wanted to recruit subjects with specif- ent results with “real” people, then go ahead ic interest in a task, it would be easy to do and run the experiment with real people. A (e.g., Peter Bohm and Hans Lind 1993). And variant of this response is to challenge the crit- if one wanted to recruit subjects for several sessions, to generate “super-experienced” ics’ assertion that students are not representa- 20 tive. As we will see, this variant is more subtle subjects or to conduct pre-tests of such things as risk aversion, trust, or “other- and constructive than the first response. 21 The first response, to suggest that the crit- regarding preferences,” that could be built ic run the experiment with real people, is into the design as well. often adequate to get rid of unwanted refer- One concern with lab experiments con- ees at academic journals. In practice, howev- ducted with convenience samples of students er, few experimenters ever examine field behavior in a serious and large-sample way. 19 Or one can use “real” nonhuman species: see John It is relatively easy to say that the experiment Kagel, Don MacDonald, and Raymond Battalio (1990) and Kagel, Battalio, and Leonard Green (1995) for dramatic demonstrations of the power of economic theory to organ- 17 For example, the evaluation of the Uruguay Round ize data from the animal kingdom. of multilateral trade liberalization by Harrison, Thomas 20 For example, John Kagel and Dan Levin (1986, 1999, Rutherford, and David Tarr (1997). 2002). 18 For example, see Harrison and H.D. Vinod (1992). 21 For example, Cox (2004). dec04_Article 1 12/14/04 2:25 PM Page 1017

Harrison and List: Field Experiments 1017

is that students might be self-selected in allows one to remove this recruitment bias some way, so that they are a sample that from the resulting inference. excludes certain individuals with characteris- Some field experiments face a more seri- tics that are important determinants of ous problem of sample selection that underlying population behavior. Although depends on the nature of the task. Once the this problem is a severe one, its potential experiment has begun, it is not as easy as it is importance in practice should not be in the lab to control information flow about overemphasized. It is always possible to sim- the nature of the task. This is obviously a ply inspect the sample to see if certain strata matter of degree, but can lead to endoge- of the population are not represented, at nous subject attrition from the experiment. least under the tentative assumption that it is Such attrition is actually informative about only observables that matter. In this case it subject preferences, since the subject’s exit would behoove the researcher to augment from the experiment indicates that the sub- the initial convenience sample with a quota ject had made a negative evaluation of it sample, in which the missing strata were sur- (Tomas Philipson and Larry Hedges 1998). veyed. Thus one tends not to see many con- The classic problem of sample selection victed mass murderers or brain surgeons in refers to possible recruitment biases, such student samples, but we certainly know that the observed sample is generated by a where to go if we feel the need to include process that depends on the nature of the them in our sample. experiment. This problem can be serious for Another consideration, of increasing any experiment, since a hallmark of virtually importance for experimenters, is the possi- every experiment is the use of some ran- bility of recruitment biases in our proce- domization, typically to treatment.22 If the dures. One aspect of this issue is studied by population from which volunteers are being Rutström (1998). She examines the role of recruited has diverse risk attitudes and plau- recruitment fees in biasing the samples of sibly expects the experiment to have some subjects that are obtained. The context for element of randomization, then the her experiment is particularly relevant here observed sample will tend to look less risk- since it entails the elicitation of values for a averse than the population. It is easy to private commodity. She finds that there are imagine how this could then affect behavior some significant biases in the strata of the differentially in some treatments. James population recruited as one varies the Heckman and Jeffrey Smith (1995) discuss recruitment fee from zero dollars to two dol- this issue in the context of social experi- lars, and then up to ten dollars. An important ments, but the concern applies equally to finding, however, is that most of those biases field and lab experiments. can be corrected simply by incorporating the 4.3 Are Students Different? relevant characteristics in a statistical model of the behavior of subjects and thereby con- This question has been addressed in sev- trolling for them. In other words, it does not eral studies, including early artefactual field matter if one group of subjects in one treat- experiments by Sarah Lichtenstein and Paul ment has 60 percent females and the other Slovic (1973), and Penny Burns (1985). sample of subjects in another treatment has Glenn Harrison and James Lesley (1996) only 40 percent females, provided one con- (HL) approach this question with a simple trols for the difference in gender when pool- statistical framework. Indeed, they do not ing the data and examining the key treatment consider the issue in terms of the relevance effect. This is a situation in which gender might influence the response or the effect of 22 If not to treatment, then randomization often occurs the treatment, but controlling for gender over choices to determine payoff. dec04_Article 1 12/14/04 2:25 PM Page 1018

1018 Journal of Economic Literature, Vol. XLII (December 2004)

of experimental methods, but rather in the subject was asked whether he or she terms of the relevance of convenience sam- would be willing to pay $X towards a public ples for the contingent valuation method.23 good, where $X was randomly selected to However, it is easy to see that their methods be $10, $30, $60, or $120. A subject would apply much more generally. respond to this question with a “yes,” a The HL approach may be explained in “no,” or a “not sure.” A simple statistical terms of their attempt to mimic the results model is developed to explain behavior as a of a large-scale national survey conducted function of the observable socioeconomic for the Exxon Valdez oil-spill litigation. A characteristics.24 major national survey was undertaken in this Assuming that a statistical model has case by Richard Carson et al. (1992) for the been developed, HL then proceeded to the attorney general of the state of Alaska. This key stage of their method. This is to assume survey used then-state-of-the-art survey that the coefficient estimates from the statis- methods but, more importantly for present tical model based on the student sample purposes, used a full probability sample of apply to the population at large. If this is the the nation. HL asked if one can obtain case, or if this assumption is simply main- essentially the same results using a conven- tained, then the statistical model may be ience sample of students from the University used to predict the behavior of the target of South Carolina. Using students as a con- population if one can obtain information venience sample is largely a matter of about the socioeconomic characteristics of methodological bravado. One could readily the target population. obtain convenience samples in other ways, The essential idea of the HL method is but using students provides a tough test of simple and more generally applicable than their approach. this example suggests. If students are repre- They proceeded by developing a simpler sentative in the sense of allowing the survey instrument than the one used in the researcher to develop a “good” statistical original study. The purpose of this is purely model of the behavior under study, then to facilitate completion of the survey and is one can often use publicly available infor- not essential to the use of the method. This mation on the characteristics of the target survey was then administered to a relatively population to predict the behavior of that large sample of students. An important part population. Their fundamental point is that of the survey, as in any field survey that aims the “problem with students” is the lack of to control for subject attributes, is the col- variability in their socio-demographic char- lection of a range of standard socioeconom- acteristics, not necessarily the unrepresen- ic characteristics of the individual (e.g., sex, tativeness of their behavioral responses age, income, parental income, household conditional on their socio-demographic size, and marital status). Once these data characteristics. are collated, a statistical model is developed To the extent that student samples exhibit in order to explain the key responses in the limited variability in some key characteris- survey. In this case the key response is a tics, such as age, then one might be wary of simple “yes” or “no” to a single dichotomous the veracity of the maintained assumption choice valuation question. In other words, involved here. However, the sample does not have to look like the population in order for 23 The contingent valuation method refers to the use of the statistical model to be an adequate one hypothetical field surveys to value the environment, by posing a scenario that asks the subject to place a value on 24 The exact form of that statistical model is not impor- an environmental change contingent on a market for it tant for illustrative purposes, although the development of existing. See Cummings and Harrison (1994) for a critical an adequate statistical model is important to the reliability review of the role of experimental economics in this field. of this method. dec04_Article 1 12/14/04 2:25 PM Page 1019

Harrison and List: Field Experiments 1019

for predicting the population response.25 All The reason is simple to understand. It is that is needed is for the behavioral respons- much easier to predict the behavior of a 26- es of students to be the same as the behav- year-old when one has a model that is based ioral responses of nonstudents. This can on the behavior of people whose ages range either be assumed a priori or, better yet, from 21 to 79 than it is to estimate the tested by sampling nonstudents as well as behavior of a 69-year-old based on the students. behavioral model from a sample whose ages Of course, it is always better to be fore- range from 19 to 27. casting on the basis of an interpolation What is the relevance of these methods for rather than an extrapolation, and that is the the original criticism of experimental proce- most important problem one has with stu- dures? Think of the experimental subjects as dent samples. This issue is discussed in some the convenience sample in the HL approach. detail by Blackburn, Harrison, and Rutström The lessons that are learned from this stu- (1994). They estimated a statistical model of dent sample could be embodied in a statisti- subject response using a sample of college cal model of their behavior, with implications students and also estimated a statistical drawn for a larger target population. model of subject response using field sub- Although this approach rests on an assump- jects drawn from a wide range of churches in tion that is as yet untested, concerning the the same urban area. Each were conven- representativeness of student behavioral ience samples. The only difference is that responses conditional on their characteris- the church sample exhibited a much wider tics, it does provide a simple basis for evalu- variability in their socio-demographic char- ating the extent to which conclusions about acteristics. In the church sample, ages students apply to a broader population. ranged from 21 to 79; in the student sample, How could this method ever lead to inter- ages ranged from 19 to 27. When predicting esting results? The answer depends on the behavior of students based on the church- context. Consider a situation in which the estimated behavioral model, interpolation behavioral model showed that age was an was used and the predictions were extreme- important determinant of behavior. Consider ly accurate. In the reverse direction, howev- further a situation in which the sample used er, when predicting church behavior from to estimate the model had an average age that the student-estimated behavioral model, the was not representative of the population as a predictions were disastrous in the sense of whole. In this case, it is perfectly possible that having extremely wide forecast variances.26 the responses of the student sample could be quite different than the predicted responses 25 For example, assume a population of 50 percent men of the population. Although no such instances and 50 percent women, but where a sample drawn at ran- have appeared in the applications of this dom happens to have 60 percent men. If responses differ method thus far, they should not be ruled out. according to sex, predicting the population is simply a matter of reweighting the survey responses. We conclude, therefore, that many of the 26 On the other hand, reporting large variances may be concerns raised by this criticism, while valid, the most accurate reflection of the wide range of valua- are able to be addressed by simple exten- tions held by this sample. We should not always assume that distributions with smaller variances provide more sions of the methods that experimenters cur- accurate reflections of the underlying population just rently use. Moreover, these extensions because they have little dispersion; for this to be true, would increase the general relevance of many auxiliary assumptions about randomness of the sampling process must be assumed, not to mention issues experimental methods obtained with student about the stationarity of the underlying population convenience samples. process. This stationarity is often assumed away in contin- Further problems arise if one allows unob- gent valuation research (e.g., the proposal to use double- bounded dichotomous choice formats without allowing for served individual effects to play a role. In possible correlation between the two questions). some statistical settings it is possible to allow dec04_Article 1 12/14/04 2:25 PM Page 1020

1020 Journal of Economic Literature, Vol. XLII (December 2004)

for those effects by means of “fixed effect” or beauty. Again, the immediate implication is “random effects” analyses. But these stan- to collect a standard battery of measures of dard devices, now quite common in the tool- individual characteristics to allow some sta- kit of experimental economists, do not tistical comparisons of conditional treatment address a deeper problem. The internal effects to be drawn.27 But even here we can validity of a randomized design is maximized only easily condition on observable charac- when one knows that the samples in each teristics, and additional identifying assump- treatment are identical. This happy extreme tions will be needed to allow for correlated leads many to infer that matching subjects differences in unobservables. on a finite set of characteristics must be bet- 4.4 Precursors ter in terms of internal validity than not matching them on any characteristics. Several experimenters have used artefac- But partial matching can be worse than tual field experiments; that is, they have no matching. The most important example deliberately sought out subjects in the of this is due to James Heckman and Peter “wild,” or brought subjects from the “wild” Siegelman (1993) and Heckman (1998), into labs. It is notable that this effort has who critique paired-audit tests of discrimi- occurred from the earliest days of experi- nation. In these experiments, two applicants mental economics, and that it has only for a job are matched in terms of certain recently become common. observables, such as age, sex, and education, Lichtenstein and Slovic (1973) replicated and differ in only one protected characteris- their earlier experiments on “preference tic, such as race. However, unless some reversals” in “… a nonlaboratory real-play extremely strong assumptions about how setting unique to the experimental litera- characteristics map into wages are made, ture on decision processes—a casino in there will be a predetermined bias in out- downtown Las Vegas” (p. 17). The experi- comes. The direction of the bias “depends,” menter was a professional dealer, and the and one cannot say much more. A metaphor subjects were drawn from the floor of the from Heckman (1998, p. 110) illustrates: casino. Although the experimental equip- Boys and girls of the same age are in a high- ment may have been relatively forbidding jump competition, and jump the same (it included a PDP-7 computer, a DEC-339 height on average. But boys have a higher CRT, and a keyboard), the goal was to iden- variance in their jumping technique, for any tify gamblers in their natural habitat. The number of reasons. If the bar is set very low subject pool of 44 did include seven known relative to the mean, then the girls will look dealers who worked in Las Vegas, and the like better jumpers; if the bar is set very “… dealer’s impression was that the game high then the boys will look like better attracted a higher proportion of profession- jumpers. The implications for numerous al and educated persona than the usual (lab and field) experimental studies of the casino clientele” (p. 18). effect of gender, that do not control for Kagel, Battalio, and James Walker (1979) other characteristics, should be apparent. provide a remarkable, early examination of This metaphor also serves to remind us many of the issues we raise. They were con- that what laboratory experimenters think of cerned with “volunteer artifacts” in lab as a “standard population” need not be a experiments, ranging from the characteristics homogeneous population. Although stu- that volunteers have to the issue of sample dents from different campuses in a given country may have roughly the same age, 27 George Lowenstein (1999) offers a similar criticism of the popular practice in experimental economics of not they can differ dramatically in influential conditioning on any observable characteristics or random- characteristics such as intelligence and izing to treatment from the same population. dec04_Article 1 12/14/04 2:25 PM Page 1021

Harrison and List: Field Experiments 1021

selection bias.28 They conducted a field reason for trade in this environment.30 The experiment in the homes of the volunteer major empirical result is the large number of subjects, examining electricity demand in observed price bubbles: fourteen of the 22 response to changes in prices, weekly feed- experiments can be said to have had some back on usage, and energy conservation price bubble. information. They also examined a compar- In an effort to address the criticism that ison sample drawn from the same popula- bubbles were just a manifestation of using tion, to check for any biases in the volunteer student subjects, Smith, Suchanek, and sample. Williams (1988) recruited nonstudent sub- Binswanger (1980, 1981) conducted jects for one experiment. As they put it, one experiments eliciting measures of risk aver- experiment “… is noteworthy because of its sion from farmers in rural India. Apart from use of professional and business people from the policy interest of studying agents in the Tucson community, as subjects. This developing countries, one stated goal of market belies any notion that our results are using artefactual field experiments was to an artifact of student subjects, and that busi- assess risk attitudes for choices in which the nessmen who ‘run the real world’ would income from the experimental task was a quickly learn to have rational expectations. substantial fraction of the wealth or annual This is the only experiment we conducted income of the subject. The method he devel- that closed on a mean price higher than in all oped has been used recently in conventional previous trading periods” (p. 1130–31). The laboratory settings with student subjects by reference at the end is to the observation Charles Holt and Susan Laury (2002). that the price bubble did not burst as the Burns (1985) conducted induced-value finite horizon of the experiment was market experiments with floor traders from approaching. Another notable feature of this wool markets, to compare with the behav- price bubble is that it was accompanied by ior of student subjects in such settings. The heavy volume, unlike the price bubbles goal was to see if the heuristics and deci- observed with experienced subjects.31 sion rules these traders evolved in their Although these subjects were not students, natural field setting affected their behavior. they were inexperienced in the use of the She did find that their natural field rivalry double auction experiments. Moreover, had a powerful motivating effect on their there is no presumption that their field expe- behavior. rience was relevant for this type of asset Vernon Smith, G. L. Suchanek, and market. Arlington Williams (1988) conducted a large series of experiments with student subjects 30 There are only two reasons players may want to trade in an “asset bubble” experiment. In the 22 in this market. First, if players differ in their risk attitudes then we might see the asset trading below expected divi- experiments they report, nine to twelve dend value (since more-risk-averse players will pay less- traders with experience in the double-auc- risk-averse players a premium over expected dividend tion institution29 traded a number of fifteen value to take their assets). Second, if subjects have diverse price expectations, we can expect trade to occur because of or thirty period assets with the same com- expected capital gains. This second reason for trading mon value distribution of dividends. If all (diverse price expectations) can actually lead to contract subjects are risk neutral and have common prices above expected dividend value, provided some subject believes that there are other subjects who believe the price expectations, then there would be no price will go even higher. 31 Harrison (1992b) reviews the detailed experimental 28 They also have a discussion of the role that these pos- evidence on bubbles, and shows that very few significant sible biases play in social psychology experiments, and how bubbles occur with subjects who are experienced in asset they have been addressed in the literature. market experiments in which there is a short-lived asset, 29 And either inexperienced, once experienced, or such as those under study. A bubble is significant only if twice experienced in asset market trading. there is some nontrivial volume associated with it. dec04_Article 1 12/14/04 2:25 PM Page 1022

1022 Journal of Economic Literature, Vol. XLII (December 2004)

Artefactual field experiments have also tasks, experience is generated in the field made use of children and high school sub- and not the lab. These results provide sup- jects. For example, William Harbaugh and port for the notion that context-specific Kate Krause (2000), Harbaugh, Krause, and experience does appear to carry over to com- Timothy Berry (2001), and Harbaugh, parable settings, at least with respect to Krause, and Lise Vesterlund (2002) explore these types of auctions. other-regarding preferences, individual This experimental design emphasizes the rationality, and risk attitudes among children identification of a naturally occurring setting in school environments. in which one can control for experience in Joseph Henrich (2000) and Henrich and the way that it is accumulated in the field. Richard McElreath (2002), and Henrich et Experienced traders gain experience over al. (2001, 2004) have even taken artefactual time by observing and surviving a relatively field experiments to the true “wilds” of a wide range of trading circumstances. In number of peasant societies, employing the some settings this might be proxied by the procedures of cultural anthropology to manner in which experienced or super-expe- recruit and instruct subjects and conduct rienced subjects are defined in the lab, but it artefactual field experiments. Their focus remains on open question whether standard was on the ultimatum bargaining game and lab settings can reliably capture the full measures of risk aversion. extent of the field counterpart of experience. This is not a criticism of lab experiments, just their domain of applicability. 5. Framed Field Experiments The methodological lesson we draw is that 5.1 The Nature of the Information Subjects one should be careful not to generalize from Already Have the evidence of a winner’s curse by student subjects that have no experience at all with Auction theory provides a rich set of pre- the field context. These results do not imply dictions concerning bidders’ behavior. One that every field context has experienced sub- particularly salient finding in a plethora of jects, such as professional sports-card deal- laboratory experiments that is not predicted ers, that avoid the winner’s curse. Instead, in first-price common-value auction theory they point to a more fundamental need to is that bidders commonly fall prey to the consider the field context of experiments winner’s curse. Only “super-experienced” before drawing general conclusions. It is not subjects, who are in fact recruited on the the case that abstract, context-free experi- basis of not having lost money in previous ments provide more general findings if the experiments, avoid it regularly. This would context itself is relevant to the performance seem to suggest that experience is a suffi- of subjects. In fact, one would generally cient condition for an individual bidder to expect such context-free experiments to be avoid the winner’s curse. Harrison and List unusually tough tests of economic theory, (2003) show that this implication is support- since there is no control for the context that ed when one considers a natural setting in subjects might themselves impose on the which it is relatively easy to identify traders abstract experimental task. that are more or less experienced at the task. The main result is that if one wants to In their experiments the experience of sub- draw conclusions about the validity of theory jects is either tied to the commodity, the val- in the field, then one must pay attention to uation task, and the use of auctions (in the the myriad of ways in which field context can field experiments with sports cards), or sim- affect behavior. We believe that convention- ply to the use of auctions (in the laboratory al lab experiments, in which roles are exoge- experiments with induced values). In all nously assigned and defined in an abstract dec04_Article 1 12/14/04 2:25 PM Page 1023

Harrison and List: Field Experiments 1023

manner, cannot ubiquitously provide reliable rather than abstract commodities, is not insights into field behavior. One might be unique to the field, nor does one have to able to modify the lab experimental design to eschew experimenter-induced valuations in mimic those field contexts more reliably, and the field. But the use of real goods does have that would make for a more robust applica- consequences that apply to both lab and tion of the experimental method in general. field experiments.32 Consider, as an example, the effect of Abstraction Requires Abstracting. One “insiders” on the market phenomenon simple example is the Tower of Hanoi known as the “winner’s curse.” For now we game, which has been extensively studied define an insider as anyone who has better by cognitive psychologists (e.g., J. R. Hayes information than other market participants. and H. A. Simon 1974) and more recently If insiders are present in a market, then one by economists (Tanga McDaniel and might expect that the prevailing prices in the Rutström 2001) in some fascinating experi- market will reflect their better information. ments. The physical form of the game, as This leads to two general questions about found in all serious Montessori classrooms market performance. First, do insiders fall and in Judea Pearl (1984, p. 28), is shown in prey to the winner’s curse? Second, does the figure 1. presence of insiders mitigate the winner’s The top picture shows the initial state, in curse for the market as a whole? which n disks are on peg 1. The goal is to The approach adopted by Harrison and move all of the disks to peg 3, as shown in List (2003) is to undertake experiments in the goal state in the bottom picture. The naturally occurring settings in which the fac- constraints are that only one disk may be tors that are at the heart of the theory are moved at a time, and no disk may ever lie identifiable and arise endogenously, and under a bigger disk. The objective is to reach then to impose the remaining controls need- the goal state in the least number of moves. ed to implement a clean experiment. In other The “trick” to solving the Tower of Hanoi is words, rather than impose all controls exoge- to use backwards induction: visualize the nously on a convenience sample of college final, goal state and use the constraints to fig- students, they find a population in the field ure out what the penultimate state must in which one of the factors of interest arises have looked like (viz., the tiny disk on the top naturally, where it can be identified easily, of peg 3 in the goal state would have to be on and then add the necessary controls. To test peg 1 or peg 2 by itself). Then work back their methodological hypotheses, they also from that penultimate state, again respecting implement a fully controlled laboratory the constraints (viz., the second smallest disk experiment with subjects drawn from the on peg 3 in the goal state would have to be same field population. We discuss some of on whichever of peg 1 or peg 2 the smallest their findings below. disk is not on). One more step in reverse and the essential logic should be clear (viz., in 5.2 The Nature of the Commodity order for the third largest disk on peg 3 to be Many field experiments involve real, phys- off peg 3, one of peg 1 or peg 2 will have to ical commodities and the values that subjects be cleared, so the smallest disk should be on place on them in their daily lives. This is dis- top of the second-smallest disk). tinct from the traditional focus in experi- Observation of students in Montessori mental economics on experimenter-induced classrooms makes it clear how they (eventu- valuations on an abstract commodity, often ally) solve the puzzle, when confronted with referred to as “tickets” just to emphasize the lack of any field referent that might suggest 32 See Harrison, Ronald Harstad, and Rutström (2004) a valuation. The use of real commodities, for a general treatment. dec04_Article 1 12/14/04 2:25 PM Page 1024

1024 Journal of Economic Literature, Vol. XLII (December 2004)

Figure 1. The Tower of Hanoi Game

the initial state. They shockingly violate the induction, if attained, it is quite possible that constraints and move all the disks to the goal it posed an insurmountable cognitive burden state en masse, and then physically work for some of the experimental subjects. backwards along the lines of the above It might be tempting to think of this as just thought experiment in backwards induction. two separate tasks, instead of a real com- The critical point here is that they temporar- modity and its abstract analogue. But we ily violate the constraints of the problem in believe that this example does identify an order to solve it “properly.” important characteristic of commodities in Contrast this behavior with the laboratory ideal field experiments: the fact that they subjects in McDaniel and Rutström (2001). allow subjects to adopt the representation of They were given a computerized version of the commodity and task that best suits their the game and told to try to solve it. However, objective. In other words, the representation the computerized version did not allow them of the commodity by the subject is an inte- to violate the constraints. Hence the labora- gral part of how the subject solves the task. tory subjects were unable to use the class- One simply cannot untangle them, at least room Montessori method, by which the not easily and naturally. student learns the idea of backwards induc- This example also illustrates that off- tion by exploring it with physical referents. equilibrium states, in which one is not opti- This is not a design flaw of the McDaniel and mizing in terms of the original constrained Rutström (2001) lab experiments, but simply optimization task, may indeed be critical one factor to keep in mind when evaluating to the attainment of the equilibrium state.33 the behavior of their subjects. Without the physical analogue of the final goal state being 33 This is quite distinct from the valid point made by allowed in the experiment, the subject was Smith (1982, p. 934, fn. 17), that it is appropriate to design the experimental institution so as to make the task as sim- forced to visualize that state conceptually, ple and transparent as possible, providing one holds con- and to likewise imagine conceptually the stant these design features as one compares experimental penultimate states. Although that may treatments. Such designs may make the results of less interest for those wanting to make field inferences, but encourage more fundamental conceptual that is a trade-off that every theorist and experimenter understanding of the idea of backwards faces to varying degrees. dec04_Article 1 12/14/04 2:25 PM Page 1025

Harrison and List: Field Experiments 1025

Thus we should be mindful of possible field reasons why homegrown values might be devices that allow subjects to explore off- affiliated in such experiments. equilibrium states, even if those states are The first is that the good being auctioned ruled out in our null hypotheses. might have some uncertain attributes, and fel- Field Goods Have Field Substitutes. There low bidders might have more or less informa- are two respects in which “field substitutes” tion about those attributes. Depending on how play a role whenever one is conducting an one perceives the knowledge of other bidders, experiment with naturally occurring, or observation of their bidding behavior35 can field, goods. We can refer to the former as affect a given bidder’s estimate of the true sub- the natural context of substitutes, and to the jective value to the extent that they change the latter as an artificial context of substitutes. bidder’s estimate of the lottery of attributes The former needs to be captured if reliable being auctioned.36 Note that what is being valuations are to be elicited; the latter needs 35 The term “bidding behavior” is used to allow for to be minimized or controlled. information about bids as well as non-bids. In the repeat- The first way in which substitutes play a ed Vickrey auction it is the former that is provided (for role in an experiment is the traditional sense winners in previous periods). In the one-shot English auction it is the latter (for those who have not yet caved in at of demand theory: to some individuals, a the prevailing price). Although the inferential steps in bottle of scotch may substitute for a bible using these two types of information differ, they are each when seeking peace of mind. The degree of informative in the same sense. Hence any remarks about the dangers of using repeated Vickrey auctions apply substitutability here is the stuff of individual equally to the use of English auctions. demand elasticities, and can reasonably be 36 To see this point, assume that a one-shot Vickrey auc- expected to vary from subject to subject. The tion was being used in one experiment and a one-shot English auction in another experiment. Large samples of upshot of this consideration is, yet again, that subjects are randomly assigned to each institution, and the one should always collect information on commodity differs. Let the commodity be something whose observable individual characteristics and quality is uncertain; an example used by Cummings, Harrison and Rutström (1995) and Rutström (1998) might be control for them. a box of gourmet chocolate truffles. Amongst undergraduate The second way in which substitutes play students in South Carolina, these boxes present something of a role in an experiment is the more subtle a taste challenge. The box is not large in relation to those found in more common chocolate products, and many of the issue of affiliation which arises in lab or field students have not developed a taste for gourmet chocolates. settings that involve preferences over a field A subject endowed with a diverse pallet is faced with an good. To see this point, consider the use of uncertain lottery. If these are just ordinary chocolates dressed up in a small box, then the true value to the subject is small repeated Vickrey auctions in which subjects (say, $2). If they are indeed gourmet chocolates then the true learn about prevailing prices. This results in value to the subject is much higher (say, $10). Assuming an a loss of control, since we are dealing with equal chance of either state of chocolate, the risk-neutral subject would bid their true expected value (in this example, $6). the elicitation of homegrown values rather In the Vickrey auction this subject will have an incentive to than experimenter-induced private values. write down her reservation price for this lottery as described To the extent that homegrown values are above. In the English auction, however, this subject is able to see a number of other subjects indicate that they are willing affiliated across subjects, we can expect an to pay reasonably high sums for the commodity. Some have effect on elicited values from using repeated not dropped out of the auction as the price has gone above Vickrey auctions rather than a one-shot $2, and it is closing on $6. What should the subject do? The 34 answer depends critically on how knowledgeable he thinks Vickrey auction. There are, in turn, two the other bidders are as to the quality of the chocolates. If those who have dropped out are the more knowledgeable 34 The theoretical and experimental literature makes ones, then the correct inference is that the lottery is more this point clearly by comparing real-time English auctions heavily weighted towards these being common chocolates. If with sealed-bid Vickrey auctions: see Paul Milgrom and those remaining in the auction are the more knowledgeable Robert Weber (1982) and Kagel, Harstad, and Levin ones, however, then the opposite inference is appropriate. In (1987). The same logic that applies for a one-shot English the former case the real-time observation should lead the auction applies for a repeated Vickrey auction, even if the subject to bid lower than in the Vickrey auction, and in the specific bidding opponents were randomly drawn from the latter case the real-time observation should lead the subject population in each round. to bid higher than in the Vickrey auction. dec04_Article 1 12/14/04 2:25 PM Page 1026

1026 Journal of Economic Literature, Vol. XLII (December 2004)

affected here by this knowledge is the subject’s 5.3 The Nature of the Task best estimate of the subjective value of the Who Cares If Hamburger Flippers Violate good. The auction is still eliciting a truthful EUT? Who cares if a hamburger flipper vio- revelation of this subjective value; it is just that lates the independence axiom of expected the subjective value itself can change with utility theory in an abstract task? His job information on the bidding behavior of others. description, job evaluation, and job satisfac- The second reason that bids might be tion do not hinge on it. He may have left affiliated is that the good might have some some money on the table in the abstract task, extra-experimental market price. Assuming but is there any sense in which his failure transaction costs of entering the “outside” suggests that he might be poor at flipping market to be zero for a moment, information hamburgers? gleaned from the bidding behavior of others Another way to phrase this point is to can help the bidder infer what that market actively recruit subjects who have experi- price might be. To the extent that it is less ence in the field with the task being stud- than the subjective value of the good, this ied.39 Trading houses do not allow neophyte information might result in the bidder delib- pit-traders to deviate from proscribed limits, erately bidding low in the experiment.37 The in terms of the exposure they are allowed. A reason is that the expected utility of bidding survival metric is commonly applied in the below the true value is clearly positive: if field, such that the subjects who engage in lower bidding results in somebody else win- certain tasks of interest have specific types of ning the object at a price below the true value, training. then the bidder can (costlessly) enter the out- The relevance of field subjects and field side market anyway. If lower bidding results environments for tests of the winner’s curse in the bidder winning the object, and market is evident from Douglas Dyer and Kagel price and bids are not linked, then consumer (1996, p. 1464), who review how executives surplus is greater than if the object had been in the commercial construction industry bought in the outside market. Note that this avoid the winner’s curse in the field: argument suggests that subjects might have an incentive to strategically misrepresent 38 Two broad conclusions are reached. One is that their true subjective value. the executives have learned a set of situation- The upshot of these concerns is that specific rules of thumb which help them to avoid unless one assumes that homegrown values the winner’s curse in the field, but which could for the good are certain and not affiliated not be applied in the laboratory markets. The second is that the bidding environment created across bidders, or can provide evidence that in the laboratory and the theory underlying it they are not affiliated in specific settings, are not fully representative of the field environ- one should avoid the use of institutions that ment. Rather, the latter has developed escape can have uncontrolled influences on esti- mechanisms for avoiding the winner’s curse that mates of true subjective value and/or the are mutually beneficial to both buyers and sell- ers and which have not been incorporated into incentive to truthfully reveal that value. the standard one-shot auction theory literature.

These general insights motivated the 37 Harrison (1992a) makes this point in relation to some design of the field experiments of Harrison previous experimental studies attempting to elicit homegrown values for goods with readily accessible outside markets. 39 The subjects may also have experience with the good 38 It is also possible that information about likely out- being traded, but that is a separate matter worthy of study. side market prices could affect the individual’s estimate of For example, List (2004c) had sports-card enthusiasts true subjective value. Informal personal experience, albeit trade coffee mugs and chocolates in tests of loss aversion, over a panel data set, is that higher-priced gifts seem to even though they had no experience in openly trading elicit warmer glows from spouses and spousal-equivalents. those goods. dec04_Article 1 12/14/04 2:25 PM Page 1027

Harrison and List: Field Experiments 1027

and List (2003), mentioned earlier. They played by dealers, they frequently fall prey study the behavior of insiders in their field to the winner’s curse. We conclude that the context, while controlling the “rules of the theory predicts field behavior well when game” to make their bidding behavior fall one is able to identify naturally occurring into the domain of existing auction theory. In field counterparts to the key theoretical this instance, the term “field context” means conditions. the commodity with which the insiders are At a more general level, consider the familiar, as well as the type of bidders they argument that subjects who behave irra- normally encounter. tionally could be subjected to a “money- This design allows one to tease apart the pump” by some arbitrager from hell. When two hypotheses implicit in the conclusions of we explain transitivity of preferences to Dyer and Kagel (1996). If these insiders fall undergraduates, the common pedagogy prey to the winner’s curse in the field exper- includes stories of intransitive subjects iment, then it must be40 that they avoid it by mindlessly cycling forever in a series of low- using market mechanisms other than those cost trades. If these cycles continue, the sub- under study. The evidence is consistent with ject is pumped of money until bankrupt. In the notion that dealers in the field do not fall fact, the absence of such phenomena is prey to the winner’s curse in the field exper- often taken as evidence that contracts or iment, providing tentative support for the markets must be efficient. hypothesis that naturally occurring markets There are several reasons why this may are efficient because certain traders use not be true. First, it is only when certain heuristics to avoid the inferential error that consistency conditions are imposed that suc- underlies the winner’s curse. cessful money-pumps provide a general indi- This support is only tentative, however, cator of irrationality, defeating their use as a because it could be that these dealers have sole indicator (Robin Cubitt and Robert developed heuristics that protect them Sugden 2001). from the winner’s curse only in their spe- Second, and germane to our concern cialized corner of the economy. That would with the field, subjects might have devel- still be valuable to know, but it would mean oped simple heuristics to avoid such that the type of heuristics they learn in their money-pumps: for example, never retrade corner are not general and do not transfer the same objects with the same person.41 to other settings. Hence, the complete As John Conlisk (1996, p. 684) notes, design also included laboratory experi- “Rules of thumb are typically exploitable ments in the field, using induced valuations by ‘tricksters,’ who can in principle ‘money as in the laboratory experiments of Kagel pump’ a person using such rules. … and Levin (1999), to see if the heuristic of Although tricksters abound—at the door, insiders transfers. We find that it does when on the phone, and elsewhere—people can they are acting in familiar roles, adding fur- easily protect themselves, with their ther support to the claim that these insiders pumpable rules intact, by such simple have indeed developed a heuristic that devices as slamming the door and hanging “travels” from problem domain to problem up the phone. The issue is again a matter domain. Yet when dealers are exogenously of circumstance and degree.” The last provided with less information than their point is important for our argument—only bidding counterparts, a role that is rarely when the circumstance is natural might

40 This inference follows if one assumes that a dealer’s 41 Slightly more complex heuristics work against arbi- survival in the industry provides sufficient evidence that he tragers from meta-hell who understand that this simple does not make persistent losses. heuristic might be employed. dec04_Article 1 12/14/04 2:25 PM Page 1028

1028 Journal of Economic Literature, Vol. XLII (December 2004)

one reasonably expect the subject to be the sense of knowing what actions are feasi- able to call upon survival heuristics that ble and what the consequences of different protect against such irrationality. To be actions might be, then control has been lost sure, some heuristics might “travel,” and at a basic level. In cases where the subject that was precisely the research question understands all the relevant aspects of the examined by Harrison and List (2003) with abstract game, problems may arise due to respect to the dreaded winner’s curse. But the triggering of different methods for solv- they might not; hence we might have sight- ing the decision problem. The use of field ings of odd behavior in the lab that would referents could trigger the use of specific simply not arise in the wild. heuristics from the field to solve the specif- Third, subjects might behave in a non- ic problem in the lab, which otherwise may separable manner with respect to sequen- have been solved less efficiently from first tial decisions over time, and hence avoid principles (e.g., Gerd Gigerenzer et al. the pitfalls of sequential money pumps 2000). For either of these reasons—a lack (Mark Machina 1989; and Edward of understanding of the task or a failure to McClennan 1990). Again, the use of such apply a relevant field heuristic—behavior sophisticated characterizations of choices may differ between the lab and the field. over time might be conditional on the indi- The implication for experimental design is vidual having familiarity with the task and to just “do it both ways,” as argued by Chris the consequences of simpler characteriza- Starmer (1999) and Harrison and Rutström tions, such as those employing intertempo- (2001). Experimental economists should be ral additivity. It is an open question if the willing to consider the effect in their exper- richer characterization that may have iments of scripts that are less abstract, but evolved for familiar field settings travels to in controlled comparisons with scripts that other settings in which the individual has are abstract in the traditional sense. less experience. Nevertheless, it must also be recognized Our point is that one should not assume that inappropriate choice of field referents that heuristics or sophisticated characteriza- may trigger uncontrolled psychological tions that have evolved for familiar field set- motivations. Ultimately, the choice tings do travel to the unfamiliar lab. If they between an abstract script and one with do exist in the field, and do not travel, then field referents must be guided by the evidence from the lab might be misleading. research question. “Context” Is Not a Dirty Word. One tra- This simple point can be made more dition in experimental economics is to use forcefully by arguing that the passion for scripts that abstract from any field counter- abstract scripts may in fact result in less con- part of the task. The reasoning seems to be trol than context-ridden scripts. It is not the that this might contaminate behavior, and case that abstract, context-free experiments that any observed behavior could not then provide more general findings if the context be used to test general theories. There is itself is relevant to the performance of sub- logic to this argument, but context should jects. In fact, one would generally expect not be jettisoned without careful consider- such context-free experiments to be unusu- ation of the unintended consequences. ally tough tests of economic theory, since Field referents can often help subjects there is no control for the context that sub- overcome confusion about the task. jects might themselves impose on the Confusion may be present even in settings abstract experimental task. This is just one that experimenters think are logically or part of a general plea for experimental econ- strategically transparent. If the subject does omists to take the psychological process of not understand what the task is about, in “task representation” seriously. dec04_Article 1 12/14/04 2:25 PM Page 1029

Harrison and List: Field Experiments 1029

This general point has already emerged in the specific information in the foreground several areas of research in experimental eco- of the task (e.g., Ulric Neisser and Ira nomics. Noticing large differences between Hyman 2000).42 contributions to another person and a charity At a more homely level, the “simple” in between-subjects experiments that were choice of parameters can add significant otherwise identical in structure and design, field context to lab experiments. The idea, Catherine Eckel and Philip Grossman (1996, pioneered by Grether, Isaac, and Plott p. 188ff.) drew the following conclusion: (1981, 1989), Grether and Plott (1984), and Hong and Plott (1982), is to estimate param- It is received wisdom in experimental econom- eters that are relevant to field applications ics that abstraction is important. Experimental and take these into the lab. procedures should be as context-free as possible, and the interaction among subjects should 5.4 The Nature of the Stakes be carefully limited by the rules of the experiment to ensure that they are playing the game One often hears the criticism that lab we intend them to play. For tests of economic experiments involve trivial stakes, and that theory, these procedural restrictions are critical. they do not provide information about As experimenters, we aspire to instructions that agents’ behavior in the field if they faced most closely mimic the environments implicit in serious stakes, or that subjects in the lab the theory, which is inevitably a mathematic experiments are only playing with “house abstraction of an economic situation. We are 43 careful not to contaminate our tests by unnec- money.” The immediate response to this essary context. But it is also possible to use experimental methodology to explore the 42 importance and consequence of context. A healthy counter-lashing was offered by Mahzarin Economists are becoming increasingly aware Banaji and Robert Crowder (1989), who concede that that social and psychological factors can only be needlessly artefactual designs are not informative. But they conclude that “we students of memory are just as introduced by abandoning, at least to some interested as anybody else in why we forget where we left extent, abstraction. This may be particularly the car in the morning or in who was sitting across the true for the investigation of other-regarding table at yesterday’s meeting. Precisely for this reason we behavior in the economic arena. are driven to laboratory experimentation and away from naturalistic observation. If the former method has been disappointing to some after about 100 years, so should Our point is simply that this should be a the latter approach be disappointing after about 2,000. more general concern. Above all, the superficial glitter of everyday methods should not be allowed to replace the quest for generalizable Indeed, research in memory reminds us principles.” (p. 1193). that subjects will impose a natural context 43 This problem is often confused with another issue: on a task even if it literally involves “non- the validity and relevance of hypothetical responses in the lab. Some argue that hypothetical responses are the only sense.” Long traditions in psychology, no way that one can mimic the stakes found in the field. doubt painful to the subjects, involved Conlisk (1989) runs an experiment to test the Allais detecting how many “nonsense syllables” a Paradox with small, real stakes and finds that virtually no subjects violated the predictions of expected utility theory. subject could recall. The logic behind the Subjects drawn from the same population did violate the use of nonsense was that the researchers “original recipe” version of the Allais Paradox with large, were not interested in the role of specific hypothetical stakes. Conlisk (1989; p. 401ff.) argues that inferences from this evidence confound hypothetical semantic or syntactic context as an aid to rewards with the reward scale, which is true. Of course, memory, and in fact saw those as nuisance one could run an experiment with small, hypothetical variables to be controlled by the use of ran- stakes and see which factor is driving this result. Chinn- Ping Fan (2002) did this, using Conlisk’s design, and found dom syllables. Such experiments generated that subjects given low, hypothetical stakes tended to a backlash of sorts in memory research, avoid the Allais Paradox, just as his subjects with low, real with many studies focusing instead on stakes avoided it. Many of the experiments that find viola- tions of the Allais Paradox in small, real stake settings memory within a natural context, in which embed these choices in a large number of tasks, which cues and frames could be integrated with could affect outcomes. dec04_Article 1 12/14/04 2:25 PM Page 1030

1030 Journal of Economic Literature, Vol. XLII (December 2004)

point is perhaps obvious: increase the stakes exchange rates to the U.S. dollar prevailing in the lab and see if it makes a difference at the time, these stakes were $1.90, $9.70, (e.g., Elizabeth Hoffman, Kevin McCabe, and $48.40, respectively. In terms of average and Vernon Smith (1996), or have subjects local monthly wages, they were equivalent to earn their stakes in the lab (e.g., Rutström approximately 2.5 hours, 12.5 hours, and and Williams 2000; and List 2004a), or seek 62.5 hours of work, respectively. out lab subjects in developing countries for They conclude that there was no effect whom a given budget is a more substantial on initial offer behavior in the first round, fraction of their income (e.g., Steven but that the higher stakes did have an Kachelmeier and Mohamed Shehata 1992; effect on offers as the subjects gained Lisa Cameron 1999; and Robert Slonim and experience with subsequent rounds. They Alvin Roth 1998). also conclude that acceptances were Colin Camerer and Robin Hogarth (1999) greater in all rounds with higher payoffs, review the issues here, identifying many but that they did not change over time. instances in which increased stakes are asso- Their experiment is particularly significant ciated with improved performance or less because they varied the stakes by a factor variation in performance. But they also alert of 25 and used procedures that have been us to important instances in which increased widely employed in comparable experi- stakes do not improve performance, so that ments.46 On the other hand, one might one does not casually assume that there will question if there was any need to go to the be such an improvement. field for this treatment. Fifty subjects Taking the Stakes to Subjects Who Are dividing roughly $50 per game is only Relatively Poor. One of the reasons for run- $1,250, and this is quite modest in terms of ning field experiments in poor countries is most experimental budgets. But fifty sub- that it is easier to find subjects who are rela- jects dividing the monetary equivalent of tively poor. Such subjects are presumably 62.5 hours is another matter. If we assume more motivated by financial stakes of a given $10 per hour in the United States for level than subjects in richer countries. lower-skilled blue-collar workers or stu- Slonim and Roth (1998) conducted bar- dents, that is $15,625, which is substantial gaining experiments in the Slovak Republic but feasible.47 to test for the effect of “high stakes” on Similarly, consider the “high payoff” behavior.44 The bargaining game they stud- experiments from China reported by ied entails one person making an offer to the Kachelmeir and Shehata (1992) (KS). other person, who then decides whether to These involved subjects facing lotteries accept it. Bargaining was over a pie worth 60 with prizes equal to 0.5 yuan, 1 yuan, 5 Slovak Crowns (Sk) in one session, a pie yuan, or 10 yuan, and being asked to state worth 300 Sk in another session, and a pie certainty-equivalent selling prices using worth 1500 Sk in a third session.45 At the “BDM” mechanism due to Gordon Becker, Morris DeGroot, and Jacob 44 Their subjects were students from universities, so Marschak (1964). Although 10 yuan only one could question how “nonstandard” this population is. But the design goal was to conduct the experiment in a converted to about $2.50 at the time of the country in which the wage rates were low relative to the experiments, this represented a consider- United States (p. 569), rather than simply conduct the able amount of purchasing power in that same experiment with students from different countries as in Roth et al. (1991). 45 Actually, the subjects bargained over points which 46 Harrison (2005a) reconsiders their conclusions. were simply converted to currency at different exchange 47 For July 2002 the Bureau of Labor Statistics estimat- rates. This procedure seems transparent enough, and ed average private sector hourly wages in the United States served to avoid possible focal points defined over differing at $16.40, with white-collar workers earning roughly $4 cardinal ranges of currency. more and blue-collar workers roughly $2 less than that. dec04_Article 1 12/14/04 2:25 PM Page 1031

Harrison and List: Field Experiments 1031

region of China, as discussed by KS (p. has repeatedly stressed the importance of 1123). Their results support several conclu- recruiting subjects who have some field sions. First, the coefficients for lotteries experience with the task or who have an involving low win probabilities imply interest in the particular task. His experi- extreme risk loving. This is perfectly plausi- ments have generally involved imposing ble given the paltry stakes involved in such institutions on the subjects who are not lotteries using the BDM elicitation proce- familiar with the institution, since the dure. Second, “bad joss,” as measured by objective of the early experiments was to the fraction of random buying prices below study new ways of overcoming free-rider the expected buying price of 50 percent of bias. But his choice of commodity has usu- the prize, is associated with a large increase ally been driven by a desire to confront sub- in risk-loving behavior.48 Third, experience jects with stakes and consequences that are with the general task increases risk aver- natural to them. In other words, his experi- sion. Fourth, increasing the prize from 5 ments illustrate how one can seek out sub- yuan to 10 yuan increases risk aversion sig- ject pools for whom certain stakes are nificantly. Of course, this last result is con- meaningful. sistent with non-constant RRA, and should Bohm (1972) is a landmark study that had not be necessarily viewed as a problem a great impact on many researchers in the unless one insisted on applying the same areas of field public-good valuation and CRRA coefficient over these two reward experimentation on the extent of free-riding. domains. The commodity was a closed-circuit broad- Again, however, the question is whether cast of a new Swedish TV program. Six elic- one needed to go to nonstandard popula- itation procedures were used. In each case tions in order to scale up the stakes to draw except one, the good was produced, and these conclusions. Using an elicitation pro- the group was able to see the program, if cedure different than the BDM procedure, aggregate WTP (willingness to pay) Holt and Laury (2002) undertake conven- equaled or exceeded a known total cost. tional laboratory experiments in which they Every subject received SEK50 upon arrival scale up stakes and draw the same conclu- at the experiment, broken down into stan- sions about experience and stake size. Their dard denominations. Bohm employed five scaling factors are generally twenty com- basic procedures for valuing his compared to a baseline level, although they also modity.49 No formal theory is provided to conducted a handful of experiments with factors as high as fifty and ninety. The overall cost of these scale treatments was 49 In Procedure I the subject pays according to his stat- $17,000, although $10,000 was sufficient for ed WTP. In Procedure II the subject pays some fraction of their primary results with a scaling of twen- stated WTP, with the fraction determined equally for all in the group such that total costs are just covered (and the ty. These are not cheap experiments, but fraction is not greater than one). In Procedure III the pay- budgets of this kind are now standard for ment scheme is unknown to subjects at the time of their many experimenters. bid. In Procedure IV each subject pays a fixed amount. In Procedure V the subject pays nothing. For comparison, a Taking the Task to the Subjects Who Care quite different Procedure VI was introduced in two stages. About It. Bohm (1972; 1979; 1984a,b; 1994) The first stage, denoted VI:1, approximates a CVM, since nothing is said to the subject as to what considerations would lead to the good being produced or what it would 48 Although purely anecdotal, our own experience is cost him if it was produced. The second stage, VI:2, that many subjects faced with the BDM task believe that involves subjects bidding against what they think is a group the buying price depends in some way on their selling of 100 for the right to see the program. This auction is con- price. To mitigate such possible perceptions, we have ducted as a discriminative auction, with the ten highest tended to use physical randomizing devices that are less bidders actually paying their bid and being able to see the prone to being questioned. program. dec04_Article 1 12/14/04 2:25 PM Page 1032

1032 Journal of Economic Literature, Vol. XLII (December 2004)

generate free-riding hypotheses for these SEK500 in group 2 would be excluded from procedures.50 The major result from enjoying the good. Bohm’s study was that bids were virtually In group 1 a subject has an incentive to identical for all institutions, averaging understate only if he conjectures that the sum between SEK7.29 and SEK10.33. of the contributions of others in his group is Bohm (1984a) uses two procedures that greater than or equal to total cost minus his elicit a real economic commitment, albeit true valuation. Total cost was known to be under different (asserted) incentives for SEK200,000, but the contributions of (many) free-riding. He implemented this experi- others must be conjectured. It is not possible ment in the field with local government to say what the extent of free-riding is in this bureaucrats bidding on the provision of a case without further information as to expec- new statistical service from the Central tations that were not observed. In group 2 Bureau of Statistics.51 The two procedures only those subjects who actually stated a are used to extract a lower and an upper WTP greater than or equal to SEK500 might bound, respectively, to the true average have had an incentive to free-ride. Forty-nine WTP for an actual good. Each agent in subjects reported exactly SEK500 in group 2, group 1 was to state his individual WTP, and whereas 93 reported a WTP of SEK500 or his actual cost would be a percentage of that higher. Thus the extent of free-riding in stated WTP such that costs for producing group 2 could be anywhere from 0 percent (if the good would be covered exactly. This per- those reporting SEK500 indeed had a true centage could not exceed 100 percent. WTP of exactly that amount) to 53 percent Subjects in group 2 were asked to state their (49 free-riders out of 93 possible free-riders). WTP. If the interval estimated for total stat- The main result reported by Bohm (1984a) ed WTP equaled or exceeded the (known) is that the average WTP interval between the total cost, the good was to be provided and two groups was quite small. Group 1 had an subjects in group 2 would pay only SEK500. average WTP of SEK827 and group 2 an Subjects bidding zero in group 1 or below average WTP of SEK889, for an interval that is only 7.5 percent of the smaller average WTP of group 1. Thus the conclusion in this 50 Procedure I is deemed the most likely to generate case must be that if free-riding incentives strategic under-bidding (p. 113), and procedure V the were present in this experiment, they did not most likely to generate strategic over-bidding. The other procedures, with the exception of VI, are thought to lie make much of a difference to the outcome. somewhere in between these two extremes. Explicit admo- One can question, however, the extent to nitions against strategic bidding were given to subjects in which these results generalize. The subjects procedures I, II, IV, and V (see p. 119, 127–29). Although no theory is provided for VI:2, it can be recognized as a were representatives of local governments, multiple-unit auction in which subjects have independent and it was announced that all reported WTP and private values. It is well-known that optimal bids for values would be published. This is not a fea- risk-neutral agents can be well below the true valuation of the agent in a Nash Equilibrium, and will never exceed the ture of most surveys used to study public pro- true valuation (e.g., bidders truthfully reveal demand for grams, which often go to great lengths to the first unit, but understate demand for subsequent units ensure subject confidentiality. On the other to influence the price). Unfortunately there is insufficient information to be able to say how far below true valuations hand, the methodological point is clear: some these optimal bids will be, since we do not know the con- subjects may simply care more about under- jectured range of valuations for subjects. List and Lucking- taking certain tasks, and in many field set- Reiley (2000) use a framed field experiment to test for demand reduction in the field and find significant demand tings this is not difficult to identify. For reduction. example, Juan Cardenas (2003) collects 51 In addition, he conducted some comparable experi- experimental data on common pool extrac- ments in a more traditional laboratory setting, albeit for a non-hypothetical good (the viewing of a pilot of a TV tion from participants that have direct, field show). experience extracting from a common pool dec04_Article 1 12/14/04 2:25 PM Page 1033

Harrison and List: Field Experiments 1033

resource. Similarly, Jeffrey Carpenter, Amrita We consider here two potentially important Daniere, and Lois Takahashi (2003) conduct parts of the experimental environment: the social dilemma experiments with urban slum physical place of the actual experiment, and dwellers who face daily coordination and col- whether subjects are informed that they are lective action problems, such as access to taking part in an experiment. clean water and solid waste disposal. Experimental Site. The relationship between behavior and the environmental 6. Natural Field Experiments context in which it occurs refers to one’s 6.1 The Nature of the Environment physical surroundings (viz., noise level, extreme temperatures, and architectural Most of the stimuli a subject encounters in design) as well as the nature of the human a lab experiment are controlled. The labora- intervention (viz., interaction with the tory, in essence, is a pristine environment experimental monitor). For simplicity and where the only thing varied is the stressor in 52 concreteness, we view the environment as a which one is interested. Indeed, some labo- whole rather than as a bundle of stimuli. For ratory researchers have attempted to expunge example, a researcher interested in the all familiar contextual cues as a matter of con- labels attached to colors may expose sub- trol. This approach is similar to mid-twenti- jects to color stimuli under sterile laborato- eth-century psychologists who attempted to ry conditions (e.g., Brent Berlin and Paul conduct experiments in “context-free” envi- Kay 1969). A field experimenter, and any ronments: egg-shaped enclosures where tem- artist, would argue that responses to color peratures and sound were properly regulated stimuli could very well be different from (Lowenstein 1999, p. F30). This approach those in the real world, where colors occur omits the context in which the stressor is nor- in their natural context (e.g., Anna mally considered by the subject. In the “real Wierzbicka 1996, ch. 10). We argue that, to world” the individual is paying attention not fully examine such a situation, the laborato- only to the stressor, but also to the environ- ry should not be abandoned but supple- ment around him and various other influ- mented with field research. Since it is often ences. In this sense, individuals have natural difficult to maintain proper experimental tools to help cope with several influences, procedures in the field, laboratory work is whereas these natural tools are not available often needed to eliminate alternatives and to individuals in the lab, and thus the full to refine concepts. effect of the stressor is not being observed. Of course, the emphasis on the interrelat- An ideal field experiment not only increas- edness of environment and behavior should es external validity, but does so in a manner 53 not be oversold: the environment clearly in which little internal validity is foregone. constrains behavior, providing varying options in some instances, and influences 52 Of course, the stressor could be an interaction of two treatments. behavior more subtly at other times. 53 We do not like the expression “external validity.” However, people also cope by changing their What is valid in an experiment depends on the theoretical environments. A particular arrangement of framework that is being used to draw inferences from the observed behavior in the experiment. If we have a theory space, or the number of windows in an that (implicitly) says that hair color does not affect behav- office, may affect employee social interac- ior, then any experiment that ignores hair color is valid tion. One means of changing interaction is to from the perspective of that theory. But one cannot identify what factors make an experiment valid without some change the furniture arrangement or win- priors from a theoretical framework, which is crossing into dow cardinality, which of course changes the the turf of “internal validity.” Note also that the “theory” environment’s effect on the employees. we have in mind here should include the assumptions required to undertake statistical inference with the exper- Environment-behavior relationships are imental data. more or less in flux continuously. dec04_Article 1 12/14/04 2:25 PM Page 1034

1034 Journal of Economic Literature, Vol. XLII (December 2004)

Experimental Proclamation. Whether sub- performed identically on achievement tests jects are informed that they are taking part (R. Rosenthal and L. Jacobsen 1968), teachers’ in an experiment may be an important factor. expectations based on the labeling led to dif- In physics, the Heisenberg Uncertainty ferences in student performance. Krueger Principle reminds us that the act of meas- (1999) offers a dissenting view, arguing that urement and observation alters that which is Hawthorne Effects are unlikely. being measured and observed. In the study Project Star studied class sizes in Tennessee of human subjects, a related, though distinct, schools. Teachers in the schools with smaller concept is the Hawthorne Effect. It suggests classes were informed that if their students “… that any workplace change, such as a performed well, class sizes would be reduced research study, makes people feel important statewide. If not, they would return to their and thereby improves their performance.”54 earlier levels. In other words, Project Star’s The notion that agents may alter their teachers had a powerful incentive to improve behavior when observed by others, especial- student performance that would not exist ly when they know what the observer is under ordinary circumstances. Recent empiri- looking for, is not novel to the Hawthorne cal results have shown that students performed Effect. Other terminology includes “inter- better in smaller classrooms. Caroline Hoxby personal self-fulfilling prophecies” and the (2000) reported on a natural experiment using “Pygmalion Effect.” data from a large sample of Connecticut Studies that claim to demonstrate the exis- schools which was free from the bias of the tence of the Hawthorne Effect include Phyllis experiment participants knowing about the Gimotty (2002), who used a treatment that study’s goal. She found no effect of smaller reminded physicians to refer women for free class sizes. Using data from the same natural mammograms. In this treatment she observed experiment, Krueger (1999) did find a positive declining referral rates from the beginning of effect from small class sizes. Similarly, Angrist the twelve-month study to the end. This result and Lavy (1999) find a positive effect in Israel, led her to argue that the results were “consis- exploiting data from a natural experiment tent with the Hawthorne Effect where a tem- “designed” by ancient rabbinic dogma. porary increase in referrals is observed in Who Makes the Decisions? Many decisions response to the initiation of the breast cancer in life are not made by individuals. In some control program.” Many other studies, ranging cases “households” arrive at a decision, which from asthma incidence to education to crimi- can be variously characterized as the out- nal justice, have attributed empirical evidence come of some cooperative or noncooperative to support the concept of the Hawthorne process. In some cases, groups, such as com- Effect. For example, in an experiment in edu- mittees, make decisions. To the extent that cation research in the 1960s where some chil- experimenters focus on individual decision- dren were labeled as high performers and making when group decision-making is more others low performers, when they had actually natural, there is a risk that the results will be misleading. Similarly, even if the decision is 54 From P. G. Benson (2000, p. 688). The Hawthorne made by an individual, there is a possibility of Effect was first demonstrated in an industrial/organiza- tional psychological study by Professor Elton Mayo of the social learning or “cheap talk” advice to aid Harvard Business School at the Hawthorne Plant of the the decision. Laboratory experimenters have Western Electric Company in Cicero, Illinois, from 1927 begun to study this characteristic of field to 1932. Researchers were confounded by the fact that productivity increased each time a change was made to the decision-making, in effect taking one of the lighting no matter if it was an increase or a decrease. What characteristics of naturally occurring field brought the Hawthorne Effect to prominence in behav- environments back into the lab: for example, ioral research was the publication of a major book in 1939 describing Mayo’s research by his associates F. J. see Gary Bornstein and Ilan Yaniv (1998), Roethlisberger and William J. Dickson. James Cox and Stephen Hayne (2002), and T. dec04_Article 1 12/14/04 2:25 PM Page 1035

Harrison and List: Field Experiments 1035

Parker Ballinger, Michael Palumbo, and specific agenda was designed to generate Nathaniel Wilcox (2003). the preferred outcome to Levine. Plott and Levine (1978) took this field 6.2 Three Examples of Minimally Invasive result back into the lab, as well as to the the- Experiments ory chalkboard. This process illustrates the Committees in the Field. Michael Levine complementarity we urge in all areas of and Charles Plott (1977) report on a field research with lab and field experiments. experiment they conducted on members of Betting in the Field. Camerer (1998) is a a flying club in which Levine was a mem- wonderful example of a field experiment ber.55 The club was to decide on a particular that allowed the controls necessary for an configuration of planes for the members, experiment, but otherwise studied naturally and Levine wanted help designing a fair occurring behavior. He recognized that agenda to deal with this problem. Plott sug- computerized betting systems allowed bets gested to Levine that there were many fair to be placed and cancelled before the race agendas, each of which would lead to a dif- was run. Thus he could try to manipulate ferent outcome, and suggested choosing the the market by placing bets in certain ways to one that got the outcome Levine desired. move the market odds, and then cancelling Levine agreed, and the agenda was designed them. The cancellation keeps his net budg- using principles that Plott understood from et at zero, and in fact is one of the main committee experiments (but not agenda treatments—to see if such a temporary bet experiments, which had never been affects prices appreciably. He found that it attempted at that stage). The parameters did not, but the methodological cleanliness assumed about the field were from Levine’s of the test is remarkable. It is also of inter- impressions and his chatting among mem- est to see that the possibility of manipulat- bers. The selected agenda was implemented ing betting markets in this way was and Levine got what he wanted: the group motivated in part by observations of such even complemented him on his work. efforts in laboratory counterparts (p. 461). A controversy at the flying club followed The only issue is how general such oppor- during the process of implementing the group tunities are. This is not a criticism of their decision. The club president, who did not like use: serendipity has always been a hand- the choice, reported to certain decision-mak- maiden of science. One cannot expect that ers that the decision was something other all problems of interest can be addressed in than the actual vote. This resulted in another a natural setting in such a minimally invasive polling of the group, using a questionnaire manner. that Plott was allowed to design. He designed Begging in the Field. List and Lucking- it to get the most complete and accurate pic- Reiley (2002) designed charitable solicita- ture possible of member preferences. tions to experimentally compare outcomes Computation and laboratory experiments, between different seed-money amounts and using induced values with the reported pref- different refund rules by using three differ- erences, demonstrated that in the lab the ent seed proportion levels: 10 percent, 33 outcomes were essentially as predicted. percent, or 67 percent of the $3,000 Levine and Plott (1977) counts as a “min- required to purchase a computer. These pro- imally invasive” field experiment, at least in portions were chosen to be as realistic as the ex ante sense, since there is evidence possible for an actual fundraising campaign that the members did not know that the while also satisfying the budget constraints they were given for this particular fundraiser. 55 We are grateful to Charles Plott for the following They also experimented with the use of a account of the events “behind the scenes.” refund, which guarantees the individual her dec04_Article 1 12/14/04 2:25 PM Page 1036

1036 Journal of Economic Literature, Vol. XLII (December 2004)

money back if the goal is not reached by the treatment to another. In treatment 10NR, group. Thus, potential donors were assigned for example, the first of two crucial sen- to one of six treatments, each funding a dif- tences read as follows: “We have already ferent computer. They refer to their six obtained funds to cover 10 percent of the treatments as 10, 10R, 33, 33R, 67, and 67R, cost for this computer, so we are soliciting with the numbers denoting the seed-money donations to cover the remaining $2,700.” In proportion, and R denoting the presence of treatments where the seed proportion dif- a refund policy. fered from 10 percent, the 10 percent and In carrying out their field experiments, $2,700 numbers were changed appropriate- they wished to solicit donors in a way that ly. The second crucial sentence stated: “If we matched, as closely as possible, the current fail to raise the $2,700 from this group of 500 state of the art in fundraising. With advice individuals, we will not be able to purchase from fundraising companies Donnelley the computer, but we will use the received Marketing in Englewood, Colorado, and funds to cover other operating expenditures Caldwell in Atlanta, Georgia, they followed of CEPA.” The $2,700 number varied with generally accepted rules believed to maxi- the seed proportion, and in refund treat- mize overall contributions. First, they pur- ments this sentence was replaced with: “If chased the names and addresses of 3,000 we fail to raise the $2,700 from this group of households in the Central Florida area that 500 individuals, we will not be able to pur- met two important criteria: 1) annual house- chase the computer, so we will refund your hold income above $70,000, and 2) house- donation to you.” All other sentences were hold was known to have previously given to a identical across the six treatments. charity (some had in fact previously given to In this experiment the responses from the University of Central Florida). They then agents were from their typical environments, assigned 500 of these names to each of the six and the subjects were not aware that they treatments. Second, they designed an attrac- were participating in an experiment. tive brochure describing the new center and its purpose. Third, they wrote a letter of solic- 7. Social Experiments itation with three main goals in mind: making 7.1 What Constitutes a Social Experiment the letter engaging and easy to read, promot- in Economics? ing the benefits of a proposed Center for Environmental Policy Analysis (CEPA), and Robert Ferber and Warner Hirsch (1982, clearly stating the key points of the experi- p. 7) define social experiments in economics mental protocol. In the personalized letter, as “… a publicly funded study that incorpo- they noted CEPA’s role within the Central rates a rigorous statistical design and whose Florida community, the total funds required experimental aspects are applied over a peri- to purchase the computer, the amount of od of time to one or more segments of a seed money available, the number of solicita- human population, with the aim of evaluating tions sent out, and the refund rule (if any). the aggregate economic and social effects of They also explained that contributions in the experimental treatments.” In many excess of the amount required for the com- respects this definition includes field experi- puter would be used for other purposes at ments and even lab experiments. The point CEPA, noted the tax deductibility of the con- of departure for social experiments seems to tribution, and closed the letter with contact be that they are part of a government agency’s information in case the donors had questions. attempt to evaluate programs by deliberate The text of the solicitation letter was com- variations in agency policies. Thus they typi- pletely identical across treatments, except cally involve variations in the way that the for the variables that changed from one agency does its normal business, rather than dec04_Article 1 12/14/04 2:25 PM Page 1037

Harrison and List: Field Experiments 1037

de novo programs. This characterization fits determining punitive damages in the civil well with the tradition of large-scale social lawsuits generated by the Exxon Valdez oil experiments in the 1960s and 1970s, dealing spill. It is also playing a major role in ongo- with negative income taxes, employment pro- ing efforts by some corporations to affect grams, health insurance, electricity pricing, “tort reform” with respect to limiting appeal and housing allowances.56 bonds for punitive awards and even caps on In recent years the lines have become punitive awards. blurred. Government agencies have been 7.2 Methodological Lessons using experiments to examine issues or policies that have no close counterpart, so that The literature on social experiments has their use cannot be viewed as variations on a been the subject of sustained methodologi- bureaucratic theme. Perhaps the most cal criticism. Unfortunately, this criticism notable social experiments in recent years has created a false tension between the use have been paired-audit experiments to iden- of experiments and the use of econometrics tify and measure discrimination. These applied to field data. We believe that virtual- involve the use of “matched pairs” of indi- ly all of the criticisms of social experiments viduals, who are made to look as much alike potentially apply in some form to field as possible apart from the protected charac- experiments unless they are run in an ideal teristics (e.g., race). These pairs then con- manner, so we briefly review the important front the target subjects, who are employers, ones. Indeed, many of them also apply to landlords, mortgage loan officers, or car conventional lab experiments. salesmen. The majority of audit studies con- Recruitment and the Evaluation Problem. ducted to date have been in the fields of Heckman and Smith (1995, p. 87) go to the employment discrimination and housing dis- heart of the role of experiments in a social- crimination (see P. A. Riach and J. Rich 2002 policy setting, when they note that “the for a review).57 strongest argument in favor of experiments The lines have also been blurred by open is that under certain conditions they solve lobbying efforts by private companies to the fundamental evaluation problem that influence social-policy change by means of arises from the impossibility of observing experiments. Exxon funded a series of exper- what would happen to a given person in iments and surveys, collected by Jerry both the state where he or she receives a Hausman (1993), to ridicule the use of the treatment (or participates in a program) and contingent valuation method in environmen- the state where he or she does not. If a per- tal damage assessment. This effort was in son could be observed in both states, the response to the role that such surveys poten- impact of the treatment on that person tially played in the criminal action brought could be calculated by comparing his or by government trustees after the Exxon her outcomes in the two states, and the Valdez oil spill. Similarly, ExxonMobil fund- evaluation problem would be solved.” ed a series of experiments and focus groups, Randomization to treatment is the means by collected in Cass Sunstein et al. (2002), to which social experiments solve this problem ridicule the way in which juries determine if one assumes that the act of randomizing punitive damages. This effort was in subjects to treatment does not lead to a clas- response to the role that juries played in sic sample selection effect, which is to say that it does not “alter the pool of participants 56 See Ferber and Hirsch (1978, 1982) and Jerry of their behavior” (p. 88). Hausman and David Wise (1985) for wonderful reviews. Unfortunately, randomization could plau- 57 Some discrimination studies have been undertaken by academics with no social-policy evaluation (e.g., Chaim sibly lead to either of these outcomes, which Fershtman and Uri Gneezy 2001 and List 2004b). are not fatal but do necessitate the use of dec04_Article 1 12/14/04 2:25 PM Page 1038

1038 Journal of Economic Literature, Vol. XLII (December 2004)

“econometric(k)s.” We have discussed “for” the transaction with the experimenter. already the possibility that the use of ran- Of course, they were bookies, and hence domization could attract subjects to experi- selected for that occupation. ments that are less risk-averse than the A variant on the recruitment problem population, if the subjects rationally antici- occurs in settings where subjects are pate the use of randomization. It is well- observed over a period of time, and attrition known in the field of clinical drug trials that is a possibility. Statistical methods can be persuading patients to participate in random- developed to use differential attrition rates ized studies is much harder than persuading as valuable information on how subjects them to participate in nonrandomized stud- value outcomes (e.g., see Philipson and ies (e.g., Michael Kramer and Stanley Hedges 1998). Shapiro 1984). The same problem applies to Substitution and the Evaluation Problem. social experiments, as evidenced by the diffi- The second assumption underlying the valid- culties that can be encountered when ity of social experiments is that “close substi- recruiting decentralized bureaucracies to tutes for the experimental treatment are not administer the random treatment (e.g., V. readily available” (Heckman and Smith Joseph Hotz 1992). James Heckman and 1995, p. 88). If they are, then subjects who Richard Robb (1985) note that the refusal are placed in the control group could opt for rate in one randomized job-training program the substitutes available outside the experi- was over 90 percent, with many of the mental setting. The result is that outcomes in refusals citing ethical concerns with adminis- the control no longer show the effect of “no tering a random treatment. treatment,” but instead the effect of “possi- What relevance does this have for field or ble access to an uncontrolled treatment.” lab experiments? The answer is simple: we Again, this is not a fatal problem, but one do not know, since it has not been systemat- that has to be addressed explicitly. In fact, it ically studied. On the other hand, field has arisen already in the elicitation of prefer- experiments have one major advantage if ences over field commodities, as discussed in they involve the use of subjects in their nat- section 4. ural environment, undertaking tasks that Experimenter Effects. In social experi- they are familiar with, since no sample selec- ments, given the open nature of the political tion is involved at the first level of the exper- process, it is almost impossible to hide the iment. In conventional lab experiments experimental objective from the person there is sample selection at two stages: the implementing the experiment or the subject. decision to attend college, and then the deci- The paired-audit experiments are perhaps sion to participate in the experiment. In arte- the most obvious targets of this, since the factual field experiments, as we defined the “treatments” themselves have any number of term in section 1, the subject selects to be in ways to bring about the conclusion that is the naturally occurring environment and favored by the research team conducting the then in the decision to be in the experiment. experiment. In this instance, the Urban So the artefactual field experiment shares Institute makes no bones about its view that this two-stage selection process with conven- discrimination is a widespread problem and tional lab experiments. However, the natural that paired-audit experiments are a critical field experiment has only one source of pos- way to address it (e.g., a casual perusal of sible selection bias: the decision to be in the Michael Fix and Raymond Struyk 1993). naturally occurring market. Hence the book- There is nothing wrong with this, apart from ies that accepted the contrived bets of the fact that it is hard to imagine how volun- Camerer (1998) had no idea that he was con- teer auditors would not see things similarly. ducting an experiment, and did not select Indeed, Heckman (1998, p. 104) notes that dec04_Article 1 12/14/04 2:25 PM Page 1039

Harrison and List: Field Experiments 1039

“auditors are sometimes instructed on the expressed their concerns, summarized by ‘problem of discrimination in American soci- Bohm (1984b, p. 136) as follows: ety’ prior to sampling firms, so they may have been coached to find what audit agen- They reported that they had held meetings of cies wanted them to find.” The opportunity their own and had decided (1) that they did not accept the local government’s decision not to for unobservables to influence the outcome provide them with regular bus service on regu- are potentially rampant in this case. lar terms; (2) that they did not accept the idea of Of course, simple controls could be having to pay in a way that differs from the way designed to address this issue. One could that “everybody else” pays (bus service is subsi- have different test-pairs visit multiple loca- dized in the area)—the implication being that they would rather go without this bus service, tions to help identify the effect of a given even if their members felt it would be worth the pair on the overall measure of discrimina- costs; (3) that they would not like to help in real- tion. The variability of measured discrimi- izing an arrangement that might reduce the nation across audit pairs is marked, and level of public services provided free or at low raises statistical issues, as well as issues of costs. It was argued that such an arrangement, if accepted here, could spread to other parts of the interpretation (e.g., see Heckman and public sector; and (4) on these grounds, they Siegelman 1993). Another control could be advised their union members to abstain from to have an artificial location for the audit participating in the project. pair to visit, where their “unobservables” could be “observed” and controlled in later This fascinating outcome is actually more statistical analyses. This procedure is used relevant for experimental economics in gen- in a standard manner in private business eral than it might seem.58 concerned with measuring the quality of When certain institutions are imposed on customer relations in the field. subjects, and certain outcomes tabulated, it One stunning example of experimenter does not necessarily follow that the out- effects from Bohm (1984b) illustrates what comes of interest for the experimenter are can happen when the subjects see a meta- the ones that are of interest to the subject.59 game beyond the experiment itself. In 1980 For example, Isaac and Smith (1985) he undertook a framed field experiment for observe virtually no instances of predatory a local government in Stockholm that was pricing in a partial equilibrium market in considering expanding a bus route to a which the prey had no alternative market to major hospital and a factory. The experiment escape to at the first taste of blood. In a was to elicit valuations from people who comparable multi-market setting in which were naturally affected by this route, and to subjects could choose to exit markets for test whether their aggregate contributions other markets, Harrison (1988) observed would make it worthwhile to provide the many instances of predatory pricing. service. A key feature of the experiment was 7.3 Surveys that Whisper in the Ears of that the subjects would have to be willing to Princes pay for the public good if it was to be provided for a trial period of six months. Field surveys are often undertaken to eval- Everyone who was likely to contribute was uate environmental injury. Many involve con- given information on the experiment, but trolled treatments such as “scope tests” of when it came time for the experiment virtually nobody turned up! The reason was that the local trade unions had decided to boy- 58 It is a pity that Bohm (1984b) himself firmly catego- cott the experiment, since it represented a rized this experiment as a failure, although one can understand that perspective. threat to the current way in which such serv- 59 See Philipson and Hedges (1998) for a general statis- ices were provided. The union leaders tical perspective on this problem. dec04_Article 1 12/14/04 2:25 PM Page 1040

1040 Journal of Economic Literature, Vol. XLII (December 2004)

changes in the extent of the injury, or differ- to inform. The second paragraph is intended ences in the valuation placed on the injury.60 to convince the recipient of their importance Unfortunately, such surveys suffer from the to the study. The idea here is to explain that fact that they do not ask subjects to make a their name has been selected as one of a direct economic commitment, and that this small sample, and that for the sample to be will likely generate an inflated valuation representative they need to respond. The report.61 However, many field surveys are goal is clearly to put some polite pressure on designed to avoid the problem of hypothetical the subject to make sure that their socio-eco- bias, by presenting the referenda as “adviso- nomic characteristic set is represented. ry.” Great care is often taken in the selection The third paragraph ensures confidential- of motivational words in cover letters, open- ity, so that the subject can ignore any possi- ing survey questions, and key valuation ques- ble repercussion from responding one way or tions, to encourage the subject to take the the other in a “politically incorrect” manner. survey seriously in the sense that their Although seemingly mundane, this assur- response will “count.”62 To the extent that ance can be important when the researcher they achieve success in this, these surveys interprets the subject as responding to the should be considered social experiments. question at hand rather than uncontrolled Consider the generic cover letter advocat- perceptions of repercussions. It also serves ed by Don Dillman (1978, pp. 165ff.) for use to mimic the anonymity of the ballot box. in mail surveys. The first paragraph is The fourth paragraph builds on the pre- intended to convey something about the ceding three to drive home the usefulness of social usefulness of the study: that there is the survey response itself, and the possibility some policy issue that the study is attempting that it will influence behavior:

The fourth paragraph of our cover letter reem- 60 Scope treatments might be employed if there is some phasizes the basic justification for the study—its scientific uncertainty about the extent of the injury to the social usefulness. A somewhat different environment at the time of the valuation, as in the two sce- approach is taken here, however, in that the narios used in the survey of the Kakadu Conservation Zone in Australia reported in David Imber, Gay Stevenson, and intent of the researcher to carry through on any Leanne Wilks (1991). Or they may be used to ascertain promises that are made, often the weakest link some measure of the internal validity of the elicited valua- in making study results useful, is emphasized. In tions, as discussed by Carson (1997) and V. Kerry Smith {an example cover letter in the text} the promise and Laura Osborne (1996). Variations in the valuation are (later carried out) was made to provide results to the basis for inferring the demand curve for the environ- government officials, consistent with the lead mental curve, as discussed by Glenn Harrison and Bengt paragraph, which included a reference to bills Kriström (1996). 61 being considered in the State Legislature and See Cummings and Harrison (1994), Cummings, Congress. Our basic concern here is to make the Harrison, and Rutström (1995), and Cummings et al. (1997). 62 There are some instances in which the agency under- promise of action consistent with the original taking the study is deliberately kept secret to the respon- social utility appeal. In surveys of particular dent. For example, this strategy was adopted by Carson et communities, a promise is often made to pro- al. (1992) in their survey of the Exxon Valdez oil spill vide results to the local media and city officials. undertaken for the attorney-general of the state of Alaska. (Dillman 1978, p. 171) They in fact asked subjects near the end of the survey who they thought had sponsored the study, and only 11 percent responded correctly (p. 91). However, 29 percent thought From our perspective, the clear intent and that Exxon had sponsored the study. Although no explicit effect of these admonitions is to attempt to connection was made to suggest who would be using the convince the subject that their response will results, it is therefore reasonable to presume that at least 40 percent of the subjects expected the responses to go have some probabilistic bearing on actual directly to one or another of the litigants in this well- outcomes. known case. Of course, that does not ensure that the This generic approach has been used, for responses will have a direct impact, since there may have been some (rational) expectation that the case would settle example, in the CVM study of the Nestucca without the survey results being entered as evidence. oil spill by Rowe et al. (1991). Their cover dec04_Article 1 12/14/04 2:25 PM Page 1041

Harrison and List: Field Experiments 1041

letter contained the following sentences in Although no promise of a direct policy the opening and penultimate paragraphs: impact is made, the survey responses are obviously valued in this instance by the Government and industry officials throughout agency charged with directly and publically the Pacific Northwest are evaluating programs to prevent oil spills in this area. Before making advising the relevant politicians on the matter. decisions that may cost you money, these offi- It remains an open question if these “advi- cials want your input. … The results of this study sory referenda” actually motivate subjects to will be made available to representatives of state, respond truthfully, although that is obviously provincial and federal governments, and indus- something that could be studied systemati- try in the Pacific Northwest. (emphasis added) cally as part of the exercise or using con- 64 In the key valuation question, subjects are trolled laboratory and field experiments. motivated by the following words: 8. Natural Experiments Your answers to the next questions are very 8.1 What Constitutes a Natural Experiment important. We do not yet know how much it will in Economics? cost to prevent oil spills. However, to make decisions about new oil spill prevention programs Natural experiments arise when the that could cost you money, government and experimenter simply observes naturally industry representatives want to learn how much occurring, controlled comparisons of one it is worth to people like you to avoid more spills. or more treatments with a baseline.65 The common feature of these experiments is These words reinforce the basic message serendipity: policy makers, nature, or tele- of the cover letter: there is some probabili- vision game-show producers66 conspire to ty, however small, that the response of the subject will have an actual impact. 64 Harrison (2005b) reviews the literature. More direct connections to policy impact 65 Good examples in economics include H. E. Frech occur when the survey is openly undertaken (1976); Roth (1991); Jere Behrman, Mark Rosenzweig, and for a public agency charged with making the Paul Taubman (1994); Stephen Bronars and Jeff Grogger (1994); Robert Deacon and Jon Sonstelie (1985); Andrew policy decision. For example, the Resource Metrick (1995); Bruce Meyer, W. Kip Viscusi, and David Assessment Commission of Australia was Durbin (1995); John Warner and Saul Pleeter (2001); and charged with making a decision on an applica- Mitch Kunce, Shelby Gerking, and William Morgan (2002). 66 Smith (1982; p. 929) compared the advantages of lab- tion to mine in public lands, and used a survey oratory experiments to econometric practice, noting that to help it evaluate the issue. The cover letter, “Over twenty-five years ago, Guy Orcutt characterized the signed by the chairperson of the commission econometrician as being in the same predicament as the electrical engineer who has been charged with the task of under the letterhead of the commission, deducing the laws of electricity by listening to a radio play. spelled out the policy setting clearly: To a limited extent, econometric ingenuity has provided some techniques for conditional solutions to inference The Resource Assessment Commission has problems of this type.” Arguably, watching the television been asked by the Prime Minister to conduct an can be an improvement on listening to the radio, since TV inquiry into the use of the resources of the game shows provide a natural avenue to observe real deci- Kakadu Conservation Zone in the Northern sions in an environment with high stakes. J. B. Berk, E. Hughson, and K. Vandezande (1996) and Rafael Tenorio Territory and to report to him on this issue by 63 and Timothy Cason (2002) study contestants’ behavior on the end of April 1991. … You have been The Price Is Right to investigate rational decision theory selected randomly to participate in a national and whether subjects play the unique subgame perfect survey related to this inquiry. The survey will be Nash equilibrium. R. Gertner (1993) and R. M. W. J. asking the views of 2500 people across Beetsma and P. C. Schotman (2001) make use of data from Australia. It is important that your views are Card Sharks and Lingo to examine individual risk prefer- recorded so that all groups of Australians are ences. Steven Levitt (2003) and List (2003) use data from included in the survey. (Imber, Stevenson, and The Weakest Link and Friend or Foe to examine the nature Wilks 1991, p. 102) and extent of disparate treatment among game-show contestants. And Metrick (1995) uses data from Jeopardy! to analyze behavior under uncertainty and players’ ability to 63 The cover letter was dated August 28, 1990. choose strategic best-responses. dec04_Article 1 12/14/04 2:25 PM Page 1042

1042 Journal of Economic Literature, Vol. XLII (December 2004)

generate these comparisons. The main elected to take the annuity, one could infer attraction of natural experiments is that that his discount rate was less than the they reflect the choices of individuals in a threshold.67 natural setting, facing natural conse- This design is essentially the same as one quences that are typically substantial. The used in a long series of laboratory experi- main disadvantage of natural experiments ments studying the behavior of college stu- derives from their very nature: the experi- dents.68 Comparable designs have been menter does not get to pick and choose the taken into the field, such as the study of the specifics of the treatments, and the experi- Danish population by Harrison, Lau, and menter does not get to pick where and Williams (2002). The only difference is that when the treatments will be imposed. The the field experiment evaluated by WP first problem may result in low power to offered each individual only one discount detect any responses of interest, as we rate: Harrison, Lau, and Williams offered illustrate with a case study in section 8.2 each subject twenty different discount below. While there is a lack of control, we rates, ranging between 2.5 percent and 50 should obviously not look a random gift percent. horse in the mouth when it comes to mak- Five features of this natural experiment ing inferences. There are some circum- make it particularly compelling for the pur- stances, briefly reviewed in section 8.3, pose of estimating individual discount when nature provides useful controls to rates. First, the stakes were real. Second, augment those from theory or “manmade” the stakes were substantial and dwarf any- experimentation. thing that has been used in laboratory experiments with salient payoffs in the United States. The average lump-sum 8.2 Inferring Discount Rates by Heroic amounts were around $50,000 and $25,000 Extrapolation for officers and enlisted personnel, respec-

In 1992, the United States Department of Defense started offering substantial early retirement options to nearly 300,000 indi- 67 Warner and Pleeter (2001) recognize that one prob- viduals in the military. This voluntary sepa- lem of interpretation might arise if the very existence of the scheme signaled to individuals that they would be ration policy was instituted as part of a forced to retire anyway. As it happens, the military also general policy of reducing the size of the significantly tightened up the rules governing “progres- military as part of the “Cold War dividend.” sion through the ranks,” so that the probability of being involuntarily separated from the military increased at the John Warner and Saul Pleeter (2001) (WP) same time as the options for voluntary separation were recognize how the options offered to mili- offered. This background factor could be significant, tary personnel could be viewed as a natural since it could have led to many individuals thinking that they were going to be separated from the military anyway experiment with which one could estimate and hence deciding to participate in the voluntary individual discount rates. In general terms, scheme even if they would not have done so otherwise. one option was a lump-sum amount, and Of course, this background feature could work in any direction, to increase or decrease the propensity of a the other option was an annuity. The indi- given individual to take one or the other option. In any vidual was told what the cut-off discount event, WP allow for the possibility that the decision to rate was for the two to be actuarially equal, join the voluntary separation process itself might lead to sample selection issues. They estimate a bivariate probit and this concept was explained in various model, in which one decision is to join the separation ways. If an individual is observed to take process and the other decision is to take the annuity the lump-sum, one could infer that his dis- rather than the lump-sum. 68 See Coller and Williams (1999), and Shane count rate was greater than the threshold Frederick, George Loewenstein, and Ted O’Donoghue rate. Similarly, for those individuals that (2002), for recent reviews of those experiments. dec04_Article 1 12/14/04 2:25 PM Page 1043

Harrison and List: Field Experiments 1043

tively.69 Third, the military went to some iment are simply not robust to the sampling lengths to explain to everyone the financial and predictive uncertainty of having to use implications of choosing one option over an estimated model to infer discount rates. the other, making the comparison of per- We use the same method as WP (2001, table sonal and threshold discount rates relative- 6, p. 48) to calculate estimated discount 71 ly transparent. Fourth, the options were rates. In their table 3, WP calculate the offered to a wide range of officers and mean predicted discount rate from a single- enlisted personnel, such that there are sub- equation probit model, using only the dis- stantial variations in key demographic vari- count rate as an explanatory variable, ables such as income, age, race, and employing a shortcut formula that correctly education. Fifth, the time horizon for the evaluates the mean discount rate. After each annuity differed in direct proportion to the probit equation is estimated, it is used to years of military service of the individual, predict the probability that each individual so that there are annuities between four- would accept the lump-sum alternative at teen and thirty years in length. This facili- discount rates varying between 0 percent tates evaluation of the hypothesis that and 100 percent in increments of 1 percent- discount rates are stationary over different age point. For example, consider a 5 percent time horizons. discount rate offered to officers, and the WP conclude that the average individual results of the single-equation probit model. discount rates implied by the observed sep- Of the 11,212 individuals in this case, 72 per- aration choices were high relative to a priori cent are predicted to have a probability of expectations for enlisted personnel. In one accepting the lump-sum of 0.5 or greater. model in which the after-tax interest rate The lowest predicted probability of accept- offered to the individual appears in linear ance for any individual at this rate is 0.207, form, they predict average rates of 10.4 per- and the highest is 0.983. cent and 35.4 percent for officers and enlist- Similar calculations are undertaken for ed personnel, respectively. However, this each possible discount rate between 0 per- model implicitly allows estimated discount cent and 100 percent, and the results tabu- rates to be negative, and indeed allows them lated. Once the predicted probabilities of to be arbitrarily negative. In an alternative acceptance are tabulated for each of the model in which the interest rate term individuals offered the buy-out, and each appears in logarithmic form, and one possible discount rate between 0 percent implicitly imposes the a priori constraint and 100 percent, we loop over each individ- that an elicited individual discount rate be ual and identify the smallest discount rate at positive, they estimate average rates of 18.7 which the lump-sum would be accepted. percent and 53.6 percent, respectively. We This smallest discount rate is precisely prefer the estimates that impose this prior where the probit model predicts that this belief, although nothing below depends on individual would be indifferent between the using them.70 lump-sum and the annuity. This provides a We show that many of the conclusions distribution of estimated minimum discount about discount rates from this natural exper- rates, one for each individual in the sample. In figure 2 we report the results of this calculation, showing the distribution of 69 92 percent of the enlisted personnel accepted the lump-sum, and 51 percent of the officers. However, these personal discount rates initially offered to acceptance rates varied with the interest rates offered, the subjects and then the distributions particularly for enlisted personnel. implied by the single-equation probit 70 Harrison (2005a) documents the detailed calculations involved, and examines the differences that arise with alternative specifications and samples. 71 John Warner kindly provided the data. dec04_Article 1 12/14/04 2:25 PM Page 1044

1044 Journal of Economic Literature, Vol. XLII (December 2004)

.025 1

.02 .8

.015 .6

.01 .4 Fraction Offered Fraction Estimated .005 .2

0 0 0 20 40 60 80 100 Discount Rate in Percent Estimated Offered

Figure 2. Offered and Estimated Discount Rates in Warner and Pleeter Natural Experiments

model used by WP. 72 These results pool roughly centered on the distribution of the data for all separating personnel. The offered rates, but much more dispersed. grey histogram shows the after-tax discount There is nothing “wrong” with these differ- rates that were offered, and the black his- ences between the offered and estimated togram shows the discount rates inferred discount rates, although they will be critical from the estimated “log-linear” model that when we calculate standard errors on these constrains discount rates to be positive. estimated discount rates. Again, the estimat- Given the different shapes of the his- ed rates in figure 2 are based on the logic tograms, they use different vertical axes to described above: no prediction error is allow simple visual comparisons. assumed from the estimated statistical The main result is that the distribution of model when it is applied at the level of the estimated discount rates is much wider than individual to predict the threshold rate at the distribution of offered rates. Harrison which the lump-sum would be accepted. (2005a) presents separate results for the The main conclusion of WP is contained samples of officers and enlisted personnel, in their table 6, which lists estimates of the and for the alternative specifications consid- average discount rates for various groups of ered by WP. For enlisted personnel the dis- their subjects. Using the model that imposes tribution of estimated rates is almost entirely the a priori restriction that discount rates be out-of-sample in comparison to the offered positive, they report that the average dis- rates above it. The distribution for officers is count rate for officers was 18.7 percent, and 53.6 percent for enlisted personnel. What 72 Virtually identical results are obtained with the are the standard errors on these means? model that corrects for possible sample-selection effects. There is reason to expect that they could be dec04_Article 1 12/14/04 2:25 PM Page 1045

Harrison and List: Field Experiments 1045

quite large, due to constraints on the scope distributions only reflects sampling over the of the natural experiment. individuals. One can generate standard Individuals were offered a choice errors that also capture the uncertainty in between a lump-sum and an annuity. The the probit model coefficients as well. before-tax discount rate that just equated Figure 3 displays the results of taking into the present value of the two instruments account the uncertainty about the coeffi- ranged between 17.5 percent and 19.8 per- cients of the estimated model used by WP. cent, which is a very narrow range of dis- Since it is an important dimension to consid- count rates. The after-tax equivalent rates er, we show the time horizon for the elicited ranged from a low of 14.5 percent up to 23.5 discount rates on the horizontal axis.74 The percent for those offered the separation middle line shows a cubic spline through the option, but over 99 percent of the after-tax predicted average discount rate. The top rates were between 17.6 percent and 20.4 (bottom) line shows a cubic spline through percent. Thus the above inferences about the upper (lower) bound of the 95 percent average discount rates for enlisted person- confidence interval, allowing for uncertainty nel are “out of sample,” in the sense that in the individual predictions due to reliance they do not reflect direct observation of on an estimated statistical model to infer dis- responses at those rates of 53.6 percent, or count rates.75 Thus, in figure 3 we see that indeed at any rates outside the interval (14.5 there is considerable uncertainty about the percent, 23.5 percent). Figure 2 illustrates discount rates for enlisted personnel, and this point as well, since the right mode is that it is asymmetric. On balance, the model entirely due to the estimates of enlisted per- implies a considerable skewness in the dis- sonnel. The average for enlisted personnel tribution of rates for enlisted personnel, therefore reflects, and relies on, the predic- with some individuals having extremely high tive power of the parametric functional implied discount rates. Turning to the forms fitted to the observed data. The same results for officers, we find much less of an general point is true for officers, but the effect from model uncertainty. In this case problem is far less severe. the rates are relatively precisely inferred, Even if one accepted the parametric func- particularly around the range of rates span- tional forms (probit), the standard errors of ning the effective rates offered, as one predictions outside of the sample range of would expect.76 break-even discount rates will be much larg- We conclude that the results for enlisted er than those within the sample range.73 The personnel are too imprecisely estimated for standard errors of the predicted response can be calculated directly from the estimat- 74 The time horizon of the annuity offered to individu- ed model. Note that this is not the same as als in the field varied directly with the years of military the distribution shown in figure 2, which is a service completed. For each year of service the horizon on the annuity was two years longer. As a result, the annuities distribution over the sample of individuals at being considered by individuals were between fourteen each simulated discount rate that assume and thirty years in length. With roughly 10 percent of the that the model provides a perfect prediction sample at each horizon, the average annuity horizon was around 22 years. for each individual.Inother words, the pre- 75 In fact, we calculate rates only up to 100 percent, so dictions underlying figure 2 just use the the upper confidence intervals for the model is con- average prediction for each individual as the strained to equal 100 percent for that reason. It would be a simple matter to allow the calculation to consider higher truth, so the sampling error reflected in the rates, but there would be little inferential value in doing so. 76 It is a standard result from elementary econometrics that the forecast interval widens as one uses the regression 73 Relaxing the functional form also allows some addi- model to predict for values of the exogenous variables that tional uncertainty into the estimation of individual dis- are further and further away from their average (e.g., count rates. William Greene 1993, p. 164–66). dec04_Article 1 12/14/04 2:25 PM Page 1046

1046 Journal of Economic Literature, Vol. XLII (December 2004)

Officers Enlisted Personnel 100 100

75 75

50 50

25 25

0 0 15 20 25 30 15 20 25 30 time horizon of annuity in years time horizon of annuity in years

Figure 3. Implied Discount Rates Incorporating Model Uncertainty

them to be used to draw reliable inferences “natural natural experimental approach” by about the discount rates. However, the Rosenzweig and Wolpin (2000). results for officers are relatively tightly esti- For example, monozygotic twins are effec- mated, and can be used to draw more reli- tively natural clones of each other at birth. able inferences. The reason for the lack of Thus one can, in principle, compare out- precision in the estimates for enlisted per- comes for such twins to see the effect of dif- sonnel is transparent from the design, which ferences in their history, knowing that one was obviously not chosen by the experi- has a control for abilities that were innate at menters: the estimates rely on out-of-sam- birth. Of course, a lot of uncontrolled and ple predictions, and the standard errors unobserved things can occur after birth and embodied in figure 3 properly reflect the before humans get to make choices that are uncertainty of such an inference. of any policy interest. So the use of such instruments obviously requires additional 8.3 Natural Instruments assumptions, beyond the a priori plausible Some variable or event is said to be a one that the natural biological event that led good instrument for unobserved factors if it to these individuals being twins was inde- is orthogonal to those factors. Many of the pendent of the efficacy of their later educa- difficulties of “manmade” random treat- tional and labor-market experiences. Thus ments have been discussed in the context of the lure of “measurement without theory” is social experiments. However, in recent clearly illusory. years many economists have turned to Another concern with the “natural instru- “nature-made” random treatments instead, ments” approach is that it often relies on the employing an approach to the evaluation of assumption that only one of the explanatory treatments that has come to be called the variables is correlated with the unobserved dec04_Article 1 12/14/04 2:25 PM Page 1047

Harrison and List: Field Experiments 1047

factors.77 This means that only one instru- market behavior, they are positing “what if” ment is required, which is fortunate since scenarios which need not be tethered to nature is a stingy provider of such instru- reality. Sometimes theorists constrain their ments. Apart from twins, natural events that propositions by the requirement that they be have been exploited in this literature “operationally meaningful,” which only include birth dates, gender, and even weath- requires that they be capable of being refut- er events, and these are not likely to grow ed, and not that anyone has the technology dramatically over time. or budget to actually do so. Both of these concerns point the way to a Tests of expected utility theory have pro- complementary use of different methods of vided a dramatic illustration of the impor- experimentation, much as econometricians tance of thought experiments being use a priori identifying assumptions as a explicitly linked to stochastic assumptions substitute for data in limited information involved in their use. Several studies offer a environments. rich array of different error specifications leading to very different inferences about 9. Thought Experiments the validity of expected utility theory, and particularly about what part of it appears to Thought experiments are extremely com- be broken: Ballinger and Wilcox (1997); mon in economics, and would seem to be Enrica Carbonne (1997); David Harless fundamentally different from lab and field and Camerer (1994); Hey (1995); John Hey experiments. We argue that they are not, and Chris Orme (1994); Graham Loomes, drawing on recent literature examining the Peter Moffatt, and Sugden (2002); and role of statistical specifications of experi- Loomes and Sugden (1995, 1998). The mental tests of deterministic theories. methodological problem is that debates Although it may surprise some, the compar- over the characterization of the residual ison between lab experiments and field have come to dominate the substantive experiments that we propose has analogues issues, as crisply drawn by Ballinger and to the way thought experiments have been Wilcox (1997, p. 1102)78: debated in analytic philosophy and the view that thought experiments are just “attenuat- We know subjects are heterogeneous. The rep- ed experiments.” Finally, we consider the resentative decision maker … restriction fails place of measures of the natural functioning miserably both in this study and new ones …. of the brain during artefactual experimental Purely structural theories permit heterogeneity by allowing several preference patterns, but are conditions. mute when it comes to mean error rate variability between or within patterns (restrictions like 9.1 Where Are the Econometric Instructions CE) and within-pattern heterogeneity of choice to Test Theory? probabilities (restrictions like CH and ZWC). We believe Occam’s Razor and the ‘Facts don’t To avoid product liability litigation, it is kill theories, theories do’ cliches do not apply: standard practice to sell commodities with CE, CH and ZWC are an atheoretical supporting clear warnings about dangerous use and cast in dramas about theoretical stars, and poor showings by this cast should be excused neither operating instructions designed to help one because they are simple nor because there are no get the most out of the product. replacements. It is time to audition a new cast. Unfortunately, the same is not true of economic theories. When theorists undertake In this instance, a lot has been learned thought experiments about individual or about the hidden implications of alternative

77 Rosenzweig and Wolpin (2000, p. 829, fn.4, and p. 78 The notation in this quote does not need to be 873). defined for the present point to be made. dec04_Article 1 12/14/04 2:25 PM Page 1048

1048 Journal of Economic Literature, Vol. XLII (December 2004)

stochastic specifications for experimental illustrated by example. We choose an exam- tests of theory. But the point is that all of this ple in which there have been actual (lab and could have been avoided if the thought field) experiments, but where the actual experiments underlying the structural mod- experiments could have been preceded by a els had accounted for errors and allowed for thought experiment. Specifically, consider individual heterogeneity in preferences the identification of “trust” as a characteris- from the outset. That relaxation does not tic of an individual’s utility function. In rescue expected utility theory, nor is that the some studies this concept is defined as the intent, but it does serve to make the experi- sole motive that leads a subject to transfer mental tests informative for their intended money to another subject in an investment purpose of identifying when and where that game.79 For example, Joyce Berg, John theory fails. Dickhaut, and McCabe (1995) use the game to measure “trust” by the actions of the first 9.2 Are Thought Experiments Just Slimmed- player and hence “trustworthiness” from Down Experiments? the responses of the second player.80 But “trust” measured in this way obviously Roy Sorenson (1992) presents an elabo- suffers from at least one confound: aversion rate defense of the notion that a thought to inequality, or “other-regarding prefer- experiment is really just an experiment “that ences.” The idea is that someone may be purports to achieve its aim without the ben- averse to seeing different payoffs for the two efit of execution” (p. 205). This lack of exe- players, since roles and hence endowments cution leads to some practical differences, in the basic version are assigned at random. such as the absence of any need to worry This is one reason that almost all versions of about luck affecting outcomes. Another dif- the experiments have given each player the ference is that thought experiments actually same initial endowment to start, so that the require more discipline if they are to be first player does not invest money with the valid. In his Nobel Prize lecture, Smith second player just to equalize their payoffs. (2003, p. 465) notes that: But it is possible that the first player would Doing experimental economics has changed the like the other player to have more, even if it way I think about economics. There are many means having more than the first player. reasons for this, but one of the most prominent Cox (2004) proposes that one pair the is that designing and conducting experiments investment game with a dictator game81 to forces you to think through the process rules and identify how much of the observed transfer procedures of an institution. Few, like Einstein, can perform detailed and imaginative mental from the first player is due to “trust” and experiments. Most of us need the challenge of how much is due to “other-regarding prefer- real experiments to discipline our thinking. ences.” Since there is strong evidence that

There are, of course, other differences 79 Player 1 transfers some percentage of an endowment between the way that thought experiments to player 2, that transfer is tripled, and then player 2 and actual experiments are conducted and decides how much of the expanded pie to return. 80 This game has been embedded in many other set- presented. But these likely have more to do tings before and after Berg, Dickhaut, and McCabe with the culture of particular scholarly (1995). We do not question the use of this game in the groups than anything intrinsic to each type investigation of broader assessments of the nature of “social preferences,” which is an expression that subsumes of experiment. many possible motives for the observed behavior, includ- The manner in which thought experi- ing the ones discussed below. 81 ments can be viewed as “slimmed-down The first player transfers money to the second player, who is unable to return it or respond in any way. Martin experiments—ones that are all talk and no Dufwenberg and Gneezy (2000) also compare the trust action” (Sorenson 1992, p. 190), is best and dictator games directly. dec04_Article 1 12/14/04 2:25 PM Page 1049

Harrison and List: Field Experiments 1049

subjects appear to exhibit substantial aver- more to do with the aims and rhetorical sion to inequality in experiments of this goals of doing experiments. As Sorenson kind, do we need to actually run the experi- (1991, p. 205) notes: ment in which the same subject participates in a dictator game and an investment game The aim of any experiment is to answer or raise to realize that “trust” is weakly overestimat- its question rationally. As stressed (earlier …), the motives of an experiment are multifarious. ed by the executed trust experiments? One One can experiment in order to teach a new might object that we would not be able to technique, to test new laboratory equipment, or make this inference without having run to work out a grudge against white rats. (The some prior experiments in which subjects principal architect of modern quantum electro- transfer money under dictator, so this design dynamics, Richard Feynman, once demonstrated that the bladder does not require gravity by proposal of Cox (2004) does not count as a standing on his head and urinating.) The dis- thought experiment. But imagine counter- tinction between aim and motive applies to factually82 that Cox (2004) left it at that, and thought experiments as well. When I say that an did not actually run an experiment. We experiment ‘purports’ to achieve its aim without would still be able to draw the new inference execution, I mean that the experimental design is presented in a certain way to the audience. from his design that trust is weakly over-esti- The audience is being invited to believe that mated in previous experiments if one contemplation of the design justifies an answer accounts for the potential confound of to the question or (more rarely) justifiably raises inequality aversion.83 Thus, in what sense its question. should we view the thought experiment of the proposed design of Cox (2004) as any- In effect, then, it is caveat emptor with thing other than an attenuated version of the thought experiments—but the same homily ordinary experiment that he actually surely applies to any experiment, even if designed and executed? executed. One trepidation with treating a thought 9.3 That’s Not a Thought Experiment … experiment as just a slimmed-down experi- This Is! ment is that it is untethered by the reality of “proof by data” at the end. But this has We earlier defined the word “field” in the following manner: “used attributively to denote an investigation, study, etc., carried 82 A thought experiment at work. 83 As it happens, there are two further confounds at out in the natural environment of a given work in the trust design, each of which can be addressed. material, language, animal, etc., and not in One is risk attitudes, at least as far as the interpretation of the laboratory, study, or office.” Thus, in an the behavior of the first player is concerned. Sending money to the other player is risky. If the first player keeps important sense, experiments that employ all of his endowment, there is no risk. So a risk-loving play- methods to measure neuronal activity during er would invest, just for the thrill. A risk-averse player controlled tasks would be included, since the would not invest for this reason. But if there are other motives for investing, then risk attitudes will exacerbate or functioning of the brain can be presumed to temper them, and need to be taken into account when be a natural reaction to the controlled stim- identifying the residual as trust. Risk attitudes play no role ulus. Neuroeconomics is the study of how for the second player’s decision. The other confound, in the proposed design of Cox (2004), is that the “price of different parts of the brain light up when giving” in his proposed dictator game is $1 for $1 trans- certain tasks are presented, such as exposure ferred, whereas it is $1 for $3 transferred in the invest- to randomly generated monetary gain or loss ment game. Thus one would weakly understate the extent of other-regarding preferences in his design, and hence in Hans Breiter et al. (2001), the risk elicita- weakly overstate the residual “trust.” The general point is tion tasks of Kip Smith et al. (2002) and even clearer: after these potential confounds are taken Dickhaut et al. (2003), the trust games of into account, what faith does one have that a reliable measure of trust has been identified statistically in the McCabe et al. (2001), and the ultimatum original studies? bargaining games of Alan Sanfey et al. dec04_Article 1 12/14/04 2:25 PM Page 1050

1050 Journal of Economic Literature, Vol. XLII (December 2004)

(2003). In many ways these methods are The main methodological conclusion we extensions of the use of verbal protocols draw is that experimenters should be wary of (speaking out loud as the task is performed) the conventional wisdom that abstract, used by K. Anders Ericsson and Herbert imposed treatments allow general inferences. Simon (1993) to study the algorithmic In an attempt to ensure generality and con- processes that subjects were going through trol by gutting all instructions and procedures as they solved problems, and the use of of field referents, the traditional lab experi- mouse-tracking technology by Eric Johnson menter has arguably lost control to the extent et al. (2002) to track sequential information that subjects seek to provide their own field search in bargaining tasks. The idea is to referents. The obvious solution is to conduct monitor some natural mental process as the experiments both ways: with and without nat- experimental treatment is administered, urally occurring field referents and context. even if the treatment is artefactual. If there is a difference, then it should be studied. If there is no difference, one can conditionally conclude that the field behavior 10. Conclusion in that context travels to the lab environment. We have avoided drawing a single, bright line between field experiments and lab REFERENCES experiments. One reason is that there are Angrist, Joshua D.; Guido W. Imbens and Donald B. Rubin. 1996. “Identification of Casual Effects Using several dimensions to that line, and inevitably Instrumental Variables,” J. Amer. Statist. Assoc. there will be some trade-offs between those. 91:434, pp. 444–45. The extent of those trade-offs will depend on Angrist, Joshua D. and Alan B. Krueger. 2001. “Instrumental Variables and the Search for where researchers fall in terms of their agree- Identification: From Supply and Demand to Natural ment with the argument and issues we raise. Experiments,” J. Econ. Persp. 15:4, pp. 69–85. Another reason is that we disagree where Angrist, Joshua D. and Victor Lavy. 1999. “Using Maimonides’ Rule to Estimate the Effect of Class the line would be drawn. One of us Size on Scholastic Achievement,” Quart. J. Econ. (Harrison), bred in the barren test-tube set- 114:2, pp. 553–75. ting of classroom labs sans ferns, sees virtu- Ballinger, T. Parker; Michael G. Palumbo and Nathaniel T. Wilcox. 2003. “Precautionary Saving ally any effort to get out of the classroom as and Social Learning Across Generations: An constituting a field experiment to some use- Experiment,” Econ. J. 113:490, pp. 920–47. ful degree. The other (List), raised in the Ballinger, T. Parker and Nathaniel T. Wilcox. 1997. “Decisions, Error and Heterogeneity,” Econ. J. wilds amidst naturally occurring sports-card 107:443, pp. 1090–105. geeks, would include only those experiments Banaji, Mahzarin R. and Robert G. Crowder. 1989. that used free-range subjects. Despite this “The Bankruptcy of Everyday Memory,” Amer. Pyschol. 44, pp. 1185–93. disagreement on the boundaries between Bateman, Ian; Alistair Munro, Bruce Rhodes, Chris one category of experiments and another Starmer and Robert Sugden. 1997. “Does Part- category, however, we agree on the charac- Whole Bias Exist? An Experimental Investigation,” Econ. J. 107:441, pp. 322–32. teristics that make a field experiment differ Becker, Gordon M.; Morris H. DeGroot and Jacob from a lab experiment. Marschak. 1964. “Measuring Utility by a Single- Using these characteristics as a guide, we Response Sequential Method,” Behav. Sci. 9:July, pp. 226–32. propose a taxonomy of field experiments Beetsma, R. M. W. J. and P. C. Schotman. 2001. that helps one see their connection to lab “Measuring Risk Attitudes in a Natural Experiment: experiments, social experiments, and natural Data from the Television Game Show Lingo,” Econ. J. 111:474, pp. 821–48. experiments. Many of the differences are Behrman, Jere R.; Mark R. Rosenzweig and Paul illusory, such that the same issues of control Taubman. 1994. “Endowments and the Allocation of apply. But many of the differences matter Schooling in the Family and in the Marriage Market: The Twins Experiment,” J. Polit. Econ. 102:6, pp. for behavior and inference, and justify the 1131–74. focus on the field. Benson, P. G. 2000. “The Hawthorne Effect,” in The dec04_Article 1 12/14/04 2:25 PM Page 1051

Harrison and List: Field Experiments 1051

Corsini Encyclopedia of Psychology and Behavioral Making: A Comparison of Students and Science. Vol. 2, 3rd ed. W. E. Craighead and C. B. Businessmen in a Simulated Progressive Auction,” in Nemeroff, eds. NY: Wiley. Research in Experimental Economics, Vol. 3. V. L. Berg, Joyce E.; John Dickhaut and Kevin McCabe. Smith, ed. Greenwich, CT: JAI Press. 1995. “Trust, Reciprocity, and Social History,” Games Camerer, Colin F. 1998. “Can Asset Markets Be Econ. Behav. 10, pp. 122–42. Manipulated? A Field Experiment with Racetrack Berk, J. B.; E. Hughson and K. Vandezande. 1996. Betting,” J. Polit. Econ. 106:3, pp. 457–82. “The Price Is Right, but Are the Bids? An Camerer, Colin and Robin Hogarth. 1999. “The Effects Investigation of Rational Decision Theory,” Amer. of Financial Incentives in Experiments: A Review Econ. Rev. 86:4, pp. 954–70. and Capital-Labor Framework,” J. Risk Uncertainty Berlin, Brent and Paul Kay. 1969. Basic Color Terms: 19, pp. 7–42. Their Universality and Evolution. Berkeley: UC Press. Cameron, Lisa A. 1999. “Raising the Stakes in the Binswanger, Hans P. 1980. “Attitudes Toward Risk: Ultimatum Game: Experimental Evidence from Experimental Measurement in Rural India,” Amer. J. Indonesia,” Econ. Inquiry 37:1, pp. 47–59. Ag. Econ. 62:3, pp. 395–407. Carbone, Enrica. 1997. “Investigation of Stochastic ———. 1981. “Attitudes Toward Risk: Theoretical Preference Theory Using Experimental Data,” Econ. Implications of an Experiment in Rural India,” Econ. Letters 57:3, pp. 305–11. J. 91:364, pp. 867–90. Cardenas, Juan C. 2003. “Real Wealth and Blackburn, McKinley; Glenn W. Harrison and E. Experimental Cooperation: Evidence from Field Elisabet Rutström. 1994. “Statistical Bias Functions Experiments,” J. Devel. Econ. 70:2, pp. 263–89. and Informative Hypothetical Surveys,” Amer. J. Ag. Carpenter, Jeffrey; Amrita Daniere and Lois Takahashi. Econ. 76:5, pp. 1084–88. 2004. “Cooperation, Trust, and Social Capital in Blundell, R. and M. Costa-Dias. 2002. “Alternative Southeast Asian Urban Slums,” J. Econ. Behav. Org. Approaches to Evaluation in Empirical 55:4, pp. 533–51. Microeconomics,” Portuguese Econ. J. 1, pp. 91–115. Carson, Richard T. 1997. “Contingent Valuation Blundell, R. and Thomas MaCurdy. 1999. “Labor Surveys and Tests of Insensitivity to Scope,” in Supply: A Review of Alternative Approaches,” in Determining the Value of Non-Marketed Goods: Handbook of Labor Economics Vol. 3C. O. Economic, Psychological, and Policy Relevant Aspects Ashenfelter and D. Card, eds. Amsterdam: Elsevier of Contingent Valuation Methods. R. J. Kopp, W. Science BV. Pommerhene and N. Schwartz, eds. Boston: Kluwer. Bohm, Peter. 1972. “Estimating the Demand for Public Carson, Richard T.; Robert C. Mitchell, W. Michael Goods: An Experiment,” Europ. Econ. Rev. 3:2, pp. Hanemann, Raymond J. Kopp, Stanley Presser and 111–30. Paul A. Ruud. 1992. A Contingent Valuation Study of ———. 1979. “Estimating Willingness to Pay: Why Lost Passive Use Values Resulting From the Exxon and How?” Scand. J. Econ. 81:2, pp. 142–53. Valdez Oil Spill. Anchorage: Attorney Gen. Alaska. ———. 1984a. “Revealing Demand for an Actual Chamberlin, Edward H. 1948. “An Experimental Public Good,” J. Public Econ. 24, pp. 135–51. Imperfect Market,” J. Polit. Econ. 56:2, 95–108. ———. 1984b. “Are There Practicable Demand- Coller, Maribeth and Melonie B. Williams. 1999. Revealing Mechanisms?” in Public Finance and the “Eliciting Individual Discount Rates,” Exper. Econ. Quest for Efficiency. H. Hanusch, ed. Detroit: 2, pp. 107–27. Wayne State U. Press. Conlisk, John. 1989. “Three Variants on the Allais ———. 1994. “Behavior under Uncertainty without Example,” Amer. Econ. Rev. 79:3, pp. 392–407. Preference Reversal: A Field Experiment,” in ———. 1996. “Why Bounded Rationality?” J. Econ. Experimental Economics. J. Hey, ed. Heidelberg: Lit. 34:2, pp. 669–700. Physica-Verlag. Cox, James C. 2004. “How To Identify Trust and Bohm, Peter and Hans Lind. 1993. “Preference Reciprocity,” Games Econ. Behav. 46:2, pp. 260–81. Reversal, Real-World Lotteries, and Lottery- Cox, James C. and Stephen C. Hayne. 2002. “Barking Interested Subjects,” J. Econ. Behav. Org. 22:3, pp. Up the Right Tree: Are Small Groups Rational 327–48. Agents?” work. pap. Dept. Econ. U. Arizona. Bornstein, Gary and Ilan Yaniv. 1998. “Individual and Cubitt, Robin P. and Robert Sugden. 2001. “On Group Behavior in the Ultimatum Game: Are Money Pumps,” Games Econ. Behav. 37:1, pp. Groups More Rational Players?” Exper. Econ. 1:1. 121–60. pp. 101–108. Cummings, Ronald G.; Steven Elliott, Glenn W. Breiter, Hans C.; Itzhak Aharon, Daniel Kahneman, Harrison and James Murphy. 1997. “Are Anders Dale, and Peter Shizgal. 2001. “Functional Hypothetical Referenda Incentive Compatible?” J. Imaging of Neural Responses to Expectancy and Polit. Econ. 105:3, pp. 609–21. Experience of Monetary Gains and Losses,” Neuron Cummings, Ronald G. and Glenn W. Harrison. 1994. 30:2, pp. 619–39. “Was the Ohio Court Well Informed in Their Bronars, Stephen G. and Jeff Grogger. 1994. “The Assessment of the Accuracy of the Contingent Economic Consequences of Unwed Motherhood: Valuation Method?” Natural Res. J. 34:1, pp. 1–36. Using Twin Births as a Natural Experiment,” Amer. Cummings, Ronald G.; Glenn W. Harrison and Laura Econ. Rev. 84:5, pp. 1141–56. L. Osborne. 1995. “Can the Bias of Contingent Burns, Penny. 1985. “Experience and Decision Valuation Be Reduced? Evidence from the dec04_Article 1 12/14/04 2:25 PM Page 1052

1052 Journal of Economic Literature, Vol. XLII (December 2004)

Laboratory,” econ. work. pap. B-95-03, College Preference: A Critical Review,” J. Econ. Lit. 40:2, pp. Business Admin., U. South Carolina. 351–401. Cummings, Ronald G.; Glenn W. Harrison and E. Gertner, R. 1993. “Game Shows and Economic Elisabet Rutström. 1995. “Homegrown Values and Behavior: Risk-Taking on Card Sharks,” Quart. J. Hypothetical Surveys: Is the Dichotomous Choice Econ. 108:2, pp. 507–21. Approach Incentive Compatible?” Amer. Econ. Rev. Gigerenzer, Gerd; Peter M. Todd and the ABC 85:1, pp. 260–66. Research Group. 2000. Simple Heuristics That Make Cummings, Ronald G. and Laura O. Taylor. 1999. Us Smart. NY: Oxford U. Press. “Unbiased Value Estimates for Environmental Gimotty, Phyllis A. 2002. “Delivery of Preventive Goods: A Cheap Talk Design for the Contingent Health Services for Breast Cancer Control: A Valuation Method,” Amer. Econ. Rev. 89:3, pp. Longitudinal View of a Randomized Controlled 649–65. Trial,” Health Services Res. 37:1, pp. 65–85. Deacon, Robert T. and Jon Sonstelie. 1985. “Rationing Greene, William H. 1993. Econometric Analysis, 2nd by Waiting and the Value of Time: Results from a ed. NY: Macmillan. Natural Experiment,” J. Polit. Econ. 93:4, pp. 627–47. Grether, David M.; R. Mark Isaac and Charles R. Plott. Dehejia, Rajeev H. and Sadek Wahba. 1999. “Causal 1981. “The Allocation of Landing Rights by Effects in Nonexperimental Studies: Reevaluating Unanimity among Competitors,” Amer. Econ. Rev. the Evaluation of Training Programs,” J. Amer. Pap. Proceed. 71:May, pp. 166–71. Statist. Assoc. 94:448, pp. 1053–62. ———. 1989. The Allocation of Scarce Resources: ———. 2002. “Propensity Score Matching for Experimental Economics and the Problem of Nonexperimental Causal Studies,” Rev. Econ. Allocating Airport Slots. Boulder: Westview Press. Statist. 84, pp. 151–61. Grether David M. and Charles R. Plott. 1984. “The Dickhaut, John; Kevin McCabe, Jennifer C. Nagode, Effects of Market Practices in Oligopolistic Markets: Aldo Rustichini and José V. Pardo. 2003. “The Impact An Experimental Examination of the Ethyl Case,” of the Certainty Context on the Process of Choice,” Econ. Inquiry 22:Oct. pp. 479–507. Proceed. Nat. Academy Sci. 100:March, pp. 3536–41. Haigh, Michael, and John A. List. 2004. “Do Dillman, Don. 1978. mail and telephone surveys; The Professional Traders Exhibit Myopic Loss Aversion? Total Design Method. NY: Wiley. An Experimental Analysis,” J. Finance 59, forthcom- Duddy, Edward A. 1924 “Report on an Experiment in ing. Teaching Method,” J. Polit. Econ. 32:5, pp. 582–603. Harbaugh, William T. and Kate Krause. 2000. Dufwenberg, Martin and Uri Gneezy. 2000. “Children’s Altruism in Public Good and Dictator “Measuring Beliefs in an Experimental Lost Wallet Experiments,” Econ. Inquiry 38:1, pp. 95–109. Game,” Games Econ. Behav. 30:2, pp. 163–82. Harbaugh, William T.; Kate Krause and Timothy R. Dyer, Douglas and John H. Kagel. 1996. “Bidding in Berry. 2001. “GARP for Kids: On the Development Common Value Auctions: How the Commercial of Rational Choice Behavior,” Amer. Econ. Rev. 91:5, Construction Industry Corrects for the Winner’s pp. 1539–45. Curse,” Manage. Sci. 42:10, pp. 1463–75. Harbaugh, William T.; Kate Krause and Lise Eckel, Catherine C. and Philip J. Grossman. 1996. Vesterlund. 2002. “Risk Attitudes of Children and “Altruism in Anonymous Dictator Games,” Games Adults: Choices Over Small and Large Probability Econ. Behav. 16, pp. 181–91. Gains and Losses,” Exper. Econ. 5, pp. 53–84. Ericsson, K. Anders, and Herbert A. Simon. 1993. Harless, David W. and Colin F. Camerer. 1994 “The Protocol Analysis: Verbal Reports as Data, rev. ed. Predictive Utility of Generalized Expected Utility Cambridge, MA: MIT Press. Theories,” Econometrica 62:6, pp. 1251–89. Fan, Chinn-Ping. 2002. “Allais Paradox in the Small,” J. Harrison, Glenn W. 1988. “Predatory Pricing in A Econ. Behav. Org. 49:3, pp. 411–21. Multiple Market Experiment,” J. Econ. Behav. Org. Ferber, Robert and Werner Z. Hirsch. 1978. “Social 9, pp. 405–17. Experimentation and Economic Policy: A Survey,” J. ———. 1992a. “Theory and Misbehavior of First-Price Econ. Lit. 16:4, pp. 1379–414. Auctions: Reply,” Amer. Econ. Rev. 82:5, 1426–43. ———. 1982. Social Experimentation and Economic ———. 1992b. “Market Dynamics, Programmed Policy. NY: Cambridge U. Press. Traders, and Futures Markets: Beginning the Fershtman, Chaim and Uri Gneezy. 2001. Laboratory Search for a Smoking Gun,” Econ. “Discrimination in a Segmented Society: An Record 68, Special Issue Futures Markets, pp. 46–62. Experimental Approach,” Quart. J. Econ. 116, pp. ———. 2005a. “Field Experiments and Control,” in 351–77. Field Experiments in Economics. J. Carpenter, G. W. Fix, Michael and Raymond J. Struyk, eds. 1993. Clear Harrison and J. A. List, eds. Research in Exper. and Convincing Evidence: Measurement of Econ. Vol. 10. Greenwich, CT: JAI Press, Discrimination in America. Washington, DC: Urban ———. 2005b “Experimental Evidence on Institute Press. Alternative Environmental Valuation Methods,” Frech, H. E. 1976. “The Property Rights Theory of the Environ. Res. Econ. 23, forthcoming. Firm: Empirical Results from a Natural Harrison, Glenn W.; Ronald M. Harstad and E. Experiment,” J. Polit. Econ. 84:1, pp. 143–52. Elisabet Rutström. 2004. “Experimental Methods Frederick, Shane; George Loewenstein and Ted and Elicitation of Values,” Exper. Econ. 7:June, pp. O’Donoghue. 2002. “Time Discounting and Time 123–40. dec04_Article 1 12/14/04 2:25 PM Page 1053

Harrison and List: Field Experiments 1053

Harrison, Glenn W. and Bengt Kriström. 1996. “On the Henrich, Joseph and Richard McElreath. 2002. “Are Interpretation of Responses to Contingent Valuation Peasants Risk-Averse Decision Markers?” Current Surveys,” in Current Issues in Environmental Anthropology 43:1, pp. 172–81. Economics. P. O. Johansson, B. Kriström and K. G. Hey, John D. 1995. “Experimental Investigations of Mäler, eds. Manchester: Manchester U. Press. Errors in Decision Making Under Risk,” Europ. Harrison, Glenn W.; Morten Igel Lau and Melonie B. Econ. Rev. 39, pp. 633–40. Williams. 2002. “Estimating Individual Discount Hey, John D. and Chris Orme. 1994. “Investigating Rates for Denmark: A Field Experiment,” Amer. Generalizations of Expected Utility Theory Using Econ. Rev. 92:5, pp. 1606–17. Experimental Data,” Econometrica 62:6, pp. Harrison, Glenn W. and James C. Lesley. 1996. “Must 1291–326. Contingent Valuation Surveys Cost So Much?” J. Hoffman, Elizabeth; Kevin A. McCabe and Vernon L. Environ. Econ. Manage. 31:1, pp. 79–95. Smith. 1996. “On Expectations and the Monetary Harrison, Glenn W. and John A. List. 2003. “Naturally Stakes in Ultimatum Games,” Int. J. Game Theory Occurring Markets and Exogenous Laboratory 25:3, pp. 289–301. Experiments: A Case Study of the Winner’s Curse,” Holt, Charles A. and Susan K. Laury. 2002. “Risk work. pap. 3-14, Dept. Econ., College Bus. Admin., Aversion and Incentive Effects,” Amer. Econ. Rev. U. Central Florida. 92:5, pp. 1644–55. Harrison, Glenn W.; Thomas F. Rutherford and David Hong, James T. and Charles R. Plott. 1982. “Rate G. Tarr. 1997. “Quantifying the Uruguay Round,” Filing Policies for Inland Water Transportation: An Econ. J. 107:444, pp. 1405–30. Experimental Approach,” Bell J. Econ. 13:1, pp. Harrison, Glenn W. and Elisabet Rutström. 2001. 1–19. “Doing It Both Ways—Experimental Practice and Hotz, V. Joseph. 1992. “Designing an Evaluation of Heuristic Context,” Behav. Brain Sci. 24:3, pp. JTPA,” in Evaluating Welfare and Training 413–14. Programs. C. Manski and I. Garfinkel, eds. Harrison, Glenn W. and H. D. Vinod. 1992. “The Cambridge: Harvard U. Press. Sensitivity Analysis of Applied General Equilibrium Hoxby, Caroline M. 2000. “The Effects of Class Size on Models: Completely Randomized Factorial Sampling Student Achievement: New Evidence From Designs,” Rev. Econ. Statist. 74:2, pp. 357–62. Population Variation,” Quart. J. Econ. 115:4, pp. Hausman, Jerry A. 1993. Contingent Valuation. NY: 1239–85. North-Holland. Imber, David; Gay Stevenson and Leanne Wilks. 1991. Hausman, Jerry A. and David A. Wise. 1985. Social A Contingent Valuation Survey of the Kakadu Experimentation Chicago: U. Chicago Press. Conservation Zone Canberra: Austral. Govt. Pub., Hayes, J. R. and H. A. Simon. 1974. “Understanding Resource Assess. Com. Written Problem Instructions,” in Knowledge and Isaac, R. Mark and Vernon L. Smith. 1985. “In Search Cognition. L. W. Gregg, ed. Hillsdale, NJ: Erlbaum. of Predatory Pricing,” J. Polit. Econ. 93:2, pp. Heckman, James J. 1998. “Detecting Discrimination,” 320–45. J. Econ. Perspect. 12:2, pp. 101–16. Johnson, Eric J.; Colin F. Camerer, Sen Sankar and Heckman, James J. and Richard Robb. 1985. Talia Tymon. 2002. “Detecting Failures of Backward “Alternative Methods for Evaluating the Impact of Induction: Monitoring Information Search in Interventions,” in Longitudinal Analysis of Labor Sequential Bargaining,” J. Econ. Theory 104:1, pp. Market Data. J. Heckman and B. Singer, eds. NY: 16–47. Cambridge U. Press. Kachelmeier, Steven J. and Mohamed Shehata. 1992. Heckman, James J. and Peter Siegelman. 1993. “The “Examining Risk Preferences Under High Monetary Urban Institute Audit Studies: Their Methods and Incentives: Experimental Evidence from the Findings,” in Clear and Convincing Evidence: People’s Republic of China,” Amer. Econ. Rev. 82:5, Measurement of Discrimination in America. M. Fix pp. 1120–41. and R. J. Struyk, eds. Washington, DC: Urban Kagel, John H.; Raymond C. Battalio and Leonard Institute Press. Green. 1995. Economic Choice Theory. An Heckman, James J. and Jeffrey A. Smith. 1995. Experimental Analysis of Animal Behavior. NY: “Assessing the Case for Social Experiments,” J. Econ. Cambridge U. Press. Perspect. 9:2, pp. 85–110. Kagel, John H.; Raymond C. Battalio and James M. Henrich, Joseph. 2000. “Does Culture Matter in Walker. 1979. “Volunteer Artifacts in Experiments in Economic Behavior? Ultimatum Game Bargaining Economics: Specification of the Problem and Some Among the Machiguenga,” Amer. Econ. Rev. 90:4, Initial Data from a Small-Scale Field Experiment,” pp. 973–79. in Research in Experimental Economics. Vol. 1. V.L. Henrich, Joseph; Robert Boyd, Samuel Bowles, Colin Smith, ed. Greenwich, CT: JAI Press. Camerer, Herbert Gintis, Richard McElreath and Kagel, John H.; Ronald M. Harstad and Dan Levin. Ernst Fehr. 2001. “In Search of Homo Economicus: 1987. “Information Impact and Allocation Rules in Experiments in 15 Small-Scale Societies,” Amer. Auctions with Affiliated Private Values: A Laboratory Econ. Rev. 91:2, pp. 73–79. Study,” Econometrica 55:6, pp. 1275–304. Henrich, Joseph; Robert Boyd, Samuel Bowles, Colin Kagel, John H. and Dan Levin. 1986. “The Winner’s Camerer, Ernst Fehr and Herbert Gintis, eds. 2004. Curse and Public Information in Common Value Foundations of Human Sociality. NY: Oxford U. Press Auctions,” Amer. Econ. Rev. 76:5, pp. 894–920. dec04_Article 1 12/14/04 2:25 PM Page 1054

1054 Journal of Economic Literature, Vol. XLII (December 2004)

———. 1999. “Common Value Auctions with Insider from the Vantage-Point of Behavioral Economics,” Information,” Econometrica 67:5, pp. 1219–38. Econ. J. 109:453, pp. F25–F34. ———. 2002. Common Value Auctions and the Loomes, Graham; Peter G. Moffatt and Robert Winner’s Curse. Princeton: Princeton U. Press. Sugden. 2002, “A Microeconometric Test of Kagel, John H.; Don N. MacDonald and Raymond C. Alternative Stochastic Theories of Risky Choice,” J. Battalio. 1990. “Tests of ‘Fanning Out’ of Indifference Risk Uncertainty 24:2, pp. 103–30. Curves: Results from Animal and Human Loomes, Graham and Robert Sugden. 1995. Experiments,” Amer. Econ. Rev. 80:4, pp. 912–21. “Incorporating a Stochastic Element Into Decision Kramer, Michael and Stanley Shapiro. 1984. “Scientific Theories,” Europ. Econ. Rev. 39, pp. 641–48. Challenges in the Application of Randomized Trials,” ———. 1998. “Testing Different Stochastic J. Amer. Medical Assoc. 252:19, pp. 2739–45. Specifications of Risky Choice,” Economica 65, pp. Krueger, Alan B. 1999. “Experimental Estimates of 581–98. Production Functions, “Quart. J. Econ. 114:2, pp. Lucking-Reiley, David. 1999. “Using Field 497–532. Experiments to Test Equivalence Between Auction Kunce, Mitch; Shelby Gerking and William Morgan. Formats: Magic on the Internet,” Amer. Econ. Rev. 2002. “Effects of Environmental and Land Use 89:5, pp. 1063–80. Regulation in the Oil and Gas Industry Using the Machina, Mark J. 1989. “Dynamic Consistency and Wyoming Checkerboard as an Experimental Non-Expected Utility Models of Choice Under Design,” Amer. Econ. Rev. 92:5, pp. 1588–93. Uncertainty,” J. Econ. Lit. 27:4, pp. 1622–68. Lalonde. Robert J. 1986. “Evaluating the Econometric McCabe, Kevin; Daniel Houser, Lee Ryan, Vernon Evaluations of Training Programs with Experimental Smith and Theodore Trouard. 2001. “A Functional Data,” Amer. Econ. Rev. 76:4, pp. 604–20. Imaging Study of Cooperation in Two-Person Levine, Michael E. and Charles R. Plott. 1977. Reciprocal Exchange,” Proceed. Nat. Academy Sci. “Agenda Influence and Its Implications,” Virginia 98:20, pp. 11832–35. Law Rev. 63:May, pp. 561–604. McClennan, Edward F. 1990. Rationality and Levitt, Steven D. 2003. “Testing Theories of Dynamic Choice NY: Cambridge U. Press. Discrimination: Evidence from the Weakest Link,” McDaniel, Tanga M. and E. Elisabet Rutström, 2001. NBER work. pap. 9449. “Decision Making Costs and Problem Solving Lichtenstein, Sarah and Paul Slovic. 1973. “Response- Performance,” Exper. Econ. 4:2, pp. 145–61. Induced Reversals of Gambling: An Extended Metrick, Andrew. 1995. “A Natural Experiment in Replication in Las Vegas,” J. Exper. Psych. 101, pp. ‘Jeopardy!’” Amer. Econ. Rev. 85:1, pp. 240–53. 16–20. Meyer, Bruce D.; W. Kip Viscusi and David L. Durbin. List, John A. 2001. “Do Explicit Warnings Eliminate 1995. “Workers’ Compensation and Injury Duration: the Hypothetical Bias in Elicitation Procedures? Evidence from a Natural Experiment,” Amer. Econ. Evidence from Field Auctions for Sportscards,” Rev. 85:3, pp. 322–40. Amer. Econ. Rev. 91:5, pp. 1498–507. Milgrom, Paul R. and Robert J. Weber. 1982. “A ———. 2003. “Friend or Foe: A Natural Experiment Theory of Auctions and Competitive Bidding,” of the Prisoner’s Dilemma,” unpub. manuscript, U. Econometrica 50:5, pp. 1089–122. Maryland Dept. Ag. Res. Econ. Neisser, Ulric, and Ira E. Hyman, Jr. eds. 2000. ———. 2004a. “Young, Selfish and Male: Field Memory Observed: Remembering in Natural Evidence of Social Preferences,” Econ. J. 114:492, Contexts. 2nd ed. NY: Worth Publishers. pp. 121–49. Pearl, Judea. 1984. Heuristics: Intelligent Search ———. 2004b. “The Nature and Extent of Strategies for Computer Problem Solving. Reading, Discrimination in the Marketplace: Evidence from MA: Addison-Wesley. the Field,” Quart. J. Econ. 119:1, pp. 49–89. Philipson, Tomas and Larry V. Hedges. 1998. “Subject ———. 2004c. “Neoclassical Theory Versus Prospect Evaluation in Social Experiments,” Econometrica Theory: Evidence from the Marketplace,” 66:2, pp. 381–408. Econometrica 72:2, pp. 615–25. Plott, Charles R. and Michael E. Levine. 1978. “A ———. 2004d. “Field Experiments: An Introduction Model of Agenda Influence on Committee and Survey,” work. pap., U. Maryland. Dept. Ag. Decisions,” Amer. Econ. Rev. 68:1, pp. 146–60. Res. Econ. and Dept. Econ. Riach, P. A. and J. Rich. 2002. “Field Experiments of ———. 2004e. “Testing Neoclassical Competitive Discrimination in the Market Place,” Econ. J. Theory in Multi-Lateral Decentralized Markets,” J. 112:483, pp. F480–F518. Polit. Econ. 112:5, pp. 1131–56.. Rosenbaum, P. and Donald Rubin. 1983. “The Central List, John A. and David Lucking-Reiley. 2000. Role of the Propensity Score in Observational “Demand Reduction in a Multi-Unit Auction: Studies for Causal Effects,” Biometrika 70, pp. Evidence from a Sportscard Experiment,” Amer. 41–55. Econ. Rev. 90:4, pp. 961–72. ———. 1984. “Reducing Bias in Observational ———. 2002. “The Effects of Seed Money and Studies Using Multivariate Matched Sampling Refunds on Charitable Giving: Experimental Methods that Incorporate the Propensity Score,” J. Evidence from a University Capital Campaign,” J. Amer. Statist. Assoc. 79, pp. 39–68. Polit. Econ. 110:1, pp. 215–33. Rosenthal, R. and L. Jacobson. 1968. Pygmalion in the Loewenstein, George. 1999. “Experimental Economics Classroom. NY: Holt, Rhinehart & Winston. dec04_Article 1 12/14/04 2:25 PM Page 1055

Harrison and List: Field Experiments 1055

Rosenzweig, Mark R. and Kenneth I. Wolpin. 2000. Smith, Kip; John Dickhaut, Kevin McCabe and José V. “Natural ‘Natural Experiments’ in Economics,” J. Pardo. 2002. “Neuronal Substrates for Choice under Econ. Lit. 38:4, pp. 827–74. Ambiguity, Risk, Gains, and Losses,” Manage. Sci. Roth, Alvin E. 1991. “A Natural Experiment in the 48:6, pp. 711–18. Organization of Entry-Level Labor Markets: Smith, V. Kerry and Laura Osborne. 1996. “Do Regional Markets for New Physicians and Surgeons Contingent Valuation Estimates Pass a Scope Test? A in the United Kingdom,” Amer. Econ. Rev. 81:3, pp. Meta Analysis,” J. Environ. Econ. Manage. 31, pp. 415–40. 287–301. Roth, Alvin E. and Michael W. K. Malouf. 1979. Smith, Vernon L. 1962. “An Experimental Study of “Game-Theoretic Models and the Role of Competitive Market Behavior,” J. Polit. Econ. 70, pp. Information in Bargaining,” Psych. Rev. 86, pp. 111–37. 574–94. ———. 1982. “Microeconomic Systems as an Roth, Alvin E.; Vesna Prasnikar, Masahiro Okuno- Experimental Science,” Amer. Econ. Rev. 72:5, pp. Fujiwara and Shmuel Zamir. 1991. “Bargaining and 923–55. Market Behavior in Jerusalem, Ljubljana, ———. 2003. “Constructivist and Ecological Pittsburgh, and Tokyo: An Experimental Study,” Rationality in Economics,” Amer. Econ. Rev. 93:3, Amer. Econ. Rev. 81:5, pp. 1068–95. pp. 465–508. Rowe, R. D.; W. Schulze, W. D. Shaw, L. D. Chestnut Smith, Vernon L.; G. L. Suchanek and Arlington W. and D. Schenk. 1991. “Contingent Valuation of Williams. 1988. “Bubbles, Crashes, and Endogenous Natural Resource Damage Due to the Nestucca Oil Expectations in Experimental Spot Asset Markets,” Spill,” report to British Columbia Ministry of Econometrica 56, pp. 1119–52. Environment. Sorenson, Roy A. 1992. Thought Experiments. NY: Rutström, E. Elisabet. 1998. “Home-Grown Values Oxford U. Press. and the Design of Incentive Compatible Auctions,” Starmer, Chris. 1999. “Experiments in Economics: Int. J. Game Theory 27:3, pp. 427–41. Should We Trust the Dismal Scientists in White Rutström, E. Elisabet and Melonie B. Williams. 2000. Coats?” J. Exper. Method. 6, pp. 1–30. “Entitlements and Fairness: An Experimental Study Sunstein, Cass R.; Reid Hastie, John W. Payne, David of Distributive Preferences,” J. Econ. Behav. Org. A. Schkade and W. Kip Viscusi. 2002. Punitive 43, pp. 75–80. Damages: How Juries Decide. Chicago: U. Chicago Sanfey, Alan G.; James K. Rilling, Jessica A. Aronson, Press. Leigh E. Nystrom and Jonathan D. Cohen. 2003. Tenorio, Rafael and Timothy Cason. 2002. “To Spin or “The Neural Basis of Economic Decision-Making in Not To Spin? Natural and Laboratory Experiments the Ultimatum Game,” Science 300:5626, pp. from The Price is Right,” Econ. J. 112, pp. 170–95. 1755–58. Warner, John T. and Saul Pleeter. 2001. “The Personal Slonim, Robert and Alvin E. Roth. 1998. “Learning in Discount Rate: Evidence from Military Downsizing High Stakes Ultimatum Games: An Experiment in Programs,” Amer. Econ. Rev. 91:1, pp. 33–53. the Slovak Republic,” Econometrica 66:3, pp. Wierzbicka, Anna. 1996. Semantics: Primes and 569–96. Universals. NY: Oxford U. Press. Smith, Jeffrey and Petra Todd. 2000. “Does Matching Winkler, Robert L. and Allan H. Murphy. 1973. Address LaLonde’s Critique of Nonexperimental “Experiments in the Laboratory and the Real Estimates?” unpub. man., Dept. Econ. U. Western World,” Org. Behav. Human Perform. 10, pp. Ontario. 252–70.