Quick viewing(Text Mode)

Experimental and Quasi-Experimental Designs for Generalized Causal

Experimental and Quasi-Experimental Designs for Generalized Causal

EXPERIMENTALAND QUASI-EXPERIMENTAL DESIGNSFOR GENERALIZED ii:. CAUSALINFERENCE

William . Shadish Trru UNIvERSITYop MEvPrrts ** .jr-*"- Thomas D. Cook NonrrrwpsrERN UNrvPnslrY iLli" '"+.'-,, fr Donald T.Campbell

HOUGHTONMIFFLIN COMPANY Boston New York Experimentsand GeneralizedCausal lnference

Ex.per'i'ment (ik-spEr'e-mant):[Middle English from Old French from experimentum, from experiri, to try; seeper- in Indo-European Roots.] n. Abbr. exp., expt, 1. a. A test under controlled conditions that is made to demonstratea known truth, examine the of a hypothe- sis, or determine the efficacyof something previously untried' b. The processof conducting such a test; experimentation. 2' An innovative "Democracy act or procedure: is only an in gouernment" (.V{illiam Ralph lnge).

Cause (k6z): [Middle English from Old French from Latin causa' teason, purpose.] n. 1. a. The producer of an effect, result, or consequence. b. The one, such as a person, an event' or a condition, that is responsi- ble for an action or a result. v. 1. To be the causeof or reason for; re- sult in. 2. To bring about or compel by authority or force.

o MANv historiansand philosophers,the increasedemphasis on experimenta- tion in the 15th and L7th centuriesmarked the emergenceof modern from its roots in natural (Hacking, 1983). Drake (1981) cites '1.6'!.2 'Water, Galileo's treatrseBodies Tbat Stay Atop or Moue in It as usheringin modern experimental science,but earlier claims can be made favoring \Tilliam Gilbert's1,600 study Onthe Loadstoneand MagneticBodies,Leonardo da Vinci's (1,452-1.51.9)many investigations,and perhapseven the Sth-centuryB.C.philoso- pher Empedocles,who used various empirical demonstrationsto argue against '1.969a, Parmenides(Jones, 1'969b).In the everyday senseof the term, humans have beenexperimenting with different ways of doing things from the earliestmo- ments of their history. Suchexperimenting is as natural a part of our life as trying a new recipe or a different way of starting campfires. z | 1. EXeERTMENTsANDGENERALTzED I

However, the scientific revolution of the 1.7thcentury departed in three ways from the common use of in natural philosophy atthat time. First, it in- creasingly used observation to correct errors in . Throughout historg natu- ral philosophers often used observation in their , usually to win philo- sophical arguments by finding that supported their theories. However, they still subordinated the use of observation to the practice of deriving "first theoriesfrom principles," starting points that humans know to be true by our nature or by divine revelation (e.g., the assumedproperties of the four basic ele- ments of fire, water, earth, and air in Aristotelian natural philosophy). According to some accounts,this subordination of evidenceto theory degeneratedin the 17th "The century: Aristotelian principle of appealing to experiencehad degenerated among philosophers into dependenceon reasoning supported by casual examples and the refutation of opponents by pointing to apparent exceptionsnot carefully '1,98"1., examined" (Drake, p. xxi).'Sfhen some 17th-century scholarsthen beganto use observation to correct apparent errors in theoretical and religious first princi- ples, they came into conflict with religious or philosophical authorities, as in the case of the Inquisition's demands that Galileo recant his account of the earth re- volving around the sun. Given such hazards,the fact that the new experimental sci- ence tipped the balance toward observation and ^way from dogma is remarkable. By the time Galileo died, the role of systematicobservation was firmly entrenched as a central feature of science,and it has remained so ever since (Harr6,1981). Second,before the 17th century, appealsto experiencewere usually basedon passiveobservation of ongoing systemsrather than on observation of what hap- pens after a system is deliberately changed. After the scientific revolution in the L7th centurS the word experiment (terms in boldface in this book are defined in the Glossary) came to connote taking a deliberate action followed by systematic observationof what occurredafterward. As Hacking (1983) noted of FrancisBa- "He con: taught that not only must we observenature in the raw, but that we must 'twist also the lion's tale', that is, manipulate our world in order to learn its se- crets" (p. U9). Although passiveobservation reveals much about the world, ac- tive manipulation is required to discover some of the world's regularities and pos- sibilities (Greenwood,, 1989). As a mundane example, stainless steel does not occur naturally; humans must manipulate it into existence.Experimental science came to be concerned with observing the effects of such manipulations. Third, early experimenters realized the desirability of controlling extraneous influences that might limit or observation. So telescopeswere carried to higher points at which the air was clearer, the glass for microscopeswas ground ever more accuratelg and scientistsconstructed laboratories in which it was pos- sible to use walls to keep out potentially biasing ether waves and to use (eventu- ally sterilized) test tubes to keep out dust or bacteria. At first, thesecontrols were developedfor astronomg chemistrg and , the natural sciencesin which in- terest in sciencefirst bloomed. But when scientists started to use in areas such as or education, in which extraneous influences are harder to control (e.g., Lind , 1,753lr, they found that the controls used in natural EXPERTMENTSAND CAUSATTONI I sciencein the laboratoryworked poorly in thesenew applications.So they devel- oped new methodsof dealingwith extraneousinfluence, such as random assign- ment (Fisher,1,925) or addinga nonrandomizedcontrol group (Coover& Angell, 1.907).As theoreticaland observationalexperience accumulated across these set- tings and topics,more sourcesof bias wereidentified and more methodswere de- velopedto copewith them (Dehue,2000). TodaSthe key featurecommon to all experimentsis still to deliberatelyvary somethingso asto discoverwhat happensto somethingelse later-to discoverthe effectsof presumedcauses. As laypersonswe do this, for example,to assesswhat happensto our blood pressureif we exercisemore, to our weight if we diet less, or ro our behaviorif we read a self-helpbook. However,scientific experimenta- tion has developedincreasingly specialized substance, language, and tools, in- cluding the practiceof experimentationin the socialsciences that is the pri- mary focus of this book. This chapter begins to explore these matters by (1) discussingthe natureof causationthat experimentstest, (2) explainingthe spe- cializedterminology (e.g., randomized experiments, quasi-experiments) that de- scribessocial experiments,(3) introducing the problem of how to generalize causalconnections from individual experiments,and (4) briefly situatingthe ex- perimentwithin a largerliterature on the nature of science.

EXPERIMENTSAND CAUSATION

A sensiblediscussion of experimentsrequires both a vocabularyfor talking about causationand an understandingof key conceptsthat underliethat vocabulary.

DefiningCause, Effect, and CausalRelationships

Most peopleintuitively recognizecausal relationships in their daily lives.For in- stance,you may say that another automobile'shitting yours was a causeof the damageto your car; that the number of hours you spentstudying was a causeof your testgrades; or that the amountof food a friend eatswas a causeof his weight. You may evenpoint to more complicatedcausal relationships, noting that a low test gradewas demoralizing,which reducedsubsequent studying, which caused evenlower grades.Here the samevariable (low grade)can be both a causeand an effect,and there can be a reciprocal relationship betweentwo variables(low gradesand not studying)that causeeach other. Despitethis intuitive familiarity with causalrelationsbips, a precisedefinition of causeand effecthas eluded philosophers for centuries.lIndeed, the definitions

1. Our analysisrefldcts the useof the word causationin ordinary language,not the more detaileddiscussions of causeby philosophers.Readers interested in suchdetail may consulta host of works that we referencein this chapter,including Cook and Campbell(1979). 4 | 1. EXPERTMENTSAND GENERALTZEDCAUSAL INFERENCE

of terms suchas and,effectdepend partly on eachother and on the causal relationshipin which both are embedded.So the 17th-centuryphilosopher "That John Locke said: which producesany simpleor complexidea, we denoteby the generalname caLtse, and that which is produced, effect" (1,97s, p. 32fl and also: " A cAtrseis that which makesany other thing, either simpleidea, substance,or ,begin to be; and an effectis that, which had its beginningfrom someother thing" (p. 325).Since then, other philosophers and scientistshave given us useful definitionsof the threekey ideas--cause,effect, and causalrelationship-that are more specificand that betterilluminate how experimentswork. We would not de- fend any of theseas the true or correctdefinition, given that the latter haseluded philosophersfor millennia;but we do claignthat theseideas help to clarify the sci- entific practiceof probing causes.

Cause 'We Considerthe causeof a forest fire. know that fires start in differentways-a match tossedfrom a ca\ a lightning strike, or a smolderingcampfire, for exam- ple. None of thesecauses is necessarybecause a forest fire can start evenwhen, say'a match is not present.Also, noneof them is sufficientto start the fire. After "hot" all, a match must stay long enoughto start combustion;it must contact combustiblematerial suchas dry leaves;there must be oxygenfor combustionto occur; and the weather must be dry enoughso that the leavesare dry and the match is not dousedby rain. So the match is part of a constellationof conditions without which a fire will not result,although some of theseconditions can be usu- ally takenfor granted,such as the availabilityof oxygen.A lightedmatch is, rhere- fore, what Mackie (1,974)called an inus condition-"an insufficientbut non- redundantpart of an unnecessarybut sufficient condition" (p. 62; italicsin orig- inal). It is insufficientbecause a match cannot start a fire without the other con- ditions. It is nonredundant only if it adds something fire-promoting that is uniquelydifferent from what the other factorsin the constellation(e.g., oxygen, dry leaves)contribute to startinga fire; after all,it would beharder ro saywhether the match causedthe fire if someoneelse simultaneously tried startingit with a cigarettelighter. It is part of a sufficientcondition to start a fire in combination with the full constellationof factors.But that condition is not necessarybecause thereare other setsof conditionsthat can also start fires. A researchexample of an inus condition concernsa new potentialtreatment for cancer.In the late 1990s,a teamof researchersin Bostonheaded by Dr. Judah Folkman reportedthat a new drug calledEndostatin shrank tumors by limiting their blood supply (Folkman, 1996).Other respectedresearchers could not repli- catethe effecteven when usingdrugs shipped to them from Folkman'slab. Scien- tists eventuallyreplicated the resultsafter they had traveledto Folkman'slab to learnhow to properlymanufacture, transport, store, and handlethe drug and how to inject it in the right location at the right depthand angle.One observerlabeled "in-our-hands" thesecontingencies the phenomenon,meaning "even we don't EXPERIMENTSAND CAUSATIONI S

know which details are important, so it might take you some time to work it out" (Rowe, L999, p.732). Endostatinwas an inus condition. It was insufficientcause by itself, and its effectivenessrequired it to be embedded in a larger set of condi- tions that were not even fully understood by the original investigators. Most causesare more accurately called inus conditions. Many factors are usu- ally required for an effectto occur, but we rarely know all of them and how they relate to each other. This is one reason that the causal relationships we discussin this book are not deterministic but only increasethe probability that an effect will occur (Eells,1,991,; Holland, 1,994).It also explains why a given causalrelation- ship will occur under some conditions but not universally across time, space,hu- -"r pop,rlations, or other kinds of treatments and outcomes that are more or less related io those studied. To different {egrees, all causal relationships are context dependent,so the generalizationof experimental effects is always at issue.That is *hy *. return to such generahzationsthroughout this book.

Effect 'We an effect is through a counterfactual model that can better understand what 'l'973' goes back at least to the 18th-century philosopher Hume (Lewis, In an experiment, p. SSel. A counterfactual is something that is contrary to fact. ie obseruewhat did happez when people received a treatment. The counterfac- tual is knowledge of what would haue happened to those same people if they si- multaneously had not receivedtreatment. An effect is the difference betweenwhat did happen and what would have happened. 'We cannot actually observe a counterfactual. Consider phenylketonuria (PKU), a genetically-basedmetabolic that causesmental retardation unless treated during the first few weeks of life. PKU is the absenceof an enzyme that would otherwise prevent a buildup of phenylalanine, a substance toxic to the nervous system. Vhen a restricted phenylalanine diet is begun early and main- tained, reiardation is prevented. In this example, the causecould be thought of as the underlying genetic defect, as the enzymatic disorder, or as the diet. Each im- plies a difierenicounterfactual. For example, if we say that a restricted phenyl- alanine diet causeda decreasein PKU-basedmental retardation in infants who are is whatever would have happened 'h"dphenylketonuric at birth, the counterfactual t'h.r. sameinfants not receiveda restricted phenylalanine diet. The samelogic applies to the genetic or enzymatic version of the cause. But it is impossible for theseu.ry ,"-i infants simultaneously to both have and not have the diet, the ge- netic disorder, or the enzyme deficiency. So a central task for all cause-probing is to create reasonable ap- proximations to this physically impossible counterfactual. For instance, if it were ethical to do so, we might phenylketonuric infants who were given the diet with other phenylketonuric infants who wer€ not given the diet but who were similar in many ways to those who were (e.g.,similar face) gender,age, socioeco- nomic status, health status). Or we might (if it were ethical) contrast infants who I 6 I 1. EXPERIMENTSANDGENERALIZED CAUSAL INFERENCE

were not on the diet for the first 3 months of their lives with those same infants after they were put on the diet starting in the 4th month. Neither of these ap- proximations is a true counterfactual. In the first case,the individual infants in the treatment condition are different from those in the comparison condition; in the second case,the identities are the same, but time has passedand many changes other than the treatment have occurred to the infants (including permanent dam- age done by phenylalanine during the first 3 months of life). So two central tasks in experimental design are creating a high-quality but necessarilyimperfect source of counterfactual inference and understanding how this source differs from the treatment condition. This counterfactual reasoning is fundarnentally qualitative becausecausal in- ference, even in experiments, is fundamentally qualitative (Campbell, 1975; Shadish, 1995a; Shadish 6c Cook, 1,999). However, some of these points have been formalized by statisticiansinto a specialcase that is sometimescalled Rubin's "1.974,'1.977,1978,79861. CausalModel (Holland, 1,986;Rubin, This book is not about statistics,so we do not describethat model in detail ('West,Biesanz, & Pitts [2000] do so and relate it to the Campbell tradition). A primary emphasisof Ru- bin's model is the analysis of causein experiments, and its basic premisesare con- sistent with those of this book.2 Rubin's model has also been widely used to ana- lyze causal inference in case-control studies in public health and (Holland 6c Rubin, 1988), in path analysisin sociology (Holland,1986), and in a paradox that Lord (1967) introduced into (Holland 6c Rubin, 1983); and it has generatedmany statistical innovations that we cover later in this book. It is new enough that critiques of it are just now beginning to appear (e.g., Dawid, 2000; Pearl, 2000). tUfhat is clear, however, is that Rubin's is a very gen- eral model with obvious and subtle implications. Both it and the critiques of it are required material for advanced students and scholars of cause-probingmethods.

CausalRelationship

How do we know if cause and effect are related? In a classicanalysis formalized by the 19th-century philosopher John Stuart Mill, a causal relationship exists if (1) the causepreceded the effect, (2) the causewas related to the effect,and (3) we can find no plausible alternative explanation for the effect other than the cause. These three characteristics mirror what happens in experiments in which (1) we manipulate the presumed cause and observe an outcome afterward; (2) we see whether variation in the cause is related to variation in the effect; and (3) we use various methods during the experiment to reduce the plausibility of other expla- nations for the effect, along with ancillary methods to explore the plausibility of those we cannot rule out (most of this book is about methods for doing this).

2. However, Rubin's model is not intended to say much about the matters of causal generalization that we address in this book. EXPERTMENTSAND CAUSATTON| 7 I

Henceexperiments are well-suited to studyingcausal relationships. No other sci- entificmethod regularly matches the characteristicsof causalrelationships so well. Mill's analysisalso points to the weaknessof othermethods. In many correlational studies,for example,it is impossibleto know which of two variablescame first, so defendinga causalrelationship between them is precarious.Understanding this logic of causalrelationships and how its key terms,such as causeand effect,are definedhelps researchers to critique cause-probingstudies.

Causation,Correlation, and Confounds

A well-known maxim in research is: Correlation does not proue causation. This is so becausewe may not know which variable came first nor whether alternative ex- planations for the presumed effectexist. For example, supposeincome and educa- tion are correlated.Do you have to have a high income before you can aff.ordto pay for education,or do you first have to get a good education before you can get a bet- ter paying job? Each possibility may be true, and so both need investigation.But un- til those investigationsare completed and evaluatedby the scholarly communiry a simple correlation doesnot indicate which variable came first. Correlations also do little to rule out alternative explanations for a relationship between two variables such as education and income. That relationship may not be causal at all but rather due to a third variable (often called a confound), such as intelligenceor family so- cioeconomicstatus, that causesboth high education and high income. For example, if high intelligencecauses success in education and on the job, then intelligent peo- ple would have correlatededucation and incomes,not becauseeducation causesin- come (or vice versa)but becauseboth would be causedby intelligence.Thus a cen- tral task in the study of experiments is identifying the different kinds of confounds that can operate in a particular researcharea and understanding the strengthsand weaknessesassociated with various ways of dealing with them

Manipulableand Nonmanipulable Causes

In the intuitive understandingof experimentationthat most peoplehave, it makes senseto say,"Let's seewhat happensif we requirewelfare recipients to work"; but it makesno senseto say,"Let's seewhat happensif I changethis adult maleinto a three-year-oldgirl." And so it is alsoin scientificexperiments. Experiments explore the effectsof things that can be manipulated,such as the dose of a medicine,the amount of a welfarecheck, the kind or amount of psychotherapyor the number of childrenin a classroom.Nonmanipulable events (e.g., the explosionof a super- nova) or attributes(e.g., people's ages, their raw geneticmaterial, or their biologi- cal sex)cannot be causesin experimentsbecause we cannotdeliberately vary them to seewhat then happens.Consequently, most scientistsand philosophersagree that it is much harderto discoverthe effectsof nonmanipulablecauses. I 8 1. EXeERTMENTSANDGENERALTzED cAUsAL TNFERENcE |

To be clear,we are not arguing that all causesmust be manipulable-only that experimental causesmust be so. Many variablesthat we correctly think of as causes are not directly manipulable. Thus it is well establishedthat a geneticdefect causes PKU even though that defect is not directly manipulable.'We can investigatesuch causesindirectly in nonexperimental studiesor even in experimentsby manipulat- ing biological processesthat prevent the gene from exerting its influence, as through the use of diet to inhibit the gene'sbiological consequences.Both the non- manipulable gene and the manipulable diet can be viewed as causes-both covary with PKU-basedretardation, both precedethe retardation, and it is possibleto ex- plore other explanations for the gene'sand the diet's effectson cognitive function- ing. However, investigating the manipulablc diet as a causehas two important ad- vantages over considering the nonmanipulable genetic problem as a cause.First, only the diet provides a direct action to solve the problem; and second,we will see that studying manipulable agentsallows a higher quality source of counterfactual inferencethrough such methods as .\fhen individuals with the nonmanipulable genetic problem are compared with personswithout it, the latter are likely to be different from the former in many ways other than the genetic de- fect. So the counterfactual inference about what would have happened to those with the PKU genetic defect is much more difficult to make. Nonetheless,nonmanipulable causesshould be studied using whatever are availableand seemuseful. This is true becausesuch causeseventually help us to find manipulable agents that can then be used to ameliorate the problem at hand. The PKU example illustrates this. Medical researchersdid not discover how to treat PKU effectively by first trying different diets with retarded children. They first discovered the nonmanipulable biological features of retarded children af- fected with PKU, finding abnormally high levels of phenylalanine and its associ- ated metabolic and genetic problems in those children. Those findings pointed in certain ameliorative directions and away from others, leading scientiststo exper- iment with treatments they thought might be effective and practical. Thus the new diet resulted from a sequenceof studies with different immediate purposes, with different forms, and with varying degreesof uncertainty reduction. Somewere ex- perimental, but others were not. Further, analogue experiments can sometimes be done on nonmanipulable causes,that is, experiments that manipulate an agent that is similar to the cause of interest. Thus we cannot change a person's race, but we can chemically induce skin pigmentation changes in volunteer individuals-though such analogues do not match the reality of being Black every day and everywhere for an entire life. Similarly past events,which are normally nonmanipulable, sometimesconstitute a that may even have been randomized, as when the 1'970 Vietnam-era draft lottery was used to investigate a variety of outcomes (e.g., An- grist, Imbens, & Rubin, 1.996a;Notz, Staw,& Cook, l97l). Although experimenting on manipulable causesmakes the job of discovering their effectseasier, experiments are far from perfect means of investigating causes. I EXPERIMENTSAND CAUSATIONI 9

Sometimesexperiments modify the conditions in which testing occurs in a way that reducesthe fit betweenthose conditions and the situation to which the results are to be generalized.Also, knowledge of the effects of manipulable causestells nothing about how and why those effectsoccur. Nor do experiments answer many other questions relevant to the real world-for example, which questions are worth asking, how strong the need for treatment is, how a cause is distributed through societg whether the treatment is implemented with theoretical fidelitS and what value should be attached to the experimental results. In additioq, in experiments,we first manipulate a treatment and only then ob- serveits effects;but in some other studieswe first observean effect, such as AIDS, and then search for its cause, whether manipulable or not. Experiments cannot help us with that search.Scriven (1976) likens such searchesto detective work in which a crime has beencommitted (..d., " robbery), the detectivesobserve a par- ticular pattern of evidencesurrounding the crime (e.g.,the robber wore a baseball cap and a distinct jacket and used a certain kind of Bun), and then the detectives searchfor criminals whose known method of operating (their modus operandi or m.o.) includes this pattern. A criminal whose m.o. fits that pattern of then becomesa suspect to be investigated further. Epidemiologists use a similar method, the case-controldesign (Ahlbom 6c Norell, 1,990),in which they observe a particular health outcome (e.g., an increasein brain tumors) that is not seenin another group and then attempt to identify associatedcauses (e.g., increasedcell phone use). Experiments do not aspire to answer all the kinds of questions, not even all the types of causal questions, that social scientistsask.

CausalDescription and CausalExplanation

The uniquestrength of experimentationis in describingthe consequencesattrib- utableto deliberatelyvarying a treatment.'Wecall this causaldescription. In con- trast, experimentsdo lesswell in clarifying the mechanismsthrough which and the conditionsunder which that causalrelationship holds-what we call causal explanation.For example,most childrenvery quickly learnthe descriptivecausal relationshipbetween flicking a light switch and obtainingillumination in a room. However,few children (or evenadults) can fully explain why that light goeson. To do so, they would haveto decomposethe treatment(the act of flicking a light switch)into its causallyefficacious features (e.g., closing an insulatedcircuit) and its nonessentialfeatures (e.g., whether the switch is thrown by hand or a motion detector).They would haveto do the samefor the effect (eitherincandescent or fluorescentlight can be produced,but light will still be producedwhether the light fixture is recessedor not). For full explanation,they would then have to show how the causallyefficacious parts of the treatmentinfluence the causally affectedparts of the outcomethrough identified mediating processes(e.g., the I 1O I T. CXPTRIMENTSANDGENERALIZED CAUSAL INFERENCE

passageof electricity through the circuit, the excitation of photons).3 ClearlS the causeof the light going on is a complex cluster of many factors. For those philoso- phers who equate causewith identifying that constellation of variables that nec- essarily inevitably and infallibly results in the effect (Beauchamp,1.974),talk of cause is not warranted until everything of relevanceis known. For them, there is no causal description without causal explanation. Whatever the philosophic mer- its of their position, though, it is not practical to expect much current social sci- ence to achieve such complete explanation. The practical importance of causal explanation is brought home when the switch fails to make the light go on and when replacing the light bulb (another easily learned manipulation) fails to solva the problem. Explanatory knowledge then offers clues about how to fix the problem-for example, by detecting and re- pairing a short circuit. Or if we wanted to create illumination in a place without lights and we had explanatory knowledge, we would know exactly which features of the cause-and-effectrelationship are essentialto create light and which are ir- relevant. Our explanation might tell us that there must be a source of electricity but that that source could take several different molar forms, such as abattery, a generator, a windmill, or a solar array. There must also be a switch mechanismto close a circuit, but this could also take many forms, including the touching of two bare wires or even a motion detector that trips the switch when someone enters the room. So causal explanation is an important route to the generalization of causal descriptions becauseit tells us which featuresof the causal relationship are essentialto transfer to other situations. This benefit of causal explanation helps elucidate its priority and prestige in all sciencesand helps explain why, once a novel and important causal relationship is discovered,the bulk of basic scientific effort turns toward explaining why and how it happens. Usuallg this involves decomposing the causeinto its causally ef- fective parts, decomposing the effects into its causally affectedparts, and identi- fying the processesthrough which the effective causal parts influence the causally affected outcome parts. These examplesalso show the close parallel between descriptiveand explana- tory causation and molar and molecular causation.aDescriptive causation usually concerns simple bivariate relationships between molar treatments and molar out- comes, molar here referring to a packagethat consistsof many different parts. For instance, we may find that psychotherapy decreasesdepression, a simple descrip- tive causal relationship benveen a molar treatment package and a molar outcome. However, psychotherapy consists of such parts as verbal interactions, placebo-

3. However, the full explanationa physicistwould offer might be quite differentfrom this electrician's explanation,perhaps invoking the behaviorof subparticles.This differenceindicates iust how complicatedis the notion of explanationand how it can quickly becomequite complexonce one shifts levelsof analysis. 4. By molar, we meansomething taken as a whole rather than in parts.An analogyis to physics,in which molar might refer to the propertiesor motions of masses,as distinguishedfrom thoseof moleculesor atomsthat make up thosemasses. EXPERIMENTSAND CAUSATION I 11 I generating procedures, setting characteristics,time constraints, and payment for services.Similarly, many depression measuresconsist of items pertaining to the physiological,cognitive, and affectiveaspects of depression.Explan atory causation breaks thesemolar causesand effectsinto their molecular parts so as to learn, say, that the verbal interactions and the placebo featuresof both causechanges in the cognitive symptoms of depression,but that payment for servicesdoes not do so even though it is part of the molar treatment package. If experiments are less able to provide this highly-prized explanatory causal knowledge, why.are experimentsso central to science,especially to basic social sci- ence,in which theory and explanation are often the coin of the realm? The answer is that the dichotomy ber'*reendescriptive and explanatory causation is lessclear in sci- entific practicethan in abstract discussionsabout causation.First, many causalex- planatironsconsist of chains of descriptivi causal links in which one event causesthe next. Experiments help to test the links in each chain. Second,experiments help dis- tinguish betweenthe validity of competing explanatory theories, for example, by test- ing competing mediating links proposed by those theories. Third, some experiments test whether a descriptive causal relationship varies in strength or direction under Condition A versus Condition B (then the condition is a moderator variable that ex- plains the conditions under which the effect holds). Fourth, some experimentsadd quantitative or qualitative observations of the links in the explanatory chain (medi- ator variables) to generateand study explanations for the descriptive causal effect. Experiments are also prized in applied areas of ,in which the identification of practical solutions to social problems has as great or even greater priority than explanations of those solutions. After all, explanation is not always required for identifying practical solutions. Lewontin (1997) makes this point about the Human Genome Project, a coordinated multibillion-dollar research program ro map the human genome that it is hoped eventually will clarify the ge- netic causesof .Lewontin is skeptical about aspectsof this search: '!ilhat is involvedhere is the differencebetween explanation and intervention.Many disorderscan be explainedby the failureof the organismto makea normalprotein, a failurethat is the consequenceof a genemutation. But interuentionrequires that the normalprotein be providedat the right placein the right cells,at the right time and in the right amount,or elsethat an alternativeway be found to providenormal cellular function.'Whatis worse,it might evenbe necessary to keepthe abnormalprotein away from the cellsat criticalmoments. None of theseobjectives is servedby knowingthe "1,997, DNA sequenceof the defectivegene. (Lewontin, p.29) Practical applications are not immediately revealedby theoretical advance.In- stead, to reveal them may take decadesof follow-up work, including tests of sim- ple descriptive causal relationships. The same point is illustrated by the cancer drug Endostatin, discussedearlier. Scientistsknew the action of the drug occurred through cutting off tumor blood supplies; but to successfullyuse the drug to treat cancersin mice required administering it at the right place, angle, and depth, and those details were not part of the usual scientific explanation of the drug's effects. 12 I 1. EXPERTMENTSAND GENERALTZEDCAUSAL TNFERENCE I

In the end,then, causal descriptions and causal explanations are in delicatebal- ancein experiments.'$7hatexperiments do bestis to improvecausal descriptions; they do lesswell at explainingcausal relationships. But most experimentscan be designedto providebetter explanations than is typicallythe casetoday. Further, in focusingon causaldescriptions, experiments often investigatemolar eventsthat may be less strongly related to outcomesthan are more molecularmediating processes,especially those processes that arecloser to the outcomein the explana- tory chain. However,many causaldescriptions are still dependableand strong enoughto be useful,to be worth making the building blocks around which im- portant policiesand theoriesare created.Just considerthe dependabilityof such causalstatements as that schooldesegregation causes white flight, or that outgroup threat causesingroup cohesion,or that psychotherapyimproves mental health, or that diet reducesthe retardationdue to PKU. Suchdependable causal relationships are usefulto policymakers,practitioners, and scientistsalike.

MODERNDESCRIPTIONS OF EXPERIMENTS

Some of the terms used in describing modern experimentation (seeTable L.L) are unique, clearly defined, and consistently used; others are blurred and inconsis- tently used. The common attribute in all experiments is control of treatment (though control can take many different forms). So Mosteller (1990, p. 225) "fn writes, an experiment the investigator controls the application of the treat- "one ment"l and Yaremko, Harari, Harrison, and Lynn (1,986,p.72) write, or more independent variables are manipulated to observe their effects on one or more dependentvariables." However, over time many different experimental sub- types have developed in responseto the needs and histories of different 'Winston ('Winston, 1990; 6c Blais, 1.996\.

TABLE1.1 The Vocabulary of Experiments

Experiment:A studyin whichan interventionis deliberately introduced to observeits effects. RandomizedExperiment: An experimentin whichunits are assigned to receivethe treatmentor an alternativecondition by a randomprocess such as the toss of a coinor a tableof randomnumbers. Quasi-Experiment:An experimentin whichunits are not assignedto conditionsrandomly. NaturalExperiment: Not reallyan experimentbecause the causeusually cannot be manipulated;a study that contrasts a naturally occurring event such as an earthquake with a comoarisoncondition. CorrelationalStudy: Usually synonymous with nonexperimentalor ; a study thatsimply observes the size and direction of a relationshipamong variables. I MODERNDESCRIPTIONS OFEXPERIMENTS I tr

RandomizedExperiment

The most clearlydescribed variant is the randomizedexperiment, widely credited to Sir RonaldFisher (1,925,1926).It was first usedin agriculturebut laterspread to other topic areasbecause it promisedcontrol over extraneoussources of vari- ation without requiringthe physicalisolation of the laboratory.Its distinguishing featureis clearand important-that the varioustreatments being contrasted (in- cludingno treatmentat all) are assignedto experimentalunits' by chance,for ex- ample,by cointoss or useof a tableof randomnumbers. If implementedcorrectlS ,"rdo- assignmentcreates two or more groupsof units that areprobabilistically similarto .".h other on the average.6Hence, any outcomedifferences that are ob- servedbetween those groups at the end,ofa studyare likely to be dueto treatment' not to differencesbetween the groupsthat alreadyexisted at the start of the study. Further,when certainassumptions are met, the randomizedexperiment yields an estimateof the sizeof a treatmenteffect that has desirablestatistical properties' along with estimatesof the probability that the true effectfalls within a defined confidenceinterval. These features of experimentsare so highly prized that in a researcharea such as medicinethe randomizedexperiment is often referredto as the gold standardfor treatmentoutcome research.' Closelyrelated to the randomizedexperiment is a more ambiguousand in- consistentlyused term, true experiment.Some authors use it synonymouslywith randomizedexperiment (Rosenthal & Rosnow,1991'). Others use it more gener- variableis deliberatelymanip- ally to refer to any studyin which an independent 'We ulated (Yaremkoet al., 1,9861anda dependentvariable is assessed. shall not usethe term at all givenits ambiguity and given that the modifier true seemsto imply restrictedclaims to a singlecorrect experimental method.

Quasi-Experiment

Much of this book focuseson a class of designsthat Campbell and Stanley (1,963)popularized as quasi-experiments.s Quasi-experiments share with all other

5. Units can be people,animals, time periods,institutions, or almost anything else.Typically in field experimentationthey are peopleor someaggregate of people,such as classroomsor work sites.In addition, a little thought showsthat random assignmentof units to treatmentsis the sameas assignmentof treatmentsto units, so thesephrases are frequendyused interchangeably' 6. The word probabilisticallyis crucial,as is explainedin more detail in Chapter 8. 7. Although the rerm is used this way consistently acrossmany fields and in this book, statisticianssometimes use the closely related term random experiment in a different way to indicate experiments for which the outcomecannor be predictedwith certainry(e.g., Hogg & Tanis, 1988). 8. Campbell (1957)first calledthese compromise designs but changedterminology very quickly; Rosenbaum (1995a\and Cochran (1965\ referto theseas observationalstudies, a term we avoid becausemany peopleuse it to refer to correlationalor nonexperimentalstudies, as well. Greenbergand Shroder(1997) usequdsi-etcperiment to refer to studiesthat randomly assigngroups (e.g.,communities) to conditions,but we would considerthese group- randomizedexperiments (Murray' 1998). I 14 I 1. EXPERIMENTSAND GENERALIZEDCAUSAL INFERENCE I

experimentsa similar purpose-to test descriptivecausal hypotheses about manip- ulable causes-as well as many structural details, such as the frequent presenceof control groups and pretest measures,to support a counterfactual inferenceabout what would have happened in the absenceof treatment. But, by definition, quasi- experiments lack random assignment. Assignment to conditions is by means of self- selection,by which units choosetreatment for themselves,or by meansof adminis-

trator selection,by which teachers,bureaucrats, legislators, therapists, physicians, t or others decidewhich persons should get which treatment. Howeveq researchers who use quasi-experimentsmay still have considerablecontrol over selectingand rl schedulingmeasures, over how nonrandom assignmentis executed,over the kinds of comparison groups with which treatment,groups are compared, and over some aspectsof how treatment is scheduled.As Campbell and Stanleynote:

There are many natural socialsettings in which the researchperson can introduce somethinglike experimentaldesign into his schedulingof collectionprocedures (e.g.,the uhen and to whom of measurement),even though he lacksthe full control over the schedulingof experimentalstimuli (the when andto wltom of exposureand the ability to randomizeexposures) which makesa true experimentpossible. Collec- tively,such situationscan be regardedas quasi-experimentaldesigns. (Campbell & StanleS1,963, p. 34)

In quasi-experiments,the causeis manipulable and occurs before the effect is measured. However, quasi-experimental design features usually create less com- pelling support for counterfactual inferences. For example, quasi-experimental control groups may differ from the treatment condition in many systematic(non- random) ways other than the presenceof the treatment Many of theseways could be alternative explanations for the observed effect, and so researchershave to worry about ruling them out in order to get a more valid estimateof the treatment effect. By contrast, with random assignmentthe researcherdoes not have to think as much about all these alternative explanations. If correctly done, random as- signment makes most of the alternatives less likely as causes of the observed treatment effect at the start of the study. In quasi-experiments,the researcherhas to enumeratealternative explanations one by one, decide which are plausible, and then use logic, design, and measure- ment to assesswhether each one is operating in a way that might explain any ob- servedeffect. The difficulties are that thesealternative explanations are never com- pletely enumerable in advance, that some of them are particular to the context being studied, and that the methods neededto eliminate them from contention will vary from alternative to alternative and from study to study. For example, suppose two nonrandomly formed groups of children are studied, a volunteer treatment group that gets a new reading program and a control group of nonvolunteerswho do not get it. If the treatment group does better, is it becauseof treatment or be- causethe cognitive development of the volunteerswas increasingmore rapidly even before treatment began? (In a randomized experiment, maturation rates would OFEXPERIMENTS 1s MODERNDESCRIPTIONS |

havebeen probabilistically equal in both groups.)To assessthis alternative,the re- searchermight add multiplepretests to revealmaturational trend beforethe treat- ment, and then comparethat trend with the trend after treatment. Another alternativeexplanation might be that the nonrandomcontrol group in- cludedmore disadvantaged children who had lessaccess to booksin their homesor who had parentswho read to them lessoften. (In a randomizedexperiment' both groupswould havehad similarproportions of suchchildren.) To assessthis alter- nativi, the experimentermay measurethe number of books at home, parentaltime spentreadingto children, and perhapstrips to libraries.Then the researcherwould seeif thesevariables differed acrosstreatment and control groups in the hypothe- sizeddirection that could explain the observedtreatment effect. Obviously,as the number of plausiblealternative explapations increases, the designof the quasi- . experimentbecomes more intellectually demandingand complex---especiallybe- causewe are nevercertain we haveidentified all the alternativeexplanations. The efforts of the quasi-experimenterstart to look like affemptsto bandagea wound that would havebeen less severe if random assignmenthad beenused initially. The ruling out of alternativehypotheses is closelyrelated to a falsificationist logic popularizedby Popper(1959). Popper noted how hard it is to be surethat a g*.r"t conclusion(e.g., ,ll r*"ttr are white) is correct basedon a limited set of observations(e.g., all the swansI've seenwere white). After all, future observa- tions may change(e.g., some day I may seea black swan).So confirmationis log- ically difficult. By contrast,observing a disconfirminginstance (e.g., a black swan) is sufficient,in Popper'sview, to falsify the generalconclusion that all swansare white. Accordingly,nopper urged scientiststo try deliberatelyto falsify the con- clusionsthey wiih to draw rather than only to seekinformation corroborating them. Conciusionsthat withstandfalsification are retainedin scientificbooks or journals and treated as plausible until better evidencecomes along. Quasi- experimentationis falsificationistin that it requiresexperimenters to identify a causalclaim and then to generateand examineplausible alternative explanations that might falsify the claim. However,such falsification can neverbe as definitiveas Popperhoped. Kuhn (7962)pointed out that falsificationdepends on two assumptionsthat can never be fully tested.The first is that the causalclaim is perfectlyspecified. But that is neverih. ."r.. So many featuresof both the claim and the test of the claim are debatable-for example,which outcome is of interest,how it is measured,the conditionsof treatment,who needstreatment, and all the many other decisions that researchersmust make in testingcausal relationships. As a result,disconfir- mation often leadstheorists to respecifypart of their causaltheories. For exam- ple, they might now specifynovel conditions that must hold for their theory to be irue and that were derivedfrom the apparentlydisconfirming observations. Sec- ond, falsificationrequires measures that are perfectlyvalid reflectionsof the the- ory being tested.However, most philosophersmaintain that all observationis theorv-laden.It is laden both with intellectualnuances specific to the partially 16 I 1. EXPERIMENTSAND GENERALIZED CAUSAL INFERENCE I

uniquescientific understandings of the theory held by the individualor group de- vising the test and also with the experimenters'extrascientific wishes, hopes, aspirations,and broadly shared cultural assumptionsand understandings.If measuresare not independentof theories,how can they provideindependent the- ory tests,including tests of causaltheories? If the possibilityof theory-neutralob- servationsis denied,with them disappearsthe possibilityof definitiveknowledge both of what seemsto confirm a causalclaim and of what seemsto disconfirmit. Nonetheless,a fallibilist versionof falsificationis possible.It arguesthat stud- iesof causalhypotheses can still usefullyimprove understanding of generaltrends despiteignorance of all the contingenciesthat might pertainto thosetrends. It ar- guesthat causalstudies are usefuleven if w0 haveto respecifythe initial hypoth- esisrepeatedly to accommodatenew contingenciesand new understandings.Af- ter all, those respecificationsare usually minor in scope;they rarely involve wholesaleoverthrowing of generaltrends in favor of completelyopposite trends. Fallibilist falsificationalso assumesthat theory-neutralobservation is impossible but that observationscan approacha more factlikestatus when they havebeen re- peatedlymade across different theoretical conceptions of a construct,across mul- tiple kinds of measurements,and at multiple times.It alsoassumes that observa- tions are imbued with multiple theories, not iust one, and that different operationalprocedures do not sharethe samemultiple theories.As a result,ob- servationsthat repeatedlyoccur despitedifferent theoriesbeing built into them havea specialfactlike statuseven if they can neverbe fully justifiedas completely theory-neutralfacts. In summary,then, fallible falsificationis more than just see- ing whether observationsdisconfirm a prediction. It involvesdiscovering and judging the worth of ancillary assumptionsabout the restrictedspecificity of the causalhypothesis under test and also about the heterogeneityof theories,view- points, settings,and times built into the measuresof the causeand effectand of any contingenciesmodifying their relationship. It is neitherfeasible nor desirableto rule out all possiblealternative interpre- tarionsof a causalrelationship. Instead, only plausiblealternatives constitute the major focus.This servespartly to keep matterstractable because the number of possiblealternatives is endless.It also recognizesthat many alternativeshave no seriousempirical or experientialsupport and so do not warrant specialattention. However,the lack of supportcan sometimesbe deceiving.For example,the cause of stomachulcers was long thought to be a combinationof lifestyle(e.g., stress) and excessacid production. Few scientistsseriously thought that ulcerswere causedby a pathogen(e.g., virus, germ, bacteria) because it wasassumed that an destroy all living organisms.However, in L982 Aus- acid-filledstomach would 'Warren tralian researchersBarry Marshall and Robin discoveredspiral-shaped bacteria,later named Helicobacterpylori (H. pylori), in ulcerpatients' stomachs. rilfith this discovery,the previouslypossible but implausiblebecame plausible. By "1994, a U.S. National Institutesof Health ConsensusDevelopment Conference concluded that H. pylori was the major causeof most peptic ulcers. So labeling ri- I MODERNDESCRTPTONS OFEXPERIMENTS I tt I val hypothesesas plausible dependsnot just on what is logically possible but on social consensus,shared experienceand, empirical data. Becausesuch factors are often context specific, different substantive areasde- velop their own lore about which alternatives are important enough to need to be controlled, even developing their own methods for doing so. In early psychologg for example, a control group with pretest observations was invented to control for the plausible alternative explanation that, by giving practice in answering test con- tent, pretestswould produce gains in performance even in the absenceof a treat- ment effect (Coover 6c Angell, 1907). Thus the focus on plausibility is a two-edged sword: it reducesthe of alternatives to be considered in quasi-experimental work, yet it also leavesthe resulting causal inference vulnerable to the discovery that an implausible-seemingalternative may later emerge as a likely causal agent.

NaturalExperiment

The term natural experiment describesa naturally-occurring contrast between a treatmentand a comparisoncondition (Fagan,1990; Meyer, 1995;Zeisel,1,973l. Often the treatments are not even potentially manipulable, as when researchers retrospectivelyexamined whether earthquakesin California causeddrops in prop- erty values (Brunette, 1.995; Murdoch, Singh, 6c Thayer, 1993). Yet plausible causal inferencesabout the effects of earthquakes are easy to construct and de- fend. After all, the earthquakesoccurred before the observations on property val- ues,and it is easyto seewhether earthquakesare related to properfy values.A use- ful source of counterfactual inference can be constructed by examining property values in the samelocale before the earthquake or by studying similar localesthat did not experience an earthquake during the bame time. If property values dropped right after the earthquake in the earthquake condition but not in the com- parison condition, it is difficult to find an alternative explanation for that drop. Natural experiments have recently gained a high profile in . Before the 1990s economists had great faith in their ability to produce valid causal in- ferencesthrough statistical adjustments for initial nonequivalencebetween treat- ment and control groups. But two studies on the effects of job training programs showed that those adjustments produced estimates that were not close to those generated from a randomized experiment and were unstable across tests of the model's sensitivity (Fraker 6c Maynard, 1,987; Lalonde, 1986). Hence, in their searchfor alternative methods, many economistscame to do natural experiments, such as the economic study of the effects that occurred in the Miami job market when many prisoners were releasedfrom Cuban jails and allowed to come to the United States(Card, 1990). They assumethat the releaseof prisoners (or the tim- ing of an earthquake) is independent of the ongoing processesthat usually affect unemployment rates (or housing values). Later we explore the validity of this assumption-of its desirability there can be little question. 18 I 1. EXPERIMENTSAND GENERALIZED CAUSAL INFERENCE

NonexperimentalDesigns

The terms correlationaldesign, passive observational design, and nonexperimental designrefer to situationsin which a presumedcause and effect are identified and measuredbut in which other structural featuresof experimentsare missing.Ran- dom assignmentis not part of the design,nor are such designelements as pretests and control groupsfrom which researchersmight constructa usefulcounterfactual inference.Instead, reliance is placedon measuringalternative explanations indi- vidually and then statisticallycontrolling for them. In cross-sectionalstudies in which all the data aregathered on the respondentsat one time, the researchermay not even know if the causeprecedes the dffect. When thesestudies are used for causalpurposes, the missingdesign features can be problematicunless much is al- ready known about which alternativeinterpretations are plausible,unless those that are plausiblecan be validly measured,and unlessthe substantivemodel used for statisticaladjustment is well-specified.These are difficult conditionsto meetin the real world of researchpractice, and thereforemany commentatorsdoubt the potentialof suchdesigns to supportstrong causal inferences in most cases.

EXPERIMENTSAND THE GENERALIZATION OF CAUSALCONNECTIONS

The strength of experimentation is its ability to illuminate causal inference.The weaknessof experimentation is doubt about the extent to which that causal rela- 'We tionship generalizes. hope that an innovative feature of this book is its focus on generalization. Here we introduce the general issuesthat are expanded in later chapters.

Most ExperimentsAre HighlyLocal But Have GeneralAspirations

Most experimentsare highly localizedand particularistic.They arealmost always conductedin a restrictedrange of settings,often just one, with a particular ver- sion of one type of treatmentrather than, say,a sampleof all possibleversions. Usually they have severalmeasures-each with theoreticalassumptions that are differentfrom thosepresent in other measures-but far from a completeset of all possiblemeasures. Each experimentnearly always usesa convenientsample of peoplerather than one that reflectsa well-describedpopulation; and it will in- evitably be conductedat a particular point in time that rapidly becomeshistory. Yet readersof experimentalresults are rarelyconcerned with what happened in that particular,past, local study.Rather, they usuallyaim to learn eitherabout theoreticalconstructs of interestor about alarger policy.Theorists often want to EXeERTMENTSAND THE GENERALIZATION OFCAUSAL CONNECTIONS I t'

connect experimental results to theories with broad conceptual applicability, which ,.q,rir., generalization at the linguistic level of constructs rather than at the level of the operations used to represent these constructs in a given experiment. They nearly always want to generallzeto more people and settings than are rep- resentedin a single experiment. Indeed, the value assignedto a substantive theory usually dependson how broad a rangeof phenomena the theory covers. SimilarlS policymakers may be interested in whether a causal relationship would hold implemented as a iprobabilistically) across the many sites at which it would be policS an inferencethat requires generalizationbeyond the original experimental stody contexr. Indeed, all human beings probably value the perceptual and cogni- tive stability that is fostered by generalizations. Otherwise, the world might ap- pear as a btulzzingcacophony of isolqted instances requiring constant cognitive processingthat would overwhelm our limited capacities. In defining generalizationas a problem, we do not assumethat more broadly ap- plicable resulti are always more desirable(Greenwood, 1989). For example, physi- cists -ho use particle accelerators to discover new elements may not expect that it would be desiiable to introduce such elementsinto the world. Similarly, social scien- tists sometimes aim to demonstrate that an effect is possible and to understand its mechanismswithout expecting that the effect can be produced more generally. For "sleeper instance, when a effect" occurs in an attitude change study involving per- suasivecommunications, the implication is that change is manifest after a time delay but not immediately so. The circumstancesunder which this effect occurs turn out to be quite limited and unlikely to be of any general interest other than to show that the theory predicting it (and many other ancillary theories) may not be wrong (Cook, Gruder, Hennigan & Flay l979\.Experiments that demonstrate limited generaliza- tion may be just as valuable as those that demonstratebroad generalization. Nonetheless,a conflict seemsto exist berweenthe localizednature of the causal knowledge that individual experiments provide and the more generalizedcausal goals that researchaspires to attain. Cronbach and his colleagues(Cronbach et al., f gSO;Cronbach, 19821havemade this argument most forcefully and their works have contributed much to our thinking about causal generalization. Cronbach noted that eachexperiment consistsof units that receivethe experiencesbeing con- trasted, of the treaiments themselves, of obseruations made on the units, and of the settings in which the study is conducted. Taking the first letter from each of these "instances four iords, he defined the acronym utos to refer to the on which data "1.982,p. are collected" (Cronb ach, 78)-to the actual people,treatments' measures' and settingsthat were sampledin the experiment. He then defined two problems of "domain generalizition: (1) generaliiing to the about which [the] question is asked" "units, (p.7g),which he called UTOS; and (2) generalizingto treatments,variables, oUTOS.e "nd r.r,ings not directly observed" (p. 831,*hi.h he called

S, 9. We oversimplify Cronbach'spresentation here for pedagogicalreasons. For example,Cronbach only usedcapital not small s, so that his system,eferred only to ,tos, not utos. He offered diverseand not always consistentdefinitions do here. of UTOS and *UTOS, in particular. And he doesnot usethe word generalizationin the samebroad way we I 20 I 1. EXPERIMENTSAND GENERALIZEDCAUSAL INFERENCE

Our theoryof causalgeneralization, outlined below and presentedin more de- tail in ChaptersLL through 13, melds Cronbach'sthinking with our own ideas about generalizationfrom previousworks (Cook, 1990, t99t; Cook 6c Camp- bell,1979), creatinga theory that is differentin modestways from both of these predecessors.Our theory is influencedby Cronbach'swork in two ways.First, we follow him by describingexperiments consistently throughout this book as con- sistingof the elementsof units, treatments,observations, and settingsrlothough we frequentlysubstitute persons for units giventhat most field experimentationis conductedwith humansas participants.:We also often substituteoutcome f.or ob- seruationsgiven the centrality of observationsabout outcomewhen examining causalrelationships. Second, we acknowledgethat researchersare often interested in two kinds of.generalizationabout eachof thesefive elements,and that these two typesare inspired bg but not identicalto, the two kinds of generalizationthat 'We Cronbach defined. call these generalizations(inferences about the constructsthat researchoperations represent) and externalvalidity gen- eralizations(inferences about whetherthe causalrelationship holds over variation in persons,settings, treatment, and measurementvariables).

ConstructValidity: Causal Generalization as Representation

The first causal generalization problem concerns how to go from the particular units, treatments, observations, and settings on which data are collected to the higher order constructs these instancesrepresent. These constructs are almost al- ways couched in terms that are more abstract than the particular instancessam- pled in an experiment. The labels may pertain to the individual elementsof the ex- periment (e.g., is the outcome measured by a given test best described as intelligence or as achievement?).Or the labels may pertain to the nature of rela- tionships among elements, including causal relationships, as when cancer treat- ments are classified as cytotoxic or cytostatic depending on whether they kill tu- mor cells directly or delay tumor growth by modulating their environment. Consider a randomized experiment by Fortin and Kirouac (1.9761.The treatment was a brief educational course administered by severalnurses, who gave a tour of their hospital and covered some basic facts about surgery with individuals who were to have elective abdominal or thoracic surgery 1-5to 20 days later in a sin- gle Montreal hospital. Ten specific outcome measureswere used after the surgery, such as an activities of daily living scaleand a count of the analgesicsused to con- trol pain. Now compare this study with its likely t^rget constructs-whether

10. \Weoccasionally refer to time as a separatefeature of experiments,following Campbell (79571and Cook and Campbell (19791,because time can cut acrossthe other factorsindependently. Cronbach did not includetime in his notational system,instead incorporating time into treatment(e.g., the schedulingof treatment),observations (e.g.,when measuresare administered),or setting(e.g., the historicalcontext of the experiment). EXnERTMENTsANDTHE GENERALIZATIoN oF cAUsAL coNNEcrtoNS | ,, I

patient education (the target cause)promotes physical recovery (the targ€t effect) "*ong surgical patients (the target population of units) in hospitals (the target univeise ofiettings). Another example occurs in basic research,in which the ques- tion frequently aiises as to whether the actual manipulations and measuresused in an experiment really tap into the specific cause and effect constructs specified by the theory. One way to dismiss an empirical challenge to a theory is simply to make the casethat the data do not really represent the concepts as they are spec- ified in the theory. Empirical resnlts often force researchersto change their initial understanding of whaithe domain under study is. Sometimesthe reconceptuahzation leads to a more restricted inference about what has been studied. Thus the planned causal agent in the Fortin and Kirouac (I976),study-patie,nt education-might need to b! respecified as informational patient education if the component of the treatment proved to be causally related to recovery from surgery but the tour of the hospital did not. Conversely data can sometimes lead researchersto think in terms o?,"rg., constructs and categoriesthat are more general than those with which they began a researchprogram. Thus the creative analyst of patient educa- tion studies mlght surmise that the treatment is a subclassof interventions that "perceived function by increasing control" or that recovery from surgery can be ;'p.tronal treated as a subclas of coping." Subsequentreaders of the study can even add their own interpietations, perhaps claiming that perceived control is re- ally just a special caseof the even more general self-efficacy construct. There is a sobtie interplay over time among the original categories the researcherintended to represeni, the study as it was actually conducted, the study results, and subse- qrr..ri interpretations. This interplay can change the researcher'sthinking about what the siudy particulars actually achieved at a more conceptual level, as can feedback fromreaders. But whatever reconceptualizationsoccur' the first problem of causal generaltzationis always the same: How can we generalizefrom a sam- ple of instancesand the data patterns associatedwith them to the particular tar- get constructs they represent?

ExternalValidity: Causal Generalization as Extrapolation

The secondproblem of generalizationis to infer whether a causalrelationship holdsover variations in p.rrorrt, settings,treatments, and outcomes.For example, someonereading the resultsof an experimenton the effectsof a kindergarten Head Startprogiam on the subsequentgrammar school reading test scores of poor African Americanchildren in Memphis during the 1980smay want to know if a programwith partially overlappingcognitive and socialdevelopment goals_would be aseffective in improvingthi mathematicstest scoresof poor Hispanicchildren in Dallas if this programwere to be implementedtomorrow. This exampl. again reminds us that generahzationis not a synonym for broader applicatiorr.H.r., generahzationis from one to another city and 1. EXPERIMENTSAND GENERALIZEDCAUSAL INFERENCE from one kind of clienteleto anotherkind, but thereis no presumptionthat Dal- las is somehow broader than Memphis or that Hispanic children constitute a broader population than African American children. Of course,some general- izations are from narrow to broad. For example,a researcherwho randomly samplesexperimental participants from a national population may generalize (probabilistically)from the sampleto all the other unstudiedmembers of that samepopulation. Indeed,that is the rationale for choosingrandom selectionin the first place.Similarly when policymakersconsider whether Head Start should be continuedon a national basis,they are not so interestedin what happenedin Memphis.They are more interestedin what would happenon the averageacross the United States,as its many local programsstill differ from eachother despite efforts in the 1990sto standardizemuch of what happensto Head Startchildren and parents.But generalizationcan also go from the broad to the narrow. Cron- bach(1982) gives the exampleof an experimentthat studieddifferences between the performancesof groups of studentsattending private and public schools.In this case,the concernof individual parentsis to know which type of schoolis bet- ter for their particular child, not for the whole group. \Thether from narrow to broad,broad to narroq or acrossunits at about the samelevel of aggregation, all theseexamples of externalvalidity questionsshare the sameneed-to infer the extent to which the effectholds over variationsin persons,settings, treatments, or outcomes.

Approachesto MakingCausal Generalizations

\Thichever way the causal generalization issue is framed, experiments do not seemat first glance to be very useful. Almost invariablS a given experiment uses a limited set of operations to represent units, treatments, outcomes, and settings. This high degree of localization is not unique to the experiment; it also charac- terizes case studies, performance monitoring systems, and opportunistically- administered given to, say, a haphazard of re- spondents at local shopping centers (Shadish, 1995b). Even when questionnaires are administered to nationally representative samples, they are ideal for repre- senting that particular population of persons but have little relevanceto citizens outside of that nation. Moreover, responsesmay also vary by the setting in which the took place (a doorstep, a living room, or a work site), by the time of day at which it was administered, by how each question was framed, or by the particular race, age,and gender combination of interviewers. But the fact that the experiment is not alone in its vulnerability to generalization issuesdoes not make it any lessa problem. So what is it that justifies any belief that an experiment can achieve a better fit between the particulars of a study and more general inferences to constructs or over variations in persons, settings, treatments, and outcomes? EXeERTMENTsAND THE GENERALtzATtoN oF cAUsAL coNNEcrtoNs I tt

Samplingand CausalGeneralization The methodmost often recommendedfor achievingthis closefit is the useof for- mal probabiliry samplingof instancesof units, treatments,observations, or set- tings (Rossi,Vlright, & Anderson,L983). This presupposesthat we haveclearly deiineatedpopulations of eachand that we can samplewith known probability from within eachof thesepopulations. In effect,this entailsthe random selection of instances,to be carefullydistinguished from random assignmentdiscussed ear- lier in this chapter.Random selection involves selecting cases by chanceto repre- sentthat popuiation,whereas random assignmentinvolves assigning cases to mul- tiple conditions. In cause-probingresearch that is not experimental,random samplesof indi- viduals"r. oft.n nr.d. Large-scalelongitudinal surveys such as the PanelStudy of IncomeDynamics or the National Longitudinal Surveyare usedto representthe populationof the United States-or certainage brackets within it-and measures Lf pot.ntial causesand effectsare then relatedto eachother using time lags in ,nr^"r.rr.-ent and statisticalcontrols for group nonequivalence.All this is donein hopesof approximatingwhat a randomizedexperiment achieves. However, cases of random ielection from a broad population followed by random assignment from within this population are much rarer (seeChapter 12 for examples).Also rare arestudies oi t".rdotn selectionfollowed by a quality quasi-experiment.Such experimentsrequire a high levelof resourcesand a degreeof logisticalcontrol that is iarely feasible,so many researchersprefer to rely on an implicit set of nonsta- tistical heuristicsfor generalizationthat we hope to make more explicit and sys- tematicin this book. Random selectionoccurs even more rarely with treatments'outcomes, and settingsthan with people.Consider the outcomesobserved in an experiment.How ofterrlre they raniomly sampled?'Wegrant that the domain samplingmodel of classicaltest iheory (Nunnally 6c Bernstein,1994) assumes that the itemsused to measurea constructhave been randomly sampledfrom a domain of all possible items. However,in actual experimentalpractice few researchersever randomly sampleitems when constructingmeasures. Nor do they do so when choosingma- nipulationsor settings.For instance,many settingswill not agreeto be sampled, "rid ,o1n. of the settingsthat agreeto be randomly sampledwill almostcertainly not agreeto be randomlyassigned to conditions.For treatments,no definitivelist of poisible treatmentsusually exists, as is most obviousin areasin which treat- -*,, are being discoveredand developedrapidly, such as in AIDS research.In general,then, random samplingis alwaysdesirable, but it is only rarely and con- tingently"However,feasible. formal samplingmethods are not the only option. Two informal, pur- posive samplingmethodrare sometimesuseful-purposive sampling of heteroge- neousinstances and purposivesampling of typical instances.In the former case'the aim is to includeinrLni.r chosendeliberately to reflect diversity on presumptively important dimensions,even though the sampleis not formally random. In the latter 24 I .l. TxpEnIMENTSAND GENERALIZED CAUSAL INFERENCE

case,the aim is to explicate the kinds of units, treatments, observations, and settings to which one most wants to generalize andthen to selectat least one instance of each class that is impressionistically similar to the class mode. Although these purposive sampling methods are more practical than formal probability sampling, they are not backed by a statistical logic that justifies formal generalizations.Nonetheless, they are probabty the most commonly used of all sampling methods for facilitating gen- eralizations. A task we set ourselvesin this book is to explicate such methods and to describe how they can be used more often than is the casetoday. However, sampling methods of any kind are insufficient to solve either prob- lem of generalization. Formal probability sampling requires specifying a target population from which sampling then takes place, but defining such populations is difficult for some targets of generalization such as treatments. Purposive sam- pling of heterogeneousinstances is differentially feasible for different elementsin a study; it is often more feasible to make measuresdiverse than it is to obtain di- verse settings, for example. Purposive sampling of typical instancesis often feasi- ble when target modes, , or means are known, but it leaves questions about generalizationsto a wider range than is typical. Besides,as Cronbach points out, most challenges to the causal generalization of an experiment typically emerge after a study is done. In such cases,sampling is relevant only if the in- stancesin the original study were sampled diversely enough to promote responsi- ble reanalysesof the data to seeif a treatment effect holds acrossmost or all of the targets about which generahzation has been challenged. But packing so many sourcesof variation into a single experimental study is rarely practical and will al- most certainly conflict with other goals of the experiment. Formal sampling meth- ods usually offer only a limited solution to causal generalizationproblems. A the- ory of generalizedcausal inference needsadditional tools.

A GroundedTheory of CausalGeneralization Practicingscientists routinely make causal generalizationsin their research,and they almostnever use formal probability samplingwhen they do. In this book, we presenta theory of causalgeneralization that is groundedin the actualpractice of science(Matt, Cook, 6c Shadish,2000). Although this theory was originally de- velopedfrom ideasthat were groundedin the constructand externalvalidiry lit- eratures(Cook, 1990,1991.),wehave since found that theseideas are common in a diverseliterature about scientificgeneralizations (e.g., Abelson, 1995; Campbell & Fiske,1.959; Cronbach & Meehl, 1955; Davis, 1994;Locke, 1'986;Medin, 1989;Messick,1ggg,1'995; Rubins, 1.994;'Willner, 1,991';'$7ilson, Hayward, Tu- nis, Bass,& Guyatt,1,995];t. \7e providemore detailsabout this groundedtheory in Chapters1L through L3, but in brief it suggeststhat scientistsmake causal gen- eralizationsin their work by usingfive closelyrelated principles: "L. SurfaceSimilarity. They assessthe apparentsimilarities between study opera- tions and the prototypicalcharacteristics of the target of generalization. I EXPERIMENTSAND THEGENERALIZATION OFCAUSAL CONNECTIONS I ZS I

2. Ruling Out lrreleuancies.They identify those things that are irrelevant because they do not change a generalization. 3. Making Discriminations. They clarify k.y discriminations that limit generalization. 4. Interpolation and Extrapolation. They make interpolations to unsampled val- ues within the range of the sampled instances and, much more difficult, they explore extrapolations beyond the sampled range. 5 . Causal Explanation. They develop and test explanatory theories about the pat- tern of effects,causes, and mediational processesthat are essentialto the trans- fer of a causalrelationship.

In this book, we want to show how scientistscan and do use thesefive princi- ples to draw generalizedconclusions dbout a causal connection. Sometimesthe conclusion is about the higher order constructs to use in describing an obtained connection at the samplelevel. In this sense,these five principles have analoguesor parallels both in the construct validity literature (e.g.,with construct content, with loru.rg.nt and discriminant validity, and with the need for theoretical rationales for consrructs) and in the cognitive scienceand philosophy literatures that study how people decidewhether instancesfall into a category(e.g., concerning the roles that protorypical characteristicsand surface versus deep similarity play in deter- mining category membership). But at other times, the conclusion about general- ization refers to whether a connection holds broadly or narrowly over variations in persons, settings,treatments, or outcomes. Here, too, the principles have ana- logues or parallels that we can recognizefrom scientific theory and practice, as in the study of dose-responserelationships (a form of interpolation-extrapolation) or the appeal to explanatory mechanismsin generalizing from animals to humans (a form of causal explanation). Scientistsuse rhese five principles almost constantly during all phases of re- search.For example, when they read a published study and wonder if some varia- work in their lab, they think about similari- tion on the study's particulars would '$7hen ties of the published study to what they propose to do. they conceptualize the new study, they anticipate how the instancesthey plan to study will match the prototypical featuresof the constructs about which they are curious. They may de- iign their study on the assumptionthat certain variations will be irrelevant to it but that others will point to key discriminations over which the causal relationship does not hold or the very character of the constructs changes.They may include measuresof key theoretical mechanisms to clarify how the intervention works. During , they test all these hypotheses and adjust their construct de- scriptions to match better what the data suggest happened in the study. The intro- duction section of their articles tries to convince the reader that the study bears on specific constructs, and the discussion sometimes speculatesabout how results -igttt extrapolate to different units, treatments, outcomes, and settings. Further, practicing scientistsdo all this not just with single studies that they read or conduct but also with multiple studies. They nearly always think about 26 1. EXPERTMENTSANDGENERALIZED CAUSAL INFERENCE |

how their own studiesfit into a larger literature about both the constructsbeing measuredand the variablesthat may or may not bound or explain a causalconnec- tion, often documentingthis fit in the introduction to their study.And they apply all five principleswhen they conduct reviewsof the literature,in which they make in- ferencesabout the kinds of generalizationsthat a body of researchcan suppoft. Throughoutthis book, and especiallyin Chapters11 to L3, we providemore detailsabout this groundedtheory of causal generalizationand about the scientific practicesthat it suggests.Adopting this groundedtheory of generalizationdoes not imply a rejectionof formal probabilitysampling.Indeed, we recommendsuch sam- pling unambiguouslywhen it is feasible,along with purposivesampling schemes to aid generalizationwhen formal randomselection methods cannot be implemented. But we alsoshow that samplingis just one methodthat practicingscientists use to make causalgeneralizations, along with practicallogic, applicationof diversesta- tistical methods,and useof featuresof designother than sampling.

EXPERIMENTSAND METASCIENCE

Extensivephilosophical debate sometimes surrounds experimentation. Here we briefly summarizesome key featuresof thesedebates, and then we discusssome implicationsof thesedebates for experimentation.However, there is a sensein which all this philosophicaldebate is incidentalto the practiceof experimentation. Experimentationis as old as humanity itself, so it precededhumanity's philo- sophicalefforts to understandcausation and genenlizationby thousandsof years. Even over just the past 400 yearsof scientificexperimentation, we can seesome constancyof experimentalconcept and method, whereasdiverse philosophical "Ex- conceptionsof the experimenthave come and gone. As Hacking(1983) said, perimentationhas a life of its own" (p. 150). It has beenone of science'smost powerful methodsfor discoveringdescriptive causal relationships, and it hasdone so well in so many ways that its placein scienceis probably assuredforever. To justify its practicetodag a scientistneed not resortto sophisticatedphilosophical reasoningabout experimentation. Nonetheless,it doeshelp scientiststo understandthese philosophical debates. For example,previous distinctions in this chapterbetween molar and molecular causation,descriptive and explanatorycause, or probabilisticand deterministic causalinferences all help both philosophersand scientiststo understandbetter both the purposeand the resultsof experiments(e.g., Bunge, 1959; Eells, 1991'; Hart & Honore, 1985;Humphreys,"t989; Mackie, 1'974;Salmon, 7984,1989; Sobel,1993; P. A. \X/hite,1990).Here we focuson a differentand broaderset of critiquesof scienceitself, not only from philosophybut alsofrom the history,so- ciologSand psychologyof science(see useful general reviews by Bechtel,1988; H. I. Brown, 1977; Oldroyd, 19861.Some of theseworks havebeen explicitly about the nature of experimentation,seeking to createa justifiedrole for it (e.g., I EXPERIMENTSAND METASCIENCEI 27

'1.990; Bhaskar,L975; Campbell, 1982,,1988; Danziger, S.Drake, l98l; Gergen, Houts, L989;Gooding, Pinch,6c Schaffer, 1,973;Gholson, Shadish, Neimeyer,6d 'Woolgar, 1,989b;Greenwood, L989; Hacking, L983; Latour, 1'987;Latour 6c 1.979;Morawski,1988; Orne, 1.962;R. RosenthaL,1.966;Shadish & Fuller, L994; Shapin,1,9941. These critiques help scientiststo seesome limits of experimenta- tion in both scienceand society.

TheKuhnian Critique

Kuhn (1962\ describedscientific revolutions as differentand partly incommensu- rableparadigms that abruptly succeedgdeach other in time and in which the grad- ual accumulationof scientificknowledge was a chimera.Hanson (1958), Polanyi (1958),Popper ('J.959), Toulmin (1'961),Feyerabend (L975), and Quine (1'95t' 1,969)contributed to the critical momentum,in part by exposingthe grossmis- takesin logicalpositivism's attempt to build a philosophyof sciencebased on re- constructinga successfulscience such as physics.All thesecritiques denied any firm foundationsfor scientificknowledge (so, by extension,experiments do not provide firm causalknowledge). The logicalpositivists hoped to achievefounda- tions on which to build knowledgeby tying all theory tightly to theory-freeob- servationthrough predicatelogic. But this left out important scientificconcepts that could not be tied tightly to observation;and it failed to recognizethat all ob- servationsare impregnatedwith substantiveand methodologicaltheory, making it impossibleto conducttheory-free tests.lt The impossibility of theory-neutralobservation (often referred to as the Quine-Duhemthesis) implies that the resultsof any singletest (and so any single experiment)are inevitablyambiguous. They could be disputed,for example,on groundsthat the theoreticalassumptions built into the outcome measurewere wrong or that the study made a fatity assumptionabout how high a treatment dosewas requiredto be effective.Some of theseassumptions are small,easily de- tected,and correctable,such as when a voltmetergives the wrong readingbecause the impedanceof the voltagesource was much higherthan that of the meter ('$fil- son, L952).But other assumptionsare more paradigmlike,impregnating a theory so completelythat other parts of the theory makeno sensewithout them (e.g.,the assumptionthat the earthis the centerof the universein pre-Galileanastronomy). Becausethe number of assumptionsinvolved in any scientifictest is very large, researcherscan easily find some assumptionsto fault or can even posit new

"Even 11. However, Holton (1986) reminds us nor to overstatethe relianceof positivistson empirical data: the father of positivism,Auguste Comte, had written . . . that without a theory of somesort by which to link phenomenato some principles 'it would not only be impossibleto combine the isolatedobservations and draw any usefulconclusions, we would not evenbe able to rememberthem, and, for the most part, the fact would not be noticed by our eyes"' (p. 32). Similarly, Uebel (1992) providesa more detailedhistorical analysisof the sentencedebate in logical positivism, showing somesurprisingly nonstereorypical positions held by key playerssuch as Carnap. 28 r. rxeenlMENTsAND GENERALIZED CAUSAL INFERENCE |

assumptions(Mitroff & Fitzgerald,1.977).In this way, substantivetheories are lesstestable than their authors originally conceived.How cana theory be tested if it is madeof clayrather than granite? For reasonswe clarify later,this critique is more true of singlestudies and less true of programsof research.But evenin the latter case,undetected constant ."tt t.r,tlt in flawed inferencesabout causeand its genenlization.As a result,no ex- I perimentis everfully certain,and extrascientificbeliefs and preferencesalways have ioo- to influencethe many discretionaryjudgments involved in all scientificbelief.

ModernSocial Psychological Critiques

Sociologistsworking within traditionsvariously called social constructivism, epis- temologicalrelativism, and the strongprogram (e.g., Barnes,1974; Bloor, 1976; Collins,l98l;Knorr-Cetina, L981-; Latour 6c'Woolgar,1.979;Mulkay, 1'979)have shown thoseextrascientific processes at work in science.Their empiricalstudies show that scientistsoften fail to adhereto norms commonlyproposed as part of good science(e.g., objectivity neutrality,sharing of information).They havealso rho*n how that which comesto be reportedas scientificknowledge is partly de- terminedby socialand psychologicalforces and partly by issuesof economicand political power both within scienceand in the largersociety-issues that arerarely mention;d in publishedresearch reports. The most extremeamong these sociolo- gistsattributes all scientificknowledge to suchextrascientific processes, claiming ihat "the natural world has a small or nonexistentrole in the constructionof sci- "l'98I, entificknowledge" (Collins, p. 3). Collins doesnot denyontological rea.lism, that real entitiesexist in the world. Rather,he deniesepistemological (scientific) realism, that whateverexternal real- ity may existcan constrain our scientifictheories. For example,if atomsreally ex- ist, do they affectour scientifictheories at all? If our theory postulatesan atom, is it describing a realentitythat existsroughly aswe describeit? Epistetnologi,calrel- atiuistssuch as Collins respondnegatively to both questions,believing that the most important influencesin scienceare social,psychological, economic, and po- litical, "ttd th"t thesemight evenbe the only influenceson scientifictheories- This view is not widely endorsedoutside a small group of sociologists,but it is a use- ful counterweightto naiveassumptions that scientificstudies somehow directly re- veal natur. to r.r,(an assumptiorwe callnaiue realism). The resultsof all studies, including experiments,are profoundly subjectto theseextrascientific influences, from their conceptionto reportsof their results.

Scienceand Trust

A standard image of the scientist is as a skeptic, a person who only trusts results that have been personally verified. Indeed, the scientific revolution of the'l'7th century I EXPERIMENTSAND METASCIENCEI 29 I claimed that trust, particularly trust in authority and dogma, was antithetical to good science.Every authoritative assertion,every dogma, was to be open to ques- tion, and the job of sciencewas to do that questioning. That image is partly wrong. Any single scientific study is an exercisein trust (Pinch, 1986; Shapin, 1,994).Studies trust the vast majority of already developed methods, findings, and concepts that they use when they test a new hypothesis. For example, statistical theories and methods are usually taken on faith rather than personally verified, as are measurement instruments. The ratio of trust to skepticism in any given study is more llke 99% trust to 1% skepticism than the opposite. Even in lifelong programs of research,the single scientist trusts much -or. than he or she ever doubts. Indeed, thoroughgoing skepticism is probably impossible for the individual scientist, po iudge from what we know of the psy- chology of science(Gholson et al., L989; Shadish 6c Fuller, 1'9941.Finall5 skepti- cism is not even an accuratecharacterrzation of past scientific revolutions; Shapin "gentlemanly (1,994)shows that the role of trust" in L7th-century England was central to the establishment of experimental science.Trust pervades science,de- spite its rhetoric of skepticism. lmplicationsfor Experiments

The net resultof thesecriticisms is a greaterappreciation for the equivocalityof all scientificknowledge. The experimentis not a clearwindow that revealsnature directlyto us.To the contrary,experiments yield hypotheticaland fallible knowl- edgethat is often dependenton context and imbuedwith many unstatedtheoret- ical assumprions.Consequently experimental results are partly relativeto those assumptionsand contextsand might well changewith new assumptionsor con- texts.In this sense,all scientistsare epistemologicalconstructivists and relativists. The differenceis whether they are strong or weak relativists.Strong relativists share Collins'sposition that only extrascientificfactors influenceour theories. 'Weak relativistsbelieve that both the ontologicalworld and the worlds of ideol- og5 interests,values, hopes, and wishesplay a role in the constructionof scien- tiiic knowledge.Most practicingscientists, including ourselves,would probably describethemselves ", Lrrtologicalrealists but weak epistemologicalrelativists.l2 To the extent that experimentsreveal nature to us, it is through a very clouded windowpane(Campbell, 1988). Suchcounterweights to naiveviews of experimentswere badly needed.As re- centlyas 30 yearsago, the centralrole of the experimentin sciencewas probably

1.2. If spacepermitred, we could exrendthis discussionto a host of other philosophicalissues that have beenraised about the experiment, such as its role in discovery versusconfirmation, incorrect assertionsthat the experiment is tied to somespecific philosophy such as logicalpositivism or pragmatism,and the variousmistakes that are frequentlymade i., suchdiscussions (e.g., Campb ell, 1982,1988; Cook, 1991; Cook 6< Campbell,1985; Shadish, 1.995a\. I 30 | 1. EXPERTMENTSAND GENERALTZED CAUSAL INFERENCE

taken more for granted than is the case today. For example, Campbell and Stan- ley (1.9631described themselves as:

committed to the experiment: as the only means for settling disputes regarding educa- tional practice, as the only way of verifying educational improvements, and as the only way of establishing a cumulative tradition in which improvements can be introduced without the danger of a faddish discard of old wisdom in favor of inferior novelties. (p. 2) "'experimental Indeed,Hacking (1983) points out that method' usedto be iust an- other name for " (p.149); and experimentation was then a more fertile ground for examples illustrating basic philosophical issuesthan it was a source of contention itself. , 'We Not so today. now understand better that the experiment is a profoundly human endeavor,affected by all the same human foibles as any other human en- deavor, though with well-developed procedures for partial control of some of the limitations that have been identified to date. Some of these limitations are com- mon to all science,of course. For example, scientiststend to notice evidencethat confirms their preferred hypothesesand to overlook contradictory evidence.They make routine cognitive errors of judgment and have limited capacity to process large amounts of information. They react to peer pressuresto agreewith accepted dogma and to social role pressuresin their relationships to students,participants, and other scientists.They are partly motivated by sociological and economic re- wards for their work (sadl5 sometimesto the point of fraud), and they display all- too-human psychological needs and irrationalities about their work. Other limi- tations have unique relevance to experimentation. For example, if causal results are ambiguous, as in many weaker quasi-experiments,experimenters may attrib- ute causation or causal generalization based on study features that have little to do with orthodox logic or method. They may fail to pursue all the alternative causal explanations becauseof a lack of energS a need to achieveclosure, or a bias toward accepting evidencethat confirms their preferred hypothesis.Each experi- ment is also a social situation, full of social roles (e.g., participant, experimenter, assistant)and social expectations (e.g., that people should provide true informa- tion) but with a uniqueness (e.g., that the experimenter does not always tell the truth) that can lead to problems when social cues are misread or deliberately thwarted by either party. Fortunately these limits are not insurmountable, as for- mal training can help overcome some of them (Lehman, Lempert, & Nisbett, 1988). Still, the relationship between scientific results and the world that science studies is neither simple nor fully trustworthy. These social and psychological analyseshave taken some of the luster from the experiment as a centerpieceof science.The experiment may have a life of its own, but it is no longer life on a pedestal. Among scientists,belief in the experi- ment as the only meansto settle disputes about causation is gone, though it is still the preferred method in many circumstances. Gone, too, is the belief that the power experimental methods often displayed in the laboratory would transfer eas- ily to applications in field settings. As a result of highly publicized science-related I A WORLDWITHOUT EXPERIMENTS OR CAUSES? I gT I eventssuch as the tragicresults of the Chernobylnuclear disaster, the disputesover certaintylevels of DNA testingin the O.J. Simpsontrials, and the failure to find a cure for most cancersafter decadesof highly publicizedand funded effort, the generalpublic now betterunderstands the limits of science. Yet we should not take thesecritiques too far. Those who argue against theory-freetests often seemto suggestthat everyexperiment will comeout just as the experimenterwishes. This expectationis totally contrary to the experienceof researchers,who find insteadthat experimentationis often frustratingand disap- pointing for the theoriesthey loved so much. Laboratory resultsmay not speak for themselves,but they certainlydo not speakonly for one'shopes and wishes. "stubborn We find much to valuein the laboratoryscientist's belief in facts" with a life spanthat is greaterthan the fluctqatingtheories with which one tries to ex- plain them.Thus many basicresults about gravityare the same,whether they are containedwithin a framework developedby Newton or by Einstein;and no suc- cessortheory to Einstein'swould be plausibleunless it could accountfor most of the stubbornfactlike findingsabout falling bodies.There may not be pure facts, but someobservations are clearlyworth treating as if they were facts. Some theorists of science-Hanson, Polanyi, Kuhn, and Feyerabend included-have so exaggeratedthe role of theory in scienceas to make experi- mental evidenceseem almost irrelevant.But exploratory experimentsthat were unguidedby formal theory and unexpectedexperimental discoveries tangential to the initial researchmotivations have repeatedly been the sourceof greatscientific advances.Experiments have provided many stubborn,dependable, replicable re- sultsthat then becomethe subjectof theory.Experimental physicists feel that their laboratorydata help keeptheir more speculativetheoretical counterparts honest, giving experimentsan indispensablerole in science.Of course,these stubborn facts often involve both commonsensepresumptions and trust in many well- establishedtheories that makeup the sharedcore of beliefof the sciencein ques- tion. And of course,these stubborn facts sometimes prove to beundependable, are reinterpretedas experimentalartifacts, or are so ladenwith a dominantfocal the- ory that they disappearonce that theory is replaced.But this is not the casewith the greatbulk of the factualbase, which remainsreasonably dependable over rel- ativelylong periodsof time.

A WORLDWITHOUT EXPERIMENTS OR CAUSES?

To borrow a thought experimentfrom Maclntyre (1981),imagine that the slates of scienceand philosophywere wiped cleanand that we had to constructour un- derstandingof the world anew.As part of that reconstruction,would we reinvent the notion of a manipulablecause? \7e think so, largely becauseof the practical utility that dependablemanipulanda have for our ability to surviveand prosper. IUTouldwe reinvent the experimentas a method for investigatingsuch causes? I AND GENERALTZED 32 | 1. EXPERTMENTS CAUSALTNFERENCE

Again yes,because humans will always be trying to betterknow how well these manipulablecauses work. Over time, they will refinehow they conductthose ex- perimentsand so will againbe drawn to problemsof counterfactualinference, of causepreceding effect, of alternativeexplanations, and of all of the other features of causationthat we havediscussed in this chapter.In the end,we would proba- bly end up with the experimentor somethingvery much like it. This book is one more stepin that ongoingprocess of refining experiments.It is about improving the yield from experimentsthat take placein complexfield settings,both the qual- ity of causalinferences they yield and our ability to generalizethese inferences to constructsand over variationsin persons,settings, treatments, and outcomes. A CriticalAssessment of OurAssumptions

As.sump.tion(e-simp'shen): [Middle Englishassumpcion, from Latin as- sumpti, assumptin-adoption, from assumptus,past participle of ass- mere,te adopt; seeassume.] n. 1. The act of taking to or upon oneself: assumptionof an obligation. 2.The act of taking overiassumption of command. 3. The act of taking for granted:assumption of a false the- ory. 4. Somethingtaken for granted or acceptedas true without proof; a supposition:a ualid assumption. 5. Presumption;arrogance. 5. Logic.A minor premise.

fltHIS BooK covers five central topics across its 13 chapters. The first topic | (Chapter 1) deals with our general understanding of descriptive causation and I experimentation. The second (Chapters 2 and 3) dealswith the types of valid- ity and the specific validity threats associatedwith this understanding. The third (Chapters 4 through 7) deals with quasi-experimentsand illustrates how combin- ing design features can facilitate better causal inference. The fourth (Chapters 8 through L0) concerns randomized experiments and stressesthe factors that im- pede and promote their implementation. The fifth (Chapters 11 through L3) deals with causal generalization, both theoretically and as concerns the conduct of in- dividual studies and programs of research.The purpose of this last chapter is to critically assesssome of the assumptions that have gone into thesefive topics, es- pecially the assumptions that critics have found obiectionable or that we antici- 'We pate they will find objectionable. organize the discussionaround each of the five topics and then briefly justify why we did not deal more extensivelywith non- experimental methods for assessingcausation. I7e do not delude ourselvesthat we can be the best explicators of our own as- sumptions. Our critics can do that task better. But we want to be as comprehen- sivesrve andan(l as explicitexplclt as we can. This I nrs isls inrn part becausewe are convincedconvrnced ofot the ad-acl- vantages of falsification as a major component of any for the social sciences,and forcing out one's assumptions and confronting them is one part of falsification. But it is also becausewe would like to stimulate critical debateabout theseassumptions so that we can learn from those who would challengeour think-

456 rct CAUSATIONAND EXPERIMENTATION| ing. If therewere to be a future book that carriedeven further forward the tradi- tion emanatingfrom Campbelland Stanleyvia Cook and Campbellto this book, then that futuie book *o,rld probably be all the better for building upon all the justifiedcriticisms coming from thosewho do not agreewith us, eitheron partic- cau- ,rlu6 o, on the whole approachwe havetaken to the analysisof descriptive the at- sationand its generayzition.'Wewould like this chapternot only to model but i.-p, to be cr"iti.alabout the assumprionsall scholarsmust inevitablymake be alsoto encourageothers to think about theseassumptions and how they might addressedin fuiure empiricalor theoreticalwork'

CAUSATIONAND EXPERIMENTATION

CausalArrows and Pretzels

descriptive Experiments test the influence of one or at most a small subset of very few causes.If statistical interactions are involved, they tend to be among variables' treatments or betweena single treatment and a limited set of moderator typical Many researchersbelieve that the causal knowledge that results from this af- .*p.ii-..rtal structure fails to map the many causal forces that simultaneously (e.g., et al', fe.t "ny given outcome in compiex and nonlinear ways Cronbach prioritize on ar- 19g0; Magnusson,2000). Thesecritics assertthat experiments an explanatory ,o*, .onrr-.ctingA to B when they should instead seekto describe most causal pretzel or set of intersectingpretzels, as it were. They also believethat whether ielationships vary across ,rttitt, settings, and times, and so they doubt (e.g., 6c Snow, there ".. "ny constant bivariate causal relationships Cronbach reflect sta- 1977).Those that do appearto be dependablein the data may simply to reveal the tistically underpow.r.i irr,, of modeiators or mediators that failed sizesmight true underlying complex causal relationships. True-variation in effect or the also be obrc.rr"d b.c"rrs. the relevant substantive theory is underspecified, attenuated, or outcome measuresare partially invalid, or the treatment contrast is (McClelland causally implicated variables afe truncated in how they are sampled 6c Judd, 1993). As valid as theseobiections are, they do not invalidatethe casefor experi- ments.The purposeof experimentsis not to completelyexplain- some.phenome- makes non; it is to ldentify whethera particularvariable or small setof variables affect- a margirraldifference in someoutcome over and above all the other forces not ing that outcome.Moreover, ontological doubts such as the precedinghave many stJppedbelievers in more complexiausal theoriesfrom acting as though or as .r,rol relationshipscan be usefullycharacterized as dependablemain effects In this very simpl. nonlin."rities that are also dependableenough to be_u_seful. where connection,consider some examples from educationin the United States, I 4s8 14.A CRTTTCALASSESSMENT OFOUR ASSUMPTTONS |

objections to experimentation are probably the most prevalent and virulent. Few educational researchersseem to object to the following substantiveconclusions of the form that A dependably causesB: small schools are better than large ones; time-on-task raises achievement; summer school raisestest scores;school deseg- regation hardly affects achievement but does increaseWhite flight; and assigning and grading homework raises achievement.The critics also do not seemto object to other conclusions involving very simple causal contingencies: reducing class "sizable" size increasesachievement, but only if the amount of change is and to a level under 20; or Catholic schools are superior to public ones, but only in the in- ner city and not in the suburbs and then most noticeably in graduation rates rather than in achievementtest scores. , The primary iustification for such oversimplifications-and for the use of the experiments that test them-is that some moderators of effects are of minor rele- vance to policy and theory even if they marginally improve explanation. The most important contingencies are usually those that modify the sign of a causal rela- tionship rather than its magnitude. Sign changesimply that a treatment is benefi- cial in some circumstancesbut might be harmful in others. This is quite different from identifying circumstancesthat influence just how positive an effect might be. Policy-makers are often willing to advocate an overall change,even if they suspect it has different-sizedpositive effects for different groups, as long as the effectsare rarely negative. But if some groups will be positively affected and others nega- tively political actors are loath to prescribe different treatments for different groups becauserivalries and jealousies often ensue. Theoreticians also probably pay more attention to causal relationships that differ in causal sign becausethis result implies that one can identify the boundary conditions that impel such a dis- parate data pattern. Of course, we do not advocate ignoring all causal contingencies.For exam- ple, physicians routinely prescribe one of severalpossible interventions for a given diagnosis.The exact choice may depend on the diagnosis,test results,patient pref- erences, insurance resources, and the availability of treatments in the patient's area. However, the costs of such a contingent system are high. In part to limit the number of relevant contingencies,physicians specialize,andwithin their own spe- cialty they undergo extensivetraining to enable them to make thesecontingent de- cisions. Even then, substantial judgment is still required to cover the many situa- tions in which causal contingencies are ambiguous or in dispute. In many other policy domains it would also be costly to implement the financial, management, and cultural changesthat a truly contingent systemwould require even if the req- uisite knowledge were available. Taking such a contingent approach to its logical extremes would entail in education, for example, that individual tutoring become the order of the dav.day.Students and instructorswould haveto be carefullymatched for overlap in teachingand learning skills and in the curriculum supportsthey would need. tilTithinlimits, some moderators can be studied experimentallSeither by measuringthe moderator so it can be testedduring analysisor by deliberately CAU5ATIONAND EXPERIMENTATIONI Ot'

experi- varying it in the next study in a program of research'In conductingsuch tak- ments,one moves away from thethik-bo" experimentsof yesteryeartoward them by, ing causalcontingencies more seriouslyand toward routinely study!1g com- foi .""-ple, disaggregatingthe treatmentto examineits causallyeffective ponents,iir"ggt.glting the effect,toexamine its causallyimpacted components, variables,and .ondrr.ting ,n"ty*r ofi.-ographic and psychologicalmoderator affects exploringlhe causalpathwa-ysihtoogh whjch (parts.of) the treatment in a singleexperiment is not possi- lparts of) the outcomJ.To do all of this well tl.. brrtto do someof it well is possibleand desirable.

EpistemologicalCriticisms of E4periments we have In highlightingstatistical conclusion validity and in-selecting examples, testing' often linked causaldescription to quantitativemethods and hypothesis positivism' Many criticswill (wrongly)r.. this asimplying a discreditedtheory of positivismre- As a philosophyof scieniefirst outlined in the early L9th century' metaphysicalspeculations, especially about unobservables,and equated 1.ct.d' school of lrro*t.ag. *lih descriptionsof e*periencedphenomena- A narrower realism logical pisitivism .*.rg.d in the eatly 20th century that also rejected logic form *til. "lro .-phasizing Ih. ,rr. of data-theoryconnections in predicate Both thesere- ""J " for p"redictingphenomena over explainingthem' fr.f.r.r.. of how lated *.r. lonf ago discredited,especially as explanations this basis'How- scienceop.r"trr.*so few criticsseriously critici'e experimentson to attack ever,many critics use the term positiuismwith lesshistorical fidelity 1985)' quantitativesocial science methods in genera-l(e'g', Lincoln & Guba, quantification liuilding on the rejectionof logicalpositivism, they reiectthe useof Because and forLal logic in observatiron,measurement, and hypothesistesting. of posi- theselast featuresare part of experiments,to reiectthis looseconception arenu- tivism entailsrejecting experiments. However, the errorsin suchcriticisms (like idea that merous.For example,to ,eject a specificfeature of positivism the and p redicatelogicare the only permissiblelinks betweendata and f,r"rrtifi.rtion generalproposi- tiheory;does not nJcessarilyimlly reiectingall relatedand more testing tions jsuch asthe notion that somekinds of quantificationand hypothesis more sucher- may be usefulfor knowledgegrowth).Ife and othershave outlined rors elsewhere(Phillips, 1990; Shadish, I995al' historians other epistemologicalcriticisms of experimentationcite the work of and'wool- of sciencesuch as Kuh"n (1,g62),of sociologists of sciencesuch as Latour Harr6'(1931). These critics tend gar ltiZll "rrd of fhiloroph.ir of scienceiuchas the notion that to focuson threethings. orre.i, the incommensurabilityof theories, As a re- theoriesare neverper"fectly specified and so can alwaysbe reinterpreted. be reiected'its sult, when disconfirmingdata seemto imply that a theory should poriolut., can insteadbI reworkedin order to make the theory and observations to the consistentwith eachother. This is usuallydone by addingnew contingencies 460 | 14.A CRIT|CALASSESSMENT OF OUR ASSUMPTTONS I

theory that limit the conditions under which it is thought to hold. A second cri- tique is of the assumption that experimental observations can 'We be used as truth tests. would like observations to be objective assessmentsthat can adjudicate between different theoretical explanations of a phenomenon. But in practice, ob- servationsare not theory neutral; they are open to multiple interpretations that in- clude such irrelevanciesas the researcher'shopes, dreams, and predilections. The consequenceis that observations rarely result in definitive hypothesistests. The fi- nal criticism follows from the many behavioral and cognitive inconsistenciesbe- tween what scientists do in practice and what scientific norms prescribe they should do. Descriptions of scientists' behavior in laboratories reveal them as choosing to do particular experiments becausethey have an intuition about a re- lationship, or they are simply curious to seewhat happens, or they want to play with a new piece of equipment they happen to find lying around. Their impetus, therefore, is not a hypothesis carefully deduced from a theory that they then test by means of careful observation. Although thesecritiques have some credibilitg they are overgeneralized.Few experimentersbelieve that their work yields definitive results evenafter it has been subjected to professional review. Further, though these philosophical, historical, and social critiques complicate what a "fact" means for any scientific method, nonethelessmany relationships have stubbornly recurred despite changesassoci- ated with the substantive theories, methods, and researcherbiases that first gen- "facts," erated them. Observations may never achieve the status of but many of them are so stubbornly replicable that they may be consideredas though they were facts. For experimenters, the trick is to make sure that observations are not im- pregnated with just one theory, and this is done by building multiple theories into observationsand by valuing independentreplications, especiallythose of sub- stantive critics-what we have elsewherecalled critical multiplism (Cook, 1985; Shadish,'1.989,1994). Although causal claims can never be definitively tested and proven, individ- ual experiments still manage to probe such claims. For example, if a study pro- duces negative results, it is often the casethat program developersand other ad- vocates then bring up methodological and substantive contingenciesthat might have changedthe result. For instance, they might contend that a different outcome measureor population would have led to a different conclusion. Subsequentstud- ies then probe these alternatives and, if they again prove negative, lead to yet an- other round of probes of whatever new explanatory possibilities have emerged. After a time, this process runs out of steam, so particularistic are the contingen- "The cies that remain to be examined. It is as though a consensusemerges: causal relationship was not obtained under many conditions. The conditions that remain to be examined are so circumscribed that the intervention will not be worth much " 'W'e even if it is effectiveunder theseconditions. agreethat this processis as much or more social than logical. But the reality of elastic theory doesnot that de- cisions about causal hypotheses are only social and devoid of all empirical and I logical content. I I t JI CAUSATIONAND EXPERIMENTATION| +er

The criticismsnoted are especiallyuseful in highlightingthe limited value of individual studiesrelative to reviewsof researchprograms. Such reviews are bet- ter becausethe greaterdiversity of studyfeatures makes it lesslikely that the same theoreticalbiases that inevitablyimpregnate any onestudy will reappearacross all the studiesunder review. Still, a dialecticprocess of point, response,and counter- point is neededeven with reviews,again implying that no singlereview is defini- iirr.. Fo, example,in responseto Smith and Glass's(1'977) meta-analytic claim that psychotheiapy*", .ff..tive, Eysenck (L977)and Presby(1'977) pointed out methojological and substantivecontingencies that challengedthe original re- viewers'reJults. They suggestedthat a differentanswer would havebeen achieved if Smith and Glassitrd ""t combinedrandomized and nonrandomizedexperi- mentsor if they had usednarrower calegoriesin which to classifytypes of ther- apy. Subsequentstudies probed thesechallenges to Smith and Glassor brought causal foith nouef or,., 1e.g.,\il'eisz et al., 1,992).This processof challenging claimswith specificalternatives has now slowedin reviewsof psychotherapyas many major contingenciesthat might limit effectivenesshave been explored. The currenrconsensus fiom reviewsof many experimentsin many kinds of settingsis that psychotherapyis effective;it is not iust the product of a regressionprocess in needseek profes- lrporrt"nrors remission)whereby those who are temporarily ,ii""t help and get better,as they would haveeven without the therapy'

NeglectedAncillarY Questions

Our focuson causalquestions within an experimentalframework neglectsmany other questionsthat arerelevant to causation.These include questions about how to decideon the importanceor leverageof any singlecausal question. This could entail exploringwhether a causalquestion is evenwarranted, as it often is not at the early sa"g.-ofdevelopment of an issue.Or it could entail exploringwhat type of c".rsalquestion is moie important-one that fills an identifiedhole in somelit- erature,o, orr. that setsout to identify specificboundary conditionslimiting a causalconnection, or onethat probesthe validity of a centralassumption held by all the theoristsand researcherswithin a field, or one that reducesuncertainty about an important decisionwhen formerly uncertaintywas high. Our approach alsoneglects the realitythat how oneformulates a descriptivecausal question usu- aily enLils meetingsome stakeholders' interests in the socialresearch more than those of others.TLus to ask about the effectsof a national program meetsthe needsof Congressionalstaffs, the media,and policy wonks to learnabout whether the program"*orks. But it can fail to meet the needsof local practitionerswho ,rro"lly"*"nt to know about the effectivenessof microelementswithin the pro- gram ,o thut they can usethis knowledgeto improve their daily practice.-Inmore Ih.or.ti."l work, to ask how someintervention affects personal self-efficacy is likely to promoteindividuals' autonomy needs, whereas to ask about the effects of a'persoasivecommunication designed to changeattitudes could well cater to t 462 14.A CR|T|CALASSESSMENT OFOUR ASSUMPT|ONS |

the needs of those who would limit or manipulate such autonomy. Our narrow technical approach to causation also neglectedissues related to how such causal knowledge might be used and misused. It gave short shrift to a systematicanaly- sis of the kinds of causal questions that can and cannot be answered through ex- periments. \7hat about the effects of abortion, divorce, stable cohabitation, birth out of wedlock, and other possibly harmful events that we cannot ethically ma- nipulate? What about the effects of class,race, and gender that are not amenable 'What to experimentation? about the effects of historical occurrencesthat can be studied only by using time-seriesmethods on whatever variables might or might not be in the archives?Of what use, one might ask, is a method that cannot get at some of the most important phenomena that shape our social world, often over generations,as in the caseof race, class,and gender? Many now consider questions about things that cannot be ma- nipulated as being beyond ,so closely do they link manipulation to causation. To them, the causemust be at least potentially manipulable, even if it is not actually manipulated in a given observational study. Thus they would not consider race ^ cause,though they would speak of the causal analysis of race in studies in which Black and White couples are, say,randomly assignedto visiting rental units in order to seeif the refusal rates vary, or that entail chemically chang- ing skin color to seehow individuals are responded to differently as a function of pigmentation, or that systematicallyvaried the racial mix of studentsin schools or classrooms in order to study teacher responsesand student performance. Many critics do not like so tight a coupling of manipulation and causation. For exam- ple, those who do status attainment researchconsider it obvious that race causally influences how teachers treat individual minority students and thus affects how well these children do in school and therefore what jobs they get and what prospects their own children will subsequentlyhave. So this coupling of causeto manipulation is a real limit of an experimental approach to causation. Although we like the coupling of causation and manipulation for purposes of defining ex- periments, we do not seeit as necessaryto all useful forms of cause.

VALIDITY

Objectionsto InternalValidity

There are severalcriticisms of Campbell's(1957) validity typologyand its exten- sions(Gadenne, 1976;Kruglanski & Kroy, 1.976;Hultsch6c Hickey, 1978; Cron- 'We bach, 1982; Cronbachet al., 1980). start first with two criticismsof raisedby Cronbach(1982) and to a lbsserextent by Kruglanskiand Kroy (1'976):(1) an atheoreticallydefined internal validity (A causesB) is trivial with- out referenceto constructs;and (2) causationin singleinstances is impossible,in- cludingin singleexperiments. vALtDtrY nol I lnternal Validity ls Trivial Cronbach(L982) writes:

I consider it pointless to speak of causeswhen all that can be validly meant by refer- ma- enceto a causein a particular instanceis that, on one trial of a partially specified phe- nipulation under.orrditior6 A, B, and c, along with other conditions not named, nomenon p was observed.To introduce the word causeseems pointless. Campbell's writings make internal validity a property of trivial, past-tense'and local statements' (p.t37) Hence,.,causal language is superfluous"(p. 140).Cronbach does not retaina spe- cific role fo, .",rr"Iinferenceln his validity typology at all. Kruglanskiand Kroy (1976)criticize internal validity similanlSsaying: are The concrete events which constitute the treatment within a specific research is simply meaningful only as members of a general conceptual category' ' ' ' Thus, it are impossibleto draw strictly specificconclusions from an experiment: our concepts g.rr.r"l and each pr.r,rppor"s an implicit general theory about resemblanceberween different concretecases. (p. 1'57) All theseauthors suggestcollapsing internal with constructvalidity in different ways. Of course,we agreethat researchersconceptualize and discusstreatments and outcomesin concepfualterms. As we saidin Chapter3, constructsare so basicto l"rrgo"g. and thought that it is impossibleto conceptualizescientific work with- out"thJm. Indeed,ir, *"ny important respects,the constructswe use constrain what we experience,a point agreedto by theoristsranging from Quine (L951' L96g)to th; postmodernists(Conner,1989; Testeq 1993). So when we say that internalvalidity concernsan atheoreticallocal molar causalinference, we do not mean that the researchershould conceptualizeexperiments or report a causal claim as "somethingmade a differencer"to useCronbach's (1982, p' 130) exag- geratedcharacterization. Still, it is both sensibleand usefulto differentiateinternal from constructva- lidity. The task of sortingout constructsis demandingenough to warrant separate attention from the task of sorting out causes.After all, operationsare concept laden,and it is very rare for researchersto know fully what thoseconcepts are. In fu.t, th, ,erearchrialmost certainly cannot know them fully becauseparadigmatic .orr..p,, areso implicitly and universallyimbued that thoseconcepts and their as- sumptions "r. ,oi,'.,imes entirely unrecognized_by researchcommunities for y."ri. Indeed,the history of scienceis repletewith examplesof famousseries of ."p.rim.nts in which a causalrelationship was demonstratedearlS but it took y."r, for the cause(or effect)to be consensuallyand stablynamed. For instance, in psychologyand linguisticsmany causalrelationships originally emanatedfrom a behavioriit paradigl but were later relabeledin cognitiveterms; in the early Hawthorne st;dy, illumination effectswere later relabeledas effectsof obtrusive observers;and some cognitive dissonanceeffects have been reinterpretedas 464 I 14.A CRITICALASSESSMENT OF OURASSUMPTIONS

attribution effects.In the history of a discipline,relationships that are correctly identified as causalcan be important evenwhen the causeand effectconstructs are incorrectlylabeled. Such examples exist becausethe reasoningused to draw causalinferences (e.g., requiring evidencethat treatmentpreceded outcome) dif- fersfrom the reasoningused to generalize(e.g., operations to prototyp- ical characteristicsof constructs).\Tithout understandingwhat is meant by de- scriptive causation, we have no means of telling whether a claim to have establishedsuch causation is justified. Cronbach's(1982) prosemakes clear that he understandsthe importanceof causallogic; but in the end, his sporadicallyexpressed craft knowledgedoes not add up to a coherenttheory of judgingthe validity of descriptivecausal inferences. His equation of internal validity as part of (under ) missesthe point that one can replicateincorrect causal conclusions. His solution to suchquestions is simplythat "the forceof eachquestion can bereduced by suit- ablecontrols" (1982,p. 233).This is inadequate,for a completeanalysis of the problem of descriptivecausal inference requires concepts we can useto recognize suitablecontrols. If a suitablecontrol is one that reducesthe plausibilityof, say historyor maturation,as Cronbach (1982,p.233) suggests, this is little morethan internalvalidity aswe haveformulated it. If one needsthe conceptsenough to use them, then they should be part of a validity typology for cause-probingmethods. For completeness,we might add that a similar boundaryquestion arises be- tweenconstruct validity and externalvalidity and betweenconstruct validity and statisticalconclusion validity. In the former case,no scientistever frames an ex- ternal validity questionwithout couchingthe questionin the languageof con- structs.In the latter case,researchers never conceptualize or discusstheir results solelyin terms of statistics.Constructs are ubiquitousin the processof doing re- searchbecause they are essentialfor conceptualizingand reporting operations. But again,the answerto this objectionis the same.The strategiesfor making in- ferencesabout a constructare not the sameas strategiesfor making inferences about whether a causalrelationship holds over variation in persons,settings, treatments,and outcomesin externalvalidity or for drawing valid statisticalcon- clusionsin the caseof statisticalconclusion validity. Construct validity requiresa theoreticalargument and an assessmentof the correspondencebetween samples and constructs. requiresanalyzing whether causal relationships hold over variations in persons,settings, treatments, and outcomes.Statistical conclusionvalidity requiresclose examination of the statisticalprocedures and as- sumptionsused. And again,one can be wrong about constructlabels while being right about externalor statisticalconclusion validity.

Objectionsto Causationin SingleExperiments A second criticism of internal validity deniesthe possibility of inferring causation in a single experiment. Cronbach (1982) says that the important feature of cau- "progressivelocalization sation is the of a cause" (Mackie, 1974, p.73) over mul- J vALrDrry otu | tiple experimentsin a program of researchin which the uncertainties about the es- sential i."t.rr.r of the causeare reduced to the point at which one can character- ize exacflywhat the causeis and is not. Indeed, much philosophy of causation as- serts that we only recognize causesthrough observing multiple instances of a putative causalrelationship, although philosophers differ as to whether the mech- anism for recognition involves logical or empirical regularities (Beauchamp, 1974;P. White, 1990). However, some philosophers do defend the position that causescan be in- ferred in singleinstances (e.g., Davidson, 1,967;Ducasse' 1,95L1' Madden & Hum- ber, L97'1,).A good exampleis causationin the (e.g.,Hart & Honore, 1985)' by which we judge whether or not one person, say, caused the death of another despitethe fact that the defendant may 4ever before have been on trial for a crime. The verdict requires a plausible casethat (among other things) the defendantb ac- tions precededlhe death of the victim, that those actions were related to the death, that other potential causesof the death are implausible, and that the death would not have occurred had the defendant not taken those actions-the very logic of causalrelationships and counterfactualsthat we outlined in Chapter 1. In fact, the defendant'scriminal history will often be specifically excluded from consideration we may learn more in iudging guilt during the trial. The lessonis clear. Although "bo,rt ."nsation from multiple than from single experiments, we can rnf.ercause in single experiments.Indeed, experimenterswill do so whether we tell them to or not. Providing them with conceptual help in doing so is a virtue, not a vice; fail- ing to do so is a major flaw in a theory of cause-probing methods. Of course, individual experiments virtually always use prior concepts from other experiments.However, such prior conceptualizations are entirely consistent with the claim that internal validity is about causal claims in single experiments. If it were not (at least partly) about single experiments, there would be no point to doing the experiment, for the prior conceptualization would successfullypre- dict what will be observed.The possibility that the data will not support the prior conceptualization makes internal validity essential.Further, prior conceptualiza- tions are not logically necessary;we can experiment to discover effects that we "The have no prior conceptual structure to expect: physicist George Darwin used to say tliat once in a while one should do a completely crazy experiment, like blowing the trumper to the tulips every morning for a month. Probably nothing wiil hafpen, but if something did happen, that would be a stupendousdiscovery" (Hacking, L983, p. 15a). But we would still need internal validity to guide us in judging if the trumpets had an effect.

Objections to Descriptive Causation A few authorsobject to the very notion of descriptivecausation. Typicall5 how- ever,such objections are made about a caricatureof descriptivecausation that has not teen usedin philosophyor in sciencefor many years-for example,a billiard ball modelthat requiresa commitmentto deterministiccausation or that excludes 466 ra.n cRrrcALAssEssMENT oFouR AssuMproNs |

reciprocalcausation. In contrast,most who write aboutexperimentation today es- pousetheories of probabilisticcausation in which the many difficultiesassociated with identifyingdependable causal relationships are humbly acknowledged.Even more important,these critics inevitably use causal-sounding language themselves, for example,replacing "cause" with "mutual simultaneousshaping" (Lincoln 6c Guba,1985, p. 151).These replacements seem to us to avoidthe word but keep the concept,and for good reason.As we saidat the end of ChapterL, if we wiped the slateclean and constructedour knowledgeof the world aneq we believewe would end up reinventingthe notion of descriptivecausation all over again,so greatlydoes knowledge of causeshelp us to survivein the world.

ObjectionsConcerning the DiscriminationBetween ConstructValidity and ExternalValidity

Although we traced the history of the present validity systembriefly in Chapter 2, readers may want additional historical perspectiveon why we made the changes we made in the present book regarding construct and external validity. Both Campbell (1957) and Campbell and Stanley(1963) only usedthe phraseexternal validitS which they defined as inferring to what populations, settings,treatment variables, and measurementvariables an effect can be generalized.They did not rcfer at all to construct validity. However, from his subsequentwritings (Camp- bell, 1986), it is clear Campbell thought of construct validity as being part of ex- ternal validity. In Campbell and Stanley therefore, external validity subsumed generalizing from researchoperations about persons, settings,causes, and effects for the purposes of labeling theseparticulars in more abstract terms, and also gen- eralizing by identifying sourcesof variation in causal relationships that are attrib- utable to person, setting, cause,and effect factors. All subsequentconceptualiza- tions also share the samegeneric strategy based on sampling instancesof persons, settings, causes,and effects and then evaluating them for their presumed corre- spondenceto targets of inference. In Campbell and Stanley'sformulation, person, setting, cause,and effect cat- egories share two basic similarities despite their surface differences-to wit, all of them have both ostensivequalities and construct representations.Populations of persons or settings are composed of units that are obviously individually osten- sive. This capacity to point to individual persons and settings, especially when they are known to belong in a referent category permits them to be readily enu- merated and selectedfor study in the formal ways that sampling statisticianspre- fer. By contrast, although individual measures (e.g., the Beck Depression Inven- tory) and treatments (e.g., a syringe full of a vaccine) are also ostensive,efforts to enumerate all existing ways of measuring or manipulating such measuresand treatmentsare much more rare (e.g.,Bloom, L956; Ciarlo et al., 1986; Steiner& Gingrich, 2000). The reason is that researchersprefer to use substantivetheory to determine which attributes a treatment or outcome measureshould contain in any

.J vALrDtrYI oe,

given studS recognizing that scholars often disagreeabout the relevant attributes of th. higher order entity and of the supposed best operations to representthem. None of ihis negatesthe reality that populations of persons or settingsare also de- fined in part by the theoretical constructs usedto refer to them, just like treatments and outiomes; they also have multiple attributes that can be legitimately con- '!(hat, tested. for instance, is the American population? \7hile a legal definition surely exists,it is not inviolate. The German conception of nationality allows that the gieat grandchildren of a German are Germans even if their parents and grand- p"r*t, have not claimed German nationality. This is not possible for Americans. And why privilege alegaldefinition? A cultural conception might admit as Amer- ican all thor. illegal immigrants who have been in the United Statesfor decades and it might e*cl.rde those American adults with passports who have never lived in the United States. Given that person's,settings, treatments, and outcomes all have both construct and ostensivequalities, it is no surprise that Campbell and Stanley did not distinguish between construct and external validity. Cook and Camptell, however, did distinguish between the two. Their un- stated rationale for the distinction was mostly pragmatic-to facilitate memory for the very long list of threats that, with the additions they made' would have had to fit under bampbell and Stanley'sumbrella conception of external validity. In their theoreticaldiicussion, Cook and Campbell associatedconstruct validity with generalizingto causesand effects, and external validity with generalizing to and across persons, settings, and times. Their choice of terms explicitly refer- encedCronbach and Meehl (1955) who used construct and construct validity in "about measurementtheory to justify inferences higher-order constructs from re- search operations'; lcook & Campbel| 1,979, p. 3S). Likewise, Cook and Campbeli associatedthe terms population and external ualidity with sampling theory and the formal and purposive ways in which researchersselect instances of persons and settings. But to complicate matters, Cook and Campbell also "all brlefly acknowledged that aspectsof the researchrequire naming samples in gener-alizabletermi, including samplesof peoples and settings as well as samples of -r"r,rres or manipulations" (p. 59). And in listing their external validity threats as statistical inieractions between a treatment and population, they linked external validity more to generalizing across populations than to generalizing to them. Also, their construct validity threats were listed in ways that emphasized generalizing to cause and effect constructs. Generalizing across different causes ind effect, *", listed as external validity becausethis task does not involve at- tributing meaning to a particular measure or manipulation. To read the threats in Cook and Campbell, external validity is about generalizing acrosspopulations of persons and settings and across different cause and effect constructs, while construct validity is about generalizing to causesand effects.Where, then, is gen- era\zing from samples of persons or settings to their referent populations? The text disiussesthis as a matter of external validitg but this classification is not ap- parent in the list of validity threats. A systemis neededthat can improve on Cook and Campbell's partial between objects of generalization (causes 468 14.A CRITICALASSESSMENT OF OUR ASSUMPTIONS

and effects versus persons and settings) and functions of generalization (general- izing to higher-order constructs from researchoperations versus inferring the de- greeof replicationacross different constructsand populations). This book usessuch a functional approachto differentiateconstruct validity from externalvalidity. It equatesconstruct validity with labelingresearch opera- tions, and externalvalidity with sourcesof variation in causalrelationships. This new formulation subsumesall of the old. Thus, Cook and Campbellt under- ,i-{11 standingof constructvalidity asgeneralizing from manipulationsand measuresto f i.. causeand effectconstructs is retained.So is externalvalidity understoodas gen- eralizingacross samples of persons,settings, and times.And generalizingacross different causeor effectconstructs is now,evenmore clearlyclassified as part of exrernalvalidity. Also highlightedis the needto label samplesof personsand set- tings in abstractterms, iust as measuresand manipulationsneed to be labeled. Suchlabeling would seemto be a matterof constructvalidity giventhat construct validity is functionallydefined in termsof labeling.However, labeling human sam- ples might have been read as being a matter of external validity in Cook and Campbell,given that their referentswere human populationsand their validity typeswere organized more around referentsthan functions.So, although the new formulationin this book is definitelymore systematicthan its predecessors,we are unsurewhether that systematizationwillultimately result in greaterterminologi- cal clarity or confusion.To keepthe latter to a minimum, the following discussion reflectsissues pertinent to the demarcationof constructand externalvalidity that have emergedeither in deliberationsbetween the first two authorsor in classes that we havetaught using pre-publication versions of this book.

Is Construct Vatidity a Prerequisite for External Vatidity? In this book, we equateexternal validity with variation in causalrelationships and constructvalidity with labeling.researchoperations. Some readers might seethis assuggesting that successfulgeneralization of a causalrelationship requires the ac- curate labelingof eachpopulation of personsand eachtype of settingto which generalizationis sought,even though we can neverbe certainthat anythingis la- beledwith perfectaccuracy. The relevanttask is to achievethe most accurateas- sessmentavailable under the circumstances.Technically, we can.test genenliza- tion acrossentities that are akeadyknown to be confoundedand thus not labeled well-e.g., when causaldata arebroken out by genderbut the femalesin the sam- ple are,on average,more intelligentthan the malesand thereforescore higher on everythingelse correlated with intelligence.This exampleillustrates how danger- ous it is to rely on measuredsurface similarity alone (i.e.,gender differences) for determininghow a sampleshould be labeledin populationterms. \7e might more accuratelylabel genderdifferences if we had a random sampleof eachgender takenfrom the samepopulation. But this is not often found in experimentalwork, and eventhis is not perfectbecause gender is known to be confoundedwith other attributes(e.g., income, work status)even in the population,and thoseother at-

t .,.J vALrDrrYI oo,

tributes may be pertinent labels for some of the inferencesbeing made. Hence, we usually have to rely on the assumption that, becausegender samplescome from the same physical setting, they are comparable on all background characteristics that might be correlated with the outcome. Becausethis assumption cannot be fully testedand is ^nyw^y often false-as in the hypothetical example above-this means rhat we could and should measure all the potential confounds within the limits of our theoretical knowledge to suggestthem, and that we should also use thesemeasures in the analysisto reduce confounding. Even with acknowledged confounding, sample-specific differences in effect sizesmay still allow us to conclude that a causal relationship varies by something associatedwith gender.This is a useful conclusion for preventing premature over- generalization.Iilith more breakdownq, confounded or not, one can even get a senseof the percentageof contrastsacross which a causalrelationship does and does not hold. But without further work, the populations across which the rela- tionship varies are incompletely identified. The value of identifying them better is particularly salient when some effect sizescannot be distinguished from zero. Al- though this clearly identifies a nonuniversal causal relationship, it does not ad- vancetheory or practice by specifying the labeled boundary conditions over which a causalrelationship fails to hold. Knowledge gains are also modest from gener- alization strategiesthat do not explicitly contrast effect sizes.Thus, when differ- ent populations are lumped together in a single hypothesis test, researcherscan learn how large a causal relationship is despite the many unexamined sourcesof variation built into the analysis. But they cannot accurately identify which con- structs do and do not co-determine the relationship's size. Construct validity adds useful specificity to external validity concerns, but it is not a necessarycondition for external validity.'We can generalizeacross entities known to be confounded' albeit lessusefully than acrossaccurately labeled entities. This last point is similar to the one raised earlier to counter the assertion of Gadenne(L9761and Kruglanski and Kroy (1976) that internal validity requires the high consrruct validity of both causeand effect. They assertthat all scienceis "something about constructs,and so it has no value to conclude that causedsome- thing sfss"-1hs result that would follow if we did a technically exemplary ran- domized experiment with correspondingly high internal validity but the causeand effect were not labeled. Nonetheless, a causal relationship is demonstrably en- "something tailed, and the finding that reliably causedsomething else" might lead to further researchto refine whatever clues are available about the causeand ef- fect constructs. A similar argument holds for the relationship of construct to ex- ternal validity. Labels with high construct validity are not necessaryfor internal or for external validity, but they are useful for both. Researchersnecessarily use the language of constructs (including human and setting population ones) to frame their researchquestions and selecttheir repre- sentationsof constructsin the samplesand measureschosen. If they have designed their work well and have had some luck, the constructs they begin and end with will be the same,though critics can challengeany claims they make. However, the 470 14.A CRITICALASSESSMENT OFOUR ASSUMPTIONS

samplesand constructs might not match we[], and then the task is to examine the samples and ascertain what they might alternatively stand for. As critics like Gadenne,Kruglanski, and Kroy havepointed out, suchreliance on the operational levelseems to legitimizeoperations as having a life independentof constructs.This is not the case,though, for operationsare intimatelydependent on interpretations at all stagesof research.Still, every operation fits some interpretations, however tentative that referent may be due to poor researchplanning or to nature turning out to be more complex than the researcher'sinitial theory.

How Does Variation AcrossDifferent Operational Representations of the SameIntended Cause or EffectRelate to Constructand ExternalValidity? In Chapter 3 we emphasizedhow the valid labeling of a causeor effect benefits from multiple operational instances,and also that thesevarious instancescan be fruitfully analyzedto examine how a causal relationship varies with the definition used. If each operational instance is indeed of the sameunderlying construct, then the samecausal relationship should result regardless of how the causeor effectis operationally defined. Yet data analysis sometimes revealsthat a causal relation- ship varies by operational instance.This means that the operations are not in fact equivalent,so that theypresumably tap both into differentconstructs and into dif- ferent causalrelationships. Either the samecausal construct is differentlyrelated to what now must be seenas two distinct outcomes,or the sameeffect construct is differently related to two or more unique causal agents.So the intention to pro- mote the construct validity of causesand effects by using multiple operations has now facilitated conclusions about the external validiry of causesor effects;that is, when the external validity of the causeand effect are in play, the data analysishas revealed that more than one causal relationship needsto be invoked. FortunatelS when we find that a causal relationship varies over different causes or different effects, the research and its context often provide clues as to how the causalelements in eachrelationship might be (re)labeled.For example,the researcher will generally examine closely how the operations differ in their particulars, and will also study which unique meanings have been attached to variants like thesein the ex- isting literature.While the meaningsthat are achievedmight be lesssuccessful be- cause they have been devised post hoc to fit novel findings, they may in some cr- cumstances still attain an acceptable level of accuracy and will certainly prompt continued discussion to account for the findings. Thus, we come full circle. I7e be- gan with multiple operational representations of the same causeor effect when test- ing a single causal relationship; then the data forced us to invoke more than one re- lationship; and finally the pattern of the outcomes and their relationship to the existing literature can help improve the labeling of the new relationships achieved.A construct validity exercise begets an externat validity conclusion that prompts the need for relabeling constructs. Demonstrating variation acrossoperations presumed to represent the same cause or effect can enhance external validity by vALlDlrY I ort

are involved than was origi- showingthat more constructsand causalrelationships increaseconstruct validity by pre- nally envisaged;and in that case,it can eventually in the original choiceof meas- ventingany mislabelingof the causeor effectinherent causalrelationships about how the ures and by providffilues from detailsof the seehere analytic tasks that flow elementsin each..f"io"ritp shouldbe labeled.'We concerns'involving each' smoothlybetween.onr,r.r.i and externalvalidity

of Personsor settings should Generalizingfrom a single sample Be Classifiedas External or Construct Validity? this samplemust representa If a studyhas a singlesample of pers.onsor settings, is an issue'Given that construct population.How ,"nlrrr-pre should be labeled an issueof constructvalidity? Af- validity is about rJ.iirrg, i, Itbeling the lample sincewith a singlesample it is not ter all, externalvalidity hardly seemsrelevant in causalrelationships would immediatelyobvious *n", comparisonof variation of personsor settingsis treatedas a be involved.So if g.".t"iit-g fio* a sample from treatmentand out- matter of constructvalidity analogousto generalizing highlightsa potential conflict in comeoperations, i*o probl.-, "r-ir.. Firstl this someparts of which saythat gen- usagein the generalsocial science community' are a matter of externalva- eralizationsfrom;;;i; of peopleto its pofulation peopleis a matter of constructva- lidity, evenwhen ;rh.;;;", ,"y ih", labefing in Cook and Campbellthat lidity. Second,trrir-J".r not fit'with the discussion personsand settingsas an external treatsgeneralizing rr.t" irrdiuidrr"lsamples of threatsdoes not explicitly deal validity matter,though their list of .*t.*"1 validity the treatmentand attributesof with this and only mentionsinteracti,ons between the settingand Person. selectedfrom. the pop- The issueis most acutewhen the samplewas randomly so keento promoterandom sam- ulation. considerwhy samplingstatisticians are Suchsampling ensures that the pling for represe";i"; " *.il-dJrignated universe. on all measuredand unmeasured sampleand populatiJn distributions are identical Notice that this includesthe popula- variableswithin the limits of samplingerror. randomsampling guarantees also tion label(whether more or less"ccorit.;, which random samplingis havinga well appliesto the ,";;[. K.y tg tle or.i rl*r, of in samplingtheory and boundedpop.rl"tiJ., from which to sample,a-requirement many well boundedpopulations somethingoften obviousin practice.Given that guaranteesthat a valid populationla- are alsowell tabeied,r""a.- samplingthen For instance'the population of bel can equallyvalidly be applied,o itt. saripl.. known and is obviouslycorrectly telephoneprefixes or.d i' tlie city of Chicagols digit dialing frol that list of labeled.Hence, i *""fa be difficuli. ,rrJt"ndom sampleas representingtelephone Chicagopr.fi*., "nJ itt." mislabelthe resulting sJction-of Chicago-Given a clearly ownersin Detroii o, orty in the Edgewater the samplelabel is the populationla- boundedpopulation and random saripling, that no methodis superiorto ran- bel, which is why samplingstatisticians believe populationlabel is known' dom selectio'f- iun.ii"g"tumpleswhen the 472 I T+.N CRITICALASSESSMENT OF OUR ASSUMPTIONS

With purposive sample selection,this elegant rationale cannot be used, whetheror not the population label is known. Thus, if respondentswere selected haphazardlyfrom shoppingmalls all over Chicago,many of the peoplestudied would belongin the likely populationof interest-residentsof Chicago.But many would not becausesome Chicago residents do not go to malls at the hours inter- viewing takes place, and becausemany personsin these malls are not from Chicago.Lacking random sampling,we could not evenconfidently call this sam- ple "peoplewalking in Chicagomalls," for other constructssuch as volunteering to be interviewedmay be systematicallyconfoundedwith samplemembership. So, meremembership in the sampleis not sufficientfor accuratelyrepresenting a pop- ulation, and by the rationalein the previousparagraph, it is alsonot sufficientfor accuratelylabeling the sample.All this leadsto two conclusionsworth elaborat- ing: (1) that random sampling can sometimespromote constructvalidity, and (2) thatexternalvalidity is in play when inferring that a singlecausal relationship from a samplewould hold in a population,whether from a randomsample or not. On the first point, the conditionsunder which random samplingcan some- timespromote the constructvalidity of singlesamples are straightforward.Given a well boundeduniverse, sampling statisticians have justified random sampling as away of clearlyrepresenting in the sampleall populationattributes. This must in- cludethe populationlabel, and so random samplingresults in labelingthe sample in the sameterms that apply to the population. Random samplingdoes not, of course,tell us whetherthe population label is itself reasonablyaccurate; random samplingwill also replicatein the sampleany mistakesthat are madein labeling the population. However,given that many populationsare alreadyreasonably well-labeledbased on past researchand theory and that suchsituations are often intuitively obviousfor researchersexperienced in an area,random samplingcan, underthese circumstances, be countedon to promoteconstruct validity. However, when random selectionhas not occurredor when the populationlabel is itself in doubt, this book hasexplicated other principlesand methodsthat can be usedfor labelingstudy operations,including labelingthe samplesof personsand settings in a study. On the secondpoint, when the questionconcerns the validity of generalizing from a causalrelationship in a singlesample to its population,the readermay also wonder how externalvalidity can be in play at all. After all, we haveframed ex- ternal validity as beingabout whetherthe causalrelationship holds overuariation in persons,settings, treatment variables, and measurementvariables. If thereis only one random samplefrom a population,where is the variation over which to ex- aminethat causalrelationship? The answeris simple:the variationis betweensam- pled and unsampledpersons in that population.As we saidin Chapter2 (and as was true in our predecessorbooks), external validity questionscan be about whether a causalrelationship holds (a) over variationsin persons,settings, treat- ments,and outcomesthat were in the experiment,and (b) for persons,settings, treatments,and outcomesthat werenot in the experiment.Those persons in a pop- vALlDlw | 473

ulation who were not randomly sampledfall into the latter category.Nothing about externalvalidity, either in the presentbook or in its predecessors,requires that all possibleuariuiion, of externalvalidity interestactually be observedin the study-indeed,it would beimpossible to do so,and we providedseveral arguments in Cirapter 2 aboutwhy it would not be wise to limit external validity questions only to variationsactually observed in a study.Of course,in most casesexternal ualidiry generalizationsto things that were not studied are difficult, having to rely on the .L.r..pt, and methodswe outlined in our grounded theory of generalized causalinference in Chapters11 through 13. But it is the great beautyof random samplingthat it guaran;es that this generalizationwill hold over both sampledand ,rnr"-pl".d p.rr6nr. So it is indeedan externalvalidity questionwhah-e1a causal relationshipthat hasbeen observed in a singlerandom sample would hold for those units that werein the populationbut not'in the random sample. Inthe end,this book treatsthe labelingof a singlesample of personsor set- tings asa matterof constructvalidiry whetheror not random samplingis used.It alsi treatsthe generalizationof causalrelationships from a singlesample to un- observedinstances as a matterof externalvalidity-againrwhether or not random samplingwas used.The fact that random sampling(which is associatedwith ex- ,.rrr"l uiiairy in this book) sometimeshappens to facilitatethe constructlabeling of a sampleis incidentalto the fact that the population label is alreadyknown. Though manypopulation labels are indeed well-known, many more are still mat- ,.r, of debate,as reflectedin the exampleswe gavein Chapter3 of whetherper- sonsshould be labeledschizophrenic or settingslabeled as hostilework environ- ments.In theselatter cases,random samplingmakes no contribution to resolving debatesabout the applicabilityof thoselabels. Instead, the principlesand meth- ods we outlinedin Ci"pt.rs 11 through 13 will haveto be brought to bear.And when randomsampling has not beenused, those principles and methodswill also haveto be broughito b.". on the externalvalidity problemof generalizingcausal relationshipsfrom singlesamples to unobservedinstances.

ObjectionsAbout the Completenessof the Typology

The first objectionof this kind is that our lists of particularthreats to validity are incomplete.Bracht and Glass(1,968), for example,ad-ded new externalvalidity threatsthat they thought were overlookedby Campbelland Stanley(1,96311' and more recentlyAiken ind West (1991) pointed to new reactivity threats._These challenges"r. i*portant becausethe key to the most confidentcausal conclusions in our ,f,.ory of validity is the ability to construct a persuasiveargument that every plausibleand identifiedthreat to validity has beenidentified and ruled out. How- iver, thereis no guaranteethat all relevantthreats to validity havebeen identified. Our lists are not divinely ordained,as can be observedfrom the changesin the threats from Campbel IUST) to Campbell and Stanley (1'963)to Cook and 14.A CRITICALASSESSMENT OF OURASSUMPTIONS

Campbell(1979) to this book. Threatsare better identifiedfrom insiderknowl- edgethan from abstractand nonlocallists of threats. A secondobjection is that we may haveleft out particularvalidity fypesor or- ganizedthem suboptimally.Perhaps the bestillustration that this is true is Sack- ett's(1979) treatment of bias in case-controlstudies. Case-control studies do not commonly fall under the rubric of experimentalor quasi-experimentaldesigns; but they are cause-probingdesigns, and in that sensea generalinterest in general- ized causalinference is at leastpartly shared.Yet Sackettcreated a different ty- pology.He organizedhis list around sevenstages of researchat which biascan oc- cur: (1) in readingabout the field, (2) in samplespecification and selection,(3) in defining the experimentalexposure, (4) 'in measuringexposure and outcome, (5) in dataanalysis, (5) in interpretationof analyses,and (71inpublishing results. Each of thesecould generatea validiry type, someof which would overlapcon- siderablywith our validity types.For example,his conceptof biases"in executing the experimentalmanoeuvre" (p. 62) is quite similar to our internal validiry whereashis withdrawal biasmirrors our attrition. However,his list alsosuggests new validity types,such as biasesin readingthe literature,and biaseshe lists at eachstage are partly orthogonal to our lists. For example,biases in readingin- clude biasesof rhetoric in which "any of severaltechniques are usedto convince the readerwithout appealingto reason"(p. 60). In the end,then, our claim is only that the presenttypology is reasonablywell informed by knowledgeof the nature of generalizedcausal inference and of some of the problemsthat are frequentlysalient about thoseinferences in field experi- mentation.It can and hopefullywill continueto be improvedboth by addition of threatsto existing validity types and by thoughtful exploration of new validity typesthat might pertainto the problem of generalizedcausal inference that is our main concern.t

1. We are acutelyaware of, and modestlydismayed at, the many differentusages of thesevalidity labelsthat have developedover the years and of the risk that posesfor terminological confusion---eventhough we are responsible for rnany of thesevariations ourselves.After all, the understandingsof validiry in this book differ from those in Campbelland Stanley(1963), whose only distinctionwas betweeninternal and externalvalidity. They alsodiffer from Cook and Campbell(7979), in which externalvalidity was concernedwith generalizingto and across populations of personsand settings,whereas all issuesof generalizingfrom the causeand effect operations constitutedthe domain of constructvalidity. Further,Campbell (1985) himselfrelabeled internal validiry and external validiry as local molar causalvalidity and the principle of proximal similarity, respectively.Stepping outside Campbell'stradition, Cronbach(1982) used these labels with yet other meanings.He saidinternal validity is the problem of generalizingfrom samplesto the domain about which the questionis asked,which soundsmuch like our construct validity except that he specifically denied any distinction betweenconstruct validiry and external validiry, using the latter term to refer to generalizingresults to unstudied populations, an issueof extrapolation beyond the data at hand. Our understandingof external validity includessuch extrapolations as one case,but it is not limited to that becauseit also has to do with empirically identifying sourcesof variation in an effect sizewhen existing data allow doing so. Finally, many other authors have casually used all theselabels in completelydifferent ways (Goetz & LeCompte,1984; Kleinbaum,Kupper, & Morgenstern,1982; Menard, 1991).So in view of all thesevariations, we urge that theselabels be used only with descriptionsthat make their intended understandingsclear.

::j t!t VALIDTTY 47s |

ObjectionsConcerning the Natureof Validity

'We it dif- defined validity as the approximate truth of an inference. Others define ferently. Here are some alternatives and our reasonsfor not using them'

Validity in the New TestTheory Tradition Testtheorists discussed validity (e.g., cronbach, 1946; Guilford, 1,946) well be- fore Campbell(L957) inventedhis typology.Sfe can only beginto touch on the many iss.re,pertinent to validity that aboundin that tradition. Here we outline a f.* i.y poinis that help differentiateour approachfrom that of test theory.The early emphasisin test theory was mostly on inferencesabout what a test meas- or.j, with a pinnaclebeing ieached in the notion of constructvalidity. Cronbach "proper breadth to the notion of ltltll creditsCook and -a-pbell for giving consffucts',(p. 152) in constructvalidity through their claim that constructva- lidity is not j"tt li-it.d to inferencesabout outcomesbut also about causesand about orherfeatures of experiments.In addition, early test theory tied validity to "The the truth of suchinferences: literatureon validationhas concentrated on the truthfulnessof testinterpretation" (Cronbach, 1988, p' 5)' However,the yearshave bro.tght changeto this early understanding'In one particularlyinfluential definition of validity in test theory Messick(1989) said' ;V"lidiry ii an integratedevaluative judgment of the degreeto which empiricalev- idenceand theoreti"calrationales support the adequacyand appropriatenessof in- (p. ferencesand actionsbased on testscores or other modesof assessment" L3); "Validiry and later he saysthat is broadly definedas nothing lessthan an evalua- tive summary'of both the ruid.tr.. for and the actual-as well as potential- consequen..,of scoreinterpretation and use" (1995, p.74L)._Whereas.our un- d.rrtu.rdirrgof validity is that inferencesare the subjectof validation,this defini- tion suggeJt,th"t actionsare also subjectto validation and that validationis ac- tually evaluation.These extentions are far from our view. A little historywill help here.Tests are designed for practicaluse. Commer- cial test developershope to profit from salesto thosewho usetests; employers hope to ,rr. t.rt, to seiectbetter personnel; and test takershope that testswill tell them somethinguseful about themsqlves.These practical applications gen- eratedconcern i., tf,e AmericanPsychological Association (APA) to identify the characteristicsof better and worsetests. APA appointeda committeechaired by Cronbachto addressthe problem.The committeeproduced the first in a contin- uing seriesof teststandaris (APA, 1,954);and this wolk alsoled to Cronbachand Melhl', (1955)classic article on constructvalidity. The test standardshave been freq.rerrtiyrevised, most recentlycosponsored by other professionalassociations (AmericanEducaiional Research Association, American PsychologicalAssocia- Re- tion, and National Council on Measurementin Education,1985, 1999)' qoirl-.nts to adhereto rhe standardsbecame part of professionalethical codes. Th" ,tandardswere also influential in legaland regulatoryproceedings and have 14.A CRITICALASSESSMENT OF OURASSUMPTIONS beencited, for example,in U.S.Supreme Court casesabout alleged misuses of test- ing practices (e.g., Albermarle Paper Co. v. MoodS 1975; Washington v. Davis, L976) and have influencedthe "Uniform Guidelines"for personnelselection by the Equal EmploymentOpportunity Commission(EEOC) et al. (1978).Various validity standardswere particularly salientin theseuses. Becauseof this legal,professional, and regulatoryconcern with the useof test- ing, the researchcommunity concerned with measurementvalidity began to use the ,;;::i' word ualiditymoreexpansivelyforexample, "asonewayto justifytheuseof atest" (Cronbach,1989, p. M9).It is only a shortdistance from validatinguse to validat- ing action, becausemost of the relevantuses were actionssuch as hiring or firing someoneor labelingsomeone retarded. Actions, in turn, haveconsequences-some positive,such as efficiencyin hiring and accuratediagnosis that allows bettertailor- ing of treatment,and somenegative, such as lossof incomeand stigmatization.So Messick(1989 , 1995l proposedthat validationalso evaluate those consequences, es- peciallythe socialjustice of consequences.Thus evaluatingthe consequencesof test use becamea key featureof validity in test theory.The net resultwas a blurring of the line betweenvalidity-as-truth and validity-as-evaluation,to the point where Cronbach(1988) said "Validation of a testor testuse is evaluation"(p.4). 'We strongly endorse the legitimacy of questions about the use of both tests and experiments. Although scientistshave frequently avoided value questions in the mis- taken belief that they cannot be studied scientifically or that scienceis value free, we cannot avoid values even if we try. The conduct of experiments involves values at every step, from question selection through the interpretation and reporting of re- sults. Concerns about the usesto which experiments and their results are put and the value of the consequencesof those usesare all important (e.g.,Shadish et al., 1991), as we illustrated in Chapter 9 in discussingethical concerns with experiments. However, if validity is to retain its primary association with the truth of knowledge claims, then it is fundamentally impossible to validate an action be- causeactions are not knowledge claims. Actions are more properly evaluated,not validated. Supposean employer administers a test, intending to use it in hiring de- cisions. Suppose the action is that a person is hired. The action is not itself a knowledge claim and therefore cannot be either true or false. Supposethat person then physically assaultsa subordinate. That consequenceis also not a knowledge claim and so also cannot be true or false. The action and the consequencesmerely exist; they are ontological entities, not epistemological ones. Perhaps Messick (1989) really meant to ask whether inferencesabout actions and consequencesare true or false. If so, the inclusion of action in his (1,989)definition of validity is en- tirely superfluous, for validity-as-truth is already about evidencein support of in- ferences,including those about action or consequ.rr..s.'

2. Perhapspartly in recognitionof this, the most recentversion of the test standards(American Educational ResearchAssociation, American PsychologicalAssociation, and National Council on Measurementin Education, 1999) helpsresolve some of the problemsoudined hereinby removingreference to validatingaction from the definition of validity: "Validity refersto the degreeto which evidenceand theory support the interpretationsof test scoresentailed by proposeduses of tests" (p. 9). i ,l ,I t VALIDITY I 477

Alternatively perhaps Messick ('1.989,L995) meant his definition to instruct "Valid- test validators to eualuatethe action or its consequences,as intimated in: ity is broadly defined as nothing lessthan an evaluative summary of both the ev- idence for and the actual-as well as potential--consequences of score interpre- tation and use" (1,995, p. 742). Validity-as-truth certainly plays a role in evaluating testsand experiments.But we must be clear about what that role is and is not. Philosophers(e.g., Scriven, 1980; Rescher,1969) tell us that a judgment about the value of something requires that we (1) selectcriteria of merit on which the thing being evaluated would have to perform well, (2) set standards of per- formanci for how well the thing must do on each criterion to be judged positivel5 (3) gather pertinent data about the thing's performance on the criteria, and then Validity-as-truth i+j i"6gr4te the results into one or more evaluative conclusions. is one (but only one) criterion of merit in dvaluation; that is, it is good if inferences about a test are true, just as it is good for the causal inference made from an ex- periment to be true. However, validation is not isomorphic with evaluation. First, criteria of merit for tests (or experiments) are not limited to validity-as-truth- For example, a good test meetsother criteria, such as having a test manual that reports ,ror*^r, being affordable for the contexts of application, and protecting confiden- tialiry ", "ppropriate. Second,the theory of validity Jvlessickproposed gives no help in accomplishing some of the other steps in the four-step evaluation process outlined previously. To evaluate a test, we need to know something about how much ualidity the inference should have to be judged good; and we need to know how to integrate results from all the other criteria of merit along with validity into an overall waluation. It is not a flaw in validity theory that these other steps are not addressed,for they are the domain of evaluation theory. The latter tells us somethingabout how to executethese steps (e.g., Scriven, 1980, 1'991)and also about other matters to be taken into account in the evaluation. Validation is not evaluation; truth is not value. Of course, the definition of terms is partly arbitrary. So one might respond that one should be able to conflate validity-as-truth and validity-as-evaluation if one so chooses.However:

The veryfact that termsmusr be supplied with arbitrarymeanings requires that words be usedwith a greatsense of responsibility.This responsibilityis twofold: first, to es- tablished,6"9"; second,to the limitationsthat the definitionsselected impose on the "l'982, user.(Goldschmidt, P. 642) 'We need the distinction between truth and value becausetrue inferencescan be about bad things (the fact that smoking causescancer does not make smoking or cancer good); "nd f"lr. inferencescan lead to good things (the astrologer'sadvice 'lavoid to Piscei to alienating your coworkers today" may have nothing to do with heavenly bodies, but may still be good advice). Conflating truth and value can be actively harmful. Messick (1995) makes clear that the social consequencesof test- "bias, ing are to be judged in terms of fairness, and distributive justice" (P. 745). 'Wi agreewith this statement,but this is test evaluation, not . Messick I 478 ra.n cRrTrcALASSESSMENT OFOUR ASSUMPTTONS |

notes that his intention is not to open the door to the social policing of truth (i.e., a test is valid if its social consequencesare good), but ambiguity on this issuehas nonethelessopened this very door. For example, Kirkhart (1,995)cites Messick as justification for judging the validity of evaluations by their social consequences: "Consequential validity refers here to the soundnessof changeexerted on systems by evaluationand the extent to which thosechanges are just" (p.a).This notion is risky becausethe most powerful arbiter of the soundnessand iustice of social con- sequencesis the sociopolitical systemin which we live. Depending on the forces in power in that system at any given time, we may find that what counts as valid is effectively determined by the political preferencesof those with power.

Validity in the Qualitative Traditions One of the most important developmentsin recent social researchis the expanded use of qualitative methods such as ethnography ethnology, participant observa- tion, unstructured interviewing, and methodology (e.g., Denzin 6c Lincoln, 2000). These methods have unrivaled strengths for the elucidation of meanings,the in-depth description of cases,the discovery of new hypotheses,and the description of how treatment interventions are implemented or of possible causal explanations. Even for those purposesfor which other methods are usually preferable,such as for making the kinds of descriptivecausal inferences that are the topic of this book, qualitative methods can often contribute helpful knowledge and 'S7henever on rare occasionscan be sufficient (Campbell, 1975; Scriven, 1976ll. re- sources allow, field experiments will benefit from including qualitative methods both for the primary benefits they are capable of generatingand also for the assis- tance they provide to the descriptive causal task itself. For example, they can un- cover important site-specificthreats to validiry and also contribute to explaining experimental results in general and perplexing outcome patterns in particular. However, the flowering of qualitative methods has often been accompanied by theoretical and philosophical controversy, often referred to as the qualitative- quantitative debates. These debates concern not just methods but roles and re- wards within science,ethics and morality and epistemologiesand ontologies. As part of the latter, the concept of validity has receivedconsiderable attention (e.g., Eisenhart & Howe, 1992; Goetz & LeCompte,1984; Kirk & Miller, 1'986;Kvale, 1.989; Maxwell, 1.992; Maxwell 6c Lincoln, 1.990;Mishler, 1,990;Phillips, 'Wolcott,J. J. 1,987; 1990). Notions of validity that are different from ours have occa- sionally resulted from qualitative work, and sometimesvalidity is rejectedentirely. However, before we review those differences we prefer to emphasize the com- monalities that we think dominate on all sides of the debates.

Comtnonalities. As we read it, the predominant view among qualitative theorists is that validity is a concept that is and should be applicable to their work..We start with examplesof discussionsof validity by qualitative theoriststhat illustrate these similarities becausethey are surprisingly more common than someportrayals in the

:

I

:l{ VALIDITYI O''

qualitative-quantitativedebates suggest and becausethey demonstratean underly- in producingvalid knowledgethat we believeis widely shared ing unity of interest "qualitative by"*ori social scientiits.For example,Maxwell (1990) says, re- searchersare just as concernedas quantitativeones about'getting it wrong,' and validity broadlydefined simply refers to the possibleways one'saccount might be 'validity threats'can be addressed"(p. 505). Even those *rorrg, and how these "go quafi[tive theoristswho say they rejectthe word ualidity will admit that they to considerablepains not to getit all wrong" (Wolcott,1990, p. L27).Kvale (1989) "concepts tiesvalidity directlyto truth, saying of validity are rootedin more com- prehensiveepistemological assumptions of the nature of true knowledge"(p. 1-1); l'refers and later that validity to the truth and correctnessof a statement"(p.731. "the 'valid' Kirk and Miller (1986) say technicaluse of the term is as a properly 'true' " "Validiry hedgedweak synonymfor (p. L9). Maxwell (L9921says in a broad sense,pertains to this relationshipbetween an accountand somethingout- sidethat account" (p. 283). All theseseem quite compatiblewith our understand- ing of validity. Maxvreli's(7992\ accountpoints to other similarities.He claimsthat validity "the is always relativeto kinds of understandingsthat accountscan embody" (p. 28il and that different communitiesof inquirers are interestedin different kindsof understandings.He notesthat qualitativeresearchers are interested in five kinds of understandingsabout: (1) the descriptionsof what was seenand heard, (2) the meaningof what was seenand heard, (3) theoreticalconstructions that characteriz.*h"t was seenand heardat higher levelsof abstraction,(4) general- izationof accountsto other persons,times, or settingsthan originallystudied, and (5) evaluationsof the objectsof study (Maxwell, 1'992;he saysthat the last two understandingsare of interestrelatively rarely in qualitativework). He then pro- validity typology for qualitativeresearchers, one for eachof the posesa five-p-art 'We ?ineorrd..standings. agreethat validity is relativeto understanding,though we usuallyrefer to in-ferenceiather than understanding.And we agreethat different communitiesof inquirerstend to be interestedin different kinds of understand- ings,though common interestsare illustratedby the apparentlyshared concerns thlt both ixperimentersand qualitativeresearchers have in how bestto charac- terizewhatwas seenand heardin a study (Maxwell'stheoretical validity and our constructvalidity). Our extendeddiscussion of internal validity reflectsthe inter- est of the community of experimentersin understandingdescriptive causes' pro- portionatelymore so than is relevantto qualitativeresearchers, even when their reportsare necessarilyreplete with the languageof causation.This observationis ,rot " criticismof qualitativeresearchers, nor is it a criticism of experimentersas beingless interested than qualitativeresearchers in thick descriptionof an indi- vidualcase. On the other hand, we should not let differencesin prototypical tendencies acrossresearch communities blind us to the fact that when a particular under- standingls of interest,the pertinentvalidity concernsare the sameno matterwhat the metlodology usedto developthe knowledgeclaim. It would be wrong for a 14.A CRITICALASSESSMENT OF OURASSUMPTIONS qualitative researcherto claim that internal validity is irrelevantto qualitative methods.Validity is not a properryof methodsbut of inferencesand knowledge claims. On those infrequent occasionsin which a qualitative researcherhas a stronginterest in a local molar causalinference, the concernswe haveoutlined un- der internal validity pertain.This argumentcuts both ways,of course.An exper- imenterwho wonderswhat the experimentmeans to participantscould learna lot from the concernsthat Maxwell outlinesunder interpretivevalidity. Maxwell (1992) also points out that his validity typology suggeststhreats to validity about which qualitativeresearchers seek "evidence that would allow them to be ruled-out. . . usinga logic similarto that of quasi-experimentalre- searcherssuch as Cook and Campbell" (p. 296). He does not outline such threatshimself, but his descriptionallows one to guesswhat somemight look like. To judge from Maxwell's prose,threats to descriptivevalidity include er- rors of commission(describing something that did not occur),errors of omis- sion (failingto describesomething that did occur),errors of (misstat- ing how often something occurred), and interrater disagreementabout description.Threats to the validity of knowledgeclaims have also been invoked by qualitative theorists other than Maxwell-for example,by Becker(1979), Denzin(1989'), and Goetzand LeCompte(1984). Our only significantdisagree- ment with Maxwell's discussionof threats is his claim that qualitativere- searchersare lessable to use "designfeatures" (p. 296) to dealwith threatsto validity. For instance,his preferreduse of multiple observersis a qualitativede- signfeature that helpsto reduceerrors of omission,commission, and frequency. The repertoireof designfeatures that qualitativeresearchers use will usuallybe quite different from those usedby researchersin other traditions, but they are designfeatures (methods) all the same.

Dffirences. Theseagreements notwithstanding, many qualitativetheorists ap- proach validity in ways that differ from our treatment.A few of thesedifferences arebased on argumentsthat aresimply erroneous (Heap, 7995; Shadish, 1995a). But many are thoughtful and deservemore attention than our spaceconstraints allow. Following is a sample. Somequalitative theorists either mix togetherevaluative and socialtheories of truth (Eisner,\979,1983) or proposeto substitutethe socialfor theevaluative. "mean- SoJensen (1989) says that validiry refersto whethera knowledgeclaim is ingful and relevant" (p. 107) to a particular languagecommunity; andGuba and Lincoln (1,982)say that truth can be reducedto whetheran accountis credibleto thosewho readit. Although we agreethat socialand evaluativetheories comple- ment eachother and are both helpful, replacingthe evaluativewith the socialis misguided. These social alternatives allow for devastatingcounterexamples (Phillips, 1987): the swindler'sstory is coherentbut fraudulent;cults convince membersof beliefsthat havelittle or no apparentbasis otherwise; and an account of an interactionbetween teacher and studentmight be true evenif neitherfound it to be credible.Bunge (1992) showshow one cannotdefine the basicidea of er- I :J I :iil 14.A CRITICALASSESSMENT OF OURASSUMPTIONS

qualitative researcher to claim that internal validity is irrelevant to qualitative methods. Validity is not a properfy of methods but of inferencesand knowledge claims. On those infrequent occasions in which a qualitative researcherhas a strong interest in a local molar causalinference, the concernswe have outlined un- der internal validity pertain. This argument cuts both ways, of course. An exper- imenter who wonders what the experiment meansto participants could learn a lot from the concerns that Maxwell outlines under interpretive validity. Maxwell (1,992) also points out that his validity typology suggeststhreats "evidence to validity about which qualitative researchersseek that would allow them to be ruled-out . . . using a logic similar to that of quasi-experimentalre- searcherssuch as Cook and Campbell" (p. 296). He does not outline such threats himself, but his description allows one to guess what some might look like. To judge from Maxwell's prose, threats to descriptive validity include er- rors of commission (describing something that did not occur), errors of omis- sion (failing to describesomething that did occur), errors of frequency (misstat- itg how often something occurred), and interrater disagreement about description. Threats to the validity of knowledge claims have also beeninvoked by qualitative theorists other than Maxwell-for example, by Becker (1,979), Denzin (1989), and Goetz and LeCompte (1984). Our only significant disagree- ment with Maxwell's discussion of threats is his claim that qualitative re- searchersare lessable to use "design features" (p. 2961to deal with threats to validity. For instance, his preferred use of multiple observers ls a qualitative de- sign feature that helps to reduceerrors of omission, commission,and frequency. The repertoire of design featuresthat qualitative researchersuse will usually be quite different from those used by researchersin other traditions, but they are design features(methods) all the same.

Differences. These agreementsnotwithstanding, many qualitative theorists ap- proach validity in ways that differ from our treatment. A few of thesedifferences are basedon argumentsthat are simply erroneous(Heap, 1.995;Shadish, 1995a). But many are thoughtful and deservemore attention than our spaceconstraints allow. Following is a sample. Some qualitative theorists either mix together evaluative and social theories "1.979,1983) of truth (Eisner, or proposeto substitutethe socialfor the evaluative. "mean- So Jensen(1989) saysthat validiry refers to whether a knowledge claim is ingful and relevant" (p. L07l to a particular language community; and Guba and Lincoln (t9821say that truth can be reduced to whether an account is credible to those who read it. Although we agreethat social and evaluative theories comple- ment each other and are both helpful, replacing the evaluative with the social is misguided. These social alternatives allow for devastating counterexamples (Phillips, L987): the swindler's story is coherent but fraudulent; cults convince members of beliefsthat have little or no apparent basis otherwise; and an account of an between teacher and student might be true even if neither found it to be credible. (1992) Bunge shows how one cannot define the basic idea of er- .il j t I I 'iil VALIDITYI +ET

ror usingsocial theories of truth. Kirk and Miller (1986) capturethe needfor an evaluativetheory of truth in qualitativemethods:

In responseto the propensity of so many nonqualitative researchtraditions to usesuch hidden positivist assumptions, some social scientists have tended to overreact by stressinj the possibility ;f alternative interpretations of everything to th€ exclusion of of ob- urry.ffor, to chooseamong them. This extreme relativism ignoresthe other side at all. It ignores the distinction between leciivity-that there is an external world lrro*l"dg. and opinion, and results in everyonehaving a separateinsight that cannot be reconciledwith anyone else's.(p. 15)

A seconddifference refers to equatingthe validity of knowledgeclaims with their evaluation,as we discussedearlier with tqsttheory (e.g.,Eisenhart 6C Howe, L992)' This is mostexplicit in Salner(L989),who suggested that much of validityin quali- "that tative methodoiogyconcerns the criteria are useful for evaluatingcompeting claims',(p. 51); "id rh. urgesresearchers to exposethe moral andvalue implications Messick(1.989) said in referenceto testtheory. Our responseis of ,.r."rch, *.rch as 'We the sameas for test theory. endorsethe need to evaluateknowledge claims broadly includingtheir moial implications;but this is not the sameas saying that the claim is-t.ue.Truih is just onecriterion of merit for a good knowledgeclaim. A third differencemakes validity a result of the processby which truth emerges.For instance,emphasizing the dialecticprocess that givesrise to truth' ,,ValidLnowledge Salnei(l9g9l says: claimsemerge . . . from the conflict and dif- ferencesbetween the contextsthemselves as thesedifferences are communicated andnegotiated among people who sharedecisions and actions"(p. 61).Miles and Huberman(1984) rpr"t of th. problemof validity in qualitativemethods being 'Lnalysis an insufficiencyof proceduresfor qualitative data" (p. 230). Guba and Lincoln (1989) argue that tiustworthinessemerges from communicationwith other colleaguesarid stakeholders.The problemwith all thesepositions is the er- ror of thinklng that validity is a propertyof methods.Any procedurefor generat- g.n.r"i. invalid-knowledge,so in the end it is the knowledge ing knowledg! can "The claim itself that muJt be judged.As Maxwell (1992) says, validity of an ac- count is inherent,not in the proceduresused to produceand validateit, but in its relationshipto thosethings it is intendedto be an accountof" (p' 281)' suggeststhat traditional approachesto validity must be A fourth difference "historically reformulatedfor qualitativemethods because validiry arosein the contextof experimentalresearch" (Eisenhart 6C Howe, 1992,p' 64\' Othersre- ject validity for similar reasonsexcept that they saythat validity arosein test the- probably first o.y 1..g.,*lol.orr, 19gO).Both are incorrect,for validiry concerns "ror. Jrt.*"ti.ally in philosophypreceding test theory and experimentalscience by hundredsor thour"ndr of years.Validity is pertinentto any discussionof the warrant for believingknowledge and is not specificto particular.methods. A fifth differenie .on..rrri the claim that there is no ontological reality at all, so thereis no truth to correspondto it. The problemswith this perspective "r. .rror1nous(Schmitt,1,995). First, even if it weretrue' it would apply only to -T

8z I r+.n cRtlcALAssEssMENT oF ouR AssuMploNs

correspondence theories of truth; coherence and pragmatist theories would be unaffected. Second, the claim contradicts our experience. As Kirk and Miller (19861putit:

Thereis a world of empiricalreality out there.The way we perceiveand understand that world is largelyup to us, but the world doesnot tolerateall understandingsof it equally(so that the individualwho believeshe or shecan halt a speedingtrain with his or her bare handsmay be punishedby the world for actingon that understanding). (p.11)

Third, the claim ignores evidenceabout the problems with people'sconstructions. "one Maxwell notes that of the fundamental insights of the social sciencesis that people's constructions are often systematic distortions of their actual situation" (p. 506). FinallS the claim is self-contradictory becauseit implies that the claim itself cannot be rrue. A sixth difference is the claim that it makes no senseto speak of truth because there are many different realities, with multiple truths to match each (Filstead, 1.979;Guba 6c Lincoln, L982; Lincoln 6c Guba, 1985). Lincoln (L990), for ex- "a ample, says that realist philosophical stance requires, indeed demands, a sin- gular reality and thereforea singulartruth" (p. 502), which shejuxtaposes against her own assumption of multiple realities with multiple truths. Whatever the mer- its of the underlying ontological arguments, this is not an argument against valid- ity. Ontological realism (a commitment that "something" does exist) does not re- quire a singular reality but merely a commitment that there be at least one reality. To take just one example, physicists have speculated that there may be circum- stancesunder which multiple physical realities could exist in parallel, as in the case of Schrodinger'scat (Davies,1984; Davies & Brown, 1986). Suchcircumstances would in no way constitute an objection to pursuing valid characterizationsof those multiple realities. Nor for that matter would the existenceof multiple real- ities require multiple truths; physicists use the same principles to account for the multiple realities that might be experiencedby Schrodinger'scat. Epistemological realism (a commitment that our knowledge reflects ontological reality) does not require only one true account of that world(s), but only that there not be two con- tradictory accounts that are both true of the same ontological referent.3 How many realities there might be, and how many truths it takes to account for them, should not be decided by fiat. A seventh difference objects to the belief in a monolithic or absolute Truth "'What (with capital T). rUfolcott (1990) says, I seek is something else, a quality that points more to identifying critical elements and wringing plausible interpre- tations from them, something one can pursue without becoming obsessedwith

3. The fact that different people might have different beliefs about the same referent is sometimes cited as violating "John this maxim, but it need not do so. For example, if the knowledge claim being validated is views the program as effective but Mary views it as ineffective," the claim can be true even though the views of John and Mary are contradictory.

j ii j VALIDITY I 483

finding the right or ultimate answer'the correctversion, the Truth" (p' 146)' He "the critical point of departurebetween quantities-oriented and quali- describes 'know'with ties-orientedresearch [as beingthat] we cannot the former'ssatisfy- ing levelsof certainty" (p. 1,47).Mishler (t990) objectsthat traditional ap- "as prl".h., to validationare portrayed universal,abstract guarantors of truth" "the absolutetruth" ip. +ZOl.Lincoln(1990) thinks that realistposition demands or absolutetruth tp. SOZI.However, it is misguidedto attributebeliefs in certainty tf appioachesto validity srrchas that in this book.'Wehope we havemade clear by now that thereare no guarantorsof valid inferences.Indeed, the more experi- encethat mostexperimenters gain, the morethey appreciatethe ambiguityof their "An results.Albert Einsteinonce said, experimentis somethingeverybody believes exceptthe personwho madeit" (Holton, 1986,p. 13).Like \(olcott, most ex- periri-renter,,..k only to wring plausibleinterpretations from their work, believ- irrgthut "prudencesat poised between skepticism and credulity" (Shapin,1994, p."xxix).rilfle tteed nor, shouldnot, and frequentlycannot decide that one account i, ,broirrt.ly true ani the other completelyfalse. To the contrary' tolerancefor multiple knowledgeconstructions is a virtual necessity(Lakatos, 1'978)because evidenceis frequeirtlyinadequate to distinguishbetween two well-supportedac- counrs(is light " p"tti.l. or wave?),and sometimesaccounts that appearto beun- (dogerms cause ulcers?)- supported"An by euiJencefor manyyears turn out to betrue that traditional understandingsof validity have eighih differenceclaims "forces moral shoitcomings.The argumentshere are many, for example,that it is- politics, ial,res (social and scientific), and ethics to be submerged" sues of "social 'experts' (Lincoln, 1,990,p. 503) and implicitly empowers science . whoseclass preoccupations (primarily'$7hite, male, and middle-class)ensure sta- while marginalizittg. . . thoseof women, personsof color, or tus for somevoices "1.990,p. minoritygroup members" (Lincoln, 502).Although these arguments may b. ou"..tlted, they contain important cautions.Recall the examplein Chapter3 ,,Even white males" in healthresearch. No doubt this biaswas that the rats were 'White and executionof health '..r.ur.h.partly due to the dominanceof malesin the design None of the methodsdiscussed in this book are intendedto redressthis problem or are capableof it. The purposeof experimentaldesign is to elucidate ca.rsalinferences -or. than morallnferences.'Whatis lessclear is that this prob- lem requiresabandoning notions of validity or truth. The claim that traditional ,pprou.h.s to truth forcibly submergepolitical and ethicalissues is simplywrong. Tb-the extent that morality is reflectedin the questionsasked, the assumptions made,and the outcomesexamined, experimentefs can go a long way by ensuring a broad representationof stakeholdervoices in studydesign. Further, moral social sciencereiuires commitment to truth. Moral righteousnesswithout truthful analysisis ihe stuff of totalitarianism.Moral diversityhelps prevent totalitarian- ism, but without the discipline provided by truth-seeking,diversity offers no -."16 to identify thoseoptions that are good for the human condition,which is, after all,the essenceof morality.In order to havea moral socialscience, we must haveboih the capacityto elucidatepersonal constructions and the capacityto see 484 14.A CR|T|CALASSESSMENT | OFOUR ASSUMPTTONS

'We how thoseconstructions reflect and distort reality (Maxwell, 19921. embrace the moral aspirationsof scholarssuch as Lincoln, but giving voiceto thoseaspi- rationssimply doesnot requireus to abandonsuch notions as validity and truth.

Q UASI.EXPE RI M ENTATION

Criteriafor RulingOut Threats: The Centralityof FuzzyPlausibility

In a randomized experiment in which all groups are treated in the sameway excepr for treatment assignment,very few assumptionsneed to be made about ro,rr.", of bias. And thosethat are made are clear and can be easilytested, particularly as con- cerns the fidelity of the original assignment process and its subsequentmainte- nance. Not surprisinglS statisticiansprefer methods in which the assumptionsare few, transparent, and testable. Quasi-experiments, however, rely heavily on re- searcheriudgments about assumptions,especially on the fuzzy but indispensable concept of plausibility. Judgmentsabout plausibility are neededfor deciding which of the many threats to validity are relevant in a given study for deciding whether a particular designelement is capableof ruling out a given threat, for estimating by how much the bias might have been reduced, and for assessingwhether multiple threats that might have been only partially adjustedfor might add up to a total bias greater than the effect size the researcher is inclined to claim. Vith quasi- experiments, the relevant assumptions are numerous, their plausibility is lessevi- dent, and their single and joint effectsare lesseasily modeled. We acknowledgethe fuzzy way in which particular internal validity threats are often ruled out, and it is becauseof this that we too prefer randomized experiments (and regressiondiscon- tinuity designs)over most of their quasi-experimentalalternatives. But quasi-experiments vary among themselveswith respect to the number, transparencg and testability of assumptions.Indeed, we deliberately ordered the chapters on quasi-experiments to reflect the increase in inferential power that comes from moving from designs without a pretest or without a comparison group to those with both, to those based on an interrupted ,and from there to regression discontinuity and random assignment.Within most of these chapters we also illustrated how inferencescan be improved by adding design el- ements-more pretest observation points, better stable matching, replication and systematic removal of the treatment, multiple control groups, and nonequivalent dependentvariables. In a sense,the plan of the four chapters on quasi-experiments reflects two purposes. One is to show how the number, transparency and testa- bility of assumptions varies by type of quasi-experimental design so that, in the best of quasi-experiments,internal validity is not much worse than with the ran- domized experiment. The other is to get students of quasi-experimentsto be more sparing with the use of this overly general label, for it threatens to tar all quasi- t +SS QUASI-EXPERIMENTATION|

to the experimentswith the samenegative brush. As scholarswho have contributed institution alization of the t i^ quoti-experiment, we feel a lot of ambivalence the random- about our role. Scholarsneed to itrint critically about alternativesto la- ized experiment, and from this need arisesthe need for the quasi-experimental under the bel. But all instancesof quasi-experimentaldesign should not be brought best studiesdo sameunduly broad quasi-experimentalumbrella if attributes of the not closely match the weaker attributes of the field writ large. use of Statisticiansseek to make their assumptions transparent through the this strat- formal models laid out as formulae. For the most part, we have resisted very con- egy becauseit backfires with so many readers,alienating them from the words in- .!pt.r"t issuesthe formul ae aredesigned to make evident.'We have used cognoscenti' stead.There is a cost to this, and not jupt in the distaste of statistical The particularly those whose own research has emphasized statistical models- to formally main cost is that our narrative approach makes it more difficult the alternative demonstrate how much fewer and more evident and more testable quasi- interpretations became as we moved from the weaker to the stronger acrossthe .*p.ri-.rrts, both within the relevant quasi-experimental chapters and 'We we tried to set of them. regret this, but do not apologize for the accessibility Fortu- create by minimirirrg the use of Greek symbols and Roman subscripts. to develop nately, this deficit is not absolute, as both we and others have worked in partic- meth;ds that can be used to measurethe size of particular threats' both 2000) and ular studies(e.g., Gastwirth et al., L994;Shadishet al., 1998; Shadish, Posavac,6c in setsof studiis (e.g.,Kazdin 6c Bass, 1989; Miller, Turner, Tindale, Dugoni,1,991;Ror."nitt.t & Rubin, 1,978;Willson & Putnam,t982\.Further, our statistical narrative approach has a significant advantage over a more narrowly threats emphasisii allows us to addressa broader array of qualitatively different that there- to validitS threats for which no statistical measureis yet available and quantification. fore mighi otherwise be overlooked with too strict an emphasison at all Better to h"u. imprecise attention to plausibility than to have no attention paid to many imptrtant threats just becausethey cannot be well measured'

PatternMatching as a ProblematicCriterion

This book is more explicitthan its predecessorsabout the desirabilityof imbuing a causalhypothesis with multiple tistable implicationsin the data, providedthat we they servett reducethe viability of alternativecausal explanations. In a sense' for the u-sualassessment havesought to substitutea pattern-matchingme{rod-ology 'We of wheth-era few means,oft.n only fwo, reliably differ. do this not because num- .o-pl.*ity itself is a desideratumin science.To the contrary,simpliciry in the be, of questionsasked and methodsused is highly prizedin science.The simplicity well. of ,arrjomized experimentsfor descriptivecausal inference illustrates this However,the samesimple circumstance does not hold with quasi-experiments. With them.we haveassirted that causalinference is improvedthe more specific, 488 | ro.o cRtlcALAssEssMENT oFouR AssuMploNs

generatingthese lists. The main concernwas to havea consensusof educationre- searchersendorsing each practice; and he guessedthat the number of thesebest practicesthat dependedon randomizedexperiments would be zero. Severalna- tionally known educationalresearchers were present,agreed that such assign- ment probably playedno role in generatingthe list, and felt no distressat this. So long as the beliefis widespreadthat quasi-experimentsconstitute the summit of what is neededto support causalconclusions, the support for experimentation that is currently found in health, agriculture,or health in schoolsis unlikely to occur.Yet randomizationis possiblein.many educational contexts within schools if the will existsto carry it out (Cook et al., 1999;Cook et al., in press).An un- fortunate and inadvertentside effect of seriousdiscussion of quasi-experiments may sometimesbe the practicalneglect of randomizedexperiments. That is a pity.

RANDOMIZEDEXPERIMENTS

This sectionlists objections that havebeen raised to doingrandomized experiments, and our analysisof the more and lesslegitimate issues that theseobiections raise.

ExperimentsCannot Be Successfully lmplemented

Even a little exposure to large-scalesocial experimentation shows that treatments are often improperly or incompletely implemented and that differential attrition often occurs. Organizational obstaclesto experiments are many. They include the reality that different actors vary in the priority they attribute to random assign- ment, that some interventions seem disruptive at all levels of the organization, and that those at the point of service delivery often find the treatment require- ments a nuisance addition to their aheady overburdened daily routine. Then there are sometimestreatment crossovers,as units in the control condition adopt or adapt components from the treatment or as those in a treatment group are ex- posed to some but not all of thesesame components. Thesecriticisms suggestthat the correct comparison is not between the randomized experiment and better quasi-experiments when each is implemented perfectly but rather between the randomized experiment as it is often imperfectly implemented and better quasi- experiments. Indeed, implementation can sometimes be better in the quasi- experiment if the decision not to randomize is based on fears of treatment degra- dation. This argument cannot be addressedwell becauseit dependson specifying the nature and degree of degradation and the kind of quasi-experimental alter- native. But taken to its extreme it suggeststhat randomized experiments have no special warrant in field settings becausethere is no evidencethat they are stronger than other designs in practice (only in theory). But the situation is probably not so bleak. Methods for preventing and cop- ing with treatment degradation are improving rapidly (seeChapter 10, this vol- RANDOMIZEDEXPERIMENTS I AAS

random assign- umel Boru ch,1997;Gueron, 1,999; Orr, L999).More important, with the -.n, may still createa superiorcounterfactual to its alternativeseven (1'9961foundthat, flaws mentionedherein. FLr e*ample,Shadish and Ragsdale experi- .o-p"..d with randomized."p..i-.tts without attrition, randomized nonrandom- mentswith attrition still yieldedbetter effectsize estimates than did ran- ized experiments.Sometimes, of course,an alternativeto severelydegraded a control' domizaiion will be best,such as a strong interruptedtime serieswith poor rule to fol- But routine rejectionof degradedrandomized experiments is a to l,o*; it takescareful study and judgmentto decide.Further, many alternatives flaws that experimentationare themselu.i ,ob;..t to treatmentimplementation crossovers thieatenthe validity of inferencesfrom them. Attrition and treatment 'we salientin ex- also occur in them. also suspectthat implementationflaws are becauseexperiments hav6 been around so long and experimenters f.ri-errt"tion the quality "r. .o critical of eachothlr's work. By contrast,criteria for assessing (e'g',Datta, of implementationand results from othermethods are far more recent lesssubjected D97j,and they may thereforebe lesswell developedconceptuallS to peercriticism, and lessimproved by the lessonsof experience.

ExperimentationNeeds Strong Theory and Standardized Treatmentlm Plementation rs Many critics claim that experimentationis more fruitful when an intervention detailsis basedon strongsubstantive theory when implementationof treatment when im- faithful to that theor5 when the rlsearchsetting is well managed,and these plementationdoes ,roi uury much betweenunits' In many field experiments' organiza' conditions are not met. For example,schools arclarge, complex, social iio"r *ith multiple programs,disputatious politics, and conflicting stakeholder well as goals.Many progr"*, a"reimplemented variably across school districts, as of standard f.ror, ..hoth, .Lrrroo-r, arri ,t.rdents.There can be no presumPli9n 1'977)' implementationor fidelity to programtheory (Berman& Mclaughlin, well- But thesecriticisms ur., i' fa-ct,misplaced. Experiments do not require implementa- specifiedprogram theories,good program management,standard a contri- tion, or treatmentsthat are tJtally ?aithfulto theory' Experimentsmake. makesa bution when they simplyprobe whether an intervention-as-implemented preceding marginal improvem.tttt.yord other backgroundvariability. Still, the suggests fa.tJ* can ieducestatistical power and so cloud causalinference. This should: that in settingsin which *or. of theseconditions hold, experiments of ex- (L) uselarge samples to detecteffects; (2) take painsto reducethe influence ma- traneousvariation either by designor through measurementand statistical worth study- nipulation; and (3) studyimplementation quality both as a variable implement i"g * its own right in oid.r to ascertainwhich settingsand providers treat- thl interventionbetter and asa mediatorto seehow implementationcarries ment effectsto outcome. 490 | r+.a cRtTtcALASSESSMENT OFOUR A5SUMPTIONS

Indeed,for many purposesthe lack of standardizationmayaid in understanding how effectivean interventionwill be undernormal conditionsof implementation.In the social world, few treatmentsare introduced in a standardand theory-faithful way. Local adaptationsand partial implementationare the norm. If this is the case, then someexperiments should reflect this variation and ask whetherthe treatment cancontinue to be effectivedespite all the variation within groupsthat we would ex- pectto find if the treatmentwerepolicy. Program developeri and socialtheorists may want standardizationat high levelsof implementation,but policy analysrsshould not welcomethis if it makesthe researchconditions differenifro- the practicecondi- tions to which they would like to generalize.Of course,it is most desiiableto be able to answerboth setsof questions-about policy-relevanteffects of treatmentsthat are variably implementedand also about the more theory-relevanteffects of optimal ex- posureto the intervention.In this regard,one might recall recentefforts io analyze the effectsof the original intent to treat through traditional meansbut alsoof the ef- fectsof the actual treatmentthrough using random assignmentas an instrumental variable(Angrist et al., 1996a\.

ExperimentsEntail Tradeoffs Not Worth Making

The choiceto experimentinvolves a number of tradeoffsthat someresearchers be- lieveare not worth making (Cronbach,7982).Experimenration prioritizes on un- biasedanswers to descriptivecausal questions. But, givenfinite r.rour..r, somere- searchersprefer to investwhat they havenot into marginalimprovements in internal validity but into promoting higher constructand external validity. They might be content with a greaterdegree of uncertainryabout the quality of a causalconnec- tion in orderto purposivelysample a greaterrange of populationsof peopleor set- tings or, when a particular population is central to the research,in ordeito gener- ate a formally representativesample. They might evenuse the resourcesto improve treatmentfidelity or to includemultiple measures of averyimportant outcome con- struct. If a consequenceof this preferencefor construct and ixternal validity is to conducta quasi-experimentor evena nonexperimentrather than a randomizedex- periment, then so be it. Similar preferencesmake other critics look askancewhen advocatesof experimentationcounsel restricting a study to volunteersin order to increasethe chancesof beingable to implementand maintainrandom assignment or when thesesame advocates advise close monitoring of the treatmentto ensureits fideliry therebycreating a situation of greaterobtruiiveness rhan would pertain if the sametreatment were part of someongoing social policy (e.g.,Heckman,1992). In the languageof Campbelland Stanley(1,963;., theclaim was that ."p.ri*.rrt"- tion traded off external validity in favor of internal validiry. In the parlanceof this book and of Cook and Campbell(1979), it is that experimentatiortradesoff both external and constructvalidity for internal validiry to its detriment. Critics also claim that experimentsoveremphasize conservative standards of scientificrigor. Theseinclude (1) usinga conservativecriterion to protect against RANDOMIZEDEXPERIMENTS | *tt

to de- wrongly concludinga treatmentis effective (p <.05) at the risk of failing include tect true treatment;ffects;(2) recommendingintent-to-treat analyses that (3) deni- as part of the treatmentthose units that havenever received treatment; gr"ting inferencesthat result from exploring unplanned treatment interactions (4) rigidly pur- with characteristicsof units, observations,settings, or times;and emerge suing a priori experimentalquestions when other interestingquestions about duriig " ,t,rdy. Mort laypersonsuse a more liberal risk calculusto decide poten- .u,rrul inferencesin their own lives,as when they considertaking up some ii"ity lifesavingtherapy. Should not sciencedo the same' be lessconservative? protection Snoula it notlt least-sometimesmake different tradeoffs between againstincorrect inferences and the failure to detecttrue effects? critics further obiectthat experimeptsprioritize descriptiveover explanatory whether causation.The criticsin qrrestionwould toleratemore uncertaintyabout processes the interventionworks in order to learn more about any explanatory times' that havethe potentialto generalizeacross units, settings'observations, and qualita- Further,,o-. critics pr.f!, to pursuethis explanatory knowledgeusing than tive meihodssimilar io thor. of th. historian,journalist, and ethnographer more opaque by meansof, sa5 structuralequation modeling that seemsmuch than the narrativereports of theseother fields' critics alsodislike the priority that experimentsgive to providing policymak- real-time ers with ofren belated"rrri.r, about what works insteadof providing in help to serviceproviders in local settings.These providers are rarely interested preferre- " torrg-a.tayedr,rrnmary ofwhat, ptogt"- has.achieved.They often elements ceiving.o.riin,ro.rsfeedback about their work and especiallyabout those letter to oiprJ.ri.. that they can changewithout undue complication' A recent theNew York Timescaptured this preference: to approach issues Alan Krueger . . claims to eschew value iudgments and wants changesin ed- (about educationalreform) empirically.Yet his insistenceon postponing is itself a value judgment ucation policy until studiesby iesearchersapproach certainry in parts of public edu- in favor of the status quo. In view of the tragic state of affairs 1999) cation, his judgment is a most questionableone. (Petersen, possible ques- we agreewith many of thesecriticisms. Among all _research meth- tions,cau-sal questions constitute only a subset.And of all possiblecausal typesof cir- ods,experimentation is not relevantio all typesof questionsand all in cumstance.One needonly read the list of options and contingenciesoutlined Ch"p,.r, 9 and L0 to appreciatehow foolhardy it is to advocateexperimenta- "gold resultin tion on a routinebasis as a causal standard"that will invariably trade- clearly interpretableeffect sizes.However, many of the criticisms about even over- offs are basedon artificial dichotomies,correctable problems, -and im- simplifications.Experiments can and should examinereasons for variable They pl.-.nt"tion, and they should searchto uncover mediating processes' the '05 neednot usestringent alpha rates; only statisticaltradition arguesfor that level.Nor needonJ restrict dataanalyses only to the intent-to-treat'though 'aloJ lueururo.rdeJoru qf,ntu e sdeld drrprlerr lpuJetur ql1qlv\ ur surerSord ;o rpuar lurluerelm puorq dpursrrd -Jns eql ol pue qf,ntu oor drlPllu^ Ieurelur aztseqduraapreqr qf,Jeesar;o sururSord ur pa8raureaABr{ stsaSSns drolsrq lpql sassau>lea^\IertuaraJur agr ol uouuane Bur -llEf, .(rq8rlrods orp arrrtaqrey aql uI erurl slr a^eq lsntu ad& drlPler d-rela)/lpll -EA pnJlsuof, lEuJalxaJo JaAodrrprlerr IeuJalur 1o drerurrd eurlnoJ due -ro; 8ur1er eJEeM'T 'esJnoJ 'parseSSns lou lardu{J uI Jealr oPELua.&\ se IO arreqsrrlr.rr tsed sPeaJxadlrear8 req.a'\ sluaut-radxaeldrllnur Ja o senssrdrrpllel leuJalxa pug lrnrls -uol qloq sserppeol dlneder aql.slsdleue-Eleruur dlrrap lsoru aeso^4, se 1ng ,sans -sI asJl{t qroq Sulssarpp" ur r.lf,EarperFrll e^Er{slueurradxa .paluerg 'larrr lpnprlrpur dlfsapou sanssr,{rrprlerr Ipuralxo puB lJnrlsuoJ r{foq sserppE ot r{f,rBrsar ,1ser1uoc letuaurr.ladxeIo srueJSord;o dlpeder agr qrr^\ pessarduneJE aA\ dg '8ur1uru r{uo^\ tanau aJulpql stJoapqJfarrnbar sluaurrradxeter{l tsa88nsol luet -.rodur .sanqod ool sr s{Jo.&\rpqra 1no Surpurg IErJospaseq-sseua^rpa}Ja alouord 'cnerualqord ol lue1Y\ol{1v\ sJJels rlaql Pue srolelsr8al asoql ro; d1-rrlnrrued eJoru uala dlqeqo-rdaru sJel\supJeell tnoqtr^ saurl aturl Buol-opelep qrns .re8uep 'sploq IEar P sI uollEluaurtradxa arntreruardq8noqlly uorlenlrs atues aql lsorule pue 'uerSor4 .sra/\ ruaurdolartaq IooqJS aqr uuSaqrauoJ sauef arurs sread 0t sl lI ,s -uB 'sloogrs ou PUPsluolutradxa ou aleq all\ Pue peleJelalf,eue8aq urle-J drua11 eruts sread sl 'sfteJJe .pasod SI fI JIeI{l rnoqe sJa^.r{supJeolf, ou e^Erl llrls o.&\puu -ord arain sJarlf,no^ erurs .splp .uoryo Iooqtrs srcad 0t A ou sr lI oor IIE ,.raddeq SFII 'elgBpuedapunsI lEql uollf,auuoJ lusnpr E tnogp suorsnlf,uooleraua8 puorg Sutmerp >lsIJol sI uolluelJetul uE Jo srlaJJear{l uo sarpnts Suorls o4 'spuno;8 leluauuadxa e^Pq ol IEIIuePI^ero lerrSoyuo elqrsneldurrdl-realf, arg drrprlerr 'saf,uareJur leuralur ol slBarql ssaFn da4;o dtr.rSalureql Sursrulordtuot lnoqlrd\ passoJl eq 'lurod louupf, spunoq aruos srql ot rrlaqledtuds d11erauafiare am qfinoqrly .ftget 'qrequo.r3 :og5t ''1" ''3'a) Ir r{luquorJ spoqlau lutuaurradxo ra8uorls aqr Bur -zrsuqdruasrue.rSo-rd tuory ueql serpnls leluaunradxeuou pue Iuluaur-radxa_isenb dlerrlua uela ro dlrsoru ;o lslsuof, rer{l qf,reasarJo sure-r8ordtuory peuJuelaq IIri\,\ uoupluroJul InJasn aroru r'rrr pourr Et' *:u"';rrx;;H:;r:::ilil? ", 'lsa88ns ilHt", sluerue^ordur leur8rulu JeuIJ-JaAato 1uo3aqr pur stxel eruos su plSrr se eq '(salqerrul ,.8.a) tou paau sluaur.radxg Surlelpau Jo sarnseau Burppe tuaqr;o ^{et salulleluos 'saJJnosal lnq artnber sarnparo-rdasaql ilV'{ooq srql ur peurllno spoqrau eqt Sursnpelpreua8 aq plnoqs uortuzrlereua8lesneo lnoqp alqrssodse 'sasseoord uolletuJotul qlntu sB puv Surlelpau pue sauof,lno pepuelurun Burre -^oJsrp tE parurp uorllellor etvp a^nelrlenb aq plnor{s pue upJ aleql .saruof,lno pue 'stuerulearl 's8urpas (suos.rad ;o sluerussasse;o dfrpryel lf,nJlsuof, ar{r puu saldures Jo sseualrleluasardar er.lrJo sasdyeuelulueurr.radxeuou eg osle plnoqs 'paqsqqnd Pue uBr erarll aq uE3 sluaurrradxs tuoJJ sllnsoJ urrelul .dlsnorl -nBf, suolsnlf,uol rraqf 3urqf,nof, PuE seleJ JoJJa ale8rgo.ld lsure8e Surpren8 hlo11u remod lptrrtsrlels pue droagl elrtuetsqns leql luatxe eqr ol suorlrrJal -uI 'srsdleur IEf,Itsllels aroldxa osle uet sJatuJurrradxg auo oq dlorruryappFor{s

.tt sNoll_dwnssvuno lo l_Nty\sslssvtv)tl|u) v I zov I RANDOMIZEDEXPERIMENTS | 493 I

ExperimentsAssume an InvalidModel of ResearchUtilization model of decision To somecritics, experimentsrecreate a naive rational choice among (the treat- making. That is, one first lays out the alternativesto choose one collectsin- *.rr,rt] then one decideson criteria of merit (the outcomes);then and finally formation on eachcriterion for eachtreatment (the ), empirical one makes a decisionabout the superior alternative.UnfortunatelS so simpleas the ra- work on the useof socialscience daia showsthat useis not \ufeiss 1988)' tional choicemodel suggests (c. 6c Bucuvalas,1980; c''weiss, contexts'ex- First,even when."-rir. and effectquestions are askedin decision ex- p.ri-.nt"l resultsare still usedalong with other forms of information-from consensusof a isting theories,personal testimony, extrapolations from surveys' haverecently be- fieldlchims from expertswith intereststo defend,and ideasthat politics' person- .o*. trendy.Decisions are shapedpartly by ideology,interests, as much made by a ality, windows of-opportunity, and ualues;and they are individualor com- policy-shapirrg.o-*nrrity (cronbachet al., 1980) as by an overtime asear- *i,,... Fuither,many decisions are not so muchmade as accreted maker with few op- lier decision,.orrrir"in later ones,leaving the final decision are available,new tions ('Weiss,1980). Indeed,by the time ixperimental results decisionmakers and issuesmay havereplaced old ones. verdicts Second,.*p.rirn.nts often yield contestedrather than unanimous Disputes arise about that thereforehave uncertain implications for decisions. resultsare valid' whether the causalquestions were correctly framed, whether entail a specific whetherrelevant outcomes were assessed, and whetherthe results voucher decision.For example,reexaminations of the Milwaukee educational occurred(H' ;"rdy offereddifferent conclusions about whetherand whereeffects SimilarlS Fuller,2000; Greene, Peterson, 6c Du, 1.999;'Sritte,1'998,"1'999,2000)' experiment(Finn differenteffect ,ir., *.r. generatedfrom the Tennesseeclass size 6cSachs, 1996)' Sometimes, EcAchilles ,1.990;Hanusi'ek,1999;Mosteller, Light, reflectdeeply scholarlydisagreements are at issue,but at other timesthe disputes conflictedstakeholder interests. likely when Third, short-terminstrumental use of experimentaldata is more it is easierto the interventionis a minor variant on existingpractice. For example, criteriafor changetextbooks in a classroomor pills givenlo patientsor eligibility or to open entry than it is to relocatehospitals to. underserved locations ;;;;"* state' Becausethe day-carecenters for welfare recipientsthroughout an entire to dramatically more feasible.tt""g.t are so ,ood.r, in scope,they are lesslikely on shor-t-termin- affecttheproble- ih.y address.So critics note that prioritizing is unlikelyto solve strumentalchange tends to preservemost of the statusquo and that truly twist tr.rr.hunt social"probl.-s. bf course'there are someexperiments from denselypoor the lion,stail andinvolvebold initiatives.Thus movingfamilies deviations inner-citylocations to the suburbsinvolved a changeof three standard 494 14.A CRIT|CALASSESSMENT OFOUR ASSUMPTTONS |

in the poverty level of the sending and receiving communities, much greater than what happens when poor families spontaneouslymove.'S7hether such a dramatic changecould ever be used as a model for cleaning out the inner of those who want to move is a moot issue. Many would judge such a policy to be unlikely. Truly bold experiments have many important rationales; but creating new policies that look like the treatment soon after the experiment is not one of them. Fourth, the most frequent use of researchmay be conceptual rather than in- strumental, changing how usersthink about basic assumptions,how they under- stand contexts, and how they organize'or label ideas. Some conceptual uses are intentional, as when a person deliberately reads a book on a current problem; for example, Murray's (1984) book on social policy had such a conceptual impact in the 1980s, creating a new social policy agenda. But other conceptual usesoccur in passing, as when a person reads a newspaper story referring to . Such usescan have great long-run impact as new ways of thinking move through the system, but they rarely change particular short-term decisions. These arguments against a naive rational decision-making model of experi- mental usefulnessare compelling. That model is rightly rejected. However, mosr of the objections are true not just of experiments but of all social sciencemethods. Consider controversies over the accuracy of the U.S. ,the entirely descrip- tive results of which enter into a decision-making process about the apportion- ment of resourcesthat is complex and highly politically charged. No method of- fers a direct road to short-term instrumental use. Moreover, the obiections are exaggerated.In settings such as the U.S. Congress,decision making is sometimes influenced instrumentally by social scienceinformation (Chelimsky, 1998), and experiments frequently contribute to that use as part of a researchreview on ef- fectivenessquestions. Similarlg policy initiatives get recycled, as happened with school vouchers, so that social science data that were not used in past years are used later when they become instrumentally relevant to a current issue (Polsby, 1'984; Quirk, 1986).In addition, data about effectivenessinfluence many stake- holders' thinking even when they do not use the information quickly or instru- mentally. Indeed, researchsuggests that high-quality experimentscan confer exrra 'Weiss credibility among policymakers and decision makers (C. & Bucuvalas, 1980)' as happened with the Tennesseeclass size study. We should also not forget that the conceptual use of experiments occurs when the texts used to train pro- fessionalsin a given field contain results of past studies about successfulpractice (Leviton 6c Cook, 1983). And using social sciencedata to produce incremental change is not always trivial. Small changescan yield benefits of hundreds of mil- lions of dollars (Fienberg,Singer, & Tanur, 1985). SociologistCarol'Weiss, an ad- vocate of doing researchfor enlightenment's sake, says that 3 decadesof experi- "impressed ence and her studiesof the use of social sciencedata leaveher with the utility of evaluation findings in stimulating incremental increasesin knowledge and in program effectiveness.Over time, cumulative incrementsare not such small potatoes after all" ('Weiss,1998, p. 31,9).Finallg the usefulnessof experimentscan be increased by the actions outlined earlier in this chapter that involve comple- EXPERIMENTSott RANDOMIZED I mentingbasic experimental design with adjunctssuch as measuresof implemen- pro- tation a-ndmediation o, qualitativemethods-anything that will help clarify gram processand implementationproblems. In summarSinvalid modelsof the nor lesscommon than ir.foln.rs of experimintalresults seem to us to be no more 'we learned invalid modelslf th. use of any other social sciencemethods. have their much in the last severaldecades about use, and experimenterswho want work to be usefulcan take advantagesof thoselessons (Shadish et al., 1'99I).

TheConditions of ExperimentationDiffer from the Conditionsof Policylmplementation if were Experimentsare often doneon a smalleiscale than would pertain services rele- i-il.-r.rted state-or nationwide,and so they cannot mimic all the details interven- u"rr, ,o full policy implementation.Hence policy implementationof an For ex- ,i"" -ry yi.ta aiff.rint o,rt.omesthan the experiment(Elmore, 1996)' size, ample, t"r.d partly on researchabout the benefits of reducing class classes Tennesseeand Caliiornia implementedstatewide policies to have more class- with fewer studentsin each.This required many new teachersand new teach- rooms.However, because of a nationalteacher shortage, some of thosenew of ersmay havebeen less qualified than thosein the experiment;and a shortage have classroomsled to more .rs. of trailers and dilapidatedbuildings that may harmedeffectiveness further. enthu- Sometimesan experimentaltreatment is an innovation that generates siasticefforts to implementit well. This is particularly frequentwhen the experi- that ment is doneby a charismaticinnovator whosetacit knowledgemay exceed pr^ctrce of thosewho would be expectedto implementthe program in ordinary may and whosecharisma may inducehigh-quality implementation. These factors is im- generatemore srr...srfoi outcomesthan will be seenwhen the intervention plementedas routine PolicY. when experimental Policy implementationmay also yield different-results prac- treatmentsare implementedin a fashionthat differs from or conflictswith ticesin real-*orld application.For example,experiments studying psychotherapy and outcomeoften standardizetreatment wiih a manual and sometimesobserve but correct the therapistfor deviatingfrom the manual (shadishet al., 2000); effec- thesepractices are rare in clinicallractice. If manualizedtreatment is more might tive (bhambless& Hollon, 1998; Kendall, 1998), experimentalresults transferpoorly to practicesettings. policy Raniom assigrrm.ntmay also changethe program from the intended implementation(i{eckman,l992l. For ixample, thosewilling to be randomized may -"y diff.r from those for whom the treatment is intended; randomizatLon with changepeople's psychological or social responseto treatment compared those"wlroself-select treatment; and randomizationmay disrupt administration clients' and implemenrationby forcingthe programto copewith a differentmix of I 496 14.A CR|T|CALASSESSMENT OF OUR | ASSUMPTIONS

Heckman claims this kind of problem with the taining PartnershipAct "calls Job 0TPA) evaluation into questionthe validity of the experimentalestimates as a statementabout the JTPA system as a whole" (Heckman,1.992,p. ZZ1,). In manyrespects, we agreewith thesecriticisms, though it is worth noting sev- eral responsesto them. First, theyassume alack of generalizabllityfrom experi- ment to policy but that is an empirical question.Some data suggesrthar general- ization may be high despitedifferences between lab and field (C. Anderson, LindsaS & Bushman, 1999) or betweenresearch and practice (Shadishet al., 2000). Second,it can help to implement.treatmentunder conditions that aremore characteristicof practiceif it doesnot unduly compromiseother researchpriori- ties. A little forethoughtcan improve the surfacesimilarity of units, trearments, observations,settings, or timesto their intendedtargets. Third, someof thesecrit- icismsare true of any researchmethodology conducted in a limited context,such as locally conductedcase studies or quasi-experiments,because local implemen- tation issuesalways differ from large-scaleissues. Fourth, the potentiallydisrup- tive nature of experimentallymanipulated interventions is sharedby many'rrr"or"h locally invented novel programs, euen uhen they are not studied by any methodologyat all.Innovation inherentlydisrupts, and substantiveliteratures are rife with examplesof innovationsthat encounteredpolicy implementationim- pediments(Shadish, 1984). However,the essentialproblem remainsthat large-scalepolicy implementa- tion is a singularevent, the effectsof which cannot be fully known exceptby do- ing the full implementation.A singleexperiment, or evena smallseries of ri-ilrt ones,cannot provide complete answers about what will happenif the intervention is adoptedas policy. However,Heckman's criticism needsreframing. He fails to distinguishamong validity types(statistical conclusion, internal, .onrtro.., exter- nal). Doing so makesit clearthat his claim that suchcriticism "calls into question the validity of the experimentalestimates as a sratementabout the JTPA ,yrt.rr, ", a whole" (Heckman,1.992,p.221,) is reallyabout external validity and construcr validity,not statisticalconclusion or internalvalidity. Except in thenarrow econo- metricstradition that heunderstandably cites (Haavelmo, 7944;Marschak ,7953; Tinbergen,1956),few socialexperimenters ever claimedthat experimentscould "system describethe as a whole"-even Fisher(1935) acknowledged this trade- off. Further,the econometricsolutions that Heckman suggestscannot avoid the sametradeoffs between internal and externalvalidity. For example,surveys and certain quasi-experimentscan avoid someproblems by observingexisting inter- ventionsthat have aheadybeen widely implemented,but the validity of tleir es- timatesof program effectsare suspectand may themselveschange if the program were imposedeven more widely as policy. Addressingthese criticisms requires multiple lines of evidence-randomized experimentsof efficacyand effectiveness,nonrandomized experiments that ob- serveexisting interventions, nonexperimental surveys to yield estimatesof repre- sentativeness,statistical analyses that bracketeffects under diverseassumpd;ns,

J RANDOMIZEDEXPERIMENTS I Ot I

qualitative observation to discover potential incompatibilities between the inter- ventiol and its context of likely implementation, historical study of the fates of similar interventions when they were implemented as policg policy analysesby those with expertisein the type of intervention at issue,and the methods for causal generalizationin this book. The conditions of policy implementation will be dif- i.r.rr, from the conditions characteristic of any rese^rchstudy of it, so predicting generalizationto policy will always be one of the toughest problems. lmposingTreatments ls FundamentallyFlawed Comparedwith Encouragingthe Growthof Local Solutionsto Problems

Experimentsimpose treatments on recipients.Yet som,elate 20th-centurythought ,.rjg.rt, that imposedsolutions may be inferior to solutionsthat are locally gen- .rJr".a by thoseiho h"n. the problem.Partly, this view is premisedon research findings of few effectsfor the Great Societysocial programs of the 1960sin the UniteJ States(Murrag 1.984;Rossi, L987),with the presumptionthat a portion of the failurewas dueto the federallyimposed nature of the programs.Partly, the view reflectsthe successof late 2Oth-centuryfree market economicsand conser- vative political ideologiescompared with centrally controlled economiesand more fi|eral political beliefs.Experimentally imposed treatments are seen in some quartersas beinginconsistent with suchthinking' IronicallS the first objectionis basedon resultsof experiments-if it is true that impos.i progr"*s do not work, experimentsprovided the evidence.More- over,these tro-.ff..t findingsmay havebeen partly due to methodologicalfailures of experimentsas they were implementedat that time. Much progressin solving practicalexperimental problems occurred after, and partly in responseto, those experiments.If so,it is prematureto assumethese experiments definitively demon- stiated no effect,especlaly given our increasedability to detectsmall effectsto- day' (D. Greenberg6c shroder,1,997; LipseS L992; Lipsey 6c'Wilson, !993). alsoiistinguish betweenpolitical-economic currency and the effects We must 'We of interventions. know of no comparisonsof, say,the effectsof locally gener- atedversus imposed solutions.Indeed, the methodologicalproblems in doing such comparisonsare daunting, especiallyaccurately categotizing interventions into the two categoriesand unlonfounding the categorieswith correlatedmethod dif- ferences.Bariing an unexpectedsolution to the seeminglyintractable problems of causalinference in nonrandomizeddesigns, answering questions about the effects of locally generatedsolutions may requireexactly the kind of high-qualityexper- imentatioi being criticized.Though it is likely that locally generatedsolutions may indeedhave significant advantages, it also is likely that someof thosesolu- tions will haveto be experimentallyevaluated. I 498 | 14.A CRIT|CALASSESSMENT OF OURASSUMPTTONS

CAUSALGENERALIZATION: AN OVERLY COMPLICATEDTHEORY?

Internal validity is best promoted via random assignment,an omnibus mechanism that ensuresthat we do not have many assumptions to worry about when causal in- ferenceis our goal. By contrast, quasi-experimentsrequire us to make explicit many assumptions-the threats to internal validity-that we then have to rule out by fiat, by design,or by measurement.The latter is a more complex and assumption-riddled processthat is clearly inferior to random assignment.Something similar holds for causal generalization,in which random selectionis the most parsimonious and the- oretically justified method, requiring the fewest assumptionswhen causalgeneral- ization is our goal. But becauserandom selectionis so rarely feasible,one instead has to construct an acceptabletheory of generaliz tion out of purposive sampling, 'We a much more difficult process. have tried to do this with our five principles of generalizedcausal inference.These, we contend, are the keys to generalizedinfer- ence that lie behind random sampling and that have to be identified, explicated, andano assessedifrt we are to make better general rnterences,inferences,even rtif they are not per- fect ones. But theseprinciples are much more complex to implement than is ran- dom sampling. Let us briefly illustrate this with the category called American adult women. We could representthis category by random selection from a critically appraised register of all women who live in the United Statesand who arc at least 21 years of age.I7ithin the limits of , we could formally generalizeany char- acteristics we measured on this sample to the population on that register. Of course, we cannot selectthis way becauseno such register exists.Instead, one does onet experiment with an opportunistic sample of women. On inspection they all '1,9 turn out to be between and 30 years of age, to be higher than average in achievementand abilit5 and to be attending school-that is, we have useda group of college women. Surface similarity suggeststhat each is an instance of the cate- gory woman. But it is obvious that the modal American woman is clearly not a college student. Such students constitute an overly homogeneoussample with re- spect to educational abilities and achievement,socioeconomic status, occupation, and all observable and unobservable correlates thereof, including health status, current employment, and educational and occupational aspirations and expecta- tions. To remedy this bias, we could use a more complex purposive sampling de- sign that selectswomen heterogeneouslyon all these characteristics.But purpo- sive sampling for heterogeneousinstances can never do this as well as random selection can, and it is certainly more complex to conceive and execute.I7e could go on and illustrate how the other principles faclhtate generalization. The point is that any theory of generalization from purposive samples is bound to be more complicated than the simplicity of random selection. But becauserandom selection is rarely possible when testing causal relation- ships within an experimental framework, we need these purposive alternatives. NONEXPERIMENTALALTERNATIVESI 499

Yet most experimental work probably still relies on the weakest of these alterna- tives, surfaci similarity.'We seek to improve on such uncritical practice. Unfortu- nately though, there is often restricted freedom for the more careful selection of instancesof units, treatments, outcomes, and settings, even when the selection is done purposively.It requires resourcesto sample irrelevanciesso that they are het- erogeneouson many attributes, to measure several related constructs that can be discriminated from each other conceptually and to measure a variety of possible explanatory processes.This is partly why we expect more progress on causal gen- eralization from a review context rather than from single studies. Thus, if one re- searchercan work with college women, another can work with female school- teachers, and another with female retirees, this creates an opportunity to see if thesesources of irrelevant homogeneity make a difference to a causal relationship or whether it holds over all these differ6nt types of women. UltimatelS causal generalizationwill always be more complicated than assess- ing the likelihood that a relationship is causal.The theory is more diffuse, more re- cent, and lesswell testedin the crucible of researchexperience. And in some quar- ters there is disdain for the issue,given the belief and practice that relationshipsthat replicate once should be consideredas generaluntil proven otherwise' not to speak oithe belief that little progressand prestigecan be achieved by designingthe next experiment to be some minor variant on past studies. There is no point in pre- t.nding that causal generalization is as institutionalized procedurally as other methods in the social sciences.'Wehave tried to set the theoretical agendain a sys- tematic way. But we do not expect to have the last word. There is still no explica- tion of causal generalizationequivalent to the empirically produced list of threats to internal validiry and the quasi-experimentaldesigns that have evolved over 40 years to rule out thesethreats. The agendais set but not complete.

NON EXPE RI M E NTAL ALTE RNATIVES

Though this book is about experimentalmethods for answeringquestions about .".rr"l hypotheses,it is a mistaketo believethat only experimentalapproaches are used for thir p,r.pose.In the following; we briefly consider severalother ap- proaches,indiiating the major reasonswhy we havenot dwelt on them in detail. basicallSthe reasonis that we believethat, whatevertheir merits for somere- searchpurposes, they generateless clear causal conclusions than randomizedex- perimentsor eventhe bestquasi-experiments such as regression-discontinuityor interruptedtime series The nonexperimentalalternatives we examineare the major onesto emerge in variousacademic disciplines. In educationand parts of anthropologyand soci- ologg onealternative is intensivequalitative case studies. In thesesame fields, and also-in developmentalpsychology there is an emerginginterest in theory-based 500 14.A CR|T|CALASSESSMENT OFOUR ASSUMPTTONS |

causal studiesbased on causal modeling practices.Across the social sciencesother than economics and statistics, the word quasi-experiment is routinely used to justify causal inferences,even though designsso referred to are so primitive in structure that 'We causal conclusions are often problematic. have to challenge such advoc acy of low-grade quasi-experiments as a valid alternative to the quality of studies we have been calling for in this book. And finally in parts of statistics and , and overwhelmingly in and those parts of sociology and that draw from econometrics,the emphasisis more on control through statistical ma- nipulation than on experimental design.I7hen descriptive causal inferencesare the primary concern, all of these alternatives will usually be inferior to experiments.

IntensiveQualitative Case Studies

The call to generate causal conclusions from intensive case studies comes from several sources. One is from quantitative researchersin education who became disenchanted with the tools of their trade and subsequentlycame to prefer the qualitative methods of the historian and journalist and especiallyof the ethnog- rapher (e.g.,Guba,198l, 1,990;and more tentatively Cronbach, 1986).Another is from those researchersoriginally trained in primary disciplinessuch as qualita- tive anthropology (e.g.,Fetterman, 19841or sociology (Patton, 1980). The enthusiasm for case study methods arises for several different reasons. One is that qualitative methods often reduce enough uncertainty about causation to meet stakeholderneeds. Most advocatespoint out that journalists,historians, ethnographers, and lay persons regularly make valid causal inferences using a qualitative processthat combines reasoning, observation, and falsificationist pro- cedures in order to rule out threats to internal validity-even if that kind of lan- guageis not explicitly used (e.g.,Becker, 1958; Cronbach,1982). A small minor- ity of qualitative theorists go even further to claim that casestudies can routinely replace experiments for nearly any causal-sounding question they can conceive (e.g.,Lincoln & Guba, 1985). A secondreason is the beliefthat suchmethods can also engagea broad view of causation that permits getting at the many forces in the world and human minds that together influence behavior in much more com- plex ways than any experiment will uncover.And the third reasonis the belief that case studies are broader than experiments in the types of information they yield. For example, they can inform readers about such useful and diverse matters as how pertinent problems were formulated by stakeholders, what the substantive theories of the intervention are, how well implemented the intervention compo- nents were, what distal, as well as proximal, effects have come about in respon- dents' lives, what unanticipated side effects there have been, and what processes explain the pattern of obtained results.The claim is that intensivecase study meth- ods allow probes of an A to B connection, of a broad range of factors condition- ing this relationship, and of a range of intervention-relevant questions that is broader than the experiment allows.

I .J NONEXPERIMENTALALTERNATIVES| 501 I

Although we agree that qualitative evidence can reduce some uncertainfy about cause-sometimes substantially the conditions under which this occurs are usually rare (Campbell, 1975).In particular,qualitative methods usually pro- duce unclear knowledge about the counterfactual of greatest importance, how those who receivedtreatment would have changedwithout treatment. Adding de- sign featuresto casestudies, such as comparison groups and pretreatmentobser- vations, clearly improves causal inference. But it does so by melding case-study data collection methods with experimental design.Although we consider this as a valuable addition ro ways of thinking about casestudies, many advocatesof the method would no longer recognize it as still being a case study. To our way of thinking, casestudies are very relevant when causation is at most a minor issue; but in most other caseswhen substantial uncertainry reduction about causation is required, we value qualitative methods within experiments rather than as alter- natives to them, in ways similar to those we outlined in Chapter 12.

Theory-BasedEva luations

This approachhas beenformulated relatively recentlyand is describedin various books or specialjournal issues(Chen & Rossi,1,992; Connell, Kubisch, Schorr,& 'Weiss, 1.995;Rogers, Hacsi, Petrosino,& Huebner,2000). Its origins are in path analysis and causal modeling traditions that are much older. Although advocates have some differenceswith each other, basically they all contend that it is useful: (1) to explicate the theory of a treatment by detailing the expected relationships among inputs, mediating pfocesses,and short- and long-term outcomes; (2) to measure all the constructs specified in the theory; and (3) to analyzethe data to assessthe extent to which the postulated relationships actually occurred. For shorter time periods, the available data may addressonly the first part of a pos- tulated causal chain; but over longer periods the complete model could be in- volved. Thus, the priority is on highly specific substantive theorS high-quality measurement,and valid analysisof multivariate explanatory processesas they un- fold in time (Chen & Rossi, 1'987,1,992). Such theoretical exploration is important. It can clarify general issueswith treat- ments of a particular type, suggestspecific researchquestions, describehow the inter- vention functions, spell out mediating processes,locate opportunities to remedy im- plementation failures, and provide lively anecdotesfor reporting results ('Weiss,1'998). All th.r. serveto increasethe knowledge yield, evenwhen such theoretical analysisis done within an experimental framework. There is nothing about the approach that makes it an alternative to experiments. It can clearly be a very important adjunct to such studies,and in this role we heartily endorsethe approach (Cook,2000). However, some authors (e.g., Chen 6c Rossi, 1,987, 1992; Connell et al., 1,995l have advocated theory-based evaluation as an attractive alternative to ex- periments when it comes to testing causal hypotheses.It is attractive for several i.urorrr. First, it requires only a treatment group' not a comparison group whose 502 | 14.A CRTT|CALASSESSMENT OFOUR ASSUMPTTONS

agreement to be in the study might be problematic and whose participation in- creasesresearch costs. Second, demonstrating a match between theory and data suggeststhe validity of the causal theory without having to go through a labori- ous processof explicitly considering alternative explanations.Third, it is often im- practical to measure distant end points in a presumed causal chain. So confirma- tion of attaining proximal end points through theory-specified processescan be used in the interim to inform program staff about effectivenessto date, to argue for more program resourcesif the program seemsto be on theoretical track, to justify claims that the program might be effective in the future on the as-yet-not- assesseddistant criteria, and to defend against premature summative evaluations that claim that an intervention is ineffective before it has been demonstrated that the processesnecessary for the effect have actually occurred. However, maior problems exist with this approach for high-quality descrip- tive causalinference (Cook, 2000). First, our experiencein writing about the the- ory of a program with its developer (Anson et al., 1,991)has shown that the the- ory is not always clear and could be clarified in diverse ways. Second, many theories are linear in their flow, omitting reciprocal feedback or external contin- genciesthat might moderate the entire flow. Third, few theories specify how long it takes for a given processto affect an indicator, making it unclear if null results disconfirm a link or suggestthat the next step did not yet occur. Fourth, failure to corroborate a model could stem from partially invalid measuresas opposedto in- validity of the theory. Fifth, many different models can fit a data set (Glymour et a1.,1987;Stelzl, 1986), so our confidencein any given model may be small. Such problems are often fatal to an approach that relies on theory to make strong causal claims. Though some of theseproblems are present in experiments (e.g.,failure to incorporate reciprocal causation, poor measures),they are of far less import be- cause experiments do not require a well-specified theory in constructing causal knowledge. Experimental causal knowledge is less ambitious than theory-based knowledge, but the more limited ambition is attainable.

Weaker Quasi-Experiments

For some researchers,random assignment is undesirable for practical or ethical reasons, so they prefer quasi-experiments. Clearly, we support thoughtful use of quasi-experimentation to study descriptive causal questions. Both interrupted time series and regression discontinuity often yield excellent effect estimates. Slightly weaker quasi-experiments can also yield defensible estimates,especially when they involve control groups with careful matching on stable pretest attrib- utes combined with other design features that have been thoughtfully chosen to addresscontextually plausible threats to validity. However, when a researchercan choose, randomized designsare usually superior to nonrandomized designs. This is especially true of nonrandomized designs in which little thought is given to such matters as the quality of the match when creating control groups, j NONEXPERIMENTALALTERNATIVESI tOl includingmultiple hypothesistests rather than a singleone' generatingdata from severalpr.tr."t*.nt time points rather than one, or having severalcomparison groupsto createcontrols that bracketperformance in the treatmentgroups. In- I..d, when resultsfrom typical quasi-experimentsare comparedwith thosefrom randomizedexperiments on the same topic, several findings emerge.Quasi- experimentsfrequently misestimate effects (Heinsman & Shadish,1'996; Shadish & Ragsdale,t9961. Tirese biases are often largeand plausiblydue to selectionbi- asessrrch as the self-selectionof more distressedclients into psychotherapytreat- ment conditions(Shadish et al., 2000) or of patientswith a poorer prognosisinto controlsin medicalexperiments (Kunz & Oxman,1'9981.These biases are espe- ciallyprevalent in quasi-experimentsthat usepoor groupsand have higheiattrition(Heinsmar$c Shadish,'1,996;Shadish 6cRagsdale,l996l. So, if the an"swersobtained from randomizedexperiments are more credible than thosefrom quasi-experimentson theoreticalgrounds and are more accurateempirically, then ,'h. ".g,.r-entsfor randomizedexperiments are evenstronger whenever a high de- gr.. oI uncertaintyreduction is requiredabout a descriptivecausal claim. Becauseall quasi-experimentsare not equal in their ability to reduceuncer- tainty about."ur., *. -"ttt to draw attention againto a common but unfortu- natepractice in manysocial sciences-to say that a quasi-experimentis beingdone in order to provide justificationthat the resultinginference will be valid. Then a quasi-experimentaldesign is describedthat is so deficientin the desirablestruc- tural featuresnoted previously,which promote better inference,that it is proba- bly not worth doing. Indeed,over the yearswe have__repeatedlynoted the term quasi-experimentbiing usedto justify designsthat fell into the classthat Camp- bell and'stanley(196i) labeledas uninterpretableand that Cook and Campbell (1,9791labeled'asgenerally uninterpretable. These are the simplestforms of the designsdiscussed in Chapters4 and 5. Quasi-experimentscannot be an alterna- tive to randomizedexperiments when the latter are feasible,and poor quasi-ex- perimentscan neverbi a substitutefor strongerquasi-experiments when_the lat- i., "r. also feasible.Just as Gueron (L999) has remindedus about randomized experiments,good quasi-experimentshave to be fought for, too. They are rarely handedout as though on a silverplate.

StatisticalControls

In this book,we haveadvocated that statisticaladjustments for group nonequivalence are best urrd oBt designcontrols have already been used to the maximum in order to reducenonequivalence to a minimum. So we are not opponentsof statisticalad- justmenttechniques such as those advocated by the statisticiansand econometricians describedin the appendixto Chapter5. Ratheqwe want to usethem as the last re- sort.The positionwe do not like is the assumptionthat statisticalcontrols are so well developeithat they can be usedto obtain confidentresults in nonexperimentaland in the past 2 weak iuasi-e*perimentalcontexts. As we saw in Chapter 5, research 504 | ta. a cRtTtcALAsSEssMENT OFOUR ASSUMPT|ONS I

decadeshas not much supported the notion that a control group can be constructed through matching from somenational or state registrywhen the treatmentgroup comesfrom a morecircumscribed and local setting.Nor hasresearch much supported the useof statisticaladjustments in longitudinalnational surveys in which individuals with differentexperiences are explicitly contrastedin order to estimatethe effectsof this experiencedifference. Undermatching is a chronic problem here,as are conse- quencesof unreliabilityin the selectionvariables, not to speakof specificationerrors dueto incompleteknowledge of the selectionprocess. particular, 'We In endogeneityprob- lemsare a realconcern. areheartened that more recentwork on statisticaladjust- mentsseems to be moving toward the position we represent,with greateremphasis beingplaced on internal controls,on stablematching within suchinternal controls, on the desirabilityof seekingcohort controlsthrough the useof siblings,on the useof PrstssLspretests collectedsorrccf,e(Jon therne same measures aS tnethe posttest, onOn thetne Uulrtyutiliw ofOt suchSUChpretest measures collected at several different times, and on the desirability 'Weof studying inter- ventionsthat areclearly exogenous shocks to someongoing system. arealso heart- enedby the progressbeing made in the statisticaldomain becauseit includesprogress on designconsiderations, as well ason analysisper se(e.g., Rosenbaum, 1999a).Ve areagnostic at this time asto the virtuesof the propensityscore and instrumental vari- able approachesthat predominatein discussionsof statisticaladiustmenr. Time will tell how well they pan out relative to the results from randomizedexperiments.'We have surely not heard the last word on this topic.

CONCLUSION 'We cannot point to one new development that has revolutionized field experimen- tation in the past few decades,yet we have seena very large number of incremen- tal improvements. As a whole, these improvements allow us to create far better field experiments than we could do 40 years ago when Campbell and Stanley (1963) first wrote. In this sense,we are very optimistic about the future. Ve believe that we will continue to see steadg incremental growth in our knowledge about how to do better field experiments. The cost of this growth, howeveq is that field experimentation has becomea more specializedtopic, both in terms of knowledge developmentand of the opportunity to put that knowledge into practice in the con- duct of field experiments.As a result, nonspecialistswho wish to do a field exper- iment may greatly benefit by consulting with those with the expertise,especially for large experiments, for experiments in which implementation problems may be high, or for casesin which methodological vulnerabilities will greatly reducecred- ibility. The same is true, of course, for many other methods. Case-studymethods, for example, have become highly enough developed that most researcherswould do an amateurishjob of using them without specializedtraining or supervisedprac- tice. Such Balkanization of. methodolog)r is, perhaps, inevitable, though none the lessregrettable. \U7e can easethe regret somewhat by recognizingthatwith special- ization may come faster progress in solving the problems of field experimentation.