<<

HANDBOOK OF METHODS IN SOCIAL AND PERSONALITY PSYCHOLOGY

Editedby HARRY T. REIS Universityof Rochester

CHARLES M. JUDD University of Colorado at Boulder

a CAMBRIDGE .UNIVERSITY PRESS CHAPTER THREE

Causal and Generalization in Field Settings Experimental and Quasi-Experimental Designs

STEPHENG. WEST, JEREMY C. BIESANZ, AND STEVENC. PITTS

The purpose of this chapter is to introduce researchers of each design. Methods of strengthening each design in social psychology to designs that pemlit relatively type with respect to and generalization strong causal in the field. We begin the of causal effects are also considered. chapter by considering some basic issues in inferring The emphasis on designs for in the causality, drawing on work by Rubin and his associates present chapter contrasts sharply with recent practice (Holland, 1986, 1988; Rubin, 1974, 1978) in statis- in basic social psychology in which the modal research tics and by Campbell and his associates (Campbell, design consists of a randomized , conducted 1957; Campbell & Stanley, 1966; Cook, 1993; Cook in the laboratory, lasting no more than 1 hour, and us- & Campbell, 1979; Shadish, Cook, & Campbell, in ing undergraduate students as the participants (West, press) in psychology. Rubin's approach emphasized Newsom, & Fenaughty, 1992). This practice has clear formal statistical criteria for inference; Campbell's strengths that are articulated in other chapters in this approach emphasized concepts from philosophy of handbook. At the same time, the recent extraordinary science and the practical issues confronting social level of dominance of this practice potentially limits researchers. These approaches are then applied to pro- the generalization of social psychological findings in vide insights on a variety of difficult issues that arise important ways. At earlier points in the history of the in randomized when they are conducted field, it was far easier to find examples of basic social in the field. In addition, Cook's (1993) new perspec- psychological hypotheses that were tested in both labo- tive on the generalization of causal effects is also dis- ratory and field settings (see Bickman & Henchy, 1972; cussed. We then turn to consideration of three classes Swingle, 1973, for collections of examples); currently, of quasi-experimental designs -the regression disconti- the majority of research conducted by social psychol- nuity design, the interrupted design, and the ogists in field settings is applied in nature. The earlier no~equivalent control group design -that can some- interplay between laboratory and field research often times provide a relatively strong basis for causal infer- provided a more convincing basis for generalization of ence. The application of the Rubin and the Campbell causal effects found in basic social psychological inves- frameworks helps identify strengths and weaknesses tigations. We suggest that perhaps at least some of the current widespread perception that many articles in our leading journals Ujust aren't that interesting any- Stephen G. West was partially supported by NIMH Grant P50- more" (Reis & Stiller, 1992, p. 470; see also Funder, MN39246 during the writing of this chapter. We thank Tom 1992) may be occasioned by the limited contexts in Belin, Tom Cook, Bill Graziano, Chick Judd, Chip Reichardt, which social psychological phenomena are studied and Harry Reis, Will Shadish, and the graduate students of the the distance of present research paradigms from the Fall 1997 Psychology 555 class (experimental and quasi- real world. experimental designs in research) at Arizona State University The current modal laboratory research for their comments and suggestionson an earlier version of this chapter. Correspondenceshould be directed to StephenG. West, is particularly good at establishing internal validity, Department of Psychology,Arizona State University, Tempe, AZ that is, the treatment caused the observed response. 85287-1104 (e-mail: [email protected]). However, as complications are introduced into the

40

~ CAUSAL INFERENCE AND GENERALIZAnON IN FIELD SETTINGS 41

laboratory setting, establishing internal validity be- the causal effect is defined as: comes more tenuous. Experiments conducted over re- peated sessions involve attrition; requests for partici- Yi(u) -~(u). pants to bring peers or parents to the laboratory are not Here, t refers to the treatment condition, c refers to the always fulfilled. Basic and applied social psychological comparison condition (often a control group), Y is research conducted in field settings may further mag- the observed response (dependent measure), and u is nify these potential issues. Consequently, researchers the unit (in psychology, typically a specific participant) must increase their focus on articulating those threats on which we observe the effects of the two treatments. to internal validity, which Campbell (1957) termed Rubin's definition leads to a clear theoretical state- "plausible rival hypotheses," that may be problematic ment of what a causal effect is, but it also impli~s a in their specific research context. A central theme of fundamental problem. "It is impossible to observe the this chapter is the identification of major classes of value of Yi(u) and ~(u) on the same unit and, there- plausible rival hypotheses associated with each of the fore, it is impossible to observethe effect of t on u" design types and the use of specific design and analy- (Holland, 1986, p. 947, italics in original). We cannot sis strategies that can help rule them out. And, as we expose a preteenage boy to the 10-week smoking pre- will indicate, many of these strategies also turn out to vention program, measure his attitudes toward smok- have potential implications for traditional laboratory ing, then return the child to the beginning of the school experiments as well. year, expose him to the comparison (control) program and remeasure his attitudes toward smoking. Because of this fundamental problem of causal inference, we A FRAMEWORK FOR CAUSAL INFERENCE IN will not be able to observe causality directly. However, EXPERIMENTS AND QUASI-EXPERIMENTS by making specific assumptions, we can develop re- Over the past two decades, Donald Rubin and his search designs that permit us to infer causality. The colleagues in (Angrist, Imbens, & Rubin, certainty of the causal inference will depend strongly 1996; Holland, 1986, 1988; Imbens & Rubin, 1997; on the viability of the assumptions that we make. Three Rosenbaum & Rubin, 1983, 1984; Rubin, 1974, 1978, design approaches that address the fundamental prob- 1986) have developed a very useful framework for lem of causal inference are available, which may be understanding the causal effects of treatments. This employed alone or in combination. After briefly pre- framework has come to be known as the Rubin Causal senting the first two design approaches, we focus on Model (RCM). The use of the RCMis particularly help- the third approach, , because it is most ful in identifying strengths and limitations of many of often used in basic and applied social psychology. the experimental and quasi-experimental designs in which the independent variable is manipulated and Design Approaches to the Fundamental posttest measures are collected at only one point in Problem of Causal Inference time following the treatment. Consider the simple case of two treatments whose WITHIN-SUBJECTDESIGNS. In within-subject de- effects the researcher wishes to compare. For exam- signs, participants are exposed to treatment (i) and ple, the researcher may wish to compare a severe frus- their responses are measured, following which they tration (treatment condition) and a mild frustration are exposed to treatment (ii) and their responses are (comparison condition) in the level of aggressive re- measured. Within-subject designs make two strong as- sponses they produce (see Berkowitz, 1993). Or, the sumptions. First is temporal stability, which that researcher may wish to compare a 10-week smoking the response does not depend on the time the treat- prevention program (treatment condition) and a no ment is delivered. Many factors such as normal hwnan program group (comparison condition) on attitudes to- development, historical events, fatigue, or even daily or ward smoking among teenagers (see Flay, 1986). monthly cycles (see Cook & Campbell, 1979, chapter Rubin begins with a consideration of the hypothet- 2) can affect participants' responses. Second is causal ical ideal conditions under which a causal effect could transience,which means that the effects of each treat- be observed. He defines the causal effect as the differ- ment and each measurement of the response do not ence between what would have happened to the par- persist over time. For example, in a study of the effects ticipant under the treatment condition and what would of a high-stress and a low-stress film clip on increases have happened to the same participant under the con- in blood pressure, the researcher would need to make trol condition under identical circumstances. That is, the strong assumptions that exposure to the initial film 'it"it 42 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS clip did not affect the perception of the second clip, chanceof being assignedto the treatment and control the participant had fully returned to her baseline level conditions.2Common randomization methods include for all bodily systems (i.e., not only blood pressure, flipping a coin (e.g., heads = treatment; tails = but also physiologicaL motivationaL and cognitive sys- control), drawing a number out of a hat (e.g., 1 = tems), and that the initial blood pressure measure- treatment; 2 = control), or using a computerto gen- ment did not affect subsequent blood pressure mea- erate random numbers identifying treatment and surements (e.g., changes in blood pressure readings control group assignments.Although randomizationis from the participant's anticipation of the unpleasant normally straightforward in the laboratory,some field constriction from the inflation of the blood pressure researchsettings present very difficult randomization cuff). For studies of physical devices, such as the perfor- problems. For example, in health care settingspar- mance of engines under varying treatment conditions, ticipants often arrive when they becomeill; facilities the assumptions of temporal stability and causal tran- to deliver the experimental treatment, the control sience will often be reasonable. For studies of social psy- treatment, or both mayor may not be availableat that chological phenomena, these assumptions will often time. A wide variety of different methods reviewed be more problematic. Greenwald (1976), Erlebacher by Boruch (1997) and Shadishet al. (in press)have (1977), Rosenthal and Rubin (1980), Judd and Kenny been developed to address complex randomization (1981), and Keren (1993) presented fuller discussions problems. of issues in within-subject designs.l Following random assignmentto treatment con- ditions, each participant then receivesthe treatment UNIT HOMOGENEITY.If experimental units can be condition (e.g.,experimental treatment vs. comparison assumed to be identical in all respects, then it makes [control] treatment) to which he or she was assigned. no difference which unit receives the treatment. That Then, the responsesof each participant are measured is, ¥i(Ul) = ¥i(U2) and ¥i(Ul) = ~(U2)' where Ul and afterthe receiptof the treatment condition. Thesepro- U2 represent two different units (participants). In this ceduresare characteristicof the basicrandomized ex- case, ¥i(Ul) -~(Uv [or ¥i(uv -~(Ul)] will pro- periment used in both laboratory and field settingsin vide an accurate estimate of the causal effect. Again, socialpsychology. this assumption of unit homogeneity is oft~n made in Randomassignment means that the variable repre- the physical sciences and engineering, but it will al- sentingthe treatmentcondition (treatmentvs. control) most never be reasonable in social psychology. Even can be expectedon averageto be independentof any monozygotic (identical) twins who are known to share measuredor unmeasuredvariable prior to treatment. identical genes can easily differ in knowledge, moti- This outcomeis reflectedin two closelyrelated ways. vation, mood, or other participant-related factors that may affect their responses to treatment. 1. At pretest,prior to any treatment delivery,the levels in the treatment and control groups will, on RANDOMIZATION. Randomization is used to equate average,be equal for any measuredor unmeasured approximately the treatment and control groups at variable.More formally, the expectedvalues for the pretest, prior to any treatment delivery. In randomiza- two groups (Group 1 and Group 2) will be equal, tion, participants are assigned to treatment conditions E(Y1) = E(Yv, for any variable Y. At pretest,the using a method that gives every participant an equal demographiccharacteristics, the attitudes,the mo- tivations,the personality traits, the abilities,as well as any other participant attributes can, on average, A hybrid approach that combines elements of the within- subject design and randomization approaches has been tra- ditionally used in some areas of . 2 In fact, the of being assigned to the treatment In the simplest version of such a design, half of the partic- and control conditions may differ. For example, the proba- ipants are randomly assigned to receive treatment (i) and bility of assignment to the treatment group might be .25 and then treatment (ii), whereas the other half of the participants the probability of assignment to the control group might be are randomly assigned to receive treatment (ii) and then .75 for each participant. Unequal allocation of participants treatment (i). Such designsmake the typically untestable as- to treatment and control groups is normally used when the sumptionthat there are no interactions between treatment cost or difficulty in implementing one of the treatment con- condition and order of presentation or treatment condition ditions is substantially greater than for the other. We will and trial. This assumption will be met when there is no causal assume equal allocation of participants to treatment and con- transience. Statistical presentations of 1he assumptions and trol groups here. Given usual statistical assumptions under- analysis of such counterbalanced and designs lying hypothesis testing, this equal allocation strategy maxi- can be found in Winer (1971) and Kirk (1995). mizes the power of the test of the null hypothesis. -CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETnNGS 43

be expected to be equal in the treatment and control groups. TABLE 3.1. lliustration: Estimating the Causal 2. The treatment variable (hereafter referred to as t for Effect -Ideal Case treatment and c for comparison) will, on average, be unrelated to any measured or unmeasured par- Participant al Q2 Q, Yc Yi ticipant variable prior to treatment. One implication 0 1 0 1 1.5 of this is that the correlation of the treatment vari- 2 0 0 2 2.5 able and any participant variable measured prior to 3 0 2 2.5 treatment delivery will, on average, be O. 4 0 0 2 2.5 These two outcomes allow the RCM to provide 5 0 0 0 2 2.5 a new definition of the causal effect that is appro- 6 0 1 0 3 3.5 priate for randomized experiments. The estimate of 7 1 0 3 3.5 the causal effect is now defined as: 8 1 0 1 3 3.5 9 0 1 1 3 3.5 Yr -Yc, 10 0 1 0 3 3.5 0 1 1 3 3.5 Three observations are in order. First, the compar- 12 1 0 0 4 4.5 ison must now shift to the average response for 13 0 0 1 4 4.5 the group of participants receiving the experimen- 14 0 1 1 4 4.5 tal treatment compared with the averageresponse 15 1 1 1 4 4.5 for the group of participants receiving the control 16 0 0 1 5 5.5 treatment. Causalinferences may no longer be made 17 1 1 1 1 1.5 abou! individual participants. Second, the two out- 18 1 1 0 2 2.5 comes are expectations: They are what occurs "on 19 0 0 0 2 2.5 average." Exact equivalence of the pretest means 20 1 0 1 2 2.5 of variables and exact independence of the treat- 21 1 1 0 2 2.5 ment and pretest measures of participant variables 22 0 1 0 3 3.5 does not occur routinely in practice. With simple 23 1 1 3 3.5 randomization, exact pretest equivalence and ex- 24 1 1 1 3 3.5 act independence of the treatment and background 25 0 0 1 3 3.5 variables occur only if randomization is applied to 26 0 0 0 3 3.5 a very large (n -+ 00) population or the results 27 1 1 1 3 3.5 are averaged across a very large number (all pos- 28 1 0 0 4 4.5 sible) of different randomizations of the same sam- 29 1 0 0 4 4.5 ple. In any given real , there is no g~arantee 30 0 0 1 4 4.5 31 that pretest means will not differ on important vari- 0 0 1 4 4.5 32 0 0 0 5 5.5 ables,in which case the estimate of the causal effect, Y t -Y c, will be too low or too high. Randomization replaces definitive statements about causal effects Note: The column labeled Yc contains the true response with probabilistic statements based on the- of each participant in the control condition. The column ory. Third, for causal inference we need to assume labeled Yt contains the true response of each partici- pant in the treatment condition. aI, a2, and a3 repre- that neither the randomization process itself nor par- sent three different random assignments of 16 partici- ticipants' potential awareness of other participants' pants to the control group and 16 participants to the treatment conditions influence the participants' re- treatment group. In each , 0 means sponses.This assumption, known as the stable-unit- the participant was assigned to the control group and treatment-value assumption is considered in more 1 means the participant was assigned to the treatment detail following the presentation of an illustrative group. The mean of all 32 participants under the con- trol condition is 3.0, the mean of all 32 participants UD example to help make some of these ideas concrete. der the treatment condition is 3.5, and the standard" viation of each condition is 1.0. The distribution' Illustrative Example: Randomization and each group is approximately normal. The causq' Rubin's Causal Model each participant is 0.5, corresponding to a mr size. In Table3.1 we presentan exampledata setthat we will use throughout this sectionof the chapter.There 44 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PInS are 32 participants. The example is constructed based on Rubin's ideal case: The response of each participant TABLE 3.2. illustration: Estimating the Causal is observed under both the control (Yr column) and Effect - (al) treatment (Yi column) conditions. The causal effect, Participant Q} ~ y; Yi -Yr, has a constant size of 0.5 for each participant. . The mean in the control group is 3.0, the mean in the 1 0 1... treatment group is 3.5, and the for 2 1 2.5 the example is 1.0. The distributions within each group 3 1 2.5 are roughly normal. Note that in this example the stan- 4 1 2.5.. dardized measure of IS d = ~ = 0.50. 5 0 2 " 6 0 3 This standardized effect size is described by Cohen .. 7 3.5 (1988) as moderate, and he believes it represents a typ- 8 1 3.5 ical effect size found in the behavioral sciences. By way ... 9 0 3 of comparison, some recent meta-analyses have found 10 0 3 mean effect sizes of d = 0.54 for the effect of alcohol on 11 0 3 aggression (Ito, Miller, & Pollock, 1996), d = 0.34 for . 12 4.5.. the effect of prevention programs on improved mental 13 0 4 health outcomes for children (Durlak & Wells, 1997), 14 0 4. and d = .59 for the relati.on between confidence and 15 1 4.5. accuracy in studies of eyewitness identification (Sporer, 16 0 5.. 17 Penrod, Read, & Cutler, 1995). 1.5 18 2.5 Also presented in Table 3.1 are three columns la- . 19 0 2.. beled ai, a2, and a3. These represent three different 20 1 2.5 actual random assignments of the 32 participants. In 21 1 2:5 each case, 16 participants are assigned to the control . 22 0 3 condition, and 16 participants are assigned to the treat- .. 23 1 3.5 ment condition. In generaL there are (2:) different pos- 24 3.5.. sible random assignments, where n is the number of 25 0 3 participants in each treatment group (Cochran & Cox, 26 0 3... 1957). In the present example, there are (:~)= "i6¥i'6i= 27 1 3.5

32x3Ix30x...x2xl b . t . 28 1 4.5 (16xI5xI4x...x2xl)(16xI5xI4x...x2xl) com ma Ions or over 29 1 4.5.. 600 million different possible random assignments. . 30 0 4 The result of random assignment is that we will be 31 0 4 able to observe only one of the two responses for each 32 0 5 participant. Table 3.2 illustrates this result for random- ization al. We see that for Participant 1, the response Note. The column labeled ~ contains the true response is observed under the control condition, but not the of each participant in the control condition. The column treatment condition; for Participant 2, the response is labeled Yt contains the true response of each participant observed under the treatment, but not the control con- in the treatment condition. al represents the random as- dition; and so on. The black squares represent the un- signment of 16 participants to the control group and 16 observed response for each participant. participants to the treatment group. .means response was not observed. In Table 3.3, we show the calculations of the causal effect for each of the three randomizations, ai, a2, and a3. The example is constructed so that the true standard deviation ((1 = 1) is known for both the treatment and construct 95% confidence intervals (CI) around these control conditions for the full set of 32 participants in estimates by applying the formula: this ideal case, so we can use the z-test, rather than the more typical i-test, for two independent groups. The CI = (Yt -Yc):f: (1.96)(SE), true causal effect of treatment in the population is 0.50. However, note that the estimates of the causal effect in where SE is the , which is equal to Iq2( 1. + 1..). For randomization 1, CI = 0.0:f: (1.96) the three samples are 0.0, -0.125, and 0.875. We can yv \n;-'-n;J

46 STEPHENG._WEST, JEREMY C. BIESANZ, AND STEVENC. ~~

.qo

0

M '" / \ I '\ ) \ N \ / \ 0 / \ 0-4 / 0 "..".

-1 -0.5 0 0.5 1 1.5 2

Valuesof Y, -Y. Figure 3.1. Distribution of Yt -Y c, rejected as being unlikely (typically p < .05), then we Note. The mean of the sampling distribution is0.5. The standard can conclude that an effect of the treatment exists. deviation of the is 0.354. Yt -Y c must Y t -Y c provides an unbiased estimate of the causal exceed 1.96 * 0.354 = 0.69 to reject the null hypothesis of no effect of the treatment. Once again, this estimate is ac- difference between ILt -IL, in the population. The experimenter curate on average; however, there is no guarantee that will correctly reject the null hypothesis about 30% of the time. the estimate will be accurate in any specific experiment -the value may be too high or too low. These consid- erations highlight the value of replications and meta- had a negative effect on the response.3Only about 30% analyses (see e.g., Hedges & OIkin, 1985; Hunter & of the estimates (i.e., those> 0.69) would lead to correct Schmidt, 1990; Johnson & Eagly, this volume, Ch. 19) rejection of the null hypothesis of no causal effect. In- in establishing more accurate and more certain esti- deed, had we been in the more usual situation and used mates of causal effects. a t-test rather than z-test 'becausewe did not know the population standard deviation, this value would have FIELD EXPERIMENTS been slightly lower, about 28%. This value is known as the statisticalpower of the test, the probability of reject- Both basic and applied social psychological experi- ing the null hypothesis when it is false. We consider ments may be carried out in field settings. The defining the issue of statistical power in more detail later in the characteristics of the closely follow chapter. those of the laboratory experiment in social psychol- According to the traditional null hypothesis sig- ogy (Aronson, WIlson, & Brewer, 1998; Smith, this nificance testing view,4 if the null hypothesis can be volume Ch. 12) -random assignment, manipulation of the treatment conditions, and measurement of the 'i 3 Researcherstypically only seethe causal effect estimate from dependent variable. However, additional issues arise 1" J their own single experiment. They have worked hard design- because of differences between the field and laboratory ing the experiment, recruiting and running the participants, and analyzing the and consequently have great confi- the Task Force on Statistical Inference, 1999). Our belief is dence in the results of their singleexperiment. When another that traditional null hypothesis testing procedures will likely researcherfails to find the same result, it is easyto attribute be revised or replaced by an alternative procedure during his lack of findings to methodological problems (the "crap re- the next few years. One major alternative is that researchers search" hypothesis; Hunter, 1996). However, Hunter argued will be asked to report confidence intervals for all effect es- that meta-analyses of several research areashave suggested timates. Confidence intervals that do not include 0 lead to that methodological quality accounts for relatively little of the same conclusions as null hypothesis significance tests the variability in the causal effects. Rather, simple sampling with respect to ruling out chance as a plausible interpre- variability of the type illustrated here appearsto be the main tation of the results. But, confidence intervals also provide source of the variability in estimates of causal effects (Hunter, about the size of the treatment. Standardized 1996). and unstandardized measures of the size of treatment effects 4 At the time of this writing, intense reconsideration of hy- are likely to become increasingly important in many areas of pothesis testing procedures is taking place among methodol- psychology in the future (Cohen, Cohen, Aiken, & West, in ogists (seeHarlow, Mulaik, & Steiger,1997; Wilkinson, L. and press).

I'"~ CAUSAL INFERENCE AND GENERALIZAllON IN FIELD SETI1NGS 47 contexts. Below we first briefly presenttwo simplified have shown generally positive results in preventing or examplesof field experiments,then identify some of delaying the onset of smoking in preteenagers (Flay, the new issuesthat arise as well as possiblesolutions 1986). for those issues. Problems of Randomized Experiments and illustrations of Field Experiments Their Remedies

Some field experiments test hypotheses derived THE STABLE-UNIT-TREATMENT-V ALUE- ASSUMP- from basic social psychological theory. For example, nON (SUTVA). Randomization yields unbiased esti- Cialdini, Reno, and Kallgren (1990) conducted a series mates of the causal effect given that the SUTVA is met. of field experiments to examine the impact of norms SUTVA has two parts. First, the assi~ent mechanism, on littering. In one study on the impact of descriptive here randomization, should not affect the participants' norms (what most others do), solitary dormitory res- response to the experimental or control treatment. Al- idents retrieving their mail (which contained a flier) though we normally consider randomization to be an were confronted with an environment containing ei- inert process with respect to the participant's response, ther no litter, a single conspicuous piece of litter, or consider the following thought experiment. A student many pieces of litter. More residents in the heavily lit- volunteers for an exciting new graduate training pro- tered environment condition discarded the flier than gram in which she is handsomely funded each sum- in the other two conditions. However, fewer residents mer of her graduate career to work with a different in the single piece of litter condition littered than in the professor of her choice in her speciality area at any clean environment condition. Cialdini et al. argued that university in the United States. She is informed that the single piece of litter invoked the descriptive norm she has been chosen for this program on the basis against littering. of (i) her extraordinary merit, or (ii) random assign- Other field experiments test hypotheses from ap- ment. Would the participant's response (e.g., measured plied social psychology. For example, Evans et al. achievement in research) to the same program be iden- (1981) evaluated the effectiveness of a program to tical in each case? At least some social psychological deter smoking among junior high school students in theorizing (e.g., Weiner, 1974) would predict that a Houston. The program built on social psychological student who believed good luck was responsible for theory and research (e.g., Bandura, 1977; McGuire, her program participation might show lower achieve- 1964). The program included three major components: ment than the same student who believed her own (a) social pressures created by peers, family, and the high levels of ability and effort were responsible for media to smoke; (b) the immediate negative physio- her program participation. logical effects of smoking; and (c) methods of resisting Second, the response of the participant should not pressures to smoke. The program was intended to be be affected by the treatments other units receive. of 3 years duration. Entire schools were assigned to Knowledge of other treatments may change the par- program (treatment) and no treatment (control) con- ticipants' level of motivation, particularly for those in ditions. Although the great majority of students par- the control group, a problem Higginbotham, West, and ticipated, informed consent requirements resulted in a Forsyth (1988) termed "atypical reactions of control sample of self-selected volunteers. Complex method- participants." For example, participants assigned to the ological issues arose in this experiment: How to as- control group in a randomized experiment of a promis- sign the small number of units (schools) to treatment ing new treatment for a serious disease may give up conditions, how to analyze data from students in the their normal health-maintaining activities, leading to treatment condition who received from 1 to 3 years of poorer outcomes in the control group than normal (re- the program because of school transfers, and how to sentful demoralization). Alternatively, participants as- address the loss of participants at the posttest mea- signed to a well-liked standard (comparison) program surement. At the end of the 3-year experiment, stu- may try harder if the existence of the standard pro- dents in the smoking prevention program reported gram may be in jeopardy as a result of a positive out- lower intentions to start smoking and lower rates of come for a new, experimental program (compensatory cigarette smoking than did control students. A num- rivalry). Such atypical responses of control participants ber of experiments evaluating other smoking preven- can lead to overestimates or underestimates of causal tion programs derived from Evans et al.'s original work effects, respectively. The participant's response can also 48 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS --- be affected in other ways when the response involves prisoners unaware that they have received finan- a comparative . For example, the causal ef- cial assistance would eliminate any positive effects fect of a job training program relative to a no training this treatment could be expected to have. Further, control group may be overestimated in a small com- treatment masking is not always successful: Partic- munity if most of the liInited pool of available jobs are ipants sometimes see through even the best mask- then taken by trainees, so that there are few, if any, ing procedures on the basis of outcomes they ex- of the usual number of possible positions available for perience early in the experiment (Meier, 1991). control group members. Even in the most carefully conducted drug trials, Researchers have developed a variety of strategies participants sometimes can identify the medication to increase the likelihood that SUTVAis met, of which they have been assigned based on factors such as the following three are the most commonly used. its taste, early positive effects, and early side effects. Participants who have strong positive (or negative) 1. If participants are not aware of the random assign- expectations about the effects of that drug may then ment and are only aware of the nature of the experi- show increased positive (or negative) responses to mental condition in which they participa~e, the like- the drug. lihood that SUTVA will be met is greatly increased. 3. The spedfic treatments do not represent outcomes Such conditions can typically be achieved in labo- or opportuIiities that are important to participants. ratory settings5 and some field settings. In particu- Participants assigned to a 6-person simulated jury in lar, geographical or temporal isolation of the partic- a laboratory experiment are unlikely to change their ipants in the treatment and control conditions can responses on the basis of their knowledge that other often help minimize these problems in field settings. participants have been assignedto a 12-person simu- However, a caution is in order here: Informed con- latedjury. In contrast, released prisoners in the con- sent procedures sometimes mandate informing par- trol condition may well be angered or demoralized ticipants that they will be randomly assigned to one if they learn that other former prisoners in the treat- of several treatment conditions whose nature is out;. ment condition receive 6 months of transitional aid lined in the consent form. Such procedures may po- to cover basic living expenses. An alternative strat- tentially lead to violations of SUTVA. egy is to offer control (comparison) participants an 2. Successful masking (blinding) procedures in which equally attractive program that is not expected to the participants (and ideally the experimenter as affect the response. Bryan, Aiken and West (1996) well) are kept unaware of their treatment condi- presented an example of this design, comparing the tion also increase the likelihood that SUTVA will effectiveness of an STD-(sexually transmitted dis- be satisfied. Such procedures are commonly used ease) prevention program with an equally attractive in drug research in which participants are not in- stress-reduction comparison program. The stress- formed whether they are receiving an active drug reduction comparison program did not include any or an inert placebo. However, masking procedures content that focused on increasing condom use and often cannot be successfully applied in real world was thus not expected to influence the response of experiments. For example, consider an experiment interest, condom use. (Mallar & Thornton, 1978) in which released prison- ers are given transitional financial aid for 6 months BREAKDOWN OF RANDOMIZATION. For the ran- (treatment) or no financial aid, and the researchers domized experiment to yield an unbiased estimate recorded whether they return to jail for property of the causal effect, Y, -Yc, random assignment of crimes during the following year. Keeping released participants to treatment and comparison conditions must, in fact, be properly carried out. Studies of large- 5 SUTVA may be violated in laboratory experiments through scale randomized field experiments, particularly those prior communication of information about the experiment conducted at multiple sites, suggest that full or par- among potential participants. The little research to date (e.g., tial breakdowns of randomization occur with some Aronson, 1966) suggeststhat this may be a minor problem, (Boruch, McSweeny, & Soderstrom, 1978; at least among unacquainted participants, given adequate Conner, 1977). Problems tend to occur more frequently debriefing following the experiment. Potentially more prob- when the individuals responsible for delivering the lematic, but little researched, is the prevalence and effects of communication through informal social networks of ac- treatment, such as school or medical personnel, are quaintances who may seek information prior to signing up allowed to carry out the assignment of participants for a particular experiment. to treatment conditions and when monitoring of the CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS 49

maintenance of the treatment assignment is poor. For to statistical and conceptual issues that need to be example, Kopans (1994) reviewed the large recent considered. Canadian experiment on the effectiveness of mam- When treatments are delivered to groups, the entire mography screening for reducing deaths from breast group is assignedto either the treatment or control con- cancer. He presented data suggesting that women in dition. Thus, randomization occurs at the level of the the mammography group had a substantially higher group and not the individual participant. The statistical cancer risk at pretestthan women in the no mammog- outcome of this procedure is that the responses of the raphy screening group. It seems likely that some of the members of each treatment group may no longer be in- physicians involved in the trial saw to it that their pa- dependent. As an illustration, consider the Evans et al. tients with family histories of breast cancer or prior (1981) smoking prevention experiment. The responses episodes of breast-related diseasewere assigned to the of 2 children randomly chosen from a single classroom screening group. would be expected to be more similar than the re- One way to address this problem is through care- sponses of 2 children randomly chosen from different ful monitoring of the randomization process as well as classrooms (Kashy & Kenny, this volume, Ch. 17). Al- careful monitoring of the treatment(s) each participant though nonindependence has no impact on the causal actually receives following randomization (Braucht & effect estimate, Y, -Y", it does lead to estimates of Reichardt, 1993). For example, students lor their par- the standard error of this effect that are typically too ents) in school-based experiments are sometimes able small.6 The magnitude of this problem increases as the to agitate successfully to change from a control to a amount of dependence (measured by the intra classcor- treatment class during the randomization process it- relation) and the size of groups to which treatment is self or during the school year following randomization. delivered increase. For example, Bardkowski (1981) Careful monitoring can help minimize this problem showed that, even with relatively low levels of depen- and can also allow formal assessment of the magni- dency in groups (intraclass correlation = .05), when tude of the problem. If there is a systematic movement groups were of a size typical of grade school classroom of children between treatment conditions (e.g., the studies (n = 25 per class),the Type 1 error rate (reject- brighter children are moved to the treatment group), ing the null hypothesis when in fact it is true) was in the estimate of the causal effect of treatment will po- fact .19 rather than the stated value of .05. tentially be biased. Attempts to correct for such bias Following Fisher (1935), researchers traditionally need to be undertaken. usolveduthis problem by aggregating their data to the An alternative strategy to minimize breakdowns of level of the group (e.g., average response of each class- randomization is to use units that are temporally or room). The unit of analysis should match the unit of as- geographically isolated in the experiment. For exam- signment -U analyze them as you've randomized them U ple, randomization breakdowns in school-based exper- (Fisher, as cited in Boruch, 1997, p. 195). However, iments, particularly those which occur postassignment, this solution is often not fully satisfactory because such are far more likely when different treatments are given analyses can be generalized only to a population of to different classrooms (low isolation of units) than groups, not to a population of individuals, which is typ- when different treatments are given to different schools ically the researcher's interest. Over the past decade, (high isolation of units). new statistical procedures termed uhierarchicallinear (random coefficient, multilevel) models" have been de- GROUP ADMINISTRATION OF TREATMENT. Often veloped. These models simultaneously provide an es- in field research interventions are offered to groups timate of the causal effect at the group level as well as of participants. For example, Evans et aI. (1981) de- individual level analyses that appropriately correct livered their smoking prevention intervention to in- standard errors for the degree of nonindependence tact school classrooms; Aiken, West, Woodward, Reno, within groups. Introductions to these models are pre- and Reynolds (1994) delivered a program encouraging sented in Bryk and Raudenbush (1992), Hedecker, compliance with American Cancer Society guidelines Gibbons, and Flay, (1994), and Kreft and DeLeeuw for regular screening mammograms to women's groups (1998). in the community; and Vinokur, Price, and Caplan 6 In randomized field experiments, participants are nearly al- (1991) recruited unemployed individuals to partici- ways more similar within than between groups. In other pate in a job-seeking skills training program that was situations involving within-subject designs or other forms of delivered in a group format. Group administration of dependency, the direction of bias may change (seeKenny & treatments, whether in the laboratory or the field, leads Judd,1986). 50 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PI'ITS

Given the proper use of hierarchical linear mod- of whether they received treatment) with the mean els to analyze the data, the random assignment of response of all participants assigned to the control groups to treatments (when participants are also ran- condition. This analysis typically yields conservative domly assigned to the individual groups) rules out tra- estimates of the causal effect and requires no as- ditional sources of , known as ecological sumptions beyond those required for the randomized bias, associated with aggregation of units (Greenland experiment. The second is to throw out all participants & Morgenstern, 1989; Robinson, 1950). However, a assigned to the treatment group who do not in fact conceptual issue remains. Rubin's causal model em- receive treatment. Such a comparison will yield a bi- phasizes that causal effects represent the comparison ased estimate of the causal effect (with the direction of of one well-articulated treatment with another well- bias being unknown) unless the stringent assumption articulated treatment. In comparisons among treat- can be made that the participants who drop out of the ments delivered to individuals in group settings, the treatment condition represent a random sample of the articulation of the treatment becomes murkier as treat- participants in that condition.7 The third, known as ment now includes the other individuals in the set- the complier average causal effect (CACE), uses ideas ting and all of the activities of members within the from and theory (little & group. Cronbach (1976) and Burstein (1980) discussed Rubin, 1987) to create an unbiased estimate of the many of the issues and opportunities associated with treatment effect for participants who actually receive multilevel designs, although their recommendations the treatment (Angrist et al., 1996; little & Yau, 1998; of analytic strategies have been superseded by the see also Bloom, 1984). Both the first and the third ap- hierarchical linear modeling approaches noted previ- proaches produce meaningful estimates of treatmenj: ously. Draper (1995), Holland (1989), Burstein (1985), effects; however they answer different questions (see and Burstein, Kim, and Delandshire (1989) considered West & Sagarin, 2000). The intention to treat analysis many of the inferential issues associated with hierar- estimates the causal effect in the entire sample, chicallinear modeling. whereas CACE estimates the causal effect only for those participants who actually receive the treatment. TREATMENTNONCOMPLIANCE. When partidpants To understand these three approaches, let us con- are randomly assigned to treatment and control condi- sider the data presented in Table 3.4. Table 3.4 uses tions in field experiments, not all partidpants may ac- the same data as Table 3.1 with some added features. tually get the treatment. A portion of the partidpants First, in assignment 4 (a4) Participants 1-16 who are as- randomly assigned to programs designed to increase signed to the control group are assumed to be perfectly exerdse, improve nutrition, or improve job-seeking matched with Participants 17-32 who are assigned to skills simply do not show up for program sessions,and the treatment group. In this assignment Participant I so do not receive the treatment. These partidpants are is identical to Participant 17, Participant 2 is identical referred to as treatment noncompliers. Practical meth- to Participant 18, and so on, so that we do not need ods exist for minimizing this problem, notably mak- to consider the effects of sampling error. Second, we ing the program attractive to partidpants, removing have indicated a systematic pattern of noncompliance. barriers to program attendance (e.g., providing trans- In column CI, participants with the lowest scores prior portation or child care), giving partidpants incentives to treatment do not comply with the treatment. These for program attendance, and only including those par- participants are termed "never takers" in Angrist et al.'s tidpants who are willing to partidpate in both the (1996) model. treatment and control programs in the randomization The use of the RCM, which emphasizes the com- (Cook & Campbell, 1979). Despite these efforts, some parison of the same unit receiving the treatment and partidpants assigned to treatment condition never re- the control conditions, highlights an easily overlooked ceive any treatment. We assume in this section that the point. Participants 1-5 in the control group are identical researcher was able to measure the dependent variable to the never takers in the treatment group (Participants on all partidpants, including those who do not receive 17-21) in Table 3.4. They are participants who would treatment. Three statistical approacheshave been taken in The approach of throwing out participants has been the stan- response to this problem. The first, known as the in- dard procedure in laboratory experiments in social psychol- tention to treat analysis (Lee, Ellenberg, Hirtz, & Nelson, ogy when technical problems arise or when panicipants are 1991) is to compare the mean response of all partid- suspicious or uncooperative. The possibility that this proce- pants assigned to the treatment condition (regardless dure introduces potential bias should always be considered. CAUSAL INFERENCE AND GENERALIZAnON IN FIELD SETTINGS 51

if they were given the opportunity. As will be illus- TABLE 3.4. illustration of Effects of Treatment trated below, the failure to consider this group of partic- Noncompliance ipants potentially yields biased estimates of treatment effects. Participant ~ Cl ~ Yi Applying these three statistical approaches to the 1 0 0 1* 1.5 present example, the intention to treat analysis com- 2 0 0 2* 2.5 pares the observed data (indicated with an asterisk in 3 0 0 2* 2.5 Table 3.4) for participants assigned to the treatment 4 0 0 2* 2.5 with the observed data for participants assigned to the 5 0 0 2* 2.5 control group. The mean for the control group is as be- 6 000 1 3* 3.5 fore, Y, = 3.0. However, in the treatment group Par- 7 1 3* 3.5 ticipants 17-21 did not comply and thus received no 8 1 3* 3.5 benefit from treatment. Consequently, the treatment 9 0 1 3* 3.5 group mean is correspondingly reduced, Yt = 3.344. 10 0 3* 3.5 11 ) 0r 1 3* 3.5 The causal effect estimate, Yt-~, is 0.344-more than 12 0 1 4* 4.5 a 30% reduction in the effect size from the true value 13 ;I'0 1 4* 4.5 for compliers of 0.50. 14 0 1 4* 4.5 Following the second approach, Participants 17-21 15 0 1 4* 4.5 are eliminated from the analysis and the mean for 16 01111111111111111 5* 5.5 the treatment group is L::~~Y;/11 = 4.045. Thus, the 0 1.5 17 1* causal effect estimate is Yt -Y, = 4.045 -3.000 = 18 0 2* 2.5 1.045, which in this case is considerably larger than 19 0 2* 2.5 20 0 2* 2.5 the true value of 0.5. Finally, using the CACE approach 21 0 2* 2.5 (little & Yau, 1998), we eliminate the never takers from 22 1 3 3.5* both the treatment and control groups. We find that 23 1 3 3.5* 3 ~i=32 ~i=16 24 1 3.5* £"'i=22 Yi L...i=6 1'; 25 1 3 3.5* Yt= 11 "'" 4.045 Y,,- 11 = 3.545, 26 1 3 3.5* 27 1 3 3.5* so that the estimate of the causal effect, Yt -Y c, is 0.5, 28 1 4 4.~* which is equal to the true effect for compliers. 29 1 4 4.5* The conceptual problem with CACE is that we can- 1 4 4.5* 30 not identify which participants in the control group 31 1 4 4.5* would comply if they were in the treatment group. 32 1 1 5 5.5* However, given a randomized experiment and SUTVA, an unbiased estimate of CACE can be calculated. In Note. The column labeled Y, contains the true response the simplest case (Bloom, 1984), in which we assume of each participant in the control condition. The column the treatment effect is constant for all participants, the labeled Yi contains the true response of each participant in the treatment condition. a4 represents the assignment of causal effect estimate, Yt -Yo, from the intention to the first 16 participants to the control group and the sec- treat analysis (.344) can be adjusted by the inverse ond 16 participants to the treatment group. '1 = 1 means of the proportion of compliers in the treatment group participant follows the treatment or control condition as (11/16)-1. In the present example, this effect can be assigned.'1 = 0 means participant is a never taker and calculated as CACE = (.344)(16/11) = 0.5. Angrist does not comply when in the treatment condition. The et al. (1996) and Little and Yau (1998) provided more starred value of Y is the value actually observed for each advanced statistical discussions of the CACE approach. participant. As before, the true causal effect,Yi- Y" is 0.5. Vinokur, Price, and Caplan (1991) provided an empir- ical illustration. The CACE approach makes several assumptions. For not have complied with the treatment if they had been field experiments in social psychology, the most im- assigned to the treatment group. All standard analyses portant of the assumptions is that the response of a that throw out noncompliers fail to take into account never taker participant who does not comply with the this group of participants who would fail to comply treatment will be identical to the response of the same

~ 52 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS

participant in the control group. This assumption im- loss. A considerable body of specialized techniques for plies that the treatment received by noncompliers in tracking, locating, and contacting participants now ex- the treatment group must be identical to the treatment ists in many research areas (Ribisl, Walton, Mowbray, received by participants in the control group. It will Luke, Davidson, & Bootsmiller, 1996). Nonetheless, only be met in designs in which the control group rep- even given the use of careful procedures, attrition still resents Mnotreatment" or in which the control group typically does occur. For example, Biglan et al. (1991), represents a base treatment (tbase,which everyone re- in their review of longitudinal studies of substance ceives) to which one or more additional components abuse prevention, reponed attrition rates ranging from are added in the treatment group (tbase+ tadditional; 5% to 66% (mean = approximately 25%). Funher- the constructive research strategy; see West & Aiken, more, dropouts typically repon greater substance use 1997). Designs in which an alternative treatment is at the initial measurement. Such findings suggest that used as the comparison group will violate this assump- estimates of treatment effects on the outcomes of in- tion. terest may be biased if attrition is not addressed. The assumption will also be violated in experiments Researchershave some ability to estimate the likely in which participants partially comply with treatment effects of attrition on their results -but only if pretest (e.g., they attend 5 of 20 required treatment sessions). measures that are expected to be related to the response Attempts to adjust intention to treat causal effect es- measures of interest have been collected. Jurs and Glass timates based on careful measures of degree of com- (1971; see also Cook & Campbell, 1979) have outlined F pliance of participants in both the treatment and con- a two-step strategy to detect possible bias introduced by P trol groups have thus far not been fully satisfactory. differential attrition between the treatment and com- 5 They require either an excellent model of the deter- parison groups. tJ minants of compliance in both the treatment and con- v trol groups or the use of a successful double-masking 1. The first step is to compare the percentage of par- 5 (blinding) procedure, in which neither the participant ticipants who drop out in the treatment and com- D nor the experimenter is aware of the participant's treat- parison groups. If these values differ, then the a ment condition. Holland (1988) presented an exten- differential attrition (participant loss) rates may be c sive discussion of this problem in terms of the RCM. interpreted as a treatment effect. However, the in- D Efron and Feldman (1991) presented one of the best terpretation of all measured response variables be- a of the current procedures for addressing partial treat- comes more problematic (see Sackett & Gent, 1979). ( ment compliance. The discussions following both the 2. The second step is to conduct a series of 2 (treat- t1 Holland and the Efron and Feldman articles clearly ar- ment group: t vs. c) x 2 (completer vs. attriter) tJ ticulate the complex issues and limitations associated ANOVAs on each of the pretest measures. The goal ti with the use of such procedures. is to identify all possible background characteristics c iJ (e.g., attitudes, behaviors, personality, demograph- PARTICIPANT LOSS AT POS1TEST MEASUREMENT. ics) on which participants may differ. A main effect The approaches for addressing treatment noncompli- for treatment group indicates a possible failure of the ance just discussed assumed that posttest measure- randomization to properly equate the groups prior ments were available for all participants regardless of to treatment. A main effect for attriter versus com- whether they complied. A second problem that oc- pleter indicates that the characteristics of the attrit- curs in many randomized field experiments is that ers (dropouts) differ from those of the completers, 1 some participants in both the treatment and compari- suggesting that the findings of the experiment can- b son groups cannot be remeasured at posttest. This prob- not be generalized to the full population of inter- tJ lem, known as participant attrition, is a major potential est (external validity threat). Of most concern, a source of bias in the estimation of causal effects. Treatment Group x Attrition Status in- f, C Attrition can often be minimized by careful atten- dicates that the characteristics of participants who tion during the planning of the experiment. Securing dropped out differed in the treatment and control D ti the addresses and telephone numbers of the partici- conditions, indicating potential bias in the estimate ti pants, their close friends or relatives, and their em- of the causal effect (internal validity threat). When ployer or school greatly aids in locating dropouts (attrit- problems of differential attrition are identified, re- fc ingparticipants). Keeping in touch with both treatment searchers should explore possible adjustments of a and control participants through periodic mailings or the treatment effect through missing data impu- iJ telephone caIls and providing incentives for contin- tation techniques (Little & Rubin, 1987; Little & d ued participation can also help minimize participant Schenker, 1995), through attempts to understand '0 CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS 53

and model the effects of attrition on the response OTHER ISSUES IN EXPERIMENTS (e.g., Arbuckle, 1996; Muthen, Kaplan, & Hollis, 1987), or through attempts to estimate reasonable Statistical Power maximum and minimum values that bracket the treatment effect (Shadish, Hu, Glaser, Knonacki, & For a randomized trial to be worth doing, it must Wong, 1998; West & Sagarin, 2000). Comparison of have adequate statistical power to detect differences alternative adjustments that make different assump- between the treatment and control conditions. Yet, ex- tions to see if they yield similar results is particularly isting reviews of the statistical power of tests in major worthwhile. psychology journals, including the Journal o/Personality and Social Psychology(e.g., Rossi, 1990), have suggested There are two major problems in the use of the Jurs that many researchers continue to conduct statistical and Glass (1971) technique just outlined. First, differ- tests with inadequate power. Following Cohen (1988), ential attrition may be related to participant character- differences between the treatment and control groups istics that were not measured at pretest. If these un- of.80, .50, and .20 standard deviation units, (d), are de- observed characteristics differ between the treatment fined as large, moderate, and small effect sizes,and .80 and control groups and are related to the responses of is defined as adequate statistical power. In a random- interest, the causal effect estimate may be misleading. ized experiment with an equal number of participants For example, it is possible that attriters from smoking in the treatment and control groups, 52 total partici- prevention programs may have peers or parents who pants (n = 26 per condition) would be needed to de- smoke and who put pressure on them to drop out of tect a large effect, 126 participants would be needed to the program. In contrast, children in the control group detect a moderate effect, and 786 participants would with parents or peers who smoke would not experience be required to detect a small effect at alpha = .05 and similar pressure and would be less likely to drop out. If power = .80. Our earlier example illustrated this is- measures of peer and parent smoking are not collected sue: With 32 total participants, we were able to detect at pretest, this potential source of differential attrition a moderate effect size (d = 0.5) only about 30% of could not be detected. Second,the Jurs and Glasstech- the time. Given an estimate of the likely effect size nique uses statistical hypothesis tests to identify vari- based either on prior research in the area (especially abIes that are related or unrelated to attrition status. estimates from meta-analyses) or on a normative basis Concluding that a variable is unrelated to attrition sta- (e.g., d = .50 following Cohen, 1988), statistical power tus requires the researcherto accept the null hypothesis can now be easily computed in advance of conduct- that there is no Treatment x Attrition Status interac- ing an experiment with user-friendly software (e.g., tion.1n fact, some of the individual pretest variables, or Borenstein, Cohen, & Rothstein, 1997). combinations of these pretest variables, may represent In the , several methods can important sources of differential attrition even though be used to increase power. These methods are exten- the statistical test fails to reach conventional levels of sively discussed in Higginbotham, West, and Forsyth significance. Some researchers (e.g., Hansen, Collins, (1988, chapter 2); Hansen and Collins (1994); Lipsey Malotte, Johnson, & Fielding, 1985) have called for (1997); and Dennis, Lennox, and Williams (1997). The using less conservative levels of significance (e.g., a = most obvious method is to increase the sample size. .25) for all tests so that the corresponding Type n error However, practical issues such as cost, available re- rates will be reduced. Others (Tebes, Snow, & Arthur, sources, or the limited availability of certain special 1992) have proposed choosing the Type I error rate (a) participant populations (e.g., bereaved children; in- based on an expected or calculated effect size so that tensive care nurses) often restrict this option, even in the ratio of Type n to Type I errors will approximate the large metropolitan areas. Other methods include using ratio of 4:1 (viz. .B = .20; a = .05) recommended by stronger treatments, maintaining the integrity of the Cohen (1988). The bottom line is that rejection of the treatment implementation, using more reliable mea- null hypothesis is used as a method of screening poten- sures, maximizing participant uniformity, minimizing tially important from unimportant sources of differen- participant attrition, and adding covariates measured tial attrition. The basic goal is to identify all potential at pretest that are related to the dependent variable. sources of differential attrition that should be retained Also of interest are a variety of rarely used techniques for further investigation. Even if a large number of tests that equate participants on one or morecovariates that are performed, they should not be corrected for alpha are known to be important predictors of the response inflation (experimentwise error rate). Such correction prior to random assignment to treatment and control defeats the basic goal of the procedure. conditions. Such procedures are particularly valuable 54 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS ps 1 - when o~y .a small number of units are available for Generalization of Causal Relationshi randoIll1zatIon. .Eq~ating participant~ at pretest on a single variable Our presentation of the RCM has focused primarily : IS straIghtforward. TypICally, participants are simply on wh~t .Cook and Campbell (1979) have termed inter- ~anked.on the ba~kground variable and then grouped nal vahdzty:So~ething about the particular treatment ~ mto paIrS (the 2 highest, the next 2 highest, and so on, cause~ the speCIficobserved response in this particular tl down to the 2 lowest scores on the pretest measure). expenment. The RCM is focused on the estimation of II Then, I participant from each pair is randomly assigned the causal effect; it is largely mute as to what the active iI to the treatment and the other to the control condition causal agent(s) of the treatment might be or how one p A matched-pairs t-test is used to compare the respons~ might go about determining them. Campbell (1986) of the treatment and control groups. If the basis for ~so emphasized the limited causal understanding pro- ma~ching participants is highly related to the response Vld~.~by internal validity, even suggesting that internal vanable, these procedures can lead to large increases validity be relabeled as "local molar (pragmatic, athe- in statistical power. For example, Student (1931, the oretical) causal validity" (p. 69). ( nom de plume of W. S. Gosset, a at the In contrast, social psychological researchers are iI Guinness brewing company) critiqued the Lanarkshire rarely interested in limiting their causal statements c: ~ e~eriment, an early experiment comparing the to a s~ecific ~reatment implementation, delivered in iI gams m height and weight of 5,000 children given a specific setting, to a specific sample of participants, v pasteu~zed milk with those of 5,000 children given ~d assessed with a specific measure. Methodolo- p raw milk. He noted that identifying 50 pairs of iden- gIStS (Campbell, 1957; Campbell & Stanley, 1966; T tical twins and then randomly assigning I twin from Cook, 1993; Cook & Campbell, 1979; Cronbach, 1982; tc each pair to the pasteurized milk condition and I twin Shadish et al., in press; see also Brewer, this volume, tJ to the raw milk condition would yield the same sta- Ch. I) have articulated principles for understanding c tistical power as the original experiment with 10,000 and g~neralizing the causal effect obtained in a single p children. expenment. Cronbach (1982) developed a model with p With several variables, these procedures become a!ocus on generalization with respect to four dimen- T more complex. To illustrate one method of sio~s: units (typically participants), treatments, obser- a on several variables, imagine that the 32 participants vatIons (measures or responses), and settings. He la- in our earlier example had been measured on several beled the specific values of these dimensions that char- important background variables at pretest. We noted acterize a specific experiment as (lower case) units, earli~r that there were over 600 million different treatments, observations, and settings (utos). He la- T possIble randomizations of the 32 participants into bel.edthe target values. of the dimensions that charac- tJ two equal groups (nl = n2 = 16). We could have tenzetheclass~stowhichtheresultsoftheexperiment p a computer program generate a large number (e..,g can be generalized as (upper case) Units , Treatments, P 1,500) of different randomizations. A multivariate Observations, and Settings (UTOS). In Cook and a measure (e.g., Hotelling's T2) of the difference between C~m~bell's (1979, chapter 2) analysis of validity, gener- ff the two groups on the pretest measures would be ~zatIo~ to Treatm~nts and Observations represent two I calculated for each of the 1,500 randomizations. We dim~ns~ons of th~ Issue of construct validity, and gen- a could .then rank order the randomizations according eraliz.atIon ~o Umts an~ Settings (and Times) represent e, to theIr successin equating the treatment and control the dimensIons of the Issue of external validity. Cron- c groups on the pretest measures on the basis of the bach (1982) also described another type of general- multivariate measure. We would then randomly select ization, ut?S to U*T*O*S*. Here, the researcher desires t: one of the best randomizations (e.g., from the 50 to generalize to new populations of Units, Treatments, with the lowest value of the Hotelling's T2) to use Observations, and Settings (U*T*O*S*) that were not in our experiment. Such procedures assure that our studied in the original experiment. two groups are well-equated at pretest, minimizing sampling error to the extent the pretest measures are aremore sensitive to violationsof assumptions(e.g., curvilin- related to the response.8 earrel~tionship between pretest measure and response)than matchingand generallyperform less well in smallsamples. 8 Anal .f .Maxwell and Delaney(199O, chapter 9) presenteda thor- YSIS0 cov~ance or bl~cking on the pretestscore can ough discussionand comparisonof matchingand blockin alsobe usedto mcreasestatIstical power. These techniques techniquesin experiments. g

~ CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS 55

A. Random B. Randomization Figure 3.2. The formal for Sampling generalization. Note. The purpose of step B is to provide ~ITreatment Croup I unbiased estimates of the causal effect of the treatment. The purpose of step A is to per- mit generalization of the results obtained in the sample to a defined population of participants. C:::-)--~ I '6..' Control Group I

I Let us apply the Cronbach model to the Evans et al. represents random sampling from a defined popula- (1981) experiment described earlier. In this exper- tion; the purpose of this stageis to assure generalization iment, the units were children in specific school to a defined population of Units (participants) as is dis- classrooms in Houston, the treatment was this specific cussedbelow. Stage B represents random assignment of implementation of the social influences smoking pre- the units in the sample to treatment and control condi- vention program, the observations were children's re- tions; as discussedpreviously, the purpose is to achieve ports of smoking, and the setting was the classroom. unbiased estimates of the causal effect in the sample. The UTOS to which Evans et al. presumably wished The combination of Stages A and B formally permits to generalize were all school children of specific ages, generalization of an unbiased causal effect of the spe- the social influences smoking prevention program, cific treatment conditions, studied in the context of a cigarette smoking, and school classrooms. As an ap- specific experimental setting, using the specific depen- plied experiment, the goal is to generalize to specific dent measures to the full population of Units (partici- participant populations and to a specific school setting. pants). Note that generalization to Treatments, Obser- The treatment is also conceptually quite circumscribed vations, and Settings is not formally addressed by this as is the observation of interest, smoking. model. This formal sampling model is routinely recom- Strategies for Generalization mended by and epidemiologists (e.g., Draper, 1995; Kish, 1987) as the ideal model for ex- STATISTICAL STRATEGY: RANDOM SAMPUNG. perimental research. Unfortunately, it is extraordinar- The only formal statistical basis for generalization is ily difficult to implement in practice. With respect to through the use of random sampling from well-defined Units (participant populations), many cannot be pre- populations. Surveys using national probability sam- dsely enumerated (e.g., children whose parents are ples assessthe attitudes of representative samples of alcoholics; Chassin, Barrera, Bech, & Kossak-Fuller, adults; other surveys may assessthe attitudes of a more 1992). Even when the researcher can predsely enu- focused sample, such as hospital nurses in a section of merate the partidpant population (e.g., using offidal Denver. In each case,a defined population is enumer- court records to enumerate recently divorced individ- ated from which a random sample is collected, yielding uals), there is no guarantee that such partidpants can estimates of the attitudes of the population that are ac- actually be located. Further, even if they are located, curate to within a specified level of sampling error.9 partidpants may refuse to be randomized or refuse to Figure 3.2 presents the randomized experiment in partidpate in a particular experiment, despite being of- the context of this formal sampling model. Stage A fered large incentives for their participation. 9 Social psychologists interested in basic research have tradi- A few recent experiments have approximated the tionally focused primarily on whether theoretically predicted ideal model of statistidans despite the difficulties. effects exist. However, recent criticisms of traditional null Randomized experiments have been conducted within hypothesis significance testing (e.g., Cohen, 1994; Harlow, national or local probability surveys (e.g., investi- Mulaik, & Steiger, 1997) and the increased prominence of gating question context; Schwarz & Hippler, 1995). meta-analysis have greatly increased the focus of both ba- Randomized experiments have compared treatment sic and applied psychologists on the size of treatment effects. This shift in focus is likely to lead to a stronger focus on issues versus control programs using random samples of spe- of generalization in basic as well as in applied researchin the dfic populations of individuals in the community (e.g., future. job seekers selected from state unemployment lines;

~ 56 STEPHEN G. WEST, JEREMY C. BmSANZ, AND STEVEN C. PITI'S

Vmokur et al., 1991; Wolchik et al., In press). Such in our research that are heterogenous with respect to experiments routinely include heroic efforts to study aspects of units, treatments, observations, and settings nonparticipants in the experiment in order to under- that are theoretically expected to be irrelevant to the stand the probable limits, if any, on the generalization treatment-outcome relationship. To the degree that the of their findings to the full participant population of results of the experiment are consistent across different interest. Nonetheless, even such extraordinary efforts types of Units, different manipulations of the indepen- only address the generalization of units; they only dent variable (Treatments), different measures of the take us from u to U. When our interest turns to the dependent variable (Observations), and different types generalization of findings to a class (or population) of Settings, the researcher can conclude that general- of Treatments, Observations, and Settings, we almost ization of the findings is not limited. never have any strong basis for defining a population of interest. In short, in nearly all basic and applied DISCRIMINANTVALIDITY. Basic social psychological social psychological research we have to turn to extra- theory and theories of programs (Lipsey, 1993; West statistical methods of enhancing causal generalization. & Aiken, 1997) have specified the processes through which a treatment is expected to have an effect on EXTRA-STATlSnCAL APPROACHES: COOK'S FIVE the outcome. These theories identify specific Treat- PRINCIPLES. Cook (1993) synthesized earlier ideas ment constructs (causal agents) that are supposed to about causal generalization and articulated five gen- affect specific Observation constructs (dependent vari- eral principles. These principles may be applied to ables). For example, the specific Treatment of frus- strengthen causal generalization with respect to Units, tration, defined as the blockage of an ongoing goal- Treatments, Observations, and Settings. directed behavior, is hypothesized to lead to increases in aggression. Other similar treatments that do not in- PROXIMALSIMlLARITY. The spedfic units, treatments, volve goal blockage, such as completing an exciting observations, and settings should include most of the task, should not produce aggression. Similarly, given components of the construct or population, particularly the focus of the hypothesis on the construct of aggres- those that are judged to be prototypical. A researcher sion, the researcher should be able to show that frustra- wishing to generalize to a population of nurses (e.g., tion does not lead to other emotion-related responses, in metropolitan Denver) should choose nurses from such as depression or euphoria. To the extent that this area in his sample. The sample should include the the causal agent of the Treatment (here, frustration) various modal types of nurses (e.g., LPN, RN). To gen- is shown to match the hypothesized construct and eralize to settings, the modal settings in which nurses the class of Observations (here, aggression) affected by work (e.g., hospital, home-care) should be identified, the Treatment matches those specified by the theory, and nurses should be sampled from each. With respect claims for understanding the causal relationship are to constructs, the researcher should design a treatment strengthened. This same approach can be taken to Units and either design or select a measurement instrument and Settings when hypotheses identify specific classes that includes most of the important components spec- of units or settings over which the causal effect will ified by the theory. generalize.

HETEROGENEOUSIRRELEVANCIES. The units, treat- CAUSALEXPLANAnON. To the extent that we can ments, observations, and settings we use in our ex- support a causal explanation of our findings and rule periments are specific instances chosen to represent out competing explanations, the likelihood of gener- the population or the construct. They will typically un- alization is increased. The causal explanation distin- derrepresent certain features of the target Units, Treat- guishes the active from the inert components of our ments, Observations, and Settings. They will typically treatment package and provides an understanding of also include other extraneous features that are not part the processes underlying our phenomenon of interest. of the target of generalization. For example, nearly all These features permit us to specify which components research on attitudes uses paper-and-pencil measure- need to be included in any new experimental context. ment techniques, yet paper-and-pencil measurement is This prindple has long been at the heart of basic exper- not part of the definition of attitudes (see Sears, 1986, imental work in sodal psychology with its focus on the and Houts, Cook, & Shadish, 1986, for other examples articulation and ruling out of competing theoretical ex- of these issues). Following the principle of heteroge- planations. More recently, both basic and applied sodal nous irrelevancies calls for the use of multiple instances psychologists have used mediational analysis (Baron &

~ CAUSAL INFERENCE AND GENERALIZATION IN FffiLD SETTINGS 57

Kenny, 1986; MacKinnon, 1994; West & Aiken, 1997) studied and those to which the researcher wishes to as a means of probing whether their favored theo- generalize increases. To the extent we are extrapolating retical explanation is consistent with the data. To the beyond the of units, treatments, observations, or extent that the data support the favored theoretical settings used in previous research, our generalization explanation, it can provide strong guidance for the de- of estimates of treatment effects become increasingly sign, implementation, and evaluation of future pro- untrustworthy. grams. However, mediational analysis does not auto- ; matically rule out other competing explanations for the SUMMARY.Traditional social psychological perspec- observed effects. To the extent that these alternative tives on generalization (e.g., Aronson et al., 1998; ;; 1; causal explanations (a) are plausible and (b) make dif- Berkowitz & Donnerstein, 1982) have relied nearly ex- ferent when new units (participants), new clusively on the single principle of causal explanation, treatments, new observations, or new settings are stud- making interpolation and extrapolation of causal ef- ied, generalization of the findings of an experiment will fects difficult in the absence of a formal specification of potentially be limited. the Units, Treatments, Observations, and Settings ad- dressed by the theory. Cook's five principles add other EMPIRICAL INTERPOLAnON AND EXTRAPOLAnON. For criteria beyond causal explanation that help identify ease of presentation, this chapter has focused on when generalization of the causal effects to the UTOS the comparison of two treatment conditions, typi- of interest is possible. 10Cook (1993) and Shadish et al. cally treatment versus control (or comparison) groups. (in press) also noted that these same five principles can Although experiments are conducted in social psychol- also be applied to meta-analyses of entire research lit- ogy with more than two levels of each separate treat- eratures. ment variable, such experiments are not common, par- ticularly in applied . In general, a high QUASI-EXPERIMENTAL DESIGNS dose of the treatment (e.g., a smoking prevention pro- gram carried out over .3 years; Evans et al., 1981) is The randomized experiment is nearly always the de- compared with a no treatment (or minimal treatment) sign of choice for testing causal hypotheses about the control group. This practice yields an estimate of the effects of a treatment. In many real-world contexts, causal effect that is limited to the specific implemen- however, randomization may be precluded by ethicaL tations of treatment and control groups used in the legaL practical (logistic), or policy concerns. In such experiment. cases, it is frequently possible to develop alternative Occasionally, parametric experiments or dose- quasi -experimental designs that share many of the response (response surface) experiments involving one strengths of the randomized experiment in reaching or more dimensions (Box & Draper, 1987; Smith, this 10 Some sodal psychologistsconcerned primarily with basic re- volume, Ch. 2; West, Aiken, & Todd, 1993) can be con- search argue that constancy of the direction of the causal ef- ducted. Such experiments help establish the functional fect is all that is needed for successfulgeneralization. Taking form of the relationship between the strength of each a position based on traditional null hypothesis significance treatment variable and the outcome of interest. Once testing, they argue that whether the effect of a treatment is the functional form of the dose-response curve (or re- large, moderate, small, or even tiny in magnitude makes lit- sponse surface) is known, causal effects for the com- tle difference so long as it is in the predicted direction (but parison of any pair of treatment conditions can be es- see Greenwald, Pratkanis, Leippe, & Baumgardner, 1986; Reichardt & Gollob, 1997). On the other hand, some influen- timated through interpolation. tial sodal psychological theorists (e.g., McGuire, 1983) have In the absence of such dose-response curves, cau- emphasized the contextual nature of sodal phenomena and tion must be exercised in generalizing the magni- much of the basic research conducted in sodal psychology tude of causal effects very far beyond the specific lev- during the past four decadeshas emphasized the finding of els of the treatment and control groups implemented predicted interaction effects, ideally of a disordinal or cross- in a particular experiment. The functional form of over form. In contrast, in many areas of applied research, dose-response relationships may be nonlinear. Com- the magnitude of effects is currently viewed as more impor- tant. For example, an expensive job training program that plications like threshold effects, the creation of new led to a consistent mean increase of $100 per year in the processes (e.g., psychological reactance if an influence partidpants' annual salary would be deemed to be ineffec- attempt becomes too strong), and the influence of in- tive, no matter how consistently the effect was obtained or teractions with other background variables become in- how tiny the assodated p-value of the statistical test (e.g., creasingly likely as the gap between the treatments p < .00001). 58 STEPHEN G. WEST, JEREMY C. BIESANZ, AND-- STEVEN C. PI'I"TS = ,.: on -< ..; . ~ C; , , "..' . rs .. 0 .., " """ R ...... "! , , ...'"" ...~ UN ",""'.'" ..' .. ~ .,' ..'" ,. . = ..I ._.~. ~, . ~ N + + ~ N., ' .~ ,y ., 0- + ,-v .~ ..., , ::! .'. . on , .',,' " .'" . ~ ..,.' ..' '. . ~ -":. , " " S ~. C = ...' , ~ ~ ...," .. .:;on , ci1. Q = 0.5 1.0 1.5 2.0 2.5 3.0 3(5 4.0 0 0.5 1.0 1.5 2.0 2.5 3.0 5 4.0 Fall Quarter GPA Fall QuarterGPA I Dean's Dean's List List (a) (b) valid estimates of the causal effect of a treatment (Cook is that participants are assigned to treatment or con- & Campbell, 1979; Reichardt & Mark, 1997). Each of trol conditions solely on the basis of whether they ex- these designs involves manipulation of the indepen- ceed or are below a cutpoint on the quantitative as- dent variable by the experimenter or other entities, signment variable and that an outcome hypothesized pretest and posttest measurement, and design or statis- to be affected by the treatment is measured following tical controls that attempt to address plausible threats treatment. This assignment rule meets the objections to internal validity. of those critics of randomized experiments who be- lieve that potentially beneficial treatments should not RegressionDiscontinuity Design be withheld from the neediest (or most deserving) par- tidpants. One of the strongest alternatives to the random- To understand the strengths and weaknesses of this ized experiment is the regression discontinuity design. design, let us review a classic regression discontinuity The regression discontinuity design can be used when study by Seaver and Quarton (1976) in some detail. treatments are assigned on the basis of a quantitative Seaver and Quarton were interested in testing the hy- measure, often a measure of need or merit. Follow- pothesis that social recognition for achievement leads ing Reichardt and Mark (1997), we term this mea- to higher levels of subsequent achievement. A random sure the quantitative assignmentvariable. For example, sample of undergraduate students who completed at some universities, entering freshmen who have a the Fall and Wmter quarters in a large university verbal scholastic aptitude test (SAT)score below a spec- constituted the participants. As at many universities, ified value (e.g., 380) are assigned to take a remedial those students who received a Fall quarter grade point English course, whereas freshman who have a score average ({;PA) of 3.5 or greater were given forrnaJ above this value are assigned to a regular English class recognition by being placed on the Dean's list whereas (Aiken, West, Schwalm, Carroll, & Hsuing, 1998). The students who did not attain this {;PA were not given outcome of interest is the students' performance on a any specialrecognition. The grades for all students were test of writing skill. Similarly, children who reach their recorded at the end of the Wmter quarter as a measure sixth birthday by December 31 of the school year are of subsequent achievement. Seaver and Quarton assignedto begin first grade, whereas younger children estimated that the awarding of Dean's list led to a 0.17- do not begin school (Cahan & Davis, 1987). The out- point increase in the students' {;PA, the equivalent of a come of interest is the children's level of performance full grade higher in one 3 hour course during the next on verbal and math tests taken the following Spring. term. Union members are laid off from work on the basis of To understand how the regression discontinuity de- their number of years of seniority. Mark and Mellor sign works, it is useful to consider several possible out- (1991) compared the degree of hindsight bias (retro- comes patterned generally after those of the Seaver spective judgments of the perceived likelihood of lay- and Quarton (1976) study. Each of the outcomes pre- offs) among laid-off workers « 20 years seniority) and sented in Figure 3.3 is based on simulated data from workers who survived layoffs (20 or more years of se- 200 students, with about 17% of the students hav- niority). The central feature of each of these examples ing Fall quarter {;PAs of 3.5 or better (Dean's list).

m.--

... CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS 59

'=! ~ on ..; ..( 0 ~ ..; 0 on N 0

"( 01 on ; 2;= ..;0 ~ "1 0

0 0 0.5 1.0 1.5 2.0 2.5 3.0 4.0 0 0.5 1.0 1.5 2.0 2.5 3.0 4.0 Fall Quarter GPA Fall Quarter GPA Dean's Dean's List List (c) (d) Figure 3.3 illustrates four possible outcomes of the Figure 3.3. illustration of possible outcomes in the regression study, of which outcomes (A) and (B) are important discontinuity design. in the present context. (A) No effect of treatment. Outcome (A) illustrates a case in which the treat- (B) Positive effect of treatment. (C) Curvilinear functional form. ment, Dean's list, has no effect on WInter quarter GPA. (D) Nonparallel regression lines in treatment and control Note that we have a single regression line that char- groups. acterizes the full range of the data. The slope of the Note. In each paneL the relationship between students' Fall regression line indicates only that there is a strong posi- quaner and Wmter quaner GPAsis indicated. The fit of a line tive relationship (correlation) between Fall quarter and represents the fit of a Fall GPA cutoff for Dean's List. 0 represents WInter quarter GPA. In contrast, outcome (B) illus- Control student; + represents a student on Dean'slist. In Panel trates the general pattern of results found by Seaver (A), representing no treatment effect of Dean's list, a single re- and Quarton (1976). Here, the same regression line as gression line fits the data for both the Dean's list and Control students. In Panel (B), the venical distance between the regres- in outcome (A) holds for students who have Fall GPAs sion lines for the Dean's list and for the Control students (the below 3.5. However, for those students who have GPAs discontinuity) represents the treatment effect. In Panel (C), the of 3.5 or above and are awarded Dean's list, the regres- solid line representsthe fit of a curvilinear relationship between sion line is now elevated. At the cutpoint of 3.5 for students' Fall and Wmter quarter GPAs; the dotted line repre- Dean's list, we see that the difference in the levels (in- sents the treatment effect that would result if this relationship tercepts) of the two regression lines changes by 0.17, were specified as being linear. In Panel (D), the slope of the which represents Seaverand Quarton's estimate of the regression line for the Dean's list students is steeper than the slope of the regressionline for the Control students (Treatment treatment effect. This discontinuity in the two regres- x Pretest interaction). sion lines can potentially be interpreted as a treatment effect. Statistically, the treatment effect is estimated of 0 at the cutpoint (i.e., X = GPApaII-3.5). In this through the use of the following regression equation: case, bo is the predicted value of Winter quarter GPA for participants at the cutpoint (3.5) who are in the y bo+bIX+b2T+e. (1) control group (non-Dean's list), bi is the slope of the regression line (i.e., the predicted amount of increase In this equation, Y is the outcome variable, here Winter in the Winter quarter GPA corresponding to a I-point quarter GPA; X is the quantitative assignment variable, increase in the Fall quarter GPA), and b2represents the here Fall quarter GPA; and T is a dummy code which treatment effect. The test of b2 informs us whether be- has a value of I if the participant receives the treat- ing on the Dean's list led to a significant increase in GPA ment, here the social recognition of Dean's list, and a the following quarter. Finally, e is the residual (error value of 0 if the participant does not receive treatment. in ). Readers wishing a more thorough pre- Each of the bs is a regression coefficient that may be es- sentation of the statistical procedures should see Aiken timated by any standard statistical package (e.g., SPSS; and West (1991, chapter 7); Cohen and Cohen (1983); SAS). The regression coefficients are most easily inter- Reichardt, Trochim, and Cappelleri (1995); or Trochim pretable if the data are rescaled so that X has a value (1984). 60 STEPHEN G. WEST, JEREMY C.BmSANZ, AND STEVEN C. PITI'S

The ideas behind the regression discontinuity design relationship between the quantitative assignment and are straightforward, but the design is counterintuitive outcome variables is varied. because it explicitly violates a usual canon of research: Two simple probes of the functional form may The treatment and control groups should be as equiva- be performed. First, scatterplots of the relationship lent as possible at pretest. Instead, this design takes ad- between the quantitative assignment variable and (a) vantage of the known rule for assignment to treatment the outcome measure and (b) the residuals should be (Dean's list: Fall GPA ~ 3.5) and control groups (no carefully examined. As is usual in , recognition: Fall GPA < 3.5). A statistical adjustment evidence of outliers, nonconstant of the is performed that permits comparison of the levels in residuals across the range of X, and nonnormality the two groups at the same value on the quantitative of the residuals suggest potential problems in the assignment variable I I (GPA = 3.5). The estimate of the specification of the regression model (see R. D. Cook & treatment effect is now conditioned on the participant's Weisberg, 1994; McClelland, this volume, Ch. 15). Of Fall GPA. particular importance in the context of the regression discontinuity design is the existence of systematic STATISTICAL ASSUMPTIONS OF THE REGRESSION positive or negative residuals from the regression DISCONTINUITY DESIGN. Because of its strong reliance line near the cutpoint, suggesting a potentially major on statistical adjustment to yield proper estimates of problem in the estimation of the treatment effect. This treatment effects, the regression discontinuity design examination is greatly facilitated by the use of modem requires two statistical assumptions in addition to those graphical techniques, such as the fitting of lowess required for the randomized experiment. The ability to curves (Cleveland, 1993) separately to the treatment check these assumptions will be much greater in studies and the control groups. Lowess curves are unon- with large sample sizes, in which the cutpoint is not too parametric" curves that may be used to describe the extreme. functional form of the relationship in the sample. Sec- ond, the regression equation may be estimated using a CORRECT SPECIFICAnONOF FUNCnONAL FORM. Re- nonlinear functional form. Following Trochim (1984) searchers in social psychology rarely have strong a and Reichardt et al. (1995), higher order polynomial priori theoretical or empirical bases for specifying the terms (e.g., X2) are typically added to the regression form of the relationship between two variables. Con- equation and tested for significance. Alternatively, the sequently, we normally assume that the form of the data may be transformed to achieve linearity (Daniel relationship between the two variables is linear, as & Wood, 1980), or other parametric or nonparametric was done in the regression equation (Equation 1). To regression equations suggested by the lowess curve the degree that the form of the relationship between may be estimated (Daniel & Wood, 1980; Hastie & Fall quarter and Winter quarter GPA in our example Tibshirani, 1990). To the extent that similar estimates is not well approximated by a straight line, the esti- of the treatment effect are found, or the alternative mate of the treatment effect based on Equation 1 may specifications of the model fit the data less well than be biased. For example, Cook and Campbell (1979, p. Equation (1), the possibility that a misspecified func- 140) presented data suggesting that a single curvilin- tional form accounts for the results can be minimized. ear relationship without a treatment effect could fit the Seaver and Quarton (1976) data equally as well as the NO TREATMENTX PRETESTINTERACnON. Researchers more typical equation (Equation 1). normally make the assumption that the regression 1,1 (Figure 3.3, Panel C illustrates this issue.) Conse- lines will be parallel in the treatment and control quently, it is important for researchers to examine the groups. However, this need not be the case, as is robustness of the results as the functional form of the depicted in Figure 3.3, Panel D. In this example, there is no discontinuity in the regression line at the cutpoint of 3.5, but the slope of the regression line is slightly II In terms of RCM, the expectedvalue of the treatment group is steeper above this value. This example illustrates one compared with the expected value of the control group, con- form of a Treatment x Pretest quantitative assigmnent ditioned on the spedfic value of the quantitative assignment variable at the cutpoint (Rubin, 1977). If the quantitative as- variable interaction that may occur. signment variable is the sole basis on which participants are Treatment x Pretest interactions represent another given the experimental versus comparison treatments, then variant of misspedfication. Once again, they may be this difference provides an unbiased estimate of the treat- detected from careful examination of scatterplots and ment effect. plots of residuals. If a Treatment x Pretest interaction CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS 61 j is suspected, a new term, XT, representing the after statistical adjustment for the quantitative assign- interaction may be added to Equation ( 1 ), resulting in ment variable, even in the absence of treatment. We y Equation (2): term such influences participant selection issues; they can arise in the regression discontinuity design in at ;> Y=bo+b1X+b2T+b3XT+e. (2) least 3 ways. Once again, the b2 coefficient estimates the treatment effect, and the b3 coefficient provides information 1. If the population to which a desirable (or undesir- about the magnitude of the difference in the slopes in able) treatment is being offered is aware of the cut- the treatment and control groups. Note that X must point, partidpants may dedde to emoll in the study be rescaled so that it has a value of 0 at the cutpoint if based in part on their perception of the likelihood b2is to be easily interpretable (Aiken & West, 1991). that they will receive the treatment. Meier (1985) When significant changes in slope are detected, re- reported that the Salk polio vacdne was tested in searchers need to be cautious in their interpretation some states using a simple version of the regression of the results. If there is no discontinuity between the discontinuity design in which the child's age was the two regression lines (i.e., no treatment main effect), assignment variable. However, many parents knew differences in slope are not usually interpretable. The that children in a spedfied age range would receive alternative explanation that there is no treatment ef- the experimental polio vacdne. Gilbert, Light, and fect, but rather a nonlinear relationship between the Mosteller (1975) documented how refusals of par- quantitative assignment variable and the outcome vari- ents of children in this age range to permit their able cannot be easily ruled out. However, when there children to partidpate in the study led to a large is a substantial discontinuity between the two regres- bias in the estimates of the effectiveness of the sion lines at the cutpoint, such an alternative explana- vacdne. tion becomes considerably less plausible, and the treat- 2. Practitioners may "adjust" the scores of individuals ment effect can be directly interpreted at the cutpoint who are just below (or above) the cutpoints so that (Trochim, Cappelleri, & Reichardt, 1991). As with any these individuals may receive a desired treatment, a interaction, the estimate of the treatment effect is con- problem Campbell (1969) termed "fuzzy assignment ditional: It would be different at any other potential rules." Teachers may give their favorite students bet- cutpoint. Statistically, treatment effects at other cut- ter grades so these students receive academic hon- points are easily estimated by rescaling the value of X to ors; welfare workers may understate a family's in- be 0 at the new cutpoint and examining the new b2 ef- come so that a child may receive speGal medical or fect; however, extrapolation of estimates of treatment educational treatment. Those in charge of treatment effects to other cutpoints makes two strong assump- allocation may also take into account other unquan- tions (Cochran, 1957). First, the regression model is tified factors, such as letters of recommendation or a close approximation of the "true" regression model with the candidates. in the population. Second, the same true regression 3. Following the pretest, partidpants may drop out of model holds for all values of the quantitative assign- the treatment and control groups. In such cases,par- ment variable. Thus, extrapolation yields treatment ef- tidpant characteristics that may not have been mea- fect estimates that are less precise and less credible than sured at pretest, such as their interest in the topic the estimates at the actual cutpoint on the quantitative area of the study, may determine in part whether assignment variable. they complete the study. As in the randomized ex- periment, the estimate of the treatment effect will SELECnON ISSUES IN THE REGRESSION DISCON- be biased to the extent that attrition is substantial, TINUITY DESIGN. The regression discontinuity design the characteristics of attriters and completers differ assumes that participants are assigned to treatment and in the treatment and control conditions, and these control groups solely on the basis of the cutoff score characteristics are related to the measured outcome on the quantitative assignment variable. Any other in- variable. fluences that affect assignment represent a potentially serious misspecification of the model. Paralleling our Many of the remedies for selection issues in the re- earlier review of the randomized experiment break- gression discontinuity design are extensions of those downs in treatment assignment give rise to the possibil- used in the randomized experiment. In terms of de- ity that there may be important differences in the peo- sign approaches, not announcing the cutpoint until af- ple participating in the treatment and control groups ter participants have been assignedto conditions masks 62 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. pms

(blinds) the assignment rule from both participants and Ideally, generalization of the treatment effects should personnel in the institutional setting and thereby helps be restricted to values that are close to the original minimize the first two selection issues. Trochim and cutpoint. Often this will be sufficient: Regression Cappelleri (1992) have suggested that the assignment discontinuity designs are typically conducted with the rule used in the regression discontinuity design ad- populations and in the settings of interest and are thus dressesthe third selection issue because it is more sensi- high in external validity. The cutpoints used are also ble to participants; however, little evidence exists with often those for which there are supporting data (e.g., which to evaluate their claim of reduced attrition. A cutoffs for clinical levels of high blood pressure) or variety of simple (e.g., dropping participants within a strong historical tradition for their use (e.g., Dean's narrow range around the cutpoint; Mohr, 1988) and List = 3.5 GPA). Thus, this limitation is often an complex econometric approaches have attempted to issue more in theory than in practice because of the provide more adequate adjustments when fuzzy as- restricted range of generalization that is sought. signment appears to have taken place (see Trochim, 1984, for a review). Statistical work on issues related SUMMARY. The regression discontinuity design to other selection issues is less well-developed. How- provides an excellent approach when participants are ever, many of the statistical techniques suggested for assigned to conditions on the basis of a quantitative the problems of treatment noncompliance and attrition measure. As Marcantonio and Cook (1994) noted, the in the randomized experiment can potentially be ex- regression discontinuity design is one of two quasi- tended to the regression discontinuity design. These ap- experimental designs that stand out ~because of the proaches make the strong assumptions that the quan- high quality of causal inference they often engender" titative assignment variable provides the only basis for (p. 134). When its assumptions are met, it rules out treatment assignment and that the functional form of most of the threats to internal validity and has good the relationship between the quantitative assignment external validity for much applied work. On the neg- variable and the outcome variable has been correctly ative side, it is considerably lower in power than the specified. randomized experiment. The central concerns of the design focused around a number of selection-related OTHER ISSUES issues and the correct specification of the functional STATISTICALPOWER. The regression discontinuity form. More complex variants of the regression discon- design will typically have considerably lower power tinuity design can be implemented: Multiple pretest than the randomized experiment, with the degree to measures may be used to assign participants to treat- which power is reduced being dependent on the ex- ment, multiple treatments may be delivered with mul- tremity of the cutpoints and the magnitude of the tiple cutpoints, multiple pretest measures may be used correlation between the quantitative assignment vari- as additional covariates, and multiple outcome vari- able and the posttest (Cappelleri, 1990; Cappelleri, ables may be collected (see Judd & Kenny, 1981; Darlington, & Trochim, 1994). Goldberger (1972) es- Shadish, etal., in press; Trochim, 1984). In addition, the timated that, in properly specified models in which the regression discontinuity design can be supplemented quantitative assignment variable and the posttest had with a randomized tie-breaking experiment around the bivariate normal distributions and a relatively strong cutpoint to provide even stronger inferences (Rubin, correlation, 2.75 times as many participants would be 1977; Shadish, et al., in press; see Aiken et al., required using the regression discontinuity design to 1998, for an empirical illustration of combining these achieve the same level of statistical power as in the ran- designs). domized experiment. The lesson is clear: Researchers planning regression discontinuity designs will need to Interrupted Time Series Design use relatively large sample sizes,both to provide a test of the treatment that has adequate statistical power and The interrupted time series design is the second to probe the statistical assumptions underlying the test. of the two quasi-experimental designs that Marcan- tonio and Cook (1994) highlighted as engendering a CAUSALGENERALIZAnON. Drawing on Cook's (1993) high quality of causal inference. In the basic inter- five prindples of causal generalization discussed rupted time series design, measurements of the out- earlier, we see that generalization of results using the come variable are collected at equally spaced intervals regression discontinuity design is limited relative to the (e.g., daily; yearly) over a long period of time. An inter- randomized experiment by the use of one cutpoint. vention is implemented at a specific point in time; the

,:~LIi i i,[ I,il CAUSAL INFERENCE AND GENERALIZAllON IN FmLD SE1TINGS 63

.ld intervention is expected to affect the outcome variable of time (or after a fixed number of observations!3), .a.l measured in the series. Time series analysis permits the and possible treatment effects may be inferred from a )ll researcher to identify changes in the level and slope of change in the level of the behavior that has occurred Ie the series that occur as a result of the intervention.12 between the baseline and treatment periods. lIS In terms of the RCM, the expected value of the treat- Figure 3.4 illustrates some of the different types of ;0 ment group is compared with the expected value of effects that can be detected using interrupted time se- the control group, conditioned on the specific point in ries designs. In Panel A, there is no effect of the inter- time at which the treatment is introduced. U (a) time vention.1n Panel B, the intervention leads to a dramatic t'S is assumed to be a good proxy for the actual rule upon decrease in the level of occurrence of the behavior. In ~ which treatment and control conditions are assigned Panel C, the intervention leads to an immediate effect Ie (Judd & Kenny, 1981) and (b) the correct functional of the intervention, followed by a slow decay of the form of the relationship between time and the outcome treatment effect over time. Finally, in Panel D, there is a variable is specified, then controlling for time will lead permanent change in both the level and the slope of the n to an unbiased estimate of the treatment effect. U time series following the intervention. Each of the results °e is not an adequate proxy, then estimates of the magni- depicted in Panels B-D are potentially interpretable 'e tude of the treatment effect may be biased. as treatment effects, given appropriate statistical anal- e Interrupted time series designs have typically been yses and attention to potential threats to internal utilized in two different research settings. First, time validity. e series designs provide an excellent method of con- From consideration of Figures 3.3 and 3.4, readers ,N ducting naturalistic studies of the effects of a change in may have detected a strong similarity between the re- It law or a social innovation. To cite two examples, West, gression discontinuity design and the interrupted time d Hepworth, McCall, and Reich (1989) used this design series design. In both designs, treatment is assignedbe- to test whether the introduction of a law requiring ginning at some specific point represented on the X- a 24-hour mandatory jail term for driving under the axis. Any alternative explanations of the results need influence of alcohol led to a subsequent decrease in to provide an account of why the level, slope, or both fatal traffic accidents. Hennigan et al. (1982) used this of the series change at that spedfic point. Note, how- design to test the hypothesis that the introduction of ever, that there is one major conceptual difference be- television broadcasting in u.s. cities in the late 1940s tween the two designs: In the interrupted time series and early 1950s would lead to subsequent increases in design, participants are assigned to the treatment or crime rates, particularly violent crime rates. Second, control group on the basis of time; in the regression time series designs may be used to test the effects discontinuity design, participants are assigned to the of interventions in single-subject designs. This ap- treatment or control group on the basis of their score proach has primarily been followed by experimental on the quantitative assignment variable. As we show psychologists in the Skinnerian tradition and clinical below, this change from pretest scores to time on the researchers who wish to make causal inferences at X-axis greatly complicates the statistical analysis and the level of the individual participant (Crosbie, 1993; leads to a different set of potential threats to internal Franklin, Allison, & Gorman, 1996; Home, Yang, & validity. Ware, 1982; Sidman, 1960). For example, Gregson (1987) used time series methods to understand AN OVERVIEW OF STATISTICAL ANALYSES. To il- individual differences in the pattern of response to lustrate time series analysis, consider a classic study by treatment in 9 participants with chronic headaches. McSweeny (1978). McSweeny was interested in the These participants reported the intensity and duration effects of the introduction of a monetary charge for of their headaches on a daily basis over an extended calls to directory assistance. He obtained the monthly period of time. In each of these examples, the treat- records of the number of calls made to Cincinnati Bell ment in this design was assigned a priori on the basis~ directory assistance for the period 1962 to 1976. Be- ginning in March 1974, and announced with consider- able publicity, Cincinnati Bell began charging 20 cents 12 Although less often hypothesized by sodal psychologists, changes in the variance of the series or in cyclical patterns in the series following an intervention can also be detected. Franklin et at. (1996) discussedsome of the inferential prob- Larsen (1989) has discussed some ways in which cyclical lems that occur when other assignment rules (e.g., the par- patterns of variables such as mood and activity levels may be ticipant's level of responding reaches asymptote) are used to important in human sodal behavior. determine when treatments are introduced or withdrawn.

CAUSAL INFERENCE AND GENERALIZAnON IN FIELD SETfINGS 65

Failure to adequately represent the serial dependency 2O-centcharge leads to incorrect standard errors and consequently g to incorrect significance tests and confidence intervals o. :] 0 (Box, Jenkins, & Reinsel, 1994; Chatfield, 1996; Judd 0 '""' --- ~\ & Kenny, 1981). '" Time series analysis offers a variety of graphical dis- ::a u on; plays and statistical tests that are useful in the detection ~ -.j" of trends, cycles, and serial dependency. Time series O; v techniques are most accurate in large samples where 0 / the number of time points is 100 or more, although ~ N ...o.s even in large samples there can be problems in correctly ~ identifying the pattern of serial dependency (Velicer & -< -~~w-~-;;;"'-;--w~"(;;l~-~~l"'60~~ ~ Harrop, 1983). In smaller samples, there is a high like- Month Number,Beginning January, 1962 lihood that several models will adequately fit the data, and the possibility exists that each of these models may Figure 3.5. Intervention effect: Introduction of a charge for be associated with a substantially different estimate of the treatment effect (see Velicer & Colby, 1997). This directory assistance. Note. The series represents the average daily number of calls result occurs because time series analysis requires a to directory assistanceper 100,000 total calls. The vertical line series of preliminary tests to determine whether the indicates the point (the beginning of month 147 of the series) model is correctly specified; these preliminary tests when the charge for directory assistancewas introduced. The require large numbers of observations to distinguish effect of this charge is shown by the large vertical drop in the sharply between different possible models. Among the level of the series at the point of intervention. techniques that may be used to identify potential prob- pretest score introduces major complications into the lems with time series models are graphical techniques statistical analysis. In the analysis of time series data, for visualizing trends and cycles (see Cleveland, 1993) three statistical problems must be adequately addressed and plots of specialized statistics (e.g., spectral density or the regression equation will be misspecified. First plots; autocorrelograms) to detect cycles and serial de- any long-term linear or curvilinear trends over time pendency. McCleary and Hay (1980), Judd and Kenny must be properly represented. In the present example, (1981, chapter 7), West and Hepworth (1991), and the long-term linear increase over time in the series Velicer and Colby (1997) presented good introductions is represented by the hI T term in Equation 3. Second, to the ideas of time series analysis. Chatfield (1996), time series data, particularly those in which the data Box, Jenkins, and Reinsel (1994), and Velicer and Mac- are collected on a hourly, daily, or monthly basis, fre- Donald (1984) presented more in-depth treatments of quently contain cycles. For example, people's moods the traditional and newer statistical procedures for the often change according to a regular weekly cycle, with identification and testing of time series models.

particularly large differences being found between the THREATSTO INTERNAL VALIDITY. As in the regres- weekend and weekdays. Again, these cycles must be sion discontinuity design, for threats to internal validity detected, and terms representing the cycles must be to be plausible, they must offer an explanation of why a added to the regression equation to avoid misspeci- change in the level slope, or both of the series occurred fication (see West & Hepworth, 1991). Finally, when at the point of the intervention (Shadish et al., in press). data are collected over time, adjacent observations are Three generic classesof explanations that are potential often more similar than observations that are further threats to the internal validity of the design can be iden- removed in time. For example, the prediction of today's tified; however, the plausibility of each of these threats weather from the previous day's weather is, on aver- will depend on the specific research context. age, far better than the prediction of today's weather As an illustration, consider the example of a new from the weather 7 days ago. This problem, known as smoking prevention program implemented at the be- serial dependency, implies that the residuals (the es) in ginning of a school year. Based on school records, the Equation 3 will not be independent (Judd & Kenny, total number of students who are cited by school per- 1981; West & Hepworth, 1991). Failure to properly sonnel for smoking each month on school grounds for specify the long-term trends, the cycles, or both in the 5 years prior to and 5 years after the beginning of the model may lead to model misspecifications and conse- intervention provide the data for the time series. quently to bias in the estimate of the treatment effect.

~ 66 ~TEP~N G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PfITS

mSTORY.Some other event that could also be ex- in traffic fatalities in the dty of Phoenix; however, the pected to decrease smoking may occur at about the magnitude of this decrease declined over time follow- same time that the smoking prevention program is ini- ing the intervention. In an attempt to rule out pos- tially implemented. For example..the community may sible historical effects (e.g., unusually good weather; remove cigarette machines or institute fines for selling change in speed limit) coindding with the implemen- cigarettes to minors, making accessto cigarettes more tation of the new law, West et at. also analyzed the difficult. traffic fatalities from a dty (EI Paso) in a nearby state that did not introduce a similar drunk driving law SELECll0N.The population of the students in the during the period of the study. The results showed school may change in the direction of having a no comparable change in traffic fatalities in EI Paso greater proportion of nonsmokers. Parents who sup- at the same point in time (July 1982) when the law port the prevention efforts may hear about the planned went into effect in Arizona. Ideally, the control series intervention and make efforts to enroll their chil- should be as similar as possible to the treatment series dren in the schooL whereas parents who support and similar data should be available for the same time smokers' rights may transfer their children to other period. districts. OTHERCONTROL SERIES. Sometimes data are avail- INSTRUMENTAllON.Aspects of the record-keeping able from another series that would not be expected to procedures may change,leading to decreasedreports of be influenced by the treatment, but which would be smoking. Students caught smoking may no longer be expected to be impacted by many of the same non- written up for their first offense or decreased staff may treatment influences as the treatment series of in- be available for monitoring of the school's grounds. terest. Reichardt and Mark (1997) illustrated exam- Once again, these threats are only plausible if they ples of such control series using data collected on the occur at about the point of the intervention. For ex- same participants in different settings or using differ- ample, if the community were to remove dgarette ma- ent measures that are not expected to be affected by chines 3 years after the implementation of the smok- the treatment. Often such a series can be constructed ing prevention program, this action would not provide by disaggregating a series into units in which little, an alternative explanation of a drop in the number of if any, impact is expected versus units in which a students who are dted for smoking at the point of im- large impact of the intervention is expected. In a clas- plementation of the program. Many of these threats sic study, Ross, Campbell, and Glass (1970; see also are less plausible in time series designs using single Glass, 1988) studied the introduction of a new inter- partidpants, in which the experimenter has consid- vention to decrease drunk driving in Great Britain. erable control over the procedures than in less con- British police were equipped with breathalysers and trolled, naturalistic designs testing the effects of new were empowered to suspend the drivers license of any- laws, programs, or new innovations on a population of one who was driving while intoxicated. Ross et al. participants. divided the data into weekend evenings when the amount of alcohol use is normally high and week- DESIGN ENHANCEMENTS.The three potential day mornings when alcohol use is normally low. Con- threats to internal validity can be made less plausi- sistent with their predictions, the implementation of ble through the addition of one or more design en- the breathalyser program led to a large decrease in hancements to the basic time series design. We con- fatalities in the weekend evening hours, followed by sider three design enhancements that can often be used a slow return to baseline. However, the implementa- both in naturalistic tests of policy and single subject tion of the program did not lead to a detectable de- designs. crease during the weekday morning hours. The com- parison of the two series helps rule out effects due to NO TREATMENTCONTROL SERIES. Comparable data history and changes in instrumentation that should may be available from another similar unit that did affect both series equally. Selection is an implausible not receive treatment during the same time period. Re- threat in this case because it is unlikely that a coun- call that West et at. (1989) studied the effect of a new try would experience appreciable in or out migration state law in Arizona that mandated a 24-hour jail term or changes in the population of drivers that were as- for drivers convicted of driving while intoxicated. They sociated with a traffic safety campaign against drunk showed that this law led to an immediate 50% decrease driving.

"i'f.

I CAUSAL INFERENCE AND GENERALIZAnON IN FffiLD SE'ITINGS 67

SWITCHINGREPUCATIONS. In some cases,the same New programs often require time for personnel to be intervention may be implemented in more than one lo- trained before they fully go into effect; new innova- cale at different times. Time series analyses of the data tions often require time to diffuse through society. from the different locales would be expected to show Inferences from time series may be strengthened by similar effects occurring at the point of intervention r. collecting supplementary data that assessthe extent in each locale. For example, West et al. (1989) found of implementation of the intervention over time. Us- that California implemented a law mandating a 24- ing such data to model a gradual implementation pro- hour jail term for driving under the influence of alco- cess can strengthen causal inferences and increase the hol 6 months before the similar law was implemented power of the statistical tests in time series analyses. in Arizona. Analysis of highway fatality data from San Hennigan et al. (1982) used data on the proportion Diego showed a similar effect to that which occurred 6 of households with television sets following the begin- months later in Phoenix -an immediate 50% reduc- ning of television broadcasts to strengthen their poten- tion in highway fatalities followed by a slow decrease tial causal inferences about the effects of television on in the magnitude of the effect over time. The replica- crime rates. In addition, they observed that the pattern tion of the effect at a different point in time helps rule of increase in burglaries was nearly identical following out other possible explanations associated with history the beginning of broadcasting in some American cities (e.g., weather) and changes in instrumentation. in 1948 and others in 1952 in their switching replica- tion design. COMBINING DESIGN ENHANCEMENTS.As should be Second, some interventions affect processeswhose evident, multiple design features can be combined in a outcomes will only be evident months or years later. To single study to strengthen further the causal inferences ate two examples, a birth control intervention would that may be made. For example, West et al. (1989) be expected to affect birth rates about 9 months later. A combined all three of the previously discussed design nationwide school-based smoking prevention program features in their studies of the effects of drunk driv- would be expected to show effects on lung cancer rates ing laws. Ross et al. (1970) added a number of other beginning about 35 years after the intervention as the control series to their basic time series design, together participants matured into the age range when lung can- with a number of other design and measurement fea- cer first begins to become manifest. In both c;:asestheory tures that were designed to address specific threats to provides a strong basis for expecting the effect to occur internal validity. When the researcher has control over at a spedfied time lag or over a spedfied distribution of treatment delivery, the design can be further strength- time lags following the intervention. In the absence of ened by introducing and removing treatments follow- strong theory, similar strong causal inferences cannot ing an a priori schedule. In well-designed time series be made. As previously noted, a decreasein school dta- designs, inferences can sometimes be made with a cer- tions for smoking 1 year following the introduction of tainty approaching that of a randomized experiment. a smoking prevention program cannot be interpreted confidently as a program effect. DELAYED EFFECTS.Time series designs provide a relatively strong basis for causal inferences when STATISnCALPOWER. Two issues of statistical power the intervention produces immediate effects, but the commonly arise in time series analysis. First, re- strength of the causal inference may be weakened con- searchers often evaluate the effects of an intervention siderably when treatment effects are delayed -unless shonly after an intervention has been implemented, the length of delay in the effect is predicted ahead of giving rise to a series with many preintervention, but time. As an illustration, consider again our example of few postintervention observations. Given a series with the school-based smoking prevention program. Sup- a fixed number of time points, the statistical power pose that a gradual decreasein the number of students of the analysis to detect treatment effects is greatest cited for smoking on school grounds began 1 year fol- when the intervention occurs at the middle rather than lowing the implementation of the program. Under such one of the ends of the series. Second, the existence circumstances, it would be difficult to attribute these of serial dependency has complex effects on calcula- changes in the level and slope of the series to the pre- tions of statistical power. Within the preintervention vention program. series and within the postintervention series, to the True delayed effects normally occur in one of two degree that observations are not independent, each ob- ways. First, new policies often do not go abruptly into servation in effect counts less than 1, thus increasing effect on the specific starting date of the intervention. the total number of observations needed to achieve a 68 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PI'ITS specified level of statistical power. On the other hand, the use of the same participant or same population of participants throughout the series reduces the variabil- ity of the series relative to one in which different par- ticipants are sampled at each time point. The statistical power of a given time series analysis is determined in large pan by the trade-off between these two effects in the specific research context.

CONCLUSION.Interrupted time series designs pro- vide a strong quasi-experimental approach when in- terventions are introduced at a specific point in time so that time may be used as a proxy for the true model of treatment assignment. Such designs provide a strong basis for the evaluation of the effects of changes in so- cial policy or new innovations that are implemented at a specific point in time. The basic time series de- sign often allows researchers to credibly rule out sev- eral threats to internal validity: Any alternative ex- planation, to be plausible, must account for why the change in the series occurred at a particular point in time. Design enhancements, including the use of various control series, switching replications, and the introduction and removal of treatments, can further strengthen the causal inferences that may be made (see also Barlow & Hersen, 1984; Kratochwill & Levin, 1992). Given their ability to rule out alternative ex- planations through both design features and statistical adjustment, interrupted time series designs represent one of the strongest alternatives to the randomized ex- periment.

Nonequivalent Control Group Design The most commonly used quasi-experimental alter- native to the randomized experiment is the nonequiv- alent control group design. In this design, a group of participants is given a treatment, or a Unatural event" occurs to the group. A second (comparison) group that does not receive the treatment is also identified. Both groups of participants are measured both be- fore and after the treatment. Randomization is not used to determine the treatment group assignment; rather the process through which participants end up in the treatment and comparison groups is not fully known and is therefore difficult to model statistically. To cite three examples, Lehman, Lampert, and Nisbett ( 1 988) compared the level of statistical reasoning of ad- vanced graduate students in psychology, which empha- sizes training in statistics, with that of advanced grad- uate students in other disciplines (e.g., chemistry) that have substantially lower training emphasis in statistics. CAUSAL INFERENCE AND GENERALIZATION IN FIELD SE'ITlNGS 69

and control groups on covariates and hidden variables 3. The functional form (e.g., linear) of the relationship have been adequately ruled out as potential explana- between the pretest variables and the outcome vari- tions for observed differences on the outcome variable able has been correctly specified. (e.g., see Bamow, 1973; Bentler & Woodward, 1978; 4. The pretest maturation rates do not differ in the Cicarelli, Cooper, & Granger, 1969; Magidson, 1977 for treatment and control groups or proper adjust- diverse analyses and reanalyses of the data from the ments for differences in the pretest maturation rates original Head Start study). Because of this problem, re- have been successfully undertaken. In the basic searchers are well advised to add other design features nonequivalent control group design, such differ- (presented later in this section) to the basic nonequiv- ences in maturation rates can be indicated by Group alent control group design in order to reach more con- x Pretest interactions or unequal in the vincing causal inferences. treatment and control groups at pretest or posttest.

STRATEGIESFOR EQUATING GROUPS.From the When each of these assumptions is met, the standpoint of the RCM, the goal of the researcher using nonequivalent control group design has in effect been the nonequivalent control group design parallels that made equivalent to the regression discontinuity de- of the regression discontinuity design: The researcher sign. The estimate of the treatment effect conditioned must adequately model the mechanism by which par- on the statistical model that properly adjusts for se- ticipants were assigned to the treatment and control lection into the treatment and control groups will groups. When this goal is accomplished, the treatment lead to an unbiased estimate of the causal effect effect estimate will be an unbiased estimate of the in the population. All other covariates and hidden treatment effect in the population (Rosenbaum, 1984; variables will then be unrelated to treatment condi- Rosenbaum & Rubin, 1984). As in the regression dis- tion, a condition that has been termed strong ignora- continuity design, the estimate of the treatment effect bility (Holland, 1986; Rosenbaum, 1984; Rosenbaum is conditioned on the participant background variables &Rubin, 1984). In practice, however, meeting these that are related to treatment assignment. assumptions and thereby achieving strong ignorabil- As noted previously, the central problem in the ity is both extraordinarily difficult and fraught with nonequivalent control group design is that units (par- uncertainty. Consequently, researchers cannot be con- ticipants) are not assignedto the treatment and control fident in practice that their estimates of the treatment groups according to any known rule. Researchers must effect using the nonequivalent control group design are use whatever information they have available to iden- unbiased. tify pretest or other background variables (e.g., socioe- conomic status, prior educational or medical history) SELECTINGCOYARIATES. Statisticians and method- with which to model the process of selection into the ologists have considered a variety of possible meth- treatment and control groups. The critical background ods of identifying important covariates that should be variables are those that are related to both (a) selection included in statistical models that attempt to adjust into the two treatment groups and (b) the outcome for pretest differences and achieve strong ignorability. variable in the population. Pretest variables that are Reichardt, Minton, and Schellenger (1986) suggested only related to either (a) or (b) but not both do not considering both (a) those variables that are theoreti- bias treatment effect estimates (Berk, 1988). cally expected to be related to selection into the treat- Researchers use these pretest and background vari- ment and control groups and (b) those variables that ables in an attempt to adjust statistically for differences are known from the literature to be related to the between the two groups that existed prior to treatment. outcome variables. Cochran (1965) and Rosenbaum Several assumptions must be made in order for these (1995) have both suggested conducting exploratory statistical adjustment techniques to produce unbiased analyses of all variables measured at pretest. Based on estimates of the causal effect. These include work, Cochran suggestedretaining any co- variate for further consideration for which the t-value 1. All important covariateshave beenidentified; there for the difference between the treatment and con- are no hidden variables. trol groups was greater than 1.5 at pretest (see also 2. Eachof the covariateshas been measuredwith per- Canner, 1984, 1991). This technique can be sup- fect reliability, or statisticalprocedures (e.g., struc- plemented by a reexamination of the excluded co- tural equationmodeling) have properlyadjusted for variates once the selection model has been devel- unreliability in the covariates. oped to detect any remaining covariates that are

~ 70 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PI'ITS

not adequately equated in the treatment and control or USREL (J6reskog & S6rbom, 1993). In one vari- groups (Rosenbaum, 1995). ant of this procedure, each measured covariate is spec- Note that the pretest measure of the outcome vari- ified to be the result of an unmeasured true score and able of interest has a special status in the nonequiva- an error of measurement. The variance of the error lent control group design. Reichardt (1979) and Rosen- of measurement is nonnally set equal to the product baum (1995) have both noted that the best method of of the variance of the variable x (1- reliability of the reducing sensitivity to unmeasured variables is to iden- measure), i.e., 0-2[1 -Pxx]; see Bollen, 1989. The test- tify covariates that are (a) unaffected (i.e., not caused) retest correlation for a relatively brief time interval is by the treatment, but are (b) strongly related to the nonnally preferred as the estimate of Pxx, although outcome variable. The pretest score on the outcome other measures of reliability may be preferred in some of interest, particularly when there is a short time lag situationsl5 (Campbell & Boruch, 1975; Campbell & between pretest and posttest, normally meets these re- Erlbacher, 1970; Linn & Werts, 1973). The outcome quirements. variable is then regressed on the true scores of each pretest variable and a dummy coded variable represent- STATISnCAL ADJUSTMENTTECHNIQUES. A vari- ing the treatment. The estimate then represents the ef- ety of statistical techniques can be used to analyze the fect of the treatment adjusted for the differences among data from the nonequivalent control group design. The the covariates at pretest, corrected for the measure of techniques do not always produce the same result; they unreliability. A second important method of correcting use different statistical procedures and consequently for unreliability uses multiple measures of each pretest have different strengths and weaknesses. For example, construct rather than a single measure that is adjusted only the second technique considered below, analy- with an estimate of unreliability. Using structural equa- sis of with correction for unreliability, ad- tion modeling, the multiple measures are used to pro- dresses the issue of measurement error. Nonetheless, vide a theoretically error-free estimate of each latent each statistical technique has the same goal: To attempt construct assessedat pretest. The difference between to achieve strong ignorability by providing a statistical the latent means on the outcome variable in the treat- adjustment that properly removes prestest differences ment and control groups is compared after adjustment between the treatment and control groups. for the latent construct(s) represented by the covariates (S6rbom, 1979). Aiken, Stein, and Bentler (1994) illus- ANALYSISOF COVARIANCE. ad- trated the application of this approach to the nonequiv- I justs the posttest score for the measured pretest scores. alent control group design in their comparison of the ( Typically, adjustment is made only for the linear ef- effectiveness of two drug treatment programs. I fect of the pretest scores on the posttest scores. No ad- c justment is made for variables that are not included GAIN SCORE ANAlYSIS. Kenny (1975; see also t in the analysis nor is the proper adjustment made for Huitema, 1980; Judd & Kenny, 1981) has proposed an variables that are measured with error. When one co- alternative approach to the analysis of the nonequiva- variate is used, umeliability wi11 result in too little ad- lent control group design. In this approach, the pretest justment for the initial differences between treatment and posttest scores are first transformed so that the groups. When more than one covariate is used and one pretest and posttest variances are equated. This trans- t or more of the variables is less than perfectly measured, formation is most easily accomplished by separate stan- 5 the outcome is less clear: Typically, umeliability results dardizations of the pretest and posttest data using the in too little adjustment, but proper adjustment, or even pretest mean and standard deviation (see Huitema, overadjustment may occur in some cases. In a specific 1980, chapter 15 for details). The mean gain in the study, the estimate of the treatment effect may be too treatment group is then compared with the mean gain large, too small, or even just right. in the control group to provide an estimate of the treat- ment effect. This approach provides good estimates of ANALYSISOF COVARIANCEWITH CORRECTIONFOR UN- the treatment effect when it can be assumed that the RELIABILm. In this procedure, an attempt is made to natural pattern of growth in the absence of treatment adjust the treatment effect based on an estimate of what is linear and constant in the two groups or is of a fan g the pretest scores would have been if they had been i: measured without error (Huitema, 1980). This adjust- 15 Cook and Reichardt (1976) and Judd and Kenny (1981) have 1 ment is now typically performed using structural equa- suggestedconducting sensitivity analysesusing several plau- t tion modeling programs such as EQS (Bentler, 1995) sible estimates of the reliability to bracket the true effect. d CAUSAL INFERENCE AND GENERALIZATION IN FmLD SE1TINGS 71 spread pattern, in which the growth is linear, but occurs at a higher rate for participants having higher scores at pretest. Many other forms of growth (or decline) may not be not well modeled by this procedure.

ECONOMETRIC SELEC110NMODELS. A variety of se- lection models originally developed in economet- rics have been applied to the nonequivalent control group design (e.g., Bamow, Cain, & Goldberger, 1980; Heckman, 1979, 1989, 1990; Muthen & J6reskog, 1983). In these models, two separate regression equa- tions are estimated. The first (selection) equation uses measured variables to predict the assignment of the participant to the treatment versus the control group and yields a probability for each participant of assign- ment to the treatment group. The second equation uses this selection probability, the treatment actually received by the participant and other measured co- variates to estimate the treatment effect. In practice, both equations are estimated simultaneously. Selection models can be shown to provide highly accurate estimates of treatment effects when their as- sumptions are met. These models presume that highly reliable measures of all variables related to selection are included in the equation and that the sample size is relatively large. More critical, the results of these mod- els are highly sensitive to violations of several addi- tional statistical assumptions (e.g., the inclusion of a measured variable that affects selection, but is indepen- dent of the outcome measure) that must be made for proper estimation. Simulation studies by Stolzenberg and Relles (1990) and VIrdin (1993) showed that selec- tion models produced much poorer estimates than sim- pIer techniques, such as analysis of covariance and gain score analysis when these assumptions are violated (see also Imbens & Rubin, 1994). Lalonde (1986) showed large discrepancies between causal effect estimates ob- tained from randomized experiments and econometric selection models.

MATCHING: TRADmONAL AND MODERN PROCE- DURES. In the simplest form of matching that is tradi- tionally considered, each case in the treatment group is paired with a single case in the control group using a single measured covariate. To illustrate, consider a simplified example comparing two small school class- rooms (nA = 12; nB = 13), one of which is to be given a new instructional treatment and one of which is to be given standard instruction (control group). Table 3.5 presents hypothetical data for this illustra- tion in which the pretest (e.g., IQ) data have been or- dered from low to high within each group. We note 72 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS

their comparison of the effectiveness of two treatments for heart disease in a study of 1,500 patients. Based on this equation, they estimated a propensity score for each participant that represents the probability that the specific participant would be assigned to the treat- ment group, given these background variables. These propensity scores are then used as the basis for match- ing participants in the treatment and control groups. Rosenbaum and Rubin (1983, 1984) have shown that matching on propensity scores minimizes pretest differences between the treatment and control groups across the full set of measured variables. Rosenbaum (1986) also pointed out that matching adjusts for any functional form of relationship between the propensity score and the outcome measure, whereas the typical analysis of covariance only adjusts for linear relation- ships. Many of the earlier problems in finding ade- quate matches in the treatment and control groups can also be reduced through the use of modem computer search algorithms that identify the optimal pairing of participants16 (RosenbaUm,1995). Further, if matching on propensity scores is supplemented with other anal- yses, a variety of other potential issues can be probed (see Rosenbaum, 1986, 1987, 1995; Rubin & Thomas, 1996). The two following sets of supplementary anal- yses are of particular interest.

1. Although matching on propensity scores optimally balances the treatment and control groups across the set of covariates, it may not optimally match the two groups on variables of particular theoretical or empirical importance (e .g., the pretest measure of the outcome variable of interest). The treatment and control groups on each of the covariates may be true differences between the means of the two groups compared using t-tests. If a researcher believes that on the matching variable. This is exactly the same certain covariates are especially critical and they do problem we noted earlier with using analysis of covari- not appear to be well equated, the researcher may ance (without correction for unreliability) to adjust for also perform analysis of covariance (or analysis of initial differences between groups. covariance with correction for unreliability to ad- Statisticians have been more enthusiastic about the dress issues of measurement error) to provide more possibilities of matching and have developed tech- precise adjustment for the effects of baseline differ- niques that can overcome several of the limitations ences in these specific variables. noted above. Of special importance is Rosenbaum 2. A number of sensitivity analyses have been devel- and Rubin's (1983, 1984) development of the con- oped for matched group designs. These sensitivity cept of matching on the propensity score (see Rubin analyses give an indication of the magnitude of bias & Thomas, 1996). As in econometric selection mod- els, the propensity score takes seriously the idea of at- 16The failure to find adequate matches highlights the limita- tempting to develop a model of selection into the treat- tions of generalization of the findings. Model-based statistical adjustments such as analysis of covariance imply that gener- ment and control groups. In this approach, a logistic alization across the full theoretical range of the covariate is regression equation is developed in which all of the possible. However, in regions where there are sparsedata in available covariates are used to predict group assign- one of the groups, such model-based generalization can rep- ment (treatmentvs. control). For example, Rosenbaum resent a risky extrapolation of the findings (Cochran, 1957; and Rubin (1984) began with 74 available covariates in see also Cook. 1993). CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS - 73

due to hidden variables that would be necessary to applied research have also provided an optimistic as- eliminate the treatment effect. As a simple exam- sessment of the success of careful matching proce- ple, Rosenbaum (1987) suggested calculating the dures: These studies found no detectable difference in standardized effect size d for the covariate (not in- either the mean or variance of estimates of standard- cluding the pretest) that shows the largest (unad- ized treatment effects from randomized experiments justed) difference at pretest -this estimate is used and matched group designs (Heinsman & Shadish, as a rough estimate of the magnitude of an impor- 1996; Shadish & Ragsdale, 1996). Nonetheless, a strong tant hidden variable. The correlation of the pretest caveat remains: The fundamental condition of strong and the posttest on the outcome variable of inter- ignorability that is necessary for the causal interpreta- est (P12)is taken as the estimate of the maximum tion of treatment effects in the nonequivalent control possible value of the relationship of the hidden vari- group design can be probed, but never definitively es- able with the outcome variable of interest. The prod- tablished. Thus, there is always a degree of uncertainty uct dP12is then used as a reasonable estimate of the associated with estimates of causal effects on the basis amount that an important hidden variable not as- of this design. sessed at pretest could potentially reduce the esti- mate of the causal effect. If the difference between THREATS TO INTERNAL VALmITY. In tenns of the treatment and control groups on the outcome Campbell's approach, the basic nonequivalent control variable can be reduced by this (or even a larger) group design is particularly susceptible to four threats amount and stilI attain , then to internal validity (Cook & Campbell, 1979). The the estimate of the causal effect is taken as an indica- strength of the design is that the inclusion of the con- tion that the treatment effect is likely to be robust to trol group rules out basic threats, such as maturation, the effects of hidden variables. Rosenbaum (1991a, that occur equally in the treatment and control groups. 1991b, 1994, 1995) also developed procedures for The weakness of the basic design is that it does not rule conducting sensitivity analyses using matched de- out casesin which these threats operate differently in signs for cases involving a dichotomous outcome the treatment and control groups. variable and for cases involving multiple continu- ous outcome variables. He has also developed proce- MATURAnON.The treatment and control groups dures based on rank statistics that are less dependent may differ in the rate at which the participants are on assuming that a specific statistical model (e.g., growing or declining on the outcome variable prior linear regression) is correct.17 to treatment. To illustrate, consider a nonequivalent In summary, the use of propensity scoresovercomes control group design in which the Evans et al. (1981) many of the objections that were directed at earlier smoking prevention program is given to all students in and simpler versions of matching. Particularly when a suburban high school, whereas students in an inner supplemented by additional analysesto addressspecific city high school receive a control program unrelated to issues (quality of matching on key covariates; umeli- smoking. An example of the threat of differential matu- ability on these key variables) and sensitivity analyses ration would occur if the number of cigarettes smoked to addressthe likely impact of hidden variables, match- per day in an urban high school was increasing at a ing on the propensity score often provides distinct ad- faster rate than in the suburban high school in the ab- vantages over the sole use of statistical adjustment sence of intervention. techniques. Belin et al. (1995) and Rosenbaum (1986) presented empirical illustrations of how the use of mSTORY.Some other event may occur to one of the propensity scores combined with careful supplemen- two groups, but not the other, that may be expected tary analyses can strengthen the conclusions that can to affect the outcome variable. For example, the media be reached from the nonequivalent control group de- in the suburban site might independently start a series sign. Recent meta-analytic studies in several areas of pointing out the dangers of teenage smoking. Alter- natively, lower cost generic or contraband cigarettes 17 Rosenbaum (1986, 1987, 1995) also provided illustrations might become available in the urban, but not in the of a third type of supplementary analysis. He showed how suburbanarea. the combination of propensity score matching and substan- tive theory in an area can be used to make predictions that provide checks on the assumption of strong ignorability. Be- STAnSTICALREGRESSION. Participants may be se- cause his examples depend on theory in substantive areas lected for the treatment or control group based on an other than social psychology, they will not be discussedhere. unreliable or temporally instable measured variable. STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS 74

Participants selected for the study in the inner city presented extensive examples of the use of several of school may have on average been temporarily smok- these techniques in the evaluation of the effectiveness ing fewer than their normal number of cigarettes per of a sales campaign. week (e.g., if several of the participants had just re- covered from colds), whereas participants in the sub- MUI1'lPLE CONTROLGROUPS. Typically, no control urban school may have on average been smoking their group can be identified that is comparable to the treat- normal number of cigarettes per day. Upon remeasure- ment group on all factors that could affect the out- ment, participants in each group would tend to report come of the study. It is often possible, however, to iden- tify multiple imperfect control groups, each of which their typical level of smoking. can address some of the threats to validity. For ex- INSTRUMENTAnON.Some aspect of the measuring ample, Roos, Roos, and Henteleff (1978) compared instruIilent may change from pretest to posttest in only the pre- and post operation health of tonsillectomy pa- tients with (a) a nonoperated control group matched one of the two groups. This threat can take many forms, on age and medical diagnosis and (b) nonoperated sib- which can be manifested in such problems as differ- lings who were within 5 years of the patient's age. The ences in the factor structures (see Pitts, West, & Tein, first control group roughly equates the medical history 1996), the reliability of the scale, or the interval prop- of the treatment participant; the second control group erties on the scale itself between the two groups (e.g., roughly equates the genetic predisposition, environ- ceiling or floor effects). Local changes in record keep- ing practices or in the sensitivity of the measures can ment, and family influences of the treatment partici- pant. Obtaining similar estimates of the causal effects also produce this threat. across each of the comparison groups can help mini- mize the plausibility of alternative explanations of the SUMMARY.In the example given above, the smoking obtained results. Rosenbaum (1987; see also associated prevention program was expected to reduce the num- discussion papers) discussed several methods of using ber of cigarettes that were smoked by the average stu- the data from carefully chosen multiple control groups dent. Given that the suburban high school was assigned to reduce the likelihood that hidden variables may be to receive the smoking prevention program, any ob- served treatment effect is confounded with the specific responsible for the results. forms of the threats of maturation, history, regression, NONEQUIVALENTDEPENDENT VARIABLES. Sometimes and instrumentation presented above. Any observed datacan be collectedon additional dependentvariables difference between the means of the treatment and or in different contexts on the same group of partici- control group means may be due to the causal effect pants. Ideally, the measuresselected should be con- of treatment, the specific threats to internal validity, or ceptually related to the outcome variable of interest both. However, note that if the smoking prevention in- and should be affected by the same threats to inter- tervention had been assigned to the inner city instead nal validity as the primary dependentvariable. How- of the suburban school, none of the specific forms of ever,the researcherwould not expectthese measures threats to internal validity illustrated above would pro- to be affectedby the treatment.To illustrate,Roos et al. vide a plausible alternative explanation of the effects of (1978)compared the tonsillectomygroup and two con- the program. In this second case,the threats to inter- trol groups on their primary outcome variable, health nal validity bias the estimates of the causal effect, but insuranceclaims for respiratoryillness, which they ex- in a direction opposite from that expected for true ef- fects of the program. These specific threats may lead pectedto decreaseonly in the treatment group. They also comparedhealth insuranceclaims for other non- to estimates of the causal effect of treatment that are respiratoryillness that would not be expectedto de- too low, thus lowering the statistical power of the test. creasefollowing treatment. To the extent that the hy- However, these threats do not call the existence of the pothesizedpattern of resultsis obtainedon the primary causal effect into question. outcome variable, but not on the nonequivalent de- pendentvariables, the interpretation of the resultsas a DESIGNENHANCEMENTS. The basic nonequivalent control groups design may be strengthened by adding causaleffect is strengthened. design and analysis components that specifically ad- MUl:rIPLEPRETESTS OVER TIME. The addition of mul- dress the plausible threats to validity. The ideas here tiple pretests on the same variable over time to the build on previous discussions of design enhancements nonequivalent control group design can help rule out for addressing threats to internal validity in other threats associated with differential rates of maturation quasi-experimental designs. Reynolds and West (1987) CAUSAL INFERENCE AND GENERALIZATION IN FIELD SETTINGS 75

or regression to the mean in the two groups. In this de- as the threat to the internal validity toward which re- sign, the multiple pretests are used to estimate the rates searchers should focus their attention. If examination of maturation in the two groups prior to treatment. of careful documentation of other possible influences These estimates can then be used to adjust the treat- on the treatment and control groups failed to iden- ment effect under the assumption that the pattern of tify plausible historical influences that differed in the maturation within each group would not change dur- two groups, the researchers could reach at least a ten- ing the study. For example, Reynolds and West (1987) tative conclusion that their treatment led to a causal used data from sales in treatment and comparison effect. stores prior to the introduction of a sales campaign in treatment stores to adjust for potential (maturational) OTHER ISSUES trends in sales levels. Recent applications of hierarchi- STAnSTICALPOWER. As noted previously, to the ex- callinear models to repeated measures data (Bryk & tent the condition of strong ignorability can be ap- Raudenbush, 1992; Willett & Sayers, 1994) have pro- proached, the nonequivalent control group design Vided a number of promising and flexible models of theoretically becomes increasingly similar to the re- adjusting the treatment effect based on information gression discontinuity design. Because in practice re- about each partidpant's growth rate and the mean searchers cannot convincingly establish that the as- growth rates within the treatment and control groups sumptions necessary for strong ignorability have been (see Pitts, 1999, for a detailed development of possible met, multiple analyses must often be conducted to models). probe assumptions and to rule out plausible threats to validity. Methodologists frequently suggest that re- PATTERNOF RESULTS.As we briefly noted previ- searchers attempt to bracket the estimate of the treat- ously, not all threats to internal validity are plausible ment effect, sometimes reporting the mean (or me- given certain patterns of results. Cook and Campbell dian), largest, and smallest effect sizes across diverse (1979) noted that the four threats to validity may in- analyses (see Reichardt & Gollob, 1997, for a discus- crease or decreasein plausibility depending on the spe- sion). This strategy increases the certainty of our causal cific area of research and the pattern of results that inference to the extent that all of the analyses yield is obtained. For example, cases of differential rates of similar estimates of the treatment effect. However, this maturation almost never involve casesin which only strategy also implies that if we wish to show that the one group is growing or the two groups are growing bracket does not include a treatment effect of 0, our in opposite directions; rather, they involve differential estimate of statistical power must be based on the sta- rates of either growth or decline in the two groups. Sim- tistical model that can be expected to lead to the great- ilarly, regression to the mean involves a disadvantaged est attenuation of the magnitude of the treatment ef- group improving or an advantaged group declining on fect. As a result, the power of the statistical tests in the remeasurement Ielative to another selected group. It nonequivalent control group design can be expected to is implausible to expect that regression to the mean be lower than for a randomized experiment. would lead a previously disadvantaged group to sur- pass a previously advantaged group. CAUSALGENERALIZAnON. Nonequivalent control Figure 3.6 draws on these ideas and helps identify group designs are normally conducted with the units the likely threats to internal validity that would be asso- and settings of interest and are thus typically rela- ciated with several different possible outcome patterns. tively high in external validity. Randomization can For example, applying the previous reasoning to the and should be implemented more frequently than it examples, outcomes B and E suggest that differential has been (Cook & Campbell, 1979; Rosenbaum, 1986; rates of maturation should be carefully considered as a Shadish et al., in press), but many potential partici- potential threat to the internal validity of the results. In pants and organizational contexts will not accept this contrast, differential rates of maturation is a less likely assignment procedure. Consequently, the nonequiva- threat given outcomes A, C, or particularly D. lent control group design, ideally induding the design Using the guidelines presented by Figure 3.6, re- enhancements discussed previously, is sometimes the searchers can identify those threats to internal validity only feasible option. For example, it would be diffi- that deserve special attention given the specific pattern cult to get public and private schools to accept random of results obtained in their study. For example, if we assignment of students; furthermore, not all students consider outcome D, the threats of differential matu- (or their families) would be willing to participate in ration rates, instrumentation, and regression are un- such an assignment process. Even if a random assign- likely, leaving differences in history in the two groups ment process could be implemented, it is likely that the ~

76 STEPHEN G. WEST, JEREMY C. BmSANZ, AND STEVEN C. PfrrS

Outcomes Threatsto InternalValidity = 0 .~ -= '= .~ou .9'" Q) ~ '" ..'" g'" E~ = Vl~ I D"~~~~~'~-"'--~~~_CTreatment

A. + ..Control I I Pretest Posttest Figure 3.6. Interpretable outcomes as- sodated with the nonequivalent control group design. " Control + Note. ++ indicates a highly likely threat to the design when this outcome occurs. B. + indicates a likely threat to the design I I when this outcome occurs. Pretest Posttest --indicates a less likely threat to the de- sign when this outcome occurs. ..Control -indicates a very unlikely threat to the 0 0 Treatment design when this outcome occurs. c. Adapted from S.G. West,Beyond the lab- oratory experiment, chapter in P. Karoly (Ed.), Measurementstrategies in health psy- Pretest PastIest chology,p. 227. Reprinted by permission of John Wiley and Sons.

D. I I Pretest Posttest

++

I I Pretest Posttest

participating families and schools would be highly CONCLUSION.The basic nonequivalent control atypical of the population of units and settings to which group design provides the least satisfactory of the three the researchers intend to generalize their results (West quasi-experimental alternatives to the randomized ex- & Sagarin, 2000). In addition, treatments and observa- periment we have considered. Such designs have tra- tions in nonequivalent control group designs are often ditionally been viewed as "very weak, easily misin- more similar to those that will be implemented in ac- terpreted, and difficult to analyze" (Huitema, 1980, tual settings, further enhancing causal generalization p. 352). From the perspective of the RCM, there is to the UTOS of interest. The ability to generalize causal always some uncertainty as to whether the condi- effects to the UTOS of interest is one of the primary tions of strong ignorability and SUTVAhave been met. strengths of the nonequivalent control group design; From the perspective of Cook and Campbell (1979) the inability to make precise statements about the mag- and Shadish et al. (in press), the four threats of differ- nitude (and sometimes direction) of causal effects is its ential maturation rates, history, statistical regression, primary weakness. and instrumentation in the two groups represent the CAUSAL INFERENCE AND GENERALIZAnON IN FIELD SE1TINGS

primary threats to internal validity that must be ruled portant for researchers to acknowledge publicly the out. The internal validity of the design can potentially known limitations of their findings and to make their be enhanced by the inclusion of design features that ad- data available for analysis by other researchers (Cook, dress specific threats to internal validity, such as mul- 1983; Houts et al., 1986; Sechrest, West, Phillips, tiple control groups, nonequivalent dependent mea- Redner, & Yeaton, 1979). Public criticism of research sures, and multiple pretests over time, (Reynolds & provides an important mechanism through which bi- West, 1987). Data may sometimes be collected that per- ases and threats to internal validity can be identified mit the researcher either to adequately model the se- and their likely effects, if any, on the results of the lection process or match groups on propensity scores. study can be assessed.Additional studies, with differ- The use of multiple statistical adjustment procedures ent strengths and weaknesses, can then be conducted that have different assumptions, but which converge to help bracket the true causal effect. Although consid- on similar estimates of the causal effect can also pro- erable uncertainty may be associated with the results duce increased confidence in the results. In general, of any single study, a consistent finding in the research strong treatment effects that can be shown through literature can greatly increase confidence in the robust- sensitivity analyses to be robust to the potential in- ness of the causal effect (Shadish et al., in press). fluence of hidden variables coupled with careful con- Reflecting recent practice, the majority of the re- sideration of the remaining threats to internal validity search examples discussed in this chapter have been can sometimes overcome most of the limitations of this applied in nature. The development of the methods design. discussed in this chapter has provided a strong ba- sis for applied social psychologists to make causal in- SOME FINAL OBSERVATIONS ferences about the effects of treatment programs de- livered in the settings of interest in the field. At the This chapter has provided an introduction to experi- same time, social psychologists interested in basic re- mental and quasi-experimental designs that are useful search are beginning to develop new areas of substan- in field settings. In contrast to the laboratory research tive interest. Among these areas are the influence of methods, in which social psychologists have tradition- culture (e.g., collectivist vs. individualist), major life ally been trained, modem field research methodol- stressors, long-term relationships, and new aspects of ogy reflects the more complex and less certain real- the self (e.g., generativity). Methodological issues that world settings in which it has been applied. As we arise in these areas, such as selection of participants have seen, even the randomized experiment may not into conditions, attrition, and growth or decline over definitively rule out all threats to internal validity - time require that researchers consider many of the spe- problems such as attrition, treatment noncompliance, cific design, measurement, and statistical techniques and atypical responses of control groups occur with developed to address these problems. Interest in prob- some regularity. Weaker quasi-experimental designs, lems such as coping with the stress of physical illness such as the nonequivalent control group design, rule often require that researchers collect data in settings out a far smaller set of the threats to internal valid- outside the laboratory. And interest in problems like ity. Consequently, social psychologists working in field long-term relationships and generativity will often re- settings must carefully articulate the threats to internal quire that researchers study relevant samples of par- validity and use specific procedures in an attempt to ticipants over extended time periods. These substan- minimize the likelihood that a specific threat to inter- tive developments suggest that the traditional modal nal validity could account for the results. These proce- study in social psychology identified in West et al.'s dures include adding specific design features (e.g.,mul- (1992) review -a randomized experiment, conducted tiple control groups), additional measurements (e.g., in the laboratory, lasting no more than 1 hour, and us- time series data), and statistical adjustments (see also ing undergraduate students as participants -may no Reichardt & Mark, 1997). The use of these procedures longer represent the design of choice in many of these can in some casesproduce strong designs whose inter- emerging research areas. These substantive develop- nal validity approaches that of a randomized experi- ments call for new variants of the laboratory exper- ment. In other cases,one or more threats to internal iment that incorporate some of the features of mod- validity will remain plausible despite the investigator's em field experiments and quasi-experiments discussed best efforts. in this chapter. These developments also may por- Because of the complexity and uncertainty asso- tend a possible return to modem improved versions ciated with research conducted in the field, it is im- of research methods more commonly used in some 78 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS

previous eras in the history of social psychology -ex- Barlow, D. H., & Hersen, M. (1984). Single caseexperimental periments, quasi-experiments, and other studies test- designs:Strategiesforstudyingbehavior change(2nded.). New ing basic social psychological principles in field con- York: Pergamon. texts. Such potential developments would help social Barnow, L. S. (1973). The effects of Head Start and so- psychology broaden the Units, Treatments, Observa- cioeconomic status on cognitive development of disad- tions, and Settings represented in its research base, vantaged students (Doctoral dissertation, University of providing a stronger basis for causal generalization. Wisconsin, Madison, 1974). Dissertation Abstracts Interna- tional. 34, 6191A. They would supplement the demonstrated strengths Barnow, L. S., Cain, G. G., & Goldberger, A. S. (1980). Issues and complement the weaknesses of traditionallabora- in the analysis of . In E. S. Stromsdorfer & tory experiments. Such designs hold the proInise of a G. Farkas (Eds.), Evaluation studies review annual (Vol. 5, more balanced Inix of methodological approaches to pp. 43-59). Beverly Hills, CA: Sage. basic issues in social psychology, in which researchers Baron, R. M., & Kenny, D. A. (1986). The moderator- could make legitimate claims for both the internal va- mediator distinction in sodal psychological research: lidity and the causal generalization of their effects. Conceptual, strategic and statistical considerations. Jour- nal of Personality and Sodal Psychology,51, 1173-1182. Belin, T. R., Elashoff, R. M., Leung, K.-M., Nisenbaum, R., REFERENCES Bastani, R., Nasseri, K., & Maxwell, A. (1995). Combining information from multiple sources in the analysis of a Aiken, L. S., Stein, J. A., & Bentler, P. M. (1994). Struc- non-equivalent control group design. In C. Gatsonis, J. tural equation analyses of clinical subpopulation differ- S. Hodges, R. E. Kass, & N. D. Singpurwalla (Eds.), Case ences and comparative treatment outcomes: Characteriz- studiesin Bayesianstatistics (Vol. 2, pp. 241-260). New York: ing the daily lives of drug addicts. Journal of Consultingand Springer- Verlag. Clinical Psychology,62, 488-499. Bentler, P. M. (1995). EQS structural equationsprogram man- Aiken, L. S., & West, S. G. (1991). Multiple regression:Testing ual. Endno, CA: Multivariate Software, Inc. and interpreting interactions. Newbury Park, CA: Sage. Bentler, P. M., & Woodward, J. A. (1978). A Head Stan Aiken, L. S., West, S. G., Schwalm, D. E., Carroll, J. 1., & re-evaluation: Positive effects are not yet demonstrable. Hsiung, S. (1998). Comparison of a randomized and two Evaluation Quarterly,2,493-510. quasi-experimental designs in a single outcome evalua- Berk, R. A. (1988). Causal inference for sociological data. In tion: Efficacy of a university-level remedial writing pro- N. J. Smeltzer (Ed.), Handbook of sodo logy (pp. 155-172). gram. Evaluation Review,22, 207-244. Newbury Park, CA: Sage. Aiken, L. S., West, S. G., Woodward, C. K., Reno, R. R., & Berkowitz, L. (1993). Aggression:Its causes,consequences, and Reynolds, K. D. (1994). Increasing screening mammog- control. New York: McGraw-Hill. raphy in asymptomatic women: Evaluation of a second Berkowitz, 1., & Donnerstein, E. (1982). External validity is generation, theory-based program. Health Psychology,13, more than skin deep. American Psychologist,37, 245-257. 526-538. Bickman, 1., & Henchy, T. (Eds.). (1972). Beyondthe labora- Angrist, J. D., Imbens, G. W., & Rubin, D.B. (1996).ldentifi- tory: Field researchin sodal psychology.New York: McGraw- cation of causal effects using instrumental variables (with Hill. commentary). Journal of the American Statistical Assodation, Biglan, A., Hood, D., Borzovsky, P., Ochs, 1., Ary, D., & 91,444-472. Black, C. (1991). Subject attrition in prevention research. Arbuckle, J. L. (1996). Full information estimation in the In C. G. Luekefeld & W. Bukoski (Eds.), Drug abuse pre- presence of incomplete data. In G. A. Marcoulides & R. vention intervention research:Methodological issues(pp. 213- E. Schumacker (Eds.), Advanced structural equation mod- 234). Washington, DC: NIDA Research Monograph # 107. eling: Issuesand techniques(pp. 243-278). Mahwah, NJ: Bloom, H. S. (1984). Accounting for no-shows in experi- Erlbaum. mental evaluation designs. Evaluation Review,8, 225-246. Aronson, E. (1966). Avoidance of inter-subject communi- Bollen, K. A. (1989). Structural equationswith latent variables. cation. PsychologicalReports, 19, 238. New York: Wiley. Aronson, E., WIlson, T. D., & Brewer (1998). Experimenta- Borenstein, M., Cohen, J., & Rothstein, H. (1997). Confi- tion in sodal psychology. In D. T. Gilbert, S. T. Fiske, & G. dence intervals, effect size, and power [Computer program]. Lindzey (Eds.), Handbook of sodal psychology(4th ed., Vol. Mahwah, NJ: Erlbaum. 1, pp. 99-142). Boston: McGraw-Hill. Boruch, R. F. (1997). Randomizedexperiments for planning and Bandura, A. (1977). Sodallearning theory. Englewood Cliffs, evaluation. Thousand Oaks, CA: Sage. NJ: Prentice Hall. Boruch, R. F., McSweeny, A. J., & Soderstrom, E. J. (1978). Bardkowski, R. S. (1981). Statistical power with group Randomized field experiments for program planning, de- mean as the unit of analysis. Journal of Educational Statis- velopment, and evaluation. Evaluation Quarterly, 2, 655- tics,6, 267-285. 695. CAUSAL INFERENCE AND GENERALIZAnON IN FIELD SETrINGS 79

BoX, G. E. P., & Draper, N. R. (1987). Empirical model building Campbell, D. T., & Stanley, J. C. (1966). Experimental and responsesurfaces. New York: WIley. and quasi-experimental designsfor research. Chicago: Rand Box, G.E.P.,Jenkins, G.M., & Reinsel, G. C. (1994). Timese- McNally. II riesanaiysis: Forecastingand control (3rd ed.). San Francisco: Canner, P. (1984). How much data should be collected in a Holden-Day. ? Statisticsin Medicine,3, 423-432. Braucht, G. N., & Reichardt, C. S. (1993). A computerized Canner, P. (1991). Covariate adjustment of treatment effects approach to trickle-process, random assignment. Evalua- in clinical trials. Controlled Clinical Trials, 12, 359-366. tion Review,17, 79-90. Cappelleri, J. C. (1990, October). Power analysisof regression- Bryan, A. D., Aiken, L. S., & West, S. G. (1996). Increasing discontinuity designs.Paper presented at the annual meet- condom use: Evaluation of a theory-based intervention ing of the American Evaluation Assodation, Washington, to prevent sexually transmitted disease in young women. DC. Health Psychology,15, 371-382. Cappelleri, J. C., Darlington, R. B., & Trochim, W. M. K. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchicallin- (1994). Power analysis of cutoff-based randomized clini- ear models:Applications and data analysismethods. Newbury cal trials. Evaluation Review,18, 141-152. Park, CA: Sage. Chassin, L., Barrera, M., Jr., Bech, K., & Kossak-Fuller, Burstein, L. (1980). The analysis of multilevel data in ed- J. (1992). Recruiting a community sample of adoles- ucational research and evaluation. In D. Berliner (Ed.), cent children of alcoholics: A comparison of three subject Reviewof researchin education(Vol. 8, pp. 158-223). Wash- sources. Journal ofStudies on Alcohol, 53, 316-319. ington, DC: American Educational Research Association. Chatfield, C. (1996). The analysisof timeseries: An introduction Burstein, L. (1985). Units of analysis. In International en- (5th ed.). London: Chapman & Hall. cyclopedia ofeducation (pp. 5368-5375). Oxford, England: Cialdini, R. B., Reno, R. R., & Kallgren, C. A. (1990). A Pergamon. focus theory of nonnative conduct: Recycling the concept Burstein, 1., Kim, K.-S., & Delandshire, G. (1989). Multi- of nonns to reduce littering in public places. Journal of level investigations of systematically varying slopes: Is- Personalityand Social Psychology,58, 1015-1026. sues, alternatives, and consequences. In R. D. Bock (Ed.), Cicarelli, V. G., Cooper, W. H., & Granger, R. L. (1969). Multilevel analysis of educational data (pp. 233-276). San The impact of Head Start: An evaluation of the effects of Diego, CA: Academic Press. Head Start on children's cognitive and affective development. Cahan, S., & Davis, D. (1987). A between-grade-leveis ap- Athens, OH: Ohio University anq Westinghouse Learning proach to the investigation of the absolute effects of Corporation. schooling on achievement. American Educational Research Cleveland, W. S. (1993). VISualizingdata. Summit, NJ: Hobart Journal, 24, 1-12. Press. ampbell, D. T. (1957). Factors relevant to the validity of ex- Cochran, W. G. (1957). Analysis of covariance: Its nature periments in social settings. PsychologicalBulletin, 54, 297- and uses. Biometrics,13, 261-281. 312. Cochran, W. G. (1965). The planning of observational stud- II ampbell, D. T. (1969). Reforms as experiments. American ! ,; ies of human populations (with discussion). Journal ofthe ~ III '"" ," Psychologist,24,409-429. Royal Statistical Society,Series A, 128, 134-155. 1 r ,I I " ampbell, D. T. (1986). Relabeling internal and exter- Cochran, W. G., & Cox, G. M. (1957). Experimental designs '11 ~I ' nal validity for applied social scientists. In W. M. K. (6th ed.). New York: Wiley. ,, 1Iochim (Ed.), Advances in quasi-experimental design and Cohen, J. (1988). Statisticalpower analysisfor the behavioral analysis (Vol. 31, pp. 67-78). San Francisco: Jossey- sciences(2nd ed.). HiIlsdale, NJ: Erlbaum. Bass. Cohen, J. (1994). The earth is round, p < .05. American ampbell, D. T., & Boruch, R. F. (1975). Making the case Psychologist,49, 997-1003. for randomized assignment by considering the alterna- Cohen, J., & Cohen, P. (1983). Applied multiple regres- tives: Six ways in which quasi-experimental sion/correlation analysisfor the behavioral sciences(2nd ed.). tend to underestimate effects. In C. A. Bennett & A. A. HiIlsdale, NJ: Erlbaum. Lumsdaine (Eds.), Evaluation and experience:Some critical Cohen, P., Cohen, J., Aiken, L. S., & West, S. G. (in press). issuesin assessingsodal programs (pp. 195-296). New York: The problem of units and the drcumstance for POMP. Academic Press. Multivariate Behavioral Research. ampbell, D. T., & Erlbacher, A. (1970). How regression Coleman, J., Hoffer, T., & Kilgore, S. (1982). Cognitive out- artifacts in quasi-experimental evaluations can mistak- comes in public and private schools. Sociologyof Education, enly make compensatory education look harmful. In J. 55, 65-76. Hellmuth (Ed.), Compensatoryeducation: A national debate. Conner, R. F. (1977). Selecting a control group: An analy- Volume. 3: Disadvantagedchild (pp. 185-225). New York: sis of the randomization process in twelve sodal refonn Brunner/Mazel. programs. Evaluation Quarterly, 1, 195-244. ampbell, D. T., & Kenny, D. A. (1999). A primer on regression Cook, R. D., & Weisberg, S. (1994). An introduction to regres- artifacts. New York: Guilford. siongraphics. New York: Wiley.

I~ 80 STEPHEN G. WEST, JEREMY C. BmSANZ, AND STEVEN C. PITTS

Cook, T. D. (1983). Quasi-experimentation: Its ontology, Franklin, R. D., Allison, D. B., & Gorman, B. S. (Eds.). epistemology, and . In G. Morgan (Ed.), Be- (1996). Designand analysisof single caseresearch. Mahwah, yond method: Strategiesfor sodal research.Beverly Hills, CA: NJ: Erlbaum. Sage. Funder, D. C. (1992). Psychology from the other side of the Cook, T. D. (1993). A quasi-sampling theory of the gener- line: Editorial processes and publication trends at JPSP. alization of causal relationships. In L. B. Sechrest & A. G. Personality and Sodal PsychologyBulletin, 18, 493-497. Scott (Eds.). Newdirectionsforprogram evaluation, (Number Gilbert, J. P., Light, R. J., & Mosteller, F. (1975). Assessing 57,39-81). San Francisco: Jossey-Bass. social innovations: An empirical base for policy. In C. A. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Bennett & A. A. Lumsdaine (Eds.), Evaluation and exper- Designand analysisissues for field settings.Boston: Houghton- iment: Some critical issuesin assessingsodal programs. (pp. Mifflin. 39-193) New York: Academic Press. Cook, T. D., & Reichardt, C. S. (1976). Statistical analy- Glass, G. V. (1988). Quasi-experiments: The case of inter- sis of data from the nonequivalent control group design: rupted time series. In R. M. Jaeger (Ed.), Complementary A guide to some current literature. Evaluation, 3, 136- methodsfor researchin education (pp. 445-464). Washing- 138. ton, DC: American Educational Research Association. Crano, W. D., & Brewer, M. B. (1986). Prindplesandmethods Goldberger, A. S. (1972, April). Selection bias in evaluating ofsodal research(2nd ed.). Boston: Allyn & Bacon. treatmenteffects: Some formal illustrations (Discussion paper Cronbach, L. J. (1976). Researchon classroomsand schools:For- 123-72). Madison: University of Wisconsin, Institute for mulation of questions,design, and analysis. Occasional Paper, Research on Poverty. Stanford Evaluation Consortium, Stanford, CA. Greenland, S., & Morgenstern, H. (1989). Ecological bias, Cronbach, L. J. (1982). Designing evaluationsof sodal and ed- confounding, and effect modification. International Jour- ucational programs. San Francisco: Jossey-Bass. nalofEpidemiology, 18, 269-274. Crosbie, J. (1993). Interrupted time-series analysis with Greenwald, A. G. (1976). Within-subjects design: To use or brief single-subject data. Journal of Consultingand Clinical not to use. PsychologicalBulletin, 83, 314-320. Psychology,61,966-974. Greenwald, A. G., Pratkanis, A. R., Leippe, M. R., & Baum- Daniel, C., & Wood, F. S. (1980). Fitting equationsto data (2nd gardner, M. H. (1986). Under what conditions does theory ed.). New York: WIley. obstruct research progress? PsychologicalReview, 93, 216- Dennis, M. L., Lennox, R. D., & WIlliams, R. (1997). 229. Practical power analysis. In K. Bryant, M. Wmdle, & Gregson, R. A. (1987). The time-series analysis of self- S. G. West (Eds.), The Scienceof prevention: Methodolog- reported headache symptoms. Behavior Change,4(2), 6- ical advancesfrom alcohol and substance abuse research 13. (pp. 367-404). Washington, DC: American Psychological Hansen, W. B., & Collins, L. M. (1994). Seven ways to Association. increase power without increasing N. In L. M. Collins Draper, D. (1995). Inference and hierarchical modeling in & L. A. Seitz (Eds.), Advancesin data analysisfor preven- the social sciences. Journal of Educational and Behavioral tion intervention research (pp. 184-195). Rockville, MD: Statistics,2O, 115-147. NlDA Research Monograph 142. NIH Publication No. 94- Durlak, J. A., & Wells, A. M. (1997). Primary prevention 3599. mental health programs for children and adolescents: A Hansen, W. B., Collins, L. M., Malotte, C. K, Johnson, C. A., meta-analytic review. American Journal of Community Psy- & Fielding, J. E. (1985). Attrition in prevention research. chology,25, 115-152. Journal of Behavioral Medidne, 8, 261-275. Efron, B., & Feldman, D. (1991). Compliance as an explana- Harlow, L. 1., Mulaik, S. A., & Steiger, J. H. (Eds.).(1997). tory variable in clinical trials (with discussion). Journa[of What if there were no significance tests? Mahwah, NJ: the American Statistical Assodation,86, 9-26. Erlbaum. Erlebacher, A. (1977). Design and analysis of experiments Hastie, T. J., & 'nbshirani, R. J. (1990). Generalizedadditive contrasting the within- and between-subjects manipula- models. New York: Chapman & Hall. tion of the independent variable. PsychologicalBulletin, 84, Heckman, J. J. (1979). Sample bias as a specification error. 212-219. Econometrica,46, 153-162. vans, R. I., Rozelle, R. M., Maxwell, S. E., Raines, B. E., Heckman, J. J. (1989). Causal inference and nonrandom Dill, C. A., Guthrie, T. J., Henderson, A. H., & Hill, P. C. samples. Journal of Educational Statistics,14, 159-168. (1981). Social modeling films to deter smoking in adoles- Heckman, J. J. (1990). Varieties of selection bias. American cents: Results of a three-year field investigation. Journal Economic Review,80, 313-318. of Applied Psychology,66, 399-414. Hedecker, D., Gibbons, R. D., & Flay, B. R. (1994). Random .her, R. A. (1935). The designof experiments.London: Oliver effects regression models for clustered data with an ex- & Boyd. ample from smoking prevention research. Journal of Con- lay, B. R. (1986). Psychosocial approaches to smoking pre- sulting and Clinical Psychology,62, 757-765. vention: A review of findings. Health Psychology,4, 449- Hedges, L. V., & Oikin, I. (1985). Statistical methodsfor meta- 488. analysis. Orlando, FL: Academic Press. CAUSAL INFERENCE AND GENERALIZATION IN FIELD SE'ITINGS 81

He sman, D. T., & Shadish, W. R. (1996). Assignment ethods in experimentation: When do nomandomized xperiments approximate answers from randomized ex- eriments. PsychologicalMethods, 1, 154-169. He .gan, K. M., del Rosario, M. L., Heath, L., Cook. T. ., Wharton, J. L., & Calder, B. J. (1982). Impact of the .troduction of television on crime in the United States: mpirical findings and theoretical implications. Journal of ersonality and Social Psychology, 42,461-477. Hi ginbotham, H. N., West, S. G., & Forsyth, D. R. (1988). sychotherapy and behavior change: Social, cultural, and ethodologicalperspectives. New York: Pergamon. Ho and, P. W. (1986). Statistics and causal inference (with .scussion). Journal of theAmerican Statistical Assodation,81, 45-970. Ho and, P. W. (1988). Causal inference, path analysis, and cursive structural equation models (with discussion). In .Clogg (Ed.), Sodologicalmethodology1988 (pp. 449-493). ashington, DC: American Sodological Assodation. Ho and, P.W. (1989). Discussion of Aitkin's and Longford's apers. In R. D. Bock (Ed.),Multilevel analysisofeducational ata (pp. 311-317). San Diego, CA: Academic. Ho e, G. P., Yang, M. C. K., & Ware, W. B. (1982). Time ries analysis for single subject designs. PsychologicalBul- tin, 91, 178-189. Ho ts, A. C., Cook, T. D., & Shadish, W. R. (1986). The erson-situation debate: A critical multiplist perspective. ournal of Personality,54, 52-105. H tema, B. E. (1980). The analysis of covarianceand alterna- 'ves. New York: Wiley. H ter, J. E. (1996, August). Needed: A ban on the signif- i ce test. In P. E. Shrout (chair), Symposium. Signifi- ance tests -should they be banned from APA journals. erican Psychological Assodation, Toronto, Ontario, anada. H ter, J. E., & Schmidt, F. L. (1990). Methods of meta- nalysis: Correctingerror and bias in researchfindings. New- ury Park, CA: Sage. 1m ens, G. W., & Rubin, D. B. (1994). On the fragility of I strumental variables . Discussion paper # 1675, npublished manuscript, Harvard University, Institute of conomic Research. 1m ens, G. W., & Rubin, D. B. (1997). r causal effects in randomized experiments with non- ompliance. Annals of Statistics,25, 305-327. Ito T. A., Miller, N., & Pollock. V. E. (1996). Alcohol and ggression: A meta-analysis on the moderating effects of .bitory cues, triggering events, and self-focused atten- :1 .on. PsychologicalBulletin, 120, 60-82. Ii1I Jo eskog, K. G., & Sorbom, D. (1993). Lisrel8: User'sreference uide. Chicago: Sdentific Software. Ju d, C. M., & Kenny, D. A. (1981). Estimating the effectsof odal interventions. New York: Cambridge University Press. Ju d, C. M., Smith, E. R., & Kidder, L. H. (1991). Research I,: ethods in social relations (6th ed.). Fort Worth, TX: Har- ourt Brace Jovanovich. Ju s, S. G., & Glass,G. V. (1971). The effect of experimental

I 82 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITrS

Little, R. J., & Yau, L. H. Y (1998). Statistical techniques for Mumame, R. J., Newstead, S., & Olson, F. J. (1985). Com- analyzing data from preventive trials: Treatment of no- paring public and private schools: The puzzling role of shows using Rubin's causal model. PsychologicalMethods, selectivity bias. Journal ofBusiness and Economic Statistics,3, 3, 147-159. 23-35. MacKinnon, D. P. (1994). Analysis of mediating variables Muthen, B., & Joreskog, K. G. (1983). Selectivity problems in prevention and intervention research. In A. Cezares in quasi-experimental studies. Evaluation Review,7, 139- & L. Beatty (Eds.), Scientificmethods for prevention interven- 174. tion research. Rockville, MD: National Institute on Drug Muthen, B., Kaplan, M., & Hollis, M. (1987). On structural Abuse. equation modeling with data that are not missing com- Magidson, J. (1977). Toward a causal modeling approach pletely at random. Psychometrika,52,431-462. to adjusting for pre-existing differences in the nonequiv- Pitts, S. C. (1999). Th~ use oflatentgTowth models to esti- alent group situation: A general alternative to ANCOVA. mate treatment effects in longitudinal experiments.Unpub- Evaluation Quarterly, 1, 399-420. lished doctoral dissertation, Arizona State University. Mallar, C. D., & Thornton, C. V. D. (1978). Transitional aid Pitts, S. C., West, S. G., & Tein, J.-Y. (1996). Longitudinal for released prisoners: Evidence from the life experiment. measurement models in evaluation research: Examining In T. D. Cook, M. L. Del Rosario, K. M. Hennigan, M. M. stability and change. Evaluation and Program Planning, 19, Mark, & W. M. K. Trochim (Eds.), Evaluation studiesreview 333-350. annual (Vol. 3, pp. 498-517). Beverly Hills, CA: Sage. Reichardt, C. S. (1979). The statistical analysis of data Marcantonio, R. J., & Cook, T. D. (1994). Convincing quasi- from nonequivalent group designs. In T. D. Cook and experiments: The interrupted time series and regression- D. T. Campbell, Quasi-experimentation:Design and analysis discontinuity designs. In J. S. Wholey, H. P. Hatry, &.K.,E. issuesfor field settings (pp. 147-205). Boston: Houghton- Newcomer (Eds.), Handbook of practical program evaluation Mifflin. (pp. 133-254). San Francisco: Jossey-Bass. Reichardt, C. S., & Gollob, H. F. (1997). When confidence Mark, M.M., & Mellor, S. (1991). Effect of self-relevance of intervals should be used instead of statistical tests, and an event on hindsight bias: The foreseeability of a layoff. vice versa. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger Journal of Applied Psychology,76, 569-577. (Eds.), What if therewere no significancetests? (pp. 259-284). Martin, S. E., Annan, S., & Forst, B. (1993). The special de- Mahwah, NJ: Erlbaum. terrent effects of a jail sanction on first-time drunk drivers: Reichardt, C. S., & Mark, M. M. (1997). Quasi- A quasi-experimental study. Accident Analysis and Preven- experimentation. In L. Bickman & D. Rog (Eds.), Hand- tion, 25, 561-568. bookof applied sodal researchmethods (pp. 193-228). Thou- Maxwell, S. E., & Delaney, H. D. (1990). Designing exper- sand Oaks, CA: Sage. iments and analyzing data: A model comparison perspective. Reichardt, C. S., Minton, B. A., & Schellenger, J. D. (1986). Pacific Grove, CA: Brooks/Cole. The analysis of covariance (ANCOVA) and the assessmentof McCleary, R., & Hay, R. A. (1980). Applied time seriesanalysis. treatment effects.Unpublished manuscript, University of Beverly Hills, CA: Sage. Denver. McGuire, W. J. (1964). Inducing resistance to persuasion: Reichardt, C. S., Trochim, W. M. K., & Cappelleri, J. C. Some contemporary approaches. In L. Berkowitz (Ed.), (1995). Reports of the death of regression -discontinuity Advancesin experimental social psychology(Vol. 1, pp. 192- analysis are greatly exaggerated. Evaluation Review,19, 39- 229). New York: Academic Press. 63. McGuire, W.J. (1983). A contextualist theory of knowledge: Reis, H. T., & Stiller, J. (1992). Publication trends in JPSP: A Its implications for innovation and research in psychologi- three-decade review. Personality and Sodal PsychologyBul- cal research. In L. Berkowitz (Ed.), Advancesin experimental letin, 18, 465-472. social psychology(Vol. 16, pp 1-47). New York: Academic Reynolds, K. D., & West, S. G. (1987). A multiplist strategy Press. for strengthening nonequivalent control group designs. McSweeny, A. J. (1978). The effects of response cost on the Evaluation Review,11, 691-714. behavior of a million persons. Journal of Applied Behavior RibisI. K. M., Watlon, M. A., Mowbray, C. T., Luke, D. A., Analysis, 11, 47-51. Davidson, W. A., & Bootsmiller, B. J. (1996). Minimiz- Meier, P. (1985). The biggest public health experiment ever: ing participant attrition in panel studies through the use The 1954 field trial of the Salk polio vaccine. In J. M. of effective retention and tracking strategies: Review and Tanur, F. Mosteller, W. H. Kruskal, R. F. Link, R. S. recommendations. Evaluation and Program Planning, 19, 1- Pieters, & G. R. Rising (Eds.), Statistics:A guide to the un- 25. known (2nd Ed.) (pp. 3-15). Monterey, CA: Wadsworth Robinson, W. S. (1950). Ecological correlations and the be- & Brooks/Cole. havior of individuals. American SodologicalReview, 15, 351- Meier, P. (1991). Comment. Journal of theAmerican Statistical 357. Association,86, 19-22. Roos, L. L., Jr., Roos, N. P., & Henteleff, P. D. (1978). Assess- Mohr, L. B. (1988). Impact analysis for program evaluation. ing the impact of tonsillectomies. Medical Care, 16, 502- New York: Basic Books. 518. CAUSAL INFERENCE AND GENERALIZAnON IN FIELD SETTINGS 83

Rosenbaum, P. R. (1984). From association to causation view of human nature. Journal of Personality and Sodal Psy- in observational studies: The role of strongly ignorable chology,51,515-530. treatment assignment. Journal of theAmerican StatisticalAs- Seaver, W. B., & Quarton, R. J. (1976). Regression discon- sociation,79,41-48. tinuity analysis of the dean's list effects. JournalofEduca- Rosenbaum, P. R. (1986). Dropping out of high school in tional Psychology,68, 459-465. the United States: An . JournalofEdu- Sechrest, 1., West, S. G., Phillips, M. A., Redner, R., & cational Statistics,11, 207-224. Yeaton, W. (1979). Some neglected problems in evalu- Rosenbaum, P.R. (1987). The role of a second control group ation research: Strength and integrity of treatments. In in an observational study (with discussion). StatisticalSci- L. Sechrest, S. G. West, M. A. Phillips, R. Rodner, & W. ence,2,292-316. Yeaton (Eds), Evaluation studiesreview annual (Vol. 4, pp. Rosenbaum, P.R. (1991a). Discussing hidden bias in obser- 15-35). Beverly Hills, CA: Sage. vational studies. Annals of Internal Medicine,115, 901-905. Shadish, W. R. (in press). The empirical program of quasi- Rosenbaum, P.R. (1991b). Sensitivity analysis for matched experimentation. In L. Bickman (Ed.), Contributions to re- case-control studies. Biometrics,47,87-100. searchdesign: Donald Campbell's legacy(Vol. 2). Thousand Rosenbaum, P.R. (1994). Coherence in observational stud- Oaks, CA: Sage. ies. Biometrics,50, 368-374. Shadish, W. R., Cook, T. D., & Campbell, D. T. (in press). Ex- Rosenbaum, P. R. (1995). Observationalstudies. New York: perimental and quasi-experimentaldesign for generalizedcausal Springer-Verlag. inference.Boston: Houghton-Mifflin. Rosenbaum, P. R., & Rubin, D. (1983). The central role of Shadish, W. R., Hu, X., Glaser,R. R., Knonacki, R., & Wong, the propensity score in observational studies for causal S. (1998). A method for exploring the effects of attrition effects. Biometrika, 70, 41-55. in randomized experiments with dichotomous outcomes. Rosenbaum, P. R., & Rubin, D. (1984). Reducing bias in ob- PsychologicalMethods, 3, 3-22. servational studies using subclassification on the propen- Shadish, W. R., & Ragsdale, K. (1996). Random versus non- sity score. Journal of the American Statistical Association,79,I random assignment to psychotherapy experiments: Do 516-524.I you get the same answer? Journal of Consultingand Clinical Rosenthal, R., & Rubin, D. (1980). Comparing within- and Psychology,64, 1290-1305. between -subjects studies. SociologicalMethods and Research, Sidman, M. (1960). Tacticsofsdentific research.New York: Ba- 9, 127-136. sic Books. Ross, H. 1., Campbell, D. T., & Glass, G. V. (1970). Deter- Sorbom, D. (1979). An alternative to the methodology for mining the social effects of a legal reform: The British analysis of covariance. In K. G. Joreskog & D. Sorbom "breathalyser" crackdown of 1967. American Behavioral (Eds.), Advancesinfactor analysisand structural equation mod- Scientist,13, 493-509. els (pp. 219-234). Cambridge, MA: Abt Books. Rossi, J. S. (1990). Statistical power of psychological re- Sporer, S. 1., Penrod, S., Read, D., & Cutler, B. search: What have we gained in 20 years? Journal of Con- (1995). Choosing, confidence, and accuracy: A meta- sulting and Clinical Psychology,58, 646-656. analysis of the confidence-accuracy relation in eyewit- Rubin, D. B. (1974). Estimating causal effects of treatments ness identification studies. PsychologicalBulletin, 118, 315- in randomized and nonrandomized studies. JournalofEd- 327. ucational Psychology,66, 688-701. MStudent" (W. S. Gosset). (1931). The Lanarkshire milk ex- Rubin, D. B. (1977). Assignment to treatment group on the periment. Biometrika, 23, 398-406. basis of a covariate. Journal of Educational Statistics,2, 1-26. Stolzenberg, R. M., & RelIes, D. A. (1990). Theory testing in !.liD Rubin, D. B. (1978). Bayesian inference for causal effects. a world of constrained . SodologicalMethods The Annals of Statistics,6, 34-58. and Research,18, 395-415. Rubin, D. B. (1986). Which ifs have causal answers. Journal Swingle, P. G. (Ed.). (1973). Sodalpsychology in naturalset- of the American Statistical Association,81, 961-962. tings. Chicago: AIrline. R~bin, D. B., & Thomas, N. (1996). Matching using esti- Tebes, J. K., Snow, D. 1., & Arthur, M. W. (1992). Panel mated propensity scores: Relating theory to practice, Bio- attrition and external validity in the short-term follow- up study of adolescent substance use. Evaluation Review, metrics,52, 249-264. Sackett, D. 1., & Gent, M. (1979). Controversy in count- 16,151-170. ing and attributing events in clinical trials. New England Trochim, W. M. K. (1984). Researchdesign for program eval- uation: The regression-discontinuityapproach. Beverly Hills, Journal of Medicine,301, 1410-1412. Schachter, S. (1959). The psychologyof affiliation. Stanford, CA: Sage. Trochim, W. M. K., & Cappelleri, J. C. (1992). Cutoff assign- CA: Stanford University Press. ment strategies for enhancing randomized clinical trials. Schwarz, N. B., & Hippler, H. J. (1995). Subsequent ques- tions may influence answers to preceding questions in ControlledClinical Trials, 13,571-604. Trochim, W. M. K., Cappelleri, J. C., & Reichardt, C. S. mail surveys. Quarterly, 59, 93-97. (1991). Random measurement error does not bias the Sears, D. O. (1986). College sophomores in the laboratory: Influences of a narrow data base on social psychology's treatment effect estimate in the regression-discontinuity 84 STEPHEN G. WEST, JEREMY C. BIESANZ, AND STEVEN C. PITTS

design: II. When an interaction effect is present. Evalua- prevention programs. American Journal of Community Psy- tion Review,15, 571-604. chology,21, 571-605. Velicer, W. F., & Colby, S. M. (1997). Time series analy- West, S. G., & Hepwonh, J. T. (1991). Statistical issues in sis of prevention and treatment research. In K. Bryant, the study of temporal data: Daily experiences. Journal of M. Wmdle, & S. G. West (Eds.), The scienceof prevention: Personality,59,611-662. Methodological advancesfrom alcohol and substanceabuse re- West, S. G., Hepworth, J. T., McCall, M. A., & Reich, J. W. search(pp. 211-250), Washington, DC: American Psycho- (1989). An evaluation of Arizona's July 1992 drunk driv- logical Association. ing law: Effects on the dty of Phoenix. Journal of Applied Velicer, W. F., & Harrop, J. W. (1983). The reliability and SocialPsychology, 19, 1212-1237. accuracy of time series model identification. Evaluation West, S. G., Newsom, J. T., & Fenaughty, A. M. Review,7, 551-560. (1992). Publication trends in JPSP: Stability and change Velicer, W. F., & MacDonald, R. P. (1984). Time series anal- in the topics, methods, and theories across two ysis without model identification. Multivariate Behavioral decades. Personality and Social PsychologyBulletin, 18,473- Research,19, 33-47. 484. Vmokur, A. D., Price, R. H., & Caplan, R. D. (1991). From West, S. G., & Sagarin, B. (in press). Subject selection and field experiments to program implementation: Assess- loss in randomized experiments. In L. Bickman (Ed.), Con- ing the potential outcomes of an experimental interven- tributions to researchdesign: Donald Campbell's legacy(Vol. 2, tion program for unemployed persons. American Journal pp. 117-154). Thousand Oaks, CA: Sage. of Community Psychology,19, 543-562. Wilkinson, L. and the Task Force on Statistical Inference VlIdin, L. M. (1993). A testof the robustnessof estimators that (1999). Statistical methods in psychology journals: Guide- model selectionin the nonequivalentcontrol group design. Un- lines and explanations. American Psychologist,54, 594- published doctoral dissertation, Arizona State University. 604. Weiner, B. (Ed.). (1974). Achievementmotivation and attribu- Willett, J. B., & Sayer, A. G. (1994). Using covariance struc- tion theory. Morristown, NJ: General Learning Press. ture analysis to detect correlates and predictors of change. West, S. G., & Aiken, L. S. (1997). Towards understand- PsychologicalBulletin, 116, 363-381. ing individual effects in multiple component prevention Wmer, B. J. (1971). Statistical principles in experimental design programs: Design and analysis strategies. In K. Bryant, (2nd ed.). New York: McGraw-Hill. M. Wmdle, & S. G. West (Eds.), The scienceof preven- wolchik, S. A., West, S. G., Sandler, I. N., Tein, J.-Y., tion: Methodologicaladvances from alcohol and substanceabuse Coatsworth, D., Lengua, L., Weiss, L., Anderson, E. R., research(pp. 167-210). Washington, DC: American Psy- Greene, S. M., & Griffin, W. A. (in press). An experimental chological Association. evaluation of theory mother and mother-child programs West, S. G., Aiken, L. S., & Todd, M. (1993). Probing the for children of divorce. Journal of Consulting and Clinical effects of individual components in multiple component Psychology.