<<

03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 23

3

INTERNAL AND EXTERNAL IN CLINICAL

STEVEN TAYLOR

GORDON J. G. ASMUNDSON

dvances in clinical psychology critically proof he needed with his own eyes.” That is, he depend on methods researchers use for claimed his patients almost invariably benefited A investigating the causal relationships from his therapy. When a treatment outcome study among variables. Research questions commonly was conducted and published by an independent include issues about the putative mechanisms group of investigators, it was found that the psy- associated with various forms of psychopathol- chotherapy used by Dr. Smith was no more effective ogy (e.g., do dysfunctional beliefs cause depres- than giving patients a placebo. Dr. Smith’s response sion?) and questions about the effects of was “there must have been something wrong with treatments (e.g., does cognitive therapy cause a the research; the researchers didn’t include patients greater reduction in eating disorder symptoms seen in the real world.” than does placebo?). How do we go about draw- ing objective conclusions about causality? This Scenario 2: The interns and their clinical supervi- is a question about whether the manipulation of sors gathered in the seminar room for the weekly one variable (i.e., the independent variable) has journal club. This week’s article, which recently effects on another variable (i.e., the dependent appeared in a leading journal, described an experi- variable). The answer to this question is more mental investigation of the effects of cognitive fac- complex than it might seem. Consider the fol- tors (expectations of disapproval) on social anxiety. lowing common scenarios. Research participants led to expect high disap- proval (from an experimental confederate) experi- enced more social anxiety than participants who Scenario 1: Dr. Smith is a well-known local advo- were led to expect no disapproval.The investigators cate of a controversial form of psychotherapy. He concluded that expectations of disapproval can claims that it works faster and more powerfully cause social anxiety and likely play a role in clinically than all other treatments. For many years he has severe conditions such as social anxiety disorder. practiced this treatment and trained other clinicians Despite the many methodological strengths of the through workshops, even though there were no sci- study, the participants of the journal club quickly entific data on its efficacy.When challenged on this searched out the methodological weaknesses. One point, Dr. Smith retorted that he had “seen all the of the interns raised a particularly important

23 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 24

24 INTRODUCTION/OVERVIEW

question about the generalizability of the study. Her 1970; Cook & Campbell, 1979; Onghena & observation was met with nods of approval from Edgington, 2005). her supervisors. By the time the journal club had fin- ished, all the attendees had convinced themselves that the study was fatally flawed. Some attendees even left the meeting with the impression that psy- chological research, even research published in Internal validity is the degree to which observed leading journals, is largely a waste of time. changes in a dependent variable can be attrib- uted to changes in an independent variable. Thus, internal validity is a matter of degree (e.g., This chapter is written largely in response to high, medium, low) rather than one of presence these two types of scenarios, which we have or absence. The researcher’s confidence in his or encountered time and again. The reactions her findings is proportionate to the strength of illustrated in these scenarios retard the scien- internal validity of the research design (Finger & tific progress of clinical psychology. The first Rand, 2003). True are designs that scenario raises many issues regarding internal have strong internal validity; that is, participants and . Dr. Smith claims that all are randomized to experimental conditions, and his patients benefit from his treatment. Yet, his other means are used to ensure that changes in series of case observations have many problems the dependent variable can be attributed to the of internal validity. His dismissal of a recent experimental manipulation of the independent study of his psychotherapy raises the issue of variable. Quasi-experimental designs have external validity. The journal club scenario weaker internal validity, as we will illustrate later. raises the question of what we can conclude There are several types of threat to internal valid- from research that does not have “perfect” ity (Cook & Campbell, 1979; Finger & Rand, internal and external validity. 2003; Rosenthal, 2002), including The important issues raised in these scenarios • are the focus of the remainder of this chapter. We history, • maturation, will begin by defining internal validity and illus- • trating the various threats to it. Some studies, testing, • instrumentation, such as those using quasi-experimental designs, • are widely used in clinical research, despite their statistical regression, • attrition, imperfect internal validity. After reading this • chapter, you should have a good understanding selection, • interactions with selection, of why such studies are used and why they are • useful. After discussing internal validity, we then diffusion or imitation of treatments, • compensatory equalization of treatments, and examine the concept of external validity (gener- • alizability) and consider the relationship experimenter expectancy. between internal and external validity. As you Each of these threats to internal validity are will see, there are some research situations in defined and illustrated in the following sections. which high external validity is vital and other situations in which it is not a priority. Finally, we will conclude with some comments about how History scientific knowledge can be advanced even Description. When changes in the dependent though most research studies have imperfect variable are due to some extraneous event that internal or external validity. Throughout this takes place between pre- and posttest, it makes discussion we will consider a number of com- it difficult to determine whether the results were monly used experimental designs that have been due to the experimental manipulation (i.e., developed to deal with issues of internal and changes in the independent variable) or to the external validity. Our discussion of these designs extraneous event. In some research, such as a will be illustrative rather than comprehensive. short study of memory, this threat can be con- Detailed discussions of experimental and quasi- trolled by shielding participants from outside experimental designs are available elsewhere influences during the study (e.g., testing them in (e.g., Asmundson, Norton, & Stein, 2002; a quiet lab) or by choosing dependent variables Barlow & Hersen, 1984; Campbell & Stanley, that could not plausibly have been affected by 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 25

Internal and External Validity in Clinical Research 25

outside influences (Cook & Campbell, 1979). In variable under investigation. Some measures are treatment studies, which may take the partici- highly reactive, whereas other measures are pant several weeks to complete, these methods largely unreactive. Also, repeated testing can may not effectively control for the effects of increase familiarity with the test, which might history. Other methods are more often used, bias scores. This threat can be dealt with in vari- such as of participants to an ous ways, such as by selecting unreactive mea- experimental group or a control group. In a ther- sures (e.g., unobtrusive observation) or by apy study, the latter might be a waiting list con- including a control group. trol in which the participants are simply assessed Example. In studies of smoking, the act of twice, with the retest interval being matched to monitoring one’s use of cigarettes affects the fre- the duration of the treatment study. Participants quency of smoking. That is, self-monitoring in the treatment condition would be similarly may help some people realize how much they assessed twice, before and after treatment. smoke, thereby motivating them to cut down. To Example. Midway through an uncontrolled illustrate the effects of test familiarity, consider a study of treatments of driving phobia, a well- study in which tests of intelligence are adminis- known celebrity was killed in a car accident, tered on multiple occasions. With repeated test- thereby inflating the fears of the treatment par- ing, participants may become better at some tests ticipants. As a result, their posttreatment scores (e.g., the digit-symbol subtest of the Wechsler on a measure of driving fear tended to be higher Adult Intelligence Scale) simply because they than their pretreatment scores, thereby giving have learned the correct responses (e.g., the sym- the misleading impression that treatment wors- bol that goes with each digit) as a result of ened their phobias. Inclusion of a waiting list repeated testing. control group would have demonstrated the impact of this event on those with driving pho- bia who were not receiving treatment. Instrumentation Description. When an effect is due to a change in the measuring instrument from pre- to post- Maturation test rather than due to the manipulation of the Description. Change in the participant over independent variable. Instrumentation can affect the course of time, where such change is not the all forms of measurement, including observers, focus of interest of the research study. This may self-report tools, interview schedules, and devices involve growth (e.g., getting smarter or stronger) that measure physiological processes. or decline (e.g., dementing). This threat can be Example. In observational studies, progres- addressed by using a control group. sive fatigue of observers who are coding various Example. A drug company treated 20 elderly types of marital can impact their people in the early stages of Alzheimer’s disease rating accuracy. With increasing fatigue, the with a new medication for depression. The investi- observers may be less likely to detect subtle gators concluded that the drug was effective in alle- interaction patterns. Using one scale at pretest viating depression in this population. However, (e.g., the first edition of the Beck Depression they failed to realize that depression naturally Inventory) and another edition at posttest (e.g., remitted for many patients as their dementia wors- the second edition of this inventory) might sug- ened. That is, as their memories became worse, the gest a change in depressive symptoms where participants no longer had insight into the fact there was no change. Similarly, in a study mea- that they were dementing and so were no longer suring physiological reactivity to stress before depressed about this problem. Inclusion of a con- and after stress inoculation training, changes in trol group (in this case, one receiving a placebo equipment calibration might falsely indicate or pill) would have allowed the researchers to assess mask a treatment effect. natural changes in depressive symptoms associ- ated with increasing dementia. Statistical Regression Description. People selected for extreme scores Testing (very high or very low) will have less extreme Description. The reactive effects of testing scores when they are retested on the same or where the very act of assessment influences the related variables. Why does regression occur? The 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 26

26 INTRODUCTION/OVERVIEW

farther a score is from the mean, the more Attrition extreme it is. The more extreme the score, the Description. The loss of participants from a rarer it is and the more likely it is to have been the study (e.g., due to mortality or treatment result of a very rare combination of factors. If dropout). This threatens internal validity if these factors are temporally unstable, then statis- attrition is not random: for instance, if attrition tical regression will occur (Furby, 1973). is greater in one experimental condition than Statistical regression is always toward the popula- another, or if particular participants are most tion mean of the group. Its magnitude is greater likely to drop out of the study. Attrition can when the test-retest reliability of a measure is low cause serious problems for clinical researchers (indicating that scores are readily influenced by by introducing biases into an . There chance factors) and when a person’s score is are various methodological and statistical pro- extreme, relative to the mean of the population cedures for limiting, evaluating, and correcting from which the person was chosen (Cook & for attrition (see Flick, 1988). However, there are Campbell, 1979). Regression effects will not be a circumstances in which attrition can render the threat if assessment methods are chosen that are results uninterpretable, as illustrated in the fol- virtually error free or uninfluenced by random lowing example. factors (e.g., measuring a person’s height; Finger Example. A residential treatment center & Rand, 2003). It is important to note that statis- reported that 80% of patients with anorexia ner- tical regression effects can be due to psychologi- vosa were “much improved” or “greatly improved” cally substantive phenomena and should not be after completing the program. Unfortunately, the automatically dismissed out of hand as statistical results were biased because 20% of patients did artifacts (Taylor, 1994). Regression might be either not complete the program, and no outcome data noise or the phenomena of interest, depending on were available for them. Some withdrew because one’s research goals. they benefited quickly from treatment and felt Examples. A gambling researcher screened a that they no longer needed to be in the program. large group of students in order to identify people Others dropped out because they failed to bene- who could be classified as heavy gamblers, as fit. Some severely anorexic patients had either measured by a questionnaire. When these people died or withdrawn from the clinic and were presented to the lab to participate in the experi- admitted to hospital. Given the large proportion ment, they completed the questionnaire a second of treatment dropouts and the uncertainty about time. To the researcher’s chagrin, many of the whether treatment completers differed, as a participants no longer had extreme scores on group, from treatment dropouts, it was not possi- the questionnaire and had to be excluded from ble to draw any legitimate conclusions from the the study. Another example concerns uncon- treatment study. trolled treatment studies, in which a group of patients are selected on the basis of extreme scores on some measure (e.g., scores on a perfec- Selection tionism scale), and then receive an intervention (e.g., a treatment for excessive perfectionism) as Description. When the effects on the depen- well as a posttest. Statistical regression may occur, dent variable arise from differences in the kinds resulting in what appears to be a treatment effect of people in the experimental groups. Selection (i.e., a decline in scores from pre- to posttest). effects are pervasive in quasi-experimental The solution to this problem is to include a con- designs (Cook & Campbell, 1979). These are trol group. Note that statistical regression is among the most widely used designs in clinical unlikely to occur if participants are selected psychology, in which a target group is compared because they have persistently elevated scores, with one or more control groups (e.g., a group such as people with chronically high scores on a of healthy people, a group with another psy- measure of anxiety (i.e., high trait anxiety). Such chopathology). Attempts are made to match the people are unlikely to show statistical regression groups on background variables (e.g., demo- because this phenomenon is due to transient graphics), and then they are compared on the factors that produce elevations in scores (e.g., a variables of interest. However, assignment of par- near-miss while driving to the lab to participate ticipants to groups (e.g., target group vs. healthy in an experiment would transiently increase control group) is, by definition, nonrandom in one’s anxiety). quasi-experimental designs. One must remember 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 27

Internal and External Validity in Clinical Research 27

that the distinction between the target group and live in stressful environments, whereas nor- any control group is an observed not experimen- motensive controls might tend to live in relatively tally manipulated distinction. low stress environments. Thus, the different Example. Many studies have investigated histories of the groups (i.e., differences in envi- whether people with anxiety disorders, as com- ronmental stressors) might be responsible for pared with healthy controls, tend to selectively any group differences in anger proneness, even if focus their attention on sources of threat in their anger is unrelated to hypertension. environment (Mogg & Bradley, 1999). Although such studies have yielded a good deal of useful Diffusion or Imitation of Treatments information and have stimulated a great deal of research, these studies are prone to selection Description. When participants in the differ- effects. That is, although the clinical (target) and ent experimental conditions can communicate control groups were matched on many back- with one another, such that participants in one ground variables, there is no guarantee that the condition learn about what happens in the other differences between the groups (e.g., the threat- condition. This can undermine the differences focused attention effects) were due to the pres- between the experimental manipulations in ence versus absence of an anxiety disorder. The each condition. effects could have been due to other factors that Example. In a study of the effects of stress (in were not assessed in the study. In these studies, the form of electric shock) on snake phobias, selection effects are addressed in three ways. snake-fearful undergraduate psychology students First, the plausibility of factors is were randomly assigned to one of two groups. taken into consideration. Anxious patients and Participants in each group were tested individu- healthy controls could differ on an almost infi- ally. Each participant was asked to walk up to a nite range of factors. Some of these factors could container housing a large, harmless snake and to confound the study of threat-focused attention touch it. Participants in the experimental group (e.g., depression), while other factors are less received a painful electric shock at a randomly plausible (e.g., the participant’s Zodiac sign). determined point as they approached the snake. Second, researchers try to control for all the Participants in the control group experienced plausible confounding factors (e.g., all partici- no shock as they approached the snake. pants are asked to refrain from caffeine con- Unfortunately, the students who had completed sumption on the day of testing; testing is done the experiment described their experiences with under conditions of normal or corrected-to- students who were soon to participate in the normal vision). Finally, if another confounding study. This contaminated the experimental factor is subsequently identified (e.g., whether manipulation because many of the people in the or not the person is taking antianxiety medica- control group had heard about the electric tion), then the study can be replicated, control- shock. As they approached the snake they wor- ling for this factor. ried about getting a shock. Thus, the “no shock” Interactions With Selection. Many of the previ- control condition was compromised. Possible ously mentioned threats to internal validity can solutions to the problem of diffusion of treat- interact with selection to produce effects on the ments is to ask participants not to discuss the dependent variable that may be confused with experiment with other students (during the effects due to the independent variable (Cook & period in which the study is being conducted) or Campbell, 1979). Examples include selection- to use experimental designs in which diffusion is history, selection-maturation, and selection- not an issue (e.g., in the snake example, one instrumentation effects. Selection-maturation could explicitly inform the control participants interactions occur when the experimental groups that they have been allocated to a “no shock” mature at different speeds. Selection-history condition). interactive effects occur when different experi- mental groups come from different settings, Compensatory Equalization of Treatments where each setting is associated with different histories. In a study designed to test the hypoth- Description. When participants learn that they esis that people with hypertension tend to be have been assigned to an experimental condition high in trait anger (i.e., anger proneness), for where they won’t receive the possible benefits example, the hypertensive patients might tend to received by participants in another experimental 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 28

28 INTRODUCTION/OVERVIEW

condition. Participants may be reluctant to toler- to the aims of the investigation and by using ate this inequality and thereby seek out the poten- procedures to counter any expectation effects of tial benefits. therapists. Examples. In a study examining the effects of Examples. In studies of a novel treatment, biofeedback for tension headaches, participants compared with some standard treatment, thera- were randomly assigned to either biofeedback pists may be highly enthusiastic about the new or a no-treatment (waiting list) control. treatment and less impassioned by the standard Participants in the control group were aware that treatment. A similar problem was encountered participants in the other group were receiving a in our recent randomized, controlled study of potentially beneficial treatment. This perceived three treatments for post-traumatic stress disor- inequality prompted some people from the con- der (PTSD): behavior therapy, relaxation train- trol group to seek out headache treatment while ing, and eye movement desensitization and they were in the waiting list condition. This reprocessing (EMDR, Taylor et al., 2003). Some confounded the investigation of the effects of therapists were enthusiastic advocates of behav- biofeedback. Another example concerns the use ior therapy, while others were equally enthusias- of pill placebo in drug studies. Participants tic about EMDR. To control for possible in these studies are informed that they will be expectancy effects, we had two therapists deliver randomly assigned to receive either capsules all three treatments. One therapist was an expert containing the drug under investigation or cap- and advocate of EMDR, while the other had sules containing an inert substance (placebo). expertise in behavior therapy. Thus, the thera- Unfortunately, in many drug studies, it is not dif- pists had potentially opposite expectations. This ficult for patients to discover whether they have design enabled us to assess whether these and been assigned to the drug or placebo conditions, other therapist factors influenced treatment because drugs, unlike placebos, commonly pro- outcome. In this study, the treatments differed duce side effects. Antidepressant medications, for in efficacy (behavior therapy tended to be most example, may produce dry mouth or temporary effective), whereas there were no differences in jitteriness as side effects. Participants who dis- the efficacy of the therapists, and there was no cover that they are taking placeboes may there- treatment-by-therapist interaction. fore seek out additional treatments during the experiment, thereby confounding the investiga- Comment tion of the effects of the drug. Alternatively, some participants who realize that they are taking Returning to the first scenario that opened a placebo may become further depressed about this chapter, we can see that most of these not getting the “real” treatment. A solution to threats to internal validity would apply to such confounds is to use placeboes that produce Dr. Smith’s observations about the effects of his side effects. Some studies have taken this treatment. His conclusions that his treatment approach (Margraf et al., 1991). was highly effective were based on simple pre/post case studies, that is, on patients assessed before and after this therapy. These studies failed Experimenter Expectancy to control for history, maturation, and statistical Description. A phenomenon whereby the regression. Attrition was also a problem. participant’s responses are influenced by expec- A number of patients started treatment with tations of the experimenter (or a proxy for Dr. Smith but dropped out because they failed to the experimenter, such as a therapist or research benefit from therapy. Dr. Smith conveniently assistant conducting a component of a study; failed to include these patients in his appraisal of Rosenthal, 2002). In other words, the partici- his treatment’s efficacy. pant’s responses are shaped in the direction of Not all case studies suffer as much from the experimenter’s expectations. These effects threats to internal validity. Various single-case may be unintentional on the part of the experi- experimental designs (which are part of the menter (or therapist). This bias, sometimes family of quasi-experimental designs) have been known as the “allegiance effect,” can be circum- developed to deal with these threats (e.g., Barlow vented by keeping the people running the exper- & Hersen, 1984; Onghena & Edgington, 2005). iment (e.g., therapists, research assistants) blind To illustrate, as part of our research into 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 29

Internal and External Validity in Clinical Research 29

cognitive-behavioral therapy (CBT) of panic dis- Beck, Greenberg, Wright, & Berchick, 1989) and order, we conducted a case study of an unusual to randomized, controlled trials (which have presentation, in which the patient’s panic disor- very strong internal validity; Barlow, Gorman, der appeared to arise from blood-injury reactiv- Shear, & Woods, 2000) in which CBT was com- ity (vasovagal dizziness and fainting in response pared to control conditions (e.g., waiting list to the sight of blood or injury; Anderson, Taylor, or placebo) and to other treatments (e.g., & McLean, 1996). The patient was initially imipramine). Thus, even though single-case treated with standard CBT for panic disorder experimental designs often have far-from-perfect (Taylor, 2000). Two years later, he relapsed when internal validity, they can yield valuable informa- exposed unexpectedly to blood-injury stimuli. tion and thereby can advance our understanding This led us to hypothesize that his blood-injury of psychopathology and its treatment. reactivity played a causal role in his panic disor- Drawing inferences, whether in quasi- der. To test this possibility, we provided the experiments or experiments, is a matter of rul- patient with another course of standard CBT for ing out rival hypotheses (e.g., hypotheses about panic disorder. As before, he was no longer pan- the role of threats to internal validity) that could icking after treatment. Then we asked the patient account for the results. Randomizing partici- if we could expose him to blood-injury stimuli pants to experimental and control groups can for one month (a videotape of injections and overcome many of the threats to internal valid- blood extractions). The thought of being ity. Random selection of participants and exposed to such a tape stimulated blood-injury random allocation to experimental conditions reactions (e.g., dizziness), which were followed ensures, within the limits of sampling error, that by a relapse of his panic disorder. The next part the sample is representative of the target popu- of the case study involved treating the patient lation and that the samples in the experimental with applied tension, which is a specific treat- groups are comparable to one another in terms ment for blood-injury reactivity (Öst & Sterner, of the background features of the participants, 1987). This treatment reduced his panic attacks such as demographics or other variables (Cook and blood-injury reactivity. When he was reex- & Campbell, 1979). posed to the videotape, he did not have any doesn’t control for some blood-injury reactions, his panic disorder did threats, such as diffusion of treatments, com- not return, and he was free of psychopathology pensatory equalization of treatments, or exper- at his four-month follow-up. This case study imenter expectancy. These threats can be involved an ABABCB design, where A = the first overcome by other means, such as those men- and second courses of CBT, B = exposure to tioned earlier. For quasi-experimental designs, blood-injury stimuli, and C = treatment with however, there is always some degree of threat to applied tension. This design makes it unlikely internal validity, such as the selection threat. that the results are due to threats to internal Cook and Campbell (1979) offer the following validity such as history or maturation. guidelines about how to assess the degree of There are many other types of single-case threat to internal validity. experimental designs, which can be used for other types of research questions (see Barlow Estimating the internal validity of a & Hersen, 1984; Onghena & Edgington, 2005). relationship is a deductive process in which Studies using single-case experimental designs the investigator has to systematically think are useful for studying unusual cases and for through how each of the internal validity conducting preliminary evaluations of new threats may have influenced the data. Then, treatments. These studies are insufficient in the investigator has to examine the data to test themselves for drawing strong conclusions, but which relevant threats can be ruled out. In all they provide some indication of whether it is of this process, the researcher has to be his or useful to conduct further investigations. Early her own best critic, trenchantly examining all case studies of CBT for panic disorder (e.g., of the threats he or she can imagine. When all Clark, Salkovskis, & Chalkey, 1985) provided of the threats can be plausibly eliminated, encouraging results, which led researchers to it is possible to make confident conclusions conduct open (uncontrolled) trials to evaluate abut whether a relationship is probably the treatment with more patients (e.g., Sokol, causal. (p. 55) 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 30

30 INTRODUCTION/OVERVIEW

EXTERNAL VALIDITY A popular research strategy is to use ana- logue samples. For example, students may be External validity has to do with the generaliz- selected because of their high scores on a ability of the research findings; to what extent measure of schizotypy for a study of variables can the findings of an experiment or quasi- thought to be relevant to schizophrenia. experiment be generalized to and across various Analogue studies have the advantage of having populations, settings, and epochs? In the follow- strong internal validity (e.g., randomized ing sections, we examine, in further detail, the assignment of schizotypal students to two or major types of threats to external validity, the more experimental conditions). However, ana- relationship between internal and external logue studies may also have important prob- validity, and the situations in which we should lems with external validity. Can findings (or shouldn’t) be concerned with threats to obtained from schizotypal students who, for external validity. Threats to external validity are example, report having some degree of magical evaluated by tests of the extent to which one can thinking and perceptual aberration, be general- generalize across various kinds of people, set- ized to people with schizophrenia? tings, and times and are, in essence, tests of sta- Studies using clinical samples also may tistical interactions (Cook & Campbell, 1979). encounter problems with external validity. Some The major threats include three types of interac- treatment outcome studies, for example, may be tions with the experimental condition that the highly selective in the patients that are enrolled. participants are in. These are interactions with A study of the treatment of bulimia nervosa selection, setting, and history. might only include patients if they agree to suspend any other treatment they might be receiving and remain on a stable dose of any Interaction of Selection and psychotropic medication they might be receiv- Experimental Condition ing. These research requirements have the Description. This concerns the question of advantage of controlling for threats to internal whether the findings from the selected group of validity, but they do raise questions about exter- research participants can be generalized to other nal validity; that is, are the patients yielding categories of people, such as people with other these clinical findings representative of patients geographic or demographic features. typically seen in clinical practice? If the patients Examples. A study comparing patients with are not representative, then the question arises severe major depression with healthy controls as to whether the treatment findings can be gen- might seek to match the participants on eralized to clinical practice in the “real world.” demographic features. Many severely depressed These concerns with patient representativeness patients are unable to work and are therefore and the use of analogue samples were raised in unemployed, receiving welfare or disability the two scenarios that opened this chapter. assistance. To match the patients with the con- Even when participants belong to the target trols on demographic factors, the researcher population of interest, recruitment factors might decide to include only unemployed con- might lead to threats to external validity (Cook trol participants. While this strengthens the & Campbell, 1979). A researcher, for example, internal validity of the study, it raises the ques- who is interested in studying conversion disor- tion of whether the results can be generalized to der might recruit patients by placing advertise- people from other levels of occupational func- ments in the local newspaper. This process of tioning. If the results of the research study vary recruitment could possibly result in a sample of across occupational levels, then there is an inter- people with conversion disorder that is unrep- action between selection (in this case, occupa- resentative of people in general with this disor- tional status) and experimental condition. This der. This threat to external validity can be interaction threatens the external validity of the examined by comparing patients recruited study. The only way to determine whether this from the newspaper to patients recruited by threat exists is to determine whether the results other means (e.g., from physical referrals) to vary with occupational status. This means that see whether the groups differ on relevant further studies might be needed to better under- variables such as the type and severity of the stand the external validity of the findings. conversion disorder. 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 31

Internal and External Validity in Clinical Research 31

Interaction of Setting or Context and important limits to the external validity (gener- Experimental Condition alizability) of the findings; for example, many soldiers in World War I apparently responded to Description. The question of concern is whether the trauma of war with symptoms of conversion findings obtained in one setting or context can disorders (e.g., “hysterical” paralysis or blind- be generalized to other settings. ness) rather than by developing reexperiencing Example. Research conducted at Harvard symptoms (e.g., Lerner, 2003). University suggests that people who claim to Sometimes an experiment takes place in a have been abducted by space aliens are more sus- very special epoch, such as during the weeks fol- ceptible, compared to control groups, to forming lowing the September 11, 2001, terrorist attacks false memories (McNally, 2003). But do these in New York and Washington, D.C. The results findings apply to purported alien abductees in of a study, for example, of college student stress general, including people from other educa- around the time of September 11 might not tional levels or geographic locations? Alien apply to other periods, past or future. One needs encounters are commonly reported in Brazil, for to rely, in part, on common sense to determine example (Pulos & Richman, 1990). Are these whether the results of an experiment would gen- people similarly subject to false memories? To eralize from one time period to another. answer this question, one may need to repeat the experiment in different settings. Even when circumstances are relatively more A related threat concerns the novelty of an mundane, we still cannot logically extrapolate intervention (Finger & Rand, 2003). If a new findings from the present to the future. treatment is evaluated, typically in a university or Yet, while logic can never be satisfied, hospital research setting, the participant may “commonsense” solutions for short-term be aware that he or she is receiving a novel treat- historical effects lie in either replicating the ment, and the therapist may be highly optimistic experiment at different times...or in or enthusiastic about the intervention. The result conducting a literature review to see if prior obtained under such conditions might not gener- evidence exists which does not refute the causal alize to other contexts, such as settings in which relationship. (Cook & Campbell, 1979, p. 74) the treatment is no longer regarded as novel.

Comment Interaction of History and Internal Versus External Validity. Internal Experimental Condition validity often takes precedence over external Description. This concerns the question of validity, because one must first obtain an unam- whether the findings obtained today would biguous finding before you can generalize the apply to the past or future, or whether the find- results. Accordingly, many studies in clinical ings would apply to people who had otherwise psychology have high internal validity and lower different histories. (or unknown) external validity. To illustrate, in a Examples. Contemporary observations of study of memory functioning in generalized the effects of traumatic stressors and quasi- anxiety disorder (GAD), internal validity is experimental analogue studies of the effects of improved if participants taking medication are mildly disturbing events (e.g., medical students excluded from the study. This is because some conducting their first human dissection) suggest anxiolytic medications, such as benzodiazepines, that exposure to traumatic events leads to symp- may impair memory function. Excluding such toms of PTSD, particularly persistent reexperi- participants improves internal validity, but it encing of the event (e.g., dreams or unwanted raises questions about external validity because thoughts of the event). It has been debated as to it remains to be established that the results whether these are timeless responses or whether would apply to GAD patients who happen to be they are simply a product of contemporary taking medication. Such patients might have Western culture. In other words, it is unclear clinically more severe GAD than unmedicated whether the finding that stress produces reexpe- patients. This would be an important issue if the riencing symptoms has strong external validity. researcher hypothesizes that GAD arises from There is some suggestion that there are particular patterns of memory processing. By 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 32

32 INTRODUCTION/OVERVIEW

excluding the more severe patients, it is not pos- draw conclusions not about a population but sible to determine whether the memory results about a theory that specifies what these partici- can be generalized to more severe cases of the pants ought to do? Or (as in the case of false disorder. memories) would it be important if any subject When Does External Validity Matter? One does, or can be induced to do, this or that? should not automatically assume that it is impor- tant that a study has good external validity.We may • Regarding the Setting. Is my intention to not be so concerned with external validity if the predict what would happen in a real-life setting focus of the investigation concerns what can hap- or target class of such settings? The answer may pen, instead of what typically does happen (Mook, be no if the aim is to test a prediction about what 1983). Thus external validity is less of a concern if ought to happen in the experimental setting. In the goal of one’s research is to test predictions this situation, external validity is not an issue. derived from theory or conjecture. Consider, for If the answer is yes, then you need to consider example, patients who report that they suddenly whether it is necessary that the setting be became aware of long-buried memories of child- representative. hood sexual abuse. The veracity of such “recov- ered” memories is highly controversial. Some Evaluating and Improving External Validity. clinicians argue that these are genuine memories There are several ways of evaluating and improv- that had been repressed and then retrieved. A ing external validity. One approach is to try to number of researchers have argued that these are ensure that the sample is representative of the false memories, sometimes implanted by thera- target population. The deliberate inclusion of a pists using hypnosis, guided imagery, or other heterogeneous sample can be used to determine “memory recovery” techniques to get to the bot- if particular variables predict the results. If you tom of the patient’s problems (for a review of this are conducting a treatment outcome study, for debate, see McNally, 2003). This debate raises the example, and want to know whether the results following question about the mechanisms of vary with socioeconomic status (SES), then you memory, which has been evaluated in several lab- could select patients from a range of different oratory studies: Is it possible to implant a clearly SES levels (using stratified random sampling) false childhood memory using memory recovery and determine whether SES predicts treatment techniques? Note that this is not an issue of does outcome. Note that this approach often requires it happen but a question of can it happen. The a large sample (e.g., n = 50 per treatment condi- answer is yes. Analogue research using university tion), so that sufficient numbers of participants students has shown that it is possible to lead the from each SES level are in each treatment condi- participants to “recall” something that, according tion. Another approach is to conduct multiple to their parents, never happened to them, such as studies across different subgroups, settings, or being savagely mauled by a dog (e.g., Porter, Yuille, times. This provides a means of determining & Lehman, 1999). Although such findings have whether the findings are replicable. relevance to the memory recovery controversy, the Benchmarking studies can also be used as primary value of this type of research is to shed a means of evaluating external validity. These light on memory processes. are investigations in which research conducted in To determine whether external validity is tightly controlled laboratory situations (which important in a given research investigation, you have high internal validity and may have low need to consider the conclusion that you would external validity) are compared with field studies like to make and whether your sample and (which may have good external validity but lower research design will enable to you reach this internal validity). A recent meta-analysis com- conclusion. The following is a sample of ques- pared results from lab and field studies across a tions that you might ask in deciding whether the range of research domains, including clinically usual criteria of external validity should even be relevant investigations such as studies of aggres- considered (Mook, 1983): sion or depression (Anderson, Lindsay, & Bushman, 1999). The investigators examined • Regarding the Sample. Am I trying to esti- the correspondence between lab- and field-based mate from sample characteristics the character- effect sizes from studies using conceptually similar istics of some population? Or am I trying to dependent and independent variables. The results 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 33

Internal and External Validity in Clinical Research 33

of field research tended to mirror the findings “imperfect” studies are worthless? If this were from lab research, suggesting that lab studies gen- the case, then scientific progress would not be erally have good external validity. possible—neither in psychology nor in the To illustrate benchmarking research, several other sciences. But what can we legitimately studies have been conducted in which treatment- conclude from imperfect investigations? Like outcome findings from a university-based spe- all areas of science, no single study in clinical cialty clinic (e.g., the Center for Anxiety and psychology provides the final answer to Related Disorders at Boston University) are an important research question. Science is a compared to findings from community mental cumulative process, whereby different studies health clinics. The university-based research investigate the research questions in different tended to have high internal validity, although ways, controlling for different factors. In other the use of patient inclusion and exclusion crite- words, science progresses through the develop- ria raised questions about the external validity ment of cumulative findings from programs of the findings. Patients are typically excluded of research (Lakatos & Musgrave, 1970). The from CBT studies if their doses of psychotropic overall pattern of findings that emerges across medication are unstable or if they have particu- studies is the most important factor in answer- lar comorbid disorders. Studies of panic disor- ing important research questions. der, for example, often exclude patients who The strength of internal and external valid- have comorbid paranoid, schizotypal, or bor- ity of a study can help researchers evaluate the derline personality disorders. Studies conducted relative importance of that study in an overall in community clinic settings are more liberal in program of research. If a study has very weak their inclusion criteria and more closely approx- internal validity, then it may be given little imate routine clinical treatment that patients or no consideration in evaluating what the cor- would receive. This means that these studies pus of research suggests about an important have good external validity but weaker internal research question. A study might have several validity. Benchmarking studies of major depres- strengths but might have some noteworthy sion and panic disorder indicate that the results weaknesses. A weakness of an analogue study of from community clinics are similar to those schizophrenia, for example, has the shortcom- obtained in university clinics and that the ing of not using actual participants with the patients from both settings are broadly similar disorder. This is not a legitimate reason for dis- in their pretreatment clinical characteristics, missing the study altogether. The limitation such as the severity and duration of their disor- simply raises another question to be answered ders (e.g., Merrill, Tolbert, & Wade, 2003; Wade, in another study: If people who have features Treat, & Stuart, 1998). These findings indicate similar to schizophrenia (analogues) produce that tightly controlled treatment studies from particular patterns of findings, then do people university clinics have good external validity. with full-blown schizophrenia show the same Such studies address the concerns of critics pattern of results? The analogue study may like Dr. Smith from Scenario 1, who claimed that have high internal validity and lower external treatment research findings do not generalize to validity, whereas the field study (using actual patients in the “real world.” patients with schizophrenia) would probably have lower internal validity (because it is diffi- cult to control for all confounding factors when using clinical samples) but higher external CONCLUSIONS:PERFECTING OUR validity. Together, the two types of studies com- KNOWLEDGE FROM IMPERFECT RESEARCH plement one another. Internal and external validity are important Few, if any, research studies are methodologi- issues in evaluating the merits of a study, cally perfect. Some consumers of the research but they are not the only considerations. Other literature tend to throw out the baby with the important issues include the way the data are bathwater; that is, if a study has a minor limita- analyzed, the reliability and validity of the mea- tion, they tend to dismiss it entirely. This was sures or manipulations used, and the statistical the case for the attendees of the journal club power of the design. Those issues are discussed discussed in Scenario 2. But is it really true that elsewhere in this volume. 03-Mckay-45470.qxd 11/17/2007 5:40 PM Page 34

34 INTRODUCTION/OVERVIEW

REFERENCES McNally, R. J. (2003). Remembering trauma. Cambridge, MA: Harvard University Press. Anderson, C. A., Lindsay, J. J., & Bushman, B. J. Merrill, K. A., Tolbert, V. E., & Wade, W. A. (2003). (1999). Research in the psychological Effectiveness of cognitive therapy for depression laboratory: Truth or triviality? Current in a community mental health center: Directions in Psychological Science, 8, 3–9. A benchmarking study. Journal of Consulting Anderson, K. W., Taylor, S., & McLean, P. (1996). and Clinical Psychology, 71, 404–409. Panic disorder associated with blood-injury Mogg, K., & Bradley, B. P. (1999). Selective attention reactivity: The necessity of establishing and anxiety: A cognitive-motivational functional relationships among maladaptive perspective. In T. Dalgelish & M. J. Power behaviors. Behavior Therapy, 27, 463–472. (Eds.), Handbook of cognition and emotion Asmundson, G. J. G., Norton, G. R., & Stein, M. B. (pp. 145–170). New York: Wiley. (2002). Clinical research in mental health: Mook, D. G. (1983). In defense of external invalidity. A practical guide. Thousand Oaks, CA: Sage American Psychologist, 38, 379–387. Publications. Onghena, P., & Edgington, E. S. (2005). Barlow, D. H., Gorman, J. M., Shear, M. K., & Customization of pain treatments: Single-case Woods, S. W. (2000). Cognitive-behavioral design and analysis. Clinical Journal of Pain, therapy, imipramine, or their combination for 21, 56–68. panic disorder: A randomized controlled trial. Öst, L.-G., & Sterner, U. (1987). Applied tension: Journal of the American Medical Association, A specific behavioral method for treatment of 283, 2529–2536. blood phobia. Behaviour Research and Therapy, Barlow, D. H., & Hersen, H. (1984). Single case 25, 25–29. experimental designs. New York: Pergamon. Porter, S., Yuille, J. C., & Lehman, D. R. (1999). Campbell, D. T., & Stanley, J. C. (1970). Experimental The nature of real, implanted, and fabricated and quasi-experimental designs for research. childhood emotional events: Implications for Chicago: Rand McNally. the recovered memory debate. Law and Human Clark, D. M., Salkovskis, P. M., & Chalkey, A. J. Behavior, 23, 517–537. (1985). Respiratory control as a treatment for Pulos, L., & Richman, G. (1990). Miracles and other panic attacks. Journal of Behavior Therapy and realities. Vancouver, British Columbia: Omega. Experimental Psychiatry, 16, 23–30. Rosenthal, R. (2002). Covert communication in Cook, T. D., & Campbell, D. T. (1979). Quasi- classrooms, clinics, courtrooms, and cubicles. experimentation: Design and analysis issues for American Psychologist, 57, 839–849. field settings. Boston: Houghton Mifflin. Sokol, L., Beck, A. T., Greenberg, R. L., Wright, F. D., Finger, M. S., & Rand, K. L. (2003). Addressing & Berchick, R. J. (1989). Cognitive therapy for validity concerns in clinical psychology panic disorder: A nonpharmacological research. In M. C. Roberts & S. S. Ilardi (Eds.), alternative. Journal of Nervous and Mental Handbook of research methods in clinical Disease, 177, 711–716. psychology (pp. 13–30). Malden, MA: Blackwell. Taylor, S. (1994). The overprediction of fear: Is it a Flick, S. N. (1988). Managing attrition in clinical form of regression toward the mean? Behaviour research. Clinical Psychology Review, 8, 499–515. Research and Therapy, 32, 753–757. Furby, L. (1973). Interpreting regressions toward the Taylor, S. (2000). Understanding and treating panic mean in developmental research. Developmental disorder. New York: Wiley. Psychology, 8, 172–179. Taylor, S., Thordarson, D. S., Maxfield, L, Fedoroff, I. C., Lakatos, I., & Musgrave, A. (1970). Criticism and Lovell, K., & Ogrodniczuk, J. (2003). the growth of knowledge. Cambridge, UK: Comparative efficacy, speed, and adverse effects Cambridge University Press. of three treatments for PTSD: Exposure therapy, Lerner, P. (2003). Hysterical men: War, psychiatry, and EMDR, and relaxation training. Journal of the politics of trauma in Germany, 1890–1930. Consulting and Clinical Psychology, 71, 330–338. New York: Cornell University Press. Wade, W. A., Treat, T. A., & Stuart, G. L. (1998). Margraf, J., Ehlers, A., Roth, W. T., Clark, D. B., Transporting an empirically supported Sheikh, J., Agras, W. S., et al. (1991). How treatment for panic disorder to a service clinic “blind” are double-blind studies? Journal of setting: A benchmarking study. Journal of Consulting and Clinical Psychology, 59, 184–187. Consulting and Clinical Psychology, 66, 231–239.