<<

Comprehensive Clinical

Psychology

Editor Nina R. Schooler Hillside Hospital, Glen Oaks, NY, USA

Comprehensive Clinical Editors-in-Chief Alan S. Bellack The University of Maryland at Baltimore, MD, USA Michel Hersen Pacific University, Forest Grove, OR, USA

Research and Methods Volume 3

2001 AN IMPRINT OF ELSEVIER SCIENCE AMSTERDAM—LONDON—NEW YORK—OXFORD—PARIS—SHANNON—TOKYO

Elsevier Science Ltd., The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, UK

Copyright © 2001 Elsevier Science Ltd.

All rights reserved. No part of this publication may be reproduced, stored in any retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic tape, mechanical photocopying, recording or otherwise, without permission in writing from the publishers.

First edition 1998 Paperback edition 2001

Library of Congress Cataloging In Publication Data Comprehensive / editors-in-chiefs Alan S. Bellack, Michel Hersen. -1st ed. p. cm. Includes indexes. Contents: v. 1. Foundations / volume editor, C. Eugene Walker — v. 2. Professional issues / volume editor, Arthur N. Wiens — v. 3. Research and Methods / volume editor, Nina R. Schooler — v. 4. Assessment / volume editor, Cecil R. Reynolds — v. 5. Children &. adolescents /volume editor, Thomas Ollendick — v. 6. Adults / volume editor, Paul Salkovskis — v. 7. Clinical geropsychology / volume editor, Barry Edelstein — v. 8. Health psychology / volume editor, Derek W. Johnston and Marle Johnston — v. 9. Applications in diverse Populations / volume editor, Nirbhay N. Singh - v. 10. Sociocultural and individual differences / volume editor, Cynthia D. Belar — v. 11. Indexes. 1. Clinical psychology I. Bellack, Alan S. II. Hersen, Michel. [DNLM: 1. Psychology, Clinical. WM lOS C737 1998] RC467.C597 1998 616.89--dc21 DNLM/DLC for Library of Congress 97-50185 CIP

British Library Cataloguing In Publication Data Comprehensive clinical psychology I. Clinical psychology II. Bellack, Alan S. (Alan Scott), 1944- II Hersen, Michel 616.8 ‘ 9

ISBN 0-08-042707-3 (set : alk. paper) ISBN 0-08-043146-1 (Volume 7) ISBN 0-08-044069-X (Volume 7 paperback)

Typeset by Bibliocraft. Dundee, UK. Printed and bound in The Netherlands by Giethoorn Media Group

Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.01 Observational Methods

FRANK J. FLOYD, DONALD H. BAUCOM, JACOB J. GODFREY, and CARLETON PALMER University of North Carolina, Chapel Hill, NC, USA

3.01.1 INTRODUCTION 1 3.01.2 HISTORICAL INFLUENCES ON BEHAVIORAL OBSERVATION 2 3.01.3 THE PROS AND CONS OF BEHAVIORAL OBSERVATION 3 3.01.4 OBTAINING OBSERVATIONS OF BEHAVIOR 4 3.01.4.1 Selection of a Setting for Observing Behavior 4 3.01.4.2 Selection of Live vs. Recorded Observation 5 3.01.4.3 Selection of a ªTaskº or Stimulus Conditions for Observing Behavior 5 3.01.4.4 Selection of a Time Period for Observing Behavior 6 3.01.5 ADOPTING OR DEVELOPING A BEHAVIORAL CODING SYSTEM 7 3.01.5.1 Adopting an Existing Coding System 7 3.01.5.2 Developing a New Coding System 9 3.01.5.2.1 Cataloging relevant behaviors 9 3.01.5.2.2 Selecting a unit of observation 9 3.01.5.2.3 Creating code categories 10 3.01.5.2.4 Content validation of codes 11 3.01.5.2.5 Developing a codebook 11 3.01.6 TRAINING CODERS 11 3.01.7 RELIABILITY 12 3.01.7.1 Enhancing Reliability 12 3.01.7.2 Evaluating Reliability 13 3.01.8 ANALYZING DATA GATHERED THROUGH BEHAVIORAL OBSERVATION 14 3.01.8.1 Data Reduction 14 3.01.8.1.1 Measurement of base rates of behavior 15 3.01.8.1.2 Measurement of sequential patterns of behavior 15 3.01.8.2 Computer Technology for Recording and Coding Behavioral Observations 16 3.01.9 INTERPRETING DATA GATHERED THROUGH BEHAVIORAL OBSERVATION 17 3.01.10 CONCLUDING COMMENTS 19 3.01.11 REFERENCES 19

3.01.1 INTRODUCTION social experiences and predict future social events. In fact, direct observation of behavior is Behavioral observation is a commonplace one of the most important strategies we use to practice in our daily lives. As social creatures process our social world. Thus, it is not and ªinformal scientists,º we rely upon ob- surprising that the field of psychology also is servations of behavior to understand current drawn to behavioral observation as a research

1 2 Observational Methods method for understanding human behavior. the environmental events that elicit and main- The current chapter will focus upon behavioral tain maladaptive as opposed to adaptive observation as a formal research tool. In this behaviors, and they focus on observable context, behavioral observation involves the behavior change as the criterion for treatment systematic observation of specific domains of success. The need for precision both in measur- behavior such that the resulting descriptions of ing the behavior of interest and in identifying behavior are replicable. In order to accomplish relevant environmental events has produced a this task, the ongoing stream of behavior must body of scholarship on behavioral assessment as be coded or broken down into recordable units, a clinical tool (e.g., Haynes, 1978; Hersen & and the criteria for the assignment of labels or Bellack, 1998), of which systematic behavioral for making evaluations must be objectified. observation is generally considered the hall- These practices of specifying units of behavior mark strategy. and objectifying coding criteria are the key steps Ironically, Haynes (1998) notes that despite in translating informal behavioral observations the theoretical importance of direct observation into formal, scientific observations. As will be of behavior as a central feature of the behavioral seen below, the challenge of employing beha- approach, most research and clinical practice vioral observation in research settings involves from a behaviorally oriented perspective relies the myriad of decisions that an investigator on indirect measures such as self-report ques- must make in this translation process from tionnaires. This increased reliance upon indirect informal to formal observation. assessment stems in part from the recognition that demonstrating improvement in subjective well-being is important, in addition to showing 3.01.2 HISTORICAL INFLUENCES ON changes in overt behavior (e.g., Jacobson, BEHAVIORAL OBSERVATION 1985). Also, there has been an increasing emphasis on cognitions and other internal, The development of behavioral observation nonobservable experiences (e.g., fear, dysphor- methodology is attributable to two major ia) as relevant focuses for behavioral interven- sources, the science of human behavior and tion. However, Haynes suspects that most the clinical practice of behaviorally oriented researchers and clinicians fail to conduct direct interventions. The science of human behavior is observations of relevant behaviors because they often traced to Watson and his colleagues (e.g., believe that the difficulty associated with Watson & Raynor, 1924), who developed conducting naturalistic observations outweighs sophisticated methods to observe the behavior the expected benefits in terms of incremental of children. Several important features distin- validity of the assessment. guish this approach from early research that Accordingly, recent advances in observa- also used the observation of human actions as tional methodology and technology tend to data. Most notably, unlike earlier trait-based come from fields of study and practice in which researchers who measured behavior to make the veracity of self-reports is particularly inferences about internal causes (e.g., Binet, suspect. For example, researchers and clinicians Galton), the behavior itself is the focus of study. working with people who have mental retarda- The approach also emphasizes direct observa- tion have produced important contributions in tion in naturalistic settings as opposed to theory and methodology (e.g., Sackett, 1979b) contrived responses that are elicited under as well as observational technology (Tapp & artificial, controlled conditions. The further Walden, 1993). Similarly, research on infants development of this approach was greatly and young children continues to emphasize stimulated by Skinner's (e.g., Skinner, 1938) direct observation over other types of measures theories that emphasized a focus on overt (e.g., parent or teacher report). observable behavior, and by research manuals Another more recent source for advances in (e.g., Sidman, 1960) that further developed the behavioral observation is the growth of the rationale and methods for conducting and marital and family systems perspective in clinical interpreting research on directly observed research and practice. This perspective empha- behaviors (Johnston & Pennypacker, 1993). sizes interpersonal exchanges as both focuses of The second historical influence, behaviorally interest in their own right and as contexts for oriented clinical practice, also emphasizes direct individual functioning. The theoretical justifica- observation under naturalistic circumstances. tion for this emphasis stems from family This approach defines pathology in terms of clinicians (e.g., Haley, 1971; Minuchin, 1974) maladaptive behavior and then focuses on how who argue that couples, parent±child dyads, environmental factors control maladaptive siblings, and larger family units are interacting responding. Thus, behaviorally oriented clin- subsystems that may cause and maintain icians conduct functional analyses to determine pathological responding for individuals. In The Pros and Cons of Behavioral Observation 3 behavioral terms, marital and family interac- treatment effects are likely to generalize to the tions elicit and maintain maladaptive respond- natural environment. ing (e.g., Patterson, 1982). Because individuals Despite these strengths, behavioral observa- are often biased reporters who have a limited tion also has several limitations. First, the perspective on the operation of the family ªobjectivityº of behavioral observation is far system, observation of family interactions by from absolute. Even when relatively simple outside observers is necessary for understanding forms of behavior are observed, the observa- these family processes. Thus, research and tional system imposes considerable structure on clinical work on marriage (e.g., Gottman, how behaviors are segmented and labeled which 1979), parenting (Patterson, 1982), and larger substantially affects the nature of the data that family units (e.g., Barton & Alexander, 1981) are obtained. Second, behavioral observation is paved the way for many advances in observa- expensive and labor-intensive as compared to tional technology and statistical approaches to self-reports. Third, observation cannot access analyzing observational data (e.g., Bakeman & inner experiences that are unobservable. Final- Gottman, 1997). ly, observations provide only a limited snapshot of behavior under a specific set of circum- stances, which often is not helpful for general- 3.01.3 THE PROS AND CONS OF izing to other situations. Ironically, this latter BEHAVIORAL OBSERVATION limitation reveals how sensitivity to the effects of context on behavior is both a strength and Behavioral observation is a fundamental limitation of this approach. Although beha- form of measurement in clinical research and vioral observation is the method of choice for practice. The function of observation is the examining functional relationships that elicit conversion of an ongoing, complex array of and maintain actions in a particular context, behavioral events into a complete and accurate such observations may have limited utility for set of data that can influence scientific or clinical predicting responses in different contexts or decisions (Hawkins, 1979; Johnston & Penny- circumstances. packer, 1993). As a measurement tool, beha- In designing a research study, an investigator vioral observation is systematic; it is guided by a has a variety of strategies available for gathering predetermined set of categories and criteria, and data, including self- and other-report measures, it is conducted by reliable observers (Bakeman behavioral observation, and physiological in- & Gottman, 1997). In many instances, beha- dices, and must decide which and how many of vioral observation is the most objective form of these strategies are appropriate. Whereas be- measurement of relevant behaviors in relevant havioral observation may be more direct than contexts because it is the most direct approach. self-report measures in many cases, observation Indeed, observations provide the yardstick is only a reasonable candidate if the investigator against which other, more indirect measures, wishes to assess how an individual or group such as self-reports or rating scales, are actually behaves in a given context. Under- evaluated. standing the different types of information that In addition to providing direct, objective are obtained with each measurement strategy is measurements, behavioral observation has critical when interpreting findings. several other strengths as an assessment tool. For example, in prevention studies to assist Hartmann and Wood (1990) note that (i) couples getting married, investigators often observation is flexible in providing various teach communication skills. Then, to assess forms of data, such as counts of individual whether the couples have learned these skills, behaviors, or records of sequences of events; (ii) they might employ behavioral observations of the measurements are relatively simple and the couples' interactions, as well as obtaining noninferential, so that they are readily obtained self-report measures from the couples about by nonprofessional observers; and (iii) observa- their communication. In reviewing the findings tions have a wide range of applicability across across such investigations, an interesting pat- populations, behaviors, and settings. For tern of results seems to have evolved. In most clinical purposes, behavioral observation pro- instances, investigators have been able to duces specificity in identifying problem beha- demonstrate that, based on behavioral observa- viors that are targets for intervention, and in tion, couples change the ways that they identifying causal and maintaining variables. communicate with each other when asked to Thus, it supports an idiographic, functional- do so in a laboratory setting. However, on self- analytic approach in which assessment leads to report measures, these same couples often specific treatment goals. Furthermore, when report that their communication has not observation is conducted in relevant settings, changed (Van Widenfelt, Baucom, & Gordon, the data are criterion-referenced, so that 1997). How can we interpret these seemingly 4 Observational Methods discrepant findings? Some investigators would 3.01.4 OBTAINING OBSERVATIONS OF argue that the behavioral observation findings BEHAVIOR are in some sense more meaningful because they reflect how the couple actually behaves. The The first challenge for an investigator who self-report findings are more suspect because wishes to employ behavioral observation is self-reports are subject to memory distortion, deciding upon what behavior to observe, which might reflect the couples' overall feelings about actually involves a series of decisions. In almost their marriages, and are impacted by response all instances, the investigator is interested in sets or test-taking attitudes. However, the drawing conclusions about a class of behaviors. results could be interpreted in other ways as However, the investigator can observe only a well. Perhaps the couples' behavior in the sample of that behavior while wishing to draw laboratory does not reflect how they actually broader conclusions. For example, a family behave at home, and this discrepancy is researcher might be interested in family inter- demonstrated by the self-report data. Conver- action patterns and decide to observe a family sely, the behavioral observation in the labora- while they interact around a dinner table; how- tory might reflect ongoing behavior at home, ever, the investigator is interested in much more but this behavior does not impact the couples' than dinner-time behavior. Similarly, a marital experience of their overall communication. That investigator might be interested in couples' is, perhaps the intervention did not target, or the communication fairly broadly, but observes behavioral coding system did not capture, the their interaction for only 10 minutes in a labor- critical elements of communication that impact atory setting. Or someone studying children's the couples' subjective experience of how they peer relationships might observe playground communicate with each other. behavior at school but not in the neighborhood. When behavior is observed, the investigator In all of these instances, the investigator must be must make a series of decisions that can concerned with whether the sample of behavior significantly impact the results and interpreta- is a representative or reasonable sample of the tion of the findings. First, the investigator must broader domain to which the investigator wishes decide what behavior to observe. This will to generalize. include a consideration of (i) the setting in which the behavior will be observed, (ii) the length of 3.01.4.1 Selection of a Setting for Observing time for observing the behavior on a given Behavior occasion, (iii) the number of occasions on which behavior will be observed, (iv) who will observe One major decision that the investigator must the behavior, and (v) the actual behavior that make is the setting for observing the behavior. A will be the focus of the investigation. After or major distinction in this regard is whether the while the data are gathered, it is coded according behavior is to be observed in a controlled or to some scheme or coding system. Thus, in laboratory setting or in a more natural setting. evaluating a child's social behavior, the in- Both strategies have their advantages and vestigator must decide whether nonverbal facial disadvantages. Laboratory settings have the cues will be coded, what verbal categories will be asset of being more controlled, such that the included in the coding system, whether the type behavior of various participants can be ob- of interaction such as play or classroom served under more standard conditions. For activities will be coded, and so forth. Once example, a variety of parents and children can the behavior is coded, then the investigator must be brought into a laboratory setting and decide how to analyze the data. Often due to observed in the same setting with the same limited amounts of data or for conceptual task. This standardized environment can be of reasons, specific codes are collapsed into larger help when interpreting the findings because it codes; for example, many specific codes may be can help to exclude some alternative explana- summarized as positive or negative behaviors. tions of the findings based around differences Finally, the investigator must decide how to among settings. For example, observing families analyze the data obtained from the observa- in their home settings might be greatly impacted tional coding scheme. Is the investigator inter- by whether there are interruptions from the ested in the frequency of various behaviors that outside, whether the home is excessively hot or were observed, or is the investigator focused cold, and so forth. upon the pattern of interaction and the However, typically the investigator is inter- interdependencies among various behaviors? ested in much more than laboratory behavior These different questions of interest to the and wishes to draw conclusions about behavior investigator will result in different data analytic in other settings in the natural environment. Not strategies that will provide different information only are natural environments typically less about the observed behavior. controlled, but often the effort and expense Obtaining Observations of Behavior 5 involved in observing behavior in its natural the behavior of subjects (Kazdin, 1982), and this environment is prohibitive. Therefore, in decid- reactivity may be greatest in intervention studies ing which setting to employ in observing when the demands to display desired behaviors behavior, the investigator must address the are relatively clear (e.g., Harris & Lahey, 1986). question of the extent to which behavior However, research on the reactivity of recording observed in one setting is generalizable to other equipment is less certain. For example, studies settings. Understandably, there is no general using repeated sampling with recording equip- answer to this question, and it must be evaluated ment fail to detect habituation effects (e.g., for the particular settings and participants of Christensen & Hazzard, 1983; Pett, Wampold, interest. Vaughn-Cole, & East, 1992), which suggests As an example of how this issue of general- that the equipment does not evoke an initial izability of behavior across settings has been orienting response. Further, studies that com- explored, Gottman (1979) evaluated couples' pare different recording methods show that conversations both in a laboratory setting and relatively obtrusive as opposed to unobtrusive at home. His findings indicated that although procedures produce few differences in the there is a tendency for couples to be more quality of most behaviors observed, although negative with each other at home than in a positivity may be increased somewhat (e.g., laboratory setting, along with more reciproca- Carpenter & Merkel, 1988; Jacob, Tennen- tion of negative emotion at home, the couples baum, Seilhamer, Bargiel, & Sharon, 1994). Of generally demonstrated similar interaction pat- course, it is possible that the mere presence of terns across settings. Even so, this finding any type of recording equipment (or knowledge applies only to the types of interaction that that it is present although unseen) may cause Gottman assessed with his particular sample, sustained changes in behavior similar to the employing a given coding system. This issue of effects of self-monitoring or participant ob- generalizability across settings from control/ servation (e.g., Jarrett & Nelson, 1984). Never- research settings to natural settings applies theless, this set of studies suggests that using equally to generalizability within one of these recording equipment with no observer present domains, as is discussed later. For example, may be a relatively less reactive approach than dinner-time behavior among family members in live observation of behavior. their own home might or might not reflect Another concern about live as opposed to family interaction later in the evening when recorded behavior is the accuracy of coded data. homework or bedtime activities are the focus of In general, we assume that video and audio the family's discussion. Thus, even within a recorded data help to improve coder accuracy natural environment, behavior in one aspect of because they provide the capacity to play back that setting might not reflect behavior in other events repeatedly that are ambiguous or happen aspects of the natural family environment. quickly. However, recorded data also may Consequently, the investigator should give a interfere with a coder's ability to attend great deal of thought to the setting in which the selectively to salient behavior, particularly in behavior is observed in order to increase the a setting in which there is considerable back- likelihood that the resulting behavioral obser- ground activity and noise. For example, Fagot vations access the behavior of interest. and Hagen (1988) found that coders evaluating children in a preschool setting were less reliable 3.01.4.2 Selection of Live vs. Recorded and tended to miss relevant events when they Observation coded from videotape as opposed to live observations. In part, the superiority of re- The setting is altered by the experimenter corded observations depends on the ability to when coders or recording devices are introduced obtain excellent recordings. In many circum- into the environment. The decision to have stances, audio recordings are particularly coders present for the actual observation problematic because behavior is ambiguous session, or instead to record the session for without visual cues. When audio recordings are later coding, raises two major concerns. The transcribed, some investigators also include live first concern is the reactivity of live observation observers who make notes about nonverbal and recording equipment. Although research on events that are added to the transcript. this topic has a long history (e.g., Haynes & Horn, 1982; Kazdin, 1982), the findings are mixed and do not specifically address the 3.01.4.3 Selection of a ªTaskº or Stimulus relative effects on reactivity of having an Conditions for Observing Behavior observer present as opposed to using video or audio recording equipment. Much research Not only must investigators decide upon the shows that the presence of a live observer alters physical setting for observing behavior, but the 6 Observational Methods task or stimulus conditions within which the that husbands were more likely to engage in the behavior is to be observed also must be decided. demand role during problem-solving interac- On many occasions, investigators ask the tions when the husbands selected the topic of participants to engage in a specific task; on conversation, compared to interactions in which other occasions, the investigator merely decides the wife selected the topic to discuss. Thus, to observe the participants' behavior in a given factors that influence an individual's interest in setting at a given time. If a particular task or a task or motivation to participate in the task interaction is structured by the investigator, might significantly influence the resulting then the effects of this particular task on the behavior that is observed. interaction must be considered. This is of less importance if the investigator is interested only 3.01.4.4 Selection of a Time Period for in this particular type of interaction. For Observing Behavior example, if an investigator is interested only in how a family would make weekend plans as a In making decisions about the representative- full family if asked to do so, then asking the full ness of the behavior that is observed, the family to have such a discussion is relatively investigator must also be concerned with the straightforward. However, the investigator degree to which the observed behavior is might be interested in some more general generalizable across time. Classical test theory question having to do with how the family indicates that longer ªtestsº are more reliable in makes decisions. If this is the issue of interest, the sense that, keeping all other factors constant, then the investigator must carefully consider, they generally include a more representative and hopefully assess, the impact of this sample of behavior and thus are more stable particular task involving planning a weekend. across time. In terms of behavioral observation, Deciding how to spend the weekend might or this raises two questions. First, how long should might not generalize to how the family makes behavior be observed on a given occasion; decisions in other domains, such as how to second, on how many different occasions should divide household chores. Indeed, asking the behavior be observed? Whereas the answers to family to sit and have a discussion resulting in these questions should be based upon empirical weekend plans might not mirror how decisions findings, often pragmatic and logistic issues are made in the family at all. Perhaps the parents influence investigators' decisions along these make such decisions, or perhaps these decisions lines. For example, sending observers into occur over the course of a number of informal persons' homes or into classrooms can be interactions with different persons present at troublesome and intrusive; sending observers different times. More generally, when investi- to Africa to follow the social interactions among gators structure particular types of interactions baboons can prove to be a logistical nightmare. or ask the participants to engage in specific Consequently, the difficulty, intrusiveness, and tasks, they must carefully consider whether the expense of behavioral observation often are a task or stimulus conditions that they have limiting factor in deciding how long to observe created mirror the ways in which the partici- behavior. Similarly, the amount of time required pants typically behave. Either the participants to employ certain coding systems limits the might behave differently when the task is length of behavioral observation. For example, different, or they might not typically engage some coding systems might require 20 hours to in the task or situation that is constructed. The code one hour of recorded behavioral interac- degree of concern raised by these issues is a tion. Therefore if the investigator intends to function of the degree to which the investigator employ such a coding system, he or she might wishes to describe how the participants typically limit the amount of observed behavior to short behave in their day-to-day lives. time periods. How the task is selected also might impact the In deciding how long to observe behavior behavior. For example, Christensen and Heavey during a given observation period, several (1990) have described different interaction factors come into play. First and most generally, patterns among husbands and wives. This a long enough time period is needed such that includes a well-known interaction pattern that the findings are relatively stable and replicable they label ªdemand-withdraw,º in which one on other occasions. In part this is related to the partner presses for the discussion of a topic, and frequency or base rate with which the behaviors the other partner attempts to withdraw from the of interest occur. If the behavior of interest is a interaction. A number of investigations indicate relatively infrequent behavior, then longer that females are more likely to assume the observation periods are needed to obtain a ªdemandº role, and husbands are more likely to stable assessment of its occurrence. However, if assume the ªwithdrawº role in marital interac- a behavior occurs with a high degree of tions. However, Christensen and Heavey found frequency, then shorter observation periods Adopting or Developing a Behavioral Coding System 7 can suffice. Second, the length of the observa- meaningful information about the couple, but tion period is influenced by the ªcomplexityº of observations across two separate evenings the phenomenon under consideration. For provide more stable interactions patterns when example, the investigator might be interested coded by the MICS. Similarly, Haynes, Folling- in whether there are different stages or phases in stad, and Sullivan (1979) found that across three a couple's conversation as they attempt to reach evenings, there was high stability on only 5 of 10 resolution to a problem; in this instance, it selected coding categories of communication would be essential to observe the entire between spouses. Interestingly, in spite of these problem-solving interaction. Or the investigator findings, no marital therapy outcome investiga- might be interested in whether there are tions have observed couples' interactions across different stages or phases in how a child two or more consecutive evenings. responds to a new setting with the mother As can be seen based on the above discussion, present. Attachment theory has explored this there are a number of questions that the question and has provided indications of how investigator must address in deciding what securely and insecurely attached children in- behaviors to observe. The decisions that are itially respond in such settings, how they made will certainly impact the investigator's venture forth into a room after an initial findings. Yet before these findings are obtained, exposure, and how their interaction with their there are many more decisions for the investi- mothers changes across time in this setting gator to make that will influence the results. (Ainsworth, Blehar, Waters, & Wall, 1978). Therefore, if the investigator hypothesizes or wishes to explore whether there are different 3.01.5 ADOPTING OR DEVELOPING A stages or phases in the interaction, this BEHAVIORAL CODING SYSTEM necessitates following the interaction for a Foremost among the additional decisions to sufficient time period to allow for an examina- be made is the choice of coding system to tion of the different phases of interest. employ to classify the data that have been Second, the investigator must decide on how observed. In fact, in order to address the many occasions to observe the behavior. A questions raised above regarding what behavior given observation period might be significantly to assess, the investigator needs to know ahead influenced by the occurrence of a major event, of time what coding system he or she will or the interaction of participants being observed employ. At times the behavior is coded live might proceed in a given manner based upon during the interaction, so the coding system what happens early in the interaction. For must be known. Even if the observed behavior is example, if a child is taunted on the playground to be coded later based on video recordings of early during a recess period, the child might the behavior, it is important to decide ahead of withdraw from the group, which will signifi- time what coding system will be employed. For cantly impact the child's behavior for the example, some coding systems might be appro- duration of the observation period on the priate only in certain settings or with certain playground. If the child is not taunted the next tasks. Similarly, some coding systems might day, then his or her interaction pattern might break behavior into a number of fine-grained proceed quite differently. Consequently, the categories that occur on a somewhat infrequent number of occasions on which to observe basis, thus necessitating longer observational interaction will differ according to the varia- periods. Therefore, deciding what coding sys- bility in the behavior being observed. If the tem to employ is a significant factor in behavior of interest occurs in a relatively developing a thoughtful study based on beha- consistent manner across different occasions, vioral observation. then fewer occasions are necessary for obtaining An initial consideration in selecting a coding a stable sample of behavior. system is whether to adopt an existing coding Whereas there are far too few empirical system or to develop a new one. Each approach investigations that have been conducted to has some obvious benefits and limitations, as determine the number of observational sessions well as other subtle impacts that may not be needed to obtain stable results, some such apparent to investigators until they are well into investigations do exist. For example, Wieder the task of coding the behavioral observations. and Weiss (1980) explored how many observa- tional sessions were needed in order to obtain a stable assessment of couples' interaction when 3.01.5.1 Adopting an Existing Coding System employing the Marital Interaction Coding System (MICS; Weiss, 1986). Based on general- In many cases, it may be possible to adopt a izability theory, they concluded that observing coding system that has been used previously in couples on a single occasion could provide the same research area, or one that can be 8 Observational Methods imported from another area where similar analysis may be appropriate for evaluating constructs were assessed. Adoption of an discrete events that require little inference; existing system has the great advantage of however, larger units of behavior might be saving the time and energy required to develop a needed to capture more complex phenomena. reliable, valid, and practical coding scheme. It Foster, Inderbitzen, and Nangle (1993) discuss a also links research across laboratories, samples, similar point regarding the selection of a coding and locations, and thus provides for ready system to evaluate the effectiveness of social synthesis of research findings from various skills training with children. They note that a studies. frequent problem with interpreting the results of The selection of a coding system is guided by treatment outcome studies is that whereas the both theoretical and practical considerations. treatment teaches children specific social skills, The primary theoretical issue is whether the cod- such as offering to share a toy, the observational ing system assesses behaviors that address the assessment evaluates only molar, or global constructs of interest to the investigator. All codes, such as ªpositive interactions.º From coding systems organize data into some set of data such as these, it is impossible to know categories or units for coding, and these cate- whether the behaviors that were trained actually gories are based on issues of importance to the were displayed during the assessment. person who developed the coding system; how- Alternatively, it also is important to question ever, they might or might not coincide with whether a complex phenomenon is accurately another investigator's concerns or theoretical evaluated by merely summarizing elemental model. Before beginning a search for a coding codes. For example, Jacob (1975) illustrates system, it is essential that the investigator first how power in interpersonal interactions may reviewtheoryandresearchtoclarifythenatureof not be indicated by who speaks more frequently the behavioral phenomena under investigation. or wins more disagreements, but rather by the Behavioral phenomena related to a construct ability to prevail on the central, important under one situation may take on different conflicts. Evaluations such as these may require characteristics in a new situation, thus making making judgments about relatively large units of an existing system inappropriate. For example, behavior. Ammerman, Van Hasselt, and Hersen (1991) Choosing among systems with different units coded problem-solving interactions between of observation also has practical implications. children with visual impairments and their Microanalytic coding systems that parse beha- parents using the Marital Interaction Coding viors into minute elements may be overly System, III (MICS III; Weiss, 1986), a well- complex and labor-intensive for investigators validated system for assessing problem-solving who merely want to assess global levels of interactions within marital dyads. The study characteristics, such as positiveness, compe- detected no differences between groups of tence, anxiety, or irritability. In such cases, it families with and without children with dis- may be more efficient to use a system that rates abilities, possibly because the coding system was dimensions such as the intensity or quality of inappropriate for the family context. Whereas behavior exhibited over an extended period of the MICS III evaluates warm and hostile time. On the other hand, small, elemental units exchanges that indeed were similar across the of observation and analysis are useful for groups, it does not assess factors such as detecting situational influences on behavior behavior management, instruction, or other and sequential patterns among minute events; socialization practices that are important as- larger, more integrative units are useful for pects of parent±child exchanges that are understanding cross-situational consistency responsive to children's disabilities (Floyd & (Cairns & Green, 1979). Thus, findings based Costigan, 1997). Thus, a behavioral coding on larger units of observation may be more system may be sensitive to relevant variables generalizable than microanalytic coding. Some only for certain types of people, relationships, or investigators appear to assume that behavioral circumstances. observation is synonymous with microanalytic In addition to the substantive content of the coding; such assumptions can serve as a major system, various coding systems differ in the impediment to the more widespread use of ways that they ªchunkº or segment behavior observational measures in research settings with into coding units. This ªchunkingº of behavior limited resources. We encourage investigators has both theoretical and practical implications to explore macroanalytical coding procedures for investigators. From a theoretical perspec- as a practical and, in some cases, more tive, the nature of the phenomena being assessed informative alternative to microanalytic coding. should influence how the stream of ongoing Every coding system incorporates an array of behavior is segmented (Floyd, 1989). More assumptions, biases, and procedural preferences specifically, relatively small, elemental units of that the originator used to guide coding Adopting or Developing a Behavioral Coding System 9 decisions. These preferences are particularly sample coverage with the theta statistic, calcu- relevant in decisions about how to handle lated as 17(number of different behaviors seen/ ambiguous incidents. Many decision rules are total number of acts observed). As the value of not made explicit in published reports and theta approaches 1, the probability of encoun- coding manuals, so that it is difficult for tering a new behavior approaches zero. That is, investigators who did not participate in its we assume that if new behaviors are not development to apply an existing system encountered with additional observations, the accurately and in a way that is consistent with behavioral repertoire has been adequately other research. Whenever possible, consultation sampled. with the originator is invaluable. Some origina- Of course, a strictly empirical approach such tors of popular measures conduct periodic as this usually is not adequate for evaluating workshops to train new users (e.g., SASB, human behavior. As we noted at the beginning Humphrey & Benjamin 1989; SPAFF, Gott- of the chapter, a stream of human behavior man, 1994). Most developers cannot be ex- often can be classified according to an en- pected to provide ongoing consultation to ormous variety of characteristics. In order to support their system, but should be willing to focus attention on a limited set of character- share examples of coded data and advice about istics, the investigator should begin with a list of common problems. these characteristics and their manifestations as gleaned from previous research, experience, and theory. Pilot observations then can be directed 3.01.5.2 Developing a New Coding System toward refining this list by broadening some categories, tightening others to make finer New ideas focused on new constructs and distinctions between theoretically disparate employing new assumptions are the lifeblood of behaviors, and adding new categories not progress in the social sciences. Thus, there will suggested by previous research. For an excellent always be a need to develop new coding example of this process, see Jacob, Tennen- schemes. Even when well-known constructs baum, Bargiel and Seilhamer's (1995) descrip- are studied, if observational procedures become tion of the development of their Home highly standardized within a research area, the Interaction Scoring System. phenomenon of interest may become totally One frequent concern while developing dependent on the existing measure. This situa- coding systems involves what to do with rare tion can lead to replicated, well-established but theoretically important events. Rare events findings that are largely an artifact of a tend to decrease the reliability of coding particular measurement procedure. The need systems; however, such rare events may be to disentangle method variance from the highly meaningful, and thus they cannot be phenomenon of interest is strong justification excluded from a system without compromising for the development of new coding systems. validity and utility. It may be possible to Detailed instructions about developing cod- collapse similar rare events into broader ing systems are given in Bakeman & Gottman categories or to alter the observational situation (1997) regarding interpersonal interaction, and in order to elicit these behaviors more consis- by O'Neill, Horner, Albin, Storey, and Sprague tently and reliably. (1990) regarding functional analysis of problem behaviors. The key steps in developing any coding system are summarized below. 3.01.5.2.2 Selecting a unit of observation An important component of identifying relevant behaviors is to determine the appro- 3.01.5.2.1 Cataloging relevant behaviors priate unit of observation. This involves the A useful initial step is to develop an decision to evaluate behavioral states, events, or exhaustive list of all relevant behaviors to be some combination of the two. In general, a state coded. In some cases, this may be accomplished is any ongoing condition that persists over an by conducting initial observations and record- extended period of time, whereas an event is a ing all relevant behaviors that occur. Animal discrete action. States are usually measured with researchers frequently use this procedure to time-based indices such as duration or latency, develop an ethogram, which is a comprehensive whereas events are usually measured with list of all behaviors that are characteristic of a frequency counts or sequential patterns. Both species. Several ethograms are published to types of unit also can be rated for intensity. The describe the behavior repertoire of different distinction between states and events is blurred animal species. However, because it is usually by the fact that the same behavior might be impossible to sample all possible behaviors for a measured with both units, such as measuring species, investigators estimate the quality of both the duration of anxiety episodes or 10 Observational Methods disruptive episodes in the classroom, as well as data against actual rates or durations, and the frequency of the episodes. The type of unit is adjust the length of the recording interval to not always mandated by the content of the produce the most accurate data possible. See behavior and, once again, the decision about the Altmann (1974) for an extensive review of appropriate unit for a particular purpose must sampling protocols, and Bakeman and Quera be guided by theoretical, empirical, and prac- (1995) for considerations about how to design tical considerations. At first glance, it may sampling protocols and record the data for data appear that recording onset and offset times for analysis purposes. all behaviors is desirable so that information about duration and latency can always be 3.01.5.2.3 Creating code categories retrieved. However, Bakeman and Gottman (1997) warn against ªthe tyranny of timeº and Typically, code categories will be mutually propose that, even with sophisticated recording exclusive and exhaustive Mutual exclusivity and analytical devices, the precise measurement means that coding categories are discrete and of time adds substantial complexity to data homogeneous, and that each event can be recording and analysis, and can cause problems classified into one and only one category. with reliability that outweigh the benefits of Exhaustiveness means that all manifestations having these data. of a construct are included in the system. In The unit of observation also involves the most cases, exhaustiveness can be achieved by sampling protocol. The two most common including a category, such as ªotherº or sampling protocols in clinical psychology ªneutral,º to label all behaviors that do not research are event sampling, which involves fit well into any other category. For example, a noting each occurrence of events during the measure of parent±adolescent interaction by entire observation period, and time sampling or Robin and Foster (1989) includes a ªtalkº code interval coding, which involves noting occur- to cover all behaviors that are not instances of rences during selected time intervals. Most other categories. In some cases, the investigator commonly, time sampling involves alternating may believe that it is necessary to violate these periods of watching and recording, each lasting guidelines. For example, another measure of for several seconds. During the recording parent±adolescent interaction, the Constraining period, the coder usually records whether or and Enabling Coding System (Leaper et al., not each possible code occurred during the 1989), allows behaviors by parents to receive preceding interval. This procedure is useful for both ªconstrainingº and ªenablingº codes live observations in which the coder must track a because the authors believe that these ªmixed large number of behaviors, so that recording messagesº may be highly relevant to adolescent each event would interfere with accurate functioning. In other cases, investigators only observation. The procedure assumes that re- label certain subsets of behaviors (i.e., as in scan cording periods are random, and will not distort sampling where, for example, only instances of the data. Several studies challenge this assump- target children's aggressive behavior and pro- tion and reveal that interval coding can distort vocations by peers are recorded). Both situa- the amount of behavior, sequential patterns, tions create difficulties for the analysis of and observer agreement for both behavioral sequences of events, although Bakeman and events (e.g., Mehm & Knutson, 1987) and Quera (1995) provide some solutions for behavioral states (e.g., Ary, 1984; Gardner & managing these types of data. Griffin, 1989). However, other studies demon- Often it is useful to organize codes into groups strate that interval coding can accurately reflect and, if appropriate, to arrange the groups into a actual behavior rates and durations (e.g., hierarchical classification. This arrangement Klesges, Woolfrey, & Vollmer, 1985). A study makes the codes easier to recall; in addition, by Mann, ten-Have, Plunkett, and Meisels this hierarchical arrangement can help to (1991) on mother±infant interactions illustrates fashion a set of decision steps to employ in that the accuracy of data from interval sampling the coding process, both of which may improve depends on how well the duration of the observer reliability. For example, children's sampling interval matches with the duration social behaviors might first be classified as of the behavior of interest. They found that the initiations versus reactions, and initiations actual durations or frequencies of mother and could be classified as prosocial versus antag- infant behaviors, which tended to occur in onistic, followed by more specific categories of relatively short episodes (ªboutsº), were inac- events within each of these general categories. curately assessed when the sampling interval The coders can then use this hierarchical was relatively long. Thus, before proceeding arrangement to organize their decision process, with a time-sampling/interval-coding approach, so that once they decide that a particular action the investigator should test the time-sampled is an initiation, and it is prosocial, there is a Training Coders 11 relatively small number of ªprosocial initiationº tests (APA, 1985) lists several other types of codes to choose among. information that would also be useful to include An alternative approach to forming a in codebooks, including information about the hierarchical organization is exemplified in the theoretical underpinnings for the measure, and Structural Analysis of Social Behavior (SASB) data on reliability and validity from research to system (Humphrey & Benjamin, 1989). This date. Herbert and Attridge (1975) propose that system uses a rating scale format as an aid to providing this type of information to coders categorical coding. In this system, all code may help to facilitate training and improve categories are classified within a circumplex coder reliability. defined by two dimensions: Interdependence and Affiliation. Coders receive extensive in- struction in the theory underlying each dimen- 3.01.6 TRAINING CODERS sion, and they learn to rate behaviors in terms of their degree of interdependence, represented on Once a coding system has been adopted or a vertical axis, and their degree of affiliation, developed, the next step is to train coders in its represented on a horizontal axis. The axes use. A preliminary step in conducting efficient intersect at their midpoints. The location on the and effective coder training is the selection of circumplex defined by the coordinates of these coders who will perform well. Unfortunately, dimensional ratings is the code for that act. there is little research on personal characteristics Thus, for example, a behavior rated as +5 for or abilities that predict good performance as a Interdependence (i.e., somewhat independent) behavioral coder. Research on interpersonal and 74 for Affiliation (i.e., somewhat dis- communication indicates that, in some circum- affiliative) is assigned to the category ªwalling stances, women tend to be more accurate than off and distancing.º men at decoding the meaning of interpersonal behaviors (e.g., Noller, 1980). However, this effect is hardly large enough to warrant the 3.01.5.2.4 Content validation of codes exclusion of male coders in studies of inter- A useful, though often neglected, step in personal behavior. To the extent that coder coding system development is content valida- characteristics such as gender, age, education, tion of the codes. This might be accomplished or ethnicity may be associated with biases that by having ªexpertsº or the actors themselves could influence coding decisions, it may be classify codes into relevant categories, then important to ensure that coders are diverse on comparing these categories to the expected these characteristics in order to randomize these categories. For example, in order to evaluate a effects and improve the validity of the data. family coding system that classified family Ironically, one characteristic that may be behaviors as aversive and nonaversive, Snyder important to select against is prior experience or (1983) had family members rate the aversiveness personal investment in the research area under of concrete examples of behaviors classified by consideration. The use of naive, uninvolved the coding system. observers controls for biases caused by factors such as familiarity with the research hypotheses and prior expectations. We believe that in many 3.01.5.2.5 Developing a codebook instances naive observers also tend to provide After codes are labeled, defined, and classi- more reliable codes. Extensive research on fied, the next step is to produce a codebook. In clinical decision making demonstrates that general, most experts recommend that the more highly experienced judges tend to employ thorough, precise, and clearly written the idiosyncratic and inconsistent decision criteria codebook, the better the chances of training that can reduce both intraobserver consistency new coders to produce reliable, valid data (e.g., and interobserver agreement as compared to Hartmann & Wood, 1990; Herbert & Attridge, naive observers (e.g., Dawes, 1979). Our 1975); however, other studies demonstrate that experiences bear this out. When coders have naive coders can at times produce valid codes extensive previous training or experience in the (e.g., Prinz & Kent, 1978). The codebook should domain of interest, they tend to have difficulty include a list of all codes, a descriptive definition adhering strictly to the coding criteria outlined for each code, and examples of behaviors that in the coding manual, particularly if their represent each code, along with examples that experiences involved a different theoretical do not match the code. In order to assist with perspective or set of assumptions than those reliable coding, it is particularly important to that undergird the coding system. include examples of differential decisions in Wilson (1982) wisely proposed that observer borderline or ambiguous cases. The APA training should address two equally important guidelines for educational and psychological goals: teaching the skills needed to perform the 12 Observational Methods coding task, and motivating the coders to record, greater responsibility in the form of perform well. Teaching the skills usually training and monitoring new coders, or public involves a combination of didactic and experi- acknowledgment of good work. ential training sessions. A helpful first step is to explain the rationale and theory that underlie the coding system. In cases where the coding 3.01.7 RELIABILITY involves considerable judgment in the applica- In observational research, the reliability, or tion of a theory or model, the coders might precision, of the measure is almost always benefit from readings and instruction about the evaluated with indices of interobserver agree- model. The coders should read and demonstrate ment. Actually, according to psychometric a thorough working knowledge of the coding theory, reliability concerns the extent to which manual and should be tested on this material coded data map onto ªtrue scores,º and thus, before proceeding. Practice with the coding the reliability of coded data also relates to system should begin with the presentation of intraobserver consistency in applying the coding examples of behaviors, followed by an explana- scheme (Bakeman & Gottman, 1997). However, tion for the coding decisions. Initial examples because agreement between independent coders should be relatively unambiguous representa- is a higher standard for judging precision, it is tions of the coding categories or examples of the the focus of formal psychometric evaluations, extremes of the rating scales. The coders should and intraobserver consistency is addressed more be required to practice the coding with feedback informally through the implementation of about their accuracy and discussion of the training and monitoring procedures to prevent rationale for coding decisions. Of course, the observer drift. practice materials should match the actual coding situation as closely as possible. Training sessions should be relatively frequent and 3.01.7.1 Enhancing Reliability relatively brief to enhance alertness. Training continues until the coders reach acceptable As noted throughout the previous sections, levels of agreement with preestablished criterion reliability can be enhanced in the way the codes codes. Because accuracy can be expected to are developed, in the procedures for training decrease once actual coding begins (e.g., Taplin and monitoring the observers, and in the & Reid, 1973), most investigators set a training procedures for conducting the observations. criterion that is higher than the minimal Regarding code development, greater reliability acceptable criterion during actual coding. can be expected when codes are clearly defined Typically, this criterion is 80±90% agreement. in operational terms, when categories are The maintenance of the coders' motivation to relatively elemental and overt as opposed to perform well probably involves the same types complex and inferential, and when coded of factors that enhance performance in any behaviors occur at least moderately frequently. work setting, including clarity of the task, If some of these conditions cannot be met, investment in the outcome, personal responsi- coders likely will need relatively more training, bility, monitoring of performance, and a fair practice, and experience in order to produce reward structure. One typical procedure used by reliable data. Instructions that explicitly dis- investigators is to develop a contract that courage or encourage specific expectancies specifies the tasks to be completed, the about frequencies or patterns of codes either expectations, and the reward structure. Coder tend to reduce observer agreement or spuriously investment in the project might be enhanced by inflate it (Kazdin, 1977). Similar to guidelines providing them with the opportunity to parti- for training sessions, frequent and relatively cipate as a member of the scientific or clinical short observation sessions produce more reli- team, to the extent that this does not bias able data than infrequent, longer observation coding, or by group interactions that build sessions. In addition to booster sessions to solidarity and cohesion among the coding team. reduce observer drift, periodically training new As described below, reliability should be coders may help to reduce the biasing effects of monitored on an ongoing basis. An unfortunate prior experience on the data, and observations feature of reliability monitoring is that there is a involving different subject groups or experi- built-in punishment schedule for inadequate mental conditions should be intermingled performance, such as having to recode sessions whenever possible to avoid systematic effects or completing additional training, but the related to observer drift. Studies repeatedly rewards for good performance are less tangible. demonstrate that coders are less reliable when Investigators should be sensitive to the need to they believe that their agreement is not being instigate a reward system in whatever way checked, although this effect may abate with possible, including providing raises for a good increased experience (e.g., Serbin, Citron, & Reliability 13

Conner, 1978; Weinrott & Jones, 1984). There- follow-up instruction is used to clarify and fore, frequent, overt, and random assessments resolve the confusion about the coding criteria. of reliability should help to maintain coder However, providing feedback about disagree- precision. ments without instruction or resolution can make coders feel more confused and uncertain and can decrease reliability. Also, it usually is 3.01.7.2 Evaluating Reliability not helpful to coders to correct their scores for chance agreement, because corrected scores The appropriate procedure and calculations may be highly variable depending on the range for evaluating reliability depend on the pur- of behaviors displayed in each session; thus they poses of the evaluation and the nature of the present a confusing picture of absolute agree- inferences that will be drawn from the data. ment. Finally, if agreement statistics are used to Thus, there is no one way to evaluate reliability, identify unreliable coders in need of additional but rather an array of approaches that can be training, it is important to base this decision on followed. Helpful reviews of various procedures multiple observation sessions, because occa- and computational formulas are presented by sional unreliable sessions can be expected due to Bakeman and Gottman (1997), Hartmann the ambiguity in the target's behavior rather (1977, 1982), Jacob, Tennenbaum, and Krahn than because of coder error. (1987), Stine (1989), and Suen, Ary, and Covalt For the purpose of determining the precision (1990). Two important decisions include (i) of the measurement after coding has been whether to compute exact agreement for specific completed, the method of computing reliability codes (point-by-point agreement) or aggregate depends on the type of data that will be analyzed agreement for larger units, and (ii) whether and (Hartmann, 1977). That is, the investigator how to correct for chance agreement. Below is a typically is interested in the reliability/precision summary of some of the more common of the scores that are actually computed from procedures, along with guidelines for selecting the data. Although it is common for investiga- among them. tors to report some form of point-by-point For the purpose of training, monitoring, and agreement for individual codes, most data providing corrective feedback to coders, in most analyses focus on aggregate indices such as cases it is useful to assess the most precise form the frequency, relative frequency, or rates of of point-by-piont agreement that is possible groups of behaviors. Thus, if several specific with the coding system. For example, with event codes are combined or aggregated into a recording, an observer's codes for each event are broader category of ªpositive behaviorº for compared with a set of criterion codes or those data-analytic purposes, then reliability esti- of an independent coder. A contingency table mates should be performed at the level of that cross-lists the two sets of codes and tallies ªpositive behavior.º We usually assume that the frequencies of each pair of codes (i.e., a observer agreement for specific codes ensures ªconfusion matrixº) is a useful method of that aggregate scores are reliable; nonetheless, it feedback for the coders. The total percent is useful to determine the actual level of agreement for all codes (# agreements/total # reliability for these aggregate codes. Further- events), and the percent agreement for specific more, agreement at the specific code level is codes (# agreements/(#agreements+#disagree- usually an overly stringent requirement that can ments)) are summary scores that are easy to unnecessarily prolong a study. understand. During coder training, we compute For the purpose of assessing the precision of these statistics beginning with the first practice the measure, percent agreement statistics are assignments because the rapid improvement usually corrected for chance agreement between that usually occurs during the first few assign- coders by using Cohen's kappa statistic. This ments can be highly rewarding and reinforcing statistic uses the information about agreements for new coders. The contingency table also can and disagreements from the confusion matrix, display a pattern of disagreements that is and evaluates the observed amount of agree- instructive, such as when two coders consis- ment relative to the expected amount of tently give a different code to the same type of agreement due to chance because of the base behavior. Even when the actual observational rates of the behaviors. The formula for kappa is: procedure involves a larger or less precise unit of kappa = [p(Observed agreement)7p(Expected observation, such as in interval sampling, it may agreement)]/[17p(Expected agreement)], where be helpful to employ an event-based tracking p(Observed agreement) is the percent agreement procedure during coder training in order to have for the two coders, and p(Expected agreement) is more precise information about when disagree- the percent agreement expected by chance. The ments occur. In our experience, the identifica- probability of chance agreement is computed by tion of consistent disagreements is helpful when calculating the relative frequencies for each code 14 Observational Methods by each coder, obtaining the product of the two patterns in the behavioral stream. Bakeman coders' relative frequency scores for each code, and Gottman (1997) vividly illustrate how then summing these products. Usually, kappa is point-by-point agreement between coders can computed for each reliability session, and the be sharply deflated when one coder occasionally mean and range across all reliability sessions are inserts extra codes, although the agreement reported. One complication with the kappa about the sequential pattern of the codes is very statistic is that it can severely underrepresent high. Wampold and Holloway (1983) make a observer agreement in sessions during which the similar case that agreement about individual subject displays a limited number of behaviors codes may be too stringent a criterion for and/or some behaviors have exceptionally high sequential data. Gottman (1980) recommends base rates. This situation produces very large an approach similar to intraclass correlation in values for expected agreement, and thus can which the investigator demonstrates that the produce a very low score for kappa even when measures of sequential dependency (e.g., lag z- the proportion of observed agreement is very scores, see below) are consistent across ob- high. A potential solution in this situation is to servers. use a sample-wise estimate of expected agree- ment derived from the base rates for the entire set of data (Bakeman & Gottman, 1997). 3.01.8 ANALYZING DATA GATHERED Another complication is that point-by-point THROUGH BEHAVIORAL kappa may be overly stringent when, as OBSERVATION frequently occurs, the actual scores used in the Apart from the usual decisions that guide the data analysis are aggregates (e.g., total fre- design of data analysis, observational data quency scores). Jacob et al. (1987) present a present some unique challenges for investiga- method for computing kappa for aggregate tors. These include questions about data scores. reduction, dimensions and scales of measure- Often the measure of interest in the data ment, and the identification of sequential analysis represents the data on an interval scale patterns of behavior. of measurement (e.g., relative frequency scores or average ratings). In this situation, the question of measurement precision concerns 3.01.8.1 Data Reduction the relative positions of subjects on the interval scale rather than agreement between coders on Most observational studies produce a large individual coded behaviors. Coder agreement amount of data for each subject, particularly for interval data can be assessed with the when the coding scheme exhaustively labels all correlation between the scores for pairs of behaviors emitted by one or more subjects coders; for example, the correlation between during an observational session. Even sessions summary scores for a group of subjects, or the lasting only 5±10 minutes can produce several correlation between interval ratings within an hundred codes when behavior units are small. observation session. Alternatively, the intra- The challenge is to reduce these data to make class correlation is also appropriate for interval them interpretable, without sacrificing the rich data. This approach is derived from the analysis descriptive information in the coded data. of variance (Winer, 1971), and is the most A first step is to group individual codes into commonly used application of generalizability broader categories. For the purposes of data theory for evaluating coder agreement. It analysis, the primary problem with many assesses the proportion of variance in the scores observational coding schemes is that they that can be attributed to variation among include behavioral events that rarely or never subjects as opposed to variation among ob- occur for many subjects, and thus produce servers. An advantage of the approach is that highly skewed distributions of scores for the when more than two coders are being evaluated, sample, resulting in missing cells in contingency one score can reflect the level of agreement (i.e., tables for sequential analyses. Whenever possi- proportion of variation attributed to differences ble, it is important to incorporate rare events among subjects) for the entire group of coders, into categories of events that occur with rather than calculating individual correlations sufficient frequency to be used in subsequent for all possible pairs of coders. Shrout and Fleiss analyses. In most cases, categories are specified (1979) outline the procedures for computing on an a priori basis by investigators, using intraclass correlations under various assump- theory and rational analysis to determine how tions about the data. codes are grouped. Alternative procedures for detecting agree- Another approach is to factor analyze ment between coders may also be appropriate individual codes to identify clusters of codes when the focus of study is the sequential that share common variance. This approach is Analyzing Data Gathered Through Behavioral Observation 15 probably most appropriate when the investi- measurement period, relative frequency scores gator is interested in examining behavior styles, may be preferable because they are comparable because the factor analysis identifies groups of across measurement situations. However, the behaviors that covary across subjects. That is, comparability of relative frequency scores behaviors are grouped based on their co- requires exhaustive coding of all behavior with occurrence within subjects, irrespective of their the same set of codes. Thus, rate per minute meaning or functional impact. For example, if scores may be preferable because they are less different children in classrooms tend to mis- dependent on other characteristics of the coding behave in different ways, with some children scheme or measurement situation. spending much time out of their seats, others talking out of turn, and others being withdrawn 3.01.8.1.2 Measurement of sequential patterns and inattentive, a factor analysis would likely of behavior identify these three types of behaviors as three separate clusters. Kuller and Linsten (1992) When events or states are coded exhaustively, used this method and identified social behaviors it is possible to evaluate patterns of behavior for and individual concentration as separate clus- an individual or between individuals interacting ters of behaviors by children in classrooms. An with each other. The behaviors can occur alternative approach is to group behaviors concurrently (e.g., the co-occurrence of eyebrow according to their functional impact. If all three arch and smile) or serially (e.g., reciprocal types of behaviors disrupt the activities of other negative behaviors between spouses). Sequen- children in the classroom, they can be grouped tial patterns are usually derived from Bayesian into one functional category of disruptive statistics that relate the occurrence or non- behavior. For example, Schaap (1982) used occurrence of an antecedent event or state with lag sequential analysis of marital interactions to the occurrence or nonoccurrence of a conse- identify a set of behaviors that were most likely quent event or state. In lag-sequential analysis to elicit negative responses from a spouse. These (e.g., Bakeman & Gottman, 1997; Sackett, specific behaviors could then be grouped 1979a), the investigator is concerned with the together as negative eliciting behaviors. transitional probability of the antecedent/con- sequent sequence, which reveals the probability that the consequent occurs, given the antecedent 3.01.8.1.1 Measurement of base rates of event or state. The important question regard- behavior ing sequential dependency is the extent to which Base rates of behavior can be expressed in this transitional probability exceeds (or is various scales of measurement. For example, smaller than) the base rate for the consequent. the frequency of events can be expressed as raw If the consequent is more (or less) likely to occur frequency, relative frequency, rate per minute, in the presence of, or following, the antecedent or ratio scores (i.e., ratio of positive to negative than in other circumstances, then there is behaviors). The selection of a scale of measure- dependency between the antecedent and the ment is usually determined by factors such as consequent. If, for example, the probability that the focus of the investigation, the precedent in a mother smiles at her infant (consequent the field of study, and psychometric properties behavior) is greater after the infant smiles at of various scales of measurement (e.g., distribu- her (antecedent behavior) than after other types tions of the scores, reliability of the index). of infant behaviors, then there is a sequential Tryon (1991) argues that scores should also be dependency in infant smile±mother smile ex- expressed in ªnaturalº scientific units in order to changes. make measurements as comparable as possible Sequential dependency can be estimated with across time, situations, or different investiga- several statistics. One common metric is the lag tions. A compelling point in this argument is sequential z-score developed by Sackett (1979a), that relativistic measurements such as standar- and a modification developed by Allison and dized z-scores, which provide information Liker (1982) that corrects for sampling error. about the position of the subject relative to This statistic compares the observed frequency other subjects in the same study, are highly of the antecedent/consequent sequence with the ªunnaturalº units that vary as a function of who expected frequency of the sequence (based on participates in the study. Instead, scores should the base rates for both events). More recently, be expressed in relation to some objective Bakeman and colleagues (Bakeman & Gott- criterion that remains invariant across samples, man, 1997; Bakeman & Quera, 1995) recom- so that a body of knowledge about the mended a slightly different formula derived phenomenon of interest can more readily from two-way contingency tables, which is the develop. For example, because raw frequency adjusted residual obtained in log-linear analy- counts will depend on the duration of the sis. A convenient feature of this statistic is that 16 Observational Methods because the scores are distributed as z, the rather than calculating sequential statistics for statistical significance of sequential patterns is individual subjects. For example, Gottman, readily discerned (i.e., scores greater than z = Markman, and Notarius (1977) examined 1.96). The statistic can be computed with sequences of effective and ineffective problem standard statistical packages. Importantly, solving in groups of distressed and nondis- Bakeman and Quera (1995) also suggest tressed married couples. Although incidents of statistical programs to compute these scores positive behaviors such as ªvalidateº and when codes in a sequence cannot repeat, such as negative behaviors such as ªput-downº were when the coding system requires that new codes rare in some couples, within the two groups they are assigned only when the behavior changes. occurred with sufficient frequency to identify Because a behavior can never follow itself, this patterns of supportive interactions that were situation produces ªstructural zerosº in the more typical of happily married couples, and diagonal of the contingency table. patterns of hostile exchange that were more The z-statistic or adjusted residual is most typical of spouses in distressed marriages. useful for examining a sequential pattern in a single set of observations on an individual subject or a group of subjects. However, it is not 3.01.8.2 Computer Technology for Recording recommended for use when separate sequential and Coding Behavioral Observations scores for each of multiple subjects are entered as data points into subsequent inferential Since the late 1970s, computer technology has analyses. The problem is that the size of adjusted become increasingly available for use in record- residuals will vary (become larger) as a function ing and coding observational data. Whereas the of the number of observations that are made, use of devices and software requires extra initial even when the actual conditional probabilities costs in time and money, they may ultimately remain constant. Thus, the z-scores are influ- increase the efficiency of data collection, coding, enced by the base rates of behaviors, and can and data management. If developers find a differ dramatically across subjects when re- market for these devices, we can expect that sponse productivity differs. Wampold (1989) their availability will become more widespread. recommends a transformed kappa statistic as an Because specific products are likely to undergo alternative; however, this statistic requires rapid changes and development, a list of selecting among three different computational currently available products would be immedi- formulas depending on the relative size of ately out of date, and thus would be of little use. expected and actual probabilities. Other com- However, it is possible to illustrate the range of monly used statistics are Yule's Q and phi. current options and some of the factors to Bakeman and Casey (1995) provide computa- consider in selecting equipment and software. tional formulas, discuss the conditions under For most situations in which behavioral which various statistics may be applicable, and observation is used, the major function of suggest that investigators examine the distribu- automated systems is to record the observations tions of scores for each statistic to determine and the codes reported by the human observer. which statistic provides the best distribution of Although Tryon (1991) catalogs many auto- scores for a particular set of data. mated devices that actually measure ªactions,º Investigators are commonly concerned about most of these devices track physiological the amount of data needed to calculate responses or simple, gross motor activity. To sequential statistics. This issue is most relevant date, no automated system can code the types of when the investigator is interested in studying complex events that are the focus of this chapter. sequences involving relatively rare events. The first systems designed for event recording Unfortunately, clinically relevant phenomena were ªdedicated devicesº because their sole are often rare phenomena, so the problem is a function was to aid in data collection. Data common one in the field of clinical psychology. could be stored in a temporary system and then Bakeman and Gottman (1997) present a uploaded to a computer for analysis. These detailed analysis of this issue using guidelines devices have been rapidly replaced by software employed in log-linear analysis of contingency systems that run on standard computers. The tables. As a general rule of thumb, the newest automated systems combine data entry investigator should obtain enough data so that and management functions with computer each antecedent and consequent event in a control of video playback devices. contingency table occurs at least five times. In One consideration in selecting a system is the many cases, this criterion can be met by number of codes the system can handle. For collapsing codes into more general categories. example, one software package, the Observa- Another strategy is to conduct analyses using tional Data Acquisition Program (ODAP; pooled observations on a group of subjects, Hetrick, Isenhart, Taylor & Sandman, 1991), Interpreting Data Gathered Through Behavioral Observation 17 allows for recording frequency and duration of may also allow for greater precision in labeling up to 10 behaviors. Duration is recorded by exact time segments. However, they require depressing a key during the occurrence of a converting videotaped observations to compact behavior. Taylor et al. (1991) used this system to disks, which is expensive when completed by measure six types of self-injurious behaviors. companies that provide this service, and time- Although this system is easy to use, the major consuming when equipment is available for limitation is that only a circumscribed range of converting data in-house. Nevertheless, the behavior can be evaluated with simple, one reduced storage requirements for CDs as dimensional codes. In contrast, other systems opposed to videotapes is an advantage. allow the investigator to configure a data entry A third consideration is the statistical analysis file to handle as many as 1000 codes, and the that a particular program may offer. Whereas codes can be multifaceted. For example, the some systems include software for calculating Observer system (Noldus, 1991) has the capa- coder agreement, linking multiple channels city to hold 1000 different codes which can be (e.g., behavioral codes with heart rate data), classified as occurring only under certain and conducting sequential analyses, others circumstances. The system prompts the coder output the data so that it can be imported into to correct errors when codes that violate the other programs for these purposes. Graphic classification structure are entered (e.g., the presentation of data is also an option with some same behavior entered twice in systems where packages. In some cases, statistical programs codes cannot repeat, or when a teacher code is are included as part of the basic system; in other assigned to a student). The Multiple Option instances, the statistical software is optional. Observation System for Experimental Studies (MOOSES; Tapp, Wehby, & Ellis, 1995) offers similar flexibility. In a study by Shores et al. 3.01.9 INTERPRETING DATA GATHERED (1993), MOOSES was used to record classroom THROUGH BEHAVIORAL interactions of children with behavior disorders. OBSERVATION Codes were devised to indicate the actor, the behavior exhibited, and the recipient of the As can be seen, there are a number of steps behavior. In addition, conditional variables before investigators obtain findings based on such as the presence of a teacher or the grouping behavioral observation. The final step for the of students who were present were also investigator is interpreting these findings. recorded. Whereas this interpretive process is psycholo- A second consideration is whether the system gical in nature and dependent upon the specifics can interface with video playback devices. of the particular investigation, our own research Several systems are available that link the on couples' relationships, marital therapy, and computer with professional quality videotape family development illustrates how a pattern of players that include a computer interface port. findings that include both behavioral observa- A machine-readable time code is stamped on the tion and self-report measures can help to videotape, and the computer reads the time code elucidate family/marital functioning. to track onset and offset times for events, Baucom and his colleagues have conducted duration for states, or to access specific several treatment outcome studies with mari- intervals. The advantage of these systems over tally distressed couples from a cognitive- simple real-time recording devices is that the behavioral perspective. Cognitive-behavioral videotape can be stopped, reversed, and started conceptualizations of marital distress have repeatedly without needing to reset a timer. placed a major emphasis on the centrality of Examples are the Observer system (Noldus, communication as essential for effective marital 1991), Observation Coding System Toolset functioning. However, the pattern of findings (OCS Tools; Triangle Research Collaborative, across our investigations indicates that com- Inc, 1994), and Procoder (Tapp & Walden, munications training might be neither necessary 1993). These systems differ greatly in terms of nor sufficient for affecting changes in marital the number and complexity of codes they can functioning. In these investigations, couples' handle and the ability and ease of controlling communication was assessed by the MICS III the video player from the computer keyboard. (Weiss, 1986) after the couples attempted to More recently, packages are becoming available solve problems on areas of difficulty in their that use CD-ROM instead of videotape input marriage in a laboratory setting. In the first (e.g., vPrism; Digital LAVA Inc., 1996). These study, all couples received some set of beha- systems provide easier access to specific sections vioral interventions to assist their marriages of an observation session because of the digital (Baucom, 1982). The difference among the access features of CD-ROM, and easier replay treatment conditions involved whether the as compared to rewinding a video player. They couples received communications training. 18 Observational Methods

The findings indicated that couples in all of the therapy. Also, because the focus of the study active treatment conditions improved equally was to understand how the parents functioned on communication and marital adjustment, as a marital dyad, the marital discussions were suggesting that communications training was conducted in a room separate from the children. not the critical element in the treatment. In a Other studies interested in how children witness subsequent investigation, Baucom, Sayers, and marital conflict have used procedures where Sher (1990) provided maritally distressed cou- children are present either as observers or ples with a variety of cognitive-behavioral participants in the discussions (e.g., Cummings, interventions, but all couples received commu- 1987). In order to reduce reactive effects nications training. Again, the findings indicated associated with the presence of a live observer, that couples in all active treatment conditions the couples were left alone together for 10 improved equally on communication and minutes to complete the discussion, and the marital adjustment. However, additional ana- discussions were videotaped for later coding. lyses indicated that changes in communication Parenting experiences were evaluated in a were not correlated with changes in marital separate family interaction session. The proce- adjustment; thus, communication training dures for this session were adopted from the could not be interpreted as the critical ingredient work of Patterson, Reid, and colleagues that altered marital adjustment (Iverson & (Patterson, 1982; Patterson, Reid, & Dishion, Baucom, 1990). This is not to say that 1992) with aggressive children and their communication is unimportant in marriage; a families. These procedures had been used to number of investigations indicates that it is (e.g., identify how behavior management and control Gottman & Krokoff, 1989). However, the are disrupted, and lead to the escalation of results from these investigations and others negative exchanges in the families of aggressive have led cognitive-behavioral investigators to children. Because children with mental retarda- develop more complex and multifaceted models tion also present behavior management chal- of marital distress that include, but are not lenges for parents, the possibility of negative limited to, a communications skills deficit escalation was potentially relevant to these explanation (e.g., Karney & Bradbury, 1995). families. In order to make the interaction as Thus, in this instance, the findings from naturalistic as possible, families were video- investigations involving behavioral observation taped in the home while completing a task of of communication with couples have led to their own choosing. All family members were theoretical changes in the conceptualization of present. However, it was also necessary to marital adjustment. structure the session somewhat in order to Research by Floyd and colleagues investi- ensure that the family members interacted gates associations among subsystems within the together and that the videotapes were suffi- families of children who have disabilities, and ciently clear so that the behaviors could be examines how family relationships predict reliably coded. Thus, families were instructed to adaptive functioning for the child and the other complete an activity together (e.g., baking family members. All observations and data cookies, working on a crafts project), to refrain collection are conducted in the families' homes from watching television or making or taking in order to maximize the likelihood that telephone calls, and to remain together in one or observed behaviors are relevant to daily two rooms within range of the video camera. functioning in the family. One set of reports The families were observed for a total of 50 (Floyd, Gilliom, & Costigan, in press; Floyd & minutes, which, for coding purposes, was Zmich, 1991) focuses on the hypothesis that the divided into 10 minute segments. During each quality of the parents' marital functioning and segment, one family member was identified as their parenting alliance influence the quality of the focus of coding, and only behaviors that parenting experiences. In order to test the occurred by the focus person and anyone hypothesis, observational measures were com- interacting with that person were coded by bined with self-reports of both marital function- the observer. This procedure allowed the ing and parenting experiences. camera operator to focus on subsets of family Similar to procedures commonly employed in members rather than trying to keep all family studies of marital relationships, and as illu- members in view at all times. strated in the studies by Baucom and colleagues, The findings support the value of these the parents' marital problem-solving skills were observational methods. Most notably, whereas assessed by having them discuss and attempt to self-report measures of marital quality, the resolve a significant area of disagreement in parenting alliance, and parenting experiences their relationship. This procedure linked the failed to distinguish families of children with investigation to the large body of previous mental retardation from a comparison group observational research on marriage and marital with typically developing children, both the References 19 marital interactions and the family interactions educational and . Washington, DC: demonstrated greater marital negativity and Author. Ammerman, R. T., Van Hasselt, V. B., & Hersen, M. more parent±child behavior management strug- (1991). Parent±child problem-solving in families of gles for the MR group (Floyd & Zmich, 1991). visually impaired youth. Journal of Pediatric Psychology, Furthermore, negative marital interactions 16, 87±101. were predictive of negative parent±child ex- Ary, D. (1984). Mathematical explanation of error in duration recording using partial interval, whole interval, changes. A subsequent longitudinal evaluation and momentary time sampling. Behavioral Assessment, 6, demonstrated that marital quality, including the 221±228. quality of marital problem-solving interactions, Bakeman, R., & Casey, R. L. (1995). Analyzing family predicts changes in negative parent±child ex- interaction: Taking time into account. Journal of Family changes over a two-year period, with couples Psychology, 9, 131±143. Bakeman, R., & Gottman, J. M. (1997). Observing who are most negative together showing interaction: An introduction to sequential analysis (2nd increases in their negative exchanges with ed.). New York: Cambridge University Press. children over time (Floyd, Gilliom, & Costigan, Bakeman, R., & Quera, V. (1995). Log-linear approaches in press). to lag-sequential analysis when consecutive codes may and cannot repeat. Psychological Bulletin, 118, 272±284. Barton, C. & Alexander, J. F. (1981). Functional family 3.01.10 CONCLUDING COMMENTS therapy. In A. S. Gurman & D. P. Kniskern (Eds.), Handbook of family therapy (pp. 403±443). New York: As can be seen in the above discussion, Brunner/Mazel. Baucom, D. H. (1982). A comparison of behavioral behavioral observation provides an extremely contracting and problem-solving/communications train- rich source of information for investigators as ing in behavioral marital therapy. Behavior Therapy, 13, they attempt to understand the complexities of 162±174. human behavior and interaction. This richness Baucom, D. H., Sayers, S. L., & Sher, T. G. (1990). Supplementing behavioral marital therapy with cognitive presents many opportunities for investigators restructuring and emotional expressiveness training: An and many challenges as well. These challenges outcome investigation. Journal of Consulting and Clinical are incorporated in the myriad of decisions that Psychology, 58, 636±645. the investigator must make in employing Cairns, R. B., & Green, J. A. (1979). How to assess behavioral observation, and obviously the route and social patterns: Observations or ratings? In R. B. Cairns (Ed.), The analysis of social interactions: that the investigator chooses greatly impacts the Methods, issues, and illustrations (pp. 209±226). Hills- findings. Thus, the investigator incurs respon- dale, NJ: Erlbaum. sibility for understanding the impact of these Carpenter, L. J., & Merkel, W. T. (1988). The effects of decisions on the questions under investigation. three methods of observation on couples in interactional research. American Journal of Family Therapy, 16, Often in reporting the results of investigations 144±157. employing behavioral observation, the method Christensen, A., & Hazzard, A. (1983). Reactive effects section of the report involving the application of during naturalistic observation of families. Behavioral coding systems and data analytic strategies is Assessment, 5, 349±362. presented in a few short paragraphs. Conse- Christensen, A., & Heavey, C. L. (1990). Gender and social structure in the demand/withdraw pattern of marital quently, most readers will have only a general conflict. Journal of Personality and Social Psychology, level of understanding of how the coding system 59, 73±81. was employed and how that impacts the Cummings, E. M. (1987). Coping with background anger findings. Therefore, the investigator must in early childhood. Child Development, 58, 976±984. thoroughly understand the coding process so Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34, that discussions of the findings accurately 571±582. represent what is indicated by the data. When Digital LAVA Inc. (1996). 10850 Wilshire Blvd., Suite this is done in a thoughtful manner, we have the 1260, LA, CA 90024. opportunity to use one of our most natural Fagot, B., & Hagan, R. (1988). Is what we see what we get? Comparisons of taped and live observations. Behavioral strategies, observing people's behavior, as a Assessment, 10, 367±374. major way to advance the science of clinical Floyd, F. J. (1989). Segmenting interactions: Coding units psychology. for assessing marital and family behaviors. Behavioral Assessment, 11, 13±29. Floyd, F. J., & Costigan, C. L. (1997). Family interactions 3.01.11 REFERENCES and family adaptation. In N. W. Bray (Ed.), Interna- tional review of research in mental retardation (Vol. 20, Ainsworth, M. D., Blehar, M. C., Waters, E., & Wall, S. pp. 47±74). New York: Academic Press. (1978). Patterns of attachment: A psychological study of Floyd, F. J., Gilliom, L. A., & Costigan, C. L. (in press). the strange situation. Hillsdale, NJ: Erlbaum. Marriage and the parenting alliance: Longitudinal Allison, P. D., & Liker, J. K. (1982). Analyzing sequential prediction of change in parenting perceptions and categorical data on dyadic interaction: A comment on behaviors. Child Development. Gottman. Psychological Bulletin, 91, 393±403. Floyd, F. J., & Zmich, D. E. (1991). Marriage and the Altmann, J. (1974). Observational study of behavior: parenting partnership: Perceptions and interactions of Sampling methods. Behaviour, 49, 227±265. parents with mentally retarded and typically developing American Psychological Association (1985). Standards for children. Child Development, 62, 1434±1448. 20 Observational Methods

Foster, S. L., Inderbitzen, H. M., & Nangle, D. W. (1993). A. (1995). Family interaction in the home: Development Assessing acceptance and social skills with peers in of a new coding system. Behavior Modification, 19, childhood. Behavior Modification, 17, 255±286. 147±169. Gardner, W., & Griffin, W. A. (1989). Methods for the Jacob, T., Tennenbaum, D. L., & Krahn, G. (1987). analysis of parallel streams of continuously recorded Factors influencing the reliability and validity of social behaviors. Psychological Bulletin, 105, 446±455. observation data. In T. Jacob (Ed.), Family interaction Gottman, J. M. (1979). Marital interaction: Experimental and psychopathology: Theories, methods, and findings investigations. New York: Academic Press. (pp. 297±328). New York: Plenum. Gottman, J. M. (1980). Analyzing for sequential connec- Jacob, T., Tennenbaum, D., Seilhamer, R. A., Bargiel, K., tion and assessing interobserver reliability for the & Sharon, T. (1994). Reactivity effects during natur- sequential analysis of observational data. Behavioral alistic observation of distressed and nondistressed Assessment, 2, 361±368. families. Journal of Family Psychology, 8, 354±363. Gottman, J. M. (1994). What predicts divorce? Hillsdale, Jacobson, N. S. (1985). The role of observational measures NJ: Erlbaum. in behavior therapy outcome research. Behavioral Gottman, J. M., & Krokoff, L. J. (1989). Marital Assessment, 7, 297±308. interaction and satisfaction: A longitudinal view. Journal Jarrett, R. B., & Nelson, R. O. (1984). Reactivity and of Consulting & Clinical Psychology, 57(1), 47±52. unreliability of husbands as participant observers. Gottman, J. M., Markman, H., & Notarius, C. (1977). The Journal of Behavioral Assessment, 6, 131±145. topography of marital conflict: A sequential analysis of Johnston, J. M., & Pennypacker, H. S. (1993). Strategies verbal and nonverbal behavior. Journal of Marriage and and tactics of behavioral research (2nd ed.). Hillsdale, NJ: the Family, 39, 461±477. Erlbaum. Haley, J. (Ed.) (1971). Changing families. New York: Grune Karney, B. R., & Bradbury, T. N. (1995). The longitudinal & Stratton. course of marital quality and stability: A review of Harris, F. C. & Lahey, B. B. (1986). Condition-related theory, methods, and research. Psychological Bulletin, reactivity: The interaction of observation and interven- 118(1), 3±34 tion in increasing peer praising in preschool children. Kazdin, A. E. (1977). Artifact, bias and complexity of Education and Treatment of Children, 9, 221±231. assessment: The ABC's of reliability. Journal of Applied Hartmann, D. P. (1977). Considerations in the choice of Behavior Analysis, 10, 141±150. interobserver reliability estimates. Journal of Applied Kazdin, A. E. (1982). Observer effects: Reactivity of direct Behavior Analysis, 10, 103±116. observation. New Directions for Methodology of Social Hartmann, D. P. (Ed.) (1982). Using observers to study and Behavioral Science, 14, 5±19. behavior. San Francisco: Jossey-Bass. Klesges, R. C., Woolfrey, J., & Vollmer, J. (1985). An Hartmann, D. P., & Wood, D. D. (1990). Observational evaluation of the reliability of time sampling versus methods. In A. S. Bellack, M. Hersen, & A. E. Kazdin continuous observation data collection. Journal of (Eds.), International handbook of behavior modification Behavior Therapy and Experimental , 16, and therapy (2nd ed., pp. 109±138). New York: Plenum. 303±307. Hawkins, R. P. (1979). The functions of assessment: Kuller, R., & Linsten, C. (1992). Health and behavior of Implications for selection and development of devices children in classrooms with and without windows. for assessing repertoires in clinical, educational, and other Journal of Environmental Psychology, 12, 305±317. settings. Journal of Behavioral Assessment, 12, 501±516. Leaper, C., Hauser, S., Kremen, A., Powers, S. I., Haynes, S. N. (1978). Principles of behavioral assessment. Jacobson, A. M., Noam, G. G., Weiss-Perry, B., & New York: Gardner Press. Follansbee, D. (1989). Adolescent±parent interactions in Haynes, S. N. (1998). The changing nature of behavioral relation to adolescents' gender and ego development assessment. In M. Hersen & A. S. Bellack (Eds.), pathway: A longitudinal study. Journal of Early Adoles- Behavioral assessment: A practical handbook (4th ed., cence, 9, 335±361. pp. 1±21). Boston: Allyn & Bacon. Mann, J., ten-Have, T., Plunkett, J. W., & Meisels, S. J. Haynes, S. N., Follingstad, D. R., & Sullivan, J. C. (1979). (1991). Time sampling: A methodological critique. Child Assessment of marital satisfaction and interaction. Development, 62, 227±241. Journal of Consulting and Clinical Psychology, 47, Mehm, J. G., & Knutson, J. F. (1987). A comparison of 789±791. event and interval strategies for observational data Haynes, S. N., & Horn, W. F. (1982). Reactivity in analysis and assessments of observer agreement. Beha- behavioral observation: A review. Behavioral Assess- vioral Assessment, 9, 151±167. ment, 4, 369±385. Minuchin, S. (1974). Families and family therapy. Cam- Herbert, J., & Attridge, C. (1975). A guide for developers bridge, MA: Harvard University Press. and users of observational systems and manuals. Noldus, L. P. J. J. (1991). The Observer: A software system American Educational Research Journal, 12, 1±20. for collection and analysis of observational data. Hersen, N., & Bellack, A. S. (1998). Behavioral assessment: Behavior Research Methods, Instruments, and Computers, A practical handbook (4th ed.). Boston: Allyn & Bacon. 23, 415±429. Hetrick, W. P., Isenhart, R. C., Taylor, D. V., & Sandman, Noller, P. (1980). Misunderstandings in marital commu- C. A. (1991). ODAP: A stand-alone program for nication: A study of couples' nonverbal communication. observational data acquisition. Behavior, Research Meth- Journal of Personality and Social Psychology, 39, ods, Instruments, and Computers, 23, 66±71. 1135±1148. Humphrey, L. L., & Benjamin, L. S. (1989). An observational O'Neill, R. E., Horner, R. H., Albin, R. W., Storey, K., & coding system for use with structural analysis of social Sprague, J. R. (1990). Functional analysis of problem behavior: The training manual. Unpublished manuscript, behavior: A practical assessment guide. Sycamore, IL: Northwestern University Medical School, Chicago. Sycamore Publishing. Iverson, A., & Baucom, D. H. (1990). Behavioral marital Patterson, G. R. (1982). A social learning approach, Vol 3: therapy outcomes: Alternate interpretations of the data. Coercive family process. Eugene, OR: Castalia Publishing Behavior Therapy, 21(1), 129±138. Company. Jacob, T. (1975). Family interaction in disturbed and Patterson, G. R., Reid, J. B., & Dishion, T. J. (1992). normal families: A methodological and substantive Antisocial Boys. Eugene, OR: Castalia. review. Psychological Bulletin, 82, 33±65. Pett, M. A., Wampold, B. E., Vaughn-Cole, B., & East, T. Jacob, T., Tennenbaum, D., Bargiel, K., & Seilhamer, R. D. (1992). Consistency of behaviors within a naturalistic References 21

setting: An examination of the impact of context and Tapp,J.,&Walden,T.(1993).PROCORDER:A repeated observations on mother±child interactions. professional tape control, coding, and analysis system Behavioral Assessment, 14, 367±385. for behavioral research using videotape. Behavior Prinz, R. J., & Kent, R. N. (1978). Recording parent± Research Methods, Instruments, and Computers, 25, adolescent interactions without the use of frequency or 53±56. interval-by-interval coding. Behavior Therapy, 9, Tapp, J., Wehby, J., & Ellis, D. (1995). A multiple option 602±604. observation system for experimental studies: MOOSES. Robin, A. L., & Foster, S. L. (1989). Negotiating parent± Behavior Research Methods, Instuments, and Computers, adolescent conflict: A behavioral-family systems approach. 27, 25±31. New York: Guilford. Taylor, D. V., Hetrick, W. P., Neri, C. L., Touchette, P., Sackett, G. P. (1979a). The lag sequential analysis of Barron, J. L., & Sandman, C. A. (1991). Effect of contingency and cyclicity in behavioral interaction naltrexone upon self-injurious behavior, learning, and research. In J. D. Osofsky (Ed.), Handbook of infant activity: A case study. Pharmacology, Biochemistry, and development (pp. 623±649). New York: Wiley. Behavior, 40, 79±82. Sackett, G. P. (Ed) (1979b). Observing behavior. Vol. 2: Triangle Research Collaborative, Inc. (1994). P. O. Box Data collection and analysis methods. Baltimore: Uni- 12167, 100 Park, Suite 115, Research Triangle Park, NC versity Park Press. 27709. Schaap, C. (1982). Communication and adjustment in Tryon, W. W. (1991). Activity measurement in psychology marriage. Lisse, Holland: Swetts & Zeitlinger. and medicine. New York: Plenum. Serbin, L. A., Citron, C., & Connor, J. M. (1978). Covert Van Widenfelt, B., Baucom, D. H., & Gordon, K. C. assessment of observer agreement: An application and (1997). The Prevention and Relationship Enhancement extension. Journal of Genetic Psychology, 133, 155±161. Program: An empirical analysis. Manuscript submitted Shores, R. E., Jack, S. L., Gunter, P. L., Ellis, D. N., for publication. Debreire, T. J., & Wehby, J. H. (1993). Classsroom Wampold, B. E. (1989). Kappa as a measure of pattern in interactions of children with behavior disorders. Journal sequential data. Quality and Quantity, 23, 171±187. of Emotional and Behavioral Disorders, 1, 27±39. Wampold, B. E., & Holloway, E. L. (1983). A note on Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: interobserver reliability for sequential data. Journal of Uses in assessing rater reliability. Psychological Bulletin, Behavioral Assessment, 5, 217±225. 86, 420±428. Watson, J. B., & Raynor, R. (1920). Conditioned Sidman, M. (1960). Tactics of scientific research: Evaluating emotional reactions. Journal of Experimental Psychol- experimental data in psychology. New York: Basic Books. ogy, 3, 1±12. Skinner, B. F. (1938). The behavior of organisms. New Weinrott, M. R. & Jones, R. R. (1984). Overt versus covert York: Appleton-Century-Crofts. assessment of observer reliability. Child Development, 5, Snyder, J. (1983). Aversive social stimuli in the Family 1125±1137. Interaction Coding System: A validation study. Beha- Weiss, R. L. (1986). MICS-III manual. Unpublished vioral Assessment, 5, 315±331. manuscript, Oregon Marital Studies Program, Univer- Stine, W. W. (1939). Interobserver relational agreement. sity of Oregon, Eugene. Psychological Bulletin, 106, 341±347. Wieder, G. B., & Weiss, R. L. (1980). Generalizability Suen, H. K., Ary, D., & Covalt, W. (1990). A decision tree theory and the coding of marital interactions. Journal of approach to selecting an appropriate observation relia- Consulting and Clinical Psychology, 48, 469±477. bility index. Journal of Psychopathology and Behavioral Wilson, F. R. (1982). Systematic rater training model: An Assessment, 12, 359±363. aid to counselors in collecting observational data. Taplin, P. S. & Reid, J. B. (1973). Effects of instructional Measurement and Evaluation in Guidance, 14, 187±194. set and experimenter influence on observer reliability. Winer, B. J. (1971). Statistical principles in experimental Child Development, 44, 547±554. design (2nd ed.). New York: McGraw-Hill. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.02 Single Case Experimental Designs: Clinical Research and Practice

STEVEN C. HAYES and JOHN T. BLACKLEDGE University of Nevada, Reno, NV, USA

3.02.1 INTRODUCTION 24 3.02.1.1 Brief History of Single Case Experimental Designs 24 3.02.1.2 Utility of Single-subject Designs 25 3.02.2 ESSENTIAL COMPONENTS OF SINGLE CASE EXPERIMENTAL DESIGNS 25 3.02.2.1 Repeated Measurement 26 3.02.2.2 Detailed Information 26 3.02.2.3 Graphing of Data 26 3.02.2.4 Creative Use of Design Elements 27 3.02.3 INTERPRETATION OF SINGLE-SUBJECT DATA 27 3.02.3.1 Variability and Stability 27 3.02.3.2 Level 27 3.02.3.3 Trend 27 3.02.3.4 Use of Statistics with Single-subject Data 27 3.02.4 VARIETIES OF SINGLE-SUBJECT DESIGNS 28 3.02.4.1 Within-series Design Elements 28 3.02.4.1.1 Simple phase change 29 3.02.4.1.2 Complex phase change elements 29 3.02.4.1.3 Parametric design 33 3.02.4.1.4 Final suggestions for interpreting within-series data 34 3.02.4.2 Between-series Designs 35 3.02.4.2.1 Alternating treatment design 35 3.02.4.2.2 Simultaneous treatment design 38 3.02.4.3 Combined-series Elements 38 3.02.4.3.1 Multiple baseline 38 3.02.4.3.2 Crossover design 42 3.02.4.3.3 Constant series control 42 3.02.5 FUTURE DIRECTIONS 43 3.02.5.1 Potential for Research Production and Consumption by Practicing Clinicians 43 3.02.5.2 Managed Health Care, Single-subject Design, and the Demonstration of Efficacy 43 3.02.6 CONCLUSION 43 3.02.7 REFERENCES 44

23 24 Single Case Experimental Designs: Clinical Research and Practice

3.02.1 INTRODUCTION usually extremely difficult. Blocking or stratify- ing samples on even a few factors in a single The purpose of clinical research from the study can lead to huge designs that cannot be point of view of the consumer of research mounted without millions of dollars of research knowledge can be stated succinctly: ªWhat funds. One compromise is to use diagnostic treatment, by whom, is most effective for this categories as a way of establishing homogeneity; individual with that specific problem under however, large and unexplained between-sub- which set of circumstances, and how does it come ject variation invariably results because the about?º (Paul, 1969, p. 44). This question has current diagnostic system is based on loose always been of relevance to practicing clinical collections of signs and symptoms rather than psychologists in the fee-for-service environment, functional processes. but it is also of increasing relevance in the era of An alternative approach is to build this managed care. Mental health delivery systems knowledge base about treatment response from cannot succeed, either in the world of public the ground up, person by person. In this opinion or in the world of fiscal reality, without approach, clinical replication across myriad finding a way to deliver services that are both clients provides the evidence that a treatment effective and efficient (Cummings, Cummings, effect holds for a population and that it is & Johnson, 1997; Cummings, Pollack, & moderated by specific subject, therapist, or Cummings, 1996). In order to do that, Paul's setting variables. That is the approach of single clinical question above must be answered for the case experimental designs (SCEDs) or what has varieties of clients demanding and receiving also been termed ªtime-series designs.º The services. former term is more popular but falsely suggests There is another way to say this. Clinical that the number of subjects is necessarily few in research must have external validity (must apply this approach. The latter correctly points to the to the settings and clients of the research source of the analysis but it is not very popular. consumer), not merely internal validity (an Both terms will be used. unambiguous relationship between a dependent The purpose of this chapter is to show how and independent variable). In group comparison single case designs are used in a research and research, external validity flows in principle practice environment, to provide an introduc- from internal validity. In a classic group tion to the various types of designs, and to comparison design, researchers hope to ran- explore the advantages of this research ap- domly sample from a known population, proach in the modern world of health care randomly assign to a treatment and control delivery. group, and collect before and after measures on all. If these methodological requirements have been met, the results should apply to other 3.02.1.1 Brief History of Single Case random samples from that same population. In Experimental Designs the practical world of clinical research, however, we do not have access to the entire population of The origin of SCEDs in clinical psychology interest (say, all panic disordered clients). We can be traced back to the beginnings of the cannot randomly sample from this population scientist±practitioner model. Two years before because clients nonrandomly refuse to partici- at the first Boulder Conference, Thorne (1947) pate. We cannot randomly assign because clients advocated the use of such designs as a practical nonrandomly drop out. And even if all this were way for clinicians to incorporate empirical not true, we never apply clinical research to other science into their everyday interactions with random samples from the same populationÐ clients, a goal concurrently explicated by rather, a clinician treats the person who Shakow et al. (1947). The type of experimental (nonrandomly) walked through the door. designs proposed by Thorne were a very External validity thus must be earnedÐ significant improvement over the traditional whether in group comparison or single-case case study because continual data collection researchÐin a piecemeal inductive fashion by over a series of phase changes were required, demonstrating that particular treatments work allowing more objective, data-driven decisions with particular clients with particular problems. to be made. These designs were an adaptation of In group comparison research, this is usually those previously used by experimental psychol- done by showing that treatments work in highly ogists (Ferster & Skinner, 1957) working with homogeneous groups of clients. By replicating animals. effects across many such homogenous groups, Single case designs became considerably more external validity can be demonstrated in group popular with the popularization of behavioral research. Efforts to show broad external validity approaches in the 1960s. For example, Baer, in a single group comparison experiment are Wolf, and Risley (1968) described a number of Essential Components of Single Case Experimental Designs 25 these designs in an early issue of the Journal of ments provide, treatment components not Applied Behavior Analysis. Hersen and Barlow's working as planned can be altered, abandoned, groundbreaking text (1976) in turn brought or supplemented. these designs into the mainstream of behavior The comprehensive data recorded over therapy. several single-subject designs can also be used Probably the biggest factors inhibiting the use to provide linkage between client characteristics of SCEDs by many clinicians has been a bias and treatment success or failure. As more against ideographic research. Yet many meth- detailed information is gathered in time-series odological leaders of the field have been quite designs than in a typical group experiment, accepting of such an approach. For example, events in individual client lives and various Cronbach (1975) advocated careful observation client characteristics that coincide with declines of individual clinical cases using SCED con- in treatment efficacy can be identified and taken trols, maintaining that the use of such design into consideration for subsequent clients. The tools allows a level of detail in observation and advantage at this level is that variability due to hypothesis testing not available in traditional sources other than treatment can be identified at group designs. Many others (Campbell, 1957; the level of the individual. This means that when Koan & McGuire, 1973; Snow, 1974) have many cases are collected and analyzed, correla- agreed, adding that tightly controlled, fixed tions between subject characteristics and treat- condition, group experimentation is often not ment responsiveness can be more refined. For well-suited to a science of vastly differing example, it may become apparent after con- humans with often vastly differing needs. Cone ducting several single-subject designs mapping (1986) noted that ideographic, single-subject the effects of a given intervention that subjects research is particularly well-suited for detecting with certain characteristics respond in char- point-to-point behavioral changes occurring in acteristically similar ways. Such a hypothesis response to environmental variables, including can be followed up using additional single psychological treatment. subjects, or by using a group design. Subjects that fail to respond positively to a given treatment may do so because of a detectable 3.02.1.2 Utility of Single-subject Designs reason, and this reason may point to an important aspect of a relevant theory. Out- A clinician carefully conducting an appro- comes like this critically depend on the foresight priate single-subject design for a client could of the researcher. Potentially important back- both circumvent the above difficulties and ground data, as indicated by theory and obtain scientifically useful results. Often no common sense, should be collected in every subject pool larger than one is needed to conduct single-subject experiment. a single-subject design (although it is necessary The use of time-series designs also compels a that the results of many such analysis be clinician to focus on the careful description of considered before any general conclusions can patient problems and treatment characteristics, be made about a treatment's efficacy). Single- and how this data relates to treatment outcome. subject experiments are extremely flexible. For In such a manner, variables that are functionally example, if a client is not responding as hoped to important for treatment can become evident. a treatment, components can be added or sub- Over a series of SCEDs, generalizations con- tracted or an altogether new treatment can be cerning the active and efficacious components implemented, without necessarily damaging the of treatment can be made from such data. validity of conclusions drawn from the experi- Finally, analysis of SCEDs concentrates on ment. Essentially, the only limitation on the the magnitude of treatment effects rather than usefulness of single-subject designs in research their statistical significance. If the treatment and clinical practice is the flexibility of the analyzed in a properly conducted SCED is researcher. If use of a planned design does not clinically significant, it will be clearly so. allow an adequate interpretation of emerging data, then the design can be altered at that point. Perhaps more importantly, a properly used single-subject design can be extremely useful in 3.02.2 ESSENTIAL COMPONENTS OF facilitating responsible assessment and treat- SINGLE CASE EXPERIMENTAL ment. A clinician conducting a single-subject DESIGNS experiment is forced to pay close attention to repeated assessments of client behavior that Although various types of individual time- provide invaluable information about target series designs exist, and are discussed in some behaviors and treatment efficacy. With the detail later, several procedures common to all continuous feedback that single-subject experi- time-series designs are first described. 26 Single Case Experimental Designs: Clinical Research and Practice

3.02.2.1 Repeated Measurement as described above. Details regarding the nature of the treatments, the environment that the The bedrock of single case experimental therapy was delivered in, and characteristics of designs is repeated measurements of client the therapist provide a level of detail conducive functioning taken in relevant domains through- to proper replication, especially if the completed out the course of treatment (Barlow & Hersen SCED is made available to others. Steps taken (1984), Hayes, Barlow, and Nelson (1997), and in group research to ensure treatment integrity Kazdin (1980) give information on specific can be taken here as well. For example, a measurement strategies). Repeated measure- colleague or student might be asked to assess a ments enable the estimation of variability within clinician's adherence to a designated treatment a case over time to provide the sources of protocol, as well as the competence with which information about treatment, measurement the treatment is delivered. Specification of the error, and extraneous factors in a time-series treatment requires that the researcher has a approach. clear, theoretically-based idea of what they are In the real world of clinical evaluation, the attempting to accomplish and how to accom- goal is to use measures that are both high plish it. The type of treatment being used, its quality and highly practical. It is advisable that specific techniques, and the phases or subphases such measurements begin as soon as possible, of each technique or strategy that are active ideally during the first session. It is also should be noted. Enough detail should be added advisable, in choosing which instruments to so that after the intervention is complete, an administer to the client or research subject, that informed guess can be made as to what might the researcher err on the side of caution and have been responsible for observed effects. administer as many instruments as may be even Collection of detailed client information partially relevant to the current experiment's allows more meaningful inferences to be drawn concerns. Theory and even common sense from single-subject data. The recording of any should be used to guide the choice as to what information that might possibly affect the instruments are relevant. Practical constraints course and effectiveness of treatment may prove also play an important part. If an adequate but to be invaluable for data analysis. Seemingly imperfect instrument already exists and no time relevant background information and signifi- is currently available to create a new one, the cant events occurring in the client's life during adequate instrument may be used. Client self- the course of treatment qualify as important monitoring and self-report is entirely accept- client information. If the client's spouse leaves able. In fact, any method of collecting data is her during the course of treatment, for example, acceptable with single-subject designs, so long notation of this event may serve as a possible as the researcher deems it appropriate. Flex- explanation for a brief or sustained decline in ibility, again, is the by-word. But, as with any the treatment's effectiveness. Of importance instrument, pay attention to its established here is the chronicling of any information about reliability and validity before using or inter- the client or the client's life that might be preting data. As treatment progresses, instru- expected to affect their response to treatment. ments useless or irrelevant to the current case Over a series of SCEDs, such information can can be discarded, but data not initially gathered be used to speculate as to why clients reacted can never be collected. It is worth spending a differentially to treatment. little extra time administering an instrument on a hunch on the chance that it will yield valuable information. Measurements are then ideally administered as frequently as is practical and 3.02.2.3 Graphing of Data meaningful. Analysis of an individual time-series design requires a visual representation of the data. A 3.02.2.2 Detailed Information simple line graph, with time plotted on the x-axis and measurement score plotted on the Specification of the particular intervention y-axis, should be sufficient for most data made with the client, including when each because it presents an immediate picture of component is delivered and any possibly variability within a data series. Pragmatic significant deviations from the standardized considerations determine what unit of time to treatment, allows meaningful inferences to be plot. Thus, if an instrument is measuring made from collected data. To be meaningful behaviors in frequent time intervals but the indicators of the effects of a treatment and its behavior itself is better thought of in temporally components, the clinician must be able to larger units, analysis may be facilitated if time temporally link specific phases of their inter- intervals are collapsed (e.g., if daily measure- vention with the ongoing flow of outcome data ments are summed and recorded as weekly Interpretation of Single-subject Data 27 measurements). Frequent and creative graphing If the data are not stable (in the sense that of data, from various angles, using different important treatment effects might be obscured), units of time, composite scores from separate the experimenter can (i) continue the phase until measures, etc., can allow insights into the effects the data does become stable, (ii) examine the of treatment. data using longer units of time if longer units are more sensible, or (iii) analyze the possible sources of variability. Each of these is defen- 3.02.2.4 Creative Use of Design Elements sible, but the last option is often the best because Individual time-series design elements (dis- it can provide important information about the cussed below) should be thought of as tools, not case. restrictive agents. If a complex phase change with an interaction element is initially planned, 3.02.3.2 Level and the data indicates the client might benefit most from continuing with the current treat- Level refers to the ªheightº on the y-axis at ment, then continuation of treatment is clearly which the data tends to aggregate. For example, justified. Unexpected client or therapist vaca- data taken during a treatment phase tending to tions can be an opportunity for a return to stay at the upper points of the measurement baseline phase, allowing a chance to monitor the scale would be notably distinguishable, in terms effectiveness, in contrast, of treatment. Clinical of level, from data congregating around the common sense and detailed understanding of mid-point of a measurement scale. the nature of the different design elements and when they are useful allows effective and serendipitous adaptation to events. 3.02.3.3 Trend Trend refers to the general linear direction in 3.02.3 INTERPRETATION OF SINGLE- which data are moving across a given period of SUBJECT DATA time. It takes a minimum of three data points to establish a trend and estimate variability around Time-series designs typically (though not that trend. always) consist of units of time called phases, Converging and diverging trends between each phase designating the continuous presence phases can be analyzed to help differentiate of a given condition (e.g., treatment, baseline, variability due to treatment from variability due and so on). Data within each phase can be to extraneous sources. For example, if a client's described as having various degrees of varia- data on a given measure shows a positive trend in bility around a given level and trend. Use of baseline that continues when treatment is statistical analyses is not necessary at the level of implemented and continues yet again when the the individual, and indeed use of most infer- baseline is reinstated, it is likely that something ential statistics with a single subject violates the other than treatment is responsible for the assumptions of standard statistical tests. Clear improvement. Conversely, if a strong positive graphing of data and a thorough understanding trend during a treatment phase levels out or of interpreting such single-subject graphs are all especially reverses after a change to a baseline that are needed to observe an effect. It is to the phase, the clinician can usually feel confident nature of variability, level, and trend, and to the that treatment is responsible for the change opportunities they provide for the interpreta- (unless some potentially significant life change tion of the data, that we now turn. happened to co-occur with the phase change). Thus, trends are not only useful indicators as to the improvement or decline of clients on various 3.02.3.1 Variability and Stability measures, but are also useful indicators, coupled Data within a phase is said to be stable to the with phase changes, of the sources of those extent that the effects of extraneous variables changes. and measurement error, as reflected in varia- bility within a subject across time, are suffi- 3.02.3.4 Use of Statistics with Single-subject ciently limited or identifiable that variability Data due to treatment can be ascertained. Determin- ing stability requires that the clinician have For the most part, inferential statistics were some clarity about what treatment effects are designed for use in between-group comparisons. large enough to be worth detectionÐthe more The assumptions underlying the widely ac- variability due to extraneous factors or mea- cepted classical model of statistics are usually surement error, the larger the treatment effect violated when statistical tests based on the would have to be to be seen. model are applied to single-subject data. To 28 Single Case Experimental Designs: Clinical Research and Practice begin with, presentation of conditions is not not mean the same thing as a statistically generally random in single-subject designs, and significant result with a group, and the randomization is a necessary prerequisite to assumptions and evidentiary base supporting statistical analysis. More importantly (and classical statistics simply dose not tell us what a more constantly), the independence of data significant result with a single subject means. required in classical statistics is generally not Beyond the technical incorrectness of using achieved when statistical analyses are applied to nomethetic statistical approaches to ideo- time-series data from a single subject (Sharpley graphic data, it is apparent that such use of & Alavosius, 1988). Busk and Marascuilo these statistics is of extremely limited use in (1988) found, in a review of 101 baselines and guiding further research and bolstering con- 125 intervention phases from various single- fidence about an intervention's efficacy with an subject experiments, that autocorrelations be- individual subject. If, for example, a statistically tween data, in most cases, were significantly significant result were to be obtained in the greater than zero and detectable even in cases of treatment of a given client, this would tell us low statistical power. Several researchers have nothing about that treatment's efficacy with suggested using analyses based on a randomiza- other potential clients. Moreover, data indicat- tion task to circumvent the autocorrelation ing a clinically significant change in a single problem (Edgington, 1980; Levin, Marascuilo, client would be readily observable in a well- & Hubert, 1978; Wampold & Furlong, 1981). conducted and properly graphed single-subject For example, data from an alternating treat- experiment. StatisticsÐso necessary in detect- ment design or extended complex phase change ing an overall positive effect in a group of design, where the presentation of each phase is subjects where some improved, some worsened, randomly determined, could be statistically and some remained unchangedÐwould not be analyzed by a procedure based on a randomiza- necessary in the case of one subject exhibiting tion task. Some controversy surrounds the issue one trend at any given time. (Huitema, 1988), but the consensus seems to be that classical statistical analyses are too risky to use in individual time-series data unless at least 35±40 data points per phase are gathered 3.02.4 VARIETIES OF SINGLE-SUBJECT (Horne, Yang, & Ware, 1982). Very few DESIGNS researchers have the good fortune to collect so much data. Time-series design elements can be classified Time-series analyses where collected data is either as within-series, between-series, or simply used to predict subsequent behavior combined-series designs. Different types of (Gottman, 1981; Gottman & Glass, 1978) can time-series design elements within each group also be used, and is useful when such predictions of designs are used for different purposes are desired. However, such an analysis is not (Hayes et al., 1997). The nature of each type suitable for series with less than 20 points, as of element, as well as its purpose, will now be serial dependence and other factors will con- described. tribute to an overinflated alpha in such cases (Greenwood & Matyas, 1990). In cases where 3.02.4.1 Within-series Design Elements statistical analysis indicates the data is not autocorrelated, basic inferential statistical pro- A design element is classified as within-series cedures such as a t-test may be used. Finally, the if data points organized sequentially within a Box±Jenkins procedure (Box & Jenkins, 1976) consistent condition are compared to other such can technically be used to determine the sequential series that precede and follow. In presence of a main effect based on the departure such a design, the clinician is typically faced of observed data from an established pattern. with a graphed data from a single measure or a However, this procedure would require a homogenous group of measures, organized into minimum of about 50 data points per phase, a sequence of phases during which a consistent and thus is impractical for all but a few single- approach is applied. Phase changes occur, subject analyses. ideally, when data has stabilized. Otherwise, In addition, most statistical procedures are of practical or ethical issues determine the time at unknown utility when used with single-subject which phases must change. For example, an data. As most statistical procedures and inter- extended baseline with a suicidal client would pretations of respective statistical results were certainly not be possible, and time and financial derived from between-group studies, use of constraints may determine phase changes in these procedures in single-subject designs yields other cases. Aspects of specific design strategies ambiguous results. The meaning of a statisti- that influence phase length are discussed at the cally significant result with a lone subject does appropriate points below. Varieties of Single-subject Designs 29

Within-series design elements include the vacations can be viewed as a chance to make simple phase change, the complex phase change, more informed decisions regarding the treat- the parametric design elements, and the chan- ment's efficacy. ging criterion design. Each is described below. More than one replication of the underlying Note that specific phases are designated below phase change (such as an ABABAB design) may by capital letters. Generally (but not always), be necessary for the clinician to be confident of a the letter A refers to a baseline phase, and letters treatment's effects for a particular client. such as B and C refer to different interventions. Interpretation of the data is facilitated by referring to the specifics of stability, level and trends as discussed above. 3.02.4.1.1 Simple phase change Examples of well-conducted simple phase If simply trying to determine a treatment's change designs include Gunter, Shores, Denny, efficacy vs. no treatment at all for a given client, and DePaepe (1994), who utilized the design to or comparing the relative efficacy of two evaluate the effects of different instructional validated treatments for a given client, then a interactions on the disruptive behavior of a simple phase change design is probably the right severely behaviorally disordered child. Gunter choice. Consider the case in which only a single et al. (1994) used a simple phase change with treatment's efficacy is at issue. A similar reversal (in this case ABAB), with baseline or A approach can be taken to answer questions phase consisting of a math assignment with about the relative efficacy of two treatments. between five and 15 difficult multiplication In the standard simple phase change, baseline problems. The treatment or B phase consisted of data on relevant client behaviors is typically equally difficult problems, but the experimenter taken for a period of time (enough time to would provide the correct answer and then estimate the level, trend, and variability around present the same problem again whenever the level and trend of the behavior of interest). This subject incorrectly answered. Institution of the baseline phase can occur during initial sessions first treatment phase indicated a desirable with the client while assessment (and no effect, with the subject's rate of disruptive treatment per se) is taking place, while the behavior falling from around 0.3 to around 0.1. client is on a waiting list, or through similar Gunter et al. (1994) wisely decided to replicate means. Treatment is then administered and the phase changes with the subject, allowing while it is in place a second estimate is made of them to be more confident that extraneous the level, trend, and variability around level and variables such as time or (to some extent) order trend of the behavior of interest. If clear changes were not responsible for the changes. occur, treatment may have had an impact. Orsborn, Patrick, Dixon, and Moore (1995) In order to control for extraneous events that provide another good, contemporary example might have co-occurred with the phase change of the simple phase change design (Figure 1). from baseline to treatment, the phase change They used an ABAB design to evaluate the must be replicated. Usually this is done by effect of reducing the frequency of teacher's changing from treatment back to baseline, but questions and increasing the frequency of other options exist. Such a change, called a pauses on the frequency of student talk, using withdrawal, is useful in aiding the clinician in 21 first- and second-grade subjects. Both B deciding whether it is indeed the treatment, and phases showed a marked increase in student talk not some extraneous variable, that is responsible frequency relative to baseline phases. The for any changes. However, certain ethical strength of the design, however, could have considerations must be made before a with- been improved with the inclusion of more data drawal is executed. If treatment definitely seems points. Five data points were collected during effective and is well-validated, it may not be baseline phases, and three in intervention necessary to withdraw the treatment to conclude phases. Three is an acceptable minimum, but that a treatment effect is likely. If the clinician is more data points are advisable. uncertain of the treatment's effectiveness, or if it is not well-validated, a withdrawal or some other 3.02.4.1.2 Complex phase change elements means of replication is necessary in order to conclude that an effect has occurred. Even a A complex phase change combines a specific short withdrawal can be sufficient if a clear trend sequence of simple phase changes into a new is evident. Alternatively, a period of time where logical whole. therapy is focused on a different, relatively unrelated problem can also be considered a (i) ABACA withdrawal. Serendipitous opportunities for a return to a baseline phase can provide important When comparing the effectiveness of two (or information. Impromptu client or therapist more) treatments relative to each other and to 30 Single Case Experimental Designs: Clinical Research and Practice

ABAB 360

180 Talk (s) Talk

0 0 5 10 15 20 Sessions Figure 1 An adaptation of a simple phase change with reversal design. Data were relatively stable before phase changes (adapted from Orsborn et al., 1995). no treatment, an ABACA complex phase campaign (phase B), and the distribution of change element might be used. Typically, the instructive stickers for children asking them to clinician would choose such a design when there ªMake it Clickº (phase C). These two inter- is reason to believe that two treatments, neither ventions were tried repeatedly. Finally, an yet well-validated, may be effective for a given incentive program was implemented giving client. When it is unclear whether either away soft drinks for drivers who arrived at treatment will be effective, treatment phases McDonald's with their seat belt fastened. The can be interspersed with baseline phases. Phase design could be described as an ABCBCBDA. changes conducted in such a manner will allow The spirit here, as always when using time- meaningful comparisons to be made between series approaches, should be one of flexible and both treatments, as well as between each data-driven decisions. With such a spirit, the treatment and no treatment. After administer- clinician should be willing to abandon even an ing each treatment once, if one that results in empirically validated treatment if it is clear, clearly more desirable data level and trend, it over a reasonable length of time, that there is may be reinstated. If each treatment's relative no positive effect. No treatment is all things to efficacy is still unclear, and the data gives no all people. reason to believe that either may be iatrogenic, Another well-conducted complex phase phase changes may be carried out as is practical. change design was reported by Peterson and A sequence such as ABACACAB might not be Azrin (1992; Figure 2). Three treatments, unreasonable if the data and clinical situation including self-monitoring (phase B), relaxation warranted. Regardless of the sequence imple- training (phase C), and habit reversal (phase D), mented, the clinician should remain aware that were compared with a baseline or no treatment order effects in a complex phase change can be (phase A). Six subjects were used, and the critical. The second treatment administered may authors took advantage of the extra subjects by be less effective simply because it is the second counterbalancing the presentation of phases. treatment. Counterbalancing of phase se- For example, while the first subject was quences in other cases can circumvent such presented with the phase sequence ambiguity. AAABCDCBDA, other subjects were pre- The clinician should stay alert to the sented with sequences such as AAADCBCD- possibility of introducing even a third treat- BA, AAACDBDCBA, and AAACDBDCBA. ment in such a design, if the original treatments A minimum of three data points were contained do not appear to be having the desired effect. in each phase (generally four or more data An example of such a situation is shown in a points were used), and the authors more often study by Cope, Moy, and Grossnickle (1988). than not waited until stability was achieved and In this study, McDonald's restaurants pro- clear trends were present before changing moted the use of seat belts with an advertising phases for subjects. Varieties of Single-subject Designs 31

A A A CBDBDC A 200

100 Number of tics

0 0 20 40 60 70 80 90 100 110 120 140 Minutes (measurements taken every 2.5 minutes) Figure 2 An example of a complex phase change. Waiting for the data in the first and third baseline (A) phase to stabilize (as well as the data in the first D phase) would have been preferable before initiating phase changes (adapted from Peterson and Azrin, 1992).

(ii) Interaction element (Figure 3). Extinction had previously been found to be effective in reducing the frequency In an interaction element the separate and of target behaviors, but it had the unfortunate combined effects of intervention elements are effect of sometimes causing more problematic examined. Use of this design element is behaviors to emerge. By alternating phases appropriate both when a treatment is working consisting of extinction alone and extinction and the clinician wishes to ascertain whether it plus communication training, the authors were will continue working without a particular (and able to show that the second condition costly in terms of time and/or money) compo- resulted in uniform decreases in problematic nent, or when a treatment is not maximally behavior. effective and the clinician believes adding or subtracting a specific component might enhance the treatment's efficacy. White, Mathews, and Fawcett (1989) provide (iii) Changing criterion an example of the use of the interaction element. They examined the effect of contingencies for The changing criterion design element con- wheelchair push-ups designed to avoid the sists of a series of shifts in an expected development of pressure sores in disabled benchmark for behavior, such that the corre- children. Wheelchair push-ups were automati- spondence between these shifts and changes in cally recorded by a computer. After a baseline, behavior can be assessed. It is particularly useful two subjects were exposed to an alarm in the case of behaviors that tend to change avoidance contingency (B), a beeper prompt gradually, provided that some benchmark, goal, (C), or a combination. An interaction design criterion, or contingency is a key component of element was combined with a multiple baseline treatment. component. The design for one subject was an The establishment of increasingly strict limits A/B+C/B/B+C/B/B+C/C/B+C, and for the on the number of times a client smokes per day other was A/B+C/C/B+C/C/B+C. Each provides a simple example. A changing criterion component (B or C) was more effective than design could be implemented when it is unclear a baseline, but for both children the combined as to whether the criteria themselves, and no (B+C) condition was the most effective overall. other variable, were responsible for observed Shukla and Albin (1996) provided another changes in smoking. As another example, a good example when examining the effects of changing criterion design could be implemented extinction vs. the effects of extinction plus to assess the degree to which changing minimum functional communication training on pro- numbers of social contacts per day affects actual blem behaviors of severely disabled subjects social contacts in a socially withdrawn client. 32 Single Case Experimental Designs: Clinical Research and Practice

A B A B+C 5

2.5 Problem behavior per minute behavior Problem

0 0 51015 20 25 30 35 40 Sessions

Figure 3 An interaction element design. The interaction (B+C) condition did not yield a better result than the B condition, but the demonstration of no further efficacy was still an important finding (adapted from Shukla & Albin, 1996).

In order to maximize the possibility that the direction of criterion shifts varied. It should also data from a changing criterion design are be noted that, as in the Belles and Bradlyn (1987) correctly interpreted, five heuristics seem useful. study, criterion changes should occur at irre- First, the number of criterion shifts in the design gular intervals. As certain behaviors may change should be relatively high. Each criterion shift is in a cyclical or gradual manner naturally, criteria a replication of the effect of setting a criterion on should be shifted after differing lengths of time. subsequent client behavior. As with any type of Thus, if the length of one criterion's phase experiment, the more frequently the results are happens to correspond to changes occurring replicated, the more confident the researcher naturally, the phase lengths of other levels of the can be of the effect and its causes. As a rule of criterion will be unlikely to continue this trend. thumb, four or more criterion shifts should Third, criterion changes should occur at occur when a changing criterion design is irregular intervals. As certain behaviors may implemented. change in a cyclical or gradual manner naturally, Second, the length of the phase in which one criteria should be shifted after differing lengths level of the criterion is in effect should be long of time. Thus, if the length of one criterion's enough to allow the stability, level, and trend of phase happens to correspond to changes the data to be interpreted relative to the occurring naturally, the phase lengths of other criterion. Additionally, if a clear trend and level levels of the criterion will be unlikely to continue do not initially emerge, the criterion should this trend. remain in effect until a clear trend and level does Fourth, the magnitude of criterion shifts emerge. Belles and Bradlyn (1987) provide a should be altered. If the data can be shown to good example of properly timed criterion track criterion changes of differing magnitudes, changes. The goal was to reduce the smoking the statement that the criterion itself is respon- rate of a long-time smoker who smoked several sible for observed changes can be made with a packs a day. The client recorded the number of greater level of assurance. cigarettes smoked each day (with reliability Finally, a periodic changing of the direction of checks by the spouse). After a baseline period, criterion shifts can be useful in assisting goals were set by the therapist for the maximum interpretations of effects. Such a strategy is number of cigarettes to be smoked. If the similar to the reversal common in simple and criterion was exceeded, the client sent a $25 complex phase changes. If client behavior can be check to a disliked charity. For each day the shown to systematically track increasing and criterion was not exceeded, $3 went into a fund decreasing criteria, the data can be more that could be used to purchase desirable items. confidently interpreted to indicate a clear effect Each criterion was left in place for at least three of the changing criteria on those behavioral days, and the length of time, magnitude, and changes. Varieties of Single-subject Designs 33

DeLuca and Holborn (1992) used the chan- 3.02.4.1.3 Parametric design ging criterion design in analyzing the effects of a variable-ratio schedule on weight loss in obese When it is suspected that different levels or and nonobese subjects (Figure 4). Phase frequencies of a component of an intervention sequences consisted of a baseline (no treatment) might have a differential effect on client phase, three different criterion phases, an behavior, a parametric design can be imple- additional baseline phase, and finally a return mented. Such designs are frequently used to to the third criterion phase. The criterion assess the effects of different psychotropic drug involved the number of revolutions completed dosages, but design is relevant for many other on a stationary exercise bike during the allotted purposes. Kornet, Goosen, and Van Ree (1991) time in each phase. Criterion phases were demonstrated the use of the parametric design determined by a calculation of 15% over the in investigating the effects of Naltrexone on criterion in place for the previous phase; when alcohol consumption. the criterion was met, a subject would receive Ideally, a kind of reversal can be incorporated reinforcement in the form of tokens exchange- into a parametric design, where levels of the able for established reinforcers. Each phase independent variable in question were system- (except for the five-session final phase) lasted atically increased and then decreased. As with for eight 30-minute sessions, resulting in eight the changing criterion element, showing that an data points per phase. The increasing variable effect tracks a raised and lowered standard ratio schedules were shown, through use of this bolsters the confidence with which an inter- design, to exert control over increased frequen- pretation is made. Baseline data should usually cies of revolution. Although the design did not be taken before and after changes in the exhibit some typical features of the changing parameter of interest. If certain levels of the criterion design, such as staggered phase length, parameter provide interesting or unclear data, criterion direction reversals, and varying phase alternations between those levels, even if not change depths, the observation of clear effects originally planned, can aid in clarification (e.g., were facilitated by the return to baseline phase if the sequence A/B/B'/B''/B'''/B''/B'/B/A was and subsequent replication of the third criterion originally planned and the data spurs increased phase. In addition, the use of the design in an interest in the B and B' levels, a sequence such as exercise program was prudent, as exercise B/B'/B/B' could be inserted or added). involves gradual performance improvements An example of acceptable use of parametric of a type easily detected by the changing design is provided by Stavinoah, Zlomke, criterion design. Adams, and Lytton (1996). The experimenters

A 80 rpm 115 rpm 130 rpm A 130 rpm 160

80 Revolutions per minute (rpm) per minute Revolutions

0 0 510152025303540 Sessions

Figure 4 An example of a changing criterion design. Dotted horizontal lines indicate set criterion in each phase. The use of more than three criteria might have been preferable to further indicate experimental control over behavior, but the design seems adequate nonetheless (taken from DeLuca & Holborn, 1992). 34 Single Case Experimental Designs: Clinical Research and Practice systematically varied dosages of haloperidol and procedure, replicated effects are always more fluoxetine while measuring the frequency of the believable than nonreplicated effects. If an subject's impulsive aggressive behaviors (IABs). effect is consistently duplicated across several Dosages of haloperidol ranged from about clients or across several different behaviors in 40 mg to 20 mg over the 40 weeks that it was the same client, the clinician can feel more administered. As a 40 mg dose initially resulted confident in stating that the treatment in in an average of 10 IABs per week, dosage was question is responsible for the effect. Each reduced to 20 mg after 12 weeks. During the next additional reinstatement of a treatment phase 34 weeks, IABs increased to 13 per week, and the resulting in client improvement within a single decision was made to increase the dosage back to series of data should also allow more confident 40 mg. This resulted in an average of 45 IABs per statements about that treatment's efficacy with week for the four weeks this dosage was in place. that client to be made. Second, effects are much The experimenters then administered a 20 mg more believable to the extent that they occur in a dose of fluoxetine for the next 62 weeks, resulting consistent manner. Third, changes of a greater in an average IAB frequency of near zero. A magnitude (when parsing out changes appar- 40 mg dose of fluoxetine administered for 58 ently caused by extraneous variables) should weeks yielded IAB frequencies of near zero. A generally be taken as more robust evidence of subsequent reduction to 20 mg for five weeks the treatment's effects. Fourth, effects occurring resulted in an average IAB frequency of 12; the immediately after the onset of a treatment phase dosage was then returned to 40 mg, with a are logically stronger indicators that the treat- resulting IAB frequency of almost zero. Ideally, ment is responsible for the changes than are less time could have been spent at each dosage, delayed effects, since fewer alternative explana- and a greater variety of dosages could have been tions exist for the effects seen. Fifth, greater employed. But the experimenters did vary changes in the level and trend of the data are dosage and even drugs, and did so with a generally more indicative of a treatment's sufficient number of data points to determine the efficacy. Sixth, any effects not explainable by effect each drug and dosage had on behavior. variables other than treatment should naturally Lerman and Iwata (1996) provide a better be more convincing. Finally, all effects should example of use of the parametric design in be interpreted while considering the back- treating the chronic hand mouthing of a ground variability of the data. The variability profoundly retarded man (Figure 5). Sessions in the data around a given level and trends in a were conducted two or three times per day. consistent condition provide an individual Baseline frequencies (with no attempts to stop estimate of the impact of extraneous factors the hand mouthing behavior) were first calcu- and measurement error against which any lated; baseline rates of three times per minute, treatment effect is seen. If, for example, the on average, were recorded over several sessions. level and/or trend of baseline data at times During the intervention phase, all subject overlaps with the level and/or trend of treatment attempts at hand mouthing were blocked by phase data, the clear possibility that factors the experimenter putting his hand in front of the other than treatment may be augmenting (or subject's mouth, resulting in less than one inhibiting) the treatment effect should be instance of hand mouthing per minute. Subse- considered. quently, attempts were blocked at a ratio of 1 Brief mention of some commonly occurring block per 2 attempts, 1/4, 1/2, 2/3, and 3/4. The conditions illustrate application of the guide- frequency of hand mouthing remained near zero lines discussed thus far. If a significant upward over all levels. The experimenters properly used or downward data trend begins one to three a descending/ascending order for levels, but also data points before a phase change is planned, allowed subject behavior to determine what delay of the phase change until the data blocking schedule would be used. The experi- stabilizes is suggested. A trend is significant if menters thus remained responsive to the data, it is of a greater degree than changes attributable and their efforts yielded a less intensive to background variability, such as would be intervention than one block per attempt. observed when a series of relatively stable data fluctuates around an average level. However, if an established and significant trend has emerged 3.02.4.1.4 Final suggestions for interpreting over a few (e.g., three or more) data points, a within-series data phase change might be acceptable if such results had been expected. Instability and unclear Besides the suggestions listed above for trends are obviously of less importance at the interpreting various types of within-series data, beginning of a phase than at the end; data at first additional general suggestions are offered here. uninterpretable often has a way of telling a First, as with any type of data-producing clearer story after more data is collected. The Varieties of Single-subject Designs 35

Baseline Response BL Response (BL) block 1/1 block 1/2 1/4 1/2 2/3 3/4 1/1 5

3 Responses per minute

0 0 20 40 60 Sessions Figure 5 An example of a parametric design. Response block conditions refer to differing levels of intervention, that is, 14 translates to one response block per every four handmouthing attempts (adapted from Lerman & Iwata, 1996). value of collecting as many data points as he leaned forward. If it were assumed that the feasible (e.g., seven or more) becomes clear after therapist had preplanned the within-session only a few data points have been graphed. alternations, an ATD as shown in Figure 6 would be obtained. The condition present in the example at any given time of measurement is 3.02.4.2 Between-series Designs rapidly alternating. No phase exists; however, if Within-series designs involve comparisons of the data in each respective treatment condition sequences of repeated measurements in a are examined separately, the relative level and succession of consistent conditions. Between- trend of each condition can be compared series designs are based on a comparison of between the two data series (hence the name conditions that are concurrent or rapidly between-series designs). alternating, so that multiple data series are An example of the ATD is provided by simultaneously created. Pure between-series Jordan, Singh, and Repp (1989). In this study, designs consist of the alternating treatments two methods of reducing stereotypical behavior design and the simultaneous treatment design. (e.g., rocking, hand-flapping) in retarded sub- jects were examined: gentle reaching (the use of social bonding and gentle with the 3.02.4.2.1 Alternating treatment design developmentally disabled) and visual screening The alternating treatment design (ATD) (covering the client's eyes for a few seconds consists of rapid and random or semirandom following stereotypic behavior, thus reducing alteration of two or more conditions such that visual stimulation including that provided by each has an approximately equal probability of these movements). Each of the two conditions being present during each measurement oppor- were randomly alternated with a baseline tunity. As an example, it was observed during a condition. After a baseline period, visual screen- clinical training case that a student therapist, ing produced a dramatic reduction in stereotypy, during many sessions, would alternate between whereas gentle teaching had only a transient two conditions: leaning away from the client effect. and becoming cold and predictable when he was Another proper use of the alternating treat- uncomfortable, and leaning towards the client ments design is provided by Van Houten (1993; and becoming warm and open when feeling Figure 7). Four children were taught subtrac- comfortable. The client would disclose less tion using two different procedures (one when the therapist leaned away, and more when procedure involved the use of a general rule, 36 Single Case Experimental Designs: Clinical Research and Practice

Forward and warm

Amount of client self-disclosure Back and cool

Time Figure 6 A hypothetical example of the use of an ATD to assess the effects of therapist behavior on client self- disclosure. the other involved only rote learning). Use of ATDs hold several other advantages over the procedures was alternated randomly and standard within-series designs. First, treatment every 15 minutes over the length of 15 or more need not be withdrawn in an ATDÐif treatment sessions, and the subtraction problems used in is periodically withdrawn, it can be for relatively each session were counterbalanced across the short periods of time. Second, comparisons subjects so that effects could be attributed to the between components can be made more teaching methods and not the problem sets. The quickly. If a clear favorite emerges early in a use of an ATD rather than a complex phase well-conducted ATD, the clinician can be change was prudent, as the order the methods reasonably sure that its comparative efficacy were presented in longer phases could probably will be maintained McCullough, Cornell, have exerted a practice effect. McDaniel, and Mueller (1974), for example, One of the benefits of the ATD is the compared the relative efficacy of two treatments simplicity with which it can be used to compare in four days using an ATD. ATDs can be used three or even more treatment conditions. Proper without collecting baseline data, or with base- comparisons of three conditions in a within- line data through the creation of a concurrent series design can be difficult due to counter- baseline data series. Any background within- balancing concerns, order effects, and the sheer series trends (such as those due to maturation of number of phase changes that need to be the client or etiology of the disorder) are executed over a relatively long period of time. unlikely to obscure interpretation of the data With an ATD, three or even more conditions because the source of data comparisons are can be presented in a short time. The rapid and purely between series, not within. random alternations between conditions makes ATD requires a minimum of two alterations order effects less likely, but multiple treatment per data series. As both series can be combined interference (the impact of one treatment is to assist assessments of measurement error and different due to the presence of another) is extraneous factors, the number of data points arguably likely. ATDs are ideally used with required is less than with a within-series design. behaviors emitted at a relatively high frequency The collection of more than two data points per that correspondingly allows many instances of series is typical and highly useful, however. In a each alternate intervention to be applied. sense, each alternation is a replication and However, the design may be used with relatively conclusions from all time-series designs can be infrequent behaviors if data is collected for a stated with more confidence with each consis- longer period of time. In addition, behaviors tent replication. that tend not to have an impact for long after a When planning alternations, the clinician discrete intervention is made and withdrawn should be alert to the duration, after presenta- make better targets for an ATD. If a change tion, of a component's effect. An administered initiated by such a discrete intervention con- drug, for example, exerts an effect over a period tinues over a long period of time, effects of of time, and presenting a new treatment subsequent interventions are obscured and component before that time has expired would reliable data interpretation is often not possible. confound interpretation. Varieties of Single-subject Designs 37

100 Rule

50 Rote Correct (%)

0 05 10 Sessions

Figure 7 An ATD in which rule-learning trials are interspersed with rote-learning trials. A practice or generalization effect is apparent in the rote-learning condition beginning around session 8 (adapted from Van Houten, 1993).

One of the shortcomings of the ATD is that strong effect. At times, exposure to one condi- observed effects in the design can be due to the tion results in a similar response to a somewhat way in which conditions are presented and similar second condition. Implementing each combined. Three areas of concern in this condition for a relatively short period of time domain of multiple treatment interference are can help reduce these problems (O'Brien, 1968), sequential confounding, carry-over effects, and as might clear separations between each treat- alternation effects (Barlow & Hayes, 1979; ment condition (such as introducing only one Ullman & Sulzer-Aszaroff, 1975). treatment condition per session). Sequential confounding occurs when there is Several procedures exist to help detect multi- a possibility that a treatment condition A yields ple treatment interference (Sidman, 1960). A different effects when presented before a simple phase change where one treatment treatment condition B than it does when condition is preceded by a baseline phase, when presented after condition B. To control for compared to another AB design containing the sequential confounding, the clinician is encour- other treatment, and finally compared to an aged to alternate treatment conditions ran- ATD combining both conditions, could be used domly or at least semirandomly. With a to parse out the separate and interactive effects randomly delivered sequence such as ABB- of the treatment conditions. Alternatively, the BAABABBBABBAAAABBAAABAA, if con- intensity of one treatment condition could be sistent differences between each condition's increased, with any subsequent changes in the effects continue to show up throughout the following conditions (as compared to changes sequence, despite the fact that the order and already witnessed in an ATD containing both frequency of each conditions' presence differs conditions) attributable to carry-over effects. through the sequence, the clinician can be Some additional caveats regarding proper use relatively certain that observed effects are not an of the ATD are of note. First, although a artifact of order of condition presentation. baseline phase is not necessary in an ATD, A carry-over effect occurs when the presenta- inclusion of baseline data can be useful for both tion of one condition somehow affects the gathering further information on client func- impact of the subsequent condition, regardless tioning and interpreting the magnitude of of the presentation order of the conditions. treatment effects. If periodic baseline points Potentially this can occur in two ways. The can be included within the ATD itself, valuable effects of two conditions can change in opposite information regarding background level, trend, directions, or in the same direction. For and variability can also be gleaned, over and example, when a strong reinforcer is delivered above what would be interpretable if treatment after a weak reinforcer, the weak reinforcer can conditions alone were present. subsequently cease to reinforce the desired Second, it is important to realize that behavior at all while the other reinforcer has a although ATDs can effectively be used with 38 Single Case Experimental Designs: Clinical Research and Practice four or even more treatment conditions and impact of the treatment but the degree to which corresponding data series, an upper limit exists it is accessed. In other words, an STD measures on the number of data series that can be preference. As an example, suppose a clinician meaningfully interpreted. One useful heuristic wished to assess the motivation of a disabled (Barlow, Hayes, & Nelson, 1984) is to count the child for different kinds of sensory stimulation. number of data points that will likely be Several kinds of toys that produced different collected for the ATD's purpose and then sensory consequences could be placed in a room divide this number by the desired number of with the child and the percentage of time played data series. If several data points will be with each kind of toy could be recorded and collected for each series, the clinician should graphed. This would be an STD. be able to proceed as planned. Third, the clinician must consider the amount of data overlap between data series when 3.02.4.3 Combined-series Elements interpreting ATD results. Overlap refers to Combined-series designs contain elements the duplication of level between series. Several from both within- and between-series designs, issues must be considered if considerable over- and combine them into a new logical whole. lap exists. First, the percentage of data points, Although many examples in the literature relative to all data points in an involved series, contain elements of both between- and within- that indeed overlap can be calculated. If this series designs, true combined-series designs percentage is low, the likelihood of a differential involve more than merely piecing components effect is higher. Second, the stability of the together. What distinguishes combined series measured behavior should be considered. If the elements from any combination of elements is frequency of a given behavior is known to vary that the combination yields analytical logic. widely over time, then some overlap on measures of that behavior between ATD 3.02.4.3.1 Multiple baseline conditions would be expected. Third, the clinician must note if any overlapping trends One of the more often encountered SCEDs is occur in the two conditions. If data in two series the multiple baseline. An adaptation of the are similar not only in level but also in trend, simple phase change, the multiple baseline then it seems plausible that a background design allows control over some variables that variable, rather than the treatment conditions, often confound interpretation of within-series might affect the data. phase change data. Upon a phase change from One final example is shown in Figure 8. These baseline to treatment in a simple phase change, are the data from an airplane-phobic client in for example, the data would ideally indicate a the study on the effect of cognitive coping on sudden and very strong treatment effect. It progress in desensitization (Hayes, Hussian, would be arranged in such a way that back- Turner, Anderson, & Grubb, 1983). Notice that ground variability could be easily detected there is a clear convergence as the two series against the effects of the treatment. Even when progress. The orderliness of the data suggested this ideal outcome occurs, however, there is also that the results from cognitive coping were the possibility that some extraneous variable generalizing to the untreated scenes. Alternat- having a strong effect on measured outcomes ing a reminder not to use the coping statements might co-occur with the onset of a treatment with the usual statements then tested this phase. Such an occurrence would have dire possibility. The data once again diverged. When implications for interpretation of the simple the original conditions were then reinstated, the phase change result. The multiple baseline data converged once more. This showed that the design allows considerable control over such convergence was a form of systematic general- threats to validity. ization, rather than a lack of difference between Essentially, the multiple baseline typically the two conditions. This is also a good example involves a sort of simple phase change across at of the combination of design elements to answer least three data series, such that the phase specific questions. This particular design does change between baseline and treatment occurs not have a name, but it is a logical extension of at different times in all three data series design tools discussed above. (Figure 9). The logic of the design is elegantly simple. Staggering the implementation of each respective phase change allows the effects of 3.02.4.2.2 Simultaneous treatment design extraneous variables to be more readily ob- Simultaneous treatment design (STD) is served. It is generally unlikely that any given similar to ATD in which the two treatments extraneous occurrence will have an equal effect are continuously present but are accessed by the on phase changes occurring at three different choice of the subject. What is plotted is not the points in time. Varieties of Single-subject Designs 39

Concurrent coping statements

10

8

6

No coping statements mentioned 4

2 Told not to use coping Average latency to anxiety (s) Average statements 0 1 102030405053 Desensitization scene

Figure 8 An example of series convergence in the ATD and its analysis by adding within-series components (redrawn from Hayes et al., 1983).

Implementation of a multiple baseline design influence on the data. A similar comparison can greatly increases the potential number of be made between the point at which the second comparisons that can be made between and phase change is implemented and the corre- within data series, ultimately strengthening the sponding data points on the third series of data. confidence with which conclusions are made The type of data recorded in each of the three from the data. Figure 3 details where such series must be similar enough so that compar- comparisons can be made. First, changes in isons can be made between the series, yet level and trend within each data series (that is, different enough that effects in one series are between the baseline and treatment phase of not expected from a phase change in another each of the three data series) can be analyzed, series. The context in which the data in each of just as with a simple AB design. Unlike a simple the three series is collected can be of three phase change, however, differences in level and varieties. The multiple baseline across behaviors trend between baseline and treatment can also requires that three relatively discrete and be compared between three series of data. The problematic behaviors, each of which might be design, in effect, contains replications of the expected to respond to a given treatment, be phase change. If similar levels and trends are chosen. The treatment is then implemented in exhibited across all three series, the clinician can staggered, multiple baseline fashion for each feel confident that the treatment is the most behavior. The clinician would probably wish to likely candidate for exerting the effect. Com- choose behaviors that are unlikely to be subject parisons can also be made between the point of to some generalization effect, so that a treatment time at which the first phase change occurs, and implemented with behavior 1 results in con- the same points of time in the two remaining comitant change in one or both of the other data series, where baseline data is still being behaviors (because, for example, the client collected. Such comparisons give the researcher begins to apply the principles learned immedi- information on whether some variable other ately to those other behaviors). Such an effect than treatment might be responsible for ob- would be easily observed in between-series com- served changes. For example, a strong positive parisons (i.e., when a data trend in a condition trend and marked change in level might be where the intervention has not yet been initiated indicated by the data after the treatment phase is resembles the trend exhibited in an intervention implemented. If similar changes occur in the condition, a generalization effect may very likely other two data series at the same points in time, be present). However, the clinician could not be before treatment has even been implemented in absolutely certain in such a case that the changes those data series, it would seem clear that across behaviors were due to some general- something besides treatment was exerting an ization effect and not an extraneous variable. In 40 swl sbtenaytosre ttepit hr nitreto si lc o n eisadbaseline and series one for place in is intervention an where points the at series, three series the two of any any within data between change as postphase and well pre- as between made be can Comparisons comparisons. 9 Figure

Behavior yohtcleapeo utpebsln,wt ewe-adwti-ore fdata of within-sources and between- with baseline, multiple a of example hypothetical A igeCs xeietlDsgs lnclRsac n Practice and Research Clinical Designs: Experimental Case Single ewe Between 5 Between 3

AB

{{ odtosaesili fetfrtescn series. second the for effect in still are conditions { Between 2 Within 1 {

{ { Within 4 Time

Within 6 { { Varieties of Single-subject Designs 41 general, if it seems theoretically likely that a in special education classes only, the design took generalization effect would occur under such the shape of a multiple baseline across classes. circumstances, and no apparent confounding One subject eventually participated in four variable is present, then it is often safe to assume classes, the other in two. Phase lengths were generalization occurred. Even if such an effect more than adequate (minimum of five data were observed, it might not be disastrous by any points), and phase changes were appropriately means. Implementing a treatment only to staggered (levels and trends stabilized before serendipitously find that it positively affects onset of the intervention in a new setting). The behaviors other than the targeted one could intervention was effective in increasing the hardly be considered unfortunate. frequency and quality of peer interactions. An example of the multiple baseline across Finally, a multiple baseline design can be behaviors is provided by Trask-Tyler, Grossi, implemented across persons. Such a manifesta- and Heward (1994). Developmentally disabled tion of the design would, of course, require young adults were taught to use three cooking access to three clients with fairly similar recipes of varying degrees of complexity (simple, presenting concerns subjected to the same trained, and complex), with the goal of the study treatment. As it is probably unlikely that being to determine whether or not specific skills anyone but a therapist at a university or health would generalize across recipes. Simple tasks center would have simultaneous access to three included preparing microwave popcorn while such clients, it is acceptable to collect data on receiving specific instructions regarding its new clients as they present for services (Hayes, preparation. Trained tasks were analogs of 1985). Objections exist to this approach (Harris previously taught simple tasks, where subjects and Jenson, 1985), but we feel strongly that the used previously taught skills to prepare a practicality of the approach and the control different food (e.g., microwave french fries). against extraneous factors that the design Complex tasks consisted either of new tasks or possesses greatly outweigh the potential risks. novel combinations of previously trained tasks. The multiple baseline across persons design is Intervention phases were staggered across com- exemplified by Kamps et al. (1992). The effects plexity levels, and occurred after a baseline phase of social skills groups on the frequency of social where subjects completed recipes without in- skills interactions of high functioning autistic struction. Complexity level served as an appro- subjects with normal peers were analyzed across priate analog for differing behaviors in this case, three subjects. Baseline consisted of frequency and the implementation of new phases were counts of social interactions before social skills staggered with sufficient data points (i.e., three training. Phase changes within each series did or over) occurring in each phase. The design use not occur until data stabilized, and phase was also innovative in that it involved applied changes across subjects were staggered approxi- skills. mately 10 sessions apart. A minimum of seven Alternately, the clinician may choose to data points (usually more) were present in each implement a multiple baseline design across phase. Social skills training had a positive effect settings. If, for example, a client was socially on social interactions. withdrawn in a variety of circumstances (e.g., at Ideally, phase changes should wait until the work, at the gym, and on dates), social skills data indicates a clear level and trend, and is training might be implemented at different stable, before the phase change is executed. This points in time across several of those circum- is advisable when changing from baseline to stances. As another example, anger management treatment within a given series, and especially training with an adolescent could be implemen- when executing staggered phase changes in the ted in a staggered fashion at home, school, and at second and third data series. If no clear effect is a part-time job site. As with the multiple baseline observable in a series, then co-occurring data across behaviors, the clinician should be alert to points between a series cannot be meaningfully the possibility of generalization effects. interpreted. An example of the multiple baseline across An example of the multiple baseline is settings design is provided by Kennedy, Cush- provided by Croce (1990). A multiple baseline ing, and Itkonen (1997; Figure 10). The study across persons was used to evaluate the effects investigated the effects of an inclusion interven- of an aerobic fitness program on three obese tion on the frequency and quality of social adults, while weekly body fat and cardiovas- contacts with nondisabled people, using two cular fitness measurements were taken. Over developmentally disabled children as subjects. three weeks of baseline data was collected in all The intervention included a number of compo- subjects before treatment was delivered to the nents, such as placement in general school first subject. Treatment for the second subject settings with nondisabled peers, and feedback. was implemented three weeks after the first, and After a baseline where both subjects participated treatment for the third subject was delayed an 42 Single Case Experimental Designs: Clinical Research and Practice

3

2

1

Class period 2

Social contacts per day 3

2

1

Class period 6

10 20 35 Weeks

Figure 10 A multiple baseline across settings. Data was stable before phase changes in both settings, and comparisons between the settings indicate that the intervention was responsible for changes and that no generalization effect occurred (adapted from Kennedy et al., 1997). additional three weeks. The data clearly change occurring at the same time in both series, indicated a desirable effect. and after treatment phases of precisely equal lengths. Such an arrangement can make even a relatively weak effect evident, as each treatment 3.02.4.3.2 Crossover design (B phase) always has a corresponding baseline The crossover design essentially involves a (A phase) in the other data series to allow finer modification of the multiple baseline allowing interpretations of stability, level, and trend. an additional degree of control over extraneous variables (Kazdin, 1980). It can be especially 3.02.4.3.3 Constant series control useful when only two (rather than three or even more) series of data can plausibly be gathered, One final combined series design warrants as this added control tends to compensate for brief mention. A constant series control can be the control lost by the omission of a third series. added to a series of data when baseline data This design has been widely used in pharma- alone is collected concurrently on some other cological studies (Singh & Aman, 1981). person, problem, or situation of interest. In Execution of a crossover design simply adding a constant series control to an in-school involves simultaneous phase changes across anger management intervention with a child, for two series of data such that, at any given point in example, relevant behavioral data might be time, opposite phases are in operation between collected at the child's home (where the treat- each series. If series one's phase sequence was, ment is not considered active) throughout the for example, ABAB, a BABA sequence would period where an ABAB phase sequence is being be simultaneously delivered, with each phase implementedatschool. Treatment B effects from Conclusion 43 the school setting can then be compared to This has been true for some time, and the concurrently gathered baseline data from the research production using these designs in the home setting to assist in interpretation of the on-line practice environment is still limited. treatment's effects. Such a control is extremely That may be about to change, however, for the useful when used in conjunction with a simple or reason discussed next. complex phase change design. A study by O'Reilly, Green, and Braunling- McMorrow (1990) provides an example of a 3.02.5.2 Managed Health Care, Single-subject baseline-only constant series control. O'Reilly Design, and the Demonstration of et al. were attempting to change the accident- Efficacy prone actions of brain-injured individuals. A written safety checklist that listed, but did not The managed health care revolution currently specifically prompt hazard remediation, was underway in the USA represents a force that, in prepared for each of several areas of the home. If all likelihood, will soon encompass nearly all US improvement was not maximal in a given area, mental health services delivery systems (Stro- individualized task analyses were prepared that sahl, 1994; Hayes et al., in press). The hallmark prompted change in areas that still required of managed care organizations (MCOs) is the mediation. The design used was a multiple provision of behavioral health services in a way baseline across settings (living room, kitchen, that is directed, monitored, not merely com- bedroom, and bathroom). Although phase pensated. changes from baseline to checklist prompts to In generation I of the managed care revolu- task analysis occurred in staggered multiple tion, cost savings accrued mostly to cost baseline across the first three settings, a baseline reduction strategies. That phase seems nearly condition remained in effect for the entire 29 complete. In generation II, cost savings are weeks of the study. Results indicated very little accruing to quality improvement strategies. evidence of generalization across responses, and Uppermost in this approach is the development the baseline-only constant series control pro- of effective and efficient treatment approaches, vided additional evidence that training was and encouragement of their use through clinical responsible for the improvements that were seen. practice guidelines. Time-series designs are relevant to MCOs in three ways. First, they can allow much greater 3.02.5 FUTURE DIRECTIONS accountability. Single-subject designs provide an excellent opportunity for the clinician to Widespread use of SCEDs by practicing document client progress and provide sufficient clinicians could provide a partial remedy to justification to MCOs, HMOs, and PPOs for two currently omnipresent concerns in the implementing treatments or treatment compo- mental health care delivery field: a virtual lack nents. Even a simple AB is a big step forward in of use and production of psychotherapy out- that area. Second, when cases are complex or come literature by clinicians, and the demand treatment resistant, these designs provide a way for demonstrated treatment efficacy by the of evaluating clinical innovation that might be managed care industry. useful for improving the quality of treatment for other cases not now covered by empirically supported treatments. Finally, these designs can 3.02.5.1 Potential for Research Production and be used to evaluate the impact of existing Consumption by Practicing Clinicians treatment programs developed by MCOs. Line clinicians currently produce very little research. This is unfortunate since it is widely 3.02.6 CONCLUSION recognized that assessment of the field effec- tiveness of treatment technology is a critical and This chapter provides a brief overview of largely absent phase in the research enterprise current single-subject design methodologies and (Strosahl, Hayes, Bergan, & Romano, in press). their use in the applied environment. The design Single-subjectdesignsarewellsuitedtoanswer elements are a set of clinical tools that are many of the questions most important to a effective both in maximally informing treatment clinician. They are useful tools in keeping the decisions and generating and evaluating re- clinician in touch with client progress and search hypotheses. Single-subject designs fill a informing treatment decisions. Most of the re- vital role in linking clinical practice to clinical quirements of these designs fit with the require- science. With the evolution of managed care, ments of good clinical decision making and with this link is now of significant economic the realities of the practice environment. importance to a major sector of our economy. 44 Single Case Experimental Designs: Clinical Research and Practice

3.02.7 REFERENCES behavior of a child identified with severe behavior disorders. Education and Treatment of Children, 17(3), Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some 435±444. current dimensions of applied behavior analysis. Journal Harris, F. N., & Jenson, W. R. (1985). Comparisons of of Applied Behavior Analysis, 1, 91±97. multiple-baseline across persons designs and AB designs Barlow, D. H., & Hayes, S. C. (1979). Alternating with replication: Issues and confusions. Behavioral treatments design: One strategy for comparing the effects Assessment, 7(2), 121±127. of two treatments in a single subject. Journal of Applied Hayes, S. C. (1985). Natural multiple baselines across Behavior Analysis, 12, 199±210. persons: A reply to Harris and Jenson. Behavioral Barlow, D. H., Hayes, S. C., & Nelson, R. O. (1984). The Assessment, 7(2), 129±132. scientist practitioner: Research and accountability in Hayes, S. C., Barlow, D. H., & Nelson, R. O. (1997). The clinical and educational setting. New York: Pergamon. scientist practitioner: Research and accountability in the Barlow, D. H., & Hersen, M. (1984). Single case experi- age of managed care, (2nd ed.). Boston: Allyn & Bacon. mental designs: Strategies for studying behavior change Hayes, S. C., Hussian, R. A., Turner, A. E., Anderson, N. (2nd ed.). New York: Pergamon. B., & Grubb, T. D. (1983). The effect of coping Belles, D., & Bradlyn, A. S. (1987). The use of the changing statements on progress through a desensitization hier- criterion design in achieving controlled smoking in a archy. Journal of Behavior Therapy and Experimental heavy smoker: A controlled case study. Journal of Psychiatry, 14, 117±129. Behavior Therapy and Experimental Psychiatry, 18, Hersen, M., & Barlow, D. H. (1976). Single case experi- 77±82. mental designs: Strategies for studying behavior change. Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: New York: Pergamon. Forecasting and control. San Francisco: Holden-Day. Horne, G. P., Yang, M. C. K., & Ware, W. B. (1982). Time Busk, P. L., & Marascuilo, L. A. (1988). Autocorrelation in series analysis for single subject designs. Psychological single-subject research: A counterargument to the myth Bulletin, 91, 178±189. of no autocorrelation: The autocorrelation debate. Huitema, B. E. (1988). Autocorrelation: 10 years of Behavioral Assessment, 10(3), 229±242. confusion. Behavioral Assessment, 10, 253±294. Campbell, D. T. (1957). Factors relevant to the validity of Jordan, J., Singh, N. N., & Repp, A. (1989). An evaluation experiments in social settings. Psychological Bulletin, 54, of gentle teaching and visual screening in the reduction 297±312. of stereotypy. Journal of Applied Behavior Analysis, 22, Cone, J. D. (1986). Idiographic, nomothetic, and related 9±22. perspectives in behavioral assessment. In R. O. Nelson & Kamps, D. M., Leonard, B. R., Vernon, S., Dugan, E. P., S. C. Hayes (Eds.), Conceptual foundations of behavioral Delquadri, J. C., Gershon, B., Wade, L., & Folk, L. assessment (pp. 111±128). New York: Guilford Press. (1992). Teaching social skills to students with autism to Cope, J. G., Moy, S. S., & Grossnickle, W. F. (1988). The increase peer interactions in an integrated first-grade behavioral impact of an advertising campaign to classroom. Journal of Applied Behavior Analysis, 25, promote safety belt use. Journal of Applied Behavior 281±288. Analysis, 21, 277±280. Kazdin, A. E. (1980). Research design in clinical psychology. Croce, R. V. (1990). Effects of exercise and diet on body New York: Harper & Row. composition and cardiovascular fitness in adults with Kennedy, C. H., Cushing, L. S., & Itkonen, T. (1997). severe mental retardation. Education and Training in General education participation improves the social Mental Retardation, 25(2), 176±187. contacts and friendship networks of students with severe Cronbach, L. J. (1975). Beyond the two disciplines of disabilities. Journal of Behavioral Education, 7, 167±189. scientific psychology. American Psychologist, 30, 116±127. Koan, S., & McGuire, W. J. (1973). The Yin and Yang of Cummings, N. A., Cummings, J. L., & Johnson, J. N. progress in social psychology. Journal of Personality and (1997). Behavioral health in primary care: A guide for Social Psychology, 28, 446±456. clinical integration. Madison, CT: Psychosocial Press. Kornet, M., Goosen, C., & Van Ree, J. M. (1991). Effect of Cummings, N. A., Pollack, M. S., & Cummings, J. L. naltrexone on alcohol consumption during chronic (1996). Surviving the demise of solo practice: Mental alcohol drinking and after a period of imposed health practitioners prospering in the era of managed care. abstinence in free-choice drinking rhesus monkeys. Madison, CT: Psychosocial Press. Psychopharmacology, 104(3), 367±376. DeLuca, R. V., & Holborn, S. W. (1992). Effects of a Lerman, D. C., & Iwata, B. A. (1996). A methodology for variable ratio reinforcement schedule with changing distinguishing between extinction and punishment effects criteria on exercise in obese and nonobese boys. Journal associated with response blocking. Journal of Applied of Applied Behavior Analysis, 25, 671±679. Behavior Analysis, 29, 231±233. Edgington, E. S. (1980) Validity of randomization tests for Levin, J. R., Marascuilo, L. A., & Hubert, L. J. (1978). one-subject experiments. Journal of Educational Statis- N = nonparametric randomization tests. In T. R. tics, 5, 235±251. Kratochwill (Ed.), Single subject research: Strategies Ferster, C. B., & Skinner, B. F. (1957). Schedules of for evaluating change (pp. 167±196). New York: Aca- reinforcement. New York: Appleton-Century-Crofts. demic Press. Gottman, J. M. (1981). Time-series analysis: A comprehen- McCullough, J. P., Cornell, J. E., McDaniel, M. H., & sive introduction for social scientists. Cambridge, UK: Mueller, R. K. (1974). Utilization of the simultaneous Cambridge University Press. treatment design to improve student behavior in a first- Gottman, J. M., & Glass, G. V. (1978). Analysis of grade classroom. Journal of Consulting and Clinical interrupted time-series experiments. In T. R. Kratochwill Psychology, 42, 288±292. (Ed.), Single subject research: Strategies for evaluating O'Brien, F. (1968). Sequential contrast effects with human change (pp. 197±235). New York: Academic Press. subjects. Journal of the Experimental Analysis of Greenwood, K. M., & Matyas, T. A. (1990). Problems with Behavior, 11, 537±542. the application of interrupted time series analysis for O'Reilly, M. F., Green, G., & Braunling-McMorrow, D. brief single subject data. Behavioral Assessment, 12, (1990). Self-administered written prompts to teach home 355±370. accident prevention skills to adults with brain injuries. Gunter, P. L., Shores, R. E., Jac, K. S. L., Denny, R. K., & Journal of Applied Behavior Analysis, 23, 431±446. DePaepe, P. A. (1994). A case study of the effects of Orsborn, E., Patrick, H., Dixon, R. S., & Moore, D. W. altering instructional interactions on the disruptive (1995). The effects of reducing teacher questions and References 45

increasing pauses on child talk during morning news. Physical Disabilities, 8(4), 367±373. Journal of Behavioral Education, 5(3), 347±357. Strosahl, K. (1994). Entering the new frontier of managed Paul, G. L. (1969). Behavior modification research: Design mental health care: Gold mines and land mines. and tactics. In C. M. Franks (Ed.), Behavior therapy: Cognitive and Behavioral Practice, 1, 5±23. Appraisal and status (pp. 29±62). New York: McGraw- Strosahl, K., Hayes, S. C., Bergan, J., & Romano, P. (in Hill. press). Evaluating the field effectiveness of Acceptance Peterson, A. L., & Azrin, N. H. (1992). An evaluation of and Commitment Therapy: An example of the manipu- behavioral treatments for Tourette Syndrome. Behavior lated training research method. Behavior Therapy. Research and Therapy, 30(2), 167±174. Thorne, F. C. (1947). The clinical method in science. Shakow, D., Hilgard, E. R., Kelly, E. L., Luckey, B., American Psychologist, 2, 159±166. Sanford, R. N., & Shaffer, L. F. (1947). Recommended Trask-Tyler, S. A., Grossi, T. A., & Heward, W. L. (1994). graduate training program in clinical psychology. Amer- Teaching young adults with developmental disabilities ican Psychologist, 2, 539±558. and visual impairments to use tape-recorded recipes: Sharpley, C. F., & Alavosius, M. P. (1988). Autocorrela- Acquisition, generalization, and maintenance of cooking tion in behavioral data: An alternative perspective. skills. Journal of Behavioral Education, 4, 283±311. Behavioral Assessment, 10, 243±251. Ulman, J. D., & Sulzer-Azaroff, B. (1975). Multielement Shukla, S., & Albin, R. W. (1996). Effects of extinction baseline design in educational research. In E. Ramp & G. alone and extinction plus functional communication Semb (Eds.), Behavior analysis: Areas of research and training on covariation of problem behaviors. Journal of application (pp. 377±391). Englewood Cliffs, NJ: Applied Behavior Analysis, 29(4), 565±568. Prentice-Hall. Singh, N. N., & Aman, M. G. (1981). Effects of Van Houten, R. (1993). Rote vs. rules: A comparison of thioridazine dosage on the behavior of severely mentally two teaching and correction strategies for teaching basic retarded persons. American Journal of Mental Defi- subtraction facts. Education and Treatment of Children, ciency, 85, 580±587. 16, 147±159. Snow, R. E. (1974). Representative and quasi-representa- Wampold, B. E., & Furlong, M. J. (1981). Randomization tive designs for research in teaching. Review of Educa- tests in single subject designs Illustrative examples. tional Research, 44, 265±291. Journal of Behavioral Assessment, 3, 329±341. Stavinoah, P. L., Zlomke, L. C., Adams, S. F., & Lytton, White, G. W., Mathews, R. M., & Fawcett, S. B. (1989). G. J. (1996). Treatment of impulsive self and other Reducing risks of pressure sores: Effects of watch directed aggression with fluoxetine in a man with mild prompts and alarm avoidance on wheelchair pushups. mental retardation. Journal of Developmental and Journal of Applied Behavior Analysis, 22, 287±295. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.03 Group Comparisons: Randomized Designs

NINA R. SCHOOLER Hillside Hospital, Glen Oaks, NY, USA

3.03.1 INTRODUCTION 47 3.03.2 COMPARING STRATEGIES FOR TREATMENT EVALUATION 48 3.03.2.1 Case Reports and Summaries of Case Series 48 3.03.2.2 Single Case Experimental Designs 48 3.03.2.3 Quasi-experimental Designs 49 3.03.2.4 Randomized Experimental Designs 49 3.03.3 HISTORICAL BACKGROUND 49 3.03.4 APPROPRIATE CONDITIONS FOR RANDOMIZED GROUP DESIGNS 49 3.03.4.1 Ethical Requirements 50 3.03.4.2 Stage of Treatment Development 51 3.03.4.3 Treatment Specification: Definition of Independent Variables 52 3.03.4.4 Client/Subject Characteristics 52 3.03.4.5 Specification and Assessment of Outcome: Definition of Dependent Variables 53 3.03.4.6 Study Duration 53 3.03.5 STUDY DESIGNS 54 3.03.5.1 Simple Two-group Comparisons 54 3.03.5.2 Multiple-group Comparisons 55 3.03.5.3 Multiple-group Comparisons that Include Medication Conditions 56 3.03.5.4 Factorial Designs 58 3.03.6 INDIVIDUAL DIFFERENCES: WHAT TREATMENT WORKS FOR WHOM? 59 3.03.6.1 Individual Difference Variable 59 3.03.6.2 Post hoc Examination of Predictors of Treatment Response 59 3.03.7 CONCLUSIONS 60 3.03.8 REFERENCES 60

3.03.1 INTRODUCTION (RCTs) (Pocock, 1983) and many recent studies of psychological treatments also use this termi- A major function of clinical psychology is to nology (e.g., Bickel, Amass, Higgins, Badger, & provide treatment or other interventions for Esch, 1997). This chapter will consider the clients who suffer from mental illness or seek strengths and weaknesses of a number of relief from problems. This chapter addresses the strategies for judging treatment utility; review role of experimental evaluation of treatments as some historical background of the experimental a source of data in deciding about the utility of a study of treatment; examine the experimental treatment. In the medical literature, such studies designs that have been used to study psycholo- are referred to as randomized clinical trials gical treatments and the circumstances under

47 48 Group Comparisons: Randomized Designs which such experiments are appropriate. In each can address the very compelling question of the sections regarding specific experimental regarding the reported outcomesЪcompared designs, exemplar studies will be discussed in to what?º In the Consumer Reports survey, there some detail in order to highlight relevant issues are a number of obvious hypotheses that can be for evaluating the literature and for the design of entertained regarding the absent comparison future studies. groups that call into question the validity of the reported finding. The finding that psychother- apy was helpful could be because those who did 3.03.2 COMPARING STRATEGIES FOR not feel they had benefitted from their psy- TREATMENT EVALUATION chotherapy did not respond to the question- naire. The finding that longer psychotherapy In order to set the experimental study of was more helpful could be because those who treatment into context, this section will review a experienced benefit continued in treatment variety of strategies for evaluation of treatment longer or because those who discontinued early efficacy attending to their strengths and weak- needed to justify their discontinuation. Statis- nesses. tical modeling of alternative causal interpreta- tions cannot fully exclude these interpretations. 3.03.2.1 Case Reports and Summaries of Case The common characteristic of both the attrib- Series uted interpretations of findings from case reports or case series and the criticism of them This strategy represents perhaps the method is that there is no formal procedure for judging with the longest history. For example, Brill the accuracy of the interpretation. (1938) cites a case treated in 1880 by Josef Such reports serve the important function of Breuer (cf p. 7) of a young girl with symptoms of generating hypotheses regarding treatments and ªparalyses with contractures, inhibitions and interventions that can lead to well-designed states of psychic confusion.º Under hypnosis studies. But they are not designed to test she was able to describe the connections hypotheses and therefore cannot provide defi- between her symptoms and past experiences nitive evidence for treatment efficacy. and ª. . . by this simple method he freed her of her symptoms.º Another case described by Brill concerns a four-year-old child who became 3.03.2.2 Single Case Experimental Designs nervous, refused to eat, and had frequent crying spells and tantrums. The symptoms began Chapter 2, this volume, on single case designs, shortly after the mother was separated from discusses the methodologies for assessing causal the child and she was ªcuredº soon after the relationships in individuals. A striking advan- mother returned. The mechanism that was tage of single case designs is that they can be posited to account for the effect was a used to evaluate treatments for individuals with disturbance in libido. very specifically defined characteristics. These Much more recently, Consumer Reports characteristics do not need to be shared with (1994, 1995) report the results of a survey of large numbers of other patients or clients for the 4000 subscribers who reported their experiences experiments to be valid. The applicability of the with psychotherapy broadly defined. Respon- specific interventions studied to broader popu- dents to the survey reported in general that they lations of clients will, of course, depend on the had benefitted from their psychotherapy and, of similarity of such clients to the subjects studied. particular interest, that more and longer An important disadvantage of single case psychotherapy was associated with greater designs is that they are only applicable in benefit. This survey has been reviewed and conditions in which return of the target received careful critique by psychologists complaints can be reliably demonstrated once (Brock, Green, Reich, & Evans, 1996; Hunt, the treatment has been removed. If a treatment 1996; Kotkin, Daviet, & Gurin, 1996; Krieg- results (or is hypothesized to result) in a change man, 1996; Mintz, Drake, & Crits-Christoph, in personality or in the orientation of an 1996; Seligman, 1995, 1996). individual to himself or others, then the These examples differ in a number of ways. repeated administration and withdrawal of The most obvious is that they span over a treatment that is the hallmark of single case century. The more relevant one for the present designs will not provide useful information. A purpose is that the Consumer Reports survey further disadvantage is that the comparisons are carries the weight of numbers and implied all internally generated and generalization is authority as a result. However, both single case accomplished only through assessment of multi- reports and reports that summarize large ple cases with similar individual characteristics numbers of cases share limitations. Neither and treatment. Appropriate Conditions for Randomized Group Designs 49

3.03.2.3 Quasi-experimental Designs sized the internal process of the psychother- apeutic encounter; the second was concerned These methods, described in detail in Chapter with evaluating the effects of psychotherapy per 4, this volume, take advantage of naturalistic se. In the latter studies the primary compar- treatment assignment or self-selection and use isons were between patients or clients who statistical methods to model outcome and to received psychotherapy and those who did not. provide controls for the major limitation of these Most controls centered around controls for methodsÐnamely that assignment to treatment treatment time such as deferred entry into is not random. A further advantage is that treatment using waiting lists. In these experi- evaluation of treatment outcome can be in- mental studies, attention to the content of the dependent of treatment delivery (and even blind psychotherapy was descriptive. This position to the knowledge of treatment) so that the effects was strongly bolstered by the writings of of expectancy on outcome are controlled. The theorists such as Frank (1971), who held that major weakness of such designs is that the biases general constructs such as positive regard and of treatment selection are allowed full rein. the therapeutic alliance underlay the effects of psychotherapy. 3.03.2.4 Randomized Experimental Designs Smith and Glass (1977) revolutionized the field with the publication of the first meta- Randomization is the hallmark of these analysis of the psychotherapy literature, de- studies. Random assignment to intervention monstrating a moderate effect size for the provides the potential to control for biases that comparison between psychotherapy and no treatment self-selection and choice introduce treatment (see Chapter 15, this volume, for a into the evaluation of treatment outcome. The discussion of current meta-analytic methods). disadvantages that randomized treatment stu- At the same time, increased attention to dies have include the logistical difficulties in measurement of change in psychotherapy implementation and restrictions in general- (Waskow & Parloff, 1975) and the development izability. Generalizability is restricted to clients of manualized treatments (see Chapter 9, this who are willing to consider randomization. volume) signaled a change in the models that Individuals who come to treatment facilities were available for the study of psychological with a strong preference or expectation may treatments. well reject randomization to treatment as an During the same period, attention to meth- option and insofar as matching to treatment odology in clinical psychology quickened influences outcome, the absence of such sub- (Campbell & Stanley, 1963, 1966; Cook & jects reduces generalization. Subject refusal Campbell, 1979). The emergence of medications to participate, particularly if it occurs after for the treatment of mental illness had inaugu- treatment assignment is known, can further rated experimental studies of these treatments compromise the advantage that random assign- using rigorous designs drawn initially from ment introduces. Refusal after treatment as- clinical psychology that included randomiza- signment is known implies that subjects whose tion, double-blind administration of medica- expectations are met continue and that those tion, placebo controls, diagnostic specificity, who did not get the treatment they preferred rating scales that paid attention to specific signs have chosen not to participate. If such refusal is and symptoms of psychopathology (see Prien & frequent, the principle of randomization has Robinson, 1994, for a recent review of progress been compromised and interpretation of find- in methods for psychopharmacology). The ings will encounter limitations similar to those demand for similar studies of psychological described for quasi-experimental designs. Attri- interventions became apparent. The modern era tion from all causes may introduce bias (Flick, of experimental studies of psychotherapy and 1988). Despite these disadvantages, randomiza- other psychological interventions was upon us. tion offers insurance that, within the population studied, biasÐparticularly from unknown sources, has been contained. Randomized experimental studies share with quasi-experi- 3.03.4 APPROPRIATE CONDITIONS mental designs the potential advantages af- FOR RANDOMIZED GROUP forded by the separation of evaluation of DESIGNS outcome from provision of the intervention. Experimentation is best suited to address comparative questions that would be biased if 3.03.3 HISTORICAL BACKGROUND the groups being compared were self-selected. Random assignment of individuals to groups Early research in psychotherapy ran on two controls for biases introduced by self-selection. relatively independent tracks: the first empha- Some questionsÐfor example, those regarding 50 Group Comparisons: Randomized Designs differences between young and old people or 3.03.4.1 Ethical Requirements men and womenÐobviously will not be ad- dressed experimentally, although later in the Certain ethical conditions regarding the chapter the question of designs to study subjects and the treatments or interventions individual differences in relation to an experi- must be met (Department of Health and mental condition will be considered. Human Services, 1994). Research participation Some examples: does marital therapy prevent requires informed consent from the participant divorce?; is cognitive-behavior therapy helpful or from an appropriately authorized surrogate, in depression?; does family psychoeducation for example, a parent for a minor child. Even in potentiate the effects of medication in schizo- cases where consent is obtained from a phrenia?; should children with attention deficit surrogate, the research participant (e.g., the disorder receive medication or family support?; child) must provide ªassent.º Assent is the term does respite care for Alzheimer's patients reduce used for agreement to participate by indivi- physical illness in caregivers?; does medication duals who are judged to lack the capacity to reduce craving in alcohol dependent indivi- provide fully informed consent. Currently in duals? In all these examples, an intervention is the USA regulations are promulgated by the either contrasted to others or in the cases where Department of Health and Human Services no comparison is stated, it is implied that the that define 12 elements of informed consent. comparison is with a group that does not receive They are: an alternate intervention. (i) An explanation of the procedures to be The hallmark of all these questions is that they followed, including specification of those that identify a group or condition and ask about an are experimental. intervention. The implication is that the findings (ii) A description of the reasonably foresee- from the experiment will apply to members of the able attendant discomforts and risks and a group who did not receive the intervention in the statement of the uncertainty of the anticipated experimental study. This assumption represents risks due to the inherent nature of the research a statistical assumptionÐit is required for the process. use of virtually all statistical tests applied in (iii) A description of the benefits that may be evaluating the questions under review. It also expected. represents an important clinical concern. Ex- (iv) A disclosure of appropriate and avail- periments are not conducted solely for the able alternate procedures that might be advan- benefit of the subjects who participate in the tageous for the subject. research. It is anticipated that the conclusions (v) An offer to answer any inquiries con- will be applicable to other individuals who share cerning the procedures. characteristics (ideally key characteristics) with (vi) A statement that information may be those who were subjects in the research. withheld from the subject in certain cases when Perhaps the greatest threat to the validity of the investigator believes that full disclosure may experimental studies is that there is something be detrimental to the subject or fatal to the study that distinguishes the members of the group design (provided, however, that the Institu- (depressed patients, children with attention tional Review Board (IRB) has given proper deficit disorder) who agree to randomization approval to such withholding of information). from those who will not participate in a study of (vii) A disclosure of the probability that the randomized treatment. In the general medical subject may be given a placebo at some time clinical trials literature where the conditions to during the course of the research study if be treated are defined by readily measurable placebo is to be utilized in the study. signs and symptoms or by physiological (viii) An explanation in lay terms of the measurements (e.g., tumor size), that assump- probability that the subject may be placed in tion is readily accepted and is therefore one or another treatment group if randomiza- generally untested. In the psychological litera- tion is a part of the study design. ture the question may be considered but there (ix) An instruction that the subject may are very few studies that have actually drawn a withdraw consent and may discontinue parti- sample of subjects from a population in order to cipation in the study at any time. know whether the subjects of a study are like the (x) An explanation that there is no penalty population from which they came. In other for not participating in or withdrawng from the words, although we look to randomization as a study once the project has been initiated. control for bias in evaluating the effect of a (xi) A statement that the investigator will treatment or intervention, there has been little inform the subject of any significant new attention paid to whether subjects who agree to information arising from the experiment or randomization represent a bias that is negligible other ongoing experiments which may bear or substantial. on the subject's choice to remain in the study. Appropriate Conditions for Randomized Group Designs 51

(xii) A statement that the investigator will ªusual care,º and even nonspecified psy- provide a review of the nature and results of the chotherapies may require re-evaluation. One study to subjects who request such information. strategy that may receive increasing attention All these essential elements must be included and popularity is that of treatment dosage. In in an Informed Consent Form to insure ade- other words, a comparison group may be quate consent to participate in any research. defined as one that receives the intervention In studies that involve randomization to of interest but receives less of it. treatment, some elements are of particular Interventions and control conditions being importance because they represent concepts compared should be potentially effective; there that are sometimes difficult for potential should be evidence regarding benefit. If this subjects to appreciate. The first, of course, is assertion is accepted, how can a ªno-treatmentº that of randomization itselfÐthat treatment comparison group be included in an experi- will be decided ªas if tossing a coin.º The second ment? First, there should be provision for the is that the clinician or treater will not choose subject to receive the potentially more effective ªthe bestº treatment for the client. Other intervention following study participation. In elements of informed consent that need to be other words, the ªno-treatmentº group is really emphasized for a truly informed process is that a delayed treatment or the time hallowed the subject is free to withdraw at any time and waiting list condition. This may be relatively that those providing treatment may also easy to implement with short-term treatment discontinue the subject's participation if they interventions but more difficult for long-term believe it to be in the subject's best interest. interventions. Alternate solutions include the In most countries research involving human use of minimal treatment conditions. Later in subjects must be reviewed by local Research the chapter, both of these conditions will be Ethics Committees or IRBs that are mandated discussed in more detail from the substantive to insure that the research meets ethical perspective. In the present context, both standards, that all elements of informed consent represent plausible solutions to the ethical are present, and that the interests of subjects are dilemma of providing appropriate treatment protected. for client populations for whom there exist a A number of issues regarding consent that go corpus of information regarding treatment beyond the 12 critical items are currently a effects. Obviously, these concerns are moot in subject of debate. Participants in these delib- specific patient or client populations where erations include various regulatory bodies, there are no data regarding treatment effects. independent commissions, and private indivi- The ethical dilemma is heightened when the duals and groups that are self-declared pro- intervention is long term or there may be tectors of those who may participate in substantial risk if treatment is deferred. For experiments regarding treatments or interven- example, studies of treatment for depression tions of interest to clinical psychology. Among generally exclude subjects who are suicidal. these issues are the following: requiring that an independent person unaffiliated with a given research project oversee the consent process; 3.03.4.2 Stage of Treatment Development expanding the definition of those who lack the capacity to provide informed consent to include The ideal sequence of treatment development populations such as all those diagnosed with includes the following stages: innovation; pre- schizophrenia and other illnesses; restricting liminary description in practice, usually by the the conduct of research that does not offer innovators or developers of the treatment; direct benefit to the participant subjects; comparative experimental studies to determine elimination of the use of placebo treatment treatment efficacy and investigate client char- conditions in patient groups for whom there is acteristics that may be linked to treatment any evidence of treatments that are effective. response; dissemination into clinical practice; These potential changes may well alter the and evaluation of outcome under conditions of nature of experimental research regarding usual clinical care. Experimental, randomized treatment and intervention. designs may take several forms during the In general, concerns about placebo treatment sequence outlined. The first is often a study or no-treatment control conditions have been that compares the new treatment with no hotly debated in the arena of pharmacological treatment. As indicated earlier, that condition treatment rather than psychological interven- is most likely to be defined by a waiting list tions. However, as evidence regarding the condition or deferred treatment. Other studies efficacy and effectiveness of psychological that provide comparisons to other established interventions becomes widely accepted, experi- treatments may follow. At the present time, mental strategies such as waiting list controls, studies that examine psychological interventions 52 Group Comparisons: Randomized Designs in relationship to medication are often carried now is to determine whether a given cook has out. These studies may address direct compara- followed the recipe. tive questions regarding medication and a psychological intervention or they may address relatively complex questions regarding the 3.03.4.4 Client/Subject Characteristics additive or interactive effects of medication and a psychological intervention. Characterization of the clients in the rando- mized trial provides the means for commu- nicating attributes of the population of clients 3.03.4.3 Treatment Specification: Definition of who may be appropriate for the treatment in Independent Variables routine clinical practice. Currently, the most common strategy for characterizing clients is the As indicated above, randomized designs are use of diagnostic criteria that provide decision most valuable when the interventions can be rules for determining whether clients fit cate- well-specified or defined. Since the goal of such gories. The most widely used are those of the research is generalization, the advantage of well- World Health Organization's International specified treatment interventions is that their classification of diseases (World Health Orga- reproduction by other clinicians is possible. nization, 1992) and the American Psychiatric Manuals such as those discussed in Chapter 9, Association's Diagnostic and statistical manual this volume, represent the ideal model of of mental disorders (American Psychiatric treatment specification, but other methods Association, 1987, 1994). The use of specified can be considered. The training of the treatment diagnostic criteria (and even standardized provider can provide a plausible statement instruments for ascertaining diagnosis) provides regarding specification. For example, clinical some insurance that clients participating in a psychologists with a Ph.D. who are board given study share characteristics with those who certified and who have completed an established have been participants in other studies and with training course may represent an appropriate potential clients who may receive the treatment statement regarding a treatment specification. at a later time. However, under some circum- Specification should also include such ele- stances it may be preferable to specify client ments as treatment duration, frequency, and inclusion based on other methodsÐsuch as the number of contacts. Duration and number may reason for seeking clinical assistance rather than appear to be relatively simple constructs but in a formal clinical diagnosis. An advantage of examining psychological interventions the dura- using inclusion criteria other than diagnosis is tion of treatment may not be rigidly defined. In that problem-focused inclusion criteria may some cases, treatment adherence by clients will make translation of findings from a randomized affect duration, frequency, and number. In trial to clinical practice easier. other circumstances, treatment may be contin- A second issue regarding client characteristics ued until a specified outcome is reached. Under in randomized studies is insuring that important these conditions, duration of treatment until a client characteristics which may influence out- criterion of improvement is reached may come are balanced across treatment groups. represent an outcome itself. One strategy is to conduct relatively large Finally, the most careful specification needs studies. Randomization is designed to minimize to be matched by measurement of adherence to imbalance but, as anyone who has seen heads the specified treatment. Did therapists do what come up 10 times in a row knows, in the short they were supposed to do? This represents an term randomization may well result in imbal- interesting shift in research on interventions. ance. One of the most common ways to achieve Earlier in the history of psychotherapy research, balanced groups is to randomize to treatment or when the emphasis was on the process of condition within a prespecified group; for psychotherapy, the evaluation of what hap- example, gender or age to insure against the pened during the psychotherapeutic encounter possibility that by chance disproportionate was of interest because it was assumed that numbers of one group are randomized to one psychotherapeutic interventions altered out- treatment condition. A second strategy is to use comes. One could draw an analogy to recording an adaptive randomization algorithm (Pocock the practices of a gifted cook in order to & Simon, 1975; White & Freedman, 1978). In ascertain the recipe. As the emphasis has shifted this method, several client characteristics that to the evaluation of outcomes, the interest in the are known or hypothesized to affect the nature of the interpersonal interactions that outcomes of interest in the study are identified. comprise the intervention has come to be seen as The goal of the randomization algorithm is to assurance of adherence or ªfidelityº (Hollon, insure that the groups are relatively well Waskow, Evans, & Lowery, 1984). The question balanced considering all of the characteristics Appropriate Conditions for Randomized Group Designs 53 simultaneously. A particular algorithm may independent assessors. It is not a perfect specify a number of characteristics but when the strategy. Such assessors have only limited number exceeds three or four, the overall opportunity to observe subjects and may not balance is unlikely to be affected. The char- be sensitive to subtle but important cues because acteristics can also be weighted so that some are of their limited contact with the subjects. In this more likely to influence treatment assignment context, both initial training of assessors to than others. Adaptive randomization is a insure reliability and ongoing monitoring of dynamic process in which subject characteristics reliability are critical. are fed into a program that generates a treatment assignment such that the identified character- istics will be well balanced across all groups. 3.03.4.6 Study Duration What is central to randomization within groups or adaptive randomization is the Study duration includes two components: premise that the chosen characteristics are duration of the intervention and duration of known or hypothesized to affect the outcomes postintervention evaluation. Duration of the of interest in the study. Characteristics that are intervention should be theoretically driven, not expected to influence outcome do not need based on the nature of the clinical problem to be considered in this way. Section 3.03.6 that is being addressed and the mechanism of discusses the use of individual differences action of the intervention. Short-term interven- among study participants in attempting to tions are much easier to investigate, particularly understand outcome differences among treat- short-term interventions for which manuals ments or interventions. have been developed. It appears that manuals are easier to develop for short-term interven- tions. However, some questions will require 3.03.4.5 Specification and Assessment of longer-term intervention. Interventions whose Outcome: Definition of Dependent duration are reckoned in months rather than Variables weeks require additional care to avoid subject attrition and to insure that the treatment To state the obvious, outcome criteria should remains constant during the long period of reflect the intended effect of the interventions time that it is being delivered. studied and the reliability and validity of the Post-treatment evaluation or follow-up after assessment instruments used should be estab- treatment ends is common in studies of lished. In general, there are advantages to using psychological interventions and addresses the some measures that are established in the field. important question of whether effects persist in This provides the opportunity to compare the absence of continued treatment interven- findings across studies and aids in the cumu- tion. Such follow-up studies are subject to the lative development of knowledge (Waskow & criticism that uncontrolled interventions may Parloff, 1975). In addition, it is valuable to have occurred during the post-treatment period include study-specific measures that focus on and may bias the outcome. But, if an interven- the particular intervention(s) under investiga- tion is hypothesized to exert a long-lasting effect tion and hypotheses being tested. In discussing that may include change in personality or long- examples of specific studies in later sections of term functioning, such evaluation is required. this chapter, the issue of whether differences Follow-up evaluations are also subject to were found on standard measures or on increased problems of attrition and the risk measures that were idiosyncratic to the study that differential attrition may introduce bias. in question will be considered. Comparisons of baseline and demographic Also of relevance is who assesses outcome. In characteristics are frequently made between pharmacologic clinical trials, double-blind pro- those subjects who are ascertained at the end of cedures represent standard operating proce- follow-up and those who are lost. This provides dure. Concerns are often expressed that side some measure of comfort, but further important effect patterns and other ªcluesº may serve to comparisons should include treatment group break the blind, but blinding preserves a and measures of outcome at the completion of measure of uncertainty regarding treatment study treatment. An incorrect strategy that has assignment (Cole, 1968). In studies of psycho- fallen into disuse was to compare those not logical intervention, double-blind conditions ascertained with the full cohort, including those are not possible. The subject and the treating lost to follow-up, so that those not ascertained clinician know what treatment the subject is were included in both groups. receiving and commitment and enthusiasm for In studies of pharmacologic treatment, treatment are assumed to be present. For this reversal of effects on discontinuation of treat- reason, a common strategy is to include ment has been taken as prima facie evidence of 54 Group Comparisons: Randomized Designs efficacy. In fact, a major research design to as ªplaceboº treatment in the literature. strategy in psychopharmacology is to establish Comparisons of two specific psychological efficacy of treatment in a cohort and then to interventions also represent potential two- randomize subjects to continuation or disconti- group comparisons. A final model may include nuation of medication. Differential levels of a treatment and no-treatment comparison in the symptom severity after a fixed experimental presence of a specified medication condition period and time to symptom exacerbation are (see Section 3.03.5.3 for discussion of this design taken as evidence of treatment efficacy. In in the context of other designs that include contrast, studies of psychological interventions medication conditions). have often considered persistence of effect Such studies may evaluate the benefit of two following the end of treatment as evidence for specific psychological interventions, of a spe- the efficacy of the psychological intervention. cific intervention vs. a nonspecific control (usual The problem that these conflicting paradigms care) or of intervention vs. none. The hypoth- represent will be considered further in Section eses and specific clinical questions that drive a 3.03.5.3 which considers comparative effects of given investigation should, in principle, drive pharmacologic and psychological treatments. the particular nature of the experimental and control groups. However, sometimes logistic considerations such as the nature of the treatment setting and the treatments that are 3.03.5 STUDY DESIGNS routinely provided in the setting may influence the design of the experiment. A further, The choice of study design always involves important factor may be the clinical needs of compromise. The clever experimenter can al- the clients and the urgency of providing some ways think of additional controls and varia- intervention. tions. In the previous section some of the There are limitations of two-group designs. relevant issues have been highlighted. Ethical Whatever the outcome, the design will not considerations may proscribe some scientifically allow testing of a range of alternate hypotheses. appropriate conditions. For example, under If the study does not include a no-treatment some circumstances deferring treatment control group, there is no control for time, through the use of a ªwaiting listº control spontaneous remission, or improvement. If the may be inappropriate. Financial resources may study does not include an alternate interven- constrain designs. Aside from funding con- tion group there is no control for nonspecific straints, availability of appropriate subjects may factors in treatment or for the specific limit the number of conditions that can be characteristics of the designated experimental studied. This may be stating the obvious, but the treatment. Further, if the study includes two more conditions and the larger the number of specified interventions, there is no control for subjects in a study, the longer it will take to either receipt of any treatment or for non- complete the study and the longer an important specific factors in treatment. clinical question will remain unanswered. Each Interpretation of the findings from two-group additional group in a study increases both the studies absent a no-treatment control is difficult cost and the duration of a study. Finally, there is if there are no differences in outcome between no single, ideal design. Ultimately, the design of the groups. No difference can be interpreted as a clinical experiment represents a decision that is Lewis Carroll and the Red Queen would have us driven by the hypotheses that are under believe ªthat all have won and all shall have investigation. What is critical is the recognition prizesº or that there is no effect of either. by the investigator (and by the reader/critic of Interpretation of studies that do not include a the research) of the hypotheses that can be no-treatment group may often depend on tested, and conversely, the questions that are integrating findings from other prior studies simply not addressed by a given study. that did include such controlsÐnot necessarily a bad state of affairs. As the field of experimental 3.03.5.1 Simple Two-group Comparisons studies of psychological interventions matures, it may become less appropriate to implement Designs involving only two groups are often studies with no-treatment control groups. used in the early stages of research with a given In the following example of a two group form of psychotherapy. The comparison group design, Keane, Fairbank, Caddell, and Zimer- may be a no-treatment waiting list for the ing (1989) compared implosive (flooding) duration of the treatment, treatment ªas usualº therapy to a waiting list control in 24 subjects in the treatment setting, or a nonspecific with post-traumatic stress disorder (PTSD). psychological intervention. Nonspecific psy- Interestingly, the study design had originally chological interventions are sometimes referred included a stress management group, but for Study Designs 55 unspecified reasons, this condition was not knew the treatment assignment. As discussed in successfully implemented, and those subjects Section 3.03.4.5, when treatment assignment is are not included in the report. Thus, the two- known, there is a potential bias. group comparison is between a specified treatment and a no-treatment control. Implo- sive therapy was manual driven, following a 3.03.5.2 Multiple-group Comparisons manual (Lyons and Keane, as cited in Keane et al., 1989), and included between 14 and 16 Multiple-group comparisons offer the op- sessions. The experimental group received portunity to test hypotheses simultaneously baseline, post-treatment (elapsed time from regarding comparisons of interventions (speci- baseline not reported), and a six-month follow- ficity of effects) and of intervention to a no- up assessment. The wait-list control group was intervention control. If the interventions being assessed twice: at baseline prior to randomiza- studied derive from theoretical models, their tion and after, on average, four months. Half of comparison may test specific hypotheses re- the implosive therapy group and the majority garding the psychological mechanisms that of subjects in the wait-list group received underlie the condition being treated. Other anxiolytic medication because of, as the more pragmatic questions that can be addressed authors note, ª. . . concerns about placebo in multiple-group comparisons include group groups and no treatment controls . . . we didn't vs. individual contact and treatment intensity. attempt to withdraw these patients from the However, it should be noted that when intensity medications which were prescribed to themº or dosage of the intervention becomes a (Keane et al., 1989, p. 249). The authors condition, control for amount of contact is maintain that the comparison of implosive lost. Again, there is no single ªbestº design. The therapy to subjects who were receiving phar- appropriate design depends on the questions macotherapy (even if it was not systematically being asked and the appropriate questions administered) represented a more stringent test depend on the stage of research. For example, of implosive therapyÐalthough it was not a questions of treatment duration or frequency part of the original design. Subjects were are more likely to be asked after a particular assessed using well-known standardized assess- treatment has been shown to have an effect. ments scales for depression, trait and state Brown and Lewinsohn (1984) compared anxiety, and instruments specifically designed three psychoeducational approaches in treating by the investigators for the assessment of depression that all used the same course PTSD. Post-test assessments were completed materials: class, individual tutoring, brief tele- by the therapist who treated the subject in the phone contact, and a delayed treatment control. implosive therapy group and by one of the Sixty-three individuals who met diagnostic same therapists for the wait-list group. In criteria for unipolar depression and paid a addition, the subjects rated themselves on course fee were randomly assigned to one of the depression, anxiety, and satisfaction with social four groups. Half the course fee was refunded if adjustment in several life areas. subjects completed all evaluations. Subjects in Implosive therapy appeared to improve the three immediate treatment groups were depression and anxiety according to self-report assessed pre- and post-treatment and at two and specific features of PTSD as rated by the later follow-up points. The delayed treatment therapists. No changes in social adjustment control group was assessed at baseline and at were seen. eight weeks, the time equivalent of post- Strengths of the study include randomiza- treatment after which they received the class tion, specification of diagnosis, apparent ab- condition. All treatments focused on specific sence of dropouts in both conditions, the behaviors believed to be dysfunctional in existence of a manual for the experimental depression (although not necessarily in the treatment, and the use of both standard study subjects) and followed a syllabus that outcome measures and measures tailored to included a text (Lewinsohn, Munoz, Youngren, the specific study hypotheses. Although sub- & Zeiss, 1978) and a workbook (Brown & jects were randomly assigned to treatment, the Lewinsohn, 1984). An independent interviewer randomization was compromised by the fact completed a standardized diagnostic assessment that treatment in one of three groups to which at baseline and of symptoms at later assessment subjects were randomized was, for unspecified points. Subjects completed three standardized reasons, not fully implemented. In other words, self-report measures. The primary outcome this study was a two-group design in execution measure was a factor-analytically derived score rather than intent. Another weakness is that that included all three self-report measures. assessments were completed by the therapists Improvement was seen in all three immediate who treated the subjects and who therefore treatment groups on this single composite 56 Group Comparisons: Randomized Designs measure compared to the delayed treatment 3.03.5.3 Multiple-group Comparisons that group. There were no differences among the Include Medication Conditions active treatment groups either in self-report or in rate of diagnosed depression at follow-up. Although in principle, designs that compare Comparison of high and low responders to medication to psychological interventions can treatment found few differences; none were a be classified in terms of whether they are simple function of specific treatment. No detailed comparisons (Section 3.03.5.1), multiple com- assessment of psychopathology or self-report parisons (Section 3.03.5.2) or represent factorial measures was made. designs (Section 3.03.5.4), they entail unique The strengths of the study include randomi- research challenges and therefore merit separate zation, a design that tested specific effects of a consideration. The first design challenge was treatment modality and the method of delivery, introduced earlier, namely the difference in the the absence of drop-outs from treatment or model of assessing effects. Follow-up assess- assessment, specification of treatment condi- ments after treatment ends are seen as important tions, the use of standardized measures of sources of information in studies of psycholo- outcome assessment, and the use of independent gical treatments. In contrast, such assessments assessors. The major weakness of the study is are rare in studies of pharmacotherapy. Relapse not in its design but the limited use made of the or re-emergence of symptoms after treatment assessment battery. Ratings made by the discontinuation is taken as evidence of efficacy. independent assessors were only used to assess Therefore, discontinuation designs to study the presence or absence of diagnosable depres- efficacy of treatment are common in psycho- sion at the six-month follow-up and the use of a pharmacology. A second design challenge is single summary score for all self-report mea- whether psychological and pharmacologic sures may well conceal more than it reveals. A treatments should be introduced at the same second weakness is the absence of assessments time or are appropriate for different stages of an of implementationÐhow well the instructors illness. Third, pharmacologic studies use used the course materials in the three conditions double-blind methods as the preferred strategy and whether all three conditions were imple- to control assessor bias and therefore studies mented equally well. often rely on treating clinicians to assess It is difficult to decide whether charging outcome. Because it is not possible to blind subjects for treatment is a strength or a treating clinicians to the psychological treat- weakness. On the one hand, payment for ment they are providing, studies of psycholo- treatment is characteristic of clinical treatment gical interventions rely on independent settings. All subjects completed the study and it assessors who, although they will be blind to could be argued that motivation to receive a treatment assignment, have reduced opportu- 50% rebate on the course fee was a factor in nity to observe, and have minimal opportunity enhancing study completion. On the other hand, to develop rapport with the subjects they are the authors report that one of the major reasons assessing. For these reasons, evaluations by that screened potential participants did not independent assessors may be less sensitive than enter the study was inability to pay. Thus, the those made by treating clinicans. Fourth, study population may have been biased by the psychological interventions are often hypothe- exclusion of these subjects. sized to affect different aspects of outcome than This study exemplifies a model that decon- pharmacologic interventions, so that direct structs a psychotherapeutic approachÐin this comparisons of effects may be difficult. case a psychoeducational approach to the treat- Historically, pharmacologic clinical trials ment of depressionÐin order to identify which have relied more heavily on medical diagnoses treatment delivery strategy offers advantages in of mental illness than studies of psychological outcome given that the strategies differ in the interventions. But current research in both amount of clinical time and resources required pharmacologic and psychological treatment is to deliver the treatment. All subjects, including largely wedded to specific diagnoses derived the delayed treatment group, reported signifi- from the DSM of the American Psychiatric cant improvement. The implication of these Association (1987, 1994) or the ICD of the findings is that the least costly method of treat- World Health Organization (1992). Parentheti- ment delivery, brief telephone contact, should cally, one can question whether the increased be implemented. However, there was no report reliance on medical diagnostic schemes such as regarding the acceptability of the method of DSM and ICD represents an advance in studies treatment delivery and as noted earlier, there of psychological intervention. It has been was no assessment of post-treatment symptoms. argued that diagnostic specificity enhances In the absence of these data, perhaps a firm reliability and therefore enhances replication clinical recommendation is premature. and communication. On the other hand, clients Study Designs 57 who seek treatment do not necessarily seek help medication and psychological treatment. These for a specific diagnostic label. designs will be considered in Section A wide range of designs have been used to 3.03.5.4. examine the relationship of medications and An example of a study that compared psychological treatments. In the simplest, a medication and a specific psychotherapy is medication or placebo is added to a uniform the study of prevention of recurrence of psychological intervention, or the converse, a depression by Frank et al. (1990). Patients psychological intervention or control condition characterized by recurrent depression (at least is added to an established medication condition. two prior episodes) were treated with medica- These designs examine the additive effect of the tion and manual-based interpersonal psy- modality that is superimposed. Questions like chotherapy (IPT) (Klerman, Weissman, does naltrexone enhance abstinence in alcoholic Rounsaville, & Chevron, 1984) until symptom subjects who are in a therapeutic milieu or does remission was documented and maintained for social skills training enhance social functioning 20 weeks. One hundred and twenty-eight in schizophrenic subjects who are maintained subjects were then randomized to one of five on antipsychotic medication can be addressed in treatment conditions: a maintenance form of this manner. The major design challenges faced interpersonal psychotherapy (IPT-M) that was by such studies are: the inclusion of appropriate characterized by less frequent visits; IPT-M and outcome measures to evaluate change on an antidepressant (imipramine) at the acute psychological interventions; insuring that the treatment dose; IPT-M and medication place- treatment modality which represents the back- bo; medication clinic visits and imipramine; ground condition remains relatively constant medication clinic visits; and medication place- during the course of the study; and insuring an bo. Subjects were treated for three years. adequate timeframe in which to evaluate effects Therapists were trained and certified by the of psychological treatments (in the design where developers of IPT. The major difference from psychological intervention is manipulated). the published manual is described as visit The second class of studies adds a medication frequency. The major outcome examined was condition (or conditions) to a multiple-group recurrence of depressive episodes during the design such as those discussed in the previous three-year period. The two treatment arms that section. These studies examine the comparative included imipramine (IPT-M and imipramine; effects of medication and psychological treat- imipramine and clinic visits) had the lowest risk ments and face greater challenges. They have of recurrence. Mean survival time was more been called ªhorse racesº and may thereby con- than 83 weeks. The two groups that received tribute to guild conflicts between psychologists IPT-M without medication (IPT-M alone; IPT- and psychiatrists (Kendall & Lipman, 1991). M and placebo) had a higher risk of recurrence; Appropriate inclusion criteria for medication mean survival time was more than 60 weeks. and psychological treatments may differ and The lowest survival time was seen in the group compromises may be necessary. Such studies that received placebo and medication clinic will generally include double-blind conditions visits; survival time was 38 weeks on average. for medication and also benefit from inclusion The authors conclude that medication, at a of independent assessors. The studies require relatively high maintenance dose, affords the expertise in both clinical psychopharmacology greatest protection from recurrence in indivi- and the relevant psychological interventions. duals who have experienced recurrent depres- Insuring this expertise for a study usually sion. However, psychotherapy, absent requires close collaboration and mutual respect medication, represents a distinct advantage by investigators drawn from psychology and over clinic attendance coupled with a placebo. psychiatry. Outcome criteria need to include The study has a number of strengths. measures of dimensions that are hypothesized Patients were randomized to treatment condi- to be sensitive to change in both modalities tion. There was an extremely low dropout rate being studied and the reasons for inclusion of (17%) over the three-year study duration. The specific outcome measures should be explicit. A design included control conditions for medica- detailed review of these and other methodolo- tion (placebo) and psychotherapy was admi- gical considerations can be found in an article nistered under three conditions: alone, in by Kendall and Lipman (1991) and in the article combination with imipramine, and in combi- detailing the research plan for the National nation with imipramine placebo. The psy- Institute of Mental Health Treatment of chotherapy followed an established manual Depression Collaborative Research Program and therapists were trained by the originators (Elkin, Parloff, Hadley, & Autry, 1985). of the therapy. Survival analysis represents the Finally, factorial designs provide explicit tests most powerful statistical method for evaluating of both additive and interactive effects of risk of recurrence. 58 Group Comparisons: Randomized Designs

Weaknesses include the limited examination clinical setting. However, the most important of clinical outcomes. Although the authors use, from the perspective of this chapter, is to mention the importance of judging interepisode study the relationship of medication and social functioning, the article only examined psychological treatment. recurrence risk. Second, the study population Marks and his colleagues (1993) studied the was, by design, limited to those who both had effects of medication and a psychological recurrent episodes and recovered from them. Of treatment for panic disorder. There was prior the 230 patients who met the first criterion, fully evidence of efficacy for both the medication, 42% did not enter the randomized trial. This alprazolam, an antianxiety drug, and the represents a substantial limitation to general- experimental psychological treatment, live ex- ization of the findings to depression broadly posure. One hundred and fifty-four subjects defined. were randomly assigned to four treatment conditions: alprazolam and exposure (AE cell 1, Figure 1); alprazolam and relaxation, the 3.03.5.4 Factorial Designs control for the psychological treatment (AR cell 2, Figure 1); placebo and live exposure (PE cell Designs that include at least two factors allow 3, Figure 1); and placebo and relaxation, so- for detection of main effects in each factor as called double-placebo (PR cell 4, Figure 1). well as interactive effects. In the simplest of Pharmacologic and psychological treatment these, 2 6 2 designs, there is a control condition lasted for eight weeks, medication was tapered or at least two defined levels of each factor, as during the following eight weeks and subjects shown in Figure 1. If the factors represent were followed for an additional six months to independently defined treatment modalities, assess relapse or maintenance of gains following then such designs can detect the ability of one discontinuation of treatment. Both exposure treatment to potentiate the other or even to and relaxation followed published guides, not inhibit the effect of the other. For this reason, treatment manuals (Marks, 1978; Wolpe & such designs are particularly suited to examin- Lazarus, 1966). The study was conducted at two ing the relationship of medication and psycho- centers; one in London, UK, and the other in logical interventions. Uhlenhuth, Lipman, and Toronto, Canada. Covi (1969), Schooler (1978), and Hollon and The primary outcome measures were: ratings DeRubeis (1981) have considered factorial of avoidance by an assessor and the subject; the models for the study of the interaction of number of major panics; work and social medication and psychological treatments. Ac- disability; and the clinician's rating of im- cording to Uhlenhuth et al. (1969), there are provement. After eight weeks of treatment, four possible forms that the interaction can there were significant main effects of both take. These are shown in Figure 1. An additive alprazolam and exposure. The effects of effect is defined as the simple sum of the effects alprazolam, as would be expected for a of the two treatment modalities. The effect in pharmacologic treatment, did not persist dur- cell 1 is equal to the combined effects seen in ing follow-up, whereas the effects of exposure cells 2 and 3. The treatments may potentiate persisted after treatment discontinuation. In- each other. In that case, the combined effect (cell terpretation of findings is complicated by the 1) is greater than the additive sum of the two fact that there was substantial improvement in treatment effects (cells 2 and 3). The treatment total major panic in the ªdouble-placeboº modalities may inhibit one another. In that case group that received medication placebo and the combined effect (cell 1) is less than the effects relaxation. of each treatment independently (cells 2 and 3). In addition to reporting statistical signifi- Finally, there may be an effect that they call cance of outcome measures, the article also reciprocal, in which the combined effect (cell 1) reports the effect size of various outcome is equal to the main effect of the more effective measures (Cohen, 1988, 1992). With large treatment (the effect seen in either cell 2 or cell samples, differences that are clinically uninter- 3). Detection of all these effects depends on the esting may be statistically significant, that is, factorial design and the presence of cell 4 which they are unlikely to be due to chance. The effect represents the combined control condition or as size is less influenced by sample size and may it has sometimes been dubbed, the double therefore be considered a better indicator of placebo. clinical relevance. During the first eight weeks, Factorial designs are potentially a valuable when both treatments were being delivered and model for examination of other important were significantly better than their controls, the variables in understanding treatment, for ex- effect size is larger for exposure than for ample, to control for the effect of setting in alprazolam. According to the definitions of studies that are conducted in more than one Uhlenhuth et al. (1969), this would be a Individual Differences: What Treatment Works for Whom? 59

Psychological treatment

Experimental Control

Experimental 1 Alprazolam/exposure 2 Alprazolam/relaxation Pharmacologic treatment Control 3 Placebo/exposure 4 Placebo/relaxation

reciprocal effect. Effect sizes are moderate to 3.03.6 INDIVIDUAL DIFFERENCES: large for alprazolam and very large for exposure WHAT TREATMENT WORKS FOR according to criteria defined by Cohen (1992). WHOM? The effect sizes diminish over time for exposure but are still in the large to very large range after From a clinical perspective, perhaps the most six months. Those for alprazolam are absent to important question regarding treatment is the small during the follow-up period. Effect size appropriate matching of patient or client to for total major panics are absent to small treatment. The idea that a given treatment throughout the study because of the effect in the approach is appropriate for all (even all who control group. In general, the findings show a share certain characteristics) runs counter to greater effect for exposure and no consistent clinical and research experience. Although pattern of interaction between psychological randomized experiments may not represent treatment and medication. the optimal method for evaluating this question, The strengths of the study include an appro- they can provide useful information. Two rather priate population for study. All subjects met different methods can be used in the context of stringent criteria for panic disorder and none randomized trials. The first is the use of factorial had been nonresponders to either of the treat- designs and the other is the post hoc examination ment modalities prior to study entry. Further, of individual characteristics as predictors of assignment to treatment was random, assess- treatment response. This section will consider ment was double-blind to medication and single- these methods. blind to psychological treatment, assessment included both self- and assessor-completed 3.03.6.1 Individual Difference Variable measures, and analyses addressed the range of assessments included. The use of an individual difference variable as The design reveals one of the problems a factor in a factorial design allows a direct inherent in studies designed to assess interac- experimental test of a hypothesis regarding tions of medication and psychological treat- differential effects of a patient or client char- ments. The discontinuation design, that was acteristic in relation to a psychological appropriate for the psychological treatment in treatment. This strategy is difficult to the present study, is not optimal for assessment implementÐeach client characteristic included of pharmacologic treatments and the fact that doubles the number of subjects needed in a pharmacologic effects were not maintained study. Perhaps the most common example of a after discontinuation should come as no client characteristic studied in this way is the use surprise. Analysis was restricted to 134 subjects of more than one center in a study so that the who completed at least six weeks of treatment (a generalization across centers can be tested. 16% dropout rate). Although the rate is Examples of this strategy are relatively common relatively low compared to other studies of (e.g., Elkin et al., 1989; Marks et al., 1993), but it alprazolam, and there were no differences in is unusual for hypothesized differential effects to baseline characteristics between the 20 subjects be proposed. A second variable that is some- who withdrew from the study and the 134 who times investigated in a factorial design is gender. did not, it is unclear how their inclusion in analyses might have altered the results. The 3.03.6.2 Post hoc Examination of Predictors of general wisdom regarding randomized clinical Treatment Response trials is unequivocal in stating that all subjects randomized must be included in analysis but a In general, post hoc analyses of individual review of a standard journal in the field, client characteristics as differential predictors of Controlled Clinical Trials, during the past four treatment response have been disappointing. years did not reveal a single specific citation. For example, several of the studies that are 60 Group Comparisons: Randomized Designs reviewed in this chapter have attempted to of the contributions made by subjects who agree identify characteristics of clients who fare to participate in experiments and improve the particularly well in the treatments studied and treatment and care of clients and patients. as has been noted in the descriptions, reliable differences have not been found. Review of the psychopharmacology literature yields a similar 3.03.8 REFERENCES conclusion. Perhaps the most reliable finding regarding American Psychiatric Association (1987). Diagnostic and statistical manual of mental disorders (3rd ed. Rev.). prediction of treatment outcome has been Washington, DC: Author. severity of symptoms. In a number of studies American Psychiatric Association (1994). Diagnostic and where symptom severity has been studied, the statistical manual of mental disorders (4th ed.). Washing- evidence suggests that it usefully discriminated ton, DC: Author. Bickel, W. K., Amass, L., Higgins, S. T., Badger, G. J., & response (e.g., Elkin et al., 1989). In that study, Esch, R. A. (1997). Effects of adding behavioral discrimination of differences between medica- treatment to opioid detoxification with buprenorphine. tions and psychotherapies was clearer among Journal of Consulting and Clinical Psychology, 65, subjects with greater symptom severity. 803±810. Difficulty in reliable identification of indivi- Brill, A. A. (1938). The basic writings of Sigmund Freud. New York: Modern Library. dual characteristics that predict differential Brock, T. C., Green, M. C., Reich, D. A., & Evans, L. M. response may stem from a number of causes. (1996). The Consumer Reports Study of Psychotherapy: The nihilistic view is that there simply are no Invalid is Invalid. American Psychologist, 51, 1083. reliable predictors. Two alternatives seem more Brown, R. A., & Lewinsohn, P. M. (1984). A psychoeduca- tional approach to the treatment of depression: Compar- reasonable. The first is that although studies ison of group, individual, and minimal contact may have adequate power to detect treatment procedures. Journal of Consulting and Clinical Psychol- differences, they do not have adequate power to ogy, 52, 774±783. examine individual characteristics. If this is the Campbell, D. T., & Stanley, J. C. (1963). Experimental and case, meta-analytic strategies could allow the quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching examination of data regarding characteristics (pp. 171±246). Chicago: Rand McNally. that are frequently considered such as gender, Campbell, D. T., & Stanley, J. C. (1966). Experimental and age, symptom severity, duration of symptoms, quasi-experimental designs for research. Chicago: Rand referral source, comorbid psychological pro- McNally. Cohen, J. (1992). A power primer. Psychological Bulletin, blems, or diagnoses. See Chapter 15, this 112, 155±159. volume, for a discussion of this problem and Cohen, J. (1988). Statistical power analysis for the methods to deal with it. The final alternative is behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. that client characteristics which are predictive of Cole, J. O. (1968). Peeking through the double blind. In D. treatment outcome are elusive and are not H. Efron (Ed.), Psychopharmacology. A review of progress 1957±1967 (pp. 979±984). Washington, DC: captured adequately in experimental studies. US Government Printing Office. Characteristics such as motivation and client± Consumer Reports (1994). Annual . therapist rapport come immediately to mind. Consumer Reports (1995, November). Mental health: Does therapy help? Consumer Reports, 734±739. Cook, T. D., & Campbell, D. T. (Eds.) (1979). Quasi- experimentation: design and analysis issues for field 3.03.7 CONCLUSIONS settings. Boston, MA: Houghton Mifflin. Department of Health and Human Services, Office of the Randomized, experimental designs represent Secretary (1994). Protection of Human Subjects. Title 45 a useful method in the evaluation of psycholo- of the Code of Federal Regulations, Sub-part 46. OPRR Reports, Revised June 18, 1991, Reprinted March 15, gical treatments. They allow unambiguous 1994. conclusions regarding the treatments that are Elkin, I., Parloff, M. B., Hadley, S. W., & Autry, J. H. studied in the subjects who receive them. This (1985). NIMH Treatment of Depression Collaborative chapter has reviewed in some detail several Research Program. Background and Research Plan. Archives of General Psychiatry, 42, 305±316. generally excellent individual studies in order to Elkin, I., Shea, T., Watkins, J. T., Imber, S. O., Sotsky, S. identify their strengths and weaknesses. The M., Collins, J. F., Glass, D. R., Pilkonis, P. A., Leber, astute reader will have noted that all the studies W. R., Docherty, J. P., Fiester, S. J., & Parloff, M. B. had both strengths and weaknesses. The goal of (1989). National Institute of Mental Health treatment of drawing attention to these is twofold. The first is depression collaborative research program. General effectiveness of treatments. Archives of General Psychia- to provide readers of the research literature with try, 46, 971±982. a framework within which to evaluate other Flick, S. N. (1988). Managing attrition in clinical research. studies. The second is to hope that the Clinical Psychology Review, 8, 499±515. observations in this chapter may contribute to Frank, E., Kupfer, D. J., Perel, J. M., Cornes, C., Jarrett, D. B., Mallinger, A. G., Thase, M. E., McEachran, A. the ongoing process of improving the quality of B., & Grochocinski, V. J. (1990). Three-year outcomes experimental studies of psychological treat- for maintenance therapies in recurrent depression. ments. In this way we can maximize the value Archives of General Psychiatry, 47, 1093±1099. References 61

Frank, J. D. (1971). Therapeutic factors in psychotherapy. disorder with agoraphobia. A controlled study in American Journal of Psychotherapy, 25, 350±361. London and Toronto. British Journal of Psychiatry, Hollon, S. D., & DeRubeis, R. J. (1981). Placebo± 162, 776±787. psychotherapy combinations: Inappropriate representa- Mintz, J., Drake, R. E., & Crits-Christoph, P. (1996). tions of psychotherapy in drug-psychotherapy compara- Efficacy and Effectiveness of Psychotherapy: Two Para- tive trials. Psychological Bulletin, 90, 467±477. digms, One Science. American Psychologist, 51, Hollon, S. D., Waskow, I. E., Evans, M, & Lowery, H. A. 1084±1085. (1984). System for rating therapies for depression. Read Pocock, S. J. (1983). Clinical trials: a practical approach. before the annual meeting of the American Psychiatric New York: Wiley. Association, Los Angeles, May 9, 1984. For copies of the Pocock, S. J., & Simon, R. (1975). Sequential treatment Collaborative Study Psychotherapy Rating Scale and assignment with balancing for prognostic factors in the related materials prepared under NIMH contract 278-81- controlled clinical trial. Biometrics, 31, 103±115. 003 (ER), order ªSystem for Rating Psychotherapy Prien, R. F., & Robinson, D. S. (Eds.) (1994). Clinical Audiotapesº from US Dept of Commerce, National evaluation of psychotropic drugsÐprinciples and guide- Technical Information Service, Springfield, VA 22161. lines. New York: Raven Press. Hunt, E. (1996). Errors in Seligman's ªThe Effectiveness of Schooler, N. R. (1978). Antipsychotic drugs and psycholo- Psychotherapy: The Consumer Reports Study.º American gical treatment in schizophrenia. In M. A. Lipton, A. Psychologist, 51(10), 1082. DiMascio, & K. F. Killam (Eds.), PsychopharmacologyÐ Keane, T. M., Fairbank, J. A., Caddell, J. M., & Zimering, a generation of progress (pp. 1155±1168). New York: R. T. (1989). Implosive (flooding) therapy reduces Raven Press. symptoms of PTSD in Vietnam combat veterans. Seligman, M. E. P. (1995). The effectiveness of psychother- Behavior Therapy, 20, 245±260. apy: The Consumer Reports study. American Psycholo- Kendall, P. C., & Lipman, A. J. (1991). Psychological and gist, 50, 965±974. pharmacological therapy: Methods and modes for Seligman, M. E. P. (1996). Science as an ally of practice. comparative outcome research. Journal of Consulting American Psychologist, 51, 1072±1079. and Clinical Psychology, 59, 78±87. Klerman, G. L., Weissman, M. D., Rounsaville, B. J., and Smith, M. L., & Glass, G. V. (1977). Meta-analysis of Chevron, E. S. (1984). Interpersonal psychotherapy of psychotherapy outcome studies. American Psychologist, depression. New York: Basic Books. 32, 752±760. Kotkin, M., Daviet, C., & Gurin, J. (1996). The Consumer Uhlenhuth, E. H., Lipman, R. S., & Covi, L. (1969). Reports Mental Health Survey. American Psychologist, Combined pharmacotherapy and psychotherapy: Con- 51, 1080±1082. trolled studies. Journal of Nervous and Mental Diseases, Kriegman, D. (1996). The effectiveness of medication: The 148, 52±64. Consumer Reports study. American Psychologist, 51, Waskow, I. E., & Parloff, M. B. (1975). Psychotherapy 1086±1088. change measures. Washington, DC: National Institute of Lewinsohn, P. M., Munoz, R. F., Youngren, M. A., & Mental Health, US Government Printing Office. Zeiss, A. M. (1978). Control your depression. Englewood White, S. J., & Freedman, L. S. (1978). Allocation of Cliffs, NJ: Prentice-Hall. patients to treatment groups in a controlled clinical Marks, I. M. (1978). Living with fear. New York: McGraw- study. British Journal of Cancer, 37, 849±857. Hill. Wolpe, J., & Lazarus, A. (1966). Behaviour therapy Marks, I. M., Swinson, R. P., Basoglu, M., Kuch, K., techniques. Oxford, UK: Pergamon. Noshirvani, H., O'Sullivan, G., Lelliott, P. T., Kirby, World Health Organization (1992). International classifica- M., McNamee, G., Sengun, S., & Wickwire, K. (1993). tion of diseases (ICD-10) (10th ed.). Geneva, Switzer- Alprazolam and exposure alone and combined in panic land: Author. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.04 Multiple Group Comparisons: Quasi-experimental Designs

HANS-CHRISTIAN WALDMANN and FRANZ PETERMANN UniversitaÈt Bremen, Germany

3.04.1 INTRODUCTION 63 3.04.1.1 The Relevance of Experimentation in Clinical Psychology 63 3.04.1.2 Terms: ªTrue,º ªQuasi-,º and ªNon-º Experimental Studies 64 3.04.1.3 The Concept of an ªEffectº 64 3.04.2 THE LOGIC OF DESIGN IN THE COURSE OF EMPIRICAL RESEARCH 65 3.04.3 CRITERIA: SENSITIVITY, VALIDITY, AND CAUSALITY 67 3.04.3.1 Sensitivity 68 3.04.3.2 Validity 68 3.04.3.3 Causality, Practice, and Pragmatism 71 3.04.4 CONTROL IN QUASI-EXPERIMENTAL DESIGN 72 3.04.5 A PRIMER SYSTEM OF DESIGN 74 3.04.5.1 Selections from the General Scheme: Nonequivalent Group Designs 74 3.04.5.2 Using a Pretest 75 3.04.5.2.1 Using pretest as a reference for gain scores 75 3.04.5.2.2 Using pretests as approximations to initial group equivalence 76 3.04.5.2.3 When not to use a pretest 76 3.04.5.2.4 Outcome patterns 77 3.04.5.3 Using Multiple Observations 77 3.04.5.4 Using Different Forms of Treatment 78 3.04.5.5 Using Different Forms of Comparison Groups 79 3.04.5.6 Using Combinations of Different Designs 80 3.04.5.7 Regression Discontinuity 80 3.04.6 ANALYSIS 81 3.04.7 CONCLUSION 87 3.04.8 REFERENCES 88

3.04.1 INTRODUCTION growing utilization of psychological aids, how- ever, must be justified on grounds of the 3.04.1.1 The Relevance of Experimentation in scientific method. There is a double need for Clinical Psychology valid and reliable demonstration of the value in technology derived from and offered by clinical Contributions of clinical psychology to psychology: efficacy in serving the customer and health services have received widespread re- efficiency legitimizing it to the supplier must be cognition and the benefits are undisputed. The shown. Quasi-experimental multiple group parallel increase of costs in such services due to comparison designs are well-suited for this

63 64 Multiple Group Comparisons: Quasi-experimental Designs task: they account for the fact that randomiza- comprehend the identification of latent traits tion may often not be an adequate sampling and their structural composition (factor analy- strategy in real-life clinical research situations sis), the identification of the dimensional metrics and still allow for assessment of absolute and according to which subjects determine the relative efficacy of psychological interventions. relative value of different objects (multidimen- They help evaluate the merits of different sional scaling/conjoint measurement), or in the components in interventions and support identification of a time series regression model in decision making as regards implementation, order to predict value and stability of a criterion. monitoring, and optimization of respective In neither case is a treatment strictly required. programs. This practical, if not wholly prag- matic, research and reasoning contributes directly to either correction or refinement of 3.04.1.3 The Concept of an ªEffectº the present state of scientific knowledge and development in the domain of clinical psychol- Multiple groups comparison experiments are ogy. It is practice where results from clinical conducted in order to demonstrate relative psychology are to be evaluated, and it will be effects of treatment. Effect hypotheses refer to seen that multiple group comparisons offer presence, size, and variability of effects with adequate tools to do so. respect to certain criterion measurements. The major objective of multiple group comparison designs lies with the generation of data in order 3.04.1.2 Terms: ªTrue,º ªQuasi-,º and ªNon-º to test such hypotheses. Subjects or groups Experimental Studies methods representing different levels of inde- pendent variables or being subjected to In this chapter, multiple group comparisons different kinds of treatment, eventually mea- will be focused on by means of quasi-experi- sured at different points of time, are compared mental designs. Delimiting characteristics to on data obtained in the same way in each distinguish the three kinds of empirical research group and at each occurrence. Then a classical situations quoted above are subject assignment, concept of an ªeffectº is identified with the modus of treatment, and control. It is agreed interaction of the subject or group factor and widely to speak of true experiments when both time/treatment with respect to the outcome random subject assignment and creation of measure. For example, the idea of a case± treatment or, in a broader sense, active manip- control study to determine therapy effective- ulation of conditions occur. In a quasi-experi- ness might be pointed out as ªif you apply X mental setup the first condition is weakened, and (treatment, independent variable) in one group, in observational studies both conditions are while not in another, and you observe a completely released. Some authors refer to the significant difference in Y (some functional term quasi-experimental as to the constraint that status, dependent varable) between both the researcher cannot deliberately assign sub- groups, then you have reason to believe that jects to conditions or that occurrence of a there is a relationship between X and Y, given treatment is only a typical but not necessary all else being equal.º It will be seen that it is the feature (e.g., Spector, 1981). In this case, the ceteris paribus condition in the given clause, differentiation to pure observational studies and how it is tentatively realized, that makes seems obsolete. It is argued that any experi- (quasi-)experimentation a delicate challenge to mental situation, be it ªtrueº or ªquasi,º involves researchers' creativity and expertise. an arbitrary discernible kind of treatment, by A worked example of a quasi-experiment is creation or manipulation of conditions regard- chosen to introduce into the present chapter an less of mode of subject assignment, and that illustration of some of these challenges and ways quasi-experiments feature all elements of true to counter them. experiments except randomization. Worked example: Outline If there is no manipulation of subjects or conditions in order to evoke variation in In a very well designed study, Yung and Kepner dependent measures (and thus contrasts for (1996) compared absolute and relative efficacy of testing), reference is made to ªpureº observa- both muscular and cognitive relaxation procedures tional studies (see Chapters 4 and 13, this on blood pressure in borderline hypertensives volume). The main problem then lies with employing a multiple-treatment±multiple-control extension of design Number 6 in Figure 2. It is selecting a statistical model and further in reported that most clinical applications make use devising an appropriate parametrization that of an amalgam of cognitive techniques like sugges- reflects as closely as possible the structure of tion, sensational focusing, and strict behavioral scientific hypotheses (in their original, verbal training of muscle stretching relaxation. The form). The objective of such inquiries may authors aim at partialing out the effects of these The Logic of Design in the Course of Empirical Research 65

various components by clearly separating instruc- observations from practice must serve as a tions that target muscles directly and such instruc- starting point for the research process. tions that try to mediate muscle tension by mental (ii) Relative to marginal conditions (re- efforts. As a consequence, special attention is given sources, time, etc.) and constraints in construct to operational validity and treatment integrity. In operationalization, a subject matter model is order to counter confounding of subject charac- teristics and procedure variables the authors devise derived in terms of variables and their inter- a rigorous subject selection and assignment policy relation. Temporal ordering and other prere- which makes their work an excellent example for quisites of model theory guaranteed, the case here. predictions regarding empirical manifestations are derivable (ªshould then be higher thanº). Before engaging in describing various designs Devise a design now in order to translate a and methods of data analysis, and in evaluating hypothesized structure of variables into a this particular experiment, it seems instructive measurable one and a sampling strategy in to outline a general process of experimental order to actually perform the measurement. research. Hypotheses stating that certain strategies of Generally, it will not be possible to observe all verbal intervention in psychotherapy are of use possible instances of variable relations (which for depressives might be translated into a would innecessitate inference), but evidence for ªbetter-offº proposition as quoted in the text the hypotheses will be sought by investigating above. It is clear that, at this level, an experi- subsets of the population(s) in which these ment rationale is suggested: comparison of relations are considered to hold true. Sampling groups. Also, global aspects of testing are variability and measurement error lead to the implied: assessment strategy (will psychometric use of statistical models in order to determine tests be applied or will therapists rate patients' the genuity of an observed effect. Effect sizes depressiveness by virtue of their expertise?), or (ES) serves to model the above ªdifferenceº in inference conditions (will an effect be required terms of statistical distributions (which, in turn, to instantiate simultaneously on all criteria and provide both the mathematical background and to be stable for at least a year, thus calling for the factual means for testing effect size). The multiple post-tests, or will the whole study be significance of statistical significance and the replicated using different subjects and other concept of reason based hereupon are evaluated therapists?). In transition to the next level, a set in Chapter 14, this volume, and the issue of of decisions to be made is encountered that are interpreting such a relationship as causal is known to statistical consultants as ªhow manyº briefly discussed in Section 3.04.3. But how does questions in project and design planning (how one arrive at effect hypotheses and correspond- many subjects are enough per group in order to ing statistical models in the first place? balance sensibly factors of sensitivity?, how many time intervals make up the optimal retest interval?, how many outliers can be tolerated in and across groups?, etc.). 3.04.2 THE LOGIC OF DESIGN IN THE (iii) Multiple group comparison as a con- COURSE OF EMPIRICAL ceptual strategy implies suppositions about RESEARCH structures in the data. Here, conceptually identified effect sizes receive an accurate, It is suggested that experimentation as a quantitative, and thereby testable formulation. procedure is a special realization of a general A ªpre±postdifferenceº between groups may process of empirical research, as depicted in now be specified in terms of an expectation of a Figure 1. Whether experimentation is seen as an difference in central tendency of an outcome appropriate access to address the problems in variable, a kind of second-order comprehensive question depends on preceding decisions; if so, measure that can be tested easily for signifi- choice of design and sampling structure is cance at the next level. More generally, statis- conceived as an embedded level. Furthermore, it tical predictions formally denote compound is proposed that this procedural outline is suppositions about the ªbehaviorº of data, and paralleled by a conceptual process of hypoth- a ªgoodº design generates data behavior that eses derivation. Figure 1 illustrates this corre- by maximum degree either confirms or contra- spondence with hypotheses levels according to dicts behavior that would be expected by the deductive four-step model presented by hypothesis. Hager (1992): (iv) Statistical hypotheses are statements (i) Generally agreeing with the hypothetico- about specific values, variability, and ordinal deductive method, scientific hypotheses are or metric relations of parameters of distribu- derived from proposition structures in theory. tions which follow from statistical predictions. Sometimes, mere need for evaluation or striking A common prototype for such a measure is 66 Multiple Group Comparisons: Quasi-experimental Designs

Emergence of need for research

Primary research question Theory: Substantial hypotheses

Operationalization Operationalization into subject matter model:

substantial prediction via logic of contrasts Design/sampling strategy

Diagnostic devices/measurement Selection and parametrization of a statistical model:

statistical hypotheses statistical data modeling

statistical testing of parameters Statistical predictions : ES=0

inference/decision

Figure 1 The process of empirical research. Criteria: Sensitivity, Validity, and Causality 67 given by ES =(mtreat 7 mcontrol)/scontrol. Its pro- (a linearly predictable trend may not be tested blems not withstanding, this is considered a for monotony only), decisiveness and unam- fairly reasonable and intuitive basic concept. biguous identification of theory-conforming Various other ES formulations can be trans- empirical states (Hager, 1992) serve to estimate lated to it (Cohen, 1988; Lipsey, 1990). For validity of hypothesis derivation. After statis- instance, in the example of therapy effective- tical testing, however, there is a need to work ness, and effect is assumed to be present when backwards: what effect size is considered prac- statistical testing has determined that its ES tically significant? Is there a failure in predict- differs from zero significantly, that is, beyond ing things as they should occur in theory? Do the expected limits of sampling error and with the findings entitle inductive inferences to be prespecified error probabilities of the a- and b- drawn to higher levels in the hierarchy of kind. In fact the probability is always estimated hypotheses, leading to tentative confirmation that there is no effect (therefore speaking of of propositions in the theoretical frame of ªnullº hypotheses) or that, as it is frequently put reference? Here have been introduced some in terms of sampling theory, comparison groups of the most serious problems in philosophy of are samples that were drawn from the same science, and there is a need to get still further population. In this case, observed differences into it before turning to techniques. merely reflect sampling error. Depending on the formulation of predictions, it is not always obvious at first sight whether these ªnullº 3.04.3 CRITERIA: SENSITIVITY, probabilities are required to be high or low. VALIDITY, AND CAUSALITY Further complicating things, a statistical pre- diction must sometimes be further differen- Conventionally, the major objective of ex- tiated into a set of single-testable statistical perimentation is seen in demonstrating a causal hypotheses. It must then be decided whether an relationship between phenomena, or, as Cook overall null (or alternative) hypothesis is con- and Campbell (1979) put it in more correct forming to the compound statistical prediction terms, in facilitating causal inference. The logic or what pattern of acceptance or rejection of underlying experimental design is an implicit individual hypotheses integrates to a favorable variation of the concepts of covariation, finding. Moreover, there is a need to decide temporal precedence, and of ruling out alter- whether integration of separate findings should natives. This means, in terms of experimenta- follow a disjunctive (compensatory) or con- tion, that the hypothesized ªcauseº should be junctive rule (e.g., must depressives benefit on capable of being omitted and control group all scales applied in order to justifiably speak of esigns are preferable. One-shot case studies or an effect). As a consequence of this multiple pretest±post-test designs without at least one testing, there is a need to control for accumula- control or, in a broader sense, a comparison tion of error probabilities. Note, however, that group, do by no degree allow for the kind of ªdifferenceº does not necessarily imply the inference scientists are most interested in: causal comparison of means. It most cases, pretest- ones. But, as was seen from Figure 1, standardized differences in means are in fact experimentation ends up in statistics. The result used for this purpose because such measures of an experiment is that data are communicated have known statistical properties (e.g., distribu- in the form of statistical propositions. Whether tions and standard errors) following normal statistical relationships should be taken as theory and can thus be easily tested statistically. indicators only or be conceived as ªcausal,º But a test might be made for a ratio to be greater ªfunctional,º or ªprobabilisticº cannot be than a prespecified number (as is done fre- determined by simply rating how well the rules quently when using differential model fit across of experimentation were obeyed, but depends groups as a criterion) or the inequality of on affiliations in philosophy of science. Criteria regression intercept and slope parameters (as for when to apply the predicate causal and be featured in regression continuity designs, see justified in doing so, are hardly available. But Section 3.04.5.7). The general rationale is to before drawing any such inferences it will be determine whether or not groups differ in any necessary to (i) ensure sensitivity in order to statistical parameter and to relate this differ- detect the effect and (ii) rely on it being a valid ence to a reference sampling error estimate. indicator for what is to be interpreted as causal. (For details on procedures of significance In the following section, a start is made on testing see Cohen, 1988; Fisher, 1959; Giere, practice by introducing the key concepts whose 1972; Mayo 1985.) Criteria like adequacy (on consideration will hopefully establish a sensi- each level, hypotheses preserve to maximum tive, valid, and meaningful design. It is clear, degree classificatory, ordinal, or metric rela- however, that an evaluation of meaning must tions assumed on preceding level), sufficiency rely on protoscientific argumentation. 68 Multiple Group Comparisons: Quasi-experimental Designs

3.04.3.1 Sensitivity and Stanley (1963), and Cook and Campbell (1979), originators of the concept, distinguish Sensitivity refers to the likelihood to detect an internal, external, and statistical validity as well effect in sample-based investigations, given it is as construct validity. The latter refers to indeed present in the population from which operationalization problems (mono-operation- samples were drawn. It is determined by sample alization bias, monomethod bias, evaluation size, total and direction of effect size, hetero- apprehension, experimenter expectancies, etc.). genity of both subjects and treatment applica- Internal validity evaluates to which extent tions, and various undesired features of variation in measured dependent variables is diagnostic devices (unreliability, etc.). While attributable solely to variation of independent those factors affect experimental precision, variables or treatment, respectively. Again, others like prescribed error probabilities of assuming an effect for a difference in the value both a and b type and the kind of statistical test of some statistical parameter calculated from used affect the associated concept on the level of data on dependent variables (e.g., the mean), statistical analysis: power. It must be borne in internal validity assures that nothing but mind that sampling error stands for the variance treatment ªcausedº this difference, and that, of parameter estimates and is, thus, maximally as Lipsey (1990) puts it, the observed difference dependent on sample size. As a consequence, an parallels the fact that ªthe only difference effect size however small will become significant between the treatment and the control condi- in statistical terms with increasing sample size, tions is the intervention of interestº (p. 11). This, while being wholly uninformative for practical of course, relies heavily on the experimenter's considerations. For an ES of the kind presented ability to control for other possible ªsources of above less than or equal to 0.3 to attain variationº or undesired confounding influences. statistical significance at a = 0.05, some External validity refers to the extent to which n = 300 subjects would be in order; for an ES results obtained from a sample-based study may of 0.5 still a hundred or more. In a meta-analysis be generalized to a specified population from on psychotherapy efficacy Grawe (1992) found which samples were drawn in the first place and/ that 87% of 897 studies were using samples of or across slightly different (sub)populations. n = 30 or less subjects in intervention groups. Such extrapolation must be reasonably justified To attain statistical significance at the usual as regards target population (i.e., persons), level an effect size of greater than 1.20 would settings, and future time. Note that representa- then be required. This may be considered quite tiveness is a prerequisite to external validity that an unlikely prerequisite: in their meta-analysis not only holds for subject selection but also for on psychotherapy research Smith, Glass, and situative conditions of experimentation itself: Miller (1980) reported an overall expectation of ecological validity must be given careful con- treatment effectiveness of no more than sideration especially in clinical fields. ES = 0.85. Tentative counterbalancing the Campbell and Stanley (1963) have raised the determinants of effect sizes other than the argument that internal validity can be thought ªtrueº effect (population effect), namely sample of as a necessary while not sufficient condition size, error probability, and assumed ES, is best for external validity. Note, however, that some achieved using power tables as proposed in methods of control depicted in the next section Cohen (1988). Since detecting an effect usually show conflicting effects on internal and external means observing an effect size that statistically validity, and to compensate for these imbal- differs from zero, the levels of design and data ances with a reasonable trade-off certainly is an analysis in Figure 1 become mixed up when art of its own. Most flaws in experimental and evaluating the whole procedure of quasi- quasi-experimental design boil down to viola- experimental multiple group comparisons. It tions of factors of validity which is why a short is the correspondence of design structure outline is presented (see Table 1) of their ªgivingº the data and statistical data model respective factors as designated by Cook and that establishes a ªgoodº design. Sometimes Campbell (1979). A far more comprehensive data obtained from one design can be translated listing of possible threats to validity and various into different models (problem of parametriza- other shortcomings in planning experimental tion) and effect sizes may be evaluated for their research is presented in Hoole (1978). Special significance by various of statistical tests attention to external validity in evaluation (problem of relative power). contexts is given in Bernstein (1976). Admittedly, full consideration of all these 3.04.3.2 Validity caveats, if not mere apprehension of this rather lengthy list, seems to discourage any use of Validity refers to the likelihood that the effect (quasi-)experimental designs at all. The con- detected is in fact the effect of interest. Campbell trary is true, though. There will be a benefit from Criteria: Sensitivity, Validity, and Causality 69

Table 1 Factors of internal and external validity.

Internal validity History In the time interval between pretest and post-test measure, various influences or sources of variation that are not of interest or that may even distort ªtrueº findings may be in effect besides the applied treatment. This clearly leads to confounding. Also, these factors may be differently effective for different subjects, thereby introducing additional error variance. Controlling for such intrusion by elimination or by holding it constant to some degree is an appropriate method in laboratory settings but is almost impossible to achieve in real-world clinical (field) settings (apart from identifying such factors as relevant in the first place). Moreover, many disorders are known to be subject to seasonal variation or others periodicities. Maturation In longitudinal studies with fairly large restest intervals, effects may be confounded with or entirely due to ªnaturalº change in subjects' characteristics that are correlated with the variables and treatments of interest. A prominent example in clinical research is spontaneous remission in psychotherapy studies. The hope is that such change takes effect equivalently in both groups of a case±control study (which again is rather a ªlarge sampleº argument but is usually held true for randomized samples). Mortality In repeated measures designs, attrition or loss of a certain proportion of subjects is to be expected on a regular basis. Differential loss, however, means that the observed drop-out pattern across groups seems not to be random but is dependent on variables that might correlate with treatment or outcome measure. This in turn introduces post hoc selection artifacts and is likely to erroneously contribute to effect size in that it may change value and variability of post-test statistics. Statistical A typical problem with measuring change lies with regression towards the mean. regression Subjects may shown gain or loss in scores from pretest to posttest solely due to the unreliability of the test. This may be compensated for to a certain degree by multivariate assessment of dependent variables. Moreover, differential effects are likely to fall into place depending on subjectsº pretest or starting-off: ªviewed more generally, statistical regression (a) operates to increase obtained pretest±post-test scores among low pretest scores, since this group's pretest scores are more likely to have been depressed by error, (b) operates to increase obtained change scores among persons with high pretest scores since their pretest scores are likely to have been inflated by error, and (c) does not affect change scores among scorers at the center of the pretest distribution since the group is likely to contain as many units whose pretest scores are inflated by error as units whose pretest scores are deflated by it.ºa Testing, instrument In repeated measures designs subjects are likely to become familiar with diagnostic reactivity, and devices or may carry over pretest scoring behavior. In addition, there are many sensitization ways that testing procedure itself may affect what is tested. Items that evoke strong emotional reactions may distort scoring in subsequent ability subtests by increasing subjects' unspecific arousal. Moreover, pretests may sensitize subjects for treatment reception by enhancing attention for certain attributes or by facilitating self- communication and introspection. Sometimes it is possible to control for this effect in quasi-experimental settings by applying strategies similar to the Solomon four- group design (see Chapter 3, this volume). Selection/ Selection is a threat to internal validity when an effect sizeÐa difference in some interactions with parameter across groupsÐreflects pre-experimental differences between cases and selection controls rather than differential impact of treatment. Note that inference from such experimentation relies heavily on the ceteris paribus condition (denoted in the given clause in the outline in Section 3.04.1.3), and that systematic differences across groups due to selection bias are most pervasive in quasi-experimentation where we have to do without randomization. This is especially true in case of clinical treatment: one would not want to allot inhouse patients or program participants to different treatments (or exclude them to serve as controls) by random numbers but according to their individual needs and aptitude. Patients assigned for cases on behalf of equal (that is, equivocal) diagnoses may yet still differ from other patients in their group (error variance) as well as from controls in a whole pattern, of other variables that are likely to be related to both reception of treatment and outcome measures (e.g., hospitalization, socioeconomic status, verbal fluency, etc.). Note that most factors of internal validity listed so far may interact with subject selection, producing differential effects and thereby increasing experimental error. 70 Multiple Group Comparisons: Quasi-experimental Designs

Table 1 (continued)

Instrumentation In the first place it must be ensured, of course, that diagnostic devices meet criteria of reliability, objectivity, and validity in terms of measurement and testing theory. In repeated measures designs, discrimination and differentiation capabilities of tests may change. Thus a gain in scores from one occasion to another may partly be due to enhanced performance of observers who improved skills with practice, and less due to differential effects of applied treatment at these occasions. A related problem lies with so-called floor or ceiling effects. Subjects scoring at the upper end of a scale on one occasion, say 6 on a 1±7 Likert scale, may easily report a decrease by half but will find it difficult to state a major increase of similar proportion. Note that clinical samples are often defined by a shift in mean in certain indicators with respect to a ªnormalº population. Control group Besides the problem of recruiting and maintaining a control group, ªcompensatory behavior rivalryº of respondents receiving less desirable treatment and ªresentful demoralizationº of such subjects are further threats to internal validity. In the first case, being assigned to the nonreceiving group may motivate subjects to behave competetively as to reduce the ªtrueº effect (due to treatment) in counter- hypothesized direction. In the second case, the deprived get more deprived and less motivated: a patient that suffers from his disease is assigned for a control and thus kept from a therapy that is thought to be efficient by the experimenter. The ªlose- heartº effect would then artificially enlarge effects. Both variants are of great practical importance in therapy effectiveness research.a External validity Interaction of It is well-known that mere knowledge to participate in a study or the awareness of pretest and being observed does affect behavior. Furthermore, due to sensitization effects or treatment, conditioning effects (see above), results obtained from samples that have completed reactivity a pretest cannot be fully generalized to populations that have not done so. Interaction of People that deliberately demand for psychological care or for access to health care selection and programs are likely to be more susceptible to treatment or motivated than are treatment accidental samples of convenience (collective of inpatients, rooms, etc.). Often, selection processes that have led to admission of subjects are unknown. It is clear, however, that results obtained from such samples cannot be generalized unreservedly. Interfering Parallel, overlapping or interacting treatments lead to confounding and thus treatments constitute a major threat to internal validity. But there are undesirable consequences for external validity as well. Suppose patients are receiving medical treatment in ascending dosage. Due to idiosyncratic effects of cumulation, conclusions drawn with respect to an intermediate level of dosage that generalize across people may be erroneous. Nonrepresentative Generally, target populations must be clearly specified in advance. In quasi- samples experimental design, representativeness cannot be ensured by random sampling on a large scale but instead take advantage of ªnaturalº (nonmanipulative) sample formation. Often characteristics of research context naturally differentiate subjects into groups (rooming policies of hospital, record analysis, etc.). Strictly speaking, since there is no real sampling there is no point in making inferences. But this aside, sampling should not refer to subjects only: generalizing also includes care givers (therapists, etc.), hospitals, and regions. Therefore, another strategy lies with deliberate sampling for heterogeneity with respect for certain attributes relevant to the research question. Inference would then be across a wide range of units holding subsets of these attributes, but still limited to the respective superset. When dealing with clinical samples, this method seems fairly reasonable.

Source: Cook & Campbell (1979).a it in that the degree to which attention is given to must be added. While the bearings of internal these issues prior to experimentation parallels validity on technical issues in research planning the degree to which results can be interpreted are widely agreed upon and factors listed above and inferences drawn after experimentation. have nearly become codified into a guide on Factors of validity thus serve as criteria for design construction, most important issues of comparative evaluation of quasi-experimental construct validity often receive less recognition. studies, they are not meant as a checklist for But re-examining the procedural structure of building the perfect one (remember that several empirical research, as sketched in Figure 1, the factors exclude mutually). Another precaution view that emerges is that everything begins and Criteria: Sensitivity, Validity, and Causality 71 ends with valid operationalization of constructs Modern philosophers and methodologists have into measurable variables, and of hypothesized demonstrated successfully that pragmatist rea- relations into exercisable methods for ªvisualiz- soning can be adapted successfully to metho- ingº them. Greater appreciation of construct dology in social sciences, and have done so down and derivation validity is advocated before to the level of hypotheses testing (Putnam, turning too readily to tactics and techniques of 1974). Of course there is no pragmatic explana- experimentation. For instance, one should tion of its own right. But our attempts to explain examine treatment integrity well before con- phenomena on experimental or even statistical structing sophisticated application schedules in grounds can be justified pragmatically by in- an interrupted time series design (see Section troducing criteria beyond a ªlogic of discoveryº: 3.04.5.3). Nonetheless, within the entire validity framework, causal inference is predicated on The first and ineliminable criterion for the ade- maximum appreciation mostly of internal quacy of an explanation must be based on what it validity. The question is what is meant by does for a man who wants explanation. This causal. depends on contextual factors that are not re- flected in the forms of propositions or the structure of inferences. (Collins, 1966, p. 140) 3.04.3.3 Causality, Practice, and Pragmatism Knowledge of causal manipulanda, even the However sensitive and valid a design might be tentative, partial and probabilistic knowledge of by these terms, an effect or statistical relation- which we are capable, can help improve social life. ship cannot be interpreted as truly causal in an What policy makers and individual citizens seem objectivist ontological sense or in the traditional to want from science are recipes that can be Humean readings (for an introduction, see followed and that usually lead to the desired Cook & Campbell, 1979, and Salmon, 1966). positive effects, even if understanding the micro- Therefore, and for the following reasons, a mediational is only partial and the positive effects wholly alternative view is pledged on these issues are not invariably brought about. (Cook & Camp- bell, 1979, p. 28) in the tradition of James and other pragmatist reviewers of methodology. Scientific inquiry and its results (e.g., explanations) should always If efficacy of, say, a new drug has been bear technological relevance. As one conse- demonstrated on a statistical basis by experi- quence, the importance of manipulation, is mentation, it is not possible, strictly speaking, re-emphasized as a feature of treatment in quasi- to conclude that deficiency in effective sub- experimentation. Consider the following argu- stance of the drug caused the disease or that the ment raised by Collingwood (1940): substance now reversed the pathogenetic pro- cess to regeneration. Having controlled for side Suppose one claimed to have discovered cause of effects, it is perfectly justifiable to maintain the cancer, but added that his discovery would be of drug as a successful therapy (and raise funds for no practical use because the cause he had dis- further elaboration which might give new in- covered was not a thing that could be produced or sights for science as well as for refinement of the prevented at will. His dicovery would be de- drug). On a statistical level this ªbetter-offº nounced a shame. (p. 300) thinking in interpreting results from case± control studies translates into the concept of Collingwood's point is that causal explana- statistical relevance (Salmon, 1966) or into the tions are often valued for the leads they give notion of Granger causality. The latter derives about factors that can be manipulated. If the from the theory of linear systems and defines essential cause of some effect does not imply causal relevance of independent variables controlling the effect (e.g., cancer) through (ªcausesº) in terms of increased predictability manipulating some factor, then knowledge of of value and variability of dependent variables this cause is not considered useful by many (ªeffectsº). Prognosis should then be more persons (Cook & Campbell, 1979). Scientific precise with a relevant variable included in inquiry and quasi-experimental research designs the model than with the variable removed, are conceived as concretization devices, as tools and good prognosis is definitely a very useful to obtain technological knowledge. Thus, re- outcome of research (Granger, 1969). Good search designs need to take into account the prognosis is, after all, a feature of operational feasibility of manipulation of conditions that models. This is what should be expected from produce effects in order to at least allow for experimentation in clinical psychology: identi- transfer into practice. It is practice where results fication of factors that influence etiology and from research in clinical psychology should be the spread of disorders and derivation of evaluated, which is why a concept of causality operational models that in turn might aid to and explanation is needed that is of relevance. confront them. 72 Multiple Group Comparisons: Quasi-experimental Designs

3.04.4 CONTROL IN QUASI- as might be implied by name, to equivalent EXPERIMENTAL DESIGN groups. There may still be reliable and sub- stantial differences on other factors relevant but Control aims at enhancing and balancing left unconsidered (or even unmeasureable if factors of internal and external validity follow- known) that affect the phenomenon in question. ing the max-con-min-rule formulated by Ker- The argument by Cook and Campbell (1979) is linger (1973): followed that only understanding the process of (i) maximize impact of independent variables subject selection allows full understanding of on dependent ones, subject differences that cannot be attributed to (ii) while holding constant factors of sys- treatment, and that randomization must be tematic intrusion, and considered the only model of such a process that (iii) minimizing the influence of unsystematic can be completely understood. variables or error. The following is a worked example:ÐControl What is meant by factors, influences, and is by subject selection and assignment: error? Control is always exerted over indepen- dent variables that contribute to effects. It is Of 307 recruited individuals, only 40 were eligible instructive to refine the term variable into four for participation. Persons showing habituation types relevant in experimentation. Systematic effects were excluded, because their high pressure variance denotes differences between groups in multiple baseline readings was assumed to be a and, with reference to the definition of an effect reaction to the measurement situation itself (sen- sitization). To control for treatment interference size in Section 3.04.1.3, thus indicates the persons currently engaged in other pressure con- presence of ªtrueº effects. In concept, it is solely trol techniques were also excluded. Persons on attributed to explanatory variables, indicating medication were asked to present medical records treatment. Extraneous variables, however, are to assure constant dosage for the time of their not a primary focus of a study but in cases of participation (history). The authors cite evidence substantial correlation with dependent ones for a linear relationship between blood pressure introduce error variance and do bias estimates level prior to behavioral intervention and treat- of effects. Such variables may either be sub- ment change. They further attribute contradictory jected to control or be left uncontrolled in one of results in recent research on relaxation efficacy to two ways, depending on both sample size and the common practice of matching experimental groups with respect to the average on certain blood subject assignment policy. In the latter case, pressure parameters, which leaves differential randomized variables are let run freely in the treatment susceptibility in higher degrees of hy- hope that their effects will cancel out in the long pertension uncontrolled. Subjects were therefore run and thus equate groups in a probabilistic orthogonally matched on systolic blood pressure, sense while confounded variables constitute a age, and sex to assure pretest equivalence. threat to validity that cannot be ruled out. It follows that controlled variables are extraneous Experimental error also leads to bias in effect but their impact on measures of effect is size estimates. While disturbing marginal influ- accounted for because error variance is reduced. ences (disruptions due to periodicities in hospi- If such erroneous influences are known to talsº daily routines, noise, etc.) may be held operate (by background knowledge introduced approximately constant or be eliminated there- into the subject matter model, or simply by by having their influence equated in and across intuition), the techniques presented in Table 2 groups, other undeliberate variations of experi- are used or at least an attempt is made to obtain mental conditions (e.g., treatment inflation or accurate measures on them. Statistical techni- interferences) and measurement effects are to a ques may, then, be used in order to separate post far lesser extent under the control of the hoc the effect of treatment (in terms of ES) and experimenter. But in many field settings there selection differences. may be no desire to exert artificial control over In quasi-experimentation, control is exercised such (in theory) irregular components of overall rather by means of subject selection or statistical effect. It may be preferrable to preserve or analysis than by manipulating conditions. enhance external or ecological validity when Because no randomization is used in quasi- finally turning to generalization. Since practice experimental settings to wash out all initial is defined as primary, however, bias due to differences in and across groups and because an naturally varying concomittants is something to effect has been defined in terms of post-test be faced. There is no need to put research back differences, credibility of findings will depend into the laboratory by excellent control when on the ability to demonstrate that groups have findings need to be transferred into applications been alike as possible except for the difference in for a noncompliant world. treatment they received. It will be noticed, The worked example is continued to demon- however, that pretest equivalence does not lead, strate trial structure and procedural control: Conrol in Quasi-experimental Design 73

Table 2 Controlling subject-induced extraneous variables.

Matching The rationale of matching is to fully equate groups with respect to pretest scores, so that differential post-test scoring could legitimately be interpreted as differential change. The more variables are called in for matching the greater a shrinkage in sample size must be expected, sometimes deflating degrees of freedom to an extent where statistical analysis is no longer possible. Moreover, such samples may no longer be representative for target populations to which one finally intends to generalize findings. By the way, there is great appeal in the idea to match subjects with themselves across treatment levels as is the case, for example, in crossover designs. Parallel groups or This means of control is, in a way, a weaker version of matching by taking resort to aggregate group statistics rather than to individual values. The influence of confounds or, matching more generally speaking, nuisance factors is considered not relevant if it is distributed evenly across groups. If such a variable has equal distribution in each group (with respect to first- and second-order moments, e.g., mean and variance), groups are said to be parallel on this variable. When dealing with intact groups or when assignment is guided by need or merit, this strategy would entail exclusion of subjects in one group in order to approximate the other or vice versa. Again, sample size would be reduced and groups may finally differ on other variables by selective exclusion. Control always seems to trade off with external validity, notably generalizability. Parallel groups, however, are much easier to obtain than samples matched by score. Blocking Blocking further draws variance out of the error term when nuissance factor(s) are known and measurable on a categorical level. The rationale is to match subjects with equal score on this measure into as many groups as there are levels of the nuisance factor. If, for example, nj = 10 subjects are to be assigned to each of J =3 treatment conditions (factor A), and another factor B is known to contribute to variance, one would obtain K = 10 levels of factor B with nk = 3 subjects each and then randomly assign njk = 1 subjects per B-level to each of the J levels in A. An increase in numbers of blocks will further decrease error variance since scores within a block of subjects matched this way are less likely to vary due to selection differences to the same extent than would have to be expected in blocks comprising more or all (no blocking) subjects. Note that in this model blocks represent a randomly sampled factor in terms of mixed-model ANOVAa and thus can enter statistical analysis to have their effect as well as possible interactions with treatment tested for. As a result, blocking could be viewed as transition of matching (above) into factorization (below). On the other hand, matching is but an extreme case of blocking where any pair of matched subjects can be considered a separate block. Factorization Confounding can be completely removed from analysis when confounding variables are leveled (measured by assignment to categories or by recording their continuous scale into discrete divisions) and introduced into design as a systematic source of variation. This generally results in factorial designs where effects of factors can be quantified and tested (main effect and interaction in fully crossed designs, main effect only in hierarchical or incomplete structures).a,b,c Analysis of While factorization is a pre-experimental device of control (it gives, after all, a covariance sampling structure), analysis of covariance serves to post hoc correct results for concommittant variables or pretest differences.d Holding constant If hospitalization must be expected to substantially mediate results, a most intuitive elimination remedy lies with selecting subjects with the same inhouse time. Strictly speaking, there is no way of generalizing to other levels of the hospitalization variable.

Source: Hays (1990).a Pedhazur & Pedhazur (1991).b Spector (1981).c Cook & Campbell (1979).d

All subjects received treatment, placebo, or pure temperature were held constant in all measure- assessment in equal time spacing in eight experi- ment situations, all equipment was visually mental sessions. They were instructed to replicate shielded to prevent feedback effects. Measurement these sessions for 30 days after the laboratory trial equipment was calibrated before each trial. All and to keep a record of their practice. After this participants spent equal time in the measurement follow-up period, all subjects were finally assessed setting and were blinded with respect to assessment on their blood pressure. Though this procedure results until completion of the measurement implicated a loss in standardization, it certainly occasion. To control for experimenter effects enhanced ecological validity and increased the subjects received live relaxation instructions in clinical relevance of the study's findings. Light and three sessions followed by five sessions with taped 74 Multiple Group Comparisons: Quasi-experimental Designs

instructions. This procedure further improved that many important practical considerations standardization of treatment situations. A training and implications for choice of a statistical model time of 18±20 minutes was justified as optimal by for subsequent analysis are hidden completely in recurring to recent findings that shorter training what is called the ªdata aquisition boxº (or ªOº lacks any effects and that longer training often in standard notation, see below). Here it must be results in loss of motivation to practice, increased dropout rates, and reduced stability of effects over decided on (i) number of dependent variables time. (uni- vs. multivariate analysis), (ii) various parameters of repeated measurement (number of occasions, lengths of retest intervals, etc.), 3.04.5 A PRIMER SYSTEM OF DESIGN (iii) order of dependent measures and error modeling (e.g., use of latent variable models), As will be obvious from Figure 2, basic forms and (iv) scale characteristics of dependents of designs mostly differ in whether or not they (parametric vs. arbitrary distribution models). use randomization to assure initial group Most authors, follow standard ROX termi- equivalence and control groups to determine nology and notion introduced by Campbell and effects, and whether or not they dispose of Stanley (1963): R stands for randomization, O pretest, multiple observations, multiple forms of stands for an observation or measurement treatment (applied, removed, reversed, faded, occassion, X stands for treatment or interven- etc.) and different controls. Following on from tion. Temporal sequence is expressed by left-to- this, the regression discontinuity approach is right notation in a row, separate rows refer to presented as leading to a class of designs in its different groups. own right. In the case of more than one explanatory independent variable, additional 3.04.5.1 Selections from the General Scheme: decisions must be made regarding logical and Nonequivalent Group Designs temporal order of treatment application (fac- torization into fully crossed and balanced plans, The designs shown in Figure 2, while not a incomplete and hierachical plans, balancing focus of this chapter, will now be considered. sequence effects, etc.). These topics are more Designs one and four are considered ªpre- fully covered in Chapter 4, this volume. Note experimentalº by Campbell and Stanley (1963):

Single Group Multiple Group Comparisons Study

“True” experiments Quasi-experiments

1 23 Post-test RXO XO XO Only RO O

4 5 6 Pretest RO XO OXO and OXO Post-test RO O OO

7 89 Time ROO...X...OO OO...XO...OO Series OO.....X.....OO Data ROO...... OO OO...... OO

Figure 2 A basic system of designs. A Primer System of Design 75 findings from such studies are most ambiguous column contains three nonequivalent group since almost any threat to internal validity listed designs which do implement multiple group above may be an effect and left uncontrolled. As comparisons. Design three, however, is subject a consequence, and because experimentation is to the same reservations made for designs one meant to demonstrate effects by assessing and four. It lacks both randomization and comparative change due to treatment while pretesting as an approximation to the same end. ruling out alternative explanations, these de- Therefore designs six and nine, when used with signs are commonly held inconclusive of effects. appropriate care and caution, permit the testing They are not recommended for the purposes of causal hypotheses, with causality being sketched in Sections 3.04.1 and 3.04.3. Design conceived as in Section 3.04.3. three does implement an independent control The worked example is continued with group, but still remains inconclusive as regards aspects of design the genuity of effects due to selection differ- ences. In design four, subjects serve as their own In quasi-experimental design, the treatments to be control. With two measures only, little can be compared serve as independent variables and inferred about true effects since maturation and experimental groups are built by assigning subjects other time-relevant threats to validity remain to treatment conditions by nonrandom proce- uncontrolled. Extending the principle to design dures. In the worked example study by Yung seven, however, leads to a typical time series and Kepner (1996), the treatment factor comprised stretch release (SR), tense release (TR), and design for basic impact analysis that to some cognitive relaxation (COG) as well as a placebo extent allows for more control. Suppose a trend attention group and a test-only control group line is fitted to both consecutive pretests and (TOC) that remained wholly untreated. The post-tests: a discernible shift in intercept and/or placebo group was given a medicament by a slope parallel to treatment application would physician with the ruse that it would reduce blood indicate presence of an experimental effect (a pressure in order to control for positive treatment conceptually different, but technically similar expectancy effects. The control group were given concept will be discussed with the regression pure assessments of their blood pressure to control discontinuity approach). Without determina- for effects of the measurement process itself. tion of such a ªnaturalº trend a higher post-test Systolic and diastolic blood pressure as well as heart rate were chosen as dependent variables, level might reflect what should be expected from entailing a multivariate setup for subsequent extrapolation of regularities in pretest values statistical analysis. All subjects were measured only, disregarding treatment. In other words, repeatedly several occasions (multiple baseline dependent variables might show a change in readings, with assessments before and after trial, value that is due to natural change over time follow-ups), which gives a multiple-treatment± (inflation, deflation, or periodicities of the multiple-control extension of design six in Figure 2. phenomenon under study). Using multiple observations does not rule out this possibility Note that the study outlined above does in as would using multiple observations and fact realize most of those items that make up the control groups, but still make it highly unlikely following sectional structure: conduct of pret- for such a trend to go undiscovered. Some ests, multiple observations, different forms of authors argue that lack of a control group may treatment, different forms of control groups, be of less importance for this reason. Delay in and, introduced post hoc by means of statistical effect instantiation or special forms of temporal analysis, even combination of different designs. calibration of effects can be modeled by univariate intervention (or interrupted time series) analyses (see McDowall, McCleary, 3.04.5.2 Using a Pretest Meidinger, & Hay, 1980). Note, however, that 3.04.5.2.1 Using pretest as a reference for gain group designs are still being dealt with exclu- scores sively where N 4 1. Single-subject designs, even when presented in the same notation, employ The major step from design three to design six another rationale of contrasting and inference clearly lies with the introduction of pretesting, and are covered in Section 3.04.2. Multiple- seemingly indicating analysis of change. A group extensions to this design are covered in natural concept of an ES would derive by Section 3.04.5.3. Designs two, five, and eight subtracting pretest averages from post-test have the most desirable properties: they imple- means and normalize the result with respect ment randomization, repeated measurements, to a control group for a comparison. There has and control groups, with multiple observations been extensive work on the question whether or additionally available in the last one. These not to use difference scores as an estimate of ES. designs are discussed in Chapter 4, this volume. Some main issues addressed in this context From the preceding, it is clear that the rightmost comprise the fact that difference scores (i) carry 76 Multiple Group Comparisons: Quasi-experimental Designs the unreliability of both their constituents, (ii) or large samples may be relied upon to justify depend on correlation with initial (pretest) this assumption, but in quasi-experimental scores, (iii) depend on scale characteristics, design performing a pretest is strictly required and (iv) are subject to regression towards the to obtain meaningful results. One still does not mean. Various alternative techniques have been need to take gain scores or measures of different proposed among which are (i) standardizing growing rates as a base for an effect size. In prior to calculating differences, (ii) establishing Section 3.04.2, they were not. But post-test ªtrueº difference scores (Lord, 1963), (iii) using differences can only be interpreted as monitor- regression residuals (Cronbach & Furby, 1970) ing effects when initial group equivalence is and, (iv) most commonly used, regression given as a common ground for comparison. As a adjustment by means of analysis of covariance. consequence, subject assignment in quasi-ex- Further, in terms of mathematical models, perimental design is based upon techniques that specification of correlation structure for error require a pretest (see Table 2). components of total scores, decomposition of error itself into various time-depending and 3.04.5.2.3 When not to use a pretest static components, as well as the formulation of model-fit criteria and appropriate statistical Research situations can be imagined where testing, with serial dependence of data contra- conducting a pretest may not be recommended dicting the ªusualº (IIND) assumptions on for even stronger reasons. Suppose that the error terms, are of major concern when phenomenon in question is likely to change analyzing change. These items denote not mere qualitatively rather than quantitatively. A problems of statistical theory, but reflect pretest would then presumably first measure substantial problems about a concept of something different than post-test. Second, a empirical change. There is, however, no concern pretest may interact with treatment by differ- about all the complications that arise when ential sensitization (see factors of internal trying to measure change but with the logic of validity) or by producing such bias that results design. For an introduction and further reading could not sensibly be attributed to treatment on the above issues, see Eye (1990), Diggle, anymore (e.g., when using attitude tests). Third, Liang, and Zeger (1994) and for special topics as retesting may be impossible because testing can indicated Kepner and Robinson (1988) and only be set up once on the relevant variable Thompson (1991) (ordinal measures), Hagen- because measurement must be assumed to aars (1990) and Lindsey (1993) (categorical distort or annul the phenomenon, or because data), Christensen (1991) and Toutenburg no parallel form of a test is available. In such (1994) (generalized linear models), Vonesh cases, the use of a proxy variable might be (1992) (nonlinear models), Farnum and Stanton considered which, measuring something differ- (1989) and Box, Jenkins, and Reinsel (1994) ent but conceptually related to post-test criteria (forecasting), Petermann (1996) and Plewis and, thus, (hopefully) being correlated to it in (1985) (measurement). the statistical sense, serves as a substitute for evaluating initial group equivalence as a pre- requisite for interpreting post-test differences 3.04.5.2.2 Using pretests as approximations to (Rao & Miller, 1971). In extreme cases, when initial group equivalence any kind of pretesting, be it on behalf of proxies In Section 3.04.2 the concept was introduced or using the post-test instrument, seems im- of an ES as the observable and testable possible, using separate samples ªfrom the same formulation of what is sought in experimenta- populationº for pre- and post-test may be in tion: effects. As can be seen from Figure 2, order. From the preceding it is clear, however, repeated measures are not strictly required to that this variant of the basic design (six) demonstrate effects. As is the case in design provides a very weak basis for inference. three differences may be observed in dependent Selection cohort designs as proposed in Cook measures across independent samples that and Campbell (1979) can be understood as a underwent two different treatments (or none conceptually improved (while more demanding) for a control) and state that one group scored extension of separate sample designs. When higher on the average (this being favorable). using self-selected samples it is obvious that Drawing any inferences or preferring this subjects collected from cohorts that fall into treatment over the other (or none) as a result place with regular periodicity because of of such a post-test-only design would rest characteristics of natural environment can be entirely on the assumption that groups were considered more alike than separate samples equivalent with respect to dependent variables with obscure nonrandom selection processes. prior to treatment application. In ªtrueº As this is most typically the case for institutions experiments, random subject assignment and/ with cyclical (e.g., annual) turnover, like A Primer System of Design 77 schools, such designs have been labeled ªre- devices. It is true that there is always our current institutional cycle designsº (Campbell & obligation to evaluate and justify the whole Stanley, 1963). procedure of research instead of its findings only. But having revised the process of hypotheses derivation that leads to design, 3.04.5.2.4 Outcome patterns and to data as a result, and having confirmed It is trivial to state that informational derivation validity to a maximum extent complexity of outcome pattern depends on attainable within the limits offset by a concrete the factorial complexity of design. It should be research context, there should be no complaints borne in mind, however, that sophisticated about reluctant and unruly reality. group and treatment structures do not necessa- rily entail higher levels of information or yield better causal models of a post-test. For example, 3.04.5.3 Using Multiple Observations the higher an order of interaction is allowed in fully crossed factorial designs, the less readable It is obvious that design nine is but a will be the results obtained and interpretation generalization of standard design six, taking will be difficult. Pedhazur and Pedhazur (1991) advantage of multiple observations pointed out give examples of how adding predictors to above, and being subjected to the flaws of regression equations alters coefficients of for- multiple repeated measurements. While the mer included predictors and changes overall extension is straightforward as regards the logic results. Though the foregoing statements about of contrasts (the rationale for multiple group validity and control may suggest a kind of ªthe comparisons), the notion of an effect and more the betterº strategy when designating according ES is more difficult to define. Time independent and dependent variables for a series analysis is usually carried out to detect design, the ultimate purpose of all modeling, and analyze patterns of change over time or to simplification, should be recalled. The ultimate build models for forecasting. Multiple group prerequisite in modeling, however, is of equal comparisons enter when chosen to define an importance to the same end: unambiguous effect in terms of different time series evolution specification. Design six, for instance, when for cases and controls, conditional on a pre- including two pretest occasions and being intervention phase (baseline) and triggered by carried out on near-perfectly matched groups differential onset, intensity, and duration of of sufficient size with near-constant treatment treatment. Two major problems must be integrity, clearly outperforms a time series considered, then: first, how should treatment extension of the same design (see below) with and observation phases be scheduled in the uncontrolled subject attrition and carry-over course of the whole study in order to set effects. contrasts for obtaining effects while accounting A naturally expected, and in most cases for time-relevant threats to validity (see Section desired, outcome of basic design six would 3.04.5.4 using multiple forms of treatment); obtain when the average post-test response of second, how to statistically analyze serially the treatment group is found to differ signifi- dependent data and interpret contrasts in the cantly in desired (hypothesized) direction from data in terms of within-subject (within-group) the control group post-test scoring, given vs. between-subject (between groups) patterns pretest equivalence and given zero or nonsigni- (see Section 3.04.6). ficant trend in the control group pretest±post- The major benefit of using multiple observa- test mean scores. But there are several variations tion has been presented in Section 3.04.5.1 of that scheme, largely due to selection± (control for maturation and regression). Adding maturation interaction, regression artifacts, a control group further permits control of and scale characteristics (see Table 1). Cook history and instrumentation for there is no a and Campbell (1979) discuss at length such priori reason to suppose that untreated subjects different outcomes of design six in its basic form experienced different environmental conditions (parallel trend lines, differential growing rates during the study or that they should react for groups, nonconstant controls, controls differently to repeated measures. While non- outperforming cases, treatment leading to spuriousness of treatment effects justifiably may post-test equivalence, trendline crossover be assumed by reference to significantly differ- [ªswitching meansº]). Testers should not, how- ent pretreatment levels or increased/decreased ever, as might be suggested by these authors and slope of baseline, temporal persistence of effects the foregoing presentation of all those threats to can never be assured without comparison to an validity, too readily attribute findings not in line untreated control group. with expectations to design misspecification, As it is intended to dedicate the section on artifact operation, and limitations of diagnostic analysis mostly to handling design six and its 78 Multiple Group Comparisons: Quasi-experimental Designs nontime-series (i.e., simple pretest±post-test) Design 10 replicates design four, with treat- extensions, a few words on statistical analysis ment removed in the second block in order for of time series experiments are in order here. For this block to serve as a ªsame-subjectsº more detail see Glass, Wilson, and Gottman substitute for an otherwise nonavailable control (1975), Gottman (1981), and Lindsey (1993). group. Careful consideration must be given to Comparisons of intercepts (level), slopes (trend) the following questions: and variance (stability, predictability) are (i) Are treatment effects of transient nature mostly used for contrasting pre±post phases. and be made to fade out (it is supposed, Note that effects in time series designs need not generally, that persistence of intervention ef- instantiate immediately with treatment applica- fects is desired)? Ethical concerns, financial tion and pertain. Impact of treatment may be questions and potential attrition effects could delayed, with gradual or one-step onset, and be added to hesitations about mere feasibility of various forms of nonconstancy show up over such designs. time (linear, nonlinear, or cyclic fading or gain, (ii) Is it reasonable to expect that average etc.). But such rather sophisticated parameters scoring will return to a near-baseline level after of time series data will only be used rarely for ES removal of treatment? definition: statistical tests for directly compar- (iii) Will it be expected that, after reintroduc- ing oscilation parameters or analytically derived tion of treatment as in design 12, effects will characteristics like assymptotic behavior (e.g., reinstall the same way (delay, size, stability, etc.) speed of point convergence) are not readily as in the first turn? (It is supposed that the available. While most one-sample time series reader is acquainted with general replication parameters of interest to researchers concerned effects in experimenting like learning and re- with intervention analyses can be identified, sentment.) Then, and especially in case of high- estimated, and tested for by transfer function amplitude treatment and obstrusive measure- models, between-sample comparisons often ment, construct validity will be seriously af- reduce to fitting an optimal model to one group fected. and observing that fit indices decrease when If answers are favorable to these concerns obtained from application of the same model in then the designs that are particularly useful in another group. Cross-series statistics are, after evaluation of recurrent intervention compo- all, hard to interpret in many cases (they require nents can be disposed of. Next should be consideration of lead-lag relations, etc.). For an considered the reversed treatment nonequiva- introduction to interrupted time series analysis lent control design 11 with pretest and post- for intervention designs see McDowall, et al., test. Cook and Campbell (1979) suggest that (1980) or Schmitz, (1989), and illustrated this design outperforms standard design six as extensions to nonlinear and system models are regards construct validity since a treatment presented in Gregson (1983). variable has to be specified rather rigorously in order to determine its logical reverse and an 3.04.5.4 Using Different Forms of Treatment operational mode for it to be applied. Never- theless the additional use of a wholly untreated In this section concerns about implementing control group is strongly advised in order to different forms of the same treatment instead of arrive at clear-cut results, in case trend lines adding treatments like in a factorial design will do in fact depart from (assumed equivalent) be discussed. In basic pretest±post-test designs pretest levels in opposite directions but in of type six, treatment is applied only once. slopes that differ with respect to size and Hence, variations concern application or re- significance. Including a common reference movement only. Graduation in treatment for contradicting provides a very strong basis application or withdrawal calls for multiple for inference. Finally design 13 (interrupted groups to undergo such graduation. This would time series with switching replications) is con- be achieved, in concept, by stacking k±1 sidered. Due to delayed treatment application, pretest±post-test designs without a control one group can always serve as a control during group over standard design six for k levels of treatment periods of the other (ªreflexive con- treatment to be tested. The resulting model trols,º Rossi, Freeman, & Wright, 1979; would conform to standard one-way analysis of ªtaking-turn controls,º Fitz-Gibbon & Morris, variance for differences scores with a fixed or 1987). The particular strength of this design lies randomly sampled treatment factor (see Section with its capability to demonstrate simulta- 3.04 .6). In time series designs, however, some neously effects in two different settings and more challenging variations are at hand that points of time with greatest parsimony. Still further increase validity and clarity of inference, it is possible to overcome most threats to if properly implemented. Figure 3 gives an validity that are faced when using the above overview. designs. A Primer System of Design 79

oxo oxo– ox+o ox–o

10 11

oxoxox– oxoxox– – a b

12

oooooooooxoooo oooxoooooooooo 13

Figure 3 Designs using different forms of treatment.

3.04.5.5 Using Different Forms of Comparison treatment variable of interest and is only meant Groups to minimize motivation effects like resentful demoralization. A second version refers to It has been pointed out frequently that to use treatments that in fact can be, or even are, a control group is indispensible as a baseline for expected to contribute to effects by operating comparison in most cases of quasi-experimental through subjects' expectancies and external research, and highly advocated in any other. attributions that are hypothesized to go along Effects should be demonstrated by either with actual treatment. Perhaps the most relative change or post-test difference to rule prominent example is medication with a sugar out alternative explanations besides treatment pill. Note that such placebos work on psycho- impact. While this is widely agreed, there has logical grounds and usually are implemented to been some controversy as regards various separate such effects from ªreal,º physical, or qualitative characteristics of controls. Are pharmacological treatment. As a consequence, controls really untreated as assumed for it might be difficult to define a psychological inference? How can control groups be main- placebo in psychotherapy research. For further tained (they do not receive, after all, any benefits detail on the issue of placebo groups, see a well- of treatment)? Is there such a thing as placebo received article of Prioleau, Murdock, and treatment? Do waiting groups (subjects sched- Brody (1983), defending admission of such uled for subsequent cycle of treatment, see placebo treatment, and subsequent peer com- design 13) differ in any substantial respect to mentaries in The Behavioral and Brain Sciences. wholly unaffiliated pools of control? In many Waiting control groups differ from untreated cases the nature of a control group is determined groups in that they must be supposed to expect simply by contextual factors like availability in benefits from treatment to come. As a con- terms of total number of subjects, financial sequence, after assignment, such subjects may resources, ethical considerations, or the time show more sensible behavior in any respects horizon of the study. While ªrealº control related to health and well-being or more medical groups receive no treatment at all, placebo compliance (thus ªraisingº baselines into trend- groups receive only irrelevant treatment. Con- lines). Waiting groups are, however, easiest to ceptually, there are two forms of such irrele- recruit and maintain, they naturally fall into vance. In the first case, intervention neither in place when treatment cannot be delivered theory nor in statistical terms relates to the unlimitedly, and there is no way of avoiding 80 Multiple Group Comparisons: Quasi-experimental Designs selection for admission. Still another concept of consequence, this rather sophisticated design is a control group emerges when using qualita- mostly applied in research situations that allow tively different treatments in comparison for randomization. Trivial as it may seem, groups, that is distributing treatment across another way of combining designs lies with groups. A way of treating a disease might, for replicating the whole basic design using the instance, to compare against current or stan- same samples and measurement occasions by dard treatment, thus highlighting the relative simply obtaining multivariate measures of post- merits of the new and saliently different features test indicators and, thus, testing multiple of the innovative treatment. Technically, within predictions concerning the same outcome the ANOVA framework introduced below, criterion. treatment would represent a fixed factor with more than two levels (treat1, treat2, etc). It is 3.04.5.7 Regression Discontinuity clear that adding a further untreated control group is still preferable (see Section 3.04.5.6). As if to provoke further discussion in order to put an end to presumed underutilization of 3.04.5.6 Using Combinations of Different regression discontinuity (RD) designs, Trochim Designs (1984) stated that RD ªis inherently counter- intuitive . . . not easy to implement . . . statistical There are three convincing arguments for analysis of the regression discontinuity design is combining designs. First, combining complete not trivial . . . [there are] few good instance of the designs rather than extending one of their useº (pp. 46±47). So what is the point in these internal features (as discussed in above subsec- designs? In structure and objective, RD designs tions) will generally enhance control. Second, conform to the layout of design six. The major broadly defining designs themselves as methods difference lies with assignment policy: while and using different methods and research plans randomization is believed to guarantee fully in addressing the same question and phenom- equivalent groups (in the long run, that is) and enon will always decrease monomethod bias pretest equivalence in nonequivalent group and increase both construct validity and gen- designs assures comparability with respect to eralizability of results. Third, combining designs the variables in questions only (leaving obscure offers refined testing of hypotheses. For presence and influence of other selection example, in Section 3.04.5.4 it was proposed, processes), it is maximum nonequivalence that to multiply stack design four over design six. serves as a rationale of regression discontinuity Consider now the application of different designs. This is achieved by ordering subjects dosages of a drug as treatment variable. Given according to their scores on the continuum of a initially comparable groups (in the present, pretest scale and subsequently defining a cut-off quasi-experimental sense), trend hypotheses point for division into further subsets to be about dosage could even be tested. Also, assigned to different conditions. Most com- integral features of basic designs as depicted monly, subjects scoring below the cutting point in Figure 2 may be tested for their mediating would then be assigned to one group, subjects effects. For an example, if design six is stacked scoring below this point to another. If a ªsharpº over design three a composite design is obtained cut-off point is not desirable on theoretical that allows for separation of pretest and grounds or because of known unreliability treatment effects and is named after its proposer regions of scales, a midscale cut-off interval Solomon (1949). Note that there are two fully might be used instead, with interval width crossed factors: whether treatment is applied or depending on estimates of measurement error not (case±control factor), and whether pretest- similar standard deviation. Subjects within this ing is administered or not. This design may now range would then be assigned randomly to either be conceived as factorial (see Chapter 3, this treatment or control group condition. While it is volume) and an analysis of the main effect of true that the process of subject selection is still treatment made (disregarding effects of pretest only imperfectly controlled because merely one sensitization and the like), main effects of factor (the dependent variable) is accounted for, pretesting (effect obtained without regard to it is perfectly known since being specified treatment), and interaction (treatment effects entirely by the researcher. It is clear that, in depend on pretest). If statistical testing has order to prevent cut-off point selection being determined that such interaction is present, rather arbitrary, strong theoretical grounds or consideration of the main effects due to one sole consent on practical concerns (limited therapy factor (e.g., treatment) is sure to yield mislead- resources, etc.) are called for. ing results and inferences. But the latter is Basically, analysis of RD designs means always true in cases where any design is used testing hypotheses of equality of regression that both lacks randomization and pretest. As a equations across groups similar to analysis of Analysis 81 covariance. The objective is to ask whether the respective conceptual analogies when ordinal trend or regression line obtained from analysis or categorical data have been obtained. Since of the treatment group pretest and post-test there are many textbooks on statistical data scores is displaced significantly from the analysis that cover technique in detail, attention respective control group regression line. If is drawn merely to the principle and major displacement is due largely to greater (or lesser) formulae and dispensed with (references are intercept, while slopes are statistically equal given, however). (parallel line regression), a main efect has been It is important to understand that until now successfully demonstrated. Stated differently, discussion has been located entirely in the realm with no effect present, a single regression line of logic. Nothing has been implied as regards would equally fit scores of both treatment and scale and complexity of measurement and data. comparison group and, as a consequence, a The point is that anything can be stored in O trend in the treatment group could be predicted (Figures 2 and 3) and that structural complexity from the control group scores. Interaction of whatever resides in this data aquisition box is effects are revealed by discontinuity in slope logically independent of complexity in design of regression lines at cut-off point. Note that, structure. Enlarging designs by stacking sub- technically, regression need not necessarily be elements simply adds data boxes and obtains linear but can include parameters of any order, lots of data and can sometimes lead to technical but that extrapolation of a trend line becomes difficulties in statistics, but both of these extremely difficult then and often leads to false features should not complicate the idea or conclusions about effects (see Pedhazur & inference behind the material. In fact, there is Pedhazur, 1991). These problems notwithstand- only one condition: statistical parameters ing, more frequent use of these designs is obtained from the data must be comparable recommended. In many intervention research on formal (i.e., mathematical) grounds (dis- contexts, subjects not only are in fact but also tributions, standard error of estimate, etc.). should be assigned to either treatment or Now dominance of comparison-of-means-style control groups by need or similar, perhaps models is readily explained: differences in means medical, criteria related to both treatment and of normally distributed variables again follow outcome. normal distribution, and departure from zero is most easily tested using direct probabilities. In essence, value and standard error of any 3.04.6 ANALYSIS parameter's estimate should be determined and related to each other (e.g., in a ratio, giving Having successfully derived and implemented standard t-test in case of metric dependents) or a multiple group comparison design there is to another pair of parameters (e.g., in a double now a need to fit a statistical model to the data ratio, giving standard F-test, respectively). But obtained in order to test the hypotheses on the there are many other ways besides of comparing ªfinalº deductive level in the process model of distributions, some less strict but more straight- empirical research (Figure 1). In particular, forward or creative. there is an interest in deriving ESs and whether Why not obtain entire time series for one they are different from zero. Note that from an single data box in a multiply repeated measure- epistemiological point of view, use of a different ments design? A time series model may be fitted statistical model is made than observational to data at each occassion resulting in micro- methods (see Section 3.04.1.2). Here, models are structural analysis. Stated differently: time considered as mere tools for hypotheses quali- series information has been condensed into fication and inference and are not of conten- some quantities that now enter macrostructural tional value besides some implications of their analysis reflecting design logic. Here, one might structural assumptions (like additivity, linear- compare such aggregate measures in a pre± ity, etc.). Analysis of nonequivalent group postfashion analysis like in standard design six, designs is achieved in three major steps. First, thus evaluating treatment impact by relative the overall rationale of analysis is introduced by change in parameters of local change. Other borrowing from the theory of generalized linear studies involve entire structural equations models. Second, its specification language is systems in a single data box, constructed from used to acquaint the reader with a system of covariance structure in multivariate data ob- models that indeed is a superset to nearly all tained at one occassion (Bentler, 1995; Duncan, models applicable in this context. Third, a closer 1975; MoÈ bus & BaÈ umer, 1986; JoÈ reskog & look is taken at a subset of models that are most SoÈ rbom, 1993). commonly employed in treating standard de- Most textbooks on the statistical analysis sign six, namely analysis of variance with or of experiments focus on analysis of variance without adjusting for a covariate and its and associates. While preference for models 82 Multiple Group Comparisons: Quasi-experimental Designs assuming variables on an interval or rational that remained unspecified or did not explicitly scale is perfectly understandable from the enter the model (i.e., remain uncontrolled). researchers point of view, it is felt that there To sum up, ingredients of a model are is a rationale of testing that allows for far more variables, parameters, a link function to variety. Appreciation of this variety is badly algebraically interconnect all these, and an needed in order to be able to choose an analysis estimation strategy to optimally fit the resulting model for the design present, thus preserving model to the data. The point in mapping most the rather deductive idea of the process model in different ideas and objectives into a common- Figure 1, instead of being forced to the models structure is variable designation: using contrary: adapting design and idea to some intensities can still mean single-attribute utili- analysis model readily at hand from common ties, attitude strength, blood pressure. But y = textbooks. Variety need not imply incompre- f(x, b, e) is too much in the abstract to be of use hensible diversity if there is a model rationale. for this purpose, and even giving an outline on Here, the rationale states: all analysis, excepting mathematical formulation and derivation of some tests for ordinal data working on special some less hard-to-handle submodels or estima- ranking methods, boils down to regression. This tion techniques would be beyond the limits of holds true even in cases where it might not be this chapter. For an advanced reading and a expected that a regression structure presents at complete build-up of generalized linear model first glance (joint space analysis in multidimen- theory and practice see Arminger, Clogg, and sional scaling, etc.). Most important, it is also Sobel (1995), Christensen (1991), and Diggle true for analysis of variance and companions et al. (1994). But model specification terms since these models can be written as regression summarized in Figure 4 can be used to figure out equations of metric dependent variables on a set the greater proportions of the generalized linear of coded vectors. These dummy variables, then, model. Note that not everything goes along with represent treatment in all its aspects (intensity, anything else when crossing items on the utmost frequency, schedule of application, etc.). En- right side and that some models use parame- coding logic of designs into what is now called a trization strategies that do not satisfy the design or hypotheses matrix for statistical condition of formal comparability of para- analysis certainly is an art of its own (for an meters across groups. What is required now is introduction, see Kerlinger (1973), or Pedhazur careful examination of the nature of the & Pedhazur (1991)). But here, regression stands variables and the justifiability of assumed form for a whole family of models obeying a certain of relations among them. Never follow the relational form. Thus, the rationale translates to impression possibly implied that all it takes is concrete statistics by proposing that all variable simply picking up a model from the scheme and relations (and contrasts are just a special applying it to the data if the marginal relation) be specified in the form y = f(x, b, parameters are met. While it is true that with e), where y denotes a possibly vector-valued set the exceptions mentioned above almost any of responses (dependents) and x denotes a set of statistical model of interest for analysis of explanatory variables (independents, respec- designs can be integrated in or extrapolated tively) that are used to approximate the from the generalized linear models scheme, the response with determinable residual or stochas- fit of design and analysis model is never tic error e. A set of coefficients b quantify the guaranteed from this alone. intensity of an element of x (which can be Several submodels can be located from that interpreted as effect information), these para- scheme that are most commonly used in meters are estimated conditional to the data by analyzing experiments of type three and six minimizing a quality function (ªleast squares,º (in Figure 2) and that may now be arranged in a ª(72)*maximum likelihoodº). Finally, f(.) more readable tabular system. If groups or denotes a characteristic link function that discrete time are assumed for a single indepen- relates the explanatory or predictive determistic dent variable of categorical nature, and an part of the model and error component to the arbitrary scale for a single dependent manifests, response. Mostly used for this purpose are variables measured in discrete time and every- identity link (giving standard multiple regres- thing else in Figure 4 is dispensed with, the sion), logit link (for binary response), and following table results (Table 3). In view of logarithmic links (leading to regression models practice, that is, software, some submodels involving polytoneous categorical data). In this present not as rigorously special cases of the way of thinking, the data are thought of as Generalized Linear Model (though they all can determined by the model but jammed by be specified accordingly), but rather as ªstand- additional random error. Note that ªerrorº aloneº implementations of tests. They serve the does not denote anything false but is meant to same purpose and are integrated in all major incorporate, by definition, all factors in effect statistical software packages. Analysis 83

identity Links logit ...... log ......

uni-* orthogonal multiple* number collinear multi* simultaneous

categorical ordinal scale censored interval uncensored rational

intensity ...... Variables dependent type frequency ...... independent durations ...... direct time ......

manifest meas.level endogeneous latent factor order k

static

Stat. setup) Model (linear additive cross-lagged temp.order autoregressive discrete time direct time X continuous time

IIND(0,X) ...... error AR(p) ...... assumptions ......

Figure 4 Building blocks in the Generalized Linear Model (GLM). 84 Multiple Group Comparisons: Quasi-experimental Designs

Table 3 Common models for analysis of designs three/six.

Independent Scale of single Idea of effect (group-time) dependent Analysis/test

Simple change One sample pre±post categorical McNemar test ordinal Wilcoxon test metric dep. t-test One sample k times categorical Cochran Q-test ordinal Friedman test metric One-way ANOVA rep. meas. Post-test differences Two samples categorical Simple w2 ordinal Mann±Whitney test metric indep. t-test m samples categorical Logit/Loglinear models ordinal Kruskal±Wallis test metric One-way ANOVA Relative change (interaction) m samples categorical Logit/Loglinear models k times ordinal Loglinear models each metric m-way ANOVA rep. meas.

Further insight can he gained by the one dimension (M 4 1), multivariate analysis introduction of the analyses of variance. These (MANOVA) is called for, if it reduces to a scalar models are considered the standard for analyses (M = 1) but extends in time (T 4 1), univariate of multiple group comparison designs, but there repeated measures ANOVA is applicable. For is no hope of covering its various features and M = 1 and T = 1, effects are evaluated using a the richness of its submodels in detail. For the post-test-differences-only-given-pretest-equiva- time being, do not bother with the distinction of lence rationale as presented in the first row in fixed-effect, random-effect, and mixed models Table 3 (performed with a t-test for k = 2 both but with the rationale of analysis (for details, see uni- and multivariate and any (M)ANOVA for Hays, (1980). Especially in the case of repeated k 4 2). With repeated measures carrying effect measurement models, there are, however, some information (T 4 1) as it holds for both other ideas and assumptions that must be understood rationales, data setup is a bit more sophisti- before actually using these techniques, most cated. Common software packages usually take prominent and important among these are: repeated measures of one variable for separate heteroscedasticity, additivity, orthogonality, variables (maybe with the same root name: p.ex. balance, degrees of freedom, compound sym- out1 out2 out3 . . . for criterion variable out at metry sphericity. In addition, further precau- three occasions). It follows that a multivariate tions as regards covariance structure are in data setup of the input data matrix is required. order when extending the model to multivariate Note that when more than one genuine variable analysis of several dependent variables. For besides out enters analysis, a doubly multi- advanced reading on these topics whose con- variate procedure is called for. Stated differ- sideration is indispensible for enhanced statis- ently, M variables are to be analyzed whose tical conclusion validity, see Toutenburg (1994) respective T time-dependent realizations are or Christensen (1991). coded as T different variables, giving M 6 T ? Consider that response data y=[y1y2 ... data columns in a typical data entry spreadsheet ym ... yM]onM dependent variables are of minimal logical dimension (N * K) 6 (M * T) obtained on interval-scale level for k =2 ... since data are obtained for K groups of N k ... K samples of equal size; for convenience, subjects each. This holds true, so far, for any n =1...i ...N at t =1,2...t ...T occasions. A one-way (M)ANOVA with a k-level treatment- categorical independent variable (or factor f1 factor, affecting one or more metric dependent - K denoting treatment(s) f1=(f11,...,f1k,...,f1 ) variables. In most instances, however, it is likely has been realized at least twice in one sample or that this root design will be extended by independently across different samples at any introducing: - occasion 1 5 t 5 T. Treatment f1t denotes ªno (i) further experimental factors subject to treatmentº for implementing nonequivalent manipulation (e.g., single vs. group therapy), control group(s). If yà offers data on more than (ii) further cross-classifying factors that fall Analysis 85 into place from basic sample analyses (e.g., age, (ii) Differences within repeated measures of sex, hospitality, etc.), a subject and according interactions with fac- (iii) further metric time-invariant covariates tors varied between groups in f-way analyses. (e.g., onset of disease, initial dosage of medica- This variation may be further decomposed into tion), and three sources: a main effect of retesting (sup- (iv) further metric time-varying covariates posed to be present in each subject's profile over (e.g., ongoing medication). time), an interaction of treatment and retesting, The general idea of analysis of variance is and differential reactions to retesting (specific rather simple: if means between k groups and/or to each of the subjects). In case of analyses with t occasions differ, and this is what is called an no between-factors present, the latter source of effect, roughly, then they are supposed to vary variation is taken to contribute to residual from the grand mean m obtained from collap- variance. In multiway repeated measures ana- sing over all M 6 K data points disregarding lyses of variance, determination of appropriate groups and treatments or collapsing over occa- components for a F-ratio in order to test for sions giving the grand mean as a ªno-changeº statistical significance of a factor or an inter- profile of t equal means. If they do vary action of some factors, can be very complicated; significantly then this variation should be to a excellent advice is given in Bortz (1993). certain proportion greater than variation pre- For design six, a differential change would be sent within groups (i.e., with respect to the mean required for an effect by testing the interaction in a group). More specifically, a partitioning of term of a repeated measures ªwithinº factor and total variance found in the data matrix into (at the case±control ªbetweenº factor for signifi- least) two components is attempted: variance of cance using the appropriate F ratio. Change groups' means reflects the systematic (ªcausal takes place but in different directions (or none) modelº) part that is solely attributable to depending on whether receiving treatment or variation of treatment (be it in time or across not. Suppose one group of aggressive children groups), whereas variance of subjects' measure- received behavior modification training while ments within a group is generated by inter- another group of equal size is still waiting for individual differences (i.e., violation pretest- admission and thus serves as a control group. equivalence or lack-in-treatment integrity) Both groups are considered equivalent on the and is therefore taken as an ªerror termº against dependent variable of interest beforehand. Take which the other part must be evaluated. Re- it that t = 3 measures on each of N1 + N2 member that the ratio of two variances or two individuals in k = 2 groups: pretest, post-test sums of squares divided according to degrees of after training, and a follow-up retest six months freedom constitutes on F-value when certain after intervention to estimate the stability of assumptions on data (Hays, 1980) are met: possible effects. Formally expressing the above statistical significance is readily tested for. In sources of variation in a structural model for the this way, model variance due to effects, possibly individual measure, as is common for GLM containing interactions when more independent submodels, and dispensing with explicit nota- variables are present and error variance due to tion of associated dummies for discrete inde- anything else, add up to give the total variance pendent variables leads to: (assuming a fixed effect model). In the case of repeated measures analysis, there is also var- yikt = m + ak + bt +(ab)kt + gik + Eikt iance among t different measures of a single variable within one subject. To sum up: three where ak refers to effect of being in either main effect sources of variation are accounted treatment or control group, bt refers to for by standard ANOVA models: differences solely attributable to time or change, (i) differences between individuals, to be (ab)kt stands for the interaction of both main further decomposed into differences between effects and thus denotes differential change as k groups (disregarding the individual by taking sketched above and in greater detail in Section means and thereby constituting a ªtrue valueº 3.04.1.3, gikt refers to a random ( = erroneous) outcome measure for this treatment) and dif- effect of a single subject i in the kth kind of ferences between subjects within a group, treatment (training vs. control, in the case obtained from subtracting the individual's present), maybe for individual specific reaction measure from the group mean. In analyses with types to either form of treatment at any no repeated measures factor present, these occassion, and Eikt accounts for anything not differences contribute to error variance against covered by the model (error term). which the first kind of variation is evaluated. In For full coverage of mathematical assump- the case of f-way ANOVAs, interactions of tions underlying this model, see Toutenburg independent variables (factors) must be taken (1994). As mentioned before, statistical hypoth- into account as well. eses are formulated as null hypotheses con- 86 Multiple Group Comparisons: Quasi-experimental Designs jecturing ªno effectº of the according term in the Actual testing of these hypotheses might be model. In addition, for the present example, performed using ªmeans modelsº available with effects of several levels of an independent the (M)ANOVA procedures or by parameteriz- variable are required, coded by an according ing the above equation to fit the general linear contrast in the hypothesis matrix (see above), model regression equation with identity link. In which add up to zero. Hypotheses testable in essence, this is done by building a parameter this model include: vector b of semipartial regression coefficients

from all effects noted (m, a1, ... ak 7 1, bc1(t) ... ), associated by the corresponding hy- H0: a1 = a2 bct 7 1 There is no difference between treatment and potheses matrix X (made up from coded vectors control group with respect to the outcome variable of contrasted independent variables) and an (dependent), disregarding point of time of mea- additive error term. This right-hand part of the surement. More generally speaking, such between equation is regressed on to the vector?y giving a hypotheses state homogeneity of measurement standard Gauss±Markov±Aitken estimate of levels in groups. Note that no differences at pretest b =(X'S 7 1X) 7 1 X'S71y from least squares time are expected. Moreover, differences possibly normal equations. A typical printout from present at post-test time (near termination of intervention) tend to attenuate up until the follow- respective software procedures includes para- up testing. It is not known whether group differ- meter value, associated standard error, and t ences were at maximum by consideration of this value, confidence limits for parameter estimate, 2 main effect of group membership alone. Z effect size, F-noncentrality parameter, and power estimate. When interpreting results, remember that an effect is conceptualized as a H0: bc1(t) = bc2(t) specific formulation depending on contrasts c(t) difference of respective group ( = treatment) for time effects.There is no overall change between mean from the grand mean, and that effects are occasions disregarding group membership. Again usually normalized to add up to zero over all more generally speaking, the hypothesis states treatments (for details and exceptions, see Hays, homogeneous influence of time effects in the 1980). course of experimentation. Considering this time Analysis of the worked example study was main effect alone, it is known that whatever change carried out as follows: occurred applied unspecifically to both groups and can therefore not be attributed to treatment. Obviously, this effect does not carry the informa- A one-way ANOVA with repeated measurements tion sought. When certain assumptions on covar- was used to test statistical significance of recorded iance structure in time are met (sphericity difference in three dependent variables for five condition, see above), a t 7 1 to orthonormal experimental conditions at pre, post, and follow- linear contrasts may be used to evaluate change up occasions. There was no significant overall at certain periods using univariate and averaged main group effect. Only systolic blood pressure tests. These assumptions assure, roughly speaking, was found to differ across groups at follow-up that present differences are due to inhomogeneity time with greatest reduction present in a compar- of means rather than of variances (for details, see ison of TR to TOC. Group-time-interactions for Toutenberg, 1994). Major software packages offer testing differential change were carried out by an correction methods to compensate for violations ANOVA using repeated contrasts (pre vs. post; of these assumptions (e.g., by correcting compo- post vs. follow-up). Systolic blood pressure de- nents of the F-test) or have alternatives available. creased significantly over time in all relaxation groups and the placebo group, but not in the control group. TR muscle relaxation showed H0:(ab)ij =0 greatest reduction of diastolic pressure, whereas This hypothesis states parallelism of progressive SR and COG significantly lowered heart rate. forms in groups or, equivalently, the absence of Pooled treatment effects were evaluated against differential change. If this hypothesis can be nontreatment groups by special ANOVA con- rejected, there is an interaction of treatment and trasts, showing significant reduction in systolic the repeated measurement factor in the sense that pressure when compared to control group and (at least) one group apparently took a different significant reduction in diastolic pressure when course from pretest to post-test and follow-up as compared to the placebo group. There were no regards the outcome criteria. This is precisely what effects on heart rate in comparison to either one of was presented as the standard notion of an effect in the nontreatment groups. The results indicate a Section 3.04.3. Such a change can be attributed to strong placebo attention effect. It is suggested that treatment because the overall effects like common independence of heart rate and significant blood linear trends and so on, have been accounted for by pressure reduction is in line with the findings in the the time of main effect. In more complex cases, field. there may also be interest in differential growth rates or whether one group arrived sooner at a saturation level (suggesting earlier termination of With additional restrictions put on S, this treatment). rationale of testing easily extends to multi- Conclusion 87 variate and doubly multivariate cases (Touten- As with the metric case, regression coefficients burg, 1994). Also, there is much appeal in the are considered effect parameters and tested for idea of comparing groups on entire linear statistical significance by evaluating their stan- systems obtained from all measured variables dard error. Multiple group comparisons can be by structural equations modeling (SEM; see obtained in one of two ways: by introducing an Duncan, 1975). Here, measurement error in independent variable denoting group member- both independent and dependent variables is ship, or by testing whether there is a significant accounted for explicitely. For an introduction change in the associated overall model fit to multiple-group covariance structure analyses indices from the more saturated model incor- and to models additionally including structured porating the effect in question (H1) to the more means see Bentler (1995), MoÈ bus and BaÈ umer restricted model dispensing with it (H0). (1986), and SoÈ rbom (1979). MANOVA and SEM are compared on theoretical and practical 3.04.7 CONCLUSION grounds in Cole, Maxwell, Arvey, and Salas (1993). Note that the latest formulations of In the last section are described the statistical these models and associated software packages models for multiple group comparisons. In can now deal with nonnormal and categorical most cases, however, analysis will be carried out data. using techniques presented in Table 3. It is re- When ordinal measures have been obtained, emphasized that it is not the complexity or ANOVA on ranks of subjectsº measures (with elegance of the final data model that puts a good respect to the order in their respective group) end to the process of quasi-experimental rather than on absolute values may be applied research: it is the fit of research questions or instead of ªgenuineº ordinal methods as listed hypotheses and according methodological tools in Table 3. In the case of categorical measures, to tackle them. This is why a major part in this frequencies mostly have been obtained denoting chapter is devoted to a presentation of an the number of subjects with the characteristic idealized procedural outline of this research feature of the category level in question present process as sketched in Figure 1, and on relative to the total number of subjects over the discussing validity affairs. Generally speaking, range of all possible levels of that categorical the last step in this process lies with reversing it variable. Basically, analysis is carried out on an inductively and with introducing evaluative aggregate level (groups or subpopulations) elements. It must be decided by which degree rather than on the individual level, and link present patterns of statistical significance justify functions other than identity link are employed the induction that predictions derived from to regress independent variables (which have substantial hypotheses have instantiated as they been of a categorical nature all the time: group should and whether validity issues have been membership, etc.) on to categorical outcome attended to a degree that inference to the level of measures (now dependent variables have also theory is admissible. Limitations in general- become categorical). Regression coefficients izability should be examined and problems must be interpreted differently, though, when encountered in the course of the study must be dependent variables have been reconceptualized evaluated in order to point to future directions in this way. A positive b indicates that the of research. With reference to Section 3.04.3.3, probability of a specific category of a dependent implications to clinical practice should be variable is positively affected (risen) relative to derived and be evaluated. In publications, these another category, including the case of simply issues are usually covered in the ªResultsº and not being in that very category, when the ªDiscussionº sections. characteristic feature of the category for an In the worked example study, the following independent variable is present in a subject. In discussion is offered: the case of a dichotomous dependent variable (logit models) interpretation is facilitated: b A result most relevant to clinical practice lies with quantifies the effect of the independent variable the fact that cognitive and muscular relaxation on the odds, that is, the ratio of favorable to both significantly reduce blood pressure and unfavorable responses (e.g., whether or not therapists may therefore choose from both meth- cancer is or will be present given the information ods with respect to their individual clientsº special available from independent variables). On a needs and aptitudes (differential indication). The more formal level a nonzero regression coeffi- critical prerequisite for this conclusion in terms of validity lies with clear separation of experimental cient in log-linear models indicates that the instructions concerning either musculary or cog- expected cell frequency in the category r of R nitively induced relaxation. It might be argued that categories or ªlevelsº of the dependent variable any instruction relevant to improving irritating departs from what would be expected in the bodily sensations does feature cognitive elements, ªnullº case of no effects, that is nr = n.=N/R. like expectancy, that contribute to intervention 88 Multiple Group Comparisons: Quasi-experimental Designs

effects. It might be better to differentiate between Collingwood, R. G. (1940). An essay on metaphysics. techniques involving direct muscular practice and Oxford, UK: Oxford University Press. a pool of techniques that dispense with the Collins, A. W. (1966). The use of statistics in explanation. behavioral element. In the end, all relaxation British Journal for the Philosophy of Science, 17, 127±140. techniques affect muscle tension. But persons Cook, T. D., & Campbell, D. T. (1979). Quasi-experi- mentation: Design and analysis issues for field settings. might well vary on the degree to which they can Chicago: Rand McNally. benefit from direct or indirect techniques. It is Cronbach, L. J., & Furby, L. (1970). How should we known that more susceptible persons, or persons measure ªchangeºÐor should we? Psychological Bulle- with greater ability to engage in and maintain vivid tin, 74, 68±80. imagination and concentration, will benefit more Diggle, P. J., Liang, K. Y., & Zeger, S. L. (1994). Analysis from cognitive techniques, and that physical dis- of longitudinal data. Oxford, UK: Oxford University abilities often rule out application of muscularly Press. directed interventions. On the other hand, muscle Duncan, O. D. (1975). Introduction to structural equations relaxation instructions are considered easier to models. New York: Academic Press. Eye, A. V. (Ed.) (1990). Statistical methods for longitudinal follow, and, after all, straight muscle relaxation research. Boston: Academic Press. techniques showed the greatest overall effect in the Farnum, N. R., & Stanton, L. W. (1989). Quantitative present study, thus pointing to a ªsafe-sideº forecasting methods. Boston: Kent Publishing. decision. These arguments refer to client selection Fisher, R. A. (1959). Statistical method and statistical in clinical practice as a parallel to the referred inference. Edinburgh, UK: Oliver & Boyd. effects of subject selections in experimentation. Fitz-Gibbon, C. T., & Morris, L. L. (1991). How to design a program evaluation. Program evaluation kit, 3. Newbury Park, CA: Sage. Accounting for the demand for both meth- Giere, R. N. (1972). The significance test controversy. odologically valid and practically relevant British Journal for the Philosophy of Science, 23, 170±181. methods, the following guidelines for quasi- Glass, G. V., Willson, V. L., & Gottman, J. M. (1975). experimental research may be offered: Design and analysis of time series experiments. Boulder, (i) preference for case±control designs, CO: Associated University Press. Gottman, J. M. (1981). Time series analysis. Cambridge, (ii) preference for repeated measures designs, UK: Cambridge University Press. thus preference for diagnostic devices and Granger, C. W. (1969). Investigating causal relations by statistical methods sensitive to (relative) change, econometric models and cross-spectral methods. Econo- (iii) specification of hypotheses in terms of metrica, 37, 424±438. Grawe, K. (1992). Psychotherapieforschung zu Beginn der effect sizes, neunziger Jahre. Psychologische Rundschau, 43, 132±168. (iv) consideration of multiple indicators per Gregson, R. (1983). Times series in Psychology. Hillsdale, criterion, entailing a multivariate data setup, NJ: Erlbaum. and Hager, W. (1992). Jenseits von Experiment und Quasiexperi- (v) special attention to balancing N (sample ment: Zur Struktur psychologischer Versuche und zur Ableitung von Vorhersagen.GoÈ ttingen, Germany: Ho- size), T (rep. measures level), ES (effect size), grefe. and other factors determining both statistical Hagenaars, J. A. (1990). Categorical longitudinal data: log- and practical significance of results presented in linear panel, trend and cohort analysis. Newbury Park, Section 3.04.3.1. CA: Sage. Hays, W. (1980). Statistics for the social sciences. New York: Holt, Rinehart, & Winston. Hoole, F. W. (1978). Evaluation research and development 3.04.8 REFERENCES activities. Beverly Hills, CA: Sage. JoÈ reskog, K., & SoÈ rbom, D. (1993). LISREL 8 user's Arminger, G., Clogg, C. C., & Sobel, M. E. (Eds.) (1995). reference guide. Chicago: Scientific Software Interna- Handbook of statistical modeling for the social and tional. behavioral sciences. New York: Plenum. Kepner, J. L., & Robinson, D. H. (1988). Nonparametric Bentler, P. M. (1995). EQS structural equations program methods for detecting treatment effects in repeated manual. Encino, CA: Multivariate Software. measures designs. Journal of the American Statistical Bernstein, I. N. (Ed.) (1976). Validity issues in evaluative Association, 83, 456±461. research. Beverly Hills, CA: Sage. Kerlinger, F. (1973). Foundations of behavioral research Bortz, J. (1993). Statistik. Berlin: Springer. (2nd ed.). New York: Holt, Rinehart, & Winston. Box, G. E., Jenkins, G. M., & Reinsel, G. C. (1994). Time Lindsey, J. K. (1993). Models for repeated measurements. series analysis: forecasting and control. Englewood Cliffs, Oxford statistical science series, 10. Oxford, UK: NJ: Prentice-Hall. Claredon Press. Campbell, D. T., & Stanley, J. C. (1963). Experimental and Lipsey, M. W. (1990). Design sensitivity: statistical power quasi-experimental design for research on teaching.In for experimental research. Newbury Park, CA: Sage. N. L. Gage (Ed.), Handbook for research on teaching, Lord, F. M. (1963). Elementary models for measuring (pp. 171±246). Chicago: Rand McNally. change. In C. W. Harris (Ed.), Problems in measuring Christensen, R. (1991). Linear models for multivariate, time change. Madison, WI: University of Wisconsin Press. series, and spatial data. New York: Springer. Mayo, D. (1985). Behavioristic, evidentialist and learning Cohen, J. (1988). Statistical power analysis for the models of statistical testing. Philosophy of Science, 52, behavioral sciences. Hillsdale, NJ: Erlbaum. 493±516. Cole, D. A., Maxwell, S. E., Arvey, R., & Salas, E. (1993). McDowall, D., & McCleary, R., Meidinger, E. E., & Hay, Multivariate group comparison of variable systems: R. A. (1980). Interrupted time series analysis. Newbury MANOVA and structural equation modeling. Psycholo- Park, CA: Sage gical Bulletin, 114, 174±184. MoÈ bus, C., & BaÈ umer, H. P. (1986). Strukturmodelle fuÈr References 89

LaÈngsschnittdaten und Zeitreihen. Bern, Switzerland: Solomon, R. L. (1949). An extension of control group Huber. design. Psychological Bulletin, 46, 137±150. Pedhazur, E. J., & Pedhazur, L. (1991). Measurement, SoÈ rbom,D.(1979).Ageneralmodelforstudying design and analysis: an integrated approach. Hillsdale, differences in factor means and factor structure between NJ: Erlbaum. groups.InK.G.JoÈ reskog & D. SoÈ rbom (Eds.), Petermann, F. (Ed.) (1996). Einzelfallanalyse (3rd ed.). Advances in factor analysis and structural equation models MuÈ nchen, Germany: Oldenbourg. (pp. 207±217). Cambridge, MA: Abt Books. Prioleau, L., Murdock, M., & Brody, N. (1983). An analyis Spector, P. E. (1981). Research designs. Beverly Hills, CA: of psychotherapy versus placebo studies. The Behaviour- Sage. al and Brain Sciences, 6, 275±285. Thompson, G. L. (1991). A unified approach to rank tests Plewis, I. (1985). Analysing change: measurement and for multivariate and repeated measures designs. Journal explanation using longitudinal data. New York: Wiley. of the American Statistical Association, 86, 410±419. Putnam, H. (1991). The ªCorroborationº of theories. In R. Toutenburg, H. (1994). Versuchsplanung und Modellwahl: Boyd, P. Gasper, & J. D. Trout (Eds.), The philosophy of statistische Planung und Auswertung von Experimenten science (pp. 121±137). Cambridge, MA: MIT Press. mit stetigem oder kategorialem Response. Heidelberg, Rao, P., & Miller, R. L. (1971). Applied econometrics. Germany: Physica. Belmont, CA: Wadsworth. Trochim, W. M. (1984). Research design for program Rossi, P. H., Freeman, H. E., & Wright, S. R. (1979). evaluation: the regression-discontinuity approach. Con- Evaluation: a systematic approach. Beverly Hills, CA: temporary evaluation research, 6. Beverly Hills, CA: Sage. Sage. Salmon, W. C. (1966). The foundations of scientific Vonesh, E. F. (1992). Non-linear models for the analysis of inference. Pittsburg, OH: University of Pittsburgh Press. longitudinal data. Statistics in Medicine, 11, 1929±1954. Schmitz, B. (1989). EinfuÈhrung in die Zeitreihenanalyse. Yung, P. M. B., & Kepner, A. A. (1996). A controlled Bern, Switzerland: Huber. comparison on the effect of muscle and cognitive Smith, M. L, Glass, V. G., & Miller, T. I. (1980). The relaxation procedures on blood pressure: implications benefits of psychotherapy. Baltimore: Johns Hopkins for the behavioral treatment of borderline hypertensives. University Press. Behavioral Research and Therapy, 34, 821±826. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.05 Epidemiologic Methods

WILLIAM W. EATON Johns Hopkins University, Baltimore, MD, USA

3.05.1 INTRODUCTION 92 3.05.1.1 Epidemiology and Clinical Psychology 92 3.05.1.2 Dichotomy and Dimension 92 3.05.1.3 Exemplar Study: Ecological Approach to Pellagra 93 3.05.2 RATES 94 3.05.2.1 Prevalence 94 3.05.2.2 Incidence 95 3.05.2.3 Incidence and Onset 96 3.05.3 MORBIDITY SURVEYS 99 3.05.3.1 Exemplar Study: The Epidemiologic Catchment Area (ECA) Program 99 3.05.3.2 Defining and Sampling the Population 100 3.05.3.3 Sample Size 101 3.05.3.4 Standardization of Data Collection 101 3.05.4 EXPOSURE AND DISEASE 103 3.05.4.1 Cohort Design 103 3.05.4.2 Exemplar Study: the British Perinatal Study 103 3.05.4.3 Case±Control Design 104 3.05.4.4 Exemplar Study: Depression in Women in London 105 3.05.5 USING AVAILABLE RECORDS 105 3.05.5.1 Government Surveys 105 3.05.5.2 Vital Statistics 106 3.05.5.3 Record-based Statistics and Record Linkage 106 3.05.5.4 Exemplar Study: the Danish Adoption Study 107 3.05.6 PREVENTIVE TRIALS 107 3.05.6.1 The Logic of Prevention 107 3.05.6.2 Attributable Risk 109 3.05.6.3 Developmental Epidemiology 109 3.05.6.4 Exemplar Study: the Johns Hopkins Prevention Trial 110 3.05.7 BIAS 112 3.05.7.1 Sampling Bias 112 3.05.7.2 Measurement Bias 113 3.05.8 CONCLUSION: THE FUTURE OF PSYCHOLOGY IN EPIDEMIOLOGY 113 3.05.9 REFERENCES 114

91 92 Epidemiologic Methods

3.05.1 INTRODUCTION popular trend open-endedly to hand over to psychiatry (and to social work), in hope or 3.05.1.1 Epidemiology and Clinical Psychology resignation, the whole of the human condition is no help. (p. 220) Is psychological epidemiology an oxymoron? The primary goals of psychology differ from the primary goals of epidemiology. Psychology 3.05.1.2 Dichotomy and Dimension seeks to understand the general principles of psychological functioning of individual human Pathology is a process, but diseases are beings. Epidemiology seeks to understand regarded as present or absent. Psychologists pathology in populations. These two have been interested historically in promoting differencesÐgeneral principles vs. pathology, health and in understanding the nature of each and individuals vs. populationsÐform the individual's phenomenology, leading to empha- foundations for ideologic differences which sis on dimensions of behavior; but epidemiol- have led to failures of linkage between the ogists have focused on disorders without much two disciplines, and many missed opportunities regard to individual differences within a given for scientific advances and human betterment. diagnostic category, leading to focus on The gap between the two is widened by their dichotomies. For clinical psychologists adopt- location in different types of institutions: liberal ing the epidemiologic perspective, the phrase arts universities vs. medical schools. ªpsychiatric epidemiologyº could describe their Various definitions of epidemiology reflect work accurately (Tsuang, Tohen, & Zahner, the notion of pathology in populations. Maus- (1995). Another area of work for psychologists ner and Kramer define epidemiology as ªthe in epidemiology focuses on psychological con- study of the distribution and determinants of tributors to diseases in general, which can diseases and injuries in human populationsº include psychiatric disorders but not be limited (1985, p. 1). Lilienfeld and Stolley use a very to them. This type of epidemiology might be similar definition but include the notion that termed ªpsychosocial epidemiologyº (Anthony, ªepidemiologists are primarily interested in the Eaton, & Henderson, (1995), and receives less occurrence of disease as categorized by time, attention in this chapter. place, and personsº (1994, p. 1). A less succinct, Interest in normal functioning, or in devia- but nevertheless classic, statement is by Morris tions from it, has led psychologists to focus on (1975), who listed seven ªusesº of epidemiology: statistical measures related to central tendency, (i) To study the hertory of health of popula- such as the mean, and the statistical methods tions, thus facilitating projections into the oriented toward that parameter, such as the future. comparison of means and analysis of variance. (ii) To diagnose the health of the community, Epidemiology's base in medicine has led it to which facilitates prioritizing various health focus on the dichotomy of the presence or problems and identifying groups in special need. absence of disease. There is the old adage in (iii) To study the working of health services, epidemiology: ªThe world is divided into two with a view to their improvement. types of persons: those who think in terms of (iv) To estimate individual risks, and chances dichotomies, and those who do not.º The result of avoiding them, which can be communicated for statistical analysis is recurring focus on to individual patients. forms involving dichotomies: rates and propor- (v) To identify syndromes, by describing the tions (Fleiss, 1981), the two-by-two table association of clinical phenomena in the popu- (Bishop, Fienberg, & Holland, 1975), the life lation. table (Lawless, 1982), and logistic regression (vi) To complete the clinical picture of (Kleinbaum, Kupper, & Morgenstern, 1982). chronic diseases, especially in terms of natural The epidemiologic approach is focused on hertory. populations. A population is a number of (vii) To search for causes of health and individuals having some characteristic in com- disease. mon, such as age, gender, occupation, nation- The seventh use is regarded by Morris as the ality, and so forth. Population is a broader term most important, but, in reviewing the literature than group, which has awareness of itself, or on psychiatric disorders for the third (1975) structured interactions: populations typically edition of his text, he was surprised at are studied by demographers, whereas groups typically are studied by sociologists. The general The sad dearth of hard fact on the causes of major population is not limited to any particular class as well as minor mental disorders, and so on how or characteristic, but includes all persons with- to prevent them . . . [and] the dearth of epidemio- out regard to special features. In the context of logical work in search of causes . . . Lack of ideas? epidemiologic research studies, the general Methods? Money? It wants airing. Of course, the population usually refers to the individuals Introduction 93 who normally reside in the locality of the study. low social class was a consistent finding: a Many epidemiologic studies involve large po- leading etiologic clue among many others with pulations, such as a nation, state, or a county of less consistent evidence. Many scientists felt that several hundred thousand inhabitants. In such pellagra psychosis was infectious, and that large studies, the general population includes lower class living situations promoted breeding individuals with a wide range of social and of the infectious agent. But Goldberger noticed biological characteristics, and this variation is that there were striking failures of infection, for helpful in generalizing the results. In the broad- example, aides in large mental hospitals, where est sense, the general population refers to the pellagra psychosis was prevalent, seemed im- human species. The fundamental parameters mune. He also observed certain exceptions to used in epidemiology often are conceptualized in the tendency of pellagra to congregate in the terms of the population: for example, prevalence lower class, among upper-class individuals with sometimes is used to estimate the population's unusual eating habits. Goldberger became need for health care; incidence estimates the convinced that the cause was a nutritional force of morbidity in the population. Generally deficiency which was connected to low social diseases have the characteristic of rarity in the class. His most powerful evidence was an population. As discussed below, rates of disease ecologic comparison of high and low rate areas: often are reported per 1000, 10 000, or 100 000 two villages which differed as to the availability population. of agricultural produce. The comparison made a Much of the logic that distinguishes the strong case for the nutritional theory using the methods of epidemiology from those of psy- ecological approach, even though he could not chology involves the epidemiologic focus on identify the nutrient. He went so far as to ingest rare dichotomies. The case±control method bodily fluids of persons with pellagra to maximizes efficiency in the face of this rarity demonstrate that it was not infectious. Even- by searching for and selecting cases of disease tually a deficiency in vitamin B was identified as very intensively, for example, as in a hospital or a necessary causal agent. Pellagra psychosis is a catchment area record system, and selecting now extremely rare in the US, in part because of controls at a fraction of their representation in standard supplementation of bread products the general population. Exposures can also be with vitamin B. rare, and this possibility is involved in the logic Goldberger understood that low social class of many cohort studies which search for and position increased risk of pellagra, and he also select exposed individuals very intensively, for believed that nutritional deficiencies which example, as in an occupational setting with resulted from lower class position were a toxins present, and selecting nonexposed con- necessary component in pellagra. With the trols at a fraction of their representation in the advantage of hindsight, nutritional deficiency general population. In both strategies the cases, appears to have a stronger causal status, exposed groups, and controls are selected with because there were many lower class persons equal care, and in both strategies they are who did not have pellagra, but few persons with drawn, if possible, to represent the population at pellagra who were not nutritionally deprived: risk of disease onset: the efficiency comes in the vitamin B deprivation is a necessary cause, but comparison to a relatively manageable sample lower social class position is a contributing of controls. cause. The concept of the causal chain points to both causes, and facilitates other judgements 3.05.1.3 Exemplar Study: Ecological Approach about the importance of the cause. For example, to Pellagra which cause produces the most cases of pellagra? (Since they operate in a single causal The ecologic approach compares populations chain, the nutritional deficiency is more im- in geographic areas as to their rates of disease, portant from this point of view.) Which cause is and has been a part of epidemiology since its antecedent in time? (Here the answer is lower beginnings. A classic ecologic study in epide- social class position.) Which cause is easiest to miology is Goldberger's work on pellagra change in order to prevent pellagra? The answer psychosis (Cooper & Morgan, 1973). The to this question is not so obvious, but, from the pellagra research also illustrates the concept public health viewpoint, it is the strongest of the causal chain and its implications for argument about which cause deserves attention. prevention. In the early part of the twentieth Which cause is consistent with an accepted century, pellagra was most prevalent in the framework of disease causation supportive of lower classes in rural villages in the southeastern the ideology of a social group with power in US. As the situation may be in the 1990s for society? Here the two causes diverge: funneling many mental disorders (Eaton & Muntaner, resources to the medical profession to cure 1996), the relationship of pellagra prevalence to pellagra is one approach to the problem 94 Epidemiologic Methods consistent with the power arrangement in the services researchers. It is also useful because it social structure; another approach is redistri- identifies groups at high risk of having a buting income so that nutritional deficiency is disorder, or greater chronicity of the disorder, less common. As it happens, neither approach or both. Finally, the point prevalence can be would have been effective. The consensus to used to measure the impact of prevention redistribute income, in effect, to diminish social programs in reducing the burden of disease class differences, is not easy to achieve in a on the community. democracy, since the majority are not affected Lifetime prevalence is the proportion of by the disease. Bread supplementation for the individuals, who have ever been ill, alive on a entire population was a relatively cost-effective given day in the population. As those who die public health solution. The more general point is are not included in the numerator or denomi- that Goldberger's work encompassed the com- nator of the proportion, the lifetime prevalence plexity of the causal process. is sometimes called the proportion of survivors Epidemiology accepts the medical framework affected (PSA). It differs from the lifetime risk as defining its dependent variable, but it is because the latter attempts to include the entire neutral as to discipline of etiology. Goldberger lifetime of a birth cohort, both past and future, did not eliminate one or the other potential and includes those deceased at the time of the cause due to an orientation toward social vs. survey. Lifetime risk is the quantity of most biological disciplines of study. He did not favor interest to geneticists. Lifetime prevalence also one or the other potential cause because it differs from the proportion of cohort affected focused on the individual vs. the collective level (PCA), which includes members of a given of analysis. This eclecticism is an important cohort who have ever been ill by the study date, strength in epidemiology because causes in regardless of whether they are still alive at that different disciplines can interact, as they did in time. the case of pellagra, and epidemiology still Lifetime prevalence has the advantage over provides a framework for study. The public lifetime risk and PCA in that it does not require health approach is to search for the most ascertaining who is deceased, whether those important causes, and to concentrate on those deceased had the disorder of interest, or how that are malleable, since they offer possibilities likely those now alive without the disorder are to for prevention. develop it before some given age. Thus, lifetime prevalence can be estimated from a cross- sectional survey. The other measures require 3.05.2 RATES either following a cohort over time or asking relatives to identify deceased family members An early and well-known epidemiologist, and report symptoms suffered by them. Often Alexander Langmuir, used to say that ªstripped these reports must be supplemented with to its basics, epidemiology is simply a process of medical records. The need for these other obtaining the appropriate numerator and sources adds possible errors: relatives may denominator, determining the rate, and inter- forget to report persons who died many years preting that rateº (cited in Foege, 1996, p. S11). before or may be uninformed about their In the sense in which Langmuir was speaking, psychiatric status; medical records may be the term rates includes proportions such as the impossible to locate or inaccurate; and the prevalence ªrateº as well as the incidence rate, as prediction of onset in young persons not yet explained below. Table 1 shows the minimum affected requires the assumption that they will design requirements, and the definitions of fall ill at the same rate as the older members of numerators and denominators, for various the sample and will die at the same ages if they types of the rates and proportions. The table do not fall ill. If risks of disorder or age-specific is ordered from top to bottom by increasing death rates change, these predictions will fail. difficulty of longitudinal follow-up. Lifetime prevalence requires that the diag- nostic status of each respondent be assessed over 3.05.2.1 Prevalence his or her lifetime. Thus, accuracy of recall of symptoms after a possible long symptom-free Prevalence is the proportion of individuals ill period is a serious issue, since symptoms and in a population. Temporal criteria allow for disorders that are long past, mild, short-lived, several types of prevalence: point, lifetime, and and less stigmatizing are particularly likely to be period. Point prevalence is the proportion of forgotten. For example, data from several cross- individuals in a population at a given point in sectional studies of depression seem to indicate a time. The most direct use of point prevalence is rise in the rate of depression in persons born as an estimate of need for care or potential after World War II (Cross-National Collabora- treatment load, and it is favored by health tive Group, 1992; Klerman & Weissman, 1989). Rates 95

Table 1 Rates and proportions in epidemiology.

Rate Minimum design Numerator Denominator

Lifetime prevalence Cross-section Ever ill Alive Point prevalence Cross-section Currently ill Alive Period prevalence (1) Cross-section Ill during period Alive at survey Period prevalence (2) Two waves Ill during period Alive during period First incidence Two waves Newly ill Never been ill Attack rate Two waves Newly ill Not ill at baseline Proportion of cohort affected Birth to present Ever ill Born and still alive Lifetime risk Birth to death Ever ill Born

These persons are older, however, and it may be It is not a point prevalence rate because it covers that they have forgotten episodes of depression a longer period of time, which can be defined as which occurred many years prior to the inter- six months, two years, and so forth, as well as view (Simon & Von Korff, 1995). Data showing one year. But one-year prevalence is not a period that lifetime prevalence of depressive disorder prevalence rate because some individuals in the actually declines with age (Robins et al., 1984) is population who are ill at the beginning of the consistent with the recall explanation. period are not successfully interviewed, because Period prevalence is the proportion of the they either die or emigrate. As the time period population ill during a specified period of time. covered in this rate becomes shorter, it Customarily the numerator is estimated by approximates the point prevalence; as the time adding the prevalent cases at the beginning of period becomes longer, the rate approaches the the defined period to the incident (first and period prevalence. If there is large mortality, the recurrent) cases that develop between the one-year prevalence rate will diverge markedly beginning and the end of the period. This form from period prevalence. is shown as period prevalence (2) in Table 1. In research based on records, all cases of a disorder 3.05.2.2 Incidence found over a one-year period are counted. The denominator is the average population size Incidence is the rate at which new cases during the interval. Thus, the customary develop in a population. It is a dynamic or time- definition of period prevalence requires at least dependent quantity and can be expressed as an two waves of data collection. Both Mausner and instantaneous rate, although, usually, it is Kramer (1985) and Kleinbaum et al. (1982) have expressed with a unit of time attached, in the noted the advantages of period prevalence for manner of an annual incidence rate. In order to the study of psychiatric disorders, where onset avoid confusion, it is essential to distinguish first and termination of episodes is difficult to incidence from total incidence. The distinction ascertain exactly (e.g., the failure to distinguish itself commonly is assumed by epidemiologists new from recurrent episodes is unimportant in but there does not appear to be consensus on the the estimation of period prevalence). Further- terminology. Most definitions of the incidence more, the number of episodes occurring during numerator include a concept such as ªnew the follow-up is unimportant; it is important casesº (Lilienfeld & Stolley, 1994, p. 109), or only to record whether there was one or more vs. persons who ªdevelop a diseaseº (Mausner & none. The disadvantage of period prevalence is Kramer, 1985, p. 44). Morris (1975) defines that it is not as useful in estimating need as point incidence as equivalent to our ªfirst incidence,º prevalence, nor as advantageous in studying and ªattack rateº as equivalent to our ªtotal etiology as incidence. incidence.º First incidence corresponds to the Another type of period prevalence, some- most common use of the term ªincidence,º but times labeled by a prefix denoting the period, since the usage is by no means universal, keeping such as one-year prevalence, is a hybrid type of the prefix is preferred. rate, conceptually mixing aspects of point and The numerator of first incidence for a period prevalence, which has been found useful specified time period is composed of those in the ECA Program (Eaton et al., 1985). This individuals who have had an occurrence of the type of period prevalence is labeled period disorder for the first time in their lives and the prevalence (1) in Table 1. It includes in the denominator includes only persons who start numerator all those surveyed individuals who the period with no prior history of the disorder. have met the criteria for disorder in the past The numerator for attack rate (or total year, and as denominator all those interviewed. incidence) includes all individuals who have 96 Epidemiologic Methods had an occurrence of the disorder during the psychiatric epidemiology, there are a range of time period under investigation, whether or not disorders with both types of causal structures it is the initial episode of their lives or a recurrent operating, which has led us to focus on this episode. The denominator for total incidence distinction in types of incidence. includes all population members except those The two types of incidence are related cases of the disorder which are active at the functionally to different measures of prevalence. beginning of the follow-up period. Kramer et al. (1980) have shown that lifetime A baseline and follow-up generally are needed prevalence (i.e., the proportion of persons in a to estimate incidence. Cumulative incidence defined population who have ever had an attack (not shown in Table 1) is the proportion of the of a disorder) is a function of first incidence and sample or population who become a case for the mortality in affected and unaffected popula- first time between initial and followup inter- tions. Point prevalence (i.e., the proportion of views (Kleinbaum et al., 1982). But incidence persons in a defined population on a given day generally is measured per unit of time, as a rate. who manifest the disorder) is linked to total When the follow-up period extends over many incidence by the queuing formula P = f (I 6 D) years, the exposure period of the entire (Kleinbaum et al., 1982; Kramer, 1957), that is, population at risk is estimated by including point prevalence is a function of the total all years prior to onset for a given individual in number of cases occurring and their duration. In the denominator, and removing years from the the search for risk factors that have etiologic denominator when an individual has onset or significance for the disorder, comparisons based dies. In this manner, the incidence is expressed on point prevalence rates suffer the disadvan- per unit time per population: for example, tage that differences between groups as to the ªthree new cases per 1000 person years of chronicity of the disorder: that is, the duration exposure.º This method of calculating facil- of episodes; the probability that episodes will itates comparison between studies with different recur; or the mortality associated with episodes; lengths of follow-up and different mortality. affect the comparisons (Kramer, (1957). For When the mean interval between initial and example, it appears that Blacks may have follow-up interview is approximately 365 days, episodes of depression of longer duration than then the ratio of new cases to the number of Whites (Eaton & Kessler, 1981). If so, the point followed up respondents, who were ascertained prevalence of depression would be biased as being with a current or past history of the toward a higher rate for Blacks, based solely disorder at the time of the initial interview, can on their greater chronicity. serve as a useful approximation of the disorder's annual first incidence. 3.05.2.3 Incidence and Onset First incidence also can be estimated by retrospection if the date of onset is obtained for Dating the onset of episodes is problematic each symptom or episode, so that the propor- for most mental disorders for many reasons. tion or persons who first qualified in the year One is that the diagnostic criteria for the prior to the interview can be estimated. For this disorders themselves are not well agreed upon, type of estimate (not shown in Table 1), only one and continual changes are being made in the wave of data collection is needed. This estimate definition of a case of disorder, such as the of first incidence is subject to the effects of recent fourth revision of the Diagnostic and mortality, however, because those who have statistical manual of mental disorders (DSM-III; died will not be available for the interview. American Psychiatric Association, 1994). The The preference for first or total incidence in DSM-IV has the advantage that criteria for etiologic studies depends on hypotheses and mental disorders are more or less explicitly assumptions about the way causes and out- defined, but it is nevertheless true that specific comes important to the disease ebb and flow. If mental disorders are often very difficult to the disease is recurrent and the causal factors distinguish from nonmorbid psychological vary in strength over time, then it might be states. Most disorders include symptoms that, important to study risk factors not only for first taken by themselves, are part of everyone's but for subsequent episodes (total incidence): normal experience: for example, feeling fearful, for example, the effects of changing levels of being short of breath or dizzy, and having stress on the occurrence of episodes of neurosis sweaty palms are not uncommon experiences, (Tyrer, 1985) or schizophrenia (Brown, Birley, but they are also symptoms of panic disorder. It & Wing, 1972). For a disease with a presumed is the clustering of symptoms, often with the fixed progression from some starting point, such requirement that they be brought together in as dementia, the first occurrence might be the one period of time or ªspell,º that generally most important episode to focus on, and first forms the requirement for diagnosis. Although incidence is the appropriate rate. In the field of the clustering criteria are fairly explicit in the Rates 97

DSM, it is not well established that they which the rate of development towards a full- correspond to the characteristics generally blown disease state is accelerated, or becomes associated with a disease, such as a predictable irreversible. course, a response to treatment, an association A second way of thinking about progress with a biological aberration in the individual, or towards disease is the occurrence of new an associated disability. Thus, the lack of symptoms which did not exist before. This established validity of the criteria-based classi- involves the gradual acquisition of symptoms so fication system exacerbates problems of dating that clusters are formed which increasingly the onset of disorder. approach the constellation required to meet The absence of firm data on the validity of the specified definitions for diagnosis. A cluster of classification system requires care about con- symptoms which occur more often together ceptualizing the process of disease onset. One than would be expected by their individual criterion of onset used in the epidemiology of prevalence in the population, that is, more often some diseases is entry into treatment, but this is than expected by chance, is a syndrome. unacceptable in psychiatry since people with ªPresentº can be defined as occurrence either mental disorders so often do not seek treatment at a nonsevere or at a severe level, thus, decisions for them. Another criterion of onset sometimes made about the symptom intensification pro- used is detectability, that is, when the symptoms cess complicate the idea of acquisition. This idea first appear, but this is also unacceptable leads the researcher to consider the order in because experiences analogous to the symptoms which symptoms occur over the natural history of most psychiatric disorders are so widespread. of disease, and in particular, whether one It is preferable to conceptualize onset as a symptom is more important than others in continuous line of development towards man- accelerating the process. ifestation of a disease. There is a threshold at Figure 1 is an adaptation of a diagram used by which the development becomes irreversible, so Lilienfeld and Lilienfeld (1980, Figure 6.3) to that at some minimal level of symptomatology it visualize the concept of incidence as a time- is certain that the full characteristics of the oriented rate which expresses the force of disease, however defined, will become apparent. morbidity in the population. In their original This use of irreversibility is consistent with some figure, (1(a)), the horizontal axis represents time epidemiological uses (Kleinbaum et al., 1982). and the presence of a line indicates disease. The Prior to this point, the symptoms are thought of adaptations (1(b) and 1(c)) give examples of the as ªsubcriterial.º At the state of knowledge in several distinct forms that onset can take when psychiatry in the 1990s, longitudinal studies in the disorder is defined by constellations of the general population, such as the ECA symptoms varying in intensity and co-occur- program and others mentioned above, are rence, as is the case with mental disorders. In needed to determine those levels of symptoma- Figure 1(b), the topmost subject (1) is what tology at which irreversibility is achieved. might be considered the null hypothesis, and it There are at least two ways of thinking about corresponds to simple onset as portrayed in the development towards disease. The first way is original. Figure 1(b) shows how intensity, the increase in severity or intensity of symptoms. represented by the vertical width of the bars, An individual could have all the symptoms might vary. The threshold of disease is set at required for diagnosis but none of them be four units of width, and in the null hypothesis sufficiently intense or impairing. The underlying subject 1 progresses from zero intensity to four logic of such an assumption may well be the units, becoming a case during the observation relatively high frequency of occurrence of the period. Subject 2 changes from nearly meeting symptoms in milder form, making it difficult to the criteria (width of three units) to meeting it distinguish normal and subcriterial complaints (four units) during the year. Both subjects 1 and from manifestations of disease. For many 2 are new cases, even though the onset was more chronic diseases, it may be inappropriate to sudden in subject 1 than in subject 2, that is, the regard the symptom as ever having been force of morbidity is stronger in subject 1 than ªabsent,º for example, personality traits giving subject 2. Subjects 3 and 4 are not new cases, rise to deviant behavior, categorized as person- even though their symptoms intensify during the ality disorders on Axis II of the DSM (APA, year as much or more than those of subject 2. 1994). This type of development can be referred Figure 1(c) adapts the same original diagram to as ªsymptom intensification,º indicating that to conceptualize acquisition of symptoms and the symptoms are already present and have the development of syndromes. At one point in become more intense during a period of time there is no correlation between symptoms, observation. This concept leads the researcher but the correlation gradually develops, until to consider whether there is a crucial level of there is a clear separation of the population into severity of a given symptom or symptoms in one group, with no association of symptoms, Incidence and Intensification

Case No. 1 Case No. 2 Lilienfeld and Stolley, Figure 6.2; Case No. 3 Length of line corresponds to Case No. 4 duration of episode. Case No. 5 Case No. 6

1 Two new cases, like No. 3 above, with different degrees of intensification; width corresponds to intensity of symptoms. 2

3 Two subjects not defined as new cases, with different degrees of intensification; width corresponds to intensity of symptoms. 4 Subject 3 is never a case, and subject 4 corresponds to Case No. 1 above.

Baseline Followup

Figure 1(a) and (b) Incidence and intensification. Morbidity Surveys 99

Incidence and Development of Syndromes

New * ** cases * * * * * * S * y * * * * * m * * * p * * * t * * * * o * * m * * * * Symptom 2 * 1 * **

Baseline Prodrome Followup Figure 1(c) Incidence and the development of syndromes. and another group where the two symptoms co- 3.05.3.1 Exemplar Study: The Epidemiologic occur. This example could be expanded to Catchment Area (ECA) Program syndromes involving more than two symptoms. Acquisition and intensification are indicators An example of a morbidity survey is the of the force of morbidity in the population, as Epidemiologic Catchment Area (ECA) Pro- are more traditional forms of incidence rate. But gram, sponsored by the United States National they are not tied to any one definition of Institute of Mental Health from 1978 through caseness. Rather, these concepts allow study of 1985. The broad aims of the ECA Program were disease progression independently of case ªto estimate the incidence and prevalence of definition. Risk factors at different stages of mental disorders, to search for etiological clues, the disease may be differentially related to and to aid in the planning of health care disease progression only above or below the facilitiesº (Eaton, Rogier, Locke, & Taube, threshold set by the diagnosis. In this situation, 1981). The Program involved sample surveys of we might reconsider the diagnostic threshold. populations living in the catchment areas of already designated Community Mental Health Centers. The broad goals of the ECA Program 3.05.3 MORBIDITY SURVEYS are described in Eaton et al. (1981), the methods are described in Eaton and Kessler (1985), and Morbidity surveys are cross-sectional surveys Eaton et al. (1984), the cross-sectional results of the general population. They are used to are described in Robins and Regier (1991), and estimate prevalence of disease in the population the incidence estimates are described in Eaton, as well as to estimate need for care. Morbidity Kramer, Anthony, Chee, and Shapiro (1989). surveys address Morris's (1975, discussed A principal advantage of the morbidity above) second ªuseº of epidemiology, ªdiag- survey is that it includes several or many nosing the health of the community, prioritizing disorders, which helps in assessing their relative health problems and identifying groups in importance from the public health point of view. need;º as well as the third use ªstudying the Another advantage is that estimates of pre- working of health services;º and the fifth use, valence and association are measured without ªidentifying syndromes.º Morbidity surveys are regard to the treatment status of the sample sometimes called the descriptive aspect of (Shapiro et al., 1984). Figure 2 displays results epidemiology, but they can also be used to from the ECA study, showing the relatively high generate hypotheses about associations in the prevalence of phobia and relatively low pre- population, and to generate control samples for valence of schizophrenia. The figure also shows cohort and case±control studies. the proportion meeting criteria for a given 100 Epidemiologic Methods disorder within the last six months, who had way to define the target population is not always seen either a physician or mental health clear, and different definitions have implica- specialist during that period. These proportions tions for the ultimate value of the results, as well are well under 50%, with the exception of as the feasibility of the study. A good definition schizophrenia and panic disorder. For depres- of a target population is an entire nation, such as sion, which is disabling and highly treatable, less in the National Comorbidity Survey, or, better than a third of those with the disorder have yet, a birth cohort of an entire nation, such as received treatment. the British Perinatal Study, which included all births in Britain during a single week in March of 1958 (Butler & Alberman, 1969). Other 3.05.3.2 Defining and Sampling the Population studies usually involve compromises of one form or another. The goal of the sampling Defining the target population for the procedure is that each respondent is selected morbidity survey is the first step. The best into the sample with a known, and nonzero,

12

10.8

10 Treated

Untreated 8

6

4.2 4 3.4 2.9 2.7

2 1.8 1.7

0.9 0.8 0.8 0.7 0.3 0.1 0

Panic Mania Phobia Drug A/D Dysthymia Alcohol A/D Depression Somatization Schizophrenia

Schizophreniform

Cognitive impairment Obsessive-compulsive Anti-social personality Unweighted data from four sites of ECA program

Figure 2 Prevalence of disorder in percent in six months prior to interview. Morbidity Surveys 101 probability. Strict probabilistic sampling char- decision was made to select five separate sites of acterizes high-quality epidemiologic surveys, research in order to provide the possibility of and is a requirement for generalization to the replication of results across sites, and to better target population. understand the effects of local variation (Eaton Most surveys are of the household residing, et al., 1981). The ECA target population thus noninstitutionalized population, where the consisted, not of the nation, but rather of an survey technology for sampling and interview- awkward aggregation of catchment areas. ing individuals is strongest (e.g., Sudman, 1976). Nevertheless, the ECA data were considered The ECA design defined the target population as benchmarks for a generation of research as ªnormalº residents of previously established (Eaton, 1994) because there was sufficient catchment areas. Sampling was conducted in variation in important sociodemographic vari- two strata, as shown in Figure 3. The household ables to allow generalization to other large residing population was sampled via area populations, that is, sufficiently large subgroups probability methods or household listings of young and old, men and women, married and provided by utility companies (e.g., as in unmarried, rich and poor, and so forth. Sudman, 1976). This stratum included short- Generalization to other target populations, stay group quarters such as jails, hospitals, and such as Asian-Americans or Native Americans, dormitories. After making a list of the residents or to rural areas, was not logical from the ECA. in each household, the interviewer asked the But note that generalization from a national person answering the door as to whether there random sample to all rural areas, or to small were any other individuals who ªnormallyº ethnic groups, would likewise not always be resided there but were temporarily absent. possible. The point is that the target population ªNormallyº was defined as the presence of a should be chosen with a view toward later bed reserved for the individual at the time of the generalization. interview. Temporarily absent residents were added to the household roster before the single 3.05.3.3 Sample Size respondent was chosen randomly. If an indivi- dual was selected who was temporarily absent, General population surveys are not efficient the interviewer made an appointment for a time designs for rare disorders or unusual patterns of after their return, or conducted the interview at service use. Even for research on outcomes that their temporary group quarters residence (i.e., in are not rare, sample sizes are often? larger than hospital, jail, dormitory, or other place). The one thousand. A common statistic to be ECA sampled the institutional populations estimated is the prevalence, which is a propor- separately, by listing all the institutions in the tion. For a proportion, the precision is affected catchment area, as well as all those nearby by the square root of the sample size (Blalock, institutions who admitted residents of the 1979). If the distribution of the proportion is catchment area. Then the inhabitants of each favorable, say, 30±70%, then a sample size of institution were rostered and selected probabil- 1000 produces precision which may be good istically. Sampling the institutional population enough for common sample surveys such as for required many more resources, per sampled voter preference. For example, a proportion of individual, than the household sample, because 0.50 has a 95% confidence interval from 0.47 to each institution had to be approached indivi- 0.53 with a sample of 1000. For rarer disorders, dually. Inclusion of temporary and long-stay the confidence interval grows relative to the size group quarters is important for health services of the proportion (i.e., 0.82±0.118 for a research because many of the group quarters are proportion of 0.10; 0.004±0.160 for a propor- involved in provision of health services, and tion of 0.01). Often, there is interest in patterns because residents of group quarters may be high broken down by subpopulations, thus challen- utilizers. The ultimate result of these procedures ging the utility of samples with as few as 1000 in the ECA was that each normal resident of the individuals. Many community surveys, such as catchment area was sampled with a known the ECA, have baseline samples in excess of probability. 3000. It is important to estimate the precision of It is not enough to crisply define a geographic the sample for parameters of interest, and the area, because different areas involve different power of the sample to test hypotheses of limitations on the generalizability of results. A interest, before beginning the data collection. nationally representative sample, such as the National Comorbidity Survey (NCS; Kessler 3.05.3.4 Standardization of Data Collection et al., 1994), may seem to be the best. But how are the results of the national sample to be Assessment in epidemiology should ideally be applied to a given local area, where decisions undertaken with standardized measurement about services are made? In the ECA the instruments which have known and acceptable 102 Epidemiologic Methods Sampling Strata for Residents ECA Study Design

Households + Temporary Group Institutional Quarters Group (Jails, Hospitals, Quarters Dormitories) (Nursing Homes, Prisons, n = 3000 Mental Hospitals) n = 500

Figure 3 Sampling strata for residents: ECA study design. reliability and validity. In community surveys, dependent on answers already given. For and automated record systems, reliable and example, if an individual responds positively valid measurement must take place efficiently to a question about the occurrence of panic and in ªfield conditions.º The amount of attacks, a series of questions about that training for interviewers in household studies particular response are asked, but if the response depends on the nature of the study. Interviewers to the question on panic is negative, the in the Longitudinal Study on Aging (LSOA; interviewer skips to the next section. In effect, Kovar, 1992), a well-known cohort study in the the interview adapts to the responses of the United States, received about 1.5 days of subject, so that more questions are asked where training (Fitti & Kovar, 1987), while ECA more information is needed. The high degree of interviewers received slightly more than a week structure in the DIS required more than one (Munson et al., 1985). Telephone interviews, week of training, as well as attention to the visual such as in the LSOA, can be monitored properties of the interview booklet itself. The systematically by recording or listening in on result was that the interviewer could follow a random basis (as long as the subject is made instructions regarding the flow of the interview, aware of this possibility), but it is difficult to and the recording of data, smoothly, so as not monitor household interviews, since it cannot be to offend or alienate the respondent. Household predicted exactly when and where the interview survey are becoming increas- will take place. ingly adaptive and response dependent, because The ECA Program involved a somewhat more information can be provided in a shorter innovative interview called the Diagnostic amount of time with adaptive interviews. Interview Schedule, or DIS (Robins, Helzer, Inexpensive laptop computers will facilitate Croughan, & Ratcliff, 1981). The DIS was such adaptive interviewing. Self-administered, designed to resemble a typical psychiatric computerized admission procedures are becom- interview, in which questions asked are highly ing more widely disseminated in the health care Exposure and Disease 103 system, expanding the database, and facilitating disorder (Table 2). The cohort design differs the retrieval and linkage of records. from the morbidity survey in that it is The reliability and validity of measurement prospective, involving longitudinal follow-up are usually assessed prior to beginning a field of an identified cohort of individuals. As well as study. Usually, the assessment involves pilot the search for causes, the cohort design tests on samples of convenience to determine the addresses Morris' fourth ªuseº of epidemiology, optimal order of the questions, the time required ªestimating individual risks;º and the sixth for each section, and whether any questions are ªuse,º that is, ªcompleting the clinical picture, unclear or offensive. Debriefing subjects in these especially the natural history.º The entire pilot tests is often helpful. The next step is a test cohort can be sampled, but when a specific of reliability and validity. Many pretests select exposure is hypothesized, the design can be samples from populations according to their made more efficient by sampling for intensive health or services use characteristics in order to measurement on the basis of the putative generate enough variation on responses to exposure, for example, children living in an adequately estimate reliability and validity. In area of toxic exposures, or with parents who order to economize, such pretests are often have been convicted of abuse, or who have had conducted in clinical settings. But such pretests problems during delivery, and a control group do not predict reliability and validity under the from the general population. Follow-up allows ªfieldº conditions of household interviews. The comparison of incidence rates in both groups. ECA Program design involved reliability and validity assessment in a hospital setting (Robins 3.05.4.2 Exemplar Study: the British Perinatal et al., 1981), and later under field conditions Study (Anthony et al., 1985). The reliability and validity of the DIS were lower in the household An example is the British Perinatal Study, a setting. cohort study of 98% of all births in Great Britain during a single week in March, 1958 3.05.4 EXPOSURE AND DISEASE (Butler & Alberman, 1969; Done, Johnstone, & Frith, 1991; Sacker, Done, Crow, & Golding, Two basic research designs in epidemiology 1995). The cohort had assessments at the ages of are the cohort and the case±control design. 7 and 11 years, and, later, mental hospital These designs are used mostly in addressing records were linked for those entering psychia- Morris' seventh ªuseº of epidemiology, the tric hospitals between 1974 and 1986, by which search for causes. This is sometimes called the time the cohort was 30 years old. Diagnoses analytic aspect of epidemiology. These designs were made from hospital case notes using a differ in their temporal orientation to collection standard system. There was some incomplete- of data (prospective vs. retrospective), and in the ness in the data, but not too large, for example, criteria by which they sample (by exposure or by 12 of the 49 individuals diagnosed as ªnarrowº caseness), as shown in Table 2. schizophrenia were excluded because they were immigrants, multiple births, or because they 3.05.4.1 Cohort Design lacked data. It is difficult to say how many individuals in the cohort had episodes of In a cohort design, incidence of disorder is disorder that did not result in hospitalization, compared in two or more groups which differ in but that does not necessarily threaten the some exposure thought to be related to the results, if the attrition occurred equally for

Table 2 Case–control and cohort studies. Case–control study Cases Controls Cohort Exposed a b a + b Study Not exposed c d c + d

a / (a + b) Relative risk = c / (c + d)

a / b ad Relative odds = = c / d bc 104 Epidemiologic Methods different categories of exposure. Table 3 shows other factors such as gender, age, and so forth. results for one of 37 different variables related to The odds ratio for low birth weight and narrow birth, that is, ªexposures,º that were available in schizophrenia in the British Perinatal Study, as the midwives' report: birth weight under 2500 g. it happens, was 3.9, after adjustment for social With n given at the head of the columns in Table class and other demographic variables. 3, the reader can fill in the four cells of a two-by- The logic of the cohort study includes two table, as in Table 2, for each of the four assessing the natural history of the disorder, disorders. For example, the cells, labeled as in that is, the study of onset and chronicity, in a Table 2, for narrow schizophrenia, are: a±5; population context without specific interven- b±706; c±30; d±16 106. The number with the tion by the researcher. Study of natural history disorder is divided by the number of births to requires repeated follow-up observations. In the generate the risk of developing the disorder by British Perinatal Study, there were assessments the time of follow-up: the cumulative incidence. of the cohort at the ages of 7 and 11. Once the The risks can be compared across those with and case groups had been defined, it was possible to without the exposure. In Table 2, the incidence study their reading and mathematics perfor- of those with exposure is a/(a+b), and the mance well before hospitalization. Those des- incidence of those without exposure is c/(c+d). tined to become schizophrenic had lower The relative risk is a comparison of the two reading and mathematics scores at both 7 and risks: [a/(a+b)]/[c/(c+d)]. For narrow schizo- 11 years of age (Crow, Done, & Sacker, 1996). phrenia the relative risk is (RR = [5/711]/[30/ Males who would eventually be diagnosed 16 136]), or 3.78. schizophrenic had problems relating to conduct The relative risk is closely related to causality during childhood, while females were anxious, since it quantifies the association in the context as compared to children who did not end up of a prospective study, so the temporal ordering being hospitalized with a diagnosis of schizo- is clear. The relative risk is approximated closely phrenia later. Later follow-ups in this study may by the relative odds or odds ratio, which does economize by studying only those who have had not include the cases in the denominator (i.e., onset, and a small random subsample of others, OR = [5/706]/[30/16 106] = 3.80, not 3.78). The to estimate such factors as the length of odds ratio has many statistical advantages for episodes, the probability of recurrences, prog- epidemiology. It is easy to calculate in the two- nostic predictors, and long-term functioning. by-two table by the cross-products ratio (i.e., ad/bc). The odds ratio quantifies the association without being affected by the prevalence of the 3.05.4.3 Case±Control Design disorder or the exposure, which is important for In the case±control study, it is the disease or the logic of cohort and case±control studies, disorder which drives the logic of the design, and where these prevalences may change from study many factors can be studied efficiently as to study, depending on the research design. This possible causes of the single disorder (Schlessel- lack of dependence on the marginal distribution man, 1982). The case±control study may be the is not characteristic of many measures of most important contribution of epidemiology to association typically used in psychology, such the advancement of public health, because it is as the correlation coefficient, the difference in so efficient in searching for causes when there is proportions, or the kappa coefficient (Bishop little knowledge. For example, adenocarcinoma et al., 1975). The odds ratio is a standard result of the vagina occurs so rarely, that, prior to of logistic regression, and can be adjusted by 1966, not a single case under the age of 50 years

Table 3 Mental disorder and low birth weight: British Perinatal study.

Cumulative incidence per 1000 Low Normal birth birth weight weight Disorder (n=727) (n=16 812) Odds ratio

Narrow schizophrenia (n = 35) 7.03 1.86 3.8 Broad schizophrenia (n = 57) 9.82 3.09 3.2 Affective psychosis (n = 32) 8.43 1.61 5.3 Neurosis (n = 76) 4.23 4.51 0.9

Source: Adapted from Sacker et al. (1995). Using Available Records 105 had been recorded at the Vincent Memorial Examination, and 76 women in the community Hospital in Boston. A time-space clustering of survey (17%) who were depressed at the time of eight cases, all among young women born the survey, as judged by a highly trained, within the period 1946±1951, was studied nonmedical interviewer using a shortened (Herbst, Utfeder, & Poskanzer, 1971). There version of the same instrument. Sixty-one was almost no knowledge of etiology. A percent of the patient cases (70/114) and 68% case±control study was conducted, matching of the survey cases (52/76) experienced the four controls to each case. The study reported a ªprovoking agentº of a severe life event during highly significant (p 5 0.00001) association the year prior to the onset of depression. Twenty between treatment of the mothers with estrogen percent (76/382) of the healthy controls experi- diethylstilbestrol during pregnancy and the enced such an event in the year prior to the subsequent development of adenocarninoma interview. Patient cases had 6.4 times the odds of of the vagina in the daughters. The results led to having experienced a life event than the controls recommendations to avoid administering stil- (i.e., [70/44]/[76/306]). bestrol during pregnancy. Logistic regression, The case±control study is very efficient, developed by epidemiologists and used in especially when the disease is rare, because it case±control studies, has distinct advantages approximates the statistical power of a huge over analysis of variance, ordinary least-squares cohort study with a relatively limited number of regression, discriminant function analysis, and controls. For a disease like schizophrenia, which probit regression, developed by agronomists, occurs in less than 1% of the population, 100 psychologists, economists, and educational cases from hospitals or a psychiatric case psychologists, especially when the dichotomous register can be matched to 300 controls from dependent variable has a very skew distribution, the catchment area of the hospital. A cohort that is, is very rare. study of this number would involve measure- ments on 10 000 persons instead of 400! Since 3.05.4.4 Exemplar Study: Depression in Women the disease is rare, it may be unnecessary to in London conduct diagnostic examinations on the group of controls, which would have only a small The most convincing demonstration of social number of cases. Furthermore, a few cases factors in the etiology of any mental disorder is distributed among the controls would weaken The social origins of depression, by Brown and the comparison of cases to controls, generating Harris (1978). That study used the case±control a conservative bias. The statistical power of the method to demonstrate the importance of life- case±control study is very close to that of the event stresses and chronic difficulties as causal analogous cohort study, and, as shown above, agents provoking the onset of depression. The the odds ratio is a close estimate of the relative method involved indepth diagnostic and etio- risk. The case±control study can be used to test logic interviews with a sample of 114 patients hypotheses about exposures, but it has the very and a sample of 458 household residents. The great ability to make comparisons across a wide target population is that residing in the range of possible risk factors, and can be useful Camberwell area in south London. even when there are very few hypotheses The analysis presented by Brown and Harris available. is logical but does not always follow the standard epidemiologic style. Table 4 presents the crucial findings in the typical epidemiologic 3.05.5 USING AVAILABLE RECORDS manner (following Table 2). Two case groups 3.05.5.1 Government Surveys are defined: the 114 women presenting at clinics and hospitals serving the Camberwell area, There are many sources of data for psychia- diagnosed by psychiatrists using a standardized tric epidemiologic research which are available clinical assessment tool called the Present to the public. These include statistics from

Table 4 Depression and life events and difficulties in London.

Cases (patients) Cases (survey) Controls (survey)

One or more severe event 70 61% 52 68% 76 20% No severe events 44 39% 24 32% 306 80% 114 100% 76 100% 382 100%

Source: Brown & Harris (1978). 106 Epidemiologic Methods treatment facilities and from large sample usually list the cause of death, including suicide. surveys of the populations conducted by a large These data files are often available from the organization such as national governments. government at nominal cost. Often the measures of interest to psychologists are only a small part of the survey, but the 3.05.5.3 Record-based Statistics and Record availability of a range of measures, drawn from Linkage other disciplines, can be a strong advantage. In the United States an important source of data is Statistics originating from treatment facilities the National Center for Health Statistics can also be put to good use in psychiatric (NCHS). For example, the Health and Nutri- epidemiology. Many early epidemiologic stu- tion Examination Survey (HANES) is a na- dies used hospital statistics as the numerators in tional sample survey conducted by the NCHS estimating rates, and census statistics in the involving physical examinations of a sample of denominators (Kramer, 1969). The utility of the United States population. Its first phase rates estimated in this manner is dependent on included the Center for Epidemiologic Studies the degree to which the clinical disorder is Depression Scale as part of its battery of associated with treatment, a stronger associa- measures, which included anthropometric mea- tion for severe schizophrenia than for mild sures, nutritional assessments, blood chemistry, phobia, presumably. The value of these rates medical history, and medical examinations also depends on the relative scarcity or abund- (Eaton & Kessler, 1981; Eaton & McLeod, ance of treatment facilities. In the United States, (1984). Medicaid and Medicare files, which include Later phases of the HANES included por- such data as diagnosis and treatment, are avail- tions of the DIS. The Health Interview Survey able for research use. Many Health Mainte- conducts health interviews of the general nance Organizations maintain data files which population of the United States, and includes could be used in psychological epidemiological reports by the respondent of named psychiatric research. disorders such as schizophrenia, depressive Statistics from treatment facilities are en- disorder, and so forth. The National Medical hanced considerably by linkage with other Care Utilization and Expenditure Survey facilities in the same geographic area, serving (MCUES) is conducted by the NCHS to help the same population (Mortensen, 1995). understand the health service system and its Although the number has declined, there still financing. The MCUES samples records of exist several registers which attempt to record practitioners from across the nation, and there and link together all psychiatric treatment are several questions on treatment for psycho- episodes for a given population (ten Horn, logical problems. Giel, Gulbinar, & Henderson, 1986). Linkage Some governments also sponsor national across facilities allows multiple treatment epi- surveys which focus on psychological disorders. sodes for the same individual (even the same The ECA is one such example, although it was episode of illness) to be combined into one not, strictly speaking, a sample of the nation. record (so-called ªunduplicatingº). Linkage The later National Comorbidity Survey includes across time allows longitudinal study of the 8058 respondents from a probability sample of course of treatment. Linkage with general health the US, with its major measurement instrument facilities, and with birth and mortality records, being a version of the Composite International provides special opportunities. The best known Diagnostic Interview (CIDI), a descendant of example of a comprehensive health registry is the DIS used in the ECA surveys (Kessler et al., the Oxford Record Linkage Study (ORLS; 1994). The British government conducted a large Acheson, Baldwin, & Graham, 1987), which survey of psychological disorders in a national links together all hospital treatment episodes in sample, using a similar instrument to the DIS its catchment area. The database of the ORLS and CIDI (Meltzer, Gill, Pettirew, & Hindi, consists of births, deaths, and hospital admis- 1995a, 1995b). Anonymous, individual level sions, which is a strong limitation. However, due data from the ECA, the British survey, and the to the catchmenting aspect of the British NCS are available at nominal cost. National Health Service, the data are not limited to the household-residing population, 3.05.5.2 Vital Statistics as is the National Comorbidity Survey, for example, or the LSOA. In automated data Governments generally assimilate data on collection systems such as the ORLS, the births, deaths, and marriages from states, recordation is often done under the auspices provinces, or localities, ensuring some minimal of the medical record systems, with data degree of uniformity of reporting and creating recorded by the physician, such as the diagnosis. data files for public use. The mortality files It cannot be presumed that physician's diagnosis Preventive Trials 107 is ªstandardizedº and therefore more reliable relatives of the control adoptees (2%). The than other interview or record data. In fact, findings for schizophrenia spectrum disorders there is significant variation in diagnosis among (including uncertain schizophrenia and schizoid physicians. Research using measurements and personality) also show a pattern consistent only diagnoses recorded by the physician, as in with genetic inheritance (26/118 in the biologic record linkage systems such as the ORLS, relatives of index cases, or 22%, vs. 16/140 should ideally include studies of reliability and biologic relatives of control adoptees, or 14%). validity (e.g., Loffler et al., 1994). 3.05.6 PREVENTIVE TRIALS 3.05.5.4 Exemplar Study: the Danish Adoption Study 3.05.6.1 The Logic of Prevention An example of the benefits of record linkage Some of the earliest epidemiologists used in psychological research is the adoption study interventions in the community to stop an of schizophrenia in Denmark (Kety, Rosenthal, epidemic or to gather information about the Wender, Schulsinger, & Jacobsen, 1975). Fa- causes of diseases in the population. This is milial and twin studies of schizophrenia sug- sometimes called the experimental aspect of gested strongly that schizophrenia was epidemiology. The best known experiment is inherited, but these studies were open to the that conducted by Snow in the cholera epidemic interpretation that the inheritance was cultural, in London. Snow's work exemplifies epidemio- not genetic, because family members are raised logic principles in several ways (Cooper & together, and identical twins may be raised in Morgan, 1973). It was widely believed that social environments which are more similar cholera was spread through the air, in a miasma, than the environments of fraternal twins. The leading many to seek safety by retreating to Danish Adoption Study ingeniously combined rural areas. Snow showed with ecologic data in the strategy of file linkage with interviews of London that areas serviced by a water company cases, controls, and their relatives. In Denmark, taking water from upstream in the Thames had each individual receives a number at birth which lower rates of cholera than areas serviced by a is included in most registration systems. The company taking water from downstream. This registration systems are potentially linkable, ecologic comparison suggested that cholera was after appropriate safeguards and clearances. In borne by water. In the context of a single cholera the adoption study, three registers were used. epidemic, Snow identified individual cases of First, all 5483 individuals in the county and city cholera, showing that they tended to cluster of Copenhagen who had been adopted by around a single water pump at Broad Street. He persons or families other than their biological further showed that many exceptional cases, relatives, from 1924 through 1947, were identi- that is, cases of cholera residing distant from the fied from a register of adoptions (Figure 4). pump, had actually drawn or drank water from These were linked to the psychiatric case that pump (e.g., on their way home from work). register, wherein it was determined that 507 His action to remove the handle of the pump, adoptees had ever been admitted to a psychiatric which roughly coincided with the termination of hospital. From case notes in the hospitals, 34 the epidemic, is regarded as an early instance of adoptees who met criteria for schizophrenia experimental epidemiology. (about 0.5% of of the total number of adoptees) In epidemiology, as with medicine generally, were selected, and matched on the basis of age, intervention is valued if it is effective, regardless sex, socioeconomic status of the adopting of whether the causation of the disease is family, and time with biologic family, or in understood or not. Table 5 shows examples of institution, prior to adoption. The relatives of preventive measures which were implemented these 68 cases and controls were identified by well prior to knowledge of the cause, discovered linkage with yet another register in Denmark much later in many cases. This logic leads to which permits locating families, parents, and experimentation even in the absence of causal children. After allowing for mortality, refusal, information. and a total of three who were impossible to Various conceptual frameworks have been trace, a psychiatric interview was conducted on used to organize the area of prevention research. 341 individuals (including 12 on whom the The Commission on Chronic Illness (1957) interview was not conducted but sufficient divided prevention into three types, dependent information for diagnosis was obtained). Eleven on the stage of the disease the intervention was of the 118 biologic relatives of the index designed to prevent or treat. Prior to onset of adoptees were schizophrenic (9%), vs. one in the disease, the prevention was primary, and its the 35 relatives of adoptive families of the index goal was to reduce incidence; after disease onset, adoptees (3%), and three of the 140 biologic the intervention was directed at promoting 108 Epidemiologic Methods Danish Adoption Study

Research Design

Method Sample

Adoption Register 5483 Adoptees

Psychiatric Register 507 4976

Case Notes 34 Index Cases 34 Controls 473

Folkeregister 247 Relatives 265 Relatives

Mortality/Refusal 173 74 174 91 Biologic Adoptive Biologic Adoptive

Psychiatric Interview 118 35 140 48

Results

Schizophrenic: 11 1 3 2 Spectrum 26 3 16 5 Normal 81 31 121 41

Figure 4 Danish adoption study: research design and results. Source: Kety, Rosenthal, Wender, Schulsinger, and Jacobsen (1975). recovery, and its goal was to reduce prevalence, The results of integrating the two frameworks so-called secondary prevention. Tertiary pre- are shown in Figure 5. Curative medicine vention was designed to prevent impairment generally operates in the area of secondary and handicap which might result from the prevention, with indicated interventions. Re- disease. The Institute of Medicine (Mrazek & habilitative medicine is in the area of tertiary Haggerty, 1994) integrated this framework into prevention. Primary, universal, and targeted that of Gordon (1983), who directed attention at interventions are the province of experimental the population to which the preventive inter- epidemiology. Prominent examples of universal vention was directed: universal preventions at interventions in the area of epidemiology are the entire general population; targeted inter- various smoking cessation campaigns, and such ventions at subgroups at high risk of develop- studies as the Stanford Five-City Project, which ment of disorder; and indicated interventions, was designed to reduce risk of heart disease by directed at individuals who have already lowering levels of several associated risk factors manifest signs and symptoms of the disease. (Farquhar et al., 1990). Preventive Trials 109

Table 5 Knowledge of prevention and etiology.

Prevention Etiology Disease Discoverer Year Agent Discoverer Year

Scurvy Lind 1753 Ascorbic acid Szent-Gyorgi 1928 Pellagra Casal 1755 Niacin Goldberger 1924 Scrotal cancer Pott 1775 Benzopyrene Cook 1933 Smallpox Jenner 1798 Orthopoxvirus Fenner 1958 Puerperal fever Semmelwies 1847 Streptococcus Pasteur 1879 Cholera Snow 1849 Vibrio cholerae Koch 1893 Bladder cancer Rehn 1895 2-Napththylamine Hueper 1938 Yellow fever Reed 1901 Flavirus Stokes 1928 Oral cancer Abbe 1915 N-Nitrosonornicotine Hoffman 1974

Source: Wynder (1994).

3.05.6.2 Attributable Risk not occur if smoking were eliminated totally. In the situation of many possible risk factors, the From among many possibilities, how should attributable risk is a tool which helps prioritize interventions be selected? Epidemiology pro- them. vides a helpful tool in the form of the Population Epidemiologic cohort studies can provide Attributable Risk, sometimes called the Attri- information which may help to stage the butable Fraction or the Etiologic Fraction intervention at the most appropriate time. (Lilienfeld & Stolley, 1994, p. 202). The The intervention should occur before most attributable risk is the maximum estimate of onsets in the population have taken place, but the proportion of the incidence of disease that not so far before onset that the effects of the would be prevented if a given risk factor were intervention wash out before the process of eliminated. For a given disease, the attributable disease development has begun. The appro- risk combines information from the relative risk priate stage for prevention in many situations for a given exposure with the prevalence of the would appear to be that of precursors, in which exposure in the population. The formula for there are subgroups which can be identified at attributable risk is: high risk, but before the disease prodrome has started (Eaton et al., 1995). P(RR 7 1) Attributable Risk = P(RR 7 1) + 1 3.05.6.3 Developmental Epidemiology where: Epidemiologists are gradually becoming P = Prevalence of Exposure, and aware of the need for life course perspective RR = Relative Risk of Exposure to Disease (Kellam & Werthamer-Larson, 1986). In social and psychological research on human beings, The relative risk can be estimated from a cohort cohort studies may suggest regularities in study, as described above, and the prevalence of human development which can be considered the exposure can be estimated from a separate etiologic clues. The clues consist of temporal survey. A simple case±control study also sequences of various behaviors over several provides the information for attributable risk, years, with only a probabilistic connection under certain conditions. The relative risk is between each behavior. The behaviors may approximated by the relative odds, as discussed have a common antecedent, such as a genetic above. If the controls are selected from the background, or they may depend on the general population, the level of exposure can be sequence of environments that the individual estimated from them. For example, case± experiences, and the interaction between envir- control studies of smoking and lung cancer in onments, genetic traits, and habits formed by the early 1950s showed that the odds of dying the history of prior behaviors. The epidemio- from lung cancer were about 15 times higher for logic notion of exposures, developed in the smokers as for nonsmokers. About half the context of infectious agents and toxins, is too population smoked, leading to the attributable simple to be useful in these multicausal, risk estimate of about 80%. In the United developmental sequences. For example, aggres- States, this meant that about 350 000 of the sive behavior by boys in early elementary school 400 000 annual deaths due to lung cancer would years is followed by conduct problems in middle 110 Epidemiologic Methods

Treatment

Case identification

Indicated

Maintenance

Prevention Selective

Standard treatment for known disorders

Compliance with long-term relapse and recurrence)After-care Universal treatment (goal: Reduction in

(including rehabilitation)

Figure 5 The mental health intervention spectrum for mental disorders. school; later behaviors such as smoking cigar- experimental and quasi-experimental design in ettes, having sexual relations at an early age, and psychology (Campbell & Stanley, 1971). The drinking alcohol, in high school, and, even- population for the trial was carefully defined in tually, the diagnosis of antisocial personality epidemiologic terms to include all first graders disorder as an adult. Which of these are essential in public schools in an area of Baltimore. The causes, and which simply are associated beha- Baltimore City Public School system was an viors which have no influence on the causal active participant in the research. The 19 schools chain to the outcome (in this example, the were divided into five levels, with three or four diagnosis of antisocial personality disorder)? schools per level, based on the socioeconomic Since the chain of causes is probabilistic and characteristics of the areas they served. At each multicausal, the notion of attributable risk of the five levels there was a control school and (discussed above) is too simplistic, and it is two experimental schools. For the study of unlikely that any estimate of attributable risk aggressive behavior, 153 children were assigned would be even moderate in size. A preventive at random to eight separate classrooms in which intervention trial with random or near-random the teacher used a special classroom procedure assignment to conditions which manipulate a called the ªGood Behavior Game,º which had putative cause may be the only way to generate been shown to reduce the level of aggressive an understanding of the causal chain. The behavior in classrooms. Nine classrooms with preventive trial serves the traditional purpose in 163 children received an intervention to improve epidemiology of testing the efficacy of the reading mastery, and there were 377 children intervention, but it also serves as an examina- who were in control classrooms. The control tion of the developmental pathway leading to classrooms consisted of classrooms in the same disorder, for building and testing theories. school as the experimental, but who did not receive the intervention; and classrooms in 3.05.6.4 Exemplar Study: the Johns Hopkins schools in the same area of Baltimore, where Prevention Trial there were no interventions at all. One outcome of the Good Behavior Game Trial was that the An example of a preventive intervention trial median level of aggressive behavior went down is the Johns Hopkins Prevention Trial (Kellam, during the period of the trial for those in the Rebek, Ialongo, & Mayer, 1994). The trial was experimental group. For controls, the level of designed to examine developmental pathways aggressive behavior was relatively stable. The generally occurring after two important char- most impressive result was that aggressive acteristics which could be identified in first behavior continued to decline in the experi- graders: success at learning (especially, learning mental group, even after discontinuation of the to read) and aggressive behavior. Table 6 shows intervention in third grade, at least through the the design of the trials, using the notation of spring of sixth grade. Table 6 Intervention and assessment for Johns Hopkins Preventive Trials in 19 public elementary schools.

Grade 1 Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Number of Number of 1985±1986 1986±1987 1987±1988 1988±1989 1989±1990 1990±1991 classrooms students F/S F/S F/S F/S F/S F/S

Good behavior game 8 153 RO/XO XO/O /O /O /O /O Mastery learning 9 163 RO/XO XO/O /O /O /O /O Controls 24 377 RO/O O/O /O /O /O /O

Source: Kellam et al. (1994). F=fall; S=spring; O=observation; X=intervention; R=random assignment. 112 Epidemiologic Methods

3.05.7 BIAS populations, which entails comparison of incidence rates, as discussed above. But the Epidemiologic research entails bias, and it is rate of incidence is often too low to generate beneficial to anticipate bias in designing sufficient cases for analysis, and necessitates research studies. Common biases in epidemiol- extending the data collection over a period of ogy take a slightly different form than in typical months or years during which new cases are psychological research (e.g., as enumerated in collected. Prevalent cases are easier to locate, Campbell & Stanley, 1971), but the principle of either through a cross-sectional survey, or, developing a language to conceptualize bias, easier yet, through records of individuals and to attempts to anticipate, eliminate, or currently under treatment. The problem with measure bias is common to both disciplines. In study of current cases is that the prevalence is a this section five basic types of bias are function of the incidence and chronicity of the considered: sampling bias, divided into three disorder, as discussed above. Association of an subtypes of treatment bias, prevalence bias, and exposure with the presence or absence of disease response bias; and measurement bias, divided mixes up influences on incidence with influences into subtypes of diagnostic bias and exposure on chronicity. Influences on chronicity can bias. include treatment. For example, comparing the brain structure of prevalent cases of schizo- 3.05.7.1 Sampling Bias phrenia to controls showed differences in the size of the ventricles, possibly an exciting clue to Sampling bias arises when the sample studied the etiology of schizophrenia. But a possible does not represent the population to which bias is that phenothiazine or other medication, generalization is to be made. An important type used to treat schizophrenia, produced the of sampling bias is treatment bias. The case± changes in brain structure. Later studies of control design often takes advantage of treat- brain structure had to focus on schizophrenics ment facilities for finding cases of disorder to who had not been treated in order to eliminate compare with controls. With severe diseases this possibility. The study of depression among which invariably lead to treatment, there is less women in London is possibly subject to bias than for disorders which are less noticeable, prevalence bias since cases were selected on distressing, or impairing to the individual. As the basis of presentation to treatment, and shown in Figure 2, data from the ECA Program controls via a cross-sectional (prevalence) indicate that among the mental disorders, only survey. Thus, it may be that provoking agents for schizophrenia and panic disorder are as such as life events contribute only to the many as half the current cases under treatment. recurrence of episodes of depression, not In 1946, Berkson showed that, where the necessarily to their initial onset. probability of treatment for a given disorder Response bias is a third general threat to the is less than 1.0, cases in treatment over- validity of findings from epidemiologic re- represented individuals with more than one search. The epidemiologic design includes disorder, that in, comorbid cases. In studying explicit statements of the population to which risk factors for disorder with the case±control generalization is sought, as discussed above. design, where cases are found through clinics, The sampling procedure includes ways to the association revealed in the data may be an designate individuals thought to be representa- artifact of the comorbidity, that is, exposure x tive of that population. After designation as may appear to be associated with disease y, the respondents, before measurements can be focus disorder of the study, but actually related taken, some respondents become unavailable to disease z (perhaps not measured in the study). to the research, usually through one of three This type of bias is so important in epidemio- mechanisms: death, refusal, or change of logic research, especially case±control studies, it residence. If these designated-but-not-included is termed ªBerkson's bias,º or sometimes respondents are not a random sample, there will ªtreatment bias.º The existence of this bias is be bias in the results. The bias can result from possibly one factor connected to little use of the differences in the distribution of cases vs. case±control design in studies of psychiatric noncases as to response rate, or to differences disorders. in distribution of exposures, or both. In cross- Another type of sampling bias very common sectional field surveys (Von Korff et al., 1985) in case±control studies arises from another type and in follow-up surveys (Eaton, Anthony, of efficiency in finding cases of disorder, that in, Tepper, & Dryman, 1992), the response bias the use of prevalent cases. This is sometimes connected to psychopathology has not been termed prevalence bias, or the clinician's illusion extremely strong. Persons with psychiatric (Cohen & Cohen, 1984). The ideal is to compare diagnoses are not more likely to refuse to relative risks among exposed and nonexposed participate, for example. Persons with cognitive Conclusion: The Future of Psychology in Epidemiology 113 impairment are more likely to die during a measurement (Rogan & Gladen, 1978). Proce- follow-up interval, and persons with antisocial dures for correcting measures of association, personality disorder, or abuse of illegal drugs, such as the odds ratio, are more complicated are more likely to change address and be since they depend on the precise study design. difficult to locate. Differential response bias is Bias also exists in the measurement of particularly threatening to studies with long exposure. The epidemiologist's tendency to follow-up periods, such as the British Perinatal categorical measurement leads to the term Study and the Danish Adoption Study, since ªmisclassificationº for this type of bias. A there was sizable attrition in both studies. well-known example is the study of Lilienfeld and Graham (1958: cited in Schlesselman, 1982, pp. 137±138), which compared male patients' 3.05.7.2 Measurement Bias declarations as to whether they were circum- cised to the results of a doctor's examination. Of Bias in measurement is called invalidity in the 84 men judged to be circumcised by the psychology, and this term is also used in epi- doctor (the ªgold standardº in this situation), demiology. But the study of validity in epi- only 37 reported being circumcised (44% demiology has been more narrowly focused sensitivity). Eighty-nine of the 108 men who than in psychology. The concentration in epi- were not circumcised in the view of the doctor demiology has been on dichotomous measures, reported not being circumcised (82% specifi- as noted above. The medical ideology has city). In psychological research, the exposures ignored the notion that concepts for disease and are often subjective psychological happenings, pathology might actually be conventions of such as emotions, or objective events recalled by thought. Where the psychometrician takes as a the subject, such as life events, instead of, for basic assumption that the true state of nature is example, residence near a toxic waste dump, as not observable, the epidemiologist tends to might be the putative exposure in a study of think of disease as a state whose presence is cancer. The importance of psychological ex- essentially knowable. As a result, discussion of posures, or subjective reporting of exposures, face validity, content validity, or construct threatens the validity of the case±control design validity are rare in epidemiologic research. in psychiatric epidemiology, and may be one Instead the focus has been on sensitivity and reason it has been used so little. The cases in a specificity, which are two aspects of criterion case±control study by definition have a disease validity (also a term not used much in or problem which the controls do not. In any epidemiologic research). kind of case±control study where the measure of Sensitivity is the proportion of true cases that exposure is based on recall by the subjects, the are identified as cases by the measure (Table 7). cases may be more likely than controls to recall Specificity is the proportion of true noncases exposures, because they wish to explain the that are identified as noncases by the measure. occurrence of the disease. In the study of In psychological research, validity is often depression in London, depressed women may be estimated with a correlation coefficient, but more likely to recall a difficult life event that this statistic is inappropriate because the happened earlier in their lives, because of their construct of disease is dichotomous and current mood, than women who are not differences in rarity of the disorder will depressed; or depressed women may magnify constrain the value of the correlation coeffi- the importance of a given event which actually cient. Furthermore, use of sensitivity and did occur. These problems of recall raise the specificity has the advantage over correlation importance of strictly prospective designs in that it forces quantification of both types of psychological epidemiological research. error, that is, false-positive and false-negative error. These errors, and the calibration of the measurement to adapt to them, depend heavily 3.05.8 CONCLUSION: THE FUTURE OF on the prevalence of the disorder, on the PSYCHOLOGY IN importance of locating cases, and on the expense EPIDEMIOLOGY of dealing with those cases that are located. Choice of a threshold to define disorder as Epidemiology is developing rapidly in ways present or absent has important consequences, that will promote effective collaboration with and is aided by detailed study of the effects of psychology. The traditional interest of psychol- different thresholds (sometimes termed re- ogists in health as well as illness links up to sponse operating characteristic (ROC) analysis, epidemiologists' growing interest in the devel- as in Murphy, 1987). There are simple proce- opment of host resistance. The traditional dures for correcting prevalence estimates ac- interest of psychologists in phenomenology is cording to the sensitivity and specificity of the beginning to link up with epidemiologists' 114 Epidemiologic Methods

Table 7 Sensitivity and specificity. True disease status Present Absent a b a + b Test Present (True-positives) (False-positives) results Absent c d c + d (False-negatives) (True-negatives) a + c b + d

Sensitivity = a / (a + c) Specificity = d / (b + d) growing interest in measurement and complex M. (1985). Comparison of the lay Diagnostic Interview nosology. In statistics, techniques of multi- Schedule and a standardized psychiatric diagnosis: Experience in Eastern Baltimore. Archives of General variate analysis are being adapted increasingly Psychiatry, 42, 667±675. well to the developmental perspective. Devel- Berkson, J. (1946). Limitations of the application of opments in the field of physiological and fourfold table analysis to hospital data. Biometrics, 2, biological measurement, including nearly non- 47±53. invasive assessments of DNA, hormones, and Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and prac- brain structure and functioning, have led to tice. Cambridge, MA: MIT Press. concepts like ªmolecular epidemiology,º which Blalock, H. M. (1979). Social statistics (2nd ed.). New will lead to vast increases in knowledge about York: McGraw-Hill. how psychological functioning affects diseases Brown, G. W., Birley, J. L. T., & Wing, J. K. (1972). across the spectrum of mental and physical Influence of family life on the course of schizophrenic disorders: a replication. British Journal of Psychiatry, illness. Many of these measures are efficient 121, 241±258. enough to be applied in the context of a large Brown, G. W., & Harris, T. (1978). The social origins of field survey. In a few decades, the concept of the depression: A study of psychiatric disorder in women. ªmind±body splitº may be an anachronism. London: Tavistock. Finally, the increasing use of laptop computers Butler, N. R., & Alberman, E. D. (1969). Perinatal Problems: The Second Report of the 1958 British for assistance in assessment of complex human Perinatal Survey. Edinburgh, Scotland: Livingstone. behaviors and traits will increase the utility of Campbell, D. T., & Stanley, J. C. (1971). Experimental and questionnaires and interview data. For all these Quasi-Experimental Designs for Research.Chicago: reasons psychologists should look forward to Rand McNally. increasingly productive collaborations in the Cohen, P., & Cohen, J. (1984). The Clinician's Illusion. Archives of General Psychiatry, 41, 1178±1182. field of epidemiology. Commission on Chronic Illness (1957). Chronic illness in the United States. Cambridge, MA: Harvard University Press. Cooper, B., & Morgan, H. G. (1973). Epidemiological ACKNOWLEDGMENTS psychiatry. Springfield, IL: Charles C. Thomas. Cross-National Collaborative Group (1992). The changing Production of this paper was supported by rate of major depression: cross-national comparisons. NIMH grants 47447, and by the Oregon Social Journal of the American Medical Association, 268, Learning Center. The author is grateful to Nina 3098±3105. Schooler and Ray Lorion for comments on early Crow, T. J., Done, D. J., & Sacker, A. (1996). Birth cohort drafts. study of the antecedents of psychosis: Ontogeny as witness to phylogenetic origins. In H. Hafner & W. F. Gattaz (Eds.), Search for the causes of schizophrenia (Vol. 3, pp. 3±20). Heidelberg, Germany: Springer. 3.05.9 REFERENCES Done, D. J., Johnstone E. C., Frith, C. D., et al. (1991). Complications of pregnancy and delivery in relation to Acheson, E. D., Baldwin, J. A., & Graham, W. J. (1987). psychosis in adult life: data from the British perinatal Textbook of medical record linkage. New York: Oxford mortality survey sample. British Medical Journal, 302, University Press. 1576±1580. American Psychiatric Association (1994). Diagnostic and Eaton, W. W. (1994). The NIMH epidemiologic catchment statistical manual of mental disorders (4th ed.). Washing- area program: Implementation and major findings. ton, DC: Author. International Journal of Methods in Psychiatric Research, Anthony, J. C., Eaton, W. W., & Henderson, A. S. (1995). 4, 103±112. Introduction: Psychiatric epidemiology (Special Issue of Eaton, W. W. (1995). Studying the natural history of Epidemiologic Reviews). Epidemiologic Reviews, 17(1), psychopathology. In M. Tsuang, M. Tohen, & G. 1±8. Zahner (Eds.), Textbook in psychiatric epidemiology Anthony, J. C., Folstein, M. F., Romanoski, A., Von (pp. 157±177). New York: Wiley Lisa, Inc. Korff, M., Nestadt, G., Chahal, R., Merchant, A., Eaton, W. W., Anthony, J. C., Tepper, S., & Dryman, A. Brown, C. H., Shapiro, S., Kramer, M., & Gruenberg, E. (1992). Psychopathology and Attrition in the Epidemio- References 115

logic Catchment Area Surveys. American Journal of Kessler, R. C., McGonagle, K. A., Zhao, S., Nelson, C. B., Epidemiology, 135, 1051±1059. Hughes, M., Eshelman, S., Wittchen, H., & Kendler, K. Eaton, W. W., Badawi, M., & Melton, B. (1995). S. (1994). Lifetime and 12-month prevalence of DSM- Prodromes and precursors: Epidemiologic data for III-R psychiatric disorders in the United States. Archives primary prevention of disorders with slow onset. of General Psychiatry, 51, 8±19. American Journal of Psychiatry, 152(7), 967±972. Kety, S. S., Rosenthal, D., Wender, P. H., Schulsinger, F., Eaton, W. W., Holzer, C. E., Von Korff, M., Anthony, J. & Jacobsen, B. (1975). Mental illness in the biological C., Helzer, J. E., George, L., Burnam, A., Boyd, J. H., and adoptive families of adopted individuals who have Kessler, L. G., & Locke, B. Z. (1984). The design of the become schizophrenic: a preliminary report based on Epidemiologic Catchment Area surveys: the control and psychiatric interviews. In R. R. Fieve, D. Rosenthal, & measurement of error. Archives of General Psychiatry, H. Brill (Eds.), Genetic Research in Psychiatry 41, 942±948. (pp. 147±165). Baltimore, MD: Johns Hopkins Univer- Eaton, W. W., & Kessler, L. G. (1981). Rates of symptoms sity Press. of depression in a national sample. American Journal of Kleinbaum, D. G., Kupper, L. L., & Morgenstern, H. Epidemiology, 114, 528±538. (1982). Epidemiologic Research. Principles and Quantita- Eaton, W. W., Kramer, M., Anthony, J. C., Chee, E. M. tive Methods. Belmont, CA: Lifetime Learning. L., & Shapiro, S. (1989). Conceptual and methodological Klerman, G. L., & Weissman, M. M. (1989). Increasing problems in estimation of the incidence of mental rates of depression. JAMA, 261, 2229±2235. disorders from field survey data. In B. Cooper & T. Kovar, M. (1992). The Longitudinal Study of Aging: Helgason (Eds.), Epidemiology and the prevention of 1984±90. Hyattsville, MD: US Department of Health mental disorders (pp. 108±127). London: Routledge. and Human Services. Eaton, W. W., & McLeod, J. (1984). Consumption of Kramer, M. (1957). A discussion of the concepts of coffee or tea and symptoms of anxiety. American Journal incidence and prevalence as related to epidemiologic of Public Health, 74, 66±68. studies of mental disorders. American Journal of Public Eaton, W. W., & Muntaner, C. (1996). Socioeconomic Health & Nation's Health, 47(7), 826±840. stratification and mental disorder. In A. V. Horwitz & T. Kramer, M. (1969). Applications of Mental Health Statis- L. Scheid (Eds.), Sociology of mental health and illness. tics: Uses in mental health programmes of statistics New York: Cambridge University Press. derived from psychiatric services and selected vital and Eaton, W. W., Regier, D. A., Locke, B. Z., & Taube, C. A. morbidity records. Geneva: World Health Organization. (1981). The Epidemiologic Catchment Area Program of Kramer, M., Von Korff, M., & Kessler, L. (1980). The the National Institute of Mental Health. Public Health lifetime prevalence of mental disorders: estimation, uses Report 96, 319±325. and limitations. Psychological Medicine, 10, 429±435. Eaton, W. W., Weissman, M. M., Anthony, J. C., Robins, Lawless, J. F. (1982). Statistical models and methods for L. N., Blazer, D. G., & Karno, M. (1985). Problems in lifetime data (Wiley Series in Probability and Mathema- the definition and measurement of prevalence and tical Statistics). New York: Wiley. incidence of psychiatric disorders. In W. W. Eaton & Lilienfeld, A., & Lilienfeld, D. (1980). Foundations of L. G. Kessler (Eds.), Epidemiologic Field Methods epidemiology. New York: Oxford University Press. in Psychiatry: The NIMH Epidemiologic Catchment Lilienfeld, D. E., & Stolley, P. D. (1994). Foundations of Area Program (pp. 311±326). Orlando, FL: Academic epidemiology. New York: Oxford University Press. Press. Loffler, W., Hafner, H., Fatkenheuer, B., Maurer, K., Farquhar, J. W., Fortmann, S. P., Flora, J. A., Taylor, C. Riecher-Rossler, A., Lutzhoft, J., Skadhede, S., Munk- B., Haskell, W. L., Williams, P. T., Maccoby, N., & Jorgensen, P., & Stromgren, E. (1994). Validation of Wood, P. D. (1990). Effects of communitywide educa- Danish case register diagnosis for schizophrenia. Acta tion on cardiovascular disease risk factors: The Stanford Psychiatrica Scandinavica, 90, 196±203. Five-City Project. Journal of the American Medical Mausner, J. S., & Kramer, S. (1985). Mausner & Bahn Association, 264(3), 359±365. epidemiology: An introductory text (2nd ed.). Philadel- Fitti, J. E., & Kovar, M. G. (1987). The Supplement on phia: Saunders. Aging to the 1984 National Health Interview Survey. Vital Meltzer, H., Gill, B., Petticrew, M., & Hinds, K. (1995a). and Health Statistics. Washington, DC: Government Physical complaints, service use and treatment of adults Printing Office. with psychiatric disorders. London: Her Majesty's Fleiss, J. L. (1981). Statistical methods for rates and Stationery Office. proportions (2nd ed.). New York: Wiley. Meltzer, H., Gill, B., Petticrew, M., & Hinds, K. (1995b). Foege, W. H. (1996). Alexander D. Langmuir: His impact Prevalence of psychiatric morbidity among adults living in on public health. American Journal of Epidemiology, private households. London: Her Majesty's Stationery 144(8) (Suppl.), S11±S15. Office. Gordon, R. (1983). An operational classification of disease Morris, J. N. (1975). Use of epidemiology (3rd ed.). prevention. Public Health Reports, 98, 107±109. Edinburgh: Churchill Livingstone. Herbst, A. L., Ulfelder, H., & Poskanzer, D. C. (1971). Mortensen, P. B. (1995). The untapped potential of case Adenocarcinoma of the vagina. Association of maternal registers and record-linkage studies in psychiatric epide- stilbestrol therapy with tumor appearance in young miology. Epidemiologic Reviews, 17(1), 205±209. women. The New England Journal of Medicine, 284(16), Mrazek, P. J., & Haggerty, R. J. (1994). Reducing risks for 878±881. mental disorders. Washington, DC: National Academy Kellam, S. G., & Werthamer-Larsson, L. (1986). Develop- Press. mental epidemiology: a basis for prevention. In M. Munson, M. L., Orvaschel, H., Skinner, E. A., Goldring, Kessler & S. E. Goldston (Eds.), A Decade of Progress in E., Pennybacker, M., & Timbers, D. M. (1985). Primary Prevention (pp. 154±180). Hanover, NH: Uni- Interviewers: Characteristics, training and field work. versity Press of New England. In W. W. Eaton & L. G. Kessler (Eds.), Epidemiologic Kellam, S. G., Rebok, G. W., Ialongo, N., & Mayer L.S. Field Methods in Psychiatry: The NIMH Epidemiologic (1994). The course and malleability of aggressive Catchment Area Program (pp. 69±83). Orlando, FL: behavior from early first grade into middle school: Academic Press. Results of a developmental epidemiologically-based Murphy, J. M., Berwick, D. M., Weinstein, M. C., Borus, preventive trail. Journal of Child Psychology and J. F., Budman, S. H., & Klerman, G. L. (1987). Psychiatry, 35(2), 259±281. Performance of Screening and Diagnostic Tests: Appli- 116 Epidemiologic Methods

cation of receiver operating characteristic analysis. and mental health services, three epidemiologic catch- Archives of General Psychiatry, 44, 550±555. ment area sites. Archives of General Psychiatry, 41, Robins, L. N., Helzer, J. E., Croughan, J., & Ratcliff, K. S. 971±978. (1981). National Institute of Mental Health Diagnostic Simon, G. E., & Vor Korff, M. (1995). Recall of Psychiatric Interview Schedule: Its history, characteristics, and History in Cross-sectional Surveys: Implications for validity. Archives of General Psychiatry, 38, 381±389. Epidemiologic Research. Epidemiologic Reviews, 17(1), Robins, L. N., Helzer, J. E., Weissman, M. M., Orvaschel, 221±227. H., Gruenberg, E. M., Burke, J. D., & Regier, D. A. Sudman, S. (1976). Applied sampling. New York: Academic (1984). Lifetime prevalence of specific psychiatric dis- Press. orders in three sites. Archives of General Psychiatry, 41, ten Horn, G. H., Giel, R., Gulbinat, W. H., & Henderson, 949±958. J. H. (Eds.) (1986). Psychiatric case registers in public Robins, L. N., & Regier, D. A. (Eds.) (1991). Psychiatric healthÐA worldwide inventory 1960±1985. Amsterdam: disorders in America: The epidemiologic catchment area Elsevier Science. study. New York: Free Press. Rogan, W. J., & Gladen, B. (1978). Estimating Prevalence Tsuang, M., Tohen, M., & Zahner, G. (1995). Textbook in from the Results of a Screening Test. American Journal psychiatric epidemiology. New York: Wiley-Liss. of Epidemiology, 107, 71±76. Tyrer, P. (1985). Neurosis divisible? Lancet, 1, 685±688. Sacker, A., Done, J., Crow, T. J., & Golding, J. (1995). Von Korff, M., Cottler, L., George, L. K., Eaton, W. W., Antecedents of schizophrenia and affective illness Leaf, P. J., & Burnam, A. (1985). Nonresponse and obstetric complications. British Journal of Psychiatry, nonresponse bias in the ECA surveys. In W. W. Eaton & 166, 734±741. L. G. Kessler (Eds.), Epidemiologic field methods in Schlesselman, J. J. (1982). Case±control studies: Design, psychiatry: The NIMH Epidemiologic Catchment Area conduct, analysis. New York: Oxford University Press. Program (pp. 85±98). Orlando, FL: Academic Press. Shapiro, S., Skinner, E. A., Kessler, L. G., Von Korff, M., Wynder, E. L. (1994). Studies in mechanism and preven- German, P. S., Tischler, G. L., Leaf, P. J., Benham, L., tion: Striking a proper balance. American Journal of Cottler, L., & Regier, D. A. (1984). Utilization of health Epidemiology, 139(6), 547±549. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.06 Qualitative and Discourse Analysis

JONATHAN A. POTTER Loughborough University, UK

3.06.1 INTRODUCTION: QUALITATIVE RESEARCH IN CONTEXT 118 3.06.1.1 Historical Moments in Qualitative Research in Clinical Settings 118 3.06.1.2 Background Issues 120 3.06.1.2.1 Philosophy, sociology, and changing conceptions of science 120 3.06.1.2.2 Investigatory procedures vs. justificatory rhetoric 121 3.06.1.2.3 Quality and quantity 122 3.06.1.3 Qualitative Research and Theory 122 3.06.1.4 Boundaries of Qualitative Research and Coverage of the Current Chapter 122 3.06.2 GROUNDED THEORY 123 3.06.2.1 Questions 124 3.06.2.2 Procedures 125 3.06.2.2.1 Materials 125 3.06.2.2.2 Coding 125 3.06.2.2.3 Method of constant comparison 125 3.06.2.2.4 Memo writing 126 3.06.2.2.5 Validity 126 3.06.2.3 Example: Clegg, Standen, and Jones (1996) 126 3.06.2.4 Virtues and Limitations 127 3.06.3 ETHNOGRAPHY AND PARTICIPANT OBSERVATION 128 3.06.3.1 Questions 128 3.06.3.2 Procedures 129 3.06.3.2.1 Access 129 3.06.3.2.2 Field relations 129 3.06.3.2.3 Fieldnotes 129 3.06.3.2.4 Interviews 130 3.06.3.2.5 Analysis 130 3.06.3.2.6 Validity 130 3.06.3.3 Example: Gubrium (1992) 131 3.06.3.4 Virtues and Limitations 131 3.06.4 DISCOURSE ANALYSIS 132 3.06.4.1 Questions 133 3.06.4.2 Procedures 134 3.06.4.2.1 Research materials 134 3.06.4.2.2 Access and ethics 135 3.06.4.2.3 Audio and video recording 135 3.06.4.2.4 Transcription 135 3.06.4.2.5 Coding 136 3.06.4.2.6 Analysis 136 3.06.4.2.7 Validity 137 3.06.4.3 Example: PeraÈkylaÈ (1995) 137 3.06.4.4 Virtues and Limitations 138

117 118 Qualitative and Discourse Analysis

3.06.5 FUTURE DIRECTIONS 139 3.06.6 REFERENCES 140

3.06.1 INTRODUCTION: QUALITATIVE genetics. Second, as is discussed below, views RESEARCH IN CONTEXT of science have changed radically since the 1950s, making it much harder to paint qualitative For many researchers the words ªqualitative researchers, as either antiscientific extremists or methodº spell out psychology's most notorious merely sloppy humanists. Third, psychology is oxymoron. If there is one thing that qualitative no longer as insulated from other social sciences methods are commonly thought to lack it is as it has been in the past. Of course, for much of precisely an adequate methodic way of arriving its twentieth century existence psychology has at findings. Indeed, for much of the twentieth been invigorated by infusions from other century quantification has been taken as the sciences such as physiology and linguistics. principle marker of the boundary between a However, in recent years there has been mature scientific psychology and common- increasing exchange with disciplines where sense, intuitive approaches. Since the late qualitative methods have been more established 1980s, however, there has been a remarkable such as sociology and anthropology. This is increase in interest in qualitative research in reflected in contemporary theoretical develop- psychology. This partly reflects a broader turn to ments such as constructionism (Gergen, 1994) qualitative research across the social sciences, and poststructuralism (Henriques, Hollway, although qualitative research of one kind or Irwin, Venn, & Walkerdine, 1984) that have another has long been an established feature of swept right across the human sciences. disciplines such as education, sociology, and, This is, then, an exciting time for qualitative most prominently, anthropology. researchers, with new work and new opportu- In psychology there is a handbook of nities of all kinds. Yet it should also be qualitative research (Richardson, 1996) as well emphasised that qualitative research in psychol- as a range of volumes and special journal issues ogy is in a chaotic state, with a muddle of whose major focus is on developing qualitative inconsistent questions and approaches being approaches to psychological problems (Antaki, blended together. Much poor work has involved 1988; Bannister, Burman, Parker, Taylor, & taking questions formulated in the metaphysics Tyndall, 1994; Henwood & Parker, 1995; Hen- of experimental psychology and attempting to wood & Nicolson, 1995; Smith, Harre , & van plug them into one or more qualitative methods. Langenhove, 1995). Psychology methods books At its worst such research peddles unsystematic and collections are increasingly serving up and untheorized speculations about the influ- qualitative methods to accompany the more ences on some piece of behavior which are usual diet of experiments, questionnaires, and backed up with two or three quotes from an surveys. At the same time, an increasing per- interview transcript. This expanding literature meability of boundaries between the social and variable quality creates some problems for sciences has provided the environment for a the production of a useful overview. This range of trans-disciplinary qualitative methods chapter selects what are seen as the most books including a useful doorstop-sized hand- coherent and successful qualitative approaches book (Denzin & Lincoln, 1994) and varied edited from a range of possibilities, as well as focusing and authored works (Bogdan & Taylor, 1975; on those approaches which are being used, and Bryman & Burgess, 1994; Coffey & Atkinson, have prospects for success, in clinical settings. A 1996; Gilbert, 1993; Lofland & Lofland, 1984; range of references is provided for those who Miles & Huberman, 1994; Miller & Dingwall, wish to follow up alternative methods. 1997; Silverman, 1993, 1997a). These general qualitative works are complemented by a mush- rooming range of books and articles devoted to 3.06.1.1 Historical Moments in Qualitative specific methods and approaches. Research in Clinical Settings Why has there been this increase in interest in qualitative research? Three speculations are In one sense a large proportion of twentieth proferred. First, there is a widespread sense that century clinical therapeutic practice was based traditional psychological methods have not on qualitative methods. The process of con- proved successful in providing major advances ducting some kind of therapy or counseling with in the understanding of human life. Despite clients and then writing them up as case histories, regular promissory notes, psychology seems to or using them as the basis for inferences about offer no earth-moving equivalent of the transis- aspects of the psyche, behavior, or cognitive tor, of general relativity, or of molecular processes, has been a commonplace of clinical Introduction: Qualitative Research in Context 119 work. Freud's use of case histories in the pretations are strongly derived from his theory, development of psychoanalytic thinking is as is shown by his willingness to straightfor- probably the most influential. Although it is wardly rework the overt sense of the records. an approach to clinical knowledge that over- Take this example: whelmingly eschews quantification, it is hard to say much about its methodic basis. For good or Hans (aged four and a half) was again watching his bad, it is dependent on the unformulated skills little sister being given her bath, when he began and intuitions of the therapist/researcher. In the laughing. On being asked why he was laughing, he hands of someone as brilliant as Freud the result replied. ªI'm laughing at Hanna's widdler.º can be extraordinary; elsewhere the product has ªWhy?º ªBecause her widdler's so lovely.º often been merely ordinary. The problem for the Of course his answer was a disingenuous one. In readers of such research is that they can do little reality her widdler had seemed to him funny. except either take it on trust or disagree. The Moreover, this is the first time he has recognized in process through which certain claims are this way the distinction between male and female established is not open to scrutiny. However, genitals instead of denying it. (1977, p. 184) Freud's study of the case of Little Hans is exceptional in this regard, and so it is briefly Note the way Graf here, and implicitly Freud worth considering (see Billig, 1998). in his text, treat the laughter as the real indicator Although Freud initially based his arguments of Hans understanding of events, and his overt for the existence of the Oedipus complex on the claim to find his sister's ªwiddlerº lovely as a interpretation of what patients told him in the form of dissembling. Hans is not delighted by course of therapy sessions, he attempted, the appearance of his sister's genitals but is unusually, to support this part of psychoanalytic amused, in line with psychoanalytic theory, by theory with more direct evidence. He asked some their difference from his own. Again, the issue of his followers to collect observations from of how to treat the sense of records of interac- their own children. The music critic Max Graf tion, and what interpretations should be made was most helpful in this regard and presented from them to things going on elsewhere such as Freud with copious notes on conversations actions or cognitions, is a fundamental one in between his son, Hans, and other family qualitative research (Silverman, 1993). members, as well as descriptions of dreams he Some 40 years later another of clinical had recounted. The published case history psychology's great figures, Carl Rogers, advo- (Freud, 1977 [1909]) contains more than 70 cated the use of newly developed recording pages of reports about Hans which Freud technology to study the use of language in describes as reproduced ªjust as I received psychotherapy itself, with the aim of under- themº without ªconventional emendationsº standing and improving therapeutic skills. For (1977, pp. 170). Here is an example: him such recordings offered ªthe first opportu- nity for an adequate study of counseling and Another time he [Hans] was looking on intently therapeutic procedures, based on thoroughly while his mother undressed before going to bed. objective dataº (Rogers, 1942, p. 433). Rogers ªWhat are you staring like that for?º she asked. envisaged using such recordings in the devel- Hans: ªI was only looking to see if you'd got a opment of a scale to differentiate the styles of widdler too.º different counselors and studies of the patterns Mother: ªOf course. Didn't you know that?º of interaction; for example, ªwhat type of Hans: ªNo. I thought you were so big you'd have a counselor statement is most likely to be widdler like a horse.º followed by statements of the client's feeling (1977, p. 173) about himself?º (1942, p. 434). Rogers' emphasis on the virtue of recordings Freud's fascinating materials and striking was followed up in two major ªmicroscopicº interpretations beg many of the questions that studies of psychotherapeutic discourse. The first have been central to qualitative research ever by Pettinger, Hockett, and Danehy (1960) since. For example, what is the role of Max focused on the initial portion of an initial Graf's expectations (he was already an advocate interview. A typical page of their study has just a of Freud's theories) in his selection and render- few words of transcript coupled with an ing of conversations with Hans? How closely do extended discussion of their sense. Much of the extracts capture the actual interactions the focus was on the prosodic cuesÐthe (including the emphasis, nonvocal elements, intonation and stressÐprovided in the inter- and so on)? What procedure did Freud use to view and their contextual significance. Prosody select the examples that were reproduced from is, of course, a feature of interaction which is the full corpus? And, most importantly, what is almost impossible to reliably capture in post hoc the basis of Freud's interpretations? His inter- notes made by key informants and so highlights 120 Qualitative and Discourse Analysis the virtue of the new technology. A second study nature of science provided by philosophers and by Labov and Fanshell (1977) also focused on sociologists since the 1970s. The image of the the opening stages of a therapy session, in this lone scientist harvesting facts, whose truth is case five episodes of interaction from the first 15 warranted through the cast-iron criterion of minutes of a psychoanalytic therapy session replication, has intermittently been wheeled on with Rhoda, a 19-year-old girl with a history of to defend supposedly scientific psychology anorexia nervosa. against a range of apparently sloppier alter- The classic example of ethnographic work in natives. However, this image now looks less the history of clinical psychology is Goffman's than substantial (see Chalmers, 1992; Potter, study of the everyday life of a mental hospital 1996a; Woolgar, 1988). published under the title Asylums (1961). It is The bottom-line status of scientific observa- worth noting that although Goffman was a tion has been undermined by a combination of sociologist, the various essays that make up philosophical analyses and sociological case Asylums were initially published in psychiatry studies. Philosophers have highlighted the journals. Rather than utilize tape recording logical relationships between observation state- technology to capture the minutiae of some ments and theoretical notions (Hesse, 1974; social setting, Goffman used an ethnographic Kuhn, 1962; Popper, 1959). Their argument is approach. He spent a year working ostensibly as that even the simplest of scientific descriptions is an assistant to the athletic director of a large dependent on a whole variety of theoretical mental hospital, interacting with patients and assumptions. Sociologists have supplemented attempting to build up an account of the these insights with studies of the way notions of institution as viewed by the patients. His observations are used in different scientific justification for working in this way is instruc- fields. For example, Lynch (1994) notes the way tive for how the strengths and weaknesses of the term observation is used in astronomy as a qualitative work have been conceptualized: loose device for collecting together a range of actions such as setting up the telescope, Desiring to obtain ethnographic detail regarding attaching sensors to it, building up traces on selected aspects of patient social life, I did not an oscilloscope, converting these into a chart employ usual kinds of measurements and controls. and canvassing the support of colleagues. Knorr I assumed that the role and time required to gather Cetina (1997) documents the different notions statistical evidence for a few statements would of observation that appear in different scientific preclude my gathering data on the tissue and fabric specialities, suggesting that high energy physi- of patient life. (1961, p. 8) cists and molecular biologists, for example, work with such strikingly different notions of As an ethnographic observer, he developed what is empirical that they are best conceived of an understanding of the local culture and as members of entirely different epistemic customs of the hospital by taking part himself. cultures. He used the competence generated in this way as The idea that experimental replication can the basis for his writing about the life and social work as a hard criterion for the adequacy of any organization in a large mental hospital. particular set of research findings has been shown to be too simple by a range of socio- 3.06.1.2 Background Issues logical studies of replication in different fields (Collins, 1981). For example, Collins (1985) has Before embarking on a detailed overview of shown that the achievement of a successful some contemporary qualitative approaches to replication is closely tied to the conception of clinical topics there are some background issues what counts as a competent experiment in the that are worth commenting on, as they will help first placeÐand this itself was often as much a make sense of the aims and development of focus of controversy as the phenomenon itself. qualitative approaches. In some cases it is In a study of gravity wave researchers, Collins necessary to address issues that have been a found that those scientists who believed in long-standing source of confusion where psy- gravity waves tended to treat replications that chologists have discussed the use of qualitative claimed to find them as competent and methods. replications that failed to find them as incom- petent. The reverse pattern was true of nonbelievers. What this meant was that replica- 3.06.1.2.1 Philosophy, sociology, and changing tion did not stand outside the controversy as a conceptions of science neutral arbiter of the outcome, but was as much As noted above, the development of quali- part of the controversy as everything else. tative work in psychology has been facilitated Philosophical and sociological analysis has by the more sophisticated understanding of the also shown that the idea that a crucial Introduction: Qualitative Research in Context 121 experiment can be performed which will force that such research is incoherent or unscientific, the abandonment of one theory and demon- merely that it should not be construed and strate the correctness of another is largely evaluated using the family of concepts whose mythical (Lakatos, 1970; Collins & Pinch, home is experimental journal articles. Likewise 1993). Indeed, historical studies suggest that the psychological model of hypothesis testing is so-called crucial experiments are not merely just one available across the natural and human insufficient to effect the shift from one theory to sciences. Qualitative research that utilizes another, they are often performed, or at least theoretically guided induction, or tries to give constructed as crucial, after the shift to provide a systematic description of some social realm, illustration and legitimation (Kuhn, 1977). should not be criticized on the grounds that it is Let us be clear at this point. This research unscientific, let alone illegitimate. Ultimately, does not show that careful observation, skilful the only consistent bottom line for the produc- replication, and theoretically sophisticated tion of excellent qualitative work is excellent experiments are not important in science. scholarship (Billig, 1988). Rather, the point is that none of these things Another difference between traditional quan- are bottom-line guarantees of scientific pro- titative and qualitative work is that in the gress. Moreover, these sociological studies have traditional work the justification of research suggested that all these features of science are findings is often taken to be equivalent to the embedded in, and inextricable from, its com- complete and correct carrying out of a set of munal practices. Their sense is developed and codified procedures. Indeed, methods books are negotiated in particular contexts in accordance often written as if they were compendia of recipes with ad hoc criteria and a wide range of craft for achieving adequate knowledge. Sampling, skills which are extremely hard to formulate in operationalization of variables, statistical tests, an explicit manner (Knorr Cetina, 1995; Latour and interpretation of significance levels are & Woolgar, 1986). The message taken from this discussed with the aid of tree diagrams and flow now very large body of work (see Jasanoff, charts intended to lead the apprentice researcher Markle, Pinch, & Petersen, 1995) is not that to the correct conclusion. In one sense, much psychologists must adopt qualitative methods, qualitative work is very different to this, with the or that qualitative methods will necessarily be procedures for justifying the research claims any better than the quantitative methods that being very different to the procedures for they may replace or supplement; it is that those producing the work. Thus, the manner in which psychologists who have argued against the a researcher arrives at some claims about the adoption of such methods on the principle that various functions of ªmm hm'sº in psychiatric they are unscientific are uninformed about the intake interviews, say, may be rather different nature of science. from the manner in which they justify the adequacy of the analysis. Yet, in another sense the difference between qualitative and quanti- 3.06.1.2.2 Investigatory procedures vs. tative research is more apparent than real, for justificatory rhetoric studies of the actual conduct of scientists There are a number of basic linguistic and following procedural rules of method show that metatheoretical difficulties in writing about such rules require a large amount of tacit qualitative methods for psychologists. Our knowledge to make them understandable and terminology for methodological discussionÐ workable, and that they are often more of a reliability, validity, sampling, factors, variance, rhetorical device used to persuade other scien- hypothesis testing, and so onÐhas grown up tists than an actual constraint on practice with the development of quantitative research (Gilbert & Mulkay, 1984; Polyani, 1958). As using experiments and surveys. The language Collins (1974) showed in an innovative ethno- has become so taken-for-granted that it is graphic study, when a group of scientists wrote a difficult to avoid treating it as obvious and paper offering the precise technical specification natural. However, it is a language that is hard to of how to make a new laser, the only people who disentangle from a range of metatheoretical were able to build a working laser of their own assumptions about the nature of behavior and had actually seen one built; just reading the processes of interaction. Traditional psychol- paper was not enough. ogy has become closely wedded to a picture of This presents something of a dilemma for factors and outcomes which, in turn, cohabits anyone writing about qualitative methods. with the multivariate statistics which are Should they write to help people conduct their omnipresent where data is analyzed. For some research so as better to understand the world, or forms of qualitative research, particularly most should they work to provide the sorts of discourse and ethnographic work, such a formalized procedural rules that can be drawn picture is inappropriate. This does not mean on in the methods sections of articles to help 122 Qualitative and Discourse Analysis persuade the psychologist reader? In practice, up modern cognitivism the objects of observa- most writing does some of each. However, the tion are hypothetical mental entities (the difficulty that psychologists often report when Oedipus complex, attributional heuristics). attempting qualitative work is probably symp- Psychoanalytic researchers have generally pre- tomatic of the failure to fully explicate the craft ferred to engage in an interpretative exercise of skills that underpin qualitative work. reconstructing those entities from the talk of patients undergoing therapy. Cognitive psy- chologists have typically used some 3.06.1.2.3 Quality and quantity hypothetico-deductive procedure where predic- There are different views on how absolute the tions are checked in experiments which inves- quantity/quality divide is. Arguments at differ- tigate, say, the attributional style of people ent levels of sophistication have been made for classified as depressed. Note that in both of future integration of qualitative and quantita- these cases they are using people's discourseÐ tive research (Bryman, 1988; Silverman, 1993). talk in the therapy session, written responses to It is undoubtedly the case that at times a questionnaireÐyet in neither case is the proponents of both quantitative and qualitative discourse as such of interest. In contrast, for research have constructed black and white researchers working with different perspectives stereotypes with little attempt at dialog such as social constructionism or discursive (although a rare debate about the relative psychology, the talk or writing itself, and the virtues of quantitative and qualitative research practices of which it is part, is the central topic. on the specific topic of attitudes towards mental For these researchers there is a need to use hospitalisation is revealingÐWeinstein, 1979, procedures which can make those practices 1980; Essex et al., 1980). It is suggested that available for study and allow their organization quantification is perfectly appropriate in a to be inspected and compared. range of situations, dependent on appropriate To take another example, behaviorist psy- analytic and theoretical judgements. chologists have done countless studies on the In many other research situations the goal is effects of particular regimes of reward and not something that can be achieved through punishment on target behaviors such as com- counting. For example, if the researcher is pulsive hand washing. However, such studies explicating the nature of ªcircular questioningº are typically focused on outcomes and statistical in Milan School Family Therapy, that goal is a associations, whereas a theoretical perspective prerequisite for a study which considers the such as symbolic interactionism or, to give a statistical prevalence of such questioning. more psychological example, Vygotskyan ac- Moreover, there are arguments for being tivity theory, encourage a more ethnographic cautious about quantification when studying examination of the settings in which rewards are the sorts of discursive and interactional materi- administered and of the sense that those als which have often been central to qualitative behaviors have in their local context research because of distortions and information Without trying to flesh out these examples in loss that can result (see Schegloff, 1993, and any detail, the important practical point they papers in Wieder, 1993). Some of the grounds make is that it is a mistake to attempt simply to for caution come from a range of qualitative import a question which has been formulated in studies of quantification in various clinical the problematics of one theoretical system, and settings (Ashmore, Mulkay, & Pinch, 1989; attempt to answer it using a method developed Atkinson, 1978; Garfinkel, 1967a; Potter, for the problematics of another. The failure to Wetherell, & Chitty, 1991). properly conceptualize a research question that fits with the research method is a major source of confusion when psychologists start to use 3.06.1.3 Qualitative Research and Theory qualitative methods. It is hard to overestimate how close the relationship is between the theories, methods, 3.06.1.4 Boundaries of Qualitative Research and and questions used by psychologists. Theories Coverage of the Current Chapter specify different positions on cognition and behavior, different notions of science, different What distinguishes qualitative research from views of the role of action and discourse, quantitative? And what qualitative approaches different understandings of the role of social are there? These questions are not as straight- settings, and, most fundamentally, different forward as they seem. In the past the qualitative/ objects for observation. quantitative distinction has sometimes been For both psychoanalytic theory and most of treated as the equivalent of the distinction the mass of theories and perspectives that make between research that produces objective and Grounded Theory 123 subjective knowledgeÐa distinction which enough to warrant discussion. In others, their makes little sense in the light of recent sociology central problematics are better addressed by the and philosophy of science. Sometimes certain approaches that are discussed. approaches using numbers have been treated as The most controversial exclusion is probably qualitative. For example, content analysis has humanistic methods given that humanistic occasionally been treated as a qualitative psychology developed in settings which had a method because it is used to deal with ªnaturally broad emphasis on therapy and psychological occurringº objects such as diaries, novels, well-being. It is suggested that the romanticism transcripts of meetings, and so on. Content of much humanistic psychology is attractive, analysis was meant to eliminate many of the but ultimately unconvincing. However, it is potential ªreactiveº effects that bedevil social often, quite legitimately, more concerned with research and thereby avoid the problem in developing participants' skills and sensitivity experimental and survey research, of how than developing propositional claims and findings relate to what goes on in the real arguments; as such it is often offering a set of world; for these are records of (indeed, examples techniques for expanding human potential of) what is actually going on. However, in this rather than developing methods for research. chapter content analysis is treated as quantita- Feminist methods are excluded, despite an tive, and therefore outside the scope of this appreciation of the importance of feminist survey, on the grounds that it (i) transforms issues in clinical settings, because the arguments phenomena into numerical counts of one kind for the existence of specifically feminist methods or another and (ii) typically attempts to (as opposed to theories or arguments) are not statistically relate these counts to some broader convincing. This is particularly true where such factors or variables. For useful introductions to claims give a major epistemological role for content analysis related to psychological topics experience or intuition (Ellis, Kiesinger, & see Holsti (1969) and Krippendorf (1980). Tillmann-Healy, 1997). These are topics for For similar reasons repertory grid analysis decomposition and analysis rather than associated with personal construct theory, and bottom-lines for knowledge. For some argu- the ªQº methodology developed by William ments in both directions on this topic see Stephenson, have sometimes been treated as Gelsthorpe (1992), Hammersley (1992), Rama- qualitative. The rationale for this was probably zanoglu (1992), and Wilkinson (1986). that they were often focused on understanding Finally, it should be stressed that qualitative the reasoning or cognitive organization of single work is not seen as having some overall individuals rather than working exclusively coherence. Quite the contrary, it is fragmented from population statistics. However, as they and of highly variable quality. Nor is some involve quantitative manipulation of elicited overall coherence seen as a desirable goal. Those responses from participants they will not be workers who talk of a qualitative paradigm dealt with here. The ideographic/nomathetic (Guba & Lincoln, 1994, Reason & Rowan, distinction will be treated as orthogonal to the 1981) unhelpfully blur over a range of theore- qualitative/quantitative one! For accessible tical and metatheoretical differences (see Hen- introductions to these approaches, see Smith wood & Pidgeon, 1994). (1995) on repertory grid methods and Stainton Rogers (1995) on Q methodology. In addition to these methods which are 3.06.2 GROUNDED THEORY excluded as not properly qualitative, a wide range of methods have not been discussed which Grounded theory has the most clear-cut nevertheless satisfy the criterion of being at least origin of any of the approaches discussed here. minimally methodic and generally eschewing The term was coined by two sociologists in an quantification. For simplicity, Table 1 lists nine influential book: The discovery of grounded methods or approaches which have been theory (Glaser & Strauss, 1967). Some of its key excluded, along with one or two references that features and concerns were a product of its would provide a good start point for any birthplace within sociology, where it was researcher who was interested in learning more developed to counter what the authors saw as about them. It would take some time to make a preoccupation, on the one hand, with abstract explicit the reasons for excluding all of them. grand theories and, on the other, with testing Generally, the problem is that they have not those theories through large scale quantitative been, and are unlikely to be in the future, studies. Grounded theory was intended to link particularly useful for studying problems in the theoretical developments (conceived as plausi- area of clinical psychology (e.g., focus groupsÐ ble relations among concepts and sets of although, see Piercy & Nickerson, 1996). In conceptsÐStrauss & Corbin, 1994) more some cases, the approaches are not coherent closely to the particulars of settings, to ground 124 Qualitative and Discourse Analysis

Table 1 Varieties of qualitative research not covered.

Qualitative research method Source

Action research Argyris, Putnam, & Smith, 1985, Whyte, 1991 Documentary studies Hodder, 1994, Scott, 1990 Ethogenics Harre , 1992 Feminist approaches Olesen, 1994, Reinharz, 1992 Focus groups Krueger, 1988, Morgan, 1997 Humanistic, participative research Reason & Rowan, 1981, Reason & Heron, 1995 Life histories Plummer, 1995, Smith, 1994 Role play Yardley, 1995 Semiotics Manning & Cullum-Swan, 1994 middle range theories in actual qualitative data manmade disasters like fires and industrial rather than to start from preconceived hypoth- accidents; Charmaz (1991) has studied the eses. It is important to stress that grounded various facets that make up people's experience theory is not a theory as such, rather it is an of chronic illness; Clegg, Standen, and Jones approach to theorising about data in any (1996) focused on the staff members' under- domain. Moreover, since its inception in the standing of their relationship with adults with 1960s, work on the theory ladenness of data has profound learning disabilities. In each case, a made the idea of ªgroundingº theory increas- major concern was to incorporate the perspec- ingly problematic. tives of the actors as they construct their Much grounded theory research has been particular social worlds. Grounded theory done outside of psychology; however, psychol- methods can help explicate the relation of ogists have become increasingly interested in the actions to settings (how does the behavior of approach in general (Charmaz, 1995; Henwood key personnel in the evolution of a major fire and Pidgeon, 1995; Pidgeon, 1996; Pidgeon and follow from their individual understanding of Henwood, 1996; Rennie, Phillips & Quartaro, events and physical positioning?); it can be used 1988), and have carried out specific studies in for developing typologies of relevant phenom- health (Charmaz, 1991, 1994) and clinical ena (in what different ways do sufferers of (Clegg, Standen, & Jones, 1996) settings. chronic illness conceptualize their problem?); and it can help identify patterns in complex 3.06.2.1 Questions systems (how does the information flowing between social actors help explain the develop- Grounded theory is designed to be usable with ment of a laboratory smallpox outbreak?). a very wide range of research questions and in Like most of the qualitative approaches the context of a variety of metatheoretical discussed here, grounded theory is not well approaches. Rather like the statistical analyses suited to the kinds of hypothesis testing and that psychologists are more familiar with, it outcome evaluation that have traditionally been deals with patterns and relationships. However, grist to the mill of clinical psychology, because of these are not relationships between numbers but its open-ended and inductive nature. Although between ideas or categories of things, and the the researcher is likely to come to a topic with a relationships can take a range of different forms. range of more or less explicit ideas, questions, In some respects the procedures in grounded and theories, it is not necessary for any or all of theory are like the operation of a sophisticated these to be formally stated before research gets filing system where entries are cross-referenced under way. The approach can start with a and categorized in a range of different ways. specific problem or it may be more directed at Indeed, this is one qualitative approach that can making sense of an experience or setting. be effectively helped by the use of computer Grounded theory can be applied to a range of packages such as NUDIST, which was itself different textual materials such as documents, developed to address grounded theory notions. interview transcripts and records of interaction, Grounded theory has proved particularly and this makes it particularly suitable for certain appropriate for studying people's understand- kinds of questions. It can deal with records which ings of the world and how these are related to exist prior to the research and it can deal with their social context. For example, Turner (1994; materials specifically collected. The processes of Turner & Pidgeon, 1997) has used grounded coding allow quite large amounts of material to theory to attempt to explain the origins of be dealt with. For example, while Turner studied Grounded Theory 125 a single (lengthy) official report of a major fire in suggests a series of specific questions that are a holiday complex, Charmaz studied 180 inter- useful for picking out the key concepts: views with 90 different people with chronic (i) What is going on? illnesses. The requirement is only that the (ii) What are the people doing? material can be coded. (iii) What is the person saying? (iv) What do these actions and statements take for granted? 3.06.2.2 Procedures (v) How do structure and context serve to The procedures for conducting grounded support, maintain, impede, or change these theory work are straightforward to describe, if actions and statements? less so to follow in practice. Pidgeon and More broadly, Pidgeon and Henwood suggest Henwood (1996, p. 88) provide a useful diagram that this phase of coding is answering the to explicate the process (Figure 1). question: ªwhat categories or labels do I need in order to account for what is of importance to me in this paragraph?º (1996, p. 92). 3.06.2.2.1 Materials Such coding is intensive and time consuming. For example, Table 2 shows an example by In line with the general emphasis on parti- Charmaz of line-by-line coding of just a brief cipants' perspectives and on understanding fragment of one of her 180 interviews. Note the patterns of relationships, researchers often way that the interview fragment is coded under a attempt to obtain rich materials such as number of different topics. There is no require- documents and conversational, semistructured ment in grounded theory that categories apply interviews. These may be supplemented by exclusively. participant observation in the research domain, generating fieldnotes which can be added to other data sets or simply used to improve the researcher's understanding so they can better 3.06.2.2.3 Method of constant comparison deal with the other materials. Coding is not merely a matter of carefully After data is collected and stored the intensive reading and labeling the materials. As the coding process that is most characteristic of grounded continues the researcher will be starting to theory is performed. This involves coding the identify categories that are interesting or data, refining the coding and identifying links relevant to the research questions. They will between categories, and writing ªmemosº which refine their indexing system by focused coding start to capture theoretical concepts and which will pick out all the instances from the relationships. data coded as, for example ªavoiding disclo- sure.º When such a collection has been produced the researcher can focus on the differences in the 3.06.2.2.2 Coding use of this category according to the setting or Different grounded theory researchers ap- the actors involved. This is what grounded proach coding in different ways. For example, it theorists refer to as the ªmethod of constant can involve generating index cards or making comparison.º In the course of such comparisons annotations next to the relevant text. The the category system may be reworked; some researcher works through the text line by line, categories will be merged together and others or paragraph by paragraph, labeling the key will be broken up, as the close reading of the data concepts that appear. Charmaz (1995, p. 38) allows an increasingly refined understanding.

Table 2 Line-by-line coding.

Coding Interview

Shifting symptoms, having inconsistent If you have lupus, I mean one day it's my liver; days one day it's my joints; one day it's my head, and Interpreting images of self given by it's like people really think you're a others hypochondriac if you keep complaining about Avoiding disclosure different ailments. . . . It's like you don't want to say anything because people are going to start Predicting rejection thinking, you know, ªGod, don't go near her, Keeping others unaware all she isÐis complaining about this.º

Source: Charmaz (1995, p. 39) 126 Qualitative and Discourse Analysis

Data collection Data preparation

Data storage

Initial analysis

Coding

Refine indexing system Core analysis Memo writing Category linking

Key concepts Definitions Outcomes Memos Relationships and models Figure 1 Procedures for conducting grounded theory research.

3.06.2.2.4 Memo writing together designed to provide a level of validation as they force a thoroughgoing engagement with Throughout this stage of coding and compar- the research materials. Line-by-line coding, ison grounded theorists recommend what they constant comparison, and memo writing are call memo writing. That is, writing explicit notes all intended to ensure that the theoretical claims on the assumptions that underly any particular made by the analyst are fully grounded in the coding. Memo writing is central to the process of data. That, after all, was the original key idea of building theoretical understandings from the grounded theory. However, some specific pro- categories, as it provides a bridge between the cedures of validation have been proposed. categorization of data and the writing up of the Some grounded theorists have suggested that research. In addition to the process of refining respondent validation could be used as a categories, a further analytic task involves criterion. This involves the researcher taking linking categories together. The goal here is to back interpretations to the participants to see if start to model relationships between categories. they are accepted. The problem with such an Indeed, one possibility may be the production of approach is that participants may agree or a diagram or flow chart which explicitly maps disagree for a range of social reasons or they relationships. may not understand what is being asked of As Figure 1 makes clear, these various them. Other approaches to validation suggest elements in grounded theory analysis are not that research should be generative, that is, discrete stages. In the main phase of the analysis, facilitate further issues and questions; have refining the indexing system is intimately bound rhetorical power, that is, prove effective in up with the linking of categories, and memo persuading others of its effectiveness; or that writing is likely to be an adjunct to both. there could be an audit trail which would allow Moreover, this analysis may lead the researcher another researcher to check how the conclu- back to the basic line-by-line coding, or even sions were reached (Henwood & Pidgeon, 1995; suggest the need to collect further material for Lincoln & Guba, 1985). analysis. 3.06.2.3 Example: Clegg, Standen, and Jones 3.06.2.2.5 Validity (1996) There is a sense in which the general method- There is no example in clinical psychology of ological procedures of grounded theory are the sorts of full-scale grounded theory study Grounded Theory 127 that Charmaz (1991) conducted on the experi- are commonly left inexplicit in other qualitative ence of illness or Glaser and Strauss (1968) approaches. The method is at its best where carried out on hospital death. Nevertheless, a there is an issue that is tractable from a relatively modest study by Clegg et al. (1996) of the common sense actor's perspective. Whether relationships between staff members and adults studying disasters, illness, or staff relationships, with profound learning disabilities illustrates the theoretical notions developed are close to some of the potential of grounded theory for the everyday notions of the participants. This clinical work. makes the work particularly suitable for policy In line with the grounded theory emphasis on implementation, for the categories and under- the value of comparison, 20 staff members from standings of the theory are easily accessible to four different residential settings were recruited. practitioners and policy makers. Each member of staff was videotaped during Some problems and questions remain, how- eight sessions of interaction with a single client ever. First, although there is a repeated that they had known for over three months. emphasis on theoryÐafter all, it is in the very Each staff member was subsequently inter- name of the methodÐthe notion of theory is a viewed about their work, their relationship with rather limited one strongly influenced by the the client, a specific experience with the client, empiricist philosophy of science of the 1950s. and their understanding of the client's view of The approach works well if theory is conceived the world. These conversational interviews were in a limited manner as a pattern of relationships tape recorded and transcribed, and form the between categories, but less well if theories are data for the major part of the study. conceived of as, say, models of underlying The study followed standard grounded theory generative mechanisms (Harre , 1986). procedures in performing a range of codings Second, one of the claimed benefits of developing links between categories and build- grounded theory work is that it works with the ing up theoretical notions from those codings. perspective of participants through its emphasis The range of different outcomes of the study is on accounts and reports. However, one of the typical of grounded theory. In the first place, risks of processes such as line-by-line coding is they identify four kinds of relationships that that it leads to a continual pressure to assign staff may have with clients: provider (where pieces of talk or elements of texts to discrete meeting of the client's needs is primary); categoriesratherthanseeingthemasinextricably meaning-maker (where making sense of the bound up with broader sequences of talk or client's moods or gestures is primary); mutual broader textual narratives. Ironically this can (where shared experience and joy at the client's mean that instead of staying with the under- development is primary); companion (where standings of the participants their words are merely being together is treated as satisfying). assigned to categories provided by the analyst. The authors go on to explore the way different Third, grounded theorists have paid little settings and different client groups were char- attention to the sorts of problems in using acterized by different relationships. Finally, textual data that ethnomethodologists and they propose that the analysis supports four discourse analysts have emphasised (Atkinson, propositions about staff±client relationships: 1978; Gilbert & Mulkay, 1984; Silverman, 1993; some types of relationship are better than others Widdicombe & Wooffitt, 1995). For example, (although this will vary with setting and client); how far is the grounding derived not from staff see the balance of control in the relation- theorizing but from reproducing common sense ship as central; families can facilitate relation- theories as if they were analytic conclusions? ships; professional networks create dilemmas. How far are Clegg's et al. (1996) staff participants, say, giving an accurate picture of their relationships with clients, and how far are 3.06.2.4 Virtues and Limitations they drawing on a range of ideas and notions to deal with problems and work up identities in the Grounded theory has a range of virtues. It is interview itself? flexible with respect to forms of data and can be Some practitioners are grappling with these applied to a wide range of questions in varied problems in a sophisticated manner (Pidgeon & domains. Its procedures, when followed fully, Henwood, 1996). As yet there is not a large force the researcher into a thorough engage- enough body of work with clinical materials ment with the materials; it encourages a slow- to allow a full evaluation of the potential of motion reading of texts and transcripts that this method. For more detailed coverage of should avoid the common qualitative research grounded theory the reader should start with the trap of trawling a set of transcripts for quotes to excellent introductions by Pidgeon (1996) and illustrate preconceived ideas. It makes explicit Pidgeon and Henwood (1996); Charmaz (1991) some of the data management procedures that provides a full scale research illustration of the 128 Qualitative and Discourse Analysis potential of grounded theory; Rafuls and Moon (1961) Asylums he tried to reveal the different (1996) discuss grounded theory in the context of worlds lived by the staff and inmates, and to family therapy; and, despite its age, Glaser and describe and explicate some of the ceremonies Strauss (1967) is still an informative basis for that were used to reinforce boundaries between understanding the approach. the two groups. A large part of his work tracked what he called the ªmoral careersº of inmates from prepatient, through admission, and then 3.06.3 ETHNOGRAPHY AND as inpatients. Much of the force and influence of PARTICIPANT OBSERVATION Goffman's work derived from its revelations about the grim ªunofficialº life lived by patients Ethnography is not a specific method so in large state mental hospitals in the 1950s. In much as a general approach which can involve a this respect it followed in the Chicago school number of specific research techniques such as tradition of expose and critique. Rosenhan's interviewing and participant observation. In- (1973) classic study of hospital admission and deed, this has been a source of some confusion the life of the patient also followed in this as rather different work is described as ethno- tradition. Famously it posed the question of graphy in anthropology, sociology, and other what was required to be diagnosed as mentally disciplines such as education or management ill and then incarcerated, and discovered that it science. The central thrust of ethnographic was sufficient to report hearing voices saying research is to study people's activities in their ªempty,º ªhollow,º and ªthud.º This ªpseudo- natural settings. The concern is to get inside the patientº study was designed with a very specific understanding and culture, to develop a subtle question about diagnostic criteria in mind; grasp of how members view the world, and why however, after the pseudopatients were ad- they act as they do. Typically, there is an mitted they addressed themselves to more emphasis on the researcher being involved in the typically ethnographic concerns, such as writing everyday world of those who are being studied. detailed descriptions of their settings, monitor- Along with this goes a commitment to working ing patient contact with different kinds of staff, with unstructured data (i.e., data which has not and documenting the experience of power- been coded at the point of data collection) and a lessness and depersonalization. tendency to perform intensive studies of small Goffman's and Rosenhan's work picks up the numbers of cases. Ethnography is not suited to ethnographic traditions of revealing hidden the sorts of hypothetico-deductive procedures worlds and providing a basis for social reform. that are common in psychology. Jodelet (1991) illustrates another analytic Two important tendencies in current ethno- possibility by performing an intensive study graphic work can be seen in their historical of one of the longest running community care antecedents in anthropology and sociology. In schemes in the world, the French colony of the nineteenth century, anthropology often Ainay-le-ChaÃteau where mental patients live involved information collected by colonial with ordinary families. Again, in line with the administrators about members of indigenous possibilities of ethnography, she attempted to peoples. Its use of key informants and its focus explore the whole setting, including the lives of on revealing the details of exotic or hidden the patients and their hosts and their under- cultures continue in much work. Sociologists of standings of the world. Her work, however, is the ªChicago Schoolº saw ethnography as a notable for showing how ethnography can way of revealing the lives and conditions of the explore the representational systems of partici- US underclass. This social reformism was pants and relate that system to the lives of the married to an emphasis on understanding their participants. To give just one small example, she participants' lives from their own perspective. shows the way the families' representation of a close link between madness and uncleanliness 3.06.3.1 Questions relates to practices such as taking meals separately from the lodgers. Ethnography comes into its own where the Another topic for ethnographic work has been researcher is trying to understand some parti- the practice of psychotherapy itself (Gubrium, cular sub-cultural group in a specific setting. It 1992; Newfield, Kuehl, Joanning, & Quinn, tends to be holistic, focusing on the entire 1990, 1991). These studies are interested in the experience participants have of a setting, or experience of patients in therapy and their their entire cosmology, rather than focusing on conceptions of what therapy is, as well as the discrete variables or phenomena. This means practices and conceptions of the therapists. Such that ethnographic questions tend to be general: studies do not focus exclusively on the interac- What happens in this setting? How do this tion in the therapy session itself, but on the whole group understand their world? In Goffman's setting. Ethnography and Participant Observation 129

Although ethnography is dependent on close 3.06.3.2.2 Field relations and systematic description of practices it is not Field relations can pose a range of challenges. necessarily atheoretical. Ethnographic studies Many of these follow from the nature of the are often guided by one of a range of theoretical participation of the researcher in the setting. conceptions. For example, Jodelet's (1991) study How far should researchers become full parti- was informed by Moscovici's (1984) theory of cipants and how far should they stay uninvolved social representations and Gubrium's (1992) observers? The dilemma here is that much of the study of family therapy was guided by broader power of ethnography comes from the experi- questions about the way the notion of family and ence and knowledge provided by full participa- family disorder are constructed in Western tion, and yet such participation may make it society. Ethnography has sometimes been trea- harder to sustain a critical distance from the ted as particularly appropriate for feminist work practices under study. The ethnographer should because of the possibility of combining concerns not be converted to the participants' cultural with experience and social context (e.g., Ronai, values; but neither should they stay entirely in 1996). the Martian role that will make it harder to understand the subtle senses through which the 3.06.3.2 Procedures participants understand their own practices. Field relations also generate many of practical, Ethnography typically involves a mix of prosaic, but nevertheless important problems different methods with interviews and partici- which stem from the sheer difficulty of main- pant observation being primary, but often taining participant status in an unfamiliar and combined with nonparticipant observation possibly difficult setting for a long period of and the analysis of documents of one kind or time. At the same time there are a whole set of another. Such a mixture raises a large number of skills required to do with building productive separate issues which will only be touched on and harmonious relationships with participants. here. There are whole books on some elements of ethnographic work such as selecting infor- mants (Johnson, 1990), living with informants 3.06.3.2.3 Fieldnotes (Rose, 1990), and interpreting ethnographic One of the central features of participant writings (Atkinson, 1990). The focus here is on observation is the production of fieldnotes. research access, field relations, interviewing, Without notes to take away there is little point in observing, fieldnotes, and analysis (see Ellen, conducting observation. In some settings it may 1984; Fetterman, 1989; Fielding, 1993; Ham- be possible to take notes concurrently with the mersley & Atkinson, 1995; Rachel, 1996; Toren, action but often the researcher will need to rely 1996; Werner & Schoepfle, 1987). on their memory, writing up notes on events as soon as possible after they happened. A rule of 3.06.3.2.1 Access thumb is that writing up fieldnotes will take just as much time as the original period of observa- Research access is often a crucial issue in tion (Fielding, 1993). In some cases it may be ethnography, as a typical ethnographic study possible to tape record interaction as it happens. will require not only access to some potentially However, ethnographers have traditionally sensitive group or setting but may involve the placed less value on recording as they see the researcher's bodily presence in delicate contexts. actual process of note taking as itself part of the Sitting in on a family therapy session, for ex- process through which the researcher comes to ample, may involve obtaining consent from a understand connections between processes and range of people who have different concerns and underlying elements of interaction. has the potential for disrupting, or at least Ethnographers stress that fieldnotes should subtly changing, the interaction that would have be based around concrete descriptions rather taken place. There are only restricted possibi- than consisting of abstract higher-order inter- lities for the ethnographer to enter settings with pretations. The reason for this is that when concealed identities, as Goffman did with his observation is being done it may not yet be clear mental hospital study, and Rosenhan's pseu- what questions are to be addressed. Notes that dopatients did. Moreover, such practices raise a stay at low levels of inference are a resource that host of ethical and practical problems which can be used to address a range of different increasingly lead ethnographers to avoid decep- questions. Fielding argues that fieldnotes are tion. Access can present another problem in expected: sensitive settings if it turns out that it is precisely unusual examples where access is granted, to provide a running description of events, people perhaps because the participants view them as and conversation. Consequently each new setting models of good practice. observed and each new member of the setting 130 Qualitative and Discourse Analysis

merits description. Similarly, changes in the hu- setting, but will concentrate on a small subset of man or other constituents of the setting should be themes which are most important or which recorded. (1993, p. 162) relate to prior questions and concerns. The analytic phase of ethnography is often It is also important to distinguish in notes described in only the sketchiest terms in between direct quotation and broad pre cis of ethnography guidebooks. It is clear that what participants are saying. A final point ethnographers often make up their own ways emphasised by ethnographers is the value of of managing the large amount of materials that keeping a record of personal impressions and they collect, and for using that material in feelings. convincing research accounts. At one extreme, ethnography can be considered an approach to 3.06.3.2.4 Interviews develop the researcher's competence in the community being studiedÐthey learn to be a Ethnographers make much use of interviews. member, to take part actually and symbolically, However, in this tradition interviews are under- and they can use this competence to write stood in a much looser manner than in much of authoritatively about the community (Collins, psychology. Indeed, the term interview may be 1983). Here extracts from notes and interview something of a misnomer with its image of the transcripts become merely exemplars of the researcher running through a relatively planned knowledge that the researcher has gained set of questionswith a single passive informant in through participation. At the other extreme, a relatively formal setting. In ethnography what ethnography blurs into grounded theorizing, is involved is often a mix of casual conversations with the notes and transcripts being dealt with with a range of different participants. Some of through line-by-line coding, comparison of these may be very brief, some extended, some categories, and memo writing. Here the re- allowing relatively formal questioning, others searcher's cultural competence will be impor- allowing no overt questioning. In the more tant for interpreting the material, but the formal cases the interview may be conducted conclusions will ultimately be dependent on with a planned schedule of questions and the the quality of the fieldnotes and transcripts and interaction is recorded and transcribed. what they can support.

3.06.3.2.5 Analysis 3.06.3.2.6 Validity There is not a neat separation of the data One of the virtues of ethnography is its rich collection and analytic phases of ethnographic ecological validity. The researcher is learning research. The judgements about what to study, directly about what goes on in a setting by what to focus on, which elements of the local observing it, by taking part, and/or by inter- culture require detailed description, and which viewing the members. This circumvents many can be taken for granted, are already part of of the inferences that are needed in extrapolat- analysis. Moreover, it is likely that in the course ing from more traditional psychological of a long period of participant observation, or a research toolsÐquestionnaires, experimental series of interviews, the researcher will start to simulationsÐto actual settings. However, the develop accounts for particular features of the closeness of the researcher to the setting does setting, or begin to identify the set of repre- not in itself ensure that the research that is sentations shared by the participants. Such produced will be of high quality. interpretations are refined, transformed, and The approach to validity most often stressed sometimes abandoned when the fieldwork is by ethnographers is triangulation. At the level completed and the focus moves on to notes and of data this involves checking to see that transcripts. different informants make the same sorts of Fielding (1993, p. 192) suggests that the claims about actions or events. At the level of standard pattern of work with ethnographic method, it involves checking that conclusions data is straightforward. The researcher starts are supported by different methods, for exam- with the fieldnotes and transcripts and searches ple, by both interviews and observation. How- them for categories and patterns. These themes ever, triangulation is not without its problems. form a basis for the ethnographic account of the Discourse researchers have noted that in setting, and they also structure the more practice the sheer variability in and between intensive analysis and probably the write-up accounts makes triangulation of only limited of the research. The data will be marked or cut use (Potter & Wetherell, 1987) and others have up (often on computer text files) to collect these identified conceptual problems in judging what themes together. In practice, the ethnographer a successful triangulation between methods is unlikely to attempt a complete account of a would be (Silverman, 1993). Ethnography and Participant Observation 131

3.06.3.3 Example: Gubrium (1992) seat themselves in that area as evidence of family dynamics. Gubrium writes about the important There are only a small number of ethnogra- role of tissue boxes in both signaling the phies done in clinical settings or on clinical potential for emotional display and providing topics. For instance, many of the clinical practical support when such display occurs: examples in Newfield, Sells, Smith, Newfield, & Newfield's (1996) chapter on ethnography in I soon realized that tissues were about more than family therapy are unpublished dissertations. weeping and overall emotional composure during The body of ethnographic work is small but therapy. Tissues mundanely signaled the funda- increasing rapidly. I have chosen Gubrium's mental reality of the home as locally understood: a (1992) study of family therapy in two institu- configuration of emotional bonds. For Benson [a counselor] their usage virtually put the domestic tions as an example because it is a book-length disorder of the home on display, locating the study, and it addresses the therapeutic process home's special order in the minutiae of emotional itself rather than concentrating solely on the expression. (Gubrium, 1992, p. 26) patients' lives in hospital or community care schemes. However, it is important to stress that The ethnographic focus on events in context Gubrium's focus was as much on what the means that therapy is treated as a product of therapy revealed about the way notions of actual interactions full of contingency and family and order are understood in American locally managed understandings. It shows the culture as in therapeutic techniques and effec- way abstract notions such as family systems or tiveness. He was concerned with the way tough love are managed in practice, and the way behaviours such as truancy take their sense as the various workers relate to each other as well part of a troubled family and the way service as to the clients. It provides an insight into the professionals redefine family order as they world of family therapy quite different from instigate programmes of rehabilitation. most other styles of research. Gubrium's choice of two contrasting institu- tions is a commonplace one in ethnography. The small number enables an intensive ap- proach; having more than one setting allows an 3.06.3.4 Virtues and Limitations illuminating range of comparisons and con- trasts. In this case one was inpatient, one Much of the power of ethnographic research outpatient; one more middle class than the comes from its emphasis on understanding other; they also varied in their standard people's actions and representations both in approaches to treatment. The virtues of having context and as part of the everyday practices two field sites shine through in the course of the that make up their lives, whether they are write-up, although it is interesting to note that Yanomami Indians or family therapists. It can the selection was almost accidental as the provide descriptions which pick out abstract researcher originally expected to be successful organizations of life in a setting as well as in gaining access to only one of the institutions. allowing the reader rich access. Ethnography The fieldwork followed a typical ethno- can be used in theory development and even graphic style of spending a considerable amount theory testing (Hammersley & Atkinson, 1995). of times at the two facilities, talking to It is flexible; research can follow up themes and counselors, watching them at work in therapy questions as they arise rather than necessarily sessions, reviewing videos of counseling, and keeping to preset goals. making fieldnotes. The study also drew on a Ethnographic research can be very time range of documents including patients' case consuming and labor intensive. It can also be notes and educational materials. In some ways very intrusive. Although Gubrium was able to this was a technically straightforward setting for participate in some aspects of family therapy, ethnographic observation as many of the this was helped by the sheer number of both participants were themselves university trained staff and family members who were involved. It practitioners who made notes and videos as part is less easy to imagine participant observation of the general workings of the facilities. on individual therapy. One of the striking differences between this One of the most important difficulties with ethnographic study of therapy and a typical ethnographic work is that the reader often has process or outcome study is that the therapy is to take on trust the conclusions because the treated as a part of its physical, institutional, evidence on which they are based is not and cultural contexts. For instance, time is spent available for assessment (Silverman, 1993). documenting the organization of the reception Where field notes of observations are repro- areas of the two facilities and the way the duced in ethnographiesÐand this is relatively counselors use the manner in which the families rareÐsuch notes are nevertheless a ready- 132 Qualitative and Discourse Analysis theorized version of events. Descriptions of researchers who have extended analytic and actions and events are always bound up with a theoretical developments in discourse studies to range of judgments (Potter, 1996a). Where clinical settings (e.g., Aronsson & Cederborg, analysis depends on the claims of key infor- 1996; Bergmann, 1992; Buttny, 1996; Edwards, mants the problem is assessing how these claims 1995; Lee, 1995). Collections such as Siegfried relate to any putative activities that are (1995), Burman, Mitchel, and Salmon (1996), described. Ethnographers deal with these pro- and Morris and Chenail (1995) reflect both blems with varying degrees of sophistication types of work, sometimes in rather uneasy (for discussion see Nelson, 1994). However, combination (see Antaki, 1996). some researchers treat them as inescapable and This tension between an applied and aca- have turned to some form of discourse analysis demic focus is closely related to the stance taken instead. on therapy. In much discourse analysis, therapy For more detailed discussions of ethnography is the start point for research and the issue is readers should start with Fielding's (1993) how therapy gets done. For example, Gale's excellent brief introduction and then use (1991; Gale & Newfield, 1992) intensive study Hammersley and Atkinson (1995) as an author- of one of O'Hanlon's therapy sessions con- itative and up to date overview. Two of the most sidered the various ways in which the goals of comprehensive, although not always most solution focused family therapy were realized in sophisticated, works are Werner and Schoepfle the talk between therapist and client. However, (1987) and Ellen (1984). Both were written by some conversation analysts and ethnometho- anthropologists, and this shows in their under- dologists resist assuming that conversational standing of what is important. Newfield, Sells, interaction glossed by some parties as therapy Smith, Newfield, and Newfield (1996) provide a (solution focused, Milan School, or whatever) useful discussion of ethnography in family must have special ingredient XÐtherapyÐthat therapy research. is absent in, say, the everyday ªtroubles talkº done with a friend over the telephone (Jefferson, 1988; Jefferson & Lee, 1992; Schegloff, 1991; Watson, 1995). 3.06.4 DISCOURSE ANALYSIS This is a significant point for all researchers into therapy and counseling, so it is worth Although both grounded theorizing and illustrating with an example. In Labov and ethnographic work in clinical areas has in- Fanshel's classic study, the therapy session creased, the most striking expansion has been in starts in the following manner: research in discourse analysis (Soyland, 1995). This work is of variable quality and often done Rhoda: I don't (1.0) know, whether (1.5) by researchers isolated in different subdisci- I-I think I did- the right thing, plines; moreover, it displays considerable termi- jistalittle situation came up (4.5) nological confusion. For simplicity, discourse an' I tried to uhm (3.0) well, try to analysis is taken as covering a range of work (4.0) use what I- what I've learned which includes conversation analysis and eth- here, see if it worked (0.3) nomethodology (Heritage, 1984; Nofsinger, Therapist: Mhm 1991), some specific traditions of discourse Rhoda: Now, I don't know if I did the right analysis and discursive psychology (Edwards thing. Sunday (1.0) um- my & Potter, 1992a; Potter & Wetherell, 1987), some mother went to my sister's again. Therapist: Mm-hm of the more analytically focused social construc- Rhoda: And she usu'lly goes for about a tionist work (McNamee & Gergen, 1992), and a day or so, like if she leaves on range of work influenced by post-structuralism, Sunday, she'll come back Tuesday Continental discourse analysis, and particularly morning. So- it's nothing. But- she the work of Foucault (Burman, 1995; Madigan, lef' Sunday, and she's still not 1992; Miller & Silverman, 1995). In some home. research these different themes are woven Therapist: O- oh. together; elsewhere strong oppositions are (1977, p. 263) marked out. The impetus for discourse analysis in clinical Labov and Fanshel provide many pages of settings comes from two directions. On the one analysis of this sequence. They identify various hand, there are practitioner/researchers who direct and indirect speech acts and make much have found ideas from social constructionism, of what they call its therapeutic interview literary theory, and narrative useful (e.g., style, particularly the vague reference terms Anderson & Goolishian, 1988; White & Epston, at the start: ªright thingº and ªjistalittle 1990). On the other, there are academic situation.º This vagueness can easily be heard Discourse Analysis 133 as the 19-year-old anorexia sufferer struggling R's first turn is . . . formulated to prefigure (i) the to face up to her relationship with her difficult telling of something she did (I think I did the right mother. However, in a reanalysis from a con- thing), and (ii) the describing of the situation that versation analytic perspective, Levinson (1983) led to the action (jistalittle situation came up). We suggests that this sequence is characteristic of are therefore warned to expect a story with two components; moreover the point of the story and mundane news telling sequences in everyday its relevance to the here and now is also prefigured conversation. These typically have four parts: (use what I've learned here, see if it worked). (1983, the pre-announcement, the go ahead, the news p. 353) telling, and the news receipt. For example: Even the hesitations and glottal stops in D: I forgot to to tell preannouncement Rhoda's first turn, which seem so redolent of you the two best a troubled young person are ªtypical markings things that of self-initiated self-repair, which is character- happen' to me istic of the production of first topicsº (Levinson, today. 1983, p. 353). This emphasis on the significance R: Oh super = what go ahead turn of detail has an important methodological were they. consequenceÐif interaction is to be understood D: I got a B+ on my news telling math test . . . and I properly it must be represented in a way that got an athletic captures this detail. Hence the use of a tran- award scription scheme that attempts to represent a R: Oh excellent. news receipt range of paralinguistic features of talk (stress, (Levinson, 1983, p. 349Ðslightly intonation) on the page as well as elements of modified) the style of its delivery (pauses, cut-offs). A fourth feature to note here is the compara- tive approach that has been taken. Rather than A particular feature of preannouncements is focus on therapy talk alone Levinson is able to their vagueness, for their job is to prefigure the support an alternative account of the interaction story (and thereby check its newsworthiness), by drawing on materials, and analysis, taken not to actually tell it. So, rather than following from mundane conversations. Since the mid Labov and Fanshel (1977) in treating this 1980s there has been a large amount of work in vagueness as specific to a troubled soul dealing different institutional settings as well as every- with a difficult topic in therapy, Levinson day conversation, and it is now possible to start (1983) proposes that it should be understood to show how a news interview, say, differs from as a commonplace characteristic of mundane the health visitor's talk with an expectant interaction. mother, and how that differs in turn from This example illustrates a number of features conversation between two friends over the typical of a range of discursive approaches to telephone (Drew & Heritage, 1992a). therapy talk. First, the talk is understood as A fifth and final feature of this example is that performing actions; it is not being used as a it is an analysis of interaction. It is neither an pathway to cognitive processes (repression, say) attempt to reduce what is going on to cognitions or as evidence of what Rhoda's life is like (her of the various partiesÐRhoda's denial, say, or difficult mother). Second, the interaction is the therapist's eliciting strategiesÐnor to trans- understood as sequentially organized so any form it into countable behaviors such as verbal part of the talk relates to what came immedi- reinforcers. This style of discourse work devel- ately before and provides an environment for ops a Wittgensteinian notion of cognitive words what will immediately follow. The realization of and phrases as elements in a set of language how far interaction gets its sense from its games for managing issues of blame, account- sequential context has critical implications for ability, description, and so on (Coulter, 1990; approaches such as content analysis and Edwards, 1997; Harre & Gillett, 1994). Such a grounded theory which involve making cate- ªdiscursive psychologyº analyzes notions such gorizations and considering relations between as attribution and memory in terms of the them; for such categorizations tend to cut across situated practices through which responsibility precisely the sequential relations that are is assigned and the business done by construct- important for the sense of the turn of talk. ing particular versions of past events (Edwards The third feature is that the talk is treated as & Potter, 1992b, 1993). ordered in its detail not merely in its broad particulars. For example, Levinson (1983) 3.06.4.1 Questions highlights a number of orderly elements in what we might easily mistake for clumsiness in Discourse analysis is more directly associated Rhoda's first turn: with particular theoretical perspectivesÐ 134 Qualitative and Discourse Analysis ethnomethodology, post-structuralism, discur- women draw on to construct notions of sive psychologyÐthan either grounded theory femininity, agency, and body image in the or ethnography. The questions it addresses context of eating disorders (Malson & Ussher, focus on the practices of interaction in their 1996; Wetherell, 1996)? What discourses are natural contexts and the sorts of discursive used to construct different notions of the resources that are drawn on those contexts. person, of the family, and of breakdown in Some of the most basic questions concern the therapy (Burman, 1995; Frosh, Burck, standardized sequences of interaction that take Strickland-Clark, & Morgan, 1996; Soal & place in therapy and counseling (Buttny & Kotter, 1996)? This work is often critical of Jensen, 1995; Lee, 1995; PeraÈ kylaÈ , 1995; Silver- individualistic conceptions of therapy. man, 1997b). This is closely related to a concern Finally, discourse researchers have stood with the activities that take place. What is the back and taken the administration of psycho- role of the therapist's tokens such as ªokayº or logical research instruments as their topic. The ªmm hmº (Beach, 1995; Czyzewski, 1995)? How intensive focus of such work can show the way do different styles of questioning perform that the sort in idiosyncratic interaction that different tasks (Bergmann, 1992; PeraÈ kylaÈ , takes place when filling in questionnaires or 1995; Silverman, 1997b)? What is the role of producing records can lead to particular out- problem formulations by both therapists and comes (Antaki & Rapley, 1996; Garfinkel, clients, and how are they transformed and 1967b; Rapley & Antaki, 1996; Soyland, 1994). negotiated (Buttny, 1996; Buttny & Jensen, Different styles of discourse work address 1995; Madill & Barkham, 1997)? For example, rather different kinds of questions. However, in a classic feminist paper Davis (1986) charts the conversation analytic work is notable in the way a woman's struggles with her oppressive commonly starting from a set of transcribed social circumstances are transformed into materials rather than preformulated research individual psychological problems suitable for questions, on the grounds that such questions individual therapy. While much discourse often embody expectations and assumptions research is focused on the talk of therapy and which prevent the analyst seeing a range of counseling itself, studies in other areas show the intricacies in the interaction. Conversation value of standing back and considering clinical analysis reveals an order to interaction that psychology as a set of work practices in participants are often unable to formulate in themselves, including management of clients abstract terms. in person and as records, conducting assess- ments, delivering diagnoses, intake and release, stimulating people with profound learning 3.06.4.2 Procedures difficulties, case conferences and supervisions, offering advice, and managing resistance (see The majority of discourse research in the Drew & Heritage, 1992a). Discourse researchers clinical area has worked with records of natural have also been moving beyond clinical settings interaction, although a small amount has used to study how people with clinical problems or open-ended interviews. There is not space here learning difficulties manage in everyday settings to discuss the role of interviews in discourse (Brewer & Yearley, 1989; Pollner & Wikler, analysis or qualitative research generally (see 1985; Wootton, 1989). Kvale, 1996; Mischler, 1986; Potter & Mulkay, Another set of questions are suggested by the 1985; Widdicombe & Wooffitt, 1995). For perspective of discursive psychology. For simplicity discourse work will be discussed in example, Edwards (1997) has studied the terms of seven elements. rhetorical relationship between problem for- mulations, descriptions of activities, and issues 3.06.4.2.1 Research materials of blame in counseling. Cheston (1996) has studied the way descriptions of the past in a Traditionally psychologists have been reluc- therapy group of older adults can create a set of tant to deal with actual interaction, preferring to social identities for the members. Discursive model it experimentally, or reconstruct it via psychology can provide a new take on emotions, scales and questionnaires. Part of the reason for examining how they are constructed and their this is the prevalent cognitivist assumptions role in specific practices (Edwards, 1997). which have directed attention away from From a more Foucaultian inspired direction, interaction itself to focus on generative mechan- studies may consider the role of particular isms within the person. In contrast, discourse discourses, or interpretative repertoires in researchers have emphasised the primacy of constructing the sense of actions and experi- practices of interaction themselves. The most ences (Potter & Wetherell, 1987). For example, obvious practice setting for clinical work is the what are the discursive resources that young therapy session itself, and this has certainly Discourse Analysis 135 received the most attention. After all, there is an may be missed from the tape, and good quality elegance in studying the ªtalking cureº using equipment is now compact and cheap. On the methods designed to acquire an understanding other hand, video can be more intrusive, of the nature of talk. However, there is a danger particularly where the recording is being done that such an exclusive emphasis underplays by one of the participants (a counselor, say), and mundane aspects of clinical practices: giving may be hard to position so it captures gestures advice, offering a diagnosis, the reception of and expressions from all parties to an interac- new clients, casual talk to clients' relatives, tion. Video poses a range of practical and writing up clinical records, case conferences, theoretical problems with respect to the tran- clinical training, and assessment. scription of nonvocal activity which can be both Notions of sample size do not translate easily time consuming and create transcript that is from traditional research as the discourse difficult to understand. Moreover, there is now research focus is not so much on individuals a large body of studies that shows high quality as on interactional phenomena of various kinds. analysis can, in many cases, be performed with Various considerations can come to the fore an audio tape alone. One manageable solution is here, including the type of generalizations that to use video if doing so does not disrupt the are to be made from the research, the time and interaction, and then to transcribe the audio and resources available, and the nature of the topic work with a combination of video tape and being studied. For example, if the topic is the audio transcript. Whether audio or video is role of ªmm hmsº in therapy a small number of chosen, the quality (clear audibility and visibi- sessions may generate a large corpus; other lity) is probably the single most consequential phenomena may be much rarer and require feature of the recording for the later research. large quantities of interaction to develop a Another difficulty is how far the recording of useful corpus. For some questions, single cases interaction affects its nature. This is a subtle may be sufficient to underpin a theoretical point issue. On the one hand, there are a range of ways or reveal a theoretically significant phenomena. of minimizing such influences including accli- matizing participants and giving clear descrip- tions of the role of the research. On the other, 3.06.4.2.2 Access and ethics experience has shown that recording has little One of the most difficult practical problems influence on many, perhaps most, of the in conducting discourse research involves get- activities in which the discourse researcher is ting access to sometimes sensitive settings in interested. Indeed, in some clinical settings ways which allow for informed consent from all recordings may be made as a matter of course the parties involved. Experience suggests that for purposes of therapy and training, and so no more often than not it is the health professionals new disruption is involved. rather than the clients who are resistant to their practices being studied, perhaps because they 3.06.4.2.4 Transcription are sensitive to the difference between the idealized version of practices that was used in Producing a high-quality transcript is a training and the apparently more messy pro- crucial prerequisite for discourse research. A cedures in which they actually engage. Some- transcript is not a neutral, simple rendition of times reassurances about these differences can the words on a tape (Ochs, 1979). Different be productive. transcription systems emphasize different fea- Using records of interaction such as these tures of interaction. The best system for most raise particular issues for ensuring anonymity. work of this kind was developed by the This is relatively manageable with transcripts conversation analyst Gail Jefferson using where names and places can be changed. It is symbols that convey features of vocal delivery harder for audio tape and harder still with that have been shown to be interactionally video. However, technical advances in the use of important to participants (Jefferson, 1985). At digitized video allow for disguising of identity the same time the system is designed to use with relatively little loss of vocal information. characters and symbols easily available on wordprocessors making it reasonably easy to learn and interpret. The basic system is 3.06.4.2.3 Audio and video recording summarized in Table 3. For fuller descriptions There is a range of practical concerns in of using the system see Button and Lee, (1987), recording natural interaction, some of them Ochs, Schegloff, and Thompson (1996), and pulling in different directions. An immediate Psathas and Anderson (1990). issue is whether to use audio or video recording. Producing high quality transcript is very On the one hand, video provides helpful demanding and time consuming. It is hard to information about nonverbal activities that give a standard figure for how long it takes 136 Qualitative and Discourse Analysis

Table 3 Brief transcription notation.

Um:: colons represent lengthening of the preceding sound; the more colons, the greater the lengthening. I've- a hyphen represents the cut-off of the preceding sound, often by a stop. :Already up and down arrows represent sharp upward and downward pitch shifts in the following sounds. Underlining represents stress, usually by volume; the more underlining the greater the stress. settled in his= the equals signs join talk that is continuous although Mm=own mind. separated across different lines of transcript. hhh hh .hh `h' represents aspiration, sometimes simply hearable breathing, sometimes laughter, etc.; when preceded P(h)ut by a superimposed dot, it marks in-breath; in parenthesis inside a word it represents laugh infiltration.

hhh[hh .hh] left brackets represent point of overlap onset; right [I just] brackets represent point of overlap resolution. .,? punctuation marks intonation, not grammar; period, comma and `question mark' indicate downward, `continuative', and upward contours respectively. ( ) single parentheses mark problematic or uncertain hearings; two parentheses separated by an oblique represent alternative hearings. (0.2)(.) numbers in parentheses represent silence in tenths of a second; a dot in parentheses represents a micro- pause, less than two tenths of a second. 8mm hmm8 the degree signs enclose significantly lowered volume.

Source: Modified from Schegloff (1997, pp. 184±185). because much depends on the quality of the part of the analysis. Typically coding will recording (fuzzy, quiet tapes can quadruple the involve sifting through materials for instances time needed) and the type of interaction (an of a phenomenon of interest and copying them individual therapy session presents much less of into an archive. This coding will often be a challenge than a lively case conference with a accompanied by preliminary notes as to their lot of overlapping talk and extraneous noise); nature and interest. At this stage selection is nevertheless, a ratio of one hour of tape to inclusiveÐit is better to include material that twenty hours of transcription time is not can turn out to be irrelevant at a later stage than unreasonable. However, this time should not exclude it for ill-formulated reasons early on. be thought of as dead time before the analysis Coding is a cyclical process. Full analysis of a proper. Often some of the most revealing corpus of materials can often take the researcher analytic insights come during transcription back to the originals as a better understanding because a profound engagement with the of the phenomenon reveals new examples. Often material is needed to produce good transcriptÐ initially disparate topics merge together in the it is generally useful to make analytic notes in course of analysis while topics which seemed parallel to the actual transcription. unitary can be separated.

3.06.4.2.5 Coding 3.06.4.2.6 Analysis In discourse research the principle task of There is no single recipe for analyzing coding is to make the task of analysis more discourse. Nevertheless, there are five consid- straightforward by sifting relevant materials erations which are commonly important in from a large body of transcript. In this it differs analysis. First, the researcher can use variation from both grounded theory and traditional in and between participants' discourse as an content analysis where coding is a more intrinsic analytic lever. The significance of variation is Discourse Analysis 137 that it can be used for identifying and explicat- elements in an analytic mentality that the ing the activities that are being performed by researcher will develop as they become more talk and texts. This is because the discourse is and more skilled. It does not matter that they constructed in the specific ways that it is are not spelled out in studies because they are precisely to perform these actions; a description separate from the procedures for validating of jealousy in couple counseling can be discourse analytic claims. assembled very differently when excusing or criticizing certain actions (Edwards, 1995). The 3.06.4.2.7 Validity researcher will benefit from attending to variations in the discourse of single individuals, Discourse researchers typically draw on some between different individuals, and between combination of four considerations to justify what is said and what might have been said. the validity of analytic claims. First, they make Second, discourse researchers have found it use of participants' own understandings as they highly productive to attend to the detail of are displayed in interaction. One of the features discourse. Conversation analysts such as Sacks of a conversation is that any turn of talk is (1992) have shown that the details in oriented to what came before and what comes discourseÐthe hesitations, lexical choice, re- next, and that orientation typically displays the pair, and so onÐare commonly part of the sense that the participant makes of the prior performance of some act or are consequential in turn. Thus, at its simplest, when someone some way for the outcome of the interaction. provides an answer they thereby display the Attending to the detail of interaction, particu- prior turn as a question and so on. Close larly in transcript, is one of the most difficult attention to this turn-by-turn display of under- things for psychologists who are used to reading standing provides one important check on through the apparently messy detail for the gist analytic interpretations (see Heritage, 1988). of what is going on. Developing analytic skills Second, researchers may find (seemingly) involves a discipline of close reading. deviant cases most useful in assessing the Third, analysis often benefits from attending adequacy of a claim. Deviant cases may to the rhetorical organization of discourse. This generate problems for a claimed generalization, involves inspecting discourse both for the way it and lead the researcher to abandon it; but they is organized to make argumentative cases and may also display in their detailed organization for the way it is designed to undermine precisely the reason why a standard pattern alternative cases (Billig, 1996). A rhetorical should take the form that it does. orientation refocuses the analyst's attention Third, a study may be assessed in part by how away from questions about how a versionÐ far it is coherent with previous discourse studies. description of a psychological disorder, sayÐ A study that builds coherently on past research relates to some putative reality and focuses it on is more plausible than one that is more how it relates to competing alternatives. anomalous. Concern with rhetoric is closely linked to a Fourth, and most important, are readers' fourth analytic concern with accountability. evaluations. One of the distinctive features of That is, displaying one's activities as rational, discourse research is its presentation of rich and sensible, and justifiable. Ethnomethodologists extended materials in a way that allows the have argued that accountability is an essential reader to make their own judgements about and pervasive character of the design and interpretations that are placed along side of understanding of human conduct generally them. This form of validation contrasts with (Garfinkel, 1967c; Heritage, 1984). Again an much grounded theory and ethnography where attention to the way actions are made account- interpretations often have to be taken on trust; it able is an aid for understanding precisely what also contrasts with much traditional experi- those actions are. mental and content analytic work where it is A fifth and final analytic consideration is of a rare for ªrawº data to be included or more than slightly different order. It is an emphasis on the one or two illustrative codings to be reproduced. virtue of building on prior analytic studies. In Whether they appear singly or together in a particular, researchers into interaction in an discourse study none of these procedures institutional setting such as a family therapy guarantee the validity of an analysis. However, setting will almost certainly benefit from a as the sociology of science work reviewed earlier familiarity with research on mundane talk as shows, there are no such guarantees in science. well as an understanding of how the patterning of turn taking and activities change in different 3.06.4.3 Example: PeraÈ kylaÈ (1995) institutional settings. The best way to think of these five considera- Given the wide variety of discourse studies tions is not as specific rules for research but as with different questions and styles of analysis it 138 Qualitative and Discourse Analysis is not easy to chose a single representative In a similar way, the use of questioning where a study. The one selected is PeraÈ kylaÈ 's (1995) client's partner, say, offers their understanding investigation of AIDS counseling because it is a of an experience ªcan create a situation where major integrative study that addresses a related the clients, in an unacknowledged but most set of questions about interaction, counseling, powerful way, elicit one another's descriptions and family therapy from a rigorous conversa- of their inner experiencesº (PeraÈ kylaÈ , 1995, tion analytic perspective and illustrates the p. 110). In the following extract the client is potential of discourse work on clinical topics. It called Edward; his partner and the counselor draws heavily on the perspective on institu- are also present. tional talk surveyed by Drew and Heritage (1992b) and is worth reading in conjunction Counselor: What are some of things that you with Silverman's (1997b) related study of think E:dward might have to HIV+ counseling which focuses more on do.=He says he doesn't know where to go from here maybe: and advice giving. awaiting results and things. PeraÈ kylaÈ focused on 32 counseling sessions (0.6) conducted with HIV+ hemophilic mainly gay Counselor: What d'you think's worrying him. identified men and their partners at a major (0.4) London hospital. The sessions were videotaped Partner: Uh::m hhhhhh I think it's just fear and transcribed using the Jeffersonian system. of the unknow:n. A wider archive of recordings (450 hours) was Client: Mm[: drawn on to provide further examples of Counselor: [Oka:y. phenomena of interest but not otherwise Partner: [At- at the present ti:me. (0.2) transcribed. The counselors characterized their Uh:m (.) once: he's (0.5) got a better understanding of (0.2) what could practices in terms of Milan School Family happen Systems Theory and, although this is not the Counselor: Mm: startpoint of PeraÈ kylaÈ 's study, he was able to Partner: uh:m how .hh this will progre:ss explicate some of the characteristics of such then: I think (.) things will be a counseling. little more [settled in his= Part of the study is concerned with identifying Counselor: [Mm the standard turn-taking organization of the Partner: =own mi:nd. counseling. Stated baldly it is that (i) counselors Counselor: Mm: ask questions; (ii) clients answer; (iii) counselors (.) comment, advise, or ask further questions. Client: Mm[: Counselor: [Edward (.) from what you When laid out in this manner the organization know:: may not seem much of a discovery. However, ((sequence continues with Edward the power of the study is showing how this responding to a direct question organization is achieved in the interaction and with a long and detailed narrative how it can be used to address painful and about his fears)) delicate topics such as sexual behavior, illness, (PeraÈ kylaÈ , 1995, p. 110) and death. PeraÈ kylaÈ goes on to examine various practices PeraÈ kylaÈ emphasizes the way that the client's that are characteristic of family systems theory talk about his fears is elicited, in part, through such as ªcircular questioning,º where the the counsellor asking the partner for his own counselor initially questions the client's partner view of those fears. The point is not that the or a family member about the client's feelings, client is forced to reveal his experiences, rather it and ªlive open supervision,º where a supervisor is that the prior revelation of his partner's may offer questions to the counselor that are, in partial view produces an environment where turn, addressed to the client. The study also such a revelation is expected and nonrevelation identifies some of the strategies by which will itself be a delicate and accountable matter. counselors can address ªdreaded issuesº in a In effect, what PeraÈ kylaÈ is documenting here are manageable way. Take ªcircular questioning,º the conversational mechanisms which family for example. In mundane interaction providing therapists characterize as using circular ques- your partial experience of some event or tioning to overcome clients' resistance. experience is a commonplace way of ªfishingº for a more authoritative version (Pomerantz, 3.06.4.4 Virtues and Limitations 1980). For example: Given the variety of styles of work done under A: Yer line's been busy. the rubric of discourse analysis it is difficult to B: Yeuh my fu (hh)- .hh my father's give an overall summary of virtues and wife called me limitations. However, the virtue of a range of Future Directions 139 studies in the conversation and discourse units such as interpretative repertoires, while analytic tradition is that they offer, arguably Potter and Wetherell (1994) and Wooffitt (1993) for the first time in psychology, a rigorous way discuss the analysis of how accounts are of directly studying human social practices. For constructed. For work in the distinctive con- example, the PeraÈ kylaÈ study discussed above is versation analytic tradition Drew (1995) and notable in studying actual HIV+ counseling in Heritage (1995) provide clear overviews and all its detail. It is not counseling as recollected by Heath and Luff (1993) discuss analysis which participants while filling in rating scales or incorporates video material; Gale (1996) ex- questionnaires; it is not an experimental plores the use of conversation analysis in family simulation of counseling; it does not depend therapy research. on post hoc ethnographic reconstructions of events; nor are the activities immediately transformed into broad coding categories or 3.06.5 FUTURE DIRECTIONS used as a mere shortcut to underlying cogni- tions. The pace of change in qualitative research in A corollary of this emphasis on working with clinical settings is currently breathtaking, and it tapes and transcripts of interaction is that these is not easy to make confident predictions. are used in research papers to allow readers to However, it is perhaps useful to speculate on evaluate the adequacy of interpretations in a how things might develop over the next few way that is rare in other styles of research. years. The first prediction is that the growth in Studies in this tradition have started to reveal the sheer quantity of qualitative research will an organization of interaction and its local continue for some time. There is so much new management that has been largely missed from territory, and so many possibilities have been traditional psychological work from a range of opened up by new theoretical and analytic perspectives. Such studies offer new conceptions developments, that they are bound to be of what is important in clinical practice and may explored. be particularly valuable in clinical training The second prediction is that research on which has often been conducted with idealized therapy and counseling talk will provide a or at least cleaned up examples of practice. particular initial focus because it is here that Discourse research is demanding and requires discourse analytic approaches can clearly a considerable investment of the researcher's provide new insights and possibly start to time to produce satisfactory results. It does not provide technical analytically grounded speci- fit neatly into routines that can be done by fications of the interactional nature of different research assistants. Indeed, even transcription, therapies in practice, as well as differences in which may seem to be the most mechanical interactional style between therapists. There element in the research, requires considerable may well be conflicts here between the ideolo- skill and benefits from the involvement of the gical goals of constructionist therapists and the primary researchers. Analysis also requires research goals of discourse analysts. considerable craft skills which can take time The third prediction is that the growth of to learn. qualitative work will encourage more research- With its focus on interaction, this would not ers to attempt integrations of qualitative and necessarily be the perspective of choice for quantitative research strategies. There will be researchers with a more traditional cognitive or attempts to supplement traditional outcomes behavioral focus, although it has important research with studies of elements of treatment implications for both of these. Some have which are not easily amenable to quantification. claimed that it places too much emphasis on Here the theoretical neutrality of grounded verbal interaction at the expense of nonverbal theory (ironically) is likely to make for easier elements, and broader issues of embodiment. integration than the more theoretically devel- Others have claimed that it places too much oped discourse perspectives. The sheer difficulty emphasis on the importance of local contexts of of blending qualitative and quantitative work interaction rather than on broader issues such as should not be underestimatedÐresearch that gender or social class. For some contrasting and has attempted this has often found severe strongly expressed claims about the role of problems (see Mason, 1994, for a discussion). discourse analysis in the cognitive psychology of The final prediction is that there will be an memory, see papers in Conway (1992). increased focus on clinical work practices An accessible general introduction to various embodied within settings such as clinics and practical aspects of doing discourse analysis is networks of professional and lay relationships. provided in Potter and Wetherell (1987; see also Here the richness of ethnographic work will be Potter, 1996b, 1997). Potter and Wetherell drawn on, but increasingly the conversation (1995) discuss the analysis of broader content analytic discipline of working with video and 140 Qualitative and Discourse Analysis transcript will replace field notes and recollec- qualitative research methods: A phenomenological ap- tions. Such work will have the effect of proach to social sciences. New York: Wiley. Brewer, J. D., & Yearley, S. (1989). Stigma and conversa- respecifying some of the basic problems of tional competence: A conversation-analytic study of the clinical research. Its broader significance, mentally handicapped. Human Studies, 12, 97±115 however, may depend on the course of wider Bryman, A. (1988). Quantity and quality in social research. debates in Psychology over the development London: Unwin Hyman. Bryman, A., & Burgess, R. G. (Eds.) (1994). Analyzing and success of the cognitive paradigm and qualitative data. London: Routledge. whether it will have a discursive and interaction Burman, E. (1995). Identification, subjectivity, and power based successor. in feminist psychotherapy. In J. Siegfried (Ed.), Ther- apeutic and everyday discourse as behaviour change: Towards a micro-analysis in psychotherapy process ACKNOWLEDGMENTS research (pp. 469±490). Norwood, NJ: Ablex. Burman, E., Mitchell, S., & Salmon, P. (Eds.) (1996). I would like to thank Alan Bryman, Karen Changes: An International Journal of Psychology and Psychotherapy (special issue on qualitative methods), 14, Henwood, Alexa Hepburn, Celia Kitzinger, and 175±243. Denis Salter for making helpful comments on Buttny, R. (1996). Clients' and therapists' joint construc- an earlier draft of this chapter. tion of the clients' problems. Research on Language and Social Interaction, 29, 125±153. Buttny, R., & Jensen, A. D. (1995) Telling problems in an 3.06.6 REFERENCES initial family therapy session: The hierarchical organiza- tion of problem-talk. In G. H. Morris & R. J. Chenail Anderson, H., & Goolishian, H. A. (1988). Human systems (Eds.), The talk of the clinic: Explorations in the analysis as linguistic systems: Preliminary and evolving ideas of medical and therapeutic discourse (pp. 19±48). Hills- about the implications for clinical theory. Family dale, NJ: Erlbaum. Process, 27, 371±393. Button, G., & Lee, J. R. E. (Eds.) (1987). Talk and social Antaki, C. (Ed.) (1988). Analysing everyday explanation. organization. Clevedon, UK: Multilingual Matters. London: Sage. Chalmers, A. (1992). What is this thing called science?: An Antaki, C. (1996). Review of The talk of the clinic. Journal assessment of the nature and status of science and its of Language and Social Psychology, 15, 176±81. methods (2nd ed.), Milton Keynes, UK: Open University Antaki, C., & Rapley, M. (1996). ªQuality of lifeº talk: The Press. liberal paradox of psychological testing. Discourse and Charmaz, K. (1991). Good days, bad days: The self in Society, 7, 293±316. chronic illness and time. New Brunswick, NJ: Rutgers Argyris, C., Putnam, R., & Smith, D. M. (1985). Action University Press. Science: Concepts, methods, and skills for research and Charmaz, K. (1994). Identity dilemmas of chronically ill intervention. San Francisco: Jossey-Bass. men. The Sociological Quarterly, 35, 269±288. Aronsson, K., & Cederborg, A.-C. (1996). Coming of age Charmaz, K. (1995). Grounded theory. In J. A. Smith, R. in family therapy talk: Perspective setting in multiparty Harre , & L. van Langenhove (Eds.), Rethinking methods problem formulations. Discourse Processes, 21, 191±212. in pychology (pp. 27±49). London: Sage. Ashmore, M., Mulkay, M., & Pinch, T. (1989). Health and Cheston, R. (1996). Stories and metaphors: Talking about efficiency: A sociological study of health economics. the past in a psychotherapy group for people with Milton Keynes, UK. Open University Press. dementia. Ageing and Society, 16, 579±602. Atkinson, J. M. (1978). Discovering suicide: Studies in the Clegg, J. A., Standen, P. J., & Jones, G. (1996). Striking the social organization of sudden death. London: Macmillan. balance: A grounded theory analysis of staff perspec- Atkinson, P. (1990). The ethnographic imagination: The tives. British Journal of Clinical Psychology, 35, 249±264. textual construction of reality. London: Routledge. Coffey, A., & Atkinson, P. (1996). Making sense of Bannister, P., Burman, E., Parker, I., Taylor, M., & qualitative data: Complementary research strategies. Tindall, C. (1994). Qualitative methods in psychology: A London: Sage. research guide. Buckingham, UK: Open University Press. Collins, H. M. (1974). The TEA Set: Tacit knowledge and Beach, W. A. (1995). Preserving and constraining options: scientific networks. Science Studies, 4, 165±186. ªOkaysº and ªofficialº priorities in medical interviews. Collins, H. M. (Ed.) (1981). Knowledge and controversy: In G. H. Morris & R. J. Chenail (Eds.), The talk of the Studies of modern natural science. Social Studies of clinic: Explorations in the analysis of medical and Science (special issue), 11. therapeutic discourse (pp. 259±289). Hillsdale, NJ: Collins, H. M. (1983). The meaning of lies: Accounts of Erlbaum. action and participatory research. In G. N. Gilbert & P. Bergmann, J. R. (1992). Veiled morality: Notes on Abell (Eds.), Accounts and action (pp. 69±76). Aldershot, discretion in psychiatry. In P. Drew & J. Heritage UK: Gower. (Eds.), Talk at work: Interaction in institutional settings Collins, H. M. (1985). Changing order: Replication and (pp. 137±162). Cambridge, UK: Cambridge University induction in scientific practice. London: Sage. Press. Collins, H. M., & Pinch, T. (1993) The Golem: What Billig, M. (1988). Methodology and scholarship in under- everyone should know about science. Cambridge, UK: standing ideological explanation. In C. Antaki (Ed.), Cambridge University Press. Analysing everyday explanation: A casebook of methods Conway, M. (Ed.) (1992). Developments and debates in the (pp. 199±215). London: Sage. study of human memory (special issue). The Psycholo- Billig, M. (1996). Arguing and thinking: A rhetorical gist, 5, 439±461. approach to social psychology (2nd ed.). Cambridge, Coulter, J. (1990). Mind in action. Cambridge, UK: Polity. UK: Cambridge University Press. Czyzewski, M. (1995). ªMm hmº tokens as interactional Billig, M. (1998). Dialogic repression and the Oedipus devices in the psychotherapeutic intake interview. In P. Complex: Reinterpreting the Little Hans case. Culture ten Have & G. Psathas (Eds.), Situated order: Studies in and Psychology, 4, 11±47. the social organization of talk and embodied activities Bogdan, R., & Taylor, S. J. (1975). Introduction to (pp. 73±89). Washington, DC: International Institute for References 141

Ethnomethodology and Conversation Analysis & Uni- paper ªOn feminist methodology.º Sociology, 26, versity Press of America. 213±218. Davis, K. (1986). The process of problem (re)formulation Gergen, K. J. (1994). Realities and relationships: Soundings in psychotherapy. Sociology of Health and Illness, 8, in social construction. Cambridge, MA: Harvard Uni- 44±74. versity Press. Denzin, N. K., & Lincoln, Y. S. (Eds.) (1994) Handbook of Gilbert, G. N. (Ed.) (1993). Researching social life. qualitative research. London: Sage. London: Sage. Drew, P. (1995). Conversation analysis. In J. Smith, R. Gilbert, G. N., & Mulkay, M. (1984). Opening Pandora's Harre , & L. van Langenhove (Eds.), Rethinking methods box: A sociological analysis of scientists' discourse. in psychology (pp. 64±79). London: Sage. Cambridge, UK: Cambridge University Press. Drew, P., & Heritage, J. (Eds.) (1992a). Talk at work: Glaser, B., & Strauss, A. L. (1967). The discovery of Interaction in institutional settings. Cambridge, UK: grounded theory: Strategies for qualitative research. Cambridge University Press. Chicago: Aldine. Drew, P., & Heritage, J. (1992b). Analyzing talk at work: Glaser, B., & Strauss, A. L. (1968). Time for dying. An introduction. In P. Drew & J. Heritage (Eds.), Talk Chicago: Aldine. at work: Interaction in institutional settings (pp. 3±65). Goffman, E. (1961). Asylums: Essays on the social situation Cambridge, UK: Cambridge University Press. of mental patients and other inmates. London: Penguin. Edwards, D. (1995). Two to tango: Script formulations, Guba, E. G., & Lincoln, Y. S. (1994). Competing dispositions, and rhetorical symmetry in relationship paradigms in qualitative research. In N. K. Denzin & troubles talk. Research on Language and Social Interac- Y. S. Lincoln (Eds.), Handbook of qualitative research tion, 28, 319±350. (pp. 105±117). London: Sage. Edwards, D. (1997). Discourse and cognition. London: Gubrium, J. F. (1992). Out of control: Family therapy and Sage. domestic disorder. London: Sage. Edwards, D., & Potter, J. (1992a). Discursive psychology. Hammersley, M. (1992). On feminist methodology. Sociol- London: Sage. ogy, 26, 187±206. Edwards, D., & Potter, J. (1992b). The chancellor's Hammersley, M., & Atkinson, P. (1995). Ethnography: memory: Rhetoric and truth in discursive remembering. Principles in practice (2nd ed.). London: Routledge. Applied Cognitive Psychology, 6, 187±215. Harre , R. (1986). Varieties of realism. Oxford, UK: Edwards, D., & Potter, J. (1993). Language and causation: Blackwell. A discursive action model of description and attribution. Harre , R. (1992). Social being: A theory for social Psychological Review, 100, 23±41. psychology (2nd ed.). Oxford, UK: Blackwell. Ellen, R. F. (1984). Ethnographic research: A guide to Harre , R., & Gillett, G. (1994). The discursive mind. general conduct. London: Academic Press. London: Sage. Ellis, C., Kiesinger, C., & Tillmann-Healy, L. M. (1997) Heath, C., & Luff, P. (1993) Explicating face-to-face Interactive interviewing: Talking about emotional ex- interaction. In N. Gilbert (Ed.), Researching social life perience. In R. Hertz (Ed.), Reflexivity and Voice. (pp. 306±326) London: Sage. Thousand Oaks, CA: Sage. Henriques, J., Hollway, W., Irwin, C., Venn, C., & Essex, M., Estroff, S., Kane, S., McLanahan, J., Robbins, Walkerdine, V. (1984). Changing the subject: Psychology, J., Dresser, R., & Diamond, R. (1980). On Weinstein's social regulation and subjectivity. London: Methuen. ªPatient attitudes toward mental hospitalization: A Henwood, K., & Nicolson, P. (Eds.) (1995). Qualitative review of quantitative research.º Journal of Health and research methods (special issue). The Psychologist, 8, Social Behaviour, 21, 393±396. 109±129. Fetterman, D. M. (1989). Ethnography: Step by step. Henwood, K., & Parker, I. (1994). Qualitative social London: Sage. psychology (special issue). Journal of Community and Fielding, N. (1993). Ethnography. In N. Gilbert (Ed.), Applied Social Psychology, 4, 219±223. Researching social life (pp. 154±171). London: Sage. Henwood, K., & Pidgeon, N. (1994). Beyond the qualita- Freud, S. (1977). Case histories. I: ªDoraº and ªLittle tive paradigm: A framework for introducing diversity Hans.º London: Penguin. within qualitative psychology. Journal of Community and Frosh, S., Burck, C., Strickland-Clark, L., & Morgan, K. Applied Social Psychology, 4, 225±238. (1996). Engaging with change: A process study of family Henwood, K., & Pidgeon, N. (1995). Grounded theory and therapy. Journal of Family Therapy, 18, 141±161. psychological research. The Psychologist, 8, 115±118. Gale, J. E. (1991). Conversation analysis of therapeutic Heritage, J. C. (1984). Garfinkel and ethnomethodology. discourse: The pursuit of a therapeutic agenda. Norwood, Cambridge, UK: Polity. NJ: Ablex. Heritage, J. C. (1988). Explanations as accounts: A Gale, J. E. (1996). Conversation analysis: Studying the conversation analytic perspective. In C. Antaki (Ed.), construction of therapeutic realities. In D. H. Sprenkle & Analysing everyday explanation: A casebook of methods S. M. Moon (Eds.), Research methods in family therapy (pp. 127±144). London: Sage. (pp. 107±124). New York: Guilford. Heritage, J. C. (1995). Conversation analysis: Methodolo- Gale, J. E., & Newfield, N. (1992). A conversation analysis gical aspects. In U. Quasthoff (Ed.), Aspects of oral of a solution-focused marital therapy session. Journal of communication. (pp. 391±418). Berlin, Germany: De Marital and Family Therapy, 18, 153±165. Gruyter. Garfinkel, H. (1967a). ªGoodº organizational reasons for Hesse, M. B. (1974). The structure of scientific inference. ªbadº clinical records. In H. Garfinkel (Ed.), Studies in London: Macmillan. ethnomethodology (pp. 186±207). Englewood Cliffs, NJ: Hodder, I. (1994). The interpretation of documents and Prentice-Hall. material culture. In N. K. Denzin & Y. S. Lincoln (Eds.), Garfinkel, H. (1967b). Methodological adequacy in the Handbook of qualitative research (pp. 395±402). London: quantitative study of selection criteria and selection Sage. practices in psychiatric outpatient clinics. In H. Garfin- Holsti, O. R. (1969). Content analysis for the social sciences kel (Ed.), Studies in ethnomethodology (pp. 208±261). and humanities. Reading, MA: Addison-Wesley. Englewood Cliffs, NJ: Prentice-Hall. Jasanoff, S., Markle, G., Pinch T., & Petersen, J. (Eds.) Garfinkel, H. (1967c). Studies in ethnomethodology. Engle- (1995). Handbook of science and technology studies. wood Cliffs, NJ: Prentice-Hall. London: Sage. Gelsthorpe, L. (1992). Response to Martyn Hammersley's Jefferson, G. (1985). An exercise in the transcription and 142 Qualitative and Discourse Analysis

analysis of laughter. In T. van Dijk (Ed.), Handbook of Mason, J. (1994). Linking qualitative and quantitative data discourse analysis (Vol. 3, pp. 25±34). London: Academic analysis. In A. Bryman & R. G. Burgess (Eds.), Press. Analyzing qualitative data. London: Routledge. Jefferson, G. (1988). On the sequential organization of McNamee, S., & Gergen, K. (Eds) (1992). Therapy as social troubles-talk in ordinary conversation. Social Problems, construction. London: Sage 35, 418±441. Miles, M. B., & Huberman, A. M. (1994). Qualitative data Jefferson, G., & Lee, J. R. E. (1992). The rejection of analysis: An expanded sourcebook (2nd Ed.). London: advice: Managing the problematic convergence of a Sage. ªtroubles-tellingº and a ªservice encounter.º In P. Drew Miller, G., & Dingwall, R. (Ed.) (1997). Context and & J. Heritage (Eds.), Talk at work: Interaction in method in qualitative research. London: Sage institutional settings (pp. 521±548). Cambridge, UK: Miller, G., & Silverman, D. (1995). Troubles talk and Cambridge University Press. counseling discourse: A comparative study. The Socio- Jodelet, D. (1991). Madness and social representations. logical Quarterly, 36, 725±747. London: Harvester/Wheatsheaf. Mischler, E. G. (1986). Research interviewing: Context and Johnson, J. C. (1990). Selecting ethnographic informants. narrative. Cambridge, MA: Harvard University Press. London: Sage. Morgan, D. L. (1997). Focus groups as qualitative research Knorr Cetina, K. (1995). Laboratory studies: The cultural (2nd ed.). London: Sage. approach to the study of science. In S. Jasanoff, G. Morris, G. H., & Chenail, R. J. (Eds.) (1995). The talk of Markle, T. Pinch, & J. Petersen (Eds.), Handbook of the clinic: Explorations in the analysis of medical and science and technology studies. London: Sage. therapeutic discourse. Hillsdale, NJ: Erlbaum. Knorr Cetina, K. (1997). Epistemic cultures: How scientists Moscovici, S. (1984). The phenomenon of social represen- make sense. Chicago: Indiana University Press. tations. In R. M. Farr & S. Moscovici (Eds.), Social Krippendorff, K. (1980). Content analysis: An introduction representations (pp. 3±69). Cambridge, UK: Cambridge to its methodology. London: Sage. University Press. Krueger, R. A. (1988). Focus groups: A practical guide for Nelson, C. K. (1994). Ethnomethodological positions on applied research. London: Sage. the use of ethnographic data in conversation analytic Kuhn, T. S. (1977). The essential tension: Selected studies in research. Journal of Contemporary Ethnography, 23, scientific tradition and change. Chicago: University of 307±329. Chicago Press. Newfield, N. A., Kuehl, B. P., Joanning, H. P., & Quinn, Kvale, S. (1996). InterViews: An introduction to qualitative W. H. (1990). A mini ethnography of the family therapy research interviewing. London: Sage. of adolescent drug abuse: The ambiguous experience. Labov, W., & Fanshel, D. (1977). Therapeutic discourse: Alcoholism Treatment Quarterly, 7, 57±80. Psychotherapy as conversation. London: Academic Press. Newfield, N., Sells, S. P., Smith, T. E., Newfield, S., & Lakatos, I. (1970). Falsification and the methodology of Newfield, F. (1996). Ethnographic research methods: scientific research programmes. In I. Lakatos & A. Creating a clinical science of the humanities. In D. H. Musgrave (Eds.), Criticism and the growth of knowledge Sprenkle & S. M. Moon (Eds.), Research Methods in (pp. 91±195). Cambridge, UK: Cambridge University Family Therapy. New York: Guilford. Press. Newfield, N. A., Kuehl, B. P., Joanning, H. P., & Quinn, Latour, B., & Woolgar, S. (1986). Laboratory life: The W. H. (1991). We can tell you about ªPsychosº and construction of scientific facts (2nd ed.). Princeton, NJ: ªShrinksº: An ethnography of the family therapy of Princeton University Press. adolescent drug abuse. In T. C. Todd & M. D. Slekman Lee, J. R. E. (1995). The trouble is nobody listens. In J. (Eds.), Family therapy approaches with adolescent sub- Siegfried (Ed.), Therapeutic and everyday discourse as stance Abusers (pp. 277±310). London: Allyn & Bacon. behaviour change: Towards a micro-analysis in psy- Nofsinger, R. (1991). Everyday conversation. London: chotherapy process research (pp. 365±390). Norwood, Sage. NJ: Ablex. Ochs, E. (1979). Transcription as theory. In E. Ochs & B. Levinson, S. (1983). Pragmatics. Cambridge, UK: Cam- Schieffelin (Eds.), Developmental pragmatics (pp. 43±47). bridge University Press. New York: Academic Press. Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Ochs, E., Schegloff, E., & Thompson, S. A. (Eds.) (1996). London: Sage. Interaction and grammar. Cambridge, UK: Cambridge Lofland, J., & Lofland, L. H. (1984). Analyzing social University Press. settings: A guide to qualitative observation and analysis. Olesen, V. (1994), Feminisms and models of qualitative Belmont, CA: Wadsworth. research. In N. K. Denzin & Y. S. Lincoln (Eds.), Lynch, M. (1994). Representation is overrated: Some Handbook of qualitative research (pp. 158±174). London: critical remarks about the use of the concept of Sage. representation in science studies. Configurations: A PeraÈ kylaÈ , A. (1995). AIDS counseling: Institutional inter- Journal of Literature, Science and Technology, 2, action and clinical practice. Cambridge, UK: Cambridge 137±149. University Press. Madigan, S. P. (1992). The application of Michel Pettinger, R. E., Hockett, C. F., & Danehy, J. J. (1960). Foucault's philosophy in the problem externalizing The first five minutes: A sample of microscopic interview discourse of Michael White. Journal of Family Therapy, analysis. Ithica, NY: Paul Martineau. 14, 265±279. Pidgeon, N. (1996). Grounded theory: Theoretical back- Madill, A., & Barkham, M. (1997). Discourse analysis of a ground. In J. T. E. Richardson (Ed.), Handbook of theme in one successful case of brief psychodynamic- qualitative research methods for psychology and the social interpersonal psychotherapy. Journal of Counselling sciences (pp. 75±85). Leicester, UK: British Psychological Psychology, 44, 232±244. Society. Malson, H., & Ussher, J. M. (1996). Bloody women: A Pidgeon, N., & Henwood, K. (1996). Grounded theory: discourse analysis of amenorrhea as a symptom of Practical implementation. In J. T. E. Richardson (Ed.), anorexia nervosa. Feminism and Psychology, 6, 505±521. Handbook of qualitative research methods for psychology Manning, P. K., & Cullum-Swan, B. (1994) Narrative, and the social sciences (pp. 86±101). Leicester, UK: content and semiotic analysis. In N. K. Denzin & Y. S. British Psychological Society. Lincoln (Eds.), Handbook of qualitative research Piercy F. P., & Nickerson, V. (1996). Focus groups in (pp. 463±477). London: Sage. family therapy research. In D. H. Sprenkle & S. M. References 143

Moon (Eds.), Research methods in family therapy Richardson, J. E. (Ed.) (1996). Handbook of qualitative (pp. 173±185). New York: Guilford. research methods for psychology and the social sciences. Plummer, K. (1995). Life story research. In J. A. Smith, R. Leicester, UK: British Psychological Society. Harre , & L. van Langenhove (Eds.), Rethinking methods Rogers, C. R. (1942). The use of electrically recorded in psychology (pp. 50±63). London: Sage. interviews in improving psychotherapeutic techniques. Polanyi, M. (1958). Personal knowledge. London: Rout- American Journal of Orthopsychiatry, 12, 429±434. ledge and Kegan Paul. Ronai, C. R. (1996). My mother is mentally retarded. In C. Pollner, M., & Wikler, L. M. (1985). The social construc- Ellis & A. P. Bochner (Eds.), Composing ethnography: tion of unreality: A case study of a family's attribution of Alternative forms of qualitative writing. Walnut Creek, competence to a severely retarded child. Family Process, CA: AltaMira Press. 24, 241±254. Rose, D. (1990). Living the ethnographic life. London: Sage. Pomerantz, A. M. (1980). Telling my side: ªlimited accessº Rosenhan, D. L. (1973). On being sane in insane places. as a fishing device. Sociological Inquiry, 50, 186±198. Science, 179, 250±258. Popper, K. (1959). The logic of scientific discovery. London: Sacks, H. (1992). Lectures on conversation. (Vols. I & II). Hutchinson. Oxford, UK: Blackwell. Potter, J. (1996a). Representing reality: Discourse, rhetoric Schegloff, E. A. (1991). Reflections on talk and social and social construction. London: Sage. structure. In D. Boden & D. H. Zimmerman (Eds.), Talk Potter, J. (1996b). Discourse analysis and constructionist and social structure (pp. 44±70). Berkeley, CA: Uni- approaches: Theoretical background. In J. T. E. versity of California Press. Richardson (Ed.), Handbook of qualitative research Schegloff, E. A. (1993). Reflections on quantification in the methods for psychology and the social sciences. Leicester, study of conversation. Research on Language and Social UK: British Psychological Society. Interaction, 26, 99±128. Potter, J. (1997). Discourse analysis as a way of analysing Schegloff, E. A. (1997) Whose text? Whose Context? naturally occurring talk. In D. Silverman (Ed.), Quali- Discourse and Society, 8, 165±187. tative research: Theory, method and practice Scott, J. (1990). A matter of record: Documentary sources in (pp. 144±160). London: Sage. social research. Cambridge, UK: Polity. Potter, J., & Mulkay, M. (1985). Scientists' interview talk: Siegfried, J. (Ed.) (1995). Therapeutic and everyday Interviews as a technique for revealing participants' discourse as behaviour change: Towards a micro-analysis interpretative practices. In M. Brenner, J. Brown, & D. in psychotherapy process research. Norwood, NJ: Ablex. Canter (Eds.), The research interview: Uses and ap- Silverman, D. (1993). Interpreting qualitative data: Methods proaches (pp. 247±271). London: Academic Press. for analysing talk, text and interaction. London: Sage. Potter, J., & Wetherell, M. (1987). Discourse and social Silverman, D. (Ed.) (1997a). Qualitative research: Theory, psychology: Beyond attitudes and behaviour. London: method and practice. London: Sage. Sage. Silverman, D. (1997b). Discourses of counselling: HIV Potter, J., & Wetherell, M. (1994) Analyzing discourse. In counselling as social interaction. London: Sage. A. Bryman & B. Burgess (Eds.), Analyzing qualitative Smith, L. M. (1994). Biographical method. In N. K. data. London: Routledge. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative Potter, J., & Wetherell, M. (1995). Discourse analysis. In J. research (pp. 286±305) London: Sage. Smith, R. Harre , & L. van Langenhove (Eds.), Rethink- Smith, J. A. (1995). Repertory grids: An interactive, case- ing methods in psychology (pp. 80±92). London: Sage. study perspective. In J. A. Smith, R. Harre , & L. van Potter, J., Wetherell, M., & Chitty, A. (1991). Quantifica- Langehove (Eds.), Rethinking methods in psychology tion rhetoricÐcancer on television. Discourse and (pp. 162±177). London: Sage. Society, 2, 333±365. Smith, J. A., Harre , R., & van Langenhove, L. (Eds.) Psathas, G., & Anderson, T. (1990). The ªpracticesº of (1995). Rethinking methods in psychology. London: Sage. transcription in conversation analysis. Semiotica, 78, Soal, J., & Kottler, A. (1996). Damaged, deficient or 75±99. determined? Deconstructing narratives in family ther- Rachel, J. (1996). Ethnography: Practical implementation. apy. South African Journal of Psychology, 26, 123±134. In J. T. E. Richardson (Ed.), Handbook of qualitative Soyland, A. J. (1994). Functions of a psychiatric case- research methods for psychology and the social sciences summary. Text, 14, 113±140. (pp. 113±124). Leicester, UK: British Psychological Soyland, A. J. (1995). Analyzing therapeutic and profes- Society. sional discourse. In J. Siegfried (Ed.), Therapeutic and Rafuls, S. E., & Moon, S. M. (1996). Grounded theory everyday discourse as behaviour change: Towards a micro- methodology in family therapy research. In D. H. analysis in psychotherapy process research (pp. 277±300). Sprenkle & S. M. Moon (Eds.), Research methods in Norwood, NJ: Ablex. family therapy (pp. 64±80). New York: Guilford. Stainton Rogers, R. (1995). Q methodology. In J. A. Smith, Ramazanoglu, C. (1992). On feminist methodology: Male R. Harre , & L. van Langenhove (Eds.), Rethinking reason versus female empowerment. Sociology, 26, methods in psychology (p. 178±192). London: Sage. 207±212. Strauss, A. L., & Corbin, J. (1994). Grounded theory Rapley, M., & Antaki, C. (1996). A conversation analysis methodology: An overview. In N. K. Denzin, & Y. S. of the ªacquiescenceº of people with learning disabilities. Lincoln (Eds.), Handbook of qualitative research Journal of Community and Applied Social Psychology, 6, (pp. 273±285). London: Sage. 207±227. Toren, C. (1996). Ethnography: Theoretical background. Reason, P., & Heron, J. (1995). Co-operative inquiry. In J. A. In J. T. E. Richardson (Ed.), Handbook of qualitative Smith, R. Harre , & L. van Langenhove (Eds.), Rethinking research methods for psychology and the social sciences methods in psychology (pp. 122±142). London: Sage. (pp. 102±112). Leicester, UK: British Psychological Reason, P., & Rowan, J. (Eds.) (1981). Human inquiry: A Society. sourcebook of new paradigm research. Chichester, UK: Turner, B. A. (1994). Patterns of crisis behaviour: A Wiley. qualitative inquiry. In A. Bryman & R. G. Burgess Reinharz, S. (1992). Feminist methods in social research. (Eds.), Analyzing qualitative data (pp. 195±215). London: New York: Oxford University Press. Routledge. Rennie, D., Phillips, J., & Quartaro, G. (1988). Grounded Turner, B. A., & Pidgeon, N. (1997). Manmade disasters theory: A promising approach to conceptualisation in (2nd ed.). Oxford, UK: Butterworth-Heinemann. psychology? Canadian Psychology, 29, 139±150. Watson, R. (1995). Some potentialities and pitfalls in the 144 Qualitative and Discourse Analysis

analysis of process and personal change in counseling Widdicombe, S., & Wooffitt, R. (1995). The language of and therapeutic interaction. In J. Siegfried (Ed.), youth subcultures: Social identity in action. Hemel Therapeutic and everyday discourse as behaviour change: Hempstead, UK: Harvester/Wheatsheaf. Towards a micro-analysis in psychotherapy process Wieder, D. L. (Ed.) (1993). Colloquy: On issues of research (pp. 301±339). Norwood, NJ: Ablex. quantification in conversation analysis. Research on Weinstein, R. M. (1979). Patient attitudes toward mental Language and Social Interaction, 26, 151±226. hospitalization: A review of quantitative research. Wilkinson, S. (1986). Introduction. In S. Wilkinson (Ed.), Journal of Health and Social Behavior, 20, 237±258. Feminist social psychology (pp. 1±6). Milton Keynes, Weinstein, R. M. (1980). The favourableness of patients' UK: Open University Press. attitudes toward mental hospitalization. Journal of Wooffitt, R. (1993). Analysing accounts. In N. Gilbert Health and Social Behavior, 21, 397±401. (Ed.), Researching social life (pp. 287±305). London: Werner,O.,&Schoepfle,G.M.(1987).Systematic Sage. fieldwork: Foundations of ethnography and interviewing Woolgar, S. (1988). Science: The very idea. London: (Vol. 1). London: Sage. Tavistock. Wetherell, M. (1996). Fear of fat: Interpretative repertoires Wootton, A. (1989). Speech to and from a severely and ideological dilemmas. In J. Maybin & N. Mercer retarded young Down's syndrome child. In M. Bever- (Eds.), Using English: From conversation to canon idge, G. Conti-Ramsden, & I. Leudar (Eds.), The (pp. 36±41). London: Routledge. language and communication of mentally handicapped White, M., & Epston, D. (1990). Narrative means to people (pp. 157±184). London: Chapman-Hall. therapeutic ends. New York: Norton. Yardley, K. (1995). Role play. In J. A. Smith, R. Harre & Whyte, W. F. (1991). Participatory action research. L. van Langenhove (Eds.), Rethinking methods in London: Sage. psychology (pp. 106±121). London: Sage. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.07 Personality Assessment

THOMAS A. WIDIGER and KIMBERLY I. SAYLOR University of Kentucky, Lexington, KY, USA

3.07.1 INTRODUCTION 145 3.07.2 SELF-REPORT INVENTORIES 146 3.07.2.1 Item Analyses 146 3.07.2.2 Gender, Ethnic, and Cultural Differences 149 3.07.2.3 Individual Differences 150 3.07.2.4 Response Distortion 151 3.07.2.5 Automated Assessment and Base Rates 152 3.07.2.6 Illustrative Instruments 153 3.07.2.6.1 Minnesota Multiphasic Personality Inventory-2 153 3.07.2.6.2 Neo Personality Inventory-Revised 154 3.07.3 SEMISTRUCTURED INTERVIEWS 154 3.07.3.1 Personality, Mood States, and Mental Disorders 156 3.07.3.2 Dissimulation and Distortion 156 3.07.3.3 Intersite Reliability 157 3.07.3.4 Self and Informant Ratings 158 3.07.3.5 Convergent Validity Across Different Interviews 158 3.07.3.6 Illustrative Instruments 159 3.07.3.6.1 SIDP-IV 159 3.07.3.6.2 PDI-IV 160 3.07.4 PROJECTIVE TECHNIQUES 160 3.07.4.1 Rorschach 162 3.07.4.2 Thematic Apperception Test 163 3.07.5 CONCLUSIONS 163 3.07.6 REFERENCES 163

3.07.1 INTRODUCTION different trait terms within the English language (Goldberg, 1990) and there might be almost as Most domains of psychology have documen- many personality assessment instruments. ted the importance of personality traits to There are instruments for the assessment of adaptive and maladaptive functioning, includ- individual traits (e.g., tender-mindedness), for ing the fields of behavioral medicine (Adler & collections of traits (e.g., the domain of Matthews, 1994), psychopathology (Watson, agreeableness, which includes tender-minded- Clark, & Harkness, 1994; Widiger & Costa, ness, trust, straightforwardness, altruism, com- 1994), and industrial-organizational psychol- pliance, and modesty), for constellations of ogy (Hogan, Curphy, & Hogan, 1994). The traits (e.g., the personality syndrome of psycho- assessment of personality is perhaps funda- pathy, which includes such traits as arrogance, mental to virtually all fields of applied psychol- superficial charm, impulsivity, callousness, ogy. However, there are as many as 4500 arrogance, deception, irresponsibility, and low different personality traits, or at least 4500 empathy), and for traits identified by theorists

145 146 Personality Assessment for which there might not yet be a specific term than is typically obtained by semistructured within the language (e.g., extratensive experi- interviews or projective techniques. The find- ence balance; Exner, 1993). ings of SRIs are very sensitive to the anxious, There are also different methods for the depressed, or angry mood states of respondents, assessment of personality traits. The primary contributing at times to poor test±retest relia- methods used in personality research are self- bility (discussed further below). However, the report inventories, semistructured interviews, correlation between two SRIs is much more and projective techniques. Self-report inven- likely to be consistent across time and across tories consist of written statements or questions, research sites than the correlation between two to which a person responds in terms of a semistructured interviews or two projective specified set of options (e.g., true vs. false or techniques. SRIs might be more susceptible to agree vs. disagree along a five-point scale). mood state distortions than semistructured Semistructured interviews consist of a specified interviews, but this susceptibility may itself be set of verbally administered questions, accom- observed more reliably across different studies panied by instructions (or at least guidelines) for than the lack of susceptibility of semistructured follow-up questions and for the interpretation interviews. and scoring of responses. Projective techniques The specific and explicit nature of SRIs has consist of relatively ambiguous stimuli or also been very useful in researching and under- prompts, the possible responses to which are standing the source and nature of subjects' largely open-ended, accompanied by instruc- responses. Much more is known about the tions (or at least guidelines) for their scoring or effects of different item formats, length of scales, interpretation. The distinctions among these demographic variables, base rates, response three methods, however, is somewhat fluid. A sets, and other moderating variables from the self-report inventory is equivalent to a fully results of SRIs than from semistructured inter- structured, self-administered interview; semi- views or projective techniques. Five issues to be structured interviews are essentially verbally discussed below are (i) item analyses; (ii) gender, administered self-report inventories that include ethnic, and cultural differences; (iii) individual at least some open-ended questions; and fully differences; (iv) response distortion; and (v) structured interviews are essentially verbally automated assessments. administered self-report inventories. Each of the methods will be considered below, including a discussion of issues relevant to or 3.07.2.1 Item Analyses problematic for that respective method. How- ever, the issues discussed for one method of There are a variety of methods for construct- assessment will apply to another. Illustrative ing, selecting, and evaluating the items to be instruments for each method are also presented. included within an SRI (Clark & Watson, 1995). However, it is useful to highlight a few points within this chapter as such analyses are of direct importance to personality assessment research. 3.07.2 SELF-REPORT INVENTORIES An obvious method for item construction, often termed the rational approach, is to The single most popular method for the construct items that describe explicitly the trait assessment of personality by researchers of being assessed. For example, the most fre- normal personality functioning is a self-report quently used SRI in clinical research for the inventory (SRI). The advantages of SRIs, assessment of personality disorders is the relative to semistructured interviews and pro- Personality Diagnostic Questionnaire-4 (PDQ- jective techniques, are perhaps self-evident. 4; Hyler, 1994). Its items were written to inquire They are substantially less expensive and time- explicitly with respect to each of the features of consuming to administer. Data on hundreds of the 10 personality disorders included in the persons can be obtained, scored, and analyzed American Psychiatric Association's (APA) at relatively little cost to the researcher. Their Diagnostic and statistical manual of mental inexpensive nature allows collection of vast disorders (DSM-IV; APA, 1994). For example, amounts of normative data that are unavailable the dependent personality criterion, ªhas diffi- for most semistructured interviews and projec- culty expressing disagreement with others tive techniques. These normative data in turn because of fear of loss of support or approvalº facilitate substantially their validation, as well (APA, 1994, p. 668) is assessed by the PDQ-4 as their interpretation by and utility to item, ªI fear losing the support of others if I researchers and clinicians. disagree with themº (Hyler, 1994, p. 4). The high degree of structure of SRIs has also The content validity of an SRI is problematic contributed to much better intersite reliability to the extent that any facets of the trait being Self-report Inventories 147 assessed are not included or are inadequately involve overlapping constellations of person- represented, items representing other traits are ality traits (Pilkonis, 1997; Shea, 1995). Con- included, or the set provides a disproportio- fining a scale to items that correlate highly with nately greater representation of one facet of the one another can result in an overly narrow trait relative to another (Haynes, Richard, & assessment of a construct, and deleting items Kubany, 1995). ªThe exact phrasing of items that correlate highly with other scales can result can exert a profound influence on the construct in false distinctions and a distorted representa- that is actually measuredº (Clark & Watson, tion of the trait being assessed (Clark & Watson, 1995, p. 7) yet the content of items often receives 1995; Smith & McCarthy, 1995). For example, little consideration in personality assessment the APA (1994) criteria set for the assessment of research. Widiger, Williams, Spitzer, and antisocial personality disorder does not include Frances (1985) concluded from a systematic such items as lacks empathy and arrogant self- content analysis of the Millon Clinical Multi- appraisal that are included in alternative criteria axial Inventory (MCMI; Millon, 1983) that the sets for this personality syndrome (Hare, 1991) MCMI failed to represent adequately many in part because these items are already con- significant features of the antisocial personality tained within the DSM-IV criteria set for the disorder. narcissistic personality disorder. Their inclusion within the criteria set for antisocial personality Many of the [personality] traits are not sampled at disorder would complicate the differentiation of all (e.g., inability to sustain consistent work the antisocial and narcissistic personality syn- behavior, lack of ability to function as a respon- dromes (Gunderson, 1992), but the failure to sible parent, inability to maintain an enduring include these items may also provide an attachment to a sexual partner, failure to plan inadequate description and assessment of anti- ahead or impulsivity, and disregard for the truth), social (psychopathic) personality traits (Hare, including an essential requirement of the DSM-III Hart, & Harpur, 1991). antisocial criteria to exhibit significant delinquent Overlapping scales, on the other hand, have behavior prior to the age of 15. (Widiger et al., their own problems, particularly if they are to be 1985, p. 375) used to test hypotheses regarding the relation- ships among or differences between the traits Many MCMI-III items are written in part to and syndromes being assessed (Helmes & represent an alternative theoretical formulation Reddon, 1993). For example, the MCMI-III for the DSM-IV personality syndromes (Millon personality scales overlap substantially (Millon, et al., 1996). Millon, & Davis, 1994). Scale overlap was in High content validity, however, will not part a pragmatic necessity of assessing 14 ensure a valid assessment of a respective personality disorders and 10 clinical syndromes personality trait. Although many of the PDQ- with no more than 175 items. However, this 4 dependency items do appear to represent overlap was also intentional to ensure that the adequately their respective diagnostic criteria, it scales be consistent with the overlap of the is difficult to anticipate how an item will personality constructs being assessed. ªMultiple actually perform when administered to persons keying or item overlapping for the MCMI with varying degrees of dependency and to inventories was designed to fit its theory's persons with varying degrees of other person- polythetic structure, and the scales constructed ality traits, syndromes, or demographic char- to represent itº (Millon, 1987, p. 130). However, acteristics. Most authors of SRIs also consider the MCMI-III cannot then be used to assess the the findings of internal consistency analyses validity of this polythetic structure, or to test (Smith & McCarthy, 1995). ªCurrently, the hypotheses concerning the covariation, differ- single most widely used method for item entiation, or relationship among the personality selection in scale development is some form of constructs, as the findings will be compelled by internal consistency analysisº (Clark & Watson, the scale overlap (Helmes & Reddon, 1993). For 1995, p. 313). example, the MCMI-III provides little possi- Presumably, an item should correlate more bility for researchers to fail to confirm a close highly with other items from the same scale relationship between, or comparable findings (e.g., dependent personality traits) than items for, sadistic and antisocial personality traits, from another scale (e.g., borderline personality given that eight of the 17 antisocial items (47%) traits), consistent with the basic principles of are shared with the sadistic scale. The same convergent and discriminant validity. However, point can be made for studies using the these assumptions hold only for personality Minnesota Multiphasic Personality Inventory syndromes that are homogeneous in content (MMPI-2) antisociality and cynicism scales, as and that are distinct from other syndromes, they share approximately a third of their items neither of which may be true for syndromes that (Helmes & Reddon, 1993). It was for this reason 148 Personality Assessment that Morey, Waugh, and Blashfield (1985) vergent, and discriminant validity (Ozer & developed both overlapping and nonoverlap- Reise, 1994). Illustrative studies with SRIs ping MMPI personality disorder scales. include Ben-Porath, McCully, and Almagor Scale homogeneity and interscale distinctive- (1993), Clark, Livesley, Schroeder, and Irish ness are usually emphasized in SRIs constructed (1996), Lilienfeld and Andrews (1996), Robins through factor analyses (Clark & Watson, and John (1997), and Trull, Useda, Coasta, and 1995). For example, in the construction of the McCrae (1995). NEO Personality Inventory-Revised (NEO-PI- The MMPI-2 clinical scales exemplify the R), Costa and McCrae (1992) purely empirical approach to scale construction. ªThe original MMPI (Hathaway & McKinley, adopted factor analysis as the basis for item 1940) was launched with the view that the selection because it identifies clusters of items that content of the items was relatively unimportant covary with each other but which are relatively and what actually mattered was whether the independent of other item clustersÐin other item was endorsed by a particular clinical words, items that show convergent validity with groupº (Butcher, 1995, p. 302). This allowed respect to other items in the cluster and divergent the test to be ªfree from the restriction that the validity with respect to other items [outside the subject must be able to describe his own cluster]. (p. 40) behavior accuratelyº (Meehl, 1945, p. 297). For example, one of the items on the MMPI Reliance upon factor analysis for scale con- hysteria scale was ªI enjoy detective or mystery struction has received some criticism (Block, storiesº (keyed false; Hathaway & McKinley, 1995; Millon et al., 1996) and there are, indeed, 1982). instances in which factor analyses have been Indicating that one does not enjoy detective conducted with little appreciation of the limita- or mystery stories does not appear to have any tions of the approach (Floyd & Widaman, obvious (or perhaps even meaningful) relation- 1995), but this is perhaps no different than ship to the presence of the mental disorder of for any other statistical technique. Factor hysteria, but answering false did correlate analysis remains a powerful tool for identifying significantly with the occurrence of this dis- underlying, latent constructs, for data reduc- order. ªThe literal content of the stimulus (the tion, and for the validation of an hypothesized item) is entirely unimportant, and even irrele- dimensional structure (Smith & McCarthy, vant and potentially misleading. Sophisticated 1995; Watson et al., 1994). In any case, the psychometric use of test items dictates that the construction of the NEO-PI-R through factor test interpreter ignore item content altogether analysis was consistent with the theoretical lest he or she be misledº (Ben-Porath, 1994, model for the personality traits being assessed p. 364). Limitations to this approach, however, (Clark & Watson, 1995; Costa & McCrae, 1992; are suggested below. Goldberg, 1990). Additional illustrations of A sophisticated approach to item analysis is theoretically driven factor analytic scale con- item response theory (IRT; Hambleton, Swa- structions are provided by the Dimensional minathan, & Rogers, 1991). Its most unique Assessment of Personality PathologyÐBasic datum is the probability of a response to a Questionnaire (DAPP-BQ; Livesley & Jack- particular item given a particular level of the son, in press), the Schedule for Nonadaptive personality trait being assessed. IRT analyses and Adaptive Personality (SNAP; Clark, 1993), demonstrate graphically how items can vary in the Interpersonal Adjective Scales (IAS; Wig- their discriminative ability at different levels of gins & Trobst, 1997), the Multidimensional the trait being assessed (assuming that the Personality Questionnaire (MPQ; Tellegen & anchoring items are not themselves system- Waller, in press), and the Temperament and atically biased). It is not the case that all items Character Inventory (TCI; Cloninger & perform equally well, nor does any particular Svrakic, 1994). item perform equally well at all levels of the The most informative item analyses will be trait. It may be the case that none of the items on correlations with external validators of the a scale provide any discrimination at particular personality traits, including the ability of items levels of a trait, or that the scale is predominated to identify persons with the trait in question, by items that discriminate at high rather than correlations with hypothesized indicators of moderate or low levels of the trait. For example, the trait, and an absence of correlations with it is possible that the MMPI-2 items to assess the variables that are theoretically unrelated to the personality domain of neuroticism are weighted trait (Smith & McCarthy, 1995). These data toward a discrimination of high levels of typically are discussed under a general heading neuroticism, providing very little discrimination of construct validity, including data concerning at low levels. IRT analyses might allow items' concurrent, postdictive, predictive, con- researchers to maximize the sensitivity of a test Self-report Inventories 149 to a particular population (e.g., inpatient psychopathic. A male who has more traits (or hospital vs. prison setting) by deleting items symptoms) of could be described that work poorly within certain settings (Em- by the MMPI-2 as being less psychopathic than bretson, 1996). IRT analyses have been used a female with the same number of traits. The widely with tests of abilities and skills, and are separate norms provided for males and females now being applied to measures of personality on SRIs are never so substantial as to eliminate (Ozer & Reise, 1994). Illustrative applications entirely the differences between the sexes, but include analyses of items from the MPQ the rationale for reducing or minimizing any (Tellegen & Waller, in press) by Reise and differences is unclear. Waller (1993), items from a measure of The provision of separate norms for males alexithymia by Hendryx, Haviland, Gibbons, and females by SRIs is inconsistent with other and Clark (1992) and items from a psychopathy domains of assessment, including most semi- semistructured interview by Cooke and Michie structured interview and projective assessments (1997). of personality variables. For example, the same threshold is used for males and females by the 3.07.2.2 Gender, Ethnic, and Cultural most commonly used semistructured interview Differences for the assessment of psychopathy, the (revised) Psychopathy Checklist (Hare, 1991). Separate Differences among gender, ethnic, cultural, norms are not provided for females and males in and other demographic populations with re- the assessment of intelligence, nor are they spect to personality traits are often socially provided for different cultural and ethnic sensitive, politically controversial, and difficult groups in the assessment of personality. to explain (Eagly, 1995). There is substantial Although statistically significant differences SRI research to indicate differences, on average, have also been obtained across ethnic and between males and females for a wide variety of cultural groups for SRI personality measures, personality traits, as assessed by research with ªnone of the major personality measures . . . SRIs (Sackett & Wilk, 1994). For example, offered norm scoring based on race or ethnicity males obtain higher scores on measures of either as a routine aspect of the scoring system assertiveness and self-esteem; whereas females or as a scoring optionº (Sackett & Wilk, 1994, obtain higher scores on measures of anxious- p. 947). It is unclear whether separate norms ness, trust, gregariousness, and tender-mind- should be provided for ethnic groups, but it does edness (Feingold, 1994). appear to be inconsistent to provide separate Separate norms therefore have been devel- norms for gender and not for ethnicity. oped for the interpretation of most SRI Separate norms would be appropriate if there personality scales (e.g., Costa & McCrae, is reason to believe that the SRI items are biased 1992; Graham, 1993; Millon et al., 1994). against a particular gender, ethnic, or cultural However, the rationale for providing different group. ªBias is the extent to which measured norms for each sex (gender) is unclear. Sackett group differences are invalid . . . Group differ- and Wilk (1994) ªsearched test manuals, hand- ences are invalid to the extent that the constructs books, and the like for a discussion of the that distinguish between groups are different rationale for the practice. We typically found from the constructs the measures were intended noneº (p. 944). They indicated, with some to representº (Kehoe & Tenopyr, 1994, p. 294). surprise, that the provision of separate norms Consider, for example, the MMPI item cited for males and females ªdoes not seem to have earlier, ªI enjoy detective or mystery storiesº been viewed as at all controversialº (Sackett & (Hathway & McKinley, 1982, p. 3). The MMPI Wilk, 1994, p. 944). If the SRI is indicating an reliance upon a blind empiricism for item actual difference between males and females selection can be problematic if the basis for (Feingold, 1994) it is unclear why this difference the item endorsement is for reasons other than, showed then be eliminated or diminished in SRI or in addition to, the presence of the construct assessments of males and females (Kehoe & being assessed. Lindsay and Widiger (1995) Tenopyr, 1994; Sackett & Wilk, 1994). suggested that this item might have been For example, males and females with the correlated with hysteria because hysteria was same raw score on the psychopathic deviate itself correlated with female gender. Signifi- scale of the MMPI-2 will be given different final cantly more females than males are diagnosed scores due to the use of different norms. with hysteria and significantly more females Different raw scores for males and females than males respond negatively to an interest in can then result in the same final score (Graham, detective or mystery stories; therefore, the item 1993). The extent of psychopathy in females is may have correlated with hysteria because it was then relative to other females; it is not a measure identifying the presence of females rather than of the actual (absolute) extent to which they are the presence of hysteria. In addition, 150 Personality Assessment

because such an item concerns normal behavior personality research throughout the present that occurs more often in one sex than in the other, century, is fundamentally inadequate for the we would consider it to be sex biased because its purposes of a science of personalityº (Lamiell, errors in prediction (false positives) will occur 1981, p. 36). Lamiell's scathing critique of more often in one sex than in the other. (Lindsay & individual differences research raised many Widiger, 1995, p. 2) important and compelling concerns. For ex- ample, he noted how the test±retest reliability of One would never consider including as one of a personality scale typically is interpreted as the DSM-IV diagnostic criteria for histrionic indicating the extent to which the expression of personality disorder, ªI have often enjoyed a trait is stable in persons across time, but most reading Cosmopolitan and Ms. Magazineº analyses in fact indicate the extent to which but many such items are included in SRIs to persons maintain their relative position on a diagnose the presence of a personality disorder. scale over time. The test±retest reliability of For example, responding false to the item ªin relative position may itself correlate highly with the past, I've gotten involved sexually with the test±retest reliability of the magnitude many people who didn't matter much to me,º within each person, but it need not. For is used for the assessment of a dependent example, the reliability of a measure of height, personality disorder on the MCMI-II (Millon, assessed across 100 persons using a product- 1987, p. 190). Not getting involved sexually with moment correlation, would suggest substantial many people who do not matter much is hardly stability between the ages of eight and 15, an indication of a dependent personality dis- despite the fact that each person's actual height order (or of any dysfunction), yet it is used to would have changed substantially during this diagnose the presence of this personality dis- time. Height is not at all consistent or stable order. across childhood, yet a measure of stability in There are also data to suggest that the same height would be very high if the relative height items on an SRI may have a different meaning among the persons changed very little. This to persons of different ethnic, cultural, or gender confusion, however, would be addressed by groups (Okazaki & Sue, 1995), although the alternative reliability statistics (e.g., an intra- magnitude, consistency, and significance of class correlation coefficient). these differences has been questioned (Ozer & Interpreting personality test scores relative to Reise, 1994; Timbrook & Graham, 1994). a particular population is not necessarily Applications of IRT analyses to the assessment problematic as long as the existence of these of bias would be useful (Kehoe & Tenopyr, norms and their implications for test interpreta- 1994). For example, Santor, Ramsay, and tion are understood adequately. For example, a Zuroff (1994) examined whether men and score on an MMPI-2 social introversion scale women at equal levels of depression respond does not indicate how introverted a person is, differentially to individual items. They reported but how much more (or less) introverted the no significant differences for most items, but person is relative to a particular normative bias was suggested for a few, including a body- group. The MMPI-2 social introversion scale image dissatisfaction item. Females were more does not indicate the extent to which an Asian- likely to endorse body-image dissatisfaction American male is socially introverted, it than males even when they were at the same level indicates how more (or less) introverted he is of depression as males, suggesting that the relative to a sample of 1138 American males endorsement of the item by women (relative to who provided the normative data, only 1% of men) reflected their gender independently of whom were Asian-American (Graham, 1993). their level of depression. Nevertheless, researchers and clinicians will at times refer to SRI results as if they are 3.07.2.3 Individual Differences providing an absolute measure of a personality trait. For example, researchers and clinicians The provision of any set of norms may itself will describe a person as having high self-esteem, be questioned. Personality description with SRIs low self-monitoring, or high dependency, when traditionally has been provided in reference to in fact the person is simply higher or lower than individual differences. ªRaw scores on person- a particular comparison group. It is common in ality inventories are usually meaningless±± personality research to construct groups of responses take on meaning only when they are subjects on the basis of median splits on SRI compared to the responses of othersº (Costa & scores for such traits as self-esteem, self- McCrae, 1992, p. 13). However, this individual monitoring, dependency, autonomy, or some differences approach can itself be problematic. other personality construct. Baumeister, Tice, ªThe individual differences research paradigm, and Hutton (1989) searched the literature for all which has thoroughly dominated empirical studies concerning self-esteem. ªMost often . . . Self-report Inventories 151 high and low self-esteem groups are created by arrogant or self-promotional, and histrionic performing a median split on the self-esteem persons will often be overemotional, exagger- scores across the sampleº (p. 556). Persons ated, or melodramatic in their self-descriptions above the median are identified as having ªhigh (APA, 1994). Response distortion may also be self-esteemº whereas persons below the median common within some populations, such as are identified as having ªlow self-esteem.º forensic, disability, or psychiatric settings, due However, interpreting SRI data in this manner in part to the higher prevalence of personality is inaccurate and potentially misleading. Per- disorder symptomatology within these popula- sons below a median would indeed have less tions but due as well to the pressures, rewards, self-esteem than the persons above a median, and inducements within these settings to but it is unknown whether they are in fact either provide inaccurate self-descriptions (Berry, high or low in self-esteem. All of the subjects 1995). might be rather high (or low). This method of A substantial amount of research has been assessing subjects is comparable to providing a conducted on the detection of response distor- measure of psychopathy to a sample of nuns to tion (otherwise known as response sets or identify a group of psychopaths and nonpsy- biases), particularly with the MMPI-2 (Butcher chopaths, or a measure of altruism to convicts & Rouse, 1996). There are self-report scales to within a prison to identify a group of saints and detect nonresponsiveness (e.g., random re- sinners. Yet, many researchers will provide sponding, yea-saying or nay-saying), overre- measures of self-esteem (self-monitoring, de- porting of symptoms (e.g., malingering, faking pendency, narcissism, or some other SRI bad, exaggeration, or self-denigration), and measure) to a group of well-functioning college underreporting (e.g., faking good, denial, students to identify persons high and low in self- defensiveness, minimization, or self-aggrand- esteem. Baumeister et al. (1989) indeed found izement). These scales often are referred to as that ªin all cases that we could determine, the validity scales, as their primary function has sample midpoint was higher (in self-esteem) been to indicate the extent to which the scores than the conceptual midpoint, and generally the on the personality (or clinical) scales are discrepancy was substantial and significantº providing an accurate or valid self-description, (p. 559). and they do appear to be generally successful in identifying the respective response distortions (Berry, Wetter, & Baer, 1995). 3.07.2.4 Response Distortion However, it is not always clear whether the prevalence of a respective form of response SRIs rely substantially on the ability of the distortion is frequent enough to warrant its respondent to provide a valid self-description. detection within all settings (Costa & McCrae, To the extent that a person does not understand 1997). Acquiescence, random responding, and the item, or is unwilling or impaired in his or her nay-saying might be so infrequent within most ability to provide an accurate response, the settings that the costs of false positive identi- results will be inaccurate. ªDetection of an fications (i.e., identifying a personality descrip- attempt to provide misleading information is a tion as invalid when it is in fact valid) will vital and necessary component of the clinical outweigh the costs of false negatives (identifying interpretation of test resultsº (Ben-Porath & a description as valid when it was in fact Waller, 1992, p. 24). invalid). In addition, much of the research on The presence of sufficiently accurate self- validity scales has been confined to analogue description is probably a reasonable assump- studies in which various response distortions are tion in most instances (Costa & McCrae, 1997). simulated by college students, psychiatric However, it may also be the case that no person patients, or other persons. There is much less will be entirely accurate in the description of this data to indicate that the inclusion of a validity or her personality. Each person probably scale as a moderator or suppressor variable evidences some degree of distortion, either actually improves the validity of the personality minimizing or exaggerating flaws or desirabil- assessment. For example, it is unclear whether ities. Such response distortion will be particu- the correlation of a measure of the personality larly evident in persons characterized by trait of neuroticism with another variable (e.g., personality disorders (Westen, 1991). Antisocial drug usage) would increase when variance due persons will tend to be characteristically to a response distortion (e.g., malingering or dishonest or deceptive in their self-descriptions, exaggeration) is partialled from the measure of dependent persons may self-denigrate, paranoid neuroticism (Costa & McCrae, 1997). persons will often be wary and suspicious, In some contexts, such as a disability borderline persons will tend to idealize and evaluation, response distortions are to be devalue, narcissistic persons will often be avoided, whereas in other contexts, such as 152 Personality Assessment the assessment of maladaptive personality 3.07.2.5 Automated Assessment and Base Rates traits, they may be what the clinician is seeking to identify (Ozer & Reise, 1994). Validity scales The structured nature of the responses to typically are interpreted as indicators of the SRIs has also facilitated the development of presence of misleading or inaccurate informa- automated (computerized) systems for scoring tion, but the response tendencies assessed by and interpretation. The researcher or clinician validity scales are central to some personality simply uses score sheets that can be submitted disorders. For example, an elevation on a for computerized scanning, receiving in return a malingering scale would indicate the presence complete scoring and, in most cases, a narrative of deception or dishonesty, suggesting perhaps description of the subject's personality derived that the information provided by the subject in from the theoretical model of, and the data response to the other SRI items was inaccurate considered by, the author(s) of the computer or misleading. However, one might not want to system. Automated systems are available for partial the variance due to malingering from an most of the major personality SRIs, and their SRI measure of antisocial or psychopathic advantages are self-evident. ªThe computer can personality. The validity of a measure of store and access a much larger fund of psychopathy would be reduced if variance due interpretive literature and base-rate data than to deception or dishonesty was extracted, as any individual clinician can master, contribut- dishonesty or deception is itself a facet of this ing to the accuracy, objectivity, reliability, and personality syndrome (APA, 1994; Hare, 1991; validity of computerized reportsº (Keller, Lilienfeld, 1994). It might be comparably Butcher, & Slutske, 1990, p. 360). misleading to extract variance due to symptom Automated interpretive systems can also be exaggeration from a measure of borderline seductive. They might provide the appearance personality traits, self-aggrandizement from a of more objectivity and validity than is in fact measure of narcissism, or self-denigration from the case (Butcher, 1995). Clinicians and re- a measure of dependency. Validity scales are not searchers should always become closely familiar just providing a measure of response distortion with the actual data and procedures used to that undermines the validity of personality develop an automated system, and the subse- description; they are also providing valid quent research assessing its validity, in order to descriptions of highly relevant and fundamental evaluate objectively for themselves the nature personality traits. and extent of the empirical support. For For example, a response distortion that has example, been of considerable interest and concern is social desirability. Persons who are instructed to while any single clinician may do well to learn the norms and base rates of the client population he or attribute falsely to themselves good, desirable she sees most often, a computer program can refer qualities provide elevations on measures of to a variety of population norms and, if pro- social desirability, and many researchers have grammed to do so, will always ªrememberº to therefore extracted from a measure of person- tailor interpretive statements according to mod- ality the variance that is due to this apparent ifying demographic data such as education, mar- response bias. However, extracting this variance ital status, and ethnicity. (Keller et al., 1990, p. 360) typically will reduce rather than increase the validity of personality measures because much This is an excellent sentiment, describing well of the socially desirable self-description does in the potential benefits of an automated report. fact constitute valid self-description (Borkenau However, simply providing a subject's educa- & Ostendorf, 1992; McCrae & Costa, 1983). tion, marital status, and ethnicity to the auto- Persons who characteristically describe them- mated computer system does not necessarily selves in a socially desirable manner are not mean that this information will in fact be used. simply providing false and misleading informa- As noted above, very few of the SRIs consider tion. Much of the variance within a measure of ethnicity. social desirability is due to persons either A purported advantage of the MCMI-III describing themselves accurately as having automated scoring system is that ªactuarial base many desirable personality traits or, equally rate data, rather than normalized standard accurately, as having a personality disposition score transformations, were employed in calcu- of self-aggrandizement, arrogance, or denial. lating scale measuresº (Millon et al., 1994, p. 4). This problem could be addressed by first The cutoff points used for the interpretation of extracting the variance due to valid individual the MCMI-III personality scales are based on differences from a measure of social desirability the base rates of the personality syndromes before it is used as a moderater variable, but within the population. ªThe BR [base rate] score then it might obviously fail to be useful as a was designed to anchor cut-off points to the moderater variable to the personality scales. prevalence of a particular attribute in the Self-report Inventories 153 psychiatric populationº (Millon et al., 1994, new study replicates more closely the popula- p. 26). This approach appears to be quite tion characteristics of the original derivation sophisticated, as it is seemingly responsive to the study. A third alternative is to have the cutoff failure of clinicians and researchers to consider points for the scales be adjusted for different the effect of base rates on the validity of an SRI base rates within local settings, but no auto- cutoff point (Finn & Kamphuis, 1995). ªThese mated scoring system currently provides this data not only provide a basis for selecting option. optimal differential diagnostic cut-off scores but also ensure that the frequencies of MCMI- 3.07.2.6 Illustrative Instruments III diagnoses and profile patterns are compar- able to representative clinical prevalence ratesº There are many SRIs for the assessment of (Millon et al., 1994 p. 4). normal and maladaptive personality traits, However, the MCMI-III automated scoring including the SNAP (Clark, 1993), the DAPP- system does not in fact make adjustments to BQ (Livesley & Jackson, in press), the Person- cutoff points depending upon the base rate of ality Assessment Inventory (Morey, 1996), the the syndrome within a local clinical setting. The MPQ (Tellegen & Waller, in press), the MCMI- MCMI-III uses the same cutoff point for all III (Millon et al., 1994), the PDQ-4 (Hyler, settings and, therefore, for all possible base 1994), the IAS (Wiggins & Trobst, 1997), the rates. The advantage of setting a cutoff point Wisconsin Personality Inventory (Klein et al., according to the base rate of a syndrome is lost if 1993), and the TCI (Cloninger & Svrakic, 1994). the cutoff point remains fixed across different Space limitations prohibit a consideration of base rates (Finn & Kamphuis, 1995). ªThe use each of them. A brief discussion, however, will of the MCMI-III should be limited to popula- be provided for the two dominant SRIs within tions that are not notably different in back- the field of clinical personality assessment, the ground from the sample employed to develop MMPI-2 (Graham, 1993; Hathaway et al., 1989) the instrument's base rate normsº (Millon et al., and the NEO-PI-R (Costa & McCrae, 1992). 1994, p. 35). It is for this reason that Millon et al. discourage the use of the MCMI-III within 3.07.2.6.1 Minnesota Multiphasic Personality normal (community or college) populations (a Inventory-2 limitation not shared by most other SRIs). In fact, the population sampled for the MCMI-III The MMPI-2 is an SRI that consists of 566 revision might have itself been notably different true/false items (Graham, 1993; Hathaway et al., in background from the original sample used to 1989). It is the most heavily researched and develop the base rate scores. Retzlaff (1996) validated SRI, and the most commonly used in calculated the probability of having a person- clinical practice (Butcher & Rouse, 1996). Its ality disorder given the obtainment of a MCMI- continued usage is due in part to familiarity and III respective cutoff point, using the data tradition, but its popularity also reflects the provided in the MCMI-III test manual. Retzlaff obtainment of a substantial amount of empiri- concluded that ªas determined by currently cal support and normative data over the many available research, the operating characteristics years of its existence (Graham, 1993). of the MCMI-III scales are poorº (p. 437). The The MMPI-2 is described as ªthe primary probabilities varied from 0.08 to a high of only self-report measure of abnormal personalityº 0.32. For example, the probability of having an (Ben-Porath, 1994, p. 363) but its most common avoidant personality disorder if one surpassed usage is for the assessment of anxiety, depres- the respective cutoff point was only 0.17, for the sive, substance, psychotic, and other such (Axis borderline scale it was only 0.18, and for the I) mental disorders rather than for the assess- antisocial scale it was only 0.07. As Retzlaff ment of personality traits (Ozer & Reise, 1994). indicated, ªthese hit rate statistics are well under ªThe MMPI-2 clinical scales are . . . measures of one-half of the MCMI-II validitiesº (Retzlaff, various forms of psychopathology . . . and not 1996, p. 435). measures of general personalityº (Greene, Retzlaff (1996) suggested that there were two Gwin, & Staal, 1997, p. 21). This is somewhat possible explanations for these results, either ironic, given that it is titled as a personality ªthe test is bad or that the validity study was inventory (Helmes & Reddon, 1993). badº (p. 435) and he concluded that the fault lay However, the MMPI-2 item pool is extensive with the MCMI-III cross-validation data. ªAt and many additional scales have been devel- best, the test is probably valid, but there is no oped beyond the basic 10 clinical and three evidence and, as such, it cannot be trusted until validity scales (Graham, 1993). For example, better validity data are availableº (Retzlaff, some of the new MMPI-2 content scales (Ben- 1996, p. 435). However, it is unlikely that any Porath, 1994; Butcher, 1995), modeled after the new data will improve these statistics, unless the seminal research of Wiggins (1966), do assess 154 Personality Assessment personality traits (e.g., cynicism, social discom- or aggression, tender-mindedness vs. tough- fort, Type A behavior, and low self-esteem). The mindedness (lack of empathy), modesty vs. Morey et al. (1985) personality disorder scales arrogance, and straightforwardness vs. decep- have also been updated and normed for the tion or manipulation. The domains and facets of MMPI-2 (Colligan, Morey, & Offord, 1994) the NEO-PI-R relate closely to other models of and an alternative set of MMPI-2, DSM-IV personality, such as the constructs of affiliation personality disorder scales are being developed and power within the interpersonal circumplex by Somwaru and Ben-Porath (1995). Harkness model of personality (McCrae & Costa, 1989; and McNulty (1994) have developed scales (i.e., Wiggins & Pincus, 1992). The NEO-PI-R the PSY-5) to assess five broad domains of assessment of the five-factor model also relates personality (i.e., neuroticism, extraversion, closely to the DSM-IV personality disorder psychoticism, aggressiveness, and constraint) nomenclature, despite its original development that are said to provide the MMPI-2 assessment as a measure of normal personality functioning of, or an alternative to, the five-factor model of (Widiger & Costa, 1994). personality (Ben-Porath, 1994; Butcher & The application of the NEO-PI-R within Rouse, 1996). ªOther measures of five-factor clinical settings, however, may be problematic, models of normal personality will need to due to the absence of extensive validity scales to demonstrate incremental validity in comparison detect mood state and response-set distortion to the full set of MMPI-2 scales (including the (Ben-Porath & Waller, 1992). A valid applica- PSY-5) to justify their use in clinical practiceº tion of the NEO-PI-R requires that the (Ben-Porath, 1994, p. 393). However, there are respondent be capable of and motivated to important distinctions between the PSY-5 and provide a reasonably accurate self-description. the five-factor model constructs, particularly for This is perhaps a safe assumption for most cases psychoticism and aggressiveness (Harkness & (Costa & McCrae, 1997), but the NEO-PI-R McNulty, 1994; Harkness, McNulty, & Ben- might not be successful in identifying when this Porath, 1995; Widiger & Trull, 1997). In assumption is inappropriate. Potential validity addition, the utility of the MMPI-2 for scales for the NEO-PI-R, however, are being personality trait description and research may researched (Schinka, Kinder, & Kremer, 1997). be limited by the predominance of items concerning clinical symptomatology and the 3.07.3 SEMISTRUCTURED INTERVIEWS inadequate representation of important do- mains of personality, such as The single most popular method for clinical (Costa, Zonderman, McCrae, & Williams, assessments of personality by psychologists 1985) and constraint (DiLalla, Gottesman, working in either private practice, inpatient Carey, & Vogler, 1993) hospitals, or university psychology depart- ments, is an unstructured interview (Watkins, Campbell, Nieberding, & Hallmark, 1995). 3.07.2.6.2 Neo Personality Inventory-Revised Whereas most researchers of personality dis- The most comprehensive model of person- order rely upon semistructured interviews (SSI) ality trait description is provided by the five- (Rogers, 1995; Zimmerman, 1994), there are but factor model (Saucier & Goldberg, 1996; a few SSIs for the assessment of normal Wiggins & Pincus, 1992). Even the most ardent personality functioning (e.g., Trull & Widiger, critics of the five-factor model acknowledge its 1997) and none of the psychologists in a survey importance and impact (e.g., Block, 1995; of practicing clinicians cited the use of a Butcher & Rouse, 1996; Millon et al., 1996). semistructured interview (Watkins et al., 1995). And, the predominant measure of the five- Unstructured clinical interviews rely entirely factor model is the NEO-PI-R (Costa & upon the training, expertise, and conscientious- McCrae, 1992, 1997; Ozer & Reise, 1994; ness of the interviewer to provide an accurate Widiger & Trull, 1997). assessment of a person's personality. They are The 240 item NEO-PI-R (Costa & McCrae, problematic for research as they are notor- 1992) assesses five broad domains of person- iously unreliable, idiosyncratic, and prone to ality: neuroticism, extraversion (vs. introver- false assumptions, attributional errors, and sion), openness (vs. closedness), agreeableness misleading expectations (Garb, 1997). For (vs. antagonism), and conscientiousness. Each example, many clinical interviewers fail to item is rated on a five-point scale, from strongly provide a comprehensive assessment of a disagree to strongly agree. Each domain is patient's maladaptive personality traits. Only differentiated into six underlying facets. For one personality disorder diagnosis typically is example, the six facets of agreeableness vs. provided to a patient, despite the fact that most antagonism are trust vs. mistrust, altruism vs. patients will meet criteria for multiple diag- exploitiveness, compliance vs. oppositionalism noses (Gunderson, 1992). Clinicians tend to Semistructured Interviews 155 diagnose personality traits and disorders hier- Ochoa, 1989). This overdiagnosis of histrionic archically. Once a patient is identified as having personality disorder in females is diminished a particular personality disorder (e.g., border- substantially when the clinician is compelled to line), clinicians will fail to assess whether assess systematically each one of the features. additional personality traits are present (Her- ªSex biases may best be diminished by an kov & Blashfield, 1995). Alder, Drake, and increased emphasis in training programs and Teague (1990) provided 46 clinicians with case clinical settings on the systematic use and histories of a patient that met the DSM-III adherence to the [diagnostic] criteriaº (Ford criteria for four personality disorders (i.e., & Widiger, 1989, p. 304). histrionic, narcissistic, borderline, and depen- The reluctance to use semistructured inter- dent). ªDespite the directive to consider each views within clinical practice, however, is category separately . . . most clinicians assigned understandable. Semistructured interviews that just one [personality disorder] diagnosisº (Adler assess all of the DSM-IV personality disorder et al., 1990, p. 127). Sixty-five percent of the diagnostic criteria can require more than two clinicians provided only one diagnosis, 28% hours for their complete administration (Widi- provided two, and none provided all four. ger & Sanderson, 1995). This is unrealistic and Unstructured clinical assessments of person- impractical in routine clinical practice, particu- ality also fail to be systematic. Morey and larly if the bulk of the time is spent in Ochua (1989) provided 291 clinicians with the determining the absence of traits. However, 166 DSM-III personality disorder diagnostic the time can be diminished substantially by first criteria (presented in a randomized order) and administering an SRI to identify which domains asked them to indicate which personality of personality functioning should be empha- disorder(s) were present in one of their patients sized and which could be safely ignored and to indicate which of the 166 diagnostic (Widiger & Sanderson, 1995). criteria were present. Kappa for the agreement Unstructured interviews are also preferred by between their diagnoses and the diagnoses that clinicians because they find SSIs to be too would be given based upon the diagnostic constraining and superficial. Most clinicians criteria they indicated to be present, ranged prefer to follow leads that arise during an from 0.11 (schizoid) to only 0.58 (borderline). In interview, adjusting the content and style to other words, their clinical diagnoses agreed facilitate rapport and to respond to the poorly with their own assessments of the particular needs of an individual patient. The diagnostic criteria for each disorder. The results questions provided by an SSI can appear, in of this study were replicated by Blashfield and comparison, to be inadequate and simplistic. Herkov (1996). Agreement in this instance However, SSIs are considered to be semistruc- ranged from 0.28 (schizoid) to 0.63 (borderline). tured because they require (or allow) profes- ªIt appears that the actual diagnoses of sional judgment and discretion in their clinicians do not adhere closely to the diagnoses administration and scoring. They are not simply suggested by the [diagnostic] criteriaº (Blash- mindless administrations of an SRI. The field & Herkov, 1996, p. 226). responsibility of an SSI interviewer is to assess Clinicians often base their personality dis- for the presence of a respective trait, not to just order assessments on the presence of just one or record a subject's responses to a series of two features, failing to consider whether a structured questions. Follow-up questions that sufficient number of the necessary features are must be sensitive and responsive to the mood present (Blashfield & Herkov, 1996). In addi- state, defensiveness, and self-awareness of the tion, the one or two features that one clinician person being interviewed are always required considers to be sufficient may not be consistent and are left to the expertise and discretion of the with the feature(s) emphasized by another interviewer (Widiger, Frances, & Trull, 1989). clinician, contributing to poor interrater relia- There are only a few fully structured interviews bility (Widiger & Sanderson, 1995; Zimmer- for the assessment of personality traits and they man, 1994) and to misleading expectations and may be inadequate precisely because of their false assumptions. For example, a number of excessive constraint and superficiality (Perry, studies have indicated that many clinicians tend 1992). to overdiagnose the histrionic personality The questions provided by an SSI are useful disorder in females (Garb, 1997). Clinicians in ensuring that each trait is assessed, and that a tend to perceive a female patient who has just set of questions found to be useful in prior one or two histrionic features as having a studies is being used in a consistent fashion. histrionic personality disorder, even when she Systematic biases in clinical assessments are may instead meet the diagnostic criteria for an more easily identified, researched, and ulti- alternative personality disorder (Blashfield & mately corrected with the explicit nature of SSIs Herkov, 1996; Ford & Widiger, 1989; Morey & and their replicated use across studies and 156 Personality Assessment research sites. A highly talented and skilled narcissistic personality traits), until it was clinician can outperform an SSI, but it is risky to recognized that these scales include many items presume that one is indeed that talented that involve self-confidence, assertion, and clinician or that one is consistently skilled and gregariousness. Piersma concluded that ªthe insightful with every patient (Dawes, 1994). It MCMI-II is not able to measure long-term would at least seem desirable for talented personality characteristics (`trait' characteris- clinicians to be informed by a systematic and tics) independent of symptomatology (`state' comprehensive assessment of a patient's per- characteristics)º (p. 91). Mood state distortion, sonality traits. however, might not be problematic within outpatient and normal community samples (Trull & Goodwin, 1993). 3.07.3.1 Personality, Mood States, and Mental SSIs appear to be more successful than SRIs Disorders in distinguishing recently developed mental disorders from personality traits, particularly Personality traits typically are understood to if the interviewers are instructed explicitly to be stable behavior patterns present since young make this distinction when they assess each adulthood (Wiggins & Pincus, 1992). However, item. Loranger et al. (1991) compared the few SRIs emphasize this fundamental feature. assessments provided by the Personality Dis- For example, the instructions for the MMPI-2 order Examination (PDE; Loranger, in press) at make no reference to age of onset or duration admission and one week to six months later. (Hathaway et al., 1989). Responding true to the Reduction in scores on the PDE were not MMPI-2 borderline item, ªI cry easilyº (Hath- associated with depression or anxiety. Loranger away et al., p. 6) could be for the purpose of et al. concluded that the ªstudy provides describing the recent development of a depres- generally encouraging results regarding the sive mood disorder rather than a characteristic apparent ability of a particular semistructured manner of functioning. Most MMPI-2 items are interview, the PDE, to circumvent trait-state in fact used to assess both recently developed artifacts in diagnosing personality disorders in mental disorders as well as long-term person- symptomatic patientsº (p. 727). ality traits (Graham, 1993). However, it should not be presumed that SSIs SRIs that required a duration since young are resilient to mood state distortions. O'Boyle adulthood for the item to be endorsed would be and Self (1990) reported that ªPDE dimensional relatively insensitive to changes in personality scores were consistently higher (more sympto- during adulthood (Costa & McCrae, 1994), but matic) when subjects were depressedº (p. 90) SSIs are generally preferred over SRIs within and Loranger et al. (1991) acknowledged as well clinical settings for the assessment of personality that all but two of the PDE scales decreased traits due to the susceptibility of SRIs to mood- significantly across the hospitalization. These state confusion and distortion (Widiger & changes are unlikely to reflect actual, funda- Sanderson, 1995; Zimmerman, 1994). Persons mental changes in personality secondary to the who are depressed, anxious, angry, manic, brief psychiatric hospitalization. hypomanic, or even just agitated are unlikely The methods by which SRIs and SSIs address to provide accurate self-descriptions. Low self- mood state and mental disorder confusion esteem, hopelessness, and negativism are central should be considered in future research (Zim- features of depression and will naturally affect merman, 1994). For example, the DSM-IV self-description. Persons who are depressed will requires that the personality disorder criteria be describe themselves as being dependent, intro- evident since late adolescence or young adult- verted self-conscious, vulnerable, and pessimis- hood (APA, 1994). However, the PDE (Lor- tic (Widiger, 1993). Distortions will even anger, in press) requires only that one item be continue after the remission of the more present since the age of 25, with the others obvious, florid symptoms of depression present for only five years. A 45-year old adult (Hirschfeld et al., 1989). with a mood disorder might then receive a Piersma (1989) reported significant decreases dependent, borderline, or comparable person- on the MCMI-II schizoid, avoidant, dependent, ality disorder diagnosis by the PDE, if just one passive±aggressive, self-defeating, schizotypal, of the diagnostic criteria was evident since the borderline, and paranoid personality scales age of 25. across a brief inpatient treatment, even with the MCMI-II mood state correction scales. Significant increases were also obtained for the 3.07.3.2 Dissimulation and Distortion histrionic and narcissistic scales, which at first appeared nonsensical (suggesting that treatment Unstructured and semistructured inter- had increased the presence of histrionic and viewers base much of their assessments on the Semistructured Interviews 157 self-descriptions of the respondent. However, a review of institutional file data rather than substantial proportion of the assessment is also answers to interview questions, given the based on observations of a person's behavior expectation that psychopathic persons will be and mannerisms (e.g., the schizotypal trait of characteristically deceptive and dishonest dur- odd or eccentric behavior; APA, 1994) the ing an interview. It is unclear whether the PCL-R manner in which a person responds to questions could provide a valid assessment of psychopathy (e.g., excessive suspiciousness in response to an in the absence of this additional, corrobatory innocuous question), and the consistency of the information (Lilienfeld, 1994; Salekin, Rogers, responses to questions across the interview. SSIs & Sewell, 1996). will also require that respondents provide A notable exception to the absence of SSI examples of affirmative responses to ensure validity scales is the Structured Interview of that the person understood the meaning or Reported Symptoms (SIRS; Rogers, Bagby, & intention of a question. The provision of follow- Dickens, 1992) developed precisely for the up queries provides a significant advantage of assessment of subject distortion and dissimula- SSIs relative to SRIs. For example, schizoid tion. The SIRS includes 172 items (sets of persons may respond affirmatively to an SRI questions) organized into eight primary and five question, ªdo you have any close friends,º but supplementary scales. Three of the primary further inquiry might indicate that they never scales assess for rare, improbable, or absurd invite these friends to their apartment, they symptoms, four assess for an unusual range and rarely do anything with them socially, they are severity of symptoms, and the eighth assesses unaware of their friends' personal concerns, and for inconsistencies in self-reported and observed they never confide in them. Widiger, Mangine, symptoms. A substantial amount of supportive Corbitt, Ellis, and Thomas (1995) described a data has been obtained with the SIRS, person who was very isolated and alone, yet particularly within forensic and neuropsycho- indicated that she had seven very close friends logical settings (Rogers, 1995). with whom she confided all of her personal feelings and insecurities. When asked to describe one of these friends, it was revealed 3.07.3.3 Intersite Reliability that she was referring to her seven cats. In sum, none of the SSIs should or do accept a The allowance for professional judgment in respondent's answers and self-descriptions sim- the selection of follow-up queries and in the ply at face value. Nevertheless, none of the interpretation of responses increases signifi- personality disorder SSIs includes a formal cantly the potential for inadequate interrater method by which to assess defensiveness, reliability. Good to excellent interrater relia- dissimulation, exaggeration, or malingering. bility has been consistently obtained in the An assessment of exaggeration and defensive- assessment of maladaptive personality traits ness can be very difficult and complicated with SSIs (Widiger & Sanderson, 1995; Zimmer- (Berry et al., 1995) yet it is left to the discretion man, 1994), but an SSI does not ensure the and expertise of the interviewer in the assess- obtainment of adequate interrater reliability. ment of each individual personality trait. Some An SSI only provides the means by which this interviewers may be very skilled at this assess- reliability can be obtained. Most SSI studies ment, whereas others may be inadequate in the report inadequate to poor interrater reliability effort or indifferent to its importance. for at least one of the domains of personality SSIs should perhaps include validity scales, being assessed (see Widiger & Sanderson, 1995; comparable to those within self-report inven- Zimmerman, 1994). The obtainment of ade- tories. Alterman et al. (1996) demonstrated quate interrater reliability should be assessed empirically that subjects exhibiting response sets and documented in every study in which an SSI of positive or negative impression management is being used, as it is quite possible that the as assessed by the PAI (Morey, 1996) showed personality disorder of particular interest is the similar patterns of response distortion on two one for which weak interrater reliability has semistructured interviews, yet the interviewers been obtained. appeared to be ªessentially unaware of such The method by which interrater reliability has behaviorº (Alterman et al., 1996, p. 408). ªThe been assessed in SSI research is also potentially findings suggest that some individuals do exhibit misleading. Interrater reliability has tradition- response sets in the context of a structured ally been assessed in SSI research with respect interview and that this is not typically detected to the agreement between ratings provided by the interviewerº (Alterman et al., 1996, by an interviewer and ratings provided by a p. 408). Much of the assessment of psychopathic person listening to an audiotaped (or video- personality traits by the revised Psychopathy taped) recording of this interview (Zimmerman, Checklist (PCL-R; Hare, 1991) is based on a 1994). However, this methodology assesses only 158 Personality Assessment the agreement regarding the ratings of the 3.07.3.4 Self and Informant Ratings respondents' statements. It is comparable to confining the assessment of the interrater Most studies administer the SSI to the person reliability of practicing clinicians' personality being assessed. However, a method by which to assessments to the agreement in their ratings of a address dissimulation and distortion is to recording of an unstructured clinical interview. administer the SSI to a spouse, relative, or The poor interrater reliability that has been friend who knows the person well. Many reported for practicing clinicians has perhaps personality traits involve a person's manner been due largely to inconsistent, incomplete, of relating to others (McCrae & Costa, 1989; and idiosyncratic interviewing (Widiger & Wiggins & Trobst, 1997) and some theorists Sanderson, 1995). It is unclear whether person- suggest that personality is essentially this ality disorder SSIs have actually resolved this manner of relating to others (Kiesler, 1996; problem as few studies have in fact assessed Westen, 1991; Wiggins & Pincus, 1992). interrater reliability using independent admin- A useful source for the description of these istrations of the same interview. traits would then be persons with whom the The misleading nature of currently reported subject has been interacting. These ªinfor- SSI interrater reliability is most evident in mantsº (as they are often identified) can be studies that have used telephone interviews. intimately familiar with the subject's character- Zimmerman and Coryell (1990) reported kappa istic manner of relating to them, and they would agreement rates of 1.0 for the assessment of the not (necessarily) share the subject's distortions schizotypal, histrionic, and dependent person- in self-description. The use of peer, spousal, and ality disorders using the Structured Interview other observer ratings of personality has a rich for DSM-III Personality (SIDP; Pfohl, Blum, & tradition in SRI research (Ozer & Reise, 1994). Zimmerman, in press). However, 86% of the An interview with a close friend, spouse, SIDP administrations were by telephone. Tele- employer, or relative is rarely uninformative phone administrations of an SSI will tend to be and typically results in the identification of more structured than face-to-face interviews, additional maladaptive personality traits. How- and perhaps prone to brief and simplistic ever, it is unclear which source will provide the administrations. They may degenerate into a most valid information. Zimmerman, Pfohl, verbal administration of an SRI. Interrater Coryell, Stangl, and Corenthal (1988) adminis- agreement with respect to the ratings of subjects' tered an SSI to both a patient and an informant, affirmative or negative responses to MMPI-2 and obtained rather poor agreement, with items (i.e., highly structured questions) is not correlations ranging from 0.17 (compulsive) particularly informative. to only 0.66 (antisocial). The informants There is reason to believe that there might identified significantly more dependent, avoi- be poor agreement across research sites using dant, narcissistic, paranoid, and schizotypal the same SSI. For example, the prevalence rate traits, but Zimmerman et al. (1988) concluded of borderline personality disorder within that ªpatients were better able to distinguish psychiatric settings has been estimated at between their normal personality and their 15% (Pilkonis et al., 1995), 23% (Riso, Klein, illnessº (p. 737). Zimmerman et al. felt that the Anderson, Ouimette, & Lizardi, 1994), 42% informants were more likely to confuse patients' (Loranger et al., 1991), and 71% (Skodol, current depression with their longstanding and Oldham, Rosnick, Kellman, & Hyler, 1991), premorbid personality traits than the patients' all using the same PDE (Loranger, in press). themselves. Similar findings have been reported This substantial disagreement is due to many by Riso et al. (1994). Informants, like the different variables (Gunderson, 1992; Pilkonis, patients themselves, are providing subjective 1997; Shea, 1995), but one distinct possibility opinions and impressions rather than objective is that there is an inconsistent administration descriptions of behavior patterns, and they may and scoring of the PDE by Loranger et al. have their own axes to grind in their emotionally (1991), Pilkonis et al. (1995), Riso et al. (1994), invested relationship with the identified patient. and Skodol et al. (1991). Excellent interrater The fundamental attribution error is to over- reliability of a generally unreliably adminis- explain behavior in terms of personality traits, tered SSI can be obtained within one parti- and peers might be more susceptible to this error cular research site through the development of than the subjects themselves. local (idiosyncratic) rules and policies for the administration of follow-up questions and the 3.07.3.5 Convergent Validity Across Different scoring of respondents' answers that are Interviews inconsistent with the rules and policies devel- oped for this same SSI at another research One of the most informative studies on the site. validity of personality disorder SSIs was Semistructured Interviews 159 provided by Skodol et al. (1991). They identity disturbance, defined in DSM-IV as a administered to 100 psychiatric inpatients two markedly and persistently unstable sense of self- different SSIs on the same day (alternating in image or sense of self (APA, 1994). All five SSIs morning and afternoon administrations). do ask about significant or dramatic changes in Agreement was surprisingly poor, with kappa self-image across time. However, there are also ranging from a low of 0.14 (schizoid) to a high of notable differences. For example, the DIPD-IV only 0.66 (dependent). Similar findings have and SIDP-IV refer specifically to feeling evil; the since been reported by Pilkonis et al. (1995). If SIDP-IV highlights in particular a confusion this is the best agreement one can obtain with regarding sexual orientation; the SCID-II the administration of different SSIs to the same appears to emphasize changes or fluctuations persons by the same research team on the same in self-image, whereas the DIPD-IV appears to day, imagine the disagreement that must be emphasize an uncertainty or absence of self- obtained by different research teams with image; and the Personality Disorder Interview-4 different SSIs at different sites (however, both (PDI-IV) includes more open-ended self-de- studies did note that significantly better agree- scriptions. ment was obtained when they considered the It is possible that the variability in content of dimensional rating of the extent to which each questions across different SSIs may not be as personality syndrome was present). important as a comparable variability in Skodol et al. (1991) concluded that the source questions across different SRIs, as the inter- of the disagreement they obtained between the views may converge in their assessment through Structured Clinical Interview for DSM-III-R the unstructured follow-up questions, queries Personality Disorders (SCID-II; First, Gibbon, and clarifications. However, the findings of Spitzer Williams, & Benjamin, in press) and the Pilkonis et al. (1995) and Skodol et al. (1991) PDE (Loranger, in press) was the different suggest otherwise. In addition, as indicated in questions (or items) used by each interview. ªIt Table 1, most of the questions within the five is fair to say that, for a number of disorders (i.e., SSIs are relatively structured, resulting perhaps paranoid, schizoid, schizotypal, narcissistic, in very little follow-up query. and passive±aggressive) the two [interviews] studied do not operationalize the diagnoses 3.07.3.6 Illustrative Instruments similarly and thus yield disparate resultsº (Skodol et al., p. 22). The three most commonly used SSIs within There appear to be important differences clinical research for the assessment of the DSM among the SSIs available for the assessment of personality disorders are the SIDP-IV (Pfohl personality disorders (Clark,1992; Widiger & et al., in press), the SCID-II (First et al., in Sanderson, 1995; Zimmerman, 1994). Table 1 press), and the PDE (Loranger, in press). The presents the number of structured questions SIDP-IV has been used in the most number of (i.e., answerable by one word, such as ªyesº or studies. A distinctive feature of the PDE is the ªfrequentlyº), open-ended questions, and ob- inclusion of the diagnostic criteria for the servational ratings provided in each of the five personality disorders of the World Health major personality disorder SSIs. It is evident Organization's (WHO) International Classifica- from Table 1 that there is substantial variation tion of Diseases (ICD-10; 1992). However, using across SSIs simply with respect to the number of this international version of the PDE to questions provided, ranging from 193 to 373 compare the DSM-IV with the ICD-10 is with respect to structured questions and from 20 problematic, as the PDE does not in fact to 69 for open-ended questions. There is also provide distinct questions for the ICD-10 variability in the reliance upon direct observa- criteria. It simply indicates which of the existing tions of the respondent. The PDE, Diagnostic PDE questions, developed for the assessment of Interview for DSM-IV Personality Disorders the DSM-IV personality disorders, could be (DIPD-IV), and SIDP-IV might be more used to assess the ICD-10 criteria. The DSM-IV difficult to administer via telephone, given the and ICD-10 assessments are not then indepen- number of observational ratings of behavior, dent. The DIPD (Zanarini, Frankenburg, appearance, and mannerisms that are required Sickel, & Yong, 1996) is the youngest SSI, (particularly to assess the schizotypal and but it is currently being used in an extensive histrionic personality disorders). multi-site, longitudinal study of personality There is also significant variability in the disorders. content of the questions used to assess the same personality trait (Clark, 1992; Widiger & 3.07.3.6.1 SIDP-IV Sanderson, 1995; Zimmerman, 1994). Table 2 presents the questions used by each of the five The SIDP-IV (Pfohl et al., in press) is the major SSIs to assess the borderline trait of oldest and most widely used SSI for the 160 Personality Assessment

Table 1 Amount of structure in personality disorder semistructured interviews.

Number of questions

Interview Structured Open-ended Examples Total Observations

DIPD-IV 373 20 5 398 19 PDE 272 69 196 537 32 PDI-IV 290 35 325 3 SCID-II 193 35 75 303 7 SIDP-IV 244 58 35 337 16

Note. Examples = specified request for examples (PDI-IV instructs interviewers to always consider asking for examples, and therefore does not include a specified request for individual items); DIPD-IV = Diagnostic Interview for DSM-IV Personality Disorders (Zanarini et al., 1996); PDE = Personality Disorder Examination (Loranger, in press); PDI-IV = Personality Disorder Interview-4 (Widiger et al., 1995); SCID-II = Structured Clinical Interview for DSM-IV Axis II Personality Disorders (First, et al., in press); SIDP-IV = Structured Interview for DSM-IV Personality (Pfohl et al., in press). assessment of the DSM personality disorders ual is the most systematic and comprehensive, (Rogers, 1995; Widiger & Sanderson, 1995; providing extensive information regarding the Zimmerman, 1994). It includes 353 items (337 history, rationale, and common assessment questions and 16 observational ratings) to assess issues for each of the DSM-IV personality the 94 diagnostic criteria for the 10 DSM-IV disorder diagnostic criteria. personality disorders. Additional items are provided to assess the proposed but not officially recognized negativistic (passive± 3.07.4 PROJECTIVE TECHNIQUES aggressive), depressive, self-defeating, and sa- distic personality disorders. It is available in two Projective techniques are not used as often as versions, one in which the items are organized SRIs in personality research, ªand many with respect to the diagnostic criteria sets academic psychologists have expressed the (thereby allowing the researcher to assess only belief that knowledge of projective testing is a subset of the disorders) and the other in which not as important as it used to be and that use of the items are organized with respect to similar projective tests will likely decline in the futureº content (e.g., perceptions of others and social (Butcher & Rouse, 1996, p. 91). However, ªof conformity) to reduce repetition and redun- the top 10 assessment procedures [used by dancy. Each diagnostic criterion is assessed on a clinicians] . . . 4 are projectives, and another four-point scale (0=not present, 1=subthres- (Bender-Gestalt) is sometimes used for projec- hold, 2=present, 3=strongly present). Admin- tive purposesº (Watkins et al., 1995, p. 59). istration time is about 90 minutes, depending in Butcher and Rouse (1996) document as well that part on the experience of the interviewer and the the second most frequently researched clinical verbosity of the subject. A computerized instrument continues to be the Rorschach. administration and scoring system is available. ªPredictions about the technique's demise The accompanying instructions and manual are appear both unwarranted and unrealisticº limited, but training videotapes and courses are (Butcher & Rouse, 1996, p. 91). ªWhatever available. negative opinions some academics may hold about projectives, they clearly are here to stay, wishing will not make them go away . . . and 3.07.3.6.2 PDI-IV their place in clinical assessment practice now The PDI-IV (Widiger et al., 1995) is the seems as strong as, if not stronger than, everº second oldest SSI for the assessment of the (Watkins et al., 1995, p. 59). DSM personality disorders. It is comparable The term ªprojective,º however, may be in content and style to the SIDP-IV, although somewhat misleading, as it suggests that these it has fewer observational ratings and more tests share an emphasis upon the interpretation open-ended inquiries. The PDI-IV has been of a projection of unconscious conflicts, used in substantially fewer studies than the impulses, needs, or wishes onto ambiguous SIDP-IV, PDE, or SCID-II, it lacks supportive stimuli. This is true for most projective tests but training material, and only one of the it is not in fact the case for the most commonly published studies using the PDI-IV was used scoring system (i.e., Exner, 1993) for the conducted by independent investigators (Ro- most commonly used (i.e., the gers, 1995). However, its accompanying man- Rorschach). The Exner (1993) Comprehensive Projective Techniques 161

Table 2 Semistructured interview questions for the assessment of identity disturbance.

DIPD-IV 1. During the past two years, have you often been unsure of who you are or what you're really like? 2. Have you frequently gone from feeling sort of OK about yourself to feeling that you're bad or even evil? 3. Have you often felt that you had no identity? 4. How about that you had no idea of who you are or what you believe in? 5. That you don't even exist?

PDE 1. Do you think one of your problems is that you're not sure what kind of person you are? 2. Do you behave as though you don't know what to expect of yourself? 3. Are you so different with different people or in different situations that you don't behave like the same person? 4. Have others told you that you're like that? Why do you think they've said that? 5. What would you like to accomplish during your life? Do your ideas about this change often? 6. Do you often wonder whether you've made the right choice of job or career? (If housewife, ask:) Do you often wonder whether you've made the right choice in becoming a housewife? (If student, ask:) Have you made up your mind about what kind of job or career you would like to have? 7. Do you have trouble deciding what's important in life? 8. Do you have trouble deciding what's morally right and wrong? 9. Do you have a lot of trouble deciding what type of friends you should have? 10. Does the kind of people you have as friends keep changing? 11. Have you ever been uncertain whether you prefer a sexual relationship with a man or a woman?

PDI-IV 1. How would you describe your personality? 2. What is distinct or unique about you? 3. Do you ever feel that you don't know who you are or what you believe in? 4. Has your sense of who you are, what you value, what you want from life, or what you want to be, been fairly consistent, or has this often changed significantly?

SCID-II 1. Have you all of a sudden changed your sense of who you are and where you are headed? 2. Does your sense of who you are often change dramatically? 3. Are you different with different people or in different situations so that you sometimes don't know who you really are? 4. Have there been lots of sudden changes in your goals, career plans, religious beliefs, and so on?

SIDP-IV 1. Does the way you think about yourself change so often that you don't know who you are anymore? 2. Do you ever feel like you're someone else, or that you're evil, or maybe that you don't even exist? 3. Some people think a lot about their sexual orientation, for instance, trying to decide whether or not they might be gay (or lesbian). Do you often worry about this?

Note. DIPD-IV = Diagnostic Interview for DSM-IV Personality Disorders (Zanarini et al., 1996, pp. 25±26); PDE = Personality Disorder Examination (Loranger, in press, pp. 58±60, 83, & 117); PDI-IV = Personality Disorder Interview-4 (Widiger et al., 1995, p. 92); SCID-II = Structured Clinical Interview for DSM-IV Axis II Personality Disorders (First et al., in press, p. 37); SIDP-IV = Structured Interview for DSM- IV Personality (Pfohl et al., in press, p. 19). 162 Personality Assessment

System does include a few scores that appear to most important events or incidents are not concern a projection of personal needs, wishes, neglected during the interview (Perry, 1992). or preoccupations (e.g., morbid content) but However, such open-ended inquiries will con- much of the scoring concerns individual tribute to problematic intersite reliability. differences in the perceptual and cognitive processing of the form or structure of the shapes, textures, details, and colors of the 3.07.4.1 Rorschach ambiguous inkblots. Exner (1989) has himself stated that ªunfortunately, the Rorschach has The Rorschach consists of 10 ambiguously been erroneously mislabeled as a projective test shaped inkblots, some of which include various for far too longº (p. 527). degrees of color and shading. The typical The label ªprojective testº is also contrasted procedure is to ask a person what each inkblot traditionally with the label ªobjective testº (e.g., might be, and to then follow at some point with Keller et al., 1990) but this is again misleading. inquiries that clarify the bases for the response The Exner (1993) Comprehensive System scor- (Aronow, Reznikoff, & Moreland, 1994; Exner, ing for the Rorschach is as objective as the 1993). The Rorschach has been the most scoring of an MMPI-2 (although not as popular projective test in clinical practice since reliable). In addition, clinicians often can the Second World War, although the Thematic interpret MMPI-2 profiles in an equally sub- Apperception Test (TAT) and sentence comple- jective manner. tion tests are gaining in frequency of usage A more appropriate distinction might be a (Watkins et al., 1995). continuum of structure vs. ambiguity with There is empirical support for many respect to the stimuli provided to the subject Rorschach variables (Bornstein, 1996; Exner, and the range of responses that are allowed. 1993; Weiner, 1996), although the quality of Most of the techniques traditionally labeled as some of this research has been questioned, projective do provide relatively more ambig- including the support for a number of the uous stimuli than either SRIs or SSIs (e.g., fundamental Exner variables, such as the inkblots or drawings). However, many SRI experience ratio (Kleiger, 1992; Wood, Nez- items can be as ambiguous as an item from a worski, & Stejskal, 1996). The Rorschach can be projective test. Consider, for example, the administered and scored in a reliable manner, MMPI-2 items ªI like mechanics magazines,º but the training that is necessary to learn how to ªI used to keep a diary,º and ªI would like to be score reliably the 168 variables of the Exner a journalistº (Hathaway et al., 1989, pp. 5, 8, 9). Comprehensive System is daunting, at best. These items are unambiguous in their content, SRIs and SSIs might provide a more cost- but the trait or characteristic they assess is very efficient method to obtain the same results ambiguous (Butcher, 1995). There is, perhaps, (although profile interpretation of the MMPI-2 less ambiguity in the meaning of the stems clinical scales can at times be equally complex; provided in a sentence completion test, such as Helmes & Reddon, 1993). Exner (1996) has ªMy conscience bothered me most when,º ªI acknowledged, at least with respect to a used to dream aboutº and ªI felt inferior whenº depression index, that ªthere are other mea- (Lah, 1989, p. 144) than in many of the MMPI-2 sures, such as the [MMPI-2], that might identify items. the presence of reported depression much more SRIs, on the other hand, are much more accurately than the Rorschachº (p. 12). The structured (or constraining) in the responses same point can perhaps be made for the that are allowed to these stimuli. The only assessment of personality traits, such as narcis- responses can be ªtrueº or ªfalseº to an sism and dependency. Bornstein (1995), how- ambiguous MMPI-2 item, whereas anything ever, has argued that the Rorschach provides a can be said in response to a more obvious less biased measure of sex differences in sentence completion stem. Projective tests are personality (dependency, in particular) because uniformly more open-ended than SRIs in the its scoring is less obvious to the subject. He responses that are allowed, increasing substan- suggested that the findings from SRIs and SSIs tially the potential for unreliability in scoring. have provided inaccurate estimates of depen- However, SSIs can be as open-ended as many dency in males because males are prone to deny projective tests in the responses that are allowed. the extent of their dependent personality traits For example, the PDI-IV SSI begins with the in response to SRIs and SSIs. request of ªhaving you tell me the major events, An additional issue for the Rorschach is that issues, or incidents that you have experienced its relevance to personality research and assess- since late childhood or adolescenceº (Widiger ment is at times unclear. For example, the et al., 1995, p. 245). This initial question is cognitive-perceptual mechanisms assessed by intentionally open-ended to ensure that the the Exner Comprehensive system do not appear References 163 to be of central importance to many theories of 3.07.5 CONCLUSIONS personality and personality disorder (Kleiger, 1992). ªIt is true that the Rorschach does not The assessment of personality is a vital offer a precise measure for any single person- component of clinical research and practice, ality traitº (Exner, 1997, p. 41). How a person particularly with the increasing recognition of perceptually organizes an ambiguous inkblot the importance of personality traits to the may indeed relate to extratensive ideational development and treatment of psychopathology activity, but constructs such as introversion, (Watson et al., 1994). The assessment of conscientiousness, need for affection, and adaptive and maladaptive personality function- empathy have a more direct, explicit relevance ing is fundamental to virtually all fields of to current personality theory and research. applied psychology. More theoretically meaningful constructs are It is then surprising and regrettable that perhaps assessed by content (e.g., Aronow et al., clinical psychology training programs provide 1994; Bornstein, 1996) or object-relational (e.g., so little attention to the importance of and Lerner, 1995) scoring systems. Content inter- methods for obtaining comprehensive and pretations of the Rorschach are more consistent systematic interviewing. The primary method with the traditional understanding of the for the assessment of personality in clinical instrument as a projective stimulus, but this practice is an unstructured interview that has approach also lacks the empirical support of the been shown to be quite vulnerable to misleading cognitive-perceptual scoring systems and may expectations, inadequate coverage, and gender only encourage a return to less reliable and and ethnic biases. Training programs will subjective interpretations (Acklin, 1995). devote a whole course, perhaps a whole year, to learning different projective techniques, but may never even inform students of the existence of any particular semistructured interview. 3.07.4.2 Thematic Apperception Test The preferred method of assessment in The TAT (Cramer, 1996) consists of 31 cards: personality disorder research appears to be one is blank, seven are for males, seven for SSIs (Zimmerman, 1994), whereas the preferred females, one for boys or girls, one for men or method in normal personality research are SRIs women and one each for a boy, girl, man, and (Butcher & Rouse, 1996). However, the optimal woman (the remaining 10 are for anyone). Thus, approach for both research and clinical practice a complete set for any particular individual would be a multimethod assessment, using could consist of 20 stimulus drawings, although methods whose errors of measurement are only 10 typically are used per person. Most of uncorrelated. No single approach will be with- the drawings include person(s) in an ambiguous out significant limitations. The convergence of but emotionally provocative context. The findings across SRI, SSI, and projective meth- instruction to the subject is to make up a odologies would provide the most compelling dramatic story for each card, describing what is results. happening, what led up to it, what is the outcome, and what the persons are thinking and 3.07.6 REFERENCES feeling. It is common to describe the task as a test of imaginative intelligence to encourage Acklin, M. W. (1995). Integrative Rorschach interpreta- vivid, involved, and nondefensive stories. tion. Journal of Personality Assessment, 64, 235±238. The TAT is being used increasingly in Adler, D. A., Drake, R. E., & Teague, G. B. (1990). Clinicians' practices in personality assessment: Does personality research with an interpersonal or gender influence the use of DSM-III Axis II? Compre- object-relational perspective (e.g., Westen, hensive Psychiatry, 31, 125±133. 1991). The TAT's provision of cues for a variety Adler, N., & Matthews, K. (1994) Health psychology: Why of interpersonal issues and relationships make it do some people get sick and some stay well? Annual Review of Psychology, 45, 229±259. particularly well suited for such research, and Alterman, A. I., Snider, E. C., Cacciola, J. S., Brown, L. S., the variables assessed are theoretically and Zaballero, A., & Siddiqui, N. (1996). Evidence for clinically meaningful (e.g., malevolent vs. response set effects in structured research interviews. benevolent affect and the capacity for an Journal of Nervous and Mental Disease, 184, 403±410. emotional investment in relationships). The American Psychiatric Association (1994). Diagnostic and statistical manual of mental disorders. (4th ed.). Wa- necessary training for reliable scoring is also less shington, DC: Author. demanding than for the Rorschach, although a Aronow, E., Reznikoff, M., & Moreland, K. (1994). The TAT administration remains time-consuming. Rorschach technique: Perceptual basics, content, inter- There are many SRI measures of closely related pretation and applications. Boston: Allyn and Bacon. Baumeister, R. F., Tice, D. M., & Hutton, D. G. (1989). interpersonal constructs that are less expensive Self-presentational motivations and personality differ- and complex to administer and score (e.g., ences in self-esteem. Journal of Personality, 57, 547±579. Kiesler, 1996; Wiggins & Trobst, 1997). Ben-Porath, Y. S. (1994). The MMPI and MMPI-2: Fifty 164 Personality Assessment

years of differentiating normal and abnormal person- analysis of the Hare Psychopathy Checklist-Revised. ality. In S. Strack & M. Lorr (Eds.), Differentiating Psychological Assessment, 9, 3±14. normal and abnormal personality (pp. 361±401). New Costa, P. T., & McCrae, R. R. (1992). Revised NEO York: Springer. Personality Inventory (NEO PI-R) and NEO Five-Factor Ben-Porath, Y. S., McCully, E., & Almagor, M. (1993). Inventory (NEO-FFI) professional manual. Odessa, FL: Incremental validity of the MMPI-2 content scales in the Psychological Assessment Resources. assessment of personality and psychopathology by self- Costa, P. T., & McCrae, R. R. (1994). Set like plaster? report. Journal of Personality Assessment, 61, 557±575. Evidence for the stability of adult personality. In T. F. Ben-Porath, Y. S., & Waller, N. G. (1992). ªNormalº Heatherton & J. L. Weinberger (Eds.), Can personality personality inventories in clinical assessment: General change? (pp. 21±40). Washington, DC: American Psy- requirements and the potential for using the NEO chological Association. Personality Inventory. Psychological Assessment, 4, Costa, P. T., & McCrae, R. R. (1997). Stability and change 14±19. in personality assessment: The Revised NEO Personality Berry, D. T. R. (1995). Detecting distortion in forensic Inventory in the year 2000. Journal of Personality evaluations with the MMPI-2. In Y. S. Ben-Porath, J. R. Assessment, 68, 86±94. Graham, G. C. N. Hall, R. D. Hirschman, & M. S. Costa, P. T., Zonderman, A. B., McCrae, R. R., & Zaragoza (Eds.), Forensic applications of the MMPI-2 Williams, R. B. (1985). Content and comprehensiveness (pp. 82±102). Thousands Oaks, CA: Sage. in the MMPI: An item factor analysis in a normal adult Berry, D. T. R., Wetter, M. W., & Baer, R. A. (1995). sample. Journal of Personality and Social Psychology, 48, Assessment of malingering. In J. N. Butcher (Ed.), 925±933. Clinical personality assessment. Practical approaches Cramer, P. (1996). Storytelling, narrative, and the Thematic (pp. 236±248). New York: Oxford University Press. Apperception Test. New York: Guilford. Blashfield, R. K., & Herkov, M. J. (1996). Investigating Dawes, R. M. (1994). House of cards. Psychology and clinician adherence to diagnosis by criteria: A replication psychotherapy built on myth. New York: Free Press. of Morey and Ochoa (1989). Journal of Personality DiLalla, D. L., Gottesman, I. I., Carey, G., & Vogler, G. P. Disorders, 10, 219±228. (1993). Joint factor structure of the Multidimensional Block, J. (1995). A contrarian view of the five-factor Personality Questionnaire and the MMPI in a psychia- approach to personality description. Psychological Bul- tric and high-risk sample. Psychological Assessment, 5, letin, 117, 187±215. 207±215. Bornstein, R. F. (1995). Sex differences in objective and Eagly, A. H. (1995). The science and politics of comparing projective dependency tests: A meta-analytic review. women and men. American Psychologist, 50, 145±158. Assessment, 2, 319±331. Embretson, S. E. (1996). The new rules of measurement. Bornstein, R. F. (1996). Construct validity of the Psychological Assessment, 8, 341±349. Rorschach oral dependency scale: 1967±1995. Psycholo- Exner, J. E. (1989). Searching for projection in the gical Assessment, 8, 200±205. Rorschach. Journal of Personality Assessment, 53, Borkenau, P., & Ostendorf, F. (1992). Social desirability 520±536. scales as moderator and suppressor variables. European Exner, J. E. (1993). The Rorschach: A comprehensive system Journal of Personality, 6, 199±214. (Vol. 1). New York: Wiley. Butcher, J. N. (1995). Item content in the interpretation of Exner, J. E. (1996). A comment on ªThe Comprehensive the MMPI-2. In J. N. Butcher (Ed.), Clinical personality System for the Rorschach: A critical examination.º assessment. Practical approaches (pp. 302±316). New Psychological Science, 7, 11±13. York: Oxford University Press. Exner, J. E. (1997). The future of Rorschach in personality Butcher, J. N. (1995). How to use computer-based reports. assessment. Journal of Personality Assessment, 68, 37±46. In J. N. Butcher (Ed.), Clinical personality assessment. Feingold, A. (1994). Gender differences in personality: A Practical approaches (pp. 78±94). New York: Oxford meta-analysis. Psychological Bulletin, 116, 429±456. University Press. Finn, S. E., & Kamphuis, J. H. (1995). What a clinician Butcher, J. N., & Rouse, S. V. (1996). personality: needs to know about base rates. In J. N. Butcher (Ed.), Individual differences and clinical assessment. Annual Clinical personality assessment. Practical approaches (pp. Review of Psychology, 47, 87±111. 224±235). New York: Oxford University Press. Clark, L. A. (1992). Resolving taxonomic issues in First, M. B., Gibbon, M., Spitzer, R. L., Williams, J. B. W., personality disorders. The value of large-scale analysis & Benjamin, L. S. (in press). User's guide for the of symptom data. Journal of Personality Disorders, 6, Structured Clinical Interview for DSM-IV Axis II 360±376. Personality Disorders. Washington, DC: American Psy- Clark, L. A. (1993). Manual for the schedule for nonadaptive chiatric Press. and adaptive personality. Minneapolis, MN: University Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in of Minnesota Press. the development and refinement of clinical assessment Clark, L. A., Livesley, W. J., Schroeder, M. L., & Irish, S. instruments. Psychological Assessment, 7, 286±299. L. (1996). Convergence of two systems for assessing Ford, M. R., & Widiger, T. A. (1989). Sex bias in the specific traits of personality disorder. Psychological diagnosis of histrionic and antisocial personality dis- Assessment, 8, 294±303. orders. Journal of Consulting and Clinical Psychology, 57, Clark, L. A., & Watson, D. (1995). Constructing validity: 301±305. Basic issues in objective scale development. Psychological Garb, H. N. (1997). Race bias, social class bias, and gender Assessment, 7, 309±319. bias in clinical judgment. Clinical Psychology: Science Cloninger, C. R., & Svrakic, D. M. (1994). Differentiating and Practice. normal and deviant personality by the seven factor Goldberg, L. R. (1990). An alternative ªDescription of personality model. In S. Strack & M. Lorr (Eds.), personalityº: The Big Five factor structure. Journal of Differentiating normal and abnormal personality Personality and Social Psychology, 59, 1216±1229. (pp. 40±64). New York: Springer. Graham, J. R. (1993). MMPI-2. Assessing personality and Colligan, R. C., Morey, L. C., & Offord, K. P. (1994). The psychopathology (2nd ed.). New York: Oxford University MMPI/MMPI-2 personality disorder scales. Contem- Press. porary norms for adults and adolescents. Journal of Greene, R. L., Gwin, R., & Staal, M. (1997). Current status Clinical Psychology, 50, 168±200. of MMPI-2 research: A methodologic overview. Journal Cooke, D. J., & Michie, C. (1997). An item response theory of Personality Assessment, 68, 20±36. References 165

Gunderson, J. G. (1992). Diagnostic controversies. In A. comparison in the Comprehensive Rorschach System. Tasman & M. B. Riba (Eds.), Review of psychiatry (Vol. Psychological Assessment, 4, 288±296. 11, pp. 9±24). Washington, DC: American Psychiatric Klein, M. H., Benjamin, L. S., Rosenfeld, R., Treece, C., Press. Husted, J., & Greist, J. H. (1993). The Wisconsin Hambleton, R. K., Swaminathan, H., & Rogers, H. J. Personality Disorders Inventory: development, reliabil- (1991). Fundamentals of item response theory. Newbury ity, and validity. Journal of Personality Disorders, 7, Park, CA: Sage. 285±303. Hare, R. D. (1991). Manual for the Revised Psychopathy Lah, M. I. (1989). Sentence completion tests. In C. S. Checklist Toronto, Canada: Multi-Health Systems. Newmark (Ed.), Major psychological assessment instru- Hare, R. D., Hart, S. D., & Harpur, T. J. (1991). ments (Vol. II, pp. 133±163). Boston: Allyn & Bacon. Psychopathy and the DSM-IV criteria for antisocial Lamiell, J. T. (1981). Toward an idiothetic psychology of personality disorder. Journal of Abnormal Psychology, personality. American Psychologist, 36, 276±289. 100, 391±398. Lerner, P. M. (1995). Assessing adaptive capacities by Harkness, A. R., & McNulty, J. L. (1994). The Personality means of the Rorschach. In J. N. Butcher (Ed.), Clinical Psychopathology Five (PSY-5): Issue from the pages personality assessment. Practical approaches of a diagnostic manual instead of a dictionary. In (pp. 317±325). New York: Oxford University Press. S. Strack & M. Lorr (Eds.), Differentiating normal Lilienfeld, S. O. (1994). Conceptual problems in the and abnormal personality (pp. 291±315). New York. assessment of psychopathy. Clinical Psychology Review, Springer. 14, 17±38. Harkness, A. R., McNulty, J. L., & Ben-Porath, Y. S. Lilienfeld, S. O., & Andrews, B. P. (1996). Development (1995). The Personality Psychopathology Five (PSY-5): and preliminary validation of a self-report measure Constructs and MMPI-2 scales. Psychological Assess- of psychopathic personality traits in noncriminal ment, 7, 104±114. populations. Journal of Personality Assessment, 66, Hathaway, S. R., & McKinley, J. C. (1940). A multiphasic 488±524. personality schedule (Minnesota): I. Construction of the Lindsay, K. A., & Widiger, T. A. (1995). Sex and gender schedule. Journal of Psychology, 10, 249±254. bias in self-report personality disorder inventories: Item Hathaway, S. R., & McKinley, J. C. (1982). Minnesota analyses of the MCMI-II, MMPI, and PDQ-R. Journal Multiphasic Personality Inventory test booklet. Minnea- of Personality Assessment, 65, 1±20. polis, MN: University of Minnesota. Livesley, W. J., & Jackson, D. (in press). Manual for the Hathaway,S.R.,McKinley,J.C.,Butcher,J.N., Dimensional Assessment of Personality Pathology-Basic Dahlstrom, W. G., Graham, J. R., & Tellegen, A. Questionnaire. Port Huron, MI: Sigma. (1989). Minnesota Multiphasic Personality Inventory test Loranger, A. W. (in press). Personality disorder examina- booklet. Minneapolis, MN: Regents of the University of tion. Washington, DC: American Psychiatric Press. Minnesota. Loranger, A. W., Lenzenweger, M. F., Gartner, A. F., Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Susman, V. L., Herzig, J., Zammit, G. K., Gartner, Content validity in psychological assessment: A func- J. D., Abrams, R. C., & Young, R. C. (1991). Trait-state tional approach to concepts and methods. Psychological artifacts and the diagnosis of personality disorders. Assessment, 7, 238±247. Archives of General Psychiatry, 48, 720±728. Helmes, E., & Reddon, J. R. (1993). A perspective on McCrae, R. R., & Costa, P. T. (1983). Social desirability developments in assessing psychopathology: A critical scales: More substance than style. Journal of Consulting review of the MMPI and MMPI-2. Psychological and Clinical Psychology, 51, 882±888. Bulletin, 113, 453±471. McCrae, R. R., & Costa, P. T. (1989). The structure of Hendryx, M. S., Haviland, M. G., Gibbons, R. D., & interpersonal traits: Wiggins' circumplex and the Five- Clark, D. C. (1992). An application of item response Factor Model. Journal of Personality and Social Psy- theory to alexithymia assessment among abstinent chology, 56, 586±595. alcoholics. Journal of Personality Assessment, 58, Meehl, P. E. (1945). The dynamics of ªstructuredº 506±515. personality tests. Journal of Clinical Psychology, 1, Herkov, M. J., & Blashfield, R. K. (1995). Clinician 296±303. diagnoses of personality disorders: Evidence of a Millon, T. (1983). Millon Clinical Multiaxial Inventory hierarchical structure. Journal of Personality Assessment, manual (3rd ed.). Minneapolis, MN: National Computer 65, 313±321. Systems. Hirschfeld, R. M., Klerman, G. L., Lavori, P., Keller, M., Millon, T. (1987). Manual for the MCMI-II (2nd ed.). Griffith, P., & Coryell, W. (1989). Premorbid personality Minneapolis, MN: National Computer Systems. assessments of first onset of major depression. Archives Millon, T., Davis, R. D., Millon, C. M., Wenger, A. W., of General Psychiatry, 46, 345±350. Van Zuilen, M. H., Fuchs, M., & Millon, R. B. (1996). Hogan, R., Curphy, G. J., & Hogan, J. (1994). What we Disorders of personality. DSM-IV and beyond. New know about leadership. Effectiveness and personality. York: Wiley. American Psychologist, 49, 493±504. Millon, T., Millon, C., & Davis, R. (1994). MCMI-III Hyler, S. E. (1994). Personality Diagnostic Questionnaire-4 manual. Minneapolis, MN: National Computer Systems. (PDQ-4). Unpublished test. New York: New York State Morey, L. C. (1996). An interpretive guide to the Personality Psychiatric Institute. Assessment Inventory (PAI). Odessa, FL: Psychological Kehoe, J. F., & Tenopyr, M. L. (1994). Adjustment in Assessment Resources. assessment scores and their usage: A taxonomy and Morey, L. C., & Ochoa, E. S. (1989). An investigation of evaluation of methods. Psychological Assessment, 6, adherence to diagnostic criteria: Clinical diagnosis of the 291±303. DSM-III personality disorders. Journal of Personality Keller, L. S., Butcher, J. N., & Slutske, W. S. (1990). Disorders, 3, 180±192. Objective personality assessment. In G. Goldstein & M. Morey, L. C., Waugh, M. H., & Blashfield, R. K. (1985). Hersen (Eds.), Handbook of psychological assessment MMPI scales for DSM-III personality disorders: Their (2nd ed., pp. 345±386). New York: Pergamon. derivation and correlates. Journal of Personality Assess- Kiesler, D. J. (1996). Contemporary interpersonal theory & ment, 49, 245±251. research, personality, psychopathology, and psychother- O'Boyle, M., & Self, D. (1990). A comparison of two apy. New York: Wiley. interviews for DSM-III-R personality disorders. Psy- Kleiger, J. H. (1992). A conceptual critique of the EA;es chiatry Research, 32, 85±92. 166 Personality Assessment

Okazaki, S., & Sue, S. (1995). Methodological issues in interviews. International Journal of Methods in Psychia- assessment research with ethnic minorities. Psychological tric Research, 1, 13±26. Assessment, 7, 367±375. Smith, G. T., & McCarthy, D. M. (1995). Methodological Ozer, D. J., & Reise, S. P. (1994). Personality assessment. considerations in the refinement of clinical assessment Annual Review of Psychology, 45, 357±388. instruments. Psychological Assessment, 7, 300±308. Perry, J. C. (1992). Problems and considerations in the Somwaru, D. P., & Ben-Porath, Y. S. (1995). Development valid assessment of personality disorders. American and reliability of MMPI-2 based personality disorder Journal of Psychiatry, 149, 1645±1653. scales. Paper presented at the 30th Annual Workshop Pfohl B., Blum, N., & Zimmerman, M. (in press). and Symposium on Recent Developments in Use of the Structured Interview for DSM-IV Personality. Washing- MMPI-2 & MMPI-A. St. Petersburg Beach, FL. ton, DC: American Psychiatric Press. Tellegen, A., & Waller, N. G. (in press). Exploring Piersma, H. L. (1989). The MCMI-II as a treatment personality through test construction: Development of outcome measure for psychiatric inpatients. Journal of the Multidimensional Personality Questionnaire. In S. R. Clinical Psychology, 45, 87±93 Briggs & J. M. Cheek (Eds.), Personality measures: Pilkonis, P. A. (1997). Measurement issues relevant to Development and evaluation (Vol. 1). Greenwich, CT: JAI personality disorders. In H. H. Strupp, M. J. Lambert, & Press. L. M. Horowitz (Eds.), Measuring patient change in Timbrook, R. E., & Graham, J. R. (1994). Ethnic mood, anxiety, and personality disorders: Toward a core differences on the MMPI-2? Psychological Assessment, battery (pp. 371±388). Washington, DC: American 6, 212±217. Psychological Association. Trull, T. J., & Goodwin, A. H. (1993). Relationship Pilkonis, P. A., Heape, C. L., Proietti, J. M., Clark, S. W., between mood changes and the report of personality McDavid, J. D., & Pitts, T. E. (1995). The reliability and disorder symptoms. Journal of Personality Assessment, validity of two structured diagnostic interviews for 61, 99±111. personality disorders. Archives of General Psychiatry, Trull, T. J., Useda, J. D., Costa, P. T., & McCrae, R. R. 52, 1025±1033. (1995). Comparison of the MMPI-2 Personality Psycho- Reise, S. P., & Waller, N. G. (1993). Traitedness and the pathology Five (PSY-5), the NEO-PI, and the NEO-PI- assessment of response pattern scalability. Journal of R. Psychological Assessment, 7, 508±516. Personality and Social Psychology, 65, 143±151. Trull, T. J., & Widiger, T. A. (1997). Structured Interview Retzlaff, P. (1996). MCMI-III diagnostic validity: Bad test for the Five-Factor Model of Personality professional or bad validity study. Journal of Personality Assessment, manual. Odessa, FL: Psychological Assessment Re- 66, 431±437. sources. Riso, L. P., Klein, D. N., Anderson, R. L., Ouimette, P. C., Watkins, C. E., Campbell, V. L., Nieberding, R., & & Lizardi, H. (1994). Concordance between patients Hallmark, R. (1995). Contemporary practice of psycho- and informants on the Personality Disorder Exam- logical assessment by clinical psychologists. Professional ination. American Journal of Psychiatry, 151, 568±573. Psychology: Research and Practice, 26, 54±60. Robins, R. W., & John, O. P. (1997). Effects of visual Watson, D., Clark, L. A., & Harkness, A. R. (1994). perspective and narcissism on self-perception. Is seeing Structure of personality and their relevance to believing? Psychological Science, 8, 37±42. psychopathology. Journal of Abnormal Psychology, 103, Rogers, R. (1995). Diagnostic and structured interviewing. A 18±31. handbook for psychologists. Odessa, FL: Psychological Weiner, I. B. (1996). Some observations on the validity of Assessment Resources. the Rorschach inkblot method. Psychological Assess- Rogers, R., Bagby, R. M., & Dickens, S. E. (1992). ment, 8, 206±213. Structured Interview of Reported Symptoms (SIRS) Westen, D. (1991). Social cognition and object relations. professional manual. Odessa, FL: Psychological Assess- Psychological Bulletin, 109, 429±455. ment Resources. Widiger, T. A. (1993). Personality and depression: Assess- Sackett, P. R., & Wilk, S. L. (1994). Within-group norming ment issues. In M. H. Klein, D. J. Kupfer, & M. T. Shea and other forms of score adjustment in preemployment (Eds.), Personality and depression. A current view testing. American Psychologist, 49, 929±954. (pp. 77±118). New York: Guilford. Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Widiger, T. A., & Costa, P. T. (1994). Personality and Nonparametric item analyses of the Beck Depression personality disorders. Journal of Abnormal Psychology, Inventory: Evaluating gender item bias and response 103, 78±91. option weights. Psychological Assessment, 6, 255±270. Widiger, T. A., Frances, A. J., & Trull, T. J. (1989). Salekin, R. T., Rogers, R., & Sewell, K. W. (1996). A Personality disorders. In R. Craig (Ed.), Clinical and review and meta-analysis of the Psychopathy Checklist diagnostic interviewing (pp. 221±236). Northvale, NJ: and Psychopathy Checklist-Revised: Predictive validity Aronson. of dangerousness. Clinical Psychology: Science and Widiger, T. A., Mangine, S., Corbitt, E. M., Ellis, C. G., & Practice, 3, 203±215. Thomas, G. V. (1995). Personality Disorder Interview-IV. Saucier, G., & Goldberg, L. R. (1996). The language of A semistructured interview for the assessment of person- personality: Lexical perspectives on the five-factor ality disorders. Odessa, FL: Psychological Assessment model. In J. S. Wiggins (Ed.), The five-factor model of Resources. personality. Theoretical perspectives (pp. 21±50). New Widiger, T. A., & Sanderson, C. J. (1995). Assessing York: Guilford. personality disorders. In J. N. Butcher (Ed.), Clinical Schinka, J. A., Kinder, B. N., & Kremer, T. (1997). personality assessment. Practical approaches Research validity scales for the NEO-PI-R: Development (pp. 380±394). New York: Oxford University Press. and initial validation. Journal of Personality Assessment, Widiger, T. A., & Trull, T. J. (1997). Assessment of the five 68, 127±138. factor model of personality. Journal of Personality Shea M. T. (1995). Interrelationships among categories of Assessment, 68, 228±250. personality disorders. In W. J. Livesley (Ed.), The DSM- Widiger, T. A., Williams, J. B. W., Spitzer, R. L., & IV personality disorders (pp. 397±406). New York: Frances, A. J. (1985). The MCMI as a measure of DSM- Guilford. III. Journal of Personality Assessment, 49, 366±378. Skodol, A. E., Oldham, J. M., Rosnick, L., Kellman, Wiggins, J. S. (1966). Substantive dimensions of self-report H. D., & Hyler, S. E. (1991). Diagnosis of DSM-III-R in the MMPI item pool. Psychological Monographs, 80, personality disorders: A comparison of two structured (22, Whole No. 630). References 167

Wiggins, J. S., & Pincus, A. L. (1992). Personality: Yong, L. (1996). Diagnostic Interview for DSM-IV Structure and assessment. Annual Review of Psychology, Personality Disorders (DIPD-IV). Boston: McLean 43, 473±504. Hospital. Wiggins, J. S., & Trobst, K. K. (1997). Prospects for the Zimmerman, M. (1994). Diagnosing personality disorders. assessment of normal and abnormal interpersonal A review of issues and research methods. Archives of behavior. Journal of Personality Assessment, 68, General Psychiatry, 51, 225±245. 110±126. Zimmerman, M., & Coryell, W. H. (1990). Diagnosing Wood, J. M., Nezworski, M. T., & Stejskal, W. J. (1996). personality disorders in the community. A comparison The Comprehensive System for the Rorschach: A critical of self-report and interview measures. Archives of examination. Psychological Science, 7, 3±10. General Psychiatry, 47, 527±531. World Health Organization. (1992). The ICD-10 classifica- Zimmerman, M., Pfohl, B., Coryell, W., Stangl, D., & tion of mental and behavioural disorders. Clinical descrip- Corenthal, C. (1988). Diagnosing personality disorders tions and diagnostic guidelines. Geneva, Switzerland: in depressed patients. A comparison of patient and Author. informant interviews. Archives of General Psychiatry, 45, Zanarini, M. C., Frankenburg, F. R., Sickel, A. E., & 733±737. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.08 Assessment of Psychopathology: Nosology and Etiology

NADER AMIR University of Georgia, Athens, GA, USA and CATHERINE A. FEUER University of Missouri at St. Louis, MO, USA

3.08.1 INTRODUCTION 170 3.08.2 HISTORY OF CLASSIFICATION 170 3.08.2.1 Medical Classification 170 3.08.2.2 Early Nosology of Psychopathology 171 3.08.3 CLASSIFICATION OF MENTAL ILLNESS 171 3.08.3.1 History of Classification of Mental Illness 171 3.08.3.2 Current Classification Systems 172 3.08.3.3 Tools for Clinical Assessment 173 3.08.4 CURRENT ISSUES IN THE CLASSIFICATION OF PSYCHOPATHOLOGY 174 3.08.4.1 Definition of Psychopathology 174 3.08.4.2 Comorbidity 175 3.08.4.2.1 Actual co-occurrence of disorders 175 3.08.4.2.2 Populations considered 176 3.08.4.2.3 Range, severity, and base rates of disorders considered 176 3.08.4.2.4 Assessment methods 176 3.08.4.2.5 Structure of the classification system 177 3.08.4.3 Clinical and Research Applications of Classification Systems 178 3.08.4.4 Organizational Problems in the DSM 179 3.08.5 ALTERNATIVE APPROACHES TO CLASSIFICATION 180 3.08.5.1 Types of Taxonomies 180 3.08.5.2 Taxonomic Models 180 3.08.5.2.1 Neo-Kraepelinian (DSM) model 180 3.08.5.2.2 Prototype model 181 3.08.5.2.3 Dimensional and categorical models 181 3.08.5.3 The Use of Etiological Factors for Diagnosis 182 3.08.6 CONCLUSIONS 184 3.08.7 REFERENCES 184

169 170 Assessment of Psychopathology: Nosology and Etiology

3.08.1 INTRODUCTION the causes of disease. The practice of medical observation and classification gained momen- The history of clinical psychology, similar to tum with the advent of medical instruments. The the history of medicine, has been characterized most notable of these was the stethoscope, by the quest for knowledge of pathological developed in 1819 by Rene Theophile processes. This knowledge enables practitioners Hyacinthe-Laennec (1787±1826). This advance in both fields to treat and prevent disorders that in instrumentation led to the use of objective threaten the quality of life. Health practitioners signifiers of pathology in diagnosing disease and must be able to identify and conceptualize the a diminished interest in less technologically problem, communicate research and clinical sophisticated observations or subjective reports findings to other members of their field, and of symptoms. The continuous development of ideally, to reach a scientific understanding of the new technologies increased the ability of disorder. Diagnostic taxonomies have evolved physicians to determine the function of organs in part to achieve these goals (Millon, 1991). and their modes of operation when healthy. This Classification systems have a long history, first novel conceptualization of dysfunction as in the basic sciences and medicine, and later in related to the functioning of organs led the psychopathology. Modern psychopathology French physician Broussais (1772±1838) to taxonomies owe much to the work of earlier propose the radical idea that specific symptom medical diagnosticians, and their development clusters could not define a disease adequately has often paralleled that of medical systems because of the high overlap between different (Clementz & Iacono, 1990). Both medical and disorders. He suggested that in order to identify psychopathological classification systems have a disease one needs to study the patient's served the purposes of improving diagnostic physiological processes and constitution. Mid- reliability, guiding clinical conceptualization nineteenth century advances in the various basic and treatment, and facilitating research and sciences (such as anatomy, histology, and scientific advancement (Blashfield, 1991; Clark, biology) led to the further decline of the practice Watson, & Reynolds, 1995). Despite the of observation in favor of the experimental usefulness of classification systems in psycho- study of physiological disease processes. pathology, no system is accepted universally. Although this emphasis on understanding Furthermore, some have questioned the utility function was a great benefit to the various basic of classification systems as a whole, favoring sciences, it culminated in the near abandonment behavioral descriptions of individual presenta- of the practice of clinical observation (Clementz tions instead (Kanfer & Saslow, 1965; Ullman & & Iacono, 1990). The increasing popularity of Krasner, 1975). The purpose of this chapter is to laboratory research also resulted in an emphasis review the issues related to the assessment, on specific symptoms, rather than on the diagnosis, and classification of psychopathol- phenomenology of disease. This topographic ogy. The chapter will begin with a brief outline or symptom-based approach to the study of of the history of medical and psychopatholo- pathology lent itself to a descriptive system of gical classification. Next, it will review the classification. However, this approach had its systems in use in the late 1990s, and evaluate critics, as far back as the 1800s. Some these systems from the standpoint of clinical investigators (e.g., Trousseau, 1801±1867) be- utility, research facilitation, and scientific lieved that a more comprehensive approach to understanding. It will then discuss the alter- diagnosis, incorporating both clinical observa- native approaches and future directions in the tion and laboratory findings, would better allow classification of psychopathology. Finally, the the recognition of disorders and their treatment. role of assessment strategies in informing this The field of genetics provided another major debate will be examined. development in the science of classification. Mendel, working in an isolated monastery, pioneered the science of genetics. His efforts 3.08.2 HISTORY OF CLASSIFICATION were continued by others (e.g., Garrod, 1902) 3.08.2.1 Medical Classification who applied his work to humans. Watson and Crick's (1953a, 1953b) detailed description of As Clementz and Iacono (1990) point out, the structure of human genetic material supplied medical classification systems have relied his- yet another powerful tool for identifying the torically on careful observation of patients and presence and etiology of certain disorders. A their symptoms. The importance of observation classification system based on genetic and as an adjunct to theorizing was recognized as biological etiology may seem promising. How- long ago as the seventeenth century by thinkers ever, researchers soon realized that the specifi- such as Thomas Sydenham (1624±1689). Ob- city of genetics is less than perfect. This lack of servations were used to make inferences about specificity seemed particularly evident in the Classification of Mental Illness 171 psychiatric disorders. We now know several come to dominate American psychiatry (Blash- psychiatric disorders are at least partly genetic. field, 1991). Psychoanalytic thought did not lose This is revealed by the finding that individuals its foothold until the community-based mental with identical genetic make-up have higher health movement of the 1960s, which pointed concordance rates for developing the same out that psychoanalysis was only accessible to disease than those who do not. However, the wealthy. Other factors that contributed to because the concordance rate between identical the decline of psychoanalytic thought include twins is less than 100% (e.g., schizophrenia; the development of alternative interventions Dworkin; Lenzenwenger, Moldin & Cornblatt, such as psychotropic medications and beha- 1987), not all the variance in psychiatric vioral intervention, as well as a general shift disorders can be explained by genetic inheri- toward empiricism. Around this time, there was tance. These findings have led to the view that a resurgence of Kraepelinian thought, mainly genetic make-up provides a diathesis for a by a group of psychiatrists at Washington particular disorder, which, when interacting University in St. Louis (Robins & Guze, 1970). with the environmental factors, may produce a They believed that psychiatry had drifted too disease. far from its roots in medicine, and that it was necessary to identify the biological bases of psychopathology. They advocated an emphasis 3.08.2.2 Early Nosology of Psychopathology on classification as a tool for helping psycho- pathology to evolve as a field (Blashfield, 1991; Emil Kraepelin (1856±1926) was the first to Klerman, 1978). adapt a medical classification framework for psychiatric illnesses (Blashfield, 1991; Kraepe- lin, 1971). He considered psychopathology to be 3.08.3 CLASSIFICATION OF MENTAL the result of underlying disease states and ILLNESS advocated the scientific investigation of ill- 3.08.3.1 History of Classification of Mental nesses. He coined the term ªdementia praecoxº Illness (later known as schizophrenia) to describe what had previously been considered distinct pathol- After World War II, the armed forces ogies. Kraepelin's thinking regarding psycho- developed a more comprehensive nomenclature pathology and classification was shaped by the to facilitate the treatment of World War II rapid advances in German medical science in the servicemen. During this same period, the World nineteenth century and by his training in Health Organization (WHO) published the first behaviorism from Wilhelm Wundt (Berrios & version of the International classification of Hauser, 1988). German medicine during Krae- diseases (ICD-6; WHO, 1992) that included pelin's lifetime emphasized the interpretation of psychiatric disorders. The authors of the ICD-6 mental disorders as diseases of the brain relied heavily on the work of various branches (Menninger, 1963). of the armed forces in the development of their Kraepelin's training with Wundt contributed taxonomy of psychopathology. The first version to his use of behavioral descriptions for clusters of the Diagnostic and statistical manual of of symptoms he believed were linked by a mental disorders (DSM), a variant on the common etiology. Kraepelin's psychopathol- ICD-6, was published in 1952 by the APA's ogy classification became influential through Committee on Nomenclature and Statistics. the publication of his textbooks on psychiatry. This nomenclature was designed for use in the His categories later became the basis for the first civilian population, and focused on clinical official classification adopted by the American utility (APA, 1994). The DSM (APA, 1994) was Psychiatric Association (APA) in 1917 (Men- originally created as a means of standardizing ninger, 1963), and revised in 1932 (APA, 1933). the collection of statistics in psychiatric hospi- A contemporary of Kraepelin's, Sigmund tals in the early 1900s. By the release of the Freud (1856±1939), was also influential in second edition of the DSM (DSM-II) (1975), the forming our understanding of psychopathol- authors had moved toward the elimination of ogy. In contrast to Kraepelin, Freud placed exclusion rules and provision of more explicit little emphasis on the phenomenology of and descriptive diagnostic criteria. The inclu- disorders, stressing the importance of diagnoses sion of explicit diagnostic criteria in the DSM-II based on the underlying cause of patients' was inspired by the 1972 work of a group of manifest symptoms. Kraepelin, however, op- Neo-Kraeplinian theorists, Feighner, Baran, posed Freudian theory and psychoanalytic Furman, and Shipman at the Washington practice because of its nonempirical orientation University School of Medicine in St. Louis (Kahn, 1959). Freud's theories were well (Feighner et al., 1972). This group had articulated, however, and by the 1950s had originally created a six-category system as a 172 Assessment of Psychopathology: Nosology and Etiology guideline for researchers in need of homogenous the presence of long-term disturbances is not subject groups. overlooked in favor of current pathology. In 1978 and 1985, Spitzer and his colleagues Together, these axes constitute the classification modified and expanded the Washington Uni- of abnormal behavior. The remaining three axes versity system and labeled it the Research are not needed to make a diagnosis but Diagnostic Criteria (RDC). The RDC estab- contribute to the recognition of factors, other lished criteria for 25 major categories, with an than the individuals' symptoms, that should be emphasis on the differential diagnosis between considered in determining the person's diag- schizophrenia and the affective disorders. The nosis. General medical conditions are rated on authors of the DSM-III (1980) followed the Axis III, psychosocial and environmental example of the Washington University and problems are coded on axis IV, and the RDC groups, including diagnostic criteria for individual's current level of functioning is rated each disorder and exclusion criteria for 60% of on axis V. all DSM-III disorders. Most of these exclusion The 10th Revision of the ICD (ICD-10)isa rules described hierarchical relationships be- multiaxial system. Axis I includes clinical tween disorders, in which a diagnosis is not diagnoses for both mental and physical dis- given if its symptoms are part of a more orders, Axis II outlines disabilities due to pervasive disorder. impairments produced by the disorder, and As a result of these changes, the DSM-III and Axis III lists the environmental, circumstantial, DSM-III-R are based almost exclusively on and personal lifestyle factors that influence the descriptive criteria. These criteria are grouped presentation, course, or outcome disorders. into distinct categories representing different ICD-10 differs from other multiaxial classifica- disorders. The DSM-IV continues this tradi- tions in that it: (i) records all medical conditions tion, using a categorical model in which patients on the same axis (Axis I), (ii) assesses comorbid are assigned to classes that denote the existence disorders (Axis II) in specific areas of function- of a set of related signs and symptoms. The ing without ascribing a specific portion to each constituent signs and symptoms are thought to disorder, and (iii) allows expression of environ- reflect an underlying disease construct. Re- mental (Axis III) factors determined by clinical searchers who advocate a categorical model of practice and epidemiological evidence (Janca, disease (e.g., Kraepelin) propose that disorders Kastrup, Katschnig, Lopez-Ibor, 1996). differ qualitatively from each other and from the One purpose for the collaboration between nondisordered state. Although later editions of DSM-IV and ICD-10 authors was to foster the the DSM have relied progressively more on goal of the transcultural applicability of ICD-10 research findings in their formulation, the stated (Uestuen, Bertelsen, Dilling, & van Drimmelen, goal of DSM-IV remains the facilitation of 1996). A multiaxial coding format in both clinical practice and communication (Frances DSM-IV and ICD-10 was adopted in order to et al., 1991). The DSM system has evolved as a provide unambiguous information with max- widely used manual in both clinical and research imum clinical usefulness in the greatest number settings. However, the structure and the stated of cases (WHO, 1996). Although not explicitly goals of the DSM have been at the core of many stated by the authors of either manual, both the controversies regarding the assessment, diag- DSM-IV and ICD-10 are examples of the ªneo- nosis, and classification of psychopathology. Kraepelinian revolutionº in psychiatric diag- nostic classification (Compton & Guze, 1995). The progress toward the shared goals of the 3.08.3.2 Current Classification Systems two systems include two areas: general clinical use of the systems; and fostering international Two systems of classification are currently in communication and cultural sensitivity. Re- common use: the DSM-IV (APA, 1994) and the garding general clinical use, a multicenter field ICD-10 (WHO, 1992). A detailed description of trial evaluating aspects of the ICD-10 and these systems is beyond the scope of this chapter DSM-IV involving 45 psychiatrists and psy- and the interested reader should consult recent chologists in seven centers was conducted in reviews of these systems (Frances, 1998; Regier, Germany (Michels et al., 1996). Results revealed et al., 1998; Spitzer, 1998). A brief overview of moderate inter-rater reliability for the ICD-10 each system is provided below. Axis I and Axis II. However, the number of The DSM-IV is a multiaxial classification relevant psychosocial circumstances coded on system. The individual is rated on five separate Axis III by the different raters varied greatly. axes. Axis I includes all diagnoses except The authors concluded that the multiaxial personality disorders and mental retardation, system was generally well accepted by partici- the later being rated on Axis II. The rationale pating clinicians, and that it is worth studying for the separation of these axes is to ensure that empirically and revising accordingly. Classification of Mental Illness 173

Clinicians specializing in several specific areas understanding of mental illness may be mis- have not been as positive in their comments guided (Patel & Winston, 1994). Specifically, about the ICD-10. For instance, Jacoby (1996) while mental illness as a phenomenon may be has argued that neither the DSM-IV nor the universal, specific categories of mental illness as ICD-10 is adequate in its categorization of outlined by the DSM and ICD may require certain disorders associated with the elderly, identification and validation of particular such as certain dementias and psychotic dis- diagnoses within specific cultures (Patel & orders. Others have argued that specific ICD-10 Winston, 1994). diagnoses show poor levels of stability of diagnosis across time. Specifically, a 1997 study 3.08.3.3 Tools for Clinical Assessment found that neurotic, stress-related, adjustment, generalized anxiety, and panic disorders as well A comprehensive review of assessment mea- as some psychoses and personality disorders sures for psychopathology is beyond the scope showed low rates of reliability across interviews of this chapter. The interested reader should at different time points (Daradkeh, El-Rufaie, consult comprehensive reviews of this topic (e.g., Younis, & Ghubash, 1997). Baumann, 1995; Bellack & Hersen, 1998; Issues regarding the fostering of international Sartorius & Janca, 1996). One useful method communication and cultural sensitivity were of classifying assessment techniques is to addressed by a Swiss study published in 1995 consider two classes of measures: those that that assessed the inter-rater reliability and aim to aid in diagnosis and those that aim to confidence with which diagnoses could be made assess the severity of symptoms relatively using the ICD Diagnostic Criteria for Research independent of diagnosis. Examples of the first (ICD-10 DCR), as well as examining the category include The Structured Clinical Inter- concordance between ICD-10 DCR, ICD-10 view for DSM-IV Diagnoses (SCID; First, Clinical Descriptions and Diagnostic Guidelines, Spitzer, Gibbon, & Williams, 1995) and the and other classification systems including the Composite International Diagnostic Interview DSM-IV. Field trials were carried out at 151 (CIDI; Robins et al., 1988). These measures are clinical centers in 32 countries by 942 clinician/ mapped closely on the diagnostic systems on researchers who conducted 11 491 individual which they are based (DSM-IV and ICD-10, patient assessments. The authors report that respectively). The second class of assessment most of the clinician/researchers found the instruments use a dimensional approach and criteria to be explicit and easy to apply and the aim to assess severity of symptoms relatively inter-rater agreement was high for the majority independently of diagnosis. Examples of such of diagnostic categories. In addition, their instruments include the Hamilton Depression results suggested that the use of operational Inventory (HAM-D; Riskind, Beck, Brown, & criteria across different systems increases levels Steer, 1987) and the Yale-Brown Obsessive- of inter-rater agreement. Compulsive Scale (Y-BOCS; Goodman, Price, Regier, Kaelber, Roper, Rae, and Sartorius Rasmussen, & Mazure, 1989a). The advantages (1994) cited findings suggesting that while and disadvantages of these assessment techni- overall inter-rater reliability across 472 clin- ques are tied closely to advantages and dis- icians in the US and Canada was good, the advantages of the categorical and dimensional clinician tended to make more use of multiple approaches to classification and will be dis- coding for certain disorders than clinicians from cussed in Section 3.08.5.2.3. other countries. This suggests that certain An important aspect of developing new aspects of the DSM system (e.g., its encourage- knowledge in psychopathology is the improve- ment of multiple diagnoses) may make the ment and standardized use of assessment tools transition and agreement between the two across studies of psychopathology. The vast systems somewhat more difficult. majority of studies of DSM disorders have used The ICD's efforts to place psychiatric self-report or interview-based measures of disorders in the context of the world commu- symptoms. In some cases, behavioral checklists nity's different religions, nationalities, and (either self- or other-reports) or psychological cultures has received praise in the literature tests have been employed. In this section, issues (Haghighat, 1994), while the DSM-IV may be such as the accuracy, reliability, and discrimi- seen as somewhat less successful in achieving nate validity of such assessment tools and how this goal. The ICD's increased cultural sensi- they may influence findings in psychopathology tivity was at the expense of increased length, but studies will be examined. efforts to omit cultural-biased criteria were Self-report measures may be the most time- largely successful. However, other authors have and labor-efficient means of gathering data argued that the attempts to improve cultural about psychological symptoms and behaviors. sensitivity to the degree of a ªuniversalº However, individuals often are inconsistent in 174 Assessment of Psychopathology: Nosology and Etiology their observations of their own behaviors, and fore, in order to efficiently discriminate between the usefulness of these reports depends heavily disorders, future research should emphasize on often limited powers of self-observation. identification of symptoms which optimally Psychometric methods for enhancing the useful- discriminate between diagnoses. Watson and ness of self-report data generally focus on Clark (1992) found that even when factor increasing the consistency (i.e., reliability and analytically derived mood measures such as accuracy or validity) of the measures. A number the Profile of Moods (POMS; McNair, Lorr, & of psychometrically sophisticated scales are Droppleman, 1981) and the Positive and available for several different types of disorders Negative Affectivity Scale (PANAS; Watson, (e.g., anxiety; BAI; Beck, Epstein, Brown, & Clark, & Tellegen, 1988) are used, certain basic Steer, 1988; PSS; Foa, Riggs, Dancu, & affects (i.e., anxiety and depression) are only Rothbaum, 1993; Y-BOCS; Goodman et al., partially differentiable. Their data suggest that 1989a, 1989b). Many of the traditional psycho- the overlap between basic affects represents a metric approaches classify disorders into state shared component inherent in each mood state, (transitory feelings) and trait (stable personality) which must be measured in order to understand attributes (e.g., state anxiety vs. trait anxiety, the overlap between different mood states and STAI; Spielberger, Gorsuch, & Lushene, 1970). disorders. The accuracy of these scales is most often evaluated by comparing results on the measure 3.08.4 CURRENT ISSUES IN THE to results on other measures of the same emotion CLASSIFICATION OF or disorder (e.g., interviews, physiological PSYCHOPATHOLOGY assessments, observations of behavior). By assessing various aspects of an emotion or As noted earlier, the classification of psycho- disorder, the investigator tries to create a pathology has been a controversial topic since composite gauge of an emotion for which no inception. Currently, the discussions revolve single indicator offers a perfect yardstick. primarily around the DSM system, although One impediment to discovering which symp- many of the debates predate the DSM. The most toms may be shared across disorders has been common topics of controversy in the classifica- the structure of many clinical measures. Ex- tion of psychopathology are: (i) the definition of amples of widely-used measures of psychologi- psychopathology; (ii) the artificially high rates cal states are the Beck Depression Inventory of comorbidity found between disorders when (BDI; Beck & Steer, 1987), the Beck Anxiety using the DSM system (Frances, Widiger, & Inventory (BAI; Beck et al., 1988), the Minne- Fyer, 1990; Robins, 1994; (iii) the balance sota Multiphasic Personality Inventory between a focus on clinical utility and the (MMPI-2; Butcher, Dahlstrom, Graham, Telle- facilitation of research and scientific progress gen, & Kaemmer, 1989), and structured inter- (Follette, 1996); and (iv) organizational pro- view scales such as the Diagnostic Inventory blems, both within and across disorders in the Scale (DIS; Helzer & Robins, 1988) and the DSM (Brown & Barlow, 1992; Quay, Routh, & Structured Clinical Interview for DSM-IV Shapiro, 1987). Diagnoses (SCID; First et al., 1995). These self-report and clinician-rated scales 3.08.4.1 Definition of Psychopathology usually assess ªmodalº symptoms, as they focus on core aspects of each syndrome rather than on There is a lack of agreement about the all possible variants. Structured interviews, such definition of psychopathology. Early versions as the SCID-IV or the DIS, often allow ªskip of both the ICD and the DSM attempted to guide outsº in which the interviewer need not the categorization of mental disorders without necessarily assess all of the symptoms of all addressing the definition of psychopathology disorders. Many studies have examined the (Spitzer, Endicott, & Robins, 1978). Some convergent and divergent validity patterns of contend that this lack of agreement about the self-report (e.g., modal) measures of various definition of psychopathology remains the disorders (e.g., Foa et al., 1993). These measures current state of affairs (e.g., Bergner, 1997). tend to yield strongly convergent assessments of Others (e.g., Kendell, 1975, 1982) have noted their respective syndromes, but there is little that physicians frequently diagnose and treat specificity in their measurement, especially in disorders for which there is no specific definition, nonpatient samples. The data often suggest the and which, in many cases, are not technically presence of a large nonspecific component considered disorders (e.g., pregnancy). The shared between syndromes such as anxiety controversy over the definition of mental and depression (Clark & Watson, 1991). Some disorder may be fueled by the fact that such scales appear to be more highly loaded with the illnesses are signified commonly by behaviors, nonspecific distress factor than others. There- and distinguishing between ªnormalº and Current Issues in the Classification of Psychopathology 175

ªdeviantº or ªdisorderedº behaviors is viewed the ªdesignº of various parts of organisms by some as having serious cultural and socio- (Mayr, 1981; Tattersall, 1995). Proponents of logical implications (Mowrer, 1960; Szasz, the ªharmful dysfunctionº definition suggest 1960). According to Gorenstein (1992), attempts that it may provide a useful starting point for at defining mental illness have fallen historically research in psychopathology, although further into several categories. Some have relied on research needs to identify the specific mechan- statistical definitions based on the relative isms that are not functioning properly (Bergner, frequency of certain characteristics in the general 1997). population. Others have employed a social definition in which behaviors which conflict with the current values of society are considered 3.08.4.2 Comorbidity deviant. Still other approaches have focused on the subjective discomfort caused by the problem, The term comorbidity was first used in the as reported by the individual. Finally, some medical epidemiology literature and has been definitions of psychopathology have been based defined as ªthe co-occurrence of different on psychological theories regarding states or diseases in the same individualº (Blashfield, behaviors thought to signify problems within 1990; Lilienfeld, Waldman, & Israel, 1994). the individual. The DSM-IV (APA, 1994) Many factors potentially may contribute to the defines a mental disorder as a ªclinically comorbidity rates of psychiatric disorders. significant behavioral or psychological syn- Reported comorbidity rates are influenced by drome or pattern that occurs in an individual the actual co-occurrence of disorders, the and . . . is associated with present distress . . . or populations considered, the range, severity, disability . . . or with a significant increased risk and base rates of the disorders considered, the of suffering death, pain, disability or an method of assessment, and the structure of the important loss of freedom.º The disorder should classification system used. not be an ªexpectable and culturally sanctioned response to an event,º and must be considered a 3.08.4.2.1 Actual co-occurrence of disorders manifestation of dysfunction in the individual. Wakefield has expanded and refined this idea in In the medical literature, comorbidity often his concept of ªharmful dysfunctionº (Wake- refers to diagnoses which occur together in an field, 1992, 1997c). This concept is a carefully individual, either over one's lifetime or simul- elaborated idea considered by some to be a taneously. This concept emphasizes the recog- workable solution to the problem of defining nition that different diagnoses potentially are abnormality (e.g., Spitzer, 1997). In essence, the related in several ways. One disease can signal harmful dysfunction concept means that beha- the presence of another, predispose the patient viors are abnormal to the extent that they imply to the development of another, or be etiologi- something is not working properly as would be cally related to another disorder (Lilienfeld expected based on its evolutionary function. et al., 1994). Lilienfeld et al. suggest that the Specifically, Wakefield (1992) states that a increased attention to comorbidity in psycho- mental disorder is present pathology is due to acknowledgment of the extensive co-occurrence and covariation that if and only if, a) the condition causes some harm or exists between diagnostic categories (Lilienfeld deprivation of benefit to the person as judged by et al. ; Widiger & Frances, 1985). For example, the standards of the person's culture (the value Kessler et al. (1994) reported the results of a criteria), and b) the condition results from the study on the lifetime and 12-month prevalence inability of some mental mechanism to perform its of DSM-III-R disorders in the US, in a random natural function, wherein a natural function is an sample of 8098 adults. Forty-eight percent of effect that is part of the evolutionary explanation the subjects reported at least one lifetime of the existence and structure of the mental disorder, with the vast majority of these mechanism (the explanatory criterion). (p. 385) individuals, 79%, reporting comorbid disor- ders. Other studies using large community Critics of Wakefield's concept argue that it samples report that more than 50% of the represents a misapplication of evolutionary participants diagnosed with one DSM disorder theory (Follette & Houts, 1996). They contend also meet criteria for a second disorder (Brown that evolutionary selection was not meant to & Barlow, 1992). It has been argued that the apply on the level of behavioral processes, that common etiological factors across different is, it is not possible to know the function of a diagnoses are of greater importance than the part or process of the individual by examining etiological factors specific to one disorder evolutionary history because random variation (Andrews, 1991; Andrews, Stewart, Morris- precludes a straightforward interpretation of Yates, Holt, & Henderson, 1990). 176 Assessment of Psychopathology: Nosology and Etiology

3.08.4.2.2 Populations considered disorder. For instance, certain disorders (e.g., social phobia) appear to be more likely to One factor that affects comorbidity rates is accompany other anxiety disorders when con- the population under study. Kessler et al. (1994) sidered at subclinical levels (Rapee, Sanderson, found that people with comorbid disorders were & Barlow, 1988). Conversely, Frances et al. likely to report higher utilization of services (1990) have suggested that the severity of mental than those without a comorbid disorder. health problems in a sample will influence Similarly, higher rates of comorbidity were comorbidity rates, in that a patient with a severe found among urban populations than rural presentation of one disorder is more likely to populations, despite the higher rates of single report other comorbid disorders. psychiatric disorders and financial problems In addition to the range and severity thresh- among rural participants. The race of the olds of disorders, the base rates of a particular population studied also seems to influence disorder have a strong influence on the apparent comorbidity rates. African-American partici- comorbidity rates. Conditions that are fre- pants in the Kessler et al. study reported lower quently present in a given sample will tend to be comorbidity rates than Caucasian participants, diagnosed together more often than those that after controlling for the effects of income and are infrequent. This may help explain the education. However, Caucasian participants comorbidity rates of certain highly prevalent reported lower rates of current comorbid disorders such as anxiety, depression, and disorders than Hispanics. Disparities in pre- substance abuse. valence of reported comorbidity were also found between respondents of different age groups, with people aged between 25 and 34 years reporting the highest rates. The income 3.08.4.2.4 Assessment methods and education of the participants were also The choice of assessment methods may associated with reported comorbidity in the influence comorbidity rates in much the same Kessler et al. study. way as aspects of the disorders. Disagreement regarding comorbidity rates may be partly due to differences in the definition of comorbidity. 3.08.4.2.3 Range, severity, and base rates of For example, comorbidity has been defined as disorders considered within-episode co-occurrence (or dual diagno- The rates of comorbidity are also influenced sis) among disorders by some (August & by the disorders studied. Specifically, the range, Garfinkel, 1990; Fulop, Strain, Vita, Lyons, severity, and base rates of certain disorders & Hammer, 1987), lifetime co-occurrence by increase the likelihood that these disorders will others (Feinstein, 1970; Shea, Widiger, & Klein, be comorbid with another disorder. Certain 1992, p. 859), and covariation among diagnoses diagnostic categories, including childhood dis- (i.e., across individuals) by still other research- orders (Abikoff & Klein, 1992; Biederman, ers (Cole & Carpentieri, 1990; Lewinsohn, Newcorn, & Sprich, 1991), anxiety disorders Rohde, Seeley, & Hops, 1991). Even when (Brown & Barlow, 1992; Goldenberg et al., researchers agree on a definition, their estimates 1996), and personality disorders (Oldham et al., of comorbidity may differ based on the type of 1992; Widiger & Rogers, 1989), appear to be assessment tool they use. As Lilienfeld et al. comorbid with other disorders. For example, (1994) point out, assessment techniques have 50% of patients with a principal anxiety error variance that by definition is not related to disorder reported at least one additional the construct(s) of interest or may artificially clinically significant anxiety or depressive dis- inflate the actual comorbidity rate. order in a large-scale study by Moras, DiNardo, Furthermore, individual raters may hold Brown, and Barlow (1991). Similarly, anxiety biases or differing endorsement thresholds for disorders are not only highly likely to be behaviors which are common to a variety of comorbid with each other but also with mood, disorders. Likewise, raters may have specific substance use, and personality disorders (Brown beliefs about which disorders tend to covary. & Barlow, 1992). These high rates of comor- These types of biases may affect both self-report bidity may in part be due to the degree to which and interviewer-based data. Similarly, studies the different anxiety disorders include over- utilizing structured interviews may differ from lapping features (Brown & Barlow, 1992). studies in which severity thresholds are not Finally the degree of comorbidity is influenced described as concretely. Therefore, differing directly by thresholds set to determine the rates of comorbidity across studies may be an presence or absence of various disorders. The artifact of the types of measurement used, or the choice of threshold appears to affect comorbid- biases of raters involved (Zimmerman, Pfohl, ity rates differentially, depending on the Coryell, Corenthal, & Stangl, 1991). Current Issues in the Classification of Psychopathology 177

3.08.4.2.5 Structure of the classification system diagnosis. The symptoms are not weighted, implying that they are of equal importance in Frances et al. (1990) argue that the classifica- defining the disorder. For many diagnoses, the tion system used in the various versions of the structure of this system makes it possible for two DSM increases the chance of comorbidity in individuals to meet the same diagnostic criteria comparison to other, equally plausible systems. without sharing many symptoms. Conversely, it The early systems (e.g., DSM-III) attempted to is possible for one person to meet the criteria address the comorbidity issue by proposing while another person who shares all but one elaborate hierarchical exclusionary rules speci- feature does not meet the criteria. As a result, fying that if a disorder was present in the course patients who actually form a fairly heteroge- of another disorder that took precedence, the neous group may be ªlumpedº into one second disorder was not diagnosed (Boyd et al., homogeneous diagnostic category. 1984). Thus, disorders which may have been Combined with unweighted symptoms and a truly comorbid were overlooked. Examination lack of attention to severity of symptoms, this of these issues resulted in the elimination of this ªlumpingº can lead to what Wakefield (1997a) exclusionary criterion in DSM-III-R. This new refers to as ªoverinclusivenessº of diagnostic method of diagnosis, however, has been criteria. Specifically, people who do not truly criticized for artificially inflating comorbidity suffer from a mental disorder may nonetheless between various disorders. receive the diagnosis, thus lowering the con- Frances et al. (1990) point out that new ceptual validity of DSM diagnostic categories. editions of the DSM expanded coverage by Conversely, minute differences in reported adding new diagnoses. This was done by symptoms may result in dichotomizing between ªsplittingº similar disorders into subtypes. They disordered and nondisordered individuals, or argue that the tendency to split diagnoses between individuals with one disorder as creates much more comorbidity than would opposed to another (e.g., avoidant personality ªlumpingº systems of classification. This is disorder vs. generalized social phobia), creating because disorders that share basic underlying heterogeneity where there may actually be features are viewed as separate. none. Much of the ªsplittingº in the DSM has This point is further elaborated by Clark et al. resulted from the increasing reliance on proto- (1995), who point out that within-category typical models of disorders (Blashfield, 1991). heterogeneity constitutes a serious challenge to The creators of the DSM increasingly have the validity of the DSM. These authors view relied on prototypes, defining a diagnostic comorbidity and heterogeneity as related category by its most essential features, regard- problems, that is, within-group heterogeneity less of whether these features are also present in of symptoms found across diagnostic cate- other diagnoses. McReynolds (1989) argued gories leads to increased rates of comorbidity that categories with a representative prototype among the disorders. Homogenous categories, and indistinct or ªfuzzyº boundaries are the on the other hand, lead to patient groups that basis of the most utilitarian classification share defining symptoms and produce lower systems because they are representative of rates of both comorbidity and heterogeneity. categories in nature. The use of this type of This is in part because of polythetic systems prototype-based classification has improved the that require some features of a disorder for sensitivity and clinical utility of the DSM diagnosis. In contrast, a monothetical system system. However, these gains are achieved at would require all features of a disorder for the expense of increased comorbidity and diagnosis. Because the monothetic approach to decreased specificity of differential diagnosis diagnosis produces very low base rates for any due to definitional overlap. Thus, this system of diagnosis (Morey, 1988), researchers generally diagnostic classification makes it unclear have used the polythetic approach. However, whether there is any true underlying affinity this approach promotes within-category het- between disorders currently considered to have erogeneity because individuals who do not high rates of comorbidity. share the same symptom profiles can receive The stated approach of the DSM-IV is a the same diagnosis. This poses a problem for descriptive method relying on observable signs any categorical system, including the DSM. and symptoms rather than underlying mechan- Because of the inherent heterogeneity in patient isms. The sets of signs and symptoms that profiles, a categorical system must address this constitute the different diagnoses are, by issue by either proposing artificial methods of definition, categorical, that is, the presence of limiting heterogeneity (e.g., DSM-III), using the required number and combination of unrealistic homogeneous categories, or ac- symptoms indicates a diagnosis and the absence knowledging the heterogeneity (e.g., DSM- of the required number of symptoms precludes a IV; Clark et al., 1995). Schizophrenia is an 178 Assessment of Psychopathology: Nosology and Etiology example of a diagnosis that has proved to have 3.08.4.3 Clinical and Research Applications of within-group heterogeneity. For example, cur- Classification Systems rent nosology (e.g., DSM-IV) assumes that schizophrenia and affective illness are distinct Classification serves various purposes, de- disorders. However, shared genetic vulnerabil- pending on the setting in which it is used. In ity has been proposed for schizophrenia and clinical settings, classification is used for some affective disorders (Crow, 1994). Taylor treatment formulation, whereas in research (1992) reviewed the evidence for this continuum settings it allows researchers to formulate and and suggested that the discrimination of these communicate ideas. The DSM-IV taskforce has disorders by their signs and symptoms is stated that the various uses of classification are inadequate (Taylor & Amir, 1994). The need usually compatible (APA, 1994, p. xv). It is not to know whether psychoses vary along a clear whether this assertion has been tested continuum is obviously critical to our under- empirically, or whether steps have been taken to standing of their pathogenesis and etiology. resolve any divergence between the various uses To address within-group heterogeneity of of classification. The goals of clinical utility and disorders, researchers have attempted to create outcome of research are not inherently incon- subtypes (e.g., positive vs. negative symptoms of sistent, but situations may arise in which the two schizophrenia; Andreasen, Flaum, Swayze, diverge in their application. The majority of the Tyrell, & Arndt, 1990). Consistent with this modifications made to recent versions of the conceptualization, researchers have correlated DSM were designed to improve clinical utility negative symptoms of schizophrenia with by simplifying or improving everyday diagnos- cognitive deficits (Johnstone et al., 1978), poor tic procedures. While not necessarily empirically premorbid adjustment (Pogue-Geile & Harrow, driven, these changes do not appear to have 1984), neuropathologic and neurologic abnorm- adversely impacted the validity of the diagnoses alities (Stevens, 1982), poor response to neuro- they affect. Assigning diagnoses facilitates leptics (Angrist, Rotrosen, & Gershon, 1980), information storage and retrieval for both and genetic factors (Dworkin & Lenzenweger, researchers and clinicians (Blashfield & Dra- 1984). On the other hand, positive symptoms guns; 1976; Mayr, 1981). However, problematic have been correlated with attention (Cornblatt, issues have been raised about the use and Lenzenweger, Dworkin, Erlenmeyer-Kimling, structure of diagnostic systems in both clinical 1992) and a positive response to neuroleptics and research settings. (Angrist et al., 1980). Despite these differences, The use of classification and the reliance on some investigators have questioned the relia- the DSM appears to be increasing (Follette, bility of these findings (Carpenter, Heinrichs, & 1996). This is partly because of the trend toward Alphs, 1985; Kay, 1990) and the clinical clarity the development and empirical examination of of the subtypes (Andreasen et al., 1990). The treatments geared toward specific diagnoses majority of these studies included patients with (Wilson, 1996). Although the use of diagnostic schizophrenia but not affective disorders, and classification is beneficial in conceptualizing some assessed only limited symptoms (e.g., only cases, formulating treatment plans, and com- negative symptoms; Buchanan & Carpenter, municating with other providers, some have 1994; Kay, 1990). These approaches are poten- argued that assigning a diagnosis may have a tially problematic because they assume impli- negative impact on patients (Szasz, 1960), that citly that the psychoses are distinct, and that the is, decisions about what constitutes pathology specific types of psychopathology are easily as opposed to normal reactions to stressors may discriminated. be arbitrary (Frances, First, & Pincus, 1995), In summary, the issue of comorbidity poses gender-based, and culturally or socioeconomi- what is potentially the greatest challenge to cally based. Furthermore, the choice of which diagnosis. Factors including the populations behaviors are considered pathological (de considered, the range, severity, and base rates of Fundia, Draguns, & Phillips, 1971) may be the disorders considered, the method of assess- influenced by the characteristics of the client ment, and the structure of the classification (Broverman, Broverman, Clarkson, Rusenk- system contribute to reported comorbidity rantz, & Vogel, 1970; Gove, 1980; Hamilton, rates. These factors are often extremely difficult Rothbart, & Dawes, 1986). to distinguish from the true comorbidity of Diagnostic categories may also have adverse disorders. A full understanding of the common consequences on research. For example, factors and shared etiologies of psychiatric although the diagnostic categories defined by disorders will most likely require an under- the classifications systems are well studied and standing of their true comorbidity. It is likely, easily compared across investigations, patients then, that this will remain a highly controversial who do not clearly fit a diagnosis are ignored. topic in psychopathology assessment. Furthermore, frequent and generally clinically- Current Issues in the Classification of Psychopathology 179 driven changes in the diagnostic criteria for a neutral while it may arguably be considered particular disorder make sustained investiga- theory-bound. Second, because the authors of tion of the disorder difficult (Davidson & Foa, the DSM do not explicitly recognize any theory, 1991). This situation is further exacerbated by specific theories cannot be empirically tested the frequent addition of new diagnostic cate- against competing theories of psychopathology. gories to the DSM (Blashfield, 1991). The DSM-IV taskforce addressed this issue by explicitly recommending that the diagnostic 3.08.4.4 Organizational Problems in the DSM classifications be used as a ªguideline informed by clinical judgment,º as they ªare not meant to Several of the issues regarding assessment and be used in a cookbook fashionº (APA, 1994, classification revolve around the organization p. xxiii). This suggestion, while useful in a of the DSM. These problems can be classified clinical setting, may not be as applicable in into two broad categories: problems with the research settings. multiaxial system and problems concerning the The proliferation of diagnostic criteria in placement and utilization of specific categories recent versions of the DSM have successfully of disorders (Clark et al., 1995). improved diagnostic reliability, but have done The multiaxial system of the DSM has no so at the expense of external validity (Clementz doubt accomplished its goal of increasing the & Iacono, 1990). That may adversely affect attention given to various (e.g., personality) research findings (Carey & Gottesman, 1978; aspects of functioning. However, the distinction Meehl, 1986). According to Clementz and between certain personality (Axis II) disorders Iacono (1990), the achievement of high relia- and some Axis I disorders are problematic. For bility in a diagnostic classification is often taken example, there is now ample data suggesting to signify either validity or a necessary first step that there is a high degree of overlap between toward demonstrating validity. However, this is avoidant personality disorder and generalized not the case in situations in which disorder social phobia (Herbert, Hope, & Bellack, 1992; criteria are designed significantly to increase Widiger, 1992). Similarly, it is difficult to reliability (i.e., by using a very restrictive distinguish schizotypal personality disorder definition). In such situations, validity may and schizophrenia on empirical grounds (Grove actually suffer greatly due to the increase in false et al., 1991). categorizations of truly disordered people as The second type of organizational issue relates unaffected. Conversely, the generation of to the placement and utilization of individual specific behavioral criteria (e.g., Antisocial disorders. For example, because post-traumatic Personality Disorder in DSM-III-R) may stress disorder (PTSD) shares many features increase reliability but lead to overinclusion (e.g., depersonalization, detachment) with the (e.g., criminals in the antisocial category) while dissociative disorders, some (e.g., Davidson & entirely missing other groups (e.g., ªsuccessfulº Foa, 1991) have argued for its placement with psychopaths who have avoided legal repercus- the dissociative disorders instead of the anxiety sions of the behavior) (Lykken, 1984). disorders. Likewise, the overly exclusive diag- Clearly there are possible negative impacts of nostic categories of the DSM have led to our current diagnostic system for research. situations in which clearly disordered patients More broadly, some have argued that nosolo- exhibiting symptoms of several disorders do not gical questions involve ªvalue judgmentsº fit into any specific category. The authors of the (Kendler, 1990, p. 971) and as such are DSM have attempted to remedy this by creating nonempirical. But, as Follette and Houts ªnot otherwise specifiedº (NOS) categories for (1996) point out, the classification of pathology, several classes of disorders. While at times it is even in physical medicine, requires the identi- difficult to distinguish the meaning of NOS fication of desirable endstates which are categories, they are often utilized in clinical culturally rather than biologically defined. practice. Clark et al (1995) view the high Furthermore, value judgments lie in the choices prevalence of subthreshold and atypical forms made between competing theories (Widiger & of the disorders commonly classified as NOS as Trull, 1993) and the attempt by the authors of contributing to the problem of heterogeneity. the DSM to remain theoretically ªneutralº is For example, various diagnoses including mood inherently problematic. In its attempt to remain disorders (Mezzich, Fabrega, Coffman, & ªneutral,º the DSM actually adopts a model of Haley, 1989), dissociative disorders (Spiegel & shared phenomenology, which implicitly ac- Cardena, 1991), and personality disorders cepts a theory of the underlying structure of (Morey, 1992) are characterized by high rates psychopathology (Follette & Houts, 1996). This of NOS. One method of combating the high implicit theory is problematic on several related NOS rates would be to create separate categories fronts. First, the DSM professes to be theory- to accommodate these individuals. For example, 180 Assessment of Psychopathology: Nosology and Etiology the rate of bipolar disorder NOS was reduced by must be made about the structure that best the inclusion of the subcategories I and II in suits their classification. Before undertaking an DSM-IV. Another method addressing the NOS investigation of the specific psychopathology problems is to include clear examples of classifications which have been proposed, the individuals who would meet the NOS criteria. structural design of diagnostic taxonomies in This second approach has been tested for some general will be outlined briefly. The frameworks potential diagnostic groups, such as the mixed suggested fall into three categories: hierarchical, anxiety-depression category (Clark & Watson, multiaxial, and circular. The hierarchical model 1991; Zinbarg & Barlow, 1991; Zinbarg et al., organizes disorders into sets with higher-order 1994). diagnoses subsuming lower-order classifica- tions. The multiaxial model assigns parallel roles for the different aspects of a disorder. 3.08.5 ALTERNATIVE APPROACHES TO Finally, the circular model assigns similar CLASSIFICATION disorders to adjoining segments of a circle and dissimilar disorders to opposite sides of the While many well-articulated criticisms of the circle (Millon, 1991). These three conceptuali- classification system have been presented, the zations are not mutually exclusive, and many literature contains far fewer suggested alter- taxonomies share aspects of more than one natives (Follette, 1996). There is great disagree- structure. Within each of these three general ment among authors who have suggested other frameworks, taxa may be considered categorical taxonomic models about not only the type of or dimensional. The current DSM combines model, but also its content, structure, and both hierarchical and multiaxial approaches. methodology. The study of psychopathology Circular frameworks, which generally employ has been pursued by researchers from different the dimensional approach, have been used in theoretical schools, including behaviorists, neu- theories of personality. The model that is the rochemists, phenomenologists, and psychody- basis for the DSM system, the Neo-Kraepelian namicists (Millon, 1991). These approaches rely model will be examined first, and then prototype on unique methodologies and produce data and dimensional models will be considered regarding different aspects of psychopathology. Finally, suggested methodological and statisti- No one conceptualization encompasses the cal approaches to improving the classification of complexity of any given disorder. These differ- psychopathology will be covered. ing views are not reducible to a hierarchy, and cannot be compared in terms of their objective value (Millon, 1991). Biophysical, phenomen- 3.08.5.2 Taxonomic Models ological, ecological, developmental, genetic, and behavioral observations have all been 3.08.5.2.1 Neo-Kraepelinian (DSM) model suggested as important contents to include in The Neo-Kraepelinian model, inspired by the the categorization of psychopathology. How to 1972 work of the Washington University group structure or organize content has also been a and embodied in the recent versions of the DSM, topic of much debate. In this section the is the current standard of psychopathology structural and methodological approaches to classification. According to the Neo-Kraepeli- creating classification systems will be reviewed. nian view, diagnostic categories represent Models of psychopathology will be discussed, medical diseases, and each diagnosis is con- including the Neo-Kraepelinian model, proto- sidered to be a discrete entity. Each of the type models, and dimensional and categorical discrete diagnostic categories is viewed as having models. Finally, the use of etiological factors in a describable etiology, course, and pattern of the categorization of psychopathology (J. F. occurrence. Clearly specified operational criter- Kihlstrom, personal communication, Septem- ia are used to define each category and foster ber 2, 1997, SSCPnet) will be examined and objectivity. This type of classification is aided by other research and statistical methodologies will the use of structured interviews in gathering be outlined that may potentially lead to better relevant symptom information to assign diag- systems of classification (J. F. Kihlstrom, noses. Diagnostic algorithms specify the objec- personal communication, September 11, 1997, tive rules for combining symptoms and reaching SSCPnet; D. Klein, personal communication, a diagnosis (Blashfield, 1991). In this view, the September 2, 1997, SSCPnet). establishment of the reliability of diagnostic categories is considered necessary before any 3.08.5.1 Types of Taxonomies type of validity can be established. Categories that refer to clearly described patterns of As knowledge about features best signifying symptoms are considered to have good internal various disorders increases, determinations validity, while the utility of a diagnosis in Alternative Approaches to Classification 181 predicting the course and treatment response of with number of features present, and features the disorder are seen as markers of good are neither necessary nor sufficient since external validity. Despite its many shortcom- membership is not an absolute. Furthermore, ings, the widespread adoption of the DSM categories in the prototype model have indis- system in both clinical work and research is a tinct boundaries, and the membership decision testament to the utility of the Neo-Kraepelinian relies largely on clinician judgment. It is likely model. that the adoption of this model would result in a decrease in reliability compared to the DSM. However, proponents argue that this model is 3.08.5.2.2 Prototype model more reflective of real-world categories in The prototype model has been suggested as a psychopathology (Chapman & Chapman, viable alternative to the current Neo-Kraepe- 1969). linian approach (Cantor, Smith, French, & Mezzich, 1980; Clarkin, Widiger, Frances, 3.08.5.2.3 Dimensional and categorical models Hurt, & Gilmore, 1983; Horowitz, Post, French, Wallis, & Siegelman, 1981; Livesley, 1985a, An alternative to the categorical classification 1985b). In this system, patients' symptoms are system is the dimensional approach. In dimen- evaluated in terms of their correlation with a sional models of psychopathology, symptoms standard or prototypical representation of are assessed along several continua, according specific disorders. A prototype consists of the to their severity. Several dimensional models most common features of members of a have been suggested (Eysenck, 1970; Russell, category, and is the standard against which 1980; Tellegen, 1985). Dimensional models are patients are evaluated (Horowitz et al., 1981). proposed as a means of increasing the empirical The prototype model differs from the Neo- parsimony of the diagnostic system. The Kraepelinian model in several ways. First, the personality disorders may be most readily prototype model is based on a philosophy of adapted to this approach (McReynolds, 1989; nominalism, in which diagnostic categories Widiger, Trull, Hurt, Clarkin, & Frances, 1987; represent concepts used by mental health Wiggins, 1982) but this approach is by no means professionals (Blashfield, 1991). Diagnostic limited to personality disorders and has been groups are not viewed as discrete, but indivi- suggested for use in disorders including schizo- duals may warrant membership in a category to phrenia (Andreasen and Carpenter, 1993), a greater or lesser degree. The categories are somatoform disorder (Katon et al., 1991), defined by exemplars, or prototypes, and the bipolar disorder (Blacker & Tsuang, 1992), presentation of features or symptoms in an childhood disorders (Quay et al., 1987), and individual is neither necessary nor sufficient to obsessive-compulsive disorder (Hollander, determine membership in a category. Rather, 1993). Dimensional models are more agnostic the prototype model holds that membership in a (i.e., making fewer assumptions), more parsi- category is correlated with the number of monious (i.e., possibly reducing the approxi- representative symptoms the patient has. The mately 300 diagnosis classifications in the DSM prototype model suggests that the degree of to a smaller subset of dimensions), more membership to a category is correlated with the sensitive to differences in the severity of number of features that a member has, so disorders across individuals, and less restrictive. defining features are neither necessary nor While a dimensional approach might simplify sufficient. some aspects of the diagnostic process, it would Some authors have described the DSM undoubtedly create new problems. First, cate- system as a prototype model, primarily because gorical models are resilient because of the it uses polythetic, as opposed to monothetic, psychological tendency to change dimensional definitions (Clarkin et al., 1983; Widiger & concepts into categorical ones (Cantor & Frances, 1985). Although the DSM does use Genero, 1986; Rosch, 1978). Second, imple- polythetic definitions, it does not constitute a mentation of dimensional systems would re- prototypical model because specific subsets of quire a major overhaul of current practice in the symptoms are sufficient for making a diagnosis. mental health field. Third, replacing the DSM Prototype and polythetic models allow varia- categorical model with a dimensional model will bility among features within a category, how- undoubtedly meet with much resistance from ever, they view category definition differently. proponents of clinical descriptiveness, who Prototype models focus on definition by believe that each separate category provides a example, polythetic models focus on category more richly textured picture of the disorder. membership as achieved by the presence of Finally, there are currently no agreed upon certain features that are sufficient. In a proto- dimensions to be included in such a classifica- type model the level of membership is correlated tion model (Millon, 1991). 182 Assessment of Psychopathology: Nosology and Etiology

Thus, the task of advocates of the dimen- the limited number of techniques available for sional approach is twofold. First, they must the examination of potential factors. For determine the type and number of dimensions example, the history of psychiatry and medicine that are necessary to describe psychopathology. is replete with examples of major findings due in Second, they will need to demonstrate that it is part to advances in technology (e.g., computer- possible to describe the entire range of psycho- ized axial tomography [CAT] scans as a method pathology using a single set of dimensions. At of examining the function of internal organs). this stage, the most useful starting point may be An alternative explanation for the limited examination of the role of various dimensions in success of etiological studies is that most the description of psychopathology, as opposed researchers have relied on theoretical perspec- to arguing the virtues and limitations of tives that assume distinct categories of illness. categorical and dimensional approaches to Specifically, the assumption of distinct diag- psychopathology. nostic entities masks the possibility that multiple etiological factors may lead to the development of the same disorder, and that biological and 3.08.5.3 The Use of Etiological Factors for environmental factors may ameliorate the effect Diagnosis of strong etiological factors. Even when a diagnosis may seem to have a clear etiology Another proposed remedy to the problems (e.g., PTSD), the picture is not clear. For facing current classification systems is to example, although the diagnosis of PTSD examine the role of etiology. Andreasen and requires a clear stressor it is known that not Carpenter (1993) point out the need to identify all individuals who experience that stressor etiologic mechanisms in order to understand a develop PTSD and not all stressors are likely to disorder. In addition, understanding etiologic create PTSD (Foa & Rothbaum, 1998). Further- factors may help explain the heterogeneity and more, the presence of a stressor alone is not comorbidity of the disorders currently listed in sufficient to warrant the diagnosis. In fact, the DSM. The authors of the DSM have research suggests that etiologic factors entirely generally avoided making statements regarding outside the diagnostic criteria (e.g., IQ; Macklin the etiology of disorders in keeping with the et al., 1998; McNally & Shin, 1995) may ªtheoretically neutralº stance. However, some ameliorate the effects of the identified etiologic authors have argued that this caveat is only factors on the development of the disorder. loosely enforced in DSM as it is, as exemplified Much of the controversy about assessment by life-stress based disorders such as PTSD and and classification in psychopathology stems adjustment disorder (Brett, 1993). from the conflict about the use of value Traditional models of etiology have focused judgments as opposed to data-driven theory on either the biological or environmental causes testing in creating diagnostic categories. Some of psychopathology. Wakefield (1997b) has have suggested that a combination of the two warned against the assumption that the etiology perspectives, including the use of both theory of psychopathological disorders will necessarily and data, may be the most workable approach be found to be a physiological malfunction. He (Blashfield & Livesley, 1991; Morey, 1991). This argued that the mind can begin to function approach would amount to a process of abnormally without a corresponding brain construct validation depending on both theory disorder. More recent conceptualizations of and evaluation of the theory by data analysis. the etiology of psychopathology acknowledge As in other areas of theory development, the role of both biological and environmental testability and parsimony would play a crucial factors, and debate the ratio of the contribution role in choosing between competing theories. In from each. As would be expected, etiological the following section, the need for the adoption studies of psychopathology tend to reflect the of new research methodologies in the field of underlying theories of mental disorders accepted assessment and classification of psychopathol- by the researchers who conduct them. For ogy will be considered. Next some of the areas of example, biological theorists attempt to identify research which illustrate promising methods, biological markers of certain disorders (e.g., most of which focus on identifying etiologic Clementz & Iacono, 1990; Klein, 1989). Envir- factors in psychopathology, will be discussed. onmental theories attempt to identify specific As mentioned earlier, researchers have called events that are necessary or sufficient to produce for a move away from a system of diagnosis a disorder. These approaches have achieved based on superficial features (symptoms) toward varying degrees of success depending on which diagnosis based on scientifically-based theories diagnostic groups were considered. One expla- of underlying etiology and disease processes. nation for the limited success of attempts to This focus parallels that of medical diagnosis of identify etiological factors in psychopathology is physical illness. Physicians do not diagnose Alternative Approaches to Classification 183 based on symptoms. Instead, patient reports of functional analysis, the identification of the symptoms are seen as indicators of potential antecedents, and consequences of each beha- illnesses. Diagnoses are not made until specific viors (Hayes, Wilson, Gifford, Follette, & indicators of pathology (e.g., biopsies, masses, Strosahl, 1996; Scotti, Morris, McNeil, & blood draws, etc.) have been examined. The Hawkins, 1996; Wulfert, Greenway, & Dough- interpretation of such laboratory results requires er, 1996). Wulfert et al. (1996) use the example an understanding of the differences between of depression as a disorder which may be caused normal and abnormal functioning of the cell or by a host of different factors (biological, organ in question. In order for assessment of cognitive, social skills deficits, or a lack of psychopathology to follow this route, research- reinforcement). They argue that the fact that ers must continue to amass information on functionally similar behavior patterns may have normal mental and behavioral functioning (J. F. very different structures may contribute to the Kihlstrom, personal communication, Septem- heterogeneity found in the presumably homo- ber 11, 1997, SSCPnet). This endeavor can be genous DSM categories. These authors contend facilitated by technological advances in experi- that functional analysis may constitute one mental paradigms and measurement techniques means of identifying homogenous subgroups and devices. The issue of what constitutes a great whose behavior share similar antecedents and enough deviation from ªnormalº functioning to consequences. This approach could be used to warrant treatment has been and most likely will refine the existing DSM categories, and to continue to be controversial. However, such inform treatment strategies. Hayes et al. (1996) decisions are a necessary hurdle if psychopathol- describe a classification system based on ogy is to evolve as a science. functional analysis as a fundamentally different It is likely that the identification of etiological alternative to the current syndromal classifica- factors in psychopathology will not rely entirely tion system. They proposed that a classification on biological factors. The validation of etiolo- system could be based on dimensions derived gical constructs in psychopathology will un- from the combined results of multiple func- doubtedly include studies designed to identify tional analyses tied to the same dimension. Each potential contributing causes including envir- dimension would then be associated with onmental, personality, and physiological fac- specific assessment methods and therapy re- tors. Examples of research methods and commendations. The authors describe one such paradigms which may prove useful in determin- functional dimension, ªexperiential avoidance,º ing the etiology of psychiatric disorders are and illustrate its utility across disorders such as becoming increasingly evident in the literature. substance abuse, obsessive-compulsive disor- Possible methodologies include: psychophar- der, panic disorder, and borderline personality macological efficacy studies (Harrison et al., disorder (Hayes et al.). This model provides an 1984; Hudson & Pope, 1990; Papp et al., 1993; alternative to the current DSM syndromal Quitkin et al., 1993; Stein, Hollander, & Klein, model, using the methodological approach of 1994); family and DNA studies (Fyer et al., functional analysis. Scotti et al. (1996) proposed 1996); treatment response studies (Clark et al., a similar system of categorization of childhood 1995, Millon, 1991); in vivo monitoring of and adolescent disorders utilizing functional physiological processes (Martinez et al., 1996); analysis. Hayes et al. (1996), Scotti et al. (1996), and identification of abnormal physiological and Wulfert et al. (1996) have all successfully processes (Klein, 1993, 1994; Pine et al., 1994). illustrated that alternatives or improvements to These approaches may prove informative in the current DSM system are possible. designing future versions of psychopathology However, functional analysis is a distinctly classification systems. Other researchers have behavioral approach which assumes that a chosen a more direct approach, as is evidenced learned stimulus±response connection is an in a series of articles in a special section of the important element in the development or Journal of Consulting and Clinical Psychology maintenance of psychopathology. Other (JCCP) entitled ªDevelopment of theoretically authors, for example, proponents of genetic coherent alternatives to the DSM-IV º (Follette, or biological explanations of psychopathology 1996). The authors of this special issue of JCCP described above, might strongly oppose a pose radical alternatives to the DSM. The classification system based purely on the alternative classification systems are proposed methodology and tenets of a behavioral from a clearly stated behavioral theoretical approach. Others disagree with the notion of viewpoint, which differs considerably from any unified system of classification of psycho- many of the more biologically-based ap- pathology, arguing that no one diagnostic proaches described above. system will be equally useful for all of the A number of the alternatives and improve- classes of disorders now included in the DSM ments suggested in the 1996 JCCP are based on (e.g., what works for Axis I may not apply to the 184 Assessment of Psychopathology: Nosology and Etiology personality or childhood disorders). Several of Beck, A. T., Epstein, N., Brown, G., & Steer, R. A. (1988). these authors have taken a more radical stance, An inventory for measuring anxiety: Psychometric properties. Journal of Consulting and Clinical Psychology, asserting the need for separate diagnostic 56(6), 893±897. systems for different classes of disorder (Kazdin Beck, A. T., & Steer, R. A. (1987). Beck depression & Kagan, 1994; Koerner, Kohlenberg, & inventory manual. San Antonio, TX: The Psychological Parker, 1996). Corporation. Bellack, A. S., & Hersen, M. (Eds.) (1998). Behavioral assessment: A practical handbook. Needham Heights, MA: Allyn & Bacon. 3.08.6 CONCLUSIONS Bergner, R. M. (1997). What is psychopathology? And so what? Clinical Psychology: Science and Practice, 4, The assessment and classification of psycho- 235±248. pathology is relatively new and controversies Berrios, G. E., & Hauser, R. (1988). The early development abound. However, the heated debates regarding of Kraepelin's ideas on classification: A conceptual issues such as comorbidity, types of taxonomies, history. Psychological Medicine, 18, 813±821. and alternative approaches are indicators of the Biederman, J., Newcorn, J., & Sprich, S. (1991). Comor- bidity of attention deficit hyperactivity disorder with strong interest in this area. Comparisons conduct, depressive, anxiety, and other disorders. Amer- between the classification of psychopathology ican Journal of Psychiatry, 148(5), 564±577. and taxonomies in the basic sciences and Blacker, D., & Tsuang, M. T. (1992). Contested boundaries medicine can be informative. However, the of bipolar disorder and the limits of categorical diagnosis classification of psychopathology is a difficult in psychiatry. American Journal of Psychiatry, 149(11), 1473±1483. task, and the methods used in other fields are Blashfield, R. K. (1990). Comorbidity and classification. In not always applicable. It is likely that the J. D. Maser & C. R. Cloninger (Eds.), Comorbidity of systems of classification, informed by the mood and anxiety disorders (pp. 61±82). Washington, continuing debates and research on the topic, DC: American Psychiatric Press. Blashfield, R. K. (1991). Models of psychiatric classifica- will continue to evolve at a rapid pace. As Clark tion. In M Hersen & S. M. Turner (Eds.), Adult et al. (1995) remarked, the science of classifica- psychopathology and diagnosis (pp. 3±22). New York: tion has inspired research in new directions and Wiley. helped to guide future developments of psycho- Blashfield, R. K., & Draguns, J. G. (1976). Toward a pathology. taxonomy of psychopathology: The purpose of psychia- tric classification. British Journal of Psychiatry, 129, 574±583. Blashfield, R. K., & Livesley, W. J. (1991). Metaphorical ACKNOWLEDGMENTS analysis of psychiatric classification as a psychological test. Journal of Abnormal Psychology, 100, 262±270. We would like to thank Amy Przeworski and Boyd, J. H., Burke, J. D., Gruenberg, E., Holzer, C. E., Melinda Freshman for their help in editing this Rae, D. S., George, L. K., Karno, M., Stoltzman, R., chapter. McEvoy, L., & Nestadt, G. (1984). Exclusion criteria of DSM-III. Archives of General Psychiatry, 41, 983±989. Brett, E. A. (1993). Classifications of posttraumatic stress 3.08.7 REFERENCES disorder in DSM-IV: Anxiety disorder, dissociative disorder, or stress disorder? In J. R. T. Davidson & E. Abikoff, H., & Klein, R. G. (1992). Attention-deficit B. Foa (Eds.), Posttraumatic stress disorder: DSM-IV hyperactivity and conduct disorder: Comorbidity and and beyond (pp. 191±204). Washington, DC: American implications for treatment. Journal of Consulting and Psychiatric Press. Clinical Psychology, 60(6), 881±892. Broverman, I. K., Broverman, D. M., Clarkson, F. E., American Psychiatric Association (1933). Notes and Rosenkrantz, P. S., & Vogel, S. R. (1970). Sex-role comment: Revised classification of mental disorders. stereotypes and clinical judgments of mental health. American Journal of Psychiatry, 90, 1369±1376. Journal of Consulting and Clinical Psychology, 34(1), 1±7. American Psychiatric Association (1994). Diagnostic and Brown, T. A., & Barlow, D. H. (1992). Comorbidity among statistical manual of mental disorders (4th ed.). Washing- anxiety disorders: Implications for treatment and DSM- ton, DC: Author. IV. Journal of Consulting and Clinical Psychology, 60(6), Andreasen, N. C., & Carpenter, W. T. (1993). Diagnosis 835±844. and classification of schizophrenia. Schizophrenia Bulle- Buchanan, R. W., & Carpenter, W. T. (1994). Domains of tin, 19(2), 199±214. psychopathology: an approach to the reduction of Andreasen, N. C., Flaum, M., Swayze, V. W., Tyrrell, G., heterogeneity in schizophrenia. The Journal of Nervous & Arndt, S. (1990). Positive and negative symptoms in and Mental Disease, 182(4), 193±204. schizophrenia: A critical reappraisal. Archives of General Butcher, J. N., Dahlstrom, W. G., Graham, J. R., Tellegen, Psychiatry, 47, 615±621. A., & Kaemmer, B. (1989). Minnesota multiphasic Angrist, B., Rotrosen, J., & Gershon, S. (1980). Differential personality inventory (MMPI-2). Administration and effects of amphetamine and neuroleptics on negative vs. scoring. Minneapolis, MN: University of Minnesota positive symptoms in schizophrenia. Psychopharmacol- Press. ogy, 72, 17±19. Cantor, N., & Genero, N. (1986). Psychiatric diagnosis and August, G. J., & Garfinkel, B. D. (1990). Comorbidity of natural categorization: A close analogy. In T. Millon & ADHD and reading disability among clinic-referred G. L. Klerman (Eds.) Contemporary directions in children. Journal of Abnormal Child Psychology, 18, psychopathologyÐtoward the DSM-IV (pp. 233±256). 29±45. New York: Guildford Press. Baumann, U. (1995) Assessment and documentation of Carey, G., & Gottesman, I. I. (1978). Reliability and psychopathology, Psychopathology, 28 (Suppl.1), 13±20. validity in binary ratings: Areas of common misunder- References 185

standing in diagnosis and symptom ratings. Archives of (1993). Reliability and validity of a brief instrument for General Psychiatry, 35, 1454±1459. assessing post-traumatic stress disorder. Journal of Clark, L. A., & Watson, D. (1991). Tripartite model of Traumatic Stress, 6, 459±473. anxiety and depression: Psychometric evidence and Foa, E. B., & Rothbaum, B. (1998). Treating the trauma of taxonomic implications. Journal of Abnormal Psychol- rape. New York: Guilford Press. ogy, 100(3), 316±336. Follette, W. C. (1996). Introduction to the special section Clark, L. A., Watson, D., & Reynolds, S. (1995). Diagnosis on the development theoretically coherent alternatives to and classification of psychopathology: Challenges to the the DSM system. Journal of Consulting and Clinical current system and future directions. Annual Review of Psychology, 64, 1117±1119. Psychology, 46, 121±153. Follette, W. C., & Houts, A. C. (1996) Models of scientific Clarkin, J. F., Widiger, T. A., Frances, A. J., Hurt, S. W., progress and the role of theory in taxonomy develop- & Gilmore, M. (1983). Prototypic typology and the ment: A case study of the DSM. Journal of Consulting borderline personality disorder. Journal of Abnormal and Clinical Psychology, 64, 1120±1132. Psychology, 92(3), 263±275. Frances, A. J. (1998). Problems in defining clinical Clementz, B. A., & Iacono, W. G. (1990). Nosology and significance in epidemiological studies. Archives of diagnosis. In L. Willerman & D. B. Cohen (Eds.), General Psychiatry, 55, 119. Psychopathology, New York: McGraw-Hill. Frances, A. J., First, M. B., & Pincus, H. A. (1995). DSM- Cole, D. A., & Carpentieri, S. (1990). Social status and the IV guidebook. Washington, DC: American Psychiatric comorbidity of child depression and conduct disorder. Press. Journal of Consulting and Clinical Psychology, 58, Frances, A. J., First, M. B., Widiger, T. A., Miele, G. I., 748±757. M., Tilly, S. M., Davis, W. W., & Pincus, H. A. (1991). Compton, W. M., & Guze, S. B., (1995). The neo- An A to Z guide to DSM-IV conundrums. Journal of Kraepelinian revolution in psychiatric diagnosis. Eur- Abnormal Psychology, 100(3), 407±412. opean Archives of Psychiatry & Clinical Neuroscience, Frances, A. J., Widiger, T. A., & Fyer, M. R. (1990). The 245, 196±201. influence of classification methods on comorbidity. In J. Cornblatt, B. A., Lenzenweger, M. F., Dworkin, R. H., & D. Maser & C. R. Cloninger (Eds.), Comorbidity of mood Erlenmeyer-Kimling, L. (1992). Childhood attentional and anxiety disorders (pp. 41±59). Washington, DC: dysfunctions predict social deficits in unaffected adults at American Psychiatric Press. risk for schizophrenia. British Journal of Psychiatry, 161, Fulop, G., Strain, J., Vita, J., Lyons, J. S. & Hammer, J. S. 59±64. (1987). Impact of psychiatric comorbidity on length of Crow, T. J. (1994). The demise of the Kraepelinian binary hospital stay for medical/surgical patients: A preliminary system as a prelude to genetic advance. In E. S. Gershon, report American Journal of Psychiatry, 144, 878±882. & C. R. Cloninger (Eds.), Genetic approaches to mental Fyer, A.J., Mannuzza, S., Chapman, T. F., Lipsitz, J., disorders (pp. 163±192). Washington, DC: American Martin, L. Y. & Klein, D. F. (1996). Panic disorder and Psychiatric Press. social phobia: Effects of comorbidity on familial Daradkeh, T. K., El-Rufaie, O. E. F., Younis, Y. O., & transmission. Anxiety 2, 173±178. Ghubash, R. (1997). The diagnostic stability of ICD-10 Goldenberg, I. M., White, K., Yonkers, K., Reich, J., psychiatric diagnoses in clinical practice. European Warshaw, M. G., Goisman, R. M., & Keller, M. B. Psychiatry, 12, 136±139. (1996). The infrequency of ªPure Cultureº diagnoses Davidson, J. R. T., & Foa, E. B. (1991). Diagnostic issues among the anxiety disorders. Journal of Clinical Psy- in posttraumatic stress disorder: Considerations for the chiatry, 57(11), 528±533. DSM-IV. Journal of Abnormal Psychology, 100(3), Goodman, W. K., Price, L. H., Rasmussen, S. A., & 346±355. Mazure, C. (1989a). The Yale-Brown obsessive compul- de Fundia, T. A., Draguns, J. G., & Phillips, L. (1971). sive scale: I. Development, use and reliability. Archives of Culture and psychiatric symptomatology: A comparison General Psychiatry, 46, 1006±1011. of Argentine and United States patients. Social Psychia- Goodman, W. K., Price, L. H., Rasmussen, S. A., & try, 6(1), 11±20. Mazure, C. (1989b). The Yale-Brown obsessive compul- Dworkin, R. H., & Lenzenweger, M. F. (1984). Symptoms sive scale: II. Validity. Archives of General Psychiatry, 46, and the genetics of schizophrenia: Implications for 1012±1016. diagnosis. American Journal of Psychiatry, 141(12), Gorenstein, E. E. (1992). The science of mental illness. New 1541±1546. York: Academic Press. Dworkin, R. H., Lenzenwenger, M. F., Moldin, S. O., & Gove, W. R. (1980). Mental illness and psychiatric Cornblatt, B. A. (1987). Genetics and the phenomenol- treatment among women. Psychology of Women Quar- ogy of schizophrenia. In P. D. Harvey & E. F. Walker terly, 4, 345±362. (Eds.), Positive and negative symptoms of psychosis: Grove, W. M., Lebow, B. S., Clementz, B. A., Cerri, A., Description, research and future directives (pp. 258±288). Medus, C., & Iacono, W. G. (1991). Familial prevalence Hillsdale, NJ: Erlbaum. and coaggregation of schizotypy indicators: A multitrait Eysenck, H. J. (1970). A dimensional system of psycho- family study. Journal of Abnormal Psychology, 100, diagnostics. In A. R. Mahrer (Ed.), New approaches to 115±121. personality classification (pp. 169±207). New York: Haghighat, R. (1994). Cultural sensitivity: ICD-10 versus Columbia University Press. DSM-III±R. International Journal of Social Psychiatry, Feighner, A. C., Baran, I. D., Furman, S., & Shipman, W. 40, 189±193. M. (1972). Private psychiatry in community mental Hamilton, S., Rothbart, M., & Dawes, R. (1986). Sex bias, health. Hospital and Community Psychiatry, 23(7), diagnosis and DSM-III. Sex Roles, 15, 269±274. 212±214. Harrison, W. M., Cooper, T. B., Stewart, J. W., Quitkin, F. Feinstein, A. R. (1970). The pre-therapeutic classification M., McGrath, P. J., Liebowitz, M. R., Rabkin, J. R., of co-morbidity in chronic disease. Journal of Chronic Markowitz, J. S., & Klein, D. F. (1984). The tyramine Diseases, 23, 455±468. challenge test as a marker for melancholia. Archives of First, M. B., Spitzer, R. L., Gibbon, M., & Williams, K. B. General Psychiatry, 41(7), 681±685. W. (1995). Structured clinical interview for DSM-IV Axis Hayes, S. C., Wilson, K. G., Gifford, E. V., Follette, V. M., I Disorders. Washington, DC: American Psychiatric & Strosahl, K. D. (1996). Experiential avoidance and Press. behavioral disorders: A functional dimensional ap- Foa, E. B., Riggs, D. S., Dancu, C. V., & Rothbaum, B. O. proach to diagnosis and treatment. Journal of Consulting 186 Assessment of Psychopathology: Nosology and Etiology

& Clinical Psychology, 64(6), 1152±1168. nosology. In J. C. Shershow (Ed.), Schizophrenia: Helzer, J. E., & Robins, L. N. (1988). The Diagnostic Science and practice (pp. 99±121). Cambridge, MA: interview schedule: Its development, evolution, and use. Harvard University Press. Social Psychology and Psychiatric Epidemiology, 23(1), Koerner, K., Kohlenberg, R. J., & Parker, C. R. (1996). 6±16. Diagnosis of personality disorder: A radical behavioral Herbert, J. D., Hope, D. A., & Bellack, A. S. (1992). alternative. Journal of Consulting & Clinical Psychology, Validity of the distinction between generalized social 64(6), 1169±1176. phobia and avoidant personality disorder. Journal of Kraeplin, E. (1971). Dementia praecox and paraphrenia (R. Abnormal psychology, 101(2), 332±339. M. Barklay, Trans.). Huntington, NY: Krieger. (Original Hollander, E. (1993). Obsessive-compulsive spectrum dis- work published in 1919). orders: An overview. Psychiatric Annals, 23(7), 355±358. Lewinsohn, P. M., Rohde, P., Seeley, J. R., & Hops, H. Horowitz, L. M., Post, D. L, French, R. S., Wallis, K. D., (1991). Comorbidity of unipolar depression: I. Major & Siegelman, E. Y. (1981). The prototype as a construct depression with dysthymia. Journal of Abnormal Psy- in abnormal psychology: II. Clarifying disagreement in chology, 100(2), 205±213. psychiatric judgments, Journal of Abnormal Psychology, Lilienfeld, S. O., Waldman, I. D., & Israel, A. C. (1994). A 90(6), 575±585. critical examination of the use of the term and concept of Hudson, J. I., & Pope, H. G. (1990). Affective spectrum comorbidity in psychopathology research. Clinical disorder: Does antidepressant response identify a family Psychology-Science & Practice, 1(1), 71±83. of disorders with a common pathophysiology? American Livesley, W. J. (1985a). The classification of personality Journal of Psychiatry, 147 (5), 552±564. disorder: I. The choice of category concept. Canadian Jacoby, R. (1996). Problems in the use of ICD-10 and Journal of Psychiatry, 30(5), 353±358. DSM-IV in the psychiatry of old age. In N. S. Costas & Livesley, W. J. (1985b). The classification of personality H. Hanns (Eds.), Neuropsychiatry in old age: An update. disorder: II. The problem of diagnostic criteria. Canadian Psychiatry in progress series (pp. 87±88). Gottingen, Journal of Psychiatry, 30(5), 359±362. Germany: Hogrefe & Huber. Lykken, D. T. (1984). Psychopathic personality. In R. I. Janca, A., Kastrup, M. C., Katschnig, H., & Lopez-Ibor, J. Corsini (Ed.), Encyclopedia of psychology (pp. 165±167). J., Jr. (1996). The ICD-10 multiaxial system for use in New York, Wiley. (As cited in B. A. Clementz & W. E. adult psychiatry: Structure and applications. Journal of Iacono (1990). In L. Willerman & D. B. Cohen (Eds.), Nervous and Mental Disease, 184, 191±192. Psychopathology, New York: McGraw-Hill.) Johnstone, E. C., Crow, T. J., Frith, C. D., Stevens, M., Macklin, M. L., Metzger, L. J., Litz, B. T., McNally, R. J., Kreel, L., & Husband, J. (1978). The dementia of Lasko, N. B., Orr, S. P., & Pitman, R. K. (1998). Lower dementia praecox. Acta Psychiatrica Scandanavia, 57, precombat intelligence is a risk factor for posttraumatic 305±324. stress disorder. Journal of Consulting and Clinical Kanfer, F. H., & Saslow, G. (1965). Behavioral analysis. Psychology, 66, 323±326 Archives of General Psychiatry, 12, 529±538. Martinez, J. M., Papp, L. A., Coplan, J. D., Anderson, D. Kahn, E. (1959). The Emil Kraepelin memorial lecture. In E., Mueller, C. M., Klein, D. F., & Gorman, J. M. D. Pasamanick (Ed.), Epidemiology of mental disorders. (1996). Ambulatory monitoring of respiration in anxiety. Washington, DC: American Association for the Ad- Anxiety, 2, 296±302. vancement of Science. (As cited in M. Hersen & S. M. Mayr, E. (1981). Biological classification: Toward a Turner (Eds.), Adult psychopathology and diagnosis synthesis of opposing methodologies. Science, 214(30), (p. 20). New York: Wiley). 510±516. Katon, W., Lin, E., Von Korff, M., Russo, J., Lipscomb, McNair, D. M., Lorr, M., & Droppleman, L. F. (1981). P., & Bush, T. (1991). Somatization: A spectrum of POMS manual (2nd ed.). San Diego: Educational and severity. American Journal of Psychiatry, 148(1), 34±40. Industrial Testing Service. Kay, S. R. (1990). Significance of the positive-negative McNally, R., & Shin, L. M. (1995). Association of distinction in schizophrenia. Schizophrenia Bulletin, 16 intelligence with severity of posttraumatic stress disorder (4), 635±652. symptoms in Vietnam combat veterans. American Kazdin, A. E., & Kagan, J. (1994). Models of dysfunction Journal of Psychiatry, 152(6), 936±938. in developmental psychopathology. Clinical Psychology- McReynolds, P. (1989). Diagnosis and clinical assessment: Science & Practice, 1(1), 35±52. Current status and major issues. Annual Review of Kendell, R. E. (1975). The concept of disease and its Psychology, 40, 83±108. implications for psychiatry. British Journal of Psychiatry, Meehl, P. E. (1986). Diagnostic taxa as open concepts: 127, 305±315. Metatheoretical and statistical questions about reliability Kendell, R. E. (1982). The choice of diagnostic criteria for and construct validity in the grand strategy of nosolo- biological research. Archives of General Psychiatry, 39, gical revision. In T. Millon & G. L. Klerman (Eds.), 1334±1339. Contemporary directions in psychopathology: Toward the Kendler, K. S. (1990). Toward a scientific psychiatric DSM-IV (pp. 215±231). New York: Guilford Press. nosology. Archives of General Psychiatry, 47, 969±973. Menninger, K. (1963). The vital balance: The life process in Kessler, R. C., McGonagle, K. A., Zhao, S., Nelson, C. B., mental health and illness. New York: Viking Press. Hughes, M., Eshelman, S., Wittchen, H. U., & Kendler, Mezzich, J. E., Fabrega, H., Coffman, G. A., & Haley, R. K. S. (1994). Lifetime and 12-month prevalence of DSM- (1989). DSM-III disorders in a large sample of psychia- III-R psychiatric disorders in the United States. Archives tric patients: Frequency and Specificity of diagnoses. of General Psychiatry, 51, 8±19. American Journal of Psychiatry, 146(2), 212±219. Klein, D. F. (1989). The Pharmacological validation of Michels, R., Siebel, U., Freyberger, H. J., Stieglitz, R. D., psychiatric diagnosis. In L. N. Robins & J. E. Barrett Schaub, R. T., & Dilling, H. (1996). The multiaxial (Eds.), The validity of psychiatric diagnosis (pp. 203±214). system of ICD-10: Evaluation of a preliminary draft in a New York: Raven Press. multicentric field trial. Psychopathology, 29, 347±356. Klein, D. F. (1993). False suffocation alarms, spontaneous Millon, T. (1991). Classification in psychopathology: panics, and related conditions: An integrative hypoth- Rationale, alternatives and standards. Journal of Abnor- esis. Archives of General Psychiatry, 50, 306±317. mal Psychology, 100, 245±261. Klein, D. F. (1994). Testing the suffocation false alarm Moras, K., DiNardo, P. A., Brown, T. A., & Barlow, D. H. theory of panic disorder. Anxiety, 1, 1±7. (1991). Comorbidity and depression among the DSM- Klerman, G. L. (1978). The evolution of a scientific III-R anxiety disorders. Manuscript as cited in Brown, T. References 187

A. & Barlow, D. H. (1992). Comorbidity among anxiety Barbor, T. F., Burke, J. D., Farmer, A., Jablenski, A., disorders: Implications for treatment and DSM-IV. Pickens, R., Reiger, D. A., Sartorius, N., & Towle, L. H. Journal of Consulting and Clinical Psychology, 60(6), (1988). The composite international Diagnostic Inter- 835±844. view: An epidemiological instrument suitable for use in Morey, L. C. (1988). Personality disorders in DSM-III and conjunction with different diagnostic systems and DSM-III-R: Convergence, coverage, and internal con- different cultures. Archives of General Psychiary, 45, sistency. American Journal of Psychiatry, 145(5), 1069±1077. 573±577. Rosch, E. H. (1978). Principles of categorization. In E. H. Morey, L. C. (1991). Classification of mental disorder As a Rosch & B. B. Lloyd (Eds.), Cognition and categorization collection of hypothetical constructs. Journal of Abnor- (pp. 27±48). Hillsdale, NJ: Erlbaum. mal Psychology, 100(3), 289±293. Russell, J. A. (1980). A circumplex model of affect. Journal Morey, L. C. (1992). Personality disorder NOS: Specifying of Personality and Social Psychology, 39(6), 1161±1178. patterns of the otherwise unspecified. Paper presented at Sartorius, N., & Janca, A. (1996). Psychiatric assessment the 100th Annual Convention of the American Psycho- instruments developed by the World Health Organiza- logical Association. Washington, DC: USA. tion. Social Psychiatry and Psychiatric Epidemiological, Mowrer, O. H. (1960). ªSin,º the lesser of two evils. 32, 55±69. American Psychologist, 15, 301±304. Scotti, J. R., Morris, T. L., McNeil, C. B., & Hawkins, R. Oldham, J. M., Skodal, A. E., Kellman, H. D., Hyler, S. E., P. (1996). DSM-IV and disorders of childhood and Rosnick, L., & Davies, M. (1992). Diagnosis of DSM-II- adolescence: Can structural criteria be functional? R personality disorders by two structured interviews: Journal of Consulting & Clinical Psychology, 64(6), Patterns of comorbidity. American Journal of Psychiatry, 1177±1191. 149, 213±220. Shea, M. T., Widiger, T. A., & Klein, M. H. (1992). Papp, L. A., Klein, D. F., Martinez, J., Schneier, F., Cole, Comorbidity of personality disorders and depression: R., Liebowitz, M. R., Hollander, E., Fyer, A. J., Jordan, Implications for treatment. Journal of Consulting and F., & Gorman, J. M. (1993). Diagnostic and substance Clinical Psychology, 60, 857±868. specificity of carbon dioxide-induced panic. American Spiegel, D., & Cardena, E. (1991). Disintegrated experi- Journal of Psychiatry, 150, 250±257. ence: The dissociative disorders revisited. Journal of Patel, V., & Winston, M. (1994). ªUniversality of mental Abnormal Psychology, 100(3), 366±378. illnessº revisited: Assumptions, artefacts and new direc- Spielberger, C. D., Gorsuch, R. R., & Lushene, R. E. tions. British Journal of Psychiatry, 165, 437±440. (1970). State-trait anxiety Inventory: Test manual for Pine, D. S., Weese-Mayer, D. E., Silvestri, J. M., Davies, form X. Palo Alto, CA: Consulting Psychologists Press. M., Whitaker, A., & Klein, D. F. (1994). Anxiety and Spitzer, R. B. (1997). Brief comments for a psychiatric congenital central hypoventilation syndrome. American nosologist weary from his own attempts to define mental Journal of Psychiatry, 151, 864±870. disorder: Why Ossorio's definition muddles and Wake- Pogue-Geile, M. F., & Harrow, M. (1984). Negative and field's ªHarmful Dysfunctionº illuminates the issues. positive symptoms in schizophrenia and depression: A Clinical Psychology: Science and Practice, 4, 259±261. follow-up. Schizophrenia Bulletin, 10(3), 371±387. Spitzer, R. B. (1998). Diagnosis and need for treatment are Quay, H. C., Routh, D. K., & Shapiro, S. K. (1987). not the same. Archieves of General Psychiatry, 55, 120. Psychopathology of childhood: From description to Spitzer, R. B., Endicott, J., & Robins, E. (1978). Research validation. Annual Review of Psychology, 38, 491±532. diagnostic criteria: Rationale and reliability. Archives of Quitkin, F. M., Stewart, J. W., McGrath, P. J., Tricamo, General Psychiatry, 35(6), 773±782. E., Rabkin, J. G., Ocepek-Welikson, K., Nunes, E., Stein, D. J., Hollander, E., & Klein, D. F. (1994). Harrison, W., & Klein, D. F. (1993). Columbia atypical Biological markers of depression and anxiety. Medico- depression: A sub-group of depressives with better graphia, 16(1), 18±21. response to MAOI than to tricyclic antidepressants or Stevens, J. R. (1982). Neuropathology of schizophrenia. placebo. British Journal of Psychiatry, 163, 30±34. Archives of General Psychiatry, 39, 1131±1139. Rapee, R. M., Sanderson, W. C., & Barlow, D. H. (1988). Szasz, T. (1960). The myth of mental illness. American Social phobia features across the DSM-III-R anxiety Psychologist, 15, 113±118. disorders. Journal of Psychopathology and Behavioral Tattersall, I. (1995). The fossil trail: How we know what we Assessment, 10(3), 287±299. think we know about human evolution. New York: Regier, D. A., Kaelber, C. T., Rae, D. S., Farmer, M. E., Oxford University Press. (As cited in Follette, W. C., & Knauper, B., Kessler, R. C., & Norquist, G. S. (1998) Houts, A. C. (1996). Models of scientific progress and Limitations of diagnostic criteria and assessment instru- the role of theory in taxonomy development: A case ments for mental disorders. Implications for research study of the DSM. Journal of Consulting and Clinical and policy. Archives of General Psychiatry, 55, 109±115. Psychology, 64, 1120±1132.) Regier, D. A., Kaelber, C. T., Roper, M. T., Rae, D. S. & Taylor, M. A. (1992). Are schizophrenia and affective Sartorius, N. (1994). The ICD-10 clinical field trial for disorder related? A selective literature review. American mental and behavioral disorders: Results in Canada and Journal of Psychiatry, 149, 22±32. the United States. American Journal of Psychiatry, 151, Taylor, M. A., & Amir, N. (1994). Are schizophrenia and 1340±1350. affective disorder related?: The problem of schizoaffec- Riskind, J. H., Beck, A. T., Brown, G., & Steer, R. A. tive disorder and the discrimination of the psychoses by (1987). Taking the measure of anxiety and depression. signs and symptoms. Comprehensive Psychiatry, 35(6), Validity of the reconstructed Hamilton scale. Journal of 420±429. Nervous and Mental Disease, 175(8), 474±479. Tellegen, A. (1985). Structures of mood and personality Robins, E., & Guze, S. B. (1970). Establishment of and their relevance to assessing anxiety, with an diagnostic validity in psychiatric illness: Its application emphasis on self-report. In A. H. Tuma & J. D. Maser to schizophrenia. American Journal of Psychiatry, 126, (Eds.), Anxiety and the anxiety disorders (pp. 681±706). 983±987. Hillsdale, NJ: Erlbaum. Robins, L. N. (1994). How recognizing ªcomorbiditiesº in Uestuen, T. B., Bertelsen, A., Dilling, H., & van psychopathology may lead to an improved research Drimmelen, J. (1996). ICD-10 casebook: The many faces nosology. Clinical Psychology: Science and Practice, 1, of mental disordersÐAdult case histories according to 93±95. ICD-10. Washington, DC: American Psychiatric Press. Robins, L. N., Wing, J., Wittchen, H. J., Helzer, J. E., Ullman, L. P., & Krasner, L. (1975). A psychological 188 Assessment of Psychopathology: Nosology and Etiology

approach to abnormal behavior. Englewood Cliffs, NJ: Widiger, T. A., & Trull, T. J. (1993). The Scholarly Prentice-Hall. development of DSM-IV. In J. A. Costa, E. Silva, & C. Wakefield, J. C. (1992). Disorder as harmful dysfunction: C. Nadelson (Eds.), International Review of Psychiatry A conceptual critique of DSM-III-R's definition of (pp. 59±78). Washington DC, American Psychiatric mental disorder. Psychological Review, 99, 232±247. Press. Wakefield, J. C. (1997a). Diagnosing DSM-IV, Part I: Widiger, T. A., Trull, T. J., Hurt, S. W., Clarkin, J., & DSM-IV and the concept of disorder. Behavioral Frances, A. (1987). A multidimensional scaling of the Research Therapy, 35(7), 633±649. DSM-III personality disorders. Archives of General Wakefield, J. C. (1997b). Diagnosing DSM-IV, Part II: Psychiatry, 44, 557±563. Eysenck (1986) and the essential fallacy. Behavioral Wiggins, J. S. (1982). Circumplex models of interpersonal Research Therapy, 35(7), 651±665. behavior in clinical psychology. In P. C. Kendall & J. N. Wakefied, J. C. (1997c). Normal inability versus patholo- Butcher (Eds.), Handbook of research methods in clinical gical disability: Why Ossorio's definition of mental psychology (pp. 183±221). New York: Wiley. disorder is not sufficient. Clinical Psychology: Science Wilson, G. T. (1996). Empirically validated treatments: and Practice, 4, 249±258. Realities and resistance. Clinical Psychology-Science & Watson, D., & Clark, L. A. (1992). Affects separable and Practice, 3(3), 241±244. inseparable: On the hierarchical arrangement of the World Health Organization (1992). International statistical negative affects. Journal of Personality and Social classification disease, 10th revision (ICD-10). Geneva, Psychology, 62(3), 489±505. Switzerland: Author. Watson, D., Clark, L. A., & Tellegen, A. (1988). Develop- World Health Organization (1996). Multiaxial classifica- ment and validation of brief measures of positive and tion of child and adolescent psychiatric disorders: The negative affect: The PANAS scales. Journal of Person- ICD-10 classification of mental and behavioral disorders in ality and Social Psychology, 54(6), 1063±1070. children and adolescents. Cambridge, UK: Cambridge Watson, J. D., & Crick, F. H. C. (1953a). General University Press. implications of the structure of deoxyribose nucleic acid. Wulfert, E., Greenway, D. E., & Dougher, M. J. (1996). A Nature, 171, 964±967. (As cited in Clementz B. A., & logical functional analysis of reinforcement-based dis- Iacono, W. G. (1990). In L. Willerman & D. B. Cohen orders: Alcoholism and pedophilia. Journal of Consulting (Eds.), Psychopathology. New York: McGraw-Hill.) and Clinical Psychology, 64(6), 1140±1151. Watson, J. D., & Crick, F. H. C. (1953b). A structure for Zimmerman, M., Pfohl, B., Coryell, W. H., Corenthal, C., deoxyribose nucleic acid. Nature, 171, 737±738. (As cited & Stangl, D. (1991). Major depression and personality in Clementz, B. A., & lacono, W. G. (1990). In L. disorder. Journal of Affective Disorders, 22, 199±210. Willerman & D. B. Cohen (Eds.), Psychopathology. New Zinbarg, R. E., & Barlow, D. D. (1991). Mixed anxiety- York: McGraw-Hill. depression: A new diagnostic category? In R. M. Rapee Widiger, T. A. (1992). Categorical versus dimensional & D. H. Barlow (Eds.), Chronic anxiety: Generalized classification: Implications from and for research. anxiety disorder and mixed anxiety-depression Journal of Personality Disorders, 6(4), 287±300. (pp. 136±152). New York: Guilford Press. Widiger, T. A., & Frances, A. (1985). The DSM-III Zinbarg, R. E., Barlow, D. H., Liebowitz, M., Street, L., personality disorders: Perspectives form psychology. Broadhead, E., Katon, W., Roy-Byrne, P., Lepine, J. P., Archives of General Psychiatry, 42, 615±623. Teherani, M., Richards, J., Brantley, P., & Kraemer, H. Widiger, T. A., & Rogers, J. H. (1989). Prevalence and (1994). The DSM-IV field trial for mixed anxiety- comorbidity of personality disorders. Psychiatric Annals, depression. American Journal of Psychiatry, 151(8), 19(3), 132±136. 1153±1162. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.09 Intervention Research: Development and Manualization

JOHN F. CLARKIN Cornell University Medical College, New York, NY, USA

3.09.1 INTRODUCTION 189 3.09.2 AIMS AND INGREDIENTS OF A TREATMENT MANUAL 190 3.09.3 DIMENSIONS ALONG WHICH THE MANUALS VARY 191 3.09.3.1 Process of Manual Generation and Elucidation 191 3.09.3.2 Patient Population 191 3.09.3.3 Knowledge Base of the Disorder 191 3.09.3.4 Treatment Strategies and Techniques 192 3.09.3.5 Treatment Duration 192 3.09.3.6 Treatment Format 192 3.09.3.7 Level of Abstraction 192 3.09.4 RANGE OF TREATMENT MANUALS 192 3.09.5 REPRESENTATIVE TREATMENT MANUALS 193 3.09.5.1 IPT: An Early Manual 193 3.09.5.2 Cognitive-behavioral Treatment of BPD 194 3.09.5.3 Psychosocial Treatment For Bipolar Disorder 195 3.09.5.4 Other Treatment Manuals 196 3.09.6 ADVANTAGES OF TREATMENT MANUALS 196 3.09.7 POTENTIAL DISADVANTAGES AND LIMITATIONS OF TREATMENT MANUALS 196 3.09.8 USE AND ROLE OF TREATMENT MANUALS 198 3.09.8.1 Efficacy to Effectiveness 198 3.09.9 SUMMARY 198 3.09.10 REFERENCES 199

3.09.1 INTRODUCTION pists' adherence and competence in the delivery of the treatment. In the recent history of Psychotherapy research can progress only if psychotherapy research, the manual has played the treatments that are investigated can be a vital role in the specification of clinical trials. replicated by each of the therapists within the In addition, if a science and not just an art of study, and the therapy can be described to the psychotherapy is to flourish, there must be consumers of the research results. Standards of methods to teach clinicians how to perform the psychotherapy research have reached a point various therapies with adherence to the specific that one cannot communicate about results treatment strategies and techniques, and with without a written manual describing the treat- competence in this complex interpersonal ment and methods for demonstrating thera- process. It is in this context that the written

189 190 Intervention Research: Development and Manualization description of psychotherapies for specific included in a treatment manual to meet patient populations in the form of manuals standards for treatment development. has become an essential fixture in the psy- An example of the development of standards chotherapy world. for treatment manuals is the National Institute There has been much lament that clinical on Drug Abuse (NIDA, Moras, 1995). The research has little if any impact on clinical experts on this panel suggested that the aims of a practice (Talley, Strupp, & Butler, 1994). treatment manual are to: Researchers complain that their findings are (i) make the therapy reproducible by thera- ignored. Clinicians argue that the research on pists other than the immediate research group; rarified samples yield findings that are detailed, (ii) include information on those issues the tedious, and irrelevant to their heterogeneous investigator thinks are important determinants patients. It may be that the psychotherapy of therapy response; treatment manuals will provide a bridge between (iii) delineate the knowledge base and skills clinical research and clinical practice. The needed by a therapist to learn from the manual. treatment manual is an accessible yield of clinical Given these aims, the group suggested the research that the clinician may find helpful, if not following content for the manual: necessary as health delivery systems change. (i) theory and principles of the treatment This chapter draws upon various sources to (theory underpinning the therapy, decision rules describe the contents of the psychotherapy and principles to guide the therapists, informa- manual, provides some examples of the therapy tion on the therapy that therapists should have manual movement, and discusses the advan- to help them adopt the requisite attitude about tages and disadvantages of this technology both it, i.e., belief in the treatment); for research and the instruction of therapists. It (ii) elements of the treatment (therapists, is interesting that, although medication man- primary interpersonal stance in relation to the agement is a social process between patient and patient, common and unique elements of physician, with many similarities to psychother- therapy); apy, there has been little development of (iii) intervention strategies for handling pro- manuals for medication management. The blems commonly encountered in delivering the informative exception is the medication man- therapy (miscellaneous interventions, specifica- agement manual used in the multisite treatment tion of interventions not to be used in the study of depression (Elkin et al., 1989). Despite therapy); the fact that medication adherence is often (iv) companion videotapes for teaching the variable and generally poor, and that there is treatment; and great variation in physician behavior in dispen- (v) criteria and method of assessing therapist sing medications, this area has attracted little competence in the treatment. attention, apparently because researchers and In essence, these are the criteria by which one clinicians assume that providing medication is a can judge the adequacy of any treatment simple and standardized behavior. We suspect manual. this assumption is false, and that more These criteria are used to assess existing sophisticated medication and medication± treatment manuals, some of which are used as psychotherapy studies in the future will have illustrations in this chapter. As noted previously manuals with validity checks on both psy- (Kazdin, 1991), it seems clear that the manuals chotherapy and medication management. are often incomplete, as they do not reflect the complexity of treatment and the full scope of the exchanges between therapist and patient. How- 3.09.2 AIMS AND INGREDIENTS OF A ever, the development of treatment manuals has TREATMENT MANUAL and continues to help the field move forward. It should be noted that there is a difference The concept of a treatment manual has between a treatment manual and a book quickly evolved in the research and clinical published and marketed that has many but communities. The treatment manual was ac- not all of the attributes of a practical, ever- claimed (Luborsky & DeRubeis, 1984) as a changing treatment manual. In an attempt to quantum leap forward which introduced a new make a book of publishable size, things that era in clinical research. Since that time there has might be included in the working manual, such been the development of numerous treatment as clinician work sheets or rating scales might be manuals. There are even special series geared eliminated from the book. A notable exception around treatment manuals, and at times is the work by Linehan which includes both a accompanying materials for patients as well. book (Linehan, 1993a) and an accompanying In addition, the intellectual community is workbook (Linehan, 1993b) with therapist moving forward in defining what should be work sheets and other material most useful in Dimensions Along Which the Manuals Vary 191 practical delivery of the treatment. The treat- agoraphobia, and panic attacks. In contrast, the ment manual of the near future will probably be diagnosis of depression is more distant from both a written description of the treatment in specific behaviors, so the manuals describing book form, and an accompanying CD-ROM treatments for depression have implicit theories that provides video demonstration of the about the nature of depression, such as the treatment (using therapists and actor/patients). interpersonal treatment or the cognitive treat- ment. Linehan's manual for the treatment of parasuicidal individuals is an interesting transi- 3.09.3 DIMENSIONS ALONG WHICH THE tional one, from a behavior (repetitive suicidal MANUALS VARY behavior) to a diagnosis that sometimes includes The existing treatment manuals can be the behavior (borderline personality disorder). compared not only to the content as specified In contrast, there are manuals written not for by the NIDA experts, but also by differences in patients with specific diagnoses but for those aspects of the treatment, including: the process with certain problems such as experiencing by which the authors generated and elucidated emotions (Greenberg, Rice, & Elliott, 1993). their treatment; the patient population for An additional way the manual being con- which the treatment is intended; the knowledge sidered is specific to a population is the attention base of the disorder being treated; treatment given to the details of patient assessment. strategies and techniques, duration, and format; Ideally, the assessment section would inform and level of abstraction in which the manual is the reader of the patients included and excluded, written. and the subgroups of patients within the diagnosis/category treated by this manual, but treated with the differences in mind depending 3.09.3.1 Process of Manual Generation and on such factors as comorbid conditions. Elucidation One difficulty that is becoming more clear is that, for those manuals that have been written What was the process of manual generation for a specific disorder, how do they apply to by the authors? Was the manual generated from patients with the disorder but with a common extensive experience with a patient population, comorbid disorder? For example, the interper- followed by a codification of the techniques that sonal psychotherapy manual was constructed seem to work on a clinical basis? Conversely, with ambulatory, nonpsychotic patients with was the manual generated from a theoretical depression in mind, but gives little indication of understanding of the patient population fol- its application when there are prominent Axis II lowed by treatment contact? Was it generated disorders. Of course, this point also relates to from a specific treatment orientation, for the accumulation of knowledge that occurs all example, cognitive-behavioral, that is then the time, so that a manual published a few years applied to a specific patient population? Is the ago may need modification and updating as manual generated by clinical researchers or by more data becomes available. clinicians? Was the manual generated during extensive clinical research, profiting from that research, or without research experience? Does 3.09.3.3 Knowledge Base of the Disorder the manual have accompanying aids, such as One could make a cogent argument that a audiotapes and work sheets? treatment for a particular disorder cannot (should not) be developed until there is 3.09.3.2 Patient Population sufficient information on the natural course of the disorder at issue. Only with the back- Therapy manuals differ in the extent to which ground of the natural development of the they specify for which patients the manual is disorder can one examine the impact of intended and for which patients the manual has intervention. Furthermore, the age of the not been attempted, or, indeed, for whom the patients to whom the treatment is applied is treatment does not work. Since psychotherapy most relevant in the context of the longitudinal research funding is focused on DSM-diagnosed pattern of the disorder in question. However, in patient groups, many of the manuals describe light of the clinical reality of patients in pain treatments for patient groups defined by needing intervention, the development of inter- diagnosis. This fact has both positive and vention strategies cannot wait for longitudinal negative aspects. Some diagnoses are close to investigations of the disorder. Both arguments phenomenological descriptions of behavioral are sound, but we can describe intervention problems, so the diagnoses are immediate manuals based on the natural history of the descriptions of problems to be solved. An disorder versus those that do not have such a example is the treatment of phobic anxiety, database upon which to build. 192 Intervention Research: Development and Manualization

3.09.3.4 Treatment Strategies and Techniques than a manual for therapist and many patients (i.e., family and group treatment). As the Obviously, treatment manuals differ in the number of patients in the treatment room strategies and techniques that are described. increases, the number of patient-supplied inter- This would include manualization of treatments actions increases, as does the potential for that use behavioral, cognitive, cognitive-beha- treatment difficulties and road blocks. vioral, systems, supportive, and psychodynamic techniques. Less obvious is the fact that it may be easier to manualize some strategies/techni- 3.09.3.7 Level of Abstraction ques more than others. In fact, some therapies may be more technique oriented and others The very term ªmanualº conjures up an relatively more interpersonally oriented, and image of the book that comes in the pocket of a this may have implications for the method of new car. It tells how to operate all the gadgets in manualization and the results. the car, and provides a guide as to when to get An issue related to strategies and techniques is the car serviced. Diagrams and pictures are the degree to which the treatment manual provided. It is a ªhow toº book on repair and describes the stance of the therapist with respect maintenance of your new car. Thus, the term to the patient, the induction of the patient into ªtreatment manualº promises to describe how the treatment roles and responsibilities, and the to do the treatment in step-by-step detail. development of the therapeutic relationship or More relevant to psychotherapy manuals are alliance. Often, it is the ambivalent patient who the manuals that inform the readerÐoften with questions the value of the treatment, is prone to graphs and picturesÐhow to sail a boat or play drop out, or to attend therapy with little tennis. This is a more apt analogy because the investment and involvement (e.g., does not manual attempts in written (and graphic) form carry out therapy homework) that provides a to communicate how to achieve complex serious challenge to the therapist. The more cognitive and motor skills. Opponents of complete manuals note the typical difficulties treatment manuals will point to golfers who and challenges to full participation in the have read all the manuals, and hack their way treatment that have been observed in experience around the course with less than sterling skill. with the patient population, and offer guidelines (To us, this does not argue against the manual, for the therapist in overcoming these challenges. but speaks of its limitations and how it should be used in the context of other teaching devises.) Ignoring the critics for the moment, this 3.09.3.5 Treatment Duration discussion raises the issue of what level of Most manuals describe a brief treatment of concreteness or abstraction is the manual best some 12±20 sessions. The methods for manualiz- formulated. Some manuals describe the treat- ing longer treatments may present a challenge to ment session by session, and within the the field. A brief treatment can be described in a individual session the flow of the session and manual session by session. Each session can be details about its construction are indicated. anticipated in a sequence and described. This is Obviously, this is easier to do the shorter the especially true if the treatment is cognitive- treatment, and the more the treatment can be behavioral in strategies and techniques, and can predicted from the beginning (i.e., the more the be anticipated as applied to the particular treatment is driven by the therapist and little condition/disorder. In contrast, as the treatment influenced by patient response or spontaneous becomes longer, and as the treatment deviates contribution). Probably the best manuals are from a cognitive-behavioral orientation to one those that constantly weave abstract principles that depends more on the productivity and and strategies of the treatment with specific nature of the individual patient (e.g., interper- examples in the form of clinical vignettes that sonal and psychodynamic treatment), the man- provide illustrations of the application of the ual will of necessity become more principle based principles to the individual situation. and reliant on what the patient brings to the situation. 3.09.4 RANGE OF TREATMENT MANUALS 3.09.3.6 Treatment Format An exhaustive list of existing treatment Psychotherapy is delivered in individual, manuals cannot be provided for many reasons, group, marital, and family treatment formats. not the least of which is that the list would be out It would seem to be simpler, if not easier, to of date before this chapter goes to print. The articulate a treatment manual for two partici- American Psychological Association (APA), pants (i.e., therapist and one patient), rather Division 12, Task Force on Psychological Representative Treatment Manuals 193

Interventions lists the so-called efficacious heterogeneity in terms of treatment format, treatments (Chambless et al., 1998) and has strategies and level of treatment development. also listed treatment manuals that are relevant IPT is delivered in an individual treatment to the efficacious treatments (Woody & San- format (patient and individual therapist), the derson, 1998). Some 33 treatment manuals are treatment format that is the simplest to listed for 13 disorders and/or problem areas manualize. DBT involves both an individual including bulimia, chronic headache, pain and a group treatment format. The family associated with rheumatic disease, stress, de- treatment of bipolar disorder involves the pression, discordant couples, enuresis, general- treatment of the individual with bipolar disorder ized anxiety disorder, obsessive-compulsive and family or marital partner. There is less disorder, panic disorder, social phobia, specific diversity among these three treatments in terms phobia, and children with oppositional beha- of treatment strategies and techniques. Two vior. Not all of the manuals listed here have (DPT and family treatment of bipolar) are been published, and must be sent for to the informed by cognitive-behavioral strategies and originator. Further, this list is limited to only techniques, but the latter introduces the intri- those treatments that have been judged by the guing concepts of family dynamics and systems Task Force of Division 12 to meet their criteria issues. The remaining (IPT) is interpersonal in for efficaciousness. focus and orientation. The three manuals vary in terms of treatment duration. IPT is brief, the family treatment of bipolar is intermediate, and 3.09.5 REPRESENTATIVE TREATMENT DBT is the longest, presenting the most MANUALS challenge to issues of manualization and therapist adherence across a longer period of In contrast, a listing of representative treat- time. A comparison of these three manuals gives ment manuals is provided here, using as the a sense of development in this area, as the IPT structure for such sampling the diagnoses in the manual was one of the first, and the latter two Diagnostic and statistical manual of mental provide a view of recent manuals and their disorders (4th ed., DSM-IV). contents. This chapter uses three treatment manuals to illustrate the present state of the art in manualization of psychotherapy: Interpersonal 3.09.5.1 IPT: An Early Manual psychotherapy of depression (IPT; Klerman, Weissman, Roundsaville, & Chevron, 1984), One of the earliest and most influential the dialectical behavioral treatment (DBT) for manuals is Interpersonal psychotherapy of borderline personality disorder (BPD) and depression by Klerman et al. (1984). This related self-destructive behaviors (Linehan, manual was written as a time-limited, out- 1993a), and the family treatment of bipolar patient treatment for depressed individuals. The disorder (Miklowitz & Goldstein, 1997). One of treatment focuses on the current rather than on these treatments and related treatment manuals past interpersonal situations and difficulties. (IPT) has achieved recognition in the APA While making no assumption about the origin Division 12 listing of efficacious treatments. One of the symptoms, the authors connect the onset (DBT for BPD) is listed as a ªprobably of depression with current grief, role disputes efficacious treatment.º The third (family treat- and/or transitions, and interpersonal deficits. ment for bipolar disorder) is not listed at this This brief intervention has three treatment time. These manuals are also selected because phases which are described clearly in the they relate to existing DSM diagnostic cate- manual. The first is an evaluation phase during gories which are central to research data and to which the patient and therapist review depres- insurance company procedures for reimburse- sive symptoms, give the syndrome of depression ment of treatment. These manuals for the a name, and induce the patient into a sick role. specified disorders are also chosen because they (Interestingly, the role of the therapist is not represent three levels of severity of disorder that described explicitly.) The patient and therapist face the clinician. Ambulatory patients with discuss and, hopefully, agree upon a treatment depression of a nonpsychotic variety are treated focus limited to four possibilities: grief, role with IPT. BPD is a serious disorder, often disputes, role transitions, or interpersonal involving self-destructive and suicidal behavior. deficits. The middle phase of treatment involves Finally, bipolar disorder is a psychiatric condi- work between therapist and patient on the tion with biological and genetic causes that can defined area of focus. For example, with current be addressed by both the empirically efficacious role disputes the therapist explores with the medications and psychotherapy. In addition, patient the nature of the disputes and options these three treatment manuals provide some for resolution. The final phase of treatment 194 Intervention Research: Development and Manualization involves reviewing and consolidating therapeu- sion in its earliest articulation (Beck, Rush, tic gains. A recent addition is the publication of Shaw, & Emery, 1979) and with more recent a client workbook (Weissman, 1995) and client additions (Beck, 1995). assessment forms. IPT has been used in many clinical trials, the first of which was in 1974 (Klerman, DiMascio, 3.09.5.2 Cognitive-behavioral Treatment of Weissman, Prusoff, & Paykel, 1974). In addition BPD to the IPT manual, a training videotape has been produced and an IPT training program is Linehan's (1993a, 1993b) cognitive-behavior- being developed (Weissman & Markowitz, al treatment of the parasuicidal individual is an 1994). example of the application of a specific school of IPT provides a generic framework guiding psychotherapy adapted to a specific patient patient and therapist to discuss current diffi- population defined both by a personality culties and this framework has been applied to disorder diagnosis (BPD) and repetitive self- symptom conditions other than depression. For destructive behavior. example, the format has been applied to patients The rationale and data upon which the patient with bipolar disorder (Ehlers, Frank, & Kupfer, pathology is understood as related to the 1988), drug abuse (Carroll, Rounsaville, & treatment is well described and thorough. Line- Gawin, 1991; Rounsaville, Glazer, Wilber, han points out that theories of personality Weissman, & Kleber, 1983), and bulimia functioning/dysfunctioning are based upon (Fairburn et al., 1991). In addition to its initial world views and assumptions. The assumption use of treating depression in ambulatory adult behind DBT is that of dialectics. The world view patients, it has now been utilized as an acute of dialectics involves notions of inter-relatedness treatment, as a continuation and maintenance and wholeness, compatible with feminist views treatment (Frank et al., 1990), and has been of psychopathology, rather than an emphasis on used for geriatric (Reynolds et al., 1992) and separation, individuation, and independence. A adolescent patients (Mufson, Moreau, Weiss- related principle is that of polarity, that is, all man, & Klerman, 1993) and in various settings propositions contain within them their own such as primary care and hospitalized elderly. oppositions. As related to the borderline The success of IPT seems to be the articulation pathology, it is assumed that within the border- of a rather straightforward approach to discus- line dysfunction there is also function and sion between patient and therapist, of current accuracy. Thus, in DBT, it is assumed that each situations without the use of more complicated individual, including the borderline clients, are procedures such as transference interpretation. capable of wisdom as related to their own life and The IPT manual was one of the first in the capable of change. At the level of the relationship field, and its straightforward description of a between borderline client and DBT therapist, common-sense approach to patients with de- dialectics refers to change by persuasion, a pression is readily adopted by many clinicians. process involving truth not as an absolute but as However, the process of treatment development an evolving, developing phenomenon. and amplification is also relevant to this, one of Borderline pathology is conceptualized as a the earliest and best treatment manuals. It is dialectical failure on the part of the client. The now clear that depression is often only partially thinking of the BPD patient has a tendency to responsive to brief treatments, such as IPT, and become paralyzed in either the thesis or that many patients relapse. It would appear that antithesis, without movement toward a synth- maintenance treatment with IPT may be useful esis. There is a related dialectical theory of the (Frank et al., 1991), and the IPT manual must development of borderline pathology. BPD is therefore be amplified for this purpose. Further- seen primarily as a dysfunction of the emotion more, it has become clear that depressed regulation system, with contributions to this individuals with personality disorders respond state of malfunction from both biological to treatment less thoroughly and more slowly irregularities and interaction over time with a than depressed individuals without personality dysfunctional environment. In this point of disorders (Clarkin & Abrams, 1998). This would view, the BPD client is prey to emotional suggest that IPT may need modification for vulnerability, that is, high sensitivity to emo- those with personality disorders, either in terms tional stimuli, emotional intensity, and a slow of how to manage the personality disorder return to emotional baseline functioning. The during treatment, or to include treatment of invalidating environment, seen as crucial to the relevant parts of the personality disorder to the BPD development, is one in which interpersonal depression. communication of personal and private experi- In order to place IPT in perspective, one could ences is met by others in the environment with compare it to the cognitive therapy of depres- trivialization and/or punishment. In such an Representative Treatment Manuals 195 environment, the developing individual does disturbed group of individuals, and provides not learn to label private experiences, nor does the challenge of extending treatment manuals the individual learn emotion regulation. beyond brief treatments. A most important and These assumptions and related data on the practical addition to the book is an accompany- developmental histories and cross-sectional ing workbook with work sheets for therapist and behaviors of those with BPD provide the patients. rationale and shape of the treatment. The major To put DBT in context one could compare it tasks of the treatment are, therefore, to teach the to other cognitive approaches to the personality client skills so that they can modulate emotional disorders (Beck & Freeman, 1990), to an experiences and related mood-dependent beha- interpersonal treatment delivered in a group viors, and to learn to trust and validate their own format (Marziali & Munroe-Blum, 1994), to a emotions, thoughts and activities. The relevant modified psychodynamic treatment (Clarkin, skills are described as four in type: skills that Yeomans, & Kernberg, in press), and to a increase interpersonal effectiveness in conflict supportive treatment (Rockland, 1992) for these situations, strategies to increase self-regulation patients. of unwanted affects, skills for tolerating emo- tional distress, and skills adapted from Zen 3.09.5.3 Psychosocial Treatment For Bipolar meditation to enhance the ability to experience Disorder emotions and avoid emotional inhibition. The manual provides extensive material on Capitalizing on their years of clinical research basic treatment strategies (e.g., dialectical stra- with patients with schizophrenia and bipolar tegies), core strategies of emotional validation disorder and their families, Miklowitz and (e.g., teaching emotion observation and labeling Goldstein (1997) have articulated a treatment skills), behavioral validation (e.g., identifying, manual for patients with bipolar disorder and countering and accepting ªshouldsº), and their spouses or family members. cognitive validation (e.g., discriminating facts The rationale for the treatment is threefold. from interpretations), and problem-solving An episode of bipolar disorder in a family strategies. Problem solving consists of analysis member affects not only the patient but the of behavior problems, generating alternate entire family. Thus, with each episode of the behavioral solutions, orientation to a solution disorder (and this is a recurring disorder) there behavior, and trial of the solution behavior. is a crisis and disorganization of the entire The core of the treatment is described as family. Thus, the family treatment is an attempt balancing problem-solving strategies with vali- to assist the patient and family to cooperatively dation strategies. deal with this chronic illness. Finally, both in the This manual is exceptional, and provides a rationale for the treatment and in the psychoe- new and very high standard in the field for ducational component of the treatment, the treatment manuals. There are a number of authors espouse a vulnerability±stress model of exemplary features. First, the patient population the disorder, which would argue for a treatment is defined by DSM criteria in addition to specific that reduces patient and family stress. problematic behaviors, that is, parasuicidal The therapeutic stance of the clinician behaviors. Second, the treatment manual was conducting this treatment is explicated well. generated in the context of clinical research. The The clinician is encouraged to be approachable, treatment was developed in the context of open and emotionally accessible. It is recom- operationalization and discovery, that is, how mended that the clinician develop a ªSocratic to operationalize a treatment for research that dialogueº with the family in providing informa- fits the pathology of the patients who are selected tion, and develop a give and take between all and described by specific criteria. The skills parties. Although the treatment has a major training manual (Linehan, 1993b) notes that it psychoeducational component, the clinician is has evolved over a 20 year period, and has been encouraged to be ready to explore the emotional tested on over 100 clients. The treatment impact of the information, and not to be simply generated in this context has been used in a classroom teacher. In dealing with both diverse treatment settings to teach therapists of difficult news (e.g., you have a life-long illness) various levels of expertise and training to address and tasks (e.g., problem solving helps reduce BPD patients. This process of teaching the stress), the clinician is encouraged to be treatment to therapists of multiple levels of reasonably optimistic about what the patient competence and training can foster the articula- and family can accomplish if they give the tion of the treatment and enrich its adaptability treatment and its methods a chance to succeed. to community settings. This treatment has a The treatment is described in the manual in duration of one year or more because it is terms of three major phases or modules that addressed to a very difficult and seriously have a clear sequence to them. The first phase of 196 Intervention Research: Development and Manualization psychoeducation is intended to provide the depression, anxiety disorders, substance abuse patient and family with information that gives and eating disorders. Treatments for Axis II them an understanding of bipolar disorder and disorders are less well developed and manua- its treatment. It is hoped that this information lized, except for BPD. Problem areas such as may have practical value in informing the marital discord, sexual dysfunction and proble- patient and family as to things they should do matic emotion expression are also addressed. (e.g., medication compliance) that assists in Further development will probably come in managing the illness. The second phase or addressing common comorbid conditions. In module of treatment is a communication addition, there has been more funding and enhancement training phase. The aim of this research for brief treatments with a cognitive- phase is to improve and/or enhance the patient behavioral orientation. Longer treatments, and family's ability to communicate with one maintenance treatment for chronic conditions, another clearly and effectively, especially in the and the development of manuals for strategies face of the stress-producing disorder. The third and techniques other than cognitive-behavioral or problem-solving phase provides the patient ones can be expected in the future. With the and family with training in effective conflict heavy incursion of managed care with its resolution. Prior to extensive explanation of the emphasis on cost-cutting, more development three phases of the treatment is information on of treatments delivered in a group format may connecting and engaging with patient and also be seen. family, and conducting initial assessment. The authors also examine carefully patient and family criteria for admission to this family 3.09.6 ADVANTAGES OF TREATMENT intervention. Interestingly, they do not rule out MANUALS patients who are not medication compliant at It is clear that the introduction of the present, nor those who are currently abusing treatment manual has had tremendous impact. substances, both common situations in the The very process of writing a treatment manual history of many bipolar patients. This is an forces the author to think through and articulated treatment broad enough to encou- articulate the details of the treatment that rage those patients who are on the cusp of what may not have been specified before. In this way, is actually treatable. the era of the written treatment manual has This manual raises directly the issue of patient fostered fuller articulation of treatment by the medication compliance and the combination of treatment originators, and furthered the ability psychosocial treatment with medication treat- of teachers of the treatment to explicate the ment. Bipolar patients are responsive to certain treatment to trainees. medications, and medication treatment is Manuals provide an operationalized state- considered a necessity. Thus, this treatment ment of the treatment being delivered or manual provides psychoeducation around the researched so that all can examine it for its foci need for medication, and encourages patient and procedures. It cannot be assumed that the and family agreement about continuous med- treatment described in the written manual was ication compliance. delivered as described by the therapists in the This treatment manual can be placed in study or in a treatment delivery system. This gap context by comparing it to a treatment for between the manual and the treatment as bipolar disorder in the individual treatment delivered highlights the need for rating scales format (Basco & Rush, 1996). This manual to assess the faithful (i.e., adherent and should also be seen in the context of quite competent) delivery of the treatment as de- similar treatments for patients with schizophre- scribed in the manual. nia and their families (such as those articulated Manuals provide a training tool for clinical by Anderson, Reiss, & Hogarty, 1986; Bellack, research and for clinical practice. It has been Mueser, Gingerich, & Agresta, 1997; Falloon, noted (Chambless, 1996; Moras, 1993) that Boyd, & McGill, 1984). students learn treatment approaches much more quickly from their systematic depiction in 3.09.5.4 Other Treatment Manuals manuals than through supervision alone. Beyond the three manuals chosen for illus- tration, it is interesting to note the patient 3.09.7 POTENTIAL DISADVANTAGES diagnoses and problem areas that are currently AND LIMITATIONS OF addressed by a manual describing a psychother- TREATMENT MANUALS apy. As noted in Table 1, there are treatments described in manuals for major Axis I disorders Dobson and Shaw (1988) have noted six such as schizophrenia, bipolar disorder, major disadvantages of treatment manuals: (i) the Potential Disadvantages and Limitations of Treatment Manuals 197

Table 1 Representative treatment manuals.

Disorder/problem area Reference

Panic Barlow and Cerny (1988) Obsessive-compulsive disorder Steketee (1993) Turner and Beidel (1988) PTSD Foy (1992) Depression Beck, Rush, Shaw, and Emery (1979) Klerman, Weissman, Rounsaville, and Chevron (1984) Bipolar Miklowitz and Goldstein (1997) Basco and Rush (1996) Schizophrenia Bellack (1997) Anderson, Reiss, and Hogarty (1986) Falloon, Boyd, and McGill (1984) Borderline Linehan (1993a, 1993b) Clarkin, Yeomans, and Kernberg (in press) Marzialli and Munroe-Blum (1994) Rockland (1992) Marital discord Baucom and Epstein (1990) Sexual dysfunction Wincze and Carey (1991) Alcohol abuse Sobell and Sobell (1993) Binge eating Fairburn, Marcus, and Wilson (1993) Problematic emotion schemas Greenberg, Rice, and Elliott (1993) inability to assess the effects of therapists' formulations rather than following validated variables, (ii) the diminished ability to study the treatment in manuals on average might reduce therapy process, (iii) a focus on treatment effectiveness rather than enhance it (Schulte, fidelity rather than on competence, (iv) the Kuenzel, Pepping, & Schulte-Bahrenberg, increased expense for research, (v) the over- 1992). When research therapy programs are researching of older and more codified therapy transferred to a clinic setting, there tends to be procedures; and (vi) the promotion of schools of an increase in effectiveness (Weisz, Donenberg, psychotherapy. It is important to distinguish the Han, & Weiss, 1995). A number of studies show limitations of the manuals as they have been that greater adherence to the psychotherapy developed up to now, from the limitations of protocol predicts better outcome (Frank et al., manuals as they can be if the field continues to 1991; Luborsky, McLellan, Woody, O'Brien, & improve them. There is no reason why many of Auerbach, 1985). the concerns listed above cannot be incorpo- However, it has been pointed out that rated into future manuals. Some are already adherence to manualized treatments may lead being included, such as therapist variables, to certain disruptive therapists' behaviors process, and competence. (Henry, Schacht, Strupp, Butler, & Binder, Probably the most extensively debated issue 1993). There is the possibility that therapists around manuals is the issue of therapist delivering a manualized treatment, especially flexibility in the execution of the treatment. those that are just learning the treatment and Some would argue that a manual seriously adhering with concentration, will deliver it with curtails the flexibility and creativity of the strict adherence but without competence. At its therapist, thus potentially eliminating the ther- extreme, it might be argued that the use of the apeutic effectiveness of talented therapists. This treatment manual may de-skill the therapist and is the type of issue that, unless infused with data, interfere with therapist competence. In fact, could be debated intensely for a very long time. rigid adherence could lead to poor therapy, and Jacobson et al. (1989) compared two versions of a mockery of what the treatment is intended to behavioral marital therapy, one inflexible and be. the other in which therapists had flexibility. The In a study reported by Castonguay, Gold- outcome of the flexibility condition was no fried, Wiser, Raue, and Hayes (1996), when better than that of the inflexible condition. There therapists were engaged in an abstract cognitive was a trend for less relapse at follow-up in the intervention the outcome appeared worse. flexibly treated couples. However, those interventions were related to Wilson (1996) notes that allowing therapists a bad outcome only when practiced to the to pursue their own somewhat ideographic case detriment of the therapeutic alliance. 198 Intervention Research: Development and Manualization

Jacobson and Hollon (1996) point out that a particular therapist model. Supervision allows the measurement of therapist adherence to the expert to help the trainee apply the manual treatments as manualized has received much to a specific patient in a specific treatment, attention, but the measurement of competence which will always produce some situations that is at a rudimentary stage. The manual should are not exactly covered in the written manual. provide instruments that have been shown to It is a naive and false assumption that a reliably assess therapist adherence to the clinician can simply read a treatment manual, treatment as described in the manual, and and thereby be enabled to deliver the treatment competence in the delivery of that treatment. with skill. It is interesting that the authors of a Many books that proport to be treatment treatment manual (Turner & Beidel, 1988) felt manuals do not include such instrumentation. the need to argue that just reading a manual is The instruments are helpful in further specifying not enough to train an individual in the necessary therapist behaviors, and may be competent delivery of the treatment in question useful to the clinical supervisor. to the individual patient. A major limitation of manual development is the number of patients who appear for treat- ment who are not addressed in the existing 3.09.8.1 Efficacy to Effectiveness treatment manuals. Some of this is due to the There is currently much discussion about the research focus of manuals, that is, the need for need to extend clinical trials of empirically clinical research to have a narrowly and care- validated treatments, that is treatments that fully defined patient population for whom the have been shown to be efficacious, to larger, treatment is intended. Unfortunately, many if community samples with average therapists not most patients who appear in clinical settings (effectiveness research). It may be that treat- would not fit into a research protocol because of ment manuals can play an important role in this comorbidity, not quite meeting criteria for a work. Indeed, the ultimate value of a treatment specific diagnosis (e.g., personality disorder manual and the training system within which it NOS, not otherwise specified). Some of this is operates will be the result that clinicians can simply due to the newness of manuals and the perform the treatment with adherence and little time there has been in their development. competence. Manuals that are most useful will contain scales developed to measure therapist adherence and competence, and most current manuals are lacking this feature. 3.09.8 USE AND ROLE OF TREATMENT MANUALS

Treatment manuals have had a brief but 3.09.9 SUMMARY exciting and productive history. There is an extreme position that something as complex, Although treatment manuals have been unique and creative as a psychotherapy between effective in specifying psychotherapy for clinical two individuals cannot be manualized. It is research, the step from clinical efficacy studies to thought that experience indicates that this is an demonstration of clinical effectiveness of the extreme view, and that manualization has been treatments that have been manualized is still useful to the field. The issues are how best to lacking. This is an issue of knowledge transfer. utilize a manual and what is the role of the That is, given the demonstration that a specific manual in clinical research and training? therapy (that has been manualized) has shown Whether in the early stages of clinical research clinical efficacy in randomized clinical trials with or in a clinical training program, the treatment a homogeneous patient population and with manual can serve as a tool to be used by the carefully selected therapists, can this treatment expert therapist who is teaching the treatment. It also show clinical benefits when transferred to a has been noted that the manual enables the setting in which the patients are less homo- student to learn the treatment more quickly than geneous and the therapists are those who are with supervision alone (Chambless, 1996; working at the local level? The written treatment Moras, 1993). It is our experience, however, manual may play a role in this transfer of that the manual has a place in the teaching tool- expertise from a small, clinical research group to box, but is less important than supervision and a larger group of therapists. However, this step watching experts doing the treatment on has yet to be demonstrated. Thus, the test of a videotape. The manual provides a conceptual manual, and the entire teaching package within overview of the treatment. Videotapes provide a which it resides, is to demonstrate that a wider visual demonstration of the manual being group of therapists can do the treatment with applied to a particular patient, in the style of adherence and competence. References 199

It is interesting to speculate about the future 3.09.10 REFERENCES of training in psychotherapy given the advances Anderson, C. M., Reiss, D. J., & Hogarty, G. E. (1986). in the field, including the generation of treat- Schizophrenia and the family. New York: Guilford Press. ment manuals. For sake of argument, we Barlow, D. H., & Cerny, J. A. (1988). Psychological indicated that there are some 33 manuals for treatment of panic. New York: Guilford Press. 13 disorders for which there are efficacious Basco, M. R., & Rush, A. J. (1996). Cognitive-behavioral treatments in the field of clinical psychology. therapy for bipolar disorder. New York: Guilford Press. Baucom, D. H., & Epstein, N. (1990). Cognitive-behavioral Should these manuals form the basis of the marital therapy. New York: Brunner/Mazel. training in psychotherapy of future psycholo- Beck, A. T., & Freeman, A. M. (1990). Cognitive therapy of gists? Or can one generate principles of personality disorders. New York: Guilford Press. treatment out of these disparate treatments Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of depression. New York: Guilford for various disorders and conditions, and teach Press. these principles? There are obvious redundan- Beck, J. S. (1995). Cognitive therapy: Basics and beyond. cies across the manuals, and one could imagine New York: Guilford Press. a supermanual with branching points for Bellack, A. S., Mueser, K. T., Gingerich, S., & Agresta, J. various disorders. For example, most manuals (1997). Social skills training for schizophrenia: A step-by- step guide. New York: Guilford Press. have an assessment phase, followed by the phase Carroll, K. M., Rounsaville, B. J., & Gawin, F. H. (1991). of making an alliance with the patient and A comparative trial of psychotherapies for ambulatory describing the characteristics of the treatment to cocaine abusers: Relapse prevention and interpersonal follow, with some indication of the roles of psychotherapy. American Journal of Drug and Alcohol patient and therapist. These are obvious skills Abuse, 17, 229±247. Castonguay, L. G., Goldfried, M. R., Wiser, S., Raue, P. that a therapist must learn, with nuances J., & Hayes, A. M. (1996). Predicting the effect of depending upon the patient and the disorder cognitive therapy for depression: A study of unique and in question. For those manuals that are common factors. Journal of Consulting and Clinical cognitive behavioral in strategies and techni- Psychology, 64, 497±504. Chambless, D. L. (1996). In defense of dissemination of ques, there seems to be great redundancy in empirically supported psychological interventions. Clin- terms of the selected finite number of techniques ical Psychology: Science and Practice, 3(3), 230±235. that are used. Chambless, D. L., Baker, M. J., Baucom, D. H., Beutler, L. What is missing is the assessment of the E., Calhoun, K. S., Crits-Christoph, P., Daiuto, A., patient in which there is no indication of what DeRubeis, R., Detweiler, J., Haaga, D. A. F., Johnson, S. B., McCurry, S., Mueser, K. T., Pope, K. S., Sanderson, the problem is, the diagnosis, or which manual W. C., Shoham, V., Stickle, T., Williams, D. A., & or manuals to use for treatment. Each manual Woody, S. R. (1998). Update on empirically validate seems to presume that clinicians can properly therapies. II. (1998). Clinical Psychologist, 51, 3±13. identify patients for that manual. Unfortu- Clarkin, J. F., & Abrams, R. (1998). Management of personality disorders in the context of mood and anxiety nately, one cannot train a psychologist to treat disorders. In A. J. Rush (Ed.), Mood and anxiety only the 13 disorders/problem areas in the list, disorders (pp. 224±235). Philadelphia: Current Science. as many patients suffer from other conditions Clarkin, J. F., Yeomans, F., & Kernberg, O. F. (in press). not covered. To compound things even further, Psychodynamic psychotherapy of borderline personality many patients (if not most, depending on the organization: A treatment manual. New York: Wiley. Dobson, K. S., & Shaw, B. F. (1988). The use of treatment clinical setting) do not suffer from just one manuals in cognitive therapy: Experience and issues. condition, but from several. Journal of Consulting and Clinical Psychology, 56, Our own approach to training is the one 673±680. implied by Roth, Fonagy, Parry, Target, and Ehlers, C. L., Frank E., & Kupfer, D. J. (1988). Social Woods (1996), with emphasis on the initial zeitgebers and biological rhythms: A unified approach to understanding the etiology of depression. Archives of assessment of the patient with specific clinical General Psychiatry, 45, 948±952. hypotheses about the situation, proceeding to Elkin, I., Shea, M. T., Watkins, J. T., Imber, S. D., Sotsky, the most relevant treatment that has been S. M., Collins, J. F., Glass, D. R., Pilkonis, P. A., Leber, validated to various degrees. This is a less black- W. R., Docherty, J. P., Fiester, S. J., & Parloff, M. B. (1989). National Institute of Mental Health Treatment and-white world of the empirically supported of Depression Collaborative Research Program: General treatment approach and more related to the effectiveness of treatments. Archives of General Psychia- complex condition we call clinical work. try, 46, 971±982. Treatment manuals will play some role in this Fairburn, C. G., Jones, R., Peveler, R. C., Carr, S. J., process, but it might be less than the quantum Solomon, R. A., O'Connor, M. E., Burton, J., & Hope, R. A. (1991). Three psychological treatments for bulimia leap suggested by Luborsky and DeRubeis nervosa: A comparative trial. Archives of General (1984). In our experience, trainees read the Psychiatry, 48, 463±469. treatment manual if they must, but they look Fairburn, C. G., Marcus, M. D. & Wilson, G. T. (1993). forward to seeing experts do the treatment on Cognitive-behavioral therapy for binge eating and bulimia nervosa: A comprehensive treatment manual. videotape, and they see supervision from In C. G. Fairburn & G. T. Wilson (Eds.), Binge eating: an expert in the treatment as a matter of Nature, assessment and treatment (pp. 361±404). New course. York: Guilford Press. 200 Intervention Research: Development and Manualization

Falloon, I. R. H., Boyd, J. L., & McGill, C. W. (1984). Miklowitz, D. J., & Goldstein, M. J. (1997). Bipolar Family care of schizophrenia. New York: Guilford Press. disorders: A family-focused treatment approach. New Foy, D. W. (Ed.) (1992). Treating PTSD: Cognitive- York: Guilford Press. behavioral strategies. Treatment manuals for practitioners. Moras, K. (1993). The use of treatment manuals to train New York: Guilford Press. psychotherapists: Observations and recommendations. Frank, E., Kupfer, D. J., Perel, J. M., Cornes, C., Jarrett, Psychotherapy, 30, 581±586. D. B., Mallinger, A. G., Thase, M. E., McEachran, A. Moras, K. (1995, January). Behavioral therapy develop- B., & Grochociniski, V. J. (1990). Three-year outcomes ment program workshop (Draft 2, 3/24/95). National for maintenance therapies in recurrent depression. Institute on Drug Abuse, Washington, DC. Archives of General Psychiatry, 47, 1093±1099. Mufson, L., Moreau, D., Weissman, M. M., & Klerman, Frank, E., Kupfer, D. J., Wagner, E. F., McEachran, A. B., G. L. (Eds.) (1993). Interpersonal psychotherapy for & Cornes, C. (1991). Efficacy of interpersonal psy- depressed adolescents. New York: Guilford Press. chotherapy as a maintenance treatment of recurrent Reynolds, C. F., Frank, E., Perel, J. M., Imber, S. D., depression. Archives of General Psychiatry, 48, Cornes, C., Morycz. R. K., Mazumdar, S., Miller, M., 1053±1059. Pollock, B. G., Rifai, A. H., Stack, J. A., George, C. J., Greenberg, L. S., Rice, L. N., & Elliott, R. K. (1993). Houck, P. R., & Kupfer, D. J. (1992). Combined Facilitating emotional change: The moment-by-moment pharmacotherapy and psychotherapy in the acute and process. New York: Guilford Press. continuation treatment of elderly patients with recurrent Henry, W. P., Schacht, T. E., Strupp, H. H., Butler, S. F., major depression: A preliminary report. American & Binder, J. L. (1993). Effects of training in time-limited Journal of Psychiatry, 149, 1687±1692. dynamic psychotherapy: Mediators of therapists re- Rockland, L. H. (1992). Supportive therapy for borderline sponses to training. Journal of Consulting and Clinical patients: A psychodynamic approach. New York: Guil- Psychology, 61, 441±447. ford Press. Jacobson, N. S., & Hollon, S. D. (1996). Prospects for Roth, A., Fonagy, P., Parry, G., Target, M., & Woods, R. future comparisons between drugs and psychotherapy: (1996). What works for whom? A critical review of Lessons from the CBT-versus-pharmacotherapy ex- psychotherapy research. New York: Guilford Press. change. Journal of Consulting and Clinical Psychology, Rounsaville, B. J., Glazer, W., Wilber, C. H., Weissman, 64, 104±108. M. M., & Kleber, H. D. (1983). Short-term interpersonal Jacobson, N. S., Schmaling, K. B., Holtzworth-Munroe, psychotherapy in methadone-maintained opiate addicts. A., Katt, J. L., Wood, L. F., & Follette, V. M. (1989). Archives of General Psychiatry, 40, 629±636. Research-structured vs. clinically flexible versions of Schulte, D., Kuenzel, R., Pepping, G., & Schulte-Bahren- social learning-based marital therapy. Behaviour Re- berg, T. (1992). Tailor-made versus standardized therapy search and Therapy, 27, 173±180. of phobic patients. Advances in Behaviour Research and Kazdin, A. E. (1991). Treatment research: The investiga- Therapy, 14, 67±92. tion and evaluation of psychotherapy. In M. Hersen, A. Sobell, M. B., & Sobell, L. C. (1993). Problem drinkers: E. Kazdin, & A. S. Bellack (Eds.), The clinical psychology Guided self-change treatment. New York: Guilford Press. handbook (2nd ed., pp. 293±312). New York: Pergamon. Steketee, G. (1993). Treatment of obsessive compulsive Klerman, G. L., DiMascio, A., Weissman, M., Prusoff, B., disorder. New York: Guilford Press. Paykel, E. S. (1974). Treatment of depression by drugs Talley, P. F., Strupp, H. H., & Butler, S. F. (Eds.) (1994). and psychotherapy. American Journal of Psychiatry, 131, Psychotherapy research and practice: Bridging the gap. 186±191. New York: Basic Books. Klerman, G. L., Weissman, M. M., Rounsaville, B. J., & Turner, S. M., & Beidel, D. C. (1988). Treating obsessive- Chevron, E. S. (1984). Interpersonal psychotherapy of compulsive disorder. Oxford, UK: Pergamon. depression. New York: Basic Books. Weissman, M. M. (1995). Mastering Depression: A patient's Linehan, M. M. (1993a). Cognitive-behavioral treatment of guide to interpersonal psychotherapy. San Antonio, TX: borderline personality disorder. New York: Guilford Psychological Corporation. Press. Weissman, M. M., & Markowitz, J. C. (1994). Interperso- Linehan, M. M. (1993b). Skills training manual for treating nal psychotherapy: Current status. Archives of General borderline personality disorder. New York: Guilford Psychiatry, 51, 599±606. Press. Weisz, J. R., Donenberg, G. R., Han, S. S., & Weiss, B. Luborsky, L., & DeRubeis, R. J. (1984). The use of (1995). Bridging the gap between laboratory and clinic in psychotherapy treatment manuals: A small revolution in child and adolescent psychotherapy. Journal of Consult- psychotherapy research style. Clinical Psychology Re- ing and Clinical Psychology, 63, 688±701. view, 4, 5±14. Wilson, G. T. (1996). Empirically validated treatments: Luborsky, L., McLellan, A. T., Woody, G. E., O'Brien, C. Realities and resistance. Clinical Psychology, 3, 241±244. P., & Auerbach, A. (1985). Therapists success and its Wincz. J. P., & Carey, M. P. (1991). Sexual dysfunction: A determinants. Archives of General Psychiatry, 42, guide for assessment and treatment. New York: Guilford 602±611. Press. Marziali, E., & Munroe-Blum, H. (1994). Interpersonal Woody, S. R., & Sanderson, W. C. (1998). Manuals for group psychotherapy for borderline personality disorder. empirically supported treatments: 1998 update. Clinical New York: Basic Books. Psychologist, 51, 17±21. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.10 Internal and External Validity of Intervention Studies

KARLA MORAS University of Pennsylvania, Philadelphia, PA, USA

3.10.1 INTRODUCTION 202 3.10.2 DEFINITIONS 202 3.10.2.1 Overview of IV, EV, CV, and SCV 202 3.10.2.2 Internal Validity 203 3.10.2.3 External Validity 203 3.10.2.4 Construct Validity 203 3.10.2.5 Statistical Conclusion Validity 204 3.10.3 COMMON THREATS TO IV, EV, CV, AND SCV 204 3.10.4 EXPERIMENTAL DESIGNS AND METHODS THAT ENHANCE IV, EV, CV, AND SCV 205 3.10.4.1 IV Methods and Designs 205 3.10.4.1.1 Random assignment 205 3.10.4.2 Untreated or Placebo-treated Control Group 208 3.10.4.3 Specification of Interventions 209 3.10.4.4 Process Research Methods 209 3.10.4.5 EV Methods 210 3.10.4.5.1 Naturalistic treatment settings 210 3.10.4.6 Inclusive Subject Selection 211 3.10.4.6.1 Staff therapists 212 3.10.4.7 CV Methods 212 3.10.4.7.1 Specification of interventions: treatment manuals 212 3.10.4.7.2 Specification of interventions: therapist adherence measures 214 3.10.4.7.3 Distinctiveness of interventions 215 3.10.4.7.4 Adjunctive therapies 215 3.10.4.7.5 Assessment procedures 215 3.10.4.8 SCV Methods 216 3.10.4.8.1 Statistical power analysis 216 3.10.4.8.2 Controlling for the type 1 error rate 216 3.10.4.8.3 Testing assumptions of statistical tests 218 3.10.5 FROM IV VS. EV TO IV + EV IN MENTAL HEALTH INTERVENTION RESEARCH 218 3.10.5.1 Strategies for Efficacy + Effectiveness (IV + EV) Intervention Studies 218 3.10.5.1.1 The conventional ªphaseº strategy 219 3.10.5.1.2 A dimensional adaptation of the phase model 219 3.10.5.1.3 Stakeholder's model to guide data selection 219 3.10.5.1.4 Mediator's model 219 3.10.5.2 Examples of Efficacy + Effectiveness Intervention Studies 220 3.10.5.2.1 Schulberg et al. (1995) 220 3.10.5.2.2 Drake et al. (1996) 220 3.10.6 A CONCLUDING OBSERVATION 221 3.10.7 REFERENCES 222

201 202 Internal and External Validity of Intervention Studies

3.10.1 INTRODUCTION 1994; Roth & Fonagy, 1996). An alternative view is that EV questions only can be asked The concepts of internal and external validity about a study's findings after the study's IV has (IV and EV) were introduced by Campbell and been established (Flick, 1988; Hoagwood et al., Stanley in 1963. IV and EV concern the validity 1995). A recent trend is to encourage investiga- of inferences that can be drawn from an inter- tors to design studies that can optimize both IV vention study, given its design and methods. and EV (Clarke, 1995; Hoagwood et al., 1995). The concepts are used to assess the extent to The topic is pursued later in the chapter. which a study's outcome findings can be This chapter is intended to provide a relatively confidently: (i) interpreted as evidence for concise, simplified explication of IV and EV that hypothesized causal relationships between will be useful to neophyte intervention research- interventions and outcomes (IV), and (ii) ers and to consumers of intervention research. assumed to generalize beyond the study situa- The main aims are to enhance the reader's tion (EV). IV and EV are conceptual tools that sophistication as consumer of intervention guide deductive (IV) and inductive (EV) think- research, and ability to use the concepts of IV ing about the impact of a study's design and and EV to design intervention studies. IV and methods on the validity of the conclusions that EV are discussed and illustrated mainly from the can be drawn from it. The concepts are not only perspective of research on interventions of a of academic interest. Evaluation of a study's IV certain type: psychological therapies for mental and EV is a logical, systematic way to judge if its health problems (Bergin & Garfield, 1971, 1994; outcome findings provide compelling evidence Garfield & Bergin, 1978, 1986).The topics that an intervention merits implementation in covered are: (i) definitions of IV and EV and public sector settings. of two newer, closely related concepts, construct This chapter is written at a unique time in the validity (CV) and statistical conclusion validity history of IV and EV. The concepts have been at (SCV); (ii) threats to IV, EV, CV, and SCV; (iii) the forefront of a contemporary, often con- designs and methods that are commonly used in tentious debate in the USA about the public mental health intervention research to enhance health value of alternative methods and designs IV, EV, CV, and SCV; and (iv) suggested for intervention research (e.g., Goldfried Wolfe, strategies from the efficacy vs. effectiveness 1998; Hoagwood, Hibbs, Bren, & Jensen, 1995; debate to optimize the scientific validity, gen- Jacobson & Christensen, 1996; Lebowitz & eralizability, and public health value of inter- Rudorfer, 1998; Mintz, Drake, & Crits-Chris- vention studies. Finally, two intervention studies toph, 1996; Newman & Tejeda, 1996; Seligman, that were designed to meet both IV and EV aims 1996; Wells & Sturm, 1996). The alternatives are are used to illustrate application of the concepts. referred to as ªefficacyº vs. ªeffectivenessº studies. In current parlance, efficacy studies 3.10.2 DEFINITIONS have high IV due to designs and methods that reflect a priority on drawing causal conclusions 3.10.2.1 Overview of IV, EV, CV, and SCV about the relationship between interventions and outcomes (e.g., Elkin et al., 1989). Effec- Kazdin (1994) provided a concise overview of tiveness studies have high EV due to designs and IV, EV, CV, and SCV, and of common threats methods that reflect the priority of obtaining to each type of validity (Table 1). The discussion findings that can be assumed to generalize to of the four concepts in this chapter is based on nonresearch intervention settings and clinic Cook and Campbell's (1979) conceptualiza- populations (e.g., Speer, 1994). Typically, effec- tions, as is Kazdin's table. Campbell and tiveness studies are done to examine the effects of Stanley's (1963, 1966) original conceptualiza- interventions with demonstrated efficacy in high tions of IV and EV were extended and slightly IV studies, when the interventions are used in revised by Cook and Campbell in 1979. The standard community treatment settings. A arguments and philosophy of science assump- related term, ªclinical utility research,º has tions that underpin Cook and Campbell's started to appear (Beutler & Howard, 1998). (1979) perspective have been challenged (Cron- The types of research questions connoted by bach, 1982). A key criticism is that they clinical utility are broader than those connoted overemphasized the value of high IV studies by effectiveness research (Howard, Moras, Brill, in a way that was inconsistent with EV Martinovich, & Lutz, 1996; Kopta, Howard, (generalizability) trade-offs often associated Lowry, & Beutler, 1994; Lueger, 1998). with such studies (Shadish, 1995). Cook and The efficacy vs. effectiveness debate is pro- Campbell's views were adopted in this chapter minent in mental health intervention research. because, challenges notwithstanding, they have IV and EV often are viewed as competing rather had a major influence on mental health than complimentary research aims (Kazdin, intervention research since the late 1960s. Definitions 203

Table 1 Types of experimental validity, questions they address, and their threats to drawing valid inferences.

Type of validity Questions addressed Threats to validity

Internal validity To what extent can the Changes due to influences other than intervention, rather than the experimental conditions, such extraneous influences, be as events (history) or processes considered to account for the (maturation) within the results, changes, or group individual, repeated testing, differences? statistical regression, and differential loss of subjects External validity To what extent can the results be Possible limitations on the generality generalized or extended to of the findings because of persons, settings, times, characteristics of the sample; measures, and characteristics therapists; or conditions, context, other than those in this or setting of the study particular experimental arrangement? Construct validity Given that the intervention was Alternative interpretations that responsible for change, what could explain the effects of the specific aspects of the intervention, that is, the intervention or arrangement conceptual basis of the findings, were the causal agents; that is, such as attention and contact with what is the conceptual basis the subject, expectancies of (construct) underlying the subjects or experimenters, cues of effect? the experiment Statistical conclusion validity To what extent is a relation Any factor related to the quantitative shown, demonstrated, or evaluation that could affect evident, and how well can the interpretation of the findings, such investigation detect effects if as low statistical power, variability they exist? in the procedures, unreliability of the measurement, inappropriate statistical tests

Source: Kazdin (1994).

A basic premise of Cook and Campbell's 3.10.2.3 External Validity definitions of IV and EV is that the term ªvalidityº can only be used in an approximate EV refers to the extent to which causal sense. They would say, for example, that conclusions from a study about a relationship judgments of the validity of the conclusions between interventions and outcomes can be that can be drawn from a study must always be assumed to generalize beyond the study's understood as approximate because, from an specific features (e.g., the patient sample, the epistemological perspective, we can never therapists, the measures of outcome, the study definitively know what is true, only that which setting). The full definition of EV is: ª . . . the has not been shown to be false. approximate validity with which we can infer that the presumed causal relationship can be generalized to and across alternative measures of the cause and effect and across different types 3.10.2.2 Internal Validity of persons, settings, and timesº (Cook & Simply stated, IV refers to the extent to which Campbell, 1979, p. 37). causal conclusions can be correctly drawn from a study about the relationship between an 3.10.2.4 Construct Validity independent variable (e.g., a type of therapy) and a dependent variable (e.g., symptom In their 1979 update of Campbell and Stanley change). The full definition reads: ªInternal (1963, 1966), Cook and Campbell highlighted validity refers to the approximate validity with two new concepts that are closely linked to which we infer that a relationship between two IV and EV: CV and SCV. Both have been variables is causal or that the absence of a incorporated into contemporary thinking about relationship implies the absence of causeº experimental design and threats to IV and EV (Cook & Campbell, 1979, p. 37). (e.g., Kazdin, 1994). 204 Internal and External Validity of Intervention Studies

Cook and Campbell (1979) focused their about a relationship (i.e., covariation) between discussion of CV on the ªputative causes and interventions and outcomes. SCV concerns effectsº (i.e., interventions and outcomes) of ªparticular reasons why we can draw false intervention studies. CV reflects the addition of conclusions about covariatesº (Cook & Camp- a concept developed earlier by Cronbach and bell, 1979, p. 37). It ªis concerned . . . with Meehl (1955) to Cook and Campbell's (1979) sources of random error and with the appro- model of experimental validity. Simply stated, priate use of statistics and statistical testsº CV is the goodness of fit between the methods (p. 80). For example, a determinant of SCV is used to operationalize constructs (interventions whether the assumptions of the statistical test and outcome variables) and the referent con- used to analyze a set of outcome data were met structs. In other words, CV is the extent to by the data. which the methods used to measure and operationalize interventions and outcomes are 3.10.3 COMMON THREATS TO IV, EV, likely to reflect the constructs that the investi- CV, AND SCV gators say they studied (e.g., cognitive therapy and depression). A more precise definition of A study's IV, EV, CV, and SCV are CV is: ª . . . the possibility that the operations determined by its design and methods, by the which are meant to represent a particular cause psychometric adequacy of its measures and or effect construct can be construed in terms of operationalizations of central constructs, and more than one construct, each of which is stated by the appropriateness of the statistical tests at the same level of reductionº (p. 59). For used to analyze the data. All must be assessed to example, the more possible it is to construe evaluate a study's IV, EV, CV, and SCV. measures in terms of constructs other than those ªDesignº refers to elements of a study's named by the investigators, the lower the CV. construction (the situation into which subjects CV also can be described as ªwhat experi- are placed) that determine the probability that mental psychologists are concerned with when causal hypotheses about a relationship between they worry about `confounding' º (Cook & the independent and dependent variables can Campbell, 1979, p. 59). An example of a threat validly be tested. For example, a design feature to CV is the possibility that the effects of a is the inclusion of a no-treatment control group medication for depression are due largely to the of some type (e.g., pill placebo) in addition to treating psychiatrist's concern for a patient and the treatment group. A no-therapy group nonjudgmental reactions to his or her symp- provides outcome data to compare with the toms, rather than to neurochemical effects of outcomes of the treated group. The comparison the drug. Such alternative interpretations of the allows examination of the possibility that therapeutic cause of outcomes in medication changes associated with the treatment also intervention studies lead to: (i) the inclusion of occur without it and, thus, cannot be causally pill placebo control conditions, (ii) randomized attributed to it. assignment to either active drug or pill placebo, ªMethodsº refers to a wide variety of and (iii) the use of double-blind procedures, that procedures that are used to implement study is, neither treater nor patient knows if the designs, such as random assignment of subjects patient is receiving active drug or placebo. to each intervention group included in a study. Cook and Campbell (1979) link CV to EV. A study's methods and design together deter- They say that generalizability is the essence of mine the degree to which any relationship found both. However, an integral aspect of CV is the between interventions and outcomes can validly adequacy with which the central variables of an be attributed to the interventions rather than to intervention study (the interventions and out- something else. In other words, study methods comes) are operationalized. Thus, CV is also and design determine whether alternative or necessarily linked to IV: the CV of the methods rival interpretations of findings can be dis- used to measure and operationalize interven- missed as improbable. Designs and methods tions and outcomes affects the validity of any that affect a study's IV, EV, CV, and SCV are causal conclusions drawn about a relationship discussed in Section 3.10.4. between the designated interventions and out- The use of IV, EV, CV, and SCV to guide comes. study design and critical appraisal of study findings is assisted by knowledge of common 3.10.2.5 Statistical Conclusion Validity threats to each type of validity. Cook and Campbell (1979) discussed several threats to SCV is described by Cook and Campbell each type. They cautioned, however, that no list (1979) as a component of IV. SCV refers to the is perfect and that theirs was derived from their effects of the statistical methods used to analyze own research experience and from reading study data on the validity of the conclusions about potential sources of fallacious inferences. Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 205

The threats to each type of validity identified by each new subject has an equal chance to be Cook and Campbell are listed in Tables 2±5. assigned to every intervention condition in a The tables also present examples of designs and study. A simple random assignment procedure methods that can offset the various threats is the coin flip: heads the subject is assigned to (ªantidotesº). The interested reader is encour- one intervention in a two-intervention study; aged to review Cook and Campbell's discussion tails he or she receives the other one. Various of threats to IV, EV, CV, and SCV; design and procedures are used for random assignment methodological antidotes to the threats; and including sophisticated techniques like urn limitations of common antidotes. randomization (Wei, 1978). Urn randomization In the following sections, the four types of simultaneously helps ensure that subjects are threats to validity are discussed in turn. In each randomly assigned to interventions, and that section, examples of designs and methods are the subjects in each one are matched on described that are used in contemporary preidentified characteristics (e.g., co-present psychological and pharmacological mental problems) that might moderate the effects of health treatment research to enhance that type an intervention (Baron & Kenny, 1986). of validity. Only a sampling of designs and Random assignment contributes to IV by methods relevant to each type of validity is helping to ensure that any differences found described due to space limitations. between interventions can be attributed to the interventions rather than to the subjects who received them. The rationale for randomization 3.10.4 EXPERIMENTAL DESIGNS AND is that known and unknown characteristics of METHODS THAT ENHANCE IV, subjects that can affect outcomes will be equally EV, CV, AND SCV distributed across all intervention conditions. Each design and method discussed in the Hence, subject features will not systematically sections that follow is limited in the extent to affect (bias) the outcomes of any intervention. which it can ensure that a particular type of Random assignment has limitations as a way experimental validity is achieved. Some limita- to enhance IV. For example, it is not a tions result mainly from the fact that patient- completely reliable method to ensure that all subjects are always free to deviate from research outcome-relevant, preintervention features of treatment protocols they enter, by dropping out subjects are equally distributed across interven- of treatment, for example. Other limitations tion conditions. Even when randomization is arise because research methods that can used, by chance some potentially outcome- enhance IV can simultaneously reduce EV. relevant subject characteristics can be more For example, using psychotherapists who are prevalent in one intervention than another experts in conducting a form of therapy can (Collins & Elkin, 1985). This fact can be both increase IV and reduce EV. discovered post hoc when subjects in each One point merits emphasis. The experimental intervention are found to differ on a character- validity potential of alternative designs and istic (e.g., marital status) that is also found to methods is always contingent on the match relate to outcome. This happened, for example, between study hypotheses and a study's design in the US National Institute of Mental Health's and methods. For example, the IV potential of a Treatment of Depression Collaborative Re- particular design differs if the main aim of a search Program study (Elkin et al., 1989). study is to obtain data that can be interpreted as Attrition (e.g., of subjects, of subjects' out- evidence for a therapy's hypothesized effects vs. come data) also limits the IV protection to obtain data that can be interpreted as provided by random assignment (Flick, 1988; evidence for the comparative efficacy of alter- Howard, Cox, & Saunders, 1990; Howard, native therapies for the same problem. Krause, & Orlinsky, 1986). For example, subjects can always drop out of treatment or fail to provide data at all required assessment 3.10.4.1 IV Methods and Designs points. Both are examples of postinclusion Table 2 lists and defines common IV threats. attrition in Howard, Krause et al. (1986) The text that follows elaborates on some of the terminology. The problems associated with, information in Table 2. and types of, attrition have been carefully explicated (Flick, 1988; Howard, Krause et al. 1986). For example, the core IV threat asso- 3.10.4.1.1 Random assignment ciated with dropout is that the subjects who drop Random assignment of subjects to all inter- out of each study intervention might differ vention groups in a study design is a sine qua non somehow (e.g., severity of symptoms). This of IV. Broadly defined, random assignment would create differences in the subjects who means that a procedure is used to ensure that complete each intervention and simultaneously 206 Internal and External Validity of Intervention Studies

Table 2 Internal validity.

Threat Description Example antidote

Historical factors Events that occur between pre- and If historical the event does not overlap postintervention measurement completely with the study period, the could be responsible for the outcome data can be analyzed intervention effects found (e.g., a separately for subject cohorts that public information campaign to received the intervention before the reduce drug use in teenagers event vs. during it, and results compared coincides with an adolescent drug to estimate the intervention's effects treatment intervention study) independent of the event Maturation Changes in subjects that naturally Include a no or minimal treatment control occur with the passage of time condition in the study, and randomize could be responsible for pre- to subjects to treatment conditions postintervention changes in outcomes Testing Pre- to postintervention changes on Evaluate the impact of repeated testing by outcome measures could be due to administering the outcome measures to repeated administration of the a randomly selected subset of subjects measures, rather than to change in only at postintervention. Compare the the outcome variables assessed final scores with those of subjects to whom the measures were repeatedly administered. Use the Solomon Four- Group design (Campbell & Stanley, 1963; Rosenthal & Rosnow, 1991) to examine both the effects of repeated testing and any interaction of testing and interventions Instrumentation Pre- to postintervention differences Keep instruments and administration in the way an outcome measure is procedures constant throughout a study administered could be responsible (e.g., recalibrate laboratory instruments for intervention effects (e.g., and human assessors during study) different diagnosticians administer a structured diagnostic interview at pre- and at post-test to the same subject) Statistical regression Change in pre- to postintervention Use the statistical technique of covarying scores include ªnormalº drift preintervention scores from termination toward more moderate scores scores in outcome analyses (e.g., whenever an initial score is either analysis of covariance, multiple higher or lower than the average regression) (mean) tendency of the group Selection Differences between subjects Randomly assign subjects to intervention assigned to a study's intervention conditions conditions could account for outcome differences in the interventions Attrition or mortality Differences between subjects who Collect outcome data at every planned complete each study intervention assessment point from all subjects (e.g., due to dropout from the (including dropouts and those interventions) could be withdrawn due to deterioration) and responsible for differences in conduct outcome analyses on the intervention outcomes `intention-to-treat' sample (see text) Ambiguity about the The study design does not make it Do not use cross-sectional designs and direction of causal possible to determine if A causes B standard correlational statistics if influence or B causes A, although the causal conclusions are desired. When hypotheses assume undirectional theory and empirical findings exist to causality (e.g., in a correlational support causal hypotheses, one study it cannot be determined alternative is to develop and test causal whether improvement in the models on data that lack experimental therapeutic alliance enhanced controls needed to infer causality (e.g., outcome or if a subject's Kenny, 1979) improvement enhanced the therapeutic alliance) Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 207

Table 2 (continued)

Threat Description Example antidote

Diffusion or imitation The putative active elements of a When informed consent is administered, of treatments study intervention are somehow emphasize the requirement to refrain disseminated to subjects in other from discussing one's treatment with intervention and/or control other subjects. Guarantee all subjects conditions (e.g., subjects who the opportunity to receive a study receive a treatment for drug treatment (e.g., at the end of the study) addiction give copies of their treatment workbooks to subjects in the control condition) Compensatory Personnel who have control over Ensure that subjects who do not get the equalization of desirable interventions desirable intervention during the study treatments compromise the study design by will have access to them after the study. somehow making them available Let subjects know at the beginning of to all subjects. For example, in a the study that this will happen. study in which therapists serve as Antidotes for the therapist example are their own controls (the same to audiotape all therapy sessions and therapists administer the monitor the tapes for therapists' interventions and control adherence to the intervention treatment conditions), therapists conditions. Immediately give therapists have difficulty refraining from feedback when they deviate from using elements of the study's adherence requirements. Also, offer all ªactiveº interventions in the subjects who are randomized to control therapy sessions of subjects who conditions the opportunity to receive are in the control condition the (putative) active treatments at the end of the study; inform therapists at the study outset that control subjects will have the opportunity to receive an active treatment Compensatory rivalry Subjects are aware of all intervention This threat applies only to some types of by respondents conditions in the study, and those intervention studies. For example, it can receiving less in a ªstandardº condition are occur when a study setting is an intact desirable motivated to outperform those in unit (such as a department) and subjects treatments ªspecialº intervention conditions. in the standard condition expect some Thus, any differences or lack of negative personal effect (e.g., lose their differences found between jobs) of study findings. Investigators intervention and/or control need to be aware of the potential conditions cannot be attributed to relevance of this threat to a study and the interventions either redesign the project or take actions to reduce its potential influence (e.g., see above antidote for compensatory equalization) Resentful This threat is the opposite of the Investigators need to be aware of the demoralization of threat of compensatory rivalry. relevance of this threat to a particular respondents Subjects in a control or other study and either redesign the project or receiving less intervention condition take actions to reduce its potential desirable underperform relative to their influence (promise subjects in the less treatments capacity and thereby contribute to desirable intervention condition the artificial differences between the opportunity to receive it at the end of intervention outcomes the study)

Statements in the first two columns and some in the third are abstracted from Cook and Campbell (1979). in those who provide outcome data. This type of One methodological antidote to the IV threat of attrition to IV has been called the problems posed by attrition due to dropout is ªdifferential sieveº (Hollon, Shelton, & Loosen, to conduct outcome analyses on the ªintention- 1991), an apt metaphor. If differential attrition to-treatº sample (Gillings & Koch, 1991). The by intervention occurs, any outcome differences intention-to-treat sample can be defined in could be due to the subjects from whom different ways. For example, one definition is all assessments were obtained rather than to the subjects who were randomized to study inter- interventions examined. ventions. Another definition is all subjects who 208 Internal and External Validity of Intervention Studies attended at least one intervention session. The Pill placebo conditions used in psychophar- latter type of definition is often used when macology research are often thought to be free investigators want to generalize outcome find- of the IV limitations associated with psy- ings only to patients who at least are willing to chotherapy placebos. However, pill placebo try an intervention. conditions are also limited in terms of IV. For Intention-to-treat analyses ideally require example, active medications almost always have outcome assessments of all dropouts (and of side effects. Thus, the placebo pill used should all others who discontinued treatment prema- mimic the side effects of the active drug. It also is turely, such as by being withdrawn by the important that both the subject and the investigators due to treatment noncompliance prescribing therapist be blind (uninformed) or deterioration) at the time(s) when they would about who is getting placebo and who is getting have been assessed if they had remained in active medication. Experienced psychopharma- treatment. The intent-to-treat analytic strategy cologists often, probably typically, can tell also contributes to a study's EV by maximizing which subjects are receiving active medication the probability that the findings can be assumed (Greenberg, Bornstein, Greenberg, & Fisher, to generalize to all individuals who would meet 1992; Margraf et al., 1991; Ney, Collins, & the study entry criteria. Spensor, 1986; Rabkin et al., 1986). The failure Unfortunately, intention-to-treat analyses of the blinding procedure is problematic have limitations with respect to protecting a because therapists' impressions could be asso- study's IV. The reader is referred to Flick (1988) ciated with different behavior toward placebo and to Howard, Krause et al. (1986) for and active medication subjects. For example, a comprehensive discussions of other methods therapist might be less encouraging about the to evaluate and reduce the effects of attrition on possibility of improvement if he or she suspects a study's IV and EV. that a subject is getting placebo rather than medication. Differential therapist behavior toward placebo and active medication subjects could affect outcomes and thereby reduce the IV 3.10.4.2 Untreated or Placebo-treated Control of the study design. Group Another common threat to the IV value of no- treatment and placebo treatment conditions is The inclusion of a no-treatment group (e.g., contamination or diffusion. Putative therapeu- wait list) or putative sham treatment (e.g., pill tic elements of an intervention can somehow placebo) in a study design is required to validly infiltrate the control condition. A well-known answer causal questions like ªDoes an inter- example of this is the ªMRFITº study con- vention have certain effects?º A no-treatment or ducted in the USA. It was designed to evaluate placebo condition to which subjects are ran- an educational intervention to reduce the domly assigned and which continues for the incidence of heart disease in men. The interven- same duration as a putative active treatment tion provided information about diet, smoking, allows several rival explanations for a relation- and exercise as a way to lower cholesterol and ship between a treatment and outcomes to other risks of heart disease (Gotto, 1997; be dismissed. For example, if the outcomes Multiple Risk Factor Trial Interventions associated with an active treatment are statis- Group, 1977). The study design was a rando- tically better than the outcomes associated with mized, controlled comparison of men who a control treatment, then the active treatment received the intervention with men who received outcomes cannot be attributed solely to changes ªusual care.º During the years that the study was that occur naturally with the passage of time, done, information used in the intervention was maturation, or ªspontaneous remission.º widely disseminated in the US media, thus The inclusion of control treatment conditions threatening the IV value of the usual care control of some type is logically very compelling as a sample. way to increase an intervention study's IV. Another way contamination of a control Unfortunately, the compelling logic is not treatment can occur is via contact between the mirrored in practice. Reams have been written control subjects and those receiving the active on problems associated with designing placebo treatments. For example, subjects in active therapies that meet IV goals (e.g., O'Leary & treatments might give control subjects inter- Borkovec, 1978). For example, credibility is one vention materials (e.g., copies of intervention issue (Parloff, 1986). How can a placebo self-help workbooks). To help prevent this psychotherapy be developed that has no threat to IV, subjects can be required, as part theoretically active therapeutic ingredients but of the study informed consent, to refrain from that seems equally credible to subjects as a sharing information about their respective putative active treatment? treatments while a study is in progress. Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 209

3.10.4.3 Specification of Interventions tapes of sessions). Judges are selected and trained to apply the measure. Psychometric A relatively new set of IV methods focuses on studies are done to evaluate and ensure that detailed specification of the interventions judges' ratings on the measure meet standards examined in a study. These methods have been for inter-rater reliability. A plan for sampling adopted widely in psychotherapy research since material from therapy sessions for judges to rate the 1980s. Their primary aims are to facilitate is developed. The sampling plan is designed to replication of studies and to ensure that provide a valid idea of the construct. Judges interventions are properly implemented. The then rate the selected material. In addition to the methods include the preparation of treatment foregoing basic steps, it is also important to manuals (e.g., Beck, 1995; Klerman, Weissman, conduct psychometric evaluations of measure Rounsaville, & Chevron, 1984), for example. (e.g., the internal consistency reliability of Such manuals describe the treatment rationale, ratings). the therapeutic techniques to be used, and the Process methods are used to examine a host of circumstances that should prompt therapists to research questions (for comprehensive reviews choose between alternative techniques (Lam- see Orlinsky & Howard, 1986; Orlinsky, Grawe, bert & Ogles, 1988). The methods also include & Parks, 1994). One of the most valuable uses of the development of therapist adherence mea- process research methods is to test the theories sures to be used by observer-judges of therapy of therapeutic change that are associated with session material (e.g., audiotapes) to assess the different types of psychotherapy. A standard extent to which treatments were delivered by treatment outcome study is designed to ask study therapists as specified in the manual questions like ªDoes a treatment work?º and (Waltz, Addis, Koerner, & Jacobson, 1993). ªDoes a therapy work better than an alternative Treatment specification methods are de- one for the same problem?º Process methods scribed more fully in Section 3.10.4.7 on CV. allow the next logical questions to be asked, for They are cross-referenced here to emphasize the example, ªDoes a therapy work for the reasons fundamental dependence of a study's IV on the it is theoretically posited to?º and ªWhat are the CV of its intervention variables. The depen- therapeutically active elements of a therapy?º dence is essentially the same as that between the Viewed in terms of Cook and Campbell's (1979) psychometric adequacy of the measures and types of experimental validity, process research methods used to operationalize independent methods contribute mainly to IV and CV. and dependent variables, and the possibility of Process methods can contribute to a study's conducting a valid test of hypotheses about the IV when they are used to test hypothesized relationship between independent and depen- relationships about theoretically causal ele- dent variables (Kraemer & Telch, 1992). The ments of a therapy (such as specific therapeutic validity of conclusions about hypotheses is techniques) and targeted outcomes. The preci- necessarily contingent on the adequacy of the sion of the causal conclusions that can be drawn operationalization of central study variables or from intervention studies is limited when constructs. The preceding point, albeit basic, patients are randomly assigned to treatments, merits emphasis because it is often ignored or receive treatment, and then outcomes are not understood well in mental health interven- measured. Even when intervention outcome tion research (Kraemer & Telch, 1992). studies yield positive findings, many funda- mental and important questions remain unan- 3.10.4.4 Process Research Methods swered. Are a therapy's effects mediated by the theoretically hypothesized activities of thera- ªProcessº research refers to a highly devel- pists? Alternatively, are the effects primarily due oped methodology that has been used in adult to nonspecific features of the therapy (e.g., the psychotherapy research since at the 1950s therapist's empathy and understanding) rather (Greenberg & Pinsof, 1986; Kiesler, 1973; Rice than to its specific techniques (e.g., cognitive & Greenberg, 1984; Russell, 1987). Described restructuring)? (Ilardi & Craighead, 1994; generally, process research is the application of Kazdin, 1979; Strupp & Hadley, 1979). Are measurement methods to therapy session ma- all of the recommended therapeutic techniques terial, such as videotaped recordings or tran- necessary; are any necessary and sufficient? scripts of therapy sessions. A typical strategy for Process methods can be used to examine process research consists of four basic steps. questions of the foregoing type. When used in First, a measure is developed of a theoretically such ways, they contribute to a study's IV important therapy construct (e.g., therapeutic because they help elucidate causality, that is, the alliance). The measure is designed to be rated by extent to which specific elements of a ther- trained judges, based on observation of some apeutic intervention can validly be assumed to type of therapy session material (e.g., video- play a causal role in the outcomes obtained 210 Internal and External Validity of Intervention Studies

(Gomes-Schwartz, 1978). Furthermore, when 3.10.4.5.1 Naturalistic treatment settings process research methods yield answers to mechanisms of action questions like those The ultimate aim of psychotherapy interven- stated above, the findings can be used to refine tion studies is to identify therapies that can be and improve existing treatments, for example, provided effectively in settings where indivi- to increase their efficiency and efficacy, and to duals who have the problems that are targeted modify the theories of therapeutic change on by the therapy are treated. However, studies which treatments are founded. with the highest IV, for example, those that Process methods can also contribute to CV. include random assignment and a placebo For example, process methods are used both to treatment, are often done in university-affiliated develop instruments to assess therapists' ad- clinic settings. Such settings are where research- herence to manualized treatments (e.g., Evans, ers tend to be located. The settings also tend to Piasecki, Kriss, & Hollon, 1984), and to develop be most amenable to the implementation of a rating strategy (e.g., plan for sampling from research methods that are needed for high IV the therapy session material to be rated by studies. judges) to assess therapist adherence for inter- University-affiliated clinical settings differ in vention studies (e.g., Hill, O'Grady, & Elkin, many ways from other settings in which mental 1992). More detail is provided on process health interventions are provided, such as urban research methods that contribute to CV in and rural community mental health clinics and Section 3.10.4.3. private practice offices. The differences between a study's setting(s) and the settings to which it is important to generalize its findings reduce EV 3.10.4.5 EV Methods only if the differences affect the outcomes of the interventions being tested. Thus, to accurately Table 3 presents three types of threats to EV evaluate a study's EV with respect to nonstudy that were highlighted by Cook and Campbell settings, one must know which setting features (1979). Each threat is described as an interaction affect the outcomes of the interventions. between a study feature and the study treat- Unfortunately, research generally is not done ment(s). This is because Cook and Campell on the foregoing topic. In the absence of emphasized the difference between generalizing knowledge, the prevailing convention seems across subject populations, settings, etc., vs. to be to adopt a conservative position: namely, generalizing to target subject populations, the more similar to ªtypicalº mental health settings. Cook and Campbell chose to focus treatment settings that an intervention study on the former because the latter requires setting is, the greater the study's probable EV. sophisticated random sampling methods which The foregoing convention results in recom- are rarely used in intervention research. mendations like: a study protocol that has high As stated earlier, EV concerns the extent to IV should be implemented in a variety of public which outcome findings validly can be assumed and private mental health treatment settings, to generalize beyond a study's specific therapists, both urban and rural, and in different sections subjects, setting(s), measures, and point in time of a country (i.e., in large countries where areas when a study was conducted. For about the past exist that have different subcultures). Unfortu- eight years in the USA, criticisms have been nately, the preceding strategy to enhance the EV raised increasingly often about intervention of settings is rife with limitations (cf. Clarke, studies with apparent low EV (e.g., Lebowitz 1995). Cook and Campbell (1979) extensively & Rudorfer, 1998). The criticisms are part of the discussed the problems, which they referred to efficacy vs. effectiveness debate. For example, it as ªmajor obstacles to conducting randomized is said that the outcomes of therapies done by experiments in field settings.º They also study therapists who are trained in a therapy and discussed social forces and field settings that then receive ongoing supervision during a study are conducive to randomized intervention (to ensure that they continue to adhere to the studies. therapy manual) cannot be assumed to general- A common obstacle is that all community ize to community therapists who might provide clinics have standard operating procedures. The the treatment. The generalizability point is a need to introduce the key IV-enhancing experi- valid one. At minimum, it underscores the need mental method, random assignment, typically is for therapist training methods and materials to an enormous challenge in such settings. At be made widely available for interventions found minimum, it requires that all regular clinic staff, to be efficacious in high IV studies. or at least a representative sample of them, be Contemporary controversies aside, methods willing to learn and to correctly implement a that can contribute to EV are described next. new intervention. Assuming that the staff are Some of their limitations are also noted. interested in being trained in a new intervention, Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 211

Table 3 External validity (generalizability).

Threat Description Example antidote

Interaction of selection and The study's subject inclusion and Make entry into the study as treatment exclusion criteria and/or other convenient as possible to requirements of study maximize the representativeness of participation could result in a the sample of the desired sample that responds population. Attempt to replicate differently to the the study using modified inclusion intervention(s) than would criteria (e.g., drop the other individuals of the type for prerandomization study induction whom the intervention(s) was procedure that was used in the designed. original study) Interaction of setting and Features unique to the study Attempt to replicate the study by treatment setting(s) could contribute to simultaneously or consecutively the intervention effects (e.g., the carrying out the same design and setting was a well-known procedure in two or more settings treatment and research clinic, famous for one of the interventions) Interaction of history and Features unique to the time when Attempt to replicate the study treatment the study was done could contribute to the intervention effects (e.g., a study of therapies for depression was conducted during the same year that the US National Institute of Mental Health ran a major public information campaign on the efficacy of psychological therapies for depression)

Statements in the first two columns and some in the third are abstracted from Cook and Campbell (1979). The threats are listed as interactions between variables and study treatments because the threats apply when the intention is to generalize across populations, settings, etc., rather than to target populations, settings, etc. Cook and Campbell provide an extended discussion of the difference. training time must be made available to them. because the study sample will not be representa- The need for training can pose a major hurdle tive of all individuals for whom the study because staff work days typically are fully intervention is believed to be appropriate. Also, booked with existing job responsibilities. studies with high IV typically require subjects to An alternative is to hire new clinic staff participate in research assessments, often ex- specifically to conduct treatments for a study. tensive ones. Some require subjects to agree to This strategy typically is associated with prag- audio- or videorecording of therapy sessions. A matic limitations such as office space. Also, portion of potential subjects are likely to refuse hiring new staff can compromise the setting EV participation due to assessment requirements, goal, the main reason for wanting to conduct an thereby reducing the subject EV of study intervention study in a field setting. findings. All of the preceding examples are types of preinclusion attrition. 3.10.4.6 Inclusive Subject Selection The fundamental threat to IV posed by subject dropout from treatment studies (post- Mental health interventions are developed to inclusion attrition) was described previously. be helpful to individuals who have particular Dropout is also a serious, predictable, and often types of problems. Hence, subject EV is a critical intractable EV threat. Researchers' careful type of EV for an intervention study. Several selection, implementation, and description of threats to subject EV that are associated with subject inclusion and exclusion criteria all procedures which enhance IV have been de- potentially enhance a study's EV by providing scribed (e.g., Howard, Cox, & Saunders, 1990; specific information on the types of individuals Howard, Krause et al., 1986). For example, IV to whom the findings are most likely to requires that subjects accept random assignment generalize. The loss of outcome data that to any intervention condition in a study. Some typically is associated with subject dropout study applicants will refuse to participate due to compromises the confidence with which it can this requirement. Such refusal is a threat to EV be concluded that a study's findings generalize 212 Internal and External Validity of Intervention Studies to individuals like those selected for the study. example, sometimes investigators limit study The threat to EV of subject dropout is another therapists to a subset who meet a criterion of reason why it is critically important to try to skillfulness. The rationale is that delivery of obtain data at all required assessment points study interventions by therapists who are from all subjects who enter a study, dropouts optimally skilled will provide the most accurate and remainers alike. test of an intervention's efficacy. However, the One of the most frequent subject EV criticisms strategy also poses a threat to EV. Can the of contemporary US studies with high IV is their outcomes of a study in which expert therapists psychodiagnostic inclusion and exclusion criter- delivered the interventions be assumed to ia (e.g., American Psychiatric Association, generalize to less skilled therapists? The most 1994). The criticism is that the use of specific conservative answer to this EV question is diagnostic critieria for sample selection yields probably not. However, many contemporary subjects who are unrepresentative of individuals intervention studies with high IV also specify (i) who would receive the interventions in commu- methods to identify therapists like those selected nity settings. Study samples are said to be highly for a study, and (ii) therapist training programs selected, consisting of ªpureº types that are that include ways to measure skill attained unrepresentatively homogeneous compared to before therapists start to treat study patients. treatment-seeking individuals who contact field Thus, high IV studies that meet current clinical settings (e.g., Lebowitz & Rudorfer, methodological standards describe ways to 1998; Seligman, 1995). identify and train nonstudy therapists who The preceding subject of EV concern is a are like study therapists and, thus, who are relatively new one. It relates directly to a mental likely to provide treatments that have outcomes health intervention research strategy that similar to those obtained (cf. Kazdin, Kratoch- developed in the USA after 1980, when the will, & VandenBos, 1986). third edition of the Diagnostic and statistical manual of mental disorders (DSM; American Psychiatric Association, 1980) was published. 3.10.4.6.2 CV Methods Shortly thereafter, sociopolitical forces con- Table 4 summarizes threats to CV and lists verged to contribute to the requirement that some antidotes. The text that follows elaborates applications for federal grant funding for on some of the information in Table 4. mental health intervention efficacy studies be In Cook and Campbell's 1979 discussion of focused on specific diagnoses (e.g., major CV, they opined that ªmost applied experi- depressive disorder; social phobia) as defined mental research is much more oriented toward in the DSM. Previously, the sample selection high construct validity of effects than of causesº criteria for such studies were very broad, for (p. 63). Their observation was accurate for example, treatment-seeking adults who com- psychotherapy intervention research at the time plained of a variety of ªproblems in livingº but but, notably, does not hold for therapy research who had no history of a psychotic disorder or published since about 1990 (e.g., Elkin et al., current drug addictions. 1989). A methodological revolution in psy- The validity of the foregoing subject of EV chotherapy research enabled the CV of putative criticism merits examination. It is true that the causes, that is, therapeutic interventions, to be trend in intervention research in the USA since enhanced. The relevant new methods are 1980 has been to develop and test interventions described in Sections 3.10.4.7.1±3.10.4.7.3. for relatively specific symptoms and syndromes (e.g., depression). However, study reports 3.10.4.6.3 Specification of interventions: generally do not include comprehensive de- treatment manuals scriptions of co-present diagnoses (i.e., comor- bidity patterns) of subjects. Thus, the extent to Cook and Campbell (1979) highlighted the which most study samples are indeed more dependence of CV on careful pre-operationali- homogeneous than community setting patients zation explication of the essential features of for whom an intervention would be recom- study constructs. Prestudy preparation of treat- mended typically cannot be determined. Also, ment manuals, discussed in Section 3.10.4.1.3, is some evidence from research clinics is beginning a research method that achieves pre-operatio- to appear that does not support the homo- nalization explication of the intervention con- geneity criticism (e.g., Kendall & Brady, 1995). struct, if done well. Criticisms of the treatment manual revolution in therapy research have been frequent and heated, despite the critical 3.10.4.6.1 Staff therapists contribution of manuals to both the IV and Another commonly mentioned threat to EV potential EV of psychotherapy intervention is the use of specially selected therapists. For studies Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 213

Table 4 Construct validity of putative causes and effects.

Threat Description Example antidote

Inadequate The researchers' elucidation of one or Devote necessary resources to preoperational more of the main study constructs identifying or developing valid explication of (i.e., the interventions and operationalizations of interventions construct outcome) was not adequate to and outcomes because high guide valid operationalization construct validity of intervention (measurement) of them and outcome variables is essential to conducting interpretable intervention research Mono-operation bias Reasonable variation in the Vary putatively irrelevant features of implementation of an intervention study interventions (e.g., use male does not occur (e.g., only male and female therapists who represent therapists are used) which reduces a wide age range to deliver the validity of the operationaliza- interventions) tion of the intervention Monomethod bias Only one measurement method is Include measures of central variables used to assess central variables (e.g., that are based on more than one all outcome measures use the paper assessment method (e.g., self-report and pencil self-report method) and behavioral observation of which confounds variance in scores symptoms) due to method with variance that reflects the variable of interest Hypothesis-guessing Subjects try to intuit the study Try to make hypotheses hard to guess within experimental hypotheses and skew their or intentionally tell subjects false conditions responses based on what they think hypotheses (and debrief them after the hypotheses are the study is over). Tell subjects who are in a treatment outcome study that honest, candid responses are more useful to the research than trying to ªhelpº the researcher by showing improvement on measures. Evaluation Subjects might present themselves Use paper and pencil self-report apprehension more positively (or more measures of the same constructs negatively) than their honest self- that ªexpertº interviewers assess; evaluation when they are being include measures of social assessed by experts in personality, desirability response style, and use psychopathology, or basic human them as covariates if they are skills correlated with scores on outcome measures. Administer measures of socially undesirable variables at a prestudy baseline point and again immediately before study begins. Assume that the second administration is likely to be the more valid (candid) index Experimenter Investigators' biases influence the Investigators should not administer expectancies study findings study interventions. Conduct the same study protocol in two or more settings, each of which is overseen by an investigator with a different preference (bias) for the interventions examined Confounding constructs Some interventions might be Conduct a study to examine whether and levels of differentially efficacious when an intervention's effects differ by constructs provided at different levels or the strength of the ªdoseº strengths (e.g., doses of administered psychotropic medications; a psychotherapy conducted with two sessions per week for the first 12 weeks vs. once weekly for the first 12 weeks) 214 Internal and External Validity of Intervention Studies

Table 4 (continued)

Threat Description Example antidote

Interaction of different Subjects concurrently receive Use procedures to reduce the treatments nonstudy interventions and the probability that subjects will receive study intervention (e.g., some treatments other than the study subjects in a psychotherapy study treatments. For example, as a start taking psychotropic criterion for study participation, medications; in a study of require subjects to refrain from medication treatment for drug taking psychotropic medications or addiction, some subjects also having any other psychotherapeutic receive medical and housing help at treatment while in the study. Check the study clinic) throughout a study for use of proscribed treatments and either withdraw subjects from study or exclude the data of subjects who use proscribed treatments from outcome analyses Interaction of testing The study assessment procedures Examine the presence of this threat and treatment affected the response of subjects to experimentally by administering the study interventions (e.g., a measures to two subject samples. preintervention measure sensitized Administer tests to one sample at subjects in a way that enhanced pre- and postintervention; their response to the study administer to other only at intervention) postintervention. A complete Solomon four-group design (Campbell & Stanley, 1963) can also be used Restricted generalizabil- Interventions can affect many Include measures that are closely ity across constructs important outcome variables. related to the main outcomes However, effects on outcomes sought with an intervention, in (constructs) of interest often are not addition to the key outcomes (e.g., highly correlated (generalizable) in a study of treatments for depression, include measures of severity of depression symptoms, work functioning [number of sick days], and marital conflict

Statements in the first two columns and some in the third are abstracted from Cook and Campbell (1979).

(Wilson, 1996). However, the CV of interven- supervisors' judgments of trainees' skill in tion constructs requires more than manuals. It conducting a therapy. Likewise, supervisors' also requires prestudy therapist training proce- impressions of trainees' competence with a dures to teach therapists how to conduct the therapy, based on trainees' descriptions of interventions as specified in the manual. A final therapy sessions, were not highly correlated step is to evaluate and ensure the effectiveness of with supervisors' ratings of trainees' compe- the training ªmanipulationº before therapists tence based on videotapes of therapy sessions are permitted to conduct therapies with study (Chevron & Rounsaville, 1983). subjects. Methods for this are discussed next. The use of process methods to assess the extent to which therapists conduct treatments as specified requires the development of adherence 3.10.4.6.4 Specification of interventions: measures (e.g., Evans et al., 1984). Adherence therapist adherence measures measures typically consist of items that oper- The previously mentioned process research ationally define what the developers of an methodology, observer-judge ratings of therapy intervention believe to be its essential thera- session material, provides the most valid index peutic elements. Judges review therapy session available to date of the extent to which a material and rate the extent to which a therapist therapist has acquired a requisite skill level with enacts (verbally and/or behaviorally) each item a study intervention. Some research findings as specified by the adherence measure. The support the preceding conclusion. For example, development of a psychometrically sound trainee therapists' scores on written tests about therapist adherence measure for an interven- a therapy were not highly correlated with tion, including a system for sampling the Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 215 therapy session material to which the measure psychotherapy for depression is based on the should be applied, is a time-consuming and assumption that the therapies are substantively sophisticated research methodology. different somehow. The assumed distinction is The use of self-report therapist adherence what makes it worthwhile to compare therapies measures recently has been proposed as a in a single study. In CV terms, a comparative substitute for judge-rated measures (Carroll, intervention study design reflects the assump- Nich, & Rounsaville, in press). Self-report tion that the intervention constructs are adherence measures have interesting uses in distinguishable. While the issue might seem therapy research, perhaps particularly in ther- moot, it is not. Therapies that sound distinct apy development research. However, the only based on their theories of therapeutic change psychometric examination to date of the and prescribed techniques might not be distin- validity of such methods as indices of therapists' guishable in practice, that is, when they are adherence to a treatment as specified in a implemented (operationalized) in words and manual yielded negative results (Carroll et al., in behaviors by therapists. press). Therapist adherence measures for different Therapist adherence measures are a way to interventions provide a systematic way to assess the CV of study interventions. Adherence evaluate the distinctiveness of putatively differ- measures can be applied to systematically ent forms of psychotherapy and, hence, a way to sampled therapy session material to yield an evaluate the CV of comparative intervention index of the extent to which study therapies were studies. The general methodology is to apply the conducted as specified. The contribution of therapist adherence measure for each interven- psychometrically sound therapist adherence tion to the intervention for which it was measures to both CV and IV cannot be developed and to the other interventions in a overemphasized: they allow us to evaluate study (e.g., cognitive therapy, psychodynamic whether or not the designated interventions therapy, and placebo therapy sessions each are (the independent variables) were delivered as rated on a cognitive therapy adherence scale and intended in intervention studies. a psychodynamic therapy adherence scale). The The preceding discussion is a simplified scores obtained on the adherence measures for explication of the issues involved. The advent each intervention can then be compared to of therapist adherence methodology was asso- a priori criterion scores that are established to ciated with recognition of the difference indicate the distinctiveness of the interventions. between two constructs, adherence and compe- Hill et al. (1992) exemplifies the methodology. tence (e.g., Waltz et al., 1993). Basically, adherence denotes a therapist's fidelity to the 3.10.4.6.6 Adjunctive therapies therapeutic techniques that are described in a treatment manual; competence denotes skill in When an intervention study is described as a implementing the techniques (e.g., tailoring a test of a specific intervention, it is assumed that technique to increase its acceptability to a the intervention is the only one that the subjects particular client). Reseach to date is consistent received. However, sometimes study subjects with the conclusion that inter-rater reliability is are allowed to participate in adjunctive inter- more achievable with adherence measures than ventions (e.g., Alcoholics Anonymous) or are with competence measures. Recruiting experts provided with additional interventions on an as- in a particular form of therapy to judge needed basis (e.g., aid from a case worker for competence does not solve the reliability housing and other needs; psychotropic medica- problem (a ªproblemº which signals needed tions). Participation in adjunctive interventions directions for psychotherapy outcome research constitutes a threat to the CV of the study at this time). intervention, as well as to the study's IV. Outcome findings cannot be validly interpreted as due to the designated intervention(s) when 3.10.4.6.5 Distinctiveness of interventions adjunctive interventions are allowed but not The main aim of some intervention studies is controlled in some way. to compare the outcomes of alternate treat- ments for the same problem. A fundamental 3.10.4.6.7 Assessment procedures assumption of comparative treatment studies, albeit sometimes implicit, is that the interven- Repeated testing on the same outcome tions being compared are distinctive in terms of measures is a CV threat that often is ignored. their putative therapeutically active elements. Cook and Campbell (1979) frame the issue as For example, a study that is designed to follows: can an intervention±outcome relation- compare the outcomes of a form of cognitive ship be generalized to testing conditions other therapy for depression with another type of than those used in the study to assess outcomes? 216 Internal and External Validity of Intervention Studies

For example, if an outcome measure was 1991). A power analysis helps an investigator administered both pre- and postintervention, determine the minimum sample size needed to would the same outcomes have been obtained if detect a statistically significant outcome differ- the measure was administered only at post- ence of a magnitude that he or she regards as intervention? One way to examine the threat of clinically meaningful (e.g., Jacobson & Truax, repeated testing to the CV of outcome scores is 1991), given his or her willingness to risk to compare findings of two randomly assigned making a type II error (i.e., accept the null intervention groups: one that was both pre- and hypothesis of no difference when it should be post-tested, and another that only was post- rejected). The smaller the sample size, the tested. greater the risk of type II error when standard A related CV threat is the possibility that statistical procedures like analysis of variance study outcome measures interact with inter- are used in intervention research. However, the ventions. For example, studies of cognitive- larger the sample size, the more expensive, time- behavioral (CB) therapy for panic disorder have consuming, and less feasible a study typically been criticized for the possibility that CB affects becomes. a primary outcome measure used in such studies, A prestudy power analysis is a computation self-reported frequency of panic attacks, mainly that allows investigators to enhance a study's by teaching subjects to redefine panic attack SCV by determining the minimum sample size rather than by reducing the actual frequency of needed to statistically detect an outcome attacks. To the extent that the criticism is valid, difference of a specified magnitude, given differences in panic attack frequency found expected variances in the outcome measures. between a control intervention and CB have low Ideally, an investigator can draw on criteria of IV because the outcome measure in the CB some type to determine the size of a difference condition has low CV. The measure does not (effect size) that will have applied significance, reflect change in frequency of panic attacks, it that is, be clinically meaningful (Jacobson & reflects change in subjects' definitions of panic Truax, 1991). For example, a clinically mean- attack. The Solomon Four-Group Design ingful difference is one that experienced thera- (Campbell & Stanley, 1963; Rosenthal & pists would regard as a compelling reason to Rosnow, 1991) can be used to simultaneously recommend one treatment over another for a evaluate main effects of repeated testing on particular problem. outcome measures and interactions of interven- The rationale for a power analysis is as tions and measures. follows. The magnitude of the variances of A common way to reduce CV threats to an outcome measures and the study sample size are outcome measure like the one just described for major determinants of whether or not an panic is to include other measures of closely outcome difference of a certain magnitude related constructs in a study. The other between interventions will be statistically sig- measures should not be vulnerable to the same nificant (i.e., unlikely to be due to chance). The CV threat. For example, a spouse could be same absolute difference between mean out- asked to provide ratings of the frequency and come scores of two interventions could be intensity of a subject's panic attacks. (This statistically significant if the sample size in each solution is, of course, associated with other group was 30, but not if it was 15. Another types of CV concerns.) perspective on the situation is that the smaller a study's sample size, the higher the probability of a type II error if the conventional p 4 0.05 is 3.10.4.7 SCV Methods used to indicate statistical significance. Rosnow and Rosenthal (1989) present a compelling Only a few SCV-relevant methods are perspective on the fundamental importance of discussed here due to space limitations and to the foregoing issues. the technical detail associated with many SCV issues. The reader is referred to Cook and Campbell (1979) for a more complete review 3.10.4.7.2 Controlling for the type I error rate and to Table 5 for a synopsis of their discussion Statistical tests commonly used in interven- of SCV. tion research (e.g., the F-test in analysis of variance) are associated with probability values, for example, p 0.05. A p value indicates the 3.10.4.7.1 Statistical power analysis 5 probability that a difference found between Conducting a statistical power analysis at the interventions is due to chance rather than a true planning phase of an intervention study is difference. Thus, for example, a statistical test critical to its SCV (Cohen, 1977, 1992; Kraemer that meets the p 4 0.05 significance level means & Thiemann, 1987; Rosenthal & Rosnow, that a difference of the same magnitude would Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 217

Table 5 Statistical conclusion validity.

Threat Description Example antidote

Low statistical Limited potential of a study to yield Conduct a power analysis (e.g., Cohen, power statistically significant intervention 1977, 1992; Rosenthal & Rosnow, 1991) effects (i.e., low potential to reject the during the design phase of a study to null hypothesis) due to sample size determine the sample size needed to and to the probable variances of detect an intervention effect of a outcome measures predetermined size Violated Failure to ensure that a study's outcome Know the crucial assumptions of all assumptions of data meet the assumptions of the statistical tests used and conduct statistical tests statistical tests used to analyze it analyses needed to evaluate whether the study data meet the assumptions. Seek expert statistical consultation if needed. Fishing and the error Several unplanned (i.e., a posteriori Consult statistics textbooks for methods rate problem rather than a priori) statistical tests of to handle the error rate problem (e.g., effects (fishing) are conducted but no Bonferroni correction; Tukey or Scheffe adjustments are made to reduce the multiple comparison tests) likelihood that statistically significant findings are due to chance. In other words, the probability of type II error is larger than indicated by the reported p values Reliability of Low test±retest reliability (stability) Rely on psychometric properties of measures and/or low internal consistency measures to guide selection of study reliability of outcome measures measures. Choose measures with strong inflates the error variance in scores psychometric properties over measures and thus reduces the probability of with poor or unknown psychometric finding intervention effects properties that might have better ªface validityº as measures of the construct of interest (Kraemer & Telch, 1992) Reliability of Variability in the implementation of an Train study therapists to a criterion level treatment intervention with different subjects of adherence to interventions before implementation (lack of standardization across they treat any study patients; monitor subjects) inflates error and, thus, their adherence to the treatment manual reduces the probability of finding throughout the study and intervene if intervention effects drift occurs (e.g., Shaw, 1984) Random Features of a study setting other than Code features of the study setting that are irrelevancies in the the interventions could affect likely to affect outcome measures as experimental outcome scores (e.g., two research variables and include them in the data setting assistants administer the outcome analyses to reduce error (e.g., research measures to subjects: one is warm and assistant who administered outcome friendly and the other is abrupt and measures is coded as a variable and distant) included as a factor in an analysis of variance that is used to test for treatment effects) Random Differences in subjects could affect Measure subject characteristics that are heterogeneity of scores on the outcome measures likely to be correlated with scores on subjects independent of any intervention outcome measures; use the effects characteristics as covariates or as blocking variables

Statements in the first two columns and some in the third are abstracted from Cook and Campbell (1979). be expected to be obtained on average five times in ways that maintain the interpretability of p or fewer if 100 of the statistical tests were done, values is crucial to a study's SCV. if the null hypothesis of no difference between The interpretation of a p value is compromised interventions were true. The use of p values in when multiple statistical tests are done but no conjunction with statistical tests is the principal precautions are taken to ensure that the degree of way investigators guard against type I error protection against a type I error remains (rejecting the null hypothesis of no difference constant. A variety of procedures exist to when it should be accepted) when interpreting maintain a specified level of protection against study findings. Thus, conducting statistical tests type I error when several statistical tests are 218 Internal and External Validity of Intervention Studies performed. Statistics textbooks describe proce- information from existing studies became dures such as the Bonferroni correction (e.g., widely and painfully evident. Lively debates Rosenthal & Rosnow, 1991). The main point for later appeared about how intervention research present purposes is that conducting multiple in the USA, particularly federally funded statistical tests without adjusting the probability research, should be designed (e.g., Clarke, level of the tests compromises a study's SCV. 1995; Jacobson & Christensen, 1996; Lebowitz Unfortunately, this is a common type of threat to & Rudorfer, 1998; Newman & Tejeda, 1996; SCV in intervention research (Dar, Serlin, & Seligman, 1995, 1996; Speer, 1994; Wells & Omer, 1994). Sturm, 1996). A question at the core of the debate is how to design intervention studies to maximize three 3.10.4.7.3 Testing assumptions of statistical aims: (i) the value of findings in terms of tests definitive public health implications and en- One fundamental assumption of the statis- hancement, (ii) the speed with which valid and tical tests that are commonly used for outcome useful findings can be obtained, and (iii) the analyses is that the dependent (outcome) public health policy yield from relatively scarce variable is normally distributed. If the data federal funding for mental health intervention do not meet the normality assumption, then the studies. Personal biases are evident in positions validity of the statistical test is compromised, taken by many who have spoken on the issues that is, the study SCV is threatened. A special (e.g., pragmatism vs. concern that policy class of statistical procedures, nonparametric decisions that affect large numbers of people statistics (Leach, 1979), exist for analyzing data will be made based on scientifically weak that are not normally distributed. Alterna- findings). However, a valuable product of the tively, procedures exist to transform data (e.g., debate has been creative thinking about how IV square root transformation) to make their and EV goals might be maximized, for example, distribution more symmetrical before applying via new study questions and methods, and the statistical tests (Atkinson, 1985; Carroll & use of statistical techniques such as causal Rupert, 1988). modeling (Kenny, 1979). Interestingly, Cook and Campbell (1979) were keenly aware of all three of the aforementioned aims in the late 3.10.5 FROM IV VS. EV TO IV + EV IN 1970s. The ideas developed in all editions of MENTAL HEALTH their book, particularly in the sections on INTERVENTION RESEARCH quasiexperimental designs, were largely in- tended to address them. One illustration of the centrality of IV and EV The current debate illustrates the importance to intervention research is the aforementioned of being fully fluent with the concepts of IV and controversy on efficacy vs. effectiveness studies EV for both creators and consumers of (e.g., Goldfried & Wolfe, 1998; Jacobson & intervention research. A few suggestions from Christensen, 1996; Lebowitz & Rudorfer, 1998; the debate for mounting studies that have both Mintz et al., 1996; Seligman, 1995, 1996). The high IV and EV are described next. Then, two controversy took center stage largely as a side illustrative IV + EV studies are reviewed in effect of dramatic shifts in health care delivery in terms of IV and EV. the USA since the mid-1980s. Two forces were particularly valent in drawing the attention of intervention researchers and consumers of their 3.10.5.1 Strategies for Efficacy + Effectiveness research to efficacy (IV) and effectiveness (EV) (IV + EV) Intervention Studies methods. The forces were (i) a move from indemnity insurance coverage for health care to Most strategies for designing intervention managed care coverage to control costs, and (ii) studies that optimize both IV and EV that have the US Congress's attempt in 1993 to reform emerged so far from the efficacy vs. effectiveness national health care. Both forces highlighted the debate are intended to maximize the applied need for valid information about interventions public health (utilitarian) information that is to guide decisions of government policymakers obtained from studies. However, to test treat- and managed care entrepreneurs. Both groups ments that hold promise as marked advances were motivated to make decisions rapidly that over currently available interventions, high IV would impact the interventions that millions of designs and methods are likely to be the most Americans could receive in insured programs. informative research strategies, due to their The need for valid findings to inform the suitability for theory-testing and for testing decisions was keenly felt by all interested causal hypotheses (Jacobson & Christensen, parties, including researchers. Limitations of 1996; Mintz et al., 1996). The IV + EV From IV vs. EV to IV + EV in Mental Health Intervention Research 219 strategies described here were selected to who have a vested interest in mental health illustrate the diversity of approaches that have interventions (the stakeholders). The authors been proposed. Each strategy is described endorsed the view that high IV studies have an briefly, in a way intended to highlight its essential role in intervention research because essence. they provide scientifically well-founded, im- portant information about interventions, for example, their potential effects and their safety. 3.10.5.1.1 The conventional ªphaseº strategy EV goals are accomplished in the stakeholders Before the recent efficacy vs. effectiveness model mainly by collecting certain kinds of debate, federal agencies that funded mental outcome and other data about interventions that health research in the USA endorsed a phase can help inform the decisions of all stakeholders model to meet the goals of IV and EV inference. in mental health treatment (e.g., government According to Hoagwood et al. (1995), the phase policymakers and corporate officials of mana- model was adapted from one used by the US ged care companies; potential recipients of National Cancer Institute. Described in a interventions; family members and others who simplified way, the model is to conduct efficacy are directly affected by those who receive mental studies of interventions for which promising health interventions). outcome data were obtained in prior, prelimin- Types of data emphasized by Newman and ary research; the next phase is to conduct Tejada (1996) include costs of interventions and effectiveness studies of interventions for which estimates of the amount of therapeutic effort compelling efficacy data are obtained. required to obtain specified behavioral out- comes. Hoagwood et al. (1995) also included cost data in their framework. The need for cost 3.10.5.1.2 A dimensional adaptation of the phase data (e.g., cost-effectiveness, cost-offset, com- model parative cost) is a dominant theme in research Hoagwood et al. (1995) identified several strategies that have been proposed for efficacy reasons why the phase model, although elegant, + effectiveness studies. This emphasis is largely is rarely fully realized. A major hurdle has been a response to increasing concerns about, and moving interventions with demonstrated effi- concomitant restrictions on, health care costs in cacy to effectiveness studies. A typical stum- the USA. Cost data have not been standard bling block to the transfer is introducing products of federally funded efficacy studies in research methods into community settings to the USA for the past 30 years. Yates (1997) has conduct effectiveness studies. Hoagwood et al. presented guidelines for psychotherapy re- presented an alternative strategy, the essence of searchers on how to collect cost data. Wolff, which is to conceptualize efficacy and effective- Helminiak, and Tebes (1997) illustrate the ness goals ªin terms of continuous dimensions, impact of different assumptions on cost esti- rather than discrete phasesº (p. 685). The mates in mental health treatment research. authors describe a conceptual framework, not Other relevant references are Knapp (1995) and a model per se. It is linked specifically to mental Yates (1996). Research strategies to examine the health interventions for children, adolescents, amount of therapeutic effort required to obtain and families. specific outcomes have been available for some The framework emphasizes three bipolar time in the psychotherapy research literature dimensions of intervention research, each of (e.g., Howard, Koptka, Krause, & Orlinsky, which can be described as having an IV and an 1986; Kopta et al., 1994). EV pole. The dimensions are types of validity, types of intervention modalities and para- 3.10.5.1.4 Mediator's model meters, and types of outcomes. For example, at the IV pole of the intervention dimension are The crux of a model proposed by Clarke short-term (e.g., three months), highly struc- (1995) is to use IV experimental methods to tured, single modality therapies that are examine crucial generalizability (EV) questions operationalized in detailed manuals; at its EV about an interventions' outcomes. For example, pole are longer-term, less well-specified treat- a study design could include two or more ments that allow for the integration of more intervention conditions distinguished mainly by than one therapy modality. the degree to which an intervention is imple- mented as specified in its manual. The essence of Clarke's (1995) model is to use IV methods to 3.10.5.1.3 Stakeholder's model to guide data evaluate the effects of several potential mediator selection variables on the outcomes that can be achieved Newman and Tejada (1996) highlighted the with an intervention. Examples of relevant need to collect data that are sought by all parties mediators are the degree to which therapists 220 Internal and External Validity of Intervention Studies adhere to the structure of a therapy as specified patients who were randomly assigned to usual in its manual, heterogeneity of the patients care, and the natural course of the usual care treated, and the adjunctive therapeutic context condition was preserved as much as possible. in which an intervention is provided (e.g., used A noteworthy EV threat was substantial as a stand-alone intervention vs. included in an preinclusion attrition of eligible subjects due to intervention package). refusal to participate and other reasons. An IV threat was that therapists were ªnestedº in a treatment condition. That is, the medical 3.10.5.2 Examples of Efficacy + Effectiveness professionals who provided usual care were Intervention Studies not the same professionals who provided the psychotherapy intervention or the pharmaco- This section illustrates the use of the concepts logical intervention. Thus, better outcomes of of IV and EV to evaluate intervention studies. the previously tested (efficacious) interventions Two highly regarded US studies that were compared to usual care could not be attributed designed to meet both IV and EV aims are validly to the therapies per se. Findings of this reviewed: Drake, McHugo, Becker, Anthony, type were, in fact, obtained. However, it can be and Clark (1996) and Schulberg et al. (1995). argued that the confounding of therapists and The main intent of each review is to illustrate treatments does not pose an important IV how the concepts that are the focus of this limitation to this particular study. Rather, the chapter can be applied to help determine the confound enhances the generalizability of conclusions that can validly be drawn from an findings to standard practice (EV) because in intervention study. The reviews obviously are the public sector, primary medical care provi- not complete or definitive evaluations of either ders are not the same as those who provide study. specific forms of psychotherapy for depression or those who specialize in psychopharmacolo- gical treatments for depression. 3.10.5.2.1 Schulberg et al. (1995) The investigators' interest in integrating IV 3.10.5.2.2 Drake et al. (1996) and EV considerations in a single study is evident in the study's two main aims. One aim was to Drake et al. (1996) integrated IV and EV compare the outcomes of a type of psychother- methods in a study designed to examine apy and a medication when used to treat interventions of a very different type from depression in patients who were being seen in those examined by Schulberg et al. (1995). Their general medical clinics, not psychiatric settings. main aim was to compare two vocational Both of the interventions were tested previously programs for individuals with severe mental in randomized controlled studies and found to disorders (SMD) such as schizophrenia. Drake be efficacious for depression in psychiatric et al. described the interventions as differing in outpatients. A second aim was to compare the two primary ways. One intervention was effects on depression of each of the two tested provided by a professional rehabilitation interventions with the effects of a third inter- agency external to the mental health clinics vention of ªusual care,º that is, treatments of the where individuals with SMD were treated; the type typically provided to depressed patients by other was integrated with mental health services doctors in primary care settings. and provided at the clinics. A second difference Central IV features of the study were: random was in the structure of the interventions. The assignment of patients to all three treatment external intervention (ªgroup skills training,º conditions, use of sample selection procedures GST) started with three months of pre-employ- to identify patients who met standard psychia- ment skills training; the mental health clinic tric diagnostic criteria for depression, and intervention (ªindividual placement and sup- outcome analyses on an intention-to-treat port,º IPS) started immediately with job sample. IV and EV were simultaneously placement activities. enhanced by other methods. The same study The central IV feature of the study was protocol was implemented concurrently at four random assignment to each of the intervention primary care clinic settings. Procedures were conditions. A key EV enhancement design used to help ensure that the psychotherapy was feature was that each intervention was concur- conducted in ways consistent with its manual rently tested at two sites. However, the full EV but that also were minimally intrusive into the value of the multisite design was compromised conduct of the therapies. Central EV features of because GST staff who had central training the study were: it was conducted in primary care roles overlapped at the two GST sites. Also, the settings (although all four settings had academic GST and IPS interventions each were con- affiliations), outcome data were obtained from ducted at different sites. Thus, as the authors A Concluding Observation 221 point out, interventions and sites were con- lor other than a client's SMD therapist. This is founded. This type of confound notably an IV consideration because the GST interven- compromises IV, that is, the confidence with tions were always provided by someone other which outcome findings can be attributed than the client's SMD therapist. primarily to interventions. One point illustrated by the preceding The study illustrates that sample selection intervention-related aspects of the study is that criteria can simultaneously increase the homo- the less clear and complete the description of geneity of a study sample and enhance EV. The methods used to implement interventions is, the inclusion and exclusion criteria constitute mini- less possible it is to evaluate their CV and, mal, yet realistic, qualifications for SMD clients consequently, the IV of outcome findings. Thus, to receive vocational interventions (e.g., lack of the public health significance of findings cannot noteworthy memory impairment; clinical stabi- be evaluated confidently. In turn, study findings lity as indicated by not being hospitalized for at can have less impact than they otherwise might. least one month; unemployment for at least one An additional IV threat due to intervention month but interest in employment). The poten- procedures was site differences in the imple- tial subject EV enhancement associated with the mentation of both interventions. The investi- sample selection criteria was reduced by another gators intervened at one of the sites to correct procedure, a ªprerandomization group experi- deviations from the IPS model. However, the ence.º SMD clients who were interested in corrective attempts were not notably effective. participating in the study were required to attend The implementation of IPS is described as a minimum of four weekly research induction ªweakerº at one site. Also, the GST counselors group sessions. One purpose of the prerando- ªconcentrated their effortsº in one of their sites. mization group attendance requirement was to The study findings highlight the crucial ensure that potential participants were ªmoti- importance to IV of intervention implementa- vatedº for the project interventions. tion procedures (intervention CV). In some Prerandomization and pretreatment ªorien- analyses, the interventions differed on the major tationº or ªstabilizationº phases are sometimes outcome variable (employment), favoring IPS; used in mental health intervention research, in others, the interventions did not differ. particularly with populations in which high Interpretation of mixed findings about inter- treatment failure and/or dropout rates are ventions' comparative effects is seriously com- expected, such as drug abusers. EV is one promised when the intervention CV is either trade-off for the use of such procedures. On the poor or difficult to evaluate from information other hand, the procedures can have an IV provided in a study report. advantage. They can create samples of indivi- The foregoing review of Drake et al. (1996) duals who are most likely to respond to an from the perspective of IV and EV threats might intervention, thereby yielding findings on its seem inconsistent with the fact that it is widely efficacy with individuals hypothesized to be regarded as one of the best IV + EV mental most responsive to it. One potential risk is that health intervention studies available. If it is the sample will consist of individuals who are assumed that the preceding IV and EV analysis likely to benefit from many incidental life/ itself is valid, what might be concluded? One environmental experiences and who, therefore, conclusion is that difficulties maintaining the might not need the study interventions to achieve CV of interventions in studies that are done in its targeted outcomes. The foregoing IV threat natural settings to enhance EV can be costly in can be offset by the inclusion of a placebo control terms of IV. In turn, this analysis suggests that a intervention of some type in the design. cost-effective use of resources in high EV studies Drake et al. (1996) also illustrates the critical is to invest in methods that achieve and impact on a study's CV, IV, and EV of the maintain the CV of interventions. methods used to operationalize the interven- tions. The study report does not indicate if the interventions were specified in manuals or the 3.10.6 A CONCLUDING OBSERVATION type of training that counselors received before implementing them for the study. A related IV and EV considerations, and the related threat to the CV of the interventions and, thus, efficacy vs. effectiveness debate, highlight two to the IV of study findings, is that the methods vital applied aims of intervention research. The used to evaluate the success of the operationa- aims are: (i) to obtain findings that can be used lization of the interventions (i.e., counselor directly and quickly in public mental health adherence to GST and to IPS) are described too delivery systems, and (ii) to obtain findings that vaguely to evaluate. An additional IV threat is can markedly improve, ideally revolutionize, that it is not clear from the report if the IPS the effectiveness of treatments available in the intervention was always provided by a counse- public health sector. Both aims have clear public 222 Internal and External Validity of Intervention Studies health significance. Aim (i) is most consistent 3.10.7 REFERENCES with using designs and methods that are firmly American Psychiatric Association (1980). Diagnostic and tied to contemporary constraints on clinical statistical manual of mental disorders (3rd ed.). Washing- practice in ªreal worldº settings and obtaining ton, DC: Author. findings that can affect practice in the short run. American Psychiatric Association (1994). Diagnostic and Aim (ii) is consistent with using scientific statistical manual of mental disorders (4th ed.). Washing- ton, DC: Author. methods that are most likely to foster discovery Atkinson, A. C. (1985). Plots, transgressions and regression: and methods that can rule out rival interpreta- An introduction to graphical methods of diagnostic tions of obtained covariation between interven- regression analysis. New York: Oxford University Press. tions and outcomes. Immediate generalizability Baron, R. M., & Kenny, D. A. (1986). The moderator- to current practice of findings from aim (ii) mediator variable distinction in social psychological research: conceptual, strategic, and statistical considera- studies cannot be a central criterion of their tions. Journal of Personality and Social Psychology, 51, public health significance. 1173±1182. The efficacy vs. effectiveness debate no doubt Beck, J. S. (1995). Cognitive therapy: Basics and beyond. was partially fueled by efficacy researchers' New York: Guilford Press. frustration. Many of them observed that years of Bergin, A. E., & Garfield, S. L. (Eds.) (1971). Handbook of psychotherapy and behavior change (1st ed.). New York: findings from high IV studies that support the Wiley. value of certain types of psychotherapy (Task Bergin, A. E., & Garfield, S. L. (Eds.) (1994). Handbook of Force on the Promotion and Dissemination of psychotherapy and behavior change (4th ed.). New York: Psychological Procedures, 1995) were eschewed Wiley. Beutler, L. E., & Howard, K. I. (1998). Clinical utility by many who focused on the EV of those studies. research: An introduction. Journal of Clinical Psychol- Also, emphasizing EV limitations of well-done, ogy, 54, 297±301. high IV studies can impede implementation of Campbell, D. T., & Stanley, J. C. (1963). Experimental and tested therapies in public sector settings. By quasi-experimental designs for research on teaching. In default, then, the use of untested, ªtreatment as N. L. Gage (Ed.), Handbook of research on teaching (pp. 171±246). Chicago: Rand McNally. usualº interventions is supported and main- Campbell, D. T., & Stanley, J. C. (1996). Experimental and tained. This is the current situation in the USA. quasi-experimental designs for research. Chicago: Rand A partial antidote to the preceding problem, a McNally. serious problem from the public health stand- Carroll, K. M., Nich, C., & Rounsaville, B. J. (in press). Use of observer and therapist ratings to monitor delivery point, was suggested by Kazdin et al. in 1986. of coping skills treatment for cocaine abusers: Utility of Their suggestion is consistent with a point made therapist session checklists. Psychotherapy Research. earlier in this chapter. Mental health interven- Carroll, R. J., & Ruppert, D. (1988). Transformation and tion research advances since the late 1980s weighting in regression. New York: Chapman and Hall. include methods to systematically train thera- Chevron, E. S., & Rounsaville, B. J. (1983). Evaluating the clinical skills of psychotherapists: A comparison of pists in therapies with demonstrated efficacy, techniques. Archives of General Psychiatry, 40, and to evaluate the skillfulness that therapists 1129±1132. attain from the training. Many therapist Clarke, G. N. (1995). Improving the transition from basic training and adherence assessment research efficacy research to effectiveness studies: Methodological methods generalize directly to public sector issues and procedures. Journal of Consulting and Clinical Psychology, 63, 718±725. therapists. Thus, rather than look at findings Cohen, J. (1977). Statistical power analysis for the from high IV studies and ask the therapist EV behavioral sciences (Rev. ed.) New York: Academic question: ªDoes the study provide evidence that Press. therapists in the public health sector will be able Cohen, J. (1992). Quantitative methods in psychology: A power primer. Psychological Bulletin, 112, 155±159. to get the same results?,º the question could be Collins, J. F., & Elkin, I. (1985). Randomization in the refocused: ªDoes the study provide methods NIMH treatment of depression collaborative research that enable therapists to be identified and program. In R. F. Boruch & W. Wothke (Eds.), trained to get the same outcomes in public Randomization and field experimentations: New directions treatment settings?º for program evaluation. San Francisco: Jossey-Bass. Consumer Reports (1995, November). Mental health: Does therapy help? (pp. 734±739). Cook, T. D., & Campbell, D. T. (Eds.) (1979). Quasi- ACKNOWLEDGMENTS experimentation: design and analysis issues for field settings. Boston: Houghton Mifflin. Preparation of this chapter was supported by Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco: Jossey-Bass. National Institute of Mental Health Grant K02 Cronbach, L. J., & Meehl, P. E. (1955). Construct validity MH01443 to Karla Moras. Sincere thanks are in psychological tests. Psychological Bulletin, 52, extended to those who provided consultations, 281±302. comments, or other contributions to the pre- Dar, R., Serlin, R. C., & Omer, H. (1994). Misuse of statistical tests in three decades of psychotherapy paration of the manuscript: Phyllis Solomon, research. Journal of Consulting and Clinical Psychology, Elizabeth McCalmont, Christine Ratto, Heidi 62, 75±82. Grenke, J. R. Landis, and Jesse Chittams. Drake, R. E., McHugo, G. J., Becker, D. R., Anthony, W. References 223

A., & Clark, R. E. (1996). The New Hampshire study of The attrition dilemma: Toward a new strategy for supported employment for people with severe mental psychotherapy research. Journal of Consulting and illness. Journal of Consulting and Clinical Psychology, 64, Clinical Psychology, 54, 106±107. 391±399. Howard, K. I., Moras, K., Brill, P. L., Martinovich, Z., & Elkin, I., Shea, M. T., Watkins, J. T., Imber, S. D., Sotsky, Lutz, W. (1996). Evaluation of psychotherapy: Efficacy, S. M., Collins, J. F., Glass, D. R., Pilkonis, P. A., Leber, effectiveness, and patient progress. American Psycholo- W. R., Docherty, J. P., Fiester, S. J., & Parloff, M. B. gist, 51, 1059±1064. (1989). NIMH Treatment of Depression Collaborative Ilardi, S. S., & Craighead, W. E. (1994). The role of Research Program: I. General effectiveness of treat- nonspecific factors in cognitive-behavior therapy for ments. Archives of General Psychiatry, 46, 971±982. depression. Clinical Psychology: Science & Practice, 1, Evans, M. E., Piasecki, J. M., Kriss, M. R., & Hollon, S. 138±156. D. (1984). Raters' manual for the Collaborative Study Jacobson, N. S., & Christensen, A. (1996). Studying the Psychotherapy Rating ScaleÐform 6. Minneapolis. Uni- effectiveness of psychotherapy: How well can clinical versity of Minnesota and the St. Paul-Ramsey medical trials do the job? American Psychologist, 51, 1031±1039. Center. (Available from the US Department of Com- Jacobson, N. S., & Truax, P. (1991). Clinical significance: merce, National Technical Information Service, Spring- A statistical approach to defining meaningful change in field, VA 22161.) psychotherapy research. Journal of Consulting and Flick, S. N. (1988). Managing attrition in clinical research. Clinical Psychology, 59, 12±19. Clinical Psychology Review, 8, 499±515. Kazdin, A. E. (1979). Nonspecific treatment factors in Garfield, S. L., & Bergin, A. E. (Eds.) (1978). Handbook of psychotherapy outcome research. Journal of Consulting psychotherapy and behavior change (2nd ed.). New York: & Clinical Psychology, 47, 846±851. Wiley. Kazdin, A. E. (1994). Methodology, design, and evaluation Garfield, S. L., & Bergin, A. E. (Eds.) (1986). Handbook of in psychotherapy research. In A. E. Bergin & S. L. psychotherapy and behavior change (3rd ed.). New York: Garfield (Eds.), Handbook of psychotherapy and behavior Wiley. change (4th ed., pp. 19±71). New York: Wiley. Gibbons, R. D., Hedeker, D., Elkin, I., et al. (1993). Some Kazdin, A. E., Kratochwill, T. R., & VandenBos, G. R. conceptual and statistical issues in analysis of long- (1986). Beyond clinical trials: Generalizing from research itudinal psychiatric data. Archives of General Psychiatry, to practice. Professional Psychology: Research and 50, 739±750. Practice, 17, 391±398. Gillings, D., & Koch, G. (1991). The application of the Kendall, P. C., & Brady, E. U. (1995). Comorbidity in the principle of intention-to-treat to the analysis of clinical anxiety disorders of childhood. In K. D. Craig & K. S. trials. Drug Information Journal, 25, 411±424. Dobson (Eds.), Anxiety and depression in adults and Goldfried, M. R., & Wolfe, B. E. (1998). Toward a more children. Newbury Park, CA: Sage. clinically valid approach to therapy research. Journal of Kenny, D. A. (1979). Correlation and causality. New York: Consulting and Clinical Psychology, 66, 143±150. Wiley. Gomes-Schwartz, B. (1978). Effective ingredients in psy- Kiesler, D. (1973). The process of psychotherapy: Empirical chotherapy: Prediction of outcome from process vari- foundations and systems of analysis. Chicago: Aldine. ables. Journal of Consulting & Clinical Psychology, 46, Klerman, G. L., Weissman, M. M., Rounsaville, B. J., & 1023±1035. Chevron, E. S. (1984). Interpersonal psychotherapy of Gotto, A. M. (1997). The multiple risk factor intervention depression. New York: Basic Books. trial (MRFIT): A return to a landmark trial. Journal of Knapp, M. (Ed.) (1995). The economic evaluation of mental the American Medical Association, 277, 595±597. health care. London: Arena, Aldershot. Greenberg, L., & Pinsof, W. M. (Eds.) (1986). The Kopta, S. M., Howard, K. I., Lowry, J. L., & Beutler, L. E. psychotherapeutic process: A research handbook. New (1994). Patterns of symptomatic recovery in psychother- York: Guilford Press. apy. Journal of Consulting and Clinical Psychology, 62, Greenberg, R. P., Bornstein, R. F., Greenberg, M. D., & 1009±1016. Fisher, S. (1992). A meta-analysis of anti-depressant Kraemer, H. C., & Telch, C. F. (1992). Selection and outcome under ªblinderº conditions. Journal of Consult- utilization of outcome measures in psychiatric clinical ing and Clinical Psychology, 60, 664±669. trials: Report on the 1988 MacArthur Foundation Hill, C. E., O'Grady, K. E., & Elkin, I. (1992). Applying Network I Methodology Institute. Neuropsychopharma- the Collaborative Study Psychotherapy Rating Scale to cology, 7, 85±94. rate therapist adherence in cognitive-behavior therapy, Kraemer, H. C., & Thiemann, S. (1987). How many interpersonal therapy, and clinical management. Journal subjects? Statistical power analysis in research, Newbury of Consulting and Clinical Psychology, 60, 73±79. Park, CA: Sage. Hoagwood, K., Hibbs, E., Bren, T. D., & Jensen, P. (1995). Lambert, M. J., & Ogles, B. M. (1988). Treatment manuals: Introduction to the special section: Efficacy and effec- Problems and promise. Journal of Integrative and tiveness in studies of child and adolescent psychother- Eclectic Psychotherapy, 7, 187±204. apy. Journal of Consulting and Clinical Psychology, 63, Leach, C. (1979). Introduction to statistics: A nonparametric 683±687. approach for the social sciences. New York: Wiley. Hollon, S. D., Shelton, R. C., & Loosen, P. T. (1991). Lebowitz, B. D., & Rudorfer, M. V. (1998). Treatment Cognitive therapy and pharmacotherapy for depression. research at the millennium: From efficacy to effective- Journal of Consulting and Clinical Psychology, 59, 88±99. ness. Journal of Clinical Psychopharmacology, 18, 1. Howard, K. I., Cox, W. M., & Saunders, S. M. (1990). Levenson, H. (1995). Time-limited dynamic psychotherapy: Attrition in substance abuse comparative treatment A guide to clinical practice. New York: Basic Books. research: the illusion of randomization. In L. S. Onken Lueger, R. J. (1998). Using feedback on patient progress to & J. D. Blaine (Eds.), Psychotherapy and counseling in the predict outcome of psychotherapy. Journal of Clinical treatment of drug abuse (DHHS Publication No. Psychology, 54, 383±393. (ADM)90±1722, pp. 66±79). Washington, DC: NIDA Margraf, J., Ehlers, A., Roth, W. T., Clark, D. B., Sheikh, Research Monograph 104. J., Agras, W. S., & Taylor, C. B. (1991). How ªblindº are Howard, K. I., Kopta, S. M., Krause, M. S., & Orlinsky, double-blind studies? Journal of Consulting and Clinical D. E. (1986). The dose±effect relationship in psychother- Psychology, 59, 184±187. apy. American Psychologist, 41, 159±164. McGrew, J. H., Bond, G. R., Dietzen, L., & Salyers, M. Howard, K. I., Krause, M. S., & Orlinsky, D. E. (1986). (1994). Measuring the fidelity of implementation of a 224 Internal and External Validity of Intervention Studies

mental health program model. Journal of Consulting and priciples common to experiments and ethnographies. Clinical Psychology, 62, 670±678. American Journal of Community Psychology, 23, Mintz, J., Drake, R. E., & Crits-Christoph, P. (1996). 419±428. Efficacy and effectiveness of psychotherapy: Two para- Shadish, W. R., Matt, G. E., Navarro, A. N., Siegle, G., digms, one science. American Psychologist, 51, Crits-Christoph, P., Hazelrigg, M. D., Jorm, A. F., 1084±1085. Lyons, L. C., Nietzel, M. T., Prout, H. T., Robinson, L., Multiple Risk Factor Intervention Trial Group. (1977). Smith, M. L., Svartberg, M., & Weiss, B. (1997). Statistical design considerations in the NHLI multiple Evidence that therapy works in clinically representative risk factor intervention trial (MRFIT). Journal of conditions. Journal of Consulting and Clinical Psychol- Chronic Disease, 30, 261±275. ogy, 65, 355±365. Ney, P. G., Collins, C., & Spensor, C. (1986). Double Shaw, B. F. (1984). Specification of the training and blind: Double talk or are there ways to do better evaluation of cognitive therapists for outcome studies. In research. Medical Hypotheses, 21, 119±126. J. B. W. Williams & R. L. Spitzer (Eds.), Psychotherapy Newman, F. L., & Tejeda, M. J. (1996). The need for research: Where are we and where should we go? research that is designed to support decisions in the (pp. 173±189). New York: Guilford Press. delivery of mental health services. American Psycholo- Schulberg, H. C., Block, M. R., Madonia, M. J., Scott, P., gist, 51, 1040±1049. Rodriguez, E., Imber, S. D., Perel, J., Lave, J., Houck, P. O'Leary, K. D., & Borkovec, T. D. (1978). Conceptual, R., & Coulehan, J. L. (1996). Treating major depression methodological, and ethical problems of placebo groups in primary care practice: Eight-month clinical outcomes. in psychotherapy research. American Psychologist, 33, Archives of General Psychiatry, 53, 913±919. 821±830. Speer, D. C. (1994). Can treatment research inform Orlinsky, D. E., Grawe, K., & Parks, B. K. (1994). Process decision makers? Nonexperimental method issues and and outcome in psychotherapyÐNoch einmal. In A. E. examples among older outpatients. Journal of Consulting Bergin & S. L. Garfield (Eds.), Handbook of psychother- and Clinical Psychology, 62, 560±568. apy and behavior change (pp. 270±376). New York: Strupp, H. H., & Hadley, S. (1979) Specific vs. nonspecific Wiley. factors in psychotherapy. Archives of General Psy- Orlinsky, D. E., & Howard, K. I. (1986). Process and chotherapy, 36, 1125±1136. outcome in psychotherapy. In S. L. Garfield & A. E. Task Force on Promotion and Dissemination of Psycho- Bergin (Eds.), Handbook of psychotherapy and behavior logical Procedures (1995). Training in and dissemination change (3rd ed. pp. 311±381). New York: Wiley. of empirically-validated psychological treatments: Re- Parloff, M. B. (1986). Placebo controls in psychotherapy port and recommendations. The Clinical Psychologist, research: a sine qua non or a placebo for research 48, 3±24. problems? Journal of Consulting and Clinical Psychology, Waltz, J., Addis, M. E., Koerner, K., & Jacobson, N. S. 54, 79±87. (1993). Testing the integrity of a psychotherapy protocol: Rabkin, J. G., Markowitz, J. S., Stewart, J., McGrath, P., Assessment of adherence and competence. Journal of Harrison, W., Quitkin, F. M., & Klein, D. F. (1986). Consulting and Clinical Psychology, 61, 620±630. How blind is blind? Assessment of patient and doctor Wei, L. J. (1978). An application of an urn model to the medication guesses in a placebo-controlled trial of design of sequential controlled trials. Journal of the imipramine and phenelzine. Psychiatry Research, 19, American Statistical Association, 73, 559±563. 75±86. Wells, K. B., & Sturm, R. (1996). Informing the policy Rice, L. N., & Greenberg, L. S. (1984). Patterns of change: process: From efficacy to effectiveness data on pharma- Intensive analysis of psychotherapy process. New York: cotherapy. Journal of Consulting and Clinical Psychology, Guilford Press. 64, 638±645. Rosenthal, R., & Rosnow, R. L. (1991). Essentials of Wilson, G. T. (1996). Manual-based treatments: The behavioral research: Methods and data analysis. New clinical application of research findings. Behavior Re- York: McGraw-Hill. search and Therapy, 34, 295±314. Rosnow,R.L.,&Rosenthal,R.(1989).Statistical Wolff, N., Helminiak, T. W., & Tebes, J. K. (1997). procedures and the justification of knowledge in Getting the cost right in cost-effectiveness analyses. psychological science. American Psychologist, 44, American Journal of Psychiatry, 154, 736±743. 1276±1284. Woody, S. R., & Kihlstrom, L. C. (1997). Outcomes, Roth, A., & Fonagy, P. (1996). What works for whom? A quality, and cost: Integrating psychotherapy and mental critical review of psychotherapy research. New York: health services research. Psychotherapy Research, 7, Guilford Press. 365±381. Russell, R. L. (Ed.) (1987). Language in psychotherapy: Yates, B. T. (1996). Analyzing costs, procedures, process, Strategies of discovery. New York: Plenum. and outcomes in human services. Thousand Oaks, CA: Seligman, M. (1995). The effectiveness of psychotherapy. Sage. The Consumer Report study. American Psychologist, 50, Yates, B. T. (1997). From psychotherapy research to 965±974. cost±outcome research: What resources are necessary to Seligman, M. (1996). Science as an ally of practice. implement which therapy procedures that change what American Psychologist, 51, 1072±1079. processes to yield which outcomes? Psychotherapy Shadish, W. R. (1995). The logic of generalization: Five Research, 7, 345±364. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.11 Mental Health Services Research

WILLIAM A. HARGREAVES University of California, San Francisco, CA, USA RALPH A. CATALANO, TEH-WEI HU University of California, Berkeley, CA, USA and BRIAN CUFFEL United Behavioral Health, San Francisco, CA, USA

3.11.1 INTRODUCTION 225 3.11.2 EFFICACY RESEARCH 226 3.11.2.1 Examples of Efficacy Research 228 3.11.3 COST-OUTCOME RESEARCH 229 3.11.3.1 Examples of Cost-outcome Studies 231 3.11.4 SERVICE SYSTEM RESEARCH 233 3.11.5 POLICY RESEARCH 238 3.11.6 CONCLUSION 239 3.11.7 REFERENCES 240

3.11.1 INTRODUCTION Mental health services research is inherently multidisciplinary, drawing from clinical psy- The US National Institute of Mental Health chology, psychiatry, health economics, epide- blueprint for improving services (NIMH, 1991; miology, sociology, political science, social Attkisson et al., 1992; Lalley et al., 1992; work, nursing, pharmacology, pharmacy, gen- Mechanic et al., 1992; Steinwachs et al., 1992) eral medicine, and public health. This multi- broadened and clarified the scope of mental disciplinary character is reflected in the health services research. New and experienced authorship of this article, including two clinical investigators trying to understand mental health psychologists, an urban planner, and a health services research, however, may find it hard to economist. Many clinical psychologists are comprehend the multiple perspectives on the contributing to mental health services research field and how seemingly disparate research and more are needed. agendas contribute to a unified body of In spite of great heterogeneity of methodol- knowledge. Nevertheless, a unified view of the ogy, all of services research can be viewed as field can be gained by reflecting on key examining some aspect of a nested system of constructs and the causal paths that connect causal processes that influence the mental them. health and well being of society. At the inner

225 226 Mental Health Services Research core are the effects of specific services on research models differ on a number of features: individual users of these services. These effects (i) their typical unit of analysis; (ii) the outcomes are contained within the broader effects of the of primary interest; (iii) their typical design way services are organized into local or other strategies; (iv) the independent variables usually end-user service systems, how services are experimentally controlled or sampled by the financed, and how these services systems impact investigator; and (v) the hypothesized causal relevant target populations. These effects of the variables of greatest interest. Table 1 sum- organization and financing of services are in marizes these features for each of the four turn influenced by policy innovations of local, models. As each model is discussed it will state, national, and international scope that illustrate the causal pathways of usual interest, influence the functioning of local service as portrayed in Figures 1 to 4. systems. These three domains of effects corre- spond to different types of services research: cost-outcome studies, service system studies, 3.11.2 EFFICACY RESEARCH and policy studies. The inner core of services research, cost- Efficacy studies examine the effects of a well- outcome studies, in turn grew out of an defined treatment or other service on the clinical established tradition of clinical efficacy research and rehabilitation outcomes of individual utilizing the randomized clinical trial. The recipients of care. Figure 1 is familiar to clinical Agency for Health Care Policy Research of trials investigators and shows the key groups of the US Government, in an attempt to foster variables and causal paths of primary interest. services research, popularized a distinction Fifteen causal paths, numbered A1 to A15, between efficacy and effectiveness research in encompass the scientific focus of efficacy the evaluation of medical technologies (Salive, research. Mayfield, & Weissman, 1990). Efficacy studies The ultimate outcomes of efficacy studies are evaluate the effect of an intervention under reductions in illness symptoms, changes in optimal conditions, when implemented by behavior, or improvements in skills. Outcomes highly skilled clinicians using comprehensive are usually measured over weeks or months treatment protocols under expert supervision, rather than years. The unit of analysis is the and where the intervention is usually delivered person who consumes services. The typical to patients with a single, well-defined disorder. study will compare an innovative treatment to Effectiveness studies complement the findings either a placebo, an accepted ªstandardº of efficacy studies by evaluating interventions in treatment, or some other control for attention conditions, settings, and target populations that and expectation effects. Double-blind controls reflect the usual circumstances in the delivery of are employed to minimize bias that might be services, often giving attention to long-term caused by expectation effects on outcomes and effects of treatments on both costs and out- on judgments of outcomes. When the blind is comes. The term ªcost-outcomeº is used to likely to ªleakº or is otherwise impossible to avoid confusion between ªeffectivenessº studies maintain, blinding of interviewers and raters is and ªcost-effectivenessº studies, since the latter sometimes attempted as a partial effort to use a specific subset of cost-outcome methods. reduce bias. Cost-outcome research remains conceptually The causal chain of primary interest is usually and methodologically linked to efficacy re- causal path A10, the effect of the implemented search. The refinement of clinical trial metho- service on ultimate outcomes. Investigators may dology over several decades has established a also be interested in the mediation of this effect useful state of the art that includes random through the use and ªreceiptº of service and assignment to treatment, appropriate blinding, through proximal outcomes. Thus, several complete follow-up with ªintent-to-treatº ana- aspects of the whole cascade of effects A1 to lyses, and improved statistical handling of A8 may be examined. Even more broadly, longitudinal data with missing or unevenly efficacy studies may take account of the spaced measurement occasions, as discussed in characteristics of the person at baseline such Chapter 3, this volume. Cost-outcome studies as diagnosis, severity of disability, history of involve the extension and improvement of treatment compliance, and stage of illness in clinical trials methodology necessary to address order to understand their modulation of broader questions of generalizability, accessi- intervention effects (A11 and A13). Similarly, bility, acceptability, cost, and long-term out- the effects of the person's preservice life context come of interventions. (e.g., homelessness, poverty, social support This chapter contrasts efficacy research and system, entitlements, insurance coverage) may the three types of services research on several be examined, along with life context during and dimensions, and gives examples of each. These after service (A14 to A15). In efficacy studies, Table 1 Typical features of different research models.

Research model Feature Efficacy Cost-outcome Service system Policy

Unit of analysis Individual representing a Individual representing a system Individuals in target population Community service systems defined disorder group target population Aggregate subgroup status, use, or change

Outcomes of primary Symptom improvement Efficacy outcomes Service demand, access, and use Local service policy adoption, interest Rehabilitative outcome Societal cost and payer mix Service cost innovation Humanistic outcome Preference-weighted outcomes Prevalence of psychiatric Service cost Public safety or utilities impairment Prevalence of psychiatric impairment

Typical design strategies Randomized clinical trial Randomized trial with Policy experiment Policy experiment epidemiological specification Population survey Population survey of target population Org-network analysis Org-network analysis Quasi-experiment MIS analysis MIS analysis Economic modeling and Economic modeling and computer simulation computer simulation

Variables experimentally Mental health system Mental health system Mental health system Population demographics controlled or set by environment environment environment Government structure investigator Clinical environment and Clinical environment and Policy innovation process service context service context Baseline characteristics of Characteristics of available subjects services Amount, type and delivery of services to study subjects

Hypothesized causal Service delivery and receipt Service delivery and receipt Local system organization, Policy innovation above the variables of interest financing, management, and local level innovation. 228 Mental Health Services Research

PERSON PRE- PERSON IMPLEMENTED SERVICE SERVICE LIFE BASELINE SERVICE ENTRY CONTEXT

A1 A8 A14 A11 USE AND ‘RECEIPT’ OF SERVICE

A4 A2 A9

A12 PROXIMAL A6 OUTCOMES A15

LIFE CONTEXT A5 DURING AND AFTER SERVICE A3 A10 ULTIMATE A7 A13 OUTCOMES

Figure 1 Efficacy studies.

however, the goal is usually focussed on causal be quite prevalent in the population likely to path A10 while other causal paths are viewed as receive the innovative intervention. Subjects are ªnuisanceº effects or ªartifactsº that threaten a recruited from some convenient source that can valid assessment of path A10. yield enough subjects who qualify and will The typical design strategy is the randomized consent, making the assumption (usually un- trial. The design of efficacy studies typically stated and unexamined) that the clients entering emphasizes the ªinternalº validity of the screening represent the target population on the estimate of the effect of a treatment, taking dimensions that matter for generalization. care to exclude alternative explanations for the observed effects, such as baseline or post- attrition differences in persons receiving differ- 3.11.2.1 Examples of Efficacy Research ent services. Efficacy studies may focus on the mechanism The National Institute of Mental Health of action of the intervention. Mechanism can ªTreatment Strategies in Schizophreniaº co- often be inferred from analyses of ªintermedi- operative study is an example of a late-stage, ateº variables that may account for the service complex efficacy study. Schooler, Keith, Severe, effect, such as service use and ªreceiptº (e.g., & Matthews (1989) and Schooler et al. (1997) medication or psychosocial treatment adher- compared maintenance treatment strategies for ence, regularity of appointments, attitude people with a recent schizophrenia relapse toward the services), proximal outcomes (e.g., whose family is in the picture. The study learning a communication skill, immediate assessed the impact of dose reduction of symptomatic response to a medication), and antipsychotic medication and family treatment changes in living setting or social support system on relapse and rehospitalization during main- resulting from the intervention. tenance treatment (paths A1 to A15 in Figure 1). The quest for internal validity and adequate Standard dose medication was compared to statistical power often leads to considerable continuous low-dose or targeted treatment, in sacrifice of ªexternalº validity, or general- which active medication was used only upon izability. Service entry by study subjects is incipient signs of relapse. A widely studied and taken as a given, and usually little effort is made effective type of family intervention was to relate the characteristics of the subject sample compared to a much less intensive family to those of the target population. Indeed, the intervention in the context of assertive medica- usual efficacy study purposely employs narrow tion management and ongoing supportive criteria to produce a homogeneous sample of assistance to patients and families. It was subjects with few ªdistractingº sources of hypothesized that the more intensive family variance, such as comorbid illnesses, that may intervention would make the conservative low- Cost-outcome Research 229 dose and targeted medication strategies more paths examined in cost-outcome studies. These effective, perhaps as effective as standard-dose include the same causal paths (A1 to A15) from but with less burden from medication side- efficacy studies (Figure 1), but include addi- effects. The three-by-two design thus produced tional factors not addressed in efficacy trials (B1 six treatment combinations, and the interaction to B21). Access to services is often examined in between dosing strategy and family intervention cost-outcome trials (Paths B1 to B9), and cost- level was of primary interest. The investigators outcome studies always measure the use and found that both continuous low-dose and cost of services, including services other than the targeted medication increased the use of experimental interventions themselves (B10, ªrescueº medication and the risk of sympto- B14, and B15). matic relapse, but only targeted treatment The ultimate outcomes of cost-outcome increased hospitalization compared to studies are broad, including clinical outcomes standard-dose medication. The two family (illness symptoms, side effects), rehabilitative interventions showed essentially no differences outcomes (productivity), humanistic outcomes in effectiveness, and family intervention did not (life satisfaction, family psychological burden), interact with dose effects. There were two and public safety outcomes (disruptive beha- important policy implications of the findings: vior, assaultiveness, suicidality, criminal beha- (i) targeted treatment, which had been advo- vior). Cost-outcome studies also may consider cated by some investigators, should not be determinants of outcomes resulting from a adopted as a standard maintenance strategy, wider range of causes including program case and (ii) the intensive family interventions that mix, consumer life context, and service cost/ had been shown in previous research to be payer mix (B3, B7 to B9). effective compared to no family intervention are More importantly, perhaps, there is a funda- more intensive than needed when provided in mental shift in the perspective on outcomes in the context of good continuity of maintenance cost-outcome studies (Hargreaves, Shumway, treatment and initial education and support of Hu, & Cuffel, 1997). The ubiquitous fact of family members. multiple, poorly correlated outcomes in mental Like most efficacy studies, however, the disorders creates a problem that cannot be primary focus was on mechanism of action ignored in cost-outcome studies, since policy rather than policy generalization. All sites were conclusions regarding the choice of alternative university-affiliated clinics, not sites picked to interventions depend on the relation of overall represent the ªusualº circumstances of schizo- cost to overall outcome. phrenia treatment. The design limited the ability Different approaches to integrating outcomes to generalize to patients who did not recon- are reflected in four different types of cost- stitute within six months of their most recent outcome studies Cost-minimization studies psychotic episode. examine only costs, in situations in which Nevertheless this study also had some outcomes are known or assumed to be equal features of cost-outcome research. ªOrdinaryº (B10, B14, B15). Cost-benefit studies focus on clinicians were recruited and trained as family outcomes that can be monetized, such as labor clinicians, the study was carried out in five sites force productivity, and analyze either cost- allowing a test of whether the findings general- benefit ratios or net benefits (benefits minus ized across site, and costs were assessed, costs) (B11, B16, B17). Cost-effectiveness although cost analysis was not part of the studies leave outcomes in their natural units, original design. and cost-effectiveness ratios are computed The methodology for efficacy research is well either on single outcomes (e.g., cost per suicide developed and relies heavily on the randomized prevented) or on a global or integrated outcome clinical trial. The field reached a reasonable score (e.g., cost per one unit of level-of- consensus on efficacy methodology in clinical functioning score gained) (B12, B18, B19). psychopharmacology in the 1970s, and in Cost-utility studies utilize cost-utility ratios, psychotherapy efficacy research since the mid- where utility is a preference-weighted unitary 1980s. outcome (e.g., cost per quality-adjusted life year gained) (B13, B20, B21). All of these studies are comparativeÐno intervention is ªcost-effec- 3.11.3 COST-OUTCOME RESEARCH tiveº by itself. The cost-outcome performance of an intervention is always judged against a Cost-outcome studies of mental health ser- policy-relevant standard, such as treatment as vices expand the scope of efficacy research to usual. Cost-outcome ratios are not needed to address key questions related to the effects of compare interventions when the less costly mental health services delivered in typical care intervention produces better outcomes. When settings. Figure 2 illustrates the expanded causal an innovation is both more expensive and more 230 Mental Health Services Research

PERSON PRE- PERSON IMPLEMENTED ELIGIBLE PERSON PRE- SERVICE LIFE SERVICE COSTS BASELINE SERVICE POOL CONTEXT

B1

B2 CASE MIX B4 B3 B7 B5 SERVICE B6 B8 ENTRY

A1

A8 A14 A11 USE AND B9 ‘RECEIPT’ OF SERVICE A4 A2

A9 A6 A15 B10 A12 PROXIMAL OUTCOMES LIFE CONTEXT A5 DURING AND A3 AFTER SERVICE A10 ULTIMATE A7 A13 OUTCOMES

B11 B12 B13 B14 B15

MONETIZED OUTCOME OUTCOME AS COSTS IN NATURAL DURING/AFTER OUTCOME UNITS UTILITY SERVICE

B16

COST B17 BENEFIT ANALYSIS B18 B19 COST EFFECTIVENESS ANALYSIS B20 COST- B21 UTILITY ANALYSIS

Figure 2 Cost-outcome studies. effective than standard care, a common type of estimation methods. In a theoretically ideal cost-outcome ratio compares the differential competitive market, goods and services are cost per differential outcome gained by the exchanged at prices that reflect their true value. innovation compared to standard care. In the real world market quality varies, and in Cost, as a consequence of an intervention, has health care the operation of a free, competitive its own measurement technology that has been market is impaired because buyers and sellers carefully worked out in econometrics and health have inadequate information, and because economics (Drummond, Brandt, Luce, & third-party payers and government regulation Rovira, 1993; Drummand, Stoddard, & Tor- also introduce market distortions. The theory of rance, 1997; Sloan, 1995). Economic theory welfare economics provides an alternative way provides the conceptual underpinning of cost to estimate the true value of goods and services Cost-outcome Research 231 when market pricing is not accurate. Under alone, and effective collaboration is required. these circumstances the true value is estimated Hargreaves et al. (1997) provide an introduction by considering the ªopportunity costº of to cost-outcome methods for mental health resources. This value estimate, called a ªshadow research, designed to prepare noneconomist price,º is the value of a resource were it to be investigators for this type of collaboration. used for its ªnext-bestº purpose. The unit of analysis in cost-outcome studies is Economists also developed the concept of still the person, but attention is given to the ªsocietalº cost to describe a cost perspective of person as a member of a target population, the an entire economic system, without regard to paths from target population membership to who pays the cost. Societal cost is accepted service, the generalization of the findings to this widely as a standard reference method in cost- target population, and even the impact of outcome research (Drummond et al., 1997). service provision or access policies on the whole Most policy questions, however, also need to be target population. informed by information about cost burdens to The typical design strategy is the randomized particular payers. For example, if a new cost-effectiveness trial focused on both internal intervention results in the same cost as usual and external validity, comparing a service care (from a societal perspective), but shifts the innovation to standard services, and estimating payment burden from the state and federal the impact on the total target population as well government to local government, or from as individual members of that target popula- government to the ill person's family, the payer tion. Ideally the size and character of the target information is important. Economists call shifts population is estimated, and a random sample is in payer burden a ªdistributional effectº of the invited to participate in the study. More often, a innovative intervention, since it distributes the series of analyses tracks the biases in selection grief in a new way. from target population, to potential subject The difference in societal cost of two pool, to recruited subject sample, to analyzable intervention strategies is the difference of the subject sample. value of resources consumed or lost. Resources In these decision-oriented studies, certain consumed (e.g., treatment activities) are called ªstandardº features of efficacy trials, such as ªdirectº costs, while resources lost (e.g., time double-blind conditions, may be avoided in prevented from working, which produces order to enhance external validity. For example, resources) are called ªindirectº costs. The an open study does not require sham procedures amounts of resources are measured in natural such as unnecessary blood samples, or active units (hours of psychologist time, days of placebos that mimic side effects, both of which hospitalization), but values are assigned to may obscure the relative acceptability of these resource units by a separate process. In services being compared. This may increase this way, questions about the accuracy of risks to internal validity, as when an open study methods for measuring resource use can be allows rater bias and alters client expectation examined separately from questions about the effects. These problems should not be ignored, appropriateness of the values assigned to each although they are less troublesome when type of resource. conclusions hinge on long-term social and Seven types of cost elements are often vocation functioning improvement rather than considered when planning mental health cost- short-term symptom improvement. outcome research: mental health treatment, Despite the broader scope of cost-outcome physical health care, criminal justice proce- studies, several types of variables are taken as dures, social services, time and productivity of given or set by the investigator, including the the patient and of family caregivers, other service model, the character of the target family costs, and sources and amounts of population, the consumer's baseline clinical patient income. Different interventions and characteristics and life context, and preservice target populations vary greatly in the relative utilization and costs. In service system research, emphasis that needs to be given these seven by contrast, it will be seen that investigators may types of cost. The most accurate measures of view these variables as outcomes. consumption and the most careful estimates of value are usually reserved for the most commonly used resources, especially the new 3.11.3.1 Examples of Cost-outcome Studies activities introduced in the innovative interven- tion being studied. A study of assertive community treatment Expert judgment is required to choose an was an exemplary cost-benefit study (Stein & appropriate balance of methods and to apply Test, 1980; Test & Stein, 1980; Weisbrod, 1983; them skillfully. In most situations neither health Weisbrod, Test, & Stein, 1980). In this study the economists nor clinical investigators can do it experimental treatment was applied for 14 232 Mental Health Services Research months. Subjects in the experimental program was seen on time from study entry to hospital had less time in the hospital, more time in discharge. In contrast, once discharged, cloza- independent living situations, better employ- pine patients were much less likely to be ment records, increased life satisfaction, de- readmitted, resulting in saved hospital days. creased symptoms, and improved treatment Full cost analyses have not been completed, but adherence. Family and community burden did saved hospital costs may fully offset the higher not appear to increase (A1, A3, A4). Societal price of the novel antipsychotic medication plus cost was found to be slightly lower in the the increased cost of community care when experimental treatment when wages earned patients are out of the hospital. This study were considered monetary benefits and welfare illustrates how the results of an effectiveness support was considered a cost (B10, B11, B14 to trial comparing a new treatment to usual B17). Thus, the innovative program showed practice can be more conservative than an both an economic advantage and a psychosocial efficacy trial comparing a novel medication to a advantage. single alternative. Previous efficacy trials had Research on assertive community treatment shown larger effects on symptom improvement, (ACT) for persons with severe mental illness is typically in patients who were in an acute illness the strongest and most coherent body of cost- exacerbation, not the relatively stable but still effectiveness investigation in mental health. symptomatic long-stay state hospital patients More than a dozen additional randomized trials who showed lower baseline scores on symptom of ACT were published from 1986 to 1996 and scales. the findings have been critiqued and summar- A study by Henggeler, Melton, and Smith ized in many reviews (Bond, McGrew, & (1992) illustrates how research can have an Fekete, 1995; Burns & Santos, 1995; Chamber- intermediate position between efficacy and cost- lain & Rapp, 1991; Hargreaves & Shumway, outcome research. Henggeler studied ªmulti- 1989; Olfson, 1990; Rubin, 1992; Santos, systemic therapyº (MST) as an intervention for Henggeler, Burns, Arana, & Meisler, 1995; serious juvenile offenders and their multipro- Scott & Dixon, 1995; Solomon, 1992; Test, blem families. Yoked pairs of qualified subjects 1992). Studies in the 1990s utilize stronger usual randomized to multisystemic therapy or usual care conditions, reflecting the widespread services were referred by the South Carolina adoption of linkage case management in the Department of Youth Services. Endpoints USA (Burns & Santos, 1995; Randolph, 1992). included family relations, peer relations, symp- These studies are refining understanding of the tomatology, social competence, self-reported conditions and target groups in which assertive delinquency, number of arrests, and weeks of community treatment is cost-efficient or has incarceration. Compared to youths who re- such superior outcome in spite of increased cost ceived usual services, youths who received MST that it can be considered cost-effective relative had fewer arrests and self-reported offenses and to available service alternatives. spent an average of 10 fewer weeks incarcerated A randomized cost-effectiveness study of the during a 59-week study period. Families atypical antipsychotic clozapine for persons assigned to MST reported increased family with schizophrenia was carried out by Essock; cohesion and decreased youth aggression in peer (Essock, Hargreaves, Covell, & Goethe, 1996a; relations. Detailed cost analyses were not Essock et al., 1996b). It featured a complete reported in this initial publication. This is one enumeration of clozapine-eligible patients in an of the first controlled trials of interventions with entire state hospital system, and a mix of this difficult target population to show sig- random and purposive selection of members of nificant beneficial outcomes and clinically the eligible pool to be offered study entry during important effects. The success of the interven- a time when clozapine was rationed by the state tion plausibly is related to its addressing each of hospital system and study participation was the major factors found in multidimensional essentially the only route of access to clozapine causal models of antisocial behavior in adoles- treatment (B1 to B9). Study subjects were cents (Henggeler, 1991). randomized to clozapine or usual antipsychotic This study can be considered efficacy research medication in an open study (A1 to A15). An in that it was delivered under close supervision effectiveness and cost evaluation was carried out of those who designed the intervention, and over two years of treatment (B10, B12, B14, further evidence is needed that its effectiveness B15, B18, B19). Clozapine showed fewer side can be maintained when the intervention is effects with some advantage in reduced restric- taught and supervised by others. In addition, tiveness of care and reduced assaultive or the statistical power of the initial study is disruptive behavior. In this long-stay state limited, so while large and statistically signifi- hospital patient group (mean length of their cant effects were observed, effect size in the hospital episode was over eight years) no effect population still needs to be estimated more Service System Research 233 accurately before basic efficacy can be con- seen as the quality of the life of all members of sidered to be established. Features more like the community, to the extent that it is affected cost-outcome research include the construction by psychiatric disorders. Mental health systems of a subject pool that is plausibly representative and system managers work within a political, of an important target population, and (except fiscal, and organizational context to focus for service supervision) the intervention reflects service resources on high priority target popula- much of the reality of providing such services in tions so that the overall cost-effectiveness or ordinary settings of care. The study can also cost-utility of the system is maximized. provide preliminary information about the cost Service system utility implies value judgments of the intervention and the cost savings that may not only about trade-offs in outcomes for a result from its effective implementation. particular target population, but trade-offs in The methodology for cost-outcome studies is the value of higher levels of functioning in one a combination of cost-effectiveness methods target population as against another. For from health economics, mental health epide- example, after many years during which com- miology methods, and randomized controlled munity programs were focused on improving trial methods, applied to intervention processes the level of functioning of clients and families that sometimes take place over several years. with serious life problems but who were not The merger seems to require ambitious and severely and persistently mentally ill, these expensive multidisciplinary studies, but that values were called into question, and service conclusion is probably incorrect. The new programs shifted their emphasis to the latter investigator can work in this area by learning group. While many factors influenced this the methods and gathering appropriate con- change (Mechanic, 1987), including a rise in sultants and collaborators. A workable state of the relative cost of hospital care, the change also the art is reasonably well advanced and reflected an implicit value judgment that serving accessible (Drummond et al., 1997; Hargreaves the severely and persistently mentally ill in the et al., 1997; Knapp, 1995; Sloan, 1995; Yates, community was of overriding importance to the 1995). overall utility of the mental health service system. This implicit utility judgement could never be examined explicitly, because the 3.11.4 SERVICE SYSTEM RESEARCH complexity of the problem exceeded capacity to study it. An effort to identify the parameters Research on local mental health service of this problem may clarify the advances in systems examines the effects of organization, theory and method needed to address such financing, and service delivery mechanisms on issues. utilization of services, costs, outcomes, cost- The system dynamics of prevalence, demand, effectiveness, and community prevalence. The utilization, and outcome can be seen as object of study, mental health service systems interacting with the service process, which in and community outcomes, are of broader scope turn interacts with the system structure. Process than phenomena affecting the relative cost and structure subsystems may experience in- effectiveness of a circumscribed group of novation or innovation may be introduced services for a specific problem or disorder. experimentally. In either case, innovation can Therefore, cost-outcome research on mental give the investigator an opportunity to learn health systems draws on economics, organiza- something about causal phenomena in the tional sociology, epidemiology, and political system. science for explanatory theory and method. A A simplistic view of system dynamics mental health system serves an array of target (Figure 3) is that different levels of prevalence populations whose service needs differ, yet cause different levels of service seeking beha- these target populations also compete for many vior, that service seeking leads to utilization, of the same services. Therefore it is reasonable that utilization influences outcomes for indivi- to view system impact as its effect on a value- duals, and these outcomes change prevalence weighted sum of its impact on the prevalence of and disability. Some variables of interest in each the array of mental disorders and their resulting of these subsystems can illustrate the possibility impairments. for more realistic models. The ultimate outcomes in service system Prevalence is used here to include not only the research are the prevalence and disability local prevalence of persons with psychiatric consequences of various mental disorders in disorders and subgroups of such persons, but the population. Intermediate outcomes include the distribution within these populations of demand for mental health services, access and living independence/freedom, productivity, use of those services, and outcomes of those homelessness, life satisfaction, public safety, services. More broadly, the ultimate outcome is and the societal cost of illness. These valued 234 Mental Health Services Research

LOCAL SERVICE SYSTEM INNOVATION

C1 C2

SYSTEM C3 SERVICE STRUCTURE PROCESS

C4 C5

C9 C6 PREVALENCE

SERVICE OUTCOME SEEKING

C8 C7 UTILIZATION

SYSTEM PROCESS Figure 3 Service systems studies. conditions and costs define the momentary health system. This is a narrower perspective on utility of the human service system with regard target populations and cost than suggested for to persons with psychiatric disorders. This cost-outcome studies, in order to allow greater utility can be influenced by effective human research investment to be focused on measuring services. It can also be influenced by migration, a larger variety of intermediate effects and system induced or otherwise. An innovation associated causes. that incurs some marginal cost can be evaluated Service seeking results from experienced need by comparing this cost to the change in utility and the lack of social supports (Alegria et al., that one is able to attribute to the innovation. 1991; Freeman et al., 1992), and is affected Priorities for various target populations and independently by economic conditions and job levels of disability are an unavoidable part of insecurity (Catalano, Rook, & Dooley, 1986). any definition of system utility. An egalitarian The need may be experienced by the ill person, society aims to maximize cost utility without by family members, or as a community response differential weighing of gender, age, socio- to illness-related disruptive behavior. Service economic status, ethnic group, residence loca- seeking produces demand on each type of tion, or reason for disability. While real-world service provided in the system or seen as needed policy is formulated from narrower perspec- by each target population subgroup. There may tives, it is useful for science to identify the larger be a pattern of over-capacity demand experi- impact of narrow policies. In the short term enced by specific services, with demand-buffer- more modest research goals should be sought. ing patterns and consequent generation of Practical compromises may include restricting secondary demand, that is, inappropriately high study samples to persons known by the service levels of care and lack of access to desired system to be in the target population, and services. Cost-outcome and service system considering only expenditures within the mental research should inform both providers and Service System Research 235 consumers about the relative value of alter- guidelines. For each service program or provi- native mental health services, enabling them to der in the system one might examine manage- make informed choices about which services to ment style, staff morale, and adherence to supply or to consume. system-wide service delivery standards. Ade- Utilization is the actual amount of each type quate measures of these aspects of the mental of service consumed by various subgroups of health delivery system need to be developed, persons in the target population, and in some especially such dynamic processes as eligibility cases by other clients who compete for the same criteria and service practices that shift in the services. It is this consumption and its effects on demand-buffering process, or transient degra- outcome and cost compared to what would dation of service capacity that may occur during occur in the absence of service or with rapid system change. alternative services that are often the focus of Local service system innovation can be research attention. viewed from a research perspective more as Outcome includes those changes in symp- an event than a subsystem. Innovation, and toms, functioning, life satisfaction, and public indeed any change including financial disaster safety that result from the consumption of and dramatic downsizing, is of interest to the services. The strengthening of natural supports investigator because change offers the oppor- and the reduction of risk of relapse might be tunity to study causal effects. This is particularly been as relevant outcomes. At the individual true when a similar change occurs or is level these are the same variables as the introduced experimentally in a number of local prevalence conditions. Whether outcomes lead mental health systems, allowing better strategies to changes in prevalence variables depends on for distinguishing the effect of the change from the proportion of the target population that other variations in system performance. consumes appropriate and effective services, The typical design strategies that are em- and the degree of offsetting changes that may ployed in the study of local service systems draw occur in the rest of the target population. from economics, organizational sociology, and The unit of analysis may be the individual for epidemiology, in addition to traditional clinical purposes of measurement although there is research and cost-outcome research. Studies of greater attention to the characteristics of the the variability in local mental health policy population from which samples are drawn. For often involve economic analysis of utilization example, service system research often attempts and expenditure patterns across counties or to make inferences about the effects of service other local geographical areas. Often system system characteristics on mental health out- level changes resulting from imposed changes in comes in the general population. In this regard, conditions or local innovation facilitate natur- the unit of analysis can be the geographic area alistic experiments, pre-post designs, or time served by the local service system, particularly series analyses that allow one to study the effects when the focus of study is aggregate statistics of these system changes on service seeking, use, such as per capita rates of violence or suicide and outcomes. States and the federal govern- measured over time or across regions. ment in the USA are also seeking to learn from The causal chain of interest includes three ªpilotº innovation using experimental or quasi- broad aspects of local area service systems experimental approaches, in collaboration with that are common sources of independent investigators. variables in service systems research. These The Substance Abuse and Mental Health are system structure, service process, and system Services Administration has sponsored a multi- innovation. state experiment comparing two experimentally System structure includes a study of local introduced innovations in local systems of care system expenditures and expenditure decisions, focused on homeless persons with dual dis- the organization governance, networks, and orders (severe mental illness and a substance use environments, service system size, integration, disorder). State mental health authorities were and cohesiveness and local provider reimburse- invited to propose pairs of sites to undertake a ment policies, benefit eligibilities, risk-sharing program to reduce homelessness among sub- incentives, and managed care strategies for stance abusing persons with mental illness by monitoring and controlling the delivery of care. increasing case management and housing sup- Service process at the system level might be ports. Both sites in each applicant state were to viewed as consisting of two types of variables: (i) propose two specific innovations appropriate service capacity mix (how many beds or slots of for their site: (i) methods for delivering various types) and the unit costs of each type; augmented case finding and case management and (ii) services integration methods such as services, and (ii) methods to achieve increased linkage case management, assertive community system integration to enhance services for treatment, entry and exit criteria, and treatment homeless persons with dual disorders. Site pairs 236 Mental Health Services Research were evaluated for their suitability for the relation to emergency service episodes. Low demonstration and selected by reviewers. Se- rates of outpatient initiations were correlated lected pairs of sites received funding for initial significantly with high levels of emergency engagement, case management, and follow-up service episodes six weeks later in the client of a comparable cohort of clients, while one site population as a whole. Looking at emergency of each pair was randomly selected to receive use by young Hispanics (controlling for use by additional ªservices integrationº funding. Thus, young non-Hispanic whites), the investigators this design addresses causal paths C1 to C5 and found significantly higher use of emergency C6 to C8 in Figure 3, although it does not services six weeks after the election. Analyses of examine changes in the prevalence of this older populations showed neither of these homeless subpopulation (C9). Preliminary re- effects of the election. Thus, the effect was sults have been reported showing that the specific to the typical age range of undocumen- additional services integration funding augmen- ted Hispanic residents. For a relatively non- ted the effectiveness of services for the target technical introduction to time-series methods population over and above the provision of see Catalano and Serxner (1987), and for more augmented engagement and case management detail see Box, Jenkins and Reinsel (1994). services (CMHS, 1996; Robert Rosenheck, Time-series analysis also can be used to test personal communication, November 20, 1996). hypotheses about service system policies. In a Not all service system research requires a provocative study, Catalano and McConnell large budget. Some dramatic system changes (1996) tested the ªquarantineº theory of occur with little warning but can be investigated involuntary commitment, which asserts that retrospectively using time series methods. For involuntary commitment of persons judged to example, Fenton, Catalano, and Hargreaves be a danger to others reduces violent crimeÐa (1996) studied the impact of passage of theory vigorously challenged by civil libertar- Proposition 187, a ballot initiative in California ians. This hypothesis could be seen as encom- mandating that undocumented immigrants be passing the entire causal loop in Figure 3 of ineligible for most state-funded health services paths C6 to C9. and requiring health providers to report persons No one has much enthusiasm for carrying out suspected of being undocumented to the an experimental test of this theory. The Immigration and Naturalization Service. investigators tested the specific hypothesis that Hispanic-Americans widely felt this as a racist the daily incidence of reported assaults and initiative specifically directed at them. Though batteries by males is related negatively to the quickly blocked in court, publicity about the incidence of involuntary commitments of males initiative and its level of support (passing in 50 as a danger to others during the preceding two of 58 counties) was hypothesized by the daysÐthe more men committed, the fewer the investigators to cause young Hispanics to avoid subsequent arrests. Involuntary commitments preventive psychiatric services (outpatient men- on the day of the reported assault were omitted tal health treatment) putting them at risk for because the direction of causality is ambiguous. increased use of emergency services at some Using data from the City and County of San subsequent time. This hypothesis could be seen Francisco for 303 days, the investigators carried as involving causal path C6 in Figure 3, though out a time-series analysis that supported (to anticipate) it also reaches up to causal paths (p5.05) the conclusion that increased commit- D5 and D7 in Figure 4, discussed in the Policy ment of men judged a danger to others reduces Studies section. the incidence of assault and battery offences. The investigators analyzed data on weekly Score: paternalists, 1, civil libertarians, 0. The initiation of public outpatient mental health discussion continues. episodes and psychiatric emergency service These three examples illustrate that service contacts in the City and County of San system changes of interest may be introduced Francisco for 67 weeks before and 23 weeks intentionally (a new intervention for home- after the date of the election. Young Hispanics lessness), or arrive as unpredictable exogenous were compared to young non-Hispanic whites shocks (Proposition 187), or arise from endemic using Box-Jenkins time series methods. Initia- variations that all organizations exhibit by tion of outpatient services by young Hispanics virtue of their inability to behave with perfect dropped significantly the week of the election consistency over time (variation in the number and remained low compared to use by non- of involuntary commitments). This third type of Hispanic whites. variation is of particular interest to managers of To refine their hypothesis of a delayed effect mental health systems. They manage organiza- on emergency services use, the investigators tions to minimize this inconsistency, but examined the ªnaturalº variation in the initia- theories of rational management as well as tion of outpatient mental health services and its common sense suggest that the effort to do so Service System Research 237

NATIONAL, STATE CONDITIONS D1

NATIONAL, STATE SYSTEM D2 STRUCTURE

D3 NATIONAL, STATE POLICY PROCESS

D4

NATIONAL, STATE POLICY INITIATIVES

D5

D6 LOCAL LOCAL CONDITIONS POLICY PROCESS

D7

LOCAL SERVICE SYSTEM INNOVATION

Figure 4 Policy studies.

should cost less than the benefit of greater In implementing the study the investigator standardization of output. Managed care is the needs to demonstrate that the implementation health system's version of scientific manage- has induced optimal or near-optimal guideline ment, and one of its consistency tools is the adherence by health care providers, has induced treatment guideline or algorithm. it relatively rapidly, and has maintained Research on the impact of treatment guide- adherence consistently over the long haul. This lines and algorithms is an important and will usually require prompting and monitoring growing area of service systems research. techniques, which are part of the cost of the Suppose an investigator wishes to test the intervention being tested. The investigator also hypothesis that new treatment guidelines im- will want to demonstrate that the usual-care prove outcomes while containing cost. The control condition had not undergone similar guidelines might be both restrictive (specifying practice changes. Another concern is that when a procedure should not be provided even guidelines may alter the detection of the when sought by the client) and prescriptive targeted health conditions. The investigator (specifying when a procedure should be recom- needs a way to determine whether detection of mended proactively when the client fails to seek the target conditions is altered in either it). It is interesting to consider the design issues experimental or usual care. Given these possible posed by this research goal. diagnostic screening effects, the investigator has To test the investigator's hypothesis for a the problem of designing an unbiased way to particular disorder, the investigator needs to select ªtracerº patients to assay outcomes in identify or articulate the relevant restrictive and both experimental and control conditions. prescriptive guidelines. Identifying potential These competing threats to validity demand health risks and benefits from these guidelines well-articulated hypotheses and carefully cho- compared to usual practice will guide the choice sen design compromises. Rather than rando- of outcomes and the length of observation mizing patients, an investigator may decide to needed to detect each outcome. randomize clinician practices or clinical sites to 238 Mental Health Services Research experimental and usual-care conditions. Cluster tional adaptation, outcomes for the most randomization often is used in educational vulnerable target populations, costs, and the research, where an innovative teaching method changes in cost burden to various payers. is randomized to teachers or classrooms Figure 4 suggests that policy initiatives by (Murray et al., 1994). Thus, individual subjects national and state government agencies offer a are nested within the unit of randomization. useful research focus for the mental health This does not preclude examining individual services investigator. Examining the link be- baseline characteristics in the analysis, however. tween national/state initiatives and the cost- Power is limited by the number of units effectiveness of services at the client level randomized, but error variance is also reduced involves ªlong-chainº causal paths in the (and power increased) by increasing the number extreme. Therefore, research designs need to of subjects within each randomization unit. For be thoughtful and realistic. Some research each of the primary outcome measures, the questions may focus simply on the link from power analysis informs the investigator about national/state policy initiatives to local service trade-offs between the number of randomiza- system innovation. This research strategy will tion units and the number of subjects within be most defensible when a national policy aims each unit. The efficiency of the design will to encourage specific types of local innovation further depend on the cost of adding units and previous services research has demon- versus the cost of adding subjects within units. strated the value of those specific innovations Feasibility and cost constraints often limit the in improving the cost effectiveness of services. number of randomization units to a smaller Political scientists and organizational sociolo- than optimal number. Under these circum- gists may also discover useful links among stances the investigator may be able to gain ªnational/state conditions,º ªnational/state power if between-unit outcome variance can be system structure,º and the ªnational/state reduced by using covariates (Raudenbush, policy processº that generate policy initiatives. 1997). Such investigators may also fruitfully examine As these considerations illustrate, the im- the comparable processes at the local level that portant design issues in service systems research influence local response to outside policy are the same as in efficacy and cost-outcome initiatives. research. Intervention effects are most easily A third strategy is to take a national or state understood when interventions are introduced policy initiative, such as a state decision to enter experimentally with some form of random into capitated contracts with local providers or assignment. Multistep causal processes can be managed care companies, and treat it as a case studied most effectively with a clear theory example in policy research, but to carry out an about intermediate causal pathways and effects, experimental or quasi-experimental study and ways to demonstrate that expected inter- across local treatment systems of the impact mediate effects are produced and potential on client access, client outcomes, and other confounding effects are avoided. endpoints. The evaluation of state adoption of managed care and capitation contracting in providing publicly funded mental health ser- vices in the USA is being evaluated in several 3.11.5 POLICY RESEARCH states using this type of design. Thus, the typical outcomes of national and National and state reform of the organization state policy research are changes in local and finance of mental health systems is a major organization or financing of mental health research issue in mental health economics. services. It is also important, although more Financing processes and incentives are changing difficult, to determine the effect of national and rapidly. Local government is being given state policy initiatives on the prevalence and increasing responsibility for providing mental consequences of mental disorder in the general health services for persons with severe mental population. illness, either through direct operation of the The causal chain of interest includes the services or through contracting with the private effects of national and state conditions, system sector. Mental health services research is structure, and policy processes on those char- examining the impact of these economic forces acteristics of local service systems thought to be and changes at both the national and local related to their cost-effectiveness. The indepen- levels. Figures 3 and 4 illustrate the linkage dent variables of causal interest to national and between national and state policy decisions and state policy research bear directly on local their effects on local service system innovation service systems as shown by the intersection of and adaptation. Research focuses on the impact Figures 3 and 4 (C1 in Figure 3 and D5, D6, and of these changes on access to care, organiza- D7 in Figure 4). Conclusion 239

National/state conditions that influence the innovation and adaptation, service utilization, policy process include such variables as political access to services, and individual client out- stability, economic conditions, human re- comes using a pre-post design in each of the sources mix, and cultural conditions such as three regions. Studies are in progress in several family and community caretaking traditions. other states. Studies like this of statewide mental Corresponding factors presumably make up health system reform are of necessity relatively some of the relevant local conditions that large projects, but state mental health autho- influence local policy. rities, services research investigators, and re- System structure at the national/state level search funding agencies have joined together to includes such variables as centralization of mount major research efforts. By the end of the control and funding, administrative standards, twentieth century the field will have a set of level of funding from sources above the local tax findings about cost control and outcomes that base, eligibility and benefit design, financing will set the policy and research agenda for the incentives for local innovation such as supply- decade that follows. side and demand-side payment incentives. In As in the other areas of services research, considering the economic concept of demand- small studies also make important contribu- side incentive as affecting local innovation tions. Cuffel, Wait, and Head (1994) studied a behavior, copayment requirements imposed 1990 innovation in the state of Arkansas in on persons with psychiatric illnesses and their which the state transferred hospital funding and families, might be included, and also copayment admission authority to local mental health requirements that funding initiatives may im- agencies, essentially putting state hospitals into pose on the local tax base. a competitive market. The investigators found, Policy process at all levels includes such as expected, that state hospital use was reduced variables as leadership, research, and advocacy. significantly after decentralization, although Advocacy may come from research and service the reduction in urban areas was proportionally professionals, from industry, and from con- greater than in rural areas. Contrary to sumers and family members. Advocacy can be expectation, admissions were not limited to an important pathway from research findings to the most severely ill, disruptive, or substance- policy initiatives. abusing patients, nor were discharged patients Mental health system reform in the USA, and more likely to be readmitted. For patients specifically the introduction of capitated finan- treated in the state hospitals, communication cing and competitive contracting for the provi- between the community and the state hospital sion of publicly funded mental health care for increased after decentralization. the severely and persistently mentally ill by Studies of national and state innovations many states, is of great interest (Brach, 1995). such as the examples above make this aspect of Several investigators are examining state policy mental health services research both feasible initiatives such as carve-out capitation in mental and increasingly important in guiding the health Medicaid, interventions related to causal evolution of mental health care. paths D5 to D7 in Figure 4 (e.g., Christianson & Gray, 1994; Christianson & Osher, 1994; Dickey et al., 1996). Christianson, Gray, Kihlstrom, and Speckman (1995a) are studying Utah's prepaid mental health plan and have 3.11.6 CONCLUSION reported preliminary findings (Christianson et al., 1995b). There is also a study of Color- Mental health services research encompasses ado's Medicaid carve-out capitated contracting a broad range of mental health services research for mental health services (Bloom et al., projects. No one project can ordinarily study unpublished). Studies of state Medicaid inno- the whole system of variables at any of the three vations typically take advantage of a sequential levels discussed in this Chapter. The scope of conversion of different regions of a state from mental health services research, however, fee-for-service reimbursement to capitation invites construction of integrated theory, espe- contracts to study as a quasi-experiment. cially in service system and policy domains. In Colorado the state initially contracted with Such theory can lay the groundwork for a a for-profit contractor for coverage of a region coordinated approach to empirical research. of the state containing about one-third of the Furthermore, in spite of its great diversity, population, contracted with several existing mental health services research shares a com- community mental health centers to cover mon objective to understand and improve the another third of the state's population, and effectiveness and efficiency of mental health delayed conversion in the rest of the state. services for individual consumers and for entire Research assessments focus on organizational target populations. 240 Mental Health Services Research

3.11.7 REFERENCES Standardizing methodologies for economic evaluation in health care. International Journal of Technology Assess- Alegria, M., Robles, R., Freeman, D. H., Vera, M., ment in Health Care, 9, 26±36. Jimenez, A. L., Rios, C., & Rios R. (1991). Patterns of Drummond, M. F., Stoddard, G. L., & Torrance, G. W. mental health utilization among island Puerto Rican (1997). Methods for the economic evaluation of health care poor. American Journal of Public Health, 81, 875±879. programmes (2nd ed.). Oxford, UK: Oxford University Attkisson, C., Cook, J., Karno, M., Lehman, A., McGla- Press. shan, T. H., Meltzer, H. Y., O'Connor, M., Richardson, Essock, S. M., Hargreaves, W. A., Covell, N. H., & D., Rosenblatt, A., Wells, K., Williams, J., & Hohmann, Goethe, J. (1996a). Clozapine's effectiveness for patients A. A. (1992). Clinical services research. Schizophrenia in state hospitals: Results from a randomized trial. Bulletin, 18, 561±626. Psychopharmacology Bulletin, 32, 683±697. Bloom, J., Hu, T., Cuffel, B., Hausman, J., Wallace, N., & Essock, S. M., Hargreaves, W. A., Dohm, F-A., Goethe, J., Scheffler, R. (Unpublished). Mental health costs under Carver, L., & Hipshman, L. (1996b). Clozapine eligibility alternative capitation systems in Colorado. among state hospital patients. Schizophrenia Bulletin, 22, Bond, G. R., McGrew, J. H., & Fekete, D. M. (1995). 15±25. Assertive outreach for frequent users of psychiatric Fenton, J., Catalano, R., & Hargreaves, W. A. (1996). hospitals: A meta-analysis. Journal of Mental Health Effect of Proposition 187 on mental health service use in Administration, 22, 4±16. California: A case study. Health Affairs, 15, 182±190. Box, G. E. P., Jenkins, G. M., & Reinsel, G. (1994). Time Freeman, D. H., Alegria, M., Vera, M., Munoz, C. A., series analysis: Forecasting and control, (3rd ed.). Engle- Robles, R. R., Jimenez, A. L., Calderon, J. M., & Pena, wood Cliffs, NJ: Prentice Hall. M. (1992). A receiver operating characteristic (ROC) Brach, C. (1995). Designing capitation projects for persons curve analysis of a model of mental health services use by with severe mental illness: A policy guide for state and Puerto Rican poor. Medical Care, 30, 1142±1153. local officials. Boston: Technical Assistance Collabora- Hargreaves, W. A., & Shumway, M. (1989). Effectiveness tive. of mental health services for the severely mentally ill. In Burns, B. J., & Santos, A. B. (1995). Assertive community C. A. Taube, D. Mechanic, & A. A. Hohmann (Eds.), treatment: An update of randomized trials. Psychiatric The future of mental health services research. Rockville, Services, 46, 669±675. MD: National Institute of Mental Health. Catalano, R. A., & McConnell, W. (1996). A time-series Hargreaves, W. A., Shumway, M., Hu, T.-W., & Cuffel, B. test of the quarantine theory of involuntary commit- (1997). Cost-outcome methods for mental health. San ment. Journal of Health and Social Behavior, 37, 381±387. Diego, CA: Academic Press. Catalano, R., Rook, K., & Dooley, D. (1986). Labor Henggeler, S. W. (1991). Multidimensional causal models markets and help-seeking: A test of the employment of delinquent behavior. In R. Cohen & A. Siegel (Eds), security hypothesis. Journal of Health and Social Context and development (pp. 211±231). Hillsdale, NJ: Behavior, 27, 277±287. Erlbaum. Catalano, R., & Serxner, S. (1987). Time series designs of Henggeler, S. W., Melton, G. B., & Smith, L. A. (1992). potential interest to epidemiologists, American Journal of Family preservation using multisystem therapy: An Epidemiology, 126, 724±731. effective alternative to incarcerating serious juvenile Center for Mental Health Services. (1996). Second year offenders. Journal of Consulting and Clinical Psychology, interim status report on the evaluation of the ACCESS 60, 953±961. demonstration program. (Vol. I., Summary of Second Knapp, M. R. J. (Ed.) (1995). The economic evaluation of year findings). Rockville, MD: Center for Mental Health mental health care. Aldershot, UK: Ashgate. Services. Lalley, T. L., Hohmann, A. A., Windle, C. D., Norquist, Chamberlain, R., & Rapp, C. A. (1991). A decade of case G. S., Keith, S. J., & Burke, J. D. (1992). Caring for management: A methodological review of outcome people with severe mental disorders: A national plan to research. Community Mental Health Journal, 27, improve services. Schizophrenia Bulletin, 18, 559±560. 171±188. Mechanic, D. (1987). Correcting misperceptions in mental Christianson, J. B., & Gray, D. Z. (1994). What CMHCs health policy: Strategies for improved care of the can learn from two states' efforts to capitate Medicaid seriously mentally ill. Milbank Quarterly, 65, 203±230. benefits. Hospital and Community Psychiatry, 45, Mechanic, D., Bevilacqua, J., Goldman, H., Hargreaves, 777±781. W., Howe, J., Knisley, M., Scherl, D. J., Stuart, G., Christianson, J. B., Gray, D. Z., Kihlstrom, L. C., & Unhjem, M. B., & Lalley, T. L. (1992). Research Speckman, Z. K. (1995a). Development of the Utah resources. Schizophrenia Bulletin, 18, 669±696. Prepaid Mental Health Plan. Advances in Health Murray, D., McKinlay, S., Martin, D., Donner, A., Dwyer, Economics and Health Services Research, 15, 117±135. J., Raudenbush, S., & Graubard, B. (1994). Design and Christianson, J. B., Manning, W., Lurie, N., Stoner, T. J., analysis issues in community trials. Evaluation Review, Gray, D. Z., Popkin, M., & Marriott, S. (1995b). Utah's (August), 493±514. prepaid mental health plan: The first year. Health National Institute of Mental Health. (1991). Caring for Affairs, 14, 160±172. people with severe mental disorders: A national plan of Christianson,J.B.,&Osher,F.C.(1994).Health research to improve services. DHHS Pub. No. maintenance organizations, health care reform, and (ADM)91±1762. Washington, DC: US Government persons with serious mental illness. Hospital and Com- Printing Office. munity Psychiatry, 45, 898±905. Olfson, M. (1990). Assertive community treatment: An Cuffel, B. J., Wait, D., & Head, T. (1994). Shifting the evaluation of the experimental evidence. Hospital and responsibility for payment for state hospital services to Community Psychiatry, 41, 634±641. community mental health agencies. Hospital and Com- Randolph, F. L. (1992). NIMH funded research demon- munity Psychiatry, 45, 460±465. stration of (P/ACT) models. Outlook, 2(2), 9±12. Dickey, B., Normand, S. L., Norton, E. C., Azeni, H., Raudenbush, S. W. (1997) Statistical analysis and optimal Fisher, W., & Altaffer, F. (1996). Managing the care of design for cluster randomization trials. Psychological schizophrenia: Lessons from a 4-year Massachusetts Methods, 2, 173±185. Medicaid study. Archives of General Psychiatry, 53(10), Rubin, A. (1992). Is case management effective for people 945±952. with serious mental illness? A research review. Health Drummond, M., Brandt, A., Luce, B., Rovira, J. (1993). and Social Work, 17, 138±150. References 241

Salive, M. E., Mayfield, J. A., & Weissman, N. W. (1990). Solomon, P. (1992). The efficacy of case management Patient outcomes research teams and the Agency for services for severely mentally disabled clients. Commu- Health Care Policy Research. Health Services Research, nity Mental Health Journal, 28, 163±180. 25, 697±708. Stein, L., & Test, M. A. (1980). Alternative to mental Santos, A. B., Henggeler, S. W., Burns, B. J., Arana, G. hospital treatment: I. Conceptual model, treatment W., & Meisler, N. (1995). Research on field-based program, and clinical evaluation. Archives of General services: Models for reform in the delivery of mental Psychiatry, 37, 392±397. health care to populations with complex clinical pro- Steinwachs, D. M., Cullum, H. M., Dorwart, R. A., Flynn, blems. American Journal of Psychiatry, 152, 1111±1123. L., Frank, R., Friedman, M. B., Herz, M. I., Mulvey, E. Schooler, N. R., Keith, S. J., Severe, J. B., & Matthews, S. P., Snowden, L., Test, M. A., Tremaine, L. S., & Windle, (1989). Acute treatment response and short term out- C. D. (1992). Service systems research. Schizophrenia come in shizophrenia: First results of the NIMH Bulletin, 18, 627±668. Treatment Strategies in Schizophrenia study. Psycho- Test, M. A. (1992). Training in community living. In R. P. pharmacology Bulletin, 25, 331±335. Liberman (Ed.). Handbook of Psychiatric Rehabilitation. Schooler, N. R., Keith, S. J., Severe, J. B., Matthews, S. Boston: Allyn and Bacon. M., Bellack, A. S., Glick, I. D., Hargreaves, W. A., Test, M. A., & Stein, L. (1980). Alternative to mental Kane, J. M., Ninan, P. T., Frances, A., Jacobs, M., hospital treatment: III. Social cost. Archives of General Lieberman, J. A., Mance. R., Simpson, G. M., & Psychiatry, 37, 409±412. Woerner, M. (1997). Relapse and rehospitalization Weisbrod, B. A. (1983) A guide to benefit-cost analysis, as during maintenance treatment of schizophrenia: The seen through a controlled experiment in treating the effects of dose reduction and family treatment. Archives mentally ill. Journal of Health Politics, Policy and Law, 7, of General Psychiatry, 54, 453±463. 808±845. Scott, J. E., & Dixon, L. B. (1995). Assertive community Weisbrod, B., Test, M. A., & Stein, L. (1980). Alternative treatment and case management for schizophrenia. to mental hospital treatment: II. Economic benefit cost Schizophrenia Bulletin, 21, 657±691. analysis. Archives of General Psychiatry, 37, 400±405. Sloan, F. (Ed.) (1995). Valuing health care: Costs, benefits, Yates, B. T. (1995). Cost-effectiveness analysis, cost-benefit and effectiveness of pharmaceuticals and other medical analysis, and beyond: Evolving models for the scientist± technologies. Cambridge, UK: Cambridge University manager±practitioner. Clinical Psychology: Science and Press. Practice, 2, 385±398. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.12 Descriptive and Inferential Statistics

ANDREW C. LEON Cornell University Medical College, New York, NY, USA

3.12.1 INTRODUCTION 244 3.12.2 TABULAR AND GRAPHIC DISPLAYS 244 3.12.3 DESCRIPTIVE STATISTICS 245 3.12.3.1 Measures of Central Tendency 245 3.12.3.2 Measures of Variability 247 3.12.3.3 Describing a Frequency Distribution 248 3.12.3.3.1 Skewness 248 3.12.3.3.2 Kurtosis 248 3.12.3.4 Other Graphic Displays 248 3.12.3.5 Statistical Software 249 3.12.4 INFERENTIAL STATISTICS 251 3.12.4.1 One- vs. Two-tailed Tests 253 3.12.4.2 Parametric vs. Nonparametric Tests 253 3.12.4.3 Between-subject Designs: Dimensional and Ordinal Dependent Variables 254 3.12.4.3.1 t-test 254 3.12.4.3.2 Mann±Whitney test 255 3.12.4.3.3 One-way analysis of variance 256 3.12.4.3.4 ANOVA: Post hoc tests 257 3.12.4.3.5 Kruskal±Wallis test 259 3.12.4.3.6 Factorial analysis of variance 260 3.12.4.3.7 Hotelling's T2 264 3.12.4.3.8 Multivariate analysis of variance 265 3.12.4.4 Linear Relations 265 3.12.4.4.1 Pearson correlation coefficient 265 3.12.4.4.2 Test for a difference between two independent correlation coefficients 269 3.12.4.4.3 Spearman rank correlation 270 3.12.4.4.4 Simple linear regression analysis 272 3.12.4.4.5 Multiple linear regression analysis 275 3.12.4.5 Between-subject Designs: Categorical Dependent Variables 276 3.12.4.5.1 Chi-square test 276 3.12.4.5.2 Fisher's Exact test 277 3.12.4.5.3 Logistic regression 278 3.12.4.5.4 Survival analysis 279 3.12.4.6 Within-subject Designs: Dimensional and Ordinal Dependent Variables 279 3.12.4.6.1 Paired t-test 280 3.12.4.6.2 Wilcoxon signed-rank test 280 3.12.4.6.3 Repeated measures ANOVA 281 3.12.4.7 Within-subject Designs: Categorical Dependent Variables 282 3.12.4.7.1 McNemar test 282 3.12.5 SUMMARY 282

243 244 Descriptive and Inferential Statistics

3.12.6 APPENDIX: SPSS COMMAND LINES 284 3.12.6.1 Descriptive Statistics 284 3.12.6.2 Inferential Statistics 284 3.12.6.2.1 Between-subject analyses 284 3.12.6.2.2 Within-group analyses 284 3.12.7 REFERENCES 284

3.12.1 INTRODUCTION group. Taking pedagogical license, the structure of the data will be changed to illustrate a variety A fundamental motivation in science is to of statistical procedures in this chapter. understand individual differences. It is deviation Throughout the chapter Xi will refer to an and variability that stimulate scientific curiosity. individual observation of the variable, X. For When do behavioral scientists study constants? example, X17 represents the X value for subject One is not compelled to study differences in number 17 (where i is each subject's arbitrarily clinical psychology students who did and did assigned, unique identifying number, which not experience infancy. Instead a comparison of ranges from 1 to N, in a study of N subjects). clinical psychology students who do and do not remember parental affect expression in infancy might be compelling. Similarly it is differences 3.12.2 TABULAR AND GRAPHIC among scores on an achievement test that are DISPLAYS exciting, not the proportion of psychology graduate students who took the Graduate There are a variety of tabular displays used to Record Examination. summarize data. Hamilton Rating Scale for Psychological research is conducted to under- Depression (HRSD) data will be used for stand individual differences. In reporting re- illustration. Initially a frequency distribution is search results, data are summarized with constructed by counting the number of occur- statistical analyses. In choosing among appro- rences, or frequency, of each value of the priate statistical procedures, several issues must variable (Xi). In this case, Xi represents the be considered. First, are the results to be used HRSD rating for subject i, where i ranges from 1 strictly to describe one sample or will they also to N. Five columns are included in the frequency be used to draw inferences about the population distribution that is displayed in Table 2. The from which the sample was drawn? Descriptive first column displays the values of Xi,in statistics are used for the former purpose, ascending order. The second column displays whereas inferential statistics are used for the latter. Here several of the more commonly used Table 1 Hamilton Rating Scale for Depression procedures are discussed. The choice among (HRSD: Xi) for 20 patients (CBT group). statistical procedures is also guided by the distribution, the sample size, and of course, the Subject number Xi research question. This chapter gives an over- view of those statistical procedures. It is not 112 meant to provide a comprehensive explication 212 of each technique, but instead provide a guide to 313 the selection, implementation, and interpreta- 414 tion of appropriate procedures. References 515 617 which provide comprehensive discussions of 718 the procedures are cited in each section. 819 Descriptive statistics are used to summarize 921 and describe data from a sample. Measures of 10 21 central tendency and variability are fundamen- 11 21 tal. The former include quantities that represent 12 22 the typical data from the sample, whereas the 13 24 latter provide information on the dispersion of 14 25 data within a sample. Data from a hypothetical 15 26 clinical trial comparing the effects of two forms 16 27 17 27 of psychotherapy for depression, interpersonal 18 28 therapy (IPT) and cognitive-behavioral therapy 19 29 (CBT), will be used repeatedly in this chapter. 20 29 Table 1 presents only the data from the CBT Descriptive Statistics 245

the proportions presented in Figures 2 and 4 Table 2 Frequency distribution of HRSD: CBT group. can be readily compared across samples of diverse sizes. cum Xi f cum(n) % (%) 3.12.3 DESCRIPTIVE STATISTICS 12 2 2 10 10 13 1 3 5 15 Descriptive statistics are summary measures 14 1 4 5 20 that describe characteristics of data. The two 15 1 5 5 25 general areas of descriptive statistics that will be 17 1 6 5 30 discussed here are measures of central tendency 18 1 7 5 35 and measures of variability. The data that were 19 1 8 5 40 originally shown in Table 1 are now displayed in 21 3 11 15 55 Table 3, with additional information that will be 22 1 12 5 60 used for calculations throughout this section. 24 1 13 5 65 25 1 14 5 70 26 1 15 5 75 27 2 17 10 85 3.12.3.1 Measures of Central Tendency 28 1 18 5 90 The arithmetic mean (X) of a sample, 29 2 20 10 100 commonly referred to as the average, is equal to the sum of all observations in the sample, Total 20 100 n Xi, divided by the number of observations (N) 1 inP the sample. the frequency for the value of X . The third i N column shows the number of subjects that have Xi a value of Xi or lower, which is referred to as the i 1 420 X 8 ˆ 9 21 cumulative frequency. In the fourth column the ˆ >PN > ˆ 20 ˆ :> ;> percentage of subjects with each value of Xi,or relative frequency, is displayed. In the fifth The median, Md, is the middle observation in column, the cumulative percentage, or cumula- a sample of ordered data (i.e., ranked from tive relative frequency, is shown, which is the lowest to highest value of Xi). In the case of percentage of subjects with a value less than or ordered data, the subscript represents the equal to Xi. Two column totals (sums, which are ascending rank, ranging from 1 to n. With an symbolized S) are presented. In the second column, the total of the frequencies is displayed, Table 3 HRSD: CBT group (N = 20). the value of which should be equal to the sample size (N). The total of the fourth column should 2 2 Xi Xi Xi7X (Xi7X) equal 100%. Other cell entries worth noting include the final number in the column of 12 144 7981 cumulative frequencies which should equal N 12 144 7981 and the final entry in the cumulative relative 13 169 7864 frequencies column, which should be equal to 14 196 7749 100%. Finally, the number of values of Xi 15 225 7636 provides an initial indication of its variability. 17 289 7416 A graphic display of the contents of any of 18 324 739 the last four columns can be informative. In 19 361 724 21 441 0 0 these graphs, the x-axis of the graph represents 21 441 0 0 the Xi values and the frequency data are 21 441 0 0 represented on the y-axis. These graphs provide 22 484 1 1 the reader with a sense of the distribution of the 24 576 3 9 data at a glance. For instance, Figure 1 is a bar 25 625 4 16 chart, or histogram, displaying the frequencies 26 676 5 25 of each X value. (If there were no variability, 27 729 6 36 only one bar would be displayed.) Figure 2 27 729 6 36 displays the relative frequencies. Figure 3 28 784 7 49 displays the cumulative frequency distribution. 29 841 8 64 29 841 8 64 Finally, the cumulative relative frequencies of the HRSD ratings are displayed in Figure 4. Total 420 9460 0 640 Although Figures 1 and 3 display sample sizes, 246

Frequency Relative frequency 10% 15%

Cumulative frequency 0% 5% 3 0 1 2 10 15 20 25 0 5 12 iue3 Figure iue1 Figure 12 iue2 Figure 12 13 13 13 14 14 lto uuaiefeuniso RD B group. CBT HRSD: of frequencies cumulative of Plot lto rqec itiuino RD B group. CBT HRSD: of distribution frequency of Plot lto eaiefeuniso RD B group. CBT HRSD: of frequencies relative of Plot ecitv n neeta Statistics Inferential and Descriptive 14 15 15 15 17 17 17 18 18 18 19 19 HRSD 19 HRSD 21 21 HRSD 21 22 22 22 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 Descriptive Statistics 247

100%

75%

50%

25% Cumulative relative frequency 0% 12 13 14 15 17 18 19 21 22 24 25 26 27 28 29 HRSD Figure 4 Plot of cumulative relative frequencies of HRSD: CBT group. even number of observations, the median is the conditional. Here the quantities used to assess mean of the two middle observations. For differences, or variability, are discussed. The instance, there are 20 observations in the sample crudest measure of variability is the range, presented in Table 1. Thus the median of the which is the difference between the maximum sample is the mean of the middle two values: 10 and minimum value in the data. For instance, in and 11: Table 1: Range Maximum Minimum Md = X(n+1)/2 = 21 (mean of 2 middle scores) ˆ X X 29 12 17 ˆ N 1 ˆ ˆ The median represents the fiftieth percentile, There are several quantities that provide a the point below which half of the data fall. For more comprehensive description of the varia- example, in Table 1, a value of 21 is the median. bility in data. Most are derived from the Sum of When the data include a very small number of Squares (SS), which is the sum of squared extreme values, the data set is referred to as deviations from the mean. The squared devia- skewed (discussed in greater detail below). With tions are presented in the right-hand column of skewed data, the median is a more representa- Table 3. Note that one of the properties of the tive measure of central tendency than the mean mean is that the sum of the deviations from the because the mean is influenced by extreme mean is equal to zero: values. For instance, the median is a more appropriate measure of central tendency of the N  US household income than the mean. Xi X 0 i 1 †ˆ The mode, Mo, is the most frequently Xˆ occurring value. The modal value can be Notice that the sum of the deviations from the determined by examining the frequency dis- mean in Table 3 is equal to zero. However, the tribution which is presented in Table 1 and sum of squared deviations from the mean is a selecting the X with the largest value of f : i i useful quantity: Mo 21 f 3 ˆ ˆ † N SS Xi X 640 ˆ i 1 †ˆ Xˆ 3.12.3.2 Measures of Variability The SS can be used for calculating the This chapter began by stating that it is an variance (s2), which is approximately the mean understanding of individual differences that of the squared deviations from the mean. The motivates a great deal of behavioral research. A variance, a quantity that is commonly used to wide range of statistical procedures are used to convey information regarding variability, is SS SS analyze differences. Each of the procedures that calculated as n 1, and not n , because it provides is discussed in this chapter has to do with an unbiased estimator of the population differences: individual or group; temporal or variance. Further discussion of the reasons 248 Descriptive and Inferential Statistics for this will not be discussed here. Note that, as 3.12.3.3.2 Kurtosis the sample size increases the difference between Kurtosis has to do with the extent to which a SS and SS becomes trivial. The variance is n 1 n frequency distribution is peaked or flat. A calculated as follows: normal bell-shaped distribution is referred to as N  2 a mesokurtic shape distribution. An example of Xi X this, a nicely rounded distribution, is shown in 2 i 1 † 640 s ˆ 33:684 Figure 7. Many human traits are normally ˆ P N 1 ˆ 20 1 ˆ distributed including height and intelligence. A Or, if the SS is available: platykurtic, like a platypus, is a somewhat flat creature. For instance, as shown in Figure 8, this SS 640 s2 33:684 would describe data in which there are several ˆ N 1 ˆ 20 1 ˆ modes, which cluster around, and include, the The standard deviation (s), a fundamental median. A leptokurtic distribution is more measure of variability, is equal to the square peaked. In Figure 9 hypothetical data show root of the variance. It is approximately equal to that such data would be characterized by a very the average deviation from the mean. It is high, unimodal peak, at the median. approximately equal because the deviations have been squared, summed, and divided by 3.12.3.4 Other Graphic Displays N71, not by N, before the square root is computed. Note that by taking the square root, Two other formats for graphic display are the the original unit of measurement is restored. Stem and Leaf and the Box and Whisker plots. These graphic displays, which are described in s ps2 p33:684 5:804 ˆ ˆ ˆ detail by Tukey (1977), each provide the re-   searcher with a great deal of information about 3.12.3.3 Describing a Frequency Distribution the distribution of the data. The importance of Modality has to do with the number of modes examining such plots cannot be overempha- contained within a frequency distribution. The sized. One outlier (i.e., extreme score) in a data HRSD data in Figure 1 has one mode and is set can markedly alter the mean and exaggerate thus referred to as unimodal. The most trivial group differences. The existence of out- commonly occurring value in this sample is liers is most readily apparent in a good graph. 21. There are three observations equal to 21 and A Stem and Leaf plot is an efficient method of there is no other value that is represented as displaying a frequency distribution. In Figure often as 21. Because the y-axis represents the 10, the vertical line, called the stem, displays the frequency, the mode is the highest point in the possible values, ranging from the minimum (at plot of the frequency distribution in Figure 1. In the top) to maximum value (at the bottom). contrast to these data, some frequency distribu- Each leaf represents an individual with a tions are bimodal, having two peaks, or multi- particular value. A leaf could be an ªx,º an modal, having more than two peaks. asterisk, or more commonly the value itself. A data value (on the stem) with several leaves is a frequently occurring value. Of course, the mode 3.12.3.3.1 Skewness has the most leaves. If one folds the distribution, vertically in the Box and Whisker plots are based on five middle and one half can be superimposed on the points as shown in Figure 11. The first three of other, the distribution is symmetric. The extent those have been discussed: the median, the to which a frequency distribution diverges from minimum, and maximum. The final two points symmetry is described as skewness. With help of the box and whisker plot are the values that from Bill Gates and a few of his peers, incomes fall midway between each extreme score and the in the United States are positively skewed. The median. These are the first and third quartiles.A skewness, which is one aspect of the departure long thin box is drawn with the latter two points from symmetry, is described as either skewed to serving as the hinges. A horizontal bar is drawn the right, positively skewed as shown in Figure 5, through the median. Whiskers are drawn at the or skewed to the left, negatively skewed, as two extreme values. The figure illustrates how shown in Figure 6. For example, with a different easily outliers are detected in a Box and Whisker set of HRSD data, a positively skewed plot. In some situations, the plot is more distribution is one in which there are a very informative when an identifier is used to label few severely depressed scores, whereas a values such as an outlier. distribution in which there are a very few When an outlier is identified, the researcher number of euphoric scores might be negatively must determine if the value is real or if it is a skewed. keypunch error. The raw data forms will help Descriptive Statistics 249

16 14 12 10 8

Frequency 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 X Figure 5 Example of positively skewed data.

16 14 12 10 8

Frequency 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X Figure 6 Example of negatively skewed data. one correct keypunch errors. However, extreme 3.12.3.5 Statistical Software scores are not necessarily incorrect values. In a study of depression, an HRSD score of 58 is Many of the statistical procedures that are possible, but extremely rare. In analyzing data described in this chapter involve tedious in which valid extreme values have been calculations. This leaves a great deal of room identified, one has several strategies to choose for computational errors. For all practical from. First, a nonparametric procedure could purposes, most statistical analyses in psycholo- be used in which the ranks of the raw data values gical research are conducted using a computer. are used. In such a case the minimum and There is a wide variety of statistical software maximum are equal to 1 and N, respectively. that is commercially available for personal Second, a transformation of the data values computers. Software that is commonly used in using either the natural logarithm, where the social sciences includes SPSS, SAS, BMDP, y' = ln(y), or square root of the original scale and SYSTAT. Comprehensive reviews of y00 py. Third, an extreme score might be statistical software are available (Lurie, 1995; winsorizedˆ , a process in which an outlier is Morgan, 1998). The choice among statistical simply recoded to the second highest value (see software has to do with the statistical proce- Tukey, 1977). Alternative strategies for outliers dures that are included, the data management are discussed in detail by Wilcox (1996). capabilities, and the ease of use. Nevertheless, a 250

Frequency Frequency Frequency 10 15 20 25 30 0 5 10 12 14 14 16 10 12 0 2 4 6 8 0 2 4 6 8 1 1 2 1 3 2 2 iue8 Figure iue9 Figure 4 3 3 iue7 Figure 5 4 4 ecitv n neeta Statistics Inferential and Descriptive 6 xml fdt ihapayutcdistribution. platykurtic a with data of Example xml fdt ihalpoutcdistribution. leptokurtic a with data of Example 5 5 7 xml fnral itiue data. distributed normally of Example 6 6 8 7 9 7 10 8 8 11 9 9 X 12 X X 10 10 13 11 11 14 12 12 15 13 16 13 17 14 14 18 15 15 19 16 16 20 17 17 21 18 18 22 23 Inferential Statistics 251

examine the research question at hand. For Frequency Stem and Leaf instance, ªWhich of two forms of psychother- apy is more effective for the treatment of 4 1*2234 depression?º For purposes of hypothesis test- ing, the question is reframed into a null 4 1. 5789 hypothesis (i.e., a hypothesis of no difference). For example, ªThere is no difference in the 5 2*11124 effects of CBT and IPT for the treatment of 7 2. 5677899 depression.º In conducting such tests, the researcher might make a correct decision, but errors are certainly possible. Consider the Stem width: 10.00 correspondence between the truth and the Each leaf: 1 case(s) research results (see Table 4). Assume that with omniscient (noninferential) ability, someone is able to describe the truth (e.g., population Figure 10 HRSD: Stem and Leaf Plot. characteristics) without conducting a statistical test. Thus the omniscience. (Every department has at least one member with self-rated ability of statistical procedure should only be used in that level.) For example, consider a clinical trial appropriate situations. A thorough understand- in which the post-treatment psychopathology of ing of the statistical technique, the assumptions, two groups (CBT and IPT) is being compared. and a familiarity with the estimation procedure Our colleague, Dr. Omniscient, could tell us is necessary. Furthermore, the user must have whether or not the severity of psychopathology the background necessary to interpret the in the two groups differs. If there is no difference results that are generated by the software. in the population, the correct decision would be One of the goals of this chapter is to provide the to conclude that there is no difference (i.e., fail reader with such a background for a variety of to reject the null hypothesis). In contrast, if descriptive and inferential procedures. In addi- there is no population difference, yet we find a tion, the appendix of this chapter includes SPSS difference, we would be in error. Specifically, command lines for each of the analyses that is rejecting a true null hypothesis is referred to as discussed (SPSS, 1997). SPSS is used for Type I Error. The probability of Type I Error is illustration here because it is accessible with referred to as the alpha level (a). minimal programming skills, it covers a wide On the other hand, if Dr. Omniscient tells us array of statistical procedures, and it has that there is, in fact, a treatment difference (i.e., excellent data management capacity. population treatment differences), we would be correct if we detected the difference (rejected a false null hypothesis). The probability of such a 3.12.4 INFERENTIAL STATISTICS rejection is referred to as statistical power.In contrast, we would be in error if we fail to detect Once again, an inferential statistical test is the treatment difference (i.e., fail to reject a false used to make an inference about a population null hypothesis). This is referred to as Type II based on data from a sample. In choosing Error and the p(Type II error)=b. In the among inferential procedures, one must con- absence of Dr. Omniscient's assistance, replica- sider the nature of both the independent and tion of results in subsequent studies provides dependent variable, whether a between-subject convincing evidence that our inference was design or within-subject design is employed, and likely to be accurate. the sample size. The procedures that are One further general digression regarding presented below are organized to reflect such inferential statistics. In general a test statistic considerations. Initially statistical procedures for between-subject designs are discussed. This is followed by a discussion of procedures for Table 4 Cross-classigication of the ªtruthº and within-subject designs. In each of those sections research results. both parametric and nonparametric procedures are considered. The distinction between para- Results of Placebo = Active Placebo = Active metric and nonparametric procedures is dis- research (H0 true) (H0 false) cussed in detail in Section 3.12.4.2. First a digression into general features No difference Correct decision Type II error regarding inferential tests. A statistical test is (CBT=IPT) Difference Type I error Correct decision used to draw an inference about a population(s) (CBT=IPT) ªPowerº based on sample data. The test is used to 252 Descriptive and Inferential Statistics

32+

--+--

+-+-+

24+

*

16++-+-+

--+--

8+

+------Variable X N = 20.00 Symbol key: * - Median Figure 11 HRSD: Box and Whisker Plot. is a ratio of a statistic (e.g., difference in means) were calculated in each of those experiments, to its standard error. Loosely speaking, the the standard error would represent the standard standard error is an estimate of the variability of deviation of that distribution (the sampling the statistic in the population. Stated differ- distribution) of that quantity (e.g., the standard ently, and hypothetically, if the experiment were deviation of the differences between two conducted repeatedly, say 10 000 times (with the means). same sample size each time), and the quantity of As stated earlier, one component of each interest (e.g., difference between two means) inferential statistical test is the null hypothesis, Inferential Statistics 253

Table 5 Hamilton Depression Ratings (Xi) for 20 was examined. For instance, results from a patients (IPT group). clinical trial for the treatment of patients with depression cannot be used to draw inferences 2   2 Xi Xi Xi7X (Xi7X) about the treatment of patients with panic disorder. 6367982 7497865 8647750 3.12.4.1 One- vs. Two-tailed Tests 981637 7 In some settings, the researcher might have 12 144 739 12 144 739preconceived expectations regarding the out- 12 144 739come of the experiment. For instance, it might 13 169 724be expected that a novel treatment is superior to 17 289 2 4 the standard. The researcher can then choose to 17 289 2 4 specify the null hypothesis as that of no 18 324 3 9 treatment difference or superiority of the novel 18 324 3 9 treatment. A two-tailed null hypothesis is 19 361 4 16 designed to detect any difference, positive or 19 361 4 16 negative, whereas a one-tailed null hypothesis is 19 361 4 16 19 361 4 16 designed to detect a difference in the prespeci- 19 361 4 16 fied direction (e.g., efficacy) but it would fail to 19 361 4 16 detect the opposite (e.g., detrimental) effects. It 19 361 4 16 is often as important, and ethically compelling, 19 361 4 16 to detect detrimental treatments as it is to detect efficacious treatments. A two-tailed test is used Total 301 4945 0 414.95 for a nondirectional null hypothesis and a one- tailed test is used for a directional null hypothesis. Although there is debate regarding n the exclusive use of two-tailed tests, there is a X i 301 consensus that the rationale for a one-tailed test X 1 15:05 ˆ Pn ˆ 20 ˆ must be stated a priori. That is, if a one-tailed 2 SS Xi X 414:95 test is to be used, it must be so stated prior to ˆ † ˆ study implementation and data collection. SS 414:95 s2 X 21:839 There has been an extensive discussion in the ˆ n 1 ˆ 20 1 ˆ literature of the costs and benefits of one and s ps2 p21:839 4:673 ˆ ˆ ˆ two-tailed tests (e.g., Fleiss, 1981, pp. 27±29;   Hopkins, Glass, & Hopkins, 1987, pp. 164±165, 205±206). All examples in this chapter employ two-tailed, nondirectional tests.

H0. In fact that is one of the eight components of the hypothesis testing procedure which are 3.12.4.2 Parametric vs. Nonparametric Tests listed below: (i) State research question. Technically speaking, one must not use (ii) State two hypotheses which are mutually parametric tests such as a t-test and an ANOVA exclusive and exhaustive: unless assumptions such as normality are (a) null hypothesis (e.g., H0 : m1 = m2) fulfilled. By normality, it is assumed that the (b) alternative hypothesis sample come from a population with normally (e.g., HA : m1 = m2). distributed data. Briefly, in a normal distribu- (iii) Choose an appropriate statistical test tion,whichisbell-shaped,themean,median,and (iv) Specify the critical value for the statistical mode are equal. The distribution is symmetrical test and the alpha level (both of which are around these measures of central tendency. The discussed below). shape of a normal distribution is determined by (v) Collect data. the population means and variance. The non- (vi) Perform calculations. parametric alternatives to the t-test and the (vii) Make decision regarding H0. ANOVA are the Mann±Whitney test and (viii) Conclusion. The results are interpreted Kruskal±Wallis test. Neither of these makes as an inference about the population (which the the normality assumptions. Many of the non- sample represents) based on the sample data. parametric procedures require a simple rank The generalizability of the findings should be transformation of the data (Conover, 1980; discussed. Inferences about the results cannot Sprent, 1989). This involves pooling the data be generalized beyond the scope of the data that from all subjects, regardless of treatment group, 254 Descriptive and Inferential Statistics and ranking in ascending order based on the 2 SS1 SS2 sp ‡ † value of the dependent variable. The subject with ˆ n1 1 n2 1 the lowest value is assigned the rank of 1. The †‡ † subject with the next lowest is assigned the value 640:0 414:95 ‡ 27:76 of 2, and so on, until the subject with the highest ˆ 20 1 20 1 ˆ value is assigned a rank of N. If all assumptions †‡ † have been reasonably fulfilled and all other and aspects of the analysis are equal, parametric tests 2 2 generally have more statistical power than sp sp 27:76 27:76 nonparametric tests. However, if the assump- s X1 X2 1:67 † ˆ sn1 ‡ n2 ˆ 20 ‡ 20 ˆ tions are not met, parametric tests could provide r misleading results about the population which   the sample represents. X1 X2 21:00 15:05 t 3:56 ˆ s X1 X2 ˆ 1:67 ˆ 3.12.4.3 Between-subject Designs: Dimensional † and Ordinal Dependent Variables This t-statistic is compared with a critical 3.12.4.3.1 t-test value from tabled values, which can be found in the appendix of any statistics book. To identify The t-test is used to compare the means of two that critical value, one must determine the groups. For instance, in a clinical trial, the degrees of freedom (df). Although a discussion of severity of depression of two treatment groups the meaning of degrees of freedom is beyond the (IPT and CBT) can be compared with a t-test. scope of this chapter, a value for df is often The HRSD is the dependent variable. The null needed to find the critical value of a test statistic. hypothesis is that the population means do not In the case of a two-group t-test, df is equal to differ: H0 : m1 = m2 and the alternative hypoth- the total number of subjects minus the number esis is that population means are not the same: of groups: HA : m1 = m2. df n n 20 20 2 38 The test is simply a ratio of the between-group ˆ 1 ‡ 2 ˆ ‡ ˆ differences relative to the within-group differ- Using this value for df, and a two-tailed alpha ences. It is used to ask the question, ªIs the mean level of 0.05, the critical t-statistic is 2.024. The difference between groups large relative to the observed t of 3.56 exceeds this critical value. fluctuation that is observed within each group?º Thus the null hypothesis is rejected and one can To illustrate this, the data from the CBT group conclude that the population means are not (presented in Table 3) are compared with data identical. IPT treatment is superior to CBT. from the other group of subjects in the Without the t-test it was apparent that the IPT hypothetical clinical trial that was described mean indicated lower symptom severity as earlier, those assigned to IPT (see Table 5). measured by the HRSD than the CBT mean. The algorithm for the t-test is a ratio of To get a handle on the magnitude of that between-group variability to within-group varia- difference, it is compared with the within-group bility: standard deviations. The IPT mean was over 6 X1 X2 HRSD points lower, which exceeds one within- t ˆ SX X group standard deviation (i.e., sIPT = 4.67 and 1 2 sCBT = 5.80). Using the t-test, one can conclude where the numerator represents the between- that the difference in group means is greater group variability (i.e., difference between the than would be expected by chance. group means) and the denominator is derived The assumptions of the t-test include: from the pooled variance. The pooled variance (i) Independence among observations. For is estimated as: example, one subject cannot re-enroll in the clinical trial to try another treatment if the 2 SS1 SS2 initial treatment does not result in symptom sp ‡ † ˆ n1 1 n2 1 remission. If that were permitted, some subjects †‡ † would contribute multiple observations to the and sample and those observations would not be independent. 2 2 sp sp (ii) The data are normally distributed in the s X1 X2 † ˆ sn1 ‡ n2 population. Technically, this assumption has to do with the distribution of the difference in Applied to the CBT vs. IPT data set: the pooled means, X X or the population sampling 1 2 variance estimate, distribution of m1 7 m2. This assumption is Inferential Statistics 255 generally met if the sample distributions of the assumptions. It is used when there is reason raw data (i.e., the actual measurements), as to believe that the data are not normally opposed to the mean differences, are normal, or distributed, when the sample sizes are small, if the sample sizes are reasonably large. or when the variances are heterogeneous. It can (iii) Homogeneity of variance. The popula- be used for dimensional data (e.g., HRSD), as tion variances are presumed to be equal. There well as ordered categorical data [e.g., illness are several tests of homogeneity of variance severity categories rated on a Likert scale from 0 including Levene's (1960) test of equality of (none) to 10 (severe)]. variances, Bartlett's test for homogeneity of Analyses are not conducted on the raw data, variances, or Hartley's F-Max test (1950). These but instead on rank transformed data, as tests will not be discussed here. In calculating described above. For instance, using the data the t-test statistic, one generally uses a pooled from the clinical trial comparing IPT and CBT, variance estimate. However, when the assump- the ranked data are displayed in Table 6. Instead tion cannot be met, separate variance estimates of analyzing the actual Hamilton scores (the can be incorporated into the t-test, or another dependent variable), each of the scores is ranked, procedure altogether, such as the Mann± with the lowest rank assigned a rank of 1 and the Whitney test (discussed below), can be used. highest assigned a rank of N. For purposes of In the case of heteroscedasticity (unequal group ranking the data are pooled across groups. In the variances), a different algorithm for the t-test is case of ties, the midpoint is used. Subsequent used which incorporates separate variance calculations are conducted with the ranks. estimates in the denominator. In addition, an The rationale behind the Mann±Whitney test adjustment in degrees of freedom is necessary. is that if the treatments are equally effective, one This will not be discussed here, but is presented would expect about half of each group to be in a variety of other references (Armitage & below the median rank for the pooled sample. Berry, 1987; Fleiss, 1986; Zar, 1996). Another way to think of it is to consider all In summary, the hypothesis testing steps for possible pairs of data across groups: subject 1 the t-test are as follows: from group 1 with subject 1 from group 2; then (i) State research question. Do CBT and IPT subject 1 with subject 2 from group 2; . . . subject have similar effects for the treatment of patients 20 from group 1 with subject 20 from group 2. with depression? (ii) State two mutually exclusive and exhaus- tive hypotheses: Table 6 Hamilton Rating Scale of Depression for (a) H0 : m1 = m2 patients in the IPT group (N = 20) and patients in the (b) HA : m1 = m2 (iii) If the assumptions appear to be fulfilled, CBT group (N = 20): raw ratings (Xi) and rank- transformed ratings (R(X )). choose the t-test to evaluate the hypotheses. i (iv) The critical value for the t-test with IPT CBT = 0.05 and df = 38 is 2.024. If |t | a observed Xi1 R(Xi1) Xi2 R(Xi2) 4tcrititcal, reject H0. (v) Data are displayed in Table 3 for the CBT 12 7 13 10.5 group and Table 5 for the IPT group. 12 7 12 7 (vi) Calculations were described above. 17 15 14 12 (vii) Decision regarding H0 Because |tobserved| 18 18 15 13 19 24 17 15 4tcrititcal (i.e., 3.5642.024), reject H0. (viii) Conclusion. IPT is a more effective 19 24 18 18 treatment of depressive symptoms in these 19 24 21 30 19 24 21 30 patients with major depressive disorder. 19 24 21 30 12 7 22 32 3.12.4.3.2 Mann±Whitney test 13 10.5 12 7 17 15 25 34 If the assumptions of the t-test cannot be met, 18 18 24 33 yet the observations are independent, the 19 24 19 24 Mann±Whitney test, (also called the Wilcoxon 19 24 28 38 test), may be an appropriate method for two 19 24 27 36.5 group comparisons, if the data are at least 6 1 26 35 ordinal in nature. An ordinal measurement need 7 2 29 39.5 8 3 29 39.5 not have equal intervals between units, but must 9 4 27 36.5 be ordered numbers (e.g., 1st, 2nd, 3rd, . . .). The Mann±Whitney test can be used to compare two Total 301 299.5 420 520.5 groups without making any distributional 256 Descriptive and Inferential Statistics

Examine whether or not in more than half of the (iii) The scale of measurement is at least ranked pairs the group 1 member is lower than ordinal. its counterpart from group 2. That is, we would For larger samples (i.e., when n1 or n2 is count the number of times that a subject in greater than 20), a Normal Approximation can group 1 has a higher rank than a subject in be used for the Mann±Whitney test (see Con- group 2. Computationally, the tedious process over, 1980). . of all n1 n2 comparisons need not be made. The computations for the Mann±Whitney U 3.12.4.3.3 One-way analysis of variance test that are now described are based on the formula for the case where both nis are less than When more than two groups are compared on or equal to 20. The total number of subjects a dimensional dependent variable, a fixed across groups is N n1 n2 and effects analysis of variance (ANOVA) is used. n1 n1 1 ˆ ‡ U n1 n2 2 ‡ † R1, where R1 is the As with the t-test, the test statistic is a ratio of sumˆ ofÁ the‡ ranks for group 1. Note that, in between-group differences to within-group N N 1 general, the sum of N ranks = ‡ †. More differences. In this case the sum of squares is 40 40 1 2 specifically, in this case, 2‡ † 820, which is the quantity representing variability. The sums equal to the sum ofˆ the ranks: of squares was introduced earlier as the sum of 299:5 520:5 820. squared deviations from the mean. For exam- ‡ ˆ The observed U statistic (Uobs) is compared ple, consider the clinical trial that has been with a critical range which is tabled in most described above, with the addition of a placebo introductory statistics texts (see for example, group. Say that patients are randomly assigned Fleiss, 1986; Zar, 1996). The critical range is a to one of three treatments: CBT, IPT, or a function of the sample size of each group. psychotherapy placebo. The psychotherapy placebo patients see their clinicians for medical n1 n1 1 management (MM)Ða brief nondirective/re- Uobs n1 n2 ‡ † R1 ˆ Á ‡ 2 flective talk therapy. The data from the MM group are presented in Table 7. The efficacy of 20 21 the three treatments for depression in this study 20 20 Á 299:5 310:5 ˆ Á ‡ 2 ˆ can be compared using an ANOVA, once again with the HRSD as the dependent variable. For the Mann±Whitney test, the hypothesis Initially, three different types of sums of testing steps are as follows: squared deviations are computed. The total sum (i) State research question. Do CBT and IPT of squared deviations (SSTotal), sum of squared have similar effects for the treatment of patients deviations accounted for by treatment group with depression? (SSTreatment), and the residual sum of squared (ii) State two mutually exclusive and exhaus- deviations, also referred to as ªsum of squares tive hypotheses with respect to the population errorº (SSResidual), are calculated as described medians (Mk). below. There are several aspects about these H0 : M1 = M2 three quantities that are worth noting at this HA : M1 = M2 point. First, the SSTotal is the sum of squared (iii) Choose the Mann±Whitney test to eval- deviations from the grand mean (i.e., for all uate the hypotheses. subjects pooled, without regard to treatment (iv) The critical values for the Mann± group). Second, SSTreatment is a measure of the Whitney test, with a = 0.05 and n1 = n2 = 20, squared deviations of the treatment group are 127 and 273. Thus, if Uobs 4127 or Uobs means from the grand mean. It can be thought 5273, reject H0. of as the sum of squares that is explained (by (v) The ranked data are displayed in Table 6 knowing what treatment the subjects received). for the IPT and CBT groups. Third, the sum of squared differences between (vi) Calculations were described above. each observation and its sample mean is the (vii) Make decision regarding H0. SSResidual. It represents the sum of squares that Uobs 5 Ucritical (i.e., 310.55273), thus reject H0. is not explained by treatment, and in this (viii) Conclusion: IPT is a more effective context, is the difference between SSTotal and treatment of depressive symptoms in patients SSTreatment. These quantities are used in the F- with major depressive disorder. test of the fixed effects ANOVA that is The assumptions of the Mann±Whitney test presented in Table 8 (the calculation of each include: quantity is also described in Table 8). (i) The two samples have been independently The assumptions of fixed effects ANOVA are: and randomly sampled from their respective (i) The samples are independent. populations. (ii) The populations from which the sample (ii) The groups are independent. are drawn are normally distributed. Inferential Statistics 257

(iii) Each of the populations has the same Table 7 Hamilton Depression Ratings (Xi) for 20 variance. patients in the MM group. For a fixed effects ANOVA, the hypothesis 2   2 testing steps are as follows: Xi Xi Xi7X (Xi7X) (i) State research question. Do CBT, IPT and MM have similar effects for the treatment 16 256 76.25 39.0625 16 256 6.25 39.0625 of patients with depression? 7 17 289 75.25 27.5625 (ii) State two mutually exclusive and exhaus- 17 289 75.25 27.5625 tive hypotheses: 19 361 73.25 10.5625 H0 :m1 = m2 = m3 19 361 73.25 10.5625 19 361 3.25 10.5625 HA1 : m1=m2 or HA2 : m1=m3 or HA3 : m2=m3 7 (iii) If assumptions are met, choose the fixed 19 361 73.25 10.5625 effects ANOVA to evaluate the hypotheses. 20 400 72.25 5.0625 (iv) The critical value for the F-test with 20 400 72.25 5.0625 a = 0.05 and df = 2,57 is 3.16 (df is defined in 21 441 71.25 1.5625 21 441 71.25 1.5625 Table 8). If Fobserved 4 Fcritical, reject H0. 22 484 70.25 0.0625 (v) Data are displayed in Tables 3, 5, and 7 25 625 2.75 7.5625 for the CBT, IPT, and MM groups, respec- 27 729 4.75 22.5625 tively. 28 784 5.75 33.0625 (vi) Calculations are described above in text 28 784 5.75 33.0625 and in Table 8. 29 841 6.75 45.5625 (vii) Make decision regarding H0. Since 31 961 8.75 76.5625 Fobserved 4Fcritical (i.e., 10.966 > 3.16) reject H0. 31 961 8.75 76.5625 (viii) Conclusion: There is a significant dif- ference in severity of depression among the Total 445 10385 0 483.75 treatments. And which is superior?

n X i 445 X 1 22:25 3.12.4.3.4 ANOVA: Post hoc tests ˆ Pn ˆ 20 ˆ A significant omnibus F-test indicates that SS X X 2 483:75 ˆ i † ˆ the null hypothesis is rejected: H0 :m1 = m2 = m3. SS 483:75 s2 X 25:46 Nonetheless, further analyses are necessary to ˆ n 1 ˆ 20 1 ˆ determine where the difference lies, that is, s ps2 p25:46 5:046 which groups are significantly different from ˆ ˆ ˆ each other. A variety of post hoc tests can be   used for this purpose. A post hoc test is designed to be used only after an omnibus F-test is statistically significant. Each test incorporates described below. The choice among these tests is the mean square within (MSW), which comes based on the research question and the design of from the calculations that are presented in the the study. ANOVA table (Table 8). The MSW is the denominator of the F-test. A nonsignificant (i) Tukey's Honestly Significant Difference omnibus F-test, however, indicates that there is not a difference in means among the groups. Of Tukey's Honestly Significant Difference course, when there is no difference, there is no (HSD) test is used to compare all pairs of need to look for differences. Consequently, the groups means. HSD can be applied when each use of a post hoc test after a nonsignificant of the pairwise comparisons is of interest to the omnibus F-test is inappropriate. researcher, or when the researcher does not have Post hoc procedures provide a strategy for specific hypotheses about which group means multiple comparisons while at the same time will differ. Initially the groups means are maintaining the probability of Type I error (i.e., calculated. In the hypothetical clinical trial the probability of detecting differences that do example that has been discussed in this chapter, not actually exist). This is in stark contrast with the groups are arranged in ascending order of the inappropriate strategy of following a their means: IPT X 15:05 , CBT significant omnibus F-test with several t-tests, X 21:00 ,MM X 22 :25ˆ. Note† that with ˆ † k k ˆ1 † each used to compare a pair of groups. That k groups there are 2 † pairwise comparisons. approach fails to account for the multiple tests, Thus, in this example, there are k k 1 3 3 1 and as a consequence, increases the probability 2 † 2 † 3 comparisons: IPT vs. CBT, of Type I error. Three Post hoc tests are IPT vs.ˆ MM,ˆ and CBT vs. MM. 258 Descriptive and Inferential Statistics

Table 8 ANOVA Table can be found in other texts (e.g., Zar, 1996). The results are presented in Table 9. Source SS df MS F The hypothesis testing steps for Tukey's HSD are as follows: Treatment 592.033 2 296.017 10.966 (i) State research question. The ANOVA Within 1538.697 57 26.995 results indicated that the IPT, CBT, and MM Total 2130.730 59 do not have similar effects for the treatment of patients with depression. Which pairs of groups differ? (ii) State two mutually exclusive and exhaus- n 2 tive hypotheses for each pairwise comparison: SSTotal Xi X:: (a) H0 :mIPT = mCBT ˆ i 1 † Xˆ HA :mIPT = mCBT 12 19:43 12 19:43 2 ˆ †2 ‡ † (b) H0 :mIPT = mMM 2 2 31 19:43 31 19:43 HA :mIPT = mMM ‡ÁÁÁ‡ † † 2130:733 (c) H0 :mCBT = mMM ˆ p HA :mCBT = mMM 2 (iii) Choose Tukey's HSD as a post hoc test SSTreatment nj Xj X:: ˆ j 1 † to evaluate the hypotheses. Xˆ 20 21:00 19:43 2 (iv) The critical value for the q-test with ˆ † a = 0.05 and df = 3,56 is 3.408. (Note that 20 15:05 19:43 2 ‡ † df = k, v, where k = number of groups and 2 20 22:25 19:43 592:033 v=n7 k 7 1.) Reject H0 if q 4 3.408. ‡ ˆ † N (v) Data are displayed in Table 9 for the 2 SSWithin Xij Xj three pairwise comparisons. ˆ i 1 † (vi) Calculations were described above. Xˆ 12 21 2 12 21 2 (vii) Decisions regarding H0. ˆ † ‡ † IPT vs. CBT: qobserved 4 qcritical 31 21 2 ‡ÁÁÁ‡ † (i.e., 5.13 4 3.408), thus reject H0. 2 31 21 1538:697 IPT vs. MM: qobserved 4 qcritical ‡ † ˆ dfTreatment k 1; where k number of groups (i.e., 6.21 4 3.408), thus reject H0. ˆ ˆ CBT vs. MM: qobserved 5 qcritical dfWithin N k ˆ (i.e., 1.07 5 3.408), thus do not reject H0. df N 1 Total ˆ (viii) Conclusion: IPT was more effective SSTreatment than either CBT or MM for the treatment of MSTreatment ˆ dfTreatment depressive symptoms in patients with major SSWithin depressive disorder. There was no difference MSWithin ˆ dfWithin between CBT and MM for the treatment of MS depressive symptoms. F Treatment ˆ MSWithin (ii) The Dunnett test The Dunnett test is to be used when a control group is to be compared with each of several other treatments, and when no other compar- The three pairwise mean differences are then isons are of interest. For instance, it could be calculated. The HSD test statistic, q, a ratio of the mean difference over the standard error, is used to demonstrate the efficacy of each treatment relative to placebo, but not to computed for each pairwise comparison. For examine differences between active treatments. each comparison, if the observed q4 critical q (a tabled value), the null hypothesis of no In the example that has been discussed, the difference between the pair of means is rejected. When there are an equal number of subjects per group, the standard error (se) is calculated as: Table 9 Tukey HSD pairwise comparisons pMSE p26:995 se 1:16 Comparison Mean difference se q n 20 ˆ k ˆ  ˆ IPT vs. CBT 5.95 1.16 5.13 where MSE comes from the ANOVA table IPT vs. MM 7.20 1.16 6.21 (Table 8) and nk is the number of subjects per CBT vs. MM 1.20 1.16 1.07 group. The algorithm for unequal sample sizes Inferential Statistics 259

Dunnett test could be used for two compar- nj isons: (i) CBT vs. MM, and (ii) IPT vs. MM. It ri could be used to show that IPT and CBT each i 1 rj ˆ have active ingredients; but the Dunnett test ˆ Pnj could not be used by members of the IPT and CBT camps to settle the score of whose method The respective rank sums and means for each is superior. The Dunnett test, which will not be group are also presented in Table 10. Note that illustrated here, is described in detail in Fleiss the IPT (18.075) group has a substantially lower (1986). mean rank than either the CBT (34.975) or MM (38.45) groups. The Kruskal±Wallis test can be used to determine whether that difference is (iii) Scheffe Post Hoc Procedure larger than might be expected by chance. Stated The Scheffe test is a more general procedure differently, consider an experiment in which 20 than those described above. It is not restricted to numbers are randomly selected from a box that pairwise comparisons, but can evaluate com- contains 60 balls, numbered 1 though 60. Would parisons of any linear combination of means. it be unlikely to randomly select a group of 20 For instance, in the above example, one might balls whose sum is 361.5 or lower? Or whose have two comparisons of interest: (i) active mean is 18.075 or lower? In essence, the (CBT and IPT) vs. placebo (MM), and (ii) CBT Kruskal±Wallis test addresses this question. vs. IPT. The Scheffe procedure would be Once the ranking has been conducted and the appropriate for those comparisons. The Scheffe sum of ranks for each group has been test is somewhat more conservative than others determined, the calculations for the Kruskal± in that it has less statistical power. For that Wallis test statistic, T, can proceed as follows: reason some criticize the test for being more N 1 S2 C likely to miss true differences. Others consider T † t † ˆ S2 C the conservative nature of the test appealing. r The Scheffe test will not be applied here. It is where discussed in detail in Fleiss (1986). k 2 2 Ri St ˆ i 1 ni 3.12.4.3.5 Kruskal±Wallis test ˆ XN The Kruskal±Wallis test (1952) is a nonpara- S2 r2 metric approach to the one-way ANOVA. The r ij ˆ i 1 procedure is used to compare three or more Xˆ N N 1 2 groups on a dependent variable that is measured C ‡ † on at least an ordinal level. Ordinal data extends ˆ 4 beyond rating scores such as the HRSD, and can include ordered categorical variables such Applying these algorithms to the data that is as Hollingshead and Redlich's (1958) four presented in Table 10: broad categories of socioeconomic status. As k 2 2 2 2 with the Mann±Whitney test, which is a special 2 Ri 699:5 361:5 769 St two-group case of the Kruskal±Wallis test, the ˆ i 1 ni ˆ 20 ‡ 20 ‡ 20 Xˆ data are pooled (across groups) and ranked 60567:175 from 1 for the lowest value of the dependent ˆ variable to N for the highest value. In the case of N ties, the midpoint is used. For example, if two 2 2 subjects had the third lowest value, they would Sr rij ˆ i 1 each be given a rank of 3.5, the midpoint of 3 Xˆ and 4. The data from the hypothetical clinical 12 22 32 ::: 562 592 592 ˆ ‡ ‡ ‡ ‡ ‡ ‡ trial comparing CBT, IPT, and MM (originally 73587 displayed in Tables 3, 5, and 7, respectively) ˆ have been ranked and are presented in Table 10. 2 2 After the data have been ranked, the sum of N N 1 60 60 1 C ‡ † ‡ † 55815 the ranks for each group is calculated: ˆ 4 ˆ 4 ˆ

nj R r N 1 S2 C j ˆ i T † t † i 1 ˆ S2 C Xˆ r 60 1 60567:175 55815 The mean rank for each of the three groups is † † 15:776 then calculated: ˆ 73587 55815 ˆ 260 Descriptive and Inferential Statistics

Table 10 HRSD for 60 patients in the clinical trial of CBT, IPT, and MM: raw (Xij) and ranked (Rij) values.

CBT IPT MM Xij Rij Xij Rij Xij Rij

12 7 6 1 16 14.5 12 7 7 2 16 14.5 13 10.5 8 3 17 18 14 12 9 4 17 18 15 13 12 7 19 30 17 18 12 7 19 30 18 22 12 7 19 30 19 30 13 10.5 19 30 21 41 17 18 20 37.5 21 41 17 18 20 37.5 21 41 18 22 21 41 22 44.5 18 22 21 41 24 46 19 30 22 44.5 25 47.5 19 30 25 47.5 26 49 19 30 27 51 27 51 19 30 28 54 27 51 19 30 28 54 28 54 19 30 29 57 29 57 19 30 31 59.5 29 57 19 30 31 59.5

Total 420 699.5 301 361.5 445 769 R 34.975 18.075 38.45

Assumptions of the Kruskal±Wallis test only be determined using an appropriate post include: hoc test, as discussed below.) (i) The samples are independent random This significant result in a Kruskal±Wallis samples from their respective populations. test indicates that there are group differences, (ii) The scale of measurement (of the depen- but does not indicate which groups differ. As dent variable) is at least ordinal. with an ANOVA, a post hoc procedure that is The hypothesis testing steps for the Kruskal± analogous to the HSD for ANOVAs can be Wallis test are as follows: used to determine which groups are signifi- (i) State research question. Do CBT, IPT, cantly different from each other. That proce- and MM have similar effects for the treatment dure, which will not be discussed here, is of patients with depression? described in detail by Conover (1980). Note (ii) State two mutually exclusive and exhaus- that in this case, the Kruskal±Wallis results are tive hypotheses with regard to group medians: similar to those of the ANOVA. This is not (a) H0 :M1 = M2 = M3 always the case. The choice between the two (b) HA :M1 = M2 = M3 tests should be based on whether or not the (iii) Use the Kruskal±Wallis test to evaluate assumptions of the test are fulfilled and not on a the hypotheses. comparison of their results. When the results (iv) The critical value for the Kruskal±Wallis from an ANOVA and a Kruskal±Wallis test test comparing k groups comes from an w2 conflict, the degree of discrepancy between the distribution, with k71 degrees of freedom and two tests may, in part, reflect the extent to which a = 0.05. In this case there are three groups the assumptions were fulfilled. (k = 3) and df = 371 = 2. Therefore, the cri- tical 2 = 5.99. If T 5.99, reject H w(2,.05) observed 4 0 3.12.4.3.6 Factorial analysis of variance (v) Data are displayed in Table 10 for the three groups. The fixed effect ANOVA model that was just (vi) Calculations are described above. discussed can be extended to include more than (vii) Decision regarding H0. Tobserved 4 one independent variable. Consider a clinical 2 wcritical (i.e., 15.77 4 5.99), thus reject H0. trial in which the two treatments (CBT and IPT) (viii) Conclusion. There is a significant dif- were compared among samples of two types of ference in severity of depression among the mood disorders (major depressive disorder and treatments. (Which groups actually differ can dysthymia). In such a design the two fixed Inferential Statistics 261 factors (independent variables) are treatment The formulae that are presented are applicable and diagnosis, which each have two levels. The for designs with cells of equal size (n; n =10in two-way ANOVA is used to test main effects of this case) and are presented to illustrate the each independent variable and the interaction concept of the partitioning of sums of squares in between the independent variables. The null a two-way ANOVA. Notice that each estimates hypotheses are: a different form of sum of squared deviations (SS) from a mean that represents a component Main effect of of the variability among the HRSD of all treatment H0 :m1. = m2. subjects in the clinical trial. Main effect of In order to conduct the F-tests of the diagnosis H0 :m.1 = m.2 ANOVA that correspond to the three null Treatment by diagnosis hypotheses, several additional calculations are interaction H0 :m117m12 = m217m22 required. The algorithms for these quantities are presented in Table 12. The first column of the Note that the first subscript represents treat- table indicates the source of variability. The ment level (where 1 = CBT and 2 = IPT), the second column displays the SS (calculated second subscript represents diagnosis (where above) that correspond to each of the four 1 = dysthymia and 2 = MDD). In the null sources of variability. The third column of hypotheses for the main effects, a dot (.)in Table 12 presents the degrees of freedom that the subscript indicates that the corresponding correspond to each source of variability. The particular variable is ignored. These three null calculation of degrees of freedom is described in hypotheses are tested with data from one Table 12. The fourth column in Table 12 experiment. The two main effects in this two- presents the mean square (MS) for each source way ANOVA represent an extension of the one of variability which is equal to the sum of main effect that is tested in a one-way ANOVA squares for that source divided by the corre- model described earlier. What is more novel sponding degrees of freedom. In the final about a factorial ANOVA design is the inter- column of Table 12, the F-test for each effect action between the factors, which is used to is equal to the mean squares for that effect examine the moderating effect one factor has on divided by the mean square within (MSwithin). the effect that the other factor has on the The results of the data from the example are dependent variable. In the hypothetical study shown in Table 13. presented here, an interaction is used to com- Assumptions of the factorial ANOVA are as pare treatment differences (CBT vs. IPT) across follows: diagnostic groups. Stated differently, it is used (i) The observations in each of the cells are to examine whether there is a larger difference independent samples. between IPT and CBT for subjects with MDD (ii) The population data are normally dis- or for subjects with dysthymia. tributed. The data from the clinical trial shown earlier (iii) The populations all have equal var- in Tables 3 and 5 have been stratified by iances. diagnosis and are presented in Table 11. In a The hypothesis testing steps for the factorial simple comparison of the means, it is clear that ANOVA are as follows: the IPT group has a substantially lower mean (i) State research questions. than the CBT group. Likewise, the dysthymia (a) Do CBT and IPT have similar effects group has a lower mean than the MDD group. on depressive symptomatology for the treat- In both cases, the differences are of the order of ment of patients with depression? one standard deviation. The factorial ANOVA (b) Do patients with major depressive dis- is used to evaluate whether or not these order and dysthymia have similar depressive differences are larger than might be expected symptomatology after the treatment of depres- by chance. Furthermore, the ANOVA will be sion? used to compare the treatment effects across (c) Do CBT and IPT have similar effects diagnostic groups. on depressive symptomatology for patients with The calculations for a two-way factorial major depressive disorder and dysthymia? ANOVA involve partitioning the total sums (ii) State sets of mutually exclusive and of squares into four components: SSTx,SSDx, exhaustive hypotheses: SSTxbyDx,SSwithin. The algorithms are described (a) Main effect of treatment in Table 12 and illustrated in Table 13. Note H0 :m1.=m2. that the subscript g represents treatment HA :m1. = m2. (g = 1,2), the subscript h represents diagnosis (b) Main effect of diagnosis (h = 1,2), and the subscript i represents in- H0 :m.1 = m.2 dividual subject (i =1±N; N = 40 in this case). HA :m.1 = m.2 262 Descriptive and Inferential Statistics

(c) Interaction of treatment by diagnosis (viii) Conclusion. IPT is a significantly more H0 :m117m21 = m127m22 effective treatment than CBT for depressive HA :m117m21 = m127m22 symptoms. Patients with major depressive dis- (iii) If assumptions have been fulfilled, use order have significantly more depressive symp- the two-way factorial ANOVA to evaluate the tomatology than those with dysthymia after hypotheses. treatment. The differential effects of IPT and (iv) The critical F-values are selected from an CBT are similar for patients with major de- F-table based on the degrees of freedom nu- pressive disorder and patients with dysthymia. merator, degrees of freedom denominator, and In this case, post hoc tests such as those alpha-level. The critical value for the F-test with described earlier are not needed to interpret a = 0.05 and df = 1,36 is 5.47. Reject H0 if these results because there are only two levels Fobserved 4 Fcritical. of each significant factor and there is not a (v) Data are displayed in Table 11. significant interaction. To determine which (vi) Calculations are described above. level of each factor had a better outcome in (vii) Decisions regarding the three H0s. this clinical trial, simply compare the means of (a) Main effect of treatment each level of the significant effects and deter- The main effect of treatment is statistically mine which is lower (i.e., less depressed on the significant: HRSD). For the treatment effect, the IPT Fobs 4 Fcritical(1,36;.05) (i.e., 17.114 4 5.47). group mean (sd) is substantially lower than (b) Main effect of diagnosis that of the CBT group: 15.05 (4.67) vs. The main effect of diagnosis is statistically 21.00 (5.80). The two groups differ by more significant: than one standard deviation, which is a very Fobs 4 Fcritical(1,36;.05) (i.e., 14.359 4 5.47). large difference. Similarly, the dysthymia (c) Interaction of treatment by diagnosis group mean (sd) is substantially lower than The treatment by diagnosis interaction is not that of the MDD group: 15.30 (5.57) vs. statistically significant: 20.75 (5.24). Here the groups differ by slightly Fobs 5 Fcritical(1,36;.05) (i.e., 0.639 5 5.47). less than one standard deviation, which is

Table 11 Hamilton Depression Ratings for 40 patients in the clinical trial of CBT and IPT stratified by diagnosis.

Treatment CBT CBT IPT IPT Diagnosis DYS MDD DYS MDD 12 14 6 12 12 19 7 13 13 26 8 17 15 27 9 17 17 22 12 18 18 24 12 19 21 25 18 19 21 28 19 19 21 29 19 19 27 29 19 19

Four cells sum 177 243 129 172 mean 17.70 24.30 12.90 17.20 sd 4.88 4.81 5.38 2.62 SS 3347 6113 1925 3020

Diagnosis DYS MDD sum 306 415 mean 15.30 20.75 sd 5.57 5.24 SS 5272 9133

Treatment CBT IPT sum 420.00 301.00 mean 21.00 15.05 sd 5.80 4.67 SS 9460 4945 Inferential Statistics 263

Table 12 Two-way ANOVA table.

Source SS df MS F

Treatment SSTx p71SSTx/(p71) MSTx/MSwithin Diagnosis SSDx q71SSDx/(q71) MSDx/MSwithin Tx by dx SSTx by Dx (p71)(q71) SSTx by dx/(p71)(q71) MSTx by dx/MSwithin Within SSwithin pq(n71) SSwithin/pq(n71)

Total SSTotal n71

N 2 SSTotal Xghi X::: ˆ i 1 † ˆ X p   2 SSTX nq Xg:: X::: ˆ j 1 † ˆ Xq   2 SSDX np X:h: X::: ˆ k 1 † Xˆ p q 2 SSTxbyDx n Xgh: X::: Xg:: X::: X:h: X::: ˆ g 1 h 1 † † † ˆ ˆ h i pXqXn 2 SSWithin Xghi Xgh: ˆ g 1 h 1 i 1 † Xˆ Xˆ Xˆ

again a substantial difference. In contrast, the diagnosis, and gender), three two-way interac- interaction was not significant. Consider the tions (treatment by diagnosis, treatment by differences in treatment means across diag- gender, and diagnosis by gender), and one three- noses: way interaction (treatment by diagnosis by gender). For an extensive discussion of ANO- X X 17:70 12:90 4:80 CBT=DY S IPT=DY S ˆ ˆ VA designs see Keppel (1991). and In this chapter, two multivariate statistical procedures will be introduced, but not discussed X X 24:30 17:20 7:10 CBT=MDD IPT=MDD ˆ ˆ in depth. For a comprehensive discussion of Relative to the variation within treatment these and other multivariate tests, see Bock (marginal standard deviations range from 4.67 (1985), Grimm and Yarnold (1995), Marcou- to 5.80), the difference in mean differences lides and Hershberger (1997), or Stevens (1992). (7.10 7 4.80 = 2.30), is not substantial enough Here multivariate means that multiple depen- to consider it more than might be expected by dentvariablesarebeingexaminedinonetest.For chance. instance, consider the comparison of affectively The two-way factorial ANOVA example that ill patients with schizophrenic patients on a was just discussed included two factors (i.e., battery of six personality scales. There are two independent variables: treatment and diagno- schools of thought for use of a multivariate sis), each of which had two levels. The factorial procedure such as this. One is that it should be ANOVA can incorporate factors which have used if there is a multivariate research question. more than two levels each. This was shown in For example, do the of the two the earlier example of the one-way ANOVA in diagnostic groups differ as a whole? The re- which there were three levels of treatment (CBT, searcher who asks this question, might be most IPT, and MM). In addition, the factorial design interested in personality as a general construct, can be extended to include more than two and less interested in the specific differences in factors. For instance, the clinical trial could ego resiliency, dependency and other personality have been conducted to test the effect of gender characteristics. In contrast, others would argue as well as treatment and diagnosis. Such a that the multivariate procedures should be used design is called a three-way ANOVA, which to protect against Type I error whenever several includes tests of three main effects (treatment, dependent variables are examined. 264 Descriptive and Inferential Statistics

Table 13 Hamilton depression ratings for clinical scales, the probability of a Type I error would trial of CBT and IPT by diagnosis: ANOVA table. be substantially elevated from the conventional 0.05 level. In general, the experimentwise alpha Source SS df MS F level for c comparisons is calculated as follows: c aEW =17 (1 7 a) . Thus, in lieu of a multi- Treatment 354.025 1 354.025 17.114 variate procedure, the two groups might be Diagnosis 297.025 1 297.025 14.359 Tx by dx 13.225 1 13.225 0.639 compared using six univariate t-tests (one for Within 744.70 36 20.686 each of the six dependent variablesÐthe personality scales) and the resulting experi- Total 1408.975 39 mentwise alpha level would be aEW =17 (1 7 0.05)6 =17 0.735 = 0.265, which cer- tainly exceeds the conventional Type I error n level of 0.05.  2 SSTotal Xghi X::: Although the computational details of Hotel- ˆ i 1 † 2 Xˆ ling's T are beyond the scope of this chapter, 12 18:025 2 12 18:025 2 some general issues regarding the procedure ˆ † ‡ † 19 18:025 2 will be introduced. First, the null hypothesis is ‡ÁÁÁ‡ † that the populations from which the two groups 19 18:025 2 1408:975 ‡ † ˆ come do not differ on (a linear combination) of q the scales (see below). Second, assumptions of   2 SSTX nq Xg:: X::: Hotelling's T 2 include multivariate normality ˆ j 1 † Xˆ and independence of observations across sub- 10 2 21 18:025 2 jects. Third, Hotelling's T 2 is an omnibus multi- ˆ Á Á † h variate procedure. If, and only if, the 15:05 18:025 2 354:025 ‡ † ˆ multivariate null hypothesis is rejected, can sub- q i sequent post hoc univariate tests be used to 2 SSDX np X:h: X::: identifythesourceofthedifference(i.e.,onwhich ˆ k 1 † Xˆ dependent variable(s) do the groups differ?). 10 2 15:3 18:025 2 This is similar to the approach to the comparison ˆ Á Á † of three or more groups in an ANOVA where, if h 20:75 18:025 2 297:025 the omnibus F-test is significant, a post hoc test ‡ † ˆ p q i must be used to determine which groups differ SS n X X X X significantly. However, those post hoc tests TxbyDx ˆ gh: :::† g:: :::† g 1 h 1 cannot be used if the omnibus F of the univariate Xˆ Xˆ h 2 ANOVA is not significant. In the case of X X::: 2 :h: † Hotelling's T , the post hoc tests would simply i 10 17:7 18:025 21:0 18:025 be a t-test for each dependent variable. ˆ Á †‡ † The hypothesis testing steps for Hotelling's h 2 2 15:3 18:025 T are as follows: ‡ † ‡ÁÁÁ (i) State research question. Are the person- i 17:2 18:025 20:75 18:025 ality profiles of affectively ill patients and ÁÁÁ‡ †‡ † schizophrenic patients similar (using a battery h 2 15:05 18:025 13:225 of six personality scales)? ‡ † ˆ (ii) State two mutually exclusive and exhaus- p q n i 2 tive hypotheses: SSWithin Xghi Xgh ˆ g 1 h 1 i 1 † Xˆ Xˆ Xˆ 11 12 2 2 12 17:7 12 17:7 21 22 ˆ † ‡ † 2 3 2 3 19 17:2 2 19 17:2 2 31 32 ‡ÁÁÁ‡ † ‡ † H0 : 6 41 7 ˆ 6 42 7 744:70 6 7 6 7 ˆ 6 51 7 6 52 7 6 7 6 7 6  7 6  7 6 61 7 6 62 7 4 5 4 5

11 12 2 3.12.4.3.7 Hotelling's T 21 22 2 3 2 3 2 31 32 Hotelling's T is used to compare two groups HA : 6 41 7 6ˆ 6 42 7 on multiple continuous dependent variables, 6 7 6 7 6 51 7 6 52 7 simultaneously. Consider that if a series of six 6 7 6 7 6  7 6  7 univariate t-tests were performed on these 6 61 7 6 62 7 4 5 4 5 Inferential Statistics 265

(iii) If the assumptions are met, use the    Hotelling's T 2 evaluate the hypotheses. 11 12 13 21 22 23 (iv) The critical value, calculations, decision 2 3 2 3 2 3 2 31 32 33 rules, and interpretation of Hotelling's T are HA : 6 41 7 6ˆ 6 42 7 6ˆ 6 43 7 described elsewhere (e.g., Stevens (1992) or 6 7 6 7 6 7 6 51 7 6 52 7 6 53 7 Marcoulides & Hershberger (1997)). 6 7 6 7 6 7 6  7 6  7 6  7 6 61 7 6 62 7 6 63 7 4 5 4 5 4 5 (iii) If the assumptions are met, use a MAN- 3.12.4.3.8 Multivariate analysis of variance OVA to evaluate the hypotheses. Multivariate analysis of variance (MANO- (iv) The critical value, calculations, decision VA) is an extension of the T 2 for the rules, and interpretation of MANOVA are comparison of three or more groups. For described elsewhere (e.g., Marcoulides & Hersh- example, three groups (e.g., mood disorders, berger, 1997; Pedhazur, 1982; Stevens, 1992). schizophrenics, and no history of a mental disorder) can be compared on a battery of six personality scales using a MANOVA. Similar to the factorial ANOVA, MANOVA can also be 3.12.4.4 Linear Relations extended to incorporate more than one factor. 3.12.4.4.1 Pearson correlation coefficient For instance, a two-way MANOVA could evaluate the main effects and interaction of The strength of the bivariate linear relation gender and diagnosis (mood disorders vs. between variables can be examined with a schizophrenics vs. no history of a mental correlation coefficient. The Pearson product disorder) in which the dependent variables are moment correlation coefficient, r, is most often a battery of six personality scales. If any of those used for this purpose. In correlational analyses, three omnibus multivariate tests (i.e, main a distinction need not be made between the effects of gender or diagnosis or the gender independent and dependent variables. Unlike by diagnosis interaction) are statistically sig- the univariate analyses that were discussed nificant, subsequent post hoc tests must be earlier, each subject contributes data in the form conducted to determine the source of the group of paired coordinates (x, y) to the analysis. differences. The choice among multivariate test Consider the data in Figure 12 which statistics (e.g., Pillais, Hotelling's, Wilks, or represents the relation between depression (on Roy) is considered by Olson (1976). Post hoc the x-axis) and functional impairment (on the y- tests for MANOVA are described in detail by axis). (These same data are presented in the first Stevens (1992). two columns of Table 14.) If there were even a Assumptions of MANOVA include: modest linear relation between depression and (i) independence among observations (i.e., functional impairment, one would expect mini- subjects) mal impairment in the mildly depressed and (ii) a multivariate normal distribution greater impairment in the severely depressed. It among the dependent variables is a correlation coefficient that quantifies the (iii) equal population covariance matrices strength of that linear relationship. (Once again, for the dependent variables. as a point of clarification, unlike much of the The hypothesis testing steps for MANOVA prior discussion in this chapter, here we consider are as follows: two variables from one group of subjects.) (i) State research question. Are the person- A brief digression may help with this ality profiles of affectively ill patients, schizo- explanation. It is quite likely that the two vari- phrenic patients, and those with no history of ables in a correlational analysis are measured on mental disorders similar (using a battery of six very different scales. For instance, in this personality scales)? sample, the impairment and depression means (ii) State two mutually exclusive and ex- (21.00 and 14.45, respectively) and standard haustive hypotheses: deviations (5.80 and 4.93, respectively) differ somewhat. For that reason, consider a scale-free metric, a standardized score, referred to as a ªz- 11 12 13 score,º which has the property of a mean of zero 21 22 23 and standard deviation of one. A transforma- 2 3 2 3 2 3 tion from the original scale (X ) to the 31 32 33 i H0 : standardized scale (Zi) is quite simple. The 6 41 7 ˆ 6 42 7 ˆ 6 43 7 6 7 6 7 6 7 original variable (Xi) is transformed by sub- 6 51 7 6 52 7 6 53 7 6 7 6 7 6 7 tracting the mean and dividing by the standard 6  7 6  7 6  7 6 61 7 6 62 7 6 63 7 deviation: 4 5 4 5 4 5 Table 14 Hamilton depression ratings and Sheehan Disability Scale for CBT group.

2 2 2 2 Xi Yi Xi Yi XiYi X XY Y X X Y Y x y zx zy Z Z i i i † i † Á x Á y 12 11 144 121 132 79.00 73.45 81.00 11.90 31.05 71.55 70.70 1.09 12 8 144 64 96 79.00 76.45 81.00 41.60 58.05 71.55 71.31 2.03 13 6 169 36 78 78.00 78.45 64.00 71.40 67.60 71.38 71.72 2.36 14 14 196 196 196 77.00 70.45 49.00 0.20 3.15 71.21 70.09 0.11 15 11 225 121 165 76.00 73.45 36.00 11.90 20.70 71.03 70.70 0.72 17 8 289 64 136 74.00 76.45 16.00 41.60 25.80 70.69 71.31 0.90 18 16 324 256 288 73.00 1.55 9.00 2.40 74.65 70.52 0.31 70.16 19 8 361 64 152 72.00 76.45 4.00 41.60 12.90 70.34 71.31 0.45 21 15 441 225 315 0.00 0.55 0.00 0.30 0.00 0.00 0.11 0.00 21 22 441 484 462 0.00 7.55 0.00 57.00 0.00 0.00 1.53 0.00 21 12 441 144 252 0.00 72.45 0.00 6.00 0.00 0.00 70.50 0.00 22 19 484 361 418 1.00 4.55 1.00 20.70 4.55 0.17 0.92 0.16 24 15 576 225 360 3.00 0.55 9.00 0.30 1.65 0.52 0.11 0.06 25 19 625 361 475 4.00 4.55 16.00 20.70 18.20 0.69 0.92 0.64 26 16 676 256 416 5.00 1.55 25.00 2.40 7.75 0.86 0.31 0.27 27 22 729 484 594 6.00 7.55 36.00 57.00 45.30 1.03 1.53 1.58 27 18 729 324 486 6.00 3.55 36.00 12.60 21.30 1.03 0.72 0.75 28 21 784 441 588 7.00 6.55 49.00 42.90 45.85 1.21 1.33 1.60 29 11 841 121 319 8.00 73.45 64.00 11.90 727.60 1.38 70.70 70.97 29 17 841 289 493 8.00 2.55 64.00 6.50 20.40 1.38 0.52 0.71

Total 420 289 9460 4637 6421.00 0.00 0.00 640.00 460.95 352.00 0.00 0.00 12.31 Mean 21.00 14.45 473.00 231.85 321.05 0.00 0.00 32.00 23.05 17.60 0.00 0.00 0.62 sd 5.804 4.925 239.790 142.538 165.468 5.803 4.925 28.038 22.404 23.171 1 1 0.8106

Xi X Yi Y xi Xi X ; yi Yi Y ; Zx ; Zy : ˆ ˆ i ˆ sx i ˆ sy Inferential Statistics 267

Relationship between depression and functional impairment

25

20

15

10 Impairment (SDS) 5

0 0 5 10 15 20 25 30 35 Depression (HRSD) Figure 12 Plot of Hamilton Depression Ratings and Sheehan Disability Scale for CBT group.

 Xi X and a low x-value corresponds with a high y- Zi ˆ sx value (e.g., depression and work productivity). Nunnally (1978) shows that the Pearson r can be The plot of standardized depression and calculated as approximately the mean of the functional impairment in Figure 13 looks sum of the product of z-scores: identical to that of Figure 12, with the exception N of the scale of the axes. (Note that in order to ZxZy distinguish between the standardized Xi s and i 1 rxy ˆ standardized Yi s, XsandYs will be included in ˆ PN 1 the subscript of the standardized scores: Zxi and Zyi.) As a consequence of the standardization, With this algorithm it can be seen that if most the Zx-axis is plotted at Y = 0, which is the subjects tend to have either high Zxs and high unstandardized y-mean, and the Zy-axis is Zys, or low Zxs and low Zys, r will be larger and plotted at X = 0, the unstandardized x-mean. positive, approaching 1.0. In contrast, if most As stated earlier, if there were a modest linear subjects tend to have either high Zxs with low relation between depression and impairment, Zys, or low Zxs with high Zys, r will have a large there would be minimal impairment in the absolute value, but will be negative, approach- mildly depressed and greater impairment in the ing 71.0. severely depressed. Consider this positive linear Using the data that are presented in Table 14: relationship as presented in Figure 12. The points that are plotted tend to cluster either in N zxzy quadrant I or in quadrant III. The latter fall in i 1 12:31 r ˆ 0:648 the upper-right quadrant, where subjects tend to ˆ PN 1 ˆ 20 1 ˆ have both relatively elevated depression and relatively elevated functional impairment. Sta- The presentation of the algorithm for the ted differently, both the x-values and y-values Pearson correlation coefficient (r) usually comes are above the mean in Figure 12, and the from a seemingly different perspective which standardized zx and zy scores are both positive will now be discussed. The correlation coeffi- in Figure 13. On the other hand, the subjects cient is derived from the covariance. First, who have both relatively lower depression and consider that the sum of cross-products xy relatively lower functional impairment are in is a sum of the product of both the x-deviations † the lower-left quadrant. That is, both the x- (x)andy-deviations (y) from their respectiveP values and y-values are below the mean. In means: xy Xi X Yi Y : contrast, a negative linear relationship would For instance,ˆ the first subject† in Table† 14 has characterize an association in which a high x- an X-valueP (HRSD)P of 12 and a Y-value (SDS) value tends to correspond with a low y-value, of 11. The respective deviation scores are: 268 Descriptive and Inferential Statistics

zy 2

1

0

-1 Impairment (standardized)

-2 -2 -1 0 1 2 Depression (standardized)

Figure 13 Plot of standardized Hamilton Depression Ratings and standardized Sheehan Disability Scale for subjects in the CBT group (N = 20).

its interpretability is somewhat limited because x X X 12 21 9 1 ˆ 1 ˆ ˆ it is not scale-free. As a consequence, unlike the  y1 Y1 Y 11 14:45 3:45 correlation coefficient (r), which ranges from ˆ ˆ ˆ negative to positive (i.e., 71.0 4 r 4 1.0), the The product of the deviation scores is called covariance does not have a fixed range. the cross-product. For the first subject, the Notice that the variance of a variable can cross-product is: be thought of as a special case of the

x1y1 9 3:45 31:05 covariance of that variable with itself: ˆ Á †ˆ S X X X X . A variance± xx ˆ i † i † Once again, the sum of cross-products covariance matrix is a convenient format to N displayP the variances and covariances among xiyi or xy is a sum of each subject's several variables: i 1 † ˆ productP Á of deviationP scores. Notice that the sum XY Z of cross-products is a bivariate extension of the 2 XsX sXY sXZ sum of squares that was described earlier, but 2 YsYX sY sYZ the sum of cross-products incorporates two Zs s s2 variables: ZX ZY Z 2 x Xi X Xi X The variances are on the main diagonal and the ˆ † † covariances are the remaining elements of X X The covariance (sxy), therefore, is a bivariate the matrix. The matrix is symmetrical about analogue of the variance. Consequently, the the main diagonal. That is, in general, sij =sji. covariance is a function of the sum of cross- Or in the case, sXY =sYX. products: Using the data from Table 14, the covariance is: xy s n xy ˆ n 1   P Xi X Yi Y xy i 1 † † S ˆ Using the equations above, the covariance xy ˆ n 1 ˆ P n 1 (sxy) can also be calculated: P352 18:526 Xi X Yi Y ˆ 20 1 ˆ sxy † † ˆ n 1 Turning back to the correlation coefficient, r P can be calculated using these quantities. It is The covariance (sxy) is a useful quantity that equal to the covariance (sxy)ofXi and Yi, also measures the strength of the linear divided by the product of their standard relationship between two variables. However, deviations: Inferential Statistics 269

S 18:526 Nevertheless, it ignores the meaningfulness of r XY 0:648 the magnitude of r. There is a disproportionate XY ˆ S S ˆ 5:803 4:925 ˆ X Y Á emphasis on the statistical significance of r while As stated earlier, the correlation coefficient at the same time a tendency to ignore the magnitude of r in the research literature. (r) can range from 71 to +1. The greater the absolute value of r, the greater the strength of The magnitude of the correlation can be the bivariate linear relationship. thought of as the proportion of variance in Yi The statistical significance of the population explained by Xi. In fact, because in correlational analyses a distinction between independent correlation coefficient, r, can be tested using the variable and dependent variable is unnecessary, sample data. The null hypothesis, H0:r =0,is tested with the following ratio: the magnitude of the correlation can also be thought of as the proportion of variance in X r i t explained by Yi. In either case, the magnitude of ˆ sr the correlation coefficient r is calculated as the 2 where s is the standard error of r, squared correlation coefficient, r , or, more r commonly, R2. Since the variability in human 1 r2 behavior and affect are not easily explained, s 2 r ˆ n 2 meaningful R s in behavioral sciences can be r quite small (R2 5 0.10) and an R2 4 0.20 may This t-ratio is tested with n72 dfs. Applying be considered substantial. Using the data from this to the data from Table 14, the standard the hypothetical example in Table 14, 42% of the error of r is: variance in functional impairment is explained by depression (i.e., R2 = r2 = 0.6482 = 0.420). 1 r2 1 0:6482 sr 0:1795 ˆ rn 2 ˆ r20 2 ˆ 3.12.4.4.2 Test for a difference between two The t-ratio is: independent correlation coefficients r 0:648 t 3:61 The correlation coefficients of two groups can ˆ sr ˆ 0:1795 ˆ be compared by testing the following null hypothesis: The hypothesis testing steps for the Pearson correlation coefficient are as follows: H0: r1 = r2 (i) State research question. Is there a linear relationship between depression and impair- The test is conducted in the following man- ment? ner. First, each correlation coefficient (r )is (ii) State two mutually exclusive and exhaus- k transformed into a Z using a procedure called tive hypotheses: k the r to Z transformation. Note that the (a) H : =0 k k 0 rXY subscript k represents group k. Unlike earlier (b) H : 0 A rXY = notation, in this context the subscript is not (iii) Choose the t-ratio to evaluate the hy- used to differentiate among subjects. potheses about the Pearson correlation coeffi- cient. 1 rk Z 0:5 ln ‡ (iv) The critical value for the t-ratio with k ˆ Á 1 r  k a = 0.05 and df = 18 is 2.101. If tobserved 4 tcritical, reject H0. where k=1 or 2 and ln represents the natural (v) Data regarding depression and impair- logarithm. ment are displayed in Table 14. Next a standard error is calculated: (vi) Calculations were described above. 1 1 (vii) Decision regarding H0. Since tobserved z z 1 2 ˆ n 3 ‡ n 3 4 tcritical (i.e., 3.6142.101), reject H0. r1 2 (viii) Conclusion. Based on data from this sample, we conclude that there is a statistically Finally, the Z-statistic is calculated: significant linear relationship between depres- Z Z Z 1 2 sion and impairment. ˆ z z 1 2 Although the null hypothesis, H0: r = 0, was rejected, be cautious not to overinterpret the For example, this could be used to address the results of such a test. The hypothesis simply question, ªIs the strength of the linear relation examines the likelihood of observing a correla- between depression and functional impairment tion of the magnitude seen in the sample, when different in the subjects in the CBT and IPT the population correlation is zero (i.e., r = 0). groups?º Although IPT impairment data are 270 Descriptive and Inferential Statistics not presented in a table, for illustration assume iance is equal to the sum of cross-products over that for the IPT group r = 0.41. Comparison of N71. Also, recall that the sum of the cross- the correlation coefficients for the CBT and IPT products: xy X X Y Y .IfY is ˆ i † i † i groups proceeds as follows: a constant, then for each subject Yi Yand thus the sumP of cross-productsP is also equalˆ to 1 r1 1 :648 Z1 0:5 ln ‡ 0:5 ln ‡ zero: ˆ Á 1 r1 ˆ Á 1 :648 8 9 8 9   :772 xy Xi X Yi Y ˆ :> ;> :> ;> ˆ † † 1 r2 1 :41 X X Xi X 0 0 Z2 0:5 ln ‡ 0:5 ln ‡ ˆ †Á ˆ ˆ Á 1 r2 ˆ Á 1 :41 8 9 8 9 X :436 If the covariance is equal to zero, r =0. ˆ :> ;> :> ;> There is another useful fact about how r is 1 1 affected by a constant (ci). If the correlation z1 z2 ˆ n1 3 ‡ n2 3 between depression (Xi) and impairment (Yi)is r r , and a constant is added to either X or Y , the 1 1 xy i i :343 correlation is not affected. For instance, say that ˆ 20 3 ‡ 20 3 ˆ r the 24-item HRSD ratings were to be collected Z1 Z2 :772 :436 for all acutely suicidal subjects in a study. Z :980 ˆ z1 z2 ˆ :343 ˆ However, in the interest of time the HRSD suicide item was not scored and not included in The assumptions of the Z-test of correlation a modified HRSD 23-item total, call it YMi. coefficients across two groups is that X and Y Technically, all of these subjects, being acutely come from a bivariate normal distribution. suicidal would rate 2 on the HRSD suicide item. What this means is that, if X comes from a Thus a 24-item total, Yi, could be calculated for normal distribution, and Y is normally dis- each subject simply by adding 2 to the subject's tributed at every value of X. 23-item total. The addition of a constant (ci)toa The hypothesis testing steps for the Z-test to variable (Yi) increases the value of the y-mean compare correlation coefficients across two by ci (i.e., YM Y ci) and it does not affect the groups are as follows: ˆ ‡ standard deviation (i.e., SY =SY). As a (i) State research question. Is the strength of M consequence neither the covariance (sxy) nor the linear relation between depression and the sum of cross-products xy is changed, and functional impairment different for subjects in thus the correlation coefficient is not affected: the CBT and IPT groups? P rxy =rxyM. (ii) State two mutually exclusive and exhaus- Finally, correlation coefficients for pairs of tive hypotheses: several variables (x, y,andz) can be con- (a) H0 : r1 = r2 veniently presented in the form of a correlation (b) HA : r1 = r2 matrix: (iii) Choose the Z-test to evaluate the hy- potheses. XY Z (iv) The critical value for the Z-test with X 1:0 rXY rXZ a = 0.05 is 1.96. If zobserved 4 zcritical, reject H0. YrYX 1:0 rYZ (v) Data are displayed in Table 14 for the ZrZX rZY 1:0 CBT group. The correlation for the IPT group was specified in the text as r = 0.41. Notice that the elements of the correlation (vi) Calculations were described above. matrix are symmetric. That is, the correlation of (vii) Decision regarding H0. Since zobserved Xi with Yi (rXY) is equal to the correlation of Yi 5 zcritical (i.e., .98051.96), do not reject H0. with Xi (rYX). Furthermore, the correlation of a (viii) Conclusion. There is not a statistically variable (e.g., Xi) with itself is equal to unity. All significant difference in the strength of the elements of the main diagonal of a correlation linear relationship between depression and matrix are 1.0: rXX=rYY=rZZ=1.0. impairment for those in the CBT and IPT groups in this study of treatment of depressive 3.12.4.4.3 Spearman rank correlation symptoms in patients with major depressive disorder. The Spearman rank correlation is a non- There are several more points regarding the parametric correlation coefficient, which is Pearson r that will be discussed briefly. First, computed using the ranks of the raw data. This the correlation of a variable with a constant is is appropriate when the population data are zero. One of the methods of calculating r that believed to be non-normal. The Spearman rank was described above includes the covariance correlation is especially useful when data (sxy) in the numerator. Recall that the covar- contain an outlier (i.e., an extreme value), Inferential Statistics 271

20

15

10 Impairment 5

0 0 10 20 Depression

Figure 14 Plot of Hamilton Depression Ratings and Sheehan Disability Scores.

35

30

25

20

15 Impairment 10

5

0 0 10 20 30 40 Depression Figure 15 Plot of Hamilton Depression Ratings and Sheehan Disability Scores. which can strongly, and inappropriately, in- impairment is r = 0.697 for the data in Figure crease the magnitude of the Pearson correlation 15. In Figure 15, which includes one outlier, coefficient. For instance, compare the plots knowledge of the severity of a subject's presented in Figures 14 and 15. The only depression would not help in guessing the difference in these plots is that Figure 15 corresponding level of impairment. includes one additional case, which is an outlier. The discrepancy between these two Pearson That is, the additional case has extreme values correlation coefficients illustrates that the on both the x- and y-variables. In Figure 14 Pearson correlation coefficient can be quite there is no linear relationship between depres- sensitive to an outlier. The influence of outliers sion (X) and impairment (Y). Stated differently, on the correlation coefficient can be minimized knowledge of the level of depression (Xi) will not by calculating the correlation on the rank- help in ªguessingº the level of impairment (Yi). transformed data to examine the linear relation- In fact, for the data in Figure 14, the Pearson ship between the ranks of Xi and Yi. The correlation coefficient between depression and Spearman rank correlation is used for that impairment is r = 0.066. In stark contrast, with purpose. the addition of one outlier, the Pearson For example, the data displayed in Figure 15 correlation coefficient between depression and are presented in Table 15. In addition to the 272 Descriptive and Inferential Statistics actual depression (Xi) and impairment (Yi) (ii) State two mutually exclusive and exhaus- ratings, the ascending ranks of those ratings are tive hypotheses: presented in Table 15. The rank transformation (a) H0 : rSXY = 0 procedure was described earlier in this chapter (b) HA : rSXY = 0 (see discussion of the Mann±Whitney test). (iii) Use a table of critical values for the Although this is a bivariate analysis, the Spearman rank correlation coefficient (for ex- rankings of the depression (RXi) and impairment ample, see Conover, 1980, p. 456). The critical

(RYi) ratings are performed independently. (The value with a two-tailed a = 0.05 and N =14is upper-case R is used to represent a ranked 0.534. If |rsobserved| 4 rscritical, reject H0. variable and is not to be confused with lower- (iv) Data regarding depression and impair- case r, which is generally used to represent a ment are displayed in Table 15. Pearson correlation coefficient.) To calculate (v) Calculations were described above. the Spearman rank correlation coefficient rs, the (vi) Decision regarding H0. Since |rsobserved| 5 difference (di) between the paired ranks for each rscritical (i.e., 0.242 5 0.534), do not reject H0. subject is calculated (di=RX 7RY ) and squared (viii) Conclusion. Based on data from this 2 i i (di ). The sum of the squared differences in ranks sample, we conclude that there is not a statis- 2 (di )(is used in the calculations of rs. In the tically significant correlation between depres- case of no tied ranks, rs is calculated: sion and impairment. P N 2 6 di 3.12.4.4.4 Simple linear regression analysis i 1 6 345 r 1 ˆ 1 Á 0:242 s ˆ n3P n ˆ 143 14 ˆ If a distinction between the independent and dependent variables is made, a correlational Notice that the value of di is minimized, and analysis can be extended to a simple linear N 2 regression analysis. In a regression analysis, the consequently the quantity, 6 di is minimized, i 1 dependent variable (y) is regressed on the ˆ as the pairs of ranks of Xi andPYi for each subject independent variable (x). Graphically, the x- correspond closely. Since the ratio in that axis represents the independent variable and the equation is subtracted from unity, the value y-axis represents the dependent variable. As in a of rs is closer to 1.0 when that ratio is small; and correlational analysis, each subject contributes rs is closer to zero when the ratio approaches 1.0. data in the form of paired coordinates (x, y)to Thus, rs is a quantity that reflects the strength of the regression analysis. The regression equa- a linear relationship. tion, Yà = a+bX, includes two variables, Y and An rs of .242 is certainly a more reasonable X, and two constants, the y-intercept (a) and the reflection of the strength of the linear relation- slope (b). After the constants have been ship between Xi and Yi in Table 15 than the calculated, the equation can be used to estimate Pearson r of .697. a conditional value of Y. That is, the equation is The calculations for data involving tied data used to estimate a value of Y, conditioned on a are more extensive (see, for example, Conover, value of X (i.e., Y|X). The algorithms which are 1980) and will not be discussed here. Inciden- used to estimate the regression constants, a and tally, the data displayed in Figure 14 come from b, employ the least-squares criterion, which the first 13 of the 14 subjects that are presented in minimizes the sum of the squared distances Table 15. The Pearson and Spearman rank cor- from the observed data (Yi) and those estimated relation coefficients for the sample of 13 subjects with the regression equation (YÃi). Yà is called are r = 0.066 and rs = 0.052, respectively. predicted Y or Y-hat. The estimated values of Yà In using the Spearman rank correlation are points along the regression line, which is coefficient, it is assumed that the data are defined by the regression equation. randomly sampled, that the subjects are For example, consider the relation between independent, and that the scale of measurement depression (X) and functional impairment (Y), is at least ordinal. where the latter is the dependent variable. The The hypothesis testing steps for the Spearman data from Table 14 are plotted in Figure 16 rank correlation coefficient are as follows: where depression is measured using the HRSD (i) State research question. Is there a rela- and functional impairment is measured with the tionship between depression and impairment Sheehan Disability Scale. The slope (b) repre- such that high values of depression tend to be sents the expected change in functional impair- associated with high values of impairment? ment that is associated with a one-unit change in Alternatively, one might hypothesize that there depression. Stated differently, consider two is a negative relationship between two variables subjects with depression ratings of Xi and X and Y (i.e., rs 5 0), such that high values of X Xi+1, respectively. If the former subject has tend to be associated with low values of Y. an impairment rating of Yi, the latter will have Inferential Statistics 273

Table 15 Depression (Xi) and impairment (Yi) symptomatology than for those meeting criteria ratings of 14 patients with affective symptomatology. for major depressive disorder.

2 In lieu of extending the regression line beyond Xi Rank(Xi)Yi Rank(Yi)di di the range of the sample data, consider the minimum and maximum HRSD (Xi) ratings. In 8 1 19 13 712 144 this sample, the HRSD ratings range from 12 to 9 2 14 7 525 7 29. The corresponding predicted functional 10 3 13 6 739 à 11492 24impairment (i.e., Yi) for the lowest and highest 12 5 12 5 0 0 HRSD ratings in this sample are estimated by 13 6 10 3 3 9 substituting the respective HRSD values in the 14 7 17 10 739 regression equation: 15 8 18 12 7416 16981 864 Y = 2.90+0.55X = 2.90+0.55(12) = 9.50 17 10 11 4 6 36 18 11 16 9 2 4 19 12 15 8 4 16 Y = 2.90+0.55X = 2.90+0.55(29) = 18.85 20 13 17 10 3 9 36 14 31 14 0 0 Consequently, the regression line, which is 218 105 210 104 1 345 presented in Figure 16, will go from (12,9.50) for the least severely ill to (29,18.85) for the most severely ill. Note that the regression line passes through the intersection of the X and Y means, X; Y , which in this case is (21,14.45). † an expected impairment rating of Yi+b. The Recall that the proportion of variance in intercept (a) represents the functional impair- functional impairment that is explained by ment that is expected with a hypothetical depression was presented above in the discus- depression rating of 0. Of course, an HRSD sion of the correlation coefficient as: of 0 is not realistic in a clinical trial for subjects R2 = r2 = 0.6482 = 0.420 These are the same meeting criteria for major depressive disorder, data and thus the same value of R2 applies. As but it represents the point along the y-axis where with correlational analyses, R2 is a measure of an extrapolated regression line would cross. the magnitude of the association between Yi and The two regression constants are estimated as Xi in simple linear regression. One of the follows (using the summary data that are descriptive statistics that was discussed earlier presented in Table 14): in this chapter is a measure of variability called N the sum of squared deviations, in this case, xy deviations from the Y-mean: i 1 352 b ˆ 0:55 N 2 PN 640 2 ˆ 2 ˆ ˆ y Y Y x i ˆ i i 1 i 1 ˆ X Xˆ   a YP bX 14:45 0:55 21:0 2:90 ˆ ˆ Á ˆ In the context of a regression analysis, this quantity can be partitioned into two compo- Thus, the regression equation is: nents: Y^ a bX 2:90 0:55X ˆ ‡ ˆ ‡ 2 ^  ^ yi Yi Y Yi Yi This indicates that a one-unit increase on the ˆ ‡ X X  X  HRSD is associated with a 0.55 unit increase on or the Sheehan Disability Scale. A regression line 2 can be graphed by plotting two paired coordi- yi SSreg SSres nates, (0,a) and X; Y , and drawing a line ˆ ‡ between the two points. Applying† this to the data The formerX of the two components, the sum of from Table 14, the regression line might be squares regression,orSSreg, represents the sum graphed by plotting: (0,2.9) and (21,14.45). of squares in Y that is accounted for by X. If one However, it is inappropriate to extrapolate were attempting to guess the level of impairment beyond the range of the sample data. This is (Yi), SSreg can be thought of as the improvement because the relation between X and Y might look over guessing the mean impairment (Y for each entirely different for values beyond the range of subject, and instead using the information† the data. In the example, the association between known from the sample data about the Xi7Yi functional impairment and depression might be relation. SSreg is estimated as follows (using the very different for those with minimal depressive summary data from the total row in Table 14): 274 Descriptive and Inferential Statistics

Relationship between depression and functional impairment 25

20

15

10 Impairment (SDS) 5

0 0 5 10 15 20 25 30 35 Depression (HRSD) Figure 16 Relationship between depression and functional impairment.

2 2 2 xy xy 2 ^  2 352 SSreg Y Y 2 RY X 2 2 0:420 ˆ ˆ P x  Á ˆ Px y ˆ 640 460:95 ˆ   Á 352X:002 193:6 P which is theP squareP of the correlation between Y ˆ 640:0 ˆ 2 2 2 and X: RY X = rYX 0:648 0:420 Thus 42%Á of theˆ variance inˆ Y is accounted The sum of squares residual (SSres) is the latter component of y2 and represents the for by X. Although this seems to be a substantial sum of squared deviations of Y that is not proportion of variance, a statistical test is used to determine whether this is more than might be accounted for by the XPi7Yi relation: expected by chance. The null hypothesis, 2 2 SS Y Y^ y2 SS H0 : r = 0, is tested with an F-test: res ˆ i ˆ reg R2=k 460X:95 193:6 X267:35 F ˆ ˆ ˆ 1 R2 = N k 1 2 † † RY X is the quantity that represents the magnitudeÁ of the association between Y and where k is equal to the number of independent X. It represents the proportion of variance in Y variables, k = 1 in this case: that is accounted for by X: R2=k F 2 2 SSreg 193:6 ˆ 1 R = N k 1 RY X † † Á ˆ SS SS ˆ 193:6 267:35 :402=1 reg ‡ res ‡ 13:034 0:42 ˆ 1 :420 = 20 1 1 ˆ ˆ † † 2 The degrees of freedom for the F-test are k Being a proportion, RY X can range from 0 to 2 2 Á and (N7k71), or with these data, 1 and 18. The 1; 0 4 RY X41. RY X can also be estimated as follows: Á Á corresponding critical F, F(1,18;a = .05) = 4.41. The null hypothesis is rejected because 13.034 2 4.41. In other words, the amount of variance xy 1 3522 1 4 2 in functional impairment that is explained by RY X 2 2 Á ˆ P x  Á y ˆ 640 Á 460:95 depression is statistically significant. 0:42 The statistical significance of the regression ˆ P P coefficient b can also be tested using the null The first component is the numerator of the hypothesis as H0 : B = 0. Consider that if the prior equation (SSreg) and the second compo- null hypothesis is not rejected, the regression nent is the denominator (SSres) of that equation. coefficient could not be considered to be Presented as one ratio, using the data from different than zero. Then if b = 0, the X-term Table 14: would fall out of the regression equation, and Yà Inferential Statistics 275 would not be a function of X. The statistical (i) independence across observations (i.e., significance of the regression coefficient b is subjects); tested with a t-ratio, which is equal to the (ii) the independent variable is fixed; regression coefficient divided by its standard (iii) the error term (e) is not correlated with error: the independent variable (X); b (iv) the variance of the error terms is con- t stant across all values of X. This is referred to as ˆ sb homoscedasticity. where the standard error of the regression The hypothesis testing steps for a simple coefficient s is estimated as: linear regression analysis are as follows: b (i) State research question. Is there a linear SSres=N k 1 relationship between depression and impair- sb ment? ˆ pN x2 (ii) State two mutually exclusive and exhaus- si 1 tive hypotheses: ˆ P (a) Hypotheses about the regression coef- Notice that since sb is in the denominator of ficient (b): the t-ratio, a larger value of sb reduces the value H0 : rXY =0 of t. Holding other aspects of the analysis HA : rXY = 0 2 constant, if SSres is large, which can result from (b) Hypotheses about the population R 2 a small X7Y relation, then sb will be large, and t (i.e., rXY): 2 will be small. In contrast, with a larger sample H0 : rXY =0 2 size (N), sb will be smaller and, as a consequence, HA : rXY = 0 t will be larger. (iii) Choose the t-ratio to evaluate the hy- N potheses about the regression coefficient (b). 2 2 SSres Yi Y^ y Use an F-test to test hypotheses about R . ˆ i 1 (iv) The critical value for the t-ratio with ˆ   X X two-tailed a = 0.05 and df = 18 is 2.101. If This quantity incorporates the differences tobserved 4 tcritical, reject H0 : bXY = 0. The à between the observed (Yi) and predicted (Yi) critical value for the F-test with a = 0.05 and values of Y. In lieu of calculating each of the df = 1,18 is 4.41. If Fobserved 4 Fcritical, reject 2 predicted values, SSres can be calculated using H0 : rXY =0. quantities that have been calculated above: (v) Data regarding depression and impair- 2 2 ment are given in Table 14. SSres 1 RY X y ˆ Á (vi) Calculations were described above. 1 :420 460X:95 267:351 (vii) Decision regarding H0 about (b). Ob- ˆ †Á ˆ served t 4 critical t, thus reject H0 (i.e., 3.61 4 2 The standard error of the regression coeffi- 2.101). Decision regarding H0 about R . Ob- served F 4 critical F (i.e., 13.064 4 4.41), thus cient is given as: sb reject H0. SSres=N k 1 (vii) Conclusion. Based on data from this s b sample, we conclude that there is a statistically ˆ pN x2 significant linear relationship between depres- si 1 ˆ sion and impairment. P 267:351= 20 1 1 Linear regression is a useful statistical pro- † :152 cedure for examining a bivariate relation. Un- ˆ p640 ˆ p like a correlational analysis in which ryx=rxy, linear regression is not a symmetric procedure. and the t-ratio as:  That is, the regression of Yi on Xi will not result b 0:55 in the same regression constants (a and b)asthe t 3:61 regression of Yi on Yi. ˆ sb ˆ 0:152 ˆ 3.12.4.4.5 Multiple linear regression analysis In a simple linear regression analysis there is one independent variable. Thus the tests of A simple linear regression analysis involves 2 significance of RY X and b are synonymous. In the regression of a dependent variable on one fact, the squareÁ root of F is equal to t: independent variable. Multiple linear regres- p13:034 3:61 sion analysis extends the statistical model such The assumptionsˆ of simple linear regression that one dependent variable is regressed on analysis include: multiple independent variables. Multiple linear 276 Descriptive and Inferential Statistics regression analysis is a general statistical model that could be dichotomized such that the that can evaluate both dimensional and catego- subjects with an HRSD rating of 14 or less rical independent variables. In fact it can test are classified as ªrespondersº and all other main effects and interactions of the ANOVA subjects are classified as ªnonresponders.º (For model and can be used to control for variables readers familiar with the HRSD, this is a rather (i.e., covariates) if certain assumptions are crude and quite unconventional threshold, used fulfilled (see Pedhazur, 1982, Chap. 12). only as an example for data analysis and not for Pedhazur (1982) distinguishes between the the study of psychopathology.) Both the use of regression for prediction and explana- independent variable (treatment CBT vs. IPT) tion. The former has to do with estimating the and the dependent variable (response: yes vs. dependent variable given a value of the no) are categorical variables. Although there is independent variable (i.e., Yi given Xi), whereas some literature discouraging categorization of the latter focuses on the proportion of variance continuous measures because of the loss of in the dependent variable that is explained by precision and statistical power, in the clinical the independent variables (R2). Although this setting decisions often must be dichotomous. chapter will not discuss multiple linear regres- sion analysis in detail, several comprehensive examinations of multiple linear regression 3.12.4.5.1 Chi-square test analysis are available (Aiken & West, 1991; Cohen & Cohen, 1983; Cook & Weisberg, 1982; The chi-square (w2) test is generally used to Neter, Wasserman, & Kutner, 1989; Pedhazur, compare rates or proportions across groups 1982; Pedhazur & Schmelkin, 1991). (Fleiss, 1981). In the example presented in Table 16, the w2 test is used to compare the response rates, and determine if rates as diverse as 20% 3.12.4.5 Between-subject Designs: Categorical and 40% can be expected by chance. This Dependent Variables format of data display is referred to as a contingency table. Initially, notice that the IPT The statistical procedures that have been group had twice the response rate as the CBT discussed until this point in the chapter are group (40% vs. 20%). Also note the degree to appropriate for continuous or ordinal depen- which the rate for each group differs from that dent variables. Nevertheless, it is not unusual of the marginal (30%), which is presented in the for the dependent variable to be categorical. margins of Table 16, and represents the There are a multitude of examples: pass or fail, response rate of the sample as a whole (N = 40). sick or well, and live or die. There are many The null hypothesis is that the population statistical procedures for categorical data and rates do not differ, HA : p1 = p2. That hypoth- comprehensive texts devoted exclusively to esis is tested by comparing the observed cell categorical data analysis (see for example frequencies with those expected if treatment and Agresti, 1996; Bishop, Feinberg, & Holland, response were independent. It is tested in the 1975; Fleiss, 1981). following manner: Consider the data from the hypothetical 2 2 2 clinical trial that have been used throughout this pij pipj 1= 2n:: chapter (originally presented in Tables 3 and 5). 2 n:: †† p p The HRSD is a continuous dependent variable ˆ i 1 j 1 i: :j Xˆ Xˆ

Table 16 Treatment response status of subjects in CBT and IPT groups where pij is the proportion of the entire sample that is in row i and column j (i.e., cellij). Note Nonresponse Response Total that a dot (.) in the subscript represents a dimension (row or column) that is ignored. For CBT 16 (a) 4 (b) 20 example, the marginal value pi. represents the [row %] [80%] [20%] [100%] proportion of subjects in row i and p.j represents {column %} {57.1%} {33.3%} {50%} the proportion of subjects in column j. The IPT 12 (c) 8 (d) 20 product of those marginal proportions is the [row %] [60%] [40%] [100%] proportion of subjects that would be expected in {column %} {42.9%} {66.7%} {50%} the cell if the row and column variables were independent of each other. This product is Total 28 12 40 2 [row %] [70%] [30%] [100%] referred to as the expected value. The w test {column %} {100%} {100%} {100%} essentially compares the observed and expected values. Applied to the data from Table 16: Inferential Statistics 277

be used for two-way contingency tables that :10 :50 :70 1= 2 40 2 2 j Á j Á †† have more than two rows and two columns. ˆ :50 :70 Á :40 :50 :30 1= 2 40 2 j Á j Á †† 3.12.4.5.2 Fisher's Exact test ‡ :50 :30 Á 2 When at least one expected frequency in a :3 :50 :70 1= 2 40 j Á j Á †† fourfold table is less than five, the Fisher's Exact ‡ :50 :70 2 Á test is more appropriate than the w test. The test 20 :50 :30 1= 2 40 2 assesses the probability of the cell frequencies, j Á j Á †† 1:07 ‡ :50 :30 ˆ given the marginal frequencies. Stated differ- Á ently, given the respective sample sizes and the Another algorithm for a fourfold table (i.e., a response rates of the sample as a whole, the two by two contingency table) is somewhat Fisher's Exact test evaluates the probability of simpler, using cell and marginal frequencies, but the four cell frequencies. The algorithm, which no proportions: is based on the hypergeometric probability 1 2 distribution, is: n:: n11n22 n12n21 n:: 2 j j2 † n1:!n2:!n:1!n:2! ˆ n1:n2:n:1n:2 p ˆ n::!n11!n12!n21!n22! With the data presented in Table 16: Where ªn-factorialº (n!), 40 16 8 4 12 1 40 2 n! n n 1 n 2 ... 1 2 jjÁ Á 2 † 1:07 ˆ Á †Á †Á Á ˆ 20 20 28 12 ˆ Á Á Á and, by definition, 0! = 1. Assumptions of the w2 test include: The algorithm for p calculates the probability (i) The samples are independent of each of getting precisely the observed configuration other. of frequencies. There were no expected fre- (ii) The subjects within each group are in- quencies that were less than five in Table 16. For dependent of each other. that reason, the data in Panel A of Table 17 will (iii) No subject can be classified in more than be used to illustrate the calculations involved in one category. a Fisher's Exact test. These data, which will be (iv) No cell has an expected value less than referred to as the observed data, come from a five. hypothetical pilot study comparing electrocon- The hypothesis testing steps for the w2 test are vulsive therapy (ECT) and psychodynamic as follows: psychotherapy (PP) for the treatment of (i) State research question. Do CBT and IPT dysthymia. The response rates were 93.33% have similar effects for the treatment of patients for ECT and 85.0% for PP. with depression? Applying the above algorithm: (ii) State two mutually exclusive and exhaus- 15! 20! 4! 31! p Á Á Á 0:3266 tive hypotheses: ˆ 35! 1! 14! 3! 17 ˆ (a) H0 : p1 = p2 Á Á Á Á (b) HA : p1 = p2 That is, given the observed marginals, about (iii) If the assumptions are met, use the w2 test one of every three random configurations will be to evaluate the hypotheses. that which is contained in Table 17. Consider (iv) The critical value for the w2 test with two- that there are only four other configurations of tailed a = 0.05 and df = 1 is 3.84. (Note that df cell frequencies that conform to the marginals of is equal to the product of # rows and # this contingency table, which are also presented 2 2 columns.) If wobserved 4 wcritical, reject H0. in Table 17, Panels B±E. Fisher's Exact test (v) Data are displayed in Table 16. involves calculating the probability of each (vi) Calculations were described above. configuration that conforms to the marginals. 2 (vii) Decision regarding H0: wobserved 5 Applying the equation that was presented 2 wcritical (i.e., 1.0753.84), thus do not reject H0. above, the respective probabilities of those (viii) Conclusion. In this sample of patients tables are presented in the final column of Table with major depressive disorder, those who were 17. (Note that the sum of these four probabil- treated with IPT did not have different response ities, and that of the observed table, 0.3266, is rates than those who were treated with CBT. 1.00. That is, the probability of any of the The null hypothesis is not rejected because possible outcomes is 1.00.) the difference in response rates across treatment For the Fisher's Exact test, one determines groups (20% vs. 40%) is not more than is which of the configurations have a probability expected by chance. The chi square test can also that is less than or equal to the probability 278 Descriptive and Inferential Statistics

Table 17 Treatment by responder status.

Configuration Treatment Nonresponder Responder Total p

Panel A #1 ECT 1 14 15 0.3266 PP 3 17 20 Total 4 31 35

Panel B #2 ECT 0 15 15 0.0925 PP 4 16 20 Total 4 31 35

Panel C #3 ECT 2 13 15 0.3810 PP 2 18 20 Total 4 31 35

Panel D #4 ECT 3 12 15 0.1738 PP 1 19 20 Total 4 31 35

Panel E #5 ECT 4 11 15 0.0261 PP 0 20 20 Total 4 31 35

associated with the observed contingency table, (v) Data are displayed in Table 17. configuration #1. In this case, the probabilities (vi) Calculations were described above. of configurations #2 (p = 0.0925), #4 (vii) Decision regarding H0: pobserved 4 * (p = 0.1738), and #5 (p = 0.0261) are all less pcritical (i.e., 0.619 4 0.05), thus do not reject H0. than the observed configuration #1 (viii) Conclusion. Dysthymic subjects trea- (p = 0.3266). These probabilities, and that of ted with ECT and PP do not have differential the observed, are summed and compared to the response rates. alpha level: There is a growing body of literature on the use of exact tests, such as that just described, in other p*=0.0925+0.1738+0.0261+0.3266=0.6190 contexts (e.g., Agresti, 1992; Mehta & Patel, 1997). These procedures are based on minimal Assumptions of the Fisher's Exact test are as distributional assumptions. As a consequence follows: they can be applied in situations when the use of (i) The data are in a contingency table with other procedures is more tenuous, particularly two rows and two columns. with small data sets. Because exact procedures (ii) The samples are independent of each are often computationally extensive, their use other. was rare until software that implemented the (iii) The subjects within each group are exact algorithms were available. StatXact soft- independent of each other. ware (Mehta & Patel, 1997) is designed to (iv) No subject can be classified in more than conduct exact tests for a wide variety of types one category. of data. It is available as a stand-alone package (v) The row and column totals are given. and also as an add-on module with statistical They are not random. software such as SPSS and SAS. The hypothesis testing steps for the Fisher's Exact test are as follows: 3.12.4.5.3 Logistic regression (i) State research question. Do ECT and PP have similar effects on response? Logistic regression analysis is used to exam- (ii) State two mutually exclusive and exhaus- ine the association of (categorical or contin- tive hypotheses: uous) independent variable(s) with one (a) H0 : p1 = p2 dichotomous dependent variable. This is in (b) HA : p1 = p2 contrast to linear regression analysis in which (iii) If assumptions are fulfilled, use Fisher's the dependent variable is a continuous variable. Exact test to evaluate the hypotheses. The discussion of logistic regression in this (iv) The critical value for the Fisher's Exact chapter is brief. Hosmer and Lemeshow (1989) test with two-tailed a = 0.05 is 0.05. If p*4 0.05, provide a comprehensive introduction to logis- reject H0. tic regression analysis. Inferential Statistics 279

Consider an example in which logistic be used to examine the risk of a suicide attempt regression could be used to examine the research over the course of a follow-up study. The question, ªIs a history of suicide attempts survival model incorporates the differential risk associated with the risk of a subsequent (i.e., period of the subjects. For that reason, it makes prospectively observed) attempt?º The logistic use of the available data in a follow-up study. regression model compares the odds of a For example, in a 15-year follow-up study of prospective attempt in those with and without subjects with affective disorders, some of the prior attempts. The ratio of those odds is called subjects will be followed for the entire study. the odds ratio. A logistic regression does not Yet others will be followed for as short as a few analyze the odds, but a natural logarithmic weeks and others for several years, but not the transformation of the odds, the log odds. duration of the entire 15-year study. The Although the calculations are more complicated Kaplan±Meier product limit estimate (Kaplan when there are multiple independent variables, & Meier, 1958) is used to estimate the computer programs can be used to perform the proportion of subjects who have not had a analyses. However, because of the logarithmic postbaseline suicide attempt (i.e., the propor- transformation of the odds ratio, the inter- tion surviving without a suicide attempt). pretation of results from the computer output is Patients who do not attempt suicide by the not necessarily straightforward. Interpretation end of follow-up, or who drop out of the follow- requires a transformation back to the original up study, are classified as censored. The Kaplan- scale by taking the inverse of the natural log of Meier (1958) estimate is calculated as the the regression coefficient, which is called probability of the event up to a given point in exponentiation. The exponentiated regression the follow-up. In this example, say that there are coefficient represents the strength of the monthly assessments in which the investigators association of the independent variable with ask about psychopathology, suicide attempts, the outcome. More specifically, it represents the and treatment during the preceding month. The increase (or decrease) in risk of the outcome that researcher could estimate the proportion of is associated with the independent variable. The subjects who remained suicide-free after one exponentiated regression coefficient represents month, two months, and so on to the end of the difference in risk of the outcome (e.g., follow-up. suicide attempt) for two subjects who differ by There are several survival analytic proce- one point on the independent variable. In this dures. For instance, the cumulative probability case, that is the difference between those with of an event (i.e., a suicide attempt) can be and without history of attempts (i.e., when compared across groups using a logrank test history of attempts is coded: 0 = no and (Peto & Peto, 1972). The survival model can be 1 = yes). further extended to incorporate many indepen- The logistic regression model can be extended dent variables that are hypothesized to be to include several independent variables (i.e., associated with risk of the event using a Cox hypothesized risk factors). For instance, are proportional hazards regression model (Cox, history of attempts, severity of depression, and 1972). There are several comprehensive texts on employment status risk factors for suicidal survival analytic techniques (e.g., Collett, 1994; behavior, controlling for diagnosis, age, and Kalbfleisch & Prentice, 1980; Lawless, 1982; gender? Each odds ratio from such a model Lee, 1980; Parmar & Machin, 1995). represents the change in risk of the outcome (i.e., a suicide attempt) that is associated with the independent variable, controlling for the 3.12.4.6 Within-subject Designs: Dimensional other independent variables. and Ordinal Dependent Variables Subjects can serve as their own controls by 3.12.4.5.4 Survival analysis being successively exposed to each treatment In a study such as the example used above, the condition. The major advantage is that a design probability of the event is somewhat con- such as this reduces the between-subject strained by the follow-up time of the study. variability that is inherent in a between-subjects That is, the number of subjects with a suicide design. As a consequence, a smaller number of attempt will very likely be much smaller in a six- subjects is needed in such a study. However, week study than in a six-year study. Survival this design is inappropriate for many types of analysis is a statistical technique that is used to experimental intervention. For instance, the examine the risk of an event over time. As with approach cannot be used in most studies of logistic regression analysis, this technique is learning, unless it is feasible to get subjects to appropriate for a dichotomous dependent unlearn a task that has been mastered. The variable. For example, survival analyses can same holds true for acute treatment of 280 Descriptive and Inferential Statistics psychopathology. Once a subject is well, the The hypothesis testing steps for the paired t- efficacy of subsequent treatments cannot be test are as follows: evaluated. (i) State research question. Is there a differ- ence in depression severity from pre- to post- CBT treatment? 3.12.4.6.1 Paired t-test (ii) State two mutually exclusive and exhaus- The paired t-test is used for a within-subject tive hypotheses: comparison of dimensional dependent vari- (a) H0 : md =0 ables. For example, in a one-group design, the (b) HA : md = 0 pretreatment HRSD scores could be compared (iii) If assumptions are met, use the paired t- with the post-treatment HRSD scores. (Note test. that there are many reasons that a multigroup (vi) The critical value for the paired t-test design is preferable, but that is beyond the scope with a two-tailed a = 0.05 and df =19, of this presentation. See Campbell & Stanley, tcritical = 2.093 (Note that for a paired t-test, 1963, for a comprehensive discussion of this.) df is equal to one less than the number of pairs.) The null hypothesis is H0 : mpre = mpost, whereas If tobserved 4 tcritical, reject H0. the alternative hypothesis is HA : mpre = mpost. (v) Data are displayed in Table 18. Stated differently, the data that are examined (vi) Calculations were described above. are the paired differences (di =Xi 7 Yi)be- (vii) Decision regarding H0: tobserved 4 tcritical tween prereatment (Xi) and post-treatment (Yi) (i.e., 4.1342.093), thus reject H0. for each subject. A mean difference d and (viii) Conclusion. There is a statistically † standard deviation of differences (sd) are significant reduction in severity of depression, calculated. as measured by the HRSD, from pre- to post- For illustration, Table 18 presents HRSD CBT treatment. scores for a group of 20 subjects before and after CBT treatment of depression. (At this point, focus only on the first three columns of 3.12.4.6.2 Wilcoxon signed-rank test Table 18.) Prior to treatment, the mean HRSD rating was 21.0 (sd = 5.8). After treatment, the The Wilcoxon signed-rank test is used to test mean HRSD rating was reduced to 15.95 differences of paired data without the normal (sd = 5.8). The paired t-test, which can gen- distribution assumption of the differences that erally be thought of as a test of whether the is required for the paired t-test. The procedure is mean difference is equal to zero, is calculated as as follows. First, the difference between each follows: pair of data is calculated. The differences are then ranked. Next, the ranks are assigned the t d=s  where ˆ d sign (+) of the corresponding difference. Then two sums are computed: the sum of the positive 2 signed ranks (T ) and the sum of the absolute sd + sd values of the negative signed ranks (T ). (The ˆ n 7 r differences of zero are ignored in these calcula- tions.) Note that the sum of the totals should be: and s2 is the variance of the differences (i.e., the T T n n 1 =2, where n is the number of d ‡ ‡ ˆ ‡ † square of the standard deviation of the differ- nonzero differences. If H0 were true, one would ences): expect about half of the differences to be positive and about half of the differences to be s2 5:462 29:84 d ˆ ˆ negative. If so, the sum of positive ranks would 2 be approximately equal to the sum of the sd 29:84 sd 1:22 negative ranks: T T 1=2 n n 1 =2. ˆ n ˆ 20 ˆ ‡   Á ‡ † r r The test statistic (Wobs) is used to examine this. d 5:10 W is equal to the smaller of the two signed t 4:13 obs ˆ sd ˆ 1:22 ˆ rank totals. For example, see Table 18 which displays the pre- and post-HRSD scores and the In a paired t-test, it is assumed that difference scores. The median pretreatment (i) the data consist of correlated pairs of rating (21.0) and the median post-treatment observations (i.e., paired data); rating (15.0) appear to be quite different. The (ii) the subjects are randomly sampled; Wilcoxon Rank Sum test examines the (iii) the differences between each pair of magnitude of the difference. Separate columns observations are from a normally distributed display the ranks of the positive differences population of differences. and the ranks of the absolute values of the Inferential Statistics 281

Table 18 HRSD scores: pre-(X) and post-(Y) CBT treatment (N = 20).

X Y Difference Ranked Signed rank Positive ranks Negative ranks |Difference |

12 11 1 5 5 5 12 4 8 15.5 15.5 15.5 13 15 723733 14 8 6 11 11 11 15 9 6 11 11 11 17 8 9 17 17 17 18 11 7 14 14 14 19 13 6 11 11 11 21 5 16 20 20 20 21 22 714744 21 28 771711 22 25 732722 24 21 3 6 6 6 25 15 10 18 18 18 26 22 4 7 7 7 27 22 5 8 8 8 27 14 13 19 19 19 28 22 6 11 11 11 29 23 6 11 11 11 29 21 8 15.5 15.5 15.5

Total 420.00 319.00 101.00 210.00 190.00 200.00 10.00 Mean 21.00 15.95 5.05 sd 5.80 7.16 5.46 d=X7 Y. negative differences. The sums of those columns (i) State research question. Is there a differ- are: ence in depression severity from pre- to post- CBT treatment? T+ = 200 (ii) State two mutually exclusive and exhaus- tive hypotheses with regard to location (e.g.,

T+ =10 median): (a) H0 : MedianTime1 = MedianTime2 (b) H : Median Median Wobserved = T =10 A Time1 = Time2 7 (iii) If the assumptions are fulfilled, use the Check the calculations the sum of the positive Wilcoxon Signed Rank Sum test to evaluate the signed ranks and absolute values of the negative hypotheses. signed ranks: (iv) The critical value for the Wilcoxon Signed Rank Sum test, with two-tailed n n 1 T T ‡ † a = 0.05 and N =20, is Wobserved 5 Wcritical, ‡ ‡ ˆ 2 reject H0. T T 200 10 210 (v) Data are displayed in Table 18. ‡ ‡ ˆ ‡ ˆ (vi) Calculations were described above. (vii) Decision regarding H : W n n 1 20 20 1 0 observed 5 and ‡ † ‡ † 210 Wcritical (i.e., 22.5552), thus reject H0. 2 ˆ 2 ˆ (viii) Conclusion. There is a significant re- The assumptions of the Wilcoxon Signed duction in depressive severity from pre- to post- Rank Sum test include: CBT treatment. (i) the data consist of correlated observations (i.e., paired data); 3.12.4.6.3 Repeated measures ANOVA (ii) subjects with paired data are randomly and independently sampled; The repeated measures ANOVA is a statis- (iii) The differences are measured on at least tical procedure that can include both within- an ordinal scale. subject and between-subject factors. Initially the The hypothesis testing steps for the Wilcoxon former will be considered. Most simply, it can be Signed Rank Sum test are as follows: an extension of the paired t-test, used to 282 Descriptive and Inferential Statistics compare subjects over time. Consider three follows: 2 b c 1 2=b c. (The algo- HRSD assessments during the course of 12 rithm, as presentedˆ jj with the† term‡ outside of the weeks of CBT (e.g., weeks 0, 6, and 12). The null absolute value, incorporates a continuity cor- hypothesis is that there is no change in rection in the numerator.) Note that the depression over the three assessments: calculations focus on the discordant pairs and H0 : m1 = m2 = m3. The alternative hypothesis ignore the concordant pairs. More specifically, is that the null hypothesis is false. The F-test of the test is a ratio of the squared difference in the repeated measures ANOVA tests whether discordant frequencies relative to the total that difference is more than can be expected by discordant frequencies. In the example above, chance. If H0 is rejected, the source of the the test would detect a disproportionate difference must be examined more closely in a representation of an emergent sleep disturbance post hoc test. This is discussed in detail elsewhere among those who had a change in sleep (e.g., Keppel, 1991; Winer, 1971). disturbance. This is illustrated with data in Table 19, where 11 of the 13 (84.5%) with changes in sleep disturbance developed the 3.12.4.7 Within-subject Designs: Categorical symptom during the course of treatment: Dependent Variables 2 11 2 1 2= 11 2 4:92. The ˆ jj 2 † ‡ †ˆ McNemar w of 4.92 exceeds the critical w2 The repeated measures design can be ex- with 1 df, 3.84, and is thus statistically tended to include between-subject factors as well significant at the 0.05 level. as within-subject factors. Such a design is referred to as a split plot design. There are two types of within- and between-subject factors: fixed and random effects. The specific levels of fixed variables are determined by the 3.12.5 SUMMARY experimenter, whereas the levels of a random There are several components of the analysis effect are randomly selected from a larger one must consider when choosing among population of levels. Results of fixed effects statistical procedures. Is there a dependent cannot be generalized beyond the levels of the variable? If so, is it dimensional, ordinal, or factors that are tested. In contrast, results of categorical? What are the comparisons that will random effects can be generalized to the answer the research question? Are there group population from which the levels were selected. comparisons? If so, how many groups are being The within-subject factor that was discussed in compared? If not, is it bivariate linear relations the repeated measures model above, was a fixed that are of interest? In addition, the assumptions effect. Random effects are discussed in detail of the statistical procedures should be carefully elsewhere (e.g., Keppel, 1991; Winer, 1971). An considered before the procedure is applied. extension of the previous example that includes Figure 17 is a flow chart that is designed to two independent treatment groups (e.g., IPT vs. facilitate selection of the appropriate inferential CBT) illustrates a design in which a repeated statistical procedure. measures ANOVA with both within-subject Finally, the value of descriptive statistical and between-subject factors would be applied. procedures must not be minimized. They are Treatment would be the between-subjects used not only to summarize data and describe a factor, whereas time would be the within- sample, but also to help a researcher become subjects factor. Subjects are treated with only familiar with the data. The initial examination one modality, but have all levels of the within- of descriptive statistics provides information subject factor (i.e., assessments at weeks 0, 6, that is useful when choosing among inferential and 12). This ANOVA includes three F-tests: statistical procedures. the main effect of treatment, the main effect of In summary, this chapter began by stating time, and the interaction of treatment by time. that a fundamental motivation in science is to understand individual differences. A variety of 3.12.4.7.1 McNemar test The McNemar test is used to examine paired dichotomous data. For example, one might Table 19 Cross-classification of presence of compare the symptomatology pretreatment and symptoms before and after treatment. post-treatment. Specifically, one might hy- pothesize that the sleep disturbance is neither Post absent Post present developed nor overcome during the course of Pre absent 76 (a) 11 (b) treatment with IPT for depression as presented Pre present 2 (c) 47 (d) in Table 19. The McNemar test is calculated as Inferential Statistical Procedures

Between Within Subject Subject Design Design

Non- Dimensional Categorical Parametric Parametric Summary Two Group Multiple Factorial Multivariate Wilcoxson Compari- Group Analysis of Group Linear Logistic Survival Paired Rank McNemar Relations Chi-squareRegression Analysis Test sons Comparisons Variance Comparisons t-test Sum Test

Non- Non- Hotelling’s Multiple ParametricParametric ParametricParametric 2T MANOVA Bivariate Variables

Multiple Mann– Analysis ofKruskal– Linear t-test Whitney Variance Wallis Continuous Ordinal Regression

Pearson Simple Correlation Linear Spearman CoefficientRegressionCorrelation

Figure 17 A guide for the selection of an appropriate inferential test. 283 284 Descriptive and Inferential Statistics statistical procedures that are used to examine Spearman Rank deviation and variability have been outlined. The Nonpar corr variables= depression with decision rules to choose among them has been impairment discussed. Reference citations have been pro- /print=spearman. vided for a more comprehensive discussion of Simple Linear Regression each technique. In closing, the reader is Regression variables=impairment encouraged to consider not only the test statistic depression and corresponding statistical significance of /descriptives=default research results, but also focus on the clinical /statistics=default meaningfulness of the findings. If a result cannot /dependent=impairment be replicated because it is trivial or if a finding /method=enter depression. has no practical value, it will serve no useful Categorical Dependent Variables scientific or clinical purpose. Chi-square Crosstab tx by response /cell=count row column /sta=chisq. ACKNOWLEDGMENTS 3.12.6.2.2 Within-group analyses The author is grateful to Laura Portera, M.S., who provided valuable feedback regarding the Dimensional Ordinal Dependent Variables structure and content of earlier versions of this Paired t-test manuscript, and to Kira Lowell, M.A., who t-test pair=prehrsd with posthrsd. provided editorial assistance. Wilcoxon Npar test wilcoxon= prehrsd with posthrsd 3.12.6 APPENDIX: SPSS COMMAND /statistics=descriptives. LINES Categorical Dependent Variables 3.12.6.1 Descriptive Statistics McNemar test Npar test mcnemar= presx with postsx. Descriptive variables= hrsd /statistics=all. Frequency variables= hrsd /histogram=normal /statistics=all. 3.12.7 REFERENCES Examine variables= hrsd by tx /plot=boxplot stemleaf histogram /statistics=all. Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7, 131±177. Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley. 3.12.6.2 Inferential Statistics Aiken L. S., & West S. G. (1991) Multiple regression: 3.12.6.2.1 Between-subject analyses testing and interpreting interactions. Newbury Park, CA: Sage. Dimensional/Ordina Dependent Variables Armitage, P., & Berry, G. (1987). Statistical methods in medical research (2nd ed.). Oxford, UK: Blackwell Two-group Comparisons Science. t-test Bishop, Y. M. M., Feinberg, S. E., & Holland, P. W. t-test group= tx (1,2) /variables=hrsd. (1975). Discrete multivariate analysis: Theory and prac- Mann±Whitney tice. Cambridge, MA: MIT Press. Npar tests m±w=hrsd by tx (1,2) Bock, R. D. (1985). Multivariate statistical procedures for behavioral research. Chicago: Scientific Software Inc. Multiple-group Comparisons Campbell, D. T., & Stanley, J. (1963). Experimental and Analysis of Variance (One-Way ANOVA) quasi-experimental designs for research. Chicago: Rand Oneway hrsd by tx (1,3) McNally. /posthoc=tukey, scheffe, dunnett Cohen, J., & Cohen, P. (1983). Applied multiple regression/ correlation analysis for the behavioral sciences (2nd ed.). /statistics=descriptives,homogeneity. Hillsdale, NJ: Erlbaum. Kruskal±Wallis Collett, D. (1994). Modelling survival data in medical Npar tests k±w=hrsd by tx (1,3) research. London: Chapman and Hall. Factorial Analysis of Variance Conover, W. J. (1980). Practical nonparametric statistics (Two-Way ANOVA) (2nd. ed.). New York: Wiley. Cook, R. D., & Weisberg, S. (1982). Residuals and influence Anova variables= hrsd by tx (1,2) in regression. New York: Chapman and Hall. diagnosis (1,2) Cox, D. R.(1972). Regression models and life tables (with /maxorders=all. discussion). Journal of Royal Statistical Society, B34, Bivariate Linear Relations 187±220. Fleiss, J. F. (1981). Statistical methods for rates and Pearson Correlation Coefficient proportions (2nd. ed.). New York: Wiley. Correlations variables= depression with Fleiss, J. F. (1986). The design and analysis of clinical impairment. experiments. New York: Wiley. References 285

Grimm, L. G., & Yarnold, P. R. (Eds.) (1995). Reading and packages for general use. The American Statistician, 52, understanding multivariate statistics. Washington, DC: 70±82. American Psychological Association. Neter, J., Wasserman, W., & Kutner, M. H. (1989). Applied Hartley, H. O. (1950). The maximum F-ratio as a short cut linear regression models (2nd ed.). Homewood, IL: test for heterogeneity of variances. Biometrika 37, Richard D. Irwin Inc. 308±312. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New Hollingshead, A. B., & Redlich, F. C. (1958). Social class York: McGraw-Hill. and mental illness. New York: Wiley. Olson, C. L. (1976). On choosing a test statistic in Hopkins, K. D., Glass, G. V., & Hopkins, B. R. (1987). multivariate analysis of variance. Psychological Bulletin, Basic statistics for the behavioral sciences (2nd ed.). 83, 579±586. Englewood Cliffs, NJ: Prentice-Hall. Parmar, M. K. B., & Machin, D. (1995). Survival analysis: Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic A practical approach. New York: Wiley. regression. New York: Wiley. Pedhazur, E. J. (1982). Multiple regression in behavioral Kalbfleisch, J. D., & Prentice, R. L. (1980). The statistical research: Explanation and prediction. New York: Holt, analysis of failure time data. New York: Wiley. Rinehart and Winston. Kaplan, E. L., & Meier, P. (1958). Nonparametric Pedhazur, E. J., Schmelkin, L. P. (1991). Measurement estimation from incomplete observations. Journal of the design and analysis: An integrated approach. Hillsdale, American Statistical Association, 53, 457±481. NJ: Erlbaum. Keppel, G. (1991). Design and analysis: A researcher's Peto, R., & Peto, J. (1972). Asymptotically efficient rank handbook (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall. invariant procedures (with discussion). Journal of the Lawless, J. F. (1982). Statistical models and methods for Royal Statistical Society, Series A, 135, 185±206 lifetime data. New York: Wiley. Sokal, R. R., & Rohlf, F. J. (1995) Biometry (3rd. ed.). Lee, E. T. (1980). Statistical methods for survival data New York: W. H. Freeman. methods. Belmont, CA: Wadsworth Inc. Sprent, P. (1989). Applied nonparametric statistical meth- Levene, H. (1960). Robust tests for equality of variances. ods. London: Chapman and Hall. In I. Olkin (Ed.), Contributions to probability and SPSS (1997) SPSS Base 7.5 Syntax reference guide. statistics. Palo Alto, CA: Stanford University Press. Chicago: SPSS Inc Lurie, P. M. (1995). A review of five statistical packages for Stevens, J. (1992). Applied multivariate statistics for the Windows. The American Statistician, 49, 99±107. social sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Marcoulides, G. A., & Hershberger, S. L. (1997) Multi- Tukey, J. W. (1977). Exploratory data analysis. Reading, variate statistical methods: A first course. Hillsdale, NJ: MA: Addison-Wesley. Erlbaum. Wilcox, R. R. (1996). Statistics for the social sciences. San Mehta, C., & Patel, N. (1997). StatXact 3.1 for Windows: Diego, CA: Academic Press. Statistical software for exact nonparametric inference: Winer, B. J. (1971) Statistical principles in experimental User manual. Cambridge, MA: CYTEL Software Cor- design (2nd. ed.). New York: McGraw-Hill. poration. Zar, J. H. (1996). Biostatistical analysis (3rd. ed.). Upper Morgan, W. T. (1998). A review of eight statistics software Saddle River, NJ: Prentice-Hall. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.13 Latent Variables, Factor Analysis, and Causal Modeling

BRIAN S. EVERITT Institute of Psychiatry, London, UK

3.13.1 INTRODUCTION 287 3.13.2 LATENT VARIABLES, OR HOW LONG IS A PIECE OF STRING? 288 3.13.3 FACTOR ANALYSIS 288 3.13.3.1 Exploratory Factor Analysis 290 3.13.3.1.1 Statements about pain 290 3.13.3.1.2 Determining the number of factors 293 3.13.3.1.3 Factor rotation 293 3.13.3.2 Confirmatory Factor Analysis 295 3.13.3.2.1 Confirmatory factor analysis models for length and intelligence 295 3.13.3.2.2 Ability and aspiration 296 3.13.3.2.3 Drug usage among American students 297 3.13.4 CASUAL MODELS AND STRUCTURAL EQUATION MODELING 300 3.13.4.1 Examples of Causal Models 304 3.13.4.1.1 Ability scores over time 304 3.13.4.1.2 Stability of alienation 305 3.13.5 LATENT VARIABLESÐMYTHS AND REALITIES 307 3.13.6 SUMMARY 308 3.13.7 APPENDIX 311 3.13.8 REFERENCES 311

3.13.1 INTRODUCTION correlation between each pair of variables (Table 2). Many psychological investigations involve The techniques discussed in this chapter are the collection of multivariate data, a term aimed primarily at explaining and describing used when more than a single variable value is as concisely as possible the relationships observed on each subject in a study. Table 1, between the variables in a set of multivariate for example, shows a set of data giving the data. Some of the methods are most applic- examination scores obtained in four subjects able when the investigator has only limited by 15 students. For many such data sets the expectation of what patterns to expect, and question of most interest is how the variables wishes to explore the covariance or correla- are related, and the first step in answering this tion matrix in an effort to uncover hopefully question usually involves the calculation of some psychologically meaningful structure. the covariance or more commonly the Other methods are designed to test specific

287 288 Latent Variables, Factor Analysis, and Causal Modeling

Table 1 Examination scores for four subjects.

Student English French Algebra Statistics

177826767 263788070 375737166 455726370 563636570 653617264 751676565 859706862 962605862 10 64 72 60 62 11 52 64 60 63 12 55 67 59 62 13 50 50 64 55 14 65 63 58 56 15 31 55 60 57

theories or hypotheses as to why the variables intelligence and one which most people would have their observed relationships. Both ap- find little difficulty in accepting; its direct proaches rely heavily on the concept of a measurement is prevented simply by measure- latent variable, a topic discussed in the next ment error. Many of the latent variables section. postulated in psychology and related disci- plines, however, are often more problematical 3.13.2 LATENT VARIABLES, OR, HOW and even in some cases controversial. Certainly LONG IS A PIECE OF STRING? there has been considerable debate about the nature of intelligence. The reason that such Table 3 shows the lengths of 15 pieces of variables cannot be measured directly is often string as measured by a ruler (R) and as more complex than simple measurement error. estimated (guessed) by three different people The question of how seriously latent variables (G, B, and D). Corresponding to each piece of should be taken in general is discussed in Section string is a true but unknown length. The four 3.13.5 but, as will be seen in later examples, observed measurements all give the true length invoking the concept of a latent variable is plus some amount of measurement error, frequently extremely useful when exploring the although R is likely to be considerably more relationships between manifest variables. In an accurate than the others. The four measure- exploratory analysis the aim will be to assess ments are said to be fallible indicators of length. whether there is any evidence that the structure Fascinating though the data in Table 3 may in an observed correlation matrix is due to a be (at least to the author), many readers of this number of underlying latent variables and, if so, article may quite reasonably ask what they have to describe and possibly label these variables. to do with psychology in general, and factor When testing a specific hypothesis or model analysis in particular? To answer such questions for the data, the relationships between the consider again the data shown in Table 1, observed and latent variables will be specified a consisting of the examination scores of 15 priori, and the investigator will be primarily individuals in four subjects. Analogous to the interested in assessing the fit of the suggested true string length of each of the 15 observations model for the observed correlations or covar- in Table 3, each individual in Table 1 might be iances. In addition to specifying how observed assumed to have a particular, but unknown, and latent variables are related, more complex level of cognitive ability (which some bolder models might also specify relationships between psychologists might label ªintelligenceº), of the latent variables themselves; examples are which the four observed examinations scores given later. are again fallible indicators. And so in the data sets in both Tables 1 and 3, 3.13.3 FACTOR ANALYSIS it can be assumed that the observed or manifest variables are fallible indicators of an underlying Factor analysis is a generic term given to a variable that cannot be measured directlyÐa class of multivariate statistical methods whose so-called latent variable. Length is, of course, a primary purpose is data reduction and sum- far more straightforward latent variable than marization. In general terms factor analysis Factor Analysis 289

Table 2 Covariances and correlations.

(i) A set of multivariate data is generally represented by a matrix X where

x11 x12 ... x1p

x21 x22 ... x2p X= ...... [ xn1 xn2 ... xnp ]

where n is the number of observations in the data, p is the number of variables, and xij represents the value of variable j for observation i (ii) The sample covariance sij of variables i and j is given by

1 n

sij = (xki 7 xÅ i)(xkj 7 xÅ j) n71 S k=1

where xÅ i and xÅ j are the sample means of the two variables

(iii) A covariance can take any value from7? to ?. A value of zero indicates that there is no association (strictly, no linear association) between the variables. A positive value occurs when, as one of the variables increases in value then, on average, so does the other. A negative covariance indicates that as one variable increases, the other decreases, and vice versa (iv) The sample covariance matrix is a matrix with the variances of each variable on the main diagonal and the covariances of each pair of variables as off-diagonal elements (v) The sample correlation coefficient, rij between variables i and j is given by sij rij = 1 (sii sjj) 2

where sii and sjj are used to denote the variances of the two variables (vi) A correlation takes values between 7 1 and 1, with the two extremes indicating that the two variables are perfectly linearly related (vii) The sample correlation matrix is

1 r12 ... r1p

r21 1 ... r2p R= ...... [ rp1 rp2 ... 1 ]

techniques address themselves to the problem factor analysis methods place no constraints on of analyzing the interrelationships amongst a the possible relationships between observed possibly large set of observed variables and and latent variables. When used in a con- explaining these relationships in terms of a firmatory manner, however, these relationships (hopefully) small number of underlying latent will be constrained in particular ways specified variables. By using factor analysis, the inves- by the hypothesized model. The next section tigator should be able to identify the separate deals with exploratory factor analysis and dimensions being measured by the manifest Section 3.13.3.2 with confirmatory factor variables. When used in an exploratory mode, analysis. 290 Latent Variables, Factor Analysis, and Causal Modeling

Table 3 How long is a piece of string? Measurement by ruler (R) and guesses by three observers (G, B, and D) are given.

Piece R G B D

1 6.3 5.0 4.8 6.0 2 4.1 3.2 3.1 3.5 3 5.1 3.6 3.8 4.5 4 5.0 4.5 4.1 4.3 5 5.7 4.0 5.2 5.0 6 3.3 2.5 2.8 2.6 7 1.3 1.7 1.4 1.6 8 5.8 4.8 4.2 5.5 9 2.8 2.4 2.0 2.1 10 6.7 5.2 5.3 6.0 11 1.5 1.2 1.1 1.2 12 2.1 1.8 1.6 1.8 13 4.6 3.4 4.1 3.9 14 7.6 6.0 6.3 6.5 15 2.5 2.2 1.6 2.0

3.13.3.1 Exploratory Factor Analysis since there is a simple relationship between the solutions derived from each.) Table 6 shows The mathematical details of the factor these estimated correlations for both the two- analysis model are listed in Table 4, and the and three-factor solutions. (Methods for decid- essential features of the technique are now ing on the appropriate number of common demonstrated by considering an application. factors are discussed later.) How are the results given by a factor analysis interpreted? First, the estimated correlations (more commonly known as factor loadings) can 3.13.3.1.1 Statements about pain be used to identify and perhaps name the This illustration is based on a subset of the underlying latent variables, although this is data reported in Skevington (1990). The study often more straightforward after the process of was concerned with beliefs about controlling rotation, which is discussed in Section pain and 123 individuals suffering from severe 3.13.3.1.3. But for now, examining the unro- pain were presented with nine statements about tated results in Table 6 it is seen that for both pain. Each statement was scored on a scale from solutions the second factor is positively corre- 1 to 6, ranging from disagreement to agreement. lated, to a greater or lesser extent, with all nine The nine statements and the observed correla- statements. It would not require a great leap of tions between them are shown in Table 5. One of imagination to suggest that this factor might be the aims of the study was to ascertain whether labeled ªgeneral pain level.º The first factor is the responses reflected the existence of subscales negatively correlated with statements taking or groups of attitudes. personal responsibility for one's pain and In an attempt to identify the latent variables positively correlated with statements in which that might account for the observed pattern of the control of, and reasons for, pain are correlations between the pain statements, a attributed elsewhere. Factors with a mixture particular form of factor analysis, maximum of positive and negative loadings (often referred likelihood factor analysis was applied (de- to as bipolar factors), usually become easier to scribed by Everitt & Dunn, 1991). The results understand after rotation and so further from a factor analysis consist of the estimated interpretation of the results is left until Section regression coefficients of each observed variable 3.13.3.1.3. on each latent variable (also known in this Apart from the factor loadings, a number of context as common factors). When the factor other quantities which need explanation are analysis has been carried out on the observed given in Table 6. First, the sum of squares of correlation matrix rather than the covariance the factor loadings of a particular observed matrix, the estimated regression coefficients are variable gives what is known as the commun- simply the correlations between each manifest ality of that variable, that is, the variance variable and each latent variable. (In an shared with the other manifest variables via exploratory factor analysis the choice of their relationships with the common factors. covariance or correlation matrix is not critical Subtracting the communality of a variable Factor Analysis 291

Table 4 The factor analysis model.

(i) In general terms factor analysis is concerned with whether the covariance, or correlations between a set of observed variables, x1, x2,...,xp can be explained in terms of a smaller number of unobservable latent variables (common factors), f1, f2,...fk where k 5 p (hopefully k, the number of common factors, will be much less than the number of original variables p) (ii) The factor analysis model is essentially a regression-type model in which the observed variables are regressed on the assumed common factors. In mathematical terms the factor analysis model can be written as

x1 = l11f1 + l12f2 +...+likfk + e1

x2 = l21f1 + l22f2 +...+l2kfk + e2 . . . xp = lp1f1 + lp2f2 +...+lpkfk + ep

The ls are factor loadings and the terms e1, e2,...,ep are known as specific variatesÐthey represent that part of an observed variable not accounted for by the common factors (iii) The following assumptions are made: (a) The common factors are in standardized form and have variance one (b) The common factors are uncorrelated (c) The specific variates are uncorrelated (d) The common factors are uncorrelated with the specific variates 2 (iv) With these assumptions the factor analysis model implies that the population variances (s i , j =1, 2,...,p) and covariances (sij) of the observed variables can be written as k

2 2 s i = l ij+ ci j=1S k

2 s i = l il l jl l=1S where ci is the variance of specific variate ei, i.e., the specific variance of variable xi k 2 (v) The model implies that the variance of an observed variable can be split into two parts Sj=1 l ij and ci. The first of those is known as the communality of the variable xi; it is the variance in the variable shared with the other observed variables via their relationships with the common factors (vi) Note that the covariances of the observed variables are generated solely from their relationships with the common factors. The specific variates play no part in determining the covariances of the observed variables; they contribute only to the variances of those variables (vii) There are a number of different methods for fitting the factor analysis model. The two most commonly used are principal factor analysis and maximum likelihood factor analysisÐboth are described in Everitt and Dunn (1991) (viii) A method of factor analysis commonly used in practice is principal components analysis (Everitt & Dunn, 1991). Although this is an extremely useful technique for the summarization of multivariate data, it is not discussed in detail here because it is not a natural precursor to the confirmatory and causal models to be discussed later. It may, however, be worthwhile listing the main differences between the two approaches: (a) Factor analysis (FA) and principal components analysis (PCA) each attempt to describe a set of multivariate data in a smaller number of dimensions than one starts with, but the procedures used to achieve this goal are essentially quite different in the two approaches (b) FA, unlike PCA, begins with a hypothesis about the covariance (or correlational) structure of the variables, namely that there exists a set of k latent variables (k 5 p) and these are adequate to account for the interrelationships of the variables though not for their full variances (c) PCA, however, is merely a transformation of the data and no assumptions are made about the form of the covariance matrix of the data. In particular PCA has no part corresponding to the specific variates of FA. Consequently, if the FA model holds and the specific variances are small, both forms of analysis would be expected to give similar results (d) A clear advantage of FA over PCA is that there is a simple relationship between the solutions obtained from the covariance and correlation matrices (e) It should be remembered that PCA and FA are both pointless if the observed variables are uncorrelatedÐFA because it has nothing to explain and PCA because it would lead to components which are essentially identical to the original variables (ix) In many (perhaps most) examples the results from a principal components analysis and an exploratory factor analysis will be similar, with any differences not usually affecting the substantive interpretation 292 Latent Variables, Factor Analysis, and Causal Modeling

Table 5 Pain statements and their correlations.

Pain statements

(1) Whether or not I am in pain in the future depends on the skills of the doctors (2) Whenever I am in pain, it is usually because of something I have done or not done (3) Whether or not I am in pain depends on what the doctors do for me (4) I cannot get any help for my pain unless I go to seek medical advice (5) When I am in pain I know that it is because I have not been taking proper exercise or eating the right food (6) People's pain results from their own carelessness (7) I am directly responsible for my pain (8) Relief from pain is chiefly controlled by the doctors (9) People who are never in pain are just plain lucky

Correlations

1 2 3 456789 1 1.0000 2 ±0.0385 1.0000 3 0.6066 ±0.0693 1.0000 4 0.4507 ±0.1167 0.5916 1.000 5 0.0320 0.4881 0.0317 ±0.082 1.0000 6 ±0.2877 0.4271 ±0.1336 ±0.2073 0.4731 1.0000 7 ±0.2974 0.3045 ±0.2404 ±0.1850 0.4138 0.6346 1.0000 8 0.4526 ±0.3090 0.5886 0.6286 ±0.1397 ±0.1329 ±0.2599 1.0000 9 0.2952 ±0.1704 0.3165 0.3680 ±0.2367 0.1541 ±0.2893 0.4047 1.000

Table 6 Maximum likelihood factor analysis solutions for pain statement correlations. (i) Two-factor solution

Specific Statement Factor 1 Factor 2 Communality variance

1 0.643 0.211 0.458 0.542 2 ±0.361 0.413 0.301 0.699 3 0.718 0.401 0.676 0.324 4 0.687 0.288 0.555 0.445 5 ±0.311 0.569 0.420 0.580 6 ±0.521 0.614 0.448 0.352 7 ±0.563 0.477 0.544 0.456 8 0.709 0.254 0.567 0.433 9 0.482 0.004 0.233 0.767 Variance 2.950 1.451

(ii) Three-factor solution

Specific Statement Factor 1 Factor 2 Factor 3 Communality variance

1 0.605 0.295 0.372 0.592 0.408 2 ±0.455 0.291 0.431 0.477 0.523 3 0.613 0.498 0.192 0.641 0.339 4 0.621 0.399 0.000 0.545 0.455 5 ±0.407 0.450 0.372 0.506 0.494 6 ±0.671 0.594 ±0.149 0.825 0.175 7 ±0.625 0.342 ±0.060 0.512 0.488 8 0.681 0.475 ±0.273 0.253 0.237 9 0.449 0.162 ±0.138 0.247 0.753 Variance 3.007 1.502 0.619 Factor Analysis 293 from the value one gives the specific variance variety of informal and formal methods have of a variable, that is, the variation in the been suggested. The former include, taking as variable not shared with the other variables. many factors as account for an adequate So, for example, in the two factor solution the amount of the variation in the observed communality of the statement ªpeople who are variables (where ªadequateº is usually inter- never in pain are just plain lucky,º is rather low preted as roughly around 60% or above), and at 0.23 and its specific variance consequently plotting factor variances against factor number relatively high at 0.77. Variation in the (a so-called scree plot) and identifying the point response to this statement is largely unrelated where the curve flattens out. When applying the to the two common factors. maximum likelihood method of factor analysis The sum of squares of the loadings on a a more formal significance testing procedure is common factor gives the variation in the available based on what is known as the manifest variables accounted for by that factor. likelihood function, which is essentially, a This is to be compared with the total variation in measure of how well the estimated factor the observed variables, which since this example solution fits the observed correlations. Everitt uses a correlation matrix and hence relates to and Dunn (1991) give a specific definition. With variables standardized to have variance one, is this approach a sequential procedure is used to simply equal to the number of variables, that is, determine k, the number of common factors. nine. So, in the two-factor solution, the first Starting with some small value of k (usually factor has variance 2.95 and accounts for 33% one), the test for number of factors is applied of the variation in the observed variables. Both and, if the test is nonsignificant, the current factors together in the two-factor solution value of k is deemed acceptable; otherwise k is account for 49% of the variance. The factors increased by one and the process repeated until in the three-factor solution together account for an acceptable solution is found. 57% of the variance. Each of the procedures described above can Note that factors are extracted in order of be applied to the pain statements data, and the their variance, and are so constructed that they results are shown in Table 7 and Figure 1. It is are uncorrelated, that is, independentÐan clear from these results that the three-factor alternative technical term that is sometimes solution is the one to choose and is consequently encountered is ªorthogonal.º subjected to the process of rotation described in the next section.

3.13.3.1.2 Determining the number of factors 3.13.3.1.3 Factor rotation Before rotating and interpreting a factor solution the investigator needs to answer the The initial factors extracted from a factor important question, ªHow many factors?º A analysis are often difficult to interpret and

2.0

1.8

1.6

1.4 Variance 1.2

1.0

0.0 1234 Factor number Figure 1 Scree plot. 294 Latent Variables, Factor Analysis, and Causal Modeling

Table 7 Determining number of factors for pain statement data.

(i) Percentage variance accounted for in a five-factor solution

Number of Percentage variance factors accounted for

122 244 355 465 570

(ii) Significance test for number of factors

Model Test statistic Degrees of freedom P-value

One-factor 172.34 27 <0.001 Two-factor 58.95 19 <0.001 Three-factor 18.19 12 0.11 Four-factor 10.01 6 0.124 Five-factor 0.85 1 0.356

name. A process which can frequently aid in solution for the pain statement data. Note that these tasks is factor rotation whereby the initial although the factor loadings have changed, the solution is described in a different and, in many communalities of the variables are unaltered, as cases, a simpler fashion. A rotated and is the total variance accounted for by the unrotated factor analysis solution are mathe- solution. The variance attributable to each matically equivalent, but the former usually common factor, has however, changed. leads to a clearer picture of the nature of the A possible interpretation of the rotated three- underlying latent variables. The rotation meth- factor solution is in terms of different aspects of ods usually employed are designed to lead to a the control of, and responsibility for, one's pain. factor solution with the properties that Thur- The first factor attributes both to others, stone (1947) referred to as a simple structure. In particularly doctors. The second factor, with very general terms such a structure results when high loadings on statements 6 and 7, involves the common factors involve subsets of the complete personal responsibilty for one's pain original variables with as little overlap as and the third factor, having its highest loadings possible, i.e., variables have high loadings on on statements 2 and 5, might be seen as a particular factor and negligible loadings on attributing pain to deficiencies in one's lifestyle. the others. In this way the original variables are The possibility of rotating factor solutions divided into groups relatively independent of arises because of the lack of uniqueness of the each other. Methods of rotation operate by factor loadings in the basic factor analysis seeking, essentially, to make large loadings model described in Table 5 (Everitt, 1996). This larger and small loadings smaller. This rather property once caused many statisticians to view vague aim is translated into more specific factor analysis with grave suspicion, since mathematical terms by selecting a rotated apparently it allows investigators licence to solution so that the loadings optimize some consider a large number of solutions (each suitable numerical criterion. For example, a well corresponding to a different rotation of the known method of rotation known as varimax factors) and to select the one closest to their a attempts to maximize the within-factor variance priori expectations (or prejudices) about the of the squared loadings. Other methods (of factor structure of the data. In general, which there are several) choose to optimize however, such suspicion is misplaced and somewhat different criteria in their aim to factor rotation can be a useful procedure for achieve simple structure. In many examples the simplifying an exploratory factor analysis solutions given by the competing methods of solution. Factor rotation merely allows the rotation will be very similar. fitted factor analysis model to be described as To illustrate the application of rotation, simply as possible. Rotation does not alter the Table 8 shows the varimax-rotated, three-factor overall structure of a solution, but only how the Factor Analysis 295

Table 8 Rotated three-factor solution for pain statement data.

Statement Factor 1 Factor 2 Factor 3 Communality

1 0.654 ±0.358 0.188 0.591 2 ±0.121 0.200 0.650 0.478 3 0.793 ±0.138 0.112 0.661 4 0.726 ±0.10 ±0.09 0.545 5 0.02 0.305 0.642 0.506 6 ±0.09 0.827 0.365 0.825 7 ±0.226 0.598 0.323 0.512 8 0.813 0.06 ±0.312 0.763 9 0.435 ±0.08 ±0.229 0.247 Variances 2.509 1.340 1.279

solution is described; rotation of factors is a 3.13.3.2.1 Confirmatory factor analysis models process by which a solution is made more for length and intelligence interpretable without changing its underlying mathematical properties. Consider once again the string data given in It should be noted that there are two distinct Table 3. The model suggested previously for types of rotation, orthogonal and oblique. these data is that the four observed measure- With the former, the factors in the rotated ments of string length are fallible estimates of solution remain independent of one another as the latent variable true string length. This model they were in the initial solution but, with the can be represented more specifically in math- latter, correlated factors are allowed. The ematical terms as shown in Table 10, and in the consequence of allowing correlations between form of a diagram as shown in Figure 2. The factors is that the sum of squares of a factor's model is a very simple example of a confirma- loadings can no longer be used to determine the tory factor analysis model, and differs from the amount of variance attributable to a factor. exploratory models considered in the previous Additionally, the sums of squares of factor section in two ways: (i) the number of factors, loadings for each variable no longer give the one, is specified a priori; and (ii) the loadings of communality of the variable. In practice, in an the observed variables on the postulated latent exploratory factor analysis, orthogonal rota- variable are all fixed to be one. tion is far more commonly used than oblique The correlations between the four measure- rotation since the solutions are often satisfac- ments are given in Table 11. Applying the tory without introducing the complication of procedure outlined in Table 9 to these correla- factor correlations. tions using the EQS package (Bentler, 1995), gives the results shown in Table 12. Judged by the chi-squared statistic the model 3.13.3.2 Confirmatory Factor Analysis fits very well. As was to be expected, measure- ment by ruler is the most accurate; observer B In a confirmatory factor analysis, particular appears to be the marginally less successful than parameters of the model, for example, some of G and D in guessing the string lengths. The the factor loadings, are specified a priori to be residuals are all very small. fixed at some specified value, usually zero or In the same way, the model detailed in Table 9 one, whilst other parameters are free to be can be fitted to the examination scores data in estimated from the data. The model considered Table 1. Details of the results are shown in may have arisen from theoretical considera- Table 13. Here the model again fits reasonably tions, or perhaps on the basis of the results of well as judged simply by the chi-squared earlier exploratory factor analyses in the same statistic, but rather less satisfactorily than for area. the string data, reflecting the less certain nature How is such a model fitted to an observed of intelligence as compared to length. In fact, correlation or covariance matrix? The simple the residuals in Table 13 indicate that the answer is by using the appropriate software as predicted and observed correlations differ quite described in the Appendix. The essentials of the considerably and that the model cannot really fitting process are, however, also given in be considered adequate for these data despite Table 9. Once again, however, much can be the nonsignificant chi-squared value. The latter learnt from examining a series of examples. may be misleading here since it is not strictly 296 Latent Variables, Factor Analysis, and Causal Modeling

Table 9 Estimation and goodness of fit in confirmatory factor analysis models.

(i) Suppose that S is the observed covariance (or correlation) matrix with elements sij. Corresponding to S will be a matrix SÃ giving the correlations as predicted by the assumed model (ii) The elements of the matrix SÃ will be functions of the parameters of the confirmatory factor analysis model, for example, factor loadings, specific variances, and factor correlations (iii) Estimation of a confirmatory factor analysis (CFA) model involves finding values for the model's parameters (parameter estimates) so that the corresponding predicted and observed covariances and variances are as close as possible to one another (iv) More specifically, a function of the differences between sij and sÃij is minimized with respect to the parameters (v) One such function would be the sum of squares of the differences; this corresponds to least-squares estimation (vi) Other estimation criteria are discussed in Everitt (1984) (v) Having decided on a measure of closeness, some type of mathematical optimization algorithm (Everitt, 1987) is applied to find the parameter values that minimize the measure (viii) The fitting procedure leads not only to parameter estimates but also to the standard errors of the estimates (ix) A chi-squared statistic measuring the discrepancies between the observed variances and covariances and those predicted by the model is one of the methods that can be used to assess the fit of the model. The degrees of freedom are

n = p(p + 1)/2 ± t where p is the number of observed variables and t is the number of free parameters in the model (x) Fit can (and should) also be assessed by examining the differences between the corresponding elements of the observed and predicted covariance or correlation matrix. Any differences (residuals) which are unacceptably large, or a pattern of moderately sized residuals sheds some doubt on the acceptability of the model

valid for samples as small as 15. Questions of the observed the values of the following six appropriate sample size necessary to achieve an variables for 556 white eighth-grade students: adequate statistical power are as relevant when (1) self-concept of ability, (2) perceived parental fitting confirmatory factor analysis as when evaluation, (3) perceived teacher evaluation, (4) applying a simple t-test (examples of how to perceived friend's evaluation, (5) educational determine sample size for the former are given in aspiration, and (6) college plans. Dunn, Everitt, & Pickles, 1993). The correlations between the six variables are A natural question that arises therefore is given in Table 15. The model postulated to how the model might be amended to improve its explain these correlations involves two latent fit. Some general points about how poorly variables, named by the authors as ability and fitting models might be amended are made later, aspiration. The first four observed variables are but one relatively minor change that might be assumed to be indicators of ability and the last considered here is to allow the loading on each two observed variables are assumed to be examination score to be a free parameter to be indicators of aspiration. In addition the two estimated rather than having a fixed value of latent variables are assumed to be correlated. one. Such a model is again easily fitted using the Consequently the model is known as a corre- EQS software and the results are given in lated two-factor model. It may be represented in Table 14. Now the residuals are much more a diagram (a so-called path diagram (Everitt, acceptable with only the correlation between 1996)) as shown in Figure 3. The mathematical algebra and statistics scores being poorly form of the model appears in Table 16. Note predicted. The estimated factor loadings show again the differences between this model and the that the algebra scores are least related to the exploratory models of Section 3.13.3.1; here, postulated latent variable. each observed variable is constrained to have fixed zero loadings on one of the latent variables, and on the other, a free loading to be estimated from the data. The results of fitting 3.13.3.2.2 Ability and aspiration the model to the observed correlations, using As a further example of fitting a confirmatory once again the EQS package, are shown in factor analysis model, the study of ability and Table 17. Of particular note among these results aspiration described in Caslyn and Kenny is the estimate of the correlation between the (1977) is used. In this study the investigators two postulated latent variables. This estimate Factor Analysis 297

Table 10 Mathematical specification of one-factor model for string data.

The factor analysis model for the string length data is

R=f + e1 G=f + e2 B=f + e3 D=f + e4

where f represents the true length and e1, e2, e3, and e4 are the errors in the four observed length measurements. These errors are assumed to have zero mean and their estimated variances will reflect the accuracy of R, G, B, and D

R e1

G e 2

f

B e3

D e4

Figure 2 Single-factor model for string lengths.

(0.666 with a standard error of 0.03) is the The majority of adult and adolescent disattenuated correlation, which represents the Americans use psychoactive substances during correlation between true ability and true an increasing proportion of their lifetime. aspiration, uncontaminated by measurement Various forms of licit and illicit psychoactive error in the observed indicators of these substance use are prevalent, suggesting that concepts. Note that the value of the disatte- patterns of psychoactive substance taking are nuated correlation is higher than any of the major components of the individual's beha- observed correlations. vioral repertoire and have pervasive implica- tions for the performance of other behaviors. In an investigation of these phenomena, Huba 3.13.3.2.3 Drug usage among American students et al. (1981) collected data on drug usage rates for 1634 students in the seventh to ninth As a final example of fitting a confirmatory grades in 11 schools in the greater metropo- factor analysis model, an investigation of drug litan area of Los Angeles. Each participant usage among American college students re- completed a questionnaire about the number ported by Huba, Wingard, and Bentler (1981) is of times he or she had ever used a particular considered. substance. 298 Latent Variables, Factor Analysis, and Causal Modeling

Table 11 Correlation matrix for the four measurements of string length.

RGB D

R 1.0000

G 0.9802 1.0000 R = B 0.9811 0.9553 1.0000 D [ 0.9899 0.9807 0.9684 1.0000 ]

Table 12 Results from fitting one-factor model with equal loadings for each variable to string data.

(i) Residuals (differences between observed and predicted correlations) RGB D

0.007 R

±0.010 ±0.025 G R = ±0.009 ±0.035 ±0.028 B D [ 0.000 ±0.009 ±0.022 ±0.007 ]

(ii) Variances of error terms for each observed variable and their standard errors (SE)

R: variance = 0.003, SE=0.005 G: variance = 0.035, SE=0.014 B: variance = 0.038, SE=0.016 D: variance = 0.017, SE=0.008

(iii) Chi-squared goodness of fit statistic=2.575, degrees of freedom=5, P-value=0.765

The substances for which data were collected (i) Alcohol use (f1) with nonzero loadings on were as follows: (1) cigarettes, (2) beer, (3) wine, beer, wine, spirits, and cigarettes. (4) spirits, (5) cocaine, (6) tranquilizers, (7) (ii) Cannabis use (f2) with nonzero loadings drugstore medications used to get high, (8) on marijuana, hashish, cigarettes, and wine. The heroin and other opiates. (9) marijuana, (10) cigarette variable is assumed to load on both the hashish, (11) inhalents (glue, petrol, etc.), (12) first and second latent variable because it some- hallucinogenics (LSD, mescalin, etc.), and (13) times occurs with both alcohol and marijuana amphetamine stimulants. and at other times does not. The nonzero Responses were recorded on a five-point scale loading on wine was allowed because of reports which ranged from ªnever triedº to ªused that wine is frequently used with marijuana and regularly.º The correlations between the usage that consequently some of the use of wine may rates of the 13 substances are shown in Table 18. be an indicator of tendencies towards cannabis. The model proposed by Huba et al. (1981) for (iii) Hard drug use (f3) with nonzero loadings these data arose from considering previously on amphetamines, tranquilizers, hallucino- reported research in the area, and postulated the genics, hashish, cocaine, heroin, drugstore following three latent variables: medication, inhalants, and spirits. The use of Factor Analysis 299

Table 13 Results of fitting the one-factor model to examination scores, with loadings fixed at one.

(ii) Correlations

English French Algebra Statistics

English 1.000

French 0.683 1.000

Algebra 0.286 0.451 1.000

Statistics [ 0.431 0.690 0.544 1.000 ]

(ii) Residuals English French Algebra Statistics

English ±0.113

French 0.104 0.147

Algebra ±0.293 ±0.128 ±0.265

Statistics [ ±0.148 0.111 ±0.035 0.025 ]

(iii) Variances of error terms for each observed variable and their standard errors

English: variance=0.534, SE=0.244 French: variance=0.274, SE=0.156 Algebra: variance=0.686, SE=0.299 Statistics: variance=0.396, SE=0.196

(iv) Chi-squared goodness-of-fit statistic=4.136, degrees of freedom=5, P-value=0.530.

each of these substances was considered to pieces of evidence that should be used when suggest a strong commitment to the notion of judging the fit of a model; other measures of fit psychoactive drug use. are discussed in Dunn et al. (1993). The path diagram of the proposed model is Amending a model that is considered not to shown in Figure 4, and Table 19 shows the provide an adequate fit will generally involve a equivalent mathematical structure of the model. mixture of the theoretical and empirical. Note that each pair of the postulated variables Information from the data about possibly is allowed to have a nonzero correlation. The better fitting models is provided by two types results of fitting the model are detailed in of tests, the Lagrange multiplier test and Wald's Table 20. The chi-squared goodness-of-fit sta- test, both of which are described in detail in the tistic takes a value 323.96 with 58 degrees of EQS manual (Bentler, 1995). Essentially, how- freedom and has a very small associated P- ever, the former evaluates whether, from a value. It appears that the proposed model does statistical viewpoint at least, the model could be not provide an adequate explanation for the improved by freeing a previously fixed para- correlations between the recorded usage rates of meter. The latter is designed to determine the 13 substances, although the large sample whether sets of parameters that were treated size (n =1634) may lead to even relatively trivial as free in the model could in fact be simulta- discrepancies between observed and predicted neously set to zero without substantial loss in correlations being declared significant. In prac- model fit. Examples of the use of the two test tice the chi-squared statistic is only one of the procedures are given in Dunn et al. (1993). In 300 Latent Variables, Factor Analysis, and Causal Modeling

Table 14 Results of fitting further single-factor model to examination scores; loadings now free parameters to be estimated.

(i) Residuals

English French Algebra Statistics

English 0.000

French 0.010 0.000

Algebra ±0.052 ±0.016 0.000

Statistics [ ±0.070 ±0.002 0.197 0.000 ]

(ii) Variances of error terms for each observed variable and their standard errors

English: variance=0.513, SE=0.226 French: variance=0.069, SE=0.217 Algebra: variance=0.766, SE=0.298 Statistics: variance=0.485, SE=0.221

(iii) Estimated factor loadings and their standard errors

English: loading=0.698, SE=0.247 French: loading=0.965, SE=0.225 Algebra: loading=0.484, SE=0.262 Statistics: loading=0.718, SE=0.246

(iv) Chi-squared goodness-of-fit statistic=1.940, degrees of freedom=2, P-value=0.379

the original analysis of the drug usage data now be fitted routinely using software such as given in Huba et al. (1981) various amendments EQS or LISREL (see Appendix). More complex to the correlated three-factor model were tried models that specify regression-type relation- in order to improve fit; for example, a number of ships between latent variables, as well as how error terms were allowed to be correlated. the manifest and latent variables are linked, However, making such amendments to gain have now been used in many areas from models small improvements in fit is not always of the female orgasm (Newcomb & Bentler, satisfactory and it should perhaps be pointed 1983) to models to represent both psychological out that, in most cases, a theoretically justifiable and physical processess in cognitive functioning model that provides a less adequate fit than a (Hines, Chiu, McAdams, Bentler, & Lipcamon, model adjusted on an ad hoc basis, is likely to be 1992). preferred. Such models are most generally (and most Recent examples of the application of con- suitably) referred to as structural equation firmatory factor analysis in clinical psychology models. However, since they are often used as are given in Osman et al. (1995) and Hittner a means of describing some prespecified causal (1995). theory of the structure of a set of variables of interest, they are also often labeled causal 3.13.4 CAUSAL MODELS AND models where ªcausalº implies that a change in STRUCTURAL EQUATION one variable is assumed to result in the change of MODELING another variable. No matter how convincing, respectable, and reasonable a path diagram and The confirmatory factor analysis models its associated model may appear, it is important described in the previous section are a subset to recognize that seldom do structural equation of the models for correlational data that can models provide any direct test of the causal Causal Models and Structural Equation Modeling 301

Table 15 Observed correlations for ability and aspiration example.

123456

1 1.00

2 0.73 1.00

3 0.70 0.68 1.00 R = 4 0.58 0.61 0.57 1.00 5 0.46 0.43 0.40 0.37 1.00 6 [ 0.56 0.52 0.48 0.41 0.72 1.00 ]

Ability Aspiration

Self- Perceived Perceived Perceived Educational College concept parental teacher friend’s aspiration plans of ability evaluation evaluation evaluation

e1 e 2 e 3 e 4 e 5 e 6 Figure 3 Path diagram for ability and aspiration.

assumptions on which these models are based. causationº is still apposite even in the sophis- This is generally true simply because, in most ticated world of structural equation modeling. cases, an investigator's belief about the causal Many investigators proposing causal models structures which underlie a set of observed might, of course, argue that they are using the variables are based on little more than common term causal in a purely metaphorical fashion. As sense and intuition; it is only rarely that there is pointed out by deLeeuw (1985), however, such a strong evidence (such as that provided by cavalier attitude towards terminology becomes controlled experimentation) about the causal hard to defend if, for example, educational structure involved, and the only satisfactory programs are based on the metaphors, such as way to demonstrate causality is through the ªintelligence is largely genetically determinedº active control of variables. As pointed out by or ªallocation of resources to schools has only Cliff (1983) it is simply not possible, with very minor impact on the careers of students.º correlational data, to isolate the empirical Many of the problems of so called causal system sufficiently so that the nature of the modeling mentioned above arise largely because relationships among variables can be unam- of the relative lack of well specified causal biguously ascertained. It seems that the old theories in the social sciences. However, not all aphorism much stressed in introductory statis- is doom and gloom for an investigator eager to tics courses, ªcorrelation does not imply try out such models. Although it may be the case 302 Latent Variables, Factor Analysis, and Causal Modeling

Table 16 Mathematical structure of correlated two-factor model for ability and aspiration.

(i) The two common factors or latent variables are ability (f1) and aspiration (f2) Both are assumed to have variance of one (this is necessary since they are unobserved and their scale needs to be set in some arbitrary way) (ii) The proposed model postulates that the relationships between the observed variables and the latent variables are as follows:

x1 = l1f1 +0f2 + e1 x2 = l2f1 +0f2 + e2 x3 = l3f1 +0f2 + e3 x4 = l4f1 +0f2 + e4 x5 =0f1 + l5f2 + e5 x6 =0f1 + l6f2 + e6

(iii) This may be rewritten as

x1 = l1f1 + e1 x2 = l2f1 + e2 x3 = l3f1 + e3 x4 = l4f1 + e4 x5 = l5f2 + e5 x6 = l6f2 + e6

(iv) Note that, unlike an exploratory factor analysis, a number of loadings are fixed a priori at zero, that is, they play no part in the estimation process (v) The model also allows for f1 and f2 to be correlated (vi) The model has a total of 13 free parameters (six loadings, six error variances and one correlation). The observed correlation matrix has six variances and 15 correlations, a total of 21 terms. Consequently the postulated model has 21 ± 13 = 8 degrees of freedom

Table 17 Results from fitting the correlated two-factor model to correlations in Table 15.

Standard Parameter Estimates error Estimate/standard

l1 0.863 0.035 24.558 l2 0.849 0.035 23.961 l3 0.805 0.035 22.115 l4 0.695 0.039 18.000 l5 0.775 0.040 19.206 l6 0.929 0.039 23.569 var(e1) 0.255 0.023 19.911 var(e2) 0.279 0.024 11.546 var(e3) 0.352 0.027 13.070 var(e4) 0.516 0.035 14.876 var(e5) 0.399 0.038 10.450 var(e6) 0.137 0.044 3.152 corr(f1, f2) 0.667 0.031 21.521

The chi-square test of the fit of the model takes the value 9.26 with 8 degrees of freedom. The associated P-value is 0.321. The model provides a very adequate fit for the data. Causal Models and Structural Equation Modeling 303

cigarettes e1

beer e 2

f 1

wine e 3

spirits e 4

cocaine e 5

f 2

tranquilizers e 6

drugstore medications e 7

heroin e 8

marijuana e 9

hashish e10

inhalents e11

f 3

hallucinogenics e12

amphetamines e13

Figure 4 Path diagram for three-factor model for drug usage example. 304 Latent Variables, Factor Analysis, and Causal Modeling

Table 18 Correlations between usage rates for 13 substances (key to substances is given in text).

12345678910111213 1 1.000 2 0.447 1.000 3 0.422 0.619 1.000 4 0.436 0.604 0.585 1.000 5 0.114 0.068 0.053 0.115 1.000 6 0.203 0.146 0.139 0.258 0.349 1.000 7 0.091 0.103 0.110 0.122 0.209 0.221 1.000 8 0.082 0.063 0.066 0.097 0.321 0.355 0.201 1.000 9 0.513 0.445 0.365 0.482 0.186 0.316 0.150 0.154 1.000 10 0.304 0.318 0.240 0.368 0.303 0.377 0.163 0.219 0.534 1.000 11 0.245 0.203 0.183 0.255 0.272 0.323 0.310 0.288 0.301 0.302 1.000 12 0.100 0.088 0.074 0.139 0.279 0.367 0.232 0.320 0.204 0.368 0.340 1.000 13 0.245 0.199 0.184 0.293 0.278 0.545 0.232 0.314 0.394 0.467 0.392 0.511 1.000

Table 19 Mathematical structure of the correlated three-factor model for the drug usage data.

(i) Three latent variables, alcohol use (f1), cannabis use (f2), and hard drug use (f3) are postulated. All are assumed to have variance unity (ii) The proposed model postulates the following relationship between the observed and latent variables:

cigarettes = l1f1 + l2f2 + e1 beer = l3f1 + e2 wine = l4f1 + l5f2 + e3 spirits = l6f1 + l7f3 + e4 cocaine = l8f3 + e5 tranquilizers = l9f3 + e6 drugstore = l10f3 + e7 heroin = l11f3 + e8 marjuanna = l12f2 + e9 hashish = l13f2 + l14f3 + e10 inhalants = l15f3 + e11 hallucinogenics = l16f3 + e12 amphetamines = l17f3 + e13

(iii) The proposed model also allows for nonzero correlations between each pair of latent variables. The proposed model has a total of 33 parameters to estimate (17 loadings, 13 error variances, and 3 between-factor correlations). Consequently, the model has 13 6 14/2 ± 33 = 58 degrees of freedom

that correlation does not imply causation, it is 3.13.4.1.1 Ability scores over time equally true that a well specified causal model implies testable propositions about the struc- Table 21 gives, for a group of children, the ture of observed correlations, and so is amen- means, standard deviations, correlations and able to falsification in the same way as is any covariances for four measures of ability (per- other scientific theory. At the very least, such a centage of test questions correct) made at the model may provide a convenient and parsimo- ages of 6, 7, 9, and 11 (Osbourne & Suddick, nious description of a set of correlations and 1972). As one would expect, the means increase serve to rule out many alternative hypotheses progressively over time, as do the standard about the structure of the data. deviations. The correlation matrix shows higher correlations between adjacent measures than between those further apart in time. 3.13.4.1 Examples of Causal Models The simplest model that might be considered for these data is that specified by the path After the rather lengthy discussion of their diagram shown in Figure 5; here, ability is merits or otherwise given above, it is now time assumed to be a fixed trait and that variation to look at some examples of the application of over time arises solely from measurement error. structural equation/causal modeling. Fitting such a model to the covariance matrix in Causal Models and Structural Equation Modeling 305

Table 20 Results of fitting the correlated three-factor model to the drug usage data.

Standard Estimate/standard Parameter Estimates error error

l1 0.358 0.035 10.371 l2 0.332 0.035 9.401 l3 0.792 0.023 35.021 l4 0.875 0.038 23.285 l5 ±0.152 0.037 ±4.158 l6 0.722 0.024 30.673 l7 0.123 0.023 5.439 l8 0.465 0.026 18.079 l9 0.676 0.024 28.182 l10 0.359 0.025 13.602 l11 0.476 0.026 18.571 l12 0.912 0.030 29.958 l13 0.396 0.030 13.379 l14 0.381 0.029 13.050 l15 0.543 0.025 21.602 l16 0.618 0.025 25.233 l17 0.763 0.023 32.980 var(e1) 0.611 0.024 25.823 var(e2) 0.374 0.020 18.743 var(e3) 0.379 0.024 16.052 var(e4) 0.408 0.019 21.337 var(e5) 0.784 0.029 26.845 var(e6) 0.544 0.023 23.222 var(e7) 0.871 0.032 27.653 var(e8) 0.773 0.029 26.735 var(e9) 0.169 0.044 3.846 var(e10) 0.547 0.022 24.593 var(e11) 0.705 0.027 25.941 var(e12) 0.618 0.025 24.655 var(e13) 0.418 0.021 19.713 corr(f1,f2) 0.634 0.027 23.369 corr(f1,f3) 0.313 0.029 10.674 corr(f2,f3) 0.499 0.027 18.412

Table 21 gives the results shown in Table 22. regarded as an entirely fixed and stable trait, but The chi-squared goodness-of-fit statistic is may vary, increasing or decreasing relative to significant at the 5% level, suggesting that the other children, from one time to the next. The simple single factor model does not provide an results of fitting this model are shown in adequate fit for the covariance matrix. Table 23. The chi-squared test statistic now In this example, the observed covariance indicates that the model fits satisfactorily. The matrix has been used as the basis for fitting the regression coefficients of each latent variable on model of interest. In general the question of the one preceding it in time are all highly whether the covariance or correlation matrix significant (see Dunn et al., 1993, for a detailed should be used as the basis for structural accident). equation modeling is probably not of great importance, although there are some situations where the covariance matrix is definitely to be 3.13.4.1.2 Stability of alienation preferred (Cudeck, 1989). A path diagram for a more plausible model As a further example of structural equation for the ability data is shown in Figure 6; this modeling a study reported by Wheaton, model postulates causal effects between one Muthen, Alwin, and Summers (1977) is used. latent variable and another and the presence of The study was concerned with the stability over the disturbances terms on f2, f3,andf4 (terms d2, time of attitudes such as alienation and their d3, and d4) means that latent ability is not relationships to background variables such as 306 Latent Variables, Factor Analysis, and Causal Modeling

Table 21 Ability scores over time.

Statistic Age 6 Age 7 Age 9 Age 11

Mean 18.034 25.819 35.255 46.593 Standard deviation 6.374 7.319 7.796 10.386

(i) Correlation matrix

Age 6 Age 7 Age 9 Age 11

Age 6 1.000

Age 7 0.809 1.000

Age 9 0.806 0.850 1.000

Age 11 [ 0.765 0.831 0.867 1.000 ]

(ii) Covariance matrix

Age 6 Age 7 Age 9 Age 11

Age 6 40.628

Age 7 37.741 53.568

Age 9 40.052 48.500 60.778

Age 11 [ 50.643 63.169 70.200 107.869]

education and occupation. Data were collected variables. One of the important questions here on attitude scales from 932 people in two rural involves the size of the regression coefficient of regions in Illinois at three points in time (1966, alienation in 1971 on alienation in 1967, since 1967, and 1971). Only that part of the data this reflects the stability of the attitude over collected in 1967 and 1971 will be of concern time. Note that the error terms of anomia and here and Table 24 shows the covariances powerlessness are allowed to be correlated over between six observed variables. The anomia time to account for possible memory or other and powerlessness subscales are taken to be retest effects. Some of the results of fitting the indicators of a latent variable, alienation, and proposed model are shown in Table 25. the two background variables, education (years The chi-squared goodness-of-fit statistic of schooling completed) and Duncan's socio- takes a value of 4.73 with four degrees of economic index (SEI) are assumed to relate to a freedom and suggests that the proposed model respondent's socioeconomic status. The path fits the observed covariances extremely well. diagram for the model postulated to explain the The estimated regression coefficient of aliena- covariances between the observed variables is tion on socioeconomic status in both 1967 and shown in Figure 7. The model involves a 1971 is negative, as might have been expected combination of a confirmatory factor analysis since higher socioeconomic status is likely to model with a regression model for the latent result in lower alienation and vice versa. The Latent VariablesÐMyths and Realities 307

e1 e 2 e3 e4

age 6 age 7 age 9 age 11

f

Figure 5 Single-factor model for ability data.

Table 22 Simple single-factor modelÐresults for ability data.

Standard Estimate/standard Parameter Estimates error error

l1 5.84 0.360 15.244 l2 6.684 0.397 16.842 l3 7.326 0.414 17.705 l4 9.476 0.564 16.813 var(e1) 10.555 1.225 8.617 var(e2) 8.89 1.197 7.427 var(e3) 7.113 1.163 6.133 var(e4) 18.084 2.424 7.460

(i) Chi-squared goodness-of-fit statistic 6.085, degrees of freedom = 2, P-value = 0.048. (ii) Note that these estimates are obtained from the covariance matrix so that the factor loadings are no longer correlations between observed and latent variables. They now represent regression coefficients. (iii) The statistical significance of each parameter can be judged by the z-statistics given in the final column. Values outside (±2,2) are roughly significant at the 5% level.

estimated regression coefficient for alienation in more than a shorthand for the observation of 1971 on alienation in 1967 is positive and highly the correlations. It does not mean that verbal significant. Clearly the attitude remains rela- ability is a variable that is measurable in any tively stable over the time period. manifest sense. In fact latent variables are An example of the application of structural essentially hypothetical constructs invented by equation modeling in clinical psychology is a scientist for the purpose of understanding given in Taylor and Rachman (1994). some research area of interest, and for which there exists no operational method for direct measurement. Consequently, a question that 3.13.5 LATENT VARIABLESÐMYTHS needs to be asked is ªCan science advance by AND REALITIES inferences based upon hypothetical constructs that cannot be measured or empirically tested?º Having already commented in Section 3.13.4 According to Lenk (1986) the answer is a that the ªcausalº in causal modeling is usually a resounding ªsometimes.º For example, atoms misnomer, what can be said about the concept in the eighteenth and nineteenth centuries were of a latent variable, central to the methods hypothetical constructs which allowed the described in this chapter? In one sense, latent foundation of thermodynamics; gravity and variables can never be anything more than is the electromagnetic field are further examples contained in the observed variables and never from physics. Clearly a science can advance anything beyond what has been specified in the using the concept of a latent variable, but their model. For example, in the statement that importance is not their reality or otherwise, but verbal ability is whatever certain tests have in rather to what extent the models of which they common, the empirical meaning is nothing are a part are able to describe and predict 308 Latent Variables, Factor Analysis, and Causal Modeling

e 1 e 2 e 3 e 4

age 6 age 7 age 9 age 11

f 1 f 2 f 3 f 4

d 2 d 3 d 4

Figure 6 Causal model for ability data.

Table 23 Results from fitting causal model for ability data.

(i) Regression coefficients of latent variables on preceding latent variables

f2 on f1: 1.120, SE=0.064 f3 on f2: 1.044, SE=0.046 f4 on f3: 1.296, SE=0.054

(ii) Chi-squared goodness-of-fit statistic=1.433, degrees of freedom=2, P-value=0.489

phenomena (Lakatos, 1977). This point is 3.13.6 SUMMARY nicely summarized by D. M. Fergusson and L. J. Horwood (personal communication, The possibility of making causal inferences 1986): about latent variables has great appeal for the social and behavioural scientist, simply because Scientific theories describe the properties of ob- many of the concepts in which they are most served variables in terms of abstractions which interested are not directly measurable. Many of summarize and make coherent the properties of the statistical and technical problems in apply- observed variables. Latent variables, are, in fact, ing the appropriate models to empirical data one of this class of abstract statements and the have largely been solved, and sophisticated justification for the use of these variables lies not in software such as EQS means that researchers an appeal to their ªrealityº or otherwise but rather can investigate and fit extremely complex to the fact that these variables serve to synthesize models routinely. Unfortunately, in their rush and summarize the properties of observed vari- not to be left behind in the causal modeling ables. stakes, many investigators appear to have abandoned completely their proper scientific This point was also made by the participants skeptism, and accepted models as reasonable, in the Conference on Systems under Indirect simply because it has been possible to fit them to Observation who concluded, after some debate data. This would not be so important if it were (Bookstein, 1982) that latent variables are ªas not the case that much of the research involved real as their predictive consequences are valid.º is in areas where action, perhaps far-reaching Such a comment implies that the justification action, taken on the basis of the findings of for postulating latent variables is their theore- the research can have enormous implications, tical utility rather than their reality. for example, in resources for education and 309

d 2 Anomia 67 e 3

f 2 e 1 Education

Powerlessness 67 e 4

f 1

Anomia 71 e 5

f 3 e 2 SEI

Powerlessness 71 e 6 d 3

Figure 7 Causal model for stability of alienation.

Table 24 Covariance of manifest variables in the stability of alienation example.

123456

1 11.834 2 6.947 9.364 3 6.819 5.09 12.532 4 4.783 5.028 7.495 9.986 5 ±3.839 ±3.889 ±3.841 ±3.625 9.610 6 ±2.190 ±1.883 ±2.175 ±1.878 3.552 4.503

Note. 1 = Anomia 67, 2 = Powerlessness 67, 3 = Anomia 71, 4 = Powerlessness 71, 5 = Education, 6 = Duncan's Socioeconomic Index.

Table 25 Regression coefficients for stability of the alienation model in Figure 5.

Alienation 67 on socioeconomic status: ±1.500, SE=0.124 Alienation 71 on alienation 67: 0.607, SE=0.051 Alienation 71 on socioeconomic status: ±0.592, SE=0.131

legislation on racial inequality. Consequently, Causal relations can only be established through both producers of such research and audiences patient, painstaking attention to all the relevant or consumers of it need to be particularly variables, and should involve active manipulation concerned that the conclusions reached are as a final confirmation valid ones. With this in mind I would like to end with the caveat issued by Cliff (1983, p.125): ACKNOWLEDGMENT beautiful computer programs do not really change anything fundamental. Correlational data are still I would like to thank Dr. Nina Schooler for correlational, and no computer program can take many helpful suggestions which substantially account of variables that are not in the analysis. improved this chapter. 310 Latent Variables, Factor Analysis, and Causal Modeling

Table 26 Identification examples.

(i) Consider three variables, x1, x2, and x3, with correlation matrix R given by

x1 x2 x3

x1 1.00

R = x2 0.83 1.00

x3 [ 0.78 0.67 1.00 ] (ii) Suppose we are interested in fitting a single-factor model, that is

x1 = l1f+e1

x2 = l2f+e2

x3 = l3f+e3

(iii) There are seven parameters to be estimated, namely

l1, l2, l3, var(f), var(e1), var(e2), var(e3)

(iv) There are, however, only six statistics for use in parameter estimation:

var(x1), var(x2), var(x3), corr(x1, x2), corr(x1, x3), corr (x2, x3)

(v) Consequently, the model is underidentified

(vi) If var(f) is set equal to one then the model is just identifiedÐthere are exactly the same number of parameters to estimate as there are informative sample statistics

(vii) Equating observed and expected variances and correlations will give the required estimates:

6 6 l 1l 2 = 0.83 6 6 l 1l 3 = 0.78 6 6 l 2l 3 = 0.67 6 2 vaÃr(e1) = 1.0 ± l 1 6 2 vaÃr(e2) = 1.0 ± l 2 6 2 vaÃr(e3) = 1.0 ± l 3

(where ªhatsº indicate estimates)

(ix) Solving these equations leads to the estimates

6 6 6 l 1 = 0.99, l 2 = 0.84, l 3 = 0.79, var(e1) = 0.02, var(e2) = 0.30, var(e3) = 0.38

(x) Now consider an analogous measurement model with four observed variables and again set var(f) = 1 (this model is now the one used on the string data and the examination scores data)

(xi) Equating observed and expected variances and correlations in this case will lead to more than a single unique estimate for some of the parameters. The model is overidentified and represents a genuinely more parsimonious description of the structure of the data. Here a better strategy for estimation is clearly needed (see Table 9) References 311

3.13.7 APPENDIX method. In K. JoÈ reskog & H. Wold (Eds.), Systems under indirect observation: Causality, structure and Explanatory factor analysis methods such as prediction, (pp. 317±322). Amsterdam: North-Holland. principal factor analysis and maximum like- Caslyn, R. J. & Kenny, D. A. (1977). Self-concept of ability and perceived evaluation of others. Cause or effect of lihood factor analysis are available in all major academic achievement? Journal of Educational Psychol- packages, for example, SPSS, SAS, MINITAB, ogy, 69, 136±145. and SYSSTAT. Options that need to be selected Cliff, N. (1983). Some cautions concerning the application include (i) method of factor analysis, (ii) method of causal modeling methods. Multivariate Behavioral Research, 18, 115±126. of estimating initial communalities, and (iii) Cudeck, R. (1989). Analysis of correlation matrices using method of rotation. covariance structure models. Psychological Bulletin, 105, For many examples the solutions given by a 317±327. different combination of the available options deLeeuw, J. (1985). Book review. Psychometrika, 50, will be very similar. In most cases the available 371±375. Dunn, G., Everitt, B. S., & Pickles, A. (1993). Modelling software makes the derivation of the results of a covariances and latent variables using EQS. London: factor analysis very simple; interpretation of Chapman and Hall. these results is often, however, a different Everitt, B. S. (1984). An introduction to latent variable matter. models. London: Chapman and Hall. Everitt, B. S. (1987). An introduction to optimization Confirmatory factor analysis and structural methods and their application in statistics. London: equation modeling are generally undertaken Chapman and Hall. using either the LISREL or EQS package. The Everitt, B. S. (1996). Making sense of statistics in former is described in JoÈ reskog and SoÈ rborm psychology. Oxford: Oxford University Press. (1993) and the latter in Dunn et al. (1993). Using Everitt, B. S. & Dunn, G. (1991). Applied multivariate data analysis. London: Arnold. either piece of software requires some degree of Hines, M., Chiu, L., McAdams, L. A., Bentler, P. A., & understanding of how to translate proposed Lipcamon, J. (1992). Cognition and the corpus callosum: models into equations that specify the model Verbal fluency, visuospatial ability and language later- explicitly (see Tables 10 and 19). One of the alization related to midsagittal surface areas of callosal problems of structural equation modeling subregions. Behavioral Neuroscience, 106, 3±14. Huba, G. J., Wingard, J. A., & Bentler, P. M. (1981). A conveniently ignored in the text is that of model comparison of two latent variable causal models for identification which refers to the degree to adolescent drug use. Journal of Personality and Social which there is a sufficient number of equations Psychology, 40, 180±193. to solve for each of the parameters to be Hittner, J. B. (1995). Factorial validity and equivalence of the alcohol expectancy questionnaire tension-reduction estimated. Models can be (i) underidentified subscale across gender and drinking frequency. Journal (too few equations), (ii) just identified (no of Clinical Psychology, 51, 563±576. degrees of freedom remain for testing the fit of JoÈ reskog, K. & SoÈ rbom, D. (1993). Lisrel 8 structural the modelÐthe model will fit perfectly but will equation modeling with the simplis command language. not give a more parsimonious description of the Hillsdale, NJ: Erlbaum. Lakatos, I. (1977). The methodology of scientific research data than is provided by the observed correla- programmes. Cambridge, UK: Cambridge University tions or covariances), or (iii) overidentified Press. (more equations than parametersÐthe fit of Lenk, P. J. (1986). Book review. Journal of the American model can be tested). Table 26 illustrates these Statistical Association, 81, 1123±1124. Newcomb, M. D. & Bentler, P. M. (1983). Dimensions of different situations with a number of simple subjective female orgasmic responsiveness. Journal of models, and a full discussion of identification is Personality and Social Psychology, 44, 862±873. given in Dunn et al. (1993). In many cases the Osborne, R. T. & Suddick, D. E. (1972). A longitudinal identification status of a complex model (and, at investigation of the intellectual differentiation hypoth- times, relatively simple models) is very difficult esis. Journal of Genetic Psychology, 121, 83±89. Osman, A., Barrios, F. X., Kopper, B., Osman, J. R., to ascertain a priori. Nonidentified models will Grittman, L., Troutman, J. A., & Panak, W. J. (1995). cause problems for both EQS and LISREL and The pain behavior check list (PBCL); psychometric users need to be very wary of results which are properties in a college sample. Journal of Clinical accompanied by messages that point out that a Psychology, 51, 775±782. Skevington, S. M. (1990). A standardised scale to measure parameter is, for example, at its lower or upper beliefs about controlling pain: A preliminary study. bound (i.e., a variance is zero or a correlation is Psychology and Health, 4, 221±232. one) or that some parameters are linearly Taylor, S. & Rachman, S. J. (1994). Stimulus estimation dependent on others. and the overprediction of fear. British Journal of Clinical Psychology, 33, 173±181. Thurstone, L. L. (1947). Multiple factor analysis. Chicago: 3.13.8 REFERENCES University of Chicago Press. Wheaton, B., Muthen, B., Alwin, D., & Summers, G. Bentler, P. M. (1995). EQS Structural Equations Program (1977). Assessing reliability and stability in panel models. Manual. Encino, CA: Multivariate Software. In D. R. Heise (Ed.), Sociological methodology Bookstein, F. L. (1982). Panel discussionÐmodeling (pp. 84±136). San Francisco: Jossey-Bass. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.14 The Shift from Significance Testing to Effect Size Estimation

MICHAEL BORENSTEIN Hillside Hospital, Glen Oaks, NY, USA

3.14.1 INTRODUCTION 314 3.14.2 THE SIGNIFICANCE TEST 314 3.14.2.1 The Logic of the Significance Test 314 3.14.2.2 Errors of Inference 315 3.14.2.2.1 Type I error 315 3.14.2.2.2 Type II error 315 3.14.3 POWER ANALYSIS 316 3.14.3.1 Role of Effect Size in Power Analysis 316 3.14.3.1.1 The effect size used in power analysis is not necessarily the population effect size 317 3.14.3.1.2 Conventions for effect size 317 3.14.3.2 Role of Alpha in Power Analysis 318 3.14.3.3 Role of Tails in Power Analysis 319 3.14.3.4 Role of Sample Size in Power Analysis 319 3.14.3.5 Computing Power in Power Analysis 320 3.14.3.6 The Null Hypothesis vs. the Nil Hypothesis 320 3.14.3.7 Retrospective Power Analysis 320 3.14.3.8 Ethical Issues in Power Analysis 321 3.14.3.9 Power Analysis in Perspective 321 3.14.3.10 Some Real World Issues in Power Analysis 322 3.14.3.11 Computer Programs for Power Analysis 322 3.14.4 PROBLEMS WITH THE SIGNIFICANCE TEST 324 3.14.4.1 Significance Tests Address the Wrong Question 325 3.14.4.2 The Null Hypothesis is not a Viable Hypothesis 325 3.14.4.3 The Gratuitous Use of Significance Tests 326 3.14.4.4 p-Values: Responding with a Misdirected Answer 327 3.14.4.4.1 Misinterpreting the significant p-value 327 3.14.4.4.2 Misinterpreting the nonsignificant p-value 328 3.14.4.5 Summary 331 3.14.5 FOCUSED SIGNIFICANCE TESTSÐA DIGRESSION 331 3.14.6 EFFECT SIZE AND CONFIDENCE INTERVALS 332 3.14.6.1 The Key Advantage of this Approach 332 3.14.6.2 Effect Size and Confidence Intervals can be Computed for Any Type of Study 333 3.14.6.3 Factors Affecting the Confidence Interval Width 333 3.14.6.4 Information Conveyed by the Effect Size and Confidence Interval 334 3.14.7 THE RELATIONSHIP BETWEEN SIGNIFICANCE TESTING AND EFFECT SIZE ESTIMATION 334 3.14.7.1 Confidence Intervals as a Surrogate for the Significance Test 335 3.14.7.1.1 Confidence intervals should serve only as a surrogate for the significance test 336 3.14.7.1.2 Confidence intervals can optionally be used as a surrogate for the significance test 336 3.14.7.1.3 Confidence intervals should never be used as a surrogate for the significance test 337

313 314 The Shift from Significance Testing to Effect Size Estimation

3.14.7.2 Should the Significance Test be Banned? 337 3.14.7.3 Should the Analysis of the Single Study be Banned? 339 3.14.7.4 Looking Ahead 340 3.14.8 USE OF CONFIDENCE INTERVALS IN STUDIES WHOSE PURPOSE IS TO PROVE THE NULL 340 3.14.9 COMPUTATIONAL ISSUES IN CONFIDENCE INTERVALS 341 3.14.10 STUDY PLANNING: PRECISION ANALYSIS VS. POWER ANALYSIS 341 3.14.10.1 Planning for Precision 342 3.14.11 META-ANALYSIS 344 3.14.12 CONCLUSIONS 345 3.14.13 REFERENCES 346

3.14.1 INTRODUCTION to rethink the role of the single study as an element in the research process, and that any A statistical wit once remarked that research- vestige of the significance test should be ers often pose the wrong question and then excluded. From this perspective, confidence proceed to answer that question incorrectly. intervals serve as an index of precision only. Statistical analyses in psychological research We also consider the implications of this traditionally have taken the form of significance paradigm for study planning. The discussion of tests. The near-universal use of significance tests significance tests is followed by a section on has made them the de facto standard of proof in power analysis, which is used in study planning research and the logic that underlies significance to ensure that the study will yield a statistically tests has played an important role in shaping the significant result. The discussion of effect size development of psychological theory. In fact, estimation is followed by a discussion of though, significance tests are generally not precision analysis, which may be used in study appropriate for the kinds of questions that planning to ensure that the study will yield a are addressed in psychological research. First, precise estimate of the treatment effect. significance tests address the wrong question. Researchers are concerned with clinical sig- nificance (i.e., the magnitude of the effect) but 3.14.2 THE SIGNIFICANCE TEST significance tests address only statistical sig- 3.14.2.1 The Logic of the Significance Test nificance (whether or not the effect is zero). Second, the significance test, by focusing Research trials are always carried out on a attention on the p-value, lends itself to mistakes sample of finite size. The results obtained in the of interpretation. Significant p-values are sample are assumed to be representative of the assumed to reflect clinically important effect larger population but because of random sizes, while in fact a significant p-value may sampling error will rarely, if ever, mirror that reflect a large sample size rather than a large population exactly. effect size. Similarly, nonsignificant p-values are For example, consider a study whose goal is taken as evidence that the treatment has no to determine whether or not a new drug being impact, though this conclusion is almost always tested with acute schizophrenics yields a higher incorrect. response rate than the current treatment. To address these issues researchers have been Assume that the drug really is effective, and moving away from significance tests and toward (in the population) increases the response rate effect size estimation. Rather than report that by 20 percentage points. The mean effect over ªp=0.01,º the researcher would report that an infinite number of samples would be 20 ªthe treatment improved the response rate by 20 points, but the effect observed in any single percentage points (95% confidence interval of sample would almost invariably fall somewhat 0.15±0.25).º While there is widespread con- below or above 20 points. Similarly, if the drug sensus that the shift toward effect size estima- really has zero effect in the population then the tion is appropriate, there is considerable mean effect over an infinite number of samples disagreement over the nature of the proposed would be zero points, but the effect observed in shift. Some feel that the logic of the significance any single sample would likely fall somewhat test should be retained, with the shift to the use below zero (the response rate being lower for of confidence intervals being one of format new drug), or above zero. rather than substance. From this perspective, Therefore, if we run a study and observe a the confidence intervals serve as a surrogate for difference in response rates for the two groups, the significance test. Others argue that we need we need to determine whether or not the sample The Significance Test 315 may have been drawn from the second popula- range in which sample effects will normally fall. tion above (where the treatment effect is zero). The test statistic tells us whether or not the To this end we pose the null hypothesis that the sample effect falls outside the normal range. two treatments are equally effective. If the null The test statistic is compared with a table of hypothesis is true, then the effect observed in values showing the expected range of the test our sample, if not zero, should fall within a statistic under the assumption that the null ªreasonableº distance of zero. If our sample hypothesis is true. If our test statistic falls within were to yield a rate difference that was this range, we cannot reject the null. If our test ªcompelling,º in the sense that it was not statistic falls outside this range, however, we ªreasonably close to zero,º we would conclude would conclude that the null hypothesis is most that the assumption of equal response rates is likely false, and the treatment does have an not viable. effect. We need to decide what level of evidence Our ability to conclude that the observed we will require prior to rejecting the null, and effect is ªcompellingº will depend on three this threshold is known as alpha. For example, if factors, each of them intuitive. The first is the alpha is set at 0.05, then the study result will be size of the effect. An observed difference of 40 deemed significant if, on the assumption that points in response rates is more compelling than the null is true, the study effect (or larger) would an observed difference of 20 points. The second be observed in only 5% of cases. is the size of the sample. An observed difference of 20 points may be discounted if based on a 3.14.2.2 Errors of Inference sample of 10 cases per group since a single aberrantly responding patient could pull the The significance test carries the potential for response rate by 10 percentage points in either two types of error: We may reject as false a null direction. The same size effect would be more hypothesis that in fact is true (type I error), or convincing in a sample of 40 patients per group, we may accept as true a null hypothesis that in and would probably be seen as definitive in a fact is false (type II error). study that employed 100 patients per group. The third element is the stringency of the criteria 3.14.2.2.1 Type I error we require as evidence that a difference exists. If the treatment carried little risk of harm and was As noted earlier, study results would be inexpensive, we might be willing to declare it declared ªsignificantº if the sample effect was ªeffectiveº even if the evidence was relatively ªcompelling.º The type I error rate is denoted weak. On the other hand, if it carried a by alpha. With alpha set at 0.05, over an infinite nontrivial risk of harm or a high cost we might number of studies where the null is true, our require that it meet a more stringent criterion. sample will nevertheless meet the criterion and This logic (how large is the effect, how large be declared significant in 5% of the studies. This is the sample, how certain do I want to be) is situationÐwhere the null is true but we decide formalized in the test of significance that is in error that the treatment is effectiveÐis routinely applied to studies. Significance tests referred to as a type I error. vary from one statistical procedure to the next This point is repeated for emphasis. There is a but are all variations on the theme general perception that 5% of studies will result Observed difference in a type I error, but this is not correct. When the Test statistic null hypothesis is true 5% of studies will result Dispersion of the difference ˆ in a type I error. However, if the treatment really The numerator in this equation is the observed is effective, then by definition a type I error is difference, for example, a rate difference of 10 not possible and the type I error rate is zero. points or 20 points between groups. The denominator reflects the sample-to-sample dis- 3.14.2.2.2 Type II error persion expected in the numerator, and in- corporates information about the sample size To meet the criterion for statistical signifi- (when the sample is large, the expected disper- cance a study must yield an effect size that is sion is small). The test statistic is computed as large enough to establish significance, given the the ratio of these two values. As such, it will be sample size and the criterion alpha. Even if the high when the observed effect is large and/or the treatment is effective, and even if the size of the sample size is large. In a sense, the test statistic effect is substantial (in the population and/or in serves as a kind of signal-to-noise ratio: if the the sample), there is no guarantee that the study population effect is zero we expect the sample will yield a statistically significant result. The effect to vary within some range of zero. The effect observed in the sample may be smaller numerator in the equation reflects the observed than the population effect (in half of all effect and the denominator defines the expected unbiased studies, it will be), which may yield 316 The Shift from Significance Testing to Effect Size Estimation a nonsignificant result. Indeed, even if the (i) The severity of the illness being treated. If sample effect is as large (or larger) than the we are testing a treatment that will reduce the population effect, unless the sample size is likelihood of relapse among schizophrenics, we adequate the results may not meet the criterion might decide that a reduction of as little as 10% for statistical significance. The type II error rate is clinically important since the implications of a is denoted by beta. If the treatment really is relapse can be relatively severe. By contrast, if effective and beta is 0.20, then 20% of all studies we are testing a treatment that was expected to will yield nonsignificant results. reduce the likelihood of an anxiety attack, a 10% reduction might be of little clinical import since the attacks are transitory. In this case, we 3.14.3 POWER ANALYSIS might decide that only a reduction of 20% or more is clinically important. The likelihood that a study will yield a (ii) The availability of alternate treatments. statistically significant effect is the study's power If no treatments exist for a particular condition, and the process of designing a study to ensure an then a treatment that was effective in even a appropriate level of power is referred to as a small proportion of patients might be clinically power analysis. If the purpose of the study is to important. By contrast, if treatments already test the null hypothesis, then a power analysis is a exist for the condition, then a new treatment critical element in study planning, to ensure that might need to surpass these to be considered the study will be able to meet this goal. clinically important. This might mean that the The statistical significance computed subse- new treatment was effective in a larger propor- quent to the study is a function of three tion of affected persons, or had a larger impact elementsÐeffect size, sample size, and alpha. for individual patients. It follows that power, which reflects the like- (iii) Treatment cost and side effects. In lihood that the study will yield statistical deciding what constitutes a clinically important significance, is determined by the same three effect we might want to take account of such elements. Specifically, the larger the effect size issues as treatment costs and side effects. A drug used in the power analysis, the larger the sample that carried a risk of serious side effects or that size, and/or the more liberal the criterion carried a high price might be considered clini- required for significance (alpha), the higher cally useful only if it was effective in a sub- the expectation that the study will yield a stantially higher proportion of cases than statistically significant effect. These three fac- available treatments, or was being used to treat tors, together with power, form a closed a severe condition. systemÐonce any three are established, the The factors that are taken into account for fourth is completely determined. The goal of a identifying a clinically important effect will of power analysis is to find an appropriate balance course vary from one study to the next and the among these factors by taking into account the three items mentioned above are intended only substantive goals of the study and the resources as examples. The general point is that the available to the researcher. selection of an effect size should be made on the basis of substantive issues and that these will 3.14.3.1 Role of Effect Size in Power Analysis vary from one study to the next. A study that has adequate power to detect a Power analysis gives power for a specific relatively large effect will not have adequate effect size. For example, the researcher might power to detect a small effect. By contrast, a report ªIf the treatment increases the response study that has adequate power to detect a rate by 20 percentage points the study will have relatively small effect will, of course, have more power of 80% to yield a significant effect.º For than enough power to detect moderate or large the same sample size and alpha, if the treatment effects. While one might therefore be tempted to effect is less than 20 points then power will be set the ªclinically important effectº at a small less than 80%. If the treatment effect exceeds 20 value to ensure high power for even a small points, then power will exceed 80%. effect, this determination cannot be made in Since our computation of power ensures, to isolation. Small effects will require a larger some likelihood, that the study will succeed in investment of resources than large effects. The rejecting the null given a specific effect size, it selection of an effect size reflects the need for follows that the effect size used in the power balance between the size of the effect that we can analysis should represent the smallest effect that detect, and the resources available for the study. would be of clinical or substantive significance. Figure 1 shows power as a function of sample In clinical trials, for example, the selection of an size for three levels of effect size (assuming effect size might take account of the following alpha, two-tailed, is set at 0.05). For the smallest factors: effect (30% vs. 40%), we would need a sample Power Analysis 317 of 376 per group to yield power of 80%. For the cally (or theoretically) important effect. Indeed, intermediate effect (30% vs. 50%), we would while the effect observed in prior studies might need a sample of 103 per group to yield this level help to provide an estimate of the true effect it is of power. For the highest effect size (30% vs. not likely to be the true effect in the 60%), we would need a sample of 49 per group populationÐif we knew that the effect size in to yield power of 80%. In this example we may these studies was accurate, there would be no decide that it would make sense to enroll 103 need to run the new study. patients per group to detect the intermediate Since the effect size used in power analysis is effect but inappropriate to enroll 376 patients not the ªtrueº population value, the researcher per group to detect the smallest effect. may elect to present a range of power estimates. For example (assuming N=103 per group and 3.14.3.1.1 The effect size used in power analysis alpha=0.05, two-tailed), ªThe study will have is not necessarily the population effect power of 80% to detect a treatment effect of 20 size points (30% vs. 50%), and power of 99% to detect a treatment effect of 30 points (30% vs. Researchers often assume that the effect size 60%).º used in a power analysis is the ªtrueº (popula- The nature of the effect size will vary from one tion) effect size. In fact, though, the ªtrueº effect statistical procedure to the next. In all cases, size is not known. While the effect size in the however, it serves the same function of provid- power analysis is assumed to reflect the ing a pure index of the effect, that is, an index population effect size for the purpose of that focuses exclusively on the effect, indepen- calculations, the power analysis is more appro- dent of the sample size, and that is not affected priately expressed as ªIf the true effect is 20 by the metric in which the effect is measured. percentage points power would be . . . º rather than ªThe true effect is 20 percentage points, 3.14.3.1.2 Conventions for effect size and therefore power is . . . º This distinction is an important one. Re- Cohen (1988, 1992) has suggested conven- searchers sometimes assume that a power tional values for ªsmall,º ªmedium,º and analysis cannot be performed in the absence ªlargeº effects in the social sciences. For tests of pilot data. In fact, it is usually possible to of a mean difference between two groups the perform a power analysis based entirely on a effect size index is d, the standardized mean logical assessment of what constitutes a clini- difference (i.e., mean difference divided by the

Power as a function of effect size and N: two sample proportions

1.0

0.8

0.6 60% vs. 30%

Power 0.4 50% vs. 30%

40% vs. 30% 0.2

0.0 0 50 100 150 200 250 300 Number of cases per group Alpha = 0.05 Tails = 2 Figure 1 Impact of effect size on statistical power. With 100 cases per group, power to detect a treatment effect of 10 percentage points (response rates of 40% vs. 30%) is about 0.25, power to detect a treatment effect of 20 points is about 0.80, and power to detect a treatment effect of 30 points is about 0.99. 318 The Shift from Significance Testing to Effect Size Estimation common within-group standard deviation). A (as above). Nevertheless, these conventions do ªsmallº effect is given as d=0.20 (i.e., 20% of the serve two functions. In all cases the researcher common within-group standard deviation); a may want to use these values as a kind of reality- medium effect as d=0.50; and a large effect as check, to ensure that the values specified make d=0.80. For tests of a difference between sense relative to these anchors. Additionally, in proportions a small effect is given as proportions cases where the researcher has difficulty deriv- of 40% vs. 50%; a medium effect corresponds to ing any kind of effect size index, they may elect proportions of 40% vs. 65%; and a large effect to fall back on these conventions (see also corresponds to proportions of 40% vs. 78%. Kraemer & Thiemann, 1987). Note that the effect size for proportions cannot be specified simply by giving the difference between the two proportions, but requires that 3.14.3.2 Role of Alpha in Power Analysis the absolute proportions be specified. For example, the effect size of 10% vs. 20% is larger The significance test yields a p-value that than the effect size of 40% vs. 50%, despite the reflects the likelihood of obtaining an effect as fact that the difference is 10 points in either case. large (or larger) than the observed effect, under For tests of a single correlation, the effect size is the assumption that the null hypothesis is true. the correlation itself. A small effect is given as For example, a p-value of 0.02 means that, a correlation of 0.10, a medium effect is a assuming that the treatment has no effect, and correlation of 0.30, and a large effect is a given the sample size, an effect as large as the correlation of 0.50. Cohen also provides con- observed effect would be seen in only 2% of ventions for other tests including analysis of studies. The p-value obtained in the study is variance and multiple regression. These conven- evaluated against the criterion alpha. If alpha tions are also included in some computer is set at 0.05, then a p-value of 0.05 or lower is programs (Borenstein & Cohen, 1988; Boren- required to reject the null hypothesis and stein, Cohen, Rothstein, Pollack, & Kane, 1990, establish statistical significance. 1992; Borenstein, Rothstein, & Cohen, 1997; If our only concern in study design were to Rothstein, Borenstein, Cohen, & Pollack, 1990). prevent a type I error, it would make sense to set Cohen himself cautions that these conven- alpha as conservatively as possible (e.g., at tions should not be used routinely, since it is 0.001). However, alpha does not operate in preferable to select an effect size based on the isolation. If we select a more stringent criterion substantive issues involved in the specific study for alpha, then our ability to meet this criterion

Power as a function of alpha and N: two sample proportions

1.0

0.8

0.6 Alpha = 0.10

Power 0.4 Alpha = 0.05

Alpha = 0.01 0.2

0.0 0 50 100 150 200 250 300 Number of cases per group Proportions of 50% vs. 30% Figure 2 Impact of alpha on statistical power. With 100 cases per group and alpha set at 0.01, power is less than 0.60; with alpha at 0.05, power is about 0.80; and with alpha at 0.10, power is about 0.85. These computations are based on response rates of 50% vs. 30%, corresponding to the center line in Figure 1. Power Analysis 319 is reduced. By moving alpha from (say) 0.10 functionally equivalent since either would lead toward 0.01, we reduce the likelihood of a type I us to retain the standard treatment. In this case, error but increase the likelihood of a type II a one-tailed test, whose only goal is to test error. (More accurately, and to emphasize a whether or not conclusion (iii) is true, might be point made earlier: We reduce the likelihood of an appropriate choice. a type I error if the null is true, but increase the Note that a one-tailed test should be used likelihood of a type II error if the null is false.) only in a study in which, as in this example, an Figure 2 shows power as a function of sample effect in the reverse direction is, for all intents size for three levels of alpha (assuming an effect and purposes, identical to ªno effect.º It is not size of 30% vs. 50%, which is the intermediate appropriate to use a one-tailed test merely effect size in the previous figure). With a sample because one is able to specify the expected size of 100 per group, with alpha set at 0.10 direction of the effect prior to running the study. power exceeds 85%; with alpha set at 0.05 In psychological research, for example, we power is about 80%; and with alpha set at 0.01, typically expect that the new procedure will power is under 60%. increase, rather than decrease, the cure rate. Traditionally, researchers in some fields have Nevertheless, a finding that it decreases the cure accepted the notion that alpha should be set at rate would be important, since it would 0.05 and power at 80% (corresponding to a beta demonstrate a possible flaw in the underlying of 0.20). This notion is implicitly based on the theory. Even in the example cited, one would assumption that a type I error is four times as want to be certain that a profound effect in the harmful as a type II error (the ratio of alpha to reverse direction could safely be ignoredÐ beta is 0.05±0.20), which notion has no basis in under a one-tailed test, it cannot be interpreted. fact. Rather, it should fall to the researcher to In behavioral research, the use of a one-tailed strike a balance between alpha and beta as befits test can be justified only rarely. the issues at hand. For example, if the study will For a given effect size, sample size, and alpha, be used to screen a new drug for further testing a one-tailed test is more powerful than a two- we might want to set alpha at 0.20 and power at tailed test (a one-tailed test with alpha set at 0.05 95%, to ensure that a potentially useful drug is has the same power as a two-tailed test with not overlooked. On the other hand, if we were alpha set at 0.10). However, the number of tails working with a drug that carried the risk of side should be set based on the substantive issue effects and the study goal was to obtain Federal (ªWill an effect in the reverse direction be Drug Administration (FDA) approval for use, meaningful?º). In general, it would not be we might want to set alpha at 0.05 while keeping appropriate to run a test as one-tailed rather power at 95%. than two-tailed as a means of increasing power. (Note also that power is higher for the one-tailed test only under the assumption that the observed 3.14.3.3 Role of Tails in Power Analysis effect falls in the expected direction. When the test is one-tailed, the power for an effect in the The significance test is always defined as reverse direction is zero by definition). either one- or two-tailed. A two-tailed test is a test that will be interpreted if the effect meets the criterion for significance and falls in either 3.14.3.4 Role of Sample Size in Power Analysis direction. As such, it is appropriate for the vast majority of research studies. A one-tailed test is For any given effect size and alpha, increasing a test that will be interpreted only if the effect the sample size will increase the power. As is true meets the criterion for significance and falls in of effect size and alpha, sample size cannot be the expected direction (i.e., the treatment viewed in isolation but rather as one element in a improves the cure rate). complex balancing act. In some studies it might A one-tailed test is appropriate only if an be important to detect even a small effect while effect in the unexpected direction would be maintaining high power. In this case it might be functionally equivalent to no effect. For appropriate to enroll many thousands of example, assume that the treatment we are patients (as was done in the ªPhysiciansº study using for depression is relatively inexpensive that found a relationship between aspirin use and carries a minimal risk of side effects. We will and cardiovascular events). be testing a new treatment which is more Typically, though, the number of available expensive but carries the potential for a greater cases is limited. The researcher might need to effect. The possible conclusions are that (i) the find the largest N that can be enrolled, and work old treatment is better, (ii) there is no difference, backwards from there to find an appropriate or (iii) the new treatment is better. For our balance between alpha and beta. They may need purposes, however, conclusions (i) and (ii) are to forgo the possibility of finding a small effect, 320 The Shift from Significance Testing to Effect Size Estimation and acknowledge that power will be adequate is too low, then we might increase the sample for a large effect only. size in an effort to reach acceptable levels of For studies that involve two groups, power is power, thus inappropriately putting patients at generally maximized when the subjects are risk. Rather, the effect size used in the power divided evenly between the two groups. When analysis should be selected carefully, based on the number of cases in the two groups is uneven the kinds of substantive issues outlined earlier. the ªeffective N º for computing power falls much closer to the smaller sample size than the larger one. 3.14.3.6 The Null Hypothesis vs. the Nil There is one exception to the rule that an Hypothesis increase in sample size will always yield an Power analysis focuses on the study's poten- increase in power. When we are working with tial for rejecting the null hypothesis. In most exact formulas for discrete distributions, such as cases the null hypothesis is the null hypothesis of the binomial test for a single proportion or the no effect, or the ªnilº hypothesis. In some Fisher exact test for a crosstabulation, it is studies, however, the researcher might want to possible that a modest increase in sample size will test another null hypothesis. For example, serve to lower power (Borenstein et al., 1997). rather than testing the null hypothesis that a correlation coefficient is zero, a researcher might want to test the null hypothesis than 3.14.3.5 Computing Power in Power Analysis the correlation is 0.80 (i.e., to demonstrate that the correlation exceeds this value). Intuitively, it Power is the fourth element in this closed is easier to show that an observed correlation systemÐfor given effect size, alpha, and sample exceeds zero than to show that it exceeds 0.80. In size, power is completely determined. A con- the latter case the effect size is smaller and a vention exists that power should be set at 80% substantially larger sample would be required to but this convention has no clear a priori basis. ensure adequate power. The appropriate level of power should be decided on a case-by-case basis, taking into account the potential harm attendant on a type I 3.14.3.7 Retrospective Power Analysis error, the determination of a clinically impor- tant effect, the available sample size, as well as One occasionally sees a retrospective power the importance of identifying a small, medium, analysis of the following form. A researcher or large effect. completes a study, and reports that the effect size While power is a function of three elements in the study was, for example, a standardized (effect size, sample size, and alpha), as a mean difference of d=0.2. The researcher then practical matter the effect size tends to dominate goes on to report that, based on this effect size, the calculation of power. Sample size plays a the study actually had power of only, for secondary role, and the impact of alpha is example, 40%. However well intentioned, this relatively modest. For example, with response type of retrospective power analysis is generally rates of 30% vs. 50%, a sample size of 100 per misguided for several reasons. First, as noted group, and alpha (two-tailed) set at 0.05, power earlier, the role of effect size in a power analysis is would be 83%. Consider, then, the impact of not ªThe effect size is d=0.4 and therefore power changes to these factors. If we elect to work with is 80%º but rather ªIf the effect size is d=0.4, an effect size of 25 percentage points rather than then power would be 80%.º The effect size 20 percentage points (i.e., 55% vs. 30% rather d=0.4 is supposed to represent the smallest than 50% vs. 30%) power would be increased effect that would be important to detect. As such, from 83% to 95%. To achieve the same 12-point it is not affected by evidence that the actual size increase in power by manipulating the sample of the effect might be larger or smaller than this size we would have to increase the sample size value. Second, even if one wanted to base power from the initial value of 100 per group to 160 per on the ªtrueº effect, the fact remains that the group. To achieve the same increase in power by value observed in any single study is not likely to selecting a more liberal value for alpha we be the true effect. Indeed, if we had confidence would need to move alpha from 0.05 to 0.20. that observed effect size was the true effect we Because the effect size plays such an would have no need to run another study to important role in the computation of power, pinpoint the effect size, much less to test the it is imperative for the researcher to use an hypothesis that it was zero. appropriate effect size in the computations. If While this application of a retrospective we use an effect size that is too large, then the power analysis is inappropriate, there are two study will not have adequate power to detect a related applications where it may be useful. One more modest effect. If we use an effect size that is the use of study data to identify a base rate. Power Analysis 321

Assume, for example, that our goal is to reduce analysis. If a study to test a new drug will have by 20% the proportion of patients having a adequate power with a sample of 100 patients, panic attack within two days of treatment. then it would be inappropriate to use a sample Initially, we assume that the base rate for these of 200 patients since the second 100 are being attacks is 30%, compute power for a difference put at risk unnecessarily. At the same time, if the of 30% vs. 24% (corresponding to a 20% study requires 200 patients to yield adequate decrease), and set the sample size on this basis. It power, it would be inappropriate to use only emerges in the study that the base rate is 20%, 100. These 100 patients consent to take part in which means that we should have computed the study on the assumption that the study is power for a difference of 20% vs. 16% likely to yield useful information. If the study is (corresponding also to a 20% decrease). In this underpowered, then the 100 patients will have case, one could argue that the original power been put at risk for no reason. calculations were based on incorrect assump- Of course, the actual decision-making pro- tions, and that for the a priori effect size the cess is complex. One can argue about whether study was underpowered. Similarly, if the effect ªadequateº power for the study is 80%, or size was computed as a mean difference in raw 90%, or 99%. One can argue about whether units (such as SAT [Scholastic Achievement power should be set based on an improvement Test] scores) but we underestimated the stan- of 10 points, or 20 points, or 30 points. One can dard deviation of the scores, one could argue argue about the appropriate balance between that our initial computations had been in error alpha and beta. The point being made here is and that power for the desired effect was too that these kinds of issues need to be addressed low. A second logical application along these explicitly as part of the decision making lines is to use the study data to obtain a more process. informed estimate of the likely effect size by computing the confidence intervals based on the sample. This information could prove useful in 3.14.3.9 Power Analysis in Perspective the design of a subsequent study. Retrospective power analyses are usually To this point we have focused on the details of reported subsequent to a study that failed to a power analysis, but it is important also that yield a significant effect. Occasionally, though, the entire procedure be seen in an appropriate researchers who have obtained a significant context. The use of a power analysis in study result may discover that their study had been planning is appropriate only when the study under-powered and then wonder if the study goal is to test the null hypothesis. By contrast, if results are still valid. Of course, this is a nonissue. the study goal is to estimate the magnitude of a Power tells us the likelihood is that the study will treatment effect, then statistical power has no yield a significant effect and is used for purposes bearing on sample size. Later in this chapter it is of planning. If the study is completed, then we argued that most studies should focus on effect know the outcomeÐeither the effect was size estimation rather than power analysis, and significant, or it was not. To worry in retrospect so this becomes a critical point. that the chances of success were small is akin to More generally, before performing a power the parent whose teenager has taken the car for a analysis or any other procedure to set sample ride during a bad rain but returned home safely. size, it behooves the researcher to ensure that There might have been reason for worry while the study should be conducted at all. Chalmers the teenager was still out (and the parent might and co-workers have shown that many studies want to be more careful the next time) but it have put patients at risk in attempts to test makes no sense to worry about the outcome of treatments when, in fact, the data required to this particular trip after the fact. answer the question already existed from prior studies. In one famous example (Lau et al., 1992), a series of 33 randomized trials were 3.14.3.8 Ethical Issues in Power Analysis carried out between 1959 and 1988 to assess the impact of intravenous (iv) streptokinase in Some studies involve putting patients at risk. preventing mortality subsequent to a myocar- At one extreme, the risk might involve a loss of dial infarction. In fact, the life-saving power of time spent completing a questionnaire. At the these drugs could have been proven as early as other extreme, the risk might involve the use of 1973, if the data available at that time had been an ineffective treatment for a potentially fatal subjected to a meta-analysis. At that time, a disease. While an extensive discussion of total of eight studies had been completed using a research ethics is outside the scope of this total of 2432 patients. The p-value for a test of chapter, it must be emphasized that ethical the nil was 50.01 and the effect size (odds ratio) issues should play a central role in power 0.74 with a 95% confidence interval of 322 The Shift from Significance Testing to Effect Size Estimation

0.59±0.92. Subsequent to 1973, researchers ran sample size. Second, we compute the proportion an additional 25 studies which employed a total of samples that will yield an effect this large (or of 34 542 additional patients, with approxi- larger), given the effect size in the population. mately half of these assigned to placebo and As a practical matter, computation of power denied the potential benfits of the treatment. In is performed by means of computer programs fairness, it must be noted that meta-analysis was designed for this purpose. The example that not widely accepted in 1973, and even if the follows is taken from Power and Precision,a researchers had access to meta-analytic meth- program developed by Borenstein et al. (1997). ods, results based on a these procedures might The example is for a study in which patients will not have been accepted by the medical com- be assigned at random to either a new treatment munity. Our intention is not to argue that this or the standard treatment, and we will compare specific set of studies was not needed, but to the proportion responding in the two groups. make the point that the study must hold the Computation of power proceeds as follows potential to provide useful information and that (Figure 3): power analysis is only one element in this (i) Optionally, enter names for the two process. groups: ªNew treatmentº and ªStandard treat- ment.º (ii) Enter the proportion responding in either 3.14.3.10 Some Real World Issues in Power group: 40% for the standard group and 60% for Analysis the new treatment. (iii) Click ªFind N.º The program shows In some ways, the funding process and other that a sample of 107 per group will yield power constraints ensure that studies will continue to of 80%. be run with low power. Feinstein (1975) may (iv) At this point the researcher may elect to have identified one part of the problem with his vary the study parameters, for example, to straightforward account of how studies are modify the effect size or sample size and see designed in real life. The statistician (i) the impact on power. This may be done computes the maximum number of patients interactively. Alternatively, the program will for which funding can be obtained, (ii) finds the automatically generate a table and graph that effect size required to yield power of 80%, given allows the researcher to simultaneously take the sample size, and (iii) develops a rationale to account of several factors. present to the finding agency to show that the One possible graph is shown in Figure 4. The effect size is the smallest effect size that would be three lines in the graph represent treatment clinically important to detect. The process effects of 25 points, 20 points, and 15 points identified by Feinstein leads to grant applicants (specifically, response rates of 65% vs. 40%, asserting that a clinically important effect would 60% vs. 40%, and 55% vs. 40% for the new be one that improves the response rate by treatment vs. the standard treatments). The precisely ª22 percentage points.º sample size required to yield power of 80% is In fact, the thrust of this approach is not 70 patients per group for the largest effect, 107 necessarily inappropriate. The selection of an per group for the second effect, and 186 per effect size should take into account any number group for the third. of factors, and it might be appropriate to In this example, taking account of substantive allocate resources to detect a large effect but not issues and available resources, we decide that we a small one. However, the grant application will base the study on the center line. In other would be more credible if the applicant words, we decide that a 20-point improvement acknowledged that the effect size was set in in response rates is important enough to justify a this way, rather than arguing that the smallest trial with 107 patients per group. By following effect of clinical importance would be a rate the trajectory of the center line we see that 139 difference of precisely 22 percentage points. patients per group would yield power of 90% for this effect, but decide that we cannot commit the additional resources. By comparing the 3.14.3.11 Computer Programs for Power position of the three lines at ª107 per groupº we Analysis note that if the treatment effect is actually 25 points (65% vs. 40%) rather than 20 points, Power is defined as the proportion of samples then the study's power is actually 95% rather that will yield a statistically significant effect, than 80%. If the true effect is actually 15 points given a specific set of assumptions. The (55% vs. 40%) then power is actually 54%. computation of this value is a two-step process. The program will also generate a text report First, we compute the effect size required to of the computation which serves as an yield a significant effect, for the given alpha and educational tool and may also be copied into With 107 patients per group and alpha (two-tailed) of 0.5, the study will have power of 0.80 to detect a treatment effect of 20 percentage points (response rates of 60% v. 40%)

Figure 3 Screen from Power and Precision showing computation of power. The researcher enters an effect size (response rates of 60% vs. 40%), alpha (0.05, two-tailed) and clicks ªFind Nº on the toolbar. The program shows that a sample size of 107 per group will yield a power of 80%. The researcher may modify any of the study parameters, and the program will update power automatically in response. 324 The Shift from Significance Testing to Effect Size Estimation

Power as a function of effect size and N: two sample proportions

1.0

0.8

0.6 P1 = 0.65 P2 = 0.40

Power 0.4 P1 = 0.60 P2 = 0.40

P1 = 0.55 P2 = 0.40 0.2

0.0 0 50 100 150 200 250 300 Number of cases per group Alpha = 0.05 Tails = 2

Figure 4 Power as a function of effect size and sample size. This graph, which is an extension of Figure 3, allows the researcher to take account of multiple factors simultaneously while planning the study. We can design the study to detect the largest effect (top line) which would require only about 70 cases per group but would result in low power to detect the more modest effects. We can design the study to detect the middle effect, which would require 107 cases per group but would result in low power to detect the smallest effect. To ensure power for the smallest effect (with the least potential benefit) we would need to enroll some 180 cases per group. a word-processing program. In this example 3.14.4 PROBLEMS WITH THE the text of the report reads (in part) as follows: SIGNIFICANCE TEST ªOne goal of the proposed study is to test the null hypothesis that the proportion responding To this point we have highlighted the key is identical in the two populations. The features of significance testing and power criterion for significance (alpha) has been set analysis. The vast majority of statistical ana- at 0.05. The test is two-tailed, which means that lyses in psychological research take the form of an effect in either direction will be interpreted. significance tests and an understanding of the With the proposed sample size of 107 and 107 process is essential to an understanding of the for the two groups, the study will have power research literature. of 80.1% to yield a statistically significant However, this was intended primarily as an result. This computation assumes that the introduction. The thesis of this chapter is that difference in proportions is 0.20 (specifically, while the significance test is ubiquitous in the 0.60 vs. 0.40). This effect was selected as the literature of psychological and medical re- smallest effect that would be important to search, its use in these fields is inappropriate detect, in the sense that any smaller effect for a number of reasons. In this section we will would not be of clinical or substantive make the following points: significance. It is also assumed that this effect (i) The significance test addresses the wrong size is reasonable, in the sense that an effect of question. Researchers are concerned with clin- this magnitude could be anticipated in this field ical significance (i.e., the magnitude of the of research.º treatment effect), but significance tests address The software available for power analysis is statistical significance (whether or not the effect constantly changing but up-to-date information is zero). is available on the Internet. Information on the (ii) The question that is addressed by signifi- program used in this example is available at cance tests (whether or not the null hypothesis is http://www.PowerAndPrecision.com. Thomas true) is inappropriate not only because it is and Krebbs (1997) maintain a site with links to tangential to clinical significance but also be- any number of programs for power analysis at cause it is almost always false in psychological http://www.Interchg.ubc.ca/cacb/power. research. Problems with the Significance Test 325

Additional problems derive from the fact that example, Tukey (1969) points out that if the significance tests lend themselves to mistakes of nature of elasticity had been defined as ªwhen interpretation. In particular: you pull it, it gets longerº the science of elasticity (i) A significant p-value is assumed to reflect would have progressed very slowly. In psycho- a clinically important effect, and the level of logical research as well, we need to address the significance is assumed to reflect the magnitude size of the effect in order for theory to develop, of the effect. In fact, though, the p-value and this is not addressed by the significance test. incorporates information about the size of the sample as well as the size of the effect. There- fore, the use of the p-value for this purpose 3.14.4.2 The Null Hypothesis is not a Viable results in confusion and errors of interpretation. Hypothesis (ii) Historically, a majority of studies have been run with low power, which means that There is a second problem with the use of the even moderate or large treatment effects may go null hypothesis in psychological research. In undetected. Additionally, the absence of sig- addition to being irrelevant, it is also not a nificance is often interpreted, inappropriately, viable hypothesis, in the sense that it is known to as evidence that the treatment effect is zero. The be false in the overwhelming majority of combination of these two types of error has psychological research. seriously impeded research in psychology. For example, consider studies that compare the impact of two treatments. Hunter (1997) points to a survey conducted by Lipsey and 3.14.4.1 Significance Tests Address the Wrong Wilson (1993) which reviewed 302 meta-ana- Question lyses of treatment studies in psychology and education. The average number of studies in The key problem with significance testing is each meta-analysis is 60, and the total number that it addresses the wrong question. The of studies included is 18 120. The treatment question that researchers want to ask is ªHow effect was zero (the null hypothesis was true) in large is the treatment effect?º By contrast, the less than 1% of the domains studied. Similarly, question actually addressed by the significance studies that look at the relationship between test is ªCan we conclude that the treatment behaviors or traits are designed to test the null effect is not zero?º hypothesis that the correlation between traits is Researchers typically assume that reports of zero. In fact, though, it would be difficult to ªsignificanceº or ªnonsignificanceº refer to name two human characteristics whose correla- clinical significance. This assumption follows tion with each other is 0.0000000. In this kind of from the fact that clinical significance is the study the null hypothesis is always false. question of interest, and also from the fact that The typical retort to the argument that the ªsignificanceº refers to substantive (or clinical) null is always false is that we do not intend to significance in common parlance. Nevertheless, test whether or not the response rates are this assumption is incorrect. A significant p- identicalÐrather, we intend to test whether or value tells us only that the treatment (probably) not they differ by some important amount. But increases the response rate by some amount. It this is exactly the problem. What we care about does not provide information about the size of is whether or not the difference is clinically the effect. important, but what we are testing is whether or The clinician, of course, needs information not the difference is exactly zero (Cohen, 1990). about the magnitude of the treatment effectÐ This fact, followed to its logical conclusion, Does the treatment increase the likelihood of leads to an interesting paradox. Given that the response by five percentage points, or 10 points, effect is not zero, and since the p-value is a or 50 points?Ðsince this is the kind of function of both effect size and sample size, it information required to balance the treatment's follows that with a large enough sample size, potential benefits against the costs and potential any effect will meet the criterion for significance. side effects. It is also the kind of information One can extend this argument in several ways: required to allow for an informed choice (i) With a large enough sample size we know between the treatment in question and other the study will yield a significant effect, so there is possible treatments. The only information that no need to run the study (Cohen, 1994). the significance test can provideÐthat the (ii) With a large enough sample size we know treatment effect exceeds zeroÐis at best of the study will yield a significant effect, so if our tangential interest. result is not significant, then by definition we The same point being made here about are committing a type II error (Schmidt, 1992). implications for clinical practice applies also (iii) With a large enough sample size the to the development of psychological theory. For study will yield a significant effect, so what we 326 The Shift from Significance Testing to Effect Size Estimation are really testing is not whether the effect is Papers are often seen in which persons report large, but whether or not we have run enough significance tests for reliability coefficients, in subjects. Thompson (1992) suggests that instead effect making the point that the scale's of testing for significance we can assess how reliability is not zero. To show that the test tired the researchers are. If they are tired, then has reliability better than zero is hardly we probably ran a lot of subjects and have a reassuring to potential users of the test. Surely, ªsignificantº effect. If they are very tired, then the relevant question is whether or not the we probably ran even more subjects and have a reliability exceeds 0.70 or 0.80. Abelson (1997b) ªvery significantº effect. suggests that ª. . . declaring a reliability coeffi- There are exceptions to this point, which are cient to be nonzero constitutes the ultimate in discussed later, but these are rarely encountered stupefyingly vacuous informationº (p. 13). in psychological research. Cohen (1994) traces the history of harangues against null hypothesis significance testing (NHST) and some of his quotes are reproduced 3.14.4.3 The Gratuitous Use of Significance here. Rozeboom (1960) wrote ªThe statistical Tests folkways of a more primitive past continue to dominate the local sceneº (p. 417). Bakan Even those who argue that significance tests (1966), writing that ªa great deal of mischief has can serve an important role in psychological been associated [with the significance test]º research recognize that the tests are almost noted that the idea was hardly original with him invariably abused. Abelson (1997a) wrote a and that ªto say it `out loud' is . . . to assume the chapter in which he argues that the null role of the child who pointed out that the hypothesis should not be banned, but includes emperor was really outfitted in his underwearº a section entitled ªThe Null Hypothesis: Merely (p. 423). Meehl (1967) likened NHST to ªa Misused, or Really Idiotic?º He uses the term potent but sterile intellectual rake who leaves in ªgratuitousº to describe the ritualistic applica- his merry path a long train of ravished maidens tion of significance tests to all kinds of data but no viable scientific offspring.º Rothstein where the test could not possibly provide any (personal communication) noted that ª. . . useful information. researchers [using NHST] often pose the wrong Indeed, significance tests are sometimes question and then proceed to answer that applied in ways that defy common sense. question incorrectly.º Abelson (1997a) cites the example of a Cohen (1994) goes on to say that ª . . . we, as researcher who divided a cohort of subjects at teachers, consultants, authors, and otherwise the group median into ªlowº and ªhighº groups perpetrators of quantitative methods, are and then performed a significance test to find responsible for the ritualization of null hypoth- out whether or not the groups were, in fact, esis significance testing . . . to the point of different. In doing so the researcher missed the meaninglessness and beyond . . . NHST has not point that we use significance tests to make only failed to support the advance of psychol- inferences about whether or not the groups ogy as a science but also has seriously impeded differ in some systematic way, but in this case we itº (p. 997) (see also Cohen, 1990). know that they differ systematically because we Rothman (1986b) writes ªTesting for statis- created them that way. Simply put, one is high tical significance today continues not on its own and one is low. In fact, this is the archetypical merits as a methodological tool but on the case of the situation cited earlier, where the momentum of tradition. Rather than serving as groups clearly differ, and the only thing being a thinker's tool, it has become for some a clumsy tested is whether or not we have a large enough substitute for thought, subverting what should sample. What if the test had failed to yield a be a contemplative exercise into an algorithm significant effect? Would the researcher have prone to errorº (p. 447). concluded that he had assigned persons to the Additional citations on this topic include two groups at random? Abelson (1995, 1997a, 1997b); Borenstein Cohen (1994) cites the case of a researcher (1994b); Carver (1978); Dar (1987); Estes who wanted to test the hypothesis that a (1997); Fisher (1959); Gonzalez (1994); Hagen particular ailment would never exist in a specific (1997); Harlow, Mulaik, and Steiger (1997); population. He tested 30 patients and found the Harris (1997a, 1997b); Hunter (1997); Loftus ailment in only one. He wanted to know (1991); Lykken (1968); Meehl (1967, 1978, whether or not this (one case in 30) was 1990); Murphy (1990); Neyman and Pearson ªsignificant.º In doing so, he missed the point (1932a, 1932b); Oakes (1986); Peto and Doll that the theory required that there be no cases in (1977); Scarr (1997); Schmidt (1992, 1996); the population, and if he found any then the Schmidt and Hunter (1996); Shaver (1993), and theory was clearly false. Shrout (1997). Problems with the Significance Test 327

3.14.4.4 p-Values: Responding with a Mainland, 1982; Nelson, Rosenthal, & Ros- Misdirected Answer now, 1986; Rothman, 1986b; Tversky & Kahne- man, 1971; Wonnacott, 1985). To this point we have shown that significance A review of a manuscript is recalled that tests address a question that is at best tangential purported to show a substantial advantage for a to the question of interest and at often entirely novel antipsychotic drug. The basis for this irrelevant. The use of significance tests carries claim was a difference between treatment with it an additional problem: Significance tests groups on a critical variable, with p 50.001. produce p-values that lend themselves to As it happens, the sample size was of the order mistakes of interpretation. of 2000 patients, and a p-value 50.001 could The ªp-valueº highlighted by the significance reflect a difference that was so small as to be test is a function of two elementsÐthe size of the trivial. In fact, some of the baseline differences effect, and the precision of the estimate. When between the groups (which had, appropriately, we consider the logic of the significance test, been dismissed as negligible in size) were larger that is, to determine whether or not the (and more significant!) than the post-treatment population effect is zero, we can appreciate differences being submitted as evidence of the the simple elegance of the p-value. Either a large drug effect. effect, or a large sample (yielding a precise It gets worse. If the confusion of statistical estimate of the effect), or some appropriate significance with clinical significance is a combination of the two, provides evidence that problem in the interpretation of single studies, the effect is not zero. By the same logic, the situation is even worse when researchers use however, it would not be appropriate to press p-values to compare the results in different the p-value into service as an indicator of effect studies. This type of comparison is common size. Since the p-value incorporates information when we want to know if the treatment is more about both the size of the effect and the size of effective in men than it is for women, or if one the sample, it does not allow us to distinguish treatment is more effective than another. between the two. Since the p-value incorporates information about both the sample size and effect size, a p- value of 0.05 could represent response rates in 3.14.4.4.1 Misinterpreting the significant two groups of 50% vs. 70% (a 20-point effect) p-value with a sample size of 50 cases per group. It could Statistical significance is often assumed to also represent response rates of 50% vs. 90% (a reflect substantive significance. Almost invari- 40-point effect) with a sample size of 10 cases per ably, the first question asked by the reader, and group. In the second case the effect size is the first point made by the researcher, is that the substantially larger than it is in the first case, but results were ªsignificant.º This is the point this fact is lost in the p-values, which are highlighted at meetings, in abstracts, and in the identical. Similarly, a p-value of 0.01 could results section of publications. Often, the represent response rates of 40% vs. 65% (25 discussion of effect does not proceed beyond points) with 50 cases per group, or 40% vs. 95% the question of significance at all. Even when it (55 points) with 10 cases per group. Again, the does, the issue of significance, that is, statistical difference in effect sizes is not evident from the significance, is emphasized over clinical sig- p-value. nificance or effect size. In fact, though, the only In fact, Tversky and Kahneman (1971) found information imparted by a statistically signifi- that students presented with information about cant p-value is that the true effect is (probably) p-values and sample sizes tend to make exactly not nil. A significant p-value could reflect a the wrong conclusions about the effect size. clinically meaningful effect. It could also reflect Students were presented with two studies where a clinically trivial effect that had been found in a the p-value was 0.05, and told that the sample large sample (because of the large sample the size was 10 per group in one and 50 per group in effect size is reported precisely, and though the other. Invariably, students assumed that the small is known to be non-nil). effect size in the second case was more Cohen (1965) writes ªAgain and again, the impressive, while exactly the reverse is true results section of an article describing an effect (see also Berkson, 1938, 1942; Friedman & as significant or highly significant is followed by Phillips, 1981; Rosenthal & Gaito, 1963, 1964; a discussion section which (usually implicitly) Rozeboom, 1960). proceeds to treat the effect as if it had been The possibilities for mistakes expand when found to be large or very largeº (p. 102). The we consider the possibility of comparing results same point has been documented repeatedly in between studies when both the p-value and the the field of psychology and medicine (Feinstein, sample size differ. If one study yielded a p-value 1975, 1976, 1977; Friedman & Phillips, 1981; of 0.05 and another yielded a p-value of 0.01, 328 The Shift from Significance Testing to Effect Size Estimation then in the absence of any additional informa- providing them with the mechanisms for com- tion a reader might assume that the effect size puting power. At present, Cohen's papers and was stronger in the latter case. In fact, though, if texts (Cohen, 1962, 1965, 1969, 1977, 1988, the first study (p=0.05) used a sample of 10 per 1990, 1994) also serve as a kind of historical group and the second (p=0.01) used 50 per map that traces the role of power analysis since group, then the effect size would have been the 1960s. substantially larger in the study with the modest Cohen (1962) surveyed papers published in p-value (a 40-point effect as compared with a 25- the Journal of Abnormal and Social Psychology point effect). in 1960. Mean power to detect a small, medium, or large effect, respectively, was 0.18, 0.48, and 0.83. (Definitions of small, medium, and large 3.14.4.4.2 Misinterpreting the nonsignificant were developed by Cohen with reference to the p-value effect sizes typically found in social science The complement to misinterpreting the sig- research; the definitions were modified slightly nificant p-value is to misinterpret the nonsigni- subsequent to the 1962 paper.) The implications ficant p-value. The only information imparted of this were spelled out clearly in the paper. At a by a nonsignificant p-value is that we have failed minimum, a great deal of time and effort is being to reject the null. Assume, for example, that wasted on studies that have little chance of refractory schizophrenic patients currently in meeting their goals. Worse, when the studies relapse are assigned to be treated either with with ªnegativeº results are published, readers clozapine or with haldol. In this hypothetical tend to interpret the absence of statistical example we find that the proportion meeting a significance as evidence that the treatment has remission criterion within six weeks is not been proven ineffective. significantly different in the two groups. The What was the response to this paper? Woody nonsignificant p-value could reflect a finding in a Allen has spoken of the time that he was large sample that the proportion remitting was kidnapped as a child. It took his father a while to virtually identical in the two groups (the large catch on (ªHe had bad reading habitsÐhe sample ensuring to a high degree of certainty started reading the ransom note, but in the that the same would hold true in the population). middle he fell asleepº) but eventually he ªsprang Or, it could reflect a finding in a small sample into action and rented out my room.º In that that the remission rate was twice as high (or vein, it took the scientific community a while to more) for the patients treated with clozapine (the appreciate the implications of Cohen's paper finding failing to prove significant in part but eventually it sprang into action also. First because of the low sample size). The first one, then another researcher found that what scenario would justify a conclusion that cloza- Cohen had done for the field of psychology pine does not substantively increase the like- could be done equally as well for any field of lihood of remission and the second scenario research, and in the years that followed a kind of would not justify this conclusion. However, cottage industry developed of papers that researchers and readers almost invariably fail to documented the fact of low power in any make this distinction, and routinely interpret the number of journals in the area of behavioral absence of significance to mean that an effect research. Many of these are cited in Sedlmeier does not exist. A related problem is the fact that and Gigerenzer (1989) and Rossi (1990). Similar many studies in psychological research suffer papers were published to document the same from low statistical power. In this section we will problem in the field of medicine (Borenstein, make four points. 1994b; Hartung, Cottrell & Giffen, 1983; (i) Power for research in psychology is abys- Phillips, Scott, & Blasczcynski, 1983; Reed & mally low; Slaichert, 1981; Reynolds, 1980) and psychiatry (ii) Rule (i) appears to be impervious to (Kane & Borenstein, 1985). change; One oft-cited paper by Frelman, Chalmers, (iii) The absence of significance should be Smith and Kuebler (1978) surveyed reports of interpreted as ªmore information is requiredº controlled clinical trials that had been published but is interpreted in error as meaning ªno effect in a number of medical journals (primarily the existsº; and Lancet, the New England Journal of Medicine, (iv) Rule (iii) appears to be impervious to and the Journal of the American Medical change. Association during the period 1960±1977), and While Cohen has often made the point that he selected 71 that had reported negative results. discovered, rather than invented, power analy- The authors found that if the true drug effect sis, the fact is that Cohen's papers over a series had been on the order of 50% (e.g., a mortality of decades have played a key part in making rate of 30% for placebo vs. 15% for drug), researchers aware of the power issue, and median power would have been 60%. In other Problems with the Significance Test 329 words, even if the drug cut the mortality rate in Earlier, the paper by Freiman et al. (1978) half there was still a 40% probability that the was cited, showing that power for papers study would have failed to obtain a significant published in some of the most prestigious result. medical journals was clearly at unacceptable How does the story end? In Woody Allen's levels. The authors went on to make a second tale the FBI surrounded the kidnappers. They point: Despite the fact that power was terribly wanted to ªtoss in the tear gas,º but they had no low, in most cases the absence of significance tear gas so instead they put on the death scene was interpreted as meaning that the drug was from Camille. The kidnappers were overcome. not effective. The authors wrote: ªThe conclu- The FBI and the kidnappers bargained with sion is inescapable that many of the therapies each other and eventually reached an agreement discarded as ineffective after inconclusive that the kidnappers would ªthrow out their `negative' trials may still have a clinically guns and keep the kid.º meaningful effectº (p. 694). In fact, it is possible The power analysis story has a similar ending. (or likely) that some of the therapies discarded In keeping with the general move away from on this basis might well have had very isolated studies and toward meta-analysis, substantial therapeutic effects. papers that document low power in specific In fact, Sedlmeier and Gigerenzer reported journals are now being replaced by meta- that power had actually dropped slightly over analysis of these papers. Now, researchers have the 25 years subsequent to Cohen's initial been able to document the fact that low power survey. The magnitude of the drop is of little exists not only in specific journals, but also in consequence, but the reason for the drop is every field of research. interestingÐresearchers in 1990 were more Sedlmeier and Gigerenzer (1989) published a likely to be making adjustments to alpha when paper entitled ªDo studies of statistical power running multiple tests (e.g., using alpha of 0.025 have an effect on the power of statistical rather than 0.05) to ensure that the type I error studies?º They found that in the 25 years since rate was kept at 0.05. Schmidt (1996) points to Cohen's initial survey power had not changed in the irony that researchers were protecting any substantive way. Specifically, they reported themselves against an error that could not that in the 1984 volume of the Journal of occur (since the null hypothesis was false) while Abnormal Psychology, mean power to detect oblivious to the type II error rate, which exceeds small, medium, and large effects was 0.21, 0.50, 50%. and 0.84, which are essentially similar to the As was the case for misinterpretation of values reported by Cohen for the 1960 volume significant results, the misinterpretation of (in 1960, for the Journal of Abnormal and Social nonsignificant results becomes even more Psychology). Similarly, Rossi (1990) reviewed complicated when researchers try to compare papers published in 1982 in the Journals of results across studies. One would hope that the Abnormal Psychology, Consulting and Clinical error of interpretation would be obvious when Psychology, and Personality and Social Psy- there are a series of studies, with some yielding a chology. Mean power to detect small, medium, significant effect and others failing to do so. In and large effects, respectively, was 0.17, 0.57, some such series the effect size is remarkably and 0.83. In fact, since papers with higher power consistent from one study to the nextÐthe are more likely to yield significant results and be studies with a large sample size met the criteria published, the mean power reported by these for significance and those with a smaller sample surveys is almost certainly higher than the mean size did not. Even in these cases, however, there power for all studies in the field. is usually a perception that ªthe studies yield Given that the fact of low power is well conflicting results.º documented, one would hope that researchers Even in 1998 a cardiologist remarked that would adapt their thinking to accommodate to thrombiolytics are incredibly effective in redu- this state of affairs. In fact, though, the mistake cing mortality for patients suffering a myocar- of interpreting nonsignificant results as imply- dial infarction, and that he sees clearly that they ing the absence of an effect runs through the save lives, but that the studies of effect ªflip back psychological and medical literature. One could and forth.º In fact, thrombiolytics represent the say that it practically gallops. textbook case (literally) cited earlier (Lau et al., Cohen (1990) relates the story of a doctoral 1992), where the treatment effect is consistent candidate who completed a study with 20 cases from one study to the next, and the presence or per group, and power of 0.33 to detect a medium absence of statistical significance varies with sized effect. As Cohen recalls ªHe ended up with sample size. non-significant resultsÐwith which he pro- Even worse, when the significant study was ceeded to demolish an important branch of performed in one type of sample and the psychoanalytic theoryº (p. 1304). nonsignificant study was performed in another 330 The Shift from Significance Testing to Effect Size Estimation type of sample, researchers sometimes try to to the next, but the sample sizes are small and, in interpret this difference as meaning that the some of the studies, not significant. Researchers effect exists in one population only (e.g., males report that the evidence is ªconflicting.º A series but not females). Abelson (1997a) notes that if a of studies are funded to determine why the treatment effect yields a p-value of 0.07 for integration had a positive effect in some studies wombats and 0.05 for dingbats we are likely to but not others (is it the teacher's attitude? Is it see a discussion explaining why the treatment is the students' socioeconomic status; is it the effective only in the latter groupÐcompletely students' age?), entirely missing the point that missing the point that the treatment effect may the effect was actually consistent from one study have been virtually identical in the two. The to the next. No pattern can be found (since none treatment effect may have even been larger in exists). Eventually, researchers decide that the the nonsignificant group, if the sample size was issue cannot be understood. A promising idea is smaller. lost, and a perception builds that research is not A more serious example of this is cited by to be trusted. A similar point is made by Meehl Poole (1987a). Selikoff, Hammond, and Churg (1978, 1990). (1968) found a statistically significant associa- Rossi (1997) gives an example of this same tion between asbestos exposure and lung cancer phenomenon from the field of memory re- among cigarette smokers (the risk of death from search. The issue of whether or not researchers lung cancer was 90 times as high as that for could demonstrate the spontaneous recovery of nonsmokers in the general population), but previously extinguished associations had a failed to find this same association among bearing on a number of important learning nonsmokers (no deaths were reported in this theories, and some 40 studies on the topic were group during the observation period). The published between 1948 and 1969. Evidence of authors, taking into account the low statistical the effect (i.e., significant findings) was obtained power for detecting this effect, drew no in only about half the studies which led most conclusions about the nonsmoking group. texts and reviews to conclude that the effect was Others, however, were less cautious and ephemeral and ªthe issue was not so much proceeded to (mis)interpret this negative find- resolved as it was abandonedº (page 179). ing. Hoffman and Wynder (1976) wrote: ªWe Recently, Rossi returned to these studies and conclude from [this study] that asbestos induces found that the average effect size (d) for the mesothelioma of the pleura and peritoneum, studies was 0.39. If we assume that this is the but not by itself [cancer] of the bronchus.º population effect size then, given the sample size Similarly, Cole and Goldman (1975) interpreted in the various studies, the mean power for these the study to mean that ª[A]pparently, asbestos studies would have been slightly under 50%. On will produce lung cancer only in smokers.º this basis we would expect about half the studies Poole reports that this finding was eventually to yield a significant effect, which is exactly what summarized in a pamphlet distributed to happened. workers in the asbestos industry: ªStudies show It seems unlikely that these errors of inter- that if you don't smoke cigarettes, asbestos does pretation will change as long as researchers not increase your risk of lung cancer.º Poole continue to work under the current system of used the data reported in the original study to significance testing. These types of problems compute confidence intervals for the risk of have continued despite educational efforts such cancer associated with asbestos exposure for the as those of Cohen (1962, 1990). They have nonsmokers, and found the odds ratio to fall in continued despite the development of simple the range of zero to 46.1. techniques for computing power, including rule- We have shown that power for many studies of-thumb tables (Cohen, 1988; Kraemer & is low, which leads to nonsignificant results even Thiemann, 1987) and computer programs when treatments are actually effective. And, we (Borenstein & Cohen, 1988; Borenstein et al., have shown that the nonsignificant results are 1990, 1992, 1997; Goldstein, 1989; Rothstein often taken as evidence that the treatment is not et al., 1990). They have continued despite effective. Either of these two issues would be research that has addressed the kinds of thought problematic by itself, and together these items processes on the part of researchers that appears seriously impede research in psychology. to drive these kinds of errors (Tversky & Schmidt (1996) outlines the impact of this Kahneman, 1971). There are many researchers practice on research and policy: an idea is who are very cognizant of power issues and proposed (school integration will result in better insist that their studies be properly powered. test scores for African-American children). A Most large-scale and multicenter studies are number of studies are run. All of the studies also designed to ensure adequate power for show that scores are increased and the magni- important tests. In this sense, there has been tude of the increase is consistent from one study some very substantial progress since the late Focused Significance TestsÐA Digression 331

1960s. Still, it is clear from Sedlmeier and For example, assume that we are planning to Gigerenzer, and Rossi, that low power con- test a new drug as a maintenance medication for tinues to be a problem. schizophrenic patients. The drug is known to increase the level of negative symptoms slightly, but would nevertheless be considered useful if it 3.14.4.5 Summary increased the response rate (defined by control of positive symptoms) by 20 percentage points. Significance tests are not the appropriate The hypothesis to be tested is not ªDoes it vehicle for the vast majority of psychological increase the response rate at allº but rather research, both because they are inherently ªDoes it increase the response rate by 20 points inappropriate for this role and additionally or more.º because they foster mistakes of interpretation. The same logic can be extended to a test of Specifically: means. Assume, for example, that we want to (i) Significance tests focus attention on sta- evaluate a new teaching strategy that is expected tistical significance (whether or not we can to yield an improvement in SAT scores. The reject the nil) rather than clinical significance strategy will be expensive to implement, and a (which is the question of interest). The test can decision is made that it would be adopted on a tell us about the direction of the effect, but in wide scale only if it increased the mean score by virtually all cases we want also to address the at least 50 points. The hypothesis to be tested is size of the effect. obvious. (ii) There is a general perception that sig- The idea of using significance testing in this nificance tests allow us to distinguish between way is not new. In the psychological literature effects that are ªrealº and those that are not. In the null hypothesis tested is almost invariably fact, though, virtually all effects we test are the null hypothesis of no effect and it is nonzero, so the only question resolved by the commonly assumed that the word ªnullº in significance test is whether or not we used a ªnull hypothesisº refers to the size of the effect. large enough sample. In fact, though, it is a reference to the hypothesis (iii) The significance test yields p-values that to be tested, or ªnullifiedº (Cohen, 1997). While lend themselves to misinterpretation. A signifi- the hypothesis to be nullified is usually the cant p-value is assumed to reflect a clinically hypothesis that two response rates are identical important effect, but it may not. A nonsignifi- it could also be the hypothesis that, for example, cant p-value is taken to reflect a zero effect, but the response rates differ by 20 percentage it does not. These problems are compounded points. To avoid confusion, Cohen suggests when a series of studies are seen as yielding that the term ªnull hypothesisº be used to refer ªconflicting resultsº for one population, or are to any hypothesis to be tested, while the term seen as evidence that the effect exists in one ªnil hypothesisº be used to refer to the null population but not another. hypothesis of no effect. Similarly, in the clinical literature the term 3.14.5 FOCUSED SIGNIFICANCE ªsignificance testº is almost invariably used to TESTSÐA DIGRESSION refer to a test of the nil hypothesis, but it can refer to any null hypothesis. Cohen suggests that To this point we have used the term the term ªsignificance testº be used to refer to a significance test to refer to the null hypothesis test of the nil, while the term ªhypothesis testº of no effect. Overwhelmingly, this is the manner be used to refer to other tests (such as the 20 in which statistical tests are applied in the point difference). These conventions will be literature. For example, the null hypothesis followed here to avoid confusion. The reader being tested is that the mean in two groups is should note, however, that these terms are used identical (i.e., the intervention had no effect), or interchangeably elsewhere. the response rates in two groups are identical, or The use of hypothesis testing rather than the correlation between two variables is zero. significance testing would provide some ob- In fact, though, we can adopt a more focused vious benefits. For one thing, it would force approach by formulating and testing a hypoth- researchers to think about the meaning of the esis that predicts the magnitude of an effect. null hypothesis. Earlier, the example of re- This would be appropriate if a theory had searchers testing the null hypothesis that a developed to the point that questions of effect reliability coefficient is zero was cited. With a size would bear on the validity of the theory. It focused test the researcher could test the would also be appropriate if we were working hypothesis that the reliability was significantly with a prospective treatment and the question of better than, for example, 0.80. If these kinds of interest was whether or not the treatment was tests were in common use, researchers would clinically useful. have an incentive to formulate the hypotheses 332 The Shift from Significance Testing to Effect Size Estimation more clearly. Additionally, in those cases where difference as the effect size (and SD=100), the null hypothesis was rejected, one could power to show that one educational method is assert that the effect size fell in a clinically superior to another is 84%. With the same important range (ªthe relapse rate is reduced by assumptions, power to show that it is superior at least 20%º). by at least 20 points is only 42%. Logically, in However, many of the problems described for fact, we might want to use 50 points as the effect the use of significance tests with the nil size and also test the hypothesis that the effect is hypothesis would apply also to the use of 50 points. When the effect size and the effect hypothesis tests for other null hypotheses. being tested are identical to each other, then no First, the problems in logic that were sample size will yield power greater than alpha identified for significance tests apply also, albeit (e.g., 5%). in modified form, to hypothesis tests. It was In sum, the adoption of hypothesis tests argued earlier that the significance test addresses would be appropriate for some studies but the question of whether or not the effect is would not address the key problems outlined precisely nil, which is of little import. Similarly, above for significance tests, and would addi- the focused hypothesis test might address the tionally introduce some new problems. In any question of whether or not the effect is precisely event, the possibility of working with hypothesis 20 points. A20-point difference might be of more tests rather than significance tests is something interest than a nil effect but is still an arbitrary of an academic point, since the field has by- value. In practice, what we would like to do is to passed this approach in favor of a more get a feel for the size of the effect as being trivial, comprehensive approachÐthe report of effect or moderate, or substantial, rather than to rule size with confidence intervals, and it is to this out any specific value. Similarly, it was argued approach that we turn now. earlier that the nil hypothesiswas rarely true, and that with a large enough sample it would almost always be disproved. Similarly, the null hypoth- 3.14.6 EFFECT SIZE AND CONFIDENCE esis that the rate difference is precisely 20 points INTERVALS is virtually certain to be false and with a large For the reasons outlined above researchers enough sample will almost always be rejected. and journals are now moving away from Under the system of hypothesis testing, with a significance tests and toward a focus on effect large enough sample, an observed rate difference size estimation. Rather than report that an of 21 points, or even 20.1 points, will require us to effect is (or is not) significantly different from reject the null of 20 points. Of course, this is not zero, we report the size of the effect, and specify the intention in running the test, but it serves as a additionally a confidence interval, the range of stark example of why the test is not the treatment effects in the population that are appropriate vehicle for this kind of application. likely to have given rise to the observed data. Second, the mistakes of interpretation that In the example cited earlier, assume that we were identified for significance tests apply, had completed the study and obtained response again in modified form, to focused hypothesis rates of 30% on drug A vs. 50% for drug B, with tests. One problem with the significance test is a sample size of 100 patients per group. We that researchers use p-values (improperly) as would report that drug B increased the response surrogates for effect size indices. A p-value of rate by 20 percentage points, with a 95% 0.05 was assumed to reflect a modest effect and confidence interval of 7±33. The effect size a p-value of 0.01 was assumed to reflect a strong observed in the sample (a rate difference of 20 effect. While the focused hypothesis test does points) is our best guess about the size of the introduce the concept of effect size (a 20-point effect in the population. The confidence interval effect in this example) the key index continues to (seven points to 33 points) reflects the likely be the p-value. A p-value of 0.05 is likely to be upper and lower bounds of the effect in the seen as reflecting a modest departure from 20 population (Altman, Gore, Gardner, & Pocock, points, and a p-value of 0.01 as reflecting a 1983; Borenstein, 1994a, 1994b; Gardner & substantial departure from 20 points. Similarly, Altman, 1986, 1989a, 1989b; Morgan, 1989; the absence of significance has traditionally Simon, 1986, 1987). been mistaken as evidence that the effect is zero. With the focused hypothesis test it would be mistaken as evidence that the effect is precisely 3.14.6.1 The Key Advantage of this Approach 20 points. Additionally, if a move was made to focused If the key problem with significance tests is hypothesis tests, low power would become even that they address the wrong question, then the more of a problem than it is now. For example, key advantage of effect size estimation is that it with a sample of 70 per group using a 50-point addresses the right question. Working with Effect Size and Confidence Intervals 333 significance tests we defined an effect and asked miology, are the relative risk and the odds ratio, whether or not that effect was probably true. It either of which is often presented in logunits. For was problematic that the effect was typically of studies that deal with correlations, the correla- little interest (the nil effect) or selected arbi- tion itself serves as an index of effect size. For trarily (the focused hypothesis test). Working studies that focus on time to an event, or with effect size and confidence intervals, rather survival, the effect size reported is often the than report on what the effect is not we report hazard ratio, that is, the relative risk per unit time on what the effect is. By doing so we focus on the in one group vs. the other. Computational question of interest. Additionally, we bypass the details for any of these computations are readily need to specify an arbitrary null hypothesis. available (see, for example, Harris & Albert, The effect size with its corresponding con- 1991; Kleinbaum, Kupper, & Morgenstein, fidence interval provides three key pieces of 1982; Lawless, 1982; Rosner, 1990; Selvin, information: 1991). The goal here is to point out that these (i) An estimate of effect size that is separate exist, and that the effect size can be presented for and distinct from the precision of this estimate. any type of study. Because this is a pure measure of effect, it The effect size is reported with a correspond- focuses our attention on the substantive impact. ing confidence interval. The computation of For example, the treatment increases the re- confidence intervals is always some variation on sponse rate by 25 percentage points. the theme (ii) A value for the lower bound of this value Lower Limit = Observed effect minus in the population (for a given confidence level). (Standard Error* 1.96) With the traditional test we know only that the Upper Limit = Observed effect plus response rate is increased by some amount (Standard Error* 1.96) exceeding zero. By contrast, with the confidence interval we may be able to report that it is At the core of either equation is the observed increased by at least 20 percentage points. effect, about which the confidence interval is (iii) A value for the upper bound of this value constructed. The lower and upper limits in the population. The traditional test provides represent the lower and upper bounds for the no estimate at all of the upper bound. By effect size in the population from which the contrast, with the confidence interval we might sample was drawn. Computation of confidence report that the upper bound for the treatment's intervals for proportions, correlations, or impact is 10 percentage points (in which case we survival times is more complicated than sug- know that the impact is of limited substantive gested by this summary, and is discussed in value) or that the upper bound is 50 percentage more detail below. These computational details points. aside, the formula shown here offers a straight- forward conceptual overview.

3.14.6.2 Effect Size and Confidence Intervals can be Computed for Any Type of Study 3.14.6.3 Factors Affecting the Confidence Interval Width Any study that may be used to compute a p- value may also be used to generate an effect size The width of the confidence interval is based with confidence intervals. This follows from the on two factors. The first of these is the sample- fact that the p-value includes both components to-sample dispersion of the effect size: The (the effect size and the precision). The p-value higher the dispersion within a sample, and/or combines the two to yield a single value, but the lower the sample size, the more the effect size each value can be presented separately. As was will vary from one sample to the next. When the case with significance tests, the exact form of working with means the dispersion within a the report will vary with the nature of the data, sample is indexed by the standard deviation of but the differences are minor. the scores. Intuitively, if working with a select For studies that focus on the mean difference group of students and the SAT scores cluster in between groups a natural index of effect is the the range of 600±750, sample-to-sample disper- mean difference itself (either in the original sion will be modest. By contrast, if we are metric, such as SAT points, or standardized to a working with a broader population, in which common metric with a standard deviation of the scores cover the full range of 200±800, 1.0). For studies that focus on the proportion sample-to-sample dispersion will be wider. The of cases responding to a treatment, one index of impact of sample size on the standard error of effect is the rate difference, that is, the difference the effect size is also intuitive. If the sample size in the proportion responding in either group. is small the group mean might be affected Other indices, common in medicine and epide- substantially by a few outliers. By contrast, if 334 The Shift from Significance Testing to Effect Size Estimation the sample size is large the impact of outliers will left of this line indicates that the drug group was be dissipated. It should be noted that when inferior to placebo, while an effect to the right working with proportions or correlations, the indicates that the drug group was superior. The first of these elements, the within-sample studies with narrow confidence intervals are dispersion, is a function of the index itself. In based on a sample of 700 patients per group. the case of proportions, when the population Those with wide confidence intervals are based proportion is close to 50% the standard on a sample of between 10 and 15 patients per deviation is relatively high, and it declines as group. All examples assume that the response the proportions approach either zero or unity. rate on placebo is 30%. We can visualize a Intuitively, one can imagine that samples drawn hierarchy of possible study results, some of from a population where the proportion of which are depicted on this schematic. responders is 99% will yield values close to Study A: Zero difference (730 to +30). We 99%Ðit would be difficult to draw many cannot rule out the possibility that the effect is nonresponders since they do exist. By contrast, nil. Nor, however, can we rule out the possibility if the proportion of responders is 50% it would that the effect is large enough to be clinically not be hard to draw one sample with a 40% useful or clinically harmful. response rate and another with a 60% response Study B: Zero difference (74 to +4). We rate. Similarly, for correlations, the standard cannot rule out the possibility that the effect is deviation is highest when the correlation is zero, nil. More to the point, however, it is clear that and it declines as the correlation approaches the effect is trivial, at best. minus one or plus one. Study C: Twenty-five point difference (710 The second factor controlling the width of the to +54). The effect size in the sample (and our confidence interval is the level of confidence best estimate of the population effect) is 25 desired. For example, one could choose to points. We cannot rule out the possibility that present the effect size plus/minus one standard the effect is nil (nor can we rule out the error. In approximately 68% of all studies possibility that it is quite potent). where the confidence intervals are reported in Study D: Five-point difference (1±9). The this manner, the interval will include the effect is probably not nil. More to the point, population effect size. One could also choose however, we can assert with a high level of to present the effect size +1.96 standard errors certainty that the effect is not clinically (assume for this example that the variance is important (at best, a nine-point advantage in known). On the normal curve, 95% of all favor of the drug). observations will fall inside the interval of Study E: 40-point difference (10±63). The 71.96 to +1.96. Therefore, if the effect size is effect is probably not nil. The possible magni- presented as +1.96 standard error units, over tude of the effect ranges from small to extremely an infinite number of studies 95% of the potent (the interval is intentionally asym- intervals will include the population effect. metric). The researcher may elect to compute two- Study F: 40-point difference (36±44). The tailed or one-tailed bounds for the confidence drug is quite potent. The likely range of effects interval. A two-tailed confidence interval ex- falls entirely within the ªvery potentº range. tends from some finite value below the observed effect to another finite value above the observed effect. A one-tailed confidence interval extends 3.14.7 THE RELATIONSHIP BETWEEN from minus infinity to some value above the SIGNIFICANCE TESTING AND observed effect, or from some value below the EFFECT SIZE ESTIMATION observed effect to plus infinity (the term ªintervalº is a misnomer in the one-tailed case The formulas for significance tests and for but is used to maintain readability). confidence intervals are mathematically con- gruent. The standard error of the effect size in confidence intervals is identical to the standard 3.14.6.4 Information Conveyed by the Effect error used in significance testing. The con- Size and Confidence Interval fidence level used for confidence intervals can be complementary to the significance level used in Figure 5 shows the results of six fictional significance tests (with the 95% confidence level studies. For each study, the effect (the difference corresponding to alpha of 5%). in response rates between placebo and drug) is Given the correspondence between the two depicted as a vertical line, bounded on either methods, any significance test of the form ªthe side by the 95% confidence interval. A solid line difference is zeroº and tested with alpha of 0.05 from the top to the bottom of the graph marks may also be tested by computing the 95% the null effect of no difference. An effect to the confidence interval and determining whether or The Relationship Between Significance Testing and Effect Size Estimation 335

A

B

C

D

E

F

-100 -50 0 50 100 Favors placebo Favors treatment

Figure 5 Data for six fictional studies. The effect size for each study is depicted by a vertical line bounded by the corresponding 95% confidence interval. The effect size highlights the magnitude of the effect while the interval width provides information about the precision with which the effect is estimated. This schematic also provides information about the significance testÐif the 95% confidence interval excludes zero, the study is significant at the 0.05 level. not this interval includes zero. More generally, 3.14.7.1 Confidence Intervals as a Surrogate for any hypothesis test of the form ªthe difference is the Significance Test 20 pointsº and tested with alpha of 0.05 may also be tested by computing the 95% confidence Mathematically, then, confidence intervals interval and determining whether or not the can serve as a surrogate for the significance test. difference includes 20 points. If the confidence Should they be used in this way? Researchers interval includes the null, then we do not reject have staked out three positions: the possibility that the true effect is the null. If (i) Confidence intervals should be used only the confidence interval does not include the null, as a surrogate for the significance test. Other- then we conclude that the effect is (probably) wise the results are entirely subjective, and we not the null. The same holds true, of course, for have no basis to make a decision about whether a significance test with alpha set at 0.01, and a or not the treatment is effective. report of the 99% confidence interval. (ii) Confidence intervals can serve as a sur- When we report a confidence interval and rogate for the significance test but they can also then focus on the question of whether or not the stand on their own. Some studies will serve best interval includes zero, the distinction between if used to provide an estimate of effect size, confidence intervals and significance tests is one accompanied by an estimate of precision. of format rather than substance. However, as Others would additionally take note of whether suggested by Figure 6, this shift in format is or not the confidence interval included specific critically important. Working within the tradi- values. tional framework of significance testing (shown (iii) Confidence intervals should never serve by the columns in this figure) the key distinction as a surrogate for the significance test. The is between studies A, B, C (nonsignificant) on significance test has no role in psychological the one hand, as opposed to D, E, F (significant) research, and any vestige of the significance test on the other. By contrast, working within the should be removed. framework of effect size estimation (shown by There is a substantial literature developing on the rows) we would group studies B and D (not these points of view. In general, the debate in clinically important), then A, C, E (possibly the medical literature is between the first two clinically important), while study F (clinically options while the debate in the psychological important) would be classified by itself. (A 20- literature is between the first and the third. The point difference in response rates would be rationale for each of the positions is outlined clinically important in the context of this study.) here. 336 The Shift from Significance Testing to Effect Size Estimation

Statistically significant

No Yes

No BD Clinically Possibly A,C E important Yes F

Figure 6 Effect size estimation vs. significance testing. The columns correspond to significance testsÐStudies A, B, C are not significant while studies D, E, F are significant. The rows correspond to effect size estimationÐStudies B and D show that the treatment is not clinically important, studies A, C, E show that the treatment may be clinically important, and study F shows that the treatment is clinically important.

3.14.7.1.1 Confidence intervals should serve only the significance test. Specifically, if the con- as a surrogate for the significance test fidence interval excludes zero then it is assumed that the effect is not nil; if it includes zero then it Some researchers have taken the position that is acknowledged that the true effect may in fact the basic idea of significance testing is a good be zero (see, for example, Chow, 1988, 1989; one, and that the problems outlined above stem Walker, 1986a, 1986b). from the misuse of these tests rather than from any flaw inherent in the method. They argue that 3.14.7.1.2 Confidence intervals can optionally be it is critical to retain the logic of the significance used as a surrogate for the test since the researcher and the reader need to significance test make a decision at the end of the study about whether or not the treatment effect was probably Others have taken the position that con- ªreal.º By using confidence intervals as a fidence intervals can be used as a surrogate for surrogate for the significance test these decision the significance test, but that it is not necessary rules are maintained while gaining the following to focus on significance when interpreting the critical advantages over the traditional signifi- results. In the running example one could cance test. articulate a series of goals, including the First, attention is focused on the magnitude following: of the treatment effect. By doing so the tendency (i) We would like to be certain (where is avoided to press the p-value into service as an ªcertainº corresponds to a given level of indicator of effect size. A rate difference of five certainty) that the treatment is not actually percentage points is seen as being a small effect harmful. (in most contexts), even if the sample is so large (ii) We would like to be certain that the that the confidence interval does not include treatment effect exceeds zero. zero (and the test is statistically significant). By (iii) We would like to be certain that the the same logic an effect of 50 percentage points treatment effect is large enough to be clinically is large, even if the sample size is so small that important. the confidence interval does include zero (and As suggested by Figure 5, the 95% confidence the test is not statistically significant). interval might or might not exclude effects that Second, unambiguous information is pro- are clinically harmful, that are zero, or that are vided about the precision of the effect. An effect clinically trivial. More generally, the effect size reported with a precision of +5 percentage and confidence intervals might allow us to draw points is definitiveÐthe effect is known to be a conclusion about some of these points but not useless if small, or is known to be useful if large. others. Or, they might allow us to answer some An effect reported with a precision of +30 questions with one level of certainty, and others percentage points is likely to leave open the at another level of certainty. question of clinical utility, but this point will be In fact, rather than focus on the issue of obvious to the researcher and the reader. whether or not the interval includes any single Mistakes of interpretation will be less likely value we can consider the full range of than they are with significance tests. population parameters that are consistent with At the same time, this approach preserves the the study results, recognizing the fact that those protective mechanisms that are incorporated in falling closer to the observed effect are more The Relationship Between Significance Testing and Effect Size Estimation 337 probable than those closer to the extremes. This factored into a decision about the utility of this approach was suggested by Birnbaum, (1961). treatment (see also Burnand, Kernan & Fein- More recently, it has been popularized by stein, 1990; Poole, 1987a, 1987b, 1987c; Roth- Miettenan and by Poole (1987a) who suggests man, 1986a, p. 119; Walker, 1986a, 1986b; that this is displayed most effectively as a graph Walter, 1991). of confidence interval by confidence level (see also Foster & Sullivan, 1987; Poole, 1987b, 3.14.7.1.3 Confidence intervals should never be 1987c; Sheehe, 1993; Smith & Bates, 1993; used as a surrogate for the Sullivan & Foster, 1990). significance test In the running example, assume that 20 patients were followed under each treatment Others have taken the position that even if the condition and the proportion responding was abuse associated with the significance test could 25% in one treatment group as compared with be avoided, there is a fatal flaw inherent in this 50% in the other. The continuous confidence approach. The argument cited above in favor of interval for this data is shown in Figure 7. This the first position is also the key argument cited graph highlights the fact that the most likely for this position: ªIf we use confidence intervals impact of the new treatment mode is to increase as surrogates for the significance test, then we the response rate by about 25 percentage points. retain the basic protective mechanisms of the The 60% confidence interval includes a benefit significance test. Specifically, if the confidence of between 12 and 37 percentage points for the interval excludes zero then we assume that the new treatment; the 80% confidence interval effect is not nil; if it includes zero then we extends from five percentage points to 42 explicitly acknowledge that the true effect may percentage points; the 90% confidence interval in fact be zero.º Proponents of the third position extends from zero to 46 percentage points; and would add ªAnd therein lies the problem.º the 95% confidence interval extends from minus It was argued earlier that the null hypothesis five (a five-point advantage for the standard is of little interest. If this is true, then when treatment) to plus 49 (a 49 percentage point reporting an effect size we should care about the advantage for the new treatment). precision of the estimate, but not about whether While it might seem at first glance that this or not the confidence interval (CI) includes zero. approach introduces an element of subjectivity Put another way, an effect of 30 points with CI into the analysis, the fact is that any test of of 20±40 is informative; with CI of 10±50 is less significance involves an element of subjectivity informative; with CI of 0±60 is less informative, since the decision to set alpha at a particular still. But there is no reason that we should be level is a subjective one, based (at least in theory) more concerned about ªruling out zeroº than on the researcher's need to balance type-I error about ªruling outº a five-point effect or a 10- against type-II error. Additionally, the decision point effect. Knowing that the effect is probably to focus on whether or not the interval includes not zero does not tell us that the effect is the nil effect, rather than (say) an effect of five probably clinically important. points or 10 points, is also arbitrary. Therefore, Similarly, it was argued that significance tests the use of confidence intervals in the manner are misleading since they allow us to confuse being discussed here, rather than introducing an effect size with precision. Studies meet the element of subjectivity into the analysis, merely criterion for significance in part because the shifts the subjectivity that is already there from sample size is large; studies fail to meet the researcher and on to the reader (Fleiss, the criterion for significance in part because 1986a, 1986b; Thompson, 1987a, 1987b, 1987c, the sample size is not large enough. By exactly 1987d. See also Cohen, 1962; Derouen, 1987; the same logic, if the confidence interval includes Lachenbruch et al., 1987; Rosnow & Rosenthal, zero it is in part because the sample size is too 1989, Rothman, 1986b; Savitz, 1987). small. If the key issue being addressed by the Perhaps more to the point, the reader may not significance test is whether or not there was a feel a need to frame the answer as a dichotomy, large enough sample, then it makes no sense to and to conclude that the treatment does (or does perform a significance test, in any format. To do not) have any impact on recurrence rates. so is to propagate and codify exactly the same Rather, the take-home message might be the error that attempts are being made to avoid (see more comprehensive picture, that is, (in Figure 7 Schmidt, 1992; Schmidt & Hunter, 1996, 1997). and introducing subjective judgments) ªthe 40% confidence interval falls entirely in the range of 3.14.7.2 Should the Significance Test be substantial effects; the 80% interval falls in the Banned? range of substantial or moderate effects; and the 95% interval includes effects that are trivialº In 1998 the significance test continues to serve with the entire gamut of possibilities being as the de facto standard of proof in most research 338 The Shift from Significance Testing to Effect Size Estimation

Confidence interval function for difference in proportions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Two-tailed confidence intervals 0.9 1.0 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 Difference in proportions

Figure 7 Continuous confidence intervals. Rather than focus on whether or not we can exclude a specific effect size with an arbitrary level of confidence, we can consider the full range of population parameters that are consistent with the study results, recognizing that those falling closer to the observed effect are more probable than those falling closer to the extremes. studies. As a practical matter, the grant studies will be accepted for publication at a applicant who proposes to skip the significance higher rate than non-significant studies, which test is not likely to be taken seriously by the will introduce a bias into the research literature funding agency; drug manufacturers are not and impair ability to conduct a meta-analysis at likely to submit studies for FDA approval a later point. Finally, they contend that the without data on statistical significance; and significance test serves no useful function at all. papers submitted for publication without tests of In support of this latter point they proposed a significance will face a serious obstacle in most ban on significance tests and then systematically journals. In the context of the three positions collected objections to this proposal. They outlined above, the current state is somewhere summarize these objections by reporting that prior to the first position. Nevertheless, recent ªEach of these objections is intuitively appealing debate has focused on the proposition that the and plausible but is easily shown to be logically significance test should be banned, which falls at and intellectually bankruptº (p. 37). Some of the or beyond the third position. objections to a ban follow, together with The most vocal proponents in favor of a ban comments. are Schmidt and Hunter (1997) who attack the Some have taken issue with the contention null hypothesis test on three fronts. First, they that the nil hypothesis is always false. For note that the null hypothesis is almost always example, if using a randomized trial to test an false. Rather than thinking of studies as entirely new class of drug that actually has zero ªsignificantº or ªnot significantº we would do effect, then the null hypothesis would be true. well to think of them as ªcorrectº and This is a valid point but largely irrelevant, since ªincorrect.º As long as researchers are permitted it applies to only a trivial number of studies. to report significance tests, type-II error will Some have argued that significance tests continue to run rampant in psychological assure objectivity. In fact, one could argue that research. Second, they argue that the role of the hypothesis test is not objective since the the single study should be to accumulate data selection of a criterion alpha, and also the which can later be incorporated in a meta- selection of a specific null to be tested, are made analysis, and so there is no need to perform a subjectively. Even if one accepts the idea that significance test for the single study. As long as confidence intervals are less objective, the fact significance tests are permitted, the significant remains that this objectivity leads to ubiquitous The Relationship Between Significance Testing and Effect Size Estimation 339 errors of interpretation. To paraphrase Tukey, While Abelson is correct that the idea of ªBe approximately right rather than exactly imposing a ban is unappealing, it is also clear wrong.º that some kind of decisive action is required. Others have taken issue with Schmidt and Without some concrete and clear action on the Hunter's contention that a single study cannot part of educators, funding agencies, and journal provide definitive evidence to support or refute editors, the shift away from significance tests a theory. Abelson, for example, cites the classic will drag out over an indeterminate period of Festinger experiment, which raised questions time. During this period a great deal of effort about the prevailing theory and set the stage for will be spent in repeatedly making the case for the studies that followed. (In this study one the shift, harm will be caused, and a great deal of theory predicted that one group would score potentially important work will go undone. higher and the competing theory predicted that It is submit that the appropriate course of the second group would score higher. Ratings action would be not to ban the significance test were conducted on an arbitrary scale, and the but rather to require that researchers report only issue to be resolved was the direction of the effect size indices and confidence intervals. effect). This is a single study that had a Scientifically, this is more defensible. Pragma- substantial impact on the direction of the field, tically, this can take effect immediately without inasmuch as it set the direction for numerous further debate. In fact, this requirement is confirmatory studies that followed. Again, implicit in guidelines now, in the sense that these arguments are valid but they apply in analyses are expected to address the research only a small fraction of studies. question and conclusions are expected to follow Perhaps the key point of Schmidt and Hunter logically from the analyses. Only rarely do we is not that any single objection to the of banning care about excluding a specific value, and even significance tests is or is not valid, but the fact more rarely do we care about excluding the nil. that researchers cling so hard to their beliefs In virtually all studies, the requirement that about significance tests. For example, research- analyses address the research question will ers keep coming back to the idea that we need the require that the report highlight the effect size tests to distinguish between ªrealº effects and and the precision with which this is reported. ªchance effects,º thereby missing the key point: While researchers would be free to report p- All effects are real. The tests distinguish between values in addition to the effect size and studies that showed this and those that (in error) confidence interval, it is likely that this practice did not. Schmidt and Hunter write ªWe attribute would disappear because it would become this psychologically interesting fact to the virtual evident that the significance test is of tangential brainwashing in significance testing that all of us interest, at best. The experience of Rothman is have undergone, beginning with our first under- informative in this regard. As an editor at the graduate course in psychological statistics.º American Journal of Public Health, he took a Should significance tests be banned, then? firm stance against the inappropriate use of Abelson (1997a) wrote that the idea of such a significance tests, and in doing so both educated ban is appealing ªIn the words of a famous and empowered researchers to focus on effect empathizer `I feel your pain' º (p. 118). The size in their analyses. More recently, as the abuse of significance tests is so pervasive that it editor of Epidemiology, Rothman has estab- is likely to require some kind of absolute ban to lished a policy that prohibits references to prevent the continued abuse of this tool. If significance testing, but reports that ªInterest- significance tests are allowed to continue, they ingly, few readers seem to be aware of this will continue to be abused, and if the confidence policy, which tells me that they don't miss the interval is allowed to serve as a surrogate for the absence of claims regarding statistical signifi- significance test, then the door to abuse will be cance. As you urge, we emphasize the measure- wide open. He notes that abuse would be not ment of effect size, and this gets a clear message only possible, but inevitable: ªIndeed, under the across to readers.º Law of Diffusion of Idiocy, every foolish application of significance testing will beget a corresponding foolish practice for confidence 3.14.7.3 Should the Analysis of the Single Study limits.º Despite these concerns, Abelson con- be Banned? cludes that the tests should not be banned. He writes, ªCreate a list of things that people Similarly, while one can empathize with the misuseÐfor example, oboes, ice skates, band desire to ban the analysis of single studies, saws, skis, and college educations. Would you Schmidt and Hunter overstate their case when be inclined to ban them because people make they argue that no single study can provide errors with them? Will we want to ban effect definitive information on its own. Abelson sizes, too, when their misuse escalates?º (p. 13). writes, ªIndeed, one might say that Hunter 340 The Shift from Significance Testing to Effect Size Estimation and Schmidt use the `Brooklyn' method of The first is a shift in educational resources. argumentation: The Brooklynite asks you a Some statistics texts now place more of an question, tells you his answer, and then says that emphasis on confidence intervals than they had you are wrong.º The fact is that some studies are in the past. Computer programs used for well designed, with adequate power and appro- analysis are more likely now than in the past priate controls. In fact, some single studies con- to present confidence intervals as well as tests of tain more subjects and better designed controls significance. Some programs for study planning than some meta-analyses. To balance Hunter now incorporate the ability to plan for and Schmidt's point that no substantive issue in confidence intervals as well as statistical psychology has ever been resolved by a single significance (see, for example, Borenstein study and that all testing should be deferred to et al., 1997 or Elashoff, 1997). the meta-analysis, we cite a recent editorial in the The second is the acceptance of effect size New England Journal of Medicine. Bailar (1997, estimation by important segments of the main- 1998) noted that no substantive issue in medicine stream research community. Journal editors in has ever been resolved by a meta-analysis, and some fields now encourage or require the use of that the randomized clinical trial is still confidence intervals in research reports (Alt- considered to be the gold standard. (Bailar man, 1991; Altman et al., 1983; Rothman, 1978, acknowledges that meta-analysis hold great 1986b; Rothman & Yankauer, 1986). The use of potential but objects to the improper use of this confidence intervals is also encouraged in the procedure. Recall Abelson's prediction, cited Uniform Requirements for Manuscripts Sub- earlier.) Additionally, even when the single study mitted to Biomedical Journals (International cannot yield adequate precision there is a need to Committee of Medical Journal Editors (Inter- provide advice about treatment options based national Committee of Medical Journal Edi- on the evidence at hand, or to use this evidence to tors, 1991) [p. 425]). There is now a task force help design subsequent research. In such cases working on behalf of the American Psycholo- the analysis of a single study may be required. gical Association, charged with the goal of Schmidt and Hunter highlight an important studying the issues raised in this paper and problem when they note that the practice of making recommendations. Harlow et al. (1997) allowing significance testing for individual edited a book entitled ªWhat if there were no studies contributes to the problem of publica- significance testsº and discuss eight points that tion bias, but this problem exists, in large part, are endorsed by some or all of the chapter because researchers and editors focus on the p- contributors. There is unanimous rejection of value. Abelson wrote that in deciding whether or the traditional use of significance testsÐ not a study should be published he takes into specifically the exclusive focus on the nil account five components, including such issues hypothesis and p-values. There is unanimous as the role of the study in the context of other agreement that researchers should present effect research and theory. The p-value is a subset of size estimates with confidence intervals, should the fifth component, and the least interesting be cognizant of power issues, and should one, at that. If the recommendations outlined evaluate the results in the context of a theory above for focusing on effect size are followed, rather than in isolation. There is strong support, the illogic of basing a publication decision on the with a few dissenters, for the use of tests that are significance test should become clear, and the designed to confirm or refute specific, publication bias minimized. hypothesis-based theories other than the nil hypothesis. 3.14.7.4 Looking Ahead The third, and most important, factor in this move away from significance tests is the It seems likely that the shift from significance increasingly common application of meta- tests to effect size estimation with confidence analyses. Meta-analyses serve both to highlight intervals is likely to take place over the next the flaws in significance tests and to provide a several years. Earlier, the history of researchers realistic mechanism for the application of effect railing against the null hypothesis tests was size estimation. Meta-analysis is the next logical outlined, beginning in the 1950s and proceeding step in the research process and we return to this to the present. The impact of these writings has point at the conclusion of this chapter. been modest. Nevertheless, whether or not significance tests are banned, it is likely that the 3.14.8 USE OF CONFIDENCE INTERVALS next few years will be marked by a shift away IN STUDIES WHOSE PURPOSE IS from significance tests and toward effect size TO PROVE THE NULL estimation with confidence intervals. This prediction is based on the convergence of Some studies are conducted for the express several factors (Schmidt, 1996). purpose of showing that the null is trueÐfor Study Planning: Precision Analysis vs. Power Analysis 341 example, that two drugs are bioequivalent or cited above we would report that the rate that two treatments are equally effective. These difference falls in the range of 70.02 to 0.35, or studies are framed along the following lines: A the rate difference falls in the range of 70.15 to study compares the mortality rates in patients 0.18. In other words, confidence intervals allow treated with either of two drugs. The study has us to report what the effect is rather than what power of 95% to detect a difference of, say, five the effect is not (Blackwelder, 1982; Detsky & percentage points in mortality rates and so if the Sackett, 1985, 1986; Makuch & Johnson, 1986). study fails to yield a significant effect we can The response which had been framed in terms conclude (with 95% certainty) that the groups of significance tests and p-values is confusing do not differ by five percentage points. because the sole function of the significance test The same logic is sometimes applied after the is to evaluate the null hypothesis. If we try to fact to a study which was initially run in hopes of reframe the null to mean almost null because we finding a significant group effect. It was shown are trying to estimate the magnitude of an effect, earlier that the absence of an effect is often then we are using the technique in a manner for interpreted without justification as meaning that which it was not intended and the results are the groups are identical. While a conclusion that awkward. the treatments are equivalent cannot be justi- fied, one can make the argument that the study had power to detect a difference of some 3.14.9 COMPUTATIONAL ISSUES IN magnitude, and the absence of significance CONFIDENCE INTERVALS indicates that the groups do not differ by this Earlier, it was noted that a confidence interval much. While this approach is technically correct could be approximated as the observed effect it suffers from two basic flaws. plus/minus a specified distance. This is a First, it involves the loss of useful informa- shortcut that yields approximate results, and tion. Assume, for example, that a study is two caveats are in order. First, even for this planned in which schizophrenic patients cur- approximation the computation is somewhat rently in remission are assigned to treatment more complex than suggested here. If the index with either a low dose or a standard dose of is based on a mean difference, then to compute neuroleptic treatment (N=40 per group) and the 95% interval we would multiply not by 1.96 followed for a year. The study has power of 95% but by a factor, based on the t-distribution, that to detect a difference of 38 percentage points in takes sample size into account. Additionally, for recurrence rates (specifically, rates of 0.80 vs. most indices, including those based on propor- 0.42). Therefore, if the study failed to yield tions or correlations, we would transform the significance it would be fair to conclude that the data prior to computing the confidence interval impact of treatment on recurrence rates, if any, and then convert back to the original metric is less than 38 percentage points. prior to reporting the results. Formulas for However, the study that fails to yield these transformations are widely available significance could yield a rate difference of 18 (Fleiss, 1981; Kleinbaum et al., 1982; Rothman, percentage points (0.80 vs. 0.62) with p=0.08 1986a; Selvin, 1991) and are applied as a matter and a 95% confidence interval of 0.02 to 0.35. 7 of course by computer programs. Second, even Or, it could yield a rate difference of two when these transformation are applied, for most percentage points (0.80 vs. 0.78) with p=0.83 indices the result is still approximate. While the and a confidence interval of 0.15 to 0.18. Once 7 approximation yields an interval that is accurate the study is completed it would be inappropriate enough for most purposes, it is not identical to to fall back on the original computations and the interval obtained by more sophisticated report that the rate difference in the population procedures (see, for example, Cornfield, 1956; is less than 38 percentage points. Rather, we Fleiss, 1979, 1981). The assertion that the 95% could report (in the first example) that the rate confidence interval will exclude the null if and difference could be as much as 35 points and (in only if the significance test is significant at 0.05 the second) that it is at most 18 percentage assumes that exact formulas (or the identical points. approximation) are used for computing the p- Second, the logic of using a power analysis to value and the confidence interval. argue that two groups are comparable is muddled and often misunderstood. We are arguing that if the effect had been of a certain 3.14.10 STUDY PLANNING: PRECISION size we would have found it, and we did not, so ANALYSIS VS. POWER ANALYSIS we can assert that the effect is not that large. Consider, by contrast, the elegant simplicity by If the goal of a study is to yield a test of which the same information is conveyed in the significance, it follows that the study should context of confidence intervals. In the examples have adequate power to yield a significant 342 The Shift from Significance Testing to Effect Size Estimation effect. By the same logic, if the goal of a study is than 95% will yield a narrower interval. If the to provide an estimate of effect size, it follows study index is a mean difference, the researcher that the study should be able to provide this may modify the standard deviation (SD) within estimate with an acceptable degree of precision. groupsÐa smaller SD will yield a narrower While the research community is clearly moving interval. The researcher may also modify the toward effect size estimation for data analysis, effect size but, as noted earlier, the confidence relatively little has been said about the need for a interval width will be affected only slightly, if at corresponding shift in the mechanisms used for all. study planning. In fact, it is not unusual to see a It should be noted that Figure 8 is an paper that focuses on effect size in the analysis extension of the example cited earlier for power but whose sample size was set on the basis of a (Figure 3). In that case the researcher had power analysis. There is neither a mathematical determined that a sample of 107 patients per nor a logical basis for this mixture of apples and group would yield power of 80%, and the estimates. The two goals, significance tests and confidence interval is shown as 0.07±0.33. When effect size estimation, have in common that they planning for precision the researcher has are both things we can do with data. Beyond increased the sample size to 190 to yield a more this, however, the two criteria are quite precise estimate of the effect size, and power has different. increased as well, from 80% to 97%. Power addresses the likelihood that a study As it did for power, the program will generate will yield a significant effect. As such, it is a table or graph of precision that allows us to controlled by effect size, sample size, and quickly take account of the larger picture alphaÐthe same elements that control statis- (Figure 9). With 50 patients per group we will tical significance. Precision addresses the width be able to report the rate difference with a of the confidence interval surrounding the effect precision (95% confidence interval) + 20 size estimate. As such, it is controlled by the percentage points. With 100 patients per group confidence level and the sample sizeÐthe same we will report the rate difference with a elements that control the confidence interval precision of +13 percentage points. With 200 width. Of particular note, effect size is the patients per group we will report the rate dominant factor in computation of power but difference with a precision of +10 percentage plays little (if any) role in the computation of points. precision. The sample size required to yield The program will also generate a summary of adequate power and the sample size required to the analysis. This report reads (in part) ªA yield adequate precision will differ from each second goal of this study is to estimate the other in almost all cases, and often by very difference between the two populations . . . The substantial amounts. study will enable us to report the difference in proportions with a precision (95.0% confidence level) of approximately + 0.098 points. Speci- 3.14.10.1 Planning for Precision fically, an observed difference of 0.200 would be reported with a 95.0% confidence interval of To determine the study's precision during the 0.100±0.296. The precision estimated here is the process of study planning, we may ªplug inº the approximate expected precision. Precision will sample size and confidence level, and compute vary as a function of the observed proportions the confidence interval width that will result. As (as well as sample size), and in any single study was the case for power analysis, this process is will be narrower or wider than this estimate.º facilitated by computer programs designed for In this example the power analysis led to a this purpose. sample size requirement of 107 per group while Figure 8 is an example of this process using the precision analysis led to a sample size of 190 the program cited earlier for power analysis. On per group. In other studies the relationship this panel the researcher has entered response between the two sample sizes will be different, rates of 60% and 40% for the new treatment and could even be reversed (with the sample for and the standard treatment, respectively, and a power being larger than the sample for sample size of 190 patients per group. The precision). The key point is that the two goals program displays the effect size as a rate of significance testing and effect size estimation difference of 20 points. This effect will be are distinct. Precision analysis focuses on the reported with a 95% confidence interval of plus/ precision of the estimate rather than the power minus 10 points (0.10±0.30). to test the null hypothesis, and as such is the The researcher may modify the sample appropriate method to use when the study goal sizeÐincreases to sample size will yield a more is to estimate the magnitude of the treatment precise estimate. The researcher may modify the effect (Bristol, 1989; Burnand et al., 1990; confidence levelÐsetting the level at 90% rather Cobb, 1985; Feinstein, 1990; Gordon, 1987; With 190 patients per group, a rate difference of 20 points (60% vs. 40%) would be reported with a 95% confidence interval of ± 10 percentage points (0.10 to 0.30)

Figure 8 Screen from power and precision showing computation of precision. The effect size is shown as a rate difference of 20 points with a 95% confidence interval of 0.10±0.30. The researcher may modify any of the study parameters such as sample size and confidence level and immediately see the impact on the confidence interval width. 344 The Shift from Significance Testing to Effect Size Estimation

95% confidence interval for rate difference: two sample proportions

0.5

0.4

0.3

0.2

0.1

Rate difference 0.0

-0.1

-0.2 0 50 100 150 200 250 300

Number of cases per group Proportion responding 60% vs. 40%

Figure 9 Planning for precision. With a sample of 50 cases per group the treatment effect will be reported with a 95% confidence interval some 40 percentage points wide. With 100 cases per group the interval will narrow to 30 points, and with 200 per group the interval will narrow to 20 points.

Greenland, 1988; Oakes, 1986; Smith & Bates, the median confidence interval width for some 1992; Walter, 1991). procedures and the expected (mean) width for Of course, even a fairly narrow interval may others. Computer programs use more complex include some effects that are clinically mean- formulas that take account of the sampling ingful and others that are not, but as the interval distribution of the confidence intervals. As such narrows in width it becomes increasingly likely they may allow us to report, for example, that that the study results will provide the informa- ªWith a sample of 50 cases per group and a tion required for an informed clinical decision. standard deviation of 10 rating points within In many cases, the sample size required for a groups, the median width of the confidence precise estimate will exceed the available interval will be 7.9 points. In 80% of studies, the resources. For example, if only 50 cases per confidence interval will be no wider than 8.4 group are available, then the 95% interval will points.º If the researcher needs to ensure that be nearly 40 points wide. In this case the study the confidence interval width will fall within a could still provide data that would be pooled specific range, then these formulas should be with data from other studies, in a meta-analysis. employed (Borenstein et al., 1997; Elashoff, The method presented here for planning for 1997; Hahn & Meeker, 1991). precision shows provides an estimate of the confidence interval width, but the actual confidence interval in any given study will 3.14.11 META-ANALYSIS almost always be narrower or wider than the width estimated in this way. When working with Meta-analysis is the process of developing a means, for example, the confidence intervals are comprehensive picture of a field of research by based in part on the sample standard deviation, performing analyses on summary data from which will vary from one sample to the next. individual studies, and as such is the next logical When working with proportions, correlations, step in the research process. The thesis of this or survival data the confidence intervals are chapter has been that the goal of research based in part on the absolute effect size obtained should be to identify the size of a treatment which will vary from one sample to the next. The effect. The single study, because it includes a procedure outlined here is based on the limited number of subjects, typically yields an assumption that the sample effect will mirror imprecise estimate. By combining data from the population effect precisely. As such, it yields multiple studies we effectively increase the Conclusions 345 sample size and yield a more precise estimate of require. The report of effect size focuses the effect. Additionally, the single study attention on the magnitude of the effect, and typically includes one treatment and one the confidence intervals tell us how much treatment population. The meta-analysis, by confidence we can have that the population contrast, may synthesize data from multiple effect lies within a given range. The data may treatments or multiple populations, and as such allow us to conclude that the treatment is clearly may be used to identify the type of treatment not effective, or that it clearly is effective, or that that will work best for a specific type of patient. additional information is required. Critically, Meta-analyses are being used to provide though, this type of report focuses attention on treatment recommendations based on the the relevant issues. If and when we elect to make results of completed studies. They are also a dichotomous decision, the format of the being used to identify areas where the questions presentation helps to ensure that the meaning of of interest have not yet been answered, and to this decision is clearly understood. help shape the development of new studies. There is an ongoing debate about the role of The relatively widespread acceptance of the single study as part of the research process. meta-analysis since the late 1980s has impor- Some take the position that any single study tant implications for the topic of this chapter in should allow researchers to conclude that a a more direct sense as well. The thesis of this specific effect size is or is not tenable. Others chapter is that in psychological research the have argued that the single study cannot null hypothesis is rarely true, and what appear provide this kind of information reliably, and to be ªconflicting resultsº (one study is that hypothesis testing should be deferred to the statistically significant while another is not) time that the data from many studies can be often reflects normal sample-to-sample varia- synthesized in a meta-analysis. tion in effect size coupled with variations in This debate has carried over to the role of sample size. In past decades this argument had confidence intervals. There is complete agree- been made in the abstract. With the advent of ment that researchers should present the effect meta-analysis it is possible to actually ªseeº the size bounded by confidence intervals since this bigger picture. Almost invariably, it becomes will focus attention of the issue of interest and clear that the null hypothesis is false, and the will help to prevent mistakes of interpretation. study-to-study variations can be seen in However, there is sharp disagreement over how context. This has been a critical factor in the the confidence interval should be interpreted. shift toward effect size estimation (Cook et al., Some take the position that confidence intervals 1992; Cooper, 1989; Cooper & Hedges, 1994; should serve as a surrogate for the significance Eagly & Wood, 1994; Hedges, 1990; Hedges & test, and the focus should be in the question of Olkin, 1985; Schmidt & Hunter 1997; Shadish, whether or not the interval includes a specific 1992). value. When used in this way the shift from significance tests to confidence intervals is one of format rather than substanceÐit helps to 3.14.12 CONCLUSIONS focus attention on the proper issue and to avoid mistakes of interpretation. Others feel that this Researchers typically want to know whether approach will merely codify the kinds of errors or not a treatment has any impact, and also inherent in significance tests. They argue that what the size of this impact might be. the key element in a shift to confidence intervals Researchers working exclusively with signifi- should be the abandonment of significance cance tests are able to address only the first of testing, and that one should refrain from any these issues, and leave the second unanswered. discussion of whether or not specific points fall Worse, because the issue of effect magnitude is within the interval. such a compelling one, the significance tests are The issues addressed in study planning should often pressed into service to address this issue as correspond to the issues that will be addressed in well, which often leads to study resultsÐ the data analyses. If the goal of a study is to yield whether statistically significant or notÐbeing a test of significance, the study should have misinterpreted. Statistical significance is equa- adequate power to yield a significant effect. ted with clinical significance, though this Similarly, if the goal of a study is to provide an interpretation may not be warranted. Absence estimate of effect size that can stand on its own, of statistical significance is taken as evidence the study should be able to provide this estimate that a treatment is not effective, though this with an acceptable degree of precision. Prior to interpretation is rarely justified. ensuring that the power analysis is performed By contrast, estimates of effect size bounded properly it behooves the researcher to ensure by confidence intervals provide the kind of that it makes sense to test the null, and indeed information that researchers and clinicians that the study should be carried out at all. 346 The Shift from Significance Testing to Effect Size Estimation

The argument that effect size estimation and Borenstein, M. (1994b). The case for confidence intervals in confidence intervals should replace significance controlled clinical trials. Controlled Clinical Trials, 15, 411±428. testing in behavioral research is not new but the Borenstein, M., & Cohen, J. (1988). Statistical power research community is beginning to take analysis: A computer program. Hillsdale, NJ: Erlbaum. concrete steps in this direction. This shift has Borenstein, M., Cohen, J., Rothstein, H., Pollack, S., & been driven by the advent of meta-analysis. Kane, J. (1990). Statistical power analysis for one-way analysis of variance: A computer program. Behavior Meta-analyses have served to highlight the Research Methods, Instruments, and Computers, 22, kinds of errors that are fostered by the use of 271±282. significance tests, and at the same time provide a Borenstein, M., Cohen, J., Rothstein, H., Pollack, S., & vehicle for estimating treatment effects pre- Kane, J. (1992). A visual approach to statistical power cisely. Additionally, they provide a vehicle for analysis on the microcomputer. Behavior Research Methods, Instruments & Computers, 24, 565±572. moving beyond the single study and for Borenstein, M., Rothstein, H., & Cohen, J. (1997). Power synthesizing data from multiple sources, which and precision. Teaneck, NJ: Biostat. http://www. is the next logical step in the research process. PowerAndPrecision.com Bristol, D. R. (1989). Sample sizes for constructing confidence intervals and testing hypotheses. Statistics ACKNOWLEDGMENTS in Medicine, 8, 803±811. Burnand, B., Kernan, W. N., & Feinstein, A. R. (1990). This work was supported in part by the Indexes and boundaries for ªquantitative significanceº in following grants: NIMH/SBIR MH-43083, statistical decisions. Journal of Clinical Epidemiology, 43, 1273±1284. NIMH/SBIR MH-52969, NIMH/SBIR MH Carver, R. P. (1978). The case against statistical signifi- 558483, and NIMH 41960. I am grateful for cance testing. Harvard Educational Review, 48, 378±399. their comments to Hannah Rothstein, Nina Chow, S. L. (1988). Significance test or effect size? Schooler, Kenneth Rothman, and the late Jacob Psychology Bulletin, 103, 105±110. Cohen, to whom this chapter is dedicated. Chow, S. L. (1989). Significance tests and deduction: Reply to Folger (1989). Psychological Bulletin, 106, 161±165. Cobb, E. B. (1985). Planning research studies: An alter- 3.14.13 REFERENCES native to power analysis. Nursing Research, 34, 386±388. Cohen, J. (1962). The statistical power of abnormal±social Abelson, R. P. (1995). Statistics as principled argument. psychological research: A review. Journal of Abnormal Mahwah, NJ: Erlbaum. and Social Psychology, 65(3), 145±153. Abelson, R. P. (1997a). A retrospective on the significance Cohen, J. (1965). Some statistical issues in psychological test ban of 1999 (if there were no significance tests, they research. In B. B. Wolman (Ed.), Handbook of clinical would be invented). In L. L. Harlow, S. A. Mulaik, & J. psychology (pp. 95±121). New York: McGraw-Hill. H. Steiger (Eds.), What if there were no significance tests? Cohen, J. (1969). Statistical power analysis for the (pp. 117±144). Mahwah, NJ: Erlbaum. behavioral sciences. New York: Academic Press. Abelson, R. P. (1997b). On the surprising longevity of Cohen, J. (1977). Statistical power analysis for the behavioral flogged horses: Why there is a case for the significance sciences (Rev. ed.). New York: Academic Press. test. Psychological Science, 8(1), 12±15. Cohen, J. (1988). Statistical power analysis for the Altman, D. G. (1991). Statistics in medical journals: behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Developments in the 1980s. Statistics in Medicine, 10, Cohen, J. (1990). Things I have learned (so far). American 1897±1913. Psychologist, 45, 1304±1312. Altman, D. G., Gore, S. M., Gardner, M. J., & Pocock, S. Cohen, J. (1992). A power primer. Psychology Bulletin, J. (1983). Statistical guidelines for contributors to 112, 155±159. medical journals. British Medical Journal, 286, Cohen, J. (1994). The Earth is round (p 50.05). American 1489±1493. Psychologist, 49, 997±1003. Bailar, J. C. (1997). The promise and problems of meta Cohen, J. (1997). The earth is round (p 5 .05). In L. L. analysis. New England Journal of Medicine, 337, Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if 559±561. there were no significance tests? (pp. 21±36). Mahwah, Bailar, J. C. (1998). Letter. New England Journal of NH: Erlbaum. Medicine, 338, 62. Cole, P., & Goldman, M. B. (1975). Occupation. In J. F. J. Bakan, D. (1966). The test of significance in psychological Fraumeni (Ed.), Persons at high risk of cancer: An research. Psychological Bulletin, 66, 423±437. approach to cancer etiology and control (pp. 167±183). Berkson, J. (1938). Some difficulties of interpretation New York: Academic Press. encountered in the application of the chi-square test. Cook, T. D., Cooper, H., Cordray, D. S., Hartman, H., Journal of the American Statistical Association, 33, Hedges, L. V., Light, L. V., Louis, T. A., & Mosteller, F. 526±542. (1992). Meta-analysis for explanation. New York: Russell Berkson, J. (1942). Tests of significance considered as Sage Foundation. evidence. Journal of the American Statistical Association, Cooper, H. (1989). Integrating research: A guide for 37, 325±335. literature reviews (2nd ed.). Newbury Park, CA: Sage. Birnbaum, A. (1961). Confidence curves: An omnibus Cooper, H., & Hedges, L. V. (1994). Research synthesis as technique for estimation and testing statistical hypoth- a scientific enterprise. In H. Cooper & L. V. Hedges eses. Journal of the American Statistical Association, 56, (Eds.), The handbook of research synthesis (pp. 3±14). 246±249. New York: Russell Sage Foundation. Blackwelder, W. C. (1982). ªProving the null hypothesisº Cornfield, J. (1956). A statistical problem arising from in clinical trials. Controlled Clinical Trials, 3, 345±353. retrospective studies. In J. Neyman (Ed.), Proceedings of Borenstein, M. (1994a). Planning for precision in survival the Third Berkeley Symposium on Mathematical Statistics studies. Journal of Clinical Epidemiology, 47(11), and Probability (pp. 135±148). Berkeley, CA: University 1277±1285. of California Press. References 347

Dar, R. (1987). Another look at Meehl, Lakatos, and the PC±DOS computers. American Statistician, 43, 253±260. scientific practices of psychologists. American Psycholo- Gonzalez, R. (1994). The statistics ritual in psychological gist, 42, 145±151. research. Psychological Science, 5, 321±328. Derouen, T. A. (1987). Letter. American Journal of Public Gordon, I. (1987). Sample size estimation in occupational Health, 77, 237. mortality studies with use of confidence interval theory Detsky, A. S., & Sackett, D. L. (1985). When was a (see comments). American Journal of Epidemiology, 125, ªnegativeº clinical trial big enough? How many patients 158±162. you need depends on what you found. Archives of Greenland, S. (1988). On sample-size and power calcula- International Medicine, 145, 709±712. tions for studies using confidence intervals. American Detsky, A. S., & Sackett, D. L. (1986). Establishing Journal of Epidemiology, 128, 231±237. therapeutic equivalency: What is clinically significant Hagen, R. L. (1997). In praise of the null hypothesis difference? Archives of International Medicine, 146, significance test. American Psychologist, 52, 15±24. 861±862. Hahn, G. J., & Meeker, W. Q. (1991). Statistical intervals. Eagly, A. H., & Wood, W. (1994). Using research synthesis New York: Wiley. to plan future research. In H. Cooper & L. V. Hedges Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if (Eds.), The handbook of research synthesis (pp. 485±500). there were no significance tests? Mahwah, NJ: Erlbaum. New York: Russell Sage Foundation. Harris, E. K., & Albert, A. (1991). Survivorship analysis for Elashoff, J. (1997). nQuery. Cork, Ireland: Statistical clinical studies. New York: Marcel Dekker. Solutions. Harris, R. J. (1997a). Reforming significance testing via Estes, W. K. (1997). Significance testing in psychological three-valued logic. In L. L. Harlow, S. A. Mulaik, & J. research: Some persisting issues. Psychological Science, H. Steiger (Eds.), What if there were no significance tests? 8(1), 18±20. (pp. 145±174). Mahwah, NJ: Erlbaum. Feinstein, A. R. (1975). Clinical biostatistics XXXIV: The Harris, R. J. (1997b). Significance tests have their place. other side of ªstatistical significanceº: Alpha, beta, delta, Psychological Science, 8(1), 8±11. and the calculation of sample size. Clinical Pharmacology Hartung, J., Cottrell, J. E., & Giffen, J. P. (1983). Absence and Therapeutics, 18, 491±505. of evidence is not evidence of absence. Anesthesiology, Feinstein, A. R. (1976). Clinical biostatistics XXXVII. 58, 298±300. Demeaned errors, confidence games, nonplussed Hedges, L. V. (1990). Directions for future methodology. minuses, inefficient coefficients, and other statistical In K. W. Wachter & M. L. Straf (Eds.), The future of disruptions of scientific communication. Clinical Phar- meta-analysis (pp. 11±26). New York: Russell Sage macology and Therapeutics, 20, 617±631. Foundation. Feinstein, A. R. (1977). Clinical biostatistics XL: Stochas- Hedges, L. V., & Olkin, I. (1985). Statistical methods for tic significance, consistency, apposite data, and some meta-analysis. Boston: Academic Press. other remedies for the intellectual pollutants of statistical Hoffman, D., & Wynder, E. L. (1976). Smoking and vocabulary. Clinical Pharmacology and Therapeutics, 22, occupational cancers. Preventative Medicine, 5, 245±261. 113±123. Hunter, J. E. (1997). Needed: A ban on the significance Feinstein, A. R. (1990). The unit fragility index: An test. Psychological Science, 8(1), 3±7. additional appraisal of ªstatistical significanceº for a International Committee of Medical Journal Editors contrast of two proportions. Journal Clinical Epidemiol- (1991). Uniform requirements for manuscripts submitted ogy, 43, 201±209. to biomedical journals. New England Journal of Medi- Fisher, R. A. (1959). Statistical methods and scientific cine, 324, 424±428. inference (2nd ed.). Edinburgh, UK: Oliver & Boyd. Kane, J. M., & Borenstein, M. (1985). Compliance in the Fleiss, J. (1979). Confidence intervals for the odds ratio in long-term treatment of schizophrenia. Psychopharmacol- case-control studies: The state of the art. Journal of ogy Bulletin, 21, 23±27. Chronic Diseases, 32, 69±77. Kleinbaum, D. G., Kupper, L. L., & Morgenstern, H. Fleiss, J. (1981). Statistical methods for rates and propor- (1982). Epidemiologic research: Principles and quantita- tions (2nd ed.). New York: Wiley. tive methods. Belmont, CA: Lifetime Learning Publica- Fleiss, J. L. (1986a). Confidence intervals vs. significance tions. tests: Quantitative interpretation (letter). American Kraemer, H. C., & Thiemann, S. (1987). How many Journal Public Health, 76, 587. subjects? Statistical power analysis in research. Newbury Fleiss, J. L. (1986b). Significance tests have a role in Park, CA: Sage. epidemiologic research: Reactions to A. M. Walker. Lachenbruch, P. A., Clark, V. A., Cumberland, W. G., American Journal Public Health, 76, 559±560. Chang, P. C., Afifi, A. A., Flack, V. F., & Elashoff, R. M. Foster, D. A., & Sullivan, K. M. (1987). Computer (1987). Letter. American Journal of Public Health, 77, 237. program produces p-value graphics (letter). American Lau, J., Antman, E. M., Jimenez Silva, J., Kupelnick, B., Journal Public Health, 77, 880±881. Mosteller, F., & Chalmers, T. C. (1992). Cumulative Freiman, J. A., Chalmers, T. C., Smith, H., Jr., & Kuebler, meta-analysis of therapeutic trials for myocardial R. R. (1978). The importance of beta, the type-II error, infarction. New England Journal of Medicine, 327, and sample size in the design and interpretation of the 248±254. randomized control trial. Survey of 71 ªnegativeº trials. Lawless, J. F. (1982). Statistical models and methods for New England Journal Medicine, 299, 690±694. lifetime data. New York: Wiley. Friedman, S. B., & Phillips, S. (1981). What's the Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of difference? Pediatric residents and their inaccurate psychological, educational, and behavioral treatment. concepts regarding statistics. Pediatrics, 68, 644±646. Confirmation from meta-analysis. American Psycholo- Gardner, M. J., & Altman, D. G. (1986). Confidence gist, 48, 1181±1209. intervals rather than P values: Estimation rather than Loftus, G. R. (1991). On the tyranny of hypothesis testing hypothesis testing. British Medical Journal, 292, 746±750. in the social sciences. Contemporary Psychology, 36, Gardner, M. J., & Altman, D. G. (1989a). Confidence 102±105. intervals analysis. London: British Medical Journal. Lykken, D. T. (1968). Statistical significance in psycholo- Gardner, M. J., & Altman, D. G. (1989b). Statistics with gical research. Psychology Bulletin, 70, 151±159. confidenceÐconfidence intervals and statistical guidelines. Mainland, D. (1982). Medical statisticsÐthinking vs. London: British Medical Journal. arithmetic. Journal of Chronic Diseases, 35, 413±417. Goldstein, R. (1989). Power and sample size via MS/ Makuch, R. W., & Johnson, M. F. (1986). Some issues in 348 The Shift from Significance Testing to Effect Size Estimation

the design and interpretation of ªnegativeº clinical Rothstein, H., Borenstein, M., Cohen, J., & Pollack, S. studies. Archives of Internal Medicine, 146, 986±989. (1990). Statistical power analysis for multiple regression/ Meehl, P. E. (1967). Theory-testing in psychology and correlation: A computer program. Educational and physics: A methodological paradox. Philosophy of Psychological Measurement, 50, 819±830. Science, 34, 103±115. Rozeboom, W. W. (1960). The fallacy of the null hypothesis Meehl, P. E. (1978). Theoretical risks and tabular asterisks: significance test. Psychological Bulletin, 57, 416±428. Sir Karl, Sir Ronald, and the slow progress in soft Savitz, D. (1987). Letter. American Journal of Public psychology. Journal of Consulting and Clinical Psychol- Health, 77, 237±238. ogy, 46, 806±834. Scarr, S. (1997). Rules of evidence: A larger context for the Meehl, P. E. (1990). Why summaries of research on statistical debate. Psychological Science, 8(1), 16±17. psychological theories are often uninterpretable. Psycho- Schmidt, F. L. (1992). What do data really mean? Research logical Reports, 66, 195±244. findings, meta-analysis, and cumulative knowledge in Morgan, P. P. (1989). Confidence intervals: From statis- psychology. American Psychologist, 47, 1173±1181. tical significance to clinical significance. Canadian Schmidt, F. L. (1996). Statistical significance testing and Medical Association Journal, 141, 881±883. cumulative knowledge in psychology: Implications for Murphy, K. R. (1990). If the null hypothesis is impossible, training of researchers. Psychological Methods, 1, why test it? American Psychology, 45, 403±404. 115±129. Nelson, N., Rosenthal, R., & Rosnow, R. L. (1986). Schmidt, F. L., & Hunter, J. E. (1996). Measurement error Interpretation of significance levels and effect sizes by in psychological research: Lessons from 26 research psychological researchers. American Psychologist, 41, scenarios. Psychological Methods, 1, 199±223. 1299±1301. Schmidt, F. L., & Hunter, J. E. (1997). Eight common Neyman, J., & Pearson, E. S. (1932a). On the use and but false objections to the discontinuation of significance interpretation of certain test criteria for purposes of testing in the analysis of research data. In L. L. Harlow, statistical inference: Part II. Biometrika, 20A, 263±294. S. A. Mulaik, & J. H. Steiger (Eds.), What if there were Neyman, J., & Pearson, E. S. (1932b). On the use and no significance tests? (pp. 37±64). Mahwah, NJ: interpretation of certain test criteria for purposes of Erlbaum. statistical inference: Part 1. Biometrika, 20A, 175±240. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of Oakes, M. (1986). Statistical inference: A commentary for statistical power have an effect on the power of studies? the social and behavioral sciences. New York: Wiley. Psychological Bulletin, 105, 309±316. Peto, R., & Doll, R. (1977). When is significant not Selikoff, I. J., Hammond, E. C., & Churg, J. (1968). significant. British Medical Journal, 2, 259. Asbestos exposure, smoking and neoplasia. Journal of Phillips, W. C., Scott, J. A., & Blasczcynski, G. (1983). The the American Medical Association, 204, 106±112. significance of ªNo significanceº: What a negative Selvin, S. (1991). Statistical analysis of epidemiologic data. statistical test really means. American Journal of New York: Oxford University Press. Roentgenology, 41, 203±206. Shadish, W. R. (1992). Do family and marital psy- Poole, C. (1987a). Beyond the confidence interval. Amer- chotherapies change what people do? A meta-analysis of ican Journal of Public Health, 77, 195±199. behavioral outcomes. In T. D. Cook & H. Cooper Poole, C. (1987b). Confidence intervals exclude nothing. (Eds.), Meta-analysis for explanation (pp. 129±208). New American Journal of Public Health, 77, 492±493. York: Russell Sage. Poole, C. (1987c). Mr. Poole's response (letter). American Shaver, J. P. (1993). What statistical significance testing is, Journal of Public Health, 77, 880. and what it is not. Journal of Experimental Education, 61, Reed, J. F., & Slaichert, W. (1981). Statistical proof in 293±316. inconclusive ªNegativeº trials Archives of Internal Sheehe, P. R. (1993). A variation on a confidence interval Medicine, 141, 1307±1310. theme (letter). Epidemiology, 4, 185±186. Reynolds, T. B. (1980). Type II error in clinical trials Shrout, P. E. (1997). Should significance tests be banned? (Editor's reply to letter). Gastroenterology, 79, 180. Introduction to a special section exploring the pros and Rosenthal, R., & Gaito, J. (1963). The interpretation of cons. Psychological Science, 8, 1±2. levels of significance by psychological researchers. Simon, R. (1986). Confidence intervals for reporting results Journal of Psychology, 55, 33±38. of clinical trials. Annals of Internal Medicine, 105, Rosenthal, R., & Gaito, J. (1964). Further evidence for the 429±435. cliff effect in the interpretation of levels of significance. Simon, R. (1987). The role of overviews in cancer Psychological Reports, 15, 570. therapeutics. Statistics in Medicine, 6, 389±396. Rosner, B. (1990). Fundamentals of biostatistics (3rd ed.). Smith, A. H., & Bates, M. N. (1992). Confidence limit Boston: PWS-Kent Publishing. analyses should replace power calculations in the inter- Rosnow, R. L., & Rosenthal, R. (1989). Statistical pretation of epidemiologic studies. Epidemiology, 3, procedures and the justification of knowledge in psycho- 449±452. logical science. American Psychology, 44, 1276±1284. Smith, A. H., & Bates, M. N. (1993). A variation on a Rossi, J. (1990). Statistical power of psychological confidence interval theme (Reply to letter). Epidemiol- research: What have we gained in 20 years? Journal of ogy, 4, 186±187. Consulting and Clinical Psychology, 58, 646±656. Sullivan, K. M., & Foster, D. A. (1990). Use of the Rossi, J. (1997). A case study in the failure of psychology as confidence interval function. Epidemiology, 1, 39±42. a cumulative science: The spontaneous recovery of Thomas, L., & Krebs, C. (1997). A comprehensive list of verbal learning. In L. L. Harlow, S. A. Mulaik, & J. power analysis software for microcomputers. http:// H. Steiger (Eds.), What if there were no significance tests? www.Interchg.ubc.ca/cacb/power. (pp. 175±198). Mahwah, NJ: Erlbaum. Thompson, B. (1992). Two and one-half decades of Rothman, K. J. (1978). A show of confidence (letter). New leadership in measurement and evaluation. Journal of England Journal of Medicine, 299, 1362±1363. Counseling and Development, 70, 434±438. Rothman, K. J. (1986a). Modern epidemiology. Boston: Thompson, W. D. (1987a). Exclusion and uncertainty Little, Brown and Company. (letter). American Journal of Public Health, 77, 879±880. Rothman, K. J. (1986b). Significance questing (editorial). Thompson, W. D. (1987b). Letter. American Journal of Annals of Internal Medicine, 105, 445±447. Public Health, 77, 238. Rothman, K. J., & Yankauer, A. (1986). Editors' note. Thompson, W. D. (1987c). On the comparison of effects. American Journal of Public Health, 76, 587±588. American Journal of Public Health, 77, 491±492. References 349

Thompson, W. D. (1987d). Statistical criteria in the logic studies. American Journal of Public Health, 76, interpretation of epidemiologic data (published erratum 556±558. appears in American Journal of Public Health, 1987(Apr), Walker, A. M. (1986b). Significance tests represent 77(4), 515). American Journal of Public Health, 77, consensus and standard practice (letter). American 191±194. Journal of Public Health, 76, 1033. Tukey, J. W. (1969). Analyzing data: Sanctification or Walter, S. D. (1991). Statistical significance and fragility detective work? American Psychologist, 24, 83±91. criteria for assessing a difference of two proportions. Tversky, A., & Kahneman, D. (1971). Belief in the law of Journal of Clinical Epidemiology, 44, 1373±1378. small numbers. Psychological Bulletin, 76, 105±110. Wonnacott, T. (1985). ªStatistically significant.º Canadian Walker, A. M. (1986a). Reporting the results of epidemio- Medical Association Journal, 133, 843. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.15 Meta-analytic Research Synthesis

SHARON H. KRAMER and ROBERT ROSENTHAL Harvard University, Cambridge, MA, USA

3.15.1 INTRODUCTION 352 3.15.1.1 A Brief History 352 3.15.2 WHY DO A META-ANALYSIS? 352 3.15.2.1 The Problem of Poor Cumulation 352 3.15.2.2 Primary Purposes of Meta-analyses 353 3.15.2.3 Pooling Pilots 353 3.15.3 CONDUCTING A META-ANALYSIS 353 3.15.3.1 Formulating the Question 353 3.15.3.2 Defining Criteria for Inclusion 354 3.15.3.3 Searching the Literature 354 3.15.3.4 Recording Study Characteristics and Identifying Moderators 355 3.15.3.5 Coding 355 3.15.3.6 Descriptive Data Displays 356 3.15.3.6.1 Stem and leaf 356 3.15.3.6.2 Box plot and summary table 356 3.15.4 QUANTITATIVELY ANALYZING THE STUDIES 356 3.15.4.1 Comparing Two Studies 356 3.15.4.1.1 Significance testing 356 3.15.4.1.2 Effect size estimation 358 3.15.4.2 Combining Two Studies 360 3.15.4.2.1 Significance testing 360 3.15.4.2.2 Effect size estimation 360 3.15.4.3 Comparing Three or More Studies: Diffuse Test 360 3.15.4.3.1 Significance testing 361 3.15.4.3.2 Effect size estimation 361 3.15.4.3.3 Comparing three or more studies: focused tests 361 3.15.4.3.4 Significance levels 361 3.15.4.3.5 Effect size estimation 362 3.15.4.4 Combining Three or More Studies 362 3.15.4.4.1 Significance testing 362 3.15.4.4.2 Effect size estimation 363 3.15.4.5 Weighting Studies 363 3.15.5 SPECIAL CASES 364 3.15.5.1 Imprecisely Reported p's 364 3.15.5.2 Repeated Measures Designs 364 3.15.5.3 Random vs. Fixed Effects 364 3.15.6 CRITICISMS OF META-ANALYSIS 365 3.15.6.1 Sampling Bias and the File Drawer Problem 365 3.15.6.2 Combining Apples and Oranges: Problems of Heterogeneity 365 3.15.6.3 Problems of Independence 366 3.15.6.3.1 Nonindependence of subjects 366 3.15.6.3.2 Nonindependence of experimenters 366 3.15.6.4 Exaggerated Significance Levels 366 3.15.6.5 Too Small Effects 366

351 352 Meta-analytic Research Synthesis

3.15.6.6 Abandonment of Primary Studies 367 3.15.7 REFERENCES 367

3.15.1 INTRODUCTION the relationship between the initial weight of steers and their subsequent gain in weight. 3.15.1.1 A Brief History Finally, back to the field of psychology, in early A little over 20 years ago, educational work on experimenter expectancy effects in the researcher, Gene V. Glass was interested in 1960s, there is an example of combining three the question of the whether psychotherapy probability levels to give an overall test of works. In other words, does it help patients any significance of three experiments (Rosenthal, more than would a placebo treatment? Further- 1966). more, if in fact psychotherapy overall is Since Glass' use of the term, meta-analyses effective, can it be determined whether one have become widely used for research in the form is more effective than another? Glass was behavioral and physical sciences and in applied not the first researcher interested in this situations, such as for policy making. By one question. Hundreds of studies had previously estimate, nearly 2000 meta-analytic reviews been conducted in this area. Yet the results from have been conducted in the social and health these studies, and even from reviews which sciences between 1980 and 1994 (Bausell, Li, summarized the studies, had reached contra- Gau, & Soeken, 1995). In addition to the dictory conclusions. What was unique about growing numbers of meta-analyses that have Glass' attempt to answer this question was that been conducted, a number of different proce- he was the first to conduct a quantitative dures for conducting meta-analyses have also synthesis of the research, a procedure which he been described. Indeed, whole books, not to called a ªmeta-analysisº (Glass, 1976; Smith, mention numerous individual papers and Glass, & Miller, 1980). In conducting the meta- chapters, have addressed the issue of conducting analysis, Glass first located studies which asked meta-analyses (e.g., Cooper, 1989; Cooper & this question of whether psychotherapy works Hedges, 1994; Glass, McGaw, & Smith, 1981; and then he quantitatively characterized the Hedges & Olkin, 1985; Hunter & Schmidt, 1990; outcomes and features of the studies and Light & Pillemer, 1984; Rosenthal, 1991). This statistically described his findings. Glass ulti- chapter will illustrate the basic quantitative mately found that psychotherapy shows posi- procedures for conducting a meta-analysis. This tive results and that there is little difference chapter draws on material presented in Ro- between the different types of therapy. senthal, (1984, 1991, 1993, 1995a, 1995b, 1998) While Glass is responsible for coining the and in Rosenthal and Rosnow (1991). For more term and for carrying out the first large-scale detailed information about the procedures meta-analysis, meta-analytic enterprises had discussed here, the reader may wish to consult previously been undertaken. For example, in these sources. As the chapter will demonstrate, 1904, Karl Pearson, the inventor of the Pearson the level of quantitative skill and training product moment correlation coefficient, had required to employ basic meta-analytic proce- been investigating the degree to which inocula- dures is so modest that any researchers capable tion against smallpox saved lives. He collected of analyzing the results of their own research the following six correlation coefficients (r's) will be capable of learning the small number of which measured the relationship of inoculation calculations required to conduct a first-rate to survival: 0.58, 0.58, 0.60, 0.63, 0.66, and 0.77, meta-analysis. and averaged them, finding a mean correlation of 0.6. Thus, his ªmeta-analysisº or synthesis of 3.15.2 WHY DO A META-ANALYSIS? these six studies found a rather large effect of inoculation on survival. 3.15.2.1 The Problem of Poor Cumulation In the 1930s, Ronald Fisher was experiment- ing with combining significance levels of A meta-analysis is literally an analysis of independent studies (Fisher, 1938) as were analyses or a procedure for quantitatively Frederick Mosteller and Robert Bush in the cumulating research results addressing specific 1950s (Mosteller & Bush, 1954). Mosteller and research questions. Unlike a typical literature Bush also showed the usefulness of combining review, a meta-analysis addresses the problem effect sizes as well as significance levels. Even in of ªpoor cumulationº in the social sciences. It is George Snedecor's 1946 statistics textbook is quite common to find a literature review in included an example drawn from agriculture of which dozens of studies are discussed but which combining correlation coefficients measuring ends with a call for more research. When these Conducting a Meta-analysis 353 studies are discussed within the review, only the is more effective with adolescents but less so statistical significance or nonsignificance of the with adults. By comparing studies, the meta- results, without any information about the size analysis uncovers new relationships or provides of the effects, are presented. Until somewhat support for suspected relationships, thus paving recently, when there existed only a small the way for future research. number of studies in any given research area, it was easier to form conclusions about a body of research, but now, as hundreds of studies 3.15.2.3 Pooling Pilots have been conducted in many fields in the social While meta-analysis is potentially valuable in sciences, the problem of poor cumulation all fields, it can be particularly helpful for clinical continues to grow. research. Often in clinical research, the number Unlike the physical sciences in which newer of patients available is so small that all the work builds directly upon the older work, in the researcher is able to do is to conductpilot studies, social sciences each succeeding volume of our none of which will ever reach a significance level scientific journals almost seems to be starting of p 5 0.05. However, the researcher is able to anew, posing the same research questions time combine these individual studies meta-analyti- and again. The common explanation for this cally. A recent example of this appeared in an problem is that there is more variability or less issue of Science (Cohen, 1993) which described agreement in our field which prevents the kind two pilot studies in which experimental monkeys of building seen in other fields. Yet, this were vaccinated using SIV (akin to HIV), while commonly held belief has not been supported. control monkeys were not. In the first pilot, six A recent comparison of 13 research areas in monkeys were available, three vaccinated and particle physics with 13 areas in psychology did three controls. The results showed two of the not actually find more conflicting results in our three vaccinated monkeys and none of the three field (Hedges, 1987). For the researcher trying controls in better health (p = 0.20, one-tailed). to develop theory, for the practitioner trying to In the second pilot, 11 animals were available decide on treatment, for the policy maker trying and two of five of the experimentals and none of to change policy, and for the funding agency six of the controls wound up in better health trying to allocate research money, meta-analysis (p = 0.18, one-tailed). While the results of holds great promise. neither one of these tiny pilot studies was significant, if treated meta-analytically, they show dramatic benefits of vaccination. 3.15.2.2 Primary Purposes of Meta-analyses The first primary purpose of a meta-analysis 3.15.3 CONDUCTING A META-ANALYSIS is to summarize for a set of studies the overall 3.15.3.1 Formulating the Question relationship between the two variables that had been investigated in each study. It can be used, The first step in conducting a meta-analysis, as with the Glass psychotherapy meta-analysis, as with any research enterprise, is a clear for pulling together hundreds of studies, or it formulation of the question being examined. can also be used simply by an investigator The only constraint on the question is that asking whether a study and its replication primary research on the topic must have already produce similar or different results and what the been conducted. The specificity and clarity of net results of these two studies might be. By the question is especially important in a meta- quantitatively summarizing the existing re- analysis in that it will help the researcher define search, a meta-analysis is able to discover what the criteria for inclusion (see below). The has so far been learned and help to discover primary question the meta-analysis seeks to what has not yet been learned. answer is: what is the relationship between any Another equally important purpose of meta- variable X and any variable Y? In order to analytic work is to learn from comparisons answer this question, we need to have between studies. By comparing the studies in a (i) an estimate of the level of significance of meta-analysis, the researcher hopes to deter- the difference between the obtained effect size mine factors that are associated with variations and the effect size expected under the null in the magnitudes of the relationships between hypothesis (usually an effect size of 0). Esti- two variables. These factors are known as mates of levels of significance are expressed by p moderator variables because they moderate or values associated with the given significance alter the magnitude of a relationship. For test, such as t, F, chi-square or Z; example, a meta-analysis might suggest that (ii) an estimate of the magnitude of the the effectiveness of a given type of therapy is relationship between X and Y (an effect size). moderated by the age of the patient, such that it An effect size commonly used by Glass and his 354 Meta-analytic Research Synthesis colleagues is the number of standard deviation couples therapy meet these criteria; would units that separate outcome scores of experi- therapy which combines group meetings with mental and control group, expressed as d. individual sessions meet these criteria? Deciding According to Cohen (1977), d's around 0.2 on the criteria should be theory or hypothesis are considered small, around 0.5 are considered driven. For example, if previous research had medium, and around 0.8 are large. A concrete found that couples therapy is fundamentally example of the meaning of an effect size which is different from group therapy, then studies with drawn from educational research is that a d of couples should be excluded from the meta- 0.2 would raise a student's performance from analysis. the 50th percentile to the 58th percentile, a 0.5 Regardless of the specific question being would raise it to the 69th, and a 0.8 would raise examined, a basic criterion for inclusion in the it to the 79th percentile (Cohen, 1977). Another meta-analysis is that the studies have reported, common effect size estimate is given by r. The or have provided the necessary information for relationship between r and d is r = d / us to obtain, an estimate of an effect size and a [(d2 +4)12]. significance level for each study. Studies which The relationship between an effect size and a do not provide this information or a way for the significance test is given by: reader to derive this information are not ordinarily included in the analysis. test of significance = size of effect 6 size of study (1) 3.15.3.3 Searching the Literature Table 1 provides several examples of this relationship. There are four basic types of documents from As this basic equation illustrates, as a study which studies may be retrieved: (i) published increases its sample size, it will obtain more books or book chapters; (ii) published journal significant results (except in the unusual case articles, magazines, newsletters, and newspa- where an effect size is truly zero in which case a pers; (iii) bachelor's, master's, and doctoral larger study will not produce a result any more theses; (iv) unpublished works such as technical significant than a smaller study). An effect size, reports, convention papers, or raw data. There however, would not be affected by the size of the are a number of computer databases available to sample used in the study. help with the retrieval process, such as PsycLit or PsycInfo (see Dickersin, 1994; Reed & Baxter, 1994; Rosenthal, 1994; White, 1994 for specific 3.15.3.2 Defining Criteria for Inclusion details on helpful retrieval techniques). Because successful retrieval from these databases is Once the research question has been for- dependent on identifying the right keyword or mulated, the next step in the meta-analysis is a key subject, the searcher may occasionally miss careful consideration of what criteria must be some references. Therefore, retrieval should also met in order for a study to be included in the rely on the ªancestry approach,º also known as synthesis. For example, a study comparing ªfootnote chasingº (White, 1994), which in- individual to group therapy would need to volves searching reference lists of relevant define what constitutes group therapy. Would articles and books. Past issues of relevant

Table 1 Examples of the relationship between tests of significance and effect size: chi-square, Z, t, and F.

Test of significance = Size of effect x Size of study

2 2 w (1) = f xN

Z = f x pN t = r x pdf p1 r2 pdf t = dx  2  F = r2 xdf p1 r2 a t = dxpdf  aCorrelated observtions. Conducting a Meta-analysis 355 journals should also be hand searched. Finally, if specifically looked at the possible moderator primary researchers in the field can be contacted, variable of how long teachers had known their they should be consulted for their knowledge of pupils before the expectations were assigned any unpublished work in the field by themselves (Raudenbush, 1994). This analysis found that or by their colleagues. This last method has been the longer teachers had known their pupils referred to as using the ªinvisible collegeº before the experiment began, the smaller were grapevine (Crane, 1972). the effects of experimentally-induced teacher A comprehensive search for all the research expectations. One method of organizing the on the given topic is necessary in order to avoid various relevant variables that has often proven conducting a biased retrieval of only the major helpful is to break down the variables into journals which may selectively publish results substantive and methodological variables. Sub- characterized by lower p values. If the field we stantive variables are those relating to the are searching is too vast for us to analyze every phenomenon under investigation while metho- study, it is better to sample randomly from this dological variables are those that relate to the complete selection than to analyze only the procedural aspect of the study (Lipsey, 1994). more readily retrievable results. Some examples of substantive features might be type of treatment administered or characteristics of the subject population. Some methodological 3.15.3.4 Recording Study Characteristics and variables of interest might be variations in the Identifying Moderators research design or procedure or internal validity of the study (e.g., random assignment of In addition to an estimate of effect size and a participants to conditions or use of a control level of significance, there are always other group). features of the retrieved studies that will be of An example of a methodological variable interest both for purely descriptive purposes and from the study of female influenceability for analysis as potential moderating variables. mentioned above that was found to play a For example, the meta-analyst should always large role was the sex of the researcher (Eagly & record and report the following study character- Carli, 1981). In this meta-analysis, male re- istics: (i) descriptions of the participant popula- searchers were more likely to find female tion, such as number, age, sex, education, and influenceability in their studies than female volunteer status; (ii) descriptions of how the researchers. An explanation for this finding may study was conducted, such as whether it was be that if a difference was found between the conducted in a laboratory or in the field, whether sexes, it was equally as likely to be reported by it was observational or randomized; and (iii) male and female researchers. However, if no year and form of publication of the study. In difference was found, a male researcher would addition to listing each study and its relevant tend to report the results less often than a female characteristics, it is also good meta-analytic researcher who would be more likely to see the practice to provide the range and median of these absence of an effect as important in that it features, such as the range and median of dates of disputes the stereotype of females being more the studies and ages of the participants. conforming and persuasible. After forming a All of the study characteristics recorded can preliminary list of moderators which have been also be employed as moderator variables, that identified, this form should be discussed with is, variables correlated with the magnitude of colleagues, advisors, and other workers in the the effect size obtained for the different studies. field for suggestions of other moderators that For example, a meta-analysis conducted by should be coded. social psychologist Alice Eagly on sex differ- ences in influenceability found that the time when a study was published was strongly related 3.15.3.5 Coding to the study's results. Studies published before 1970 were somewhat more likely to show greater Once relevant features have been identified, influenceability among females, while studies they need to be coded and entered into a published during the period of the women's database type of system. Some variables, such as movement in the 1970s were less likely to find subject sex or number of subjects in the study, differences between sexes (Eagly & Carli, 1981). can be easily coded (e.g., a 1 for female subjects In addition to these general characteristics, or a 0 for males). Others, such as an assessment of each area of research will also have its own the quality of the study, will require more specific moderator variables which are of sophisticated coding. For these types of codings, particular interest or meaning. For example, a each study should be coded by several knowl- meta-analysis in the field of the effects of edgeable methodologists who are not necessarily teachers' expectations on pupils' IQ gains specialists in the field being examined. The 356 Meta-analytic Research Synthesis agreement or reliability of the judges can then be and the line dividing the rectangle represents the assessed (for details on how to compute median. The box plot is especially useful when reliability, see Rosenthal, 1991 or Rosenthal & there are data to display from several subsam- Rosnow, 1991) and when agreement is satisfac- ples. tory the feature is coded. Table 2 provides an In addition to the graphic display, it is also example of a partial database of relevant instructive to provide a summary table of these variables which were coded from a recent measures of central tendency and variability. meta-analysis examining individual vs. group Several indices of central tendency should be problem-solving performance (Kramer & reported: the unweighted mean effect size, the Rosenthal, 1997). weighted mean effect size, (see below for more details on weighting), and the median. The 3.15.3.6 Descriptive Data Displays number of independent effect sizes on which these indices are based should be reported. The Once all the studies have been collected and standard deviation, the maximum and mini- the relevant information extracted and recorded mum effect size, and the effect sizes found at the from each, the next step is to display what has 75th percentile (Q3) and the 25th percentile (Q1) been found. One of the most important parts of should also be given. The following may also be a meta-analysis is a graphic display of the effect reported: the proportion of studies showing sizes and the summary of their distribution and effect sizes in the predicted direction; the total central tendency. We will describe two visual number of subjects on which the weighted mean displays that may be used, but there are a great is based; and the median number of subjects per many more to choose from (e.g., Cooper, 1989; obtained effect size. Table 4 provides a Glass et al., 1981; Greenhouse & Iyengar, 1994; summary of the data displayed in the box plot Hedges & Olkin, 1985; Light & Pillemer, 1984; in Figure 1. Light, Singer, & Willett, 1994; Rosenthal & Rosnow, 1991; Tukey, 1977). 3.15.4 QUANTITATIVELY ANALYZING THE STUDIES 3.15.3.6.1 Stem and leaf In addition to the descriptive techniques just Table 3 is a stem and leaf display (Tukey, discussed, there are also procedures for infer- 1977) from a recent meta-analysis of the effects ential analysis of the retrieved data. As Table 5 of gender on judgments of sexual harassment shows, there are two major ways to evaluate the (Blumenthal, 1997). Each of the 83 effect sizes is results of research studies: (i) terms of their recorded with the first digit found in the column statistical significance (e.g., p levels); and (ii) in labeled ªstemº and the second digit found in the terms of their effect sizes (e.g., r). When column labeled ªleaf.º The top two entries of evaluating, there are two major analytic pro- Table 3, therefore, are read as two r's of 0.65, cesses that can be applied to the set of studies: and 0.43. The stem and leaf display provides a comparing and combining. Because there are clear overall impression of all of the results; in some especially convenient computational pro- this particular case, the display quickly and cedures for the two-study situation, Table 5 easily demonstrates that most of the effects of separates the procedures applicable to the case of gender differences found in these studies tend to combining and comparing two studies from be small. It can also point out when there is an three or more studies. In cases of three or more unusual distribution of effect sizes, for example, studies being represented, we are able to a bimodal distribution may suggest that a compare studies on two levels: with diffuse tests certain innovation works well for one group but and with focused tests. When diffuse tests are not for another. used to show whether significance levels or effect sizes differ, the researcher may learn that they 3.15.3.6.2 Box plot and summary table differ but not how they differ. By using focused tests, or contrasts, we are able to expand on the The box plot or box graph, originally called a knowledge of the diffuse tests by learning box-and-whisker plot (Tukey, 1977), provides a whether the studies differ in a theoretically pictorial summary of five descriptive statistics predictable or meaningful way. from the distribution of effect sizes: the max- imum, the minimum, the median, the 75th percentile, and the 25th percentile. Figure 1 3.15.4.1 Comparing Two Studies displays a box plot of the data in Table 3. The top 3.15.4.1.1 Significance testing and bottom dots represent the maximum and minimum scores, the top and bottom of the Although we are generally more interested in rectangle represent the 75th and 25th percentile comparing the results of the effect sizes of the Table 2 An example of a partial coding database of four studies from a meta-analysis of individual vs. group problem-solving performance.

Type of Subject Size of Signfi- Effect size Author Year publication N occupation group Type of task cance level estimate

Barnlund 1959 Journal article 143 Students 5 Logic problem Z = 6.3 r = 0.88 Campbell 1968 Journal article 48 Managers 4 Human relations problem Z = 71.4 r = 70.43 Kanekar 1982 Journal article 36 Students 2 Anagrams Z = 6.9 r = 0.80 Knight 1921 Master's thesis 35 Students 6 Judgment task Z = 5.2 r = 0.74 358 Meta-analytic Research Synthesis

Table 3 A stem and leaf display of 83 effect sizes of if both Z results are in the same direction, they the effect of gender on judgments of sexual will both have positive signs; if the results are in harassment. the opposite direction, one of the two Z's will be negatively signed. The two Z's are then Stem Leaf compared by the following formula:

0.6 5 Z1 Z2 Z 2 0.6 ˆ p2 † 0.5 0.5 Example 1  0.4 Studies A and B yield results in the same 0.4 3,3 direction; one is significant at the 0.05 level, the 0.3 7,7,9,9 0.3 0,0,2,3 other is not. The p level of Study A is 0.04, the p 0.2 5,5,5,5,5,6,7 level of Study B is 0.30. The Z's corresponding 0.2 2,2,3,3,4,4 to these p's are found in a table of the normal 0.1 5,5,5,5,5,6,6,6,7,7,7,8,8,8,8,9,9,9 curve to be 1.75 and 0.52 respectively. Using 0.1 0,1,1,1,2,2,2,3,3,3,3,4,4,4 Equation (2), we find that: 0.0 5,6,6,6,7,7,8,8,8,8,8,8,9,9,9,9,9 Z Z 1:75 0:52 0.0 0,0,0,0,0,0,0,0,0,3 1 2 0:87 p2 ˆ p2 ˆ two studies, sometimes this information is not The p value associated with this Z is 0.19 one- available and all we are able to compare is p tailed. Thus, when comparing these two studies, values. The following is the procedure for one of which reaches ªsignificance,º while the comparing the significance levels of two studies other does not, we find that the difference (Rosenthal & Rubin, 1979); all p levels are one- between them does not come close to the tailed. First, for each of the studies, we obtain as conventional levels of significance. The exam- exact a p level as possible. For example, if we ples in the chapter are hypothetical in order to obtain t(150) = 2.32, our p = 0.011 not 50.05. keep the computational examples small and Extended tables of the t distribution are helpful manageable. For illustrations of various meta- here (e.g., Federighi, 1959; Rosenthal & analytic procedures with real-life examples, see Rosnow, 1991) or calculators and software the final chapter in Rosenthal (1991). packages with built-in distributions of Z, t, F, Example 2 and chi-square that will provide exact p values Studies A and B yield results in different for the various statistical tests. Second, for each directions and both are significant. One p is p, we find Z, the standard normal deviate 0.001, the other is 0.03. The Z's corresponding corresponding to the p value (using a Z table or to these p's are 3.09 and 71.88 (note the calculator or statistical software that converts opposite signs to indicate results in opposite from p to Z). Since only one-tailed p's were used, directions). From Equation (2), we have: 3:09 1:88 † † 3:51 0.70 p2 ˆ which has a corresponding p of 0.0002, which 0.60 indicates that the two p values differ signifi- cantly. 0.50

3.15.4.1.2 Effect size estimation 0.40 When comparing studies, we are usually more interested in looking at the consistency or 0.30 heterogeneity of the effect sizes than of the significance levels since a difference in Z's might 0.20 simply be due to the difference in the size of the studies but not truly reflective of whether the two studies differ in effect size. 0.10 Researchers do not routinely report an effect size estimate together with their test of 0.00 significance. Yet, in most cases, by rearranging the formulas in Table 1 into the following Figure 1 A box plot of the data displayed in Table 3. formulas, the meta-analyst can compute the Quantitatively Analyzing the Studies 359

Table 4 A summary table of the measures of central tendency and variability of the distribution of effect sizes in Table 3.

Central tendency Variability

Unweighted mean 0.17 Maximum 0.65 Weighted mean n.a. Q3 0.23 Median 0.15 Median 0.15 Proportion of studies Q1 0.09 greater than 0.00 0.89 Minimum 0.00 Median number of Q3±Q1 0.14 subjects per effect size n.a. Standard deviation 0.13 Total number of independent effect sizes 83 Total number of subjects 34 350

n.a. = information not available necessary effect size from the information that and more skewed. To adjust for this skew, it is has been reported. There are several ways of recommended that all r's first be transformed to computing effect sizes (r) from significance test Fisher zrs (Fisher, 1928), which are distributed results (t, Z, chi-square, and F) as shown below. nearly normally, before further computations are carried out. The relationship between r and z 2 t is given by 1/2 loge [(1 + r)/(1 7 r)]. There are r handy tables available which convert r to zr and ˆ st2 df ‡ zr to r (e.g., Rosenthal & Rosnow, 1991) and most statistical software packages will easily x2 1 r † perform this transformation as well. After converting obtained r's to z 's, the following ˆ rN r Z formula may be used: r zr zr ˆ N 1 2 3 1 1 † n 3 n 3 F 1; 1 ‡ 2 r † q which is distributed as Z (Snedecor & Cochran, ˆ sF1; dferror †‡ 1989). Example 3 The effect size emphasized in this chapter will be Studies A and B yield results in the same r but analogous procedures are available for direction with effect sizes of r = 0.70 (n=25) comparing other effect size indicators such as and r=0.25 (n=100), respectively. The Fisher Hedge's g or d ' (Hedges, 1982; Hsu, 1980; zr's corresponding to these r's are 0.87 and 0.26, Rosenthal, 1991; Rosenthal & Rubin, 1982a; respectively. From Equation (3), we have Rosenthal, Rosnow, & Rubin, 1997). As the population value of r gets further and zr1 zr2 0:87 0:26 Z 2:58 further from zero, the distribution of r's ˆ 1 1 ˆ 1 1 ˆ n 3 n 3 22 97 sampled from that population becomes more 1 ‡ 2 ‡ q q

Table 5 An overview of 10 meta-analytic procedures.

Significance testing Effect size estimation

Comparing Two studies Three or more studies: diffuse tests Three or more studies: focused tests Combining: Two studies Three or more studies 360 Meta-analytic Research Synthesis which has an associated p of 0.005, one-tailed. be cautious in our interpretation and should Thus, these two studies agree on a significant search for that factors may have led Studies A positive relationship between variables X and Y and B to have obtained such different results. but disagree significantly in their estimates of the size of that relationship. 3.15.4.2.2 Effect size estimation 3.15.4.2 Combining Two Studies Once we have computed the associated Fisher zr's for each r, the combined effect size estimate 3.15.4.2.1 Significance testing is simply z z While comparisons of two studies allow us to r1 r2 z 5 ask whether our studies differ in some way, it is 2 ˆ r1 † also of interest to combine the two studies. By combining the significance levels of two studies, or the Fisher zr corresponding to our mean r. we are able to obtain an overall estimate of the We can then look up the r associated with this probability that the two p levels might have been mean zr (using an r to zr or zr to r table or obtained if there truly were no relationship statistical software). between the variables X and Y. We will present Example 6 here the simplest and most versatile procedure Studies A and B yield results in the same combining probability levels, the method of direction, one r = 0.95, the other r = 0.25. The adding Z's called the Stouffer method (Mos- Fisher zr's corresponding to these r's are 1.83 teller & Bush, 1954); other methods are and 0.26, respectively. Applying Equation 5, we summarized elsewhere (Rosenthal, 1978, 1991). have Like the method for comparing p values, the z z 1:83 0:26 r1 ‡ r2 ‡ 1:045 first step is to obtain accurate p levels for each of 2 ˆ 2 ˆ the two studies and then to find the Z's corresponding to those p's. The following which is our mean Fisher zr. This value can then equation is then applied: be converted back to an r of 0.78 which can be interpreted as showing a rather large combined Z Z 1 ‡ 2 Z 4 mean effect (Cohen, 1977, 1988) from these two p2 ˆ † studies. Example 7 Example 4  Study A and B yield effect sizes of r = 0.30 Studies A and B yield results in the same and 0.00 respectively. The Fisher zr's corre- direction but neither are significant. One p value sponding to these two effect sizes are 0.31 and is 0.10, the other is 0.15, with corresponding Z's 0.00. Again applying Equation (5) we have of 1.28 and 1.04. From Equation (4), we have 0:31 0:00 Z Z 1:28 1:04 ‡ 0:155 1 ‡ 2 ‡ 1:65 2 ˆ p2 ˆ p2 ˆ as our mean zr which is associated with a mean r the p associated with thisZ is 0.05. Thus, we of 0.15. have an example of two studies, neither of which These last two examples also illustrate that yielded significant results when considered the use of Fisher's zr gives heavier weight to r's individually but which were significant when that are further from zero in either direction but meta-analytically combined. makes little difference when working with very Example 5 small r's. If we had averaged the two r's from The results of Studies A and B are in opposite Example 6 without first transforming them to directions and are both significant. One p is Fisher zr's, we would have found a mean r of 0.00001, one-tailed, and the other is 0.03, one- (0.95 + 0.25)/2 = 0.60, substantially smaller tailed, but in the opposite direction. The Z's than the 0.78 which we found using the Fisher corresponding to these p's are 4.26 and 71.88 zr's. However, if we recompute Example 7 using (note the opposite signs indicate results in op- r's instead of Fisher zr's, we would obtain the posite directions). Using Equation (4), we have same result: (0.30 + 0.00)/2 = 0.15. 4:26 1:88 †‡ † 1:68 p2 ˆ 3.15.4.3 Comparing Three or More Studies: Diffuse Test with an associated p of 0.046. Thus, the combined p supports the finding of the more The following procedures are generalizations significant of the two results. However, since the of the procedures just described which can be two p's were so significantly different, we should used to compare more than two studies. As Quantitatively Analyzing the Studies 361 mentioned earlier, when dealing with three or The Fisher zr's corresponding to these r's are more studies, we are able to conduct both diffuse 0.87, 0.48, 0.10, 70.15. First, using Equation comparisons as well as focused comparisons. We (8), the weighted mean, zr,is will begin with the procedures for diffuse tests. 27 0:87 42 0:48 17 0:10 27 0:15 ‰ †‡ †‡ †‡ †Š 27 42 17 22 ‡ ‡ ‡ 3.15.4.3.1 Significance testing 0:39 ˆ As with two studies, the first step is to find the Z associated with each p level. The studies can Then, applying Equation (7), we find that then be compared using the following equation: 27 0:87 0:39 42 0:48 0:39 †2 ‡ †2‡ 2 2 2 Zj Z 6 17 0:10 0:39 22 0:15 0:39 14:4 † ˆ † †2 ‡ † ˆ X which is distributed as chi-square with K 7 1 df. which for K 7 1= 3 df is significant at In this equation, Zj is the Z for any one study, Z p = 0.0024. These four effect sizes, then, are is the mean of all the Z's obtained, and K is the significantly heterogeneous. number of studies being compared. A signifi- cant chi-square tells us that the p's associated 3.15.4.3.3 Comparing three or more studies: with our studies differ significantly among focused tests themselves. Example 8 While the two examples above showed us that Studies A, B, C, and D yield one-tailed p the set of four studies were found to be values of 0.05, 0.001, 0.17, and 0.01, respec- significantly heterogeneous both in terms of tively. However, Study C shows results in the their significance levels and effect sizes, they opposite direction from Studies A, B, and D. merely told us that overall the four studies We find the Z's corresponding to these p's to be differed. It is generally more useful to have more 1.64, 3.09, 70.95, and 2.33 (note the negative focused information about how the four studies sign for Study C whose results were in the differed. For example, did the prescribed opposite direction). Using Equation (6) we find treatment differ when given to women vs. a chi-square value of men? Or, for example, did the studies in which the subjects lived in nursing homes differ from Z Z 2 1:64 1:53 2 j † ˆ‰ † †Š ‡ those studies in which the subjects lived with 3:09X 1:53 2 0:95 1:53 2 their families? In order to answer this question, ‰ † †Š ‡ ‰ † †Š ‡ it is necessary to conduct focused tests, also 2:33 1:53 2 9:24 ‰ † †Š ˆ known as contrasts. which for K 7 1=3df is significant at p = 0.03. The p values from these four studies, then, are 3.15.4.3.4 Significance levels significantly heterogeneous. As with the diffuse comparisons, we first find the Z's that correspond to the p level of each study. We then compute a Z from the following 3.15.4.3.2 Effect size estimation equation The statistical significance of the heteroge-  Z neity of the effect sizes of three or more studies is j j Z 9 2 ˆ † also obtained from a chi-square using the P j following equation qP 2 2 In this equation is the theoretically derived nj 3 zr zr 7 lj † 1 † ˆ † prediction or contrast weight for any one study, where n isX the number of sampling units on chosen such that the sum of the lj's will be zero. which each r is based, z is the Fisher z Zj is the Z for any one study. rj r Example 10 corresponding to each r, and zr is the weighted mean z , that is Studies A, B, C, and D were conducted to r examine the efficacy of a new type of behavioral nj 3 zrj therapy. They yielded the following one-tailed p zr † 8 ˆ nj 3 † values: 0.0001, 0.03, 0.15, and 0.10, all in the P † same direction. We calculate the Z's corre- Example 9 P sponding to these p's to be 3.72, 1.88, 1.04, and Studies A, B, C, and D yield effect sizes of 1.28, respectively. We also know that the r = 0.70 (n = 30), r = 0.45 (n = 45), r = 0.10 amount of therapy given in each of the studies (n = 20), and r = 70.15 (n = 25), respectively. differed such that Studies A, B, C, and D 362 Meta-analytic Research Synthesis involved 8, 6, 4, and 2 h of therapy per month, = 12. The Fisher zr's corresponding to these r's respectively. A focused question that would be are found to be 1.42, 1.00, 0.23, and 0.68, of interest to us in this example might be respectively. We also know that Studies A, B, C, whether there was a significant linear relation- and D involved 8, 6, 4, and 2 h of therapy per ship between the number of hours of therapy month, respectively. We are interested in and the statistical significance of the result. The whether there is a relationship between number weights of a linear contrast involving four of hours of therapy received and size of studies are 3, 1, 71, 73, as obtained from a observed effect favoring the new therapy. As table of orthogonal polynomials (e.g., Ro- in Example 10, the appropriate l's are 3, 1, 7 1, senthal & Rosnow, 1991). Therefore, from and 7 3. Therefore, applying Equation (10) we Equation (9) we have find

jZj jzrj

2 2 P j P j Wj qP3 3:72 1 1:88 1 1:04 3 1:28 rP † ‡ † ‡ † ‡ † 3 1:42 1 1:00 1 0:23 3 0:68 ˆ 2 2 2 2 † ‡ † ‡ † ‡ † 3 1 1 3 ˆ 3 2 1 2 1 2 3 2 † ‡ † ‡ † ‡ † 9† 9† 9 † 9 † 1:82 q ‡ ‡ ‡ ˆ 2:01 q ˆ as our Z value which is significant at p = 0.03. Thus, the four p values tend to grow more as our Z value which is significant at p= 0.22. significant as more hours of therapy are received Thus, the four effect sizes tend to grow larger as per month. the number of hours of therapy increases.

(i) Interpretation of moderator variables 3.15.4.3.5 Effect size estimation Examples 10 and 11 seem to show that It is almost always quite valuable to the number of hours of therapy is linked linearly to researcher to compute a focused comparison on effectiveness of the therapy. Yet, in these a series of effect sizes. For example, given a set hypothetical examples, subjects were not as- of effect sizes for studies of a new behavioral signed at random to the four studies, each of therapy, we might want to know whether these which studied the effects of treatment employ- effects are increasing or decreasing linearly with ing a different number of hours per week of the number of hours of therapy received per therapy. Therefore, our interpretation of this month. It is from the focused comparison of finding must be cautious. We cannot be sure effects that we are able to test moderator that differences between the four studies are due variables and often form new theories. to the different number of therapy hours and As was the case for diffuse tests, we begin by not to some other variable that is also correlated computing the effect size r, its associated Fisher with this moderator. The finding should not be zr and n 73 where n is the number of sampling taken as evidence for a causal relationship, units on which each r is based. The statistical rather as being suggestive of the possibility of a significance of the focused test or contrast then causal relationship, a possibility which can is obtained from a Z computed as follows then be studied experimental in future research (Rosenthal & Rubin, 1982a): (Hall, Tickle-Degnen, Rosenthal, & Mosteller, jzr 1994). j Z 10 2 ˆ † P j Wj 3.15.4.4 Combining Three or More Studies rP 3.15.4.4.1 Significance testing Once again, lj is the theoretically derived prediction or contrast weight for any one study, By combining the p levels of a set of studies, chosen such that the sum of the lj's will be zero. we are able to obtain an overall estimate of the The zrj is the Fisher zr for any one study and wj is probability that the set of p levels might have the inverse of the variance of the effect size for been obtained if the null hypothesis of no each study. When using Fisher zr transforma- relationship between X and Y were true. The tions of the effect size r, the variance is 1/(nj7 3), method presented here is the generalized version so for the present example, wj= nj7 3. of the method presented earlier as applying to Example 11 the results of two studies; other methods are Studies A, B, C, and D yield effect sizes of described in detail elsewhere (e.g., Rosenthal, 0.89, 0.76, 0.23, and 0.59, respectively, all with n 1991). Quantitatively Analyzing the Studies 363

As before, we first obtain the Z's correspond- Example 13 ing to our p levels (with Z's disagreeing in Studies A, B, and C yield effect sizes of direction given negative signs). The sum of the r = 0.70, 0.45, and 0.10, respectively, all in the Z's is then divided by the square root of the same direction. The Fisher zr values associated number of studies (K) yielding a new statistic with these r's are 0.87, 0.48, and 0.10, which is distributed as Z, as follows: respectively. Applying Equation (12), we find Z z 0:87 0:48 0:10 j Z 11 r ‡ ‡ 0:48 p ˆ † k ˆ 3 ˆ P k P Example 12  as our mean Fisher zr, which corresponds to an r Studies A, B, C, and D yield one-tailed p of 0.45, a rather strong combined effect of the values of 0.05, 0.001, 0.17, and 0.01, respec- three studies. In this example, all the effect sizes tively. However, Study C shows results in the were in the same direction; however, if we had a opposite direction from Studies A, B, and D. case of three studies where effect sizes seemed to We find the Z's corresponding to these p's to be be substantially heterogeneous, we would be 1.64, 3.09, 7 0.95, and 2.33 (note the negative cautious in our interpretation of the combined sign for Study C whose results were in the effect size, just as we were in our treatment of the opposite direction). Applying Equation (11) we combined probabilities in Example (12). then find: Z 1:64 3:09 0:95 2:33 j †‡ †‡ †‡ † p ˆ p 3.15.4.5 Weighting Studies P k 4 3:06  ˆ  When combining two or more studies, the meta-analyst may choose to weight each study as our new Z value which is associated with a p by size of study (df), estimated quality of study, of 0.001, one-tailed. (We would normally use a or any other desired weights (Mosteller & Bush, one-tailed p if we had correctly predicted the 1954; Rosenthal, 1978, 1980, 1984). The general bulk of the findings but would use a two-tailed p procedure for weighting Z's is to (i) first assign value if we had not.) Thus, we find a combined p to each study a given weight (symbolized by w level for the four studies that would be highly and which is assigned before inspection of the unlikely if the null hypothesis were true. Yet, we data); (ii) multiply each Z by the desired weight, should be very cautious about drawing any w; (iii) add the weighted Z's and; (iv) divide this simple overall conclusion from this combined sum by the square root of the sum of the squared significance level because of the heterogeneity of weights, as follows: these four p levels. Example 8 employed the same p values and found that they were wjZj Zw 13 significantly different at p = 0.03. This hetero- w2 ˆ † geneity, however, could be due to heterogeneity P j of effect sizes, of sample sizes, or both. We qP should always look carefully at the effect sizes Example 14 and sample sizes of the studies before drawing We are interested in finding the combined Z any conclusions based on the significance levels from the four studies discussed in Example 12 above. and now we want to weight them each by their degrees of freedom which are 24, 75, 4, and 50, respectively. From Example 12, we found Z's 3.15.4.4.2 Effect size estimation for the four studies to be 1.64, 3.09, 7 0.95, and 2.33. Applying Equation (13) we now find: When combining the results of three or more studies, we are at least as interested in the wjzj combined effect size estimate as in the combined w2 probability discussed above. In computing this P j value, we once again convert our r's to the qP24 1:64 75 3:09 4 0:95 50 2:33 †‡ †‡ †‡ † associated Fisher zr's and simply find the mean ˆ 2 2 2 2 24 75 4 50 Fisher zr as follows: † ‡ † ‡ † ‡ † z 4:11 q r z 12 ˆ k ˆ r † P as our combined Z which is associated with a p where K refers to the number of studies being of 0.00002. We should keep in mind that when combined. We then find the r associated with weighting Z's by df, the size of the study is this mean zr. actually playing a doubly large role since it has 364 Meta-analytic Research Synthesis already been entered into the determination of section 3.15.4.1.2 we can often derive the each Z initially. associated test statistic, its exact p value, and We can similarly weight effect sizes by df or the associated Z. any other desired weight using the following If a result is simply reported as ªnonsignifi- equation: cant,º and if no further information is available, we have little choice but to treat the result as a p wjzr z j 14 of 0.50, which corresponds to a Z of 0.00. rweighted ˆ † P wj Although this procedure is ordinarily conser- vative leading to effect size estimates that are p Example 15 P too small, the alternative of not using those From Example 13, we found that Studies A, studies is likely to lead to effect size estimates B, and C yield Fisher zr values of 0.87, 0.48, and that are too large and almost surely to p values 0.10, respectively. The df's for each study are 56, that are too significant. The recommended 120, and 24, respectively. Applying Equation approach is to conduct the analysis both (14), we find: waysÐwith the studies and without the studies and to see how much of a difference it will make wjzrj zrweighted to the overall view of the data. ˆ P wj 56 0:87 120 0:48 24 0:10 p P†‡ †‡ † ˆ 56 120 24 3.15.5.2 Repeated Measures Designs ‡ ‡ 0:54 A recent meta-analysis (Kramer & Rosenthal, ˆ 1997) introduced the concept of the repeated as our mean weighted Fisher zr which corre- measures meta-analysis. In this situation, the sponds to an r of 0.49, slightly larger than the question of interest concerns the difference unweighted mean effect size r of 0.45 which we between two or more effect sizes for each study. found above. The meta-analysis compared the improvement in problem-solving performance when indivi- duals were formed into noninteracting groups 3.15.5 SPECIAL CASES vs. their performance when they actually inter- acted as a group. A similar situation where one 3.15.5.1 Imprecisely Reported p's may be interested in a repeated measures question might be where patients' improvement The meta-analytic procedures discussed is first measured one month after receiving a new above for combining and comparing signifi- treatment and this improvement is then com- cance levels require the meta-analyst to find the pared to their improvement a year after receiving Z associated with the given p from each study. the treatment. One useful way to approach this Unfortunately, it is quite common for p's to be type of meta-analysis is to treat it as if it were two reported imprecisely as p 0.05 or 0.01 so 5 5 meta-analyses whose results are then compared. that p might be 0.001 or 0.0001 or even 0.00001. In the first example above, a meta-analysis was When this is the case, the researcher needs to go conducted of the effect sizes associated with back to the original test statistic employed, for improvement from the individual to the non- example, t, F, Z, or chi-square, which many interacting group condition. The results from journals require their contributors to report. The this meta-analysis were then compared with the df for the t and for the denominator of the F test meta-analytic results of the size of the effect tell us about the size of the study. The df for chi- associated with improvement due to interaction. square is analogous to the dffor the numerator of the F test and so tells us about the number of conditions, not the number of sampling units. 3.15.5.3 Random vs. Fixed Effects Fortunately, the 1983 edition of the Publication Manual of the American Psychological Associa- A common area of confusion is the distinc- tion has added the requirement that the total N tion between random and fixed effects models of be reported together with the df for conditions analysis. The fixed effects model is in the same for chi-square tests. Using the reported test spirit as a regression model or a fixed analysis of statistic and the information about the size of the variance in which the treatment levels are study, we should be able to obtain the exact p considered fixed and the random variation is value for the given test statistic and the Z from the sample of individuals within treatment associated with that p. levels. With a fixed effects analysis, we are able Occasionally, we may have information to generalize to participants of the type found in about an effect size estimator (such as r, d,or the studies we included in the meta-analysis but g) but not a test statistic. Using the equations in we are not able to generalize to other studies. Criticisms of Meta-analysis 365

This is because the source of sampling error in Example 16 the fixed model is the variation among the 57 experiments have been conducted examin- people in the studies. Although our general- ing the effect of a new treatment for alcoholism. izability is somewhat restricted by using a fixed The SZ of these 57 studies was 98.6. How many effects model, we gain in statistical power new, filed, or unretrieved studies (X) would be because our df are drawn from the total N of required to bring this Z down to a barely all the studies. significant level (Z = 1.645)? From Equation With a random effects model, our general- (15) we have izability increases to all studies of the type from 98:6 2 which our sample of studies was drawn. X † 57 3535 However, by employing a random effects ˆ 2:706 ˆ model, we lose in statistical power compared Thus, there would need to be over 3000 studies to the fixed effects analysis. A simple one- averaging null results tucked away in file sample t-test on the mean effect size (Mosteller drawers before one could conclude that the & Bush, 1954) is often an appropriate random overall results of the 57 experiments were due to effects analysis. It will typically be more sampling bias. conservative than the results of the Stouffer method, but would allow for more general- izability. It is highly recommended to conduct 3.15.6.2 Combining Apples and Oranges: an analysis both with a random and a fixed Problems of Heterogeneity effects model and to compare the two methods. (For more detailed discussion of these issues, see Meta-analysis has been criticized for combin- Hedges, 1994; Raudenbush, 1994; Rosenthal, ing apples and oranges, in other words, for 1995a; Shadish & Haddock, 1994). combining studies with different operationali- zations of independent and dependent variables and different sampling units. This is in fact true, 3.15.6 CRITICISMS OF META-ANALYSIS and a good meta-analysis will examine these differing factors as potential moderating vari- 3.15.6.1 Sampling Bias and the File Drawer ables as was discussed above. Problem Critics of meta-analysis have also pointed out the dangers of combining studies of different One of the most frequent criticisms leveled quality and treating them all as equal. This is an against meta-analyses is that they are inherently issue which can be dealt with by weighting biased because there is a greater likelihood that studies according to their quality using the same published studies will have significant results procedures illustrated in Examples 14 and 15 and thus might not be representative of the where studies were weighted by study size. population of studies which have been con- Consider the following example. ducted. This criticism, while well taken, could Example 17 equally well be applied to narrative reviews of Studies A, B, and C yield effect sizes of Fisher the literature. However, with a meta-analysis, zr = 0.70, 0.23, and 0.37. A group of judges there are now certain computational procedures rated these three studies for their internal that can be employed to address this problem. validity and found mean ratings of 1.5, 3.4, One visual way of examining the issue of and 4.0. Using these mean ratings as weights, publication bias is through the use of a Funnel the weighted combined effect size can be Plot (Light & Pillemer, 1984), a scatterplot of calculated using Equation (14): sample size vs. estimated effect sizes. Another method is to compute a ªfile drawer analysisº wjzrj 1:5 0:70 3:4 0:23 4:0 3:7 †‡ †‡ † (Rosenthal, 1991; Rosenthal & Rubin, 1988) as wj ˆ 1:5 3:4 4:0 follows. P ‡ ‡ 0:37 To find the number (X) of new, filed, or P ˆ unretrieved studies averaging null results re- We can conclude that weighting by quality of quired to bring the new overall p to any desired research in this case led to a somewhat different level of significance (e.g., p = 0.05, Z = 1.645), result than not weighting in that it lowered the the following equation may be used: perhaps somewhat elevated effect size of the Z 2 poorer quality study. X † k 15 ˆ 2:706 † Coding of studies for their quality involves P having raters check each study for the presence where K is the number of studies combined and of certain aspects, such as random assignment, SZ is the sum of the Z's obtained for the K experimenter blind to hypothesis, presence of studies. demand characteristics, etc. and then adding the 366 Meta-analytic Research Synthesis number of desirable features present for each However, unless the dependent variables are study. Ratings of studies involves having raters perfectly correlated, using the mean or the make a more global, overall assessment of the median estimates is likely to be somewhat methodological quality of a study. When estab- conservative. If the meta-analyst has available lishing weightings of quality, the coding or rating the degrees of freedom and typical intercorrela- should be done, as in hypothetical Example 17, tion among the dependent variables, another by a group of disinterested judges, rather than by approach, described in more detail elsewhere the researchers conducting the meta-analysis. (Rosenthal & Rubin, 1986), can provide a more This will safeguard against the understandable accurate combination of the effect sizes which tendency to think of our own studies, of those of can then be entered into the meta-analysis. This our students and our friends, and of those who approach also allows the meta-analyst to successfully replicate our work as good studies compute contrasts among the effect sizes for while thinking of those studies conducted by our the multiple dependent variables. enemies and by those who failed to replicate our work as bad studies. Reliability of coding or rating should also be reported. 3.15.6.3.2 Nonindependence of experimenters Given that researchers within the same laboratory or who once worked within the 3.15.6.3 Problems of Independence same laboratory group may have similar 3.15.6.3.1 Nonindependence of subjects interests, it quite frequently occurs that several of the studies in the meta-analysis have been It would not be unusual for the meta-analyst conducted by the same experimenter or experi- to discover multiple studies in which the same mental team. This may be problematic in that subjects have participated or that within one the results of researcher A may be correlated study subjects have contributed to more than with those of researcher B despite the fact that one dependent variable. For example, a study they used different participants. It would be may present the assessments of a group of sound practice to analyze the studies by subjects on a number of different measures of laboratory as well as by study. For example, sensitivity to nonverbal cues. In this situation, a recent meta-analysis of the predictive power of the results generated by each dependent variable short samples of nonverbal behavior (Ambady (each measure of nonverbal sensitivity) cannot & Rosenthal, 1992) compared studies authored be considered independent of each other since by either of the two meta-analysts to studies the same subjects contributed toward each one. conducted by other researchers. The compar- Although it would be an error to treat multiple ison showed all the effect sizes to be quite results from one study as though they were homogenous with a mean effect size of studies independent for significance testing, there is by the two researchers of r = 0.38 compared to nothing wrong in doing so for the purposes of a mean r of 0.39 by the others. effect size estimation. By doing so, the research- er weights each study in proportion to the number of different effect sizes it generates, a 3.15.6.4 Exaggerated Significance Levels procedure which meta-analysts might not choose to do but one which would not be The ªcriticismº that meta-analyses yield more unjustifiably wrong. Our recommendation is to significant results is to a great extent true. As we use any of the procedures mentioned in order to saw with Equation(1) (Significance Test = Size produce only one effect size estimate and of Effect 6 Size of Study), as more subjects are significance level for each study for the overall added, either to a single study or to a meta- analysis. One option for combining the various analysis of many studies, the results will become significance levels and effect sizes would simply more significant. In fact, this is one of the benefits of conducting meta-analyses in that it be to take the mean or the median Z or zr of all the dependent variables. (The mean Z could be allows for the cumulation of knowledge that computed by averaging the Z's or by first was discussed above so that we no longer averaging the z 's and then getting the Z dismiss studies which have not found p's 5 0.05 r but rather cumulate them so that the body of associated with the mean zr; the mean zr could also be computed by averaging the p levels and work may show its overall significance level and then computing the effect size that corresponds effect size. to that mean p level. These different procedures can yield different results, none of which is 3.15.6.5 Too Small Effects intrinsically more correct, but one should be chosen beforehand and used consistently In contrast to the argument above, another through out the meta-analysis.) criticism of meta-analysis is that the results of References 367

Table 6 A binomial effect size display (BESD) of the Blumenthal, J. A. (1998). The reasonable woman standard: relationship between aspirin usage and heart attacks. A meta-analytic review of gender differences in percep- tions of sexual harassment. Law and Human Behavior, 22, 33±57. Heart attack No heart attack Total Cohen, J. (1977). Statistical power analysis for the beha- vioral sciences (Rev. ed.). New York: Academic Press. Aspirin 48.3 51.7 100 Cohen, Jon (1993). A new goal: Preventing disease, not Placebo 51.7 48.3 100 infection. Science, 262, 1820±1821. Total 100 100 200 Cooper, H. M. (1989). Integrating research: A guide for literature reviews (2nd ed.). Newbury Park, CA: Sage. Cooper, H. M., & Hedges, L. V. (Eds.) (1994). The handbook of research synthesis. New York: Sage. some meta-analyses really only show ªsmall Crane, D. (1972). Invisible colleges: Diffusion of knowledge in scientific communities. Chicago: University of Chicago effectsº because the obtained r's and subsequent Press. r's are small. While this may be true, it is also Dickersin, K. (1994). Research registers. In H. M. Cooper true that small effects can have great impor- & L. V. Hedges (Eds.), The handbook of research tance. A classic example of this is the Physician's synthesis (pp. 71±83). New York: Sage. Aspirin Study (see Rosenthal, 1995b). In 1988, a Eagly, A. H., & Carli, L. L. (1981). Sex of researchers and sex-typed communications as determinants of sex study of the effects of aspirin on reducing heart differences in influenceability: A meta-analysis of social attacks was prematurely terminated. It had influence studies. Psychological Bulletin, 90, 1±20. become clear to the physicians involved that it Federighi, E. T. (1959). Extended tables of the percentage would be unethical to continue to give half of points of Student's t-distribution. Journal of the Amer- ican Statistical Association, 54, 683±688. the subjects a placebo, since aspirin clearly Fisher, R. A. (1928). Statistical methods for research prevented heart attacks. What was the r of this workers (4th ed.). London: Oliver & Boyd. 2 study? The r was 0.034, with an r of 0.0011. Fisher, R. A. (1938). Statistical methods for research Table 6 presents the results of the aspirin workers (7th ed.). London: Oliver & Boyd. study as a Binomial Effect Size Display (BESD), Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5(10), 3±8. a display which allows us to see the practical Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta- importance of the effect (Rosenthal & Rubin, analysis in social research. Beverly Hills, CA: Sage. 1982b). The correlation is shown as the Greenhouse, J. B., & Iyengar, S. (1994). Sensitivity analysis difference in outcome rates between the experi- and diagnostics. In H. M. Cooper & L. V. Hedges (Eds.), mental and the control group. As the BESD The handbook of research synthesis (pp. 383±398). New York: Sage. shows, aspririn lead to a 4% decrease in heart Hall, J. A., Tickle-Degnen, L., Rosenthal, R., & Mosteller, attacks, a small but rather important, effect. F. (1994). Hypotheses and Problems in Research Synthesis. In H. M. Copper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 17±28). New 3.15.6.6 Abandonment of Primary Studies York: Sage. Hedges, L. V. (1982). Fitting categorical models to effect Finally, critics have claimed that the growing sizes from a series of experiments. Journal of Educational use of meta-analysis has destroyed the motiva- Statistics, 7, 119±137. tion (or even the need) to conduct primary Hedges, L. V. (1987). How hard is hard science, how soft is soft science? American Psychologist, 42, 443±455. research studies. Certainly not: first of all, the Hedges, L. V. (1994). Fixed effects models. In H. M. primary studies are needed in order to do a Cooper & L. V. Hedges (Eds.), The handbook of research meta-analysis. Second, the research suggested synthesis (pp. 285±299). New York: Sage. by meta-analyses can only be addressed by Hedges, L. V., & Olkin, I. (1985). Statistical methods for newly designed primary studies employing meta-analysis. Orlando, FL: Academic Press. Hsu, L. M. (1980). Tests of differences in p levels as tests of randomization. A great benefit of meta-analysis differences in effect sizes. Psychological Bulletin, 88, is that it will prevent unnecessary duplication, 705±708. replication, and the wasting of scientifically and Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta- socially valuable resources. Conducting more analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. meta-analyses may very well reduce the number Kramer, S. H., & Rosenthal, R. Why are two heads better of primary studies conducted, but it should than one: A meta-analytic comparison of individual, increase the potential value of those that are statistical, and interacting group problem solving. Manu- conducted. script submitted for publication. Light, R. J., & Pillemer, D. B. (1984). Summing up: The science of reviewing research. Cambridge, MA: Harvard 3.15.7 REFERENCES University Press. Light, R. J., Singer, J. D., & Willett, J. B. (1994). The visual Ambady, N., & Rosenthal, R. (1992). Thin slices of presentation and interpretation of meta-analyses. In H. expressive behavior as predictors of interpersonal con- M. Cooper & L. V. Hedges (Eds.), The handbook of sequences: A meta-analysis. Psychological Bulletin, 111, research synthesis (pp. 439±453). New York: Sage. 256±274. Lipsey, M. W. (1994). Identifying potentially interesting Bausell, R. B., Li, Y., Gau, M., & Soeken, K. L. (1995). variables and analysis opportunities. In H. M. Cooper & The growth of meta-analytic literature from 1980 to L. V. Hedges (Eds.), The handbook of research synthesis 1993. Evaluation and the Health Professions, 18, 238±251. (pp. 111±124). New York: Sage. 368 Meta-analytic Research Synthesis

Mosteller, F. M., & Bush, R. R. (1954). Selected Rosenthal, R., & Rosnow, R. L. (1991). Essentials of quantitative techniques. In G. Lindzey (Ed.), Handbook behavioral research: Methods and data analysis (2nd ed.). of social psychology: Vol. 1. Theory and method New York: McGraw-Hill. (pp. 289±334). Cambridge, MA: Addison-Wesley. Rosenthal, R., Rosnow, R. L., & Rubin, D. B. Contrasts Raudenbush, S. W. (1994). Random effects models. In H. and effect sizes in behavioral research: A correlational M. Cooper & L. V. Hedges (Eds.), The handbook of approach. Unpublished Volume. research synthesis (pp. 301±322). New York: Sage. Rosenthal, R., & Rubin, D. B. (1979). Comparing Reed, J. G., & Baxter, P. M. (1994). Using reference significance levels of independent studies. Psychological databases. In H. M. Cooper & L. V. Hedges (Eds.), The Bulletin, 86, 1165±1168. handbook of research synthesis (pp. 57±70). New York: Rosenthal, R., & Rubin, D. B. (1982a). Comparing effect Sage. sizes of independent studies. Psychological Bulletin, 92, Rosenthal, R. (1966). Experimenter effects in behavioral 500±504. research. New York: Appleton-Century-Crofts. Rosenthal, R., & Rubin, D. B. (1982b). A simple, general Rosenthal, R. (1978). Combining results of independent purpose display of magnitude of experimental effect. studies. Psychological Bulletin, 85, 185±193. Journal of Educational Psychology, 74, 166±169. Rosenthal, R. (Ed.) (1980). New directions for methodology Rosenthal, R., & Rubin, D. B. (1986). Meta-analytic of social and behavioral science: Quantitative assessment procedures for combining studies with multiple effect of research domains (No. 5). San Francisco: Jossey-Bass. sizes. Psychological Bulletin, 99(3), 400±406. Rosenthal, R. (1984). Meta-analytic procedures for social Rosenthal, R., & Rubin, D. B. (1988). Comment: research. Beverly Hills, CA: Sage. Assumptions and procedures in the file drawer problem. Rosenthal, R. (1991). Meta-analytic procedures for social Statistical Science, 3, 120±125. research. Newbury Park, CA: Sage. Shadish, W. R., & Haddock, C. K. (1994). Combining Rosenthal, R. (1993). Cumulating evidence. In G. Keren & estimates of effect size. In H. M. Cooper & L. V. Hedges C. Lewis (Eds.), A handbook for data analysis in the (Eds.), The handbook of research synthesis (pp. 261±281). behavioral sciences: Methodological issues (pp. 519±559). New York: Sage. Hillsdale, NJ: Erlbaum. Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The Rosenthal, M. C. (1994). The fugitive literature. In H. M. benefits of psychotherapy. Baltimore, MD: Johns Hop- Cooper & L. V. Hedges (Eds.), The handbook of research kins University Press. synthesis (pp. 85±95). New York: Sage. Snedecor, G. W. (1946). Statistical methods applied to Rosenthal, R. (1995a). Writing meta-analytic reviews. experiments in agriculture (4th ed.). Ames, IA: Iowa State Psychological Bulletin, 118(2), 183±192. University Press. Rosenthal, R. (1995b). Progress in clinical psychology: Is Snedecor,G.W.,&Cochran,W.G.(1967).Statisticalmethods there any? Clinical Psychology: Science and Practice, 2, (6thed.).Ames,IA:IowaStateUniversityPress. 133±50. Tukey, J. W. (1977). Exploratory data analysis. Reading, Rosenthal, R. (1998). Meta-analysis: Concepts, corollaries MA: Addison-Wesley. and controversies. In J. G. Adair D. Bellanger, & K. L. White, H. D. (1994). Scientific communication and Dion (Eds.), Advances in psychological science. Vol. 1: literature retrieval. In H. M. Cooper & L. V. Hedges Social personal and cultural aspects (pp. 371±384). Hove, (Eds.), The handbook of research synthesis (pp. 41±55). UK: Psychology Press. New York: Sage.