Comprehensive Clinical Psychology. Volume 3: Research and Methods

Comprehensive Clinical

Psychology

Editor Nina R. Schooler Hillside Hospital, Glen Oaks, NY, USA

Comprehensive Clinical Psychology Editors-in-Chief Alan S. Bellack The University of Maryland at Baltimore, MD, USA Michel Hersen Pacific University, Forest Grove, OR, USA

Research and Methods Volume 3

2001 AN IMPRINT OF ELSEVIER SCIENCE AMSTERDAM—LONDON—NEW YORK—OXFORD—PARIS—SHANNON—TOKYO

Elsevier Science Ltd., The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, UK

All rights reserved. No part of this publication may be reproduced, stored in any retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic tape, mechanical photocopying, recording or otherwise, without permission in writing from the publishers.

First edition 1998 Paperback edition 2001

Library of Congress Cataloging In Publication Data Comprehensive clinical psychology / editors-in-chiefs Alan S. Bellack, Michel Hersen. -1st ed. p. cm. Includes indexes. Contents: v. 1. Foundations / volume editor, C. Eugene Walker — v. 2. Professional issues / volume editor, Arthur N. Wiens — v. 3. Research and Methods / volume editor, Nina R. Schooler — v. 4. Assessment / volume editor, Cecil R. Reynolds — v. 5. Children &. adolescents /volume editor, Thomas Ollendick — v. 6. Adults / volume editor, Paul Salkovskis — v. 7. Clinical geropsychology / volume editor, Barry Edelstein — v. 8. Health psychology / volume editor, Derek W. Johnston and Marle Johnston — v. 9. Applications in diverse Populations / volume editor, Nirbhay N. Singh - v. 10. Sociocultural and individual differences / volume editor, Cynthia D. Belar — v. 11. Indexes. 1. Clinical psychology I. Bellack, Alan S. II. Hersen, Michel. [DNLM: 1. Psychology, Clinical. WM lOS C737 1998] RC467.C597 1998 616.89--dc21 DNLM/DLC for Library of Congress 97-50185 CIP

British Library Cataloguing In Publication Data Comprehensive clinical psychology I. Clinical psychology II. Bellack, Alan S. (Alan Scott), 1944- II Hersen, Michel 616.8 ‘ 9

ISBN 0-08-042707-3 (set : alk. paper) ISBN 0-08-043146-1 (Volume 7) ISBN 0-08-044069-X (Volume 7 paperback)

Typeset by Bibliocraft. Dundee, UK. Printed and bound in The Netherlands by Giethoorn Media Group

3.01 Observational Methods

FRANK J. FLOYD, DONALD H. BAUCOM, JACOB J. GODFREY, and CARLETON PALMER University of North Carolina, Chapel Hill, NC, USA

3.01.1 INTRODUCTION 1 3.01.2 HISTORICAL INFLUENCES ON BEHAVIORAL OBSERVATION 2 3.01.3 THE PROS AND CONS OF BEHAVIORAL OBSERVATION 3 3.01.4 OBTAINING OBSERVATIONS OF BEHAVIOR 4 3.01.4.1 Selection of a Setting for Observing Behavior 4 3.01.4.2 Selection of Live vs. Recorded Observation 5 3.01.4.3 Selection of a ªTaskº or Stimulus Conditions for Observing Behavior 5 3.01.4.4 Selection of a Time Period for Observing Behavior 6 3.01.5 ADOPTING OR DEVELOPING A BEHAVIORAL CODING SYSTEM 7 3.01.5.1 Adopting an Existing Coding System 7 3.01.5.2 Developing a New Coding System 9 3.01.5.2.1 Cataloging relevant behaviors 9 3.01.5.2.2 Selecting a unit of observation 9 3.01.5.2.3 Creating code categories 10 3.01.5.2.4 Content validation of codes 11 3.01.5.2.5 Developing a codebook 11 3.01.6 TRAINING CODERS 11 3.01.7 RELIABILITY 12 3.01.7.1 Enhancing Reliability 12 3.01.7.2 Evaluating Reliability 13 3.01.8 ANALYZING DATA GATHERED THROUGH BEHAVIORAL OBSERVATION 14 3.01.8.1 Data Reduction 14 3.01.8.1.1 Measurement of base rates of behavior 15 3.01.8.1.2 Measurement of sequential patterns of behavior 15 3.01.8.2 Computer Technology for Recording and Coding Behavioral Observations 16 3.01.9 INTERPRETING DATA GATHERED THROUGH BEHAVIORAL OBSERVATION 17 3.01.10 CONCLUDING COMMENTS 19 3.01.11 REFERENCES 19

3.01.1 INTRODUCTION social experiences and predict future social events. In fact, direct observation of behavior is Behavioral observation is a commonplace one of the most important strategies we use to practice in our daily lives. As social creatures process our social world. Thus, it is not and ªinformal scientists,º we rely upon ob- surprising that the field of psychology also is servations of behavior to understand current drawn to behavioral observation as a research

1 2 Observational Methods method for understanding human behavior. the environmental events that elicit and main- The current chapter will focus upon behavioral tain maladaptive as opposed to adaptive observation as a formal research tool. In this behaviors, and they focus on observable context, behavioral observation involves the behavior change as the criterion for treatment systematic observation of specific domains of success. The need for precision both in measur- behavior such that the resulting descriptions of ing the behavior of interest and in identifying behavior are replicable. In order to accomplish relevant environmental events has produced a this task, the ongoing stream of behavior must body of scholarship on behavioral assessment as be coded or broken down into recordable units, a clinical tool (e.g., Haynes, 1978; Hersen & and the criteria for the assignment of labels or Bellack, 1998), of which systematic behavioral for making evaluations must be objectified. observation is generally considered the hall- These practices of specifying units of behavior mark strategy. and objectifying coding criteria are the key steps Ironically, Haynes (1998) notes that despite in translating informal behavioral observations the theoretical importance of direct observation into formal, scientific observations. As will be of behavior as a central feature of the behavioral seen below, the challenge of employing beha- approach, most research and clinical practice vioral observation in research settings involves from a behaviorally oriented perspective relies the myriad of decisions that an investigator on indirect measures such as self-report ques- must make in this translation process from tionnaires. This increased reliance upon indirect informal to formal observation. assessment stems in part from the recognition that demonstrating improvement in subjective well-being is important, in addition to showing 3.01.2 HISTORICAL INFLUENCES ON changes in overt behavior (e.g., Jacobson, BEHAVIORAL OBSERVATION 1985). Also, there has been an increasing emphasis on cognitions and other internal, The development of behavioral observation nonobservable experiences (e.g., fear, dysphor- methodology is attributable to two major ia) as relevant focuses for behavioral interven- sources, the science of human behavior and tion. However, Haynes suspects that most the clinical practice of behaviorally oriented researchers and clinicians fail to conduct direct interventions. The science of human behavior is observations of relevant behaviors because they often traced to Watson and his colleagues (e.g., believe that the difficulty associated with Watson & Raynor, 1924), who developed conducting naturalistic observations outweighs sophisticated methods to observe the behavior the expected benefits in terms of incremental of children. Several important features distin- validity of the assessment. guish this approach from early research that Accordingly, recent advances in observa- also used the observation of human actions as tional methodology and technology tend to data. Most notably, unlike earlier trait-based come from fields of study and practice in which researchers who measured behavior to make the veracity of self-reports is particularly inferences about internal causes (e.g., Binet, suspect. For example, researchers and clinicians Galton), the behavior itself is the focus of study. working with people who have mental retarda- The approach also emphasizes direct observation have produced important contributions in tion in naturalistic settings as opposed to theory and methodology (e.g., Sackett, 1979b) contrived responses that are elicited under as well as observational technology (Tapp & artificial, controlled conditions. The further Walden, 1993). Similarly, research on infants development of this approach was greatly and young children continues to emphasize stimulated by Skinner's (e.g., Skinner, 1938) direct observation over other types of measures theories that emphasized a focus on overt (e.g., parent or teacher report). observable behavior, and by research manuals Another more recent source for advances in (e.g., Sidman, 1960) that further developed the behavioral observation is the growth of the rationale and methods for conducting and marital and family systems perspective in clinical interpreting research on directly observed research and practice. This perspective empha- behaviors (Johnston & Pennypacker, 1993). sizes interpersonal exchanges as both focuses of The second historical influence, behaviorally interest in their own right and as contexts for oriented clinical practice, also emphasizes direct individual functioning. The theoretical justifica- observation under naturalistic circumstances. tion for this emphasis stems from family This approach defines pathology in terms of clinicians (e.g., Haley, 1971; Minuchin, 1974) maladaptive behavior and then focuses on how who argue that couples, parent±child dyads, environmental factors control maladaptive siblings, and larger family units are interacting responding. Thus, behaviorally oriented clin- subsystems that may cause and maintain icians conduct functional analyses to determine pathological responding for individuals. In The Pros and Cons of Behavioral Observation 3 behavioral terms, marital and family interac- treatment effects are likely to generalize to the tions elicit and maintain maladaptive respond- natural environment. ing (e.g., Patterson, 1982). Because individuals Despite these strengths, behavioral observa- are often biased reporters who have a limited tion also has several limitations. First, the perspective on the operation of the family ªobjectivityº of behavioral observation is far system, observation of family interactions by from absolute. Even when relatively simple outside observers is necessary for understanding forms of behavior are observed, the observa- these family processes. Thus, research and tional system imposes considerable structure on clinical work on marriage (e.g., Gottman, how behaviors are segmented and labeled which 1979), parenting (Patterson, 1982), and larger substantially affects the nature of the data that family units (e.g., Barton & Alexander, 1981) are obtained. Second, behavioral observation is paved the way for many advances in observa- expensive and labor-intensive as compared to tional technology and statistical approaches to self-reports. Third, observation cannot access analyzing observational data (e.g., Bakeman & inner experiences that are unobservable. Final- Gottman, 1997). ly, observations provide only a limited snapshot of behavior under a specific set of circumstances, which often is not helpful for general- 3.01.3 THE PROS AND CONS OF izing to other situations. Ironically, this latter BEHAVIORAL OBSERVATION limitation reveals how sensitivity to the effects of context on behavior is both a strength and Behavioral observation is a fundamental limitation of this approach. Although beha- form of measurement in clinical research and vioral observation is the method of choice for practice. The function of observation is the examining functional relationships that elicit conversion of an ongoing, complex array of and maintain actions in a particular context, behavioral events into a complete and accurate such observations may have limited utility for set of data that can influence scientific or clinical predicting responses in different contexts or decisions (Hawkins, 1979; Johnston & Penny- circumstances. packer, 1993). As a measurement tool, beha- In designing a research study, an investigator vioral observation is systematic; it is guided by a has a variety of strategies available for gathering predetermined set of categories and criteria, and data, including self- and other-report measures, it is conducted by reliable observers (Bakeman behavioral observation, and physiological in- & Gottman, 1997). In many instances, beha- dices, and must decide which and how many of vioral observation is the most objective form of these strategies are appropriate. Whereas be- measurement of relevant behaviors in relevant havioral observation may be more direct than contexts because it is the most direct approach. self-report measures in many cases, observation Indeed, observations provide the yardstick is only a reasonable candidate if the investigator against which other, more indirect measures, wishes to assess how an individual or group such as self-reports or rating scales, are actually behaves in a given context. Under- evaluated. standing the different types of information that In addition to providing direct, objective are obtained with each measurement strategy is measurements, behavioral observation has critical when interpreting findings. several other strengths as an assessment tool. For example, in prevention studies to assist Hartmann and Wood (1990) note that (i) couples getting married, investigators often observation is flexible in providing various teach communication skills. Then, to assess forms of data, such as counts of individual whether the couples have learned these skills, behaviors, or records of sequences of events; (ii) they might employ behavioral observations of the measurements are relatively simple and the couples' interactions, as well as obtaining noninferential, so that they are readily obtained self-report measures from the couples about by nonprofessional observers; and (iii) observa- their communication. In reviewing the findings tions have a wide range of applicability across across such investigations, an interesting pat- populations, behaviors, and settings. For tern of results seems to have evolved. In most clinical purposes, behavioral observation pro- instances, investigators have been able to duces specificity in identifying problem beha- demonstrate that, based on behavioral observa- viors that are targets for intervention, and in tion, couples change the ways that they identifying causal and maintaining variables. communicate with each other when asked to Thus, it supports an idiographic, functional- do so in a laboratory setting. However, on self- analytic approach in which assessment leads to report measures, these same couples often specific treatment goals. Furthermore, when report that their communication has not observation is conducted in relevant settings, changed (Van Widenfelt, Baucom, & Gordon, the data are criterion-referenced, so that 1997). How can we interpret these seemingly 4 Observational Methods discrepant findings? Some investigators would 3.01.4 OBTAINING OBSERVATIONS OF argue that the behavioral observation findings BEHAVIOR are in some sense more meaningful because they reflect how the couple actually behaves. The The first challenge for an investigator who self-report findings are more suspect because wishes to employ behavioral observation is self-reports are subject to memory distortion, deciding upon what behavior to observe, which might reflect the couples' overall feelings about actually involves a series of decisions. In almost their marriages, and are impacted by response all instances, the investigator is interested in sets or test-taking attitudes. However, the drawing conclusions about a class of behaviors. results could be interpreted in other ways as However, the investigator can observe only a well. Perhaps the couples' behavior in the sample of that behavior while wishing to draw laboratory does not reflect how they actually broader conclusions. For example, a family behave at home, and this discrepancy is researcher might be interested in family inter- demonstrated by the self-report data. Conver- action patterns and decide to observe a family sely, the behavioral observation in the labora- while they interact around a dinner table; how- tory might reflect ongoing behavior at home, ever, the investigator is interested in much more but this behavior does not impact the couples' than dinner-time behavior. Similarly, a marital experience of their overall communication. That investigator might be interested in couples' is, perhaps the intervention did not target, or the communication fairly broadly, but observes behavioral coding system did not capture, the their interaction for only 10 minutes in a labor- critical elements of communication that impact atory setting. Or someone studying children's the couples' subjective experience of how they peer relationships might observe playground communicate with each other. behavior at school but not in the neighborhood. When behavior is observed, the investigator In all of these instances, the investigator must be must make a series of decisions that can concerned with whether the sample of behavior significantly impact the results and interpreta- is a representative or reasonable sample of the tion of the findings. First, the investigator must broader domain to which the investigator wishes decide what behavior to observe. This will to generalize. include a consideration of (i) the setting in which the behavior will be observed, (ii) the length of 3.01.4.1 Selection of a Setting for Observing time for observing the behavior on a given Behavior occasion, (iii) the number of occasions on which behavior will be observed, (iv) who will observe One major decision that the investigator must the behavior, and (v) the actual behavior that make is the setting for observing the behavior. A will be the focus of the investigation. After or major distinction in this regard is whether the while the data are gathered, it is coded according behavior is to be observed in a controlled or to some scheme or coding system. Thus, in laboratory setting or in a more natural setting. evaluating a child's social behavior, the in- Both strategies have their advantages and vestigator must decide whether nonverbal facial disadvantages. Laboratory settings have the cues will be coded, what verbal categories will be asset of being more controlled, such that the included in the coding system, whether the type behavior of various participants can be ob- of interaction such as play or classroom served under more standard conditions. For activities will be coded, and so forth. Once example, a variety of parents and children can the behavior is coded, then the investigator must be brought into a laboratory setting and decide how to analyze the data. Often due to observed in the same setting with the same limited amounts of data or for conceptual task. This standardized environment can be of reasons, specific codes are collapsed into larger help when interpreting the findings because it codes; for example, many specific codes may be can help to exclude some alternative explana- summarized as positive or negative behaviors. tions of the findings based around differences Finally, the investigator must decide how to among settings. For example, observing families analyze the data obtained from the observa- in their home settings might be greatly impacted tional coding scheme. Is the investigator inter- by whether there are interruptions from the ested in the frequency of various behaviors that outside, whether the home is excessively hot or were observed, or is the investigator focused cold, and so forth. upon the pattern of interaction and the However, typically the investigator is inter- interdependencies among various behaviors? ested in much more than laboratory behavior These different questions of interest to the and wishes to draw conclusions about behavior investigator will result in different data analytic in other settings in the natural environment. Not strategies that will provide different information only are natural environments typically less about the observed behavior. controlled, but often the effort and expense Obtaining Observations of Behavior 5 involved in observing behavior in its natural the behavior of subjects (Kazdin, 1982), and this environment is prohibitive. Therefore, in decid- reactivity may be greatest in intervention studies ing which setting to employ in observing when the demands to display desired behaviors behavior, the investigator must address the are relatively clear (e.g., Harris & Lahey, 1986). question of the extent to which behavior However, research on the reactivity of recording observed in one setting is generalizable to other equipment is less certain. For example, studies settings. Understandably, there is no general using repeated sampling with recording equip- answer to this question, and it must be evaluated ment fail to detect habituation effects (e.g., for the particular settings and participants of Christensen & Hazzard, 1983; Pett, Wampold, interest. Vaughn-Cole, & East, 1992), which suggests As an example of how this issue of general- that the equipment does not evoke an initial izability of behavior across settings has been orienting response. Further, studies that com- explored, Gottman (1979) evaluated couples' pare different recording methods show that conversations both in a laboratory setting and relatively obtrusive as opposed to unobtrusive at home. His findings indicated that although procedures produce few differences in the there is a tendency for couples to be more quality of most behaviors observed, although negative with each other at home than in a positivity may be increased somewhat (e.g., laboratory setting, along with more reciproca- Carpenter & Merkel, 1988; Jacob, Tennen- tion of negative emotion at home, the couples baum, Seilhamer, Bargiel, & Sharon, 1994). Of generally demonstrated similar interaction pat- course, it is possible that the mere presence of terns across settings. Even so, this finding any type of recording equipment (or knowledge applies only to the types of interaction that that it is present although unseen) may cause Gottman assessed with his particular sample, sustained changes in behavior similar to the employing a given coding system. This issue of effects of self-monitoring or participant ob- generalizability across settings from control/ servation (e.g., Jarrett & Nelson, 1984). Never- research settings to natural settings applies theless, this set of studies suggests that using equally to generalizability within one of these recording equipment with no observer present domains, as is discussed later. For example, may be a relatively less reactive approach than dinner-time behavior among family members in live observation of behavior. their own home might or might not reflect Another concern about live as opposed to family interaction later in the evening when recorded behavior is the accuracy of coded data. homework or bedtime activities are the focus of In general, we assume that video and audio the family's discussion. Thus, even within a recorded data help to improve coder accuracy natural environment, behavior in one aspect of because they provide the capacity to play back that setting might not reflect behavior in other events repeatedly that are ambiguous or happen aspects of the natural family environment. quickly. However, recorded data also may Consequently, the investigator should give a interfere with a coder's ability to attend great deal of thought to the setting in which the selectively to salient behavior, particularly in behavior is observed in order to increase the a setting in which there is considerable back- likelihood that the resulting behavioral obser- ground activity and noise. For example, Fagot vations access the behavior of interest. and Hagen (1988) found that coders evaluating children in a preschool setting were less reliable 3.01.4.2 Selection of Live vs. Recorded and tended to miss relevant events when they Observation coded from videotape as opposed to live observations. In part, the superiority of re- The setting is altered by the experimenter corded observations depends on the ability to when coders or recording devices are introduced obtain excellent recordings. In many circum- into the environment. The decision to have stances, audio recordings are particularly coders present for the actual observation problematic because behavior is ambiguous session, or instead to record the session for without visual cues. When audio recordings are later coding, raises two major concerns. The transcribed, some investigators also include live first concern is the reactivity of live observation observers who make notes about nonverbal and recording equipment. Although research on events that are added to the transcript. this topic has a long history (e.g., Haynes & Horn, 1982; Kazdin, 1982), the findings are mixed and do not specifically address the 3.01.4.3 Selection of a ªTaskº or Stimulus relative effects on reactivity of having an Conditions for Observing Behavior observer present as opposed to using video or audio recording equipment. Much research Not only must investigators decide upon the shows that the presence of a live observer alters physical setting for observing behavior, but the 6 Observational Methods task or stimulus conditions within which the that husbands were more likely to engage in the behavior is to be observed also must be decided. demand role during problem-solving interac- On many occasions, investigators ask the tions when the husbands selected the topic of participants to engage in a specific task; on conversation, compared to interactions in which other occasions, the investigator merely decides the wife selected the topic to discuss. Thus, to observe the participants' behavior in a given factors that influence an individual's interest in setting at a given time. If a particular task or a task or motivation to participate in the task interaction is structured by the investigator, might significantly influence the resulting then the effects of this particular task on the behavior that is observed. interaction must be considered. This is of less importance if the investigator is interested only 3.01.4.4 Selection of a Time Period for in this particular type of interaction. For Observing Behavior example, if an investigator is interested only in how a family would make weekend plans as a In making decisions about the representative- full family if asked to do so, then asking the full ness of the behavior that is observed, the family to have such a discussion is relatively investigator must also be concerned with the straightforward. However, the investigator degree to which the observed behavior is might be interested in some more general generalizable across time. Classical test theory question having to do with how the family indicates that longer ªtestsº are more reliable in makes decisions. If this is the issue of interest, the sense that, keeping all other factors constant, then the investigator must carefully consider, they generally include a more representative and hopefully assess, the impact of this sample of behavior and thus are more stable particular task involving planning a weekend. across time. In terms of behavioral observation, Deciding how to spend the weekend might or this raises two questions. First, how long should might not generalize to how the family makes behavior be observed on a given occasion; decisions in other domains, such as how to second, on how many different occasions should divide household chores. Indeed, asking the behavior be observed? Whereas the answers to family to sit and have a discussion resulting in these questions should be based upon empirical weekend plans might not mirror how decisions findings, often pragmatic and logistic issues are made in the family at all. Perhaps the parents influence investigators' decisions along these make such decisions, or perhaps these decisions lines. For example, sending observers into occur over the course of a number of informal persons' homes or into classrooms can be interactions with different persons present at troublesome and intrusive; sending observers different times. More generally, when investi- to Africa to follow the social interactions among gators structure particular types of interactions baboons can prove to be a logistical nightmare. or ask the participants to engage in specific Consequently, the difficulty, intrusiveness, and tasks, they must carefully consider whether the expense of behavioral observation often are a task or stimulus conditions that they have limiting factor in deciding how long to observe created mirror the ways in which the partici- behavior. Similarly, the amount of time required pants typically behave. Either the participants to employ certain coding systems limits the might behave differently when the task is length of behavioral observation. For example, different, or they might not typically engage some coding systems might require 20 hours to in the task or situation that is constructed. The code one hour of recorded behavioral interac- degree of concern raised by these issues is a tion. Therefore if the investigator intends to function of the degree to which the investigator employ such a coding system, he or she might wishes to describe how the participants typically limit the amount of observed behavior to short behave in their day-to-day lives. time periods. How the task is selected also might impact the In deciding how long to observe behavior behavior. For example, Christensen and Heavey during a given observation period, several (1990) have described different interaction factors come into play. First and most generally, patterns among husbands and wives. This a long enough time period is needed such that includes a well-known interaction pattern that the findings are relatively stable and replicable they label ªdemand-withdraw,º in which one on other occasions. In part this is related to the partner presses for the discussion of a topic, and frequency or base rate with which the behaviors the other partner attempts to withdraw from the of interest occur. If the behavior of interest is a interaction. A number of investigations indicate relatively infrequent behavior, then longer that females are more likely to assume the observation periods are needed to obtain a ªdemandº role, and husbands are more likely to stable assessment of its occurrence. However, if assume the ªwithdrawº role in marital interac- a behavior occurs with a high degree of tions. However, Christensen and Heavey found frequency, then shorter observation periods Adopting or Developing a Behavioral Coding System 7 can suffice. Second, the length of the observa- meaningful information about the couple, but tion period is influenced by the ªcomplexityº of observations across two separate evenings the phenomenon under consideration. For provide more stable interactions patterns when example, the investigator might be interested coded by the MICS. Similarly, Haynes, Folling- in whether there are different stages or phases in stad, and Sullivan (1979) found that across three a couple's conversation as they attempt to reach evenings, there was high stability on only 5 of 10 resolution to a problem; in this instance, it selected coding categories of communication would be essential to observe the entire between spouses. Interestingly, in spite of these problem-solving interaction. Or the investigator findings, no marital therapy outcome investiga- might be interested in whether there are tions have observed couples' interactions across different stages or phases in how a child two or more consecutive evenings. responds to a new setting with the mother As can be seen based on the above discussion, present. Attachment theory has explored this there are a number of questions that the question and has provided indications of how investigator must address in deciding what securely and insecurely attached children in- behaviors to observe. The decisions that are itially respond in such settings, how they made will certainly impact the investigator's venture forth into a room after an initial findings. Yet before these findings are obtained, exposure, and how their interaction with their there are many more decisions for the investi- mothers changes across time in this setting gator to make that will influence the results. (Ainsworth, Blehar, Waters, & Wall, 1978). Therefore, if the investigator hypothesizes or wishes to explore whether there are different 3.01.5 ADOPTING OR DEVELOPING A stages or phases in the interaction, this BEHAVIORAL CODING SYSTEM necessitates following the interaction for a Foremost among the additional decisions to sufficient time period to allow for an examina- be made is the choice of coding system to tion of the different phases of interest. employ to classify the data that have been Second, the investigator must decide on how observed. In fact, in order to address the many occasions to observe the behavior. A questions raised above regarding what behavior given observation period might be significantly to assess, the investigator needs to know ahead influenced by the occurrence of a major event, of time what coding system he or she will or the interaction of participants being observed employ. At times the behavior is coded live might proceed in a given manner based upon during the interaction, so the coding system what happens early in the interaction. For must be known. Even if the observed behavior is example, if a child is taunted on the playground to be coded later based on video recordings of early during a recess period, the child might the behavior, it is important to decide ahead of withdraw from the group, which will signifi- time what coding system will be employed. For cantly impact the child's behavior for the example, some coding systems might be appro- duration of the observation period on the priate only in certain settings or with certain playground. If the child is not taunted the next tasks. Similarly, some coding systems might day, then his or her interaction pattern might break behavior into a number of fine-grained proceed quite differently. Consequently, the categories that occur on a somewhat infrequent number of occasions on which to observe basis, thus necessitating longer observational interaction will differ according to the varia- periods. Therefore, deciding what coding sys- bility in the behavior being observed. If the tem to employ is a significant factor in behavior of interest occurs in a relatively developing a thoughtful study based on beha- consistent manner across different occasions, vioral observation. then fewer occasions are necessary for obtaining An initial consideration in selecting a coding a stable sample of behavior. system is whether to adopt an existing coding Whereas there are far too few empirical system or to develop a new one. Each approach investigations that have been conducted to has some obvious benefits and limitations, as determine the number of observational sessions well as other subtle impacts that may not be needed to obtain stable results, some such apparent to investigators until they are well into investigations do exist. For example, Wieder the task of coding the behavioral observations. and Weiss (1980) explored how many observational sessions were needed in order to obtain a stable assessment of couples' interaction when 3.01.5.1 Adopting an Existing Coding System employing the Marital Interaction Coding System (MICS; Weiss, 1986). Based on general- In many cases, it may be possible to adopt a izability theory, they concluded that observing coding system that has been used previously in couples on a single occasion could provide the same research area, or one that can be 8 Observational Methods imported from another area where similar analysis may be appropriate for evaluating constructs were assessed. Adoption of an discrete events that require little inference; existing system has the great advantage of however, larger units of behavior might be saving the time and energy required to develop a needed to capture more complex phenomena. reliable, valid, and practical coding scheme. It Foster, Inderbitzen, and Nangle (1993) discuss a also links research across laboratories, samples, similar point regarding the selection of a coding and locations, and thus provides for ready system to evaluate the effectiveness of social synthesis of research findings from various skills training with children. They note that a studies. frequent problem with interpreting the results of The selection of a coding system is guided by treatment outcome studies is that whereas the both theoretical and practical considerations. treatment teaches children specific social skills, The primary theoretical issue is whether the cod- such as offering to share a toy, the observational ing system assesses behaviors that address the assessment evaluates only molar, or global constructs of interest to the investigator. All codes, such as ªpositive interactions.º From coding systems organize data into some set of data such as these, it is impossible to know categories or units for coding, and these cate- whether the behaviors that were trained actually gories are based on issues of importance to the were displayed during the assessment. person who developed the coding system; how- Alternatively, it also is important to question ever, they might or might not coincide with whether a complex phenomenon is accurately another investigator's concerns or theoretical evaluated by merely summarizing elemental model. Before beginning a search for a coding codes. For example, Jacob (1975) illustrates system, it is essential that the investigator first how power in interpersonal interactions may reviewtheoryandresearchtoclarifythenatureof not be indicated by who speaks more frequently the behavioral phenomena under investigation. or wins more disagreements, but rather by the Behavioral phenomena related to a construct ability to prevail on the central, important under one situation may take on different conflicts. Evaluations such as these may require characteristics in a new situation, thus making making judgments about relatively large units of an existing system inappropriate. For example, behavior. Ammerman, Van Hasselt, and Hersen (1991) Choosing among systems with different units coded problem-solving interactions between of observation also has practical implications. children with visual impairments and their Microanalytic coding systems that parse beha- parents using the Marital Interaction Coding viors into minute elements may be overly System, III (MICS III; Weiss, 1986), a well- complex and labor-intensive for investigators validated system for assessing problem-solving who merely want to assess global levels of interactions within marital dyads. The study characteristics, such as positiveness, compe- detected no differences between groups of tence, anxiety, or irritability. In such cases, it families with and without children with dis- may be more efficient to use a system that rates abilities, possibly because the coding system was dimensions such as the intensity or quality of inappropriate for the family context. Whereas behavior exhibited over an extended period of the MICS III evaluates warm and hostile time. On the other hand, small, elemental units exchanges that indeed were similar across the of observation and analysis are useful for groups, it does not assess factors such as detecting situational influences on behavior behavior management, instruction, or other and sequential patterns among minute events; socialization practices that are important as- larger, more integrative units are useful for pects of parent±child exchanges that are understanding cross-situational consistency responsive to children's disabilities (Floyd & (Cairns & Green, 1979). Thus, findings based Costigan, 1997). Thus, a behavioral coding on larger units of observation may be more system may be sensitive to relevant variables generalizable than microanalytic coding. Some only for certain types of people, relationships, or investigators appear to assume that behavioral circumstances. observation is synonymous with microanalytic In addition to the substantive content of the coding; such assumptions can serve as a major system, various coding systems differ in the impediment to the more widespread use of ways that they ªchunkº or segment behavior observational measures in research settings with into coding units. This ªchunkingº of behavior limited resources. We encourage investigators has both theoretical and practical implications to explore macroanalytical coding procedures for investigators. From a theoretical perspec- as a practical and, in some cases, more tive, the nature of the phenomena being assessed informative alternative to microanalytic coding. should influence how the stream of ongoing Every coding system incorporates an array of behavior is segmented (Floyd, 1989). More assumptions, biases, and procedural preferences specifically, relatively small, elemental units of that the originator used to guide coding Adopting or Developing a Behavioral Coding System 9 decisions. These preferences are particularly sample coverage with the theta statistic, calcu- relevant in decisions about how to handle lated as 17(number of different behaviors seen/ ambiguous incidents. Many decision rules are total number of acts observed). As the value of not made explicit in published reports and theta approaches 1, the probability of encoun- coding manuals, so that it is difficult for tering a new behavior approaches zero. That is, investigators who did not participate in its we assume that if new behaviors are not development to apply an existing system encountered with additional observations, the accurately and in a way that is consistent with behavioral repertoire has been adequately other research. Whenever possible, consultation sampled. with the originator is invaluable. Some origina- Of course, a strictly empirical approach such tors of popular measures conduct periodic as this usually is not adequate for evaluating workshops to train new users (e.g., SASB, human behavior. As we noted at the beginning Humphrey & Benjamin 1989; SPAFF, Gott- of the chapter, a stream of human behavior man, 1994). Most developers cannot be ex- often can be classified according to an en- pected to provide ongoing consultation to ormous variety of characteristics. In order to support their system, but should be willing to focus attention on a limited set of character- share examples of coded data and advice about istics, the investigator should begin with a list of common problems. these characteristics and their manifestations as gleaned from previous research, experience, and theory. Pilot observations then can be directed 3.01.5.2 Developing a New Coding System toward refining this list by broadening some categories, tightening others to make finer New ideas focused on new constructs and distinctions between theoretically disparate employing new assumptions are the lifeblood of behaviors, and adding new categories not progress in the social sciences. Thus, there will suggested by previous research. For an excellent always be a need to develop new coding example of this process, see Jacob, Tennen- schemes. Even when well-known constructs baum, Bargiel and Seilhamer's (1995) descrip- are studied, if observational procedures become tion of the development of their Home highly standardized within a research area, the Interaction Scoring System. phenomenon of interest may become totally One frequent concern while developing dependent on the existing measure. This situa- coding systems involves what to do with rare tion can lead to replicated, well-established but theoretically important events. Rare events findings that are largely an artifact of a tend to decrease the reliability of coding particular measurement procedure. The need systems; however, such rare events may be to disentangle method variance from the highly meaningful, and thus they cannot be phenomenon of interest is strong justification excluded from a system without compromising for the development of new coding systems. validity and utility. It may be possible to Detailed instructions about developing cod- collapse similar rare events into broader ing systems are given in Bakeman & Gottman categories or to alter the observational situation (1997) regarding interpersonal interaction, and in order to elicit these behaviors more consis- by O'Neill, Horner, Albin, Storey, and Sprague tently and reliably. (1990) regarding functional analysis of problem behaviors. The key steps in developing any coding system are summarized below. 3.01.5.2.2 Selecting a unit of observation An important component of identifying relevant behaviors is to determine the appro- 3.01.5.2.1 Cataloging relevant behaviors priate unit of observation. This involves the A useful initial step is to develop an decision to evaluate behavioral states, events, or exhaustive list of all relevant behaviors to be some combination of the two. In general, a state coded. In some cases, this may be accomplished is any ongoing condition that persists over an by conducting initial observations and record- extended period of time, whereas an event is a ing all relevant behaviors that occur. Animal discrete action. States are usually measured with researchers frequently use this procedure to time-based indices such as duration or latency, develop an ethogram, which is a comprehensive whereas events are usually measured with list of all behaviors that are characteristic of a frequency counts or sequential patterns. Both species. Several ethograms are published to types of unit also can be rated for intensity. The describe the behavior repertoire of different distinction between states and events is blurred animal species. However, because it is usually by the fact that the same behavior might be impossible to sample all possible behaviors for a measured with both units, such as measuring species, investigators estimate the quality of both the duration of anxiety episodes or 10 Observational Methods disruptive episodes in the classroom, as well as data against actual rates or durations, and the frequency of the episodes. The type of unit is adjust the length of the recording interval to not always mandated by the content of the produce the most accurate data possible. See behavior and, once again, the decision about the Altmann (1974) for an extensive review of appropriate unit for a particular purpose must sampling protocols, and Bakeman and Quera be guided by theoretical, empirical, and prac- (1995) for considerations about how to design tical considerations. At first glance, it may sampling protocols and record the data for data appear that recording onset and offset times for analysis purposes. all behaviors is desirable so that information about duration and latency can always be 3.01.5.2.3 Creating code categories retrieved. However, Bakeman and Gottman (1997) warn against ªthe tyranny of timeº and Typically, code categories will be mutually propose that, even with sophisticated recording exclusive and exhaustive Mutual exclusivity and analytical devices, the precise measurement means that coding categories are discrete and of time adds substantial complexity to data homogeneous, and that each event can be recording and analysis, and can cause problems classified into one and only one category. with reliability that outweigh the benefits of Exhaustiveness means that all manifestations having these data. of a construct are included in the system. In The unit of observation also involves the most cases, exhaustiveness can be achieved by sampling protocol. The two most common including a category, such as ªotherº or sampling protocols in clinical psychology ªneutral,º to label all behaviors that do not research are event sampling, which involves fit well into any other category. For example, a noting each occurrence of events during the measure of parent±adolescent interaction by entire observation period, and time sampling or Robin and Foster (1989) includes a ªtalkº code interval coding, which involves noting occur- to cover all behaviors that are not instances of rences during selected time intervals. Most other categories. In some cases, the investigator commonly, time sampling involves alternating may believe that it is necessary to violate these periods of watching and recording, each lasting guidelines. For example, another measure of for several seconds. During the recording parent±adolescent interaction, the Constraining period, the coder usually records whether or and Enabling Coding System (Leaper et al., not each possible code occurred during the 1989), allows behaviors by parents to receive preceding interval. This procedure is useful for both ªconstrainingº and ªenablingº codes live observations in which the coder must track a because the authors believe that these ªmixed large number of behaviors, so that recording messagesº may be highly relevant to adolescent each event would interfere with accurate functioning. In other cases, investigators only observation. The procedure assumes that re- label certain subsets of behaviors (i.e., as in scan cording periods are random, and will not distort sampling where, for example, only instances of the data. Several studies challenge this assump- target children's aggressive behavior and pro- tion and reveal that interval coding can distort vocations by peers are recorded). Both situa- the amount of behavior, sequential patterns, tions create difficulties for the analysis of and observer agreement for both behavioral sequences of events, although Bakeman and events (e.g., Mehm & Knutson, 1987) and Quera (1995) provide some solutions for behavioral states (e.g., Ary, 1984; Gardner & managing these types of data. Griffin, 1989). However, other studies demon- Often it is useful to organize codes into groups strate that interval coding can accurately reflect and, if appropriate, to arrange the groups into a actual behavior rates and durations (e.g., hierarchical classification. This arrangement Klesges, Woolfrey, & Vollmer, 1985). A study makes the codes easier to recall; in addition, by Mann, ten-Have, Plunkett, and Meisels this hierarchical arrangement can help to (1991) on mother±infant interactions illustrates fashion a set of decision steps to employ in that the accuracy of data from interval sampling the coding process, both of which may improve depends on how well the duration of the observer reliability. For example, children's sampling interval matches with the duration social behaviors might first be classified as of the behavior of interest. They found that the initiations versus reactions, and initiations actual durations or frequencies of mother and could be classified as prosocial versus antag- infant behaviors, which tended to occur in onistic, followed by more specific categories of relatively short episodes (ªboutsº), were inac- events within each of these general categories. curately assessed when the sampling interval The coders can then use this hierarchical was relatively long. Thus, before proceeding arrangement to organize their decision process, with a time-sampling/interval-coding approach, so that once they decide that a particular action the investigator should test the time-sampled is an initiation, and it is prosocial, there is a Training Coders 11 relatively small number of ªprosocial initiationº tests (APA, 1985) lists several other types of codes to choose among. information that would also be useful to include An alternative approach to forming a in codebooks, including information about the hierarchical organization is exemplified in the theoretical underpinnings for the measure, and Structural Analysis of Social Behavior (SASB) data on reliability and validity from research to system (Humphrey & Benjamin, 1989). This date. Herbert and Attridge (1975) propose that system uses a rating scale format as an aid to providing this type of information to coders categorical coding. In this system, all code may help to facilitate training and improve categories are classified within a circumplex coder reliability. defined by two dimensions: Interdependence and Affiliation. Coders receive extensive instruction in the theory underlying each dimen- 3.01.6 TRAINING CODERS sion, and they learn to rate behaviors in terms of their degree of interdependence, represented on Once a coding system has been adopted or a vertical axis, and their degree of affiliation, developed, the next step is to train coders in its represented on a horizontal axis. The axes use. A preliminary step in conducting efficient intersect at their midpoints. The location on the and effective coder training is the selection of circumplex defined by the coordinates of these coders who will perform well. Unfortunately, dimensional ratings is the code for that act. there is little research on personal characteristics Thus, for example, a behavior rated as +5 for or abilities that predict good performance as a Interdependence (i.e., somewhat independent) behavioral coder. Research on interpersonal and 74 for Affiliation (i.e., somewhat dis- communication indicates that, in some circum- affiliative) is assigned to the category ªwalling stances, women tend to be more accurate than off and distancing.º men at decoding the meaning of interpersonal behaviors (e.g., Noller, 1980). However, this effect is hardly large enough to warrant the 3.01.5.2.4 Content validation of codes exclusion of male coders in studies of inter- A useful, though often neglected, step in personal behavior. To the extent that coder coding system development is content valida- characteristics such as gender, age, education, tion of the codes. This might be accomplished or ethnicity may be associated with biases that by having ªexpertsº or the actors themselves could influence coding decisions, it may be classify codes into relevant categories, then important to ensure that coders are diverse on comparing these categories to the expected these characteristics in order to randomize these categories. For example, in order to evaluate a effects and improve the validity of the data. family coding system that classified family Ironically, one characteristic that may be behaviors as aversive and nonaversive, Snyder important to select against is prior experience or (1983) had family members rate the aversiveness personal investment in the research area under of concrete examples of behaviors classified by consideration. The use of naive, uninvolved the coding system. observers controls for biases caused by factors such as familiarity with the research hypotheses and prior expectations. We believe that in many 3.01.5.2.5 Developing a codebook instances naive observers also tend to provide After codes are labeled, defined, and classi- more reliable codes. Extensive research on fied, the next step is to produce a codebook. In clinical decision making demonstrates that general, most experts recommend that the more highly experienced judges tend to employ thorough, precise, and clearly written the idiosyncratic and inconsistent decision criteria codebook, the better the chances of training that can reduce both intraobserver consistency new coders to produce reliable, valid data (e.g., and interobserver agreement as compared to Hartmann & Wood, 1990; Herbert & Attridge, naive observers (e.g., Dawes, 1979). Our 1975); however, other studies demonstrate that experiences bear this out. When coders have naive coders can at times produce valid codes extensive previous training or experience in the (e.g., Prinz & Kent, 1978). The codebook should domain of interest, they tend to have difficulty include a list of all codes, a descriptive definition adhering strictly to the coding criteria outlined for each code, and examples of behaviors that in the coding manual, particularly if their represent each code, along with examples that experiences involved a different theoretical do not match the code. In order to assist with perspective or set of assumptions than those reliable coding, it is particularly important to that undergird the coding system. include examples of differential decisions in Wilson (1982) wisely proposed that observer borderline or ambiguous cases. The APA training should address two equally important guidelines for educational and psychological goals: teaching the skills needed to perform the 12 Observational Methods coding task, and motivating the coders to record, greater responsibility in the form of perform well. Teaching the skills usually training and monitoring new coders, or public involves a combination of didactic and experi- acknowledgment of good work. ential training sessions. A helpful first step is to explain the rationale and theory that underlie the coding system. In cases where the coding 3.01.7 RELIABILITY involves considerable judgment in the applica- In observational research, the reliability, or tion of a theory or model, the coders might precision, of the measure is almost always benefit from readings and instruction about the evaluated with indices of interobserver agree- model. The coders should read and demonstrate ment. Actually, according to psychometric a thorough working knowledge of the coding theory, reliability concerns the extent to which manual and should be tested on this material coded data map onto ªtrue scores,º and thus, before proceeding. Practice with the coding the reliability of coded data also relates to system should begin with the presentation of intraobserver consistency in applying the coding examples of behaviors, followed by an explana- scheme (Bakeman & Gottman, 1997). However, tion for the coding decisions. Initial examples because agreement between independent coders should be relatively unambiguous representa- is a higher standard for judging precision, it is tions of the coding categories or examples of the the focus of formal psychometric evaluations, extremes of the rating scales. The coders should and intraobserver consistency is addressed more be required to practice the coding with feedback informally through the implementation of about their accuracy and discussion of the training and monitoring procedures to prevent rationale for coding decisions. Of course, the observer drift. practice materials should match the actual coding situation as closely as possible. Training sessions should be relatively frequent and 3.01.7.1 Enhancing Reliability relatively brief to enhance alertness. Training continues until the coders reach acceptable As noted throughout the previous sections, levels of agreement with preestablished criterion reliability can be enhanced in the way the codes codes. Because accuracy can be expected to are developed, in the procedures for training decrease once actual coding begins (e.g., Taplin and monitoring the observers, and in the & Reid, 1973), most investigators set a training procedures for conducting the observations. criterion that is higher than the minimal Regarding code development, greater reliability acceptable criterion during actual coding. can be expected when codes are clearly defined Typically, this criterion is 80±90% agreement. in operational terms, when categories are The maintenance of the coders' motivation to relatively elemental and overt as opposed to perform well probably involves the same types complex and inferential, and when coded of factors that enhance performance in any behaviors occur at least moderately frequently. work setting, including clarity of the task, If some of these conditions cannot be met, investment in the outcome, personal responsi- coders likely will need relatively more training, bility, monitoring of performance, and a fair practice, and experience in order to produce reward structure. One typical procedure used by reliable data. Instructions that explicitly dis- investigators is to develop a contract that courage or encourage specific expectancies specifies the tasks to be completed, the about frequencies or patterns of codes either expectations, and the reward structure. Coder tend to reduce observer agreement or spuriously investment in the project might be enhanced by inflate it (Kazdin, 1977). Similar to guidelines providing them with the opportunity to parti- for training sessions, frequent and relatively cipate as a member of the scientific or clinical short observation sessions produce more reli- team, to the extent that this does not bias able data than infrequent, longer observation coding, or by group interactions that build sessions. In addition to booster sessions to solidarity and cohesion among the coding team. reduce observer drift, periodically training new As described below, reliability should be coders may help to reduce the biasing effects of monitored on an ongoing basis. An unfortunate prior experience on the data, and observations feature of reliability monitoring is that there is a involving different subject groups or experi- built-in punishment schedule for inadequate mental conditions should be intermingled performance, such as having to recode sessions whenever possible to avoid systematic effects or completing additional training, but the related to observer drift. Studies repeatedly rewards for good performance are less tangible. demonstrate that coders are less reliable when Investigators should be sensitive to the need to they believe that their agreement is not being instigate a reward system in whatever way checked, although this effect may abate with possible, including providing raises for a good increased experience (e.g., Serbin, Citron, & Reliability 13

Conner, 1978; Weinrott & Jones, 1984). There- follow-up instruction is used to clarify and fore, frequent, overt, and random assessments resolve the confusion about the coding criteria. of reliability should help to maintain coder However, providing feedback about disagree- precision. ments without instruction or resolution can make coders feel more confused and uncertain and can decrease reliability. Also, it usually is 3.01.7.2 Evaluating Reliability not helpful to coders to correct their scores for chance agreement, because corrected scores The appropriate procedure and calculations may be highly variable depending on the range for evaluating reliability depend on the pur- of behaviors displayed in each session; thus they poses of the evaluation and the nature of the present a confusing picture of absolute agree- inferences that will be drawn from the data. ment. Finally, if agreement statistics are used to Thus, there is no one way to evaluate reliability, identify unreliable coders in need of additional but rather an array of approaches that can be training, it is important to base this decision on followed. Helpful reviews of various procedures multiple observation sessions, because occa- and computational formulas are presented by sional unreliable sessions can be expected due to Bakeman and Gottman (1997), Hartmann the ambiguity in the target's behavior rather (1977, 1982), Jacob, Tennenbaum, and Krahn than because of coder error. (1987), Stine (1989), and Suen, Ary, and Covalt For the purpose of determining the precision (1990). Two important decisions include (i) of the measurement after coding has been whether to compute exact agreement for specific completed, the method of computing reliability codes (point-by-point agreement) or aggregate depends on the type of data that will be analyzed agreement for larger units, and (ii) whether and (Hartmann, 1977). That is, the investigator how to correct for chance agreement. Below is a typically is interested in the reliability/precision summary of some of the more common of the scores that are actually computed from procedures, along with guidelines for selecting the data. Although it is common for investiga- among them. tors to report some form of point-by-point For the purpose of training, monitoring, and agreement for individual codes, most data providing corrective feedback to coders, in most analyses focus on aggregate indices such as cases it is useful to assess the most precise form the frequency, relative frequency, or rates of of point-by-piont agreement that is possible groups of behaviors. Thus, if several specific with the coding system. For example, with event codes are combined or aggregated into a recording, an observer's codes for each event are broader category of ªpositive behaviorº for compared with a set of criterion codes or those data-analytic purposes, then reliability esti- of an independent coder. A contingency table mates should be performed at the level of that cross-lists the two sets of codes and tallies ªpositive behavior.º We usually assume that the frequencies of each pair of codes (i.e., a observer agreement for specific codes ensures ªconfusion matrixº) is a useful method of that aggregate scores are reliable; nonetheless, it feedback for the coders. The total percent is useful to determine the actual level of agreement for all codes (# agreements/total # reliability for these aggregate codes. Further- events), and the percent agreement for specific more, agreement at the specific code level is codes (# agreements/(#agreements+#disagree- usually an overly stringent requirement that can ments)) are summary scores that are easy to unnecessarily prolong a study. understand. During coder training, we compute For the purpose of assessing the precision of these statistics beginning with the first practice the measure, percent agreement statistics are assignments because the rapid improvement usually corrected for chance agreement between that usually occurs during the first few assign- coders by using Cohen's kappa statistic. This ments can be highly rewarding and reinforcing statistic uses the information about agreements for new coders. The contingency table also can and disagreements from the confusion matrix, display a pattern of disagreements that is and evaluates the observed amount of agree- instructive, such as when two coders consis- ment relative to the expected amount of tently give a different code to the same type of agreement due to chance because of the base behavior. Even when the actual observational rates of the behaviors. The formula for kappa is: procedure involves a larger or less precise unit of kappa = [p(Observed agreement)7p(Expected observation, such as in interval sampling, it may agreement)]/[17p(Expected agreement)], where be helpful to employ an event-based tracking p(Observed agreement) is the percent agreement procedure during coder training in order to have for the two coders, and p(Expected agreement) is more precise information about when disagree- the percent agreement expected by chance. The ments occur. In our experience, the identifica- probability of chance agreement is computed by tion of consistent disagreements is helpful when calculating the relative frequencies for each code 14 Observational Methods by each coder, obtaining the product of the two patterns in the behavioral stream. Bakeman coders' relative frequency scores for each code, and Gottman (1997) vividly illustrate how then summing these products. Usually, kappa is point-by-point agreement between coders can computed for each reliability session, and the be sharply deflated when one coder occasionally mean and range across all reliability sessions are inserts extra codes, although the agreement reported. One complication with the kappa about the sequential pattern of the codes is very statistic is that it can severely underrepresent high. Wampold and Holloway (1983) make a observer agreement in sessions during which the similar case that agreement about individual subject displays a limited number of behaviors codes may be too stringent a criterion for and/or some behaviors have exceptionally high sequential data. Gottman (1980) recommends base rates. This situation produces very large an approach similar to intraclass correlation in values for expected agreement, and thus can which the investigator demonstrates that the produce a very low score for kappa even when measures of sequential dependency (e.g., lag z- the proportion of observed agreement is very scores, see below) are consistent across ob- high. A potential solution in this situation is to servers. use a sample-wise estimate of expected agreement derived from the base rates for the entire set of data (Bakeman & Gottman, 1997). 3.01.8 ANALYZING DATA GATHERED Another complication is that point-by-point THROUGH BEHAVIORAL kappa may be overly stringent when, as OBSERVATION frequently occurs, the actual scores used in the Apart from the usual decisions that guide the data analysis are aggregates (e.g., total fre- design of data analysis, observational data quency scores). Jacob et al. (1987) present a present some unique challenges for investiga- method for computing kappa for aggregate tors. These include questions about data scores. reduction, dimensions and scales of measure- Often the measure of interest in the data ment, and the identification of sequential analysis represents the data on an interval scale patterns of behavior. of measurement (e.g., relative frequency scores or average ratings). In this situation, the question of measurement precision concerns 3.01.8.1 Data Reduction the relative positions of subjects on the interval scale rather than agreement between coders on Most observational studies produce a large individual coded behaviors. Coder agreement amount of data for each subject, particularly for interval data can be assessed with the when the coding scheme exhaustively labels all correlation between the scores for pairs of behaviors emitted by one or more subjects coders; for example, the correlation between during an observational session. Even sessions summary scores for a group of subjects, or the lasting only 5±10 minutes can produce several correlation between interval ratings within an hundred codes when behavior units are small. observation session. Alternatively, the intra- The challenge is to reduce these data to make class correlation is also appropriate for interval them interpretable, without sacrificing the rich data. This approach is derived from the analysis descriptive information in the coded data. of variance (Winer, 1971), and is the most A first step is to group individual codes into commonly used application of generalizability broader categories. For the purposes of data theory for evaluating coder agreement. It analysis, the primary problem with many assesses the proportion of variance in the scores observational coding schemes is that they that can be attributed to variation among include behavioral events that rarely or never subjects as opposed to variation among ob- occur for many subjects, and thus produce servers. An advantage of the approach is that highly skewed distributions of scores for the when more than two coders are being evaluated, sample, resulting in missing cells in contingency one score can reflect the level of agreement (i.e., tables for sequential analyses. Whenever possi- proportion of variation attributed to differences ble, it is important to incorporate rare events among subjects) for the entire group of coders, into categories of events that occur with rather than calculating individual correlations sufficient frequency to be used in subsequent for all possible pairs of coders. Shrout and Fleiss analyses. In most cases, categories are specified (1979) outline the procedures for computing on an a priori basis by investigators, using intraclass correlations under various assump- theory and rational analysis to determine how tions about the data. codes are grouped. Alternative procedures for detecting agree- Another approach is to factor analyze ment between coders may also be appropriate individual codes to identify clusters of codes when the focus of study is the sequential that share common variance. This approach is Analyzing Data Gathered Through Behavioral Observation 15 probably most appropriate when the investi- measurement period, relative frequency scores gator is interested in examining behavior styles, may be preferable because they are comparable because the factor analysis identifies groups of across measurement situations. However, the behaviors that covary across subjects. That is, comparability of relative frequency scores behaviors are grouped based on their co- requires exhaustive coding of all behavior with occurrence within subjects, irrespective of their the same set of codes. Thus, rate per minute meaning or functional impact. For example, if scores may be preferable because they are less different children in classrooms tend to mis- dependent on other characteristics of the coding behave in different ways, with some children scheme or measurement situation. spending much time out of their seats, others talking out of turn, and others being withdrawn 3.01.8.1.2 Measurement of sequential patterns and inattentive, a factor analysis would likely of behavior identify these three types of behaviors as three separate clusters. Kuller and Linsten (1992) When events or states are coded exhaustively, used this method and identified social behaviors it is possible to evaluate patterns of behavior for and individual concentration as separate clus- an individual or between individuals interacting ters of behaviors by children in classrooms. An with each other. The behaviors can occur alternative approach is to group behaviors concurrently (e.g., the co-occurrence of eyebrow according to their functional impact. If all three arch and smile) or serially (e.g., reciprocal types of behaviors disrupt the activities of other negative behaviors between spouses). Sequen- children in the classroom, they can be grouped tial patterns are usually derived from Bayesian into one functional category of disruptive statistics that relate the occurrence or non- behavior. For example, Schaap (1982) used occurrence of an antecedent event or state with lag sequential analysis of marital interactions to the occurrence or nonoccurrence of a conse- identify a set of behaviors that were most likely quent event or state. In lag-sequential analysis to elicit negative responses from a spouse. These (e.g., Bakeman & Gottman, 1997; Sackett, specific behaviors could then be grouped 1979a), the investigator is concerned with the together as negative eliciting behaviors. transitional probability of the antecedent/consequent sequence, which reveals the probability that the consequent occurs, given the antecedent 3.01.8.1.1 Measurement of base rates of event or state. The important question regard- behavior ing sequential dependency is the extent to which Base rates of behavior can be expressed in this transitional probability exceeds (or is various scales of measurement. For example, smaller than) the base rate for the consequent. the frequency of events can be expressed as raw If the consequent is more (or less) likely to occur frequency, relative frequency, rate per minute, in the presence of, or following, the antecedent or ratio scores (i.e., ratio of positive to negative than in other circumstances, then there is behaviors). The selection of a scale of measure- dependency between the antecedent and the ment is usually determined by factors such as consequent. If, for example, the probability that the focus of the investigation, the precedent in a mother smiles at her infant (consequent the field of study, and psychometric properties behavior) is greater after the infant smiles at of various scales of measurement (e.g., distribu- her (antecedent behavior) than after other types tions of the scores, reliability of the index). of infant behaviors, then there is a sequential Tryon (1991) argues that scores should also be dependency in infant smile±mother smile ex- expressed in ªnaturalº scientific units in order to changes. make measurements as comparable as possible Sequential dependency can be estimated with across time, situations, or different investiga- several statistics. One common metric is the lag tions. A compelling point in this argument is sequential z-score developed by Sackett (1979a), that relativistic measurements such as standar- and a modification developed by Allison and dized z-scores, which provide information Liker (1982) that corrects for sampling error. about the position of the subject relative to This statistic compares the observed frequency other subjects in the same study, are highly of the antecedent/consequent sequence with the ªunnaturalº units that vary as a function of who expected frequency of the sequence (based on participates in the study. Instead, scores should the base rates for both events). More recently, be expressed in relation to some objective Bakeman and colleagues (Bakeman & Gott- criterion that remains invariant across samples, man, 1997; Bakeman & Quera, 1995) recom- so that a body of knowledge about the mended a slightly different formula derived phenomenon of interest can more readily from two-way contingency tables, which is the develop. For example, because raw frequency adjusted residual obtained in log-linear analy- counts will depend on the duration of the sis. A convenient feature of this statistic is that 16 Observational Methods because the scores are distributed as z, the rather than calculating sequential statistics for statistical significance of sequential patterns is individual subjects. For example, Gottman, readily discerned (i.e., scores greater than z = Markman, and Notarius (1977) examined 1.96). The statistic can be computed with sequences of effective and ineffective problem standard statistical packages. Importantly, solving in groups of distressed and nondis- Bakeman and Quera (1995) also suggest tressed married couples. Although incidents of statistical programs to compute these scores positive behaviors such as ªvalidateº and when codes in a sequence cannot repeat, such as negative behaviors such as ªput-downº were when the coding system requires that new codes rare in some couples, within the two groups they are assigned only when the behavior changes. occurred with sufficient frequency to identify Because a behavior can never follow itself, this patterns of supportive interactions that were situation produces ªstructural zerosº in the more typical of happily married couples, and diagonal of the contingency table. patterns of hostile exchange that were more The z-statistic or adjusted residual is most typical of spouses in distressed marriages. useful for examining a sequential pattern in a single set of observations on an individual subject or a group of subjects. However, it is not 3.01.8.2 Computer Technology for Recording recommended for use when separate sequential and Coding Behavioral Observations scores for each of multiple subjects are entered as data points into subsequent inferential Since the late 1970s, computer technology has analyses. The problem is that the size of adjusted become increasingly available for use in record- residuals will vary (become larger) as a function ing and coding observational data. Whereas the of the number of observations that are made, use of devices and software requires extra initial even when the actual conditional probabilities costs in time and money, they may ultimately remain constant. Thus, the z-scores are influ- increase the efficiency of data collection, coding, enced by the base rates of behaviors, and can and data management. If developers find a differ dramatically across subjects when re- market for these devices, we can expect that sponse productivity differs. Wampold (1989) their availability will become more widespread. recommends a transformed kappa statistic as an Because specific products are likely to undergo alternative; however, this statistic requires rapid changes and development, a list of selecting among three different computational currently available products would be immedi- formulas depending on the relative size of ately out of date, and thus would be of little use. expected and actual probabilities. Other com- However, it is possible to illustrate the range of monly used statistics are Yule's Q and phi. current options and some of the factors to Bakeman and Casey (1995) provide computa- consider in selecting equipment and software. tional formulas, discuss the conditions under For most situations in which behavioral which various statistics may be applicable, and observation is used, the major function of suggest that investigators examine the distribu- automated systems is to record the observations tions of scores for each statistic to determine and the codes reported by the human observer. which statistic provides the best distribution of Although Tryon (1991) catalogs many auto- scores for a particular set of data. mated devices that actually measure ªactions,º Investigators are commonly concerned about most of these devices track physiological the amount of data needed to calculate responses or simple, gross motor activity. To sequential statistics. This issue is most relevant date, no automated system can code the types of when the investigator is interested in studying complex events that are the focus of this chapter. sequences involving relatively rare events. The first systems designed for event recording Unfortunately, clinically relevant phenomena were ªdedicated devicesº because their sole are often rare phenomena, so the problem is a function was to aid in data collection. Data common one in the field of clinical psychology. could be stored in a temporary system and then Bakeman and Gottman (1997) present a uploaded to a computer for analysis. These detailed analysis of this issue using guidelines devices have been rapidly replaced by software employed in log-linear analysis of contingency systems that run on standard computers. The tables. As a general rule of thumb, the newest automated systems combine data entry investigator should obtain enough data so that and management functions with computer each antecedent and consequent event in a control of video playback devices. contingency table occurs at least five times. In One consideration in selecting a system is the many cases, this criterion can be met by number of codes the system can handle. For collapsing codes into more general categories. example, one software package, the Observa- Another strategy is to conduct analyses using tional Data Acquisition Program (ODAP; pooled observations on a group of subjects, Hetrick, Isenhart, Taylor & Sandman, 1991), Interpreting Data Gathered Through Behavioral Observation 17 allows for recording frequency and duration of may also allow for greater precision in labeling up to 10 behaviors. Duration is recorded by exact time segments. However, they require depressing a key during the occurrence of a converting videotaped observations to compact behavior. Taylor et al. (1991) used this system to disks, which is expensive when completed by measure six types of self-injurious behaviors. companies that provide this service, and time- Although this system is easy to use, the major consuming when equipment is available for limitation is that only a circumscribed range of converting data in-house. Nevertheless, the behavior can be evaluated with simple, one reduced storage requirements for CDs as dimensional codes. In contrast, other systems opposed to videotapes is an advantage. allow the investigator to configure a data entry A third consideration is the statistical analysis file to handle as many as 1000 codes, and the that a particular program may offer. Whereas codes can be multifaceted. For example, the some systems include software for calculating Observer system (Noldus, 1991) has the capa- coder agreement, linking multiple channels city to hold 1000 different codes which can be (e.g., behavioral codes with heart rate data), classified as occurring only under certain and conducting sequential analyses, others circumstances. The system prompts the coder output the data so that it can be imported into to correct errors when codes that violate the other programs for these purposes. Graphic classification structure are entered (e.g., the presentation of data is also an option with some same behavior entered twice in systems where packages. In some cases, statistical programs codes cannot repeat, or when a teacher code is are included as part of the basic system; in other assigned to a student). The Multiple Option instances, the statistical software is optional. Observation System for Experimental Studies (MOOSES; Tapp, Wehby, & Ellis, 1995) offers similar flexibility. In a study by Shores et al. 3.01.9 INTERPRETING DATA GATHERED (1993), MOOSES was used to record classroom THROUGH BEHAVIORAL interactions of children with behavior disorders. OBSERVATION Codes were devised to indicate the actor, the behavior exhibited, and the recipient of the As can be seen, there are a number of steps behavior. In addition, conditional variables before investigators obtain findings based on such as the presence of a teacher or the grouping behavioral observation. The final step for the of students who were present were also investigator is interpreting these findings. recorded. Whereas this interpretive process is psycholo- A second consideration is whether the system gical in nature and dependent upon the specifics can interface with video playback devices. of the particular investigation, our own research Several systems are available that link the on couples' relationships, marital therapy, and computer with professional quality videotape family development illustrates how a pattern of players that include a computer interface port. findings that include both behavioral observa- A machine-readable time code is stamped on the tion and self-report measures can help to videotape, and the computer reads the time code elucidate family/marital functioning. to track onset and offset times for events, Baucom and his colleagues have conducted duration for states, or to access specific several treatment outcome studies with mari- intervals. The advantage of these systems over tally distressed couples from a cognitive- simple real-time recording devices is that the behavioral perspective. Cognitive-behavioral videotape can be stopped, reversed, and started conceptualizations of marital distress have repeatedly without needing to reset a timer. placed a major emphasis on the centrality of Examples are the Observer system (Noldus, communication as essential for effective marital 1991), Observation Coding System Toolset functioning. However, the pattern of findings (OCS Tools; Triangle Research Collaborative, across our investigations indicates that com- Inc, 1994), and Procoder (Tapp & Walden, munications training might be neither necessary 1993). These systems differ greatly in terms of nor sufficient for affecting changes in marital the number and complexity of codes they can functioning. In these investigations, couples' handle and the ability and ease of controlling communication was assessed by the MICS III the video player from the computer keyboard. (Weiss, 1986) after the couples attempted to More recently, packages are becoming available solve problems on areas of difficulty in their that use CD-ROM instead of videotape input marriage in a laboratory setting. In the first (e.g., vPrism; Digital LAVA Inc., 1996). These study, all couples received some set of beha- systems provide easier access to specific sections vioral interventions to assist their marriages of an observation session because of the digital (Baucom, 1982). The difference among the access features of CD-ROM, and easier replay treatment conditions involved whether the as compared to rewinding a video player. They couples received communications training. 18 Observational Methods

The findings indicated that couples in all of the therapy. Also, because the focus of the study active treatment conditions improved equally was to understand how the parents functioned on communication and marital adjustment, as a marital dyad, the marital discussions were suggesting that communications training was conducted in a room separate from the children. not the critical element in the treatment. In a Other studies interested in how children witness subsequent investigation, Baucom, Sayers, and marital conflict have used procedures where Sher (1990) provided maritally distressed cou- children are present either as observers or ples with a variety of cognitive-behavioral participants in the discussions (e.g., Cummings, interventions, but all couples received commu- 1987). In order to reduce reactive effects nications training. Again, the findings indicated associated with the presence of a live observer, that couples in all active treatment conditions the couples were left alone together for 10 improved equally on communication and minutes to complete the discussion, and the marital adjustment. However, additional ana- discussions were videotaped for later coding. lyses indicated that changes in communication Parenting experiences were evaluated in a were not correlated with changes in marital separate family interaction session. The proce- adjustment; thus, communication training dures for this session were adopted from the could not be interpreted as the critical ingredient work of Patterson, Reid, and colleagues that altered marital adjustment (Iverson & (Patterson, 1982; Patterson, Reid, & Dishion, Baucom, 1990). This is not to say that 1992) with aggressive children and their communication is unimportant in marriage; a families. These procedures had been used to number of investigations indicates that it is (e.g., identify how behavior management and control Gottman & Krokoff, 1989). However, the are disrupted, and lead to the escalation of results from these investigations and others negative exchanges in the families of aggressive have led cognitive-behavioral investigators to children. Because children with mental retarda- develop more complex and multifaceted models tion also present behavior management chal- of marital distress that include, but are not lenges for parents, the possibility of negative limited to, a communications skills deficit escalation was potentially relevant to these explanation (e.g., Karney & Bradbury, 1995). families. In order to make the interaction as Thus, in this instance, the findings from naturalistic as possible, families were video- investigations involving behavioral observation taped in the home while completing a task of of communication with couples have led to their own choosing. All family members were theoretical changes in the conceptualization of present. However, it was also necessary to marital adjustment. structure the session somewhat in order to Research by Floyd and colleagues investi- ensure that the family members interacted gates associations among subsystems within the together and that the videotapes were suffi- families of children who have disabilities, and ciently clear so that the behaviors could be examines how family relationships predict reliably coded. Thus, families were instructed to adaptive functioning for the child and the other complete an activity together (e.g., baking family members. All observations and data cookies, working on a crafts project), to refrain collection are conducted in the families' homes from watching television or making or taking in order to maximize the likelihood that telephone calls, and to remain together in one or observed behaviors are relevant to daily two rooms within range of the video camera. functioning in the family. One set of reports The families were observed for a total of 50 (Floyd, Gilliom, & Costigan, in press; Floyd & minutes, which, for coding purposes, was Zmich, 1991) focuses on the hypothesis that the divided into 10 minute segments. During each quality of the parents' marital functioning and segment, one family member was identified as their parenting alliance influence the quality of the focus of coding, and only behaviors that parenting experiences. In order to test the occurred by the focus person and anyone hypothesis, observational measures were com- interacting with that person were coded by bined with self-reports of both marital function- the observer. This procedure allowed the ing and parenting experiences. camera operator to focus on subsets of family Similar to procedures commonly employed in members rather than trying to keep all family studies of marital relationships, and as illu- members in view at all times. strated in the studies by Baucom and colleagues, The findings support the value of these the parents' marital problem-solving skills were observational methods. Most notably, whereas assessed by having them discuss and attempt to self-report measures of marital quality, the resolve a significant area of disagreement in parenting alliance, and parenting experiences their relationship. This procedure linked the failed to distinguish families of children with investigation to the large body of previous mental retardation from a comparison group observational research on marriage and marital with typically developing children, both the References 19 marital interactions and the family interactions educational and psychological testing. Washington, DC: demonstrated greater marital negativity and Author. Ammerman, R. T., Van Hasselt, V. B., & Hersen, M. more parent±child behavior management strug- (1991). Parent±child problem-solving in families of gles for the MR group (Floyd & Zmich, 1991). visually impaired youth. Journal of Pediatric Psychology, Furthermore, negative marital interactions 16, 87±101. were predictive of negative parent±child ex- Ary, D. (1984). Mathematical explanation of error in duration recording using partial interval, whole interval, changes. A subsequent longitudinal evaluation and momentary time sampling. Behavioral Assessment, 6, demonstrated that marital quality, including the 221±228. quality of marital problem-solving interactions, Bakeman, R., & Casey, R. L. (1995). Analyzing family predicts changes in negative parent±child ex- interaction: Taking time into account. Journal of Family changes over a two-year period, with couples Psychology, 9, 131±143. Bakeman, R., & Gottman, J. M. (1997). Observing who are most negative together showing interaction: An introduction to sequential analysis (2nd increases in their negative exchanges with ed.). New York: Cambridge University Press. children over time (Floyd, Gilliom, & Costigan, Bakeman, R., & Quera, V. (1995). Log-linear approaches in press). to lag-sequential analysis when consecutive codes may and cannot repeat. Psychological Bulletin, 118, 272±284. Barton, C. & Alexander, J. F. (1981). Functional family 3.01.10 CONCLUDING COMMENTS therapy. In A. S. Gurman & D. P. Kniskern (Eds.), Handbook of family therapy (pp. 403±443). New York: As can be seen in the above discussion, Brunner/Mazel. Baucom, D. H. (1982). A comparison of behavioral behavioral observation provides an extremely contracting and problem-solving/communications train- rich source of information for investigators as ing in behavioral marital therapy. Behavior Therapy, 13, they attempt to understand the complexities of 162±174. human behavior and interaction. This richness Baucom, D. H., Sayers, S. L., & Sher, T. G. (1990). Supplementing behavioral marital therapy with cognitive presents many opportunities for investigators restructuring and emotional expressiveness training: An and many challenges as well. These challenges outcome investigation. Journal of Consulting and Clinical are incorporated in the myriad of decisions that Psychology, 58, 636±645. the investigator must make in employing Cairns, R. B., & Green, J. A. (1979). How to assess behavioral observation, and obviously the route personality and social patterns: Observations or ratings? In R. B. Cairns (Ed.), The analysis of social interactions: that the investigator chooses greatly impacts the Methods, issues, and illustrations (pp. 209±226). Hills- findings. Thus, the investigator incurs respon- dale, NJ: Erlbaum. sibility for understanding the impact of these Carpenter, L. J., & Merkel, W. T. (1988). The effects of decisions on the questions under investigation. three methods of observation on couples in interactional research. American Journal of Family Therapy, 16, Often in reporting the results of investigations 144±157. employing behavioral observation, the method Christensen, A., & Hazzard, A. (1983). Reactive effects section of the report involving the application of during naturalistic observation of families. Behavioral coding systems and data analytic strategies is Assessment, 5, 349±362. presented in a few short paragraphs. Conse- Christensen, A., & Heavey, C. L. (1990). Gender and social structure in the demand/withdraw pattern of marital quently, most readers will have only a general conflict. Journal of Personality and Social Psychology, level of understanding of how the coding system 59, 73±81. was employed and how that impacts the Cummings, E. M. (1987). Coping with background anger findings. Therefore, the investigator must in early childhood. Child Development, 58, 976±984. thoroughly understand the coding process so Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34, that discussions of the findings accurately 571±582. represent what is indicated by the data. When Digital LAVA Inc. (1996). 10850 Wilshire Blvd., Suite this is done in a thoughtful manner, we have the 1260, LA, CA 90024. opportunity to use one of our most natural Fagot, B., & Hagan, R. (1988). Is what we see what we get? Comparisons of taped and live observations. Behavioral strategies, observing people's behavior, as a Assessment, 10, 367±374. major way to advance the science of clinical Floyd, F. J. (1989). Segmenting interactions: Coding units psychology. for assessing marital and family behaviors. Behavioral Assessment, 11, 13±29. Floyd, F. J., & Costigan, C. L. (1997). Family interactions 3.01.11 REFERENCES and family adaptation. In N. W. Bray (Ed.), Interna- tional review of research in mental retardation (Vol. 20, Ainsworth, M. D., Blehar, M. C., Waters, E., & Wall, S. pp. 47±74). New York: Academic Press. (1978). Patterns of attachment: A psychological study of Floyd, F. J., Gilliom, L. A., & Costigan, C. L. (in press). the strange situation. Hillsdale, NJ: Erlbaum. Marriage and the parenting alliance: Longitudinal Allison, P. D., & Liker, J. K. (1982). Analyzing sequential prediction of change in parenting perceptions and categorical data on dyadic interaction: A comment on behaviors. Child Development. Gottman. Psychological Bulletin, 91, 393±403. Floyd, F. J., & Zmich, D. E. (1991). Marriage and the Altmann, J. (1974). Observational study of behavior: parenting partnership: Perceptions and interactions of Sampling methods. Behaviour, 49, 227±265. parents with mentally retarded and typically developing American Psychological Association (1985). Standards for children. Child Development, 62, 1434±1448. 20 Observational Methods

Foster, S. L., Inderbitzen, H. M., & Nangle, D. W. (1993). A. (1995). Family interaction in the home: Development Assessing acceptance and social skills with peers in of a new coding system. Behavior Modification, 19, childhood. Behavior Modification, 17, 255±286. 147±169. Gardner, W., & Griffin, W. A. (1989). Methods for the Jacob, T., Tennenbaum, D. L., & Krahn, G. (1987). analysis of parallel streams of continuously recorded Factors influencing the reliability and validity of social behaviors. Psychological Bulletin, 105, 446±455. observation data. In T. Jacob (Ed.), Family interaction Gottman, J. M. (1979). Marital interaction: Experimental and psychopathology: Theories, methods, and findings investigations. New York: Academic Press. (pp. 297±328). New York: Plenum. Gottman, J. M. (1980). Analyzing for sequential connec- Jacob, T., Tennenbaum, D., Seilhamer, R. A., Bargiel, K., tion and assessing interobserver reliability for the & Sharon, T. (1994). Reactivity effects during natur- sequential analysis of observational data. Behavioral alistic observation of distressed and nondistressed Assessment, 2, 361±368. families. Journal of Family Psychology, 8, 354±363. Gottman, J. M. (1994). What predicts divorce? Hillsdale, Jacobson, N. S. (1985). The role of observational measures NJ: Erlbaum. in behavior therapy outcome research. Behavioral Gottman, J. M., & Krokoff, L. J. (1989). Marital Assessment, 7, 297±308. interaction and satisfaction: A longitudinal view. Journal Jarrett, R. B., & Nelson, R. O. (1984). Reactivity and of Consulting & Clinical Psychology, 57(1), 47±52. unreliability of husbands as participant observers. Gottman, J. M., Markman, H., & Notarius, C. (1977). The Journal of Behavioral Assessment, 6, 131±145. topography of marital conflict: A sequential analysis of Johnston, J. M., & Pennypacker, H. S. (1993). Strategies verbal and nonverbal behavior. Journal of Marriage and and tactics of behavioral research (2nd ed.). Hillsdale, NJ: the Family, 39, 461±477. Erlbaum. Haley, J. (Ed.) (1971). Changing families. New York: Grune Karney, B. R., & Bradbury, T. N. (1995). The longitudinal & Stratton. course of marital quality and stability: A review of Harris, F. C. & Lahey, B. B. (1986). Condition-related theory, methods, and research. Psychological Bulletin, reactivity: The interaction of observation and interven- 118(1), 3±34 tion in increasing peer praising in preschool children. Kazdin, A. E. (1977). Artifact, bias and complexity of Education and Treatment of Children, 9, 221±231. assessment: The ABC's of reliability. Journal of Applied Hartmann, D. P. (1977). Considerations in the choice of Behavior Analysis, 10, 141±150. interobserver reliability estimates. Journal of Applied Kazdin, A. E. (1982). Observer effects: Reactivity of direct Behavior Analysis, 10, 103±116. observation. New Directions for Methodology of Social Hartmann, D. P. (Ed.) (1982). Using observers to study and Behavioral Science, 14, 5±19. behavior. San Francisco: Jossey-Bass. Klesges, R. C., Woolfrey, J., & Vollmer, J. (1985). An Hartmann, D. P., & Wood, D. D. (1990). Observational evaluation of the reliability of time sampling versus methods. In A. S. Bellack, M. Hersen, & A. E. Kazdin continuous observation data collection. Journal of (Eds.), International handbook of behavior modification Behavior Therapy and Experimental Psychiatry, 16, and therapy (2nd ed., pp. 109±138). New York: Plenum. 303±307. Hawkins, R. P. (1979). The functions of assessment: Kuller, R., & Linsten, C. (1992). Health and behavior of Implications for selection and development of devices children in classrooms with and without windows. for assessing repertoires in clinical, educational, and other Journal of Environmental Psychology, 12, 305±317. settings. Journal of Behavioral Assessment, 12, 501±516. Leaper, C., Hauser, S., Kremen, A., Powers, S. I., Haynes, S. N. (1978). Principles of behavioral assessment. Jacobson, A. M., Noam, G. G., Weiss-Perry, B., & New York: Gardner Press. Follansbee, D. (1989). Adolescent±parent interactions in Haynes, S. N. (1998). The changing nature of behavioral relation to adolescents' gender and ego development assessment. In M. Hersen & A. S. Bellack (Eds.), pathway: A longitudinal study. Journal of Early Adoles- Behavioral assessment: A practical handbook (4th ed., cence, 9, 335±361. pp. 1±21). Boston: Allyn & Bacon. Mann, J., ten-Have, T., Plunkett, J. W., & Meisels, S. J. Haynes, S. N., Follingstad, D. R., & Sullivan, J. C. (1979). (1991). Time sampling: A methodological critique. Child Assessment of marital satisfaction and interaction. Development, 62, 227±241. Journal of Consulting and Clinical Psychology, 47, Mehm, J. G., & Knutson, J. F. (1987). A comparison of 789±791. event and interval strategies for observational data Haynes, S. N., & Horn, W. F. (1982). Reactivity in analysis and assessments of observer agreement. Beha- behavioral observation: A review. Behavioral Assess- vioral Assessment, 9, 151±167. ment, 4, 369±385. Minuchin, S. (1974). Families and family therapy. Cam- Herbert, J., & Attridge, C. (1975). A guide for developers bridge, MA: Harvard University Press. and users of observational systems and manuals. Noldus, L. P. J. J. (1991). The Observer: A software system American Educational Research Journal, 12, 1±20. for collection and analysis of observational data. Hersen, N., & Bellack, A. S. (1998). Behavioral assessment: Behavior Research Methods, Instruments, and Computers, A practical handbook (4th ed.). Boston: Allyn & Bacon. 23, 415±429. Hetrick, W. P., Isenhart, R. C., Taylor, D. V., & Sandman, Noller, P. (1980). Misunderstandings in marital commu- C. A. (1991). ODAP: A stand-alone program for nication: A study of couples' nonverbal communication. observational data acquisition. Behavior, Research Meth- Journal of Personality and Social Psychology, 39, ods, Instruments, and Computers, 23, 66±71. 1135±1148. Humphrey, L. L., & Benjamin, L. S. (1989). An observational O'Neill, R. E., Horner, R. H., Albin, R. W., Storey, K., & coding system for use with structural analysis of social Sprague, J. R. (1990). Functional analysis of problem behavior: The training manual. Unpublished manuscript, behavior: A practical assessment guide. Sycamore, IL: Northwestern University Medical School, Chicago. Sycamore Publishing. Iverson, A., & Baucom, D. H. (1990). Behavioral marital Patterson, G. R. (1982). A social learning approach, Vol 3: therapy outcomes: Alternate interpretations of the data. Coercive family process. Eugene, OR: Castalia Publishing Behavior Therapy, 21(1), 129±138. Company. Jacob, T. (1975). Family interaction in disturbed and Patterson, G. R., Reid, J. B., & Dishion, T. J. (1992). normal families: A methodological and substantive Antisocial Boys. Eugene, OR: Castalia. review. Psychological Bulletin, 82, 33±65. Pett, M. A., Wampold, B. E., Vaughn-Cole, B., & East, T. Jacob, T., Tennenbaum, D., Bargiel, K., & Seilhamer, R. D. (1992). Consistency of behaviors within a naturalistic References 21

setting: An examination of the impact of context and Tapp,J.,&Walden,T.(1993).PROCORDER:A repeated observations on mother±child interactions. professional tape control, coding, and analysis system Behavioral Assessment, 14, 367±385. for behavioral research using videotape. Behavior Prinz, R. J., & Kent, R. N. (1978). Recording parent± Research Methods, Instruments, and Computers, 25, adolescent interactions without the use of frequency or 53±56. interval-by-interval coding. Behavior Therapy, 9, Tapp, J., Wehby, J., & Ellis, D. (1995). A multiple option 602±604. observation system for experimental studies: MOOSES. Robin, A. L., & Foster, S. L. (1989). Negotiating parent± Behavior Research Methods, Instuments, and Computers, adolescent conflict: A behavioral-family systems approach. 27, 25±31. New York: Guilford. Taylor, D. V., Hetrick, W. P., Neri, C. L., Touchette, P., Sackett, G. P. (1979a). The lag sequential analysis of Barron, J. L., & Sandman, C. A. (1991). Effect of contingency and cyclicity in behavioral interaction naltrexone upon self-injurious behavior, learning, and research. In J. D. Osofsky (Ed.), Handbook of infant activity: A case study. Pharmacology, Biochemistry, and development (pp. 623±649). New York: Wiley. Behavior, 40, 79±82. Sackett, G. P. (Ed) (1979b). Observing behavior. Vol. 2: Triangle Research Collaborative, Inc. (1994). P. O. Box Data collection and analysis methods. Baltimore: Uni- 12167, 100 Park, Suite 115, Research Triangle Park, NC versity Park Press. 27709. Schaap, C. (1982). Communication and adjustment in Tryon, W. W. (1991). Activity measurement in psychology marriage. Lisse, Holland: Swetts & Zeitlinger. and medicine. New York: Plenum. Serbin, L. A., Citron, C., & Connor, J. M. (1978). Covert Van Widenfelt, B., Baucom, D. H., & Gordon, K. C. assessment of observer agreement: An application and (1997). The Prevention and Relationship Enhancement extension. Journal of Genetic Psychology, 133, 155±161. Program: An empirical analysis. Manuscript submitted Shores, R. E., Jack, S. L., Gunter, P. L., Ellis, D. N., for publication. Debreire, T. J., & Wehby, J. H. (1993). Classsroom Wampold, B. E. (1989). Kappa as a measure of pattern in interactions of children with behavior disorders. Journal sequential data. Quality and Quantity, 23, 171±187. of Emotional and Behavioral Disorders, 1, 27±39. Wampold, B. E., & Holloway, E. L. (1983). A note on Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: interobserver reliability for sequential data. Journal of Uses in assessing rater reliability. Psychological Bulletin, Behavioral Assessment, 5, 217±225. 86, 420±428. Watson, J. B., & Raynor, R. (1920). Conditioned Sidman, M. (1960). Tactics of scientific research: Evaluating emotional reactions. Journal of Experimental Psychol- experimental data in psychology. New York: Basic Books. ogy, 3, 1±12. Skinner, B. F. (1938). The behavior of organisms. New Weinrott, M. R. & Jones, R. R. (1984). Overt versus covert York: Appleton-Century-Crofts. assessment of observer reliability. Child Development, 5, Snyder, J. (1983). Aversive social stimuli in the Family 1125±1137. Interaction Coding System: A validation study. Beha- Weiss, R. L. (1986). MICS-III manual. Unpublished vioral Assessment, 5, 315±331. manuscript, Oregon Marital Studies Program, Univer- Stine, W. W. (1939). Interobserver relational agreement. sity of Oregon, Eugene. Psychological Bulletin, 106, 341±347. Wieder, G. B., & Weiss, R. L. (1980). Generalizability Suen, H. K., Ary, D., & Covalt, W. (1990). A decision tree theory and the coding of marital interactions. Journal of approach to selecting an appropriate observation relia- Consulting and Clinical Psychology, 48, 469±477. bility index. Journal of Psychopathology and Behavioral Wilson, F. R. (1982). Systematic rater training model: An Assessment, 12, 359±363. aid to counselors in collecting observational data. Taplin, P. S. & Reid, J. B. (1973). Effects of instructional Measurement and Evaluation in Guidance, 14, 187±194. set and experimenter influence on observer reliability. Winer, B. J. (1971). Statistical principles in experimental Child Development, 44, 547±554. design (2nd ed.). New York: McGraw-Hill. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.02 Single Case Experimental Designs: Clinical Research and Practice

STEVEN C. HAYES and JOHN T. BLACKLEDGE University of Nevada, Reno, NV, USA

3.02.1 INTRODUCTION 24 3.02.1.1 Brief History of Single Case Experimental Designs 24 3.02.1.2 Utility of Single-subject Designs 25 3.02.2 ESSENTIAL COMPONENTS OF SINGLE CASE EXPERIMENTAL DESIGNS 25 3.02.2.1 Repeated Measurement 26 3.02.2.2 Detailed Information 26 3.02.2.3 Graphing of Data 26 3.02.2.4 Creative Use of Design Elements 27 3.02.3 INTERPRETATION OF SINGLE-SUBJECT DATA 27 3.02.3.1 Variability and Stability 27 3.02.3.2 Level 27 3.02.3.3 Trend 27 3.02.3.4 Use of Statistics with Single-subject Data 27 3.02.4 VARIETIES OF SINGLE-SUBJECT DESIGNS 28 3.02.4.1 Within-series Design Elements 28 3.02.4.1.1 Simple phase change 29 3.02.4.1.2 Complex phase change elements 29 3.02.4.1.3 Parametric design 33 3.02.4.1.4 Final suggestions for interpreting within-series data 34 3.02.4.2 Between-series Designs 35 3.02.4.2.1 Alternating treatment design 35 3.02.4.2.2 Simultaneous treatment design 38 3.02.4.3 Combined-series Elements 38 3.02.4.3.1 Multiple baseline 38 3.02.4.3.2 Crossover design 42 3.02.4.3.3 Constant series control 42 3.02.5 FUTURE DIRECTIONS 43 3.02.5.1 Potential for Research Production and Consumption by Practicing Clinicians 43 3.02.5.2 Managed Health Care, Single-subject Design, and the Demonstration of Efficacy 43 3.02.6 CONCLUSION 43 3.02.7 REFERENCES 44

23 24 Single Case Experimental Designs: Clinical Research and Practice

3.02.1 INTRODUCTION usually extremely difficult. Blocking or stratify- ing samples on even a few factors in a single The purpose of clinical research from the study can lead to huge designs that cannot be point of view of the consumer of research mounted without millions of dollars of research knowledge can be stated succinctly: ªWhat funds. One compromise is to use diagnostic treatment, by whom, is most effective for this categories as a way of establishing homogeneity; individual with that specific problem under however, large and unexplained between-sub- which set of circumstances, and how does it come ject variation invariably results because the about?º (Paul, 1969, p. 44). This question has current diagnostic system is based on loose always been of relevance to practicing clinical collections of signs and symptoms rather than psychologists in the fee-for-service environment, functional processes. but it is also of increasing relevance in the era of An alternative approach is to build this managed care. Mental health delivery systems knowledge base about treatment response from cannot succeed, either in the world of public the ground up, person by person. In this opinion or in the world of fiscal reality, without approach, clinical replication across myriad finding a way to deliver services that are both clients provides the evidence that a treatment effective and efficient (Cummings, Cummings, effect holds for a population and that it is & Johnson, 1997; Cummings, Pollack, & moderated by specific subject, therapist, or Cummings, 1996). In order to do that, Paul's setting variables. That is the approach of single clinical question above must be answered for the case experimental designs (SCEDs) or what has varieties of clients demanding and receiving also been termed ªtime-series designs.º The services. former term is more popular but falsely suggests There is another way to say this. Clinical that the number of subjects is necessarily few in research must have external validity (must apply this approach. The latter correctly points to the to the settings and clients of the research source of the analysis but it is not very popular. consumer), not merely internal validity (an Both terms will be used. unambiguous relationship between a dependent The purpose of this chapter is to show how and independent variable). In group comparison single case designs are used in a research and research, external validity flows in principle practice environment, to provide an introduc- from internal validity. In a classic group tion to the various types of designs, and to comparison design, researchers hope to ran- explore the advantages of this research ap- domly sample from a known population, proach in the modern world of health care randomly assign to a treatment and control delivery. group, and collect before and after measures on all. If these methodological requirements have been met, the results should apply to other 3.02.1.1 Brief History of Single Case random samples from that same population. In Experimental Designs the practical world of clinical research, however, we do not have access to the entire population of The origin of SCEDs in clinical psychology interest (say, all panic disordered clients). We can be traced back to the beginnings of the cannot randomly sample from this population scientist±practitioner model. Two years before because clients nonrandomly refuse to partici- at the first Boulder Conference, Thorne (1947) pate. We cannot randomly assign because clients advocated the use of such designs as a practical nonrandomly drop out. And even if all this were way for clinicians to incorporate empirical not true, we never apply clinical research to other science into their everyday interactions with random samples from the same populationÐ clients, a goal concurrently explicated by rather, a clinician treats the person who Shakow et al. (1947). The type of experimental (nonrandomly) walked through the door. designs proposed by Thorne were a very External validity thus must be earnedÐ significant improvement over the traditional whether in group comparison or single-case case study because continual data collection researchÐin a piecemeal inductive fashion by over a series of phase changes were required, demonstrating that particular treatments work allowing more objective, data-driven decisions with particular clients with particular problems. to be made. These designs were an adaptation of In group comparison research, this is usually those previously used by experimental psychol- done by showing that treatments work in highly ogists (Ferster & Skinner, 1957) working with homogeneous groups of clients. By replicating animals. effects across many such homogenous groups, Single case designs became considerably more external validity can be demonstrated in group popular with the popularization of behavioral research. Efforts to show broad external validity approaches in the 1960s. For example, Baer, in a single group comparison experiment are Wolf, and Risley (1968) described a number of Essential Components of Single Case Experimental Designs 25 these designs in an early issue of the Journal of ments provide, treatment components not Applied Behavior Analysis. Hersen and Barlow's working as planned can be altered, abandoned, groundbreaking text (1976) in turn brought or supplemented. these designs into the mainstream of behavior The comprehensive data recorded over therapy. several single-subject designs can also be used Probably the biggest factors inhibiting the use to provide linkage between client characteristics of SCEDs by many clinicians has been a bias and treatment success or failure. As more against ideographic research. Yet many meth- detailed information is gathered in time-series odological leaders of the field have been quite designs than in a typical group experiment, accepting of such an approach. For example, events in individual client lives and various Cronbach (1975) advocated careful observation client characteristics that coincide with declines of individual clinical cases using SCED con- in treatment efficacy can be identified and taken trols, maintaining that the use of such design into consideration for subsequent clients. The tools allows a level of detail in observation and advantage at this level is that variability due to hypothesis testing not available in traditional sources other than treatment can be identified at group designs. Many others (Campbell, 1957; the level of the individual. This means that when Koan & McGuire, 1973; Snow, 1974) have many cases are collected and analyzed, correla- agreed, adding that tightly controlled, fixed tions between subject characteristics and treat- condition, group experimentation is often not ment responsiveness can be more refined. For well-suited to a science of vastly differing example, it may become apparent after con- humans with often vastly differing needs. Cone ducting several single-subject designs mapping (1986) noted that ideographic, single-subject the effects of a given intervention that subjects research is particularly well-suited for detecting with certain characteristics respond in char- point-to-point behavioral changes occurring in acteristically similar ways. Such a hypothesis response to environmental variables, including can be followed up using additional single psychological treatment. subjects, or by using a group design. Subjects that fail to respond positively to a given treatment may do so because of a detectable 3.02.1.2 Utility of Single-subject Designs reason, and this reason may point to an important aspect of a relevant theory. Out- A clinician carefully conducting an appro- comes like this critically depend on the foresight priate single-subject design for a client could of the researcher. Potentially important back- both circumvent the above difficulties and ground data, as indicated by theory and obtain scientifically useful results. Often no common sense, should be collected in every subject pool larger than one is needed to conduct single-subject experiment. a single-subject design (although it is necessary The use of time-series designs also compels a that the results of many such analysis be clinician to focus on the careful description of considered before any general conclusions can patient problems and treatment characteristics, be made about a treatment's efficacy). Single- and how this data relates to treatment outcome. subject experiments are extremely flexible. For In such a manner, variables that are functionally example, if a client is not responding as hoped to important for treatment can become evident. a treatment, components can be added or sub- Over a series of SCEDs, generalizations con- tracted or an altogether new treatment can be cerning the active and efficacious components implemented, without necessarily damaging the of treatment can be made from such data. validity of conclusions drawn from the experi- Finally, analysis of SCEDs concentrates on ment. Essentially, the only limitation on the the magnitude of treatment effects rather than usefulness of single-subject designs in research their statistical significance. If the treatment and clinical practice is the flexibility of the analyzed in a properly conducted SCED is researcher. If use of a planned design does not clinically significant, it will be clearly so. allow an adequate interpretation of emerging data, then the design can be altered at that point. Perhaps more importantly, a properly used single-subject design can be extremely useful in 3.02.2 ESSENTIAL COMPONENTS OF facilitating responsible assessment and treat- SINGLE CASE EXPERIMENTAL ment. A clinician conducting a single-subject DESIGNS experiment is forced to pay close attention to repeated assessments of client behavior that Although various types of individual time- provide invaluable information about target series designs exist, and are discussed in some behaviors and treatment efficacy. With the detail later, several procedures common to all continuous feedback that single-subject experi- time-series designs are first described. 26 Single Case Experimental Designs: Clinical Research and Practice

3.02.2.1 Repeated Measurement as described above. Details regarding the nature of the treatments, the environment that the The bedrock of single case experimental therapy was delivered in, and characteristics of designs is repeated measurements of client the therapist provide a level of detail conducive functioning taken in relevant domains through- to proper replication, especially if the completed out the course of treatment (Barlow & Hersen SCED is made available to others. Steps taken (1984), Hayes, Barlow, and Nelson (1997), and in group research to ensure treatment integrity Kazdin (1980) give information on specific can be taken here as well. For example, a measurement strategies). Repeated measure- colleague or student might be asked to assess a ments enable the estimation of variability within clinician's adherence to a designated treatment a case over time to provide the sources of protocol, as well as the competence with which information about treatment, measurement the treatment is delivered. Specification of the error, and extraneous factors in a time-series treatment requires that the researcher has a approach. clear, theoretically-based idea of what they are In the real world of clinical evaluation, the attempting to accomplish and how to accom- goal is to use measures that are both high plish it. The type of treatment being used, its quality and highly practical. It is advisable that specific techniques, and the phases or subphases such measurements begin as soon as possible, of each technique or strategy that are active ideally during the first session. It is also should be noted. Enough detail should be added advisable, in choosing which instruments to so that after the intervention is complete, an administer to the client or research subject, that informed guess can be made as to what might the researcher err on the side of caution and have been responsible for observed effects. administer as many instruments as may be even Collection of detailed client information partially relevant to the current experiment's allows more meaningful inferences to be drawn concerns. Theory and even common sense from single-subject data. The recording of any should be used to guide the choice as to what information that might possibly affect the instruments are relevant. Practical constraints course and effectiveness of treatment may prove also play an important part. If an adequate but to be invaluable for data analysis. Seemingly imperfect instrument already exists and no time relevant background information and signifi- is currently available to create a new one, the cant events occurring in the client's life during adequate instrument may be used. Client self- the course of treatment qualify as important monitoring and self-report is entirely accept- client information. If the client's spouse leaves able. In fact, any method of collecting data is her during the course of treatment, for example, acceptable with single-subject designs, so long notation of this event may serve as a possible as the researcher deems it appropriate. Flex- explanation for a brief or sustained decline in ibility, again, is the by-word. But, as with any the treatment's effectiveness. Of importance instrument, pay attention to its established here is the chronicling of any information about reliability and validity before using or inter- the client or the client's life that might be preting data. As treatment progresses, instru- expected to affect their response to treatment. ments useless or irrelevant to the current case Over a series of SCEDs, such information can can be discarded, but data not initially gathered be used to speculate as to why clients reacted can never be collected. It is worth spending a differentially to treatment. little extra time administering an instrument on a hunch on the chance that it will yield valuable information. Measurements are then ideally administered as frequently as is practical and 3.02.2.3 Graphing of Data meaningful. Analysis of an individual time-series design requires a visual representation of the data. A 3.02.2.2 Detailed Information simple line graph, with time plotted on the x-axis and measurement score plotted on the Specification of the particular intervention y-axis, should be sufficient for most data made with the client, including when each because it presents an immediate picture of component is delivered and any possibly variability within a data series. Pragmatic significant deviations from the standardized considerations determine what unit of time to treatment, allows meaningful inferences to be plot. Thus, if an instrument is measuring made from collected data. To be meaningful behaviors in frequent time intervals but the indicators of the effects of a treatment and its behavior itself is better thought of in temporally components, the clinician must be able to larger units, analysis may be facilitated if time temporally link specific phases of their inter- intervals are collapsed (e.g., if daily measure- vention with the ongoing flow of outcome data ments are summed and recorded as weekly Interpretation of Single-subject Data 27 measurements). Frequent and creative graphing If the data are not stable (in the sense that of data, from various angles, using different important treatment effects might be obscured), units of time, composite scores from separate the experimenter can (i) continue the phase until measures, etc., can allow insights into the effects the data does become stable, (ii) examine the of treatment. data using longer units of time if longer units are more sensible, or (iii) analyze the possible sources of variability. Each of these is defen- 3.02.2.4 Creative Use of Design Elements sible, but the last option is often the best because Individual time-series design elements (dis- it can provide important information about the cussed below) should be thought of as tools, not case. restrictive agents. If a complex phase change with an interaction element is initially planned, 3.02.3.2 Level and the data indicates the client might benefit most from continuing with the current treat- Level refers to the ªheightº on the y-axis at ment, then continuation of treatment is clearly which the data tends to aggregate. For example, justified. Unexpected client or therapist vaca- data taken during a treatment phase tending to tions can be an opportunity for a return to stay at the upper points of the measurement baseline phase, allowing a chance to monitor the scale would be notably distinguishable, in terms effectiveness, in contrast, of treatment. Clinical of level, from data congregating around the common sense and detailed understanding of mid-point of a measurement scale. the nature of the different design elements and when they are useful allows effective and serendipitous adaptation to events. 3.02.3.3 Trend Trend refers to the general linear direction in 3.02.3 INTERPRETATION OF SINGLE- which data are moving across a given period of SUBJECT DATA time. It takes a minimum of three data points to establish a trend and estimate variability around Time-series designs typically (though not that trend. always) consist of units of time called phases, Converging and diverging trends between each phase designating the continuous presence phases can be analyzed to help differentiate of a given condition (e.g., treatment, baseline, variability due to treatment from variability due and so on). Data within each phase can be to extraneous sources. For example, if a client's described as having various degrees of varia- data on a given measure shows a positive trend in bility around a given level and trend. Use of baseline that continues when treatment is statistical analyses is not necessary at the level of implemented and continues yet again when the the individual, and indeed use of most infer- baseline is reinstated, it is likely that something ential statistics with a single subject violates the other than treatment is responsible for the assumptions of standard statistical tests. Clear improvement. Conversely, if a strong positive graphing of data and a thorough understanding trend during a treatment phase levels out or of interpreting such single-subject graphs are all especially reverses after a change to a baseline that are needed to observe an effect. It is to the phase, the clinician can usually feel confident nature of variability, level, and trend, and to the that treatment is responsible for the change opportunities they provide for the interpreta- (unless some potentially significant life change tion of the data, that we now turn. happened to co-occur with the phase change). Thus, trends are not only useful indicators as to the improvement or decline of clients on various 3.02.3.1 Variability and Stability measures, but are also useful indicators, coupled Data within a phase is said to be stable to the with phase changes, of the sources of those extent that the effects of extraneous variables changes. and measurement error, as reflected in variability within a subject across time, are suffi- 3.02.3.4 Use of Statistics with Single-subject ciently limited or identifiable that variability Data due to treatment can be ascertained. Determin- ing stability requires that the clinician have For the most part, inferential statistics were some clarity about what treatment effects are designed for use in between-group comparisons. large enough to be worth detectionÐthe more The assumptions underlying the widely ac- variability due to extraneous factors or mea- cepted classical model of statistics are usually surement error, the larger the treatment effect violated when statistical tests based on the would have to be to be seen. model are applied to single-subject data. To 28 Single Case Experimental Designs: Clinical Research and Practice begin with, presentation of conditions is not not mean the same thing as a statistically generally random in single-subject designs, and significant result with a group, and the randomization is a necessary prerequisite to assumptions and evidentiary base supporting statistical analysis. More importantly (and classical statistics simply dose not tell us what a more constantly), the independence of data significant result with a single subject means. required in classical statistics is generally not Beyond the technical incorrectness of using achieved when statistical analyses are applied to nomethetic statistical approaches to ideo- time-series data from a single subject (Sharpley graphic data, it is apparent that such use of & Alavosius, 1988). Busk and Marascuilo these statistics is of extremely limited use in (1988) found, in a review of 101 baselines and guiding further research and bolstering con- 125 intervention phases from various single- fidence about an intervention's efficacy with an subject experiments, that autocorrelations be- individual subject. If, for example, a statistically tween data, in most cases, were significantly significant result were to be obtained in the greater than zero and detectable even in cases of treatment of a given client, this would tell us low statistical power. Several researchers have nothing about that treatment's efficacy with suggested using analyses based on a randomiza- other potential clients. Moreover, data indicat- tion task to circumvent the autocorrelation ing a clinically significant change in a single problem (Edgington, 1980; Levin, Marascuilo, client would be readily observable in a well- & Hubert, 1978; Wampold & Furlong, 1981). conducted and properly graphed single-subject For example, data from an alternating treat- experiment. StatisticsÐso necessary in detect- ment design or extended complex phase change ing an overall positive effect in a group of design, where the presentation of each phase is subjects where some improved, some worsened, randomly determined, could be statistically and some remained unchangedÐwould not be analyzed by a procedure based on a randomiza- necessary in the case of one subject exhibiting tion task. Some controversy surrounds the issue one trend at any given time. (Huitema, 1988), but the consensus seems to be that classical statistical analyses are too risky to use in individual time-series data unless at least 35±40 data points per phase are gathered 3.02.4 VARIETIES OF SINGLE-SUBJECT (Horne, Yang, & Ware, 1982). Very few DESIGNS researchers have the good fortune to collect so much data. Time-series design elements can be classified Time-series analyses where collected data is either as within-series, between-series, or simply used to predict subsequent behavior combined-series designs. Different types of (Gottman, 1981; Gottman & Glass, 1978) can time-series design elements within each group also be used, and is useful when such predictions of designs are used for different purposes are desired. However, such an analysis is not (Hayes et al., 1997). The nature of each type suitable for series with less than 20 points, as of element, as well as its purpose, will now be serial dependence and other factors will con- described. tribute to an overinflated alpha in such cases (Greenwood & Matyas, 1990). In cases where 3.02.4.1 Within-series Design Elements statistical analysis indicates the data is not autocorrelated, basic inferential statistical pro- A design element is classified as within-series cedures such as a t-test may be used. Finally, the if data points organized sequentially within a Box±Jenkins procedure (Box & Jenkins, 1976) consistent condition are compared to other such can technically be used to determine the sequential series that precede and follow. In presence of a main effect based on the departure such a design, the clinician is typically faced of observed data from an established pattern. with a graphed data from a single measure or a However, this procedure would require a homogenous group of measures, organized into minimum of about 50 data points per phase, a sequence of phases during which a consistent and thus is impractical for all but a few single- approach is applied. Phase changes occur, subject analyses. ideally, when data has stabilized. Otherwise, In addition, most statistical procedures are of practical or ethical issues determine the time at unknown utility when used with single-subject which phases must change. For example, an data. As most statistical procedures and inter- extended baseline with a suicidal client would pretations of respective statistical results were certainly not be possible, and time and financial derived from between-group studies, use of constraints may determine phase changes in these procedures in single-subject designs yields other cases. Aspects of specific design strategies ambiguous results. The meaning of a statisti- that influence phase length are discussed at the cally significant result with a lone subject does appropriate points below. Varieties of Single-subject Designs 29

Within-series design elements include the vacations can be viewed as a chance to make simple phase change, the complex phase change, more informed decisions regarding the treat- the parametric design elements, and the chan- ment's efficacy. ging criterion design. Each is described below. More than one replication of the underlying Note that specific phases are designated below phase change (such as an ABABAB design) may by capital letters. Generally (but not always), be necessary for the clinician to be confident of a the letter A refers to a baseline phase, and letters treatment's effects for a particular client. such as B and C refer to different interventions. Interpretation of the data is facilitated by referring to the specifics of stability, level and trends as discussed above. 3.02.4.1.1 Simple phase change Examples of well-conducted simple phase If simply trying to determine a treatment's change designs include Gunter, Shores, Denny, efficacy vs. no treatment at all for a given client, and DePaepe (1994), who utilized the design to or comparing the relative efficacy of two evaluate the effects of different instructional validated treatments for a given client, then a interactions on the disruptive behavior of a simple phase change design is probably the right severely behaviorally disordered child. Gunter choice. Consider the case in which only a single et al. (1994) used a simple phase change with treatment's efficacy is at issue. A similar reversal (in this case ABAB), with baseline or A approach can be taken to answer questions phase consisting of a math assignment with about the relative efficacy of two treatments. between five and 15 difficult multiplication In the standard simple phase change, baseline problems. The treatment or B phase consisted of data on relevant client behaviors is typically equally difficult problems, but the experimenter taken for a period of time (enough time to would provide the correct answer and then estimate the level, trend, and variability around present the same problem again whenever the level and trend of the behavior of interest). This subject incorrectly answered. Institution of the baseline phase can occur during initial sessions first treatment phase indicated a desirable with the client while assessment (and no effect, with the subject's rate of disruptive treatment per se) is taking place, while the behavior falling from around 0.3 to around 0.1. client is on a waiting list, or through similar Gunter et al. (1994) wisely decided to replicate means. Treatment is then administered and the phase changes with the subject, allowing while it is in place a second estimate is made of them to be more confident that extraneous the level, trend, and variability around level and variables such as time or (to some extent) order trend of the behavior of interest. If clear changes were not responsible for the changes. occur, treatment may have had an impact. Orsborn, Patrick, Dixon, and Moore (1995) In order to control for extraneous events that provide another good, contemporary example might have co-occurred with the phase change of the simple phase change design (Figure 1). from baseline to treatment, the phase change They used an ABAB design to evaluate the must be replicated. Usually this is done by effect of reducing the frequency of teacher's changing from treatment back to baseline, but questions and increasing the frequency of other options exist. Such a change, called a pauses on the frequency of student talk, using withdrawal, is useful in aiding the clinician in 21 first- and second-grade subjects. Both B deciding whether it is indeed the treatment, and phases showed a marked increase in student talk not some extraneous variable, that is responsible frequency relative to baseline phases. The for any changes. However, certain ethical strength of the design, however, could have considerations must be made before a with- been improved with the inclusion of more data drawal is executed. If treatment definitely seems points. Five data points were collected during effective and is well-validated, it may not be baseline phases, and three in intervention necessary to withdraw the treatment to conclude phases. Three is an acceptable minimum, but that a treatment effect is likely. If the clinician is more data points are advisable. uncertain of the treatment's effectiveness, or if it is not well-validated, a withdrawal or some other 3.02.4.1.2 Complex phase change elements means of replication is necessary in order to conclude that an effect has occurred. Even a A complex phase change combines a specific short withdrawal can be sufficient if a clear trend sequence of simple phase changes into a new is evident. Alternatively, a period of time where logical whole. therapy is focused on a different, relatively unrelated problem can also be considered a (i) ABACA withdrawal. Serendipitous opportunities for a return to a baseline phase can provide important When comparing the effectiveness of two (or information. Impromptu client or therapist more) treatments relative to each other and to 30 Single Case Experimental Designs: Clinical Research and Practice

ABAB 360

180 Talk (s) Talk

0 0 5 10 15 20 Sessions Figure 1 An adaptation of a simple phase change with reversal design. Data were relatively stable before phase changes (adapted from Orsborn et al., 1995). no treatment, an ABACA complex phase campaign (phase B), and the distribution of change element might be used. Typically, the instructive stickers for children asking them to clinician would choose such a design when there ªMake it Clickº (phase C). These two inter- is reason to believe that two treatments, neither ventions were tried repeatedly. Finally, an yet well-validated, may be effective for a given incentive program was implemented giving client. When it is unclear whether either away soft drinks for drivers who arrived at treatment will be effective, treatment phases McDonald's with their seat belt fastened. The can be interspersed with baseline phases. Phase design could be described as an ABCBCBDA. changes conducted in such a manner will allow The spirit here, as always when using time- meaningful comparisons to be made between series approaches, should be one of flexible and both treatments, as well as between each data-driven decisions. With such a spirit, the treatment and no treatment. After administer- clinician should be willing to abandon even an ing each treatment once, if one that results in empirically validated treatment if it is clear, clearly more desirable data level and trend, it over a reasonable length of time, that there is may be reinstated. If each treatment's relative no positive effect. No treatment is all things to efficacy is still unclear, and the data gives no all people. reason to believe that either may be iatrogenic, Another well-conducted complex phase phase changes may be carried out as is practical. change design was reported by Peterson and A sequence such as ABACACAB might not be Azrin (1992; Figure 2). Three treatments, unreasonable if the data and clinical situation including self-monitoring (phase B), relaxation warranted. Regardless of the sequence imple- training (phase C), and habit reversal (phase D), mented, the clinician should remain aware that were compared with a baseline or no treatment order effects in a complex phase change can be (phase A). Six subjects were used, and the critical. The second treatment administered may authors took advantage of the extra subjects by be less effective simply because it is the second counterbalancing the presentation of phases. treatment. Counterbalancing of phase se- For example, while the first subject was quences in other cases can circumvent such presented with the phase sequence ambiguity. AAABCDCBDA, other subjects were pre- The clinician should stay alert to the sented with sequences such as AAADCBCD- possibility of introducing even a third treat- BA, AAACDBDCBA, and AAACDBDCBA. ment in such a design, if the original treatments A minimum of three data points were contained do not appear to be having the desired effect. in each phase (generally four or more data An example of such a situation is shown in a points were used), and the authors more often study by Cope, Moy, and Grossnickle (1988). than not waited until stability was achieved and In this study, McDonald's restaurants pro- clear trends were present before changing moted the use of seat belts with an advertising phases for subjects. Varieties of Single-subject Designs 31

A A A CBDBDC A 200

100 Number of tics

0 0 20 40 60 70 80 90 100 110 120 140 Minutes (measurements taken every 2.5 minutes) Figure 2 An example of a complex phase change. Waiting for the data in the first and third baseline (A) phase to stabilize (as well as the data in the first D phase) would have been preferable before initiating phase changes (adapted from Peterson and Azrin, 1992).

(ii) Interaction element (Figure 3). Extinction had previously been found to be effective in reducing the frequency In an interaction element the separate and of target behaviors, but it had the unfortunate combined effects of intervention elements are effect of sometimes causing more problematic examined. Use of this design element is behaviors to emerge. By alternating phases appropriate both when a treatment is working consisting of extinction alone and extinction and the clinician wishes to ascertain whether it plus communication training, the authors were will continue working without a particular (and able to show that the second condition costly in terms of time and/or money) compo- resulted in uniform decreases in problematic nent, or when a treatment is not maximally behavior. effective and the clinician believes adding or subtracting a specific component might enhance the treatment's efficacy. White, Mathews, and Fawcett (1989) provide (iii) Changing criterion an example of the use of the interaction element. They examined the effect of contingencies for The changing criterion design element con- wheelchair push-ups designed to avoid the sists of a series of shifts in an expected development of pressure sores in disabled benchmark for behavior, such that the corre- children. Wheelchair push-ups were automati- spondence between these shifts and changes in cally recorded by a computer. After a baseline, behavior can be assessed. It is particularly useful two subjects were exposed to an alarm in the case of behaviors that tend to change avoidance contingency (B), a beeper prompt gradually, provided that some benchmark, goal, (C), or a combination. An interaction design criterion, or contingency is a key component of element was combined with a multiple baseline treatment. component. The design for one subject was an The establishment of increasingly strict limits A/B+C/B/B+C/B/B+C/C/B+C, and for the on the number of times a client smokes per day other was A/B+C/C/B+C/C/B+C. Each provides a simple example. A changing criterion component (B or C) was more effective than design could be implemented when it is unclear a baseline, but for both children the combined as to whether the criteria themselves, and no (B+C) condition was the most effective overall. other variable, were responsible for observed Shukla and Albin (1996) provided another changes in smoking. As another example, a good example when examining the effects of changing criterion design could be implemented extinction vs. the effects of extinction plus to assess the degree to which changing minimum functional communication training on pro- numbers of social contacts per day affects actual blem behaviors of severely disabled subjects social contacts in a socially withdrawn client. 32 Single Case Experimental Designs: Clinical Research and Practice

A B A B+C 5

2.5 Problem behavior per minute behavior Problem

0 0 51015 20 25 30 35 40 Sessions

Figure 3 An interaction element design. The interaction (B+C) condition did not yield a better result than the B condition, but the demonstration of no further efficacy was still an important finding (adapted from Shukla & Albin, 1996).

In order to maximize the possibility that the direction of criterion shifts varied. It should also data from a changing criterion design are be noted that, as in the Belles and Bradlyn (1987) correctly interpreted, five heuristics seem useful. study, criterion changes should occur at irre- First, the number of criterion shifts in the design gular intervals. As certain behaviors may change should be relatively high. Each criterion shift is in a cyclical or gradual manner naturally, criteria a replication of the effect of setting a criterion on should be shifted after differing lengths of time. subsequent client behavior. As with any type of Thus, if the length of one criterion's phase experiment, the more frequently the results are happens to correspond to changes occurring replicated, the more confident the researcher naturally, the phase lengths of other levels of the can be of the effect and its causes. As a rule of criterion will be unlikely to continue this trend. thumb, four or more criterion shifts should Third, criterion changes should occur at occur when a changing criterion design is irregular intervals. As certain behaviors may implemented. change in a cyclical or gradual manner naturally, Second, the length of the phase in which one criteria should be shifted after differing lengths level of the criterion is in effect should be long of time. Thus, if the length of one criterion's enough to allow the stability, level, and trend of phase happens to correspond to changes the data to be interpreted relative to the occurring naturally, the phase lengths of other criterion. Additionally, if a clear trend and level levels of the criterion will be unlikely to continue do not initially emerge, the criterion should this trend. remain in effect until a clear trend and level does Fourth, the magnitude of criterion shifts emerge. Belles and Bradlyn (1987) provide a should be altered. If the data can be shown to good example of properly timed criterion track criterion changes of differing magnitudes, changes. The goal was to reduce the smoking the statement that the criterion itself is respon- rate of a long-time smoker who smoked several sible for observed changes can be made with a packs a day. The client recorded the number of greater level of assurance. cigarettes smoked each day (with reliability Finally, a periodic changing of the direction of checks by the spouse). After a baseline period, criterion shifts can be useful in assisting goals were set by the therapist for the maximum interpretations of effects. Such a strategy is number of cigarettes to be smoked. If the similar to the reversal common in simple and criterion was exceeded, the client sent a $25 complex phase changes. If client behavior can be check to a disliked charity. For each day the shown to systematically track increasing and criterion was not exceeded, $3 went into a fund decreasing criteria, the data can be more that could be used to purchase desirable items. confidently interpreted to indicate a clear effect Each criterion was left in place for at least three of the changing criteria on those behavioral days, and the length of time, magnitude, and changes. Varieties of Single-subject Designs 33

DeLuca and Holborn (1992) used the chan- 3.02.4.1.3 Parametric design ging criterion design in analyzing the effects of a variable-ratio schedule on weight loss in obese When it is suspected that different levels or and nonobese subjects (Figure 4). Phase frequencies of a component of an intervention sequences consisted of a baseline (no treatment) might have a differential effect on client phase, three different criterion phases, an behavior, a parametric design can be imple- additional baseline phase, and finally a return mented. Such designs are frequently used to to the third criterion phase. The criterion assess the effects of different psychotropic drug involved the number of revolutions completed dosages, but design is relevant for many other on a stationary exercise bike during the allotted purposes. Kornet, Goosen, and Van Ree (1991) time in each phase. Criterion phases were demonstrated the use of the parametric design determined by a calculation of 15% over the in investigating the effects of Naltrexone on criterion in place for the previous phase; when alcohol consumption. the criterion was met, a subject would receive Ideally, a kind of reversal can be incorporated reinforcement in the form of tokens exchange- into a parametric design, where levels of the able for established reinforcers. Each phase independent variable in question were system- (except for the five-session final phase) lasted atically increased and then decreased. As with for eight 30-minute sessions, resulting in eight the changing criterion element, showing that an data points per phase. The increasing variable effect tracks a raised and lowered standard ratio schedules were shown, through use of this bolsters the confidence with which an inter- design, to exert control over increased frequen- pretation is made. Baseline data should usually cies of revolution. Although the design did not be taken before and after changes in the exhibit some typical features of the changing parameter of interest. If certain levels of the criterion design, such as staggered phase length, parameter provide interesting or unclear data, criterion direction reversals, and varying phase alternations between those levels, even if not change depths, the observation of clear effects originally planned, can aid in clarification (e.g., were facilitated by the return to baseline phase if the sequence A/B/B'/B''/B'''/B''/B'/B/A was and subsequent replication of the third criterion originally planned and the data spurs increased phase. In addition, the use of the design in an interest in the B and B' levels, a sequence such as exercise program was prudent, as exercise B/B'/B/B' could be inserted or added). involves gradual performance improvements An example of acceptable use of parametric of a type easily detected by the changing design is provided by Stavinoah, Zlomke, criterion design. Adams, and Lytton (1996). The experimenters

A 80 rpm 115 rpm 130 rpm A 130 rpm 160

80 Revolutions per minute (rpm) per minute Revolutions

0 0 510152025303540 Sessions

Figure 4 An example of a changing criterion design. Dotted horizontal lines indicate set criterion in each phase. The use of more than three criteria might have been preferable to further indicate experimental control over behavior, but the design seems adequate nonetheless (taken from DeLuca & Holborn, 1992). 34 Single Case Experimental Designs: Clinical Research and Practice systematically varied dosages of haloperidol and procedure, replicated effects are always more fluoxetine while measuring the frequency of the believable than nonreplicated effects. If an subject's impulsive aggressive behaviors (IABs). effect is consistently duplicated across several Dosages of haloperidol ranged from about clients or across several different behaviors in 40 mg to 20 mg over the 40 weeks that it was the same client, the clinician can feel more administered. As a 40 mg dose initially resulted confident in stating that the treatment in in an average of 10 IABs per week, dosage was question is responsible for the effect. Each reduced to 20 mg after 12 weeks. During the next additional reinstatement of a treatment phase 34 weeks, IABs increased to 13 per week, and the resulting in client improvement within a single decision was made to increase the dosage back to series of data should also allow more confident 40 mg. This resulted in an average of 45 IABs per statements about that treatment's efficacy with week for the four weeks this dosage was in place. that client to be made. Second, effects are much The experimenters then administered a 20 mg more believable to the extent that they occur in a dose of fluoxetine for the next 62 weeks, resulting consistent manner. Third, changes of a greater in an average IAB frequency of near zero. A magnitude (when parsing out changes appar- 40 mg dose of fluoxetine administered for 58 ently caused by extraneous variables) should weeks yielded IAB frequencies of near zero. A generally be taken as more robust evidence of subsequent reduction to 20 mg for five weeks the treatment's effects. Fourth, effects occurring resulted in an average IAB frequency of 12; the immediately after the onset of a treatment phase dosage was then returned to 40 mg, with a are logically stronger indicators that the treat- resulting IAB frequency of almost zero. Ideally, ment is responsible for the changes than are less time could have been spent at each dosage, delayed effects, since fewer alternative explana- and a greater variety of dosages could have been tions exist for the effects seen. Fifth, greater employed. But the experimenters did vary changes in the level and trend of the data are dosage and even drugs, and did so with a generally more indicative of a treatment's sufficient number of data points to determine the efficacy. Sixth, any effects not explainable by effect each drug and dosage had on behavior. variables other than treatment should naturally Lerman and Iwata (1996) provide a better be more convincing. Finally, all effects should example of use of the parametric design in be interpreted while considering the back- treating the chronic hand mouthing of a ground variability of the data. The variability profoundly retarded man (Figure 5). Sessions in the data around a given level and trends in a were conducted two or three times per day. consistent condition provide an individual Baseline frequencies (with no attempts to stop estimate of the impact of extraneous factors the hand mouthing behavior) were first calcu- and measurement error against which any lated; baseline rates of three times per minute, treatment effect is seen. If, for example, the on average, were recorded over several sessions. level and/or trend of baseline data at times During the intervention phase, all subject overlaps with the level and/or trend of treatment attempts at hand mouthing were blocked by phase data, the clear possibility that factors the experimenter putting his hand in front of the other than treatment may be augmenting (or subject's mouth, resulting in less than one inhibiting) the treatment effect should be instance of hand mouthing per minute. Subse- considered. quently, attempts were blocked at a ratio of 1 Brief mention of some commonly occurring block per 2 attempts, 1/4, 1/2, 2/3, and 3/4. The conditions illustrate application of the guide- frequency of hand mouthing remained near zero lines discussed thus far. If a significant upward over all levels. The experimenters properly used or downward data trend begins one to three a descending/ascending order for levels, but also data points before a phase change is planned, allowed subject behavior to determine what delay of the phase change until the data blocking schedule would be used. The experi- stabilizes is suggested. A trend is significant if menters thus remained responsive to the data, it is of a greater degree than changes attributable and their efforts yielded a less intensive to background variability, such as would be intervention than one block per attempt. observed when a series of relatively stable data fluctuates around an average level. However, if an established and significant trend has emerged 3.02.4.1.4 Final suggestions for interpreting over a few (e.g., three or more) data points, a within-series data phase change might be acceptable if such results had been expected. Instability and unclear Besides the suggestions listed above for trends are obviously of less importance at the interpreting various types of within-series data, beginning of a phase than at the end; data at first additional general suggestions are offered here. uninterpretable often has a way of telling a First, as with any type of data-producing clearer story after more data is collected. The Varieties of Single-subject Designs 35

Baseline Response BL Response (BL) block 1/1 block 1/2 1/4 1/2 2/3 3/4 1/1 5

3 Responses per minute

0 0 20 40 60 Sessions Figure 5 An example of a parametric design. Response block conditions refer to differing levels of intervention, that is, 14 translates to one response block per every four handmouthing attempts (adapted from Lerman & Iwata, 1996). value of collecting as many data points as he leaned forward. If it were assumed that the feasible (e.g., seven or more) becomes clear after therapist had preplanned the within-session only a few data points have been graphed. alternations, an ATD as shown in Figure 6 would be obtained. The condition present in the example at any given time of measurement is 3.02.4.2 Between-series Designs rapidly alternating. No phase exists; however, if Within-series designs involve comparisons of the data in each respective treatment condition sequences of repeated measurements in a are examined separately, the relative level and succession of consistent conditions. Between- trend of each condition can be compared series designs are based on a comparison of between the two data series (hence the name conditions that are concurrent or rapidly between-series designs). alternating, so that multiple data series are An example of the ATD is provided by simultaneously created. Pure between-series Jordan, Singh, and Repp (1989). In this study, designs consist of the alternating treatments two methods of reducing stereotypical behavior design and the simultaneous treatment design. (e.g., rocking, hand-flapping) in retarded subjects were examined: gentle reaching (the use of social bonding and gentle persuasion with the 3.02.4.2.1 Alternating treatment design developmentally disabled) and visual screening The alternating treatment design (ATD) (covering the client's eyes for a few seconds consists of rapid and random or semirandom following stereotypic behavior, thus reducing alteration of two or more conditions such that visual stimulation including that provided by each has an approximately equal probability of these movements). Each of the two conditions being present during each measurement oppor- were randomly alternated with a baseline tunity. As an example, it was observed during a condition. After a baseline period, visual screen- clinical training case that a student therapist, ing produced a dramatic reduction in stereotypy, during many sessions, would alternate between whereas gentle teaching had only a transient two conditions: leaning away from the client effect. and becoming cold and predictable when he was Another proper use of the alternating treat- uncomfortable, and leaning towards the client ments design is provided by Van Houten (1993; and becoming warm and open when feeling Figure 7). Four children were taught subtrac- comfortable. The client would disclose less tion using two different procedures (one when the therapist leaned away, and more when procedure involved the use of a general rule, 36 Single Case Experimental Designs: Clinical Research and Practice

Forward and warm

Amount of client self-disclosure Back and cool

Time Figure 6 A hypothetical example of the use of an ATD to assess the effects of therapist behavior on client self- disclosure. the other involved only rote learning). Use of ATDs hold several other advantages over the procedures was alternated randomly and standard within-series designs. First, treatment every 15 minutes over the length of 15 or more need not be withdrawn in an ATDÐif treatment sessions, and the subtraction problems used in is periodically withdrawn, it can be for relatively each session were counterbalanced across the short periods of time. Second, comparisons subjects so that effects could be attributed to the between components can be made more teaching methods and not the problem sets. The quickly. If a clear favorite emerges early in a use of an ATD rather than a complex phase well-conducted ATD, the clinician can be change was prudent, as the order the methods reasonably sure that its comparative efficacy were presented in longer phases could probably will be maintained McCullough, Cornell, have exerted a practice effect. McDaniel, and Mueller (1974), for example, One of the benefits of the ATD is the compared the relative efficacy of two treatments simplicity with which it can be used to compare in four days using an ATD. ATDs can be used three or even more treatment conditions. Proper without collecting baseline data, or with base- comparisons of three conditions in a within- line data through the creation of a concurrent series design can be difficult due to counter- baseline data series. Any background within- balancing concerns, order effects, and the sheer series trends (such as those due to maturation of number of phase changes that need to be the client or etiology of the disorder) are executed over a relatively long period of time. unlikely to obscure interpretation of the data With an ATD, three or even more conditions because the source of data comparisons are can be presented in a short time. The rapid and purely between series, not within. random alternations between conditions makes ATD requires a minimum of two alterations order effects less likely, but multiple treatment per data series. As both series can be combined interference (the impact of one treatment is to assist assessments of measurement error and different due to the presence of another) is extraneous factors, the number of data points arguably likely. ATDs are ideally used with required is less than with a within-series design. behaviors emitted at a relatively high frequency The collection of more than two data points per that correspondingly allows many instances of series is typical and highly useful, however. In a each alternate intervention to be applied. sense, each alternation is a replication and However, the design may be used with relatively conclusions from all time-series designs can be infrequent behaviors if data is collected for a stated with more confidence with each consis- longer period of time. In addition, behaviors tent replication. that tend not to have an impact for long after a When planning alternations, the clinician discrete intervention is made and withdrawn should be alert to the duration, after presenta- make better targets for an ATD. If a change tion, of a component's effect. An administered initiated by such a discrete intervention con- drug, for example, exerts an effect over a period tinues over a long period of time, effects of of time, and presenting a new treatment subsequent interventions are obscured and component before that time has expired would reliable data interpretation is often not possible. confound interpretation. Varieties of Single-subject Designs 37

100 Rule

50 Rote Correct (%)

0 05 10 Sessions

Figure 7 An ATD in which rule-learning trials are interspersed with rote-learning trials. A practice or generalization effect is apparent in the rote-learning condition beginning around session 8 (adapted from Van Houten, 1993).

One of the shortcomings of the ATD is that strong effect. At times, exposure to one condi- observed effects in the design can be due to the tion results in a similar response to a somewhat way in which conditions are presented and similar second condition. Implementing each combined. Three areas of concern in this condition for a relatively short period of time domain of multiple treatment interference are can help reduce these problems (O'Brien, 1968), sequential confounding, carry-over effects, and as might clear separations between each treat- alternation effects (Barlow & Hayes, 1979; ment condition (such as introducing only one Ullman & Sulzer-Aszaroff, 1975). treatment condition per session). Sequential confounding occurs when there is Several procedures exist to help detect multi- a possibility that a treatment condition A yields ple treatment interference (Sidman, 1960). A different effects when presented before a simple phase change where one treatment treatment condition B than it does when condition is preceded by a baseline phase, when presented after condition B. To control for compared to another AB design containing the sequential confounding, the clinician is encour- other treatment, and finally compared to an aged to alternate treatment conditions ran- ATD combining both conditions, could be used domly or at least semirandomly. With a to parse out the separate and interactive effects randomly delivered sequence such as ABB- of the treatment conditions. Alternatively, the BAABABBBABBAAAABBAAABAA, if con- intensity of one treatment condition could be sistent differences between each condition's increased, with any subsequent changes in the effects continue to show up throughout the following conditions (as compared to changes sequence, despite the fact that the order and already witnessed in an ATD containing both frequency of each conditions' presence differs conditions) attributable to carry-over effects. through the sequence, the clinician can be Some additional caveats regarding proper use relatively certain that observed effects are not an of the ATD are of note. First, although a artifact of order of condition presentation. baseline phase is not necessary in an ATD, A carry-over effect occurs when the presenta- inclusion of baseline data can be useful for both tion of one condition somehow affects the gathering further information on client func- impact of the subsequent condition, regardless tioning and interpreting the magnitude of of the presentation order of the conditions. treatment effects. If periodic baseline points Potentially this can occur in two ways. The can be included within the ATD itself, valuable effects of two conditions can change in opposite information regarding background level, trend, directions, or in the same direction. For and variability can also be gleaned, over and example, when a strong reinforcer is delivered above what would be interpretable if treatment after a weak reinforcer, the weak reinforcer can conditions alone were present. subsequently cease to reinforce the desired Second, it is important to realize that behavior at all while the other reinforcer has a although ATDs can effectively be used with 38 Single Case Experimental Designs: Clinical Research and Practice four or even more treatment conditions and impact of the treatment but the degree to which corresponding data series, an upper limit exists it is accessed. In other words, an STD measures on the number of data series that can be preference. As an example, suppose a clinician meaningfully interpreted. One useful heuristic wished to assess the motivation of a disabled (Barlow, Hayes, & Nelson, 1984) is to count the child for different kinds of sensory stimulation. number of data points that will likely be Several kinds of toys that produced different collected for the ATD's purpose and then sensory consequences could be placed in a room divide this number by the desired number of with the child and the percentage of time played data series. If several data points will be with each kind of toy could be recorded and collected for each series, the clinician should graphed. This would be an STD. be able to proceed as planned. Third, the clinician must consider the amount of data overlap between data series when 3.02.4.3 Combined-series Elements interpreting ATD results. Overlap refers to Combined-series designs contain elements the duplication of level between series. Several from both within- and between-series designs, issues must be considered if considerable over- and combine them into a new logical whole. lap exists. First, the percentage of data points, Although many examples in the literature relative to all data points in an involved series, contain elements of both between- and within- that indeed overlap can be calculated. If this series designs, true combined-series designs percentage is low, the likelihood of a differential involve more than merely piecing components effect is higher. Second, the stability of the together. What distinguishes combined series measured behavior should be considered. If the elements from any combination of elements is frequency of a given behavior is known to vary that the combination yields analytical logic. widely over time, then some overlap on measures of that behavior between ATD 3.02.4.3.1 Multiple baseline conditions would be expected. Third, the clinician must note if any overlapping trends One of the more often encountered SCEDs is occur in the two conditions. If data in two series the multiple baseline. An adaptation of the are similar not only in level but also in trend, simple phase change, the multiple baseline then it seems plausible that a background design allows control over some variables that variable, rather than the treatment conditions, often confound interpretation of within-series might affect the data. phase change data. Upon a phase change from One final example is shown in Figure 8. These baseline to treatment in a simple phase change, are the data from an airplane-phobic client in for example, the data would ideally indicate a the study on the effect of cognitive coping on sudden and very strong treatment effect. It progress in desensitization (Hayes, Hussian, would be arranged in such a way that back- Turner, Anderson, & Grubb, 1983). Notice that ground variability could be easily detected there is a clear convergence as the two series against the effects of the treatment. Even when progress. The orderliness of the data suggested this ideal outcome occurs, however, there is also that the results from cognitive coping were the possibility that some extraneous variable generalizing to the untreated scenes. Alternat- having a strong effect on measured outcomes ing a reminder not to use the coping statements might co-occur with the onset of a treatment with the usual statements then tested this phase. Such an occurrence would have dire possibility. The data once again diverged. When implications for interpretation of the simple the original conditions were then reinstated, the phase change result. The multiple baseline data converged once more. This showed that the design allows considerable control over such convergence was a form of systematic general- threats to validity. ization, rather than a lack of difference between Essentially, the multiple baseline typically the two conditions. This is also a good example involves a sort of simple phase change across at of the combination of design elements to answer least three data series, such that the phase specific questions. This particular design does change between baseline and treatment occurs not have a name, but it is a logical extension of at different times in all three data series design tools discussed above. (Figure 9). The logic of the design is elegantly simple. Staggering the implementation of each respective phase change allows the effects of 3.02.4.2.2 Simultaneous treatment design extraneous variables to be more readily ob- Simultaneous treatment design (STD) is served. It is generally unlikely that any given similar to ATD in which the two treatments extraneous occurrence will have an equal effect are continuously present but are accessed by the on phase changes occurring at three different choice of the subject. What is plotted is not the points in time. Varieties of Single-subject Designs 39

Concurrent coping statements

No coping statements mentioned 4

2 Told not to use coping Average latency to anxiety (s) Average statements 0 1 102030405053 Desensitization scene

Figure 8 An example of series convergence in the ATD and its analysis by adding within-series components (redrawn from Hayes et al., 1983).

Implementation of a multiple baseline design influence on the data. A similar comparison can greatly increases the potential number of be made between the point at which the second comparisons that can be made between and phase change is implemented and the corre- within data series, ultimately strengthening the sponding data points on the third series of data. confidence with which conclusions are made The type of data recorded in each of the three from the data. Figure 3 details where such series must be similar enough so that compar- comparisons can be made. First, changes in isons can be made between the series, yet level and trend within each data series (that is, different enough that effects in one series are between the baseline and treatment phase of not expected from a phase change in another each of the three data series) can be analyzed, series. The context in which the data in each of just as with a simple AB design. Unlike a simple the three series is collected can be of three phase change, however, differences in level and varieties. The multiple baseline across behaviors trend between baseline and treatment can also requires that three relatively discrete and be compared between three series of data. The problematic behaviors, each of which might be design, in effect, contains replications of the expected to respond to a given treatment, be phase change. If similar levels and trends are chosen. The treatment is then implemented in exhibited across all three series, the clinician can staggered, multiple baseline fashion for each feel confident that the treatment is the most behavior. The clinician would probably wish to likely candidate for exerting the effect. Com- choose behaviors that are unlikely to be subject parisons can also be made between the point of to some generalization effect, so that a treatment time at which the first phase change occurs, and implemented with behavior 1 results in con- the same points of time in the two remaining comitant change in one or both of the other data series, where baseline data is still being behaviors (because, for example, the client collected. Such comparisons give the researcher begins to apply the principles learned immedi- information on whether some variable other ately to those other behaviors). Such an effect than treatment might be responsible for ob- would be easily observed in between-series com- served changes. For example, a strong positive parisons (i.e., when a data trend in a condition trend and marked change in level might be where the intervention has not yet been initiated indicated by the data after the treatment phase is resembles the trend exhibited in an intervention implemented. If similar changes occur in the condition, a generalization effect may very likely other two data series at the same points in time, be present). However, the clinician could not be before treatment has even been implemented in absolutely certain in such a case that the changes those data series, it would seem clear that across behaviors were due to some general- something besides treatment was exerting an ization effect and not an extraneous variable. In 40 swl sbtenaytosre ttepit hr nitreto si lc o n eisadbaseline and series one for place in is intervention an where points the at series, three series the two of any any within data between change as postphase and well pre- as between made be can Comparisons comparisons. 9 Figure

Behavior yohtcleapeo utpebsln,wt ewe-adwti-ore fdata of within-sources and between- with baseline, multiple a of example hypothetical A igeCs xeietlDsgs lnclRsac n Practice and Research Clinical Designs: Experimental Case Single ewe Between 5 Between 3

{{ odtosaesili fetfrtescn series. second the for effect in still are conditions { Between 2 Within 1 {

{ { Within 4 Time

Within 6 { { Varieties of Single-subject Designs 41 general, if it seems theoretically likely that a in special education classes only, the design took generalization effect would occur under such the shape of a multiple baseline across classes. circumstances, and no apparent confounding One subject eventually participated in four variable is present, then it is often safe to assume classes, the other in two. Phase lengths were generalization occurred. Even if such an effect more than adequate (minimum of five data were observed, it might not be disastrous by any points), and phase changes were appropriately means. Implementing a treatment only to staggered (levels and trends stabilized before serendipitously find that it positively affects onset of the intervention in a new setting). The behaviors other than the targeted one could intervention was effective in increasing the hardly be considered unfortunate. frequency and quality of peer interactions. An example of the multiple baseline across Finally, a multiple baseline design can be behaviors is provided by Trask-Tyler, Grossi, implemented across persons. Such a manifesta- and Heward (1994). Developmentally disabled tion of the design would, of course, require young adults were taught to use three cooking access to three clients with fairly similar recipes of varying degrees of complexity (simple, presenting concerns subjected to the same trained, and complex), with the goal of the study treatment. As it is probably unlikely that being to determine whether or not specific skills anyone but a therapist at a university or health would generalize across recipes. Simple tasks center would have simultaneous access to three included preparing microwave popcorn while such clients, it is acceptable to collect data on receiving specific instructions regarding its new clients as they present for services (Hayes, preparation. Trained tasks were analogs of 1985). Objections exist to this approach (Harris previously taught simple tasks, where subjects and Jenson, 1985), but we feel strongly that the used previously taught skills to prepare a practicality of the approach and the control different food (e.g., microwave french fries). against extraneous factors that the design Complex tasks consisted either of new tasks or possesses greatly outweigh the potential risks. novel combinations of previously trained tasks. The multiple baseline across persons design is Intervention phases were staggered across com- exemplified by Kamps et al. (1992). The effects plexity levels, and occurred after a baseline phase of social skills groups on the frequency of social where subjects completed recipes without in- skills interactions of high functioning autistic struction. Complexity level served as an appro- subjects with normal peers were analyzed across priate analog for differing behaviors in this case, three subjects. Baseline consisted of frequency and the implementation of new phases were counts of social interactions before social skills staggered with sufficient data points (i.e., three training. Phase changes within each series did or over) occurring in each phase. The design use not occur until data stabilized, and phase was also innovative in that it involved applied changes across subjects were staggered approxi- skills. mately 10 sessions apart. A minimum of seven Alternately, the clinician may choose to data points (usually more) were present in each implement a multiple baseline design across phase. Social skills training had a positive effect settings. If, for example, a client was socially on social interactions. withdrawn in a variety of circumstances (e.g., at Ideally, phase changes should wait until the work, at the gym, and on dates), social skills data indicates a clear level and trend, and is training might be implemented at different stable, before the phase change is executed. This points in time across several of those circum- is advisable when changing from baseline to stances. As another example, anger management treatment within a given series, and especially training with an adolescent could be implemen- when executing staggered phase changes in the ted in a staggered fashion at home, school, and at second and third data series. If no clear effect is a part-time job site. As with the multiple baseline observable in a series, then co-occurring data across behaviors, the clinician should be alert to points between a series cannot be meaningfully the possibility of generalization effects. interpreted. An example of the multiple baseline across An example of the multiple baseline is settings design is provided by Kennedy, Cush- provided by Croce (1990). A multiple baseline ing, and Itkonen (1997; Figure 10). The study across persons was used to evaluate the effects investigated the effects of an inclusion interven- of an aerobic fitness program on three obese tion on the frequency and quality of social adults, while weekly body fat and cardiovas- contacts with nondisabled people, using two cular fitness measurements were taken. Over developmentally disabled children as subjects. three weeks of baseline data was collected in all The intervention included a number of compo- subjects before treatment was delivered to the nents, such as placement in general school first subject. Treatment for the second subject settings with nondisabled peers, and feedback. was implemented three weeks after the first, and After a baseline where both subjects participated treatment for the third subject was delayed an 42 Single Case Experimental Designs: Clinical Research and Practice

Class period 2

Social contacts per day 3

Class period 6

10 20 35 Weeks

Figure 10 A multiple baseline across settings. Data was stable before phase changes in both settings, and comparisons between the settings indicate that the intervention was responsible for changes and that no generalization effect occurred (adapted from Kennedy et al., 1997). additional three weeks. The data clearly change occurring at the same time in both series, indicated a desirable effect. and after treatment phases of precisely equal lengths. Such an arrangement can make even a relatively weak effect evident, as each treatment 3.02.4.3.2 Crossover design (B phase) always has a corresponding baseline The crossover design essentially involves a (A phase) in the other data series to allow finer modification of the multiple baseline allowing interpretations of stability, level, and trend. an additional degree of control over extraneous variables (Kazdin, 1980). It can be especially 3.02.4.3.3 Constant series control useful when only two (rather than three or even more) series of data can plausibly be gathered, One final combined series design warrants as this added control tends to compensate for brief mention. A constant series control can be the control lost by the omission of a third series. added to a series of data when baseline data This design has been widely used in pharma- alone is collected concurrently on some other cological studies (Singh & Aman, 1981). person, problem, or situation of interest. In Execution of a crossover design simply adding a constant series control to an in-school involves simultaneous phase changes across anger management intervention with a child, for two series of data such that, at any given point in example, relevant behavioral data might be time, opposite phases are in operation between collected at the child's home (where the treat- each series. If series one's phase sequence was, ment is not considered active) throughout the for example, ABAB, a BABA sequence would period where an ABAB phase sequence is being be simultaneously delivered, with each phase implementedatschool. Treatment B effects from Conclusion 43 the school setting can then be compared to This has been true for some time, and the concurrently gathered baseline data from the research production using these designs in the home setting to assist in interpretation of the on-line practice environment is still limited. treatment's effects. Such a control is extremely That may be about to change, however, for the useful when used in conjunction with a simple or reason discussed next. complex phase change design. A study by O'Reilly, Green, and Braunling- McMorrow (1990) provides an example of a 3.02.5.2 Managed Health Care, Single-subject baseline-only constant series control. O'Reilly Design, and the Demonstration of et al. were attempting to change the accident- Efficacy prone actions of brain-injured individuals. A written safety checklist that listed, but did not The managed health care revolution currently specifically prompt hazard remediation, was underway in the USA represents a force that, in prepared for each of several areas of the home. If all likelihood, will soon encompass nearly all US improvement was not maximal in a given area, mental health services delivery systems (Stro- individualized task analyses were prepared that sahl, 1994; Hayes et al., in press). The hallmark prompted change in areas that still required of managed care organizations (MCOs) is the mediation. The design used was a multiple provision of behavioral health services in a way baseline across settings (living room, kitchen, that is directed, monitored, not merely com- bedroom, and bathroom). Although phase pensated. changes from baseline to checklist prompts to In generation I of the managed care revolu- task analysis occurred in staggered multiple tion, cost savings accrued mostly to cost baseline across the first three settings, a baseline reduction strategies. That phase seems nearly condition remained in effect for the entire 29 complete. In generation II, cost savings are weeks of the study. Results indicated very little accruing to quality improvement strategies. evidence of generalization across responses, and Uppermost in this approach is the development the baseline-only constant series control pro- of effective and efficient treatment approaches, vided additional evidence that training was and encouragement of their use through clinical responsible for the improvements that were seen. practice guidelines. Time-series designs are relevant to MCOs in three ways. First, they can allow much greater 3.02.5 FUTURE DIRECTIONS accountability. Single-subject designs provide an excellent opportunity for the clinician to Widespread use of SCEDs by practicing document client progress and provide sufficient clinicians could provide a partial remedy to justification to MCOs, HMOs, and PPOs for two currently omnipresent concerns in the implementing treatments or treatment compo- mental health care delivery field: a virtual lack nents. Even a simple AB is a big step forward in of use and production of psychotherapy out- that area. Second, when cases are complex or come literature by clinicians, and the demand treatment resistant, these designs provide a way for demonstrated treatment efficacy by the of evaluating clinical innovation that might be managed care industry. useful for improving the quality of treatment for other cases not now covered by empirically supported treatments. Finally, these designs can 3.02.5.1 Potential for Research Production and be used to evaluate the impact of existing Consumption by Practicing Clinicians treatment programs developed by MCOs. Line clinicians currently produce very little research. This is unfortunate since it is widely 3.02.6 CONCLUSION recognized that assessment of the field effectiveness of treatment technology is a critical and This chapter provides a brief overview of largely absent phase in the research enterprise current single-subject design methodologies and (Strosahl, Hayes, Bergan, & Romano, in press). their use in the applied environment. The design Single-subjectdesignsarewellsuitedtoanswer elements are a set of clinical tools that are many of the questions most important to a effective both in maximally informing treatment clinician. They are useful tools in keeping the decisions and generating and evaluating re- clinician in touch with client progress and search hypotheses. Single-subject designs fill a informing treatment decisions. Most of the re- vital role in linking clinical practice to clinical quirements of these designs fit with the require- science. With the evolution of managed care, ments of good clinical decision making and with this link is now of significant economic the realities of the practice environment. importance to a major sector of our economy. 44 Single Case Experimental Designs: Clinical Research and Practice

3.02.7 REFERENCES behavior of a child identified with severe behavior disorders. Education and Treatment of Children, 17(3), Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some 435±444. current dimensions of applied behavior analysis. Journal Harris, F. N., & Jenson, W. R. (1985). Comparisons of of Applied Behavior Analysis, 1, 91±97. multiple-baseline across persons designs and AB designs Barlow, D. H., & Hayes, S. C. (1979). Alternating with replication: Issues and confusions. Behavioral treatments design: One strategy for comparing the effects Assessment, 7(2), 121±127. of two treatments in a single subject. Journal of Applied Hayes, S. C. (1985). Natural multiple baselines across Behavior Analysis, 12, 199±210. persons: A reply to Harris and Jenson. Behavioral Barlow, D. H., Hayes, S. C., & Nelson, R. O. (1984). The Assessment, 7(2), 129±132. scientist practitioner: Research and accountability in Hayes, S. C., Barlow, D. H., & Nelson, R. O. (1997). The clinical and educational setting. New York: Pergamon. scientist practitioner: Research and accountability in the Barlow, D. H., & Hersen, M. (1984). Single case experi- age of managed care, (2nd ed.). Boston: Allyn & Bacon. mental designs: Strategies for studying behavior change Hayes, S. C., Hussian, R. A., Turner, A. E., Anderson, N. (2nd ed.). New York: Pergamon. B., & Grubb, T. D. (1983). The effect of coping Belles, D., & Bradlyn, A. S. (1987). The use of the changing statements on progress through a desensitization hier- criterion design in achieving controlled smoking in a archy. Journal of Behavior Therapy and Experimental heavy smoker: A controlled case study. Journal of Psychiatry, 14, 117±129. Behavior Therapy and Experimental Psychiatry, 18, Hersen, M., & Barlow, D. H. (1976). Single case experi- 77±82. mental designs: Strategies for studying behavior change. Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: New York: Pergamon. Forecasting and control. San Francisco: Holden-Day. Horne, G. P., Yang, M. C. K., & Ware, W. B. (1982). Time Busk, P. L., & Marascuilo, L. A. (1988). Autocorrelation in series analysis for single subject designs. Psychological single-subject research: A counterargument to the myth Bulletin, 91, 178±189. of no autocorrelation: The autocorrelation debate. Huitema, B. E. (1988). Autocorrelation: 10 years of Behavioral Assessment, 10(3), 229±242. confusion. Behavioral Assessment, 10, 253±294. Campbell, D. T. (1957). Factors relevant to the validity of Jordan, J., Singh, N. N., & Repp, A. (1989). An evaluation experiments in social settings. Psychological Bulletin, 54, of gentle teaching and visual screening in the reduction 297±312. of stereotypy. Journal of Applied Behavior Analysis, 22, Cone, J. D. (1986). Idiographic, nomothetic, and related 9±22. perspectives in behavioral assessment. In R. O. Nelson & Kamps, D. M., Leonard, B. R., Vernon, S., Dugan, E. P., S. C. Hayes (Eds.), Conceptual foundations of behavioral Delquadri, J. C., Gershon, B., Wade, L., & Folk, L. assessment (pp. 111±128). New York: Guilford Press. (1992). Teaching social skills to students with autism to Cope, J. G., Moy, S. S., & Grossnickle, W. F. (1988). The increase peer interactions in an integrated first-grade behavioral impact of an advertising campaign to classroom. Journal of Applied Behavior Analysis, 25, promote safety belt use. Journal of Applied Behavior 281±288. Analysis, 21, 277±280. Kazdin, A. E. (1980). Research design in clinical psychology. Croce, R. V. (1990). Effects of exercise and diet on body New York: Harper & Row. composition and cardiovascular fitness in adults with Kennedy, C. H., Cushing, L. S., & Itkonen, T. (1997). severe mental retardation. Education and Training in General education participation improves the social Mental Retardation, 25(2), 176±187. contacts and friendship networks of students with severe Cronbach, L. J. (1975). Beyond the two disciplines of disabilities. Journal of Behavioral Education, 7, 167±189. scientific psychology. American Psychologist, 30, 116±127. Koan, S., & McGuire, W. J. (1973). The Yin and Yang of Cummings, N. A., Cummings, J. L., & Johnson, J. N. progress in social psychology. Journal of Personality and (1997). Behavioral health in primary care: A guide for Social Psychology, 28, 446±456. clinical integration. Madison, CT: Psychosocial Press. Kornet, M., Goosen, C., & Van Ree, J. M. (1991). Effect of Cummings, N. A., Pollack, M. S., & Cummings, J. L. naltrexone on alcohol consumption during chronic (1996). Surviving the demise of solo practice: Mental alcohol drinking and after a period of imposed health practitioners prospering in the era of managed care. abstinence in free-choice drinking rhesus monkeys. Madison, CT: Psychosocial Press. Psychopharmacology, 104(3), 367±376. DeLuca, R. V., & Holborn, S. W. (1992). Effects of a Lerman, D. C., & Iwata, B. A. (1996). A methodology for variable ratio reinforcement schedule with changing distinguishing between extinction and punishment effects criteria on exercise in obese and nonobese boys. Journal associated with response blocking. Journal of Applied of Applied Behavior Analysis, 25, 671±679. Behavior Analysis, 29, 231±233. Edgington, E. S. (1980) Validity of randomization tests for Levin, J. R., Marascuilo, L. A., & Hubert, L. J. (1978). one-subject experiments. Journal of Educational Statis- N = nonparametric randomization tests. In T. R. tics, 5, 235±251. Kratochwill (Ed.), Single subject research: Strategies Ferster, C. B., & Skinner, B. F. (1957). Schedules of for evaluating change (pp. 167±196). New York: Aca- reinforcement. New York: Appleton-Century-Crofts. demic Press. Gottman, J. M. (1981). Time-series analysis: A comprehen- McCullough, J. P., Cornell, J. E., McDaniel, M. H., & sive introduction for social scientists. Cambridge, UK: Mueller, R. K. (1974). Utilization of the simultaneous Cambridge University Press. treatment design to improve student behavior in a first- Gottman, J. M., & Glass, G. V. (1978). Analysis of grade classroom. Journal of Consulting and Clinical interrupted time-series experiments. In T. R. Kratochwill Psychology, 42, 288±292. (Ed.), Single subject research: Strategies for evaluating O'Brien, F. (1968). Sequential contrast effects with human change (pp. 197±235). New York: Academic Press. subjects. Journal of the Experimental Analysis of Greenwood, K. M., & Matyas, T. A. (1990). Problems with Behavior, 11, 537±542. the application of interrupted time series analysis for O'Reilly, M. F., Green, G., & Braunling-McMorrow, D. brief single subject data. Behavioral Assessment, 12, (1990). Self-administered written prompts to teach home 355±370. accident prevention skills to adults with brain injuries. Gunter, P. L., Shores, R. E., Jac, K. S. L., Denny, R. K., & Journal of Applied Behavior Analysis, 23, 431±446. DePaepe, P. A. (1994). A case study of the effects of Orsborn, E., Patrick, H., Dixon, R. S., & Moore, D. W. altering instructional interactions on the disruptive (1995). The effects of reducing teacher questions and References 45

increasing pauses on child talk during morning news. Physical Disabilities, 8(4), 367±373. Journal of Behavioral Education, 5(3), 347±357. Strosahl, K. (1994). Entering the new frontier of managed Paul, G. L. (1969). Behavior modification research: Design mental health care: Gold mines and land mines. and tactics. In C. M. Franks (Ed.), Behavior therapy: Cognitive and Behavioral Practice, 1, 5±23. Appraisal and status (pp. 29±62). New York: McGraw- Strosahl, K., Hayes, S. C., Bergan, J., & Romano, P. (in Hill. press). Evaluating the field effectiveness of Acceptance Peterson, A. L., & Azrin, N. H. (1992). An evaluation of and Commitment Therapy: An example of the manipu- behavioral treatments for Tourette Syndrome. Behavior lated training research method. Behavior Therapy. Research and Therapy, 30(2), 167±174. Thorne, F. C. (1947). The clinical method in science. Shakow, D., Hilgard, E. R., Kelly, E. L., Luckey, B., American Psychologist, 2, 159±166. Sanford, R. N., & Shaffer, L. F. (1947). Recommended Trask-Tyler, S. A., Grossi, T. A., & Heward, W. L. (1994). graduate training program in clinical psychology. Amer- Teaching young adults with developmental disabilities ican Psychologist, 2, 539±558. and visual impairments to use tape-recorded recipes: Sharpley, C. F., & Alavosius, M. P. (1988). Autocorrela- Acquisition, generalization, and maintenance of cooking tion in behavioral data: An alternative perspective. skills. Journal of Behavioral Education, 4, 283±311. Behavioral Assessment, 10, 243±251. Ulman, J. D., & Sulzer-Azaroff, B. (1975). Multielement Shukla, S., & Albin, R. W. (1996). Effects of extinction baseline design in educational research. In E. Ramp & G. alone and extinction plus functional communication Semb (Eds.), Behavior analysis: Areas of research and training on covariation of problem behaviors. Journal of application (pp. 377±391). Englewood Cliffs, NJ: Applied Behavior Analysis, 29(4), 565±568. Prentice-Hall. Singh, N. N., & Aman, M. G. (1981). Effects of Van Houten, R. (1993). Rote vs. rules: A comparison of thioridazine dosage on the behavior of severely mentally two teaching and correction strategies for teaching basic retarded persons. American Journal of Mental Defi- subtraction facts. Education and Treatment of Children, ciency, 85, 580±587. 16, 147±159. Snow, R. E. (1974). Representative and quasi-representa- Wampold, B. E., & Furlong, M. J. (1981). Randomization tive designs for research in teaching. Review of Educa- tests in single subject designs Illustrative examples. tional Research, 44, 265±291. Journal of Behavioral Assessment, 3, 329±341. Stavinoah, P. L., Zlomke, L. C., Adams, S. F., & Lytton, White, G. W., Mathews, R. M., & Fawcett, S. B. (1989). G. J. (1996). Treatment of impulsive self and other Reducing risks of pressure sores: Effects of watch directed aggression with fluoxetine in a man with mild prompts and alarm avoidance on wheelchair pushups. mental retardation. Journal of Developmental and Journal of Applied Behavior Analysis, 22, 287±295. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.03 Group Comparisons: Randomized Designs

NINA R. SCHOOLER Hillside Hospital, Glen Oaks, NY, USA

3.03.1 INTRODUCTION 47 3.03.2 COMPARING STRATEGIES FOR TREATMENT EVALUATION 48 3.03.2.1 Case Reports and Summaries of Case Series 48 3.03.2.2 Single Case Experimental Designs 48 3.03.2.3 Quasi-experimental Designs 49 3.03.2.4 Randomized Experimental Designs 49 3.03.3 HISTORICAL BACKGROUND 49 3.03.4 APPROPRIATE CONDITIONS FOR RANDOMIZED GROUP DESIGNS 49 3.03.4.1 Ethical Requirements 50 3.03.4.2 Stage of Treatment Development 51 3.03.4.3 Treatment Specification: Definition of Independent Variables 52 3.03.4.4 Client/Subject Characteristics 52 3.03.4.5 Specification and Assessment of Outcome: Definition of Dependent Variables 53 3.03.4.6 Study Duration 53 3.03.5 STUDY DESIGNS 54 3.03.5.1 Simple Two-group Comparisons 54 3.03.5.2 Multiple-group Comparisons 55 3.03.5.3 Multiple-group Comparisons that Include Medication Conditions 56 3.03.5.4 Factorial Designs 58 3.03.6 INDIVIDUAL DIFFERENCES: WHAT TREATMENT WORKS FOR WHOM? 59 3.03.6.1 Individual Difference Variable 59 3.03.6.2 Post hoc Examination of Predictors of Treatment Response 59 3.03.7 CONCLUSIONS 60 3.03.8 REFERENCES 60

3.03.1 INTRODUCTION (RCTs) (Pocock, 1983) and many recent studies of psychological treatments also use this termi- A major function of clinical psychology is to nology (e.g., Bickel, Amass, Higgins, Badger, & provide treatment or other interventions for Esch, 1997). This chapter will consider the clients who suffer from mental illness or seek strengths and weaknesses of a number of relief from problems. This chapter addresses the strategies for judging treatment utility; review role of experimental evaluation of treatments as some historical background of the experimental a source of data in deciding about the utility of a study of treatment; examine the experimental treatment. In the medical literature, such studies designs that have been used to study psycholo- are referred to as randomized clinical trials gical treatments and the circumstances under

47 48 Group Comparisons: Randomized Designs which such experiments are appropriate. In each can address the very compelling question of the sections regarding specific experimental regarding the reported outcomesÐªcompared designs, exemplar studies will be discussed in to what?º In the Consumer Reports survey, there some detail in order to highlight relevant issues are a number of obvious hypotheses that can be for evaluating the literature and for the design of entertained regarding the absent comparison future studies. groups that call into question the validity of the reported finding. The finding that psychotherapy was helpful could be because those who did 3.03.2 COMPARING STRATEGIES FOR not feel they had benefitted from their psy- TREATMENT EVALUATION chotherapy did not respond to the questionnaire. The finding that longer psychotherapy In order to set the experimental study of was more helpful could be because those who treatment into context, this section will review a experienced benefit continued in treatment variety of strategies for evaluation of treatment longer or because those who discontinued early efficacy attending to their strengths and weak- needed to justify their discontinuation. Statis- nesses. tical modeling of alternative causal interpretations cannot fully exclude these interpretations. 3.03.2.1 Case Reports and Summaries of Case The common characteristic of both the attrib- Series uted interpretations of findings from case reports or case series and the criticism of them This strategy represents perhaps the method is that there is no formal procedure for judging with the longest history. For example, Brill the accuracy of the interpretation. (1938) cites a case treated in 1880 by Josef Such reports serve the important function of Breuer (cf p. 7) of a young girl with symptoms of generating hypotheses regarding treatments and ªparalyses with contractures, inhibitions and interventions that can lead to well-designed states of psychic confusion.º Under hypnosis studies. But they are not designed to test she was able to describe the connections hypotheses and therefore cannot provide defi- between her symptoms and past experiences nitive evidence for treatment efficacy. and ª. . . by this simple method he freed her of her symptoms.º Another case described by Brill concerns a four-year-old child who became 3.03.2.2 Single Case Experimental Designs nervous, refused to eat, and had frequent crying spells and tantrums. The symptoms began Chapter 2, this volume, on single case designs, shortly after the mother was separated from discusses the methodologies for assessing causal the child and she was ªcuredº soon after the relationships in individuals. A striking advan- mother returned. The mechanism that was tage of single case designs is that they can be posited to account for the effect was a used to evaluate treatments for individuals with disturbance in libido. very specifically defined characteristics. These Much more recently, Consumer Reports characteristics do not need to be shared with (1994, 1995) report the results of a survey of large numbers of other patients or clients for the 4000 subscribers who reported their experiences experiments to be valid. The applicability of the with psychotherapy broadly defined. Respon- specific interventions studied to broader popu- dents to the survey reported in general that they lations of clients will, of course, depend on the had benefitted from their psychotherapy and, of similarity of such clients to the subjects studied. particular interest, that more and longer An important disadvantage of single case psychotherapy was associated with greater designs is that they are only applicable in benefit. This survey has been reviewed and conditions in which return of the target received careful critique by psychologists complaints can be reliably demonstrated once (Brock, Green, Reich, & Evans, 1996; Hunt, the treatment has been removed. If a treatment 1996; Kotkin, Daviet, & Gurin, 1996; Krieg- results (or is hypothesized to result) in a change man, 1996; Mintz, Drake, & Crits-Christoph, in personality or in the orientation of an 1996; Seligman, 1995, 1996). individual to himself or others, then the These examples differ in a number of ways. repeated administration and withdrawal of The most obvious is that they span over a treatment that is the hallmark of single case century. The more relevant one for the present designs will not provide useful information. A purpose is that the Consumer Reports survey further disadvantage is that the comparisons are carries the weight of numbers and implied all internally generated and generalization is authority as a result. However, both single case accomplished only through assessment of multi- reports and reports that summarize large ple cases with similar individual characteristics numbers of cases share limitations. Neither and treatment. Appropriate Conditions for Randomized Group Designs 49

3.03.2.3 Quasi-experimental Designs sized the internal process of the psychotherapeutic encounter; the second was concerned These methods, described in detail in Chapter with evaluating the effects of psychotherapy per 4, this volume, take advantage of naturalistic se. In the latter studies the primary compar- treatment assignment or self-selection and use isons were between patients or clients who statistical methods to model outcome and to received psychotherapy and those who did not. provide controls for the major limitation of these Most controls centered around controls for methodsÐnamely that assignment to treatment treatment time such as deferred entry into is not random. A further advantage is that treatment using waiting lists. In these experi- evaluation of treatment outcome can be in- mental studies, attention to the content of the dependent of treatment delivery (and even blind psychotherapy was descriptive. This position to the knowledge of treatment) so that the effects was strongly bolstered by the writings of of expectancy on outcome are controlled. The theorists such as Frank (1971), who held that major weakness of such designs is that the biases general constructs such as positive regard and of treatment selection are allowed full rein. the therapeutic alliance underlay the effects of psychotherapy. 3.03.2.4 Randomized Experimental Designs Smith and Glass (1977) revolutionized the field with the publication of the first meta- Randomization is the hallmark of these analysis of the psychotherapy literature, de- studies. Random assignment to intervention monstrating a moderate effect size for the provides the potential to control for biases that comparison between psychotherapy and no treatment self-selection and choice introduce treatment (see Chapter 15, this volume, for a into the evaluation of treatment outcome. The discussion of current meta-analytic methods). disadvantages that randomized treatment stu- At the same time, increased attention to dies have include the logistical difficulties in measurement of change in psychotherapy implementation and restrictions in general- (Waskow & Parloff, 1975) and the development izability. Generalizability is restricted to clients of manualized treatments (see Chapter 9, this who are willing to consider randomization. volume) signaled a change in the models that Individuals who come to treatment facilities were available for the study of psychological with a strong preference or expectation may treatments. well reject randomization to treatment as an During the same period, attention to meth- option and insofar as matching to treatment odology in clinical psychology quickened influences outcome, the absence of such sub- (Campbell & Stanley, 1963, 1966; Cook & jects reduces generalization. Subject refusal Campbell, 1979). The emergence of medications to participate, particularly if it occurs after for the treatment of mental illness had inaugu- treatment assignment is known, can further rated experimental studies of these treatments compromise the advantage that random assign- using rigorous designs drawn initially from ment introduces. Refusal after treatment as- clinical psychology that included randomiza- signment is known implies that subjects whose tion, double-blind administration of medica- expectations are met continue and that those tion, placebo controls, diagnostic specificity, who did not get the treatment they preferred rating scales that paid attention to specific signs have chosen not to participate. If such refusal is and symptoms of psychopathology (see Prien & frequent, the principle of randomization has Robinson, 1994, for a recent review of progress been compromised and interpretation of find- in methods for psychopharmacology). The ings will encounter limitations similar to those demand for similar studies of psychological described for quasi-experimental designs. Attri- interventions became apparent. The modern era tion from all causes may introduce bias (Flick, of experimental studies of psychotherapy and 1988). Despite these disadvantages, randomiza- other psychological interventions was upon us. tion offers insurance that, within the population studied, biasÐparticularly from unknown sources, has been contained. Randomized experimental studies share with quasi-experi- 3.03.4 APPROPRIATE CONDITIONS mental designs the potential advantages af- FOR RANDOMIZED GROUP forded by the separation of evaluation of DESIGNS outcome from provision of the intervention. Experimentation is best suited to address comparative questions that would be biased if 3.03.3 HISTORICAL BACKGROUND the groups being compared were self-selected. Random assignment of individuals to groups Early research in psychotherapy ran on two controls for biases introduced by self-selection. relatively independent tracks: the first empha- Some questionsÐfor example, those regarding 50 Group Comparisons: Randomized Designs differences between young and old people or 3.03.4.1 Ethical Requirements men and womenÐobviously will not be addressed experimentally, although later in the Certain ethical conditions regarding the chapter the question of designs to study subjects and the treatments or interventions individual differences in relation to an experi- must be met (Department of Health and mental condition will be considered. Human Services, 1994). Research participation Some examples: does marital therapy prevent requires informed consent from the participant divorce?; is cognitive-behavior therapy helpful or from an appropriately authorized surrogate, in depression?; does family psychoeducation for example, a parent for a minor child. Even in potentiate the effects of medication in schizo- cases where consent is obtained from a phrenia?; should children with attention deficit surrogate, the research participant (e.g., the disorder receive medication or family support?; child) must provide ªassent.º Assent is the term does respite care for Alzheimer's patients reduce used for agreement to participate by indivi- physical illness in caregivers?; does medication duals who are judged to lack the capacity to reduce craving in alcohol dependent indivi- provide fully informed consent. Currently in duals? In all these examples, an intervention is the USA regulations are promulgated by the either contrasted to others or in the cases where Department of Health and Human Services no comparison is stated, it is implied that the that define 12 elements of informed consent. comparison is with a group that does not receive They are: an alternate intervention. (i) An explanation of the procedures to be The hallmark of all these questions is that they followed, including specification of those that identify a group or condition and ask about an are experimental. intervention. The implication is that the findings (ii) A description of the reasonably foresee- from the experiment will apply to members of the able attendant discomforts and risks and a group who did not receive the intervention in the statement of the uncertainty of the anticipated experimental study. This assumption represents risks due to the inherent nature of the research a statistical assumptionÐit is required for the process. use of virtually all statistical tests applied in (iii) A description of the benefits that may be evaluating the questions under review. It also expected. represents an important clinical concern. Ex- (iv) A disclosure of appropriate and avail- periments are not conducted solely for the able alternate procedures that might be advan- benefit of the subjects who participate in the tageous for the subject. research. It is anticipated that the conclusions (v) An offer to answer any inquiries con- will be applicable to other individuals who share cerning the procedures. characteristics (ideally key characteristics) with (vi) A statement that information may be those who were subjects in the research. withheld from the subject in certain cases when Perhaps the greatest threat to the validity of the investigator believes that full disclosure may experimental studies is that there is something be detrimental to the subject or fatal to the study that distinguishes the members of the group design (provided, however, that the Institu- (depressed patients, children with attention tional Review Board (IRB) has given proper deficit disorder) who agree to randomization approval to such withholding of information). from those who will not participate in a study of (vii) A disclosure of the probability that the randomized treatment. In the general medical subject may be given a placebo at some time clinical trials literature where the conditions to during the course of the research study if be treated are defined by readily measurable placebo is to be utilized in the study. signs and symptoms or by physiological (viii) An explanation in lay terms of the measurements (e.g., tumor size), that assump- probability that the subject may be placed in tion is readily accepted and is therefore one or another treatment group if randomiza- generally untested. In the psychological litera- tion is a part of the study design. ture the question may be considered but there (ix) An instruction that the subject may are very few studies that have actually drawn a withdraw consent and may discontinue parti- sample of subjects from a population in order to cipation in the study at any time. know whether the subjects of a study are like the (x) An explanation that there is no penalty population from which they came. In other for not participating in or withdrawng from the words, although we look to randomization as a study once the project has been initiated. control for bias in evaluating the effect of a (xi) A statement that the investigator will treatment or intervention, there has been little inform the subject of any significant new attention paid to whether subjects who agree to information arising from the experiment or randomization represent a bias that is negligible other ongoing experiments which may bear or substantial. on the subject's choice to remain in the study. Appropriate Conditions for Randomized Group Designs 51

(xii) A statement that the investigator will ªusual care,º and even nonspecified psy- provide a review of the nature and results of the chotherapies may require re-evaluation. One study to subjects who request such information. strategy that may receive increasing attention All these essential elements must be included and popularity is that of treatment dosage. In in an Informed Consent Form to insure ade- other words, a comparison group may be quate consent to participate in any research. defined as one that receives the intervention In studies that involve randomization to of interest but receives less of it. treatment, some elements are of particular Interventions and control conditions being importance because they represent concepts compared should be potentially effective; there that are sometimes difficult for potential should be evidence regarding benefit. If this subjects to appreciate. The first, of course, is assertion is accepted, how can a ªno-treatmentº that of randomization itselfÐthat treatment comparison group be included in an experi- will be decided ªas if tossing a coin.º The second ment? First, there should be provision for the is that the clinician or treater will not choose subject to receive the potentially more effective ªthe bestº treatment for the client. Other intervention following study participation. In elements of informed consent that need to be other words, the ªno-treatmentº group is really emphasized for a truly informed process is that a delayed treatment or the time hallowed the subject is free to withdraw at any time and waiting list condition. This may be relatively that those providing treatment may also easy to implement with short-term treatment discontinue the subject's participation if they interventions but more difficult for long-term believe it to be in the subject's best interest. interventions. Alternate solutions include the In most countries research involving human use of minimal treatment conditions. Later in subjects must be reviewed by local Research the chapter, both of these conditions will be Ethics Committees or IRBs that are mandated discussed in more detail from the substantive to insure that the research meets ethical perspective. In the present context, both standards, that all elements of informed consent represent plausible solutions to the ethical are present, and that the interests of subjects are dilemma of providing appropriate treatment protected. for client populations for whom there exist a A number of issues regarding consent that go corpus of information regarding treatment beyond the 12 critical items are currently a effects. Obviously, these concerns are moot in subject of debate. Participants in these delib- specific patient or client populations where erations include various regulatory bodies, there are no data regarding treatment effects. independent commissions, and private indivi- The ethical dilemma is heightened when the duals and groups that are self-declared pro- intervention is long term or there may be tectors of those who may participate in substantial risk if treatment is deferred. For experiments regarding treatments or interven- example, studies of treatment for depression tions of interest to clinical psychology. Among generally exclude subjects who are suicidal. these issues are the following: requiring that an independent person unaffiliated with a given research project oversee the consent process; 3.03.4.2 Stage of Treatment Development expanding the definition of those who lack the capacity to provide informed consent to include The ideal sequence of treatment development populations such as all those diagnosed with includes the following stages: innovation; pre- schizophrenia and other illnesses; restricting liminary description in practice, usually by the the conduct of research that does not offer innovators or developers of the treatment; direct benefit to the participant subjects; comparative experimental studies to determine elimination of the use of placebo treatment treatment efficacy and investigate client char- conditions in patient groups for whom there is acteristics that may be linked to treatment any evidence of treatments that are effective. response; dissemination into clinical practice; These potential changes may well alter the and evaluation of outcome under conditions of nature of experimental research regarding usual clinical care. Experimental, randomized treatment and intervention. designs may take several forms during the In general, concerns about placebo treatment sequence outlined. The first is often a study or no-treatment control conditions have been that compares the new treatment with no hotly debated in the arena of pharmacological treatment. As indicated earlier, that condition treatment rather than psychological interven- is most likely to be defined by a waiting list tions. However, as evidence regarding the condition or deferred treatment. Other studies efficacy and effectiveness of psychological that provide comparisons to other established interventions becomes widely accepted, experi- treatments may follow. At the present time, mental strategies such as waiting list controls, studies that examine psychological interventions 52 Group Comparisons: Randomized Designs in relationship to medication are often carried now is to determine whether a given cook has out. These studies may address direct compara- followed the recipe. tive questions regarding medication and a psychological intervention or they may address relatively complex questions regarding the 3.03.4.4 Client/Subject Characteristics additive or interactive effects of medication and a psychological intervention. Characterization of the clients in the randomized trial provides the means for commu- nicating attributes of the population of clients 3.03.4.3 Treatment Specification: Definition of who may be appropriate for the treatment in Independent Variables routine clinical practice. Currently, the most common strategy for characterizing clients is the As indicated above, randomized designs are use of diagnostic criteria that provide decision most valuable when the interventions can be rules for determining whether clients fit cate- well-specified or defined. Since the goal of such gories. The most widely used are those of the research is generalization, the advantage of well- World Health Organization's International specified treatment interventions is that their classification of diseases (World Health Orga- reproduction by other clinicians is possible. nization, 1992) and the American Psychiatric Manuals such as those discussed in Chapter 9, Association's Diagnostic and statistical manual this volume, represent the ideal model of of mental disorders (American Psychiatric treatment specification, but other methods Association, 1987, 1994). The use of specified can be considered. The training of the treatment diagnostic criteria (and even standardized provider can provide a plausible statement instruments for ascertaining diagnosis) provides regarding specification. For example, clinical some insurance that clients participating in a psychologists with a Ph.D. who are board given study share characteristics with those who certified and who have completed an established have been participants in other studies and with training course may represent an appropriate potential clients who may receive the treatment statement regarding a treatment specification. at a later time. However, under some circum- Specification should also include such ele- stances it may be preferable to specify client ments as treatment duration, frequency, and inclusion based on other methodsÐsuch as the number of contacts. Duration and number may reason for seeking clinical assistance rather than appear to be relatively simple constructs but in a formal clinical diagnosis. An advantage of examining psychological interventions the dura- using inclusion criteria other than diagnosis is tion of treatment may not be rigidly defined. In that problem-focused inclusion criteria may some cases, treatment adherence by clients will make translation of findings from a randomized affect duration, frequency, and number. In trial to clinical practice easier. other circumstances, treatment may be contin- A second issue regarding client characteristics ued until a specified outcome is reached. Under in randomized studies is insuring that important these conditions, duration of treatment until a client characteristics which may influence out- criterion of improvement is reached may come are balanced across treatment groups. represent an outcome itself. One strategy is to conduct relatively large Finally, the most careful specification needs studies. Randomization is designed to minimize to be matched by measurement of adherence to imbalance but, as anyone who has seen heads the specified treatment. Did therapists do what come up 10 times in a row knows, in the short they were supposed to do? This represents an term randomization may well result in imbal- interesting shift in research on interventions. ance. One of the most common ways to achieve Earlier in the history of psychotherapy research, balanced groups is to randomize to treatment or when the emphasis was on the process of condition within a prespecified group; for psychotherapy, the evaluation of what hap- example, gender or age to insure against the pened during the psychotherapeutic encounter possibility that by chance disproportionate was of interest because it was assumed that numbers of one group are randomized to one psychotherapeutic interventions altered out- treatment condition. A second strategy is to use comes. One could draw an analogy to recording an adaptive randomization algorithm (Pocock the practices of a gifted cook in order to & Simon, 1975; White & Freedman, 1978). In ascertain the recipe. As the emphasis has shifted this method, several client characteristics that to the evaluation of outcomes, the interest in the are known or hypothesized to affect the nature of the interpersonal interactions that outcomes of interest in the study are identified. comprise the intervention has come to be seen as The goal of the randomization algorithm is to assurance of adherence or ªfidelityº (Hollon, insure that the groups are relatively well Waskow, Evans, & Lowery, 1984). The question balanced considering all of the characteristics Appropriate Conditions for Randomized Group Designs 53 simultaneously. A particular algorithm may independent assessors. It is not a perfect specify a number of characteristics but when the strategy. Such assessors have only limited number exceeds three or four, the overall opportunity to observe subjects and may not balance is unlikely to be affected. The char- be sensitive to subtle but important cues because acteristics can also be weighted so that some are of their limited contact with the subjects. In this more likely to influence treatment assignment context, both initial training of assessors to than others. Adaptive randomization is a insure reliability and ongoing monitoring of dynamic process in which subject characteristics reliability are critical. are fed into a program that generates a treatment assignment such that the identified characteristics will be well balanced across all groups. 3.03.4.6 Study Duration What is central to randomization within groups or adaptive randomization is the Study duration includes two components: premise that the chosen characteristics are duration of the intervention and duration of known or hypothesized to affect the outcomes postintervention evaluation. Duration of the of interest in the study. Characteristics that are intervention should be theoretically driven, not expected to influence outcome do not need based on the nature of the clinical problem to be considered in this way. Section 3.03.6 that is being addressed and the mechanism of discusses the use of individual differences action of the intervention. Short-term interven- among study participants in attempting to tions are much easier to investigate, particularly understand outcome differences among treat- short-term interventions for which manuals ments or interventions. have been developed. It appears that manuals are easier to develop for short-term interventions. However, some questions will require 3.03.4.5 Specification and Assessment of longer-term intervention. Interventions whose Outcome: Definition of Dependent duration are reckoned in months rather than Variables weeks require additional care to avoid subject attrition and to insure that the treatment To state the obvious, outcome criteria should remains constant during the long period of reflect the intended effect of the interventions time that it is being delivered. studied and the reliability and validity of the Post-treatment evaluation or follow-up after assessment instruments used should be estab- treatment ends is common in studies of lished. In general, there are advantages to using psychological interventions and addresses the some measures that are established in the field. important question of whether effects persist in This provides the opportunity to compare the absence of continued treatment interven- findings across studies and aids in the cumu- tion. Such follow-up studies are subject to the lative development of knowledge (Waskow & criticism that uncontrolled interventions may Parloff, 1975). In addition, it is valuable to have occurred during the post-treatment period include study-specific measures that focus on and may bias the outcome. But, if an interven- the particular intervention(s) under investigation is hypothesized to exert a long-lasting effect tion and hypotheses being tested. In discussing that may include change in personality or long- examples of specific studies in later sections of term functioning, such evaluation is required. this chapter, the issue of whether differences Follow-up evaluations are also subject to were found on standard measures or on increased problems of attrition and the risk measures that were idiosyncratic to the study that differential attrition may introduce bias. in question will be considered. Comparisons of baseline and demographic Also of relevance is who assesses outcome. In characteristics are frequently made between pharmacologic clinical trials, double-blind pro- those subjects who are ascertained at the end of cedures represent standard operating proce- follow-up and those who are lost. This provides dure. Concerns are often expressed that side some measure of comfort, but further important effect patterns and other ªcluesº may serve to comparisons should include treatment group break the blind, but blinding preserves a and measures of outcome at the completion of measure of uncertainty regarding treatment study treatment. An incorrect strategy that has assignment (Cole, 1968). In studies of psycho- fallen into disuse was to compare those not logical intervention, double-blind conditions ascertained with the full cohort, including those are not possible. The subject and the treating lost to follow-up, so that those not ascertained clinician know what treatment the subject is were included in both groups. receiving and commitment and enthusiasm for In studies of pharmacologic treatment, treatment are assumed to be present. For this reversal of effects on discontinuation of treat- reason, a common strategy is to include ment has been taken as prima facie evidence of 54 Group Comparisons: Randomized Designs efficacy. In fact, a major research design to as ªplaceboº treatment in the literature. strategy in psychopharmacology is to establish Comparisons of two specific psychological efficacy of treatment in a cohort and then to interventions also represent potential two- randomize subjects to continuation or disconti- group comparisons. A final model may include nuation of medication. Differential levels of a treatment and no-treatment comparison in the symptom severity after a fixed experimental presence of a specified medication condition period and time to symptom exacerbation are (see Section 3.03.5.3 for discussion of this design taken as evidence of treatment efficacy. In in the context of other designs that include contrast, studies of psychological interventions medication conditions). have often considered persistence of effect Such studies may evaluate the benefit of two following the end of treatment as evidence for specific psychological interventions, of a spe- the efficacy of the psychological intervention. cific intervention vs. a nonspecific control (usual The problem that these conflicting paradigms care) or of intervention vs. none. The hypoth- represent will be considered further in Section eses and specific clinical questions that drive a 3.03.5.3 which considers comparative effects of given investigation should, in principle, drive pharmacologic and psychological treatments. the particular nature of the experimental and control groups. However, sometimes logistic considerations such as the nature of the treatment setting and the treatments that are 3.03.5 STUDY DESIGNS routinely provided in the setting may influence the design of the experiment. A further, The choice of study design always involves important factor may be the clinical needs of compromise. The clever experimenter can al- the clients and the urgency of providing some ways think of additional controls and varia- intervention. tions. In the previous section some of the There are limitations of two-group designs. relevant issues have been highlighted. Ethical Whatever the outcome, the design will not considerations may proscribe some scientifically allow testing of a range of alternate hypotheses. appropriate conditions. For example, under If the study does not include a no-treatment some circumstances deferring treatment control group, there is no control for time, through the use of a ªwaiting listº control spontaneous remission, or improvement. If the may be inappropriate. Financial resources may study does not include an alternate interven- constrain designs. Aside from funding con- tion group there is no control for nonspecific straints, availability of appropriate subjects may factors in treatment or for the specific limit the number of conditions that can be characteristics of the designated experimental studied. This may be stating the obvious, but the treatment. Further, if the study includes two more conditions and the larger the number of specified interventions, there is no control for subjects in a study, the longer it will take to either receipt of any treatment or for non- complete the study and the longer an important specific factors in treatment. clinical question will remain unanswered. Each Interpretation of the findings from two-group additional group in a study increases both the studies absent a no-treatment control is difficult cost and the duration of a study. Finally, there is if there are no differences in outcome between no single, ideal design. Ultimately, the design of the groups. No difference can be interpreted as a clinical experiment represents a decision that is Lewis Carroll and the Red Queen would have us driven by the hypotheses that are under believe ªthat all have won and all shall have investigation. What is critical is the recognition prizesº or that there is no effect of either. by the investigator (and by the reader/critic of Interpretation of studies that do not include a the research) of the hypotheses that can be no-treatment group may often depend on tested, and conversely, the questions that are integrating findings from other prior studies simply not addressed by a given study. that did include such controlsÐnot necessarily a bad state of affairs. As the field of experimental 3.03.5.1 Simple Two-group Comparisons studies of psychological interventions matures, it may become less appropriate to implement Designs involving only two groups are often studies with no-treatment control groups. used in the early stages of research with a given In the following example of a two group form of psychotherapy. The comparison group design, Keane, Fairbank, Caddell, and Zimer- may be a no-treatment waiting list for the ing (1989) compared implosive (flooding) duration of the treatment, treatment ªas usualº therapy to a waiting list control in 24 subjects in the treatment setting, or a nonspecific with post-traumatic stress disorder (PTSD). psychological intervention. Nonspecific psy- Interestingly, the study design had originally chological interventions are sometimes referred included a stress management group, but for Study Designs 55 unspecified reasons, this condition was not knew the treatment assignment. As discussed in successfully implemented, and those subjects Section 3.03.4.5, when treatment assignment is are not included in the report. Thus, the two- known, there is a potential bias. group comparison is between a specified treatment and a no-treatment control. Implo- sive therapy was manual driven, following a 3.03.5.2 Multiple-group Comparisons manual (Lyons and Keane, as cited in Keane et al., 1989), and included between 14 and 16 Multiple-group comparisons offer the op- sessions. The experimental group received portunity to test hypotheses simultaneously baseline, post-treatment (elapsed time from regarding comparisons of interventions (speci- baseline not reported), and a six-month follow- ficity of effects) and of intervention to a no- up assessment. The wait-list control group was intervention control. If the interventions being assessed twice: at baseline prior to randomiza- studied derive from theoretical models, their tion and after, on average, four months. Half of comparison may test specific hypotheses re- the implosive therapy group and the majority garding the psychological mechanisms that of subjects in the wait-list group received underlie the condition being treated. Other anxiolytic medication because of, as the more pragmatic questions that can be addressed authors note, ª. . . concerns about placebo in multiple-group comparisons include group groups and no treatment controls . . . we didn't vs. individual contact and treatment intensity. attempt to withdraw these patients from the However, it should be noted that when intensity medications which were prescribed to themº or dosage of the intervention becomes a (Keane et al., 1989, p. 249). The authors condition, control for amount of contact is maintain that the comparison of implosive lost. Again, there is no single ªbestº design. The therapy to subjects who were receiving phar- appropriate design depends on the questions macotherapy (even if it was not systematically being asked and the appropriate questions administered) represented a more stringent test depend on the stage of research. For example, of implosive therapyÐalthough it was not a questions of treatment duration or frequency part of the original design. Subjects were are more likely to be asked after a particular assessed using well-known standardized assess- treatment has been shown to have an effect. ments scales for depression, trait and state Brown and Lewinsohn (1984) compared anxiety, and instruments specifically designed three psychoeducational approaches in treating by the investigators for the assessment of depression that all used the same course PTSD. Post-test assessments were completed materials: class, individual tutoring, brief tele- by the therapist who treated the subject in the phone contact, and a delayed treatment control. implosive therapy group and by one of the Sixty-three individuals who met diagnostic same therapists for the wait-list group. In criteria for unipolar depression and paid a addition, the subjects rated themselves on course fee were randomly assigned to one of the depression, anxiety, and satisfaction with social four groups. Half the course fee was refunded if adjustment in several life areas. subjects completed all evaluations. Subjects in Implosive therapy appeared to improve the three immediate treatment groups were depression and anxiety according to self-report assessed pre- and post-treatment and at two and specific features of PTSD as rated by the later follow-up points. The delayed treatment therapists. No changes in social adjustment control group was assessed at baseline and at were seen. eight weeks, the time equivalent of post- Strengths of the study include randomiza- treatment after which they received the class tion, specification of diagnosis, apparent ab- condition. All treatments focused on specific sence of dropouts in both conditions, the behaviors believed to be dysfunctional in existence of a manual for the experimental depression (although not necessarily in the treatment, and the use of both standard study subjects) and followed a syllabus that outcome measures and measures tailored to included a text (Lewinsohn, Munoz, Youngren, the specific study hypotheses. Although sub- & Zeiss, 1978) and a workbook (Brown & jects were randomly assigned to treatment, the Lewinsohn, 1984). An independent interviewer randomization was compromised by the fact completed a standardized diagnostic assessment that treatment in one of three groups to which at baseline and of symptoms at later assessment subjects were randomized was, for unspecified points. Subjects completed three standardized reasons, not fully implemented. In other words, self-report measures. The primary outcome this study was a two-group design in execution measure was a factor-analytically derived score rather than intent. Another weakness is that that included all three self-report measures. assessments were completed by the therapists Improvement was seen in all three immediate who treated the subjects and who therefore treatment groups on this single composite 56 Group Comparisons: Randomized Designs measure compared to the delayed treatment 3.03.5.3 Multiple-group Comparisons that group. There were no differences among the Include Medication Conditions active treatment groups either in self-report or in rate of diagnosed depression at follow-up. Although in principle, designs that compare Comparison of high and low responders to medication to psychological interventions can treatment found few differences; none were a be classified in terms of whether they are simple function of specific treatment. No detailed comparisons (Section 3.03.5.1), multiple com- assessment of psychopathology or self-report parisons (Section 3.03.5.2) or represent factorial measures was made. designs (Section 3.03.5.4), they entail unique The strengths of the study include randomi- research challenges and therefore merit separate zation, a design that tested specific effects of a consideration. The first design challenge was treatment modality and the method of delivery, introduced earlier, namely the difference in the the absence of drop-outs from treatment or model of assessing effects. Follow-up assess- assessment, specification of treatment condi- ments after treatment ends are seen as important tions, the use of standardized measures of sources of information in studies of psycholo- outcome assessment, and the use of independent gical treatments. In contrast, such assessments assessors. The major weakness of the study is are rare in studies of pharmacotherapy. Relapse not in its design but the limited use made of the or re-emergence of symptoms after treatment assessment battery. Ratings made by the discontinuation is taken as evidence of efficacy. independent assessors were only used to assess Therefore, discontinuation designs to study the presence or absence of diagnosable depres- efficacy of treatment are common in psycho- sion at the six-month follow-up and the use of a pharmacology. A second design challenge is single summary score for all self-report mea- whether psychological and pharmacologic sures may well conceal more than it reveals. A treatments should be introduced at the same second weakness is the absence of assessments time or are appropriate for different stages of an of implementationÐhow well the instructors illness. Third, pharmacologic studies use used the course materials in the three conditions double-blind methods as the preferred strategy and whether all three conditions were imple- to control assessor bias and therefore studies mented equally well. often rely on treating clinicians to assess It is difficult to decide whether charging outcome. Because it is not possible to blind subjects for treatment is a strength or a treating clinicians to the psychological treat- weakness. On the one hand, payment for ment they are providing, studies of psycholo- treatment is characteristic of clinical treatment gical interventions rely on independent settings. All subjects completed the study and it assessors who, although they will be blind to could be argued that motivation to receive a treatment assignment, have reduced opportu- 50% rebate on the course fee was a factor in nity to observe, and have minimal opportunity enhancing study completion. On the other hand, to develop rapport with the subjects they are the authors report that one of the major reasons assessing. For these reasons, evaluations by that screened potential participants did not independent assessors may be less sensitive than enter the study was inability to pay. Thus, the those made by treating clinicans. Fourth, study population may have been biased by the psychological interventions are often hypothe- exclusion of these subjects. sized to affect different aspects of outcome than This study exemplifies a model that decon- pharmacologic interventions, so that direct structs a psychotherapeutic approachÐin this comparisons of effects may be difficult. case a psychoeducational approach to the treat- Historically, pharmacologic clinical trials ment of depressionÐin order to identify which have relied more heavily on medical diagnoses treatment delivery strategy offers advantages in of mental illness than studies of psychological outcome given that the strategies differ in the interventions. But current research in both amount of clinical time and resources required pharmacologic and psychological treatment is to deliver the treatment. All subjects, including largely wedded to specific diagnoses derived the delayed treatment group, reported signifi- from the DSM of the American Psychiatric cant improvement. The implication of these Association (1987, 1994) or the ICD of the findings is that the least costly method of treat- World Health Organization (1992). Parentheti- ment delivery, brief telephone contact, should cally, one can question whether the increased be implemented. However, there was no report reliance on medical diagnostic schemes such as regarding the acceptability of the method of DSM and ICD represents an advance in studies treatment delivery and as noted earlier, there of psychological intervention. It has been was no assessment of post-treatment symptoms. argued that diagnostic specificity enhances In the absence of these data, perhaps a firm reliability and therefore enhances replication clinical recommendation is premature. and communication. On the other hand, clients Study Designs 57 who seek treatment do not necessarily seek help medication and psychological treatment. These for a specific diagnostic label. designs will be considered in Section A wide range of designs have been used to 3.03.5.4. examine the relationship of medications and An example of a study that compared psychological treatments. In the simplest, a medication and a specific psychotherapy is medication or placebo is added to a uniform the study of prevention of recurrence of psychological intervention, or the converse, a depression by Frank et al. (1990). Patients psychological intervention or control condition characterized by recurrent depression (at least is added to an established medication condition. two prior episodes) were treated with medica- These designs examine the additive effect of the tion and manual-based interpersonal psy- modality that is superimposed. Questions like chotherapy (IPT) (Klerman, Weissman, does naltrexone enhance abstinence in alcoholic Rounsaville, & Chevron, 1984) until symptom subjects who are in a therapeutic milieu or does remission was documented and maintained for social skills training enhance social functioning 20 weeks. One hundred and twenty-eight in schizophrenic subjects who are maintained subjects were then randomized to one of five on antipsychotic medication can be addressed in treatment conditions: a maintenance form of this manner. The major design challenges faced interpersonal psychotherapy (IPT-M) that was by such studies are: the inclusion of appropriate characterized by less frequent visits; IPT-M and outcome measures to evaluate change on an antidepressant (imipramine) at the acute psychological interventions; insuring that the treatment dose; IPT-M and medication place- treatment modality which represents the back- bo; medication clinic visits and imipramine; ground condition remains relatively constant medication clinic visits; and medication place- during the course of the study; and insuring an bo. Subjects were treated for three years. adequate timeframe in which to evaluate effects Therapists were trained and certified by the of psychological treatments (in the design where developers of IPT. The major difference from psychological intervention is manipulated). the published manual is described as visit The second class of studies adds a medication frequency. The major outcome examined was condition (or conditions) to a multiple-group recurrence of depressive episodes during the design such as those discussed in the previous three-year period. The two treatment arms that section. These studies examine the comparative included imipramine (IPT-M and imipramine; effects of medication and psychological treat- imipramine and clinic visits) had the lowest risk ments and face greater challenges. They have of recurrence. Mean survival time was more been called ªhorse racesº and may thereby con- than 83 weeks. The two groups that received tribute to guild conflicts between psychologists IPT-M without medication (IPT-M alone; IPT- and psychiatrists (Kendall & Lipman, 1991). M and placebo) had a higher risk of recurrence; Appropriate inclusion criteria for medication mean survival time was more than 60 weeks. and psychological treatments may differ and The lowest survival time was seen in the group compromises may be necessary. Such studies that received placebo and medication clinic will generally include double-blind conditions visits; survival time was 38 weeks on average. for medication and also benefit from inclusion The authors conclude that medication, at a of independent assessors. The studies require relatively high maintenance dose, affords the expertise in both clinical psychopharmacology greatest protection from recurrence in indivi- and the relevant psychological interventions. duals who have experienced recurrent depres- Insuring this expertise for a study usually sion. However, psychotherapy, absent requires close collaboration and mutual respect medication, represents a distinct advantage by investigators drawn from psychology and over clinic attendance coupled with a placebo. psychiatry. Outcome criteria need to include The study has a number of strengths. measures of dimensions that are hypothesized Patients were randomized to treatment condi- to be sensitive to change in both modalities tion. There was an extremely low dropout rate being studied and the reasons for inclusion of (17%) over the three-year study duration. The specific outcome measures should be explicit. A design included control conditions for medica- detailed review of these and other methodolo- tion (placebo) and psychotherapy was admi- gical considerations can be found in an article nistered under three conditions: alone, in by Kendall and Lipman (1991) and in the article combination with imipramine, and in combi- detailing the research plan for the National nation with imipramine placebo. The psy- Institute of Mental Health Treatment of chotherapy followed an established manual Depression Collaborative Research Program and therapists were trained by the originators (Elkin, Parloff, Hadley, & Autry, 1985). of the therapy. Survival analysis represents the Finally, factorial designs provide explicit tests most powerful statistical method for evaluating of both additive and interactive effects of risk of recurrence. 58 Group Comparisons: Randomized Designs

Weaknesses include the limited examination clinical setting. However, the most important of clinical outcomes. Although the authors use, from the perspective of this chapter, is to mention the importance of judging interepisode study the relationship of medication and social functioning, the article only examined psychological treatment. recurrence risk. Second, the study population Marks and his colleagues (1993) studied the was, by design, limited to those who both had effects of medication and a psychological recurrent episodes and recovered from them. Of treatment for panic disorder. There was prior the 230 patients who met the first criterion, fully evidence of efficacy for both the medication, 42% did not enter the randomized trial. This alprazolam, an antianxiety drug, and the represents a substantial limitation to general- experimental psychological treatment, live ex- ization of the findings to depression broadly posure. One hundred and fifty-four subjects defined. were randomly assigned to four treatment conditions: alprazolam and exposure (AE cell 1, Figure 1); alprazolam and relaxation, the 3.03.5.4 Factorial Designs control for the psychological treatment (AR cell 2, Figure 1); placebo and live exposure (PE cell Designs that include at least two factors allow 3, Figure 1); and placebo and relaxation, so- for detection of main effects in each factor as called double-placebo (PR cell 4, Figure 1). well as interactive effects. In the simplest of Pharmacologic and psychological treatment these, 2 6 2 designs, there is a control condition lasted for eight weeks, medication was tapered or at least two defined levels of each factor, as during the following eight weeks and subjects shown in Figure 1. If the factors represent were followed for an additional six months to independently defined treatment modalities, assess relapse or maintenance of gains following then such designs can detect the ability of one discontinuation of treatment. Both exposure treatment to potentiate the other or even to and relaxation followed published guides, not inhibit the effect of the other. For this reason, treatment manuals (Marks, 1978; Wolpe & such designs are particularly suited to examin- Lazarus, 1966). The study was conducted at two ing the relationship of medication and psycho- centers; one in London, UK, and the other in logical interventions. Uhlenhuth, Lipman, and Toronto, Canada. Covi (1969), Schooler (1978), and Hollon and The primary outcome measures were: ratings DeRubeis (1981) have considered factorial of avoidance by an assessor and the subject; the models for the study of the interaction of number of major panics; work and social medication and psychological treatments. Ac- disability; and the clinician's rating of im- cording to Uhlenhuth et al. (1969), there are provement. After eight weeks of treatment, four possible forms that the interaction can there were significant main effects of both take. These are shown in Figure 1. An additive alprazolam and exposure. The effects of effect is defined as the simple sum of the effects alprazolam, as would be expected for a of the two treatment modalities. The effect in pharmacologic treatment, did not persist dur- cell 1 is equal to the combined effects seen in ing follow-up, whereas the effects of exposure cells 2 and 3. The treatments may potentiate persisted after treatment discontinuation. In- each other. In that case, the combined effect (cell terpretation of findings is complicated by the 1) is greater than the additive sum of the two fact that there was substantial improvement in treatment effects (cells 2 and 3). The treatment total major panic in the ªdouble-placeboº modalities may inhibit one another. In that case group that received medication placebo and the combined effect (cell 1) is less than the effects relaxation. of each treatment independently (cells 2 and 3). In addition to reporting statistical signifi- Finally, there may be an effect that they call cance of outcome measures, the article also reciprocal, in which the combined effect (cell 1) reports the effect size of various outcome is equal to the main effect of the more effective measures (Cohen, 1988, 1992). With large treatment (the effect seen in either cell 2 or cell samples, differences that are clinically uninter- 3). Detection of all these effects depends on the esting may be statistically significant, that is, factorial design and the presence of cell 4 which they are unlikely to be due to chance. The effect represents the combined control condition or as size is less influenced by sample size and may it has sometimes been dubbed, the double therefore be considered a better indicator of placebo. clinical relevance. During the first eight weeks, Factorial designs are potentially a valuable when both treatments were being delivered and model for examination of other important were significantly better than their controls, the variables in understanding treatment, for ex- effect size is larger for exposure than for ample, to control for the effect of setting in alprazolam. According to the definitions of studies that are conducted in more than one Uhlenhuth et al. (1969), this would be a Individual Differences: What Treatment Works for Whom? 59

Psychological treatment

Experimental Control

Experimental 1 Alprazolam/exposure 2 Alprazolam/relaxation Pharmacologic treatment Control 3 Placebo/exposure 4 Placebo/relaxation

reciprocal effect. Effect sizes are moderate to 3.03.6 INDIVIDUAL DIFFERENCES: large for alprazolam and very large for exposure WHAT TREATMENT WORKS FOR according to criteria defined by Cohen (1992). WHOM? The effect sizes diminish over time for exposure but are still in the large to very large range after From a clinical perspective, perhaps the most six months. Those for alprazolam are absent to important question regarding treatment is the small during the follow-up period. Effect size appropriate matching of patient or client to for total major panics are absent to small treatment. The idea that a given treatment throughout the study because of the effect in the approach is appropriate for all (even all who control group. In general, the findings show a share certain characteristics) runs counter to greater effect for exposure and no consistent clinical and research experience. Although pattern of interaction between psychological randomized experiments may not represent treatment and medication. the optimal method for evaluating this question, The strengths of the study include an appro- they can provide useful information. Two rather priate population for study. All subjects met different methods can be used in the context of stringent criteria for panic disorder and none randomized trials. The first is the use of factorial had been nonresponders to either of the treat- designs and the other is the post hoc examination ment modalities prior to study entry. Further, of individual characteristics as predictors of assignment to treatment was random, assess- treatment response. This section will consider ment was double-blind to medication and single- these methods. blind to psychological treatment, assessment included both self- and assessor-completed 3.03.6.1 Individual Difference Variable measures, and analyses addressed the range of assessments included. The use of an individual difference variable as The design reveals one of the problems a factor in a factorial design allows a direct inherent in studies designed to assess interac- experimental test of a hypothesis regarding tions of medication and psychological treat- differential effects of a patient or client char- ments. The discontinuation design, that was acteristic in relation to a psychological appropriate for the psychological treatment in treatment. This strategy is difficult to the present study, is not optimal for assessment implementÐeach client characteristic included of pharmacologic treatments and the fact that doubles the number of subjects needed in a pharmacologic effects were not maintained study. Perhaps the most common example of a after discontinuation should come as no client characteristic studied in this way is the use surprise. Analysis was restricted to 134 subjects of more than one center in a study so that the who completed at least six weeks of treatment (a generalization across centers can be tested. 16% dropout rate). Although the rate is Examples of this strategy are relatively common relatively low compared to other studies of (e.g., Elkin et al., 1989; Marks et al., 1993), but it alprazolam, and there were no differences in is unusual for hypothesized differential effects to baseline characteristics between the 20 subjects be proposed. A second variable that is some- who withdrew from the study and the 134 who times investigated in a factorial design is gender. did not, it is unclear how their inclusion in analyses might have altered the results. The 3.03.6.2 Post hoc Examination of Predictors of general wisdom regarding randomized clinical Treatment Response trials is unequivocal in stating that all subjects randomized must be included in analysis but a In general, post hoc analyses of individual review of a standard journal in the field, client characteristics as differential predictors of Controlled Clinical Trials, during the past four treatment response have been disappointing. years did not reveal a single specific citation. For example, several of the studies that are 60 Group Comparisons: Randomized Designs reviewed in this chapter have attempted to of the contributions made by subjects who agree identify characteristics of clients who fare to participate in experiments and improve the particularly well in the treatments studied and treatment and care of clients and patients. as has been noted in the descriptions, reliable differences have not been found. Review of the psychopharmacology literature yields a similar 3.03.8 REFERENCES conclusion. Perhaps the most reliable finding regarding American Psychiatric Association (1987). Diagnostic and statistical manual of mental disorders (3rd ed. Rev.). prediction of treatment outcome has been Washington, DC: Author. severity of symptoms. In a number of studies American Psychiatric Association (1994). Diagnostic and where symptom severity has been studied, the statistical manual of mental disorders (4th ed.). Washing- evidence suggests that it usefully discriminated ton, DC: Author. Bickel, W. K., Amass, L., Higgins, S. T., Badger, G. J., & response (e.g., Elkin et al., 1989). In that study, Esch, R. A. (1997). Effects of adding behavioral discrimination of differences between medica- treatment to opioid detoxification with buprenorphine. tions and psychotherapies was clearer among Journal of Consulting and Clinical Psychology, 65, subjects with greater symptom severity. 803±810. Difficulty in reliable identification of indivi- Brill, A. A. (1938). The basic writings of Sigmund Freud. New York: Modern Library. dual characteristics that predict differential Brock, T. C., Green, M. C., Reich, D. A., & Evans, L. M. response may stem from a number of causes. (1996). The Consumer Reports Study of Psychotherapy: The nihilistic view is that there simply are no Invalid is Invalid. American Psychologist, 51, 1083. reliable predictors. Two alternatives seem more Brown, R. A., & Lewinsohn, P. M. (1984). A psychoeducational approach to the treatment of depression: Compar- reasonable. The first is that although studies ison of group, individual, and minimal contact may have adequate power to detect treatment procedures. Journal of Consulting and Clinical Psychol- differences, they do not have adequate power to ogy, 52, 774±783. examine individual characteristics. If this is the Campbell, D. T., & Stanley, J. C. (1963). Experimental and case, meta-analytic strategies could allow the quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching examination of data regarding characteristics (pp. 171±246). Chicago: Rand McNally. that are frequently considered such as gender, Campbell, D. T., & Stanley, J. C. (1966). Experimental and age, symptom severity, duration of symptoms, quasi-experimental designs for research. Chicago: Rand referral source, comorbid psychological pro- McNally. Cohen, J. (1992). A power primer. Psychological Bulletin, blems, or diagnoses. See Chapter 15, this 112, 155±159. volume, for a discussion of this problem and Cohen, J. (1988). Statistical power analysis for the methods to deal with it. The final alternative is behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. that client characteristics which are predictive of Cole, J. O. (1968). Peeking through the double blind. In D. treatment outcome are elusive and are not H. Efron (Ed.), Psychopharmacology. A review of progress 1957±1967 (pp. 979±984). Washington, DC: captured adequately in experimental studies. US Government Printing Office. Characteristics such as motivation and client± Consumer Reports (1994). Annual questionnaire. therapist rapport come immediately to mind. Consumer Reports (1995, November). Mental health: Does therapy help? Consumer Reports, 734±739. Cook, T. D., & Campbell, D. T. (Eds.) (1979). Quasi- experimentation: design and analysis issues for field 3.03.7 CONCLUSIONS settings. Boston, MA: Houghton Mifflin. Department of Health and Human Services, Office of the Randomized, experimental designs represent Secretary (1994). Protection of Human Subjects. Title 45 a useful method in the evaluation of psycholo- of the Code of Federal Regulations, Sub-part 46. OPRR Reports, Revised June 18, 1991, Reprinted March 15, gical treatments. They allow unambiguous 1994. conclusions regarding the treatments that are Elkin, I., Parloff, M. B., Hadley, S. W., & Autry, J. H. studied in the subjects who receive them. This (1985). NIMH Treatment of Depression Collaborative chapter has reviewed in some detail several Research Program. Background and Research Plan. Archives of General Psychiatry, 42, 305±316. generally excellent individual studies in order to Elkin, I., Shea, T., Watkins, J. T., Imber, S. O., Sotsky, S. identify their strengths and weaknesses. The M., Collins, J. F., Glass, D. R., Pilkonis, P. A., Leber, astute reader will have noted that all the studies W. R., Docherty, J. P., Fiester, S. J., & Parloff, M. B. had both strengths and weaknesses. The goal of (1989). National Institute of Mental Health treatment of drawing attention to these is twofold. The first is depression collaborative research program. General effectiveness of treatments. Archives of General Psychia- to provide readers of the research literature with try, 46, 971±982. a framework within which to evaluate other Flick, S. N. (1988). Managing attrition in clinical research. studies. The second is to hope that the Clinical Psychology Review, 8, 499±515. observations in this chapter may contribute to Frank, E., Kupfer, D. J., Perel, J. M., Cornes, C., Jarrett, D. B., Mallinger, A. G., Thase, M. E., McEachran, A. the ongoing process of improving the quality of B., & Grochocinski, V. J. (1990). Three-year outcomes experimental studies of psychological treat- for maintenance therapies in recurrent depression. ments. In this way we can maximize the value Archives of General Psychiatry, 47, 1093±1099. References 61

Frank, J. D. (1971). Therapeutic factors in psychotherapy. disorder with agoraphobia. A controlled study in American Journal of Psychotherapy, 25, 350±361. London and Toronto. British Journal of Psychiatry, Hollon, S. D., & DeRubeis, R. J. (1981). Placebo± 162, 776±787. psychotherapy combinations: Inappropriate representa- Mintz, J., Drake, R. E., & Crits-Christoph, P. (1996). tions of psychotherapy in drug-psychotherapy compara- Efficacy and Effectiveness of Psychotherapy: Two Para- tive trials. Psychological Bulletin, 90, 467±477. digms, One Science. American Psychologist, 51, Hollon, S. D., Waskow, I. E., Evans, M, & Lowery, H. A. 1084±1085. (1984). System for rating therapies for depression. Read Pocock, S. J. (1983). Clinical trials: a practical approach. before the annual meeting of the American Psychiatric New York: Wiley. Association, Los Angeles, May 9, 1984. For copies of the Pocock, S. J., & Simon, R. (1975). Sequential treatment Collaborative Study Psychotherapy Rating Scale and assignment with balancing for prognostic factors in the related materials prepared under NIMH contract 278-81- controlled clinical trial. Biometrics, 31, 103±115. 003 (ER), order ªSystem for Rating Psychotherapy Prien, R. F., & Robinson, D. S. (Eds.) (1994). Clinical Audiotapesº from US Dept of Commerce, National evaluation of psychotropic drugsÐprinciples and guide- Technical Information Service, Springfield, VA 22161. lines. New York: Raven Press. Hunt, E. (1996). Errors in Seligman's ªThe Effectiveness of Schooler, N. R. (1978). Antipsychotic drugs and psycholo- Psychotherapy: The Consumer Reports Study.º American gical treatment in schizophrenia. In M. A. Lipton, A. Psychologist, 51(10), 1082. DiMascio, & K. F. Killam (Eds.), PsychopharmacologyÐ Keane, T. M., Fairbank, J. A., Caddell, J. M., & Zimering, a generation of progress (pp. 1155±1168). New York: R. T. (1989). Implosive (flooding) therapy reduces Raven Press. symptoms of PTSD in Vietnam combat veterans. Seligman, M. E. P. (1995). The effectiveness of psychother- Behavior Therapy, 20, 245±260. apy: The Consumer Reports study. American Psycholo- Kendall, P. C., & Lipman, A. J. (1991). Psychological and gist, 50, 965±974. pharmacological therapy: Methods and modes for Seligman, M. E. P. (1996). Science as an ally of practice. comparative outcome research. Journal of Consulting American Psychologist, 51, 1072±1079. and Clinical Psychology, 59, 78±87. Klerman, G. L., Weissman, M. D., Rounsaville, B. J., and Smith, M. L., & Glass, G. V. (1977). Meta-analysis of Chevron, E. S. (1984). Interpersonal psychotherapy of psychotherapy outcome studies. American Psychologist, depression. New York: Basic Books. 32, 752±760. Kotkin, M., Daviet, C., & Gurin, J. (1996). The Consumer Uhlenhuth, E. H., Lipman, R. S., & Covi, L. (1969). Reports Mental Health Survey. American Psychologist, Combined pharmacotherapy and psychotherapy: Con- 51, 1080±1082. trolled studies. Journal of Nervous and Mental Diseases, Kriegman, D. (1996). The effectiveness of medication: The 148, 52±64. Consumer Reports study. American Psychologist, 51, Waskow, I. E., & Parloff, M. B. (1975). Psychotherapy 1086±1088. change measures. Washington, DC: National Institute of Lewinsohn, P. M., Munoz, R. F., Youngren, M. A., & Mental Health, US Government Printing Office. Zeiss, A. M. (1978). Control your depression. Englewood White, S. J., & Freedman, L. S. (1978). Allocation of Cliffs, NJ: Prentice-Hall. patients to treatment groups in a controlled clinical Marks, I. M. (1978). Living with fear. New York: McGraw- study. British Journal of Cancer, 37, 849±857. Hill. Wolpe, J., & Lazarus, A. (1966). Behaviour therapy Marks, I. M., Swinson, R. P., Basoglu, M., Kuch, K., techniques. Oxford, UK: Pergamon. Noshirvani, H., O'Sullivan, G., Lelliott, P. T., Kirby, World Health Organization (1992). International classifica- M., McNamee, G., Sengun, S., & Wickwire, K. (1993). tion of diseases (ICD-10) (10th ed.). Geneva, Switzer- Alprazolam and exposure alone and combined in panic land: Author. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.04 Multiple Group Comparisons: Quasi-experimental Designs

HANS-CHRISTIAN WALDMANN and FRANZ PETERMANN UniversitaÈt Bremen, Germany

3.04.1 INTRODUCTION 63 3.04.1.1 The Relevance of Experimentation in Clinical Psychology 63 3.04.1.2 Terms: ªTrue,º ªQuasi-,º and ªNon-º Experimental Studies 64 3.04.1.3 The Concept of an ªEffectº 64 3.04.2 THE LOGIC OF DESIGN IN THE COURSE OF EMPIRICAL RESEARCH 65 3.04.3 CRITERIA: SENSITIVITY, VALIDITY, AND CAUSALITY 67 3.04.3.1 Sensitivity 68 3.04.3.2 Validity 68 3.04.3.3 Causality, Practice, and Pragmatism 71 3.04.4 CONTROL IN QUASI-EXPERIMENTAL DESIGN 72 3.04.5 A PRIMER SYSTEM OF DESIGN 74 3.04.5.1 Selections from the General Scheme: Nonequivalent Group Designs 74 3.04.5.2 Using a Pretest 75 3.04.5.2.1 Using pretest as a reference for gain scores 75 3.04.5.2.2 Using pretests as approximations to initial group equivalence 76 3.04.5.2.3 When not to use a pretest 76 3.04.5.2.4 Outcome patterns 77 3.04.5.3 Using Multiple Observations 77 3.04.5.4 Using Different Forms of Treatment 78 3.04.5.5 Using Different Forms of Comparison Groups 79 3.04.5.6 Using Combinations of Different Designs 80 3.04.5.7 Regression Discontinuity 80 3.04.6 ANALYSIS 81 3.04.7 CONCLUSION 87 3.04.8 REFERENCES 88

3.04.1 INTRODUCTION growing utilization of psychological aids, however, must be justified on grounds of the 3.04.1.1 The Relevance of Experimentation in scientific method. There is a double need for Clinical Psychology valid and reliable demonstration of the value in technology derived from and offered by clinical Contributions of clinical psychology to psychology: efficacy in serving the customer and health services have received widespread re- efficiency legitimizing it to the supplier must be cognition and the benefits are undisputed. The shown. Quasi-experimental multiple group parallel increase of costs in such services due to comparison designs are well-suited for this

63 64 Multiple Group Comparisons: Quasi-experimental Designs task: they account for the fact that randomiza- comprehend the identification of latent traits tion may often not be an adequate sampling and their structural composition (factor analy- strategy in real-life clinical research situations sis), the identification of the dimensional metrics and still allow for assessment of absolute and according to which subjects determine the relative efficacy of psychological interventions. relative value of different objects (multidimen- They help evaluate the merits of different sional scaling/conjoint measurement), or in the components in interventions and support identification of a time series regression model in decision making as regards implementation, order to predict value and stability of a criterion. monitoring, and optimization of respective In neither case is a treatment strictly required. programs. This practical, if not wholly pragmatic, research and reasoning contributes directly to either correction or refinement of 3.04.1.3 The Concept of an ªEffectº the present state of scientific knowledge and development in the domain of clinical psychol- Multiple groups comparison experiments are ogy. It is practice where results from clinical conducted in order to demonstrate relative psychology are to be evaluated, and it will be effects of treatment. Effect hypotheses refer to seen that multiple group comparisons offer presence, size, and variability of effects with adequate tools to do so. respect to certain criterion measurements. The major objective of multiple group comparison designs lies with the generation of data in order 3.04.1.2 Terms: ªTrue,º ªQuasi-,º and ªNon-º to test such hypotheses. Subjects or groups Experimental Studies methods representing different levels of independent variables or being subjected to In this chapter, multiple group comparisons different kinds of treatment, eventually mea- will be focused on by means of quasi-experi- sured at different points of time, are compared mental designs. Delimiting characteristics to on data obtained in the same way in each distinguish the three kinds of empirical research group and at each occurrence. Then a classical situations quoted above are subject assignment, concept of an ªeffectº is identified with the modus of treatment, and control. It is agreed interaction of the subject or group factor and widely to speak of true experiments when both time/treatment with respect to the outcome random subject assignment and creation of measure. For example, the idea of a case± treatment or, in a broader sense, active manip- control study to determine therapy effective- ulation of conditions occur. In a quasi-experi- ness might be pointed out as ªif you apply X mental setup the first condition is weakened, and (treatment, independent variable) in one group, in observational studies both conditions are while not in another, and you observe a completely released. Some authors refer to the significant difference in Y (some functional term quasi-experimental as to the constraint that status, dependent varable) between both the researcher cannot deliberately assign subgroups, then you have reason to believe that jects to conditions or that occurrence of a there is a relationship between X and Y, given treatment is only a typical but not necessary all else being equal.º It will be seen that it is the feature (e.g., Spector, 1981). In this case, the ceteris paribus condition in the given clause, differentiation to pure observational studies and how it is tentatively realized, that makes seems obsolete. It is argued that any experi- (quasi-)experimentation a delicate challenge to mental situation, be it ªtrueº or ªquasi,º involves researchers' creativity and expertise. an arbitrary discernible kind of treatment, by A worked example of a quasi-experiment is creation or manipulation of conditions regard- chosen to introduce into the present chapter an less of mode of subject assignment, and that illustration of some of these challenges and ways quasi-experiments feature all elements of true to counter them. experiments except randomization. Worked example: Outline If there is no manipulation of subjects or conditions in order to evoke variation in In a very well designed study, Yung and Kepner dependent measures (and thus contrasts for (1996) compared absolute and relative efficacy of testing), reference is made to ªpureº observa- both muscular and cognitive relaxation procedures tional studies (see Chapters 4 and 13, this on blood pressure in borderline hypertensives volume). The main problem then lies with employing a multiple-treatment±multiple-control extension of design Number 6 in Figure 2. It is selecting a statistical model and further in reported that most clinical applications make use devising an appropriate parametrization that of an amalgam of cognitive techniques like sugges- reflects as closely as possible the structure of tion, sensational focusing, and strict behavioral scientific hypotheses (in their original, verbal training of muscle stretching relaxation. The form). The objective of such inquiries may authors aim at partialing out the effects of these The Logic of Design in the Course of Empirical Research 65

various components by clearly separating instruc- observations from practice must serve as a tions that target muscles directly and such instruc- starting point for the research process. tions that try to mediate muscle tension by mental (ii) Relative to marginal conditions (re- efforts. As a consequence, special attention is given sources, time, etc.) and constraints in construct to operational validity and treatment integrity. In operationalization, a subject matter model is order to counter confounding of subject characteristics and procedure variables the authors devise derived in terms of variables and their inter- a rigorous subject selection and assignment policy relation. Temporal ordering and other prere- which makes their work an excellent example for quisites of model theory guaranteed, the case here. predictions regarding empirical manifestations are derivable (ªshould then be higher thanº). Before engaging in describing various designs Devise a design now in order to translate a and methods of data analysis, and in evaluating hypothesized structure of variables into a this particular experiment, it seems instructive measurable one and a sampling strategy in to outline a general process of experimental order to actually perform the measurement. research. Hypotheses stating that certain strategies of Generally, it will not be possible to observe all verbal intervention in psychotherapy are of use possible instances of variable relations (which for depressives might be translated into a would innecessitate inference), but evidence for ªbetter-offº proposition as quoted in the text the hypotheses will be sought by investigating above. It is clear that, at this level, an experi- subsets of the population(s) in which these ment rationale is suggested: comparison of relations are considered to hold true. Sampling groups. Also, global aspects of testing are variability and measurement error lead to the implied: assessment strategy (will psychometric use of statistical models in order to determine tests be applied or will therapists rate patients' the genuity of an observed effect. Effect sizes depressiveness by virtue of their expertise?), or (ES) serves to model the above ªdifferenceº in inference conditions (will an effect be required terms of statistical distributions (which, in turn, to instantiate simultaneously on all criteria and provide both the mathematical background and to be stable for at least a year, thus calling for the factual means for testing effect size). The multiple post-tests, or will the whole study be significance of statistical significance and the replicated using different subjects and other concept of reason based hereupon are evaluated therapists?). In transition to the next level, a set in Chapter 14, this volume, and the issue of of decisions to be made is encountered that are interpreting such a relationship as causal is known to statistical consultants as ªhow manyº briefly discussed in Section 3.04.3. But how does questions in project and design planning (how one arrive at effect hypotheses and correspond- many subjects are enough per group in order to ing statistical models in the first place? balance sensibly factors of sensitivity?, how many time intervals make up the optimal retest interval?, how many outliers can be tolerated in and across groups?, etc.). 3.04.2 THE LOGIC OF DESIGN IN THE (iii) Multiple group comparison as a con- COURSE OF EMPIRICAL ceptual strategy implies suppositions about RESEARCH structures in the data. Here, conceptually identified effect sizes receive an accurate, It is suggested that experimentation as a quantitative, and thereby testable formulation. procedure is a special realization of a general A ªpre±postdifferenceº between groups may process of empirical research, as depicted in now be specified in terms of an expectation of a Figure 1. Whether experimentation is seen as an difference in central tendency of an outcome appropriate access to address the problems in variable, a kind of second-order comprehensive question depends on preceding decisions; if so, measure that can be tested easily for signifi- choice of design and sampling structure is cance at the next level. More generally, statis- conceived as an embedded level. Furthermore, it tical predictions formally denote compound is proposed that this procedural outline is suppositions about the ªbehaviorº of data, and paralleled by a conceptual process of hypoth- a ªgoodº design generates data behavior that eses derivation. Figure 1 illustrates this corre- by maximum degree either confirms or contra- spondence with hypotheses levels according to dicts behavior that would be expected by the deductive four-step model presented by hypothesis. Hager (1992): (iv) Statistical hypotheses are statements (i) Generally agreeing with the hypothetico- about specific values, variability, and ordinal deductive method, scientific hypotheses are or metric relations of parameters of distribu- derived from proposition structures in theory. tions which follow from statistical predictions. Sometimes, mere need for evaluation or striking A common prototype for such a measure is 66 Multiple Group Comparisons: Quasi-experimental Designs

Emergence of need for research

Primary research question Theory: Substantial hypotheses

Operationalization Operationalization into subject matter model:

substantial prediction via logic of contrasts Design/sampling strategy

Diagnostic devices/measurement Selection and parametrization of a statistical model:

statistical hypotheses statistical data modeling

statistical testing of parameters Statistical predictions : ES=0

inference/decision

Figure 1 The process of empirical research. Criteria: Sensitivity, Validity, and Causality 67 given by ES =(mtreat 7 mcontrol)/scontrol. Its pro- (a linearly predictable trend may not be tested blems not withstanding, this is considered a for monotony only), decisiveness and unam- fairly reasonable and intuitive basic concept. biguous identification of theory-conforming Various other ES formulations can be trans- empirical states (Hager, 1992) serve to estimate lated to it (Cohen, 1988; Lipsey, 1990). For validity of hypothesis derivation. After statis- instance, in the example of therapy effective- tical testing, however, there is a need to work ness, and effect is assumed to be present when backwards: what effect size is considered prac- statistical testing has determined that its ES tically significant? Is there a failure in predict- differs from zero significantly, that is, beyond ing things as they should occur in theory? Do the expected limits of sampling error and with the findings entitle inductive inferences to be prespecified error probabilities of the a- and b- drawn to higher levels in the hierarchy of kind. In fact the probability is always estimated hypotheses, leading to tentative confirmation that there is no effect (therefore speaking of of propositions in the theoretical frame of ªnullº hypotheses) or that, as it is frequently put reference? Here have been introduced some in terms of sampling theory, comparison groups of the most serious problems in philosophy of are samples that were drawn from the same science, and there is a need to get still further population. In this case, observed differences into it before turning to techniques. merely reflect sampling error. Depending on the formulation of predictions, it is not always obvious at first sight whether these ªnullº 3.04.3 CRITERIA: SENSITIVITY, probabilities are required to be high or low. VALIDITY, AND CAUSALITY Further complicating things, a statistical prediction must sometimes be further differen- Conventionally, the major objective of ex- tiated into a set of single-testable statistical perimentation is seen in demonstrating a causal hypotheses. It must then be decided whether an relationship between phenomena, or, as Cook overall null (or alternative) hypothesis is con- and Campbell (1979) put it in more correct forming to the compound statistical prediction terms, in facilitating causal inference. The logic or what pattern of acceptance or rejection of underlying experimental design is an implicit individual hypotheses integrates to a favorable variation of the concepts of covariation, finding. Moreover, there is a need to decide temporal precedence, and of ruling out alter- whether integration of separate findings should natives. This means, in terms of experimenta- follow a disjunctive (compensatory) or con- tion, that the hypothesized ªcauseº should be junctive rule (e.g., must depressives benefit on capable of being omitted and control group all scales applied in order to justifiably speak of esigns are preferable. One-shot case studies or an effect). As a consequence of this multiple pretest±post-test designs without at least one testing, there is a need to control for accumula- control or, in a broader sense, a comparison tion of error probabilities. Note, however, that group, do by no degree allow for the kind of ªdifferenceº does not necessarily imply the inference scientists are most interested in: causal comparison of means. It most cases, pretest- ones. But, as was seen from Figure 1, standardized differences in means are in fact experimentation ends up in statistics. The result used for this purpose because such measures of an experiment is that data are communicated have known statistical properties (e.g., distribu- in the form of statistical propositions. Whether tions and standard errors) following normal statistical relationships should be taken as theory and can thus be easily tested statistically. indicators only or be conceived as ªcausal,º But a test might be made for a ratio to be greater ªfunctional,º or ªprobabilisticº cannot be than a prespecified number (as is done fre- determined by simply rating how well the rules quently when using differential model fit across of experimentation were obeyed, but depends groups as a criterion) or the inequality of on affiliations in philosophy of science. Criteria regression intercept and slope parameters (as for when to apply the predicate causal and be featured in regression continuity designs, see justified in doing so, are hardly available. But Section 3.04.5.7). The general rationale is to before drawing any such inferences it will be determine whether or not groups differ in any necessary to (i) ensure sensitivity in order to statistical parameter and to relate this differ- detect the effect and (ii) rely on it being a valid ence to a reference sampling error estimate. indicator for what is to be interpreted as causal. (For details on procedures of significance In the following section, a start is made on testing see Cohen, 1988; Fisher, 1959; Giere, practice by introducing the key concepts whose 1972; Mayo 1985.) Criteria like adequacy (on consideration will hopefully establish a sensi- each level, hypotheses preserve to maximum tive, valid, and meaningful design. It is clear, degree classificatory, ordinal, or metric rela- however, that an evaluation of meaning must tions assumed on preceding level), sufficiency rely on protoscientific argumentation. 68 Multiple Group Comparisons: Quasi-experimental Designs

3.04.3.1 Sensitivity and Stanley (1963), and Cook and Campbell (1979), originators of the concept, distinguish Sensitivity refers to the likelihood to detect an internal, external, and statistical validity as well effect in sample-based investigations, given it is as construct validity. The latter refers to indeed present in the population from which operationalization problems (mono-operation- samples were drawn. It is determined by sample alization bias, monomethod bias, evaluation size, total and direction of effect size, hetero- apprehension, experimenter expectancies, etc.). genity of both subjects and treatment applica- Internal validity evaluates to which extent tions, and various undesired features of variation in measured dependent variables is diagnostic devices (unreliability, etc.). While attributable solely to variation of independent those factors affect experimental precision, variables or treatment, respectively. Again, others like prescribed error probabilities of assuming an effect for a difference in the value both a and b type and the kind of statistical test of some statistical parameter calculated from used affect the associated concept on the level of data on dependent variables (e.g., the mean), statistical analysis: power. It must be borne in internal validity assures that nothing but mind that sampling error stands for the variance treatment ªcausedº this difference, and that, of parameter estimates and is, thus, maximally as Lipsey (1990) puts it, the observed difference dependent on sample size. As a consequence, an parallels the fact that ªthe only difference effect size however small will become significant between the treatment and the control condi- in statistical terms with increasing sample size, tions is the intervention of interestº (p. 11). This, while being wholly uninformative for practical of course, relies heavily on the experimenter's considerations. For an ES of the kind presented ability to control for other possible ªsources of above less than or equal to 0.3 to attain variationº or undesired confounding influences. statistical significance at a = 0.05, some External validity refers to the extent to which n = 300 subjects would be in order; for an ES results obtained from a sample-based study may of 0.5 still a hundred or more. In a meta-analysis be generalized to a specified population from on psychotherapy efficacy Grawe (1992) found which samples were drawn in the first place and/ that 87% of 897 studies were using samples of or across slightly different (sub)populations. n = 30 or less subjects in intervention groups. Such extrapolation must be reasonably justified To attain statistical significance at the usual as regards target population (i.e., persons), level an effect size of greater than 1.20 would settings, and future time. Note that representa- then be required. This may be considered quite tiveness is a prerequisite to external validity that an unlikely prerequisite: in their meta-analysis not only holds for subject selection but also for on psychotherapy research Smith, Glass, and situative conditions of experimentation itself: Miller (1980) reported an overall expectation of ecological validity must be given careful con- treatment effectiveness of no more than sideration especially in clinical fields. ES = 0.85. Tentative counterbalancing the Campbell and Stanley (1963) have raised the determinants of effect sizes other than the argument that internal validity can be thought ªtrueº effect (population effect), namely sample of as a necessary while not sufficient condition size, error probability, and assumed ES, is best for external validity. Note, however, that some achieved using power tables as proposed in methods of control depicted in the next section Cohen (1988). Since detecting an effect usually show conflicting effects on internal and external means observing an effect size that statistically validity, and to compensate for these imbal- differs from zero, the levels of design and data ances with a reasonable trade-off certainly is an analysis in Figure 1 become mixed up when art of its own. Most flaws in experimental and evaluating the whole procedure of quasi- quasi-experimental design boil down to viola- experimental multiple group comparisons. It tions of factors of validity which is why a short is the correspondence of design structure outline is presented (see Table 1) of their ªgivingº the data and statistical data model respective factors as designated by Cook and that establishes a ªgoodº design. Sometimes Campbell (1979). A far more comprehensive data obtained from one design can be translated listing of possible threats to validity and various into different models (problem of parametriza- other shortcomings in planning experimental tion) and effect sizes may be evaluated for their research is presented in Hoole (1978). Special significance by various of statistical tests attention to external validity in evaluation (problem of relative power). contexts is given in Bernstein (1976). Admittedly, full consideration of all these 3.04.3.2 Validity caveats, if not mere apprehension of this rather lengthy list, seems to discourage any use of Validity refers to the likelihood that the effect (quasi-)experimental designs at all. The con- detected is in fact the effect of interest. Campbell trary is true, though. There will be a benefit from Criteria: Sensitivity, Validity, and Causality 69

Table 1 Factors of internal and external validity.

Internal validity History In the time interval between pretest and post-test measure, various influences or sources of variation that are not of interest or that may even distort ªtrueº findings may be in effect besides the applied treatment. This clearly leads to confounding. Also, these factors may be differently effective for different subjects, thereby introducing additional error variance. Controlling for such intrusion by elimination or by holding it constant to some degree is an appropriate method in laboratory settings but is almost impossible to achieve in real-world clinical (field) settings (apart from identifying such factors as relevant in the first place). Moreover, many disorders are known to be subject to seasonal variation or others periodicities. Maturation In longitudinal studies with fairly large restest intervals, effects may be confounded with or entirely due to ªnaturalº change in subjects' characteristics that are correlated with the variables and treatments of interest. A prominent example in clinical research is spontaneous remission in psychotherapy studies. The hope is that such change takes effect equivalently in both groups of a case±control study (which again is rather a ªlarge sampleº argument but is usually held true for randomized samples). Mortality In repeated measures designs, attrition or loss of a certain proportion of subjects is to be expected on a regular basis. Differential loss, however, means that the observed drop-out pattern across groups seems not to be random but is dependent on variables that might correlate with treatment or outcome measure. This in turn introduces post hoc selection artifacts and is likely to erroneously contribute to effect size in that it may change value and variability of post-test statistics. Statistical A typical problem with measuring change lies with regression towards the mean. regression Subjects may shown gain or loss in scores from pretest to posttest solely due to the unreliability of the test. This may be compensated for to a certain degree by multivariate assessment of dependent variables. Moreover, differential effects are likely to fall into place depending on subjectsº pretest or starting-off: ªviewed more generally, statistical regression (a) operates to increase obtained pretest±post-test scores among low pretest scores, since this group's pretest scores are more likely to have been depressed by error, (b) operates to increase obtained change scores among persons with high pretest scores since their pretest scores are likely to have been inflated by error, and (c) does not affect change scores among scorers at the center of the pretest distribution since the group is likely to contain as many units whose pretest scores are inflated by error as units whose pretest scores are deflated by it.ºa Testing, instrument In repeated measures designs subjects are likely to become familiar with diagnostic reactivity, and devices or may carry over pretest scoring behavior. In addition, there are many sensitization ways that testing procedure itself may affect what is tested. Items that evoke strong emotional reactions may distort scoring in subsequent ability subtests by increasing subjects' unspecific arousal. Moreover, pretests may sensitize subjects for treatment reception by enhancing attention for certain attributes or by facilitating self- communication and introspection. Sometimes it is possible to control for this effect in quasi-experimental settings by applying strategies similar to the Solomon four- group design (see Chapter 3, this volume). Selection/ Selection is a threat to internal validity when an effect sizeÐa difference in some interactions with parameter across groupsÐreflects pre-experimental differences between cases and selection controls rather than differential impact of treatment. Note that inference from such experimentation relies heavily on the ceteris paribus condition (denoted in the given clause in the outline in Section 3.04.1.3), and that systematic differences across groups due to selection bias are most pervasive in quasi-experimentation where we have to do without randomization. This is especially true in case of clinical treatment: one would not want to allot inhouse patients or program participants to different treatments (or exclude them to serve as controls) by random numbers but according to their individual needs and aptitude. Patients assigned for cases on behalf of equal (that is, equivocal) diagnoses may yet still differ from other patients in their group (error variance) as well as from controls in a whole pattern, of other variables that are likely to be related to both reception of treatment and outcome measures (e.g., hospitalization, socioeconomic status, verbal fluency, etc.). Note that most factors of internal validity listed so far may interact with subject selection, producing differential effects and thereby increasing experimental error. 70 Multiple Group Comparisons: Quasi-experimental Designs

Table 1 (continued)

Instrumentation In the first place it must be ensured, of course, that diagnostic devices meet criteria of reliability, objectivity, and validity in terms of measurement and testing theory. In repeated measures designs, discrimination and differentiation capabilities of tests may change. Thus a gain in scores from one occasion to another may partly be due to enhanced performance of observers who improved skills with practice, and less due to differential effects of applied treatment at these occasions. A related problem lies with so-called floor or ceiling effects. Subjects scoring at the upper end of a scale on one occasion, say 6 on a 1±7 Likert scale, may easily report a decrease by half but will find it difficult to state a major increase of similar proportion. Note that clinical samples are often defined by a shift in mean in certain indicators with respect to a ªnormalº population. Control group Besides the problem of recruiting and maintaining a control group, ªcompensatory behavior rivalryº of respondents receiving less desirable treatment and ªresentful demoralizationº of such subjects are further threats to internal validity. In the first case, being assigned to the nonreceiving group may motivate subjects to behave competetively as to reduce the ªtrueº effect (due to treatment) in counter- hypothesized direction. In the second case, the deprived get more deprived and less motivated: a patient that suffers from his disease is assigned for a control and thus kept from a therapy that is thought to be efficient by the experimenter. The ªlose- heartº effect would then artificially enlarge effects. Both variants are of great practical importance in therapy effectiveness research.a External validity Interaction of It is well-known that mere knowledge to participate in a study or the awareness of pretest and being observed does affect behavior. Furthermore, due to sensitization effects or treatment, conditioning effects (see above), results obtained from samples that have completed reactivity a pretest cannot be fully generalized to populations that have not done so. Interaction of People that deliberately demand for psychological care or for access to health care selection and programs are likely to be more susceptible to treatment or motivated than are treatment accidental samples of convenience (collective of inpatients, rooms, etc.). Often, selection processes that have led to admission of subjects are unknown. It is clear, however, that results obtained from such samples cannot be generalized unreservedly. Interfering Parallel, overlapping or interacting treatments lead to confounding and thus treatments constitute a major threat to internal validity. But there are undesirable consequences for external validity as well. Suppose patients are receiving medical treatment in ascending dosage. Due to idiosyncratic effects of cumulation, conclusions drawn with respect to an intermediate level of dosage that generalize across people may be erroneous. Nonrepresentative Generally, target populations must be clearly specified in advance. In quasi- samples experimental design, representativeness cannot be ensured by random sampling on a large scale but instead take advantage of ªnaturalº (nonmanipulative) sample formation. Often characteristics of research context naturally differentiate subjects into groups (rooming policies of hospital, record analysis, etc.). Strictly speaking, since there is no real sampling there is no point in making inferences. But this aside, sampling should not refer to subjects only: generalizing also includes care givers (therapists, etc.), hospitals, and regions. Therefore, another strategy lies with deliberate sampling for heterogeneity with respect for certain attributes relevant to the research question. Inference would then be across a wide range of units holding subsets of these attributes, but still limited to the respective superset. When dealing with clinical samples, this method seems fairly reasonable.

Source: Cook & Campbell (1979).a it in that the degree to which attention is given to must be added. While the bearings of internal these issues prior to experimentation parallels validity on technical issues in research planning the degree to which results can be interpreted are widely agreed upon and factors listed above and inferences drawn after experimentation. have nearly become codified into a guide on Factors of validity thus serve as criteria for design construction, most important issues of comparative evaluation of quasi-experimental construct validity often receive less recognition. studies, they are not meant as a checklist for But re-examining the procedural structure of building the perfect one (remember that several empirical research, as sketched in Figure 1, the factors exclude mutually). Another precaution view that emerges is that everything begins and Criteria: Sensitivity, Validity, and Causality 71 ends with valid operationalization of constructs Modern philosophers and methodologists have into measurable variables, and of hypothesized demonstrated successfully that pragmatist rea- relations into exercisable methods for ªvisualiz- soning can be adapted successfully to metho- ingº them. Greater appreciation of construct dology in social sciences, and have done so down and derivation validity is advocated before to the level of hypotheses testing (Putnam, turning too readily to tactics and techniques of 1974). Of course there is no pragmatic explana- experimentation. For instance, one should tion of its own right. But our attempts to explain examine treatment integrity well before con- phenomena on experimental or even statistical structing sophisticated application schedules in grounds can be justified pragmatically by in- an interrupted time series design (see Section troducing criteria beyond a ªlogic of discoveryº: 3.04.5.3). Nonetheless, within the entire validity framework, causal inference is predicated on The first and ineliminable criterion for the ade- maximum appreciation mostly of internal quacy of an explanation must be based on what it validity. The question is what is meant by does for a man who wants explanation. This causal. depends on contextual factors that are not reflected in the forms of propositions or the structure of inferences. (Collins, 1966, p. 140) 3.04.3.3 Causality, Practice, and Pragmatism Knowledge of causal manipulanda, even the However sensitive and valid a design might be tentative, partial and probabilistic knowledge of by these terms, an effect or statistical relation- which we are capable, can help improve social life. ship cannot be interpreted as truly causal in an What policy makers and individual citizens seem objectivist ontological sense or in the traditional to want from science are recipes that can be Humean readings (for an introduction, see followed and that usually lead to the desired Cook & Campbell, 1979, and Salmon, 1966). positive effects, even if understanding the micro- Therefore, and for the following reasons, a mediational is only partial and the positive effects wholly alternative view is pledged on these issues are not invariably brought about. (Cook & Camp- bell, 1979, p. 28) in the tradition of James and other pragmatist reviewers of methodology. Scientific inquiry and its results (e.g., explanations) should always If efficacy of, say, a new drug has been bear technological relevance. As one conse- demonstrated on a statistical basis by experi- quence, the importance of manipulation, is mentation, it is not possible, strictly speaking, re-emphasized as a feature of treatment in quasi- to conclude that deficiency in effective sub- experimentation. Consider the following argu- stance of the drug caused the disease or that the ment raised by Collingwood (1940): substance now reversed the pathogenetic process to regeneration. Having controlled for side Suppose one claimed to have discovered cause of effects, it is perfectly justifiable to maintain the cancer, but added that his discovery would be of drug as a successful therapy (and raise funds for no practical use because the cause he had dis- further elaboration which might give new in- covered was not a thing that could be produced or sights for science as well as for refinement of the prevented at will. His dicovery would be de- drug). On a statistical level this ªbetter-offº nounced a shame. (p. 300) thinking in interpreting results from case± control studies translates into the concept of Collingwood's point is that causal explana- statistical relevance (Salmon, 1966) or into the tions are often valued for the leads they give notion of Granger causality. The latter derives about factors that can be manipulated. If the from the theory of linear systems and defines essential cause of some effect does not imply causal relevance of independent variables controlling the effect (e.g., cancer) through (ªcausesº) in terms of increased predictability manipulating some factor, then knowledge of of value and variability of dependent variables this cause is not considered useful by many (ªeffectsº). Prognosis should then be more persons (Cook & Campbell, 1979). Scientific precise with a relevant variable included in inquiry and quasi-experimental research designs the model than with the variable removed, are conceived as concretization devices, as tools and good prognosis is definitely a very useful to obtain technological knowledge. Thus, re- outcome of research (Granger, 1969). Good search designs need to take into account the prognosis is, after all, a feature of operational feasibility of manipulation of conditions that models. This is what should be expected from produce effects in order to at least allow for experimentation in clinical psychology: identi- transfer into practice. It is practice where results fication of factors that influence etiology and from research in clinical psychology should be the spread of disorders and derivation of evaluated, which is why a concept of causality operational models that in turn might aid to and explanation is needed that is of relevance. confront them. 72 Multiple Group Comparisons: Quasi-experimental Designs

3.04.4 CONTROL IN QUASI- as might be implied by name, to equivalent EXPERIMENTAL DESIGN groups. There may still be reliable and substantial differences on other factors relevant but Control aims at enhancing and balancing left unconsidered (or even unmeasureable if factors of internal and external validity follow- known) that affect the phenomenon in question. ing the max-con-min-rule formulated by Ker- The argument by Cook and Campbell (1979) is linger (1973): followed that only understanding the process of (i) maximize impact of independent variables subject selection allows full understanding of on dependent ones, subject differences that cannot be attributed to (ii) while holding constant factors of sys- treatment, and that randomization must be tematic intrusion, and considered the only model of such a process that (iii) minimizing the influence of unsystematic can be completely understood. variables or error. The following is a worked example:ÐControl What is meant by factors, influences, and is by subject selection and assignment: error? Control is always exerted over independent variables that contribute to effects. It is Of 307 recruited individuals, only 40 were eligible instructive to refine the term variable into four for participation. Persons showing habituation types relevant in experimentation. Systematic effects were excluded, because their high pressure variance denotes differences between groups in multiple baseline readings was assumed to be a and, with reference to the definition of an effect reaction to the measurement situation itself (sensitization). To control for treatment interference size in Section 3.04.1.3, thus indicates the persons currently engaged in other pressure con- presence of ªtrueº effects. In concept, it is solely trol techniques were also excluded. Persons on attributed to explanatory variables, indicating medication were asked to present medical records treatment. Extraneous variables, however, are to assure constant dosage for the time of their not a primary focus of a study but in cases of participation (history). The authors cite evidence substantial correlation with dependent ones for a linear relationship between blood pressure introduce error variance and do bias estimates level prior to behavioral intervention and treat- of effects. Such variables may either be sub- ment change. They further attribute contradictory jected to control or be left uncontrolled in one of results in recent research on relaxation efficacy to two ways, depending on both sample size and the common practice of matching experimental groups with respect to the average on certain blood subject assignment policy. In the latter case, pressure parameters, which leaves differential randomized variables are let run freely in the treatment susceptibility in higher degrees of hy- hope that their effects will cancel out in the long pertension uncontrolled. Subjects were therefore run and thus equate groups in a probabilistic orthogonally matched on systolic blood pressure, sense while confounded variables constitute a age, and sex to assure pretest equivalence. threat to validity that cannot be ruled out. It follows that controlled variables are extraneous Experimental error also leads to bias in effect but their impact on measures of effect is size estimates. While disturbing marginal influ- accounted for because error variance is reduced. ences (disruptions due to periodicities in hospi- If such erroneous influences are known to talsº daily routines, noise, etc.) may be held operate (by background knowledge introduced approximately constant or be eliminated there- into the subject matter model, or simply by by having their influence equated in and across intuition), the techniques presented in Table 2 groups, other undeliberate variations of experi- are used or at least an attempt is made to obtain mental conditions (e.g., treatment inflation or accurate measures on them. Statistical techni- interferences) and measurement effects are to a ques may, then, be used in order to separate post far lesser extent under the control of the hoc the effect of treatment (in terms of ES) and experimenter. But in many field settings there selection differences. may be no desire to exert artificial control over In quasi-experimentation, control is exercised such (in theory) irregular components of overall rather by means of subject selection or statistical effect. It may be preferrable to preserve or analysis than by manipulating conditions. enhance external or ecological validity when Because no randomization is used in quasi- finally turning to generalization. Since practice experimental settings to wash out all initial is defined as primary, however, bias due to differences in and across groups and because an naturally varying concomittants is something to effect has been defined in terms of post-test be faced. There is no need to put research back differences, credibility of findings will depend into the laboratory by excellent control when on the ability to demonstrate that groups have findings need to be transferred into applications been alike as possible except for the difference in for a noncompliant world. treatment they received. It will be noticed, The worked example is continued to demon- however, that pretest equivalence does not lead, strate trial structure and procedural control: Conrol in Quasi-experimental Design 73

Table 2 Controlling subject-induced extraneous variables.

Matching The rationale of matching is to fully equate groups with respect to pretest scores, so that differential post-test scoring could legitimately be interpreted as differential change. The more variables are called in for matching the greater a shrinkage in sample size must be expected, sometimes deflating degrees of freedom to an extent where statistical analysis is no longer possible. Moreover, such samples may no longer be representative for target populations to which one finally intends to generalize findings. By the way, there is great appeal in the idea to match subjects with themselves across treatment levels as is the case, for example, in crossover designs. Parallel groups or This means of control is, in a way, a weaker version of matching by taking resort to aggregate group statistics rather than to individual values. The influence of confounds or, matching more generally speaking, nuisance factors is considered not relevant if it is distributed evenly across groups. If such a variable has equal distribution in each group (with respect to first- and second-order moments, e.g., mean and variance), groups are said to be parallel on this variable. When dealing with intact groups or when assignment is guided by need or merit, this strategy would entail exclusion of subjects in one group in order to approximate the other or vice versa. Again, sample size would be reduced and groups may finally differ on other variables by selective exclusion. Control always seems to trade off with external validity, notably generalizability. Parallel groups, however, are much easier to obtain than samples matched by score. Blocking Blocking further draws variance out of the error term when nuissance factor(s) are known and measurable on a categorical level. The rationale is to match subjects with equal score on this measure into as many groups as there are levels of the nuisance factor. If, for example, nj = 10 subjects are to be assigned to each of J =3 treatment conditions (factor A), and another factor B is known to contribute to variance, one would obtain K = 10 levels of factor B with nk = 3 subjects each and then randomly assign njk = 1 subjects per B-level to each of the J levels in A. An increase in numbers of blocks will further decrease error variance since scores within a block of subjects matched this way are less likely to vary due to selection differences to the same extent than would have to be expected in blocks comprising more or all (no blocking) subjects. Note that in this model blocks represent a randomly sampled factor in terms of mixed-model ANOVAa and thus can enter statistical analysis to have their effect as well as possible interactions with treatment tested for. As a result, blocking could be viewed as transition of matching (above) into factorization (below). On the other hand, matching is but an extreme case of blocking where any pair of matched subjects can be considered a separate block. Factorization Confounding can be completely removed from analysis when confounding variables are leveled (measured by assignment to categories or by recording their continuous scale into discrete divisions) and introduced into design as a systematic source of variation. This generally results in factorial designs where effects of factors can be quantified and tested (main effect and interaction in fully crossed designs, main effect only in hierarchical or incomplete structures).a,b,c Analysis of While factorization is a pre-experimental device of control (it gives, after all, a covariance sampling structure), analysis of covariance serves to post hoc correct results for concommittant variables or pretest differences.d Holding constant If hospitalization must be expected to substantially mediate results, a most intuitive elimination remedy lies with selecting subjects with the same inhouse time. Strictly speaking, there is no way of generalizing to other levels of the hospitalization variable.

Source: Hays (1990).a Pedhazur & Pedhazur (1991).b Spector (1981).c Cook & Campbell (1979).d

All subjects received treatment, placebo, or pure temperature were held constant in all measure- assessment in equal time spacing in eight experiment situations, all equipment was visually mental sessions. They were instructed to replicate shielded to prevent feedback effects. Measurement these sessions for 30 days after the laboratory trial equipment was calibrated before each trial. All and to keep a record of their practice. After this participants spent equal time in the measurement follow-up period, all subjects were finally assessed setting and were blinded with respect to assessment on their blood pressure. Though this procedure results until completion of the measurement implicated a loss in standardization, it certainly occasion. To control for experimenter effects enhanced ecological validity and increased the subjects received live relaxation instructions in clinical relevance of the study's findings. Light and three sessions followed by five sessions with taped 74 Multiple Group Comparisons: Quasi-experimental Designs

instructions. This procedure further improved that many important practical considerations standardization of treatment situations. A training and implications for choice of a statistical model time of 18±20 minutes was justified as optimal by for subsequent analysis are hidden completely in recurring to recent findings that shorter training what is called the ªdata aquisition boxº (or ªOº lacks any effects and that longer training often in standard notation, see below). Here it must be results in loss of motivation to practice, increased dropout rates, and reduced stability of effects over decided on (i) number of dependent variables time. (uni- vs. multivariate analysis), (ii) various parameters of repeated measurement (number of occasions, lengths of retest intervals, etc.), 3.04.5 A PRIMER SYSTEM OF DESIGN (iii) order of dependent measures and error modeling (e.g., use of latent variable models), As will be obvious from Figure 2, basic forms and (iv) scale characteristics of dependents of designs mostly differ in whether or not they (parametric vs. arbitrary distribution models). use randomization to assure initial group Most authors, follow standard ROX termi- equivalence and control groups to determine nology and notion introduced by Campbell and effects, and whether or not they dispose of Stanley (1963): R stands for randomization, O pretest, multiple observations, multiple forms of stands for an observation or measurement treatment (applied, removed, reversed, faded, occassion, X stands for treatment or interven- etc.) and different controls. Following on from tion. Temporal sequence is expressed by left-to- this, the regression discontinuity approach is right notation in a row, separate rows refer to presented as leading to a class of designs in its different groups. own right. In the case of more than one explanatory independent variable, additional 3.04.5.1 Selections from the General Scheme: decisions must be made regarding logical and Nonequivalent Group Designs temporal order of treatment application (factorization into fully crossed and balanced plans, The designs shown in Figure 2, while not a incomplete and hierachical plans, balancing focus of this chapter, will now be considered. sequence effects, etc.). These topics are more Designs one and four are considered ªpre- fully covered in Chapter 4, this volume. Note experimentalº by Campbell and Stanley (1963):

Single Group Multiple Group Comparisons Study

“True” experiments Quasi-experiments

1 23 Post-test RXO XO XO Only RO O

4 5 6 Pretest RO XO OXO and OXO Post-test RO O OO

7 89 Time ROO...X...OO OO...XO...OO Series OO.....X.....OO Data ROO...... OO OO...... OO

Figure 2 A basic system of designs. A Primer System of Design 75 findings from such studies are most ambiguous column contains three nonequivalent group since almost any threat to internal validity listed designs which do implement multiple group above may be an effect and left uncontrolled. As comparisons. Design three, however, is subject a consequence, and because experimentation is to the same reservations made for designs one meant to demonstrate effects by assessing and four. It lacks both randomization and comparative change due to treatment while pretesting as an approximation to the same end. ruling out alternative explanations, these de- Therefore designs six and nine, when used with signs are commonly held inconclusive of effects. appropriate care and caution, permit the testing They are not recommended for the purposes of causal hypotheses, with causality being sketched in Sections 3.04.1 and 3.04.3. Design conceived as in Section 3.04.3. three does implement an independent control The worked example is continued with group, but still remains inconclusive as regards aspects of design the genuity of effects due to selection differences. In design four, subjects serve as their own In quasi-experimental design, the treatments to be control. With two measures only, little can be compared serve as independent variables and inferred about true effects since maturation and experimental groups are built by assigning subjects other time-relevant threats to validity remain to treatment conditions by nonrandom proce- uncontrolled. Extending the principle to design dures. In the worked example study by Yung seven, however, leads to a typical time series and Kepner (1996), the treatment factor comprised stretch release (SR), tense release (TR), and design for basic impact analysis that to some cognitive relaxation (COG) as well as a placebo extent allows for more control. Suppose a trend attention group and a test-only control group line is fitted to both consecutive pretests and (TOC) that remained wholly untreated. The post-tests: a discernible shift in intercept and/or placebo group was given a medicament by a slope parallel to treatment application would physician with the ruse that it would reduce blood indicate presence of an experimental effect (a pressure in order to control for positive treatment conceptually different, but technically similar expectancy effects. The control group were given concept will be discussed with the regression pure assessments of their blood pressure to control discontinuity approach). Without determina- for effects of the measurement process itself. tion of such a ªnaturalº trend a higher post-test Systolic and diastolic blood pressure as well as heart rate were chosen as dependent variables, level might reflect what should be expected from entailing a multivariate setup for subsequent extrapolation of regularities in pretest values statistical analysis. All subjects were measured only, disregarding treatment. In other words, repeatedly several occasions (multiple baseline dependent variables might show a change in readings, with assessments before and after trial, value that is due to natural change over time follow-ups), which gives a multiple-treatment± (inflation, deflation, or periodicities of the multiple-control extension of design six in Figure 2. phenomenon under study). Using multiple observations does not rule out this possibility Note that the study outlined above does in as would using multiple observations and fact realize most of those items that make up the control groups, but still make it highly unlikely following sectional structure: conduct of pret- for such a trend to go undiscovered. Some ests, multiple observations, different forms of authors argue that lack of a control group may treatment, different forms of control groups, be of less importance for this reason. Delay in and, introduced post hoc by means of statistical effect instantiation or special forms of temporal analysis, even combination of different designs. calibration of effects can be modeled by univariate intervention (or interrupted time series) analyses (see McDowall, McCleary, 3.04.5.2 Using a Pretest Meidinger, & Hay, 1980). Note, however, that 3.04.5.2.1 Using pretest as a reference for gain group designs are still being dealt with exclu- scores sively where N 4 1. Single-subject designs, even when presented in the same notation, employ The major step from design three to design six another rationale of contrasting and inference clearly lies with the introduction of pretesting, and are covered in Section 3.04.2. Multiple- seemingly indicating analysis of change. A group extensions to this design are covered in natural concept of an ES would derive by Section 3.04.5.3. Designs two, five, and eight subtracting pretest averages from post-test have the most desirable properties: they imple- means and normalize the result with respect ment randomization, repeated measurements, to a control group for a comparison. There has and control groups, with multiple observations been extensive work on the question whether or additionally available in the last one. These not to use difference scores as an estimate of ES. designs are discussed in Chapter 4, this volume. Some main issues addressed in this context From the preceding, it is clear that the rightmost comprise the fact that difference scores (i) carry 76 Multiple Group Comparisons: Quasi-experimental Designs the unreliability of both their constituents, (ii) or large samples may be relied upon to justify depend on correlation with initial (pretest) this assumption, but in quasi-experimental scores, (iii) depend on scale characteristics, design performing a pretest is strictly required and (iv) are subject to regression towards the to obtain meaningful results. One still does not mean. Various alternative techniques have been need to take gain scores or measures of different proposed among which are (i) standardizing growing rates as a base for an effect size. In prior to calculating differences, (ii) establishing Section 3.04.2, they were not. But post-test ªtrueº difference scores (Lord, 1963), (iii) using differences can only be interpreted as monitor- regression residuals (Cronbach & Furby, 1970) ing effects when initial group equivalence is and, (iv) most commonly used, regression given as a common ground for comparison. As a adjustment by means of analysis of covariance. consequence, subject assignment in quasi-ex- Further, in terms of mathematical models, perimental design is based upon techniques that specification of correlation structure for error require a pretest (see Table 2). components of total scores, decomposition of error itself into various time-depending and 3.04.5.2.3 When not to use a pretest static components, as well as the formulation of model-fit criteria and appropriate statistical Research situations can be imagined where testing, with serial dependence of data contra- conducting a pretest may not be recommended dicting the ªusualº (IIND) assumptions on for even stronger reasons. Suppose that the error terms, are of major concern when phenomenon in question is likely to change analyzing change. These items denote not mere qualitatively rather than quantitatively. A problems of statistical theory, but reflect pretest would then presumably first measure substantial problems about a concept of something different than post-test. Second, a empirical change. There is, however, no concern pretest may interact with treatment by differ- about all the complications that arise when ential sensitization (see factors of internal trying to measure change but with the logic of validity) or by producing such bias that results design. For an introduction and further reading could not sensibly be attributed to treatment on the above issues, see Eye (1990), Diggle, anymore (e.g., when using attitude tests). Third, Liang, and Zeger (1994) and for special topics as retesting may be impossible because testing can indicated Kepner and Robinson (1988) and only be set up once on the relevant variable Thompson (1991) (ordinal measures), Hagen- because measurement must be assumed to aars (1990) and Lindsey (1993) (categorical distort or annul the phenomenon, or because data), Christensen (1991) and Toutenburg no parallel form of a test is available. In such (1994) (generalized linear models), Vonesh cases, the use of a proxy variable might be (1992) (nonlinear models), Farnum and Stanton considered which, measuring something differ- (1989) and Box, Jenkins, and Reinsel (1994) ent but conceptually related to post-test criteria (forecasting), Petermann (1996) and Plewis and, thus, (hopefully) being correlated to it in (1985) (measurement). the statistical sense, serves as a substitute for evaluating initial group equivalence as a prerequisite for interpreting post-test differences 3.04.5.2.2 Using pretests as approximations to (Rao & Miller, 1971). In extreme cases, when initial group equivalence any kind of pretesting, be it on behalf of proxies In Section 3.04.2 the concept was introduced or using the post-test instrument, seems im- of an ES as the observable and testable possible, using separate samples ªfrom the same formulation of what is sought in experimenta- populationº for pre- and post-test may be in tion: effects. As can be seen from Figure 2, order. From the preceding it is clear, however, repeated measures are not strictly required to that this variant of the basic design (six) demonstrate effects. As is the case in design provides a very weak basis for inference. three differences may be observed in dependent Selection cohort designs as proposed in Cook measures across independent samples that and Campbell (1979) can be understood as a underwent two different treatments (or none conceptually improved (while more demanding) for a control) and state that one group scored extension of separate sample designs. When higher on the average (this being favorable). using self-selected samples it is obvious that Drawing any inferences or preferring this subjects collected from cohorts that fall into treatment over the other (or none) as a result place with regular periodicity because of of such a post-test-only design would rest characteristics of natural environment can be entirely on the assumption that groups were considered more alike than separate samples equivalent with respect to dependent variables with obscure nonrandom selection processes. prior to treatment application. In ªtrueº As this is most typically the case for institutions experiments, random subject assignment and/ with cyclical (e.g., annual) turnover, like A Primer System of Design 77 schools, such designs have been labeled ªre- devices. It is true that there is always our current institutional cycle designsº (Campbell & obligation to evaluate and justify the whole Stanley, 1963). procedure of research instead of its findings only. But having revised the process of hypotheses derivation that leads to design, 3.04.5.2.4 Outcome patterns and to data as a result, and having confirmed It is trivial to state that informational derivation validity to a maximum extent complexity of outcome pattern depends on attainable within the limits offset by a concrete the factorial complexity of design. It should be research context, there should be no complaints borne in mind, however, that sophisticated about reluctant and unruly reality. group and treatment structures do not necessarily entail higher levels of information or yield better causal models of a post-test. For example, 3.04.5.3 Using Multiple Observations the higher an order of interaction is allowed in fully crossed factorial designs, the less readable It is obvious that design nine is but a will be the results obtained and interpretation generalization of standard design six, taking will be difficult. Pedhazur and Pedhazur (1991) advantage of multiple observations pointed out give examples of how adding predictors to above, and being subjected to the flaws of regression equations alters coefficients of for- multiple repeated measurements. While the mer included predictors and changes overall extension is straightforward as regards the logic results. Though the foregoing statements about of contrasts (the rationale for multiple group validity and control may suggest a kind of ªthe comparisons), the notion of an effect and more the betterº strategy when designating according ES is more difficult to define. Time independent and dependent variables for a series analysis is usually carried out to detect design, the ultimate purpose of all modeling, and analyze patterns of change over time or to simplification, should be recalled. The ultimate build models for forecasting. Multiple group prerequisite in modeling, however, is of equal comparisons enter when chosen to define an importance to the same end: unambiguous effect in terms of different time series evolution specification. Design six, for instance, when for cases and controls, conditional on a pre- including two pretest occasions and being intervention phase (baseline) and triggered by carried out on near-perfectly matched groups differential onset, intensity, and duration of of sufficient size with near-constant treatment treatment. Two major problems must be integrity, clearly outperforms a time series considered, then: first, how should treatment extension of the same design (see below) with and observation phases be scheduled in the uncontrolled subject attrition and carry-over course of the whole study in order to set effects. contrasts for obtaining effects while accounting A naturally expected, and in most cases for time-relevant threats to validity (see Section desired, outcome of basic design six would 3.04.5.4 using multiple forms of treatment); obtain when the average post-test response of second, how to statistically analyze serially the treatment group is found to differ signifi- dependent data and interpret contrasts in the cantly in desired (hypothesized) direction from data in terms of within-subject (within-group) the control group post-test scoring, given vs. between-subject (between groups) patterns pretest equivalence and given zero or nonsigni- (see Section 3.04.6). ficant trend in the control group pretest±post- The major benefit of using multiple observa- test mean scores. But there are several variations tion has been presented in Section 3.04.5.1 of that scheme, largely due to selection± (control for maturation and regression). Adding maturation interaction, regression artifacts, a control group further permits control of and scale characteristics (see Table 1). Cook history and instrumentation for there is no a and Campbell (1979) discuss at length such priori reason to suppose that untreated subjects different outcomes of design six in its basic form experienced different environmental conditions (parallel trend lines, differential growing rates during the study or that they should react for groups, nonconstant controls, controls differently to repeated measures. While non- outperforming cases, treatment leading to spuriousness of treatment effects justifiably may post-test equivalence, trendline crossover be assumed by reference to significantly differ- [ªswitching meansº]). Testers should not, how- ent pretreatment levels or increased/decreased ever, as might be suggested by these authors and slope of baseline, temporal persistence of effects the foregoing presentation of all those threats to can never be assured without comparison to an validity, too readily attribute findings not in line untreated control group. with expectations to design misspecification, As it is intended to dedicate the section on artifact operation, and limitations of diagnostic analysis mostly to handling design six and its 78 Multiple Group Comparisons: Quasi-experimental Designs nontime-series (i.e., simple pretest±post-test) Design 10 replicates design four, with treat- extensions, a few words on statistical analysis ment removed in the second block in order for of time series experiments are in order here. For this block to serve as a ªsame-subjectsº more detail see Glass, Wilson, and Gottman substitute for an otherwise nonavailable control (1975), Gottman (1981), and Lindsey (1993). group. Careful consideration must be given to Comparisons of intercepts (level), slopes (trend) the following questions: and variance (stability, predictability) are (i) Are treatment effects of transient nature mostly used for contrasting pre±post phases. and be made to fade out (it is supposed, Note that effects in time series designs need not generally, that persistence of intervention ef- instantiate immediately with treatment applica- fects is desired)? Ethical concerns, financial tion and pertain. Impact of treatment may be questions and potential attrition effects could delayed, with gradual or one-step onset, and be added to hesitations about mere feasibility of various forms of nonconstancy show up over such designs. time (linear, nonlinear, or cyclic fading or gain, (ii) Is it reasonable to expect that average etc.). But such rather sophisticated parameters scoring will return to a near-baseline level after of time series data will only be used rarely for ES removal of treatment? definition: statistical tests for directly compar- (iii) Will it be expected that, after reintroduc- ing oscilation parameters or analytically derived tion of treatment as in design 12, effects will characteristics like assymptotic behavior (e.g., reinstall the same way (delay, size, stability, etc.) speed of point convergence) are not readily as in the first turn? (It is supposed that the available. While most one-sample time series reader is acquainted with general replication parameters of interest to researchers concerned effects in experimenting like learning and re- with intervention analyses can be identified, sentment.) Then, and especially in case of high- estimated, and tested for by transfer function amplitude treatment and obstrusive measure- models, between-sample comparisons often ment, construct validity will be seriously af- reduce to fitting an optimal model to one group fected. and observing that fit indices decrease when If answers are favorable to these concerns obtained from application of the same model in then the designs that are particularly useful in another group. Cross-series statistics are, after evaluation of recurrent intervention compo- all, hard to interpret in many cases (they require nents can be disposed of. Next should be consideration of lead-lag relations, etc.). For an considered the reversed treatment nonequiva- introduction to interrupted time series analysis lent control design 11 with pretest and post- for intervention designs see McDowall, et al., test. Cook and Campbell (1979) suggest that (1980) or Schmitz, (1989), and illustrated this design outperforms standard design six as extensions to nonlinear and system models are regards construct validity since a treatment presented in Gregson (1983). variable has to be specified rather rigorously in order to determine its logical reverse and an 3.04.5.4 Using Different Forms of Treatment operational mode for it to be applied. Never- theless the additional use of a wholly untreated In this section concerns about implementing control group is strongly advised in order to different forms of the same treatment instead of arrive at clear-cut results, in case trend lines adding treatments like in a factorial design will do in fact depart from (assumed equivalent) be discussed. In basic pretest±post-test designs pretest levels in opposite directions but in of type six, treatment is applied only once. slopes that differ with respect to size and Hence, variations concern application or re- significance. Including a common reference movement only. Graduation in treatment for contradicting provides a very strong basis application or withdrawal calls for multiple for inference. Finally design 13 (interrupted groups to undergo such graduation. This would time series with switching replications) is con- be achieved, in concept, by stacking k±1 sidered. Due to delayed treatment application, pretest±post-test designs without a control one group can always serve as a control during group over standard design six for k levels of treatment periods of the other (ªreflexive con- treatment to be tested. The resulting model trols,º Rossi, Freeman, & Wright, 1979; would conform to standard one-way analysis of ªtaking-turn controls,º Fitz-Gibbon & Morris, variance for differences scores with a fixed or 1987). The particular strength of this design lies randomly sampled treatment factor (see Section with its capability to demonstrate simulta- 3.04 .6). In time series designs, however, some neously effects in two different settings and more challenging variations are at hand that points of time with greatest parsimony. Still further increase validity and clarity of inference, it is possible to overcome most threats to if properly implemented. Figure 3 gives an validity that are faced when using the above overview. designs. A Primer System of Design 79

oxo oxo– ox+o ox–o

10 11

oxoxox– oxoxox– – a b

oooooooooxoooo oooxoooooooooo 13

Figure 3 Designs using different forms of treatment.

3.04.5.5 Using Different Forms of Comparison treatment variable of interest and is only meant Groups to minimize motivation effects like resentful demoralization. A second version refers to It has been pointed out frequently that to use treatments that in fact can be, or even are, a control group is indispensible as a baseline for expected to contribute to effects by operating comparison in most cases of quasi-experimental through subjects' expectancies and external research, and highly advocated in any other. attributions that are hypothesized to go along Effects should be demonstrated by either with actual treatment. Perhaps the most relative change or post-test difference to rule prominent example is medication with a sugar out alternative explanations besides treatment pill. Note that such placebos work on psycho- impact. While this is widely agreed, there has logical grounds and usually are implemented to been some controversy as regards various separate such effects from ªreal,º physical, or qualitative characteristics of controls. Are pharmacological treatment. As a consequence, controls really untreated as assumed for it might be difficult to define a psychological inference? How can control groups be main- placebo in psychotherapy research. For further tained (they do not receive, after all, any benefits detail on the issue of placebo groups, see a well- of treatment)? Is there such a thing as placebo received article of Prioleau, Murdock, and treatment? Do waiting groups (subjects sched- Brody (1983), defending admission of such uled for subsequent cycle of treatment, see placebo treatment, and subsequent peer com- design 13) differ in any substantial respect to mentaries in The Behavioral and Brain Sciences. wholly unaffiliated pools of control? In many Waiting control groups differ from untreated cases the nature of a control group is determined groups in that they must be supposed to expect simply by contextual factors like availability in benefits from treatment to come. As a con- terms of total number of subjects, financial sequence, after assignment, such subjects may resources, ethical considerations, or the time show more sensible behavior in any respects horizon of the study. While ªrealº control related to health and well-being or more medical groups receive no treatment at all, placebo compliance (thus ªraisingº baselines into trend- groups receive only irrelevant treatment. Con- lines). Waiting groups are, however, easiest to ceptually, there are two forms of such irrele- recruit and maintain, they naturally fall into vance. In the first case, intervention neither in place when treatment cannot be delivered theory nor in statistical terms relates to the unlimitedly, and there is no way of avoiding 80 Multiple Group Comparisons: Quasi-experimental Designs selection for admission. Still another concept of consequence, this rather sophisticated design is a control group emerges when using qualita- mostly applied in research situations that allow tively different treatments in comparison for randomization. Trivial as it may seem, groups, that is distributing treatment across another way of combining designs lies with groups. A way of treating a disease might, for replicating the whole basic design using the instance, to compare against current or stan- same samples and measurement occasions by dard treatment, thus highlighting the relative simply obtaining multivariate measures of post- merits of the new and saliently different features test indicators and, thus, testing multiple of the innovative treatment. Technically, within predictions concerning the same outcome the ANOVA framework introduced below, criterion. treatment would represent a fixed factor with more than two levels (treat1, treat2, etc). It is 3.04.5.7 Regression Discontinuity clear that adding a further untreated control group is still preferable (see Section 3.04.5.6). As if to provoke further discussion in order to put an end to presumed underutilization of 3.04.5.6 Using Combinations of Different regression discontinuity (RD) designs, Trochim Designs (1984) stated that RD ªis inherently counter- intuitive . . . not easy to implement . . . statistical There are three convincing arguments for analysis of the regression discontinuity design is combining designs. First, combining complete not trivial . . . [there are] few good instance of the designs rather than extending one of their useº (pp. 46±47). So what is the point in these internal features (as discussed in above subsec- designs? In structure and objective, RD designs tions) will generally enhance control. Second, conform to the layout of design six. The major broadly defining designs themselves as methods difference lies with assignment policy: while and using different methods and research plans randomization is believed to guarantee fully in addressing the same question and phenom- equivalent groups (in the long run, that is) and enon will always decrease monomethod bias pretest equivalence in nonequivalent group and increase both construct validity and gen- designs assures comparability with respect to eralizability of results. Third, combining designs the variables in questions only (leaving obscure offers refined testing of hypotheses. For presence and influence of other selection example, in Section 3.04.5.4 it was proposed, processes), it is maximum nonequivalence that to multiply stack design four over design six. serves as a rationale of regression discontinuity Consider now the application of different designs. This is achieved by ordering subjects dosages of a drug as treatment variable. Given according to their scores on the continuum of a initially comparable groups (in the present, pretest scale and subsequently defining a cut-off quasi-experimental sense), trend hypotheses point for division into further subsets to be about dosage could even be tested. Also, assigned to different conditions. Most com- integral features of basic designs as depicted monly, subjects scoring below the cutting point in Figure 2 may be tested for their mediating would then be assigned to one group, subjects effects. For an example, if design six is stacked scoring below this point to another. If a ªsharpº over design three a composite design is obtained cut-off point is not desirable on theoretical that allows for separation of pretest and grounds or because of known unreliability treatment effects and is named after its proposer regions of scales, a midscale cut-off interval Solomon (1949). Note that there are two fully might be used instead, with interval width crossed factors: whether treatment is applied or depending on estimates of measurement error not (case±control factor), and whether pretest- similar standard deviation. Subjects within this ing is administered or not. This design may now range would then be assigned randomly to either be conceived as factorial (see Chapter 3, this treatment or control group condition. While it is volume) and an analysis of the main effect of true that the process of subject selection is still treatment made (disregarding effects of pretest only imperfectly controlled because merely one sensitization and the like), main effects of factor (the dependent variable) is accounted for, pretesting (effect obtained without regard to it is perfectly known since being specified treatment), and interaction (treatment effects entirely by the researcher. It is clear that, in depend on pretest). If statistical testing has order to prevent cut-off point selection being determined that such interaction is present, rather arbitrary, strong theoretical grounds or consideration of the main effects due to one sole consent on practical concerns (limited therapy factor (e.g., treatment) is sure to yield mislead- resources, etc.) are called for. ing results and inferences. But the latter is Basically, analysis of RD designs means always true in cases where any design is used testing hypotheses of equality of regression that both lacks randomization and pretest. As a equations across groups similar to analysis of Analysis 81 covariance. The objective is to ask whether the respective conceptual analogies when ordinal trend or regression line obtained from analysis or categorical data have been obtained. Since of the treatment group pretest and post-test there are many textbooks on statistical data scores is displaced significantly from the analysis that cover technique in detail, attention respective control group regression line. If is drawn merely to the principle and major displacement is due largely to greater (or lesser) formulae and dispensed with (references are intercept, while slopes are statistically equal given, however). (parallel line regression), a main efect has been It is important to understand that until now successfully demonstrated. Stated differently, discussion has been located entirely in the realm with no effect present, a single regression line of logic. Nothing has been implied as regards would equally fit scores of both treatment and scale and complexity of measurement and data. comparison group and, as a consequence, a The point is that anything can be stored in O trend in the treatment group could be predicted (Figures 2 and 3) and that structural complexity from the control group scores. Interaction of whatever resides in this data aquisition box is effects are revealed by discontinuity in slope logically independent of complexity in design of regression lines at cut-off point. Note that, structure. Enlarging designs by stacking sub- technically, regression need not necessarily be elements simply adds data boxes and obtains linear but can include parameters of any order, lots of data and can sometimes lead to technical but that extrapolation of a trend line becomes difficulties in statistics, but both of these extremely difficult then and often leads to false features should not complicate the idea or conclusions about effects (see Pedhazur & inference behind the material. In fact, there is Pedhazur, 1991). These problems notwithstand- only one condition: statistical parameters ing, more frequent use of these designs is obtained from the data must be comparable recommended. In many intervention research on formal (i.e., mathematical) grounds (dis- contexts, subjects not only are in fact but also tributions, standard error of estimate, etc.). should be assigned to either treatment or Now dominance of comparison-of-means-style control groups by need or similar, perhaps models is readily explained: differences in means medical, criteria related to both treatment and of normally distributed variables again follow outcome. normal distribution, and departure from zero is most easily tested using direct probabilities. In essence, value and standard error of any 3.04.6 ANALYSIS parameter's estimate should be determined and related to each other (e.g., in a ratio, giving Having successfully derived and implemented standard t-test in case of metric dependents) or a multiple group comparison design there is to another pair of parameters (e.g., in a double now a need to fit a statistical model to the data ratio, giving standard F-test, respectively). But obtained in order to test the hypotheses on the there are many other ways besides of comparing ªfinalº deductive level in the process model of distributions, some less strict but more straight- empirical research (Figure 1). In particular, forward or creative. there is an interest in deriving ESs and whether Why not obtain entire time series for one they are different from zero. Note that from an single data box in a multiply repeated measure- epistemiological point of view, use of a different ments design? A time series model may be fitted statistical model is made than observational to data at each occassion resulting in micro- methods (see Section 3.04.1.2). Here, models are structural analysis. Stated differently: time considered as mere tools for hypotheses quali- series information has been condensed into fication and inference and are not of conten- some quantities that now enter macrostructural tional value besides some implications of their analysis reflecting design logic. Here, one might structural assumptions (like additivity, linear- compare such aggregate measures in a pre± ity, etc.). Analysis of nonequivalent group postfashion analysis like in standard design six, designs is achieved in three major steps. First, thus evaluating treatment impact by relative the overall rationale of analysis is introduced by change in parameters of local change. Other borrowing from the theory of generalized linear studies involve entire structural equations models. Second, its specification language is systems in a single data box, constructed from used to acquaint the reader with a system of covariance structure in multivariate data ob- models that indeed is a superset to nearly all tained at one occassion (Bentler, 1995; Duncan, models applicable in this context. Third, a closer 1975; MoÈ bus & BaÈ umer, 1986; JoÈ reskog & look is taken at a subset of models that are most SoÈ rbom, 1993). commonly employed in treating standard de- Most textbooks on the statistical analysis sign six, namely analysis of variance with or of experiments focus on analysis of variance without adjusting for a covariate and its and associates. While preference for models 82 Multiple Group Comparisons: Quasi-experimental Designs assuming variables on an interval or rational that remained unspecified or did not explicitly scale is perfectly understandable from the enter the model (i.e., remain uncontrolled). researchers point of view, it is felt that there To sum up, ingredients of a model are is a rationale of testing that allows for far more variables, parameters, a link function to variety. Appreciation of this variety is badly algebraically interconnect all these, and an needed in order to be able to choose an analysis estimation strategy to optimally fit the resulting model for the design present, thus preserving model to the data. The point in mapping most the rather deductive idea of the process model in different ideas and objectives into a common- Figure 1, instead of being forced to the models structure is variable designation: using contrary: adapting design and idea to some intensities can still mean single-attribute utili- analysis model readily at hand from common ties, attitude strength, blood pressure. But y = textbooks. Variety need not imply incompre- f(x, b, e) is too much in the abstract to be of use hensible diversity if there is a model rationale. for this purpose, and even giving an outline on Here, the rationale states: all analysis, excepting mathematical formulation and derivation of some tests for ordinal data working on special some less hard-to-handle submodels or estima- ranking methods, boils down to regression. This tion techniques would be beyond the limits of holds true even in cases where it might not be this chapter. For an advanced reading and a expected that a regression structure presents at complete build-up of generalized linear model first glance (joint space analysis in multidimen- theory and practice see Arminger, Clogg, and sional scaling, etc.). Most important, it is also Sobel (1995), Christensen (1991), and Diggle true for analysis of variance and companions et al. (1994). But model specification terms since these models can be written as regression summarized in Figure 4 can be used to figure out equations of metric dependent variables on a set the greater proportions of the generalized linear of coded vectors. These dummy variables, then, model. Note that not everything goes along with represent treatment in all its aspects (intensity, anything else when crossing items on the utmost frequency, schedule of application, etc.). En- right side and that some models use parame- coding logic of designs into what is now called a trization strategies that do not satisfy the design or hypotheses matrix for statistical condition of formal comparability of para- analysis certainly is an art of its own (for an meters across groups. What is required now is introduction, see Kerlinger (1973), or Pedhazur careful examination of the nature of the & Pedhazur (1991)). But here, regression stands variables and the justifiability of assumed form for a whole family of models obeying a certain of relations among them. Never follow the relational form. Thus, the rationale translates to impression possibly implied that all it takes is concrete statistics by proposing that all variable simply picking up a model from the scheme and relations (and contrasts are just a special applying it to the data if the marginal relation) be specified in the form y = f(x, b, parameters are met. While it is true that with e), where y denotes a possibly vector-valued set the exceptions mentioned above almost any of responses (dependents) and x denotes a set of statistical model of interest for analysis of explanatory variables (independents, respec- designs can be integrated in or extrapolated tively) that are used to approximate the from the generalized linear models scheme, the response with determinable residual or stochas- fit of design and analysis model is never tic error e. A set of coefficients b quantify the guaranteed from this alone. intensity of an element of x (which can be Several submodels can be located from that interpreted as effect information), these para- scheme that are most commonly used in meters are estimated conditional to the data by analyzing experiments of type three and six minimizing a quality function (ªleast squares,º (in Figure 2) and that may now be arranged in a ª(72)*maximum likelihoodº). Finally, f(.) more readable tabular system. If groups or denotes a characteristic link function that discrete time are assumed for a single indepen- relates the explanatory or predictive determistic dent variable of categorical nature, and an part of the model and error component to the arbitrary scale for a single dependent manifests, response. Mostly used for this purpose are variables measured in discrete time and every- identity link (giving standard multiple regres- thing else in Figure 4 is dispensed with, the sion), logit link (for binary response), and following table results (Table 3). In view of logarithmic links (leading to regression models practice, that is, software, some submodels involving polytoneous categorical data). In this present not as rigorously special cases of the way of thinking, the data are thought of as Generalized Linear Model (though they all can determined by the model but jammed by be specified accordingly), but rather as ªstand- additional random error. Note that ªerrorº aloneº implementations of tests. They serve the does not denote anything false but is meant to same purpose and are integrated in all major incorporate, by definition, all factors in effect statistical software packages. Analysis 83

identity Links logit ...... log ......

uni-* orthogonal multiple* number collinear multi* simultaneous

categorical ordinal scale censored interval uncensored rational

intensity ...... Variables dependent type frequency ...... independent durations ...... direct time ......

manifest meas.level endogeneous latent factor order k

static

Stat. setup) Model (linear additive cross-lagged temp.order autoregressive discrete time direct time X continuous time

IIND(0,X) ...... error AR(p) ...... assumptions ......

Figure 4 Building blocks in the Generalized Linear Model (GLM). 84 Multiple Group Comparisons: Quasi-experimental Designs

Table 3 Common models for analysis of designs three/six.

Independent Scale of single Idea of effect (group-time) dependent Analysis/test

Simple change One sample pre±post categorical McNemar test ordinal Wilcoxon test metric dep. t-test One sample k times categorical Cochran Q-test ordinal Friedman test metric One-way ANOVA rep. meas. Post-test differences Two samples categorical Simple w2 ordinal Mann±Whitney test metric indep. t-test m samples categorical Logit/Loglinear models ordinal Kruskal±Wallis test metric One-way ANOVA Relative change (interaction) m samples categorical Logit/Loglinear models k times ordinal Loglinear models each metric m-way ANOVA rep. meas.

Further insight can he gained by the one dimension (M 4 1), multivariate analysis introduction of the analyses of variance. These (MANOVA) is called for, if it reduces to a scalar models are considered the standard for analyses (M = 1) but extends in time (T 4 1), univariate of multiple group comparison designs, but there repeated measures ANOVA is applicable. For is no hope of covering its various features and M = 1 and T = 1, effects are evaluated using a the richness of its submodels in detail. For the post-test-differences-only-given-pretest-equiva- time being, do not bother with the distinction of lence rationale as presented in the first row in fixed-effect, random-effect, and mixed models Table 3 (performed with a t-test for k = 2 both but with the rationale of analysis (for details, see uni- and multivariate and any (M)ANOVA for Hays, (1980). Especially in the case of repeated k 4 2). With repeated measures carrying effect measurement models, there are, however, some information (T 4 1) as it holds for both other ideas and assumptions that must be understood rationales, data setup is a bit more sophisti- before actually using these techniques, most cated. Common software packages usually take prominent and important among these are: repeated measures of one variable for separate heteroscedasticity, additivity, orthogonality, variables (maybe with the same root name: p.ex. balance, degrees of freedom, compound sym- out1 out2 out3 . . . for criterion variable out at metry sphericity. In addition, further precau- three occasions). It follows that a multivariate tions as regards covariance structure are in data setup of the input data matrix is required. order when extending the model to multivariate Note that when more than one genuine variable analysis of several dependent variables. For besides out enters analysis, a doubly multi- advanced reading on these topics whose con- variate procedure is called for. Stated differ- sideration is indispensible for enhanced statis- ently, M variables are to be analyzed whose tical conclusion validity, see Toutenburg (1994) respective T time-dependent realizations are or Christensen (1991). coded as T different variables, giving M 6 T ? Consider that response data y=[y1y2 ... data columns in a typical data entry spreadsheet ym ... yM]onM dependent variables are of minimal logical dimension (N * K) 6 (M * T) obtained on interval-scale level for k =2 ... since data are obtained for K groups of N k ... K samples of equal size; for convenience, subjects each. This holds true, so far, for any n =1...i ...N at t =1,2...t ...T occasions. A one-way (M)ANOVA with a k-level treatment- categorical independent variable (or factor f1 factor, affecting one or more metric dependent - K denoting treatment(s) f1=(f11,...,f1k,...,f1 ) variables. In most instances, however, it is likely has been realized at least twice in one sample or that this root design will be extended by independently across different samples at any introducing: - occasion 1 5 t 5 T. Treatment f1t denotes ªno (i) further experimental factors subject to treatmentº for implementing nonequivalent manipulation (e.g., single vs. group therapy), control group(s). If yÃ offers data on more than (ii) further cross-classifying factors that fall Analysis 85 into place from basic sample analyses (e.g., age, (ii) Differences within repeated measures of sex, hospitality, etc.), a subject and according interactions with fac- (iii) further metric time-invariant covariates tors varied between groups in f-way analyses. (e.g., onset of disease, initial dosage of medica- This variation may be further decomposed into tion), and three sources: a main effect of retesting (sup- (iv) further metric time-varying covariates posed to be present in each subject's profile over (e.g., ongoing medication). time), an interaction of treatment and retesting, The general idea of analysis of variance is and differential reactions to retesting (specific rather simple: if means between k groups and/or to each of the subjects). In case of analyses with t occasions differ, and this is what is called an no between-factors present, the latter source of effect, roughly, then they are supposed to vary variation is taken to contribute to residual from the grand mean m obtained from collap- variance. In multiway repeated measures ana- sing over all M 6 K data points disregarding lyses of variance, determination of appropriate groups and treatments or collapsing over occa- components for a F-ratio in order to test for sions giving the grand mean as a ªno-changeº statistical significance of a factor or an inter- profile of t equal means. If they do vary action of some factors, can be very complicated; significantly then this variation should be to a excellent advice is given in Bortz (1993). certain proportion greater than variation pre- For design six, a differential change would be sent within groups (i.e., with respect to the mean required for an effect by testing the interaction in a group). More specifically, a partitioning of term of a repeated measures ªwithinº factor and total variance found in the data matrix into (at the case±control ªbetweenº factor for signifi- least) two components is attempted: variance of cance using the appropriate F ratio. Change groups' means reflects the systematic (ªcausal takes place but in different directions (or none) modelº) part that is solely attributable to depending on whether receiving treatment or variation of treatment (be it in time or across not. Suppose one group of aggressive children groups), whereas variance of subjects' measure- received behavior modification training while ments within a group is generated by inter- another group of equal size is still waiting for individual differences (i.e., violation pretest- admission and thus serves as a control group. equivalence or lack-in-treatment integrity) Both groups are considered equivalent on the and is therefore taken as an ªerror termº against dependent variable of interest beforehand. Take which the other part must be evaluated. Re- it that t = 3 measures on each of N1 + N2 member that the ratio of two variances or two individuals in k = 2 groups: pretest, post-test sums of squares divided according to degrees of after training, and a follow-up retest six months freedom constitutes on F-value when certain after intervention to estimate the stability of assumptions on data (Hays, 1980) are met: possible effects. Formally expressing the above statistical significance is readily tested for. In sources of variation in a structural model for the this way, model variance due to effects, possibly individual measure, as is common for GLM containing interactions when more independent submodels, and dispensing with explicit nota- variables are present and error variance due to tion of associated dummies for discrete inde- anything else, add up to give the total variance pendent variables leads to: (assuming a fixed effect model). In the case of repeated measures analysis, there is also var- yikt = m + ak + bt +(ab)kt + gik + Eikt iance among t different measures of a single variable within one subject. To sum up: three where ak refers to effect of being in either main effect sources of variation are accounted treatment or control group, bt refers to for by standard ANOVA models: differences solely attributable to time or change, (i) differences between individuals, to be (ab)kt stands for the interaction of both main further decomposed into differences between effects and thus denotes differential change as k groups (disregarding the individual by taking sketched above and in greater detail in Section means and thereby constituting a ªtrue valueº 3.04.1.3, gikt refers to a random ( = erroneous) outcome measure for this treatment) and dif- effect of a single subject i in the kth kind of ferences between subjects within a group, treatment (training vs. control, in the case obtained from subtracting the individual's present), maybe for individual specific reaction measure from the group mean. In analyses with types to either form of treatment at any no repeated measures factor present, these occassion, and Eikt accounts for anything not differences contribute to error variance against covered by the model (error term). which the first kind of variation is evaluated. In For full coverage of mathematical assump- the case of f-way ANOVAs, interactions of tions underlying this model, see Toutenburg independent variables (factors) must be taken (1994). As mentioned before, statistical hypoth- into account as well. eses are formulated as null hypotheses con- 86 Multiple Group Comparisons: Quasi-experimental Designs jecturing ªno effectº of the according term in the Actual testing of these hypotheses might be model. In addition, for the present example, performed using ªmeans modelsº available with effects of several levels of an independent the (M)ANOVA procedures or by parameteriz- variable are required, coded by an according ing the above equation to fit the general linear contrast in the hypothesis matrix (see above), model regression equation with identity link. In which add up to zero. Hypotheses testable in essence, this is done by building a parameter this model include: vector b of semipartial regression coefficients

from all effects noted (m, a1, ... ak 7 1, bc1(t) ... ), associated by the corresponding hy- H0: a1 = a2 bct 7 1 There is no difference between treatment and potheses matrix X (made up from coded vectors control group with respect to the outcome variable of contrasted independent variables) and an (dependent), disregarding point of time of mea- additive error term. This right-hand part of the surement. More generally speaking, such between equation is regressed on to the vector?y giving a hypotheses state homogeneity of measurement standard Gauss±Markov±Aitken estimate of levels in groups. Note that no differences at pretest b =(X'S 7 1X) 7 1 X'S71y from least squares time are expected. Moreover, differences possibly normal equations. A typical printout from present at post-test time (near termination of intervention) tend to attenuate up until the follow- respective software procedures includes para- up testing. It is not known whether group differ- meter value, associated standard error, and t ences were at maximum by consideration of this value, confidence limits for parameter estimate, 2 main effect of group membership alone. Z effect size, F-noncentrality parameter, and power estimate. When interpreting results, remember that an effect is conceptualized as a H0: bc1(t) = bc2(t) specific formulation depending on contrasts c(t) difference of respective group ( = treatment) for time effects.There is no overall change between mean from the grand mean, and that effects are occasions disregarding group membership. Again usually normalized to add up to zero over all more generally speaking, the hypothesis states treatments (for details and exceptions, see Hays, homogeneous influence of time effects in the 1980). course of experimentation. Considering this time Analysis of the worked example study was main effect alone, it is known that whatever change carried out as follows: occurred applied unspecifically to both groups and can therefore not be attributed to treatment. Obviously, this effect does not carry the informa- A one-way ANOVA with repeated measurements tion sought. When certain assumptions on covar- was used to test statistical significance of recorded iance structure in time are met (sphericity difference in three dependent variables for five condition, see above), a t 7 1 to orthonormal experimental conditions at pre, post, and follow- linear contrasts may be used to evaluate change up occasions. There was no significant overall at certain periods using univariate and averaged main group effect. Only systolic blood pressure tests. These assumptions assure, roughly speaking, was found to differ across groups at follow-up that present differences are due to inhomogeneity time with greatest reduction present in a compar- of means rather than of variances (for details, see ison of TR to TOC. Group-time-interactions for Toutenberg, 1994). Major software packages offer testing differential change were carried out by an correction methods to compensate for violations ANOVA using repeated contrasts (pre vs. post; of these assumptions (e.g., by correcting compo- post vs. follow-up). Systolic blood pressure de- nents of the F-test) or have alternatives available. creased significantly over time in all relaxation groups and the placebo group, but not in the control group. TR muscle relaxation showed H0:(ab)ij =0 greatest reduction of diastolic pressure, whereas This hypothesis states parallelism of progressive SR and COG significantly lowered heart rate. forms in groups or, equivalently, the absence of Pooled treatment effects were evaluated against differential change. If this hypothesis can be nontreatment groups by special ANOVA con- rejected, there is an interaction of treatment and trasts, showing significant reduction in systolic the repeated measurement factor in the sense that pressure when compared to control group and (at least) one group apparently took a different significant reduction in diastolic pressure when course from pretest to post-test and follow-up as compared to the placebo group. There were no regards the outcome criteria. This is precisely what effects on heart rate in comparison to either one of was presented as the standard notion of an effect in the nontreatment groups. The results indicate a Section 3.04.3. Such a change can be attributed to strong placebo attention effect. It is suggested that treatment because the overall effects like common independence of heart rate and significant blood linear trends and so on, have been accounted for by pressure reduction is in line with the findings in the the time of main effect. In more complex cases, field. there may also be interest in differential growth rates or whether one group arrived sooner at a saturation level (suggesting earlier termination of With additional restrictions put on S, this treatment). rationale of testing easily extends to multi- Conclusion 87 variate and doubly multivariate cases (Touten- As with the metric case, regression coefficients burg, 1994). Also, there is much appeal in the are considered effect parameters and tested for idea of comparing groups on entire linear statistical significance by evaluating their stan- systems obtained from all measured variables dard error. Multiple group comparisons can be by structural equations modeling (SEM; see obtained in one of two ways: by introducing an Duncan, 1975). Here, measurement error in independent variable denoting group member- both independent and dependent variables is ship, or by testing whether there is a significant accounted for explicitely. For an introduction change in the associated overall model fit to multiple-group covariance structure analyses indices from the more saturated model incor- and to models additionally including structured porating the effect in question (H1) to the more means see Bentler (1995), MoÈ bus and BaÈ umer restricted model dispensing with it (H0). (1986), and SoÈ rbom (1979). MANOVA and SEM are compared on theoretical and practical 3.04.7 CONCLUSION grounds in Cole, Maxwell, Arvey, and Salas (1993). Note that the latest formulations of In the last section are described the statistical these models and associated software packages models for multiple group comparisons. In can now deal with nonnormal and categorical most cases, however, analysis will be carried out data. using techniques presented in Table 3. It is re- When ordinal measures have been obtained, emphasized that it is not the complexity or ANOVA on ranks of subjectsº measures (with elegance of the final data model that puts a good respect to the order in their respective group) end to the process of quasi-experimental rather than on absolute values may be applied research: it is the fit of research questions or instead of ªgenuineº ordinal methods as listed hypotheses and according methodological tools in Table 3. In the case of categorical measures, to tackle them. This is why a major part in this frequencies mostly have been obtained denoting chapter is devoted to a presentation of an the number of subjects with the characteristic idealized procedural outline of this research feature of the category level in question present process as sketched in Figure 1, and on relative to the total number of subjects over the discussing validity affairs. Generally speaking, range of all possible levels of that categorical the last step in this process lies with reversing it variable. Basically, analysis is carried out on an inductively and with introducing evaluative aggregate level (groups or subpopulations) elements. It must be decided by which degree rather than on the individual level, and link present patterns of statistical significance justify functions other than identity link are employed the induction that predictions derived from to regress independent variables (which have substantial hypotheses have instantiated as they been of a categorical nature all the time: group should and whether validity issues have been membership, etc.) on to categorical outcome attended to a degree that inference to the level of measures (now dependent variables have also theory is admissible. Limitations in general- become categorical). Regression coefficients izability should be examined and problems must be interpreted differently, though, when encountered in the course of the study must be dependent variables have been reconceptualized evaluated in order to point to future directions in this way. A positive b indicates that the of research. With reference to Section 3.04.3.3, probability of a specific category of a dependent implications to clinical practice should be variable is positively affected (risen) relative to derived and be evaluated. In publications, these another category, including the case of simply issues are usually covered in the ªResultsº and not being in that very category, when the ªDiscussionº sections. characteristic feature of the category for an In the worked example study, the following independent variable is present in a subject. In discussion is offered: the case of a dichotomous dependent variable (logit models) interpretation is facilitated: b A result most relevant to clinical practice lies with quantifies the effect of the independent variable the fact that cognitive and muscular relaxation on the odds, that is, the ratio of favorable to both significantly reduce blood pressure and unfavorable responses (e.g., whether or not therapists may therefore choose from both meth- cancer is or will be present given the information ods with respect to their individual clientsº special available from independent variables). On a needs and aptitudes (differential indication). The more formal level a nonzero regression coeffi- critical prerequisite for this conclusion in terms of validity lies with clear separation of experimental cient in log-linear models indicates that the instructions concerning either musculary or cog- expected cell frequency in the category r of R nitively induced relaxation. It might be argued that categories or ªlevelsº of the dependent variable any instruction relevant to improving irritating departs from what would be expected in the bodily sensations does feature cognitive elements, ªnullº case of no effects, that is nr = n.=N/R. like expectancy, that contribute to intervention 88 Multiple Group Comparisons: Quasi-experimental Designs

effects. It might be better to differentiate between Collingwood, R. G. (1940). An essay on metaphysics. techniques involving direct muscular practice and Oxford, UK: Oxford University Press. a pool of techniques that dispense with the Collins, A. W. (1966). The use of statistics in explanation. behavioral element. In the end, all relaxation British Journal for the Philosophy of Science, 17, 127±140. techniques affect muscle tension. But persons Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. might well vary on the degree to which they can Chicago: Rand McNally. benefit from direct or indirect techniques. It is Cronbach, L. J., & Furby, L. (1970). How should we known that more susceptible persons, or persons measure ªchangeºÐor should we? Psychological Bulle- with greater ability to engage in and maintain vivid tin, 74, 68±80. imagination and concentration, will benefit more Diggle, P. J., Liang, K. Y., & Zeger, S. L. (1994). Analysis from cognitive techniques, and that physical dis- of longitudinal data. Oxford, UK: Oxford University abilities often rule out application of muscularly Press. directed interventions. On the other hand, muscle Duncan, O. D. (1975). Introduction to structural equations relaxation instructions are considered easier to models. New York: Academic Press. Eye, A. V. (Ed.) (1990). Statistical methods for longitudinal follow, and, after all, straight muscle relaxation research. Boston: Academic Press. techniques showed the greatest overall effect in the Farnum, N. R., & Stanton, L. W. (1989). Quantitative present study, thus pointing to a ªsafe-sideº forecasting methods. Boston: Kent Publishing. decision. These arguments refer to client selection Fisher, R. A. (1959). Statistical method and statistical in clinical practice as a parallel to the referred inference. Edinburgh, UK: Oliver & Boyd. effects of subject selections in experimentation. Fitz-Gibbon, C. T., & Morris, L. L. (1991). How to design a program evaluation. Program evaluation kit, 3. Newbury Park, CA: Sage. Accounting for the demand for both meth- Giere, R. N. (1972). The significance test controversy. odologically valid and practically relevant British Journal for the Philosophy of Science, 23, 170±181. methods, the following guidelines for quasi- Glass, G. V., Willson, V. L., & Gottman, J. M. (1975). experimental research may be offered: Design and analysis of time series experiments. Boulder, (i) preference for case±control designs, CO: Associated University Press. Gottman, J. M. (1981). Time series analysis. Cambridge, (ii) preference for repeated measures designs, UK: Cambridge University Press. thus preference for diagnostic devices and Granger, C. W. (1969). Investigating causal relations by statistical methods sensitive to (relative) change, econometric models and cross-spectral methods. Econo- (iii) specification of hypotheses in terms of metrica, 37, 424±438. Grawe, K. (1992). Psychotherapieforschung zu Beginn der effect sizes, neunziger Jahre. Psychologische Rundschau, 43, 132±168. (iv) consideration of multiple indicators per Gregson, R. (1983). Times series in Psychology. Hillsdale, criterion, entailing a multivariate data setup, NJ: Erlbaum. and Hager, W. (1992). Jenseits von Experiment und Quasiexperi- (v) special attention to balancing N (sample ment: Zur Struktur psychologischer Versuche und zur Ableitung von Vorhersagen.GoÈ ttingen, Germany: Ho- size), T (rep. measures level), ES (effect size), grefe. and other factors determining both statistical Hagenaars, J. A. (1990). Categorical longitudinal data: log- and practical significance of results presented in linear panel, trend and cohort analysis. Newbury Park, Section 3.04.3.1. CA: Sage. Hays, W. (1980). Statistics for the social sciences. New York: Holt, Rinehart, & Winston. Hoole, F. W. (1978). Evaluation research and development 3.04.8 REFERENCES activities. Beverly Hills, CA: Sage. JoÈ reskog, K., & SoÈ rbom, D. (1993). LISREL 8 user's Arminger, G., Clogg, C. C., & Sobel, M. E. (Eds.) (1995). reference guide. Chicago: Scientific Software Interna- Handbook of statistical modeling for the social and tional. behavioral sciences. New York: Plenum. Kepner, J. L., & Robinson, D. H. (1988). Nonparametric Bentler, P. M. (1995). EQS structural equations program methods for detecting treatment effects in repeated manual. Encino, CA: Multivariate Software. measures designs. Journal of the American Statistical Bernstein, I. N. (Ed.) (1976). Validity issues in evaluative Association, 83, 456±461. research. Beverly Hills, CA: Sage. Kerlinger, F. (1973). Foundations of behavioral research Bortz, J. (1993). Statistik. Berlin: Springer. (2nd ed.). New York: Holt, Rinehart, & Winston. Box, G. E., Jenkins, G. M., & Reinsel, G. C. (1994). Time Lindsey, J. K. (1993). Models for repeated measurements. series analysis: forecasting and control. Englewood Cliffs, Oxford statistical science series, 10. Oxford, UK: NJ: Prentice-Hall. Claredon Press. Campbell, D. T., & Stanley, J. C. (1963). Experimental and Lipsey, M. W. (1990). Design sensitivity: statistical power quasi-experimental design for research on teaching.In for experimental research. Newbury Park, CA: Sage. N. L. Gage (Ed.), Handbook for research on teaching, Lord, F. M. (1963). Elementary models for measuring (pp. 171±246). Chicago: Rand McNally. change. In C. W. Harris (Ed.), Problems in measuring Christensen, R. (1991). Linear models for multivariate, time change. Madison, WI: University of Wisconsin Press. series, and spatial data. New York: Springer. Mayo, D. (1985). Behavioristic, evidentialist and learning Cohen, J. (1988). Statistical power analysis for the models of statistical testing. Philosophy of Science, 52, behavioral sciences. Hillsdale, NJ: Erlbaum. 493±516. Cole, D. A., Maxwell, S. E., Arvey, R., & Salas, E. (1993). McDowall, D., & McCleary, R., Meidinger, E. E., & Hay, Multivariate group comparison of variable systems: R. A. (1980). Interrupted time series analysis. Newbury MANOVA and structural equation modeling. Psycholo- Park, CA: Sage gical Bulletin, 114, 174±184. MoÈ bus, C., & BaÈ umer, H. P. (1986). Strukturmodelle fuÈr References 89

LaÈngsschnittdaten und Zeitreihen. Bern, Switzerland: Solomon, R. L. (1949). An extension of control group Huber. design. Psychological Bulletin, 46, 137±150. Pedhazur, E. J., & Pedhazur, L. (1991). Measurement, SoÈ rbom,D.(1979).Ageneralmodelforstudying design and analysis: an integrated approach. Hillsdale, differences in factor means and factor structure between NJ: Erlbaum. groups.InK.G.JoÈ reskog & D. SoÈ rbom (Eds.), Petermann, F. (Ed.) (1996). Einzelfallanalyse (3rd ed.). Advances in factor analysis and structural equation models MuÈ nchen, Germany: Oldenbourg. (pp. 207±217). Cambridge, MA: Abt Books. Prioleau, L., Murdock, M., & Brody, N. (1983). An analyis Spector, P. E. (1981). Research designs. Beverly Hills, CA: of psychotherapy versus placebo studies. The Behaviour- Sage. al and Brain Sciences, 6, 275±285. Thompson, G. L. (1991). A unified approach to rank tests Plewis, I. (1985). Analysing change: measurement and for multivariate and repeated measures designs. Journal explanation using longitudinal data. New York: Wiley. of the American Statistical Association, 86, 410±419. Putnam, H. (1991). The ªCorroborationº of theories. In R. Toutenburg, H. (1994). Versuchsplanung und Modellwahl: Boyd, P. Gasper, & J. D. Trout (Eds.), The philosophy of statistische Planung und Auswertung von Experimenten science (pp. 121±137). Cambridge, MA: MIT Press. mit stetigem oder kategorialem Response. Heidelberg, Rao, P., & Miller, R. L. (1971). Applied econometrics. Germany: Physica. Belmont, CA: Wadsworth. Trochim, W. M. (1984). Research design for program Rossi, P. H., Freeman, H. E., & Wright, S. R. (1979). evaluation: the regression-discontinuity approach. Con- Evaluation: a systematic approach. Beverly Hills, CA: temporary evaluation research, 6. Beverly Hills, CA: Sage. Sage. Salmon, W. C. (1966). The foundations of scientific Vonesh, E. F. (1992). Non-linear models for the analysis of inference. Pittsburg, OH: University of Pittsburgh Press. longitudinal data. Statistics in Medicine, 11, 1929±1954. Schmitz, B. (1989). EinfuÈhrung in die Zeitreihenanalyse. Yung, P. M. B., & Kepner, A. A. (1996). A controlled Bern, Switzerland: Huber. comparison on the effect of muscle and cognitive Smith, M. L, Glass, V. G., & Miller, T. I. (1980). The relaxation procedures on blood pressure: implications benefits of psychotherapy. Baltimore: Johns Hopkins for the behavioral treatment of borderline hypertensives. University Press. Behavioral Research and Therapy, 34, 821±826. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.05 Epidemiologic Methods

WILLIAM W. EATON Johns Hopkins University, Baltimore, MD, USA

3.05.1 INTRODUCTION 92 3.05.1.1 Epidemiology and Clinical Psychology 92 3.05.1.2 Dichotomy and Dimension 92 3.05.1.3 Exemplar Study: Ecological Approach to Pellagra 93 3.05.2 RATES 94 3.05.2.1 Prevalence 94 3.05.2.2 Incidence 95 3.05.2.3 Incidence and Onset 96 3.05.3 MORBIDITY SURVEYS 99 3.05.3.1 Exemplar Study: The Epidemiologic Catchment Area (ECA) Program 99 3.05.3.2 Defining and Sampling the Population 100 3.05.3.3 Sample Size 101 3.05.3.4 Standardization of Data Collection 101 3.05.4 EXPOSURE AND DISEASE 103 3.05.4.1 Cohort Design 103 3.05.4.2 Exemplar Study: the British Perinatal Study 103 3.05.4.3 Case±Control Design 104 3.05.4.4 Exemplar Study: Depression in Women in London 105 3.05.5 USING AVAILABLE RECORDS 105 3.05.5.1 Government Surveys 105 3.05.5.2 Vital Statistics 106 3.05.5.3 Record-based Statistics and Record Linkage 106 3.05.5.4 Exemplar Study: the Danish Adoption Study 107 3.05.6 PREVENTIVE TRIALS 107 3.05.6.1 The Logic of Prevention 107 3.05.6.2 Attributable Risk 109 3.05.6.3 Developmental Epidemiology 109 3.05.6.4 Exemplar Study: the Johns Hopkins Prevention Trial 110 3.05.7 BIAS 112 3.05.7.1 Sampling Bias 112 3.05.7.2 Measurement Bias 113 3.05.8 CONCLUSION: THE FUTURE OF PSYCHOLOGY IN EPIDEMIOLOGY 113 3.05.9 REFERENCES 114

91 92 Epidemiologic Methods

3.05.1 INTRODUCTION popular trend open-endedly to hand over to psychiatry (and to social work), in hope or 3.05.1.1 Epidemiology and Clinical Psychology resignation, the whole of the human condition is no help. (p. 220) Is psychological epidemiology an oxymoron? The primary goals of psychology differ from the primary goals of epidemiology. Psychology 3.05.1.2 Dichotomy and Dimension seeks to understand the general principles of psychological functioning of individual human Pathology is a process, but diseases are beings. Epidemiology seeks to understand regarded as present or absent. Psychologists pathology in populations. These two have been interested historically in promoting differencesÐgeneral principles vs. pathology, health and in understanding the nature of each and individuals vs. populationsÐform the individual's phenomenology, leading to empha- foundations for ideologic differences which sis on dimensions of behavior; but epidemiol- have led to failures of linkage between the ogists have focused on disorders without much two disciplines, and many missed opportunities regard to individual differences within a given for scientific advances and human betterment. diagnostic category, leading to focus on The gap between the two is widened by their dichotomies. For clinical psychologists adopt- location in different types of institutions: liberal ing the epidemiologic perspective, the phrase arts universities vs. medical schools. ªpsychiatric epidemiologyº could describe their Various definitions of epidemiology reflect work accurately (Tsuang, Tohen, & Zahner, the notion of pathology in populations. Maus- (1995). Another area of work for psychologists ner and Kramer define epidemiology as ªthe in epidemiology focuses on psychological con- study of the distribution and determinants of tributors to diseases in general, which can diseases and injuries in human populationsº include psychiatric disorders but not be limited (1985, p. 1). Lilienfeld and Stolley use a very to them. This type of epidemiology might be similar definition but include the notion that termed ªpsychosocial epidemiologyº (Anthony, ªepidemiologists are primarily interested in the Eaton, & Henderson, (1995), and receives less occurrence of disease as categorized by time, attention in this chapter. place, and personsº (1994, p. 1). A less succinct, Interest in normal functioning, or in devia- but nevertheless classic, statement is by Morris tions from it, has led psychologists to focus on (1975), who listed seven ªusesº of epidemiology: statistical measures related to central tendency, (i) To study the hertory of health of popula- such as the mean, and the statistical methods tions, thus facilitating projections into the oriented toward that parameter, such as the future. comparison of means and analysis of variance. (ii) To diagnose the health of the community, Epidemiology's base in medicine has led it to which facilitates prioritizing various health focus on the dichotomy of the presence or problems and identifying groups in special need. absence of disease. There is the old adage in (iii) To study the working of health services, epidemiology: ªThe world is divided into two with a view to their improvement. types of persons: those who think in terms of (iv) To estimate individual risks, and chances dichotomies, and those who do not.º The result of avoiding them, which can be communicated for statistical analysis is recurring focus on to individual patients. forms involving dichotomies: rates and propor- (v) To identify syndromes, by describing the tions (Fleiss, 1981), the two-by-two table association of clinical phenomena in the popu- (Bishop, Fienberg, & Holland, 1975), the life lation. table (Lawless, 1982), and logistic regression (vi) To complete the clinical picture of (Kleinbaum, Kupper, & Morgenstern, 1982). chronic diseases, especially in terms of natural The epidemiologic approach is focused on hertory. populations. A population is a number of (vii) To search for causes of health and individuals having some characteristic in com- disease. mon, such as age, gender, occupation, nation- The seventh use is regarded by Morris as the ality, and so forth. Population is a broader term most important, but, in reviewing the literature than group, which has awareness of itself, or on psychiatric disorders for the third (1975) structured interactions: populations typically edition of his text, he was surprised at are studied by demographers, whereas groups typically are studied by sociologists. The general The sad dearth of hard fact on the causes of major population is not limited to any particular class as well as minor mental disorders, and so on how or characteristic, but includes all persons with- to prevent them . . . [and] the dearth of epidemio- out regard to special features. In the context of logical work in search of causes . . . Lack of ideas? epidemiologic research studies, the general Methods? Money? It wants airing. Of course, the population usually refers to the individuals Introduction 93 who normally reside in the locality of the study. low social class was a consistent finding: a Many epidemiologic studies involve large po- leading etiologic clue among many others with pulations, such as a nation, state, or a county of less consistent evidence. Many scientists felt that several hundred thousand inhabitants. In such pellagra psychosis was infectious, and that large studies, the general population includes lower class living situations promoted breeding individuals with a wide range of social and of the infectious agent. But Goldberger noticed biological characteristics, and this variation is that there were striking failures of infection, for helpful in generalizing the results. In the broad- example, aides in large mental hospitals, where est sense, the general population refers to the pellagra psychosis was prevalent, seemed im- human species. The fundamental parameters mune. He also observed certain exceptions to used in epidemiology often are conceptualized in the tendency of pellagra to congregate in the terms of the population: for example, prevalence lower class, among upper-class individuals with sometimes is used to estimate the population's unusual eating habits. Goldberger became need for health care; incidence estimates the convinced that the cause was a nutritional force of morbidity in the population. Generally deficiency which was connected to low social diseases have the characteristic of rarity in the class. His most powerful evidence was an population. As discussed below, rates of disease ecologic comparison of high and low rate areas: often are reported per 1000, 10 000, or 100 000 two villages which differed as to the availability population. of agricultural produce. The comparison made a Much of the logic that distinguishes the strong case for the nutritional theory using the methods of epidemiology from those of psy- ecological approach, even though he could not chology involves the epidemiologic focus on identify the nutrient. He went so far as to ingest rare dichotomies. The case±control method bodily fluids of persons with pellagra to maximizes efficiency in the face of this rarity demonstrate that it was not infectious. Even- by searching for and selecting cases of disease tually a deficiency in vitamin B was identified as very intensively, for example, as in a hospital or a necessary causal agent. Pellagra psychosis is a catchment area record system, and selecting now extremely rare in the US, in part because of controls at a fraction of their representation in standard supplementation of bread products the general population. Exposures can also be with vitamin B. rare, and this possibility is involved in the logic Goldberger understood that low social class of many cohort studies which search for and position increased risk of pellagra, and he also select exposed individuals very intensively, for believed that nutritional deficiencies which example, as in an occupational setting with resulted from lower class position were a toxins present, and selecting nonexposed con- necessary component in pellagra. With the trols at a fraction of their representation in the advantage of hindsight, nutritional deficiency general population. In both strategies the cases, appears to have a stronger causal status, exposed groups, and controls are selected with because there were many lower class persons equal care, and in both strategies they are who did not have pellagra, but few persons with drawn, if possible, to represent the population at pellagra who were not nutritionally deprived: risk of disease onset: the efficiency comes in the vitamin B deprivation is a necessary cause, but comparison to a relatively manageable sample lower social class position is a contributing of controls. cause. The concept of the causal chain points to both causes, and facilitates other judgements 3.05.1.3 Exemplar Study: Ecological Approach about the importance of the cause. For example, to Pellagra which cause produces the most cases of pellagra? (Since they operate in a single causal The ecologic approach compares populations chain, the nutritional deficiency is more im- in geographic areas as to their rates of disease, portant from this point of view.) Which cause is and has been a part of epidemiology since its antecedent in time? (Here the answer is lower beginnings. A classic ecologic study in epide- social class position.) Which cause is easiest to miology is Goldberger's work on pellagra change in order to prevent pellagra? The answer psychosis (Cooper & Morgan, 1973). The to this question is not so obvious, but, from the pellagra research also illustrates the concept public health viewpoint, it is the strongest of the causal chain and its implications for argument about which cause deserves attention. prevention. In the early part of the twentieth Which cause is consistent with an accepted century, pellagra was most prevalent in the framework of disease causation supportive of lower classes in rural villages in the southeastern the ideology of a social group with power in US. As the situation may be in the 1990s for society? Here the two causes diverge: funneling many mental disorders (Eaton & Muntaner, resources to the medical profession to cure 1996), the relationship of pellagra prevalence to pellagra is one approach to the problem 94 Epidemiologic Methods consistent with the power arrangement in the services researchers. It is also useful because it social structure; another approach is redistri- identifies groups at high risk of having a buting income so that nutritional deficiency is disorder, or greater chronicity of the disorder, less common. As it happens, neither approach or both. Finally, the point prevalence can be would have been effective. The consensus to used to measure the impact of prevention redistribute income, in effect, to diminish social programs in reducing the burden of disease class differences, is not easy to achieve in a on the community. democracy, since the majority are not affected Lifetime prevalence is the proportion of by the disease. Bread supplementation for the individuals, who have ever been ill, alive on a entire population was a relatively cost-effective given day in the population. As those who die public health solution. The more general point is are not included in the numerator or denomi- that Goldberger's work encompassed the com- nator of the proportion, the lifetime prevalence plexity of the causal process. is sometimes called the proportion of survivors Epidemiology accepts the medical framework affected (PSA). It differs from the lifetime risk as defining its dependent variable, but it is because the latter attempts to include the entire neutral as to discipline of etiology. Goldberger lifetime of a birth cohort, both past and future, did not eliminate one or the other potential and includes those deceased at the time of the cause due to an orientation toward social vs. survey. Lifetime risk is the quantity of most biological disciplines of study. He did not favor interest to geneticists. Lifetime prevalence also one or the other potential cause because it differs from the proportion of cohort affected focused on the individual vs. the collective level (PCA), which includes members of a given of analysis. This eclecticism is an important cohort who have ever been ill by the study date, strength in epidemiology because causes in regardless of whether they are still alive at that different disciplines can interact, as they did in time. the case of pellagra, and epidemiology still Lifetime prevalence has the advantage over provides a framework for study. The public lifetime risk and PCA in that it does not require health approach is to search for the most ascertaining who is deceased, whether those important causes, and to concentrate on those deceased had the disorder of interest, or how that are malleable, since they offer possibilities likely those now alive without the disorder are to for prevention. develop it before some given age. Thus, lifetime prevalence can be estimated from a cross- sectional survey. The other measures require 3.05.2 RATES either following a cohort over time or asking relatives to identify deceased family members An early and well-known epidemiologist, and report symptoms suffered by them. Often Alexander Langmuir, used to say that ªstripped these reports must be supplemented with to its basics, epidemiology is simply a process of medical records. The need for these other obtaining the appropriate numerator and sources adds possible errors: relatives may denominator, determining the rate, and inter- forget to report persons who died many years preting that rateº (cited in Foege, 1996, p. S11). before or may be uninformed about their In the sense in which Langmuir was speaking, psychiatric status; medical records may be the term rates includes proportions such as the impossible to locate or inaccurate; and the prevalence ªrateº as well as the incidence rate, as prediction of onset in young persons not yet explained below. Table 1 shows the minimum affected requires the assumption that they will design requirements, and the definitions of fall ill at the same rate as the older members of numerators and denominators, for various the sample and will die at the same ages if they types of the rates and proportions. The table do not fall ill. If risks of disorder or age-specific is ordered from top to bottom by increasing death rates change, these predictions will fail. difficulty of longitudinal follow-up. Lifetime prevalence requires that the diagnostic status of each respondent be assessed over 3.05.2.1 Prevalence his or her lifetime. Thus, accuracy of recall of symptoms after a possible long symptom-free Prevalence is the proportion of individuals ill period is a serious issue, since symptoms and in a population. Temporal criteria allow for disorders that are long past, mild, short-lived, several types of prevalence: point, lifetime, and and less stigmatizing are particularly likely to be period. Point prevalence is the proportion of forgotten. For example, data from several cross- individuals in a population at a given point in sectional studies of depression seem to indicate a time. The most direct use of point prevalence is rise in the rate of depression in persons born as an estimate of need for care or potential after World War II (Cross-National Collabora- treatment load, and it is favored by health tive Group, 1992; Klerman & Weissman, 1989). Rates 95

Table 1 Rates and proportions in epidemiology.

Rate Minimum design Numerator Denominator

Lifetime prevalence Cross-section Ever ill Alive Point prevalence Cross-section Currently ill Alive Period prevalence (1) Cross-section Ill during period Alive at survey Period prevalence (2) Two waves Ill during period Alive during period First incidence Two waves Newly ill Never been ill Attack rate Two waves Newly ill Not ill at baseline Proportion of cohort affected Birth to present Ever ill Born and still alive Lifetime risk Birth to death Ever ill Born

These persons are older, however, and it may be It is not a point prevalence rate because it covers that they have forgotten episodes of depression a longer period of time, which can be defined as which occurred many years prior to the inter- six months, two years, and so forth, as well as view (Simon & Von Korff, 1995). Data showing one year. But one-year prevalence is not a period that lifetime prevalence of depressive disorder prevalence rate because some individuals in the actually declines with age (Robins et al., 1984) is population who are ill at the beginning of the consistent with the recall explanation. period are not successfully interviewed, because Period prevalence is the proportion of the they either die or emigrate. As the time period population ill during a specified period of time. covered in this rate becomes shorter, it Customarily the numerator is estimated by approximates the point prevalence; as the time adding the prevalent cases at the beginning of period becomes longer, the rate approaches the the defined period to the incident (first and period prevalence. If there is large mortality, the recurrent) cases that develop between the one-year prevalence rate will diverge markedly beginning and the end of the period. This form from period prevalence. is shown as period prevalence (2) in Table 1. In research based on records, all cases of a disorder 3.05.2.2 Incidence found over a one-year period are counted. The denominator is the average population size Incidence is the rate at which new cases during the interval. Thus, the customary develop in a population. It is a dynamic or time- definition of period prevalence requires at least dependent quantity and can be expressed as an two waves of data collection. Both Mausner and instantaneous rate, although, usually, it is Kramer (1985) and Kleinbaum et al. (1982) have expressed with a unit of time attached, in the noted the advantages of period prevalence for manner of an annual incidence rate. In order to the study of psychiatric disorders, where onset avoid confusion, it is essential to distinguish first and termination of episodes is difficult to incidence from total incidence. The distinction ascertain exactly (e.g., the failure to distinguish itself commonly is assumed by epidemiologists new from recurrent episodes is unimportant in but there does not appear to be consensus on the the estimation of period prevalence). Further- terminology. Most definitions of the incidence more, the number of episodes occurring during numerator include a concept such as ªnew the follow-up is unimportant; it is important casesº (Lilienfeld & Stolley, 1994, p. 109), or only to record whether there was one or more vs. persons who ªdevelop a diseaseº (Mausner & none. The disadvantage of period prevalence is Kramer, 1985, p. 44). Morris (1975) defines that it is not as useful in estimating need as point incidence as equivalent to our ªfirst incidence,º prevalence, nor as advantageous in studying and ªattack rateº as equivalent to our ªtotal etiology as incidence. incidence.º First incidence corresponds to the Another type of period prevalence, some- most common use of the term ªincidence,º but times labeled by a prefix denoting the period, since the usage is by no means universal, keeping such as one-year prevalence, is a hybrid type of the prefix is preferred. rate, conceptually mixing aspects of point and The numerator of first incidence for a period prevalence, which has been found useful specified time period is composed of those in the ECA Program (Eaton et al., 1985). This individuals who have had an occurrence of the type of period prevalence is labeled period disorder for the first time in their lives and the prevalence (1) in Table 1. It includes in the denominator includes only persons who start numerator all those surveyed individuals who the period with no prior history of the disorder. have met the criteria for disorder in the past The numerator for attack rate (or total year, and as denominator all those interviewed. incidence) includes all individuals who have 96 Epidemiologic Methods had an occurrence of the disorder during the psychiatric epidemiology, there are a range of time period under investigation, whether or not disorders with both types of causal structures it is the initial episode of their lives or a recurrent operating, which has led us to focus on this episode. The denominator for total incidence distinction in types of incidence. includes all population members except those The two types of incidence are related cases of the disorder which are active at the functionally to different measures of prevalence. beginning of the follow-up period. Kramer et al. (1980) have shown that lifetime A baseline and follow-up generally are needed prevalence (i.e., the proportion of persons in a to estimate incidence. Cumulative incidence defined population who have ever had an attack (not shown in Table 1) is the proportion of the of a disorder) is a function of first incidence and sample or population who become a case for the mortality in affected and unaffected popula- first time between initial and followup inter- tions. Point prevalence (i.e., the proportion of views (Kleinbaum et al., 1982). But incidence persons in a defined population on a given day generally is measured per unit of time, as a rate. who manifest the disorder) is linked to total When the follow-up period extends over many incidence by the queuing formula P = f (I 6 D) years, the exposure period of the entire (Kleinbaum et al., 1982; Kramer, 1957), that is, population at risk is estimated by including point prevalence is a function of the total all years prior to onset for a given individual in number of cases occurring and their duration. In the denominator, and removing years from the the search for risk factors that have etiologic denominator when an individual has onset or significance for the disorder, comparisons based dies. In this manner, the incidence is expressed on point prevalence rates suffer the disadvan- per unit time per population: for example, tage that differences between groups as to the ªthree new cases per 1000 person years of chronicity of the disorder: that is, the duration exposure.º This method of calculating facil- of episodes; the probability that episodes will itates comparison between studies with different recur; or the mortality associated with episodes; lengths of follow-up and different mortality. affect the comparisons (Kramer, (1957). For When the mean interval between initial and example, it appears that Blacks may have follow-up interview is approximately 365 days, episodes of depression of longer duration than then the ratio of new cases to the number of Whites (Eaton & Kessler, 1981). If so, the point followed up respondents, who were ascertained prevalence of depression would be biased as being with a current or past history of the toward a higher rate for Blacks, based solely disorder at the time of the initial interview, can on their greater chronicity. serve as a useful approximation of the disorder's annual first incidence. 3.05.2.3 Incidence and Onset First incidence also can be estimated by retrospection if the date of onset is obtained for Dating the onset of episodes is problematic each symptom or episode, so that the propor- for most mental disorders for many reasons. tion or persons who first qualified in the year One is that the diagnostic criteria for the prior to the interview can be estimated. For this disorders themselves are not well agreed upon, type of estimate (not shown in Table 1), only one and continual changes are being made in the wave of data collection is needed. This estimate definition of a case of disorder, such as the of first incidence is subject to the effects of recent fourth revision of the Diagnostic and mortality, however, because those who have statistical manual of mental disorders (DSM-III; died will not be available for the interview. American Psychiatric Association, 1994). The The preference for first or total incidence in DSM-IV has the advantage that criteria for etiologic studies depends on hypotheses and mental disorders are more or less explicitly assumptions about the way causes and out- defined, but it is nevertheless true that specific comes important to the disease ebb and flow. If mental disorders are often very difficult to the disease is recurrent and the causal factors distinguish from nonmorbid psychological vary in strength over time, then it might be states. Most disorders include symptoms that, important to study risk factors not only for first taken by themselves, are part of everyone's but for subsequent episodes (total incidence): normal experience: for example, feeling fearful, for example, the effects of changing levels of being short of breath or dizzy, and having stress on the occurrence of episodes of neurosis sweaty palms are not uncommon experiences, (Tyrer, 1985) or schizophrenia (Brown, Birley, but they are also symptoms of panic disorder. It & Wing, 1972). For a disease with a presumed is the clustering of symptoms, often with the fixed progression from some starting point, such requirement that they be brought together in as dementia, the first occurrence might be the one period of time or ªspell,º that generally most important episode to focus on, and first forms the requirement for diagnosis. Although incidence is the appropriate rate. In the field of the clustering criteria are fairly explicit in the Rates 97

DSM, it is not well established that they which the rate of development towards a full- correspond to the characteristics generally blown disease state is accelerated, or becomes associated with a disease, such as a predictable irreversible. course, a response to treatment, an association A second way of thinking about progress with a biological aberration in the individual, or towards disease is the occurrence of new an associated disability. Thus, the lack of symptoms which did not exist before. This established validity of the criteria-based classi- involves the gradual acquisition of symptoms so fication system exacerbates problems of dating that clusters are formed which increasingly the onset of disorder. approach the constellation required to meet The absence of firm data on the validity of the specified definitions for diagnosis. A cluster of classification system requires care about con- symptoms which occur more often together ceptualizing the process of disease onset. One than would be expected by their individual criterion of onset used in the epidemiology of prevalence in the population, that is, more often some diseases is entry into treatment, but this is than expected by chance, is a syndrome. unacceptable in psychiatry since people with ªPresentº can be defined as occurrence either mental disorders so often do not seek treatment at a nonsevere or at a severe level, thus, decisions for them. Another criterion of onset sometimes made about the symptom intensification pro- used is detectability, that is, when the symptoms cess complicate the idea of acquisition. This idea first appear, but this is also unacceptable leads the researcher to consider the order in because experiences analogous to the symptoms which symptoms occur over the natural history of most psychiatric disorders are so widespread. of disease, and in particular, whether one It is preferable to conceptualize onset as a symptom is more important than others in continuous line of development towards man- accelerating the process. ifestation of a disease. There is a threshold at Figure 1 is an adaptation of a diagram used by which the development becomes irreversible, so Lilienfeld and Lilienfeld (1980, Figure 6.3) to that at some minimal level of symptomatology it visualize the concept of incidence as a time- is certain that the full characteristics of the oriented rate which expresses the force of disease, however defined, will become apparent. morbidity in the population. In their original This use of irreversibility is consistent with some figure, (1(a)), the horizontal axis represents time epidemiological uses (Kleinbaum et al., 1982). and the presence of a line indicates disease. The Prior to this point, the symptoms are thought of adaptations (1(b) and 1(c)) give examples of the as ªsubcriterial.º At the state of knowledge in several distinct forms that onset can take when psychiatry in the 1990s, longitudinal studies in the disorder is defined by constellations of the general population, such as the ECA symptoms varying in intensity and co-occur- program and others mentioned above, are rence, as is the case with mental disorders. In needed to determine those levels of symptoma- Figure 1(b), the topmost subject (1) is what tology at which irreversibility is achieved. might be considered the null hypothesis, and it There are at least two ways of thinking about corresponds to simple onset as portrayed in the development towards disease. The first way is original. Figure 1(b) shows how intensity, the increase in severity or intensity of symptoms. represented by the vertical width of the bars, An individual could have all the symptoms might vary. The threshold of disease is set at required for diagnosis but none of them be four units of width, and in the null hypothesis sufficiently intense or impairing. The underlying subject 1 progresses from zero intensity to four logic of such an assumption may well be the units, becoming a case during the observation relatively high frequency of occurrence of the period. Subject 2 changes from nearly meeting symptoms in milder form, making it difficult to the criteria (width of three units) to meeting it distinguish normal and subcriterial complaints (four units) during the year. Both subjects 1 and from manifestations of disease. For many 2 are new cases, even though the onset was more chronic diseases, it may be inappropriate to sudden in subject 1 than in subject 2, that is, the regard the symptom as ever having been force of morbidity is stronger in subject 1 than ªabsent,º for example, personality traits giving subject 2. Subjects 3 and 4 are not new cases, rise to deviant behavior, categorized as person- even though their symptoms intensify during the ality disorders on Axis II of the DSM (APA, year as much or more than those of subject 2. 1994). This type of development can be referred Figure 1(c) adapts the same original diagram to as ªsymptom intensification,º indicating that to conceptualize acquisition of symptoms and the symptoms are already present and have the development of syndromes. At one point in become more intense during a period of time there is no correlation between symptoms, observation. This concept leads the researcher but the correlation gradually develops, until to consider whether there is a crucial level of there is a clear separation of the population into severity of a given symptom or symptoms in one group, with no association of symptoms, Incidence and Intensification

Case No. 1 Case No. 2 Lilienfeld and Stolley, Figure 6.2; Case No. 3 Length of line corresponds to Case No. 4 duration of episode. Case No. 5 Case No. 6

1 Two new cases, like No. 3 above, with different degrees of intensification; width corresponds to intensity of symptoms. 2

3 Two subjects not defined as new cases, with different degrees of intensification; width corresponds to intensity of symptoms. 4 Subject 3 is never a case, and subject 4 corresponds to Case No. 1 above.

Baseline Followup

Figure 1(a) and (b) Incidence and intensification. Morbidity Surveys 99

Incidence and Development of Syndromes

New * ** cases * * * * * * S * y * * * * * m * * * p * * * t * * * * o * * m * * * * Symptom 2 * 1 * **

Baseline Prodrome Followup Figure 1(c) Incidence and the development of syndromes. and another group where the two symptoms co- 3.05.3.1 Exemplar Study: The Epidemiologic occur. This example could be expanded to Catchment Area (ECA) Program syndromes involving more than two symptoms. Acquisition and intensification are indicators An example of a morbidity survey is the of the force of morbidity in the population, as Epidemiologic Catchment Area (ECA) Pro- are more traditional forms of incidence rate. But gram, sponsored by the United States National they are not tied to any one definition of Institute of Mental Health from 1978 through caseness. Rather, these concepts allow study of 1985. The broad aims of the ECA Program were disease progression independently of case ªto estimate the incidence and prevalence of definition. Risk factors at different stages of mental disorders, to search for etiological clues, the disease may be differentially related to and to aid in the planning of health care disease progression only above or below the facilitiesº (Eaton, Rogier, Locke, & Taube, threshold set by the diagnosis. In this situation, 1981). The Program involved sample surveys of we might reconsider the diagnostic threshold. populations living in the catchment areas of already designated Community Mental Health Centers. The broad goals of the ECA Program 3.05.3 MORBIDITY SURVEYS are described in Eaton et al. (1981), the methods are described in Eaton and Kessler (1985), and Morbidity surveys are cross-sectional surveys Eaton et al. (1984), the cross-sectional results of the general population. They are used to are described in Robins and Regier (1991), and estimate prevalence of disease in the population the incidence estimates are described in Eaton, as well as to estimate need for care. Morbidity Kramer, Anthony, Chee, and Shapiro (1989). surveys address Morris's (1975, discussed A principal advantage of the morbidity above) second ªuseº of epidemiology, ªdiag- survey is that it includes several or many nosing the health of the community, prioritizing disorders, which helps in assessing their relative health problems and identifying groups in importance from the public health point of view. need;º as well as the third use ªstudying the Another advantage is that estimates of pre- working of health services;º and the fifth use, valence and association are measured without ªidentifying syndromes.º Morbidity surveys are regard to the treatment status of the sample sometimes called the descriptive aspect of (Shapiro et al., 1984). Figure 2 displays results epidemiology, but they can also be used to from the ECA study, showing the relatively high generate hypotheses about associations in the prevalence of phobia and relatively low pre- population, and to generate control samples for valence of schizophrenia. The figure also shows cohort and case±control studies. the proportion meeting criteria for a given 100 Epidemiologic Methods disorder within the last six months, who had way to define the target population is not always seen either a physician or mental health clear, and different definitions have implica- specialist during that period. These proportions tions for the ultimate value of the results, as well are well under 50%, with the exception of as the feasibility of the study. A good definition schizophrenia and panic disorder. For depres- of a target population is an entire nation, such as sion, which is disabling and highly treatable, less in the National Comorbidity Survey, or, better than a third of those with the disorder have yet, a birth cohort of an entire nation, such as received treatment. the British Perinatal Study, which included all births in Britain during a single week in March of 1958 (Butler & Alberman, 1969). Other 3.05.3.2 Defining and Sampling the Population studies usually involve compromises of one form or another. The goal of the sampling Defining the target population for the procedure is that each respondent is selected morbidity survey is the first step. The best into the sample with a known, and nonzero,

10.8

10 Treated

Untreated 8

4.2 4 3.4 2.9 2.7

2 1.8 1.7

0.9 0.8 0.8 0.7 0.3 0.1 0

Panic Mania Phobia Drug A/D Dysthymia Alcohol A/D Depression Somatization Schizophrenia

Schizophreniform

Cognitive impairment Obsessive-compulsive Anti-social personality Unweighted data from four sites of ECA program

Figure 2 Prevalence of disorder in percent in six months prior to interview. Morbidity Surveys 101 probability. Strict probabilistic sampling char- decision was made to select five separate sites of acterizes high-quality epidemiologic surveys, research in order to provide the possibility of and is a requirement for generalization to the replication of results across sites, and to better target population. understand the effects of local variation (Eaton Most surveys are of the household residing, et al., 1981). The ECA target population thus noninstitutionalized population, where the consisted, not of the nation, but rather of an survey technology for sampling and interview- awkward aggregation of catchment areas. ing individuals is strongest (e.g., Sudman, 1976). Nevertheless, the ECA data were considered The ECA design defined the target population as benchmarks for a generation of research as ªnormalº residents of previously established (Eaton, 1994) because there was sufficient catchment areas. Sampling was conducted in variation in important sociodemographic vari- two strata, as shown in Figure 3. The household ables to allow generalization to other large residing population was sampled via area populations, that is, sufficiently large subgroups probability methods or household listings of young and old, men and women, married and provided by utility companies (e.g., as in unmarried, rich and poor, and so forth. Sudman, 1976). This stratum included short- Generalization to other target populations, stay group quarters such as jails, hospitals, and such as Asian-Americans or Native Americans, dormitories. After making a list of the residents or to rural areas, was not logical from the ECA. in each household, the interviewer asked the But note that generalization from a national person answering the door as to whether there random sample to all rural areas, or to small were any other individuals who ªnormallyº ethnic groups, would likewise not always be resided there but were temporarily absent. possible. The point is that the target population ªNormallyº was defined as the presence of a should be chosen with a view toward later bed reserved for the individual at the time of the generalization. interview. Temporarily absent residents were added to the household roster before the single 3.05.3.3 Sample Size respondent was chosen randomly. If an individual was selected who was temporarily absent, General population surveys are not efficient the interviewer made an appointment for a time designs for rare disorders or unusual patterns of after their return, or conducted the interview at service use. Even for research on outcomes that their temporary group quarters residence (i.e., in are not rare, sample sizes are often? larger than hospital, jail, dormitory, or other place). The one thousand. A common statistic to be ECA sampled the institutional populations estimated is the prevalence, which is a propor- separately, by listing all the institutions in the tion. For a proportion, the precision is affected catchment area, as well as all those nearby by the square root of the sample size (Blalock, institutions who admitted residents of the 1979). If the distribution of the proportion is catchment area. Then the inhabitants of each favorable, say, 30±70%, then a sample size of institution were rostered and selected probabil- 1000 produces precision which may be good istically. Sampling the institutional population enough for common sample surveys such as for required many more resources, per sampled voter preference. For example, a proportion of individual, than the household sample, because 0.50 has a 95% confidence interval from 0.47 to each institution had to be approached indivi- 0.53 with a sample of 1000. For rarer disorders, dually. Inclusion of temporary and long-stay the confidence interval grows relative to the size group quarters is important for health services of the proportion (i.e., 0.82±0.118 for a research because many of the group quarters are proportion of 0.10; 0.004±0.160 for a propor- involved in provision of health services, and tion of 0.01). Often, there is interest in patterns because residents of group quarters may be high broken down by subpopulations, thus challen- utilizers. The ultimate result of these procedures ging the utility of samples with as few as 1000 in the ECA was that each normal resident of the individuals. Many community surveys, such as catchment area was sampled with a known the ECA, have baseline samples in excess of probability. 3000. It is important to estimate the precision of It is not enough to crisply define a geographic the sample for parameters of interest, and the area, because different areas involve different power of the sample to test hypotheses of limitations on the generalizability of results. A interest, before beginning the data collection. nationally representative sample, such as the National Comorbidity Survey (NCS; Kessler 3.05.3.4 Standardization of Data Collection et al., 1994), may seem to be the best. But how are the results of the national sample to be Assessment in epidemiology should ideally be applied to a given local area, where decisions undertaken with standardized measurement about services are made? In the ECA the instruments which have known and acceptable 102 Epidemiologic Methods Sampling Strata for Residents ECA Study Design

Households + Temporary Group Institutional Quarters Group (Jails, Hospitals, Quarters Dormitories) (Nursing Homes, Prisons, n = 3000 Mental Hospitals) n = 500

Figure 3 Sampling strata for residents: ECA study design. reliability and validity. In community surveys, dependent on answers already given. For and automated record systems, reliable and example, if an individual responds positively valid measurement must take place efficiently to a question about the occurrence of panic and in ªfield conditions.º The amount of attacks, a series of questions about that training for interviewers in household studies particular response are asked, but if the response depends on the nature of the study. Interviewers to the question on panic is negative, the in the Longitudinal Study on Aging (LSOA; interviewer skips to the next section. In effect, Kovar, 1992), a well-known cohort study in the the interview adapts to the responses of the United States, received about 1.5 days of subject, so that more questions are asked where training (Fitti & Kovar, 1987), while ECA more information is needed. The high degree of interviewers received slightly more than a week structure in the DIS required more than one (Munson et al., 1985). Telephone interviews, week of training, as well as attention to the visual such as in the LSOA, can be monitored properties of the interview booklet itself. The systematically by recording or listening in on result was that the interviewer could follow a random basis (as long as the subject is made instructions regarding the flow of the interview, aware of this possibility), but it is difficult to and the recording of data, smoothly, so as not monitor household interviews, since it cannot be to offend or alienate the respondent. Household predicted exactly when and where the interview survey questionnaires are becoming increas- will take place. ingly adaptive and response dependent, because The ECA Program involved a somewhat more information can be provided in a shorter innovative interview called the Diagnostic amount of time with adaptive interviews. Interview Schedule, or DIS (Robins, Helzer, Inexpensive laptop computers will facilitate Croughan, & Ratcliff, 1981). The DIS was such adaptive interviewing. Self-administered, designed to resemble a typical psychiatric computerized admission procedures are becom- interview, in which questions asked are highly ing more widely disseminated in the health care Exposure and Disease 103 system, expanding the database, and facilitating disorder (Table 2). The cohort design differs the retrieval and linkage of records. from the morbidity survey in that it is The reliability and validity of measurement prospective, involving longitudinal follow-up are usually assessed prior to beginning a field of an identified cohort of individuals. As well as study. Usually, the assessment involves pilot the search for causes, the cohort design tests on samples of convenience to determine the addresses Morris' fourth ªuseº of epidemiology, optimal order of the questions, the time required ªestimating individual risks;º and the sixth for each section, and whether any questions are ªuse,º that is, ªcompleting the clinical picture, unclear or offensive. Debriefing subjects in these especially the natural history.º The entire pilot tests is often helpful. The next step is a test cohort can be sampled, but when a specific of reliability and validity. Many pretests select exposure is hypothesized, the design can be samples from populations according to their made more efficient by sampling for intensive health or services use characteristics in order to measurement on the basis of the putative generate enough variation on responses to exposure, for example, children living in an adequately estimate reliability and validity. In area of toxic exposures, or with parents who order to economize, such pretests are often have been convicted of abuse, or who have had conducted in clinical settings. But such pretests problems during delivery, and a control group do not predict reliability and validity under the from the general population. Follow-up allows ªfieldº conditions of household interviews. The comparison of incidence rates in both groups. ECA Program design involved reliability and validity assessment in a hospital setting (Robins 3.05.4.2 Exemplar Study: the British Perinatal et al., 1981), and later under field conditions Study (Anthony et al., 1985). The reliability and validity of the DIS were lower in the household An example is the British Perinatal Study, a setting. cohort study of 98% of all births in Great Britain during a single week in March, 1958 3.05.4 EXPOSURE AND DISEASE (Butler & Alberman, 1969; Done, Johnstone, & Frith, 1991; Sacker, Done, Crow, & Golding, Two basic research designs in epidemiology 1995). The cohort had assessments at the ages of are the cohort and the case±control design. 7 and 11 years, and, later, mental hospital These designs are used mostly in addressing records were linked for those entering psychia- Morris' seventh ªuseº of epidemiology, the tric hospitals between 1974 and 1986, by which search for causes. This is sometimes called the time the cohort was 30 years old. Diagnoses analytic aspect of epidemiology. These designs were made from hospital case notes using a differ in their temporal orientation to collection standard system. There was some incomplete- of data (prospective vs. retrospective), and in the ness in the data, but not too large, for example, criteria by which they sample (by exposure or by 12 of the 49 individuals diagnosed as ªnarrowº caseness), as shown in Table 2. schizophrenia were excluded because they were immigrants, multiple births, or because they 3.05.4.1 Cohort Design lacked data. It is difficult to say how many individuals in the cohort had episodes of In a cohort design, incidence of disorder is disorder that did not result in hospitalization, compared in two or more groups which differ in but that does not necessarily threaten the some exposure thought to be related to the results, if the attrition occurred equally for

Table 2 Case–control and cohort studies. Case–control study Cases Controls Cohort Exposed a b a + b Study Not exposed c d c + d

a / (a + b) Relative risk = c / (c + d)

a / b ad Relative odds = = c / d bc 104 Epidemiologic Methods different categories of exposure. Table 3 shows other factors such as gender, age, and so forth. results for one of 37 different variables related to The odds ratio for low birth weight and narrow birth, that is, ªexposures,º that were available in schizophrenia in the British Perinatal Study, as the midwives' report: birth weight under 2500 g. it happens, was 3.9, after adjustment for social With n given at the head of the columns in Table class and other demographic variables. 3, the reader can fill in the four cells of a two-by- The logic of the cohort study includes two table, as in Table 2, for each of the four assessing the natural history of the disorder, disorders. For example, the cells, labeled as in that is, the study of onset and chronicity, in a Table 2, for narrow schizophrenia, are: a±5; population context without specific interven- b±706; c±30; d±16 106. The number with the tion by the researcher. Study of natural history disorder is divided by the number of births to requires repeated follow-up observations. In the generate the risk of developing the disorder by British Perinatal Study, there were assessments the time of follow-up: the cumulative incidence. of the cohort at the ages of 7 and 11. Once the The risks can be compared across those with and case groups had been defined, it was possible to without the exposure. In Table 2, the incidence study their reading and mathematics perfor- of those with exposure is a/(a+b), and the mance well before hospitalization. Those des- incidence of those without exposure is c/(c+d). tined to become schizophrenic had lower The relative risk is a comparison of the two reading and mathematics scores at both 7 and risks: [a/(a+b)]/[c/(c+d)]. For narrow schizo- 11 years of age (Crow, Done, & Sacker, 1996). phrenia the relative risk is (RR = [5/711]/[30/ Males who would eventually be diagnosed 16 136]), or 3.78. schizophrenic had problems relating to conduct The relative risk is closely related to causality during childhood, while females were anxious, since it quantifies the association in the context as compared to children who did not end up of a prospective study, so the temporal ordering being hospitalized with a diagnosis of schizo- is clear. The relative risk is approximated closely phrenia later. Later follow-ups in this study may by the relative odds or odds ratio, which does economize by studying only those who have had not include the cases in the denominator (i.e., onset, and a small random subsample of others, OR = [5/706]/[30/16 106] = 3.80, not 3.78). The to estimate such factors as the length of odds ratio has many statistical advantages for episodes, the probability of recurrences, prog- epidemiology. It is easy to calculate in the two- nostic predictors, and long-term functioning. by-two table by the cross-products ratio (i.e., ad/bc). The odds ratio quantifies the association without being affected by the prevalence of the 3.05.4.3 Case±Control Design disorder or the exposure, which is important for In the case±control study, it is the disease or the logic of cohort and case±control studies, disorder which drives the logic of the design, and where these prevalences may change from study many factors can be studied efficiently as to study, depending on the research design. This possible causes of the single disorder (Schlessel- lack of dependence on the marginal distribution man, 1982). The case±control study may be the is not characteristic of many measures of most important contribution of epidemiology to association typically used in psychology, such the advancement of public health, because it is as the correlation coefficient, the difference in so efficient in searching for causes when there is proportions, or the kappa coefficient (Bishop little knowledge. For example, adenocarcinoma et al., 1975). The odds ratio is a standard result of the vagina occurs so rarely, that, prior to of logistic regression, and can be adjusted by 1966, not a single case under the age of 50 years

Table 3 Mental disorder and low birth weight: British Perinatal study.

Cumulative incidence per 1000 Low Normal birth birth weight weight Disorder (n=727) (n=16 812) Odds ratio

Narrow schizophrenia (n = 35) 7.03 1.86 3.8 Broad schizophrenia (n = 57) 9.82 3.09 3.2 Affective psychosis (n = 32) 8.43 1.61 5.3 Neurosis (n = 76) 4.23 4.51 0.9

Source: Adapted from Sacker et al. (1995). Using Available Records 105 had been recorded at the Vincent Memorial Examination, and 76 women in the community Hospital in Boston. A time-space clustering of survey (17%) who were depressed at the time of eight cases, all among young women born the survey, as judged by a highly trained, within the period 1946±1951, was studied nonmedical interviewer using a shortened (Herbst, Utfeder, & Poskanzer, 1971). There version of the same instrument. Sixty-one was almost no knowledge of etiology. A percent of the patient cases (70/114) and 68% case±control study was conducted, matching of the survey cases (52/76) experienced the four controls to each case. The study reported a ªprovoking agentº of a severe life event during highly significant (p 5 0.00001) association the year prior to the onset of depression. Twenty between treatment of the mothers with estrogen percent (76/382) of the healthy controls experi- diethylstilbestrol during pregnancy and the enced such an event in the year prior to the subsequent development of adenocarninoma interview. Patient cases had 6.4 times the odds of of the vagina in the daughters. The results led to having experienced a life event than the controls recommendations to avoid administering stil- (i.e., [70/44]/[76/306]). bestrol during pregnancy. Logistic regression, The case±control study is very efficient, developed by epidemiologists and used in especially when the disease is rare, because it case±control studies, has distinct advantages approximates the statistical power of a huge over analysis of variance, ordinary least-squares cohort study with a relatively limited number of regression, discriminant function analysis, and controls. For a disease like schizophrenia, which probit regression, developed by agronomists, occurs in less than 1% of the population, 100 psychologists, economists, and educational cases from hospitals or a psychiatric case psychologists, especially when the dichotomous register can be matched to 300 controls from dependent variable has a very skew distribution, the catchment area of the hospital. A cohort that is, is very rare. study of this number would involve measurements on 10 000 persons instead of 400! Since 3.05.4.4 Exemplar Study: Depression in Women the disease is rare, it may be unnecessary to in London conduct diagnostic examinations on the group of controls, which would have only a small The most convincing demonstration of social number of cases. Furthermore, a few cases factors in the etiology of any mental disorder is distributed among the controls would weaken The social origins of depression, by Brown and the comparison of cases to controls, generating Harris (1978). That study used the case±control a conservative bias. The statistical power of the method to demonstrate the importance of life- case±control study is very close to that of the event stresses and chronic difficulties as causal analogous cohort study, and, as shown above, agents provoking the onset of depression. The the odds ratio is a close estimate of the relative method involved indepth diagnostic and etio- risk. The case±control study can be used to test logic interviews with a sample of 114 patients hypotheses about exposures, but it has the very and a sample of 458 household residents. The great ability to make comparisons across a wide target population is that residing in the range of possible risk factors, and can be useful Camberwell area in south London. even when there are very few hypotheses The analysis presented by Brown and Harris available. is logical but does not always follow the standard epidemiologic style. Table 4 presents the crucial findings in the typical epidemiologic 3.05.5 USING AVAILABLE RECORDS manner (following Table 2). Two case groups 3.05.5.1 Government Surveys are defined: the 114 women presenting at clinics and hospitals serving the Camberwell area, There are many sources of data for psychia- diagnosed by psychiatrists using a standardized tric epidemiologic research which are available clinical assessment tool called the Present to the public. These include statistics from

Table 4 Depression and life events and difficulties in London.

Cases (patients) Cases (survey) Controls (survey)

One or more severe event 70 61% 52 68% 76 20% No severe events 44 39% 24 32% 306 80% 114 100% 76 100% 382 100%

Source: Brown & Harris (1978). 106 Epidemiologic Methods treatment facilities and from large sample usually list the cause of death, including suicide. surveys of the populations conducted by a large These data files are often available from the organization such as national governments. government at nominal cost. Often the measures of interest to psychologists are only a small part of the survey, but the 3.05.5.3 Record-based Statistics and Record availability of a range of measures, drawn from Linkage other disciplines, can be a strong advantage. In the United States an important source of data is Statistics originating from treatment facilities the National Center for Health Statistics can also be put to good use in psychiatric (NCHS). For example, the Health and Nutri- epidemiology. Many early epidemiologic stu- tion Examination Survey (HANES) is a na- dies used hospital statistics as the numerators in tional sample survey conducted by the NCHS estimating rates, and census statistics in the involving physical examinations of a sample of denominators (Kramer, 1969). The utility of the United States population. Its first phase rates estimated in this manner is dependent on included the Center for Epidemiologic Studies the degree to which the clinical disorder is Depression Scale as part of its battery of associated with treatment, a stronger associa- measures, which included anthropometric mea- tion for severe schizophrenia than for mild sures, nutritional assessments, blood chemistry, phobia, presumably. The value of these rates medical history, and medical examinations also depends on the relative scarcity or abund- (Eaton & Kessler, 1981; Eaton & McLeod, ance of treatment facilities. In the United States, (1984). Medicaid and Medicare files, which include Later phases of the HANES included por- such data as diagnosis and treatment, are avail- tions of the DIS. The Health Interview Survey able for research use. Many Health Mainte- conducts health interviews of the general nance Organizations maintain data files which population of the United States, and includes could be used in psychological epidemiological reports by the respondent of named psychiatric research. disorders such as schizophrenia, depressive Statistics from treatment facilities are en- disorder, and so forth. The National Medical hanced considerably by linkage with other Care Utilization and Expenditure Survey facilities in the same geographic area, serving (MCUES) is conducted by the NCHS to help the same population (Mortensen, 1995). understand the health service system and its Although the number has declined, there still financing. The MCUES samples records of exist several registers which attempt to record practitioners from across the nation, and there and link together all psychiatric treatment are several questions on treatment for psycho- episodes for a given population (ten Horn, logical problems. Giel, Gulbinar, & Henderson, 1986). Linkage Some governments also sponsor national across facilities allows multiple treatment epi- surveys which focus on psychological disorders. sodes for the same individual (even the same The ECA is one such example, although it was episode of illness) to be combined into one not, strictly speaking, a sample of the nation. record (so-called ªunduplicatingº). Linkage The later National Comorbidity Survey includes across time allows longitudinal study of the 8058 respondents from a probability sample of course of treatment. Linkage with general health the US, with its major measurement instrument facilities, and with birth and mortality records, being a version of the Composite International provides special opportunities. The best known Diagnostic Interview (CIDI), a descendant of example of a comprehensive health registry is the DIS used in the ECA surveys (Kessler et al., the Oxford Record Linkage Study (ORLS; 1994). The British government conducted a large Acheson, Baldwin, & Graham, 1987), which survey of psychological disorders in a national links together all hospital treatment episodes in sample, using a similar instrument to the DIS its catchment area. The database of the ORLS and CIDI (Meltzer, Gill, Pettirew, & Hindi, consists of births, deaths, and hospital admis- 1995a, 1995b). Anonymous, individual level sions, which is a strong limitation. However, due data from the ECA, the British survey, and the to the catchmenting aspect of the British NCS are available at nominal cost. National Health Service, the data are not limited to the household-residing population, 3.05.5.2 Vital Statistics as is the National Comorbidity Survey, for example, or the LSOA. In automated data Governments generally assimilate data on collection systems such as the ORLS, the births, deaths, and marriages from states, recordation is often done under the auspices provinces, or localities, ensuring some minimal of the medical record systems, with data degree of uniformity of reporting and creating recorded by the physician, such as the diagnosis. data files for public use. The mortality files It cannot be presumed that physician's diagnosis Preventive Trials 107 is ªstandardizedº and therefore more reliable relatives of the control adoptees (2%). The than other interview or record data. In fact, findings for schizophrenia spectrum disorders there is significant variation in diagnosis among (including uncertain schizophrenia and schizoid physicians. Research using measurements and personality) also show a pattern consistent only diagnoses recorded by the physician, as in with genetic inheritance (26/118 in the biologic record linkage systems such as the ORLS, relatives of index cases, or 22%, vs. 16/140 should ideally include studies of reliability and biologic relatives of control adoptees, or 14%). validity (e.g., Loffler et al., 1994). 3.05.6 PREVENTIVE TRIALS 3.05.5.4 Exemplar Study: the Danish Adoption Study 3.05.6.1 The Logic of Prevention An example of the benefits of record linkage Some of the earliest epidemiologists used in psychological research is the adoption study interventions in the community to stop an of schizophrenia in Denmark (Kety, Rosenthal, epidemic or to gather information about the Wender, Schulsinger, & Jacobsen, 1975). Fa- causes of diseases in the population. This is milial and twin studies of schizophrenia sug- sometimes called the experimental aspect of gested strongly that schizophrenia was epidemiology. The best known experiment is inherited, but these studies were open to the that conducted by Snow in the cholera epidemic interpretation that the inheritance was cultural, in London. Snow's work exemplifies epidemio- not genetic, because family members are raised logic principles in several ways (Cooper & together, and identical twins may be raised in Morgan, 1973). It was widely believed that social environments which are more similar cholera was spread through the air, in a miasma, than the environments of fraternal twins. The leading many to seek safety by retreating to Danish Adoption Study ingeniously combined rural areas. Snow showed with ecologic data in the strategy of file linkage with interviews of London that areas serviced by a water company cases, controls, and their relatives. In Denmark, taking water from upstream in the Thames had each individual receives a number at birth which lower rates of cholera than areas serviced by a is included in most registration systems. The company taking water from downstream. This registration systems are potentially linkable, ecologic comparison suggested that cholera was after appropriate safeguards and clearances. In borne by water. In the context of a single cholera the adoption study, three registers were used. epidemic, Snow identified individual cases of First, all 5483 individuals in the county and city cholera, showing that they tended to cluster of Copenhagen who had been adopted by around a single water pump at Broad Street. He persons or families other than their biological further showed that many exceptional cases, relatives, from 1924 through 1947, were identi- that is, cases of cholera residing distant from the fied from a register of adoptions (Figure 4). pump, had actually drawn or drank water from These were linked to the psychiatric case that pump (e.g., on their way home from work). register, wherein it was determined that 507 His action to remove the handle of the pump, adoptees had ever been admitted to a psychiatric which roughly coincided with the termination of hospital. From case notes in the hospitals, 34 the epidemic, is regarded as an early instance of adoptees who met criteria for schizophrenia experimental epidemiology. (about 0.5% of of the total number of adoptees) In epidemiology, as with medicine generally, were selected, and matched on the basis of age, intervention is valued if it is effective, regardless sex, socioeconomic status of the adopting of whether the causation of the disease is family, and time with biologic family, or in understood or not. Table 5 shows examples of institution, prior to adoption. The relatives of preventive measures which were implemented these 68 cases and controls were identified by well prior to knowledge of the cause, discovered linkage with yet another register in Denmark much later in many cases. This logic leads to which permits locating families, parents, and experimentation even in the absence of causal children. After allowing for mortality, refusal, information. and a total of three who were impossible to Various conceptual frameworks have been trace, a psychiatric interview was conducted on used to organize the area of prevention research. 341 individuals (including 12 on whom the The Commission on Chronic Illness (1957) interview was not conducted but sufficient divided prevention into three types, dependent information for diagnosis was obtained). Eleven on the stage of the disease the intervention was of the 118 biologic relatives of the index designed to prevent or treat. Prior to onset of adoptees were schizophrenic (9%), vs. one in the disease, the prevention was primary, and its the 35 relatives of adoptive families of the index goal was to reduce incidence; after disease onset, adoptees (3%), and three of the 140 biologic the intervention was directed at promoting 108 Epidemiologic Methods Danish Adoption Study

Research Design

Method Sample

Adoption Register 5483 Adoptees

Psychiatric Register 507 4976

Case Notes 34 Index Cases 34 Controls 473

Folkeregister 247 Relatives 265 Relatives

Mortality/Refusal 173 74 174 91 Biologic Adoptive Biologic Adoptive

Psychiatric Interview 118 35 140 48

Results

Schizophrenic: 11 1 3 2 Spectrum 26 3 16 5 Normal 81 31 121 41

Figure 4 Danish adoption study: research design and results. Source: Kety, Rosenthal, Wender, Schulsinger, and Jacobsen (1975). recovery, and its goal was to reduce prevalence, The results of integrating the two frameworks so-called secondary prevention. Tertiary pre- are shown in Figure 5. Curative medicine vention was designed to prevent impairment generally operates in the area of secondary and handicap which might result from the prevention, with indicated interventions. Re- disease. The Institute of Medicine (Mrazek & habilitative medicine is in the area of tertiary Haggerty, 1994) integrated this framework into prevention. Primary, universal, and targeted that of Gordon (1983), who directed attention at interventions are the province of experimental the population to which the preventive inter- epidemiology. Prominent examples of universal vention was directed: universal preventions at interventions in the area of epidemiology are the entire general population; targeted inter- various smoking cessation campaigns, and such ventions at subgroups at high risk of develop- studies as the Stanford Five-City Project, which ment of disorder; and indicated interventions, was designed to reduce risk of heart disease by directed at individuals who have already lowering levels of several associated risk factors manifest signs and symptoms of the disease. (Farquhar et al., 1990). Preventive Trials 109

Table 5 Knowledge of prevention and etiology.

Prevention Etiology Disease Discoverer Year Agent Discoverer Year

Scurvy Lind 1753 Ascorbic acid Szent-Gyorgi 1928 Pellagra Casal 1755 Niacin Goldberger 1924 Scrotal cancer Pott 1775 Benzopyrene Cook 1933 Smallpox Jenner 1798 Orthopoxvirus Fenner 1958 Puerperal fever Semmelwies 1847 Streptococcus Pasteur 1879 Cholera Snow 1849 Vibrio cholerae Koch 1893 Bladder cancer Rehn 1895 2-Napththylamine Hueper 1938 Yellow fever Reed 1901 Flavirus Stokes 1928 Oral cancer Abbe 1915 N-Nitrosonornicotine Hoffman 1974

Source: Wynder (1994).

3.05.6.2 Attributable Risk not occur if smoking were eliminated totally. In the situation of many possible risk factors, the From among many possibilities, how should attributable risk is a tool which helps prioritize interventions be selected? Epidemiology pro- them. vides a helpful tool in the form of the Population Epidemiologic cohort studies can provide Attributable Risk, sometimes called the Attri- information which may help to stage the butable Fraction or the Etiologic Fraction intervention at the most appropriate time. (Lilienfeld & Stolley, 1994, p. 202). The The intervention should occur before most attributable risk is the maximum estimate of onsets in the population have taken place, but the proportion of the incidence of disease that not so far before onset that the effects of the would be prevented if a given risk factor were intervention wash out before the process of eliminated. For a given disease, the attributable disease development has begun. The appro- risk combines information from the relative risk priate stage for prevention in many situations for a given exposure with the prevalence of the would appear to be that of precursors, in which exposure in the population. The formula for there are subgroups which can be identified at attributable risk is: high risk, but before the disease prodrome has started (Eaton et al., 1995). P(RR 7 1) Attributable Risk = P(RR 7 1) + 1 3.05.6.3 Developmental Epidemiology where: Epidemiologists are gradually becoming P = Prevalence of Exposure, and aware of the need for life course perspective RR = Relative Risk of Exposure to Disease (Kellam & Werthamer-Larson, 1986). In social and psychological research on human beings, The relative risk can be estimated from a cohort cohort studies may suggest regularities in study, as described above, and the prevalence of human development which can be considered the exposure can be estimated from a separate etiologic clues. The clues consist of temporal survey. A simple case±control study also sequences of various behaviors over several provides the information for attributable risk, years, with only a probabilistic connection under certain conditions. The relative risk is between each behavior. The behaviors may approximated by the relative odds, as discussed have a common antecedent, such as a genetic above. If the controls are selected from the background, or they may depend on the general population, the level of exposure can be sequence of environments that the individual estimated from them. For example, case± experiences, and the interaction between envir- control studies of smoking and lung cancer in onments, genetic traits, and habits formed by the early 1950s showed that the odds of dying the history of prior behaviors. The epidemio- from lung cancer were about 15 times higher for logic notion of exposures, developed in the smokers as for nonsmokers. About half the context of infectious agents and toxins, is too population smoked, leading to the attributable simple to be useful in these multicausal, risk estimate of about 80%. In the United developmental sequences. For example, aggres- States, this meant that about 350 000 of the sive behavior by boys in early elementary school 400 000 annual deaths due to lung cancer would years is followed by conduct problems in middle 110 Epidemiologic Methods

Treatment

Case identification

Indicated

Maintenance

Prevention Selective

Standard treatment for known disorders

Compliance with long-term relapse and recurrence)After-care Universal treatment (goal: Reduction in

(including rehabilitation)

Figure 5 The mental health intervention spectrum for mental disorders. school; later behaviors such as smoking cigar- experimental and quasi-experimental design in ettes, having sexual relations at an early age, and psychology (Campbell & Stanley, 1971). The drinking alcohol, in high school, and, even- population for the trial was carefully defined in tually, the diagnosis of antisocial personality epidemiologic terms to include all first graders disorder as an adult. Which of these are essential in public schools in an area of Baltimore. The causes, and which simply are associated beha- Baltimore City Public School system was an viors which have no influence on the causal active participant in the research. The 19 schools chain to the outcome (in this example, the were divided into five levels, with three or four diagnosis of antisocial personality disorder)? schools per level, based on the socioeconomic Since the chain of causes is probabilistic and characteristics of the areas they served. At each multicausal, the notion of attributable risk of the five levels there was a control school and (discussed above) is too simplistic, and it is two experimental schools. For the study of unlikely that any estimate of attributable risk aggressive behavior, 153 children were assigned would be even moderate in size. A preventive at random to eight separate classrooms in which intervention trial with random or near-random the teacher used a special classroom procedure assignment to conditions which manipulate a called the ªGood Behavior Game,º which had putative cause may be the only way to generate been shown to reduce the level of aggressive an understanding of the causal chain. The behavior in classrooms. Nine classrooms with preventive trial serves the traditional purpose in 163 children received an intervention to improve epidemiology of testing the efficacy of the reading mastery, and there were 377 children intervention, but it also serves as an examina- who were in control classrooms. The control tion of the developmental pathway leading to classrooms consisted of classrooms in the same disorder, for building and testing theories. school as the experimental, but who did not receive the intervention; and classrooms in 3.05.6.4 Exemplar Study: the Johns Hopkins schools in the same area of Baltimore, where Prevention Trial there were no interventions at all. One outcome of the Good Behavior Game Trial was that the An example of a preventive intervention trial median level of aggressive behavior went down is the Johns Hopkins Prevention Trial (Kellam, during the period of the trial for those in the Rebek, Ialongo, & Mayer, 1994). The trial was experimental group. For controls, the level of designed to examine developmental pathways aggressive behavior was relatively stable. The generally occurring after two important char- most impressive result was that aggressive acteristics which could be identified in first behavior continued to decline in the experi- graders: success at learning (especially, learning mental group, even after discontinuation of the to read) and aggressive behavior. Table 6 shows intervention in third grade, at least through the the design of the trials, using the notation of spring of sixth grade. Table 6 Intervention and assessment for Johns Hopkins Preventive Trials in 19 public elementary schools.

Grade 1 Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Number of Number of 1985±1986 1986±1987 1987±1988 1988±1989 1989±1990 1990±1991 classrooms students F/S F/S F/S F/S F/S F/S

Good behavior game 8 153 RO/XO XO/O /O /O /O /O Mastery learning 9 163 RO/XO XO/O /O /O /O /O Controls 24 377 RO/O O/O /O /O /O /O

Source: Kellam et al. (1994). F=fall; S=spring; O=observation; X=intervention; R=random assignment. 112 Epidemiologic Methods

3.05.7 BIAS populations, which entails comparison of incidence rates, as discussed above. But the Epidemiologic research entails bias, and it is rate of incidence is often too low to generate beneficial to anticipate bias in designing sufficient cases for analysis, and necessitates research studies. Common biases in epidemiol- extending the data collection over a period of ogy take a slightly different form than in typical months or years during which new cases are psychological research (e.g., as enumerated in collected. Prevalent cases are easier to locate, Campbell & Stanley, 1971), but the principle of either through a cross-sectional survey, or, developing a language to conceptualize bias, easier yet, through records of individuals and to attempts to anticipate, eliminate, or currently under treatment. The problem with measure bias is common to both disciplines. In study of current cases is that the prevalence is a this section five basic types of bias are function of the incidence and chronicity of the considered: sampling bias, divided into three disorder, as discussed above. Association of an subtypes of treatment bias, prevalence bias, and exposure with the presence or absence of disease response bias; and measurement bias, divided mixes up influences on incidence with influences into subtypes of diagnostic bias and exposure on chronicity. Influences on chronicity can bias. include treatment. For example, comparing the brain structure of prevalent cases of schizo- 3.05.7.1 Sampling Bias phrenia to controls showed differences in the size of the ventricles, possibly an exciting clue to Sampling bias arises when the sample studied the etiology of schizophrenia. But a possible does not represent the population to which bias is that phenothiazine or other medication, generalization is to be made. An important type used to treat schizophrenia, produced the of sampling bias is treatment bias. The case± changes in brain structure. Later studies of control design often takes advantage of treat- brain structure had to focus on schizophrenics ment facilities for finding cases of disorder to who had not been treated in order to eliminate compare with controls. With severe diseases this possibility. The study of depression among which invariably lead to treatment, there is less women in London is possibly subject to bias than for disorders which are less noticeable, prevalence bias since cases were selected on distressing, or impairing to the individual. As the basis of presentation to treatment, and shown in Figure 2, data from the ECA Program controls via a cross-sectional (prevalence) indicate that among the mental disorders, only survey. Thus, it may be that provoking agents for schizophrenia and panic disorder are as such as life events contribute only to the many as half the current cases under treatment. recurrence of episodes of depression, not In 1946, Berkson showed that, where the necessarily to their initial onset. probability of treatment for a given disorder Response bias is a third general threat to the is less than 1.0, cases in treatment over- validity of findings from epidemiologic re- represented individuals with more than one search. The epidemiologic design includes disorder, that in, comorbid cases. In studying explicit statements of the population to which risk factors for disorder with the case±control generalization is sought, as discussed above. design, where cases are found through clinics, The sampling procedure includes ways to the association revealed in the data may be an designate individuals thought to be representa- artifact of the comorbidity, that is, exposure x tive of that population. After designation as may appear to be associated with disease y, the respondents, before measurements can be focus disorder of the study, but actually related taken, some respondents become unavailable to disease z (perhaps not measured in the study). to the research, usually through one of three This type of bias is so important in epidemio- mechanisms: death, refusal, or change of logic research, especially case±control studies, it residence. If these designated-but-not-included is termed ªBerkson's bias,º or sometimes respondents are not a random sample, there will ªtreatment bias.º The existence of this bias is be bias in the results. The bias can result from possibly one factor connected to little use of the differences in the distribution of cases vs. case±control design in studies of psychiatric noncases as to response rate, or to differences disorders. in distribution of exposures, or both. In cross- Another type of sampling bias very common sectional field surveys (Von Korff et al., 1985) in case±control studies arises from another type and in follow-up surveys (Eaton, Anthony, of efficiency in finding cases of disorder, that in, Tepper, & Dryman, 1992), the response bias the use of prevalent cases. This is sometimes connected to psychopathology has not been termed prevalence bias, or the clinician's illusion extremely strong. Persons with psychiatric (Cohen & Cohen, 1984). The ideal is to compare diagnoses are not more likely to refuse to relative risks among exposed and nonexposed participate, for example. Persons with cognitive Conclusion: The Future of Psychology in Epidemiology 113 impairment are more likely to die during a measurement (Rogan & Gladen, 1978). Proce- follow-up interval, and persons with antisocial dures for correcting measures of association, personality disorder, or abuse of illegal drugs, such as the odds ratio, are more complicated are more likely to change address and be since they depend on the precise study design. difficult to locate. Differential response bias is Bias also exists in the measurement of particularly threatening to studies with long exposure. The epidemiologist's tendency to follow-up periods, such as the British Perinatal categorical measurement leads to the term Study and the Danish Adoption Study, since ªmisclassificationº for this type of bias. A there was sizable attrition in both studies. well-known example is the study of Lilienfeld and Graham (1958: cited in Schlesselman, 1982, pp. 137±138), which compared male patients' 3.05.7.2 Measurement Bias declarations as to whether they were circumcised to the results of a doctor's examination. Of Bias in measurement is called invalidity in the 84 men judged to be circumcised by the psychology, and this term is also used in epi- doctor (the ªgold standardº in this situation), demiology. But the study of validity in epi- only 37 reported being circumcised (44% demiology has been more narrowly focused sensitivity). Eighty-nine of the 108 men who than in psychology. The concentration in epi- were not circumcised in the view of the doctor demiology has been on dichotomous measures, reported not being circumcised (82% specifi- as noted above. The medical ideology has city). In psychological research, the exposures ignored the notion that concepts for disease and are often subjective psychological happenings, pathology might actually be conventions of such as emotions, or objective events recalled by thought. Where the psychometrician takes as a the subject, such as life events, instead of, for basic assumption that the true state of nature is example, residence near a toxic waste dump, as not observable, the epidemiologist tends to might be the putative exposure in a study of think of disease as a state whose presence is cancer. The importance of psychological ex- essentially knowable. As a result, discussion of posures, or subjective reporting of exposures, face validity, content validity, or construct threatens the validity of the case±control design validity are rare in epidemiologic research. in psychiatric epidemiology, and may be one Instead the focus has been on sensitivity and reason it has been used so little. The cases in a specificity, which are two aspects of criterion case±control study by definition have a disease validity (also a term not used much in or problem which the controls do not. In any epidemiologic research). kind of case±control study where the measure of Sensitivity is the proportion of true cases that exposure is based on recall by the subjects, the are identified as cases by the measure (Table 7). cases may be more likely than controls to recall Specificity is the proportion of true noncases exposures, because they wish to explain the that are identified as noncases by the measure. occurrence of the disease. In the study of In psychological research, validity is often depression in London, depressed women may be estimated with a correlation coefficient, but more likely to recall a difficult life event that this statistic is inappropriate because the happened earlier in their lives, because of their construct of disease is dichotomous and current mood, than women who are not differences in rarity of the disorder will depressed; or depressed women may magnify constrain the value of the correlation coeffi- the importance of a given event which actually cient. Furthermore, use of sensitivity and did occur. These problems of recall raise the specificity has the advantage over correlation importance of strictly prospective designs in that it forces quantification of both types of psychological epidemiological research. error, that is, false-positive and false-negative error. These errors, and the calibration of the measurement to adapt to them, depend heavily 3.05.8 CONCLUSION: THE FUTURE OF on the prevalence of the disorder, on the PSYCHOLOGY IN importance of locating cases, and on the expense EPIDEMIOLOGY of dealing with those cases that are located. Choice of a threshold to define disorder as Epidemiology is developing rapidly in ways present or absent has important consequences, that will promote effective collaboration with and is aided by detailed study of the effects of psychology. The traditional interest of psychol- different thresholds (sometimes termed re- ogists in health as well as illness links up to sponse operating characteristic (ROC) analysis, epidemiologists' growing interest in the devel- as in Murphy, 1987). There are simple proce- opment of host resistance. The traditional dures for correcting prevalence estimates ac- interest of psychologists in phenomenology is cording to the sensitivity and specificity of the beginning to link up with epidemiologists' 114 Epidemiologic Methods

Table 7 Sensitivity and specificity. True disease status Present Absent a b a + b Test Present (True-positives) (False-positives) results Absent c d c + d (False-negatives) (True-negatives) a + c b + d

Sensitivity = a / (a + c) Specificity = d / (b + d) growing interest in measurement and complex M. (1985). Comparison of the lay Diagnostic Interview nosology. In statistics, techniques of multi- Schedule and a standardized psychiatric diagnosis: Experience in Eastern Baltimore. Archives of General variate analysis are being adapted increasingly Psychiatry, 42, 667±675. well to the developmental perspective. Devel- Berkson, J. (1946). Limitations of the application of opments in the field of physiological and fourfold table analysis to hospital data. Biometrics, 2, biological measurement, including nearly non- 47±53. invasive assessments of DNA, hormones, and Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and prac- brain structure and functioning, have led to tice. Cambridge, MA: MIT Press. concepts like ªmolecular epidemiology,º which Blalock, H. M. (1979). Social statistics (2nd ed.). New will lead to vast increases in knowledge about York: McGraw-Hill. how psychological functioning affects diseases Brown, G. W., Birley, J. L. T., & Wing, J. K. (1972). across the spectrum of mental and physical Influence of family life on the course of schizophrenic disorders: a replication. British Journal of Psychiatry, illness. Many of these measures are efficient 121, 241±258. enough to be applied in the context of a large Brown, G. W., & Harris, T. (1978). The social origins of field survey. In a few decades, the concept of the depression: A study of psychiatric disorder in women. ªmind±body splitº may be an anachronism. London: Tavistock. Finally, the increasing use of laptop computers Butler, N. R., & Alberman, E. D. (1969). Perinatal Problems: The Second Report of the 1958 British for assistance in assessment of complex human Perinatal Survey. Edinburgh, Scotland: Livingstone. behaviors and traits will increase the utility of Campbell, D. T., & Stanley, J. C. (1971). Experimental and questionnaires and interview data. For all these Quasi-Experimental Designs for Research.Chicago: reasons psychologists should look forward to Rand McNally. increasingly productive collaborations in the Cohen, P., & Cohen, J. (1984). The Clinician's Illusion. Archives of General Psychiatry, 41, 1178±1182. field of epidemiology. Commission on Chronic Illness (1957). Chronic illness in the United States. Cambridge, MA: Harvard University Press. Cooper, B., & Morgan, H. G. (1973). Epidemiological ACKNOWLEDGMENTS psychiatry. Springfield, IL: Charles C. Thomas. Cross-National Collaborative Group (1992). The changing Production of this paper was supported by rate of major depression: cross-national comparisons. NIMH grants 47447, and by the Oregon Social Journal of the American Medical Association, 268, Learning Center. The author is grateful to Nina 3098±3105. Schooler and Ray Lorion for comments on early Crow, T. J., Done, D. J., & Sacker, A. (1996). Birth cohort drafts. study of the antecedents of psychosis: Ontogeny as witness to phylogenetic origins. In H. Hafner & W. F. Gattaz (Eds.), Search for the causes of schizophrenia (Vol. 3, pp. 3±20). Heidelberg, Germany: Springer. 3.05.9 REFERENCES Done, D. J., Johnstone E. C., Frith, C. D., et al. (1991). Complications of pregnancy and delivery in relation to Acheson, E. D., Baldwin, J. A., & Graham, W. J. (1987). psychosis in adult life: data from the British perinatal Textbook of medical record linkage. New York: Oxford mortality survey sample. British Medical Journal, 302, University Press. 1576±1580. American Psychiatric Association (1994). Diagnostic and Eaton, W. W. (1994). The NIMH epidemiologic catchment statistical manual of mental disorders (4th ed.). Washing- area program: Implementation and major findings. ton, DC: Author. International Journal of Methods in Psychiatric Research, Anthony, J. C., Eaton, W. W., & Henderson, A. S. (1995). 4, 103±112. Introduction: Psychiatric epidemiology (Special Issue of Eaton, W. W. (1995). Studying the natural history of Epidemiologic Reviews). Epidemiologic Reviews, 17(1), psychopathology. In M. Tsuang, M. Tohen, & G. 1±8. Zahner (Eds.), Textbook in psychiatric epidemiology Anthony, J. C., Folstein, M. F., Romanoski, A., Von (pp. 157±177). New York: Wiley Lisa, Inc. Korff, M., Nestadt, G., Chahal, R., Merchant, A., Eaton, W. W., Anthony, J. C., Tepper, S., & Dryman, A. Brown, C. H., Shapiro, S., Kramer, M., & Gruenberg, E. (1992). Psychopathology and Attrition in the Epidemio- References 115

logic Catchment Area Surveys. American Journal of Kessler, R. C., McGonagle, K. A., Zhao, S., Nelson, C. B., Epidemiology, 135, 1051±1059. Hughes, M., Eshelman, S., Wittchen, H., & Kendler, K. Eaton, W. W., Badawi, M., & Melton, B. (1995). S. (1994). Lifetime and 12-month prevalence of DSM- Prodromes and precursors: Epidemiologic data for III-R psychiatric disorders in the United States. Archives primary prevention of disorders with slow onset. of General Psychiatry, 51, 8±19. American Journal of Psychiatry, 152(7), 967±972. Kety, S. S., Rosenthal, D., Wender, P. H., Schulsinger, F., Eaton, W. W., Holzer, C. E., Von Korff, M., Anthony, J. & Jacobsen, B. (1975). Mental illness in the biological C., Helzer, J. E., George, L., Burnam, A., Boyd, J. H., and adoptive families of adopted individuals who have Kessler, L. G., & Locke, B. Z. (1984). The design of the become schizophrenic: a preliminary report based on Epidemiologic Catchment Area surveys: the control and psychiatric interviews. In R. R. Fieve, D. Rosenthal, & measurement of error. Archives of General Psychiatry, H. Brill (Eds.), Genetic Research in Psychiatry 41, 942±948. (pp. 147±165). Baltimore, MD: Johns Hopkins Univer- Eaton, W. W., & Kessler, L. G. (1981). Rates of symptoms sity Press. of depression in a national sample. American Journal of Kleinbaum, D. G., Kupper, L. L., & Morgenstern, H. Epidemiology, 114, 528±538. (1982). Epidemiologic Research. Principles and Quantita- Eaton, W. W., Kramer, M., Anthony, J. C., Chee, E. M. tive Methods. Belmont, CA: Lifetime Learning. L., & Shapiro, S. (1989). Conceptual and methodological Klerman, G. L., & Weissman, M. M. (1989). Increasing problems in estimation of the incidence of mental rates of depression. JAMA, 261, 2229±2235. disorders from field survey data. In B. Cooper & T. Kovar, M. (1992). The Longitudinal Study of Aging: Helgason (Eds.), Epidemiology and the prevention of 1984±90. Hyattsville, MD: US Department of Health mental disorders (pp. 108±127). London: Routledge. and Human Services. Eaton, W. W., & McLeod, J. (1984). Consumption of Kramer, M. (1957). A discussion of the concepts of coffee or tea and symptoms of anxiety. American Journal incidence and prevalence as related to epidemiologic of Public Health, 74, 66±68. studies of mental disorders. American Journal of Public Eaton, W. W., & Muntaner, C. (1996). Socioeconomic Health & Nation's Health, 47(7), 826±840. stratification and mental disorder. In A. V. Horwitz & T. Kramer, M. (1969). Applications of Mental Health Statis- L. Scheid (Eds.), Sociology of mental health and illness. tics: Uses in mental health programmes of statistics New York: Cambridge University Press. derived from psychiatric services and selected vital and Eaton, W. W., Regier, D. A., Locke, B. Z., & Taube, C. A. morbidity records. Geneva: World Health Organization. (1981). The Epidemiologic Catchment Area Program of Kramer, M., Von Korff, M., & Kessler, L. (1980). The the National Institute of Mental Health. Public Health lifetime prevalence of mental disorders: estimation, uses Report 96, 319±325. and limitations. Psychological Medicine, 10, 429±435. Eaton, W. W., Weissman, M. M., Anthony, J. C., Robins, Lawless, J. F. (1982). Statistical models and methods for L. N., Blazer, D. G., & Karno, M. (1985). Problems in lifetime data (Wiley Series in Probability and Mathema- the definition and measurement of prevalence and tical Statistics). New York: Wiley. incidence of psychiatric disorders. In W. W. Eaton & Lilienfeld, A., & Lilienfeld, D. (1980). Foundations of L. G. Kessler (Eds.), Epidemiologic Field Methods epidemiology. New York: Oxford University Press. in Psychiatry: The NIMH Epidemiologic Catchment Lilienfeld, D. E., & Stolley, P. D. (1994). Foundations of Area Program (pp. 311±326). Orlando, FL: Academic epidemiology. New York: Oxford University Press. Press. Loffler, W., Hafner, H., Fatkenheuer, B., Maurer, K., Farquhar, J. W., Fortmann, S. P., Flora, J. A., Taylor, C. Riecher-Rossler, A., Lutzhoft, J., Skadhede, S., Munk- B., Haskell, W. L., Williams, P. T., Maccoby, N., & Jorgensen, P., & Stromgren, E. (1994). Validation of Wood, P. D. (1990). Effects of communitywide educa- Danish case register diagnosis for schizophrenia. Acta tion on cardiovascular disease risk factors: The Stanford Psychiatrica Scandinavica, 90, 196±203. Five-City Project. Journal of the American Medical Mausner, J. S., & Kramer, S. (1985). Mausner & Bahn Association, 264(3), 359±365. epidemiology: An introductory text (2nd ed.). Philadel- Fitti, J. E., & Kovar, M. G. (1987). The Supplement on phia: Saunders. Aging to the 1984 National Health Interview Survey. Vital Meltzer, H., Gill, B., Petticrew, M., & Hinds, K. (1995a). and Health Statistics. Washington, DC: Government Physical complaints, service use and treatment of adults Printing Office. with psychiatric disorders. London: Her Majesty's Fleiss, J. L. (1981). Statistical methods for rates and Stationery Office. proportions (2nd ed.). New York: Wiley. Meltzer, H., Gill, B., Petticrew, M., & Hinds, K. (1995b). Foege, W. H. (1996). Alexander D. Langmuir: His impact Prevalence of psychiatric morbidity among adults living in on public health. American Journal of Epidemiology, private households. London: Her Majesty's Stationery 144(8) (Suppl.), S11±S15. Office. Gordon, R. (1983). An operational classification of disease Morris, J. N. (1975). Use of epidemiology (3rd ed.). prevention. Public Health Reports, 98, 107±109. Edinburgh: Churchill Livingstone. Herbst, A. L., Ulfelder, H., & Poskanzer, D. C. (1971). Mortensen, P. B. (1995). The untapped potential of case Adenocarcinoma of the vagina. Association of maternal registers and record-linkage studies in psychiatric epide- stilbestrol therapy with tumor appearance in young miology. Epidemiologic Reviews, 17(1), 205±209. women. The New England Journal of Medicine, 284(16), Mrazek, P. J., & Haggerty, R. J. (1994). Reducing risks for 878±881. mental disorders. Washington, DC: National Academy Kellam, S. G., & Werthamer-Larsson, L. (1986). Develop- Press. mental epidemiology: a basis for prevention. In M. Munson, M. L., Orvaschel, H., Skinner, E. A., Goldring, Kessler & S. E. Goldston (Eds.), A Decade of Progress in E., Pennybacker, M., & Timbers, D. M. (1985). Primary Prevention (pp. 154±180). Hanover, NH: Uni- Interviewers: Characteristics, training and field work. versity Press of New England. In W. W. Eaton & L. G. Kessler (Eds.), Epidemiologic Kellam, S. G., Rebok, G. W., Ialongo, N., & Mayer L.S. Field Methods in Psychiatry: The NIMH Epidemiologic (1994). The course and malleability of aggressive Catchment Area Program (pp. 69±83). Orlando, FL: behavior from early first grade into middle school: Academic Press. Results of a developmental epidemiologically-based Murphy, J. M., Berwick, D. M., Weinstein, M. C., Borus, preventive trail. Journal of Child Psychology and J. F., Budman, S. H., & Klerman, G. L. (1987). Psychiatry, 35(2), 259±281. Performance of Screening and Diagnostic Tests: Appli- 116 Epidemiologic Methods

cation of receiver operating characteristic analysis. and mental health services, three epidemiologic catch- Archives of General Psychiatry, 44, 550±555. ment area sites. Archives of General Psychiatry, 41, Robins, L. N., Helzer, J. E., Croughan, J., & Ratcliff, K. S. 971±978. (1981). National Institute of Mental Health Diagnostic Simon, G. E., & Vor Korff, M. (1995). Recall of Psychiatric Interview Schedule: Its history, characteristics, and History in Cross-sectional Surveys: Implications for validity. Archives of General Psychiatry, 38, 381±389. Epidemiologic Research. Epidemiologic Reviews, 17(1), Robins, L. N., Helzer, J. E., Weissman, M. M., Orvaschel, 221±227. H., Gruenberg, E. M., Burke, J. D., & Regier, D. A. Sudman, S. (1976). Applied sampling. New York: Academic (1984). Lifetime prevalence of specific psychiatric dis- Press. orders in three sites. Archives of General Psychiatry, 41, ten Horn, G. H., Giel, R., Gulbinat, W. H., & Henderson, 949±958. J. H. (Eds.) (1986). Psychiatric case registers in public Robins, L. N., & Regier, D. A. (Eds.) (1991). Psychiatric healthÐA worldwide inventory 1960±1985. Amsterdam: disorders in America: The epidemiologic catchment area Elsevier Science. study. New York: Free Press. Rogan, W. J., & Gladen, B. (1978). Estimating Prevalence Tsuang, M., Tohen, M., & Zahner, G. (1995). Textbook in from the Results of a Screening Test. American Journal psychiatric epidemiology. New York: Wiley-Liss. of Epidemiology, 107, 71±76. Tyrer, P. (1985). Neurosis divisible? Lancet, 1, 685±688. Sacker, A., Done, J., Crow, T. J., & Golding, J. (1995). Von Korff, M., Cottler, L., George, L. K., Eaton, W. W., Antecedents of schizophrenia and affective illness Leaf, P. J., & Burnam, A. (1985). Nonresponse and obstetric complications. British Journal of Psychiatry, nonresponse bias in the ECA surveys. In W. W. Eaton & 166, 734±741. L. G. Kessler (Eds.), Epidemiologic field methods in Schlesselman, J. J. (1982). Case±control studies: Design, psychiatry: The NIMH Epidemiologic Catchment Area conduct, analysis. New York: Oxford University Press. Program (pp. 85±98). Orlando, FL: Academic Press. Shapiro, S., Skinner, E. A., Kessler, L. G., Von Korff, M., Wynder, E. L. (1994). Studies in mechanism and preven- German, P. S., Tischler, G. L., Leaf, P. J., Benham, L., tion: Striking a proper balance. American Journal of Cottler, L., & Regier, D. A. (1984). Utilization of health Epidemiology, 139(6), 547±549. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.06 Qualitative and Discourse Analysis

JONATHAN A. POTTER Loughborough University, UK

3.06.1 INTRODUCTION: QUALITATIVE RESEARCH IN CONTEXT 118 3.06.1.1 Historical Moments in Qualitative Research in Clinical Settings 118 3.06.1.2 Background Issues 120 3.06.1.2.1 Philosophy, sociology, and changing conceptions of science 120 3.06.1.2.2 Investigatory procedures vs. justificatory rhetoric 121 3.06.1.2.3 Quality and quantity 122 3.06.1.3 Qualitative Research and Theory 122 3.06.1.4 Boundaries of Qualitative Research and Coverage of the Current Chapter 122 3.06.2 GROUNDED THEORY 123 3.06.2.1 Questions 124 3.06.2.2 Procedures 125 3.06.2.2.1 Materials 125 3.06.2.2.2 Coding 125 3.06.2.2.3 Method of constant comparison 125 3.06.2.2.4 Memo writing 126 3.06.2.2.5 Validity 126 3.06.2.3 Example: Clegg, Standen, and Jones (1996) 126 3.06.2.4 Virtues and Limitations 127 3.06.3 ETHNOGRAPHY AND PARTICIPANT OBSERVATION 128 3.06.3.1 Questions 128 3.06.3.2 Procedures 129 3.06.3.2.1 Access 129 3.06.3.2.2 Field relations 129 3.06.3.2.3 Fieldnotes 129 3.06.3.2.4 Interviews 130 3.06.3.2.5 Analysis 130 3.06.3.2.6 Validity 130 3.06.3.3 Example: Gubrium (1992) 131 3.06.3.4 Virtues and Limitations 131 3.06.4 DISCOURSE ANALYSIS 132 3.06.4.1 Questions 133 3.06.4.2 Procedures 134 3.06.4.2.1 Research materials 134 3.06.4.2.2 Access and ethics 135 3.06.4.2.3 Audio and video recording 135 3.06.4.2.4 Transcription 135 3.06.4.2.5 Coding 136 3.06.4.2.6 Analysis 136 3.06.4.2.7 Validity 137 3.06.4.3 Example: PeraÈkylaÈ (1995) 137 3.06.4.4 Virtues and Limitations 138

117 118 Qualitative and Discourse Analysis

3.06.5 FUTURE DIRECTIONS 139 3.06.6 REFERENCES 140

3.06.1 INTRODUCTION: QUALITATIVE genetics. Second, as is discussed below, views RESEARCH IN CONTEXT of science have changed radically since the 1950s, making it much harder to paint qualitative For many researchers the words ªqualitative researchers, as either antiscientific extremists or methodº spell out psychology's most notorious merely sloppy humanists. Third, psychology is oxymoron. If there is one thing that qualitative no longer as insulated from other social sciences methods are commonly thought to lack it is as it has been in the past. Of course, for much of precisely an adequate methodic way of arriving its twentieth century existence psychology has at findings. Indeed, for much of the twentieth been invigorated by infusions from other century quantification has been taken as the sciences such as physiology and linguistics. principle marker of the boundary between a However, in recent years there has been mature scientific psychology and common- increasing exchange with disciplines where sense, intuitive approaches. Since the late qualitative methods have been more established 1980s, however, there has been a remarkable such as sociology and anthropology. This is increase in interest in qualitative research in reflected in contemporary theoretical develop- psychology. This partly reflects a broader turn to ments such as constructionism (Gergen, 1994) qualitative research across the social sciences, and poststructuralism (Henriques, Hollway, although qualitative research of one kind or Irwin, Venn, & Walkerdine, 1984) that have another has long been an established feature of swept right across the human sciences. disciplines such as education, sociology, and, This is, then, an exciting time for qualitative most prominently, anthropology. researchers, with new work and new opportu- In psychology there is a handbook of nities of all kinds. Yet it should also be qualitative research (Richardson, 1996) as well emphasised that qualitative research in psychol- as a range of volumes and special journal issues ogy is in a chaotic state, with a muddle of whose major focus is on developing qualitative inconsistent questions and approaches being approaches to psychological problems (Antaki, blended together. Much poor work has involved 1988; Bannister, Burman, Parker, Taylor, & taking questions formulated in the metaphysics Tyndall, 1994; Henwood & Parker, 1995; Hen- of experimental psychology and attempting to wood & Nicolson, 1995; Smith, HarreÂ , & van plug them into one or more qualitative methods. Langenhove, 1995). Psychology methods books At its worst such research peddles unsystematic and collections are increasingly serving up and untheorized speculations about the influ- qualitative methods to accompany the more ences on some piece of behavior which are usual diet of experiments, questionnaires, and backed up with two or three quotes from an surveys. At the same time, an increasing per- interview transcript. This expanding literature meability of boundaries between the social and variable quality creates some problems for sciences has provided the environment for a the production of a useful overview. This range of trans-disciplinary qualitative methods chapter selects what are seen as the most books including a useful doorstop-sized hand- coherent and successful qualitative approaches book (Denzin & Lincoln, 1994) and varied edited from a range of possibilities, as well as focusing and authored works (Bogdan & Taylor, 1975; on those approaches which are being used, and Bryman & Burgess, 1994; Coffey & Atkinson, have prospects for success, in clinical settings. A 1996; Gilbert, 1993; Lofland & Lofland, 1984; range of references is provided for those who Miles & Huberman, 1994; Miller & Dingwall, wish to follow up alternative methods. 1997; Silverman, 1993, 1997a). These general qualitative works are complemented by a mush- rooming range of books and articles devoted to 3.06.1.1 Historical Moments in Qualitative specific methods and approaches. Research in Clinical Settings Why has there been this increase in interest in qualitative research? Three speculations are In one sense a large proportion of twentieth proferred. First, there is a widespread sense that century clinical therapeutic practice was based traditional psychological methods have not on qualitative methods. The process of con- proved successful in providing major advances ducting some kind of therapy or counseling with in the understanding of human life. Despite clients and then writing them up as case histories, regular promissory notes, psychology seems to or using them as the basis for inferences about offer no earth-moving equivalent of the transis- aspects of the psyche, behavior, or cognitive tor, of general relativity, or of molecular processes, has been a commonplace of clinical Introduction: Qualitative Research in Context 119 work. Freud's use of case histories in the pretations are strongly derived from his theory, development of psychoanalytic thinking is as is shown by his willingness to straightfor- probably the most influential. Although it is wardly rework the overt sense of the records. an approach to clinical knowledge that over- Take this example: whelmingly eschews quantification, it is hard to say much about its methodic basis. For good or Hans (aged four and a half) was again watching his bad, it is dependent on the unformulated skills little sister being given her bath, when he began and intuitions of the therapist/researcher. In the laughing. On being asked why he was laughing, he hands of someone as brilliant as Freud the result replied. ªI'm laughing at Hanna's widdler.º can be extraordinary; elsewhere the product has ªWhy?º ªBecause her widdler's so lovely.º often been merely ordinary. The problem for the Of course his answer was a disingenuous one. In readers of such research is that they can do little reality her widdler had seemed to him funny. except either take it on trust or disagree. The Moreover, this is the first time he has recognized in process through which certain claims are this way the distinction between male and female established is not open to scrutiny. However, genitals instead of denying it. (1977, p. 184) Freud's study of the case of Little Hans is exceptional in this regard, and so it is briefly Note the way Graf here, and implicitly Freud worth considering (see Billig, 1998). in his text, treat the laughter as the real indicator Although Freud initially based his arguments of Hans understanding of events, and his overt for the existence of the Oedipus complex on the claim to find his sister's ªwiddlerº lovely as a interpretation of what patients told him in the form of dissembling. Hans is not delighted by course of therapy sessions, he attempted, the appearance of his sister's genitals but is unusually, to support this part of psychoanalytic amused, in line with psychoanalytic theory, by theory with more direct evidence. He asked some their difference from his own. Again, the issue of his followers to collect observations from of how to treat the sense of records of interac- their own children. The music critic Max Graf tion, and what interpretations should be made was most helpful in this regard and presented from them to things going on elsewhere such as Freud with copious notes on conversations actions or cognitions, is a fundamental one in between his son, Hans, and other family qualitative research (Silverman, 1993). members, as well as descriptions of dreams he Some 40 years later another of clinical had recounted. The published case history psychology's great figures, Carl Rogers, advo- (Freud, 1977 [1909]) contains more than 70 cated the use of newly developed recording pages of reports about Hans which Freud technology to study the use of language in describes as reproduced ªjust as I received psychotherapy itself, with the aim of under- themº without ªconventional emendationsº standing and improving therapeutic skills. For (1977, pp. 170). Here is an example: him such recordings offered ªthe first opportunity for an adequate study of counseling and Another time he [Hans] was looking on intently therapeutic procedures, based on thoroughly while his mother undressed before going to bed. objective dataº (Rogers, 1942, p. 433). Rogers ªWhat are you staring like that for?º she asked. envisaged using such recordings in the devel- Hans: ªI was only looking to see if you'd got a opment of a scale to differentiate the styles of widdler too.º different counselors and studies of the patterns Mother: ªOf course. Didn't you know that?º of interaction; for example, ªwhat type of Hans: ªNo. I thought you were so big you'd have a counselor statement is most likely to be widdler like a horse.º followed by statements of the client's feeling (1977, p. 173) about himself?º (1942, p. 434). Rogers' emphasis on the virtue of recordings Freud's fascinating materials and striking was followed up in two major ªmicroscopicº interpretations beg many of the questions that studies of psychotherapeutic discourse. The first have been central to qualitative research ever by Pettinger, Hockett, and Danehy (1960) since. For example, what is the role of Max focused on the initial portion of an initial Graf's expectations (he was already an advocate interview. A typical page of their study has just a of Freud's theories) in his selection and render- few words of transcript coupled with an ing of conversations with Hans? How closely do extended discussion of their sense. Much of the extracts capture the actual interactions the focus was on the prosodic cuesÐthe (including the emphasis, nonvocal elements, intonation and stressÐprovided in the inter- and so on)? What procedure did Freud use to view and their contextual significance. Prosody select the examples that were reproduced from is, of course, a feature of interaction which is the full corpus? And, most importantly, what is almost impossible to reliably capture in post hoc the basis of Freud's interpretations? His inter- notes made by key informants and so highlights 120 Qualitative and Discourse Analysis the virtue of the new technology. A second study nature of science provided by philosophers and by Labov and Fanshell (1977) also focused on sociologists since the 1970s. The image of the the opening stages of a therapy session, in this lone scientist harvesting facts, whose truth is case five episodes of interaction from the first 15 warranted through the cast-iron criterion of minutes of a psychoanalytic therapy session replication, has intermittently been wheeled on with Rhoda, a 19-year-old girl with a history of to defend supposedly scientific psychology anorexia nervosa. against a range of apparently sloppier alter- The classic example of ethnographic work in natives. However, this image now looks less the history of clinical psychology is Goffman's than substantial (see Chalmers, 1992; Potter, study of the everyday life of a mental hospital 1996a; Woolgar, 1988). published under the title Asylums (1961). It is The bottom-line status of scientific observa- worth noting that although Goffman was a tion has been undermined by a combination of sociologist, the various essays that make up philosophical analyses and sociological case Asylums were initially published in psychiatry studies. Philosophers have highlighted the journals. Rather than utilize tape recording logical relationships between observation state- technology to capture the minutiae of some ments and theoretical notions (Hesse, 1974; social setting, Goffman used an ethnographic Kuhn, 1962; Popper, 1959). Their argument is approach. He spent a year working ostensibly as that even the simplest of scientific descriptions is an assistant to the athletic director of a large dependent on a whole variety of theoretical mental hospital, interacting with patients and assumptions. Sociologists have supplemented attempting to build up an account of the these insights with studies of the way notions of institution as viewed by the patients. His observations are used in different scientific justification for working in this way is instruc- fields. For example, Lynch (1994) notes the way tive for how the strengths and weaknesses of the term observation is used in astronomy as a qualitative work have been conceptualized: loose device for collecting together a range of actions such as setting up the telescope, Desiring to obtain ethnographic detail regarding attaching sensors to it, building up traces on selected aspects of patient social life, I did not an oscilloscope, converting these into a chart employ usual kinds of measurements and controls. and canvassing the support of colleagues. Knorr I assumed that the role and time required to gather Cetina (1997) documents the different notions statistical evidence for a few statements would of observation that appear in different scientific preclude my gathering data on the tissue and fabric specialities, suggesting that high energy physi- of patient life. (1961, p. 8) cists and molecular biologists, for example, work with such strikingly different notions of As an ethnographic observer, he developed what is empirical that they are best conceived of an understanding of the local culture and as members of entirely different epistemic customs of the hospital by taking part himself. cultures. He used the competence generated in this way as The idea that experimental replication can the basis for his writing about the life and social work as a hard criterion for the adequacy of any organization in a large mental hospital. particular set of research findings has been shown to be too simple by a range of socio- 3.06.1.2 Background Issues logical studies of replication in different fields (Collins, 1981). For example, Collins (1985) has Before embarking on a detailed overview of shown that the achievement of a successful some contemporary qualitative approaches to replication is closely tied to the conception of clinical topics there are some background issues what counts as a competent experiment in the that are worth commenting on, as they will help first placeÐand this itself was often as much a make sense of the aims and development of focus of controversy as the phenomenon itself. qualitative approaches. In some cases it is In a study of gravity wave researchers, Collins necessary to address issues that have been a found that those scientists who believed in long-standing source of confusion where psy- gravity waves tended to treat replications that chologists have discussed the use of qualitative claimed to find them as competent and methods. replications that failed to find them as incom- petent. The reverse pattern was true of nonbelievers. What this meant was that replica- 3.06.1.2.1 Philosophy, sociology, and changing tion did not stand outside the controversy as a conceptions of science neutral arbiter of the outcome, but was as much As noted above, the development of quali- part of the controversy as everything else. tative work in psychology has been facilitated Philosophical and sociological analysis has by the more sophisticated understanding of the also shown that the idea that a crucial Introduction: Qualitative Research in Context 121 experiment can be performed which will force that such research is incoherent or unscientific, the abandonment of one theory and demon- merely that it should not be construed and strate the correctness of another is largely evaluated using the family of concepts whose mythical (Lakatos, 1970; Collins & Pinch, home is experimental journal articles. Likewise 1993). Indeed, historical studies suggest that the psychological model of hypothesis testing is so-called crucial experiments are not merely just one available across the natural and human insufficient to effect the shift from one theory to sciences. Qualitative research that utilizes another, they are often performed, or at least theoretically guided induction, or tries to give constructed as crucial, after the shift to provide a systematic description of some social realm, illustration and legitimation (Kuhn, 1977). should not be criticized on the grounds that it is Let us be clear at this point. This research unscientific, let alone illegitimate. Ultimately, does not show that careful observation, skilful the only consistent bottom line for the produc- replication, and theoretically sophisticated tion of excellent qualitative work is excellent experiments are not important in science. scholarship (Billig, 1988). Rather, the point is that none of these things Another difference between traditional quan- are bottom-line guarantees of scientific pro- titative and qualitative work is that in the gress. Moreover, these sociological studies have traditional work the justification of research suggested that all these features of science are findings is often taken to be equivalent to the embedded in, and inextricable from, its com- complete and correct carrying out of a set of munal practices. Their sense is developed and codified procedures. Indeed, methods books are negotiated in particular contexts in accordance often written as if they were compendia of recipes with ad hoc criteria and a wide range of craft for achieving adequate knowledge. Sampling, skills which are extremely hard to formulate in operationalization of variables, statistical tests, an explicit manner (Knorr Cetina, 1995; Latour and interpretation of significance levels are & Woolgar, 1986). The message taken from this discussed with the aid of tree diagrams and flow now very large body of work (see Jasanoff, charts intended to lead the apprentice researcher Markle, Pinch, & Petersen, 1995) is not that to the correct conclusion. In one sense, much psychologists must adopt qualitative methods, qualitative work is very different to this, with the or that qualitative methods will necessarily be procedures for justifying the research claims any better than the quantitative methods that being very different to the procedures for they may replace or supplement; it is that those producing the work. Thus, the manner in which psychologists who have argued against the a researcher arrives at some claims about the adoption of such methods on the principle that various functions of ªmm hm'sº in psychiatric they are unscientific are uninformed about the intake interviews, say, may be rather different nature of science. from the manner in which they justify the adequacy of the analysis. Yet, in another sense the difference between qualitative and quanti- 3.06.1.2.2 Investigatory procedures vs. tative research is more apparent than real, for justificatory rhetoric studies of the actual conduct of scientists There are a number of basic linguistic and following procedural rules of method show that metatheoretical difficulties in writing about such rules require a large amount of tacit qualitative methods for psychologists. Our knowledge to make them understandable and terminology for methodological discussionÐ workable, and that they are often more of a reliability, validity, sampling, factors, variance, rhetorical device used to persuade other scien- hypothesis testing, and so onÐhas grown up tists than an actual constraint on practice with the development of quantitative research (Gilbert & Mulkay, 1984; Polyani, 1958). As using experiments and surveys. The language Collins (1974) showed in an innovative ethno- has become so taken-for-granted that it is graphic study, when a group of scientists wrote a difficult to avoid treating it as obvious and paper offering the precise technical specification natural. However, it is a language that is hard to of how to make a new laser, the only people who disentangle from a range of metatheoretical were able to build a working laser of their own assumptions about the nature of behavior and had actually seen one built; just reading the processes of interaction. Traditional psychol- paper was not enough. ogy has become closely wedded to a picture of This presents something of a dilemma for factors and outcomes which, in turn, cohabits anyone writing about qualitative methods. with the multivariate statistics which are Should they write to help people conduct their omnipresent where data is analyzed. For some research so as better to understand the world, or forms of qualitative research, particularly most should they work to provide the sorts of discourse and ethnographic work, such a formalized procedural rules that can be drawn picture is inappropriate. This does not mean on in the methods sections of articles to help 122 Qualitative and Discourse Analysis persuade the psychologist reader? In practice, up modern cognitivism the objects of observa- most writing does some of each. However, the tion are hypothetical mental entities (the difficulty that psychologists often report when Oedipus complex, attributional heuristics). attempting qualitative work is probably symp- Psychoanalytic researchers have generally pre- tomatic of the failure to fully explicate the craft ferred to engage in an interpretative exercise of skills that underpin qualitative work. reconstructing those entities from the talk of patients undergoing therapy. Cognitive psychologists have typically used some 3.06.1.2.3 Quality and quantity hypothetico-deductive procedure where predic- There are different views on how absolute the tions are checked in experiments which inves- quantity/quality divide is. Arguments at differ- tigate, say, the attributional style of people ent levels of sophistication have been made for classified as depressed. Note that in both of future integration of qualitative and quantita- these cases they are using people's discourseÐ tive research (Bryman, 1988; Silverman, 1993). talk in the therapy session, written responses to It is undoubtedly the case that at times a questionnaireÐyet in neither case is the proponents of both quantitative and qualitative discourse as such of interest. In contrast, for research have constructed black and white researchers working with different perspectives stereotypes with little attempt at dialog such as social constructionism or discursive (although a rare debate about the relative psychology, the talk or writing itself, and the virtues of quantitative and qualitative research practices of which it is part, is the central topic. on the specific topic of attitudes towards mental For these researchers there is a need to use hospitalisation is revealingÐWeinstein, 1979, procedures which can make those practices 1980; Essex et al., 1980). It is suggested that available for study and allow their organization quantification is perfectly appropriate in a to be inspected and compared. range of situations, dependent on appropriate To take another example, behaviorist psy- analytic and theoretical judgements. chologists have done countless studies on the In many other research situations the goal is effects of particular regimes of reward and not something that can be achieved through punishment on target behaviors such as com- counting. For example, if the researcher is pulsive hand washing. However, such studies explicating the nature of ªcircular questioningº are typically focused on outcomes and statistical in Milan School Family Therapy, that goal is a associations, whereas a theoretical perspective prerequisite for a study which considers the such as symbolic interactionism or, to give a statistical prevalence of such questioning. more psychological example, Vygotskyan ac- Moreover, there are arguments for being tivity theory, encourage a more ethnographic cautious about quantification when studying examination of the settings in which rewards are the sorts of discursive and interactional materi- administered and of the sense that those als which have often been central to qualitative behaviors have in their local context research because of distortions and information Without trying to flesh out these examples in loss that can result (see Schegloff, 1993, and any detail, the important practical point they papers in Wieder, 1993). Some of the grounds make is that it is a mistake to attempt simply to for caution come from a range of qualitative import a question which has been formulated in studies of quantification in various clinical the problematics of one theoretical system, and settings (Ashmore, Mulkay, & Pinch, 1989; attempt to answer it using a method developed Atkinson, 1978; Garfinkel, 1967a; Potter, for the problematics of another. The failure to Wetherell, & Chitty, 1991). properly conceptualize a research question that fits with the research method is a major source of confusion when psychologists start to use 3.06.1.3 Qualitative Research and Theory qualitative methods. It is hard to overestimate how close the relationship is between the theories, methods, 3.06.1.4 Boundaries of Qualitative Research and and questions used by psychologists. Theories Coverage of the Current Chapter specify different positions on cognition and behavior, different notions of science, different What distinguishes qualitative research from views of the role of action and discourse, quantitative? And what qualitative approaches different understandings of the role of social are there? These questions are not as straight- settings, and, most fundamentally, different forward as they seem. In the past the qualitative/ objects for observation. quantitative distinction has sometimes been For both psychoanalytic theory and most of treated as the equivalent of the distinction the mass of theories and perspectives that make between research that produces objective and Grounded Theory 123 subjective knowledgeÐa distinction which enough to warrant discussion. In others, their makes little sense in the light of recent sociology central problematics are better addressed by the and philosophy of science. Sometimes certain approaches that are discussed. approaches using numbers have been treated as The most controversial exclusion is probably qualitative. For example, content analysis has humanistic methods given that humanistic occasionally been treated as a qualitative psychology developed in settings which had a method because it is used to deal with ªnaturally broad emphasis on therapy and psychological occurringº objects such as diaries, novels, well-being. It is suggested that the romanticism transcripts of meetings, and so on. Content of much humanistic psychology is attractive, analysis was meant to eliminate many of the but ultimately unconvincing. However, it is potential ªreactiveº effects that bedevil social often, quite legitimately, more concerned with research and thereby avoid the problem in developing participants' skills and sensitivity experimental and survey research, of how than developing propositional claims and findings relate to what goes on in the real arguments; as such it is often offering a set of world; for these are records of (indeed, examples techniques for expanding human potential of) what is actually going on. However, in this rather than developing methods for research. chapter content analysis is treated as quantita- Feminist methods are excluded, despite an tive, and therefore outside the scope of this appreciation of the importance of feminist survey, on the grounds that it (i) transforms issues in clinical settings, because the arguments phenomena into numerical counts of one kind for the existence of specifically feminist methods or another and (ii) typically attempts to (as opposed to theories or arguments) are not statistically relate these counts to some broader convincing. This is particularly true where such factors or variables. For useful introductions to claims give a major epistemological role for content analysis related to psychological topics experience or intuition (Ellis, Kiesinger, & see Holsti (1969) and Krippendorf (1980). Tillmann-Healy, 1997). These are topics for For similar reasons repertory grid analysis decomposition and analysis rather than associated with personal construct theory, and bottom-lines for knowledge. For some argu- the ªQº methodology developed by William ments in both directions on this topic see Stephenson, have sometimes been treated as Gelsthorpe (1992), Hammersley (1992), Rama- qualitative. The rationale for this was probably zanoglu (1992), and Wilkinson (1986). that they were often focused on understanding Finally, it should be stressed that qualitative the reasoning or cognitive organization of single work is not seen as having some overall individuals rather than working exclusively coherence. Quite the contrary, it is fragmented from population statistics. However, as they and of highly variable quality. Nor is some involve quantitative manipulation of elicited overall coherence seen as a desirable goal. Those responses from participants they will not be workers who talk of a qualitative paradigm dealt with here. The ideographic/nomathetic (Guba & Lincoln, 1994, Reason & Rowan, distinction will be treated as orthogonal to the 1981) unhelpfully blur over a range of theore- qualitative/quantitative one! For accessible tical and metatheoretical differences (see Hen- introductions to these approaches, see Smith wood & Pidgeon, 1994). (1995) on repertory grid methods and Stainton Rogers (1995) on Q methodology. In addition to these methods which are 3.06.2 GROUNDED THEORY excluded as not properly qualitative, a wide range of methods have not been discussed which Grounded theory has the most clear-cut nevertheless satisfy the criterion of being at least origin of any of the approaches discussed here. minimally methodic and generally eschewing The term was coined by two sociologists in an quantification. For simplicity, Table 1 lists nine influential book: The discovery of grounded methods or approaches which have been theory (Glaser & Strauss, 1967). Some of its key excluded, along with one or two references that features and concerns were a product of its would provide a good start point for any birthplace within sociology, where it was researcher who was interested in learning more developed to counter what the authors saw as about them. It would take some time to make a preoccupation, on the one hand, with abstract explicit the reasons for excluding all of them. grand theories and, on the other, with testing Generally, the problem is that they have not those theories through large scale quantitative been, and are unlikely to be in the future, studies. Grounded theory was intended to link particularly useful for studying problems in the theoretical developments (conceived as plausi- area of clinical psychology (e.g., focus groupsÐ ble relations among concepts and sets of although, see Piercy & Nickerson, 1996). In conceptsÐStrauss & Corbin, 1994) more some cases, the approaches are not coherent closely to the particulars of settings, to ground 124 Qualitative and Discourse Analysis

Table 1 Varieties of qualitative research not covered.

Qualitative research method Source

Action research Argyris, Putnam, & Smith, 1985, Whyte, 1991 Documentary studies Hodder, 1994, Scott, 1990 Ethogenics HarreÂ , 1992 Feminist approaches Olesen, 1994, Reinharz, 1992 Focus groups Krueger, 1988, Morgan, 1997 Humanistic, participative research Reason & Rowan, 1981, Reason & Heron, 1995 Life histories Plummer, 1995, Smith, 1994 Role play Yardley, 1995 Semiotics Manning & Cullum-Swan, 1994 middle range theories in actual qualitative data manmade disasters like fires and industrial rather than to start from preconceived hypoth- accidents; Charmaz (1991) has studied the eses. It is important to stress that grounded various facets that make up people's experience theory is not a theory as such, rather it is an of chronic illness; Clegg, Standen, and Jones approach to theorising about data in any (1996) focused on the staff members' under- domain. Moreover, since its inception in the standing of their relationship with adults with 1960s, work on the theory ladenness of data has profound learning disabilities. In each case, a made the idea of ªgroundingº theory increas- major concern was to incorporate the perspec- ingly problematic. tives of the actors as they construct their Much grounded theory research has been particular social worlds. Grounded theory done outside of psychology; however, psychol- methods can help explicate the relation of ogists have become increasingly interested in the actions to settings (how does the behavior of approach in general (Charmaz, 1995; Henwood key personnel in the evolution of a major fire and Pidgeon, 1995; Pidgeon, 1996; Pidgeon and follow from their individual understanding of Henwood, 1996; Rennie, Phillips & Quartaro, events and physical positioning?); it can be used 1988), and have carried out specific studies in for developing typologies of relevant phenom- health (Charmaz, 1991, 1994) and clinical ena (in what different ways do sufferers of (Clegg, Standen, & Jones, 1996) settings. chronic illness conceptualize their problem?); and it can help identify patterns in complex 3.06.2.1 Questions systems (how does the information flowing between social actors help explain the develop- Grounded theory is designed to be usable with ment of a laboratory smallpox outbreak?). a very wide range of research questions and in Like most of the qualitative approaches the context of a variety of metatheoretical discussed here, grounded theory is not well approaches. Rather like the statistical analyses suited to the kinds of hypothesis testing and that psychologists are more familiar with, it outcome evaluation that have traditionally been deals with patterns and relationships. However, grist to the mill of clinical psychology, because of these are not relationships between numbers but its open-ended and inductive nature. Although between ideas or categories of things, and the the researcher is likely to come to a topic with a relationships can take a range of different forms. range of more or less explicit ideas, questions, In some respects the procedures in grounded and theories, it is not necessary for any or all of theory are like the operation of a sophisticated these to be formally stated before research gets filing system where entries are cross-referenced under way. The approach can start with a and categorized in a range of different ways. specific problem or it may be more directed at Indeed, this is one qualitative approach that can making sense of an experience or setting. be effectively helped by the use of computer Grounded theory can be applied to a range of packages such as NUDIST, which was itself different textual materials such as documents, developed to address grounded theory notions. interview transcripts and records of interaction, Grounded theory has proved particularly and this makes it particularly suitable for certain appropriate for studying people's understand- kinds of questions. It can deal with records which ings of the world and how these are related to exist prior to the research and it can deal with their social context. For example, Turner (1994; materials specifically collected. The processes of Turner & Pidgeon, 1997) has used grounded coding allow quite large amounts of material to theory to attempt to explain the origins of be dealt with. For example, while Turner studied Grounded Theory 125 a single (lengthy) official report of a major fire in suggests a series of specific questions that are a holiday complex, Charmaz studied 180 inter- useful for picking out the key concepts: views with 90 different people with chronic (i) What is going on? illnesses. The requirement is only that the (ii) What are the people doing? material can be coded. (iii) What is the person saying? (iv) What do these actions and statements take for granted? 3.06.2.2 Procedures (v) How do structure and context serve to The procedures for conducting grounded support, maintain, impede, or change these theory work are straightforward to describe, if actions and statements? less so to follow in practice. Pidgeon and More broadly, Pidgeon and Henwood suggest Henwood (1996, p. 88) provide a useful diagram that this phase of coding is answering the to explicate the process (Figure 1). question: ªwhat categories or labels do I need in order to account for what is of importance to me in this paragraph?º (1996, p. 92). 3.06.2.2.1 Materials Such coding is intensive and time consuming. For example, Table 2 shows an example by In line with the general emphasis on parti- Charmaz of line-by-line coding of just a brief cipants' perspectives and on understanding fragment of one of her 180 interviews. Note the patterns of relationships, researchers often way that the interview fragment is coded under a attempt to obtain rich materials such as number of different topics. There is no require- documents and conversational, semistructured ment in grounded theory that categories apply interviews. These may be supplemented by exclusively. participant observation in the research domain, generating fieldnotes which can be added to other data sets or simply used to improve the researcher's understanding so they can better 3.06.2.2.3 Method of constant comparison deal with the other materials. Coding is not merely a matter of carefully After data is collected and stored the intensive reading and labeling the materials. As the coding process that is most characteristic of grounded continues the researcher will be starting to theory is performed. This involves coding the identify categories that are interesting or data, refining the coding and identifying links relevant to the research questions. They will between categories, and writing ªmemosº which refine their indexing system by focused coding start to capture theoretical concepts and which will pick out all the instances from the relationships. data coded as, for example ªavoiding disclosure.º When such a collection has been produced the researcher can focus on the differences in the 3.06.2.2.2 Coding use of this category according to the setting or Different grounded theory researchers ap- the actors involved. This is what grounded proach coding in different ways. For example, it theorists refer to as the ªmethod of constant can involve generating index cards or making comparison.º In the course of such comparisons annotations next to the relevant text. The the category system may be reworked; some researcher works through the text line by line, categories will be merged together and others or paragraph by paragraph, labeling the key will be broken up, as the close reading of the data concepts that appear. Charmaz (1995, p. 38) allows an increasingly refined understanding.

Table 2 Line-by-line coding.

Coding Interview

Shifting symptoms, having inconsistent If you have lupus, I mean one day it's my liver; days one day it's my joints; one day it's my head, and Interpreting images of self given by it's like people really think you're a others hypochondriac if you keep complaining about Avoiding disclosure different ailments. . . . It's like you don't want to say anything because people are going to start Predicting rejection thinking, you know, ªGod, don't go near her, Keeping others unaware all she isÐis complaining about this.º

Source: Charmaz (1995, p. 39) 126 Qualitative and Discourse Analysis

Data collection Data preparation

Data storage

Initial analysis

Coding

Refine indexing system Core analysis Memo writing Category linking

Key concepts Definitions Outcomes Memos Relationships and models Figure 1 Procedures for conducting grounded theory research.

3.06.2.2.4 Memo writing together designed to provide a level of validation as they force a thoroughgoing engagement with Throughout this stage of coding and compar- the research materials. Line-by-line coding, ison grounded theorists recommend what they constant comparison, and memo writing are call memo writing. That is, writing explicit notes all intended to ensure that the theoretical claims on the assumptions that underly any particular made by the analyst are fully grounded in the coding. Memo writing is central to the process of data. That, after all, was the original key idea of building theoretical understandings from the grounded theory. However, some specific pro- categories, as it provides a bridge between the cedures of validation have been proposed. categorization of data and the writing up of the Some grounded theorists have suggested that research. In addition to the process of refining respondent validation could be used as a categories, a further analytic task involves criterion. This involves the researcher taking linking categories together. The goal here is to back interpretations to the participants to see if start to model relationships between categories. they are accepted. The problem with such an Indeed, one possibility may be the production of approach is that participants may agree or a diagram or flow chart which explicitly maps disagree for a range of social reasons or they relationships. may not understand what is being asked of As Figure 1 makes clear, these various them. Other approaches to validation suggest elements in grounded theory analysis are not that research should be generative, that is, discrete stages. In the main phase of the analysis, facilitate further issues and questions; have refining the indexing system is intimately bound rhetorical power, that is, prove effective in up with the linking of categories, and memo persuading others of its effectiveness; or that writing is likely to be an adjunct to both. there could be an audit trail which would allow Moreover, this analysis may lead the researcher another researcher to check how the conclu- back to the basic line-by-line coding, or even sions were reached (Henwood & Pidgeon, 1995; suggest the need to collect further material for Lincoln & Guba, 1985). analysis. 3.06.2.3 Example: Clegg, Standen, and Jones 3.06.2.2.5 Validity (1996) There is a sense in which the general method- There is no example in clinical psychology of ological procedures of grounded theory are the sorts of full-scale grounded theory study Grounded Theory 127 that Charmaz (1991) conducted on the experi- are commonly left inexplicit in other qualitative ence of illness or Glaser and Strauss (1968) approaches. The method is at its best where carried out on hospital death. Nevertheless, a there is an issue that is tractable from a relatively modest study by Clegg et al. (1996) of the common sense actor's perspective. Whether relationships between staff members and adults studying disasters, illness, or staff relationships, with profound learning disabilities illustrates the theoretical notions developed are close to some of the potential of grounded theory for the everyday notions of the participants. This clinical work. makes the work particularly suitable for policy In line with the grounded theory emphasis on implementation, for the categories and under- the value of comparison, 20 staff members from standings of the theory are easily accessible to four different residential settings were recruited. practitioners and policy makers. Each member of staff was videotaped during Some problems and questions remain, how- eight sessions of interaction with a single client ever. First, although there is a repeated that they had known for over three months. emphasis on theoryÐafter all, it is in the very Each staff member was subsequently inter- name of the methodÐthe notion of theory is a viewed about their work, their relationship with rather limited one strongly influenced by the the client, a specific experience with the client, empiricist philosophy of science of the 1950s. and their understanding of the client's view of The approach works well if theory is conceived the world. These conversational interviews were in a limited manner as a pattern of relationships tape recorded and transcribed, and form the between categories, but less well if theories are data for the major part of the study. conceived of as, say, models of underlying The study followed standard grounded theory generative mechanisms (HarreÂ , 1986). procedures in performing a range of codings Second, one of the claimed benefits of developing links between categories and build- grounded theory work is that it works with the ing up theoretical notions from those codings. perspective of participants through its emphasis The range of different outcomes of the study is on accounts and reports. However, one of the typical of grounded theory. In the first place, risks of processes such as line-by-line coding is they identify four kinds of relationships that that it leads to a continual pressure to assign staff may have with clients: provider (where pieces of talk or elements of texts to discrete meeting of the client's needs is primary); categoriesratherthanseeingthemasinextricably meaning-maker (where making sense of the bound up with broader sequences of talk or client's moods or gestures is primary); mutual broader textual narratives. Ironically this can (where shared experience and joy at the client's mean that instead of staying with the under- development is primary); companion (where standings of the participants their words are merely being together is treated as satisfying). assigned to categories provided by the analyst. The authors go on to explore the way different Third, grounded theorists have paid little settings and different client groups were char- attention to the sorts of problems in using acterized by different relationships. Finally, textual data that ethnomethodologists and they propose that the analysis supports four discourse analysts have emphasised (Atkinson, propositions about staff±client relationships: 1978; Gilbert & Mulkay, 1984; Silverman, 1993; some types of relationship are better than others Widdicombe & Wooffitt, 1995). For example, (although this will vary with setting and client); how far is the grounding derived not from staff see the balance of control in the relation- theorizing but from reproducing common sense ship as central; families can facilitate relation- theories as if they were analytic conclusions? ships; professional networks create dilemmas. How far are Clegg's et al. (1996) staff participants, say, giving an accurate picture of their relationships with clients, and how far are 3.06.2.4 Virtues and Limitations they drawing on a range of ideas and notions to deal with problems and work up identities in the Grounded theory has a range of virtues. It is interview itself? flexible with respect to forms of data and can be Some practitioners are grappling with these applied to a wide range of questions in varied problems in a sophisticated manner (Pidgeon & domains. Its procedures, when followed fully, Henwood, 1996). As yet there is not a large force the researcher into a thorough engage- enough body of work with clinical materials ment with the materials; it encourages a slow- to allow a full evaluation of the potential of motion reading of texts and transcripts that this method. For more detailed coverage of should avoid the common qualitative research grounded theory the reader should start with the trap of trawling a set of transcripts for quotes to excellent introductions by Pidgeon (1996) and illustrate preconceived ideas. It makes explicit Pidgeon and Henwood (1996); Charmaz (1991) some of the data management procedures that provides a full scale research illustration of the 128 Qualitative and Discourse Analysis potential of grounded theory; Rafuls and Moon (1961) Asylums he tried to reveal the different (1996) discuss grounded theory in the context of worlds lived by the staff and inmates, and to family therapy; and, despite its age, Glaser and describe and explicate some of the ceremonies Strauss (1967) is still an informative basis for that were used to reinforce boundaries between understanding the approach. the two groups. A large part of his work tracked what he called the ªmoral careersº of inmates from prepatient, through admission, and then 3.06.3 ETHNOGRAPHY AND as inpatients. Much of the force and influence of PARTICIPANT OBSERVATION Goffman's work derived from its revelations about the grim ªunofficialº life lived by patients Ethnography is not a specific method so in large state mental hospitals in the 1950s. In much as a general approach which can involve a this respect it followed in the Chicago school number of specific research techniques such as tradition of exposeÂ and critique. Rosenhan's interviewing and participant observation. In- (1973) classic study of hospital admission and deed, this has been a source of some confusion the life of the patient also followed in this as rather different work is described as ethno- tradition. Famously it posed the question of graphy in anthropology, sociology, and other what was required to be diagnosed as mentally disciplines such as education or management ill and then incarcerated, and discovered that it science. The central thrust of ethnographic was sufficient to report hearing voices saying research is to study people's activities in their ªempty,º ªhollow,º and ªthud.º This ªpseudo- natural settings. The concern is to get inside the patientº study was designed with a very specific understanding and culture, to develop a subtle question about diagnostic criteria in mind; grasp of how members view the world, and why however, after the pseudopatients were ad- they act as they do. Typically, there is an mitted they addressed themselves to more emphasis on the researcher being involved in the typically ethnographic concerns, such as writing everyday world of those who are being studied. detailed descriptions of their settings, monitor- Along with this goes a commitment to working ing patient contact with different kinds of staff, with unstructured data (i.e., data which has not and documenting the experience of power- been coded at the point of data collection) and a lessness and depersonalization. tendency to perform intensive studies of small Goffman's and Rosenhan's work picks up the numbers of cases. Ethnography is not suited to ethnographic traditions of revealing hidden the sorts of hypothetico-deductive procedures worlds and providing a basis for social reform. that are common in psychology. Jodelet (1991) illustrates another analytic Two important tendencies in current ethno- possibility by performing an intensive study graphic work can be seen in their historical of one of the longest running community care antecedents in anthropology and sociology. In schemes in the world, the French colony of the nineteenth century, anthropology often Ainay-le-ChaÃteau where mental patients live involved information collected by colonial with ordinary families. Again, in line with the administrators about members of indigenous possibilities of ethnography, she attempted to peoples. Its use of key informants and its focus explore the whole setting, including the lives of on revealing the details of exotic or hidden the patients and their hosts and their under- cultures continue in much work. Sociologists of standings of the world. Her work, however, is the ªChicago Schoolº saw ethnography as a notable for showing how ethnography can way of revealing the lives and conditions of the explore the representational systems of partici- US underclass. This social reformism was pants and relate that system to the lives of the married to an emphasis on understanding their participants. To give just one small example, she participants' lives from their own perspective. shows the way the families' representation of a close link between madness and uncleanliness 3.06.3.1 Questions relates to practices such as taking meals separately from the lodgers. Ethnography comes into its own where the Another topic for ethnographic work has been researcher is trying to understand some parti- the practice of psychotherapy itself (Gubrium, cular sub-cultural group in a specific setting. It 1992; Newfield, Kuehl, Joanning, & Quinn, tends to be holistic, focusing on the entire 1990, 1991). These studies are interested in the experience participants have of a setting, or experience of patients in therapy and their their entire cosmology, rather than focusing on conceptions of what therapy is, as well as the discrete variables or phenomena. This means practices and conceptions of the therapists. Such that ethnographic questions tend to be general: studies do not focus exclusively on the interac- What happens in this setting? How do this tion in the therapy session itself, but on the whole group understand their world? In Goffman's setting. Ethnography and Participant Observation 129

Although ethnography is dependent on close 3.06.3.2.2 Field relations and systematic description of practices it is not Field relations can pose a range of challenges. necessarily atheoretical. Ethnographic studies Many of these follow from the nature of the are often guided by one of a range of theoretical participation of the researcher in the setting. conceptions. For example, Jodelet's (1991) study How far should researchers become full parti- was informed by Moscovici's (1984) theory of cipants and how far should they stay uninvolved social representations and Gubrium's (1992) observers? The dilemma here is that much of the study of family therapy was guided by broader power of ethnography comes from the experi- questions about the way the notion of family and ence and knowledge provided by full participa- family disorder are constructed in Western tion, and yet such participation may make it society. Ethnography has sometimes been trea- harder to sustain a critical distance from the ted as particularly appropriate for feminist work practices under study. The ethnographer should because of the possibility of combining concerns not be converted to the participants' cultural with experience and social context (e.g., Ronai, values; but neither should they stay entirely in 1996). the Martian role that will make it harder to understand the subtle senses through which the 3.06.3.2 Procedures participants understand their own practices. Field relations also generate many of practical, Ethnography typically involves a mix of prosaic, but nevertheless important problems different methods with interviews and partici- which stem from the sheer difficulty of main- pant observation being primary, but often taining participant status in an unfamiliar and combined with nonparticipant observation possibly difficult setting for a long period of and the analysis of documents of one kind or time. At the same time there are a whole set of another. Such a mixture raises a large number of skills required to do with building productive separate issues which will only be touched on and harmonious relationships with participants. here. There are whole books on some elements of ethnographic work such as selecting informants (Johnson, 1990), living with informants 3.06.3.2.3 Fieldnotes (Rose, 1990), and interpreting ethnographic One of the central features of participant writings (Atkinson, 1990). The focus here is on observation is the production of fieldnotes. research access, field relations, interviewing, Without notes to take away there is little point in observing, fieldnotes, and analysis (see Ellen, conducting observation. In some settings it may 1984; Fetterman, 1989; Fielding, 1993; Ham- be possible to take notes concurrently with the mersley & Atkinson, 1995; Rachel, 1996; Toren, action but often the researcher will need to rely 1996; Werner & Schoepfle, 1987). on their memory, writing up notes on events as soon as possible after they happened. A rule of 3.06.3.2.1 Access thumb is that writing up fieldnotes will take just as much time as the original period of observa- Research access is often a crucial issue in tion (Fielding, 1993). In some cases it may be ethnography, as a typical ethnographic study possible to tape record interaction as it happens. will require not only access to some potentially However, ethnographers have traditionally sensitive group or setting but may involve the placed less value on recording as they see the researcher's bodily presence in delicate contexts. actual process of note taking as itself part of the Sitting in on a family therapy session, for ex- process through which the researcher comes to ample, may involve obtaining consent from a understand connections between processes and range of people who have different concerns and underlying elements of interaction. has the potential for disrupting, or at least Ethnographers stress that fieldnotes should subtly changing, the interaction that would have be based around concrete descriptions rather taken place. There are only restricted possibi- than consisting of abstract higher-order inter- lities for the ethnographer to enter settings with pretations. The reason for this is that when concealed identities, as Goffman did with his observation is being done it may not yet be clear mental hospital study, and Rosenhan's pseu- what questions are to be addressed. Notes that dopatients did. Moreover, such practices raise a stay at low levels of inference are a resource that host of ethical and practical problems which can be used to address a range of different increasingly lead ethnographers to avoid decep- questions. Fielding argues that fieldnotes are tion. Access can present another problem in expected: sensitive settings if it turns out that it is precisely unusual examples where access is granted, to provide a running description of events, people perhaps because the participants view them as and conversation. Consequently each new setting models of good practice. observed and each new member of the setting 130 Qualitative and Discourse Analysis

merits description. Similarly, changes in the hu- setting, but will concentrate on a small subset of man or other constituents of the setting should be themes which are most important or which recorded. (1993, p. 162) relate to prior questions and concerns. The analytic phase of ethnography is often It is also important to distinguish in notes described in only the sketchiest terms in between direct quotation and broad preÂ cis of ethnography guidebooks. It is clear that what participants are saying. A final point ethnographers often make up their own ways emphasised by ethnographers is the value of of managing the large amount of materials that keeping a record of personal impressions and they collect, and for using that material in feelings. convincing research accounts. At one extreme, ethnography can be considered an approach to 3.06.3.2.4 Interviews develop the researcher's competence in the community being studiedÐthey learn to be a Ethnographers make much use of interviews. member, to take part actually and symbolically, However, in this tradition interviews are under- and they can use this competence to write stood in a much looser manner than in much of authoritatively about the community (Collins, psychology. Indeed, the term interview may be 1983). Here extracts from notes and interview something of a misnomer with its image of the transcripts become merely exemplars of the researcher running through a relatively planned knowledge that the researcher has gained set of questionswith a single passive informant in through participation. At the other extreme, a relatively formal setting. In ethnography what ethnography blurs into grounded theorizing, is involved is often a mix of casual conversations with the notes and transcripts being dealt with with a range of different participants. Some of through line-by-line coding, comparison of these may be very brief, some extended, some categories, and memo writing. Here the re- allowing relatively formal questioning, others searcher's cultural competence will be impor- allowing no overt questioning. In the more tant for interpreting the material, but the formal cases the interview may be conducted conclusions will ultimately be dependent on with a planned schedule of questions and the the quality of the fieldnotes and transcripts and interaction is recorded and transcribed. what they can support.

3.06.3.2.5 Analysis 3.06.3.2.6 Validity There is not a neat separation of the data One of the virtues of ethnography is its rich collection and analytic phases of ethnographic ecological validity. The researcher is learning research. The judgements about what to study, directly about what goes on in a setting by what to focus on, which elements of the local observing it, by taking part, and/or by inter- culture require detailed description, and which viewing the members. This circumvents many can be taken for granted, are already part of of the inferences that are needed in extrapolat- analysis. Moreover, it is likely that in the course ing from more traditional psychological of a long period of participant observation, or a research toolsÐquestionnaires, experimental series of interviews, the researcher will start to simulationsÐto actual settings. However, the develop accounts for particular features of the closeness of the researcher to the setting does setting, or begin to identify the set of repre- not in itself ensure that the research that is sentations shared by the participants. Such produced will be of high quality. interpretations are refined, transformed, and The approach to validity most often stressed sometimes abandoned when the fieldwork is by ethnographers is triangulation. At the level completed and the focus moves on to notes and of data this involves checking to see that transcripts. different informants make the same sorts of Fielding (1993, p. 192) suggests that the claims about actions or events. At the level of standard pattern of work with ethnographic method, it involves checking that conclusions data is straightforward. The researcher starts are supported by different methods, for exam- with the fieldnotes and transcripts and searches ple, by both interviews and observation. How- them for categories and patterns. These themes ever, triangulation is not without its problems. form a basis for the ethnographic account of the Discourse researchers have noted that in setting, and they also structure the more practice the sheer variability in and between intensive analysis and probably the write-up accounts makes triangulation of only limited of the research. The data will be marked or cut use (Potter & Wetherell, 1987) and others have up (often on computer text files) to collect these identified conceptual problems in judging what themes together. In practice, the ethnographer a successful triangulation between methods is unlikely to attempt a complete account of a would be (Silverman, 1993). Ethnography and Participant Observation 131

3.06.3.3 Example: Gubrium (1992) seat themselves in that area as evidence of family dynamics. Gubrium writes about the important There are only a small number of ethnogra- role of tissue boxes in both signaling the phies done in clinical settings or on clinical potential for emotional display and providing topics. For instance, many of the clinical practical support when such display occurs: examples in Newfield, Sells, Smith, Newfield, & Newfield's (1996) chapter on ethnography in I soon realized that tissues were about more than family therapy are unpublished dissertations. weeping and overall emotional composure during The body of ethnographic work is small but therapy. Tissues mundanely signaled the funda- increasing rapidly. I have chosen Gubrium's mental reality of the home as locally understood: a (1992) study of family therapy in two institu- configuration of emotional bonds. For Benson [a counselor] their usage virtually put the domestic tions as an example because it is a book-length disorder of the home on display, locating the study, and it addresses the therapeutic process home's special order in the minutiae of emotional itself rather than concentrating solely on the expression. (Gubrium, 1992, p. 26) patients' lives in hospital or community care schemes. However, it is important to stress that The ethnographic focus on events in context Gubrium's focus was as much on what the means that therapy is treated as a product of therapy revealed about the way notions of actual interactions full of contingency and family and order are understood in American locally managed understandings. It shows the culture as in therapeutic techniques and effec- way abstract notions such as family systems or tiveness. He was concerned with the way tough love are managed in practice, and the way behaviours such as truancy take their sense as the various workers relate to each other as well part of a troubled family and the way service as to the clients. It provides an insight into the professionals redefine family order as they world of family therapy quite different from instigate programmes of rehabilitation. most other styles of research. Gubrium's choice of two contrasting institutions is a commonplace one in ethnography. The small number enables an intensive approach; having more than one setting allows an 3.06.3.4 Virtues and Limitations illuminating range of comparisons and contrasts. In this case one was inpatient, one Much of the power of ethnographic research outpatient; one more middle class than the comes from its emphasis on understanding other; they also varied in their standard people's actions and representations both in approaches to treatment. The virtues of having context and as part of the everyday practices two field sites shine through in the course of the that make up their lives, whether they are write-up, although it is interesting to note that Yanomami Indians or family therapists. It can the selection was almost accidental as the provide descriptions which pick out abstract researcher originally expected to be successful organizations of life in a setting as well as in gaining access to only one of the institutions. allowing the reader rich access. Ethnography The fieldwork followed a typical ethno- can be used in theory development and even graphic style of spending a considerable amount theory testing (Hammersley & Atkinson, 1995). of times at the two facilities, talking to It is flexible; research can follow up themes and counselors, watching them at work in therapy questions as they arise rather than necessarily sessions, reviewing videos of counseling, and keeping to preset goals. making fieldnotes. The study also drew on a Ethnographic research can be very time range of documents including patients' case consuming and labor intensive. It can also be notes and educational materials. In some ways very intrusive. Although Gubrium was able to this was a technically straightforward setting for participate in some aspects of family therapy, ethnographic observation as many of the this was helped by the sheer number of both participants were themselves university trained staff and family members who were involved. It practitioners who made notes and videos as part is less easy to imagine participant observation of the general workings of the facilities. on individual therapy. One of the striking differences between this One of the most important difficulties with ethnographic study of therapy and a typical ethnographic work is that the reader often has process or outcome study is that the therapy is to take on trust the conclusions because the treated as a part of its physical, institutional, evidence on which they are based is not and cultural contexts. For instance, time is spent available for assessment (Silverman, 1993). documenting the organization of the reception Where field notes of observations are repro- areas of the two facilities and the way the duced in ethnographiesÐand this is relatively counselors use the manner in which the families rareÐsuch notes are nevertheless a ready- 132 Qualitative and Discourse Analysis theorized version of events. Descriptions of researchers who have extended analytic and actions and events are always bound up with a theoretical developments in discourse studies to range of judgments (Potter, 1996a). Where clinical settings (e.g., Aronsson & Cederborg, analysis depends on the claims of key infor- 1996; Bergmann, 1992; Buttny, 1996; Edwards, mants the problem is assessing how these claims 1995; Lee, 1995). Collections such as Siegfried relate to any putative activities that are (1995), Burman, Mitchel, and Salmon (1996), described. Ethnographers deal with these pro- and Morris and Chenail (1995) reflect both blems with varying degrees of sophistication types of work, sometimes in rather uneasy (for discussion see Nelson, 1994). However, combination (see Antaki, 1996). some researchers treat them as inescapable and This tension between an applied and aca- have turned to some form of discourse analysis demic focus is closely related to the stance taken instead. on therapy. In much discourse analysis, therapy For more detailed discussions of ethnography is the start point for research and the issue is readers should start with Fielding's (1993) how therapy gets done. For example, Gale's excellent brief introduction and then use (1991; Gale & Newfield, 1992) intensive study Hammersley and Atkinson (1995) as an author- of one of O'Hanlon's therapy sessions con- itative and up to date overview. Two of the most sidered the various ways in which the goals of comprehensive, although not always most solution focused family therapy were realized in sophisticated, works are Werner and Schoepfle the talk between therapist and client. However, (1987) and Ellen (1984). Both were written by some conversation analysts and ethnometho- anthropologists, and this shows in their under- dologists resist assuming that conversational standing of what is important. Newfield, Sells, interaction glossed by some parties as therapy Smith, Newfield, and Newfield (1996) provide a (solution focused, Milan School, or whatever) useful discussion of ethnography in family must have special ingredient XÐtherapyÐthat therapy research. is absent in, say, the everyday ªtroubles talkº done with a friend over the telephone (Jefferson, 1988; Jefferson & Lee, 1992; Schegloff, 1991; Watson, 1995). 3.06.4 DISCOURSE ANALYSIS This is a significant point for all researchers into therapy and counseling, so it is worth Although both grounded theorizing and illustrating with an example. In Labov and ethnographic work in clinical areas has in- Fanshel's classic study, the therapy session creased, the most striking expansion has been in starts in the following manner: research in discourse analysis (Soyland, 1995). This work is of variable quality and often done Rhoda: I don't (1.0) know, whether (1.5) by researchers isolated in different subdisci- I-I think I did- the right thing, plines; moreover, it displays considerable termi- jistalittle situation came up (4.5) nological confusion. For simplicity, discourse an' I tried to uhm (3.0) well, try to analysis is taken as covering a range of work (4.0) use what I- what I've learned which includes conversation analysis and eth- here, see if it worked (0.3) nomethodology (Heritage, 1984; Nofsinger, Therapist: Mhm 1991), some specific traditions of discourse Rhoda: Now, I don't know if I did the right analysis and discursive psychology (Edwards thing. Sunday (1.0) um- my & Potter, 1992a; Potter & Wetherell, 1987), some mother went to my sister's again. Therapist: Mm-hm of the more analytically focused social construc- Rhoda: And she usu'lly goes for about a tionist work (McNamee & Gergen, 1992), and a day or so, like if she leaves on range of work influenced by post-structuralism, Sunday, she'll come back Tuesday Continental discourse analysis, and particularly morning. So- it's nothing. But- she the work of Foucault (Burman, 1995; Madigan, lef' Sunday, and she's still not 1992; Miller & Silverman, 1995). In some home. research these different themes are woven Therapist: O- oh. together; elsewhere strong oppositions are (1977, p. 263) marked out. The impetus for discourse analysis in clinical Labov and Fanshel provide many pages of settings comes from two directions. On the one analysis of this sequence. They identify various hand, there are practitioner/researchers who direct and indirect speech acts and make much have found ideas from social constructionism, of what they call its therapeutic interview literary theory, and narrative useful (e.g., style, particularly the vague reference terms Anderson & Goolishian, 1988; White & Epston, at the start: ªright thingº and ªjistalittle 1990). On the other, there are academic situation.º This vagueness can easily be heard Discourse Analysis 133 as the 19-year-old anorexia sufferer struggling R's first turn is . . . formulated to prefigure (i) the to face up to her relationship with her difficult telling of something she did (I think I did the right mother. However, in a reanalysis from a con- thing), and (ii) the describing of the situation that versation analytic perspective, Levinson (1983) led to the action (jistalittle situation came up). We suggests that this sequence is characteristic of are therefore warned to expect a story with two components; moreover the point of the story and mundane news telling sequences in everyday its relevance to the here and now is also prefigured conversation. These typically have four parts: (use what I've learned here, see if it worked). (1983, the pre-announcement, the go ahead, the news p. 353) telling, and the news receipt. For example: Even the hesitations and glottal stops in D: I forgot to to tell preannouncement Rhoda's first turn, which seem so redolent of you the two best a troubled young person are ªtypical markings things that of self-initiated self-repair, which is character- happen' to me istic of the production of first topicsº (Levinson, today. 1983, p. 353). This emphasis on the significance R: Oh super = what go ahead turn of detail has an important methodological were they. consequenceÐif interaction is to be understood D: I got a B+ on my news telling math test . . . and I properly it must be represented in a way that got an athletic captures this detail. Hence the use of a tran- award scription scheme that attempts to represent a R: Oh excellent. news receipt range of paralinguistic features of talk (stress, (Levinson, 1983, p. 349Ðslightly intonation) on the page as well as elements of modified) the style of its delivery (pauses, cut-offs). A fourth feature to note here is the comparative approach that has been taken. Rather than A particular feature of preannouncements is focus on therapy talk alone Levinson is able to their vagueness, for their job is to prefigure the support an alternative account of the interaction story (and thereby check its newsworthiness), by drawing on materials, and analysis, taken not to actually tell it. So, rather than following from mundane conversations. Since the mid Labov and Fanshel (1977) in treating this 1980s there has been a large amount of work in vagueness as specific to a troubled soul dealing different institutional settings as well as every- with a difficult topic in therapy, Levinson day conversation, and it is now possible to start (1983) proposes that it should be understood to show how a news interview, say, differs from as a commonplace characteristic of mundane the health visitor's talk with an expectant interaction. mother, and how that differs in turn from This example illustrates a number of features conversation between two friends over the typical of a range of discursive approaches to telephone (Drew & Heritage, 1992a). therapy talk. First, the talk is understood as A fifth and final feature of this example is that performing actions; it is not being used as a it is an analysis of interaction. It is neither an pathway to cognitive processes (repression, say) attempt to reduce what is going on to cognitions or as evidence of what Rhoda's life is like (her of the various partiesÐRhoda's denial, say, or difficult mother). Second, the interaction is the therapist's eliciting strategiesÐnor to trans- understood as sequentially organized so any form it into countable behaviors such as verbal part of the talk relates to what came immedi- reinforcers. This style of discourse work devel- ately before and provides an environment for ops a Wittgensteinian notion of cognitive words what will immediately follow. The realization of and phrases as elements in a set of language how far interaction gets its sense from its games for managing issues of blame, account- sequential context has critical implications for ability, description, and so on (Coulter, 1990; approaches such as content analysis and Edwards, 1997; HarreÂ & Gillett, 1994). Such a grounded theory which involve making cate- ªdiscursive psychologyº analyzes notions such gorizations and considering relations between as attribution and memory in terms of the them; for such categorizations tend to cut across situated practices through which responsibility precisely the sequential relations that are is assigned and the business done by construct- important for the sense of the turn of talk. ing particular versions of past events (Edwards The third feature is that the talk is treated as & Potter, 1992b, 1993). ordered in its detail not merely in its broad particulars. For example, Levinson (1983) 3.06.4.1 Questions highlights a number of orderly elements in what we might easily mistake for clumsiness in Discourse analysis is more directly associated Rhoda's first turn: with particular theoretical perspectivesÐ 134 Qualitative and Discourse Analysis ethnomethodology, post-structuralism, discur- women draw on to construct notions of sive psychologyÐthan either grounded theory femininity, agency, and body image in the or ethnography. The questions it addresses context of eating disorders (Malson & Ussher, focus on the practices of interaction in their 1996; Wetherell, 1996)? What discourses are natural contexts and the sorts of discursive used to construct different notions of the resources that are drawn on those contexts. person, of the family, and of breakdown in Some of the most basic questions concern the therapy (Burman, 1995; Frosh, Burck, standardized sequences of interaction that take Strickland-Clark, & Morgan, 1996; Soal & place in therapy and counseling (Buttny & Kotter, 1996)? This work is often critical of Jensen, 1995; Lee, 1995; PeraÈ kylaÈ , 1995; Silver- individualistic conceptions of therapy. man, 1997b). This is closely related to a concern Finally, discourse researchers have stood with the activities that take place. What is the back and taken the administration of psycho- role of the therapist's tokens such as ªokayº or logical research instruments as their topic. The ªmm hmº (Beach, 1995; Czyzewski, 1995)? How intensive focus of such work can show the way do different styles of questioning perform that the sort in idiosyncratic interaction that different tasks (Bergmann, 1992; PeraÈ kylaÈ , takes place when filling in questionnaires or 1995; Silverman, 1997b)? What is the role of producing records can lead to particular out- problem formulations by both therapists and comes (Antaki & Rapley, 1996; Garfinkel, clients, and how are they transformed and 1967b; Rapley & Antaki, 1996; Soyland, 1994). negotiated (Buttny, 1996; Buttny & Jensen, Different styles of discourse work address 1995; Madill & Barkham, 1997)? For example, rather different kinds of questions. However, in a classic feminist paper Davis (1986) charts the conversation analytic work is notable in the way a woman's struggles with her oppressive commonly starting from a set of transcribed social circumstances are transformed into materials rather than preformulated research individual psychological problems suitable for questions, on the grounds that such questions individual therapy. While much discourse often embody expectations and assumptions research is focused on the talk of therapy and which prevent the analyst seeing a range of counseling itself, studies in other areas show the intricacies in the interaction. Conversation value of standing back and considering clinical analysis reveals an order to interaction that psychology as a set of work practices in participants are often unable to formulate in themselves, including management of clients abstract terms. in person and as records, conducting assessments, delivering diagnoses, intake and release, stimulating people with profound learning 3.06.4.2 Procedures difficulties, case conferences and supervisions, offering advice, and managing resistance (see The majority of discourse research in the Drew & Heritage, 1992a). Discourse researchers clinical area has worked with records of natural have also been moving beyond clinical settings interaction, although a small amount has used to study how people with clinical problems or open-ended interviews. There is not space here learning difficulties manage in everyday settings to discuss the role of interviews in discourse (Brewer & Yearley, 1989; Pollner & Wikler, analysis or qualitative research generally (see 1985; Wootton, 1989). Kvale, 1996; Mischler, 1986; Potter & Mulkay, Another set of questions are suggested by the 1985; Widdicombe & Wooffitt, 1995). For perspective of discursive psychology. For simplicity discourse work will be discussed in example, Edwards (1997) has studied the terms of seven elements. rhetorical relationship between problem formulations, descriptions of activities, and issues 3.06.4.2.1 Research materials of blame in counseling. Cheston (1996) has studied the way descriptions of the past in a Traditionally psychologists have been reluc- therapy group of older adults can create a set of tant to deal with actual interaction, preferring to social identities for the members. Discursive model it experimentally, or reconstruct it via psychology can provide a new take on emotions, scales and questionnaires. Part of the reason for examining how they are constructed and their this is the prevalent cognitivist assumptions role in specific practices (Edwards, 1997). which have directed attention away from From a more Foucaultian inspired direction, interaction itself to focus on generative mechan- studies may consider the role of particular isms within the person. In contrast, discourse discourses, or interpretative repertoires in researchers have emphasised the primacy of constructing the sense of actions and experi- practices of interaction themselves. The most ences (Potter & Wetherell, 1987). For example, obvious practice setting for clinical work is the what are the discursive resources that young therapy session itself, and this has certainly Discourse Analysis 135 received the most attention. After all, there is an may be missed from the tape, and good quality elegance in studying the ªtalking cureº using equipment is now compact and cheap. On the methods designed to acquire an understanding other hand, video can be more intrusive, of the nature of talk. However, there is a danger particularly where the recording is being done that such an exclusive emphasis underplays by one of the participants (a counselor, say), and mundane aspects of clinical practices: giving may be hard to position so it captures gestures advice, offering a diagnosis, the reception of and expressions from all parties to an interac- new clients, casual talk to clients' relatives, tion. Video poses a range of practical and writing up clinical records, case conferences, theoretical problems with respect to the tran- clinical training, and assessment. scription of nonvocal activity which can be both Notions of sample size do not translate easily time consuming and create transcript that is from traditional research as the discourse difficult to understand. Moreover, there is now research focus is not so much on individuals a large body of studies that shows high quality as on interactional phenomena of various kinds. analysis can, in many cases, be performed with Various considerations can come to the fore an audio tape alone. One manageable solution is here, including the type of generalizations that to use video if doing so does not disrupt the are to be made from the research, the time and interaction, and then to transcribe the audio and resources available, and the nature of the topic work with a combination of video tape and being studied. For example, if the topic is the audio transcript. Whether audio or video is role of ªmm hmsº in therapy a small number of chosen, the quality (clear audibility and visibi- sessions may generate a large corpus; other lity) is probably the single most consequential phenomena may be much rarer and require feature of the recording for the later research. large quantities of interaction to develop a Another difficulty is how far the recording of useful corpus. For some questions, single cases interaction affects its nature. This is a subtle may be sufficient to underpin a theoretical point issue. On the one hand, there are a range of ways or reveal a theoretically significant phenomena. of minimizing such influences including accli- matizing participants and giving clear descriptions of the role of the research. On the other, 3.06.4.2.2 Access and ethics experience has shown that recording has little One of the most difficult practical problems influence on many, perhaps most, of the in conducting discourse research involves get- activities in which the discourse researcher is ting access to sometimes sensitive settings in interested. Indeed, in some clinical settings ways which allow for informed consent from all recordings may be made as a matter of course the parties involved. Experience suggests that for purposes of therapy and training, and so no more often than not it is the health professionals new disruption is involved. rather than the clients who are resistant to their practices being studied, perhaps because they 3.06.4.2.4 Transcription are sensitive to the difference between the idealized version of practices that was used in Producing a high-quality transcript is a training and the apparently more messy pro- crucial prerequisite for discourse research. A cedures in which they actually engage. Some- transcript is not a neutral, simple rendition of times reassurances about these differences can the words on a tape (Ochs, 1979). Different be productive. transcription systems emphasize different fea- Using records of interaction such as these tures of interaction. The best system for most raise particular issues for ensuring anonymity. work of this kind was developed by the This is relatively manageable with transcripts conversation analyst Gail Jefferson using where names and places can be changed. It is symbols that convey features of vocal delivery harder for audio tape and harder still with that have been shown to be interactionally video. However, technical advances in the use of important to participants (Jefferson, 1985). At digitized video allow for disguising of identity the same time the system is designed to use with relatively little loss of vocal information. characters and symbols easily available on wordprocessors making it reasonably easy to learn and interpret. The basic system is 3.06.4.2.3 Audio and video recording summarized in Table 3. For fuller descriptions There is a range of practical concerns in of using the system see Button and Lee, (1987), recording natural interaction, some of them Ochs, Schegloff, and Thompson (1996), and pulling in different directions. An immediate Psathas and Anderson (1990). issue is whether to use audio or video recording. Producing high quality transcript is very On the one hand, video provides helpful demanding and time consuming. It is hard to information about nonverbal activities that give a standard figure for how long it takes 136 Qualitative and Discourse Analysis

Table 3 Brief transcription notation.

Um:: colons represent lengthening of the preceding sound; the more colons, the greater the lengthening. I've- a hyphen represents the cut-off of the preceding sound, often by a stop. :Already up and down arrows represent sharp upward and downward pitch shifts in the following sounds. Underlining represents stress, usually by volume; the more underlining the greater the stress. settled in his= the equals signs join talk that is continuous although Mm=own mind. separated across different lines of transcript. hhh hh .hh `h' represents aspiration, sometimes simply hearable breathing, sometimes laughter, etc.; when preceded P(h)ut by a superimposed dot, it marks in-breath; in parenthesis inside a word it represents laugh infiltration.

hhh[hh .hh] left brackets represent point of overlap onset; right [I just] brackets represent point of overlap resolution. .,? punctuation marks intonation, not grammar; period, comma and `question mark' indicate downward, `continuative', and upward contours respectively. ( ) single parentheses mark problematic or uncertain hearings; two parentheses separated by an oblique represent alternative hearings. (0.2)(.) numbers in parentheses represent silence in tenths of a second; a dot in parentheses represents a micro- pause, less than two tenths of a second. 8mm hmm8 the degree signs enclose significantly lowered volume.

Source: Modified from Schegloff (1997, pp. 184±185). because much depends on the quality of the part of the analysis. Typically coding will recording (fuzzy, quiet tapes can quadruple the involve sifting through materials for instances time needed) and the type of interaction (an of a phenomenon of interest and copying them individual therapy session presents much less of into an archive. This coding will often be a challenge than a lively case conference with a accompanied by preliminary notes as to their lot of overlapping talk and extraneous noise); nature and interest. At this stage selection is nevertheless, a ratio of one hour of tape to inclusiveÐit is better to include material that twenty hours of transcription time is not can turn out to be irrelevant at a later stage than unreasonable. However, this time should not exclude it for ill-formulated reasons early on. be thought of as dead time before the analysis Coding is a cyclical process. Full analysis of a proper. Often some of the most revealing corpus of materials can often take the researcher analytic insights come during transcription back to the originals as a better understanding because a profound engagement with the of the phenomenon reveals new examples. Often material is needed to produce good transcriptÐ initially disparate topics merge together in the it is generally useful to make analytic notes in course of analysis while topics which seemed parallel to the actual transcription. unitary can be separated.

3.06.4.2.5 Coding 3.06.4.2.6 Analysis In discourse research the principle task of There is no single recipe for analyzing coding is to make the task of analysis more discourse. Nevertheless, there are five consid- straightforward by sifting relevant materials erations which are commonly important in from a large body of transcript. In this it differs analysis. First, the researcher can use variation from both grounded theory and traditional in and between participants' discourse as an content analysis where coding is a more intrinsic analytic lever. The significance of variation is Discourse Analysis 137 that it can be used for identifying and explicat- elements in an analytic mentality that the ing the activities that are being performed by researcher will develop as they become more talk and texts. This is because the discourse is and more skilled. It does not matter that they constructed in the specific ways that it is are not spelled out in studies because they are precisely to perform these actions; a description separate from the procedures for validating of jealousy in couple counseling can be discourse analytic claims. assembled very differently when excusing or criticizing certain actions (Edwards, 1995). The 3.06.4.2.7 Validity researcher will benefit from attending to variations in the discourse of single individuals, Discourse researchers typically draw on some between different individuals, and between combination of four considerations to justify what is said and what might have been said. the validity of analytic claims. First, they make Second, discourse researchers have found it use of participants' own understandings as they highly productive to attend to the detail of are displayed in interaction. One of the features discourse. Conversation analysts such as Sacks of a conversation is that any turn of talk is (1992) have shown that the details in oriented to what came before and what comes discourseÐthe hesitations, lexical choice, re- next, and that orientation typically displays the pair, and so onÐare commonly part of the sense that the participant makes of the prior performance of some act or are consequential in turn. Thus, at its simplest, when someone some way for the outcome of the interaction. provides an answer they thereby display the Attending to the detail of interaction, particu- prior turn as a question and so on. Close larly in transcript, is one of the most difficult attention to this turn-by-turn display of under- things for psychologists who are used to reading standing provides one important check on through the apparently messy detail for the gist analytic interpretations (see Heritage, 1988). of what is going on. Developing analytic skills Second, researchers may find (seemingly) involves a discipline of close reading. deviant cases most useful in assessing the Third, analysis often benefits from attending adequacy of a claim. Deviant cases may to the rhetorical organization of discourse. This generate problems for a claimed generalization, involves inspecting discourse both for the way it and lead the researcher to abandon it; but they is organized to make argumentative cases and may also display in their detailed organization for the way it is designed to undermine precisely the reason why a standard pattern alternative cases (Billig, 1996). A rhetorical should take the form that it does. orientation refocuses the analyst's attention Third, a study may be assessed in part by how away from questions about how a versionÐ far it is coherent with previous discourse studies. description of a psychological disorder, sayÐ A study that builds coherently on past research relates to some putative reality and focuses it on is more plausible than one that is more how it relates to competing alternatives. anomalous. Concern with rhetoric is closely linked to a Fourth, and most important, are readers' fourth analytic concern with accountability. evaluations. One of the distinctive features of That is, displaying one's activities as rational, discourse research is its presentation of rich and sensible, and justifiable. Ethnomethodologists extended materials in a way that allows the have argued that accountability is an essential reader to make their own judgements about and pervasive character of the design and interpretations that are placed along side of understanding of human conduct generally them. This form of validation contrasts with (Garfinkel, 1967c; Heritage, 1984). Again an much grounded theory and ethnography where attention to the way actions are made account- interpretations often have to be taken on trust; it able is an aid for understanding precisely what also contrasts with much traditional experi- those actions are. mental and content analytic work where it is A fifth and final analytic consideration is of a rare for ªrawº data to be included or more than slightly different order. It is an emphasis on the one or two illustrative codings to be reproduced. virtue of building on prior analytic studies. In Whether they appear singly or together in a particular, researchers into interaction in an discourse study none of these procedures institutional setting such as a family therapy guarantee the validity of an analysis. However, setting will almost certainly benefit from a as the sociology of science work reviewed earlier familiarity with research on mundane talk as shows, there are no such guarantees in science. well as an understanding of how the patterning of turn taking and activities change in different 3.06.4.3 Example: PeraÈ kylaÈ (1995) institutional settings. The best way to think of these five considera- Given the wide variety of discourse studies tions is not as specific rules for research but as with different questions and styles of analysis it 138 Qualitative and Discourse Analysis is not easy to chose a single representative In a similar way, the use of questioning where a study. The one selected is PeraÈ kylaÈ 's (1995) client's partner, say, offers their understanding investigation of AIDS counseling because it is a of an experience ªcan create a situation where major integrative study that addresses a related the clients, in an unacknowledged but most set of questions about interaction, counseling, powerful way, elicit one another's descriptions and family therapy from a rigorous conversa- of their inner experiencesº (PeraÈ kylaÈ , 1995, tion analytic perspective and illustrates the p. 110). In the following extract the client is potential of discourse work on clinical topics. It called Edward; his partner and the counselor draws heavily on the perspective on institu- are also present. tional talk surveyed by Drew and Heritage (1992b) and is worth reading in conjunction Counselor: What are some of things that you with Silverman's (1997b) related study of think E:dward might have to HIV+ counseling which focuses more on do.=He says he doesn't know where to go from here maybe: and advice giving. awaiting results and things. PeraÈ kylaÈ focused on 32 counseling sessions (0.6) conducted with HIV+ hemophilic mainly gay Counselor: What d'you think's worrying him. identified men and their partners at a major (0.4) London hospital. The sessions were videotaped Partner: Uh::m hhhhhh I think it's just fear and transcribed using the Jeffersonian system. of the unknow:n. A wider archive of recordings (450 hours) was Client: Mm[: drawn on to provide further examples of Counselor: [Oka:y. phenomena of interest but not otherwise Partner: [At- at the present ti:me. (0.2) transcribed. The counselors characterized their Uh:m (.) once: he's (0.5) got a better understanding of (0.2) what could practices in terms of Milan School Family happen Systems Theory and, although this is not the Counselor: Mm: startpoint of PeraÈ kylaÈ 's study, he was able to Partner: uh:m how .hh this will progre:ss explicate some of the characteristics of such then: I think (.) things will be a counseling. little more [settled in his= Part of the study is concerned with identifying Counselor: [Mm the standard turn-taking organization of the Partner: =own mi:nd. counseling. Stated baldly it is that (i) counselors Counselor: Mm: ask questions; (ii) clients answer; (iii) counselors (.) comment, advise, or ask further questions. Client: Mm[: Counselor: [Edward (.) from what you When laid out in this manner the organization know:: may not seem much of a discovery. However, ((sequence continues with Edward the power of the study is showing how this responding to a direct question organization is achieved in the interaction and with a long and detailed narrative how it can be used to address painful and about his fears)) delicate topics such as sexual behavior, illness, (PeraÈ kylaÈ , 1995, p. 110) and death. PeraÈ kylaÈ goes on to examine various practices PeraÈ kylaÈ emphasizes the way that the client's that are characteristic of family systems theory talk about his fears is elicited, in part, through such as ªcircular questioning,º where the the counsellor asking the partner for his own counselor initially questions the client's partner view of those fears. The point is not that the or a family member about the client's feelings, client is forced to reveal his experiences, rather it and ªlive open supervision,º where a supervisor is that the prior revelation of his partner's may offer questions to the counselor that are, in partial view produces an environment where turn, addressed to the client. The study also such a revelation is expected and nonrevelation identifies some of the strategies by which will itself be a delicate and accountable matter. counselors can address ªdreaded issuesº in a In effect, what PeraÈ kylaÈ is documenting here are manageable way. Take ªcircular questioning,º the conversational mechanisms which family for example. In mundane interaction providing therapists characterize as using circular ques- your partial experience of some event or tioning to overcome clients' resistance. experience is a commonplace way of ªfishingº for a more authoritative version (Pomerantz, 3.06.4.4 Virtues and Limitations 1980). For example: Given the variety of styles of work done under A: Yer line's been busy. the rubric of discourse analysis it is difficult to B: Yeuh my fu (hh)- .hh my father's give an overall summary of virtues and wife called me limitations. However, the virtue of a range of Future Directions 139 studies in the conversation and discourse units such as interpretative repertoires, while analytic tradition is that they offer, arguably Potter and Wetherell (1994) and Wooffitt (1993) for the first time in psychology, a rigorous way discuss the analysis of how accounts are of directly studying human social practices. For constructed. For work in the distinctive con- example, the PeraÈ kylaÈ study discussed above is versation analytic tradition Drew (1995) and notable in studying actual HIV+ counseling in Heritage (1995) provide clear overviews and all its detail. It is not counseling as recollected by Heath and Luff (1993) discuss analysis which participants while filling in rating scales or incorporates video material; Gale (1996) ex- questionnaires; it is not an experimental plores the use of conversation analysis in family simulation of counseling; it does not depend therapy research. on post hoc ethnographic reconstructions of events; nor are the activities immediately transformed into broad coding categories or 3.06.5 FUTURE DIRECTIONS used as a mere shortcut to underlying cognitions. The pace of change in qualitative research in A corollary of this emphasis on working with clinical settings is currently breathtaking, and it tapes and transcripts of interaction is that these is not easy to make confident predictions. are used in research papers to allow readers to However, it is perhaps useful to speculate on evaluate the adequacy of interpretations in a how things might develop over the next few way that is rare in other styles of research. years. The first prediction is that the growth in Studies in this tradition have started to reveal the sheer quantity of qualitative research will an organization of interaction and its local continue for some time. There is so much new management that has been largely missed from territory, and so many possibilities have been traditional psychological work from a range of opened up by new theoretical and analytic perspectives. Such studies offer new conceptions developments, that they are bound to be of what is important in clinical practice and may explored. be particularly valuable in clinical training The second prediction is that research on which has often been conducted with idealized therapy and counseling talk will provide a or at least cleaned up examples of practice. particular initial focus because it is here that Discourse research is demanding and requires discourse analytic approaches can clearly a considerable investment of the researcher's provide new insights and possibly start to time to produce satisfactory results. It does not provide technical analytically grounded speci- fit neatly into routines that can be done by fications of the interactional nature of different research assistants. Indeed, even transcription, therapies in practice, as well as differences in which may seem to be the most mechanical interactional style between therapists. There element in the research, requires considerable may well be conflicts here between the ideolo- skill and benefits from the involvement of the gical goals of constructionist therapists and the primary researchers. Analysis also requires research goals of discourse analysts. considerable craft skills which can take time The third prediction is that the growth of to learn. qualitative work will encourage more research- With its focus on interaction, this would not ers to attempt integrations of qualitative and necessarily be the perspective of choice for quantitative research strategies. There will be researchers with a more traditional cognitive or attempts to supplement traditional outcomes behavioral focus, although it has important research with studies of elements of treatment implications for both of these. Some have which are not easily amenable to quantification. claimed that it places too much emphasis on Here the theoretical neutrality of grounded verbal interaction at the expense of nonverbal theory (ironically) is likely to make for easier elements, and broader issues of embodiment. integration than the more theoretically devel- Others have claimed that it places too much oped discourse perspectives. The sheer difficulty emphasis on the importance of local contexts of of blending qualitative and quantitative work interaction rather than on broader issues such as should not be underestimatedÐresearch that gender or social class. For some contrasting and has attempted this has often found severe strongly expressed claims about the role of problems (see Mason, 1994, for a discussion). discourse analysis in the cognitive psychology of The final prediction is that there will be an memory, see papers in Conway (1992). increased focus on clinical work practices An accessible general introduction to various embodied within settings such as clinics and practical aspects of doing discourse analysis is networks of professional and lay relationships. provided in Potter and Wetherell (1987; see also Here the richness of ethnographic work will be Potter, 1996b, 1997). Potter and Wetherell drawn on, but increasingly the conversation (1995) discuss the analysis of broader content analytic discipline of working with video and 140 Qualitative and Discourse Analysis transcript will replace field notes and recollec- qualitative research methods: A phenomenological ap- tions. Such work will have the effect of proach to social sciences. New York: Wiley. Brewer, J. D., & Yearley, S. (1989). Stigma and conversa- respecifying some of the basic problems of tional competence: A conversation-analytic study of the clinical research. Its broader significance, mentally handicapped. Human Studies, 12, 97±115 however, may depend on the course of wider Bryman, A. (1988). Quantity and quality in social research. debates in Psychology over the development London: Unwin Hyman. Bryman, A., & Burgess, R. G. (Eds.) (1994). Analyzing and success of the cognitive paradigm and qualitative data. London: Routledge. whether it will have a discursive and interaction Burman, E. (1995). Identification, subjectivity, and power based successor. in feminist psychotherapy. In J. Siegfried (Ed.), Ther- apeutic and everyday discourse as behaviour change: Towards a micro-analysis in psychotherapy process ACKNOWLEDGMENTS research (pp. 469±490). Norwood, NJ: Ablex. Burman, E., Mitchell, S., & Salmon, P. (Eds.) (1996). I would like to thank Alan Bryman, Karen Changes: An International Journal of Psychology and Psychotherapy (special issue on qualitative methods), 14, Henwood, Alexa Hepburn, Celia Kitzinger, and 175±243. Denis Salter for making helpful comments on Buttny, R. (1996). Clients' and therapists' joint construc- an earlier draft of this chapter. tion of the clients' problems. Research on Language and Social Interaction, 29, 125±153. Buttny, R., & Jensen, A. D. (1995) Telling problems in an 3.06.6 REFERENCES initial family therapy session: The hierarchical organization of problem-talk. In G. H. Morris & R. J. Chenail Anderson, H., & Goolishian, H. A. (1988). Human systems (Eds.), The talk of the clinic: Explorations in the analysis as linguistic systems: Preliminary and evolving ideas of medical and therapeutic discourse (pp. 19±48). Hills- about the implications for clinical theory. Family dale, NJ: Erlbaum. Process, 27, 371±393. Button, G., & Lee, J. R. E. (Eds.) (1987). Talk and social Antaki, C. (Ed.) (1988). Analysing everyday explanation. organization. Clevedon, UK: Multilingual Matters. London: Sage. Chalmers, A. (1992). What is this thing called science?: An Antaki, C. (1996). Review of The talk of the clinic. Journal assessment of the nature and status of science and its of Language and Social Psychology, 15, 176±81. methods (2nd ed.), Milton Keynes, UK: Open University Antaki, C., & Rapley, M. (1996). ªQuality of lifeº talk: The Press. liberal paradox of psychological testing. Discourse and Charmaz, K. (1991). Good days, bad days: The self in Society, 7, 293±316. chronic illness and time. New Brunswick, NJ: Rutgers Argyris, C., Putnam, R., & Smith, D. M. (1985). Action University Press. Science: Concepts, methods, and skills for research and Charmaz, K. (1994). Identity dilemmas of chronically ill intervention. San Francisco: Jossey-Bass. men. The Sociological Quarterly, 35, 269±288. Aronsson, K., & Cederborg, A.-C. (1996). Coming of age Charmaz, K. (1995). Grounded theory. In J. A. Smith, R. in family therapy talk: Perspective setting in multiparty HarreÂ , & L. van Langenhove (Eds.), Rethinking methods problem formulations. Discourse Processes, 21, 191±212. in pychology (pp. 27±49). London: Sage. Ashmore, M., Mulkay, M., & Pinch, T. (1989). Health and Cheston, R. (1996). Stories and metaphors: Talking about efficiency: A sociological study of health economics. the past in a psychotherapy group for people with Milton Keynes, UK. Open University Press. dementia. Ageing and Society, 16, 579±602. Atkinson, J. M. (1978). Discovering suicide: Studies in the Clegg, J. A., Standen, P. J., & Jones, G. (1996). Striking the social organization of sudden death. London: Macmillan. balance: A grounded theory analysis of staff perspec- Atkinson, P. (1990). The ethnographic imagination: The tives. British Journal of Clinical Psychology, 35, 249±264. textual construction of reality. London: Routledge. Coffey, A., & Atkinson, P. (1996). Making sense of Bannister, P., Burman, E., Parker, I., Taylor, M., & qualitative data: Complementary research strategies. Tindall, C. (1994). Qualitative methods in psychology: A London: Sage. research guide. Buckingham, UK: Open University Press. Collins, H. M. (1974). The TEA Set: Tacit knowledge and Beach, W. A. (1995). Preserving and constraining options: scientific networks. Science Studies, 4, 165±186. ªOkaysº and ªofficialº priorities in medical interviews. Collins, H. M. (Ed.) (1981). Knowledge and controversy: In G. H. Morris & R. J. Chenail (Eds.), The talk of the Studies of modern natural science. Social Studies of clinic: Explorations in the analysis of medical and Science (special issue), 11. therapeutic discourse (pp. 259±289). Hillsdale, NJ: Collins, H. M. (1983). The meaning of lies: Accounts of Erlbaum. action and participatory research. In G. N. Gilbert & P. Bergmann, J. R. (1992). Veiled morality: Notes on Abell (Eds.), Accounts and action (pp. 69±76). Aldershot, discretion in psychiatry. In P. Drew & J. Heritage UK: Gower. (Eds.), Talk at work: Interaction in institutional settings Collins, H. M. (1985). Changing order: Replication and (pp. 137±162). Cambridge, UK: Cambridge University induction in scientific practice. London: Sage. Press. Collins, H. M., & Pinch, T. (1993) The Golem: What Billig, M. (1988). Methodology and scholarship in under- everyone should know about science. Cambridge, UK: standing ideological explanation. In C. Antaki (Ed.), Cambridge University Press. Analysing everyday explanation: A casebook of methods Conway, M. (Ed.) (1992). Developments and debates in the (pp. 199±215). London: Sage. study of human memory (special issue). The Psycholo- Billig, M. (1996). Arguing and thinking: A rhetorical gist, 5, 439±461. approach to social psychology (2nd ed.). Cambridge, Coulter, J. (1990). Mind in action. Cambridge, UK: Polity. UK: Cambridge University Press. Czyzewski, M. (1995). ªMm hmº tokens as interactional Billig, M. (1998). Dialogic repression and the Oedipus devices in the psychotherapeutic intake interview. In P. Complex: Reinterpreting the Little Hans case. Culture ten Have & G. Psathas (Eds.), Situated order: Studies in and Psychology, 4, 11±47. the social organization of talk and embodied activities Bogdan, R., & Taylor, S. J. (1975). Introduction to (pp. 73±89). Washington, DC: International Institute for References 141

Ethnomethodology and Conversation Analysis & Uni- paper ªOn feminist methodology.º Sociology, 26, versity Press of America. 213±218. Davis, K. (1986). The process of problem (re)formulation Gergen, K. J. (1994). Realities and relationships: Soundings in psychotherapy. Sociology of Health and Illness, 8, in social construction. Cambridge, MA: Harvard Uni- 44±74. versity Press. Denzin, N. K., & Lincoln, Y. S. (Eds.) (1994) Handbook of Gilbert, G. N. (Ed.) (1993). Researching social life. qualitative research. London: Sage. London: Sage. Drew, P. (1995). Conversation analysis. In J. Smith, R. Gilbert, G. N., & Mulkay, M. (1984). Opening Pandora's HarreÂ , & L. van Langenhove (Eds.), Rethinking methods box: A sociological analysis of scientists' discourse. in psychology (pp. 64±79). London: Sage. Cambridge, UK: Cambridge University Press. Drew, P., & Heritage, J. (Eds.) (1992a). Talk at work: Glaser, B., & Strauss, A. L. (1967). The discovery of Interaction in institutional settings. Cambridge, UK: grounded theory: Strategies for qualitative research. Cambridge University Press. Chicago: Aldine. Drew, P., & Heritage, J. (1992b). Analyzing talk at work: Glaser, B., & Strauss, A. L. (1968). Time for dying. An introduction. In P. Drew & J. Heritage (Eds.), Talk Chicago: Aldine. at work: Interaction in institutional settings (pp. 3±65). Goffman, E. (1961). Asylums: Essays on the social situation Cambridge, UK: Cambridge University Press. of mental patients and other inmates. London: Penguin. Edwards, D. (1995). Two to tango: Script formulations, Guba, E. G., & Lincoln, Y. S. (1994). Competing dispositions, and rhetorical symmetry in relationship paradigms in qualitative research. In N. K. Denzin & troubles talk. Research on Language and Social Interac- Y. S. Lincoln (Eds.), Handbook of qualitative research tion, 28, 319±350. (pp. 105±117). London: Sage. Edwards, D. (1997). Discourse and cognition. London: Gubrium, J. F. (1992). Out of control: Family therapy and Sage. domestic disorder. London: Sage. Edwards, D., & Potter, J. (1992a). Discursive psychology. Hammersley, M. (1992). On feminist methodology. Sociol- London: Sage. ogy, 26, 187±206. Edwards, D., & Potter, J. (1992b). The chancellor's Hammersley, M., & Atkinson, P. (1995). Ethnography: memory: Rhetoric and truth in discursive remembering. Principles in practice (2nd ed.). London: Routledge. Applied Cognitive Psychology, 6, 187±215. HarreÂ , R. (1986). Varieties of realism. Oxford, UK: Edwards, D., & Potter, J. (1993). Language and causation: Blackwell. A discursive action model of description and attribution. HarreÂ , R. (1992). Social being: A theory for social Psychological Review, 100, 23±41. psychology (2nd ed.). Oxford, UK: Blackwell. Ellen, R. F. (1984). Ethnographic research: A guide to HarreÂ , R., & Gillett, G. (1994). The discursive mind. general conduct. London: Academic Press. London: Sage. Ellis, C., Kiesinger, C., & Tillmann-Healy, L. M. (1997) Heath, C., & Luff, P. (1993) Explicating face-to-face Interactive interviewing: Talking about emotional ex- interaction. In N. Gilbert (Ed.), Researching social life perience. In R. Hertz (Ed.), Reflexivity and Voice. (pp. 306±326) London: Sage. Thousand Oaks, CA: Sage. Henriques, J., Hollway, W., Irwin, C., Venn, C., & Essex, M., Estroff, S., Kane, S., McLanahan, J., Robbins, Walkerdine, V. (1984). Changing the subject: Psychology, J., Dresser, R., & Diamond, R. (1980). On Weinstein's social regulation and subjectivity. London: Methuen. ªPatient attitudes toward mental hospitalization: A Henwood, K., & Nicolson, P. (Eds.) (1995). Qualitative review of quantitative research.º Journal of Health and research methods (special issue). The Psychologist, 8, Social Behaviour, 21, 393±396. 109±129. Fetterman, D. M. (1989). Ethnography: Step by step. Henwood, K., & Parker, I. (1994). Qualitative social London: Sage. psychology (special issue). Journal of Community and Fielding, N. (1993). Ethnography. In N. Gilbert (Ed.), Applied Social Psychology, 4, 219±223. Researching social life (pp. 154±171). London: Sage. Henwood, K., & Pidgeon, N. (1994). Beyond the qualita- Freud, S. (1977). Case histories. I: ªDoraº and ªLittle tive paradigm: A framework for introducing diversity Hans.º London: Penguin. within qualitative psychology. Journal of Community and Frosh, S., Burck, C., Strickland-Clark, L., & Morgan, K. Applied Social Psychology, 4, 225±238. (1996). Engaging with change: A process study of family Henwood, K., & Pidgeon, N. (1995). Grounded theory and therapy. Journal of Family Therapy, 18, 141±161. psychological research. The Psychologist, 8, 115±118. Gale, J. E. (1991). Conversation analysis of therapeutic Heritage, J. C. (1984). Garfinkel and ethnomethodology. discourse: The pursuit of a therapeutic agenda. Norwood, Cambridge, UK: Polity. NJ: Ablex. Heritage, J. C. (1988). Explanations as accounts: A Gale, J. E. (1996). Conversation analysis: Studying the conversation analytic perspective. In C. Antaki (Ed.), construction of therapeutic realities. In D. H. Sprenkle & Analysing everyday explanation: A casebook of methods S. M. Moon (Eds.), Research methods in family therapy (pp. 127±144). London: Sage. (pp. 107±124). New York: Guilford. Heritage, J. C. (1995). Conversation analysis: Methodolo- Gale, J. E., & Newfield, N. (1992). A conversation analysis gical aspects. In U. Quasthoff (Ed.), Aspects of oral of a solution-focused marital therapy session. Journal of communication. (pp. 391±418). Berlin, Germany: De Marital and Family Therapy, 18, 153±165. Gruyter. Garfinkel, H. (1967a). ªGoodº organizational reasons for Hesse, M. B. (1974). The structure of scientific inference. ªbadº clinical records. In H. Garfinkel (Ed.), Studies in London: Macmillan. ethnomethodology (pp. 186±207). Englewood Cliffs, NJ: Hodder, I. (1994). The interpretation of documents and Prentice-Hall. material culture. In N. K. Denzin & Y. S. Lincoln (Eds.), Garfinkel, H. (1967b). Methodological adequacy in the Handbook of qualitative research (pp. 395±402). London: quantitative study of selection criteria and selection Sage. practices in psychiatric outpatient clinics. In H. Garfin- Holsti, O. R. (1969). Content analysis for the social sciences kel (Ed.), Studies in ethnomethodology (pp. 208±261). and humanities. Reading, MA: Addison-Wesley. Englewood Cliffs, NJ: Prentice-Hall. Jasanoff, S., Markle, G., Pinch T., & Petersen, J. (Eds.) Garfinkel, H. (1967c). Studies in ethnomethodology. Engle- (1995). Handbook of science and technology studies. wood Cliffs, NJ: Prentice-Hall. London: Sage. Gelsthorpe, L. (1992). Response to Martyn Hammersley's Jefferson, G. (1985). An exercise in the transcription and 142 Qualitative and Discourse Analysis

analysis of laughter. In T. van Dijk (Ed.), Handbook of Mason, J. (1994). Linking qualitative and quantitative data discourse analysis (Vol. 3, pp. 25±34). London: Academic analysis. In A. Bryman & R. G. Burgess (Eds.), Press. Analyzing qualitative data. London: Routledge. Jefferson, G. (1988). On the sequential organization of McNamee, S., & Gergen, K. (Eds) (1992). Therapy as social troubles-talk in ordinary conversation. Social Problems, construction. London: Sage 35, 418±441. Miles, M. B., & Huberman, A. M. (1994). Qualitative data Jefferson, G., & Lee, J. R. E. (1992). The rejection of analysis: An expanded sourcebook (2nd Ed.). London: advice: Managing the problematic convergence of a Sage. ªtroubles-tellingº and a ªservice encounter.º In P. Drew Miller, G., & Dingwall, R. (Ed.) (1997). Context and & J. Heritage (Eds.), Talk at work: Interaction in method in qualitative research. London: Sage institutional settings (pp. 521±548). Cambridge, UK: Miller, G., & Silverman, D. (1995). Troubles talk and Cambridge University Press. counseling discourse: A comparative study. The Socio- Jodelet, D. (1991). Madness and social representations. logical Quarterly, 36, 725±747. London: Harvester/Wheatsheaf. Mischler, E. G. (1986). Research interviewing: Context and Johnson, J. C. (1990). Selecting ethnographic informants. narrative. Cambridge, MA: Harvard University Press. London: Sage. Morgan, D. L. (1997). Focus groups as qualitative research Knorr Cetina, K. (1995). Laboratory studies: The cultural (2nd ed.). London: Sage. approach to the study of science. In S. Jasanoff, G. Morris, G. H., & Chenail, R. J. (Eds.) (1995). The talk of Markle, T. Pinch, & J. Petersen (Eds.), Handbook of the clinic: Explorations in the analysis of medical and science and technology studies. London: Sage. therapeutic discourse. Hillsdale, NJ: Erlbaum. Knorr Cetina, K. (1997). Epistemic cultures: How scientists Moscovici, S. (1984). The phenomenon of social represen- make sense. Chicago: Indiana University Press. tations. In R. M. Farr & S. Moscovici (Eds.), Social Krippendorff, K. (1980). Content analysis: An introduction representations (pp. 3±69). Cambridge, UK: Cambridge to its methodology. London: Sage. University Press. Krueger, R. A. (1988). Focus groups: A practical guide for Nelson, C. K. (1994). Ethnomethodological positions on applied research. London: Sage. the use of ethnographic data in conversation analytic Kuhn, T. S. (1977). The essential tension: Selected studies in research. Journal of Contemporary Ethnography, 23, scientific tradition and change. Chicago: University of 307±329. Chicago Press. Newfield, N. A., Kuehl, B. P., Joanning, H. P., & Quinn, Kvale, S. (1996). InterViews: An introduction to qualitative W. H. (1990). A mini ethnography of the family therapy research interviewing. London: Sage. of adolescent drug abuse: The ambiguous experience. Labov, W., & Fanshel, D. (1977). Therapeutic discourse: Alcoholism Treatment Quarterly, 7, 57±80. Psychotherapy as conversation. London: Academic Press. Newfield, N., Sells, S. P., Smith, T. E., Newfield, S., & Lakatos, I. (1970). Falsification and the methodology of Newfield, F. (1996). Ethnographic research methods: scientific research programmes. In I. Lakatos & A. Creating a clinical science of the humanities. In D. H. Musgrave (Eds.), Criticism and the growth of knowledge Sprenkle & S. M. Moon (Eds.), Research Methods in (pp. 91±195). Cambridge, UK: Cambridge University Family Therapy. New York: Guilford. Press. Newfield, N. A., Kuehl, B. P., Joanning, H. P., & Quinn, Latour, B., & Woolgar, S. (1986). Laboratory life: The W. H. (1991). We can tell you about ªPsychosº and construction of scientific facts (2nd ed.). Princeton, NJ: ªShrinksº: An ethnography of the family therapy of Princeton University Press. adolescent drug abuse. In T. C. Todd & M. D. Slekman Lee, J. R. E. (1995). The trouble is nobody listens. In J. (Eds.), Family therapy approaches with adolescent sub- Siegfried (Ed.), Therapeutic and everyday discourse as stance Abusers (pp. 277±310). London: Allyn & Bacon. behaviour change: Towards a micro-analysis in psy- Nofsinger, R. (1991). Everyday conversation. London: chotherapy process research (pp. 365±390). Norwood, Sage. NJ: Ablex. Ochs, E. (1979). Transcription as theory. In E. Ochs & B. Levinson, S. (1983). Pragmatics. Cambridge, UK: Cam- Schieffelin (Eds.), Developmental pragmatics (pp. 43±47). bridge University Press. New York: Academic Press. Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Ochs, E., Schegloff, E., & Thompson, S. A. (Eds.) (1996). London: Sage. Interaction and grammar. Cambridge, UK: Cambridge Lofland, J., & Lofland, L. H. (1984). Analyzing social University Press. settings: A guide to qualitative observation and analysis. Olesen, V. (1994), Feminisms and models of qualitative Belmont, CA: Wadsworth. research. In N. K. Denzin & Y. S. Lincoln (Eds.), Lynch, M. (1994). Representation is overrated: Some Handbook of qualitative research (pp. 158±174). London: critical remarks about the use of the concept of Sage. representation in science studies. Configurations: A PeraÈ kylaÈ , A. (1995). AIDS counseling: Institutional inter- Journal of Literature, Science and Technology, 2, action and clinical practice. Cambridge, UK: Cambridge 137±149. University Press. Madigan, S. P. (1992). The application of Michel Pettinger, R. E., Hockett, C. F., & Danehy, J. J. (1960). Foucault's philosophy in the problem externalizing The first five minutes: A sample of microscopic interview discourse of Michael White. Journal of Family Therapy, analysis. Ithica, NY: Paul Martineau. 14, 265±279. Pidgeon, N. (1996). Grounded theory: Theoretical back- Madill, A., & Barkham, M. (1997). Discourse analysis of a ground. In J. T. E. Richardson (Ed.), Handbook of theme in one successful case of brief psychodynamic- qualitative research methods for psychology and the social interpersonal psychotherapy. Journal of Counselling sciences (pp. 75±85). Leicester, UK: British Psychological Psychology, 44, 232±244. Society. Malson, H., & Ussher, J. M. (1996). Bloody women: A Pidgeon, N., & Henwood, K. (1996). Grounded theory: discourse analysis of amenorrhea as a symptom of Practical implementation. In J. T. E. Richardson (Ed.), anorexia nervosa. Feminism and Psychology, 6, 505±521. Handbook of qualitative research methods for psychology Manning, P. K., & Cullum-Swan, B. (1994) Narrative, and the social sciences (pp. 86±101). Leicester, UK: content and semiotic analysis. In N. K. Denzin & Y. S. British Psychological Society. Lincoln (Eds.), Handbook of qualitative research Piercy F. P., & Nickerson, V. (1996). Focus groups in (pp. 463±477). London: Sage. family therapy research. In D. H. Sprenkle & S. M. References 143

Moon (Eds.), Research methods in family therapy Richardson, J. E. (Ed.) (1996). Handbook of qualitative (pp. 173±185). New York: Guilford. research methods for psychology and the social sciences. Plummer, K. (1995). Life story research. In J. A. Smith, R. Leicester, UK: British Psychological Society. HarreÂ , & L. van Langenhove (Eds.), Rethinking methods Rogers, C. R. (1942). The use of electrically recorded in psychology (pp. 50±63). London: Sage. interviews in improving psychotherapeutic techniques. Polanyi, M. (1958). Personal knowledge. London: Rout- American Journal of Orthopsychiatry, 12, 429±434. ledge and Kegan Paul. Ronai, C. R. (1996). My mother is mentally retarded. In C. Pollner, M., & Wikler, L. M. (1985). The social construc- Ellis & A. P. Bochner (Eds.), Composing ethnography: tion of unreality: A case study of a family's attribution of Alternative forms of qualitative writing. Walnut Creek, competence to a severely retarded child. Family Process, CA: AltaMira Press. 24, 241±254. Rose, D. (1990). Living the ethnographic life. London: Sage. Pomerantz, A. M. (1980). Telling my side: ªlimited accessº Rosenhan, D. L. (1973). On being sane in insane places. as a fishing device. Sociological Inquiry, 50, 186±198. Science, 179, 250±258. Popper, K. (1959). The logic of scientific discovery. London: Sacks, H. (1992). Lectures on conversation. (Vols. I & II). Hutchinson. Oxford, UK: Blackwell. Potter, J. (1996a). Representing reality: Discourse, rhetoric Schegloff, E. A. (1991). Reflections on talk and social and social construction. London: Sage. structure. In D. Boden & D. H. Zimmerman (Eds.), Talk Potter, J. (1996b). Discourse analysis and constructionist and social structure (pp. 44±70). Berkeley, CA: Uni- approaches: Theoretical background. In J. T. E. versity of California Press. Richardson (Ed.), Handbook of qualitative research Schegloff, E. A. (1993). Reflections on quantification in the methods for psychology and the social sciences. Leicester, study of conversation. Research on Language and Social UK: British Psychological Society. Interaction, 26, 99±128. Potter, J. (1997). Discourse analysis as a way of analysing Schegloff, E. A. (1997) Whose text? Whose Context? naturally occurring talk. In D. Silverman (Ed.), Quali- Discourse and Society, 8, 165±187. tative research: Theory, method and practice Scott, J. (1990). A matter of record: Documentary sources in (pp. 144±160). London: Sage. social research. Cambridge, UK: Polity. Potter, J., & Mulkay, M. (1985). Scientists' interview talk: Siegfried, J. (Ed.) (1995). Therapeutic and everyday Interviews as a technique for revealing participants' discourse as behaviour change: Towards a micro-analysis interpretative practices. In M. Brenner, J. Brown, & D. in psychotherapy process research. Norwood, NJ: Ablex. Canter (Eds.), The research interview: Uses and ap- Silverman, D. (1993). Interpreting qualitative data: Methods proaches (pp. 247±271). London: Academic Press. for analysing talk, text and interaction. London: Sage. Potter, J., & Wetherell, M. (1987). Discourse and social Silverman, D. (Ed.) (1997a). Qualitative research: Theory, psychology: Beyond attitudes and behaviour. London: method and practice. London: Sage. Sage. Silverman, D. (1997b). Discourses of counselling: HIV Potter, J., & Wetherell, M. (1994) Analyzing discourse. In counselling as social interaction. London: Sage. A. Bryman & B. Burgess (Eds.), Analyzing qualitative Smith, L. M. (1994). Biographical method. In N. K. data. London: Routledge. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative Potter, J., & Wetherell, M. (1995). Discourse analysis. In J. research (pp. 286±305) London: Sage. Smith, R. HarreÂ , & L. van Langenhove (Eds.), Rethink- Smith, J. A. (1995). Repertory grids: An interactive, case- ing methods in psychology (pp. 80±92). London: Sage. study perspective. In J. A. Smith, R. HarreÂ , & L. van Potter, J., Wetherell, M., & Chitty, A. (1991). Quantifica- Langehove (Eds.), Rethinking methods in psychology tion rhetoricÐcancer on television. Discourse and (pp. 162±177). London: Sage. Society, 2, 333±365. Smith, J. A., HarreÂ , R., & van Langenhove, L. (Eds.) Psathas, G., & Anderson, T. (1990). The ªpracticesº of (1995). Rethinking methods in psychology. London: Sage. transcription in conversation analysis. Semiotica, 78, Soal, J., & Kottler, A. (1996). Damaged, deficient or 75±99. determined? Deconstructing narratives in family ther- Rachel, J. (1996). Ethnography: Practical implementation. apy. South African Journal of Psychology, 26, 123±134. In J. T. E. Richardson (Ed.), Handbook of qualitative Soyland, A. J. (1994). Functions of a psychiatric case- research methods for psychology and the social sciences summary. Text, 14, 113±140. (pp. 113±124). Leicester, UK: British Psychological Soyland, A. J. (1995). Analyzing therapeutic and profes- Society. sional discourse. In J. Siegfried (Ed.), Therapeutic and Rafuls, S. E., & Moon, S. M. (1996). Grounded theory everyday discourse as behaviour change: Towards a micro- methodology in family therapy research. In D. H. analysis in psychotherapy process research (pp. 277±300). Sprenkle & S. M. Moon (Eds.), Research methods in Norwood, NJ: Ablex. family therapy (pp. 64±80). New York: Guilford. Stainton Rogers, R. (1995). Q methodology. In J. A. Smith, Ramazanoglu, C. (1992). On feminist methodology: Male R. HarreÂ , & L. van Langenhove (Eds.), Rethinking reason versus female empowerment. Sociology, 26, methods in psychology (p. 178±192). London: Sage. 207±212. Strauss, A. L., & Corbin, J. (1994). Grounded theory Rapley, M., & Antaki, C. (1996). A conversation analysis methodology: An overview. In N. K. Denzin, & Y. S. of the ªacquiescenceº of people with learning disabilities. Lincoln (Eds.), Handbook of qualitative research Journal of Community and Applied Social Psychology, 6, (pp. 273±285). London: Sage. 207±227. Toren, C. (1996). Ethnography: Theoretical background. Reason, P., & Heron, J. (1995). Co-operative inquiry. In J. A. In J. T. E. Richardson (Ed.), Handbook of qualitative Smith, R. HarreÂ , & L. van Langenhove (Eds.), Rethinking research methods for psychology and the social sciences methods in psychology (pp. 122±142). London: Sage. (pp. 102±112). Leicester, UK: British Psychological Reason, P., & Rowan, J. (Eds.) (1981). Human inquiry: A Society. sourcebook of new paradigm research. Chichester, UK: Turner, B. A. (1994). Patterns of crisis behaviour: A Wiley. qualitative inquiry. In A. Bryman & R. G. Burgess Reinharz, S. (1992). Feminist methods in social research. (Eds.), Analyzing qualitative data (pp. 195±215). London: New York: Oxford University Press. Routledge. Rennie, D., Phillips, J., & Quartaro, G. (1988). Grounded Turner, B. A., & Pidgeon, N. (1997). Manmade disasters theory: A promising approach to conceptualisation in (2nd ed.). Oxford, UK: Butterworth-Heinemann. psychology? Canadian Psychology, 29, 139±150. Watson, R. (1995). Some potentialities and pitfalls in the 144 Qualitative and Discourse Analysis

analysis of process and personal change in counseling Widdicombe, S., & Wooffitt, R. (1995). The language of and therapeutic interaction. In J. Siegfried (Ed.), youth subcultures: Social identity in action. Hemel Therapeutic and everyday discourse as behaviour change: Hempstead, UK: Harvester/Wheatsheaf. Towards a micro-analysis in psychotherapy process Wieder, D. L. (Ed.) (1993). Colloquy: On issues of research (pp. 301±339). Norwood, NJ: Ablex. quantification in conversation analysis. Research on Weinstein, R. M. (1979). Patient attitudes toward mental Language and Social Interaction, 26, 151±226. hospitalization: A review of quantitative research. Wilkinson, S. (1986). Introduction. In S. Wilkinson (Ed.), Journal of Health and Social Behavior, 20, 237±258. Feminist social psychology (pp. 1±6). Milton Keynes, Weinstein, R. M. (1980). The favourableness of patients' UK: Open University Press. attitudes toward mental hospitalization. Journal of Wooffitt, R. (1993). Analysing accounts. In N. Gilbert Health and Social Behavior, 21, 397±401. (Ed.), Researching social life (pp. 287±305). London: Werner,O.,&Schoepfle,G.M.(1987).Systematic Sage. fieldwork: Foundations of ethnography and interviewing Woolgar, S. (1988). Science: The very idea. London: (Vol. 1). London: Sage. Tavistock. Wetherell, M. (1996). Fear of fat: Interpretative repertoires Wootton, A. (1989). Speech to and from a severely and ideological dilemmas. In J. Maybin & N. Mercer retarded young Down's syndrome child. In M. Bever- (Eds.), Using English: From conversation to canon idge, G. Conti-Ramsden, & I. Leudar (Eds.), The (pp. 36±41). London: Routledge. language and communication of mentally handicapped White, M., & Epston, D. (1990). Narrative means to people (pp. 157±184). London: Chapman-Hall. therapeutic ends. New York: Norton. Yardley, K. (1995). Role play. In J. A. Smith, R. HarreÂ & Whyte, W. F. (1991). Participatory action research. L. van Langenhove (Eds.), Rethinking methods in London: Sage. psychology (pp. 106±121). London: Sage. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.07 Personality Assessment

THOMAS A. WIDIGER and KIMBERLY I. SAYLOR University of Kentucky, Lexington, KY, USA

3.07.1 INTRODUCTION 145 3.07.2 SELF-REPORT INVENTORIES 146 3.07.2.1 Item Analyses 146 3.07.2.2 Gender, Ethnic, and Cultural Differences 149 3.07.2.3 Individual Differences 150 3.07.2.4 Response Distortion 151 3.07.2.5 Automated Assessment and Base Rates 152 3.07.2.6 Illustrative Instruments 153 3.07.2.6.1 Minnesota Multiphasic Personality Inventory-2 153 3.07.2.6.2 Neo Personality Inventory-Revised 154 3.07.3 SEMISTRUCTURED INTERVIEWS 154 3.07.3.1 Personality, Mood States, and Mental Disorders 156 3.07.3.2 Dissimulation and Distortion 156 3.07.3.3 Intersite Reliability 157 3.07.3.4 Self and Informant Ratings 158 3.07.3.5 Convergent Validity Across Different Interviews 158 3.07.3.6 Illustrative Instruments 159 3.07.3.6.1 SIDP-IV 159 3.07.3.6.2 PDI-IV 160 3.07.4 PROJECTIVE TECHNIQUES 160 3.07.4.1 Rorschach 162 3.07.4.2 Thematic Apperception Test 163 3.07.5 CONCLUSIONS 163 3.07.6 REFERENCES 163

3.07.1 INTRODUCTION different trait terms within the English language (Goldberg, 1990) and there might be almost as Most domains of psychology have documen- many personality assessment instruments. ted the importance of personality traits to There are instruments for the assessment of adaptive and maladaptive functioning, includ- individual traits (e.g., tender-mindedness), for ing the fields of behavioral medicine (Adler & collections of traits (e.g., the domain of Matthews, 1994), psychopathology (Watson, agreeableness, which includes tender-minded- Clark, & Harkness, 1994; Widiger & Costa, ness, trust, straightforwardness, altruism, com- 1994), and industrial-organizational psychol- pliance, and modesty), for constellations of ogy (Hogan, Curphy, & Hogan, 1994). The traits (e.g., the personality syndrome of psycho- assessment of personality is perhaps funda- pathy, which includes such traits as arrogance, mental to virtually all fields of applied psychol- superficial charm, impulsivity, callousness, ogy. However, there are as many as 4500 arrogance, deception, irresponsibility, and low different personality traits, or at least 4500 empathy), and for traits identified by theorists

145 146 Personality Assessment for which there might not yet be a specific term than is typically obtained by semistructured within the language (e.g., extratensive experi- interviews or projective techniques. The find- ence balance; Exner, 1993). ings of SRIs are very sensitive to the anxious, There are also different methods for the depressed, or angry mood states of respondents, assessment of personality traits. The primary contributing at times to poor test±retest relia- methods used in personality research are self- bility (discussed further below). However, the report inventories, semistructured interviews, correlation between two SRIs is much more and projective techniques. Self-report inven- likely to be consistent across time and across tories consist of written statements or questions, research sites than the correlation between two to which a person responds in terms of a semistructured interviews or two projective specified set of options (e.g., true vs. false or techniques. SRIs might be more susceptible to agree vs. disagree along a five-point scale). mood state distortions than semistructured Semistructured interviews consist of a specified interviews, but this susceptibility may itself be set of verbally administered questions, accom- observed more reliably across different studies panied by instructions (or at least guidelines) for than the lack of susceptibility of semistructured follow-up questions and for the interpretation interviews. and scoring of responses. Projective techniques The specific and explicit nature of SRIs has consist of relatively ambiguous stimuli or also been very useful in researching and under- prompts, the possible responses to which are standing the source and nature of subjects' largely open-ended, accompanied by instruc- responses. Much more is known about the tions (or at least guidelines) for their scoring or effects of different item formats, length of scales, interpretation. The distinctions among these demographic variables, base rates, response three methods, however, is somewhat fluid. A sets, and other moderating variables from the self-report inventory is equivalent to a fully results of SRIs than from semistructured inter- structured, self-administered interview; semi- views or projective techniques. Five issues to be structured interviews are essentially verbally discussed below are (i) item analyses; (ii) gender, administered self-report inventories that include ethnic, and cultural differences; (iii) individual at least some open-ended questions; and fully differences; (iv) response distortion; and (v) structured interviews are essentially verbally automated assessments. administered self-report inventories. Each of the methods will be considered below, including a discussion of issues relevant to or 3.07.2.1 Item Analyses problematic for that respective method. How- ever, the issues discussed for one method of There are a variety of methods for construct- assessment will apply to another. Illustrative ing, selecting, and evaluating the items to be instruments for each method are also presented. included within an SRI (Clark & Watson, 1995). However, it is useful to highlight a few points within this chapter as such analyses are of direct importance to personality assessment research. 3.07.2 SELF-REPORT INVENTORIES An obvious method for item construction, often termed the rational approach, is to The single most popular method for the construct items that describe explicitly the trait assessment of personality by researchers of being assessed. For example, the most fre- normal personality functioning is a self-report quently used SRI in clinical research for the inventory (SRI). The advantages of SRIs, assessment of personality disorders is the relative to semistructured interviews and pro- Personality Diagnostic Questionnaire-4 (PDQ- jective techniques, are perhaps self-evident. 4; Hyler, 1994). Its items were written to inquire They are substantially less expensive and time- explicitly with respect to each of the features of consuming to administer. Data on hundreds of the 10 personality disorders included in the persons can be obtained, scored, and analyzed American Psychiatric Association's (APA) at relatively little cost to the researcher. Their Diagnostic and statistical manual of mental inexpensive nature allows collection of vast disorders (DSM-IV; APA, 1994). For example, amounts of normative data that are unavailable the dependent personality criterion, ªhas diffi- for most semistructured interviews and projec- culty expressing disagreement with others tive techniques. These normative data in turn because of fear of loss of support or approvalº facilitate substantially their validation, as well (APA, 1994, p. 668) is assessed by the PDQ-4 as their interpretation by and utility to item, ªI fear losing the support of others if I researchers and clinicians. disagree with themº (Hyler, 1994, p. 4). The high degree of structure of SRIs has also The content validity of an SRI is problematic contributed to much better intersite reliability to the extent that any facets of the trait being Self-report Inventories 147 assessed are not included or are inadequately involve overlapping constellations of person- represented, items representing other traits are ality traits (Pilkonis, 1997; Shea, 1995). Con- included, or the set provides a disproportio- fining a scale to items that correlate highly with nately greater representation of one facet of the one another can result in an overly narrow trait relative to another (Haynes, Richard, & assessment of a construct, and deleting items Kubany, 1995). ªThe exact phrasing of items that correlate highly with other scales can result can exert a profound influence on the construct in false distinctions and a distorted representa- that is actually measuredº (Clark & Watson, tion of the trait being assessed (Clark & Watson, 1995, p. 7) yet the content of items often receives 1995; Smith & McCarthy, 1995). For example, little consideration in personality assessment the APA (1994) criteria set for the assessment of research. Widiger, Williams, Spitzer, and antisocial personality disorder does not include Frances (1985) concluded from a systematic such items as lacks empathy and arrogant self- content analysis of the Millon Clinical Multi- appraisal that are included in alternative criteria axial Inventory (MCMI; Millon, 1983) that the sets for this personality syndrome (Hare, 1991) MCMI failed to represent adequately many in part because these items are already con- significant features of the antisocial personality tained within the DSM-IV criteria set for the disorder. narcissistic personality disorder. Their inclusion within the criteria set for antisocial personality Many of the [personality] traits are not sampled at disorder would complicate the differentiation of all (e.g., inability to sustain consistent work the antisocial and narcissistic personality syn- behavior, lack of ability to function as a respon- dromes (Gunderson, 1992), but the failure to sible parent, inability to maintain an enduring include these items may also provide an attachment to a sexual partner, failure to plan inadequate description and assessment of anti- ahead or impulsivity, and disregard for the truth), social (psychopathic) personality traits (Hare, including an essential requirement of the DSM-III Hart, & Harpur, 1991). antisocial criteria to exhibit significant delinquent Overlapping scales, on the other hand, have behavior prior to the age of 15. (Widiger et al., their own problems, particularly if they are to be 1985, p. 375) used to test hypotheses regarding the relationships among or differences between the traits Many MCMI-III items are written in part to and syndromes being assessed (Helmes & represent an alternative theoretical formulation Reddon, 1993). For example, the MCMI-III for the DSM-IV personality syndromes (Millon personality scales overlap substantially (Millon, et al., 1996). Millon, & Davis, 1994). Scale overlap was in High content validity, however, will not part a pragmatic necessity of assessing 14 ensure a valid assessment of a respective personality disorders and 10 clinical syndromes personality trait. Although many of the PDQ- with no more than 175 items. However, this 4 dependency items do appear to represent overlap was also intentional to ensure that the adequately their respective diagnostic criteria, it scales be consistent with the overlap of the is difficult to anticipate how an item will personality constructs being assessed. ªMultiple actually perform when administered to persons keying or item overlapping for the MCMI with varying degrees of dependency and to inventories was designed to fit its theory's persons with varying degrees of other person- polythetic structure, and the scales constructed ality traits, syndromes, or demographic char- to represent itº (Millon, 1987, p. 130). However, acteristics. Most authors of SRIs also consider the MCMI-III cannot then be used to assess the the findings of internal consistency analyses validity of this polythetic structure, or to test (Smith & McCarthy, 1995). ªCurrently, the hypotheses concerning the covariation, differ- single most widely used method for item entiation, or relationship among the personality selection in scale development is some form of constructs, as the findings will be compelled by internal consistency analysisº (Clark & Watson, the scale overlap (Helmes & Reddon, 1993). For 1995, p. 313). example, the MCMI-III provides little possi- Presumably, an item should correlate more bility for researchers to fail to confirm a close highly with other items from the same scale relationship between, or comparable findings (e.g., dependent personality traits) than items for, sadistic and antisocial personality traits, from another scale (e.g., borderline personality given that eight of the 17 antisocial items (47%) traits), consistent with the basic principles of are shared with the sadistic scale. The same convergent and discriminant validity. However, point can be made for studies using the these assumptions hold only for personality Minnesota Multiphasic Personality Inventory syndromes that are homogeneous in content (MMPI-2) antisociality and cynicism scales, as and that are distinct from other syndromes, they share approximately a third of their items neither of which may be true for syndromes that (Helmes & Reddon, 1993). It was for this reason 148 Personality Assessment that Morey, Waugh, and Blashfield (1985) vergent, and discriminant validity (Ozer & developed both overlapping and nonoverlap- Reise, 1994). Illustrative studies with SRIs ping MMPI personality disorder scales. include Ben-Porath, McCully, and Almagor Scale homogeneity and interscale distinctive- (1993), Clark, Livesley, Schroeder, and Irish ness are usually emphasized in SRIs constructed (1996), Lilienfeld and Andrews (1996), Robins through factor analyses (Clark & Watson, and John (1997), and Trull, Useda, Coasta, and 1995). For example, in the construction of the McCrae (1995). NEO Personality Inventory-Revised (NEO-PI- The MMPI-2 clinical scales exemplify the R), Costa and McCrae (1992) purely empirical approach to scale construction. ªThe original MMPI (Hathaway & McKinley, adopted factor analysis as the basis for item 1940) was launched with the view that the selection because it identifies clusters of items that content of the items was relatively unimportant covary with each other but which are relatively and what actually mattered was whether the independent of other item clustersÐin other item was endorsed by a particular clinical words, items that show convergent validity with groupº (Butcher, 1995, p. 302). This allowed respect to other items in the cluster and divergent the test to be ªfree from the restriction that the validity with respect to other items [outside the subject must be able to describe his own cluster]. (p. 40) behavior accuratelyº (Meehl, 1945, p. 297). For example, one of the items on the MMPI Reliance upon factor analysis for scale con- hysteria scale was ªI enjoy detective or mystery struction has received some criticism (Block, storiesº (keyed false; Hathaway & McKinley, 1995; Millon et al., 1996) and there are, indeed, 1982). instances in which factor analyses have been Indicating that one does not enjoy detective conducted with little appreciation of the limita- or mystery stories does not appear to have any tions of the approach (Floyd & Widaman, obvious (or perhaps even meaningful) relation- 1995), but this is perhaps no different than ship to the presence of the mental disorder of for any other statistical technique. Factor hysteria, but answering false did correlate analysis remains a powerful tool for identifying significantly with the occurrence of this dis- underlying, latent constructs, for data reduc- order. ªThe literal content of the stimulus (the tion, and for the validation of an hypothesized item) is entirely unimportant, and even irrele- dimensional structure (Smith & McCarthy, vant and potentially misleading. Sophisticated 1995; Watson et al., 1994). In any case, the psychometric use of test items dictates that the construction of the NEO-PI-R through factor test interpreter ignore item content altogether analysis was consistent with the theoretical lest he or she be misledº (Ben-Porath, 1994, model for the personality traits being assessed p. 364). Limitations to this approach, however, (Clark & Watson, 1995; Costa & McCrae, 1992; are suggested below. Goldberg, 1990). Additional illustrations of A sophisticated approach to item analysis is theoretically driven factor analytic scale con- item response theory (IRT; Hambleton, Swa- structions are provided by the Dimensional minathan, & Rogers, 1991). Its most unique Assessment of Personality PathologyÐBasic datum is the probability of a response to a Questionnaire (DAPP-BQ; Livesley & Jack- particular item given a particular level of the son, in press), the Schedule for Nonadaptive personality trait being assessed. IRT analyses and Adaptive Personality (SNAP; Clark, 1993), demonstrate graphically how items can vary in the Interpersonal Adjective Scales (IAS; Wig- their discriminative ability at different levels of gins & Trobst, 1997), the Multidimensional the trait being assessed (assuming that the Personality Questionnaire (MPQ; Tellegen & anchoring items are not themselves system- Waller, in press), and the Temperament and atically biased). It is not the case that all items Character Inventory (TCI; Cloninger & perform equally well, nor does any particular Svrakic, 1994). item perform equally well at all levels of the The most informative item analyses will be trait. It may be the case that none of the items on correlations with external validators of the a scale provide any discrimination at particular personality traits, including the ability of items levels of a trait, or that the scale is predominated to identify persons with the trait in question, by items that discriminate at high rather than correlations with hypothesized indicators of moderate or low levels of the trait. For example, the trait, and an absence of correlations with it is possible that the MMPI-2 items to assess the variables that are theoretically unrelated to the personality domain of neuroticism are weighted trait (Smith & McCarthy, 1995). These data toward a discrimination of high levels of typically are discussed under a general heading neuroticism, providing very little discrimination of construct validity, including data concerning at low levels. IRT analyses might allow items' concurrent, postdictive, predictive, con- researchers to maximize the sensitivity of a test Self-report Inventories 149 to a particular population (e.g., inpatient psychopathic. A male who has more traits (or hospital vs. prison setting) by deleting items symptoms) of psychopathy could be described that work poorly within certain settings (Em- by the MMPI-2 as being less psychopathic than bretson, 1996). IRT analyses have been used a female with the same number of traits. The widely with tests of abilities and skills, and are separate norms provided for males and females now being applied to measures of personality on SRIs are never so substantial as to eliminate (Ozer & Reise, 1994). Illustrative applications entirely the differences between the sexes, but include analyses of items from the MPQ the rationale for reducing or minimizing any (Tellegen & Waller, in press) by Reise and differences is unclear. Waller (1993), items from a measure of The provision of separate norms for males alexithymia by Hendryx, Haviland, Gibbons, and females by SRIs is inconsistent with other and Clark (1992) and items from a psychopathy domains of assessment, including most semi- semistructured interview by Cooke and Michie structured interview and projective assessments (1997). of personality variables. For example, the same threshold is used for males and females by the 3.07.2.2 Gender, Ethnic, and Cultural most commonly used semistructured interview Differences for the assessment of psychopathy, the (revised) Psychopathy Checklist (Hare, 1991). Separate Differences among gender, ethnic, cultural, norms are not provided for females and males in and other demographic populations with re- the assessment of intelligence, nor are they spect to personality traits are often socially provided for different cultural and ethnic sensitive, politically controversial, and difficult groups in the assessment of personality. to explain (Eagly, 1995). There is substantial Although statistically significant differences SRI research to indicate differences, on average, have also been obtained across ethnic and between males and females for a wide variety of cultural groups for SRI personality measures, personality traits, as assessed by research with ªnone of the major personality measures . . . SRIs (Sackett & Wilk, 1994). For example, offered norm scoring based on race or ethnicity males obtain higher scores on measures of either as a routine aspect of the scoring system assertiveness and self-esteem; whereas females or as a scoring optionº (Sackett & Wilk, 1994, obtain higher scores on measures of anxious- p. 947). It is unclear whether separate norms ness, trust, gregariousness, and tender-mind- should be provided for ethnic groups, but it does edness (Feingold, 1994). appear to be inconsistent to provide separate Separate norms therefore have been devel- norms for gender and not for ethnicity. oped for the interpretation of most SRI Separate norms would be appropriate if there personality scales (e.g., Costa & McCrae, is reason to believe that the SRI items are biased 1992; Graham, 1993; Millon et al., 1994). against a particular gender, ethnic, or cultural However, the rationale for providing different group. ªBias is the extent to which measured norms for each sex (gender) is unclear. Sackett group differences are invalid . . . Group differ- and Wilk (1994) ªsearched test manuals, hand- ences are invalid to the extent that the constructs books, and the like for a discussion of the that distinguish between groups are different rationale for the practice. We typically found from the constructs the measures were intended noneº (p. 944). They indicated, with some to representº (Kehoe & Tenopyr, 1994, p. 294). surprise, that the provision of separate norms Consider, for example, the MMPI item cited for males and females ªdoes not seem to have earlier, ªI enjoy detective or mystery storiesº been viewed as at all controversialº (Sackett & (Hathway & McKinley, 1982, p. 3). The MMPI Wilk, 1994, p. 944). If the SRI is indicating an reliance upon a blind empiricism for item actual difference between males and females selection can be problematic if the basis for (Feingold, 1994) it is unclear why this difference the item endorsement is for reasons other than, showed then be eliminated or diminished in SRI or in addition to, the presence of the construct assessments of males and females (Kehoe & being assessed. Lindsay and Widiger (1995) Tenopyr, 1994; Sackett & Wilk, 1994). suggested that this item might have been For example, males and females with the correlated with hysteria because hysteria was same raw score on the psychopathic deviate itself correlated with female gender. Signifi- scale of the MMPI-2 will be given different final cantly more females than males are diagnosed scores due to the use of different norms. with hysteria and significantly more females Different raw scores for males and females than males respond negatively to an interest in can then result in the same final score (Graham, detective or mystery stories; therefore, the item 1993). The extent of psychopathy in females is may have correlated with hysteria because it was then relative to other females; it is not a measure identifying the presence of females rather than of the actual (absolute) extent to which they are the presence of hysteria. In addition, 150 Personality Assessment

because such an item concerns normal behavior personality research throughout the present that occurs more often in one sex than in the other, century, is fundamentally inadequate for the we would consider it to be sex biased because its purposes of a science of personalityº (Lamiell, errors in prediction (false positives) will occur 1981, p. 36). Lamiell's scathing critique of more often in one sex than in the other. (Lindsay & individual differences research raised many Widiger, 1995, p. 2) important and compelling concerns. For example, he noted how the test±retest reliability of One would never consider including as one of a personality scale typically is interpreted as the DSM-IV diagnostic criteria for histrionic indicating the extent to which the expression of personality disorder, ªI have often enjoyed a trait is stable in persons across time, but most reading Cosmopolitan and Ms. Magazineº analyses in fact indicate the extent to which but many such items are included in SRIs to persons maintain their relative position on a diagnose the presence of a personality disorder. scale over time. The test±retest reliability of For example, responding false to the item ªin relative position may itself correlate highly with the past, I've gotten involved sexually with the test±retest reliability of the magnitude many people who didn't matter much to me,º within each person, but it need not. For is used for the assessment of a dependent example, the reliability of a measure of height, personality disorder on the MCMI-II (Millon, assessed across 100 persons using a product- 1987, p. 190). Not getting involved sexually with moment correlation, would suggest substantial many people who do not matter much is hardly stability between the ages of eight and 15, an indication of a dependent personality dis- despite the fact that each person's actual height order (or of any dysfunction), yet it is used to would have changed substantially during this diagnose the presence of this personality dis- time. Height is not at all consistent or stable order. across childhood, yet a measure of stability in There are also data to suggest that the same height would be very high if the relative height items on an SRI may have a different meaning among the persons changed very little. This to persons of different ethnic, cultural, or gender confusion, however, would be addressed by groups (Okazaki & Sue, 1995), although the alternative reliability statistics (e.g., an intra- magnitude, consistency, and significance of class correlation coefficient). these differences has been questioned (Ozer & Interpreting personality test scores relative to Reise, 1994; Timbrook & Graham, 1994). a particular population is not necessarily Applications of IRT analyses to the assessment problematic as long as the existence of these of bias would be useful (Kehoe & Tenopyr, norms and their implications for test interpreta- 1994). For example, Santor, Ramsay, and tion are understood adequately. For example, a Zuroff (1994) examined whether men and score on an MMPI-2 social introversion scale women at equal levels of depression respond does not indicate how introverted a person is, differentially to individual items. They reported but how much more (or less) introverted the no significant differences for most items, but person is relative to a particular normative bias was suggested for a few, including a body- group. The MMPI-2 social introversion scale image dissatisfaction item. Females were more does not indicate the extent to which an Asian- likely to endorse body-image dissatisfaction American male is socially introverted, it than males even when they were at the same level indicates how more (or less) introverted he is of depression as males, suggesting that the relative to a sample of 1138 American males endorsement of the item by women (relative to who provided the normative data, only 1% of men) reflected their gender independently of whom were Asian-American (Graham, 1993). their level of depression. Nevertheless, researchers and clinicians will at times refer to SRI results as if they are 3.07.2.3 Individual Differences providing an absolute measure of a personality trait. For example, researchers and clinicians The provision of any set of norms may itself will describe a person as having high self-esteem, be questioned. Personality description with SRIs low self-monitoring, or high dependency, when traditionally has been provided in reference to in fact the person is simply higher or lower than individual differences. ªRaw scores on person- a particular comparison group. It is common in ality inventories are usually meaningless±± personality research to construct groups of responses take on meaning only when they are subjects on the basis of median splits on SRI compared to the responses of othersº (Costa & scores for such traits as self-esteem, self- McCrae, 1992, p. 13). However, this individual monitoring, dependency, autonomy, or some differences approach can itself be problematic. other personality construct. Baumeister, Tice, ªThe individual differences research paradigm, and Hutton (1989) searched the literature for all which has thoroughly dominated empirical studies concerning self-esteem. ªMost often . . . Self-report Inventories 151 high and low self-esteem groups are created by arrogant or self-promotional, and histrionic performing a median split on the self-esteem persons will often be overemotional, exagger- scores across the sampleº (p. 556). Persons ated, or melodramatic in their self-descriptions above the median are identified as having ªhigh (APA, 1994). Response distortion may also be self-esteemº whereas persons below the median common within some populations, such as are identified as having ªlow self-esteem.º forensic, disability, or psychiatric settings, due However, interpreting SRI data in this manner in part to the higher prevalence of personality is inaccurate and potentially misleading. Per- disorder symptomatology within these popula- sons below a median would indeed have less tions but due as well to the pressures, rewards, self-esteem than the persons above a median, and inducements within these settings to but it is unknown whether they are in fact either provide inaccurate self-descriptions (Berry, high or low in self-esteem. All of the subjects 1995). might be rather high (or low). This method of A substantial amount of research has been assessing subjects is comparable to providing a conducted on the detection of response distor- measure of psychopathy to a sample of nuns to tion (otherwise known as response sets or identify a group of psychopaths and nonpsy- biases), particularly with the MMPI-2 (Butcher chopaths, or a measure of altruism to convicts & Rouse, 1996). There are self-report scales to within a prison to identify a group of saints and detect nonresponsiveness (e.g., random re- sinners. Yet, many researchers will provide sponding, yea-saying or nay-saying), overre- measures of self-esteem (self-monitoring, de- porting of symptoms (e.g., malingering, faking pendency, narcissism, or some other SRI bad, exaggeration, or self-denigration), and measure) to a group of well-functioning college underreporting (e.g., faking good, denial, students to identify persons high and low in self- defensiveness, minimization, or self-aggrand- esteem. Baumeister et al. (1989) indeed found izement). These scales often are referred to as that ªin all cases that we could determine, the validity scales, as their primary function has sample midpoint was higher (in self-esteem) been to indicate the extent to which the scores than the conceptual midpoint, and generally the on the personality (or clinical) scales are discrepancy was substantial and significantº providing an accurate or valid self-description, (p. 559). and they do appear to be generally successful in identifying the respective response distortions (Berry, Wetter, & Baer, 1995). 3.07.2.4 Response Distortion However, it is not always clear whether the prevalence of a respective form of response SRIs rely substantially on the ability of the distortion is frequent enough to warrant its respondent to provide a valid self-description. detection within all settings (Costa & McCrae, To the extent that a person does not understand 1997). Acquiescence, random responding, and the item, or is unwilling or impaired in his or her nay-saying might be so infrequent within most ability to provide an accurate response, the settings that the costs of false positive identi- results will be inaccurate. ªDetection of an fications (i.e., identifying a personality descrip- attempt to provide misleading information is a tion as invalid when it is in fact valid) will vital and necessary component of the clinical outweigh the costs of false negatives (identifying interpretation of test resultsº (Ben-Porath & a description as valid when it was in fact Waller, 1992, p. 24). invalid). In addition, much of the research on The presence of sufficiently accurate self- validity scales has been confined to analogue description is probably a reasonable assump- studies in which various response distortions are tion in most instances (Costa & McCrae, 1997). simulated by college students, psychiatric However, it may also be the case that no person patients, or other persons. There is much less will be entirely accurate in the description of this data to indicate that the inclusion of a validity or her personality. Each person probably scale as a moderator or suppressor variable evidences some degree of distortion, either actually improves the validity of the personality minimizing or exaggerating flaws or desirabil- assessment. For example, it is unclear whether ities. Such response distortion will be particu- the correlation of a measure of the personality larly evident in persons characterized by trait of neuroticism with another variable (e.g., personality disorders (Westen, 1991). Antisocial drug usage) would increase when variance due persons will tend to be characteristically to a response distortion (e.g., malingering or dishonest or deceptive in their self-descriptions, exaggeration) is partialled from the measure of dependent persons may self-denigrate, paranoid neuroticism (Costa & McCrae, 1997). persons will often be wary and suspicious, In some contexts, such as a disability borderline persons will tend to idealize and evaluation, response distortions are to be devalue, narcissistic persons will often be avoided, whereas in other contexts, such as 152 Personality Assessment the assessment of maladaptive personality 3.07.2.5 Automated Assessment and Base Rates traits, they may be what the clinician is seeking to identify (Ozer & Reise, 1994). Validity scales The structured nature of the responses to typically are interpreted as indicators of the SRIs has also facilitated the development of presence of misleading or inaccurate informa- automated (computerized) systems for scoring tion, but the response tendencies assessed by and interpretation. The researcher or clinician validity scales are central to some personality simply uses score sheets that can be submitted disorders. For example, an elevation on a for computerized scanning, receiving in return a malingering scale would indicate the presence complete scoring and, in most cases, a narrative of deception or dishonesty, suggesting perhaps description of the subject's personality derived that the information provided by the subject in from the theoretical model of, and the data response to the other SRI items was inaccurate considered by, the author(s) of the computer or misleading. However, one might not want to system. Automated systems are available for partial the variance due to malingering from an most of the major personality SRIs, and their SRI measure of antisocial or psychopathic advantages are self-evident. ªThe computer can personality. The validity of a measure of store and access a much larger fund of psychopathy would be reduced if variance due interpretive literature and base-rate data than to deception or dishonesty was extracted, as any individual clinician can master, contribut- dishonesty or deception is itself a facet of this ing to the accuracy, objectivity, reliability, and personality syndrome (APA, 1994; Hare, 1991; validity of computerized reportsº (Keller, Lilienfeld, 1994). It might be comparably Butcher, & Slutske, 1990, p. 360). misleading to extract variance due to symptom Automated interpretive systems can also be exaggeration from a measure of borderline seductive. They might provide the appearance personality traits, self-aggrandizement from a of more objectivity and validity than is in fact measure of narcissism, or self-denigration from the case (Butcher, 1995). Clinicians and re- a measure of dependency. Validity scales are not searchers should always become closely familiar just providing a measure of response distortion with the actual data and procedures used to that undermines the validity of personality develop an automated system, and the subse- description; they are also providing valid quent research assessing its validity, in order to descriptions of highly relevant and fundamental evaluate objectively for themselves the nature personality traits. and extent of the empirical support. For For example, a response distortion that has example, been of considerable interest and concern is social desirability. Persons who are instructed to while any single clinician may do well to learn the norms and base rates of the client population he or attribute falsely to themselves good, desirable she sees most often, a computer program can refer qualities provide elevations on measures of to a variety of population norms and, if prosocial desirability, and many researchers have grammed to do so, will always ªrememberº to therefore extracted from a measure of person- tailor interpretive statements according to modality the variance that is due to this apparent ifying demographic data such as education, mar- response bias. However, extracting this variance ital status, and ethnicity. (Keller et al., 1990, p. 360) typically will reduce rather than increase the validity of personality measures because much This is an excellent sentiment, describing well of the socially desirable self-description does in the potential benefits of an automated report. fact constitute valid self-description (Borkenau However, simply providing a subject's educa- & Ostendorf, 1992; McCrae & Costa, 1983). tion, marital status, and ethnicity to the auto- Persons who characteristically describe them- mated computer system does not necessarily selves in a socially desirable manner are not mean that this information will in fact be used. simply providing false and misleading informa- As noted above, very few of the SRIs consider tion. Much of the variance within a measure of ethnicity. social desirability is due to persons either A purported advantage of the MCMI-III describing themselves accurately as having automated scoring system is that ªactuarial base many desirable personality traits or, equally rate data, rather than normalized standard accurately, as having a personality disposition score transformations, were employed in calcu- of self-aggrandizement, arrogance, or denial. lating scale measuresº (Millon et al., 1994, p. 4). This problem could be addressed by first The cutoff points used for the interpretation of extracting the variance due to valid individual the MCMI-III personality scales are based on differences from a measure of social desirability the base rates of the personality syndromes before it is used as a moderater variable, but within the population. ªThe BR [base rate] score then it might obviously fail to be useful as a was designed to anchor cut-off points to the moderater variable to the personality scales. prevalence of a particular attribute in the Self-report Inventories 153 psychiatric populationº (Millon et al., 1994, new study replicates more closely the popula- p. 26). This approach appears to be quite tion characteristics of the original derivation sophisticated, as it is seemingly responsive to the study. A third alternative is to have the cutoff failure of clinicians and researchers to consider points for the scales be adjusted for different the effect of base rates on the validity of an SRI base rates within local settings, but no auto- cutoff point (Finn & Kamphuis, 1995). ªThese mated scoring system currently provides this data not only provide a basis for selecting option. optimal differential diagnostic cut-off scores but also ensure that the frequencies of MCMI- 3.07.2.6 Illustrative Instruments III diagnoses and profile patterns are comparable to representative clinical prevalence ratesº There are many SRIs for the assessment of (Millon et al., 1994 p. 4). normal and maladaptive personality traits, However, the MCMI-III automated scoring including the SNAP (Clark, 1993), the DAPP- system does not in fact make adjustments to BQ (Livesley & Jackson, in press), the Person- cutoff points depending upon the base rate of ality Assessment Inventory (Morey, 1996), the the syndrome within a local clinical setting. The MPQ (Tellegen & Waller, in press), the MCMI- MCMI-III uses the same cutoff point for all III (Millon et al., 1994), the PDQ-4 (Hyler, settings and, therefore, for all possible base 1994), the IAS (Wiggins & Trobst, 1997), the rates. The advantage of setting a cutoff point Wisconsin Personality Inventory (Klein et al., according to the base rate of a syndrome is lost if 1993), and the TCI (Cloninger & Svrakic, 1994). the cutoff point remains fixed across different Space limitations prohibit a consideration of base rates (Finn & Kamphuis, 1995). ªThe use each of them. A brief discussion, however, will of the MCMI-III should be limited to popula- be provided for the two dominant SRIs within tions that are not notably different in back- the field of clinical personality assessment, the ground from the sample employed to develop MMPI-2 (Graham, 1993; Hathaway et al., 1989) the instrument's base rate normsº (Millon et al., and the NEO-PI-R (Costa & McCrae, 1992). 1994, p. 35). It is for this reason that Millon et al. discourage the use of the MCMI-III within 3.07.2.6.1 Minnesota Multiphasic Personality normal (community or college) populations (a Inventory-2 limitation not shared by most other SRIs). In fact, the population sampled for the MCMI-III The MMPI-2 is an SRI that consists of 566 revision might have itself been notably different true/false items (Graham, 1993; Hathaway et al., in background from the original sample used to 1989). It is the most heavily researched and develop the base rate scores. Retzlaff (1996) validated SRI, and the most commonly used in calculated the probability of having a person- clinical practice (Butcher & Rouse, 1996). Its ality disorder given the obtainment of a MCMI- continued usage is due in part to familiarity and III respective cutoff point, using the data tradition, but its popularity also reflects the provided in the MCMI-III test manual. Retzlaff obtainment of a substantial amount of empiri- concluded that ªas determined by currently cal support and normative data over the many available research, the operating characteristics years of its existence (Graham, 1993). of the MCMI-III scales are poorº (p. 437). The The MMPI-2 is described as ªthe primary probabilities varied from 0.08 to a high of only self-report measure of abnormal personalityº 0.32. For example, the probability of having an (Ben-Porath, 1994, p. 363) but its most common avoidant personality disorder if one surpassed usage is for the assessment of anxiety, depres- the respective cutoff point was only 0.17, for the sive, substance, psychotic, and other such (Axis borderline scale it was only 0.18, and for the I) mental disorders rather than for the assess- antisocial scale it was only 0.07. As Retzlaff ment of personality traits (Ozer & Reise, 1994). indicated, ªthese hit rate statistics are well under ªThe MMPI-2 clinical scales are . . . measures of one-half of the MCMI-II validitiesº (Retzlaff, various forms of psychopathology . . . and not 1996, p. 435). measures of general personalityº (Greene, Retzlaff (1996) suggested that there were two Gwin, & Staal, 1997, p. 21). This is somewhat possible explanations for these results, either ironic, given that it is titled as a personality ªthe test is bad or that the validity study was inventory (Helmes & Reddon, 1993). badº (p. 435) and he concluded that the fault lay However, the MMPI-2 item pool is extensive with the MCMI-III cross-validation data. ªAt and many additional scales have been devel- best, the test is probably valid, but there is no oped beyond the basic 10 clinical and three evidence and, as such, it cannot be trusted until validity scales (Graham, 1993). For example, better validity data are availableº (Retzlaff, some of the new MMPI-2 content scales (Ben- 1996, p. 435). However, it is unlikely that any Porath, 1994; Butcher, 1995), modeled after the new data will improve these statistics, unless the seminal research of Wiggins (1966), do assess 154 Personality Assessment personality traits (e.g., cynicism, social discom- or aggression, tender-mindedness vs. tough- fort, Type A behavior, and low self-esteem). The mindedness (lack of empathy), modesty vs. Morey et al. (1985) personality disorder scales arrogance, and straightforwardness vs. decep- have also been updated and normed for the tion or manipulation. The domains and facets of MMPI-2 (Colligan, Morey, & Offord, 1994) the NEO-PI-R relate closely to other models of and an alternative set of MMPI-2, DSM-IV personality, such as the constructs of affiliation personality disorder scales are being developed and power within the interpersonal circumplex by Somwaru and Ben-Porath (1995). Harkness model of personality (McCrae & Costa, 1989; and McNulty (1994) have developed scales (i.e., Wiggins & Pincus, 1992). The NEO-PI-R the PSY-5) to assess five broad domains of assessment of the five-factor model also relates personality (i.e., neuroticism, extraversion, closely to the DSM-IV personality disorder psychoticism, aggressiveness, and constraint) nomenclature, despite its original development that are said to provide the MMPI-2 assessment as a measure of normal personality functioning of, or an alternative to, the five-factor model of (Widiger & Costa, 1994). personality (Ben-Porath, 1994; Butcher & The application of the NEO-PI-R within Rouse, 1996). ªOther measures of five-factor clinical settings, however, may be problematic, models of normal personality will need to due to the absence of extensive validity scales to demonstrate incremental validity in comparison detect mood state and response-set distortion to the full set of MMPI-2 scales (including the (Ben-Porath & Waller, 1992). A valid applica- PSY-5) to justify their use in clinical practiceº tion of the NEO-PI-R requires that the (Ben-Porath, 1994, p. 393). However, there are respondent be capable of and motivated to important distinctions between the PSY-5 and provide a reasonably accurate self-description. the five-factor model constructs, particularly for This is perhaps a safe assumption for most cases psychoticism and aggressiveness (Harkness & (Costa & McCrae, 1997), but the NEO-PI-R McNulty, 1994; Harkness, McNulty, & Ben- might not be successful in identifying when this Porath, 1995; Widiger & Trull, 1997). In assumption is inappropriate. Potential validity addition, the utility of the MMPI-2 for scales for the NEO-PI-R, however, are being personality trait description and research may researched (Schinka, Kinder, & Kremer, 1997). be limited by the predominance of items concerning clinical symptomatology and the 3.07.3 SEMISTRUCTURED INTERVIEWS inadequate representation of important domains of personality, such as conscientiousness The single most popular method for clinical (Costa, Zonderman, McCrae, & Williams, assessments of personality by psychologists 1985) and constraint (DiLalla, Gottesman, working in either private practice, inpatient Carey, & Vogler, 1993) hospitals, or university psychology depart- ments, is an unstructured interview (Watkins, Campbell, Nieberding, & Hallmark, 1995). 3.07.2.6.2 Neo Personality Inventory-Revised Whereas most researchers of personality dis- The most comprehensive model of person- order rely upon semistructured interviews (SSI) ality trait description is provided by the five- (Rogers, 1995; Zimmerman, 1994), there are but factor model (Saucier & Goldberg, 1996; a few SSIs for the assessment of normal Wiggins & Pincus, 1992). Even the most ardent personality functioning (e.g., Trull & Widiger, critics of the five-factor model acknowledge its 1997) and none of the psychologists in a survey importance and impact (e.g., Block, 1995; of practicing clinicians cited the use of a Butcher & Rouse, 1996; Millon et al., 1996). semistructured interview (Watkins et al., 1995). And, the predominant measure of the five- Unstructured clinical interviews rely entirely factor model is the NEO-PI-R (Costa & upon the training, expertise, and conscientious- McCrae, 1992, 1997; Ozer & Reise, 1994; ness of the interviewer to provide an accurate Widiger & Trull, 1997). assessment of a person's personality. They are The 240 item NEO-PI-R (Costa & McCrae, problematic for research as they are notor- 1992) assesses five broad domains of person- iously unreliable, idiosyncratic, and prone to ality: neuroticism, extraversion (vs. introver- false assumptions, attributional errors, and sion), openness (vs. closedness), agreeableness misleading expectations (Garb, 1997). For (vs. antagonism), and conscientiousness. Each example, many clinical interviewers fail to item is rated on a five-point scale, from strongly provide a comprehensive assessment of a disagree to strongly agree. Each domain is patient's maladaptive personality traits. Only differentiated into six underlying facets. For one personality disorder diagnosis typically is example, the six facets of agreeableness vs. provided to a patient, despite the fact that most antagonism are trust vs. mistrust, altruism vs. patients will meet criteria for multiple diag- exploitiveness, compliance vs. oppositionalism noses (Gunderson, 1992). Clinicians tend to Semistructured Interviews 155 diagnose personality traits and disorders hier- Ochoa, 1989). This overdiagnosis of histrionic archically. Once a patient is identified as having personality disorder in females is diminished a particular personality disorder (e.g., border- substantially when the clinician is compelled to line), clinicians will fail to assess whether assess systematically each one of the features. additional personality traits are present (Her- ªSex biases may best be diminished by an kov & Blashfield, 1995). Alder, Drake, and increased emphasis in training programs and Teague (1990) provided 46 clinicians with case clinical settings on the systematic use and histories of a patient that met the DSM-III adherence to the [diagnostic] criteriaº (Ford criteria for four personality disorders (i.e., & Widiger, 1989, p. 304). histrionic, narcissistic, borderline, and depen- The reluctance to use semistructured inter- dent). ªDespite the directive to consider each views within clinical practice, however, is category separately . . . most clinicians assigned understandable. Semistructured interviews that just one [personality disorder] diagnosisº (Adler assess all of the DSM-IV personality disorder et al., 1990, p. 127). Sixty-five percent of the diagnostic criteria can require more than two clinicians provided only one diagnosis, 28% hours for their complete administration (Widi- provided two, and none provided all four. ger & Sanderson, 1995). This is unrealistic and Unstructured clinical assessments of person- impractical in routine clinical practice, particu- ality also fail to be systematic. Morey and larly if the bulk of the time is spent in Ochua (1989) provided 291 clinicians with the determining the absence of traits. However, 166 DSM-III personality disorder diagnostic the time can be diminished substantially by first criteria (presented in a randomized order) and administering an SRI to identify which domains asked them to indicate which personality of personality functioning should be empha- disorder(s) were present in one of their patients sized and which could be safely ignored and to indicate which of the 166 diagnostic (Widiger & Sanderson, 1995). criteria were present. Kappa for the agreement Unstructured interviews are also preferred by between their diagnoses and the diagnoses that clinicians because they find SSIs to be too would be given based upon the diagnostic constraining and superficial. Most clinicians criteria they indicated to be present, ranged prefer to follow leads that arise during an from 0.11 (schizoid) to only 0.58 (borderline). In interview, adjusting the content and style to other words, their clinical diagnoses agreed facilitate rapport and to respond to the poorly with their own assessments of the particular needs of an individual patient. The diagnostic criteria for each disorder. The results questions provided by an SSI can appear, in of this study were replicated by Blashfield and comparison, to be inadequate and simplistic. Herkov (1996). Agreement in this instance However, SSIs are considered to be semistruc- ranged from 0.28 (schizoid) to 0.63 (borderline). tured because they require (or allow) profes- ªIt appears that the actual diagnoses of sional judgment and discretion in their clinicians do not adhere closely to the diagnoses administration and scoring. They are not simply suggested by the [diagnostic] criteriaº (Blash- mindless administrations of an SRI. The field & Herkov, 1996, p. 226). responsibility of an SSI interviewer is to assess Clinicians often base their personality dis- for the presence of a respective trait, not to just order assessments on the presence of just one or record a subject's responses to a series of two features, failing to consider whether a structured questions. Follow-up questions that sufficient number of the necessary features are must be sensitive and responsive to the mood present (Blashfield & Herkov, 1996). In addi- state, defensiveness, and self-awareness of the tion, the one or two features that one clinician person being interviewed are always required considers to be sufficient may not be consistent and are left to the expertise and discretion of the with the feature(s) emphasized by another interviewer (Widiger, Frances, & Trull, 1989). clinician, contributing to poor interrater relia- There are only a few fully structured interviews bility (Widiger & Sanderson, 1995; Zimmer- for the assessment of personality traits and they man, 1994) and to misleading expectations and may be inadequate precisely because of their false assumptions. For example, a number of excessive constraint and superficiality (Perry, studies have indicated that many clinicians tend 1992). to overdiagnose the histrionic personality The questions provided by an SSI are useful disorder in females (Garb, 1997). Clinicians in ensuring that each trait is assessed, and that a tend to perceive a female patient who has just set of questions found to be useful in prior one or two histrionic features as having a studies is being used in a consistent fashion. histrionic personality disorder, even when she Systematic biases in clinical assessments are may instead meet the diagnostic criteria for an more easily identified, researched, and ulti- alternative personality disorder (Blashfield & mately corrected with the explicit nature of SSIs Herkov, 1996; Ford & Widiger, 1989; Morey & and their replicated use across studies and 156 Personality Assessment research sites. A highly talented and skilled narcissistic personality traits), until it was clinician can outperform an SSI, but it is risky to recognized that these scales include many items presume that one is indeed that talented that involve self-confidence, assertion, and clinician or that one is consistently skilled and gregariousness. Piersma concluded that ªthe insightful with every patient (Dawes, 1994). It MCMI-II is not able to measure long-term would at least seem desirable for talented personality characteristics (`trait' characteris- clinicians to be informed by a systematic and tics) independent of symptomatology (`state' comprehensive assessment of a patient's per- characteristics)º (p. 91). Mood state distortion, sonality traits. however, might not be problematic within outpatient and normal community samples (Trull & Goodwin, 1993). 3.07.3.1 Personality, Mood States, and Mental SSIs appear to be more successful than SRIs Disorders in distinguishing recently developed mental disorders from personality traits, particularly Personality traits typically are understood to if the interviewers are instructed explicitly to be stable behavior patterns present since young make this distinction when they assess each adulthood (Wiggins & Pincus, 1992). However, item. Loranger et al. (1991) compared the few SRIs emphasize this fundamental feature. assessments provided by the Personality Dis- For example, the instructions for the MMPI-2 order Examination (PDE; Loranger, in press) at make no reference to age of onset or duration admission and one week to six months later. (Hathaway et al., 1989). Responding true to the Reduction in scores on the PDE were not MMPI-2 borderline item, ªI cry easilyº (Hath- associated with depression or anxiety. Loranger away et al., p. 6) could be for the purpose of et al. concluded that the ªstudy provides describing the recent development of a depres- generally encouraging results regarding the sive mood disorder rather than a characteristic apparent ability of a particular semistructured manner of functioning. Most MMPI-2 items are interview, the PDE, to circumvent trait-state in fact used to assess both recently developed artifacts in diagnosing personality disorders in mental disorders as well as long-term person- symptomatic patientsº (p. 727). ality traits (Graham, 1993). However, it should not be presumed that SSIs SRIs that required a duration since young are resilient to mood state distortions. O'Boyle adulthood for the item to be endorsed would be and Self (1990) reported that ªPDE dimensional relatively insensitive to changes in personality scores were consistently higher (more sympto- during adulthood (Costa & McCrae, 1994), but matic) when subjects were depressedº (p. 90) SSIs are generally preferred over SRIs within and Loranger et al. (1991) acknowledged as well clinical settings for the assessment of personality that all but two of the PDE scales decreased traits due to the susceptibility of SRIs to mood- significantly across the hospitalization. These state confusion and distortion (Widiger & changes are unlikely to reflect actual, funda- Sanderson, 1995; Zimmerman, 1994). Persons mental changes in personality secondary to the who are depressed, anxious, angry, manic, brief psychiatric hospitalization. hypomanic, or even just agitated are unlikely The methods by which SRIs and SSIs address to provide accurate self-descriptions. Low self- mood state and mental disorder confusion esteem, hopelessness, and negativism are central should be considered in future research (Zim- features of depression and will naturally affect merman, 1994). For example, the DSM-IV self-description. Persons who are depressed will requires that the personality disorder criteria be describe themselves as being dependent, intro- evident since late adolescence or young adult- verted self-conscious, vulnerable, and pessimis- hood (APA, 1994). However, the PDE (Lor- tic (Widiger, 1993). Distortions will even anger, in press) requires only that one item be continue after the remission of the more present since the age of 25, with the others obvious, florid symptoms of depression present for only five years. A 45-year old adult (Hirschfeld et al., 1989). with a mood disorder might then receive a Piersma (1989) reported significant decreases dependent, borderline, or comparable person- on the MCMI-II schizoid, avoidant, dependent, ality disorder diagnosis by the PDE, if just one passive±aggressive, self-defeating, schizotypal, of the diagnostic criteria was evident since the borderline, and paranoid personality scales age of 25. across a brief inpatient treatment, even with the MCMI-II mood state correction scales. Significant increases were also obtained for the 3.07.3.2 Dissimulation and Distortion histrionic and narcissistic scales, which at first appeared nonsensical (suggesting that treatment Unstructured and semistructured inter- had increased the presence of histrionic and viewers base much of their assessments on the Semistructured Interviews 157 self-descriptions of the respondent. However, a review of institutional file data rather than substantial proportion of the assessment is also answers to interview questions, given the based on observations of a person's behavior expectation that psychopathic persons will be and mannerisms (e.g., the schizotypal trait of characteristically deceptive and dishonest dur- odd or eccentric behavior; APA, 1994) the ing an interview. It is unclear whether the PCL-R manner in which a person responds to questions could provide a valid assessment of psychopathy (e.g., excessive suspiciousness in response to an in the absence of this additional, corrobatory innocuous question), and the consistency of the information (Lilienfeld, 1994; Salekin, Rogers, responses to questions across the interview. SSIs & Sewell, 1996). will also require that respondents provide A notable exception to the absence of SSI examples of affirmative responses to ensure validity scales is the Structured Interview of that the person understood the meaning or Reported Symptoms (SIRS; Rogers, Bagby, & intention of a question. The provision of follow- Dickens, 1992) developed precisely for the up queries provides a significant advantage of assessment of subject distortion and dissimula- SSIs relative to SRIs. For example, schizoid tion. The SIRS includes 172 items (sets of persons may respond affirmatively to an SRI questions) organized into eight primary and five question, ªdo you have any close friends,º but supplementary scales. Three of the primary further inquiry might indicate that they never scales assess for rare, improbable, or absurd invite these friends to their apartment, they symptoms, four assess for an unusual range and rarely do anything with them socially, they are severity of symptoms, and the eighth assesses unaware of their friends' personal concerns, and for inconsistencies in self-reported and observed they never confide in them. Widiger, Mangine, symptoms. A substantial amount of supportive Corbitt, Ellis, and Thomas (1995) described a data has been obtained with the SIRS, person who was very isolated and alone, yet particularly within forensic and neuropsycho- indicated that she had seven very close friends logical settings (Rogers, 1995). with whom she confided all of her personal feelings and insecurities. When asked to describe one of these friends, it was revealed 3.07.3.3 Intersite Reliability that she was referring to her seven cats. In sum, none of the SSIs should or do accept a The allowance for professional judgment in respondent's answers and self-descriptions sim- the selection of follow-up queries and in the ply at face value. Nevertheless, none of the interpretation of responses increases signifi- personality disorder SSIs includes a formal cantly the potential for inadequate interrater method by which to assess defensiveness, reliability. Good to excellent interrater relia- dissimulation, exaggeration, or malingering. bility has been consistently obtained in the An assessment of exaggeration and defensive- assessment of maladaptive personality traits ness can be very difficult and complicated with SSIs (Widiger & Sanderson, 1995; Zimmer- (Berry et al., 1995) yet it is left to the discretion man, 1994), but an SSI does not ensure the and expertise of the interviewer in the assess- obtainment of adequate interrater reliability. ment of each individual personality trait. Some An SSI only provides the means by which this interviewers may be very skilled at this assess- reliability can be obtained. Most SSI studies ment, whereas others may be inadequate in the report inadequate to poor interrater reliability effort or indifferent to its importance. for at least one of the domains of personality SSIs should perhaps include validity scales, being assessed (see Widiger & Sanderson, 1995; comparable to those within self-report inven- Zimmerman, 1994). The obtainment of ade- tories. Alterman et al. (1996) demonstrated quate interrater reliability should be assessed empirically that subjects exhibiting response sets and documented in every study in which an SSI of positive or negative impression management is being used, as it is quite possible that the as assessed by the PAI (Morey, 1996) showed personality disorder of particular interest is the similar patterns of response distortion on two one for which weak interrater reliability has semistructured interviews, yet the interviewers been obtained. appeared to be ªessentially unaware of such The method by which interrater reliability has behaviorº (Alterman et al., 1996, p. 408). ªThe been assessed in SSI research is also potentially findings suggest that some individuals do exhibit misleading. Interrater reliability has tradition- response sets in the context of a structured ally been assessed in SSI research with respect interview and that this is not typically detected to the agreement between ratings provided by the interviewerº (Alterman et al., 1996, by an interviewer and ratings provided by a p. 408). Much of the assessment of psychopathic person listening to an audiotaped (or video- personality traits by the revised Psychopathy taped) recording of this interview (Zimmerman, Checklist (PCL-R; Hare, 1991) is based on a 1994). However, this methodology assesses only 158 Personality Assessment the agreement regarding the ratings of the 3.07.3.4 Self and Informant Ratings respondents' statements. It is comparable to confining the assessment of the interrater Most studies administer the SSI to the person reliability of practicing clinicians' personality being assessed. However, a method by which to assessments to the agreement in their ratings of a address dissimulation and distortion is to recording of an unstructured clinical interview. administer the SSI to a spouse, relative, or The poor interrater reliability that has been friend who knows the person well. Many reported for practicing clinicians has perhaps personality traits involve a person's manner been due largely to inconsistent, incomplete, of relating to others (McCrae & Costa, 1989; and idiosyncratic interviewing (Widiger & Wiggins & Trobst, 1997) and some theorists Sanderson, 1995). It is unclear whether person- suggest that personality is essentially this ality disorder SSIs have actually resolved this manner of relating to others (Kiesler, 1996; problem as few studies have in fact assessed Westen, 1991; Wiggins & Pincus, 1992). interrater reliability using independent admin- A useful source for the description of these istrations of the same interview. traits would then be persons with whom the The misleading nature of currently reported subject has been interacting. These ªinfor- SSI interrater reliability is most evident in mantsº (as they are often identified) can be studies that have used telephone interviews. intimately familiar with the subject's character- Zimmerman and Coryell (1990) reported kappa istic manner of relating to them, and they would agreement rates of 1.0 for the assessment of the not (necessarily) share the subject's distortions schizotypal, histrionic, and dependent person- in self-description. The use of peer, spousal, and ality disorders using the Structured Interview other observer ratings of personality has a rich for DSM-III Personality (SIDP; Pfohl, Blum, & tradition in SRI research (Ozer & Reise, 1994). Zimmerman, in press). However, 86% of the An interview with a close friend, spouse, SIDP administrations were by telephone. Tele- employer, or relative is rarely uninformative phone administrations of an SSI will tend to be and typically results in the identification of more structured than face-to-face interviews, additional maladaptive personality traits. How- and perhaps prone to brief and simplistic ever, it is unclear which source will provide the administrations. They may degenerate into a most valid information. Zimmerman, Pfohl, verbal administration of an SRI. Interrater Coryell, Stangl, and Corenthal (1988) adminis- agreement with respect to the ratings of subjects' tered an SSI to both a patient and an informant, affirmative or negative responses to MMPI-2 and obtained rather poor agreement, with items (i.e., highly structured questions) is not correlations ranging from 0.17 (compulsive) particularly informative. to only 0.66 (antisocial). The informants There is reason to believe that there might identified significantly more dependent, avoi- be poor agreement across research sites using dant, narcissistic, paranoid, and schizotypal the same SSI. For example, the prevalence rate traits, but Zimmerman et al. (1988) concluded of borderline personality disorder within that ªpatients were better able to distinguish psychiatric settings has been estimated at between their normal personality and their 15% (Pilkonis et al., 1995), 23% (Riso, Klein, illnessº (p. 737). Zimmerman et al. felt that the Anderson, Ouimette, & Lizardi, 1994), 42% informants were more likely to confuse patients' (Loranger et al., 1991), and 71% (Skodol, current depression with their longstanding and Oldham, Rosnick, Kellman, & Hyler, 1991), premorbid personality traits than the patients' all using the same PDE (Loranger, in press). themselves. Similar findings have been reported This substantial disagreement is due to many by Riso et al. (1994). Informants, like the different variables (Gunderson, 1992; Pilkonis, patients themselves, are providing subjective 1997; Shea, 1995), but one distinct possibility opinions and impressions rather than objective is that there is an inconsistent administration descriptions of behavior patterns, and they may and scoring of the PDE by Loranger et al. have their own axes to grind in their emotionally (1991), Pilkonis et al. (1995), Riso et al. (1994), invested relationship with the identified patient. and Skodol et al. (1991). Excellent interrater The fundamental attribution error is to over- reliability of a generally unreliably adminis- explain behavior in terms of personality traits, tered SSI can be obtained within one parti- and peers might be more susceptible to this error cular research site through the development of than the subjects themselves. local (idiosyncratic) rules and policies for the administration of follow-up questions and the 3.07.3.5 Convergent Validity Across Different scoring of respondents' answers that are Interviews inconsistent with the rules and policies developed for this same SSI at another research One of the most informative studies on the site. validity of personality disorder SSIs was Semistructured Interviews 159 provided by Skodol et al. (1991). They identity disturbance, defined in DSM-IV as a administered to 100 psychiatric inpatients two markedly and persistently unstable sense of self- different SSIs on the same day (alternating in image or sense of self (APA, 1994). All five SSIs morning and afternoon administrations). do ask about significant or dramatic changes in Agreement was surprisingly poor, with kappa self-image across time. However, there are also ranging from a low of 0.14 (schizoid) to a high of notable differences. For example, the DIPD-IV only 0.66 (dependent). Similar findings have and SIDP-IV refer specifically to feeling evil; the since been reported by Pilkonis et al. (1995). If SIDP-IV highlights in particular a confusion this is the best agreement one can obtain with regarding sexual orientation; the SCID-II the administration of different SSIs to the same appears to emphasize changes or fluctuations persons by the same research team on the same in self-image, whereas the DIPD-IV appears to day, imagine the disagreement that must be emphasize an uncertainty or absence of self- obtained by different research teams with image; and the Personality Disorder Interview-4 different SSIs at different sites (however, both (PDI-IV) includes more open-ended self-de- studies did note that significantly better agree- scriptions. ment was obtained when they considered the It is possible that the variability in content of dimensional rating of the extent to which each questions across different SSIs may not be as personality syndrome was present). important as a comparable variability in Skodol et al. (1991) concluded that the source questions across different SRIs, as the inter- of the disagreement they obtained between the views may converge in their assessment through Structured Clinical Interview for DSM-III-R the unstructured follow-up questions, queries Personality Disorders (SCID-II; First, Gibbon, and clarifications. However, the findings of Spitzer Williams, & Benjamin, in press) and the Pilkonis et al. (1995) and Skodol et al. (1991) PDE (Loranger, in press) was the different suggest otherwise. In addition, as indicated in questions (or items) used by each interview. ªIt Table 1, most of the questions within the five is fair to say that, for a number of disorders (i.e., SSIs are relatively structured, resulting perhaps paranoid, schizoid, schizotypal, narcissistic, in very little follow-up query. and passive±aggressive) the two [interviews] studied do not operationalize the diagnoses 3.07.3.6 Illustrative Instruments similarly and thus yield disparate resultsº (Skodol et al., p. 22). The three most commonly used SSIs within There appear to be important differences clinical research for the assessment of the DSM among the SSIs available for the assessment of personality disorders are the SIDP-IV (Pfohl personality disorders (Clark,1992; Widiger & et al., in press), the SCID-II (First et al., in Sanderson, 1995; Zimmerman, 1994). Table 1 press), and the PDE (Loranger, in press). The presents the number of structured questions SIDP-IV has been used in the most number of (i.e., answerable by one word, such as ªyesº or studies. A distinctive feature of the PDE is the ªfrequentlyº), open-ended questions, and ob- inclusion of the diagnostic criteria for the servational ratings provided in each of the five personality disorders of the World Health major personality disorder SSIs. It is evident Organization's (WHO) International Classifica- from Table 1 that there is substantial variation tion of Diseases (ICD-10; 1992). However, using across SSIs simply with respect to the number of this international version of the PDE to questions provided, ranging from 193 to 373 compare the DSM-IV with the ICD-10 is with respect to structured questions and from 20 problematic, as the PDE does not in fact to 69 for open-ended questions. There is also provide distinct questions for the ICD-10 variability in the reliance upon direct observa- criteria. It simply indicates which of the existing tions of the respondent. The PDE, Diagnostic PDE questions, developed for the assessment of Interview for DSM-IV Personality Disorders the DSM-IV personality disorders, could be (DIPD-IV), and SIDP-IV might be more used to assess the ICD-10 criteria. The DSM-IV difficult to administer via telephone, given the and ICD-10 assessments are not then indepen- number of observational ratings of behavior, dent. The DIPD (Zanarini, Frankenburg, appearance, and mannerisms that are required Sickel, & Yong, 1996) is the youngest SSI, (particularly to assess the schizotypal and but it is currently being used in an extensive histrionic personality disorders). multi-site, longitudinal study of personality There is also significant variability in the disorders. content of the questions used to assess the same personality trait (Clark, 1992; Widiger & 3.07.3.6.1 SIDP-IV Sanderson, 1995; Zimmerman, 1994). Table 2 presents the questions used by each of the five The SIDP-IV (Pfohl et al., in press) is the major SSIs to assess the borderline trait of oldest and most widely used SSI for the 160 Personality Assessment

Table 1 Amount of structure in personality disorder semistructured interviews.

Number of questions

Interview Structured Open-ended Examples Total Observations

DIPD-IV 373 20 5 398 19 PDE 272 69 196 537 32 PDI-IV 290 35 325 3 SCID-II 193 35 75 303 7 SIDP-IV 244 58 35 337 16

Note. Examples = specified request for examples (PDI-IV instructs interviewers to always consider asking for examples, and therefore does not include a specified request for individual items); DIPD-IV = Diagnostic Interview for DSM-IV Personality Disorders (Zanarini et al., 1996); PDE = Personality Disorder Examination (Loranger, in press); PDI-IV = Personality Disorder Interview-4 (Widiger et al., 1995); SCID-II = Structured Clinical Interview for DSM-IV Axis II Personality Disorders (First, et al., in press); SIDP-IV = Structured Interview for DSM-IV Personality (Pfohl et al., in press). assessment of the DSM personality disorders ual is the most systematic and comprehensive, (Rogers, 1995; Widiger & Sanderson, 1995; providing extensive information regarding the Zimmerman, 1994). It includes 353 items (337 history, rationale, and common assessment questions and 16 observational ratings) to assess issues for each of the DSM-IV personality the 94 diagnostic criteria for the 10 DSM-IV disorder diagnostic criteria. personality disorders. Additional items are provided to assess the proposed but not officially recognized negativistic (passive± 3.07.4 PROJECTIVE TECHNIQUES aggressive), depressive, self-defeating, and sadistic personality disorders. It is available in two Projective techniques are not used as often as versions, one in which the items are organized SRIs in personality research, ªand many with respect to the diagnostic criteria sets academic psychologists have expressed the (thereby allowing the researcher to assess only belief that knowledge of projective testing is a subset of the disorders) and the other in which not as important as it used to be and that use of the items are organized with respect to similar projective tests will likely decline in the futureº content (e.g., perceptions of others and social (Butcher & Rouse, 1996, p. 91). However, ªof conformity) to reduce repetition and redun- the top 10 assessment procedures [used by dancy. Each diagnostic criterion is assessed on a clinicians] . . . 4 are projectives, and another four-point scale (0=not present, 1=subthres- (Bender-Gestalt) is sometimes used for projec- hold, 2=present, 3=strongly present). Admin- tive purposesº (Watkins et al., 1995, p. 59). istration time is about 90 minutes, depending in Butcher and Rouse (1996) document as well that part on the experience of the interviewer and the the second most frequently researched clinical verbosity of the subject. A computerized instrument continues to be the Rorschach. administration and scoring system is available. ªPredictions about the technique's demise The accompanying instructions and manual are appear both unwarranted and unrealisticº limited, but training videotapes and courses are (Butcher & Rouse, 1996, p. 91). ªWhatever available. negative opinions some academics may hold about projectives, they clearly are here to stay, wishing will not make them go away . . . and 3.07.3.6.2 PDI-IV their place in clinical assessment practice now The PDI-IV (Widiger et al., 1995) is the seems as strong as, if not stronger than, everº second oldest SSI for the assessment of the (Watkins et al., 1995, p. 59). DSM personality disorders. It is comparable The term ªprojective,º however, may be in content and style to the SIDP-IV, although somewhat misleading, as it suggests that these it has fewer observational ratings and more tests share an emphasis upon the interpretation open-ended inquiries. The PDI-IV has been of a projection of unconscious conflicts, used in substantially fewer studies than the impulses, needs, or wishes onto ambiguous SIDP-IV, PDE, or SCID-II, it lacks supportive stimuli. This is true for most projective tests but training material, and only one of the it is not in fact the case for the most commonly published studies using the PDI-IV was used scoring system (i.e., Exner, 1993) for the conducted by independent investigators (Ro- most commonly used projective test (i.e., the gers, 1995). However, its accompanying man- Rorschach). The Exner (1993) Comprehensive Projective Techniques 161

Table 2 Semistructured interview questions for the assessment of identity disturbance.

DIPD-IV 1. During the past two years, have you often been unsure of who you are or what you're really like? 2. Have you frequently gone from feeling sort of OK about yourself to feeling that you're bad or even evil? 3. Have you often felt that you had no identity? 4. How about that you had no idea of who you are or what you believe in? 5. That you don't even exist?

PDE 1. Do you think one of your problems is that you're not sure what kind of person you are? 2. Do you behave as though you don't know what to expect of yourself? 3. Are you so different with different people or in different situations that you don't behave like the same person? 4. Have others told you that you're like that? Why do you think they've said that? 5. What would you like to accomplish during your life? Do your ideas about this change often? 6. Do you often wonder whether you've made the right choice of job or career? (If housewife, ask:) Do you often wonder whether you've made the right choice in becoming a housewife? (If student, ask:) Have you made up your mind about what kind of job or career you would like to have? 7. Do you have trouble deciding what's important in life? 8. Do you have trouble deciding what's morally right and wrong? 9. Do you have a lot of trouble deciding what type of friends you should have? 10. Does the kind of people you have as friends keep changing? 11. Have you ever been uncertain whether you prefer a sexual relationship with a man or a woman?

PDI-IV 1. How would you describe your personality? 2. What is distinct or unique about you? 3. Do you ever feel that you don't know who you are or what you believe in? 4. Has your sense of who you are, what you value, what you want from life, or what you want to be, been fairly consistent, or has this often changed significantly?

SCID-II 1. Have you all of a sudden changed your sense of who you are and where you are headed? 2. Does your sense of who you are often change dramatically? 3. Are you different with different people or in different situations so that you sometimes don't know who you really are? 4. Have there been lots of sudden changes in your goals, career plans, religious beliefs, and so on?

SIDP-IV 1. Does the way you think about yourself change so often that you don't know who you are anymore? 2. Do you ever feel like you're someone else, or that you're evil, or maybe that you don't even exist? 3. Some people think a lot about their sexual orientation, for instance, trying to decide whether or not they might be gay (or lesbian). Do you often worry about this?

Note. DIPD-IV = Diagnostic Interview for DSM-IV Personality Disorders (Zanarini et al., 1996, pp. 25±26); PDE = Personality Disorder Examination (Loranger, in press, pp. 58±60, 83, & 117); PDI-IV = Personality Disorder Interview-4 (Widiger et al., 1995, p. 92); SCID-II = Structured Clinical Interview for DSM-IV Axis II Personality Disorders (First et al., in press, p. 37); SIDP-IV = Structured Interview for DSM- IV Personality (Pfohl et al., in press, p. 19). 162 Personality Assessment

System does include a few scores that appear to most important events or incidents are not concern a projection of personal needs, wishes, neglected during the interview (Perry, 1992). or preoccupations (e.g., morbid content) but However, such open-ended inquiries will con- much of the scoring concerns individual tribute to problematic intersite reliability. differences in the perceptual and cognitive processing of the form or structure of the shapes, textures, details, and colors of the 3.07.4.1 Rorschach ambiguous inkblots. Exner (1989) has himself stated that ªunfortunately, the Rorschach has The Rorschach consists of 10 ambiguously been erroneously mislabeled as a projective test shaped inkblots, some of which include various for far too longº (p. 527). degrees of color and shading. The typical The label ªprojective testº is also contrasted procedure is to ask a person what each inkblot traditionally with the label ªobjective testº (e.g., might be, and to then follow at some point with Keller et al., 1990) but this is again misleading. inquiries that clarify the bases for the response The Exner (1993) Comprehensive System scor- (Aronow, Reznikoff, & Moreland, 1994; Exner, ing for the Rorschach is as objective as the 1993). The Rorschach has been the most scoring of an MMPI-2 (although not as popular projective test in clinical practice since reliable). In addition, clinicians often can the Second World War, although the Thematic interpret MMPI-2 profiles in an equally sub- Apperception Test (TAT) and sentence comple- jective manner. tion tests are gaining in frequency of usage A more appropriate distinction might be a (Watkins et al., 1995). continuum of structure vs. ambiguity with There is empirical support for many respect to the stimuli provided to the subject Rorschach variables (Bornstein, 1996; Exner, and the range of responses that are allowed. 1993; Weiner, 1996), although the quality of Most of the techniques traditionally labeled as some of this research has been questioned, projective do provide relatively more ambig- including the support for a number of the uous stimuli than either SRIs or SSIs (e.g., fundamental Exner variables, such as the inkblots or drawings). However, many SRI experience ratio (Kleiger, 1992; Wood, Nez- items can be as ambiguous as an item from a worski, & Stejskal, 1996). The Rorschach can be projective test. Consider, for example, the administered and scored in a reliable manner, MMPI-2 items ªI like mechanics magazines,º but the training that is necessary to learn how to ªI used to keep a diary,º and ªI would like to be score reliably the 168 variables of the Exner a journalistº (Hathaway et al., 1989, pp. 5, 8, 9). Comprehensive System is daunting, at best. These items are unambiguous in their content, SRIs and SSIs might provide a more cost- but the trait or characteristic they assess is very efficient method to obtain the same results ambiguous (Butcher, 1995). There is, perhaps, (although profile interpretation of the MMPI-2 less ambiguity in the meaning of the stems clinical scales can at times be equally complex; provided in a sentence completion test, such as Helmes & Reddon, 1993). Exner (1996) has ªMy conscience bothered me most when,º ªI acknowledged, at least with respect to a used to dream aboutº and ªI felt inferior whenº depression index, that ªthere are other mea- (Lah, 1989, p. 144) than in many of the MMPI-2 sures, such as the [MMPI-2], that might identify items. the presence of reported depression much more SRIs, on the other hand, are much more accurately than the Rorschachº (p. 12). The structured (or constraining) in the responses same point can perhaps be made for the that are allowed to these stimuli. The only assessment of personality traits, such as narcis- responses can be ªtrueº or ªfalseº to an sism and dependency. Bornstein (1995), how- ambiguous MMPI-2 item, whereas anything ever, has argued that the Rorschach provides a can be said in response to a more obvious less biased measure of sex differences in sentence completion stem. Projective tests are personality (dependency, in particular) because uniformly more open-ended than SRIs in the its scoring is less obvious to the subject. He responses that are allowed, increasing substan- suggested that the findings from SRIs and SSIs tially the potential for unreliability in scoring. have provided inaccurate estimates of depen- However, SSIs can be as open-ended as many dency in males because males are prone to deny projective tests in the responses that are allowed. the extent of their dependent personality traits For example, the PDI-IV SSI begins with the in response to SRIs and SSIs. request of ªhaving you tell me the major events, An additional issue for the Rorschach is that issues, or incidents that you have experienced its relevance to personality research and assess- since late childhood or adolescenceº (Widiger ment is at times unclear. For example, the et al., 1995, p. 245). This initial question is cognitive-perceptual mechanisms assessed by intentionally open-ended to ensure that the the Exner Comprehensive system do not appear References 163 to be of central importance to many theories of 3.07.5 CONCLUSIONS personality and personality disorder (Kleiger, 1992). ªIt is true that the Rorschach does not The assessment of personality is a vital offer a precise measure for any single person- component of clinical research and practice, ality traitº (Exner, 1997, p. 41). How a person particularly with the increasing recognition of perceptually organizes an ambiguous inkblot the importance of personality traits to the may indeed relate to extratensive ideational development and treatment of psychopathology activity, but constructs such as introversion, (Watson et al., 1994). The assessment of conscientiousness, need for affection, and adaptive and maladaptive personality function- empathy have a more direct, explicit relevance ing is fundamental to virtually all fields of to current personality theory and research. applied psychology. More theoretically meaningful constructs are It is then surprising and regrettable that perhaps assessed by content (e.g., Aronow et al., clinical psychology training programs provide 1994; Bornstein, 1996) or object-relational (e.g., so little attention to the importance of and Lerner, 1995) scoring systems. Content inter- methods for obtaining comprehensive and pretations of the Rorschach are more consistent systematic interviewing. The primary method with the traditional understanding of the for the assessment of personality in clinical instrument as a projective stimulus, but this practice is an unstructured interview that has approach also lacks the empirical support of the been shown to be quite vulnerable to misleading cognitive-perceptual scoring systems and may expectations, inadequate coverage, and gender only encourage a return to less reliable and and ethnic biases. Training programs will subjective interpretations (Acklin, 1995). devote a whole course, perhaps a whole year, to learning different projective techniques, but may never even inform students of the existence of any particular semistructured interview. 3.07.4.2 Thematic Apperception Test The preferred method of assessment in The TAT (Cramer, 1996) consists of 31 cards: personality disorder research appears to be one is blank, seven are for males, seven for SSIs (Zimmerman, 1994), whereas the preferred females, one for boys or girls, one for men or method in normal personality research are SRIs women and one each for a boy, girl, man, and (Butcher & Rouse, 1996). However, the optimal woman (the remaining 10 are for anyone). Thus, approach for both research and clinical practice a complete set for any particular individual would be a multimethod assessment, using could consist of 20 stimulus drawings, although methods whose errors of measurement are only 10 typically are used per person. Most of uncorrelated. No single approach will be with- the drawings include person(s) in an ambiguous out significant limitations. The convergence of but emotionally provocative context. The findings across SRI, SSI, and projective meth- instruction to the subject is to make up a odologies would provide the most compelling dramatic story for each card, describing what is results. happening, what led up to it, what is the outcome, and what the persons are thinking and 3.07.6 REFERENCES feeling. It is common to describe the task as a test of imaginative intelligence to encourage Acklin, M. W. (1995). Integrative Rorschach interpreta- vivid, involved, and nondefensive stories. tion. Journal of Personality Assessment, 64, 235±238. The TAT is being used increasingly in Adler, D. A., Drake, R. E., & Teague, G. B. (1990). Clinicians' practices in personality assessment: Does personality research with an interpersonal or gender influence the use of DSM-III Axis II? Compre- object-relational perspective (e.g., Westen, hensive Psychiatry, 31, 125±133. 1991). The TAT's provision of cues for a variety Adler, N., & Matthews, K. (1994) Health psychology: Why of interpersonal issues and relationships make it do some people get sick and some stay well? Annual Review of Psychology, 45, 229±259. particularly well suited for such research, and Alterman, A. I., Snider, E. C., Cacciola, J. S., Brown, L. S., the variables assessed are theoretically and Zaballero, A., & Siddiqui, N. (1996). Evidence for clinically meaningful (e.g., malevolent vs. response set effects in structured research interviews. benevolent affect and the capacity for an Journal of Nervous and Mental Disease, 184, 403±410. emotional investment in relationships). The American Psychiatric Association (1994). Diagnostic and statistical manual of mental disorders. (4th ed.). Wa- necessary training for reliable scoring is also less shington, DC: Author. demanding than for the Rorschach, although a Aronow, E., Reznikoff, M., & Moreland, K. (1994). The TAT administration remains time-consuming. Rorschach technique: Perceptual basics, content, inter- There are many SRI measures of closely related pretation and applications. Boston: Allyn and Bacon. Baumeister, R. F., Tice, D. M., & Hutton, D. G. (1989). interpersonal constructs that are less expensive Self-presentational motivations and personality differ- and complex to administer and score (e.g., ences in self-esteem. Journal of Personality, 57, 547±579. Kiesler, 1996; Wiggins & Trobst, 1997). Ben-Porath, Y. S. (1994). The MMPI and MMPI-2: Fifty 164 Personality Assessment

years of differentiating normal and abnormal person- analysis of the Hare Psychopathy Checklist-Revised. ality. In S. Strack & M. Lorr (Eds.), Differentiating Psychological Assessment, 9, 3±14. normal and abnormal personality (pp. 361±401). New Costa, P. T., & McCrae, R. R. (1992). Revised NEO York: Springer. Personality Inventory (NEO PI-R) and NEO Five-Factor Ben-Porath, Y. S., McCully, E., & Almagor, M. (1993). Inventory (NEO-FFI) professional manual. Odessa, FL: Incremental validity of the MMPI-2 content scales in the Psychological Assessment Resources. assessment of personality and psychopathology by self- Costa, P. T., & McCrae, R. R. (1994). Set like plaster? report. Journal of Personality Assessment, 61, 557±575. Evidence for the stability of adult personality. In T. F. Ben-Porath, Y. S., & Waller, N. G. (1992). ªNormalº Heatherton & J. L. Weinberger (Eds.), Can personality personality inventories in clinical assessment: General change? (pp. 21±40). Washington, DC: American Psy- requirements and the potential for using the NEO chological Association. Personality Inventory. Psychological Assessment, 4, Costa, P. T., & McCrae, R. R. (1997). Stability and change 14±19. in personality assessment: The Revised NEO Personality Berry, D. T. R. (1995). Detecting distortion in forensic Inventory in the year 2000. Journal of Personality evaluations with the MMPI-2. In Y. S. Ben-Porath, J. R. Assessment, 68, 86±94. Graham, G. C. N. Hall, R. D. Hirschman, & M. S. Costa, P. T., Zonderman, A. B., McCrae, R. R., & Zaragoza (Eds.), Forensic applications of the MMPI-2 Williams, R. B. (1985). Content and comprehensiveness (pp. 82±102). Thousands Oaks, CA: Sage. in the MMPI: An item factor analysis in a normal adult Berry, D. T. R., Wetter, M. W., & Baer, R. A. (1995). sample. Journal of Personality and Social Psychology, 48, Assessment of malingering. In J. N. Butcher (Ed.), 925±933. Clinical personality assessment. Practical approaches Cramer, P. (1996). Storytelling, narrative, and the Thematic (pp. 236±248). New York: Oxford University Press. Apperception Test. New York: Guilford. Blashfield, R. K., & Herkov, M. J. (1996). Investigating Dawes, R. M. (1994). House of cards. Psychology and clinician adherence to diagnosis by criteria: A replication psychotherapy built on myth. New York: Free Press. of Morey and Ochoa (1989). Journal of Personality DiLalla, D. L., Gottesman, I. I., Carey, G., & Vogler, G. P. Disorders, 10, 219±228. (1993). Joint factor structure of the Multidimensional Block, J. (1995). A contrarian view of the five-factor Personality Questionnaire and the MMPI in a psychia- approach to personality description. Psychological Bul- tric and high-risk sample. Psychological Assessment, 5, letin, 117, 187±215. 207±215. Bornstein, R. F. (1995). Sex differences in objective and Eagly, A. H. (1995). The science and politics of comparing projective dependency tests: A meta-analytic review. women and men. American Psychologist, 50, 145±158. Assessment, 2, 319±331. Embretson, S. E. (1996). The new rules of measurement. Bornstein, R. F. (1996). Construct validity of the Psychological Assessment, 8, 341±349. Rorschach oral dependency scale: 1967±1995. Psycholo- Exner, J. E. (1989). Searching for projection in the gical Assessment, 8, 200±205. Rorschach. Journal of Personality Assessment, 53, Borkenau, P., & Ostendorf, F. (1992). Social desirability 520±536. scales as moderator and suppressor variables. European Exner, J. E. (1993). The Rorschach: A comprehensive system Journal of Personality, 6, 199±214. (Vol. 1). New York: Wiley. Butcher, J. N. (1995). Item content in the interpretation of Exner, J. E. (1996). A comment on ªThe Comprehensive the MMPI-2. In J. N. Butcher (Ed.), Clinical personality System for the Rorschach: A critical examination.º assessment. Practical approaches (pp. 302±316). New Psychological Science, 7, 11±13. York: Oxford University Press. Exner, J. E. (1997). The future of Rorschach in personality Butcher, J. N. (1995). How to use computer-based reports. assessment. Journal of Personality Assessment, 68, 37±46. In J. N. Butcher (Ed.), Clinical personality assessment. Feingold, A. (1994). Gender differences in personality: A Practical approaches (pp. 78±94). New York: Oxford meta-analysis. Psychological Bulletin, 116, 429±456. University Press. Finn, S. E., & Kamphuis, J. H. (1995). What a clinician Butcher, J. N., & Rouse, S. V. (1996). personality: needs to know about base rates. In J. N. Butcher (Ed.), Individual differences and clinical assessment. Annual Clinical personality assessment. Practical approaches (pp. Review of Psychology, 47, 87±111. 224±235). New York: Oxford University Press. Clark, L. A. (1992). Resolving taxonomic issues in First, M. B., Gibbon, M., Spitzer, R. L., Williams, J. B. W., personality disorders. The value of large-scale analysis & Benjamin, L. S. (in press). User's guide for the of symptom data. Journal of Personality Disorders, 6, Structured Clinical Interview for DSM-IV Axis II 360±376. Personality Disorders. Washington, DC: American Psy- Clark, L. A. (1993). Manual for the schedule for nonadaptive chiatric Press. and adaptive personality. Minneapolis, MN: University Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in of Minnesota Press. the development and refinement of clinical assessment Clark, L. A., Livesley, W. J., Schroeder, M. L., & Irish, S. instruments. Psychological Assessment, 7, 286±299. L. (1996). Convergence of two systems for assessing Ford, M. R., & Widiger, T. A. (1989). Sex bias in the specific traits of personality disorder. Psychological diagnosis of histrionic and antisocial personality dis- Assessment, 8, 294±303. orders. Journal of Consulting and Clinical Psychology, 57, Clark, L. A., & Watson, D. (1995). Constructing validity: 301±305. Basic issues in objective scale development. Psychological Garb, H. N. (1997). Race bias, social class bias, and gender Assessment, 7, 309±319. bias in clinical judgment. Clinical Psychology: Science Cloninger, C. R., & Svrakic, D. M. (1994). Differentiating and Practice. normal and deviant personality by the seven factor Goldberg, L. R. (1990). An alternative ªDescription of personality model. In S. Strack & M. Lorr (Eds.), personalityº: The Big Five factor structure. Journal of Differentiating normal and abnormal personality Personality and Social Psychology, 59, 1216±1229. (pp. 40±64). New York: Springer. Graham, J. R. (1993). MMPI-2. Assessing personality and Colligan, R. C., Morey, L. C., & Offord, K. P. (1994). The psychopathology (2nd ed.). New York: Oxford University MMPI/MMPI-2 personality disorder scales. Contem- Press. porary norms for adults and adolescents. Journal of Greene, R. L., Gwin, R., & Staal, M. (1997). Current status Clinical Psychology, 50, 168±200. of MMPI-2 research: A methodologic overview. Journal Cooke, D. J., & Michie, C. (1997). An item response theory of Personality Assessment, 68, 20±36. References 165

Gunderson, J. G. (1992). Diagnostic controversies. In A. comparison in the Comprehensive Rorschach System. Tasman & M. B. Riba (Eds.), Review of psychiatry (Vol. Psychological Assessment, 4, 288±296. 11, pp. 9±24). Washington, DC: American Psychiatric Klein, M. H., Benjamin, L. S., Rosenfeld, R., Treece, C., Press. Husted, J., & Greist, J. H. (1993). The Wisconsin Hambleton, R. K., Swaminathan, H., & Rogers, H. J. Personality Disorders Inventory: development, reliabil- (1991). Fundamentals of item response theory. Newbury ity, and validity. Journal of Personality Disorders, 7, Park, CA: Sage. 285±303. Hare, R. D. (1991). Manual for the Revised Psychopathy Lah, M. I. (1989). Sentence completion tests. In C. S. Checklist Toronto, Canada: Multi-Health Systems. Newmark (Ed.), Major psychological assessment instru- Hare, R. D., Hart, S. D., & Harpur, T. J. (1991). ments (Vol. II, pp. 133±163). Boston: Allyn & Bacon. Psychopathy and the DSM-IV criteria for antisocial Lamiell, J. T. (1981). Toward an idiothetic psychology of personality disorder. Journal of Abnormal Psychology, personality. American Psychologist, 36, 276±289. 100, 391±398. Lerner, P. M. (1995). Assessing adaptive capacities by Harkness, A. R., & McNulty, J. L. (1994). The Personality means of the Rorschach. In J. N. Butcher (Ed.), Clinical Psychopathology Five (PSY-5): Issue from the pages personality assessment. Practical approaches of a diagnostic manual instead of a dictionary. In (pp. 317±325). New York: Oxford University Press. S. Strack & M. Lorr (Eds.), Differentiating normal Lilienfeld, S. O. (1994). Conceptual problems in the and abnormal personality (pp. 291±315). New York. assessment of psychopathy. Clinical Psychology Review, Springer. 14, 17±38. Harkness, A. R., McNulty, J. L., & Ben-Porath, Y. S. Lilienfeld, S. O., & Andrews, B. P. (1996). Development (1995). The Personality Psychopathology Five (PSY-5): and preliminary validation of a self-report measure Constructs and MMPI-2 scales. Psychological Assess- of psychopathic personality traits in noncriminal ment, 7, 104±114. populations. Journal of Personality Assessment, 66, Hathaway, S. R., & McKinley, J. C. (1940). A multiphasic 488±524. personality schedule (Minnesota): I. Construction of the Lindsay, K. A., & Widiger, T. A. (1995). Sex and gender schedule. Journal of Psychology, 10, 249±254. bias in self-report personality disorder inventories: Item Hathaway, S. R., & McKinley, J. C. (1982). Minnesota analyses of the MCMI-II, MMPI, and PDQ-R. Journal Multiphasic Personality Inventory test booklet. Minnea- of Personality Assessment, 65, 1±20. polis, MN: University of Minnesota. Livesley, W. J., & Jackson, D. (in press). Manual for the Hathaway,S.R.,McKinley,J.C.,Butcher,J.N., Dimensional Assessment of Personality Pathology-Basic Dahlstrom, W. G., Graham, J. R., & Tellegen, A. Questionnaire. Port Huron, MI: Sigma. (1989). Minnesota Multiphasic Personality Inventory test Loranger, A. W. (in press). Personality disorder examina- booklet. Minneapolis, MN: Regents of the University of tion. Washington, DC: American Psychiatric Press. Minnesota. Loranger, A. W., Lenzenweger, M. F., Gartner, A. F., Haynes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). Susman, V. L., Herzig, J., Zammit, G. K., Gartner, Content validity in psychological assessment: A func- J. D., Abrams, R. C., & Young, R. C. (1991). Trait-state tional approach to concepts and methods. Psychological artifacts and the diagnosis of personality disorders. Assessment, 7, 238±247. Archives of General Psychiatry, 48, 720±728. Helmes, E., & Reddon, J. R. (1993). A perspective on McCrae, R. R., & Costa, P. T. (1983). Social desirability developments in assessing psychopathology: A critical scales: More substance than style. Journal of Consulting review of the MMPI and MMPI-2. Psychological and Clinical Psychology, 51, 882±888. Bulletin, 113, 453±471. McCrae, R. R., & Costa, P. T. (1989). The structure of Hendryx, M. S., Haviland, M. G., Gibbons, R. D., & interpersonal traits: Wiggins' circumplex and the Five- Clark, D. C. (1992). An application of item response Factor Model. Journal of Personality and Social Psy- theory to alexithymia assessment among abstinent chology, 56, 586±595. alcoholics. Journal of Personality Assessment, 58, Meehl, P. E. (1945). The dynamics of ªstructuredº 506±515. personality tests. Journal of Clinical Psychology, 1, Herkov, M. J., & Blashfield, R. K. (1995). Clinician 296±303. diagnoses of personality disorders: Evidence of a Millon, T. (1983). Millon Clinical Multiaxial Inventory hierarchical structure. Journal of Personality Assessment, manual (3rd ed.). Minneapolis, MN: National Computer 65, 313±321. Systems. Hirschfeld, R. M., Klerman, G. L., Lavori, P., Keller, M., Millon, T. (1987). Manual for the MCMI-II (2nd ed.). Griffith, P., & Coryell, W. (1989). Premorbid personality Minneapolis, MN: National Computer Systems. assessments of first onset of major depression. Archives Millon, T., Davis, R. D., Millon, C. M., Wenger, A. W., of General Psychiatry, 46, 345±350. Van Zuilen, M. H., Fuchs, M., & Millon, R. B. (1996). Hogan, R., Curphy, G. J., & Hogan, J. (1994). What we Disorders of personality. DSM-IV and beyond. New know about leadership. Effectiveness and personality. York: Wiley. American Psychologist, 49, 493±504. Millon, T., Millon, C., & Davis, R. (1994). MCMI-III Hyler, S. E. (1994). Personality Diagnostic Questionnaire-4 manual. Minneapolis, MN: National Computer Systems. (PDQ-4). Unpublished test. New York: New York State Morey, L. C. (1996). An interpretive guide to the Personality Psychiatric Institute. Assessment Inventory (PAI). Odessa, FL: Psychological Kehoe, J. F., & Tenopyr, M. L. (1994). Adjustment in Assessment Resources. assessment scores and their usage: A taxonomy and Morey, L. C., & Ochoa, E. S. (1989). An investigation of evaluation of methods. Psychological Assessment, 6, adherence to diagnostic criteria: Clinical diagnosis of the 291±303. DSM-III personality disorders. Journal of Personality Keller, L. S., Butcher, J. N., & Slutske, W. S. (1990). Disorders, 3, 180±192. Objective personality assessment. In G. Goldstein & M. Morey, L. C., Waugh, M. H., & Blashfield, R. K. (1985). Hersen (Eds.), Handbook of psychological assessment MMPI scales for DSM-III personality disorders: Their (2nd ed., pp. 345±386). New York: Pergamon. derivation and correlates. Journal of Personality Assess- Kiesler, D. J. (1996). Contemporary interpersonal theory & ment, 49, 245±251. research, personality, psychopathology, and psychother- O'Boyle, M., & Self, D. (1990). A comparison of two apy. New York: Wiley. interviews for DSM-III-R personality disorders. Psy- Kleiger, J. H. (1992). A conceptual critique of the EA;es chiatry Research, 32, 85±92. 166 Personality Assessment

Okazaki, S., & Sue, S. (1995). Methodological issues in interviews. International Journal of Methods in Psychia- assessment research with ethnic minorities. Psychological tric Research, 1, 13±26. Assessment, 7, 367±375. Smith, G. T., & McCarthy, D. M. (1995). Methodological Ozer, D. J., & Reise, S. P. (1994). Personality assessment. considerations in the refinement of clinical assessment Annual Review of Psychology, 45, 357±388. instruments. Psychological Assessment, 7, 300±308. Perry, J. C. (1992). Problems and considerations in the Somwaru, D. P., & Ben-Porath, Y. S. (1995). Development valid assessment of personality disorders. American and reliability of MMPI-2 based personality disorder Journal of Psychiatry, 149, 1645±1653. scales. Paper presented at the 30th Annual Workshop Pfohl B., Blum, N., & Zimmerman, M. (in press). and Symposium on Recent Developments in Use of the Structured Interview for DSM-IV Personality. Washing- MMPI-2 & MMPI-A. St. Petersburg Beach, FL. ton, DC: American Psychiatric Press. Tellegen, A., & Waller, N. G. (in press). Exploring Piersma, H. L. (1989). The MCMI-II as a treatment personality through test construction: Development of outcome measure for psychiatric inpatients. Journal of the Multidimensional Personality Questionnaire. In S. R. Clinical Psychology, 45, 87±93 Briggs & J. M. Cheek (Eds.), Personality measures: Pilkonis, P. A. (1997). Measurement issues relevant to Development and evaluation (Vol. 1). Greenwich, CT: JAI personality disorders. In H. H. Strupp, M. J. Lambert, & Press. L. M. Horowitz (Eds.), Measuring patient change in Timbrook, R. E., & Graham, J. R. (1994). Ethnic mood, anxiety, and personality disorders: Toward a core differences on the MMPI-2? Psychological Assessment, battery (pp. 371±388). Washington, DC: American 6, 212±217. Psychological Association. Trull, T. J., & Goodwin, A. H. (1993). Relationship Pilkonis, P. A., Heape, C. L., Proietti, J. M., Clark, S. W., between mood changes and the report of personality McDavid, J. D., & Pitts, T. E. (1995). The reliability and disorder symptoms. Journal of Personality Assessment, validity of two structured diagnostic interviews for 61, 99±111. personality disorders. Archives of General Psychiatry, Trull, T. J., Useda, J. D., Costa, P. T., & McCrae, R. R. 52, 1025±1033. (1995). Comparison of the MMPI-2 Personality Psycho- Reise, S. P., & Waller, N. G. (1993). Traitedness and the pathology Five (PSY-5), the NEO-PI, and the NEO-PI- assessment of response pattern scalability. Journal of R. Psychological Assessment, 7, 508±516. Personality and Social Psychology, 65, 143±151. Trull, T. J., & Widiger, T. A. (1997). Structured Interview Retzlaff, P. (1996). MCMI-III diagnostic validity: Bad test for the Five-Factor Model of Personality professional or bad validity study. Journal of Personality Assessment, manual. Odessa, FL: Psychological Assessment Re- 66, 431±437. sources. Riso, L. P., Klein, D. N., Anderson, R. L., Ouimette, P. C., Watkins, C. E., Campbell, V. L., Nieberding, R., & & Lizardi, H. (1994). Concordance between patients Hallmark, R. (1995). Contemporary practice of psycho- and informants on the Personality Disorder Exam- logical assessment by clinical psychologists. Professional ination. American Journal of Psychiatry, 151, 568±573. Psychology: Research and Practice, 26, 54±60. Robins, R. W., & John, O. P. (1997). Effects of visual Watson, D., Clark, L. A., & Harkness, A. R. (1994). perspective and narcissism on self-perception. Is seeing Structure of personality and their relevance to believing? Psychological Science, 8, 37±42. psychopathology. Journal of Abnormal Psychology, 103, Rogers, R. (1995). Diagnostic and structured interviewing. A 18±31. handbook for psychologists. Odessa, FL: Psychological Weiner, I. B. (1996). Some observations on the validity of Assessment Resources. the Rorschach inkblot method. Psychological Assess- Rogers, R., Bagby, R. M., & Dickens, S. E. (1992). ment, 8, 206±213. Structured Interview of Reported Symptoms (SIRS) Westen, D. (1991). Social cognition and object relations. professional manual. Odessa, FL: Psychological Assess- Psychological Bulletin, 109, 429±455. ment Resources. Widiger, T. A. (1993). Personality and depression: Assess- Sackett, P. R., & Wilk, S. L. (1994). Within-group norming ment issues. In M. H. Klein, D. J. Kupfer, & M. T. Shea and other forms of score adjustment in preemployment (Eds.), Personality and depression. A current view testing. American Psychologist, 49, 929±954. (pp. 77±118). New York: Guilford. Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Widiger, T. A., & Costa, P. T. (1994). Personality and Nonparametric item analyses of the Beck Depression personality disorders. Journal of Abnormal Psychology, Inventory: Evaluating gender item bias and response 103, 78±91. option weights. Psychological Assessment, 6, 255±270. Widiger, T. A., Frances, A. J., & Trull, T. J. (1989). Salekin, R. T., Rogers, R., & Sewell, K. W. (1996). A Personality disorders. In R. Craig (Ed.), Clinical and review and meta-analysis of the Psychopathy Checklist diagnostic interviewing (pp. 221±236). Northvale, NJ: and Psychopathy Checklist-Revised: Predictive validity Aronson. of dangerousness. Clinical Psychology: Science and Widiger, T. A., Mangine, S., Corbitt, E. M., Ellis, C. G., & Practice, 3, 203±215. Thomas, G. V. (1995). Personality Disorder Interview-IV. Saucier, G., & Goldberg, L. R. (1996). The language of A semistructured interview for the assessment of person- personality: Lexical perspectives on the five-factor ality disorders. Odessa, FL: Psychological Assessment model. In J. S. Wiggins (Ed.), The five-factor model of Resources. personality. Theoretical perspectives (pp. 21±50). New Widiger, T. A., & Sanderson, C. J. (1995). Assessing York: Guilford. personality disorders. In J. N. Butcher (Ed.), Clinical Schinka, J. A., Kinder, B. N., & Kremer, T. (1997). personality assessment. Practical approaches Research validity scales for the NEO-PI-R: Development (pp. 380±394). New York: Oxford University Press. and initial validation. Journal of Personality Assessment, Widiger, T. A., & Trull, T. J. (1997). Assessment of the five 68, 127±138. factor model of personality. Journal of Personality Shea M. T. (1995). Interrelationships among categories of Assessment, 68, 228±250. personality disorders. In W. J. Livesley (Ed.), The DSM- Widiger, T. A., Williams, J. B. W., Spitzer, R. L., & IV personality disorders (pp. 397±406). New York: Frances, A. J. (1985). The MCMI as a measure of DSM- Guilford. III. Journal of Personality Assessment, 49, 366±378. Skodol, A. E., Oldham, J. M., Rosnick, L., Kellman, Wiggins, J. S. (1966). Substantive dimensions of self-report H. D., & Hyler, S. E. (1991). Diagnosis of DSM-III-R in the MMPI item pool. Psychological Monographs, 80, personality disorders: A comparison of two structured (22, Whole No. 630). References 167

Wiggins, J. S., & Pincus, A. L. (1992). Personality: Yong, L. (1996). Diagnostic Interview for DSM-IV Structure and assessment. Annual Review of Psychology, Personality Disorders (DIPD-IV). Boston: McLean 43, 473±504. Hospital. Wiggins, J. S., & Trobst, K. K. (1997). Prospects for the Zimmerman, M. (1994). Diagnosing personality disorders. assessment of normal and abnormal interpersonal A review of issues and research methods. Archives of behavior. Journal of Personality Assessment, 68, General Psychiatry, 51, 225±245. 110±126. Zimmerman, M., & Coryell, W. H. (1990). Diagnosing Wood, J. M., Nezworski, M. T., & Stejskal, W. J. (1996). personality disorders in the community. A comparison The Comprehensive System for the Rorschach: A critical of self-report and interview measures. Archives of examination. Psychological Science, 7, 3±10. General Psychiatry, 47, 527±531. World Health Organization. (1992). The ICD-10 classifica- Zimmerman, M., Pfohl, B., Coryell, W., Stangl, D., & tion of mental and behavioural disorders. Clinical descrip- Corenthal, C. (1988). Diagnosing personality disorders tions and diagnostic guidelines. Geneva, Switzerland: in depressed patients. A comparison of patient and Author. informant interviews. Archives of General Psychiatry, 45, Zanarini, M. C., Frankenburg, F. R., Sickel, A. E., & 733±737. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.08 Assessment of Psychopathology: Nosology and Etiology

NADER AMIR University of Georgia, Athens, GA, USA and CATHERINE A. FEUER University of Missouri at St. Louis, MO, USA

3.08.1 INTRODUCTION 170 3.08.2 HISTORY OF CLASSIFICATION 170 3.08.2.1 Medical Classification 170 3.08.2.2 Early Nosology of Psychopathology 171 3.08.3 CLASSIFICATION OF MENTAL ILLNESS 171 3.08.3.1 History of Classification of Mental Illness 171 3.08.3.2 Current Classification Systems 172 3.08.3.3 Tools for Clinical Assessment 173 3.08.4 CURRENT ISSUES IN THE CLASSIFICATION OF PSYCHOPATHOLOGY 174 3.08.4.1 Definition of Psychopathology 174 3.08.4.2 Comorbidity 175 3.08.4.2.1 Actual co-occurrence of disorders 175 3.08.4.2.2 Populations considered 176 3.08.4.2.3 Range, severity, and base rates of disorders considered 176 3.08.4.2.4 Assessment methods 176 3.08.4.2.5 Structure of the classification system 177 3.08.4.3 Clinical and Research Applications of Classification Systems 178 3.08.4.4 Organizational Problems in the DSM 179 3.08.5 ALTERNATIVE APPROACHES TO CLASSIFICATION 180 3.08.5.1 Types of Taxonomies 180 3.08.5.2 Taxonomic Models 180 3.08.5.2.1 Neo-Kraepelinian (DSM) model 180 3.08.5.2.2 Prototype model 181 3.08.5.2.3 Dimensional and categorical models 181 3.08.5.3 The Use of Etiological Factors for Diagnosis 182 3.08.6 CONCLUSIONS 184 3.08.7 REFERENCES 184

169 170 Assessment of Psychopathology: Nosology and Etiology

3.08.1 INTRODUCTION the causes of disease. The practice of medical observation and classification gained momen- The history of clinical psychology, similar to tum with the advent of medical instruments. The the history of medicine, has been characterized most notable of these was the stethoscope, by the quest for knowledge of pathological developed in 1819 by Rene Theophile processes. This knowledge enables practitioners Hyacinthe-Laennec (1787±1826). This advance in both fields to treat and prevent disorders that in instrumentation led to the use of objective threaten the quality of life. Health practitioners signifiers of pathology in diagnosing disease and must be able to identify and conceptualize the a diminished interest in less technologically problem, communicate research and clinical sophisticated observations or subjective reports findings to other members of their field, and of symptoms. The continuous development of ideally, to reach a scientific understanding of the new technologies increased the ability of disorder. Diagnostic taxonomies have evolved physicians to determine the function of organs in part to achieve these goals (Millon, 1991). and their modes of operation when healthy. This Classification systems have a long history, first novel conceptualization of dysfunction as in the basic sciences and medicine, and later in related to the functioning of organs led the psychopathology. Modern psychopathology French physician Broussais (1772±1838) to taxonomies owe much to the work of earlier propose the radical idea that specific symptom medical diagnosticians, and their development clusters could not define a disease adequately has often paralleled that of medical systems because of the high overlap between different (Clementz & Iacono, 1990). Both medical and disorders. He suggested that in order to identify psychopathological classification systems have a disease one needs to study the patient's served the purposes of improving diagnostic physiological processes and constitution. Mid- reliability, guiding clinical conceptualization nineteenth century advances in the various basic and treatment, and facilitating research and sciences (such as anatomy, histology, and scientific advancement (Blashfield, 1991; Clark, biology) led to the further decline of the practice Watson, & Reynolds, 1995). Despite the of observation in favor of the experimental usefulness of classification systems in psycho- study of physiological disease processes. pathology, no system is accepted universally. Although this emphasis on understanding Furthermore, some have questioned the utility function was a great benefit to the various basic of classification systems as a whole, favoring sciences, it culminated in the near abandonment behavioral descriptions of individual presenta- of the practice of clinical observation (Clementz tions instead (Kanfer & Saslow, 1965; Ullman & & Iacono, 1990). The increasing popularity of Krasner, 1975). The purpose of this chapter is to laboratory research also resulted in an emphasis review the issues related to the assessment, on specific symptoms, rather than on the diagnosis, and classification of psychopathol- phenomenology of disease. This topographic ogy. The chapter will begin with a brief outline or symptom-based approach to the study of of the history of medical and psychopatholo- pathology lent itself to a descriptive system of gical classification. Next, it will review the classification. However, this approach had its systems in use in the late 1990s, and evaluate critics, as far back as the 1800s. Some these systems from the standpoint of clinical investigators (e.g., Trousseau, 1801±1867) be- utility, research facilitation, and scientific lieved that a more comprehensive approach to understanding. It will then discuss the alter- diagnosis, incorporating both clinical observa- native approaches and future directions in the tion and laboratory findings, would better allow classification of psychopathology. Finally, the the recognition of disorders and their treatment. role of assessment strategies in informing this The field of genetics provided another major debate will be examined. development in the science of classification. Mendel, working in an isolated monastery, pioneered the science of genetics. His efforts 3.08.2 HISTORY OF CLASSIFICATION were continued by others (e.g., Garrod, 1902) 3.08.2.1 Medical Classification who applied his work to humans. Watson and Crick's (1953a, 1953b) detailed description of As Clementz and Iacono (1990) point out, the structure of human genetic material supplied medical classification systems have relied his- yet another powerful tool for identifying the torically on careful observation of patients and presence and etiology of certain disorders. A their symptoms. The importance of observation classification system based on genetic and as an adjunct to theorizing was recognized as biological etiology may seem promising. How- long ago as the seventeenth century by thinkers ever, researchers soon realized that the specifi- such as Thomas Sydenham (1624±1689). Ob- city of genetics is less than perfect. This lack of servations were used to make inferences about specificity seemed particularly evident in the Classification of Mental Illness 171 psychiatric disorders. We now know several come to dominate American psychiatry (Blash- psychiatric disorders are at least partly genetic. field, 1991). Psychoanalytic thought did not lose This is revealed by the finding that individuals its foothold until the community-based mental with identical genetic make-up have higher health movement of the 1960s, which pointed concordance rates for developing the same out that psychoanalysis was only accessible to disease than those who do not. However, the wealthy. Other factors that contributed to because the concordance rate between identical the decline of psychoanalytic thought include twins is less than 100% (e.g., schizophrenia; the development of alternative interventions Dworkin; Lenzenwenger, Moldin & Cornblatt, such as psychotropic medications and beha- 1987), not all the variance in psychiatric vioral intervention, as well as a general shift disorders can be explained by genetic inheri- toward empiricism. Around this time, there was tance. These findings have led to the view that a resurgence of Kraepelinian thought, mainly genetic make-up provides a diathesis for a by a group of psychiatrists at Washington particular disorder, which, when interacting University in St. Louis (Robins & Guze, 1970). with the environmental factors, may produce a They believed that psychiatry had drifted too disease. far from its roots in medicine, and that it was necessary to identify the biological bases of psychopathology. They advocated an emphasis 3.08.2.2 Early Nosology of Psychopathology on classification as a tool for helping psychopathology to evolve as a field (Blashfield, 1991; Emil Kraepelin (1856±1926) was the first to Klerman, 1978). adapt a medical classification framework for psychiatric illnesses (Blashfield, 1991; Kraepe- lin, 1971). He considered psychopathology to be 3.08.3 CLASSIFICATION OF MENTAL the result of underlying disease states and ILLNESS advocated the scientific investigation of ill- 3.08.3.1 History of Classification of Mental nesses. He coined the term ªdementia praecoxº Illness (later known as schizophrenia) to describe what had previously been considered distinct pathol- After World War II, the armed forces ogies. Kraepelin's thinking regarding psycho- developed a more comprehensive nomenclature pathology and classification was shaped by the to facilitate the treatment of World War II rapid advances in German medical science in the servicemen. During this same period, the World nineteenth century and by his training in Health Organization (WHO) published the first behaviorism from Wilhelm Wundt (Berrios & version of the International classification of Hauser, 1988). German medicine during Krae- diseases (ICD-6; WHO, 1992) that included pelin's lifetime emphasized the interpretation of psychiatric disorders. The authors of the ICD-6 mental disorders as diseases of the brain relied heavily on the work of various branches (Menninger, 1963). of the armed forces in the development of their Kraepelin's training with Wundt contributed taxonomy of psychopathology. The first version to his use of behavioral descriptions for clusters of the Diagnostic and statistical manual of of symptoms he believed were linked by a mental disorders (DSM), a variant on the common etiology. Kraepelin's psychopathol- ICD-6, was published in 1952 by the APA's ogy classification became influential through Committee on Nomenclature and Statistics. the publication of his textbooks on psychiatry. This nomenclature was designed for use in the His categories later became the basis for the first civilian population, and focused on clinical official classification adopted by the American utility (APA, 1994). The DSM (APA, 1994) was Psychiatric Association (APA) in 1917 (Men- originally created as a means of standardizing ninger, 1963), and revised in 1932 (APA, 1933). the collection of statistics in psychiatric hospi- A contemporary of Kraepelin's, Sigmund tals in the early 1900s. By the release of the Freud (1856±1939), was also influential in second edition of the DSM (DSM-II) (1975), the forming our understanding of psychopathol- authors had moved toward the elimination of ogy. In contrast to Kraepelin, Freud placed exclusion rules and provision of more explicit little emphasis on the phenomenology of and descriptive diagnostic criteria. The inclu- disorders, stressing the importance of diagnoses sion of explicit diagnostic criteria in the DSM-II based on the underlying cause of patients' was inspired by the 1972 work of a group of manifest symptoms. Kraepelin, however, op- Neo-Kraeplinian theorists, Feighner, Baran, posed Freudian theory and psychoanalytic Furman, and Shipman at the Washington practice because of its nonempirical orientation University School of Medicine in St. Louis (Kahn, 1959). Freud's theories were well (Feighner et al., 1972). This group had articulated, however, and by the 1950s had originally created a six-category system as a 172 Assessment of Psychopathology: Nosology and Etiology guideline for researchers in need of homogenous the presence of long-term disturbances is not subject groups. overlooked in favor of current pathology. In 1978 and 1985, Spitzer and his colleagues Together, these axes constitute the classification modified and expanded the Washington Uni- of abnormal behavior. The remaining three axes versity system and labeled it the Research are not needed to make a diagnosis but Diagnostic Criteria (RDC). The RDC estab- contribute to the recognition of factors, other lished criteria for 25 major categories, with an than the individuals' symptoms, that should be emphasis on the differential diagnosis between considered in determining the person's diag- schizophrenia and the affective disorders. The nosis. General medical conditions are rated on authors of the DSM-III (1980) followed the Axis III, psychosocial and environmental example of the Washington University and problems are coded on axis IV, and the RDC groups, including diagnostic criteria for individual's current level of functioning is rated each disorder and exclusion criteria for 60% of on axis V. all DSM-III disorders. Most of these exclusion The 10th Revision of the ICD (ICD-10)isa rules described hierarchical relationships be- multiaxial system. Axis I includes clinical tween disorders, in which a diagnosis is not diagnoses for both mental and physical dis- given if its symptoms are part of a more orders, Axis II outlines disabilities due to pervasive disorder. impairments produced by the disorder, and As a result of these changes, the DSM-III and Axis III lists the environmental, circumstantial, DSM-III-R are based almost exclusively on and personal lifestyle factors that influence the descriptive criteria. These criteria are grouped presentation, course, or outcome disorders. into distinct categories representing different ICD-10 differs from other multiaxial classifica- disorders. The DSM-IV continues this traditions in that it: (i) records all medical conditions tion, using a categorical model in which patients on the same axis (Axis I), (ii) assesses comorbid are assigned to classes that denote the existence disorders (Axis II) in specific areas of function- of a set of related signs and symptoms. The ing without ascribing a specific portion to each constituent signs and symptoms are thought to disorder, and (iii) allows expression of environ- reflect an underlying disease construct. Re- mental (Axis III) factors determined by clinical searchers who advocate a categorical model of practice and epidemiological evidence (Janca, disease (e.g., Kraepelin) propose that disorders Kastrup, Katschnig, Lopez-Ibor, 1996). differ qualitatively from each other and from the One purpose for the collaboration between nondisordered state. Although later editions of DSM-IV and ICD-10 authors was to foster the the DSM have relied progressively more on goal of the transcultural applicability of ICD-10 research findings in their formulation, the stated (Uestuen, Bertelsen, Dilling, & van Drimmelen, goal of DSM-IV remains the facilitation of 1996). A multiaxial coding format in both clinical practice and communication (Frances DSM-IV and ICD-10 was adopted in order to et al., 1991). The DSM system has evolved as a provide unambiguous information with max- widely used manual in both clinical and research imum clinical usefulness in the greatest number settings. However, the structure and the stated of cases (WHO, 1996). Although not explicitly goals of the DSM have been at the core of many stated by the authors of either manual, both the controversies regarding the assessment, diag- DSM-IV and ICD-10 are examples of the ªneo- nosis, and classification of psychopathology. Kraepelinian revolutionº in psychiatric diagnostic classification (Compton & Guze, 1995). The progress toward the shared goals of the 3.08.3.2 Current Classification Systems two systems include two areas: general clinical use of the systems; and fostering international Two systems of classification are currently in communication and cultural sensitivity. Re- common use: the DSM-IV (APA, 1994) and the garding general clinical use, a multicenter field ICD-10 (WHO, 1992). A detailed description of trial evaluating aspects of the ICD-10 and these systems is beyond the scope of this chapter DSM-IV involving 45 psychiatrists and psy- and the interested reader should consult recent chologists in seven centers was conducted in reviews of these systems (Frances, 1998; Regier, Germany (Michels et al., 1996). Results revealed et al., 1998; Spitzer, 1998). A brief overview of moderate inter-rater reliability for the ICD-10 each system is provided below. Axis I and Axis II. However, the number of The DSM-IV is a multiaxial classification relevant psychosocial circumstances coded on system. The individual is rated on five separate Axis III by the different raters varied greatly. axes. Axis I includes all diagnoses except The authors concluded that the multiaxial personality disorders and mental retardation, system was generally well accepted by partici- the later being rated on Axis II. The rationale pating clinicians, and that it is worth studying for the separation of these axes is to ensure that empirically and revising accordingly. Classification of Mental Illness 173

Clinicians specializing in several specific areas understanding of mental illness may be mis- have not been as positive in their comments guided (Patel & Winston, 1994). Specifically, about the ICD-10. For instance, Jacoby (1996) while mental illness as a phenomenon may be has argued that neither the DSM-IV nor the universal, specific categories of mental illness as ICD-10 is adequate in its categorization of outlined by the DSM and ICD may require certain disorders associated with the elderly, identification and validation of particular such as certain dementias and psychotic dis- diagnoses within specific cultures (Patel & orders. Others have argued that specific ICD-10 Winston, 1994). diagnoses show poor levels of stability of diagnosis across time. Specifically, a 1997 study 3.08.3.3 Tools for Clinical Assessment found that neurotic, stress-related, adjustment, generalized anxiety, and panic disorders as well A comprehensive review of assessment mea- as some psychoses and personality disorders sures for psychopathology is beyond the scope showed low rates of reliability across interviews of this chapter. The interested reader should at different time points (Daradkeh, El-Rufaie, consult comprehensive reviews of this topic (e.g., Younis, & Ghubash, 1997). Baumann, 1995; Bellack & Hersen, 1998; Issues regarding the fostering of international Sartorius & Janca, 1996). One useful method communication and cultural sensitivity were of classifying assessment techniques is to addressed by a Swiss study published in 1995 consider two classes of measures: those that that assessed the inter-rater reliability and aim to aid in diagnosis and those that aim to confidence with which diagnoses could be made assess the severity of symptoms relatively using the ICD Diagnostic Criteria for Research independent of diagnosis. Examples of the first (ICD-10 DCR), as well as examining the category include The Structured Clinical Inter- concordance between ICD-10 DCR, ICD-10 view for DSM-IV Diagnoses (SCID; First, Clinical Descriptions and Diagnostic Guidelines, Spitzer, Gibbon, & Williams, 1995) and the and other classification systems including the Composite International Diagnostic Interview DSM-IV. Field trials were carried out at 151 (CIDI; Robins et al., 1988). These measures are clinical centers in 32 countries by 942 clinician/ mapped closely on the diagnostic systems on researchers who conducted 11 491 individual which they are based (DSM-IV and ICD-10, patient assessments. The authors report that respectively). The second class of assessment most of the clinician/researchers found the instruments use a dimensional approach and criteria to be explicit and easy to apply and the aim to assess severity of symptoms relatively inter-rater agreement was high for the majority independently of diagnosis. Examples of such of diagnostic categories. In addition, their instruments include the Hamilton Depression results suggested that the use of operational Inventory (HAM-D; Riskind, Beck, Brown, & criteria across different systems increases levels Steer, 1987) and the Yale-Brown Obsessive- of inter-rater agreement. Compulsive Scale (Y-BOCS; Goodman, Price, Regier, Kaelber, Roper, Rae, and Sartorius Rasmussen, & Mazure, 1989a). The advantages (1994) cited findings suggesting that while and disadvantages of these assessment techni- overall inter-rater reliability across 472 clin- ques are tied closely to advantages and dis- icians in the US and Canada was good, the advantages of the categorical and dimensional clinician tended to make more use of multiple approaches to classification and will be dis- coding for certain disorders than clinicians from cussed in Section 3.08.5.2.3. other countries. This suggests that certain An important aspect of developing new aspects of the DSM system (e.g., its encourage- knowledge in psychopathology is the improvement of multiple diagnoses) may make the ment and standardized use of assessment tools transition and agreement between the two across studies of psychopathology. The vast systems somewhat more difficult. majority of studies of DSM disorders have used The ICD's efforts to place psychiatric self-report or interview-based measures of disorders in the context of the world commu- symptoms. In some cases, behavioral checklists nity's different religions, nationalities, and (either self- or other-reports) or psychological cultures has received praise in the literature tests have been employed. In this section, issues (Haghighat, 1994), while the DSM-IV may be such as the accuracy, reliability, and discrimi- seen as somewhat less successful in achieving nate validity of such assessment tools and how this goal. The ICD's increased cultural sensi- they may influence findings in psychopathology tivity was at the expense of increased length, but studies will be examined. efforts to omit cultural-biased criteria were Self-report measures may be the most time- largely successful. However, other authors have and labor-efficient means of gathering data argued that the attempts to improve cultural about psychological symptoms and behaviors. sensitivity to the degree of a ªuniversalº However, individuals often are inconsistent in 174 Assessment of Psychopathology: Nosology and Etiology their observations of their own behaviors, and fore, in order to efficiently discriminate between the usefulness of these reports depends heavily disorders, future research should emphasize on often limited powers of self-observation. identification of symptoms which optimally Psychometric methods for enhancing the useful- discriminate between diagnoses. Watson and ness of self-report data generally focus on Clark (1992) found that even when factor increasing the consistency (i.e., reliability and analytically derived mood measures such as accuracy or validity) of the measures. A number the Profile of Moods (POMS; McNair, Lorr, & of psychometrically sophisticated scales are Droppleman, 1981) and the Positive and available for several different types of disorders Negative Affectivity Scale (PANAS; Watson, (e.g., anxiety; BAI; Beck, Epstein, Brown, & Clark, & Tellegen, 1988) are used, certain basic Steer, 1988; PSS; Foa, Riggs, Dancu, & affects (i.e., anxiety and depression) are only Rothbaum, 1993; Y-BOCS; Goodman et al., partially differentiable. Their data suggest that 1989a, 1989b). Many of the traditional psycho- the overlap between basic affects represents a metric approaches classify disorders into state shared component inherent in each mood state, (transitory feelings) and trait (stable personality) which must be measured in order to understand attributes (e.g., state anxiety vs. trait anxiety, the overlap between different mood states and STAI; Spielberger, Gorsuch, & Lushene, 1970). disorders. The accuracy of these scales is most often evaluated by comparing results on the measure 3.08.4 CURRENT ISSUES IN THE to results on other measures of the same emotion CLASSIFICATION OF or disorder (e.g., interviews, physiological PSYCHOPATHOLOGY assessments, observations of behavior). By assessing various aspects of an emotion or As noted earlier, the classification of psycho- disorder, the investigator tries to create a pathology has been a controversial topic since composite gauge of an emotion for which no inception. Currently, the discussions revolve single indicator offers a perfect yardstick. primarily around the DSM system, although One impediment to discovering which symp- many of the debates predate the DSM. The most toms may be shared across disorders has been common topics of controversy in the classifica- the structure of many clinical measures. Ex- tion of psychopathology are: (i) the definition of amples of widely-used measures of psychologi- psychopathology; (ii) the artificially high rates cal states are the Beck Depression Inventory of comorbidity found between disorders when (BDI; Beck & Steer, 1987), the Beck Anxiety using the DSM system (Frances, Widiger, & Inventory (BAI; Beck et al., 1988), the Minne- Fyer, 1990; Robins, 1994; (iii) the balance sota Multiphasic Personality Inventory between a focus on clinical utility and the (MMPI-2; Butcher, Dahlstrom, Graham, Telle- facilitation of research and scientific progress gen, & Kaemmer, 1989), and structured inter- (Follette, 1996); and (iv) organizational pro- view scales such as the Diagnostic Inventory blems, both within and across disorders in the Scale (DIS; Helzer & Robins, 1988) and the DSM (Brown & Barlow, 1992; Quay, Routh, & Structured Clinical Interview for DSM-IV Shapiro, 1987). Diagnoses (SCID; First et al., 1995). These self-report and clinician-rated scales 3.08.4.1 Definition of Psychopathology usually assess ªmodalº symptoms, as they focus on core aspects of each syndrome rather than on There is a lack of agreement about the all possible variants. Structured interviews, such definition of psychopathology. Early versions as the SCID-IV or the DIS, often allow ªskip of both the ICD and the DSM attempted to guide outsº in which the interviewer need not the categorization of mental disorders without necessarily assess all of the symptoms of all addressing the definition of psychopathology disorders. Many studies have examined the (Spitzer, Endicott, & Robins, 1978). Some convergent and divergent validity patterns of contend that this lack of agreement about the self-report (e.g., modal) measures of various definition of psychopathology remains the disorders (e.g., Foa et al., 1993). These measures current state of affairs (e.g., Bergner, 1997). tend to yield strongly convergent assessments of Others (e.g., Kendell, 1975, 1982) have noted their respective syndromes, but there is little that physicians frequently diagnose and treat specificity in their measurement, especially in disorders for which there is no specific definition, nonpatient samples. The data often suggest the and which, in many cases, are not technically presence of a large nonspecific component considered disorders (e.g., pregnancy). The shared between syndromes such as anxiety controversy over the definition of mental and depression (Clark & Watson, 1991). Some disorder may be fueled by the fact that such scales appear to be more highly loaded with the illnesses are signified commonly by behaviors, nonspecific distress factor than others. There- and distinguishing between ªnormalº and Current Issues in the Classification of Psychopathology 175

ªdeviantº or ªdisorderedº behaviors is viewed the ªdesignº of various parts of organisms by some as having serious cultural and socio- (Mayr, 1981; Tattersall, 1995). Proponents of logical implications (Mowrer, 1960; Szasz, the ªharmful dysfunctionº definition suggest 1960). According to Gorenstein (1992), attempts that it may provide a useful starting point for at defining mental illness have fallen historically research in psychopathology, although further into several categories. Some have relied on research needs to identify the specific mechan- statistical definitions based on the relative isms that are not functioning properly (Bergner, frequency of certain characteristics in the general 1997). population. Others have employed a social definition in which behaviors which conflict with the current values of society are considered 3.08.4.2 Comorbidity deviant. Still other approaches have focused on the subjective discomfort caused by the problem, The term comorbidity was first used in the as reported by the individual. Finally, some medical epidemiology literature and has been definitions of psychopathology have been based defined as ªthe co-occurrence of different on psychological theories regarding states or diseases in the same individualº (Blashfield, behaviors thought to signify problems within 1990; Lilienfeld, Waldman, & Israel, 1994). the individual. The DSM-IV (APA, 1994) Many factors potentially may contribute to the defines a mental disorder as a ªclinically comorbidity rates of psychiatric disorders. significant behavioral or psychological syn- Reported comorbidity rates are influenced by drome or pattern that occurs in an individual the actual co-occurrence of disorders, the and . . . is associated with present distress . . . or populations considered, the range, severity, disability . . . or with a significant increased risk and base rates of the disorders considered, the of suffering death, pain, disability or an method of assessment, and the structure of the important loss of freedom.º The disorder should classification system used. not be an ªexpectable and culturally sanctioned response to an event,º and must be considered a 3.08.4.2.1 Actual co-occurrence of disorders manifestation of dysfunction in the individual. Wakefield has expanded and refined this idea in In the medical literature, comorbidity often his concept of ªharmful dysfunctionº (Wake- refers to diagnoses which occur together in an field, 1992, 1997c). This concept is a carefully individual, either over one's lifetime or simul- elaborated idea considered by some to be a taneously. This concept emphasizes the recog- workable solution to the problem of defining nition that different diagnoses potentially are abnormality (e.g., Spitzer, 1997). In essence, the related in several ways. One disease can signal harmful dysfunction concept means that beha- the presence of another, predispose the patient viors are abnormal to the extent that they imply to the development of another, or be etiologi- something is not working properly as would be cally related to another disorder (Lilienfeld expected based on its evolutionary function. et al., 1994). Lilienfeld et al. suggest that the Specifically, Wakefield (1992) states that a increased attention to comorbidity in psycho- mental disorder is present pathology is due to acknowledgment of the extensive co-occurrence and covariation that if and only if, a) the condition causes some harm or exists between diagnostic categories (Lilienfeld deprivation of benefit to the person as judged by et al. ; Widiger & Frances, 1985). For example, the standards of the person's culture (the value Kessler et al. (1994) reported the results of a criteria), and b) the condition results from the study on the lifetime and 12-month prevalence inability of some mental mechanism to perform its of DSM-III-R disorders in the US, in a random natural function, wherein a natural function is an sample of 8098 adults. Forty-eight percent of effect that is part of the evolutionary explanation the subjects reported at least one lifetime of the existence and structure of the mental disorder, with the vast majority of these mechanism (the explanatory criterion). (p. 385) individuals, 79%, reporting comorbid disorders. Other studies using large community Critics of Wakefield's concept argue that it samples report that more than 50% of the represents a misapplication of evolutionary participants diagnosed with one DSM disorder theory (Follette & Houts, 1996). They contend also meet criteria for a second disorder (Brown that evolutionary selection was not meant to & Barlow, 1992). It has been argued that the apply on the level of behavioral processes, that common etiological factors across different is, it is not possible to know the function of a diagnoses are of greater importance than the part or process of the individual by examining etiological factors specific to one disorder evolutionary history because random variation (Andrews, 1991; Andrews, Stewart, Morris- precludes a straightforward interpretation of Yates, Holt, & Henderson, 1990). 176 Assessment of Psychopathology: Nosology and Etiology

3.08.4.2.2 Populations considered disorder. For instance, certain disorders (e.g., social phobia) appear to be more likely to One factor that affects comorbidity rates is accompany other anxiety disorders when con- the population under study. Kessler et al. (1994) sidered at subclinical levels (Rapee, Sanderson, found that people with comorbid disorders were & Barlow, 1988). Conversely, Frances et al. likely to report higher utilization of services (1990) have suggested that the severity of mental than those without a comorbid disorder. health problems in a sample will influence Similarly, higher rates of comorbidity were comorbidity rates, in that a patient with a severe found among urban populations than rural presentation of one disorder is more likely to populations, despite the higher rates of single report other comorbid disorders. psychiatric disorders and financial problems In addition to the range and severity thresh- among rural participants. The race of the olds of disorders, the base rates of a particular population studied also seems to influence disorder have a strong influence on the apparent comorbidity rates. African-American partici- comorbidity rates. Conditions that are fre- pants in the Kessler et al. study reported lower quently present in a given sample will tend to be comorbidity rates than Caucasian participants, diagnosed together more often than those that after controlling for the effects of income and are infrequent. This may help explain the education. However, Caucasian participants comorbidity rates of certain highly prevalent reported lower rates of current comorbid disorders such as anxiety, depression, and disorders than Hispanics. Disparities in pre- substance abuse. valence of reported comorbidity were also found between respondents of different age groups, with people aged between 25 and 34 years reporting the highest rates. The income 3.08.4.2.4 Assessment methods and education of the participants were also The choice of assessment methods may associated with reported comorbidity in the influence comorbidity rates in much the same Kessler et al. study. way as aspects of the disorders. Disagreement regarding comorbidity rates may be partly due to differences in the definition of comorbidity. 3.08.4.2.3 Range, severity, and base rates of For example, comorbidity has been defined as disorders considered within-episode co-occurrence (or dual diagno- The rates of comorbidity are also influenced sis) among disorders by some (August & by the disorders studied. Specifically, the range, Garfinkel, 1990; Fulop, Strain, Vita, Lyons, severity, and base rates of certain disorders & Hammer, 1987), lifetime co-occurrence by increase the likelihood that these disorders will others (Feinstein, 1970; Shea, Widiger, & Klein, be comorbid with another disorder. Certain 1992, p. 859), and covariation among diagnoses diagnostic categories, including childhood dis- (i.e., across individuals) by still other research- orders (Abikoff & Klein, 1992; Biederman, ers (Cole & Carpentieri, 1990; Lewinsohn, Newcorn, & Sprich, 1991), anxiety disorders Rohde, Seeley, & Hops, 1991). Even when (Brown & Barlow, 1992; Goldenberg et al., researchers agree on a definition, their estimates 1996), and personality disorders (Oldham et al., of comorbidity may differ based on the type of 1992; Widiger & Rogers, 1989), appear to be assessment tool they use. As Lilienfeld et al. comorbid with other disorders. For example, (1994) point out, assessment techniques have 50% of patients with a principal anxiety error variance that by definition is not related to disorder reported at least one additional the construct(s) of interest or may artificially clinically significant anxiety or depressive dis- inflate the actual comorbidity rate. order in a large-scale study by Moras, DiNardo, Furthermore, individual raters may hold Brown, and Barlow (1991). Similarly, anxiety biases or differing endorsement thresholds for disorders are not only highly likely to be behaviors which are common to a variety of comorbid with each other but also with mood, disorders. Likewise, raters may have specific substance use, and personality disorders (Brown beliefs about which disorders tend to covary. & Barlow, 1992). These high rates of comor- These types of biases may affect both self-report bidity may in part be due to the degree to which and interviewer-based data. Similarly, studies the different anxiety disorders include over- utilizing structured interviews may differ from lapping features (Brown & Barlow, 1992). studies in which severity thresholds are not Finally the degree of comorbidity is influenced described as concretely. Therefore, differing directly by thresholds set to determine the rates of comorbidity across studies may be an presence or absence of various disorders. The artifact of the types of measurement used, or the choice of threshold appears to affect comorbid- biases of raters involved (Zimmerman, Pfohl, ity rates differentially, depending on the Coryell, Corenthal, & Stangl, 1991). Current Issues in the Classification of Psychopathology 177

3.08.4.2.5 Structure of the classification system diagnosis. The symptoms are not weighted, implying that they are of equal importance in Frances et al. (1990) argue that the classifica- defining the disorder. For many diagnoses, the tion system used in the various versions of the structure of this system makes it possible for two DSM increases the chance of comorbidity in individuals to meet the same diagnostic criteria comparison to other, equally plausible systems. without sharing many symptoms. Conversely, it The early systems (e.g., DSM-III) attempted to is possible for one person to meet the criteria address the comorbidity issue by proposing while another person who shares all but one elaborate hierarchical exclusionary rules speci- feature does not meet the criteria. As a result, fying that if a disorder was present in the course patients who actually form a fairly heteroge- of another disorder that took precedence, the neous group may be ªlumpedº into one second disorder was not diagnosed (Boyd et al., homogeneous diagnostic category. 1984). Thus, disorders which may have been Combined with unweighted symptoms and a truly comorbid were overlooked. Examination lack of attention to severity of symptoms, this of these issues resulted in the elimination of this ªlumpingº can lead to what Wakefield (1997a) exclusionary criterion in DSM-III-R. This new refers to as ªoverinclusivenessº of diagnostic method of diagnosis, however, has been criteria. Specifically, people who do not truly criticized for artificially inflating comorbidity suffer from a mental disorder may nonetheless between various disorders. receive the diagnosis, thus lowering the con- Frances et al. (1990) point out that new ceptual validity of DSM diagnostic categories. editions of the DSM expanded coverage by Conversely, minute differences in reported adding new diagnoses. This was done by symptoms may result in dichotomizing between ªsplittingº similar disorders into subtypes. They disordered and nondisordered individuals, or argue that the tendency to split diagnoses between individuals with one disorder as creates much more comorbidity than would opposed to another (e.g., avoidant personality ªlumpingº systems of classification. This is disorder vs. generalized social phobia), creating because disorders that share basic underlying heterogeneity where there may actually be features are viewed as separate. none. Much of the ªsplittingº in the DSM has This point is further elaborated by Clark et al. resulted from the increasing reliance on proto- (1995), who point out that within-category typical models of disorders (Blashfield, 1991). heterogeneity constitutes a serious challenge to The creators of the DSM increasingly have the validity of the DSM. These authors view relied on prototypes, defining a diagnostic comorbidity and heterogeneity as related category by its most essential features, regard- problems, that is, within-group heterogeneity less of whether these features are also present in of symptoms found across diagnostic cate- other diagnoses. McReynolds (1989) argued gories leads to increased rates of comorbidity that categories with a representative prototype among the disorders. Homogenous categories, and indistinct or ªfuzzyº boundaries are the on the other hand, lead to patient groups that basis of the most utilitarian classification share defining symptoms and produce lower systems because they are representative of rates of both comorbidity and heterogeneity. categories in nature. The use of this type of This is in part because of polythetic systems prototype-based classification has improved the that require some features of a disorder for sensitivity and clinical utility of the DSM diagnosis. In contrast, a monothetical system system. However, these gains are achieved at would require all features of a disorder for the expense of increased comorbidity and diagnosis. Because the monothetic approach to decreased specificity of differential diagnosis diagnosis produces very low base rates for any due to definitional overlap. Thus, this system of diagnosis (Morey, 1988), researchers generally diagnostic classification makes it unclear have used the polythetic approach. However, whether there is any true underlying affinity this approach promotes within-category het- between disorders currently considered to have erogeneity because individuals who do not high rates of comorbidity. share the same symptom profiles can receive The stated approach of the DSM-IV is a the same diagnosis. This poses a problem for descriptive method relying on observable signs any categorical system, including the DSM. and symptoms rather than underlying mechan- Because of the inherent heterogeneity in patient isms. The sets of signs and symptoms that profiles, a categorical system must address this constitute the different diagnoses are, by issue by either proposing artificial methods of definition, categorical, that is, the presence of limiting heterogeneity (e.g., DSM-III), using the required number and combination of unrealistic homogeneous categories, or ac- symptoms indicates a diagnosis and the absence knowledging the heterogeneity (e.g., DSM- of the required number of symptoms precludes a IV; Clark et al., 1995). Schizophrenia is an 178 Assessment of Psychopathology: Nosology and Etiology example of a diagnosis that has proved to have 3.08.4.3 Clinical and Research Applications of within-group heterogeneity. For example, cur- Classification Systems rent nosology (e.g., DSM-IV) assumes that schizophrenia and affective illness are distinct Classification serves various purposes, de- disorders. However, shared genetic vulnerabil- pending on the setting in which it is used. In ity has been proposed for schizophrenia and clinical settings, classification is used for some affective disorders (Crow, 1994). Taylor treatment formulation, whereas in research (1992) reviewed the evidence for this continuum settings it allows researchers to formulate and and suggested that the discrimination of these communicate ideas. The DSM-IV taskforce has disorders by their signs and symptoms is stated that the various uses of classification are inadequate (Taylor & Amir, 1994). The need usually compatible (APA, 1994, p. xv). It is not to know whether psychoses vary along a clear whether this assertion has been tested continuum is obviously critical to our under- empirically, or whether steps have been taken to standing of their pathogenesis and etiology. resolve any divergence between the various uses To address within-group heterogeneity of of classification. The goals of clinical utility and disorders, researchers have attempted to create outcome of research are not inherently incon- subtypes (e.g., positive vs. negative symptoms of sistent, but situations may arise in which the two schizophrenia; Andreasen, Flaum, Swayze, diverge in their application. The majority of the Tyrell, & Arndt, 1990). Consistent with this modifications made to recent versions of the conceptualization, researchers have correlated DSM were designed to improve clinical utility negative symptoms of schizophrenia with by simplifying or improving everyday diagnos- cognitive deficits (Johnstone et al., 1978), poor tic procedures. While not necessarily empirically premorbid adjustment (Pogue-Geile & Harrow, driven, these changes do not appear to have 1984), neuropathologic and neurologic abnorm- adversely impacted the validity of the diagnoses alities (Stevens, 1982), poor response to neuro- they affect. Assigning diagnoses facilitates leptics (Angrist, Rotrosen, & Gershon, 1980), information storage and retrieval for both and genetic factors (Dworkin & Lenzenweger, researchers and clinicians (Blashfield & Dra- 1984). On the other hand, positive symptoms guns; 1976; Mayr, 1981). However, problematic have been correlated with attention (Cornblatt, issues have been raised about the use and Lenzenweger, Dworkin, Erlenmeyer-Kimling, structure of diagnostic systems in both clinical 1992) and a positive response to neuroleptics and research settings. (Angrist et al., 1980). Despite these differences, The use of classification and the reliance on some investigators have questioned the relia- the DSM appears to be increasing (Follette, bility of these findings (Carpenter, Heinrichs, & 1996). This is partly because of the trend toward Alphs, 1985; Kay, 1990) and the clinical clarity the development and empirical examination of of the subtypes (Andreasen et al., 1990). The treatments geared toward specific diagnoses majority of these studies included patients with (Wilson, 1996). Although the use of diagnostic schizophrenia but not affective disorders, and classification is beneficial in conceptualizing some assessed only limited symptoms (e.g., only cases, formulating treatment plans, and com- negative symptoms; Buchanan & Carpenter, municating with other providers, some have 1994; Kay, 1990). These approaches are poten- argued that assigning a diagnosis may have a tially problematic because they assume impli- negative impact on patients (Szasz, 1960), that citly that the psychoses are distinct, and that the is, decisions about what constitutes pathology specific types of psychopathology are easily as opposed to normal reactions to stressors may discriminated. be arbitrary (Frances, First, & Pincus, 1995), In summary, the issue of comorbidity poses gender-based, and culturally or socioeconomi- what is potentially the greatest challenge to cally based. Furthermore, the choice of which diagnosis. Factors including the populations behaviors are considered pathological (de considered, the range, severity, and base rates of Fundia, Draguns, & Phillips, 1971) may be the disorders considered, the method of assess- influenced by the characteristics of the client ment, and the structure of the classification (Broverman, Broverman, Clarkson, Rusenk- system contribute to reported comorbidity rantz, & Vogel, 1970; Gove, 1980; Hamilton, rates. These factors are often extremely difficult Rothbart, & Dawes, 1986). to distinguish from the true comorbidity of Diagnostic categories may also have adverse disorders. A full understanding of the common consequences on research. For example, factors and shared etiologies of psychiatric although the diagnostic categories defined by disorders will most likely require an under- the classifications systems are well studied and standing of their true comorbidity. It is likely, easily compared across investigations, patients then, that this will remain a highly controversial who do not clearly fit a diagnosis are ignored. topic in psychopathology assessment. Furthermore, frequent and generally clinically- Current Issues in the Classification of Psychopathology 179 driven changes in the diagnostic criteria for a neutral while it may arguably be considered particular disorder make sustained investiga- theory-bound. Second, because the authors of tion of the disorder difficult (Davidson & Foa, the DSM do not explicitly recognize any theory, 1991). This situation is further exacerbated by specific theories cannot be empirically tested the frequent addition of new diagnostic cate- against competing theories of psychopathology. gories to the DSM (Blashfield, 1991). The DSM-IV taskforce addressed this issue by explicitly recommending that the diagnostic 3.08.4.4 Organizational Problems in the DSM classifications be used as a ªguideline informed by clinical judgment,º as they ªare not meant to Several of the issues regarding assessment and be used in a cookbook fashionº (APA, 1994, classification revolve around the organization p. xxiii). This suggestion, while useful in a of the DSM. These problems can be classified clinical setting, may not be as applicable in into two broad categories: problems with the research settings. multiaxial system and problems concerning the The proliferation of diagnostic criteria in placement and utilization of specific categories recent versions of the DSM have successfully of disorders (Clark et al., 1995). improved diagnostic reliability, but have done The multiaxial system of the DSM has no so at the expense of external validity (Clementz doubt accomplished its goal of increasing the & Iacono, 1990). That may adversely affect attention given to various (e.g., personality) research findings (Carey & Gottesman, 1978; aspects of functioning. However, the distinction Meehl, 1986). According to Clementz and between certain personality (Axis II) disorders Iacono (1990), the achievement of high relia- and some Axis I disorders are problematic. For bility in a diagnostic classification is often taken example, there is now ample data suggesting to signify either validity or a necessary first step that there is a high degree of overlap between toward demonstrating validity. However, this is avoidant personality disorder and generalized not the case in situations in which disorder social phobia (Herbert, Hope, & Bellack, 1992; criteria are designed significantly to increase Widiger, 1992). Similarly, it is difficult to reliability (i.e., by using a very restrictive distinguish schizotypal personality disorder definition). In such situations, validity may and schizophrenia on empirical grounds (Grove actually suffer greatly due to the increase in false et al., 1991). categorizations of truly disordered people as The second type of organizational issue relates unaffected. Conversely, the generation of to the placement and utilization of individual specific behavioral criteria (e.g., Antisocial disorders. For example, because post-traumatic Personality Disorder in DSM-III-R) may stress disorder (PTSD) shares many features increase reliability but lead to overinclusion (e.g., depersonalization, detachment) with the (e.g., criminals in the antisocial category) while dissociative disorders, some (e.g., Davidson & entirely missing other groups (e.g., ªsuccessfulº Foa, 1991) have argued for its placement with psychopaths who have avoided legal repercus- the dissociative disorders instead of the anxiety sions of the behavior) (Lykken, 1984). disorders. Likewise, the overly exclusive diag- Clearly there are possible negative impacts of nostic categories of the DSM have led to our current diagnostic system for research. situations in which clearly disordered patients More broadly, some have argued that nosolo- exhibiting symptoms of several disorders do not gical questions involve ªvalue judgmentsº fit into any specific category. The authors of the (Kendler, 1990, p. 971) and as such are DSM have attempted to remedy this by creating nonempirical. But, as Follette and Houts ªnot otherwise specifiedº (NOS) categories for (1996) point out, the classification of pathology, several classes of disorders. While at times it is even in physical medicine, requires the identi- difficult to distinguish the meaning of NOS fication of desirable endstates which are categories, they are often utilized in clinical culturally rather than biologically defined. practice. Clark et al (1995) view the high Furthermore, value judgments lie in the choices prevalence of subthreshold and atypical forms made between competing theories (Widiger & of the disorders commonly classified as NOS as Trull, 1993) and the attempt by the authors of contributing to the problem of heterogeneity. the DSM to remain theoretically ªneutralº is For example, various diagnoses including mood inherently problematic. In its attempt to remain disorders (Mezzich, Fabrega, Coffman, & ªneutral,º the DSM actually adopts a model of Haley, 1989), dissociative disorders (Spiegel & shared phenomenology, which implicitly ac- Cardena, 1991), and personality disorders cepts a theory of the underlying structure of (Morey, 1992) are characterized by high rates psychopathology (Follette & Houts, 1996). This of NOS. One method of combating the high implicit theory is problematic on several related NOS rates would be to create separate categories fronts. First, the DSM professes to be theory- to accommodate these individuals. For example, 180 Assessment of Psychopathology: Nosology and Etiology the rate of bipolar disorder NOS was reduced by must be made about the structure that best the inclusion of the subcategories I and II in suits their classification. Before undertaking an DSM-IV. Another method addressing the NOS investigation of the specific psychopathology problems is to include clear examples of classifications which have been proposed, the individuals who would meet the NOS criteria. structural design of diagnostic taxonomies in This second approach has been tested for some general will be outlined briefly. The frameworks potential diagnostic groups, such as the mixed suggested fall into three categories: hierarchical, anxiety-depression category (Clark & Watson, multiaxial, and circular. The hierarchical model 1991; Zinbarg & Barlow, 1991; Zinbarg et al., organizes disorders into sets with higher-order 1994). diagnoses subsuming lower-order classifications. The multiaxial model assigns parallel roles for the different aspects of a disorder. 3.08.5 ALTERNATIVE APPROACHES TO Finally, the circular model assigns similar CLASSIFICATION disorders to adjoining segments of a circle and dissimilar disorders to opposite sides of the While many well-articulated criticisms of the circle (Millon, 1991). These three conceptuali- classification system have been presented, the zations are not mutually exclusive, and many literature contains far fewer suggested alter- taxonomies share aspects of more than one natives (Follette, 1996). There is great disagree- structure. Within each of these three general ment among authors who have suggested other frameworks, taxa may be considered categorical taxonomic models about not only the type of or dimensional. The current DSM combines model, but also its content, structure, and both hierarchical and multiaxial approaches. methodology. The study of psychopathology Circular frameworks, which generally employ has been pursued by researchers from different the dimensional approach, have been used in theoretical schools, including behaviorists, neu- theories of personality. The model that is the rochemists, phenomenologists, and psychody- basis for the DSM system, the Neo-Kraepelian namicists (Millon, 1991). These approaches rely model will be examined first, and then prototype on unique methodologies and produce data and dimensional models will be considered regarding different aspects of psychopathology. Finally, suggested methodological and statisti- No one conceptualization encompasses the cal approaches to improving the classification of complexity of any given disorder. These differ- psychopathology will be covered. ing views are not reducible to a hierarchy, and cannot be compared in terms of their objective value (Millon, 1991). Biophysical, phenomen- 3.08.5.2 Taxonomic Models ological, ecological, developmental, genetic, and behavioral observations have all been 3.08.5.2.1 Neo-Kraepelinian (DSM) model suggested as important contents to include in The Neo-Kraepelinian model, inspired by the the categorization of psychopathology. How to 1972 work of the Washington University group structure or organize content has also been a and embodied in the recent versions of the DSM, topic of much debate. In this section the is the current standard of psychopathology structural and methodological approaches to classification. According to the Neo-Kraepeli- creating classification systems will be reviewed. nian view, diagnostic categories represent Models of psychopathology will be discussed, medical diseases, and each diagnosis is con- including the Neo-Kraepelinian model, proto- sidered to be a discrete entity. Each of the type models, and dimensional and categorical discrete diagnostic categories is viewed as having models. Finally, the use of etiological factors in a describable etiology, course, and pattern of the categorization of psychopathology (J. F. occurrence. Clearly specified operational criter- Kihlstrom, personal communication, Septem- ia are used to define each category and foster ber 2, 1997, SSCPnet) will be examined and objectivity. This type of classification is aided by other research and statistical methodologies will the use of structured interviews in gathering be outlined that may potentially lead to better relevant symptom information to assign diag- systems of classification (J. F. Kihlstrom, noses. Diagnostic algorithms specify the objec- personal communication, September 11, 1997, tive rules for combining symptoms and reaching SSCPnet; D. Klein, personal communication, a diagnosis (Blashfield, 1991). In this view, the September 2, 1997, SSCPnet). establishment of the reliability of diagnostic categories is considered necessary before any 3.08.5.1 Types of Taxonomies type of validity can be established. Categories that refer to clearly described patterns of As knowledge about features best signifying symptoms are considered to have good internal various disorders increases, determinations validity, while the utility of a diagnosis in Alternative Approaches to Classification 181 predicting the course and treatment response of with number of features present, and features the disorder are seen as markers of good are neither necessary nor sufficient since external validity. Despite its many shortcom- membership is not an absolute. Furthermore, ings, the widespread adoption of the DSM categories in the prototype model have indis- system in both clinical work and research is a tinct boundaries, and the membership decision testament to the utility of the Neo-Kraepelinian relies largely on clinician judgment. It is likely model. that the adoption of this model would result in a decrease in reliability compared to the DSM. However, proponents argue that this model is 3.08.5.2.2 Prototype model more reflective of real-world categories in The prototype model has been suggested as a psychopathology (Chapman & Chapman, viable alternative to the current Neo-Kraepe- 1969). linian approach (Cantor, Smith, French, & Mezzich, 1980; Clarkin, Widiger, Frances, 3.08.5.2.3 Dimensional and categorical models Hurt, & Gilmore, 1983; Horowitz, Post, French, Wallis, & Siegelman, 1981; Livesley, 1985a, An alternative to the categorical classification 1985b). In this system, patients' symptoms are system is the dimensional approach. In dimen- evaluated in terms of their correlation with a sional models of psychopathology, symptoms standard or prototypical representation of are assessed along several continua, according specific disorders. A prototype consists of the to their severity. Several dimensional models most common features of members of a have been suggested (Eysenck, 1970; Russell, category, and is the standard against which 1980; Tellegen, 1985). Dimensional models are patients are evaluated (Horowitz et al., 1981). proposed as a means of increasing the empirical The prototype model differs from the Neo- parsimony of the diagnostic system. The Kraepelinian model in several ways. First, the personality disorders may be most readily prototype model is based on a philosophy of adapted to this approach (McReynolds, 1989; nominalism, in which diagnostic categories Widiger, Trull, Hurt, Clarkin, & Frances, 1987; represent concepts used by mental health Wiggins, 1982) but this approach is by no means professionals (Blashfield, 1991). Diagnostic limited to personality disorders and has been groups are not viewed as discrete, but indivi- suggested for use in disorders including schizo- duals may warrant membership in a category to phrenia (Andreasen and Carpenter, 1993), a greater or lesser degree. The categories are somatoform disorder (Katon et al., 1991), defined by exemplars, or prototypes, and the bipolar disorder (Blacker & Tsuang, 1992), presentation of features or symptoms in an childhood disorders (Quay et al., 1987), and individual is neither necessary nor sufficient to obsessive-compulsive disorder (Hollander, determine membership in a category. Rather, 1993). Dimensional models are more agnostic the prototype model holds that membership in a (i.e., making fewer assumptions), more parsi- category is correlated with the number of monious (i.e., possibly reducing the approxi- representative symptoms the patient has. The mately 300 diagnosis classifications in the DSM prototype model suggests that the degree of to a smaller subset of dimensions), more membership to a category is correlated with the sensitive to differences in the severity of number of features that a member has, so disorders across individuals, and less restrictive. defining features are neither necessary nor While a dimensional approach might simplify sufficient. some aspects of the diagnostic process, it would Some authors have described the DSM undoubtedly create new problems. First, cate- system as a prototype model, primarily because gorical models are resilient because of the it uses polythetic, as opposed to monothetic, psychological tendency to change dimensional definitions (Clarkin et al., 1983; Widiger & concepts into categorical ones (Cantor & Frances, 1985). Although the DSM does use Genero, 1986; Rosch, 1978). Second, imple- polythetic definitions, it does not constitute a mentation of dimensional systems would re- prototypical model because specific subsets of quire a major overhaul of current practice in the symptoms are sufficient for making a diagnosis. mental health field. Third, replacing the DSM Prototype and polythetic models allow varia- categorical model with a dimensional model will bility among features within a category, how- undoubtedly meet with much resistance from ever, they view category definition differently. proponents of clinical descriptiveness, who Prototype models focus on definition by believe that each separate category provides a example, polythetic models focus on category more richly textured picture of the disorder. membership as achieved by the presence of Finally, there are currently no agreed upon certain features that are sufficient. In a proto- dimensions to be included in such a classifica- type model the level of membership is correlated tion model (Millon, 1991). 182 Assessment of Psychopathology: Nosology and Etiology

Thus, the task of advocates of the dimen- the limited number of techniques available for sional approach is twofold. First, they must the examination of potential factors. For determine the type and number of dimensions example, the history of psychiatry and medicine that are necessary to describe psychopathology. is replete with examples of major findings due in Second, they will need to demonstrate that it is part to advances in technology (e.g., computer- possible to describe the entire range of psycho- ized axial tomography [CAT] scans as a method pathology using a single set of dimensions. At of examining the function of internal organs). this stage, the most useful starting point may be An alternative explanation for the limited examination of the role of various dimensions in success of etiological studies is that most the description of psychopathology, as opposed researchers have relied on theoretical perspec- to arguing the virtues and limitations of tives that assume distinct categories of illness. categorical and dimensional approaches to Specifically, the assumption of distinct diag- psychopathology. nostic entities masks the possibility that multiple etiological factors may lead to the development of the same disorder, and that biological and 3.08.5.3 The Use of Etiological Factors for environmental factors may ameliorate the effect Diagnosis of strong etiological factors. Even when a diagnosis may seem to have a clear etiology Another proposed remedy to the problems (e.g., PTSD), the picture is not clear. For facing current classification systems is to example, although the diagnosis of PTSD examine the role of etiology. Andreasen and requires a clear stressor it is known that not Carpenter (1993) point out the need to identify all individuals who experience that stressor etiologic mechanisms in order to understand a develop PTSD and not all stressors are likely to disorder. In addition, understanding etiologic create PTSD (Foa & Rothbaum, 1998). Further- factors may help explain the heterogeneity and more, the presence of a stressor alone is not comorbidity of the disorders currently listed in sufficient to warrant the diagnosis. In fact, the DSM. The authors of the DSM have research suggests that etiologic factors entirely generally avoided making statements regarding outside the diagnostic criteria (e.g., IQ; Macklin the etiology of disorders in keeping with the et al., 1998; McNally & Shin, 1995) may ªtheoretically neutralº stance. However, some ameliorate the effects of the identified etiologic authors have argued that this caveat is only factors on the development of the disorder. loosely enforced in DSM as it is, as exemplified Much of the controversy about assessment by life-stress based disorders such as PTSD and and classification in psychopathology stems adjustment disorder (Brett, 1993). from the conflict about the use of value Traditional models of etiology have focused judgments as opposed to data-driven theory on either the biological or environmental causes testing in creating diagnostic categories. Some of psychopathology. Wakefield (1997b) has have suggested that a combination of the two warned against the assumption that the etiology perspectives, including the use of both theory of psychopathological disorders will necessarily and data, may be the most workable approach be found to be a physiological malfunction. He (Blashfield & Livesley, 1991; Morey, 1991). This argued that the mind can begin to function approach would amount to a process of abnormally without a corresponding brain construct validation depending on both theory disorder. More recent conceptualizations of and evaluation of the theory by data analysis. the etiology of psychopathology acknowledge As in other areas of theory development, the role of both biological and environmental testability and parsimony would play a crucial factors, and debate the ratio of the contribution role in choosing between competing theories. In from each. As would be expected, etiological the following section, the need for the adoption studies of psychopathology tend to reflect the of new research methodologies in the field of underlying theories of mental disorders accepted assessment and classification of psychopathol- by the researchers who conduct them. For ogy will be considered. Next some of the areas of example, biological theorists attempt to identify research which illustrate promising methods, biological markers of certain disorders (e.g., most of which focus on identifying etiologic Clementz & Iacono, 1990; Klein, 1989). Envir- factors in psychopathology, will be discussed. onmental theories attempt to identify specific As mentioned earlier, researchers have called events that are necessary or sufficient to produce for a move away from a system of diagnosis a disorder. These approaches have achieved based on superficial features (symptoms) toward varying degrees of success depending on which diagnosis based on scientifically-based theories diagnostic groups were considered. One expla- of underlying etiology and disease processes. nation for the limited success of attempts to This focus parallels that of medical diagnosis of identify etiological factors in psychopathology is physical illness. Physicians do not diagnose Alternative Approaches to Classification 183 based on symptoms. Instead, patient reports of functional analysis, the identification of the symptoms are seen as indicators of potential antecedents, and consequences of each beha- illnesses. Diagnoses are not made until specific viors (Hayes, Wilson, Gifford, Follette, & indicators of pathology (e.g., biopsies, masses, Strosahl, 1996; Scotti, Morris, McNeil, & blood draws, etc.) have been examined. The Hawkins, 1996; Wulfert, Greenway, & Dough- interpretation of such laboratory results requires er, 1996). Wulfert et al. (1996) use the example an understanding of the differences between of depression as a disorder which may be caused normal and abnormal functioning of the cell or by a host of different factors (biological, organ in question. In order for assessment of cognitive, social skills deficits, or a lack of psychopathology to follow this route, research- reinforcement). They argue that the fact that ers must continue to amass information on functionally similar behavior patterns may have normal mental and behavioral functioning (J. F. very different structures may contribute to the Kihlstrom, personal communication, Septem- heterogeneity found in the presumably homo- ber 11, 1997, SSCPnet). This endeavor can be genous DSM categories. These authors contend facilitated by technological advances in experi- that functional analysis may constitute one mental paradigms and measurement techniques means of identifying homogenous subgroups and devices. The issue of what constitutes a great whose behavior share similar antecedents and enough deviation from ªnormalº functioning to consequences. This approach could be used to warrant treatment has been and most likely will refine the existing DSM categories, and to continue to be controversial. However, such inform treatment strategies. Hayes et al. (1996) decisions are a necessary hurdle if psychopathol- describe a classification system based on ogy is to evolve as a science. functional analysis as a fundamentally different It is likely that the identification of etiological alternative to the current syndromal classifica- factors in psychopathology will not rely entirely tion system. They proposed that a classification on biological factors. The validation of etiolo- system could be based on dimensions derived gical constructs in psychopathology will un- from the combined results of multiple func- doubtedly include studies designed to identify tional analyses tied to the same dimension. Each potential contributing causes including envir- dimension would then be associated with onmental, personality, and physiological fac- specific assessment methods and therapy re- tors. Examples of research methods and commendations. The authors describe one such paradigms which may prove useful in determin- functional dimension, ªexperiential avoidance,º ing the etiology of psychiatric disorders are and illustrate its utility across disorders such as becoming increasingly evident in the literature. substance abuse, obsessive-compulsive disor- Possible methodologies include: psychophar- der, panic disorder, and borderline personality macological efficacy studies (Harrison et al., disorder (Hayes et al.). This model provides an 1984; Hudson & Pope, 1990; Papp et al., 1993; alternative to the current DSM syndromal Quitkin et al., 1993; Stein, Hollander, & Klein, model, using the methodological approach of 1994); family and DNA studies (Fyer et al., functional analysis. Scotti et al. (1996) proposed 1996); treatment response studies (Clark et al., a similar system of categorization of childhood 1995, Millon, 1991); in vivo monitoring of and adolescent disorders utilizing functional physiological processes (Martinez et al., 1996); analysis. Hayes et al. (1996), Scotti et al. (1996), and identification of abnormal physiological and Wulfert et al. (1996) have all successfully processes (Klein, 1993, 1994; Pine et al., 1994). illustrated that alternatives or improvements to These approaches may prove informative in the current DSM system are possible. designing future versions of psychopathology However, functional analysis is a distinctly classification systems. Other researchers have behavioral approach which assumes that a chosen a more direct approach, as is evidenced learned stimulus±response connection is an in a series of articles in a special section of the important element in the development or Journal of Consulting and Clinical Psychology maintenance of psychopathology. Other (JCCP) entitled ªDevelopment of theoretically authors, for example, proponents of genetic coherent alternatives to the DSM-IV º (Follette, or biological explanations of psychopathology 1996). The authors of this special issue of JCCP described above, might strongly oppose a pose radical alternatives to the DSM. The classification system based purely on the alternative classification systems are proposed methodology and tenets of a behavioral from a clearly stated behavioral theoretical approach. Others disagree with the notion of viewpoint, which differs considerably from any unified system of classification of psycho- many of the more biologically-based ap- pathology, arguing that no one diagnostic proaches described above. system will be equally useful for all of the A number of the alternatives and improve- classes of disorders now included in the DSM ments suggested in the 1996 JCCP are based on (e.g., what works for Axis I may not apply to the 184 Assessment of Psychopathology: Nosology and Etiology personality or childhood disorders). Several of Beck, A. T., Epstein, N., Brown, G., & Steer, R. A. (1988). these authors have taken a more radical stance, An inventory for measuring anxiety: Psychometric properties. Journal of Consulting and Clinical Psychology, asserting the need for separate diagnostic 56(6), 893±897. systems for different classes of disorder (Kazdin Beck, A. T., & Steer, R. A. (1987). Beck depression & Kagan, 1994; Koerner, Kohlenberg, & inventory manual. San Antonio, TX: The Psychological Parker, 1996). Corporation. Bellack, A. S., & Hersen, M. (Eds.) (1998). Behavioral assessment: A practical handbook. Needham Heights, MA: Allyn & Bacon. 3.08.6 CONCLUSIONS Bergner, R. M. (1997). What is psychopathology? And so what? Clinical Psychology: Science and Practice, 4, The assessment and classification of psycho- 235±248. pathology is relatively new and controversies Berrios, G. E., & Hauser, R. (1988). The early development abound. However, the heated debates regarding of Kraepelin's ideas on classification: A conceptual issues such as comorbidity, types of taxonomies, history. Psychological Medicine, 18, 813±821. and alternative approaches are indicators of the Biederman, J., Newcorn, J., & Sprich, S. (1991). Comor- bidity of attention deficit hyperactivity disorder with strong interest in this area. Comparisons conduct, depressive, anxiety, and other disorders. Amer- between the classification of psychopathology ican Journal of Psychiatry, 148(5), 564±577. and taxonomies in the basic sciences and Blacker, D., & Tsuang, M. T. (1992). Contested boundaries medicine can be informative. However, the of bipolar disorder and the limits of categorical diagnosis classification of psychopathology is a difficult in psychiatry. American Journal of Psychiatry, 149(11), 1473±1483. task, and the methods used in other fields are Blashfield, R. K. (1990). Comorbidity and classification. In not always applicable. It is likely that the J. D. Maser & C. R. Cloninger (Eds.), Comorbidity of systems of classification, informed by the mood and anxiety disorders (pp. 61±82). Washington, continuing debates and research on the topic, DC: American Psychiatric Press. Blashfield, R. K. (1991). Models of psychiatric classifica- will continue to evolve at a rapid pace. As Clark tion. In M Hersen & S. M. Turner (Eds.), Adult et al. (1995) remarked, the science of classifica- psychopathology and diagnosis (pp. 3±22). New York: tion has inspired research in new directions and Wiley. helped to guide future developments of psycho- Blashfield, R. K., & Draguns, J. G. (1976). Toward a pathology. taxonomy of psychopathology: The purpose of psychiatric classification. British Journal of Psychiatry, 129, 574±583. Blashfield, R. K., & Livesley, W. J. (1991). Metaphorical ACKNOWLEDGMENTS analysis of psychiatric classification as a psychological test. Journal of Abnormal Psychology, 100, 262±270. We would like to thank Amy Przeworski and Boyd, J. H., Burke, J. D., Gruenberg, E., Holzer, C. E., Melinda Freshman for their help in editing this Rae, D. S., George, L. K., Karno, M., Stoltzman, R., chapter. McEvoy, L., & Nestadt, G. (1984). Exclusion criteria of DSM-III. Archives of General Psychiatry, 41, 983±989. Brett, E. A. (1993). Classifications of posttraumatic stress 3.08.7 REFERENCES disorder in DSM-IV: Anxiety disorder, dissociative disorder, or stress disorder? In J. R. T. Davidson & E. Abikoff, H., & Klein, R. G. (1992). Attention-deficit B. Foa (Eds.), Posttraumatic stress disorder: DSM-IV hyperactivity and conduct disorder: Comorbidity and and beyond (pp. 191±204). Washington, DC: American implications for treatment. Journal of Consulting and Psychiatric Press. Clinical Psychology, 60(6), 881±892. Broverman, I. K., Broverman, D. M., Clarkson, F. E., American Psychiatric Association (1933). Notes and Rosenkrantz, P. S., & Vogel, S. R. (1970). Sex-role comment: Revised classification of mental disorders. stereotypes and clinical judgments of mental health. American Journal of Psychiatry, 90, 1369±1376. Journal of Consulting and Clinical Psychology, 34(1), 1±7. American Psychiatric Association (1994). Diagnostic and Brown, T. A., & Barlow, D. H. (1992). Comorbidity among statistical manual of mental disorders (4th ed.). Washing- anxiety disorders: Implications for treatment and DSM- ton, DC: Author. IV. Journal of Consulting and Clinical Psychology, 60(6), Andreasen, N. C., & Carpenter, W. T. (1993). Diagnosis 835±844. and classification of schizophrenia. Schizophrenia Bulle- Buchanan, R. W., & Carpenter, W. T. (1994). Domains of tin, 19(2), 199±214. psychopathology: an approach to the reduction of Andreasen, N. C., Flaum, M., Swayze, V. W., Tyrrell, G., heterogeneity in schizophrenia. The Journal of Nervous & Arndt, S. (1990). Positive and negative symptoms in and Mental Disease, 182(4), 193±204. schizophrenia: A critical reappraisal. Archives of General Butcher, J. N., Dahlstrom, W. G., Graham, J. R., Tellegen, Psychiatry, 47, 615±621. A., & Kaemmer, B. (1989). Minnesota multiphasic Angrist, B., Rotrosen, J., & Gershon, S. (1980). Differential personality inventory (MMPI-2). Administration and effects of amphetamine and neuroleptics on negative vs. scoring. Minneapolis, MN: University of Minnesota positive symptoms in schizophrenia. Psychopharmacol- Press. ogy, 72, 17±19. Cantor, N., & Genero, N. (1986). Psychiatric diagnosis and August, G. J., & Garfinkel, B. D. (1990). Comorbidity of natural categorization: A close analogy. In T. Millon & ADHD and reading disability among clinic-referred G. L. Klerman (Eds.) Contemporary directions in children. Journal of Abnormal Child Psychology, 18, psychopathologyÐtoward the DSM-IV (pp. 233±256). 29±45. New York: Guildford Press. Baumann, U. (1995) Assessment and documentation of Carey, G., & Gottesman, I. I. (1978). Reliability and psychopathology, Psychopathology, 28 (Suppl.1), 13±20. validity in binary ratings: Areas of common misunder- References 185

standing in diagnosis and symptom ratings. Archives of (1993). Reliability and validity of a brief instrument for General Psychiatry, 35, 1454±1459. assessing post-traumatic stress disorder. Journal of Clark, L. A., & Watson, D. (1991). Tripartite model of Traumatic Stress, 6, 459±473. anxiety and depression: Psychometric evidence and Foa, E. B., & Rothbaum, B. (1998). Treating the trauma of taxonomic implications. Journal of Abnormal Psychol- rape. New York: Guilford Press. ogy, 100(3), 316±336. Follette, W. C. (1996). Introduction to the special section Clark, L. A., Watson, D., & Reynolds, S. (1995). Diagnosis on the development theoretically coherent alternatives to and classification of psychopathology: Challenges to the the DSM system. Journal of Consulting and Clinical current system and future directions. Annual Review of Psychology, 64, 1117±1119. Psychology, 46, 121±153. Follette, W. C., & Houts, A. C. (1996) Models of scientific Clarkin, J. F., Widiger, T. A., Frances, A. J., Hurt, S. W., progress and the role of theory in taxonomy develop- & Gilmore, M. (1983). Prototypic typology and the ment: A case study of the DSM. Journal of Consulting borderline personality disorder. Journal of Abnormal and Clinical Psychology, 64, 1120±1132. Psychology, 92(3), 263±275. Frances, A. J. (1998). Problems in defining clinical Clementz, B. A., & Iacono, W. G. (1990). Nosology and significance in epidemiological studies. Archives of diagnosis. In L. Willerman & D. B. Cohen (Eds.), General Psychiatry, 55, 119. Psychopathology, New York: McGraw-Hill. Frances, A. J., First, M. B., & Pincus, H. A. (1995). DSM- Cole, D. A., & Carpentieri, S. (1990). Social status and the IV guidebook. Washington, DC: American Psychiatric comorbidity of child depression and conduct disorder. Press. Journal of Consulting and Clinical Psychology, 58, Frances, A. J., First, M. B., Widiger, T. A., Miele, G. I., 748±757. M., Tilly, S. M., Davis, W. W., & Pincus, H. A. (1991). Compton, W. M., & Guze, S. B., (1995). The neo- An A to Z guide to DSM-IV conundrums. Journal of Kraepelinian revolution in psychiatric diagnosis. Eur- Abnormal Psychology, 100(3), 407±412. opean Archives of Psychiatry & Clinical Neuroscience, Frances, A. J., Widiger, T. A., & Fyer, M. R. (1990). The 245, 196±201. influence of classification methods on comorbidity. In J. Cornblatt, B. A., Lenzenweger, M. F., Dworkin, R. H., & D. Maser & C. R. Cloninger (Eds.), Comorbidity of mood Erlenmeyer-Kimling, L. (1992). Childhood attentional and anxiety disorders (pp. 41±59). Washington, DC: dysfunctions predict social deficits in unaffected adults at American Psychiatric Press. risk for schizophrenia. British Journal of Psychiatry, 161, Fulop, G., Strain, J., Vita, J., Lyons, J. S. & Hammer, J. S. 59±64. (1987). Impact of psychiatric comorbidity on length of Crow, T. J. (1994). The demise of the Kraepelinian binary hospital stay for medical/surgical patients: A preliminary system as a prelude to genetic advance. In E. S. Gershon, report American Journal of Psychiatry, 144, 878±882. & C. R. Cloninger (Eds.), Genetic approaches to mental Fyer, A.J., Mannuzza, S., Chapman, T. F., Lipsitz, J., disorders (pp. 163±192). Washington, DC: American Martin, L. Y. & Klein, D. F. (1996). Panic disorder and Psychiatric Press. social phobia: Effects of comorbidity on familial Daradkeh, T. K., El-Rufaie, O. E. F., Younis, Y. O., & transmission. Anxiety 2, 173±178. Ghubash, R. (1997). The diagnostic stability of ICD-10 Goldenberg, I. M., White, K., Yonkers, K., Reich, J., psychiatric diagnoses in clinical practice. European Warshaw, M. G., Goisman, R. M., & Keller, M. B. Psychiatry, 12, 136±139. (1996). The infrequency of ªPure Cultureº diagnoses Davidson, J. R. T., & Foa, E. B. (1991). Diagnostic issues among the anxiety disorders. Journal of Clinical Psy- in posttraumatic stress disorder: Considerations for the chiatry, 57(11), 528±533. DSM-IV. Journal of Abnormal Psychology, 100(3), Goodman, W. K., Price, L. H., Rasmussen, S. A., & 346±355. Mazure, C. (1989a). The Yale-Brown obsessive compul- de Fundia, T. A., Draguns, J. G., & Phillips, L. (1971). sive scale: I. Development, use and reliability. Archives of Culture and psychiatric symptomatology: A comparison General Psychiatry, 46, 1006±1011. of Argentine and United States patients. Social Psychia- Goodman, W. K., Price, L. H., Rasmussen, S. A., & try, 6(1), 11±20. Mazure, C. (1989b). The Yale-Brown obsessive compul- Dworkin, R. H., & Lenzenweger, M. F. (1984). Symptoms sive scale: II. Validity. Archives of General Psychiatry, 46, and the genetics of schizophrenia: Implications for 1012±1016. diagnosis. American Journal of Psychiatry, 141(12), Gorenstein, E. E. (1992). The science of mental illness. New 1541±1546. York: Academic Press. Dworkin, R. H., Lenzenwenger, M. F., Moldin, S. O., & Gove, W. R. (1980). Mental illness and psychiatric Cornblatt, B. A. (1987). Genetics and the phenomenol- treatment among women. Psychology of Women Quar- ogy of schizophrenia. In P. D. Harvey & E. F. Walker terly, 4, 345±362. (Eds.), Positive and negative symptoms of psychosis: Grove, W. M., Lebow, B. S., Clementz, B. A., Cerri, A., Description, research and future directives (pp. 258±288). Medus, C., & Iacono, W. G. (1991). Familial prevalence Hillsdale, NJ: Erlbaum. and coaggregation of schizotypy indicators: A multitrait Eysenck, H. J. (1970). A dimensional system of psycho- family study. Journal of Abnormal Psychology, 100, diagnostics. In A. R. Mahrer (Ed.), New approaches to 115±121. personality classification (pp. 169±207). New York: Haghighat, R. (1994). Cultural sensitivity: ICD-10 versus Columbia University Press. DSM-III±R. International Journal of Social Psychiatry, Feighner, A. C., Baran, I. D., Furman, S., & Shipman, W. 40, 189±193. M. (1972). Private psychiatry in community mental Hamilton, S., Rothbart, M., & Dawes, R. (1986). Sex bias, health. Hospital and Community Psychiatry, 23(7), diagnosis and DSM-III. Sex Roles, 15, 269±274. 212±214. Harrison, W. M., Cooper, T. B., Stewart, J. W., Quitkin, F. Feinstein, A. R. (1970). The pre-therapeutic classification M., McGrath, P. J., Liebowitz, M. R., Rabkin, J. R., of co-morbidity in chronic disease. Journal of Chronic Markowitz, J. S., & Klein, D. F. (1984). The tyramine Diseases, 23, 455±468. challenge test as a marker for melancholia. Archives of First, M. B., Spitzer, R. L., Gibbon, M., & Williams, K. B. General Psychiatry, 41(7), 681±685. W. (1995). Structured clinical interview for DSM-IV Axis Hayes, S. C., Wilson, K. G., Gifford, E. V., Follette, V. M., I Disorders. Washington, DC: American Psychiatric & Strosahl, K. D. (1996). Experiential avoidance and Press. behavioral disorders: A functional dimensional ap- Foa, E. B., Riggs, D. S., Dancu, C. V., & Rothbaum, B. O. proach to diagnosis and treatment. Journal of Consulting 186 Assessment of Psychopathology: Nosology and Etiology

& Clinical Psychology, 64(6), 1152±1168. nosology. In J. C. Shershow (Ed.), Schizophrenia: Helzer, J. E., & Robins, L. N. (1988). The Diagnostic Science and practice (pp. 99±121). Cambridge, MA: interview schedule: Its development, evolution, and use. Harvard University Press. Social Psychology and Psychiatric Epidemiology, 23(1), Koerner, K., Kohlenberg, R. J., & Parker, C. R. (1996). 6±16. Diagnosis of personality disorder: A radical behavioral Herbert, J. D., Hope, D. A., & Bellack, A. S. (1992). alternative. Journal of Consulting & Clinical Psychology, Validity of the distinction between generalized social 64(6), 1169±1176. phobia and avoidant personality disorder. Journal of Kraeplin, E. (1971). Dementia praecox and paraphrenia (R. Abnormal psychology, 101(2), 332±339. M. Barklay, Trans.). Huntington, NY: Krieger. (Original Hollander, E. (1993). Obsessive-compulsive spectrum dis- work published in 1919). orders: An overview. Psychiatric Annals, 23(7), 355±358. Lewinsohn, P. M., Rohde, P., Seeley, J. R., & Hops, H. Horowitz, L. M., Post, D. L, French, R. S., Wallis, K. D., (1991). Comorbidity of unipolar depression: I. Major & Siegelman, E. Y. (1981). The prototype as a construct depression with dysthymia. Journal of Abnormal Psy- in abnormal psychology: II. Clarifying disagreement in chology, 100(2), 205±213. psychiatric judgments, Journal of Abnormal Psychology, Lilienfeld, S. O., Waldman, I. D., & Israel, A. C. (1994). A 90(6), 575±585. critical examination of the use of the term and concept of Hudson, J. I., & Pope, H. G. (1990). Affective spectrum comorbidity in psychopathology research. Clinical disorder: Does antidepressant response identify a family Psychology-Science & Practice, 1(1), 71±83. of disorders with a common pathophysiology? American Livesley, W. J. (1985a). The classification of personality Journal of Psychiatry, 147 (5), 552±564. disorder: I. The choice of category concept. Canadian Jacoby, R. (1996). Problems in the use of ICD-10 and Journal of Psychiatry, 30(5), 353±358. DSM-IV in the psychiatry of old age. In N. S. Costas & Livesley, W. J. (1985b). The classification of personality H. Hanns (Eds.), Neuropsychiatry in old age: An update. disorder: II. The problem of diagnostic criteria. Canadian Psychiatry in progress series (pp. 87±88). Gottingen, Journal of Psychiatry, 30(5), 359±362. Germany: Hogrefe & Huber. Lykken, D. T. (1984). Psychopathic personality. In R. I. Janca, A., Kastrup, M. C., Katschnig, H., & Lopez-Ibor, J. Corsini (Ed.), Encyclopedia of psychology (pp. 165±167). J., Jr. (1996). The ICD-10 multiaxial system for use in New York, Wiley. (As cited in B. A. Clementz & W. E. adult psychiatry: Structure and applications. Journal of Iacono (1990). In L. Willerman & D. B. Cohen (Eds.), Nervous and Mental Disease, 184, 191±192. Psychopathology, New York: McGraw-Hill.) Johnstone, E. C., Crow, T. J., Frith, C. D., Stevens, M., Macklin, M. L., Metzger, L. J., Litz, B. T., McNally, R. J., Kreel, L., & Husband, J. (1978). The dementia of Lasko, N. B., Orr, S. P., & Pitman, R. K. (1998). Lower dementia praecox. Acta Psychiatrica Scandanavia, 57, precombat intelligence is a risk factor for posttraumatic 305±324. stress disorder. Journal of Consulting and Clinical Kanfer, F. H., & Saslow, G. (1965). Behavioral analysis. Psychology, 66, 323±326 Archives of General Psychiatry, 12, 529±538. Martinez, J. M., Papp, L. A., Coplan, J. D., Anderson, D. Kahn, E. (1959). The Emil Kraepelin memorial lecture. In E., Mueller, C. M., Klein, D. F., & Gorman, J. M. D. Pasamanick (Ed.), Epidemiology of mental disorders. (1996). Ambulatory monitoring of respiration in anxiety. Washington, DC: American Association for the Ad- Anxiety, 2, 296±302. vancement of Science. (As cited in M. Hersen & S. M. Mayr, E. (1981). Biological classification: Toward a Turner (Eds.), Adult psychopathology and diagnosis synthesis of opposing methodologies. Science, 214(30), (p. 20). New York: Wiley). 510±516. Katon, W., Lin, E., Von Korff, M., Russo, J., Lipscomb, McNair, D. M., Lorr, M., & Droppleman, L. F. (1981). P., & Bush, T. (1991). Somatization: A spectrum of POMS manual (2nd ed.). San Diego: Educational and severity. American Journal of Psychiatry, 148(1), 34±40. Industrial Testing Service. Kay, S. R. (1990). Significance of the positive-negative McNally, R., & Shin, L. M. (1995). Association of distinction in schizophrenia. Schizophrenia Bulletin, 16 intelligence with severity of posttraumatic stress disorder (4), 635±652. symptoms in Vietnam combat veterans. American Kazdin, A. E., & Kagan, J. (1994). Models of dysfunction Journal of Psychiatry, 152(6), 936±938. in developmental psychopathology. Clinical Psychology- McReynolds, P. (1989). Diagnosis and clinical assessment: Science & Practice, 1(1), 35±52. Current status and major issues. Annual Review of Kendell, R. E. (1975). The concept of disease and its Psychology, 40, 83±108. implications for psychiatry. British Journal of Psychiatry, Meehl, P. E. (1986). Diagnostic taxa as open concepts: 127, 305±315. Metatheoretical and statistical questions about reliability Kendell, R. E. (1982). The choice of diagnostic criteria for and construct validity in the grand strategy of nosolo- biological research. Archives of General Psychiatry, 39, gical revision. In T. Millon & G. L. Klerman (Eds.), 1334±1339. Contemporary directions in psychopathology: Toward the Kendler, K. S. (1990). Toward a scientific psychiatric DSM-IV (pp. 215±231). New York: Guilford Press. nosology. Archives of General Psychiatry, 47, 969±973. Menninger, K. (1963). The vital balance: The life process in Kessler, R. C., McGonagle, K. A., Zhao, S., Nelson, C. B., mental health and illness. New York: Viking Press. Hughes, M., Eshelman, S., Wittchen, H. U., & Kendler, Mezzich, J. E., Fabrega, H., Coffman, G. A., & Haley, R. K. S. (1994). Lifetime and 12-month prevalence of DSM- (1989). DSM-III disorders in a large sample of psychia- III-R psychiatric disorders in the United States. Archives tric patients: Frequency and Specificity of diagnoses. of General Psychiatry, 51, 8±19. American Journal of Psychiatry, 146(2), 212±219. Klein, D. F. (1989). The Pharmacological validation of Michels, R., Siebel, U., Freyberger, H. J., Stieglitz, R. D., psychiatric diagnosis. In L. N. Robins & J. E. Barrett Schaub, R. T., & Dilling, H. (1996). The multiaxial (Eds.), The validity of psychiatric diagnosis (pp. 203±214). system of ICD-10: Evaluation of a preliminary draft in a New York: Raven Press. multicentric field trial. Psychopathology, 29, 347±356. Klein, D. F. (1993). False suffocation alarms, spontaneous Millon, T. (1991). Classification in psychopathology: panics, and related conditions: An integrative hypoth- Rationale, alternatives and standards. Journal of Abnor- esis. Archives of General Psychiatry, 50, 306±317. mal Psychology, 100, 245±261. Klein, D. F. (1994). Testing the suffocation false alarm Moras, K., DiNardo, P. A., Brown, T. A., & Barlow, D. H. theory of panic disorder. Anxiety, 1, 1±7. (1991). Comorbidity and depression among the DSM- Klerman, G. L. (1978). The evolution of a scientific III-R anxiety disorders. Manuscript as cited in Brown, T. References 187

A. & Barlow, D. H. (1992). Comorbidity among anxiety Barbor, T. F., Burke, J. D., Farmer, A., Jablenski, A., disorders: Implications for treatment and DSM-IV. Pickens, R., Reiger, D. A., Sartorius, N., & Towle, L. H. Journal of Consulting and Clinical Psychology, 60(6), (1988). The composite international Diagnostic Inter- 835±844. view: An epidemiological instrument suitable for use in Morey, L. C. (1988). Personality disorders in DSM-III and conjunction with different diagnostic systems and DSM-III-R: Convergence, coverage, and internal con- different cultures. Archives of General Psychiary, 45, sistency. American Journal of Psychiatry, 145(5), 1069±1077. 573±577. Rosch, E. H. (1978). Principles of categorization. In E. H. Morey, L. C. (1991). Classification of mental disorder As a Rosch & B. B. Lloyd (Eds.), Cognition and categorization collection of hypothetical constructs. Journal of Abnor- (pp. 27±48). Hillsdale, NJ: Erlbaum. mal Psychology, 100(3), 289±293. Russell, J. A. (1980). A circumplex model of affect. Journal Morey, L. C. (1992). Personality disorder NOS: Specifying of Personality and Social Psychology, 39(6), 1161±1178. patterns of the otherwise unspecified. Paper presented at Sartorius, N., & Janca, A. (1996). Psychiatric assessment the 100th Annual Convention of the American Psycho- instruments developed by the World Health Organiza- logical Association. Washington, DC: USA. tion. Social Psychiatry and Psychiatric Epidemiological, Mowrer, O. H. (1960). ªSin,º the lesser of two evils. 32, 55±69. American Psychologist, 15, 301±304. Scotti, J. R., Morris, T. L., McNeil, C. B., & Hawkins, R. Oldham, J. M., Skodal, A. E., Kellman, H. D., Hyler, S. E., P. (1996). DSM-IV and disorders of childhood and Rosnick, L., & Davies, M. (1992). Diagnosis of DSM-II- adolescence: Can structural criteria be functional? R personality disorders by two structured interviews: Journal of Consulting & Clinical Psychology, 64(6), Patterns of comorbidity. American Journal of Psychiatry, 1177±1191. 149, 213±220. Shea, M. T., Widiger, T. A., & Klein, M. H. (1992). Papp, L. A., Klein, D. F., Martinez, J., Schneier, F., Cole, Comorbidity of personality disorders and depression: R., Liebowitz, M. R., Hollander, E., Fyer, A. J., Jordan, Implications for treatment. Journal of Consulting and F., & Gorman, J. M. (1993). Diagnostic and substance Clinical Psychology, 60, 857±868. specificity of carbon dioxide-induced panic. American Spiegel, D., & Cardena, E. (1991). Disintegrated experi- Journal of Psychiatry, 150, 250±257. ence: The dissociative disorders revisited. Journal of Patel, V., & Winston, M. (1994). ªUniversality of mental Abnormal Psychology, 100(3), 366±378. illnessº revisited: Assumptions, artefacts and new direc- Spielberger, C. D., Gorsuch, R. R., & Lushene, R. E. tions. British Journal of Psychiatry, 165, 437±440. (1970). State-trait anxiety Inventory: Test manual for Pine, D. S., Weese-Mayer, D. E., Silvestri, J. M., Davies, form X. Palo Alto, CA: Consulting Psychologists Press. M., Whitaker, A., & Klein, D. F. (1994). Anxiety and Spitzer, R. B. (1997). Brief comments for a psychiatric congenital central hypoventilation syndrome. American nosologist weary from his own attempts to define mental Journal of Psychiatry, 151, 864±870. disorder: Why Ossorio's definition muddles and Wake- Pogue-Geile, M. F., & Harrow, M. (1984). Negative and field's ªHarmful Dysfunctionº illuminates the issues. positive symptoms in schizophrenia and depression: A Clinical Psychology: Science and Practice, 4, 259±261. follow-up. Schizophrenia Bulletin, 10(3), 371±387. Spitzer, R. B. (1998). Diagnosis and need for treatment are Quay, H. C., Routh, D. K., & Shapiro, S. K. (1987). not the same. Archieves of General Psychiatry, 55, 120. Psychopathology of childhood: From description to Spitzer, R. B., Endicott, J., & Robins, E. (1978). Research validation. Annual Review of Psychology, 38, 491±532. diagnostic criteria: Rationale and reliability. Archives of Quitkin, F. M., Stewart, J. W., McGrath, P. J., Tricamo, General Psychiatry, 35(6), 773±782. E., Rabkin, J. G., Ocepek-Welikson, K., Nunes, E., Stein, D. J., Hollander, E., & Klein, D. F. (1994). Harrison, W., & Klein, D. F. (1993). Columbia atypical Biological markers of depression and anxiety. Medico- depression: A sub-group of depressives with better graphia, 16(1), 18±21. response to MAOI than to tricyclic antidepressants or Stevens, J. R. (1982). Neuropathology of schizophrenia. placebo. British Journal of Psychiatry, 163, 30±34. Archives of General Psychiatry, 39, 1131±1139. Rapee, R. M., Sanderson, W. C., & Barlow, D. H. (1988). Szasz, T. (1960). The myth of mental illness. American Social phobia features across the DSM-III-R anxiety Psychologist, 15, 113±118. disorders. Journal of Psychopathology and Behavioral Tattersall, I. (1995). The fossil trail: How we know what we Assessment, 10(3), 287±299. think we know about human evolution. New York: Regier, D. A., Kaelber, C. T., Rae, D. S., Farmer, M. E., Oxford University Press. (As cited in Follette, W. C., & Knauper, B., Kessler, R. C., & Norquist, G. S. (1998) Houts, A. C. (1996). Models of scientific progress and Limitations of diagnostic criteria and assessment instru- the role of theory in taxonomy development: A case ments for mental disorders. Implications for research study of the DSM. Journal of Consulting and Clinical and policy. Archives of General Psychiatry, 55, 109±115. Psychology, 64, 1120±1132.) Regier, D. A., Kaelber, C. T., Roper, M. T., Rae, D. S. & Taylor, M. A. (1992). Are schizophrenia and affective Sartorius, N. (1994). The ICD-10 clinical field trial for disorder related? A selective literature review. American mental and behavioral disorders: Results in Canada and Journal of Psychiatry, 149, 22±32. the United States. American Journal of Psychiatry, 151, Taylor, M. A., & Amir, N. (1994). Are schizophrenia and 1340±1350. affective disorder related?: The problem of schizoaffec- Riskind, J. H., Beck, A. T., Brown, G., & Steer, R. A. tive disorder and the discrimination of the psychoses by (1987). Taking the measure of anxiety and depression. signs and symptoms. Comprehensive Psychiatry, 35(6), Validity of the reconstructed Hamilton scale. Journal of 420±429. Nervous and Mental Disease, 175(8), 474±479. Tellegen, A. (1985). Structures of mood and personality Robins, E., & Guze, S. B. (1970). Establishment of and their relevance to assessing anxiety, with an diagnostic validity in psychiatric illness: Its application emphasis on self-report. In A. H. Tuma & J. D. Maser to schizophrenia. American Journal of Psychiatry, 126, (Eds.), Anxiety and the anxiety disorders (pp. 681±706). 983±987. Hillsdale, NJ: Erlbaum. Robins, L. N. (1994). How recognizing ªcomorbiditiesº in Uestuen, T. B., Bertelsen, A., Dilling, H., & van psychopathology may lead to an improved research Drimmelen, J. (1996). ICD-10 casebook: The many faces nosology. Clinical Psychology: Science and Practice, 1, of mental disordersÐAdult case histories according to 93±95. ICD-10. Washington, DC: American Psychiatric Press. Robins, L. N., Wing, J., Wittchen, H. J., Helzer, J. E., Ullman, L. P., & Krasner, L. (1975). A psychological 188 Assessment of Psychopathology: Nosology and Etiology

approach to abnormal behavior. Englewood Cliffs, NJ: Widiger, T. A., & Trull, T. J. (1993). The Scholarly Prentice-Hall. development of DSM-IV. In J. A. Costa, E. Silva, & C. Wakefield, J. C. (1992). Disorder as harmful dysfunction: C. Nadelson (Eds.), International Review of Psychiatry A conceptual critique of DSM-III-R's definition of (pp. 59±78). Washington DC, American Psychiatric mental disorder. Psychological Review, 99, 232±247. Press. Wakefield, J. C. (1997a). Diagnosing DSM-IV, Part I: Widiger, T. A., Trull, T. J., Hurt, S. W., Clarkin, J., & DSM-IV and the concept of disorder. Behavioral Frances, A. (1987). A multidimensional scaling of the Research Therapy, 35(7), 633±649. DSM-III personality disorders. Archives of General Wakefield, J. C. (1997b). Diagnosing DSM-IV, Part II: Psychiatry, 44, 557±563. Eysenck (1986) and the essential fallacy. Behavioral Wiggins, J. S. (1982). Circumplex models of interpersonal Research Therapy, 35(7), 651±665. behavior in clinical psychology. In P. C. Kendall & J. N. Wakefied, J. C. (1997c). Normal inability versus patholo- Butcher (Eds.), Handbook of research methods in clinical gical disability: Why Ossorio's definition of mental psychology (pp. 183±221). New York: Wiley. disorder is not sufficient. Clinical Psychology: Science Wilson, G. T. (1996). Empirically validated treatments: and Practice, 4, 249±258. Realities and resistance. Clinical Psychology-Science & Watson, D., & Clark, L. A. (1992). Affects separable and Practice, 3(3), 241±244. inseparable: On the hierarchical arrangement of the World Health Organization (1992). International statistical negative affects. Journal of Personality and Social classification disease, 10th revision (ICD-10). Geneva, Psychology, 62(3), 489±505. Switzerland: Author. Watson, D., Clark, L. A., & Tellegen, A. (1988). Develop- World Health Organization (1996). Multiaxial classifica- ment and validation of brief measures of positive and tion of child and adolescent psychiatric disorders: The negative affect: The PANAS scales. Journal of Person- ICD-10 classification of mental and behavioral disorders in ality and Social Psychology, 54(6), 1063±1070. children and adolescents. Cambridge, UK: Cambridge Watson, J. D., & Crick, F. H. C. (1953a). General University Press. implications of the structure of deoxyribose nucleic acid. Wulfert, E., Greenway, D. E., & Dougher, M. J. (1996). A Nature, 171, 964±967. (As cited in Clementz B. A., & logical functional analysis of reinforcement-based dis- Iacono, W. G. (1990). In L. Willerman & D. B. Cohen orders: Alcoholism and pedophilia. Journal of Consulting (Eds.), Psychopathology. New York: McGraw-Hill.) and Clinical Psychology, 64(6), 1140±1151. Watson, J. D., & Crick, F. H. C. (1953b). A structure for Zimmerman, M., Pfohl, B., Coryell, W. H., Corenthal, C., deoxyribose nucleic acid. Nature, 171, 737±738. (As cited & Stangl, D. (1991). Major depression and personality in Clementz, B. A., & lacono, W. G. (1990). In L. disorder. Journal of Affective Disorders, 22, 199±210. Willerman & D. B. Cohen (Eds.), Psychopathology. New Zinbarg, R. E., & Barlow, D. D. (1991). Mixed anxiety- York: McGraw-Hill. depression: A new diagnostic category? In R. M. Rapee Widiger, T. A. (1992). Categorical versus dimensional & D. H. Barlow (Eds.), Chronic anxiety: Generalized classification: Implications from and for research. anxiety disorder and mixed anxiety-depression Journal of Personality Disorders, 6(4), 287±300. (pp. 136±152). New York: Guilford Press. Widiger, T. A., & Frances, A. (1985). The DSM-III Zinbarg, R. E., Barlow, D. H., Liebowitz, M., Street, L., personality disorders: Perspectives form psychology. Broadhead, E., Katon, W., Roy-Byrne, P., Lepine, J. P., Archives of General Psychiatry, 42, 615±623. Teherani, M., Richards, J., Brantley, P., & Kraemer, H. Widiger, T. A., & Rogers, J. H. (1989). Prevalence and (1994). The DSM-IV field trial for mixed anxiety- comorbidity of personality disorders. Psychiatric Annals, depression. American Journal of Psychiatry, 151(8), 19(3), 132±136. 1153±1162. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.09 Intervention Research: Development and Manualization

JOHN F. CLARKIN Cornell University Medical College, New York, NY, USA

3.09.1 INTRODUCTION 189 3.09.2 AIMS AND INGREDIENTS OF A TREATMENT MANUAL 190 3.09.3 DIMENSIONS ALONG WHICH THE MANUALS VARY 191 3.09.3.1 Process of Manual Generation and Elucidation 191 3.09.3.2 Patient Population 191 3.09.3.3 Knowledge Base of the Disorder 191 3.09.3.4 Treatment Strategies and Techniques 192 3.09.3.5 Treatment Duration 192 3.09.3.6 Treatment Format 192 3.09.3.7 Level of Abstraction 192 3.09.4 RANGE OF TREATMENT MANUALS 192 3.09.5 REPRESENTATIVE TREATMENT MANUALS 193 3.09.5.1 IPT: An Early Manual 193 3.09.5.2 Cognitive-behavioral Treatment of BPD 194 3.09.5.3 Psychosocial Treatment For Bipolar Disorder 195 3.09.5.4 Other Treatment Manuals 196 3.09.6 ADVANTAGES OF TREATMENT MANUALS 196 3.09.7 POTENTIAL DISADVANTAGES AND LIMITATIONS OF TREATMENT MANUALS 196 3.09.8 USE AND ROLE OF TREATMENT MANUALS 198 3.09.8.1 Efficacy to Effectiveness 198 3.09.9 SUMMARY 198 3.09.10 REFERENCES 199

3.09.1 INTRODUCTION pists' adherence and competence in the delivery of the treatment. In the recent history of Psychotherapy research can progress only if psychotherapy research, the manual has played the treatments that are investigated can be a vital role in the specification of clinical trials. replicated by each of the therapists within the In addition, if a science and not just an art of study, and the therapy can be described to the psychotherapy is to flourish, there must be consumers of the research results. Standards of methods to teach clinicians how to perform the psychotherapy research have reached a point various therapies with adherence to the specific that one cannot communicate about results treatment strategies and techniques, and with without a written manual describing the treat- competence in this complex interpersonal ment and methods for demonstrating thera- process. It is in this context that the written

189 190 Intervention Research: Development and Manualization description of psychotherapies for specific included in a treatment manual to meet patient populations in the form of manuals standards for treatment development. has become an essential fixture in the psy- An example of the development of standards chotherapy world. for treatment manuals is the National Institute There has been much lament that clinical on Drug Abuse (NIDA, Moras, 1995). The research has little if any impact on clinical experts on this panel suggested that the aims of a practice (Talley, Strupp, & Butler, 1994). treatment manual are to: Researchers complain that their findings are (i) make the therapy reproducible by thera- ignored. Clinicians argue that the research on pists other than the immediate research group; rarified samples yield findings that are detailed, (ii) include information on those issues the tedious, and irrelevant to their heterogeneous investigator thinks are important determinants patients. It may be that the psychotherapy of therapy response; treatment manuals will provide a bridge between (iii) delineate the knowledge base and skills clinical research and clinical practice. The needed by a therapist to learn from the manual. treatment manual is an accessible yield of clinical Given these aims, the group suggested the research that the clinician may find helpful, if not following content for the manual: necessary as health delivery systems change. (i) theory and principles of the treatment This chapter draws upon various sources to (theory underpinning the therapy, decision rules describe the contents of the psychotherapy and principles to guide the therapists, informa- manual, provides some examples of the therapy tion on the therapy that therapists should have manual movement, and discusses the advan- to help them adopt the requisite attitude about tages and disadvantages of this technology both it, i.e., belief in the treatment); for research and the instruction of therapists. It (ii) elements of the treatment (therapists, is interesting that, although medication man- primary interpersonal stance in relation to the agement is a social process between patient and patient, common and unique elements of physician, with many similarities to psychother- therapy); apy, there has been little development of (iii) intervention strategies for handling pro- manuals for medication management. The blems commonly encountered in delivering the informative exception is the medication man- therapy (miscellaneous interventions, specifica- agement manual used in the multisite treatment tion of interventions not to be used in the study of depression (Elkin et al., 1989). Despite therapy); the fact that medication adherence is often (iv) companion videotapes for teaching the variable and generally poor, and that there is treatment; and great variation in physician behavior in dispen- (v) criteria and method of assessing therapist sing medications, this area has attracted little competence in the treatment. attention, apparently because researchers and In essence, these are the criteria by which one clinicians assume that providing medication is a can judge the adequacy of any treatment simple and standardized behavior. We suspect manual. this assumption is false, and that more These criteria are used to assess existing sophisticated medication and medication± treatment manuals, some of which are used as psychotherapy studies in the future will have illustrations in this chapter. As noted previously manuals with validity checks on both psy- (Kazdin, 1991), it seems clear that the manuals chotherapy and medication management. are often incomplete, as they do not reflect the complexity of treatment and the full scope of the exchanges between therapist and patient. How- 3.09.2 AIMS AND INGREDIENTS OF A ever, the development of treatment manuals has TREATMENT MANUAL and continues to help the field move forward. It should be noted that there is a difference The concept of a treatment manual has between a treatment manual and a book quickly evolved in the research and clinical published and marketed that has many but communities. The treatment manual was ac- not all of the attributes of a practical, ever- claimed (Luborsky & DeRubeis, 1984) as a changing treatment manual. In an attempt to quantum leap forward which introduced a new make a book of publishable size, things that era in clinical research. Since that time there has might be included in the working manual, such been the development of numerous treatment as clinician work sheets or rating scales might be manuals. There are even special series geared eliminated from the book. A notable exception around treatment manuals, and at times is the work by Linehan which includes both a accompanying materials for patients as well. book (Linehan, 1993a) and an accompanying In addition, the intellectual community is workbook (Linehan, 1993b) with therapist moving forward in defining what should be work sheets and other material most useful in Dimensions Along Which the Manuals Vary 191 practical delivery of the treatment. The treat- agoraphobia, and panic attacks. In contrast, the ment manual of the near future will probably be diagnosis of depression is more distant from both a written description of the treatment in specific behaviors, so the manuals describing book form, and an accompanying CD-ROM treatments for depression have implicit theories that provides video demonstration of the about the nature of depression, such as the treatment (using therapists and actor/patients). interpersonal treatment or the cognitive treatment. Linehan's manual for the treatment of parasuicidal individuals is an interesting transi- 3.09.3 DIMENSIONS ALONG WHICH THE tional one, from a behavior (repetitive suicidal MANUALS VARY behavior) to a diagnosis that sometimes includes The existing treatment manuals can be the behavior (borderline personality disorder). compared not only to the content as specified In contrast, there are manuals written not for by the NIDA experts, but also by differences in patients with specific diagnoses but for those aspects of the treatment, including: the process with certain problems such as experiencing by which the authors generated and elucidated emotions (Greenberg, Rice, & Elliott, 1993). their treatment; the patient population for An additional way the manual being con- which the treatment is intended; the knowledge sidered is specific to a population is the attention base of the disorder being treated; treatment given to the details of patient assessment. strategies and techniques, duration, and format; Ideally, the assessment section would inform and level of abstraction in which the manual is the reader of the patients included and excluded, written. and the subgroups of patients within the diagnosis/category treated by this manual, but treated with the differences in mind depending 3.09.3.1 Process of Manual Generation and on such factors as comorbid conditions. Elucidation One difficulty that is becoming more clear is that, for those manuals that have been written What was the process of manual generation for a specific disorder, how do they apply to by the authors? Was the manual generated from patients with the disorder but with a common extensive experience with a patient population, comorbid disorder? For example, the interper- followed by a codification of the techniques that sonal psychotherapy manual was constructed seem to work on a clinical basis? Conversely, with ambulatory, nonpsychotic patients with was the manual generated from a theoretical depression in mind, but gives little indication of understanding of the patient population fol- its application when there are prominent Axis II lowed by treatment contact? Was it generated disorders. Of course, this point also relates to from a specific treatment orientation, for the accumulation of knowledge that occurs all example, cognitive-behavioral, that is then the time, so that a manual published a few years applied to a specific patient population? Is the ago may need modification and updating as manual generated by clinical researchers or by more data becomes available. clinicians? Was the manual generated during extensive clinical research, profiting from that research, or without research experience? Does 3.09.3.3 Knowledge Base of the Disorder the manual have accompanying aids, such as One could make a cogent argument that a audiotapes and work sheets? treatment for a particular disorder cannot (should not) be developed until there is 3.09.3.2 Patient Population sufficient information on the natural course of the disorder at issue. Only with the back- Therapy manuals differ in the extent to which ground of the natural development of the they specify for which patients the manual is disorder can one examine the impact of intended and for which patients the manual has intervention. Furthermore, the age of the not been attempted, or, indeed, for whom the patients to whom the treatment is applied is treatment does not work. Since psychotherapy most relevant in the context of the longitudinal research funding is focused on DSM-diagnosed pattern of the disorder in question. However, in patient groups, many of the manuals describe light of the clinical reality of patients in pain treatments for patient groups defined by needing intervention, the development of inter- diagnosis. This fact has both positive and vention strategies cannot wait for longitudinal negative aspects. Some diagnoses are close to investigations of the disorder. Both arguments phenomenological descriptions of behavioral are sound, but we can describe intervention problems, so the diagnoses are immediate manuals based on the natural history of the descriptions of problems to be solved. An disorder versus those that do not have such a example is the treatment of phobic anxiety, database upon which to build. 192 Intervention Research: Development and Manualization

3.09.3.4 Treatment Strategies and Techniques than a manual for therapist and many patients (i.e., family and group treatment). As the Obviously, treatment manuals differ in the number of patients in the treatment room strategies and techniques that are described. increases, the number of patient-supplied inter- This would include manualization of treatments actions increases, as does the potential for that use behavioral, cognitive, cognitive-beha- treatment difficulties and road blocks. vioral, systems, supportive, and psychodynamic techniques. Less obvious is the fact that it may be easier to manualize some strategies/techni- 3.09.3.7 Level of Abstraction ques more than others. In fact, some therapies may be more technique oriented and others The very term ªmanualº conjures up an relatively more interpersonally oriented, and image of the book that comes in the pocket of a this may have implications for the method of new car. It tells how to operate all the gadgets in manualization and the results. the car, and provides a guide as to when to get An issue related to strategies and techniques is the car serviced. Diagrams and pictures are the degree to which the treatment manual provided. It is a ªhow toº book on repair and describes the stance of the therapist with respect maintenance of your new car. Thus, the term to the patient, the induction of the patient into ªtreatment manualº promises to describe how the treatment roles and responsibilities, and the to do the treatment in step-by-step detail. development of the therapeutic relationship or More relevant to psychotherapy manuals are alliance. Often, it is the ambivalent patient who the manuals that inform the readerÐoften with questions the value of the treatment, is prone to graphs and picturesÐhow to sail a boat or play drop out, or to attend therapy with little tennis. This is a more apt analogy because the investment and involvement (e.g., does not manual attempts in written (and graphic) form carry out therapy homework) that provides a to communicate how to achieve complex serious challenge to the therapist. The more cognitive and motor skills. Opponents of complete manuals note the typical difficulties treatment manuals will point to golfers who and challenges to full participation in the have read all the manuals, and hack their way treatment that have been observed in experience around the course with less than sterling skill. with the patient population, and offer guidelines (To us, this does not argue against the manual, for the therapist in overcoming these challenges. but speaks of its limitations and how it should be used in the context of other teaching devises.) Ignoring the critics for the moment, this 3.09.3.5 Treatment Duration discussion raises the issue of what level of Most manuals describe a brief treatment of concreteness or abstraction is the manual best some 12±20 sessions. The methods for manualiz- formulated. Some manuals describe the treating longer treatments may present a challenge to ment session by session, and within the the field. A brief treatment can be described in a individual session the flow of the session and manual session by session. Each session can be details about its construction are indicated. anticipated in a sequence and described. This is Obviously, this is easier to do the shorter the especially true if the treatment is cognitive- treatment, and the more the treatment can be behavioral in strategies and techniques, and can predicted from the beginning (i.e., the more the be anticipated as applied to the particular treatment is driven by the therapist and little condition/disorder. In contrast, as the treatment influenced by patient response or spontaneous becomes longer, and as the treatment deviates contribution). Probably the best manuals are from a cognitive-behavioral orientation to one those that constantly weave abstract principles that depends more on the productivity and and strategies of the treatment with specific nature of the individual patient (e.g., interper- examples in the form of clinical vignettes that sonal and psychodynamic treatment), the man- provide illustrations of the application of the ual will of necessity become more principle based principles to the individual situation. and reliant on what the patient brings to the situation. 3.09.4 RANGE OF TREATMENT MANUALS 3.09.3.6 Treatment Format An exhaustive list of existing treatment Psychotherapy is delivered in individual, manuals cannot be provided for many reasons, group, marital, and family treatment formats. not the least of which is that the list would be out It would seem to be simpler, if not easier, to of date before this chapter goes to print. The articulate a treatment manual for two partici- American Psychological Association (APA), pants (i.e., therapist and one patient), rather Division 12, Task Force on Psychological Representative Treatment Manuals 193

Interventions lists the so-called efficacious heterogeneity in terms of treatment format, treatments (Chambless et al., 1998) and has strategies and level of treatment development. also listed treatment manuals that are relevant IPT is delivered in an individual treatment to the efficacious treatments (Woody & San- format (patient and individual therapist), the derson, 1998). Some 33 treatment manuals are treatment format that is the simplest to listed for 13 disorders and/or problem areas manualize. DBT involves both an individual including bulimia, chronic headache, pain and a group treatment format. The family associated with rheumatic disease, stress, de- treatment of bipolar disorder involves the pression, discordant couples, enuresis, general- treatment of the individual with bipolar disorder ized anxiety disorder, obsessive-compulsive and family or marital partner. There is less disorder, panic disorder, social phobia, specific diversity among these three treatments in terms phobia, and children with oppositional beha- of treatment strategies and techniques. Two vior. Not all of the manuals listed here have (DPT and family treatment of bipolar) are been published, and must be sent for to the informed by cognitive-behavioral strategies and originator. Further, this list is limited to only techniques, but the latter introduces the intri- those treatments that have been judged by the guing concepts of family dynamics and systems Task Force of Division 12 to meet their criteria issues. The remaining (IPT) is interpersonal in for efficaciousness. focus and orientation. The three manuals vary in terms of treatment duration. IPT is brief, the family treatment of bipolar is intermediate, and 3.09.5 REPRESENTATIVE TREATMENT DBT is the longest, presenting the most MANUALS challenge to issues of manualization and therapist adherence across a longer period of In contrast, a listing of representative treat- time. A comparison of these three manuals gives ment manuals is provided here, using as the a sense of development in this area, as the IPT structure for such sampling the diagnoses in the manual was one of the first, and the latter two Diagnostic and statistical manual of mental provide a view of recent manuals and their disorders (4th ed., DSM-IV). contents. This chapter uses three treatment manuals to illustrate the present state of the art in manualization of psychotherapy: Interpersonal 3.09.5.1 IPT: An Early Manual psychotherapy of depression (IPT; Klerman, Weissman, Roundsaville, & Chevron, 1984), One of the earliest and most influential the dialectical behavioral treatment (DBT) for manuals is Interpersonal psychotherapy of borderline personality disorder (BPD) and depression by Klerman et al. (1984). This related self-destructive behaviors (Linehan, manual was written as a time-limited, out- 1993a), and the family treatment of bipolar patient treatment for depressed individuals. The disorder (Miklowitz & Goldstein, 1997). One of treatment focuses on the current rather than on these treatments and related treatment manuals past interpersonal situations and difficulties. (IPT) has achieved recognition in the APA While making no assumption about the origin Division 12 listing of efficacious treatments. One of the symptoms, the authors connect the onset (DBT for BPD) is listed as a ªprobably of depression with current grief, role disputes efficacious treatment.º The third (family treat- and/or transitions, and interpersonal deficits. ment for bipolar disorder) is not listed at this This brief intervention has three treatment time. These manuals are also selected because phases which are described clearly in the they relate to existing DSM diagnostic cate- manual. The first is an evaluation phase during gories which are central to research data and to which the patient and therapist review depres- insurance company procedures for reimburse- sive symptoms, give the syndrome of depression ment of treatment. These manuals for the a name, and induce the patient into a sick role. specified disorders are also chosen because they (Interestingly, the role of the therapist is not represent three levels of severity of disorder that described explicitly.) The patient and therapist face the clinician. Ambulatory patients with discuss and, hopefully, agree upon a treatment depression of a nonpsychotic variety are treated focus limited to four possibilities: grief, role with IPT. BPD is a serious disorder, often disputes, role transitions, or interpersonal involving self-destructive and suicidal behavior. deficits. The middle phase of treatment involves Finally, bipolar disorder is a psychiatric condi- work between therapist and patient on the tion with biological and genetic causes that can defined area of focus. For example, with current be addressed by both the empirically efficacious role disputes the therapist explores with the medications and psychotherapy. In addition, patient the nature of the disputes and options these three treatment manuals provide some for resolution. The final phase of treatment 194 Intervention Research: Development and Manualization involves reviewing and consolidating therapeu- sion in its earliest articulation (Beck, Rush, tic gains. A recent addition is the publication of Shaw, & Emery, 1979) and with more recent a client workbook (Weissman, 1995) and client additions (Beck, 1995). assessment forms. IPT has been used in many clinical trials, the first of which was in 1974 (Klerman, DiMascio, 3.09.5.2 Cognitive-behavioral Treatment of Weissman, Prusoff, & Paykel, 1974). In addition BPD to the IPT manual, a training videotape has been produced and an IPT training program is Linehan's (1993a, 1993b) cognitive-behavior- being developed (Weissman & Markowitz, al treatment of the parasuicidal individual is an 1994). example of the application of a specific school of IPT provides a generic framework guiding psychotherapy adapted to a specific patient patient and therapist to discuss current diffi- population defined both by a personality culties and this framework has been applied to disorder diagnosis (BPD) and repetitive self- symptom conditions other than depression. For destructive behavior. example, the format has been applied to patients The rationale and data upon which the patient with bipolar disorder (Ehlers, Frank, & Kupfer, pathology is understood as related to the 1988), drug abuse (Carroll, Rounsaville, & treatment is well described and thorough. Line- Gawin, 1991; Rounsaville, Glazer, Wilber, han points out that theories of personality Weissman, & Kleber, 1983), and bulimia functioning/dysfunctioning are based upon (Fairburn et al., 1991). In addition to its initial world views and assumptions. The assumption use of treating depression in ambulatory adult behind DBT is that of dialectics. The world view patients, it has now been utilized as an acute of dialectics involves notions of inter-relatedness treatment, as a continuation and maintenance and wholeness, compatible with feminist views treatment (Frank et al., 1990), and has been of psychopathology, rather than an emphasis on used for geriatric (Reynolds et al., 1992) and separation, individuation, and independence. A adolescent patients (Mufson, Moreau, Weiss- related principle is that of polarity, that is, all man, & Klerman, 1993) and in various settings propositions contain within them their own such as primary care and hospitalized elderly. oppositions. As related to the borderline The success of IPT seems to be the articulation pathology, it is assumed that within the border- of a rather straightforward approach to discus- line dysfunction there is also function and sion between patient and therapist, of current accuracy. Thus, in DBT, it is assumed that each situations without the use of more complicated individual, including the borderline clients, are procedures such as transference interpretation. capable of wisdom as related to their own life and The IPT manual was one of the first in the capable of change. At the level of the relationship field, and its straightforward description of a between borderline client and DBT therapist, common-sense approach to patients with de- dialectics refers to change by persuasion, a pression is readily adopted by many clinicians. process involving truth not as an absolute but as However, the process of treatment development an evolving, developing phenomenon. and amplification is also relevant to this, one of Borderline pathology is conceptualized as a the earliest and best treatment manuals. It is dialectical failure on the part of the client. The now clear that depression is often only partially thinking of the BPD patient has a tendency to responsive to brief treatments, such as IPT, and become paralyzed in either the thesis or that many patients relapse. It would appear that antithesis, without movement toward a synth- maintenance treatment with IPT may be useful esis. There is a related dialectical theory of the (Frank et al., 1991), and the IPT manual must development of borderline pathology. BPD is therefore be amplified for this purpose. Further- seen primarily as a dysfunction of the emotion more, it has become clear that depressed regulation system, with contributions to this individuals with personality disorders respond state of malfunction from both biological to treatment less thoroughly and more slowly irregularities and interaction over time with a than depressed individuals without personality dysfunctional environment. In this point of disorders (Clarkin & Abrams, 1998). This would view, the BPD client is prey to emotional suggest that IPT may need modification for vulnerability, that is, high sensitivity to emo- those with personality disorders, either in terms tional stimuli, emotional intensity, and a slow of how to manage the personality disorder return to emotional baseline functioning. The during treatment, or to include treatment of invalidating environment, seen as crucial to the relevant parts of the personality disorder to the BPD development, is one in which interpersonal depression. communication of personal and private experi- In order to place IPT in perspective, one could ences is met by others in the environment with compare it to the cognitive therapy of depres- trivialization and/or punishment. In such an Representative Treatment Manuals 195 environment, the developing individual does disturbed group of individuals, and provides not learn to label private experiences, nor does the challenge of extending treatment manuals the individual learn emotion regulation. beyond brief treatments. A most important and These assumptions and related data on the practical addition to the book is an accompany- developmental histories and cross-sectional ing workbook with work sheets for therapist and behaviors of those with BPD provide the patients. rationale and shape of the treatment. The major To put DBT in context one could compare it tasks of the treatment are, therefore, to teach the to other cognitive approaches to the personality client skills so that they can modulate emotional disorders (Beck & Freeman, 1990), to an experiences and related mood-dependent beha- interpersonal treatment delivered in a group viors, and to learn to trust and validate their own format (Marziali & Munroe-Blum, 1994), to a emotions, thoughts and activities. The relevant modified psychodynamic treatment (Clarkin, skills are described as four in type: skills that Yeomans, & Kernberg, in press), and to a increase interpersonal effectiveness in conflict supportive treatment (Rockland, 1992) for these situations, strategies to increase self-regulation patients. of unwanted affects, skills for tolerating emotional distress, and skills adapted from Zen 3.09.5.3 Psychosocial Treatment For Bipolar meditation to enhance the ability to experience Disorder emotions and avoid emotional inhibition. The manual provides extensive material on Capitalizing on their years of clinical research basic treatment strategies (e.g., dialectical stra- with patients with schizophrenia and bipolar tegies), core strategies of emotional validation disorder and their families, Miklowitz and (e.g., teaching emotion observation and labeling Goldstein (1997) have articulated a treatment skills), behavioral validation (e.g., identifying, manual for patients with bipolar disorder and countering and accepting ªshouldsº), and their spouses or family members. cognitive validation (e.g., discriminating facts The rationale for the treatment is threefold. from interpretations), and problem-solving An episode of bipolar disorder in a family strategies. Problem solving consists of analysis member affects not only the patient but the of behavior problems, generating alternate entire family. Thus, with each episode of the behavioral solutions, orientation to a solution disorder (and this is a recurring disorder) there behavior, and trial of the solution behavior. is a crisis and disorganization of the entire The core of the treatment is described as family. Thus, the family treatment is an attempt balancing problem-solving strategies with vali- to assist the patient and family to cooperatively dation strategies. deal with this chronic illness. Finally, both in the This manual is exceptional, and provides a rationale for the treatment and in the psychoe- new and very high standard in the field for ducational component of the treatment, the treatment manuals. There are a number of authors espouse a vulnerability±stress model of exemplary features. First, the patient population the disorder, which would argue for a treatment is defined by DSM criteria in addition to specific that reduces patient and family stress. problematic behaviors, that is, parasuicidal The therapeutic stance of the clinician behaviors. Second, the treatment manual was conducting this treatment is explicated well. generated in the context of clinical research. The The clinician is encouraged to be approachable, treatment was developed in the context of open and emotionally accessible. It is recom- operationalization and discovery, that is, how mended that the clinician develop a ªSocratic to operationalize a treatment for research that dialogueº with the family in providing informa- fits the pathology of the patients who are selected tion, and develop a give and take between all and described by specific criteria. The skills parties. Although the treatment has a major training manual (Linehan, 1993b) notes that it psychoeducational component, the clinician is has evolved over a 20 year period, and has been encouraged to be ready to explore the emotional tested on over 100 clients. The treatment impact of the information, and not to be simply generated in this context has been used in a classroom teacher. In dealing with both diverse treatment settings to teach therapists of difficult news (e.g., you have a life-long illness) various levels of expertise and training to address and tasks (e.g., problem solving helps reduce BPD patients. This process of teaching the stress), the clinician is encouraged to be treatment to therapists of multiple levels of reasonably optimistic about what the patient competence and training can foster the articula- and family can accomplish if they give the tion of the treatment and enrich its adaptability treatment and its methods a chance to succeed. to community settings. This treatment has a The treatment is described in the manual in duration of one year or more because it is terms of three major phases or modules that addressed to a very difficult and seriously have a clear sequence to them. The first phase of 196 Intervention Research: Development and Manualization psychoeducation is intended to provide the depression, anxiety disorders, substance abuse patient and family with information that gives and eating disorders. Treatments for Axis II them an understanding of bipolar disorder and disorders are less well developed and manua- its treatment. It is hoped that this information lized, except for BPD. Problem areas such as may have practical value in informing the marital discord, sexual dysfunction and proble- patient and family as to things they should do matic emotion expression are also addressed. (e.g., medication compliance) that assists in Further development will probably come in managing the illness. The second phase or addressing common comorbid conditions. In module of treatment is a communication addition, there has been more funding and enhancement training phase. The aim of this research for brief treatments with a cognitive- phase is to improve and/or enhance the patient behavioral orientation. Longer treatments, and family's ability to communicate with one maintenance treatment for chronic conditions, another clearly and effectively, especially in the and the development of manuals for strategies face of the stress-producing disorder. The third and techniques other than cognitive-behavioral or problem-solving phase provides the patient ones can be expected in the future. With the and family with training in effective conflict heavy incursion of managed care with its resolution. Prior to extensive explanation of the emphasis on cost-cutting, more development three phases of the treatment is information on of treatments delivered in a group format may connecting and engaging with patient and also be seen. family, and conducting initial assessment. The authors also examine carefully patient and family criteria for admission to this family 3.09.6 ADVANTAGES OF TREATMENT intervention. Interestingly, they do not rule out MANUALS patients who are not medication compliant at It is clear that the introduction of the present, nor those who are currently abusing treatment manual has had tremendous impact. substances, both common situations in the The very process of writing a treatment manual history of many bipolar patients. This is an forces the author to think through and articulated treatment broad enough to encou- articulate the details of the treatment that rage those patients who are on the cusp of what may not have been specified before. In this way, is actually treatable. the era of the written treatment manual has This manual raises directly the issue of patient fostered fuller articulation of treatment by the medication compliance and the combination of treatment originators, and furthered the ability psychosocial treatment with medication treat- of teachers of the treatment to explicate the ment. Bipolar patients are responsive to certain treatment to trainees. medications, and medication treatment is Manuals provide an operationalized state- considered a necessity. Thus, this treatment ment of the treatment being delivered or manual provides psychoeducation around the researched so that all can examine it for its foci need for medication, and encourages patient and procedures. It cannot be assumed that the and family agreement about continuous med- treatment described in the written manual was ication compliance. delivered as described by the therapists in the This treatment manual can be placed in study or in a treatment delivery system. This gap context by comparing it to a treatment for between the manual and the treatment as bipolar disorder in the individual treatment delivered highlights the need for rating scales format (Basco & Rush, 1996). This manual to assess the faithful (i.e., adherent and should also be seen in the context of quite competent) delivery of the treatment as de- similar treatments for patients with schizophre- scribed in the manual. nia and their families (such as those articulated Manuals provide a training tool for clinical by Anderson, Reiss, & Hogarty, 1986; Bellack, research and for clinical practice. It has been Mueser, Gingerich, & Agresta, 1997; Falloon, noted (Chambless, 1996; Moras, 1993) that Boyd, & McGill, 1984). students learn treatment approaches much more quickly from their systematic depiction in 3.09.5.4 Other Treatment Manuals manuals than through supervision alone. Beyond the three manuals chosen for illustration, it is interesting to note the patient 3.09.7 POTENTIAL DISADVANTAGES diagnoses and problem areas that are currently AND LIMITATIONS OF addressed by a manual describing a psychother- TREATMENT MANUALS apy. As noted in Table 1, there are treatments described in manuals for major Axis I disorders Dobson and Shaw (1988) have noted six such as schizophrenia, bipolar disorder, major disadvantages of treatment manuals: (i) the Potential Disadvantages and Limitations of Treatment Manuals 197

Table 1 Representative treatment manuals.

Disorder/problem area Reference

Panic Barlow and Cerny (1988) Obsessive-compulsive disorder Steketee (1993) Turner and Beidel (1988) PTSD Foy (1992) Depression Beck, Rush, Shaw, and Emery (1979) Klerman, Weissman, Rounsaville, and Chevron (1984) Bipolar Miklowitz and Goldstein (1997) Basco and Rush (1996) Schizophrenia Bellack (1997) Anderson, Reiss, and Hogarty (1986) Falloon, Boyd, and McGill (1984) Borderline Linehan (1993a, 1993b) Clarkin, Yeomans, and Kernberg (in press) Marzialli and Munroe-Blum (1994) Rockland (1992) Marital discord Baucom and Epstein (1990) Sexual dysfunction Wincze and Carey (1991) Alcohol abuse Sobell and Sobell (1993) Binge eating Fairburn, Marcus, and Wilson (1993) Problematic emotion schemas Greenberg, Rice, and Elliott (1993) inability to assess the effects of therapists' formulations rather than following validated variables, (ii) the diminished ability to study the treatment in manuals on average might reduce therapy process, (iii) a focus on treatment effectiveness rather than enhance it (Schulte, fidelity rather than on competence, (iv) the Kuenzel, Pepping, & Schulte-Bahrenberg, increased expense for research, (v) the over- 1992). When research therapy programs are researching of older and more codified therapy transferred to a clinic setting, there tends to be procedures; and (vi) the promotion of schools of an increase in effectiveness (Weisz, Donenberg, psychotherapy. It is important to distinguish the Han, & Weiss, 1995). A number of studies show limitations of the manuals as they have been that greater adherence to the psychotherapy developed up to now, from the limitations of protocol predicts better outcome (Frank et al., manuals as they can be if the field continues to 1991; Luborsky, McLellan, Woody, O'Brien, & improve them. There is no reason why many of Auerbach, 1985). the concerns listed above cannot be incorpo- However, it has been pointed out that rated into future manuals. Some are already adherence to manualized treatments may lead being included, such as therapist variables, to certain disruptive therapists' behaviors process, and competence. (Henry, Schacht, Strupp, Butler, & Binder, Probably the most extensively debated issue 1993). There is the possibility that therapists around manuals is the issue of therapist delivering a manualized treatment, especially flexibility in the execution of the treatment. those that are just learning the treatment and Some would argue that a manual seriously adhering with concentration, will deliver it with curtails the flexibility and creativity of the strict adherence but without competence. At its therapist, thus potentially eliminating the ther- extreme, it might be argued that the use of the apeutic effectiveness of talented therapists. This treatment manual may de-skill the therapist and is the type of issue that, unless infused with data, interfere with therapist competence. In fact, could be debated intensely for a very long time. rigid adherence could lead to poor therapy, and Jacobson et al. (1989) compared two versions of a mockery of what the treatment is intended to behavioral marital therapy, one inflexible and be. the other in which therapists had flexibility. The In a study reported by Castonguay, Gold- outcome of the flexibility condition was no fried, Wiser, Raue, and Hayes (1996), when better than that of the inflexible condition. There therapists were engaged in an abstract cognitive was a trend for less relapse at follow-up in the intervention the outcome appeared worse. flexibly treated couples. However, those interventions were related to Wilson (1996) notes that allowing therapists a bad outcome only when practiced to the to pursue their own somewhat ideographic case detriment of the therapeutic alliance. 198 Intervention Research: Development and Manualization

Jacobson and Hollon (1996) point out that a particular therapist model. Supervision allows the measurement of therapist adherence to the expert to help the trainee apply the manual treatments as manualized has received much to a specific patient in a specific treatment, attention, but the measurement of competence which will always produce some situations that is at a rudimentary stage. The manual should are not exactly covered in the written manual. provide instruments that have been shown to It is a naive and false assumption that a reliably assess therapist adherence to the clinician can simply read a treatment manual, treatment as described in the manual, and and thereby be enabled to deliver the treatment competence in the delivery of that treatment. with skill. It is interesting that the authors of a Many books that proport to be treatment treatment manual (Turner & Beidel, 1988) felt manuals do not include such instrumentation. the need to argue that just reading a manual is The instruments are helpful in further specifying not enough to train an individual in the necessary therapist behaviors, and may be competent delivery of the treatment in question useful to the clinical supervisor. to the individual patient. A major limitation of manual development is the number of patients who appear for treatment who are not addressed in the existing 3.09.8.1 Efficacy to Effectiveness treatment manuals. Some of this is due to the There is currently much discussion about the research focus of manuals, that is, the need for need to extend clinical trials of empirically clinical research to have a narrowly and care- validated treatments, that is treatments that fully defined patient population for whom the have been shown to be efficacious, to larger, treatment is intended. Unfortunately, many if community samples with average therapists not most patients who appear in clinical settings (effectiveness research). It may be that treat- would not fit into a research protocol because of ment manuals can play an important role in this comorbidity, not quite meeting criteria for a work. Indeed, the ultimate value of a treatment specific diagnosis (e.g., personality disorder manual and the training system within which it NOS, not otherwise specified). Some of this is operates will be the result that clinicians can simply due to the newness of manuals and the perform the treatment with adherence and little time there has been in their development. competence. Manuals that are most useful will contain scales developed to measure therapist adherence and competence, and most current manuals are lacking this feature. 3.09.8 USE AND ROLE OF TREATMENT MANUALS

Treatment manuals have had a brief but 3.09.9 SUMMARY exciting and productive history. There is an extreme position that something as complex, Although treatment manuals have been unique and creative as a psychotherapy between effective in specifying psychotherapy for clinical two individuals cannot be manualized. It is research, the step from clinical efficacy studies to thought that experience indicates that this is an demonstration of clinical effectiveness of the extreme view, and that manualization has been treatments that have been manualized is still useful to the field. The issues are how best to lacking. This is an issue of knowledge transfer. utilize a manual and what is the role of the That is, given the demonstration that a specific manual in clinical research and training? therapy (that has been manualized) has shown Whether in the early stages of clinical research clinical efficacy in randomized clinical trials with or in a clinical training program, the treatment a homogeneous patient population and with manual can serve as a tool to be used by the carefully selected therapists, can this treatment expert therapist who is teaching the treatment. It also show clinical benefits when transferred to a has been noted that the manual enables the setting in which the patients are less homo- student to learn the treatment more quickly than geneous and the therapists are those who are with supervision alone (Chambless, 1996; working at the local level? The written treatment Moras, 1993). It is our experience, however, manual may play a role in this transfer of that the manual has a place in the teaching tool- expertise from a small, clinical research group to box, but is less important than supervision and a larger group of therapists. However, this step watching experts doing the treatment on has yet to be demonstrated. Thus, the test of a videotape. The manual provides a conceptual manual, and the entire teaching package within overview of the treatment. Videotapes provide a which it resides, is to demonstrate that a wider visual demonstration of the manual being group of therapists can do the treatment with applied to a particular patient, in the style of adherence and competence. References 199

It is interesting to speculate about the future 3.09.10 REFERENCES of training in psychotherapy given the advances Anderson, C. M., Reiss, D. J., & Hogarty, G. E. (1986). in the field, including the generation of treat- Schizophrenia and the family. New York: Guilford Press. ment manuals. For sake of argument, we Barlow, D. H., & Cerny, J. A. (1988). Psychological indicated that there are some 33 manuals for treatment of panic. New York: Guilford Press. 13 disorders for which there are efficacious Basco, M. R., & Rush, A. J. (1996). Cognitive-behavioral treatments in the field of clinical psychology. therapy for bipolar disorder. New York: Guilford Press. Baucom, D. H., & Epstein, N. (1990). Cognitive-behavioral Should these manuals form the basis of the marital therapy. New York: Brunner/Mazel. training in psychotherapy of future psycholo- Beck, A. T., & Freeman, A. M. (1990). Cognitive therapy of gists? Or can one generate principles of personality disorders. New York: Guilford Press. treatment out of these disparate treatments Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of depression. New York: Guilford for various disorders and conditions, and teach Press. these principles? There are obvious redundan- Beck, J. S. (1995). Cognitive therapy: Basics and beyond. cies across the manuals, and one could imagine New York: Guilford Press. a supermanual with branching points for Bellack, A. S., Mueser, K. T., Gingerich, S., & Agresta, J. various disorders. For example, most manuals (1997). Social skills training for schizophrenia: A step-by- step guide. New York: Guilford Press. have an assessment phase, followed by the phase Carroll, K. M., Rounsaville, B. J., & Gawin, F. H. (1991). of making an alliance with the patient and A comparative trial of psychotherapies for ambulatory describing the characteristics of the treatment to cocaine abusers: Relapse prevention and interpersonal follow, with some indication of the roles of psychotherapy. American Journal of Drug and Alcohol patient and therapist. These are obvious skills Abuse, 17, 229±247. Castonguay, L. G., Goldfried, M. R., Wiser, S., Raue, P. that a therapist must learn, with nuances J., & Hayes, A. M. (1996). Predicting the effect of depending upon the patient and the disorder cognitive therapy for depression: A study of unique and in question. For those manuals that are common factors. Journal of Consulting and Clinical cognitive behavioral in strategies and techni- Psychology, 64, 497±504. Chambless, D. L. (1996). In defense of dissemination of ques, there seems to be great redundancy in empirically supported psychological interventions. Clin- terms of the selected finite number of techniques ical Psychology: Science and Practice, 3(3), 230±235. that are used. Chambless, D. L., Baker, M. J., Baucom, D. H., Beutler, L. What is missing is the assessment of the E., Calhoun, K. S., Crits-Christoph, P., Daiuto, A., patient in which there is no indication of what DeRubeis, R., Detweiler, J., Haaga, D. A. F., Johnson, S. B., McCurry, S., Mueser, K. T., Pope, K. S., Sanderson, the problem is, the diagnosis, or which manual W. C., Shoham, V., Stickle, T., Williams, D. A., & or manuals to use for treatment. Each manual Woody, S. R. (1998). Update on empirically validate seems to presume that clinicians can properly therapies. II. (1998). Clinical Psychologist, 51, 3±13. identify patients for that manual. Unfortu- Clarkin, J. F., & Abrams, R. (1998). Management of personality disorders in the context of mood and anxiety nately, one cannot train a psychologist to treat disorders. In A. J. Rush (Ed.), Mood and anxiety only the 13 disorders/problem areas in the list, disorders (pp. 224±235). Philadelphia: Current Science. as many patients suffer from other conditions Clarkin, J. F., Yeomans, F., & Kernberg, O. F. (in press). not covered. To compound things even further, Psychodynamic psychotherapy of borderline personality many patients (if not most, depending on the organization: A treatment manual. New York: Wiley. Dobson, K. S., & Shaw, B. F. (1988). The use of treatment clinical setting) do not suffer from just one manuals in cognitive therapy: Experience and issues. condition, but from several. Journal of Consulting and Clinical Psychology, 56, Our own approach to training is the one 673±680. implied by Roth, Fonagy, Parry, Target, and Ehlers, C. L., Frank E., & Kupfer, D. J. (1988). Social Woods (1996), with emphasis on the initial zeitgebers and biological rhythms: A unified approach to understanding the etiology of depression. Archives of assessment of the patient with specific clinical General Psychiatry, 45, 948±952. hypotheses about the situation, proceeding to Elkin, I., Shea, M. T., Watkins, J. T., Imber, S. D., Sotsky, the most relevant treatment that has been S. M., Collins, J. F., Glass, D. R., Pilkonis, P. A., Leber, validated to various degrees. This is a less black- W. R., Docherty, J. P., Fiester, S. J., & Parloff, M. B. (1989). National Institute of Mental Health Treatment and-white world of the empirically supported of Depression Collaborative Research Program: General treatment approach and more related to the effectiveness of treatments. Archives of General Psychia- complex condition we call clinical work. try, 46, 971±982. Treatment manuals will play some role in this Fairburn, C. G., Jones, R., Peveler, R. C., Carr, S. J., process, but it might be less than the quantum Solomon, R. A., O'Connor, M. E., Burton, J., & Hope, R. A. (1991). Three psychological treatments for bulimia leap suggested by Luborsky and DeRubeis nervosa: A comparative trial. Archives of General (1984). In our experience, trainees read the Psychiatry, 48, 463±469. treatment manual if they must, but they look Fairburn, C. G., Marcus, M. D. & Wilson, G. T. (1993). forward to seeing experts do the treatment on Cognitive-behavioral therapy for binge eating and bulimia nervosa: A comprehensive treatment manual. videotape, and they see supervision from In C. G. Fairburn & G. T. Wilson (Eds.), Binge eating: an expert in the treatment as a matter of Nature, assessment and treatment (pp. 361±404). New course. York: Guilford Press. 200 Intervention Research: Development and Manualization

Falloon, I. R. H., Boyd, J. L., & McGill, C. W. (1984). Miklowitz, D. J., & Goldstein, M. J. (1997). Bipolar Family care of schizophrenia. New York: Guilford Press. disorders: A family-focused treatment approach. New Foy, D. W. (Ed.) (1992). Treating PTSD: Cognitive- York: Guilford Press. behavioral strategies. Treatment manuals for practitioners. Moras, K. (1993). The use of treatment manuals to train New York: Guilford Press. psychotherapists: Observations and recommendations. Frank, E., Kupfer, D. J., Perel, J. M., Cornes, C., Jarrett, Psychotherapy, 30, 581±586. D. B., Mallinger, A. G., Thase, M. E., McEachran, A. Moras, K. (1995, January). Behavioral therapy develop- B., & Grochociniski, V. J. (1990). Three-year outcomes ment program workshop (Draft 2, 3/24/95). National for maintenance therapies in recurrent depression. Institute on Drug Abuse, Washington, DC. Archives of General Psychiatry, 47, 1093±1099. Mufson, L., Moreau, D., Weissman, M. M., & Klerman, Frank, E., Kupfer, D. J., Wagner, E. F., McEachran, A. B., G. L. (Eds.) (1993). Interpersonal psychotherapy for & Cornes, C. (1991). Efficacy of interpersonal psy- depressed adolescents. New York: Guilford Press. chotherapy as a maintenance treatment of recurrent Reynolds, C. F., Frank, E., Perel, J. M., Imber, S. D., depression. Archives of General Psychiatry, 48, Cornes, C., Morycz. R. K., Mazumdar, S., Miller, M., 1053±1059. Pollock, B. G., Rifai, A. H., Stack, J. A., George, C. J., Greenberg, L. S., Rice, L. N., & Elliott, R. K. (1993). Houck, P. R., & Kupfer, D. J. (1992). Combined Facilitating emotional change: The moment-by-moment pharmacotherapy and psychotherapy in the acute and process. New York: Guilford Press. continuation treatment of elderly patients with recurrent Henry, W. P., Schacht, T. E., Strupp, H. H., Butler, S. F., major depression: A preliminary report. American & Binder, J. L. (1993). Effects of training in time-limited Journal of Psychiatry, 149, 1687±1692. dynamic psychotherapy: Mediators of therapists re- Rockland, L. H. (1992). Supportive therapy for borderline sponses to training. Journal of Consulting and Clinical patients: A psychodynamic approach. New York: Guil- Psychology, 61, 441±447. ford Press. Jacobson, N. S., & Hollon, S. D. (1996). Prospects for Roth, A., Fonagy, P., Parry, G., Target, M., & Woods, R. future comparisons between drugs and psychotherapy: (1996). What works for whom? A critical review of Lessons from the CBT-versus-pharmacotherapy ex- psychotherapy research. New York: Guilford Press. change. Journal of Consulting and Clinical Psychology, Rounsaville, B. J., Glazer, W., Wilber, C. H., Weissman, 64, 104±108. M. M., & Kleber, H. D. (1983). Short-term interpersonal Jacobson, N. S., Schmaling, K. B., Holtzworth-Munroe, psychotherapy in methadone-maintained opiate addicts. A., Katt, J. L., Wood, L. F., & Follette, V. M. (1989). Archives of General Psychiatry, 40, 629±636. Research-structured vs. clinically flexible versions of Schulte, D., Kuenzel, R., Pepping, G., & Schulte-Bahren- social learning-based marital therapy. Behaviour Re- berg, T. (1992). Tailor-made versus standardized therapy search and Therapy, 27, 173±180. of phobic patients. Advances in Behaviour Research and Kazdin, A. E. (1991). Treatment research: The investiga- Therapy, 14, 67±92. tion and evaluation of psychotherapy. In M. Hersen, A. Sobell, M. B., & Sobell, L. C. (1993). Problem drinkers: E. Kazdin, & A. S. Bellack (Eds.), The clinical psychology Guided self-change treatment. New York: Guilford Press. handbook (2nd ed., pp. 293±312). New York: Pergamon. Steketee, G. (1993). Treatment of obsessive compulsive Klerman, G. L., DiMascio, A., Weissman, M., Prusoff, B., disorder. New York: Guilford Press. Paykel, E. S. (1974). Treatment of depression by drugs Talley, P. F., Strupp, H. H., & Butler, S. F. (Eds.) (1994). and psychotherapy. American Journal of Psychiatry, 131, Psychotherapy research and practice: Bridging the gap. 186±191. New York: Basic Books. Klerman, G. L., Weissman, M. M., Rounsaville, B. J., & Turner, S. M., & Beidel, D. C. (1988). Treating obsessive- Chevron, E. S. (1984). Interpersonal psychotherapy of compulsive disorder. Oxford, UK: Pergamon. depression. New York: Basic Books. Weissman, M. M. (1995). Mastering Depression: A patient's Linehan, M. M. (1993a). Cognitive-behavioral treatment of guide to interpersonal psychotherapy. San Antonio, TX: borderline personality disorder. New York: Guilford Psychological Corporation. Press. Weissman, M. M., & Markowitz, J. C. (1994). Interperso- Linehan, M. M. (1993b). Skills training manual for treating nal psychotherapy: Current status. Archives of General borderline personality disorder. New York: Guilford Psychiatry, 51, 599±606. Press. Weisz, J. R., Donenberg, G. R., Han, S. S., & Weiss, B. Luborsky, L., & DeRubeis, R. J. (1984). The use of (1995). Bridging the gap between laboratory and clinic in psychotherapy treatment manuals: A small revolution in child and adolescent psychotherapy. Journal of Consult- psychotherapy research style. Clinical Psychology Re- ing and Clinical Psychology, 63, 688±701. view, 4, 5±14. Wilson, G. T. (1996). Empirically validated treatments: Luborsky, L., McLellan, A. T., Woody, G. E., O'Brien, C. Realities and resistance. Clinical Psychology, 3, 241±244. P., & Auerbach, A. (1985). Therapists success and its Wincz. J. P., & Carey, M. P. (1991). Sexual dysfunction: A determinants. Archives of General Psychiatry, 42, guide for assessment and treatment. New York: Guilford 602±611. Press. Marziali, E., & Munroe-Blum, H. (1994). Interpersonal Woody, S. R., & Sanderson, W. C. (1998). Manuals for group psychotherapy for borderline personality disorder. empirically supported treatments: 1998 update. Clinical New York: Basic Books. Psychologist, 51, 17±21. Copyright © 1998 Elsevier Science Ltd. All rights reserved.

3.10 Internal and External Validity of Intervention Studies

KARLA MORAS University of Pennsylvania, Philadelphia, PA, USA

3.10.1 INTRODUCTION 202 3.10.2 DEFINITIONS 202 3.10.2.1 Overview of IV, EV, CV, and SCV 202 3.10.2.2 Internal Validity 203 3.10.2.3 External Validity 203 3.10.2.4 Construct Validity 203 3.10.2.5 Statistical Conclusion Validity 204 3.10.3 COMMON THREATS TO IV, EV, CV, AND SCV 204 3.10.4 EXPERIMENTAL DESIGNS AND METHODS THAT ENHANCE IV, EV, CV, AND SCV 205 3.10.4.1 IV Methods and Designs 205 3.10.4.1.1 Random assignment 205 3.10.4.2 Untreated or Placebo-treated Control Group 208 3.10.4.3 Specification of Interventions 209 3.10.4.4 Process Research Methods 209 3.10.4.5 EV Methods 210 3.10.4.5.1 Naturalistic treatment settings 210 3.10.4.6 Inclusive Subject Selection 211 3.10.4.6.1 Staff therapists 212 3.10.4.7 CV Methods 212 3.10.4.7.1 Specification of interventions: treatment manuals 212 3.10.4.7.2 Specification of interventions: therapist adherence measures 214 3.10.4.7.3 Distinctiveness of interventions 215 3.10.4.7.4 Adjunctive therapies 215 3.10.4.7.5 Assessment procedures 215 3.10.4.8 SCV Methods 216 3.10.4.8.1 Statistical power analysis 216 3.10.4.8.2 Controlling for the type 1 error rate 216 3.10.4.8.3 Testing assumptions of statistical tests 218 3.10.5 FROM IV VS. EV TO IV + EV IN MENTAL HEALTH INTERVENTION RESEARCH 218 3.10.5.1 Strategies for Efficacy + Effectiveness (IV + EV) Intervention Studies 218 3.10.5.1.1 The conventional ªphaseº strategy 219 3.10.5.1.2 A dimensional adaptation of the phase model 219 3.10.5.1.3 Stakeholder's model to guide data selection 219 3.10.5.1.4 Mediator's model 219 3.10.5.2 Examples of Efficacy + Effectiveness Intervention Studies 220 3.10.5.2.1 Schulberg et al. (1995) 220 3.10.5.2.2 Drake et al. (1996) 220 3.10.6 A CONCLUDING OBSERVATION 221 3.10.7 REFERENCES 222

201 202 Internal and External Validity of Intervention Studies

3.10.1 INTRODUCTION 1994; Roth & Fonagy, 1996). An alternative view is that EV questions only can be asked The concepts of internal and external validity about a study's findings after the study's IV has (IV and EV) were introduced by Campbell and been established (Flick, 1988; Hoagwood et al., Stanley in 1963. IV and EV concern the validity 1995). A recent trend is to encourage investiga- of inferences that can be drawn from an inter- tors to design studies that can optimize both IV vention study, given its design and methods. and EV (Clarke, 1995; Hoagwood et al., 1995). The concepts are used to assess the extent to The topic is pursued later in the chapter. which a study's outcome findings can be This chapter is intended to provide a relatively confidently: (i) interpreted as evidence for concise, simplified explication of IV and EV that hypothesized causal relationships between will be useful to neophyte intervention research- interventions and outcomes (IV), and (ii) ers and to consumers of intervention research. assumed to generalize beyond the study situa- The main aims are to enhance the reader's tion (EV). IV and EV are conceptual tools that sophistication as consumer of intervention guide deductive (IV) and inductive (EV) think- research, and ability to use the concepts of IV ing about the impact of a study's design and and EV to design intervention studies. IV and methods on the validity of the conclusions that EV are discussed and illustrated mainly from the can be drawn from it. The concepts are not only perspective of research on interventions of a of academic interest. Evaluation of a study's IV certain type: psychological therapies for mental and EV is a logical, systematic way to judge if its health problems (Bergin & Garfield, 1971, 1994; outcome findings provide compelling evidence Garfield & Bergin, 1978, 1986).The topics that an intervention merits implementation in covered are: (i) definitions of IV and EV and public sector settings. of two newer, closely related concepts, construct This chapter is written at a unique time in the validity (CV) and statistical conclusion validity history of IV and EV. The concepts have been at (SCV); (ii) threats to IV, EV, CV, and SCV; (iii) the forefront of a contemporary, often con- designs and methods that are commonly used in tentious debate in the USA about the public mental health intervention research to enhance health value of alternative methods and designs IV, EV, CV, and SCV; and (iv) suggested for intervention research (e.g., Goldfried Wolfe, strategies from the efficacy vs. effectiveness 1998; Hoagwood, Hibbs, Bren, & Jensen, 1995; debate to optimize the scientific validity, gen- Jacobson & Christensen, 1996; Lebowitz & eralizability, and public health value of inter- Rudorfer, 1998; Mintz, Drake, & Crits-Chris- vention studies. Finally, two intervention studies toph, 1996; Newman & Tejeda, 1996; Seligman, that were designed to meet both IV and EV aims 1996; Wells & Sturm, 1996). The alternatives are are used to illustrate application of the concepts. referred to as ªefficacyº vs. ªeffectivenessº studies. In current parlance, efficacy studies 3.10.2 DEFINITIONS have high IV due to designs and methods that reflect a priority on drawing causal conclusions 3.10.2.1 Overview of IV, EV, CV, and SCV about the relationship between interventions and outcomes (e.g., Elkin et al., 1989). Effec- Kazdin (1994) provided a concise overview of tiveness studies have high EV due to designs and IV, EV, CV, and SCV, and of common threats methods that reflect the priority of obtaining to each type of validity (Table 1). The discussion findings that can be assumed to generalize to of the four concepts in this chapter is based on nonresearch intervention settings and clinic Cook and Campbell's (1979) conceptualiza- populations (e.g., Speer, 1994). Typically, effec- tions, as is Kazdin's table. Campbell and tiveness studies are done to examine the effects of Stanley's (1963, 1966) original conceptualiza- interventions with demonstrated efficacy in high tions of IV and EV were extended and slightly IV studies, when the interventions are used in revised by Cook and Campbell in 1979. The standard community treatment settings. A arguments and philosophy of science assump- related term, ªclinical utility research,º has tions that underpin Cook and Campbell's started to appear (Beutler & Howard, 1998). (1979) perspective have been challenged (Cron- The types of research questions connoted by bach, 1982). A key criticism is that they clinical utility are broader than those connoted overemphasized the value of high IV studies by effectiveness research (Howard, Moras, Brill, in a way that was inconsistent with EV Martinovich, & Lutz, 1996; Kopta, Howard, (generalizability) trade-offs often associated Lowry, & Beutler, 1994; Lueger, 1998). with such studies (Shadish, 1995). Cook and The efficacy vs. effectiveness debate is pro- Campbell's views were adopted in this chapter minent in mental health intervention research. because, challenges notwithstanding, they have IV and EV often are viewed as competing rather had a major influence on mental health than complimentary research aims (Kazdin, intervention research since the late 1960s. Definitions 203

Table 1 Types of experimental validity, questions they address, and their threats to drawing valid inferences.

Type of validity Questions addressed Threats to validity

Internal validity To what extent can the Changes due to influences other than intervention, rather than the experimental conditions, such extraneous influences, be as events (history) or processes considered to account for the (maturation) within the results, changes, or group individual, repeated testing, differences? statistical regression, and differential loss of subjects External validity To what extent can the results be Possible limitations on the generality generalized or extended to of the findings because of persons, settings, times, characteristics of the sample; measures, and characteristics therapists; or conditions, context, other than those in this or setting of the study particular experimental arrangement? Construct validity Given that the intervention was Alternative interpretations that responsible for change, what could explain the effects of the specific aspects of the intervention, that is, the intervention or arrangement conceptual basis of the findings, were the causal agents; that is, such as attention and contact with what is the conceptual basis the subject, expectancies of (construct) underlying the subjects or experimenters, cues of effect? the experiment Statistical conclusion validity To what extent is a relation Any factor related to the quantitative shown, demonstrated, or evaluation that could affect evident, and how well can the interpretation of the findings, such investigation detect effects if as low statistical power, variability they exist? in the procedures, unreliability of the measurement, inappropriate statistical tests

Source: Kazdin (1994).

A basic premise of Cook and Campbell's 3.10.2.3 External Validity definitions of IV and EV is that the term ªvalidityº can only be used in an approximate EV refers to the extent to which causal sense. They would say, for example, that conclusions from a study about a relationship judgments of the validity of the conclusions between interventions and outcomes can be that can be drawn from a study must always be assumed to generalize beyond the study's understood as approximate because, from an specific features (e.g., the patient sample, the epistemological perspective, we can never therapists, the measures of outcome, the study definitively know what is true, only that which setting). The full definition of EV is: ª . . . the has not been shown to be false. approximate validity with which we can infer that the presumed causal relationship can be generalized to and across alternative measures of the cause and effect and across different types 3.10.2.2 Internal Validity of persons, settings, and timesº (Cook & Simply stated, IV refers to the extent to which Campbell, 1979, p. 37). causal conclusions can be correctly drawn from a study about the relationship between an 3.10.2.4 Construct Validity independent variable (e.g., a type of therapy) and a dependent variable (e.g., symptom In their 1979 update of Campbell and Stanley change). The full definition reads: ªInternal (1963, 1966), Cook and Campbell highlighted validity refers to the approximate validity with two new concepts that are closely linked to which we infer that a relationship between two IV and EV: CV and SCV. Both have been variables is causal or that the absence of a incorporated into contemporary thinking about relationship implies the absence of causeº experimental design and threats to IV and EV (Cook & Campbell, 1979, p. 37). (e.g., Kazdin, 1994). 204 Internal and External Validity of Intervention Studies

Cook and Campbell (1979) focused their about a relationship (i.e., covariation) between discussion of CV on the ªputative causes and interventions and outcomes. SCV concerns effectsº (i.e., interventions and outcomes) of ªparticular reasons why we can draw false intervention studies. CV reflects the addition of conclusions about covariatesº (Cook & Camp- a concept developed earlier by Cronbach and bell, 1979, p. 37). It ªis concerned . . . with Meehl (1955) to Cook and Campbell's (1979) sources of random error and with the appro- model of experimental validity. Simply stated, priate use of statistics and statistical testsº CV is the goodness of fit between the methods (p. 80). For example, a determinant of SCV is used to operationalize constructs (interventions whether the assumptions of the statistical test and outcome variables) and the referent con- used to analyze a set of outcome data were met structs. In other words, CV is the extent to by the data. which the methods used to measure and operationalize interventions and outcomes are 3.10.3 COMMON THREATS TO IV, EV, likely to reflect the constructs that the investi- CV, AND SCV gators say they studied (e.g., cognitive therapy and depression). A more precise definition of A study's IV, EV, CV, and SCV are CV is: ª . . . the possibility that the operations determined by its design and methods, by the which are meant to represent a particular cause psychometric adequacy of its measures and or effect construct can be construed in terms of operationalizations of central constructs, and more than one construct, each of which is stated by the appropriateness of the statistical tests at the same level of reductionº (p. 59). For used to analyze the data. All must be assessed to example, the more possible it is to construe evaluate a study's IV, EV, CV, and SCV. measures in terms of constructs other than those ªDesignº refers to elements of a study's named by the investigators, the lower the CV. construction (the situation into which subjects CV also can be described as ªwhat experi- are placed) that determine the probability that mental psychologists are concerned with when causal hypotheses about a relationship between they worry about `confounding' º (Cook & the independent and dependent variables can Campbell, 1979, p. 59). An example of a threat validly be tested. For example, a design feature to CV is the possibility that the effects of a is the inclusion of a no-treatment control group medication for depression are due largely to the of some type (e.g., pill placebo) in addition to treating psychiatrist's concern for a patient and the treatment group. A no-therapy group nonjudgmental reactions to his or her symp- provides outcome data to compare with the toms, rather than to neurochemical effects of outcomes of the treated group. The comparison the drug. Such alternative interpretations of the allows examination of the possibility that therapeutic cause of outcomes in medication changes associated with the treatment also intervention studies lead to: (i) the inclusion of occur without it and, thus, cannot be causally pill placebo control conditions, (ii) randomized attributed to it. assignment to either active drug or pill placebo, ªMethodsº refers to a wide variety of and (iii) the use of double-blind procedures, that procedures that are used to implement study is, neither treater nor patient knows if the designs, such as random assignment of subjects patient is receiving active drug or placebo. to each intervention group included in a study. Cook and Campbell (1979) link CV to EV. A study's methods and design together deter- They say that generalizability is the essence of mine the degree to which any relationship found both. However, an integral aspect of CV is the between interventions and outcomes can validly adequacy with which the central variables of an be attributed to the interventions rather than to intervention study (the interventions and out- something else. In other words, study methods comes) are operationalized. Thus, CV is also and design determine whether alternative or necessarily linked to IV: the CV of the methods rival interpretations of findings can be dis- used to measure and operationalize interven- missed as improbable. Designs and methods tions and outcomes affects the validity of any that affect a study's IV, EV, CV, and SCV are causal conclusions drawn about a relationship discussed in Section 3.10.4. between the designated interventions and out- The use of IV, EV, CV, and SCV to guide comes. study design and critical appraisal of study findings is assisted by knowledge of common 3.10.2.5 Statistical Conclusion Validity threats to each type of validity. Cook and Campbell (1979) discussed several threats to SCV is described by Cook and Campbell each type. They cautioned, however, that no list (1979) as a component of IV. SCV refers to the is perfect and that theirs was derived from their effects of the statistical methods used to analyze own research experience and from reading study data on the validity of the conclusions about potential sources of fallacious inferences. Experimental Designs and Methods that Enhance IV, EV, CV, and SCV 205

The threats to each type of validity identified by each new subject has an equal chance to be Cook and Campbell are listed in Tables 2±5. assigned to every intervention condition in a The tables also present examples of designs and study. A simple random assignment procedure methods that can offset the various threats is the coin flip: heads the subject is assigned to (ªantidotesº). The interested reader is encour- one intervention in a two-intervention study; aged to review Cook and Campbell's discussion tails he or she receives the other one. Various of threats to IV, EV, CV, and SCV; design and procedures are used for random assignment methodological antidotes to the threats; and including sophisticated techniques like urn limitations of common antidotes. randomization (Wei, 1978). Urn randomization In the following sections, the four types of simultaneously helps ensure that subjects are threats to validity are discussed in turn. In each randomly assigned to interventions, and that section, examples of designs and methods are the subjects in each one are matched on described that are used in contemporary preidentified characteristics (e.g., co-present psychological and pharmacological mental problems) that might moderate the effects of health treatment research to enhance that type an intervention (Baron & Kenny, 1986). of validity. Only a sampling of designs and Random assignment contributes to IV by methods relevant to each type of validity is helping to ensure that any differences found described due to space limitations. between interventions can be attributed to the interventions rather than to the subjects who received them. The rationale for randomization 3.10.4 EXPERIMENTAL DESIGNS AND is that known and unknown characteristics of METHODS THAT ENHANCE IV, subjects that can affect outcomes will be equally EV, CV, AND SCV distributed across all intervention conditions. Each design and method discussed in the Hence, subject features will not systematically sections that follow is limited in the extent to affect (bias) the outcomes of any intervention. which it can ensure that a particular type of Random assignment has limitations as a way experimental validity is achieved. Some limita- to enhance IV. For example, it is not a tions result mainly from the fact that patient- completely reliable method to ensure that all subjects are always free to deviate from research outcome-relevant, preintervention features of treatment protocols they enter, by dropping out subjects are equally distributed across interven- of treatment, for example. Other limitations tion conditions. Even when randomization is arise because research methods that can used, by chance some potentially outcome- enhance IV can simultaneously reduce EV. relevant subject characteristics can be more For example, using psychotherapists who are prevalent in one intervention than another experts in conducting a form of therapy can (Collins & Elkin, 1985). This fact can be both increase IV and reduce EV. discovered post hoc when subjects in each One point merits emphasis. The experimental intervention are found to differ on a character- validity potential of alternative designs and istic (e.g., marital status) that is also found to methods is always contingent on the match relate to outcome. This happened, for example, between study hypotheses and a study's design in the US National Institute of Mental Health's and methods. For example, the IV potential of a Treatment of Depression Collaborative Re- particular design differs if the main aim of a search Program study (Elkin et al., 1989). study is to obtain data that can be interpreted as Attrition (e.g., of subjects, of subjects' out- evidence for a therapy's hypothesized effects vs. come data) also limits the IV protection to obtain data that can be interpreted as provided by random assignment (Flick, 1988; evidence for the comparative efficacy of alter- Howard, Cox, & Saunders, 1990; Howard, native therapies for the same problem. Krause, & Orlinsky, 1986). For example, subjects can always drop out of treatment or fail to provide data at all required assessment 3.10.4.1 IV Methods and Designs points. Both are examples of postinclusion Table 2 lists and defines common IV threats. attrition in Howard, Krause et al. (1986) The text that follows elaborates on some of the terminology. The problems associated with, information in Table 2. and types of, attrition have been carefully explicated (Flick, 1988; Howard, Krause et al. 1986). For example, the core IV threat asso- 3.10.4.1.1 Random assignment ciated with dropout is that the subjects who drop Random assignment of subjects to all inter- out of each study intervention might differ vention groups in a study design is a sine qua non somehow (e.g., severity of symptoms). This of IV. Broadly defined, random assignment would create differences in the subjects who means that a procedure is used to ensure that complete each intervention and simultaneously 206 Internal and External Validity of Intervention Studies

Table 2 Internal validity.