JOURNAL OF APPLIED BEHAVIOR ANALYSIS 2010, 43, 195–213 NUMBER 2(SUMMER 2010)

APPLYING -DETECTION THEORY TO THE STUDY OF OBSERVER ACCURACY AND IN BEHAVIORAL ASSESSMENT

DOROTHEA C. LERMAN,ALLISON TETREAULT,ALYSON HOVANETZ,EMILY BELLACI, JONATHAN MILLER,HILARY KARP,ANGELA MAHMOOD,MAGGIE STROBEL, SHELLEY MULLEN,ALICE KEYL, AND ALEXIS TOUPARD

UNIVERSITY OF HOUSTON, CLEAR LAKE

We evaluated the feasibility and of a laboratory model for examining observer accuracy within the framework of signal-detection theory (SDT). Sixty-one individuals collected data on aggression while viewing videotaped segments of simulated teacher–child interactions. The purpose of Experiment 1 was to determine if brief feedback and contingencies for scoring accurately would bias responding reliably. Experiment 2 focused on one variable (specificity of the operational definition) that we hypothesized might decrease the likelihood of bias. The effects of social consequences and information about expected behavior change were examined in Experiment 3. Results indicated that feedback and contingencies reliably biased responding and that the clarity of the definition only moderately affected this outcome. Key words: behavioral assessment, college students, data collection, observer bias, signal- detection theory ______Direct observation and measurement of behav- confidence in the accuracy of the reported data. ior are the cornerstones of effective research and However, practitioners do not routinely collect practice in applied behavior analysis. Trained data on interobserver agreement (i.e., reliabili- observers use various methods to record occur- ty). Furthermore, agreement is not synonymous rences of precisely defined target behaviors and with accuracy (i.e., two observers could agree other events during designated observational but incorrectly score the behavior; Kazdin, periods. In application, behavioral consultants 1977; Mudford, Martin, Hui, & Taylor, 2009). often rely on parents, teachers, and direct-care staff A number of studies have identified variables to collect data on target behaviors. The behavioral that might lead observers to record data consultant examines these data to obtain infor- inaccurately. Research findings indicate that mation that is key to program effectiveness, such factors related to the measurement system (e.g., as the baseline level of responding, the conditions number of different behaviors scored), charac- under which a behavior occurs, and changes in teristics of the observers (e.g., duration of responding with the introduction of treatment or training), characteristics of the setting (e.g., modifications to existing procedures. Little re- presence of other observers), and consequences search has been conducted on the accuracy of data for scoring (e.g., social approval for recording collected by direct-care staff or the best way to changes in the level of the target behavior) can train people to collect these data. influence the accuracy and reliability of behav- Interobserver agreement, which is deter- ioral measurement (see Kazdin, 1977; Repp, mined by having two observers record the same Nieminen, Olinger, & Brusca, 1988, for events at the same time, is routinely reported in reviews). A large portion of these studies, published research to provide some degree of however, focused on interobserver agreement rather than accuracy. Furthermore, the nature We thank Jennifer Fritz for her assistance in conducting of inconsistencies or inaccuracies in data this study. collection was not systematically examined. Address correspondence to Dorothea C. Lerman, 2700 For example, sources of random error versus Bay Area Blvd., Box 245, Houston, Texas 77058 (e-mail: [email protected]). nonrandom error (i.e., observer bias) have not doi: 10.1901/jaba.2010.43-195 been differentiated in previous research.

195 196 DOROTHEA C. LERMAN et al.

Signal-detection theory (SDT; Green & decision making, including clinical diagnosis Swets, 1966) may provide a useful framework and assessment (see McFall & Treat, 1999; for further analysis of observer accuracy. SDT Swets, 1988, 1996, for reviews). In the was developed to examine the behavior of an experimental analysis of behavior, signal-detec- observer in the presence of ambiguous stimuli. tion methods have been used to study The task of the observer is to discriminate the control and reinforcement effects in choice presence versus absence of a stimulus (i.e., situations (e.g., Alsop & Porritt, 2006; Davison detect a signal against a background of noise). & McCarthy, 1987; Nevin, Olson, Mandell, & In classical signal-detection experiments, the Yarensky, 1975). observer either responds ‘‘yes’’ or ‘‘no’’ regard- The concepts of SDT also could be extended ing the presence of the signal on each trial. to the direct observation of behavior in research Correctly indicating that a stimulus is present is and clinical settings. Any behavior that should called a hit, and correctly indicating that a be recorded by observers is analogous to the stimulus is absent is called a correct rejection. signal in SDT. All other behaviors are analo- Indicating that a stimulus is absent when it is gous to the noise in SDT. Correctly recording actually present is called a miss, and indicating that a behavior has occurred is analogous to that a stimulus is present when it is actually responding, ‘‘Yes, the signal is present’’ (i.e., a absent is called a false alarm. hit). Correctly refraining from recording a According to SDT, the behavior of an behavior that does not meet the definition of observer in this type of situation has at least the target response is analogous to responding, two dimensions. One dimension is determined ‘‘No, the signal is absent’’ (i.e., a correct by the sensory capability of the observer and the rejection). Failing to record a behavior that actual ambiguity of the stimulus and is called has occurred is analogous to responding the sensitivity of the observer (i.e., how well the incorrectly in the presence of the signal (i.e., a observer discriminates the signal from the miss), whereas recording a behavior that did not noise). A second dimension is the proclivity of occur is analogous to responding incorrectly in the observer to judge in one direction as the presence of noise (i.e., a false alarm). opposed to the other (e.g., to indicate that the Research on SDT indicates that observer error signal is present rather than absent), referred to may reflect problems with sensitivity (i.e., as the observer’s . Research on SDT discriminating the target behavior from other indicates that response bias is affected by a behaviors) or response bias (i.e., the criterion number of variables, including the consequenc- used by the observer to determine whether a es for each outcome of judgment, the a priori behavior should be recorded). SDT also probability of each option, the decision rule suggests that problems with sensitivity and bias that influences the observer, and instructions are more likely to occur when the observer about how to make the observations (Green & encounters ambiguous samples of the targeted Swets, 1966). Sensitivity, on the other hand, is behaviors. usually affected only by operations that change Thus far, research on observer accuracy in the amount of ambiguity in the stimulus behavioral assessment has not differentiated situation. SDT provides a way to evaluate the between sensitivity and bias or considered the effect of factors on sensitivity and response bias role of ambiguous behavioral samples when separately. examining factors that may influence accuracy Methods based on SDT have been applied or reliability. Various types of ambiguous across a variety of disciplines (e.g., medicine, behavioral samples might arise during natural- industry, psychiatry, engineering) to evaluate istic observation. Three types will be described APPLYING SIGNAL-DETECTION THEORY 197 for illustrative purposes. First, a particular event different variables affect the accuracy and may possess a subset (rather than all) of the consistency of behavioral measurement. For criteria specified in the defined response class. example, several studies have shown that the For example, the definition of a tantrum might presence of another observer (or general be ‘‘screaming and falling to the floor,’’ such knowledge of monitoring) improves observer that both responses must be present for a accuracy (Repp et al., 1988). It is possible that tantrum to be scored. A sample consisting of this factor influences the observer’s criterion for falling to the floor in the absence of screaming a hit (e.g., the observer adopts a more might be associated with inconsistent or conservative criterion for the target behavior) erroneous data collection (i.e., a false alarm). and, thus, is less likely to record false alarms. Second, a particular event may possess all of the Alternatively, the observer may attend more criteria specified in the behavioral definition but closely to the behavior samples (i.e., increase include other elements that differentiate the vigilance without changing the criterion), sample from other members of the defined thereby decreasing the number of misses. response class. For example, a behavioral sample Similar changes in overall accuracy would occur consisting of screaming and laughing while but for very different reasons. falling to the floor may appear ambiguous to Analyses of measurement based on SDT the observer who is scoring tantrums as defined might better differentiate among factors that above. Such a sample may increase the influence response bias and sensitivity, leading likelihood of inconsistent or erroneous data to a greater understanding of factors that collection (i.e., a miss). Finally, ambiguity may influence measurement and, thus, better solu- arise when the nature of the behavior makes it tions for rectifying these problems. For exam- difficult to specify samples that should be ple, response bias would be indicated if a factor included and excluded in the behavioral produces similar effects on both hits and false definition. For example, inappropriate vocaliza- alarms. A simultaneous increase (or decrease) in tions might be defined as vocalizations unrelated hits and false alarms suggests that the observer to the topic being discussed or to stimuli in the has altered his or her criterion for scoring a environment (e.g., DeLeon, Arnold, Rodriguez- behavior. Catter, & Uy, 2003). Such a definition permits The identification of stimuli and conditions some degree of subjective interpretation. For that reliably bias observers’ responding as example, disagreement might occur between predicted by SDT would be useful for the two observers when an individual states, ‘‘I love further study of clinically relevant variables that marshmallows,’’ after hearing someone say, may affect the accuracy of observations in the ‘‘There are some beautiful, fluffy clouds in the natural environment. The purpose of this study sky today.’’ was to develop and test a procedure for Factors that influence observer accuracy may evaluating factors that may influence observer be particularly problematic when an observer is accuracy and bias in behavioral assessment. As a faced with these types of ambiguous samples. first step, we sought to determine whether we Nonetheless, in prior research on observer could reliably bias responding in the laboratory accuracy and reliability, the nature of the by having individuals score videotaped segments behavioral samples (i.e., clear vs. ambiguous) of simulated child–teacher interactions with was uncontrolled. Detailed analyses of observer clear and ambiguous samples of designated errors (e.g., percentage of misses vs. false alarms) behaviors. Based on previous research on SDT, also have rarely been conducted. These gaps we hypothesized that (a) hits and false alarms limit our knowledge about how and why would increase when observers were given brief 198 DOROTHEA C. LERMAN et al. feedback and told that they would receive groups is not differentiated on this basis in the monetary points for each hit, (b) hits and false remainder of the paper. All participants (20 alarms would decrease when observers were women, 17 men) completed an interview with given brief feedback and told that they would the experimenter regarding previous experience lose points for each false alarm, and that (c) with data collection and direct observation of changes in hits and false alarms would be more behavior and prior course work in behavior likely to occur with samples designated as analysis, behavior therapy, or behavior modifi- ambiguous rather than as clear. cation. The participants reported little or no After evaluating the utility of our procedures previous relevant data-collection experience or in Experiment 1, we conducted two additional course work (beyond a learning principles studies on factors that might alter observer bias course). All sessions took place in a laboratory in the presence of ambiguous events. Experi- room that contained a table, chairs, TV/VCR, ment 2 focused on the specificity of the and handheld computer. In addition to these operational definition. We hypothesized that participants, an individual with a doctoral this variable might decrease the likelihood of degree in behavior analysis and 8 years of bias under conditions found to bias responding experience collecting direct-observation data for reliably (i.e., the factors manipulated in Exper- both research and clinical purposes was recruit- iment 1). The relation between response bias ed to participate in the experiments as an expert and two other clinically relevant factors that data collector (see further explanation below). have been shown to alter observers’ record- She was naive to the purpose of the study. ings—social consequences and information about expected behavior change—were exam- Materials ined in Experiment 3. A series of eight vignettes were developed and videotaped prior to the study. Each vignette showed a teacher instructing a student on a GENERAL METHOD particular task (e.g., sweeping the floor, sorting EXPERIMENTS 1 AND 2 objects at a table, playing with leisure materials) Participants and Setting for approximately 4 min. The same two actors Graduate students enrolled in a learning appeared in each vignette, and the total video principles course and undergraduate students lasted 33 min. The actors followed prepared enrolled in a research and statistics course were scripts such that each vignette consisted of two recruited for Experiments 1 and 2. Nineteen or three clear samples of the target behavior graduate students (22 years to 53 years old; M (aggression, defined as ‘‘hitting the teacher, 5 30 years) and 18 undergraduate students kicking the teacher, and throwing objects at the (20 years to 59 years old; M 5 33 years) teacher’’), two or three ambiguous samples of participated in one of two experiments (none the target behavior, and five ambiguous non- participated in both). The graduate students examples of the target behavior. The 33-min received extra credit in their learning principles segment used in Experiments 1 and 2 contained course for participating. The undergraduate a total of 20 clear samples of the target students fulfilled a course requirement in their behavior, 20 ambiguous samples of the target research and statistics class. (Students could behavior, and 40 ambiguous nonexamples of choose among a variety of available experiments the target behavior (i.e., 40 possible hits and 40 or select a nonresearch option to satisfy these possible false alarms [if, in fact, false alarms course requirements.) Because similar results involved only ambiguous samples]). When were obtained for the graduate and undergrad- creating the scripts for the vignettes, the uate participants, information about these two experimenters classified samples as clear or APPLYING SIGNAL-DETECTION THEORY 199

Table 1 Examples of Clear and Ambiguous Samples and Nonexamples of Aggression Shown in the Videotaped Segments

Clear Ambiguous Samples () Hitting the teacher on the shoulder with a closed Hitting the teacher’s hand while reaching for task fist while screaming ‘‘no,’’ kicking the teacher materials, hitting the teacher’s shoulder while when she begins to provide a physical prompt, smiling, hitting the teacher on the arm with an throwing task materials at the teacher when object while playing with the object. given an instruction. Nonexamples (noise) Complying to an instruction, engaging in a task. Screaming at the teacher, hitting own head, throwing objects in the opposite direction of the teacher; swinging an arm towards teacher without making contact. ambiguous on the basis of expected outcomes. following the scoring of an entire 33-min Examples of samples from each category are segment. If the precise timing of any event shown in Table 1. The remaining interactions differed by more than 3 s between the observers, consisted of clear nonexamples of the target the segment was rescored until no such behavior (e.g., compliance to instructions). discrepancies occurred (this rarely happened). Each behavioral sample of the student was One of the observers’ records was selected separated by at least 10 s from the previous randomly to serve as the gold standard data sample, so that observer accuracy could be record for each version of the videotape. To determined (see below for further description). verify the accuracy of the gold standard data, Three different versions of the 33-min segment the first author compared these records to the were constructed by altering the order of the written scripts and the written scripts to the vignettes such that no vignette occurred con- videotaped vignettes. tiguous with another vignette in more than one version. A voice stating ‘‘one, two, three, start,’’ Response Measurement and Interrater Agreement was inserted prior to the first vignette of each The primary measures of interest were the segment to cue initiation of the data-collection number of hits and false alarms in each scoring program. Observers used handheld PCs with session. A hit was defined as the observer external keyboards (Dell Axim) and data- scoring the occurrence of aggression (designated collection software (Instant Data) to collect as an ‘‘a’’ on his or her data record) within 65s data. The software generated a text record of of the occurrence of aggression on the gold each key pressed and the moment that it was standard data record. A false alarm was defined pressed for the entire observation. as the observer scoring the occurrence of To generate ‘‘gold standard’’ data records for aggression more than 65 s from the occurrence each version of the 33-min videotaped segment, of aggression on the gold standard data record. a group of six trained observers used the same Each participant’s hit rate was determined by handheld PCs and software to score occurrences dividing the total number hits by the total of clear and ambiguous samples of aggression as number of clear and ambiguous samples of well as ambiguous nonexamples of aggression. aggression (40). The false alarm rate was Prior to and during the scoring, the observers determined by dividing the total number of had access to the written scripts of the vignettes false alarms by the total number of ambiguous but did not interact with each other during the nonexamples of aggression (40). All of the data scoring. Different key codes were used for each records were scored independently by two of these three response categories. Working in experimenters, and the results were compared. pairs, the observers compared their data records The experimenters rescored any data records 200 DOROTHEA C. LERMAN et al. that contained one or more disagreements. The Depending on the condition, the participant results were again compared, and interrater also was told about the possibility of earning agreement of 100% was obtained. points for scoring correctly (see further descrip- tion below). Participants were told to score Procedure praise in addition to aggression (the true target) Participants were escorted into the lab and to increase the demand of the observation and given the following written information and to reduce possible reactivity. instructions, which the experimenter read After receiving the instructions, all partici- aloud: pants viewed a 16-min practice video consisting Inmyfield(appliedbehavioranalysis),direct of eight vignettes that resembled those of the observation of behavior is critical to both research test video. However, the practice video con- and practice. We rely on observers to accurately tained only clear samples and clear nonexamples record the occurrence of behaviors in field settings, such as homes, schools, and clinics. We must train of the target behavior. Following the first observers to score data. The purpose of this study is practice session, the experimenter discussed to examine efficient ways to train observers and to any errors in the participant’s scoring. Addi- evaluate the benefits of using handheld computers to tional 10-min practice sessions continued (using collect data in field settings. We are simulating the procedures typically used different versions of the practice video) until the to train observers who score behavior in the field. participant made no errors in the scoring of You will be viewing a series of videotaped segments aggression. The instructor then asked the with two actors who are pretending to be a teacher and a student with developmental disabilities. Each participant if he or she was ready to begin the segment consists of a series of 2- to 5-min clips; you actual scoring (all participants indicated that should score all of the clips in a segment as one they were ready). At this point, the written continuous observation. First, you will be scoring instructions were removed from the room. some segments as practice sessions. During these practice sessions, you may discuss the scoring with Across both experiments, participants required me and ask questions. Once you feel comfortable a mean of 20 min to meet the training criterion with the scoring, you will score three additional 33- (range, 16 to 36 min). minute videotaped segments as though you are actually out in the field. Once you have begun this scoring, I will not be able to answer any questions EXPERIMENT 1 about the data collection because we will be The purpose of Experiment 1 was to examine pretending that you are out in the field. Just do the best that you can. the effects of instructions that included infor- You will be scoring the following behaviors by mation about contingencies and brief feedback pressing the appropriate key, as shown in parenthe- on the rate of hits and false alarms. The goal ses: Aggression (a): Score whenever the student hits was to determine if our procedures would be the teacher, kicks the teacher, or throws objects at the teacher. Praise (c): Score whenever the teacher useful for the further study of observer accuracy delivers a statement that includes praise, such as and bias within the framework of SDT. ‘‘Nice working,’’ ‘‘I like that,’’ ‘‘Good job,’’ ‘‘That was a good one,’’ etc. Tone of voice is not relevant. Procedure It is important to score the data as accurately as possible. Press the appropriate key as soon as the Twenty participants scored three versions of behavior occurs. Any keys that are pressed more than the 33-min video. For 15 participants, factors 5 seconds after the behavior occurs will not be intended to bias responding were manipulated considered correct. Do not score additional occur- rences of a behavior if they follow the initial for two of the three scorings. The remaining 5 behavior by less than 5 seconds. For example, if participants (control group) were not exposed the student hits the teacher three times very quickly, to the experimental manipulation. These par- only score the first behavior of this ‘‘burst’’ or ticipants simply scored the three videos to ‘‘episode.’’ If the teacher delivers multiple praise statements following the student’s behavior, only evaluate patterns of responding in the absence score the first statement. of any factors designed to alter criterion. The APPLYING SIGNAL-DETECTION THEORY 201 participants were assigned randomly to the participant’s data record to the gold standard experimental and control groups. The expert data record. If the participant had more misses data collector who was recruited for the study than false alarms, the participant was exposed to was asked to score one version of the 33-min Condition B for the second scoring session. If video. She was not exposed to the experimental the participant had more false alarms than manipulation. Our intention was to use her misses, the participant was exposed to Condi- scoring as a measure of the quality of the video. tion C for the second scoring. That is, we assumed that someone with expert Consequences for hits (B). Prior to the scoring, levels of experience would capture most, if not participants were told, all, instances of aggression (both clear and When downloading your data record, I noticed that ambiguous) if these responses were clearly you missed some of the aggressions in the last segment. It’s really important to catch all of the visible in the video. The data-collection expert aggressions that occur. Thus, for this segment, you was not given any feedback about her scoring. will have an opportunity to earn points for correctly In addition to the instructions shown above, scoring aggression. You will earn one point for each aggression that you catch. The more points that you all participants, with the exception of those in earn, the more money you will receive. the control group and the data-collection expert, were told, ‘‘I will tell you if you can The statements above included a brief earn points for recording data in the upcoming feedback statement because feedback is typically segment and how you can earn these points.’’ included in signal-detection procedures. Partic- These participants also were told that the points ipants were told about the importance of hits to would be exchangeable for money at the end of increase the saliency of the consequences for 1 the sessions and that they could earn up to $40 hits. Following the scoring session, the exper- (graduate students) or $20 (undergraduate imenter compared the participant’s data record students). After completion of the practice to the gold standard data record. The partici- sessions, all participants were exposed to pant remained in Condition B for the third Condition A (baseline). Participants in the scoring session as long as the frequency of hits control group were then exposed to two was equal to or less than 37 (of a total possible additional scoring sessions under Condition A. 40). Otherwise, the participant was exposed to The remaining participants were exposed to Condition C for the third scoring (this either Condition B (consequences for hits) or happened for 2 participants). Condition C (consequences for false alarms) for Consequences for false alarms (C). Prior to the the second scoring, depending on the pattern of scoring, participants were told, scoring during baseline (see further description When downloading your data record, I noticed that you scored aggression at times when aggression did below). The participant’s performance during not occur. It is really important not to include things the second scoring determined whether Condi- that aren’t aggression. Thus, for this segment, you tion B or Condition C was implemented during will start with a certain number of points, and you will lose one point each time that you score the third scoring. Participants received a 10-min something as aggression when it is not. You will break between each scoring session. earn however many points are left at the end of the Baseline (A). Prior to the start of the video 1 segment, participants were told, ‘‘For this At the conclusion of this experiment, 10 additional participants were recruited. Half were exposed to feedback segment, you will not have an opportunity to only and the other half were exposed to the point earn points. Just score as accurately as you can.’’ contingency only (combined with the statement about the (Participants in the control group were just told importance of hits and avoiding false alarms; see also Condition C). Results suggested that both components to score as accurately as they could.) During the were necessary to reliably bias responding. Data are session break, the experimenter compared the available from the first author. 202 DOROTHEA C. LERMAN et al.

Figure 1. Response patterns across the three scoring sessions that would indicate manipulation of response bias (left) or changes in sensitivity (right) for the four possible sequences of conditions (A-B-B, A-C-C, A-B-C, A-C-B).

session. The more points that you earn, the more false alarm rate (tendency to refrain from money you will receive. recording). Data points that appear closer to As noted above, participants were given the the upper left corner of the graph reflect higher brief feedback statement because feedback is levels of accuracy (i.e., a high hit rate with a low typically included in signal-detection proce- false alarm rate), whereas data points that dures. Participants were told about the impor- appear closer to the lower right corner of the tance of avoiding false alarms to increase the graph reflect lower levels of accuracy (i.e., low saliency of the consequences for false alarms. hit rate with a high false alarm rate). Data for Following the scoring session, the experimenter the three observation sessions were plotted on compared the participant’s data record to the the same graph for each participant. Different gold standard data record. If the participant was symbols were used to indicate the session completing the second scoring session in this number, and the three points were connected condition, the participant remained in Condi- by a line for visual inspection purposes. tion C for the third scoring session as long as For the purposes of illustration, response the frequency of false alarms was at or above patterns that indicate manipulation of response three. Otherwise, the participant would have bias are shown in the left panel of Figure 1 for the been exposed to Condition B for the third four possible sequences of conditions (A-B-B, A- scoring (however, this did not happen for any C-C, A-B-C, A-C-B). Manipulation of response participants). bias as predicted by SDT is indicated by a positive relation between the hit rate and false alarm rate Data Analysis across the three sessions, with the direction of the The performance of each participant was relation determined by the conditions experi- plotted on a graph displaying the hit rate enced by the participant. Data points move (vertical axis) against the false alarm rate towards the upper right corner of the graph from (horizontal axis) for each observation session. one session to the next session given a greater With this type of display, data points that tendency to record events (i.e., an increase in the appear close to the upper right corner of the rate of hits and false alarms). Data points move graph indicate a high hit rate with a high false towards the lower left corner of the graph from alarm rate (tendency to record events). Data one session to the next session given a greater points that appear close to the lower left corner tendency to refrain from recording events (i.e., a of the graph indicate a low hit rate with a low decrease in the rate of hits and false alarms). APPLYING SIGNAL-DETECTION THEORY 203

Response patterns that reflect systematic false alarms relative to the previous session. changes in sensitivity across the three sessions Condition C (monetary points lost for false are shown in the right panel of Figure 1. A alarms) was always associated with a decrease in negative relation between the hit rate and false both hits and false alarms relative to the alarm rate (i.e., data points moving towards the previous session. These findings indicate that upper left corner or the lower right corner of the the procedures reliably biased responding. It graph) indicates a change in sensitivity. For this should be noted that the outcomes for the 2 particular study, we would expect to see an participants who experienced the A-B-C se- increase in the hit rate with a corresponding quence of conditions deviated slightly from the decrease in the false alarm rate (i.e., an increase predicted response pattern (see Figure 1) in that in accuracy) if the manipulated contingencies the data point for the third session (Condition altered sensitivity for any participants. Thus, in C) did not fall to the left of the first session Figure 1, the pattern of responding across the (Condition A or no consequences). That is, three sessions is identical regardless of the although the participants’ hit and false alarm conditions experienced by the participant. rates were lower than in the previous session, The mean percentages of clear samples of the they did not fall below baseline rates (i.e., target behavior, ambiguous samples of the target Condition A). This may have occurred due to behavior, and ambiguous nonexamples of the their initial low false alarm rates under target behavior scored by participants also were Condition A or to their exposure to Condition calculated to determine if errors were more likely B in the prior session (i.e., sequence effects). to occur with samples considered to be ambig- The hit rate for the data-collection expert, uous than with those considered to be clear. who missed five instances of aggression (all categorized as ambiguous), was equal to or Results and Discussion greater than that obtained for any participant Plots showing the hit rates and false alarm under Condition A (first scoring session). rates for the individuals who participated in Although this might suggest that some instances Experiment 1 are displayed in the left column of aggression were not clearly visible in the of Figure 2. The data are grouped according to video, a large number of participants exceeded the sequence of conditions experienced by the her hit rate under other conditions. As such, it participant (i.e., A-B-B, A-C-C, A-B-C, or A-A- seems likely that the expert’s misses were A). The hit rate and false alarm rate for the determined by her criteria for scoring an event session scored by the data-collection expert are as an instance of aggression (rather than her plotted in the top left panel (displayed with an ability to detect the event). This possibility was asterisk). Results for all participants (with the further explored in Experiment 2. exception of the expert and those in the control The mean percentage of clear samples of the group) showed a positive relation between the target behavior, ambiguous samples of the hit rate and false alarm rate across the three target behavior, and ambiguous nonexamples sessions. Data from the second and third of the target behavior scored by participants are sessions moved in the expected direction with shown in Figure 3. The participants scored respect to the previous sessions, given the nearly all clear samples of aggression, regardless consequences that were experienced (i.e., mon- of the condition. On the other hand, the mean etary points earned for hits vs. monetary points percentage of ambiguous events scored by the lost for false alarms). That is, Condition B participants depended on the condition, with a (monetary points earned for hits) was always higher percentage scored in Condition B than associated with an increase in both hits and in the other two conditions. 204 DOROTHEA C. LERMAN et al.

Figure 2. Hit rates and false alarm rates for the participants in Experiment 1 (left) and Experiment 2 (right) across the three scoring sessions. Data are grouped according to the sequence of conditions experienced. Different line patterns are used for some data sets to assist with visual inspection. The asterisks show the expert’s hit and false alarm rates in Experiment 1 and Experiment 2. APPLYING SIGNAL-DETECTION THEORY 205

Figure 3. Mean percentage of clear samples of aggression, ambiguous samples of aggression, and ambiguous nonexamples of aggression scored under each condition in Experiment 1.

To summarize, results of Experiment 1 servers were given a relatively general defini- indicated that we could reliably bias observers’ tion of aggression (i.e., hitting, kicking, or scoring of aggression by providing feedback and throwing items at the teacher). It should be points when ambiguous stimuli were present. noted that this level of specificity was selected These findings are important because observers based on a review of articles on aggression are likely to encounter ambiguous events in the published in the Journal of Applied Behavior environment and because a variety of factors Analysis. Providing observers with more specific may influence observers’ decision rules when definitions of aggression might result in less scoring these events. Thus, strategies are needed flexible criteria for determining whether events for preventing or reducing this problem among should be scored. In other words, by reducing observers. The laboratory stimuli and methods the amount of ambiguity in the stimuli via examined in Experiment 1 could be used to more precise response definitions, observers further evaluate factors that might alter response may be less susceptible to factors that alter bias. Identifying factors that decrease the bias. In fact, the conditions found to produce likelihood of response bias under conditions bias in Experiment 1—receiving brief feedback that have been found to reliably bias responding and information about contingencies—may may lead to strategies for improving the improve accuracy (i.e., discriminability or accuracy of data collected in the natural sensitivity) among observers who have more environment. specific definitions on which to base their One such factor is the specificity of the criteria. response definition. Numerous authors have On the other hand, one assumption of SDT suggested that operational definitions should be is that sensitivity and bias are independent. objective (i.e., refer to only observable features Factors that influence sensitivity (e.g., specific- of responding), clear (i.e., provide unambiguous ity of the behavior definition) should not descriptions), and complete (i.e., differentiate influence response bias and vice versa. None- between responses that should and should not theless, some research findings have been be considered an occurrence (e.g., Cooper, inconsistent with this assumption. For example, Heron, & Heward, 2007). Nonetheless, sur- in Alsop and Porritt (2006), pigeons’ respond- prisingly few studies have examined the role of ing on a discrimination task showed less this factor on the accuracy and reliability of sensitivity to changes in reinforcement magni- behavioral observation. In Experiment 1, ob- tude (i.e., less response bias) as the discrimina- 206 DOROTHEA C. LERMAN et al. bility of the stimuli was increased. The not meet the definition of hitting, kicking, or interaction between sensitivity and bias was throwing objects at the teacher.’’ explored in Experiment 2 by giving observers more precise definitions. Results and Discussion The hit rates and false alarm rates for the EXPERIMENT 2 individuals in Experiment 2 are displayed in the The purpose of Experiment 2 was to right column of Figure 2. The data are grouped determine whether providing observers with according to the sequence of conditions expe- more specific operational definitions of aggres- rienced by the participant (i.e., A-B-B, A-C-C, sion would moderate the effects of feedback and A-B-C, or A-A-A). The hit rate and false alarm points on response bias, as well as lead to higher rate for the session scored by the data-collection levels of accuracy than those obtained in expert are plotted in the top right panel Experiment 1. (displayed with an x). Of the 12 participants exposed to the experimental manipulation of Procedure bias, 7 consistently showed a positive relation Seventeen participants scored the same three between the hit rate and false alarm rate across versions of the 33-min video used in Experi- the three sessions (5 of the 6 participants in the ment 1. For 12 participants, factors intended to A-C-C group and both participants in the A-B- bias responding were manipulated for two of C group). For these 7 participants, Condition B the three scorings. The remaining 5 participants (monetary points earned for hits) was always (control group) were not exposed to the associated with an increase in both hits and false experimental manipulation. The data-collection alarms relative to the previous condition expert also participated in the experiment by experienced, and Condition C (monetary points again scoring one version of the 33-min video lost for false alarms) was always associated with (she had not received any feedback about her a decrease in both hits and false alarms relative previous scoring). All procedures were identical to the previous condition experienced. For the to those described for Experiment 1, with the remaining 5 participants who were exposed to exception of the definition for aggression. the experimental manipulation, the points and Participants were given the following definition feedback did not bias their responding consis- of aggression, which closely described the tently. Furthermore, none of these participants topographies of hitting, kicking, and throwing showed a systematic increase in accuracy, which that appeared in the video: would have been indicated by an increase in hits (a) Hitting the teacher, defined as any forceful or a decrease in false alarms. However, it should contact between the student’s hand or arm and any be noted that the high levels of accuracy under part of the teacher; (b) kicking the teacher, defined as Condition A for 1 participant in the A-B-B any forceful contact between the student’s foot or leg and any part of the teacher; and (c) throwing objects group left little room for improvement. The at the teacher, defined as any time the student expert did not miss any instances of aggression, released an object from her hand and it made contact and, as expected, the control participants did with any part of the teacher. not show evidence of bias across the three In addition to this definition, participants scoring sessions. were given the following additional instructions A comparison of the findings from Experi- (orally and in writing) about the scoring: ‘‘Any ments 1 and 2 suggests that the type of response that meets the definition of aggression definition received (general vs. specific) influ- should be scored, even if the response appears to enced the outcomes. Providing a more specific be accidental. Other behaviors that are aggres- definition appeared to reduce the likelihood of sive in nature should not be scored if they do bias among participants in Experiment 2, given APPLYING SIGNAL-DETECTION THEORY 207 that all of the participants in Experiment 1 experimental group differences in the hit rate showed bias. If so, the variable produced a scores, t(25) 5 1.99, p 5 .057. This suggests relatively small effect, however, because more that the type of definition moderated the effects than half of the participants in Experiment 2 of the feedback and points on response bias. still showed evidence of bias. This was some- The type of definition also appeared to have a what unexpected because the observers had been small influence on overall accuracy, particularly given definitions and scoring instructions that the false alarm rate. Overall, the mean hit rate were more detailed and complete. For example, for participants in Experiment 2 who were each form of aggression was explicitly defined, exposed to the experimental manipulation was and observers were told to score anything that slightly higher than the mean hit rate for those met this definition even if the response seemed in Experiment 1 (Ms 5 .8 and .7, respectively). unintentional. The participants in Experiment 2 also had a Another feature of the data, however, lower false alarm rate than those in Experiment suggests that the extent of the participants’ bias 1(Ms 5 .2 and .4, respectively). These was related to the type of definition received. differences, however, were not statistically The amount of change in the hit rate and false significant, as indicated by the results of two- alarm rate across the three sessions was typically tailed independent t tests for the hit rates, t(25) smaller for participants who received the 522.04, p 5 .052, and false alarm rates, t(25) specific definition than for those who received 5 2.02, p 5 .054. Although participants in the the general definition. This was particularly control group who received the specific defini- noticeable for the false alarm rates. The tion had a slightly lower hit rate than those who difference between the highest and lowest hit received the general definition (Ms 5 .6 and .7, rate across the three scoring sessions for each respectively), they had a lower false alarm rate participant was averaged across the participants (Ms 5 .2 and .4, respectively). Finally, the data- who received the specific definition; a mean collection expert had a much higher hit rate difference score of .20 resulted. The same when given the specific definition (1.0) than calculation was performed with false alarm rates when given the general definition (.87). Her for participants who received the specific false alarm rate remained unchanged (one false definition and was .23. Mean difference scores alarm). among hit rates and mean difference scores Together, results of Experiments 1 and 2 among false alarm rates also were calculated for suggest that the type of definition provided to participants who received the general definition observers may alter the likelihood of bias. The (those from Experiment 1) and were .29 and consequences manipulated in these experiments, .51, respectively. Two-tailed independent t tests however, were somewhat contrived for the were used to compare the mean difference purpose of conducting laboratory research on scores for the hit rates (i.e., comparing .20 from observer bias. This could potentially limit the Experiment 2 participants and .29 from generality of these laboratory findings to real- Experiment 1 participants) and the mean world settings. In the next experiment, the difference scores for the false alarm rates (i.e., materials and procedures developed in Experiment comparing .29 from Experiment 2 participants 1wereusedtoevaluatefactorsthatmaybemore and .51 from Experiment 1 participants). likely to occur in research and clinical settings. Results were statistically significant for the experimental group differences in the false EXPERIMENT 3 alarm rate scores, t(25) 5 3.13, p 5 .004, but Kazdin (1977) suggested that providing results were not statistically significant for the information about possible changes in behavior 208 DOROTHEA C. LERMAN et al. during upcoming observations might influence scored. This modification was made to simulate the accuracy of data collection if observers also more closely the social consequences that might receive feedback about their recording. In one occur when training observers or checking of the few studies to examine these variables, reliability in field settings. That is, a researcher observers recorded four target behaviors exhib- or supervisor who sees evidence of errors might ited by 2 children on videotapes (O’Leary, inform observers about the errors (e.g., ‘‘It looks Kent, & Kanowitz, 1975). An experimenter like you missed some occurrences of self-injury’’) told observers that two of the behaviors were and encourage them to score more accurately in expected to decrease during treatment sessions the future (among other things). This conse- relative to baseline sessions. In actuality, levels quence was similar to that manipulated in of responding were similar under both condi- Experiments 1 and 2, except that the points were tions. Observers also received social conse- replaced by social consequences only. quences in the form of approval or disapproval Participants and Setting for their scoring. Specifically, the experimenter expressed approval (e.g., ‘‘Dr. O’Leary will be A total of 32 undergraduate students enrolled pleased to see the drop in the level of —’’; in a research and statistics course were recruited ‘‘These tokens are really having an effect on —.’’) for the study. None had previously participated if the observers’ data showed a reduction in in any experiments conducted by the authors. The students fulfilled a course requirement in responding during treatment sessions relative to their class as a result of their participation. baseline sessions. Social disapproval was provided (Students could choose from among a variety of (e.g., ‘‘You don’t seem to be picking up the available experiments or select a nonresearch treatment effect on —.’’) if the data showed no option to satisfy this course requirement.) Of reduction or an increase in behavior. Patterns of these 32 participants, 8 met the exclusion scoring across the two baseline and two treatment criteria in the first condition (see description sessions suggested that these factors altered the below), resulting in 24 individuals (19 women, accuracy of observation. 5 men) who completed the study. The mean age Nonetheless, it is possible that the social of the participants was 30 years (range, 20 years consequences, independent of information to 53 years). about expected behavior change, might have altered the observers’ accuracy. It is also unclear Materials whether information about expected behavior Results of the scoring in Experiment 1 were change contributed to the outcomes. Further- used to create three different videotaped more, neither factor has been examined while segments from the 33-min video. Each segment controlling the ambiguity of the behavioral contained six clear samples of aggression, six samples or evaluating the effects on hits versus ambiguous samples of aggression, six clear false alarms. nonexamples of aggression, and six ambiguous Thus, the purpose of Experiment 3 was to nonexamples of aggression. None of the extend O’Leary et al. (1975) by examining the samples appeared in more than one segment. separate effects of social consequences and Three different videos were necessary because, information about expected behavior change for one condition, participants would be told to on response bias using the videotaped samples expect higher levels of problem behavior in the developed for Experiments 1 and 2. Unlike upcoming session. To equate the level of O’Leary et al., however, the social contingency ambiguity across the three video segments, the was based on inaccurate scoring of hits rather specific samples for each segment were selected than on changes in the amount of behavior by examining the accuracy of scoring across all APPLYING SIGNAL-DETECTION THEORY 209 participants and conditions in Experiment 1. as many aggressions scored as there should have Clear and ambiguous samples that were associ- been. It is really important that you score all the ated with similar levels of accuracy were aggression that occurs and that you try harder assigned randomly to Segments 1, 2, and 3. on the next video.’’ Gold standard data records were established for Information about expected behavior change each segment using the same procedures as (C). Prior to scoring, participants were told, those described previously. ‘‘These clips resemble the actual person’s behavior on a day when there was more Procedure aggression because the behavior change inter- Participants were escorted to the lab and vention was not being implemented.’’ No given the same initial written and verbal feedback was given to the participants about instructions as the participants in Experiment their scoring in the previous session. 1 except that they were told that each behavioral sample lasted 12 s (rather than 2 min to 5 min) Data Analysis and that each video segment lasted 5 min The total number of hits and false alarms in (instead of 33 min). Practice sessions also were each condition was calculated for each partic- conducted as previously described. All partici- ipant. The number of hits and false alarms in pants were exposed to Condition A (baseline) Conditions B and C was then subtracted from after completing the practice sessions. Partici- the number of hits and false alarms in pants then were assigned randomly to either Condition A. The change in hits and false Condition B (social consequences) or Condition alarms in each condition was examined for each C (information about expected behavior change) participant to determine (a) the number of in the second scoring and to the remaining participants who showed evidence of response condition in the third scoring. The three video bias (i.e., an increase in both hits and false segments were assigned randomly to each alarms), (b) the number of participants who condition across participants. Participants re- showed an increase in sensitivity (i.e., an ceived a brief break between each scoring session. increase in hits or a decrease in false alarms), Baseline (A). Prior to the start of the video and (c) the number of participants who showed segment, participants were told to score as a decrease in sensitivity (i.e., a decrease in hits or accurately as they could. During the session an increase in false alarms). These results were break, the experimenter compared each partic- examined separately for each participant and for ipant’s data record to the gold standard data the two groups of participants who were record. If the participant scored at least five of exposed to the two conditions in a different the six instances of ambiguous aggression or at order (B-C vs. C-B). The latter analysis was least five of the six instances of ambiguous noise conducted to identify possible sequence effects. (aggression nonexamples), they were excluded from the remainder of the study. These Results and Discussion exclusion criteria were used because both of the Figure 4 shows the change in the number of subsequent conditions were expected to increase hits and false alarms during Conditions B (top) the number of hits and false alarms. The 8 and C (bottom) relative to Condition A. Data participants who met the exclusion criteria were are grouped according to the sequence of thanked and told that they did not need to conditions experienced by the participants. complete the remainder of the scorings. Data for the 12 participants who experienced Social consequences (B). Prior to scoring, Condition B prior to Condition C are displayed participants were told, ‘‘When downloading on the left, and data for the 12 participants who your data record, I noticed that there were not experienced Condition C prior to Condition B 210 DOROTHEA C. LERMAN et al.

Figure 4. Change in the number of hits and false alarms under Condition B (social consequences) and Condition C (information about expected behavior change) relative to Condition A (baseline) for each participant. are displayed on the right. Under Condition B, participants, outcomes suggested an increase in 12 of the 24 participants showed an increase in sensitivity, with 3 participants showing an both hits and false alarms, indicating that social increase in hits only, 1 participant showing a consequences biased responding. One partici- decrease in false alarms only, and 2 participants pant showed a decrease in hits and false alarms, showing a decrease in false alarms combined an outcome that is consistent with response bias with an increase in hits. Similar results were but opposite to the direction expected. An obtained regardless of whether the participant additional 5 participants showed an increase in was exposed to Condition B before or after false alarms only (2 participants) or an increase Condition C. in false alarms combined with a decrease in hits When exposed to Condition C, 5 of the 24 (3 participants). These patterns suggest a participants showed an increase in both hits and decrease in sensitivity. For the remaining 6 false alarms. These 5 participants had been APPLYING SIGNAL-DETECTION THEORY 211 exposed to Condition B before Condition C relation occurred across participants despite and had shown evidence of bias under Condi- varying initial levels of hits and false alarms. tion B. Among the 12 participants who were This approach extends that used in previous exposed to Condition C before Condition B, research on observer accuracy by controlling the none showed evidence of bias. Three partici- nature of the behavioral samples and examining pants showed no change in responding, and 2 the sources of error obtained (i.e., random vs. participants showed evidence of response bias nonrandom error). but opposite to the direction expected (i.e., a Results of Experiments 2 and 3 further decrease in both hits and false alarms). The suggest that this procedure may be useful for remaining response patterns (i.e., increase or advancing our knowledge of factors that decrease in sensitivity) were similar to those for influence response bias, especially when some participants who had experienced Condition B degree of ambiguity is present in the situation. prior to Condition C. Numerous variables may affect the accuracy of Together, these results showed that social data collected in naturalistic settings, such as the consequences but not information about ex- type of instructions received, presence of other pected behavior change altered response bias observers, clarity of the behavioral definitions, among some observers who were scoring clear probability of the behavior, and consequences and ambiguous events in a laboratory setting. for scoring (or not scoring) events as instances Sequence or interaction effects also appeared to of the targeted response (for reviews, see influence the outcomes for observers who were Kazdin, 1977; Repp et al., 1988). In Experi- first exposed to social consequences. That is, the ment 2, providing observers with a more effects of social consequences either carried over complete and detailed definition of aggression into the next scoring session or altered the appeared to reduce the likelihood and amount participants’ responses to the experimenter’s of response bias, as well as decrease the statement about expected behavior change. incidence of false alarms. In one of the few previous studies to evaluate this variable, House and House (1979) examined the correlation GENERAL DISCUSSION between the clarity of response definitions and Preliminary findings support the viability of a the level of reliability among pairs of observers procedure based on SDT for evaluating vari- who collected data on 27 behaviors of children ables that may influence observer accuracy and and their parents. One hundred college students bias in behavioral assessment. In Experiment 1, read the definitions and ranked them based on individuals who collected data on clear and clarity. A median clarity rating was then ambiguous samples of a common target determined for each definition. A Spearman behavior and ambiguous nonexamples of the rank-order correlation between these median behavior exhibited predictable patterns of values and mean reliability scores across the 27 responding. Consistent with previous research behaviors was not statistically significant (0.25). on SDT, response bias occurred when observers The authors suggested that all of the definitions received brief feedback about their performanc- might have been reasonably clear due to previous es and consequences for either hits or false refinements in the definitions. Furthermore, the alarms. Changes in scoring were more likely to observers had received extensive training. involve samples designated as ambiguous rather Although results of Experiment 2 suggested than as clear, providing some support for the that the clarity of the definition may influence designations. The effects of the experimental observer accuracy, the overall impact on procedure were robust, in that a consistent response bias was not as substantial as expected. 212 DOROTHEA C. LERMAN et al.

Nearly half of the participants still showed and materials also could explain the discrepant evidence of response bias. The degree of results. It also should be noted that individuals ambiguity present in the videos or the strength with more extensive training in direct observa- of the biasing factors may have weakened the tion techniques may not be as susceptible to effects of this variable on responding. Alterna- bias as the participants in this study. However, tively, the specific definition may not have been as noted previously, behavioral consultants as clear as intended. Nonetheless, results suggest often rely on data collected by caregivers, that observers may be more resistant to factors teachers, and others with limited training and that produce response bias when they are given experience. more detailed, specific definitions. The purpose of these experimental arrange- Experiment 3 built on these studies by ments was to advance basic and applied research examining other factors that may bias respond- on observer accuracy. A similar procedure may ing in clinical settings. Instead of combining the be useful for evaluating other variables that may general feedback statement with points ex- increase response bias or alter observer sensitiv- changeable for money, participants were urged ity. These factors include the probability of the to ‘‘try harder’’ after being told that they did behavior, type of instructions or training not score accurately. This type of social provided to observers, and the availability of consequence, although still somewhat con- concurrent consequences for the various re- trived, may more closely approximate the sponse options (i.e., hits, correct rejections, consequences that occur in clinical situations. misses, and false alarms). Depending on the Participants’ responses to this factor also experimental question and arrangement, it may provided a basis for examining the effects of be helpful to draw on other data analysis another variable implicated in previous research methods that are commonly used in SDT and on observer accuracy. As noted previously, behavioral detection theory research (e.g., Kazdin (1977) suggested that information receiver operating characteristics, indexes of about expected behavior change may influence discriminability and bias; see Irwin & observer accuracy when combined with feed- McCarthy, 1998, for an overview of these back and social consequences. Results of methods). However, further research is needed Experiment 3, however, showed that the to evaluate the utility of these methods for feedback and consequences were responsible examining observer accuracy and bias in for changes in response bias that occurred after behavioral assessment. observers received information about expected Although the concepts and methods of SDT behavior change. may prove to be useful to researchers, they have A similar feedback statement was used in no obvious direct application to the assessment Experiments 1 and 3. In both cases, the of observer accuracy in clinical settings. None- experimenter told participants that they did theless, research findings may lead to a greater not score accurately and that it was important to understanding of observer behavior and, thus, do so. The participants in Experiment 3, ways to improve the performance of those who however, were not offered the opportunity to collect data in the field. For example, although earn points exchangeable for money. This may the monetary points in Experiments 1 and 2 explain why only half of the participants were contrived, it is likely that other types of showed evidence of bias under the social reinforcement contingencies operate in the consequences condition in Experiment 3, natural environment for scoring ambiguous whereas all of the participants did so in events in one manner or another. Several Experiment 1. Other differences in the methods putative forms of positive and negative rein- APPLYING SIGNAL-DETECTION THEORY 213 forcement may be available if data gathered by Green, D. M., & Swets, J. A. (1966). Signal detection theory and . New York: Wiley. caregivers and staff alter conclusions about the House, B. J., & House, A. E. (1979). Frequency, effectiveness of an intervention or the severity of complexity, and clarity as covariates of observer a problem. For example, teacher-collected data reliability. Journal of Behavioral Assessment, 1, that include numerous false alarms and few 149–165. Irwin, R. J., & McCarthy, D. (1998). Psychophysics: misses (i.e., inflated levels of problem behavior) Methods and analyses of signal detection. In K. Lattal might lead to (a) negative reinforcement in the & M. Perone (Eds.), Handbook of research methods in form of removal of the student from the human operant behavior (pp. 291–321). New York: Plenum. classroom or (b) positive reinforcement in the Kazdin, A. E. (1977). Artifact, bias, and complexity of form of continued assistance and attention from assessment: The ABCs of reliability. Journal of Applied a behavioral consultant. Alternatively, data that Behavior Analysis, 10, 141–150. contain numerous misses and few false alarms McFall, R. M., & Treat, T. A. (1999). Quantifying the information value of clinical assessments with signal (i.e., deflated levels of problem behavior) might detection theory. Annual Review of , 50, lead to (a) negative reinforcement via removal 215–241. of the consultant or (b) positive reinforcement Mudford, O. C., Martin, N. T., Hui, J. K. Y., & Taylor, S. A. (2009). Assessing observer accuracy in contin- in the form of approval and recognition from uous recording of rate and duration: Three algorithms the consultant, superiors, and peers. Thus, these compared. Journal of Applied Behavior Analysis, 42, findings may have some generality to behavioral 527–539. Nevin, J. A., Olson, K., Mandell, C., & Yarensky, P. assessment in the natural environment. (1975). Differential reinforcement and signal detec- tion. Journal of the Experimental Analysis of Behavior, 24, 355–367. REFERENCES O’Leary, K. D., Kent, R. N., & Kanowitz, J. (1975). Alsop, B., & Porritt, M. (2006). Discriminability and Shaping data collection congruent with experimental sensitivity to reinforcer magnitude in a detection task. hypotheses. Journal of Applied Behavior Analysis, 8, Journal of the Experimental Analysis of Behavior, 85, 43–51. 41–56. Repp, A. C., Nieminen, G. S., Olinger, E., & Brusca, R. Cooper, J. O., Heron, T. E., & Heward, W. L. (2007). (1988). Direct observation: Factors affecting the Applied behavior analysis (2nd ed.). Englewood Cliffs, accuracy of observers. Exceptional Children, 55, NJ: Prentice Hall. 29–36. Davison, M., & McCarthy, D. (1987). The interaction of Swets, J. A. (1988). Measuring the accuracy of diagnostic stimulus and reinforcer control in complex temporal systems. Science, 240, 1285–1293. discrimination. Journal of the Experimental Analysis of Swets, J. A. (1996). Signal detection theory and ROC Behavior, 48, 97–116. analysis in psychology diagnostics: Collected papers. DeLeon, I. G., Arnold, K. L., Rodriguez-Catter, V., & Mawah, NJ: Erlbaum. Uy, M. L. (2003). Covariation between bizarre and nonbizarre speech as a function of the content of Received February 16, 2009 verbal attention. Journal of Applied Behavior Analysis, Final acceptance August 21, 2009 36, 101–104. Action Editor, Gregory P. Hanley