<<

Archives of Forensic Psychology c 2014 Global Institute of Forensic Psychology 2014, Vol. 1, No. 1, 1–13 ISSN 2334-2749

Crime on the Border: Use of Evidence in Customs Interviews

P¨arAnders Granhag Department of Psychology, University of Gothenburg, Gothenburg, Sweden Norwegian Police University College, Oslo, Norway Department of Psychology, University of Oslo, Oslo, Norway

Franziska Clemens Department of Psychology, University of Gothenburg, Gothenburg, Sweden

Leif A. Str¨omwall Department of Psychology, University of Gothenburg, Gothenburg, Sweden

Erik Mac Giolla∗ Department of Psychology, University of Gothenburg, Box 500, 405 30 Gothenburg, Sweden, [email protected]

Customs officers are an understudied population in psycho-legal research. The present study is an attempt to fill this gap by examining customs officers’ interview strategies. Specifically, in an experimental setup, we examined how customs officers (N = 80) planned to use evidence in an investigative interview. Half the customs officers were members of a crime-fighting unit (less experienced interviewers) and half were members of an investigative unit (more experienced interviewers). Participants were randomly assigned to two evidence conditions: weak or strong. Participants extracted evidence from a mock crime brief and described how they would use this evidence in a suspect interview. Of the extracted pieces of evidence only 15% were planned to be used in a strategic manner. Evidence strength did not influence whether participants planned to use evidence strategically. The results showed that members of the crime-fighting unit were less likely to use evidence strategically than members of the investigative unit. Completed training courses were associated with an improvement in the planned use of evidence. In contrast, experience in years was associated with a poorer planned use of evidence. Taken together, the results imply that if customs officers are to improve with regards to their use of evidence in suspect interviews, explicit and systematic training may be more effective than experience. Keywords: Customs officers, Suspect interviews, Evidence strength, Strategic interviewing

1 giolla, granhag, clemens, & stromwall¨

Customs officers represent an understudied population in psycho-legal research. They are, however, faced with many issues that are relevant to legal psychology. One example, and the focus of this study, is suspect interviewing. A primary aim of suspect interviews is to obtain information to resolve unsolved issues relating to a crime, such as a suspect’s potential role in it (Memon, Vrij, & Bull, 2003). A growing number of researchers argue that central to the task of suspect interviewing is (1) the evidence that the interviewer has against the suspect and (2) how this evidence is used during the interview (for reviews see Bull, 2014; Granhag & Hartwig, 2015). The more specific focus of the current study is therefore on customs officers’ planned use of evidence in suspect interviews. To our knowledge, this is the first study on customs officers’ interview strategies. For this reason, our literature review will focus on the related field of evidence use in police interviews. The best practice on how to use evidence in an interview remains open, with accepted practices varying from country to country. For instance, a study of real-life police interviews in the US found that interviewers disclosed the evidence against the suspect (together with a suggestion of guilt) at the beginning of the interview in 85% of cases (Leo, 1996). Such findings likely reflect the influence of police manuals advocating the Reid Technique (Inbau, Reid, Buckley, & Jayne, 2011) which recommends, amongst other things, confronting a suspect with much, but not all, of the evidence at the outset of an interview. This is done to emphasize the degree and severity of the evidence the interviewer holds against the suspect. In contrast, in the UK, results have shown that in only a minority of cases (slightly above 10%) was the evidence disclosed at the beginning of an interview (Moston & Engelberg, 1993). These differences are likely due to the differing goals of suspect interviews in the two countries: In the US interviews are typically confession-focused, whereas investigative interviews conducted in the UK focus on information gathering (Gudjonsson, 2003). In a recent study, Japanese police officers were asked about their use of evidence during interviews (Wachi et al., 2014). Statements relating to police behavior were rated on a five-point scale (1 = never occurs; 5 = always occurs). Mean ratings showed that police rarely confronted suspects with evidence, while they were somewhat more likely to highlight statement-evidence inconsistencies (for an overview of related literature see Moston & Engelberg, 2011). Researchers have also examined how the strength of evidence influences how police carry out interviews. For example, Moston, Stephenson, and Williamson (1992) found that interviewers chose an accusatory (vs. information-gathering) interviewing style more often when there was strong (vs. weak) evidence against the suspect. An accusatory interviewing style is often characterized by confronting the suspect with the available evidence at the outset of the interview (Moston & Stephenson, 1993). Hence, when the evidence is strong and the interviewer therefore focuses on eliciting a confession from the suspect, s/he might consider an accusatory interview style an effective approach. This finding was qualified by a recent questionnaire study, in which police officers’ evidence disclosure tactics were examined in both strong and weak evidence conditions (Smith & Bull, 2014). The vast majority of officers reported that they would choose to disclose evidence to the suspect either gradually throughout the interview or at the end of the interview. Furthermore, a majority did not change their disclosure tactic depending on evidence strength. It should be noted, however, that the populations differed between the two studies (the majority of officers, approximately 80%, in the study by Smith and Bull, were Australian, while officers in the study by Moston et al. were from the UK).

2 crime on the border

Additionally, there was a 20-year difference between the two data collections. Considering how standard practices vary from country to country and are likely to vary over time, these differences may account for the observed discrepancies. In recent years researchers have turned to the laboratory to answer questions as to how best to use evidence during suspect interviews. Here the focus has primarily been on when evidence should be disclosed. Much of this research speaks against the early disclosure of evidence and instead suggests disclosing evidence either at the end of the interview or gradually throughout the interview (Bull, 2014; Hartwig, Granhag, Str¨omwall, & Kronkvist, 2006; Sellers & Kebbell, 2009). Late disclosure and gradual disclosure increase the chances of statement-evidence inconsistencies—an important indicator of deceit (Clemens, Granhag, & Str¨omwall, 2011; Granhag, Str¨omwall, Will´en, & Hartwig, 2013; Hartwig, Granhag, Str¨omwall, & Vrij, 2005; Jordan & Hartwig, 2013; Sorochinski et al., 2014). Late and gradual disclosure have also been shown to improve the accuracy of judgments of guilt and innocence (Dando & Bull, 2011; Dando, Bull, Ormerod, & Sandham, 2013; Hartwig et al., 2006) and to result in more true confessions than early disclosure (Sellers & Kebbell, 2009). Although some issues remain moot regarding whether to disclose evidence late or gradually—for instance, Dando and Bull (2011) found gradual disclosure to be more effective when detecting deceit than late disclosure, while Sorochinski et al. (2014) found late disclosure to be more effective than gradual disclosure—the results consistently speak against the early disclosure of evidence. Many scholars have raised a number of disadvantages to the early disclosure of evidence, including: (1) it does not take into account evasive strategies (e.g., withholding information, denial strategies) that guilty suspects typically employ during interviews (Hartwig, Granhag, & Str¨omwall, 2007); (2) by disclosing known evidence early in an interview, guilty suspects are better equipped to create statements that are consistent both with the evidence and with their claimed innocence (Sellers & Kebbell, 2009); and (3) there are generally more risks involved when disclosing compared with not disclosing information (Sellers & Kebbell, 2009). Sellers and Kebbell highlight a number of potential risks associated with evidence disclosure, including: (1) a diminished interviewer-suspect rapport (if the evidence is presented confrontationally); (2) a risk that the evidence is inaccurate (which may reduce the credibility of the interviewer in the eyes of the suspect and in turn the likelihood of true confessions; Kebbell, Hurren, & Roberts, 2006); and (3) the fact that once the evidence has been used, it cannot be unused.

The Current Study

The common theme of the research on this topic is a belief that evidence can and should be used in a strategic manner during suspect interviews. The primary aim of the current study was to examine the extent to which customs officers would use evidence in a strategic vs. a non-strategic manner. In addition, we examined whether experience, training, and evidence strength would influence the planned use of evidence.

3 giolla, granhag, clemens, & stromwall¨

Method

Participants and Design

A total of 80 Swedish customs officers (32 women, 48 men; Mage = 45.74 years, SDage = 10.91), in central and southern Sweden, voluntarily participated in the study. Half of them were members of a crime-fighting unit (15 women, 25 men; Mage = 47.05 years, SDage = 11.33) and the other half were members of an investigative unit (17 women, 23 men; Mage = 44.43 years, SDage = 10.45). The participants in both groups were randomly assigned to the two evidence conditions; half received a case file containing weak evidence pointing to the suspect’s guilt (weak evidence) and half received a case file containing strong evidence pointing to the suspect’s guilt (strong evidence). Hence, the study employed a 2 (Unit: crime-fighting vs. investigative) × 2 (Evidence Strength: weak vs. strong) design. Ethical approval for the study was granted by Swedish Customs.

Customs Officers’ Interview Experience and Training

Members of the crime-fighting unit had less interview experience. This unit’s main task is the inspection of travelers and goods and/or surveillance work. The members of this unit have investigative interviews as a secondary work task and typically only in urgent situations when preparation is not possible (e.g., field interviews). Officers in this group receive basic information on suspect interviewing in the compulsory course, Criminal Investigative Training 1 (CIT1). CIT1 consists of basic training in criminal investigation, crime-fighting, and suspect interviewing. Some of the older officers were employed before the present courses had begun. The details on the training these officers received are not accessible. Those with more interview experience were members of the investigative unit. This unit can be compared to drugs and fraud investigators at the police. Investigative interviewing is the main work task for this group, including both field interviews and formal suspect interviews where they would have the time to prepare. Members of this group are typically recruited from the crime-fighting unit and hence will have completed CIT1. Most members of this unit have also completed additional courses with a greater focus on investigative interviewing, such as Criminal Investigative Training 2 (CIT2) and Criminal Investigative Training 3 (CIT3). CIT2 includes, amongst other things, further training on interview procedure (including feedback on mock interviews, and instruction on how to avail of evidence during interviews). CIT3 is a course conducted by criminal investigators at the Swedish Police College, and includes further training on interview methods such as the analysis of difficult or failed real-life interviews that the customs officers had conducted. The strength of the evidence was manipulated in the case file that the officers received. Due to the lack of previous research on customs officers’ interview strategies, the data analyses should be seen as largely explorative.

Materials and Procedure

Study materials were sent to the customs officers at Swedish Customs by postal mail. These materials included informed consent (explaining that participation was completely voluntary), written instructions, background information about a fictitious case, and a questionnaire. In the instructions, the participants were told that the purpose of the

4 crime on the border study was to explore how they plan and execute investigative interviews depending on different case characteristics and their work experience. The fictitious case was a smuggling vignette involving four persons. To increase ecological validity the case was constructed with the help of an experienced customs officer who did not take part in the study. Participants were asked to plan for an interview with one of the suspects on short notice based on the background information provided. Depending on the experimental condition, the background information included either weak or strong evidence pointing to the suspect’s guilt. For example, in the weak evidence condition, police found drugs in a false bottom inside a suitcase, but the suspect’s fingerprints were found only on the outside of the suitcase, whereas in the strong evidence condition, police found the suspect’s fingerprints on both the outside and the inside of the suitcase (see Appendix). After reading the vignette, the officers were asked to fill out a questionnaire that included questions about their demographic background, their work experience as customs officers and investigators (in years), and about their training background. The question on training specifically asked how many of the criminal investigative training courses they had completed (0 = no courses completed, 1 = CIT1 completed; 2 = CIT1 + CIT2 completed; 3 = CIT1 + CIT2 + CIT3 completed). Next, the participants were asked to rate, on a seven-point Likert scale, the strength of the evidence against the suspect (1 = very weak; 7 = very strong) and then to list what they perceived to be the important pieces of evidence (“The crime brief contains a number of different pieces of evidence. Please list those that you believe are the most important.”). After listing the evidence that participants believed was important, the customs officers were asked to explain how they would use these pieces of evidence in a suspect interview (“How would you present and formulate questions related to each piece of evidence you listed?”). All participants filled out the questionnaires individually.

Coding

How many pieces of evidence the officers considered to be important were summed and each piece was coded in accordance with how it was planned to be used. Three primary codings were used: non-strategic use of evidence, strategic use of evidence, and an other category that was used for situations when it was unclear how the interviewers were to use the evidence or when they did not describe how they were to use the evidence at all. Evidence was coded as having been used in a non-strategic manner when the customs officer stated that they would merely ask the suspect to explain the evidence and in doing so reveal to the suspect what evidence they held. We argue that this is a non-strategic use of evidence because it does not take into account the risks associated with evidence disclosure such as how guilty suspects can create statements to match the disclosed evidence (Sellers & Kebbell, 2009). Evidence was coded as having been used in a strategic manner if the goal was to systematically exhaust all alternative explanations the suspect could give without disclosing the evidence to the suspect. We argue that this is a strategic use of evidence because it increases the chance of diagnostic cues to deception (Granhag & Hartwig, 2015), while at the same time limiting the risks associated with evidence disclosure highlighted by Sellers and Kebbell (2009).

5 giolla, granhag, clemens, & stromwall¨

One person coded all 80 written statements. A random sample of 16 statements (20%) was then coded by a second person to examine interrater reliability. Both coders were trained in the coding procedure and both were blind to the hypotheses and experimental conditions. The agreement between the two coders was acceptable: 82.22% agreement for the codings on the number of identified pieces of evidence, and an intraclass correlation coefficient of 0.78 (95% CI [0.50, 0.92]) as well as a Cohen’s κ of 0.77 (95% CI [0.63, 0.91]) for the codings on evidence use. For each participant a ratio was calculated for strategically used pieces of evidence as well as for non-strategically used pieces of evidence. These ratios were calculated by dividing the number of strategically used and non-strategically used pieces of evidence, respectively, by the total number of pieces of evidence that the participant extracted. Consider, for example, a participant that extracted five pieces of evidence, of which one piece was used strategically, three pieces were used non-strategically, and one piece could not be coded. This would result in a strategically used ratio of 0.2 (1/5), and a non-strategically used ratio of 0.6 (3/5). The main analyses were carried out on ratios because the number of extracted pieces of evidence differed between participants.

Results

Preliminary Analyses

The manipulation check on evidence strength was successful, with customs officers perceiving the weak evidence (M = 4.45, SD = 1.15) as weaker than the strong evidence (M = 5.63, SD = 1.31), t(78) = 4.25, p < 0.001, r = 0.434, 95% CI [0.236, 0.596]. To assure that crime-fighting and investigative groups did not differ on potentially moderating characteristics, the two groups were compared on the following variables: age, gender, work experience (in years), and training (number of courses completed). The only significant difference found was for training. On average, members of the crime-fighting unit had completed fewer CIT courses (M = 1.75, SD = 1.25) than members of the investigative unit (M = 2.35, SD = 0.74), t(74) = 2.51, p = 0.015, r = 0.273, 95% CI [0.057, .465]. Participants had on average over 20 years of work experience as customs officers. The crime-fighting unit (M = 24.80, SD = 12.87) and the investigative unit (M = 20.38, SD = 12.87) did not significantly differ on years of total work experience, t(78) = 1.50, p = 0.137, r = 0.167, 95% CI [-0.054, .373]. There were no significant age or gender differences between the crime-fighting unit and the investigative unit (ps > 0.05).

Main Analyses

A total of 386 pieces of evidence were extracted, with an average of 4.84 (SD = 2.08) pieces extracted per participant. Investigative units and crime-fighting units extracted a similar number of evidence pieces, t(78) = 1.24, p = 0.219, r = 0.139, 95% CI [-0.083, .348]. In addition, the number of extracted pieces of evidence did not differ in the weak compared with the strong evidence conditions, t(78) = 1.57, p = 0.120, r = 0.175, 95% CI [-0.046, .380]. Of the total number of pieces of evidence only 15.28% was coded as strategically used, 44.04% was coded as non-strategically used, while the remaining 40.68% was not able to be coded into either category. A paired samples t-test showed

6 crime on the border that participants were more likely to plan to use evidence in a non-strategic manner (M = 2.13; SD = 2.06) than in a strategic manner (M = 0.74, SD = 1.20), t(79) = 5.02, p< 0.001, r = 0.494, 95% CI [0.308, 0.644]. The remaining analyses were conducted on the ratios for strategically and non- strategically used pieces of evidence. A 2 (Evidence Strength: weak vs. strong) × 2 (Unit: crime-fighting vs. investigative) between-subjects Analysis of Variance (ANOVA) was performed, with the ratio of strategically used pieces of evidence as the dependent variable. The unit the officer belonged to had a significant effect, F (1, 76) = 12.77, p = 0.001, r = 0.379, 95% CI [0.174, .553]. Means show that members of the investigative unit were more likely to plan to use the evidence in a strategic manner compared with members of the crime-fighting unit (Table 1). The main effect of evidence strength was non-significant, F (1, 76) = 0.002, p = 0.961, r = 0.005, [−0.218; 0.228]. Nor was the evidence × unit interaction effect significant, F (1, 76) = 1.717, p = 0.194, r = 0.14, 95% CI [-0.082, 0.349]. A second 2 (Evidence Strength: weak vs. strong) × 2 (Unit: crime-fighting vs. investigative) ANOVA was performed with non-strategically used pieces of evidence as the dependent variable. The results complement those of the previous ANOVA. Again, the unit the officer belonged to had a significant effect, F (1, 76) = 6.255, p = 0.015, r = 0.276, 95% CI [0.060, .467], with means showing that members of the crime-fighting unit were more likely to plan to use the evidence in a non-strategic manner than members of the investigative unit (see Table 1). Furthermore, the main effect of evidence strength was non-significant, F (1, 76) = 0.757, p = 0.387, r = 0.100, 95% CI [-0.122 0.313]. Nor was the evidence strength × unit interaction effect significant, F (1, 76) = 1.464, p = 0.498, r = 0.078, 95% CI [ -0.144, 0.293]. Taken together, the results imply that the occupation of the officer is a greater predictor of his/her planned use of evidence than the evidence strength.

Table 1. Descriptive statistics on the number and nature of extracted pieces of evidence

Extracted pieces of Strategic use Non-strategic use evidence (M,SD) (M,SD) (M,SD) Crime-fighting unit 4.55 (1.62) 0.07 (0.19) 0.55 (0.40) Investigative unit 5.13 (2.41) 0.26 (0.28) 0.34 (0.34)

Weak evidence 4.48 (1.87) 0.16 (0.25) 0.48 (0.37) Strong evidence 5.20 (2.24) 0.16 (0.27) 0.41 (0.39) Note: M = Mean; SD = Standard Deviation. Means and standard deviations are for the main effects of unit and evidence strength.

As noted earlier, a significant difference existed between the crime-fighting unit and the investigative unit with regards to training courses completed. To investigate this issue further, correlations were performed with the number of relevant courses completed and the ratio of strategically used and non-strategically used pieces of evidence. The number of completed courses was positively correlated with a strategic use of evidence,

7 giolla, granhag, clemens, & stromwall¨ r(74) = 0.258, 95% CI [0.035, .457], p = 0.024, and negatively correlated with a non- strategic use of evidence, r(74) = −0.222, 95% CI [-0.426, .004], p = 0.053, suggesting that training positively improves the planned use of evidence in suspect interviews. Correlations were also performed with years of work experience. In contrast to training, years of experience was negatively correlated with a strategic use of evidence, r(78) = −0.242, 95% CI [-438, -024], p = 0.031, and positively correlated with a non- strategic use of evidence, r(78) = 0.250, 95% CI [0.032, .445], p = 0.025. Finally, to further examine the role of work experience, correlations were calculated for members of the investigative unit only, as only this unit had suspect interviewing as a primary task. The correlations were not significant for either strategically used pieces of evidence, r(38) = −0.040, 95% CI [-0.347, 0.275], p = 0.82, or for non-strategically used pieces of evidence, r(38) = 0.118, 95% CI [-0.201, .414], p = 0.468. Taken together, the results suggest that experience does not improve the planned use of evidence in suspect interviews and may in fact have a negative influence.

Discussion The aim of the present study was to examine how customs officers planned to use evidence during suspect interviews. The primary finding was that customs officers used evidence in a non-strategic manner more often than in a strategic manner. This result is particularly striking considering that the experimental context was substantially less demanding and complex than real-life situations—officers were simply asked to make a plan for how they would use extracted pieces of evidence during an interview. Hence, it is likely that the results would be more pronounced when customs officers prepare for real life interviews, where there is more at stake, as well as limited resources and time constraints which may negatively influence planning decisions. It was found that members of the investigative unit used more of the extracted pieces of evidence in a strategic manner than the members of the crime-fighting unit. To explore this finding further, we examined the officers’ training and years of experience. On average, members of the investigative unit had completed more criminal investigative training courses than members of the crime-fighting unit. In addition, a higher number of completed courses was associated with a more strategic use of evidence. In contrast, years of experience appeared to have a negative effect on how evidence was planned to be used—years of experience was associated with a reduction in strategic use and an increase in non-strategic use of evidence. Taken together, these findings highlight the value of training compared with experience. Such an interpretation would imply the need for further training in the use of evidence over and above knowledge that can be gained from experiential learning. This interpretation is in line with the results of a recent study aimed at training US law enforcement officers to detect deceit (Luke et al., 2014). In the study, experience (measured in both years working as a law enforcement officer as well as the number of interviews conducted) was not correlated with ability to detect deceit. However, a four-hour training session on strategically using evidence in interviews to elicit statement-evidence inconsistencies improved detection accuracy by over 20%. This corroborates our finding concerning the importance of explicit and systematic training compared with learning solely from experience. No significant effect was found for evidence strength: officers in the weak and strong conditions planned to use the evidence in a similar manner. This finding is in

8 crime on the border agreement with the recent study by Smith and Bull (2014) which found that a majority of police officers do not change their disclosure strategy depending on evidence strength. A limitation of both the current study and that of Smith and Bull is that vignettes rather than real-life interviews were used. It could be that evidence strength influences interview strategy once officers are in the interview room working real-life cases, as observed by Moston, Stephenson, and Williamson (1990, as cited in Moston, Stephenson, & Williamson, 1992). An additional issue concerns the evidence strength manipulation. Although the manipulation check showed that customs officers perceived the evidence as weaker in the weak (vs. strong) evidence condition, both groups rated the evidence as generally quite strong, with mean ratings above the mid-point. Therefore, an alternative interpretation may be that evidence strength did not influence the interviewers’ planned disclosure pattern because the difference in evidence strength was not great enough, or, more specifically, because the weak evidence condition was not weak enough.

Limitations

A number of limitations in the present study are worth noting. First, the study only concerns the pre-interview plans the customs officers made concerning evidence use. It does not speak to whether or how such plans would materialize in subsequent interviews. Future research could address this by having officers perform mock interviews. A second issue concerns our coding of strategic vs. non-strategic use of evidence. Our codings were derived from research on the topic that warns against the early disclosure of evidence. Nonetheless, alternative opinions on how evidence should be used in interviews do exist, opinions that may call into question the codings used. For instance, Inbau et al. (2011), recommend confronting the suspect with available evidence in an attempt to increase the suspects’ perception of the amount of evidence against them. A final and related issue concerns the large amount of extracted evidence (40.68%) that could not be coded. Much of this was because participants provided pieces of evidence, but did not explain how they would all be used in an interview. Other explanations of evidence use simply could not be coded into our categorizations of strategic/non-strategic (e.g., they were too vague with their descriptions: “I don’t know how much I’d bother with the fingerprint evidence since we also had the CCTV footage.”). A recommendation for future research is to include and develop additional terms and gradations for what can be considered as strategic or non-strategic uses of evidence.

Conclusions

The primary findings of the current study showed that (1) customs officers generally used evidence in a non-strategic manner, (2) training courses were associated with a better use of evidence, and (3) years of experience was associated with a poorer use of evidence. Taken together, these results imply that if customs officers are to improve with regards to their use of evidence in suspect interviews, explicit training may be more effective than experience.

9 giolla, granhag, clemens, & stromwall¨

References

Bull, R. (2014). When in interviews to disclose information to suspects and to challenge them? In R. Bull (Ed.), Investigative interviewing (p. 167-181). New York: Springer. Clemens, F., Granhag, P. A., & Str¨omwall, L. (2011). Eliciting cues to false intent: A new application of strategic interviewing. Law and Human Behavior, 35 (6), 512-522. doi: 10.1007/s10979-010-9258-9 Dando, C. J., & Bull, R. (2011). Maximising opportunities to detect verbal deception: Training police officers to interview tactically. Journal of Investigative Psychology and Offender Profiling, 8 (2), 189–202. doi: 10.1002/jip.145 Dando, C. J., Bull, R., Ormerod, T. C., & Sandham, A. L. (2013). Helping to sort the liars from the truth-tellers: The gradual revelation of information during investigative interviews. Legal and Criminological Psychology. doi: 10.1111/lcrp.12016 Granhag, P. A., & Hartwig, M. (2015). The strategic use of evidence (SUE) technique: A conceptual overview. In P. Grandhag, A. Vrij, & B. Verschuere (Eds.), Deception detection: New challenges and cognitive approaches (pp. 231–251). Chichester, England: John Wiley & Sons. Granhag, P. A., Str¨omwall, L. A., Will´en, R. M., & Hartwig, M. (2013). Eliciting cues to deception by tactical disclosure of evidence: The first test of the evidence framing matrix. Legal and Criminological Psychology, 18 (2), 341–355. doi: 10.1111/j.2044 -8333.2012.02047.x Gudjonsson, G. (2003). The psychology of interrogations and confessions: A handbook. Chichester, England: John Wiley & Sons. Hartwig, M., Granhag, P. A., & Str¨omwall, L. A. (2007). Guilty and innocent suspects’ strategies during police interrogations. Psychology, Crime & Law, 13 (2), 213-227. doi: 10.1080/10683160600750264 Hartwig, M., Granhag, P. A., Str¨omwall, L. A., & Kronkvist, O. (2006, Oct). Strategic use of evidence during police interviews: When training to detect deception works. Law and Human Behavior, 30 (5), 603-619. doi: 10.1007/s10979-006-9053-9 Hartwig, M., Granhag, P. A., Str¨omwall, L. A., & Vrij, A. (2005). Detecting deception via strategic disclosure of evidence. Law and Human Behavior, 29 (4), 469-484. doi: 10.1007/s10979-005-5521-x Inbau, F., Reid, J., Buckley, J., & Jayne, B. C. (2011). Criminal interrogation and confessions (5th ed.). Burlington, MA: Jones & Bartlett Learning. Jordan, S., & Hartwig, M. (2013). On the phenomenology of innocence: The role of belief in a just world. Psychiatry, Psychology and Law, 20 (5), 749-760. doi: 10.1080/13218719.2012.730903 Kebbell, M. R., Hurren, E. J., & Roberts, S. (2006). Mock-suspects’ decisions to confess: The accuracy of eyewitness evidence is critical. Applied Cognitive Psychology, 20 (4), 477–486. doi: 10.1002/acp.1197 Leo, R. A. (1996). Inside the interrogation room. Journal of Criminal Law & Criminology, 86 (2), 266–303. Luke, T., Hartwig, M., Joseph, E., Brimbal, L., Chan, G., Dawson, E., & Granhag, P. A. (2014). Training in the strategic use of evidence technique: Improving deception detection accuracy of American law enforcement officers. (Manuscript submitted for publication.)

10 crime on the border

Memon, A., Vrij, A., & Bull, R. (2003). Psychology and law : Truthfulness, accuracy and credibility. Chichester, England: John Wiley & Sons. Moston, S., & Engelberg, T. (1993). Police questioning techniques in tape recorded interviews with criminal suspects. Policing and Society, 3 (3), 223-237. doi: 10 .1080/10439463.1993.9964670 Moston, S., & Engelberg, T. (2011). The effects of evidence on the outcome of interviews with criminal suspects. Police Practice and Research, 12 (6), 518-526. doi: 10.1080/ 15614263.2011.563963 Moston, S., & Stephenson, G. M. (1993). The changing face of police interrogation. Journal of Community & Applied Social Psychology, 3 (2), 101–115. doi: 10.1002/ casp.2450030204 Moston, S., Stephenson, G. M., & Williamson, T. M. (1990). The incidence, antecedents and consequences of the use of the right to silence during police questioning. (Unpublished manuscript) Moston, S., Stephenson, G. M., & Williamson, T. M. (1992). The effects of case characteristics on suspect behaviour during police questioning. British Journal of Criminology, 32 (1), 23-40. Sellers, S., & Kebbell, M. R. (2009). When should evidence be disclosed in an interview with a suspect? An experiment with mock-suspects. Journal of Investigative Psychology and Offender Profiling, 6 (2), 151–160. doi: 10.1002/jip.95 Smith, L. L., & Bull, R. (2014). Exploring the disclosure of forensic evidence in police interviews with suspects. Journal of Police and Criminal Psychology, 29 (2), 81-86. doi: 10.1007/s11896-013-9131-0 Sorochinski, M., Hartwig, M., Osborne, J., Wilkins, E., Marsh, J., Kazakov, D., & Granhag, P. (2014). Interviewing to detect deception: When to disclose the evidence? Journal of Police and Criminal Psychology, 29 (2), 87-94. doi: 10.1007/s11896-013-9121-2 Wachi, T., Watanabe, K., Yokota, K., Otsuka, Y., Kuraishi, H., & Lamb, M. (2014). Police interviewing styles and confessions in Japan. Psychology, Crime & Law, 20 (7), 673-694. doi: 10.1080/1068316X.2013.854791

Author Note

The authors declared that there are no conflicts of interest nor was any funding funding received for the conduct of this study. We would like to thank Carin Ingefors at Swedish Customs for the collection of the data and valuable input to the planning of the study.

11 giolla, granhag, clemens, & stromwall¨

Appendix

Case Information the Officers Received

The information marked in bold within the square brackets differed between the weak and the strong evidence conditions. Information not in square brackets was identical for the two conditions.

Background

You are quickly brought on to a new case concerning four detainees, all of who (in preliminary interviews) have denied any criminal involvement. Your task is to interview Fredrik Karlsson, who has been unwilling to answer questions without an attorney present. An attorney has now arrived and Fredrik is now willing to be interviewed. You are to conduct an interview with Fredrik on short notice based on the information you can obtain from the brief below.

Brief

At 18.30 Fredrik Karlsson (b. 1983) arrived from Denmark in a Swedish-registered Volvo. Although Fredrik appeared nervous, the border check found that he was transporting a legal amount of alcohol in accordance with his testimony about his upcoming 25th birthday party. Fredrik nonetheless seemed uneasy. He said he was traveling alone, but [weak evidence: two unused toothbrushes] [strong evidence: women’s underwear and a bra] were found in a bag as well as a note with “22.30” written on it. A detection dog vigorously nosed the back seat but nothing else was found. Fredrik was allowed to leave. He was initially tracked, but was lost shortly thereafter. Some 15 minutes later his car was found, and Fredrik and a woman with a suitcase were seen exiting a nearby McDonalds restaurant. A group of youths were hanging out near Fredrik’s car, and Fredrik and the woman were therefore arrested at the entrance to McDonalds. [weak evidence: There was some commotion during the arrest and afterwards a mobile phone was found on the ground] [strong evidence: A mobile phone was discovered in Fredrik’s back pocket]. The mobile (with the number 070-22 23 33) had recently been used to call an unregistered mobile number, 070-99 98 88. A Russian passport was found in the woman’s handbag, with the name Tatjana Mirkarov (b. 1988), as well as ten 500 krona notes. A false bottom was discovered in the suitcase along with 1kg of heroin in sealed bags. The fingerprints that could be obtained match those of Fredrik and Tatjana and were found on the handle and the outside of the suitcase [strong evidence: In the suitcase there was also a tax-free bag with chocolate and cigarettes. Fredrik’s fingerprints were found on this bag.] Tatjana was informed, through the use of a translator, that she is suspected of major drug smuggling, but she only responded “Please no questions”. She has previously been a plaintiff in a dismissed trafficking case. Fredrik was also informed that he is suspected of major drug smuggling. He denied involvement in any crime, and then refused to speak until an attorney was present.

12 crime on the border

The McDonalds CCTV camera showed how Tatjana arrived with the suitcase, bought a cup of coffee and sat at a table. [weak evidence: Fredrik arrived moments later, made a purchase and sat at another table. Both went to the restroom within moments of each other. The restrooms are not monitored by CCTV and the two were in the unmonitored area together for a brief time. They returned to their respective tables separately. Shortly after, they headed for the entrance at the same time.] [strong evidence: Fredrik arrived moments later and sat down at the same table where he handed over cash to Tatjana.] In Fredrik’s residence [weak evidence: a plate with remnants of cocaine and four straws were found on the kitchen table beside a laptop] [strong evidence: a plate with remnants of cocaine and a single straw were found on the bedside table beside a laptop], as well as notations about an internet site for instant loans and a note with “Quaystreet 7, 22.30” written on it. Rustan Olsson (b. 1955) and the Russian citizen Alexej Dvorski (b. 1978) were found when “Rustan’s Builder Supplies” at Quaystreet 7 was raided. Empty liquor bottles and empty cases of beer were found in the warehouse. In a broom closet that smelled strongly of solvent, a weighing scale with remnants of heroin and packaging bags were found. A mobile phone with the number 070-99 98 88 was found in Alexej’s jacket [strong evidence: In the mobile’s phonebook the number 070-22 23 33 was saved under the name “F 2”.] Rustan and Alexej were informed that they are suspected of major drug smuggling/drug-related crime. Neither was willing to be interviewed.

Received: July 3, 2014 Revision Received: Sept. 19, 2014 Revision Received: Oct. 9, 2014 Accepted: Oct. 10, 2014

13 Archives of Forensic Psychology c 2014 Global Institute of Forensic Psychology 2014, Vol. 1, No. 1, 14–26 ISSN 2334-2749

How Literature Can Add Value to Structured Professional Judgments of Violence Risks: An Illustrative Rare Risk Example Inspired by Alice Munro’s Child’s Play

Christopher D. Webster Department of Psychiatry, University of Toronto, Toronto, ON, Canada Department of Psychology, Simon Fraser University, Burnaby, BC, Canada Child Development Institute, Toronto, ON, Canada

Eric B´elisle∗ Child Development Institute, 22 Street Clair Avenue East, Suite 202, Toronto, ON, M4T 2S3 [email protected]

The short story Child’s Play by Alice Munro (2009) is used to illustrate that, though the seeds were there all along, it took a peculiar set of circumstances occurring in sequence for a pair of psychologically-allied girls to seize an opportunity to rid themselves permanently of another girl whom they disliked and perceived as abject. Munro’s story can hardly fail to capture the attention of forensic mental health practitioners and researchers. It is, apart from anything else, relevant to the rare but intriguing phenomenon of folie `a deux. In the present paper we first outline Alice Munro’s story. We then portray how a recently published Structured Professional Judgement (SPJ) device, the Short-Term Assessment of Risk and Treatability: Adolescent Version (START:AV), can be brought to bear in showing how the feeling-driven ideas or fantasies of two or more persons can occasionally evolve in powerful concert with utterly disastrous consequences. The trend towards tailoring SPJ devices uniquely to each assessee is discussed in relation to the common narrativity of clinical assessments and fiction. Keywords: START:AV, Structured professional judgement, Folie `adeux, Alice Munro, Child’s Play

“Intimacy is complicated, and it sometimes goes awry” (Narduzzi, 2013, p. 87) “[The clinician should be allowed the option] to review a risk estimate produced by a structured tool so that the clinician may note the presence of rare risk or protective factors in a given case, and that these factors – precisely because they are rare – will not have been taken into account in the construction of structured tools for assessing violence risk. . . ” (Monahan, 2010, p. 195)

14 literature can add value to spj: an example

This article aims to make a case for encouraging mental health clinicians to take literature into account as they refine and test their evaluations of violence.1,2 We attempt to make our case by performing an analysis of Alice Munro’s short story Child’s Play. In masterfully executed prose from a writer with a history of deftly shaded portrayals of psychologies and violence, two normal girls who possess a peculiarly intense symbiotic connection drown a third, intellectually-disabled girl while away at summer camp. The paper outlines Munro’s short story, paying attention to her construction of characters and their reality. This leads to a discussion of the importance of narrative-making, interpretation, and story-telling for the clinician, and how literature can sometimes be brought to bear on these. To begin, however, allow us to discuss our literary example of choice. Alice Munro, winner of the 2013 Nobel Prize, has given violence a great deal of consideration throughout her decades-long career, and always does so with an abundance of understanding and compassion. Munro’s short stories explore moments of violence with great sympathy and subtlety. Time and again, she convincingly portrays the stunning eruptions of violence that punctuate what are often otherwise ordinary lives; her clear knowledge of her characters allows her to hint at what produces these eruptions. Let us now start by providing a summary of one of her many stories that deal with violence and mental disorder.

Synopsis of Child’s Play

Marlene – the narrator and one of the perpetrators of the murder – is presented to the reader as a fairly typical nine or ten year-old. She lives with the rest of her family in a modest yellow house in a small town, where her father is a teacher. Part of the house is occupied by the elderly Mrs. Home, who is joined by her granddaughter Verna in the summer before Marlene is to start school. Girls of Marlene’s age are, apparently, given to showing instantaneous likes or dislikes of other girls. Marlene comments: “From the very beginning I had an aversion to her unlike anything I had felt up to that time for any other person. I said that I hated her, and my mother said, How can you, what has she ever done to you?” (Munro, 2009, p.194). Marlene’s hate is based not on an actual fear of Verna but, rather, on the powerfully abject feeling that she inspires in Marlene. Munro reminds us that the very young tend to feel things peculiarly fiercely. Verna, who is two or three years older than Marlene, disturbs Marlene’s solitary play. Verna has “a strenuous determination and an inability to understand that she wasn’t wanted” (p. 195). Verna is viewed by Marlene’s mother as “young for her age” (p. 195), which is code for intellectually handicapped. Munro reminds us that “children of course are monstrously conventional, repelled at once by whatever is off-centre, out of whack, unmanageable” (p. 195). In the story, Marlene’s mother tries unsuccessfully to get her to take a kinder view of Verna. But Marlene senses that underneath, her mother is actually no more accepting of Verna’s presence and her limitations than is she. This sense is investigated by Narduzzi (2013), a literary scholar who has written on the reproduction of negative affective responses to disability in Child’s Play. Narduzzi sees it as the origin of Marlene’s “collectively appointed superiority” (p. 82). At the end of that first summer, the girls have to attend school. Marlene goes to the regular school; Verna goes to a “special class in a special

15 webster & belisle´ building” (Munro, 2009, p. 197). Marlene does not therefore have to contend with Verna during the day. This segregation only reaffirms the divide Marlene senses. It takes a year or so before Marlene’s family is able to move into their new bungalow. Marlene is relieved that she now never has to go past the yellow house, which to her had taken on Verna’s “narrow slyness, her threatening squint” (p. 198). It turns out, though, that Verna goes out of her way to choose to walk past the bungalow on the way to her school. Marlene continues trying to evade Verna, despite never being able to articulate the source of her mistrust, even later when reflecting as an aged adult: “it was hardly likely that she was going to attack and pummel me or pull out my hair. But only adults would be so stupid as to believe she had no power. A power, moreover, that was specifically directed at me. I was the one she had her eye on. Or so I believed” (p. 200). Marlene adds: “I suppose I hated her as some people hate snakes or caterpillars or mice or slugs. For no decent reason” (p. 200). While Marlene genuinely seems not to have considered the matter too much further, Narduzzi (2013) argues quite forcefully that there is a rather indecent reason for Marlene’s hatred, and it stems from her able-bodied privilege. At the end of the following summer Marlene goes to summer camp for a couple of weeks. There she falls in with Charlene. They have adjacent beds in the dormitory. They swap stories: Charlene puts up a tale of her witnessing her brother having sex, emphasizing the horribleness of his bare backside; for her part, Marlene offers up Verna in all her amplified awfulness and as much withering detail as she can muster. Toward the end of camp Charlene rushes up to Marlene and tells her that Verna is at the camp, having just arrived as part of an influx of specials for that last weekend. On the final day of camp, when parents come to collect their kids, both groups of children are having a final swim in the lake. Marlene, Charlene, and Verna, come into close proximity in water about up to their armpits. Suddenly, a couple of motorboats come toward the shore causing a big wave. The wave stirs up “a tumult of screaming and shouting all around” (p. 221). Charlene and Marlene are knocked off their feet but soon surface laughing. Verna is also tumbled over. Described ephemerally as a passive and leisurely jellyfish, she remains under water. In this pivotal instant, Marlene recounts the story’s psychological violence becoming physical: “Charlene and I had our hands on her, on her rubber cap” (p. 221). Of course, the drowning was later construed as an accident.3 In the recounted version many, many years later, Marlene figures that even if they had told what they had done, they would have been forgiven anyway. They were young and innocent, supposedly unaware of the consequences of their actions, but the now-aged narrator Marlene asks herself if this story she has told herself is true. She concludes that it is, but only in the sense that she and Charlene had not planned or premeditated their violence, though its seeds had clearly been sown. There was no conscious deliberation followed by conscious action; instead it was unspoken and spontaneous. The elderly Marlene recounts of Charlene that “her eyes were wide and gleeful, as I suppose mine were too. I don’t think we felt wicked, triumphing in our wickedness. . . More as if we were doing just what was – amazingly – demanded of us, as if this were the absolute high point, the culmination, in our lives, of our being ourselves” (p. 222; see Kunimatsu, Marsee, Lau, & Fassnacht, 2012, for a discussion of “happy victimization” in detained girls).4,5 This all took place in a couple of minutes or so. The idea that choice was involved did not occur to the girls. The story ends with the two girls being

16 literature can add value to spj: an example driven home by their respective parents. Before parting, the girls never say to one another “anything as banal, as insulting or unnecessary, as Don’t tell” (p. 223, original emphasis). They do not even say goodbye. And they are gone before there is general realization on the beach that Verna is dead. This story is particularly fraught by the knowledge that the two girls who commit murder each go on to live long, successful lives. Except for toward the end of Charlene’s life, they remain unconnected after leaving camp. Marlene becomes a successful anthropologist. Charlene marries and leads an ordinary domestic life. When Charlene is dying of cancer, she leaves a note for Marlene asking that she seek out a particular priest in Guelph, Ontario. After considerable misgivings, though without delay, Marlene does this. Although Marlene is unable to meet the priest, it is apparent that Charlene wished for the father to know that she was indeed dying. She makes it clear in the note to Marlene that Father Hofstrader knows about the secret murder, but knows “Nothing about you” (p. 214, original emphasis). By showing Charlene’s search for absolution as her life comes to a close, Munro develops a compelling sub-theme around the nature and necessity of forgiveness.

Comment on the Story It is only in concert with Charlene that Marlene becomes a risk. It was a fluke that she met and over-bonded with Charlene at camp. If Marlene had picked a different bed in the dormitory, Verna would have lived to tell the non-tale. Munro’s story reminds us of the limits of statistical predictions of violence (see Singh, Grann, & Fazel, 2011; Webster, Haque, & Hucker, 2014, pp. 88–91).6,7 Most of the work undertaken by forensic mental health clinicians called upon to assess risks for possible future violence revolves around acts that are publically recognized and eventually dealt with by courts, review boards, and other tribunals (see Shah, 1978). Often the issue has to do with whether, following legal proceedings, the individual should be sent to prison, to hospital, or to some kind of supervision in the community. Alice Munro’s short story is different in that her characters avoid formal encounters with the law. That their act of violence is never publicly recognized does not preclude it from being open to psychological analysis. It is worth noting that it would have taken only a slight change to the Munro story as written to have invoked the police (e.g., a bystander who noticed the drowning and reported the girls). In Marlene, Munro presents us with a perfectly normal nine or ten year-old girl. There is nothing particularly remarkable about her. The only thing that stands out is her unwillingness or inability to accept Verna. Marlene is unsettled by Verna’s lack of normalcy, her specialness, her disability. As already noted, this is despite her mother’s urgings that she be more accommodating and understanding toward this other, admittedly odd, girl. Verna’s death is then the result of this callous rejection arising in uniquely hostile circumstances. Had Verna kept more to herself and not passed Marlene’s door on the way to school, Marlene may have had fewer awful grievances to share with Charlene. Had Marlene and Charlene not become so entangled at summer camp, Verna would never have been drowned. The Verna that Marlene constructs for Charlene is far more odd and sinister than in actuality. The two normal girls are quick to build up an intense level of excitement between them. They come ever closer together in their symbiotic

17 webster & belisle´ emotional entanglement, and now they have another girl to react against and to bind them together. Toward the end of camp, once Verna arrives on the scene, Charlene sees it as her duty to shield Marlene from the dreadful Verna. Charlene thereby joins more fully with Marlene in the ever-increasing amplification of Verna’s awfulness. Their intertwined emotional states interconnected. These two normals resent the intrusion of the specials. The story, therefore, has to do with the development of stigma, with how negative affect is produced, how social rejection works, and how already vulnerable people become victimized (Narduzzi, 2013). The story also points out how some very serious offences can occur without ever coming close to being formally recognized. Marlene and Charlene hold Verna under the water but it occurs with no conversation whatever between them at the time or even after the act. These are not girls with obvious psychiatric problems. It is not possible to find a characterization for this murder in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5; American Psychiatric Association, 2013). All we can say is that the girls at their young ages were unlikely to be able to comprehend what they were doing at the moment and the dreadfulness of it. It was, as the author puts it in her title, Child’s Play. If anyone is at fault here, it is as a stretch Marlene’s parents who failed to insist on Marlene’s displaying a more charitable and civilized attitude towards Verna. Yet how closely can even attentive parents be expected to monitor their child’s social arrangements with peers? At an even greater stretch, the camp counselors could also be implicated in helping to build the deadly connection between the two girls. They did, after all, refer to them as “twins” (p. 189), thereby reinforcing the bond between them. They may also have perpetuated the whole different feeling of the camp after the specials arrive, and in doing so they reinforced the object of Marlene and Charlene’s hatred. But, given the swirl of activities at camp, how fair would it be to fault the counselors for failing to note the burgeoning unhealthy, over the top, symbiotic connection between Marlene and Charlene? Doubtless, summer camps generally provide children with positive social experiences that can last a lifetime. All the same, there can be risks as well.8 In addition, the story has implications for our understanding of adults whose intellectual abilities are restricted at a child-like level and who commit “senseless” crimes of the sort laid out in this chilling story. The same might be said of adults who lack emotional maturity. Mostly though, the Munro story is about how negative relationships can spread like contagion, and about how careful we, as human beings, have to be to ensure that, even under provocations, our actions are considered and filtered through as much reasoning as possible.9

Folie `aDeux Conceivably, Munro’s “case” could be considered as folie `adeux (i.e., shared insanity, or a madness shared by two). The folie `adeux is of “intense professional interest because it may shed some light on the question of whether one person can drive another person mad” (Franzini & Grossberg, 1995, p. 140). This remains the case if in this quotation the word “violent” is substituted or added to “mad.”10 Franzini and Grossberg (1995) say: “It involves the transference of delusional ideas from a psychotic individual to an intimate associate who has been under his or her influence for an extended period of time” (p. 139). While one girl is portrayed as being dominant – Charlene is slightly

18 literature can add value to spj: an example more precocious and manipulative – neither could be in any way considered psychotic. Although the camp experience was a peculiarly intense one, it was also distinctly short- lived. At the same time, the case could be made that an actual delusion had been created (formed on the basis that Verna was a creepy kid who did not deserve to live). This seems a stretch. If there was any madness at all, it was most fleeting.

Structured Professional Judgement: An Overview Mental health, correctional, and forensic professionals are routinely called upon by courts and tribunals to evaluate persons presumed to pose risks of violence toward themselves or others (Otto & Douglas, 2010). It often proves helpful to approach this task from a statistical or actuarial point of view (Michel et al., 2013). Predictive accuracy can be enhanced to some degree by knowing the likely propensity for future violence of individuals sharing some of the characteristics of the person under assessment. Useful or even necessary though this may be, the law has difficulty awarding punishments, deprivations of liberties, and the like on the basis of what other people have done rather than the actions committed by the particular individual under evaluation. Decision- making bodies may allow themselves to be influenced by results from statistically-based, normative studies but will routinely require in addition an analysis based on idiographic considerations (Seifert, Jahn, Bolten, & Wirtz, 2002). It was precisely this reality that furthered the development 20 years ago of the Structured Professional Judgement (SPJ) approach to the assessment of violence risks.11 Today the most commonly used SPJ scheme across professional disciplines is the Historical Clinical Risk Management-20 (HCR-20) and its progeny (Hurducas, Singh, de Ruiter, & Petrila, 2014; Singh et al., 2014). Version 1 of this scheme appeared in 1995 (Webster et al., 1995). The framework for this scheme was borrowed directly from Hare’s manual for the assessment of psychopathy (1991). The Hare Psychopathy Checklist-Revised (PCL-R), like the original, now consists of 20 items (e.g., Glibness/superficial charm; Grandiose sense of self worth; Poor behavioural control; Impulsivity). The pertinence of each of these items to the individual case is scored as 0 (No); 1 (Maybe in some regards); or 2 (Yes; Hare, 2003, p. 20). Through extensive research, Hare set a cutoff score of 30 as defining the psychopath. Although never intended by Hare to be a device for predicting violence, it has in fact been shown to do so in study after study (Leistico, Salekin, DeCoster, & Rogers, 2008). While Webster et al. (1995) adopted the organizational scheme of the PCL-R in the original HCR-20 – 20 items, each scored on a 3-point scale – there was one important difference between these two schemes. Whereas the PCL-R relies on the summed item scores to indicate the degree of psychopathy present, the HCR-20 asked assessors, once all item scoring is complete, to stand back a little, and offer an overall projection of future violence as being Low, Moderate, or High. In other words, the detailed scoring is expected to guide the assessor to a conclusion that is clinically rather than actuarially based. This approach has not suited some who, mistakenly in our view, would have it that clinicians add nothing but error to the statically-determined estimate (Quinsey, Harris, Rice, & Cormier, 2006). From their origins in Hare’s 1991 PCL-R, SPJ schemes have continued to evolve into fluid devices like the Short Term Assessment of Risk and Treatability: Adolescent Version (START:AV; Viljoen, Nicholls, Cruise, Desmarais, & Webster, 2014). Inspired

19 webster & belisle´ by the Short Term Assessment of Risk and Treatability (START; Webster, Martin, Brink, Nicholls, & Desmarais, 2009), it has adopted many of the innovations pioneered by preceding SPJ schemes, including the START’s giving assessors the chance to score all 20 items on its scale not just for risks but for strengths. The START:AV also takes notes from the Risk of Sexual Violence Protocol (RSVP; S. D. Hart et al., 2003); it allows clinicians to add other items of their own devising to the scheme should the fixed items not adequately cover the particular quality or situation being assessed. And as with additions in the Historical Clinical Risk Management-20, Version 3 (HCR-20, V3; Douglas, Hart, Webster, & Belfrage, 2013), the START:AV invites assessors to create vignettes in order to sharpen prediction to the individual case. While there are other SPJ schemes for children and adolescents, the START:AV is one of the most pliable thanks to its building on existing progress (e.g., emphasis on risk formulation, creation of risk scenarios, focus on risk management issues).12 The essential purpose of this paper is to determine the extent to which the START:AV proves responsive enough to help a hypothetical assessor elucidate the rare but masterfully written case: Munro’s.

Application of the START:AV to Child’s Play It would be too much to ask that Munro provide detail enough that the two girls could be rated reliably on the START:AV.13,14 As a writer, Munro provides exactly sufficient background so that she can build to the zenith of the story, the climactic drowning, which takes place in a couple of minutes. Yet a scan of the 24 standard items from the START:AV would deem them largely irrelevant (e.g., School and Work, Recreation, Substance Use, Rule Adherence, Self Care, Mental/Cognitive State, Social Support, Parenting, Parental Functioning, Material Resources, Community, Medication Adherence, and Treatability). Conceivably, if this were an actual risk assessment, and if there were to be background information sufficient, something might be made of the remaining items (i.e., Conduct, Coping, Impulse Control, Emotional State, Attitudes, Peers, External Triggers, Insight, and Plans). The 24 defined items are designed to cover the wide majority of routine assessment cases. The authors of START:AV, following the already mentioned practice introduced in the RSVP and V3 of the HCR-20, allow in the manual for the inclusion, where warranted, of a Case-Specific Item (pp. 49-50). Before using this Item 25 write-in option, assessors are asked to consider whether or not there is research evidence for adding such an item or whether there is within-person evidence. The present instance is marked by there being between-person (and within-person) evidence. The authors of START:AV provide a lengthy list of possible case-specific examples (e.g., Cultural Connectedness, Spirituality, Sexual Orientation, Chronic Illness, Trauma of Various Types). One purpose of this paper is to illustrate that it is now considered good science in SPJ assessment to reach for information that, though not always nomothetically backed, is vital for a full understanding of the person or persons involved. The topic of symbiosis, though admittedly rare, needs to be opened up for scientific investigation.15 Unlike earlier SPJ schemes, the allowance of a case-specific item is expected to enhance predictive power, though this has yet to be demonstrated. The START:AV also improves on its less adaptable forebears by encouraging assessors to create scenarios. If it had turned out that the two girls had been arrested at the scene, evaluators would have had to consider the likelihood of either or both of them committing a similar crime in the future. More

20 literature can add value to spj: an example specifically, assessors using the START:AV are enjoined to ask questions like: “Will he or she engage in similar behaviors as before? Might these behaviors change over time? Who is likely to be hurt and how?” (p. 80). After cautioning would-be assessors not to approach risk assessment as an “exercise in creative writing” (p. 82), the authors of the START:AV nonetheless urge practitioners to “think outside the box” (p. 82). The present authors would label Item 25 as “Unhealthy Symbiotic Attachment” and rate it as “High”. And given more information would probably see that condition as interacting with certain of the other, set, items (e.g., Emotional State, Relationships, Peers, Impulse Control).

The Art of Communication

It is a plain fact that, no matter how sophisticated the process of risk assessment and no matter how many well-researched scientific methods were used in the process, the eventual results must be understood by a layperson. As has been asserted, the criteria for an assessment’s evaluation must surely include, alongside scientific defensibility, clarity, and comprehensibility (Webster et al., 2014, p. 163). One of the present authors (CDW) serves as a member of Ontario Review Board (ORB). Those who belong to the ORB and similar tribunals are asked to read, in advance of each set of hearings, a report on each accused person. Mostly these are adequately written and conform to the kind of standards we have recommended elsewhere, and they provide acceptable, if uninspired, narratives (see chapter 19, Webster et al., 2014, pp. 163–169). Yet occasionally a clinician will write an account that is not only commendable from a scientific point of view, but would captivate the attention of any person reading it. The story is delivered with skill and precision, and as a result, the Board member decision-makers better understand the patient and the past and present circumstances. Because of the clinician’s sureness in the writing, it is likely that, confronted with such a well-considered and well-expressed report, Board members will have much less difficulty than usual in comprehending how the violent offense likely unfolded. In addition, chances are they will be more than usually apt to reach agreement during the in camera discussions which center on stipulating a course of habilitation for the coming year. This considerable benefit arises simply because the clinician weaving the story together knew how to write and had given the issues a lot of thought from several angles. Rather than a formulaic narrative based on preset narrative blocks, the author of the report had made the story so particular to the individual that a full sense of character was developed, thereby making the risk assessment more accurate than it might otherwise have been. In such cases, the assessor’s job is closer to Alice Munro’s than most empiricists would care to admit. While it is certainly true that writers of fiction like Munro have the advantage over the clinician of being able to exert greater formal control over the flow of information, the fact is that every assessor has to become a writer. When that person can portray the patient’s story with precision, accuracy, and most importantly restrained compassion, he or she will be best serving the client and, likely, the community as well.

21 webster & belisle´

Limitations

The authors are aware that Alice Munro, writing for an entirely different purpose, did not include background information about the two girls sufficient to enable a precise scoring of the START:AV. They also warn that the philosopher’s inductive-logic method, though often helpful in opening up topics and phenomena for research, requires careful, scientific checks. The solution offered by the present authors, in terms of an analysis within the START:AV framework, was to invoke a case-specific item. Given the unusualness of the symbiosis phenomenon, this seems warranted in Child’s Play. Users of the START:AV must remember that such a course should be followed only when the “standing items” do not capture what needs to be captured (pp. 49–51).

Concluding Comment

In certain circumstances, some forensic clinicians charged with assessing and managing risks of violence will benefit from an acquaintance with literature. It is not that reading literature such as Child’s Play can provide immediate concrete solutions for the clinician, but rather that reading the careful descriptions of human psychologies composed by those with absolute mastery of language can help inspire and guide scientific investigation. One of the present authors concluded the discussion of a Graham Greene story centered on violence risk by remarking that the clinician’s job is “to create an impression of the person from whatever written material lies at hand and whatever information the person chooses to give” (Webster & Ben-Aron, 1985, p. 47). These impressions are narratives, and the greater the familiarity with narratives, the more easily and accurately these can be created. Entire schools of literary criticism have been founded on the principle that literature allows us to see ourselves in the unfamiliar; by encouraging the reader to connect with the new psychologies of each character encountered, literature acts as a gymnasium for the compassionate understanding of diverse human narratives. There may be no other skill of greater importance to the clinician tasked with the paramount challenge of assessing someone’s risk for violence.

References

American Psychiatric Association. (2000). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: American Psychiatric Association. American Psychiatric Association. (2013). The diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA. Augimeri, L. K., Webster, C. D., Koegl, C., & Leven, K. (1998). Early assessment risk list for boys: EARL-21B (Version 1, Consultation ed.). Toronto, ON: Earlscourt Child and Family Centre. Booth, J. P., & Jackson, P. (1994). Heavenly creatures [motion picture]. New Zealand: WingNut Films. Borum, R., Bartel, P., & Forth, A. (2006). Manual for the structured assessment of violence risk in youth (SAVRY). Odessa, FL. Brink, J. (2012, March). The art of risk assessment. Paper presented at the International Association for Forensic Mental Health Services, Miami, FL.

22 literature can add value to spj: an example

Daffern, M. (2010). A structured cognitive behavioural approach to the assessment and treatment of violent offenders using offence paralleling behaviours. In M. Daffern, L. Jones, & J. Shine (Eds.), Offence paralleling behaviour: A case formulation approach to offender assessment and intervention (pp. 105–120). Chichester: John Wiley & Sons Inc. Douglas, K. S., Hart, S. D., Webster, C. D., & Belfrage, H. (2013). HCR-20, V3: Assessing risk for violence: User guide. Burnaby, BC: Mental Health, Law, and Policy Institute, Simon Fraser University. Drayton, J. (2012). The search for Anne Perry: The hidden life of a bestselling crime writer. Toronto: HarperCollins. Franzini, L., & Grossberg, J. (1995). Eccentric and bizarre behaviors. New York, NY: John Wiley & Sons, Inc. Hare, R. D. (1991). The Psychopathy Checklist – Revised. Toronto, ON: Multi-Health Systems. Hare, R. D. (2003). The Psychopathy Checklist – Revised (2nd Ed.). Toronto, ON: Multi-Health Systems. Hart, C. (2014). A pocket guide to risk assessment and management in mental health. Oxford: Routledge. Hart, S. D., Kropp, P. K., Laws, D. R., Klaver, J., Logan, C., & Watt, K. A. (2003). The risk for sexual violence protocol: Structured professional guidelines for assessing risk of sexual violence. Burnaby, BC: Mental Health, Law, and Policy Institute, Simon Fraser University. Heilbrun, K., Yasuhara, K., & Shah, S. (2010). Violence-risk assessment tools. In R. K. Otto & K. S. Douglas (Eds.), Handbook of violence risk assessment (pp. 1–17). New York: Routledge. Hurducas, C. C., Singh, J. P., de Ruiter, C., & Petrila, J. (2014). Violence risk assessment tools: A systematic review of surveys. International Journal of Forensic Mental Health, 13 (3), 181–192. doi: 10.1080/14999013.2014.942923 Kunimatsu, M., Marsee, M., Lau, K., & Fassnacht, G. (2012). Callous-unemotional traits and happy victimization: Relationships with delinquency in a sample of detained girls. International Journal of Forensic Mental Health, 11 (1), 1–8. doi: 10.1080/14999013.2012.667509 Leistico, A. R., Salekin, R. T., DeCoster, J., & Rogers, R. (2008). A large-scale meta- analysis relating the Hare measures of psychopathy to antisocial conduct. Law and Human Behavior, 32 (1), 28–45. doi: 10.1007/s10979-007-9096-6 Levene, K. S., Augimeri, L. K., Pepler, D. J., Walsh, M. M., Webster, C. D., & Koegl, C. J. (2001). Early assessment risk list for girls (Earl-21G) Version 1 – Consultation Edition. Toronto: Earlscourt Child and Family Centre. Lovins, B., Lowenkamp, C. T., & Latessa, E. J. (2009). Applying the risk principle to sex offenders: Can treatment make some sex offenders worse? The Prison Journal, 89 (3), 344–357. doi: 10.1177/0032885509339509 McCabe, P. (2001). The Butcher Boy. London: Picador. McNiel, D. E., Chamberlain, J. R., Weaver, C. M., Hall, S. E., Fordwood, S. R., & Binder, R. L. (2008). Impact of clinical training on violence risk assessment. American Journal of Psychiatry, 165 (2), 195–200. doi: 10.1176/appi.ajp.2007.06081396 Michel, S. F., Riaz, M., Webster, C., Hart, S. D., Levander, S., M¨uller-Isberner, R., . . . Hodgins, S. (2013). Using the HCR-20 to predict aggressive behavior among

23 webster & belisle´

men with schizophrenia living in the community: Accuracy of prediction, general and forensic settings, and dynamic risk factors. International Journal of Forensic Mental Health, 12 (1), 1–13. doi: 10.1080/14999013.2012.760182 Monahan, J. (2010). The classification of violence risk. In R. K. Otto & K. S. Douglas (Eds.), Handbook of violence risk assessment (p. 187-198). New York, NY: Routledge. Munro, A. (2009). Child’s play. In Too much happiness (pp. 188–223). Toronto: McClelland & Stewart, Ltd. Narduzzi, D. (2013). Regulating affect and reproducing norms: Alice munro’s “Child’s Play”. doi: 10.3828/jlcds.2013.5 Oates, J. C. (1994). Extenuating circumstances. In Haunted: Tales of the grotesque (p. 147153). New York, NY: Dutton Signet. Otto, R., & Douglas, K. (Eds.). (2010). Handbook of violence risk assessment. New York, NY: Routledge/Taylor & Francis. Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. A. (2006). Violent offenders: Appraising and managing risk (2nd ed.). Washington, DC: American Psychological Association. Scott, P. D. (1977). Assessing dangerousness in criminals. The British Journal of Psychiatry, 131 (2), 127–42. doi: 10.1192/bjp.131.2.127 Seifert, D., Jahn, K., Bolten, S., & Wirtz, M. (2002). Prediction of dangerousness in mentally disordered offenders in Germany. International Journal of Law and Psychiatry, 25 (1), 51–66. doi: 10.1016/S0160-2527(01)00096-6 Shah, S. A. (1978). Dangerousness: a paradigm for exploring some issues in law and psychology. American Psychologist, 33 (3), 224–238. doi: 10.1037/0003-066X.33.3 .224 Singh, J. P., Desmarais, S. L., Hurducas, C., Arbach-Lucioni, K., Condemarin, C., Dean, K., . . . Otto, R. K. (2014). International perspectives on the practical application of violence risk assessment: A global survey of 44 Countries. International Journal of Forensic Mental Health, 13 (3), 193–206. doi: 10.1080/14999013.2014.922141 Singh, J. P., Grann, M., & Fazel, S. (2011). A comparative study of violence risk assessment tools: a systematic review and metaregression analysis of 68 studies involving 25,980 participants. Clinical Psychology Review, 31 (3), 499–513. doi: 10.1016/j.cpr.2010.11.009 SNAPTM Girls Group Manual: The Girl’s Club. (2002). Toronto, ON: Child Development Institute (formerly Earlscourt Child and Family Centre). van Bilsen, H. P. (2013). Lessons we can learn from opera. Journal of Psychiatric Intensive Care, 10 (01), 512. doi: 10.1017/s1742646413000228 Viljoen, J. L., Nicholls, T. L., Cruise, K. R., Desmarais, S. L., & Webster, C. D. (2014). Short-term assessment of risk and treatability: Adolescent version, START: AV user guide. Burnaby, BC: Mental Health, Law and Policy Institute, Simon Fraser University. Webster, C. D., & Ben-Aron, M. H. (1985). Dangerous Doctor Fischer: a case history in clinical assessment. In C. D. Webster, M. H. Ben-Aron, & S. J. Hucker (Eds.), Dangerousness: Probability and prediction, psychiatry and public policy (pp. 41– 51). New York, NY: Cambridge University Press. Webster, C. D., Douglas, K. S., Eaves, D., & Hart, S. (1997). HCR-20: Assessing risk

24 literature can add value to spj: an example

for violence – version 2. Burnaby, BC: Mental Health, Law, and Policy Institute, Simon Fraser University. Webster, C. D., Eaves, D., Douglas, K. S., & Winthrop, A. (1995). The HCR-20 scheme: The assessment of dangerousness and risk – version 1. Burnaby, BC. Webster, C. D., Haque, Q., & Hucker, S. (2014). Violence risk-assessment and management: Advances through structured professional judgment and sequential redirections (2nd ed.). Chichester. Webster, C. D., Martin, M. L., Brink, J., Nicholls, T. L., & Desmarais, S. (2009). Manual for the short-term assessment of risk and treatability (START) Version 1. Port Coquitlam, BC: Forensic Psychiatric Services Commission and St. Josephs Healthcare, Hamilton, ON.

Footnotes

1The authors would like to thank the two anonymous reviewers for their detailed and considered suggestions. They helped us strengthen the paper considerably. 2In fact, this very story was brought to the attention of the present authors by Dr. Virginia Edwards, a long-practicing child psychiatrist in Toronto. 3The same point could be made from analysis of Joyce Carol Oates’ Extenuating Circumstances (1994). In this story we are presented with a single mother who is utterly overwhelmed by having to care for her baby without social or material support. Although in her misery she contrives “accidentally” to scald the child to death in her own kitchen, she would likely never be found guilty in a legal sense. 4In forensic mental health practice where violence risk is being evaluated, clinicians are enjoined to pay careful attention to the assessee’s mental and emotional states before, during, and after a violent act (see, for example, Scott, 1977, pp. 129–132). In reality, this is a difficult task. By the time the clinician comes to sit with the accused, their mental and emotional furniture will likely have been rearranged in the course of clumsy investigations by police and other authorities. Here, the Munro story points out the necessity of trying to get an uncontaminated picture of the individual’s state at the time of the crucial event. 5Kunimatsu et al. (2012) examined 59 detained girls. They were interested in two traits: “callous- unemotional” and “happy victimization.” The former is perhaps self-explanatory. The latter is defined as “inappropriate positive emotions following a transgression” (p. 1). Both of these traits have been studied previously, the former especially. An assessor, faced with examining Marlene and Charlene in real life, would find the results of this study to be pertinent to the investigation. 6In this, it should not for a moment be thought we are arguing against the kinds of statistically-based studies we and our colleagues have conducted over many years (see Webster et al., 2014, pp. 88–91). There is need for both nomothetic and idiographic approaches. 7Some readers may be inclined to dismiss what we have written by saying that Munro’s story is simply that: a story. To this we would reply, take a gander at Joanne Drayton’s The Search for Anne Perry (2012). Drayton’s book is based on interviews with Anne Perry, the highly successful writer of crime novels. It deals with the fact that Anne was originally Juliet Hulme, one of two symbiotically intertwined girls who murdered the mother of the other girl. This took place in New Zealand in 1954. The murder formed the basis of the highly successful movie Heavenly Creatures (Booth & Jackson, 1994). This “story” is much discussed by Narduzzi (2013), who also considers the 1997 Victoria, British Columbia murder of Reena Virk (footnote 2, p. 77). Indeed, Narduzzi does more than hint that Munro may have drawn on the “real-life” Virk case as she composed Child’s Play. Munro lives in Victoria. 8The researcher Daffern (2010) reminds us of William Golding’s Lord of the Flies. He points out that, absent adult supervision, shipwrecked British choirboys can “behave like savages” (p. 114). 9Although some boys may form unhealthy symbiotic relationships consider for example the way Francie and Joe intertwined themselves by ganging up on Phillip in Patrick McCabe’s The Butcher Boy (2001) - it would seem that child and youth counselors know, in their treatment sessions with girls, to focus on the “social forms of aggression, which tend to characterize girlhood aggression” which is “...in contrast to the boys’ groups where the focus is more on ‘sportsmanship’ and rule breaking (e.g., stopping stealing)” (SNAPTM Girls Group Manual: The Girl’s Club, 2002, p. 6).

25 webster & belisle´

10DSM-IV-TR (2000) specified criteria for Shared Psychotic Disorder (297.3, pp. 305-306), but, for whatever reasons, this classification has disappeared from DSM-5. 11Heilbrun, Yasuhara, and Shah (2010) have summarized the essence of SPJ. They say it “...involves the presentation of specified risk factors, which are usually derived from a broad review of the literature rather than from a specific data set” (p. 5). They also inform us that “Risk factors are well operationalized so their applicability can be coded (usually as no, possible, or yes) reliably” (p. 5). Heilbrun et al. note that the raters themselves have to draw conclusions on the basis of factors that are dynamic, changeable; there is no set formula for integrating them. 12The Early Assessment Risk List for Boys (EARL-20B; Augimeri, Webster, Koegl, & Leven, 1998) built off the improvements in the Historical Clinical Risk Management-20, Version 2 (HCR-20V2; Webster, Douglas, Eaves, & Hart, 1997) by targeting an SPJ scheme at under-12 boys. This was later adapted for under-12 girls in the Early Assessment Risk List for Girls (EARL-21G; Levene et al., 2001). Borum also used the HCR-20V2 as the basis of an early scheme devised for application to adolescents (Borum, Bartel, & Forth, 2006). 13Consider, for example, the work of McNiel et al. (2008). These authors contrived case vignettes (in a true sense “stories”) which they then had rated by residents in psychiatry. What they were able to show, convincingly, on the basis of random assignment to groups, was that those trained previously with the HCR-20 achieved significantly higher quality of documentation than their control counterparts. We have been using contrived vignettes for training mental health professionals to assess violence risk over many years. Sometimes these cases are adapted and disguised from real cases. Sometimes they are “made up.” Either way, we create them in order to be able to demonstrate reliability in a classroom audience. Helpful though this may be from a pedagogical point of view, it is hardly scientific. Seeming high inter-rater reliability is created by unambiguity in the text. But it serves an important illustrative purpose nonetheless. 14The START:AV is designed for use with adolescents aged 12 to 18 years old. Obviously, the girls featured in Child’s Play were outside the lower bound. Yet the authors of the manual do indicate that “there may be exceptions” (p. 3). It is, after all, common under law to raise adolescents who have committed serious crimes to adult court. The seriousness of the act committed by Marlene and Charlene would make the START:AV relevant (though it is perhaps a pity that the EARL-21G has not yet been revised along the same lines as the START:AV). 15Some critics might argue that our current emphasis on the importance on literature and the arts can only set back the ground from those who mistakenly advocate for an intuitive rather than a structured approach to violence risk assessment. Chris Hart has recently pointed out why such an approach cannot be recommended (2014, p. 48). We and others have long called for considered but restrained use of art, theatre, and the like (e.g., Brink, 2012; van Bilsen, 2013), and believe that the value of any SPJ scheme can be enhanced, at least occasionally, through the work of writers and other artists who have pondered long and deep about the nature of violence in all its many forms. And we have to note that, as members of society, all of us — perpetrators of violence, victims of violence, bystanders to the conduct of violence, and professional assessors of violence — have been and continue to be influenced by formal literature as taught to us in schools and universities and by the daily incursion of news reports and social media.

Received: Oct. 3, 2014 Revision Received: Nov. 30, 2014 Accepted: Dec. 5, 2014

26 Archives of Forensic Psychology c 2014 Global Institute of Forensic Psychology 2014, Vol. 1, No. 1, 27–48 ISSN 2334-2749

Establishing Cut-off Scores for the Parent-Reported Inventory of Callous-Unemotional Traits

Eva R. Kimonis∗ School of Psychology, Mathews Building, The University of New South Wales, Sydney, NSW 2052, Australia, [email protected]

Kostas A. Fanti

Department of Psychology, University of Cyprus, Nicosia, Cyprus

Jay P. Singh Global Institute of Forensic Research, Reston, VA, USA Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA Faculty of Health Sciences, Molde University College, Molde, Norway

Callous-unemotional (CU) traits (i.e., lack of empathy/guilt/concern for others) have proven useful for identifying a unique subgroup of antisocial youths at risk for severe, persistent, and impairing conduct problems attributed to distinct etiological processes. Several tools for measuring CU traits alone or as part of a broader assessment of psychopathy exist but none have established cut-off scores for making categorical decisions about youth. The aim of the present study was to establish clinically meaningful cut-off scores on the parent-reported Inventory of Callous-Unemotional Traits (ICU) for the purpose of identifying children with high stable co-occurring conduct problems (CP) and CU traits (CP+CU), while balancing costs of false positives and false negatives. Participants included 1,370 school-aged (Mage = 9.38, SDage = 1.64 at baseline) boys and girls followed prospectively over 18 months. Several statistical indices were applied to establish optimal cut-off scores for identifying those 2.3% of children on a trajectory of high stable co-occurring CP+CU according to latent class growth analyses. Results indicated that children who scored at or above the identified ICU cut-off scores (24 for mother-report and 27 for father-report) were significantly more likely to engage in future self-reported bullying compared to children who scored below the thresholds. With encouraging evidence for the success of nuanced treatments for children with CP+CU, these findings may assist in screening children that might benefit from them. Keywords: Callous-unemotional traits, Conduct problems, Assessment, Cut-off scores, Longitudinal

27

kimonis, fanti, & singh

Youth exhibiting significant and impairing patterns of antisocial and aggressive behavior are heterogeneous with respect to their severity, prognosis, and presumed etiology (Frick & Viding, 2009). Further, a substantial body of research suggests that non-normative levels of callous-unemotional (CU) traits (i.e., lack of empathy/guilt/concern for others) are useful for identifying antisocial youth who show a distinct pattern of severe, chronic, and aggressive conduct problems that are resistant to traditional mental health interventions, and thought to be underpinned by distinct causal factors (Frick, Ray, Thornton, & Kahn, 2014). CU traits comprise one dimension of psychopathy along with narcissism/deceitfulness and impulsivity/irresponsibility. A key challenge in identifying this at-risk population comes in translating continuous scores obtained from existing tools into dichotomous decisions that can inform prognosis and guide treatment planning. Cut-off scores for commonly used measures of CU traits have not yet been established. Although we recognize that the most compelling data suggest that psychopathy is a dimensional trait rather than a categorical taxon (Edens, Marcus, Lilienfeld, & Poythress, 2006; Lilienfeld, 1994; Murrie, Marcus, et al., 2007; cf. Harris, Rice, & Quinsey, 1994; Vasey, Kotov, Frick, & Loney, 2005), using arbitrary cut-off scores (e.g., deviation from the sample mean) to make important decisions may be either over- or under-inclusive. The present study aimed to address this risk and increase fairness in decision-making by empirically identifying objective cut-off scores on one of the most comprehensive measures of CU traits currently available: the Inventory of Callous-Unemotional Traits (ICU). A primary motivation in achieving this aim was to optimize the accurate identification of children with poor prognosis—i.e., on a trajectory of stable high co-occurring conduct problems (CP) and CU traits (CP+CU)—thus carefully balancing the competing risks of identifying too many or too few youths with non-normative CU traits. Attempts to categorize children on the basis of CU traits, which are thought to signal risk for psychopathy in adulthood, naturally raise important and valid concerns. This criticism is rooted in part in concerns over stigma associated with labeling youth, in part in the limited longitudinal research literature establishing the stability of CU traits from childhood to adulthood, and in part in the potential for harm that comes with the misuse of psychological instruments (Seagrave & Grisso, 2002). With respect to stability, a long-term follow-up study measuring psychopathic (including CU) traits at age 13 years found that the construct was moderately stable to age 24 years (r = .32), despite using different informants and assessment instruments across the two age periods (Lynam, Caspi, Moffitt, Loeber, & Stouthamer-Loeber, 2007). Of those 13-year olds scoring in the top 20% on the psychopathy measure, 14% went on to warrant a diagnosis of psychopathy in young adulthood, making this construct more stable than other forms of childhood psychopathology (Mash & Dozois, 2003). Furthermore, the 11-year correlation is equivalent to that typically seen when different informants use the same instrument at the same time-point to rate an individual’s behavior on a construct. Within any given developmental stage (childhood or adolescence), the general stability of CU or psychopathic traits tends to be higher at around 30% (e.g., Lee, Klaver, Hart, Moretti, & Douglas, 2009). CU traits also signal risk for other antisocial outcomes in adulthood. For example, after accounting for the severity of co-occurring conduct problems, CU traits present at elementary and middle school age predicted criminal behavior and other antisocial outcomes (e.g., antisocial personality symptoms) in adulthood (Byrd, Loeber, & Pardini,

28 establishing an icu cut-off score

2012; McMahon, Witkiewitz, & Kotler, 2010). These findings highlight the importance of early identification and treatment of children with CP+CU given their heightened risk for long-term impairment. With respect to stigma, although there is no research directly testing the effects of labeling youths as having CU traits, there has been research on the negative effects of the use of the term psychopathy when applied to children and adolescents (for a review, see Murrie, Boccaccini, McCoy, & Cornell, 2007). The term psychopathy has been found to significantly affect decisions made by professionals (e.g., clinicians’ estimation of treatability). However, labeling children and adolescents as psychopathic does not produce more negative effects than using the term conduct disorder (CD). Thus, it appears that any term used to describe individuals with antisocial behavior or traits will acquire negative connotations. In fact, the perceived pejorative connotation associated with the term callous-unemotional (see Frick & Nigg, 2012) was one of the driving concerns behind referring to this constellation of traits as “With Limited Prosocial Emotions” in the Diagnostic and Statistical Manual of Mental Illness, Fifth Edition (DSM-5; American Psychiatric Association , 2013).

Assessing CU Traits

Several instruments are currently available to assess CU and psychopathic traits in youth. The majority originate from the Psychopathy Checklist (PCL-R; Hare, 1991, 2003) for adults, and include the Psychopathy Checklist: Youth Version (PCL:YV; Forth, Kosson, & Hare, 2003) and the Antisocial Process Screening Device (APSD; Frick & Hare, 2001). The Inventory of Callous-Unemotional Traits (ICU; Frick, 2004; Kimonis et al., 2008) is a derivative of the APSD and a commonly used measure of CU traits that was developed over decades to improve upon its predecessor in three important ways. First, the ICU includes a larger number of items to better capture the emotional detachment dimension of psychopathy. Second, it balances positive and negative wording of items. Third, the instrument uses a 4-point Likert-type response format to prevent against response bias and an exact middle rating. Unlike the PCL-R that has an established diagnostic cut-off score of 30 to identify adults with psychopathy (Hare, 2003), these instruments designed for youth do not have existing cut-off scores to assist practitioners in categorizing youth into those showing meaningful and clinically significant levels of CU traits.1 Establishing cut-off scores for measures of CU traits is of considerable importance to treatment planning for youth presenting with conduct problems. For example, several studies of justice-involved adolescents report that youth scoring high on psychopathic traits were less likely to participate in treatment, had lower quality participation, and were more likely to reoffend after treatment than those low on psychopathic traits (Gretton, Mcbride, Hare, O’Shaughnessy, & Kumka, 2001; O’Neill, Lidz, & Heilbrun, 2003; Spain, Douglas, Poythress, & Epstein, 2004). Younger samples have similarly been found to respond more poorly to traditional interventions when scoring high on CU traits (Hawes, Price, & Dadds, 2014). For example, Hawes and Dadds (2005) found that parents of boys (Mage = 6.29 years) with high CU traits were more likely to rate the discipline component (i.e., time out) of a manualized parent training program as ineffective at reducing their children’s conduct problems during

29 kimonis, fanti, & singh treatment relative to parents whose children had oppositional defiant disorder (ODD) alone (see also Haas et al., 2011). However, treatments tailored to the unique needs of children and adolescents with high CU traits are effective in reducing conduct problems, CU traits, and recidivism rates (Caldwell, Skeem, Salekin, & Van Rybroek, 2006; Kolko & Pardini, 2010). For example, in a study of 177 clinic-referred children, those with CU traits who received an individualized and comprehensive modular intervention evinced similar rates of improvement to other children with CD (Kolko & Pardini, 2010). These findings highlight the critical need to appropriately categorize youth with conduct problems into more homogeneous subgroups to ensure that they are receiving the optimal treatment and that time and resources are not wasted on contraindicated treatments.

The Present Study

The primary aim of the current longitudinal study was to determine cut-off scores on a parent-reported version of the ICU that optimally identified youth at risk for co-occurring stable and high CU traits and conduct problems across time. Given the difficulty in determining the optimal standard against which to develop a cut-off score, a statistical method called group-based trajectory modeling, which is widely used in clinical research (Nagin & Odgers, 2010), was applied to identify clusters of community youth presenting more homogeneous patterns of symptom presentations across childhood. ICU cut-off scores that optimally classified youth into a trajectory of high stable CU traits co-occurring with high stable conduct problems, according to mother- and father-report, were identified for the full sample and also separately by gender and age group. The second aim of the current study was to validate the ICU cut-off score using bullying as a criterion measure. This was accomplished by testing whether children scoring above the ICU cut-off scores were more likely to engage in future bullying behavior compared to children scoring below. Several studies report that youth scoring high on CU traits are more likely to engage in bullying behavior (e.g., Fanti, Frick, & Georgiou, 2009; Fanti & Kimonis, 2012, 2013; Viding, Simmonds, Petrides, & Frederickson, 2009). This link has been explained by their poor recognition of others’ distress cues, their lack of concern for others’ feelings, and their expectation that aggressive behavior will result in positive outcomes (Fanti & Kimonis, 2012, 2013). Bullying is a particularly appropriate outcome measure within school-age samples as it represents a developmentally appropriate form of antisocial behavior.

Method

Participants

Participants included 1,370 families in the country of Cyprus, each with a child (53.4% girls) between the ages of 6 and 12 years (Mage at first assessment = 9.38, SD = 1.64). Families with children in grades one through six were recruited for the study, and the resulting sample was divided evenly across grades, with approximately 16% of children in each. A cross-sequential research design combining both cross-sectional and longitudinal methods was employed. This design was selected to shorten the time required to collect longitudinal data and to approximate longitudinal growth across a

30 establishing an icu cut-off score period of six years, which involved the repeated measurement (i.e., three longitudinal waves approximately six months apart) of children in the six elementary school grades. Since data were originally collected by grade cohort, the dataset was transformed to age cohorts to enable the investigation of longitudinal trajectories between the ages of 6.5 and 12 years. This design is a good approximation of a single long-term longitudinal design, and Latent Growth Models are well suited for this type of model estimation (Duncan, Duncan, & Strycker, 2013). Full information maximum likelihood fitting can be used to impute missing data across time.

Procedure

Following approval of the study by the Cyprus Ministry of Education, the second author randomly selected 26 urban and rural schools in the four school districts (Larnaka, Lemesos, Pafos, and Lefkosia) to ensure that the sample was representative of the general population. Administrators and personnel were provided a description of the study, and the boards of participating schools approved the study. Prior to data collection, signed parental consent and youth assent were obtained from 85% of participating families. Parents and children completed a battery of study questionnaires. In some cases only mother or father reports were available, although for the majority of children both parents participated in the study. At Time 1, data were collected from 1,127 mothers and 818 fathers (both parents, n = 1, 028), at Time 2 (six months after Time 1) data were collected from 1,050 mothers and 805 fathers (both parents, n = 991), and at Time 3 (six months after Time 2) data were collected from 955 mothers and 732 fathers (both parents, n = 886). Data on bullying behavior were also collected from 719 children at Time 2 and 792 children at Time 3. Families did not receive incentives for their participation.

Measures

Conduct problems. The Checkmate plus Child Symptom Inventory for Parents-4 (CSI-4; Gadow & Sprafkin, 2002) was designed to assess symptoms of ODD (8 items; e.g., “Argues with adults”) and CD (15 items; e.g., “Has stolen things from others using physical force”) based on the diagnostic criteria specified in the DSM-IV (APA , 1994). This measure was administered at all three time points, and parents indicated the frequency with which their child exhibited ODD and CD symptoms on a 4-point Likert- type scale (0 = never to3= very often). ODD and CD subscales were combined to create an overall conduct problem score. Cronbach’s α for this combined variable ranged from .86-.87 across time and depending on which parent was reporting, corresponding to good internal consistency (Barker, Pistrang, & Elliott, 1994).2 Prior research supports the validity of parent-reported ODD and CD symptom scores from the CSI-4 in community and clinical samples in Cyprus and the U.S. (Fanti & Mu˜noz Centifanti, 2013; Gadow & Sprafkin, 2002). Callous-unemotional traits. CU traits were assessed using the 24-item parent-report ICU (Frick, 2004). ICU items, such as “shows no remorse when he/she has done something wrong,” were rated on a 4-point Likert-type scale (0 = not at all to 3 = definitely true), with higher scores indicating greater CU traits. The construct

31 kimonis, fanti, & singh validity of the ICU is supported in community, clinic-referred, and incarcerated samples of youth (e.g., Fanti, Frick, & Georgiou, 2009). For example, the ICU total score has been found to be associated with aggression, delinquency, and psychosocial and psychophysiological impairment (e.g., Fanti et al., 2009; Kimonis et al., 2008). The total score demonstrated good internal consistency (Cronbach’s α ranged from .86-.88 across time). For the purposes of identifying trajectories of CU traits, mother- and father-reported ICU scores were combined in a conservative fashion by taking the higher rating between parents. This method is beneficial for circumventing potential underreporting (e.g., Pardini, Lochman, & Powell, 2007). Results of studies that combine multiple informants in this way are similar to those that have used different procedures (Piacentini, Cohen, & Cohen, 1992). To establish cut-off scores across informants, mother- and father-reported ICU scores were examined separately. Bullying. The Student Survey of Bullying Behavior-Revised (SSBB-R; Varjas, Meyers, & Hunt, 2006) was administered at Times 2 and 3 to measure school bullying. Children self-reported whether they had engaged in different types (physical, verbal, and relational) of bullying behavior on an ordinal scale (0 = never,1= once or twice a year,2= monthly,3= weekly, or 4 = daily). The SSBB-R includes 12 items assessing bullying (e.g., “How often do you pick on younger, smaller, less powerful, or less popular kids by hitting or kicking them?”) that are summed to create a total bullying scale with a range of possible scores from 0 to 48. Prior research reports that SSBB-R scores are a valid and reliable measure of school-based bullying in community samples of children and adolescents in Cyprus and the U.S. (Fanti et al., 2009; Hunt, Meyers, Jarrett, & Neel, 2005; Varjas et al., 2006). Cronbach’s s for the bullying scale were 0.89 at Time 2 and 0.90 at Time 3, falling in the good to excellent range.

Plan of Analysis We accomplished the study aims by first identifying distinct groups of developmental trajectories of conduct problems and CU traits from ages 6.5 to 12 years using Latent Class Growth Analysis (LCGA) in Mplus 6.1 for Windows (Muth´en & Muth´en, 2010). The purpose of this analysis was to identify children showing a trajectory of stable high scores on measures of CU traits and conduct problem symptoms from early childhood to adolescence. LCGA identifies heterogeneous latent classes based on longitudinal data by modeling the relationship between a given attribute (e.g., conduct problems or CU traits) and age. LCGA also allows for the investigation of whether change over time in the attribute is linear or whether it follows a more complex, curvilinear pattern. Full information maximum likelihood fitting was used in Mplus to retain children with incomplete assessments in the analysis. The LCGA estimation in Mplus resulted in two outputs: (a) the shape and location of the different estimated class trajectories, and (b) the posterior probability and entropy values of class membership. Average probabilities and entropy values greater than .70 indicate clear classification and greater power to predict class membership (Muth´en & Muth´en, 2010; Nagin, 2005). The Bayesian Information Criterion (BIC) was one of the model fit statistics used in the current study. Models with lower BICs are preferred (Schwarz, 1978); however, since the BIC criterion tends to favor models with fewer classes by penalizing added parameters, the Lo, Mendel, Rubin (LMR) statistic was also used. Lo, Mendell, and Rubin (2001) adjusted the likelihood ratio test in order to be used in growth mixture

32 establishing an icu cut-off score modeling to enable the comparison of non-nested models with different numbers of classes. The LMR statistic tests k − 1 classes against k classes. A significant chi-square value (p<.05) indicates that the k − 1 class model is rejected in favor of the k-class model. A non-significant chi square value (p>.05) suggests that a model with one fewer class is preferred. Second, to identify subgroups of children with high stable CU traits co-occurring with high stable conduct problems, cross-tab analyses were performed in IBM SPSS Statistics for Windows, Version 19.0 (IBM Corp., 2010). Groups were created on the basis of membership into conduct problem and CU trajectories to identify a subgroup of children with high stable CU traits co-occurring with high stable conduct problems. The primary purpose of the present study was to identify ICU cut-off scores for the accurate identification of the groups determined using LCGA. To accomplish this goal, intraclass correlation coefficients (ICC) were first used to test the a priori assumption that CU scores were stable across the three study time points. Next, sensitivity and specificity were calculated by rater using each ICU score as a potential cut-off threshold. In the context of the present study, sensitivity referred to the proportion of children in the diagnostic group of interest (e.g., high stable trajectory group) who scored at or above the cut-off threshold on the ICU, whereas specificity referred to the proportion of children not in the diagnostic group of interest who scored below the cut-off threshold (Singh, 2013a). These discriminative parameters were calculated both when “high CU/high CP” youths served as the diagnostic group of interest, as well as when the diagnostic group of interest was “high CU/high CP”, “high CU/moderate CP”, and “moderate CU/high CP” youths. The former analysis corresponds to Risk-Needs-Responsivity principles (Bonta & Andrews, 2006), allowing preventative resources to be allocated to only those youths at highest risk, whereas the latter represents an approach consistent with screening (Fazel, Singh, Doll, & Grann, 2012; Singh, Grann, & Fazel, 2011). Youden’s (1950) J, an index for determining the cut-off threshold that balanced the cost-ratio between sensitivity and specificity, was then calculated for each score. The score with the highest J value was identified as the optimal cut-off threshold, with a cost-ratio of false positive to false negative classifications of 1:1. Cut-off thresholds were investigated for the sample overall and by gender as well as age group.3 Finally, independent samples t-tests were used to investigate whether children scored by parents as at or above the ICU cut-off score identified for the general sample (measured at Time 1) were more likely to self-report engaging in future bullying (Times 2 and 3) compared with children scoring below the ICU cut-off score.

Results

Descriptive Statistics

Descriptive statistics and correlations are presented in Table 1. As shown, CU traits were correlated with conduct problems across time based on both mother and father reports, and mother and father reports of CU traits and conduct problems were moderately correlated across time. Similarly, single measures and average measures ICCs were high across time between mother and father reports of CU traits (single measures ICC: T1 = .68 [.63-.71], T2 = .66 [.62-.70], T3 = .68 [.64-.72], p < .001; average measures ICC: T1 = .81 [.78-.83], T2 = .79 [.76-.82], T3 = .81 [.78-.84], p < .001) and CP (single

33 kimonis, fanti, & singh

35 i = 30.75, s = -.07 30

25 i = 22.44, s = -.40** 20

15

10 i = 11.57, s = -.14 5 Callous-unemotional traits Callous-unemotional 0 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 Age(years) Low 40.3% Moderate 35.6% High 24.1%

Figure 1a. LCGA model for CU traits

measures ICC: T1 = .66 [.61-.70], T2 = .66 [.62-.70], T3 = .69 [.64-.72], p < .001; average measures ICC: T1 = .79 [.76-.82], T2 = .80 [.77-.82], T3 = .81 [.78-.84], p<.001).

Identifying Trajectories of CU Traits and Conduct Problems

To identify the optimal number of trajectories for CU traits, models with one to four classes were estimated with LCGA, in accordance with the model fit indices. The BIC statistic increased from Class 3 to Class 4 and the LMR statistic fell out of significance for the four-class model, suggesting that the three-class model better fit the data (Table 2). Furthermore, the BIC for the curvilinear model (22,156.15) was higher than the BIC for the linear model (22,138.17), indicating that the linear model better represented the data. In addition, the mean probability score for the identified classes ranged from .79-.87 and the entropy value was .77, suggesting that the classes were well differentiated in the final model (Fig. 1a). Children assigned to the low risk group (40.3%) scored below average on CU traits across time. Children in the moderate risk group (35.6%) scored slightly above average on CU traits across time, although they showed decreases in these traits from early childhood to early adolescence. Children in the high stable group (24.1%) exhibited continuous high levels of CU traits, and scored approximately 1.5 standard deviations above the mean on CU traits across time. According to χ2 analyses, boys were less likely to be in the low risk group than girls, χ2(2,N = 1, 192) = 7.88, p<.05, d = .25, 95% CI [.08-.42]. To identify the optimal number of trajectories for conduct problem symptoms, models with one to four classes were estimated with LCGA. The change in the BIC statistic from class three to class four was much smaller compared to the change from

34 establishing an icu cut-off score bles Table 1. Descriptive statistics and correlations among the main study varia 001; T = Time, CU = Callous-unemotional, CP = Conduct problems. p<. ** Variable1. T1 CU (father) 2. T1 CP (father)3. T2 CU (father)4. T2 CP .46** (father)5. T3 CU 1 .56** (father)6. T3 CP .35** .34** (father)7. T1 CU .59** .69** (mother)8. 2 T1 CP .38** .38** .42** (mother)9. .68** T2 CU .59** .63** (mother)10. .32** T2 .36** CP 3 .38** .37** (mother)11. .45** .41** T3 .66** CU .71** (mother) .28**12. .27** .23** T3 .30** CP .47** (mother) .50** .53**Descriptives 4 .66** .47** .53** .33** .35** .35**Mean .29** .31** .37** .49** .55** .66**SD .52** .56** 5 .33** .33** .34** .35** .43** .68** .59** .57** .60** .41** .39** .31** 6 .38** .62** .70** .69** .43** .37** .45** 16.50 7 .70** .71** 4.61 .43** .44** 9.34 14.48 .78** 8 .52** 4.40 4.15 14.44 9.51 9 4.36 4.28 16.05 10 9.29 5.33 4.35 11 14.42 9.23 4.92 12 4.94 13.93 5.05 9.29 4.65 9.06 5.04 Note:

35 kimonis, fanti, & singh

Table 2. Model fit statistics and posterior probabilities based on LCGA

(a) CU traits Classes BIC Entropy LMR 1 23,091.07 N/A N/A 2 22,247.50 0.79 p<.001 3 22138.17 0.77 p<.001 4 22141.28 0.59 p = .11 (b) Conduct problems Classes BIC Entropy LMR 1 18829.57 N/A N/A 2 17871.78 0.89 p<.001 3 17271.06 0.87 p<.01 4 17078.24 0.80 p = .08 (c) Conduct problems Low Moderate High Low CU traits 0.362 0.041 0.000 Moderate CU traits 0.244 0.105 0.007 High CU traits 0.105 0.113 0.023 Note: Part a and b show model fit statistics based on the LCGA. Part c shows the proportion of children belonging to each class.

Class 2 to Class 3, which suggests that the greatest improvement in fit occurred from the two-class to the three-class model (Table 2). In addition, the LMR statistic fell out of significance for the four-class model, suggesting that the three-class model better represented the data. Moreover, the four-class model identified two very similar low classes of small theoretical importance. Accordingly, the more parsimonious three-class model was selected. The BIC (17,285.67) for the curvilinear model was higher in comparison to the BIC (17,271.06) for the linear model, indicating that the linear model better represented the data. The mean probability score for the three conduct problem classes ranged from .90-.95 and the entropy value was .87, suggesting that the classes were well separated (Fig. 1b). Children assigned to the low risk group (71.1%) exhibited low conduct problems across time. Children in the moderate risk group (25.9%) showed above average levels of conduct problems across time. Children in the high risk group (3%) showed a linear increase in conduct problems from early childhood to early adolescence, and remained at higher risk compared with low and moderate risk groups. Both the moderate and high-risk groups showed above average levels of conduct problems. According to χ2analyses, the low group comprised more girls than boys and the moderate risk group was overrepresented by boys,

36 establishing an icu cut-off score

χ2(2,N = 1, 192) = 11.68, p<.01, d = .30, 95% CI [.13, .49]. Cross-tab analyses were

35

30

25

20

15 i = 9.33, s = .09 10 Conduct problems Conduct i = 3.61, s = -.04 5

0 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 Age (years) Low 71.1% Moderate 25.9% High 3%

Figure 1b. LCGA model for conduct problems used to create groups on the basis of CU trait and conduct problem trajectories. This analysis indicates the percentage of children represented in all possible joint classes between the three-class model for CU and the three-class model for conduct problems. Table 2(c) shows the proportion of children belonging to each combined group of conduct problems and CU traits. Results indicated that the majority of children in the stable high conduct problem group also fell in the stable high CU trait group (77%). Children with moderate conduct problems were more likely to score high or moderately on CU traits, and only a small percentage of children with moderate conduct problems scored low on CU traits. Finally, children with low conduct problems were more likely to score low on CU traits, although a number of children in the low conduct problems group were also classified in the moderate or high CU groups.

Identifying ICU Cut-off Scores

Prior to determining ICU cut-off scores, single measures and average measures ICCs from two-way mixed effects models for ICU scores were examined. Results indicated that resolved (i.e., combined mother and father) ICU total scores were stable across the three time points (single measures ICC = .61, p<.001, 95% CI [.57, .64]; average measures ICC = .82, p<.001, 95% CI [.80, .84]). Given these substantial to almost perfect levels of ICU score reliability (Landis & Koch, 1977), it was decided to investigate cut-off thresholds for the instrument using scores obtained at Time 1. Using the Time 1 data, sensitivity, specificity, and Youden’s J values were calculated for each potential mother- and father-reported ICU score for the overall sample and separately by gender and age group. Classification into the high CU/high CP group, high CP/moderate CU group, or

37 kimonis, fanti, & singh moderate CP/high CU group were used as the diagnostic outcome. The cut-off scores that optimized both sensitivity and specificity on the ICU are presented in Table 3 for mother-report and Table 4 for father-report (full statistical details available from the authors). Overall, for mother-report for the full sample and for boys alone an ICU total score of 24 best identified those youth in the high CU/high CP trajectory, whereas a higher threshold score of 27 was identified for girls. For father-report for the full sample and for girls alone this score was 27, but was lower for boys at 25. Across raters, the range of cut-off scores varied by age and ranged from 16 to 29. Tables 3 and 4 also present cut-off scores for optimally predicting alternative combinations of CU and CP levels.

Criterion Validity of the Identified Cut-off Score

On average, relative to children rated by parents as below the general sample ICU cut-off score (total score = 24 for mother-report and 27 for father-report), those scoring at or above it were more likely to engage in bullying at Time 2, according to both mother- reported (M = 2.86, SD = 5.75 versus M = 5.35, SD = 8.59, d = .39, 95% CI [.23, .55], t(659) = 3.25, p<.001) and father-reported CU traits (M = 3.03, SD = 6.01 versus M = 5.66, SD = 8.82, d = .40, 95% CI [.20, .60], t(534) = 4.12, p<.001). This finding was consistent at Time 3 for mother-reported (M = 4.14, SD = 6.86 versus M = 6.21, SD = 8.02, d = .29, 95% CI [.10, .55], t(599) = 2.81, p<.01) and father- reported (M = 4.41, SD = 7.01 versus M = 6.01, SD = 8.05, d = .22, 95% CI [.08, .42], t(490) = 2.20, p<.05) CU traits.

Discussion

The present study aimed to establish cut-off scores for the parent-report version of a commonly used comprehensive measure of CU traits in youth—the Inventory of Callous- Unemotional Traits. This was accomplished by first identifying trajectories of youth with more homogeneous patterns of conduct problems and CU traits across childhood. This analysis revealed that 2.3% of youth in this nationally representative Cypriot community sample showed high stable conduct problem symptoms co-occurring with high stable CU traits. Findings indicated that using cut-off scores of 24 and 27 on the mother- and father- reported ICU (respectively) most accurately identified this group of children. Consistent with prior research, CU traits were moderately to highly stable across the three study time points (Frick, Kimonis, Dandreaux, & Farell, 2003). Importantly, the identified cut- off scores also predicted future child-reported bullying behavior, an established external correlate to CU (Fanti & Kimonis, 2013). Currently there is no clear consensus regarding the most appropriate base rate for defining non-normative and impairing levels of CU traits and this base rate will likely depend on the setting (e.g., community, clinic-referred, incarcerated), number and type of informants, and the purpose for making this diagnosis (e.g., the importance of avoiding false positives versus avoiding false negatives; Kahn, Frick, Youngstrom, Findling, & Youngstrom, 2012). However, Frick and Viding (2009) estimated that 2–4% of all children show a joint CP+CU presentation on the basis of CU prevalence estimates obtained within samples of children with conduct problems, and in their Fast

38 establishing an icu cut-off score J ) 208 , = 1 n oup suggests that this cut-off score urring conduct problems and CU traits; problems-high stable CU traits; *The low e co-occurring conduct problems and CU traits, Cut-off 2 Sensitivity Specificity Youden’s J Table 3. Summary of cut-off thresholds using mother-reported CU scores ( Cut-off 1 is for identifying youth on trajectories of high stable co-occ Analysis Cut-off 1Overall SensitivityBoys SpecificityGirls 24 Youden’s Age 7Age 8 24Age 9 27Age 0.69 10 16Age 11 29 0.86Age 12 16 0.50 21 0.78 13* 0.88 0.75 26 0.76 0.96 0.86 1.00 0.47 1.00 0.57 0.89 1.00 0.62 0.63 0.36 0.71 17 0.45 0.44 0.64 0.75 16 0.60 0.71 17 0.86 0.45 16 26 0.75 0.91 28 0.84 17 0.61 22 0.94 0.54 20 0.59 1.00 0.61 0.93 0.48 0.75 0.63 0.86 0.70 0.5 0.85 0.45 0.61 0.80 0.57 0.40 0.64 0.85 0.54 0.55 0.34 Cut-off 2 is forhigh identifying stable youth on conduct trajectories problems-moderatebase of CU rate either traits, of high or membership stabl is in moderate a the conduct statistical high artifact. CU-conduct problems group for this age gr Note:

39 kimonis, fanti, & singh J ) = 818 n oup suggests that this cut-off score urring conduct problems and CU traits; problems-high stable CU traits; *The low e co-occurring conduct problems and CU traits, Cut-off 2 Sensitivity Specificity Youden’s J Table 4. Summary of cut-off thresholds using father-reported CU scores ( Cut-off 1 is for identifying youth on trajectories of high stable co-occ Analysis Cut-off 1Overall SensitivityBoys SpecificityGirls 27 Youden’s Age 7Age 8 25Age 9 27Age 0.70 10 25Age 11 35 0.80Age 12 28 0.70 22 0.83 37* 0.83 0.33 27 0.73 1.00 0.87 1.00 0.53 1.00 0.80 0.99 1.00 0.53 0.87 0.57 0.69 22 0.99 0.63 0.33 0.75 22 0.87 0.69 22 0.82 0.99 22 23 0.75 0.93 21 0.69 20 0.76 24 0.91 0.82 25 0.74 0.84 0.77 0.83 0.58 0.88 0.77 0.78 0.88 0.67 0.75 0.47 0.72 0.83 0.68 0.60 0.80 0.59 0.54 0.71 0.68 Cut-off 2 is forhigh identifying stable youth on conduct trajectories problems-moderatebase of CU rate either traits, of high or membership stabl is in moderate a the conduct statistical high artifact. CU-conduct problems group for this age gr Note:

40 establishing an icu cut-off score

Track sample McMahon et al. (2010) reported a prevalence rate of 1.2%. These estimates are within range of the 2.3% prevalence rate for youth with high stable co-occurring CU traits and CP in the present study and lend confidence to the generalizability of the findings. Whereas the 3% prevalence of conduct problems was consistent with large epidemiological studies, the prevalence of high stable CU traits was relatively high in this sample of Cypriot youth, whether presenting alone (24%) or within youth with co-occurring high stable conduct problems (77%). In comparison, prior studies of children in the U.K. that used items from non-standardized instruments report prevalence estimates between 3-5% for high CU traits in the absence of conduct problems (Fontaine, McCrory, Boivin, Moffitt, & Viding, 2011; Rowe et al., 2010). Average ICU total scores in the present study (father-report: M = 16.50, SD = 9.34, mother-report: M = 16.05, SD = 9.23) contradict the possibility that Cypriot youth are on average more callous and unemotional than youth from the U.K., for whom mean ICU total scores were almost one standard deviation higher in a large (N = 704) community sample of 11-13 year-old children (M = 24.72, SD = 9.01; Viding, Simmonds, Petrides, & Frederickson, 2009). Future research is necessary to understand why a quarter of this sample presented with ICU scores approximately 1.5 standard deviations above the sample mean. Children presenting with CU traits without conduct problems have been relatively understudied in the literature, likely due in large part to their lesser impairment relative to those with co-occurring conduct problems. For example, children high on CU traits alone are at lower risk for adverse outcomes compared to children high on both dimensions (Fanti, 2013; Rowe et al., 2010). Focusing on children high on CU traits alone in future research may help elucidate those factors that protect at-risk children from maladaptive outcomes and a destructive course of antisocial and aggressive behavior. Perhaps importantly, although a high percentage of children were identified with high CU traits, only a small minority of children scored consistently high on both conduct problems and CU traits. According to a recent review, it appears to be the combination of conduct problems and CU traits, and not CU traits alone, that leads to future maladaptive outcomes, representing a more severely impaired group of youth (Frick et al., 2014). Extant research suggests that children with conduct problems and CU traits differ in important ways from children with conduct problems alone (Frick et al., 2014). For example, their antisocial behaviors tend to be more severe and aggressive, earlier starting and more stable than children without CU traits. To illustrate the utility of the identified cut-off scores in identifying children consistent with this profile, based on these data, a score of 27 or higher on the ICU (father-report) correctly identified 70% of children within the stable high CU/high CP trajectory whereas a score below 27 correctly identified 83% of children not in this high-risk group. When scored at or above this ICU threshold at Time 1, children were significantly more likely to report engaging in bullying behaviors six months and one year later, than children scoring below the cut-off score. This finding is consistent with cross-sectional (Fanti et al., 2009; Viding et al., 2009) and longitudinal research (Fanti & Kimonis, 2012) reporting an association between bullying and CU traits. It also provides support for the predictive validity of the identified ICU cut-off score with respect to its ability to predict an external correlate to CU traits measured from a different reporter.

41 kimonis, fanti, & singh

Relating the identified cut-off scores to the assessment of CU traits in previous research, the cut-off scores are roughly half a standard deviation above the mean self-reported ICU total score (M = 22.50, SD = 8.20) reported for a mixed sample (N = 383) of European community, clinic-referred, detained, and child welfare-involved boys (8–20 years; Feilhauer, Cima, & Arntz, 2012), and (for father-report) roughly equivalent to the mean parent-reported ICU total score (M = 27.19, SD = 12.20) reported for a sample (N = 94) of American adolescent boys (12–18 years) adjudicated of a sexual offense (White, Cruise, & Frick, 2009), recognizing important differences in age, informant, and setting across studies. When the intention is to screen youth for the purposes of preventive intervention, scores ≥ 17 for mother-report and ≥ 22 for father-report on the ICU most accurately identified those in the overall sample who fell in the groups showing moderate trajectories of CU or conduct problems combined with high stable trajectories of the other. Using either method, cut-off scores varied somewhat when examined separately by gender or age group. Whereas the cut-off scores generated in this study may have utility for identifying youth at risk for severe and stable conduct problems co-occurring with CU traits to deliver nuanced interventions, it is important to note that many research questions may be better addressed using continuous measures of CU traits that utilize the complete information provided by the Likert-type response format of most CU measures (Thornton, Frick, Crapanzano, & Terranova, 2013). The results of the present study must be considered within the context of its relative strengths and limitations. First, although both mother- and father-reports of CU traits were examined, data on the presence of CU traits in the school setting according to teacher reports was not available. Diagnostic systems, such as the DSM-5 (APA , 2013) explicitly recognize the importance of carefully considering multiple sources of information including self- and informant report (e.g., parents, teachers, peers, other family members) from sources who have known the child for a significant period of time when evaluating the presence of non-normative CU traits. Second, symptoms of ODD and CD were combined in order to identify children with conduct problems more generally who are at greatest risk for continuing and severe impairment across childhood. Thus, these data cannot speak to the utility of the identified cut-off scores within the context of the “With Limited Prosocial Emotions” CU specifier to the diagnosis of CD. Third, bullying is a conduct disorder symptom (i.e., often bullies, threatens, or intimidates others), which lends to the possibility of criterion contamination when using it as an external criterion measure. However, different sources rated measures of conduct problems and bullying, lending some confidence to the ability of parent-reported ICU cut-off scores to predict child-reported bullying. While using self-report to assess bullying runs the risk of response bias given links between psychopathy and deception (see Rogers & Cruise, 2000), some research suggests that antisocial attitudes and behavior are more reliably assessed using self-report than caregiver-report, especially into adolescence (Jolliffe et al., 2003). Future research might consider measuring bullying using a multi-method approach (self-, parent-, teacher-report), and incorporating other external correlates that are important to the construct of CU traits, such as cognitive (e.g., punishment insensitivity), emotional (e.g., distress insensitivity) and biological (e.g., reduced amygdala activation) variables (Frick et al., 2014). Fourth, the sample comprised Greek-speaking children from the

42 establishing an icu cut-off score

Republic of Cyprus and findings must be replicated in samples from other countries and with adolescent populations to determine the predictive validity of identified cut-off scores with respect to a variety of outcomes to determine their generalizability. Finally, cut-off scores could be established for all age groups except 11-year-olds, for whom there was a very low base rate of membership into the high CU/high CP trajectory group. Thus, future research is needed to establish whether similar cut-off scores identify at-risk children in this age group relative to other ages. Important strengths of this study were the large sample size, collection of longitudinal data, and the use of both mother and father-reported CU traits and conduct problems to address the possibility of under-reporting of symptoms. We also employed advanced statistical modeling, adopted from the field of criminology and becoming increasingly popular in clinical research (Nagin & Odgers, 2010), to identify a subset of youth showing high chronic CU and conduct problem symptoms in order to optimize prediction from possible ICU cut-off scores. In the context of these limitations and strengths, our results provide some useful information for defining significant levels of CU traits among community-based samples of school-aged youth. The Risk-Needs-Responsivity principles suggest that preventative resources should be allocated to only those youths at highest risk (Bonta & Andrews, 2006). The process of identifying which community-based youths are at greatest risk for future impairment to administer empirically supported interventions to them may be assisted by applying the ICU cut-off scores identified in the present study. Utilizing Youden’s J statistic to identify optimal scores assumes that the costs of false positives and false negatives are equally deleterious, meaning that decision-making using the instrument will be neither over-inclusive nor under-inclusive. There were two primary reasons for selecting this approach. First, from a diagnostic perspective, youth (inappropriately) assigned the CU specifier (i.e., false positives) would be at no greater risk from suffering from this label as they would from the CD label that must be diagnosed first according to the DSM-5 (see Murrie, Boccaccini, McCoy, & Cornell, 2007). Second, failing to classify a youth with CD with the “With Limited Prosocial Emotions” specifier (i.e., false negative) may result in delivery of parent-training interventions that are the gold-standard for children with conduct problems but are less effective for those with CU traits (Hawes et al., 2014). This may delay CP+CU children from receiving appropriate treatment (i.e., intensive interventions tailored to their unique emotional and cognitive characteristics) that have been found to improve their CU traits, conduct problems, and other antisocial outcomes (Kolko & Pardini, 2010). International surveys of practitioners who use structured instruments in psychological assessments have clearly shown a preference for categorical methods of risk communication (Singh, 2013b). However, as recent research has suggested that antisociality may be a taxometrically continuous rather than dichotomous construct, some researchers have argued that cut-off scores should not be established for personality measures such as the ICU (Marcus, Lilienfeld, Edens, & Poythress, 2006). It may be that the ICU could be used most efficiently following a two-stepped decision-making strategy. First, the cut-off scores identified in the current study would be used to determine whether resources should be allocated in a given case. Second, risk ratio and percentile information derived from continuous ICU scores would be used to determine how many resources should be allocated and the nature of those resources. Future research could examine whether risk ratios and percentiles from large-scale or

43 kimonis, fanti, & singh jurisdiction-specific normative studies result in the most effective risk management plans. In conclusion, this promising line of research provides a step towards appropriately sorting antisocial youth into more homogenous subgroups to improve outcomes for a particularly impaired population.

References

American Psychiatric Association . (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author. American Psychiatric Association . (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author. Barker, C., Pistrang, N., & Elliott, R. (1994). Research methods in clinical and counselling psychology. Chichester, UK: Wiley. Bonta, J., & Andrews, D. A. (2006). Risk-need-responsivity model for offender assessment and rehabilitation. Ottawa, ON: Public Safety and Emergency Preparedness. (Cat.No. PS3-1/2007-6) Byrd, A. L., Loeber, R., & Pardini, D. A. (2012). Understanding desisting and persisting forms of delinquency: the unique contributions of disruptive behavior disorders and interpersonal callousness. Journal of Child Psychology and Psychiatry, 53 (4), 371– 380. doi: 10.1111/j.1469-7610.2011.02504.x Caldwell, M., Skeem, J. L., Salekin, R. T., & Van Rybroek, G. (2006). Treatment response of adolescent offenders with psychopathy features: A 2-year follow-up. Criminal Justice and Behavior, 33 (5), 571–596. doi: 10.1177/0093854806288176 Campbell, M. A., Porter, S., & Santor, D. (2004). Psychopathic traits in adolescent offenders: an evaluation of criminal history, clinical, and psychosocial correlates. Behavioral Sciences & the Law, 22 (1), 23–47. doi: 10.1002/bsl.572 Duncan, T. E., Duncan, S. C., & Strycker, L. A. (2013). An introduction to latent variable growth curve modeling: Concepts, issues, and applications. Routledge Academic. Edens, J. F., Marcus, D. K., Lilienfeld, S. O., & Poythress, J., Norman G. (2006). Psychopathic, not psychopath: Taxometric evidence for the dimensional structure of psychopathy. Journal of Abnormal Psychology, 115 (1), 131–144. doi: 10.1037/ 0021-843x.115.1.131 Fanti, K. A. (2013). Individual, social, and behavioral factors associated with co- occurring conduct problems and callous-unemotional traits. Journal of Abnormal Child Psychology, 41 (5), 811–824. doi: 10.1007/s10802-013-9726-z Fanti, K. A., Frick, P. J., & Georgiou, S. (2009). Linking callous-unemotional traits to instrumental and non-instrumental forms of aggression. Journal of Psychopathology and Behavioral Assessment, 31 (4), 285–298. doi: 10.1007/s10862-008-9111-3 Fanti, K. A., & Kimonis, E. R. (2012). Bullying and victimization: The role of conduct problems and psychopathic traits. Journal of Research on Adolescence, 22 (4), 617–631. doi: 10.1111/j.1532-7795.2012.00809.x Fanti, K. A., & Kimonis, E. R. (2013). Dimensions of juvenile psychopathy distinguish “bullies,” “bully-victims,” and “victims”. Psychology of Violence, 3 (4), 396–409. doi: 10.1037/a0033951 Fanti, K. A., & Mu˜noz Centifanti, L. C. (2013). Childhood callous-unemotional traits moderate the relation between parenting distress and conduct problems over time.

44 establishing an icu cut-off score

Child Psychiatry and Human Development, 45 (2), 173–184. doi: 10.1007/s10578 -013-0389-3 Fazel, S., Singh, J. P., Doll, H., & Grann, M. (2012). The prediction of violence and antisocial behaviour: A systematic review and meta-analysis of the utility of risk assessment instruments in 73 samples involving 24,827 individuals. British Medical Journal, 345 , e4692. doi: 10.1136/bmj.e4692 Feilhauer, J., Cima, M., & Arntz, A. (2012). Assessing callousunemotional traits across different groups of youths: Further cross-cultural validation of the inventory of callousunemotional traits. International Journal of Law and Psychiatry, 35 (4), 251–262. doi: 10.1016/j.ijlp.2012.04.002 Fontaine, N. M. G., McCrory, E. J. P., Boivin, M., Moffitt, T. E., & Viding, E. (2011). Predictors and outcomes of joint trajectories of callousunemotional traits and conduct problems in childhood. Journal of Abnormal Psychology, 120 (3), 730– 742. doi: 10.1037/a0022620 Forth, A. E., Kosson, D. S., & Hare, R. D. (2003). The Psychopathy Checklist: Youth Version manual. Toronto: Multi-Health Systems. Frick, P. J. (2004). The inventory of callous-unemotional traits. New Orleans: The University of New Orleans. Unpublished rating scale. Frick, P. J., & Hare, R. D. (2001). Antisocial process screening device (APSD). Toronto: Multi-health systems. Frick, P. J., Kimonis, E. R., Dandreaux, D. M., & Farell, J. M. (2003). The four-year stability of psychopathic traits in non-referred youth. Behavioral Sciences & the Law, 21 (6), 713–736. doi: 10.1002/bsl.568 Frick, P. J., & Nigg, J. T. (2012). Current issues in the diagnosis of attention deficit hyperactivity disorder, oppositional defiant disorder, and conduct disorder. Annual Review of Clinical Psychology, 8 (1), 77–107. doi: 10.1146/annurev-clinpsy-032511 -143150 Frick, P. J., Ray, J. V., Thornton, L. C., & Kahn, R. E. (2014). Can callous-unemotional traits enhance the understanding, diagnosis, and treatment of serious conduct problems in children and adolescents? a comprehensive review. Psychological Bulletin, 140 (1), 1–57. doi: 10.1037/a0033076 Frick, P. J., & Viding, E. (2009). Antisocial behavior from a developmental psychopathology perspective. Development and Psychopathology, 21 (4), 1111– 1131. doi: 10.1017/s0954579409990071 Gadow, K. D., & Sprafkin, J. (2002). Child Symptom Inventory–4 screening and norms manual. Stony Brook, NY: Checkmate Plus. Gretton, H. M., Mcbride, M., Hare, R. D., O’Shaughnessy, R., & Kumka, G. (2001). Psychopathy and recidivism in adolescent sex offenders. Criminal Justice and Behavior, 28 (4), 427–449. doi: 10.1177/009385480102800403 Haas, S. M., Waschbusch, D. A., Pelham, W. E., King, S., Andrade, B. F., & Carrey, N. J. (2011). Treatment response in CP/ADHD children with callous/unemotional traits. Journal of Abnormal Child Psychology, 39 (4), 541–552. doi: 10.1007/ s10802-010-9480-4 Hare, R. D. (1991). The psychopathy checklist-revised (2nd ed.). Toronto, ON, Canada: Multi-Health Systems. Hare, R. D. (2003). The psychopathy checklist-revised (2nd ed.). Toronto, ON, Canada: Multi-Health Systems.

45 kimonis, fanti, & singh

Harris, G. T., Rice, M. E., & Quinsey, V. L. (1994). Psychopathy as a taxon: Evidence that psychopaths are a discrete class. Journal of Consulting and Clinical Psychology, 62 (2), 387–397. doi: 10.1037/0022-006x.62.2.387 Hawes, D. J., & Dadds, M. R. (2005). The treatment of conduct problems in children with callous-unemotional traits. Journal of Consulting and Clinical Psychology, 73 (4), 737741. doi: 10.1037/0022-006x.73.4.737 Hawes, D. J., Price, M. J., & Dadds, M. R. (2014). Callous-unemotional traits and the treatment of conduct problems in childhood and adolescence: A comprehensive review. Clinical Child and Family Psychology Review, 17 (3), 248–267. doi: 10 .1007/s10567-014-0167-1 Hunt, M. H., Meyers, J., Jarrett, O., & Neel, J. (2005). Student survey of bullying behavior: Preliminary development and results from six elementary schools. Atlanta, Georgia: Georgia State University Center for Research on School Safety, School Climate and Classroom Management. IBM Corp. (2010). IBM SPSS statistics for Windows, Version 19.0. Armonk, NY: IBM Corp. Jolliffe, D., Farrington, D. P., Hawkins, J. D., Catalano, R. F., Hill, K. G., & Kosterman, R. (2003). Predictive, concurrent, prospective and retrospective validity of self- reported delinquency. Criminal Behaviour and Mental Health, 13 (3), 179197. doi: 10.1002/cbm.541 Kahn, R. E., Frick, P. J., Youngstrom, E., Findling, R. L., & Youngstrom, J. K. (2012). The effects of including a callous-unemotional specifier for the diagnosis of conduct disorder. Journal of Child Psychology and Psychiatry, 53 (3), 271–282. doi: 10 .1111/j.1469-7610.2011.02463.x Kimonis, E. R., Frick, P. J., Skeem, J. L., Marsee, M. A., Cruise, K., Munoz, L. C., . . . Morris, A. S. (2008). Assessing callousunemotional traits in adolescent offenders: Validation of the Inventory of Callous–Unemotional Traits. International Journal of Law and Psychiatry, 31 (3), 241–252. doi: 10.1016/j.ijlp.2008.04.002 Kolko, D. J., & Pardini, D. A. (2010). ODD dimensions, ADHD, and callousunemotional traits as predictors of treatment response in children with disruptive behavior disorders. Journal of Abnormal Psychology, 119 (4), 713–725. doi: 10.1037/ a0020910 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33 (1), 159–174. Lee, Z., Klaver, J. R., Hart, S. D., Moretti, M. M., & Douglas, K. S. (2009). Short-term stability of psychopathic traits in adolescent offenders. Journal of Clinical Child & Adolescent Psychology, 38 (5), 595–605. doi: 10.1080/15374410903103536 Lilienfeld, S. O. (1994). Conceptual problems in the assessment of psychopathy. Clinical Psychology Review, 14 (1), 17–38. doi: http://dx.doi.org/10.1016/0272-7358(94) 90046-9 Lo, Y., Mendell, N., & Rubin, D. B. (2001). Testing the number of components in a normal mixture. Biometrika, 88 (3), 767–778. doi: 10.1093/biomet/88.3.767 Lynam, D. R., Caspi, A., Moffitt, T. E., Loeber, R., & Stouthamer-Loeber, M. (2007). Longitudinal evidence that psychopathy scores in early adolescence predict adult psychopathy. Journal of Abnormal Psychology, 116 (1), 155–165. doi: 10.1037/ 0021-843x.116.1.155 Marcus, D. K., Lilienfeld, S. O., Edens, J. F., & Poythress, N. G. (2006). Is antisocial

46 establishing an icu cut-off score

personality disorder continuous or categorical? a taxometric analysis. Psychological Medicine, 36 (11), 1571–1581. doi: 10.1017/s0033291706008245 Mash, E., & Dozois, D. (2003). Child psychopathology: A developmental-systems perspective. In E. Mash & R. A. Barkley (Eds.), Child psychopathology (2nd ed., pp. 3–71). New York: Gilford Press. McMahon, R. J., Witkiewitz, K., & Kotler, J. S. (2010). Predictive validity of callousunemotional traits measured in early adolescence with respect to multiple antisocial outcomes. Journal of Abnormal Psychology, 119 (4), 752–763. doi: 10.1037/a0020796 Murrie, D. C., Boccaccini, M. T., McCoy, W., & Cornell, D. G. (2007). Diagnostic labeling in juvenile court: How do descriptions of psychopathy and conduct disorder influence judges? Journal of Clinical Child & Adolescent Psychology, 36 (2), 228– 241. doi: 10.1080/15374410701279602 Murrie, D. C., Marcus, D. K., Douglas, K. S., Lee, Z., Salekin, R. T., & Vincent, G. (2007). Youth with psychopathy features are not a discrete class: a taxometric analysis. Journal of Child Psychology and Psychiatry, 48 (7), 714–723. doi: 10 .1111/j.1469-7610.2007.01734.x Muth´en, L., & Muth´en, B. (2010). Mplus User’s Guide (6th ed.). Los Angeles: Muth´en & Muth´en. Nagin, D. S. (2005). Group-based modeling of development. Boston, MA: Harvard University Press. Nagin, D. S., & Odgers, C. L. (2010). Group-based trajectory modeling in clinical research. In Nolen-Hoekland, T. S. Cannon, & T. Widger (Eds.), Annual review of clinical psychology. Palo Alto, CA: Annual Reviews. O’Neill, M. L., Lidz, V., & Heilbrun, K. (2003). Adolescents with psychopathic characteristics in a substance abusing cohort: Treatment process and outcomes. Law and Human Behavior, 27 (3), 299–313. doi: 10.1023/a:1023435924569 Pardini, D. A., Lochman, J. E., & Powell, N. (2007). The development of callous- unemotional traits and antisocial behavior in children: are there shared and/or unique predictors? Journal of Clinical Child and Adolescent Psychology, 36 (3), 319–333. doi: 10.1080/15374410701444215 Piacentini, J. C., Cohen, P., & Cohen, J. (1992). Combining discrepant diagnostic information from multiple sources: Are complex algorithms better than simple ones? Journal of Abnormal Child Psychology, 20 (1), 51–63. doi: 10.1007/ bf00927116 Rogers, R., & Cruise, K. (2000). Malingering and deception among psychopaths. In C. Gacono (Ed.), The clinical and forensic assessment of psychopathy: A practitioner’s guide (pp. 269–284). Mahwah, NJ, US: Lawrence Erlbaum Associates Publishers. Rowe, R., Maughan, B., Moran, P., Ford, T., Briskman, J., & Goodman, R. (2010). The role of callous and unemotional traits in the diagnosis of conduct disorder. Journal of Child Psychology and Psychiatry, 51 (6), 688–695. doi: 10.1111/j.1469-7610.2009 .02199.x Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 (2), 461–464. Seagrave, D., & Grisso, T. (2002). Adolescent development and the measurement of juvenile psychopathy. Law and Human Behavior, 26 (2), 219–239. doi: 10.1023/a:

47 kimonis, fanti, & singh

1014696110850 Singh, J. P. (2013a). Predictive validity performance indicators in violence risk assessment: A methodological primer. Behavioral Sciences & the Law, 31 (1), 8–22. doi: 10.1002/bsl.2052 Singh, J. P. (2013b). Violence risk assessment in adults and juveniles: The state of the art. (Speech presented at Harvard University, Cambridge, MA, USA) Singh, J. P., Grann, M., & Fazel, S. (2011). A comparative study of violence risk assessment tools: A systematic review and metaregression analysis of 68 studies involving 25,980 participants. Clinical Psychology Review, 31 (3), 499–513. doi: 10.1016/j.cpr.2010.11.009 Spain, S. E., Douglas, K. S., Poythress, N. G., & Epstein, M. (2004). The relationship between psychopathic features, violence and treatment outcome: the comparison of three youth measures of psychopathic features. Behavioral Sciences & the Law, 22 (1), 85–102. doi: 10.1002/bsl.576 Thornton, L. C., Frick, P. J., Crapanzano, A. M., & Terranova, A. M. (2013). The incremental utility of callous-unemotional traits and conduct problems in predicting aggression and bullying in a community sample of boys and girls. Psychol Assess, 25 (2), 366–378. doi: 10.1037/a0031153 Varjas, K., Meyers, J., & Hunt, M. H. (2006). Student survey of bullying behaviorRevised 2 (SSBB-R2). Atlanta, Georgia: Georgia State University, Center for Research on School Safety, School Climate and Classroom Management. Vasey, M. W., Kotov, R., Frick, P. J., & Loney, B. R. (2005). The latent structure of psychopathy in youth: A taxometric investigation. Journal of Abnormal Child Psychology, 33 (4), 411–429. Viding, E., Simmonds, E., Petrides, K. V., & Frederickson, N. (2009). The contribution of callous-unemotional traits and conduct problems to bullying in early adolescence. Journal of Child Psychology and Psychiatry, 50 (4), 471–481. doi: 10.1111/j.1469 -7610.2008.02012.x White, S. F., Cruise, K. R., & Frick, P. J. (2009). Differential correlates to self-report and parent-report of callous–unemotional traits in a sample of juvenile sexual offenders. Behavioral Sciences & the Law, 27 (6), 910–928. doi: 10.1002/bsl.911 Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3 (1), 32–35.

Footnotes

1Although no formal cut-off score has been identified for the PCL:YV, some studies have used scores of 25 or 30 to designate youth as psychopathic (e.g., Campbell, Porter, & Santor, 2004; Gretton, Mcbride, Hare, O’Shaughnessy, & Kumka, 2001). 2Reliability coefficients <.60 are considered insufficient, .60–.69 marginal, .70–.79 acceptable, .80–.89 good, and ≥ .90 excellent. 3Due to small sample sizes, cut-off thresholds were not calculated for ages 6 and 13 year.

Received: Sept. 17, 2014 Revision Received: Oct. 25, 2014 Accepted: Oct. 30, 2014

48 Archives of Forensic Psychology c 2014 Global Institute of Forensic Psychology 2014, Vol. 1, No. 1, 49–59 ISSN 2334-2749

Forensic Risk Assessment: A Beginner’s Guide

Jerrod Brown The American Institute for the Advancement of Forensic Studies, St. Paul, MN, USA Pathways Counseling Center, St. Paul, MN Concordia University, St. Paul, MN, USA

Jay P. Singh∗ Global Institute of Forensic Research, 11700 Plaza America Drive, Suite 810, Reston, VA 20190, USA [email protected] Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA Faculty of Health Sciences, Molde University College, Molde, Norway

Forensic risk assessment refers to the attempt to predict the likelihood of future offending in order to identify individuals in need of intervention. Risk assessment protocols have been implemented in mental health and criminal justice settings around the globe to prioritize risk reduction strategies for those most at need. Helping to allocate scarce resources more effectively and efficiently while protecting our communities, forensic risk assessment has come to be a cornerstone of forensic practice in many jurisdictions. The present article is intended to provide practitioners and policymakers with a general introduction to this fast-growing field of research. The process of identifying those static and dynamic risk and protective factors that are incorporated into the risk assessment process is examined. Thereafter, strengths and weaknesses of the three most common approaches to risk assessment are described and requirements under Tarasoff liability are discussed. Keywords: Risk assessment, Recidivism, Violence, Actuarial, Structured professional judgment

In 1972, Alberta Lessard, a mentally ill woman involuntarily committed to a psychiatric hospital in Wisconsin, filed a class action suit on behalf of all individuals aged 18 and older who had been committed under the Wisconsin State Mental Health Act. This legislation allowed mentally ill individuals who were “gravely disabled” (Wis. Stat. § 51.001 et seq.) to be involuntarily hospitalized for treatment. Overruling this law, the U.S. Supreme Court held that, in order to be involuntarily hospitalized: “The risk of violence to self or others must be established, with such dangerousness being

49 brown & singh demonstrated by a recent overt act plus the substantial probability of recurrence” (Lessard v. Schmidt, 1972, p.1093). Dangerousness was defined as “having a high probability of inflicting imminent substantial physical harm” (Drake, Clemente, & Perrin, 2006, p. 2). This emphasis on the likelihood of inflicting harm reflected a growing preference in 20th century America for probabilistic estimates of the risk of antisocial behavior such as violence as opposed to clinicians’ dichotomous judgments of dangerousness that had been used throughout the first half of the century. It was this need to accurately establish the risk of future offending that gave birth to one of the largest fields in forensic mental health: forensic risk assessment. Historically, the construct of risk referred to the probability of gain or loss weighted by the value of what stood to be gained or lost. Beginning largely in the 20th century, this construct was applied to the area of forensic mental health. Forensic risk assessment refers to the process by which the likelihood of future antisocial behavior is evaluated (Singh, 2012). The antisocial behavior being predicted may constitute a first-time offense or a repeat offense, the latter of which is referred to as recidivism. Risk assessments routinely involve the structured examination of a number of risk factors (biological, psychological, or sociological characteristics that increase the likelihood of antisocial behavior) and protective factors (biological, psychological, or sociological characteristics that decrease the likelihood of antisocial behavior). These may be either static (historical or unchanging), acutely dynamic (modifiable and likely to change), or stably dynamic (modifiable but unlikely to change) in nature (Andrews & Bonta, 2010). An example of a static risk factor for antisocial behavior is a history of violence (Quinsey, Harris, Rice, & Cormier, 2006), whereas an acute dynamic risk factor would be stress (Borum, 1996), and a stable dynamic risk factor would be marital status (Andrews & Bonta, 1995). An example of a static protective factor against antisocial behavior is intelligence (de Vogel, de Ruiter, Bouman, & de Vries Robb´e, 2007), whereas an acute dynamic protective factor would be medication adherence (Webster, Martin, Brink, Nicholls, & Desmarais, 2009), and a stable dynamic protective factor would be healthy peer relationships (Webster et al., 2009).

Identifying Risk and Protective Factors According to the guidelines set forth by Grann and L˚angstr¨om (2007), risk and protective factors – be they static or dynamic in nature – can be identified using one of three techniques: (a) the empirical method, (b) the theoretical method, or (c) the clinical method. Each of these three techniques has its own merit, albeit they vary in terms of their focus on psychometrics versus practical application. In the empirical method, risk and protective factors are identified through research in which a sample is followed for such a duration as to allow for the possibility of offending. The biopsychosocial characteristics of those who offend are analyzed to see if they systematically differ from those who do not. If the presence of a given characteristic increases the likelihood of offending to a statistically significant extent, it is considered a risk factor. If the presence of that characteristic decreases this likelihood, the characteristic is considered a protective factor. In the theoretical method, a particular theory (e.g., psychoanalytic, behavioral, cognitive) is used to guide decisions as to which characteristics place an individual at a

50 beginner’s guide to forensic risk assessment higher or lower risk of antisocial behavior (Grann & L˚angstr¨om, 2007). Different theoretical orientations offer different conceptualizations of what constitutes an “at risk” person and propose different mechanisms concerning how that individual came to be at risk. For example, risk assessments formulated from a psychoanalytic perspective may take into consideration information concerning disorganized attachment styles as well as an individuals sexual history. (For an overview of forensic risk assessment and psychodynamic theory, see Doctor & Nettleton, 2003). Behavioral measures, on the other hand, would be more likely to include consideration of the individuals previous offending history, social competence, and his or her parents style of discipline. (For an overview of forensic risk assessment and behavioral theory, see Eifert & Feldner, 2004). Risk tools adopting a cognitive approach would likely include consideration of an individuals capacity for emotion regulation, tendency to ruminate, and level of impulsivity. (For an overview of cognitive approaches to forensic risk management, see Lipsey, Chapman, & Landenberger, 2001). In contrast to the previous two approaches, the clinical method of identifying risk and protective factors involves identifying individual characteristics which, regardless of whether they are empirically or theoretically associated with offending, are changeable and thus can be addressed through clinical intervention (Grann & L˚angstr¨om, 2007). For example, although traits such as an individual’s history of antisocial behavior or severe mental illness cannot be altered, other characteristics such as employment status or level of education can. Hence, the clinical method places an emphasis on dynamic factors.

Contemporary Approaches to Forensic Risk Assessment Although there are numerous adverse outcomes that can be evaluated through forensic risk assessment (e.g., substance use, absconsion, self-harm), the current article will focus on evidence-based approaches to violence, sex offender, and general recidivism risk assessment. Specifically, we will explore the three leading approaches to risk assessment currently used in practice and examples of key tools that follow each. In addition, we will examine the importance of understanding Tarasoff liability in the context of forensic risk assessment. Systematic reviews and meta-analyses of the research base on forensic risk assessment have established three leading approaches to this form of evaluation: (a) unstructured clinical judgment, (b) actuarial assessment, and (c) structured professional judgment (Singh & Fazel, 2010). In the following section, we will examine the relative strengths and weaknesses of each. Unstructured clinical judgment. Unstructured clinical judgment (UCJ) refers to the subjective process of evaluating the likelihood of an adverse outcome without the use of a structured method (e.g., a risk assessment tool). Instead, clinical skills and experience with the given individual whose risk is being assessed are relied upon (Murray & Thomson, 2010). The key benefits of the UCJ approach include its flexibility, its utility in tailoring risk assessments to a given individual, its incorporation of a variety of case-specific risk and protective static and dynamic factors, and its inexpensiveness (i.e., no materials need to be purchased). The key drawback of the UCJ approach is its inherent subjectivity, resulting in poor rates of reliability and predictive validity. Of particular concern is this approach’s vulnerability to human judgment biases in the decision-making process. For example, hindsight bias due to

51 brown & singh recent tragic events involving high-profile homicides by individuals diagnosed with a mental illness may result in the overestimation of violence-risk in persons with quite low base rates of interpersonal aggression (Arkes, 1991; Large, Ryan, Singh, Paton, & Nielssen, 2011). If evaluating a college-aged adolescent in Newtown, Connecticut in the United States, who was diagnosed with Asperger’s Syndrome and raised by a single mother who had taught him to fire guns, this adolescent would likely be perceived as a higher risk immediately after an armed gunman reportedly diagnosed with Asperger’s Syndrome entered Sandy Hook Elementary School in Newtown in 2012 and fatally shot 20 children and six adult staff members. This despite epidemiological research findings suggesting that individuals diagnosed with Asperger’s Syndrome are not at increased risk of violence compared to the general population (Ghaziuddin, 2013) and that the large majority of individuals with this diagnosis who do go on to be violent do not commit crimes involving weapons (Lerner, Haque, Northrup, Lawer, & Bursztajn, 2012). Perhaps the best-known criticism of unstructured clinical judgment in forensic risk assessment is the seminal monograph by Monahan (1981), entitled The Clinical Prediction of Violent Behavior. A spiritual successor to Meehl’s (1954) Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review of the Evidence, in which it was argued that professionals cannot predict outcomes as successfully as statistical formulae, Monahan reviewed the research literature on unstructured clinical judgment and found that clinicians are unable to predict violence at rates above chance, concluding: [P]sychiatrists and psychologists are accurate in no more than one out of three predictions of violent behavior over a several-year period among institutionalized populations that had both committed violence in the past (and thus had high base rates for it) and those who were diagnosed as ‘mentally ill’ (Monahan, 1981, p. 48–49). Actuarial assessment. Actuarial risk assessment tools are structured instruments composed of risk and/or protective, static and/or dynamic factors that are found to be associated with the adverse event of interest using a statistical methodology (e.g., logistic regression, Cox regression, Chi-Squared Automatic Interaction Detection [CHAID]). Each item is weighted in accordance with the amount of variance it accounts for in the prediction of the adverse event of interest. Total scores are cross-referenced with a manual in which estimates of recidivism rates are provided for either each score or for ranges of scores (referred to as “risk bins” or “risk categories”). These estimates are derived from the actual rates of recidivism seen in groups with the same score or ranges of scores in the sample on which the tool was calibrated (i.e., the group whose data was used to develop the tool). The key benefits of actuarial risk assessment tools include their objectivity and transparency in the risk assessment process, their speed of administration, their requiring mostly historical information (i.e., incorporating mostly static risk factors) that are routinely available in criminal/court/medical records, their removal of human judgment biases inherent in the clinical decision-making process, and the generation of an estimated recidivism rate. The latter is perceived as the most significant strength of actuarial risk assessment tools, making them of higher perceived usefulness in legal settings (Singh, 2013).

52 beginner’s guide to forensic risk assessment

The key drawbacks of actuarial risk assessment tools are the inability to apply group-based recidivism rates to individual patients (Hart, Michie, & Cooke, 2007), the instability of estimated recidivism rates when applied to groups in different jurisdictions (Singh, Fazel, Gueorguieva, & Buchanan, 2014), and the inability to incorporate case-specific information to modify estimated recidivism rates. Concerning the latter, the preponderance of the research literature on modifying the findings of actuarial risk assessment tools suggests that such modification weakens rather than strengthens their reliability and predictive validity (Quinsey et al., 2006). In addition, adding or removing additional items on actuarial risk assessment tools or using them with unintended populations or to predict unintended outcomes has been found to weaken their predictive validity (Quinsey et al., 2006). Examples of commonly-used actuarial risk assessment schemes include the Violence Risk Appraisal Guide (VRAG; Quinsey et al., 2006), the Static-99 (Hanson & Thornton, 1999), and the Level of Service Inventory (Andrews & Bonta, 1995). All three of these schemes have been revised over the past decade to take into consideration new research findings concerning violence, sex offenders, and general recidivism risk assessment (respectively). New statistical analyses have recently been conducted to construct the Violence Risk Appraisal Guide-Revised (VRAG-R; Rice, Harris, & Lang, 2013), the Static-99-Revised (Static-99R; Helmus, Thornton, Hanson, & Babchishin, 2012), and the Level of Service/Case Management Inventory (LS/CMI; Andrews, Bonta, & Wormith, 2004). The VRAG-R is a 12-item instrument including static risk factors that capture seven domains: living situation, school performance, substance use, marital status, criminal history, index offense characteristics, and antisocial personality. Items are weighted using Nuffield’s 1982 base rate weighting strategy and total scores are used to place individuals into one of nine risk bins, each of which has associated recidivism rate estimates. As the instrument is intended for use in predicting violence in mentally disordered offenders, the VRAG-R may be particularly useful in psychiatric hospitals and clinics for determining the allocation of therapeutic resources as well as in aiding discharge decisions. Albeit a new scheme, the VRAG-R is based on an instrument that is amongst the most validated actuarial instruments available (Waypoint Centre for Mental Health Care, 2014). The Static-99R is a 10-item instrument including static risk factors that capture four domains: age, living situation, index offense characteristics, and prior offense characteristics. Items are weighted according to logistic regression weights and total scores are used to place individuals into one of four risk bins, each of which has associated recidivism rate estimates. As the instrument is intended for use in predicting sexual recidivism in adult sexual offenders, the Static-99R may be particularly useful in court settings for assisting in decisions such as bail determination and the need for community supervision in sexual offenders. Although the revision to the Static-99 is a relatively new instrument, the original scheme has been extensively validated across populations and settings (Helmus, 2008; Singh, Fazel, Gueorguieva, & Buchanan, 2013). The LS/CMI is a 43-item instrument including static and dynamic risk factors that capture eight domains: criminal history, leisure/recreation, alcohol/drug problems, education/employment, companions, procriminal attitudes, family/marital, and antisocial patterns. Items are weighted in a present vs. absent fashion and total scores are used to place individuals into one of five risk bins, each of which has associated recidivism rate estimates. As the instrument is intended for use in predicting general

53 brown & singh recidivism in late adolescent and adult offender populations, the LS/CMI may be particularly useful in jail, prison, and re-entry settings for identifying criminogenic needs and promoting responsive approaches to treatment for high-risk offenders. The LS/CMI is based on the highly-regarded Risk-Needs-Responsivity principles which emphasize prioritizing interventions for individuals at highest risk, focusing on criminogenic needs to reduce risk, and tailoring treatments to the characteristics of the individual case (Andrews & Bonta, 2010). Structured professional judgment. Structured professional judgment (SPJ) risk assessment tools were developed to address the inflexibility of actuarial schemes. SPJ instruments are composed of risk and/or protective, static and/or dynamic factors that research or theory suggests are associated with the adverse event of interest. Total scores are used as an aide-memoire, guiding administrators in making a categorical risk judgment (e.g., Low, Moderate, or High) when combined with case-specific information gained through clinical experience with the client being evaluated. Hence, total scores are not to be used as statistical predictors of risk but rather as an important piece of a larger formulation process. SPJ schemes seek to address the weaknesses of actuarial schemes. Thus, the key benefits of SPJ risk assessment tools include being more focused on individual clients than groups and the ability to take into consideration information not included in the item content of specific tools. The predictive validity of SPJ tools has been found to be non-significantly different than that of actuarial tools (Fazel, Singh, Doll, & Grann, 2012). In addition, practitioners generally perceive SPJ instruments to be more accurate and reliable than actuarial instruments and also of greater interest to mental health boards (Singh, 2013). The key drawbacks of SPJ risk assessment tools include a less objective evaluation process as well as the re-introduction of human decision-making biases into risk assessments. In addition, SPJ instruments are generally perceived as taking longer to administer than actuarial instruments (Singh, 2013). Examples of commonly-used SPJ risk assessment schemes include the Historical, Clinical, Risk Management-20 (HCR-20; Douglas, Hart, & Webster, 2013), the Sexual Violence Risk-20 (SVR-20; Boer, Hart, Kropp, & Webster, 1997), and the Short-Term Assessment of Risk and Treatability (START; Webster, Martin, Brink, Nicholls, & Desmarais, 2009). The HCR-20 is a 20-item instrument including both static and dynamic risk factors that capture three domains: historical risk factors, clinical risk factors, and risk management factors. As the instrument is intended for use in predicting violence in mentally disordered civil and forensic patients, the HCR-20 may be particularly useful in mental health settings. Though the third version of this instrument was recently released, the HCR-20 scheme is amongst the most validated available (Douglas et al., 2014). The SVR-20 is a 20-item instrument including both static and dynamic risk factors that capture three domains: psychosocial adjustment risk factors, sexual offense risk factors, and future plans risk factors. As the instrument is intended for use in predicting violence in sexual offenders, the SVR-20 may be particularly useful in Sexually Violent Predator hearings for determining whether indeterminate detention might be necessary for convicted sexual offenders. The second version of this instrument is scheduled to be released within the next year, but the reliability and validity of the original scheme has been evidenced internationally (Rettenberger, Hucker, Boer, & Eher, 2009).

54 beginner’s guide to forensic risk assessment

The START is a 20-item instrument including dynamic risk and protective factors. As the instrument is intended for use in predicting violence in psychiatric populations, the START may be particularly useful in civil and forensic psychiatric hospitals and clinics for identifying treatment targets. Albeit comparatively newer than alternative schemes such as the HCR-20 and SVR-20, the START has become one of the most widely used risk assessment tools for the purposes of risk monitoring (Singh, 2013) and has been found to be a reliable and valid predictor of future violence (O’Shea & Dickens, 2014).

Tarasoff Liability and Forensic Risk Assessment

In the case of Tarasoff v. Regents of the University of California (1972), a 25-year-old Masters student at the University of California, Berkeley who had been diagnosed with paranoid schizophrenia, murdered a 19-year-old girl named Tatiana Tarasoff (Slovenko, 1988). Tarasoff’s murderer had been receiving counselling and had disclosed his obsession with the girl and his fantasies about harming her to his therapist, who claimed that he did not inform the proper authorities due to doctor-patient confidentiality. Tarasoff’s parents sued their daughter’s murderer’s therapist as well as other members of the University for negligence. The Supreme Court of California held that:

When a therapist determines, or pursuant to the standards of his profession should determine, that his patient presents a serious danger of violence to another, he incurs an obligation to use reasonable care to protect the intended victim against such danger. The discharge of this duty may require the therapist to take one or more of various steps, depending upon the nature of the case. Thus it may call for him to warn the intended victim or others likely to apprise the victim of the danger, to notify the police, or to take whatever other steps are reasonably necessary under the circumstances (Tarasoff v. Regents of the University of California, 1972, p. 431).

The ruling that mental health professionals have a duty to protect those individuals whom their patients threaten with bodily harm, established a legal precedent that clinicians could be held partially responsible for their patients’ crimes unless they could prove that they thoroughly assessed their patients’ risk of harming others (Cooper, Griesel, & Yuille, 2008). Thus, the introduction of “Tarasoff liability” (Mason, 1998, p. 109) also increased the importance of forensic risk assessment tools, which could be cited as evidence against clinical negligence (Monahan, 2006). Although the Tarasoff ruling has not been upheld in all 50 U.S. states (Kaser-Boyd, 2015), the America Psychological Association continues to expect mental health professionals to be competent in the use of violence risk assessment tools when evaluating the risk of harm to others as part of civil commitment hearings (Gilfoyle et al., 2011).

Conclusion

Over the past 30 years, more than 400 forensic risk assessment tools have been developed for the purposes of predicting the likelihood of future violence, sex offending, and general recidivism (Singh, 2013). In accordance with a recent amicus curiae brief

55 brown & singh from the American Psychological Association (Gilfoyle et al., 2011), we recommend that such structured instruments be routinely used by mental health and criminal justice professionals and that judges and lawyers seek out evaluators who use such instruments rather than unstructured clinical judgments. This said, risk assessment tools should not be the sole determinants of decisions concerning civil liberties, especially when the base rate of the outcome of interest is particularly low (McSherry & Keyzer, 2009). Though currently used in over 40 countries for prediction, management, and monitoring purposes (Singh, 2013), no single risk assessment tool has emerged as being more accurate than others (Yang, Wong, & Coid, 2010). To decide which tool to use, meta-analytic research suggests focusing on the intended population and outcome for which a tool was designed, and then trying to find a “best fit” with the population and outcome of interest (Singh, Grann, & Fazel, 2011). The more deviations from a tool manual (e.g., item omissions, changes in scoring procedures), the weaker the tool’s performance – this extends to using a “clinical override” on estimates established by actuarial risk assessment tools (Quinsey et al., 2006). As new risk assessment tools have recently been developed for more specific populations – for example, intellectually disabled offenders (Lofthouse, Lindsay, Totsika, Hastings, & Roberts, 2014) – to assess the likelihood of more specific outcomes – for example, spousal assault (Kropp, Hart, & Belfrage, 2010) and suicide (Steeg et al., 2012) – there has been a renewed focus on moving beyond static risk factors and moving towards incorporating more dynamic and protective factors in the item content of these increasingly important instruments. With the knowledge gained in this article on the broad approaches used in forensic risk assessment, it is recommended that interested readers continue their education on available tools using resources such as the Risk Management Authority’s (2007) Risk Assessment Tool Evaluation Directory or the Global Institute of Forensic Research’s monthly Executive Bulletin on risk assessment tools (www.gifrinc.com).

References Andrews, D. A., & Bonta, J. (1995). LSI-R: The level of service inventoryrevised. toronto: Multi-health systems. Toronto, ON: Multi-Health Systems. Andrews, D. A., & Bonta, J. (2010). The psychology of criminal conduct (5th ed.). Providence, NJ: Matthew Bender & Company, Inc. Andrews, D. A., Bonta, J., & Wormith, J. (2004). LS/CMI: The level of service/case management inventory. Toronto, ON: Multi-Health Systems. Arkes, H. R. (1991). Costs and benefits of judgment errors: Implications for debiasing. Psychological Bulletin, 110 (3), 486–498. doi: 10.1037/0033-2909.110.3.486 Boer, D. P., Hart, S. D., Kropp, P. R., & Webster, C. D. (1997). Manual for the sexual violence risk-20: Professional guidelines for assessing risk of sexual violence. Burnaby, BC: Mental Health, Law, and Policy Institute, Simon Fraser University. Borum, R. (1996). Improving the clinical practice of violence risk assessment: Technology, guidelines, and training. American Psychologist, 51 (9), 945–956. doi: 10.1037/ 0003-066x.51.9.945 Cooper, B. S., Griesel, D., & Yuille, J. C. (2008). Clinical-forensic risk assessment: The past and current state of affairs. Journal of Forensic Psychology Practice, 7 (4), 1–63. doi: 10.1300/j158v07n04 01

56 beginner’s guide to forensic risk assessment de Vogel, V., de Ruiter, C., Bouman, Y., & de Vries Robb´e, M. (2007). Handleiding bij de SAPROF: Structured assessment of protective factors for violence risk. Versie 1 [Guide to the SAPROF. Structured Assessment of Protective Factors for violence risk. Version 1]. Utrecht: Forum Educatief. Doctor, R., & Nettleton, S. (2003). Dangerous patients: A psychodynamic approach to risk assessment and management. London: Karnac. Douglas, K. S., Hart, S. D., & Webster, H., C. D. & Belfrange. (2013). HCR-20 (Version 3): Assessing risk for violence. Burnaby, BC: Mental Health, Law, and Policy Institute, Simon Fraser University. Douglas, K. S., Shaffer, C., Blanchard, A. J. E., Guy, L. S., Reeves, K., & Weir, J. (2014). HCR-20 violence risk assessment scheme: Overview and annotated bibliography. Burnaby, BC: Mental Health, Law, and Policy Institute, Simon Fraser University. Drake, C., Clemente, L., & Perrin, G. (2006). Violence risk assessment in clinical practice. Phoenix, AZ: Arizona Psychologist. Retrieved from http:// azpsychologist.com/sourcebook Violence Risk-Lia%5B1%5D.doc Eifert, G. H., & Feldner, M. T. (2004). Comprehensive handbook of psychological assessment: Behavioral assessment. In M. Hersen (Ed.), (Vol. 3, pp. 95–107). Fazel, S., Singh, J., Doll, H., & Grann, M. (2012). The prediction of violence and antisocial behaviour: A systematic review and meta-analysis of the utility of risk assessment instruments in 73 samples involving 24,827 individuals. British Medical Journal, 345 , e4692. Ghaziuddin, M. (2013). Violent behavior in autism spectrum disorder: Is it a fact, or fiction? Curr. Psychiatry Current Psychiatry, 12 (10), 22–32. Gilfoyle, N. F. P., Ogden, D. W., Tran, D. H., Friedman, S. S., Owens, A. L., & Pickering, W. C. (2011). Brief for amici curiae American Psychological Association and Texas Psychological Association in support of petition for a writ of certiorari. Washington, DC: American Psychological Association. Grann, M., & L˚angstr¨om, N. (2007). Actuarial assessment of violence risk to weigh or not to weigh? Criminal Justice and Behavior, 34 (1), 22–36. Hanson, R. K., & Thornton, D. (1999). Static 99: Improving actuarial risk assessments for sex offenders (Vol. 2; User Report 9902). Ottawa, ON: Department of the Solicitor General of Canada. Hart, S. D., Michie, C., & Cooke, D. J. (2007). Precision of actuarial risk assessment instruments: Evaluating the ‘margins of error of group v. individual predictions of violence. The British Journal of Psychiatry, 190 (49), s60–s65. doi: 10.1192/ bjp.190.5.s60 Helmus, L. (2008). Annotated bibliography of static-99 replications. Retrieved from http://www.static99.org/pdfdocs/static-99annotatedbibliography.pdf Helmus, L., Thornton, D., Hanson, R. K., & Babchishin, K. M. (2012). Improving the predictive accuracy of Static-99 and Static-2002 with older sex offenders: Revised age weights. Sexual Abuse: A Journal of Research and Treatment, 24 (1), 64–101. doi: 10.1177/1079063211409951 Kaser-Boyd, N. (2015). Threat assessment in homicide/suicide: The duty to warn. In C. de Ruiter & N. Kaser-Boyd (Eds.), Forensic psychological assessment in practice: Case studies. New York, NY: Routledge. Kropp, P. R., Hart, S. D., & Belfrage, H. (2010). The brief spousal assault form for the evaluation of risk (B-SAFER), Version 2: User manual [Computer software

57 brown & singh

manual]. Vancouver, BC. Large, M. M., Ryan, C. J., Singh, S. P., Paton, M. B., & Nielssen, O. B. (2011). The predictive value of risk categorization in schizophrenia. Harvard Review of Psychiatry, 19 (1), 25–33. doi: 10.3109/10673229.2011.549770 Lerner, M. D., Haque, O. S., Northrup, E. C., Lawer, L., & Bursztajn, H. J. (2012). Emerging perspectives on adolescents and young adults with high-functioning autism spectrum disorders, violence, and criminal law. Journal of the American Academy of Psychiatry and the Law Online, 40 (2), 177–190. Lessard v. Schmidt, 349 F. Supp. 1078 (E.D. Wis. 1972) Lipsey, M. W., Chapman, G. L., & Landenberger, N. A. (2001). Cognitive-behavioral programs for offenders. The Annals of the American Academy of Political and Social Science, 578 (1), 144–157. Lofthouse, R. E., Lindsay, W. R., Totsika, V., Hastings, R. P., & Roberts, D. (2014). Dynamic risk and violence in individuals with an intellectual disability: tool development and initial validation. The Journal of Forensic Psychiatry & Psychology, 25 , 288–306. Mason, T. (1998). Tarasoff liability: Its impact for working with patients who threaten others. International Journal of Nursing Studies, 35 (1), 109–114. McSherry, B., & Keyzer, P. (2009). Sex offenders and preventive detention: Politics, policy and practice. Annandale, VA: Federation Press. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press. Monahan, J. (1981). The prediction of violent behavior. Rockville, MD: U.S. Department of Health and Human Services. Monahan, J. (2006). Tarasoff at thirty: How developments in science and policy shape the common law. University of Cincinnati Law Review, 75 , 497–521. Murray, J., & Thomson, M. E. (2010). Clinical judgement in violence risk assessment. Europes Journal of Psychology, 6 (1), 128–149. Nuffield, J. (1982). Parole decision-making in canada: Research towards decision guidelines. Ottawa, ON: Ministry of Supply and Services Canada. O’Shea, L. E., & Dickens, G. L. (2014). Short-term assessment of risk and treatability (START): Systematic review and meta-analysis. Psychological Assessment, 26 (3), 990–1002. doi: 10.1037/a0036794 Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. A. (2006). Violent offenders: Appraising and managing risk (2nd ed.). Washington, DC: American Psychological Association. Rettenberger, M., Hucker, S. J., Boer, D. P., & Eher, R. (2009). The reliability and validity of the Sexual Violence Risk-20 (SVR-20): An international review. Sexual Offender Treatment, 4 , 1–14. Rice, M. E., Harris, G. T., & Lang, C. (2013). Validation of and revision to the VRAG and SORAG: The violence risk appraisal guide-revised (VRAG-R). Psychological Assessment, 25 (3), 951–965. doi: 10.1037/a0032878 Risk Management Authority. (2007). Risk assessment tools evaluation directory (RATED). Paisley: Risk Management Authority. Singh, J. P. (2012). Handbook of juvenile forensic psychology and psychiatry. In E. Grigorenko (Ed.), Handbook of juvenile forensic psychology and psychiatry (pp. 215–225). New York: Springer.

58 beginner’s guide to forensic risk assessment

Singh, J. P. (2013). The International Risk Survey (IRiS) project: Perspectives on the practical application of violence risk assessment tools. Paper presented at the Annual Conference of the American Psychology-Law Society, Portland, OR. Singh, J. P., & Fazel, S. (2010). Forensic risk assessment: A metareview. Criminal Justice and Behavior, 37 (9), 965–988. doi: 10.1177/0093854810374274 Singh, J. P., Fazel, S., Gueorguieva, R., & Buchanan, A. (2013). Rates of sexual recidivism in high risk sex offenders: A meta-analysis of 10,422 participants. Sexual Offender Treatment, 7 , 44–57. Singh, J. P., Fazel, S., Gueorguieva, R., & Buchanan, A. (2014). Rates of violence in patients classified as high risk by structured risk assessment instruments. The British Journal of Psychiatry, 204 (3), 180–187. doi: 10.1192/bjp.bp.113.131938 Singh, J. P., Grann, M., & Fazel, S. (2011). A comparative study of violence risk assessment tools: A systematic review and metaregression analysis of 68 studies involving 25,980 participants. Clinical Psychology Review, 31 (3), 499–513. doi: 10.1016/j.cpr.2010.11.009 Slovenko, R. (1988). Commentary: The therapist’s duty to warn or protect third persons. Journal of Psychiatry and Law, 16 , 139–209. Steeg, S., Kapur, N., Webb, R., Applegate, E., Stewart, S. L. K., Hawton, K., ... Cooper, J. (2012, Mar). The development of a population-level clinical screening tool for self-harm repetition and suicide: the react self-harm rule. Psychological Medicine, 42 (11), 2383–2394. doi: 10.1017/s0033291712000347 Tarasoff v. Regents of the University of California 17 Cal 3d 425, 551 P 2d 334 (1976) Waypoint Centre for Mental Health Care. (2014). Research department bibliography on assessment and communication of violence risk. Retrieved from http://static.squarespace.com/static/520a76a0e4b03ad27abae1e3/t/ 53f52df0e4b07b3557c47122/1408577008298/Risk.pdf Webster, C. D., Martin, M. L., Brink, J., Nicholls, T. L., & Desmarais, S. L. (2009). Manual for the short-term assessment of risk and treatability (START) (Version 1.1). Hamilton, ON: Forensic Psychiatric Services Commission. Yang, M., Wong, S. C. P., & Coid, J. (2010). The efficacy of violence prediction: A meta- analytic comparison of nine risk assessment tools. Psychological Bulletin, 136 (5), 740–767. doi: 10.1037/a0020473

Received: Oct. 2, 2014 Revision Received: Dec. 8, 2014 Accepted: Dec. 11, 2014

59 Archives of Forensic Psychology c 2015 Global Institute of Forensic Psychology 2015, Vol. 1, No. 2, 1–15 ISSN 2334-2749

Assessing the Risk of Severe Intimate Partner Violence: Validating the DyRiAS in Switzerland

Juliane Gerth Department of Mental Health Services, Office of Corrections, Canton of Zurich, Feldstrasse 42, P.O. Box 8090, Zurich, Switzerland Department of Psychology, University of Konstanz, Universit¨atsstrasse 10, 78464 Konstanz, Germany

Astrid Rossegger Department of Mental Health Services, Office of Corrections, Canton of Zurich, Feldstrasse 42, P.O. Box 8090, Zurich, Switzerland Department of Psychology, University of Konstanz, Universit¨atsstrasse 10, 78464 Konstanz, Germany

Jay P. Singh Global Institute of Forensic Research, 11700 Plaza America Drive, Suite 810, Reston, VA 20190 Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania Faculty of Health Sciences, Molde University College, P.O. Box 2110, 6402 Molde, Norway

Jerome Endrass∗ Department of Mental Health Services, Office of Corrections, Canton of Zurich, Feldstrasse 42, P.O. Box 8090, Zurich, Switzerland Tel: +41 43 259 8168, Fax: +41 43 259 8451, E-mail: jerome .endrass@ ji .zh .ch Department of Psychology, University of Konstanz, Universit¨atsstrasse 10, 78464 Konstanz, Germany

The aim of the present study was to investigate the performance of the Dynamic Risk Analysis System (DyRiAS) in assessing the risk of lethal and potentially lethal intimate partner violence (IPV) in the Canton of Zurich, Switzerland. Police records were used to retrospectively administer the DyRiAS for 171 IPV offenders processed by the municipal police of Zurich in 2008. The sample was then followed for a period of three months to five years. The ability of the six DyRiAS risk categories to discriminate between recidivists and non-recidivists was investigated using correlational and receiver operating characteristic curve analyses. DyRiAS assessments were not found to be significantly associated with lethal or potentially lethal IPV. Furthermore, the finding that no non- recidivists were assigned to the instrument’s lowest risk categories and that no recidivists were assigned to the instrument’s highest risk category could not be attributed to intense police interventions. On the basis of the current study, the DyRiAS does not appear to be able to predict the likelihood of lethal or potentially lethal IPV. Further research is necessary to replicate these findings in larger samples using prospective study designs. Keywords: Intimate partner violence, Risk assessment, Homicide, Crime, Police

1 gerth, rossegger, singh, & endrass

Assessing the Risk of Severe Intimate Partner Violence: Validating the DyRiAS in Switzerland

Approximately 4 out of 10 female homicide victims are murdered by their current or former intimate partner (World Health Organization, 2013). Systematic review evidence suggests that this statistic is particularly representative of high-income countries (St¨ockl et al., 2013). In the United States, 39% of female homicide victims are killed by their partners, whereas this applies to only 3% of all American male homicides (Catalano, 2007). Similarly, the rate of intimate partner homicide (IPH) in Canada has remained four times higher for women than for men over the past thirty years (Northcott, 2012). In England and Wales, 65% of all IPH victims are female (Dixon & Graham-Kevan, 2011). This trend also extends to Central Europe, where 41% of all police-registered female homicides in Germany (German Ministry of the Interior, 2014) and 58% in Switzerland are attributed to partners (Zoder, 2008). Given that more than half of women killed in Switzerland are murdered by their partner, policymakers have considerable interest in the reliable and valid assessment of IPH risk. Of particular interest is the assessment of individuals reported to the police for general intimate partner violence (IPV) incidents, as cases of IPH are frequently preceded by less severe forms of violence (65%–85%, Bailey et al., 1997; Browne, Williams, & Dutton, 1998; Dixon & Graham-Kevan, 2011; Moracco, Runyan, & Butts, 1998; Roehl, O’Sullivan, Webster, & Campbell, 2005a).

Intimate Partner Homicide Risk Assessment

On the one hand — as incidents of IPV often precede IPH — IPH is often discussed as a continuum (i.e., an aggravation of IPV). On the other hand – as IPH is not determined by IPV and discernable risk markers could have been identified – these two forms of violence are also considered distinct from each other (Addington & Perumean-Chaney, 2014; Dobash, Dobash, Cavanagh, & Medina-Ariza, 2007; McCloskey, 2007). For example, Dobash et al. (2007) found in their comparative study that non-lethal IPV perpetrators are less possessive, less jealous, but have a longer criminal history than lethal IPV perpetrators (Dobash et al., 2007). Also, the former are more often intoxicated at the time of their index offense (Addington & Perumean-Chaney, 2014; Dobash et al., 2007). In addition, the rate of domestic homicides is decidedly lower than that of domestic violence in general: In 2013, 28.5% (n = 13, 003) of all violent and sexual offenses registered by the police in Switzerland were domestic. Of those, only 0.8% (n = 63) were lethal (Swiss Federal Statistical Office, 2014). Hence, assessing the likelihood of IPH in the same manner as IPV will result in an overestimation of risk, necessitating approaches with greater specificity (Dixon & Graham-Kevan, 2011). Despite a considerable number of instruments having been developed for the purpose of assessing intimate partner violence (IPV) risk and despite the publication of several articles on the use of those instruments for the prediction of femicide (e.g., the Ontario Domestic Assault Risk Assessment in Eke, Hilton, Harris, Rice, & Houghton, 2011), we identified only one validated risk assessment instrument that has been developed especially focusing on IPH: the Danger Assessment (DA; Campbell et al., 2003). Revised in 2003, the current version of the DA is a mechanical instrument

2 risk assessment of severe intimate partner violence comprised of 20 risk factors across four domains: threatening and sexual violence, characteristics of the relationship between the victim and offender, socio-demographic characteristics of the offender, and family status of the victim. Total scores are used to place offenders into one of four risk categories, with higher scores denoting higher recidivism risk. The two independent cross-validation studies published on the current version of the DA have found mixed evidence of the predictive validity of the instrument. One study found moderate levels of predictive validity for any and severe IPV (AUC = 0.67 and 0.69, respectively; Roehl, O’Sullivan, Webster, & Campbell, 2005b). However, the other showed that the DA strongly overestimated recidivism risk, as 55% of the sample were assigned to the highest risk category, but eventually none were actually charged or convicted of any kind of (attempted) homicidal offense (Storey & Hart, 2014). Hence, additional IPH risk assessment instruments may be necessary.

The Dynamic Risk Analysis System An alternative to the DA has recently been proposed in the German-speaking region of Europe to provide a mechanical method of assessing the likelihood of lethal and potentially lethal incidents of IPV, namely The Dynamic Risk Analysis System (DyRiAS; Hoffmann & Glaz-Ocik, 2012). The DyRiAS was developed to assess the immediate (days to months) likelihood of lethal and potentially lethal IPV. The instrument contains static as well as dynamic items, which allows raters to monitor escalations in relational conflicts. The items were selected on the basis of an extensive literature review as well as case study analyses and present risk and protective factors that capture situational, behavioral, and cognitive-emotional domains. The items’ content covers factors related to the current state of the relationship between the victim and the offender as well as critical life events. Further, characteristics of the offender are included, such as attitudes, controlling behavior, suicidal behavior, substance abuse, criminal history, access to weapons etc. The instrument is a web-based application, and items are weighted relative to their perceived strength of association with IPH. Interdependencies between risk and protective factors for IPH are modeled using a proprietary hierarchical decision making process. The DyRiAS classifies individuals into one of six risk categories, with higher categories denoting a higher risk of committing a serious violent offense against an intimate partner. Further, an assessment of four different subscales can be conducted on the basis of the DyRiAS item scoring: 1) ‘Situation’, which assesses potentially negative external factors affecting the person; 2) ‘Mind-Set’, which assesses the offender’s perception about the current situation; 3) ‘Verhalten’ [behavior], which assesses preparatory acts towards a future offense and 4) ‘Mittelschwere Gewalt’ [moderately severe violence], which assesses the risk of future violent assaults that are less severe than lethal or near-lethal violence. The DyRiAS is currently used in women’s support organizations and police departments in Germany, Austria, and Switzerland (Hoffmann, personal communication, July 25, 2013). A systematic literature search conducted in February 2015 using PubMed, PsycINFO, MEDLINE and Google Scholar identified only one previous publication examining the validity of the DyRiAS (Hoffmann & Glaz-Ocik, 2012). This retrospective validation study evaluated the convergent validity of the DyRiAS using 61 cases of attempted IPH in Germany. The authors found that 82% (n = 50) of the offenders were assigned to the two highest risk categories on the

3 gerth, rossegger, singh, & endrass instrument. These findings may have been confounded, however, by the fact that assessors were not blind to outcomes at the time of administering the DyRiAS. Hence, additional research is needed to establish the validity of the instrument, especially in applied settings.

The Present Study

The objective of the present study was to investigate the discriminative validity of the DyRiAS in the prediction of both short-term as well as long-term IPV recidivism risk in the Canton of Zurich, Switzerland. Specifically, we sought to conduct a validation study in a police setting using a total cohort of male to female domestic violence cases in which an intimate partner was the perpetrator. Following a discussion of the effect of interventions on the risk of recidivism, we ensured to collect data on police interventions that were mandated subsequent to the index assault. On the one hand recent studies suggest that police interventions (e.g., protection orders) are correlated with reduced recidivism (Belfrage & Strand, 2012; Belfrage et al., 2012; Logan & Walker, 2009; Storey, Kropp, Hart, Belfrage, & Strand, 2014). On the other hand, the mediating role of risk management strategies such as police interventions in the performance of risk assessment tools is currently unclear (Belfrage et al., 2012; Storey et al., 2014). Due to a lack of validation studies of the DyRiAS, we aimed to not only assess its discrimination validity but also its concurrent validity as compared to a commonly used and well-validated risk assessment tool, the Ontario Domestic Assault Risk Assessment (ODARA; Hilton et al., 2004). According to a recent meta-analysis by Messing and Thaller (2013) on general IPV risk assessment, the ODARA showed the highest mean effect size (AUC = .67) compared to other IPV risk assessment tools across five prospective validation studies. Furthermore, findings of Hilton and colleagues (2004) indicate a positive correlation between the ODARA’s risk categories and the severity of the re-offense. In addition, Eke and colleagues (2011) showed in a retrospectively designed pilot study (sample size of n = 13) that 92% of IPH recidivism cases were assigned to the highest risk category of the ODARA. Thus, our analyses aimed to investigate the following three research questions:

1. Is the DyRiAS able to discriminate between potentially lethal or lethal IPV recidivists and non-recidivists? 2. Does police intervention impact the discrimination performance of the DyRiAS? 3. Is the DyRiAS concurrently valid with the commonly used ODARA?

Method

Participants

As part of a broader evaluation of IPV offenders in the Canton of Zurich, all cases of domestic violence processed by the municipal police of Zurich between January 1, 2008 and December 31, 2008 were eligible for inclusion in the present study (N = 342). Only adult men who either physically assaulted their female partner or issued them

4 risk assessment of severe intimate partner violence death threats with a weapon were included (N = 216). An additional 39 offenders were excluded due to a lack of adequate information to score the DyRiAS, three were excluded due to no time at risk as they had never been released from prison or a forensic psychiatry facility within a follow up of five years, and another three were excluded who received a court-mandated therapy (N = 171). As the DyRiAS was designed to monitor risk over time, its discrimination performance was measured over four lengths of time at risk: three months, six months, one year and five years. Offenders who died, were deported to their home country, or who were incarcerated for a certain period of time and thus did not reach the necessary time at risk were also excluded. Hence, for the discrimination analyses depending on each of the four time periods the sample size differed as followed: 3-months (n = 168), 6-months (n = 167), 1-year (n = 166) and 5-years (n = 146).

Procedure and Measures DyRiAS. The DyRiAS is a Web-based application composed of 39 dichotomous items measuring both risk and protective static and dynamic factors. To score the DyRiAS, information must be available for at least 55% of the instrument’s items. Missing items are not prorated, but rather scored as “0”. Based on the instrument’s proprietary weighting system which is based on each item’s perceived importance to the prediction of IPH and interdependencies with other items as found in the literature, offenders are classified into one of six risk categories (Category 0 = No Risk; Category 5 = High Risk). ODARA. The Ontario Domestic Assault Risk Assessment (ODARA) is an actuarial instrument comprised of 13 risk factors designed to assess the risk of IPV recidivism (Hilton et al., 2004). The instrument operationalizes recidivism as any new physical assaults or death threats made while holding a weapon. Up to five missing items (which is about 38% of the instruments items) are allowed, and a prorating table is available. Total scores on the ODARA (range: 0 – 13) are used to classify offenders into one of seven risk categories, with higher categories denoting higher recidivism risk. A recent review of the ODARA literature identified ten cross-validation studies, which reported a moderate to good ability to discriminate between recidivists and non-recidivists (Gerth, Rossegger, Urbaniok, & Endrass, 2014). Recidivism. Consistent with the DyRiAS manual, recidivism was defined as an incident of lethal or potentially lethal IPV, which was operationalized as bodily harm, endangering the life of another, or (attempted) homicide. To increase sensitivity, a second outcome of interest was also used – adding rape, sexual coercion, coercion, minor assaults, and threats to the criteria. Only incidents registered by the municipal and cantonal police of Zurich where the victim was a current or former female partner and which occurred after the index assault of IPV were considered acts of recidivism. Police intervention. Interventions carried out subsequently to the index incident by the police include arrests, protective orders (e.g., no-contact orders), eviction and rayon orders (i.e., the designation of off-limit areas) as well as custody into which offenders were taken to be transferred to the federal prosecutor. Intensity of police interventions was defined as the number of interventions mandated. Data collection. DyRiAS and ODARA assessments were conducted retrospectively by four psychologists trained in the use of the instruments. Data collection started after the psychologists had successfully assessed five training cases. Each of the four

5 gerth, rossegger, singh, & endrass psychologists, blind to offender outcomes, rated non-overlapping subsamples. Scoring was based on police files, which included self-reports of the offender, the victim, and collaterals on the index incident as well as both medical evidence related to the index incident and information on the offender’s previous contacts with the police. Data was collected at the station of the municipal police of Zurich, where police staff instructed the raters in how to read their files. When uncertainty arose about item ratings, clarification was sought from the authors of the instruments (DyRiAS: Dr. Jens Hoffmann; ODARA: Dr. Zoe Hilton). Ethics approval. In accordance with Swiss law, the present study did not need approval by the cantonal ethics committee, but rather by the municipal data protection authority. Such approval was provided in March 2013.

Statistical Analysis The ability of DyRiAS to discriminate between recidivists and non-recidivists was investigated using correlational and receiver operating characteristic (ROC) curve analyses at three months, six months, one year, and five years at risk in the community. Regression analyses were also conducted to examine the relationship between the intensity of police intervention and DyRiAS assessments. Finally, the concurrent validity of the DyRiAS and ODARA risk categories was examined using Spearman’s ρ. All analyses were two-tailed and used a standard significance threshold of α = .05 in STATA/IC 13.0 for Windows (StataCorp, 2013).

Results Characteristics of the Total Sample At the time of the index incident, IPV offenders were an average age of 37.5 years old (SD = 11.2). Regarding nationality, 35.1% (n = 60) of the offenders were Swiss, 20.5% (n = 35) held nationalities of the European Union, 17.5% (n = 30) were from South East Europe and 12.3% (n = 21) from Africa. Approximately half of them were employed (n = 94, 55.0%). In more than three-quarters of the cases, the index incident occurred in the victim’s home (n = 134, 78.4%) and resulted in physical injury to the victim (n = 118, 69%). According to police files, the index incident involved threats in 89 cases (52.1%), minor assaults in 98 cases (57.3%), bodily harm in 80 cases (46.8%), and coercion in 54 cases (31.6%). At the time of the index incident, half of the offenders were in a partnership with their victim (n = 86, 50.3%). Almost three-quarters (n = 121, 70.6%,) of the offenders had previously been registered in police files with an IPV assault, and in four out of five cases the victim was repeatedly abused by the offender (n = 140, 81.9%).

DyRiAS Risk Category Distribution None of the case files contained sufficient information to score all 39 DyRiAS items. An average of 36.4% (SD = 5.8%) of items remained unscored. None of the offenders were assigned to risk categories 0 or 1. The mode and mean were both Category 3 (n = 78, 45.6% of the offenders scored in this category). About one fourth were classified into Category 2 (n = 45, 26.3%,) and 4 (n = 41, 24.0%,) and only 4.1% (n = 7) of the sample was assigned to Category 5.

6 risk assessment of severe intimate partner violence

Rates of Recidivism No offender recidivated with a lethal offense. The recidivism rate of potentially lethal IPV was 0.6% (n = 1) within three months, 2.4% (n = 4) within six months, 3.0% (n = 5) within one year, and 8.9% (n = 13) within five years of time at risk. Recidivism rates for each of the five DyRiAS risk categories and within each of the four periods of time at risk are displayed in Table 1. Notably, no offender in Category 5 recidivated.

Discrimination Performance Potentially lethal IPV correlated positively but non-significantly with the DyRiAS risk categories across all subsamples (three months: r = .09, p = .24; six months: r = .09, p = .26; one year: r = .08, p = .30; five years: r = .13, p = .11). Similarly, ROC curve analyses revealed non-significant discrimination of recidivists and non-recidivists by the DyRiAS across all of the four different periods of time at risk (Table 2). Analyses using the broader operationalization of IPV recidivism resulted in similar findings (Table 2). Post hoc power analyses for discrimination analyses revealed that only the five year subsample was sufficiently large to detect effects when “true” effects existed with a probability of 80% at a significance level of α = .05.

Levels of Intervention and Specificity Analyses in the 3-Month Subsample For almost all offenders in the 3-months subsample (n = 166, 98.8%), protective orders such as no-contact (n = 162, 97.0%), eviction (n = 109, 65.3%), and rayon orders (n = 117, 70.1%) were issued. Most offenders were arrested (77.4%; n = 130 and then held in custody (69.0%; n = 116). Although a correlation was found between DyRiAS risk categories and the intensity of police interventions (rpb = .26, p< 0.001), the latter did not mediate the former’s ability to accurately predict IPV recidivism risk (β = 0.01, SE = 0.19, p = 0.58). Specificity analyses were conducted to descriptively examine different levels of intervention intensity for offenders in DyRiAS categories 4 and 5 (Table 3). All (n = 6; 4% of the 3-months subsample) of the offenders assigned to the highest risk category were arrested, held in custody, and transferred to the federal prosecutor subsequent to the index assault. None of them recidivated. However, of those offenders in the second highest DyRiAS category who did not receive such intensive interventions (n = 5; 4% of the 3-months subsample), there were also no recidivists.

Concurrent Validity IPV offenders were assigned to ODARA categories 1 to 7 (M = 5.6, SD = 1.4). Information was missing for 53.8% of the offenders, with an average of 1.3 items missing (SD = 0.06). The DyRiAS categories and ODARA categories were not significantly correlated (ρ = .07, p = .34). Because the ODARA was not developed to test for IPH risk, we additionally explored the concurrent validity of the categories of the DyRiAS subscale “moderately severe IPV”, but again found no significant correlation (ρ = .09; p = .24). However, in contrast to the DyRiAS, the ODARA categories’ ability to discriminate between general IPV recidivists and non-recidivists at a follow-up of three months was found to be significant (AUC = 0.73, 95% CI = 0.57 − 0.90, p = 0.02).

7 gerth, rossegger, singh, & endrass = 0) = 7) = 4) = 2) n n n n Total Sample ), 6 Months = 3) ( = 0) ( = 1) ( = 1) ( n n n n Time at Risk = 168 n = 3) ( = 0) ( = 0) ( = 1) ( n n n n Time at Risk ers at 3 Months ( ) Time at Risk = 1) ( = 0) ( = 0) ( = 0) ( 0.6% 2.4% 3.0% 8.9% n n n n Time at Risk = 146 n = 4) ( = 0) = 0) = 35) ( = 72) ( = 35) ( n n n n n n Time at Risk Table 1. ), and 5 Years ( = 5) ( = 0) ( = 0) ( = 40) ( = 77) ( = 44) ( n n n n n n Time at Risk = 166 n = 6) ( = 0) ( = 0) ( = 40) ( = 77) ( = 44) ( n n n n n n Time at Risk ), 1 Year ( = 6) ( = 0) ( = 0) ( = 167 = 40) ( = 77) ( = 45) ( n n n n n n n ( Time at Risk Percentage of Sample in Each Risk Category Recidivism Rate = 7) ( = 0) ( = 0) ( = 41) ( = 78) ( = 45) ( 0% 0% 0% 0% 0% N.A. N.A. N.A. N.A. 0% 0% 0% 0% 0% N.A. N.A. N.A. N.A. = 171) 3 Months 6 Months 1 Year 5 Years 3 Months 6 Months 1 Year 5 Years 4.1% 3.6% 3.6% 3.0% 2.7% 0% 0% 0% 0% n n n 24.0% 23.8% 24.0% 24.1% 24.0% 2.5% 7.5% 7.5% 20.0% 45.6% 45.8% 46.1% 46.4% 49.3% 0% 0% 1.3% 5.6% 26.3% 26.8% 26.3% 26.5% 24.0% 0% 2.3% 2.3% 5.7% n n n ( ( ( ( ( ( N Total Sample ( = sample size, N.A. = Not applicable. n DyRiAS Risk Category Distribution and Recidivism Rates for IPV Offend 5 4 Risk Category 0 Overall Recidivism Base Rate 3 1 2 Note:

8 risk assessment of severe intimate partner violence p o Outcomes al IPV Recidivism = ion none of the offenders sm = rape, sexual coercion, Base Rate AUC (95% CI) p Table 2. Outcome Criterion Potentially Lethal IPV Recidivism General IPV Recidivism Comparing the Discrimination Performance of the DyRiAS across Tw AUC = area under the curve, CI = confidence interval, Potentially Leth Follow-up Base Rate3 Months6 Months AUC1 (95% Year 0.6% CI) 5 Years 2.4% 0.85 3.0% (0.00 - 1.00) 8.9% 0.67 (0.32 - 1.00) 0.25 0.64 (0.36 - 0.26 0.64 0.92) (0.48 - 0.80) 4.2% 11.3% 0.30 0.11 0.57 (0.36 - 0.54 16.8% 0.78) (0.39 - 0.69) 34.8% 0.42 0.51 0.55 (0.43 - 0.55 0.66) (0.46 - 0.64) 0.41 0.33 bodily harm or endangeringrecidivated the life with of an anothercoercion, (though [attempted] in minor homicide), this assaults, investigat and General threats. IPV Recidivi Note:

9 gerth, rossegger, singh, & endrass = 0) = 0) = 0) = 0) = 6) n n n n n ts Recidivists = 6) 0.0% ( = 0) 0.0% ( = 0) 0.0% ( = 0) 0.0% ( n n n n Months Subsample Assigned to the = 1) 100% ( = 0) 0.0% ( = 0) 0.0% ( = 0) 0.0% ( = 40) DyRiAS Category 5 ( cutor. n n n n n = 1) 0.0% ( = 1) 0.0% ( = 5) 0.0% ( = 32) 3.0% ( n n n Table 3. n 100% ( DyRiAS Category 4 ( 97.0% ( a High-Risk DyRiAS Categories a = sample size. n Custody while one’s case is being transferred to the federal prose Arrest + Custody + Protective Order Arrest + Custody Arrest + Protective Order 100% ( Level of InterventionProtective Order Non-recidivists Recidivists Non-recidivis 100% ( Level of Police Ordered Post-Offense Intervention for Offenders in the 3- a Note:

10 risk assessment of severe intimate partner violence

Discussion

The present study investigated the ability of a recently developed risk assessment instrument – the DyRiAS – to discriminate between IPV offenders who went on to commit acts of potentially lethal or lethal IPV and those who did not. A total cohort of 177 IPV offenders processed by the municipal police of Zurich was followed for between three months to five years. Randomly selected recidivists were found to be classified into higher DyRiAS risk categories more often than not, but no significant evidence of discrimination was found. Although these discrimination findings have important implications for practitioners who use the DyRiAS, discrimination represents only one component of criterion validity. The second component is calibration (Rossegger, Endrass, Gerth, & Singh, 2014; Singh, 2013), which measures the fit between the expected and observed rates of recidivism (Schmid & Griffith, 2005). As normative recidivism rates for the six DyRiAS risk categories have not been published, the evaluation of the instrument’s calibration is currently limited to a descriptive analysis of how recidivists are distributed. As no recidivists were in the highest risk category and no non-recidivists were in the lowest risk category, the calibration of the DyRiAS in the Swiss intimate partner offender population appears to be wanting. In accordance with Risk-Needs-Responsivity Principles (Bonta & Andrews, 2007), IPV offenders judged to be at higher risk of recidivism should receive more intensive police interventions. Indeed, we detected a significant relationship between recidivism risk as assessed by the DyRiAS and the intensity of police intervention. However, the intensity of police intervention did not appear to play a role in the accuracy of DyRiAS assessments of potentially lethal or lethal recidivism risk. Finally, the poor discrimination validity of the DyRiAS did not appear to be the result of a sample bias, as a second, more widely-used IPV recidivism risk assessment tool – the ODARA – was not correlated with the DyRiAS, and in contrast to the “moderately severe violence” scale of the DyRiAS, the ODARA succeeded in significantly discriminating between IPV recidivists and non-recidivists. Thus, we assume that the non-significant discrimination findings are not an artifact of the sample, but rather of the instrument.

Limitations

There were several limitations of the present study beyond a rather small sample size given the low base rate of IPH. First, the current study was a retrospective, file-based study, which has implications for the quality and completeness of the data collected. In 18% of the cases that were processed by the police in 2008, the DyRiAS could not be administered due to missing information. For the remainder, our descriptive analyses suggested that 37% of the necessary information was missing for the average offender. Thus, it seems that more than a third of the information required to administer the DyRiAS is not routinely collected by the police in the Canton. Hence, there is a discrepancy at this time between the information collected by the police and the information that is necessary to administer the DyRiAS. This discrepancy may be important to consider, as previous research on mechanical instruments has found that discrimination increases with fewer missing items (e.g., Harris & Rice, 2003). It should

11 gerth, rossegger, singh, & endrass be the aim of a prospective study to test whether missing information can be made accessible through elaborated witness interviews or whether the findings of the present study reflect a limitation of the practical usefulness of the DyRiAS. Second, although the raters were trained in the use of the instruments, no inter- rater reliability analysis was conducted. Given the non-significant findings of the current study, it is necessary to examine the DyRiAS’ inter-rater reliability in future research. Third, the practical utility of the DyRiAS was not assessed. Thus, it may be beneficial to conduct a qualitative investigation such as a survey or set of interviews with police to examine the perceived usefulness of the DyRiAS, not only in risk assessment but also in the development and monitoring of risk management plans. A recent international survey of psychiatrists, psychologists, and nurses suggests that practitioners rate violence risk assessment instruments differently in their utility in such tasks (Singh et al., 2014). Fourth, no cases of IPH recidivism and only a few cases of potentially lethal IPV were detected. It has been argued that the prediction of low base rate events such as IPH results in false positive rates high enough to make such risk assessments lack practical utility (Szmukler, 2003). The same argument has been made in the evaluation of other low base rate events such as suicide in psychiatric patients (Large, 2010), where perceived community responsibility is also high, despite a general lack of research evidence on predictability. Statistically, the low base rate of IPH resulted in low power, which would have increased the likelihood of false negative findings in our null hypothesis significance testing for the correlational and ROC curve analyses. As the present study represents a total population cohort, rates of IPH are not likely to be underestimated compared to less severe IPV, and a broader operational definition of IPV did not result in stronger predictive validity findings. Thus, it may be that IPH is not a predictable outcome using the DyRiAS.

Conclusion

The current study is the first validation study assessing the discrimination performance of the DyRiAS, a mechanical instrument developed for the assessment of potentially lethal and lethal intimate partner violence risk. The DyRiAS was not found to discriminate between recidivists and non-recidivists of potentially lethal IPV within three months to five years. On the basis of the current study, the DyRiAS’ ability to assess the risk of potentially lethal IPV and lethal IPV could not be demonstrated. Though further research is necessary to establish the reliability of the instrument and to replicate these findings in larger samples and using prospective study designs, caution is warranted when using paper- or Web-based risk assessment tools for which no validation or normative data are available.

References

Addington, L. A., & Perumean-Chaney, S. E. (2014). Fatal and non-fatal intimate partner violence: What separates the men from the women for victimizations reported to police? Homicide Studies, 18 (2), 196–220. doi: 10.1177/1088767912471341

12 risk assessment of severe intimate partner violence

Bailey, J. E., , Kellermann, A. L., Somes, G. W., Banton, J. G., Rivara, F. P., & Rushford, N. P. (1997). Risk factors for violent death of women in the home. Archives of Internal Medicine, 157 (7), 777–782. doi: 10.1001/archinte.1997.00440280101009 Belfrage, H., & Strand, S. (2012). Measuring the outcome of structured spousal violence risk assessments using the B-SAFER: Risk in relation to recidivism and intervention. Behavioral Sciences & the Law, 30 (4), 420–430. doi: 10.1002/ bsl.2019 Belfrage, H., Strand, S., Storey, J. E., Gibas, A. L., Kropp, P. R., & Hart, S. D. (2012). Assessment and management of risk for intimate partner violence by police officers using the spousal assault risk assessment guide. Law and Human Behavior, 36 (1), 60–67. doi: 10.1037/h0093948 Bonta, J., & Andrews, D. A. (2007). Risk-need-responsivity model for offender assessment and rehabilitation (Tech. Rep.). Ottawa: Department of Public Safety and Emergency Preparedness Canada. Retrieved from http://www.publicsafety.gc .ca/cnt/rsrcs/pblctns/rsk-nd-rspnsvty/rsk-nd-rspnsvty-eng.pdf Browne, A., Williams, K. R., & Dutton, D. C. (1998). Homicide between intimate partners. In M. D. Smith & M. A. Zahn (Eds.), Homicide: A sourcebook of social research (pp. 149–164). SAGE Publications. Campbell, J. C., Webster, D., Koziol-McLain, J., Block, C., Campbell, D., Curry, M. A., . . . Laughon, K. (2003). Risk factors for femicide in abusive relationships: Results from a multisite case control study. American Journal of Public Health, 93 (7), 1089–1097. doi: 10.2105/ajph.93.7.1089 Catalano, S. (2007). Intimate partner violence in the united states (Tech. Rep.). Washington, DC: U.S. Department of Justice, Office of Justice Programs, Bureau of Justice Statistics. Retrieved from http://www.bjs.gov/content/pub/pdf/ ipvus.pdf Dixon, L., & Graham-Kevan, N. (2011). Until death do they part: preventing intimate partner homicide. The Psychologist, 24 , 820–823. Dobash, R. E., Dobash, R. P., Cavanagh, K., & Medina-Ariza, J. (2007). Lethal and nonlethal violence against an intimate female partner: Comparing male murderers to nonlethal abusers. Violence Against Women, 13 (4), 329353. doi: 10.1177/ 1077801207299204 Eke, A. W., Hilton, N. Z., Harris, G. T., Rice, M. E., & Houghton, R. E. (2011). Intimate partner homicide: Risk assessment and prospects for prediction. Journal of Family Violence, 26 (3), 211–216. doi: 10.1007/s10896-010-9356-y German Ministry of the Interior. (2014). Polizeiliche Kriminalstatistik 2013, Bundesrepublik Deutschland [Police Crime Statistics 2013, Germany] (Tech. Rep.). Wiesbaden: Author. Gerth, J., Rossegger, A., Urbaniok, F., & Endrass, J. (2014). Das Ontario Domestic Assault Risk Assessment (ODARA) – Validit¨atund autorisierte deutsche Ubersetzung¨ eines Screening-Instruments f¨ur Risikobeurteilungen bei Intimpartnergewalt [The Ontario Domestic Assault Risk Assessment (ODARA) – Validity and authorised German translation of an intimate partner violence screening tool]. Fortschritte der Neurologie Psychiatrie, 82 (11), 616–626. doi: 10.1055/s-0034-1384915 Harris, G. T., & Rice, M. E. (2003). Actuarial assessment of risk among sex offenders. Annals of the New York Academy of Sciences, 989 , 198–210.

13 gerth, rossegger, singh, & endrass

Hilton, N. Z., Harris, G. T., Rice, M. E., Lang, C., Cormier, C. A., & Lines, K. J. (2004). A brief actuarial assessment for the prediction of wife assault recidivism: The ontario domestic assault risk assessment. Psychological Assessment, 16 (3), 267–275. doi: 10.1037/1040-3590.16.3.267 Hoffmann, J. (personal communication, July 25, 2013). Hoffmann, J., & Glaz-Ocik, J. (2012). DyRiAS-Intimpartner: Konstruktion eines online gest¨utzten Analyse-Instrumentes zur Risikoeinsch¨atzung von t¨odlicher Gewalt gegen aktuelle oder fr¨uhere Intimpartnerinnen [DyRiAS intimate partner: Constructing of a Web-based instrument for assessing the risk of lethal violence against a current or former female intimate partner]. Polizei & Wissenschaft, 2 , 45–57. Large, M. M. (2010). No evidence for improvement in the accuracy of suicide risk assessment. Journal of Nervous and Mental Disease, 198 (8), 604. doi: 10.1097/ nmd.0b013e3181e9db3e Logan, T. K., & Walker, R. T. (2009). Civil protective order effectiveness: Justice or just a piece of paper? Violence and Victims, 25 (3), 332–348. doi: 10.1891/ 0886-6708.25.3.332 McCloskey, K. (2007). Intimate partner violence. In Encyclopedia of psychology and law (pp. 383–387). Thousand Oaks, CA: SAGE. Messing, J. T., & Thaller, J. (2013). The average predictive validity of intimate partner violence risk assessment instruments. Journal of Interpersonal Violence, 28 (7), 1537–1558. doi: 10.1177/0886260512468250 Moracco, K. E., Runyan, C. W., & Butts, J. D. (1998). Femicide in north carolina, 1991–1993: A statewide study of patterns and precursors. Homicide Studies, 2 (4), 422–446. doi: 10.1177/1088767998002004005 Northcott, M. (2012). Intimate partner violence risk assessment tools: A review (Tech. Rep.). Ottawa: Canada: Department of Justice, Research and Statistics Division. Roehl, J., O’Sullivan, C., Webster, D., & Campbell, D. (2005b). Intimate partner violence risk assessment validation study: The RAVE study practitioner summary and recommendations: Validation of tools for assessing risk from violent intimate partners (Report No. 209732) (Tech. Rep.). Washington, DC: U.S. Department of Justice. Retrieved from https://www.ncjrs.gov/pdffiles1/nij/grants/ 209732.pdf Roehl, J., O’Sullivan, C., Webster, D., & Campbell, J. (2005a). Intimate partner violence risk assessment validation study, final report (Report No. 209731) (Tech. Rep.). Washington, DC: U.S. Department of Justice. Retrieved from https://www.ncjrs .gov/pdffiles1/nij/grants/209731.pdf Rossegger, A., Endrass, J., Gerth, J., & Singh, J. P. (2014, Mar). Replicating the violence risk appraisal guide: A total forensic cohort study. PLoS ONE, 9 (3), e91845. doi: 10.1371/journal.pone.0091845 Schmid, C. H., & Griffith, J. L. (2005). Encyclopedia of biostatistics. In P. Armitage & T. Colton (Eds.), (2nd ed., Vol. 5, pp. 1–6). Chichester: John Wiley & Sons. Singh, J. P. (2013). Predictive validity performance indicators in violence risk assessment: A methodological primer. Behavioral Sciences & the Law, 31 (1), 8–22. doi: 10 .1002/bsl.2052 Singh, J. P., Desmaraisd, S. L., Hurducase, C., Arbach-Lucionifg, K., Condemarinh, C., Deanij, K., . . . Otto, R. K. (2014). International perspectives on the

14 risk assessment of severe intimate partner violence

practical application of violence risk assessment: A global survey of 44 Countries. International Journal of Forensic Mental Health, 13 (3), 193–206. doi: 10.1080/ 14999013.2014.922141 StataCorp. (2013). Stata statistical software: Release 13 (Tech. Rep.). College Station, TX: StataCorp LP. St¨ockl, H., Devries, K., Rotstein, A., Abrahams, N., Campbell, J., Watts, C., & Moreno, C. G. (2013). The global prevalence of intimate partner homicide: a systematic review. The Lancet, 382 (9895), 859–865. doi: 10.1016/s0140-6736(13)61030-2 Storey, J. E., & Hart, S. D. (2014). An examination of the danger assessment as a victim- based risk assessment instrument for lethal intimate partner violence. Journal of Threat Assessment and Management, 1 (1), 56–66. doi: 10.1037/tam0000002 Storey, J. E., Kropp, P. R., Hart, S. D., Belfrage, H., & Strand, S. (2014). Assessment and management of risk for intimate partner violence by police officers using the brief spousal assault form for the evaluation of risk. Criminal Justice and Behavior, 41 (2), 256–271. doi: 10.1177/0093854813503960 Swiss Federal Statistical Office. (2014). Polizeiliche Kriminalstatistik (PKS) - Jahresbericht 2013 [Police Crime Statistics - Annual report 2013] (Tech. Rep.). Neuchˆatel: Author. Szmukler, G. (2003). Risk assessment: ‘numbers’ and ‘values’. Psychiatric Bulletin, 27 (6), 205–207. doi: 10.1192/pb.27.6.205 World Health Organization. (2013). Global and regional estimates of violence against women: Prevalence and health effects of intimate partner violence and non-partner sexual violence (Tech. Rep.). Geneva: Department of Reproductive Health and Research. Zoder, I. (2008). T¨otungsdelikte in der partnerschaft. polizeilich registrierte f¨alle 2000- 2004 [Homicides within intimate partnerships. Cases known to the Police 2000– 2004] (Tech. Rep.). Neuchˆatel: OFS.

Received: Oct. 11, 2014 Revision Received: Mar. 16, 2015 Accepted: Mar. 20, 2015

15 Archives of Forensic Psychology c 2015 Global Institute of Forensic Psychology 2015, Vol. 1, No. 2, 16–27 ISSN 2334-2749

The Concealed Information Test in the Laboratory Versus Japanese Field Practice: Bridging the Scientist–Practitioner Gap

Tokihiro Ogawa∗, Izumi Matsuda, Michiko Tsuneoka National Research Institute of Police Science, 6-3-1 Kashiwanoha, Kashiwa, Chiba 277-0882, Japan E-mail: [email protected]

Bruno Verschuere Department of Clinical Psychology, University of Amsterdam, Amsterdam, The Netherlands Department of Experimental Clinical and Health Psychology, Ghent University, Ghent, Belgium Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands

Whereas the Concealed Information Test (CIT) is heavily researched in laboratories, Japan is the only country that applies it on a large scale to real criminal investigations. Here we note that important differences exist in CIT design, data-analysis, and test conclusions between these two settings. These differences can be ascribed to using the CIT in the laboratory to judge the overall presence or absence of crime-related knowledge (examinee-focused), while using it in the field to assess recognition of individual pieces of crime-related knowledge (question-focused). The question-focused approach is one way to increase the usefulness of the CIT and is a key factor that allows Japanese law enforcement to apply the CIT to real criminal investigations. We hope this review can help bridge this apparent scientist–practitioner gap by encouraging critical reflection on the benefits and pitfalls of examinee- vs. question-based approaches, and by encouraging question-focused laboratory-based research that has direct relevance to Japanese field practice. Keywords: polygraph examination, concealed information test, memory detection

Polygraphy is one of the most important and controversial topics in applied psychophysiology. Polygraph testing refers to the recording of physiological signals—typically skin conductance, respiration, and cardiovascular responses—that can provide useful information during the course of criminal investigations. The Comparison Question Test (CQT) is the most commonly applied polygraph method around the world for detecting deception, although it is not the method used by Japanese police. It compares physiological responses to specific, accusatory questions (e.g., Did you steal money from the cashbox last Friday night?) to control questions

16 laboratory cit vs. practical cit that are deliberately formulated to be more vague (e.g., In the first 25 years of your life, have you ever done anything illegal?). Stronger physiological responses to the accusatory questions are interpreted as a sign of deception. Yet, the rationale and validity of the CQT have been heavily challenged for decades (National Research Council, 2003). In particular, there is concern that those telling the truth may also show stronger physiological responses to these questions, resulting in a high rate of false-positive outcomes. Another less contested polygraph method is the Guilty Knowledge Test (GKT; Lykken, 1959), now commonly referred to as the Concealed Information Test (CIT; Verschuere, Ben-Shakhar, & Meijer, 2011). The CIT does not assess deception, but rather assesses the presence of crime-related memories. In this review paper, we focus on the methodological features of the CIT in laboratory studies and field application to help bridge the gap between research and practice.

The Concealed Information Test

The CIT first appeared in the psychophysiology literature in Lykken’s 1959 seminal paper. The basic concept underlying the CIT was phrased as follows: Use of physiological measurements to detect not lying, but the presence of “guilty knowledge” requires only the more reasonable assumption that a guilty person will show some involuntary physiological response (e.g., GSR) to stimuli related to remembered details of his crime. (p. 385) Thus, if an examinee is involved in a crime, he or she should know details of the crime that are unknown by the innocent, and this knowledge will cause different physiological responses to the crime-relevant stimuli than to other stimuli. Based on this rationale, Lykken designed a test consisting of six multiple-choice questions. Participants in a mock-crime experiment were read questions such as: “Where did the thief hide the stolen watch? Was it (a) in the men’s room, (b) on the coat rack, (c) in the office, (d) on the windowsill, (e) in the locker?” The items were chosen so that an examinee with no knowledge of the crime would be unable to discriminate the relevant item (e.g., the locker) from among the irrelevant ones. During the presentation of these questions, electrodermal activity was recorded. A score of 2 was given if the crime-related item in a question elicited the largest physiological response and a score of 1 was given if it elicited the second largest response. Otherwise a score of 0 was given. The scores were summed up across questions. Because there were six questions, the overall score could range from “a perfect innocent” score of 0 to “a perfect guilty” score of 12. A total score of 6 or less was used to classify the examinee as “innocent” and a total score greater than six signified “guilty”. With 100% specificity (no false-positives), and 88% sensitivity (12% false-negatives), the experiment was a great success. Subsequent studies have modified the original CIT in efforts to increase its validity (Ben-Shakhar, 2012; Meijer, Selle, Elber, & Ben-Shakhar, 2014; Verschuere et al., 2011; Verschuere & Meijer, 2014). For example, whereas early studies were solely based on skin- conductance responses, additional measures such as respiratory activity (Timm, 1982), heart rate (Verschuere, Crombez, de Clercq, & Koster, 2004), and finger-pulse volume (Elaad & Ben-Shakhar, 2006) have been demonstrated as effective CIT measures and

17 ogawa, matsuda, tsuneoka, & verschuere are now often included in the test, sometimes with variations. Studies have also shown that participants with concealed information typically respond to the crime-relevant item with larger electrodermal activity, lower respiratory activity, heart rate deceleration, and peripheral vasoconstriction (Elaad & Ben-Shakhar, 2006; Meijer et al., 2014). These physiological changes are typical of orienting responses (Verschuere et al., 2004). The relevant alternatives are significant only for knowledgeable individuals, and significant stimuli elicit enhanced orienting responses (e.g., Gati & Ben-Shakhar, 1990; Lykken, 1974; Sokolov, 1963). CIT accuracy has also been studied, and several reviews have reported that the average correct detection rates (i.e., sensitivity) ranged from 76% to 88%, whereas average correct rejection rate (i.e., specificity) ranged from 83% to 97% (Ben-Shakhar & Furedy, 1990; Lykken, 1998; MacLaren, 2001). Two meta-analyses (Ben-Shakhar & Elaad, 2003; Meijer et al., 2014) have also demonstrated high detection- efficiency estimates for the CIT. Field-use of the CIT has been limited. Despite support from scientific studies, it is often thought to be inapplicable (Podlesny, 1993; Vrij, 2008). For example, because the CIT focuses on knowledge only known to the criminal, leakage of crime-related information by the police, media, or attorneys makes it difficult to formulate the proper questions (Ben-Shakhar & Elaad, 2003; Ben-Shakhar, Gronau, & Elaad, 1999). If innocent people know critical details of the crime, detecting such knowledge makes little contribution to a criminal investigation or leads to a waste of time and resources. In this context, Japan is the exception in its practical use of the CIT. The large-scale use of the CIT in Japan may not be related to Japanese law enforcement being better able to withhold information from the public (see e.g., Furedy, 2009), as much as it is a different way of using the CIT that makes it easier to apply.

Japanese Field Use of the CIT The CIT is used to assess an examinee’s memory about a criminal case. As in laboratory studies, the Japanese field CIT is composed of multiple-choice questions (Matsuda, Nittono, & Allen, 2012; Osugi, 2011). There are two variants of the Japanese field CIT: The known-solution CIT and the searching CIT. The known-solution CIT asks whether an examinee knows a specific detail about the crime that law enforcement have identified. Suppose that a tie has been identified by law enforcement as the weapon used to strangle someone, and that this fact had not been disclosed to the public. A known-solution question about the weapon could be: “What was the murder weapon? Was it (a) a rope, (b) a scarf, (c) a tie, (d) a lamp cord, (d) necklace, (e) a stocking?” This specific CIT assesses whether the examinee knows that the tie was the murder weapon, but does not go beyond this fact to assign overall guilt based on this knowledge. The searching CIT is also used to examine whether an examinee recognizes a specific item as relevant to the crime-related fact in question, but contrary to the known-solution CIT, law enforcement does not know the correct answer. In these cases, the investigators will select plausible options, hoping that the correct item is included among the choices. The investigators will also typically include an open alternative (i.e., “something else”) to avoid having all answers be incorrect. Using the same example as above, a differential response to the tie would suggest that it was the weapon used in the crime. In this way the searching CIT can provide clues about crime details that law enforcement have not

18 laboratory cit vs. practical cit yet discovered. The known-solution CIT and searching CIT are often combined within a single case. In Japan, the CIT typically consists of four to six questions (Kobayashi, Yoshimoto, & Fujihara, 2009). Each question is repeated three to five times, with the choices given in different orders. Physiological measures include skin conductance, respiratory activity, heart rate, and normalized pulse volume (i.e., improved measurement of pulse volume; Hirota et al., 2003; Sawada, Tanaka, & Yamakoshi, 2001). The examiner judges whether a specific item consistently elicits larger physiological responses than the other items (Osugi, 2011). This judgment is based on visual inspection of graphs and descriptive statistics derived from the physiological recordings. If an examinee’s responses are similar to all items, the examiner infers that the examinee does not recognize any item as being related to the crime. However, the examiner would infer that the examinee does recognize a specific item as related to the crime if differential responses to that item are observed. To form the final conclusion, the examiner takes into account additional factors such as the examinee’s physiological reactivity, environmental factors (e.g., noise), artifacts (e.g., physical movement or deep breathing), and reasonable alternative arguments for certain items (e.g., recognition of relevant item for reasons unrelated to the crime).

Differences Between Laboratory Research and Japanese Field Practice

The use of the CIT in laboratory-based studies and Japanese criminal investigations differs notably. First, in Japanese criminal investigations, each question is repeated at least three times. This repetition serves to increase reliability, and allows assessment of whether stronger responses to the crime-relevant item appear systematic across repetitions. In contrast, the laboratory CIT typically uses none or few repetitions of individual questions. For example, about half of the 80 studies included in the meta-analysis by Ben-Shakhar and Elaad (2003) presented questions only once, indicating that question repetition was not considered a requirement in these studies. Instead, laboratory researchers prefer using multiple questions more than repeating individual questions, reasoning that repetition of the same question is more error-prone (Meijer et al., 2014). For instance, if a crime relevant item happens to be more salient, familiar, or arousing to an examinee, it might always evoke a stronger response than the crime irrelevant items, and repeating the question will not remedy the problem. Second, while Japanese polygraphers working on real criminal investigations aggregate repetitions of a single question, but never across different questions (Matsuda et al., 2012), laboratory researchers aggregate different questions when scoring the physiological signals. Indeed, Lykken (1959) scored the CIT by summing up scores across questions, and this remains the standard practice in laboratory research today. Third, the test conclusion in laboratory studies are different from that of the Japanese CIT used in the field. Based on the aggregated responses across different questions, laboratory studies come to a single conclusion of “knowledgeable (guilty)” or “unknowledgeable (innocent)”. Thus, the conclusion in laboratory research refers to whether the examinee has knowledge about the crime. In contrast, Japanese CIT practitioners provide conclusions for each separate question. For example, when a CIT in a Japanese theft case contains three questions that ask when it happened, where it happened, and what was stolen, the conclusion could be that the examinee seems to

19 ogawa, matsuda, tsuneoka, & verschuere know the when and the where but not the what, rather than simply that the examinee knows something. This list is probably not exhaustive, but it covers the key differences that are related to the different conceptualizations of the CIT as outlined in the next section.

Conceptualizing Guilty Knowledge

How can we understand the differences between laboratory-based studies and the Japanese application of the CIT? We think it is helpful to frame these differences within a slightly different conceptualization of the notion of guilty knowledge. Lykken (1959) introduced the notion of guilty knowledge to psychology to describe knowing facts of a crime that are only known to the criminals. In his seminal paper, he concluded that the concealed information test provided a means “to determine guilt” (p. 388). Likewise, the title of the paper—The GSR in the Detection of Guilt—refers to determining whether the examinee is guilty. In such a conceptualization, multiple questions are equivalent in the sense that they all contribute to the overall judgment of guilty knowledge. The test conclusion is based upon averaging across the different questions. That the suspect reacts to some questions but not others is considered noise. Note that aggregating across different questions restricts the conclusions polygraph examiners can make because it can no longer be specified what the examinee knows and does not know, particularly when questions are presented only once. Here, the CIT seems to serve a single overall goal, to determine whether the examinee has knowledge of the crime. Thus, the CIT in laboratory-based research is examinee-focused. The concept of guilty knowledge is conceptualized differently in Japanese field practice, and refers to knowledge of specific detail of the crime (i.e., a specific CIT question). In Japanese field practice, each question has its own significance with the reasoning that different questions represent different aspects of the crime (e.g., time, place, victim, accomplices). Japanese polygraph examiners are required to provide separate conclusions for each question, and do not aggregate questions to derive a single outcome. Thus, the CIT in Japanese field practice is question-focused. The CIT in Japanese field practice can therefore be considered an information- gathering approach. As such, it can even be meaningfully applied when guilt is already known. Suppose a hypothetical case where law enforcement arrested a man for a theft and the suspect confesses that he did it alone. The investigators may have doubt as to whether the man performed the crime alone or if he had an accomplice. In this case, a CIT that asks about the number of people involved in the crime may be informative (e.g., Did you perform the theft (a) with one accomplice, (b) with two accomplices, (c) with three accomplices, (d) with more than three accomplices. The purpose of this type of searching-CIT is not to determine guilt, but rather to gain further information.

What Could Laboratory-Based Research Learn From Japanese Field Practice?

Identifying the different ways in which the CIT is conceptualized might help elucidate the differences between research and practice. For instance, knowing the different ways that guilty knowledge is conceptualized should help Japanese

20 laboratory cit vs. practical cit practitioners understand why laboratory-based studies often aggregate scores across multiple questions – a practice uncommon to Japanese practitioners. Further, researchers may have wondered why Japanese police investigators are reluctant to provide hit rates from their field tests. This may be more understandable when one considers that such statistics involve an overall judgment as to whether guilty knowledge is present (examinee-focused) rather than the question-by-question (question-focused) judgments common in Japanese field tests. For example, Ogawa, Matsuda, and Tsuneoka (2013) planned a mock theft experiment in which 36 Japanese polygraphers served as examiners (see also Matsuda, Ogawa, Tsuneoka, & Verschuere, 2014, for further detail). Results indicated an 86% sensitivity and a 95% specificity after exclusion of inconclusive decisions (cf. Elaad, Ginton, & Jungman, 1992). These figures, however, reflect correct classification rates of a single detail of the crime and could be readily misinterpreted by laboratory researchers that typically report the correct classifications of examinees. The question-focused approach might help encourage CIT implementation outside of Japan. The limited field use of the CIT in many countries has often been related to the difficulty in preparing sufficient questions (Podlesny, 1993, 2003). Lykken (1988) suggested that six or more questions is optimal for a CIT, yet Podlesny (1993, 2003) argued that developing such a CIT is only feasible in a small portion of cases. In contrast, the question-focused CIT has no minimum number of questions. Additionally, using searching CIT questions further broadens the way CIT can be used because developing searching CIT questions is often possible when preparing known-solution CIT questions is difficult. In searching CIT, each question must be treated separately. Notably, Japanese practitioners typically use a mixture of known-solution CIT and searching CIT within a single case. In contrast, laboratory-based research has typically used either the known- solution CIT or the searching CIT (for recent laboratory studies on the searching CIT see Breska, Ben-Shakhar, & Gronau, 2012; Meijer, Bente, Ben-Shakhar, & Schumacher, 2013; Meijer, Smulders, & Merckelbach, 2010; Meixner & Rosenfeld, 2011). We hope that our analysis will provide impetus for new research. Studies departing from the question-focused perspective may not be easily applicable to the CIT as used in Japan. Scientists can increase the applied value of their research in several ways. Foremost, Japanese field practice is more likely to incorporate laboratory conclusions that are based upon studies that treat individual CIT questions as the unit of analysis. This implies repetition of individual questions and not aggregating across different questions.

What Could Japanese Field Practice Learn From Laboratory-Based Research?

Laboratory-based research has provided the scientific foundation for how the CIT is used in the field. For example, the polygraph system used in Japan employs the 0.5 V constant voltage circuit following the recommendation by Fowles et al. (1981). It also uses a laboratory-based data-analysis procedure such as a standardization to reduce inter-repetition variance in physiological measures (Ben-Shakhar, 1985). However, although Japanese field practice has certainly been inspired by laboratory research, it should critically reflect upon how laboratory research can be used to further improve the situation in the field.

21 ogawa, matsuda, tsuneoka, & verschuere

The choice to aggregate across multiple questions in laboratory-based research is based on the concern that using single questions can inflate false positive errors. In the question-focused approach, the innocent examinee may consider the critical item in one question to be more salient or more plausible. In such a case, repeating the question does not solve the problem because the increased physical responses of an innocent examinee may be the result of a bias in the test rather than being noise that can be ignored. The examiner confirms these possibilities during pre and post- test interviews by explicitly asking the examinee how they felt about the items. However, when using several questions, it is unlikely that the innocent person considers all critical items to be more salient or more plausible. Therefore, even if questions are treated separately, presenting multiple questions should prevent errors during an investigation such as a mistaken arrest. Ogawa et al. (2013) reported 5% as the false-positive rate in question-based analysis. However, when four questions are administered, the probability that all answers are false-positives is greatly reduced to 0.054%. Laboratory research has provided convincing evidence that several measures, not currently used in Japan, are valid. These include finger-pulse line length (Elaad & Ben-Shakhar, 2006; Vandenbosch, Verschuere, Crombez, & de Clercq, 2009) and non- autonomic measures such as reaction time (Kleinberg, Verschuere, & Theocharidou, 2015; Seymour & Fraynt, 2009; Seymour & Kerlin, 2008; Seymour, Seifert, Shafto, & Mosmann, 2000; Verschuere & Ben-Shakhar, 2011), and the P300 event related potential recorded from the brain by EEG (Farwell & Donchin, 1991; Rosenfeld et al., 1988). We are not saying that these measures are ready to be used in the Japanese field. But proposing new measures and new techniques is one of the most important contributions of laboratory- based studies, as are the follow-up studies that assess the validity of these new procedures before implementing them in the field. More importantly, Japanese field practice should develop a more valid and objective scoring method. In current Japanese field practice, physiological recordings are evaluated through visual inspection (Osugi, 2011). One may argue that human judgment is vulnerable to biases (Dawes, 1979), and an objective scoring system would produce results that are more impartial, reliable, and ultimately more valid (Matsuda et al., 2012). Current practitioners may argue that subjective scoring allows them to incorporate important information that is not captured in current objective scoring systems. This is illustrated in a case report by Yamamoto (2010) where an examinee confessed his knowledge to 4 out of 5 questions during the interrogation held after the polygraph test. For each of the 4 questions, the examinee consistently reacted to the critical items with increased vasodilation at the fingertip. Such a response directly opposes predictions made by laboratory-based research (i.e., vasoconstriction), and might have been missed by computerized scoring systems. Subjective scoring is often more valid than computerized scoring. Using Lykken scoring of Ogawa et al.’s (2013) data, Matsuda, Ogawa, and Tsuneoka (2015) found a sensitivity of 68% and a specificity of 96%. Human judgment, however, was better with 86% sensitivity and 95% specificity. This lower sensitivity in the Lykken scoring may have resulted from the inclusion of non-responding participants or from ignoring individual differences in responsive measures. Moreover, the Lykken scoring system was designed for the known-solution CIT and cannot simply be applied to the searching CIT. Clearly, more sophisticated models are needed for analyzing physiological measures both for known-solution and searching CIT questions.

22 laboratory cit vs. practical cit

An Agenda for Future Research

Research is needed for future development of the field CIT. An important topic in future CIT studies should be how the criminal perceives and remembers criminal events. Such studies have important practical implications for developing questions. For example, recent studies have suggested that central features such as the weapon are much better remembered than peripheral items such as a picture on the wall (Carmel, Dayan, Naveh, Raveh, & Ben-Shakhar, 2003; Gamer & Berti, 2012; Nahari & Ben-Shakhar, 2010). In addition, eyewitness research provides valuable information to practitioners regarding which facts are likely to be remembered and which are not. Additionally, eyewitnesses and perpetrators might have memories of events with different characteristics, and much needs to be learned about what criminals are likely to remember. Ongoing efforts should be dedicated to establishing theoretical bases of the Concealed Information Test. Good theory will provide the conditions, measures, and populations in which the technology will work (Verschuere & Ben-Shakhar, 2011). For example, one might ask about processes responsible for the atypical responses found by Yamamoto (2010). Only basic and applied studies are able to answer these questions. In this context, laboratory studies will be extremely useful for the practical use of the CIT. Practitioners should try to inform researchers about practical issues they confront in the field. For instance, when developing a field CIT, there is currently no consensus as to whether the examinee’s stated answer should be used as an option in the CIT. Suppose the examinee claims to have committed a theft alone, but law enforcement thinks there may have been two or three accomplices. Clearly, the ‘two accomplices’ and ‘three accomplices’ options should be included, but how is the ‘no accomplices’ option handled? As this is what the examinee claims, it is likely to evoke a response, irrespective of its truth-value. Excluding it as an option may seem strange to the examinee. Current field practice has several ways of handling this issue, such as excluding the option from the test or including it in the test but excluding it from the analyses. Laboratory research may tell us how these different procedures affect detection efficiency. For example, inclusion might not necessarily harm detection efficiency, as this is somewhat similar to a study in which detection efficiency was unaffected by including target items to which participants had to respond by pressing a key (Ben-Shakhar et al., 1999; Elaad, 1997). Laboratory studies that employ methods more closely related to those used in the Japanese field CIT will strengthen the scientific basis of the field application.

Conclusion

As the overall aim of this paper was to promote further development of the CIT, we have highlighted the differences in how the CIT is used in laboratory-based studies vs. Japanese criminal investigations. In laboratory-based studies, CIT questions have often been integrated to derive a single, unified outcome of ‘knowledgeable’ versus ‘unknowledgeable’ (examinee-focus), whereas Japanese polygraph examiners treat each CIT question as a unit of analysis (question-focus). This difference is not trivial, as it poses different constraints on the CIT method (e.g., the need to have a sufficient number of questions does not hold from the individual question approach), requires different methods for data analysis (e.g., aggregating across questions), and leads to very different test conclusions (i.e., knowledge present or absent vs. recognition of

23 ogawa, matsuda, tsuneoka, & verschuere specific pieces of crime-related information). We hope this paper will foster a better understanding of how the CIT is applied in laboratory-based studies and in Japanese criminal investigations. As such, this paper might encourage laboratory-based studies that can be more readily translated to Japanese field application. Importantly, we hope the present paper will stimulate CIT field application outside Japan and encourage reflection by Japanese practitioners on how the CIT can be optimally applied.

References

Ben-Shakhar, G. (1985). Standardization within individuals: A simple method to neutralize individual differences in skin conductance. Psychophysiology, 22 (3), 292–299. doi: 10.1111/j.1469-8986.1985.tb01603.x Ben-Shakhar, G. (2012). Current research and potential applications of the concealed information test: An overview. Frontiers in Psychology, 3 , 342. doi: 10.3389/ fpsyg.2012.00342 Ben-Shakhar, G., & Elaad, E. (2003). The validity of psychophysiological detection of information with the guilty knowledge test: A meta-analytic review. Journal of Applied Psychology, 88 (1), 131–151. doi: 10.1037/0021-9010.88.1.131 Ben-Shakhar, G., & Furedy, J. J. (1990). Theories and applications in the detection of deception. New York, NY: Springer-Verlag. doi: 10.1007/978-1-4612-3282-7 Ben-Shakhar, G., Gronau, N., & Elaad, E. (1999). Leakage of relevant information to innocent examinees in the gkt: An attempt to reduce false-positive outcomes by introducing target stimuli. Journal of Applied Psychology, 84 (5), 651–660. doi: 10.1037/0021-9010.84.5.651 Breska, A., Ben-Shakhar, G., & Gronau, N. (2012). Algorithms for detecting concealed knowledge among groups when the critical information is unavailable. Journal of Experimental Psychology: Applied, 18 (3), 292–300. doi: 10.1037/a0028798 Carmel, D., Dayan, E., Naveh, A., Raveh, O., & Ben-Shakhar, G. (2003). Estimating the validity of the guilty knowledge test from simulated experiments: The external validity of mock crime studies. Journal of Experimental Psychology: Applied, 9 (4), 261–269. doi: 10.1037/1076-898x.9.4.261 Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34 (7), 571–582. doi: 10.1037/0003-066x.34.7.571 Elaad, E. (1997). Polygraph examiner awareness of crime-relevant information and the guilty knowledge test. Law and Human Behavior, 21 (1), 107–120. doi: 10.1023/a: 1024822211587 Elaad, E., & Ben-Shakhar, G. (2006). Finger pulse waveform length in the detection of concealed information. International Journal of Psychophysiology, 61 (2), 226–234. doi: 10.1016/j.ijpsycho.2005.10.005 Elaad, E., Ginton, A., & Jungman, N. (1992). Detection measures in real-life criminal guilty knowledge tests. Journal of Applied Psychology, 77 (5), 757–767. doi: 10 .1037/0021-9010.77.5.757 Farwell, L. A., & Donchin, E. (1991). The truth will out: Interrogative polygraphy (“lie detection”) with event-related brain potentials. Psychophysiology, 28 (5), 531–547. doi: 10.1111/j.1469-8986.1991.tb01990.x

24 laboratory cit vs. practical cit

Fowles, D. C., Christie, M. J., Edelberg, R., GRINGS, W. W., Lykken, D. T., & Venables, P. H. (1981). Publication recommendations for electrodermal measurements. Psychophysiology, 18 (3), 232–239. doi: 10.1111/j.1469-8986.1981.tb03024.x Furedy, J. J. (2009). The concealed information test as an instrument of applied differential psychophysiology: Methodological considerations. Applied Psychophysiology and Biofeedback, 34 (3), 149–160. doi: 10.1007/s10484-009-9097 -y Gamer, M., & Berti, S. (2012). P300 amplitudes in the concealed information test are less affected by depth of processing than electrodermal responses. Frontiers in Human Neuroscience, 6 . doi: 10.3389/fnhum.2012.00308 Gati, I., & Ben-Shakhar, G. (1990). Novelty and significance in orientation and habituation: A feature-matching approach. Journal of Experimental Psychology: General, 119 (3), 251–263. doi: 10.1037/0096-3445.119.3.251 Hirota, A., Sawada, Y., Tanaka, G., Nagano, Y., Matsuda, I., & Takasawa, N. (2003). A new index for psychophysiological detection of deception: Applicability of normalized pulse volume. Japanese Journal of Physiological Psychology and Psychophysiology, 21 (3), 217–230. doi: 10.5674/jjppp1983.21.217 Kleinberg, B., Verschuere, B., & Theocharidou, K. (2015). RT-based memory detection: Item saliency effects in the single-probe and the multiple-probe protocol. Journal of Applied Research in Memory and Cognition, 4 (1), 59–65. doi: 10.1016/j.jarmac .2015.01.001 Kobayashi, T., Yoshimoto, K., & Fujihara, S. (2009). The contemporary situation of field polygraph tests. Japanese Journal of Physioogical Psychoogy and Psychophysiology, 27 (1), 5–15. doi: 10.5674/jjppp.27.5 Lykken, D. T. (1959). The GSR in the detection of guilt. Journal of Applied Psychology, 43 (6), 385–388. doi: 10.1037/h0046060 Lykken, D. T. (1974). Psychology and the lie detector industry. American Psychologist, 29 (10), 725–739. doi: 10.1037/h0037441 Lykken, D. T. (1988). Detection of guilty knowledge: A comment on Forman and McCauley. Journal of Applied Psychology, 73 (2), 303–304. doi: 10.1037/0021 -9010.73.2.303 Lykken, D. T. (1998). A tremor in the blood: Uses and abuses of the lie detector. Plenum Trade. MacLaren, V. V. (2001). A quantitative review of the guilty knowledge test. Journal of Applied Psychology, 86 (4), 674–683. doi: 10.1037/0021-9010.86.4.674 Matsuda, I., Nittono, H., & Allen, J. J. B. (2012). The current and future status of the concealed information test for field use. Frontiers in Psychology, 3 . doi: 10.3389/fpsyg.2012.00532 Matsuda, I., Ogawa, T., & Tsuneoka, M. (2015). Discrimination performance of statistical values used in the concealed information test studies. Jpn. J. Forensic. Sci. Tech., 20 (1), 59–67. doi: 10.3408/jafst.681 Matsuda, I., Ogawa, T., Tsuneoka, M., & Verschuere, B. (2014). Using pretest data to screen low-reactivity individuals in the autonomic-based concealed information test. Psychophysiology, 52 (3), 436–439. doi: 10.1111/psyp.12328 Meijer, E. H., Bente, G., Ben-Shakhar, G., & Schumacher, A. (2013). Detecting concealed information from groups using a dynamic questioning approach: Simultaneous skin conductance measurement and immediate feedback. Frontiers in Psychology, 4 .

25 ogawa, matsuda, tsuneoka, & verschuere

doi: 10.3389/fpsyg.2013.00068 Meijer, E. H., Selle, N. K., Elber, L., & Ben-Shakhar, G. (2014). Memory detection with the concealed information test: A meta analysis of skin conductance, respiration, heart rate, and P300 data. Psychophysiology, 51 (9), 879–904. doi: 10.1111/psyp .12239 Meijer, E. H., Smulders, F. T., & Merckelbach, H. L. (2010). Extracting concealed information from groups. Journal of Forensic Sciences, 55 (6), 1607–1609. doi: 10.1111/j.1556-4029.2010.01474.x Meixner, J. B., & Rosenfeld, J. P. (2011). A mock terrorism application of the P300- based concealed information test. Psychophysiology, 48 (2), 149–154. doi: 10.1111/ j.1469-8986.2010.01050.x Nahari, G., & Ben-Shakhar, G. (2010). Psychophysiological and behavioral measures for detecting concealed information: The role of memory for crime details. Psychophysiology, 48 (6), 733–744. doi: 10.1111/j.1469-8986.2010.01148.x National Research Council. (2003). The polygraph and lie detection: Committee to review the scientific evidence on the polygraph. Division of Behavioral and Social Sciences and Education, Washington, DC: The National Academies Press. Ogawa, T., Matsuda, I., & Tsuneoka, M. (2013). Accuracy of concealed information test as a memory detection technique: A laboratory study. Japanese Journal of Forensic Science and Technology, 18 (1), 35–44. doi: 10.3408/jafst.18.35 Osugi, A. (2011). Daily application of the Concealed Information Test: Japan. In B. Verschuere, G. Ben-Shakhar, & E. Meijer (Eds.), Memory Detection: Theory and application of the Concealed Information Test (pp. 253–275). Cambridge, UK: Cambridge University Press. Podlesny, J. A. (1993). Is the guilty knowledge polygraph technique applicable in criminal investigations? A review of FBI case records. Crime Laboratory Digest, 20 , 57–61. Podlesny, J. A. (2003). A paucity of operable case facts restricts applicability of the guilty knowledge technique in FBI criminal polygraph examinations. Forensic Science Communications, 5 . Retrieved from http://www.fbi.gov/about-us/lab/ forensic-science-communications/fsc/july2003/index.htm/podlesny.htm Rosenfeld, J. P., Cantwell, B., Nasman, V. T., Wojdac, V., Ivanov, S., & Mazzeri, L. (1988). A modified, event-related potential-based guilty knowledge test. International Journal of Neuroscience, 42 (1-2), 157–161. doi: 10.3109/ 00207458808985770 Sawada, Y., Tanaka, G., & Yamakoshi, K. (2001). Normalized pulse volume (NPV) derived photo-plethysmographically as a more valid measure of the finger vascular tone. International Journal of Psychophysiology, 41 (1), 1–10. doi: 10.1016/s0167 -8760(00)00162-8 Seymour, T. L., & Fraynt, B. R. (2009). Time and encoding effects in the concealed knowledge test. Applied Psychophysiology and Biofeedback, 34 (3), 177–187. doi: 10.1007/s10484-009-9092-3 Seymour, T. L., & Kerlin, J. R. (2008). Successful detection of verbal and visual concealed knowledge using an RT-based paradigm. Applied Cognitive Psychology, 22 (4), 475– 490. doi: 10.1002/acp.1375 Seymour, T. L., Seifert, C. M., Shafto, M. G., & Mosmann, A. L. (2000). Using response time measures to assess ‘guilty knowledge’. Journal of Applied Psychology, 85 (1), 30–37. doi: 10.1037/0021-9010.85.1.30

26 laboratory cit vs. practical cit

Sokolov, E. N. (1963). Perception and the conditioned reflex. New York, NY: Macmillan. Timm, H. W. (1982). Effect of altered outcome expectancies stemming from placebo and feedback treatments on the validity of the guilty knowledge technique. Journal of Applied Psychology, 67 (4), 391–400. doi: 10.1037/0021-9010.67.4.391 Vandenbosch, K., Verschuere, B., Crombez, G., & de Clercq, A. (2009). The validity of finger pulse line length for the detection of concealed information. International Journal of Psychophysiology, 71 (2), 118–123. doi: 10.1016/j.ijpsycho.2008.07.015 Verschuere, B., & Ben-Shakhar, G. (2011). Theory of the concealed information test. In B. Verschuere, G. Ben-Shakhar, & E. Meijer (Eds.), Memory Detection: Theory and application of the Concealed Information Test (pp. 128–148). Cambridge, UK: Cambridge University Press. doi: 10.1017/cbo9780511975196.008 Verschuere, B., Ben-Shakhar, G., & Meijer, E. (2011). Memory detection: Theory and application of the concealed information test (B. Verschuere, G. Ben-Shakhar, & E. Meijer, Eds.). Cambridge, UK: Cambridge University Press. Verschuere, B., Crombez, G., de Clercq, A., & Koster, E. H. W. (2004). Autonomic and behavioral responding to concealed information: Differentiating orienting and defensive responses. Psychophysiology, 41 (3), 461–466. doi: 10.1111/j.1469-8986 .00167.x Verschuere, B., & Meijer, E. H. (2014). What’s on your mind? Recent advances in memory detection using the concealed information test. European Psychologist, 19 (3), 162–171. doi: 10.1027/1016-9040/a000194 Vrij, A. (2008). Detecting lies and deceit: Pitfalls and opportunities (2nd ed.). Chichester, UK: John Wiley & Sons. Yamamoto, N. (2010). Case study of increase of normalized pulse volume to the critical item in the field concealed information test. Japanese Journal of Forensic Science and Technology, 15 (1), 65–74. doi: 10.3408/jafst.15.65

Received: December 8, 2014 Revision Received: May 20, 2015 Accepted: May 27, 2015

27 Archives of Forensic Psychology © 2015 Global Institute of Forensic Psychology 2015, Vol. 1, No. 2, 28–54 ISSN 2334-2749

A systematic review of the effectiveness of anger management interventions among adult male offenders in secure settings

Sara Schamborg Centre for Forensic and Family Psychology, Division of Psychiatry and Applied Psychology, School of Medicine, The University of Nottingham, England, UK

Ruth J. Tully∗ Tully Forensic Psychology Ltd, Nottingham, UK Centre for Forensic and Family Psychology, Division of Psychiatry and Applied Psychology, School of Medicine, The University of Nottingham, YANG Fujia Building, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK E-mail: Ruth .Tully@ nottingham .ac .uk

Background: Anger management is one of the most common interventions offered to forensic populations. The effectiveness of anger management interventions for adult male offenders in secure settings has not been systematically reviewed. Aim: To systematically review the effectiveness of anger management interventions for adult male offenders in secure settings, and to assess the quality of the existing research in this area Method: Seven electronic databases were searched and hand searching was completed. Inclusion and exclusion criteria were applied to the identified studies. Data was extracted from the included studies and studies were quality assessed before being synthesised. Results: In total, 11 papers, containing 12 primary studies, were included in this systematic review. The majority of the studies reported that participants in the treatment groups made at least some gains following treatment compared to participants in the control groups. Conclusion: Preliminary results displayed promising but mixed findings with regard to the effectiveness of anger management interventions for adult males in secure settings. Limitations and implications for delivering effective anger management treatment in secure settings are discussed. Future high quality research is needed in order to inform evidence-based practice. Keywords: Anger, Anger Management, Offender, Systematic Review

28 review of anger treatment in secure settings

Introduction

Definition of anger

As a variety of conceptualisations of anger have been put forward, attaining a shared definition of the construct has been a challenging task in the field of anger (Miller, 1985). At present, it is generally agreed that anger is the subjective experience of an emotion (Howells, Daffern, & Day, 2008). Specifically, anger has been defined as “an emotional state that consists of feelings that vary in intensity, from mild irritation or annoyance to intense fury and rage” (p. 16). Anger has been differentiated from other unpleasant feelings by the presence of the following; a cognitive component which involves appaisal of wrongdoing and a motivational component which involves a desire to address the identified wrongdoing (Fernandez, 2013). In addition, anger has commonly been misunderstood or used interchagebly with other concepts, such as aggression (Parrott & Giancola, 2007). In contrast to anger, aggression refers to a behaviour that is used with an intention to cause harm (Baron & Richardson, 1994). Although anger may lead to aggression it is not considered sufficiet or necessary for aggression to take place (Spielberger, 1999a).

Anger Management Treatment

Even though anger is a universal feeling that has functional properties, such as signalling unfair treatment and threat, dysfunctional anger, i.e. intense, frequent or inappropriately expressed anger, is considered to be a widespread social problem linked to the display of antisocial behaviours. As a result, anger management treatment has commonly been delivered to individuals to address this issue (Beck & Fernandez, 1998; Reilly & Shopshire, 2012). The aim of such treatment has been to aid individuals in recognising the psychological causes of their experience of anger, such as triggers and cognitive appraisals of situations, as well as to help individuals to manage their anger appropriately, for example through cognitive reframing (Beck & Fernandez, 1998; Howells, 2004). A number of meta-analyses have been conducted to assess the effect of anger management interventions (e.g., Beck & Fernandez, 1998; Del Vecchio & O’Leary, 2004; DiGiuseppe & Tafrate, 2003; Edmondson & Conger, 1996; Sukhodolsky, Kassinove, & Gorman, 2004; Tafrate, 1995). These papers have included a mixture of non-clinical, clinical, forensic, and non-forensic populations. Overall, outcomes have been promising – anger management has been found to result in significant within and between group improvements. For example, Beck and Fernandez (1998) conducted a study including mainly clinical and forensic samples; the outcome indicated that individuals who received Cognitive Behavioural Therapy (CBT) based anger management treatment did 76% better in relation to anger reductions than individuals who did not receive such treatment (effect size .70). This effect size was predominately based on outcomes of self-report anger measures, but the study also included behavioural ratings of aggression.

29 schamborg & tully

Anger Management Treatment in Forensic Settings

Spielberger (1991) found that the experience, expression, and control of anger amongst a sample of prisoners were significantly more problematic than in a sample of adults in the general population. These problems are more pronounced in violent offenders (Mills, Kroner, & Forth, 1998). Due to these findings, and because anger is a potential contributing factor to, or the function of, aggressive behaviour (e.g., Daffern & Howells, 2008; Deffenbacher et al., 1996; Giancola, 2002; Parrott & Zeichner, 2005), anger management treatment is readily provided in secure services (Howells et al., 2005), such as in prisons and secure psychiatric facilities. Despite such treatment being one of the most commonly delivered interventions in forensic settings, its empirical evaluation using forensic samples has been limited (Novaco, Ramm, & Black, 2004). Consequently, the impact of anger management interventions is not yet fully understood (DiGiuseppe & Tafrate, 2003). As yet literature relies on primary studies in the field because no systematic reviews or meta-analyses have been conducted using solely forensic adult populations. Findings of studies conducted to date are conflicting; some studies report positive results (e.g., Bu¸s, S¸tefan, & Visu-Petra, 2009), whereas other studies report that treatment does not result in significant differences between the experimental and control group (e.g., Stone, 1990). In addition, a great proportion of existing studies rely on single designs, i.e., pre- and post-treatment comparisons (e.g., McMurran, Charlesworth, Duggan, & McCarthy, 2001), however the lack of comparison or control groups somewhat limits the validity and generalisability of the outcomes. Increased knowledge about the effects of this treatment is of importance. Gains are beneficial not only for the well-being and safety of clients and staff members in services, but also for wider society in terms of public protection (Black et al., 2011).

Aims and Objectives

Anger management is one of the most frequently delivered interventions provided in forensic settings (Howells et al., 2005). As yet, there has been no published systematic review evaluating the effect of anger management treatment solely delivered in forensic settings, and primary studies have shown mixed findings. This review aims to explore current literature and address gaps in research and clinical knowledge by systematically exploring the effectiveness of anger management treatment for adult male offenders in secure settings. Rigid eligibility and quality/risk of bias assessments will be conducted as part of this review in order to only include studies that fully meet the inclusion criteria and to carefully consider the quality of the included studies to guide recommendations as to how future research can be improved or further developed.

Method

Search strategy: Sources of literature and search terms

Prior to initiating electronic searches, the Cochrane Library was reviewed to identify any previous systematic reviews in the area. No relevant reviews were found. Subsequently, relevant primary studies were sourced using seven electronic databases. These were: Applied Social Sciences Index and Abstracts (ASSIA), EMBASE,

30 review of anger treatment in secure settings

MEDLINE: OVID, National Criminal Justice Reference Service Abstracts (NCJRS), ProQuest Dabase Search, ProQuest Dissertations and Theses, and PsycINFO: OVID1. No restrictions on language, country of origin or publication status were made. Studies were retrieved by directly accessing journals, through university interlibrary loans or by contacting authors. In addition, hand searching was completed through reviewing reference lists of included studies and through searches on Google Scholar. Studies that were inaccessible through these means were excluded from the review.

Study selection

Studies were assessed for eligibility using the inclusion and exclusion criteria (See Table 1); a clear proforma was used to aid this process. The process of assessment for eligibility was carried out by the primary reviewer. In addition, a second independent reviewer with training in systematic reviewing assessed 20% of the total hits for eligibility. Studies were included in the quality assessment stage only if they met the inclusion criteria. Papers that met the inclusion criteria, but also aspects of the exclusion criteria were only included in this review if they analysed the relevant variables in isolation. In cases where information was unclear or lacking, direct author contact was sought to obtain further clarification.

Population

The population was limited to adult males, i.e. 18 years of age or above, who had an average, or above, level of intellectual functioning. The reason for this was due to the developmental and cognitive differences between adults and children, adolescents or learning disabled offenders. These differences may impact on the delivery and outcome of treatment; hence this may distort the findings in regard to treatment change, which is an issue that provided rationale for limiting the included population to adults aged 18 years old or above. The inclusion criterion of ‘male’ was decided upon because most of the research and treatment in forensic settings is developed based on male needs (Horn & Towl, 1997; Koons, Burrow, Morash, & Bynum, 1997) and literature exploring the outcome of anger management treatment including female offenders is limited (R. Walker, 2001). Female offenders were therefore excluded. The population was further limited to individuals in secure settings in order to reduce confounding influences as a result of including samples from other settings. Andrews, Bonta, and Hoge (1990) argued that treatment provided in community settings compared to secure settings result in better outcomes. This finding may be due to individuals in community settings being exposed to different and natural

1The search syntax used in each data base was: (anger) AND (treatment* or intervention* or therap* or management or program* or control or reduction) AND (inpatient* or offender* or prison* or inmate* or convict* or patient*) OR (“psychiatric hospital” or “psychiatric institution” or “psychiatric services” or jail or forensic or prison* or incarcerated or “secure setting*” or “secure service*” or “forensic mental health services” or “forensic service*” or “forensic hospital” or “forensic setting*” or “forensic institution*” or “penal facilit*” or “penal institution*” or “correctional institution*” or “correctional service*” or “correctional setting*”)

31 schamborg & tully

Table 1. Inclusion and Exclusion Criteria

Inclusioncriteria Exclusioncriteria Population Adult males (≥ 18 years Female offenders, learning disabled of age) located in secure offenders, child/young offenders, psychiatric hospital or adolescent offenders, adult male prison settings offenders in community/non-forensic settings, or non-offenders Intervention Anger management Anger management treatments that interventions that include do not administer anger specific the administration of at self-report measures pre- and post- least one anger specific treatment, interventions that do self-report/psychometric not specifically aim to target anger, measure pre- and post- interventions that do not target anger treatment in isolation (e.g., other offence based treatments), or non-psychologically informed interventions (e.g., medical/ pharmacological interventions) Comparator No anger management Single group studies (before and after intervention (control design), or the control group not being group) in the same setting (i.e., in a non-secure setting) Outcome Self-report Self-report measures that were not measures/psychometric anger specific, behavioural observations assessment of anger (e.g., institutional aggression, role play), and reconviction rates Type of Design Randomised control trials Opinion papers, case reports, or case control studies systematic reviews, meta-analyses, narrative reviews, case series reports, cross-sectional studies, or cohort studies destabilisers, though they may gain opportunities to implement and practice the skills that they develop through treatment. Additionally, they may feel that treatment is more voluntary if they are located in the community, and they may make greater treatment gain through this perceived autonomy. In environments such as prison, adopting a hostile attitude and lifestyle is often reinforced. This can be because such behaviour enhances survival and status as well as social approval by peers, creating a safety network for prisoners (MacPherson, 1986; Shinkfield & Graffam, 2014). Moreover, prison and secure hospital settings are characterised by a variety of detrimental factors that may impact on the individual’s level of change. These factors may involve having limited personal choice and control, which may result in feelings of anger, frustration and helplessness (Novaco & Taylor, 2004).

32 review of anger treatment in secure settings

Intervention and comparator

The intervention was limited to anger management treatments. To review the effectiveness of anger management treatment, only studies that aimed to treat anger as the primary treatment need were included. Although anger may overlap with aggression, including violence, these are two separate constructs (de Azevedo, Wang, Goulart, Lotufo, & Bense˜nor, 2010; Parrott & Giancola, 2007; Spielberger, 1999b). Therefore, interventions that were excluded were those that treated anger as part of a wider treatment programme (e.g., general offending behaviour programmes like the Thinking Skills Programme used in the prisons of England and Wales), offence specific treatment that aided anger as a secondary focus (e.g., sex offender treatment programmes) or general aggression/violence interventions that were not anger specific. To account for changes on a sample level (within group) and treatment level (between group), as well as to increase the validity of the findings of this current review, only studies that included both treatment and control groups were included. Control groups refer to samples that did not receive anger management treatment, but who completed the anger related self-report assessments in parallel to the sample that completed the anger management treatment. This design is considered to be of higher quality when compared to studies that rely on pre- and post-treatment comparisons in a single group of participants to assess treatment change (DiGiuseppe & Tafrate, 2003). Given that anger is defined as an emotion, and as such is subjective in nature, it has been considered appropriate to measure anger through self-report assessments (Eckhardt, Norlander, & Deffenbacher, 2004; Novaco, 1994). Therefore, this review only included primary studies that incorporated at least one anger self-report measure to assess treatment change. Measures that had a primary focus on assessing other related concepts, for example aggression and hostility, did not meet the criteria for inclusion. This includes tools such as the Aggression Questionnaire (Buss & Warren, 2000) or the Buss-Durkee Hostility Inventory (Buss & Durkee, 1957). The limitations of self-report measures are discussed later.

Outcome

Assessment of anger and treatment change is inherently problematic. Self-report measures have been found to be the most common way of assessing treatment change in anger management interventions due to the subjective nature of the construct (Beck & Fernandez, 1998; Eckhardt et al., 2004; Novaco, 1994). Consequently, this review aimed to assess the efficacy of treatment based on anger self-report measures that were administered both pre- and post-treatment. Some of these measures not only include an assessment of the experience of anger but also include scales that measure the control and expression of anger, and this is accounted for in this review. Other types of assessments that do not rely on self-report, but which aim to explore anger related misconduct, such as institutional aggression or recidivism, were not reviewed in this paper. This is because institutional aggression or recidivism rates may not depict an accurate reflection of changes to anger or changes made as a result of anger management treatment. Whilst the aim of anger treatment may also be to reduce future aggression, aggressive behaviour can be a function of various other phenomena as opposed to just anger. Accordingly it can be argued that simply considering post-treatment aggression as the outcome may not

33 schamborg & tully actually measure reduction in anger. This was demonstrated by Loza and Loza-Fanous (1999) who found that the level of anger in a sample of prisoners did not predict violent or non-violent recidivism and, as previously mentioned, anger alone is not considered to be sufficient, nor is anger necessary, for aggression to take place (Bushman & Anderson, 2001; Howells et al., 2005).

Type of design

Based on the nature of the studies in the field and due to the aim of this review being to evaluate primary studies, randomised control trials (RCT) and case control studies were considered to meet the inclusion criteria. Other study designs were excluded in order to enhance confidence of the results of this systematic review.

Data extraction

In order to avoid data extraction bias a pre-determined protocol was developed and documented. This was used to extract data for this review before synthesising the information. This protocol was adapted based on information from Bettany-Saltikov (2012) and Petticrew and Roberts (2006).

Quality assessment and risk of bias assessment

All studies that met the inclusion criteria were assessed for quality and risk of bias using one out of two predetermined protocol proformas, depending on the study design. These forms were informed by the Critical Appraisal Skills Programme (CASP; CASP, 2013) and Bettany-Saltikov (2012). Specifically, the quality assessment forms included a number of questions that could be answered as yes, partially, no or unknown. In order to determine the overall quality of each study, a scoring system was applied. Yes constituted two points, partially constituted one point and no or unknown did not constitute any points. The items on the quality assessment forms were grouped into five categories, each category representing an area of risk. The areas of risk were assessed in accordance with Cochrane review guidelines for interventions (Higgins, Altman, & Sterne, 2011). Specifically, the areas of risk were; sampling and selection; performance; detection; attrition; and results/statistical analysis bias (See Appendix for description of these concepts). Each category was given an overall rating of low, unclear or high risk of bias based on the direction of the questions/items in each category. Studies were not excluded based on quality or risk of bias; however the results of this review will be discussed against the backdrop of these assessments. All included studies (N = 12) were assessed for quality and risk of bias by the primary reviewer. A second independent reviewer (a clinician with working and academic knowledge of anger treatment, and who had training in systematic reviewing methods) quality assessed five (42%) of the included studies to ensure reliability. The intra-class correlation coefficient (ICC) value between reviewers was .33, which is considered to be in the ‘fair’ range; however on close examination of this finding, there appeared to be one particular study (Bu¸set al., 2009) that skewed the results due to high level of disagreement between reviewers. Excluding this study from ICC analysis resulted in an

34 review of anger treatment in secure settings

ICC value of .91, which is considered to be in the ‘excellent’ range (Landis & Koch, 1977). Any disagreements between reviewers were resolved through discussion of rationale for rating.

Results

Description of studies

The full search generated 9804 hits; 1673 were duplicates, 8045 were irrelevant, 64 did not meet the required inclusion criteria, and 10 were inaccessible; these were all removed. One relevant paper was retrieved through hand searching; subsequently two papers were excluded because they were sample duplicates of this hand-searched study. Consequently, 11 papers, including 12 primary studies, met the inclusion criteria and were therefore reviewed. All of these studies were considered to follow a case control study design (See Figure 1 for breakdown of search hits).

Characteristics of included studies

The total sample size for this review is 985. Although the Howells et al. (2001) paper included both community and prison inmates at the start of their study (86% of participants who completed post-treatment assessments were prison inmates), for parts of the analysis they only included the participants in prison. Consequently outcomes relating exclusively to the included population were distinct and reportable within this review. No sample overlap occurred between the 12 studies. Out of the total sample size, 215 participants were part of control groups (attrition: 0% – 40.62%) and 307 participants were enrolled in anger treatment groups (attrition: 0% – 71.74%). Three studies did not provide details about the breakdown of group allocation. All but one of the studies was completed in prison settings (11 studies), with the remaining one study being located in a forensic psychiatric hospital. All but one anger management intervention incorporated CBT principles. Kennedy (1990) included a Cognitive Therapy group and a Behavioural Therapy group, separately. Treatment in four studies were based solely on CBT principles, six studies were influenced by additional principles; ‘stress inoculation’ (one study), ‘Buddhist psychology’ (one study), ‘mindfulness and acceptance’ (one study), and ‘rational behaviour/emotive therapy’ (three studies). The treatments ranged from providing between six to 23 sessions; each session lasting between one to three hours. The majority of studies provided treatment that lasted for 10 sessions (seven studies), with sessions lasting for two hours each (six studies). As per inclusion criteria, all studies measured the outcome of treatment by including at least one anger related self-report psychometric assessment, both pre- and post-treatment. The most common tools were; The State-Trait Anger Expression Inventory (STAXI; Spielberger, 1999b) or STAXI-II (Spielberger, 1999a) (six studies), Novaco Anger Scale (NAS; Novaco, 1975, 1994), Novaco Provocation Inventory (NPI; Novaco, 1975, 1977, 1979) or Novaco Anger Scale and Provocation Inventory (NAS-PI; Novaco, 1994) (seven studies), and Watt Anger Knowledge Scale (WAKS; Watt & Howells, 1999) (four studies). In addition, the following tools were used; The Short Anger Measure (SAM; Mohr, Heseltine, & Howells, 2001) (one study), Anger Inventory (AI; Gaertner, 1983) (one study), Dimensions of Anger Reactions (DAR; Novaco, 1975)

35 schamborg & tully

Figure 1. Flow chart of search hits and strategy

36 review of anger treatment in secure settings

(one study), Anger Experience Checklist (AEC; Meers, 1979) (one study), and Orlinappi Prison Situational Anger Measure (OPSAM; Napolitano, 1991) (one study). The majority of studies were conducted in English-speaking westernised countries; England (one study), Canada (two studies), United States (four studies) and Australia (four studies). One study was conducted in Romania. A summary of the characteristics of each study can be viewed in Table 2.

37 schamborg & tully 3 Findings TheSTAXI-II mean AngerIndex Expression score reducedfor on significantly compared the the group. to treatment the group control Participantsexperimentalsignificant group inanger knowledge increasesto had compared thetreatment. the control in significant groupbetween post and the Nofound. experimental control differences other group were ofsubscales, the exceptIN anger and forsignificantly scales increased, AX AX- were found and Index, inbut which not thetreatment. control experimental group post Participantsexperimentalsignificant group inincompared had group to the angerNo the increases differences post control experimental knowledge other betweengroup treatment. and were found.participants the in However, control bothi.e. significant groups, control group, weremake found significant to decreases in their experimental angrywell temperament and as asin significant their increases outwardcontrol and of inward anger. 2 post-anger self-report measure(s) Expression Index on STAXI-II NAS-PI, and c) Modified WAKS Modified WAKS, and c) SAM Continued on next page Location Pre- and 2 UK The Anger 2 Australia a) STAXI, b) 2 Romania STAXI2 Significant decreases Australia on each a) STAXI-II, b) × × × × Nosessions of × hours he effectiveness anger management treatment (ABA design) 1 Table 2. Sample size for adult males in secure settings NS NS N/A NS NS N/A 10 5 60 30 27 10%88 51 30 51 26 13.33% 0% 12 37 35 0.54% 10 119 92 26 71.74% 27 27 0% 10 418 Total Treatment Group Control Group Included Pre Post Attrition rate Pre Post Attrition Rate Treatment/Model Cognitive- behavioural principles, mindfulness and acceptance therapy Cognitive- behavioural and rational-emotive therapy principles Cognitive behavioural principles Cognitive behavioural principles intervention/ waiting list Waiting list control Control group, not specified Waiting list control Waiting list control 4 ) ) ) ) 2011 2009 2010 2001 Summary of study characteristics and results of studies exploring t Paper Control group Prison setting Black et al. ( Bu¸set al. ( Heseltine, Howells, and Day ( Howells et al. (

38 review of anger treatment in secure settings Intensityanger ofNAS as self-reported for reducedcontrol experimental significantly measuredtreatment. but participants by significant not post betweenand the Nofound. experimental control differences other groupParticipants were experimentalsignificant group inanger decreases arousalcompared had levelsgroup to (NPI) in the postother the treatment. significantbetween differences control No and thefound. experimental control group were Findings experimentalself-reportedless conditions anger significantly provocating toon ato variety of thecontrol situations participants2) AI condition,experimental in participantsself-reported compared the and less in significantly of condition frequent angry the of episodes, occurrences duration of anger, anger,helpful strategies as and toanger well express negative as intensity and more following episodeson having of the angerparticipants consequences DAR less incondition. compared the to control weretreatment foundconditions between pre- andgroup. the and post- control AEC OPSAM post-anger self-report measure(s) Continued on next page US a) NAS, and b) US a) NPI, and b) Location Pre- and 3 Canada a) AI, and b) DAR 1) Participants in the 2 US NPI No significant differences × × × × 2 2 1 1 No of sessions × hours 1 1 Sample size 40 NS NS N/A NS NS N/A 23 66 34 22 35.29% 3250 19 25 17 40.62% 15 32%31 25 15 18 9 28% 40% 12 16 13 18.75% 10 Total Treatment Group Control Group Included Pre Post Attrition rate Pre Post Attrition Rate Treatment/Model Cognitive principles, and 2) behavioural principles Cognitive- behavioural treatment, rational behavioural therapy principles Cognitive behavioural and rational behaviour therapy principles Cognitive behavioural approach, including stress inoculation principles intervention/ waiting list Waiting list control Waiting list control Waiting list control Waiting list control ) ) ) ) 1990 1979 1991 1990 Continued from last page Paper Control group Kennedy ( Meers ( Napolitano ( Stone (

39 schamborg & tully Measure experimentalsignificant groupanger decreases arousalcompared had levelsgroup to (NPI) post in treatment. the control SignificantState decreases angeranger, State –to anger in feelings –Trait desire verbalise of angeranger – anger provokingdecreased reaction situations the significantly and to experimentalto in compared thetreatment. controlfindings group onsubscales post were the not found. two Significant other Findings 1)experimental Participantshadanger in significantly conditions knowledgeto the compared greater following the2) treatment,experimental control participants and were morebehavioural in group likelyprovocation condition reactions to the bothpost-treatment. pre- have to significant and Nofound. other resultsNo were were significanttreatment differences conditions found pre- andgroup. andknowledge post- increased control between However,regardless of anger condition. significantly over time re prison inmates. Parts of the analysis k of information) scales on STAXI-II post-anger self-report measure(s) NAS, and c) WAKS and b) WAKS-modified US State and Trait Location Pre- and 2 Australia a) STAXI-II, b) 2 Australia a) NAS- modified, × 1 Canada NPI Participants in the × × × 2 1 No of sessions × hours 1 ons, AEC = Anger Experience Checklist, OPSAM Orlinappi Prison Situational Anger ) 2005 Sample size , 85 participants completed post-treatment assessments; 86% of these we cable (i.e., attrition rates were not possible to calculate due to lac 2002 , Howells et al. 29 15 10 33.33% 14 1440 0% 2020 10 0% 2020 0% 6 39 25 18 28% 14 1350 NS 0.71% NS 10 N/A NS NS N/A 10 Total Treatment Group Control Group Included Pre Post Attrition rate Pre Post Attrition Rate Treatment/Model Cognitive- behavioural approach, including Buddhist psychology principles Cognitive- behavioural and stress inoculation principles Cognitive- behavioural principles Cognitive- behavioural principles 05 p<. intervention/ waiting list Waiting list control Psychoeducational group (eight one hour sessions) Waiting list control Waiting list control , , ) ) 1999 1999 2004 1986 NS = Not specified (i.e., information was not specified), N/A = Not appli Incorporating both prison and community-corrections based participants. 2 SAM = The Short Anger Measure, AI = Anger Inventory, DAR = Dimensions of Anger Reacti Two additional publications to this research does exist ( Significance set at Continued from last page Paper Control group Watt and Howells ( Watt and Howells ( Vannoy and Hoyt ( Hospital setting Stermac ( Study 1) Study 2) 1 5 2 only incorporated the prisoners. 4 3

40 review of anger treatment in secure settings

Quality assessment and risk of bias assessment

There was only one study that had predominantly low risk of bias across the categories (Bu¸set al., 2009). Two studies had no high risk of bias identified (Heseltine et al., 2010; Stermac, 1986), although within these studies many areas of risk were rated as unclear. All other studies had a mixed outline in relation to areas of risk, including at least one area of high risk of bias. Attempts at author contact did not clarify all the issues raised. These outcomes indicate that a high proportion of studies were impeded by limited quality and limited clarity. To see the distribution of bias across all studies see Figure 2. Across all studies, sampling and selection bias was the area that was identified as predominantly high risk. These high levels of risk were mainly a reflection of small sample sizes (i.e., limited power), lack of screening of anger related difficulties, and difficulties obtaining matched samples at baselines. A great degree of high risk of bias and lack of clarity was reported in relation to detection bias and attrition bias. Even though most studies included the use of at least one self-report anger assessment that is considered to have standardised psychometric properties, a number of studies included tools that have been neither validated nor standardised. Only authors of one study partly controlled for social desirability responding (Howells et al., 2001) however no other studies measured and controlled for such responding. This is a major limitation in forensic settings where other factors may influence participants’ answers on post-treatment measures (McEwan, Davis, MacKenzie, & Mullen, 2009), e.g. forthcoming parole hearings or other situational demands. A number of studies were also characterised by high attrition rates. The areas with least amount of bias were performance and results/statistical analysis bias, i.e. there was a tendency for treatment to be delivered by trained professionals and using manual based interventions, which suggests treatment integrity. There was also a trend suggesting that authors conducted appropriate statistical analysis on the obtained data, although how and if missing data was controlled for was not detailed in the studies. Additionally, confounding variables were identified in a number of the studies. These were in relation to referral procedures, e.g. relying solely on self-referrals, rather than a combination of self-referrals and professional referrals. This may not fully reflect the referral processes that tend to take place in clinical practice and may not be representative of the nature of secure settings. For instance, in practice there may be participants who may feel that completion of anger management intervention is necessary to attain parole or early discharge, therefore feeling like they have little autonomy. These participants may be completing programmes due to these external pressures, rather than their engagement being led by risk-needs-responsivity principles. Such factors are considered to potentially decrease or conceal treatment effectiveness.

Descriptive data synthesis

Insufficient reporting of data in a number of the included studies did not lend the data collected within this review to meta-analytical techniques. Attempts were made to contact authors for further information, such as means and standard deviations of data

41 schamborg & tully

Figure 2. Risk of bias graph in each study; however, sufficient information was not obtained. Consequently, meta- analysis was not conducted however a detailed qualitative summary of the findings can be found in Table 2 and data is synthesised below. Insufficient reporting of data in a number of the included studies did not lend the data collected within this review to meta-analytical techniques. Attempts were made to contact authors for further information, such as means and standard deviations of data in each study; however, sufficient information was not obtained. Consequently, meta- analysis was not conducted however a detailed qualitative summary of the findings can be found in Table 2 and data is synthesised below.

Treatment effects

The majority of the studies (10 studies) reported that participants in the treatment groups made at least some gains following treatment compared to participants in the control groups. Five studies (Black et al., 2011; Bu¸set al., 2009; Kennedy, 1990; Stermac, 1986; Vannoy & Hoyt, 2004) reported predominantly positive outcomes for the participants that underwent treatment compared to participants in the control group. Specifically, it was found that these participants experienced significantly less anger when faced with trigger situations, and they coped with anger in more adaptive ways following treatment. The majority of these studies used validated and standardised outcome measures such as STAXI (Spielberger, 1999b), STAXI-II (Spielberger, 1999a) and NPI (Novaco, 1975, 1977, 1979). Two studies reported mixed findings (Meers, 1979; Napolitano, 1991); this may link to the use of standardised/validated tools versus those less validated. Both of these studies found change in the desired direction on the NAS (Novaco, 1975) and NPI (Novaco, 1979) for the experimental group following treatment, but not for the control condition. These tools are considered valid and reliable to measure anger and are widely used in forensic settings (Novaco, 1975, 1977, 1994, 2003). Conversely, these studies found no significant pre-post treatment change using other tools: AEC (Meers,

42 review of anger treatment in secure settings

1979) and OPSAM (Napolitano, 1991). The use of these particular tools has not been validated or standardised. Both the AEC and OPSAM were developed by the authors for the purposes of the research that they conducted and this may explain their mixed findings. Three studies found positive changes between pre- and post-treatment level of anger knowledge in participants that undertook the treatment group compared to those in the control group, but these studies did not find other between group psychometric changes (Heseltine et al., 2010; Howells et al., 2001; Watt & Howells, 1999, Study 1). All three of these studies were conducted in Australia, which may indicate potential cultural differences in relation to the treatment provided and the skills acquired following treatment. Two studies did not find any significant differences between the treatment and control group following treatment (Stone, 1990; Watt & Howells, 1999, Study 2). Of note, Stone (1990) had a very small sample at the time of post treatment, with nine participants in the experimental group and 13 participants in the control group (total attrition rate 29%), which may have limited the power of this study. Watt and Howells’s (1999, Study 2) research was another Australian study, which may further highlight potential cultural components to these outcomes. Moreover, Watt and Howells used two self-report measures, one that was developed by authors at the time of the study (WAKS). This tool was modified in Study 2 following its use in Study 1, and the authors used a modified form of the NAS (Novaco, 1994). Consequently, the reliability and validity of the tools, and therefore the reliability and validity of the study, may be queried.

Discussion The aim of this review was to systematically assess the effectiveness of anger management treatment for adult male offenders in secure settings and to consider the quality of the research in this area. The limited amount of research in the field revealed mixed results. Further, the available research was characterised by methodological shortcomings. Consequently, firm conclusions about the effectiveness of anger management treatment in secure settings are problematic to determine. Despite this, some research provides promising findings in relation to the reduction of anger related difficulties following anger management treatment. For example, participants were found to experience less frequent and less intense anger, as well as have an increased ability to cope with these feelings in more adaptive ways following treatment when compared to participants in control conditions. This indicates that anger management treatment may work for some of the participants, some of the time, as assessed by self-report psychometric measures. The positive trends of this systematic review, in relation to the provision of anger management treatment for adults in secure settings, are in accordance with previous meta-analyses that have been conducted in the more general field of anger management treatment. For example, Beck and Fernandez (1998) support the use of CBT to treat anger. Their meta-analysis included 50 studies, incorporating some studies conducted with offender populations and in forensic settings amongst other non-forensic studies. The conflicting findings in the current review may be a reflection of specific features of secure settings. This is somewhat consistent with Andrews et al. (1990) who

43 schamborg & tully found that treatment provided in the community is more effective than treatment provided in prison settings. For instance, hostile attitudes and, aggressive behaviours may thrive in prison as means of survival and the restricted and artificial prison environment may cause feelings of anger, frustration and helplessness, as well as inhibit the ability or motivation of treatment completers to effectively apply the anger management techniques that they may have developed through interventions (MacPherson, 1986; Novaco & Taylor, 2004). Additionally, opportunities to practice the acquired skills in scenarios similar to those they may face if they were in the community are limited in prison, and prisoners are commonly exposed to other type of destabilising situations (Heseltine et al., 2010; Shinkfield & Graffam, 2014). For example, witnessing or taking part in arguments and substance use have been reported to be commonly prevalent in secure settings (European Monitoring Centre for Drugs and Drug Addiction, 2004). Although life stressors and conflict situations are encountered whilst in the community, this is an environment within which the public deems the negative expression of anger to be socially unacceptable. Moreover, in secure settings, some individuals may complete anger management treatment due to them feeling that this is required of them or to attain secondary goals, such as to aid early parole/discharge (MacPherson, 1986; Towl, 1995). Such individuals may not develop or learn as part of the treatment process and their attendance at sessions might even hinder the progression of other group members (Towl, 1995). Consistent with this, Looman (2005) found that low motivation for undertaking treatment was associated with poor treatment outcomes. Additionally, participants that are not motivated to change may misuse acquired skills. For example, enhancing self-esteem in participants that have low motivation to change or engage in treatment may result in increasing their confidence to express their anger inappropriately, rather than vice versa (Feshbach, 1986; ?). The mixed outcomes may also be attributed to factors such as the provision of time-limited manualised groups for populations with longstanding and complex needs. As a result of large-scale manual based treatment being provided, treatment may not be tailored to each individual’s risk, needs and responsivity. Despite groups providing individuals with the opportunity to share experiences, challenge each other, attain support from both professionals and peers, as well as to model for others, it is less personal, e.g. information and education is provided by facilitators and it includes lower levels of participant self-disclosure. This may reflect the fact that confidentiality is more difficult to maintain within group work and it may not be in the individual’s best interest to share certain information, such as offence history, within group settings (e.g., Novaco et al., 2004; Towl, 1995). Therefore the delivery of group-based treatment may limit an individual’s ability to address their difficulties in a meaningful and in-depth manner, as opposed to individual and tailored treatment that is specifically designed to meet offender needs. In this review, four out of five studies with disappointing findings were conducted in Australia (Heseltine et al. 2010; Howells et al. 2001; Watt and Howells 1999, Study 1 and 2). Research has indicated that Indigenous people are overrepresented in Australian prisons (Day et al., 2006). Figures show that 20% of Australian prisoners are Indigenous, although they only represent about 1-2% of the general population in Australia (Australian Bureau of Statistics, 2004; J. Walker & McDonald, 1995). Consistent with this, 19% of the participants in Howells et al.’s 2001 study were Indigenous, but the outcome of treatment did not control for differences based on

44 review of anger treatment in secure settings ethnic background. Nevertheless, research has suggested that Indigenous offenders have greater difficulties in relation to the experience, expression and control of anger (Day et al., 2008, 2006; Howells et al., 2001). Consequently, this may indicate the importance of being responsive to cultural needs when providing treatment in forensic settings in order to enhance effectiveness of the intervention.

Limitations of the included studies Problems with quality and high risk of bias were prevalent across a number of the included studies in this review, which limits the validity of the findings of research in this area. One area that was particularly characterised by high level of risk was sampling and selection bias. Preparing, delivering and evaluating interventions in secure settings is highly time and resource intensive (e.g., Towl, 1995). Consistent with this notion, the majority of studies in this review generally included small samples sizes. This is in accordance with forensic research in general, which commonly is impeded with limited power (e.g., Walton & Chou, 2014). The limited power of included studies decreases the likelihood of research finding effects that are theoretically and practically meaningful (Rosenfeld & Penrod, 2011). Some studies approached a greater number of participants; however attrition levels are another common problem in research and practice within the field (e.g., Gondolf & Foster, 1991; Zinger & Wichmann, 1999). In relation to attrition bias, a number of studies in this review had high dropout rates. For example, Howells et al. (2001) recruited 418 participants with only 285 participants completing post-treatment assessments. Nevertheless, this is one of the biggest research projects conducted to evaluate the outcome of anger management treatment. On a smaller scale, highlighting issues relating to statistical power within studies, Stone (1990) recruited 31 participants, 22 of who completed the post-treatment assessments, with nine of these being treatment completers. A variety of reasons for high attrition rates are possible, such as low participant motivation, prison or hospital transfers or discharge, and clash of activities or conflicting priorities (Black et al., 2011). Another major methodological weakness across the majority of studies was that comprehensive assessments of anger were not reported to have been completed to aid participant selection. Although pre-treatment assessments were administered, through which pre- to post-treatment change was assessed, the pre-treatment measures were applied following selection for treatment, as opposed to being a way of selecting participants based on treatment need in relation to anger. A number of authors that did complete pre-treatment assessments to explore suitability of participants to take part in anger management treatment relied on exploring previous aggressive behaviour, rather than anger per se. For example, Stermac (1986) recruited participants who 1) self-reported anger problems, 2) were assessed by professionals to have anger problems, or 3) had displayed aggressive behaviour in the past. Consequently, problems specifically related to anger were not required in order for participants to be recruited. Similarly, Watt and Howells (1999, Study 1 and 2) selected participants based on their past aggressive behaviour rather than treatment selection being based on the construct of anger. As previously mentioned, anger is not necessary or sufficient for aggression to occur as aggressive behaviour may occur independently of anger. Aggression may, for example, be predominantly instrumental in nature (Bushman & Anderson, 2001). Therefore, an individual that has presented as aggressive may not necessarily be

45 schamborg & tully suitable for anger management treatment (Novaco et al., 2004). Yet in the included studies, these people, who may not be suitable, were included as participants within treatment groups. Howells et al. (2001) further highlight the importance of assessing anger related difficulties in order to recruit suitable candidates to take part in anger management treatment as these authors found that individuals with high anger related problems benefit more from treatment compared to participants with less anger related difficulties. One of the most robust studies in relation to participant selection, Bu¸set al. (2009), involved participant selection being based on a three-stage selection process, assessing aggression, anger, level of motivation and education criteria. These assessments were completed based on file reviews and clinical interviews with both the participant and staff members. Such multi-stage assessments may be one way of addressing selection bias within future studies and may also be a method of selecting the most suitable candidates for group treatment within clinical practice. Considering that the experience of anger is subjective in nature, an assessment of anger, using for example psychometric instruments or clinical interviews, may not only indicate whether an individual has difficulties in relation anger, but also whether they recognise this issue and their level of readiness to undertake treatment (Bonta & Andrews, 2003; Eckhardt et al., 2004; Novaco, 1994). Anger as a treatment need is not the only problem in selecting participants for anger management groups. Howells et al. (2001) suggest that level of ‘treatment readiness’ is important to establish in order to include participants that are willing to fully engage in the process of undertaking anger management treatment, and to achieve therapeutic change as part of received treatment. Recognising that anger has been dysfunctional prior to undergoing treatment aids the prospect that skills learnt as part of treatment are used appropriately. For example, on one hand, increasing an individual’s self-esteem could decrease the perception of personal threat and subsequently the likelihood of anger expression (Feshbach, 1986). On the other hand, if not motivated or ready to undergo change, increasing an individual’s self-esteem may increase someone’s confidence to express their anger in a dysfunctional way (Kroner & Reddon, 1995). Another area characterised by high risk was detection bias. Although all studies used at least one self-report anger measure with good psychometric properties, a number of studies used additional tools that had not been validated. For example, Napolitano (1991) used a tool that had been developed for the purposes of the study: OPSAM. This was therefore a tool that had not been adequately validated. In addition, with the exception of one study that partly controlled for participants social desirability responding by using NAS-PI’s inherent validity scale (Novaco, 1994), none of the other included studies measured and controlled for this. This is despite previous research having found that forensic samples may be motivated to bias their responding for multiple reasons, such as anger being viewed as unacceptable or undesirable. Additionally, there may be assumptions on the part of the participants about consequences of responding, for example attempts to avoid treatment or to appear as though treatment have been beneficial to aid the prospect of release/ discharge. It may also be that the participant lacks insight or holds cognitive distortions, which may result in minimisation or denial of difficulties (Bannatyne, Gacono, & Greene, 1999; Bornstein, Rossner, Hill, & Stepanian, 1994; Ford, 1991; Novaco & Taylor, 2004). Therefore the use of a social desirability measure, such as the Paulhus Deception Scales (PDS; Paulhus, 1998) is recommended to be administered in conjunction with

46 review of anger treatment in secure settings self-report measures, such as STAXI-II (Spielberger, 1999a). Of note, even though self-report measures capture the subjective nature of anger (Eckhardt et al., 2004); they may not demonstrate behavioural changes (Towl & Dexter, 1994). Nevertheless, simply measuring behaviours that are potentially related to the experience of anger, such as aggression or number of violent incidents, may not accurately reflect changes to the experience of anger. This type of measurement problem inevitably creates a vicious circular argument about how to effectively and meaningfully measure change in relation to anger.

Practical implications of the findings

Based on the abovementioned findings a number of practical recommendations can be made for professionals who prepare, deliver and evaluate anger management treatment for adults in secure settings. Firstly, it is of importance to ensure that individuals are suitable for treatment. This process can be aided by completing comprehensive anger assessments with each individual prior to allocating them to a treatment group. The treatment should be targeted for individuals that specifically have anger related difficulties, rather than provided to people solely based on their past history of aggression (Novaco & Taylor, 2004). The selection process and assessment of treatment change can be aided by measuring and controlling for social desirability responding in conjunction with using validated and standardised self-report anger measures (McEwan et al., 2009). Moreover, where possible, it is useful to attempt to account for confounding factors that are commonly prevalent in secure and group settings. For example, perception of compulsory requirements of treatment completion may impede the outcome of the intervention. Howells et al. (2001) found that level of treatment readiness has been found to impact treatment outcome. Consequently, the process of providing effective treatment can be assisted by tailoring provision of treatment based on risk-needs-responsivity principles (Bonta & Andrews, 2003). This may also incorporate being responsive to the cultural needs of individuals, such as to ethnic minorities (Day et al., 2008).

Conclusion and recommendations

The limited amount of research that exists in the field of anger management treatment in secure settings reveals conflicting results. In addition, the available studies are characterised by methodological shortcomings. Consequently, definite conclusions about the effectiveness of anger management treatment in secure settings are problematic to determine. Despite this, some research provides promising findings in relation to reduction of anger related difficulties following anger management treatment. The majority of studies included in this review were conducted in prison settings and so the generalisability of the findings to psychiatric hospital settings should be made with caution. It is important that future high quality research is conducted in order to gain increased knowledge in the area and to inform evidence based practice. Future research would benefit from attempting to include larger samples to increase power of the findings. Treatment allocation should be made following the completion of a robust

47 schamborg & tully selection process so that suitable participants can be allocated to treatment conditions. To achieve this aim, treatment providers may complete comprehensive pre-treatment assessment of participants’ anger difficulties and their level of readiness to complete treatment. Additionally the assessment of anger should be made using reliable and validated tools, as well as assessing for socially desirable responding. In the medium- to long-term it will be important to assess whether treatment gains are maintained by completing comprehensive follow-ups of participants. Enhancing knowledge of the applicability of the provision of anger management treatment to individuals with various ethnic backgrounds will be of practical utility. This is of importance as it appears that different ethnic groups have different profiles in relation to their anger associated difficulties and these groups may respond differently to treatment. Additionally, as previously mentioned, the current review highlighted that most of the research in this area has so far been conducted mainly in prisons and not in forensic psychiatric hospital settings and so more research in the latter is needed. Given the exclusion of studies without a control group, the total number of studies in this review was just 12. There is a clear need for future research to incorporate control conditions in order to confidently assess whether treatment changes are valid in the experimental conditions. Specifically, control groups increase the possibility of reducing confounding influences, for example, the completion of self-report assessments has previously been found to result in small positive outcome without the respondent having been exposed to treatment (Howells et al., 2001). Lastly, when sufficient research outcomes are available, the completion of a meta-analysis in the field would further increase knowledge of the effectiveness of anger management interventions for adult male offenders in forensic settings, although such an approach would benefit from quality assessing the included studies, because if not completed, an important aspect of examining the literature will be missed. Overall, this review has attempted to address the gaps in knowledge in relation to the effectiveness of anger management treatment for adult males in secure settings and to assess the quality of the available studies. Despite comprehensive searches, efforts to obtain articles and contacting authors, as well as completing in-depth analyses of obtained data, this systematic review is not without limitation. Ten identified studies, which may have been relevant for this review, were unobtainable. Additionally, at the quality assessment stage a number of items were rated as unclear and clarification of some of these issues was not possible. These factors may have biased the outcome this review. For example, although this review did not place restrictions on the publication status of studies to avoid publication bias, it has been recognised that the quality of unpublished research might vary from that of published research (eg., Tully, Chou, & Browne, 2013). Regardless of these limitations, which the authors acknowledge, this review and its outcomes provide useful information and insight for clinical practice and areas of future research. Consequently, it adds to the current knowledge within the field of anger management treatment for adult male offenders in secure forensic settings.

References Andrews, D. A., Bonta, J., & Hoge, R. D. (1990). Classification for effective rehabilitation: Rediscovering psychology. Criminal Justice and Behavior, 17 (1), 19–52. doi: 10.1177/0093854890017001004

48 review of anger treatment in secure settings

Australian Bureau of Statistics. (2004). Prisoners in australia. Retrieved from http://www.ausstats.abs.gov.au/Ausstats/subscriber.nsf/0/ 7C94C06BE2C67F17CA256F7200703CB5/$File/45170 2004.pdf Bannatyne, L. A., Gacono, C. B., & Greene, R. L. (1999). Differential patterns of responding among three groups of chronic, psychotic, forensic outpatients. Journal of Clinical Psychology, 55 (12), 1553–1565. doi: 10.1002/(sici)1097-4679(199912)55: 12h1553::aid-jclp12i3.0.co;2-1 Baron, R. A., & Richardson, D. R. (1994). Human aggression (2nd ed.). New York, NY: Plenum Press. Beck, R., & Fernandez, E. (1998). Cognitive-behavioral therapy in the treatment of anger: A meta-analysis. Cognitive Therapy and Research, 22 (1), 63-74. doi: 10.1023/A: 1018763902991 Bettany-Saltikov, J. (2012). How to do a systematic literature review in nursing: A step-by-step guide. Maidenhead: Open University Press. Black, G., Forrester, A., Wilks, M., Riaz, M., Maguire, H., & Carlin, P. (2011). Using initiative to provide clinical intervention groups in prison: a process evaluation. International Review of Psychiatry, 23 (1), 70–76. doi: 10.3109/09540261.2010 .544293 Bonta, J., & Andrews, D. A. (2003). A commentary on ward and stewart’s model of human needs. Psychology, Crime & Law, 9 (3), 215–218. doi: 10.1080/10683/ 16031000112115 Bornstein, R. F., Rossner, S. C., Hill, E. L., & Stepanian, M. L. (1994). Face validity and fakability of objective and projective measures of dependency. Journal of Personality Assessment, 63 (2), 363–386. doi: 10.1207/s15327752jpa6302 14 Bu¸s, I., S¸tefan, E.-C., & Visu-Petra, G. (2009). Anger management in the penitentiary: An intervention study. Cognition, Brain, Behavior. An Interdisciplinary Journal, 13 (3), 329–340. Bushman, B. J., & Anderson, C. A. (2001). Is it time to pull the plug on hostile versus instrumental aggression dichotomy? Psychological Review, 108 (1), 273–279. doi: 10.1037/0033-295x.108.1.273 Buss, A. H., & Durkee, A. (1957). An inventory for assessing different kinds of hostility. Journal of Consulting Psychology, 21 (4), 343–349. doi: 10.1037/h0046900 Buss, A. H., & Warren, W. (2000). Aggression questionnaire: Manual. Los Angeles: Western Psychological Services. CASP (Critical Appraisal Skills Programme) (CASP). (2013). Retrieved from http:// www.caspuk.net/ Daffern, M., & Howells, K. (2008). The function of aggression in personality disordered patients. Journal of Interpersonal Violence, 24 (4), 586–600. doi: 10.1177/ 0886260508317178 Day, A., Davey, L., Wanganeen, R., Casey, S., Howells, K., & Nakata, M. (2008). Symptoms of trauma, perceptions of discrimination, and anger: A comparison between Australian indigenous and nonindigenous prisoners. Journal of Interpersonal Violence, 23 (2), 245–258. doi: 10.1177/0886260507309343 Day, A., Davey, L., Wanganeen, R., Howells, K., DeSantolo, J., & Nakata, M. (2006). The meaning of anger for australian indigenous offenders: The significance of context. International Journal of Offender Therapy and Comparative Criminology, 50 , 520– 539.

49 schamborg & tully de Azevedo, F. B., Wang, Y.-P., Goulart, A. C., Lotufo, P. A., & Bense˜nor, I. M. (2010). Application of the spielberger’s State-Trait Anger Expression Inventory in clinical patients. Arquivos de Neuro-Psiquiatria, 68 (2), 231–234. doi: 10.1590/ s0004-282x2010000200015 Deffenbacher, J. L., Oetting, E. R., Thwaites, G. A., Lynch, R. S., Baker, D. A., Stark, R. S., . . . Eiswerth-Cox, L. (1996). State-trait anger theory and the utility of the trait anger scale. Journal of Counseling Psychology, 43 (2), 131–148. doi: 10.1037/0022-0167.43.2.131 Del Vecchio, T., & O’Leary, K. (2004). Effectiveness of anger treatments for specific anger problems: A meta-analytic review. Clinical Psychology Review, 24 (1), 15–34. doi: 10.1016/j.cpr.2003.09.006 DiGiuseppe, R., & Tafrate, R. C. (2003). Anger treatment for adults: A meta-analytic review. Clinical Psychology: Science and Practice, 10 (1), 70–84. doi: 10.1093/ clipsy.10.1.70 Eckhardt, C., Norlander, B., & Deffenbacher, J. (2004). The assessment of anger and hostility: a critical review. Aggression and Violent Behavior, 9 (1), 17–43. doi: 10.1016/s1359-1789(02)00116-7 Edmondson, C. B., & Conger, J. C. (1996). A review of treatment efficacy for individuals with anger problems: conceptual, assessment, and methodological issues. Clinical Psychology Review, 16 (3), 251–275. doi: 10.1016/s0272-7358(96)90003-3 European Monitoring Centre for Drugs and Drug Addiction. (2004). Annual report 2004: The state of the drugs problem in the european union and norway. Retrieved from http://www.emcdda.europa.eu/attachements.cfm/att 37253 EN ar2004 -en1.pdf Fernandez, E. (2013). Treatments for anger in specific populations: Theory, application, and outcome. In E. Fernandez (Ed.), (pp. 1–14). New York, NY: Oxford University Press. Feshbach, S. (1986). Reconceptualizations of anger: Some research perspectives. Journal of Social and Clinical Psychology, 4 (2), 123–132. doi: 10.1521/jscp.1986.4.2.123 Ford, B. (1991). Anger and irrational beliefs in violent inmates. Personality and Individual Differences, 12 (3), 211–215. doi: 10.1016/0191-8869(91)90106-l Gaertner, G. P. (1983). A component analysis of stress inoculation training for the development of anger management skills in adult male offenders. Dissertation Abstract International, 44 , 2359. Giancola, P. R. (2002). The influence of trait anger on the alcohol-aggression relation in men and women. Alcoholism: Clinical and Experimental Research, 26 (9), 1350– 1358. doi: 10.1111/j.1530-0277.2002.tb02678.x Gondolf, E. W., & Foster, R. A. (1991). Pre-program attrition in batterer programs. Journal of Family Violence, 6 (4), 337–349. doi: 10.1007/bf00980537 Heseltine, K., Howells, K., & Day, A. (2010). Brief anger interventions with offenders may be ineffective: A replication and extension. Behaviour Research and Therapy, 48 (3), 246–250. doi: 10.1016/j.brat.2009.10.005 Higgins, J. P. T., Altman, D. G., & Sterne, J. A. C. (2011). Cochrane handbook for systematic reviews of interventions (Version 5.1.0). In J. P. T. Higgins & S. Green (Eds.), (chap. Assessing risk of bias in included studies). The Cochrane Collaboration. Horn, R., & Towl, G. (1997). Anger management for women prisoners. Issues in

50 review of anger treatment in secure settings

Criminological and Legal Psychology, 29 , 57–62. Howells, K. (2004). Anger and its links to violent offending. Psychiatry, Psychology and Law, 11 (2), 189–196. doi: 10.1375/pplt.2004.11.2.189 Howells, K., Daffern, M., & Day, A. (2008). Handbook of forensic mental health. In K. Soothill, M. Dolan, & P. Rogers (Eds.), (pp. 351–374). Devon: Willian Publishing. Howells, K., Day, A., Bubner, S., Jauncey, S., Parker, A., Williamson, P., & Heseltine, K. (2001). An evaluation of anger management programs with violent offenders in two australian states. Retrieved from http://crg.aic.gov.au/reports/37-98-9 fr.pdf Howells, K., Day, A., Bubner, S., Jauncey, S., Williamson, P., Parker, A., & Heseltine, K. (2002). Anger management and violence prevention: Improving effectiveness. Trends and Issues in Crime and Criminal Justice, 227 , 1–6. Howells, K., Day, A., Williamson, P., Bubner, S., Jauncey, S., Parker, A., & Heseltine, K. (2005). Brief anger management programs with offenders: Outcomes and predictors of change. Journal of Forensic Psychiatry & Psychology, 16 (2), 296– 311. doi: 10.1080/14789940500096099 Kennedy, S. M. (1990). Anger management training with adult offenders (Unpublished doctoral dissertation). University of Ottawa, Ottawa, Ontario, Canada. (Unpublished) Koons, B. A., Burrow, J. D., Morash, M., & Bynum, T. (1997). Expert and offender perceptions of program elements linked to successful outcomes for incarcerated women. Crime & Delinquency, 43 (4), 512–532. doi: 10.1177/ 0011128797043004007 Kroner, D. G., & Reddon, J. R. (1995). Anger and psychopathology in prison inmates. Personality and Individual Differences, 18 (6), 783–788. doi: 10.1016/ 0191-8869(94)00206-8 Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33 (1), 159–174. Looman, J. (2005). Responsivity issues in the treatment of sexual offenders. Trauma, Violence, & Abuse, 6 (4), 330–353. doi: 10.1177/1524838005280857 Loza, W., & Loza-Fanous, A. (1999). Anger and prediction of violent and nonviolent offenders’ recidivism. Journal of Interpersonal Violence, 14 (10), 1014–1029. doi: 10.1177/088626099014010002 MacPherson, S. L. (1986). An investigation of the components of anger management as applied to an incarcerated population (Unpublished doctoral dissertation). Pennsylvania State University, Pennsylvania, United States. (Unpublished) McEwan, T. E., Davis, M. R., MacKenzie, R., & Mullen, P. E. (2009). The effects of social desirability response bias on staxi-2 profiles in a clinical forensic sample. British Journal of Clinical Psychology, 48 (4), 431–436. doi: 10.1348/014466509x454886 McMurran, M., Charlesworth, P., Duggan, C., & McCarthy, L. (2001). Controlling angry aggression: A pilot group intervention with personality disordered offenders. Behavioural and Cognitive Psychotherapy, 29 (4), 473–483. doi: 10.1017/ s1352465801004076 Meers, D. J. (1979). The effectiveness of rational behavior therapy in reducing anger of inmates (Unpublished doctoral dissertation). Indiana State University, Indiana, United States. (Unpublished)

51 schamborg & tully

Miller, J. B. (1985). The construction of anger in women and men. Retrieved from http://www.wellesleyresearchcenter.org/pdf/previews/preview 4sc.pdf Mills, J. F., Kroner, D. G., & Forth, A. E. (1998). Novaco Anger Scale: Reliability and validity within an adult criminal sample. Assessment, 5 (3), 237–248. Mohr, P., Heseltine, K., & Howells, K. (2001). Evaluation of operation flinders wilderness-adventure program for youth at risk. Adelaide, Australia: University of South Australia. Napolitano, S. (1991). Evaluation of prison anger control training (PACT): A group treatment program for incarcerated murderers and violent offenders (Unpublished doctoral dissertation). California School of Professional Psychology, California, United States. (Unpublished) Novaco, R. W. (1975). Anger control: The development and evaluation of an experimental treatment. Lexington, MA: Academic Press. Novaco, R. W. (1977). A stress inoculation approach to anger management in the training of law enforcement officers. American Journal of Community Psychology, 5 (3), 327–346. doi: 10.1007/BF00884700 Novaco, R. W. (1979). Cognitive-behavioral interventions: Theory, research and procedures. In P. C. Kendell & S. C. Hollon (Eds.), (pp. 241–285). New York: Academic Press. Novaco, R. W. (1994). Violence and mental disorder: Developments in risk assessment. In J. Monahan & H. J. Steadman (Eds.), (pp. 21–59). Chicago, IL: University of Chicago Press. Novaco, R. W. (2003). The Novaco Anger Scale and Provocation Inventory (NAS-PI). Los Angeles, CA: Western Psychological Services. Novaco, R. W., Ramm, M., & Black, L. (2004). The essential handbook of offender assessment and treatment. In C. Hollin (Ed.), (pp. 129–146). London: John Wiley & Sons. Novaco, R. W., & Taylor, J. L. (2004). Assessment of anger and aggression in male offenders with developmental disabilities. Psychological Assessment, 16 (1), 42–50. doi: 10.1037/1040-3590.16.1.42 Parrott, D. J., & Giancola, P. R. (2007). Addressing “the criterion problem” in the assessment of aggressive behavior: Development of a new taxonomic system. Aggression and Violent Behavior, 12 (3), 280–299. doi: 10.1016/j.avb.2006.08.002 Parrott, D. J., & Zeichner, A. (2005). Effects of sexual prejudice and anger on physical aggression toward gay and heterosexual men. Psychology of Men & Masculinity, 6 (1), 3–17. doi: 10.1037/1524-9220.6.1.3 Paulhus, D. L. (1998). Paulhus deception scales (PDS): The balanced inventory of desirable responding - 7. Toronto: Multi-Health Systems. Petticrew, M., & Roberts, H. (2006). Systematic reviews in the social sciences: A practical guide. Oxford: Blackwell Publishing. Reilly, P. M., & Shopshire, M. S. (2012). Anger management for substance abuse and mental health clients: A cognitive behavioral therapy manual. Retrieved from http://store.samhsa.gov/shin/content/SMA12-4213/SMA12-4213.pdf Rosenfeld, B., & Penrod, S. D. (2011). Research methods in forensic psychology. New Jersey: John Wiley & Sons. Shinkfield, A. J., & Graffam, J. (2014). Experience and expression of anger among Australian prisoners and the relationship between anger and reintegration variables.

52 review of anger treatment in secure settings

International Journal of Offender Therapy and Comparative Criminology, 58 (4), 435–453. doi: 10.1177/0306624x12470525 Spielberger, C. D. (1991). State-trait Anger Expression Inventory: Revised research edition: Professional manual. Odessa, FL: Psychological Assessment Resources. Spielberger, C. D. (1999a). Manual for the state-trait anger expression inventory-2. Odessa, FL: Psychological Assessment Resources. Spielberger, C. D. (1999b). Manual for the State-Trait Anger Expression Inventory-2. Odessa, FL: Psychological Assessment Resources. Stermac, L. E. (1986). Anger control treatment for forensic patients. Journal of Interpersonal Violence, 1 (4), 446–457. doi: 10.1177/088626086001004004 Stone, B. L. (1990). Anger and aggression control for incarcerated offenders (Unpublished doctoral dissertation). University of Montana, Montana, United States. Sukhodolsky, D. G., Kassinove, H., & Gorman, B. S. (2004). Cognitive-behavioral therapy for anger in children and adolescents: a meta-analysis. Aggression and Violent Behavior, 9 (3), 247–269. doi: 10.1016/j.avb.2003.08.005 Tafrate, R. C. (1995). Anger disorders: Definition, diagnosis, and treatment. In H. Kassinove (Ed.), (pp. 109–128). Washington, DC: Taylor & Frances. Towl, G. J. (1995). Groupwork in prisons. Leicester, UK: British Psychological Society. Towl, G. J., & Dexter, P. M. (1994). Anger management with prisoners: An empirical evaluation. Groupwork, 7 , 256–269. Tully, R. J., Chou, S., & Browne, K. D. (2013). A systematic review on the effectiveness of sex offender risk assessment tools in predicting sexual recidivism of adult male sex offenders. Clinical Psychology Review, 33 (2), 287–316. doi: 10.1016/j.cpr.2012 .12.002 Vannoy, S. D., & Hoyt, W. T. (2004). Evaluation of an anger therapy intervention for incarcerated adult males. Journal of Offender Rehabilitation, 39 (2), 39–57. doi: 10.1300/j076v39n02 03 Walker, J., & McDonald, D. (1995). The over-representation of Indigenous people in custody in australia. Trends and Issues in Crime and Criminal Justice, 47 , 1–6. Walker, R. (2001). Anger and women prisoners: Its origins, expression and management (Unpublished master’s thesis). University of South Australia, Adelaide, South Australia, Australia. (Unpublished) Walton, J. S., & Chou, S. (2014). The effectiveness of psychological treatment for reducing recidivism in child molesters: A systematic review of randomized and nonrandomized studies. Trauma, Violence, & Abuse. doi: 10.1177/ 1524838014537905 Watt, B. D., & Howells, K. (1999). Skills training for aggression control: Evaluation of an anger management programme for violent offenders. Legal and Criminological Psychology, 4 (2), 285–300. doi: 10.1348/135532599167914 Zinger, I., & Wichmann, C. (1999). The psychological effects of 60 days in administrative segregation. Retrieved from http://www.csc-scc.gc.ca/research/092/r85 e .pdf

53 schamborg & tully

Appendix

Areas assessed for risk of bias (Higgins et al., 2011):

• Sampling and selection bias: Refers to systematic variations in baseline characteristics and selection as well as allocation of participants. • Performance bias: Refers to systematic variations in the relation to being exposed to factors other than the treatment that may influence participant’s performance or the outcome of the study.

• Detection bias: Refers to systematic errors in how outcome of treatment is established. • Attrition bias: Refers to systematic errors in participants withdrawing from the study conducted, which may result in the collected data being incomplete.

• Results/statistical analysis bias: Refers to systematic error is how the outcome is assessed.

Received: December 8, 2014 Revision Received: May 20, 2015 Accepted: May 27, 2015

54 Archives of Forensic Psychology c 2015 Global Institute of Forensic Psychology 2015, Vol. 1, No. 2, 55–77 ISSN 2334-2749

Risky Business: Incorporating Informed Deception Detection Strategies in Violence Risk Assessments

Alysha Baker∗ Centre for the Advancement of Psychological Science and Law (CAPSL) University of British Columbia E-mail: [email protected]

Stephen Porter Centre for the Advancement of Psychological Science and Law (CAPSL) University of British Columbia

Leanne ten Brinke Haas School of Business UC Berkeley

Megan Udala Centre for the Advancement of Psychological Science and Law (CAPSL) University of British Columbia

Although the relevance of deception detection is well established in police investigations and the courtroom, it has been less salient in the context of psychological risk assessments at various levels of the legal system such as sentencing, preventative detention hearings, and conditional release. During the interview component of risk assessments, the implementation of strategies to detect lies is vital to ensuring that accurate predictions are made about an offender’s likelihood of perpetrating violence. In this paper, we discuss some of the challenges associated with attempting to detect deception in such high-stakes contexts. We then address the current understanding of the manner in which high-stakes deception is “leaked” behaviorally and active strategies that can be used to detect lies in risk assessment interviews. We suggest that recent advances in deception detection consistently be incorporated into existing violence risk assessment protocols. Further, the need for deception detection training among professionals who conduct risk assessments is emphasized. Keywords: detecting deception, violence risk assessment, remorse

Risky Business: Incorporating Informed Deception Detection Strategies in Violence Risk Assessments

While deception is a common aspect of human communication (e.g., DePaulo, Kashy, Kirkendol, Wyer, & Epstein, 1996; DePaulo et al., 2003; Hancock, 2007; Serota,

55 baker, porter, ten brinke, & udala

Levine, & Boster, 2010), most lies are low-stake white lies that are of little consequence if undetected. These lies typically are told to oil the wheels of social interaction, to gain some type of minor personal benefit, or to spare the feelings of others (e.g., Abe, 2011; Vrij, 2008a). However, lies told in the context of the criminal justice system often are high-stakes and can have major consequences for both the liar and society (Porter & ten Brinke, 2010). For example, when guilty suspects lie successfully to police, they are afforded an opportunity to continue committing crimes, sometimes for many years. For example, Gary Ridgway was interviewed by police and passed a polygraph test early in the Green River Killer murder investigation in the United States allowing him to continue murdering women for decades before he was brought to justice. On the other hand, when an innocent suspect is perceived to be lying by an interviewer, the faulty assessment can lead to tunnel vision, biased credibility assessment in court, and, in numerous cases, a wrongful conviction. Given that suspect interviews are such a central evidence-gathering tool in investigations (Holmberg & Christianson, 2002; Vrij, Granhag, & Porter, 2010), the reliability of deception detection within this context is paramount. The application of credibility assessment, however, does not end at the investigation, or even upon reaching a verdict at trial. Criminal defendants and convicted criminal offenders are interviewed for the purpose of gathering information relating to culpability, risk for future violence, remorse, and rehabilitation. The risks of being deceived in these contexts can be equally as potent as during investigation and trial. In terms of culpability, reactive, spontaneous offenses often are accompanied by a relatively light sentence compared to cold-blooded crimes (manslaughter versus first or second degree murder), providing a motivation for some violent offenders to lie about the nature of their offense. For example, about one-third of non-psychopathic and two-thirds of psychopathic murderers exaggerate how reactive or unplanned their crime was, framing it more as a crime of passion than a cold-blooded one (Porter & Woodworth, 2007). A recent study found, more generally, that the majority of violent offenders downplay the level of premeditation and instrumentality of their violent crimes (Laurell, Belfrage, & Hellstr¨om, 2014). Further, these claims are frequently believed – in particular, those by psychopathic offenders – and lead to undeserved lenient treatment in the legal system (Hakk¨anen-Nyholm & Hare, 2009). The perceived credibility of an offender’s stories relating to his/her crime and future plans has relevance when considering length of sentence and potential release into the community (Byrne, 2003). For example, an offender who provides a story of great remorse and plans to seek treatment and change his life in the future may receive less prison time or be more likely to be granted conditional release. Yet, lies frequently are not detected in these contexts. Despite their relatively high risk for re-offending, psychopathic offenders – who are prodigious liars and skilled emotional actors (e.g., Book et al., 2015; Porter, ten Brinke, Baker, & Wallace, 2011) – are far more likely to be granted conditional release than their less risky counterparts (Porter, ten Brinke, & Wilson, 2009). In fact, Ruback and Hopper (1986) found that parole decisions/predictions became less accurate after the board met with the offender than when only the file had been reviewed, emphasizing the need for better interviewing and credibility assessment strategies. Thus, high-stakes lies with relevance to risk for violence are often successful. Empirical evidence suggests that most observers are near the level of chance when it

56 deception detection in risk assessment interviews comes to detecting lies or emotional deception (i.e., accuracy rates of approximately 50-56%; Blair, Levine, & Shaw, 2010; Bond & DePaulo, 2006; Porter & ten Brinke, 2008), despite the commonly-held assumption that catching lies is common sense (e.g., Supreme Court of Canada in R. v. Marquard). Most recently, Evanoff, Porter, and Black (2014, in press) found that observers viewing extremely high-stakes lies (family members pleading for the return of a missing relative) were no better than chance at detection (cf. Wright-Whelan, Wagstaff, & Wheatcroft, 2015), despite that the liars were “leaking” much behavioral information indicative of deception (ten Brinke, Porter, & Baker, 2012). Moreover, the lie catcher’s confidence in his/her assessment bears little relation – and is sometimes inversely related – to the accuracy of the evaluation (Kassin & Fong, 1999; Meissner & Kassin, 2002). But what about psychologists and other mental health professionals? While it often is assumed that such “professional” lie catchers are better at detecting lies than laypeople, they are not (e.g, Hartwig, Granhag, Str¨omwall, & Vrij, 2004; Vrij & Mann, 2001) and experience on the job is unrelated to accuracy (e.g., DePaulo & Pfeifer, 1986; Porter, Woodworth, & Birt, 2000). Ekman and O’Sullivan (1991) found that psychiatrists performed at the level of chance in a passive lie detection task, and the classic study by Rosenhan (1973) found that malingerers nearly always fooled psychiatrists and hospital staff. Clinical psychologists with an interest in deception, on the other hand, have been found to do somewhat better (62.5-67.5% accuracy) than psychiatrists in a passive detection task (Ekman, O’Sullivan, & Frank, 1999). However, a recent study of forensic psychologists, forensic psychiatrists, and legal professionals found that they performed at the level of chance in detecting high-stakes deception prior to training (Shaw, Porter, & ten Brinke, 2013). In fact, a high level of self-reported emotional intelligence – which one might expect to positively relate to detection skills of particularly emotional lies because of an enhanced understanding of emotions – is associated with impaired lie detection (Baker, ten Brinke, & Porter, 2012). Much of this research is based on “passive observation” in which observers’ credibility evaluations are based on viewing videos of people telling the truth or lying. While some researchers have found that informed passive observation/coding can lead to the identification of reliable behavioral cues to deception (e.g., Matsumoto, Hwang, & Sandoval, 2013; McQuaid, Woodworth, Hutton, Porter, & ten Brinke, 2015; Shaw et al., 2013; ten Brinke & Porter, 2012; Wright-Whelan et al., 2015), this skill must be complemented by active strategies to accurately unearth lies (Vrij, Hope, & Fisher, 2014). For example, Levine et al. (2014) found that expert investigators were extremely accurate (able to elicit confessions) at lie detection when they were given the opportunity to actively question criminal suspects, contradicting their performance when viewing videotaped speakers (Vrij & Mann, 2001; but see Vrij, Meissner, and Kassin 2015 arguing that the above study’s findings are “dangerously misleading”). As such, it is important to consider the pitfalls and promises of incorporating lie detection strategies into risk assessment interviews.

Incorporating Lie Detection Strategies in Violence Risk Assessments Violence risk assessments typically (but not always) include an interview with an offender, affording the assessor an opportunity to observe and assess credibility beyond a file assessment. Most in the clinical-forensic field now agree that actuarial risk assessment

57 baker, porter, ten brinke, & udala measures such as the Violence Risk Appraisal Guide (VRAG; Quinsey, Harris, Rice, & Cormier, 2006), Sex Offender Risk Assessment Guide (SORAG; Quinsey, Rice, & Harris, 1995), or clinical-actuarial risk assessment measures (i.e., Structured Professional Judgment tools; SPJ) such as the Spousal Assault Risk Assessment Guide (SARA; Kropp, Hart, Webster, & Eaves, 1995) and Historical Clinical Risk Management-20 (Douglas, Hart, Webster, & Belfrage, 2013, HCR-R 20;), among others, are critical components of any psychological risk assessment. Although not initially designed for risk assessment purposes the Psychopathy Checklist-Revised (PCL-R; Hare, 1991, 2003) frequently is used in this context (Hurducas, Singh, de Ruiter, & Petrila, 2014) and will be included in the discussion of such instruments. These clinician-scored tools are advantageous over the use of self-report scales because of the latter’s susceptibility to deceptive responding (e.g., Kelsey, Rogers, & Robinson, 2014). However, the results of these actuarial measures are informed by (as with the PCL-R), or complemented by (e.g., VRAG), an interview with the offender who is motivated to appear as a low risk, and may lie to achieve this goal. SPJ risk assessment tools may be vulnerable to attempts at deception because of the reliance on the clinician’s opinion and judgment; however, if adequately trained, this very feature affords clinicians the opportunity to identify signs of deception and incorporate signals of deception into their determination of risk level. In most manuals accompanying each major actuarial risk tool, information concerning interviewing and credibility assessment is only mentioned in passing. For example, the Spousal Assault Risk Assessment Guide (Kropp et al., 1995) notes that the context of the interview motivates the offender to disclose information or to present information in a self-serving manner (leading to a recommendation for the assessor to rely primarily on collateral information), but provides no suggestions for assessing self-reported information. The PCL-R offers more guidance on the issue of deception; the purpose of the interview is described as to “allow the user to compare and evaluate the consistency of statements and responses, both within the interview and between the interview and the collateral/file information”, and to “provide the user with an opportunity to probe for more information and to challenge the individual on inconsistencies in his/her statements” (p. 18). In scoring items that consider “trait” deceptive behavior (Pathological Lying, Conning/Manipulative), assessors are instructed to consider discrepancies between information from the subject within the interview and in the collateral information.1 However, there is no guidance around how to assess credibility in the interview beyond looking for information discrepancies. And it is clear that the presence of psychopathy can be in the eye of the beholder. After accurately intuiting the presence of psychopathy in offenders based on viewing “thin slice” videos, observers’ initial perceptions are quickly charmed away with extended viewing (Fowler, Lilienfeld, & Patrick, 2009). Further, PCL-R scores and accordant risk assessment findings are strongly influenced by which side – the defense or prosecution – hired the expert evaluator (e.g., Blais, 2015; Murrie, Boccaccini, Guarnera, & Rufino, 2013), known as the adversarial allegiance effect. Particularly where there is little guidance on how to assess credibility, it may be that adversarial allegiance is fueled, at

1In general, the assessor is advised to always give the collateral information more credibility than the self-report in instances of disagreement, which seems to us to be a problematic approach. A careful assessment of credibility should result in a more fair and accurate conclusion than assuming that the offender is deceptive.

58 deception detection in risk assessment interviews least in part, by unconsciously biased credibility assessments. This hypothesis is based on past research finding that unconscious biases can influence the interpretation of cues to deception (for example, an observer may be led to believe a target will be deceptive and then rely on stereotypical but false cues to deception such as averted eye gaze to support this preconception), and even lead to tunnel vision, in a credibility assessment context (see Porter, ten Brinke, & Wilson, 2009, for discussion); for example, Meissner and Kassin (2002) found support for the “investigator bias” such that those with more training and prior experience with criminals had a greater response bias for identifying the target as deceptive (versus truthful). However, a direct empirical investigation of the biasing nature of the adversarial allegiance on credibility assessments is necessary. In conducting a clinical interview that informs or complements measures of risk, assessors must be informed about deceptive behavior and the use of strategies to facilitate credibility assessment. As advocated by many scholars in the field of risk assessment (e.g., Quinsey et al., 2006), multiple sources of information should be used in reaching a decision about a subject’s risk for future violence and an evaluation of his/her sincerity can serve as an additional source of information.

Pitfalls in Detecting Deception in the Risk Assessment Interview The risk assessment interview is much more than a verbal information gathering session in which the assessor asks the offender questions and records his/her responses; it is a high-stakes context in which each player has great motivation to “perform” at a high level. Each “player” is trying to read the another but may not wish the other to be aware of the nuances of the evaluation or impression management strategies. Consciously or unconsciously, the assessor makes observations about the offender’s appearance, demeanor, response style (e.g., tone of voice), and response content that rightfully or wrongfully inform inferences about character and credibility (e.g., Zebrowitz & Montepare, 2015). The offender may be careful in trying to portray credibility to the assessor and be selective or dishonest in the information he/she provides to manage his/her impression, and more specifically, to appear to be a low risk for re-offense. Lying – about issues such as remorse, rehabilitation, sexual fantasies, violent ideation, future plans, drug use, etc. – in this context can be a complex undertaking and stressful for many offenders who may spend many more years in prison if deemed a high versus a low risk. Telling a lie requires that the offender concurrently fabricate plausible information, keep those details straight, and appear credible and emotionally sincere to a critical assessor. As such, deception in this context may be accompanied by strong emotions – fear, contempt and resentment, or even excitement – that must be inhibited and/or convincingly faked (e.g., Ekman & Friedman, 1975; ten Brinke & Porter, 2012). Before turning to the aspects of the offender’s behavior to which the assessor should attend in order to make informed conclusions about honesty, we address some challenges and observational pitfalls that can impair such evaluations (see greater elaboration in Vrij et al., 2010). The “First Impression” problem. Typically, the first time the risk assessor and offender will meet is at the risk assessment interview. Upon meeting the offender, the assessor is subject to a host of first impressions about him/her that can powerfully influence later decision-making. Indeed, first impressions inform observers’ judgments about others’ character, personality characteristics, and intentions, and these intuitive,

59 baker, porter, ten brinke, & udala split-second evaluations are predicated on specific appearance-related features, including facial features, emotional expression, attractiveness, posture, and race (e.g., Andreoni & Petrie, 2008; R. Bull & Rumsey, 1988; Callan, Powell, & Ellard, 2007; Stewart et al., 2012; Zebrowitz & Montepare, 2015). Within approximately 100ms long-lasting first impressions are formed about the individual, with confidence in such assessments increasing over time (Aviezer, Trope, & Todorov, 2012; Bar, Neta, & Linz, 2006; Porter & ten Brinke, 2009; Todorov, Said, & Verosky, 2011; Willis & Todorov, 2006). This evaluation appears to be a result of an automatic and subconscious process that focuses on structural characteristics of the face, with features such as higher eyebrows, rounder faces, wider chins, and larger eyes signaling trustworthiness (Bar et al., 2006; Todorov, 2008; Todorov, Baron, & Oosterhof, 2008; Vartanian et al., 2012; Willis & Todorov, 2006). Impressions of trustworthiness are positively related to ratings of ‘babyfacedness’, symmetry, and attractiveness (e.g., P. Bull, 2006; R. Bull & Vine, 2003; Zebrowitz & Montepare, 2015; Zebrowitz, Voinescu, & Collins, 1996). In contrast, faces that are identified as being untrustworthy often are associated with a certain type of crime. For example, research has found that observers consistently identify certain faces as ‘good guys’ or ‘bad guys’ (Goldstein, Chance, & Gilbert, 1984; Yarmey, 1993) and that the ‘bad guys’ subsequently become classified as more likely to belong to ‘rapist’, ‘armed robber’, or ‘murderer’ offender categories (R. Bull & McAlpine, 1998). Interestingly, observers do have a limited ability to sense whether a photo is of a criminal or non-criminal, but are unable to classify the type of criminal represented in the photo (Valla, Ceci, & Williams, 2011). This effect translated into real-world risk assessments can have a major impact on an assessor’s opinion of the offender, similar to the manner in which it has been documented to affect other legal decision-makers’ judgments. Despite the error-prone nature of intuitive judgments of a target’s credibility, they can have a powerful influence in a variety of social situations, including legal decision- making (e.g., Gilron & Gutchess, 2011; Korva, Porter, O’Connor, Shaw, & Brinke, 2013; Langlois et al., 2000; Porter, ten Brinke, & Gustaw, 2010). Dangerous Decisions Theory (DDT; Porter & ten Brinke, 2009) posits that because first impressions are persistent and held with great confidence they strongly influence the manner in which information about the target encountered is interpreted later. More specifically, new information about the target will be interpreted to fit the original (sometimes erroneous) initial inference, potentially resulting in inaccurate evaluations. Consistent with Porter and ten Brinke’s (2009) theory, O’Sullivan (2003) found that judgments of target trustworthiness were correlated with state judgments of truthfulness in the predicted manner (e.g., a target who is first judged to be trustworthy-looking is more likely to be classified as telling the truth). Further, Porter et al. (2010) found that mock jury members required fewer pieces of evidence to render a guilty verdict for an untrustworthy-looking defendant compared to a trustworthy-looking one. Such biased decisions, guided by unfounded knowledge and an emphasis on intuition, is a dangerous recipe for inaccurate judgments during assessments of one’s sincerity and subsequent tunnel vision in verdict decisions (Porter & ten Brinke, 2009). And, this pattern of biased decision-making is not confined to the laboratory; it plays out in alarming ways in actual courtrooms. In a study of real criminal sentences, Black homicide defendants with more racially stereotypical facial features – who are evaluated as more socially threatening than those with less stereotypical features – were more likely to receive the death penalty for their crime, relative to Black defendants with

60 deception detection in risk assessment interviews less stereotypical features (Eberhardt, Davies, Purdie-Vaughns, & Johnson, 2006). In sum, first impressions of one’s trustworthiness can greatly bias critical legal decisions; assessors of risk should be aware of these biases to ward against unfair decisions. Decision-Making Biases. Another pitfall associated with poor lie detection performance that may apply in risk assessment interviews is “default” biases regarding others’ honesty. Most laypersons hold a truth bias, tending by default to perceive others as being truthful (Levine, Park, & McCornack, 1999; Robinson, 1996). However, a different bias tends to emerge with professionals in settings where the base rate of lying is higher and the consequences of missed lies are more substantial. For example, police tend to exhibit a lie bias such that they are more likely to label a target as a liar than would be expected by chance alone (Garrido, Masip, & Herrero, 2004; Meissner & Kassin, 2002). This apparent lie bias is arguably a result of the base rate at which certain observers experience (or believe they experience) lies versus truths in their line of work, the obvious motivation to catch liars, and any guilt-presumptive training received (such as via the Reid technique). These findings would suggest that some interviewers in the risk assessment context may be biased by the particularly negative file information or simply by the context in which they are making the credibility assessment. Anecdotally, professionals in the clinical-forensic or correctional field will know “offender lovers” or “offender haters” – psychologists who may be particularly gullible or cynical, respectively, regarding the words coming out of the subject’s mouth during a risk assessment interview. While experience on the job tends to increase confidence in detecting deception, it does not improve accuracy – a dangerous combination that can fuel suggestive, high-pressure interviewing practices that may not necessarily result in the assessor receiving the most rich information (and have been found to make even innocent suspects confess in investigative settings; Meissner & Kassin, 2002; Shaw & Porter, 2015; Vrij & Mann, 2001). An awareness of where one’s biases may lie is necessary in order to minimize their influence on credibility assessments. Unreliable Cues and Training. Most observers – professionals and laypeople alike – hold inaccurate beliefs about deceptive behavior leading to a reliance on stereotypical cues that lack empirical validation (Eichenbaum & Bodkin, 2000; Garrido et al., 2004; Global Deception Team, 2006; Str¨omwall, Granhag, & Hartwig, 2004; Vrij & Mann, 2001), likely contributing to their poor performance on deception detection tasks. Globally, the most commonly used cues include those perceived to be indicative of nervousness, such as gaze aversion and fidgeting, despite both being unrelated to deceit (Global Deception Team, 2006; Mann, Vrij, & Bull, 2004; Vrij & Mann, 2001; Wiseman et al., 2012). Inaccurate beliefs and poor performance on lie detection tasks by legal professionals may partially be due to a lack of training, or the use of invalid training programs. Indeed, far too often it is assumed that detecting lies is a straightforward task resulting in many observers relying on their intuition to discern truths from lies. Even in the eyes of courts, determining the veracity of testimony is considered to be “common sense” (R. v. Marquard, 1993) and is the responsibility of the jury – individuals without any training in lie detection but who are assumed to be proficient at the task (R. v. Francois, 1994). Further, there is no standard training in lie detection for judges, despite the fact that credibility assessment is a “bread and butter” task for them. In fact, there seems to be little consensus among them in their beliefs about the best signals of deception, although the most common beliefs are exactly like those false cues mentioned above

61 baker, porter, ten brinke, & udala

(Porter & ten Brinke, 2009). Because it appears that cues relied upon by most observers have little relation to actual deceit (Vrij, 2008a), the need for empirically-based training is critical. Importantly, recent studies and meta-analyses substantiate that empirically- based training is effective in enhancing lie detection ability (Driskell, 2012; Hauch, Sporer, Michael, & Meissner, 2014; Shaw et al., 2013). While most legal professional groups receive little or no deception detection training, others do but their training lacks validity. The most widely utilized investigative interviewing approach for police officers is the Reid Technique (Inbau, Reid, Buckley, & Jayne, 2001) – a technique that assumes that lie detection is straightforward and that lying is associated with the same types of stereotypical cues mentioned earlier. Investigators are advised to have a guilt-presumptive approach and inform the suspect of their unquestionable involvement in the crime (Kassin et al., 2010; Shaw & Porter, 2015). The interviewer is advised to then speak “at” the suspect rather than engaging in questioning and answering, and rely on patience and various manipulative tactics to obtain an admission of guilt. Despite the widespread adoption of the Reid Technique, research has found that this training hinders deception detection accuracy (Kassin & Fong, 1999). As such, risk assessors utilizing confrontational techniques akin to the Reid Technique need to be aware of its downfall in regards to credibility assessments. Research suggests that training in deception detection techniques via the Reid technique may actually make people worse than chance in detecting lies, but increases their confidence in doing so (Kassin and Fong 1999; cf. Levine et al., 2014). This likely is because the procedure fails to consider scientifically validated cues (Walsh & Bull, 2010) and promotes an active reliance on simplistic, stereotypical (but non-valid) cues. For example, Vrij, Mann, and Fisher (2006) found that behaviors supposedly indicative of lying, such as nervousness and discomfort, were actually exhibited more often by honest participants than deceptive participants. In sum, the reliance on interviewing techniques and credibility assessment strategies with no empirical foundation can actively impair lie detection, a scenario that almost certainly has led to tunnel vision during legal decision-making and numerous wrongful convictions in North America (The Innocence Project, 2012). As such, confrontational interviewing approaches and attention to behavioral cues advocated by the Reid technique are ill-advised for evaluators who must detect deception in a risk assessment context. Fortunately, there has been a recent movement to introduce new police training practices, such as the PEACE model (Snook, Eastwood, Stinson, Tedeschini, & House, 2010), which shows promise in truth gathering and whose principles could be translated to risk assessment interviews. According to the PEACE (Preparation and Planning, Engage and Explain, Account, Closure, Evaluation) model of interviewing, investigators initially learn as much detail as possible regarding the interviewee and prepare subsequent questions based on an analysis of existing evidence (Clarke, Milne, & Bull, 2011; Gudjonsson & Pearse, 2011; Milne & Bull, 1999; Snook et al., 2010). To apply this method of interviewing to a risk assessment setting, the assessor can engage the offender in a conversational, rapport-building dialogue and explanation of the interview process followed by non-coercive techniques to elicit information willingly provided by the offender regarding relevant risk factors specific to him/her. As such, the PEACE model of interviewing is less aggressive and creates a more respectable dialogue between interviewer and interviewee resulting in increased comfort –

62 deception detection in risk assessment interviews an advantage for risk assessment interviews aimed at enhancing the information-gathering process. However, despite the promise of the PEACE model (and its advancements since its predecessor), there is little guidance for the detection of deception in this approach. We suggest, however, that the combination of PEACE model interviewing techniques and empirically-valid lie detection knowledge would well-equip those who must assess credibility in risk assessment contexts. Narrowed Focus and Lack of Behavioral Baseline. Despite the non-existence of a “Pinnochio’s nose” (i.e., a smoking gun in the deception realm), observers tend to focus on and overemphasize the importance of information gathered from specific non-verbal channels (Mann et al., 2004; Porter et al., 2000); however, verbal cues to deception are important to consider (e.g., Vrij, 2008b; Vrij, Leal, Mann, & Granhag, 2011). Indeed, a large body of research indicates that a holistic approach is most beneficial and is associated with higher overall accuracy rates (Ekman & O’Sullivan, 1991; Vrij, Akehurst, Soukara, & Bull, 2004; Vrij, Evans, Akehurst, & Mann, 2004). In a large-scale meta-analysis, Hartwig and Bond (2014) found that the holistic use of multiple cues to deception led to a lie detection accuracy rate of 70% across various contexts. Importantly, because the would-be lie detector ideally attends to all channels of communication, he/she can compare potential deceptive displays to the baseline truthful behavior of the individual. This is particularly important because not only does a Pinocchio’s nose not exist but individual liars show idiosyncratic cues to lying in their nonverbal and verbal behavior (DePaulo & Friedman, 1998). These groups include people whose natural behavior may appear suspicious to others (e.g., introverts and socially anxious individuals to name a few; Vrij et al., 2010), an issue which may lead to credibility assessment biases. Similarly, biased credibility assessments may result from cross-cultural differences in behaviour, such as whether it is socially appropriate to maintain eye contact during conversation (Ekman, 1972; Kupperbusch et al., 1999). This could lead to an incorrect assumption of risk, as an individual may be presumed to be acting guilty when he/she are merely abiding by their own cultural scripts. To ensure these biases are avoided, it is necessary to establish a baseline and consider cross-cultural differences for every interviewee. Poor Elicitation of Deceptive Cues. There has been a recent movement in the deception literature calling for a change in the manner in which lies can be strategically detected (Levine, 2014; Vrij & Granhag, 2012), with an emphasis on specific interviewing techniques that elicit more cues to deception. Indeed, much of the research to date on nonverbal and verbal cues has focused on passive observation but active strategies are becoming increasingly acknowledged as being important (Levine et al., 2014). For example, Vrij and Granhag (2012) argued that strategies to increase cognitive load, capitalize on responses to unanticipated questions, and calculated use of evidence are particularly valuable. Paired with passive observations, more active strategies will help increase the rate at which lies are caught. Some scholars have even suggested the accumulated research from the past 40 years has led our understanding of deception towards a dead end and that more creative approaches must be taken to better elicit behavioral cues in the hopes of increasing accuracy rates (e.g., Levine, Shaw, & Shulman, 2010). Consequently, assessments of risk can be improved by using strategies to elicit more behavioral cues about an interviewee’s credibility.

63 baker, porter, ten brinke, & udala

Moving Forward: Promises of Detecting Deception in the Risk Assessment Interview

Passive Observation. Although there is no Pinnochio’s nose, there are a variety of verbal, behavioral, and facial cues reliably associated with high-stakes deception (e.g., Colwell, Hiscock-Anisman, & Fede, 2012; Griesel, Ternes, Schraml, Cooper, & Yuille, 2012; O’Sullivan, 2003; Vrij et al., 2010). In a widely-cited meta-analysis, DePaulo and colleagues (2003) found that liars provided fewer details, spent less time talking, told less plausible stories, and utilized fewer illustrators. More specific to high-stakes contexts, ten Brinke and Porter (2012) found with a sample of individuals pleading for the safe return of missing family members that deceptive individuals portrayed inadequate emotional displays (i.e., surprise and happiness rather than distress) and used fewer words overall but proportionally more tentative language, relative to their genuine counterparts. Although offenders may have more practice in lying and maintaining their deceit for long periods of time (relative to the general public), research suggests that they too leak signals that can reveal their duplicity. In a study comparing autobiographical stories from offender and student samples, it was found that offenders engaged in more self-manipulations (e.g., covering their face) and less smiling during deception (Porter, Doucette, Woodworth, Earle, & MacNeil, 2008). Further, Klaver, Lee, and Hart (2007) found that offenders did not show a decrease in illustrator movements, in contrast to the pattern found more generally with non-offenders, but did find that certain linguistic patterns (e.g., decreased response time, fewer words spoke, and speech disturbances) and head movements revealed offenders’ lies. In general, an extensive literature suggests that there are salient differences between authentic and deceptive behaviors that can be relied upon during an empirically-based credibility assessment – even among criminal populations. Attention to these behaviors can be particularly informative when the baseline method can be employed (i.e., comparing the behavior during displays in question to typical behavior; Porter & ten Brinke, 2009; Vrij, 2008a). By attending to verbal and nonverbal cues to deceit during the interview, clinicians can use this information to score items on existing risk assessment tools, such as the PCL-R’s (Hare, 1991, 2003) pathological lying item; however, we argue that this information be incorporated to a greater degree than simply informing existing items, especially since very few specific items to lying actually exist in current risk assessment tools. Impression Management via Emotional Displays. The formal risk assessment that accompanies later stages of the legal system provides unique opportunities to incorporate information gleaned from emotional displays (both sincere and insincere). There are a number of deception dimensions within the context of a formal risk assessment (e.g., false positive presentation, denial of criminality, and conning and manipulation) but here we focus on impression management attempts regarding one’s emotions. Although lying can occur at any time during the risk appraisal process, cues to emotional deception likely will be most apparent during the interview/information-gathering stage of a risk assessment (see Mills, Kroner, & Morgan, 2011). During this stage emotional clues can be apparent for hidden basic emotions, such as anger, happiness, and sadness, allowing a trained eye to identify inappropriate or inconsistent affect that reveals important information about an offender’s true emotions. For example, if the offender was convicted of a crime that involved a failed attempt of retaliation, he/she may claim that there will not be another attempt and may promise that any “bad blood” with the target is gone but may reveal

64 deception detection in risk assessment interviews his/her true emotion through simultaneous displays of anger when discussing the target2. The presence (or absence) of other basic emotions can also be assessed depending on the stage of the interview and topic at hand. For example, if an interviewer addresses an inconsistency in self-report and file or collateral information, or presents evidence that suggests the offender’s risk level is greater than expected, the offender may display signs of fear (Ekman & Friesen, 1969; Ekman & O’Sullivan, 1991; Frank & Ekman, 1997). It must be acknowledged that it is possible (although not empirically known) that reciting one’s offence repeatedly over time while incarcerated may cause offenders to discuss their crimes in a progressively less emotional manner. The assessor should be cognizant of the amount of time passed since the crime and how often the offender has recounted the crime prior to the assessment. However, as found in other high- stakes deception contexts (ten Brinke & Porter, 2012), discordant emotion displayed by deceptive individuals is identifiable and is not likely to become less apparent over time. For example, a brief smirk by an offender while recounting his murder a decade after it occurred likely still indicates a lack of remorse. In sum, attempts to receive a lenient sentence or undeserved release in risk assessment settings may be revealed by a focus on falsified and concealed emotions – behaviors that can be reliably detected (e.g., ten Brinke & Porter, 2012) and may provide important information regarding an offender’s risk for future violence. With training, an in-depth understanding of the way in which basic universal emotions manifest is a valuable addition to the assessor’s arsenal (see Ekman 2007 for a thorough discussion). Information gathered from attention to concealed emotions can be incorporated into existing risk assessment tools that have items related to intentions. For example, the pro-criminal attitude/orientation items in the Level of Service Case Management Inventory (LS/CMI; Andrews, Bonta, & Wormith, 2000) – a general recidivism risk assessment tool – can be informed by noted emotional discrepancies regarding whether the offender still holds a positive outlook on antisocial behavior and wants to participate in criminal activities in the future or if he/she is no longer in support of his/her past antisocial lifestyle. In sum, attending to emotional information that may be present during the assessment can help reveal the offender’s true emotions and intentions. Remorse Appraisal. A moral emotion of particular importance in the criminal justice system is remorse. Remorseful expressions and apologies in sentencing contexts, in particular, are communicated in order to make reparations for past transgressions and to indicate to observers that the perpetrator understands the gravity of his/her wrongdoing; in turn, the perceived sincerity of emotional displays in sentencing and conditional release contexts is given considerable weight during the decision-making process (see Slovenko, 2006). Perpetrators who show remorse are considered to be good candidates for treatment and rehabilitation: “[The defendant’s] remorse, guilt, and shame should provide him with a strong motivation to work at changes that will prevent future acts of violence” (R. v. Struve, 2007). This is echoed empirically with numerous studies attesting to the relationship between perceived remorse and reduced punishments (Day & Ross, 2011; MacLin, Downs, Maclin, & Caspers, 2009; Pipes &

2To facilitate the process of evaluating impression management of one’s emotions, it is highly recommended that interviews be videotaped to allow for later analysis.

65 baker, porter, ten brinke, & udala

Alessi, 1999; Taylor & Kleinke, 1992; Wiener & Rinehart, 1986) and a related emotion – guilt – being associated with reduced recidivism (Tangney, Stuewig, & Martinez, 2014). Because of the importance of perceived remorse, leniency-seeking offenders may see the opportunity to manipulate a judge or clinician for a lesser sentence or lowered risk rating, by lying about or exaggerating their level of remorse. While legal decision- makers already attempt to critically evaluate these displays they often miss the mark by relying on cues that lack empirical support, such as assuming remorse from general admissions of guilt and responsibility (Weisman, 2004, 2009; Wood & MacMartin, 2007). Although accepting responsibility is a behavior that may result from remorse, remorse is not a necessity for the acceptance of responsibility because of the numerous motivating factors that may lead an individual to (selfishly) communicate it. This issue has likely contributed to the ambiguity and inconsistency identified in past remorse appraisals in the legal system (Ward, 2006). Although the benefits associated with expressing remorse in a sentencing or conditional release hearing, or risk assessment interview, encourage some offenders to put on a false face, by having a keen eye for insincere emotional displays decision-makers can use this information to inform their risk assessment. Recent research has found support for unique expression patterns of remorse; ten Brinke, MacDonald, Porter, and O’Connor (2012) used the “unethical memory” paradigm during which participants were asked to provide narratives about an event for which they felt remorseful for and another that they felt no remorse. Findings indicated that genuine statements (versus false remorse) were characterized by emotional facial expressions representative of the basic emotions (e.g., sadness) with periods of neutrality between subsequent emotional expressions. In contrast, the researchers found that falsified remorse accounts were associated with increased speech hesitations and emotional turbulence (i.e., positive to negative emotion rather than a return to neutral; ten Brinke, MacDonald, et al., 2012) – features that, if detected, can inform appraisers. As such, with training, those assessing risk can look to these signs of sincerity to determine whether targets are genuinely remorseful and fully understand the gravity of their offence – important determinants of one’s propensity for future violence and recidivism during sentencing decisions. Although we speculate that real or false remorse will be predictive of an offender’s likelihood of re-offending, research to date has not directly examined this relationship. Tangney et al. (2014) found a related emotion – guilt – was negatively related to recidivism after one-year, potentially suggesting that a similar pattern between remorse and recidivism would likely be seen (i.e., reduction in re-offense rate). However, denial – which would presumably be accompanied by a lack of remorse – has not been found to be a risk factor for sexual violence (Hanson & Bussi`ere, 1998; but see Lund 2000 for discussion of the limitations of this meta-analysis), suggesting indirectly that a lack of remorse may not be as strongly associated with recidivism as generally assumed. Active Strategies to Elicit More Information About a Subject’s Credibility. Although observational/passive approaches to deception detection are important, complementary active strategies to elicit such cues through specific interviewing approaches have received considerable attention recently (Levine, 2014; Levine et al., 2014; Vrij & Granhag, 2012). More specifically, the incorporation of active strategies allows for more informed decisions surrounding risk for violence. These strategies can easily be incorporated into risk assessment interviews with the offender similar to the

66 deception detection in risk assessment interviews manner in which they would be incorporated at other stages of the legal justice system. Despite the relatively recent surge of research into active deception detection strategies, the research to date suggests these approaches are worthy of risk assessors’ attention as they can enhance deception detection accuracy rates greater than reported by meta-analyses on passive observation. Interviewing Strategies. A number of interviewing approaches have been put forth recently to elicit cues to deception, including the cognitive lie detection approach (Vrij et al., 2011) and Strategic Use of Evidence approach (SUE; Granhag & Hartwig, 2008), which can be employed during risk assessment interviews. Building on the understanding of lying as a cognitively demanding task (Christ, Van Essen, Watson, Brubaker, & McDermott, 2008), the cognitive lie detection approach emphasizes the value of increasing cognitive load to make cues to deceit more salient and consists of two approaches: imposing-cognitive-load and strategic-questioning. While discussing events that the offender may be motivated to lie about in the hopes of appearing at a reduced risk (e.g., altercations during incarceration or level of remorse), clinicians can incorporate certain strategies to impose greater cognitive demand similar to any interview setting (Vrij et al., 2011), including instructing the interviewee to maintain eye contact while discussing events of interest or asking them to complete a simultaneous, secondary task. Secondly, because liars often prepare their lies, assessors can catch offenders off guard by asking unanticipated questions (Vrij et al., 2011) – a method that has been found to enhance discrimination between truth-tellers and liars (Leins, Fisher, Vrij, Leal, & Mann, 2011). By using these types of strategies to increase cognitive demand the liar will have less control his/her verbal and behavioral communication. Secondly, the Strategic Use of Evidence approach (SUE; Granhag & Hartwig, 2008) operates on the assumption that liars and truth-tellers will respond differently to pieces of evidence presented to them. The SUE approach can be translated to the risk assessment setting when a clinician would like to present information that may not have been disclosed to the offender and that relates to their risk level. For example, a clinician may want to discuss information received from an institutional informant. In this scenario, the SUE approach advocates that, following a free recall of the event of interest (if applicable), particularly incriminating evidence should be withheld initially but that specific questions concerning that evidence should be asked. Subsequently, the manner in which the offender responds can give insight into the credibility of his/her account. In a meta-analysis of this questioning approach, liars were found to be more likely than truth-tellers to provide contradictory evidence (Hartwig, Granhag, & Luke, 2014), avoid mentioning incriminating evidence, and deny having knowledge of incriminating evidence (Hartwig et al., 2004). Further, the reliance on statement-evidence inconsistency (which was relied upon by SUE trained interviewers more) resulted in an overall discrimination accuracy of 85.4%. Examining Intentions. Perhaps most unique to conditional release settings, legal decision-makers can dedicate attention to the truthfulness of offenders’ stated intentions upon release. To date, the literature concerning deception detection has largely remained focused on detection of lies regarding past events (Granhag & Str¨omwall, 2004; Vrij, 2008b). However, there is emerging research that has made great strides in detecting lies about genuine and false future intentions (i.e., thought processes prior to actual action). The formation of behavioral intention requires episodic future thought (EFT) – the ability

67 baker, porter, ten brinke, & udala to ponder a personal event that may potentially occur in the future (Schacter & Addis, 2007). These recent advancements in our scientific understanding of EFT processes (e.g., Addis, Wong, & Schacter, 2007; Hassabis & Maguire, 2007), and subsequently genuine intentions, can be capitalized upon to unearth false intent. Granhag and Knieps (2011) proposed that because forethought is used in planning intentions, those suspected of being deceitful can be questioned about the planning of their intended actions and, at this point, veracity can be assessed. For example, Granhag and Knieps (2011) asked one group of participants to formulate a plan for a mock criminal action in a local shopping mall. They were instructed to place a memory stick containing ‘illegal’ content in a store and prepare a cover story to lie about their intentions. Another group of participants were asked to formulate a plan for a non-criminal action (i.e., simply shopping for gifts). The researchers intercepted both groups of participants before they were able to carry out these actions (thus, allowing for an assessment of their EFT regarding their plans) and found that truth-telling participants (i.e., the non-criminal group simply buying a gift at the mall) reported using more mental images in the planning stages and had greater levels of detail of those mental images, compared to liars. Further, Vrij and colleagues (2011) examined passengers’ true and false intentions provided at an airport about the nature of their trip; results indicated that narratives of false intentions were less plausible (but had equal degrees of detail) compared to narratives provided by truthful passengers – a finding that has subsequently been replicated (Vrij et al., 2011). Accounts of truthful intentions are also longer, more detailed, and clearer compared to deceptive accounts of planning phases (Sooniste, Granhag, Knieps, & Vrij, 2013). Further, questioning about the planning of intended actions appears to be even more valuable than questioning about the intentions themselves, with more behavioral differences emerging between truthful and deceptive accounts in the former (e.g., Granhag & Knieps, 2011; Sooniste et al., 2013). Subsequently, questioning offenders about their intentions upon release is a valuable approach and specifically inquiring about the planning of these intentions may not be as expected by the offender, leading to less scripted responses (Sooniste et al., 2013). This emerging research on future intentions demonstrates that notable differences can be identified between truth tellers and liars – an approach that can be applied directly to predictions of an offender’s risk for future violence by evaluating displays while speaking to a clinician during a risk assessment (or directly to the parole board) about his/her plans upon release. For example, clinicians can examine whether a perpetrator is reporting genuine intentions to turn his/her life around or if his/her actual intent is to return to the previous antisocial lifestyle. Further, information gathered for an analysis of potential false intentions can be incorporated into existing risk assessment tools that have items related to intentions, such as the HCR-20 Version 3 (Douglas et al., 2013), which includes a risk management item that questions whether the offender’s plan lacks feasibility. However, it must be acknowledged that this approach would work best with those consciously and purposely lying about their intentions upon release, rather than those without the cognitive capacity to plan to reoffend or those who have no intention to reoffend but do so impulsively upon release.

68 deception detection in risk assessment interviews

Summary of Recommendations

The application of both observational skills and active strategies for deception detection discussed here to the assessment of risk for future violence is recommended for all risk evaluators to ensure deserving offenders receive release and those adept at putting on a false face do not. In general, information gathered from the observation of cues to deception can be used as an information-extracting tool during any interview with an offender. When cues to deceit are present, clinicians should identify them and probe further with additional questions that might clarify any suspicious behavior (while further assessing his/her displays during a response). Further, clinicians should accompany observational approaches with more recent active strategies of lie detection. To encourage more widespread use of deception detection techniques in violence risk assessment interviews, it is recommended that reports produced by clinicians should consistently include a section addressing the offender’s credibility based on an empirically-supported evaluation. Ultimately, this will equip those charged with the difficult task of predicting future violence with another piece of important information, thereby increasing the predictive accuracy of these judgments. We further encourage researchers to continue to tackle these important issues, to move the study of deception detection forward, and to contribute to more reliable legal decision-making.

References

Abe, N. (2011). How the brain shapes deception: An integrated review of the literature. The Neuroscientist, 17 (5), 560–574. doi: 10.1177/1073858410393359 Addis, D. R., Wong, A. T., & Schacter, D. L. (2007). Remembering the past and imagining the future: Common and distinct neural substrates during event construction and elaboration. Neuropsychologia, 45 (7), 1363–1377. doi: 10.1016/ j.neuropsychologia.2006.10.016 Andreoni, J., & Petrie, R. (2008). Beauty, gender and stereotypes: Evidence from laboratory experiments. Journal of Economic Psychology, 29 (1), 73–93. doi: 10 .1016/j.joep.2007.07.008 Andrews, D. A., Bonta, J., & Wormith, S. J. (2000). Level of service/case management inventory: LS/CMI. Multi-Health Systems. Aviezer, H., Trope, Y., & Todorov, A. (2012). Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science, 338 (6111), 1225–1229. doi: 10.1126/science.1224313 Baker, A., ten Brinke, L., & Porter, S. (2012). Will get fooled again: Emotionally intelligent people are easily duped by high-stakes deceivers. Legal and Criminological Psychology, 18 (2), 300–313. doi: 10.1111/j.2044-8333.2012.02054.x Bar, M., Neta, M., & Linz, H. (2006). Very first impressions. Emotion, 6 (2), 269–278. doi: 10.1037/1528-3542.6.2.269 Blair, J. P., Levine, T. R., & Shaw, A. S. (2010). Content in context improves deception detection accuracy. Human Communication Research, 36 (3), 423–442. doi: 10 .1111/j.1468-2958.2010.01382.x Blais, J. (2015). Preventative detention decisions: Reliance on expert assessments and evidence of partisan allegiance within the Canadian context. Behavioral Sciences & the Law, 33 (1), 74–91. doi: 10.1002/bsl.2155

69 baker, porter, ten brinke, & udala

Bond, C. F., & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10 (3), 214–234. doi: 10.1207/s15327957pspr1003 2 Book, A., Methot, T., Gauthier, N., Hosker-Field, A., Forth, A., Quinsey, V., & Molnar, D. (2015). The mask of sanity revisited: Psychopathic traits and affective mimicry. Evolutionary Psychological Science, 1 (2), 91–102. doi: 10.1007/s40806-015-0012-x Bull, P. (2006). Detecting lies and deceit: The psychology of lying and the implications for professional practice. J. Community. Appl. Soc. Psychol., 16 (2), 166–167. doi: 10.1002/casp.828 Bull, R., & McAlpine, S. (1998). Facial appearance and criminality. In A. Memon, A. Vrij, & R. Bull (Eds.), Psychology and law: Truthfulness, accuracy and credibility (pp. 59–76). London: McGraw-Hill. Bull, R., & Rumsey, N. (1988). The social psychology of facial appearance. New York: Springer. Bull, R., & Vine, M. (2003). Attractive people tell the truth: Can you believe it? (Poster presented at the Annual Conference of the European Association of Psychology and Law, Edinburgh) Byrne, M. K. (2003). Trauma reactions in the offender. International Journal of Forensic Psychology, 1 , 59–70. Callan, M. J., Powell, N. G., & Ellard, J. H. (2007). The consequences of victim physical attractiveness on reactions to injustice: The role of observers’ belief in a just world. Social Justice Research, 20 (4), 433–456. doi: 10.1007/s11211-007-0053-9 Christ, S. E., Van Essen, D. C., Watson, J. M., Brubaker, L. E., & McDermott, K. B. (2008). The contributions of prefrontal cortex and executive control to deception: Evidence from activation likelihood estimate meta-analyses. Cerebral Cortex, 19 (7), 1557–1566. doi: 10.1093/cercor/bhn189 Clarke, C., Milne, R., & Bull, R. (2011). Interviewing suspects of crime: The impact of peace training, supervision and the presence of a legal advisor. Journal of Investigative Psychology and Offender Profiling, 8 (2), 149–162. doi: 10.1002/ jip.144 Colwell, K., Hiscock-Anisman, C., & Fede, J. (2012). Assessment criteria indicative of deception: An example of the new paradigm of differential recall enhancement. In Applied issues in investigative interviewing, eyewitness memory, and credibility assessment (pp. 259–291). Springer Science & Business Media. doi: 10.1007/ 978-1-4614-5547-9 11 Day, M. V., & Ross, M. (2011). The value of remorse: How drivers’ responses to police predict fines for speeding. Law and Human Behavior, 35 (3), 221–234. doi: 10.1007/s10979-010-9234-4 DePaulo, B. M., & Friedman, H. S. (1998). Nonverbal communication. In D. T. Gilbert, S. T. Fiske, G. Lindzey, D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (4th ed., Vol. 1 and 2, pp. 3–40). New York, NY, US: McGraw-Hill. DePaulo, B. M., Kashy, D. A., Kirkendol, S. E., Wyer, M. M., & Epstein, J. A. (1996). Lying in everyday life. Journal of Personality and Social Psychology, 70 (5), 979– 995. doi: 10.1037/0022-3514.70.5.979 DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129 (1), 74–118. doi: 10.1037/ 0033-2909.129.1.74

70 deception detection in risk assessment interviews

DePaulo, B. M., & Pfeifer, R. L. (1986). On-the-job experience and skill at detecting deception1. Journal of Applied Social Psychology, 16 (3), 249–267. doi: 10.1111/ j.1559-1816.1986.tb01138.x Douglas, K. S., Hart, S. D., Webster, C. D., & Belfrage, H. (2013). HCR-20 (Version 3): Assessing Risk for violence. Burnaby, BC, Canada: Mental Health, Law, and Policy Institute, Simon Fraser University. Driskell, J. E. (2012). Effectiveness of deception detection training: A meta-analysis. Psychology, Crime & Law, 18 (8), 713–731. doi: 10.1080/1068316x.2010.535820 Eberhardt, J. L., Davies, P. G., Purdie-Vaughns, V. J., & Johnson, S. L. (2006). Looking deathworthy: Perceived stereotypicality of black defendants predicts capital-sentencing outcomes. Psychological Science, 17 (5), 383–386. doi: 10.1111/ j.1467-9280.2006.01716.x Eichenbaum, H., & Bodkin, J. (2000). Belief and knowledge as distinct forms of memory. In D. L. Schacter & E. Scarry (Eds.), Memory, brain, and belief (pp. 176–207). Cambridge, MA, US: Harvard University Press. Ekman, P. (1972). Universals and cultural differences in facial expressions of emotions. In Symposium on motivation. Lincoln: University of Nebraska Press. Symposium on Motivation. Ekman, P. (2007). Emotions revealed:recognizing faces and feelings to improve communication and emotional life (2nd ed.). New York, NY: Holt Paperbacks. Ekman, P., & Friedman, W. V. (1975). Unmasking the face: A guide to recognizing emotions from facial cues. Englewood Cliffs, NJ: Prentice-Hall. Ekman, P., & Friesen, W. V. (1969). Nonverbal leakage and clues to deception. Psychiatry(32), 88–105. Ekman, P., & O’Sullivan, M. (1991). Who can catch a liar? American Psychologist, 46 (9), 913–920. doi: 10.1037/0003-066x.46.9.913 Ekman, P., O’Sullivan, M., & Frank, M. G. (1999). A few can catch a liar. Psychological Science, 10 (3), 263–266. doi: 10.1111/1467-9280.00147 Evanoff, C., Porter, S., & Black, P. J. (2014, in press). Video killed the radio star? the influence of presentation modality on detecting high-stakes, emotional lies. Legal and Criminological Psychology. doi: 10.1111/lcrp.12064 Fowler, K. A., Lilienfeld, S. O., & Patrick, C. J. (2009). Detecting psychopathy from thin slices of behavior. Psychological Assessment, 21 (1), 68–78. doi: 10.1037/a0014938 Frank, M. G., & Ekman, P. (1997). The ability to detect deceit generalizes across different types of high-stake lies. Journal of Personality and Social Psychology, 72 (6), 1429–1439. doi: 10.1037/0022-3514.72.6.1429 Garrido, E., Masip, J., & Herrero, C. (2004). Police officers’ credibility judgments: Accuracy and estimated ability. International Journal of Psychology, 39 (4), 254– 275. doi: 10.1080/00207590344000411 Gilron, R., & Gutchess, A. H. (2011, Dec). Remembering first impressions: Effects of intentionality and diagnosticity on subsequent memory. Cognitive, Affective, & Behavioral Neuroscience, 12 (1), 85–98. doi: 10.3758/s13415-011-0074-6 Global Deception Team. (2006). A world of lies. Journal of Cross-Cultural Psychology, 37 (1), 60–74. doi: 10.1177/0022022105282295 Goldstein, A. G., Chance, J. E., & Gilbert, B. (1984). Facial stereotypes of good guys and bad guys: A replication and extension. Bulletin of the Psychonomic Society, 22 (6), 549–552. doi: 10.3758/bf03333904

71 baker, porter, ten brinke, & udala

Granhag, P. A., & Hartwig, M. (2008). A new theoretical perspective on deception detection: On the psychology of instrumental mind-reading. Psychology, Crime & Law, 14 (3), 189–200. doi: 10.1080/10683160701645181 Granhag, P. A., & Knieps, M. (2011). Episodic future thought: Illuminating the trademarks of forming true and false intentions. Applied Cognitive Psychology, 25 (2), 274–280. doi: 10.1002/acp.1674 Granhag, P. A., & Str¨omwall, L. (2004). The detection of deception in forensic contexts. New York: Cambridge University Press. Griesel, D., Ternes, M., Schraml, D., Cooper, B. S., & Yuille, J. C. (2012). The ABC’s of CBCA: Verbal credibility assessment in practice. In B. S. Cooper, D. Griesel, & M. Ternes (Eds.), Applied issues in investigative interviewing, eyewitness memory, and credibility assessment (pp. 293–323). New York: Springer. doi: 10.1007/ 978-1-4614-5547-9 12 Gudjonsson, G. H., & Pearse, J. (2011). Suspect interviews and false confessions. Current Directions in Psychological Science, 20 (1), 33–37. doi: 10.1177/0963721410396824 Hakk¨anen-Nyholm, H., & Hare, R. D. (2009). Psychopathy, homicide, and the courts: Working the system. Criminal Justice and Behavior, 36 (8), 761–777. doi: 10.1177/ 0093854809336946 Hancock, J. T. (2007). Digital deception: When, where, and how people lie online. In K. McKenna, T. Postmes, U. Reips, & A. Joinson (Eds.), Oxford handbook of internet psychology (pp. 287–301). Oxford: Oxford University Press. Hanson, R. K., & Bussi`ere, M. T. (1998). Predicting relapse: A meta-analysis of sexual offender recidivism studies. Journal of Consulting and Clinical Psychology, 66 (2), 348–362. doi: 10.1037/0022-006x.66.2.348 Hare, R. D. (1991). The Hare Psychopathy Checklist — Revised. Toronto, ON CAN: Multi-Health Systems. Hare, R. D. (2003). The Hare Psychopathy Checklist — Revised (2nd ed.). Toronto, ON CAN: Multi-Health Systems. Hartwig, M., & Bond, C. F. (2014). Lie detection from multiple cues: A meta-analysis. Applied Cognitive Psychology, 28 (5), 661–676. doi: 10.1002/acp.3052 Hartwig, M., Granhag, P. A., & Luke, T. (2014). Strategic use of evidence during investigative interviews: The state of the science. In D. C. Raskin, C. R. Honts, & J. C. Kircher (Eds.), Credibility assessment: Scientific research and applications (pp. 1–31). Oxford, UK: Elsevier. Hartwig, M., Granhag, P. A., Str¨omwall, L. A., & Vrij, A. (2004). Detecting deception via strategic disclosure of evidence. Law and Human Behavior, 29 (4), 469–484. doi: 10.1007/s10979-005-5521-x Hassabis, D., & Maguire, E. A. (2007). Deconstructing episodic memory with construction. Trends in Cognitive Sciences, 11 (7), 299–306. doi: 10.1016/ j.tics.2007.05.001 Hauch, V., Sporer, S. L., Michael, S. W., & Meissner, C. A. (2014). Does training improve the detection of deception? a meta-analysis. Communication Research. doi: 10.1177/0093650214534974 Holmberg, U., & Christianson, S. A. (2002). Murderers’ and sexual offenders’ experiences of police interviews and their inclination to admit or deny crimes. Behavioral Sciences & the Law, 20 (1-2), 31–45. doi: 10.1002/bsl.470 Hurducas, C. C., Singh, J. P., de Ruiter, C., & Petrila, J. (2014). Violence risk assessment

72 deception detection in risk assessment interviews

tools: A systematic review of surveys. International Journal of Forensic Mental Health, 13 (3), 181–192. doi: 10.1080/14999013.2014.942923 Inbau, F. E., Reid, J. E., Buckley, J. P., & Jayne, B. C. (2001). Criminal interrogation and confessions (4th ed.). Gaithersburg, MD: Aspen Publishers. Kassin, S. M., Drizin, S. A., Grisso, T., Gudjonsson, G. H., Leo, R. A., & Redlich, A. D. (2010). Police-induced confessions, risk factors, and recommendations: Looking ahead. Law and Human Behavior, 34 (1), 49–52. doi: 10.1007/s10979-010-9217-5 Kassin, S. M., & Fong, C. T. (1999). “I’m innocent!”: Effects of training on judgments of truth and deception in the interrogation room. Law and Human Behavior, 23 (5), 499–516. doi: 10.1023/a:1022330011811 Kelsey, K. R., Rogers, R., & Robinson, E. V. (2014). Self-report measures of psychopathy: What is their role in forensic assessments? Journal of Psychopathology and Behavioral Assessment, 1–12. doi: 10.1007/s10862-014-9475-5 Klaver, J. R., Lee, Z., & Hart, S. D. (2007). Psychopathy and nonverbal indicators of deception in offenders. Law and Human Behavior, 31 (4), 337–351. doi: 10.1007/ s10979-006-9063-7 Korva, N., Porter, S., O’Connor, B. P., Shaw, J., & Brinke, L. t. (2013). Dangerous decisions: Influence of juror attitudes and defendant appearance on legal decision- making. Psychiatry, Psychology and Law, 20 (3), 384–398. doi: 10.1080/13218719 .2012.692931 Kropp, P. R., Hart, D. D., Webster, C. W., & Eaves, D. (1995). Manual for the spousal assault assessment guide (2nd ed.). Vancouver, BC: British Columbia Institute on Family Violence. Kupperbusch, C., Matsumoto, D., Kooken, K., Loweinger, S., Uchida, H., Wilson-Cohn, C., & Yrizarry, N. (1999). Cultural influences on nonverbal expressions of emotion. In P. Pillippot, R. S. Feldman, & E. J. Coats (Eds.), The social context of nonverbal behavior (pp. 17–44). New York, NY: Cambridge University Press. Langlois, J. H., Kalakanis, L., Rubenstein, A. J., Larson, A., Hallam, M., & Smoot, M. (2000). Maxims or myths of beauty? a meta-analytic and theoretical review. Psychological Bulletin, 126 (3), 390–423. doi: 10.1037/0033-2909.126.3.390 Laurell, J., Belfrage, H., & Hellstr¨om, r. (2014). Deceptive behaviour and instrumental violence among psychopathic and non-psychopathic violent forensic psychiatric patients. Psychology, Crime & Law, 20 (5), 467–479. doi: 10.1080/1068316x.2013 .793341 Leins, D., Fisher, R. P., Vrij, A., Leal, S., & Mann, S. (2011). Using sketch drawing to induce inconsistency in liars. Legal and Criminological Psychology, 16 (2), 253–265. doi: 10.1348/135532510x501775 Levine, T. R. (2014). Active deception detection. Policy Insights from the Behavioral and Brain Sciences, 1 (1), 122–128. doi: 10.1177/2372732214548863 Levine, T. R., Clare, D. D., Blair, J. P., McCornack, S., Morrison, K., & Park, H. S. (2014). Expertise in deception detection involves actively prompting diagnostic information rather than passive behavioral observation. Human Communication Research, 40 (4), 442–462. doi: 10.1111/hcre.12032 Levine, T. R., Park, H. S., & McCornack, S. A. (1999). Accuracy in detecting truths and lies: Documenting the “veracity effect”. Communication Monographs, 66 (2), 125–144. doi: 10.1080/03637759909376468 Levine, T. R., Shaw, A., & Shulman, H. C. (2010). Increasing deception detection

73 baker, porter, ten brinke, & udala

accuracy with strategic questioning. Human Communication Research, 36 (2), 216– 231. doi: 10.1111/j.1468-2958.2010.01374.x Lund, C. A. (2000). Predictors of sexual recidivism: Did meta-analysis clarify the role and relevance of denial? Sexual Abuse: A Journal of Research and Treatment, 12 (4), 275–287. doi: 10.1023/A:1009590510916 MacLin, M. K., Downs, C., Maclin, O. H., & Caspers, H. M. (2009). The effect of defendant facial expression on mock juror decision-making: The power of remorse. North American Journal of Psychology, 11 , 323–332. Mann, S., Vrij, A., & Bull, R. (2004). Detecting true lies: Police officers’ ability to detect suspects’ lies. Journal of Applied Psychology, 89 (1), 137–149. doi: 10.1037/ 0021-9010.89.1.137 Matsumoto, D., Hwang, H. C., & Sandoval, V. A. (2013). Ethnic similarities and differences in linguistic indicators of veracity and lying in a moderately high stakes scenario. Journal of Police and Criminal Psychology, 30 (1), 15–26. doi: 10.1007/ s11896-013-9137-7 McQuaid, S. M., Woodworth, M., Hutton, E. L., Porter, S., & ten Brinke, L. (2015). Automated insights: Verbal cues to deception in real-life high-stakes lies. Psychology, Crime & Law, 21 (7), 617–631. doi: 10.1080/1068316x.2015.1008477 Meissner, C. A., & Kassin, S. M. (2002). “He’s guilty!”: Investigator bias in judgments of truth and deception. Law and Human Behavior, 26 (5), 469–480. doi: 10.1023/ a:1020278620751 Mills, J. F., Kroner, D. F., & Morgan, R. D. (2011). Clinician’s guide to violence risk assessment. New York, NY, US: Guilford Press. Milne, R., & Bull, R. (1999). Investigative interviewing: Psychology and practice. Chichester: Wiley & Sons. Murrie, D. C., Boccaccini, M. T., Guarnera, L. A., & Rufino, K. A. (2013). Are forensic experts biased by the side that retained them? Psychological Science, 24 (10), 1889–1897. doi: 10.1177/0956797613481812 O’Sullivan, M. (2003). The fundamental attribution error in detecting deception: The boy-who-cried-wolf effect. Personality and Social Psychology Bulletin, 29 (10), 1316–1327. doi: 10.1177/0146167203254610 Pipes, R. B., & Alessi, M. (1999). Remorse and a previously punished offense in assignment of punishment and estimated likelihood of a repeated offense. Psychological Reports, 85 (5), 246–248. doi: 10.2466/pr0.85.5.246-248 Porter, S., Doucette, N. L., Woodworth, M., Earle, J., & MacNeil, B. (2008). Halfe the world knowes not how the other halfe lies: Investigation of verbal and non-verbal signs of deception exhibited by criminal offenders and non-offenders. Legal and Criminological Psychology, 13 (1), 27–38. doi: 10.1348/135532507x186653 Porter, S., & ten Brinke, L. (2008). Reading between the lies: Identifying concealed and falsified emotions in universal facial expressions. Psychological Science, 19 (5), 508–514. doi: 10.1111/j.1467-9280.2008.02116.x Porter, S., & ten Brinke, L. (2009). Dangerous decisions: A theoretical framework for understanding how judges assess credibility in the courtroom. Legal and Criminological Psychology, 14 (1), 119–134. Retrieved from http://dx.doi.org/ 10.1348/135532508X281520 doi: 10.1348/135532508x281520 Porter, S., & ten Brinke, L. (2010). The truth about lies: What works in detecting high-stakes deception? Legal and Criminological Psychology, 15 (1), 57–75. doi:

74 deception detection in risk assessment interviews

10.1348/135532509x433151 Porter, S., ten Brinke, L., Baker, A., & Wallace, B. (2011). Would i lie to you? “Leakage” in deceptive facial expressions relates to psychopathy and emotional intelligence. Personality and Individual Differences, 51 (2), 133–137. doi: 10.1016/j.paid.2011 .03.031 Porter, S., ten Brinke, L., & Gustaw, C. (2010). Dangerous decisions: The impact of first impressions of trustworthiness on the evaluation of legal evidence and defendant culpability. Psychology, Crime & Law, 16 (6), 477–491. Retrieved from http:// dx.doi.org/10.1080/10683160902926141 doi: 10.1080/10683160902926141 Porter, S., ten Brinke, L., & Wilson, K. (2009). Crime profiles and conditional release performance of psychopathic and non-psychopathic sexual offenders. Legal and Criminological Psychology, 14 (1), 109–118. doi: 10.1348/135532508x284310 Porter, S., & Woodworth, M. (2007). “I’m sorry i did it...but he started it”: A comparison of the official and self-reported homicide descriptions of psychopaths and non-psychopaths. Law and Human Behavior, 31 (1), 91–107. doi: 10.1007/ s10979-006-9033-0 Porter, S., Woodworth, M., & Birt, A. R. (2000). Truth, lies, and videotape: An investigation of the ability of federal parole officers to detect deception. Law and Human Behavior, 24 (6), 643–658. doi: 10.1023/a:1005500219657 Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. (2006). Violent offenders: Appraising and managing risk. Washington: American Psychological Association. Quinsey, V. L., Rice, M. E., & Harris, G. T. (1995). Actuarial prediction of sexual recidivism. Journal of Interpersonal Violence, 10 (1), 85–105. doi: 10.1177/ 088626095010001006 Robinson, W. (1996). Deceit, delusion, and detection. Thousand Oaks, CA, US: Sage Publications, Inc. Rosenhan, D. L. (1973). On being sane in insane places. Science, 179 (4070), 250–258. doi: 10.1126/science.179.4070.250 Ruback, R. B., & Hopper, C. H. (1986). Decision making by parole interviewers: The effect of case and interview factors. Law and Human Behavior, 10 (3), 203–214. doi: 10.1007/bf01046210 Schacter, D. L., & Addis, D. R. (2007). The cognitive neuroscience of constructive memory: Remembering the past and imagining the future. Philosophical Transactions of the Royal Society B: Biological Sciences, 362 (1481), 773–786. doi: 10.1098/rstb.2007.2087 Serota, K. B., Levine, T. R., & Boster, F. J. (2010). The prevalence of lying in America: Three studies of self-reported lies. Human Communication Research, 36 (1), 2–25. doi: 10.1111/j.1468-2958.2009.01366.x Shaw, J., & Porter, S. (2015). Constructing rich false memories of committing crime. Psychological Science, 26 (3), 291–301. doi: 10.1177/0956797614562862 Shaw, J., Porter, S., & ten Brinke, L. (2013). Catching liars: Training mental health and legal professionals to detect high-stakes lies. The Journal of Forensic Psychiatry & Psychology, 24 (2), 145–159. doi: 10.1080/14789949.2012.752025 Slovenko, R. (2006). Remorse. Journal of Psychiatry & Law, 34 (3), 397–432. Snook, B., Eastwood, J., Stinson, M., Tedeschini, J., & House, J. C. (2010). Reforming investigative interviewing in Canada. Canadian Journal of Criminology and Criminal Justice, 52 (2), 215–229. doi: 10.3138/cjccj.52.2.215

75 baker, porter, ten brinke, & udala

Sooniste, T., Granhag, P. A., Knieps, M., & Vrij, A. (2013). True and false intentions: Asking about the past to detect lies about the future. Psychology, Crime & Law, 19 (8), 673–685. doi: 10.1080/1068316x.2013.793333 Stewart, L. H., Ajina, S., Getov, S., Bahrami, B., Todorov, A., & Rees, G. (2012). Unconscious evaluation of faces on social dimensions. Journal of Experimental Psychology: General, 141 (4), 715–727. doi: 10.1037/a0027950 Str¨omwall, L. A., Granhag, P. A., & Hartwig, M. (2004). Practitioners’ beliefs about deception. In P. A. Granhag & L. A. Str¨omwall (Eds.), The detection of deception in forensic contexts (pp. 229–250). New York, NY, US: Cambridge University Press. doi: 10.1017/cbo9780511490071.010 Tangney, J. P., Stuewig, J., & Martinez, A. G. (2014). Two faces of shame: The roles of shame and guilt in predicting recidivism. Psychological Science, 25 (3), 799–805. doi: 10.1177/0956797613508790 Taylor, C., & Kleinke, C. L. (1992). Effects of severity of accident, history of drunk driving, intent, and remorse on judgments of a drunk driver1. Journal of Applied Social Psychology, 22 (21), 1641–1655. doi: 10.1111/j.1559-1816.1992.tb00966.x ten Brinke, L., MacDonald, S., Porter, S., & O’Connor, B. (2012). Crocodile tears: Facial, verbal and body language behaviours associated with genuine and fabricated remorse. Law and Human Behavior, 36 (1), 51–59. doi: 10.1037/h0093950 ten Brinke, L., & Porter, S. (2012). Cry me a river: Identifying the behavioral consequences of extremely high-stakes interpersonal deception. Law and Human Behavior, 36 (6), 469–477. doi: 10.1037/h0093929 ten Brinke, L., Porter, S., & Baker, A. (2012). Darwin the detective: Observable facial muscle contractions reveal emotional high-stakes lies. Evolution and Human Behavior, 33 (4), 411–416. doi: 10.1016/j.evolhumbehav.2011.12.003 R. v. Francois, (1994), 2 S.C.R. 827 R. v. Marquard, (1993), 4 S.C.R. 223 R. v. Struve, (2007), BCSC 1316 The Innocence Project. (2012). Factors on post-conviction dna exonerations. Retrieved from http://www.innocenceproject.org/ Todorov, A. (2008). Evaluating faces on trustworthiness: An extension of systems for recognition of emotions signaling approach/avoidance behaviors. Annals of the New York Academy of Sciences, 1124 (1), 208–224. doi: 10.1196/annals.1440.012 Todorov, A., Baron, S. G., & Oosterhof, N. N. (2008). Evaluating face trustworthiness: A model based approach. Social Cognitive and Affective Neuroscience, 3 (2), 119–127. doi: 10.1093/scan/nsn009 Todorov, A., Said, C. P., & Verosky, S. C. (2011). Personality impressions from facial appearance. In A. Calder, J. V. Haxby, M. Johnson, & G. Rhodes (Eds.), Handbook of face perception (pp. 631–652). New York, NY: Oxford University Press. Valla, J. M., Ceci, S. J., & Williams, W. M. (2011). The accuracy of inferences about criminality based on facial appearance. Journal of Social, Evolutionary, and Cultural Psychology, 5 (1), 66–91. doi: 10.1037/h0099274 Vartanian, O., Stewart, K., Mandel, D. R., Pavlovic, N., McLellan, L., & Taylor, P. J. (2012). Personality assessment and behavioral prediction at first impression. Personality and Individual Differences, 52 (3), 250–254. doi: 10.1016/j.paid.2011 .05.024 Vrij, A. (2008a). Detecting lies and deceit: Pitfalls and opportunities. Chichester, UK:

76 deception detection in risk assessment interviews

John Wiley & Sons. Vrij, A. (2008b). Nonverbal dominance versus verbal accuracy in lie detection: A plea to change police practice. Criminal Justice and Behavior, 35 (10), 1323–1336. doi: 10.1177/0093854808321530 Vrij, A., Akehurst, L., Soukara, S., & Bull, R. (2004). Let me inform you how to tell a convincing story: CBCA and reality monitoring scores as a function of age, coaching, and deception. Canadian Journal of Behavioural Science/Revue canadienne des sciences du comportement, 36 (2), 113–126. doi: 10.1037/h0087222 Vrij, A., Evans, H., Akehurst, L., & Mann, S. (2004). Rapid judgements in assessing verbal and nonverbal cues: Their potential for deception researchers and lie detection. Applied Cognitive Psychology, 18 (3), 283–296. doi: 10.1002/acp.964 Vrij, A., & Granhag, P. A. (2012). Eliciting cues to deception and truth: What matters are the questions asked. Journal of Applied Research in Memory and Cognition, 1 (2), 110–117. doi: 10.1016/j.jarmac.2012.02.004 Vrij, A., Granhag, P. A., & Porter, S. (2010). Pitfalls and opportunities in nonverbal and verbal lie detection. Psychological Science in the Public Interest, 11 (3), 89–121. doi: 10.1177/1529100610390861 Vrij, A., Hope, L., & Fisher, R. P. (2014). Eliciting reliable information in investigative interviews. Policy Insights from the Behavioral and Brain Sciences, 1 (1), 129–136. doi: 10.1177/2372732214548592 Vrij, A., Leal, S., Mann, S. A., & Granhag, P. A. (2011). A comparison between lying about intentions and past activities: Verbal cues and detection accuracy. Applied Cognitive Psychology, 25 (2), 212–218. doi: 10.1002/acp.1665 Vrij, A., & Mann, S. (2001). Who killed my relative? Police officers ability to detect real-life high-stake lies. Psychology, Crime & Law, 7 (1–4), 119–132. doi: 10.1080/ 10683160108401791 Vrij, A., Mann, S., & Fisher, R. P. (2006). An empirical test of the behaviour analysis interview. Law and Human Behavior, 30 (3), 329–345. doi: 10.1007/s10979-006 -9014-3 Vrij, A., Meissner, C. A., & Kassin, S. M. (2015). Problems in expert deception detection and the risk of false confessions: No proof to the contrary in levine et al. (2014). Psychology, Crime & Law, 1–18. doi: 10.1080/1068316x.2015.1054389 Walsh, D., & Bull, R. (2010). What really is effective in interviews with suspects? a study comparing interviewing skills against interviewing outcomes. Legal and Criminological Psychology, 15 (2), 305–321. doi: 10.1348/135532509x463356 Ward, G. K. (2006). Race and the justice workforce. In R. D. Peterson, L. J. Krivo, & J. Hagan (Eds.), The many colors of crime: Inequalities of race, ethnicity, and crime in America (pp. 67–87). New York, NY: New York University Press. Weisman, R. (2004). Showing remorse: Reflections on the gap between expression and attribution in cases of wrongful conviction. Canadian Journal of Criminology and Criminal Justice, 46 (2), 121–138. doi: 10.3138/cjccj.46.2.121 Weisman, R. (2009). Being and doing: The judicial use of remorse to construct character and community. Social & Legal Studies, 18 (1), 47–69. doi: 10.1177/ 0964663908100333 Wiener, R. L., & Rinehart, N. (1986). Psychological causality in the attribution of responsibility for rape. Sex Roles, 14 (7-8), 369–382. doi: 10.1007/bf00288422 Willis, J., & Todorov, A. (2006). First impressions: Making up your mind after a

77 baker, porter, ten brinke, & udala

100-Ms exposure to a face. Psychological Science, 17 (7), 592–598. doi: 10.1111/ j.1467-9280.2006.01750.x Wiseman, R., Watt, C., ten Brinke, L., Porter, S., Couper, S.-L., & Rankin, C. (2012). The eyes don?t have it: Lie detection and neuro-linguistic programming. PLoS ONE, 7 (7), e40259. doi: 10.1371/journal.pone.0040259 Wood, L. A., & MacMartin, C. (2007). Constructing remorse: Judges? Sentencing decisions in child sexual assault cases. Journal of Language and Social Psychology, 26 (4), 343–362. doi: 10.1177/0261927x07306979 Wright-Whelan, C., Wagstaff, G., & Wheatcroft, J. M. (2015). High stakes lies: Police and non-police accuracy in detecting deception. Psychology, Crime & Law, 21 (2), 127–138. doi: 10.1080/1068316x.2014.935777 Yarmey, A. D. (1993). Stereotypes and recognition memory for faces and voices of good guys and bad guys. Applied Cognitive Psychology, 7 (5), 419–431. doi: 10.1002/ acp.2350070505 Zebrowitz, L. A., & Montepare, J. M. (2015). Faces and first impressions. In M. Mikulincer, P. R. Shaver, E. Borgida, & J. A. Bargh (Eds.), Apa handbook of personality and social psychology, volume 1: Attitudes and social cognition (pp. 251–276). Washington, DC, US: American Psychological Association. doi: 10.1037/14341-008 Zebrowitz, L. A., Voinescu, L., & Collins, M. A. (1996). “Wide-Eyed” and “crooked-faced”: Determinants of perceived and real honesty across the life span. Personality and Social Psychology Bulletin, 22 (12), 1258–1269. doi: 10.1177/ 01461672962212006

Received: December 8, 2014 Revision Received: May 20, 2015 Accepted: May 27, 2015

78