Does Explainable Artificial Intelligence Improve Human Decision-Making?

Yasmeen Alufaisan,1 Laura R. Marusich, 2 Jonathan Z. Bakdash,3 Yan Zhou,4 Murat Kantarcioglu 4 1 EXPEC Computer Center at Saudi Aramco Dhahran 31311, Saudi Arabia 2 U.S. Army Combat Capabilities Development Command Army Research Laboratory South at the University of Texas at Arlington 3 U.S. Army Combat Capabilities Development Command Army Research Laboratory South at the University of Texas at Dallas 4 University of Texas at Dallas Richardson, TX 75080 [email protected], {laura.m.cooper20.civ, jonathan.z.bakdash.civ}@mail.mil, {yan.zhou2,muratk}@utdallas.edu Abstract To fully achieve fairness and accountability, explainable AI should lead to better human decisions. Earlier research Explainable AI provides insights to users into the why for demonstrated that explainable AI can be understood by peo- model predictions, offering potential for users to better un- ple (Ribeiro, Singh, and Guestrin 2018). Ideally, the com- derstand and trust a model, and to recognize and correct AI bination of humans and machines will perform better than predictions that are incorrect. Prior research on human and explainable AI interactions has typically focused on measures either alone (Adadi and Berrada 2018), such as computer- such as interpretability, trust, and usability of the explanation. assisted chess (Cummings 2014), but this combination may There are mixed findings whether explainable AI can improve not necessarily improve the overall accuracy of AI systems. actual human decision-making and the ability to identify the While (causal) explanation and prediction share common- problems with the underlying model. Using real datasets, we alities, they are not interchangeable concepts (Adadi and compare objective human decision accuracy without AI (con- Berrada 2018; Shmueli et al. 2010; Edwards and Veale trol), with an AI prediction (no explanation), and AI predic- 2018). Consequently, a ”good” explanation, interpretable tion with explanation. We find providing any kind of AI pre- model predictions, may not be sufficient for improving diction tends to improve user decision accuracy, but no con- actual human decisions (Adadi and Berrada 2018; Miller clusive evidence that explainable AI has a meaningful impact. 2019) because of heuristics and biases in human decision- Moreover, we observed the strongest predictor for human de- cision accuracy was AI accuracy and that users were some- making (Kahneman 2011). Therefore, it is important to what able to detect when the AI was correct vs. incorrect, but demonstrate whether, and what types of, explainable AI can this was not significantly affected by including an explana- improve the decision-making performance of humans us- tion. Our results indicate that, at least in some situations, the ing that AI, relative to performance using the predictions of why information provided in explainable AI may not enhance ”black box” AI with no explanations and for human making user decision-making, and further research may be needed to decisions with no AI prediction. understand how to integrate explainable AI into real systems. In this work, we empirically investigate whether explain- able AI improves human decision-making using a two- choice classification experiment with real-world data. Us- Introduction ing human subject experiments, we compared three differ- Explainable AI is touted as the key for users to “under- ent settings where a user needs to make decision 1) No AI stand, appropriately trust, and effectively manage. . . [AI prediction (Control), 2) AI predictions but no explanation, systems])” (Gunning 2017) with parallel goals of achiev- and 3) AI predictions with explanations. Our results indi- ing fairness, accountability, and transparency (Sokol 2019). cate that, while providing the AI predictions tends to help There are a multitude of reasons for explainable AI, but there users, the why information provided in explainable AI does is little empirical research for its impact on human decision- not specifically enhance user decision-making. making (Miller 2019; Adadi and Berrada 2018). Prior be- havioral research on explainable AI has primarily focused Background and Related Work on human understanding/interpretability, trust, and usabil- ity for different types of explanations (Doshi-Velez and Kim Using Doshi-Velez and Kim’s (2017) framework for inter- 2017; Hoffman et al. 2018; Ribeiro, Singh, and Guestrin pretable machine learning, our current work focuses on: real 2016, 2018; Lage et al. 2019). humans, simplified tasks. Because our objective is on eval- uating decision-making, we do not compare different types Copyright c 2021, Association for the Advancement of Artificial of explanations and instead used one of the best available Intelligence (www.aaai.org). All rights reserved. explanations: anchor LIME (Ribeiro, Singh, and Guestrin 2018). We use real tasks here, although our tasks involve Poisson distribution—to constrain the rule generation pro- relatively simple decisions with two possible choices. Addi- cess and provide theoretical bounds for reducing computa- tionally, we use lay individuals rather than experts. Below, tion by iteratively pruning the search space. In our experi- we discuss prior work that is related to our experimental ap- ments, we use anchor LIME to provide explanations for all proach. our experimental evaluation due to the high human precision of anchor LIME as reported in Ribeiro, Singh, and Guestrin Explainable AI/Machine Learning (2018). While machine learning models largely remain opaque and their decisions are difficult to explain, there is an urgent need Human Decision-Making and Human Experiments for machine learning systems that can “explain” its reason- with Explainable AI ing. For example, European Union regulation requires “right A common reason for providing explanation is to improve to explanation” for any algorithms that make decisions sig- human predictions or decisions (Keil 2006). People are not nificantly impacting users with user-level predictors (Par- necessarily rational (i.e., maximizing an expected utility liament and Council of the European Union 2016). In re- function). Instead, decisions are often driven by heuristics sponse to the lack of consensus on the definition and eval- and biases (Kahneman 2011). Also, providing more infor- uation of interpretability in machine learning, Doshi-Velez mation, even if relevant, does not necessarily lead people and Kim (2017) propose a taxonomy for the evaluation to making better decisions (Gigerenzer and Brighton 2009). of interpretability focusing on the synergy among human, Bounded rationality in human decision-making using satis- application, and functionality. They contrast interpretability fying with constraints (Gigerenzer and Brighton 2009) is an with reliability and fairness, and discuss scenarios in which alternative theory to heuristics and biases (Kahneman 2011). interpretability is needed. To unmask the incomprehensible Regardless of the theoretical account for human decision- reasoning made by these machine learning/AI models, re- making, people, which can include experts (Dawes, Faust, searchers developed explainable models that are built on top and Meehl 1989), generally do not make fully optimal deci- of the machine learning model to explain their decisions. sions. The most common forms of explainable models that pro- At a minimum, explainable AI should not be detrimen- vide explanations for the decisions made by machine learn- tal to human decision-making. The literature on decision ing models are feature-based and rule-based models. The aids (a computational recommendation or prediction, typ- feature-based models resemble feature selection where the ically without an explicit explanation) has mixed findings model outputs the top features that explain the machine for human performance. Sometimes these aids are benefi- learning prediction and their associated weights (Datta, Sen, cial for human decision-making, whereas at other times they and Zick 2016; Ribeiro, Singh, and Guestrin 2016). The have negative effects on decisions (Kleinmuntz and Schkade rule-based models provide simple if-then-else rules to ex- 1993; Skitka, Mosier, and Burdick 1999). These mixed find- plain predictions (Ribeiro, Singh, and Guestrin 2018; Alu- ings may be attributable to absence of explanations; this faisan et al. 2017). It has been shown that rule-based models can be investigated through human experiments testing AI provide higher human precision when compared to feature- predictions with explanations compared with AI predictions based models (Ribeiro, Singh, and Guestrin 2018). alone. Lou et al. (2012) investigate the generalized additive mod- Most prior human experiments with explainable AI have els (GAMs) that combine single-feature models through a concentrated on interpretability, trust, and subjective mea- linear function. GAMs are more accurate than simple lin- sures of usability, such as preferences and satisfaction, with ear models, and can be easily interpreted by users. Their work on decision-making performance remaining somewhat empirical study suggests that a shallow bagged-tree with limited (Miller 2019; Adadi and Berrada 2018). Earlier gradient boosting is the best method on low to medium di- results suggest explainable AI can increase interpretabil- mensional datasets. Anchor LIME is an example of the cur- ity (e.g. Ribeiro, Singh, and Guestrin 2018), trust (e.g. rent state-of-the-art explainable rule-based model (Ribeiro, Lakkaraju and Bastani 2020; Ribeiro, Singh, and Guestrin Singh, and Guestrin 2018). It is a model-agnostic system 2016; Selvaraju et al. 2017), and usability (e.g. Ribeiro, that can explain predictions generated by any machine learn- Singh, and Guestrin 2018) to varying degrees, but this does ing model with high precision. The model provides rules, not necessarily translate to better performance on real-world referred to as anchors, to explain the prediction for each decisions about the underlying data, such as whether to ac- instance. A rule is an anchor if it sufficiently explains the tually use the AI’s prediction, whether the AI has made an prediction locally such that any changes to the rest of the error, and the role of explanations. In fact, recent work has features, features not included in the anchor, do not effect shown that subjective measures commonly assessed (e.g., the prediction. Anchors can be found in two different ap- preference and trust) do not predict actual human perfor- proaches: bottom-up approach and beam search. Wang et al. mance (Buc¸inca et al. 2020; Zhang, Liao, and Bellamy (2017) present a machine learning algorithm that produces 2020); similarly, performance on common proxy tasks such Bayesian rule sets (BRS) comprised of short rules in the dis- as predicting the AI’s decision also may not be indicative of junctive normal form. They develop two probabilistic mod- actual decision-making performance (Buc¸inca et al. 2020). els with prior parameters that allow the user to specify a These findings highlight the need for more study of the im- desired size and shape and balance between accuracy and pact of AI explanation on objective human performance, not interpretability. They apply two priors—beta-binomials and just proxy or subjective measures. In the limited studies that do examine the effect of ex- We choose these datasets primarily because they both in- planation on human decision-making performance, there are volve real-world contexts that are understandable and en- mixed findings about whether the explanation provides an gaging for human participants. Further, the two datasets dif- additional benefit over AI prediction alone. For example, fer widely in number of features and in the overall accuracy some researchers found that human performance was better classifiers can achieve in their predictions. This allows us when an AI prediction was accompanied by explanation than to explore the effects of these differences on human perfor- performance with the prediction alone (Buc¸inca et al. 2020; mance; in addition, it ensures that our findings are not lim- Lai and Tan 2019) However, other studies did not show any ited only to a specific dataset. We briefly discuss each dataset additional benefit of explanation over AI prediction alone in more detail below. (Green and Chen 2019), with some even showing evidence COMPAS stands for Correctional Offender Management of worse performance with explanation (Poursabzi-Sangdeh Profiling for Alternative Sanctions (ProPublica 2016). It is a et al. 2018; Zhang, Liao, and Bellamy 2020). scoring system used to assign risk scores to criminal defen- The two papers finding a benefit for explanations con- dants to determine their likelihood of becoming a recidivist. sisted of a task in which users made decisions about the The data has 6,479 instances and 7 features. These features fat content in pictures of food (Buc¸inca et al. 2020) and are: gender, age, race, priors count, and charge degree risk judgments about whether text from hotel reviews were gen- score, and whether the defendants re-offended in two years uine or deceptive (Lai and Tan 2019). They also both used or not. We let the binary re-offending feature be our class. a simple binary choice as the decision-making task. In con- Census income (CI) data contains information used to trast, the work finding no improvement in decision accuracy predict individuals’ income (Dua and Graff 2017). It has with explainable AI used datasets that comprised variables 32,561 instances and 14 features. These features are: age, and outcomes, including probabilistic assessments for risks workclass, education, marital status, occupation, relation- with recidivism and loan outcomes (Green and Chen 2019), ship, race, sex, capital gain, capital loss, hours per week, decisions about real estate valuations (Poursabzi-Sangdeh and country. The class value is low income (less or equal to et al. 2018), and predictions about income (Zhang, Liao, 50K) or high income (greater than 50K). We preprocessed and Bellamy 2020). In addition, instead of simple binary the dataset to allow equal class distribution 1. choices, these studies used prediction of values along a con- tinuum (Green and Chen 2019; Poursabzi-Sangdeh et al. Experimental Design 2018), and binary choice with the option to switch after see- Prior results demonstrating people interpret, trust, and pre- ing the model prediction (Zhang, Liao, and Bellamy 2020). fer explainable AI, suggesting it will improve the accuracy Besides dataset and task differences, there are two other of human decisions. Hence, our primary hypotheses are that distinctions among these papers. Only a single paper as- explainable AI would aid human decision-making. The hy- sessed decision-making under time pressure (Zhang, Liao, potheses (H.) are as follows: and Bellamy 2020) and only two papers informed users if their decisions were correct or incorrect (Green and Chen H. 1 Explainable AI enhances decision-making process 2019; Zhang, Liao, and Bellamy 2020). Our study de- compared to only an AI prediction (without explanation) sign uses datasets of multiple variables and outcomes and and a control condition with no AI. provides correct/incorrect feedback, but also uses a very [H. 1.a] A participant performs above chance in pre- straightforward binary choice task. This combination could diction tasks. potentially resolve the disparity in results from the studies H. 2 A participant’s decision accuracy is positively associ- above. ated with AI accuracy. H. 3 Average participant’s decision accuracy does not out- Methods perform AI accuracy. In this section, we first describe the two datasets used in our H. 4 Participants outperform AI accuracy more often with experiments. We then provide the details of our experimental explainable AI over AI prediction alone. design and hypotheses, participant recruitment, and general demographics of our sample. H. 5 Participants follow explainable AI recommendation more often than AI only recommendation. Dataset H. 6 Explainable AI increases participants’ decision confi- To conduct our experiments, we choose two different dence. datasets that have been heavily used in prior research that H. 7 A participant’s decision confidence is positively corre- tries to understand algorithmic fairness and accountability lated with the accuracy of his/her decision. issues. For example, the COMPAS dataset has been used to To investigate these hypotheses, we used a 2 (Dataset: detect potential biases in criminal justice system (ProPublica Census and COMPAS) x 3 (AI condition: Control, AI, and 2016). The Census income dataset, which has been used to test many machine learning techniques, involves predictions 1The CI dataset is from 1994. We adjusted for inflation by using of individuals’ income status. This has been associated with a present value of 88k. From 1994 to January 2020 (when the ex- potential biases in making decisions such as access to credit periment was run) inflation in the U.S. was 76.45%: https://www. and job opportunities. .com/input/?i=inflation+from+1994+to+jan+2020 Figure 1: Example from the study demonstrating the information appearing in the three AI conditions for a trial from the COMPAS dataset condition.

AI with Explanation) between-participants experimental de- enough instances for the explanations generated using an- sign. The three AI conditions were: chor LIME (Ribeiro, Singh, and Guestrin 2018). In our behavioral experiment, 50 instances were randomly • Control: Participants were provided with no prediction or sampled without replacement for each participant. Thus, AI information from the AI. accuracy was experimentally manipulated for participants • AI: Participants were provided with only an AI prediction. (Census: mean AI accuracy = 83.85%, sd = 3.67%; COM- PAS: mean AI accuracy = 69.18%, sd = 4.65%). Because • AI with Explanation: Participants received an AI predic- of the sample size and large number of repeated trials per tion, as well as an explanation of the prediction using an- participant, there was no meaningful difference in mean AI chor LIME (Ribeiro, Singh, and Guestrin 2018). accuracy for participants in the AI condition vs. those in the To achieve more than 80% statistical power to detect a AI explanation condition (p = 0.90). medium effect size for this design, we planned for a sample size of N = 300 (50 per condition). Participant Recruitment and Procedure In all conditions, each trial consists of a description of an We developed the experiment using jsPsych (De Leeuw individual and a two-alternative forced choice for the classi- 2015), and hosted it on the Volunteer Science platform (Rad- fication of that individual. Each choice was correct on 50% ford et al. 2016) 2. Participants were recruited using Ama- of the trials, thus chance performance for human decision- zon Mechanical Turk (AMT) and were compensated $4.00 making accuracy was 50%. Additionally, an AI prediction each. We collected data from 50 participants in each of the and/or explanation may appear, depending on the AI con- six experimental conditions, for a total of 300 participants dition (see Figure 1). After a decision is made, participants (57.67% male). Most participants were 18 to 44 years old are asked to enter their confidence in that choice, on a Likert (80.67%). This research was approved as exempt (19-176) scale of 1 (No Confidence) to 5 (Full Confidence). Feed- by the Army Research Laboratory’s Institutional Review back is then displayed, indicating whether or not the previ- Board. ous choice was correct. Participants read and agreed to a consent form, then re- We compared the prediction accuracy of Logistic Regres- ceived instructions on the task, specific to the experimental sion, Multi-layer Perceptron Neural Network with two lay- condition they were assigned to. They completed 10 practice ers of 50 units each, Random Forest, Support Vector Ma- trials, followed by 50 test trials and a brief questionnaire as- chine (SVM) with rbf kernel and selected the best classifier sessing general demographic information and comments on for each dataset. We chose a Multi-layer Perceptron Neu- strategies used during the task. The median time to complete ral Network for Census income data where it resulted in an the practice and test trials was 18 minutes. overall accuracy of 82% and SVM with rbf kernel for COM- PAS data with an overall accuracy of 68%. Census income accuracy closely matches the accuracy reported in the liter- Results and Discussion ature (Lichman 2013; Kag 2018; Alufaisan, Kantarcioglu, In this section we analyze and describe the effects of dataset, and Zhou 2016) and COMPAS accuracy matches the re- AI condition, and AI accuracy on the participants’ decision- sults published by ProPublica (ProPublica 2016). We split the data to 60% for training and 40% for testing to allow 2https://volunteerscience.com/ making accuracy, ability to outperform the AI, adherence to AI Accuracy and Participant Decision-Making AI recommendations, confidence ratings, and reaction time. Accuracy Participant Decision-Making Accuracy We also evaluated the effect of the randomly varied AI ac- curacy for each participant on their decision-making accu- We compared participants’ mean accuracy in the experiment racy. We used linear regression to analyze this relationship, across conditions using a 2 (Dataset) x 3 (AI) Anal- specifying participant accuracy as the dependent variable ysis of Variance (ANOVA) (see Figure 2). We found signif- and the following as independent variables: mean AI accu- icant main effects, with a small effect size for AI condition 2 racy (per participant), AI condition, and dataset condition, (F (2, 294) = 8.19, p < 0.001, η = 0.04) and a nearly see Figure 3. Regressions are represented by the solid lines large effect for dataset condition (F (1, 294) = 46.51, p < 2 with the shaded areas representing 95% confidence inter- 0.001, η = 0.12). In addition, there was a significant in- vals. The control condition is not included in the analysis teraction with a small effect size (F (2, 294) = 8.38, p < 2 or figure, because the accuracy of the AI is not relevant if 0.001, η = 0.05), indicating that the effect of AI condition no AI prediction is presented to the participant. The over- depended on the dataset. Specifically, the large effect for in- all regression model was significant with a large effect size, creased accuracy with AI was driven by the Census dataset. 2 F (4, 195) = 21.23, p < 0.001,Radjusted = 0.29. Consis- Contrary to H. 1, explainable AI did not substantially im- tent with H. 2, there was a large main effect for AI accu- prove decision-making accuracy over AI alone. We followed racy (β = 0.70, p < 0.001,R2 = 0.28). Also, there was a up on significant ANOVA effects by performing pairwise small AI accuracy and dataset interaction (β = −0.07, p < comparisons using Tukey’s Honestly Significant Difference. 0.01,R2 = 0.03), reflecting the same interaction depicted These post-hoc tests indicated that participants who viewed in Figure 2. There were no significant regression differences the Census dataset showed improved accuracy over control for dataset or AI versus AI Explanation, ps > 0.60; there when given an AI prediction (p < 0.01) and higher accuracy was no significant effect of dataset because it largely drove with AI explanation versus control (p < 0.001), but there AI accuracy. Note it is not just that participants perform bet- was no statistically significant difference in participant accu- ter with the higher mean AI accuracy of the Census dataset, racy for AI compared to AI explanation (p = 0.28). Whereas both datasets had large positive relationships with partici- the COMPAS dataset had no significant differences in par- pant accuracy and corresponding mean AI accuracy shown ticipant accuracy across pairwise comparisons for the three in Figure 3. AI conditions (ps > 0.75). Also, the mean participant accu- racy for the COMPAS control condition (mean = 63.7%, sd = 9.24%) was comparable to participant accuracy for prior Mean AI and Participant Accuracy decision-making research using the same dataset (mean = AI AI Explanation 62.8%, sd = 4.8%) (Dressel and Farid 2018). 90 There was strong evidence supporting H. 1.a, the vast ma- jority of participants had mean accuracy exceeding guessing 80 (50% accuracy). The overall participant accuracy across all 70 conditions was 65.65% (sd = 10.92%), with 90% (or 270 out Dataset of 300) participants performing above chance on the classi- 60 Census fication task. This indicates that the task was challenging but COMPAS feasible for almost all participants. 50

40

Participant Accuracy (%) Mean Participant Accuracy 80 60 70 80 90 60 70 80 90 Mean AI Accuracy (%)

Figure 3: Mean AI accuracy (per participant) and mean par- 70 ticipant accuracy by AI and AI Explanation and the two Dataset datasets. The shaded areas represent 95% confidence inter- Census COMPAS vals. 60

Outperforming the AI Accuracy An interesting ques-

Mean Participant Accuracy (%) Mean Participant Accuracy tion is whether the combination of AI and human decision- 50 making can outperform either alone. The previous analy- Control AI AI Explanation ses showed that the addition of AI prediction information Condition improved human performance over controls with humans alone. We also evaluated how often the human decision- Figure 2: Mean participant accuracy in each AI and dataset making accuracy outperformed the accuracy of the cor- condition. Error bars represent 95% confidence intervals. responding mean AI prediction accuracy, which was ex- perimentally manipulated. Although most participants per- Following AI Recommendations Confidence Ratings 90 4.00

3.75

80 Dataset Dataset 3.50 Census COMPAS Census Compas 3.25

70 Mean Confidence Rating Percentage Following AI Following Percentage 3.00 Control AI AI Explanation AI Condition 60 AI Correct AI Incorrect Condition Figure 5: Mean participant confidence ratings in each AI condition. Error bars represent 95% confidence intervals. Figure 4: Mean proportion of participant choices match- ing AI prediction as a function of whether the AI cor- rect/incorrect and the dataset condition. Error bars repre- Confidence Ratings sent 95% confidence intervals. To simplify this figure, re- We found that AI (without and with explanation) resulted sults were collapsed for the AI and AI Explanation condi- in slightly increased mean confidence. There was a small tions which did not have a significant main effect, p = 0.62. effect of AI condition on mean confidence (see Figure 5, F (2, 294) = 3.58, p = 0.03, η2 = 0.02). Post hoc tests indicated participants had significantly lower mean confi- dence in the control condition than AI, p < 0.03, but there were no statistical differences for other pairwise compar- formed well above chance, only a relatively small number isons, ps > 0.25. This contradicted H. 6, and there was of participants had decision accuracy exceeding their mean no evidence of a confidence increase with explanations. In AI prediction (7% or 14 out of 200). This result largely sup- addition, there was no evidence for a main effect of dataset ports H. 3 and also shown above in Figure 3 where each dot condition or interaction, ps > 0.84. represents an individual; and dots above the black dashed Confirming H. 7, we found a positive relationship for line show the participants that outperformed their mean AI accuracy and confidence rating within individuals indicat- prediction. The black dashed line shows equivalent perfor- ing that participants’ confidence ratings were fairly well- mance for mean AI accuracy and mean participant accuracy. calibrated with their actual decision accuracy. We calculated each participant’s mean accuracy at each confidence rating Dataset/Condition AI AI Explanation they used, and then conducted a repeated measures corre- Census 0 1 lation (Bakdash and Marusich 2017) (rrm = 0.48, p < COMPAS 10 3 0.001).

Table 1: Number of participants with decision accuracy ex- Additional Results ceeding their mean AI accuracy. We also assessed reaction time and summarize self-reported decision-making strategies. These results are exploratory, there were no specific hypotheses. There was no significant main effect of AI condition on participants’ reaction time Adherence to AI Model Predictions Participants fol- (F (2, 294) = 2.13, p = 0.12, η2 = 0.01). There was only lowed the AI predictions more often when the AI was correct a main effect of dataset condition (F (1, 294) = 28.52, p < versus when the AI was incorrect, indicating some recogni- 0.001, η2 = 0.09), where participants took an average of tion of when the AI makes bad predictions (see Figure 4, 1600 ms longer in the Census condition than the COMPAS 2 F (1, 196) = 36.15, p < 0.001, ηp = 0.16). This was con- condition (see Figure 6). This effect was most likely due to sistent with participant sensitivity to AI recommendations, the Census dataset having more variables for each instance evidence for H. 2. Also, participants were better able to rec- than the COMPAS dataset, and thus requiring more reading ognize correct vs. incorrect AI predictions when they were time on each trial. The addition of an explanation did not in the Census condition, demonstrated in the significant in- meaningfully increase reaction time over an AI prediction teraction between AI correctness and dataset, F (1, 196) = only. 2 9.01, p < 0.01, ηp = 0.04. None of the remaining ANOVA Subjective measures, such as self-reported strategies and results were significant, ps > 0.16. Thus, there was no ev- measures of usability, often diverge from objective measures idence for higher adherence to recommendations with ex- of human performance (Andre and Wickens 1995; Nisbett plainable AI, which rejected H. 5. and Wilson 1977) such as actual decisions (Buc¸inca et al. Reaction Time suggesting this was not solely attributable to participants re- 8000 sponding as quickly as possible. The lack of a significant, practically-relevant effect for explainable AI was not due to lack of statistical power or 6000 ceiling performance - nearly all participants consistently Dataset performed above chance, but well below perfect accuracy. Census Compas These findings also illustrate the need to compare decision- 4000 making with explainable AI to other conditions including no AI and AI prediction without explanation. If we did not have Mean Reaction Time (ms) an AI only (decision aid) condition a reasonable but flawed inference would have been that explainable AI enhances de- 2000 Control AI AI Explanation cisions. AI Condition Limitations The present findings have limited generaliz- Figure 6: Mean reaction time in each AI and dataset condi- ability to decision-making with other datasets and explain- tion. Error bars represent 95% confidence intervals. able AI techniques. For example, the effectiveness of ex- planations, or lack thereof, for human decision-making may depend on a variety of factors: the specific explanation tech- 2020). Participants self-reported varying strategies to make nique, the properties of the dataset, and the task itself (such their decisions, yet there was a clear benefit for AI predic- as probabilistic predictions vs two choice decisions). Nev- tion (without and with explanation). In the AI and AI expla- ertheless, this paper demonstrates that one cannot assume nation conditions: n = 80 indicated using the data without explainable AI will necessary improve human decisions and mentioning AI, n = 39 reported using a combination of the the need to evaluate objective measures, of human perfor- data and the AI, and only n = 16 said they primarily used, mance, in explainable AI. trusted, or followed the AI. Despite limited self-reported use of the AI in the two relevant conditions, decision ac- Conclusions and Future Work curacy was higher with AI (Figure 2), strongly associated Much of the existing research on explainable AI focuses on with AI accuracy (Figure 3), and there was some sensitivity the usability, trust, and interpretability of the explanation. to whether the AI was followed when it was correct versus In this paper, we fill in the research blank by investigat- incorrect (Figure 4). Nearly 80% of user comments could be ing whether explainable AI can improve human decision- coded, blank and nonsense responses could not be coded. making. We design a behavioral experiment in which each Discussion participant recruited using Amazon Mechanical Turk is asked to complete 50 test trials in one of six experimental Our results show providing an AI prediction enhances hu- conditions. Our experiment is conducted on two real datasets man decision accuracy, but in opposition to the hypotheses to compare human decision with an AI prediction and an AI adding an explanation positively impact decisions and in- with explanation. Our experimental results demonstrate that crease the ability to outperform the AI. This finding is in AI predictions alone can generally improve human decision line with some previous studies that also found no added accuracy, while the advantage of explainable AI is not con- benefit for explanation over AI prediction alone (Green and clusive. We also show that users tend to follow AI predic- Chen 2019; Poursabzi-Sangdeh et al. 2018; Zhang, Liao, and tions more often when the AI predictions are accurate. In Bellamy 2020). This suggests that it was not the simplicity addition, AI with or without explanation can increase the of the decision-making task that accounts for the opposite confidence of human users which, on average, was well- findings in Buc¸inca et al. (2020) and Lai and Tan (2019), calibrated to user decision accuracy. since the current study also uses a relatively straightforward In the future, we plan to investigate whether explainable binary decision. Rather, it may be the case that tasks with AI can help improve fairness, safety, and ethics by increas- highly intuitive datasets are required for explanation to im- ing the transparency of AI models. Human decision-making prove performance. Future work may address this question is a key outcome measure, but is certainly not the only goal directly. for explainable AI. We also plan to explore the difference One possible explanation for findings of no added bene- of distributions in the error space between human decision fit of explanation is that providing more information, even if and AI predictions, especially at decision boundaries. Also, task-relevant, does not necessarily improve human decision- whether human-machine collaboration is feasible through making accuracy (Gigerenzer and Brighton 2009; Goldstein interactions in closed feedback loops. We will also expand and Gigerenzer 2002; Nadav-Greenberg and Joslyn 2009). our datasets to include other data format such as images. This phenomenon is attributed to cognitive limitations and people using near-optimal strategies, and corresponds with Poursabzi-Sangdeh et al.’s (2018) findings that explanation Ethical Impact caused information overload, reducing people’s ability to de- For many important critical decisions, depending on the AI tect AI mistakes. However, their study and most other papers model prediction may not be enough. Furthermore, many re- did not use the speeded response paradigm we used here, cent regulations such as GDPR (Parliament and Council of the European Union 2016) allow potential audit of AI pre- Andre, A. D.; and Wickens, C. D. 1995. When users want diction by a human. Therefore, it is critical to understand what’s not best for them. Ergonomics in design 3(4): 10–14. whether the explanations provided by explainable AI meth- Bakdash, J. Z.; and Marusich, L. R. 2017. Repeated mea- ods improve the overall prediction accuracy, and help human sures correlation. Frontiers in psychology 8: 1–13. doi: decision makers to detect errors. Our results indicate that al- 10.3389/fpsyg.2017.00456. though the existence of an AI model may improve human decision-making, the explanations provided may not auto- Buc¸inca, Z.; Lin, P.; Gajos, K. Z.; and Glassman, E. L. 2020. matically improve the accuracy. We believe that our results Proxy Tasks and Subjective Measures Can Be Misleading could help ignite the needed research to explore how to bet- in Evaluating Explainable AI Systems. In Proceedings of ter integrate explanations, AI models and human operators the 25th International Conference on Intelligent User Inter- to have better outcomes compared AI models or humans faces, IUI ’20, 454–464. doi:10.1145/3377325.3377498. alone. Corbett-Davies, S.; and Goel, S. 2018. The measure and Another consideration is the data itself, models created mismeasure of fairness: A critical review of fair machine using systematically biased data will simply mirror and rein- learning. arXiv preprint arXiv:1808.00023 . force patterns inherent to the data. For example, in the COM- Cummings, . M. 2014. Man versus machine or man+ ma- PAS dataset a high risk score for re-offending has far lower chine? IEEE Intelligent Systems 29(5): 62–69. accuracy for black defendants than white ones (ProPub- lica 2016). Furthermore, multiple AI systems for predicting Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic Trans- criminal activity have relied on ”dirty” data with racial bias parency via Quantitative Input Influence: Theory and Ex- due to flawed polices and procedures (Richardson, Schultz, periments with Learning Systems. In In Proceedings of the and Crawford 2019). One solution may be to combine ap- 2016 IEEE Symposium on Security and Privacy (SP), 598– proaches for fair machine learning (Corbett-Davies and Goel 617. 2018) with explainable AI, while considering dataset prop- Dawes, R. M.; Faust, D.; and Meehl, P. E. 1989. Clinical erties such as its provenance. versus actuarial judgment. Science 243(4899): 1668–1674. De Leeuw, J. R. 2015. jsPsych: A JavaScript library for cre- Acknowledgments ating behavioral experiments in a Web browser. Behavior The views and conclusions contained in this document are research methods 47(1): 1–12. those of the authors and should not be interpreted as rep- Doshi-Velez, F.; and Kim, B. 2017. Towards a rigorous resenting the official policies, either expressed or implied, science of interpretable machine learning. arXiv preprint of the U.S. Army Combat Capabilities Development Com- arXiv:1702.08608 . mand Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and dis- Dressel, J.; and Farid, H. 2018. The accuracy, fairness, tribute reprints for Government purposes notwithstanding and limits of predicting recidivism. Science advances 4(1): any copyright notation. M.K. and Y.Z. were supported by eaao5580. ARL under grant W911NF-17-1-0356. We thank Jason Rad- Dua, D.; and Graff, C. 2017. UCI Machine Learning Repos- ford for help with implementing the experiment on the Vol- itory. URL http://archive.ics.uci.edu/ml. unteer Science platform and Katelyn Morris for indepen- Edwards, L.; and Veale, M. 2018. Enslaving the algorithm: dently coding the comments. We also thank Dan Cassenti for From a “Right to an Explanation” to a “Right to Better De- input on the paper, and Jessica Schultheis and Alan Breacher cisions”? IEEE Security & Privacy 16(3): 46–54. for editing the paper. Gigerenzer, G.; and Brighton, H. 2009. Homo heuristicus: References Why biased minds make better inferences. Topics in cogni- tive science 1(1): 107–143. 2018. Income prediction on UCI adult dataset. https://www.kaggle.com/overload10/income-prediction- Goldstein, D. G.; and Gigerenzer, G. 2002. Models of eco- on-uci-adult-dataset. logical rationality: the recognition heuristic. Psychological review 109(1): 75. Adadi, A.; and Berrada, M. 2018. Peeking inside the black- box: A survey on Explainable Artificial Intelligence (XAI). Green, B.; and Chen, Y. 2019. The Principles and Limits IEEE Access 6: 52138–52160. of Algorithm-in-the-Loop Decision Making. Proc. ACM Hum.-Comput. Interact. 3(CSCW). doi:10.1145/3359152. Alufaisan, Y.; Kantarcioglu, M.; and Zhou, Y. 2016. De- URL https://doi.org/10.1145/3359152. tecting Discrimination in a Black-Box Classifier. In In Pro- ceedings of the 2016 IEEE 2nd International Conference on Gunning, D. 2017. Explainable artificial intelli- Collaboration and Internet Computing (CIC). IEEE. gence (xai). URL https://www.darpa.mil/attachments/ XAIProgramUpdate.pdf. Alufaisan, Y.; Zhou, Y.; Kantarcioglu, M.; and Thuraising- ham, B. 2017. From Myths to Norms: Demystifying Data Hoffman, R. R.; Mueller, S. T.; Klein, G.; and Litman, J. Mining Models with Instance-Based Transparency. In In 2018. Metrics for explainable ai: Challenges and prospects. Proceedings of the 2017 IEEE 3rd International Conference arXiv preprint arXiv:1812.04608 . on Collaboration and Internet Computing (CIC). IEEE. Kahneman, D. 2011. Thinking, fast and slow. Macmillan. Keil, F. C. 2006. Explanation and understanding. Annu. Rev. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors: Psychol. 57: 227–254. High-precision model-agnostic explanations. In Thirty- Second AAAI Conference on Artificial Intelligence Kleinmuntz, D. N.; and Schkade, D. A. 1993. Informa- . tion displays and decision processes. Psychological Science Richardson, R.; Schultz, J. M.; and Crawford, K. 2019. Dirty 4(4): 221–227. data, bad predictions: How civil rights violations impact po- lice data, predictive policing systems, and justice. NYUL Lage, I.; Chen, E.; He, J.; Narayanan, M.; Kim, B.; Ger- Rev. Online 94: 15. shman, S.; and Doshi-Velez, F. 2019. An evaluation of the human-interpretability of explanation. arXiv preprint Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; arXiv:1902.00006 . Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explana- tions from Deep Networks via Gradient-Based Localization. Lai, V.; and Tan, C. 2019. On human predictions with expla- In 2017 IEEE International Conference on Computer Vision nations and predictions of machine learning models: A case (ICCV), 618–626. study on deception detection. In Proceedings of the Confer- ence on Fairness, Accountability, and Transparency, 29–38. Shmueli, G.; et al. 2010. To explain or to predict? Statistical science 25(3): 289–310. Lakkaraju, H.; and Bastani, O. 2020. “How do I fool you?”: Manipulating User Trust via Misleading Black Box Expla- Skitka, L. J.; Mosier, K. L.; and Burdick, M. 1999. Does nations. In Proceedings of the AAAI/ACM Conference on automation bias decision-making? International Journal of AI, Ethics, and Society, 79–85. ACM. doi:10.1145/3375627. Human-Computer Studies 51(5): 991–1006. 3375833. Sokol, K. 2019. Fairness, Accountability and Transparency Lichman, M. 2013. UCI Machine Learning Repository. http: in Artificial Intelligence: A Case Study of Logical Predictive //archive.ics.uci.edu/ml”. Models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 541–542. Lou, Y.; Caruana, R.; and Gehrke, J. 2012. Intelligible Mod- Wang, T.; Rudin, C.; Doshi-Velez, F.; Liu, Y.; Klampfl, els for Classification and Regression. In KDD’12, Beijing, E.; and MacNeille, P. 2017. A Bayesian Framework for China. ACM. Learning Rule Sets for Interpretable Classification. Jour- Miller, T. 2019. Explanation in artificial intelligence: In- nal of Machine Learning Research 18(70): 1–37. URL sights from the social sciences. Artificial Intelligence 267: http://jmlr.org/papers/v18/16-003.html. 1–38. Zhang, Y.; Liao, Q. V.; and Bellamy, R. K. E. 2020. Effect Nadav-Greenberg, L.; and Joslyn, S. L. 2009. Uncertainty of Confidence and Explanation on Accuracy and Trust Cal- forecasts improve decision making among nonexperts. Jour- ibration in AI-Assisted Decision Making. In Proceedings nal of Cognitive Engineering and Decision Making 3(3): of the 2020 Conference on Fairness, Accountability, and 209–227. Transparency, FAT* ’20, 295–305. doi:10.1145/3351095. Nisbett, R. E.; and Wilson, T. D. 1977. Telling more than we 3372852. can know: verbal reports on mental processes. Psychological review 84(3): 231. Parliament and Council of the European Union. 2016. Gen- eral data protection regulation. https://eur-lex.europa.eu/eli/ reg/2016/679/oj. Poursabzi-Sangdeh, F.; Goldstein, D. G.; Hofman, J. M.; Vaughan, J. W.; and Wallach, H. 2018. Manipulat- ing and measuring model interpretability. arXiv preprint arXiv:1802.07810 . ProPublica. 2016. Machine Bias. https://www.propublica. org/article/machine-bias-risk-assessments-in-criminal- sentencing. Radford, J.; Pilny, A.; Reichelmann, A.; Keegan, B.; Welles, B. F.; Hoye, J.; Ognyanova, K.; Meleis, W.; and Lazer, D. 2016. Volunteer science: An online laboratory for experi- ments in social psychology. Social Psychology Quarterly 79(4): 376–396. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why should i trust you?” Explaining the predictions of any clas- sifier. In Proceedings of the 22nd ACM SIGKDD interna- tional conference on knowledge discovery and data mining, 1135–1144. Supplementary Material for "Does Explainable Artificial Intelligence Improve Human Decision-Making?"

Age Distribution This was despite higher mean AI accuracy for the Census See Figure 1 for the age distribution of participants. Most Income dataset over COMPAS. participants were in the 25-34 age group. We compared the accuracy of participants who self- reported different strategies, grouping them into those who mentioned using the AI (in the AI and AI Explanation condi- 160 tions; Primarily AI and AI and data), those who didn’t men- tion using the AI (Data and gut, Primarily guy, and Primarily 120 data), and those who provided blank or nonsense strategies (Could not be coded, labelled as NA). As shown in Figure

80 2, mean participant accuracy was very similar for those who mentioned using the AI versus those who didn’t. In the fu- Number of Participants ture, it may be useful to compare the types of decision errors 40 for AI versus AI with explanation as well as errors based on self-reported strategies. Different types of errors could

0 18−24 25−34 35−44 45−54 55−64 65+ produce similar decisional accuracy if the overall error rate is Age similar. However, participants whose self-reported strategies Figure 1: Age distribution of participants. were blank or did not make sense did show a lower mean ac- curacy on the task, perhaps indicating a lower level of effort or engagement with the task overall with accuracy nearing or even at chance performance (50%). The number of users Self-Reported Decision Strategies with no self-reported strategy was comparable across all six We performed an exploratory analysis on the self-reported conditions. decision strategies, qualitatively described by users in their Strategy Control AI AI Explanation comments. We coded strategies using six categories, see the first column in Table . First, comments were independently Primarily AI – 7 9 coded by two raters: The third author and a research . AI and data – 14 25 Inter-rater reliability was good, κ = 0.70, p < 0.001. Second, Data and gut 5 2 1 we reached consensus to resolve all discrepancies and finalize Primarily gut 2 6 6 categorical coding of decision strategies. Primarily data 74 46 34 The majority of participants commented they primarily Could not be coded 19 25 25 relied on the data to make decisions, especially in the control condition: n = 74. The dominant reported strategy in the two Strategy Census COMPAS AI conditions was primarily using the data to make decisions Primarily AI 9 7 without any mention of AI: n = 80. Primary use of AI AI and data 16 23 (n = 16) was surprisingly low. Although partial use of AI, Data and gut 4 4 including the possibility of explanations in the explainable AI Primarily gut 3 11 condition, was higher (n = 39) was still limited compared to Primarily data 83 71 the more widely used strategy of making decisions using the Could not be coded 35 34 data. Whereas the number of participants using each strategy was generally comparable between the two datasets (Table ). Minor differences, more frequent use of AI and data together Speed-Accuracy Tradeoff for COMPAS with less use of primarily data for COMPAS. We used linear regression to explore the tradeoff between Copyright c 2021, Association for the Advancement of Artificial speed and accuracy across participants; that is, whether Intelligence (www.aaai.org). All rights reserved. slower participants tended to be more accurate and vice versa. Participant Accuracy 80

70 Condition Control AI 60 AI Explanation

50 Mean Participant Accuracy (%) Mean Participant Accuracy

AI No AI NA Self−Reported Strategy

Figure 2: Participant accuracy by self-reported strategy and AI condition. Error bars are 95% confidence intervals.

Speed/Accuracy

80

Dataset 60 Census Compas Accuracy (%) Accuracy

40

4000 8000 12000 16000 Reaction Time (ms)

Figure 3: Speed-accuracy tradeoff generated by participants in each dataset condition. Shaded areas indicate 95% confi- dence intervals.

We found overall a small but significant speed-accuracy trade- off (β = 1.01, p < 0.001, η2 = 0.08), where each addi- tional second of reaction time predicted a 1.01% increase in accuracy. As shown in Figure 3, there was a small in- teraction between the dataset condition and reaction time (F (1, 288) = 6.65, p < 0.02, η2 = 0.02), indicating that the speed-accuracy tradeoff was present for participants in the Census dataset condition (β = 1.02), but it was near zero for those in the COMPAS dataset condition (β = −0.02). We did not find a significant effect of AI condition on the speed-accuracy tradeoff.