Does Explainable Artificial Intelligence Improve Human Decision-Making?

Does Explainable Artificial Intelligence Improve Human Decision-Making? Yasmeen Alufaisan,1 Laura R. Marusich, 2 Jonathan Z. Bakdash,3 Yan Zhou,4 Murat Kantarcioglu 4 1 EXPEC Computer Center at Saudi Aramco Dhahran 31311, Saudi Arabia 2 U.S. Army Combat Capabilities Development Command Army Research Laboratory South at the University of Texas at Arlington 3 U.S. Army Combat Capabilities Development Command Army Research Laboratory South at the University of Texas at Dallas 4 University of Texas at Dallas Richardson, TX 75080 [email protected], flaura.m.cooper20.civ, [email protected], fyan.zhou2,[email protected] Abstract To fully achieve fairness and accountability, explainable AI should lead to better human decisions. Earlier research Explainable AI provides insights to users into the why for demonstrated that explainable AI can be understood by peo- model predictions, offering potential for users to better un- ple (Ribeiro, Singh, and Guestrin 2018). Ideally, the com- derstand and trust a model, and to recognize and correct AI bination of humans and machines will perform better than predictions that are incorrect. Prior research on human and explainable AI interactions has typically focused on measures either alone (Adadi and Berrada 2018), such as computer- such as interpretability, trust, and usability of the explanation. assisted chess (Cummings 2014), but this combination may There are mixed findings whether explainable AI can improve not necessarily improve the overall accuracy of AI systems. actual human decision-making and the ability to identify the While (causal) explanation and prediction share common- problems with the underlying model. Using real datasets, we alities, they are not interchangeable concepts (Adadi and compare objective human decision accuracy without AI (con- Berrada 2018; Shmueli et al. 2010; Edwards and Veale trol), with an AI prediction (no explanation), and AI predic- 2018). Consequently, a ”good” explanation, interpretable tion with explanation. We find providing any kind of AI pre- model predictions, may not be sufficient for improving diction tends to improve user decision accuracy, but no con- actual human decisions (Adadi and Berrada 2018; Miller clusive evidence that explainable AI has a meaningful impact. 2019) because of heuristics and biases in human decision- Moreover, we observed the strongest predictor for human decision accuracy was AI accuracy and that users were some- making (Kahneman 2011). Therefore, it is important to what able to detect when the AI was correct vs. incorrect, but demonstrate whether, and what types of, explainable AI can this was not significantly affected by including an explana- improve the decision-making performance of humans us- tion. Our results indicate that, at least in some situations, the ing that AI, relative to performance using the predictions of why information provided in explainable AI may not enhance ”black box” AI with no explanations and for human making user decision-making, and further research may be needed to decisions with no AI prediction. understand how to integrate explainable AI into real systems. In this work, we empirically investigate whether explainable AI improves human decision-making using a two- choice classification experiment with real-world data. Us- Introduction ing human subject experiments, we compared three differ- Explainable AI is touted as the key for users to “under- ent settings where a user needs to make decision 1) No AI stand, appropriately trust, and effectively manage. [AI prediction (Control), 2) AI predictions but no explanation, systems])” (Gunning 2017) with parallel goals of achiev- and 3) AI predictions with explanations. Our results indi- ing fairness, accountability, and transparency (Sokol 2019). cate that, while providing the AI predictions tends to help There are a multitude of reasons for explainable AI, but there users, the why information provided in explainable AI does is little empirical research for its impact on human decision- not specifically enhance user decision-making. making (Miller 2019; Adadi and Berrada 2018). Prior be- havioral research on explainable AI has primarily focused Background and Related Work on human understanding/interpretability, trust, and usability for different types of explanations (Doshi-Velez and Kim Using Doshi-Velez and Kim’s (2017) framework for inter- 2017; Hoffman et al. 2018; Ribeiro, Singh, and Guestrin pretable machine learning, our current work focuses on: real 2016, 2018; Lage et al. 2019). humans, simplified tasks. Because our objective is on eval- uating decision-making, we do not compare different types Copyright c 2021, Association for the Advancement of Artificial of explanations and instead used one of the best available Intelligence (www.aaai.org). All rights reserved. explanations: anchor LIME (Ribeiro, Singh, and Guestrin 2018). We use real tasks here, although our tasks involve Poisson distribution—to constrain the rule generation pro- relatively simple decisions with two possible choices. Addi- cess and provide theoretical bounds for reducing computa- tionally, we use lay individuals rather than experts. Below, tion by iteratively pruning the search space. In our experi- we discuss prior work that is related to our experimental ap- ments, we use anchor LIME to provide explanations for all proach. our experimental evaluation due to the high human precision of anchor LIME as reported in Ribeiro, Singh, and Guestrin Explainable AI/Machine Learning (2018). While machine learning models largely remain opaque and their decisions are difficult to explain, there is an urgent need Human Decision-Making and Human Experiments for machine learning systems that can “explain” its reason- with Explainable AI ing. For example, European Union regulation requires “right A common reason for providing explanation is to improve to explanation” for any algorithms that make decisions sig- human predictions or decisions (Keil 2006). People are not nificantly impacting users with user-level predictors (Par- necessarily rational (i.e., maximizing an expected utility liament and Council of the European Union 2016). In re- function). Instead, decisions are often driven by heuristics sponse to the lack of consensus on the definition and eval- and biases (Kahneman 2011). Also, providing more infor- uation of interpretability in machine learning, Doshi-Velez mation, even if relevant, does not necessarily lead people and Kim (2017) propose a taxonomy for the evaluation to making better decisions (Gigerenzer and Brighton 2009). of interpretability focusing on the synergy among human, Bounded rationality in human decision-making using satis- application, and functionality. They contrast interpretability fying with constraints (Gigerenzer and Brighton 2009) is an with reliability and fairness, and discuss scenarios in which alternative theory to heuristics and biases (Kahneman 2011). interpretability is needed. To unmask the incomprehensible Regardless of the theoretical account for human decision- reasoning made by these machine learning/AI models, re- making, people, which can include experts (Dawes, Faust, searchers developed explainable models that are built on top and Meehl 1989), generally do not make fully optimal deci- of the machine learning model to explain their decisions. sions. The most common forms of explainable models that pro- At a minimum, explainable AI should not be detrimen- vide explanations for the decisions made by machine learn- tal to human decision-making. The literature on decision ing models are feature-based and rule-based models. The aids (a computational recommendation or prediction, typ- feature-based models resemble feature selection where the ically without an explicit explanation) has mixed findings model outputs the top features that explain the machine for human performance. Sometimes these aids are benefi- learning prediction and their associated weights (Datta, Sen, cial for human decision-making, whereas at other times they and Zick 2016; Ribeiro, Singh, and Guestrin 2016). The have negative effects on decisions (Kleinmuntz and Schkade rule-based models provide simple if-then-else rules to ex- 1993; Skitka, Mosier, and Burdick 1999). These mixed find- plain predictions (Ribeiro, Singh, and Guestrin 2018; Alu- ings may be attributable to absence of explanations; this faisan et al. 2017). It has been shown that rule-based models can be investigated through human experiments testing AI provide higher human precision when compared to feature- predictions with explanations compared with AI predictions based models (Ribeiro, Singh, and Guestrin 2018). alone. Lou et al. (2012) investigate the generalized additive mod- Most prior human experiments with explainable AI have els (GAMs) that combine single-feature models through a concentrated on interpretability, trust, and subjective mea- linear function. GAMs are more accurate than simple lin- sures of usability, such as preferences and satisfaction, with ear models, and can be easily interpreted by users. Their work on decision-making performance remaining somewhat empirical study suggests that a shallow bagged-tree with limited (Miller 2019; Adadi and Berrada 2018). Earlier gradient boosting is the best method on low to medium di- results suggest explainable AI can increase interpretabil- mensional datasets. Anchor LIME is an example of the cur- ity (e.g. Ribeiro, Singh, and Guestrin 2018), trust (e.g. rent state-of-the-art explainable rule-based model (Ribeiro,

Load more