The Effect of AI Explanations on Complementary Team Performance
Total Page:16
File Type:pdf, Size:1020Kb
Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance Gagan Bansal∗ Joyce Zhou† Besmira Nushi Tongshuang Wu∗ Raymond Fok† [email protected] [email protected] [email protected] Microsoft Research [email protected] [email protected] University of Washington University of Washington Ece Kamar Marco Tulio Ribeiro Daniel S. Weld [email protected] [email protected] [email protected] Microsoft Research Microsoft Research University of Washington & Allen Institute for Artificial Intelligence ABSTRACT ACM Reference Format: Many researchers motivate explainable AI with studies showing Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, that human-AI team performance on decision-making tasks im- Ece Kamar, Marco Tulio Ribeiro, and Daniel S. Weld. 2021. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team proves when the AI explains its recommendations. However, prior Performance. In CHI Conference on Human Factors in Computing Systems studies observed improvements from explanations only when the (CHI ’21), May 8–13, 2021, Yokohama, Japan. ACM, New York, NY, USA, AI, alone, outperformed both the human and the best team. Can 16 pages. https://doi.org/10.1145/3411764.3445717 explanations help lead to complementary performance, where team accuracy is higher than either the human or the AI working solo? 1 INTRODUCTION We conduct mixed-method user studies on three datasets, where an Although the accuracy of Artificial Intelligence (AI) systems is AI with accuracy comparable to humans helps participants solve a rapidly improving, in many cases, it remains risky for an AI to op- task (explaining itself in some conditions). While we observed com- erate autonomously, e.g., in high-stakes domains or when legal and plementary improvements from AI augmentation, they were not in- ethical matters prohibit full autonomy. A viable strategy for these creased by explanations. Rather, explanations increased the chance scenarios is to form Human-AI teams, in which the AI system aug- that humans will accept the AI’s recommendation, regardless of its ments one or more humans by recommending its predictions, but correctness. Our result poses new challenges for human-centered people retain agency and have accountability on the final decisions. AI: Can we develop explanatory approaches that encourage ap- Examples include AI systems that predict likely hospital readmis- propriate trust in AI, and therefore help generate (or improve) sion to assist doctors with correlated care decisions [8, 13, 15, 78] complementary performance? and AIs that estimate recidivism to help judges decide whether CCS CONCEPTS to grant bail to defendants [2, 30]. In such scenarios, it is impor- tant that the human-AI team achieves complementary performance • Human-centered computing ! Empirical studies in HCI; (i.e., performs better than either alone): From a decision-theoretic Interactive systems and tools; • Computing methodologies ! perspective, a rational developer would only deploy a team if it Machine learning. adds utility to the decision-making process [73]. For example, sig- nificantly improving decision accuracy by closing deficiencies in KEYWORDS automated reasoning with human effort, and vice versa [35, 70]. Explainable AI, Human-AI teams, Augmented intelligence Many researchers have argued that such human-AI teams would be improved if the AI systems could explain their reasoning. In addi- ∗Equal contribution. tion to increasing trust between humans and machines or improving † Made especially large contributions. the speed of decision making, one hopes that an explanation should help the responsible human know when to trust the AI’s sugges- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed tion and when to be skeptical, e.g., when the explanation doesn’t for profit or commercial advantage and that copies bear this notice and the full citation “make sense.” Such appropriate reliance [46] is crucial for users to on the first page. Copyrights for components of this work owned by others than the leverage AI assistance and improve task performance [10]. Indeed, author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission at first glance, it appears that researchers have already confirmed and/or a fee. Request permissions from [email protected]. the utility of explanations on tasks ranging from medical diagno- CHI ’21, May 8–13, 2021, Yokohama, Japan sis [14, 53], data annotation [67] to deception detection [43]. In © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8096-6/21/05...$15.00 each case, the papers show that, when the AI provides explanations, https://doi.org/10.1145/3411764.3445717 team accuracy reaches a level higher than human-alone. CHI ’21, May 8–13, 2021, Yokohama, Japan G. Bansal et al. Team setting 1.0 1.0 Input Decision Recommendation (R) + Explanation R+Explanation R+Explanation > 0 > 0 ? Teammates R only R only AI Human Complementary zone Accuracy of decisions ( max (Human, AI) , 1 ] 0.0 Accuracy of decisions 0.0 Change of performance A Prior work B Ours Figure 1: (Best viewed in color) Do AI explanations lead to complementary team performance? In a team setting, when given an input, the human uses (usually imperfect) recommendations from an AI model to make the final decision. We seek to under- stand if automatically generated explanations of the AI’s recommendation improve team performance compared to baselines, such as simply providing the AI’s recommendation, R, and confidence. (A) Most previous work concludes that explanations improve team performance (i.e., ∆A > 0); however, it usually considers settings where AI systems are much more accurate than people and even the human-AI team. (B) Our study considers settings where human and AI performance is comparable to allow room for complementary improvement. We ask, “Do explanations help in this context, and how do they compare to simple confidence-based strategies?” (Is ∆B > 0?). However, a careful reading of these papers shows another com- suggestions or ignore the correct ones [13, 69]. However, using monality: in every situation, while explanations are shown to help end-to-end studies, we go one step further to quantify the impact raise team performance closer to that of the AI, one would achieve of such over-reliance on objective metrics of team performance. an even better result by stripping humans from the loop and let- As a first attempt to tackle the problem of blind reliance onAI, ting the AI operate autonomously (Figure1A & Table1). Thus, we introduce Adaptive Explanation. Our mechanism tries to reduce the existing work suggests several important open questions for human trust when the AI has low confidence: it only explains the AI and HCI community: Do explanations help achieve comple- the predicted class when the AI is confident, but also explains mentary performance by enabling humans to anticipate when the the alternative otherwise. While it failed to produce significant AI is potentially incorrect? Furthermore, do explanations provide improvement in final team performance over other explanation significant value over simpler strategies such as displaying the AI’s types, there is suggestive evidence that the adaptive approach can uncertainty? In the quest to build the best human-machine teams, push the agreement between AI predictions and human decisions such questions deserve critical attention. towards the desired direction. To explore these questions, we conduct new experiments where Through extensive qualitative analysis, we also summarize po- we control the study design, ensuring that the AI’s accuracy is tential factors that should be considered in experimental settings for comparable to the human’s (Figure1B). Specifically, we measure studying human-AI complementary performance. For example, the the human skill on our experiment tasks and then control AI ac- difference in expertise between human and AI affects whether (or curacy by purposely selecting study samples where AI has compa- how much) AI assistance will help achieve complementary perfor- rable accuracy. This setting simulates situations where there is a mance, and the display of the explanation may affect the human’s strong incentive to deploy human-AI teams, e.g., because there ex- collaboration strategy. In summary: ists more potential for complementary performance (by correcting • We highlight an important limitation of previous work on each other’s mistakes), and where simple heuristics such as blindly explainable AI: While many studies show that explaining following the AI are unlikely to achieve the highest performance. predictions of AI increases team performance (Table1), they We selected three common-sense tasks that can be tackled by all consider cases where the AI system is significantly more crowd workers with little training: sentiment analysis of book and accurate than both the human partner and the human-AI beer reviews and a set of LSAT questions that require logical reason- team. In response, we argue that AI explanations for decision- ing. We conducted large-scale studies using a variety of explanation making should aim for complementary performance, where sources (AI versus expert-generated) and strategies (explaining just the human-AI team outperforms both solo human and AI. the predicted class, or explaining other classes as well). We observed • To study complementary performance, we develop a new complementary performance on every task, but — surprisingly — experimental setup and use it in studies with 1626 users on explanations did not appear to offer benefit compared to simply three tasks1 to evaluate a variety of explanation sources and displaying the AI’s confidence. Notably, explanations increased strategies. We observe complementary performance in every reliance on recommendations even when the AI was incorrect.