I Can Do Better Than Your AI: Expertise and Explanations
James Schaffer John O’Donovan James Michaelis US Army Research Laboratory University of California Santa Barbara US Army Research Laboratory Playa Vista, CA Santa Barbara, CA Adelphi, MD [email protected] [email protected] [email protected]
Adrienne Raglin Tobias Höllerer US Army Research Laboratory University of California Santa Barbara Adelphi, MD Santa Barbara, CA [email protected] [email protected]
ABSTRACT CCS CONCEPTS Intelligent assistants, such as navigation, recommender, and • Information systems → Expert systems; Personaliza- expert systems, are most helpful in situations where users tion; • Human-centered computing → User models; HCI lack domain knowledge. Despite this, recent research in cog- design and evaluation methods. nitive psychology has revealed that lower-skilled individuals may maintain a sense of illusory superiority, which might KEYWORDS suggest that users with the highest need for advice may be User Interfaces; Human-Computer Interaction; Intelligent the least likely to defer judgment. Explanation interfaces – a Assistants; Information Systems; Decision Support Systems; method for persuading users to take a system’s advice – are Cognitive Modeling thought by many to be the solution for instilling trust, but ACM Reference Format: do their effects hold for self-assured users? To address this James Schaffer, John O’Donovan, James Michaelis, Adrienne Raglin, knowledge gap, we conducted a quantitative study (N=529) and Tobias Höllerer. 2019. I Can Do Better Than Your AI: Expertise wherein participants played a binary decision-making game and Explanations. In Proceedings of 24th International Conference with help from an intelligent assistant. Participants were on Intelligent User Interfaces (IUI ’19). ACM, New York, NY, USA, profiled in terms of both actual (measured) expertise and 14 pages. https://doi.org/10.1145/3301275.3302308 reported familiarity with the task concept. The presence of explanations, level of automation, and number of errors 1 INTRODUCTION made by the intelligent assistant were manipulated while Many people express distrust or negative sentiment towards observing changes in user acceptance of advice. An analysis intelligent assistants such as recommender systems, GPS nav- of cognitive metrics lead to three findings for research in igation aids, and general purpose assistants (e.g., Amazon’s intelligent assistants: 1) higher reported familiarity with the Alexa). Simultaneously, the amount of information available task simultaneously predicted more reported trust but less to an individual decision maker is exploding, creating an adherence, 2) explanations only swayed people who reported increased need for automation of the information filtering very low task familiarity, and 3) showing explanations to process. Distrust may be due to intelligent assistants be- people who reported more task familiarity led to automation ing overly complex or opaque, which generates uncertainty bias. about their accuracy or relevance of results. This creates an enormous challenge for designers - even if intelligent assis- tants can deliver the right information at the right time, it can still be a challenge to persuade people to trust, understand, ACM acknowledges that this contribution was authored or co-authored and adhere to their recommendations. by an employee, contractor, or affiliate of the United States government. While trust is thought to be a primary issue in system As such, the United States government retains a nonexclusive, royalty-free acceptance, a user’s self-assessment of his or her own knowl- right to publish or reproduce this article, or to allow others to do so, for edge may also play a significant role. Personality traits re- government purposes only. lated to self-reported ability have been studied recently in IUI ’19, March 17–20, 2019, Marina del Ray, CA, USA © 2019 Association for Computing Machinery. cognitive psychology [33, 41]. A well known result from this ACM ISBN 978-1-4503-6272-6/19/03...$15.00 research is the “Dunning-Kruger” effect, where low-ability https://doi.org/10.1145/3301275.3302308 individuals maintain an overestimated belief in their own IUI ’19, March 17–20, 2019, Marina del Ray, CA, USA Schaffer et al. ability. This effect states that a lack of self-awareness, which recommendations to the participants, or precisely control the is correlated with lower cognitive ability, may lead some to level of error. We present an analysis of participant data that inflate their own skills while ignoring the skills of others. explains the consequences of self-assessed familiarity (our Moreover, the broader study of “illusory superiority,” which measure of confidence) for intelligent assistants. Changes in tells us that it takes a highly intelligent person to gauge the adherence to recommendations, trust, and situation aware- intelligence of others, may play a role in the adoption of ness are observed. Our analysis allows us to address the intelligent assistants. This begs the question: how can infor- following research questions: mation systems sway over-confident users that believe they (1) How does a person’s self-assessed task knowledge are smarter than the machine? predict adherence to advice from intelligent assistants? Explanations are becoming the commonly accepted an- (2) Are rational explanations from an intelligent assistant swer to making intelligent assistants more usable (e.g, [29]). effective on over-confident people? Explanations inform users about the details of automated (3) What is the relationship between over-confidence, au- information filtering (such as in recommender systems) and tomation bias, system error, and explanations? help to quantify uncertainty. Research into explanations started with expert systems studies [1, 28], and more recently 2 BACKGROUND their trust and persuasion effects have been studied in rec- ommender systems [36, 50]. Beyond this, explanations have This section reviews the work that serves as the foundation been shown to fundamentally alter user beliefs about the sys- of the experiment design and interpretation of participant tem, including competence, benevolence, and integrity [56]. data. On the negative side of things, theories of human awareness Explanations and Intelligent Assistants [17] and multitasking [23] predict that users who spend time perceiving and understanding explanations from intelligent The majority of intelligent assistant explanation research has assistants may have lower awareness of other objects in their been conducted in recommender and expert systems [49], environment (or must necessarily take more time to complete and older forms of explanations present multi-point argu- a given task), which could negatively affect performance9 [ ]. ments (e.g., Toulmin style argumentation [32]) about why a This might have implications for “automation bias,” or the particular course of action should be chosen. These explana- over-trusting of technology [13], which occurs when users tions could be considered “rational” in that they explain the become complacent with delegating to the system. Whether mathematical or algorithmic properties of the systems they explanations alleviate or exacerbate automation bias is still explain, which contrasts with more emotional explanations, an open research question. There are still more intricacies to such as those that are employed by virtual agents [31, 38].The explanations, for instance, recent research in intelligent user differential impact of explanation on novices and experts has interfaces [4] has found that explanations interact with user also been studied [1]: novices are much more likely to ad- characteristics [25] to affect trusting beliefs about a recom- here to recommendations due to a lack of domain knowledge, mender system, implying that not all explanations are equal while expert users require a strong “domain-oriented” ar- for all users. Moreover, since intelligent assistants are often gument before adhering to advice. In this context, “expert” presented as autonomous, cognitive psychology might tell us refers to a true expert (indicated by past experience) rather that explanations that sway a cooperative and trusting user than a self-reported expert. Most of these studies focus on [7, 35] might be completely ignored by an overconfident indi- decision making domains (financial analysis, auditing prob- vidual. This problem could be amplified when imperfections lems), where success would be determined objectively. The in the systems are perceived. work presented here builds on the research by Arnold et The above concerns illustrate the idea that the use of al. [1] by quantifying objectively determined expertise, and explanations might need to be tailored to the user’s self- assessing its relationship with self-reported familiarity and confidence and the potential error of the information system interaction effects with system explanations. or intelligent assistant. This work investigates how a per- son’s prior reported familiarity with a task vs. their actual Errors by Intelligent Assistants measured expertise affects their adherence to recommen- Intelligent assistants often vary in their degree of automa- dations. A study of (N=529) participants is conducted on a tion and effectiveness. The pros and cons of varying levels binary choice task (the n-player Iterated Prisoner’s Dilemma of automation have been studied in human-agent teaming [2], here introduced in the form of the Diner’s Dilemma [8]. Less prompting of the user for intervention may reduce [24]) with assistance from an experimentally manipulated cognitive load, but might also reduce awareness of system intelligent assistant, dubbed the “Dining Guru.” With this operations. Although system effectiveness continues to im- constrained setup, we were able to make theoretically ideal prove (e.g. [39]), it is not conclusive that improved algorithms I Can Do Better Than Your AI: Expertise and Explanations IUI ’19, March 17–20, 2019, Marina del Ray,CA,USA result in improved adherence to system recommendations, cooperation while consistently but not vengefully penaliz- for instance, no effect of system error was found in Salem ing defection, is one of the best strategies to employ in the et al. [52]. In contrast, Yu et al. found that system errors do Iterated Prisoner’s Dilemma. have a significant effect on trust [57] and Harman et al. found that users trust inaccurate recommendations more than they Awareness and Automation Bias should [30]. These research studies also call the relationship Human awareness was introduced and is a focus of many between trust and adherence into question. In this study, we US Department of Defense agencies [19], wherein it is re- attempt to clarify this relationship through simultaneous ferred to as “situation awareness.” Situation awareness is measurement of trust and adherence while controlling for defined as “the perception of elements in the environment system error and automation. within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future.” Endsley’s theory models human decision-making in dynamic environments and has become well accepted. There Self-reported Ability are three states of situation awareness, which are: percep- Consequences of self-reported ability have been recently tion, comprehension, and projection, all of which contribute discovered in studies of cognitive psychology [33, 41]. As to an individual’s ability to make good decisions. Perhaps mentioned, the “Dunning-Kruger” effect predicts that low- the biggest threat to situation awareness is automation bias ability individuals maintain an overestimated belief in their [13], which predicts that the more of the information gath- own ability. This work also illustrates how quantitative met- ering process is automated, the more users will “check out.” rics collected through questionnaires do not always measure This is sometimes called “overtrusting” [34] and can, in some their face value. For instance, the Dunning-Kruger effect instances, lead to catastrophic failure1. Work on the Dunning- shows us that asking a user how much he knows about a Kruger effect and automation bias led us to hypothesize that particular topic will only quantify the number of “unknown overconfident users would be less susceptible to automation unknowns” relative to the user, rather than their actual ability bias, since it is likely they would prefer to perform the task in that area [48]. Deficits in knowledge are a double burden on their own. for these users, not only causing them to make mistakes but also preventing them from realizing they are making Trust mistakes [16, 42]. Despite the research on automation bias, research in intel- The Dunning-Kruger effect is part of a larger group of ligent assistants still points towards trust as an important cognitive effects sometimes referred to as illusory superiority. factor to indicate usability. In the research literature, the Other effects in this category create an additional concern for term “trust” is used with many different semantic meanings the success of intelligent assistants. For instance, it is known and the way it is measured has implications for what it can that estimating the intelligence of others is a difficult task predict. For instance, previous research in recommender sys- (the “Downing effect”)14 [ ] that requires high intelligence. tems has shown that trust and system perceptions are highly This explains the tendency of people to be very likely to rate correlated [10, 36, 37, 40]. More generally, research by Mck- themselves as “above average,” even though not everyone night ties the concepts of trust in technology and trust in can be so. We might expect that lower-intelligence users other people [46], showing that people can discriminate be- would fail to accurately gauge the intelligence of information tween the two. In this work, we use a set of question items systems, leading to disuse. adapted from [47] for trust in a specific technology (in this The research on self-reported ability leads us to hypoth- case, the Dining Guru). This way, we can quantitatively as- esize that overconfident individuals are less likely to inter- sess the relationship between reported trust and observed act with or adhere to intelligent assistants, due to the over- adherence – a relationship that is often assumed estimation of their own ability and their inability to assess the accuracy of the system. A main challenge in this work Iterated Prisoner’s Dilemma (IPD) and Diner’s is quantifying illusory superiority for abstract trust games. Dilemma Herein, we solicited users to self-report their familiarity with The Diner’s Dilemma was chosen for the research platform the games such as the Diner’s Dilemma, Iterated Prisoner’s due to its wide applicability, limited complexity, and research Dilemma, or the Public Goods game beforehand. Anyone base. The Diner’s Dilemma is an n-player, iterated version with a passing knowledge of the rules of this game should of the basic Prisoner’s Dilemma. In the basic Prisoner’s have an obvious advantage over their peers, for example, a commonly known fact among those familiar with trust 1https://www.theguardian.com/technology/2016/jun/30/tesla-autopilot- games is that "tit-for-tat", which rewards cooperation with death-self-driving-car-elon-musk IUI ’19, March 17–20, 2019, Marina del Ray, CA, USA Schaffer et al.
Player Chooses: the expensive dish regardless of what the others choose to Hotdog Lobster do. However, over repeated rounds, a diner’s choice can af- 2 co-diners cooperate 20.00 24.00 fect the perceptions of other co-diners and cooperation may 1 co-diner cooperates 12.00 17.14 develop, which affects long term prosperity of the group. Neither cooperates 8.57 13.33 Hotdog/lobster cost and values for the game are shown in Table 1: Diner’s Dilemma choice payoff matrix. Figure 1, under each respective item, resulting in the payoff matrix that is shown in Table 1. Participants played the Diner’s Dilemma with two simu- lated co-diners. The co-diners were not visually manifested Dilemma, two players decide to take an action (cooperate as to avoid any confounding emotional responses from partic- or defect) without communication beforehand, where defec- ipants (see [11]). The co-diners played variants of Tit-for-Tat tion leads to a higher outcome for an individual regardless (TFT), a proven strategy for success in the Diner’s Dilemma of the other players actions, but with mutual cooperation wherein the co-diner makes the same choice that the par- leading to a higher outcome than mutual defection. The it- ticipant did in the previous round. To make the game more erated form of this game, Iterated Prisoner’s Dilemma, is comprehensible for participants, simulated co-diners reacted useful for studying many real-world situations such as mili- only to the human decision and not to each other. In or- tary arms races, cooperation in business settings, economics, der to increase the information requirements of the game, animal behavior, and doping in sports. Although motives some noise was added to the TFT strategy in the form of in- to cooperate in the Iterated Prisoner’s Dilemma can be de- creased propensity to betray (respond to a hotdog order with pendent on many other factors such as altruistic tendencies, a lobster order) or forgive (respond to a lobster order with emotion, or the five-factor model11 [ , 15, 51], the game still a hotdog order). Participants played three games with an represents a fairly simple binary choice task: in each round undisclosed number of rounds (approximately 50 per game) there are two options, with each option leading to a different and co-diner strategies switched between games. This means objectively-measured outcome based on the current state that the primary task for the user was to figure out what of the game. Multi-player versions of this game, such as strategies the co-diners were employing and adjust accord- the Diner’s Dilemma, are more complex, which has made ingly. In the first game, co-diners betrayed often and the best them suitable for studying the effects of increased informa- strategy was to order lobster. In the second game, co-diners tion available to players through a user interface [27, 45]. betrayed at a reduced rate and also forgave to some degree, In this experiment, the 3-player version was used, which which made hotdog the best choice. In the final game, co- makes it sufficiently easy to understand but still sufficiently diners were very forgiving and rarely ordered lobster even complex as to warrant a computational aid. Moreover, this when betrayed, which again made lobster the best choice. version of the game has been previously validated in terms The mean performance of participants in each game is shown of understandability and information requirements [53]. in Table4.
3 GAME DESIGN In the Diner’s Dilemma, several diners eat out at a restau- 4 USER INTERFACE DESIGN rant, repeatedly over an unspecified number of days, with Participants played the game through the interface shown the agreement to split the bill equally each time. Each diner in Figure 1. This interface contains four components: the has the choice to order the inexpensive dish (hotdog) or last round panel (left), the control panel (center), the history the expensive dish (lobster). Diners receive a better dining panel (bottom), and the virtual agent, the Dining Guru (right). experience (here, quantified as dining points) when every- Across treatments, all panels remained the same except for one chooses the inexpensive dish compared to when every- the Dining Guru, which varied (see Figure 2). one chooses the expensive dish. To be a valid dilemma, the Participants were provided with a basic interface con- quality-cost ratio of the two items available in a valid Diner’s taining all of the information that the Dining Guru used to Dilemma game must meet a few conditions. First, if the player generate its advice. The last round panel was shown on the were dining alone, ordering hotdog should maximize dining left side of the interface and the control panel was shown points. Second, players must earn more points when they in the middle. Together, these panels displayed the current are the sole defector than when all players cooperate. Finally, dining points, the food quality and cost of each menu item, the player should earn more points when the player and the the current round, and the results from the previous round in two co-diners all defect than when the player is the only one terms of dining points. These panels allowed the participant to cooperate. This “game payoff matrix” means that in one to make a choice in each round. On the lower portion of the round of the game, individual diners are better off choosing screen a history panel was provided. This panel contained I Can Do Better Than Your AI: Expertise and Explanations IUI ’19, March 17–20, 2019, Marina del Ray,CA,USA
Figure 1: The user interface for the game.
information about who chose what in previous rounds and errors occurring in earlier rounds when information was in- reciprocity rates. complete. A “manual” version of the Dining Guru was given some treatments, which required participants to supply the Dining Guru Dining Guru with estimates of hotdog and lobster reciprocity The Dining Guru was shown on the right side of the screen. rates ("Automation=0" cases in Figure 2) before receiving a In each round, the Dining Guru could be examined by a recommendation. The rationale for this version is that by participant to receive a recommendation about which item allowing the participant to become more engaged with the (hotdog or lobster) would maximize their dining points. As system, higher SA may result. with the simulated co-diners, the Dining Guru was not given 5 EXPERIMENT DESIGN any dynamic personality beyond being presented as an agent - a static drawing was used to communicate this. Users were This section describes the experiment procedure, variables, required to mouse over the Dining Guru to invoke a rec- and tested hypotheses. ommendation, which made it possible to measure adher- ence. Recommendations were generated by calculating the Procedure expected value of ordering hotdog or lobster in the future, Reported familiarity was measured during the pre-study and based on the maximum likelihood estimates of the rates of trust was measured during the post-study (question items are forgiveness and betrayal from co-diners. Due to the fixed given in Table 3). Before playing the game, participants were strategy of the simulated co-diners, the Dining Guru made introduced to game concepts related to the user interface, the “best possible” choice in each round, with most of the reciprocity, and the Dining Guru by playing practice rounds. IUI ’19, March 17–20, 2019, Marina del Ray, CA, USA Schaffer et al.
Variable Name µ σ Self-rep. Familiarity (α = 0.80, AVE = R2 Est. Explanation (0 or 1) 0.59) Automation (0 or 1) I am familiar with abstract trust games. 0.52 1.27 Error (0, 0.5, or 1) I am familiar with the Diner’s Dilemma. 0.47 1.14 Self-reported Familiarity 0 1 I am familiar with the public goods game. 0.76 1.58 Reported trust in the Dining Guru 0 1 Trust (α = 0.9, AVE = 0.76) R2 Est. Measured Expertise 8.08 1.64 I trusted the Dining Guru. 0.79 1.36 Awareness of game elements 0.99 0.556 I could rely on the Dining Guru. 0.74 1.29 Adherence to system advice 0.33 0.25 I would advise a friend to take advice from 0.75 1.37 Optimal Moves (as ratio, 0.0 to 1.0) 0.59 0.12 the Dining Guru if they played the game. Table 2: A list of variables used in the analysis. Mean (µ) Table 3: Factors determined by participant responses to sub- and standard deviation (σ) are given for dependent variables. jective questions. R2 reports the fit of the item to the fac- Self-reported familiarity and reported trust were strictly la- tor. Est. is the estimated loading of the item to the factor. α tent factors, thus having a mean of 0 and standard deviation is Cronbach’s alpha (a measurement of reliability, see [54]). of 1. Awareness and measure expertise were parcelled (aver- AVE is average variance extracted. aged) into latent factors before analysis.
Several training questionnaires, which could be resubmitted as many times as needed, were used to help participants and the expected per-round average. The explanatory ver- learn the game. The Dining Guru was introduced as an “AI sion relayed the expected per-round averages graphically, adviser” and participants learned how to access it and what in a bar chart, right above the recommendation, so that par- its intentions were. Participants were told that the Dining ticipant attention would be drawn to the explanation. The Guru was not guaranteed to make optimal decisions and that explanation variable thus represents whether or not the sys- choosing to take its advice was their choice. Expertise was tem was a “black-box” (Explanation = 0) or a “white-box” measured just after training by assessing retention of game (Explanation = 1) for each participant. and user interface concepts via an eleven item questionnaire. In the “manual” treatment (Automation = 0), the Dining During the primary game phase, participants played three Guru did not update until prompted by the user, who was games of Diner’s Dilemma against three configurations of required to provide estimates of co-diners’ reciprocity rates. simulated co-diners with varying behavior characteristics. The estimates were provided by moving two sliders, which displayed and took no value until users first interacted with Independent Variables them. Users could freely experiment with the sliders, which Two levels of automation (Automation = 0, Automation = 1), means that they could be used to understand the relationship two levels of explanation (Explanation = 0, Explanation = between the payoff matrix and co-diner reciprocity rates. 1) and three levels of recommendation error (Error = 0, Three levels of error were manipulated: no-error, weak- Error = 0.5, Error = 1) were manipulated between-subjects error and full-error. In the no-error treatment (Error = 0.0), (see Figure 2 for a visual). All manipulations (three param- the Dining Guru produced recommendations that could be eters, 3 ∗ 22 = 12 manipulations) were used as between- considered flawless, which, if followed, would result inop- subjects treatments in this experiment. timal moves. The weak error (Error = 0.5) version would The explanation for the Dining Guru was designed to randomly adjust the reciprocity estimates up or down by accurately reflect the way that it was calculating recommen- up to 25%. For instance, if the true hotdog reciprocity rate dations. This was to align it with other forms of “rational” was 65%, the Dining Guru would use a value anywhere be- explanation [32], such as collaborative filtering explanations tween 40 and 90%. This could cause the Dining Guru to [49]. Since the Dining Guru calculates maximum-likelihood "flip-flop" between recommendations. Finally, the "full" error estimates of co-diner behavior and cross references this with (Error = 1.0) condition adjusted reciprocity estimates by up the payoff matrix to produce recommendations, the expla- to 50% in either direction. A practical consequence of this nation thus needed to contain estimates for the expected was that the Dining Guru would flip its recommendation points per round of each choice. Additionally, in the manual almost every round. The error in the recommendations was version, in which the human player would provide reciproca- reasonably hidden from participants and indeed was only no- tion rate estimates to the dining guru, a text blurb appeared ticeable when either explanation was present or the Dining explaining the connection between co-diner reciprocity rates Guru was not automated. I Can Do Better Than Your AI: Expertise and Explanations IUI ’19, March 17–20, 2019, Marina del Ray,CA,USA
Figure 2: Variations of the Dining Guru, based on automation and explanation treatments.
Dependent Variables the rules of chess to perform better than someone who does Adherence was said to occur for each round where the user not know how the pieces move, but this is not true for people choice matched the last recommendation given by the Dining who may report being very familiar with baseball. Guru. The measurement of true adherence is a difficult re- A factor analysis of trust and self-reported familiarity is search challenge, since the reason for a participant’s decision given in Table 3. All of these items were taken on a Likert following a recommendation must be the recommendation it- scale (1-7), which rated agreement. Furthermore, we demon- self, and not because the participant would have chosen that strate the external validity of these factors by examining regardless. The mouse-over design helped to reduce some of correlations with both measured expertise and performance the uncertainty about the reason for the participant’s choice. – if the self-reported familiarity metric is valid, negative The adherence measurement was scaled between 0 and 1. correlations should be found for both. All factors achieved Performance, measured as the ratio of moves that were discriminant validity using the Campbell & Fiske test [6]. considered optimal, was analyzed on a per-participant basis Situation awareness and (true) expertise were measured for the purpose of validating the Dining Guru’s design. While using testing methodologies. All-item parcels were used for learning effects were found in the data, they are not the focus each and the residual variance of the indicator variables was of this paper and were not involved in our statistical tests freed. We used a situation-awareness global assessment test (discussion of learning effects in the Diner’s Dilemma game [18] during game 2 to assess situation awareness. The situa- can be found in [53]). tion awareness questionnaire contained 5 questions related Reported familiarity was chosen to quantify illusory su- to the current game state. The situation awareness question periority - this was done for three reasons. First, agree- items each contained a slider (min: 0%, max:100%) and asked ment questions such as “I am an expert Iterated Prisoner’s participants to estimate: their current cooperation rate (1) Dilemma player” would not have measured the desired fac- and the hotdog (2,3) and lobster (4,5) reciprocity rates for tor - outside of tournaments where researchers submit algo- each co-diner. The game interface was not available at this rithms play the IPD2, there is, to our knowledge, no compet- time. The situation awareness score was calculated by first itive community of Diner’s Dilemma or IPD players. Second, summing up the errors from each of the 5 estimation ques- the “Dunning-Kruger” effect is demonstrated by measuring tions and then inverting the scores based on the participant “unknown-unknowns,”and likert-scale based questions about with the highest error, such that higher situation awareness familiarity measures “I don’t even know what this is” all the scores are better. Finally, true expertise was measured just way to “thorough mastery,” as the common dictionary defini- after the training with the training materials removed. This tion suggests. Third, Diner’s Dilemma, similar to chess, is a test was a set of eleven questions that relates to knowledge of game about information and decision-making. For instance, the game and whether the participant possessed the ability you would expect someone who is intimately familiar with to map from the current game state to the optimal decision:
2http://lesswrong.com/lw/7f2/prisoners_dilemma_tournament_results/ (1) How much does a hotdog cost? (slider response) IUI ’19, March 17–20, 2019, Marina del Ray, CA, USA Schaffer et al.
(2) How much does a lobster cost? (slider response) fine-grained picture of this effect, we break self-reported (3) What is the quality of a hotdog? (slider response) familiarity into quartiles and test where differences exist. (4) What is the quality of a lobster? (slider response) (5) In a one-round Diner’s Dilemma game (only one restau- • H7: Explanations increase adherence regardless of self- rant visit), you get the least amount of dining points reported familiarity. when... (four options) Third, we study the relationships between system vari- (6) In a one-round Diner’s Dilemma game (only one restau- ables and its effect on automation bias. We hypothesized that rant visit), you get the most amount of dining points occurrence of automation bias may depend on the particular when... (four options) system design, for instance, making the system manual or (7) Which situation gets you more points? (two options) presenting explanations may reduce automation bias. For (8) Which situation gets you more points? (two options) (9) Suppose you know for sure that your co-diners recipro- this, automation, explanation, and error (as well as their cate your Hotdog order 100% of the time and reciprocate interaction effects) are regressed onto situation awareness, your Lobster order 100% of the time. Which should you adherence, and trust in the SEM for a total of 18 hypotheses. order for the rest of the game? (H/L) To further investigate automation bias, we test the rela- (10) Suppose you know for sure that your co-diners recipro- tionships between reported familiarity and measured ex- cate your Hotdog order 0% of the time and reciprocate pertise on situation awareness. Theories of illusory supe- your Lobster order 100% of the time. Which should you riority and automation bias might lead us to hypothesize order for the rest of the game? (H/L) that over-confident users (who may be less likely touse (11) Suppose you know for sure that your co-diners recipro- the adviser) may have better situation awareness. Addition- cate your Hotdog order 50% of the time and reciprocate ally, we hypothesized that explanations might have differ- your Lobster order 50% of the time. Which should you ent effects on awareness at different levels of measured ex- order for the rest of the game? (H/L) pertise and self-reported familiarity. To test this, we build an indicator product of the interactions effects between re- Hypotheses ported familiarity/explanation (which was 0 or 1) and mea- Our analysis first considered the validity of the task and sured expertise/explanation [43]. This produced the vari- study design. Most importantly, the Dining Guru without ables: Familiarity ·Explanation and Expertise ·Explanation. errors should perform much better than the participants. Then, we test the following hypotheses:
• H1: The Dining Guru in the no error condition out- • H8: Self-reported familiarity predicts increased situa- performs the participants on average. tion awareness. Next, measured expertise and reported familiarity should • H9: Situation Awareness predicts decreased adherence. have consequences for performance. To validate that self- • H10: Expertise · Explanation predicts increased situa- reported familiarity quantifies the illusory superiority effect, tion awareness. reported familiarity should also negatively correlate with • H11: Familiarity · Explanation predicts increased situ- measured expertise, which in turn should positively correlate ation awareness. with performance. Thus: For these final hypotheses, we also break expertise and familiarity up into quartiles and run an ANOVA to find the • H2: Self-reported familiarity predicts decreased perfor- mance. particular condition where situation awareness is being im- pacted (see Figure 5). • H3: Measured expertise predicts increased performance. • H : Self-reported familiarity predicts decreased exper- 4 6 RESULTS tise. This section reports the results from the online experiment. To answer the research questions, we used a combination Two SEMs are used to answer the research questions, the of ANOVA and structural equation modeling (SEM) [55] . To first to validate the task (Figure 3) and the second tobuilda address the first research question, we test the relationships model of participant data that can provide evidence for or among measured expertise, reported familiarity, adherence, against each individual hypothesis (Figure 4). For the uniniti- and trust via SEM: ated, an SEM is essentially a model of simultaneous multiple • H5: Reported familiarity predicts decreased adherence regressions that also allows for mediation analysis and latent and trust. factor modeling (these latter two make it the best choice for • H6: Measured expertise predicts increased adherence testing our hypotheses). Each circle in the reported SEMs and trust. indicate a latent factor (a factor built on multiple question To address the second research question, we test for a gen- items to better account for variance), each square indicates eral effect of explanation on adherence. To generate amore an observed variable, each arrow indicates a regression that I Can Do Better Than Your AI: Expertise and Explanations IUI ’19, March 17–20, 2019, Marina del Ray,CA,USA
β Task and Metric Validation Simulations for the Dining Guru were run offline to establish its performance, which was compared with the participant β data. Table 4 shows that, on average, participants performed β β slightly worse than the most error-prone version of the Din- ing Guru. This means that participants were on average β better off when adhering to the Dining Guru’s advice, re- gardless of error. It is also evidenced by the significance of the β correlation between adherence and optimal moves shown in Figure 3. Thus we accept H1. Figure 3 shows correlations in participant data that ex- Figure 3: Correlations tested for validation of the task. plain relationships between measured expertise, reported familiarity, and the number of optimal moves. A significant negative correlation was found between reported familiarity was fit with either β (normalized effect size in standard devia- and optimal moves (β = −0.103∗) and a significant positive tions) or B value for treatment variables (effect size observed correlation was found for measured expertise (β = 0.163∗∗∗). when treatment is switched on). Co-variances are also re- A large negative correlation was found between measured ported (shown as bi-directional arrows) and the choice of expertise and self-reported familiarity (β = −0.304 ∗ ∗∗). regression or co-variance modeling was based on the rela- Thus we accept H , H , and H . tionship between variables (effect sizes significance values 2 3 4 for these two choices are typically similar and the model- ing choice between the two is a question of interpretation). Illusory Superiority Our SEM models were fit using lavaan3 in R. A complete Next we test hypotheses related to illusory superiority. An overview of SEM is given in [55]. Multiplicity control on SEM was built using the lavaan statistical package in R (Fig- our 29 hypotheses (11 explicit, 18 exploratory) was enforced ure 4). Reported familiarity and measured expertise were using the Benjamini-Hochberg procedure with Q = 0.05 [3], used as regressands to explain adherence and trust. Signif- which is recommended for SEM analysis [12]. Significance icant effects were found for reported familiarity on trust levels in this section are reported as follows: *** = p < .001, (β = 0.241 ∗ ∗∗) and adherence (β = −0.289). No direct ef- ** = p < .01, and * = p < .05. fects were found for measured expertise, however, the model 867 participants attempted the study on Amazon Mechan- predicts that measured expertise does slightly increase adher- ical Turk (AMT) and eventually 551 completions were ob- ence through a full mediation effect by situation awareness. tained (64% completion). Upon inspection of the pre-study This effect is also demonstrated by the weak correlation data, we suspect that many of the dropouts were likely bots shown in Figure 3. Thus, we reject H5, H6. attempting to game our Qualtrics system, but failed when No general effect of explanation was found on adherence. reaching the more complex game portion of the task. Partic- However, Figure 5 demonstrates that explanations did cause ipants that completed the study were paid $3.00 and spent a significant increase in adherence for the participants who an average of 11 minutes 24 seconds to complete the Diner’s reported being very unfamiliar with the task (from about Dilemma training, and 15 minutes 30 seconds to complete 30% adherence to about 40%). No effect is found on the upper all three games. Participants were between 18 and 70 years quartiles, but a decreasing trend in adherence can be seen as of age and 54% were male. Recruitment was opened up in- self-reported familiarity increases. Thus we reject H7. ternationally to improve generalizability of results, but we required that workers to have 50 prior completed AMT tasks Automation Bias and be fluent in English. The data was checked for satis- Next, we tested the effects and interactions of automation, ficing by examining input patterns (e.g., marking thesame error, and explanation to predict trust, adherence, and situa- value for every question item) and those users were removed, tion awareness. An effect of Automation · Explanation was resulting in a cleaned dataset of N = 529 participants. At- found at the p = 0.029 level, but did not pass our multi- tention was also controlled for by our CRT metric (it is not plicity control. Error was found to cause reduced adherence unreasonable to assume many users of such systems will (B = −0.151 ∗ ∗), but this effect does not hold for the auto- be inattentive “in the wild,” thus we did not remove these mated system (B = 0.173 ∗ ∗∗). Automation was also found users). to cause decreased situation awareness (B = −0.100∗). This exploratory analysis confirmed automation bias, but also 3http://lavaan.ugent.be/ indicates that explanations may not be an effective remedy. IUI ’19, March 17–20, 2019, Marina del Ray, CA, USA Schaffer et al.