How the Experts Do It: Assessing and Explaining Agent Behaviors in Real-Time Strategy Games

Jonathan Dodge, Sean Penney, Claudia Hilderbrand, Andrew Anderson, Margaret Burnett Oregon State University Corvallis, OR; USA { dodgej, penneys, minic, anderan2, burnett }@oregonstate.edu

ABSTRACT RTS domain can be mapped to real world combat mission How should an AI-based explanation system explain an agent’s planning and execution such as an AI system trained to control complex behavior to ordinary end users who have no back- a fleet of drones for missions in simulated environments [30]. ground in AI? Answering this question is an active research People without AI training will need to understand and ulti- area, for if an AI-based explanation system could effectively mately assess the decisions of such a system, based on what explain intelligent agents’ behavior, it could enable the end such intelligent systems recommend or decide to do on their users to understand, assess, and appropriately trust (or distrust) own. For example, imagine “Jake,” the proud owner of a new the agents attempting to help them. To provide insights into self-driving car, who needs to monitor the AI system driving this question, we turned to human expert explainers in the his car through complex traffic like the Los Angeles freeway real-time strategy domain – “shoutcasters” – to understand system at rush hour, assessing when to trust the system [15] (1) how they foraged in an evolving strategy game in real and when he should take control. Ideally, an interactive expla- time, (2) how they assessed the players’ behaviors, and (3) nation system could help Jake assess whether and when the AI how they constructed pertinent and timely explanations out is making its decisions “for the right reasons” — in real time. of their insights and delivered them to their audience. The results provided insights into shoutcasters’ foraging strategies Scenarios like this are the motivation for an emerging area for gleaning information necessary to assess and explain the of research referred to as “Explainable AI,” where an auto- players; a characterization of the types of implicit questions mated explanation device presents an AI system’s decisions shoutcasters answered; and implications for creating explana- and actions in a form useful to the intended audience — here, tions by using the patterns and abstraction levels these human Jake. There are recent research advances in explainable AI, as experts revealed. we discuss in the Related Work section, but only a few focus on explaining complex strategy environments like RTS games ACM Classification Keywords and fewer draw from expert explainers. To help fill this gap, H.1.2 User/Machine systems: Human Information Processing; we conducted an investigation in the setting of StarCraft II, a H.5.2 Information interfaces and presentation (e.g., HCI): User popular RTS game [21] available to AI researchers [32]. Interfaces; I.2.m Artificial Intelligence: Miscellaneous. We looked to “shoutcasters” (sportscasters for e-sports like RTS games) like those in Figure1. In StarCraft e-sports, Author Keywords two players compete while the shoutcasters provide real-time Explainable AI; Intelligent Agents; RTS Games; StarCraft; commentary. They are helpful to investigate for explaining AI Information Foraging agents in real time to people like Jake for two reasons. First, they face an assessment task similar to explaining Jake’s car- INTRODUCTION arXiv:1711.06953v1 [cs.HC] 19 Nov 2017 driving agent to him. Specifically, they must 1) discover the Real-time strategy (RTS) games are becoming more popular actions of the player, 2) make sense of them and 3) assess them, artificial intelligence (AI) research platforms. A number of factors have contributed to this trend. First, RTS games are a challenge for AI because they involve real-time adversarial planning within sequential, dynamic, and partially observable environments [21]. Second, AI advancements made in the

Paste the appropriate copyright statement here. ACM now supports three different copyright statements: • ACM copyright: ACM holds the copyright on the work. This is the historical ap- proach. • License: The author(s) retain copyright, but ACM receives an exclusive publication license. • Open Access: The author(s) wish to pay for the work to be open access. The addi- tional fee must be paid to ACM. Figure 1. Two shoutcasters providing commentary for a professional This text field is large enough to hold the appropriate release statement assuming it is single spaced. StarCraft match. Every submission will be assigned their own unique DOI string to be included here. 1 particularly if they discover good, bad, or unorthodox behavior. Building upon this finding, Kulesza et al. [12] then identified They must do all this while simultaneously constructing an principles for explaining (in a “white box” fashion) to users explanation of their discoveries in real-time. how a machine learning system makes its predictions more transparent to the user. In user studies with a prototype fol- expert explainers Second, shoutcasters are . As communication lowing these principles, participants’ quality of mental models professionals, they are paid to inform an audience they cannot increased by up to 52%, and along with these improvements see or receive feedback/questions from. Hoffman & Klein [9] came better ability to customize the intelligent agents. Kapoor researched five stages of explanation, looking at how expla- et al. [10] also showed that explaining AI increased user satis- nations are formed from observation of an event, generating faction and interacting with the explanations enabled users to one or more possible explanations, judging the plausibility construct classifiers that were more aligned with target prefer- of said explanations, and either resolving or extending the ences. Bostandjiev et al.’s work on a music recommendation explanation. Their findings help to illustrate the complexity system [2] found that explanation led to a remarkable increase of shoutcasters’ task, due to its abductive nature of explaining in user-satisfaction with their system. the past and anticipating the future. In short, shoutcasters must anticipate and answer the questions the audience are not able As to what people want explained about AI systems, one to ask, all while passively watching the video stream. influential work into explainable AI has been Lim & Dey’s [17] investigation into information demanded from context-aware Because shoutcasters explain in parallel to gathering their intelligent systems. They categorized users’ information needs information, we guided part of our investigation using In- into various “intelligibility types,” and investigated which formation Foraging Theory (IFT) [25], which explains how types provided the most benefit to user understanding. Among people go about their information seeking activities. It is based these types were “What” questions (What did the system do?), predator on naturalistic predator-prey models, in which the “Why” questions (Why did the system do X?), and so on. In patches (shoutcaster) searches (parts of the information envi- this paper, we draw upon these results to categorize the kinds prey ronment) to find (evidence of players’ decision process) of questions that shoutcasters’ explanations answered. by following the cues (signposts in the environment that seem to point toward prey) based on their scent (predator’s guess Other research confirms that explanations containing certain at how related to the prey a cue is). IFT constructs have been intelligibility types make a difference in user attitude towards used to explain and predict people’s information-seeking be- the system. For example, findings by Cotter et al. [6] showed havior in several domains, such as understanding navigations that justifying why an algorithm works (but not on how it through web sites or programming and software engineering works) were helpful for increasing users’ confidence in the environments [5,7,8, 14, 19, 22, 23, 24, 29]. However, to our system — but not for improving their trust. Other work shows knowledge, it has not been used before to investigate explain- that the relative importance of the intellibility types may vary ing RTS environments like StarCraft. with the domain; for example, findings by Castelli et al. [3] in the domain of smart homes showed a strong interest in “What” Using this framework, we investigated the following research questions, but few of the other intellibility types. questions (RQs): RQ1 The What and the Where: What information do shout- Constructing effective explanations of AI is not straightfor- casters seek to generate explanations, and where do they ward, especially when the underlying AI system is complex. find it? Both Kulesza et al. [12] and Guestrin et al. [26] point to a potential trade-off between faithfulness and interpretability RQ2 The How: How do shoutcasters seek the information in explanation. The latter group developed an algorithm that they seek? can explain (in a “black box” fashion) predictions of any clas- RQ3 The Questions: What implicit questions do shoutcasters sifier in a faithful way, and also approximate it locally with answer and how do they form their answers? an interpretable model. In their work, they described the RQ4 The Explanations: What relationships and objects do fidelity-interpretability trade-off, in which making an expla- shoutcasters use when building their explanations? nation more faithful was likely to reduce its interpretability, and vice versa. However, humans manage this trade-off by BACKGROUND AND RELATED WORK accounting for many factors, such as the audience’s current In the HCI community, research has begun to investigate the situation, their background, amount of time available, etc. One benefits to humans of explaining AI. “Jake” (our test pilot) goal of the current study is to understand how expert human improving his mental model is critical to his success, since explainers, like our shoutcasters, manage this trade-off. Jake needs a reasonable mental model of the AI system to assess whether it is making decisions for the right reasons. In the domain of assessing RTS intelligent agents, Kim et Mental models, defined as “internal representations that people al. [11] invited 20 experienced players to assess the skill lev- build based on their experiences in the real world,” enable els of AI bots playing StarCraft. They observed that human users to predict system behavior [20]. Kulesza et al. [13] found rankings were different in several ways to a ranking computed those who adjusted their mental models most in response to from the bots’ competition win rate, because humans weighed explanations of AI (a recommender system) were best able certain factors like decision-making skill more heavily. The to customize recommendations. Further, participants who mismatch between empirical results and perception scores improved their mental models the most found debugging more may be because AI bots that are effective against each other worthwhile and engaging. proved less effective against humans.

2 Tournament Shoutcasters Players Game 1 2017 IEM Katowice ToD and PiG Neeb vs Jjakji 2 2 2017 IEM Katowice Rotterdam and Maynarde Harstem vs TY 1 3 2017 GSL Season 1 Code S Artosis and tasteless Soo vs Dark 2 4 2016 WESG Finals Tenshi and Zeweig DeMuslim vs iGXY 1 5 2017 StarLeague S1 Premier Wolf and Brendan Innovation vs Dark 1 6 2016 KeSPA Cup Wolf and Brendan Maru vs Patience 1 7 2016 IEM Geonggi Kaelaris and Funka Byun vs Iasonu 2 8 2016 IEM Shanghai Rotterdam and Nathanias ShowTime vs Iasonu 3 9 2016 WCS Global Finals iNcontroL and Rotterdam Nerchio vs Elazer 2 10 2016 DreamHack Open Leipzig Rifkin and ZombieGrub Snute vs ShowTime 3 Table 1. Summary of StarCraft 2 games studied. Please consult our supplementary materials for transcripts and links to videos.

Cheung et al. [4] studied StarCraft from a different perspec- Research questions RQ1 and RQ2 investigated how the cast- tive, that of non-participant spectators. Their investigations ers seek information onscreen, so we used IFT constructs to produced a set of nine personas that helped to illuminate who discover the types of information casters sought and how they these spectators are and why they watch. Since shoutcasters unearthed it. For RQ1 (the patches in which they sought in- are one of the personas, they discussed how shoutcasters af- formation), we simply counted the casters’ navigations among fect the spectator experience and how they judiciously decide patches. Changes in the display screen identified these for us how and when to reveal different types of information, both to automatically. For RQ2 (how they went about their informa- entertain and inform the audience. tion foraging), we coded the 110 instances of caster navigation by the context where it took place, based on player actions — The closest work to our own is Metoyer et al.’s [18] investiga- Building, Fighting, Moving, Scouting — or simply caster nav- tion into the vocabulary and language structure of explaining igation. Two researchers independently coded 21% of the data RTS games. In their study, novices and experts acted in pairs, in this manner, with IRR of 80% (Jaccard). After achieving with the novice watching the expert play and providing ques- IRR, one researcher coded the remainder of the data. tions, while the expert thought aloud and answered questions. They developed qualitative coding schemes of the content and For RQ3 (implicit questions the shoutcasters answered), we structure of the explanations the expert players offered. Our in- coded the casters’ utterances by the Lim & Dey [17] questions vestigation is subtly different in that our explainers are expert they answered. We added a judgment code to capture caster communicators about the game and must anticipate audience evaluation on the quality of actions. The complete code set questions on their own. However, given the pertinence of will be detailed in the RQ3 Results section. Using this code their work, we modified their code set to analyze shoutcasters’ set, two researchers independently coded 34% of the 1024 utterances. explanations in the corpus, with 80% inter-rater reliability (Jaccard). After achieving IRR, the researchers split up the remainder of the coding. METHODOLOGY In order to study high quality explanations and capable players, To investigate RQ4 (explanation content), we drew content we considered only games from professional tournaments coding rules from Metoyer et al. [18]’s analysis of explaining denoted as “Premier” by TeamLiquid1. Using these criteria, Wargus games and added some codes to account for differ- we selected 10 matches available with video on demand from ences in gameplay and study structure. (For ease of presen- professional StarCraft 2 tournaments from 2016 and 2017 tation, in this paper we use the terms “numeric quantity” and (Table1). Professional matches have multiple games, so we “indefinite quantity” instead of their terms “identified discrete” randomly selected one game from each match for analysis. 16 and “indefinite quantity”, respectively.) Two researchers in- distinct shoutcasters appeared across the 10 videos, with two dependently coded the corpus, one category at a time (e.g., casters2 commentating each time. Objects, Actions, ...), achieving an average of 78% IRR on more than 20% of the data in each category. One researcher Shoutcasters should both inform and entertain, so they fill then finished coding the rest of the corpus. Since all data dead air time with jokes. Thus, we filtered the casters’ utter- sources are public, we have provided all data and coding rules ances by relevance. To do so, two researchers independently in supplementary materials to enable replicability and support coded 32% of statements in the corpus as relevant or irrelevant further research. to explaining the game. We achieved a 95% inter-rater relia- bility (IRR), as measured by the Jaccard index. (The Jaccard index is the size of the intersection of the codes applied by RESULTS the researchers divided by the size of the union.) Then, the researchers split up and coded the rest of the corpus. RQ1 Results: What information do shoutcasters seek to generate explanations, and where do they find it? 1http://wiki.teamliquid.net/starcraft2/Premier_Tournaments We used two frameworks to investigate casters’ information 2Here, caster pair (caster or pair for short) differentiates our observed seeking behaviors. To situate what information casters sought individuals from the population of shoutcasters as a whole. in a common framework for conceptualizing intelligent agents,

3 Figure 2. A screenshot from an analyzed game, modified to highlight the patches available to our casters: HUD [1, bottom] (Information about current game state, e.g., resources held, income rate, supply, and upgrade status); Minimap [2, lower left] (Zoomed out version of the main window); “Tab” [3, top left] (Provides details on demand, currently set on “Production”); Workers killed [4, center left] (Shows that 9 Red workers have died recently); Popup [5, center] (visualizations that compare player performance, usually shown briefly). Regions 3 and 5 will be detailed in Figures4 and5. we turned to the Performance, Environment, Actuators, Sen- in Table2. The casters had “superpowers” with respect to Sen- sors (PEAS) model [27]. We drew from Information Foraging sors (and performance measures) — their interface allowed Theory (IFT) to understand where casters did their information full observation of the environment, whereas players could seeking, beginning with the places their desired information only partially observe it. As the only ways for casters to peer could be found. These places are called information “patches” through the players’ sensors, the casters extensively used the in IFT terminology. Minimap and the Vision Toggle. Table2 columns 1 and 2 show the correspondence between Actuators were the means for the agents to interact with their PEAS constructs and patches in the game that the casters in environment, such as building a unit. Figure2 region 3 (Pro- our data actually used. Performance measures showed as- duction Tab) shows some of the actuators the player was us- sets, resources, successes, and failures, e.g., Figure2 region 4 ing, namely that Player Blue was building 5 types of objects, (showing that Blue has killed 9 of Red’s workers) and region 5 whereas Red was building 8. Casters almost always kept vi- (showing that Blue has killed 19 units to Red’s 3, etc.). Table2 sualizations of actions in progress on display. RTS actions shows that casters rarely consulted performance measures, es- had a duration, meaning that when a player took an action, pecially those that examined past game states. However, they time passed before its consequence had been realized. The discussed basic performance measures available in the HUD Production tab’s popularity was likely due to the fact that it is (Figure2 region 1), which contained present state information, the only stable view of information about actuators and their e.g., resources held or upgrade status. associated actions. The Environment, or where the agent is situated, corresponds In fact, prior to the game in our corpus, Pair 3 had this ex- to the game state (map, units, structures, etc.), shown in the change, which demonstrated their perception of the production main window in Figure2. The Environment mattered for am- tab’s importance to doing their job, bient awareness, but casters navigated to Sensor and Actuator Pair 3a: “What if we took someone who knows literally information most actively, so we turn to those constructs next. nothing about StarCraft, just teach them a few phrases and what everything is on the production tab?” Sensors helped the agent collect information about the envi- “Oh, I would be out of a job.” ronment and corresponded to the local vision area provided Pair 3b: by individual units themselves in our domain. Figure2 region Implications for an interactive explainer 2 (Minimap) shows a “bird’s eye view” of the portion of the Abstracting beyond the StarCraft components to the PEAS environment observable by the Sensors. Casters used patches model revealed a pattern of the casters’ behaviors with impli- containing information about Sensors very often, with Min- cations for future explanation systems, which we characterize imap and Vision Toggles being among the most used patches as: “keep your Sensors close, but your Actuators closer.” This

4 Patch Name State Agg. Usage Pair 1 Pair 2 Pair 3 Pair 4 Pair 5 Pair 6 Pair 7 Pair 8 Pair 9 Pair 10 Units Lost popup: Shows count and resource Past High 6 2 2 1 1 value of the units each player has lost. Units Lost tab: Same as above, but as a tab. Past High 5 1 1 1 1 1 Income tab: Provides resource gain rate. Present High 2 1 1 Income popup: Shows each player’s re- Present High 2 1 1 source gain rate and worker count. Army tab: Shows supply and resource value Present High 1 1 of currently held non-worker units.

Performance Income Advantage graph: Shows time se- Past High 1 1 ries data comparing resource gain rate. Units Killed popup: Essentially the opposite Past High 1 1 of the Units Lost popup Units tab: Shows currently possessed units. Present Low 51 1 2 2 10 1 13 20 2 Upgrades tab: Like Units tab, but for up- Present Low 5 1 3 1 grades to units and buildings.

Environment Structures tab: Like Units tab, but buildings. Present Low 2 1 1 Production tab: Shows the units, structures, Present Low Preferred (by choice) “always-on” tab, not counted and upgrades that are in progress, i.e. have

Actuator been started, but are yet to finish. Minimap: Shows zoomed out map view Present Med Too many to count Vision Toggle: Shows only the vision avail- Present Low 36 5 8 1 2 1 5 1 7 5 1

Sensor able to one of the players. Table 2. This table illustrates description, classification, and usage rates of the patches and enrichment operations we observed casters using. Each patch is classified by: 1. The part of the PEAS model that this patch illuminates best (column 1), 2. whether it examines past or present game states (column 3), and 3. degree to which the patch aggregates data in its visualization (column 4). The remaining columns show total usage counts, as well as per caster pair usage. Note that there are additional patches passively available (Main Window and HUD) which do not have navigation properties. aligns with other research, showing that real-time visualization Actuators Cue: ? Goal: assess scouting Sensors of agent actions can improve system transparency [33]. However, these results contrast with many of today’s explana- Cue: Units separating, Environment tion systems, which tend to prioritize Performance measures, fighting likely over Cue: Units co-located, but the Actuators and Sensors seemed to form the very core impending combat likely of these expert explainers’ information seeking for presen- Performance tation to their audience. Our results instead suggest that an explanation system should prioritize useful, readily accessible Figure 3. The A-E-P+S loop was a common information foraging strat- egy some casters used in foraging for agent behavior. It starts at the Ac- information about what an agent did or can do (Actuators) and tuators, and returns there throughout the foraging process. If a caster of what it can see or has seen (Sensors). interrupted the loop, they usually did so to return to the Actuators.

RQ2 Results: The How: How do shoutcasters seek the To help derive what caused casters to leave Actuator patches, information they seek? which seemed to have so much importance to them, Informa- Section RQ1 discussed the What and Where (i.e., the content tion Foraging Theory (IFT) explains why people (information casters sought and locations where they sought it.) We then predators) leave one patch to move to another as a cost/benefit considered how they decided to move among these places. decision, based on the value of information in the patch a predator is already in versus the value per cost of going to The casters’ foraging moves seemed to follow a common for- another patch [25]. Staying in the same patch is generally the aging “loop” through the available information patch types: an least expensive, but when there is less value to be gained by Actuators-Environment-Performance loop with occasional for- staying versus moving to another patch, the predator moves ays over to Sensors (Figure3). Specifically, the casters tended to the other patch. However, the predator is not omniscient: to start at the “always-on” Actuator-related patches of current decisions are based upon the predator’s perception of the cost state’s actions in-progress; then when something triggered and value that other patches will actually deliver. They formed a change in their focus, they checked the Environment for these perceptions from both their prior experience with dif- current game state information and occasionally Performance ferent patch types [23] and from the cues (signposts in their measures of past states. If they needed more information along information environment) that provided concise information the way, they went to the Sensors to see through a player’s about content available in other patches. For the casters, cer- eyes. We will refer to this process as the “A-E-P+S loop”. tain types of cues tended to trigger a move.

5 Figure 4. The Units Lost tab (left image) shows the number of units lost and their total value, in terms of resources spent, for both players. In this example from Pair 2, we see that Blue Player (top) has lost merely 2800 minerals worth of units so far in this game, while Red has lost more than 7000. The Units Killed popup (right image) allows shoutcasters to quickly compare player performance via a “tug-of-war” visualization. In this example from Pair 1, as we see that Blue Player (left) has killed 19 units, while Red has killed merely 3. The main difference between these two styles of visualization is that the tab offers more options and Figure 5. The Production tab, showing the build actions currently in information depth to “drill down” into. progress for each player. Each unit/structure type is represented by a glyph (which serves as a link to navigate to that object), provided a progress bar for duration, and given the number of objects of that type. Thus, we can see that Blue Player (top row) is building 5 different types of things, while Red (bottom row) is building 4 types of things. The Struc- tures, Upgrades, and Units tab look fairly similar to the Production tab. Impending combat was the most common cue triggering a move from the Actuators type (Production tab) to the Envi- ronment type (Units tab) — i.e., from A to E in the A-E-P+S a scout) nearby. Toggling the vision Sensor was the second loop. In Figure3, the cue was co-located, opposing units, most common patch move. indicative of imminent combat, which led to caster navigation to a new patch to illuminate the environment. In fact, combat Besides the act of following cues, IFT has another foraging cues triggered navigations to the Units tab most frequently, operation: enriching their information environment to make accounting for 30 of the 51 navigations there (Table2). it more valuable or cost-efficient [25]. The aforementioned Vision Toggle was one example of this, and another was when Interestingly, this cue type was different from the static cues casters added on information visualizations derived from the most prior IFT research has used. In previous IFT investiga- raw data, like Performance measure popups or other basic tions, cues tended to be static decorations (text or occasionally visualizations. Two examples of the data obtained through this images) that label a navigation device, like a hyperlink or enrichment is shown in Figure4. button that leads to another information patch. In contrast, cues like the onset of combat are dynamic and often did not These Performance measures gave the shoutcasters at-a-glance provide an affordable direct navigation. However, cues like information about the ways one player was winning. For ex- this were considered cues because they “provide users with ample, the most commonly used tab, Units Lost tab (Figure4) concise information about content that is not immediately showed the number of units lost and their total value, in terms available” [25]. In the case of combat, they suggested high of resources spent. This measure achieves “at a glance” by value in another location, namely the Units tab. aggregating all the data samples together by taking a sum; derived values like this allow the visualization to scale to large Combat ending was a dynamic cue that triggered a move to a data sets [28]. However, Table2 indicates that the lower data Performance measure. Of the 13 navigations to a past-facing aggregation patches were more heavily used. As Figure5 Performance measure (via tab or popup), 10 occurred shortly shows, the casters used the Production tab to see units grouped after combat ended as a form of “after-action review.” Occa- by type, so type information was maintained with only posi- sionally, the shoutcasters visited other Performance patches, tional data lost. This contrasts with the Minimap (medium such as the Income, Units Lost, and Army tabs, to demon- aggregation), in which type information is discarded but po- strate reasons why a player had accrued an in-game lead, or sitional information maintained at a lower granularity. The the magnitude of that lead (7 navigations). However, signs of casters used Performance measure patches primarily to under- completed fighting were the main cues for visiting a Perfor- stand present state data (HUD), but these patches were also mance patch. the only way to access past state information (Table2).

The most common detour out of the A-E-P part of the loop to Implications for an interactive explainer a Sensor patch was to enrich the information environment via These results have several implications for automated explana- the Vision Toggle (36 navigations). The data did not reveal tion systems in this domain. First, the A-E-P+S loop and how exactly what cue(s) led to this move, but the move itself had the casters traversed it reveals priority and timing implications a common theme: to assess scouting operations. The casters for automated explanation systems. For example, the cues that used the Vision Toggle to allow themselves to see the game led them to switch to different information patches could also through the eyes of only one of the players, but their default be cues in an automated system about the need to avail differ- behavior was to view the game with ALL vision. This provided ent information at appropriate times. For example, our casters both the casters with the ability to observe players’ units and showed a strong preference for actuator information as “steady structures simultaneously. Toggling the Vision Sensor in this state” visualization, but preferred performance information way enabled them to assess what information was or had been upon conclusion of a subtask. gathered by each player via their scouting actions (29 of the 36 total Vision Toggles), since an enemy unit would only Viewing the casters’ behaviors through the dual lens of PEAS appear to the player’s sensors if they had a friendly unit (e.g., + IFT has implications for not only the kinds of patches that

6 The shoutcasters were remarkably consistent (Figure6) in the types of implicit questions they answered. As Table3 sums up, casters overwhelmingly chose to answer What, with What-could-happen and How-to high on their list. (The total is greater than 1024 because explanations answered mul- tiple questions and/or fit into multiple categories.) These results surprised us. Whereas Lim & Dey [16] found that Why was the most demanded explanation type from users, the casters rarely provided Why answers. More specifically, in the Lim & Dey study, approximately 48 of 250 participants, Figure 6. Frequency of Lim & Dey questions answered by casters, with (19%) demanded a Why explanation. To contrast with our one line per caster pair. Y-Axis represents percentages of the utterances which answered that category of question (X-Axis). Note how casters study, only 27 of the casters’ 1024 utterances (approximately structured answers consistently. 3%) were Why answers. an explanation system would need to provide, but also the cost to users of not providing these patches in a readily accessible Discussion and implications for an interactive explainer format. For example, PEAS + IFT revealed a costly foraging Why so few Whys? Should an automated explainer, like our problem for the casters due to the relative inaccessibility of shoutcasters, eschew Why explanations, in favor of What? some Actuator patches. In StarCraft, there is no easily accessi- ble mechanism by which they could navigate to an Actuator One possibility is that the casters delivered exactly what their patch with fighting or scouting actions in progress. audience wanted, and thus the casters’ distribution of explana- tion types was well chosen. After all, the casters were experts Instead, the only way the casters could get access to these paid to provide commentary for prestigious tournaments, so actions was via painstaking camera placement. To accom- they would know their audience well. The expertise level plish this, the casters made countless navigations to move the of the audience may have been fairly high, because the tour- camera using the Minimap, traditional scrolling, or via tabs nament videos were available only on demand (as opposed with links to the right unit or building. But despite all these to broadcast like some professional sports) at websites that navigation affordances, sometimes the casters were unable to casual audience members may not even know about. If a well- place the camera on all the actions they needed to see. informed audience expected the players to do exactly what they did, their expectations would not be violated, which, ac- For example, at one point when Pair 4 had the camera on a cording to Lim & Dey, suggests less demand for Why [16]. fight at Xy’s base, a second fight broke out at DeMuslim’s This suggests that the extent to which an automated explainer base, which they completely missed: needs to emphasize Why explanations may depend on both Pair 4a: the expertise of the intended audience, which drives their ex- “Xy actually killed the 3rd base of DeMuslim.” pectations, and the agent’s competence, which drives failure ...... to meet reasonable expectations. Pair 4b: “Oh my god, you’re right Alex.” Pair 4a: “Yeah, it was killed during all that action.” However, another possibility is that the audience really did want Why explanations, but the casters rarely provided them RQ3 Results: What implicit questions do shoutcasters because of the time they required — both theirs and the audi- ence’s. The shoutcasters explained in real time as the players answer and how do they form their answers? performed their actions. It takes time to understand the present, For the first two research questions, we considered how the predict the future, and link present to future; and spending shoutcasters gathered and assessed information. We now shift time in these ways reduces the time allowable for explaining focus to the explanations themselves. interesting activities happening in present. The corpus showed Much of the prior research into explaining agent behavior casters interrupting themselves and each other as new events starts at some kind of observable effect and then explains transpired, as they tried to keep up with the time constraints. something about that effect or its causes [10, 12, 16, 31]. In This also has implications to the audience’s workflow, because RTS games, most such observable effects are the result of it takes time for the audience to mentally process shoutcast- player actions, and recall from RQ1 that the casters spent most ers’ departures from the present, particularly when interesting of their information-gathering effort examining the players’ actions continuously occur. Actuators to discover and understand actions. Even more critical to an explanation system, Why questions The casters used the information they gained to craft expla- also tend to require extra effort (cognitive or computing re- nations to answer implicit questions (i.e., questions their au- sources), because they require connecting two time slices: dience “should be” wondering) about player actions. Thus, drawing from prior work about the nature of questions people Pair 10: “After seeing the first phoenix and, of course, ask about AI, we coded the 1024 casters’ explanations using the second one confirmed, Snute is going to invest in a the Lim & Dey “intelligibility types” [17]. couple spore crawlers.”

7 Code Freq Description Example What 595 What the player did or anything about game state “The liberators are moving forward as well” What-could- 376 What the player could have done or what will happen “Going to be chasing those medivacs away” happen How-to 233 Explaining rules, directives, audience tips, high level “He should definitely try for the counter attack strategies right away” *How-good/bad- 112 Evaluation of player actions “Very good snipe there for Neeb” was-that-action Why-did 27 Why the player performed an action “...that allowed Dark to hold onto that 4th base, it allowed him to get those ultralisks out” Why-didn’t 6 Why the player did not perform an action “The probe already left a while ago, so we knew it wasn’t going to be a pylon rush” Table 3. Utterance type code set, slightly modified from the schema proposed by Lim & Dey. The asterisk denotes the code that we added, How-good/bad-was-that-action because the casters judged actions based on their quality.

In this example, the casters had to connect past information around the edges.” For How-to, casters gave the audience (scouting the phoenix, a flying unit) with a prediction of the tips and explained high level strategies. For example, consider future (investing in spore crawlers, an air defense structure). this rule-like explanation, which implies the reason “why” the player used a particular army composition: Pair 10: “Roach Why-didn’t Answering questions was even rarer than an- ravager in general is really good...” swering Why questions (Table3). Like Why questions, Why-didn’t questions required casters to make a connec- The next rule-like How-to example is an even closer approxi- tion between previous game state and a potential current or mation to “why” information. Pair 8: “Obviously when there future game state. For example, Pair 2: “The probe already are 4 protoss units on the other side of the map, you need left a while ago, so we knew it wasn’t going to be a pylon rush..” to produce more zerglings, which means even fewer drones Why-didn’t answers’ rarity is consistent with the finding that for Iasonu.” In this case, the casters are giving a rule: given understanding a Why-didn’t explanation requires even more a general game state (protoss units on their side of the map) mental effort than a Why explanation [17]. As for an interac- the player should perform an action (produce zerglings). But tive explanation system, supporting Why questions requires the example does more; it also implies a Why answer to the solving both a temporal credit assignment problem (determin- question “Why isn’t Iasonu making more drones?” Since this ing the effect of an action taken at a particular time on the implied answer simply relates the present to a rule or best outcome) and a structural one (determining the effect of a practice, it was produced at much lower expense than a true particular system element on the outcome). See [1] for an Why answer that required tying past events to the present. accessible explanation of these problems. Mechanisms casters used to circumvent the need for disrup- The casters found a potentially “satisficing” approximation tive and resource-intensive Why explanations, such as using of Why, a combination of What and What-could-happen, How-to, may also be ways to alleviate the same problems in the two most frequent explanation types. Their What answers explanation systems. explained what the player did, what happened in the game, and description of the game state. These were all things happening RQ4 Results: What relationships and objects do shout- in the present, and did not require the additional cognitive casters use when building their explanations? steps required to answer Why or Why-didn’t, which may To inform future explanation systems’ content by expert expla- have contributed to its high frequency. Further, the audience nations — the patterns of nouns, verbs, and adjectives/adverbs needed this kind of “play-by-play” information to stay in- in these professionally crafted explanations — we drew upon formed about the game’s progression; for example, Pair 4: a code set from prior work [18] (see the Methodology section). “This one hero, marine, is starting to kill the vikings.” When Table4 shows how much caster pairs used each of these types adding on What-could-happen, casters were pairing What of content, grouping the objects (nouns) in the first group of with what the player will or could do, i.e., a hypothetical columns, then actions (verbs), and then properties (adjectives outcome. For example, and adverbs). Table5 shows how the casters’ explanations Pair 1: “...if he gets warning of this he’ll be able to get used these concepts together, i.e., which properties they paired back up behind his wall in.” with which objects and actions. Although answering the question What-could-happen re- quired predicting the future, it did not also require the casters The casters’ explanation sentences tended to be noun-verb to tie together information from past and future. constructions, so we began with the nouns. The most fre- quently described objects were fighting object, production The other two frequent answers, How-good/bad-was-that- object, and enemy, with frequencies of 53%, 40%, and 9%, action and How-to, also sometimes contained “why” in- respectively, as shown in Figure4. (This is similar to results formation. For How-good/bad-was-that-action, casters from [18], where production, fighting, and enemy objects were judged an action e.g.: Pair 1: “Nice maneuver from Jjakji, he the three most popular object subcodes.) As to the actions knows he can’t fight Neeb front on right now, he needs to go (“verbs”), the casters mainly discussed fighting (40%) and

8 Table 4. Occurrence frequencies for each code, as a percent of the total number of utterances in the corpus. From left to right: Object (pink), Action (orange), Spatial (green), Temporal (yellow), and Quantitative (blue) codes. The casters were consistent about kinds of the content they rarely included, but inconsistent about the kinds of content they most favored. building (23%). It is not surprising that the casters frequently into account in their explanations. One example of this was discussed fighting, since combat skills are important in Star- evaluation of potential new base locations. Craft [11], and producing is often a prerequisite to fighting. Pair 5: “...if he takes the one [base] thats closer that’s This may suggest that, in RTS environments, an explanation near his natural [base], then it’s close to Innovation so system may be able to focus on only the most important subset he can harass.” of actions and objects, without needing to track and reason Here, the casters communicated the control of parts of the about most of the others. map by describing bases as a region, and then relating two The casters were quite strategic in how they put together these regions with a distance. The magnitude of that distance nouns and verbs with properties. The casters’ used particu- then informed whether the player was able to more easily lar properties with these nouns and verbs to paint the bigger attack. Of the casters’ utterances that described distance picture of how the game was going for each player, and how along with production object, 27 out of 44 referred to the that tied to the players’ strategies. We illustrate in the next distance between bases or moving to/from a base. subsections a few of the ways casters communicated about “When should I...”: Temporal properties player decisions — succinctly enough for real time. Casters’ explanations often reflected players’ priorities for allocating limited resources. One way they did so was using “This part of the map is mine!”: Spatial properties speed properties: Pair 4: “We see a really quick third [base] RTS players claim territory in battles with the arrangement here from XY, like five minutes third.” Since extra bases pro- of their military units, e.g.: vide additional resource gathering capacity, the audience could Pair 3: “He’s actually arcing these roaches out in such infer that the player intended to follow an “economic” strategy, a great way so that he’s going to block anything that’s as those resources could have otherwise been spent on military going to try to come back.” units or upgrades. This contrasts with the following example, As the arrangement column of Table5 shows, the objects Pair 8: “He’s going for very fast lurker den...” The second that were used most with arrangment were fighting objects example indicated the player’s intent to follow a different strat- (12%, 72 instances) and enemy, (10%, 26 instances). Note egy: unlocking stronger units (lurkers). Speed co-occurred that arrangement is very similar to point/region, but at a with building/producing most often (12%, 36 instances). smaller scale; Arrangement of production object, such as “Do I care how many?”: Quantitative properties exactly where buildings are placed in one’s base, appeared to be less significant, co-occurring only 5% of the time. We found it surprising how often the casters described quanti- ties without numbers. In fact, the casters often did not even The degree to which an RTS player wishes to be aggressive include type information when they described the players’ or passive is often evident in their choice of what distance holdings, instead focusing on comparative properties (Ta- to keep from their opponent, and the casters often took this ble5). For example, Pair 1: “There is too much supply for

9 Specifically, spatial properties can communicate beyond the actual properties of objects to strategies themselves; for exam- ple, casters used distance to point out plans to attack or defend. Temporal properties can be used in explanations of strate- gies when choices in resource allocation determines available strategies. Finally, an interactive explanation system could use the quan- titive property results to help ensure alignment in the level of abstraction used by the human and the system. For example, a player can abstract a quantity of units into a single group or think of them as individual units. Knowing the level of abstraction the human players use in different situations can help an interactive explanation system choose the level of ab- Table 5. Co-Occurrence Matrix. Across rows: Object (pink, top rows) straction that will meet human expectations. Using properties and Action (orange, bottom rows) codes. Across columns: Spatial in these strategic ways may enable an interactive explanation (green, left), Temporal (yellow, center), and Quantitative (blue, right). system to meet its real-time constraints while at the same time Co-occurrence rates were calculated by dividing the intersection of the improving its communicativeness to the audience. subcodes by the union. CONCLUSION him to handle. Neeb finalizes the score here after a fantastic The results of our study suggest that explaining intelligent agents to humans has much to gain from looking to the human game.” Here, “supply” is so generic, we do not even know experts. In our case, the expert explainers — RTS shoutcasters what kind of things Neeb had – only that he had “too much” — revealed implications into what, when, and how human of it. audiences of such systems need explanations, and how real- In contrast, when the casters discussed cheap military units, time constraints can come together with explanation-building like “marines” and “zerglings,” they tended to provide type strategies. Among the results we learned were: information, but about half of their mentions still included RQ1 Investigating the what’s and where’s of casters’ real- no precise numbers. Perhaps it was a matter of the high cost time information foraging to assess and understand the to get that information: cheap units are often built in large players showed that the most commonly used patches of quantities, so deriving a precise quantity is often very tedious. the information environment were the Actuators (“A” in Further, adding one weak unit that is cheap to build has little the PEAS model). This suggests that, in contrast to today’s impact on army strength, so getting a precise number may explanation systems, which tend to present mostly Perfor- not have been worthwhile – i.e. the value of knowing precise mance measures, explanations should consider presenting quantities is low. To illustrate, consider the following example, more from the Actuators and Sensors. which quantified the army size of both players vaguely, using RQ2 The how’s of casters’ foraging revealed a common pat- indefinite quantity properties: Pair 6: “That’s a lot of marines tern, which we termed the A-E-P+S loop, and the most and marauders and not enough stalkers.” common cues and triggers that led shoutcasters to move In the RTS domain, workers are a very important unit. Con- through this loop. Future explanation systems may be well- sistent with this importance, workers are the only unit where served to prioritize and recommend explanations according the casters were automatically alerted to their death (Figure2, to this loop and its triggers. region 4), and are also available at a glance on the HUD RQ3 As model explainer, the casters revealed strategies for (Figure2 region 1). Correspondingly, the casters often gave “satisficing” with explanations that may not have precisely precise quantities of workers (a production object). Workers answered all the questions the audience had in mind, but (workers, drones, scvs, and probes) had 46 co-occurrences were feasible given the time and resource constraints in with numeric quantities, but only 12 with indefinite quan- effect when comprehending, assessing, and explaining, all tities (e.g., lot, some, few). Pair 2: “...it really feels like in real time as play progresses. These strategies may be Harstem is doing everything right, and [yet] somehow ended likewise applicable to interactive explanation systems. up losing 5 workers.” RQ4 The detailed contents of the casters’ explanations re- vealed patterns of how they paired properties (“adjectives Implications for an interactive explainer and adverbs”) with different objects (“nouns”) and actions These results have particularly important implications for inter- (“verbs”). Interactive explanation systems may be able to active explanation systems with real-time constraints. Namely, leverage these patterns to communicate succinctly about the results suggest that an effective way to communicate about an agent’s tactics and strategies. strategies and tactics is to modify the critical objects and ac- Ultimately, both shoutcasters’ and explanation systems’ jobs tions with particular properties that suggest strategies. This not are to improve the audience members’ mental model of the only affords a succinct way to communicate about strategies agents’ behavior. As Cheung, et al. [4] put it, “...commentators and tactics (fewer words) but also a lighter load for both the are valued for their ability to expose the depth of the game.” system and the audience than attempting to build and process Hopefully, future explanation systems will be valued for the a rigorous explanation of strategy. same reason.

10 ACKNOWLEDGMENTS 11. Man-Je Kim, Kyung-Joong Kim, SeungJun Kim, and Removed for anonymous review. Anind K Dey. 2016. Evaluation of artificial intelligence competition bots by experienced human REFERENCES players. In Proceedings of the 2016 CHI Conference 1. Adrian K Agogino and Kagan Tumer. 2004. Unifying Extended Abstracts on Human Factors in Computing temporal and structural credit assignment problems. In Systems. ACM, 1915–1921. Proceedings of the Third International Joint Conference 12. Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and on Autonomous Agents and Multiagent Systems-Volume 2. Simone Stumpf. 2015. Principles of explanatory IEEE Computer Society, 980–987. debugging to personalize interactive machine learning. In 2. Svetlin Bostandjiev, John O’Donovan, and Tobias Proceedings of the 20th International Conference on Höllerer. 2012. TasteWeights: A visual interactive hybrid Intelligent User Interfaces. ACM, 126–137. recommender system. In Proceedings of the Sixth ACM 13. Todd Kulesza, Simone Stumpf, Margaret Burnett, and Conference on Recommender Systems. ACM, 35–42. Irwin Kwan. 2012. Tell me more?: the effects of mental 3. Nico Castelli, Corinna Ogonowski, Timo Jakobi, Martin model soundness on personalizing an intelligent agent. In Stein, Gunnar Stevens, and Volker Wulf. 2017. What Proceedings of the SIGCHI Conference on Human happened in my home?: An end-user development Factors in Computing Systems. ACM, 1–10. approach for smart home data visualization. In 14. Sandeep Kaur Kuttal, Anita Sarma, and Gregg Rothermel. Proceedings of the 2017 CHI Conference on Human 2013. Predator behavior in the wild web world of bugs: Factors in Computing Systems. ACM, 853–866. An information foraging theory perspective. In Visual 4. Gifford Cheung and Jeff Huang. 2011. Starcraft from the Languages and Human-Centric Computing (VL/HCC), stands: Understanding the game spectator. In 2013 IEEE Symposium on. IEEE, 59–66. Proceedings of the SIGCHI Conference on Human 15. Jason Levine. 2017. Americans are right not to trust Factors in Computing Systems (CHI ’11). ACM, New self-driving cars. (2017). York, NY, USA, 763–772. DOI: http://dx.doi.org/10.1145/1978942.1979053 16. Brian Y Lim and Anind K Dey. 2009. Assessing demand for intelligibility in context-aware applications. In 5. Ed H Chi, Peter Pirolli, Kim Chen, and James Pitkow. Proceedings of the 11th international conference on 2001. Using information scent to model user information Ubiquitous computing. ACM, 195–204. needs and actions and the web. In Proceedings of the SIGCHI Conference on Human Factors in Computing 17. Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. 2009. Systems. ACM, 490–497. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of 6. Kelley Cotter, Janghee Cho, and Emilee Rader. 2017. the SIGCHI Conference on Human Factors in Computing Explaining the news feed algorithm: An analysis of the Systems. ACM, 2119–2128. “News Feed FYI” blog. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in 18. Ronald Metoyer, Simone Stumpf, Christoph Neumann, Computing Systems. ACM, 1553–1560. Jonathan Dodge, Jill Cao, and Aaron Schnabel. 2010. Explaining how to play real-time strategy games. 7. Scott D. Fleming, Chris Scaffidi, David Piorkowski, Knowledge-Based Systems 23, 4 (2010), 295–301. Margaret Burnett, Rachel Bellamy, Joseph Lawrance, and Irwin Kwan. 2013. An information foraging theory 19. Nan Niu, Anas Mahmoud, Zhangji Chen, and Gary perspective on tools for debugging, refactoring, and reuse Bradshaw. 2013. Departures from optimality: tasks. ACM Transactions on Software Engineering and Understanding human analyst’s information foraging in Methodology (TOSEM) 22, 2 (2013), 14. assisted requirements tracing. In Proceedings of the 2013 International Conference on Software Engineering. IEEE 8. Wai-Tat Fu and Peter Pirolli. 2007. SNIF-ACT: A Press, 572–581. cognitive model of user navigation on the world wide web. Human-Computer Interaction 22, 4 (2007), 20. Donald A Norman. 1983. Some observations on mental 355–412. models. Mental models 7, 112 (1983), 7–14. 9. Robert R Hoffman and Gary Klein. 2017. Explaining 21. S. Ontañón, G. Synnaeve, A. Uriarte, F. Richoux, D. explanation, part 1: theoretical foundations. IEEE Churchill, and M. Preuss. 2013. A survey of real-time Intelligent Systems 32, 3 (2017), 68–73. strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI 10. Ashish Kapoor, Bongshin Lee, Desney Tan, and Eric in Games 5, 4 (Dec 2013), 293–311. DOI: Horvitz. 2010. Interactive optimization for steering http://dx.doi.org/10.1109/TCIAIG.2013.2286295 machine classification. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 22. Alexandre Perez and Rui Abreu. 2014. A diagnosis-based ACM, 1343–1352. approach to software comprehension. In Proceedings of the 22nd International Conference on Program Comprehension. ACM, 37–47.

11 23. David Piorkowski, Scott D. Fleming, Christopher 29. Sruti Srinivasa Ragavan, Sandeep Kaur Kuttal, Charles Scaffidi, Margaret Burnett, Irwin Kwan, Austin Z Henley, Hill, Anita Sarma, David Piorkowski, and Margaret Jamie Macbeth, Charles Hill, and Amber Horvath. 2015. Burnett. 2016. Foraging among an overabundance of To fix or to learn? How production bias affects similar variants. In Proceedings of the 2016 CHI developers’ information foraging during debugging. In Conference on Human Factors in Computing Systems. Software Maintenance and Evolution (ICSME), 2015 ACM, 3509–3521. IEEE International Conference on. IEEE, 11–20. 30. Katia Sycara, Christian Lebiere, Yulong Pei, Donald 24. David Piorkowski, Austin Z Henley, Tahmid Nabi, Morrison, and Michael Lewis. 2015. Abstraction of Scott D Fleming, Christopher Scaffidi, and Margaret analytical models from cognitive models of human Burnett. 2016. Foraging and navigations, fundamentally: control of robotic swarms. In International Conference Developers’ predictions of value and cost. In Proceedings on Cognitive Modeling. University of Pittsburgh. of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 31. Joe Tullio, Anind K Dey, Jason Chalecki, and James ACM, 97–108. Fogarty. 2007. How it works: A field study of non-technical users interacting with an intelligent system. 25. Peter Pirolli. 2007. Information foraging theory: In Proceedings of the SIGCHI Conference on Human Adaptive interaction with information. Oxford University Factors in Computing Systems. ACM, 31–40. Press. 32. Oriol Vinyals. 2016. DeepMind and blizzard to release 26. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. starcraft II as an AI 2016. Why should I trust you?: Explaining the research environment. (2016). https://deepmind.com/blog/ predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge deepmind-and-blizzard-release-starcraft-ii-ai-research-environment/ Discovery and Data Mining. ACM, 1135–1144. 33. Robert H Wortham, Andreas Theodorou, and Joanna J 27. Stuart J. Russell and Peter Norvig. 2003. Artificial Bryson. 2017. Improving robot transparency:real-time Intelligence: A modern approach (2 ed.). Pearson visualisation of robot AI substantially improves Education. understanding in naive observers, In IEEE RO-MAN 2017. IEEE RO-MAN 2017 (August 2017). 28. Robert Spence. 2007. Information Visualization: Design http://opus.bath.ac.uk/55793/ for interaction (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

12