Natural Language Interaction with Explainable AI Models Arjun R Akula1, Sinisa Todorovic2, Joyce Y Chai3 and Song-Chun Zhu4

Natural Language Interaction with Explainable AI Models Arjun R Akula1, Sinisa Todorovic2, Joyce Y Chai3 and Song-Chun Zhu4 1,4University of California, Los Angeles 2Oregon State University 3Michigan State University [email protected] [email protected] [email protected] [email protected] Abstract This paper presents an explainable AI (XAI) system that provides explanations for its predictions. The system consists of two key components – namely, the prediction And-Or graph (AOG) model for recognizing and localizing concepts of interest in input data, and the XAI model for providing explanations to the user about the AOG’s predictions. In this work, we focus on the XAI model specified to in- Figure 1: Two frames (scenes) of a video: (a) teract with the user in natural language, top-left image (scene1) shows two persons sitting whereas the AOG’s predictions are consid- at the reception and others entering the audito- ered given and represented by the corre- rium and (b) top-right (scene2) image people run- sponding parse graphs (pg’s) of the AOG. ning out of an auditorium. Bottom-left shows the Our XAI model takes pg’s as input and AOG parse graph (pg) for the top-left image and provides answers to the user’s questions Bottom-right shows the pg for the top-right image using the following types of reasoning: direct evidence (e.g., detection scores), medical diagnosis domains (Imran et al., 2018; part-based inference (e.g., detected parts Hatamizadeh et al., 2019)). provide evidence for the concept asked), Consider for example, two frames (scenes) of and other evidences from spatiotemporal a video shown in Figure1. An action detection context (e.g., constraints from the spa- model might predict that two people in the scene1 tiotemporal surround). We identify sev- are in sitting posture. User might be interested eral correlations between user’s questions to know more details about the prediction such and the XAI answers using Youtube Ac- as: Why do the model think the people are in sit- tion dataset. ting posture? Why not standing instead of sitting? arXiv:1903.05720v2 [cs.AI] 7 Jul 2019 Why two persons are sitting instead of one? The 1 Introduction XAI models aim to generate explanations to these An explainable AI (XAI) model aims to provide questions from different perspectives such as fol- transparency (in the form of justification, expla- lows: “action detection score for them to sit is nation, etc) for its predictions or actions made by higher than other actions such as standing”, “the it (Baehrens et al., 2010; Lipton, 2016; Ribeiro torso, left arm and right arm poses of both the et al., 2016; Miller, 2017; Biran and Cotton, 2017; people suggest that they are in sitting pose”, “I Biran and McKeown, 2017). Recently, there found chairs behind the table in the beginning of has been a lot of focus on building XAI mod- the video and couldn’t see them now, which is why els, especially to provide explanations for under- I think they might be sitting on those chairs”. standing and interpreting the predictions made by Explanations are considered to be interactive deep learning models (e.g. explaining models in conversations (Miller, 2017; Akula et al., 2013). Therefore it is necessary to understand the un- ferent types of contrastive questions that can be derlying characteristics of such conversations. In posed by the user to an XAI model such as i) why this work, we propose a generic framework to X rather than not X; ii) why X rather than the de- interact with an XAI model in natural language. fault value for X; iii) why X rather than Y. In our The framework consists of two key components experiments, we found a similar and more finer – namely, the prediction And-Or graph (AOG) categorization to be helpful for analyzing the users model (Zhu et al., 2007) for recognizing and lo- questions. calizing concepts of interest in input data, and the XAI model for providing explanations to the user 3 XAI Question Types about the AOG’s predictions. The And-Or graph is a hierarchical and com- Questions posed by the user to obtain explana- positional representation recursively defined to tions from an XAI model are typically contrastive capture contextual information. AOG structure in nature (Hilton, 1990; Lombrozo, 2006; Miller, embodies the expressiveness of context sensitive 2017). For example, questions such as “Why does grammars and probabilistic reasoning of graph- the model think the person is in sitting posture”, ical models. Spatial, Temporal and Causal de- “Why does the model think that two persons are composition of entities and the relations between sitting instead of one?”, need contrastive explana- them can be modeled using AOG. In this work, tions. In order to generate such explanations, we we focus on the XAI model specified to inter- categorize the questions into the following 10 cat- act with the user in natural language, whereas the egories to understand the implicit contrast that a AOG’s predictions are considered given and rep- question presupposes: resented by the corresponding parse graphs (pg’s) 1. WH-X Contrastivity in these type of questions of the AOG. Our XAI model takes pg’s as input will be in the form of “Why X rather than not and provides answers to the user’s questions using X”. For example, the question “Why does the the following types of reasoning: direct evidence model think the person is sitting?” is a WH-X (e.g., detection scores), part-based inference (e.g., question based on the video shown in Figure1. detected parts provide evidence for the concept In this question, user wants to know the expla- asked), and other evidences from spatiotemporal nation as to why the person’s action is predicted context (e.g., constraints from the spatiotemporal as sitting rather than not sitting. surround). We created a new explanation dataset 2. WH-X-NOT-Y Contrastivity in these type of by using Youtube Action Videos dataset (Liu et al., questions will be in the form of “Why X rather 2009). To the best of our knowledge, this is the than Y”. For example, the question “Why does first explanation dataset that has explicit question the model think the person is sitting and not and explanation pairs. We present several corre- standing?” is a WH-X-NOT-Y question. In this lations between user’s questions and the XAI an- question, user wants to know the explanation as swers using our explanation dataset. to why the person’s action is predicted as sitting rather than standing. WH-X and WH-X-NOT- 2 Related Work Y categories look similar and one might think they both need similar explanations. However, Several works (Miller, 2017) have been proposed in our experiments, we found that they need in the past to understand the underlying charac- different explanations. teristics of explanations. Lombrozo et al. (Lom- 3. WH-X1-NOT-X2 Contrastivity in these type of brozo, 2006) proposed that the explanations are questions will be in the form of “Why X1 rather typically contrastive: they account for one state than X2”. For example, the question “Why of affairs in contrast to another. However the does the model think two persons are sitting in- definitions of most of these explanation types stead of three?” is a WH-X1-NOT-X2 question. are based on theoretical grounds (Dennett, 1989; It may be noted that WH-X-NOT-Y questions Chin-Parker and Bradner, 2010) and cannot be ap- refer to the contrastivity between two different plied directly in practice. In our work, we pro- concepts X and Y whereas WH-X1-NOT-X2 pose explanation types that are motivated from an refer to the contrastivity between two different algorithmic perspective rather than on theoretical observations about a single concept. grounds. Hilton et al. (Hilton, 1990) proposed dif- 4. WH-NOT-Y Contrastivity in these type of questions will be in the form of “Why not 4 XAI Explanation Types Y”. For example, the question “Why does the model think the person is not standing?” is a Our XAI model takes AOG parse graphs (pg) as WH-NOT-Y question. In this question, user input and provides answers using the following wants to know the explanation as to why the six types of explanations. In our experiments, we person’s action is not predicted as standing. found that these explanation types are sufficient 5. NOT-X User might want to correct the XAI to answer all the 10 different question types dis- model’s understanding of a concept or argue cussed in the previous section. with the XAI model over the validity of an 1. AOG Alpha explanation Alpha explanation evidence. For this purpose, we propose the is the explanation generated by the XAI question categories beginning with the prefix model using the direct evidence (e.g., detec- ‘NOT’. Questions of NOT-X category will be tion scores). For example, consider the ques- in the form of “I think it is X rather than not tion “Why does the model think that the per- X”. For example, the question “I think the person is sitting?”. Our XAI model, using the son is not sitting?” is a NOT-X question. pg of scene1 shown in Figure1, generates the 6. NOT-X1-BUT-X2 Questions of NOT-X1- following alpha explanation “Action detection BUT-X2 category will be in the form of “I score for the person to sit is highest”. It may be think it is X1 rather than X2”. For example, the noted that XAI model used the evidence from question “I think there are two persons in the node A1 in the pg to generate this response video and not just one” is a NOT-X1-BUT-X2 without taking advantage of surrounding con- question.

Natural Language Interaction with Explainable AI Models Arjun R Akula1, Sinisa Todorovic2, Joyce Y Chai3 and Song-Chun Zhu4

Deep Learning of Neuromuscular Control for Biomechanical Human Animation

Phd Thesis, Technical University of Denmark (DTU)

Snakes: Active Contour Models

Lecture Notes in Computer Science 5535 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan Van Leeuwen

Motion and Color Analysis for Animat Perception in Subsequent Sections

Fast Approximate Energy Minimization Via Graph Cuts

NYU GSAS Bulletin 2003

Image Segmentation Using Deep Learning: a Survey

Towards High‑Quality 3D Telepresence with Commodity RGBD Camera

UCLA UCLA Electronic Theses and Dissertations

Impulse-Based Dynamic Simulation of Rigid Body Systems

Artificial Fishes