Evaluating Recurrent Neural Network Explanations
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating Recurrent Neural Network Explanations Leila Arras1, Ahmed Osman1, Klaus-Robert Muller¨ 2;3;4, and Wojciech Samek1 1Machine Learning Group, Fraunhofer Heinrich Hertz Institute, Berlin, Germany 2Machine Learning Group, Technische Universitat¨ Berlin, Berlin, Germany 3Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea 4Max Planck Institute for Informatics, Saarbrucken,¨ Germany fleila.arras, [email protected] Abstract a particular decision, i.e., when the input is a se- quence of words: which words are determinant Recently, several methods have been proposed for the final decision? This information is crucial to explain the predictions of recurrent neu- ral networks (RNNs), in particular of LSTMs. to unmask “Clever Hans” predictors (Lapuschkin The goal of these methods is to understand the et al., 2019), and to allow for transparency of the network’s decisions by assigning to each in- decision-making process (EU-GDPR, 2016). put variable, e.g., a word, a relevance indicat- Early works on explaining neural network pre- ing to which extent it contributed to a partic- dictions include Baehrens et al.(2010); Zeiler and ular prediction. In previous works, some of Fergus(2014); Simonyan et al.(2014); Springen- these methods were not yet compared to one berg et al.(2015); Bach et al.(2015); Alain and another, or were evaluated only qualitatively. We close this gap by systematically and quan- Bengio(2017), with several works focusing on ex- titatively comparing these methods in differ- plaining the decisions of convolutional neural net- ent settings, namely (1) a toy arithmetic task works (CNNs) for image recognition. More re- which we use as a sanity check, (2) a five-class cently, this topic found a growing interest within sentiment prediction of movie reviews, and be- NLP, amongst others to explain the decisions of sides (3) we explore the usefulness of word general CNN classifiers (Arras et al., 2017a; Ja- relevances to build sentence-level representa- covi et al., 2018), and more particularly to explain tions. Lastly, using the method that performed the predictions of recurrent neural networks (Li best in our experiments, we show how specific linguistic phenomena such as the negation in et al., 2016, 2017; Arras et al., 2017b; Ding et al., sentiment analysis reflect in terms of relevance 2017; Murdoch et al., 2018; Poerner et al., 2018). patterns, and how the relevance visualization In this work, we focus on RNN explanation can help to understand the misclassification of methods that are solely based on a trained neu- individual samples. ral network model and a single test data point1. Thus, methods that use additional information, 1 Introduction such as training data statistics, sampling, or are Recurrent neural networks such as LSTMs optimization-based (Ribeiro et al., 2016; Lund- (Hochreiter and Schmidhuber, 1997) are a stan- berg and Lee, 2017; Chen et al., 2018) are out dard building block for understanding and gener- of our scope. Among the methods we consider, ating text data in NLP. They find usage in pure we note that the method of Murdoch et al.(2018) NLP applications, such as abstractive summa- was not yet compared against Arras et al.(2017b); rization (Chopra et al., 2016), machine transla- Ding et al.(2017); and that the method of Ding tion (Bahdanau et al., 2015), textual entailment et al.(2017) was validated only visually. More- (Rocktaschel¨ et al., 2016); as well as in multi- over, to the best of our knowledge, no recurrent modal tasks involving NLP, such as image cap- neural network explanation method was tested so tioning (Karpathy and Fei-Fei, 2015), visual ques- far on a toy problem where the ground truth rele- tion answering (Xu and Saenko, 2016) or lip read- 1These methods are deterministic, and are essentially ing (Chung et al., 2017). based on a decomposition of the model’s current prediction. As these models become more and more Thereby they intend to reflect the sole model’s “point of view” on the test data point, and hence are not meant to pro- widespread due to their predictive performance, vide an averaged, smoothed or denoised explanation of the there is also a need to understand why they took prediction by additionally exploiting the data’s distribution. 113 Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 113–126 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics vance value is known. ize the relevance of single input variables in RNNs Therefore our contributions are the follow- for sentiment classification (Li et al., 2016). We ing: we evaluate and compare the aforementioned use the latter formulation of relevance and denote methods, using two different experimental setups, it as Gradient. With this definition the relevance thereby we assess basic properties and differences of an entire word is simply the squared L2-norm between the explanation methods. Along-the-way of the prediction function’s gradient w.r.t. the word 2 we purposely adapted a simple toy task, to serve embedding, i.e. Rxt = krxt fc(x)k2 . as a testbed for recurrent neural networks explana- A slight variation of this approach uses partial tions. Lastly, we explore how word relevances can derivatives multiplied by the variable’s value, i.e. be used to build sentence-level representations, R = @fc (x) · x . Hence, the word relevance is a xi @xi i and demonstrate how the relevance visualization dot product between prediction function gradient T can help to understand the (mis-)classification of and word embedding: Rxt = (rxt fc(x)) xt selected samples w.r.t. semantic composition. (Denil et al., 2015). We refer to this variant as Gradient×Input. 2 Explaining Recurrent Neural Network Both variants are general and can be applied to Predictions any neural network. They are computationally ef- ficient and require one forward and backward pass First, let us settle some notations. We suppose through the net. given a trained recurrent neural network based model, which has learned some scalar-valued pre- 2.2 Occlusion-based explanation diction function fc(·), for each class c of a clas- sification problem. Further, we denote by x = Another method to assign relevances to single (x1; x2; :::; xT ) an unseen input data point, where variables, or entire words, is by occluding them xt represents the t-th input vector of dimension D, in the input, and tracking the difference in the net- within the input sequence x of length T . In NLP, work’s prediction w.r.t. a prediction on the orig- the vectors xt are typically word embeddings, and inal unmodified input (Zeiler and Fergus, 2014; x may be a sentence. Li et al., 2017). In computer vision the occlusion Now, we are interested in methods that can ex- is performed by replacing an image region with plain the network’s prediction fc(x) for the in- a grey or zero-valued square (Zeiler and Fergus, put x, and for a chosen target class c, by assign- 2014). In NLP word vectors, or single of its com- ing a scalar relevance value to each input variable ponents, are replaced by zero; in the case of re- or word. This relevance is meant to quantify the current neural networks, the technique was applied variable’s or word’s importance for or against a to identify important words for sentiment analysis model’s prediction towards the class c. We denote (Li et al., 2017). Practically, the relevance can be computed in by Rxi (index i) the relevance of a single vari- able. This means xi stands for any arbitrary in- two ways: in terms of prediction function dif- put variable xt;d representing the d-th dimension, ferences, or in the case of a classification prob- d 2 f1; :::; Dg, of an input vector xt. Further, we lem, using a difference of probabilities, i.e. Rxi = f (x)−f (x ), or R = P (x)−P (x ), refer to Rxt (index t) to designate the relevance c c jxi=0 xi c c jxi=0 exp fc(·) value of an entire input vector or word xt. Note where Pc(·) = P . We refer to the k exp fk(·) that, for most methods, one can obtain a word- former as Occlusionf-diff, and to the latter as level relevance value by simply summing up the OcclusionP-diff. Both variants can also be used to relevances over the word embedding dimensions, estimate the relevance of an entire word, in this P i.e. Rxt = d2f1;:::;Dg Rxt;d . case the corresponding word embedding is set to zero in the input. This type of explanation is 2.1 Gradient-based explanation computationally expensive and requires T forward One standard approach to obtain relevances is passes through the network to determine one rele- based on partial derivatives of the prediction func- vance value per word in the input sequence x. @fc @fc 2 tion: R = (x) , or R = (x) (Di- A slight variation of the above approach uses xi @xi xi @xi mopoulos et al., 1995; Gevrey et al., 2003; Si- word omission (similarly toK ad´ ar´ et al., 2017) in- monyan et al., 2014; Li et al., 2016). stead of occlusion. On a morphosyntactic agree- In NLP this technique was employed to visual- ment experiment (see Poerner et al., 2018), omis- 114 sion was shown to deliver inferior results, there- activated) i.e. Rg = 0, and assign all the rele- fore we consider only occlusion-based relevance. vance to the remaining signal neuron (which is usually tanh activated) i.e. Rs = Rj. We call this 2.3 Layer-wise relevance propagation LRP variant LRP-all, which stands for “signal- A general method to determine input space rele- take-all” redistribution.