Interpreting Neural Networks with Nearest Neighbors

Interpreting Neural Networks with Nearest Neighbors Eric Wallace∗, Shi Feng∗, Jordan Boyd-Graber University of Maryland fewallac2,shifeng,[email protected] Abstract are completely void of information, for example, images of pure noise (Goodfellow et al., 2015) Local model interpretation methods explain or meaningless text snippets (Feng et al., 2018). individual predictions by assigning an importance value to each input feature. This value is Consequently, a model’s confidence may not prop- often determined by measuring the change in erly reflect whether discriminative input features confidence when a feature is removed. How- are present. This issue makes it difficult to re- ever, the confidence of neural networks is not a liably judge the importance of each input fea- robust measure of model uncertainty. This is- ture using common confidence-based interpreta- sue makes reliably judging the importance of tion methods (Feng et al., 2018). the input features difficult. We address this by To address this, we apply Deep k-Nearest changing the test-time behavior of neural networks using Deep k-Nearest Neighbors. With- Neighbors (DKNN) (Papernot and McDaniel, out harming text classification accuracy, this 2018) to neural models for text classification. algorithm provides a more robust uncertainty Concretely, predictions are no longer made with a metric which we use to generate feature im- softmax classifier, but using the labels of the train- portance values. The resulting interpretations ing examples whose representations are most sim- better align with human perception than base- ilar to the test example (Section3). This provides line methods. Finally, we use our interpreta- an alternative metric for model uncertainty, con- tion method to analyze model predictions on dataset annotation artifacts. formity, which measures how much support a test prediction has by comparing its hidden represen- 1 Introduction tations to the training data. This representation- based uncertainty measurement can be used in The growing use of neural networks in sensitive combination with existing interpretation methods, domains such as medicine, finance, and security such as leave-one-out (Li et al., 2016), to better raises concerns about human trust in these ma- identify important input features. chine learning systems. A central question is test- We combine DKNN with CNN and LSTM time interpretability: how can humans understand models on six NLP text classification tasks, includ- the reasoning behind model predictions? ing sentiment analysis and textual entailment, with A common way to interpret neural network no loss in classification accuracy (Section4). We predictions is to identify the most important in- compare interpretations generated using DKNN put features. For instance, a visual saliency map conformity to baseline interpretation methods, that highlights important pixels in an image (Sun- finding DKNN interpretations rarely assign im- dararajan et al., 2017) or words in a sentence (Li portance to extraneous words that do not align et al., 2016). Given a model’s test prediction, the with human perception (Section5). Finally, we importance of each input feature is the change in generate interpretations using DKNN conformity model confidence when that feature is removed. for a dataset with known artifacts (SNLI), helping However, neural network confidence is not a to indicate whether a model has learned superficial proper measure of model uncertainty (Guo et al., patterns. We open source the code for DKNN and 2017). This issue is emphasized when models our results.1 make highly confident predictions on inputs that ∗?Equal contribution 1https://github.com/Eric-Wallace/deep-knn 136 Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 136–144 Brussels, Belgium, November 1, 2018. c 2018 Association for Computational Linguistics 2 Interpretation Through Feature Thus, a word’s importance is the dot product be- Attribution tween the gradient of the class prediction with respect to the embedding and the word embedding Feature attribution methods explain a test predic- itself. This gradient approximation simulates the tion by assigning an importance value to each in- change in confidence when an input word is re- put feature (typically pixels or words). moved and has been used in various interpreta- In the case of text classification, we have an in- tion methods for NLP (Arras et al., 2016; Ebrahimi put sequence of n words x = hw ; w ; : : : w i, 1 2 n et al., 2017). We refer to this interpretation ap- represented as one-hot vectors. The word se- proach as Gradient in our experiments. quence is then converted to a sequence of word embeddings e = hv1; v2;::: vni. A classifier 2.3 Interpretation Method Failures f outputs a probability distribution over classes. Interpreting neural networks can have unexpected The class with the highest probability is selected negative results. Ghorbani et al.(2017) and Kin- as the prediction y, with its probability serving as dermans et al.(2017) show how a lack of model the model confidence. To create an interpretation, robustness and stability can cause egregious in- each input word is assigned an importance value, terpretation failures in computer vision settings. g(wi j x; y), which indicates the word’s contri- Feng et al.(2018) extend this to NLP and draw con- bution to the prediction. A saliency map (or heat nections between interpretation failures and adver- map) visually highlights words in a sentence. sarial examples (Szegedy et al., 2014). To counter- 2.1 Leave-one-out Attribution act this, new interpretation methods alone are not A simple way to define the importance g is via enough—models must be improved. For instance, leave-one-out (Li et al., 2016): individually re- Feng et al.(2018) argues that interpretation meth- move a word from the input and see how the con- ods should not rely on prediction confidence as it does not reflect a model’s uncertainty. fidence changes. The importance of word wi is the decrease in confidence2 when word i is removed: Following this, we improve interpretations by replacing neural network confidence with a robust g(wi j x; y) = f(y j x) − f(y j x−i); (1) uncertainty estimate using DKNN (Papernot and McDaniel, 2018). This algorithm achieves compa- where x−i is the input sequence with the ith word rable accuracy on image classification tasks while removed and f(y j x) is the model confidence for providing a better uncertainty metric capable of class y. This can be repeated for all words in the defending against adversarial examples. input. Under this definition, the sign of the importance value is opposite the sign of the confidence 3 Deep k-Nearest Neighbors for change: if a word’s removal causes a decrease in Sequential Inputs the confidence, it gets a positive importance value. This section describes Deep k-Nearest Neighbors, We refer to this interpretation method as Confi- its application to sequential inputs, and how we dence leave-one-out in our experiments. use it to determine word importance values. 2.2 Gradient-Based Feature Attribution 3.1 Deep k-Nearest Neighbors In the case of neural networks, the model f(x) as a function of word wi is a highly non-linear, dif- Papernot and McDaniel(2018) propose Deep k- ferentiable function. Rather than leaving one word Nearest Neighbors (DKNN), a modification to the out at a time, we can simulate a word’s removal by test-time behavior of neural networks. approximating f with a function that is linear in wi After training completes, the DKNN algorithm through the first-order Taylor expansion. The im- passes every training example through the model portance of wi is computed as the derivative of f and saves each of the layer’s representations. This with respect to the one-hot vector: creates a new dataset, whose features are the representations and whose labels are the model pre- @f @f @vi @f dictions. Test-time predictions are made by pass- = = · vi (2) @wi @vi @wi @vi ing an example through the model and performing 2equivalently the change in class score or cross entropy k-nearest neighbors classification on the resulting loss representations. This modification does not de- 137 grade the accuracy of image classifiers on several 4.1 Datasets and Models standard datasets (Papernot and McDaniel, 2018). We consider six common text classification tasks: For our purposes, the benefit of DKNN is binary sentiment analysis using Stanford Senti- the algorithm’s uncertainty metric, the conformity ment Treebank (Socher et al., 2013, SST) and Cus- score. This score is the percentage of nearest tomer Reviews (Hu and Liu, 2004, CR), topic clas- neighbors belonging to the predicted class. Con- sification using TREC (Li and Roth, 2002), opin- formity follows from the framework of conformal ion polarity (Wiebe et al., 2005, MPQA), and sub- prediction (Shafer and Vovk, 2008) and estimates jectivity/objectivity (Pang and Lee, 2004, SUBJ). how much the training data supports a classifica- Additionally, we consider natural language infer- tion decision. ence with SNLI (Bowman et al., 2015). We exper- The conformity score is based on the represen- iment with BILSTM and CNN models. tations of every layer in the model, and there- fore, a prediction only receives high conformity CNN Our CNN architecture resembles Kim if it largely agrees with neighboring examples at (2014). We use convolutional filters of size three, all representation levels. This mechanism de- four, and five, with max-pooling over time (Col- fends against adversarial examples (Szegedy et al., lobert and Weston, 2008). The filters are followed 2014), as it is difficult to construct a perturbation by three fully-connected layers. We fine-tune which changes the neighbors at every layer. Con- GLOVE embeddings (Pennington et al., 2014) of sequently, conformity is a better uncertainty met- each word.

Load more