Interpreting Neural Networks with Nearest Neighbors

Eric Wallace∗, Shi Feng∗, Jordan Boyd-Graber University of Maryland {ewallac2,shifeng,jbg}@umiacs.umd.edu

Abstract are completely void of information, for example, images of pure noise (Goodfellow et al., 2015) Local model interpretation methods explain or meaningless text snippets (Feng et al., 2018). individual predictions by assigning an impor- tance value to each input feature. This value is Consequently, a model’s confidence may not prop- often determined by measuring the change in erly reflect whether discriminative input features confidence when a feature is removed. How- are present. This issue makes it difficult to re- ever, the confidence of neural networks is not a liably judge the importance of each input fea- robust measure of model uncertainty. This is- ture using common confidence-based interpreta- sue makes reliably judging the importance of tion methods (Feng et al., 2018). the input features difficult. We address this by To address this, we apply Deep k-Nearest changing the test-time behavior of neural net- works using Deep k-Nearest Neighbors. With- Neighbors (DKNN)(Papernot and McDaniel, out harming text classification accuracy, this 2018) to neural models for text classification. algorithm provides a more robust uncertainty Concretely, predictions are no longer made with a metric which we use to generate feature im- softmax classifier, but using the labels of the train- portance values. The resulting interpretations ing examples whose representations are most sim- better align with human perception than base- ilar to the test example (Section3). This provides line methods. Finally, we use our interpreta- an alternative metric for model uncertainty, con- tion method to analyze model predictions on dataset annotation artifacts. formity, which measures how much support a test prediction has by comparing its hidden represen- 1 Introduction tations to the training data. This representation- based uncertainty measurement can be used in The growing use of neural networks in sensitive combination with existing interpretation methods, domains such as medicine, finance, and security such as leave-one-out (Li et al., 2016), to better raises concerns about human trust in these ma- identify important input features. chine learning systems. A central question is test- We combine DKNN with CNN and LSTM time interpretability: how can humans understand models on six NLP text classification tasks, includ- the reasoning behind model predictions? ing sentiment analysis and textual entailment, with A common way to interpret neural network no loss in classification accuracy (Section4). We predictions is to identify the most important in- compare interpretations generated using DKNN put features. For instance, a visual saliency map conformity to baseline interpretation methods, that highlights important pixels in an image (Sun- finding DKNN interpretations rarely assign im- dararajan et al., 2017) or words in a sentence (Li portance to extraneous words that do not align et al., 2016). Given a model’s test prediction, the with human perception (Section5). Finally, we importance of each input feature is the change in generate interpretations using DKNN conformity model confidence when that feature is removed. for a dataset with known artifacts (SNLI), helping However, neural network confidence is not a to indicate whether a model has learned superficial proper measure of model uncertainty (Guo et al., patterns. We open source the code for DKNN and 2017). This issue is emphasized when models our results.1 make highly confident predictions on inputs that ∗?Equal contribution 1https://github.com/Eric-Wallace/deep-knn

136 Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 136–144 Brussels, Belgium, November 1, 2018. c 2018 Association for Computational Linguistics 2 Interpretation Through Feature Thus, a word’s importance is the dot product be- Attribution tween the gradient of the class prediction with re- spect to the embedding and the Feature attribution methods explain a test predic- itself. This gradient approximation simulates the tion by assigning an importance value to each in- change in confidence when an input word is re- put feature (typically pixels or words). moved and has been used in various interpreta- In the case of text classification, we have an in- tion methods for NLP (Arras et al., 2016; Ebrahimi put sequence of n words x = hw , w , . . . w i, 1 2 n et al., 2017). We refer to this interpretation ap- represented as one-hot vectors. The word se- proach as Gradient in our experiments. quence is then converted to a sequence of word embeddings e = hv1, v2,... vni. A classifier 2.3 Interpretation Method Failures f outputs a probability distribution over classes. Interpreting neural networks can have unexpected The class with the highest probability is selected negative results. Ghorbani et al.(2017) and Kin- as the prediction y, with its probability serving as dermans et al.(2017) show how a lack of model the model confidence. To create an interpretation, robustness and stability can cause egregious in- each input word is assigned an importance value, terpretation failures in settings. g(wi | x, y), which indicates the word’s contri- Feng et al.(2018) extend this to NLP and draw con- bution to the prediction. A saliency map (or heat nections between interpretation failures and adver- map) visually highlights words in a sentence. sarial examples (Szegedy et al., 2014). To counter- 2.1 Leave-one-out Attribution act this, new interpretation methods alone are not A simple way to define the importance g is via enough—models must be improved. For instance, leave-one-out (Li et al., 2016): individually re- Feng et al.(2018) argues that interpretation meth- move a word from the input and see how the con- ods should not rely on prediction confidence as it does not reflect a model’s uncertainty. fidence changes. The importance of word wi is the decrease in confidence2 when word i is removed: Following this, we improve interpretations by replacing neural network confidence with a robust

g(wi | x, y) = f(y | x) − f(y | x−i), (1) uncertainty estimate using DKNN(Papernot and McDaniel, 2018). This algorithm achieves compa- where x−i is the input sequence with the ith word rable accuracy on image classification tasks while removed and f(y | x) is the model confidence for providing a better uncertainty metric capable of class y. This can be repeated for all words in the defending against adversarial examples. input. Under this definition, the sign of the impor- tance value is opposite the sign of the confidence 3 Deep k-Nearest Neighbors for change: if a word’s removal causes a decrease in Sequential Inputs the confidence, it gets a positive importance value. This section describes Deep k-Nearest Neighbors, We refer to this interpretation method as Confi- its application to sequential inputs, and how we dence leave-one-out in our experiments. use it to determine word importance values. 2.2 Gradient-Based Feature Attribution 3.1 Deep k-Nearest Neighbors In the case of neural networks, the model f(x) as a function of word wi is a highly non-linear, dif- Papernot and McDaniel(2018) propose Deep k- ferentiable function. Rather than leaving one word Nearest Neighbors (DKNN), a modification to the out at a time, we can simulate a word’s removal by test-time behavior of neural networks. approximating f with a function that is linear in wi After training completes, the DKNN algorithm through the first-order Taylor expansion. The im- passes every training example through the model portance of wi is computed as the derivative of f and saves each of the layer’s representations. This with respect to the one-hot vector: creates a new dataset, whose features are the rep- resentations and whose labels are the model pre- ∂f ∂f ∂vi ∂f dictions. Test-time predictions are made by pass- = = · vi (2) ∂wi ∂vi ∂wi ∂vi ing an example through the model and performing 2equivalently the change in class score or cross entropy k-nearest neighbors classification on the resulting loss representations. This modification does not de-

137 grade the accuracy of image classifiers on several 4.1 Datasets and Models standard datasets (Papernot and McDaniel, 2018). We consider six common text classification tasks: For our purposes, the benefit of DKNN is binary sentiment analysis using Stanford Senti- the algorithm’s uncertainty metric, the conformity ment Treebank (Socher et al., 2013, SST) and Cus- score. This score is the percentage of nearest tomer Reviews (Hu and Liu, 2004, CR), topic clas- neighbors belonging to the predicted class. Con- sification using TREC (Li and Roth, 2002), opin- formity follows from the framework of conformal ion polarity (Wiebe et al., 2005, MPQA), and sub- prediction (Shafer and Vovk, 2008) and estimates jectivity/objectivity (Pang and Lee, 2004, SUBJ). how much the training data supports a classifica- Additionally, we consider natural language infer- tion decision. ence with SNLI (Bowman et al., 2015). We exper- The conformity score is based on the represen- iment with BILSTM and CNN models. tations of every layer in the model, and there- fore, a prediction only receives high conformity CNN Our CNN architecture resembles Kim if it largely agrees with neighboring examples at (2014). We use convolutional filters of size three, all representation levels. This mechanism de- four, and five, with max-pooling over time (Col- fends against adversarial examples (Szegedy et al., lobert and Weston, 2008). The filters are followed 2014), as it is difficult to construct a perturbation by three fully-connected layers. We fine-tune which changes the neighbors at every layer. Con- GLOVE embeddings (Pennington et al., 2014) of sequently, conformity is a better uncertainty met- each word. For DKNN, we use the activations ric for both regular examples and adversarial ones, from the convolution layer and the three fully- making it suitable for generating interpretations. connected layers. 3.2 Handling Sequences BILSTM Our architecture uses a bidirectional The DKNN algorithm requires fixed-size vector LSTM(Graves and Schmidhuber, 2005), with the representations. To reach a fixed-size representa- final hidden state forming the fixed-size represen- tion for text classification, we can take the final tation. We use three LSTM layers, followed by hidden state of a or use two fully-connected layers. We fine-tune GLOVE a form of max pooling across time (Collobert and embeddings of each word. For DKNN, we use the Weston, 2008). We consider deep architectures of final activations of the three recurrent layers and these two forms, using each of the layers’ repre- the two fully-connected layers. sentations as the features. SNLI Classifier Unlike other tasks with a single 3.3 Conformity leave-one-out input sentence, SNLI has two inputs, a premise and Using conformity, we generate interpretations hypothesis. Following Conneau et al.(2017), we through a modified version of leave-one-out (Li use the same model to encode the two inputs, gen- et al., 2016). After removing a word, rather than erating representations u for the premise and v for observing the drop in confidence, we instead mea- the hypothesis. We concatenate the two represen- sure the drop in conformity. Formally, we modify tations along with their dot-product and element- classifier f in Equation1 to output probabilities wise absolute difference, arriving at a final repre- based on conformity scores. We refer to this as sentation [u; v; u ∗ v; |u − v|]. This vector passes conformity leave-one-out in our experiments. through two fully-connected layers for classifica- tion. For DKNN, we use the activations of the two 4 DKNN Maintains Classification fully-connected layers. Accuracy Nearest Neighbor Search For accurate inter- Interpretability should not come at the cost pretations, we trade efficiency for accuracy and of performance—before investigating how inter- replace locally sensitive hashing (Gionis et al., pretable DKNN is, we first evaluate its accuracy. 1999) used by Papernot and McDaniel(2018) with We experiment with six text classification tasks a k-d tree (Bentley, 1975). We use k = 75 nearest and two models, verifying that DKNN achieves neighbors at each layer. The empirical results are accuracy comparable to regular classifiers. robust to the choice of k.

138 4.2 Classification Results lights 5.32 words, confidence leave-one-out high- DKNN achieves comparable accuracy on the five lights 5.79 words, and conformity leave-one-out classification tasks (Table1). On SNLI, the BIL- highlights 3.65 words. STM achieves an accuracy of 81.2% with a soft- The second, and related, observation for max classifier and 81.0% with DKNN. confidence-based approaches is a bias towards se- lecting word importance based on the inherent 5 DKNN is Interpretable sentiment, rather than a word’s meaning in con- Following past work (Li et al., 2016; Murdoch text. For example, see “clash”, “terribly”, and “un- faithful” in Table2. The removal of these words et al., 2018), we focus on the SST dataset for gen- erating interpretations. Due to the lack of standard causes a small change in the model confidence. interpretation evaluation metrics (Doshi-Velez and When using DKNN, the conformity score indi- Kim, 2017), we use qualitative interpretation eval- cates that the model’s uncertainty has not risen uations (Smilkov et al., 2017; Sundararajan et al., without these input words and leave-one-out does 2017; Li et al., 2016), performing quantitative ex- not assign them any importance. periments where possible to examine the distinc- We characterize our interpretation method as tion between the interpretation methods. significantly higher precision, but slightly lower recall than confidence-based methods. Confor- 5.1 Interpretation Analysis mity leave-one-out rarely assigns high importance We compare our method (Conformity leave-one- to words that do not align with human perception out) against two baselines: leave-one-out using of sentiment. However, there are cases when our regular confidence (Confidence leave-one-out, see method does not assign significant importance to Section 2.1), and the gradient with respect to any word. This occurs when the input has a high the input (Gradient, see Section 2.2). To create redundancy. For example, a positive movie re- saliency maps, we normalize each word’s impor- view that describes the sentiment in four distinct tance by dividing it by the total importance of the ways. In these cases, leaving out a single senti- words in the sentence. We display unknown words ment word has little effect on the conformity as the in angle brackets <>. Table2 shows SST interpre- model’s representation remains supported by the tation examples for the BILSTM model. Further other redundant features. Confidence-based inter- examples are on a supplementary website.3 pretations, which interpret models using the linear Conformity leave-one-out assigns concentrated units that produce class scores, achieve higher re- importance values to a small number of input call by responding to every change in the input for words. In contrast, the baseline methods assign a certain direction but may have lower precision. non-zero importance values to numerous words, In the second example of Table2, the word “ter- many of which are irrelevant. For instance, in all ribly” is assigned a negative importance value, dis- three examples of Table2, both baselines highlight regarding its positive meaning in context. To ex- almost half of the input, including words such as amine if this is a stand-alone example or a more “about” and “movie”. We suspect model confi- general pattern of uninterpretable behavior, we dence is oversensitive to these unimportant input calculate the importance value of the word “ter- changes, causing the baseline interpretations to ribly” in other positive examples. For each occur- highlight unimportant words. On the other hand, rence of the word “great” in positive validation ex- the conformity score better separates word impor- amples, we paraphrase it to “awesome”, “wonder- tance, generating clearer interpretations. ful”, or “impressive”, and add the word “terribly” The tendency for confidence-based approaches in front of it. This process yields 66 examples. to assign importance to many words holds for the For each of these examples, we compute the im- entire test set. We compute the average number portance value of each input word and rank them of highlighted words using a threshold of 0.05 (a from most negative to most positive (the most neg- normalized importance value corresponding to a ative word has a rank of 1). We compare the av- light blue or light red highlight). Out of the av- erage ranking of “terribly” from the three meth- erage 20.23 words in SST test set, gradient high- ods: 7.9 from conformity leave-one-out, 1.68 from 3https://sites.google.com/view/ confidence leave-one-out, and 1.1 from gradient. language-dknn/ The baseline methods consistently rank “terribly”

139 SSTCRTRECMPQASUBJ LSTM 86.7 82.7 91.5 88.9 94.8 LSTMDKNN 86.6 82.5 91.3 88.6 94.9 CNN 85.7 83.3 92.8 89.1 93.5 CNNDKNN 85.8 83.4 92.4 88.7 93.1

Table 1: Replacing a neural network’s softmax classifier with DKNN maintains classification accuracy on standard text classification tasks.

Method Saliency Map Conformity an intelligent fiction about learning through cultural clash. Confidence an intelligent fiction about learning through cultural clash. Gradient an intelligent fiction about learning through cultural clash. Conformity is talented and terribly charismatic. Confidence is talented and terribly charismatic. Gradient is talented and terribly charismatic. Conformity Diane Lane shines in unfaithful. Confidence Diane Lane shines in unfaithful. Gradient Diane Lane shines in unfaithful. Color Legend Positive Impact Negative Impact

Table 2: Comparison of interpretation approaches on SST test examples for the LSTM model. Blue indicates positive impact and red indicates negative impact. Our method (Conformity leave-one-out) has higher precision, rarely assigning importance to extraneous words such as “about” or “movie”. as the most negative word, ignoring its meaning in a more general paraphrase of the premise, using context. This echoes our suspicion: DKNN gener- words such as “outside” instead of “playing soccer ates interpretations with higher precision because in a park”. Contradiction examples often contain conformity is robust to irrelevant changes. negation words or non-action verbs like “sleep- ing”. Models trained solely on the hypothesis can 5.2 Analyzing Dataset Annotation Artifacts learn these patterns to achieve an accuracy consid- erably higher than the majority baseline. Through DKNN, we get a new uncertainty mea- These studies indicate that the SNLI task can be surement, conformity, that measures how a test ex- gamed. We look to confirm that some artifacts are ample’s representation is positioned relative to the indeed exploited by normally trained models that training data representations. In this section, we use full input pairs. We create saliency maps for use conformity leave-one-out to interpret a model examples in the validation set using conformity trained on SNLI. This dataset is known to con- leave-one-out. Table3 shows samples and more tain annotation artifacts and we demonstrate that can be found on the supplementary website.4 We our interpretation method can help identify when use the blue highlights to indicate words which models exploit these dataset biases. positively support the model’s predicted class, and Recent studies (Gururangan et al., 2018; Poliak the color red to indicate words that support a dif- et al., 2018) identified annotation artifacts in the ferent class. The first example is a randomly sam- SNLI dataset. These works identified that super- pled baseline, showing how the words “swims” ficial patterns exist in the input which strongly and “pool” support the model’s prediction of con- correlate with certain labels, making it possible tradiction. The other examples are selected be- for models to “game” the task: obtain high ac- curacy without true understanding. For instance, 4https://sites.google.com/view/ the hypothesis of an entailment example is often language-dknn/

140 cause they contain terms identified as artifacts. In 2015). Most importantly for interpretation, the the second example, conformity leave-one-out as- change in confidence often will not properly re- signs extremely high word importance to “sleep- flect whether discriminative input features have ing”, disregarding other words necessary to pre- been removed (Feng et al., 2018). dict Contradiction (i.e., the Neutral class is still possible if “pets” is replaced with “people”). In 6.2 Confidence Calibration is Insufficient the final two hypothesis, the interpretation method We attribute one interpretation failure to neural diagnoses the model failure, assigning high impor- network confidence issues. Guo et al.(2017) study tance to “wearing”, rather than focusing positively overconfidence and propose a calibration proce- on the shirt color. dure using . This adds a temperature To explore this further, we compute the average to the softmax function to align confidence with importance rank using conformity and confidence accuracy. However, this is not input dependent. leave-one-out for the top five artifacts in each SNLI The confidence is lower for both full-length exam- class identified by Gururangan et al.(2018). Ta- ples and ones with words left out. Hence, selecting ble4 compares the average rank assigned by the influential words will remain difficult. two methods, sorting the words by Pointwise Mu- To verify this, we create an interpretation base- tual Information as provided by Gururangan et al. line using temperature scaling. The results corrob- (2018). The word “nobody” particularly stands orate the intuition: a calibrated leave-one-out does out: it is the most important input word every time not fix the interpretation issues. Qualitatively, the it appears in a contradiction example. calibrated interpretations are comparable to confi- For most of the artifacts, conformity leave-one- dence leave-one-out. Furthermore, calibrating the out assigns them a high importance, often rank- DKNN conformity score followingPapernot and ing the artifacts as the most important input word. McDaniel(2018) does not improve interpretability Confidence leave-one-out correlates less strongly compared to the uncalibrated conformity score. with the known artifacts, frequently assigning im- 6.3 Alternative Interpretation Improvements portance values as low as fifth or sixth most im- portant. Given the high correlation between con- Recent work improves interpretation methods formity leave-one-out and the manually identified through other means. Smilkov et al.(2017) and artifacts, this interpretation method may serve as Sundararajan et al.(2017) both aggregate gradi- a technique to identify undesirable biases a model ent values over multiple passes to may have learned. eliminate local noise or satisfy interpretation ax- ioms. This work does not address model confi- 6 Discussion and Related Work dence and is orthogonal to our DKNN approach.

We connect the improvements made by confor- 6.4 Interpretation Through Data Selection mity leave-one-out to model confidence issues, Retrieval-Augmented Neural Networks (Zhao and compare alternative interpretation improvements, Cho, 2018) are similar to DKNN: they augment and discuss further features of DKNN. model predictions with an information retrieval system that searches over network activations 6.1 Issues in Neural Network Confidence from the training data. Gradient and leave-one-out both interpret a model Retrieval-Augmented models and DKNN can by determining the importance value for each in- both select influential training examples for a test put word. This effectively reduces the problem prediction. In particular, the training data activa- of interpretation to one of determining model un- tions which are closest to the test point’s activa- certainty. Past work relies on model confidence tions are influential according to the model. These as a measure of uncertainty. However, a neu- training examples can provide interpretations as ral network’s confidence is unreasonably high: on a form of analogy (Caruana et al., 1999), an in- held-out examples, it far exceeds empirical error tuitive explanation for both ex- rates (Guo et al., 2017). This is further exempli- perts and non-experts (Klein, 1989; Kim et al., fied by the high confidence predictions produced 2014; Koh and Liang, 2017; Wallace and Boyd- on inputs that are adversarial (Szegedy et al., Graber, 2018). However, unlike in computer vi- 2014) or contain solely noise (Goodfellow et al., sion where training data selection using DKNN

141 Prediction Input Saliency Map Premise a young boy reaches for and touches the propeller of a vintage Contradiction aircraft. Hypothesis a young boy swims in his pool. Premise a brown a dog and a black dog in the edge of the ocean with a Entailment wave under them boats are on the water in the background. Hypothesis the pets are sleeping on the grass.. Premise man in a blue shirt standing in front of a structure painted with geometric designs. Entailment Hypothesis a man is wearing a blue shirt. Entailment Hypothesis a man is wearing a black shirt. Color Legend Positive Impact Negative Impact

Table 3: Interpretations generated with conformity leave-one-out align with annotation biases identified in SNLI. In the second example, the model puts emphasis on the word “sleeping”, disregarding other words that could indicate the Neutral class. The final example diagnoses a model’s incorrect Entailment prediction (shown in red). Green highlights indicate words that support the classification decision made (shown in parenthesis), pink highlights indicate words that support a different class.

yielded interpretable examples (Papernot and Mc- Daniel, 2018), our experiments did not find human interpretable data points for SST or SNLI.

Label Artifact Conformity Confidence 6.5 Trust in Model Predictions outdoors 2.93 3.26 least 2.22 4.41 Model confidence is important for real-world ap- Entailment instrument 3.57 4.47 plications: it signals how much one should trust outside 4.08 4.80 a neural network’s predictions. Unfortunately, animal 2.00 4.73 users may be misled when a model outputs highly tall 1.09 2.61 confident predictions on rubbish examples (Good- first 2.14 2.99 Neutral competition 2.33 5.56 fellow et al., 2015; Nguyen et al., 2015) or ad- sad 1.39 1.79 versarial examples (Szegedy et al., 2014). Re- favorite 1.69 3.89 cent work decides when to trust a neural network nobody 1.00 1.00 model (Ribeiro et al., 2016; Doshi-Velez and Kim, sleeping 1.64 2.34 Contradiction no 2.53 5.74 2017; Jiang et al., 2018). For instance, analyzing tv 1.92 3.74 local linear model approximations (Ribeiro et al., cat 1.42 3.62 2016) or flagging rare network activations us- ing kernel density estimation (Jiang et al., 2018). Table 4: The top SNLI artifacts identified by Guru- The DKNN conformity score is a trust metric rangan et al.(2018) are shown on the left. For each that helps defend against image adversarial exam- word, we compute the average importance rank ples (Papernot and McDaniel, 2018). Future work over the validation set using either Conformity or should study if this robustness extends to interpre- Confidence leave-one-out. A score of 1.0 indicates tations. that a word is always ranked as the most important word in the input. Conformity leave-one-out as- 7 Future Work and Conclusion signs stronger importance to artifacts, suggesting it better diagnoses model biases. A robust model uncertainty estimate is critical to determine feature importance. The DKNN confor- mity score is one such uncertainty metric which leads to higher precision interpretations. Al- though DKNN is only a test-time improvement—

142 the model is still trained with maximum likeli- Samuel R. Bowman, Gabor Angeli, Christopher Potts, hood. Combining nearest neighbor and maxi- and Christopher D. Manning. 2015. A large anno- mum likelihood objectives during training may tated corpus for learning natural language inference. In EMNLP. further improve model accuracy and interpretabil- ity. Moreover, other uncertainty estimators do Rich Caruana, Hooshang Kangarloo, John David N. not require test-time modifications. For example, Dionisio, Usha S. Sinha, and David B. Johnson. 1999. Case-based explanation of non-case-based modeling p(x) and p(y | x) using Bayesian Neu- learning methods. Proceedings of AMIA Symposium ral Networks (Gal et al., 2016). . Similar to other NLP interpretation meth- Ronan Collobert and Jason Weston. 2008. A unified ods (Sundararajan et al., 2017; Li et al., 2016), architecture for natural language processing: Deep conformity leave-one-out works when a model’s neural networks with multitask learning. In ICML. representation is fixed-sized. For other NLP tasks, Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic such as structured prediction (e.g., translation and Barrault, and Antoine Bordes. 2017. Supervised parsing) or span prediction (e.g., extractive sum- learning of universal sentence representations from marization and reading comprehension), models natural language inference data. In EMNLP. output a variable number of predictions and our in- Finale Doshi-Velez and Been Kim. 2017. Towards a terpretation approach will not suffice. Developing rigorous science of interpretable machine learning. interpretation techniques for these types of models arXiv preprint arXiv: 1702.08608 . is a necessary area for future work. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing We apply DKNN to neural models for text Dou. 2017. HotFlip: White-box adversarial exam- classification. This provides a better estimate of ples for text classification. In ACL. model uncertainty—conformity—which we com- Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, bine with leave-one-out. This overcomes issues Pedro Rodriguez, and Jordan Boyd-Graber. 2018. stemming from neural network confidence, lead- Pathologies of neural models make interpretations ing to higher precision interpretations. Most inter- difficult. In EMNLP. estingly, our interpretations are supported by the Yarin Gal, Yutian Chen, Roger Frigola, S. Gu, Alex training data, providing insights into the represen- Kendall, Yingzhen Li, Rowan McAllister, Carl Ras- tations learned by a model. mussen, Ilya Sutskever, Gabriel Synnaeve, Nilesh Tripuraneni, Richard Turner, Oriol Vinyals, Adrian Weller, Mark van der Wilk, and Yan Wu. 2016. Un- Acknowledgments certainty in . Ph.D. thesis, University of Oxford. Feng was supported under subcontract to Raytheon BBN Technologies by DARPA award Amirata Ghorbani, Abubakar Abid, and James Y. Zou. 2017. Interpretation of neural networks is fragile. HR0011-15-C-0113. JBG is supported by NSF arXiv preprint arXiv: 1710.10547 . Grant IIS1652666. Any opinions, findings, conclusions, or recommendations expressed here Aristides Gionis, Piotr Indyk, and Rajeev Motwani. are those of the authors and do not necessarily 1999. Similarity search in high dimensions via hashing. In VLDB. reflect the view of the sponsor. The authors would like to thank the members of the CLIP lab at Ian J. Goodfellow, Jonathon Shlens, and Christian the University of Maryland and the anonymous Szegedy. 2015. Explaining and harnessing adver- sarial examples. In ICLR. reviewers for their feedback. Alex Graves and Jurgen¨ Schmidhuber. 2005. Frame- wise phoneme classification with bidirectional LSTM and other neural network architectures. Neu- References ral networks : the official journal of the Interna- Leila Arras, Franziska Horn, Gregoire´ Montavon, tional Neural Network Society . Klaus-Robert Muller,¨ and Wojciech Samek. 2016. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein- Explaining predictions of non-linear classifiers in berger. 2017. On calibration of modern neural net- NLP. In Workshop on Representation Learning for works. In ICML. NLP. Suchin Gururangan, Swabha Swayamdipta, Omer Jon Louis Bentley. 1975. Multidimensional binary Levy, Roy Schwartz, Samuel R. Bowman, and search trees used for associative searching. Com- Noah A. Smith. 2018. Annotation artifacts in nat- munications of the ACM . ural language inference data. In NAACL.

143 Minqing Hu and Bing Liu. 2004. Mining and summa- Glenn Shafer and Vladimir Vovk. 2008. A tutorial on rizing customer reviews. In KDD. conformal prediction. JMLR .

Heinrich Jiang, Been Kim, and Maya R. Gupta. 2018. Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. To trust or not to trust a classifier. arXiv preprint Viegas,´ and Martin Wattenberg. 2017. SmoothGrad: arXiv: 1805.11783 . removing noise by adding noise. arXiv preprint arXiv: 1706.03825 . Been Kim, Cynthia Rudin, and Julie A. Shah. 2014. The bayesian case model: A generative approach for Richard Socher, A. V. Perelygin, Jean Wu, Jason casebased reasoning and prototype classification. In Chuang, Christopher D. Manning, Andrew Ng, and NIPS. Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- Yoon Kim. 2014. Convolutional neural networks for bank. In EMNLP. sentence classification. EMNLP . Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Pieter-Jan Kindermans, Sara Hooker, Julius Ade- 2017. Axiomatic attribution for deep networks. In bayo, Maximilian Alber, Kristof T. Schutt,¨ Sven ICML. Dahne,¨ Dumitru Erhan, and Been Kim. 2017. The (un)reliability of saliency methods. arXiv preprint Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, arXiv: 1711.00867 . Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural Gary A. Klein. 1989. Do decision biases explain too networks. In ICLR. much. In Proceedings of the Human Factors and Ergonomics Society. Eric Wallace and Jordan Boyd-Graber. 2018. Trick me if you can: Adversarial writing of trivia challenge Pang Wei Koh and Percy Liang. 2017. Understand- questions. In Proceedings of ACL 2018 Student Re- ing black-box predictions via influence functions. In search Workshop. ICML. Janyce Wiebe, Theresa Wilson, and Claire Cardie. Jiwei Li, Will Monroe, and Daniel Jurafsky. 2016. Un- 2005. Annotating expressions of opinions and emo- derstanding neural networks through representation tions in language. In LREC. erasure. arXiv preprint arXiv: 1612.08220 . Jake Zhao and Kyunghyun Cho. 2018. Retrieval- Xin Li and Dan Roth. 2002. Learning question classi- augmented convolutional neural networks for im- fiers. In COLT. proved robustness against adversarial examples. arXiv preprint arXiv: 1802.09502 . W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Be- yond word importance: Contextual decomposition to extract interactions from lstms. In ICLR.

Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR.

Bo Pang and Lillian Lee. 2004. A sentimental educa- tion: Sentiment analysis using subjectivity summa- rization based on minimum cuts. In ACL.

Nicolas Papernot and Patrick D. McDaniel. 2018. Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning. arXiv preprint arXiv: 1803.04765 .

Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP.

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language infer- ence. In *SEM@NAACL-HLT.

Marco Tulio´ Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?”: Explain- ing the predictions of any classifier. In KDD.

144