Arxiv:1804.07781V3 [Cs.CL] 28 Aug 2018 Regular Examples

Pathologies of Neural Models Make Interpretations Difficult Shi Feng1 Eric Wallace1 Alvin Grissom II2 Mohit Iyyer3;4 Pedro Rodriguez1 Jordan Boyd-Graber1 1University of Maryland 2Ursinus College 3UMass Amherst 4Allen Institute for Artificial Intelligence fshifeng,ewallac2,entilzha,[email protected], [email protected], [email protected] Abstract SQUAD Context In 1899, John Jacob Astor IV invested $100,000 for Tesla to further develop One way to interpret neural model predic- and produce a new lighting system. In- tions is to highlight the most important instead, Tesla used the money to fund his put features—for example, a heatmap visu- Colorado Springs experiments. alization over the words in an input sen- Original What did Tesla spend Astor’s money on ? tence. In existing interpretation methods for Reduced did NLP, a word’s importance is determined by Confidence 0.78 ! 0.91 either input perturbation—measuring the decrease in model confidence when that word is removed—or by the gradient with respect to Figure 1: SQUAD example from the validation set. that word. To understand the limitations of Given the original Context, the model makes the same these methods, we use input reduction, which correct prediction (“Colorado Springs experiments”) iteratively removes the least important word on the Reduced question as the Original, with even from the input. This exposes pathological be- higher confidence. For humans, the reduced question, haviors of neural models: the remaining words “did”, is nonsensical. appear nonsensical to humans and are not the ones determined as important by interpretation methods. As we confirm with human ex- A common, non-adversarial form of model in- periments, the reduced examples lack infor- terpretation is feature attribution: features that mation to support the prediction of any la- are crucial for predictions are highlighted in a bel, but models still make the same predic- heatmap. One can measure a feature’s importance tions with high confidence. To explain these by input perturbation. Given an input for text clas- counterintuitive results, we draw connections to adversarial examples and confidence cali- sification, a word’s importance can be measured bration: pathological behaviors reveal difficul- by the difference in model confidence before and ties in interpreting neural models trained with after that word is removed from the input—the maximum likelihood. To mitigate their defi- word is important if confidence decreases signifi- ciencies, we fine-tune the models by encourag- cantly. This is the leave-one-out method (Li et al., ing high entropy outputs on reduced examples. 2016b). Gradients can also measure feature im- Fine-tuned models become more interpretable portance; for example, a feature is influential to the under input reduction without accuracy loss on arXiv:1804.07781v3 [cs.CL] 28 Aug 2018 regular examples. prediction if its gradient is a large positive value. Both perturbation and gradient-based methods can 1 Introduction generate heatmaps, implying that the model’s prediction is highly influenced by the highlighted, im- Many interpretation methods for neural networks portant words. explain the model’s prediction as a counterfactual: Instead, we study how the model’s prediction is how does the prediction change when the input is influenced by the unimportant words. We use in- modified? Adversarial examples (Szegedy et al., put reduction, a process that iteratively removes 2014; Goodfellow et al., 2015) highlight the in- the unimportant words from the input while main- stability of neural network predictions by showing taining the model’s prediction. Intuitively, the how small perturbations to the input dramatically words remaining after input reduction should be change the output. important for prediction. Moreover, the words should match the leave-one-out method’s selec- 2.1 Importance from Input Gradient tions, which closely align with human percep- Ribeiro et al.(2016) and Li et al.(2016b) de- tion (Li et al., 2016b; Murdoch et al., 2018). How- fine importance by seeing how confidence changes ever, rather than providing explanations of the when a feature is removed; a natural approxima- original prediction, our reduced examples more tion is to use the gradient (Baehrens et al., 2010; closely resemble adversarial examples. In Fig- Simonyan et al., 2014). We formally define these ure1, the reduced input is meaningless to a hu- importance metrics in natural language contexts man but retains the same model prediction with and introduce the efficient gradient-based approx- high confidence. Gradient-based input reduction imation. For each word in an input sentence, we exposes pathological model behaviors that contra- measure its importance by the change in the con- dict what one expects based on existing interpreta- fidence of the original prediction when we remove tion methods. that word from the sentence. We switch the sign In Section2, we construct more of these coun- so that when the confidence decreases, the impor- terintuitive examples by augmenting input reduc- tance value is positive. tion with beam search and experiment with three Formally, let x = hx1; x2; : : : xni denote the in- tasks: SQUAD (Rajpurkar et al., 2016) for read- put sentence, f(y j x) the predicted probability of 0 ing comprehension, SNLI (Bowman et al., 2015) label y, and y = argmaxy0 f(y j x) the original for textual entailment, and VQA (Antol et al., predicted label. The importance is then 2015) for visual question answering. Input re- g(x j x) = f(y j x) − f(y j x ): (1) duction with beam search consistently reduces the i −i input sentence to very short lengths—often only To calculate the importance of each word in a sen- one or two words—without lowering model confi- tence with n words, we need n forward passes of dence on its original prediction. The reduced ex- the model, each time with one of the words left amples appear nonsensical to humans, which we out. This is highly inefficient, especially for longer verify with crowdsourced experiments. In Sec- sentences. Instead, we approximate the impor- tion3, we draw connections to adversarial exam- tance value with the input gradient. For each word ples and confidence calibration; we explain why in the sentence, we calculate the dot product of the observed pathologies are a consequence of the its word embedding and the gradient of the output overconfidence of neural models. This elucidates with respect to the embedding. The importance limitations of interpretation methods that rely on of n words can thus be computed with a single model confidence. In Section4, we encourage forward-backward pass. This gradient approxima- high model uncertainty on reduced examples with tion has been used for various interpretation meth- entropy regularization. The pathological model ods for natural language classification models (Li behavior under input reduction is mitigated, lead- et al., 2016a; Arras et al., 2016); see Ebrahimi ing to more reasonable reduced examples. et al.(2017) for further details on the derivation. We use this approximation in all our experiments as it selects the same words for removal as an ex- 2 Input Reduction haustive search (no approximation). 2.2 Removing Unimportant Words To explain model predictions using a set of impor- Instead of looking at the words with high important words, we must first define importance. Af- tance values—what interpretation methods com- ter defining input perturbation and gradient-based monly do—we take a complementary approach approximation, we describe input reduction with and study how the model behaves when the sup- these importance metrics. Input reduction dras- posedly unimportant words are removed. Intu- tically shortens inputs without causing the model itively, the important words should remain after to change its prediction or significantly decrease the unimportant ones are removed. its confidence. Crowdsourced experiments con- Our input reduction process iteratively removes firm that reduced examples appear nonsensical to the unimportant words. At each step, we remove humans: input reduction uncovers pathological the word with the lowest importance value un- model behaviors. til the model changes its prediction. We experiment with three popular datasets: SQUAD (Ra- SNLI Premise Well dressed man and woman dancing in jpurkar et al., 2016) for reading comprehension, the street SNLI (Bowman et al., 2015) for textual entail- Original Two man is dancing on the street ment, and VQA (Antol et al., 2015) for visual Reduced dancing Answer Contradiction question answering. We describe each of these Confidence 0.977 ! 0.706 tasks and the model we use below, providing full VQA details in the Supplement. In SQUAD, each example is a context paragraph and a question. The task is to predict a span in the paragraph as the answer. We reduce only the question while keeping the context paragraph unchanged. The model we use is the DRQA Doc- Original What color is the flower ? ument Reader (Chen et al., 2017). Reduced flower ? In SNLI, each example consists of two sen- Answer yellow Confidence 0.827 ! 0.819 tences: a premise and a hypothesis. The task is to predict one of three relationships: entailment, neutral, or contradiction. We reduce only the hy- Figure 2: Examples of original and reduced inputs pothesis while keeping the premise unchanged. where the models predict the same Answer. Reduced The model we use is Bilateral Multi-Perspective shows the input after reduction. We remove words from Matching (BIMPM) (Wang et al., 2017). the hypothesis for SNLI, questions for SQUAD and VQA. Given the nonsensical reduced inputs, humans In VQA, each example consists of an image would not be able to provide the answer with high con- and a natural language question. We reduce only fidence, yet, the neural models do. the question while keeping the image unchanged. The model we use is Show, Ask, Attend, and An- swer (Kazemi and Elqursh, 2017).

Load more