<<

Efficient Estimation of Influence of a Training Instance

Sosuke Kobayashi1,2 Sho Yokoi1,3 Jun Suzuki1,3 Kentaro Inui1,3 Tohoku University1 Preferred Networks, Inc.2 RIKEN3 [email protected] yokoi,jun.suzuki,inui @ecei.tohoku.ac.jp { }

Abstract method, which (i) is computationally more effi- cient while (ii) useful for applications (iii) without Understanding the influence of a training in- significant sacrifice of model performance. stance on a neural network model leads to im- We propose a trick for enabling a neural network proving interpretability. However, it is diffi- without restrictions to estimate the influence, which cult and inefficient to evaluate the influence, which shows how a model’s prediction would we refer to as turn-over dropout. This method is be changed if a training instance were not computationally efficient as it requires only run- used. In this paper, we propose an efficient ning two forward computations after training a method for estimating the influence. Our single model on the entire training dataset. In method is inspired by dropout, which zero- addition to the efficiency, we demonstrated that masks a sub-network and prevents the sub- it enabled BERT (Devlin et al., 2019) and VG- network from learning each training instance. GNet (Simonyan and Zisserman, 2015) to analyze By switching between dropout masks, we can use sub-networks that learned or did not learn the influences of training through various experi- each training instance and estimate its influ- ments, including example-based interpretation of ence. Through experiments with BERT and error predictions and data cleansing to improve the VGGNet on classification datasets, we demon- accuracy on a test set with a distributional shift. strate that the proposed method can capture training influences, enhance the interpretabil- 2 Influence of a Training Instance ity of error predictions, and cleanse the train- ing dataset for improving generalization. 2.1 Problem Setup We present preliminaries on the problem setup. In 1 Introduction this paper, we deal with the influence of training with an instance on prediction with another one, What is the influence of a training instance on a ma- which has been studied in Koh and Liang (2017), chine learning model? This question has attracted Hara et al. (2019) and so on. Let z := (x, y) be an the attention of the community (Cook, 1977; Koh instance and represent a pair of input x X and its and Liang, 2017; Zhang et al., 2018; Hara et al., 2 output y Y , and let D := z N be a training 2019). Evaluating the influence of a training in- 2 { i}i=1 dataset. By using an optimization method with D, stance leads to more interpretable models and other we aim to find a model f : X Y . Denoting the applications like data cleansing. D ! by L(f,z), the learning problem is A simple evaluation is by comparing a model ˆ obtaining fD = argminf Ezi DL(f,zi). with another similarly trained model, whose train- 2 The influence, I(z ,z ; D), is a quantitative ing does not include the instance of interest. This target i benefit from zi to prediction of ztarget. Let fD z method, however, requires computational costs of \{ i} to be a model trained on the dataset D excluding time and storage depending on the number of in- z , the influence is defined as stances, which indicates the extreme difficulty (Ta- i ble 1). While computationally cheaper estimation I(ztarget,zi; D) methods have been proposed (Koh and Liang, 2017; := L(fD z ,ztarget) L(fD,ztarget). (1) Hara et al., 2019), they still have computational dif- \{ i} ficulties or restrictions of model choices. The con- Intuitively, the larger this value, the more strongly a tribution of this work is to propose an estimation training instance zi contributes to reduce the loss of 41 Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 41–47 Online, November 20, 2020. c 2020 Association for Computational Linguistics Method Training Storage Estimation Re-train O( D 2) O( ✓ D ) O(F D ) | | | || | | | Hara+ O( D ) O( ✓ T ) O(F D +(F +F 0)TB) | | | | | | Koh+ O( D ) O( ✓ ) O(F D +(F +F 0)rtb) | | | | | | Ours O( D ) O( ✓ ) O(F D ) | | | | | | Table 1: Comparison of computational complexity for estimating the influence of all instance on another in- stance, with Hara et al. (2019) and Koh and Liang (2017), where ✓ is the number of parameters, F is | | a forward/backward computation, F 0 is a double back- ward computation, T is the training steps, B is a train- ing minibatch size, b is a minibatch size for stabilizing approximation, rt are the hyper-parameters; typically rt D . See the references in detail. Figure 1: Dropout generates a sub-network for each ⇡| | training instance z, and updates its parameters (red; top) only. By contrast, the (blue; bottom) sub-network is not influenced by z. Our estimation uses the differ- prediction on another instance ztarget. The instance ence between the two sub-networks. of interest ztarget is typically an instance in a test or validation dataset. 3 Proposed Method 2.2 Related Methods 3.1 Background: Dropout Computing the influence in Equation (1) by re- training two models for each instance is computa- Dropout (Hinton et al., 2012; Srivastava et al., tionally expensive, and several estimation methods 2014) is a popular regularization methods for deep d are proposed. Koh and Liang (2017) proposed an neural networks. During training, -dimensional m d estimation method that assumed a strongly convex random mask vector , where refers to the num- loss function and a global optimal solution1. While ber of parameters of a , is sampled, and a neu- f the method is used even with neural models (Koh ral network model is transformed into a variant f m m and Liang, 2017; Han et al., 2020), which do not with a parameter set multiplied with each update3. The elements of mask m 0, 1 d are satisfy the assumption, it still requires high compu- 2{ p } randomly sampled as follows: m := m /p, m tational cost. Hara et al. (2019) proposed a method j j0 j0 ⇠ without these restrictions; however, it consumes Bernoulli(p). Parameters masked (multiplied) with large disk storage and computation time that de- 0 are disabled in an update step like pruning. Thus, m pend on the number of optimization steps. Our dropout randomly selects various sub-networks f proposed method is much more efficient, as shown to be updated at every step. During inference at in Table 1. For example, in a case where Koh and test time, dropout is not applied. One interpretation Liang (2017)’s method took 10 minutes to estimate of dropout is that it trains numerous sub-networks the influences of 10,000 training instances on an- and uses them as ensemble (Hinton et al., 2012; other instance with BERT (Han et al., 2020), our Srivastava et al., 2014; Bachman et al., 2014; Baldi method only required 35 seconds2. This efficiency and Sadowski, 2014; Bul et al., 2016). In this work, will expand the scope of applications of computing p =0.5; approximately half of the parameters are influence. For example, it would enable real-time zero-masked. interpretation of model predictions for users of the 3.2 Proposed Method: Turn-over Dropout models. In the standard dropout method, dropout masks 1 Strictly speaking, Koh and Liang (2017) studied a similar are sampled independently at every update. In but different value from I in Equation (1). Briefly, the formu- our proposed method, however, we use instance- lation in Koh and Liang (2017) considers convex models with specific dropout masks m(z), which are also ran- the optimal parameters for fD zi and fD. The definition in Hara et al. (2019) did not have\{ such} conditions and treated dom vectors but deterministically generated and the broader problem. We follow Hara et al. (2019); therefore, 3 the definition in Equation (1) allows any fD and fD z , as Typically, dropout is applied to the layers of the neural \{ i} long as they have the same initial parameters and optimization network rather than its parameter matrices. In this case, each procedures using the same mini-batches except for zi. instance in a minibatch drops different column-wise parame- 2 For the details, see Appendix B. ters of matrices at once.

42 tied with each instance z. Thus, when the network 4 Experiments is trained with an instance z, only a deterministic subset of its parameters is updated, as shown in The computational efficiency of our method is dis- Figure 1. In other words, the sub-network f m(z) cussed in Section 2. Moreover, we answer a ques- is updated; however the corresponding counter- tion: even if it is efficient, does it work well on part of the network f m(z) is not at all affected applications? To demonstrate the applicability, we conducted experiments using different models and by z, where m(z) is thefflipped mask of m(z), i.e., m(z) := 1 m(z). Both sub-networks, f m(z) datasets. p m(z) f and f , can be used by applying the individual Setup First, we used the Stanford Sentiment f masksf to f. These sub-networks are analogously TreeBank (SST-2) (Socher et al., 2013) binary sen- comprehended as two different networks trained timent classification task. Five thousand instances on a dataset with or without an instance, respec- were sampled from the training set, and 872 in- 4 tively, fD and fD z . From this analogy, the \{ i} stances in the development set were used. We influence of a training instance can be evaluated trained BERT-base classifiers (Wolf et al., 2019) by considering these two sub-networks. The in- with the adapter modules (Houlsby et al., 2019), fluence I(ztarget,zi; D)=L(fD z ,ztarget) \{ i} which froze the pre-trained BERT parameters but L(fD,ztarget) is estimated as newly trained branch networks in addition to the Iˆ(ztarget,zi; D) output layers. We applied the turn-over dropout on the adapter modules and output layers. := L(f m(zi),z ) L(f m(zi),z ), (2) D target D target In addition, we used the CIFAR-10 (Krizhevsky, f m(zi) 2009) 10-class image classification task, with the which corresponds to the gain when using fD , m(zi) 50,000 training instances and 10,000 validation instead of f for a prediction on ztarget.We D instances. We trained the VGGNet19 classifier (Si- call this estimationf method turn-over dropout. Its summarized advantages are as follows: monyan and Zisserman, 2015) with the turn-over Lower computation time: The method only dropout. • requires running forward procedure two times. Models were trained with the cross-entropy loss. No snapshot or re-training: A single model Further details of the setup are shown in Ap- • can be used for all training instances. pendix A. Easy to implement: The model modification • 4.1 Side Effect on Model Performance and estimation procedure are very simple. 3.3 -efficient Instance-specific Masks Note that turn-over dropout is not for improving the accuracy of models. It gives the models the One may think that using instance-specific masks method of efficiently estimating the influence of require a large space, depending on the dataset each training instance. A possible side effect is a de- size and the number of parameters to be masked. terioration of accuracy due to introducing instance- However, this cost is drastically reduced to a con- specific dropout with p =0.56. Thus, we first ex- stant O(1), using a trick. As the masks are not plored the change of classification accuracy when updated, we do not have to save them directly. In- using the turn-over dropout. stead, we can deterministically generate the ran- For BERT with the adapter modules on SST- dom masks with a fixed random seed number any- 2, if we use a small dataset (N=5,000), the accu- time. Thus, models can avoid storing masks and racy slightly decreased from the baseline model, generate masks when using them. We call this trick from 90.0% to 88.3%. If we use a larger dataset as volatile mask generation5. (N=20,000), the change is negligible; 90.5% and 4 m(z) m(z) In this paper, we associate f and f with fD 90.2%. Thus, in a case with large datasets, where and fD zi , respectively. However, while fD doesf not focus \{ } m(z) on any instance in D so much, its substitute f may be a issue could occur even with the method, depending on imple- little biased to some characteristic of z. For ignoring bias, we mentations. For such a particular case and another solution m(z) can use fD itself (i.e., full network) instead of f , while for it, see Appendix C in detail. 6 the representation powers of fD and fD z are different. Dropout with p =0.5 is often used in various neural \{ i} We tested the alternative but did not find large improvements. networks, especially on linear layers of them, and improves Further exploration is an interesting future work. the accuracy. However, dropout on all layers could damage. It 5 The volatile mask generation method solved storage and is also unclear how dropout with “static” masks effect because memory issues in our experiments. However, the memory the idea is novel.

43 Figure 3: A misclassified text in the test set and the text with the highest influence with the error label in the training set for BERT on SST-2.

Figure 2: Loss curves of BERT on SST-2. we typically want to use turn-over dropout for ef- ficiency, applying the turn-over dropout does not decrease the validation accuracy compared with the baseline. However, when we use turn-over dropout on all layers of BERT without the adapter modules Figure 4: A misclassified text in the test set and the using makes training unstable. Furthermore the texts with the highest influence with the error label in same is true for VGGNet on CIFAR-10. Instead, the training set for BERT on Yahoo Answers. we first applied the turn-over dropout only for all layers after the 11th layer, although this means early layers can learn all instances in the training ing the flipped mask does not learn each training 7 dataset and make the turn-over dropout leaky .We instance ztrain. found that VGGNet with turn-over dropout can overfit more than the baseline does; their accura- 4.3 Interpretation of Error of Predictions cies are 86.2% and 92.0%, respectively. If we add Neural network models are notorious for their regularization using the original dropout, the accu- black-box prediction, which harms the trust and racy is recovered to 91.3%. Thus, in some cases, usability (Ribeiro et al., 2016). The influence es- we have to care about the decrease of model per- timation can mitigate this problem by suggesting formances when using turn-over dropout. While possible reasons for a wrong model prediction by we experimented with the successful architectures identifying influential training instances. only, exploring the side effect in various architec- To verify this benefit, we collected the misclas- tures and its remedy is important future work. sified instances of the validation or test set and searched for the training instances that most influ- 4.2 Sanity Check: Learning Curves enced the wrong predictions. Figure 3 indicates a We first observed an interesting property of the turn- text example from the results. Rare words of named over dropout from the loss curves during training, entities were divided into many subwords (Schus- as shown in Figure 2. The solid red line of training ter and Nakajima, 2012; Sennrich et al., 2016; m(ztrain) Wu et al., 2016) and requiring more complex pro- loss using m(ztrain), L(fD ,ztrain), showed a typical tendency of training loss. However, the cessing. A guess is that BERT might fail to un- solid blue line of training loss using m(ztrain), derstand the input due to the cluttered subwords, m(ztrain) and predict a wrong label, which depended on a L(f ,ztrain), indicated loss values close to D training instance similarly with many subwords. the testf losses (in dotted lines), withoutf overfit- ting. This fact agrees with the idea behind the Additionally, we conducted the same experiment turn-over dropout; the sub-network f m(ztrain) us- on Yahoo Answers 10-label question classification dataset (Zhang et al., 2015)8, which is more com- 7 Yuki M. Asano (2020) demonstrated thatf early layers plex than sentiment analysis. Figure 4 shows the of CNN contained limited information about the statistics of images, and such low-level statistics can be learned even results on Yahoo Answers. The misclassified text through a single image. Based on the finding, we assumed that shares the phrase “ch ##rist” with the two influen- early layers did not fit each instance so much, and the effect of leakage was small. 8We used 5,000 training instances as well as SST-2.

44 ter that, we re-trained models without turn-over dropout on datasets that removed training instances with 1% of the most negative influences. Finally, the model performances on Elec’s test dataset are compared, as shown in Table 2. The models trained on the cleansed datasets achieved better accuracy and lower loss than those trained on the original dataset. This result demonstrated that our estima- tion of the influence could also be used for data Figure 5: Misclassified images in the validation set (up- cleansing. per row) and images with the highest influence in the training set (lower row) for VGGNet on CIFAR-10. 5 Conclusion This paper proposed a method that required a low Accuracy (%) Loss 1% Random Removal 76.8 1.1 0.521 0.030 computational cost for estimating the influence of No Cleansing 77.0 ± 0.9 0.536 ± 0.063 a training instance. The method alters dropout 1% Cleansing 78.3 ± 0.2 0.484 ± 0.008 ± ± with instance-specific masks and, for estimation, Table 2: The results of data cleansing. Loss is the uses sub-networks that are not trained with each cross entropy loss. The averages and standard devia- instance. The experiments demonstrated that this tions from four difference runs are shown. method could be applied even for complex models.

Acknowledgments tial instances. Such a low-level cue is not critical in the test. However, it seemed that the model focused We appreciate the helpful comments from the on the phrase and predicted the label of training anonymous reviewers. We thank Sho Takase, Hi- instances containing the phrase. roshi Noji, Hitomi Yanaka, Koki Washio, Saku In addition, more intuitively, image results are Sugawara, Benjamin Heinzerling, and Kazuaki shown in Figure 5. The two leftmost instances Hanawa for constructive comments. This work with the “bird” label were wrongly predicted as was supported by JSPS KAKENHI Grant Number “airplane.” The training instances of airplane with JP19H04162. the highest influence on the error predictions are shown in the row below. The corresponding images had similar visual features, such as shape, layout, or References color, which probably led to the wrong predictions. Philip Bachman, Ouais Alsharif, and . 2014. Learning with pseudo-ensembles. In Ad- 4.4 Data Cleansing vances in Neural Information Processing Systems 27, pages 3365–3373. Another possible application of the influence esti- mation is to eliminate harmful instances from the Pierre Baldi and Peter Sadowski. 2014. The training dataset. If the mean influence of a train- dropout learning algorithm. Artificial Intelligence, 210(C):78–122. ing instance on unseen instances is negative, the instance can be harmful for generalization. We ex- John Blitzer, Mark Dredze, and Fernando Pereira. 2007. perimented with data cleansing in a case of domain Biographies, Bollywood, boom-boxes and blenders: shift, where the training dataset is of SST-2 (movie Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the As- review); however, the validation and test dataset are sociation of Computational Linguistics, pages 440– of the ‘electronics’ subset in Multi-Domain Senti- 447, Prague, Czech Republic. Association for Com- ment Dataset (Blitzer et al., 2007) (Elec). We split putational Linguistics. the Elec dataset into 200 instances for validation Samuel Rota Bul, Lorenzo Porzi, and Peter and 1,800 instances for the test. Note that we do not Kontschieder. 2016. Dropout distillation. In use Elec dataset as a training dataset for studying Proceedings of The 33rd International Conference only the effect of data cleansing. on Machine Learning, volume 48, pages 99–107. We finetuned BERT models (with turn-over R Dennis Cook. 1977. Detection of influential observa- dropout) on SST-2 dataset and calculated the mean tion in linear regression. Technometrics, pages 15– influences considering Elec’s validation set. Af- 18.

45 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Rico Sennrich, Barry Haddow, and Alexandra Birch. Kristina Toutanova. 2019. BERT: Pre-training of 2016. Neural machine translation of rare words deep bidirectional transformers for language under- with subword units. In Proceedings of the 54th An- standing. In Proceedings of the 2019 Conference of nual Meeting of the Association for Computational the North American Chapter of the Association for Linguistics (Volume 1: Long Papers), pages 1715– Computational Linguistics: Human Language Tech- 1725, Berlin, Germany. Association for Computa- nologies, Volume 1 (Long and Short Papers), pages tional Linguistics. 4171–4186. Karen Simonyan and Andrew Zisserman. 2015. Very Xiaochuang Han, Byron C. Wallace, and Yulia deep convolutional networks for large-scale image Tsvetkov. 2020. Explaining black box predictions recognition. In International Conference on Learn- and unveiling data artifacts through influence func- ing Representations. tions. In Proceedings of the 2020 Annual Confer- ence of the Association for Computational Linguis- Richard Socher, Alex Perelygin, Jean Wu, Jason tics (to appear). Chuang, Christopher D. Manning, , and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. bank. In Proceedings of the 2013 Conference on 2019. Data cleansing for models trained with sgd. Empirical Methods in Natural Language Processing, In Advances in Neural Information Processing Sys- pages 1631–1642. tems 32, pages 4215–4224. Nitish Srivastava, Geoffrey Hinton, , Geoffrey E. Hinton, Nitish Srivastava, Alex , and Ruslan Salakhutdinov. 2014. Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhut- Dropout: A simple way to prevent neural networks dinov. 2012. Improving neural networks by from overfitting. Journal of Machine Learning Re- preventing co-adaptation of feature detectors. search, 15:1929–1958. CoRR, abs/1207.0580. Adina Williams, Nikita Nangia, and Samuel Bowman. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, 2018. A broad-coverage challenge corpus for sen- Bruna Morrone, Quentin De Laroussilhe, Andrea tence understanding through inference. In Proceed- Gesmundo, Mona Attariyan, and Sylvain Gelly. ings of the 2018 Conference of the North American 2019. Parameter-efficient transfer learning for NLP. Chapter of the Association for Computational Lin- In Proceedings of the 36th International Conference guistics: Human Language Technologies, Volume on Machine Learning, volume 97 of Proceedings 1 (Long Papers), pages 1112–1122, New Orleans, of Machine Learning Research, pages 2790–2799, Louisiana. Association for Computational Linguis- Long Beach, California, USA. PMLR. tics.

Pang Wei Koh and Percy Liang. 2017. Understand- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien ing black-box predictions via influence functions. In Chaumond, Clement Delangue, Anthony Moi, Pier- Proceedings of the 34th International Conference on ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- Machine Learning, pages 1885–1894. icz, and Jamie Brew. 2019. Huggingface’s trans- formers: State-of-the-art natural language process- Alex Krizhevsky. 2009. Learning multiple layers of ing. arXiv, abs/1910.03771. features from tiny images. Technical report. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Maxim Krikun, Yuan Cao, Qin Gao, Klaus and Christopher D. Manning. 2020. Stanza: A Macherey, Jeff Klingner, Apurva Shah, Melvin John- Python natural language processing toolkit for many son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, human languages. In Proceedings of the 58th An- Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith nual Meeting of the Association for Computational Stevens, George Kurian, Nishant Patil, Wei Wang, Linguistics: System Demonstrations. Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Marco Ribeiro, Sameer Singh, and Carlos Guestrin. and Jeffrey Dean. 2016. ’s neural machine 2016. “why should I trust you?”: Explaining the pre- translation system: Bridging the gap between human dictions of any classifier. In Proceedings of the 2016 and machine translation. CoRR, abs/1609.08144. Conference of the North American Chapter of the Association for Computational Linguistics: Demon- Andrea Vedaldi Yuki M. Asano, Christian Rupprecht. strations, pages 97–101, San Diego, California. As- 2020. A critical analysis of self-supervision, or what sociation for Computational Linguistics. we can learn from a single image. In International Conference on Learning Representations (ICLR). Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In International Confer- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. ence on Acoustics, Speech and Signal Processing, Character-level convolutional networks for text clas- pages 5149–5152. sification. In C. Cortes, N. D. Lawrence, D. D. Lee,

46 M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 649–657. Curran Associates, Inc. Xuezhou Zhang, Xiaojin Zhu, and Stephen Wright. 2018. Training set debugging using trusted items. In AAAI Conference on Artificial Intelligence.

47