Efficient Estimation of Influence of a Training Instance

Efficient Estimation of Influence of a Training Instance Sosuke Kobayashi1,2 Sho Yokoi1,3 Jun Suzuki1,3 Kentaro Inui1,3 Tohoku University1 Preferred Networks, Inc.2 RIKEN3 [email protected] yokoi,jun.suzuki,inui @ecei.tohoku.ac.jp { } Abstract method, which (i) is computationally more efficient while (ii) useful for applications (iii) without Understanding the influence of a training in- significant sacrifice of model performance. stance on a neural network model leads to im- We propose a trick for enabling a neural network proving interpretability. However, it is diffi- without restrictions to estimate the influence, which cult and inefficient to evaluate the influence, which shows how a model’s prediction would we refer to as turn-over dropout. This method is be changed if a training instance were not computationally efficient as it requires only run- used. In this paper, we propose an efficient ning two forward computations after training a method for estimating the influence. Our single model on the entire training dataset. In method is inspired by dropout, which zero- addition to the efficiency, we demonstrated that masks a sub-network and prevents the sub- it enabled BERT (Devlin et al., 2019) and VG- network from learning each training instance. GNet (Simonyan and Zisserman, 2015) to analyze By switching between dropout masks, we can use sub-networks that learned or did not learn the influences of training through various experi- each training instance and estimate its influ- ments, including example-based interpretation of ence. Through experiments with BERT and error predictions and data cleansing to improve the VGGNet on classification datasets, we demon- accuracy on a test set with a distributional shift. strate that the proposed method can capture training influences, enhance the interpretabil- 2 Influence of a Training Instance ity of error predictions, and cleanse the training dataset for improving generalization. 2.1 Problem Setup We present preliminaries on the problem setup. In 1 Introduction this paper, we deal with the influence of training with an instance on prediction with another one, What is the influence of a training instance on a ma- which has been studied in Koh and Liang (2017), chine learning model? This question has attracted Hara et al. (2019) and so on. Let z := (x, y) be an the attention of the community (Cook, 1977; Koh instance and represent a pair of input x X and its and Liang, 2017; Zhang et al., 2018; Hara et al., 2 output y Y , and let D := z N be a training 2019). Evaluating the influence of a training in- 2 { i}i=1 dataset. By using an optimization method with D, stance leads to more interpretable models and other we aim to find a model f : X Y . Denoting the applications like data cleansing. D ! loss function by L(f,z), the learning problem is A simple evaluation is by comparing a model ˆ obtaining fD = argminf Ezi DL(f,zi). with another similarly trained model, whose train- 2 The influence, I(z ,z ; D), is a quantitative ing does not include the instance of interest. This target i benefit from zi to prediction of ztarget. Let fD z method, however, requires computational costs of \{ i} to be a model trained on the dataset D excluding time and storage depending on the number of in- z , the influence is defined as stances, which indicates the extreme difficulty (Ta- i ble 1). While computationally cheaper estimation I(ztarget,zi; D) methods have been proposed (Koh and Liang, 2017; := L(fD z ,ztarget) L(fD,ztarget). (1) Hara et al., 2019), they still have computational dif- \{ i} − ficulties or restrictions of model choices. The con- Intuitively, the larger this value, the more strongly a tribution of this work is to propose an estimation training instance zi contributes to reduce the loss of 41 Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 41–47 Online, November 20, 2020. c 2020 Association for Computational Linguistics Method Training Storage Estimation Re-train O( D 2) O( ✓ D ) O(F D ) | | | || | | | Hara+ O( D ) O( ✓ T ) O(F D +(F +F 0)TB) | | | | | | Koh+ O( D ) O( ✓ ) O(F D +(F +F 0)rtb) | | | | | | Ours O( D ) O( ✓ ) O(F D ) | | | | | | Table 1: Comparison of computational complexity for estimating the influence of all instance on another instance, with Hara et al. (2019) and Koh and Liang (2017), where ✓ is the number of parameters, F is | | a forward/backward computation, F 0 is a double backward computation, T is the training steps, B is a training minibatch size, b is a minibatch size for stabilizing approximation, rt are the hyper-parameters; typically rt D . See the references in detail. Figure 1: Dropout generates a sub-network for each ⇡| | training instance z, and updates its parameters (red; top) only. By contrast, the (blue; bottom) sub-network is not influenced by z. Our estimation uses the differ- prediction on another instance ztarget. The instance ence between the two sub-networks. of interest ztarget is typically an instance in a test or validation dataset. 3 Proposed Method 2.2 Related Methods 3.1 Background: Dropout Computing the influence in Equation (1) by re- training two models for each instance is computa- Dropout (Hinton et al., 2012; Srivastava et al., tionally expensive, and several estimation methods 2014) is a popular regularization methods for deep d are proposed. Koh and Liang (2017) proposed an neural networks. During training, -dimensional m d estimation method that assumed a strongly convex random mask vector , where refers to the num- loss function and a global optimal solution1. While ber of parameters of a layer, is sampled, and a neu- f the method is used even with neural models (Koh ral network model is transformed into a variant f m m and Liang, 2017; Han et al., 2020), which do not with a parameter set multiplied with each update3. The elements of mask m 0, 1 d are satisfy the assumption, it still requires high compu- 2{ p } randomly sampled as follows: m := m /p, m tational cost. Hara et al. (2019) proposed a method j j0 j0 ⇠ without these restrictions; however, it consumes Bernoulli(p). Parameters masked (multiplied) with large disk storage and computation time that de- 0 are disabled in an update step like pruning. Thus, m pend on the number of optimization steps. Our dropout randomly selects various sub-networks f proposed method is much more efficient, as shown to be updated at every step. During inference at in Table 1. For example, in a case where Koh and test time, dropout is not applied. One interpretation Liang (2017)’s method took 10 minutes to estimate of dropout is that it trains numerous sub-networks the influences of 10,000 training instances on an- and uses them as ensemble (Hinton et al., 2012; other instance with BERT (Han et al., 2020), our Srivastava et al., 2014; Bachman et al., 2014; Baldi method only required 35 seconds2. This efficiency and Sadowski, 2014; Bul et al., 2016). In this work, will expand the scope of applications of computing p =0.5; approximately half of the parameters are influence. For example, it would enable real-time zero-masked. interpretation of model predictions for users of the 3.2 Proposed Method: Turn-over Dropout machine learning models. In the standard dropout method, dropout masks 1 Strictly speaking, Koh and Liang (2017) studied a similar are sampled independently at every update. In but different value from I in Equation (1). Briefly, the formu- our proposed method, however, we use instance- lation in Koh and Liang (2017) considers convex models with specific dropout masks m(z), which are also ran- the optimal parameters for fD zi and fD. The definition in Hara et al. (2019) did not have\{ such} conditions and treated dom vectors but deterministically generated and the broader problem. We follow Hara et al. (2019); therefore, 3 the definition in Equation (1) allows any fD and fD z , as Typically, dropout is applied to the layers of the neural \{ i} long as they have the same initial parameters and optimization network rather than its parameter matrices. In this case, each procedures using the same mini-batches except for zi. instance in a minibatch drops different column-wise parame- 2 For the details, see Appendix B. ters of matrices at once. 42 tied with each instance z. Thus, when the network 4 Experiments is trained with an instance z, only a deterministic subset of its parameters is updated, as shown in The computational efficiency of our method is dis- Figure 1. In other words, the sub-network f m(z) cussed in Section 2. Moreover, we answer a ques- is updated; however the corresponding counter- tion: even if it is efficient, does it work well on part of the network f m(z) is not at all affected applications? To demonstrate the applicability, we conducted experiments using different models and by z, where m(z) is thefflipped mask of m(z), i.e., m(z) := 1 m(z). Both sub-networks, f m(z) datasets. p − m(z) f and f , can be used by applying the individual Setup First, we used the Stanford Sentiment f masksf to f. These sub-networks are analogously TreeBank (SST-2) (Socher et al., 2013) binary sen- comprehended as two different networks trained timent classification task. Five thousand instances on a dataset with or without an instance, respec- were sampled from the training set, and 872 in- 4 tively, fD and fD z . From this analogy, the \{ i} stances in the development set were used.

Load more