Efficient Estimation of Influence of a Training Instance

Efficient Estimation of Influence of a Training Instance Sosuke Kobayashi1,2 Sho Yokoi1,3 Jun Suzuki1,3 Kentaro Inui1,3 Tohoku University1 Preferred Networks, Inc.2 RIKEN3 [email protected] yokoi,jun.suzuki,inui @ecei.tohoku.ac.jp { } Abstract method, which (i) is computationally more efficient while (ii) useful for applications (iii) without Understanding the influence of a training in- significant sacrifice of model performance. stance on a neural network model leads to im- We propose a trick for enabling a neural network proving interpretability. However, it is diffi- without restrictions to estimate the influence, which cult and inefficient to evaluate the influence, which shows how a model’s prediction would we refer to as turn-over dropout. This method is be changed if a training instance were not computationally efficient as it requires only run- used. In this paper, we propose an efficient ning two forward computations after training a method for estimating the influence. Our single model on the entire training dataset. In method is inspired by dropout, which zero- addition to the efficiency, we demonstrated that masks a sub-network and prevents the sub- it enabled BERT (Devlin et al., 2019) and VG- network from learning each training instance. GNet (Simonyan and Zisserman, 2015) to analyze By switching between dropout masks, we can use sub-networks that learned or did not learn the influences of training through various experi- each training instance and estimate its influ- ments, including example-based interpretation of ence. Through experiments with BERT and error predictions and data cleansing to improve the VGGNet on classification datasets, we demon- accuracy on a test set with a distributional shift. strate that the proposed method can capture training influences, enhance the interpretabil- 2 Influence of a Training Instance ity of error predictions, and cleanse the training dataset for improving generalization. 2.1 Problem Setup We present preliminaries on the problem setup. In 1 Introduction this paper, we deal with the influence of training with an instance on prediction with another one, What is the influence of a training instance on a ma- which has been studied in Koh and Liang (2017), chine learning model? This question has attracted Hara et al. (2019) and so on. Let z := (x, y) be an the attention of the community (Cook, 1977; Koh instance and represent a pair of input x X and its and Liang, 2017; Zhang et al., 2018; Hara et al., 2 output y Y , and let D := z N be a training 2019). Evaluating the influence of a training in- 2 { i}i=1 dataset. By using an optimization method with D, stance leads to more interpretable models and other we aim to find a model f : X Y . Denoting the applications like data cleansing. D ! loss function by L(f,z), the learning problem is A simple evaluation is by comparing a model ˆ obtaining fD = argminf Ezi DL(f,zi). with another similarly trained model, whose train- 2 The influence, I(z ,z ; D), is a quantitative ing does not include the instance of interest. This target i benefit from zi to prediction of ztarget. Let fD z method, however, requires computational costs of \{ i} to be a model trained on the dataset D excluding time and storage depending on the number of in- z , the influence is defined as stances, which indicates the extreme difficulty (Ta- i ble 1). While computationally cheaper estimation I(ztarget,zi; D) methods have been proposed (Koh and Liang, 2017; := L(fD z ,ztarget) L(fD,ztarget). (1) Hara et al., 2019), they still have computational dif- \{ i} − ficulties or restrictions of model choices. The con- Intuitively, the larger this value, the more strongly a tribution of this work is to propose an estimation training instance zi contributes to reduce the loss of 41 Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 41–47 Online, November 20, 2020. c 2020 Association for Computational Linguistics Method Training Storage Estimation Re-train O( D 2) O( ✓ D ) O(F D ) | | | || | | | Hara+ O( D ) O( ✓ T ) O(F D +(F +F 0)TB) | | | | | | Koh+ O( D ) O( ✓ ) O(F D +(F +F 0)rtb) | | | | | | Ours O( D ) O( ✓ ) O(F D ) | | | | | | Table 1: Comparison of computational complexity for estimating the influence of all instance on another instance, with Hara et al. (2019) and Koh and Liang (2017), where ✓ is the number of parameters, F is | | a forward/backward computation, F 0 is a double backward computation, T is the training steps, B is a training minibatch size, b is a minibatch size for stabilizing approximation, rt are the hyper-parameters; typically rt D . See the references in detail. Figure 1: Dropout generates a sub-network for each ⇡| | training instance z, and updates its parameters (red; top) only. By contrast, the (blue; bottom) sub-network is not influenced by z. Our estimation uses the differ- prediction on another instance ztarget. The instance ence between the two sub-networks. of interest ztarget is typically an instance in a test or validation dataset. 3 Proposed Method 2.2 Related Methods 3.1 Background: Dropout Computing the influence in Equation (1) by re- training two models for each instance is computa- Dropout (Hinton et al., 2012; Srivastava et al., tionally expensive, and several estimation methods 2014) is a popular regularization methods for deep d are proposed. Koh and Liang (2017) proposed an neural networks. During training, -dimensional m d estimation method that assumed a strongly convex random mask vector , where refers to the num- loss function and a global optimal solution1. While ber of parameters of a layer, is sampled, and a neu- f the method is used even with neural models (Koh ral network model is transformed into a variant f m m and Liang, 2017; Han et al., 2020), which do not with a parameter set multiplied with each update3. The elements of mask m 0, 1 d are satisfy the assumption, it still requires high compu- 2{ p } randomly sampled as follows: m := m /p, m tational cost. Hara et al. (2019) proposed a method j j0 j0 ⇠ without these restrictions; however, it consumes Bernoulli(p). Parameters masked (multiplied) with large disk storage and computation time that de- 0 are disabled in an update step like pruning. Thus, m pend on the number of optimization steps. Our dropout randomly selects various sub-networks f proposed method is much more efficient, as shown to be updated at every step. During inference at in Table 1. For example, in a case where Koh and test time, dropout is not applied. One interpretation Liang (2017)’s method took 10 minutes to estimate of dropout is that it trains numerous sub-networks the influences of 10,000 training instances on an- and uses them as ensemble (Hinton et al., 2012; other instance with BERT (Han et al., 2020), our Srivastava et al., 2014; Bachman et al., 2014; Baldi method only required 35 seconds2. This efficiency and Sadowski, 2014; Bul et al., 2016). In this work, will expand the scope of applications of computing p =0.5; approximately half of the parameters are influence. For example, it would enable real-time zero-masked. interpretation of model predictions for users of the 3.2 Proposed Method: Turn-over Dropout machine learning models. In the standard dropout method, dropout masks 1 Strictly speaking, Koh and Liang (2017) studied a similar are sampled independently at every update. In but different value from I in Equation (1). Briefly, the formu- our proposed method, however, we use instance- lation in Koh and Liang (2017) considers convex models with specific dropout masks m(z), which are also ran- the optimal parameters for fD zi and fD. The definition in Hara et al. (2019) did not have\{ such} conditions and treated dom vectors but deterministically generated and the broader problem. We follow Hara et al. (2019); therefore, 3 the definition in Equation (1) allows any fD and fD z , as Typically, dropout is applied to the layers of the neural \{ i} long as they have the same initial parameters and optimization network rather than its parameter matrices. In this case, each procedures using the same mini-batches except for zi. instance in a minibatch drops different column-wise parame- 2 For the details, see Appendix B. ters of matrices at once. 42 tied with each instance z. Thus, when the network 4 Experiments is trained with an instance z, only a deterministic subset of its parameters is updated, as shown in The computational efficiency of our method is dis- Figure 1. In other words, the sub-network f m(z) cussed in Section 2. Moreover, we answer a ques- is updated; however the corresponding counter- tion: even if it is efficient, does it work well on part of the network f m(z) is not at all affected applications? To demonstrate the applicability, we conducted experiments using different models and by z, where m(z) is thefflipped mask of m(z), i.e., m(z) := 1 m(z). Both sub-networks, f m(z) datasets. p − m(z) f and f , can be used by applying the individual Setup First, we used the Stanford Sentiment f masksf to f. These sub-networks are analogously TreeBank (SST-2) (Socher et al., 2013) binary sen- comprehended as two different networks trained timent classification task. Five thousand instances on a dataset with or without an instance, respec- were sampled from the training set, and 872 in- 4 tively, fD and fD z . From this analogy, the \{ i} stances in the development set were used.

Efficient Estimation of Influence of a Training Instance

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support