Optimizing Data Usage Via Differentiable Rewards

Optimizing Data Usage via Differentiable Rewards Xinyi Wang * 1 Hieu Pham * 1 2 Paul Michel 1 Antonios Anastasopoulos 1 Jaime Carbonell 1 Graham Neubig 1 Abstract Previous work has attempted to create strategies to handle To acquire a new skill, humans learn better and this sensitivity by selecting subsets of the data to train the faster if a tutor informs them of how much atten- model on (Jiang & Zhai, 2007; Wang et al.; Axelrod et al., tion they should pay to particular content or prac- 2011; Moore & Lewis, 2010), providing different weights tice problems based on their current knowledge for each example (Sivasankaran et al., 2017; Ren et al., level. Similarly, a machine learning model could 2018), or changing the presentation order of data (Bengio potentially be trained better if data is presented in et al., 2009; Kumar et al., 2019). a way that adapts to its current learning state. In However, there are several challenges with existing work on this paper, we examine the problem of training an better strategies for data usage. Most data filtering critera adaptive scorer that weights data instances to max- or training curriculum rely on domain-specific knowledge imally benefit learning. Training such as scorer and hand-designed heuristics, which can be sub-optimal. To efficiently is a challenging problem; in order to avoid hand-designed heuristics, several works propose to precisely quantify the effect of a data instance on optimize a parameterized neural network to learn the data the final model, a naive approach would require usage schedule, but most of them are tailored to specific use completing the entire training process and observ- cases, such as handling noisy data for classification (Jiang ing final performance. We propose an efficient et al., 2018), learning a curriculum learning strategy for alternative – Differentiable Data Selection (DDS) specific tasks (Kumar et al., 2019; Tsvetkov et al., 2016), – that formulates a scorer as a learnable function and actively selecting data for annotation (Fang et al., 2017; of the training data that can be efficiently updated Wu et al., 2018). Fan et al.(2018b) proposes a more gen- along with the main model being trained. Specif- eral teacher-student framework that first trains a “teacher ically, DDS updates the scorer with an intuitive network” to select data that directly optimizes development reward signal: it should up-weigh the data that has set accuracy over multiple training runs. However, because a similar gradient with a development set upon running multiple runs of training simply to learn this teacher which we would finally like to perform well. With- network entails an n-fold increase in training time for n out significant computing overhead, DDS deliv- runs, this is infeasible in many practical settings. In addi- ers consistent improvements over several strong tion, in preliminary experiments we also found the single baselines on two very different tasks of machine reward signal provided by dev set accuracy at the end of translation and image classification.1 training to be noisy, to the extent that we were not able to achieve results competitive with simpler heuristic training methods. 1. Introduction In this paper, we propose an alternative: a general Rein- While deep learning models are remarkably good at fitting forcement Learning (RL) framework for optimizing training large data sets, their performance is also highly sensitive to data usage by training a scorer network that minimizes the the structure and domain of their training data. Training on model loss on the development set. We formulate the scorer out-of-domain data can lead to worse model performance, network as a function of the current training examples, and while using more relevant data can assist transfer learning. update the scorer along with the main model being trained. * 1 Thus, our method requires no heuristics and is generaliz- Equal contribution Language Technology Institute, Carnegie able to various tasks. To make the scorer adaptive, we train Mellon University, Pittsburgh, PA 15213, USA 2Google Research, Brain Team, Mountain View, CA 94043, USA. Correspondence to: it using frequent and efficient updates towards a reward Xinyi Wang, Hieu Pham <fxinyiw1,[email protected]>. function inspired by recent work on learning from auxiliary tasks (Du et al., 2018; Liu et al., 2019b), which uses the th Proceedings of the 37 International Conference on Machine similarity between two gradients as a measure of task rele- Learning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 by vance. We propose to use the gradient alignment between the author(s). 1Code will be released soon. the training examples and the dev set as a reward signal for a Optimizing Data Usage via Differentiable Rewards Training (x, y) Scorer p(x, y; ψ) Model et al., 2014), covariate shift (Shimodaira, 2000), and label Data ψ (x, y) θ shift (Lipton et al., 2018). However, unlike the large training set, we can often col- ∇θℓ(x, y; θ) lect a relatively small development set Ddev = f(xi; yi): ∇θ J(θ, Ddev) i = 1; :::; Ndevg with a distribution Pdev(X; Y ) that is much Figure 1: The general workflow of DDS. closer to Ptest(X; Y ) (Some examples can be found in x 4). Since Ddev is a better approximation of our test-time sce- parametric scorer network, as illustrated in Fig. 1. We then nario, we can use Ddev to get reliable feedback to learn to formulate our framework as a bi-level optimization problem better utilize our training data from Dtrain. In particular, similar to those found in prior works such as meta-learning we propose to train a scorer network, parameterized by , for few-shot learning (Finn et al., 2017), noisy data filter- that adjusts the weights of examples in Dtrain to minimize ing (Ren et al., 2018), and neural architecture search (Liu J(θ; Ddev) . et al., 2019a), and demonstrate that our proposed update rules follow a direct differentiation of the scorer parameters 2.2. Learning to Optimize Data Usage to optimize the model loss on the dev set. Thus we refer to We propose to optimize the scorer’s parameters in an our framework as “Differentiable Data Selection” (DDS). RL setting. Our environment is the model state θ and an We demonstrate two concrete instantiations of the example hx; yi. Our RL agent is the scorer network , DDS framework, one for a more general case of image which optimizes the data usage for the current model state. classification, and the other for a more specific case of neu- The agent’s reward on picking an example approximates the ral machine translation (NMT). For image classification, we dev set performance of the resulting model after the model test on both CIFAR-10 and ImageNet. For NMT, we focus is updated on this example. on a multilingual setting, where we optimize data usage Our scorer network is parameterized as a differentiable func- from a multilingual corpus to improve the performance on a tion that only takes as inputs the features of the example particular language. For these two very different and real- hx; yi. Intuitively, it represents a distribution over the train- istic tasks, we find the DDS framework brings significant ing data where more important data has a higher probability improvements over diverse baselines for all settings. of being used, denoted P (X; Y ; ). Unlike prior methods which generally require complicated featurization of both 2. Differentiable Data Selection the model state and the data as input to the RL agent (Fan et al., 2018b; Jiang et al., 2018; Fang et al., 2017), our 2.1. Risk, Training, and Development Sets formulation is much simpler and generalizable to different Commonly in machine learning, we seek to find the param- tasks. Since our scorer network does not consider the model eters θ∗ that minimize the risk J(θ; P ), the expected value parameters θt as input, we update it iteratively with the of a loss function `(x; y; θ), where hx; yi are pairs of inputs model so that at training step t, P (X; Y ; t) provides an and associated labels sampled from a particular distribution up-to-date data scoring feedback for a given θt. P (X; Y ): Although the above formulation is simpler and more general, θ∗ = argmin J(θ; P ) where it requires much more frequent updates to the scorer param- θ (1) eter . Existing RL frameworks simply use the change in J(θ; P ) = Ex;y∼P (X;Y )[`(x; y; θ)] dev set risk as the regular reward signal, which makes the update expensive and unstable (Fan et al., 2018b; Kumar Ideally, we would like the risk J(·) to be minimized over et al., 2019). Therefore, we propose a novel reward function the data distribution that our system sees at test time, as an approximation to ∆Jdev(x; y) to quantify the effect of hx; yi ie. Ptest(X; Y ). Unfortunately, this distribution is un- the training example . Inspired by Du et al.(2018), known at training time, so instead we collect a training which uses gradient similarity between two tasks to measure set Dtrain = f(xi; yi): i = 1; :::; Ntraing with distribution the effect of adaptating between them, we use the agreement hx; yi Ptrain(X; Y ) = Uniform(Dtrain), and minimize the empiri- between the model gradient on data and the gradient hx; yi cal risk by taking hx; yi ∼ Ptrain(X; Y ). Since we need a on the dev set to approximate the effect of on dev set sufficiently large training set Dtrain to train a good model, performance. This reward implies that we prefer data that θ it is hard to ensure that Ptrain(X; Y ) ≈ Ptest(X; Y ).

Load more