Deep Value Networks Learn to Evaluate and Iteratively Refine
Total Page:16
File Type:pdf, Size:1020Kb
Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs Michael Gygli 1 * Mohammad Norouzi 2 Anelia Angelova 2 Abstract complicated high level reasoning to resolve ambiguity. We approach structured output prediction by op- An expressive family of energy-based models studied by timizing a deep value network (DVN) to pre- LeCun et al.(2006) and Belanger & McCallum(2016) ex- cisely estimate the task loss on different out- ploits a neural network to score different joint configura- put configurations for a given input. Once the tions of inputs and outputs. Once the network is trained, model is trained, we perform inference by gra- one simply resorts to gradient-based inference as a mech- dient descent on the continuous relaxations of anism to find low energy outputs. Despite recent develop- the output variables to find outputs with promis- ments, optimizing parameters of deep energy-based models ing scores from the value network. When ap- remains challenging, limiting their applicability. Moving plied to image segmentation, the value network beyond large margin training used by previous work (Be- takes an image and a segmentation mask as in- langer & McCallum, 2016), this paper presents a simpler puts and predicts a scalar estimating the inter- and more effective objective inspired by value based rein- section over union between the input and ground forcement learning for training energy-based models. truth masks. For multi-label classification, the Our key intuition is that learning to critique different out- DVN’s objective is to correctly predict the F1 put configurations is easier than learning to directly come score for any potential label configuration. The up with optimal predictions. Accordingly, we build a deep DVN framework achieves the state-of-the-art re- value network (DVN) that takes an input x and a corre- sults on multi-label prediction and image seg- sponding output structure y, both as inputs, and predicts a mentation benchmarks. scalar score v(x; y) evaluating the quality of the configura- tion y and its correspondence with the input x. We exploit a loss function `(y; y∗) that compares an output y against 1. Introduction a ground truth label y∗ to teach a DVN to evaluate different Structured output prediction is a fundamental problem in output configurations. The goal is to distill the knowledge machine learning that entails learning a mapping from in- of the loss function into the weights of a value network so put objects to complex multivariate output structures. Be- that during inference, in the absence of the labeled output y∗ cause structured outputs live in a high-dimensional com- , one can still rely on the value judgments of the neural binatorial space, one needs to design factored prediction net to compare outputs. models that are not only expressive, but also computation- To enable effective iterative refinement of structured out- arXiv:1703.04363v2 [cs.LG] 8 Aug 2017 ally tractable for both learning and inference. Due to com- puts via gradient ascent on the score of a DVN, similar to putational considerations, a large body of previous work Belanger & McCallum(2016), we relax the discrete out- (e.g., Lafferty et al.(2001); Tsochantaridis et al.(2004)) put variables to live in a continuous space. Moreover, we has focused on relatively weak graphical models with pair- extend the domain of loss functions so the loss applies to wise or small clique potentials. Such models are not ca- continuous variable outputs. For example, for multi-label pable of learning complex correlations among the random classification, instead of enforcing each output dimension variables, making them not suitable for tasks requiring yi to be binary, we let yi 2 [0; 1] and we generalize the notion of F score to apply to continuous predictions. For *Work done during an internship at Google Brain. 1ETH Zurich¨ 1 & gifs.com 2Google Brain, Mountain View, USA. Correspon- image segmentation, we use a similar generalization of in- dence to: Michael Gygli <[email protected]>, Moham- tersection over union. Then, we train a DVN on many out- mad Norouzi <[email protected]>. put examples encouraging the network to predict precise (negative) loss scores for almost any output configuration. th Proceedings of the 34 International Conference on Machine Figure1 illustrates the gradient based inference process on Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). a DVN optimized for image segmentation. Deep Value Networks truth output structures in a high-dimensional space is often Gradient based inference infeasible, one measures the quality of a mapping via a loss ∗ Input x Step 5 Step 10 Step 30 GT label y 0 + function `(y; y ): Y ×Y ! R that evaluates the distance between different output structures. Given such a loss func- tion, the quality of a mapping is measured by empirical loss over a validation dataset D0, X ∗ `(yb(x); y ) (1) (x;y∗)2D0 This loss can take an arbitrary form and is often non- differentiable. For multi-label classification, a common loss is negative F1 score and for image segmentation, a typ- ical loss is negative intersection over union (IOU). Some structured output prediction methods (Taskar et al., 2003; Tsochantaridis et al., 2004) learn a mapping from in- puts to outputs via a score function s(x; y; θ), which evalu- ates different input-output configurations based on a linear function of some joint input-output features (x; y), Figure 1. Segmentation results of DVN on Weizmann horses test samples. Our gradient based inference method iteratively refines s(x; y; θ) = θT (x; y) : (2) segmentation masks to maximize the predicted scores of a deep value network. Starting from a black mask at step 0, the predic- The goal of learning is to optimize a score function such tions converge within 30 steps yielding the output segmentation. that the model’s predictions denoted y, See https://goo.gl/8OLufh for more & animated results. b y = argmax s(x; y; θ) ; (3) b y This paper presents a novel training objective for deep are closely aligned with ground-truth labels y∗ as measured structured output prediction, inspired by value-based re- by empirical loss in (1) on the training set. inforcement learning algorithms, to precisely evaluate the quality of any input-output pair. We assess the effective- Empirical loss is not amenable to numerical optimization ness of the proposed algorithm on multi-label classifica- because the argmax in (3) is discontinuous. Structural tion based on text data and on image segmentation. We SVM formulations (Taskar et al., 2003; Tsochantaridis obtain state-of-the-art results in both cases, despite the dif- et al., 2004) introduce a margin violation (slack) variable ferences of the domains and loss functions. Even given for each training pair, and define a continuous upper bound on the empirical loss. The upper bound on the loss for a small number of input-output pairs, we find that we are ∗ able to build powerful structure prediction models. For ex- an example (x; y ) and the model’s prediction yb takes the ample, on the Weizmann horses dataset (Borenstein & Ull- form: man, 2004), without any form of pre-training, we are able `(y; y∗) to optimize 2:5 million network parameters on only 200 b ≤ max [ `(y; y∗)+s(x; y; θ)] − s(x; y; θ) (4a) training images with multiple crops. Our deep value net- y b pre-trained work setup outperforms methods that are on ≤ max [ `(y; y∗) + s(x; y; θ)] − s(x; y∗; θ) : (4b) large datasets such as ImageNet (Deng et al., 2009) and y methods that operate on 4× larger inputs. Our source code Previous work (Taskar et al., 2003; Tsochantaridis et al., based on TensorFlow (Abadi et al., 2015) is available at 2004), defines a surrogate objective on the empirical loss, https://github.com/gyglim/dvn. by summing over the bound in (4b) for different training examples, plus a regularizer. This surrogate objective is 2. Background convex in θ, which makes optimization convenient. Structured output prediction entails learning a mapping This paper is inspired by the structural SVM formulation from input objects x 2 X (e.g., X ≡ RM ) to multivariate above, but we give up the convexity of the objective to discrete outputs y 2 Y (e.g., Y ≡ f0; 1gN ). Given a train- obtain more expressive models using a multi-layer neu- (i) ∗(i) N ing dataset of input-output pairs, D ≡ f(x ; y )gi=1, ral networks. Specifically, we generalize the formulation we aim to learn a mapping yb(x): X!Y from inputs above in three ways: 1) use a non-linear score function de- to ground truth outputs. Because finding the exact ground noted v(x; y; θ) that fuses (·; ·) and θ together and jointly Deep Value Networks learns the features. 2) use gradient descend in y for itera- Here y \ y∗ denotes the number of dimension i where ∗ ∗ tive refinement of outputs to approximately find the best both yi and yi are active and y [ y denotes the number ∗ yb(x). 3) optimize the score function with a regression ob- of dimensions where at least one of yi and yi is active. jective so that the predicted scores closely approximate the Assuming that one has learned a suitable value network negative loss values, that attains v(x; y; θ) ≈ v∗(y; y∗) at every input-output pairs, in order to infer a prediction for an input x, which ∗ 8y 2 Y; v(x; y; θ) ≈ −`(y; y ) : (5) is valued highly by the value network, one needs to find y = argmax v(x; y; θ) as described below. Our deep value network (DVN) is a non-linear function try- b y ing to evaluate the value of any output configuration y 2 Y 3.1.