Loss Functions for Discriminative Training of Energy-Based Models

Loss Functions for Discriminative Training of Energy-Based Models. Yann LeCun and Fu Jie Huang The Courant Institute, New York University fyann,jhuangfu [email protected] http://yann.lecun.com Abstract Another implicit advantage of the probabilistic approach is that it provides well-justified loss functions for learning, e.g. maximum likelihood for generative models, and Probabilistic graphical models associate a prob- max conditional likelihood for discriminative models. Be- ability to each configuration of the relevant vari- cause of the normalization, maximizing the likelihood of ables. Energy-based models (EBM) associate an the training samples will automatically decrease the likeli- energy to those configurations, eliminating the hood of other points, thereby driving machine to approach need for proper normalization of probability dis- the desired behavior. The downside is that the negative log- tributions. Making a decision (an inference) with likelihood is the only well-justified loss functions. Yet, ap- an EBM consists in comparing the energies asso- proximating a distribution over the entire space by max- ciated with various configurations of the variable imizing likelihood may be an overkill when the ultimate to be predicted, and choosing the one with the goal is merely to produce the right decision. smallest energy. Such systems must be trained discriminatively to associate low energies to the We will argue that using proper probabilistic models, be- desired configurations and higher energies to un- cause they must be normalized, considerably restricts our desired configurations. A wide variety of loss choice of model architecture. Some desirable architectures function can be used for this purpose. We give may be difficult to normalize (the normalization may in- sufficient conditions that a loss function should volve the computation of intractable partition functions), satisfy so that its minimization will cause the sys- or may even be non-normalizable (their partition function tem to approach to desired behavior. We give may be an integral that does not converge). many specific examples of suitable loss func- This paper concerns a more general class of models called tions, and show an application to object recog- Energy-Based Models (EBM). EBMs associate an (un- nition in images. normalized) energy to each configuration of the variables to be modeled. Making an inference with an EBM consists in searching for a configuration of the variables to be 1 Introduction predicted that minimizes the energy, or comparing the energies of a small number of configurations of those variables. EBMs have considerable advantages over traditional prob- Graphical Models are overwhelmingly treated as proba- abilistic models: (1) There is no need to compute partition bilistic generative models in which inference and learning functions that may be intractable; (2) because there is no are viewed as probabilistic estimation problems. One ad- requirement for normalizability, the repertoire of possible vantage of the probabilistic approach is compositionality: model architectures that can be used is considerably richer. one can build and train component models separately be- fore assembling them into a complete system. For example, Training an EBM consists in finding values of the trainable a Bayesian classifier can be built by assembling separately- parameter that associate low energies to “desired” config- trained generative models for each class. But if a model urations of variables (e.g. observed on a training set), and is trained discriminatively from end to end to make deci- high energies to “undesired” configurations. With prop- sions, mapping raw input to ultimate outputs, there is no erly normalized probabilistic models, increasing the like- need for compositionality. Some applications require hard lihood of a “desired” configuration of variables will auto- decisions rather than estimates of conditional output dis- matically decrease the likelihoods of other configurations. tributions. One example is mobile robot navigation: once With EBMs, this is not the case: making the energy of de- trained, the robot must turn left or right when facing an ob- sired configurations low may not necessarily make the en- stacle. Computing a distribution over steering angles would ergies of other configurations high. Therefore, one must be of little use in that context. The machine should be be very careful when designing loss functions for EBMs 1. trained from end-to-end to approach the best possible decision in the largest range of situations. 1it is important to note that the energy is quantity minimized We must make sure that the loss function we pick will ef- fectively drive our machine to approach the desired behavior. In particular, we must ensure that the loss function has no trivial solution (e.g. where the best way to minimize the loss is to make the energy constant for all input/output pair). A particular manifestation of this is the so-called col- lapse problem that was pointed out in some early works that attempted to combined neural nets and HMMs [7, 2, 8]. This energy-based, end-to-end approach to learning has been applied with great success to sentence-level handwrit- Figure 1: Two energy surfaces in X; Y space obtained ing recognition in the past [10]. But there has not been by training two neural nets to compute the function Y = a general characterization of “good” energy functions and X2 − 1=2. The blue dots represent a subset of the train- loss functions. The main point of this paper is to give suf- ing samples. In the left diagram, the energy is quadratic ficient conditions that a discriminative loss function must in Y , therefore its exponential is integrable over Y . This satisfy, so that its minimization will carve out the energy model is equivalent to a probabilistic Gaussian model of landscape in input/output space in the right way, and cause P (Y jX). The right diagram uses a non-quadratic saturated the machine to approach the desired behavior. We then pro- energy whose exponential is not integrable over Y . This pose a wide family of loss functions that satisfy these con- model is not normalizable, and therefore has no probabilis- ditions, independently of the architecture of the machine tic counterpart. being trained. 2 Energy-Based Models Y for a given X matter. However, if exp(−E(W; Y; X)) is integrable over Y , for all X and W , we can turn an EBM into an equivalent probabilistic model by posing: Let us define our task as one of predicting the best configuration of a set of variables denoted collectively by Y , given exp(−βE(W; Y; X) a set of observed (input) variables collectively denoted by P (Y jX; W ) = exp(−βE(W; y; X)) X. Given an observed configuration for X, a probabilistic y model (e.g. a graphical model) will associate a (normal- R ized) probability P (Y jX) to each possible configuration where β is an arbitrary positive constant. The normaliz- of Y . When a decision must be made, the configuration of ing term (the denominator) is called the partition function. Y that maximizes P (Y jX) will be picked. However, the EBM framework gives us more flexibility because it allows us to use energy functions whose exponen- An Energy-Based Model (EBM) associates a scalar energy tial is not integrable over the domain of Y . Those models E(W; Y; X) to each configuration of X; Y . The family of have no probabilistic equivalents. possible energy functions is parameterized by a parameter vector W , which is to be learned. One can view this en- Furthermore, we will see that training EBMs with certain ergy function as a measure of “compatibility” between the loss functions circumvents the requirement for evaluating values of Y and X. Note that there is no requirement for the partition function and its derivatives, which may be in- normalization. tractable. Solving this problem is a major issue with probabilistic models, if one judges by the considerable amount The inference process consists in clamping X to the ob- of recent publications on the subject. served configuration (e.g. an input image for image classification), and searching for the configuration of Y in a Probabilistic models are generally trained with the maxi- set fY g that minimizes the energy. This optimal configu- mum likelihood criterion (or equivalently, the negative log- likelihood loss). This criterion causes the model to ap- ration is denoted Yˇ : Yˇ = argmin E(W; Y; X). In Y 2fY g proach the conditional density P (Y jX) over the entire do- many situations, such as classification, fY g will be a dis- main of Y for each X. With the EBM framework, we al- crete set, but in other situations fY g may be a continu- low ourselves to devise loss functions that merely cause ous set (e.g. a compact set in a vector space). This paper the system to make the best decisions. These loss functions will not discuss how to perform this inference efficiently: are designed to place minY 2fY g E(W; Y; X) near the de- the reader may use her favorite and most appropriate opti- sired Y for each X. This is a considerably less complex mization method depending upon the form of E(W; Y; X), and less constrained problem than that of estimating the including exhaustive search, gradient-based methods, vari- “correct” conditional density over Y for each X. To con- ational methods, (loopy) belief propagation, dynamic pro- vince ourselves of this, we can note that many different gramming, etc. energy functions may have minima at the same Y for a Because of the absence of normalization, EBMs should given X, but only one of those (or a few) maximizes the only be used for discrimination or decision tasks where likelihood.

Load more