Loss Functions for Discriminative Training of Energy-Based Models.

Yann LeCun and Fu Jie Huang The Courant Institute, {yann,jhuangfu }@cs.nyu.edu http://yann.lecun.com

Abstract Another implicit advantage of the probabilistic approach is that it provides well-justified loss functions for learn- ing, e.g. maximum likelihood for generative models, and Probabilistic graphical models associate a prob- max conditional likelihood for discriminative models. Be- ability to each configuration of the relevant vari- cause of the normalization, maximizing the likelihood of ables. Energy-based models (EBM) associate an the training samples will automatically decrease the likeli- energy to those configurations, eliminating the hood of other points, thereby driving machine to approach need for proper normalization of probability dis- the desired behavior. The downside is that the negative log- tributions. Making a decision (an inference) with likelihood is the only well-justified loss functions. Yet, ap- an EBM consists in comparing the energies asso- proximating a distribution over the entire space by max- ciated with various configurations of the variable imizing likelihood may be an overkill when the ultimate to be predicted, and choosing the one with the goal is merely to produce the right decision. smallest energy. Such systems must be trained discriminatively to associate low energies to the We will argue that using proper probabilistic models, be- desired configurations and higher energies to un- cause they must be normalized, considerably restricts our desired configurations. A wide variety of loss choice of model architecture. Some desirable architectures function can be used for this purpose. We give may be difficult to normalize (the normalization may in- sufficient conditions that a should volve the computation of intractable partition functions), satisfy so that its minimization will cause the sys- or may even be non-normalizable (their partition function tem to approach to desired behavior. We give may be an integral that does not converge). many specific examples of suitable loss func- This paper concerns a more general class of models called tions, and show an application to object recog- Energy-Based Models (EBM). EBMs associate an (un- nition in images. normalized) energy to each configuration of the variables to be modeled. Making an inference with an EBM con- sists in searching for a configuration of the variables to be 1 Introduction predicted that minimizes the energy, or comparing the ener- gies of a small number of configurations of those variables. EBMs have considerable advantages over traditional prob- Graphical Models are overwhelmingly treated as proba- abilistic models: (1) There is no need to compute partition bilistic generative models in which inference and learning functions that may be intractable; (2) because there is no are viewed as probabilistic estimation problems. One ad- requirement for normalizability, the repertoire of possible vantage of the probabilistic approach is compositionality: model architectures that can be used is considerably richer. one can build and train component models separately be- fore assembling them into a complete system. For example, Training an EBM consists in finding values of the trainable a Bayesian classifier can be built by assembling separately- parameter that associate low energies to “desired” config- trained generative models for each class. But if a model urations of variables (e.g. observed on a training set), and is trained discriminatively from end to end to make deci- high energies to “undesired” configurations. With prop- sions, mapping raw input to ultimate outputs, there is no erly normalized probabilistic models, increasing the like- need for compositionality. Some applications require hard lihood of a “desired” configuration of variables will auto- decisions rather than estimates of conditional output dis- matically decrease the likelihoods of other configurations. tributions. One example is mobile robot navigation: once With EBMs, this is not the case: making the energy of de- trained, the robot must turn left or right when facing an ob- sired configurations low may not necessarily make the en- stacle. Computing a distribution over steering angles would ergies of other configurations high. Therefore, one must be of little use in that context. The machine should be be very careful when designing loss functions for EBMs 1. trained from end-to-end to approach the best possible de- cision in the largest range of situations. 1it is important to note that the energy is quantity minimized We must make sure that the loss function we pick will ef- fectively drive our machine to approach the desired behav- ior. In particular, we must ensure that the loss function has no trivial solution (e.g. where the best way to minimize the loss is to make the energy constant for all input/output pair). A particular manifestation of this is the so-called col- lapse problem that was pointed out in some early works that attempted to combined neural nets and HMMs [7, 2, 8]. This energy-based, end-to-end approach to learning has been applied with great success to sentence-level handwrit- Figure 1: Two energy surfaces in X, Y space obtained ing recognition in the past [10]. But there has not been by training two neural nets to compute the function Y = a general characterization of “good” energy functions and X2 − 1/2. The blue dots represent a subset of the train- loss functions. The main point of this paper is to give suf- ing samples. In the left diagram, the energy is quadratic ficient conditions that a discriminative loss function must in Y , therefore its exponential is integrable over Y . This satisfy, so that its minimization will carve out the energy model is equivalent to a probabilistic Gaussian model of landscape in input/output space in the right way, and cause P (Y |X). The right diagram uses a non-quadratic saturated the machine to approach the desired behavior. We then pro- energy whose exponential is not integrable over Y . This pose a wide family of loss functions that satisfy these con- model is not normalizable, and therefore has no probabilis- ditions, independently of the architecture of the machine tic counterpart. being trained.

2 Energy-Based Models Y for a given X matter. However, if exp(−E(W, Y, X)) is integrable over Y , for all X and W , we can turn an EBM into an equivalent probabilistic model by posing: Let us define our task as one of predicting the best configu- ration of a set of variables denoted collectively by Y , given exp(−βE(W, Y, X) a set of observed (input) variables collectively denoted by P (Y |X, W ) = exp(−βE(W, y, X)) X. Given an observed configuration for X, a probabilistic y model (e.g. a graphical model) will associate a (normal- R ized) probability P (Y |X) to each possible configuration where β is an arbitrary positive constant. The normaliz- of Y . When a decision must be made, the configuration of ing term (the denominator) is called the partition function. Y that maximizes P (Y |X) will be picked. However, the EBM framework gives us more flexibility be- cause it allows us to use energy functions whose exponen- An Energy-Based Model (EBM) associates a scalar energy tial is not integrable over the domain of Y . Those models E(W, Y, X) to each configuration of X, Y . The family of have no probabilistic equivalents. possible energy functions is parameterized by a parameter vector W , which is to be learned. One can view this en- Furthermore, we will see that training EBMs with certain ergy function as a measure of “compatibility” between the loss functions circumvents the requirement for evaluating values of Y and X. Note that there is no requirement for the partition function and its derivatives, which may be in- normalization. tractable. Solving this problem is a major issue with prob- abilistic models, if one judges by the considerable amount The inference process consists in clamping X to the ob- of recent publications on the subject. served configuration (e.g. an input image for image clas- sification), and searching for the configuration of Y in a Probabilistic models are generally trained with the maxi- set {Y } that minimizes the energy. This optimal configu- mum likelihood criterion (or equivalently, the negative log- likelihood loss). This criterion causes the model to ap- ration is denoted Yˇ : Yˇ = argmin E(W, Y, X). In Y ∈{Y } proach the conditional density P (Y |X) over the entire do- many situations, such as classification, {Y } will be a dis- main of Y for each X. With the EBM framework, we al- crete set, but in other situations {Y } may be a continu- low ourselves to devise loss functions that merely cause ous set (e.g. a compact set in a vector space). This paper the system to make the best decisions. These loss functions will not discuss how to perform this inference efficiently: are designed to place minY ∈{Y } E(W, Y, X) near the de- the reader may use her favorite and most appropriate opti- sired Y for each X. This is a considerably less complex mization method depending upon the form of E(W, Y, X), and less constrained problem than that of estimating the including exhaustive search, gradient-based methods, vari- “correct” conditional density over Y for each X. To con- ational methods, (loopy) belief propagation, dynamic pro- vince ourselves of this, we can note that many different gramming, etc. energy functions may have minima at the same Y for a Because of the absence of normalization, EBMs should given X, but only one of those (or a few) maximizes the only be used for discrimination or decision tasks where likelihood. For example, figure 1 shows two energy sur- only the relative energies of the various configurations of faces in X, Y space. They were obtained by training two neural nets (denoted G(W, X)) to approximate the func- 2 during inference, while the loss is the quantity minimized during tion Y = X − 1/2. In the left diagram, the energy learning E(W, Y, X) = (Y − G(W, X))2 is quadratic in Y , there- Figure 2: Examples of EBMs. (a) switch-based classifier; (b) a regressor; (c) constraint satisfaction architecture. Multidi- mensional variables are in red, scalars in green, and discrete variables in green dotted lines. fore its exponential is integrable over Y . This model is 2.2 Examples of EBMs equivalent to a probabilistic Gaussian model for P (Y |X). The right diagram uses a non-quadratic saturated energy EBM for Classification: Traditional multi-class classifiers E(W, Y, X) = tanh (Y − G(W, X))2 whose exponen- can be viewed as particular types of EBMs whose archi- tial is not integrable over Y . This model is not normaliz- tecture is shown in figure 2(a). A parameterized discrimi- able, and therefore has no probabilistic counterpart, yet it nant function G(W, X) produces an output vector with one fulfills our desire to produce the best Y for any given X. component for each of the k categories (G0, G1..., Gk−1). Component Gi is interpreted as the energy (or “penalty”) for assigning X to the i-th category. A discrete switch module selects which of the components is connected to the output energy. The position of the switch is controlled by 2.1 Previous work the discrete variable Y , which is interpreted as the category. k−1 The output energy is equal to E(W, Y, X) = i=0 δ(Y − Several authors have previously pointed out the shortcom- i)G(W, X)i, where δ(Y − i) is equal to 1 for Y = i and 0 ings of normalized models for discriminative tasks. Bot- otherwise (Kronecker function), and G(W, XP)i is the i-th tou [4] first noted that discriminatively trained HMMs are component of G(W, X). Running the machine consists in unduly restricted in their expressive power because of the finding the position of the switch (the value of Y ) that min- normalization requirements, a problem recently named “la- imizes the energy, i.e. the position of the switch that selects bel bias” in [9]. To alleviate this problem, late normal- the smallest component of G(W, X). ization schemes for un-normalized discriminative HMMs EBM for Regression: A regression function G(W, X) were proposed in [6] and [10]. Recent works have revived with the squared error loss (e.g. a traditional neural net- the issue in the context of sequence labelling [5, 1, 14]. work) is a trivial form of minimum energy machine (see Some authors have touted the use of various non- figure 2(b)). The energy function of this machine is de- 1 2 probabilistic loss functions, such as the loss or fined as E(W, Y, X) = 2 ||G(W, X) − Y || . The value of the maximum margin loss, for training decision-making Y that minimizes E is simply equal to G(W, X). There- systems [7, 10, 5, 1, 14]. However, the loss functions in fore, running such a machine consists simply in computing these systems are intimately linked to the underlying archi- G(W, X) and copying the value into Y . The energy is then tecture of the machine being trained. Some loss functions zero. This architecture can be used for classification by are incompatible with some architectures and may possess simply making {Y } a discrete set (with one element for undesirable minima. The present paper gives conditions each category). that “well-behaved” loss functions should satisfy. EBM for Constraint Satisfaction: sometimes, the depen- Teh et al. [15] have introduced the term “Energy-Based dency between X and Y cannot be expressed as a function Model” in a context similar to ours, but they only consid- that maps X’s to Y ’s (consider for example the constraint ered the log-likelihood loss function. We use the term EBM X2 + Y 2 = 1). In this case, one can resort to modeling in a slightly more general sense, which include the possibil- “constraints” that X and Y must satisfy. The energy func- ity of using other loss functions. Bengio et al. [3] describe tion measures the price for violating the constraints. An an energy-based language model, but they do not discuss example architecture is shown in figure 2(c). The energy the issue of loss functions. is E(W, Y, X) = C(Gx(Wx, X), Gy(Wy, Y )), where GX and Gy are functions to be learned, and C(a, b) is a dissim- computed as part of the energy-minimizing inference pro- ilarity measure (e.g. a distance). cess. Section 4 reports experimental results obtained with such a system. 2.3 Deterministic Latent Variables 3 Loss Functions for EBM Training. Many tasks are more conveniently modeled by architec- tures that use latent variables. Deterministic latent vari- In , the training set S is a set of pairs ables are extra variables (denoted by Z) that influence the i i i i energy, and that are not observed. During an inference, the S = {(X , Y ) , i = 1..P } where X is an input, and Y energy is minimized over Y and Z: is a desired output to be predicted. Practically every learn- ing methods can be described as the process of finding the (Yˇ , Zˇ) = argmin E(W, Y, Z, X) parameter W ∈ {W } that minimizes a judiciously chosen Y ∈{Y }, Z∈{Z} loss function L(W, S). The loss function should be a mea- By simply redefining our energy function as: sure of the discrepancy between the machine’s behavior and the desired behavior on the training set. Well-behaved Eˇ(W, Y, X) = min E(W, Y, Z, X) loss functions for EBMs should shape the energy landscape Z∈{Z} so as to “dig holes” at (X, Y ) locations near training sam- ples, while “building hills” at un-desired locations, partic- We can essentially ignore the issue of latent variables. ularly the ones that are erroneously picked by the inference Latent variables are very useful in situations where a hid- algorithm. For example, a good loss function for the regres- den characteristic of the process being modeled can be in- sion problem depicted in figure 1 should dig holes around ferred from observations, but cannot be predicted directly. the blue dots (which represent a subset of the training set, This occurs for example in , handwrit- while ensuring that the surrounding areas have higher en- ing recognition, natural language processing, and biolog- ergy. ical sequence analysis, and other sequence labeling tasks In the following, we characterize the general form of loss where a segmentation must be performed simultaneously functions whose minimization will make the machine carve with the recognition. Alternative segmentations are of- out the energy landscape in the right way so as to approach ten represented as paths in a weighted lattice. Each path the desired behavior. We define the loss on the full training may be associated with a category [10, 5, 1, 14]. The set as: path being followed in the lattice can be viewed as a dis- P crete latent variable. Searching for the best path using 1 dynamic programming (Viterbi) or approximate methods L(W, S) = R L(W, Y i, Xi) (1) P (e.g. beam search) can be seen as a minimization of i=1 ! the energy function with respect to this discrete variable. X For example, in a speech recognition context, evaluating where L(W, Y i, Xi) is the per-sample loss function for i i i i minZ∈{Z} E(W, Y , Z, X ) is akin to “constrained seg- sample (X , Y ), and R is a monotonically increasing mentation”: finding the best path in the lattice that produces function. Loss functions that combine per-sample losses a particular output label Y i. through other symmetric n-ary operations than addition (e.g. multiplication, as in the case of likelihood-like loss An EBM framework with which to perform graph manipu- functions) can be trivially obtained from the above through lations and search, while preserving the ability to compute judicious choices of R and L. With this definition, the loss partial derivatives for learning is the Graph Transformer is invariant under permutations of the samples, and under Network model described in [10]. However, that paper multiple repetitions of the same training set. In the follow- only mentions two loss functions (generalized perceptron ing we will set R to the identity function. We assume that and log-likelihood), without a general discussion of how to L(W, Y i, Xi) has a lower bound over W for all Y i, Xi. construct appropriate loss functions. At this point, we can note that if we minimize any such Rather than give a detailed description of latent-variable loss on a training set over a set of functions with finite VC- EBM for sequence processing, we will describe an applica- dimension, appropriate VC-type upper bounds for the ex- tion to visual object detection and recognition. The archi- pected loss will apply, ensuring the convergence of the em- tecture is shown in figure 3. The input image is first turned pirical loss to the expected loss as the training set size in- into an appropriate representation (e.g. a feature vector) creases. Therefore, we will only discuss the conditions un- by a trainable front-end module (e.g. a convolutional net- der which a loss function will make the machine approach work, as in [11]). This representation is then matched to the desired behavior on the training set. models of each category. Each model outputs an energy that measures how well the representation matched the cat- Sometimes, the task uniquely defines a “natural” loss func- egory (a low energy indicates a good match, a high energy tion (e.g. the number of mis-classified examples), but more a bad match). The switch selects the best-matching cate- often than not, minimizing that function is impractical. gory. The object models take in latent variables that may be Therefore one must resort to surrogate loss functions whose used to represent some instantiation parameters of the ob- choice is up to the designer of the system. One crucial, but jects, such as the pose, illumination, or conformation. The often neglected, question must be answered before choos- optimal value of those parameters for a particular input is ing a loss function: “will minimizing the loss cause the Figure 3: Example of a switch-based Energy-Based Model architecture for object recognition in images, where the pose of the object is treated as a latent variable. learning machine to approach the desired behavior?” We {Eˇ(W, Y, Xi) , Y ∈ {Y }} . For example, if {Y }is the will give general conditions for that. set of integers between 0 and k − 1, as would be the case for the switch-based classifier with k categories shown in The inference process produces the Y that minimizes figure 2(a), the per-sample loss for sample i i should ˇ i (X , Y ) E(W, Y, X ). Therefore, a well-designed loss function be of the form: must drive the energy of the desired output Eˇ(W, Y i, Xi) to be lower than the energies of all the other possible out- L(W, Y i, Xi) = L(Y i, Eˇ(W, 0, Xi), . . . , Eˇ(W, k−1, Xi)) puts. Minimizing the loss function should result in “holes” (2) at X, Y locations near the training samples, and “hills” at With this assumption, we separate the choice of the loss un-desired locations. function from the details of the internal structure of the ma- chine, and limit the discussion to how minimizing the loss 3.1 Conditions on the Energy function affects the energies.

i We must now characterize the form that L can take such The condition for the correct classification of sample X is: that its minimization will eventually drive the machine to satisfy condition 3. Condition 1 Eˇ(W, Y i, Xi) < Eˇ(W, Y, Xi) , ∀Y ∈ {Y }, Y 6= Y i We must design L in such a way that minimizing it will de- crease the difference Eˇ(W, Y i, Xi)−Eˇ(W, Y¯ , Xi), when- To ensure that the correct answer is robustly stable, we may ˇ ¯ i ˇ i i choose to impose that the energy of the desired output be ever E(W, Y , X ) < E(W, Y , X ) + m. In other words, lower than the energies of the undesired outputs by a mar- whenever the difference between the energy of the incor- rect answer with the lowest energy and the energy of the gin m: desired answer is less than the margin, our learning proce- Condition 2 Eˇ(W, Y i, Xi) < Eˇ(W, Y, Xi) − m , ∀Y ∈ dure should make that difference larger. We will now pro- {Y }, Y 6= Y i pose a set of sufficient conditions on the loss that guarantee this. We will now consider the case where Y is a discrete vari- Since we are only concerned with how the loss influences able. Let us denote by Y¯ the output that produces the small- ˇ i i ˇ ¯ i est energy while being different from the desired output Y i: the relative values of E(W, Y , X ), and E(W, Y , X ), i we will consider the shape of loss surface in the space of Y¯ = argmin i Eˇ(W, Y, X ). Condition 2 can Y ∈{Y },Y 6=Y ˇ i i ˇ ¯ i be rewritten as: E(W, Y , X ) and E(W, Y , X ), and view the other argu- ments of the loss (the energies for all the other values of Y ) Condition 3 Eˇ(W, Y i, Xi) < Eˇ(W, Y¯ , Xi) − m , Y¯ = as parameters of that surface: ˇ i argminY ∈{Y },Y 6=Y i E(W, Y, X ) i i ˇ i i ˇ ¯ i L(W, Y , X ) = Q[Ey](E(W, Y , X ), E(W, Y , X )) For the continuous Y case, we can simply de- where the parameter contains the vector of energies fine ¯ as the lowest-energy output outside of a [Ey] Y i ¯ ball of a given radius around the desired output: for all values of Y except Y and Y . ˇ i argminY ∈{Y },||Y −Y i||>E(W, Y, X ) We can now state sufficient conditions that guarantee that minimizing L will eventually satisfy condition 3. 3.2 Sufficient conditions on the loss function In all the sufficient conditions stated below, we assume that there exist a W such that condition 3 is satis- We will now make the key assumption that L de- fied for a single training example (X i, Y i), and that i ˇ i i ˇ ¯ i pends on X only indirectly through the set of energies Q[Ey](E(W, Y , X ), E(W, Y , X )) is convex (convex in its 2 arguments, but not necessarily convex in W ). The ergy loss applied to this machine fulfills condition 6. Intu- conditions must old for all values of [Ey]. itively, that is because by pulling G(W, X) towards one of the RBF centers, we push it away from the others. There- Condition 4 The minima of fore when the energy of the desired output decreases, the ˇ i i ˇ ¯ i Q[Ey](E(W, Y , X ), E(W, Y , X )) are in the half- other ones increase. However, if we allow the RBF centers plane Eˇ(W, Y¯ , Xi) < Eˇ(W, Y i, Xi) + m. to be learned, condition 6 is no longer fulfilled. In that case, the loss has spurious minima where all the RBF centers are This condition on the loss function clearly ensures that min- equal, and the function G(W, X) is constant and equal to imizing it will drive the machine to find a solution that sat- that RBF center. The loss is zero, but the machine does isfies condition 3, if such a solution exists. not produce the desired result. Picking any combination of loss and energy that satisfy any of the conditions 4, 5, or 6 Another sufficient condition can be stated to characterize solves this collapse problem. loss functions that do not have minima, or whose minimum is at infinity: Generalized Perceptron Loss: We define the generalized Perceptron loss for training sample (X i, Y i) as: Condition 5 the gradient of ˇ i i ˇ ¯ i i i ˇ i i ˇ i Q[Ey](E(W, Y , X ), E(W, Y , X )) on the margin Lptron(W, Y , X ) = E(W, Y , X )− min E(W, Y, X ) line Eˇ(W, Y¯ , Xi) = Eˇ(W, Y i, Xi) + m, has a positive Y ∈{Y } (3) dot product with the direction [-1,1]. It is easy to see that with E(W, Y i, Xi) = −Y i.W T Xi, This condition guarantees that minimizing L will drive the and {Y } = {−1, 1}, the above loss reduces to the tradi- tional linear Perceptron loss. The generalized perceptron energies Eˇ(W, Y¯ , Xi) and Eˇ(W, Y i, Xi) toward the half- loss satisfies condition 4 with . This loss was used ˇ ¯ i ˇ i i m = 0 plane E(W, Y , X ) < E(W, Y , X ) + m. by [10] for training a commercially deployed handwriting Yer another sufficient condition can be stated to character- recognizer that combined a heuristic segmenter, a convolu- ize loss functions whose minima are not in the desired half- tional net, and a language model (where the latent variables plane, but where the possible values of Eˇ(W, Y¯ , Xi) and represented paths in an interpretation lattice). A similar Eˇ(W, Y i, Xi) are constrained by their dependency on W loss was studied by [5] for training a text parser. Because in such a way that the minimum of the loss while satisfying the margin is zero, this loss may not prevent collapses for the constraint is in the desired half-plane: certain architecture. Generalized Margin Loss: A more robust version of the i Condition 6 On the margin line Eˇ(W, Y¯ , X ) = Perceptron loss is the Margin Loss, which directly uses the Eˇ(W, Y i, Xi) + m, the following must hold: energy of most offending non-desired output Y¯ in the con- ∂Eˇ(W,Y i,Xi) ∂Eˇ(W,Y¯ ,Xi) ∂L(W,Y i,Xi) trastive term: ∂W − ∂W . ∂W > 0 h i i i ˇ i i ˇ ¯ i This condition ensures that an update of the parame- Lmargin(W, Y , X ) = Qm[E(W, Y , X )−E(W, Y , X )] ters W to minimize the loss will drive the energies (4) Eˇ(W, Y¯ , Xi) and Eˇ(W, Y i, Xi) toward the desired half- where Qm(e) is any function that is monotonically increas- ˇ ¯ i ˇ i i ing for e > −m. The traditional “hinge loss” used with plane E(W, Y , X ) < E(W, Y , X ) + m. kernel-based methods, and the loss used by the LVQ2 algo- We must emphasize that these are only sufficient condi- rithm are special cases with Qm(e) = e + m for e > −m, tions. There may be legitimate loss functions (e.g. non- and 0 otherwise. Special forms of that loss were used in [7] convex functions) that do not satisfy them, yet have the for discriminative speech recognition, and in [1] and [14] proper behavior. for text labeling. The exponential loss used in AdaBoost is a special form of equation (4) with Qm(e) = exp(e). A 3.3 Examples of Loss Functions slightly more general form of the margin loss that satisfies conditions 4 or 5 is given by: We can now give examples of loss functions that satisfy the i i i i i above criteria, and examine whether some of the popular Lgmargin(W, Y , X ) = Qgm[Eˇ(W, Y , X ), Eˇ(W, Y¯ , X )] loss functions proposed in the literature satisfy it. (5) with the condition that ∂Qgm(e1,e2) ∂Qgm(e1,e2) when Energy Loss: The simplest and most widely used ∂e1 > ∂e2 loss is the energy loss, which is simply of the form e1 + m > e2. An example of such loss is: i i ˇ i i . This loss does not Lenergy(W, Y , X ) = E(W, Y , X ) i i + i i − i satisfy condition 4 or 5 in general, but there are cer- L(W, Y , X ) = Q (Eˇ(W, Y , X ))+Q (Eˇ(W, Y¯ , X )) tain forms of Eˇ(W, Y i, Xi) for which condition 6 is sat- (6) where Q+(e) is a convex monotonically increasing func- isfied. For example, let us consider an energy of the form − K k 2 tion, and Q (e) a convex monotonically decreasing func- E(W, Y, X) = k=1 δ(Y −k).||U −G(W, X)|| . Func- dQ+ dQ− tion G(W, X) could be a neural net, on top of which are tion such that if de |e1 = 0 and de |e2 = 0 then placed K radialPbasis functions whose centers are the vec- e1 + m < e2. When Eˇ(W, Y i, Xi) is akin to a distance tors U k. If the U k are fixed and all different, then the en- (bounded below by 0), a judicious choice for Q+ and Q− is:

L(W, Y i, Xi) = Eˇ(W, Y i, Xi)2+κ exp(−βEˇ(W, Y¯ , Xi)) (7) where κ and β are positive constants. A similar loss func- tion was recently used by our group to train a pose-invariant face detection [12]. This system can simultaneously detect faces and estimate their pose using latent variables to rep- resent the head pose. Contrastive Free Energy Loss: While the loss functions proposed thus far involve only Eˇ(W, Y¯ , Xi) in their con- Figure 4: Invariant object recognition with NORB dataset. trastive part, loss functions can be devised to combine all The left portion shows sample views of the training in- the energies for all values of Y in their contrastive term: stances, and the right portion testing instances for the 5 categories. L(W, Y i, Xi) = Eˇ(W, Y i, Xi)− (8) F (Eˇ(W, 0, Xi), . . . , Eˇ(W, k − 1, Xi)) 6.8% test error on this set when fed with binocular 96×96- F can be interpreted as a generalized free energy of the en- pixel gray-scale images [11]. semble of systems with energies Eˇ(W, Y, Xi) ∀Y ∈ {Y }. It appears difficult to characterize the general form of F We used an architecture very much like the one in figure 3, that ensures that L satisfies condition 5. An interesting spe- where the feature extraction module is identical to the first cial case of this loss is the familiar negative log-likelihood 5 layers of the 6- convolutional net used in the refer- loss: ence experiment. The object model functions were of the form: Ei = ||Wi.V − F (Z)||, i = 1..5, where V is the i i i i i Lnll(W, Y , X ) = Eˇ(W, Y , X ) − Fβ(W, X ) (9) output of the 5-layer net (100 dimensions), Wi is a 9 × 100 (trainable) weight matrix. The latent variable Z has two with dimensions that are meant to represent the azimuth and el- evation of the object viewpoint. The set of possible values 1 {Z} contained 162 values (azimuths: 0-360 degrees every F (W, Xi) = − log exp[−βEˇ(W, Y, Xi)] β 20, elevations: 30-70 degrees every 5). The output of F (Z) β Y ∈{Y } ! Z is a point on an azimuth/elevation half-sphere (2D surface) (10) 9 where β is a positive constant. The second term can embedded in the 9D hypercube [−1, +1] . The minimiza- be interpreted as the Helmholtz free energy (log parti- tion of the energy over Z is performed through exhaustive tion function) of the ensemble of systems with energies search (which is relatively cheap). Eˇ(W, Y, Xi) ∀Y ∈ {Y }. This type of discriminative loss We used the loss function (7). This loss causes the con- with β = 1 is widely used for discriminative probabilis- volutional net to produce a point as close as possible to tic models in the speech, handwriting, and NLP commu- any point on the half-sphere of the desired class, and as far nities [10, 2, 8]. This is also the loss function used in the as possible from the half-sphere of the best-scoring non- conditional random field model of Lafferty et al [9]. desired class. The system is trained “from scratch” includ- ing the convolutional net. We obtained 6.3% error on the We can see that loss (9) reduces to the generalized Percep- test set, which is a moderate, but significant improvement tron loss when β → ∞. Computing this loss and its deriva- over the 6.8% of the control experiment. tive requires computing integrals (or sums) over {Y } that may be intractable. It also assumes that the exponential of the energy be integrable over {Y }, which puts restrictions 5 Discussion on the choice of E(W, Y, X) and/or {Y }. Efficient End-to-End Gradient-Based Learning To per- 4 Illustrative Experiments form gradient-based training of all the modules in the ar- chitecture, we must compute the gradient of the loss with To illustrate the use of contrastive loss functions with non- respect to all the parameters. This is easily achieved with probabilistic latent variables, we trained a system to recog- the module-based generalization of back-propagation de- nize generic objects in images independently of the pose scribed in [10]. A typical learning iteration would involve and the illumination. We used the NORB dataset [11] the following steps: (1) one forward propagation through which contains 50 different uniform-colored toy objects the modules that only depend on X; (2) a run of the energy- under 18 azimuths, 9 elevations, and 6 lighting conditions. minimizing inference algorithm on the modules that de- The objects are 10 instance from 5 generic categories: four- pend on Z and Y ; (3) as many back-propagations through legged animals, human figures, airplanes, trucks, and cars. the modules that depend on Y as there are energy terms Five instances of each category were used for training, and in the loss function; (4) one back-propagation through the the other five for testing (see figure 4). A 6-layer convolu- module that depends only on X; (5) one update of the pa- tional network trained with the mean-square loss achieves rameters. Efficient Inference: Most loss functions described in this [2] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe. paper involve multiple runs of the machine (in the worst Global optimization of a neural network-hidden case, one run for each energy term that enters in the loss). Markov model hybrid. IEEE Transaction on Neural However, the parts of the machine that solely depend on Networks, 3(2):252–259, 1992. X, and not on Y or Z need not be recomputed for each run (e.g. the feature extractor in figure 3), because X does not [3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. change between runs. A neural probabilistic language model. Journal of Research, 3:1137–1155, February If the energy function can be decomposed as a sum of func- 2003. tions (called factors) E(W, Y, X) = j Ej (Wj , Y, Z, X), each of which takes subsets of the variables in Z and Y as [4] L. Bottou. Une Approche theorique´ de input, we can use a form of belief propagationP algorithm l’Apprentissage Connexionniste: Applications a` for factor graphs to compute the lowest energy configura- la Reconnaissance de la Parole. PhD thesis, Uni- tion [13]. These algorithms are exact and tractable if Z versite´ de Paris XI, 91405 Orsay cedex, , and Y are discrete and the factor graph has no loop. They 1991. reduce to Viterbi-type algorithms when members of {Z} [5] Michael Collins. Discriminative training methods for can be represented by paths in a lattice (as is the case for hidden markov models: Theory and experiments with sequence segmentation/labeling tasks). perceptron algorithms. In Proc. EMNLP, 2002. Approximate Inference: We do not really need to assume that the energy-minimizing inference process always finds [6] J. S. Denker and C. J. Burges. Image segmentation the global minimum of E(W, Y, X i) with respect to Y . and recognition. In The Mathematics of Induction. We merely need to assume that this process finds approxi- Addison Wesley, 1995. mate solutions (e.g. local minima) in a consistent, repeat- i [7] X. Driancourt and L. Bottou. MLP, LVQ and DP: able manner. Indeed, if there are minima of E(W, Y, X ) Comparison & cooperation. In Proceedings of the with respect to Y that our minimization/inference algo- International Joint Conference on Neural Networks, rithm never finds, we do not need to find them and increase volume 2, pages 815–819, Seattle, 1991. their energy. Their existence is irrelevant to our problem. However, it is important to note that those unreachable re- [8] P. Haffner. Connectionist speech recognition with a gions of low energy will affect the loss function of proba- global MMI algorithm. In Eurospeech’93, Berlin, bilistic models as well as that of EBMs if the negative log- September 1993. likelihood loss is used. This is a distinct advantage of the un-normalized EBM approach: low-energy areas that are [9] John Lafferty, Andrew McCallum, and Fernando never reached by the inference algorithm are not a concern. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. International Conference on Machine Learning 6 Conclusion and Outlook (ICML), 2001.

Most approaches to discriminative training of graphical [10] Yann LeCun, Leon Bottou, , and models in the literature use loss functions from a very small Patrick Haffner. Gradient-based learning applied to set. We show that energy-based (un-normalized) graphical document recognition. Proceedings of the IEEE, models can be trained discriminatively using a very wide 86(11):2278–2324, November 1998. family of loss functions. We give a sufficient condition that [11] Yann LeCun, Fu-Jie Huang, and Leon Bottou. Learn- the loss function must satisfy so that its minimization will ing methods for generic object recognition with in- make the system approach the desired behavior. We give a variance to pose and lighting. In Proc. CVPR, 2004. number of loss functions that satisfy this criterion and de- scribe experiments in image recognition that illustrate the [12] M. Osadchy, M. Miller, and Y. LeCun. Synergistic use of such discriminative loss functions in the presence of face detection and pose estimation. In Proc. NIPS, non-probabilistic latent variables. 2004. [13] Kschischang F. R., B. J. Frey, and H.-A. Loeliger. Acknowledgments Factor graphs and the sum-product algorithm. IEEE The authors wish to thank Leon Bottou, Yoshua Bengio, Trans. Information Theory, 47(2):498–519, February Margarita Osadchy, and Matt Miller for useful discussions. 2001. [14] Ben Taskar, Carlos Guestrin, and Daphne Koller. References Max-margin markov networks. In Proc. NIPS, 2003. [15] Y. W. Teh, M. Welling, S. Osindero, and Hinton G. [1] Yasemin Altun, Mark Johnson, and Thomas Hof- E. Energy-based models for sparse overcomplete rep- mann. Loss functions and optimization methods for resentations. Journal of Machine Learning Research, discriminative learning of label sequences. In Proc. 4:1235–1260, 2003. EMNLP, 2003.