Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff [email protected] J. Andrew Bagnell [email protected] Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich [email protected] Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E1, Canada 1. Introduction gramming problem before deriving the convex objec- Maximum margin structured learning (MMSL) has re- tive and showing how its subgradients can be com- cently gained recognition within the machine learn- puted utilizing the specialized inference algorithm in- ing community as a tractable method for large scale herent to the problem. We finish with theoretical learning. However, most current methods are lim- guarantees and some experimental results in two do- ited in terms of scalability, convergence, or memory mains: sequence labeling for optical character recog- requirements. The original Structured SMO method nition; and imitation learning for path planning in proposed in (Taskar et al., 2003) is slow to converge, mobile robot navigation. The former problem is well particularly for Markov networks of even medium tree- known to MMSL, but the latter is new to this domain. width. Similarly, dual exponentiated gradient tech- Indeed, although there is a tractable polynomial sized niques suffer from sublinear convergence as well as quadratic programming representation for the problem often large memory requirements. Recently, (Taskar (Ratliff et al., 2006), solving it directly using one of et al., 2006) have looked into saddle-point methods for the previously proposed methods would be intractable optimization and have succeeded in efficiently solving practically for reasons similar to those that arise in di- several problems that would have otherwise had in- rectly solving the linear programming formulation of tractable memory requirements. Markov Decision Processes. We propose an alternative gradient based approach us- 2. Maximum margin structured ing a regularized risk formulation of MMSL derived by placing the constraints into the objective to create a learning convex function in w. This objective is then optimized We present a brief review of maximum margin struc- by a direct generalization of gradient descent, popular tured learning in terms of convex programming. In in convex optimization, called the subgradient method this setting, we attempt to predict a structured ob- (Shor, 1985). The abundance of literature on subgra- ject y ∈ Y(x) from a given input x ∈ X . For our dient methods makes this algorithm a decidedly con- purposes we assume that the inference problem can venient choice. In this case, it is well known that the be described in terms of a computationally tractable subgradient method is guaranteed linear convergence max over a score function sx : Y(x) → R such that ∗ when the stepsize is chosen to be constant. Further- y = arg maxy∈Y(x) sx(y) and take as our hypothesis more, this algorithm becomes the well-studied Greedy class functions of the form Projection algorithm of (Zinkevich, 2003) in the online h(x; w) = arg max wT f(x, y) (1) setting. Using tools developed in (Hazan et al., 2006), y∈Y(x) we can show that the risk of this online algorithm with This class is parameterized by w in a convex param- respect to the prediction loss grows only sublinearly in eter space W, and f(x, y) are vector valued feature time. Perhaps more importantly, the implementation functions. Given data D = {(x , y )}n we abbreviate of this algorithm is simple and has intuitive appeal i i i=1 f(x , y) as f (y) and Y(x ) as Y . since an integral part of the computation comes from i i i i running the inference algorithm being trained in the The margin is chosen to scale with the loss of choosing inner loop. class y over the desired yi. We denote this prediction loss function by L(y , y) = L (y), and assume that In what follows, we review the basic formulation of i i L (y) > 0 for all y ∈ Y \y , and L (y ) = 0. In learn- maximum margin structured learning as a convex pro- i i i i i ing, our goal is to find a score function that scores yi Subgradient Methods for Maximum Margin Structured Learning higher than all other y ∈ Yi\yi by this margin. For- the gradient. We denote the set of all subgradients of mally, this gives us the following constraint: c(·) at point w by ∂c(w). ∀i, y ∈ Y , wT f (y ) ≥ wT f (y) + L (y). (2) To compute the subgradient of our objective func- i i i i i tion, we make use of the following four well known Maximizing the left hand side over all y ∈ Yi, and properties: (1) subgradient operators are linear; (2) adding slack variables, we can express this mathemat- the gradient is the unique subgradient of a differen- ically as following convex program: tiable function; (3) if f(x, y) is differentiable in x, then ∇xf(x, y∗) is a subgradient of the convex function λ 2 1 X q ∗ min kwk + βiζi (3) φ(x) = maxy f(x, y) for any y ∈ arg maxy f(x, y); (4) w,ζi 2 n i an analogous chain rule holds as expected. We are now T T s.t. ∀i w fi(yi) + ζi ≥ max w fi(y) + Li(y) equipped to compute a subgradient gw ∈ ∂c(w) of our y∈Yi objective function (4): where λ ≥ 0 is a hyperparameter that trades off con- n straint violations for margin maximization (i.e. fit for 1 X T ∗ ∗ T q−1 gw = qβi (w fi(y ) + Li(y )) − w fi(yi) · simplicity), q ≥ 1 defines the penalty norm, and β ≥ 0 n i i=1 are constants that scale training examples relative to w ∗ ∆ fi + λw (6) each other. βi can be used to give training examples equal weight regardless of their differing structure. See where y∗ = arg max (wT f (y)+L (y)) and ∆wf ∗ = Section 6 for a concrete example of this in the case of y∈Yi i i i f (y∗)−f (y ). This latter expression emphasizes that, planning. i i i intuitively, the subgradient compares the feature val- ues between the example class yi and the current loss- 3. Subgradient methods and MMSL augmented prediction y∗. We propose rewriting Program 3 as a regularized risk Note that computing the subgradient requires solv- function and taking subgradients of the resulting ob- ∗ T ing the problem y = arg maxy∈Yi (w fi(y) + Li(y)) jective. This leads to a myriad of possible algorithms, for each example. If we can efficiently solve the simplest of which is the subgradient method for T arg maxy∈Yi w fi(y) using a particular specialized al- convex optimization. This method has shown promis- gorithm, we can often use the same algorithm to effi- ing experimental results, and as a direct generalization ciently compute this loss-augmented optimization for of gradient descent for differentiable functions, is easy a particular class of loss function and hence efficiently to implement and has good theoretical properties in compute this subgradient. Algorithm 1 details the ap- both batch and online settings. plication of the subgradient method to maximum mar- The regularized risk interpretation can be easily de- gin structured learning. rived by noting that the slack variables ζ in Equa- i Given gt ∈ ∂c(wt) and αt, the basic iterative update 1 T tion 4 are tight and thus equal to maxy∈Yi (w fi(y)+ is T Li(y)) − w fi(yi) at the minimum. We can there- wt+1 = PW [wt − αtgt] (7) fore move these constraints into the objective function, simplifying the program into a single cost function: where PW projects w onto a convex set W formed by any problem specific convex constraints we may n q 1 X T T impose on w.2 c(w) = βi max(w fi(y) + Li(y)) − w fi(yi) n y∈Yi i=1 λ 3.1. Optimization in the batch setting + kwk2 (4) 2 In the batch setting, this algorithm is one of a well studied class of algorithms forming the subgradient This objective is convex, but nondifferentiable; we can method (Shor, 1985).3 Crucial to this method is optimize it by utilizing the subgradient method (Shor, the choice of stepsize sequence {αt}, and convergence 1985). A subgradient of a convex function c : W → R guarantees vary accordingly. Our results are developed at w is defined as a vector gw for which 2 It is actually sufficient that PW be an approximate 0 T 0 0 0 ∀w ∈ W, gw(w − w) ≤ c(w ) − c(w) (5) projection operator for which PW [w] ∈ W and ∀w ∈ 0 0 W, kPW [w] − w k ≤ kw − w k. Note that subgradients need not be unique, though at 3The term “subgradient method” is used in lieu of “sub- points of differentiability, they necessarily agree with gradient descent” because the method is not technically a descent method. Since the stepsize sequence is chosen in 1The right hand term maximizes over a set that includes advance, the objective value per iteration can, and often yi, and Li(yi) = 0 by definition. does, increase. Subgradient Methods for Maximum Margin Structured Learning Algorithm 1 Subgradient Method for Maximum learning rate. Under this rule, Algorithm 1 is guaran- Margin Structured Learning teed to converge to the minimum, but only at a sublin- n ear rate under the above strong convexity assumption 1: procedure sMMSL({xi, yi, fi(·), Li(·)}i=1, Reg- (see (Nedic & Bertsekas, 2000), Proposition 2.8). ularization parameter λ > 0, Stepsize sequence {αt} (learning rate), Iterations T ) 3.2.

Load more