Subgradient Methods for Maximum Margin Structured Learning

Nathan D. Ratliff [email protected] J. Andrew Bagnell [email protected] Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich [email protected] Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E1, Canada

1. Introduction gramming problem before deriving the convex objec- Maximum margin structured learning (MMSL) has re- tive and showing how its subgradients can be com- cently gained recognition within the machine learn- puted utilizing the specialized inference algorithm in- ing community as a tractable method for large scale herent to the problem. We finish with theoretical learning. However, most current methods are lim- guarantees and some experimental results in two do- ited in terms of scalability, convergence, or memory mains: sequence labeling for optical character recog- requirements. The original Structured SMO method nition; and imitation learning for path planning in proposed in (Taskar et al., 2003) is slow to converge, mobile robot navigation. The former problem is well particularly for Markov networks of even medium tree- known to MMSL, but the latter is new to this domain. width. Similarly, dual exponentiated tech- Indeed, although there is a tractable polynomial sized niques suffer from sublinear convergence as well as representation for the problem often large memory requirements. Recently, (Taskar (Ratliff et al., 2006), solving it directly using one of et al., 2006) have looked into saddle-point methods for the previously proposed methods would be intractable optimization and have succeeded in efficiently solving practically for reasons similar to those that arise in di- several problems that would have otherwise had in- rectly solving the formulation of tractable memory requirements. Markov Decision Processes. We propose an alternative gradient based approach us- 2. Maximum margin structured ing a regularized risk formulation of MMSL derived by placing the constraints into the objective to create a learning in w. This objective is then optimized We present a brief review of maximum margin struc- by a direct generalization of , popular tured learning in terms of convex programming. In in , called the subgradient method this setting, we attempt to predict a structured ob- (Shor, 1985). The abundance of literature on subgra- ject y ∈ Y(x) from a given input x ∈ X . For our dient methods makes this algorithm a decidedly con- purposes we assume that the inference problem can venient choice. In this case, it is well known that the be described in terms of a computationally tractable subgradient method is guaranteed linear convergence max over a score function sx : Y(x) → R such that ∗ when the stepsize is chosen to be constant. Further- y = arg maxy∈Y(x) sx(y) and take as our hypothesis more, this algorithm becomes the well-studied Greedy class functions of the form Projection algorithm of (Zinkevich, 2003) in the online h(x; w) = arg max wT f(x, y) (1) setting. Using tools developed in (Hazan et al., 2006), y∈Y(x) we can show that the risk of this online algorithm with This class is parameterized by w in a convex param- respect to the prediction loss grows only sublinearly in eter space W, and f(x, y) are vector valued feature time. Perhaps more importantly, the implementation functions. Given data D = {(x , y )}n we abbreviate of this algorithm is simple and has intuitive appeal i i i=1 f(x , y) as f (y) and Y(x ) as Y . since an integral part of the computation comes from i i i i running the inference algorithm being trained in the The margin is chosen to scale with the loss of choosing inner loop. class y over the desired yi. We denote this prediction loss function by L(y , y) = L (y), and assume that In what follows, we review the basic formulation of i i L (y) > 0 for all y ∈ Y \y , and L (y ) = 0. In learn- maximum margin structured learning as a convex pro- i i i i i ing, our goal is to find a score function that scores yi Subgradient Methods for Maximum Margin Structured Learning higher than all other y ∈ Yi\yi by this margin. For- the gradient. We denote the set of all subgradients of mally, this gives us the following constraint: c(·) at point w by ∂c(w). ∀i, y ∈ Y , wT f (y ) ≥ wT f (y) + L (y). (2) To compute the subgradient of our objective func- i i i i i tion, we make use of the following four well known Maximizing the left hand side over all y ∈ Yi, and properties: (1) subgradient operators are linear; (2) adding slack variables, we can express this mathemat- the gradient is the unique subgradient of a differen- ically as following convex program: tiable function; (3) if f(x, y) is differentiable in x, then ∇xf(x, y∗) is a subgradient of the convex function λ 2 1 X q ∗ min kwk + βiζi (3) φ(x) = maxy f(x, y) for any y ∈ arg maxy f(x, y); (4) w,ζi 2 n i an analogous chain rule holds as expected. We are now T T  s.t. ∀i w fi(yi) + ζi ≥ max w fi(y) + Li(y) equipped to compute a subgradient gw ∈ ∂c(w) of our y∈Yi objective function (4): where λ ≥ 0 is a hyperparameter that trades off con- n straint violations for margin maximization (i.e. fit for 1 X T ∗ ∗ T q−1 gw = qβi (w fi(y ) + Li(y )) − w fi(yi) · simplicity), q ≥ 1 defines the penalty norm, and β ≥ 0 n i i=1 are constants that scale training examples relative to w ∗ ∆ fi + λw (6) each other. βi can be used to give training examples equal weight regardless of their differing structure. See where y∗ = arg max (wT f (y)+L (y)) and ∆wf ∗ = Section 6 for a concrete example of this in the case of y∈Yi i i i f (y∗)−f (y ). This latter expression emphasizes that, planning. i i i intuitively, the subgradient compares the feature val- ues between the example class yi and the current loss- 3. Subgradient methods and MMSL augmented prediction y∗. We propose rewriting Program 3 as a regularized risk Note that computing the subgradient requires solv- function and taking subgradients of the resulting ob- ∗ T ing the problem y = arg maxy∈Yi (w fi(y) + Li(y)) jective. This leads to a myriad of possible algorithms, for each example. If we can efficiently solve the simplest of which is the subgradient method for T arg maxy∈Yi w fi(y) using a particular specialized al- convex optimization. This method has shown promis- gorithm, we can often use the same algorithm to effi- ing experimental results, and as a direct generalization ciently compute this loss-augmented optimization for of gradient descent for differentiable functions, is easy a particular class of loss function and hence efficiently to implement and has good theoretical properties in compute this subgradient. Algorithm 1 details the ap- both batch and online settings. plication of the subgradient method to maximum mar- The regularized risk interpretation can be easily de- gin structured learning. rived by noting that the slack variables ζ in Equa- i Given gt ∈ ∂c(wt) and αt, the basic iterative update 1 T tion 4 are tight and thus equal to maxy∈Yi (w fi(y)+ is T Li(y)) − w fi(yi) at the minimum. We can there- wt+1 = PW [wt − αtgt] (7) fore move these constraints into the objective function, simplifying the program into a single cost function: where PW projects w onto a W formed by any problem specific convex constraints we may n  q 1 X T T impose on w.2 c(w) = βi max(w fi(y) + Li(y)) − w fi(yi) n y∈Yi i=1 λ 3.1. Optimization in the batch setting + kwk2 (4) 2 In the batch setting, this algorithm is one of a well studied class of algorithms forming the subgradient This objective is convex, but nondifferentiable; we can method (Shor, 1985).3 Crucial to this method is optimize it by utilizing the subgradient method (Shor, the choice of stepsize sequence {αt}, and convergence 1985). A subgradient of a convex function c : W → R guarantees vary accordingly. Our results are developed at w is defined as a vector gw for which 2 It is actually sufficient that PW be an approximate 0 T 0 0 0 ∀w ∈ W, gw(w − w) ≤ c(w ) − c(w) (5) projection operator for which PW [w] ∈ W and ∀w ∈ 0 0 W, kPW [w] − w k ≤ kw − w k. Note that subgradients need not be unique, though at 3The term “subgradient method” is used in lieu of “sub- points of differentiability, they necessarily agree with gradient descent” because the method is not technically a descent method. Since the stepsize sequence is chosen in 1The right hand term maximizes over a set that includes advance, the objective value per iteration can, and often yi, and Li(yi) = 0 by definition. does, increase. Subgradient Methods for Maximum Margin Structured Learning

Algorithm 1 Subgradient Method for Maximum learning rate. Under this rule, Algorithm 1 is guaran- Margin Structured Learning teed to converge to the minimum, but only at a sublin-

n ear rate under the above strong convexity assumption 1: procedure sMMSL({xi, yi, fi(·), Li(·)}i=1, Reg- (see (Nedic & Bertsekas, 2000), Proposition 2.8). ularization parameter λ > 0, Stepsize sequence {αt} (learning rate), Iterations T ) 3.2. Optimization in an online setting 2: t ← 1 In contrast with many optimization techniques, the 3: w ← 0 subgradient method naturally extends from the batch 4: while t ≤ T do setting (as presented) to an online setting. In the on- 5: Solve y∗ = arg max (wT f (y) + L (y)) i y∈Yi i i line setting one imagines seeing several problems on for each i. closely related domains: in particular, one may ob- 6: Compute g ∈ ∂c(w) as in Equation 6. serve a domain, be required to predict within it, and 7: Update w ← w − α g t only then observe the “correct” solution (or observe 8: (Optional): Project w on to any additional the corrections to the predicted solution). constraints. 9: t ← t + 1 At each time step t, we are given Yt and ft(·) with 10: end while which to select a weight vector wt and make a pre- 11: return w diction. Once this prediction is made, we can then 12: end procedure observe the true class yt and update the weight vec- tor based on the error. Thus, we can define ct(w) = λ 2 T T 2 kwk + maxy∈Yt (w ft(y) + Lt(y)) − w ft(yt) = λ kwk2 + r (w) to be the λ -strongly convex cost func- from (Nedic & Bertsekas, 2000) who analyze incremen- 2 t 2 tal subgradient algorithms, of which the subgradient tion (see Equation 4) at time t, which we can evaluate method is a special case. given yt, Yt, and ft(·). This is now an online con- vex programming problem (Zinkevich, 2003), to Our results require a strong convexity assumption to which we will apply Greedy Projection with learning Rd hold for the objective function. Given W ⊆ , a rate 1/(tλ). function f : W → R is η-strongly convex if there exists g : W → Rd such that for all w, w0 ∈ W: (Hazan et al., 2006) have shown that with this learning rate, our online optimization problem has logarithmic 0 T 0 0 2 regret with respect to the objective funtion. However, f(w ) ≥ f(w) + gw(w − w) + ηkw − wk . (8) the loss we truly care about on round t is the pre- λ ∗ ∗ In our case, Equation 4 is 2 -strongly convex. diction loss, Lt(yt ), where yt is the prediction made during this round. Space doesn’t permit a proof, but Theorem 3.1: Linear convergence of constant the following may be derived using tools from (Hazan stepsize sequence. Let the stepsize sequence {αt} of 1 et al., 2006): Algorithm (1) be chosen as αt = α ≤ λ . Furthermore, assume that for a particular region of radius R around Theorem 3.2: Sublinear regret for subgradient the minimum, ∀w, g ∈ ∂c(w), kgk ≤ C. Then the MMSL. Assume that the features in each state are algorithm converges at a linear rate to a region of the bounded in norm by 1, then: ∗ minimum w = arg minw∈W c(w) bounded by kwmin − T T q 2 ∗ αC C X ∗ X ∗ ∗ 2 1 w k ≤ λ ≤ λ . L (y ) ≤ r (w ) + λT kw k + (1 + ln T ) (9) t t t λ t=1 t=1 Proof: (Sketch) By the strong convexity of cq(w) and Proposition 2.4 of (Nedic & Bertsekas, 2000) we have √ Choosing λ = 1+ln√T , then: kw∗k T 2 ∗ 2 t+1 ∗ 2 αC kwt+1 − w k ≤ (1 − αλ) kw0 − w k + T T λ X X L (y∗) ≤ r (w∗) + kw∗kpT (1 + ln T ) (10) αC2 C2 t t t −→ ≤ t=1 t=1 t→∞ λ λ2 2 Thus, if we know our horizon T and the achievable margin, our loss grows only sublinearly with time. This theorem shows that we attain linear convergence to a small region of the minimum using a constant Observe that Theorem 3.2 is a result about the ad- stepsize. Alternatively, we can choose a diminishing ditive loss functions Lt specifically. If we attempted γ stepsize rule of the form αt = t for t ≥ 1, where γ is instead to consider, for instance, a setting where we some positive constant that can be thought of as the did not require more margin from higher loss paths Subgradient Methods for Maximum Margin Structured Learning

(e.g. 0/1 loss on paths) our bound would be much weaker and scale with the size of the domain.

4. Generalization guarantees Our online algorithm also inherits interesting general- ization guarantees when applied in the batch setting. Given independent, identically distributed data, the expected loss of our algorithm can be bounded, with probability greater than or equal to 1−δ, by the errors it makes at each step of the incremental subgradient method using the techniques of (Cesa-Bianchi et al., 2004):4

T s Figure 1. Training error (green) and test error (red) for 1 X 2 1 E[L (w ¯)] ≤ r (w ) + log (11) each iteration of 10 fold cross validation using 5500 training T +1 T t t T δ t=1 and 600 validation examples.

This bound is rather similar in form to previous gen- eralization bounds given using covering number tech- 6.1. Optical character recognition niques (Taskar et al., 2003). Importantly, though, this We implemented the incremental subgradient method approach removes the dependence entirely on the num- for the sequence labeling problem originally explored ber of bits b being predicted in structured learning; by (Taskar et al., 2003) who used the Structured SMO most existing techniques introduce a log b factor for algorithm.5 Running our algorithm with 600 training the number of predicted bits. examples and 5500 test examples using 10 fold cross validation, as was done in (Taskar et al., 2003), we 5. Slack-scaling attained an average prediction error of 0.20 using a In principle, we can use these tools to compute sub- linear kernel. This result is statistically equivalent to of the slack-scaling formulation detailed in the previously published result; however, the entire 10 (Tsochantaridis et al., 2005). Disregarding the regu- fold cross validation run completed within 17 seconds. larization, under this formulation Equation 4 becomes Furthermore, when running the experiment using the entire data set partitioned into 10 folds of 5500 train- n  q 1 X T  ing and 600 test examples each, we achieved a sig- c˜(w) = βi max Li(y) w (fi(y) − fi(yi)) + 1 n y∈Yi nificantly lower average error of 0.13, again using the i=1 linear kernel. Figure 1 shows the error per iteration Multiplying by the loss inside the maximization makes for the larger experiment. this expression more restrictive in terms of optimiza- tion since the specialized inference algorithm cannot 6.2. Path planning usually be applied directly. However, when the loss We present briefly a result from our implementation of function takes on only a relatively small number of val- the batch learning algorithm for the MMSL approach k ues Li(y) ∈ {lj}j=1 for all y ∈ Yi, we can often solve to imitation learning. Details and additional results the inner inference problem for each j individually can be found in (Ratliff et al., 2006). In this setting, (j) with respect to the set Yi = {y ∈ Yi | Li(y) = lj}. the objective is to learn to predict the correct path Using this we can then compute the subgradient and between two points in a world given features over the utilize the subgradient method as before to optimize underlying MDP. Examples show the sequence of de- the objective. cisions a teacher might make through a number of re- gions. In order to give approximately equal weight to 6. Experimental results each example in the objective function given by equa- −q We present results for two problems: sequence labeling tion 4, we chose βi = |yi| , where |yi| denotes the for optical character recognition (Taskar et al., 2003); length the ith example path. We used a variant of and imitation learning for path planning in mobile A* as our specialized algorthm to solve the inference robot navigation (Ratliff et al., 2006). problem represented by Equation 1. Differing example trajectories, demonstrated in a par- 4To achieve this result we must actually use the average weight vectorw ¯ computed during learning, not merely the 5This data can be found at http://www.cs.berkeley. last one. edu/~taskar/ocr/ Subgradient Methods for Maximum Margin Structured Learning

pare these methods on problems where they are both applicable. In recent work, (Duame et al., 2006) has considered reinforcement learning based approaches to structured classification. Subgradient methods for (unstructured) margin linear classification were con- sidered in (Zhang, 2004). (LeCun et al., 1998) con- siders the use of gradient methods for learning using decoding methods such as Viterbi; our approach (if applied to sequence labeling) extends such methods to use notions of structured maximum margin.

Acknowledgements The first two authors gratefully acknowledge the partial support of this research by the DARPA Learning for Lo- Figure 2. Demonstration of learning to plan based on satel- comotion contract. lite color imagery. For a particular training/holdout region pair, the top row trains the learner to follow the road while References the bottom row trains the learner to “hide” in the trees. From left to right, the columns depict the single training Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On example presented, the learned cost map over the holdout the generalization ability of on-line learning algorithms. IEEE Trans. on Information Theory (pp. 2050–2057). region, and the corresponding learned behavior over that Preliminary version in Proc. of the 14th conference on region. Cost scales with intensity. Neural Information processing Systems (NIPS 2001). Duame, H., Langford, J., & Marcu, D. (2006). Search- based structured prediction. In Preparation. ticular region of a map, lead to significantly different behavior in a separate holdout region after learning. Hazan, E., Kalai, A., Kale, S., & Agarwal, A. (2006). Loga- rithmic regret algorithms for online convex optimization. Figure 2 shows qualitatively the results of this exper- To appear in COLT 2006. iment. The behavior presented in the top row sug- gests a desire to stay on the road, while the bottom LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recogni- row portrays a more clandestine behavior. By column tion. Proceedings of the IEEE (pp. 2278–2324). from left to right, the images depict the training exam- ple presented to the algorithm, the learned cost map Nedic, A., & Bertsekas, D. (2000). Convergence rate of in- cremental subgradient algorithms. Stochastic Optimiza- on a holdout region after training, and the resulting tion: Algorithms and Applications. behavior produced by A* over this region. Ratliff, N., Bagnell, J. A., & Zinkevich, M. (2006). Max- imum margin planning. Twenty Second International 7. Related and future work Conference on Machine Learning (ICML06). The subgradient optimization presented here makes it Shor, N. Z. (1985). Minimization methods for non- practical to learn in problems for which straightfor- differentiable functions. Springer-Verlag. ward QP techniques are intractable. This is exempli- Taskar, B., Guestrin, C., & Koller, D. (2003). Max mar- fied in case of path planning. In this senario, subgradi- gin markov networks. Advances in Neural Information ent methods for maximum margin structured learning Processing Systems (NIPS-14). converge quickly while the explicit quadratic program Taskar, B., Lacoste-Julien, S., & Jordan, M. (2006). Struc- would be too large to even represent for a generic QP tured prediction via the extragradient method. In Ad- solver. Recently, a number of other techniques have vances in neural information processing systems 18. been proposed to solve problems of these kinds, in- cluding cutting plane (Tsochantaridis et al., 2005) and Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and extragradient techniques (Taskar et al., 2006). The interdependent output variables. Journal of Machine latter is applicable where inference may be written as Learning Research, 1453–1484. a linear program and is also able to achieve linear con- Zhang, T. (2004). Solving large scale linear prediction vergence rates. The subgradient method has the ad- problems using stochastic gradient descent algorithms. vantage of being applicable to any problems where loss Proceedings of ICML. augmented inference may be quickly solved including Zinkevich, M. (2003). Online convex programming and by combinatorial methods. Further, our algorithm ex- generalized infinitesimal gradient ascent. Proceedings tends naturally to the online case, where sublinear re- of the Twentieth International Conference on Machine gret bounds are available. It will be interesting to com- Learning.