Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Parametric Dual Maximization for Non-Convex Learning Problems

Yuxun Zhou Zhaoyi Kang Costas J. Spanos Department of EECS Department of EECS Department of EECS UC Berkeley UC Berkeley UC Berkeley [email protected] [email protected] [email protected]

Abstract solving standard SVM and updating latent variables. An- other widely applied technique is the concave-convex Proce- We consider a class of non-convex learning problems that can dure (CCCP) (Yuille, Rangarajan, and Yuille 2002). Among be formulated as jointly optimizing regularized hinge loss and a set of auxiliary variables. Such problems encompass but are many others, (Yu and Joachims 2009; Ping, Liu, and Ihler not limited to various versions of semi-supervised learning, 2014) used CCCP for latent structural SVM training. Direct learning with hidden structures, robust learning, etc. Existing application of the gradient-based method is especially attrac- methods either suffer from local minima or have to invoke a tive for large scale problems owing to its low computational non-scalable combinatorial search. In this paper, we propose cost (Bottou 2010). Such examples include the stochastic a novel learning procedure, namely Parametric Dual Max- (SGD) for large margin polytope machine imization (PDM), that can approach global optimality effi- (Kantchelian et al. 2014; Zhou, Jin, and Spanos 2015) and ciently with user specified approximation levels. The building S3VM (Chapelle and Zien 2005). Combinatorial optimiza- blocks of PDM are two new results: (1) The equivalent con- tion methods, e.g., the local search method (Joachims 1999) vex maximization reformulation derived by parametric anal- and (B & B) (Chapelle, Sindhwani, and ysis. (2) The improvement of local solutions based on a nec- essary and sufficient condition for global optimality. Experi- Keerthi 2006), were also implemented for small-scale prob- mental results on two representative applications demonstrate lems. It’s worth mentioning that other heuristic approaches the effectiveness of PDM compared to other approaches. and relaxations such as continuation method (Chapelle, Chi, and Zien 2006) and semidefinite program (SDP) relaxation (Bei and Cristianini 2006)(Xu, Crammer, and Schuurmans 1 Introduction 2006) have also been examined for several applications. To enhance the performance on more challenging tasks, vari- Yet except B & B, all of the aforementioned methods, i.e., ations of the classic large margin learning formulation are AO, CCCP, and SGD, only converge to local minimums and proposed to incorporate additional modeling flexibility. To 3 could be very sensitive to initial conditions. Although SDP name a few, semi-supervised SVM (S VM) is introduced approximation yields a convex problem, the quality of the in (Bennett, Demiriz, and others 1999; Joachims 1999) to relaxation is still an open question in both theory and prac- combine labeled and unlabeled samples together for overall tice (Park and Boyd 2015). On the other hand, it has long risk minimization. To learn a classifier for datasets having been realized that global optimal solution can return excel- unobserved information, SVM with latent variables is pro- lent generalization performance in situations where local op- posed in (Felzenszwalb et al. 2010) for object detection and timal solutions fail completely (Chapelle, Sindhwani, and in (Yu and Joachims 2009; Zhou, Hu, and Spanos 2016) for Keerthi 2006). The major issue withB&Bisitsscalability: structural learning. Inasmuch as the traditional large mar- the size of the search tree can grow exponentially with the gin classifier with hinge loss can be sensitive to outliers, the number of integer variables (Krishnamoorthy 2008), making authors of (Xu, Crammer, and Schuurmans 2006) suggest a it only suitable for small scale problems. Interested readers ramp loss with which a robust version of SVM is proposed. are referred to (Chapelle, Sindhwani, and Keerthi 2008) for Nonetheless, unlike classic SVM learning objective that a thorough discussion. possesses amiable convexity, those variations introduce non- In this work, we propose a learning procedure, namely convex learning objectives, hindering their generalization Parametric Dual Maximization (PDM), based on a different performance and scalable deployment due to optimization view of the problem. We first demonstrate that the learning difficulties. In literature, much effort has been made to ob- objectives can be rewritten into jointly optimizing regular- tain at least a locally optimal solution: Viewing the prob- ized hinge loss and a set of auxiliary variables. Then we lem as a biconvex optimization leads to a series of alternat- show that they are equivalent to non-smooth convex maxi- ing optimization (AO) algorithms. For example, in (Felzen- mization through a series of parametric analysis techniques. szwalb et al. 2010), latent SVM was trained by alternately Finally, we establish PDM by exploiting a necessary and Copyright c 2017, Association for the Advancement of Artificial sufficient global optimality condition. Our contributions are Intelligence (www.aaai.org). All rights reserved. highlighted as follows. (1) The equivalence to non-smooth

2956  M min M SM RM convex maximization unveils a novel view of an important p∈S m=1 pmξm, where is the simplex in . class of learning problems such as S3VM. Now we know Due to strict feasibility and biconvexity in (w,b) and p,we that they are NP-hard, but possesses gentle geometric prop- can exchange the order of minimization and obtain an equiv- erties that allow new solution techniques. (2) We develop alent form similar to (OPT1). The variable pi is the “prob- a set of new parametric analysis techniques, which can be ability” of yi =1. reused for many other tasks, e.g., solution path calculation. (3) By checking a necessary and sufficient optimality con- Many other learning variations could be rewritten in a dition, the proposed PDM can approach the global optimum similar way1. Observing that the inner problem of OPT1 is efficiently with user specified approximation levels. convex quadratic with fixed p, we replace it with its dual The rest of the paper is organized as follows. In Section 2, problem and cast OPT1 into we detail the reformulation of the problem with examples. In 1   Section 3, we derive the equivalent non-smooth convex max- max min J (α)= αiyiκ(xi, xj)yjαj − αi p∈P α∈A(p) 2 imization by parametric analysis. In Section 4, the optimal- i,j i ity condition is presented, and the corresponding algorithm   A(p)= α | 0 ≤ ≤ ∀ yT α =0 is proposed. Numerical experiments are given in Section 5. where αi cipi i, (OPT2) 2 A Class of Large Margin Learning In the above equivalent formulation, we can view the in- d ner optimization as minimizing a quadratic function subject A labeled data sample is denoted as (xi,yi), with xi ∈ R and y ∈{−1, +1}. We focus on the following joint mini- to polyhedron constraints that are parametrized by the aux- i iliary variable p. Assuming the kernel matrix K, defined by mization problem 2 Ki,j = κ(xi, xj), is strictly positive , then the optimum is N α∗ 1 2 unique by strict convexity, and the solution is a function min min P(w,b; p)= ||w||H + cipiV (yi,hi) of p. Ideally, if one can write out the functional dependence p∈P w,b 2 =1 ∗ i explicitly, OPT2 is essentially maxp∈P J (α (p)), which (OPT1) minimizes over the “parameters” p of the inner problem. = (w x )+ (· ·) where hi κ , i b with κ , a Mercer kernel func- In the terminology of operational research and optimization, ( )= tion. The function V is the Hinge loss, i.e., V yi,hi the task of analyzing the dependence of an optimal solution T max(0, 1 − yihi). We call p  [p1, ··· ,pN ] ∈ P the aux- on multiple parameters is called parametric programming. iliary variable of the problem, and assume its feasible set P Inspired by this new view of OPT2 (and thence OPT1), to be convex. note that with p fixed, the inner problem re- our solution strategy is: Firstly, determining the functional sembles traditional large margin learning. Depending on the J (α∗(p)) by parametric analysis, and then minimizing over context, the auxiliary variable p can be regarded as hidden p ∈ P by exploiting the unique property of J (α∗(p)). states or probability assignments for loss terms. We focus on Note that the first step in effect involves a convex (OPT1) in this work, because many large margin learning 3 quadratic parametric programming (CQPP), which has been variations, including S VM, latent SVM, robust SVM, etc., addressed in optimization and control community for sen- can be rewritten in this form. The following is an example sitivity analysis and explicit controller design (Tondel, Jo- of such reformulation. hansen, and Bemporad 2003) (Wachsmuth 2013). Moreover, Example 1 Consider the learning objective of Semi Super- the study of solution path algorithms in our field (Hastie et vised Support Vector Machine (S3VM): al. 2004) (Karasuyama and Takeuchi 2011) can also be re- garded as special cases of CQPP. Nonetheless, existing work 1 l n min ||w||2 + ( )+ ( ) on CQPP is technically insufficient, because (1) Due to the H C1 V yi,hi C2 V yi,hi αT y =0 w,b,yu 2 presence of the constraint , the problem at hand i=1 i=l+1 corresponds to a “degenerate” case for which existing so- where l is the number of labeled samples and n − l unla- lution is still lacking. (2) Some important properties of the beled samples are included in the loss with “tentative” label parametric solution, specifically its geometric structure, are y u, which constitute additional variables to minimize over. not entirely revealed in prior works. Interestingly, the learning objective has the following equiv- In the next section, we target the the inner minimization alent form: for parametric analysis. Our results not only provide the l analytical form of the solution in critical regions (defined 1 2 later), but also demonstrate that the overall learning prob- min min ||w||H + C1 V (yi,hi) w,b p 2 lem (OPT2) is equivalent to a convex maximization. i=1 n 1 + C2 [piV (1,hi)+(1− pi)V (−1,hi)] More examples of reformulation are given in the supplemen- i=l+1 tary material. 2Then the induced matrix Q  K◦yyT is also strictly positive, The equivalence is due to the fact that minimizing hence the optimization is strictly convex. For situations in which K over pi will cause all its mass to concentrate on the is only positive semidefinite, a decomposition technique detailed in smaller of V (1,hi) and V (−1,hi). Formally for any the supplementary material, can be used to reduce the problem to variables ξ1, ··· ,ξM we have minm{ξ1, ··· ,ξM } = the strictly positive case.

2957 3 Deriving the Equivalent Convex Theorem 1 Assume that the solution of (IO) is non- Maximization Problem degenerate and induces a set of active and inactive con- straints A and AC, respectively. With H, R defined previ- To begin with, the inner minimization is rewritten in a more T  H(Cα )T v  Cα H1 compact form: ously and A , A , we have (1) The optimal solution is a continuous piecewise affine 1 function of p. And in the critical region defined by min J (α)= αT Qα − 1T α α 2 (IO)  −1 0 α p 0 T R (Cp p + C + v) ≥ 0 subject to C α ≤ C p + C , y α =0. A A p 0 α −1 p 0 C C p + CAC − CAC TR (CAp + CA + v) ≥ 0 α p 0 A where Qij = yiκ(xi, xj)yj, and C , C and C are con- (1) stant matrices encapsulating the constraints. the optimal solution α∗ of (IO) admits a closed form A Mild Sufficient Condition for Existence We first ∗ −1 p 0 demonstrate that, interestingly, a mild sample partition con- α (p)=TR (CAp + CA + v) (2) dition is sufficient for the existence and uniqueness of the parametric solution of (IO). (2) The optimal objective J (α∗(p)) is a continuous piece- p Definition 1 (Active Constraint) After a solution of (IO) wise quadratic (PWQ) function of . has been obtained as α∗(p). The ith row of the constraint ∗ p 0 Remark The theorem indicates that each time the inner op- is said to be active at p,ifCαα (p)=C p + C , and i i i timization (IO) is solved, full information in a well-defined Cαα∗(p) Cpp + C0 inactive if i < i i . We denote the index neighborhood (critical region) can be retrieved as a function set of active inequalities by A, and inactive ones by AC .We α α α of the auxiliary variable. Hence one can efficiently calcu- use CA to represent row selection of matrix C , i.e, CA α late the closed form optimal solution and its gradient in that contains rows of C whose index is in A. region, without having to solve (IO) again. (2) shows that ∗ Definition 2 (Partition of Samples) Based on the value of J (α (p)) is continuous but non-smooth. th αi at optimal, the i sample is called: Global Structure of the Optimality Recall that our max J (α∗(p)) • ∈O ∗ =0 goal is to solve p∈P . In this part, we show Non-support vectors, denoted by i ,ifαi . • ∈S that the problem is equivalent to convex maximization by re- Unbounded support vectors, denoted by i u if we have J (α∗(p)) 0 ∗ vealing several important geometric properties of strictly <αi

2958 4 Global Optimality Condition and Algorithm 1 Parametric Dual Maximization (0) ∈ P =0 Parametric Dual Maximization Choose p ; set k ; compute p∗ with subgradient descent. To ease the notation, we hide the intermediate variable and while k ≤ iter max do Starting from p(k), find a local maximizer r(k) ∈ P with a local solver. denote m (k) ∗ Construct A (k) at r by (9) (10); Solve (IO) if a new critical region is F(p)  J (α (p)) r (3) encountered, otherwise use (2). max F(p) i m then (OPT2) becomes p∈P . From the properties for q ∈ A (k) do ∗ r of F(p),orJ (α (p)), given in Theorem 1 and Theorem 2, for gj ∈ V (∂F(qi)) do i T j we know that the problem is in effect a convex piece-wise Solve uij = argmaxp∈P (p − q ) g quadratic maximization. In this section, we propose a global end for ∗ i i j∗ = { } ( )=( ∗ ) optimization algorithm based on an optimality condition and Let j argmaxj uij ; u , s uij , g ; end for a level set approximation technique. ∗ i i T i (k) i∗ Let i = argmaxi{(u − q ) s }; u = u ; A Global Optimality Condition Several global opti- ∗ ∗ ∗ if (ui − qi )T si > 0 then mality conditions for maximizing convex function, partic- (k+1) (k) Set p = u ; k = k +1; # improvement found ularly convex quadratic functions, have been proposed be- else fore (Tsevendorj 2001) (Georgiev, Chinchuluun, and Parda- Terminate and output p(k); # optimality checked los 2011). In this work, we adapt a version of Strekalovsky’s end if condition for non-smooth case. First of all, the notion of Collecting explored critical region and explicit forms given in (2)(1). level set is defined as the set of variables that produce the end while same function values, i.e., Proposition 1 Problem (6) is equivalent to Definition 4 The level set of the function F at p is defined  by i T i n max max (p − q ) g(q ) (7) EF(p) = {q ∈ R |F(q)=F(p)} p∈P g(qi)∈V (∂F(qi)) A sufficient and necessary condition for a point p∗ to be the global maximizer of F(p) reads, which indicates that the optimal solution to (6) must be on the vertex of the feasible polyhedron. As such, (6) can be p∗ Theorem 3 is a global optimal solution of the prob- expanded into a set of linear programs, each of which is sub- lem maxp∈P F(p), if and only if for all p ∈ P, q ∈ m F(qi) stantiated by an element in Ap and a vertex of ∂ . ∗ (q) ∈ F(q) EF(p ),g ∂ , we have The PDM Algorithm With the approximate auxiliary (p − q)T g(q) ≤ 0 (4) problem solved, we can immediately determine if an im- p F(q) F p provement can be made at the current . More specifically, where ∂ is the set of subgradients of at . {(ui si) =1··· } m let , ,i , ,m be the solution of (6) on Ap , By virtue of Theorem 3, we can verify the optimality of any i.e., point p by solving (ui − qi)T si =max(p − qi)T (qi)  T g (8) Δ(p)  max (p − q) g(q) p∈P,g(qi)∈V (∂F(qi)) (5) q∈EF(p), p ∈P g(q)∈∂F(q) Δ( m)=max (ui − qi)T si and define Ap i=1,··· ,m . Then with and checking if Δ(p) ≤ 0. We call the above maximization the convexity of F we have the auxiliary problem at p. The major difficulty is that the p ∈ P qi ∈ m Proposition 2 For any , if there exist Ap , level set EF(p) is hard to calculate explicitly. Next, we study g(qi) ∈ V (∂F(qi)), and ui defined in (8), such that solution method for (5) by approximating the level set with (ui − qi)T g(qi) > 0, then we must have F(ui) > F(p). a collection of representative points. Approximate Level Set Now the remaining work is to construct the approximate level set given the current p and the degree m. The following Definition 5 Given a user specified approximation degree lemma shows that this is possible if a global minimizer is m, the approximation level set for EF( ) is defined by  p  available. m 1 2 m i A = q , q , ··· , q | q ∈ EF( ) i =1, 2, ··· ,m p p Lemma 2 Let the global minimizer of F(p) be p∗, then for n Consider solving the auxiliary problem approximately by re- p = p∗ and h ∈ R , there exist a unique positive scalar γ, m qi p + h ∈ placing EF(p) with Ap , then for each , (5) becomes such that ∗ γ EF(p). max (p − qi)T g(qi) With this guarantee, we write approximate level set by p∈P,g(qi)∈∂F(qi) (6)   m 1 2 m i i A = q , q , ··· , q | q = p + γ h ∈ EF( ) (9) Since F(p) is almost everywhere differentiable, in most p ∗ i p cases g(qi) is unique and equals to the gradient ∇F(qi). To explore directions for improvement, a natural choice of h Then the auxiliary problem is a simple linear program.In is a set of orthogonal basis. Specifically, we could start with i 1 the cases when q is on the boundary of critical regions, a random h and use Gram-Schmidt algorithm to extend it ∂F(qi) becomes a convex set, and the auxiliary problem i to m orthogonal basis. For each h , the corresponding γi is becomes a bilinear program. General bilinear program is found by solving: hard, but fortunately (6) has disjoint feasible sets, and one i can show that Φ(γi)  F(p∗ + γih ) −F(p)=0 (10)

2959 As stated in Lemma 2, the above function has a unique adopt the same training, validation and testing partition on root, which can be computed efficiently with line searching Vowel (D5), Music (D6), Bank (D7) and Wave (D8, with sim- method. To obtain the global minimizer, we have to solve ulator) data sets. To create a latent data structure, we assume p∗ = argminF(p), which is a convex minimization prob- only grouped binary labels are known. lem. Using Theorem 1, we show (in supplementary mate- rial) that a sub-gradient descent method with T iterations √ 3 converges to the global minimum within O(1/ T ). Table 1: Data sets. D4-D3 for S VM and D5-D8 for LSVM Organizing all building blocks developed so far, we sum- Data set ID # classes # samples # features labeled marize the PDM procedure in Algorithm 1. Given the cur- 2moons D1 2 200 2 2 rent solution p(k), the algorithm first tries to improve it with Coil D2 3 216 1024 6 existing methods such as AO, CCCP, SGD, etc. After find- Robot D3 4 2456 25 40 r(k) m 2spiral D4 2 100000 2 4 ing a local solution , the approximate level set Ar(k) is m Vowel D5 10 990 11 grouped obtained by solving (10) and constructing (9). With Ar(k) and the current sub-gradient, one or several linear program is Music D6 10 2059 68 grouped solved to pick up the vector u(k) that maximizes the condi- Bank D7 9 7166 649 grouped Δ( m) Wave D8 30 100000 40 grouped tion (4) of Theorem 3. If this maximal value, i.e., Ap ,is greater than 0, then by Proposition 2, u(k) must be a strictly   2 2 improved solution compared to r(k). As such, the algorithm The Gaussian kernel κ(x, y)=exp x − y /2σ is p(k+1) = u(k) Δ( m) ≤ 0 used for all experiments. Following model selection sug- continues with . Otherwise if Ap , the algorithm terminates since no improvement could be gestions (Chapelle, Sindhwani, and Keerthi 2008)(Felzen- szwalb et al. 2010), best hyperparamter combination found at the current point with the user specified approxi- 2 mation degree. For convergence, we have C1,C2,σ are chosen with cross validation from C1 ∈ 0:0.5:3 2 −3:1:3 −8:1:0 {10 }, σ ∈{(1/2) } and C2 ∈{10 } for 3 −4:1:4 Theorem 4 Algorithm I generates a sequence S VM and C2 ∈{10 } for LSVM. A simple gradi- {p(1) ··· p(k) ···} , , , having non-decreasing function ent ascent is used as the local minimizer for PDM. All ex- values. The sequence converges to an approximate maxi- periments are conducted on a workstation with Dual Xeon F(p) mizer of in a finite number of steps. x5687 CPUs and 72GB memory. In each iteration, we only have to solve m|V (∂F(qi))| 35 1 linear programs, and in most cases |V (∂F(qi))| =1due to the almost everywhere differentiability shown in Theorem 2. 30 0.9 When constructing the approximiate level set, we need to 25 0.8 solve at most m convex quadratic programs (IO)s, which 20 0.7 15 0.6 Objective Value seems computationally expensive. However, note that this Testing Accuracy 10 0.5 problem resembles the classic SVM dual, hence a variety 1 1 5) 5)

of existing methods can be reused for acceleration (Chang . 0.8 . 0.9 0 0 and Lin 2011). Moreover, by virtue of the optimality struc- ≤ 0.6 > 0.8 0 0 p p ture revealed in Theorem 1 and 2, a list of explored critical 0.4 0.7 regions and the corresponding explicit optimalities can be (with 0.2 (with 0.6 stored. If the current p is on this list, all information could p p 0 0.5 be retrieved in an explicit form, and there is no need to solve 24681012 24681012 the quadratic problem again. To further accelerate the algo- Iteration Iteration rithm, one can “enlarge” critical regions. See supplementary Figure 1: PDM in each iteration for S3VM training. Ran- material for a discussion. domized initiation; m =20; D1 dataset

5 Experiments Small-scale Demo To get more intuition on how PDM 3 In this section, we report optimization and generalization works, we use PDM to train S VM on the D1 dataset, and performance of PDM for the training of S3VM and Latent plot the iterative evolution of objective function (−J ), test- SVM (LSVM). More results and a Matlab implementation ing accuracy and the values of p in Figure 1. The approx- could be found online. imation level m is set to 0.1length(p)=20, and initial Datasets and Experiment Setup Details about the p(0) is chosen randomly. We observe that PDM converges datasets are listed in Table 1. For S3VM, we report results within 12 iterations (top left subfigure). The testing accu- on four popular data sets for semi-supervised learning, i.e., racy increases from 48% to above 98% (top right subfig- 2moons (D1), coil (D2), robot (D3) and 2spiral (D4, with ure), showing improvements in both optimization and gen- simulator). In each experiment, 60% of the samples are used eralization performance. Moreover, the auxiliary variable p for training, in which only a small portion are assumed to approaches global optimum even with random initial values be labeled samples. 10% of the data are used as a validation (bottom subfigures). Note that in this process, a total number set for choosing hyperparameters. With the remaining 30%, of 36 (IO)s are solved and about 2/3 of the critical regions we evaluate the generalization performance. For LSVM we have been reused more than once.

2960 Table 2: Normalized objective value (OPT1. First row for Table 3: Generalization Performance (error rates). Averaged each dataset. The lower the better). Time usage (Second row over 10 random data partitions. Error rate greater than or for each dataset. s = seconds; h = hours) close to 50% should be interpreted as “failed”.

Data GD CCCP AO LCS IA BB PDM1 PDM2 Data GD CCCP AO LCS IA BB PDM1 PDM2 2.39 2.82 4.83 5.55 1.79 1.00 1.03 1.00 D1 51.4 60.0 52.8 65.5 37.5 0.0 1.9 0.2 D1 1.7s 6.2s 2.7s 6.7s 3.4s 210s 16s 35s D2 57.9 66.1 47.9 61.1 57.2 0.0 5.3 1.1 S3VM 3.74 3.92 3.46 4.98 2.35 1.00 1.19 1.03 D3 26.6 29.3 59.8 38.8 27.4 * 9.5 3.3 D2 5.3s 6.8s 4.3s 7.9s 5.6s 362s 43s 83s D4 52.1 39.8 40.0 45.4 31.4 * 3.5 2.0 S3VM 3.95 4.23 3.48 6.96 2.85 * 1.11 1.00 D3 D5 15.8 16.2 13.5 9.9 * * 2.5 1.7 33s 56s 28s 43s 27s * 231s 489s D6 39.8 43.7 40.8 39.4 * * 12.1 7.6 6.98 4.91 4.90 6.16 4.22 * 1.31 1.00 LSVM D4 D7 20.0 19.4 19.8 22.5 * * 8.9 5.1 0.19h 0.41h 0.33h 0.37h 0.46h * 1.4h 2.7h D8 53.1 36.7 39.7 46.2 * * 19.9 13.1 4.45 5.31 4.85 4.09 * * 1.13 1.00 D5 26s 54s 33s 68s * * 209s 451s Keerthi 2006) (Chapelle, Sindhwani, and Keerthi 2008), jus- 6.51 5.34 4.77 6.82 * * 1.28 1.00 D6 tifying the extra computational overhead required to pursue 63s 90s 72s 101s * * 468s 997s LSVM 6.78 7.69 4.17 6.22 * * 1.26 1.00 the global optimum. D7 326s 371s 263s 477s * * 1217s 2501s Choice of Approximation Degree m Comparing 10.2 5.16 6.35 7.57 * * 1.54 1.00 PDM1 and PDM2 in Table 2&3, we note that in general, D8 0.23h 0.73h 0.66h 0.93h * * 2.5h 4.8h increasing the approximation degree m will produce better optimization and generalization performance. To investigate Optimization and Generalization Performance We the effect of m, we use PDM to train S3VM on D3, and next compare PDM with different optimization methods in plot in Figure 2 the optimum value, testing accuracy, time terms of their optimization and generalization performance. and space usage as a function of m (from 80 to 650). It The algorithms considered for S3VM training are: Gradient appears that further increasing m after some large enough Descent (GD) in (Chapelle and Zien 2005), CCCP in (Col- value (e.g., 300 in Figure 2) only provide marginal improve- lobert et al. 2006), Alternating Optimization (AO) in (Sind- ment in both training and testing. Also, seeing that the com- hwani, Keerthi, and Chapelle 2006), Local Combinatorial putational time usage grows (slightly) super-linearly and Search (LCS) in (Joachims 1999), Infinitesimal Annealing that the space usage grows almost linearly, we suggest using (IA) in (Ogawa et al. 2013), Branch and Bound (BB) in an m ∈ [0.1length(p), 0.2length(p)], a tradeoff between (Chapelle, Sindhwani, and Keerthi 2006). The algorithms training/testing accuracy and computational overhead. included for LSVM are GD in (Kantchelian et al. 2014), CCCP in (Yu and Joachims 2009), AO in (Dundar et al.  

 2008), adapted LCS in (Joachims 1999). The proposed PDM  is tested with two versions by setting the approximation de-   gree m =0.1length(p) (PDM1) and m =0.2length(p)   (PDM2).  2372EMHFWLYH 7HVWLQJ$FFXUDF\   In Table 2, objective function values of OPT1 (normalized   by the smallest one) are shown in the upper row, and the cor-  responding computation times are given in the second row   for each data. Note that although BB provides exact global   optimum for small data set D1 and D2, it runs out of mem-   ory (72GB!) for other datasets due to the exponential growth 7LPH8VDJH VHF  0HPRU\8VDJH *%  of its search tree. On the other hand, PDM1&2 provides a             near optimal solution to BB with much less time and space     usage. For larger data sets (D4-D8) on which BB can not be executed, PDM outperforms all the other local optimiza- Figure 2: The effect of m for PDM. D3 Dataset; Average tion methods: We observe that PDM achieves a significantly and CIs for 50 runs. improved objective value, and the runner up is at least 2.8 times larger. Although the running time is longer than local 6 Conclusion 5 methods, PDM is still scalable (D4 & D8 have 10 samples), In this paper we propose a novel global optimization proce- hence can be carried out for large scale problems. dure, PDM, to solve a class of non-convex learning problem. In Table 3, we compare the generalization performance of Our parametric analysis reveals an entirely different per- different algorithms in terms of testing error rate. It appears spective that this class of learning problems are equivalent clearly that the global optimal solution provided by BB and to maximizing a convex PWQ function. We then develop PDM has excellent generalization error rate, while other lo- the PDM algorithm based on a global optimality condition cal optimization methods perform much worse, and even fail for non-smooth convex maximization. Experimental results completely (e.g., on D1, D2, D4, D8). This observation is justified the effectiveness of PDM regarding both optimiza- consistent with previous findings (Chapelle, Sindhwani, and tion and generalization performance.

2961 7 Acknowledgments Joachims, T. 1999. Transductive inference for text classifi- This research is funded by the Republic of Singapore’s Na- cation using support vector machines. In ICML, volume 99, tional Research Foundation through a grant to the Berkeley 200–209. Education Alliance for Research in Singapore (BEARS) for Kantchelian, A.; Tschantz, M. C.; Huang, L.; Bartlett, P. L.; the Singapore-Berkeley Building Efficiency and Sustainabil- Joseph, A. D.; and Tygar, J. 2014. Large-margin convex ity in the Tropics (SinBerBEST) Program. BEARS has been polytope machine. In Advances in Neural Information Pro- established by the University of California, Berkeley as a cessing Systems, 3248–3256. center for intellectual excellence in research and education Karasuyama, M., and Takeuchi, I. 2011. Suboptimal solu- in Singapore. tion path algorithm for support vector machine. ICML. Krishnamoorthy, B. 2008. Bounds on the size of branch- References and-bound proofs for integer knapsacks. Operations Re- Bei, T., and Cristianini, N. 2006. Semi-supervised learn- search Letters 36(1):19–25. ing using semi-definite programming. In Semi-supervised Ogawa, K.; Imamura, M.; Takeuchi, I.; and Sugiyama, M. Learning. MIT Press. 177–186. 2013. Infinitesimal annealing for training semi-supervised Bennett, K.; Demiriz, A.; et al. 1999. Semi-supervised sup- support vector machines. In Proceedings of the 30th Inter- port vector machines. Advances in Neural Information pro- national Conference on Machine Learning, 897–905. cessing systems 368–374. Park, J., and Boyd, S. 2015. A semidefinite programming Bottou, L. 2010. Large-scale machine learning with stochas- method for integer convex quadratic minimization. arXiv tic gradient descent. In Proceedings of COMPSTAT’2010. preprint arXiv:1504.07672. Springer. 177–186. Ping, W.; Liu, Q.; and Ihler, A. 2014. Marginal structured Chang, C.-C., and Lin, C.-J. 2011. LIBSVM: A library for svm with hidden variables. arXiv preprint arXiv:1409.1320. support vector machines. ACM Transactions on Intelligent Sindhwani, V.; Keerthi, S. S.; and Chapelle, O. 2006. De- Systems and Technology 2:27:1–27:27. terministic annealing for semi-supervised kernel machines. Chapelle, O., and Zien, A. 2005. Semi-supervised classifi- In Proceedings of the 23rd international conference on Ma- cation by low density separation. In Proceedings of the tenth chine learning, 841–848. ACM. international workshop on artificial intelligence and statis- Tondel, P.; Johansen, T. A.; and Bemporad, A. 2003. An tics, volume 1, 57–64. algorithm for multi-parametric and Chapelle, O.; Chi, M.; and Zien, A. 2006. A continuation explicit MPC solutions. Automatica. method for semi-supervised svms. In Proceedings of the Tsevendorj, I. 2001. Piecewise-convex maximization prob- 23rd international conference on Machine learning, 185– lems. Journal of Global Optimization 21(1):1–14. 192. ACM. Wachsmuth, G. 2013. On licq and the uniqueness of la- Chapelle, O.; Sindhwani, V.;and Keerthi, S. S. 2006. Branch grange multipliers. Operations Research Letters 41(1):78– and bound for semi-supervised support vector machines. In 80. Advances in neural information processing systems, 217– Xu, L.; Crammer, K.; and Schuurmans, D. 2006. Robust 224. support vector machine training via convex outlier ablation. Chapelle, O.; Sindhwani, V.; and Keerthi, S. S. 2008. Opti- In AAAI, volume 6, 536–542. mization techniques for semi-supervised support vector ma- Yu, C.-N. J., and Joachims, T. 2009. Learning structural chines. The Journal of Machine Learning Research 9:203– svms with latent variables. In 26th international conference 233. on machine learning, 1169–1176. Collobert, R.; Sinz, F.; Weston, J.; and Bottou, L. 2006. Yuille, A. L.; Rangarajan, A.; and Yuille, A. 2002. The Large scale transductive svms. The Journal of Machine concave-convex procedure (cccp). Advances in neural in- Learning Research 7:1687–1712. formation processing systems 2:1033–1040. Dundar, M. M.; Wolf, M.; Lakare, S.; Salganicoff, M.; and Zhou, Y.; Hu, N.; and Spanos, C. J. 2016. Veto-consensus Raykar, V. C. 2008. Polyhedral classifier for target detec- multiple kernel learning. In Thirtieth AAAI Conference on tion: a case study: colorectal cancer. In ICML. Artificial Intelligence. Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ra- Zhou, Y.; Jin, B.; and Spanos, C. J. 2015. Learning con- manan, D. 2010. Object detection with discriminatively vex piecewise linear machine for data-driven optimal con- trained part-based models. PAMI, IEEE Transactions on trol. In 2015 IEEE 14th International Conference on Ma- 32(9):1627–1645. chine Learning and Applications (ICMLA), 966–972. IEEE. Georgiev, P. G.; Chinchuluun, A.; and Pardalos, P. M. 2011. Optimality conditions of first order for global minima of lo- cally lipschitz functions. Optimization 60(1-2):277–282. Hastie, T.; Rosset, S.; Tibshirani, R.; and Zhu, J. 2004. The entire regularization path for the support vector machine. The Journal of Machine Learning Research 5:1391–1415.

2962