Parametric Dual Maximization for Non-Convex Learning Problems

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Parametric Dual Maximization for Non-Convex Learning Problems Yuxun Zhou Zhaoyi Kang Costas J. Spanos Department of EECS Department of EECS Department of EECS UC Berkeley UC Berkeley UC Berkeley [email protected] [email protected] [email protected] Abstract solving standard SVM and updating latent variables. An- other widely applied technique is the concave-convex Proce- We consider a class of non-convex learning problems that can dure (CCCP) (Yuille, Rangarajan, and Yuille 2002). Among be formulated as jointly optimizing regularized hinge loss and a set of auxiliary variables. Such problems encompass but are many others, (Yu and Joachims 2009; Ping, Liu, and Ihler not limited to various versions of semi-supervised learning, 2014) used CCCP for latent structural SVM training. Direct learning with hidden structures, robust learning, etc. Existing application of the gradient-based method is especially attrac- methods either suffer from local minima or have to invoke a tive for large scale problems owing to its low computational non-scalable combinatorial search. In this paper, we propose cost (Bottou 2010). Such examples include the stochastic a novel learning procedure, namely Parametric Dual Max- gradient descent (SGD) for large margin polytope machine imization (PDM), that can approach global optimality effi- (Kantchelian et al. 2014; Zhou, Jin, and Spanos 2015) and ciently with user specified approximation levels. The building S3VM (Chapelle and Zien 2005). Combinatorial optimiza- blocks of PDM are two new results: (1) The equivalent con- tion methods, e.g., the local search method (Joachims 1999) vex maximization reformulation derived by parametric anal- and branch and bound (B & B) (Chapelle, Sindhwani, and ysis. (2) The improvement of local solutions based on a necessary and sufficient condition for global optimality. Experi- Keerthi 2006), were also implemented for small-scale prob- mental results on two representative applications demonstrate lems. It’s worth mentioning that other heuristic approaches the effectiveness of PDM compared to other approaches. and relaxations such as continuation method (Chapelle, Chi, and Zien 2006) and semidefinite program (SDP) relaxation (Bei and Cristianini 2006)(Xu, Crammer, and Schuurmans 1 Introduction 2006) have also been examined for several applications. To enhance the performance on more challenging tasks, vari- Yet except B & B, all of the aforementioned methods, i.e., ations of the classic large margin learning formulation are AO, CCCP, and SGD, only converge to local minimums and proposed to incorporate additional modeling flexibility. To 3 could be very sensitive to initial conditions. Although SDP name a few, semi-supervised SVM (S VM) is introduced approximation yields a convex problem, the quality of the in (Bennett, Demiriz, and others 1999; Joachims 1999) to relaxation is still an open question in both theory and prac- combine labeled and unlabeled samples together for overall tice (Park and Boyd 2015). On the other hand, it has long risk minimization. To learn a classifier for datasets having been realized that global optimal solution can return excel- unobserved information, SVM with latent variables is pro- lent generalization performance in situations where local op- posed in (Felzenszwalb et al. 2010) for object detection and timal solutions fail completely (Chapelle, Sindhwani, and in (Yu and Joachims 2009; Zhou, Hu, and Spanos 2016) for Keerthi 2006). The major issue withB&Bisitsscalability: structural learning. Inasmuch as the traditional large mar- the size of the search tree can grow exponentially with the gin classifier with hinge loss can be sensitive to outliers, the number of integer variables (Krishnamoorthy 2008), making authors of (Xu, Crammer, and Schuurmans 2006) suggest a it only suitable for small scale problems. Interested readers ramp loss with which a robust version of SVM is proposed. are referred to (Chapelle, Sindhwani, and Keerthi 2008) for Nonetheless, unlike classic SVM learning objective that a thorough discussion. possesses amiable convexity, those variations introduce non- In this work, we propose a learning procedure, namely convex learning objectives, hindering their generalization Parametric Dual Maximization (PDM), based on a different performance and scalable deployment due to optimization view of the problem. We first demonstrate that the learning difficulties. In literature, much effort has been made to ob- objectives can be rewritten into jointly optimizing regular- tain at least a locally optimal solution: Viewing the prob- ized hinge loss and a set of auxiliary variables. Then we lem as a biconvex optimization leads to a series of alternat- show that they are equivalent to non-smooth convex maxi- ing optimization (AO) algorithms. For example, in (Felzen- mization through a series of parametric analysis techniques. szwalb et al. 2010), latent SVM was trained by alternately Finally, we establish PDM by exploiting a necessary and Copyright c 2017, Association for the Advancement of Artificial sufficient global optimality condition. Our contributions are Intelligence (www.aaai.org). All rights reserved. highlighted as follows. (1) The equivalence to non-smooth 2956 M min M SM RM convex maximization unveils a novel view of an important p∈S m=1 pmξm, where is the simplex in . class of learning problems such as S3VM. Now we know Due to strict feasibility and biconvexity in (w,b) and p,we that they are NP-hard, but possesses gentle geometric prop- can exchange the order of minimization and obtain an equiv- erties that allow new solution techniques. (2) We develop alent form similar to (OPT1). The variable pi is the “prob- a set of new parametric analysis techniques, which can be ability” of yi =1. reused for many other tasks, e.g., solution path calculation. (3) By checking a necessary and sufficient optimality con- Many other learning variations could be rewritten in a dition, the proposed PDM can approach the global optimum similar way1. Observing that the inner problem of OPT1 is efficiently with user specified approximation levels. convex quadratic with fixed p, we replace it with its dual The rest of the paper is organized as follows. In Section 2, problem and cast OPT1 into we detail the reformulation of the problem with examples. In 1 Section 3, we derive the equivalent non-smooth convex max- max min J (α)= αiyiκ(xi, xj)yjαj − αi p∈P α∈A(p) 2 imization by parametric analysis. In Section 4, the optimal- i,j i ity condition is presented, and the corresponding algorithm A(p)= α | 0 ≤ ≤ ∀ yT α =0 is proposed. Numerical experiments are given in Section 5. where αi cipi i, (OPT2) 2 A Class of Large Margin Learning In the above equivalent formulation, we can view the in- d ner optimization as minimizing a quadratic function subject A labeled data sample is denoted as (xi,yi), with xi ∈ R and y ∈{−1, +1}. We focus on the following joint mini- to polyhedron constraints that are parametrized by the aux- i iliary variable p. Assuming the kernel matrix K, defined by mization problem 2 Ki,j = κ(xi, xj), is strictly positive , then the optimum is N α∗ 1 2 unique by strict convexity, and the solution is a function min min P(w,b; p)= ||w||H + cipiV (yi,hi) of p. Ideally, if one can write out the functional dependence p∈P w,b 2 =1 ∗ i explicitly, OPT2 is essentially maxp∈P J (α (p)), which (OPT1) minimizes over the “parameters” p of the inner problem. = (w x )+ (· ·) where hi κ , i b with κ , a Mercer kernel func- In the terminology of operational research and optimization, ( )= tion. The function V is the Hinge loss, i.e., V yi,hi the task of analyzing the dependence of an optimal solution T max(0, 1 − yihi). We call p [p1, ··· ,pN ] ∈ P the aux- on multiple parameters is called parametric programming. iliary variable of the problem, and assume its feasible set P Inspired by this new view of OPT2 (and thence OPT1), to be convex. note that with p fixed, the inner problem re- our solution strategy is: Firstly, determining the functional sembles traditional large margin learning. Depending on the J (α∗(p)) by parametric analysis, and then minimizing over context, the auxiliary variable p can be regarded as hidden p ∈ P by exploiting the unique property of J (α∗(p)). states or probability assignments for loss terms. We focus on Note that the first step in effect involves a convex (OPT1) in this work, because many large margin learning 3 quadratic parametric programming (CQPP), which has been variations, including S VM, latent SVM, robust SVM, etc., addressed in optimization and control community for sen- can be rewritten in this form. The following is an example sitivity analysis and explicit controller design (Tondel, Jo- of such reformulation. hansen, and Bemporad 2003) (Wachsmuth 2013). Moreover, Example 1 Consider the learning objective of Semi Super- the study of solution path algorithms in our field (Hastie et vised Support Vector Machine (S3VM): al. 2004) (Karasuyama and Takeuchi 2011) can also be regarded as special cases of CQPP. Nonetheless, existing work 1 l n min ||w||2 + ( )+ ( ) on CQPP is technically insufficient, because (1) Due to the H C1 V yi,hi C2 V yi,hi αT y =0 w,b,yu 2 presence of the constraint , the problem at hand i=1 i=l+1 corresponds to a “degenerate” case for which existing so- where l is the number of labeled samples and n − l unla- lution is still lacking. (2) Some important properties of the beled samples are included in the loss with “tentative” label parametric solution, specifically its geometric structure, are y u, which constitute additional variables to minimize over. not entirely revealed in prior works. Interestingly, the learning objective has the following equiv- In the next section, we target the the inner minimization alent form: for parametric analysis.

Parametric Dual Maximization for Non-Convex Learning Problems

Globally Optimal Model-Based Clustering Via Mixed Integer Nonlinear Programming

Attenuation Imaging by Wavefield Reconstruction Inversion with Bound

Arxiv:1608.04430V3 [Math.OC] 28 Dec 2018 R R 0 Counts the Number of Nonzero Elements in a Vector

Cosparse Regularization of Physics-Driven Inverse Problems Srdan Kitic

A Primal Method for Multiple Kernel Learning

Binary Optimization Via Mathematical Programming with Equilibrium Constraints

New SOCP Relaxation and Branching Rule for Bipartite Bilinear Programs

New SOCP Relaxation and Branching Rule for Bipartite Bilinear Programs

Minimum Spanning Trees with Neighborhoods Let G = (V, E) Be a Connected Undirected Graph, Whose Vertices Are Embedded D D D in R , I.E., V ∈ R for All V ∈ V

Chemical and Phase Equilibria Through Deterministic Global Optimization

Biconvex Relaxation for Semidefinite Programming in Computer Vision

Identification of Cascade Dynamic Nonlinear Systems: a Bargaining-Game-Theory-Based Approach 4659