Analytic Center Cutting Plane Method for Multiple Kernel Learning
Total Page:16
File Type:pdf, Size:1020Kb
Ann Math Artif Intell DOI 10.1007/s10472-013-9331-4 Analytic center cutting plane method for multiple kernel learning Sharon Wulff · Cheng Soon Ong © Springer Science+Business Media Dordrecht 2013 Abstract Multiple Kernel Learning (MKL) is a popular generalization of kernel methods which allows the practitioner to optimize over convex combinations of kernels. We observe that many recent MKL solutions can be cast in the framework of oracle based optimization, and show that they vary in terms of query point generation. The popularity of such methods is because the oracle can fortuitously be implemented as a support vector machine. Motivated by the success of centering approaches in interior point methods, we propose a new approach to optimize the MKL objective based on the analytic center cutting plane method (accpm). Our experimental results show that accpm outperforms state of the art in terms of rate of convergence and robustness. Further analysis sheds some light as to why MKL may not always improve classification accuracy over naive solutions. Keywords Multiple kernel learning · accpm · Oracle methods · Machine learning Mathematics Subject Classification (2010) 68T05 1 Introduction Kernel methods, for example the support vector machine (SVM), are a class of algorithms that consider only the similarity between examples [1]. A kernel function S. Wulff Department of Computer Science, ETH, Zürich, Switzerland e-mail: [email protected] C. S. Ong (B) NICTA, The University of Melbourne, Melbourne, Australia e-mail: [email protected] S. Wulff, C.S. Ong k implicitly maps examples x to a feature space given by a feature map via the identity k(xi, x j) = (xi), (x j) . It is often unclear what the most suitable kernel for the task at hand is, and hence the user may wish to combine several possible kernels. One problem with simply adding kernels is that using uniform weights is possibly not optimal. An extreme example is the case that one kernel is not correlated with the labels at all, in such cases giving it positive weight just adds noise [2]. Multiple kernel learning (MKL) is a way of optimizing kernel weights while training the SVM. In addition to leading to good classification accuracies, MKL can also be useful for identifying relevant and meaningful features [2–5]. MKL finds a convex combination of kernels [6, 7], that is to find a classifier p fw,b,β (x, y) = βk wk,k(x) + b k=1 where β are the kernel weights corresponding to the p kernels, and w, b are the SVM parameters. The MKL problem was first formulated as a semidefinite programming SDP problem [6]. One can exploit the known structure of the MKL problem to speed up the optimization, for example using an SMO-like approach [8]. However, this requires a full reimplementation of the solver. MKL has also been recently shown to be equivalent to group LASSO [9]. Recently, leveraging on the existence of efficient software for solving the SVM optimization problem, a semi-infinite linear programming (SILP) approach was developed in [7]. Their solution is based on the concept of cutting planes. Cutting planes methods alternate between choosing query points and calling an oracle. Given a query point, the oracle returns halfspaces to be included in the current set of constraints, forming the feasible set. In the context of MKL, the oracle is an SVM solver and the query points are kernel weights. One criticism of the SILP is that it requires many cutting planes (and hence calls to the SVM) before convergence. This gave rise to a subgradient based approach [10] and a bundle method [11]. Our contributions in this paper are as follows. We review MKL optimization in an oracle based optimization framework and demonstrate that they are essentially different ways of generating query points. We propose the use of the analytic center as a query point since it would in expectation halve the hypothesis space. We conduct experiments comparing the performance of several MKL solutions on UCI data. We show that our more“regularized” approach often requires fewer iterations, and is more robust to variations in data. The experimental results are followed by a discussion about the correlation between the MKL objective and test accuracy. 2 Oracle based methods Oracle based methods are widely used for solving integer programming problems. Here we focus on using oracle based methods for solving convex optimization problems [12]. The goal of oracle based algorithms is to find a point in a convex set Z, or to determine if the set is empty (refer to Algorithm 1 for a pseudocode description). In an optimization problem, Z is the set of ε-suboptimal points. The method does not assume any direct access to the description of Z,suchasthe objective and constraint functions, except through an oracle. Instead the method generates a query point q (Section 2.2) which upon satisfying q ∈ Z is passed to the ACCPM for multiple kernel learning oracle (Section 2.1). The oracle returns a hyperplane which separates q from the set Z. This hyperplane is called a “cut” since it eliminates a halfspace from our search, which is the reason oracle based methods are also known as cutting plane methods. 2.1 Oracles and cuts For a convex optimization problem with m constraints, minz f0(z) (1) s.t. fi(z) 0, i = 1,...,m where f0,..., fm are convex and differentiable, the target set Z is the optimal (or ε-optimal) set. Given a query point q, the oracle first checks for feasibility. If q is not feasible, then we choose one of the violated constraints f j(q)>0 and form a cut f j(q) +∇f j(q) (z − q) 0 (2) This cut (2) is called a feasibility cut for the problem (1) since it cuts away a halfspace of points known to be infeasible (since they violate the jth constraint). If q is feasible, we construct the cutting plane ∇ f0(q) (z − q) 0, (3) which is called the objective cut for (1). This cuts out the halfspace z |∇f0(q) (z − q)>0 since all such points have an objective value larger than f0(q) and hence cannot be optimal. If q is feasible and ∇ f0(q) = 0 then q is optimal. In general, for non- differentiable problems, the gradients ∇ f j(z) can be replaced by subgradients. Algorithm 1 Cutting plane algorithm for optimization Require: an initial polyhedron P0 containing Z. t = 0 repeat (t+1) Generate a query point q in Pt Query the oracle at q(t+1), Oracle returns a cutting plane at+1z b t+1. Update the constraints: P = P ∩{ | } t+1 t z at+1z b t+1 . t = t + 1 until convergence or Pt+1 =∅ 2.2 Generating query points In principle, we would like the query point q(t+1) corresponding to the current polyhedron Pt (containing the optimal set Z) to be generated such that the resulting (t+1) cut reduces the size of Pt+1 as much as possible. When querying the oracle with q , we do not know in which direction the generated cut will exclude, but we do know that q(t+1) will be in the excluded halfspace. One approach is to greedily use the S. Wulff, C.S. Ong vertex of the current polytope which minimizes the objective, leading to the following method. 2.2.1 Method of Kelley–Cheney–Goldstein For the optimization problem (1), the piecewise linear function (t) f (z) = max f0(zi) +∇f0(zi) (z − zi) it (t+1) is a lower approximation to f0(z). The next query point q is found by solving min θ s.t.θ f0(qi) +∇f0(qi) (z − qi), ∀i t (4) Atz b t, where At, b t are the set of existing cutting planes which define the current polyhe- dron Pt. In the rest of the paper we refer to the above method as the Kelley–Cheney– Goldstein method (KCG). ∇ f0(qi) Using instead of ∇ f0(qi) in (4), results in finding the center of the largest ∇ f0(qi) sphere. This variant of (4) is called the Chebyshev center method. This modification, where the gradients are scaled to unit length, has significantly better convergence properties [12]. This already shows the power of centering, which is exploited by the following method. 2.2.2 Analytic center cutting plane method The analytic center is a concept which has been popularized by interior point ∈ R = − methods. Given a constraint ai z b i, define the slack si as si b i ai z,that is, si is a measure of how close or how far the current solution is from the constraint. An interior point of the feasible set is a point for which all the slacks are strictly positive. The analytic center is defined as the unique maximizer of the function t t f (s) = si where s ∈ R is the vector of slacks of the current set of constraints i=1 { , = ,..., } ai z b i i 1 t . The geometrical interpretation of the analytic center is the point that maximizes the product of distances to all of the faces of the polytope. Maximizing the product of the slacks (e.g. instead of sum), ensures that every slack is strictly positive. We can rewrite the analytic center as t t ( ) = = ( − ) argmax f s argmax si argmax log b i ai z (5) z z z i=1 i=1 This function is also known as the logarithmic barrier, and its unique maximizer can be efficiently found using Newton iterations [12]. Going back to the cutting planes framework, computing a point which is as far as possible from the border of the feasible set, ensures that in every iteration the resulting cut yields a substantial reduction of the size of the set.