Equivalence of Multicategory SVM and Simplex Cone SVM: Fast Computations and Statistical Theory
Total Page:16
File Type:pdf, Size:1020Kb
Equivalence of Multicategory SVM and Simplex Cone SVM: Fast Computations and Statistical Theory Guillaume A. Pouliot 1 Abstract attempts have been made at extending the methodology to The multicategory SVM (MSVM) of Lee et al. accommodate classification with K>2 categories (Sun et (2004) is a natural generalization of the classical, al., 2017; Dogan et al., 2016; Lopez et al., 2016; Kumar binary support vector machines (SVM). However, et al., 2017, survey available in Ma and Guo, 2014). Lee, its use has been limited by computational diffi- Lin and Wahba (2004) propose what is arguably the natural culties. The simplex-cone SVM (SCSVM) of multicategory generalization of binary SVM. For instance, Mroueh et al. (2012) is a computationally ef- their multicategory SVM (MSVM) is Fisher consistent (i.e., ficient multicategory classifier, but its use has the classification rule it produces converges to the Bayes been limited by a seemingly opaque interpreta- rule), which is a key property and motivation for the use tion. We show that MSVM and SCSVM are in standard SVM. Furthermore, it encompasses standard SVM fact exactly equivalent, and provide a bijection as a special case. between their tuning parameters. MSVM may However, the method has not been widely used in applica- then be entertained as both a natural and com- tion, nor has it been studied from a statistical perspective, putationally efficient multicategory extension of the way SVM has been. Amongst the machine learning SVM. We further provide a Donsker theorem for community, MSVM has not gathered popularity commen- finite-dimensional kernel MSVM and partially surate to that of SVM. Likewise, three major publications answer the open question pertaining to the very (Jiang et al., 2008; Koo et al., 2008; Li et al., 2011) have competitive performance of One-vs-Rest meth- established Donsker theorems for SVM, and none have done ods against MSVM. Furthermore, we use the de- so for MSVM. rived asymptotic covariance formula to develop an inverse-variance weighted classification rule Interestingly, computation and statistical analysis of MSVM which improves on the One-vs-Rest approach. are hindered by the same obstable. The optimization prob- lem which MSVM consists of is done under a sum-to-zero constraint on the vector argument. This makes both the 1. Introduction numerical optimization task and the statistical asymptotic analysis of the estimator more challenging. The numerical Support vector machines (SVM) is an established algorithm optimization is substantially slowed down by the equality for classification with two categories (Vapnik, 1998; Smola constraint1 as detailed in Table 1. Likewise, standard meth- and Schlkopf, 1998; Steinwart and Christmann, 2008; Fried- ods for deriving Donsker theorems and limit distribution man et al., 2009). The method finds the maximum margin theory do not apply to such constrained vectors of random separating hyperplane; it finds the hyperplane dividing the variables.2 input space (perhaps after mapping the data to a higher di- mensional space) into two categories and maximizing the In a separate strain of literature, Mroueh, Poggio, Rosasco minimum distance from a point to the hyperplane. SVM and Slotine (2012) have proposed the simplex-cone SVM can also be adapted to allow for imperfect classification, in (SC-SVM), a multicategory classifier developped within the which case we speak of soft margin SVM. vector reproducing kernel Hilbert space set-up. The SC- SVM optimization program is computationally tractable, Given the success of SVM at binary classification, many and in particular does away with the equality constraint slow- ing down the primal and dual formulations of MSVM (Lee 1Harris School of Public Policy, University of Chicago, Chicago, IL, USA. Correspondence to: Guillaume A. Pouliot 1In fact, a “hack” sometimes used is to ignore the equality <[email protected]>. constraint in the primal or dual formulation. This can result in th arbitrarily large distorsions of the optimal solution. 35 2 Proceedings of the International Conference on Machine For instance, the covariance matrix of a vector of random Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 variables constrained to sum to zero is not positive definite. by the author(s). Equivalence of Multicategory SVM and Simplex Cone SVM et al., 2004). Nevertheless, its use has remained marginal, theory has been put forth to explain it. arguably due to more limited interpretability, e.g., the notion We make some progress towards an analytical characteriza- of distance is captured via angles and the nesting of binary tion of the comparative performance, and are able to suggest SVM as a special case is not entirely straightforward. an explanation as to the competitive, and sometimes supe- As our main contribution, we show that MSVM and SC- rior, empirical performance of One-vs-Rest compared to SVM are in fact exactly equivalent. As a direct consequence, MSVM. We argue that, in some respect, One-vs-Rest makes we deliver faster computations for MSVM. Simulations a more efficient use of the information contained in the data. such as those presented in Table 1 display speed gains or The remainder of the paper is organized as follows. Sec- an order of magnitude. Furthermore, the equivalence with tion 2 defines both MSVM and SC-SVM, and contains the an unconstrained estimator allows for statistical analysis of proof of the equivalence between the two methods. Sec- MSVM. As a second contribution, we deliver a Donsker tion 3 gives the Donsker theorem for MSVM, and describes theorem for MSVM, as well as an asymptotic covariance how the asymptotic distribution may be used for more ef- formula with sample analog. A third contribution is to ficient classification. Section 4 suggests an analytical ex- use the asymptotic analysis result to propose a statistically planation for the surprisingly competitive performance of efficient, inverse-variance weighted modification of One- One-vs-Rest classifiers versus MSVM. Section 5 discusses vs-Rest. Finally, as a fourth contribution, we show that and concludes. the asymptotic analysis allows us to partially answer, with analytic characterizations, the open question relating to the very competitive performance of the seemingly more naive 2. Equivalence One-vs-Rest method against MSVM. The multicategory SVM (MSVM) of Lee et al. (2004) is The fourth contribution is important because it provides an- arguably the more elegant and natural generalization of alytical substance to a long-standing open question. To be SVM to multicategory data. However, its implementation, sure, the different attempts at developping a multicategory even for moderate size data sets, is complicated by the generalization of binary SVM can be understood as sub- presence of a sum constraint on the vector argument. scribing to one of two broad approaches. The first approach The simplex encoding of Mroueh et al. (2012) is relieved consists in doing multicategory classification using the stan- of the linear constraint on the vector argument. However, dard, binary SVM. For instance, the popular One-vs-Rest we believe the simplex encoding is not more widely used approach works as follows: to predict the category of a point because it is not known what standard encoding it corre- in a test set3 (i.e. out of sample), run K binary SVMs where sponds to, making it challenging for practitioners to carry the first category is one of the original K categories, and the out interpretable classification analysis. The following re- second category is the union of the remaining K 1 cate- − sult resolves both issues, making it of practical interest for gories. The predicted category is the one that was picked analysts and researchers using multicategory classification against all others with the greatest “confidence”. In prac- methods. tice, the confidence criteria used is the distance of the test point to the separating hyperplane (we show in Subsection We define MSVM and SC-SVM, and establish their equiv- 3.1 that even this can be improved according to statistical alence. The presentation is done with finite-dimensional considerations). The second approach consists in generaliz- kernels for ease of exposition. Remark 3 details the general- ing the standard SVM to develop a single machine which ization to infinite-dimensional kernels in reproducing kernel implements multicategory classification solving a single, Hilbert spaces. joint optimization problem. Many such algorithms have p With K categories, data is of the form (xi,yi) R been suggested (Weston and Watkins, 1999; Crammer and 2 ⇥ 1,...,K , i =1,...,N. When carrying out multicategory Singer, 2002; Lee et al., 2004). Intuition would suggest that { } classification, different choices of encodings of the cate- joint optimization makes for a more statistically efficient gory variables y lead to optimization problems that are procedure, and for superior out-of-sample prediction perfor- i differently formulated and implemented. mance. However, in a quite counterintuitive turn of events, it has been widely observed in practice that multicategory clas- For their multicategory SVM (MSVM), Lee et al. (2004) encode yi associated with category k 1,...,K as a K- sification with binary machines offers a performance (for th 1 2{ } tuple with 1 in the k entry and K− 1 in every other entry. instance, in out-of-sample classification) which is competi- − tive with, and sometimes superior to, that of single-machine For instance, multicategory SVM algorithms. This phenomenon is widely