Equivalence of Multicategory SVM and Simplex Cone SVM: Fast Computations and Statistical Theory
Guillaume A. Pouliot 1
Abstract attempts have been made at extending the methodology to The multicategory SVM (MSVM) of Lee et al. accommodate classification with K>2 categories (Sun et (2004) is a natural generalization of the classical, al., 2017; Dogan et al., 2016; Lopez et al., 2016; Kumar binary support vector machines (SVM). However, et al., 2017, survey available in Ma and Guo, 2014). Lee, its use has been limited by computational diffi- Lin and Wahba (2004) propose what is arguably the natural culties. The simplex-cone SVM (SCSVM) of multicategory generalization of binary SVM. For instance, Mroueh et al. (2012) is a computationally ef- their multicategory SVM (MSVM) is Fisher consistent (i.e., ficient multicategory classifier, but its use has the classification rule it produces converges to the Bayes been limited by a seemingly opaque interpreta- rule), which is a key property and motivation for the use tion. We show that MSVM and SCSVM are in standard SVM. Furthermore, it encompasses standard SVM fact exactly equivalent, and provide a bijection as a special case. between their tuning parameters. MSVM may However, the method has not been widely used in applica- then be entertained as both a natural and com- tion, nor has it been studied from a statistical perspective, putationally efficient multicategory extension of the way SVM has been. Amongst the machine learning SVM. We further provide a Donsker theorem for community, MSVM has not gathered popularity commen- finite-dimensional kernel MSVM and partially surate to that of SVM. Likewise, three major publications answer the open question pertaining to the very (Jiang et al., 2008; Koo et al., 2008; Li et al., 2011) have competitive performance of One-vs-Rest meth- established Donsker theorems for SVM, and none have done ods against MSVM. Furthermore, we use the de- so for MSVM. rived asymptotic covariance formula to develop an inverse-variance weighted classification rule Interestingly, computation and statistical analysis of MSVM which improves on the One-vs-Rest approach. are hindered by the same obstable. The optimization prob- lem which MSVM consists of is done under a sum-to-zero constraint on the vector argument. This makes both the 1. Introduction numerical optimization task and the statistical asymptotic analysis of the estimator more challenging. The numerical Support vector machines (SVM) is an established algorithm optimization is substantially slowed down by the equality for classification with two categories (Vapnik, 1998; Smola constraint1 as detailed in Table 1. Likewise, standard meth- and Schlkopf, 1998; Steinwart and Christmann, 2008; Fried- ods for deriving Donsker theorems and limit distribution man et al., 2009). The method finds the maximum margin theory do not apply to such constrained vectors of random separating hyperplane; it finds the hyperplane dividing the variables.2 input space (perhaps after mapping the data to a higher di- mensional space) into two categories and maximizing the In a separate strain of literature, Mroueh, Poggio, Rosasco minimum distance from a point to the hyperplane. SVM and Slotine (2012) have proposed the simplex-cone SVM can also be adapted to allow for imperfect classification, in (SC-SVM), a multicategory classifier developped within the which case we speak of soft margin SVM. vector reproducing kernel Hilbert space set-up. The SC- SVM optimization program is computationally tractable, Given the success of SVM at binary classification, many and in particular does away with the equality constraint slow- ing down the primal and dual formulations of MSVM (Lee 1Harris School of Public Policy, University of Chicago, Chicago, IL, USA. Correspondence to: Guillaume A. Pouliot 1In fact, a “hack” sometimes used is to ignore the equality
Specifically, in the case of finite-dimensional feature maps, of the loss functions is straighforward. Indeed, they suggest minimizing
1 ˜ 1 fy (x)+ = Cf(x) + y0=y 0 K 1 y0=yi K 1 6 + 6 y0 + n 1 P h i P ⇣ ˜ ⌘ 1 = y =y cy0 , f(x) + K 1 , L(y ) [Wx + b y ] + W , (1) 06 i + n i · i i + 2 ||| ||| i=1 P hD E (3)i X
which is exactly the SC-SVM loss of Mroueh et al. (2004). W = trace(W T W ) L(y )=1 e where , and i K yi is a Writing out f and f˜as linear functions, the identity becomes ||| ||| th vector that has 0 in the k entry when yi designates category k, and a 1 in every other entry. Importantly, the decision 1 T !y x + by + 1 (Wx+ b)= y0=y 0 0 K 1 function is constrained to sum to zero, i.e. k 6 + (4) 0 x [ ] h ˜ ˜ 1i , . The function +applies pointwise to its vector = Py =y cy0 , Wx+ b + K 1 8 · 06 i + argument. P hD E i Mroueh et al. (2012) preconize an encoding that does away ˜ ˜ ˜ with f(x)=Wx+ b and f(x)=Wx+ b, and !y0 is the th with the sum-to-zero constraint. The loss function they (y0) row of W . suggest is based on the inner product between the deci- sion function and their encoding of yi’s. Likewise in the Equality (up to a change of tuning parameter) of the penalty finite-dimensional case, the penalized minimization prob- relies on the key observation of this exercise, which is that T K C C is the diagonal matrix IK 1. It then immediately lem entailed by their loss function is K 1 follows that
n ˜ K 1 trace W˜ TW˜ = K 1 trace WTCTCW 1 1 ˜ ˜ ˜ K K + cy0 , Wxi + b + W , T n K 1 + 2 ||| ||| ⇣ ⌘ = trace W W . i=1 y0=y D E X X6 (2) (5) K 1 where cy is a unit vector in R which encodes the re- sponse; it is a row of a simplex coding matrix, which is the In conclusion, we have key building block of their construction. n 1 L(yi) [Wxi + b yi] + W A simplex coding matrix (Mroueh et al., 2012; Pires et al., n · + 2 ||| ||| K (K 1) i=1 2013) is a matrix C R ⇥ such that its rows ck X 2 2 T 1 satisfy (i) ck 2 =1; (ii) ci cj = K 1 for i = j ; and K k k 6 n (iii) k=1 ck =0K 1. It encodes the responses as unit 1 1 (K 1) K 1 ˜ ˜ ˜ vectors in R having maximal equal angle with each = + cy0 , Wx+ b + W , n K 1 + 2K ||| ||| P i=1 y0=y other. Further note that, because its domain is a (K 1)- X X6 D E (6) dimensional subspace of K , any given C has a unique R as desired. inverse operator C˜ defined on the image x RK :1T x = { 2 K 0 . We now prove the key linear algebra result. There are other } ways (see remarks below) to prove this result. However, it is For a given choice of simplex encoding defined by C, the K 1 K desirable to establish the equivalence between the more prac- operator C : R R can be thought of as mapping ! tical encoding and the more interpretable one in an intuitive decision functions and encoded y’s from the unrestricted way. The geometric proof given below accomplishes this by simplex encoding space to the standard, restricted encoding establishing the equivalence through a volume preservation space used by Lee et al. (2004). argument. A natural question is then: if f(x)=Wx+ b and f˜(x)= Wx˜ + ˜b are optimal solutions to (1) and (2), respectively, PROPOSITION are C˜ (Wx+ b) and C(Wx˜ + ˜b) then optimal solutions to Let C K (K 1) be a simplex coding matrix. Then its (2) and (1), respectively? We show that this is in fact the R ⇥ 2 K case. That is, both problems are exactly equivalent. columns are orthogonal and have norm K 1 . We now show the problems are equivalent. The equivalence Proof q Equivalence of Multicategory SVM and Simplex Cone SVM
The key observation (Gantmacher, 1959, vol. 1, p.251) is n 1 that 2⇡ijm n, j mod n =0 e n = . 0, o.w. m=0 ⇢ p X V = G, (7) ⇤ where V = V (C) is the volume of the parallelipiped The immediate implication of the above argument and propo- spanned by the columns of C, and G = G(C) is the Gram- sition is that we may compute MSVM using the equivalent, mian of C. The Grammian (defined below) extends the unconstrained representation of SC-SVM. In Table 1, we notion of volume to objects determined by more vectors display “clock-on-the-wall” computation times. Collected than the space they are embedded in has dimensions. simulations suggest gains of an order of magnitude. th Let C i denote the i column of C, and recall that C · ||| ||| Remark 1 The result of the Proposition holds for a more denotes the sum of the squared entries of C. Note that K D general simplex matrix C R ⇥ , 0