Equivalence of Multicategory SVM and Simplex Cone SVM: Fast Computations and Statistical Theory

Guillaume A. Pouliot 1

Abstract attempts have been made at extending the methodology to The multicategory SVM (MSVM) of Lee et al. accommodate classification with K>2 categories (Sun et (2004) is a natural generalization of the classical, al., 2017; Dogan et al., 2016; Lopez et al., 2016; Kumar binary support vector machines (SVM). However, et al., 2017, survey available in Ma and Guo, 2014). Lee, its use has been limited by computational diffi- Lin and Wahba (2004) propose what is arguably the natural culties. The simplex-cone SVM (SCSVM) of multicategory generalization of binary SVM. For instance, Mroueh et al. (2012) is a computationally ef- their multicategory SVM (MSVM) is Fisher consistent (i.e., ficient multicategory classifier, but its use has the classification rule it produces converges to the Bayes been limited by a seemingly opaque interpreta- rule), which is a key property and motivation for the use tion. We show that MSVM and SCSVM are in standard SVM. Furthermore, it encompasses standard SVM fact exactly equivalent, and provide a bijection as a special case. between their tuning parameters. MSVM may However, the method has not been widely used in applica- then be entertained as both a natural and com- tion, nor has it been studied from a statistical perspective, putationally efficient multicategory extension of the way SVM has been. Amongst the machine learning SVM. We further provide a Donsker theorem for community, MSVM has not gathered popularity commen- finite-dimensional kernel MSVM and partially surate to that of SVM. Likewise, three major publications answer the open question pertaining to the very (Jiang et al., 2008; Koo et al., 2008; Li et al., 2011) have competitive performance of One-vs-Rest meth- established Donsker theorems for SVM, and none have done ods against MSVM. Furthermore, we use the de- so for MSVM. rived asymptotic covariance formula to develop an inverse-variance weighted classification rule Interestingly, computation and statistical analysis of MSVM which improves on the One-vs-Rest approach. are hindered by the same obstable. The optimization prob- lem which MSVM consists of is done under a sum-to-zero constraint on the vector argument. This makes both the 1. Introduction numerical optimization task and the statistical asymptotic analysis of the estimator more challenging. The numerical Support vector machines (SVM) is an established algorithm optimization is substantially slowed down by the equality for classification with two categories (Vapnik, 1998; Smola constraint1 as detailed in Table 1. Likewise, standard meth- and Schlkopf, 1998; Steinwart and Christmann, 2008; Fried- ods for deriving Donsker theorems and limit distribution man et al., 2009). The method finds the maximum margin theory do not apply to such constrained vectors of random separating hyperplane; it finds the hyperplane dividing the variables.2 input space (perhaps after mapping the data to a higher di- mensional space) into two categories and maximizing the In a separate strain of literature, Mroueh, Poggio, Rosasco minimum distance from a point to the hyperplane. SVM and Slotine (2012) have proposed the simplex-cone SVM can also be adapted to allow for imperfect classification, in (SC-SVM), a multicategory classifier developped within the which case we speak of soft margin SVM. vector reproducing kernel Hilbert space -up. The SC- SVM optimization program is computationally tractable, Given the success of SVM at binary classification, many and in particular does away with the equality constraint slow- ing down the primal and dual formulations of MSVM (Lee 1Harris School of Public Policy, University of Chicago, Chicago, IL, USA. Correspondence to: Guillaume A. Pouliot 1In fact, a “hack” sometimes used is to ignore the equality . constraint in the primal or dual formulation. This can result in th arbitrarily large distorsions of the optimal solution. 35 2 Proceedings of the International Conference on Machine For instance, the covariance matrix of a vector of random Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 variables constrained to sum to zero is not positive definite. by the author(s). Equivalence of Multicategory SVM and Simplex Cone SVM et al., 2004). Nevertheless, its use has remained marginal, theory has been put forth to explain it. arguably due to more limited interpretability, e.g., the notion We make some progress towards an analytical characteriza- of distance is captured via angles and the nesting of binary tion of the comparative performance, and are able to suggest SVM as a special case is not entirely straightforward. an explanation as to the competitive, and sometimes supe- As our main contribution, we show that MSVM and SC- rior, empirical performance of One-vs-Rest compared to SVM are in fact exactly equivalent. As a direct consequence, MSVM. We argue that, in some respect, One-vs-Rest makes we deliver faster computations for MSVM. Simulations a more efficient use of the information contained in the data. such as those presented in Table 1 display speed gains or The remainder of the paper is organized as follows. Sec- an order of magnitude. Furthermore, the equivalence with tion 2 defines both MSVM and SC-SVM, and contains the an unconstrained estimator allows for statistical analysis of proof of the equivalence between the two methods. Sec- MSVM. As a second contribution, we deliver a Donsker tion 3 gives the Donsker theorem for MSVM, and describes theorem for MSVM, as well as an asymptotic covariance how the asymptotic distribution may be used for more ef- formula with sample analog. A third contribution is to ficient classification. Section 4 suggests an analytical ex- use the asymptotic analysis result to propose a statistically planation for the surprisingly competitive performance of efficient, inverse-variance weighted modification of One- One-vs-Rest classifiers versus MSVM. Section 5 discusses vs-Rest. Finally, as a fourth contribution, we show that and concludes. the asymptotic analysis allows us to partially answer, with analytic characterizations, the open question relating to the very competitive performance of the seemingly more naive 2. Equivalence One-vs-Rest method against MSVM. The multicategory SVM (MSVM) of Lee et al. (2004) is The fourth contribution is important because it provides an- arguably the more elegant and natural generalization of alytical substance to a long-standing open question. To be SVM to multicategory data. However, its implementation, sure, the different attempts at developping a multicategory even for moderate size data sets, is complicated by the generalization of binary SVM can be understood as sub- presence of a sum constraint on the vector argument. scribing to one of two broad approaches. The first approach The simplex encoding of Mroueh et al. (2012) is relieved consists in doing multicategory classification using the stan- of the linear constraint on the vector argument. However, dard, binary SVM. For instance, the popular One-vs-Rest we believe the simplex encoding is not more widely used approach works as follows: to predict the of a point because it is not known what standard encoding it corre- in a test set3 (i.e. out of sample), run K binary SVMs where sponds to, making it challenging for practitioners to carry the first category is one of the original K categories, and the out interpretable classification analysis. The following re- second category is the union of the remaining K 1 cate- sult resolves both issues, making it of practical interest for gories. The predicted category is the one that was picked analysts and researchers using multicategory classification against all others with the greatest “confidence”. In prac- methods. tice, the confidence criteria used is the distance of the test point to the separating hyperplane (we show in Subsection We define MSVM and SC-SVM, and establish their equiv- 3.1 that even this can be improved according to statistical alence. The presentation is done with finite-dimensional considerations). The second approach consists in generaliz- kernels for ease of exposition. Remark 3 details the general- ing the standard SVM to develop a single machine which ization to infinite-dimensional kernels in reproducing kernel implements multicategory classification solving a single, Hilbert spaces. joint optimization problem. Many such algorithms have p With K categories, data is of the form (xi,yi) R been suggested (Weston and Watkins, 1999; Crammer and 2 ⇥ 1,...,K , i =1,...,N. When carrying out multicategory Singer, 2002; Lee et al., 2004). Intuition would suggest that { } classification, different choices of encodings of the cate- joint optimization makes for a more statistically efficient gory variables y lead to optimization problems that are procedure, and for superior out-of-sample prediction perfor- i differently formulated and implemented. mance. However, in a quite counterintuitive turn of events, it has been widely observed in practice that multicategory clas- For their multicategory SVM (MSVM), Lee et al. (2004) encode yi associated with category k 1,...,K as a K- sification with binary machines offers a performance (for th 1 2{ } tuple with 1 in the k entry and K 1 in every other entry. instance, in out-of-sample classification) which is competi- tive with, and sometimes superior to, that of single-machine For instance, multicategory SVM algorithms. This phenomenon is widely 1 1 1 ”yi in category 2” yi = , 1, , , . acknowledged (Rifkin and Klautau, 2004) but very little , K 1 K 1 ··· K 1 ✓ ◆ 3Or the fitted category of a point in the training set. The loss they suggest is then based on the differ- ence between the decision function and the encoded yi’s. Equivalence of Multicategory SVM and Simplex Cone SVM

Specifically, in the case of finite-dimensional feature maps, of the loss functions is straighforward. Indeed, they suggest minimizing

1 ˜ 1 fy (x)+ = Cf(x) + y0=y 0 K 1 y0=yi K 1 6 + 6 y0 + n  1 P h i P ⇣ ˜ ⌘ 1 = y =y cy0 , f(x) + K 1 , L(y ) [Wx + b y ] + W , (1) 06 i + n i · i i + 2 ||| ||| i=1 P hD E (3)i X

which is exactly the SC-SVM loss of Mroueh et al. (2004). W = trace(W T W ) L(y )=1 e where , and i K yi is a Writing out f and f˜as linear functions, the identity becomes ||| ||| th vector that has 0 in the k entry when yi designates category k, and a 1 in every other entry. Importantly, the decision 1 T !y x + by + 1 (Wx+ b)= y0=y 0 0 K 1 function is constrained to sum to zero, i.e. k 6 + (4) 0 x [ ] h ˜ ˜ 1i , . The function +applies pointwise to its vector = Py =y cy0 , Wx+ b + K 1 8 · 06 i + argument. P hD E i Mroueh et al. (2012) preconize an encoding that does away ˜ ˜ ˜ with f(x)=Wx+ b and f(x)=Wx+ b, and !y0 is the th with the sum-to-zero constraint. The loss function they (y0) row of W . suggest is based on the inner product between the deci- sion function and their encoding of yi’s. Likewise in the Equality (up to a change of tuning parameter) of the penalty finite-dimensional case, the penalized minimization prob- relies on the key observation of this exercise, which is that T K C C is the diagonal matrix IK 1. It then immediately lem entailed by their loss function is K 1 follows that

n ˜ K 1 trace W˜ TW˜ = K 1 trace WTCTCW 1 1 ˜ ˜ ˜ K K + cy0 , Wxi + b + W , T n K 1 + 2 ||| ||| ⇣ ⌘ = trace W W . i=1 y0=y  D E X X6 (2) (5) K 1 where cy is a unit vector in R which encodes the re- sponse; it is a row of a simplex coding matrix, which is the In conclusion, we have key building block of their construction. n 1 L(yi) [Wxi + b yi] + W A simplex coding matrix (Mroueh et al., 2012; Pires et al., n · + 2 ||| ||| K (K 1) i=1 2013) is a matrix C R ⇥ such that its rows ck X 2 2 T 1 satisfy (i) ck 2 =1; (ii) ci cj = K 1 for i = j ; and K k k 6 n (iii) k=1 ck =0K 1. It encodes the responses as unit 1 1 (K 1) K 1 ˜ ˜ ˜ vectors in R having maximal equal angle with each = + cy0 , Wx+ b + W , n K 1 + 2K ||| ||| P i=1 y0=y  other. Further note that, because its domain is a (K 1)- X X6 D E (6) dimensional subspace of K , any given C has a unique R as desired. inverse operator C˜ defined on the image x RK :1T x = { 2 K 0 . We now prove the key linear algebra result. There are other } ways (see remarks below) to prove this result. However, it is For a given choice of simplex encoding defined by C, the K 1 K desirable to establish the equivalence between the more prac- operator C : R R can be thought of as mapping ! tical encoding and the more interpretable one in an intuitive decision functions and encoded y’s from the unrestricted way. The geometric proof given below accomplishes this by simplex encoding space to the standard, restricted encoding establishing the equivalence through a volume preservation space used by Lee et al. (2004). argument. A natural question is then: if f(x)=Wx+ b and f˜(x)= Wx˜ + ˜b are optimal solutions to (1) and (2), respectively, PROPOSITION are C˜ (Wx+ b) and C(Wx˜ + ˜b) then optimal solutions to Let C K (K 1) be a simplex coding matrix. Then its (2) and (1), respectively? We show that this is in fact the R ⇥ 2 K case. That is, both problems are exactly equivalent. columns are orthogonal and have norm K 1 . We now show the problems are equivalent. The equivalence Proof q Equivalence of Multicategory SVM and Simplex Cone SVM

The key observation (Gantmacher, 1959, vol. 1, p.251) is n 1 that 2⇡ijm n, j mod n =0 e n = . 0, o.w. m=0 ⇢ p X V = G, (7) ⇤ where V = V (C) is the volume of the parallelipiped The immediate implication of the above argument and propo- spanned by the columns of C, and G = G(C) is the Gram- sition is that we may compute MSVM using the equivalent, mian of C. The Grammian (defined below) extends the unconstrained representation of SC-SVM. In Table 1, we notion of volume to objects determined by more vectors display “clock-on-the-wall” computation times. Collected than the space they are embedded in has dimensions. simulations suggest gains of an order of magnitude. th Let C i denote the i column of C, and recall that C · ||| ||| Remark 1 The result of the Proposition holds for a more denotes the sum of the squared entries of C. Note that K D general simplex matrix C R ⇥ , 0

3.1. Efficient Classifiers Table 1. Computation times in seconds. Primal and dual MSVMs are implemented as described in Lee et al. (2004). All simulations SVM are most commonly used for classification and predic- are for K =3balanced categories. Computations were done in tion tasks. Accordingly, the most immediate practical use Gurobi on a Macbook Pro with a 3.1 GHz Intel Core i7 processor. for an estimate of the variance of the separating hyperplane is the construction of a more accurate classifier. FORMULATION n =200 n =1000 n =5000 PRIMAL MSVM 5.5 81.4 9740.4 Consider the One-vs-Rest method, for instance. The One- DUAL MSVM 0.4 8.2 887.5 vs-Rest method fits K hyperplanes, which in the linear case SC-SVM 0.1 0.4 7.7 p+1 are defined by (!i,bi) R , and categorizes a point by 2 attributing it to the category in which it is the “deepest”. 3. Donsker Theorem That is, T ˆ By considering MSVM as penalized M-estimator, one can yˆnew = arg max !ˆk xnew + bk . k in principle work out its asymptotic distribution. In (2), n o under simplex encoding, MSVM is phrased as an uncon- However, studentized distances yield more sensible and reli- strained M-estimator, and the asymptotic distribution for able classifications by accounting for the comparative uncer- the estimated parameters –and thus separating hyperplane– tainty of the hyperplanes when categorizing a given point. can be obtained using standard empirical process theory. Naturally, a point being ”deeper” with respect to a classi- The expression for the covariance matrices presented below fying hyperplane –in terms of the length of the line from are novel and of practical use. To the best of my knowledge, the point to the hyperplane and normal to the hyperplane– if a practitioner wants to compute the asymptotic covariance should make one more confident in the classification if it oc- matrix of SVM or MSVM –which is essential in order to curs in a section of the space where the hyperplane has lower know where extrapolation is reliable– this article is the only variance. In sections with high variance, the distance could resource displaying worked out expressions with sample be much smaller in resamplings of the data. Accordingly, analogs.4 we suggest the following efficient categorization rule One readily obtains (Van der Vaart, 2008) a standard central T ˆ limit theorem result of the form !ˆk xnew + bk yˆ⇤ = arg max , new T T k ( (x , 1)⌃k(x , 1)) ˜ˆ ˜ d 1 1 new new pn ⇥n ⇥⇤ N(0,HMulti ⌦MultiHMulti ), (9) ! p ˆ ⇣ ⌘ where ⌃k is the asymptotic variance of (ˆ!k, bk), or a con- T T where ⇥˜ =(vec(W˜ ) , ˜b) , the information matrix ⌦Multi sistent estimate. An analog modification can be applied to is make the MSVM procedure more efficient.

4. Efficiency of One-vs-Rest

E cT 1 c , f˜ a˜ c 1 c , f˜ a˜ Explaining the surprisingly competitive performance of the 0 y0 y0 1 0 y0 y0 1 y =y y =y naive One-vs-Rest approach, comparatively to the more so- X06 nD E o X06 nD E o @ A @ Aphisticated MSVM approach, is an important open question. (xT , 1)T (xT , 1) , The phenomenon is detailed and documented empirically ⌦ in Rifkin and Klautau (2004) and is well established in the ⇣ ⌘ and the Hessian HMulti is machine learning folklore. However, there are practically no theoretical results in the way of an explanation. In this sec- tion, we consider this question from the asymptotic statistics T Ey c c p c , ˜b a˜ perspective and argue that the competitive performance of 2 y0 y0 y0 y =y One-vs-Rest may be explained by a more efficient use of X06 ⇣ ⌘ ⇣ D E ⌘ 4 information. E (xT , 1)T (xT , 1) c , f˜ = a,˜ y . ⌦ y0 The idea is to consider the full One-vs-Rest method as a h D E ii ˜ ˜ ˜ 1 single M-estimator and to artificially impose a sum-to-zero Both are evaluated at ⇥⇤, f = f(x), and a˜ = K 1 , and constraint on the decision function. I can use the simplex p = p ˜ ˜ is the density of cy , f˜ conditional on cy ,Wx+b y 0 encoding and obtain the (joint) asymptotic variance of the h 0 i| 1 1 y. Derivations are given in the onlineD appendix.E K separating hyperplanes in the form H1vsR ⌦1vsRH1vsR .

4 1 Koo et al. (2008) and Jiang et al. (2008) do not provide Note that I pick the geometric margin to be K 1 , rather expressions with sample analogs. than 1 in the standard form for binary (and thus One-vs- Equivalence of Multicategory SVM and Simplex Cone SVM

Rest) SVM. The loss function for One-vs-Rest in simplex that this comes from the one-vs-rest procedure using infor- encoding is mation from the “own category”, while MSVM doesn’t as it only uses information with respect to “other” categories. K 1 1 y = k c , Wx˜ + ˜b (10) It is clear from the comparison of the loss functions (2) and { }· K 1 k k=1 + (10), corresponding to SC-SVM and the simplex encoding X  D E of One-vs-Rest, respectively, that both penalize for an obser- 1 vation that falls within the half-space assigned to an “other” +1 y = k + c , Wx˜ + ˜b { 6 }· K 1 k category, but only One-vs-Rest rewards for points falling  +! D E within their true, “own” category. It was not clear, however, which is minimized in W˜ and ˜b. The first summand penal- if the rewarding a point for being in its “own” category is izes classification for which the point x is not sufficiently far still informative and not redundant when it is already pe- from the hyperplane within the true category. This is where nalized if it is in any “other” category. However, in spite we speak of using the information from a point’s “own” cat- of imposing an additional constraint on the solution space egory. The second summand penalizes classifications for of the One-vs-Rest problem, we do find from inspection of which the point x is not sufficiently far from the hyperplane the Hessian that the additional information from rewarding away from the wrong category. This is where we speak of classification of points within their “own” category is infor- using the information from “other” categories. mative and not redundant. Although this was not obvious a The sum-to-zero constraint is added for analytical reasons; priori, it is revealed by the statistical asymptotic analysis. we need it to make the covariance matrices comparable. It Furthermore, in the special case of a separable data gen- will be apparent that the analytical conclusion is robust to erating process (DGP), that is in the case in which 1 a˜ { this modification. c , f˜ 0 =0a.s., we get that ⌦ =⌦ and y } Multi 1vsR The information matrix ⌦1vsR is Dboth proceduresE have the same target hyperplane. Therefore, One-vs-Rest is strictly more statistically efficient than mul- ticategory when the DGP is separable. In this specific case, T ˜ ˜ E cy 1 a˜ cy, f 0 cy1 a˜ cy, f 0 this translates into smaller expected prediction error. We { } { } ⇣ D E ⌘⇣ D E ⌘ have displayed a case, that of perfect seprarability, where (xT , 1)T (xT , 1) One-vs-Rest (with simplex encoding) provably dominates ⌦ MSVM. T 2E(c 1 a˜ cy, f˜ 0 )( c 1 a˜ + c , f˜ 0 ) y { } y0 { k0 } In non-separable cases, this dominance may not hold. In fact, y =y D E X06 D E we expect MSVM to outperform One-vs-Rest in some cases (xT , 1)T (xT , 1) due to the more efficient gathering, by joint optimization, of ⌦ the information from the “other” categories. T +E( c 1 a˜+ ck, f˜ 0 )( c 1 a˜+ c , f˜ 0 ) y0 { } y0 { k0 } y =y y =y X06 D E X06 D E 5. Discussion and Conclusion (xT , 1)T (xT , 1), We established rigorously, and with a proof conveying geo- ⌦ metric intuition, the equivalence of MSVM and SC-SVM. and the Hessian H1vsR is This provides a formulation of the optimization problem

T for computing MSVM which is relieved of the sum-to-zero Ey (c cy) p cy, ˜b a˜ y constraint that bogged down computations in the implemen- h ⇣ ⇣D E ⌘⌘ T T T tations as suggested in Lee et al. (2004). Our hope is that E (x , 1) (x , 1) ck, f˜ =˜a, y ⌦ availablity of faster computations for MSVM will encour- h D E ii age applied researchers and analysts to employ MSVM in T +Ey (c c ) p c , ˜b a˜ multicategory classification tasks. We gave the first cen- 2 y0 y0 y0 y =y tral limit theorem for MSVM, along with an asymptotic X06 ⇣ ⇣ D E ⌘⌘ 4 covariance formula having a sample analog, which is a E (xT , 1)T (xT , 1) c , f˜ = a,˜ y . ⌦ y0 new result even for binary SVM. The variance formula al- h D E ii lows for the construction of studentized decision functions We get instructive comparisons. First of all, HMulti < for One-vs-Rest procedures, improving their accuracy and H1vsR. That is, the one-vs-rest problem has more “curva- statistical efficiency. These make for more reliable classifi- 5 ture” than the MSVM. Indeed, it is clear from inspection cation, especially for extrapolation. We gave an analytical

5 characterization of the surprisingly good performance of the Note the addition of a y = y0 positive summand. Equivalence of Multicategory SVM and Simplex Cone SVM

K (K 1) One-vs-Rest procedure, comparatively to MSVM, using the C:K R ⇥ , which we now describe. 7! asymptotic distribution of estimators. We hope this line of C function(K) study fosters further research. { C1, = StoC(0, 0,...,0) · 1 Acknowledgements C2, = StoC acos , 0,...,0 · K 1 I would like to thank Lorenzo Rosasco for introducing me ✓ ✓ ◆ ◆ 1 1 to the material studied in this article and for his support C3, = StoC acos , acos , 0,...,0 · K 1 K 2 throughout this project. Different angles for studying SVM ✓ ✓ ◆ ✓ ◆ ◆ . methods as M-estimators arose in stimulating conversation . with Isaiah Andrews. Jann Spiess read an early draft of 1 1 the article and made helpful comments. Jules Marchand- CK 1, = StoC acos , acos , · K 1 K 2 Gagnon collaborated on a cousin project and contributed ✓ ✓ ◆ ✓ ◆ insights which this article bears the mark of. 1 1 ..., acos , acos 3 2 ✓ ◆ ✓ ◆◆ Appendix: Construction of the Simplex 1 1 CK, = StoC acos , acos , C · K 1 K 2 Encoding Matrix ✓ ✓ ◆ ✓ ◆ 1 1 It is straightforward to build a function that takes K and ..., acos , 2 acos 3 · 2 outputs the simplex encoding matrix C. We give intuitive ✓ ◆ ✓ ◆◆ pseudocode for the general K case. A code file is available in the online appendix. C The construction relies on mapping vectors in spherical } coordinates to their representation in Cartesian coordinates. K 2 K 1 The function StoC : R R does just that. ! References

A. J. Chan, Grobner Bases over fields with Valuations and Tropical Curves by Coordinate Projections, PhD Thesis, University of Warwick, 2013.

StoC function(1,2,...,K 2) { K. Crammer and Y. Singer. On the algorithmic implementa- v = cos( ) 1 1 tion of multiclass kernel-basd vector machines. Journal v2 =sin(1) cos(2) of Machine Learning Research, 2:265–292, 2001. v3 =sin(1)sin(2) cos(3) . Dogan, Urn, Tobias Glasmachers, and Christian Igel. A . unified view on multi- support vector classification.

vK 2 =sin(1) sin(K 3) cos(K 2) Journal of Machine Learning Research 17, no. 45 (2016): ··· 1-32. vK 1 =sin(1) sin(K 3)sin(K 2) ··· J. Friedman, T. Hastie, and R. Tibshirani. The elements of v =(v1,...,vK 1) statistical learning. Springer, Berlin: Springer series in statistics, 2009.

v F. R. Gantmacher. The Theory of Matrices, Chelsea, New York, 1959. } B. Jiang, X. Zhang, and T. Cai, Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers. Journal of Machine Learning Research, 9: 521–540, 2008.

Using the StoC function, it is now easy to construct the J. Koo, Y. Lee, Y. Kim, and C. Park, A Bahadur represen- simplex encoding matrix. This can be done with the function tation of the linear support vector machine. Journal of Equivalence of Multicategory SVM and Simplex Cone SVM

Machine Learning Research, 9: 1343–1368, 2008.

Kumar, Deepak, and Manoj Thakur. All-in-one multicate- gory least squares nonparallel hyperplanes support vector machine. Pattern Recognition Letters (2017).

Y. Lee, L. Yin, and G. Wahba, Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99:67-81, 2004.

B. B. Li, A. Artemiou and L. Li, Principal Support Vector Machines for Linear and Nonlinear Sufficient Dimension Reduction. The Annals of Statistics, 39: 3182-3210, 2011.

Lopez, Julio, Sebastin Maldonado, and Miguel Carrasco. A novel multi-class SVM model using second-order cone constraints. Applied Intelligence 44, no. 2 (2016): 457- 469.

Ma, Yunqian, and Guodong Guo, eds. Support vector ma- chines applications. Switzerland: Springer, 2014.

Y. Mroueh, T. Poggio, L. Rosasco, J.-J. E. Slotine, Multi- class Learning with Simplex Coding. Advances in Neural Information Processing Systems, NIPS 2012.

B. A. Pires, C. Szepesvari, M. Ghavamzadeh, Cost-sensitive multiclass classification risk bounds. Proceedings of The 30th International Conference on Machine Learning. 2013.

R. Rifkin, and A. Klautau, In defense of one-vs-all classi- fication. The Journal of Machine Learning Research 5, 101-141, 2004.

A. J. Smola and B. Schlkopf. Learning with kernels. GMD- Forschungszentrum Informationstechnik, 1998.

I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.

Sun, Hui, Bruce A. Craig, and Lingsong Zhang. Angle- based multicategory distance-weighted SVM. The Journal of Machine Learning Research 18, no. 1 (2017): 2981- 3001.

A. W. Van der Vaart. Asymptotic statistics. Vol. 3. Cam- bridge university press, 2000.

V. N. Vapnik. Statistical learning theory. Vol. 1. New York: Wiley, 1998.

J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, vol. 99, pp. 219-224. 1999.