Tight bounds for SVM classification error

B. Apolloni, S. Bassis, S. Gaito, D. Malchiodi Dipartimento di Scienze dell'Informazione, Universita degli Studi di Milano Via Comelico 39/41, 20135 Milano, Italy E-mail: {apolloni, bassis, gaito, malchiodi}@dsi.unimi.it

Abstract- We find very tight bounds on the accuracy of a Sup- lies in finding a separating hyperplane, i.e. an h E H such port Vector Machine classification error within the Algorithmic that all the points with a given label belong to one of the two Inference framework. The framework is specially suitable for this half-spaces determined by h. kind of classifier since (i) we know the number of support vectors really employed, as an ancillary output of the learning procedure, In order to obtain such a h, an SVM computes first the and (ii) we can appreciate confidence intervals of misclassifying solution {°ia, . . . , a* } of a dual constrained optimization exactly in function of the cardinality of these vectors. problem As a result we obtain confidence intervals that are up to an order m m narrower than those in the a 1 supplied literature, having slight max Z different meaning due to the different approach they come from, °Zai-2 aiicmjyiYjxi-x (1) but the same operational function. We numerically check the t=1 i,i-=i covering of these intervals. m aiyi = 0 (2) I. INTRODUCTION i-l Support Vector Machines (SVM for short) [1] represent ajt > O i = 1 . .. ,m, (3) an operational tool widely used by the Machine Learning where - denotes the standard dot product in RI, and then community. Per se a SVM is an n dimensional hyperplane returns a hyperplane (called separating hyperplane) whose committed to separate positive from negative points of a lin- equation is w - x + b = 0, where early separable Cartesian space. The success of these machines m in comparison with analogous models such as a real-inputs w = :a*yixi (4) perceptron is due to the algorithm employed to learn them 6 = i- f from examples that performs very efficiently and relies on b = yi -w - xi for i such that a*z > O. (5) a well defined small subset of examples that it manages in a symbolic way. Thus the algorithm plays the role of a specimen In the case of a separable sample (i.e. a sample for which the of the computational learning theory [2] allowing theoretical existence of a separating hyperplane is guaranteed), this algo- forecasting of the future misclassifying error. This prevision rithm produces a separating hyperplane with optimal margin, however may result very bad and consequently deprived of i.e. a hyperplane maximizing its minimal distance with the any operational consequence. This is because we are generally sample points. Moreover, typically only a few components of obliged to broad approximations coming from more or less {a,*... v I4m} are different from zero, so that the hypothesis sophisticated variants of the law of large numbers. In the depends on a small subset of the available examples (those paper we overcome this drawback working in the Algorithmic corresponding to non null 's, that are denoted support vectors Inference framework [3], computing bounds that are linked to or SV). the properties of the actual classification instance and typically A variant of this algorithm, known as soft-margin clas- prove tighter by one order of magnitude in comparison to sifier [6], produces hypotheses for which the separability analogous bounds computed by Vapnik [4]. We numerically requirement is relaxed, introducing a parameter It whose check that these bounds delimit slightly oversized confidence value represents an upper bound to the fraction of sample intervals [5] for the actual error probability. classification errors and a lower bound to fraction of points that The paper is organized as follows: Section II introduces are allowed to have a distance less or equal to the margin. The SVMs, while Section III describes and solves the correspond- corresponding optimization problem is essentially unchanged, ing accuracy estimation problem in the Algorithmic Inference with the sole exception of (3), which now becomes framework. Section IV numerically checks the theoretical ° < < results. ai-m m (3') II. LEARNING SVMS Eai > M. In their basic version, SVMs are used to compute hypothe- itI ses in the class H of hyperplanes in RI, for fixed n E N. Analogously, the separating hyperplane equation is still ob- Given a sample {xi,. .. xx} E RImn with associated labels tained through (4-5), though the latter equation needs to be {Y1y* - Ym} {-1Ee , }m, the related classification problem computed on indices i such that 0 < ae <-.m 0-7803-9422-4/05/$20.00 ©2005 IEEE 5 --50 d

- W n } K ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.~~~~~Perr ,err W _ °S~~~~~~~~~~~~~~~ t

(a) m = 102 (b) m = 103 (c) m =106

Fig. 1. Comparison between two-sided 0.9 confi dence intervals for SVM classifi cation error. t: number of misclassifi ed sample points; d: detailNC dimension; u: confi dence interval extremes; m: sample size. Gray surfaces: VC bounds. Black surfaces: proposed bounds.

III. THE ASSOCIATED ALGORITHMIC INFERENCE variable given by the probability measure of c *. h and FuC..h PROBLEM its cumulative distribution function. For a given Zm, if The theory for computing the bounds lies in a new statistical * h has detail D(C,H)h and misclassifies th points of Zm, framework called Algorithmic Inference [3]. The leading idea then for each e E (0,1), is to start from specific sample, e.g. a record of observed data constituting the true basis of the available knowledge, m , and then to infer about the possible continuation of this record - call it population - when the observed i=th+1 phenomenon m remains the same. Since we may have many sample suffixes as continuation, we compute their on FU. h(6 > m )i(l-£)m-i. (6) i=D(c,H)h +th the sole condition of them being compatible with the already Fairly strong surjectivity is a usual regularity condition [3], observed data. This constraint constitutes the inference tool while D(C,H)h is a key parameter of the Algorithmic Inference to be molded with the formal knowledge already available on approach to learning called detail [8]. The general idea is the phenomenon. In particular, in the problem of learning a that it counts the number of meaningful examples within a Boolean function over a space I, our knowledge base is a sample which prevent v/ from computing a hypothesis h' with labeled sample Zm = {(xi,bi), i = 1,..., m} where m is a wider mistake region c + h'. These points are supposed an integer, xi e X and bi are Boolean variables. The formal to be algorithmically computed for each c and h through knowledge stands in a concept class C related to the observed a sentry fiunction S. The maximum DC,H of D(C,H)h over phenomenon, and we assume that for every M and every the entire class C + H of symmetric differences between population ZM a c exists in C such that Zm+M = {(x, c(xi)), possible concepts and hypotheses relates to the well known i = 1,.... m + M}. We are interested in another function Vapnik-Chervonenkis dimension dVC [4] through the relation: W: {Zm} 1- H which, having in input the sample, computes < + 1 a hypotesis h in a (possibly) different class H such that DC,H dVC(C-H) [8]. the Uc h 1, denoting the measure of the A. The detail of a SVM symmetric difference c . h in the possible continuations of Zm, is low enough with high probability. The bounds we propose The distinctive feature of the hypotheses learnt through in this paper delimit confidence intervals for the mentioned SVM is that the detail D(C,H)h ranges from 1 to the number measure, within a dual notion of these intervals where the I'h of support vectors minus 1, and its value increases with extremes are fixed and the bounded parameter is a random the broadness of the approximation of the solving algorithm variable. The core of the theory is the theorem below. [9]. Theorem 3.1: [7] For a space X and any probability measure Lemma 3.2: Let us denote by C the concept class of hy- P on it, assume we are given perlanes on a given space X and by Cr = {xi,...,x8} a minimal set of support vectors of a hyperplane h (i.e. a is * concept classes C and H on £; a support vector set but, whatever i is, no a\{xj} does the * a labeled sample Zm drawn from I x {0, 1}; same). Then, for whatever goal hyperplane c separating the * a fairly strongly surjective function d: {Zm} !4 H above set accordingly with h, there exists a sentry function In the case where for any Zm and any infinite suffix ZM S on C . H and a subset of ar of cardinality at most s - 1 of it a c E C exists computing the example labels of the sentinelling c . h according to S. whole sequence, consider the family of random sets {c E C: Proof: To identify a hyperplane in an n-dimensional Eu- Zm+M = {(xi,c(xi)),i = 1,. . . , m + M} for any specifica- clidean space we need to put n non aligned points into a tion ZM of ZM}. Denote h -= (zm) and UC÷.h the random linear equations' system, n + 1 if these points are at a fixed (either negative or distance. This is also the tBy default capital letters (such as U, X) will denote random variables positive) maximum and small letters (u, x) their corresponding specifi cations; the sets the number of support vectors required by a SVM. We may specifications belong to will be denoted by capital gothic letters (U,I ). substitute one or more points with direct linear constraints on

6 Perr 2 Perr 1.5 Perrl.2 1.5 = 1.25 1 0.8 0.75 0.6 0.5 0.5 0.4 0.25 20 40 60 80 0.2 2 00 600 S00 1000 -0.5 - t -0.25 t 2 10S 4 10556 10058 10 5 10 6 - 1 b 0.5 -0.2 t

(a) m = 102 (b)mM -i03 (c) m = 106

Fig. 2. Same comparison as in Fig. 1 with d = 4. the hyperplane coefficients when the topology of the support Figure 1 plots the interval extremes versus detail/VC dimen- vectors allows it. Sentinelling the expansion of the symmetric sion and number of misclassified points for different sizes of difference c . h results in forbidding any rotation of h into a the sample. Companion curves in the figure are the Vapnik h' pivoted along the intersection of c with h. The membership bounds [4]: of this intersection to h' adds from 1 to n - 1 linear relations on its coefficients, so that at most #c - 1 points from a- (log 2 + 1) -log9

E ( zdw(l Udw) =2 (8) 2-Pe, is computed as the frequency of misclassifi ed points from a huge set ~+1 drawn with the same distribution of the examples.

7 e- tresp. pts

0. 08

0. 06

0. 04/

0 .02 /

I v F-0.05er 0.1 D.i5 1-0.2 0.25 0.3 03__...... Logl0m Fig. 4. Course of the experimental confi dence level with algorithm approx.- imation. (a) Perr 0.4r

Perr- lr

i I I : ; i I I I ' I I I I . . ' I__ I.-i. I. 1- -1- i i nn11 *:.l:S~ * 5 1 0 15 .'. 0

- :I: n wn *:w(1 ta F~~~~Irhlh th Fig. 5. Course of misclassifi cation probability and related bounds normalized in respect to the dimensionality n of the sample space. Gray line: our bounds; (b) dark line: Rademacher bounds.

Fig. 3. Course of misclassification probability with the parameters of the learning problem: (a) error probability (Perr) vs. sample size, (b) error each hypercube dimension, since for each dimension sampled probability vs. number of support vectors ph plus number of misclassifi cations instances may have a different number of support vectors th. Points: sampled values; lines: 0.9 confi dence bounds. (while th we constrained to be 0). REFERENCES the upper bounds. This is due to an analogous overvaluation of [1] C. Cortes and V. Vapnik, "Support-Vector networks," Machine Learning, D(C,H)h through ,th. As explained before, the gap between the vol. 20, pp. 121-167, 1995. two parameters diminishes with the approximation broadness [2] L. G. Valiant, "A theory of the learnable," Communications of the ACM, vol. 11, no. 27, pp. 1134-1142, 1984. of the support vector algorithm. This is accounted for in the [3] B. Apolloni, D. Malchiodi, and S. Gaito, Algorithmic Inference in Ma- graph in Fig. 4 which reports the percentage of experiments chine Learning. Magill, Adelaide: Advanced Kniowledge International, trespassing the bounds. This quantity comes close to the 2003. [4] V. Vapnik, Statitical Learning Theory. New York: John Wiley & Sons, planned 6 = 0.1 with the increase of the free parameter in 1998. the v-support vector algorithm [12] employed to learn the [5] S. S. Wilks, Mathematical , ser. Wiley Publications in Statistics. hyperplanes. This nicely reflects an expectable rule of thumb New York: John Wiley, 1962. [6] B. Sch6lkopf, C. J. C. Burges, and A. J. Smola, Eds., Advances in kernel according to which true sample complexity of the leaning methods: Support Vector learning. Cambridge, Mass.: MIT Press, 1999. algorithm decreases with the increase of its accuracy. [7] B. Apolloni, D. Esposito, Malchiodi, A., C. Orovas, G. Palmas, and Bounds based on Rademacher complexity comes from J. Taylor, "A general framework for learning rules from data," IEEE Transactions on Neural Networks, vol. l5, pp. 1333-1349, 2004. Bartlett inequality [13] [8] B. Apolloni and S. Chiaravalli, "PAC leamring of concept classes through the boundaries of their items," Theoretical Computer Science, vol. 172, P(Yf(X) < 0) pp. 91-120, 1997. Emkb(Yf(X))+ [9] B. Apolloni, S. Bassis, S. Gaito, D. Malchiodi, and A. Minora, "Com- a er m puting confidence intervals for the risk of SVM classifi through 4B + + 1) lo4' Algorithmic Inference," in Biological and Artifi cial Intelligence En-vi- + (10) Ek(Xi,7Xi) (-+1) ronments, B. M. and R. Zyji g2 Apolloni, Marinaro, Tagliaferri, Eds. Springer, ;= 2005, pp. 225-234. [10] J. Shawe-Taylor and N. Cristianini, Kernel Methodsfor Pattern Analysis. with notation as in the quoted reference, where the first term Cambridge University Press, 2004. on the left side corresponds to the empirical margin cost, [111 M. Muselli, "Support Vector Machines for uncertainty region detection," in Neural Nets: WIRN '01, 12th Italian on a that we may assume very to VIETRI Workshop Neural quantity close the empirical Nets (Vietri sul Mare, Italy, 17-19 May 2001), M. Marinaro and error. The computation of this bound requires a sagacious R. Tagliaferri, Eds. London: Springer, 2001., pp. 108-113. choice of -y and B, which at the best of our preliminary [121 B. Schblkopf, A. Smola, R. Williamson, and P. Bartlett, "New support vector Neural vol. pp. 2000. us to the in 5. Here we have in algorithms," Computations, 12, 1207-1245, study brought graph Fig [13] P. L. Bartlett and S. Mendelson, "Rademacher and Gaussian complexi- abscissas the dimensionality of the instance space. For the ties: Risk bounds and structural results." Journal of Machine Leatning sake of comparison, we contrast the above with our upper Research, vol. 3, pp. 463-482, 2002. bounds computed for the same 6 (hence coming from a one side confidence interval) and mediated in correspondence with

8