US005640492A Patent 19 11 Patent Number: 5,640,492 Cortes et al. 45 Date of Patent: Jun. 17, 1997

54). SOFT MARGIN CLASSIFIER B.E. Boser, I. Goyon, and V.N. Vapnik, “A Training Algo rithm. For Optimal Margin Classifiers”. Proceedings Of The 75 Inventors: , New York, N.Y.: 4-th Workshop of Computational Learning Theory, vol. 4, Vladimir Vapnik, Middletown, N.J. San Mateo, CA, Morgan Kaufman, 1992. 73) Assignee: Lucent Technologies Inc., Murray Hill, Y. Le Cun, B. Boser, J. Denker, D. Henderson, R. Howard, N.J. W. Hubbard, and L. Jackel, "Handwritten Digit Recognition with a Back-Propagation Network', (D. Touretzky, Ed.), Advances in Neural Information Processing Systems, vol. 2, 21 Appl. No.: 268.361 Morgan Kaufman, 1990. 22 Filed: Jun. 30, 1994 A.N. Referes et al., “Stock Ranking: Neural Networks Vs. (51] Int. Cl...... G06E 1/00; G06E 3100 Multiple Linear Regression':, IEEE International Conf. on 52 U.S. Cl...... 395/23 Neural Networks, vol. 3, San Francisco, CA, IEEE, pp. 58) Field of Search ...... 395/23 1419-1426, 1993. M. Rebollo et al., “AMixed Integer Programming Approach 56) References Cited to Multi-Spectral Image Classification”, Pattern Recogni U.S. PATENT DOCUMENTS tion, vol. 9, No. 1, Jan. 1977, pp. 47-51, 54-55, and 57. 4,120,049 10/1978 Thaler et al...... 365/230 V. Uebele et al., "Extracting Fuzzy Rules From Pattern 4,122,443 10/1978 Thaler et al...... 340/.46.3 Classification Neural Networks”, Proceedings From 1993 5,040,214 8/1991 Grossberg et al...... 381/43 International Conference on Systems, Man and Cybernetics, 5,214,716 5/1993 Refregier et al...... 382.f42 LeTouquet, France, 17-20, Oct. 1993, vol. 2, pp. 578-583. 5,239,594 8/1993 Yoda ...... 382/15 5,239,619 8/1993 Takatori et al...... 395/23 C. Cortes et al., "Support-Vector Networks”, Machine 5,245,696 9/1993 Stork et al...... 395/13 Learning, vol. 20, No. 3, Sep. 1995, pp. 273–297. 5,263,124 11/1993 Weaver et al...... 382/37 5,271,090 12/1993 Boser ...... 395/21 5,333,209 7/1994 Sinden et al...... 382/13 Primary Examiner-Robert W. Downs Assistant Examiner A. Katbab OTHER PUBLICATIONS 57 ABSTRACT Bottou et al., "Comparison of classifier methods: A case study in Handwritten digit recognition,” Proc. of 12th IAPR A soft margin classifier and method are disclosed for pro Conf., Oct. 9-13, 1994, pp. 77-82. cessing input data of a training set into classes separated by Schulmeister, B, "The piecewise linear classifier DIPOL92.” soft margins adjacent optimal hyperplanes. Slack variables ECML-94. pp. 411-414. are provided, allowing erroneous or difficult data in the R. Courant and D. Hilbert, Methods of Mathematical Phys training set to be taken into account in determining the ics, Interscience: New York, pp. 122-141, 1953. optimal hyperplane. Inseparable data in the training set are D.G. Luenberger, Linear and Non-Linear Programming, separated without removal of data obstructing separation by Addison-Wesley: Reading, MA, pp. 326-330, 1984. determining the optimal hyperplane having minimal number V.N. Vapnik, Estimation of Dependencies Based on Empiri of erroneous classifications of the obstructing data. The cal Data, Springer-Verlag: New York, pp. 355-367, 1982. parameters of the optimal hyperplane generated from the M. Aizerman, E. Braverman, and L. Rozonoer, "Theoretical training set determine decision functions or separators for Foundations of The Potential Function Method in Pattern classifying empirical data. Recognition Learning', Automation and Remote Control, vol. 25, pp. 821-837, Jun. 1964. 31 Claims, 7 Drawing Sheets

U.S. Patent Jun. 17, 1997 Sheet 1 of 7 5,640,492

s

U.S. Patent Jun. 17, 1997 Sheet 3 of 7 5,640,492 FIG. 3 60 |START CLASSIFICATION PROCEDURE

RECEIVE DATA VECTORS OF A TRAINING 65 SET FROM A TRS PATTERN DATA

70 PARSE RECEIVED DATA VECTORS TO OBTAIN CLASS LABELS

RECEIVE AN INPUT SEPARATOR VALUE 75 REPRESENTING A CLASS FOR CLASSIFICATION OF THE DATA VECTORS

80 DETERMINE PARAMETERS OF A SEPARATOR CLASSIFYING DATA WECTORS IN THE CLASS ACCORDING TO THE SEPARATOR VALUE 85 TRANSFORM INPUT DATA VECTORS

DETERMINE PARAMETERS OF AN OPTIMAL 90 MULTIDIMENSIONAL SURFACE SEPARATING DATA WECTORS BY SOFT MARGINS USING SLACK WARIABLES

DETERMINE A DECISION FUNCTION FROM 95 THE PARAMETERS CORRESPONDING TO THE SEPARATOR VALUE

GENERATE A CLASSIFICATION SIGNAL /100 FROM THE DECISION FUNCTION TO INDICATE MEMBERSHIP STATUS OF EACH DATA WECTOR IN THE CLASS 05 OUTPUT THE CLASSIFICATION SIGNAL U.S. Patent Jun. 17, 1997 Sheet 4 of 7 5,640,492

FIG. 4 O START DETERMINING PARAMETERS OF OPTIMAL MULTIDIMENSIONAL SURFACE

15 MINIMIZE A COST FUNCTION BY GUADRATIC OPTIMIZATION

20 DETERMINE WEIGHT VECTOR AND BIAS

25 DETERMINE MINIMUM NON-NEGATIVE SLACKWARIABLESSATISFYING SOFT MARGIN CONSTRAINTS

30 RETURN U.S. Patent 5,640,492 U.S. Patent Jun. 17, 1997 Sheet 6 of 7 5,640,492

U.S. Patent Jun. 17, 1997 Sheet 7 of 7 5,640,492

s E u 5,640.492 1. 2 SOFT MARGIN CLASS FER to a decision surface to locate and remove the obstructing data such as erroneous, outlying, or difficult training pat BACKGROUND OF THE INVENTON terns. 1. Field of the Invention It is preferable to absorb such separation obstructing 5 training patterns within soft margins between classes and to This disclosure relates to automated data classifiers. In classify training patterns that may not be linearly separable particular, this disclosure relates to an apparatus and method using a global approach in locating such separation obstruct for performing two-group classification of input data in ing training patterns. It is also advantageous to implement a automated data processing applications. training method which avoids restarting of detecting difficult 2. Description of the Related Art O training patterns for pruning. Automated systems for data classifying application such as, for example, pattern identification and optical character SUMMARY recognition, process sets of input data by dividing the input A soft margin classification system is disclosed for dif data into more readily processible subsets. Such data pro ferentiating data vectors to produce a classification signal cessing employs at least two-group classifications; i.e. the 15 indicating membership status of each data vector in class. classifying of input data into two subsets. The soft margin classification system includes a processing As known in the art, some learning systems such as unit having memory for storing the data vectors in a training artificial neural networks (ANN) require training frominput set; stored programs including a data vector processing training data to allow the trained learning systems to per program; and a processor controlled by the stored programs. form on empirical data within a predetermined error toler The processor includes determining means for determining ance. In one example, as described in Y. Le Cun et al., parameters including slack variables from the data vectors, "Handwritten Digit Recognition with a Back-propagation the parameters representing a multidimensional surface dif Network” (D. Touretzky, Ed.), ADVANCES IN NEURAL ferentiating the data vectors with respect to the class; and INFORMATION PROCESSING SYSTEMS, Volume 2, means for generating the classification signal from the data Morgan Kaufman, 1990; a five layer back propagation 25 vectors and the parameters. neural networkis applied to handwritten digit recognition on a U.S. Postal Service database of 16x16 pixel bit-mapped The generating means evaluates a decision function of the digits containing 7300 training patterns and 2000 test pat parameters and each data vector to indicate membership of terns recorded from actual mail. a respective data vector in the class to generate the classi 30 fication signal. The determining means determines the One classification method known in the artis the Optimal parameters including a weight vector and a bias for all of the Margin Classifier (OMC) procedure, described in B. E. data vectors in a training set, and determines a minimum Boser, I. Goyan, and V.N. Vapnik, A Training Algorithm for non-negative value of the slack variables for each data Optimal Margin Classifiers, PROCEEDINGS OF THE vector from a plurality of constraints. FOURTH WORKSHOP OF COMPUTATIONAL LEARN. 35 The determining means minimizes a cost function to ING THEORY, Vol. 4, Morgan Kaufman, San Mateo, Calif. satisfy a plurality of constraints. The determining means 1992. An application of the OMC method is described in determines the weight vector and bias representing an opti commonly assigned U.S. patent application No. 08/097.785, mal hyperplane separating classes A, B. An input device is filed Jul. 27, 1993, and entitled AN OPTEMAL MARGEN provided for inputting the training set of data vectors, and MEMORYBASED DECISION SYSTEM, which is incor 40 the processor includes means for transforming the data porated herein by reference. vectors using convolution of dot products. Generally, using a set of vectors in n-dimensional space as A method is also disclosed for differentiating pattern input data, the OMC classifies the input data with non-linear vectors to indicate membership in a class comprising the decision surfaces, where the input patterns undergo a non steps of storing data vectors in memory; processing the data linear transformation to a new space using convolution of 45 vectors using stored programs including a data vector pro dot products for linear separation by optimal hyperplanes in cessing program; determining parameters including slack the transformed space, such as shown in F.G. 1 for two variables from the data vectors, the parameters representing dimensional vectors in classes indicated by X's and O's. In a multidimensional surface differentiating the data vectors this disclosure the term "hyperplane” means an with respect to a class; and generating a classification signal n-dimensional surface and includes 1-dimensional and 50 2-dimensional surfaces; i.e. points and lines, respectively, from the data vectors and the parameters. separating classes of data in higher dimensions. In FIG., 1 BRIEF DESCRIPTION OF THE DRAWINGS the classes of data vectors may be separated by a number of hyperplanes 2, 4. The OMC determines an optimal hyper The features of the disclosed soft margin classifier and plane 6 separating the classes. 55 method will become more readily apparent and may be In situations having original training patterns or dot better understood by referring to the following detailed product-transformed training patterns which are not linearly description of an illustrative embodiment of the present separable, learning systems trained therefrom may address invention, taken in conjunction with the accompanying the inseparability by increasing the number of free drawings, where: parameters, which introduces potential over-fitting of data. FIG. 1 illustrates an example of two-dimensional classi Alternatively, inseparability may be addressed by pruning fication by the OMC method; from consideration the training patterns obstructing FIG.2 shows the components of the soft margin classifier separability, as described in V.N. Vapnik, ESTIMATION OF disclosed herein; DEPENDENCES BASED ON EMPIRICAL DATA, New FIG. 3 illustrates a block diagram of the operation of the York: Springer-Verlag, pp. 355-369, 1982; followed by a 65 soft margin classifier; restart of the training process on the pruned set of training FIG. 4 illustrates a block diagram of a subroutine imple patterns. The pruning involves a local decision with respect menting a parameter determination procedure; 5,640.492 3 4 FIG. 5 illustrates exemplary bit map digits as training training pattern data source in step 65; parsing received data patterns; vectors to obtain class labels in step 70; receiving an input FIG. 6 shows an example of soft margin classification of separator value representing a class which is selected by a pattern vectors; and user through input device 35 for classification of the data FIG. 7 illustrates error contributions from the slack vari vectors in step 75; and determining parameters of a separator ables. classifying data vectors in the class according to the sepa rator value in step 80 which includes transforming input data DESCRIPTION OF THE PREFERRED vectors in step 85 according to a predetermined vector EMBODIMENTS mapping; determining parameters of an optimal multidimen Referring now in specific detail to the drawings, with like 10 sional surface separating parsed data vectors by soft margins reference numerals identifying similar or identical elements, using slack variables in step 90; and determining a decision as shown in FIG. 2, the present disclosure describes an function from the parameters corresponding to the separator apparatus and method implementing a soft margin classifier value in step 95. 10, which includes a processing unit 15 having a processor After determining the parameters in step 80, the classifi 20, memory 25, and stored programs 30 including a matrix 15 cation procedure performs the steps of generating a classi reduction program; an input device 35; and an output device fication signal from the decision function to indicate mem 40. In an exemplary embodiment, the processing unit 15 is bership status of each data vectorin the class in step 100; and preferably a SPARC workstation available from Sun outputting the classification signal in step 105. Microsystems, Inc. having associated RAM memory and a In step 90, the determination of parameters is performed 400 MB capacity hard or fixed drive as memory 25. The 20 by a subroutine as shown in FIG. 4 including the steps of processor 20 operates using the UNDX operating system to starting the determination of parameters of the optimal run application software as the stored programs 30 providing multidimensional surface in step 110; minimizing a cost application programs and Subroutines implementing the function by quadratic optimization in step 115; determining disclosed soft margin classifier system and methods. weight vector and bias in step 120; determining minimum The processor 20 receives commands and training pattern 25 non-negative slack variables satisfying soft margin con data from a training pattern data source 45 through the input straints in step 125; and returning the determined weight device 35 which includes a keyboard and/or a data reading vector, bias, and slack variables in step 130 to proceed to device such as a disk drive for receiving the training pattern step 95 in FIG. 3. data from storage media such as a floppy disk. The received In an exemplary embodiment, the soft margin classifier 10 training pattern data are stored in memory 25 for further 30 generates ten separators for each digit 0, 1, . . . 9, where a processing to determine parameters of an optimal hyper separator is a classification signal or set of parameters plane as described below. determining a decision function derived from the training The parameters are used by the processor 15 as a decision set. The decision function derived from the training set is function to classify input empirical data to generate a used to classify empirical input data with respect to the classification signal corresponding to the input empirical 35 selected class. data as being a member or non-member of a specified class In the exemplary embodiment for use in recognizing bit corresponding to a separator value input by the user through mapped digits, each pattern t of data in the training set T the input device 35. The classification signal is sent to an includes a bit maps, and a label y of a digit D where the output device 40 such as a display for displaying the input pattern s, when reformatted, determines the shape and data classified by the decision function. Alternatively, the appearance of the digit D. Data pattern t is referred to as a output device 30 may include specialized graphics programs vector, but t may be in other equivalent or comparable to convert the generated classification signal to a displayed configurations; for example, matrices, data blocks, graphic of a multidimensional hyperplane representing the coordinates, etc. decision function. In additional embodiments, the generated Each data pattern t is associated with the set Y by its classification signal includes a weight vector, a bias, and 45 corresponding label y, where Y is the set of all patterns slack variables listed in a file with the input training patterns which are bit maps of digit D. Data pattern ti may thus be for output as columns or tables of text by the output device represented by the vector (sy) indicating that patterns, has 40 which may be a display or a hard copy printer. labelly; s, eYP (belongs to set Y); and s, represents a bit The soft margin classifier 10 performs the application 50 map of digit D. Also associated with Y is its complement programs and Subroutines, described hereinbelow in con Y'=not(Y) where Y is the set of all patterns t having junction with FIGS. 3-4, which are implemented from corresponding bit map datas, which do not represent a bit complied sources code in the C programming language with map of digit D. Thus, Yu Y=T. a LISP interface. As shown in FIG. 5, fourteen patterns are bit maps of The present invention includes a method for generating a 55 digits with their corresponding label of the represented digit classification signal from input training patterns, including respectively below. Therefore, a data pattern s shown as the steps of storing data vectors in memory; processing the reference 50 in FIG. 5 is a bit map of digit 6, so s is data vectors using stored programs including a data vector associated with Y and s eYo. Similarly, data pattern s. processing program; determining parameters including slack shown as reference 55 in FIG. 5 is a bitmap of digit 5, so variables from the data vectors, the parameters representing S2 eY5, S2 £Ys, and S1 fYs. a multidimensional surface differentiating the data vectors In the above examples, s, seT are bit maps of 16x16 with respect to a class; and generating a classification signal pixels so the length of each data patterns, s is 16x16=256 from the data vectors and the parameters. bits and so each is represented in the above example by a 256 In an exemplary embodiment, as shown in FIGS. 3-4, the component vector. soft margin classifier 10 starts a classification procedure in 65 In operation the soft margin classifier 10 parses each step 60 using the data vector processing program, including pattern t-(s, y) of label Y, leaving data vectors, and then the steps of receiving data vectors of the training set from a each vectors, is transformed by convolution of a dot product 5,640.492 S 6 to transformed data vectors x. The transformation permits "Theoretical foundations of the potential function method in classification using non-linear decision Surfaces; i.e. a non pattern recognition learning", Automation and Remote linear transformation is performed on the input patterns in Control, 25:821-837, June 1964; potential functions are of the input space to a new transformed space permitting linear the form separation by optimal hyperplanes of the OMC method in the references cited above. ) (10) A predetermined k-dimensional vector function V: R. R. (1) In the Boser et al. publication, cited above, the optimal hyperplane method was combined with the method of con 10 volution of the dot product, and in addition to the potential maps each n-dimensional training vector s, to a new functions as in Eq. (10) above, polynomial classifiers of the k-dimensional training vector x=V(s). As described below, form a k-dimensional weight vector w and a bias b is then constructed for each digit D such that a decision function classification function f for class A is a linear separator of 15 were considered. the transformed training vectors X, where Using different dot products K(u,v), one can construct different learning machines with arbitrarily shaped decision surfaces. All these learning machines follow the same solu In the exemplary example, A=Y and B=Y for gener tion scheme as the original optimal hyperplane method and 20 have the same advantage of effective stopping criteria. A ating a decision function or separator classifying input data polynomial of the form as in Eq. (11) is used in the soft vectors to determine the membership status of the input data margin classifier 10 described herein. In use, it was found vectors in each class A, B; i.e. to determine whether or not that the raw error in classification c the U.S. Postal Service each input data vector represents digit D. Database was between 4.3-4.7% for d=2 to 4, and the error In constructing the separator using optimal hyperplanes as decreased as dincreased. With d-4, the raw error was 4.3%. in the OMC method, the weight vector may be written as: 25 By increasing the degree d, simple non-linear transfor mations as in Eq. (11) eventually lead to separation, but the a i OW(x) i. BW(x) (3) dimensionality of the separating space may become larger than the number of training patterns. Better generalization where x is the transformed training data vectors. ability of the learning machine, and less potential over The linearity of the dot product implies that the decision 30 fitting of the data, can be achieved if the method allows for function f for class A for unknown empirical data depends errors on the training set as implemented in the soft margin on the dot product according to classifier 10 and method described herein. The OMC method described above uses margins on either (4) side of an intermediate hyperplane separating the trans 35 formed vectors to thus classify the transformed vectors as being on one side or the other of the intermediate hyper The classification method may be generalized by consid plane. The margins are determined by the +1 and -1 in the ering different forms of the dot product equation w. i + b 2. 1 if reA (12) According to the Hilbert-Schmidt Theorem, as described w + b is -1 if eB in R. Courant and D. Hilbert, METHODS OF MATH In the soft margin classifier 10, the margin is "soft"; i.e. EMATICAL PHYSICS, Interscience, New York, pp. the hyperplane separating the classified data vectors in the 122-141, 1953; any symmetric function K(u,v) can be 45 transformed space depends on the values of slack variables. expanded in the form The margins are determined from the slack variables such that w i + b 2 1 - (w,b) if weA (13) where W e Rand V, are eigenvalues and eigenfunctions, 50 f(x) = { respectively, of the integral equation w x +bs -1 + (w,b) ifixie B where (w.b) for each pattern is a non-negative slack variable. The value of Ei(w,b) is a function of the parameters (w.b) of the decision surface which is the smallest non negative number that makes the transformed data patterns x A sufficient condition to ensure a positive norm of the 55 satisfy the inequality in Eq. (13). transformed vectors is that all the eigenvalues in the expan FIG. 6 illustrates the use of slack variables in relation to sion of Eq. (6) above are positive. To guarantee that these an optimal hyperplane between vectors of classes A, B coefficients are positive, it is necessary and sufficient represented by X's and O's respectively. (according to Mercer's Theorem) that the condition Hyperplanes 135, 140 correspond to wx +b=-1 and w.x +b=1, respectively, with soft margins extending from each hyperplane 135,140. Hyperplane 145 corresponds to wx+b is satisfied for all g such that =0 and is intermediate of hyperplanes 135, 140. As shown in FIG. 6, data vector 150 is in class A but is Jg'(a)dukee (9) classified by the soft margin classifier 10 as a member of 65 class B since data vector 150 is present in the region Functions that satisfy Mercer's Theorem can therefore be determined by the hyperplanes 135, 140, classifying data used as dot products. As described in M. Aizerman et al., vectors as members of class B. Similarly, data vectors 155, 5,640.492 7 8 160 from class B are classified as being as class Aby the soft unique one that also minimizes w-. The Iwll' term is margin classifier 10 using hyperplanes 135, 140. As shown in FIG. 6, the values of the slack variables are greater than common with the OMC, and it tends to keep the distance 1 for data vectors 150, 155, 160; i.e. data vectors 150, 155, between the convex hulls of the correctly classified patterns 160 are deemed erroneous data. These erroneous data vec as far apart as possible. The term is chosen to obtain a tors are taken into account by the soft margin classifier 10, unique solution, so its multiplier Y is kept at a small positive as described below, without removal c these data vectors value. from the training set and without restart of the classification method, in determining the optimal hyperplanes. FIG. 6 also illustrates data vectors 165, 170 of classes A and B, Eq. (16) can be solved in the dual space of the Lagrange respectively, having =0; i.e. lying on hyperplanes 135,140, 10 multipliers in a similar manner as in the OMC for deter respectively. mining optimal hyperplanes. Additional m--1 non-negative The soft margin classifier 10 determines the weight vector Lagrange multipliers are introduced: m multipliers, e. w and the bias bhaving the fewest numbers of errors on the two sets A, B of patterns. A value of (w.b) >1 corresponds enforces the constraints 5,20, and a multiplier8 enforces the to an error on pattern X, since the pattern is classified last constraint of Eq. (17). according to the sign of w.x-b. 15 The optimal solution is found to be (w",b) which mini In the OMC, a function W was defined by the following mizes the expression equation:

14 isAB (14) 20 As illustrated in FIG.7, 0 is the step-function: 0(x)=1 if x>0 and Zero otherwise, shown as reference 175. The minimizing of Eq. (14) is a highly non-linear prob lem. An analytic expression for an upper bound on the number of errors is obtained through the constraint 25 where w = X, ori- i. Biri (19) A = e(w,b) ie A eAB and where A20. As shown in FIG. 7, for AD1, the cost of an error increases 30 X. Oli=X B; (20) more than linearly as shown by the graph 180 of E with the ie A ieB pattern's deviation from the desired value. As seen in FIG. 7, non-errors; i.e. patterns having 0 <<1, make Small contributions by , shown as reference 185, to the sum In the soft margin classifier system and method disclosed e(w.b), approximating the zeroing out the non-errors by the 35 herein, the function W in Eq. (18) is altered to become: step function. As A->0 all >1 contribute with the same cost 1. (21) of 1 using the step function where 5 is approximately 1, W= X. Oti- XE - re- Af shown as reference 190, as A->0, which is preferable to the iea P 2 contribution of each when A21. A constraint of the form in Eq. (15) departs from the quadratic programming problem of determining the optimal with respect to which the 2m+1 non-negative multipliers are hyperplane in the OMC, and departs from guaranteed con maximized, where is the 2m+1 dimensional vector vergence time, unless A=1 or 2. In the preferred embodiment of the soft margin classifier 10, A=1 is the best mode since (22) y = e. it provides a constraint that is near to the desired constraint. 45 In alternative embodiments, a hybrid method is imple (";s ) mented by the soft margin classifier 10 with A=2 for 0s&s1 and A=1 for 21 which provides a moderate increase in computational complexity. and A is a positive definite (2m+1) X(2m+1) matrix. A unique solution to the quadratic optimization problem is provided by the following cost function 50 The optimal weight vector w” and bias b" are determined in step 120 in FIG. 4 by: min (16) (; lip-n-e-r-s") ("b")=agni (; lit n-1 + 1. 5°) (23) where m, Y are parameters of the cost function, kept fixed 55 w,b.e. under optimization. The cost function is minimized in step 115 in FIG. 4 with respect to the following constraints: under the constraints w x + b 2 1 - if reA (17) w x + b 2 1 - if xeA (24) w r + b is -1 + if xeB w r + b is -1 + if ice B & 2 0 & 2 0 isieX is e isieXE se

The terms in the cost function serve different purposes. 65 The terme' enforces a small number of errors. When several The vectors U, 1, u, V, e, of dimension n+m+1 are defined solutions exist with the same value of e we choose the as follows: 5,640.492 10

(25) (30) 0',3',e',8=arg max W 0 Oli, Biei,8 5 W. iéAXE or XE P3 - - 2 - UOUQ O o U = 1 = 1 u = lie A e -1 O From the above, Wis a quadratic form of the multipliers due to the introduction of the n+m}-1 dimensional vector f - 10 defined in Eq. (22). One obtains 0 O W=cy- yAly (31) O (26) with O O 15 1. (32) ce 0 O O 0 vi -1 ie Bei = -1 ie A, B h-i-H h - (33) 0 O 20 G= hi h. -h -h -h nih-k o O and O O h =- --1 k = --1. (34) 25 and non-negative Lagrange multipliers O B e, and 8 are where the matrix H in the n-m--1 dimensional block matrix applied for the constraints. The Lagrange function corre A is the signed dot products between vectors: sponding to the optimization problemis, with this notation, and m” is the number of patterns with non-zero 5 in the solution. The constinaint: where Q is a n+m+1xn+m+1 diagonal matrix isie isea (35) 35 I O 0 (28) is to be valid when the sum is taken over all non-zero . It Q = 0 Yl 0 also has to be fulfilled for any subset of the as well. 0 0 in Lagrange multipliers, 1,2,..., corresponding to every subset of patterns considered to form the intermediate solu The minimization of Eq. (27) with respect to U is per-" tions. The dimension of the Lagrange multiplier space is formed according to the following Kuhn–Tucker conditions: now n+mp, and the Kuhn–Tucker conditions read: w = X. Cixi - X. Bixi (36) y * = ieX A Cor - ieX B fix3xi (29) 20* Bri 45 ieaX C.Ci= iex B; oi + ei-6 si.k - y ifyiif w= 50 m*= Bii + e.e-8 ifyi = -1 y and the G matrix last column and row are widened to a band of width p. Forevery subset,j, to which is a member, there S c-A- will be an entry k as defined above in position (i, n+mj), cr = (i+m, n+m}), (n+mj, i) and (n+mj, im). The bottom cifie,620 i = 0,1,..., m 55 right pxp block is of the form (w. x + b - +)0=0 for eA (w:x + b + 1-)B;= 0 forge B e i = 0 i =0,1, ..., m 60 ( - i.5)XE 8 O

Back-substitution the first five expressions of Eq. (29) 65 backinto Eq. (27) determines a maximization problem in the where m+m" are the vectors in both subset 1 and 2. The 2m+1 dimensional multiplier space of O, B, e. 8: off-diagonal matrix of the above matrix may be ignored to 5,640.492 11 12 do the optimization with respect to the individually 6's 4. The system of claim 3 wherein the determining means independently, and also to reduce the number of support determines the parameters including a weight vector and a patters in the intermediate solutions. Thus, the obtained solution provides an accurate approximation for (w', b) to bias from a plurality of constraints. a true optimum. 5. The system of claim 4 wherein the generating means Upon determining (w", b) in step 120 in FIG. 4 and evaluates the decision function from a linear sum of the bias minimizing the slack variables in step 125 in FIG. 4, the and a dot product of the weight vector and a respective data classification signal is then generated with respect to class A Vector. from the sign of the value of the decision function f(x), 6. The system of claim 4 wherein the determining means where x is empirical data transformed by a dot product determines a minimum non-negative value of the slack mapping function and input to the determined decision 10 variables for each data vector from the weight vector and the function. Referring to Eq. (13) in conjunction with FIG. 6, bias satisfying the plurality of constraints. for non-erroneous patterns, the slack variables satisfy the 7. A classifier for classifying data vectors to produce a inequality 0