Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)

Classification by Discriminative Regularization

1∗ 2 3 1 1 Bin Zhang , Fei Wang , Ta-Hsin Li , Wen jun Yin , Jin Dong 1IBM China Research Lab, Beijing, China 2Department of Automation, Tsinghua University, Beijing, China 3IBM Watson Research Center, Yorktown Heights, NY, USA ∗Corresponding author, email: [email protected]

Abstract Generally achieving a good classifier needs a sufficiently • ˆ Classification is one of the most fundamental problems large amount of training data points (such that may have a better approximation of ), however, in realJ world in , which aims to separate the data J from different classes as far away as possible. A com- applications, it is usually expensive and time consuming mon way to get a good classification function is to min- to get enough labeled points. imize its empirical prediction loss or structural loss. In Even if we have enough training data points, and the con- this paper, we point out that we can also enhance the • structed classifier may fit very well on the training set, it discriminality of those classifiers by further incorporat- may not have a good generalization ability on the testing ing the discriminative information contained in the data set as a prior into the classifier construction process. In set. This is the problem we called overfitting. such a way, we will show that the constructed classifiers Regularization is one of the most important techniques will be more powerful, and this will also be validated by to handle the above two problems. First, for the overfitting the final empirical study on several benchmark data sets. problem, Vapnik proposed that and ˆ satisfies the follow- ing inequality (Vapnik, 1995) J J Introduction 6 ˆ + ε(n, dF ), Classification, or with discrete outputs J J (Duda et al., 2000), is one of the vital problems in machine where ε(n, dF ) measures the complexity of the hypothesis learning and artificial intelligence. Given a set of training model. Therefore we can minimize the following structural n risk to get an optimal f data points and their labels = (xi, yi) i=1, the classifi- cation algorithms aim to constructX { a good} classifier f ∈ F ¯ = ˆ + λΩ(f), ( is the hypothesis space) and use it to predict the labels J J F of the data points in the testing set T . Most of the real where Ω(f) is a structural regularization term which mea- world data analysis problems can finallyX be reformulated as sures the complexity of f. In such a way, the resulting f a classification problem, such as text categorization (Sebas- could also have a good generalization ability on testing data. tiani, 2002), face recognition (Li & Jain, 2005) and genetic Second, for the first problem, we can apply semi- sequence identification (Chuzhanova et al., 1998). supervised learning methods, which aims to learn from par- More concretely, in classification problems, we usually tially labeled data (Chapelle et al., 2006). Belkin et al. pro- assume that the data points and their labels are sampled from posed to add an additional regularization term to ¯ to pun- a joint distribution P (x, y), and the optimal classifier f we ish the geometric smoothness of the decision functionJ with seek for should minimize the following expected loss (Vap- respect to the intrinsic data manifold (Belkin et al., 2006), nik, 1995) and such smoothness is estimated from the whole data set. = E x (y, f(x)), Belkin et al. also pointed out that such a technique can also J P ( ,y)L where ( , ) is some . However, in practice, be incorporated into the procedure of supervised classifica- P (x, yL) is· usually· unknown. Therefore, we should minimize tion, and the smoothness of the classification function can the following empirical loss instead be measured class-wise. In this paper, we propose a novel regularization approach n for classification. Instead of using geometrical smoothness ˆ 1 = (yi, f(xi)) of f as the penalty criterion as in (Belkin et al., 2006), we J n L Xi=1 directly use the discriminality of f as the penalty function, When deriving a specific classification algorithm by mini- since the final goal of classification is just to discriminate the data from different classes. Therefore we call our method mizing ˆ, we may encounter two common problems: J Discriminant Regularization (DR). Specifically, we derive Copyright °c 2008, American Association for Artificial Intelli- two margin based criterion to measure the discriminality of gence (www.aaai.org). All rights reserved. f, namely the global margin and local margin. We show

746 that the final solution of the classification function will be achieved by solving a linear equation system, which can be efficiently implemented with many numerical methods, and the experimental results on several benchmark data sets are also presented to show the effectiveness of our method. One issue should be mentioned here is that independent of this work, (Xue et al., 2007) also proposes a similar discrim- inative regularization technique for classification. However, there is no detailed mathematical derivations in their techni- cal report. The rest of this paper is organized as follows. In section 2 Figure 1: The data graph constructed in supervised manifold we will briefly review the basic procedure of manifold regu- regularization. There are only connections among the data larization, since it is clearly related to this paper. The algo- points in the same class. rithm details of DR will be derived in section 3. In section 4 we sill present the experimental results, followed by the conclusions and discussions in section 5. Discriminative Regularization

A Brief Review of Manifold Regularization Although (Belkin et al., 2006) proposed an elegant frame- As mentioned in the introduction, manifold regularization work for learning by taking the smoothness of f into ac- (Belkin et al., 2006) seeks for an optimal classification func- count, however, the final goal of learning (e.g., classifica- tion by minimizing the following loss tion or clustering) is to discriminate the data from differ- n ent classes. Therefore, a reasonable classification function 1 ¯ = (f(xi), yi) + λΩ(f) + γ (f) (1) should be smooth within each classes, but simultaneously J n L I Xi=1 produce great contrast between different classes. There- f where Ω(f) measures the complexity of f, and (f) pe- fore, we propose to penalize the discriminality of , rather f nalizes the geometrical property of f. Generally,I in semi- than the smoothness of directly when constructing it. f supervised learning, (f) can be approximated by Generally, the discriminality of can be measured by the I difference (or quotient) of the within-class scatterness and T 2 (f) = f Lf = wij (fi fj ) between-class scatterness, which can either be computed I − Xij globally or locally. where wij measures the similarity between xi and xj , fi = T f(xi), f = [f1, f2, , fn] . L is an n n graph Laplacian Global Margin Based Discriminality matrix with its (i, j···)-th entry ×

j wij , if i = j Similar to the maximum margin criterion (Li et al., 2006), Łij = ½ Pwij , otherwise we measure the discriminality of f by − Therefore (f) reflects the variation, or the smoothness over I 1(f) = (f) (f) (2) the intrinsic data manifold. For supervised learning, we can D W − B define a block-diagonal matrix L as where (f) and (f) measure the within-class scatterness L1 and between-classW B scatterness of f which can be computed  L2  by L = . ..   2  L  C  C  1 where C is the number of classes, and Li is the graph Lapla- (f) = °fi fj° (3) W x ° − nc x ° cian constructed on class i, i.e., if we construct a supervised Xc=1 Xi∈πc ° Xj ∈πc ° ° ° data graph on , then there only exists connections among ° ° 2 C the data pointsX in the same class (see Fig.1)). Then by rear- 1° 1 ° ranging the elements in f as (f) = nc ° fj fj ° (4) B °nc x − n ° Xc=1 ° Xj ∈πc Xj ° f f T f T f T T ° ° = [ 1 , 2 , , C ] ° ° ··· ° ° where f1 is the classification vector on class 1, we can adopt where we use πc to denote the c-th class, and nc = πc . the following term as (f) | | I Define the total scatterness of f as T T (f) = fi Lifi = f Lf I 2 Xi 1 which is in fact the summation of the smoothness of f within (f) = (f) + (f) = °fi fj° (5) T W B ° − n ° each class. Xi ° Xj ° ° ° ° ° ° °

747 2 C where represents the cardinality of a set. The total neigh- 1 |·| (f) = °fi fj° borhood margin of f can be defined as W x ° − nc x ° Xc=1 Xi∈πc ° Xj ∈πc ° ° ° γ = γi ° ° i C ° ° X 2 2 1 = fi fjfi + 2 fkfl x ∈ − nc x ∈ nc x x ∈  2 2 Xc=1 Xi πc Xj πc k,Xl πc = f f f f .   i k i j k:x ∈N e k − k − j:x ∈N o k − k 2 Xi  Xk i Xj i  = fi gij fifj  orx ∈N e orx ∈N o  − i k i j Xi Xij   The discriminality of f can be described by γ. Define two 1 2 1 T = gij (fi fj) = f LW f weight matrices 2 − 2 ij o o X 1, xi or xj Wo(i, j) = ∈ Nj ∈ Ni (12) where ½ 0, otherwise 1/nc, xi, xj πc e e gij = ∈ , (6) e 1, xi or xj ½ 0, otherwise W (i, j) = ∈ Nj ∈ Ni (13) ½ 0, otherwise and and their corresponding degree matrices T n×1 f = [f1, f2, , fn] R (7) o o o o D = diag(d , d , , dn) (14) ··· ∈ Rn×n 1 2 ··· LW = diag(LW1, LW2, , LWC ) (8) De = diag(de, de, , de ) (15) ··· ∈ 1 2 ··· n Rnc×nc and LWc is a square matrix with its entries where do = Wo(i, j), de = We(i, j). Then we can ∈ 1 1 i j i j LWc(i, i) = 1 , LWc(i, j) = . Since o − nc − nc define the homogeneousP graph LaplacianP L and heteroge- neous graph Laplacian Le as T 1 T T (f) = f I ee f = f LT f, (9) o o o T µ − n ¶ L = D W (16) e e − e 1 T L = D W (17) where LT = I n ee can be viewed as a centralization − − × operator, I is the n n identity matrix, e Rn 1 is a column With the above definitions, we can rewrite the discriminality vector with all its elements× equal to 1. Then∈ of f can be defined by T o e T T 2(f) = f (L L )f (18) (f) = (f) (f) = f (LT LW )f = f LBf (10) D − B T − W − where LB = LT LW . Then the discriminality of f (Eq.(2) Classification by Discriminative Regularization can be rewritten as− The optimal f can be learned by minimizing T n 1 = f (LW LB)f (11) 1 D − J = (fi, yi) + λΩ(f) + γ i(f) (19) n L D Local Margin Based Discriminality Xi Before we go into the details of our Local Margin Based where i = 1, 2. Assume f lies in a Reproducing Kernel Discriminative Regularization, first let’s define two types of Hilbert Space (RKHS) , which is associated with a Mercer kernel k : HR, then the complexity of f can be neighborhoods(Wang, 2007): X × X → measured by the induced norm k of f in . We have the Definition 1(Homogeneous Neighborhoods). For a data k · k H o following representor theorem point xi, its ξ nearest homogeneous neighborhood is Ni Theorem 1. The minimizer of Eq.(19) admits an expansion the set of ξ most similar1 data which are in the same class n with xi. f(x) = αik(x, xi) (20) Definition 2(Heterogeneous Neighborhoods).For a data Xi=1 point x , its ζ nearest heterogeneous neighborhood e is i i α = [α , α , , α ]T the set of ζ most similar data which are not in the sameN class Let the coefficient vector 1 2 n and the kernel matrix K Rn×n with its (i, j)-th··· entry K(i, j) = with xi. ∈ k(xi, xj ), then theorem 1 tells us that f = Kα. Therefore Then the neighborhood margin γi for f on xi is defined as we can rewrite Eq.(19) as

2 2 1 T T T γi = fi fk fi fj , J = (Kα y) (Kα y) + λα Kα + γα KLDKα k:x ∈N e k − k − j:x ∈N o k − k n − − Xk i Xj i orx ∈N e orx ∈N o i k i j where LD is the discriminative graph Laplacian such that LD = LW LB for global margin based criterion, and 1 o −e In this paper two data vectors are considered to be similar if LD = L L for local margin based criterion. Since the Euclidean distance between them is small, two data tensors are − considered to be similar if the Frobenius norm of their difference ∂J 1 = 2 K(Kα y) + (λK + γKLDK)α tensor is small. ∂α µn − ¶

748 Table 1: Discriminative Regularization for Classification Table 2: Linear Discriminative Regularization Inputs: Inputs: n n Training data set = (xi, yi i=1 ; Training data set = (xi, yi i=1 ; X { m } } X { m } } Testing data set T = zi i=1; Testing data set T = zi i=1; Kernel type andX parameters;{ } Regularization parametersX { }λ, γ. Regularization parameters λ, γ. Output: Output: The predicted labels of the data in T . X The predicted labels of the data in T . Procedure: X Procedure: 1. Construct the discriminative graph Laplacian LD; 1. Construct the kernel matrix K for the data set ; 2. Compute the augmented weight vector w˜ using Eq.(24); X T 2. Construct the discriminative graph Laplacian LD; 3. Output the zi’s label by yzi = w zi + b. 3. Compute α using Eq.(21); m×n 4. Construct the kernel matrix KT R between ∈ and T , with its (i, j)-th entry KT (i, j) = k(xi, zj ); Then we can get X X 5. Output the data labels by yT = KT α. T −1 T w˜ = (G G + γL˜ D) G y˜ (24)

Let ∂Lα = 0, we can get that It is also worth mentioning that Eq.(23) is a (d+1) (d+ ∂ 1) linear equation system which can be easily solved× when ∗ −1 α = (K + nλI + nγLDK) y (21) d is small. When d is large but feature vectors are highly Table 1 summarizes the basic procedure of our discrimina- sparse with z number of non-zero entries, we can employ tive regularization based classification method. Conjugate Gradient (CG) methods to solve this system. CG techniques are Krylov methods that solve a system Ax = b The Linear Case: To make our algorithm more efficient, ∗ we also derive the linear discriminative regularization based by repeatedly multiplying a candidate solution x by A. In classification method. Mathematically, a linear classifier f the case of linear Discriminative RLS, we can construct the predicts the label of x by matrix-vector product in the LHS of (3) in time O(n(z+k)), where k (typically small) is average number of entries per T f(x) = w x + b (22) column in LD. This is done by using an intermediate vector where w Rd is the weight vector and b R is the offset. Xw and appropriately forming sparse matrix-vector prod- Then using∈ our discriminative regularization∈ framework, we ucts. Thus, the algorithm can employ very well-developed can solve for the optimal f by minimizing methods for efficiently obtaining the solution. Table 2 summarizes the basic procedure of the linear dis- 0 1 T 2 2 T T J = (w xi + b yi) + λ w + γw X LDXw criminative regularization based classification method. n − k k Xi Let Experiments T In this section, we will present a set of experimental results x1 /√n 1/√n y1/√n T to demonstrate the effectiveness of our method. First let’s  x2 /√n 1/√n  y2/√n   w see an illustrative example showing the difference between G = . . , y˜ = . , w˜ =  . .  . b the global and local margin based criterions.  T    · ¸  x /√n 1/√n   yn/√n   n     √λI 0   0d  A Toy Example  d d    In this part we will present a toy example to demonstrate the where I is the d d identity matrix, and 0 Rd 1 is d d effectiveness of our discriminative regularization framework a zero column vector.× Define the augmented discriminative∈ × as well as the difference between the global margin based graph Laplacian as discriminative regularization (GDR) and local margin based T X LDX 0 (d+1)×(d+1) discriminative regularization (LDR). L˜ D = R · 0 0 ¸ ∈ In our experiment, we construct three toy data sets based on the two-pattern (Zhou et al., 2004), in which the overlap- Then we can rewrite J 0 as ping between the two classes becomes heavier and heavier 0 2 T J = Gw˜ y˜ + γw˜ L˜ Dw˜ from data set I to data set III. For both the GDR and LDR k − k The partial derivative of J 0 with respect to w˜ is methods, we apply the Gaussian kernel with its width being 0 set to 0.1, and the regularization parameters for the complex- ∂J T ity and discriminality regularizers are also set to 0.1. The = 2 G (Gw˜ y˜) + γL˜ Dw˜ ∂w˜ ³ − ´ experimental results are shown in Fig.2. ∂J 0 From the figures we can observe that both GDR and LDR By setting ∂w˜ = 0, we can get the optimal w˜ by solving the following linear equation system can discover the “optimal” decision boundary to discrimi- nate the two classes. As we expected, the decision bound- T T (G G + γL˜ D)w˜ = G y˜ (23) aries achieved by GDR are generally smoother than the ones

749 1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 Decision Boundary −1 Decision Boundary −1 Decision Boundary Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 −1.5 −1.5 −1.5 −1 0 1 2 −1 0 1 2 −1 0 1 2

(a) GDR on data set I (b) GDR on data set II (c) GDR on data set III

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 Decision Boundary −1 Decision Boundary −1 Decision Boundary Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 −1.5 −1.5 −1.5 −1 0 1 2 −1 0 1 2 −1 0 1 2

(d) LDR on data set I (e) LDR on data set II (f) LDR on data set III

Figure 2: A toy example showing the difference between global margin based discriminative regularization (GDR) and local margin based discriminative regularization (LDR). In the example we construct three toy data sets using the two-moon pattern, and the overlapping between the two classes becomes heavier from data set I to data set III. From the figures we can observe that the decision boundaries achieved from GDR are generally smoother than the ones achieved from LDR. obtained by LDR, since GDR maximizes the discriminality of the classification function based on the global information Table 3: Descriptions of the UCI data sets. of the data set, while LDR pays more attention on the local Data Size (n) Feature (N) Class discriminality of the decision function. Balance 625 3 2 BUPA 345 6 2 The Data Sets Glass 214 7 9 Iris 150 4 3 We adopt three categories of data sets in our experiments: Ionosphere 351 34 2 UCI data. We perform experiments on 9 UCI data sets2. Lenses 24 4 3 • The basic information of those data sets are summarized Pima 768 8 2 in table 4. Soybean 307 35 19 CMU PIE Face Data Set(Sim et al., 2003). This is a Wine 138 13 3 • database of 41,368 face images of 68 people, each person under 13 different poses, 43 different illumination con- ditions, and with 4 different expressions. In our experi- grees. In our experiments, all the images are resized to ments, we only adopt a subset containing five near frontal 32x32. poses (C05, C07, C09, C27, C29) and all the images un- der different illuminations and expressions. As a result, Comparisons and Experimental Settings there are 170 images for each individual. In our experi- In our experiments, we compare the performance of our dis- ments, all the images are resized to 32x32 for computa- criminative regularization algorithm with four other compet- tional convenience. itive algorithms: COIL-20 Object Image Data Set(Nene et al., 1996). It • Linear Support Vector Machines (LSVM). We use lib- is a database of gray-scale images of 20 objects. For each • SVM (Fan et al., 2005) to implement the SVM algorithm object there are 72 images of size 128x128. The images with a linear kernel, and the cost parameter is searched were taken around the object at the pose interval of 5 de- from 10−4, 10−3, 10−2, 10−1, 1, 101, 102, 103, 104 . { } 2http://mlearn.ics.uci.edu/MLRepository. Kernel Support Vector Machines (KSVM). html • We also use libSVM (Fan et al., 2005) to im-

750 Table 4: Results on the UCI data sets (%). Data LSVM SVM LLapRLS KLapRLS LGDR LLDR GDR LDR Balance 87.76 92.04 88.34 91.63 89.20 88.72 91.45 92.87 BUPA 66.99 72.43 68.90 73.06 66.94 70.23 74.09 82.13 Glass 62.57 72.75 61.65 76.24 62.04 63.68 71.35 74.20 Iris 94.53 92.04 98.27 98.63 89.32 87.86 97.34 98.77 Ionosphere 87.78 95.14 91.63 98.30 85.20 88.20 95.24 98.75 Lenses 74.62 79.25 82.65 81.54 82.09 85.64 87.35 87.68 Pima 77.08 79.23 69.50 77.45 77.24 79.82 79.12 87.70 Soybean 100.00 97.75 100.00 100.00 99.04 99.68 100.00 100.00 Wine 95.57 79.87 97.11 96.54 99.00 99.00 98.35 98.24

table 5 are averaged over 50 independent runs. From these Table 5: Experimental results in image data sets (%). results we can clearly see the effectiveness of our methods. PIE(20 train) COIL(2 train) LSVM 36.92 94.30 Conclusions KSVM 33.27 97.24 In this paper we propose a novel regularization approach LLapRLS 30.18 96.78 for supervised classification. Instead of punishing the ge- KLapRLS 29.49 98.37 ometrical smoothness as in (Belkin et al., 2006), we directly LGDR 30.12 97.12 punish the discriminality of the classification function in our LLDR 28.77 99.02 method. Finally the experiments show that our algorithm is GDR 27.93 98.93 more effective than some conventional methods. LDR 26.49 99.14 References Belkin, M., Niyogi, P., and Sindhwani, V. Manifold Regular- plement the kernel SVM algorithm with a ization: a Geometric Framework for Learning from Examples. Gaussian kernel with its width tuned from Journal of Machine Learning Research. 2006. −4 −3 −2 −1 1 2 3 4 4 σ0, 4 σ0, 4 σ0, 4 σ0, σ0, 4 σ0, 4 σ0, 4 σ0, 4 σ0 , Chapelle O, Scholkopf¨ B, Zien A. Semi-supervised Learning. { } where σ0 is the mean distance between any two examples Cambridge, Massachusetts: The MIT Press, 2006 in the training set. The cost parameter is tuned in the Chuzhanova, N.A., Jones, A.J., and Margetts, S.. Feature Selec- same way as in LSVM. tion for Genetic Sequence Classification. Bioinformatics. 1998. Linear Laplacian Regularized Duda, R O, Hart, P E and Stork D G. Pattern Classification (2nd • (LLapRLS) (Belkin et al., 2006). The imple- ed.). Wiley, 2000 mentation code is downloaded from http: Fukunaga, F. Introduction to Statistical Pattern Recognition. Aca- //manifold.cs.uchicago.edu/manifold_ demic Press, New York, 2nd edtion. 1990. regularization/software.html, in which the Fan, R. -E., Chen, P. -H., and Lin, C.-J. Working Set Selection linear kernel is adopted in RKHS, and variance of the Using Second Order Information for Training SVM. Journal of Gaussian in constructing the graph is set by the same way Machine Learning Research 6, 1889-1918. 2005. as in (Zelnik-Manor & Perona, 2005), and the extrinsic Li, H., Jiang, T. and Zhang, K. Efficient and robust feature extrac- and intrinsic regularization parameters are searched from tion by maximunmargin criterion. TNN 17(1), 157-165, 2006. 10−4, 10−3, 10−2, 10−1, 1, 101, 102, 103, 104 . Nene, S. A., Nayar, S. K. and Murase, H. Columbia Object Image { } Library (COIL-20). TR-CUCS-005-96, February 1996. Kernel Laplacian Regularized Least Squares • Sebastiani, F. Machine Learning in Automated Text Categoriza- (KLapRLS). All the experimental settings are the tion. ACM Computing Surveys 34(1): 1-47. 2002. same as in LLapRLS except that we adopt the Sim, T., Baker, S. and Bsat, M. The CMU Pose, Illumination, and Gaussian kernel in RKHS with its width tuned from Expression Database. IEEE-TPAMI 25(12): 1615-1618, 2003. −4 −3 −2 −1 1 2 3 4 4 σ0, 4 σ0, 4 σ0, 4 σ0, σ0, 4 σ0, 4 σ0, 4 σ0, 4 σ0 , Li, S. Z. & Jain, A. K. Handbook of Face Recognition. 2005. {where σ is the mean distance between any two examples } 0 Vapnik V. The Nature of Statistical Learning Theory. 1995. in the training set. Wang, F. and Zhang, C. Feature Extraction by Maximizing the Average Neighborhood Margin CVPR2007. Experimental Results Xue H, Chen SC, Yang Q: Discriminative regularization: A new classifier learning method. NUAA-TR. 2007. The experimental results are shown in table 4 and table 5, where we have highlighted the best performances. For the Zelnik-Manor, L. and Perona, P. Self-Tuning Spectral Clustering. In NIPS 17, 1601-1608. 2005. PIE data set, we randomly labeled 20 images for each per- son for training, while for the COIL data set, we randomly Zhou, D., Bousquet, O., Lal, T. N., Weston, J. and Scholkopf,¨ B. Learning with Local and Global Consistency. NIPS 16. 2004. labeled 2 images per object. The performances reported in

751