Classification by Discriminative Regularization
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Classification by Discriminative Regularization 1¤ 2 3 1 1 Bin Zhang , Fei Wang , Ta-Hsin Li , Wen jun Yin , Jin Dong 1IBM China Research Lab, Beijing, China 2Department of Automation, Tsinghua University, Beijing, China 3IBM Watson Research Center, Yorktown Heights, NY, USA ¤Corresponding author, email: [email protected] Abstract Generally achieving a good classifier needs a sufficiently ² ^ Classification is one of the most fundamental problems large amount of training data points (such that may have a better approximation of ), however, in realJ world in machine learning, which aims to separate the data J from different classes as far away as possible. A com- applications, it is usually expensive and time consuming mon way to get a good classification function is to min- to get enough labeled points. imize its empirical prediction loss or structural loss. In Even if we have enough training data points, and the con- this paper, we point out that we can also enhance the ² structed classifier may fit very well on the training set, it discriminality of those classifiers by further incorporat- may not have a good generalization ability on the testing ing the discriminative information contained in the data set as a prior into the classifier construction process. In set. This is the problem we called overfitting. such a way, we will show that the constructed classifiers Regularization is one of the most important techniques will be more powerful, and this will also be validated by to handle the above two problems. First, for the overfitting the final empirical study on several benchmark data sets. problem, Vapnik proposed that and ^ satisfies the follow- ing inequality (Vapnik, 1995) J J Introduction 6 ^ + "(n; dF ); Classification, or supervised learning with discrete outputs J J (Duda et al., 2000), is one of the vital problems in machine where "(n; dF ) measures the complexity of the hypothesis learning and artificial intelligence. Given a set of training model. Therefore we can minimize the following structural n risk to get an optimal f data points and their labels = (xi; yi) i=1, the classifi- cation algorithms aim to constructX f a goodg classifier f 2 F ¹ = ^ + ¸(f); ( is the hypothesis space) and use it to predict the labels J J F of the data points in the testing set T . Most of the real where (f) is a structural regularization term which mea- world data analysis problems can finallyX be reformulated as sures the complexity of f. In such a way, the resulting f a classification problem, such as text categorization (Sebas- could also have a good generalization ability on testing data. tiani, 2002), face recognition (Li & Jain, 2005) and genetic Second, for the first problem, we can apply semi- sequence identification (Chuzhanova et al., 1998). supervised learning methods, which aims to learn from par- More concretely, in classification problems, we usually tially labeled data (Chapelle et al., 2006). Belkin et al. pro- assume that the data points and their labels are sampled from posed to add an additional regularization term to ¹ to pun- a joint distribution P (x; y), and the optimal classifier f we ish the geometric smoothness of the decision functionJ with seek for should minimize the following expected loss (Vap- respect to the intrinsic data manifold (Belkin et al., 2006), nik, 1995) and such smoothness is estimated from the whole data set. = E x (y; f(x)); Belkin et al. also pointed out that such a technique can also J P ( ;y)L where ( ; ) is some loss function. However, in practice, be incorporated into the procedure of supervised classifica- P (x; yL) is¢ usually¢ unknown. Therefore, we should minimize tion, and the smoothness of the classification function can the following empirical loss instead be measured class-wise. In this paper, we propose a novel regularization approach n for classification. Instead of using geometrical smoothness ^ 1 = (yi; f(xi)) of f as the penalty criterion as in (Belkin et al., 2006), we J n L Xi=1 directly use the discriminality of f as the penalty function, When deriving a specific classification algorithm by mini- since the final goal of classification is just to discriminate the data from different classes. Therefore we call our method mizing ^, we may encounter two common problems: J Discriminant Regularization (DR). Specifically, we derive Copyright °c 2008, American Association for Artificial Intelli- two margin based criterion to measure the discriminality of gence (www.aaai.org). All rights reserved. f, namely the global margin and local margin. We show 746 that the final solution of the classification function will be achieved by solving a linear equation system, which can be efficiently implemented with many numerical methods, and the experimental results on several benchmark data sets are also presented to show the effectiveness of our method. One issue should be mentioned here is that independent of this work, (Xue et al., 2007) also proposes a similar discrim- inative regularization technique for classification. However, there is no detailed mathematical derivations in their techni- cal report. The rest of this paper is organized as follows. In section 2 Figure 1: The data graph constructed in supervised manifold we will briefly review the basic procedure of manifold regu- regularization. There are only connections among the data larization, since it is clearly related to this paper. The algo- points in the same class. rithm details of DR will be derived in section 3. In section 4 we sill present the experimental results, followed by the conclusions and discussions in section 5. Discriminative Regularization A Brief Review of Manifold Regularization Although (Belkin et al., 2006) proposed an elegant frame- As mentioned in the introduction, manifold regularization work for learning by taking the smoothness of f into ac- (Belkin et al., 2006) seeks for an optimal classification func- count, however, the final goal of learning (e.g., classifica- tion by minimizing the following loss tion or clustering) is to discriminate the data from differ- n ent classes. Therefore, a reasonable classification function 1 ¹ = (f(xi); yi) + ¸(f) + γ (f) (1) should be smooth within each classes, but simultaneously J n L I Xi=1 produce great contrast between different classes. There- f where (f) measures the complexity of f, and (f) pe- fore, we propose to penalize the discriminality of , rather f nalizes the geometrical property of f. Generally,I in semi- than the smoothness of directly when constructing it. f supervised learning, (f) can be approximated by Generally, the discriminality of can be measured by the I difference (or quotient) of the within-class scatterness and T 2 (f) = f Lf = wij (fi fj ) between-class scatterness, which can either be computed I ¡ Xij globally or locally. where wij measures the similarity between xi and xj , fi = T f(xi), f = [f1; f2; ; fn] . L is an n n graph Laplacian Global Margin Based Discriminality matrix with its (i; j¢) ¢-th ¢ entry £ j wij ; if i = j Similar to the maximum margin criterion (Li et al., 2006), Łij = ½ Pwij ; otherwise we measure the discriminality of f by ¡ Therefore (f) reflects the variation, or the smoothness over I 1(f) = (f) (f) (2) the intrinsic data manifold. For supervised learning, we can D W ¡B define a block-diagonal matrix L as where (f) and (f) measure the within-class scatterness L1 and between-classW B scatterness of f which can be computed 2 L2 3 by L = . .. 6 7 2 6 L 7 C 4 C 5 1 where C is the number of classes, and Li is the graph Lapla- (f) = °fi fj° (3) W x ° ¡ nc x ° cian constructed on class i, i.e., if we construct a supervised Xc=1 Xi2¼c ° Xj 2¼c ° ° ° data graph on , then there only exists connections among ° ° 2 C the data pointsX in the same class (see Fig.1)). Then by rear- 1° 1 ° ranging the elements in f as (f) = nc ° fj fj ° (4) B °nc x ¡ n ° Xc=1 ° Xj 2¼c Xj ° f f T f T f T T ° ° = [ 1 ; 2 ; ; C ] ° ° ¢ ¢ ¢ ° ° where f1 is the classification vector on class 1, we can adopt where we use ¼c to denote the c-th class, and nc = ¼c . the following term as (f) j j I Define the total scatterness of f as T T (f) = fi Lifi = f Lf I 2 Xi 1 which is in fact the summation of the smoothness of f within (f) = (f) + (f) = °fi fj° (5) T W B ° ¡ n ° each class. Xi ° Xj ° ° ° ° ° ° ° 747 2 C where represents the cardinality of a set. The total neigh- 1 j¢j (f) = °fi fj° borhood margin of f can be defined as W x ° ¡ nc x ° Xc=1 Xi2¼c ° Xj 2¼c ° ° ° γ = γi ° ° i C ° ° X 2 2 1 = 0fi fjfi + 2 fkfl1 x 2 ¡ nc x 2 nc x x 2 0 2 21 Xc=1 Xi ¼c Xj ¼c k;Xl ¼c = f f f f : @ A i k i j k:x 2N e k ¡ k ¡ j:x 2N o k ¡ k 2 Xi B Xk i Xj i C = fi gij fifj B orx 2N e orx 2N o C ¡ i k i j Xi Xij @ A The discriminality of f can be described by γ. Define two 1 2 1 T = gij (fi fj) = f LW f weight matrices 2 ¡ 2 ij o o X 1; xi or xj Wo(i; j) = 2 Nj 2 Ni (12) where ½ 0; otherwise 1=nc; xi; xj ¼c e e gij = 2 ; (6) e 1; xi or xj ½ 0; otherwise W (i; j) = 2 Nj 2 Ni (13) ½ 0; otherwise and and their corresponding degree matrices T n£1 f = [f1; f2; ; fn] R (7) o o o o D = diag(d ; d ; ; dn) (14) ¢ ¢ ¢ 2 Rn£n 1 2 ¢ ¢ ¢ LW = diag(LW1; LW2; ; LWC ) (8) De = diag(de; de; ; de ) (15) ¢ ¢ ¢ 2 1 2 ¢ ¢ ¢ n Rnc£nc and LWc is a square matrix with its entries where do = Wo(i; j), de = We(i; j).