A Loss Function Analysis for Classification Methods in Text Categorization

472 A Loss Function Analysis for Classification Methods in Text Categorization Fan LI HUSTLF~CS.CMU.EDU Carnegie Mellon Univ, 4502 NSH, 5000 Forbes Avenue, Pittsburgh, PA 15213 USA Yiming Yang [email protected] Carnegie Mellon Univ, 4502 NSH, 5000 Forbes Avenue, Pittsburgh, PA 15213 USA Abstract Support Vector Machine (SVM)and other approaches (Yang & Liu, 1999; Yang, 1999; Joachims, 1998; Mc- This paper presents a formal analysis of pop- Callum & Nigam; Zhang & Oles, 2001; Lewis et al., ular text classification methods, focusing on 2003). As more methods are published, we need to their loss functions whoseminimization is es- have a sound theoretical framework for cross-method sential to the optimization of those methods, comparison. Recent work in machine learning focus- and whose decomposition into the training- ing on the regularization of classification methodsand set loss and the model complexity enables on the analysis of their loss functions is a step in this cross-method comparisons on a commonba- direction. sis from an optimization point of view. Those methods include Support Vector Machines, Vapnik (Vapnik, 1995) defined the objective function Linear Regression, Logistic Regression, Neu- in SVMas minimizing the expected risk on test exam- ral Network, Naive Baycs, K-Nearest Neigh- ples, and decomposedthat risk into two components: bor, Rocchio-style and Multi-class Prototype the empirical risk that reflects the training-set errors classifiers. Theoretical analysis (including of the classifier, and the inverse of marginwidth that our new derivations) is provided for each reflects howfar the positive and negative training ex- method,along with e~-aluation results for all amples of a category are separated by the decision sur- the methods on the Reuters-21578 bench- face. Thus, both the minimization of training-set er- mark corpus. Using linear regression, neural rors and the maximization of the margin width are the networks and logistic regression methods as criteria used in the optimization of SVM.Balancing examples, we show that properly tuning the betweenthe two criteria has been referred as the regu- balance betweenthe training-set loss and the larization of a classifier; the degree of regularization is complexity penalty would have a significant often controlled by a parameter in that method (sec- impact to the performanceof a classifier. In tion 2). SVMhave been extremely successful in text linear regression, in particular, the tuning of categorization, often resulting in the best performance the complexity penalty yielded a result (mea- in benchmark evaluations (Joachims, 1998; Yang sured using macro-averaged F1) that outper- Liu, 1999; Lewiset al., 2003). formed all text categorization methods ever Hastie et al. (Hastie et al. 2001) presented a moregen- evaluated on that benchmarkcorpus, includ- eral frameworkfor estimating the potential of a model ing Support Vector Machines. in makingclassification errors, and used a slightly dif- ferent terminology: loss or generalization error corre- spondingto the expected risk, training-set loss corre- 1. Introduction sponding to the empirical risk, and model complexity Text categorization is an active research area in ma~ corresponding to the margin-related risk in SVM.Us- ing this framework they comparedalternative ways to chine learning and information retrieval. A large num- ber of statistical classification methodshave been ap- penalize the model complexity, including the Akalke plied to this problem, including linear regression, lo- Information Criterion (AIC), the Bayesian Informa- gistic regression (LR), neural networks (NNet), Naive tion Criterion (BIC), and the MinimumDescription Bayes (NB), k-nearest neighbor (kNN), Rocchio-style, Length (MDL)criterion. More interestingly, they Proceedingsof the Twentieth International Conferenceon MachineLearning (ICML-2003),Washington DC, 2003. 473 comparedthe differences in the training-set loss func- represents the values of the p input variables in the tions for SVM,LLSF, LR and AdaBoost, in a way such ith training example. Scalar Yl E {-1,1} (unless that the sensitivity of those methods with respect to otherwise specified) is the class label. classification errors on training examplescan be easily compared(section 2). ¯ Vector ~ = (ill,...,/~v) T consists of the parame- ters in a linear classifier, whichare estimated us- It wouldbe valuable to analyze a broader range of clas- ing the training data. sification methodsin a similar fashion as presented by Hestie et al., so that the comparison amongmethods ¯ Alinea scalar f(a~i,/~) = ~’i/~is the classifier’s out- can be madeexplicitly in terms of their inductive bi- put given input gi, and the quantity Yif(~i,/~) ases with respect to training examples, or in terms of shows how much the system’s output agrees with their penalty functions for model complexity. For this the truth label we need a formal analysis on the optimization criterion of each method, in the form of a loss function that ¯ The 2-norm of/~ is represented as [[/~1[ and the decomposes into the training-set error term and the 1-normof/~ is represented as IIEII model complexity term. Such a formal analysis, how- ever, often is not available in the literature for popular Note that we purposely chose to define gi as a hori- text categorization methods, such as Nave Bayes, kNN zontal vector and fl as a vertical vector, so that we can and Rocchio-style classifiers. conveniently write xi~ for the dot product ~-]~=1x~/~k The primary contribution we offer here is a loss- (and vice versa), which will be frequently seen in our function based study for eight classifiers popular in derivations. text categorization, including SVM,linear regression, logistic regression, neural networks, Rocchio-style, 2.1. SVM Prototypes, kNN and Nave Bayes. We provide our SVMhas been extremely successful in text categoriza- own derivations for the loss function decomposition tion. Multiple versions of SVMexist; in this paper we in Rocchio-style, NB, kNNand multi-class prototypes only use linear SVMfor our analysis, partly for clarity (Prototypes), which have not been reported before. and simplicity of our analysis, and partly because lin- Wealso show the importance of properly tuning the ear SVMperformed as well as other versions of SVMin amount of regularization by using controlled examina- text categorization evaluations(Joachims, 1998). SVM tions of LLSF, LR and NNet with and without reg- emphasizesthe generalization capability of the classi- ularization. Finally, we compare the performance of the eight classifiers with properly tuned regulariza- fier (Hastie et al. 2001), whoseloss function (for class c) has the form of tion (though validation) using a benchmark corpus (Reuters-21578) in text categorization. fl The organization of the remaining parts of this pa- Lc : E(1 - yiEi/~)+ + ~11/~[12 (1) per is aa follows: Section 2 outlines the classifiers and i=1 provides a formal analysis on their loss functions. Sec- in which the training-set loss on a single training ex- tion 3 describes the experiment settings and results. ample is defined to be Section 4 summarizes and offers the concluding re- marks. le = (1-Yi ~ i~)+~ ( 1 =Yi 0 ~i~otherwise wh en Yl Xi~ <_ 1 2. Loss functions of the classifiers The first term in the right hand side of formula 1 is the In order to comparedifferent classifiers on a common cumulative training-set loss and the second term is the basis, we need to present their loss functions in the complexity penalty and both are functions of vector unified form: Lc = gl(YJ(,~i,~)) g2(/~). We cal /L The optimization in SVMis to find/~ that mini- the first term gl (ylf ( Ei, fl) the tr aining-set loss and mizes the sum of the two terms in formula 1. In other the second term g2(/~) the complexity penalty or the words, the optimization in SVMis not only driven by regularizer. The following notation will be used in the the training-set loss, but also driven by the 2-norm rest of this paper: of vector fl, which is determined by the squared sum of the coefficients in /~ and reflects the sensitivity of the mapping function with respect to the input vari- ¯ The training data consL~ts of N pairs of (El,y1), ables. The value of A controls the trade-off between (~’2,y2),...,(X.N,YN). Vector ~i = (Xil,...,Xlp) the two terms, that is, it is the weight (algorithmically 474 determined in the training phase of SVM)of the sec- with the scoring function ](~, fi) = ~fl where fi is the ond term relative to the first term. Formula 1 can be prototype vector, defined to be: transformed into dual form and solved using quadratic 1 b programming. y" = ~-~=~ ~, ~,-~, ~ Ecj ~i ~c This kind of analysis on the loss function in SVMis not new, of course. In fact, it is a part of the SVM : ~1 ~ j, + ~b ~ v,~, (3) theory, and has been presented by other researchers yi:l yi:--I (Vapnik, 1995; Hastie et al. 2001). Our point here where b is a parameter which can be tuned. Nowwe is to start with a good framework and carry out the formal analysis for the other classifiers chosen for this showthat the rcgularized loss function in the Rocchio- style classifier is study in a consistent fashion; someof those classifiers have not been formally analyzed in this manner. Lc =- ~ yi~,ifi bg~ E (4) yi:l y~=--i 2.2. Linear Least Squares Fit (LLSF) In order to minimizethe loss function, we need to take Linear regression, also called Linear Least Squares Fit the first order derivative of formula 4 with respect to (LLSF) in the literature, has performed competitively fi and set it to zero, to SVMand other high performing classifiers (includ- r ing kNNand logistic regression) in text categorization dLc bh evaluations (Yang, 1999). LLSFis similar to linear ,¢ = - ~,yl---1 Y’~’ N~y~=--I~C Y’~’ + Jvd:o SVMin the sense that both leaxn a linear mapping ](~, fi) = ~fi based on tile training data.

Load more