<<

472

A Loss Function Analysis for Classification Methods in Text Categorization

Fan LI HUSTLF~CS.CMU.EDU Carnegie Mellon Univ, 4502 NSH, 5000 Forbes Avenue, Pittsburgh, PA 15213 USA

Yiming Yang [email protected] Carnegie Mellon Univ, 4502 NSH, 5000 Forbes Avenue, Pittsburgh, PA 15213 USA

Abstract Support Vector Machine (SVM)and other approaches (Yang & Liu, 1999; Yang, 1999; Joachims, 1998; Mc- This paper presents a formal analysis of pop- Callum & Nigam; Zhang & Oles, 2001; Lewis et al., ular text classification methods, focusing on 2003). As more methods are published, we need to their loss functions whoseminimization is es- have a sound theoretical framework for cross-method sential to the optimization of those methods, comparison. Recent work in focus- and whose decomposition into the training- ing on the regularization of classification methodsand set loss and the model complexity enables on the analysis of their loss functions is a step in this cross-method comparisons on a commonba- direction. sis from an optimization point of view. Those methods include Support Vector Machines, Vapnik (Vapnik, 1995) defined the objective function , , Neu- in SVMas minimizing the expected on test exam- ral Network, Naive Baycs, K-Nearest Neigh- ples, and decomposedthat risk into two components: bor, Rocchio-style and Multi-class Prototype the empirical risk that reflects the training-set errors classifiers. Theoretical analysis (including of the classifier, and the inverse of marginwidth that our new derivations) is provided for each reflects howfar the positive and negative training ex- method,along with e~-aluation results for all amples of a category are separated by the decision sur- the methods on the Reuters-21578 bench- face. Thus, both the minimization of training-set er- mark corpus. Using linear regression, neural rors and the maximization of the margin width are the networks and logistic regression methods as criteria used in the optimization of SVM.Balancing examples, we show that properly tuning the betweenthe two criteria has been referred as the regu- balance betweenthe training-set loss and the larization of a classifier; the degree of regularization is complexity penalty would have a significant often controlled by a parameter in that method (sec- impact to the performanceof a classifier. In tion 2). SVMhave been extremely successful in text linear regression, in particular, the tuning of categorization, often resulting in the best performance the complexity penalty yielded a result (mea- in benchmark evaluations (Joachims, 1998; Yang sured using macro-averaged F1) that outper- Liu, 1999; Lewiset al., 2003). formed all text categorization methods ever Hastie et al. (Hastie et al. 2001) presented a moregen- evaluated on that benchmarkcorpus, includ- eral frameworkfor estimating the potential of a model ing Support Vector Machines. in makingclassification errors, and used a slightly dif- ferent terminology: loss or corre- spondingto the expected risk, training-set loss corre- 1. Introduction sponding to the empirical risk, and model complexity Text categorization is an active research area in ma~ corresponding to the margin-related risk in SVM.Us- ing this framework they comparedalternative ways to chine learning and information retrieval. A large num- ber of statistical classification methodshave been ap- penalize the model complexity, including the Akalke plied to this problem, including linear regression, lo- Information Criterion (AIC), the Bayesian Informa- gistic regression (LR), neural networks (NNet), Naive tion Criterion (BIC), and the MinimumDescription Bayes (NB), k-nearest neighbor (kNN), Rocchio-style, Length (MDL)criterion. More interestingly, they

Proceedingsof the Twentieth International Conferenceon MachineLearning (ICML-2003),Washington DC, 2003. 473

comparedthe differences in the training-set loss func- represents the values of the p input variables in the tions for SVM,LLSF, LR and AdaBoost, in a way such ith training example. Scalar Yl E {-1,1} (unless that the sensitivity of those methods with respect to otherwise specified) is the class label. classification errors on training examplescan be easily compared(section 2). ¯ Vector ~ = (ill,...,/~v) T consists of the parame- ters in a , whichare estimated us- It wouldbe valuable to analyze a broader of clas- ing the training . sification methodsin a similar fashion as presented by Hestie et al., so that the comparison amongmethods ¯ Alinea scalar f(a~i,/~) = ~’i/~is the classifier’s out- can be madeexplicitly in terms of their inductive bi- put given input gi, and the quantity Yif(~i,/~) ases with respect to training examples, or in terms of shows how much the system’s output agrees with their penalty functions for model complexity. For this the truth label we need a formal analysis on the optimization criterion of each method, in the form of a loss function that ¯ The 2- of/~ is represented as [[/~1[ and the decomposes into the training-set error term and the 1-normof/~ is represented as IIEII model complexity term. Such a formal analysis, how- ever, often is not available in the literature for popular Note that we purposely chose to define gi as a hori- text categorization methods, such as Nave Bayes, kNN zontal vector and fl as a vertical vector, so that we can and Rocchio-style classifiers. conveniently write xi~ for the dot product ~-]~=1x~/~k The primary contribution we offer here is a loss- (and vice versa), which will be frequently seen in our function based study for eight classifiers popular in derivations. text categorization, including SVM,linear regression, logistic regression, neural networks, Rocchio-style, 2.1. SVM Prototypes, kNN and Nave Bayes. We provide our SVMhas been extremely successful in text categoriza- own derivations for the loss function decomposition tion. Multiple versions of SVMexist; in this paper we in Rocchio-style, NB, kNNand multi-class prototypes only use linear SVMfor our analysis, partly for clarity (Prototypes), which have not been reported before. and simplicity of our analysis, and partly because lin- Wealso show the importance of properly tuning the ear SVMperformed as well as other versions of SVMin amount of regularization by using controlled examina- text categorization evaluations(Joachims, 1998). SVM tions of LLSF, LR and NNet with and without reg- emphasizesthe generalization capability of the classi- ularization. Finally, we compare the performance of the eight classifiers with properly tuned regulariza- fier (Hastie et al. 2001), whoseloss function (for class c) has the form of tion (though validation) using a benchmark corpus (Reuters-21578) in text categorization. fl The organization of the remaining parts of this pa- Lc : E(1 - yiEi/~)+ + ~11/~[12 (1) per is aa follows: Section 2 outlines the classifiers and i=1 provides a formal analysis on their loss functions. Sec- in which the training-set loss on a single training ex- tion 3 describes the settings and results. ample is defined to be Section 4 summarizes and offers the concluding re- marks. le = (1-Yi ~ i~)+~ ( 1 =Yi 0 ~i~otherwise wh en Yl Xi~ <_ 1

2. Loss functions of the classifiers The first term in the right hand side of formula 1 is the In order to comparedifferent classifiers on a common cumulative training-set loss and the second term is the basis, we need to present their loss functions in the complexity penalty and both are functions of vector unified form: Lc = gl(YJ(,~i,~)) g2(/~). We cal /L The optimization in SVMis to find/~ that mini- the first term gl (ylf ( Ei, fl) the tr aining-set loss and mizes the sum of the two terms in formula 1. In other the second term g2(/~) the complexity penalty or the words, the optimization in SVMis not only driven by regularizer. The following notation will be used in the the training-set loss, but also driven by the 2-norm rest of this paper: of vector fl, which is determined by the squared sum of the coefficients in /~ and reflects the sensitivity of the mapping function with respect to the input vari- ¯ The training data consL~ts of N pairs of (El,y1), ables. The value of A controls the trade-off between (~’2,y2),...,(X.N,YN). Vector ~i = (Xil,...,Xlp) the two terms, that is, it is the weight (algorithmically 474 determined in the training phase of SVM)of the sec- with the scoring function ](~, fi) = ~fl where fi is the ond term relative to the first term. Formula 1 can be prototype vector, defined to be: transformed into dual form and solved using quadratic 1 b programming. y" = ~-~=~ ~, ~,-~, ~ Ecj ~i ~c This kind of analysis on the loss function in SVMis not new, of course. In fact, it is a part of the SVM : ~1 ~ j, + ~b ~ v,~, (3) theory, and has been presented by other researchers yi:l yi:--I (Vapnik, 1995; Hastie et al. 2001). Our point here where b is a parameter which can be tuned. Nowwe is to start with a good framework and carry out the formal analysis for the other classifiers chosen for this showthat the rcgularized loss function in the Rocchio- style classifier is study in a consistent fashion; someof those classifiers have not been formally analyzed in this manner. Lc =- ~ yi~,ifi bg~ E (4) yi:l y~=--i 2.2. Linear Fit (LLSF) In order to minimizethe loss function, we need to take Linear regression, also called Fit the first order derivative of formula 4 with respect to (LLSF) in the literature, has performed competitively fi and set it to zero, to SVMand other high performing classifiers (includ- r ing kNNand logistic regression) in text categorization dLc bh evaluations (Yang, 1999). LLSFis similar to linear ,¢ = - ~,yl---1 Y’~’ N~y~=--I~C Y’~’ + Jvd:o SVMin the sense that both leaxn a linear mapping ](~, fi) = ~fi based on tile training data. Its optimiza- It is easy to see that fi in formula 3 is just the solu- tion criterion in estimating fi, however,is strictly the tion. In other words, formula4 is the loss function that minimization of the training-set error in terms of the the Rocchio-style classifier is trying to minimize.Pre- sumof squared residuals. The loss function is defined senting its loss function in this fashion enables us to to be: Lusl = ~in=l (Yi - Y.ifl) 2. Expandingthe right comparethe Rocchio-style approach with other classi- hand side and rearranging, we obtain the equivalent fication methodson the samebasis, i.e., loss-function formula in the desired form of gl(yi~.ifl): based analysis.

n Observing formula 4 is interesting. The loss function L,,,~ = ~d+ (eJ)~ - 2~,ed consists of three parts, instead of two as in the other i=t classifiers we analyzed so far. The first part is the training-set loss on positive examples;The second part is the training-set loss on negative examples;the third part is the complexitypenalizer []fl]]2. The training-set loss on a single training examplede- pends on whether it is a positive or negative example. That is, Addingthe regularizer ,kl]flll 2 to the training-set loss, we obtain the loss functions of the regularized LLSF -Yi~ifl when Yi = 1 (which is also called Ridge Regression): {--bN-~Yi~i ~ when Yi = -1 2.4. Multi-class Prototype Classifier Multi-class Prototype classifier, or just "Prototype" as an abbreviation, is even simpler than Rocchio-style. It 2.3. Rocchlo-style is the same as Rocchio-style except that only positive examples are used to construct the prototype of each Rocchio-styleclassifiers are widely used in text catego- category. That is, the methodis defined by setting the rization for their simplicity and relatively good perfor- parameter b to zero in the formula 3 and 4. Accord- mance(Lewiset al., 2003). They construct a prototype ingly, the regularization loss in the Prototype method vector for each category using both the centroid of pos- is: itive training examplesff and the centroid of negative training examples ft. Whendocuments are normalized, Lc = - E YiZifi + yllfill2Nc (51 Rocchio-styleclassifier can be seen as a linear classifier y¢:l 475

Including Prototype in this study gives us a good base- 2.6. Logistic Regression (LR) line. Logistic regression methods have also showngood per- formance (competitive to SVM, LLSF and kNN) 2.5. kNN the evaluations on benchmarkcollections(Yang, 1999; kNNhas been popular in text categorization, both Zhang & Oles, 2001). It estimates the conditional for its simplicity and for the good performance in probability of y given Z in the form of benchmarkevaluations(Yang & Liu, 1999; Lewis et al., 1 P(yl~.) = 7r(y~.~) d~j 2003). kNNis very similar to Prototype except that 1 + e=p(-ye only the training examples inside of the neighborhood local to each test example have a non-zero loss. The and learns the regression coefficients/~ in order to max- nearness of each neighbor in kNNis often measured imize l-Iin=l P(Yl [Zi). This is equivalent to minimizing using the cosine similarity between the test example the training-set loss defined in the logarithmic form: n and the training example, which is equivalent to using n log 1 ~ dot product after both vectors axe normalized. For Lo= = log(1+ xp(-ye fi)) simplicity of analysis, we restrict our discussion under i=l the assumption of using normalized vectors. Under The regularized version of LR (Zhang & Oles, 2001) this assumption, kNNhas a locally linear classifica- has the loss function in the form of tion function with the vector of coefficients n L~ = ~log(1 + exp(-yiEi~)) =, llfill (s) i=l

2.7. Neural Networks (NNet) where Rk (~) is the k training examplesnearest to test Neural networks have also shown competitive perfor- example~, and/~= is the local centroid of the positive mancein text categorization evaluations(Yang & Liu, examples in category c. The classification decision on 1999; Yang, 1999). Werestrict our analysis to a two- test exampleEis obtained by thresholding on the dot level (no hidden layers) neural network in this paper. product g. flz. Nowwe need to formally analyze ex- NNetswithout hidden layers are very s_.imilar to LR in ;actly what kNNis optimizing. Defining a loss function the sense that they estimate P(y = l[f~, ~) in the form in the following form of 1 P(y = 1 [fl, Z) -- 1 + exp(-~.fl) - 7r (Z/~)

However, the objective function is to minimize Lc = )-~(y~ - 7r(Z~f~))2 wherey~ is 1 or 0. To makeits loss and setting the first order derivative of the right hand function in a form consistent and comparable to those side to zero yields the coefficient vector in formula 6. in other classifiers, weneed to write it using yi instead This is to say that the optimization criterion in kNN of y~ where Yl -- 1 when y~ --- 1 and y~ -- -1 when is the minimization of loss function Lc in formula 7 y~ -- 0. Thetraining-set loss is: which has both the training-set error component(the first term) and the complexity penalization component (the second term). Accordingly, the training-set loss n 1 ~ Lc = El=l(- 7r(xifi))whenYl = 1 on a single training exampleis: ~-~in=~(O- ~r(:Fifi)) = whenyl = -I y]~__l(1 -- ~r(fflxifi)) 2 whenYl : 1 -YlX.ifl-’= whenYi = 1 A Zi E kNN(~.) 2 lc~ = ( 0 Otherwise = {~-].7=l(Tr(-yiZi/~)) when Yl = -1

=~ Z(1 - ~r(y, Zd~)) (9) Analyzing kNN’s optimization criterion in the form i=l of the loss function presented above has not been re- 2 ported before, to our knowledge. Note that we use/~= Adding an regularization term AH/~Hyields the loss instead of/~ to emphasizethe local nature of the clas- function of the regularized NNet: sification in kNN. The loss function depends on each 2 2 test example, which strongly differentiates kNNfrom L~ = Z(I- ~r(y, Eifl)) + Al[/~[[ (10) the other classifiers. i=1 476

2.8. Naive Bayes (NB) Clearly, Ok = ~ is just the solution. This meansthat the loss function in formula 13 is the optimal objec- We restrict our analysis to the most popular multi- tive function NBis using to estimate its parameters nomial NB classifier(McCallum & Nigam). It esti- P(Wk[c)(it is also equivalent to maximizingthe likeli- mates the of test document D hood1-I~=1 8F°~ subject to )-’].~=t Ok = 1). as a memberof category c using the following formula p(c[D) P(c) l- I~=, P(W~lc)n"%’D) We now rewrite formula 13 as a function of Fc = v(D) where P(c) is the of the category, P(D) is the proba- (F~I,..., F~p)and fl = (fit, ..., flp)T whereflk = log0k. bility of D occurring by chance, P(WkI c) is the prob- Since Ok is a word probability, all the elements in ability of word Wkconditioned on category c, and are positive numbers.This )-’]~P=t 0k = l[0~[h is n(D, Wk) is the count of word Wk in document D. the norm-1of vector 0’. Substituting those terms in 13 Then yields the loss function in the form of P Lc -_~,8 + &llffllx (14) log P(clD) ~-~ n( Wk,D) lo g P( Wklc) = k=t Furthermore,from fo = ~,~o~,we have f~g = + log(P(c)) - log(P(D)) (11) C~o:r’~fi,and from fi = loggwehave g= J where Rewriting formula 11 using Xk = n(Wk,D), Ok e~ de.j (efll,...,eBp). Substituting those terms in 14 P(Wk[c) and flk = lOg0k, we have yields the loss function in the form of P logP(c[D) : ~ Xk log0k +log(P(c)) - logP(D) Lo = - ~,j+&lleallx k=t ~Ec P C Yi il ~ q- &lleZll~ (15) -: ~ Xk~ "4- log(P(c)) log P(D) k y~----I k=l = ~fl+ log(P(c)) lo gP(D) (12) Nowwe have successfully decomposedNB’s loss func- Optimization in NB regards the estimation of the tion into the form of L = gt (yi£1/?) + g2(/~). Note we only discussed NBwithout any smoothing which is model parameters based on training data t : /5(c ) = ~- and Ok = -~ where Nc is the number of positive knownto be important for the effectiveness of NB. It training examples for category c, Fck is the is easy to see in the second term of formula 15 that of word Wkin those positive training examples, and Ile~ll, wouldbe overly sensitive to estimation errors in Sc = Y]fk=t Fck is the total number of word occur- the elements flk = log P(WkIc) ff those numbers(nega- rences in category c. tive) have large absolute values, that is, whenthe word probabilities are near zero. Wenow show howto relate the parameter estimates in NBto a loss function in the form of L = gl (Yi~i~) Wenow present the loss function for NBwith Laplace smoothing which is commonin NB. Here the estimate g2(fl) so that we can compare NBwith other classi- I+Fo~ let us use i" to represent vector tiers on the same basis. Let us use vector ~i to repre- of Ok is Ok = v+so " sent the ith training documentwhose elements are the (1,1,...,1). Notethat the elementsin/~ are all negative within-documentterm frequencies of individual words, numbersbecause fl = log 0 and 0 are probabilities. So the vector sumffc = Y~.~,ec ~i to represent the within- -f/7 = Ilfilll, Then we have the loss function for NB category term frequencies in category c, and 0k to de- as the following: note P(Wklc). Wedefine the loss function in the fol- P P lowing form: s~ = - ~(1 + £ck)logOk+ (p + So) P p k=l k:l L~ = -- C Fck logOk + S~ ~-~ Sk (13) = -(i’+ L)fi+(P + &)(llgll~) k=l k=t -- -(f+ ~ y~e,)lT+ (p+ &)lle~ll~ To minimizethis loss function, we take the first-order yi=l partial derivative with respect to Okand set it to zero: -: -- ~ yix, i~’4- (t9"4- &)llegl[1+ Ilfill~(16) aL~ yi=l 00~ - Fck. +S~=0 tNowwe only consider NBwithout smoothingfor sim- Comparing this to formula 15 of NB without smooth- plicity. Wewill consider NBwith Laplace smoothingnext. ing, we can see the correction by Laplace smoothing 477

Table1. Thetraining-set loss functionsand the regularizers of eight classifiers Classifier regularizer: g2(~) Regularized LLSF Regularized LR ~llfill2 Regularized 2-layer NNet ~llfill2 SVM ~llt~ll~ Rocchio ~11~11~ Prototype 11/311 1 ~ 2 kNN ~11/3=11 NB without smoothing S~lle~lll NB with Laplace smoothing (p+ S0)llJIl~ +IIEII~

~z.

o.:

w i°~n :x ~ o~ o n - _~° --1.5 - -=-. - o o o

Figure 1. The training-set loss functions of eight classifiers

in the third term, which prevents the coefficients in/~ in SVM,so the differences among those methods are from being too large. In other words, it prevents the only in their training-set loss functions. Prototype and classification decisions from being overly sensitive to NB, on the other hand, are exactly the same in terms small changes in the input. of their training-set loss, but fundamentallydifferent in their regularizer terms. Also, both formulas 15 and 16 show a unique prop- erty of NBclassifiers, that is, the influence of term The curve of the training-set loss on individual train- Sc in the loss functions, which causes the amounts of ing examples is shownfor each classifier in Figure 1; regularization to vary for different categories. To our the 0-1 misclassification loss is also shownfor compar- knowledge,this is the first time that this property is ison. The Y axis represents the loss, and the X axis madeexplicit in loss-function based analysis. Whether is the value of Yi,~i/3. Exampleswith YiX, ifl >-- 0 are this is a theoretical weaknessor a desirable property those correctly categorized by a classifier assumingthe of NBrequires future research. classification decisions are obtained by thresholding at zero; exampleswithd/iZif~ < 0 are those misclassified. 2.9. Comparative Analysis Exampleswith Yigi/3 = 1 are those perfectly scored in the sense that the scores (gift) by the classifier is in The loss functions of the eight classifiers are summa- total agreementwith the true scores of Yi. rized in Table 1. All the regularized classifiers, except NB, have their regularizers in the form of the vec- FromFigure 1, we can see that LLSFgives the highest tor 2-norm1[/3[[ 2 multiplied to a constant or a weight penalty to the misclassified examples with a negative (category-specific). Amongthose, regularized LLSF, and very large absolute value of YiZi/3 while NNetgives NNet and LRhave exactly the same regularizer as that those errors the lowest penalty. In other words, LLSF 478 tries very hard to correctly classify such (with pus. Both macro- and micro-averaged F1 are reported, relatively small scores) while NNetdoes not focus on which have been the conventional performance mea- those outliers. As for the correctlyclassified examples sures in text categorization evaluations (Yang & Liu, with a large positive ~ue of YiX.ifl, LLSFis the only 1999). All the parameters are tuned using five fold method which penalizes them heavily. SVM,NNet and cross-validation on the training data. Feature selec- LR tend to ignore these examples by giving them zero tion was applied to documentsas a preprocessing step or near zero penalties. On the other hand, Rocchio, before the training of the classifiers; the X2 criterion NB, Prototype and kNNgive these examples minus wasused for feature selection in all the classifiers. For loss rather than neglecting them. NB, we selected the top 1000 features. For Rocchio- style (implementedusing the version in (Lewis et al., It should be noticed that we have two lines for Proto- type and NB’ a linear function with a non-zero slope 2003)) we used the top 2000 features and set parameter for the positive examples, and the other with a flat b = -2. For Prototype we used the top 2000 features. slope for the negative examples. This reflects the fact For kNNwe set k = 85 and used the top 2000 features that when micro-avg. F1 was the performance mea- that only the positive training examples of each cate- sure, and the top 1000 features when macro-avg. F1 gory are used to train the category-specific models in those methods,kNN is similar in this sense except that was the performance measure. For regularized LLSF, LR, NNet(2-layer) and SVM,we used all the features its loss functions are local, dependingon the neighbor- hood of each input example; we omit the lines of kNN without selection. We used T-test to compare the macro F1 scores of regularized LLSF and SVMand in this figure. Rocchio-style, on the other hand, uses both positive and negative examples to construct the found regularized LLSFwas significantly better. category prototype, should have two linear lines (with Figure 3 shows the performance curves of the regular- non-zero slopes) as its loss flmctions. For convenience, ized LLSF, NNet and LR on a validation set (a held- we showa specific case of Rocchio-style whenparame- off subset of the training data), with respect to the ter b = g~N~ in this figure, i.e., the two lines for positive varying value of ,k that controls the amount of reg- and negative examples becomethe same. ularization. Clearly, the performance of those clas- sifters depends on the choice of the value for A: all 3. Empirical Evaluation the curves peak at some A values larger then zero. For LLSFand NNet, in particular, having regnlariza- Weconducted two sets of : one set was tion (with a properly chosen A) can make a signifi- for the global comparison of the eight classifiers in cant improvementsover the cases of no regularization text categorization using a benchmarkcollection, the (A = 0). Based on macro-averaged F1 curves, we chose Reuters-21578 corpus ApteModversion(Yang & Liu, ,k = 10-4 for regularized LLSFand ,k = 10-7 for regu- 1999) (http: //www-2.cs.cmu.edu/~yiming), and larized NNetand LRfor the evaluation of those meth- other set was for examiningthe effectiveness of regu- ods on the test set. larization in individual classifiers. For the latter, we chose LLSF, LR and NNet. Figure 2 shows the results Figure 4 comparesour results of the regularized LLSF and regularized NNet with the published results of LLSFand NNet without regularization on the same (Yang & Liu, 1999). Clearly, our new i results are significantly better than the previous re- 0 e-e~7 oa74e 0 e.eo~ oee06 O9 0 e474 0e,.s~7 " sults of those methods, and the regularization played ; o.eco~ 0.7e~ an important role in making the difference. Note that I°’s 0171 0 ~ 14 0 ’ ~~8 in (Yang & Liu, 1999), the LLSFused truncated SVD 061 0"5914 0.~57 0.~76 0-e084 - to get the solution and the NNet had 3 layers. Thus, those scores are not directly comparable, but rather just indicative.

" 4. Concluding Remarks ’ii In this paper, we presented a loss-function based anal- NB Pr~o~ P.oo~ SV’M KNN Rog LR P,O~NN.m Rog_LLS.~ ysis for eight classifications methodsthat are popular F/gure 2. Performanceof eight classifers on Reuters-21578 in text categorization. Our main research findings are: of the eight classifiers on the test set of the Reuterscor- * The optimization criteria in all the eight meth- 479

is not monotonic, which has been considered as a weakness of this method in some theoretical analysis(Hastie et al. 2001). Its macro-averaged F1 performance (0.6398) is the best score ever = : ~ Hog,:,og-,.,., LLSF ...... reported on the Reuters-21578 corpus, statisti- ¯ o°Lj ...... ,------;,-,, , ~ 4 cally significantly outperforming SVMthat was 1o4 lO4 to Io Valueof ~. the best until this study. 0.7 , : : I o.e...... i ...... ! ...... Our new derivation shows that NBhas the regu- r, o ...... T ...... : ...... i ...... 0.4 ...... ~ ...... : ...... i ...... ~ .... larizer in a formof (p + Sc)lle~lll+ Ilfilll, which o.~sF I ..... &~! ...... co0.2 ...... i ...... ~ Reg NNot ..... ~ ...... ! ...... is radically different fromthe Ilflll 2 regularizer in the other classification methods. Whether or not this would be an explanation for the suboptimal performance of NBrequires future research. F/gure 3. Performanceof classifers with respect to varying , amountsof regularization References Hastie T., Tibshirani, R. & Friedman, J. (2001) The 1 Elementsof statistical learning, data mining, Infer- 0.9 ence, and Prediction. In Springer Series in Statis- o,8 tics. 0.7 Joachims, T. (1998) Text Categorization with Support o.e Vector Machines: Learning with Many Relevant ~ o.s Features. Proceedings of the European Conference 0.4 on Machine Learning (ECML), Springer. o.a o.2 Lewis, D., Yang, Y., Rose, T. & Li, F. (2002) RCVI: o.I A New Text Categorization Test Collection to be o appeared in Journal of Machine Learning Research. McCallum, A. & Nigam, K. (1998) A comparison Figure 4. Perfromanceof LLSFand NNetwith and with- out regularization event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Catego- rization. ods can be presented using a loss function that Vapnik, V. (1995) The nature of statistical learning consists of two terms: the training-set loss and Theory. Springer, NewYork. the regularizer (i.e., the complexitypenalty). The proofs for four of those methods, Rocchio, Proto- Yang, Y. & Chute, C.G. (1994) An example-based type, kNNand NB, are new in this paper. Such mappingmethod for text classification and retrieval. decomposition enables an insightful comparisonof In A CMTransactions on Information Systems. classification methodsusing those two terms. Yang, Y. & Lin, X. (1999) A re-examination of text Regularized LLSF, NNet and LR have exactly categorization methods. A CMConference on Re- the same regularizer as that in SVM,so the dif- search and Development in Information Retrieval, ferences amongthose methods are only in their pp 42-49. training-set loss functions. Our evaluation results on the Reuters corpus show that the performance Yang, Y. (1999) An ewluation of statistical ap- of the four methodsare quite competitive, in both proaches to text categorization. Journal of Infor- macro- and micro-averaged F1 scores, despite the mation Retrieval, Vol 1, pp 67-88. theoretical differences in their loss functions. Zhang, T. & Oles, F.J. (2001) Text Categorization Regularization made significant performance im- Based on Regularized Linear Classification Meth- provements in LLSFand NNet on Reuters. Reg- ods. In Information Retrieval 4(1): 5-31. ularized LLSF,in particular, performed surpris- ingly well although its training-set loss function