Quick viewing(Text Mode)

Text Classification with Kernels on the Multinomial Manifold

Text Classification with Kernels on the Multinomial Manifold

Text Classification with Kernels on the Multinomial Dell Zhang1,2 Xi Chen Wee Sun Lee1,2 1Department of Computer Science Department of Mathematical and 1Department of Computer Science School of Computing Statistical Sciences School of Computing S16-05-08, 3 Science Drive 2 University of Alberta SOC1-05-26, 3 Science Drive 2 National University of Singapore 10350 122 St Apt 205 National University of Singapore Singapore 117543 Edmonton, AB T5N 3W4 Singapore 117543 2Singapore-MIT Alliance Canada 2Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3 +1-780-492-1704 E4-04-10, 4 Engineering Drive 3 Singapore 117576 Singapore 117576 +65-68744251 [email protected] +65-68744526 [email protected] [email protected]

ABSTRACT machine learning methods for text classification [8, 15, 16, 36]. Support Vector Machines (SVMs) have been very successful in “The crucial ingredient of SVMs and other methods is the text classification. However, the intrinsic geometric structure of so-called kernel trick, which permits the computation of dot text data has been ignored by standard kernels commonly used in products in high-dimensional feature spaces, using simple SVMs. It is natural to assume that the documents are on the functions defined on pairs of input patterns. This trick allows the multinomial manifold, which is the simplex of multinomial formulation of nonlinear variants of any algorithm that can be models furnished with the Riemannian structure induced by the cast in terms of dot products, SVMs being but the most Fisher information metric. We prove that the Negative prominent example.” [32] Distance (NGD) on the multinomial manifold is conditionally However, standard kernels commonly used in SVMs have positive definite (cpd), thus can be used as a kernel in SVMs. neglected a-priori knowledge about the intrinsic geometric Experiments show the NGD kernel on the multinomial manifold structure of text data. We think it makes more sense to view to be effective for text classification, significantly outperforming document feature vectors as points in a Riemannian manifold, standard kernels on the ambient . rather than in the much larger Euclidean space. This paper studies kernels on the multinomial manifold that enable SVMs to Categories and Subject Descriptors effectively exploit the intrinsic geometric structure of text data to H.3.1. [Content Analysis and Indexing]; H.3.3 [Information improve text classification accuracy. Search and Retrieval]; I.2.6 [Artificial Intelligence]: Learning; In the rest of this paper, we first examine the multinomial I.5.2 []: Design Methodology – classifier manifold (§2), then propose the new kernel based on the design and evaluation. geodesic distance (§3) and present experimental results to demonstrate its effectiveness (§4), later review related works General Terms (§5), finally make concluding remarks (§6). Algorithms, Experimentation, Theory. 2. THE MULTINOMIAL MANIFOLD Keywords This section introduces the concept of the multinomial manifold Text Classification, Machine Learning, Support Vector Machine, and the trick to compute geodesic distances on it, followed by Kernels, , Differential . how documents can be naturally embedded in it. 2.1 Concept 1. INTRODUCTION Let S =⋅{p(|θ )} be an n-dimensional regular statistical Recent research works have established the Support Vector θ∈Θ Machine (SVM) as one of the most powerful and promising model family on a set X . For each x ∈X assume the mapping θ θ ∞ Θ ∂ p(|)x is C at each point in the interior of . Let i

∂ denote and ()x denote log( p (x | θ)) . The Fisher Permission to make digital or hard copies of all or part of this for ∂θ θ personal or classroom use is granted without fee provided that copies are i not made or distributed for profit or commercial advantage and that copies information metric [1, 19, 21] at θ∈Θ is defined in terms of the bear this notice and the full citation on the first page. To copy otherwise, or given by republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’05, August 15–19, 2005, Salvador, Brazil. Copyright 2005 ACM 1-59593-034-5/05/0008...$5.00.

266

θ =∂∂⎡⎤ θθ′∈ P n gEij() θθθ⎣⎦ i j Therefore the geodesic distance between , can be (1) computed as the geodesic distance between FF(),(θθ′ )∈Sn , i.e., =∂p (|)x θ log(|)()()ppdx θ ∂ log(|)x θ x + ∫X ij n the length of the shortest curve on S+ connecting F()θ and or equivalently as F()θ′ that is actually a segment of a great circle. Specifically, θθ′∈ P n g (θ)4=∂ppd (|)x θ ∂ (|)x θ x . (2) the geodesic distance between , is given by ij∫X i j ⎛⎞n+1 θ ′′′ Note that gij () can be thought of as the variance of the score dFF(θθ , )== 2arccos() ( θ ), ( θ ) 2arccos⎜⎟∑ θθ . (8) G ⎝⎠ii ∂ θ θ i =1 i θ . In coordinates i , gij () defines a Riemannian metric on Θ , giving S the structure of an n-dimensional Riemannian 2.3 Embedding manifold. In text retrieval, clustering and classification, a document is usually considered as a “bag of words” [2]. It is natural to Intuitively the Fisher information may be seen as the amount of assume that the “bag of words” of a document is generated by information a single data point supplies with respect to the independent draws from a multinomial distribution θ over problem of estimating the parameter θ . The choice of the Fisher vocabulary Vww= { ,..., } . In other words, every document is information metric is motivated by its attractive properties in 1 n+1 theory and good performances in practice [21, 23, 24]. modeled by a multinomial distribution, which may change from document to document. Given the feature representation of a Consider multinomial distributions that model mutually document, d = (dd ,..., ) , it can be embedded in the = θ 1 n +1 independent events X11,..., X n+ with Pr[X ii ] . Obviously multinomial manifold Pn by applying L normalization, n+1 1 θ = (θθ ,..., ) should be on the n-simplex defined by ∑θ =1 . 1 n+1 i ⎛⎞ i=1 dd θˆ(d )= ⎜⎟in ,..., +1 . (9) The probability that X occurs x times, ..., X occurs x nn++11 1 1 n+1 n+1 ⎜⎟d d ⎝⎠∑∑ii==11i i times is given by

+ The simple TF representation of a document D sets N ! n 1 θ = θ xi = p(|)x n+1 ∏ i (3) diitf(,) w D which means the term frequency (TF) of word wi x ! i =1 ∏i =1 i in document D , i.e., how many times wi appears in D . The

n+1 embedding that corresponds to the TF representation is = where Nx∑ i . theoretically justified as the maximum likelihood estimator for i=1 the multinomial distribution [15, 21, 24]. The multinomial manifold is the parameter space of the The popular TF×IDF representation [2] of a document D sets multinomial distribution =⋅ di tf(,) wii D idf () w , where the TF component tfwD(,)i is ⎧⎫n+1 weighted by idf() w , the inverse document frequency (IDF) of Pnn=∈θ » +1 θθ =∀≥ i ⎨⎬: ∑ ii1;i , 0 (4) ⎩⎭i =1 word wi in the corpus. If there are m documents in the corpus and word w appears in dfw() documents, then equipped with the Fisher information metric, which can be i i = ( ) shown to be idfiilog m df ( w ) . The embedding that corresponds to the

n+1 TF×IDF representation can be interpreted as a pullback metric of uivi gθ (,uv )= ∑ (5) the Fisher information through the transformation θ i=1 i ⎛⎞θλ θ λ n n n ⎜⎟11nn++ 1 1 where θ∈ P , and uv, ∈TθP are vectors tangent to P at θ G ()θ = ,..., (10) λ ⎜⎟nn++11 + ∑∑θλ θλ represented in the standard of »n 1 . ⎝⎠ii==11i iii

2.2 Geodesic λ = idfi with i n+1 . Pn idf It is a well-known fact that the multinomial manifold is ∑ j =1 j isometric to the positive portion of the n-sphere of radius 2 [18] How to find the optimal embedding is an interesting research Snn=∈ψψ» +1 =∀≥ψ + { : 2;i ,i 0} (6) problem to explore. It is possible to learn a Riemannian metric even better than using the TF×IDF weighting [23]. through the diffeomorphism F : PSnn→ , + Under the above embeddings, the kernel (that will be discussed θ = θθ later) between two documents d and d′ means k (θˆˆ(),(d θ d′ )) . F( ) (21 ,...,2 n+1 ) . (7)

267

3. DISTANCE BASED KERNELS ΦΦ(xx), (′′ ) =k ( xx , ) . (11) Good kernels should be consistent with one’s intuition of pair- wise similarity/dissimilarity in the domain. The motivation of Theorem 2 ( Representation of CPD Kernels this paper is to exploit the intrinsic geometric structure of text [32]). Let k be a real-valued cpd kernel on X . Then there data to design a kernel that can better capture document exists a Hilbert space of real-valued functions on X and a similarity/dissimilarity. Standard text retrieval, clustering and mapping Φ→:XH such that classification usually rely on the similarity measure defined by 2 1 the dot (inner product) of two document vectors in a Φ−Φ=−()xx (′′ )kkk (, xxxxxx ) +() (,) + ( ′′ , ). (12) Euclidean space [2]. The geometric interpretation of the dot 2 product is that it computes the cosine of the between two The former theorem implies that pd kernels are justified to be vectors provided they are normalized to unit length. When used in all kernel methods. The latter theorem implies that cpd turning to the Riemannian geometry, this similarity measure is kernels are justified to be used in the kernel methods which are no longer available on general manifolds, because the concept of translation invariant, i.e., distance based, in the feature space. dot product is only defined locally on the tangent space but not Since SVMs are translation invariant in the feature space, they globally on the manifold itself. However, there exists a natural are able to employ not only pd kernels but also cpd kernels [31, dissimilarity measure on general manifolds: geodesic distance. 32]. This section first describes the concept of kernels (§3.1), then Standard (commonly used) kernels [32] include: discusses the Negative kernel (§3.2) and the Negative Geodesic Distance kernel (§3.3) in detail. ′′= Linear kLIN (xx,) xx , , (13)

3.1 PD and CPD Kernels d Polynomial k (,xx′′ )=+() xx , c with dc∈≥»,0, (14) Definition 1 (Kernels [32]). Let X be a nonempty set. A real- POL valued symmetric function k : XX×→» is called a positive ⎛⎞′ 2 ∈ » ∈X xx− definite (pd) kernel if for all m and all xx1,..., m the Gaussian k (,xx′ )=− exp⎜⎟ with σ > 0 , and (15) RBF ⎜⎟σ 2 induced mm× Gram (kernel) matrix K with elements ⎝⎠2 Kxx= k( , ) satisfies cKcT ≥ 0 given any vector c∈ »m . The ijij Sigmoid k (,xx′′ )=+ tanh(κϑ xx , ) with κϑ><0, 0 . (16) function k is called a conditionally positive definite (cpd) SIG kernel if K satisfies the above for any vector c∈ »m The former three are pd but the last one is not. with c1T = 0. 3.2 The Negative Euclidean Distance Kernel As a direct consequence of Definition 1, we have Lemma 3 ([31, 32]). The negative squared Euclidean distance −=−−2 ′′2 Lemma 1 ([32]). function dE (,xx ) x x is a cpd kernel.

(i) Any pd kernel is also a cpd kernel. Lemma 4 (Fractional Powers and Logs of CPD Kernels [4, XX×→−∞ −− β (ii) Any constant c ≥ 0 is a pd kernel; any constant c ∈ » is a 32]). If k :(,0] is cpd, then so are ()k , cpd kernel. 01<<β and −−ln(1k ) . αα≥ (iii) If k1 and k2 are pd (resp. cpd) kernels, 12,0, then Proposition 1. The Negative Euclidean Distance (NED) function αα+ 1kk122 is a pd (resp. cpd) kernel. ′′=− kdNED(,xx ) E (, xx ) (17)

(iv) If k1 and k2 are pd kernels, then kk12 defined by is a cpd kernel. kk(,xx′′′ )= k (, xx ) k (, xx ) is a pd kernel. 12 1 2 Proof. It follows directly from Lemma 3 and 4 with β =12. Lemma 2 (Connection of PD and CPD Kernels [4, 29, 32]). Let k be real-valued symmetric function defined on XX× . 3.3 The Negative Geodesic Distance Kernel Then we have Theorem 3 (Dot Product Kernels in Finite [30, 32]). A function kf(xx,)′′= (, xx ) defined on the unit sphere in 1 (i) kkkkk(,xx′′ ):=−−+() (, xx ) (, xx ) ( x , x ′ ) ( x , x ) is pd if a finite n dimensional Hilbert space is a pd kernel if and only if 2 00 00 its Legendre polynomial expansion has only nonnegative and only if k is cpd; coefficients, i.e.,

(ii) exp(tk ) is pd for all t > 0 if and only if k is cpd. ∞ = P n ≥ ft( )(∑ brr t) with br 0 . (18) Theorem 1 (Hilbert Space Representation of PD Kernels r =0 [32]). Let k be a real-valued pd kernel on X . Then there exists Theorem 4 (Dot Product Kernels in Infinite Dimensions [30, a Hilbert space of real-valued functions on X and a mapping ′′= Φ→:X H such that 32]). A function kf(xx,) (, xx ) defined on the unit sphere in

268

an infinite dimensional Hilbert space is a pd kernel if and only if kernel k , the decision function of the Support Vector Machine its Taylor series expansion has only nonnegative coefficients, i.e., (SVM) is

∞ ⎛⎞m = r ≥ =+α * * ft() ∑ art with ar 0 . (19) fykb()xxx sgn⎜⎟∑ ii (, i ) , (27) r =0 ⎝⎠i =1

Since (19) is more stringent than (18), in order to prove positive where (αα* ,...,* ) = α* solves the following quadratic definiteness for arbitrary dimensional dot product kernels it 1 m suffices to show that condition (19) holds [32]. optimization problem: mm1 maximize Wyyk(α )=−∑∑ααα (xx , ) (28) Proposition 2. The Negative Geodesic Distance (NGD) function α∈»m i ijij i j ==2 on the multinomial manifold iij1,1 ≤≤α = θθ′′=− θθ subject to 0 i C for all i 1,..., m (29) kdNGD(, ) G (, ) (20)

m for θθ, ′∈ P n is a cpd kernel. Moreover, the shifted NGD kernel α = and ∑ iiy 0 (30) θθ′ + π i=1 kNGD (, ) is a pd kernel. > * Proof. Plugging the formula (8) into (20), we have for some C 0 , and b is obtained by averaging m * * + − α <<α ⎛⎞n 1 yykjiiji∑ (xx, ) over all training examples with 0 j C . θθ′′=− θθ i =1 kNGD(, ) 2arccos⎜⎟∑ i i . (21) ⎝⎠i=1 Proposition 3. Let k be a valid kernel, and kk=+β where θθ θθ′′ θ θ′ β ∈ » is a constant. Then k and k lead to the identical SVM, Denote ( 1 ,..., n+1 ) and ( 1 ,..., n+1 ) by and given the same training data. respectively. It is obvious that θ and θ′ are both on the unit sphere. Let Proof. Denote the SVMs with kernel k and k learned from the ft()=−π 2arccos()t . (22) training data (xx11 ,y ),...,(m ,ym ) by SVM() k and SVM() k θθ′ respectively. The objective function (28) of the quadratic Then we can rewrite kNGD (, ) as optimization problem for training SVM() k is θθ′′=− θ θ π kfNGD (, ) ( , ) . (23) mm α =−ααα1 Wyyk( )(,∑∑i ijijxx i j) (31) ==2 The Maclaurin series (the Taylor series expansion of a function iij1,1 about 0) for the inverse cosine function with −≤11t ≤ is mm =−ααα1 +β ∑∑iijijijyy() k(,xx ) (32) ⎛⎞ ==2 Γ+1 iij1,1 ∞ ⎜⎟r π ⎝⎠2 + arccos(t ) =−∑ t 2r 1 , (24) mmβ m π + =−ααα1 − αα 2 r =0 (2rr 1) ! ∑∑i ijijy yk(,xx i j ) ∑ iji yyj (33) iij==1,122 ij ,1 = where Γ()x is the gamma function. Hence 2 mm1 β ⎛⎞ m ∞ =−ααα − α ∑∑iijijijiiy yk(,xx ) ⎜⎟ ∑ y . (34) 21r + = iij==1,122⎝⎠ i = 1 ft() ∑ ctr , (25) r =0 In addition, for all training examples with 0 <<α * C , and j

mm ⎛⎞1 yykyyk−=−+αα** xx() xx β 2Γ+⎜⎟r jiijijiiji∑∑(,) (,) (35) ⎝⎠2 ii==11 c = . (26) r π + (21)rr! mm⎛⎞ =−αβα** − y jiijiii∑∑yk(,)xx ⎜⎟ y . (36) Γ> > > = ii==11⎝⎠ Since ()x 0 for all x 0 , we have cr 0 for all r 0,1,2,... . θθ′ By Theorem 3 and 4, the dot product kernel f (, ) is pd. Furthermore, the decision function of SVM() k is Thus the NGD kernel k (,θθ′ ) is cpd according to Lemma 1. NGD ⎛⎞m * * fykb()xxx=+ sgn⎜⎟∑α (, ) (37) The shifted NGD kernel k (,θθ′ )+ π equals to f (,θθ′ ) ii i NGD ⎝⎠i =1 which has been proved to be pd. ⎛⎞mm⎛⎞ =++⎜⎟αβα* ** Definition 2 (Support Vector Machine, Dual Form [32]) sgn∑∑iiy kyb (xx , i ) ⎜⎟ ii . (38) ⎝⎠ii==11⎝⎠ Given a set of m labeled examples (xx11 ,y ),...,(m ,ym ) and a

269

Making use of the equality constraint (30), all terms containing along with its train/test split is used in our experiments due to β vanish in (34), (36) and (38). Therefore SVM() k and the following considerations: duplicates and newsgroup- * * identifying headers have been removed; there is no randomness SVM() k arrive at equivalent α and b , and their decision in training and testing set selection so cross-experiment functions for classification are identical. comparison is easier; separating the training and testing sets in Proposition 2 and Proposition 3 reveal that although the NGD time is more realistic. The performance measure is the multi- θθ′ class classification accuracy. kernel kNGD (, ) is cpd (not pd), it leads to the identical SVM θθ′ + π as a pd kernel kNGD (, ) . This deepens our understanding of LIBSVM [5] was employed as the implementation of SVM, with 3 the NGD kernel. all the parameters set to their default values . LIBSVM uses the “one-vs-one” ensemble method for multi-class classification Using the NGD kernel as the starting point, a family of cpd and because of its effectiveness and efficiency [12]. pd kernels can be constructed based on the geodesic distance on the multinomial manifold using Lemma 1, Lemma 2, Lemma 4, We have tried standard kernels, the NED kernel and the NGD etc [32]. In particular, we have the following pd kernel which kernel. The linear (LIN) kernel worked better than or as well as will be discussed later in the section of related works. other standard kernels (such as the Gaussian RBF kernel) in our experiments, which is consistent with previous observations that Proposition 4. The kernel the linear kernel usually can achieve the best performance for

+ text classification [8, 15-17, 36]. Therefore the experimental − n ⎛⎞1 ⎛⎞n 1 θθ′′=−()πθθ2 ktNGD− E( , ) 4 exp⎜⎟ arccos⎜⎟∑ i i (39) results of standard kernels other than the linear kernel are not ⎝⎠t ⎝⎠i=1 reported here. with t > 0 for θθ, ′∈ P n is pd. The text data represented as TF or TF×IDF vectors can be embedded in the multinomial manifold by applying L1 θθ′ Proof. It is not hard to see that kNGD− E (, ) equals to normalization (9), as described in §2.3. Since kernels that assume (including the LIN and NED kernel) − n ⎛⎞1 ()4πtk2 exp⎜⎟ (θθ ,′ ) , whose positive definiteness is a often perform better with L normalization, we report such ⎝⎠2t NGD 2 experimental results as well. The NGD kernel essentially relies trivial consequence of Proposition 2, Lemma 1 and Lemma 2. on the multinomial manifold so we stick to L1 normalization when using it. 4. EXPERIMENTS The experimental results4 obtained using SVMs with the LIN, We have conducted experiments on two real-world datasets, NED and NGD kernels are shown in Table 1 and 2. WebKB 1 and 20NG 2 , to evaluate the effectiveness of the proposed NGD kernel for text classification using SVMs. Table 1, Experimental results on the WebKB dataset. The WebKB dataset contains manually classified Web pages that representation normalization kernel accuracy were collected from the computer science departments of four universities (Cornell, Texas, Washington and Wisconsin) and LIN 72.57% some other universities (misc.). Only pages in the top 4 largest L1 NED 84.39% classes were used, namely student, faculty, course and project. TF NGD 91.88% All pages were pre-processed using the Rainbow toolkit [25] LIN 89.52% L2 with the options “--skip-header --skip-html --lex-pipe-command=tag- NED 90.08% digits --no-stoplist --prune-vocab-by-occur-count=2”. The task is sort LIN 55.84% of a four-fold cross validation in a leave-one-university-out way: L training on three of the universities plus the misc. collection, and 1 NED 81.04% testing on the pages from a fourth, held-out university. This way TF×IDF NGD 91.42% of train/test split is recommended because each university’s LIN 88.23% L2 pages have their idiosyncrasies. The performance measure is the NED 87.96% multi-class classification accuracy averaged over these four splits.

The 20NG (20newsgroups) dataset is a collection of approximately 20,000 documents that were collected from 20 different newsgroups [22]. Each newsgroup corresponds to a different topic and is considered as a class. All documents were pre-processed using the Rainbow toolkit [25] with the option “-- 3 We have further tried a series of values {…, 0.001, 0.01, 0.1, 1, prune-vocab-by-doc-count=2”. The “bydate” version of this dataset 10, 100, 1000, …} for the parameter C and found the superiority of the NGD kernel unaffected.

4 1 Our experimental results on the WebKB and 20NG datasets http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/ should not be directly compared with most published results data/ because of the difference in experimental settings and 2 http://people.csail.mit.edu/people/jrennie/20Newsgroups performance measures.

270

Up=∇ x θ Table 2, Experimental results on the 20NG dataset. the Fisher score x θ log ( | ) at a single point in the parameter space: representation normalization kernel accuracy ′ = T −1 LIN 24.68% kF (,xx ) UIUxx′ (40) L1 NED 69.16% where I = EUU[]T . In contrast, the NGD kernel is based on TF NGD 81.94% xxx the full geometry of statistical models. LIN 79.76% L2 NED 79.76% Another typical kernel of this type is the probability product LIN 21.47% kernel proposed by Jebara et al. [14]. Let p(|⋅ θ ) be probability X ρ > L1 NED 72.20% distributions on a space . Given a positive constant 0 , the TF×IDF NGD 84.61% probability product kernel is defined as

LIN 82.33% ρρ ρρ L kppdpp(,)θθ′′==⋅⋅ (|)x θ (|x θ )x (|),(|)θθ ′(41) 2 PP ∫X NED 82.06% L2

ρρ ρ assuming that pp(⋅⋅∈|),(|)θθ′ L (X) , i.e., p(|)x θ 2 dx 2 ∫X The NED kernel worked comparably to the LIN kernel under L2 2ρ normalization, and superiorly under L1 normalization. This and p(|x θ′ ) dx are well-defined (not infinity). For any Θ ∫X observation has not been reported before. ⋅∈θ ρ X θ∈Θ such that pL( |)2 ( ) for all , the probability product The NGD kernel consistently outperformed its Euclidean kernel defined by (41) is pd. In a special case ρ =12, the counterpart -- the NED kernel, and the representative of standard kernels -- the LIN kernel, throughout the experiments. probability product kernel is called the Bhattacharyya kernel All improvements made by the NGD kernel over the LIN or NED because in the statistics literature it is known as the kernel are statistically significant according to (micro) sign test Bhattacharyya’s affinity between probability distributions. When [36] at the 0.005 level (P-Value < 0.005)5, as shown in Table 3. applied to the multinomial manifold, the Bhattacharyya kernel of θθ, ′∈ P n turns out to be Table 3, Statistical significance tests about the n+1 differences between the NGD kernel (under L1 θθ′′= θθ kB (, ) ∑ i i (42) normalization) and the LIN/NEG kernel (under L2 i =1 normalization). which is closely related to the NGD kernel given by (21) through comparison dataset representation Z ⎛⎞1 TF 3.4744 kk=−cos⎜⎟, (43) WebKB-top4 BNGD⎝⎠2 TF×IDF 4.2500 NGD vs. LIN TF 7.5234 though they are proposed from different . 20NG-bydate TF×IDF 7.4853 The idea of assuming text data as points on the multinomial TF 2.6726 WebKB-top4 manifold for constructing new classification methods has been TF×IDF 4.4313 investigated by Lebanon and Lafferty [21, 24]. In particular, they NGD vs. NED TF 7.6028 have proposed the information diffusion kernel, based on the 20NG-bydate TF×IDF 8.4718 heat equation on the Riemannian manifold defined by the Fisher information metric [21]. On the multinomial manifold, the We think the success of the NGD kernel for text classification is information diffusion kernel can be approximated by attributed to its ability to exploit the intrinsic geometric structure n ⎛⎞n+1 of text data. − 1 2 ⎛⎞ θθ′′≈−()πθθ2 ktID( , ) 4 exp⎜⎟ arccos ⎜⎟∑ i i . (44) ⎝⎠t ⎝⎠i=1 5. RELATED WORKS θθ′ It is very attractive to design kernels that can combine the merits It is identical to the pd kernel kNGD− E (, ) that is constructed of generative modeling and discriminative learning. This paper based on the NGD kernel (39), except that the inverse cosine lies along the line of research towards this direction. ⎛⎞n+1 − 2 θθ′ =− 2 θ θ′ component is squared. Since arccos ⎜⎟∑ i i dG ( , ) An influential work on this topic is the Fisher kernel proposed by ⎝⎠i=1 Jaakkola and Haussler [13]. For any suitably regular probability looks not cpd, the kernel k (,θθ′ ) is probably not pd, according θ θ ID model p(|)x with parameters , the Fisher kernel is based on to Lemma 2. Whether it is cpd still remains unclear. While the information diffusion kernel generalizes the Gaussian kernel of Euclidean space, the NGD kernel generalizes the NED kernel and provides more insights on this issue. 5 Using McNemar’s test also indicates that the improvements Let’s look at the NGD kernel, the Bhattacharyya kernel and the brought by the NGD kernel are statistically significant. information diffusion kernel on the multinomial manifold, in the

271

context of text classification using TF or TF×IDF feature 8. REFERENCES representation. It is not hard to see that the above three kernels [1] Amari, S., Nagaoka, H. and Amari, S.-I. Methods of all invoke the -root squashing function on the term Information Geometry. American Mathematical Society, frequencies, thus provides an explanation for the long-standing 2001. mysterious wisdom that preprocessing term frequencies by taking squared roots often improves performance of text clustering or [2] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information classification [14]. Retrieval. Addison-Wesley, 1999. Although the NGD kernel is not restricted to the multinomial [3] Bahlmann, C., Haasdonk, B. and Burkhardt, H. On-Line manifold, it may be hard to have a closed-form formula to Handwriting Recognition with Support Vector Machines -- compute geodesic distances on manifolds with complex structure. A Kernel Approach. in Proceedings of the 8th International One possibility to overcome this obstacle is to use manifold Workshop on Frontiers in Handwriting Recognition learning techniques [28, 33, 35]. For example, given a set of data (IWFHR), 2002, 49-54. points that reside on a manifold, the Isomap algorithm of [4] Berg, C., Christensen, J.P.R. and Ressel, P. Harmonic Tenenbaum et al. [35] estimates the geodesic distance between a Analysis on Semigroups: Theory of Positive Definite and pair of points by the length of the shortest path connecting them Related Functions. Springer-Verlag, 1984. with respect to some graph (e.g., the k-nearest-neighbor graph) [5] Chang, C.-C. and Lin, C.-J. LIBSVM: a Library for Support constructed on the data points. In case of the Fisher information Vector Machines. 2001. metric, the distance between nearby points (distributions) of X http://www.csie.ntu.edu.tw/~cjlin/libsvm. can be approximated in terms of the Kullback-Leibler via the following relation. When θθ′ =+δ θ with δθ a [6] Chapelle, O., Haffner, P. and Vapnik, V.N. SVMs for perturbation, the Kullback-Leibler divergence is proportional to Histogram Based Image Classification. IEEE Transactions the density’s Fisher information [7, 20] on Neural Networks, 10 (5). 1055-1064.

δ θ→0 [7] Dabak, A.G. and Johnson, D.H. Relations between 1 2 Dp()(|x θ′ )||(|) px θθθ= F ()()δ . (45) Kullback-Leibler distance and Fisher information 2 Manscript, 2002. Another relevant line of research is to incorporate problem- [8] Dumais, S., Platt, J., Heckerman, D. and Sahami, M. specific distance measures into SVMs. One may simply represent Inductive Learning Algorithms and Representations for Text each example as a vector of its problem-specific distances to all Categorization. in Proceedings of the 7th ACM examples, or embed the problem-specific distances in a International Conference on Information and Knowledge (regularized) , and then employ standard SVM Management (CIKM), Bethesda, MD, 1998, 148-155. algorithm [9, 27]. This approach has a disadvantage of losing [9] Graepel, T., Herbrich, R., Bollmann-Sdorra, P. and sparsity, consequently they are not suitable for large-scale dataset. Obermayer, K. Classification on Pairwise Proximity Data. Another kind of approach directly uses kernels constructed based on the problem-specific distance, such as the Gaussian RBF in Advances in Neural Information Processing Systems kernel with the problem-specific distance measure plugged in [3, (NIPS), Denver, CO, 1998, 438-444. 6, 10, 11, 26]. Our proposed NGD kernel on the multinomial [10] Haasdonk, B. and Bahlmann, C. Learning with Distance manifold encodes a-priori knowledge about the intrinsic Substitution Kernels. in Proceedings of the 26th DAGM geometric structure of text data. It has been shown to be Symposium, Tubingen, Germany, 2004, 220-227. theoretically justified (cpd) and practically effective. [11] Haasdonk, B. and Keysers, D. Tangent Distance Kernels for Support Vector Machines. in Proceedings of the 16th 6. CONCLUSIONS International Conference on Pattern Recognition (ICPR), The main contribution of this paper is to prove that the Negative Quebec, Canada, 2002, 864-868. Geodesic Distance (NGD) on the multinomial manifold is a [12] Hsu, C.-W. and Lin, C.-J. A Comparison of Methods for conditionally positive definite (cpd) kernel, and it leads to Multi-class Support Vector Machines. IEEE Transactions accuracy improvements over kernels assuming Euclidean on Neural Networks, 13 (2). 415-425. geometry for text classification. [13] Jaakkola, T. and Haussler, D. Exploiting Generative Models Future works are to extend the NGD kernel to other manifolds in Discriminative Classifiers. in Advances in Neural (particularly for multimedia tasks), and apply it in other kernel Information Processing Systems (NIPS), Denver, CO, 1998, methods for pattern analysis (kernel PCA, kernel Spectral 487-493. Clustering, etc.) [34]. [14] Jebara, T., Kondor, R. and Howard, A. Probability Product Kernels. Journal of Machine Learning Research, 5. 819- 7. ACKNOWLEDGMENTS 844. We thank the anonymous reviewers for their helpful comments. [15] Joachims, T. Learning to Classify Text using Support Vector Machines. Kluwer, 2002. [16] Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. in

272

Proceedings of the 10th European Conference on Machine Information Processing Systems (NIPS), Vancouver and Learning (ECML), Chemnitz, Germany, 1998, 137-142. Whistler, Canada, 2003. [17] Joachims, T., Cristianini, N. and Shawe-Taylor, J. [27] Pekalska, E., Paclik, P. and Duin, R.P.W. A Generalized Composite Kernels for Hypertext Categorisation. in Kernel Approach to Dissimilarity-based Classification. Proceedings of the 18th International Conference on Journal of Machine Learning Research, 2. 175-211. Machine Learning (ICML), Williamstown, MA, 2001, 250- [28] Roweis, S.T. and Saul, L.K. Nonlinear Dimensionality 257. Reduction by Locally Linear Embedding. Science, 290 [18] Kass, R.E. The Geometry of Asymptotic Inference. (5500). 2323-2326. Statistical Science, 4 (3). 188-234. [29] Schoenberg, I.J. Metric Spaces and Positive Definite [19] Kass, R.E. and Vos, P.W. Geometrical Foundations of Functions. Transactions of the American Mathematical Asymptotic Inference. Wiley, 1997. Society, 44. 522-536. [20] Kullback, S. Information Theory and Statistics. Wiley, [30] Schoenberg, I.J. Positive Definite Functions on Spheres. 1959. Duke Mathematical Journal, 9 (1). 96?08. [21] Lafferty, J.D. and Lebanon, G. Information Diffusion [31] Scholkopf, B. The Kernel Trick for Distances. in Advances Kernels. in Advances in Neural Information Processing in Neural Information Processing Systems (NIPS), Denver, Systems (NIPS), Vancouver, Canada, 2002, 375-382. CO, 2000, 301-307. [22] Lang, K. NewsWeeder: Learning to Filter Netnews. in [32] Scholkopf, B. and Smola, A.J. Learning with Kernels. MIT Proceedings of the 12th International Conference on Press, Cambridge, MA, 2002. Machine Learning (ICML), Tahoe City, CA, 1995, 331-339. [33] Seung, H.S. and Lee, D.D. The Manifold Ways of [23] Lebanon, G. Learning Riemannian Metrics. in Proceedings Perception. Science, 290 (5500). 2268-2269. of the 19th Conference on Uncertainty in Artificial [34] Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Intelligence (UAI), Acapulco, Mexico, 2003, 362-369. Pattern Analysis. Cambridge University Press, 2004. [24] Lebanon, G. and Lafferty, J.D. Hyperplane Margin [35] Tenenbaum, J.B., Silva, V.d. and Langford, J.C. A Global Classifiers on the Multinomial Manifold. in Proceedings of Geometric Framework for Nonlinear Dimensionality the 21st International Conference on Machine Learning Reduction. Science, 290 (5500). 2319-2323. (ICML), Alberta, Canada, 2004. [36] Yang, Y. and Liu, X. A Re-examination of Text [25] McCallum, A.K. Bow: A Toolkit for Statistical Language Categorization Methods. in Proceedings of the 22nd ACM Modeling, Text Retrieval, Classification and Clustering. International Conference on Research and Development in 1996. http://www.cs.cmu.edu/~mccallum/bow. Information Retrieval (SIGIR), Berkeley, CA, 1999, 42-49. [26] Moreno, P.J., Ho, P. and Vasconcelos, N. A Kullback- Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications. in Advances in Neural

273