Text Classification with Kernels on the Multinomial Manifold

Text Classification with Kernels on the Multinomial Manifold Dell Zhang1,2 Xi Chen Wee Sun Lee1,2 1Department of Computer Science Department of Mathematical and 1Department of Computer Science School of Computing Statistical Sciences School of Computing S16-05-08, 3 Science Drive 2 University of Alberta SOC1-05-26, 3 Science Drive 2 National University of Singapore 10350 122 St Apt 205 National University of Singapore Singapore 117543 Edmonton, AB T5N 3W4 Singapore 117543 2Singapore-MIT Alliance Canada 2Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3 +1-780-492-1704 E4-04-10, 4 Engineering Drive 3 Singapore 117576 Singapore 117576 +65-68744251 [email protected] +65-68744526 [email protected] [email protected] ABSTRACT machine learning methods for text classification [8, 15, 16, 36]. Support Vector Machines (SVMs) have been very successful in “The crucial ingredient of SVMs and other kernel methods is the text classification. However, the intrinsic geometric structure of so-called kernel trick, which permits the computation of dot text data has been ignored by standard kernels commonly used in products in high-dimensional feature spaces, using simple SVMs. It is natural to assume that the documents are on the functions defined on pairs of input patterns. This trick allows the multinomial manifold, which is the simplex of multinomial formulation of nonlinear variants of any algorithm that can be models furnished with the Riemannian structure induced by the cast in terms of dot products, SVMs being but the most Fisher information metric. We prove that the Negative Geodesic prominent example.” [32] Distance (NGD) on the multinomial manifold is conditionally However, standard kernels commonly used in SVMs have positive definite (cpd), thus can be used as a kernel in SVMs. neglected a-priori knowledge about the intrinsic geometric Experiments show the NGD kernel on the multinomial manifold structure of text data. We think it makes more sense to view to be effective for text classification, significantly outperforming document feature vectors as points in a Riemannian manifold, standard kernels on the ambient Euclidean space. rather than in the much larger Euclidean space. This paper studies kernels on the multinomial manifold that enable SVMs to Categories and Subject Descriptors effectively exploit the intrinsic geometric structure of text data to H.3.1. [Content Analysis and Indexing]; H.3.3 [Information improve text classification accuracy. Search and Retrieval]; I.2.6 [Artificial Intelligence]: Learning; In the rest of this paper, we first examine the multinomial I.5.2 [Pattern Recognition]: Design Methodology – classifier manifold (§2), then propose the new kernel based on the design and evaluation. geodesic distance (§3) and present experimental results to demonstrate its effectiveness (§4), later review related works General Terms (§5), finally make concluding remarks (§6). Algorithms, Experimentation, Theory. 2. THE MULTINOMIAL MANIFOLD Keywords This section introduces the concept of the multinomial manifold Text Classification, Machine Learning, Support Vector Machine, and the trick to compute geodesic distances on it, followed by Kernels, Manifolds, Differential Geometry. how documents can be naturally embedded in it. 2.1 Concept 1. INTRODUCTION Let S =⋅{p(|θ )} be an n-dimensional regular statistical Recent research works have established the Support Vector θ∈Θ Machine (SVM) as one of the most powerful and promising model family on a set X . For each x ∈X assume the mapping θ θ ∞ Θ ∂ p(|)x is C at each point in the interior of . Let i ∂ denote and ()x denote log( p (x | θ)) . The Fisher Permission to make digital or hard copies of all or part of this work for ∂θ θ personal or classroom use is granted without fee provided that copies are i not made or distributed for profit or commercial advantage and that copies information metric [1, 19, 21] at θ∈Θ is defined in terms of the bear this notice and the full citation on the first page. To copy otherwise, or matrix given by republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’05, August 15–19, 2005, Salvador, Brazil. Copyright 2005 ACM 1-59593-034-5/05/0008...$5.00. 266 θ =∂∂⎡⎤ θθ′∈ P n gEij() θθθ⎣⎦ i j Therefore the geodesic distance between , can be (1) computed as the geodesic distance between FF(),(θθ′ )∈Sn , i.e., =∂p(|)x θ log(|)()()ppdx θ ∂ log(|)x θ x + ∫X ij n the length of the shortest curve on S+ connecting F()θ and or equivalently as F()θ′ that is actually a segment of a great circle. Specifically, the geodesic distance between θθ, ′∈ P n is given by g (θ)4=∂ppd (|)x θ ∂ (|)x θ x . (2) ij∫X i j ⎛⎞n+1 θ ′′′ Note that gij () can be thought of as the variance of the score dFF(θθ , )== 2arccos() ( θ ), ( θ ) 2arccos⎜⎟∑ θθ . (8) G ⎝⎠ii ∂ θ θ i =1 i θ . In coordinates i , gij () defines a Riemannian metric on Θ , giving S the structure of an n-dimensional Riemannian 2.3 Embedding manifold. In text retrieval, clustering and classification, a document is usually considered as a “bag of words” [2]. It is natural to Intuitively the Fisher information may be seen as the amount of assume that the “bag of words” of a document is generated by information a single data point supplies with respect to the independent draws from a multinomial distribution θ over problem of estimating the parameter θ . The choice of the Fisher vocabulary Vww= { ,..., } . In other words, every document is information metric is motivated by its attractive properties in 1 n+1 theory and good performances in practice [21, 23, 24]. modeled by a multinomial distribution, which may change from document to document. Given the feature representation of a Consider multinomial distributions that model mutually document, d = (dd ,..., ) , it can be embedded in the = θ 1 n +1 independent events X11,..., X n+ with Pr[X ii ] . Obviously multinomial manifold Pn by applying L normalization, n+1 1 θ = (θθ ,..., ) should be on the n-simplex defined by ∑θ =1 . 1 n+1 i ⎛⎞ i=1 dd θˆ(d )= ⎜⎟in ,..., +1 . (9) The probability that X occurs x times, ..., X occurs x nn++11 1 1 n+1 n+1 ⎜⎟d d ⎝⎠∑∑ii==11i i times is given by + The simple TF representation of a document D sets N ! n 1 θ = θ xi = p(|)x n+1 ∏ i (3) diitf(,) w D which means the term frequency (TF) of word wi x ! i =1 ∏i =1 i in document D , i.e., how many times wi appears in D . The n+1 embedding that corresponds to the TF representation is = where Nx∑ i . theoretically justified as the maximum likelihood estimator for i=1 the multinomial distribution [15, 21, 24]. The multinomial manifold is the parameter space of the The popular TF×IDF representation [2] of a document D sets multinomial distribution =⋅ di tf(,) wii D idf () w , where the TF component tfwD(,)i is ⎧⎫n+1 weighted by idf() w , the inverse document frequency (IDF) of Pnn=∈θ » +1 θθ =∀≥ i ⎨⎬: ∑ ii1;i , 0 (4) ⎩⎭i =1 word wi in the corpus. If there are m documents in the corpus and word w appears in dfw() documents, then equipped with the Fisher information metric, which can be i i = ( ) shown to be idfiilog m df ( w ) . The embedding that corresponds to the n+1 TF×IDF representation can be interpreted as a pullback metric of uivi gθ (,uv )= ∑ (5) the Fisher information through the transformation θ i=1 i ⎛⎞θλ θ λ n n n ⎜⎟11nn++ 1 1 where θ∈ P , and uv, ∈TθP are vectors tangent to P at θ G ()θ = ,..., (10) λ ⎜⎟nn++11 + ∑∑θλ θλ represented in the standard basis of »n 1 . ⎝⎠ii==11i iii 2.2 Geodesic λ = idfi with i n+1 . Pn idf It is a well-known fact that the multinomial manifold is ∑ j =1 j isometric to the positive portion of the n-sphere of radius 2 [18] How to find the optimal embedding is an interesting research Snn=∈ψψ» +1 =∀≥ψ + { : 2;i ,i 0} (6) problem to explore. It is possible to learn a Riemannian metric even better than using the TF×IDF weighting [23]. through the diffeomorphism F : PSnn→ , + Under the above embeddings, the kernel (that will be discussed θ = θθ later) between two documents d and d′ means k (θˆˆ(),(d θ d′ )) . F( ) (21 ,...,2 n+1 ) . (7) 267 3. DISTANCE BASED KERNELS ΦΦ(xx), (′′ ) =k ( xx , ) . (11) Good kernels should be consistent with one’s intuition of pair- wise similarity/dissimilarity in the domain. The motivation of Theorem 2 (Hilbert Space Representation of CPD Kernels this paper is to exploit the intrinsic geometric structure of text [32]). Let k be a real-valued cpd kernel on X . Then there data to design a kernel that can better capture document exists a Hilbert space of real-valued functions on X and a similarity/dissimilarity. Standard text retrieval, clustering and mapping Φ→:XH such that classification usually rely on the similarity measure defined by 2 1 the dot product (inner product) of two document vectors in a Φ−Φ=−()xx (′′ )kkk (, xxxxxx ) +() (,) + ( ′′ , ). (12) Euclidean space [2]. The geometric interpretation of the dot 2 product is that it computes the cosine of the angle between two The former theorem implies that pd kernels are justified to be vectors provided they are normalized to unit length. When used in all kernel methods. The latter theorem implies that cpd turning to the Riemannian geometry, this similarity measure is kernels are justified to be used in the kernel methods which are no longer available on general manifolds, because the concept of translation invariant, i.e., distance based, in the feature space.

Text Classification with Kernels on the Multinomial Manifold

21. Orthonormal Bases

Glossary of Linear Algebra Terms

A Some Basic Rules of Tensor Calculus

The Dot Product

Concept of a Dyad and Dyadic: Consider Two Vectors a and B Dyad: It Consists of a Pair of Vectors a B for Two Vectors a a N D B

Derivation of the Two-Dimensional Dot Product

Tropical Arithmetics and Dot Product Representations of Graphs

Inner Product Spaces

1 Vectors & Tensors

The Dot Product

16 the Transpose

VISUALIZING EXTERIOR CALCULUS Contents Introduction 1 1. Exterior