DIMENSIONALITY REDUCTION USING NON-NEGATIVE FACTORIZATION FOR INFORMATION RETRIEVAL

Satoru Tsuge, Masami Shishibori, Shingo Kuroiwa, Kenji Kita

Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University 2-1, Minami-josanjima, Tokushima, 770-8506 Japan E-ma,il: {tsuge, bori, kuroiwa, kita}@is.tokushima-u. ac.jp

Abstract tion The Vector Space hlodel (VSM) is a conven- t.iona1 informat,ion retrieval model, which represent.s a 1 Introduction document, collection by a. term-by-dociiment, matrix. Since term-by-document, mat.rices are iisiially high- With the rapid growth of online information, e.g., the dimensional and sparse, they are susceptible t.o noise World Wide Web (WWW)? a large collection of full- and are also difficult, t,o capt,iire the iinderlying seman- text, documents is available and opporti1nit.y for get- tic stmctiire. Additionally, the storage and processing t,jng a iisefiil piece of information is increased. Infor- of siich matrices places great, demands on computing mat,ion retrieval is now becoming one of the most, im- resources. Dimensionalit*yrediict,ion is a way t,o over- portant issues for handling a large text data. come these problems. Principal Component, Analysis The Vector Space f\iIodel (VSM) is one of the major (PCA) and Singiilar Valiie Decomposit,ion (SVD) are convent,iona.l information retrieval models that repre- popular t,echniqiiesfor dimensionality reduction based sent dociiment,s and queries by vect,ors in a miiltidi- on matrix decomposition, however they contain both mensional space[13. Since these vectors are iisiially positive and negative values in the decomposed mat,+ high-dimensional and sparse: they are siisceptible to ces. In the work described here, we use Non-negat,ive noise and are also difficult to capture t.he ~mderlying Mat.rix Fact.orizat,ion(NMF) for dimensionality rednc- semantic st,riict,iire. Additionally, the storage and prc- t,ion of t,he vect,or space model. Since matxices de- cessing of such dah places great, dema.nds on com- composed by NMF only cont,a,innon-negat,ive values, puting resources. Dimensionality rediiction is a way t.he original data are represented by only additive, not, tm overcome these problems. Principal Component, siibtractive, combinations of the basis vectors. This Analysis (PCA) and Singiilar Valiie Decomposition chara.ct,erist,icof parts-based represent,at,ion is appeal- (SVD)[2] are popular techniques for dimensionalit,y re- ing because it. reflectss the intxit.ive notion of combin- ductlion based on matrix decomposition. However: the ing parts t,o form R whole. Also NMF computation is cost of t.heir computation will be prohibitive when ma.- based on the simple iterative algorithm: it. is therefore trices become la.rge. advantageoils for applications involving large matxi- This paper propose to apply Non-negative Matrix ces. Using MEDLINE collection, we experimentally Fact,orizat,ion(NMF) [3][4] to dimensionality rediict.ion showed t,hat, NMF offers great, improvement over the of tjhe document vectors in the term-by-document, ma- vector space model. trices. The NMF decomposes a non-negative matrix Keywords int.o two non-negative matrices. One of the decom- information ret,rieval, vector space model, non- posed matrix can he regarded as the basis vect.ors. The negative matrix factnrizatkm, dimensionalit,y rediic- dimensionality reduction can be performed by project-

0-7803-77-2/01/$10.00 02001 IEEE

960 ing t,he dociiment vectors onto the lower dimensional The NMF does not, allow negative entries in the space which is formed by these basis vectors. matrix W ,and H. These constmints lead to a parts- The NMF is distingiiished from the other methods, based repre,sentat,ion becaiise t.hey allow only additive, e.g., PCA and SVD, by itssnon-negat,ivity constraints. not siibtractive, combination. This characteristic of These constraints lead to a parts-based representnation parts-based representat.ion is appealing hecaiise it, re- becaiise they allow only additive, not, subtracthe, com- flect,s t,he intiiitive not.ion of combining parts t,o form binations. Also, t,he NMF computAon is based on the a whole. simple iterative algorit,hm, it is therefore i~lvant.ageoi~s for applications involving large matrices. 2.1 NMF computation The remainder of this paper is organized as fol- Here, we introdiice t.wo algorithms based on iterative lows. In sect.ion 2, we introduce non-negative matrix estimation of W and H[3][4].At each iteration of the factorization. In section 3: a method of dimensionally algorit,hms, the new value of W or H is foiind by miil- rediiction with NMF is described. Sect,ion 4 shows tiplying the ciirrent vnliie by some factor that, depends infomattionretxeival resiilts on the MEDLINE t,est col- on the qiialit,y of t.he approximation in eqiiation (1). lection and disciisses these resiilts. Finally: section 5 Repeated iteration of the iipdate rules is giiaranteed gives conchisions and fiiture works. t20converge to a locally optimal matrix factorization. First, we introdiice t,he iipdate rules given in the 2 Non-Negative Matrix Factorization next equat.ions, This section provides a brief overview of Non-negative (3) Matxix Fxtorizathn (NMF) [3]141. Given a non- negikive n. x ?n, matrix V, t,he NWIF finds the non- negative n, x r matrix W and the non-negative r x m (4) matrix H such that. Repeated iteration of t,hese iipdate ride5 converges to a local minimum of the objective fiinction: V M WH. (1) (5) T m.)r nm,, The is generally chosen t,o satisfy (n,+ < so ij that the product WH can be regarded as a compressed This objective fiinction is defined as t,he sqiiare of the form of the data in V. Eiiclidean distance bet,ween V and WHfor a measine. The equation (1) can be rewritten column by We call these update rilles update rule 1. columns as Next, we int,rodiice the iipdate rules which milxi- v M Wh, (2) mize the following objectrive fiinction: where v and h

96 1 We call these iipdate rilles update nrle 2. occurring in only one document, were dso removed. The remaining terms were then stemmed iising the 3 Dimensionality Reduction Using Porter algorithm[6]. The preprocessing step resiilt,ed NMF in 4328 indexing terms. There were two different types of term weighting: We now describe our vector space information retrieval global weight,ing nnd local weighting. Local weight- model which incorporates NMF-based dimensionality ing was functioned to determine how many times each reduction. To apply this NMF to dimensionalit,y re- term appears in the entire collection. The dij, 1:-th duction of a term-by-document matrix in the infor- element, of the document, vectors dj was given by mation retrieval, we at,t,empt to regard the t,erm-by- document matrix as t’he data. matxix V of the NMF. d, = Lij Gi; (9) The following is a resume of the information retrieval where, Lij is the weight for term i in the document dj, by iising NMF-based dimensionality rediicthn: Gj is the global weight for term i. As a term weighting 1. Extrcxt, indexing terms from the entire docii- scheme, we iised log-entmy[7]: ment, collect,ion iising an appropriate stop list Locd weight,: and stemming algorithm. Let we have 17. index- ing terms and m documents. Lij = log(1 + fij), (10) Global weight,: 2. Create m, document, vectors dl, d2,. .. d,, where dj is the i-th component of clociiment vector, aij is defined oij = Lij x Gi. Here, Lij is the local weighting for the i.-t.h term in dociiment, dji and where n is t,he number of clociiment,s in the collection, Gi is the global weighting for the i-th term. and fij indicates the freqiiency of the i-th t,erm in the j-th document,, and pi:, = fi. 3. Apply the non-negat.ive matrix fact,orization to We iised t,he random valiiw from 0.0 to 1.0 for the the term-by-document matrix. The basis vectors initial valiies of decomposed matrices W a3lcl H MT are compiited by this process. For the retxieval evilliiat.ion,we meamred the non- 4. Project the document, vectors onto new r- interpolakd average precision, which refers to an av- dimensional space. The coliimns of W form the erage of precision at various points of recall iising the axes of this space. t,op fifty documents retrieved[6][SI. This score was cal- culated by iising “trm-eval” program[9]. 5. Using the same transformation, map a qiiery vec- t,or int,o the r-dimensional space. 4.2 Experimental results

6. Calculate the similarity between transformed Figure 1 (a) shows khat the cost of objective function dociiment vectors and a qiiery vector. F given by equation (5) as a function of number of iterations iinder the condition that the number of the 4 Information Retrieval Experiments reduced dimensions T was fixed to sixty himdred. Also, Figure 1 (b) shows the cost of objective function given 4.1 Conditions by quation (6) m a function of number of iterations In experiments, we used the MEDLINE collection. under same conditions. We coiild see from these fig- This collection consists of thirty qiieries and 1,033 doc- ure that the cost, of objective funct,ion F of the bot,h uments. The average number of relevant, dociimentds lipdate riiles converged after only 20 it,erat,ions. for each qiiery was 23.2. We first preprocessed docii- Next,, Figure 2 shows the average precision as a ments to eliminat,e non-content,-bearing stopwords 11s- function of rdiiced dimensions iinder the condition ing a stop list, of 439 common English words. Terms that the number of iterations was twenty, in which the

962 14000 I 1 50000

80000

-7OWO

aoooo

-90000

-100000

-110000

-120000

-130000 50 100 150 200 250 300 lterateion lterateion (a) Update nile 1 (b) Update rule 2

Figure 1: Cost. of objective function as a, funct,ion of nnmber of iterations

update iule 1 - 08 - VSM -.-

0.45 -

"_nd 0 100 200 3M) 500 800 700 800 0 100 200 300 WO 600 700 800

(a) Update rule 1 (b) Update rule 2 Figure 2: Average precision as a function of dimensions cost. of the objective function converged. For compari- precision significantly degraded where the dimensions son, the average precision of VSM is also shown in this was less than 100. This result implied that, only 100 figure. The dimensions of VSM eqiialed the niimber of basis vectors were not enough to capture the semantic indexing terms, 4328. st,ructure in this data. We coiild see from this figure that, the each NMF which had more than 100 dimensions improved the 4.3 Discussions performances compared with VSM. The best perfor- In this section, we will disciiss the experimental results mance of the each NMF was achieved at, 600 and 400 described in Section 4.2. First,, we consider the rela- dimensions iising update nile 1 and update riile 2, re- tionship between the average precision and the number spectively. These performances were comparable to of iterat,ions. Because the NMF is based on iterative SVD with same dimensions. As a result, we coiild iipdates of W and H: the similarity between V and consider that, the number of the semantic striictiire in WHdepends on the number of iterations. Therefore, this data was aboiit 500. the niimber of iterations can be considered t,o affect We also coiild see from this figire that the average the retrieval performance.

963 064 - - 064-

082 - - 082- Q Q v) .-y1 YV K n -K 0 00- 0 IJ) 00- tF I -+--- 058 - - 058 -

056 - - 058-

Figure 3: Average precision rzs a fiinction of iterations

50000 ~

00000 .

U.

0 100 200 300 400 500 600 700 800 8w 1000 0 100 200 300 400 500 800 700 800 900 1000 Dimension (a) Update riile 1 (b) Update rule 2 Figiire 4: Cost of objective fiinct,ion as a function of dimensions

Figiire 3 shows the average precision as a function strong dependence existed between the average preci- of niimber of iterations. The rediiced dimensions were sion and the cost, of objective function. Therefore, the fixed t,o 600 for update riile 1 and 400 for iipdat,e riile cost. of objective function could be considered to use 2. These dimensions gave the best performance in pre- for the rest,riction of iteration. vioiis experiment,s (see Section 4.2). In Figure 3, the Next,, we inwst,igat,e the relationships between t.he left figure (a) and the right figure (b) indica.t,e the av- cost, of the object,ive function and the niimber of cli- erage precision using update riile 1 and 1ipdat.e riile 2, mensions. Figiire 4 shows the cost, of t,he objective respect.ively. fiinction as a function of dimensions iinder the condi- We coiild see from t.hese figiires t.hat t,he average tion that, the niimber of iterations was fixed to 20. precision almost. remained same nevertheless t,he niim- We coiild see from t,hese figiires that, the matrix her of iterations increased. When the number of it- product WH was close t.o t.he original data matrix V erations were more than 20, t,hese average precision in proportion to increase the dimension. However, in ciirves were close to t,he cost ciirves which were shown Figiire 2, we showed that the a,verage precision could in Figiire 1. As a result, we coiild consider that the not be improved nevertheless t,he dimension increased.

964 As a resiilt, the dependence was not, observed bet,wwn pany, 1983. the cost of t,he object,ive function and the average pre- [2] S. Deerwester, S. Dumais, G. Furnas, T. L~mdauer, cision when the number of dimensions were more that and R. Harshman. Indexing by latent, semantic 300. analysis. Jo~i~rnalof the American Society for In- 5 Conclusions formation Science, 41(6):391-407, 1990.

We have proposed a method for dimensionality reduc- [3] D. Lee and H. Seiing. Algorithms for non-negative tion of the vector space information retrieval model mat,rix factorization. NIPS 2000, 2000. using the Non-negative Matrix Factorization (NMF). [4] D. Lee and H. Seiing. Learning the parts of ob- The NMF decomposes a non-negative matxix, i.e., jects by non-negative. matrix factorization. Nat7i,re, term-by-dociiment,matrix, into two non-negative ma- 401:788-791, 1999. trices. One of the decomposed matxix can be regaaded as5 the basis vectors. The dimensionalit,yrediict,ion can [5] M. Berry, S. Diimais, and G. O’Brien. Using linear be performed by projecting the clocument vectors ont,o dgebra for intelligent information retrieval. 507~~- the lower dimensional space which is formed by t,he,se: nal of the American Society for Information Sci- basis vectors. ence. SIAM Review, 37((4)):573-595, 1995. Experimental results on the MEDLINE test collec- tion showed the proposed metshod gave a bet,tter per- [6] I. Witten, A. hfoffat,, and T. Bell. Mmaging Gzga- formance than the conventional vector space model. bytes: Compressing and Indexing Dociimen,ts and Therefore, we crm conclude that the basis vectors: i.e., Images. Van Nostrand Reinhold, New York, 1994. the columns of the decomposed matrix W,could dis- [7] E,. Chisholm and T. Kolda. New term weight- cover the semantic striictiire that is latent in the data ing fhrmiila.5 for t,he vector space method in infor- We are now planning to analyze the basis vectm in or- mat,ion retxieval. Techn,acal Memorandum ORNL- der to discover what, kinds of semantic striictiire exist 13756, 1999. in them. [8] D. Lewis. Evaliiating text categoriza,tion. Proc. References of Speech and Natwal Language Workshop, pages 312-318, 1991. 111 G. Salton and J. NIcGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Com- [91 TREC homepage. http: //trec.nist .gov/ .

965