Document Clustering Based on Non-Negative Matrix Factorization
Total Page:16
File Type:pdf, Size:1020Kb
Document Clustering Based On Non-negative Matrix Factorization Wei Xu, Xin Liu, Yihong Gong NEC Laboratories America, Inc. 10080 North Wolfe Road, SW3-350 Cupertino, CA 95014, U.S.A. {xw,xliu,ygong}@ccrl.sj.nec.com ABSTRACT the corpus. An efficient document browsing and navigation In this paper, we propose a novel document clustering is a valuable complement to the deficiencies of traditional method based on the non-negative factorization of the term- IR technologies. As pointed out in [3], the variety of in- document matrix of the given document corpus. In the la- formation retrieval needs can be expressed by a spectrum tent semantic space derived by the non-negative matrix fac- where at one end is a narrowly specified search for docu- torization (NMF), each axis captures the base topic of a par- ments matching the user’s query, and at the other end is a ticular document cluster, and each document is represented broad information need such as what are the major interna- as an additive combination of the base topics. The cluster tional events in the year 2001, or a need without well defined membership of each document can be easily determined by goals but to learn more about general contents of the data finding the base topic (the axis) with which the document corpus. Traditional text search engines fit well for covering has the largest projection value. Our experimental evalua- one end of the spectrum, which is a keyword-based search tions show that the proposed document clustering method for specific documents, while browsing through a cluster hi- surpasses the latent semantic indexing and the spectral clus- erarchy is more effective for serving the information retrieval tering methods not only in the easy and reliable derivation needs from the rest part of the spectrum. of document clustering results, but also in document clus- In recent years, research on topic detection and tracking, tering accuracies. document content summarization and filtering has received enormous attention in the information retrieval community. Topic detection and tracking (TDT) aim to automatically Categories and Subject Descriptors detect salient topics from either a given document corpus or H.3.3 [Information Storage and Retrieval]: Information an incoming document stream and to associate each docu- Search and Retrieval—Clustering ment with one of the detected topics. The TDT problems can be considered as a special case of the document clus- General Terms tering problem and actually most of the TDT systems in the literature were realized by adapting various document Algorithms clustering techniques. On the other hand, document sum- marization is intended to create a document abstract by Keywords extracting sentences/paragraphs that best present the main Document Clustering, Non-negative Matrix Factorization content of the original document. Again, many proposed summarization systems employed clustering techniques for 1. INDRODUCTION identifying distinct content, and finding semantically similar sentences of the document. Document clustering techniques have been receiving more Document clustering methods can be mainly categorized and more attentions as a fundamental and enabling tool into two types: document partitioning (flat clustering) for efficient organization, navigation, retrieval, and summa- and agglomerative (bottom-up hierarchical) clustering. Al- rization of huge volumes of text documents. With a good though both types of methods have been extensively in- document clustering method, computers can automatically vestigated for several decades, accurately clustering docu- organize a document corpus into a meaningful cluster hier- ments without domain-dependent background information, archy, which enables an efficient browsing and navigation of nor predefined document categories or a given list of topics is still a challenging task. In this paper, we propose a novel document partition- Permission to make digital or hard copies of all or part of this work for ing method based on the non-negative factorization of the personal or classroom use is granted without fee provided that copies are term-document matrix of the given document corpus. In not made or distributed for profit or commercial advantage and that copies the latent semantic space derived by the non-negative ma- bear this notice and the full citation on the first page. To copy otherwise, to trix factorization (NMF) [7], each axis captures the base republish, to post on servers or to redistribute to lists, requires prior specific topic of a particular document cluster, and each document permission and/or a fee. is represented as an additive combination of the base topics. SIGIR’03 July 28–August 1, 2003, Toronto, Canada. Copyright 2001 ACM 1-58113-646-3/03/0007 ...$5.00. The cluster membership of each document can be easily de- 267 termined by finding the base topic (the axis) with which the rithms (such as K-means) in the transformed space. Al- document has the largest projection value. Our method dif- though it was claimed that each dimension of the singular fers from the latent semantic indexing method based on the vector space captures a base latent semantics of the doc- singular vector decomposition (SVD) and the related spec- ument corpus, and that each document is jointly indexed tral clustering methods in that the latent semantic space by the base latent semantics in this space, negative values derived by NMF does not need to be orthogonal, and that in some of the dimensions generated by the SVD, however, each document is guaranteed to take only non-negative val- make the above explanation less meaningful. ues in all the latent semantic directions. These two differ- In recent years, spectral clustering based on graph par- ences bring about an important benefit that each axis in titioning theories has emerged as one of the most effective the space derived by the NMF has a much more straightfor- document clustering tools. These methods model the given ward correspondence with each document cluster than in the document set using a undirected graph in which each node space derived by the SVD, and thereby document clustering represents a document, and each edge (i, j) is assigned a results can be directly derived without additional cluster- weight wij to reflect the similarity between documents i and ing operations. Our experimental evaluations show that the j. The document clustering task is accomplished by finding proposed document clustering method surpasses SVD- and the best cuts of the graph that optimize certain predefined the eigenvector-based clustering methods not only in the criterion functions. The optimization of the criterion func- easy and reliable derivation of document clustering results, tions usually leads to the computation of singular vectors but also in document clustering accuracies. or eigenvectors of certain graph affinity matrices, and the clustering result can be derived from the obtained eigen- 2. RELATED WORKS vector space. Many criterion functions, such as the Aver- age Cut [2], Average Association [11], Normalized Cut [11], Generally, clustering methods can be categorized as Min-Max Cut [5], etc, have been proposed along with the agglomerative and partitional. Agglomerative clustering efficient algorithms for finding their optimal solutions. It methods group the data points into a hierarchical tree struc- can be proven that under certain conditions, the eigenvector ture, or a dendrogram, by bottom-up approach. The proce- spaces computed by these methods are equivalent to the la- dure starts by placing each data point into a distinct cluster tent semantic space derived by the LSI method. As spectral and then iteratively merges the two most similar clusters clustering methods do not make naive assumptions on data into one parent cluster. Upon completion, the procedure distributions, and the optimization accomplished by solving automatically generates a hierarchical structure for the data 2 certain generalized eigenvalue systems theoretically guaran- set. The complexity of these algorithms is O(n log n)where tees globally optimal solutions, these methods are generally n is the number of data points in the data set. Because of far more superior than traditional document clustering ap- the quadratic order of complexity, bottom-up agglomera- proaches. However, because of the use of singular vector tive clustering methods could become computationally pro- or eigenvector spaces, all the methods in this category have hibitive for clustering tasks that deal with millions of data the same problem as LSI, i.e., the eigenvectors computed points. from the graph affinity matrices usually do not correspond On the other hand, document partitioning methods de- directly to individual clusters, and consequently, traditional compose a document corpus into a given number of disjoint data clustering methods such as K-means have to be applied clusters which are optimal in terms of some predefined cri- in the eigenvector spaces to find the final document clusters. teria functions. Partitioning methods can also generate a hierarchical structure of the document corpus by iteratively partitioning a large cluster into smaller clusters. Typical 3. THE PROPOSED METHOD methods in this category include K-Means clustering [12], Assume that a document corpus is comprised of k clus- probabilistic clustering using the Naive Bayes or Gaussian ters each of which corresponds to a coherent topic. Each mixture model [1, 9], etc. K-Means produces a cluster set document in the corpus either completely belongs to a par- that minimizes the sum of squared errors between the doc- ticular topic, or is more or less related to several topics. To uments and the cluster centers, while both the Naive Bayes accurately cluster the given document corpus, it is ideal to and the Gaussian mixture models assign each document to project the document corpus into a k-dimensional semantic the cluster that provides the maximum likelihood probabil- space in which each axis corresponds to a particular topic. ity. The common drawback associated with these methods is In such a semantic space, each document can be represented that they all make harsh simplifying assumptions on the dis- as a linear combination of the k topics.