Document Clustering Using Word Clusters Via the Information Bottleneck Method
Total Page:16
File Type:pdf, Size:1020Kb
Document Clustering using Word Clusters via the Information Bottleneck Method Noam Slonim and Naftali Tishby School of Computer Science and Engineering and The Interdisciplinary Center for Neural Computation The Hebrew University, Jerusalem 91904, Israel email: noamm,tishby @cs.huji.ac.il Abstract have been studied in the context of information retrieval system- s: clustering the documents on the basis of the distributions of We present a novel implementation of the recently introduced in- words that co-occur in the documents, and clustering the words formation bottleneck method for unsupervised document cluster- using the distributions of the documents in which they occur (see ing. Given a joint empirical distribution of words and documents, [28] for in-depth review). In this paper we propose a new method ´Ü Ý µ Ô ,wefirst cluster the words, , so that the obtained word for document clustering, which combines these two approaches clusters, , maximally preserve the information on the documents. under a single information theoretic framework. A recently intro- ´ µ The resulting joint distribution, Ô , contains most of the o- duced principle, termed the information bottleneck method [26] is based on the following simple idea. Given the empirical joint ´ µ Á ´ µ riginal information about the documents, Á , but it is much less sparse and noisy. Using the same procedure distribution of two variables, one variable is compressed so that the mutual information about the other is preserved as much as we then cluster the documents, , so that the information about the word-clusters is preserved. Thus, we first find word-clusters possible. In our case these two variables correspond to the set of that capture most of the mutual information about the set of docu- documents and the set of words. Thus, we may find word-clusters ments, and then find document clusters, that preserve the informa- that capture most of the information about the document corpus, tion about the word clusters. We tested this procedure over sever- or we may extract document clusters that capture most of the in- al document collections based on subsets taken from the standard formation about the words that occur. In this work we combine the two alternatives. We approach this problem using a two stage Æ Û × Ö ÓÙÔ× ¾¼ corpus. The results were assessed by calculating the correlation between the document clusters and the correct la- algorithm. First, we extract word-clusters that capture most of the bels for these documents. Finding from our experiments show that information about the documents. In the second stage we replace this double clustering procedure, which uses the information bot- the original representation of the documents, the co-occurrence tleneck method, yields significantly superior performance com- matrix of documents versus words, by a much more compact rep- pared to other common document distributional clustering algo- resentation based on the co-occurrences of the word-clusters in rithms. Moreover, the double clustering procedure improves all the documents. Using this new document representation, we re- the distributional clustering methods examined here. apply the same clustering procedure to obtain the desired docu- ment clusters. The main advantage of this double-clustering pro- cedure lies in a significant reduction of the inevitable noise of the 1 Introduction original co-occurrence matrix, due to its very high dimension. The reduced matrix, based on the word-clusters, is denser and more ro- Document clustering has long been an important problem in infor- bust, providing a better reflection of the inherent structure of the mation retrieval. Early works suggested improving the efficiency document corpus. and increasing the effectiveness of document retrieval systems by Our main concern is how well this method actually discovers first grouping the documents into clusters (cf. [27] and the refer- this inherent structure. Therefore, instead of evaluating our proce- ences therein). Recently, document clustering has been put for- dure by its effectiveness for an IR system (e.g. [30]), we evaluate ward as an important tool for Web search engines [15] [16] [18] the method on a standard labeled corpus, commonly used to e- [30], navigating and browsing document collections [5] [6] [8] valuate supervised text classification algorithms. In this way we [9] [23] and distributed retrieval [29]. Two types of clustering circumvent the bias caused by the use of a specific IR system. In addition, we view the ‘correct’ labels of the documents as objec- tive knowledge on the inherent structure of the dataset. Specifical- Æ Û × Ö ÓÙÔ× ly we used the ¾¼ dataset, collected by Lang [12], ¼¼¼ ¾¼ which contains about ¾¼ articles evenly distributed over UseNet discussion groups. From this corpus we generated sev- eral subsets and measured clustering performance via the corre- lation between the obtained document clusters and the original newsgroups. We compared several clustering algorithms includ- ing the single-stage information bottleneck algorithm [24], Ward’s method [1] and complete-linkage [28] using the standard tf-idf term weights [20]. We found that double-clustering, using the haps surprisingly, this general problem has an exact optimal for- information bottleneck method, was significantly superior to all mal solution without any assumption about the origin of the join- ´Ü Ý µ the other examined algorithms. In addition, the double-clustering t distribution Ô [26]. This solution is given in terms of Ü ¾ procedure improved performance over other algorithms in all our the three distributions that characterize every cluster : ´Üµ experiments. In other words, clustering the documents by their the prior probability for this cluster, Ô , its membership prob- ´Üܵ words was always inferior to clustering by word-clusters. abilities Ô , and its distribution over the relevance variable, ´Ý ܵ Ô´Üܵ Ô . In general, the membership probabilities, , are ¾ Ü ¾ 2 The Information Bottleneck Method ‘soft’, i.e. every Ü can be assigned to every in some (normalized) probability. The information bottleneck prin- Most clustering algorithms start either from pairwise ‘distances’ ciple determines the distortion measure between the points Ü and È Ô´Ý Üµ Ü Ô´Ý ÜµÔ´Ý Üµ ´Ý ܵ ÐÓ between points (pairwise clustering) or with a distortion measure Ô to be the ÃÄ , the Ý Ô´Ý Üµ between a data point and a class centroid (vector quantization). Kulback-Libeler divergence [4] between the conditional distribu- Given the distance matrix or the distortion measure, the clustering ´Ý ܵ Ô´Ý Üµ tions Ô and . Specifically, the formal solution is given task can be adapted in various ways into an optimization prob- by the following equations which must be solved together, lem consisting of finding a small number of classes with low in- Դܵ Ô´Üܵ ÜÔ ´ ¬ Ô´Ý ÜµÔ´Ý Üµ µ traclass distortion or with high intraclass connectivity. The main ÃÄ ´¬Üµ problem with this approach is in the choice of the distance or dis- È tortion measures. Too often this is an arbitrary choice, sensitive to ½ Ô´ÜÜµÔ´ÜµÔ´Ý Üµ ´Ý ܵ Ô (3) Ü Ô´Üµ the specific representation, which may not accurately reflect the È structure of the various components in the high dimensional data. Դܵ Ô´ÜܵԴܵ In the context of document clustering, a natural measure of Ü ´¬ ܵ similarity of two documents is the similarity between their word where is a normalization factor, and the single positive conditional distributions. Specifically, let be the set of docu- (Lagrange) parameter ¬ determines the “softness” of the classifi- ments and let be the set of words, then for every document we cation. Intuitively, in this procedure the information contained in can define about is ‘squeezed’ through a compact ‘bottleneck’ of clus- Ò´Ý Üµ È ´Ý ܵ Ô (1) ters , that is forced to represent the ‘relevant’ part in w.r.t. to Ò´Ý Üµ Ý ¾ . ´Ý ܵ Ý where Ò is the number of occurrences of the word in the ½ document Ü. Roughly speaking, we would like documents with 2.1 Relation to previous work similar conditional word distributions to belong to the same clus- An important information theoretic based approach to word clus- ter. This formulation of finding a cluster hierarchy of the members tering was carried out by Brown et al [3] who used n-gram mod- of one set (e.g. documents), based on the similarity of their condi- els, and about the same time by Pereira, Tishby and Lee [17] who tional distributions w.r.t the members of another set (e.g. words), introduced an early version of the bottleneck method, using verb- was first introduced in [17] and was called “distributional cluster- object tagging for word sense disambiguation. Hofmann [10] has ing”. recently proposed another procedure, called probabilistic latent The issue of selecting the ‘right’ distance measure between semantic indexing (PLSI) for automated document indexing, mo- distributions remains, however, unresolved in that earlier work. tivated and based upon our earlier work. Using this procedure one Recently, Tishby, Pereira, and Bialek [26] proposed a principled can represent documents (and words) in a low-dimensional ‘latent approach to this problem, which avoids the arbitrary choice of a semantic space’. The latent variables defined in this scheme are distortion or a distance measures. In this new approach, given the somewhat analogous to our variable. However, there are im- ´Ü Ý µ empirical joint distribution of two random variables Ô , one portant differences between these approaches.