Document Clustering using Word Clusters via the Information Bottleneck Method

Noam Slonim and Naftali Tishby

School of Computer Science and Engineering and The Interdisciplinary Center for Neural Computation

The Hebrew University, Jerusalem 91904, Israel  email: noamm,tishby @cs.huji.ac.il

Abstract have been studied in the context of system- s: clustering the documents on the basis of the distributions of We present a novel implementation of the recently introduced in- words that co-occur in the documents, and clustering the words formation bottleneck method for unsupervised document cluster- using the distributions of the documents in which they occur (see

ing. Given a joint empirical distribution of words and documents, [28] for in-depth review). In this paper we propose a new method

´ µ

,wefirst cluster the words, , so that the obtained word for document clustering, which combines these two approaches 

clusters, , maximally preserve the information on the documents. under a single information theoretic framework. A recently intro-



´ µ The resulting joint distribution, , contains most of the o- duced principle, termed the information bottleneck method [26]

 is based on the following simple idea. Given the empirical joint

´  µ ´  µ riginal information about the documents, , but it is much less sparse and noisy. Using the same procedure distribution of two variables, one variable is compressed so that the mutual information about the other is preserved as much as we then cluster the documents, , so that the information about the word-clusters is preserved. Thus, we first find word-clusters possible. In our case these two variables correspond to the set of that capture most of the mutual information about the set of docu- documents and the set of words. Thus, we may find word-clusters ments, and then find document clusters, that preserve the informa- that capture most of the information about the document corpus, tion about the word clusters. We tested this procedure over sever- or we may extract document clusters that capture most of the in- al document collections based on subsets taken from the standard formation about the words that occur. In this work we combine

the two alternatives. We approach this problem using a two stage ¾¼ corpus. The results were assessed by calculating the correlation between the document clusters and the correct la- algorithm. First, we extract word-clusters that capture most of the bels for these documents. Finding from our experiments show that information about the documents. In the second stage we replace this double clustering procedure, which uses the information bot- the original representation of the documents, the co-occurrence tleneck method, yields significantly superior performance com- matrix of documents versus words, by a much more compact rep- pared to other common document distributional clustering algo- resentation based on the co-occurrences of the word-clusters in rithms. Moreover, the double clustering procedure improves all the documents. Using this new document representation, we re- the distributional clustering methods examined here. apply the same clustering procedure to obtain the desired docu- ment clusters. The main advantage of this double-clustering pro- cedure lies in a significant reduction of the inevitable noise of the 1 Introduction original co-occurrence matrix, due to its very high dimension. The reduced matrix, based on the word-clusters, is denser and more ro- Document clustering has long been an important problem in infor- bust, providing a better reflection of the inherent structure of the mation retrieval. Early works suggested improving the efficiency document corpus. and increasing the effectiveness of document retrieval systems by Our main concern is how well this method actually discovers first grouping the documents into clusters (cf. [27] and the refer- this inherent structure. Therefore, instead of evaluating our proce- ences therein). Recently, document clustering has been put for- dure by its effectiveness for an IR system (e.g. [30]), we evaluate ward as an important tool for Web search engines [15] [16] [18] the method on a standard labeled corpus, commonly used to e- [30], navigating and browsing document collections [5] [6] [8] valuate supervised text classification algorithms. In this way we [9] [23] and distributed retrieval [29]. Two types of clustering circumvent the bias caused by the use of a specific IR system. In addition, we view the ‘correct’ labels of the documents as objec-

tive knowledge on the inherent structure of the dataset. Specifical-

ly we used the ¾¼ dataset, collected by Lang [12],

¼¼¼ ¾¼ which contains about ¾¼ articles evenly distributed over UseNet discussion groups. From this corpus we generated sev- eral subsets and measured clustering performance via the corre- lation between the obtained document clusters and the original newsgroups. We compared several clustering algorithms includ- ing the single-stage information bottleneck algorithm [24], Ward’s method [1] and complete-linkage [28] using the standard tf-idf term weights [20]. We found that double-clustering, using the haps surprisingly, this general problem has an exact optimal for-

information bottleneck method, was significantly superior to all mal solution without any assumption about the origin of the join-

´ µ

the other examined algorithms. In addition, the double-clustering t distribution [26]. This solution is given in terms of



¾

procedure improved performance over other algorithms in all our the three distributions that characterize every cluster  :

´µ

experiments. In other words, clustering the documents by their the prior probability for this cluster, , its membership prob-

´µ

words was always inferior to clustering by word-clusters. abilities , and its distribution over the relevance variable,

´ µ ´µ

. In general, the membership probabilities, , are



¾  ¾ 2 The Information Bottleneck Method ‘soft’, i.e. every can be assigned to every in some (normalized) probability. The information bottleneck prin-

Most clustering algorithms start either from pairwise ‘distances’ ciple determines the distortion measure between the points and

È

Ô´Ý ܵ

 ´ µ´ µ  ´ µ ÐÓ between points (pairwise clustering) or with a distortion measure

to be the ÃÄ , the

Ý

Ô´Ý Üµ between a data point and a class centroid (vector quantization). Kulback-Libeler divergence [4] between the conditional distribu-

Given the distance matrix or the distortion measure, the clustering

´ µ ´ µ tions and . Specifically, the formal solution is given task can be adapted in various ways into an optimization prob- by the following equations which must be solved together,

lem consisting of finding a small number of classes with low in-



Դܵ

 ´µ ÜÔ ´ ´ µ´ µ µ

traclass distortion or with high intraclass connectivity. The main ÃÄ



 ´¬ܵ

 

problem with this approach is in the choice of the distance or dis-  È

tortion measures. Too often this is an arbitrary choice, sensitive to ½

´µ´µ´ µ ´ µ

(3)

Ü

Դܵ 

the specific representation, which may not accurately reflect the 



 È

structure of the various components in the high dimensional data. 

´µ ´µ´µ

In the context of document clustering, a natural measure of Ü

´ µ

similarity of two documents is the similarity between their word where is a normalization factor, and the single positive conditional distributions. Specifically, let be the set of docu- (Lagrange) parameter determines the “softness” of the classifi-

ments and let be the set of words, then for every document we cation. Intuitively, in this procedure the information contained in

can define about is ‘squeezed’ through a compact ‘bottleneck’ of clus-

´ µ



È

´ µ

(1) ters , that is forced to represent the ‘relevant’ part in w.r.t. to

´ µ

Ý ¾

.

´ µ

where is the number of occurrences of the word in the ½ document . Roughly speaking, we would like documents with 2.1 Relation to previous work similar conditional word distributions to belong to the same clus- An important information theoretic based approach to word clus- ter. This formulation of finding a cluster hierarchy of the members tering was carried out by Brown et al [3] who used n-gram mod- of one set (e.g. documents), based on the similarity of their condi- els, and about the same time by Pereira, Tishby and Lee [17] who tional distributions w.r.t the members of another set (e.g. words), introduced an early version of the bottleneck method, using verb- was first introduced in [17] and was called “distributional cluster- object tagging for word sense disambiguation. Hofmann [10] has ing”. recently proposed another procedure, called probabilistic latent The issue of selecting the ‘right’ distance measure between semantic indexing (PLSI) for automated document indexing, mo- distributions remains, however, unresolved in that earlier work. tivated and based upon our earlier work. Using this procedure one Recently, Tishby, Pereira, and Bialek [26] proposed a principled can represent documents (and words) in a low-dimensional ‘latent approach to this problem, which avoids the arbitrary choice of a semantic space’. The latent variables defined in this scheme are

distortion or a distance measures. In this new approach, given the 

somewhat analogous to our variable. However, there are im-

´ µ empirical joint distribution of two random variables , one portant differences between these approaches. First, while PLSI looks for a compact representation of , which preserves as much assumes a generative hidden variable model for the data and uses information as possible about the relevant variable . This simple intuitive idea has a natural information theoretic formulation: find maximum likelihood for estimating the latent variables, the infor-

mation bottleneck method makes no assumption about the struc-

 clusters of the members of the set , denoted here by , such that

ture of the data distribution (it is not a hidden variable model) and



´  µ

the mutual information is maximized, under a constraint uses a variational principle to optimize directly the relevant infor-



´  µ on the information extracted from , . mation in the co-occurrence data to extract the new representation.

Second, the PLSI model is based on a conditional-independence

´  µ

The mutual information, , between the random vari- assumption, i.e. given the latent variables the words and docu- ables and is given by (e.g. [4]) ments are independent, which is not needed in our approach. An-

other important advantage of our method is that it has a complete



´ µ

(formal) analytic solution, enabling better understanding of the

´  µ ´µ´ µ ÐÓ

(2)

´ µ

resulting classification.

ܾÝ ¾

and is the only consistent statistical measure of the information 3 The Agglomerative Information Bottleneck Algorithm

that variable contains about variable . The compactness of



´  µ

the representation is determined by , while the quali- As has been shown in [24] [25], there is a simple implementa- 

ty of the clusters, , is measured by the fraction of the infor- tion of the information bottleneck method, restricted to the case

¾

 of ‘hard’ clusters. In this case every belongs to precisely

´  µ ´  µ

mation they capture about , namely, . Per-



¾ one cluster  . This restriction, which corresponds to the lim-

½ Note that under this definition the priors of all documents are uniformly normal-

½ it in Eqs. (3), yields a natural distance measure between

´Üµ

ized to Ô , thus we avoid an undesirable bias due to different document

lengths. distributions which can be easily implemented in an agglomera- tive hierarchical clustering procedure.



 ¾

´ µ Let denote a specific (hard) cluster, then following Input: Joint probability distribution

[25] we define,

¾ ½   A partition of into clusters,

 Output:

½ ¾ 

if



´µ 

 ¼  otherwise

 Initialization:



È ½

(4) 

¯ 

´ µ ´µ´ µ

 Construct

ܾÜ

Դܵ











È

½ 

¯ , calculate

´µ ´µ

ܾÜ

´´ µ·´ µµ ´  µ´  µ

 ÂË

Using these distributions one can easily evaluate the mutual in- Loop:



formation between the set of clusters and using Eq.(2). As

   ½ ½

stated earlier, the objective of the information bottleneck method ¯ For

´µ

is to extract partitions of ,defined by the mapping , that

maximize the mutual information functional. Note that the roles – Find the indices for which  is mini- of and in this scenario can be switched. We may extrac- mized

t clusters of words which capture maximum mutual information   µ

£ – Merge

about the documents, or find clusters of documents that capture Ë

 

   

£

the mutual information about the words. We utilize this symmetry – Update

 £ in the double-clustering procedure (see section 5). The general – Update  costs w.r.t. framework applied here is an agglomerative greedy hierarchical

clustering algorithm. The algorithm starts with trivial partitioning ¯ End For

 into  singleton clusters, where each cluster contains exactly

one element of . At each step we merge two components of the Figure 1: Pseudo-code of the agglomerative information bottle-

current partition into a single new component in a way that locally neck algorithm.



´  µ

minimizes the loss of mutual information . Every merger,

´  µ µ 

£ , is formally defined by the following equations

algorithm is now very simple, where at each step we perform “the





 

best possible merge”, i.e. merge the clusters which min- ½ ¾  ¾ 



if or



´ µ

£ 

Æ ´  µ

 imize .Infigure 1 we provide the pseudo code of this ¼

 otherwise 

 agglomerative procedure.

Ô´Ü µ

´Ü µ

Ô (5) 





´  µ ´  µ· ´  µ

£



Ô´Ü µ Ô´Ü µ

£ £





 4 Other Clustering Methods





´ µ´ µ·´ µ

£ In the same general framework of the agglomerative clustering

 algorithm, we applied two other similarity criteria to construct two

´  µ The decrease in the mutual information due to this merg-

other algorithms for purposes of comparison. First, a common

 

Æ ´  µ  ´  µ ´  µ

 Ø Ö

er is defined by  ÓÖ , natural distance measure between probability distributions is the

 

´  µ ´  µ  Ø Ö

where  ÓÖ and are the information ½ norm (or the variational distance), defined as, values before and after the merger, respectively. After a little al-

gebra [25] one can see that 

´ µ   ´ µ ´ µ

½ (9)

Ý ¾

Æ ´  µ´´ µ·´ µµ ¡ ´  µ´  µ

ÂË

(6)

½

Unlike the -divergence, the norm is a distance measure sat-

where the functional ÂË is the Jensen-Shannon ( ) divergence isfying all the metric properties, including triangle inequality. It (see [13] [7]) defined as

also approximates the -divergence for close distributions [13].

Our second clustering algorithm therefore used the following dis-

   ·  

ÃÄ ÃÄ ÂË (7) tributional similarity measure

where in our case

´´ µ·´ µµ ¡ ´´  µ´  µµ

 ½

 (10)

 ´  µ´  µ





 

 Note that multiplication by the ‘weight’ of the clusters to be merged

Ô´Ü µ

Ô´Ü µ 

 is crucial. Otherwise there is a strong bias for assigning all objects



(8)

Ô´Ü µ Ô´Ü µ

£ £ 

 into one cluster. Besides these two algorithms, which are motivat-

 

 ed by probability theory, our third comparison algorithm is the

  ´  µ· ´  µ

standard Ward’s method which is based on the Euclidean distance [1]. The similarity measure for this algorithm is thus given by The -divergence is non-negative and equals zero if and only

if both arguments are identical. It is upper bounded (by ½) and



´ µ´ µ

¾

´´  µ ´  µµ ¡ 

symmetric though it is not a metric. Note that the “merger cost”,

 (11)

´ µ·´ µ

Æ ´  µ

¾

, can now be interpreted as the multiplication of the Ý

´ µ·´ µ

‘weight’ of the merged elements, , by their ‘dis-

´  µ´  µ

tance’, ÂË . In addition we also implemented a complete-linkage (agglom- erative) algorithm (see e.g. [28]) which uses the conventional tf- By introducing the information optimization criterion the re- idf term-weights [20] to represent the documents in a vector space sulting similarity measure directly emerges from the analysis. The model. In this method the least similar pair of documents, one of

´ µ

Input: Joint probability distribution 6 The Experimental Design



´ µ First Stage: Find clusters using In this section we describe our experiments and present a new ob- jective method for evaluating document clustering procedures. In

Second Stage: addition we describe the datasets used in our experiments, which

are all based on a standard IR corpus, the ¾¼ corpus.

¾ ´ µ

¯ For every , replace by the more compact

´ µ

representation 6.1 The evaluation method



´ µ ¯ Find clusters using In general, measuring clustering effectiveness is not a trivial is- sue. Standard measures such as the average distance between da- Figure 2: The double-clustering procedure. ta points and candidate class centroids are rather abstract for our needs, and furthermore, as already mentioned, it is not clear what distance measure should be used. In most of the previous work on each cluster, determines the similarity between clusters. Specifi- document clustering the performance of the clustering has been cally, measured in terms of its effectiveness over some information re-

trieval system. Specifically, the clustering results are used to re-

¼

¼

´  µ ´ ´ µµ

order the list of documents returned by the IR system, under the

¾Ü Ü ¾Ü

Ü (12) 

 assumption that the user is able to select the clusters with the high-

¼

´ µ where for we used the cosine of the angle between est relevant document density [9] [22] [30]. There are several the two tf-idf vectors representing the documents. We also im- problems in this evaluation method. First, as noted in [30], empiri- plemented a single-linkage algorithm for which the most similar cal tests have shown that users fail to choose the best cluster about pair of documents, one from each cluster, determines the similar- ¾¼± of the time [9]. Second, generating the document collection- ity between clusters. However, the results for this method were s by the results obtained by an IR system w.r.t. some queries is significantly inferior to the complete-linkage algorithm (due to a sensitive to the specific IR system and queries being used, which strong tendency to cluster all documents into one huge cluster), may result in some unclear bias over the datasets. Third, this e- thus we do not report these data here. valuation method does not measure directly how well the inher- ent structure of the document corpus is revealed by the clustering procedure, but rather provide indirect estimates, through the IR 5 The Double Clustering Procedure system performance. To overcome these problems we propose a simple solution, which is essentially estimating document cluster-

The three criteria described in Eqs.(6, 10, 11) are essentially sym- ing performance by tools used for supervised text classification metric with regard to the roles of and . In other words, tasks. In other words, since our interest is in measuring how well there are no prior requirements regarding which variable should the clustering process can reveal the inherent structure of a given be compressed. In this work we suggest a combination of these t- document collection, we use a standard labeled text classification wo options. In order to do that we introduce a two-stage clustering corpus to construct our datasets, while using the labels as clear

procedure. In the first stage we represent each word by its con- objective knowledge reflecting the dataset inherent structure. In

´ µ ditional distribution over the set of documents, . We then addition, we adopt the accuracy measure used by supervised learn-

use a distributional clustering algorithm to obtain word-clusters, ing algorithms to our needs. Specifically, we measure clustering

 

     denoted by , with . In the second stage we use performance by the accuracy given by the contingency table of these word-clusters to replace the original representation of the the obtained clusters and the ‘real’ document categories.

documents. Instead of representing a document by its conditional

´ µ distribution of words, , we represent it by its conditional

6.2 The datasets

´ µ

distribution of word-clusters, ,defined by È

We constructed ½¼ different document subsets of the

´ µ

´ µ

Ý ¾Ý

¾¼ corpus collected by Lang [12]. This corpus con-

È È È

´ µ 

(13)

¾¼ ¼¼¼ ¾¼

´ µ ´ µ 

 tains about articles evenly distributed among UseNet

ݾ ݾ Ý ¾Ý discussion groups, and is usually employed for evaluating super-

Using this compact representation, we re-apply the distributional vised text classification techniques (e.g. [2] [21]). Many of these  clustering algorithm to extract the desired document clusters, . groups have similar topics (e.g. five groups discuss different is-

In figure 2 we outline the double clustering procedure. sues concerning computers). In addition, as pointed out by Schapire

± Using the information bottleneck method in this double-clustering and Singer [21] about  of the documents in this corpus are framework provides clear-cut information on the nature of the present in more than one group (since people tend to post articles clusters obtained. In the first stage the algorithm extracts word- to multiple newsgroups). Therefore, the ‘real’ clusters are inher- clusters which capture most of the relevant information about the ently fuzzy. For our tests we used ½¼ different randomly chosen

given documents. More formally stated, in the first stage the algo- subsets from this corpus. The details of these subsets are given

½ 

 in table . Our pre-processing included ignoring all file head-

´  µ » ´  µ rithm finds a set of clusters such that .In

 ers, lowering the upper case characters and ignoring all word- the second stage, the algorithm extracts document clusters, , that capture most of the relevant information about the word-clusters. s that contained digits or non alpha-numeric characters. We did Therefore we obtain significant reduction in both dimensions of not use a stop-list or any procedure. However, we in- the original variables, without losing too much of their mutual in- cluded a standard feature selection mechanism, where for each

dataset we selected the ¾¼¼¼ words with the highest contribution

  

´  µ » ´  µ » ´  µ formation: . to the mutual information between the words and the documents.

More formally stated, for each dataset, we sorted all words by

È

Ô´ÜÝ µ

´ µ  ´ µ ¾¼¼¼ ´ µÐÓ

and selected the top .

ܾ

Դܵ Dataset Newsgroups included documents Total

per group documents

¼¼ ¾¼¼¼ sci.crypt, sci.electronics,

sci.med, sci.space.

¾¼ ¼¼

¾¿ ½ talk.politics.mideast,

talk.politics.misc.

 ½¼¼ ¼¼

¾¿ ½ comp.graphics, rec.motorcycles, rec.sport.baseball,

sci.space talk.politics.mideast.

½¼ ¼ ¼¼

¾¿ ½ alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey

sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.gun. ¼¼

Table 1: Datasets details. For example, for each of the three datasets we randomly chose documents, evenly distributed

¾

between the news groups talk.politics.mideast and talk.politics.misc. This resulted in three document collections, ½ and

¼¼ ¿ , each of which consisted of documents.

7 Experimental Results graphics motorcycles baseball space mideast



½ 78 3 11 6 10

½¼



For each of our document collections we tested the following ¾ 3 68 7 5 5



clustering algorithms: ¿ 4 5 59 8 9



 6 14 13 68 13

¯

ÓÙ Ð : The double-clustering procedure using the dis-

  9 10 10 13 63 tance measure derived from the information bottleneck method

(Eq. 6).

Table 2: Contingency table for the ÓÙ Ð algorithm over the

¯ ½ ½

ÓÙ Ð

 ½¼ ¼ : The double-clustering procedure using the - ¾ dataset using word-clusters. The accuracy is .

norm distance measure (Eq. 10).

¯

ÓÙ Ð : The double-clustering procedure using Ward’s

 ½¼ ¾ dataset using word-clusters. Note that this accura- distance measure, i.e. the Euclid norm (Eq. 11).

cy was obtained in an unsupervised manner, without using any of ¯ the document labels. The number of document clusters used for × Ò Ð : Clustering the documents based on the original co-occurrence matrix of documents versus words, using the evaluating the contingency table was generally set to be identical

distance measure derived from the information bottleneck to the number of ‘real’ categories (except for the Binary dataset- ¾ method (Eq. 6). s for which we used  document clusters instead of ). This is

equivalent to a simplifying assumption that a user is approximate-

¯ ½ × Ò Ð : Clustering the documents based on the original ly aware of the number of ‘real’ categories in the document col-

co-occurrence matrix of documents versus words, using the lection. Choosing the appropriate number of document clusters ½ -norm distance measure (Eq. 10). without any prior knowledge about the data is a question of model

selection which is beyond the scope of this work. We note, howev-

¯

× Ò Ð : Clustering the documents based on the origi- er, that this problem could be addressed using standard techniques nal co-occurrence matrix of documents versus words, using such as cross-validation. Ward’s distance measure (Eq. 11).

Detailed results for all three double-clustering algorithms in

½¼ ¿

¯ all datasets are given in figure 3. In table we list the re-

 Ø : Clustering the documents based on the

original co-occurrence matrix of documents versus word- sults for the three single-clustering procedures, as well as for the

 s, using a complete-linkage algorithm and the tf-idf term Ø algorithm. For purposes of comparison we al- weights. The similarity between documents was estimat- so include the average results of the double-clustering procedures. ed by the cosine of the angle between the two tf-idf vectors Several results should be noted specifically: representing the documents.

¯ Using the information bottleneck algorithm with the dou- To avoid bias due to the number of word-clusters used by ble clustering procedure (i.e. algorithm ÓÙ Ð ) resulted the double-clustering procedures, we tested the performance of in superior performance compared to the other algorithms.

these algorithms for various numbers of word-clusters. Specif- Specifically the average performance over all datasets at-

 ¼

tained ¼ accuracy, while the second best result was

¾¼ ¿¼ ¼ ¼

ically we tested performance using ½¼ and word-

¿

 clusters. Performance was estimated as the accuracy given by the accuracy (for the Ø algorithms).

contingency table of the obtained clusters and the real documen- the accuracy measure we used in this work. For purposes of comparison, we also ¾ t categories. For example, in table ¾ we present the contin- present the averaged results using the mutual information as the quality measure in

figure 3. gency table and the accuracy for the ÓÙ Ð algorithm over the ¿ To gain some perspective we also tested the performance of Rainbow software ¾ Another possible measure for the quality of the obtained clusters is the mutual package [14] using a naive Bayes supervised classifier. The training set for each information given in the contingency table. This measure is highly correlated with

½ ½

× Ò Ð ÓÙ Ð × Ò Ð ÓÙ Ð × Ò Ð Ø  Data/algorithm ÓÙ Ð

0.59 0.49 0.41 0.34 0.33 0.29 0.47

½ 0.70 0.71 0.61 0.62 0.59 0.56 0.67

¾ 0.68 0.60 0.60 0.57 0.56 0.51 0.61

¿ 0.75 0.70 0.66 0.65 0.60 0.60 0.52



½ 0.59 0.42 0.43 0.38 0.34 0.27 0.51



¾ 0.58 0.40 0.43 0.36 0.29 0.28 0.34



¿ 0.53 0.50 0.46 0.34 0.34 0.29 0.63

½¼

½ 0.35 0.24 0.31 0.24 0.20 0.19 0.27

½¼

¾ 0.35 0.26 0.28 0.27 0.21 0.20 0.33

½¼ ¿ 0.35 0.29 0.27 0.28 0.20 0.17 0.34 Average 0.55 0.46 0.45 0.40 0.37 0.34 0.47

Table 3: Averaged results for all clustering procedures in all datasets. For the double-clustering algorithms the results for every dataset are

averaged over the different numbers of word-clusters used in the process.

½¼ ¾¼

¯ The double-clustering algorithms were tested using clustering algorithm, amounts to clustering in both dimensions -

¿¼ ¼ ¼

and ¼ word-clusters for every dataset, i.e. run- first, words based on their document distribution, and second -

 ¼

s( for each dataset). In almost all runs ( out of ) documents based on their word-clusters distribution. We demon-

the ÓÙ Ð performance was superior to the other double- strated that the double-clustering procedure is useful for all the

½

clustering algorithms. Of these, the ÓÙ Ð was usually similarity measures we examined. When combined with the in-



better than the ÓÙ Ð . In addition, in out of these formation bottleneck, the results were clearly better. The method

¼ runs, double-clustering improved the performance for all is shown to provide good document classification accuracy for the

the distance measures used in the clustering process. In oth- ¾¼ dataset, in a fully unsupervised manner, without

½

ÓÙ Ð ÓÙ Ð

er words, ÓÙ Ð and were almost using any training labels. We argue that these results demonstrate

½

× Ò Ð × Ò Ð always superior to × Ò Ð and re- both the validity of the information bottleneck method and the

spectively. The most significant improvement was for the power of the double-clustering procedure for this problem.

× Ò Ð

information bottleneck algorithms. ( ÓÙ Ð versus ). The agglomerative procedure used in this work has time com-

¿

´  µ plexity of , which is not suitable for very large datasets.

¯ The single-stage procedures were tested once for each dataset, Several techniques have been proposed for dealing with this is-

½¼



i.e. runs. Averaging over these runs, the Ø sue, which can also be employed here. For example, a somewhat

algorithm was slightly better than the × Ò Ð algorithm, similar clustering procedure was recently used by Baker and M-

½

which on its own was significantly better than the × Ò Ð cCallum [2] for finding word clusters in supervised text catego- algorithm. The × Ò Ð algorithm exhibited the weak- rization. To avoid high complexity they suggested using a fixed est performance for almost all datasets. small number of clusters, where in each step the two most sim- ilar clusters are joined and another new word is added as a new ¯ The best performance of all algorithms was over the three cluster. Their work pointed out that distributional clustering can Binary datasets. For the Multi and Science datasets per- formance was usually similar. The weakest performance be useful for significant reduction of the feature dimensionality with minor decrease in classification accuracy. The present work, was obtained consistently for the Multi½¼ datasets. In oth- er words, as expected, increasing the number of categories in contrast, shows that for unsupervised document classification, resulted in poorer performance, regardless of the algorithm using word clusters is not only more efficient, but also leads to used. significant improvement in performance. The double-clustering procedure used here is a two stage pro- cess. First we find word-clusters and then use them to obtain docu- 8 Discussion and Further Work ment clusters. A natural generalization is to try and compress both dimensions of the original co-occurrence matrix simultaneously, In this paper we presented a novel principled approach to the clus- and we are working in this direction. In addition the agglomera- tering of documents, which outperformed several other common- tive information bottleneck algorithm used here is a special case ly used algorithms. The combination of two novel ingredients of a more general algorithm which yields even better performance contributed to this work. The first is the information bottleneck on the same data. This more general approach is beyond the scope method, which is a principled information theoretic approach to of this work and will be presented elsewhere [25]. distributional clustering. It provides an objective measure of the quality of the obtained clusters - the extracted relevant informa- tion - as well as a well justified distributional similarity measure. Acknowledgments This measure, which emerges directly from first principles, is the KL-divergence to the mean, or the Jensen-Shannon divergence, Useful discussions with Yoram Singer and Fernando Pereira are and is in this information theoretic sense the optimal similarity greatly appreciated. This research was supported by grants from measure. The second ingredient is the double clustering proce- the Israeli Ministry of Science, and by the US-Israel Bi-national dure. This mechanism, which can be used with any distributional Science Foundation (BSF).

dataset was set to ¾¼¼ documents (evenly distributed). The test sets were identical to 

those listed in table ½. Averaging over runs, the averaged performance over all data

½ ÅÙÐؽ¼

sets attained ¼ accuracy. Specifically for the three datasets, Rainbow

¼¿ Á

averaged performance attained accuracy while the unsupervised ÓÙÐ

¿ average performance attained ¼ accuracy, which is definitely comparable. 0.7 1 IB IB IB double double double 0.8 L1 0.65 L1 L1 double double double 0.7 Ward Ward Ward double 0.6 double 0.8 double 0.6 0.55 0.5 0.5 0.6 0.4 Accuracy Accuracy 0.45 0.3 Mutual Information 0.2 0.4 0.4 0.1 0.35

0 0.3 0.2 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Averaged mutual information results Averaged accuracy results Science dataset

1 1 1 IB IB IB double double double L1 L1 L1 double double double 0.9 Ward 0.9 Ward 0.9 Ward double double double

0.8 0.8 0.8

Accuracy 0.7 Accuracy 0.7 Accuracy 0.7

0.6 0.6 0.6

0.5 0.5 0.5 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Binary dataset Binary dataset Binary dataset 1 2 3

0.9 0.9 0.9 IB IB IB double double double L1 L1 L1 0.8 double 0.8 double 0.8 double Ward Ward Ward double double double 0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5 Accuracy Accuracy Accuracy

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Multi5 dataset Multi5 dataset Multi5 dataset 1 2 3

0.6 0.6 0.6 IB IB IB double double double L1 L1 L1 double double double Ward Ward Ward 0.5 double 0.5 double 0.5 double

0.4 0.4 0.4 Accuracy Accuracy Accuracy 0.3 0.3 0.3

0.2 0.2 0.2

10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 Multi10 dataset Multi10 dataset Multi10 dataset 1 2 3

Figure 3: Results for all three double-clustering procedures over all data sets. The horizontal axis corresponds to the number of word- clusters used to cluster the documents. The top left figure presents averaged results in terms of the mutual information given in the contingency table of the document clusters and the real categories. Note the similarity with the accuracy averaged results. References [19] G. Salton. The SMART retrieval system. Englewood Cliffs, NJ:Prentice-Hall; 1971.. [1] M. R. Anderberg. for Applications. Aca- demic Press, 1973. [20] G. Salton. Developments in Automatic Text Retrieval. Sci- ence, Vol. 253, pages 974–980, 1990. [2] L. D. Baker and A. K. McCallum. Distributional Clustering of Words for Text Classification In ACM SIGIR 98, 1998. [21] R. E. Schapire and Y. E. Singer. BoosTexter: A System for Multiclass Multi-label Text Categorization, 1998. [3] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer. Class-based n-gram models of natual language. [22] H. Schutze and C. Silverstein. Projections for Efficient Doc- Computational Linguistics, 18(4), pages 467–477, 1992 uments Clustering In ACM SIGIR 97, pages 74–81, 1997. [4] T. M. Cover and J. A. Thomas. Elements of Information [23] C. Silverstein and J. O. Pedersen. Almost-Constant-Time Theory. John Wiley & Sons, New York, 1991. Clustering for Arbitrary Corpus Subsets. In ACM SIGIR 97, pages 60–66, 1997. [5] D. R. Cutting, D. R. Karger, J. O. Pedersen and J. W. Tukey. Scatter/Gother: A Cluster Based Approach to Browsing [24] N. Slonim and N. Tishby. Agglomerative Information Bot- Large Document Collections. In ACM SIGIR 92, pages 318– tleneck. In Proc. of Neural Information Processing Systems 329, 1992. (NIPS-99), pages 617–623, 1999. [6] D. R. Cutting, D. R. Karger and J. O. Pedersen. Constan- [25] N. Slonim and N. Tishby. The Hard Clustering Limit of the t Interaction-Time Scatter/Gother Browsing of Very Large Information Bottleneck Method. In preperation. Document Collections. In ACM SIGIR 93, pages 126–134, 1993. [26] N. Tishby, F.C. Pereira and W. Bialek. The Information Bot- tlencek Method In Proc. of the 37-th Allerton Conference on [7] R. El-Yaniv, S. Fine, and N. Tishby. Agnostic classification Communication and Computation, 1999. of Markovian sequences. In Advances in Neural Information Processing (NIPS-97), pages 465–471, 1997. [27] C. J. van Rijsbergen. Information Retrieval. London: But- terworths; 1979. [8] K. Eguchi. Adaptive Cluster-based Browsing Using Incre- mentally Expanded Queries and Its Effects. In ACM SIGIR [28] P. Willett. Recent Trends in Hierarchic Document Cluster- 99, pages 265–266, 1999. ing: A Crtical Review. Information Processing & Manage- ment, Vol. 24(5), pp. 577-597, 1988. [9] M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In ACM [29] J. Xu and W. B. Croft. Cluster-based Language Models for SIGIR 96, pages 76–84, 1996. Distributed Retrieval. In ACM SIGIR 99, pages 254–261, 1999. [10] T. Hofmann. Probabilistic Latent Semantic Indexing. In ACM SIGIR 99, pages 50–57, 1999. [30] O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In ACM SIGIR 98, pages 46–54, [11] M. Iwayama and T. Tokunaga. Cluster-Based Text Catego- 1998. rization: A Comparison of Category Search Strategies. In ACM SIGIR 95, pages 273–280, 1995. [12] K. Lang. Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning, pages 331–339, 1995. [13] J. Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991. [14] McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and cluster- ing. http://www.cs.cmu.edu/ mccallum/bow, 1996. [15] M. Mechkour, D. J. Harper and G. Muresan. The WebClus- ter Project: Using Clustering for Mediating Access to the WWW In ACM SIGIR 98, pages 357–358, 1998. [16] G. Muresan, D. J. Harper and M. Mechkour. WebCluster, a Tool for Mediated Information Access. In ACM SIGIR 99, page 337, 1999. [17] F. C. Pereira, N. Tishby, and L. Lee. Distributional cluster- ing of English words. In 30th Annual Meeting of the Associ- ation for Computational Linguistics, Columbus, Ohio, pages 183–190, 1993. [18] D. Roussinov, K. Tolle, M. Ramsey and H. Chen. INteractive Internet Search through Automatic Clustering: an Empirical Study. In ACM SIGIR 99, pages 289–290, 1999.