WordNet-based Text Document Clustering

Julian Sedding Dimitar Kazakov Department of Computer Science AIG, Department of Computer Science University of York University of York Heslington, York YO10 5DD, Heslington, York YO10 5DD, United Kingdom, United Kingdom, [email protected] [email protected]

Abstract Vivisimo,1 a commercial clustering interface based on results from a number of search- Text document clustering can greatly simplify engines. browsing large collections of documents by re- Ways to increase clustering speed are ex- organizing them into a smaller number of man- plored in many research papers, and the recent ageable clusters. Algorithms to solve this task trend towards web-based clustering, requiring exist; however, the algorithms are only as good real-time performance, does not seem to change as the data they work on. Problems include am- this. However, van Rijsbergen points out, “it biguity and synonymy, the former allowing for seems to me a little early in the day to insist on erroneous groupings and the latter causing sim- efficiency even before we know much about the ilarities between documents to go unnoticed. In behaviour of clustered files in terms of the ef- this research, na¨ıve, syntax-based disambigua- fectiveness of retrieval” (van Rijsbergen, 1989). tion is attempted by assigning each word a Indeed, it may be worth exploring which factors part-of-speech tag and by enriching the ‘bag-of- influence the quality (or effectiveness) of docu- words’ data representation often used for docu- ment clustering. ment clustering with synonyms and hypernyms from WordNet. Clustering can be broken down into two stages. The first one is to preprocess the doc- uments, i.e. transforming the documents into 1 Introduction a suitable and useful data representation. The Text document clustering is the grouping of text second stage is to analyse the prepared data and documents into semantically related groups, or divide it into clusters, i.e. the clustering algo- as Hayes puts it, “they are grouped because they rithm. are likely to be wanted together”(Hayes, 1963). Steinbach et al. (2000) compare the suitabil- Initially, document clustering was developed to ity of a number of algorithms for text clustering improve precision and recall of information re- and conclude that bisecting k-means, a parti- trieval systems. More recently, however, driven tional algorithm, is the current state-of-the-art. by the ever increasing amount of text docu- Its processing time increases linearly with the ments available in corporate document reposi- number of documents and its quality is similar tories and on the Internet, the focus has shifted to that of hierarchical algorithms. towards providing ways to efficiently browse Preprocessing the documents is probably at large collections of documents and to reorganise least as important as the choice of an algorithm, search results for display in a structured, often since an algorithm can only be as good as the hierarchical manner. data it works on. While there are a number of The clustering of Internet search results has preprocessing steps, that are almost standard attracted particular attention. Some recent now, the effects of adding background knowl- studies explored the feasibility of clustering ‘in edge are still not very extensively researched. real-time’ and the problem of adequately label- This work explores if and how the two following ing clusters. Zamir and Etzioni (1999) have methods can improve the effectiveness of clus- created a clustering interface for the meta- tering. ‘HuskySearch’ and Zhang and Dong (2001) present their work on a system called SHOC. The reader is also referred to 1http://www.vivisimo.com Part-of-Speech Tagging. Segond et al. 2 Background (1997) observe that part-of-speech tagging This work is most closely related to the recently (PoS) solves semantic ambiguity to some published research of Hotho et al. (2003b), and extent (40% in one of their tests). Based can be seen as a logical continuation of their ex- on this observation, we study whether periments. While these authors have analysed na¨ıve word sense disambiguation by PoS the benefits of using WordNet synonyms and up tagging can help to improve clustering to five levels of hypernyms for document clus- results. tering (using the bisecting k-means algorithm), this work describes the impact of tagging the WordNet. Synonymy and hypernymy can re- documents with PoS tags and/or adding all hy- veal hidden similarities, potentially leading pernyms to the information available for each 2 to better clusters. WordNet, an ontology document. which models these two relations (among Here we use the vector space model, as de- many others) (Miller et al., 1991), is used scribed in the work of Salton et al. (1975), in to include synonyms and hypernyms in the which a document is represented as a vector or data representation and the effects on clus- ‘bag of words’, i.e., by the words it contains tering quality are observed and analysed. and their frequency, regardless of their order. A number of fairly standard techniques have been used to preprocess the data. In addition, The overall aim of the approach outlined above a combination of standard and custom software is to cluster documents by meaning, hence it tools have been used to add PoS tags and Word- is relevant to language understanding. The ap- Net categories to the data set. These will be proach has some of the characteristics expected described briefly below to allow for the experi- from a robust language understanding system. ments to be repeated. Firstly, learning only relies on unannoted text The first preprocessing step is to PoS tag the data, which is abundant and does not contain corpus. The PoS tagger relies on the text struc- the individual bias of an annotator. Secondly, ture and morphological differences to determine the approach is based on general-purpose re- the appropriate part-of-speech. For this reason, sources (Brill’s PoS Tagger, WordNet), and the if it is required, PoS tagging is the first step performance is studied under pessimistic (hence to be carried out. After this, stopword removal more realistic) assumptions, e.g., that the tag- is performed, followed by . This or- ger is trained on a standard dataset with poten- der is chosen to reduce the amount of words tially different properties from the documents to be stemmed. The stemmed words are then to be clustered. Similarly, the approach studies looked up in WordNet and their correspond- the potential benefits of using all possible senses ing synonyms and hypernyms are added to the (and hypernyms) from WordNet, in an attempt bag-of-words. Once the document vectors are to postpone (or avoid altogether) the need for completed in this way, the frequency of each Word Sense Disambiguation (WSD), and the re- word across the corpus can be counted and ev- lated pitfalls of a WSD tool which may be biased ery word occurring less often than the pre speci- towards a specific domain or language style. fied threshold is pruned. Finally, after the prun- The remainder of the document is structured ing step, the term weights are converted to tf idf as follows. Section 2 describes related work as described below. and the techniques used to preprocess the data, Stemming, stopword removal and pruning all as well as cluster it and evaluate the results aim to improve clustering quality by removing achieved. Section 3 provides some background noise, i.e. meaningless data. They all lead to on the selected corpus, the Reuters-21578 test a reduction in the number of dimensions in the collection (Lewis, 1997b), and presents the sub- term-space. Weighting is concerned with the es- corpora that are extracted for use in the exper- timation of the importance of individual terms. iments. Section 4 describes the experimental All of these have been used extensively and are framework, while Section 5 presents the results considered the baseline for comparison in this and their evaluation. Finally, conclusions are work. However, the two techniques under in- drawn and further work discussed in Section 6. vestigation both add data to the representa- tion. PoS tagging adds syntactic information 2available at http://www.cogsci.princeton.edu/∼wn and WordNet is used to add synonyms and hy- pernyms. The rest of this section discusses pre- document length. Effectively, this approach is processing, clustering and evaluation in more equivalent to normalising each document vector detail. to length one and is called relative term fre- PoS Tagging PoS tags are assigned to the quency. corpus using Brill’s PoS tagger. As PoS tagging However, for this research a more sophisti- requires the words to be in their original order cated measure is used: the product of term fre- this is done before any other modifications on quency and inverse document frequency tf idf. the corpora. Salton et al. define the inverse document fre- quency idf as Stopword Removal Stopwords, i.e. words thought not to convey any meaning, are re- idft = log2 n − log2 dft + 1 (1) moved from the text. The approach taken in this work does not compile a static list of stop- where dft is the number of documents in which words, as usually done. Instead PoS informa- term t appears and n the total number of doc- tion is exploited and all tokens that are not uments. Consequently, the tf idf measure is cal- nouns, verbs or adjectives are removed. culated as

Stemming Words with the same meaning ap- tf idft = tf · (log2 n − log2 dft + 1) (2) pear in various morphological forms. To cap- ture their similarity they are normalised into simply the multiplication of tf and idf. This a common root-form, the stem. The morphol- means that larger weights are assigned to terms ogy function provided with WordNet is used for that appear relatively rarely throughout the stemming, because it only yields stems that are corpus, but very frequently in individual doc- contained in the WordNet dictionary. uments. Salton et al. (1975) measure a 14% improvement in recall and precision for tf idf in WordNet Categories WordNet, the lexical comparison to the standard term frequency tf. database developed by Miller et al., is used to include background information on each word. Clustering is done with the bisecting k- Depending on the experiment setup, words are means algorithm as it is described by Stein- replaced with their synset IDs, which constitute bach et al. (2000). In their comparison of their different possible senses, and also different different algorithms they conclude that bisect- levels of hypernyms, more general terms for the ing k-means is the current state of the art for a word, are added. document clustering. Bisecting k-means com- bines the strengths of partitional and hierarchi- Pruning Words appearing with low fre- cal clustering methods by iteratively splitting quency throughout the corpus are unlikely to the biggest cluster using the basic k-means algo- appear in more than a handful of documents rithm. Basic k-means is a partitional clustering and would therefore, even if they contributed algorithm based on the vector space model. At any discriminating power, be likely to cause too the heart of such algorithms is a similarity mea- fine grained distinctions for us to be useful, i.e sure. We choose the cosine distance, which mea- clusters containing only one or two documents. sures the similarity of two documents by cal- Therefore all words (or synset IDs) that ap- culating the cosine of the angle between them. pear less often than a pre-specified threshold are The cosine distance is defined as follows: pruned. ~ ~ Weighting Weights are assigned to give an ~ ~ di · dj s(di, dj) = cos(^(di, dj)) = (3) indication of the importance of a word. The |d~i| · |d~j| most trivial weight is the word-frequency. How- ever, more sophisticated methods can provide where |d~i| and |d~j| are the lengths of vectors better results. Throughout this work, tf idf d~i and d~j, respectively, and d~i · d~j is the dot- (term frequency x inverse document frequency) product of the two vectors. When the lengths of as described by Salton et al. (1975), is used. the vectors are normalised, the cosine distance One problem with term frequency is that the is equivalent to the dot-product of the vectors, ~ ~ lengths of the documents are not taken into ac- i.e. d1 · d2. count. The straight forward solution to this Evaluation Three different evaluation mea- problem is to divide the term frequency by the sures are used in this work, namely purity, en- total number of terms in the document, the tropy and overall similarity. Purity and entropy are both based on precision, 3 The Corpus Here we look at what kind of corpus is required |C ∩ L| prec(C,L) := , (4) to assess the quality of clusters, and present |C| our choice, the Reuters-21578 test collection. This is followed by a discussion of the ways sub- where each cluster C from a clustering C of the corpora can be extracted from the whole corpus set of documents D is compared with the man- in order to address some of the problems of the ually assigned category labels L from the man- Reuters corpus. ual categorisation L, which requires a category- A corpus useful for evaluating text document labeled corpus. Precision is the probability of a clustering needs to be annotated with class or document in cluster C being labeled L. category labels. This is not a straightforward Purity is the percentage of correctly clustered task, as even human annotators sometimes dis- documents and can be calculated as: agree on which label to assign to a specific doc- ument. Therefore, all results depend on the X |C| purity(C, L) := · max prec(C,L) (5) quality of annotation. It is therefore unrealis- |D| L∈L C∈C tic to aim at high rates of agreement with re- gard to the corpus, and any evaluation should yielding values in the range between 0 and 1. rather focus on the relative comparison of the The intra-cluster entropy (ice) of a cluster C, results achieved by different experiment setups as described by Steinbach et al. (2000), consid- and configurations. ers the dispersion of documents in a cluster, and Due to the aforementioned difficulty of agree- is defined as: ing on a categorisation and the lack of a defini- tion of ‘correct’ classification, no standard cor- X pora for evaluation of clustering techniques ex- ice(C) := prec(C,L) · log(prec(C,L)) (6) ist. Still, although not standardised, a number L∈L of pre-categorised corpora are available. Apart from various domain-specific corpora with class Based on the intra-cluster entropy of all clus- annotations, there is the Reuters-21578 test col- ters, the average, weighted by the cluster size, lection (Lewis, 1997b), which consists of 21578 is calculated. This results in the following for- newswire articles from 1987. mula, which is based on the one used by Stein- The Reuters corpus is chosen for use in the bach et al. (2000): experiments of this projects for four reasons.

X |C| 1. Its domain is not specific, therefore it can entropy( ) := · ice(C) (7) C |D| be understood by a non-expert. C∈C 2. WordNet, an ontology, which is not tailored Overall similarity is independent of pre- to a specific domain, would not be effective annotation. Instead the intra-cluster similari- for domains with a very specific vocabulary. ties are calculated, giving an idea of the cohe- siveness of a cluster. This is the average similar- 3. It is freely available for download. ity between each pair of documents in a cluster, including the similarity of a document with it- 4. It has been used in comparable studies be- self. Steinbach et al. (2000) show that this is fore (Hotho et al., 2003b). equivalent to the squared length of the cluster On closer inspection of the corpus, there re- 2 centroid, i.e. |~c| . The overall similarity is then main some problems to solve. First of all, only calculated as about half of the documents are annotated with category-labels. On the other hand some doc- uments are attributed to multiple categories, X |C| overall similarity( ) := · |~c|2 (8) meaning that categories overlap. Some confu- C |D| C∈C sion seems to have been caused in the research community by the fact that there is a TOPICS Similarity is expressed as a percentage, there- attribute in the SGML, the value of which is ei- fore the possible values for overall similarity ther set to YES or NO (or BYPASS). However, range from 0 to 1. this does not correspond to the values observed within the TOPICS tag; sometimes categories ditions. Especially for this purpose a fourth, ex- can be found, even if the TOPICS attribute is tremely homogeneous test corpus, ‘reut-min15- set to NO and sometimes there are no categories max20’ is derived. It is like the ‘reut-max20’ assigned, even if the attribute indicates YES. corpus, but all categories containing less than Lewis explains that this is not an error in the 15 documents are entirely removed. The ‘reut- corpus, but has to do with the evolution of the min15-max20’ is thus the most homogeneous corpus and is kept for historic reasons (Lewis, test corpus, with a standard deviation in cluster 1997a). size of only 0.7 documents. Therefore, to prepare a base-corpus, the A summary of the derived test corpora is TOPICS attribute is ignored and all documents shown in Table 1, including the number of doc- that have precisely one category assigned to uments they contain, i.e. their size, the average them are selected. Additionally, all documents category size and the standard deviation. Fig- with an empty document body are also dis- ure 2 shows the distribution of categories within carded. This results in the corpus ‘reut-base’ the derived corpora graphically. containing 9446 documents. The distribution of category sizes in the ‘reut-base’ is shown in Table 1: Corpus Statistics Figure 1. It illustrates that there are a few cat- Category Size egories occurring extremely often, in fact the Name Size ø stdev two biggest categories contain about two thirds reut-min15-max20 713 20 0.7 of all documents in the corpus. This unbal- reut-max20 881 13 7.7 anced distribution would blur test results, be- reut-max50 1690 24 19.9 cause even ‘random clustering’ would poten- reut-max100 2244 34 35.2 tially obtain purity values of 30% and more only reut-base 9446 143 553.2 due to the contribution of the two main cate- Note: The average category size is rounded to the nearest gories. whole number, the standard deviation to the first decimal place.

4000 s

t 3500 n e m

u 3000 4 Clustering with PoS and c o d

f 2500 o r

e Background Knowledge b 2000 m u n 1500 The aim of this work is to explore the bene- 1000 fits of partial disambiguation of words by their 500 0 PoS and the inclusion of WordNet concepts. n e p e i s s p i il l t r r t a u s t a e r d i e p b m a o p a ta a e e b e p e u h c a a h ff c o lu g b w t e e b lv e t c ri n t ri e tr s o j a t- re h m i -d o o h c a -m s l t c p c lu a n o a n i st e c n This has been tested on five different setups, g n v te i in ra st category (ordered by frequency) as shown in Table 2.

Figure 1: Category Distribution for Corpus Table 2: Experiment Configurations ‘reut-base’ (only selected categories are listed). Name Base PoS Syns Hyper Similar to Hotho et al. (2003b), we get around Baseline yes this problem by deriving new corpora from the PoS Only yes yes base corpus. Their maximum category size is Syns yes yes yes reduced to 20, 50 and 100 documents respec- Hyper 5 yes yes yes 5 tively. Categories containing more documents Hyper All yes yes yes all are not excluded, but instead they are reduced Base: stopword removal, stemming, pruning and tf idf weighting in size to comply with the defined maximum, are performed; PoS tags are stripped. PoS: PoS tags are kept attached to the words. i.e., all documents in excess of the maximum Syns: all senses of a word are included using synset offsets. are removed. Hyper: hypernyms to the specified depth are included. Empty fields indicate ‘no’ or ‘0’. Creating derived corpora has the further ad- vantages of reducing the size of corpora and thus computational requirements for the test runs. Also, tests can be run on more and less homo- Baseline The first configuration setting is used geneous corpora, homogeneous with regard to to get a baseline for comparison. All ba- the cluster size, that is, which can give an idea sic preprocessing techniques are used, i.e. of how the method performs under different con- stopword removal, stemming, pruning and s t 25 PoS tagged token are included. n e m u

c 20 o d

f Hyper 5 Here five levels of hypernyms are in- o 15 r e b cluded in addition to the synset IDs. m

u 10 n 5 Hyper All Same as above, but all hypernyms 0 for each word token are included. t l r r i q p e n e s ld s e k ly e s e a e p g c o fe to d a o e e c p g e b g d w in a b f t ru g g r t to p n rv b u ra s o o c te -s s u ra e u s t u c c n n e s o s r o i o iv y- e h ir l e r n o m category (ordered by frequency) Each of the configurations is used to create 16, 32 and 64 clusters from each of the four reut-min15-max20 test-corpora. Due to the random choice of ini- tial cluster centroids in the bisecting k-means s t

n 25 e

m algorithm, the means of three test-runs with u c

o 20 d

f the same configuration is calculated for corpora o

r 15 e b

m ‘reut-max20’ and ‘reut-max50’. The existing u n 10 project time constraints allowed us to gain some 5

0 additional insight by doing one test-run for each

q a n n d i k s s p e l t r r t a u s t a e c o o r l ip c a e i d ta a e e b e p e u h c a c tt a o o g v h a e e b lv e t c ri n t ri of ‘reut-max100’ and ‘reut-min15-max20’. This o o e g t t- r s tr h m i -d o o h c c s a e -m s l t c p e n s c lu ta n o a liv re i s e c n g n v te i in a results in 120 experiments in total. tr s category (ordered by frequency) All configurations use tf idf weighting and reut-max20 pruning. The pruning thresholds vary. For all experiments using the ‘reut-max20’ corpus all

60

s terms occurring less than 20 times are pruned. t n e m

u 50 c The experiments on corpora ‘reut-max50’ and o d f o

r 40 e ‘reut-min15-max20’ are carried out with a prun- b m u n 30 ing threshold of 50. For the corpus ‘reut- 20 max100’, the pruning threshold is set to 50 10 when configurations Baseline, PoS Only or 0

x i t e s e s p i il l t r r t n u s t g a Syns are used and to 200 otherwise. This rel- -f p s d e d m a o p a ta a e e b e p e u t h y c re a v u lu g b w t e e b lv e y c ri n s t e e tr r r a t- re h m i -d o o h n t e c a -m s l t c p o in s c lu ta n o a re n i s e c n m g n v te i in a atively high threshold is chosen, in order to re- r st category (ordered by frequency) duce memory requirements. To ensure that this reut-max50 inconsistency does not distort the conclusions drawn from the test data, the results of these s t

n 120

e tests are considered with great care and are ex- m

u 100 c o

d plicitly referred to when used. f 80 o r e

b 60 Further details of this research are described m u n 40 in an unpublished report (Sedding, 2004). 20

0

x r t n i r s p i s l t e l t n t s l e a 5 Results and Evaluation -f a s r p e m a o p a ta a e b e je e o tl h y g re a c p lu g b w g e e m fu e y ri o t t e u e e p a t- h o -d o w a h n s t o a -m c l t -c p o in c n c in ta n l a i s e n m g n v te i in a tr s The results are presented in the format of one category (ordered by frequency) graph per corpus, showing the entropy, purity reut-max100 and overall similarity values for each of the con- figurations shown in Table 2. On the X-axis, the different configuration set- Figure 2: Category Distributions for Derived tings are listed. On the right-hand side, hype Corpora refers to the hypernym depth, syn refers to whether synonyms were included or not, pos weighting. PoS tags are removed from the refers to the presence or absence of PoS tags tokens to get the equivalent of a raw text and clusters refers to the number of clusters cre- corpus. ated. For improved readability, lines are drawn, splitting the graphs into three sections, one for PoS Only Identical to the baseline, but the each number of clusters. For experiments on PoS tags are not removed. the corpora ‘reut-max20’ and ‘reut-max50’, the Syns In addition to the previous configuration, values in the graphs are the average of three all WordNet senses (synset IDs) of each test runs, whereas for the corpora ‘reut-min15- max20’ and ‘reut-max100’, the values are those becomes obvious when very small clusters are obtained from a single test run. looked at. For instance, the minimum purity The Y-axis indicates the numerical values for value for a cluster containing three documents each of the measures. Note that the values is 33 percent, for two documents it is 50 percent, for purity and similarity are percentages, and and, in the extreme case of a single document thus limited to the range between 0 and 1. For per cluster, purity is always 100 percent. those two measures, higher values indicate bet- The PoS Only experiment results in perfor- ter quality. High entropy values, on the other mance, which is very similar to the Baseline, hand, indicate lower quality. Entropy values are and is sometimes a little better, sometimes a lit- always greater than 0 and for the particular ex- tle worse. This is expected, and the experiment periments carried out, they never exceed 1.3. is included to allow for a more accurate inter- In analysing the test results, the main focus is pretation of the subsequent experiments using on the data of corpora ‘reut-max20’ and ‘reut- synonyms and hypernyms. max50’, shown in Figure 3 and Figure 4, re- A more interesting observation is that purity spectively. This data is more reliable, because and entropy values indicate better clusters for it is the average of repeated test runs. Figures Baseline than for any of the configurations us- 6–7 show the test data obtained from cluster- ing background knowledge from WordNet (i.e. ing the corpora ‘reut-min15-max20’ and ‘reut- Syns, Hyper 5 and Hyper All). One possi- max100’, respectively. ble conclusion is that adding background knowl- The fact that the purity and similarity values edge is not helpful at all. However, the reasons are far from 100 percent is not unusual. In many for the relatively poor performance could also cases, not even human annotators agree on how be due to the way the experiments are set up. to categorise a particular document (Hotho et Therefore, a possible explanation for these re- al., 2003a). More importantly, the number of sults could be that the benefit of extra overlap categories are not adjusted to the number of la- between documents, which the added synonyms bels present in a corpus, which makes complete and hypernyms should provide, is outweighed agreement impossible. by the additional noise they create. WordNet All three measures indicate that the qual- does often provide five or more senses for a ity increases with the number of clusters. The word, which means that for one correct sense graph in Figure 5 illustrates this for the entropy a number of incorrect senses are added, even if in ‘reut-max50’. For any given configuration, it the PoS tags eliminate some of them. appears that the decrease in entropy is almost The overall similarity measure gives a differ- constant when the number of clusters increases. ent indication. Its values appear to increase This is easily explained by the average cluster for the cases where background knowledge is in- sizes, which decrease with an increasing number cluded, especially when hypernyms are added. of clusters; when clusters are smaller, the proba- Overall similarity is the weighted average of the bility of having a high percentage of documents intra-cluster similarities of all clusters. So the with the same label in a cluster increases. This intra-cluster similarity actually increases with

1.200 1.200

1.000 1.000

0.800 0.800

0.600 0.600

0.400 0.400

0.200 0.200

0.000 0.000 0 0 0 5 all 0 0 0 5 all 0 0 0 5 all hype 0 0 0 5 all 0 0 0 5 all 0 0 0 5 all hype false false true true true false false true true true false false true true true syn false false true true true false false true true true false false true true true syn false true true true true false true true true true false true true true true pos false true true true true false true true true true false true true true true pos 16 32 64 clusters 16 32 64 clusters

purity entropy similarity purity entropy similarity

Figure 3: Test Results for ‘reut-max20’ Figure 4: Test Results for ‘reut-max50’ 1.300 used, the correct sense of a word could be cho- 1.200 sen and only the hypernyms for the correct 1.100 1.000 sense of the word could be taken into account. 0.900

0.800 This should drastically reduce noise. The ben- 0.700 efit of the added ‘correct’ concepts would then 0.600 0.500 probably improve cluster quality. Hotho et al. 0 0 0 5 all hype false false true true true syn (2003a) experimented successfully with simple false true true true true pos disambiguation strategies, e.g., they used only 16 clusters 32 clusters 64 clusters the first sense provided by WordNet. Figure 5: Entropies for Different Cluster Sizes As an alternative to word-by-word disam- in ‘reut-max50’ biguation, a strategy to disambiguate based on document vectors could be devised; after adding

1.200 all alternative senses of the terms, the least fre- quent ones could be removed. This is similar 1.000 to pruning but would be done on a document 0.800 by document basis, rather than globally on the 0.600 whole corpus. The basis for this idea is that only

0.400 concepts that appear repeatedly in a document contribute (significantly) to the meaning of the 0.200 document. It is important that this is done be-

0.000 0 0 0 5 all 0 0 0 5 all 0 0 0 5 all hype fore hypernyms are added, especially when all false false true true true false false true true true false false true true true syn false true true true true false true true true true false true true true true pos levels of hypernyms are added, because the most 16 32 64 clusters general terms are bound to appear more often purity entropy similarity than the more specific ones. This would lead to lots of very similar, but meaningless bags of Figure 6: Test Results for ‘reut-min15-max20’ words or bags of concepts. Comparing Syns, Hyper 5 and Hyper All added information. As similarity increases with with each other, in many cases Hyper 5 gives additional overlap, the overall similarity mea- the best results. A possible explanation could sure shows that additional overlap is achieved. again be the equilibrium between valuable in- The main problem with the approach of formation and noise that are added to the vec- adding all synonyms and all hypernyms into tor representations. From these results it seems the document vectors seems to be the added that there is a point where the amount of in- noise. The expectation that tf idf weighting formation added reaches its maximum benefit; would take care of these quasi-random new con- adding more knowledge afterwards results in cepts is not met, but the results also indicate decreased cluster quality again. It should be possible improvements to this approach. noted that a fixed threshold for the levels of hy- pernyms used is unlikely to be optimal for all If word-by-word disambiguation would be words. Instead, a more refined approach could set this threshold as a function of the semantic 1.200 distance (Resnik and Yarowsky, 2000; Stetina,

1.000 1997) between the word and its hypernyms.

0.800 The maximised benefit is most evident in the ‘reut-max100’ corpus (Figure 7). However, 0.600 it needs to be kept in mind that for the last 0.400 two data points, Hyper 5 and Hyper All,

0.200 the pruning threshold is 200. Therefore, the comparison with Syns needs to be done with 0.000 0 0 0 5 all 0 0 0 5 all 0 0 0 5 all hype false false true true true false false true true true false false true true true syn care. This is not much of a problem, because false true true true true false true true true true false true true true true pos the other graphs consistently show that the 16 32 64 clusters performance for Syns is worse than for Hy- purity entropy similarity per 5. The difference between Hyper 5 and Hyper All in ‘reut-max100’, can be directly Figure 7: Test Results for ‘reut-max100’ compared though, because the pruning thresh- old of 200 is used for both configurations. et al.(2003b) seems more realistic yet beneficial. Surprisingly, there is a sharp drop in the When comparing the use of different levels of overall similarity from Hyper 5 to Hyper All, hypernyms, the results indicate that including much more evident than in the other three only five levels is better than including all. A corpora. One possible explanation could be possible explanation of this is that the terms the different structure of the corpus. It seems become too general when all hypernym levels more probable, however, that the high prun- are included. ing threshold is the cause again. Assuming Further research is needed to determine that Hyper 5 seldom includes the most gen- whether this way of document clustering can eral concepts, whereas Hyper All always in- be improved by appropriately selecting a sub- cludes them, their frequency in Hyper All be- set of the synonyms and hypernyms used comes so high that the frequencies of all the here. There is a number of corpus-based ap- other terms are very low in comparison. The proaches to word-sense disambiguation (Resnik document vectors in case of Hyper All end and Yarowsky, 2000), which could be used for up containing mostly meaningless concepts, be- this purpose. cause most of the others are pruned. This leads The other point of interest that could be fur- to decreased cluster quality because the gen- ther analysed is to find out why using five lev- eral concepts have little discriminating power. els of hypernyms produces better results than In the corresponding experiments on other cor- using all levels of hypernyms. It would be in- pora, more of the specific concepts are retained. teresting to see whether this effect persists when Therefore, a better balance between general better disambiguation is used to determine ‘cor- and specific concepts is maintained, keeping the rect’ word senses. cluster quality higher than in the case of corpus ‘reut-max100’. Acknowledgements PoS Only performs similar to Baseline, al- though usually a slight decrease in quality can The authors wish to thank the three anonymous be observed. Despite the assumption that the referees for their valuable comments. disambiguation achieved by the PoS tags should improve clustering results, this is clearly not References the case. PoS tags only disambiguate the cases R.M. Hayes. 1963. Mathematical models in in- where different word classes are represented by formation retrieval. In P.L. Garvin, editor, the same stem, e.g., the noun ‘run’ and the verb Natural Language and the Computer, page ‘run’. Clearly the meanings of these pairs are 287. McGraw-Hill, New York. in most cases related. Therefore, distinguishing A. Hotho, S. Staab, and G. Stumme. 2003a. between them reduces the weight of their com- Text clustering based on background knowl- mon concept by splitting it between two con- edge. Technical Report No. 425. cepts. In the worst case, they are pruned if A. Hotho, S. Staab, and G. Stumme. 2003b. treated separately, instead of contributing sig- Wordnet improves text document clustering. nificantly to the document vector as a joint con- In Proceedings of the Semantic Web Work- cept. shop at SIGIR-2003, 26th Annual Interna- tional ACM SIGIR Conference. 6 Conclusions D.D. Lewis. 1997a. Readme file of Reuters- The main finding of this work is that includ- 21578 text categorization test collection, dis- ing synonyms and hypernyms, disambiguated tribution 1.0. only by PoS tags, is not successful in improv- D.D. Lewis. 1997b. Reuters-21578 text catego- ing clustering effectiveness. This could be at- rization test collection, distribution 1.0. tributed to the noise introduced by all incor- G. Miller, R. Beckwith, C. Fellbaum, D. Gross, rect senses that are retrieved from WordNet. It and K. Miller. 1991. Five papers on . appears that disambiguation by PoS alone is in- International Journal of Lexicography. sufficient to reveal the full potential of including P. Resnik and D. Yarowsky. 2000. Distin- background knowledge. One obviously imprac- guishing systems and distinguishing senses: tical alternative would be manual sense disam- New evaluation methods for word sense dis- biguation. The automated approach of only us- ambiguation. Natural Language Engineering, ing the most common sense adopted by Hotho 5(2):113–133. G. Salton, A. Wong, and C.S. Yang. 1975. A vector space model for automatic indexing. Communications of the ACM, 18:613–620. J. Sedding. 2004. Wordnet-based text docu- ment clustering. Bachelor’s Thesis. F. Segond, A. Schiller, G. Grefenstette, and J.P. Chanod. 1997. An experiment in seman- tic tagging using hidden Markov model tag- ging. In Automatic and Building of Lexical Semantic Resources for NLP Applications, pages 78–81. Asso- ciation for Computational Linguistics, New Brunswick, New Jersey. M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering tech- niques. In KDD Workshop on . Jiri Stetina. 1997. Corpus Based Natural Lan- guage Ambiguity Resolution. Ph.D. thesis, Kyoto University. C.J. van Rijsbergen. 1989. Information Re- trieval (Second Edition). Buttersworth, Lon- don. O. Zamir and O. Etzioni. 1999. Grouper: A dynamic clustering interface to Web search results. Computer Networks, Amsterdam, Netherlands, 31(11–16):1361–1374. D. Zhang and Y. Dong. 2001. Semantic, hierar- chical, online clustering of web search results. 3rd International Workshop on Web Informa- tion and Data Management, Atlanta, Geor- gia.