
WordNet-based Text Document Clustering Julian Sedding Dimitar Kazakov Department of Computer Science AIG, Department of Computer Science University of York University of York Heslington, York YO10 5DD, Heslington, York YO10 5DD, United Kingdom, United Kingdom, [email protected] [email protected] Abstract Vivisimo,1 a commercial clustering interface based on results from a number of search- Text document clustering can greatly simplify engines. browsing large collections of documents by re- Ways to increase clustering speed are ex- organizing them into a smaller number of man- plored in many research papers, and the recent ageable clusters. Algorithms to solve this task trend towards web-based clustering, requiring exist; however, the algorithms are only as good real-time performance, does not seem to change as the data they work on. Problems include am- this. However, van Rijsbergen points out, “it biguity and synonymy, the former allowing for seems to me a little early in the day to insist on erroneous groupings and the latter causing sim- efficiency even before we know much about the ilarities between documents to go unnoticed. In behaviour of clustered files in terms of the ef- this research, na¨ıve, syntax-based disambigua- fectiveness of retrieval” (van Rijsbergen, 1989). tion is attempted by assigning each word a Indeed, it may be worth exploring which factors part-of-speech tag and by enriching the ‘bag-of- influence the quality (or effectiveness) of docu- words’ data representation often used for docu- ment clustering. ment clustering with synonyms and hypernyms from WordNet. Clustering can be broken down into two stages. The first one is to preprocess the doc- uments, i.e. transforming the documents into 1 Introduction a suitable and useful data representation. The Text document clustering is the grouping of text second stage is to analyse the prepared data and documents into semantically related groups, or divide it into clusters, i.e. the clustering algo- as Hayes puts it, “they are grouped because they rithm. are likely to be wanted together”(Hayes, 1963). Steinbach et al. (2000) compare the suitabil- Initially, document clustering was developed to ity of a number of algorithms for text clustering improve precision and recall of information re- and conclude that bisecting k-means, a parti- trieval systems. More recently, however, driven tional algorithm, is the current state-of-the-art. by the ever increasing amount of text docu- Its processing time increases linearly with the ments available in corporate document reposi- number of documents and its quality is similar tories and on the Internet, the focus has shifted to that of hierarchical algorithms. towards providing ways to efficiently browse Preprocessing the documents is probably at large collections of documents and to reorganise least as important as the choice of an algorithm, search results for display in a structured, often since an algorithm can only be as good as the hierarchical manner. data it works on. While there are a number of The clustering of Internet search results has preprocessing steps, that are almost standard attracted particular attention. Some recent now, the effects of adding background knowl- studies explored the feasibility of clustering ‘in edge are still not very extensively researched. real-time’ and the problem of adequately label- This work explores if and how the two following ing clusters. Zamir and Etzioni (1999) have methods can improve the effectiveness of clus- created a clustering interface for the meta- tering. search engine ‘HuskySearch’ and Zhang and Dong (2001) present their work on a system called SHOC. The reader is also referred to 1http://www.vivisimo.com Part-of-Speech Tagging. Segond et al. 2 Background (1997) observe that part-of-speech tagging This work is most closely related to the recently (PoS) solves semantic ambiguity to some published research of Hotho et al. (2003b), and extent (40% in one of their tests). Based can be seen as a logical continuation of their ex- on this observation, we study whether periments. While these authors have analysed na¨ıve word sense disambiguation by PoS the benefits of using WordNet synonyms and up tagging can help to improve clustering to five levels of hypernyms for document clus- results. tering (using the bisecting k-means algorithm), this work describes the impact of tagging the WordNet. Synonymy and hypernymy can re- documents with PoS tags and/or adding all hy- veal hidden similarities, potentially leading pernyms to the information available for each 2 to better clusters. WordNet, an ontology document. which models these two relations (among Here we use the vector space model, as de- many others) (Miller et al., 1991), is used scribed in the work of Salton et al. (1975), in to include synonyms and hypernyms in the which a document is represented as a vector or data representation and the effects on clus- ‘bag of words’, i.e., by the words it contains tering quality are observed and analysed. and their frequency, regardless of their order. A number of fairly standard techniques have been used to preprocess the data. In addition, The overall aim of the approach outlined above a combination of standard and custom software is to cluster documents by meaning, hence it tools have been used to add PoS tags and Word- is relevant to language understanding. The ap- Net categories to the data set. These will be proach has some of the characteristics expected described briefly below to allow for the experi- from a robust language understanding system. ments to be repeated. Firstly, learning only relies on unannoted text The first preprocessing step is to PoS tag the data, which is abundant and does not contain corpus. The PoS tagger relies on the text struc- the individual bias of an annotator. Secondly, ture and morphological differences to determine the approach is based on general-purpose re- the appropriate part-of-speech. For this reason, sources (Brill’s PoS Tagger, WordNet), and the if it is required, PoS tagging is the first step performance is studied under pessimistic (hence to be carried out. After this, stopword removal more realistic) assumptions, e.g., that the tag- is performed, followed by stemming. This or- ger is trained on a standard dataset with poten- der is chosen to reduce the amount of words tially different properties from the documents to be stemmed. The stemmed words are then to be clustered. Similarly, the approach studies looked up in WordNet and their correspond- the potential benefits of using all possible senses ing synonyms and hypernyms are added to the (and hypernyms) from WordNet, in an attempt bag-of-words. Once the document vectors are to postpone (or avoid altogether) the need for completed in this way, the frequency of each Word Sense Disambiguation (WSD), and the re- word across the corpus can be counted and ev- lated pitfalls of a WSD tool which may be biased ery word occurring less often than the pre speci- towards a specific domain or language style. fied threshold is pruned. Finally, after the prun- The remainder of the document is structured ing step, the term weights are converted to tf idf as follows. Section 2 describes related work as described below. and the techniques used to preprocess the data, Stemming, stopword removal and pruning all as well as cluster it and evaluate the results aim to improve clustering quality by removing achieved. Section 3 provides some background noise, i.e. meaningless data. They all lead to on the selected corpus, the Reuters-21578 test a reduction in the number of dimensions in the collection (Lewis, 1997b), and presents the sub- term-space. Weighting is concerned with the es- corpora that are extracted for use in the exper- timation of the importance of individual terms. iments. Section 4 describes the experimental All of these have been used extensively and are framework, while Section 5 presents the results considered the baseline for comparison in this and their evaluation. Finally, conclusions are work. However, the two techniques under in- drawn and further work discussed in Section 6. vestigation both add data to the representa- tion. PoS tagging adds syntactic information 2available at http://www.cogsci.princeton.edu/∼wn and WordNet is used to add synonyms and hy- pernyms. The rest of this section discusses pre- document length. Effectively, this approach is processing, clustering and evaluation in more equivalent to normalising each document vector detail. to length one and is called relative term fre- PoS Tagging PoS tags are assigned to the quency. corpus using Brill’s PoS tagger. As PoS tagging However, for this research a more sophisti- requires the words to be in their original order cated measure is used: the product of term fre- this is done before any other modifications on quency and inverse document frequency tf idf. the corpora. Salton et al. define the inverse document fre- quency idf as Stopword Removal Stopwords, i.e. words thought not to convey any meaning, are re- idft = log2 n − log2 dft + 1 (1) moved from the text. The approach taken in this work does not compile a static list of stop- where dft is the number of documents in which words, as usually done. Instead PoS informa- term t appears and n the total number of doc- tion is exploited and all tokens that are not uments. Consequently, the tf idf measure is cal- nouns, verbs or adjectives are removed. culated as Stemming Words with the same meaning ap- tf idft = tf · (log2 n − log2 dft + 1) (2) pear in various morphological forms. To cap- ture their similarity they are normalised into simply the multiplication of tf and idf. This a common root-form, the stem.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-