A Word-Based Soft Clustering Algorithm for Documents

A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS King-Ip Lin, Ravikumar Kondadadi Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA. linki,[email protected]

Abstract: resulting clusters will contain documents that share a Document clustering is an important tool for similar set of words. applications such as Web search engines. It enables the We believe this approach has many advantages. user to have a good overall view of the information Firstly, it allows a document to reside in multiple clusters. contained in the documents. However, existing algorithms This allows one to capture documents that contain suffer from various aspects; hard clustering algorithms multiple topics. Secondly, it leads itself to a natural (where each document belongs to exactly one cluster) representation of the clusters, which are the words that is cannot detect the multiple themes of a document, while associated with the clusters themselves. Also, the running soft clustering algorithms (where each document can time is strictly linear to the number of documents. Even belong to multiple clusters) are usually inefficient. We though the running time grows quadratically with the propose WBSC (Word-based Soft Clustering), an efficient number of clusters, this number is bounded by the number soft clustering algorithm based on a given similarity of words, which has a rather fixed upper bound (the measure. WBSC uses a hierarchical approach to cluster number of words in the language). This allows the documents having similar words. WBSC is very effective algorithm to scale well. and efficient when compared with existing hard clustering The rest of the paper is organized as follows: Section algorithms like K-means and its variants. 2 summarizes related work in the field. Section 3 describes our algorithm in more detail. Section 4 provides Keywords: some preliminary experimental results. Section 5 outlines Document clustering, Word based clustering, future direction of this work. soft clustering 2. Related work 1. Introduction Clustering algorithms have been developed and used Document clustering is a very useful tool in today’s in many fields. [4] provides an extensive survey of world where a lot of documents are stored and retrieved various clustering techniques. In this section, we highlight electronically. It enables one to discover hidden similarity work done on document clustering. and key concepts. Moreover, it enables one to summarize Many clustering techniques have been applied to a large amount of documents using key or common clustering documents. For instance, Willett [9] provided a attributes of the clusters. Thus clustering can be used to survey on applying hierarchical clustering algorithms into categorize document databases and digital libraries, as clustering documents. Cutting et al. [1] adapted various well as providing useful summary information of the partition-based clustering algorithms to clustering categories for browsing purposes. For instance, a typical documents. Two of the techniques are Buckshot and search on the World Wide Web can return thousands of Fractionation. Buckshot selects a small sample of documents. Automatic clustering of the documents documents to pre-cluster them using a standard clustering enables the user to have a clear and easy grasp of what algorithm and assigns the rest of the documents to the kind of documents are retrieved, providing tremendous clusters formed. Fractionation splits the N documents into help for him/her to locate the right information. ‘m’ buckets where each bucket contains N/m documents. Many clustering techniques have been developed and Fractionation takes an input parameter , which indicates they can be applied to clustering documents. [3, 6] contain the reduction factor for each bucket. The standard examples of using such techniques. Most of these clustering algorithm is applied so that if there are ‘n’ traditional approaches use documents as the basis of documents in each bucket, they are clustered into n/ clustering. In this paper, we took an alternative approach, clusters. Now each of these clusters is treated as if they word-based soft clustering (WBSC). Instead of were individual documents and the whole process is comparing documents and clustering them directly, we repeated until there are only ‘K’ clusters. cluster the words used in such documents. For each word, A relatively new technique is proposed by Zamir we form a cluster containing documents that have that and Etzioni [10]. They introduce the notion of phrase- word. After that, we combine clusters that contain similar based document clustering. They use a generalized suffix- sets of documents, by an agglomerative approach [5]. The tree to obtain information about the phrases and use them to cluster the documents. 3. WBSC 0.33 as our threshold. We are experimenting with As the name suggests, WBSC uses a word-based different ways of finding a better threshold. approach to build clusters. It first forms initial clusters of the documents, with each cluster representing a single Displaying results word. For instance, WBSC forms a cluster for the word In the algorithm, we keep track of the words that ‘tiger’ made up of all the documents that contain the word are used to represent each cluster. When two clusters ‘tiger’. After that, WBSC merges similar clusters – merge, the new cluster acquires the words from both clusters are similar if they contain the similar set of clusters. Thus at the end of the algorithm, each cluster documents – using a hierarchical based approach until will have a set of representative words. This set can be some stopping criterion is reached. At the end, the clusters used to provide a description of the cluster. Also, we are displayed based on the words associated with them. found out that there are a lot of clusters that have very few words associated with them – as they have either never Cluster Initialization: been merged, or are only merged 2-3 times. Our results We represent each cluster as a vector of show that discarding such clusters does not affect the documents. For each word, we create a bit vector (called results significantly. In fact, it allows the results to be term vector), with each dimension representing whether a interpreted more clearly. certain document is present or absent (denoted by 1 and 0, respectively). This step requires only one pass through the 4. Experiments & Results: set of the documents. However, we do maintain some sort This section describes the results of the various of index (e.g. a hash table) on the list of words in the experiments that we carried out. In order to evaluate the document to speed up the building process. performance of WBSC, we compared it with other Notice that the number of clusters is independent algorithms like K-Means, Fractionation and Buckshot [1]. to the number of documents. However, it grows with the total number of words in all the documents. Thus, we do Experimental Setup: not consider stop-words -- words that either carry no Our test bed consists of two separate datasets: meaning (like prepositions) or very common words. The  Web: This contains 2000 documents downloaded second idea can be extended. Words that appear in too from the Web. They are from different categories many documents are not going to help clustering. such as Food, Cricket, Astronomy, Clustering, Similarly, if a word appear in only one document, it is not Genetic Algorithms, Baseball, Movies, XML, likely that it will merge with other clusters and remain Salsa (Dance) and Olympics. We use search useless in clustering. Thus, we remove all initial clusters engines to help us locate the appropriate that have only one document or have more than half of the documents. total number of documents in it.  Newsgroups: This contains 2000 documents from the UCI KDD archive Cluster Building: (http://kdd.ics.uci.edu/) [8]. It has documents Once the initial clusters are formed, we apply a retrieved from 20 different news groups. hierarchical-based algorithm to combine the clusters. Here All the experiments are carried out on a 733 MHz, we need a measure of similarity between any pair of 260 MB RAM PC. We did experiments on different clusters. Let us assume we have two clusters A and B, document sets of different sizes. We ran the algorithm to with n1 and n2 documents each respectively (without loss get the clusters and tested the effectiveness of clustering of generality, assume n1  n2), with c documents in (the types of clusters formed), both qualitatively and common. Intuitively we would like to combine the two quantitatively. We also compared the execution times of clusters if they have many common documents (relative all the algorithms for document sets of different sizes. to the size of the clusters). We tried c min(n1 ,n2 ) as our similarity measure. However, this measure does not Effectiveness of Clustering: perform well. After trying with different measures, we We did many experiments with different number c of documents taken from the above-mentioned test bed. settled with the Tanimoto measure [2], . All the algorithms were run to produce the same number n1  n2  c of clusters with same input parameters. WBSC formed WBSC iterates over the clusters to merge similar clusters for each of the different categories in the clusters using this similarity measure. Clusters are merged document sets, while the other algorithms (K-Means, if their similarity value is higher than a pre-defined Fractionation and Buckshot) did not. In addition, the other threshold. We consider every pair of clusters, merging algorithms formed clusters with documents of many them if the similarity is larger than the threshold. Also, categories that are not related to each other. Figure 1 there are cases when one cluster is simply the subset of shows sample output from one of the experiments with the other. We merge them in that case. Currently we use 500 documents picked from 10 of the categories of the newsgroups data set. The categories include atheism, graphics, x-windows, hardware, space, electronics, Christianity, guns, baseball and hockey. Cluster results: Keywords Category BTW, remind, NHL, hockey, Canada, Canucks, hockey, teams, players Hockey Human, Christian, bible, background, research, errors, April, President, members, member, private, Christian community Cult, MAGIC, baseball, Bob, rules, poor, teams, players, season Baseball Analog, digital, straight, board, chip, processor Hardware Planes, pilots, ride, signals, offered Electronics Distribution, experience, anti, message, Windows, error, week, X Server, Microsoft, problems Windows-x America, carry, isn, weapons, science, armed Guns Military, fold, NASA, access, backup, systems, station, shuttle Space Notion, soul, several, hell, Personally, created, draw, exists, god Atheism Drivers, SVGA, conversion, VGA, compression, cpu, tools Graphics Keith, Jon, attempt, questions Atheism Limit, Police, Illegal, law Guns Ohm, impedance, circuit, input Electronics Season, Lemieux, Bure, runs, wings Hockey Figure 1: clusters formed by WBSC on a sample from UCI data archive

WBSC formed 14 clusters. It formed more than matched, the more resemble the clusters are to the original one cluster for some of the categories like Hockey, Guns categories. For our algorithm, and for the purpose of this etc. The document set on hockey has different documents comparison, we assigned each document to the cluster related to Detroit Redwings and Vancouver Canucks. Our that has the largest similarity value. algorithm formed two different clusters for these two Figure 2 shows the number of matches with the different sets. We ran the other algorithms with 10 as the original categories for different algorithms for different required number of clusters to be formed. Many of the sets of documents. clusters have documents that are not at all related to each 0

other (they have documents from different categories). 0 60.00 1 To show the effectiveness of our algorithm we / s 50.00 run the various algorithms on our web data set, where e h c some documents are in multiple categories (for example a t 40.00 a

document related to both baseball and movies should be m

30.00 f

in all the three clusters: baseball, movie, and baseball- o

r 20.00

movie). We cannot present the results due to limitation on e space. K-Means, Fractionation and Buckshot formed only b 10.00 m clusters about baseball and Movies and did not form a u

N 0.00 cluster for documents related to both categories. However, WBSC formed a cluster for baseball-movies and put the 100 200 300 400 500 documents related to baseball-movies in that cluster. This Number of documents shows the effectiveness of our method. WBSC K-Means Bucks hot Fractionation We also measure the effectiveness of WBSC quantitatively. We compared the clusters formed by the Figure 2: Comparison of quality of clusters by documents against the documents in the original different clustering algorithms categories and matched the clusters with the categories We can clearly see that WBSC outperforms the one-to-one. For each matching, we counted the number of other algorithms in effectiveness. documents that are common in corresponding clusters. Execution Time: The matching with the largest number of common We also measured the execution time of various documents is used to measure the effectiveness. This algorithms. Figure 3 gives the comparison of execution matching can be found by a maximum weight bipartite time. As the graph shows, WBSC outperforms almost all matching algorithm [7]. We return the number of documents in the matching. The more documents that are other algorithms in execution time, especially as the 5. Conclusions number of documents increases. In this paper, we presented WBSC, a word-based document-clustering algorithm. We have shown that it is a 140.00 promising means for clustering documents as well as )

s 120.00 presenting the clusters in a meaningful way to the user. e t

u We have compared WBSC with other existing algorithms n

i 100.00 m

and have shown that it performs well, both in terms of

n 80.00 i (

running time and quality of the clusters formed. e 60.00 m

i Future work includes tuning the parameters of t

n 40.00

o the algorithm, as well as adapting new similarity measures i t

u 20.00 c and data structures to speed up the algorithm. e x

E 0.00 200 400 600 800 1000 1200 1400 6. References: [1] Douglass R. Cutting, David R. Karger, Jan O. Number of documents Pedersen, John W. Tukey, Scatter/Gather: A Cluster- W BSC K-Means based Approach to Browsing Large Document Buckshot Fractionation Collections, In Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pp 318-329, June Figure 3: Execution times of various clustering 1992. algorithms [2] Dean, P. M. Ed., Molecular Similarity in Drug Design, Blackie Academic & Professional, 1995, pp Web Search Results: 111 –137. We also tested WBSC on the results from Web [3] D. R. Hill, A vector clustering technique, in: search engines. We downloaded documents returned from Samuelson (Ed.), Mechanized Information Storage, the Google search engine (www.google.com) and we Retrieval and Dissemination, North-Holland, Amsterdam, apply WBSC on them. Limitation on space prohibits us 1968. from showing all the results. Here we show some of the [4] A.K. Jain, M.N. Murty and P.J. Flynn, Data clusters found by WBSC by clustering the top 100 URLs Clustering: A Review, ACM Computing Surveys. 31(3): returned from searching the term “cardinal”. The 264-323, Sept 1999. categories correspond to the common usage of the word [5] F. Murtagh, A Survey of Recent Advances in “cardinal” in documents over the Web (name of baseball Hierarchical Clustering Algorithms", The Computer teams, nickname for schools, a kind of bird, and a Journal, 26(4): 354-359, 1983. Catholic clergy). Figure 4 shows some of the clusters [6] J. J. Rocchio, Document retrieval systems – formed by WBSC. optimization and evaluation, Ph.D. Thesis, Harvard Cluster results: Keywords and sample Related University, 1966. documents topic [7] Robert E. Tarjan, Data Structures and Bishops, Catholic, world, Roman, Roman Network Algorithms, Society for Industrial and Applied Church, cardinals, College, Holy Catholic Mathematics, 1983. 1. Cardinals in the Catholic Church [8]http://kdd.ics.uci.edu/databases/20newsgroup th Church Catholicpages.com Cardinals s/20newsgroups.html, last visited November 18 , 2000. 2. The Cardinals of the Holy library [9] P.Willett, Recent trends in hierarchical Roman Church document clustering: a critical review, Information processing and management, 24: 577-97, 1988. Dark, Bird, female, color Cardinal [10] O.Zamir, O.Etzioni, Web document 1. Cardinal Page Bird clustering: a feasibility demonstration, in Proceedings of 2. First Internet Bird Club 19th international ACM SIGIR conference on research Benes, rookie, players, hit, innings, St.Louis and development in information retrieval (SIGIR 98), runs, Garrett, teams, league, Saturday Cardinals 1998, pp 46-54. 1. St.Louis Cardinals Notes (MLB team) 2. National League Baseball - Brewers vs. Cardinals Figure 4: Clusters formed for search term “cardinals” by WBSC This shows that our algorithm is effective even with the web search results.