Document Clustering Using K-Means and K-Medoids Rakesh Chandra Balabantaray*, Chandrali Sarma**, Monica Jha***
Total Page:16
File Type:pdf, Size:1020Kb
Article can be accessed online at http://www.publishingindia.com Document Clustering using K-Means and K-Medoids Rakesh Chandra Balabantaray*, Chandrali Sarma**, Monica Jha*** Abstract clustering are big volume, high dimensionality and complex semantics. Our motive in the present paper is to With the huge upsurge of information in day-to-day’s life, extract particular domain of work from a huge collection it has become difficult to assemble relevant information of documents using K-Means and K-Medoids clustering in nick of time. But people, always are in dearth of algorithm and to obtain best clusters which later on time, they need everything quick. Hence clustering can be used for document summarizations. Document was introduced to gather the relevant information in clustering is a more specific technique for document a cluster. There are several algorithms for clustering organization, automatic topic extraction and fast IR1, information out of which in this paper, we accomplish which has been carried out using K-means clustering K-means and K-Medoids clustering algorithm and a algorithm implemented on a tool named WEKA2(Waikato comparison is carried out to find which algorithm is best Environment for Knowledge Analysis) and K-Medoids for clustering. On the best clusters formed, document algorithm is executed on Java NetBeans IDE to obtain summarization is executed based on sentence weight clusters. This allows a collection of various documents to focus on key point of the whole document, which makes it easier for people to ascertain the information from different domains to be segregated into groups of they want and thus read only those documents which similar domain. The best cluster obtained, then undergoes is relevant in their point of view. summarization to retrieve highest weighed sentences to reveal gist of the whole document. Keywords: Clustering, K-Means, K-Medoids, WEKA3.9, Document Summarization 2. Reviews Throughout these years, there has been a lot of work 1. Introduction implemented on document clustering using k-means by various researchers employing different means. Initially, Achievement of better efficiency in retrieval of relevant the researchers worked using the simple k-means information from an explosive collection of data is algorithm and then in later years, various modifications challenging. In this context, a process called document were executed. Use of K-mean clustering and Vector clustering can be used for easier information access. Space model was employed by using the text data by The goal of document clustering is to discover the treating it as high dimensional. It was shown that the time natural grouping(s) of a set of patterns, points, objects or taken for entire clustering process was linear in the size of documents. Objects that are in the same cluster are similar document collection [Indrajit S. Dhillon et.al. 2001]. Some among themselves and dissimilar to the objects belonging researchers found an effective technique for K-means to other clusters. The purpose of document clustering is clustering which proves that principal components to meet human interests in information searching and are the continuous solutions to the discrete cluster understanding. The challenging problems of document membership [Chris Ding et.al.2004]. Recently, work on * IIIT Bhubaneswar, Bhubaneswar, Odisha, India. E-mail: [email protected] ** Department of Information and Technology, Gauhati University, Guwahati, India. E-mail: [email protected] *** Department of Information and Technology, Gauhati University, Guwahati, India. E-mail: [email protected] 8 International Journal of Knowledge Based Computer System Volume 1 Issue 1 June 2013 the performance of the partition clustering techniques Same input is provided to both the algorithms and later on in terms of complex data objects and comparative after the algorithm implementation is over, the best cluster study of the cluster algorithm for corresponding data obtained is then used for document summarization. and proximity measure for specific objective function based on K-means and EM Algorithms was executed. 3.1. Collection of Datasets Comparison and evaluation clustering algorithms with multiple data sets, like text, business, and stock market At the beginning hundred documents were collected data was performed. Comparative study of clustering which consisted twenty each of Entertainment (e1, e2…. algorithms identified one or more problematic factors e20), Literature (l1, l2….l20), Sport (s1, s2….s20), such as high dimensionality, efficiency, scalability with Political (p1, p2.… p20) and Zoology (z1,z2,….z20). data size, sensitivity to noise in the data. [Satheelaxmi.G These documents undergo refinement which is fed to the et.al.2012]. Work by integrating the constraints into the algorithm to obtain clusters containing documents from trace formulation of the sum of square Euclidean distance similar domains. function of K-means by combining criterion function transformation into trace maximization was optimized by eigen-decomposition [Guobiao Hu et.al.2008]. Plentiful 3.2. Convert Documents into Vectors times work on K-means algorithm on WEKA model had been implemented in the past which in turn has improved A document usually consists of huge number of words, it the WEKA tool set. [Sapna Jain et al.2010]. Previously, is not always necessary that each word is of importance. work based on several datasets, including synthetic and Due to which, a document has high dimensionality has to real data, show that the proposed algorithm may reduce the be reduced. Hence processing is carried on a document number of distance calculations by a factor of more than to reduce this dimensionality and get rid of extra words a thousand times when compared to existing algorithms and to obtain weight of each of the word to be used in while producing clusters of comparable quality was the algorithm. Conversion of documents into vectors is carried out[Maria Camila N. et.al.2006]. Proposal of a carried in various steps. new language model, to simultaneously cluster and sum up at the same time has been implemented in past. The 3.2.1. Tokenization method implies good document clustering method with more meaningful interpretation and a better document All processes in information retrieval require the words of summarization method taking the document context the data set. The main use of tokenization is identifying information into consideration [Wang, Shenghuo Zhu et the meaningful keywords called tokens. Tokenization al.2008]. splits sentences into individual tokens, typically words. 3. Methodology 3.2.2. Stop-Words Removal Execution of algorithm is based on the input provided Collected documents contain some unnecessary words by to the algorithm. In this paper, the input provided to which dimensionality of a document will be increased; we the algorithm had to undergo certain refinement. The should remove those words to get proper result. Pronoun, experiment generally comprises of four major steps. adverb, preposition etc. which are used constantly throughout in a document has to be removed. Convert Documents into Collection of Datasets Vectors 3.2.3. Weight Calculation This step involves calculating the weight of each word twice, once using frequency of words and then using term Multi-document Clustering based frequency-inverse document frequency (tf-idf). summarization algorithm Document Clustering using K-Means and K-Medoids 9 a. Weight calculated using frequency 3.3. Clustering Based Algorithms Ratio of each word occurred in the document to the total Among various clustering based algorithm, we number of words in that document gives us weight. have selected K-means and K-Medoids algorithm. Weight = (frequency) ∏ (total number of words in a document) Implementation of K-means algorithm was carried out via b. Weight calculated using tf-idf method WEKA tool and K-Medoids on java platform. Using the same input matrix both the algorithms is implemented and The other method which is often used in information the results obtained are compared to get the best cluster. retrieval and text mining known as tf-idf weight, where K-Mean Clustering using WEKA Tool tf is the “Term Frequency” and is denoted tft,d with the subscripts denoting the term and the document in order To cluster documents, after doing preprocessing tasks and idf is the “Inverse Term Frequency”. The term vector we have to form a flat file which is compatible with for a string is defined by its term frequencies. If count WEKA tool and then send that file through this tool to (t,cs) is the count of term t in character sequence cs, then form clusters for those documents. This section will give the term frequency (TF) is defined by, a brief mechanism with WEKA tool and use of K-means Tf (t,cs) = sqrt(count(t,cs)) algorithm on that tool. WEKA has four applications Explorer, Experiment, Knowledge Flow and Simple CLI If df(t,docs) be the document frequency of token t i.e. the but here only Explorer application and the two fields from number of documents in which the token t appears, then this application i.e. preprocess and cluster fields are used. the inverse term frequency (IDF) of t is defined by, idf(t,docs) = log(D/df(t,docs)) Therefore if term weight is denoted by Wi then typically, Wi = TF * IDF 3.2.4. Weighted Matrix Formation A matrix had to be formed where we stored weight of each word in rows and in the column we had stored documents name. This matrix serves as the input for both the algorithm. For formation of this matrix we employ the idea from Vector Space Model (VSM)3 which is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for 3.3.1. WEKA Data File Format (input) example, index term. Tw1 Tw2 Tw3 Tw4 Tw5 As it is already mentioned that weighted matrix is the input for the implementation of the algorithm, so an .arff d1 Ê 05.