Document Clustering with Concept Based Vector Suffix Tree Document Model
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Electronics Communication and Computer Engineering a Volume 6, Issue (5) Sept., NCRTCST-2015, ISSN 2249–071X 3rd National Conference on Research Trends in Computer Science & Technology NCRTCST-2015 Document Clustering with Concept Based Vector Suffix Tree Document Model Dr. N. Sandhya J Sireesha Subir Kumar Sarkar Professor, VNRVJIET Assoc. Prof., MREC(A), Dept. of ETCE, Email: [email protected] Email: [email protected] Jadavpur Kolkata, India University Abstract – This work aims to extend a document each node can be used to represent the documents. This representation model which is elegant by combining the model can be extended by attempting the word sense versatility of the vector space model, the increased relevance disambiguation using WordNet. Once the vector of of the suffix tree document model, which takes word ordering weights is obtained, similarity measures and K-Means into account and retaining the relationship between words clustering algorithm with the Vector Space Model are like synonyms. The effectiveness and the relevance of this document model can be evaluated by a partitioning applied. The lack of an effective measure for the quality of clustering technique K-Means and then a systematic clusters which is an important problem with the original comparative study of the impact of similarity measures in suffix tree is overcome with this model. conjunction with different types of vector space representation on cluster quality may be performed. This III. DOCUMENTS PREPROCESSING USING document model will be called the concept based vector suffix tree document model. WORDNET Keywords – Wireless Network, Failures, Checkpoints, Initially we need to preprocess the documents. This step Rollback Recovery. is imperative. The next step is to analyze the prepared data and divide it into clusters using clustering algorithm. The I. INTRODUCTION effectiveness of clustering is improved by adding the Part- Of-Speech Tagging and making use of the WordNet for The most popular way for representing documents is the synsets. The most important procedure in the VSM, because of its speed and versatility. TF-IDF score preprocessing of documents is to enrich the term vectors can be computed easily, so this model is fast. In the worst with concepts from the core ontology. WordNet [7] covers case the document needs to be traversed twice for semantic and lexical relations between word forms and computing document frequencies and TF-IDF values. To word meanings. The first preprocessing step is POS overcome the bag of words problems, text documents are tagging. POS tagging is a process of assigning correct treated as sequence of words and documents are retrieved syntactic categories to each word. Tag set and word based on sharing of frequent word sequences from text disambiguation rules are fundamental parts of any POS databases. The sequential relationship between the words tagger. The POS tagger relies on the text structure and and documents is preserved using suffix tree data structure morphological differences to determine the appropriate [1]. Syntax based disambiguation is attempted by part-of-speech. POS Tagging is an essential part of Natural enriching the text document representations by Language Processing (NLP) applications such as speech background knowledge provided in a core ontology- recognition, text to speech, word sense disambiguation, WordNet. information retrieval, semantic processing, parsing, This paper is organized as follows. The section 2 deals information extraction and machine translation. WordNet with the related work in text document clustering, section contains only nouns, verbs, adjectives and adverbs. Since 3 describes the preprocessing, POS tagging and the use of nouns and verbs are more important in representing the WordNet for replacing words with their Synset IDs. content of documents and also mainly form the frequent Section 4 and Section 5 discusses the document word meaning sequences, we focus only on nouns and representation, similarity measures and their semantics. verbs and remove all adjectives and adverbs from the Section 6 describes the steps in the CBVSTDM. documents. For those word forms that do not have entries in WordNet, we keep them in the documents since these II. RELATED WORK unidentified word forms may capture unique information about the documents. We remove the stop words and An approach that combines suffix trees with the vector then stemming is performed. The morphology function space model was proposed earlier [2] [3]. They use the provided with WordNet is used for stemming as it only same Suffix Tree Document Model proposed by Zamir yields stems that are contained in the WordNet dictionary and Etzioni but they map the nodes from the common and also achieves improved results than Porter stemmer. suffix tree to a M dimensional space in the Vector Space The stemmed words are then looked up in the WordNet Model. Thus, a feature vector containing the weights of the lexical database to replace the words by their synset IDs. These preprocessing steps aim to improve the cluster All copyrights Reserved by NCRTCST-2015, Departments of CSE CMR College of Engineering and Technology, Kandlakoya (V), Medchal Road, Hyderabad, India. Published by IJECCE (www.ijecce.org) 35 International Journal of Electronics Communication and Computer Engineering a Volume 6, Issue (5) Sept., NCRTCST-2015, ISSN 2249–071X 3rd National Conference on Research Trends in Computer Science & Technology NCRTCST-2015 quality. These steps lead to the reduction of dimensions in Coefficient. The details of different similarity measures the term space. are described below. 5.1 Cosine Similarity Measure IV. DOCUMENT REPRESENTATION Classification The most commonly used is the cosine function [8]. For two documents di and dj, the similarity The representation of a set of documents as vectors in a between them can be calculated common vector space is known as the vector space model. ddij. Documents in vector space can be represented using cos (ddij , )= (4) d dj d Boolean, Term Frequency and Term Frequency – Inverse ij Document Frequency. In Boolean representation, if a term Where’d, and dj are m-dimensional vectors over the term exists in a document, then the corresponding term value is set T= {t1, t2,…tm}. Each dimension represents a term set to one otherwise it is set to zero. Boolean with its weight in the document, which is non-negative. representation is used when every term has equal As a Result the cosine similarity is non-negative and importance and is applied when the documents are of bounded between [0, 1]. Cosine similarity captures a scale small size. In Term Frequency and Term Frequency invariant understanding of similarity and is independent of Inverse Document Frequency the term weights have to be document length. When the document vectors are of unit set. The term weights are set as the simple frequency length, the above equation is simplified to: counts of the terms in the documents. This reflects the cos (di , dj) = di . dj intuition that terms occur frequently within a document may reflect its meaning more strongly than terms that occur less frequently and should thus have higher weights. Each document d is considered as a vector in the term- space and represented by the term frequency (TF) vector: dtf = [ tf1 , tf2 , ……..tfD ] (1) where tfi is the frequency of term i in the document and D is the total number of unique terms in the text database. The tf–idf representation of the document d = [ tf log( n / df ), tf log( n / df ),….. tf – idf 1 1 2 2 Fig.1. Suffix tree for strings “cat ate ....., tfD log( n /dfD )] (2) To account for the documents of different lengths, each When the cosine value is 1 the two documents are document vector is normalized to a unit vector (i.e., ||dtf- identical, and 0 if there is nothing in common between idf|||=1). In the rest of this paper, we assume that this vector them. (i.e., their document vectors are orthogonal to each space model is used to represent documents during the other). clustering. Given a set Cj of documents and their 5.2 Jaccard Coefficient corresponding vector representations, the centroid vector cj The Jaccard coefficient, which is sometimes referred to is defined as: as the Tanimoto coefficient [9, 10] measures similarity as 1 the intersection divided by the union of the objects. For cdji (3) text document, the Jaccard coefficient compares the sum C dC j ij weight of shared terms to the sum weight of terms that are where each di is the document vector in the set Cj, and j is present in either of the two documents but are not the the number of documents in cluster Cj. It should be noted common terms. that even though each document vector di is of unit length, dd. ij the centroid vector cj is not necessarily of unit length. Jaccard Coff (dd , )= ij 2 2 di d j d i* d i V. SIMILARITY MEASURES ddij Jaccard Index (dd , )= ijdd Document clustering groups similar documents to form ij a coherent cluster. However, the definition of a pair of The Jaccard Coefficient ranges between [0, 1]. The documents being similar or different is not always clear Jaccard value is 1 if two documents are identical and 0 if and normally varies with the actual problem setting. the two documents are disjoint. The Cosine Similarity may Accurate clustering requires a precise definition of the be extended to yield Jaccard Coefficient in case of Binary closeness between a pair of objects, in terms of either the attributes. pair wise similarity or distance [20]. A variety of 5.3 Pearson Correlation Coefficient similarity or distance measures have been proposed and Correlation is a technique for investigating the widely applied, such as cosine similarity, Jaccard relationship between two quantitative, continuous coefficient, Euclidean distance and Pearson Correlation variables, for example, age and blood pressure.