Using Word Net for Document Clustering: a Detailed Review Harsha Patil 1 , Dr

IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

Using Word Net for Document Clustering: A Detailed Review Harsha Patil 1 , Dr. R. S. Thakur 2

1 Research Scholar, MANIT, Bhopal, Ashoka Center For Business and Computer Studies, Nashik, India. 2 Maulana Azad National Institute of Technology, Bhopal, India. 1 [email protected], 2 [email protected]

ABSTRACT: Document Clustering is an unsupervised technique for categorized documents in groups on the basis of their similarity. Document clustering techniques are basically very useful to efficiently manage and organize the result of search engine query. Mostly document clustering techniques use Vector Space Model (VSM) to represent any document. VSM based methods generates bag of Words. But VSM based approaches doesn’t consider Semantic relationship among the words. Many researchers are working on semantic aspects of document clustering to improve cluster quality. Since last eight nine years efforts have been seen in applying semantics to document clustering. Many external knowledge bases like Word Net, Wikipedia, and Lucerne etc. are utilized to handle this challenge. The article explains semantic approach in detail and different semantic similarity measures used in algorithms for finding semantic association among words.

Keywords: Document Clustering, Semantic, WordNet, Similarity measures, synonyms.

INTRODUCTION

Document Clustering deals with unstructured text, which generates many challenges. Abstract concepts consist by text are difficult to represent as well as having countless combination of abstract relationships. Most of the traditional methods are based on BOW (Bag of Words) which discounts semantic associations among the words and consequences in less qualitative output. In last nine to ten years span, many of the researchers explore the semantic aspects of the words and improve the text mining techniques. Use of external knowledge base is being very helpful to develop semantic based approaches for document clustering. Word Net (Miller, 1995) is one of the most widely used thesauruses for English language. Ambiguity and synonymy are two of the major problems that document clustering techniques regularly fail to tackle with. In recent research work, WordNet has been widely used to improve quality of document clustering. WordNet is a lexical knowledge, based on conceptual look up which organize lexical information in terms of word meaning, rather than word form. Here we extant extensive survey of numerous Semantic based techniques for document clustering which exploits WordNet and improved the quality of the clusters formed.

DOCUMENT CLUSTERING TECHNIQUES: AN OVERVIEW

Document clustering is the process of making cohesive groups which consist documents, which are more similar with each other. So documents in same groups have high similarity as compare to document from different cluster. Document clustering techniques can be classified in to three broad categories: Partitioning method [1], Agglomerative and divisive clustering [2] and item set based clustering [3]. Many clustering algorithms, based on partitioning or hierarchical methodology like K-Means [4], Bisecting K-Means [1], Hierarchical Agglomerative clustering (HAC) [x], and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [x] performs efficiently for low dimensional data but in case of high dimensional data they results in poor clustering. Frequent item set based algorithms handle the problem of high dimensionality of text documents by selecting only frequent item sets as features for clustering. Hierarchical Frequent Term based Clustering (HFTC) [5] proposed by Beil F, Ester M, Xu X (2002) did great contribution in this direction. After that Fung, et al proposed Hierarchical Document Clustering using frequent item sets (FIHC) [3] which use association rule mining and provides meaningful labels to the clusters. All the above algorithms are not considered the semantic associations of the words.

VOLUME 4, ISSUE 7, DEC/2017 339 http://iaetsdjaras.org/ IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

From last nine-ten years many researchers have been using External knowledge base for associate meanings with words. WordNet is extensively used by researchers for this purpose. Related works using WordNet is exhaustively explained in further section of this paper.

WORDNET

Word Net is one of the widely used lexical reference system based on psycholinguistic theories of human lexica memory. There is a multilingual WordNet for European languages which is structured in the same way as the English language WordNet. Now it is available in some Indian languages also like in Hindi, Marathi, Punjabi etc. and are going to connect with English and European WordNet. Recent Windows version of WordNet is 2.1, released in March 2005. Version 3.0 for Unix/Linux/Solaris/etc. was released in December, 2006.

Various Natural language processing applications used WordNet for Word Sense disambiguation, find semantic distance between words, Machine Translation, Search engine processing, Plagiarism detection, Sentiment analysis etc. WordNet established the lexical or semantical connections between noun, verb, adjective, and adverb form of words, which expressing distinct concepts. This distinct concept is called sense and has distinct sysnet, which explain specific meaning of a word, its explanation, and its synonyms. The fundamental unit which searched in WordNet is concept rather than word however in dictionary where fundamental unit is results in meaningful unit.

SEMANTIC SIMILARITY MEASURES:

Semantic similarity is confidence score of two words which explains likeness of their meaning. Increased in use of WordNet results in many semantic similarity methods for find distance between words. According to Meng et al. [6] Semantic similarity measures can be broadly categorized in four classes: path length based measures, information content based measures, feature based measures, and hybrid measures.

1 Path length based measures: It is based on the length of the path connecting the concepts and the location of the concepts in the taxonomy [7]. It counts edges between concepts. The disadvantage of this method is two pairs with equal length of shortest path will have the same similarity. 2 Information content based measures: It is based on the principle is that if two concepts are sharing more common information that means they are more similar [8]. 3 feature based measures: According to this measure two concepts are more similar if they have more common features and less uncommon features [9]. This measure is not work properly if complete feature sets of concepts are not available. 4 Hybrid measure: This measure combines the principles proposed in path length based measures, information content based measures and feature based measure. It also consider the relations like IS-A, Part- of more finding semantic similarity.

RELATED WORKS

In this section we provide tabulated representation of all related research of document clustering using WordNet. This detail study will be very precious collection for all researchers who are looking this area for their future research.

VOLUME 4, ISSUE 7, DEC/2017 340 http://iaetsdjaras.org/ IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

Table 1: Summarized details on review of Document clustering algorithms using WordNet

VOLUME 4, ISSUE 7, DEC/2017 341 http://iaetsdjaras.org/ IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

CONCLUSION

This paper cover Semantic similarity measures in WordNet to find semantic association among words. This survey will be very useful for the researchers who want to uses Semantic based techniques for document clustering which exploits WordNet and improved the quality of the clusters formed. Survey gives summarized description about research going on using WordNet and also provide briefing about algorithms they are using, evaluation measures and future scope.

REFERENCES

[1] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. Proc. Of the 6th ACM SIGKDD international conference on TextMining Workshop, KDD 2000, 2000

[2] Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An introduction to Cluster Analysis, John Wiley & Sons, Inc (1990)

[3] Fung, B., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Item sets, In Proc. of SIAM Intl. Conf. on Data Mining. (2003)

[4] Hartigan, J. A., and Wong, M. A.: Algorithm AS 136: A K-Meanss Clustering Algorithm. In: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28, pp. 100-108, Royal Statistical Society (1979)

VOLUME 4, ISSUE 7, DEC/2017 342 http://iaetsdjaras.org/ IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

[5] Beil, F., Ester, M., Xu, X.: Frequent Term-based Text Clustering, In Proc. of Intl. Conf. on Knowledge Discovery and Data Mining,. (2002)

[6] Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in wordnet. International Journal of Hybrid Information Technology, 6(1), 1–12.

[7] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis and E. E. Milios, “Semantic similarity methods in WordNet and their application to information retrieval on the web”, Proceedings of the 7th annual ACM international workshop on Web information and data management, (2005) October 31- November 05, Bremen, Germany.

[8] P. Resnik, “Using information content to evaluate semantic similarity”, Proceedings of the 14th International Joint Conference on Artificial Intelligence, (1995) August 20-25; Montréal Québec, Canada.

[9] A. Tversky, “Features of Similarity”, Psycological Review, vol. 84, no. 4, (1977).

[10] Andreas Hotho , Steffen Staab , Gerd Stumme, “Wordnet improves Text Document Clustering,” In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003

[11] Chihli Hung, Stefan Wermter, Peter Smith, “Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet,” Journal IEEE Intelligent Systems archive, Vol. 19 Issue 2, pp. 68-77, Mar. 2004

[12] Yong Wang, Julia Hodges, “Document Clustering with Semantic Analysis,” In Proc. of the 39th Annual Hawaii International Conference on System Sciences, HICSS, Vol. 03, pp. 54.3, 2006

[13] Hai-Tao Zheng, Bo-Yeong Kang, Hong-Gee Kim, “Exploiting noun phrases and semantic relationships for text document clustering,” Journal of Information Sciences, Vol. 179, Issue 13, pp. 2249-2262, Jun. 2009

[14] Chun-Ling Chen, Frank S. Tseng, Tyne Liang, “An Integration of Fuzzy Association Rules and WordNet for Document Clustering,” In Proc. of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD, pp. 147-159, 2009

[15] S.Vijayalakshmi, Dr.D.Manimegalai, “Query based Text Document Clustering using its Hypernymy Relation,” International Journal of Computer Applications 23(1):13–16, Jun. 2011

[16] Dang, Q., Zhang, J., Lu, Y., & Zhang, K. (2013). WordNet-based suffix tree clustering algorithm. In Paper presented at the 2013 international conference on information science and computer applications (ISCA 2013).

[17] T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach for text clustering using Word Net and lexical chains,” Expert Systems with Applications, vol. 42, no. 4, pp. 2264–2275,2015.

[18] Sujata R. Kolhe ; S. D. Sawarkar, “A concept driven document clustering using WordNet,” In Proc. of the International conference Nascent Technologies in Engineering (ICNTE),2017.

VOLUME 4, ISSUE 7, DEC/2017 343 http://iaetsdjaras.org/