Semantic Document Clustering Using Information from Wordnet and Dbpedia

Semantic Document Clustering Using Information from WordNet and DBPedia Lubomir Stanchev Computer Science Department California Polytechnic State University San Luis Obispo, CA, USA [email protected] Abstract—Semantic document clustering is a type of unsu- strategy will struggle to find evidence for grouping any two pervised learning in which documents are grouped together documents together. based on their meaning. Unlike traditional approaches that The problem of semantic document clustering is difficult cluster documents based on common keywords, this technique can group documents that share no words in common as long because it involves some understanding of the English lan- as they are on the same subject. We compute the similarity guage and our world. For example, our system can use infor- between two documents as a function of the semantic similarity mation from DBPedia to determine that “Henri, le Chat Noir” between the words and phrases in the documents. We model and “Tardar Sauce” are both famous Internet cats. Although information from WordNet and DBPedia as a probabilistic graph significant effort has been put forward in automated natural that can be used to compute the similarity between two terms. We experimentally validate our algorithm on the Reuters-21578 language processing [9], [10], [23], current approaches fall benchmark, which contains 11; 362 newswire stories that are short of understanding the precise meaning of human text. grouped in 82 categories using human judgment. We apply the In our approach, we make limited use of natural language k-means clustering algorithm to group the documents using a processing techniques (for example, we use the Standford similarity metric that is based on keyword matching and one that CoreNLP tool [22]) and we rely on high-quality information uses the probabilistic graph. We show that the second approach produces higher precision and recall, which corresponds to better about the words in English language (WordNet) and our world alignment with the classification that was done by human experts. (DBPedia) to process the input documents. A traditional approach uses k-means clustering [21] to clus- I. INTRODUCTION ter documents. The algorithm is based on a vector representa- Consider an RSS feed of news stories. Organizing them tion of the documents (based on term frequencies) and a dis- in categories will make search easier. For example, a smart tance metric (e.g., the cosine similarity between two document classifier will put a story about “Tardar Sauce” (better known vectors). Unfortunately, this approach will incorrectly compute as Grumpy Cat) and a story about “Henri, le Chat Noir” the similarity distance between two documents that describe (Henry, the black cat) in the same category because both the same concept using different words. It will only consider stories are about famous cats from the Internet. Our approach the common words and their frequencies and it will ignore the uses information from WordNet [25] and DBPedia [20] to meaning of the words. In [41], we explore how information construct a probabilistic graph that can be used to compute from WordNet can be used to create a probabilistic graph that the semantic similarity between the two documents. is used to cluster the documents. However, this approach does The problem of semantic document clustering is interesting not take into account information from DBPedia and will not because it can improve the quality of the clustering results as be able to determine that “Tardar Sauce” and “Henri, le Chat compared to keyword matching algorithms. For example, the Noir” are both famous Internet cats. later algorithms will likely put documents that use different In this paper, we extend the approach from [41] in two ways. terminology to describe the same concept in separate cate- First, we apply the Standford CoreNLP tool to lemmatize gories. Consider a document that contains the term “ascorbic the words in the documents and assign them to the correct acid” multiple times and a document that contains the term part of speech (i.e., noun, verb, adjective, or adverb). Second, “vitamin C” multiple times. The documents are semantically we add information from DBPedia to the probabilistic graph. similar because “ascorbic acid” and “vitamin C” refer to the DBPedia contains knowledge from Wikipedia. This includes same organic compound and therefore a clustering algorithm the title of each Wikipedia page, the short abstract for the should take this fact into account. However, this will only page, the length of the Wikipedia page, the category of each happen when the close relationship between the two terms is Wikipedia page (e.g., “Anarchism” belongs to the category stored in the system and applied during document clustering. “Political Cultures”), information that an object belongs to a The need for a semantic document clustering system becomes class (e.g., “Azerbaijan” is a type of a country), RDF triplets even more apparent when the number of documents is small between objects (e.g., “Algeria” has official language that is or when they are very short. In this case, it is likely that the “Arabic”), and about disambiguation (e.g., “Alien” can refer to documents will not share many words and a keyword matching “Alien(law)”, that is, the legal meaning of the word.) All this information allows us to extend the probabilistic graph and the probabilistic graph to compute the distance between two find new evidence about the semantic similarities between the documents. Some limited user interaction is possible when phrases in the documents that are to be clustered. classifying documents – see for example the research on In what follows, in Section II we present an overview folksonomies [11]. Our system currently does not allow for of related research. Our main contribution is in Section III, user interaction when creating the document clusters. where we present a modified algorithm for creating the In later years, the research of Croft was extended by creating probabilistic graph that stores the part of speech for each a graph in the form of a semantic network [4], [28], [31] and word. The algorithm is then extended with information from graphs that contain the semantic relationships between words DBPedia. Section IV describes our algorithms for measuring [2], [1], [6]. Later on, Simone Ponzetto and Michael Strube the semantic similarity between documents and clustering the showed how to create a graph that only represents inheritance documents. Our other contribution is in Section V, where of words in WordNet [18], [32], while Glen Jeh and Jennifer we describe our implementation of the algorithm using a Widom showed how to approximate the similarity between distributed Hadoop environment and validate our approach phrases based on information about the structure of the graph by showing how it can produce data of better quality than in which they appear [15]. All these approaches differ from the algorithm that is based on simple keywords matching our approach because they do not consider the strength of the and our previous algorithm that relies exclusively on data relationship between the nodes in the graph. In other words, from WordNet. Lastly, Section VI summarizes the paper and weights are not assigned to the edges of the graph. outlines areas for future research. Natural language techniques can be used to analyze the II. RELATED RESEARCH text in a document [13], [26], [35]. For example, a natural The probabilistic graph that is presented in this paper is language analyzer may determine that a document talks about based on the research from [37], which shows how to measure animals and words or concepts that can represent an animal the semantic similarity between words based on information can be identified in other documents. As a result, documents from WordNet. Later on, in [38] we explain how the graph can that are identified to refer to the same or similar concepts can be extended with information from Wikipedia. After that, in be classified together. One problem with this approach is that [40] we show how the Markov Logic Network model [29] can it is computationally expensive. A second problem is that it use the probabilistic graph to compute the probability that a is not a probabilistic model and therefore it is difficult to be word is relevant to a user given that a different word from the applied towards generating a document similarity metric. user’s input query is relevant. Lastly, in [36] we show how Note our limited use of ontologies to cluster the documents. a random walk in a bounded box in the graph can be used Unlike existing approaches that annotate each document with to make the computation of the semantic similarity between a description in a formal language [17], [27], [12], we use two words more precise and more efficient. In this paper, we ontological information from DBPedia to calculate the weights extend this existing research in two ways: (1) we store the of the edges in the probabilistic graph. The problem with part of speech together with each word in the graph and (2) the traditional approach is that: (1) manual annotation is time we incorporate knowledge from DBPedia in the probabilistic consuming and automatic annotation is not very reliable and graph. (2) a query language, such as SPARQL [33], can tell us which Note that a plethora of research papers have been published documents are similar, but it will not give us a similarity on the subject of using supervised learning models with train- metric. ing sets for document classification [5], [42]. Our approach differs because it is unsupervised, it does not use a training Since the early 1990s, research on LSA (stands for latent set, and it can cluster documents in any number of classes semantic analysis [8]) has been carried out.

Load more