<<

Semantic Using Information from WordNet and DBPedia

Lubomir Stanchev Science Department California Polytechnic State University San Luis Obispo, CA, USA [email protected]

Abstract—Semantic document clustering is a type of unsu- strategy will struggle to find evidence for grouping any two pervised learning in which documents are grouped together documents together. based on their meaning. Unlike traditional approaches that The problem of semantic document clustering is difficult cluster documents based on common keywords, this technique can group documents that share no in common as long because it involves some understanding of the English lan- as they are on the same subject. We compute the similarity guage and our world. For example, our system can use infor- between two documents as a function of the mation from DBPedia to determine that “Henri, le Chat Noir” between the words and phrases in the documents. We model and “Tardar Sauce” are both famous cats. Although information from WordNet and DBPedia as a probabilistic graph significant effort has been put forward in automated natural that can be used to compute the similarity between two terms. We experimentally validate our algorithm on the Reuters-21578 processing [9], [10], [23], current approaches fall benchmark, which contains 11, 362 newswire stories that are short of understanding the precise meaning of human text. grouped in 82 categories using human judgment. We apply the In our approach, we make limited use of natural language k-means clustering algorithm to group the documents using a processing techniques (for example, we use the Standford similarity that is based on keyword matching and one that CoreNLP tool [22]) and we rely on high-quality information uses the probabilistic graph. We show that the second approach produces higher precision and recall, which corresponds to better about the words in (WordNet) and our world alignment with the classification that was done by human experts. (DBPedia) to process the input documents. A traditional approach uses k-means clustering [21] to clus- I.INTRODUCTION ter documents. The algorithm is based on a vector representa- Consider an RSS feed of news stories. Organizing them tion of the documents (based on term frequencies) and a dis- in categories will make search easier. For example, a smart tance metric (e.g., the cosine similarity between two document classifier will put a story about “Tardar Sauce” (better known vectors). Unfortunately, this approach will incorrectly compute as Grumpy Cat) and a story about “Henri, le Chat Noir” the similarity between two documents that describe (Henry, the black cat) in the same category because both the same using different words. It will only consider stories are about famous cats from the Internet. Our approach the common words and their frequencies and it will ignore the uses information from WordNet [25] and DBPedia [20] to meaning of the words. In [41], we explore how information construct a probabilistic graph that can be used to compute from WordNet can be used to create a probabilistic graph that the semantic similarity between the two documents. is used to cluster the documents. However, this approach does The problem of semantic document clustering is interesting not take into account information from DBPedia and will not because it can improve the quality of the clustering results as be able to determine that “Tardar Sauce” and “Henri, le Chat compared to keyword matching algorithms. For example, the Noir” are both famous Internet cats. later algorithms will likely put documents that use different In this paper, we extend the approach from [41] in two ways. terminology to describe the same concept in separate cate- First, we apply the Standford CoreNLP tool to lemmatize gories. Consider a document that contains the term “ascorbic the words in the documents and assign them to the correct acid” multiple times and a document that contains the term part of speech (i.e., , , adjective, or ). Second, “vitamin ” multiple times. The documents are semantically we add information from DBPedia to the probabilistic graph. similar because “ascorbic acid” and “vitamin C” refer to the DBPedia contains knowledge from . This includes same organic and therefore a clustering algorithm the title of each Wikipedia page, the short abstract for the should take this fact into account. However, this will only page, the length of the Wikipedia page, the category of each happen when the close relationship between the two terms is Wikipedia page (e.g., “Anarchism” belongs to the category stored in the system and applied during document clustering. “Political Cultures”), information that an object belongs to a The need for a semantic document clustering system becomes class (e.g., “Azerbaijan” is a type of a country), RDF triplets even more apparent when the number of documents is small between objects (e.g., “Algeria” has official language that is or when they are very short. In this case, it is likely that the “Arabic”), and about disambiguation (e.g., “Alien” can refer to documents will not share many words and a keyword matching “Alien(law)”, that is, the legal meaning of the .) All this information allows us to extend the probabilistic graph and the probabilistic graph to compute the distance between two find new evidence about the semantic similarities between the documents. Some limited user interaction is possible when phrases in the documents that are to be clustered. classifying documents – see for example the research on In what follows, in Section II we present an overview [11]. Our system currently does not allow for of related research. Our main contribution is in Section III, user interaction when creating the document clusters. where we present a modified algorithm for creating the In later years, the research of Croft was extended by creating probabilistic graph that stores the part of speech for each a graph in the form of a [4], [28], [31] and word. The algorithm is then extended with information from graphs that contain the semantic relationships between words DBPedia. Section IV describes our algorithms for measuring [2], [1], [6]. Later on, Simone Ponzetto and Michael Strube the semantic similarity between documents and clustering the showed how to create a graph that only represents inheritance documents. Our other contribution is in Section V, where of words in WordNet [18], [32], while Glen Jeh and Jennifer we describe our implementation of the algorithm using a Widom showed how to approximate the similarity between distributed Hadoop environment and validate our approach phrases based on information about the structure of the graph by showing how it can produce data of better quality than in which they appear [15]. All these approaches differ from the algorithm that is based on simple keywords matching our approach because they do not consider the strength of the and our previous algorithm that relies exclusively on data relationship between the nodes in the graph. In other words, from WordNet. Lastly, Section VI summarizes the paper and weights are not assigned to the edges of the graph. outlines areas for future research. Natural language techniques can be used to analyze the II.RELATED RESEARCH text in a document [13], [26], [35]. For example, a natural The probabilistic graph that is presented in this paper is language analyzer may determine that a document talks about based on the research from [37], which shows how to measure animals and words or that can represent an animal the semantic similarity between words based on information can be identified in other documents. As a result, documents from WordNet. Later on, in [38] we explain how the graph can that are identified to refer to the same or similar concepts can be extended with information from Wikipedia. After that, in be classified together. One problem with this approach is that [40] we show how the Markov Logic Network model [29] can it is computationally expensive. A second problem is that it use the probabilistic graph to compute the probability that a is not a probabilistic model and therefore it is difficult to be word is relevant to a user given that a different word from the applied towards generating a document similarity metric. user’s input query is relevant. Lastly, in [36] we show how Note our limited use of to cluster the documents. a random walk in a bounded box in the graph can be used Unlike existing approaches that annotate each document with to make the computation of the semantic similarity between a description in a [17], [27], [12], we use two words more precise and more efficient. In this paper, we ontological information from DBPedia to calculate the weights extend this existing research in two ways: (1) we store the of the edges in the probabilistic graph. The problem with part of speech together with each word in the graph and (2) the traditional approach is that: (1) manual annotation is time we incorporate knowledge from DBPedia in the probabilistic consuming and automatic annotation is not very reliable and graph. (2) a query language, such as SPARQL [33], can tell us which Note that a plethora of research papers have been published documents are similar, but it will not give us a similarity on the subject of using models with train- metric. ing sets for document classification [5], [42]. Our approach differs because it is unsupervised, it does not use a training Since the early 1990s, research on LSA (stands for latent set, and it can cluster documents in any number of classes semantic analysis [8]) has been carried out. The approach has rather than just classify the documents in preexisting topics. the advantage of not relying on external information. Instead, it One alternative to supervised learning is using a knowledge- considers the adjacency of words in text documents as proof base that contains information about the relationship between of their semantic similarity. For example, LSA can be used the words and phrases that can be found in the documents to detect words that are [19]. This differs from our to be clustered. For example, in 1986, W. B. Croft proposed approach because we do not consider the location of the words the use of a that contains semantic information, in the documents for the most part. The only exceptions are such as what words are synonyms [7]. Sequentially, there when we extract the part of the speech for each word and have been multiple papers on the use of a thesaurus to when we compute the conditional probability between a sense represent the semantic relationship between words and phrases and the words in the sense and give higher weight to do first [14], [30]. This approach, although very progressive for the words because they are more relevant. times, differs from our approach because we consider indirect Lastly, not that our approach is different from that of relationships between words (i.e., relationships along paths of ([24]). Word2Vec explores existing documents to several words). We also do not apply document expansion find which words go together. Instead, we use high quality (e.g., adding the synonyms of the words in a document to the knowledge from WordNet and DBPedia to find the of document) when comparing two documents. Instead, we use semantic similarity between phrases. III.BUILDINGTHE PROBABILISTIC GRAPH example, “building” is a holonym of “window”. For , In this section, we extend on previous approaches to build- WordNet defines the hypernym and troponym relationships. X ing the probabilistic graph [38], [41] by considering the part is a hypernym of Y if performing X is one way of performing of speech for each word and using information from DBPedia. Y. For example, “to perceive” is a hypernym of “to listen”. The verb Y is a troponym of the verb X if the activity Y A. Modeling WordNet is doing X in some manner. For example, “to lisp” is a WordNet gives us information about the words in the troponym of “to talk”. Lastly, WordNet defines the related English language. We use WordNet 3.0, which contains about to and similar to relationship between adjective senses, which 150, 000 different terms. Both words and phrases can be found are self explanatory. in WordNet. For example, “sports utility vehicle” is a term We create a node in the probabilistic graph for each word from WordNet. WordNet uses the terminology word form to form and part of speech pair. For example, we create a node refer to both words and phrases. Note that the meaning of for [spring, wf noun] and a node for [spring, wf verb], where a word form is not precise. For example, the word “spring” wf stands for word form. The first node represents the word can mean the season after winter, a metal elastic device, or form noun “spring”, while the second node represents the word natural flow of ground water, among other meanings. This is form verb “spring”. In this paper, we consider the two word the reason why WordNet uses the concept of a sense. For forms to be distinct entities and we do not explicitly create an example, earlier in this paragraph we cited three different edge between them just because they share the same syntax. senses of the word “spring”. Every word form has one or We also create a node for each sense. For example, we create a more senses and every sense is represented by one or more node with label [natural flow of ground water, s noun], where word forms. A human can usually determine which of the s stands for sense. Instead of revisiting our previous algorithm many senses a word form represents by the in which from [40], we summarize how the probabilities are computed the word form appears. Each word form is classified in one in Table I, where these probabilities are used to create the of four categories: noun, verb, adjective, or adverb. weighted edges between the nodes. The different constants that WordNet contains a plethora of information about word appear in the formulas were determined using experimental forms and senses. For example, it contains the definition and evaluation [39]. The formulas are slightly modified and we example use of each sense. Consider the word “chair”. One do some extra processing to process the part of speech for of its senses has the definition: “a seat for one person, with a each word form node and for each sense node. Next, we show support for the back” and the example use: “he put his coat few examples that demonstrate the algorithm for creating the over the back of the chair and sat down”. Two other senses of probabilities. the word have the definitions: “the position of a professor” and Consider the noun chair and its most popular meanings: “the officer who presides at the meetings of an organization”. “a seat for one person”. In total, the noun has four meaning, We process these textual descriptions to extract evidence about where WordNet defines their frequencies to be 35, 2, 1, and the strength of the relationship between a word form and the 1. Accordingly, we create the following formula. word forms that appear in the definition and example use of the word’s senses. Note that WordNet also provides information [chair, wf noun] ⇒ [a seat for one person, s noun], 35/39 about the frequency of use of each sense. This represents s noun noun sense the popularity of the sense in the English language relative The keyword stands for . The formula 35/39 to the popularity of the other senses of the word form. For shows that there is probability that if we are interested example, in WordNet the first sense of the word “chair” (a in the noun chair, then we are also interested in its most seat for one person, with a support for the back) is given a popular sense. Note that we store the type of speech for both frequency of 35, the second sense (the position of a professor) word forms and senses. All the senses of a word form must is given frequency of just two, while the third sense (the officer have the same type of speech as the word form. The number 35/39 who presides at the meetings of an organization) is given a is computed by dividing the frequency of the sense by frequency of one. the sum of the frequencies of all the senses of the word form. WordNet also contains information about the relationship We also create the reverse relationship. between senses. The senses in WordNet are divided into four [a seat for one person, s noun] ⇒ [chair, wf noun], 1 categories: , verbs, adjectives, and . For example, WordNet stores information about the hypernym and hyponym This formula means that if we are interested in a sense, relationships between nouns. The hypernym relationship corre- then we must be also interested in each of the word forms sponds to the “kind-of” relationship. For example, “canine” in that represent the sense with probability 100%. a hypernym of “dog”. The hyponym relationship is the reverse. Next, let us consider the relationships between the most For example, “dog” is a hyponym of “canine”. WordNet popular sense of the word chair and the words in the definition. also provides information about the meronym and holonym We create the following formulas. relationship between noun senses. The meronym relationship corresponds to the “part-of” relationship. The holonym re- [a seat for one person, s noun] ⇒ [seat, wf noun], 0.6 lationship is the reverse of the meronym relationship. For [a seat for one person, s noun] ⇒ [person, wf noun], 0.4 TABLE I THE DIFFERENT FORMULAS FOR MODELING WORDNET part of speech from to probability frequency[1](w,s) word form w sense s of w P frequency(w,si) si∈sense(w) sense s word form w of s 1 [4] [2] [3] count (w,d) sense s word form w in the definition d of s c ∗ norm ( P ) count(wi,d) general wi∈d word form w sense s that has w in its definition d 0.3 sdf [5](w) count(w,e) sense s word form w in the example use e of s 0.3 ∗ norm( P ) count(wi,e) wi∈e word form w a sense s that has w in its example use e 0.15 sef [6](w) |s2| noun sense s1 noun hypernym sense s2 of s1 0.9 ∗ P |s| s is a hypernym of s1 noun noun sense s1 noun hyponym sense s2 of s1 0.3 |s2| noun sense s1 noun meronym sense s2 of s1 0.6 ∗ P |s| s is a meronym of s1 noun sense s1 noun holonym sense s2 of s1 0.15 |s2| verb sense s1 verb troponym sense s2 of s1 0.9 ∗ P |s| s is a troponym of s1 verb sense s verb sense s , where s is a verb troponym of s 0.3 verb 1 2 1 2 |s2| verb sense s1 verb hyponym sense s2 of s1 0.9 ∗ P |s| s is a hyponym of s1 verb sense s1 verb sense s2, where s1 is a verb hypernym of s2 0.3 adjective sense s adjective sense s that is related to s 0.6 adjective 1 2 1 adjective sense s1 adjective sense s2 that is similar to s1 0.8 .[1] frequency(w, s) is the popularity of the sense s for the word form w as determined by WordNet. .[2] c = 0.6 for the first word, c = 0.4 for the second word, c = 0.2 for the rest of the words. ( −1 , x ≤ 0.5 .[3] norm(x) = log2(x) 1.2 x > 0.5 .[4] count(w, str) is the number of times the word form w appears in the string str. .[5] sdf stands for sense definition frequency. sdf (w) returns the number of senses that contain the word form w in their definition. .[6] sef stands for sense example use frequency. sef (w) returns the number of senses that contain the word form w in their example use.

We use the Standford CoreNLP tool [22] to parse the definition of a sense. The tool returns back the main part of X frequency(w, s) each word (e.g., “ing”, “s”, and “ed” endings are striped) and |s| = |w| ∗ P frequency(w, si) w∈wordforms(s) the part of speech for the word. The tool also removes the si∈senses(w) noise word. Note that, following the formula from Table I, The above formula approximates the size of a sense by c = 0.6 for the first word and c = 0.4 for the second. looking at all the word forms that represent the sense and The reason is that first words in the definition of a sense are figuring out how much each word form contributes to the size more important. The norm function is an inverse logarithmic of the sense. The formula is used to approximate the popularity function that smoothens the difference between a word that of a sense. appears in the definition of a sense that has five words and a WordNet defines the hyponym (a.k.a. kind-of) relationship word that appears in the definition of a sense that has 20 words. between senses that represent nouns. For example, the most The special case of the function applies when we have a single popular sense of the word “dog” is a hyponym of the most non-noise word in the definition of a sense. In this case, we set popular sense of the word “canine”. Consider the first sense the probability to be 1.2 because we have stronger evidence of the word “chair”: “a seat for one person ...”. WordNet about the relationship between the sense and the word. defines 15 hyponyms for this sense, including senses for the words “armchair” and “wheelchair”. We add formulas that Next, we will show an example of creating an edge based on show the probability between this first sense of the word the structured information in WordNet. Note that the formulas “chair” and each of the hyponyms. In the British National in Table I use the notion of a size of sense, or |s| for the sense Corpus, the frequency of “armchair” is 657 and the frequency s. We use information from Oxford’s British National Corpus of “wheelchair” is 551. Since both senses are associated with (BNC) [3], which contains information about the frequency a single word form, we do not need to consider the frequency of use of word forms. Let |w| be the popularity of the word of use of each sense. If “armchair” and “wheelchair” were the form w that is shown in BNC. Let senses(w) be the set of only hyponyms of the sense “a seat for one person ...”, then senses of the word form w and wordforms(s) be the set of we will add the following formula. all word forms that represent the sense s. Then we define |s| as follows. [a seat for, s noun] ⇒ [chair with support, s noun], 0.49 The formula shows the probability for the sense “chair with The example assumes that there are 20 different RDF triplets a support on each side for arms” of the word “armchair”. The where Alabama is the subject. The idea is that the if Alabama probability is computed as 0.9 ∗ 657/(657 + 551) = 0.49. has only few RDFs where it is the subject, then we give higher importance to these relations. B. Modeling DBPedia Next, consider the DBPedia information that “chair” is a Wikipedia contains information about our world. This in- type of “furniture”. We will create the following formula. cludes information about people, organizations, movies, songs, 100 [furniture, wt] ⇒ [chair, wt], 0.8 ∗ and places, where most of this information is not part of 1000 WordNet. DBPedia [20] contains structured information that The formula assumes that the Wikipedia page for “chair” is extracted from Wikipedia. Specifically, we incorporate in- has 100 lines, while the number of lines of all Wikipedia things formation from the six files that are shown in Table II. The that are of type furniture is 1000. The formulas tries to estimate information in the files uses the (Terse RDF Triple what percent of furniture refers to chairs, that is, what is the Language) syntax. probability that someone who is interested in furniture is also Our algorithm first creates a node for every Wikipedia interested to know more about chairs. article and category. The label of the node will be the title of Next, consider the example that one disambiguation of the Wikipedia article or Wikipedia category (i.e., either wt for “ADA” is “Ada ”. We create the fol- Wikipedia title, wc for Wikipedia category). The information lowing formula. about the Wikipedia titles and categories can be extracted 0.8 from the article categories en.tql file. This differs from our [ADA, wt] ⇒ [ADA Programming Language, wt], approach in [38] where we store no meta information. 40 Table III shows the formulas for computing the probabilities The formula assumes that there are a total of 40 different based on information from DBPedia. The coefficients for DB- disambiguations of ADA. The idea of the formula is that Pedia are in general smaller than those for WordNet because if there are only few disambiguations, then the strength of the later contains information of higher quality. Fine-tuning the relationship between the disambiguation page concept and these coefficients remains an area for future research. each of the disambiguation artifacts is stronger. Consider the Wikipedia page with title “National Hockey In order to save space, we do not show examples of the League”. We will create the following formula. reverse relationships from Table III. C. Computing the Edge Weights [National Hockey League, wt] ⇒ [national, wf adj ], 0.25 So far, we have created the nodes of the graph and shown The number 0.25 is computed as 0.4 ∗ norm(1/3) be- formulas that contain conditional probabilities between nodes. cause there are three words in the Wikipedia title. We will For example, the formula create similar formulas between “National Hockey League” [X] ⇒ [Y ], p and “hockey” and “league”. Note that if “National Hockey League” was a word form in WordNet, then we create a means that if we are interested in X, then we are also single formulas as follows: [national hockey league, wt] ⇒ interested in Y with probability p. We will adopt the Markov [national hockey league, wf noun], 0.4 ∗ norm(1) = 0.48. Logic Network [29] model and rewrite the formula as a first The idea of the formula is to connect the nodes from WordNet order formula with probability, where the predicate rel tells us and DBPedia. Note that the formula applies to both Wikipedia whether or not the concept is relevant to the user. titles and Wikipedia categories. rel(X) ⇒ rel(Y ), p Next, consider the information that the Wikipedia article “Algeria” corresponds to the Wikipedia category “Countries Next, suppose that the formulas from Tables I and III in Africa”. We create the following formula. generate one ore more formulas between the nodes X and 400 Y . We first convert each probability to a MLN weight using [Algeria, wt] ⇒ [Countries in Africa, wc], 0.6 ∗ the formula. 10, 000 ( ln( 1+p ), p < 0.9999 The example assumes that the Wikipedia page for Algeria 1−p weight(rel(X) ⇒ rel(Y )) = 1+0.9999 has 400 lines and the total number of lines of all Wikipedia ln( 1−0.9999 ) p ≥ 0.9999 pages in the category “Countries in Africa” is 10,000. The idea Note that we first transform the probability in the range of the formula is that if there were only few Wikipedia pages [0.5,1] because we want each formula to contribute positively in a category, then the relationship between the Wikipedia page 0 p to the weight. In other words, p = 0.5 + 2 . We then apply and category would be stronger. the MLN model that computes the weight of a formula as the Next, consider the information that the capital of Alabama p0 1+p natural logarithm of the odds, or w = ln( 1−p0 ) = ln( 1−p ). is Montgomery. We create the following formula. An extreme case is when p = 1 and the weight will be equal 0.2 to infinity. We address this case by setting the weight equal to [Alabama, wt] ⇒ [Montgomery Alabama, wt], 20 9.9034 when the probability is too high. TABLE II FILESFROM WIKIPEDIA

file content example title = “American Football Conference” abstract = “The American Football Conference (AFC) is one short abstracts en.tql Wikipedia page title and short abstract of the two conferences of the National Football Leagues (NFL) ...” The title of a Wikipedia page page length en.tql and the number of lines title = “American Football Conference” number of lines = 120 subject =“Alabama”, predicate =“capital”, mappingbased objects en.tql RDF triplets object = “Montgomery, Alabama” Wikipedia page title and the corresponding article categories en.tql Wikipedia category tile = “Algeria” category = “Countries in Africa” instance types en.tql Wikipedia object and its type “American Film Institute” is a type of “organization” A term and the different Wikipedia disambiguations en.tql webpages that it can refer to term=“ADA” disambiguation = “Ada Programming Language”

TABLE III THE DIFFERENT FORMULAS FOR MODELING DBPEDIA

type from to probability count(w,t) Wikipedia title t word w in Wikipedia title 0.4 ∗ norm( P ) count(wi,t) title wi∈t word w Wikipedia title t that contains w 0.2 dtf [1](w) count(w,a) Wikipedia page t word w that appears in the abstract a of t three or more times 0.2 ∗ norm( P ) count(wi,a) abstract [2] wi∈frequent3 (a) word w Wikipedia page t that contains w 3 or more times in the abstract 0.1 daf [3](w) 0.6∗|t|[4] Wikipedia category c Wikipedia article t that belongs to the category P |ti| category ti∈c Wikipedia article t Wikipedia category c, where t belongs to c 0.1 cf [5](t) subject s of RDF triplet object o of RDF triplet 0.2 sf [6](s) RDF triplet object o of RDF triplet subject s of RDF triplet 0.2 of [7](s) 0.8∗|o| type t object o that belongs to t P |oi| instance oi∈t object o type t, where o belongs to t 0.1 tf [8](o) disambiguation page d article t that is one of the disambiguations 0.8 df [9](d) disambiguation Wikipedia article t disambiguation page d that points to t 0.1

.[1] dtf stands for document title frequency. dtf (w) returns the number of documents that contain the word w in their title. [2] . frequent3(s) returns the number of words that occur three or more times in the string s. .[3] daf stands for document abstract frequency. daf (w) returns the number of documents that contain the word w in their abstract three or more times. .[4] |t| returns the number of lines in the Wikipedia article t. .[5] cf stands for category frequency. cf (t) returns the number of categories that t belongs to. .[6] sf stands for subject frequency. sf (s) returns the number of RDF triplets that have subject s. .[7] of stands for object frequency. of (o) returns the number of RDF triplets that have object o. .[8] tf stands for type frequency. tf (t) returns the number of types that t belongs to. .[9] df stands for disambiguation frequency. df (d) returns the number of disambiguations for the disambiguation page d.

Next, if there are multiple formulas between two nodes, we IV. CLUSTERINGTHE DOCUMENTS follow the MLN model and just add the weights. Finally, we We add each document as a node in the graph. Consider a convert the total weight back to probability using the reverse document d and a word form w in the document. We create formula. the following formula. count(w, d) ew − 1 [d] ⇒ [w], norm( ) p = |d| ew + 1 In the formula, count(w, d) denotes the number of times the We create an edge between every two nodes that participate word form appears in the document and |d| denotes the total in a formula, where the weight of the edge will be computed number of words in the document. using the above formula. As a last step, we normalize the We also create a formula for the reverse relationship. weights of the edges so that the sum of the weights of the m [w] ⇒ [d], log ( ) outgoing edges from every node is equal to 1. 2 docFr(w) TABLE IV The function docFr(w) returns the number of documents RESULTS ON THE REUTERS-21578 BENCHMARK that contain the word w and m is the total number of documents. Both formulas are similar to the abstract formula cosine logarithmic this paper # of rounds 30 45 49 from Table III. The reason is that we can think of the abstract precision 0.57 0.66 0.71 of a Wikipedia article as the text of a document. As explained recall 0.05 0.07 0.14 in [39], this formula allows us to compute the distance between F1-measure 0.09 0.13 0.23 two documents in a way that is similar to normalizing the document vectors using the TF-IDF function [16] and then using the cosine distance formula. hash tables and trees. Instead, we used Spark’s join operation Next, we create an edge in the graph for each formula, when we wanted to join the result of two Resilient Distributed where we will normalize the weights again to make sure Datasets (RDDs). It took less than one hour to create the that the sum of the weights of the outgoing edges for every probabilistic graph using the files for WordNet and DBPedia. node is equal to 1. Note that we did not have to convert the probabilities to MLN weights in this step because we do not We next read all the documents from the Reuters-21578 create duplicate edges. benchmark. The benchmark contains 21, 578 documents that Given two nodes X and Y in the graph, we define P r(X|Y ) are stored in 22 text files. Our program read the files and as the conditional probability that X is relevant given that Y we extracted information about the 11, 362 documents that is relevant. This number can be estimated, for example, by were classified in one of 82 categories using human judgment doing multiple random walks starting at Y and calculating the (the other documents do not have human judgment associated percent that reach X [36]. Since the sum of the weights of with them). For every document, we stored its title, its text, the outgoing edges is always 1, at each node we can randomly the category it belongs to, and a document vector. The later decide where to hop next. For example, if we are at a node contains the non-noise words in the document and their frequencies. Since the words in the title are more important, n and there is an outgoing edge to n1 with weight w1, to we counted these words twice. We next added the documents n2 with weight w2 and to n3 with weight w3, then we can generate a random number between 0 and 1. If the number is to the probabilistic graph. We next clustered the documents using the k-means clus- smaller or equal to w1, then we will hop to n1. If the number tering algorithm. We chose the value k = 82 because this is between w1 and w1+w2, then we will hop to n2. Otherwise, is the number of categories as determined by the human we will hop to n3. When conducting random walks, we only have to be careful not to revisit the same node multiple times. judgment. The first 82 documents were put in 82 distinct In particular, our algorithm keeps a hash table of visited nodes clusters. At this point, the lonely document in each category and always looks for a that does not involve nodes that was designed as the centroid. We next processed the rest of the are already visited. We also apply the bounded box techniques documents. Every document was compared to the 82 centroids from [36] to make the calculations efficient. and assigned to the cluster with the closest centroid. Next, a new centroid was chosen for each cluster. This was done by Given two documents d1 and d2, we compute the distance between them using the formula. adding the document vectors in each cluster and dividing the result by the number of vectors. Next, the documents were P r(d |d ) + P r(d |d ) reclustered around the new centroids and the process was distance(d , d ) = 1 2 2 1 1 2 2 repeated until it converged. We use the k-means clustering algorithm [21] to classify The k-means clustering algorithm is based on two document the documents. The algorithm starts with k document seeds. It functions: finding the distance between two documents and then finds the documents that are closest to each seed using the computing the average of several documents. We have three distance metric. Next, the centroid (i.e., mean) of each cluster choices for the distance metric: the standard cosine function, is found and then new clusters are created using the centroids the logarithmic function from [41] that uses only information as the seeds. The process repeats and it is guaranteed to from WordNet, and the random walk function on the full converge. Computing the mean of a set of documents amounts probabilistic graph that stores information from WordNet and to adding the document vectors and dividing by the number DBPedia. of documents. The document vector for a document contains Table IV shows the F1-measure when using the three the frequency of each word in the document. different distance metrics. The measure gives a single number based on the precision and recall of the result of the clustering V. EXPERIMENTAL VALIDATION TP algorithm. We computed the precision as TP+FP and the TP We used a Hadoop cluster of seven running recall as TP+FN . In the formula, TP is the number of Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GH 14 cores and 32 true positives, that is, the number of documents that were GB RAM to run the experiments. All code was written in Scala classified in the same category by both the program and human and used Spark. Information from WordNet using the Java API judgment. FP is the number of false positives, that is, the for WordNet Searching (JAWS) [34] was first extracted and number of documents that were classified in the same category saved to files. We did not use any search structures, such as by the program, but were classified in different categories by human judgment. Lastly, FN is the number of false negatives, [16] K. Jones. ”a statistical interpretation of term specificity and its applica- that is, the number of documents that were classified in the tion in retrieval”. Journal of Documentation, 28(1):11–21, 1972. [17] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff. Se- same category by human judgment but were classified in mantic Annotation, Indexing, and Retrieval. Journal of Web , different categories by the program. 2(1):49–79, 2004. As the table suggests, using the full probabilistic graph with [18] R. Knappe, H. Bulskov, and T. Andreasen. Similarity Graphs. Fourteenth International Symposium on Foundations of Intelligent Systems, 2003. information from WordNet and DBPedia can lead to both [19] T. K. Landauer, P. Foltz, and D. Laham. Introduction to Latent Semantic higher precision and recall. The reason is that now we are Analysis. Discourse Processes, pages 259–284, 1998. not just comparing the semantic similarity between words, but [20] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. also the semantic similarity between phrases. DBpedia A large-scale, multilingual extracted from Wikipedia. , 6(2):167–195, 2015. VI.CONCLUSIONAND FUTURE RESEARCH [21] J. B. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of Fifth Berkeley Symposium on In this paper, we reviewed how information from WordNet Mathematical Statistics and Probability, pages 281–297, 1967. can be used to build a probabilistic graph. We extended [22] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and existing algorithms by adding part of speech tag to each D. McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational (ACL) System word and sense. We then showed how the graph can be Demonstrations, pages 55–60, 2014. extended with information from DBPedia. We validated the [23] M.F.Porter. An Algorithm for Suffix Stripping. Readings in Information algorithm experimentally by comparing it to an algorithm that Retrieval, pages 313–316, 1997. [24] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of uses the cosine similarity metric and an algorithm that uses Word Representations in Vector Space. arXiv:1301.3781. only information from WordNet. The results show that adding [25] G. A. Miller. WordNet: A Lexical for English. Communica- information from DBPedia increases both the precision and tions of the ACM, 38(11):39–41, 1995. [26] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Goodrum, and the recall of the algorithm on the Reuters-21578 benchmark. R. Girju. LASSO: A Tool for Surfing the Answer Net. Text Retrieval One area for future research is moving beyond the bag of Conference (TREC-8), 1999. words model and considering the ordering of the words in the [27] B. Popov, A. Kiryakov, D. D. Ognyanoff, D. Manov, and A. Kirilov. KIM A Semantic Platform for and Retrieval. documents. Journal of Natural Language Engineering, 10(3):375–392, 2004. [28] L. Rau. Knowledge Organization and Access in a Conceptual Informa- REFERENCES tion System. Information Processing and Management, 23(4):269–283, [1] M. Agosti and F. Crestani. Automatic Authoring and Construction of 1987. for . ACM Multimedia Systems, 15(24), [29] M. Richardson and P. Domingos. Markov Logic Networks. Machine 1995. Learning, 62(1-2):107–136, 2006. [2] M. Agosti, F. Crestani, G. Gradenigo, and P. Mattiello. An Approach [30] M. Sanderson. Word Sense Disambiguation and Information Retrieval. to Conceptual Modeling of IR Auxiliary Data. IEEE International Seventeenth annual international ACM SIGIR conference on Research Conference on Computer and Communications, 1990. and development in information retrieval, 1994. [3] L. Burnard. Reference Guide for the British National Corpus (XML [31] P. Shoval. Expert consultation system for a retrieval database with Edition). http://www.natcorp.ox.ac.uk, 2007. semantic network of concepts. Fourth Annual International ACM SIGIR [4] P. Cohen and R. Kjeldsen. Information Retrieval by Constrained Conference on Information Storage and Retrieval: Theoretical Issues in Spreading Activation on Sematic Networks. Information Processing and Information Retrieval, pages 145–149, 1981. Management, pages 255–268, 1987. [32] Simone Paolo Ponzetto and Michael Strube. Deriving a Large Scale [5] R. Collobert and J. Weston. A Unified Architecture for Natural Language from Wikipedia. 22nd International Conference on Artificial Processing: Deep Neural Networks with Multitask Learning. Twenty Intelligence, 2007. Fifth International Conference on Machine Learning, 2008. [33] E. Sirin and B. Parsia. SPARQL-DL: SPARQL Query for OWL-DL. [6] F. Crestani. Application of Spreading Activation Techniques in Infor- OWL: Experiences and Directions Workshop, 2007. mation Retrieval. Artificial Intelligence Review, 11(6):453–482, 1997. [34] B. Spell. Java API for WordNet Searching (JAWS). [7] Croft. User-specified Domain Knowledge for . http://lyle.smu.edu/ tspell/jaws/index.html, 2009. Nineth Annual International ACM Conference on Research and Devel- [35] K. Srihari, W. Li, and X. Li. Information Extraction Supported Question opment in Information Retrieval, pages 201–206, 1986. Answering. In Advances in Open Domain , 2004. [8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and [36] L. Stanchev. Implementing Semantic Document Search Using a R. Harshman. Indexing by . Journal of the Bounded Random Walk in a Probabilistic Graph. IEEE International Society for , 41(6):391–407, 1990. Conference of , 2008. [9] C. Fox. and Stoplists. Information Retrieval: Data [37] L. Stanchev. Building Semantic Corpus from WordNet. The First Structures and Algorithms, pages 102–130, 1992. International Workshop on the role of Semantic Web in Literature-Based [10] W. Frakes. Algorithms. Information Retrieval: Data Struc- Discovery, 2012. tures and Algorithms, pages 131–160, 1992. [38] L. Stanchev. Creating a Phrase Similarity Graph from Wikipedia. Eight [11] T. Gruber. Collective knowledge systems: Where the social web meets IEEE International Conference on Semantic Computing, 2014. the semantic. Web Journal of Web Semantics, 2008. [39] L. Stanchev. Fine-Tuning an Algorithm for Using [12] R. V. Guha, R. McCool, and E. Miller. Semantic Search. Twelfth a Similarity Graph. International Journal on Semantic Computing, International Conference (WWW 2003), pages 700– 9(3):283–306, 2015. 709, 2003. [40] L. Stanchev. Creating a Probabilistic Graph for WordNet using Markov [13] E. H. Hovy, L. Gerber, U. Hermjakob, M. Junk, and C. Y. Lin. Question Logic Network. Proceedings of the Sixth International Conference on Answering in Webclopedia. TREC-9 Conference, 2000. Web Intelligence, Mining and Semantics, pages 1–12, 2016. [14] K. Jarvelin, J. Keklinen, and T. Niemi. ExpansionTool: Concept-based [41] L. Stanchev. Semantic Document Clustering Using a Similarity Graph. Query Expansion and Construction. Springer Netherlands, pages 231– Tenth IEEE International Conference on Semantic Computing, pages 255, 2001. 1–8, 2016. [15] G. Jeh and J. Widom. SimRank: A Measure of Structural-context [42] J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple Similarity. Proceedings of the Eight ACM SIGKDD International and general method for semi-supervised learning. In Forty Eight Anual Conference on Knowledge Discovery and , pages 538–543, Meeting of the Association for Computational Linguistics, pages 384– 2002. 394, 2010.