A Semantic Network Approach to Measuring Relatedness

A Semantic Network Approach to Measuring Relatedness Brian Harrington Oxford University Computing Laboratory [email protected] Abstract 1999) notes, “Semantic similarity represents a special case of semantic relatedness: for ex- Humans are very good at judging the ample, cars and gasoline would seem to be strength of relationships between two more closely related than, say, cars and bi- terms, a task which, if it can be au- cycles, but the latter pair are certainly more tomated, would be useful in a range similar”. (Budanitsky and Hirst, 2006) fur- of applications. Systems attempting to ther note that “Computational applications solve this problem automatically have typically require relatedness rather than just traditionally either used relative po- similarity; for example, money and river are sitioning in lexical resources such as cues to the in-context meaning of bank that WordNet, or distributional relation- are just as good as trust company”. ships in large corpora. This paper pro- Systems for automatically determining the poses a new approach, whereby rela- degree of semantic relatedness between two tionships are derived from natural lan- terms have traditionally either used a mea- guage text by using existing nlp tools, surement based on the distance between the then integrated into a large scale se- terms within WordNet (Banerjee and Ped- mantic network. Spreading activation ersen, 2003; Hughes and Ramage, 2007), or is then used on this network in order used co-occurrence statistics from a large cor- to judge the strengths of all relation- pus (Mohammad and Hirst, 2006; Padóand ships connecting the terms. In com- Lapata, 2007). Recent systems have, how- parisons with human measurements, ever, shown improved results using extremely this approach was able to obtain re- large corpora (Agirre et al., 2009), and ex- sults on par with the best purpose built isting large-scale resources such as Wikipedia systems, using only a relatively small (Strube and Ponzetto, 2006). corpus extracted from the web. This In this paper, we propose a new approach is particularly impressive, as the net- to determining semantic relatedness, in which work creation system is a general tool a semantic network is automatically created for information collection and integra- from a relatively small corpus using exist- tion, and is not specifically designed for ing NLP tools and a network creation system tasks of this type. called ASKNet (Harrington and Clark, 2007), 1 Introduction and then spreading activation is used to determine the strength of the connections within The ability to determine semantic relatedness that network. This process is more analogous between terms is useful for a variety of nlp ap- to the way the task is performed by humans. plications, including word sense disambigua- Information is collected from fragments and tion, information extraction and retrieval, and assimilated into a large semantic knowledge text summarisation (Budanitsky and Hirst, structure which is not purposely built for a 2006). However, there is an important dis- single task, but is constructed as a general tinction to be made between semantic relat- resource containing a wide variety of infor- edness and semantic similarity. As (Resnik, mation. Relationships represented within this 356 Coling 2010: Poster Volume, pages 356–364, Beijing, August 2010 structure can then be used to determine the and Brew, 2004; Padóand Lapata, 2007) or total strength of the relations between any two from the web using search engine results (Tur- terms. ney, 2001). In a recent paper, Agirre et al. (2009) parsed 2 Existing Approaches 4 billion documents (1.6 Terawords) crawled 2.1 Resource Based Methods from the web, and then used a search func- tion to extract syntactic relations and con- A popular method for automatically judging text windows surrounding key words. These semantic distance between terms is through were then used as features for vector space, WordNet (Fellbaum, 1998), using the lengths in a similar manner to work done in (Padó of paths between words in the taxonomy as a and Lapata, 2007), using the British National measure of distance. While WordNet-based Corpus (bnc). This system has produced ex- approaches have obtained promising results cellent results, indicating that the quality of for measuring semantic similarity (Jiang and the results for these types of approaches is re- Conrath, 1997; Banerjee and Pedersen, 2003), lated to the size and coverage of their corpus. the results for the more general notion of se- This does however present problems moving mantic relatedness have been less promising forward, as 1.6 Terawords is obviously an ex- (Hughes and Ramage, 2007). tremely large corpus, and it is likely that there One disadvantage of using WordNet for would be a diminishing return on investment evaluating semantic relatedness is its hierar- for increasingly large corpora. In the same pa- chical taxonomic structure. This results in per, another method was shown which used terms such as car and bicycle being close in the pagerank algorihm, run over a network the network, but terms such as car and gaso- formed from WordNet and the WordNet gloss line being far apart. Another difficulty arises tags in order to produce equally impressive re- from the non-scalability of WordNet. While sults. the quality of the network is high, the manual nature of its construction means that arbi- 3 A Semantic Network Approach trary word pairs may not occur in the network. Hence in this paper we pursue an approach in The resource we use is a semantic network, which the resource for measuring semantic re- automatically created by the large scale net- latedness is created automatically, based on work creation program, ASKNet. The rela- naturally occurring text. tions between nodes in the network are based A similar project, not using WordNet is on the relations returned by a parser and se- WikiRelate (Strube and Ponzetto, 2006), mantic analyser, which are typically the argu- which uses the existing link structure of ments of predicates found in the text. Hence Wikipedia as its base network, and uses sim- terms in the network are related by the chain ilar path based measurements to those found of syntactic/semantic relations which connect in WordNet approaches to compute semantic the terms in documents, making the network relatedness. This project has seen improved ideal for measuring the general notion of se- results over most WordNet base approaches, mantic relatedness. largely due to the nature of Wikipedia, where Distinct occurrences of terms and entities articles tend to link to other articles which are are combined into a single node using a novel related, rather than just ones which are simi- form of spreading activation (Collins and Lof- lar. tus, 1975). This combining of distinct men- tions produces a cohesive connected network, 2.2 Distributional Methods allowing terms and entities to be related An alternative method for judging semantic across sentences and even larger units such as distance is using word co-occurrence statistics documents. Once the network is built, spread- derived from a very large corpus (McDonald ing activation is used to determine semantic 357 relatedness between terms. For example, to As an example of the usefulness of infor- determine how related car and gasoline are, mation integration, consider the monk-asylum activation is given to one of the nodes, say example, taken from the rg dataset (de- car, and the network is “fired” to allow the scribed in Section 5.1). It is possible that even activation to spread to the rest of the net- a large corpus could contain sentences linking work. The amount of activation received by monk with church, and linking church with gasoline is then a measure of the strength of asylum, but no direct links between monk and the semantic relation between the two terms. asylum. However, with an integrated seman- We use three datasets derived from human tic network, activation can travel across mul- judgements of semantic relatedness to test our tiple links, and through multiple paths, and technique. Since the datasets contain general will show a relationship, albeit probably not terms which may not appear in an existing a very strong one, between monk and asylum, corpus, we create our own corpus by harvest- which corresponds nicely with our intuition. ing text from the web via Google. This ap- Figure 1, which gives an example net- proach has the advantage of requiring little work built from duc documents describing the human intervention and being extensible to Elian Gonzalez custody battle, gives an indi- new datasets. Our results using the semantic cation of the kind of network that ASKNet network derived from the web-based corpus builds. This figure does not give the full net- are comparable to the best performing exist- work, which is too large to show in a sin- ing methods tested on the same datasets. gle figure, but shows the “core” of the net- 4 Creating the Semantic Networks work, where the core is determined using the technique described in (Harrington and Clark, ASKNet creates the semantic networks using 2009). The black boxes represent named en- existing nlp tools to extract syntactic and se- tities mentioned in the text, which may have mantic information from text. This informa- been mentioned a number of times across doc- tion is then combined using a modified version uments, and possibly using different names of the update algorithm used by Harrington (e.g. Fidel Castro vs. President Castro). The and Clark (2007) to create an integrated large- diamonds are named directed edges, which scale network. By mapping together concepts represent relationships between entities. and objects that relate to the same real-world A manual evaluation using human judges entities, the system is able to transform the has been performed to measure the accuracy output of various nlp tools into a single net- of ASKNet networks.

Load more