A Comparison of Wordnet and Roget's Taxonomy for Measuring

I I A Comparison of WordNet and Roget's Taxonomy for I Measuring Semantic Similarity Michael L. Mc Hale Intelligent Information Systems I Air Force Research Laboratory 525 Brooks Road 13441 Rome, NY, USA, I [email protected] Roget's International Thesaurus, third edition I Abstract (Roget 1962) as the taxonomy. Thus the results can be compared to those in the literature (that This paper presents the results of using used WordNet). The end result is the ability to I Roget's International Thesaurus as the compare the relative usefulness of Roget's and taxonomy in a semantic similarity WordNet for this type of task. measurement task. Four similarity metrics were taken from the literature and applied to 1 Semantic Similarity i Roget's. The experimental evaluation suggests that the traditional edge counting Each metric of semantic similarity makes approach does surprisingly well (a assumptions about the taxonomy in which it I correlation of r=0.88 with a benchmark set works. Generally, these assumptions go unstated of human similarity judgements, with an but since they are important for the upper bound of r=0.90 for human subjects understanding of the results we obtain, we will I performing the same task.) cover them for each metric. All the metrics assume a taxonomy with some semantic order. I Introduction 1.1 Distance Based Similarity The study of semantic relatedness has been a A common method of measuring semantic part of artificial intelligence and psychology for similarity is to consider the taxonomy as a tree, many years. Much of the early semantic or lattice, in semantic space. The distance I relatedness work in natural language processing between concepts within that space is then taken centered around the use of Roget's thesaurus as a measurement of the semantic similarity. (Yaworsky 92). As WordNet (Miller 90) became I available, most of the new work used it (Agirre 1.1.1 Edges as distance & Rigau 96, Resnik 95, Jiang & Conrath 97). If all the edges (branches of the tree) are of This is understandable, as WordNet is freely equal length, then the number of intervening I available, fairly large and was designed for edges is a measure of the distance. The computing. Roget's remains, though, an measurement usually used (Rada et al. 89) is the attractive lexical resource for those with access shortest path between concepts. This, of course, I to it. Its wide, shallow hierarchy is densely relies on an ideal taxonomy with edges of equal populated with nearly 200,000 words and length. In taxonomies based on natural phrases. The relationships among the words are languages, the edges are not the same length. In also much richer than WordNet's IS-A or HAS- Roget's, for example, the distance (counting I PART links. The price paid for this richness is a edges) between Intellect and Grammar is the somewhat unwieldy tool with ambiguous links. same as the distance between Grammar and This paper presents an evaluation of Roget's Phrase Structure. This does not seem intuitive. I for the task of measuring semantic similarity. In general, the edges in this type of taxonomy This is done by using four metrics of semantic tend to grow shorter with depth. I similarity found in the literature while using 115 I I I 1.1.2 Related Metn'cs McHale (95) decomposed Roget's taxonomy A number of different metrics related to distance and used five different metrics to show the have used edges that have been modified to usefulness of the various attributes of the I correct for the problem of non-uniformity. The taxonomy. Two of those metrics deal with modifications include the density of the distance but only one is of interest to us for this subhierarchies, the depth in the hierarchy where task; the number of intervening words. The I the word is found, the type of links, and the number of intervening words ignores the information content of the nodes subsuming the hierarchy completely, treating it as a flat file. word. For the measurement to be an accurate metric, I The use of density is based on the two conditions must be met. Fi'i'st, the ordering observation that words in a more densely part of of the words must be correct. Second, either all the hierarchy are more closely related than the words of the language must be represented I words in sparser areas (Agirre and Rigau 96). (virtually impossible) or they must be evenly For density to be a valid metric, the hierarchy distributed throughout the hierarchy I. Since it is must be fairly complete or at least the unlikely that either of these conditions hold for I distribution of words in the hierarchy has to any taxonomy, the most that can be expected of closely reflect the distribution of words in the this measurement is that it might provide a language. Neither of these conditions ever hold reasonable approximation of the distance completely. Furthermore, the observation about (similar to density). It is included here, not I density may be an overgeneralization. In because the approximation is reasonable, but Roget's, for instance, category 277 Ship/Boat because it provides information that helps has many more words (much denser) than explain the other results. I category 372 Blueness. That does not mean that kayak is more closely related to tugboat than sky 1.2 Information Based Similarity blue is to turquoise. In fact, it does not even I mean that kayak is closer to Ship/Boat than Given the above problems with distance related turquoise is to Blueness. measures, Resnik (Resnik 95) decided to use just Depth in the hierarchy is another attribute the information content of the concepts and I often used. It may be more useful in the deep compared the results to edge Counting and hierarchy of WordNet than it is in Roget's where human replication of the same task. Resnik the hierarchy is fairly flat and uniform. All the defines the similarity of two concepts as the I words in Roget's are at either level 6 or 7 in the maximum of the Information Content of the hierarchy. concepts that subsume them in the taxonomy. The type of link in WordNet is explicit, in The Information Content of a concept relies on I Roget's it is never clear but it consists of more the probability of encountering an instance of than IS-A and HAS-PART. One such link is the concept. To compute this probability, Resnik HAS-ATTRIBUTE. used the relative frequency of occurrence of Some of the researchers that have used the each word in the Brown Corpus 2. The I above metrics include Sussna (Sussna 93) who probabilities thus found should fairly well weighted the edges by using the density of the approximate the true values for other subhierarchy, the depth in the hierarchy and the generalized texts. The concept probabilities were I type of link. Richardson and Smeaton then computed from the occurrences as simply (Richardson and Smeaton 95) used density, the relative frequency of the concept. hierarchy depth and the information content of I the concepts. Jiang and Conrath (Jiang and I This condition certainly does not hold true in Conrath 95) used the number of edges and WordNet where animals and plants represent a information content. They all reported disproportionately large section of the hierarchy. I improvement in results compared to straight 2 Resnik used the semantic concordance (semcor) that edge counting. comes with WordNet. Semcor is derived from a hand-tagged subset of the Brown Corpus. His I calculations were done using WordNet 1.5. 116 I the hierarchy) computations for the entire Roget Freq(c) hierarchy. This is sizeable overhead compared to (e) = edge counting which requires no a priori N computations. Of course, once the computations are done they do not need to be recomputed until The information content of each concept is then a new word is added to the hierarchy. Since the given by IC(c) = log .i ~(c), where ~(c) is the values for information content bubble up from probability. Thus, more common words have the words, each addition of a word would lower information content. require that all the hierarchy above it be I To replicate the metric using Roget's, the recomputed. frequency of occurrence of the words found in Jiang and Conrath (Jiang and Conrath 97) the Brown Corpus was divided by the total also used information content to measure number of occurrences of the word in Roget's3. semantic relatedness but they combined it with i From the information content of each concept, edge counting using a formula that also took into the information content for each node in the consideration local density, node depth and link Roget hierarchy was computed. These are type. They optimized the formula by using two I simply the minimum of the information content parameters, ct and ~, that controlled the degree of all the words beneath the node in the of how much the node depth and density factors taxonomy. Therefore, the information content of contributed to the edge weighting computation. I a parent node is never greater than any of its If t~----0 and 13=1, then their formula for the children. distance between two concepts cl and c2 The metric of relatedness for two words simplifies to I according to Resnik is the information content of the lowest common ancestor for any of the word Dist(cl,c2) = IC(c0 + IC(c2) - 2 X [C(LS(cbc2)) senses. What this implies is that, for the purpose I of measuring relatedness, each synset in Where LS(cbc2) denotes the lowest super- WordNet or each semicolon group in Roget's ordinate ofcl and c2.

A Comparison of Wordnet and Roget's Taxonomy for Measuring

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support