Using Wordnet to Supplement Corpus Statistics

Using Wordnet to Supplement Corpus Statistics Rose Hoberman Roni Rosenfeld School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] Abstract always be a problem when relying on data-driven techniques, especially when only small amounts of Data-driven techniques, although commonly used data are available. for many natural language processing tasks, require One proposed solution is to augment corpus- large amounts of data to perform well. Even with derived statistics with linguistic knowledge, avail- significant amounts of data there is always a long able in the form of existing lexical and semantic re- tail of infrequent linguistic events, which results in sources. Such resources include lexical databases poor statistical estimation. To lessen the effect of like WordNet (Fellbaum, 1998), knowledge bases these unreliable estimates, we propose augmenting like Cyc (Lenat, 1995), thesauri like Roget’s (Chap- corpus statistics with linguistic knowledge encoded man, 1977), and machine readable dictionaries in existing resources. This paper evaluates the use- like the Longman Dictionary of Contemporary En- fulness of the information encoded in WordNet for glish (Proctor, 1978). These linguistic resources two tasks: improving perplexity of a bigram lan- have been used for many natural language pro- guage model trained on very little data, and finding cessing tasks, such as resolving syntactic ambigu- longer-distance correlations in text. Word similarity (Resnik, 1999), identifying spelling errors (Bu- ities derived from WordNet are evaluated by com- danitsky and Hirst, 2001), and disambiguating word paring them to association statistics derived from senses (Agirre and Rigau, 1996). However, as they large amounts of data. Although we see the trends are not frequency-based, it is not clear in general we were hoping for, the overall effect is small. We how to use them within a statistical framework. have found that WordNet does not currently have In this paper, we consider the use of linguis- the breadth or quantity of relations necessary to tic knowledge derived from WordNet for combat- make substantial improvements over purely data- ing data sparseness in two language modeling tasks. driven approaches for these two tasks. WordNet is a large, widely-used, general-English semantic network that groups words together into synonym sets, and links these sets with a variety of lin- 1 Motivation and Outline guistic and semantic relations. The taxonomic struc- Data-driven techniques are commonly used for ture of WordNet enables us to automatically derive many natural language processing tasks. However, word similarities, based on variants of a method pro- these techniques require large amounts of data to posed in (Resnik, 1995). The goal of this paper is perform well, and even with significant amounts of to examine the usefulness of these WordNet-derived data there is always a long tail of infrequent linguis- word similarities for supplementing corpus statistics tic events. The majority of words, for example, oc- in the following two tasks. cur only a few times even in a very large corpus. The first task is to improve the perplexity of a bi- Poor statistical estimation of these rare events will gram language model trained on very little data. If semantically similar words have similar bigram dis- For example, in WordNet, eight edges separate rab- tributions, then for rare words we can use WordNet bit from organism (rabbit IS-A leporid ... IS-A to find similar, more common proxy words. By com- mammal ... IS-A organism), whereas only one edge bining the bigram data of rare words with their near- separates plankton from organism (plankton IS-A est neighbors, we hope to reduce data sparseness and organism). thus better approximate each word’s true bigram dis- (Resnik, 1995) has proposed an alternative mea- tribution. sure of semantic distance in a taxonomy, based on The second task is to find long-distance corre- the intuition that the similarity between two con- lations in text; specifically, we would like to find cepts is equal to the amount of information they words that tend to co-occur within a sentence. By share. He collects counts from a corpus to estimate identifying these “sticky pairs” we can build lan- the probability of each concept in the taxonomy, guage models that better reflect the semantic co- then uses these probabilities to obtain the informa- herence evident within real sentences. Association tion content (negative log likelihood) of a concept. statistics collected from data are only reliable, how- The Resnik similarity between two concepts c1 and ever, for high frequency words. Long-distance asso- c2 is defined as the information content of their least ciations are generally semantic, and thus WordNet common ancestor (lca): seems an appropriate resource for augmenting the data. simres(c1; c2) = − log(p(lca(c1; c2)) In section 2 we describe three measures proposed in the literature for measuring the semantic similar- The Resnik similarity measure has some proper- ity between concepts in a taxonomy. In addition, ties that may not be desirable. For instance, the ex- we introduce a novel measure of word similarity that tent of self-similarity depends on how specific a con- takes into account sense frequency. Sections 3 and cept is; two items x and y can be more similar to each 4 present a more detailed motivation, methodology, other than another item z is to itself. In an attempt results and error analysis for the bigram and seman- to address this and other issues, many other similar- tic coherence tasks, respectively. Finally, in Sec- ity measures have subsequently been proposed. We tion 5 we conclude and discuss future work. selected two additional measures suggested in the literature which have shown to be highly correlated 2 Measuring Similarity in a Taxonomy with human judgments of similarity. 2.1 Measuring Concept Similarity (Jiang and Conrath, 1997) calculate the inverse of semantic similarity – semantic distance. However, WordNet is a large, general-English semantic net- since any distance measure can easily be converted work that represents concepts as groups of syn- into a measure of similarity through simple algebraic onymous words, called synsets. WordNet is com- manipulation, we still refer to it as a measure of sim- prised of approximately 110K synsets, which are ilarity. Although Jiang and Conrath motivate their connected by about 150K edges representing a vari- measure differently, it essentially quantifies the dis- ety of linguistic and semantic relations. The largest tance between two concepts as the amount of infor- component of WordNet consists of noun synsets mation that is not shared between them. This is just connected by hypernym (IS-A) relations, effectively the total amount of information in the two synsets forming a noun taxonomy. minus their shared information: The simplest measure of semantic similarity between two concepts in a taxonomy (synsets in Word- simjc(c1; c2) = Net) is the length of the shortest path between them. 2 · log(p(lca(c1; c2))) − (log(p(c1)) + log(p(c2))) The shorter the path, the more similar the concepts should be. However, this simple correspondence (Lin, 1998) takes into account both similarities between path length and similarity is not always and differences between the two concepts. He nor- valid because edges in a taxonomy often span signif- malizes the amount of shared information by the to- icantly different semantic distances (Resnik, 1999). tal amount of information, essentially calculating the percentage of information which is shared: similar words or phrases. The hope is that pool- ing words in the same equivalence class will re- 2 · log(p(lca(c1; c2))) simlin(c1; c2) = sult in more reliable estimation of model parame- log(p(c1)) + log(p(c2)) ters and better generalization to unseen sequences. 2.2 Measuring Word Similarity Several algorithms have been suggested for automatically clustering the vocabulary using information The three similarity measures just described are theoretic criteria (Brown et al., 1992; Kneser and all designed for determining the similarity between Ney, 1993). Another method (Dagan et al., 1999) concepts in a taxonomy. However, most words have groups words according to their distributional sim- multiple possible senses corresponding to distinct ilarity. All data-driven vocabulary clustering algo- concepts in the taxonomy. Given a concept simi- rithms, however, suffer from the same limitation: larity measure sim(c1; c2), Resnik defines the simi- when a word is rare there is usually not enough data larity between words as the maximum similarity be- to identify an appropriate equivalence class. Rare tween any of their senses. He searches over all pos- words, which could most benefit from being as- sible combinations of senses and takes the most in- signed to an appropriate cluster, cannot be reliably formative ancestor: clustered. In this section we test whether the bigram wsimmax(w1; w2) = max[sim(c1; c2)] distributions of semantically similar words (accord- c1;c2 ing to WordNet) can be combined to reduce the bi- where c1 ranges over the senses of w1 and c2 ranges gram perplexity of rare words. over the senses of w2. However, as (Resnik, 1999) notes, this method 3.1 Methodology “sometimes produces spuriously high similarity We use simple linear interpolation to combine a tar- measures for words on the basis of inappropriate get word’s bigram estimate with that of a similar, but word senses.” In particular, it measures words as more common proxy word. This use of proxy is sim- highly similar even if it is only their rare senses that ilar to the notion of synonym in (Jelinek et al., 1990). are similar in meaning. For instance, wsimmax will Formally, let pml(·|w) be the unsmoothed (maxi- find that brazil and pecan are very similar because mum likelihood) bigram distribution following word they are both nuts, which is a rare concept and thus w as derived from the training corpus and pgt(·|w) has high information content.

Using Wordnet to Supplement Corpus Statistics

Evaluation of Machine Learning Algorithms for Sms Spam Filtering

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

N-Gram-Based Machine Translation

Title and Link Description Improving Distributional Similarity With

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

Word Senses and Wordnet Lady Bracknell

A Unigram Orientation Model for Statistical Machine Translation

Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation

Part I Word Vectors I: Introduction, Svd and Word2vec 2 Natural Language in Order to Perform Some Task

Phrase Based Language Model for Statistical Machine Translation

Phrase-Based Attentions