Determining Word Sense Dominance Using a Thesaurus Saif Mohammad and Graeme Hirst Department of Computer Science University of Toronto Toronto, ON M5S 3G4, Canada g fsmm,gh @cs.toronto.edu Abstract of insufficient evidence. The dominance values may be used as prior probabilities for the differ- The degree of dominance of a sense of a ent senses, obviating the need for labeled train- word is the proportion of occurrences of ing data in a sense disambiguation task. Natural that sense in text. We propose four new language systems can choose to ignore infrequent methods to accurately determine word senses of words or consider only the most domi- sense dominance using raw text and a pub- nant senses (McCarthy et al., 2004). An unsuper- lished thesaurus. Unlike the McCarthy vised algorithm that discriminates instances into et al. (2004) system, these methods can different usages can use word sense dominance to be used on relatively small target texts, assign senses to the different clusters generated. without the need for a similarly-sense- Sense dominance may be determined by sim- distributed auxiliary text. We perform an ple counting in sense-tagged data. However, dom- extensive evaluation using artificially gen- inance varies with domain, and existing sense- erated thesaurus-sense-tagged data. In the tagged data is largely insufficient. McCarthy process, we create a word–category co- et al. (2004) automatically determine domain- occurrence matrix, which can be used for specific predominant senses of words, where the unsupervised word sense disambiguation domain may be specified in the form of an un- and estimating distributional similarity of tagged target text or simply by name (for exam- word senses, as well. ple, financial domain). The system (Figure 1) au- tomatically generates a thesaurus (Lin, 1998) us- 1 Introduction ing a measure of distributional similarity and an The occurrences of the senses of a word usually untagged corpus. The target text is used for this have skewed distribution in text. Further, the dis- purpose, provided it is large enough to learn a the- tribution varies in accordance with the domain or saurus from. Otherwise a large corpus with sense topic of discussion. For example, the ‘assertion distribution similar to the target text (text pertain- of illegality’ sense of charge is more frequent in ing to the specified domain) must be used. the judicial domain, while in the domain of eco- The thesaurus has an entry for each word type, nomics, the ‘expense/cost’ sense occurs more of- which lists a limited number of words (neigh- ten. Formally, the degree of dominance of a par- bors) that are distributionally most similar to it. ticular sense of a word (target word) in a given Since Lin’s distributional measure overestimates text (target text) may be defined as the ratio of the the distributional similarity of more-frequent word occurrences of the sense to the total occurrences of pairs (Mohammad and Hirst, Submitted), the the target word. The sense with the highest domi- neighbors of a word corresponding to the predom- nance in the target text is called the predominant inant sense are distributionally closer to it than sense of the target word. those corresponding to any other sense. For each Determination of word sense dominance has sense of a word, the distributional similarity scores many uses. An unsupervised system will benefit of all its neighbors are summed using the semantic by backing off to the predominant sense in case similarity of the word with the closest sense of the 121 SIMILARLY SENSE DISTRIBUTED We therefore rely on first-order co-occurrences, which we believe are better indicators of a word’s WORDNET characteristics than second-order co-occurrences A (distributionally similar words). U C X O TARGET LIN’S I R L P 2 Thesauri TEXT D THESAURUS I U A S R Published thesauri, such as Roget’s and Mac- DOMINANCE VALUES Y quarie, divide the English vocabulary into around a thousand categories. Each category has a list Figure 1: The McCarthy et al. system. of semantically related words, which we will call category terms or c-terms for short. Words with multiple meanings may be listed in more than one category. For every word type in the vocabulary of the thesaurus, the index lists the categories that PUBLISHED THESAURUS A include it as a c-term. Categories roughly cor- U C respond to coarse senses of a word (Yarowsky, X O TARGET I R 1992), and the two terms will be used interchange- WCCM L P TEXT D I U ably. For example, in the Macquarie Thesaurus, A S bark is a c-term in the categories ‘animal noises’ R DOMINANCE VALUES Y and ‘membrane’. These categories represent the coarse senses of bark. Note that published the- Figure 2: Our system. sauri are structurally quite different from the “the- saurus” automatically generated by Lin (1998), neighbor as weight. The sense that gets the highest wherein a word has exactly one entry, and its score is chosen as the predominant sense. neighbors may be semantically related to it in any The McCarthy et al. system needs to re-train of its senses. All future mentions of thesaurus will (create a new thesaurus) every time it is to de- refer to a published thesaurus. termine predominant senses in data from a differ- While other sense inventories such as WordNet ent domain. This requires large amounts of part- exist, use of a published thesaurus has three dis- of-speech-tagged and chunked data from that do- tinct advantages: (i) coarse senses—it is widely main. Further, the target text must be large enough believed that the sense distinctions of WordNet are to learn a thesaurus from (Lin (1998) used a 64- far too fine-grained (Agirre and Lopez de Lacalle million-word corpus), or a large auxiliary text with Lekuona (2003) and citations therein); (ii) compu- a sense distribution similar to the target text must tational ease—with just around a thousand cate- be provided (McCarthy et al. (2004) separately gories, the word–category matrix has a manage- used 90-, 32.5-, and 9.1-million-word corpora). able size; (iii) widespread availability—thesauri By contrast, in this paper we present a method are available (or can be created with relatively that accurately determines sense dominance even less effort) in numerous languages, while Word- in relatively small amounts of target text (a few Net is available only for English and a few ro- hundred sentences); although it does use a corpus, mance languages. We use the Macquarie The- it does not require a similarly-sense-distributed saurus (Bernard, 1986) for our experiments. It corpus. Nor does our system (Figure 2) need consists of 812 categories with around 176,000 any part-of-speech-tagged data (although that may c-terms and 98,000 word types. Note, however, improve results further), and it does not need to that using a sense inventory other than WordNet generate a thesaurus or execute any such time- will mean that we cannot directly compare perfor- intensive operation at run time. Our method stands mance with McCarthy et al. (2004), as that would on the hypothesis that words surrounding the tar- require knowing exactly how thesaurus senses get word are indicative of its intended sense, and map to WordNet. Further, it has been argued that that the dominance of a particular sense is pro- such a mapping across sense inventories is at best portional to the relative strength of association be- difficult and maybe impossible (Kilgarriff and Yal- tween it and co-occurring words in the target text. lop (2001) and citations therein). 122 3 Co-occurrence Information are incremented. The steps of word sense disam- biguation and creating new bootstrapped WCCMs 3.1 Word–Category Co-occurrence Matrix can be repeated until the bootstrapping fails to im- The strength of association between a particular prove accuracy significantly. category of the target word and its co-occurring The cells of the WCCM are populated using a words can be very useful—calculating word sense large untagged corpus (usually different from the dominance being just one application. To this target text) which we will call the auxiliary cor- end we create the word–category co-occurrence pus. In our experiments we use a subset (all except matrix (WCCM) in which one dimension is the every twelfth sentence) of the British National ;::: list of all words (w1 ; w2 ) in the vocabulary, Corpus World Edition (BNC) (Burnard, 2000) as and the other dimension is a list of all categories the auxiliary corpus and a window size of ¦5 ;::: (c1 ; c2 ). words. The remaining one twelfth of the BNC is used for evaluation purposes. Note that if the tar- ::: c c ::: c 1 2 j get text belongs to a particular domain, then the ::: w1 m11 m12 ::: m1 j creation of the WCCM from an auxiliary text of ::: w2 m21 m22 ::: m2 j the same domain is expected to give better results . .. ::: . ::: than the use of a domain-free text. ::: wi mi1 mi2 ::: mij . 3.2 Analysis of the Base WCCM . .. The use of untagged data for the creation of the A particular cell, mij, pertaining to word wi and base WCCM means that words that do not re- category c j, is the number of times wi occurs in ally co-occur with a certain category but rather a predetermined window around any c-term of c j do so with a homographic word used in a differ- in a text corpus. We will refer to this particular ent sense will (erroneously) increment the counts WCCM created after the first pass over the text corresponding to the category.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-