Word Sense Disambiguation Using a Second Language Monolingual Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Word Sense Disambiguation Using a Second Language Monolingual Corpus Ido Dagan* Alon Itai t AT&T Bell Laboratories Technion--Israel Institute of Technology This paper presents a new approach for resolving lexical ambiguities in one language using statis- tical data from a monolingual corpus of another language. This approach exploits the differences between mappings of words to senses in different languages. The paper concentrates on the prob- lem of target word selection in machine translation, for which the approach is directly applicable. The presented algorithm identifies syntactic relations between words, using a source language parser, and maps the alternative interpretations of these relations to the target language, using a bilingual lexicon. The preferred senses are then selected according to statistics on lexical relations in the target language. The selection is based on a statistical model and on a constraint prop- agation algorithm, which simultaneously handles all ambiguities in the sentence. The method was evaluated using three sets of Hebrew and German examples and was found to be very use- ful for disambiguation. The paper includes a detailed comparative analysis of statistical sense disambiguation methods. 1. Introduction The resolution of lexical ambiguities in nonrestricted text is one of the most difficult tasks of natural language processing. A related task in machine translation, on which we focus in this paper, is target word selection. This is the task of deciding which target language word is the most appropriate equivalent of a source language word in context. In addition to the alternatives introduced by the different word senses of the source language word, the target language may specify additional alternatives that differ mainly in their usage. Traditionally, several linguistic levels were used to deal with this problem: syn- tactic, semantic, and pragmatic. Computationally, the syntactic methods are the most affordable, but are of no avail in the frequent situation when the different senses of the word show the same syntactic behavior, having the same part of speech and even the same subcategorization frame. Substantial application of semantic or pragmatic knowl- edge about the word and its context requires compiling huge amounts of knowledge, the usefulness of which for practical applications in broad domains has not yet been proven (e.g., Lenat et al. 1990; Nirenburg et al. 1988; Chodorow, Byrd, and Heidron 1985). Moreover, such methods usually do not reflect word usages. Statistical approaches, which were popular several decades ago, have recently reawakened and were found to be useful for computational linguistics. Within this framework, a possible (though partial) alternative to using manually constructed * AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974, USA. E-mail: [email protected]. The work reported here was done while the author was at the Technion--Israel Institute of Technology. t Department of Computer Science, Technion--Israel Institute of Technology, Haifa 32000, Israel. E-maih [email protected]. (~) 1994 Association for Computational Linguistics Computational Linguistics Volume 20, Number 4 knowledge can be found in the use of statistical data on the occurrence of lexical re- lations in large corpora (e.g., Grishman, Hirschman, and Nhan 1986). The use of such relations (mainly relations between verbs or nouns and their arguments and modi- fiers) for various purposes has received growing attention in recent research (Church and Hanks 1990; Zernik and Jacobs 1990; Hindle 1990; Smadja 1993). More specifically, two recent works have suggested using statistical data on lexical relations for resolving ambiguity of prepositional phrase attachment (Hindle and Rooth 1991) and pronoun references (Dagan and Itai 1990, 1991). Clearly, statistics on lexical relations can also be useful for target word selection. Consider, for example, the Hebrew sentence extracted from the foreign news section of the daily Ha-Aretz, September 1990 (transcripted to Latin letters): (1) Nose ze mana" mi-shtei ha-mdinot mi-lahtom "al hoze shalom. issue this prevented from-two the-countries from-signing on treaty peace [ This sentence would translate into English as (2) This issue prevented the two countries from signing a peace treaty. The verb lahtom has four senses: 'sign,' 'seal,' 'finish,' and 'close.' The noun hoze means both 'contract' and 'treaty,' where the difference is mainly in usage rather than in the meaning (in Hebrew the word h.oze is used for both sub-senses). One possible solution is to consult a Hebrew corpus tagged with word senses, from which we would probably learn that the sense 'sign' of lahtom appears more frequently with hoze as its object than all the other senses. Thus we should prefer that sense. However, the size of corpora required to identify lexical relations in a broad domain is very large, and therefore it is usually not feasible to have such corpora manually tagged with word senses) The problem of choosing between 'treaty' and 'contract' cannot be solved using only information on Hebrew, because Hebrew does not distinguish between them. The solution suggested in this paper is to identify the lexical relations in corpora of the target language, instead of the source language. We consider word combinations and count how often they appear in the same syntactic relation as in the ambiguous sentence. For the above example, the noun compound 'peace treaty' appeared 49 times in our corpus (see Section 4.3 for details on our corpus), whereas the compound 'peace contract' did not appear at all; the verb-obj combination 'to sign a treaty' appeared 79 times, whereas none of the other three alternatives appeared more than twice. Thus, we first prefer 'treaty' to 'contract' because of the noun compound 'peace treaty' and then proceed to prefer 'sign' since it appears most frequently having the object 'treaty.' The order of selection is determined by a constraint propagation algorithm. In both cases, the correctly selected word is not the most frequent one: 'close' is more frequent in our corpus than 'sign' and 'contract' is more frequent than 'treaty.' Also, by using a model of statistical confidence, the algorithm avoids a decision in cases in which no alternative is significantly better than the others. Our approach can be analyzed from two different points of view. From that of monolingual sense disambiguation, we exploit the fact that the mapping between words and word senses varies significantly among different languages. This enables 1 Hearst (1991) suggests a sense disambiguation scheme along this line. See Section 7 for a comparison of several sense disambiguation methods. 564 Ido Dagan and Alon Itai Word Sense Disambiguation US to map an ambiguous construct from one language to another, obtaining repre- sentations in which each sense corresponds to a distinct word. Now it is possible to collect co-occurrence statistics automatically from a corpus of the other language, without requiring manual tagging of senses. 2 From the point of view of machine translation, we suggest that some ambigu- ity problems are easier to solve at the level of the target language than the source language. The source language sentences are considered a noisy source for target lan- guage sentences, and our task is to devise a target language model that prefers the most reasonable translation. Machine translation is thus viewed in part as a recogni- tion problem, and the statistical model we use specifically for target word selection may be compared with other language models in recognition tasks (e.g., Katz 1987; Jelinek 1990, for speech recognition). To a limited extent, this view is shared with the statistical machine translation system of Brown et al. (1990), which employs a target language n-gram model (see Section 8 for a comparison with this system). In contrast to this view, previous approaches in machine translation typically resolve examples like (1) by stating various constraints in terms of the source language (Nirenburg 1987). As explained above, such constraints cannot be acquired automatically and therefore are usually limited in their coverage. The experiments we conducted clearly show that statistics on lexical relations are very useful for disambiguation. Most notable is the result for the set of examples of Hebrew to English translation, which was picked randomly from foreign news sections in the Israeli press. For this set, the statistical model was applicable for 70% of the ambiguous words, and its selection was then correct for 91% of the cases. We cite also the results of a later experiment (Dagan, Marcus, and Markovitch 1993) that tested a weaker variant of our method on texts in the computer domain, achieving a precision of 85%. Both results significantly improve upon a naive method that uses only a priori word probabilities. These results are comparable to recent reports in the literature (see Section 7). It should be emphasized, though, that our results were achieved for a realistic simulation of a broad coverage machine translation system, on randomly selected examples. We therefore believe that our figures reflect the expected performance of the algorithm in a practical implementation. On the other hand, most other results relate to a small number of words and senses that were determined by the experimenters. Section 2 of the paper describes the linguistic model we use, employing a syntac- tic parser and a bilingual lexicon. Section 3 presents the statistical model, assuming a multinomial model for a single lexical relation and then using a constraint propa- gation algorithm to account simultaneously for all relations in the sentence. Section 4 describes the experimental Setting. Section 5 presents and analyzes the results of the experiment and cites additional results (Dagan, Marcus, and Markovitch 1993). In Section 6 we analyze the limitations of the algorithm in different cases and suggest enhancements to improve it. We also discuss the possibility of adopting the algorithm for monolingual applications. Finally, in Section 7 we present a comparative analysis of statistical sense disambiguation methods and then conclude in Section 8.