Word Sense Disambiguation: Methods and Applications

WORD SENSE DISAMBIGUATION: METHODS AND APPLICATIONS

Zeynep ALTAN

Maltepe University Faculty of Engineering Department of Computer Engineering

[email protected] http://www.maltepe.edu.tr/akademik/fakulteler/muhendislik/muh_eng_cv/zeynep_altan.asp http://geocities.com/altanzeynep/ Address: Marmara Egitim Koyu 34857 Maltepe - İSTANBUL Turkey Phone: +90 216 626 10 50 Fax: +90 216 626 10 70

Abstract: This tutorial is proposed to provide an overview of the importance of word sense disambiguation in recent computational linguistics applications. The tutorial summarizes the primary disambiguation studies, which were mostly materialized by hand, to evaluate today’s large projects more realistic. The significance of disambiguation task arises from its broad practice areas including but not limited to machine translation, information retrieval, hypertext guiding, parsing and speech synthesis, spelling correction, reference resolution, automatic text summarization and research engines. Almost all vital words in the abovementioned application areas have numerous number of senses, and the purpose of any disambiguation program attempts to select the most appropriate one of these senses according to the context. The disambiguation process can either be accomplished by statistical learning methods that utilize an annotated corpus or by knowledge based methods matching the context with a given knowledge source. Any electronic information resource used in the second disambiguation process, aims to combine the characteristics of both the dictionary and the structured semantic network. Tutorial Duration: 3 hours

Audience: People who are interested in but not limited to different natural language processing applications, machine translation, and the researchers concerning research engines on Internet.

Tutorial Contents∗

The Location of Word Sense Disambiguation in the Natural Language Processing Researches The disambiguation task is to determine which of the senses of an ambiguous word is invoked for a particular use of word in any text or discourse [Manning and Schütze, 1999]. Then we can distinguish that meaning from other senses which can potentially be applied to that word. To determine the ambiguity of the words is one of the fundamental problems of natural language applications and their related jobs. Since machine translation, information retrieval and hypertext guiding, parsing, speech synthesis, spelling correction, reference resolution, automatic text summarization, and research engines are the primary practice areas of word sense disambiguation (WSD); it is very significant in natural language processing (NLP) fields as well. Therefore we can claim that WSD is an essential research area appealing considerable interest in the computational linguistics literature. However, the researches have not been matured for many languages yet, and there is no smooth consensus on the definition of WSD. Whenever we take WSD as a general problem, we may pronounce the followings:

• Many words have more than one sense. Generally there exists a sense ambiguity for a given word when the context of that word is not considered. • The main goal of WSD research is to resolve ambiguity of different senses listed in a dictionary, thesaurus or another source of a given word. • Any person eliminates other possible senses of a word and he/she considers only one sense whenever he/she understands a sentence containing an ambiguous word. Understanding an ambiguous sentence means that “the human language understanding process selects appropriate sense among possible set of senses” [Kilgarriff, 1999].

If WSD can be defined as a part of human language understanding process and can be modeled in a computer program, then a fundamental function that is essential for many NLP applications can be implemented. This idea has been put into words in the following sentence of Cottrell [Cottrell, 1989]:

“Sense ambiguity is one of the main problems faced by many natural language understanding (NLU) systems. Understanding the correct sense is a must for NLU. Resolving disambiguation mechanism of human is very important, since whatever their method is it is functioning perfectly”.

∗ ppt slides will be distributed during the presentation

Shortly, WSD can be summarized as follows: It determines the sense of an ambiguous word in a certain domain by considering the context it is used. All words has finite number of senses provided in a source and a WSD program attempt to select one, especially the most appropriate one, of these senses according to the context.

Word Sense Disambiguation History WSD has been utilized as a powerful intermediate task for many years. The initial researches about automatic WSD were started in MT studies. Weaver stated the necessity of WSD in MT in his Memorandum (Weaver, 1949), and the first WSD approach that would be the building block of the works afterwards as follows:

“If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine one at a time the meaning of the words. [...] But if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word. [...] The practical question is “What minimum value of N will, at least in tolerable fraction of cases, lead to the correct choice of meaning for the central word?”

Kaplan attempted to answer this question with his research which the ambiguous word and one or two words on either sides of ambiguous word are given to seven different translators [Kaplan, 1950]. He reported that the results of sense disambiguation by giving the whole sentence and this approach were not significantly different. Similar results were then reported for other languages such as Russian and French.

At the beginning, the goal of MT was self-effacing concentrating mainly on specific parts of texts, especially the translation of technical texts. After several decades later of Weaver’s consider in Memorandum about the role of domain in WSD repeated [Gale et al., 1992]

“In mathematics, to take what is probably the easiest example, one can very nearly say that each word, within the general context of a mathematical article, has one and only one meaning”.

Semantic and syntactic relations between a word and its context had also been recognized. For example the correct sense of the ambiguous word “can” in the sentence “I can can the can” can be easily determined by using syntactic features. In MT first researches were limited and the main goal was the translation of specific texts from particular technical domains. These observations lead to the development of specialized dictionaries and micro-glossaries in the early days of MT. These types of sources were including only the relevant sense of an ambiguous word for the domain in hand. For example a micro- glossary for economy has only the appropriate definition of interest in economy, but not the other irrelevant senses. Many researchers had also tried to match the words in any language to a common semantic/conceptual representation independent from a language (interlingua) by using logical and mathematical principles in order to solve the WSD problem.

Another important step in WSD research was the usage of knowledge-based systems. [Masterman, 1961]. The system offered by Masterman was very similar to the modern knowledge-based WSD systems. Weaver emphasized the importance of statistical methods in his Memorandum, thus many researchers studied the percentage of the words that have more than one senses, the average senses for the polysemous words, the ratio of the different senses both in target and source languages for translation etc, [Harper, 1957]. The frequencies of senses are calculated using Bayes formula and considering the observation that the frequency of senses depends on the domain of texts.

The essential reality about the early works on WSD is to mark the fundamental difficulties and approaches to the problem and to indicate which of them were predicted and developed at that time. However, without large-scale resources most of these ideas remained untested and forgotten until several decades later [Ide and Veronis, 1998].

Although the fundamental problem of all these approaches was taken into account and their solutions were proposed a few decades ago, these solutions could not be tested thoroughly due to the insufficient resources. Nowadays, the electronic resources became highly available and the technological developments in computers are implausible; so all these possibilities produced significant improvement in the WSD researches. The knowledge types used in modern systems can be summarized as part-of-speech (POS), morphology, collocations, semantic word associations, syntactic cues, selectional preferences, domain, and frequency of senses, pragmatic. In addition to knowledge types, today’s systems make use of information resources such as machine-readable dictionaries, ontology and corpora.

Word Sense Disambiguation Application Areas Sense disambiguation is an intermediate task, which is not an application on its own; but it is necessary and useful also for many applications in one way or another. It is obviously essential for language understanding applications such as message understanding, man-machine communication etc... WSD is at least helpful, and required in some instances, for the applications whose aim is not language understanding. Thus, it is necessary to examine where and how WSD is used in order to understand the need for it in the NLP applications. WSD in Machine Translation Machine Translation (MT) researchers claim that effective WSD mechanisms can provide a great progress in this field. The applications on MT must deal with two types of word ambiguity: ambiguity in the source language and ambiguity in the target language, so selecting the most appropriate translation of a word can only be achieved by WSD [Hutchins and Sommers, 92] . For example the word “yüz” in Turkish can be translated into English as “face, swim, skin, hundred, etc.” The correct translation can be done by WSD.

WSD in Information Retrieval and Hypertext Navigation It is necessary to eliminate documents containing inappropriate senses of the keyword that is being searched. Choosing the correct sense of the word in a search query and associating this query to the documents that contain the suitable senses will considerably increase the quality of the search results. If a Turkish query contains the word “fare” (“mouse” in English) about computers, then eliminating the documents that contain the “animal sense” will be desired. WSD in Grammatical Analysis Selecting the correct sense of an ambiguous word also plays an important role in part of speech tagging, morphological analysis, etc. For an example from Turkish “çıkarlar” can be analyzed as: çıkar(Noun) + lar(Plural) or çık(Verb) + ar(Present) + lar(Plural) The word “çıkar” as a noun means “self-interest”, and the verb “çık” means “to exit” in English. Thus, if we know that this is a noun we select the first morphological analysis result as the correct analysis. WSD in Speech Processing Sense disambiguation is required for correct phonetic of words in speech synthesis and for word segmentation and homophone discrimination in speech analysis. For example Turkish words “hala” and “hâlâ” are two distinct instances as “the father’s sister” and “still” senses respectively. The ^ sign affects the pronunciation and changes the meaning of a word. WSD in Text Processing In spelling correction, case changes, and lexical access in Semitic languages which vowels are not used such as Arabic, Hebrew, WSD will be helpful. [Yarowsky, 1994a]

Different Word Sense Disambiguation Approximations Although the wideness of the examined approaches, the systems which give extremely correct WSD results are not enough. The rudimentary explanation of this failure is the richness of natural languages and the disagreement of the words or bag of words between different languages. Four distinct observations about WSD explained by Resnik and Yarowsky in 1997 are still current [Resnik and Yarowsky, 1997]. These are as follows:

• The evaluation of WSD systems is not yet standardized, • The potential for WSD varies by task, • Adequately large sense-tagged data sets are difficult to obtain, • The field has narrowed down the approaches only a little.

Since these assertions contain the most studied language English, other lesser studied languages such as Turkish has no chance to hold a completed system. The first problem is the nonexistence of an extensive corpus for such languages. Brown Corpus and Penn Tree Bank as well-known two corpora in English include part-of-speech tagging and parsing completely, and have been applied in many WSD projects. Meanwhile, the Sixth Message Understanding Conference (MUC 6) has been successful in dissimilar evaluation of parsers both for training and testing data [MUC 6, 1995]. London-Lund Corpus as an another research of MUC 6 has been annotated the type of anaphora (subject pronoun or full noun phrase), the type of predecessor (as explicit or implicit), current status of the predecessor (whether the predecessor is discourse topic, segment topic or subsegment topic) and the processing slot (anaphora cases can be classified according to the type of knowledge such as syntactic collocational or discourse knowledge) [Rocha, 1997]. Thus the task of anaphora resolution has become an evaluation standard under MUC 6 conference. Then the researches continued extending the coreference classes. For example coreference resolution system has been the essential characteristic on the development of machine learning systems [Soon et al., 2001]. Later, a noun phrase coreference system has been materialized to improve these machine learning approaches and also to extend the work of Soon et al. [Ng and Cardie, 2002]. The best results have been obtained on the MUC- 6 and MUC-7 coreference resolution data sets. Then many researches are developed in terms of semantic role labeling for robust, broad coverage semantic parsing. One of them is automatic labeling of semantic roles to identify the semantic roles by constituents of a sentence within a semantic frame [Gildea and Jurafsky, 2002]. When an input sentence, a target word and frame are given, the system labels constituents either with abstract semantic roles or more domain-specific semantic roles.

Evaluation has a serious affect in information extraction (IE), so the MUC conferences, where most of the evaluation methodologies were developed, contributed the IE methodology with the annotated corpora to train and test the documents. On the other hand, the researchers utilized different set of classes and different evaluation metrics before the second half of 90s. Therefore MUC 6 and MUC 7 conferences caused the expansion the IE and Machine Translation (MT) studies, and also WSD.

According to the second observation of Resnik and Yarowsky, it is not possible to make a successful WSD in various applications. In IE, if a bag of words are matched into another bag of words within a document, is it possible to obtain perfect sense information? In speech recognition, the language models give more emphasize to the sense information then IE, and the equivalence classes rather then word classes facilitates the researches. Another important enhancement on annotated data is the public usage of WordNet in 1994. WordNet has been the primary large scale corpus as the broad-coverage annotation of all words. There are essentially two methods for constructing an electronic dictionary. One is through automatic achievement; other method is to craft one’s dictionary by hand. At the first approximation, we scan the text of a book format dictionary into the computer and then manipulate the contents so as to derive information. But the second approach use slow, unwieldy, and hand built construction. WordNet was initially considered as a test research for a particular model of lexical organization that had never been implemented before on a large scale and it had manually been constructed. But semantic networks containing more than a few dozen of words did not exist in it, and the researchers also did not know what kind of relations they would need to create. Since WordNet today represents a continuing experiment without predecessors, it has an advantageous to make changes freely and to insert additions to its contents during its development [Fellbaum C., 1999]. The lexicon is conceptualized through synsest, structured in ties and linked by meaning relations: hyper/hyponymy, mereonymy, role, etc... In each concept, or synset, the differences in meaning (polysemies) are separated, numbered and defined through taxonomic and associative relations. WordNet is considered one of the most important standard lexicons for English; it is referable and downloadable freely from the Internet.

Afterwards it has been conceivable to implement the supervised algorithms to the new projects without difficulty. Supervised approaches, in the beginning, consider semantically annotated corpus and get better results than unsupervised methods on selected ambiguous words. Decision Lists [Yarowsky, 1994b], Neural Networks [Towell and Voorhees, 1998], Bayesian learning [Bruce and Wiebe, 1999], Exemplar-Based learning [Ng, 1997], and Boosting [Escudero et al., 2000] are a few of standard ML algorithms applied to supervised learning.

The general problem of Yarowsky’s decision lists analysis is the resolution of lexical syntactic and semantic ambiguity based on properties of the surrounding context. The algorithm consists of seven steps: identifying the ambiguities in accent patterns, collecting the training contents, measuring the collectional distributions, sorting by log- likelihood into decision lists, optional pruning and interpolation, training the decision lists for general classes of ambiguity, and finally using the decision lists. One of the problems on WSD to be resolved is the accent restoration studied by Yarowsky. Word choice selection in machine translation, homograph and homophone disambiguation, and capitalization restoration are a few examples. Capitalization restoration is a similar problem in those distinct semantic concepts such as “AIDS/aids” (disease or helpful tools) and “Bush~bush” (president or shrub). Homophone words are one of the specific problems of automatic speech recognition (ASR). Words having different orthography with the same phonetic transcription are homophone among themselves. Words in the same homophone class may differ because of their number and gender, the syntactic category they belong to or their meaning. Quantity, size and frequency of homophone classes are language dependent parameters. It is required to combine local language models with large models like n- grams in order to consider the word dependencies. In homophone disambiguation, distinct semantic concepts such as “ceiling” and “sealing” have also become represented by the same ambiguous form, but in the medium of speech and with similar disambiguating clues. Different parts of speech, same part of speech, proper names, roman numerals, fraction dates, years /quantifiers, abbreviations are the homographs classification of Yarowsky [Yarowsky, 1996]. Furthermore, missing accents may create both semantic and syntactic ambiguities, including tense or mood distinctions which may only be resolved by distant temporal markers or non-syntactic cues. In Turkish there are a lot of sentences including this propery. For example the sentence “Fenerbahçe Galatasaray’a 5 attı” (Fotboot team Fenerbahçe kicked to Galatasaray 5 goals) does not include the “kick” verb, and disambiguaiton occurs. These sentences can generally be used in newspapers and on daily discources. Accent restoration provides several advantages for the explanation and evaluation of the proposed decision list algorithm. First, it suggests both syntactic and semantic of ambiguity types. Second, the correct accent pattern is directly recoverable. But in traditional word-sense disambiguation, hand labeling training and test data is a difficult and subjective task. Third, the task of restoring missing accents and resolving ambiguous forms demonstrates significant commercial applications such as a large potential market in its use in grammar and spelling correctors.

Naive Bayes is intended as a simple representation of statistical learning methods. A Naive Bayesian classifier assumes that all the feature variables representing a problem are conditionally independent given the value of a classification variable. In word sense disambiguation, the context in which an ambiguous word occurs is represented by the feature variables and the sense of the ambiguous word is represented by the classification variable. Model probabilities are estimated during training process using relative frequencies. Despite its simplicity, Naive Bayes is claimed to obtain contemporary accuracy on supervised WSD. Briefly, Bayesian classification and exemplar-based algorithms are two different statistical approaches of disambiguation field. A large context window is utilized surrounding the ambiguous word as the Bayes classifier. Every component of the classifier has an effect on the selection of the appropriate sense. The classifier does not select any feature, but uses all the combinations of the features. It assumes that a corpus exists in which the correct senses of each target word can be assigned. The Bayes classifier uses the Bayes decision rule and tries to minimize the error. In exemplar-based approach a set of features are extracted and recorded for each sentence in the training corpus. Then, similar features are obtained for the sentences in the test corpus, and compared with the features of the training sentences. The sense of ambiguous word in the training sentence with the closest features is then selected as the sense of ambiguous word in the test corpus. As a result, the heart of the algorithm is to select the appropriate feature sets and to calculate the distance between these features. In exemplar (instance or memory) based learning no generalization of training examples is performed. Instead, the examples are stored in memory and the classification of new examples is based on the classes of the most similar stored examples. Exemplar based learning is said to be the best option for WSD.

A brief explanation about SENSEVAL There are now many computer programs to evaluate word sense disambiguation systems automatically. The first Senseval took place in the summer of 1998 for English, French and Italian to bring a standard to disambiguation projects [Kilgariff, 1998]. The different 35 words (nouns, verbs, adjectives and indeterminates) are used in English Senseval as lexical sample. The distribution of each word according to part of speech, and the numbers of test instances were given. SemCor, Hector and DSO Corpora are manually sense tagged datasets in English. Each participated system was scored the granularity on each task as coarse-grained, mixed-grained and fine-grained scorings. Coarse-grained scoring integrates all subsense tags (corresponding to codes such as 1.1, 2.1) to main sense tags (corresponding to codes such as 1, 2). Mixed-grained scoring gives full credit for a estimation in the answer file if it is subsumed by an answer in the key file. A tag subsumes another tag if it is a main sense tag (corresponding to a code such as 2) and the other tag is a subsense tag under it (corresponding to a code such as 2.1). Fine-grained scoring counts only identical sense tags as a match. That is, even if the estimate in the answer file subsumes or is subsumed by the correct answer in the key, no credit at all is given. For example, under fine-grained scoring, an estimation of 2.1 receives no credit if the answer in the key file is 2. Throughout the Senseval studies, the systems are classified into three categories: A –all words (disambiguate all content in a text), S- supervised training (require a substantial quantity as over 30 sense-tagged instances of each word), and O –Other training (do not require over 30 tagged training instances, but require a learning phase to be applied for each word to be disambiguated).

A Recent Study on Word Sense Disambiguation Two different complementary methods are generally submitted from WSD various methodological approaches: knowledge-based and corpus-based methods, in addition to obtain the hybrid approaches by combining them [Montoyo et.al., 2005]. The knowledge- based method disambiguates the words by matching the context with a given knowledge source. Any electronic information source like WordNet can be used to combine the characteristics of both the dictionary and the structured semantic network. It provides the definitions of different word senses and represents distinct lexical concepts as synsets which are defined as synonymous word groups WordNet also organizes words into a conceptual structure by representing a number of semantic relationships as hyponymy, hypernymy, meronymy, etc.. As the corpus-based approximation any supervised machine-learning algorithm may be implemented that has been trained from the annotated sentences. The corpus-based systems usually represent the linguistic information in the form of feature vectors. It is not necessary to annotate these feature sets for all contexts of each sentence. Word collocations, part-of-speech labels, keywords, topic and domain information, grammatical relationships are some of these features. Combining these two approaches, it can be shown that knowledge-based method will facilitate the corpus-based method to get better performance and vice versa.

References Bruce R.F. and J. M. Wiebe J.M., 1999. “Decomposable Modeling in Natural Language Processing”, Computational Linguistics, vol: 25(2), pp:195-207. Cottrell, G. W., 1989. “A Connectionist Approach To Word Sense Disambiguation”, In Research Notes In Artificial Intelligence, London: Pitman. Escudero G., Marquez L., and Rigau G., 2000. “Boosting Applied to Word Sense Disambiguation”, in Proceedings of the 12th European Conference on Machine Learning, Barcelona, Spain. Fellbaum C., 1999. An WordNet Electronic Lexical Database, Thr MIT Press. Gale, W A., Church, K.W., Yarowsky, D., 1992. “One sense per discourse.” Proceedings of the Speech and Natural Language Workshop, San Francisco, pp. 233-237, Morgan Kaufmann. Gildea, D. and D. Jurafsky , 2002. “Automatic Labeling of Semantic Roles. Computational Linguistics, vol: 28(3), pp.245–288. Harper, K. E. , 1957. “Semantic Ambiguity”, in Mechanical Translation, 4(3), 68-69. Hutchins, J. and Sommers, H., 1992. Introduction to Machine Translation. Academic Press. Ide, N., and Veronis, J., 1998. Introduction to the special issue on word sense disambiguation: “The State of the Art”, Computational Linguistics, 24(1), 1-40 Kaplan, A., 1950. “An experimental study of ambiguity and context”, in Mimeographed, in November, pp18. Reprinted in Mechanical Translation,1955, vol: 2(2), pp: 39-46 “An experimental study of ambiguity and context”. Kilgarriff, A., 1999. “I don’t believe in word senses.” , Computers and the Humanities. B. (Eds.), English Corpus Linguistics, Longman, London, 8-29. Manning C.D., Schütze H., 1999. Foundations of Statistical Natural Language Processing, Chapter: 7, The MIT Press. Kilgarriff A., 1998. “SENSEVAL : An Exercise in Evaluating Word Sense Disambiguation Programs “, in Proc. LREC, Granada,. pp 581--588. Masterman, M., 1961. Semantic message detection for machine translation, using an interlingua, 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, Her Majesty’s Stationery Office, London, 1962, 437-475. Montoyo A., Suarez A., Rigau G., Palomar M.,2005.,“Combining Knowledge- and Corpus- based Word-Sense-Disambiguation Methods” Journal of Artificial Intelligence Research vol: 23 pp: 299-330. Ng. H.T., 1997. “Exemplar-Base Word Sense Disambiguation: Some Recent Improvements”, in Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing. Ng, V. and C. Cardie, 2002. “Improving machine learning approachesmto coreference resolution”, in proceedings of ACL-02, pp. 104–111. Proceedings of the Sixth Message Understanding Conference (MUC 6), 1995, November San Mateo Morgan Kauffman . Resnik P., Yarowsy D. ,1997, A Perspective on Word sense Disambiguation and their evaluation In Proceedings of SIGLEX '97, Washington, DC, pp. 79-86, 1997. Rocha M. , 1997, Corpus Based Study of Anaphora in English and Portuguese In.in Corpus Based and Computational Approaches to Discourse Anaphora, (eds) Botley S.P. and McEnery T., UCL Press Soon,W.M., H. T. Ng and Lim. D.C., 2001, “A machine learning approach to coreference resolution of noun phrases”, Computational Linguistics, vol:27(4), pp.521–544. Towell G. and Voorhees E.M., 1998, “Disambiguating Highly Ambiguous Words”, Computational Linguistics, vol: 24(1), pp:125-146. Weaver, W. ,1949. “Translation” in Mimeographed in July pp.12. Reprinted in Locke (1955), William N. and Booth, A. (Eds.), in Machine translation of languages.John Wiley & Sons, New York. Yarowsky, D. ,1994a. A comparison of corpus-based techniques for restoring accents Spanish and French text, Proceedings the 2nd Annual Workshop on Very Text Corpora. Las Cruces, 19- 32. Yarowsky D., 1994b. “Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French”, in Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pages 88-95, Las Cruces. Yarowsky, D., 1996. “Homograph Disambiguation in Text to Speech Synthesis”, in progress and Speech Synthesis, pp:157-172, Springer, New York.

Zeynep Altan attended Istanbul Technical University both as an undergraduate and as a graduate student, receiving B.A. in Mathematical Engineering and M.S. in System Analysis Section. She completed her Ph. Degree at the Istanbul University in Numerical Methods Section. Until 1993 she was research assistant at ITU Mathematical Engineering Department. Until her retirement in 2002, she was an assistant professor of Computer Engineering Department at Istanbul University. She is interested in computational linguistics, especially processing of Turkish language. In autumn 2004, she has joined Maltepe University as a part time lecturer, and she became a full time lecturer in Engineering Department in March 2004.