Multilingual Word Sense Disambiguation Using Wikipedia

MULTILINGUAL WORD SENSE DISAMBIGUATION USING WIKIPEDIA

Bharath Dandala

Dissertation Prepared for the Degree of

DOCTOR OF PHILOSOPHY

UNIVERSITY OF NORTH TEXAS

August 2013

APPROVED:

Rada Mihalcea, Major Professor Paul Tarau, Committee Member Rodney Nielsen, Committee Member Razvan Bunescu, Committee Member Barrett Bryant, Chair of the Department of Computer Science and Engineering Costas Tsatsoulis, Dean of the College of Engineering Mark Wardell, Dean of the Toulouse Graduate School Dandala, Bharath. Multilingual Word Sense Disambiguation Using Wikipedia. Doctor of

Philosophy (Computer Science), August 2013, 86 pp., 17 tables, 10 illustrations, 189 numbered

references.

Ambiguity is inherent to human language. In particular, word sense ambiguity is

prevalent in all natural languages, with a large number of the words in any given language

carrying more than one meaning. Word sense disambiguation is the task of automatically

assigning the most appropriate meaning to a polysemous word within a given context. Generally

the problem of resolving ambiguity in literature has revolved around the famous quote “you shall

know the meaning of the word by the company it keeps.” In this thesis, we investigate the role of

context for resolving ambiguity through three different approaches.

Instead of using a predefined monolingual sense inventory such as WordNet, we use a language-independent framework where the word senses and sense-tagged data are derived automatically from Wikipedia. Using Wikipedia as a source of sense-annotations provides the much needed solution for knowledge acquisition bottleneck. In order to evaluate the viability of

Wikipedia based sense-annotations, we cast the task of disambiguating polysemous nouns as a monolingual classification task and experimented on lexical samples from four different languages (viz. English, German, Italian and Spanish). The experiments confirm that the

Wikipedia based sense annotations are reliable and can be used to construct accurate monolingual sense classifiers.

It is a long belief that exploiting multiple languages helps in building accurate word sense disambiguation systems. Subsequently, we developed two approaches that recast the task of disambiguating polysemous nouns as a multilingual classification task. The first approach for multilingual word sense disambiguation attempts to effectively use a machine translation system to leverage two relevant multilingual aspects of the semantics of text. First, the various senses of a target word may be translated into different words, which constitute unique, yet highly salient signal that effectively expand the target word’s feature space. Second, the translated context words themselves embed co-occurrence information that a translation engine gathers from very large parallel corpora. The second approach for multlingual word sense disambiguation attempts to reduce the reliance on the machine translation system during training by using the multilingual knowledge available in Wikipedia through its interlingual links. Finally, the experiments on a lexical sample from four different languages confirm that the multilingual systems perform better than the monolingual system and significantly improve the disambiguation accuracy. Copyright 2013

Bharath Dandala

ii ACKNOWLEDGEMENTS

I wish to express my sincere appreciation to a number of people without whom this thesis might not have been written, and to whom I am greatly indebted.

First of all, I am extremely grateful to my advisor, Dr. Rada Mihalcea, for her visionary support and unwavering guidance throughout the course of this work. I could not have imagined a better advisor and mentor for my Ph.D. study. Aside from being a superb reseaher, she has been a magnificient mentor and guided me in the in paths of righteousness throughout the years.

Thank You Rada Mihalcea.

My sincere gratitute is reserved to the members of my PhD committee, Dr. Paul Tarau,

Dr. Rodney Nielsen and Dr. Razvan Bunescu. Their feedback was very valuable and helped me to improve my dissertation.

I am also thankful to my colleagues and friends for their various forms of support during my graduate study–Veronica, Samer, Carmen, Erwin, Ravi, Vanessa, Chris and Madalina. Many friends have helped me throughout these years. Their support and care helped me overcome setbacks and stay focused I greatly value their friendship and I deeply appreciate their belief in me.

Most importantly, none of this would have been possible without the love, understanding, patience of parents. I would like to express my heart-felt gratitude to my parents, Ramesh

Dandala and Varalakshmi Dandala and dedicate this thesis to them.

iii TABLE OF CONTENTS

Page

ACKNOWLEDGMENTS ...... iii

LIST OF TABLES ...... viii

LIST OF FIGURES ...... x

CHAPTER 1. INTRODUCTION ...... 1

1.1. Problem Definition...... 1

1.2. Motivation ...... 2

1.3. Proposed Solution ...... 4

1.4. Contributions...... 5

1.5. Thesis Outline ...... 6

CHAPTER 2. BACKGROUND ...... 8

2.1. Ambiguity ...... 8

2.2. Word Sense Disambiguation versus Word Sense Discrimination ...... 8

2.2.1. Coarse-Grained versus Fine-Grained WSD ...... 9

2.3. Wikipedia ...... 9

2.4. Machine Learning ...... 13

2.4.1. Multinomial Naive-Bayes Classification ...... 14

2.5. Parts of Speech Tagging ...... 15

2.6. Stemming ...... 15

2.7. Tools ...... 16

2.7.1. Wikipedia Miner ...... 16

2.7.2. TreeTagger ...... 16

iv 2.7.3. Snowball Stemmer ...... 17

2.7.4. Weka ...... 17

2.7.5. Google Translate ...... 17

2.8. Applications of Word Sense Disambiguation ...... 17

CHAPTER 3. RELATED WORK ...... 19

3.1. Early Research in WSD ...... 19

3.2. AI-based Systems...... 20

3.3. Prominent Problem in NLP ...... 21

3.4. Categories of Approaches to WSD ...... 22

3.4.1. Knowledge Based Methods ...... 22

3.4.1.1. Lesk Algorithm ...... 22

3.4.1.2. Semantic Similarity Measures ...... 23

3.4.1.3. Selectional Preferences ...... 24

3.4.1.4. Graph Based Methods ...... 24

3.4.1.5. Heuristic Based Methods ...... 25

3.4.2. Supervised Methods ...... 25

3.4.2.1. Lack of Sense Tagged Corpora ...... 26

3.4.2.2. Lack of Right Sized Training Sets ...... 27

3.4.2.3. Lack of Portability across Domains ...... 27

3.4.2.4. Decision Based WSD ...... 28

3.4.2.5. Neural Network WSD ...... 29

3.4.2.6. Instance Based WSD...... 29

3.4.2.7. Probabilistic WSD ...... 30

v 3.4.2.8. Support Vector Machines ...... 30

3.4.2.9. Ensemble Classifiers ...... 31

3.4.3. Unsupervised Methods...... 32

3.5. Related Work in Cross-Lingual/MultilingualWSD ...... 32

CHAPTER 4. WORD SENSE DISAMBIGUATION FRAMEWORKS...... 35

4.1. The WikiMonoSense System ...... 35

4.1.1. A Monolingual Dataset throughWikipedia Links ...... 35

4.1.2. An Example ...... 38

4.1.3. Wikipedia and WordNet ...... 39

4.2. The WikiMonoSense Learning Framework ...... 39

4.3. The WikiTransSense System ...... 40

4.3.1. A Multilingual Dataset through Machine Translation ...... 41

4.3.2. The WikiTransSense Learning Framework ...... 42

4.4. The WikiMuSense System ...... 42

4.4.1. A Multilingual Dataset through Interlingual Wikipedia Links ...... 42

4.4.1.1. Sense Gaps Problem ...... 44

4.4.1.2. Multilingual Bias Problem ...... 44

4.4.2. The WikiMuSense Learning Framework...... 46

CHAPTER 5. EXPERIMENTAL RESULTS ...... 48

5.1. Experiments and Results ...... 48

5.2. Discussion on WikiMonoSense System ...... 54

5.2.1. Learning Curve for WikiMonoSense System ...... 54

5.3. Discussion on WikiTransSense and WikiMuSense Systems...... 55

vi 5.3.1. Learning Curves for NultilingualWSD Systems...... 56

5.3.2. WikiMuSense Without Translations During Training ...... 58

5.4. Word Sense Disambiguation in a Coarse-Grained Setting ...... 59

CHAPTER 6. CONCLUSIONS AND FUTURE WORK ...... 65

6.1. Future Work ...... 67

BIBLIOGRAPHY ...... 69

vii LIST OF TABLES

Page

2.1 Number of articles, redirects, and users for the top tenWikipedia editions. The total number of articles also includes the disambiguation pages...... 12

4.1 Word senses for the word “bar”, based on annotation labels used in Wikipedia ...... 38

4.2 Minimum ratio distribution (MRD) and the required number of translations for building German corpus given English as the reference language for the word “argument.” Table also indicates number of examples in reference language (NR), number of examples in supporting language (NS), reference language percentages (RP), proportional distributions (PD), treshhold (T) and ratio (R), Minimum ratio is represented as bold in the ratio column. NIL indication no interlingual link available from EnglishWikipedia to German Wikipedia for that particular sense ...... 46

5.1 WSD macro accuracies ...... 49

5.2 WSD micro accuracies ...... 49

5.3 Word sense disambiguation results on English, including one baseline (MFS = most frequent sense) and the three proposed word sense disambiguation systems based on Wikipedia 2012 data. Number of senses (#s) and number of examples (#ex) are also indicated ...... 50

5.4 Word sense disambiguation results on Spanish, using Wikipedia 2012 data. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated ...... 51

5.5 Word sense disambiguation results on German, using Wikipedia 2012 data. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated ...... 52

5.6 Word sense disambiguation results on Italian, using Wikipedia 2012 data. In addition to a baseline (MFS = most frequent sense) and the word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated...... 53

5.7 Total number of sentence translations per language, in the two multilingual approaches. .... 55

5.8 WSD performance with no sense gaps ...... 58

5.9 Coarse grained macro accuracies...... 60

5.10 Coarse grained micro accuracies...... 60

viii 5.11 Word sense disambiguation results on English, including one baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results based on Wikipedia 2012 data in a coarse grained setting. Number of senses (#s) and number of examples (#ex) are also indicate ...... 61

5.12 Word sense disambiguation results on Spanish, using Wikipedia 2012 data in a coarse grained setting. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated ...... 62

5.13 Word sense disambiguation results on German, using Wikipedia 2012 data in a coarse- grained setting. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated...... 63

5.14 Word sense disambiguation results on Italian, using Wikipedia 2012 data in a coarse- grained setting. In addition to a baseline (MFS = most frequent sense) and the word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated...... 64

ix LIST OF FIGURES

Page

2.1 Snapshot of a fragment from the Wikipedia article on Alan Turing ...... 10

2.2 Example redirect page, from Turing to Alan Turing ...... 11

2.3 disambiguation page for the noun Sense ...... 12

4.1 Architecture of WikiTransSense Framework ...... 40

4.2 English to German translations from Google Translate, with the target words aligned ...... 41

4.3 Architecture of WikiMuSense system ...... 43

5.1 Learning curves for WikiMonoSense ...... 54

5.2 Impact of the number of supporting languages on the two multilingual WSD systems...... 56

5.3 Learning curves for WikiMuSense with respect to the supporting language examples ...... 57

5.4 Learning curves forWikiMuSense with respect to the reference language examples ...... 57

x CHAPTER 1

INTRODUCTION

1.1. Problem Deﬁnition

Ambiguity is inherent to human language. In particular, word sense ambiguity is prevalent in all natural languages, with a large number of the words in any given language carrying more than one meaning. For instance, the English noun “ plant” can mean “green plant” or “factory”; similarly the French word “feuille” can mean “leaf” or “paper”. The correct sense of an ambiguous word can be selected based on the context where it occurs, and correspondingly the problem of word sense disambiguation is defined as the task of automatically assigning the most appropriate meaning to a polysemous word within a given context. The word sense disambiguation (WSD) task is known as a classic AI-complete problem in the literature, that is, “its solution requires a solution to all the general artificial intelligence (AI) problems of representing and reasoning about arbitrary real-world knowledge” or “a problem which can be solved only after resolving all the general artificial intelligence (AI) problems of such as the representation of common sense and encyclopedic knowledge.” [72, 66].

Two well studied categories of approaches to WSD are represented by knowledge-based [88, 49, 122] and data-driven [186, 126, 136] methods. Knowledge-based methods rely on information drawn from wide-coverage lexical resources such as WordNet [110]. Their performance has been generally constrained by the limited amount of lexical and semantic information present in these resources. In a recent effort to alleviate the semantic information bottleneck, Ponzetto and Nav- igli [142] created WordNet++, an extended version of WordNet that was augmented with unlabeled relations extracted from Wikipedia. The resulting knowledge-based system was shown to be competitive with state-of-the-art supervised approaches in a coarse grained all-words setting and on domain-speciﬁc datasets. Knowledge-based methods have also been observed to be robust when tested on data with different sense distributions [9], a setting where supervised methods would normally need to use domain adaptation [9].

Among the various data-driven word sense disambiguation methods proposed to date, su-

1 pervised systems have been observed to lead to highest performance. In these systems, the sense disambiguation problem is formulated as a supervised learning task, where each sense-tagged occurrence of a particular word is transformed into a feature vector which is then used in an automatic learning process. Despite their high performance, these supervised systems have an important drawback: their applicability is limited to those few words for which sense tagged data is available, and their accuracy is strongly connected to the amount of labeled data available at hand. To address the sense-tagged data bottleneck problem, different methods have been proposed in the past, with various degrees of success. This includes the automatic generation of sense-tagged data using monosemous relatives [83, 107, 7], automatically bootstrapped disambiguation patterns [186, 101], parallel texts as a way to point out word senses bearing different translations in a second language [38, 127, 37], and the use of volunteer contributions over the Web [27]. The granularity of word sense repositories has been also recognized as an important factor in the development of annotated datasets for word sense disambiguation (WSD) [164], with significant impacts upon both the performance of automatic WSD systems and their utility for downstream applications. Previous work on manual sense annotations with respect to WordNet has shown low levels of agreement between human annotators, ranging between 65% [27] and 72% [165], which is a clear indicator of very fine-grained distinctions between word senses that are difficult to differentiate, even for humans. A large number of techniques have been proposed for clustering the collection of fine-grained senses available in WordNet [139, 105, 97, 65, 119, 164]. The OntoNotes project, a large-scale effort to cluster and supplement word senses in WordNet in order to produce a high-quality dataset for automatic WSD [65], has been beneficial for other language processing tasks such as discourse analysis, coreference resolution, or semantic parsing. Coarser sense inventories make it easier to identify synonyms or translations of selected words in context, which can lead to improvements in information retrieval [188], semantic indexing [51], and machine translation [25].

1.2. Motivation

The driving motivation of this research is the belief in the natural language processing (NLP) community that having an accurate and comprehensive word sense disambiguation(WSD)

2 system would help in accurate machine translation, question answering and information retrieval, speech processing, text to speech translation, information extraction and text mining.

It is also a long belief that exploiting multiple languages helps in building accurate word sense disambiguation systems. There have been a number of attempts to exploit parallel corpora for word sense disambiguation, for bilingual models [48, 33, 38, 127, 24] and for multilingual models [170, 13, 124, 87]. So, the primary focus of this research is placed on overcoming the drawbacks of current state-of-art supervised WSD approaches and build highly accurate supervised WSD system that exploits infromation drawn from multiple languages.

Subsequenty, an interesting trend has been observed in the growing line of WSD research from recent years, where Wikipedia has been used as a resource of world knowledge for building WSD systems [104, 121, 17]. The vast amount of knowledge available in Wikipedia has been shown to beneﬁt supervised WSD systems [104, 17]. The proposed approaches use the semi- structured information available in Wikipedia either directly or indirectly by mapping automatically to resources such as DBPedia [18] or YAGO [167] that distill information from Wikipedia in order to alleviate semantic information bottleneck problem. There is also an increasing number of languages that have a Web presence, and correspondingly there is a growing need for effective word sense disambiguation systems for multiple natural languages. Wikipedia has become one of the primary knowledge resources for multilingual natural language processing research, yielding a combined total of more than 10 million articles in more than 280+ languages1.

Brieﬂy, this research is motivated by three important factors:

(1) The need for building comprehensive and accurate supervised WSD systems for different natural languages. (2) The importance of Wikipedia as a large semantic resource for building supervised WSD systems. (3) The use of multilingual evidence available in Wikipedia for building multilingual joint word sense disambiguation systems.

1http://en.wikipedia.org/wiki/Wikipedia:Size of Wikipedia

3 1.3. Proposed Solution

This research proposes three approaches to word sense disambiguation that use Wikipedia as a source of sense annotations. Starting from a basic monolingual approach, I develop two multilingual systems: one that uses a machine translation system to create multilingual features, and one where multilingual features are extracted primarily through the interlingual links available in Wikipedia. The proposed three approaches to solve WSD task are enumerated as follows:

(1) WikiMonoSense: In this approach, I address the sense-tagged data bottleneck problem by using Wikipedia as a source of sense annotations. Starting with the hyperlinks available in Wikipedia, I first generate sense annotated corpora that can be used for training accurate and robust monolingual sense classifiers (WikiMonoSense). (2) WikiTransSense: The sense tagged corpus extracted for the reference language is translated into a number of supporting languages. The word alignments between the reference sentences and the supporting translations computed by Google Translate are used to generate complementary features in our first approach to multilingual WSD. (3) WikiMuSense: The reliance on machine translation (MT) is significantly reduced during the training phase of this approach to multilingual WSD, in which sense tagged corpora in the supporting languages are created through the interlingual links available in Wikipedia. Separate classifiers are trained for the reference and the supporting languages and their probabilistic outputs are integrated at test time into a joint disambiguation decision for the reference language.

Through the proposed systems, I show that the Wikipedia annotations are reliable, and the quality of a sense tagging classifier built on this data set exceeds by a large margin the accuracy of an informed baseline that selects the most frequent word sense by default. I also show that the multilingual sense classifiers WikiTransSense and WikiMuSense significantly outperform the WikiMonoSense systems (chapter 5).

4 1.4. Contributions

This work covers multiple facets ranging from incorporating world knowledge into word sense disambiguation task to stepping across language boundaries for a richer and more robust word sense disambiguation system. They are enumerated as follows:

(1) Propose a new method for generating sense-tagged corpora using Wikipedia as a source of sense annotations: I propose a new method to create sense-annotated corpora in different languages using Wikipedia to address the semantic information bottleneck problem. Starting with the hyperlinks available in Wikipedia, I ﬁrst generate sense annotated corpora that can be used for training accurate and robust monolingual sense classiﬁers (WikiMonoSense, in Section 4.1) (2) Propose two multilingual WSD frameworks that exploit the cumulative ability of features originating from multiple languages to improve on the proposed monolingual WSD framework: I propose a new method for multilingual WSD, WikiTransSense, that ex- plores the expansion of a monolingual feature set with features drawn from multiple languages computed by machine translation in order to generate a more robust and more effective vector-space representation that can be used for the task of word sense disambiguation, in Section 4.3. I also propose another multilingual WSD approach, WikiMuSense, that leverages the annotations available in multilingual Wikipedias and interlingual links available in Wikipedia in order to reduce the reliance on machine translation, in Sec- tion 4.4 (3) Examine the effectiveness and robustness of the proposed three WSD frameworks: I examine whether the proposed three WSD frameworks are effective and robust and see whether they perform better than a state-of-art baseline. I also examine whether our multilingual frameworks WikiTransSense and WikiMuSense perform better than our monolingual framework WikiMonoSense through established benchmark statistics for WSD. (4) Explore the importance of size of sense annotated corpora in the reference language and supporting languages and the number of supporting languages: I propose a new methodology to explore the association between the ambiguity in reference language and the

5 amount of available sense annotated data in reference language through two different approaches. I also explore the association between the ambiguity in reference language and the amount of available sense annotated data in supporting languages. I also propose a new methodology to explore the association between ambiguity in the reference language and the number of supporting languages and see how different combinations help in resolving ambiguity in the reference language through our two proposed multilingual systems.

1.5. Thesis Outline

This dissertation consists of three main parts. The ﬁrst introduces the background of word sense disambiguation and the main approaches that have been proposed to solve it. The second describes the architecture of the WikiMonoSense, WikiTransSense and WikiMuSense frameworks I propose for WSD, whereas the third part outlines the experimental setup and results. Below is a more detailed overview of the different chapters. Chapter 2 details the background information required for the proposed approaches to word sense disambiguation in both monolingual and multilingual settings. It also introduces Wikipedia and the semi-structured nature of Wikipedia. Finally, this chapter provides a brief background about machine learning and the NaiveBayes classiﬁer used in this research. Chapter 3 summarizes the three main approaches that have been used to tackle the WSD problem: (1) knowledge-based approaches (2) supervised approaches and (3) unsupervised approaches. It also highlights the main shortcomings of the various approaches, being the knowledge- acquisition bottleneck and the rigid sense inventories. It also summarizes the approaches that are most relevant to our work that have been used to tackle the WSD problem in a cross-lingual or multilingual setting. Chapter 4 discusses an approach for using Wikipedia as a source of sense annotations for word sense disambiguation. It also introduces the monolingual WSD framework, WikiMonoSense, and two multilingual WSD frameworks, WikiTransSense and WikiMuSense Chapter 5 provides the detailed results and evaluations on a subset of the ambiguous words using our three word sense disambiguation systems WikiMonoSense, WikiTransSense and

6 WikiMuSense in four different languages namely English, German, Italian and Spanish. Chapter 6 summarizes the conclusions of this thesis and presents prospects for future research.

7 CHAPTER 2

BACKGROUND

In this chapter I address some of the terminology used throughout the thesis and discuss some real life applications of word sense disambiguation. I also introduce the tools that are used in this research.

2.1. Ambiguity

Ambiguity is a term used to characterize phenomena that have more than a single meaning. Human language contains ambiguity at many levels of linguistic representation, but two kinds of ambiguity, lexical ambiguity and syntactic ambiguity are very common in natural language.

Syntactic or structural ambiguity occurs when two or more syntactic structures can be interpreted from the same phrase or a sentence. Syntactic ambiguity occurs from the relationship between the words and clauses of a sentence and the derived sentence structure. A classic example is “flying planes can be dangerous,” which can be interpreted either as “the act of flying planes is dangerous” or “planes that are flying are dangerous.”

Lexical ambiguity or polysemy is concerned with multiple interpretations of lexemes. A word is ambiguous if it involves two lexical items that have identical forms, but have distinct meanings, i.e. unrelated meanings. The most classical example of lexical ambiguity is bank, which denotes either “an organization providing ﬁnancial services” or “the side of a river.”

2.2. Word Sense Disambiguation versus Word Sense Discrimination

Word sense discrimination is an open problem of natural language processing in which the task is to automatically identifying the senses of words. Given a word used in a number of different contexts, Word sense discrimination is the process of merging the instances of the word together by determining which contexts are the most similar to each other. The hypothesis is that words with similar meanings are often used in similar contexts [111]. The manual construction of such sense inventory is a tedious and time-consuming job, and the result is highly dependent on the

8 annotators and the domain at hand. In contrast, word sense disambiguation (WSD) aims to assign a sense label from a predeﬁned sense inventory to a particular instance of a word in context.

2.2.1. Coarse-Grained versus Fine-Grained WSD

Sense granularity is the extent to within which a word is broken down into different meanings based on various contexts it can be used. Sense granularity has become an important factor in recent WSD systems. If the senses are too coarse-grained, some important senses may be missed, in contrast, if the senses are too ﬁne-grained unwanted disambiguation errors occur even with manual effort[27, 165].

2.3. Wikipedia

Wikipedia is a free online encyclopedia, representing the outcome of a continuous collaborative effort of a large number of volunteer contributors. Virtually any Internet user can create or edit a Wikipedia web page, and this “freedom of contribution” has a positive impact on both the quantity (fast-growing number of articles) and the quality (potential mistakes are quickly corrected within the collaborative environment) of this online resource.

The basic entry in Wikipedia is an “article” (or “page”), which defines and describes a concept, an entity, or an event, and consists of a hypertext document with hyperlinks to other pages within or outside Wikipedia. The role of the hyperlinks is to guide the reader to pages that provide additional information about the entities or events mentioned in an article. Articles are organized into “ categories”, which in turn are organized into category hierarchies. For instance, the article on “Alan Turing” shown partially in Figure 2.1 is included in the category “British cryptographers”, which in turn has a parent category named “British scientists”, and so forth. The left pane of the page in this figure contains a set of links, of which more important for this work are the “interlingual links” that map to equivalent articles in other languages. The right pane contains the first two paragraphs of the actual article, with the hyperlinks shown in gray, whereas the bottom part of the figure shows the first ten categories associated with the article.

Each article in Wikipedia is uniquely referenced by an identiﬁer, consisting of one or more words separated by spaces or underscores and occasionally a parenthetical explanation. For exam-

9 FIGURE 2.1. Snapshot of a fragment from the Wikipedia article on Alan Turing. ple, the article for the entity Turing that refers to the “English computer scientist” has the unique identifier “Alan Turing”, whereas the article on Turing with the “stream cipher” meaning has the unique identifier TURING (CIPHER). The hyperlinks within Wikipedia are created using these unique identifiers, together with an “anchor text” that represents the surface form of the hyperlink. For instance, ”Alan Mathison Turing, [[Order of the British Empire|OBE]], [[Fellow of the Royal Society|FRS]] was an English

[[mathematician]]” is the wiki source for the ﬁrst sentence in the example page on ALAN TURING, containing links to the articles ORDER OF THE BRITISH EMPIRE, “Fellow of the Royal Society”, and MATHEMATICIAN. If the surface form and the unique identiﬁer of an article coincide, then the surface form can be turned directly into a hyperlink in the HTML version by placing double brackets around it (e.g. “[[mathematician]]”). Alternatively, if the surface form should be hyper-

10 linked to an article with a different unique identiﬁer, e.g. link the acronym “FRS” to the article on “Fellow of the Royal Society”, then a piped link is used instead, as in “[[Fellow of the Royal Society|FRS]]”.

One of the implications of the large number of contributors editing the Wikipedia articles is the occasional lack of consistency with respect to the unique identiﬁer used for a certain entity. For instance, “Alan Turing” is also referred to using the last name “Turing”, or the full name “Alan Mathison Turing”. This has led to the so-called “redirect pages”, which consist of a redirection hyperlink from an alternative name (e.g. “Turing”) to the article actually containing the description of the entity (e.g. “Alan Turing”), as shown in Figure 2.2.

FIGURE 2.2. Example redirect page, from Turing to Alan Turing.

Another structure that is particularly relevant to the work described in this research is the disambiguation page. Disambiguation pages are specifically created for ambiguous entities, and consist of links to articles defining the different meanings of the entity. The unique identifier for a disambiguation page typically consists of the parenthetical explanation (disambiguation) attached

to the name of the ambiguous entity, as in e.g. SENSE (DISAMBIGUATION), which is the unique identiﬁer for the disambiguation page of the noun Sense, as shown in Figure 2.3.

Wikipedia editions are available for more than 280 languages, with a number of entries varying from a few pages to three millions articles or more per language. Table 2.1 shows the ten largest Wikipedias (as of March 2012), along with the number of articles and approximate number of contributors.1

1http://meta.wikimedia.org/wiki/List of Wikipedias

11 FIGURE 2.3. disambiguation page for the noun Sense.

Language Code Articles Redirects Users English en 4,674,066 4,805,557 16,503,562 French fr 3,298,615 789,408 1,250,266 German de 3,034,238 678,288 1,398,424 Italian it 2,874,747 319,179 731,750 Polish pl 2,598,797 158,956 481,079 Spanish es 2,587,613 504,062 2,162,925 Dutch nl 2,530,250 226,201 446,458 Russian ru 2,300,769 682,402 819,812 Japanese jp 1,737,565 372,909 607,152 Portuguese pt 719,944 100,000 919,782 TABLE 2.1. Number of articles, redirects, and users for the top ten Wikipedia editions. The total number of articles also includes the disambiguation pages.

Finally, also relevant for the work described in this thesis are the interlingual links, which explicitly connect articles in different languages. For instance, the English article for the noun

SENSE is connected, among others, to the Spanish article SENTIDO (PERCEPCION´ ) and the Latin article SENSUS (BIOLOGIA). On average, about half of the articles in a Wikipedia version include

12 interlingual links to articles in other languages. The number of interlingual links per article varies from an average of 5 in the English Wikipedia, to 10 in the Spanish Wikipedia, and as many as 23 in the Arabic Wikipedia.

2.4. Machine Learning

Machine learning is a branch of artificial intelligence which involves computationally learning patterns from data, and applying the learned patterns to new or unseen data. Tom M. Mitchell[113] provided a widely quoted definition for machine learning: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” Machine learning has been extensively used with remarkable success to solve several Natu- ral Language Processing problems, including speech recognition, document categorization, document segmentation, part-of-speech tagging, word-sense disambiguation, named entity recognition, parsing, machine translation etc. In particular, for automatic WSD, supervised learning is one of the most successful approaches. Supervised learning is a machine learning technique for creating a function from training data. The training data consist of features of input objects, and desired outputs. The output of the function can be a continuous value, known as regression or discrete known as classification. The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this, the learner has to generalize from the presented data to unseen situations in a “reasonable” way. In particular, WSD is treated as a classification problem, in which the class labels are discrete senses of the word.

Formal Definition of Supervised Classification: LetXbeasetofvariables, X = X1, ..., Xn, such that n > 0 known as attribute or feature variables, which can be discrete or continuous, and let C represent a dependent nominal classification or class variable with class values c1, ..., ck such that k > 1. The classification problem consists of learning a reliable mapping function Cˆ = f(X) from the attributes to the class variable over a dataset D containing several instances of X assigned

13 a class C known as training data, such that for an unseen instance/instances of X also known as test data, I can predict the value of C with high expected accuracy.

2.4.1. Multinomial Naive-Bayes Classiﬁcation

The Naive Bayes algorithm[156] is one of the most simple, versatile and computationally efficient supervised machine learning algorithms that performs well over a wide range of classification problems, including medical diagnosis, text categorization, collaborative and email filtering, and information retrieval[183]. Despite its simplicity, Naive Bayes can often outperform more so- phisticated classification methods. It is a probabilistic classifier based on Bayesian decision theory that assumes naive independence between attribute variables.[40, 52].

Formal Definition: Let X = X1....Xn,C be a finite set of random variables, where X1....Xn define attribute variables and C defines a class variable then, a unique joint probability distribution on the random variables is defined by the following equation:

P (X1....Xn,C)= P (C) ∗ P (X1....Xn|C) (1)

Where P(C) is the prior probability of the class labels.

Once P(C) and P (X1....Xn|C) are estimated from the training data, Bayes rule gives the conditional probability of the class given attributes and mathematically it can be written as:

P (C|X1....Xn)= α ∗ P (C) ∗ P (X1...Xn|C) (2)

where α is a normalization factor that ensures that the conditional probability of all possible class labels sums up to 1.

However, learning P (X1,X2...Xn|C) is in general impractical, so a “naive” assumption is made, that all attribute variables X1...Xn are mutually independent given a class variable C . With this assumption:

P (X1....Xn|C)= P (X1|C) ∗ P (X2|C)...P (Xn|C) (3)

14 Combining equation 10, equation 2 and equation 3 , I get Naive Bayes classiﬁcation rule

n P (C|X1....Xn)= α ∗ P (C) ∗ P (Xi|C) (4) Yi=1 where α is a normalizing factor that ensures that the conditional probability of all possible class labels sums up to 1. Equation 4 is used to estimate the posterior probability distribution for a new unseen example for each class the class with maximum posterior probability is selected.

2.5. Parts of Speech Tagging

Part-of-speech tagging, also called grammatical tagging or word-category disambiguation, is the process of assigning the words in a text document as corresponding to a particular part-of- speech, based on both its definition, as well as its context. Traditional grammar classifies words based on eight parts of speech: the verb, the noun, the pronoun, the adjective, the adverb, the preposition, the conjunction, and the interjection. Part-of-speech tagging is an essential tool in many natural language processing applications such as information extraction, word sense disambiguation, parsing, question answering, and named entity recognition. Manually assigning part- of-speech tags to words is a tedious and time-consuming task. Automatic Part-of-speech tagging is a process of assigning part-of-speech by a computer instead of doing it manually. Contextual behavior of the words largely vary in different languages, so the key issue is to identify the part- of-speech of a word based on the context in which it occurred. There are several approaches to automatic part-of-speech tagging classified as rule based, probabilistic, and transformational- based approaches. Probabilistic approaches determine the most probable tag of a token based on the surrounding context words,based on probability values obtained from a training corpus. The most widely known probabilistic approaches for tagging are Hidden Markov Models, Maximum Entropy Markov Models and Conditional Random Fields (CRF).

2.6. Stemming

Stemming is the process of removing the afﬁxes from inﬂected words, without doing complete morphological analysis. A stemming algorithm is a procedure to reduce all words with the

15 same stem to a common form. Stemming is an essential tool in many natural language processing applications such as information retrieval, word sense disambiguation, information extraction and domain analysis. Manually stemming words is a tedious and time-consuming task. Auto- matic stemming is a process of stemming by a computer instead of doing it manually. There are several approaches to stemming classiﬁed as rule-based (Lovins algorithm, Porter’s algorithm, Paice/Husk algorithm, Dawson algorithm) and statistical (N-gram stemming, Hidden Markov Models) aprroaches. The most widely known rule-based approach for stemming is Porter’s algorithm because of its simplicity, speed and availability of implementation for multiple natural languages.

2.7. Tools

2.7.1. Wikipedia Miner

Wikipedia Miner[112] is a toolkit for using the structure, content and understanding the relations between entities of Wikipedia. It mainly aims at providing a easy way to integrate Wikipedia’s knowledge into our own applications, by providing simpliﬁed, object-oriented access to Wikipedia’s structure and content. It helps in detecting and disambiguating Wikipedia topics when they are mentioned in documents. It also helps in understanding and measuring the semantic relatedness between terms or concepts in Wikipedia. However, in this research our purpose of using Wikipedia Miner is to exploit its simplistic structure and object-oriented access for dealing with a complex multilingual knowledge base like Wikipedia.

2.7.2. TreeTagger

TreeTagger[158]2 is a language independent part-of-speech tagger implemented using Markov Models. It performs tokenization and outputs part-of-speech (PoS) tags and lemma information. In this research, I particularly used TreeTagger because it is compatible across multiple languages with very good performance.

2http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

16 2.7.3. Snowball Stemmer

Snowball stemmer3 is a publicly available implementation of the Porter stemmer[143] in multiple languages. Currently, the project website provides implementations for many Romance, Germanic, Uralic and Scandinavian languages as well as English, Russian and Turkish.

2.7.4. Weka

Waikato Environment for Knowledge Analysis(WEKA)[55]4, is an open-source workbench containing implementations of a wide range of state-of-the-art machine learning algorithms and data preprocessing methodologies. The basic way of interacting with these methods is by invoking them from the command line or Graphical User Interface. The system is written in Java and it can be completely integrated into our environment through Java code. The package is distributed under the terms of the GNU General Public License.

2.7.5. Google Translate

Google Translate5 is online statistical multilingual machine-translation service provided by Google Inc. to translate text from one language into another. Speciﬁcally, to help researchers, the University Research Program for Google Translate6 was introduced to compare and contrast with, and build on top of, Google’s statistical machine translation system. This program provides a Java Application Programming Interface(API) through which researchers can submit text for translation, and then obtain detailed results of the translated text, including phrasal alignments of the submitted text and translated text and n-best translations for the submitted text.

2.8. Applications of Word Sense Disambiguation

Weaver[175] in his famous memorandum, pointed out WSD is a very important “intermediate task” for Machine Translation. Following this, it has become an important intermediate task

3http://snowball.tartarus.org/ 4http://www.cs.waikato.ac.nz/ml/weka/ 5http://translate.google.com/ 6http://research.google.com/university/translate/index.html

17 for many natural language processing tasks including natural language understanding and communication. Some of the well known WSD applications are:

(1) Automatic Machine translation of sentences, in which correct translation to foreign words is required based on the context. (2) Question Answering and Information retrieval, in which correct information needs to be provided based on the context of the words in the query. (3) Speech processing, in which correct phonetization of words need to be identiﬁed. For example the words “seal” and “ceil” are spelled differently but pronounced the same way. (4) Text to Speech translation, in which correct pronunciation for heteronym words is required depending on their meaning. For example: the word “minute” can mean “extremely small” or “a unit of time.” (5) Finally, information extraction and text mining, in which accurate analysis of text is required.

18 CHAPTER 3

RELATED WORK

The task of WSD is one of the oldest problems in the ﬁeld of computational linguistics and it has been an interest and concern since the earliest days of computer treatment of language. For a complete overview of the ﬁeld, I refer to [66], [5], [123] and [85].

3.1. Early Research in WSD

In 1949, Weaver [175] in his famous Memorandum on Machine Translation introduced the WSD problem in a computational context and conceived it as a fundamental task of Machine Translation (MT). He emphasized that the problem of multiple meanings can be tackled by exam- ining the context in which the word occurs. He stated in his Memorandum:

If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words .... But if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word .... The practical question is: “what minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?

During 1950s, in an attempt to answer this question, several efforts were made to understand to what extent the immediate context and the number of distinct senses affect the ambiguity of a particular word [68, 77]. These studies proved that by considering the immediate context or adjacent words, a marked reduction of ambiguity is possible. Another important observation was sense resolution given adjacent words of the target word was not signiﬁcantly better or worse than when given the entire sentence. Further, Reiﬂer [150] pointed how grammatical structure, in particular, syntactic relations can help in WSD. Weaver [175] also discussed the role of the domain in sense disambiguation which led to much efforts on building specialized dictionaries or micro-glossaries in order to reduce the number

19 of senses for a target word [131, 132, 133, 130, 39, 53, 134]. Madhu and Lytle [91] proposed one of the notable approaches, in which they calculated sense frequencies using corpora in different domains and then applied a Bayesian formula to determine the probability of each sense in a given context. Despite several efforts, researchers considered the problem of MT and WSD as a completely distinct ﬁeld of investigation. They also understood that WSD was a very difﬁcult problem, given the limited resources available for computation at that time. In 1960, Bar-Hillel [15] argued that a Fully Automatic High Quality Translation(FAHQT) was “unattainable” in principle. Considering an example word “pen” in a very innocuous example context: “Little John was looking for his toy box. Finally he found it. The box was in the “pen”. John was very happy.” He argued, “no existing program or an imaginable program could enable an electronic computer to resolve the ambiguity of the word “pen” in this context”. He also pointed out the lack of world knowledge and argued that MT systems should be supplied with a universal encyclopaedia in addition to the dictionaries. This became a main obstacle to further research and development of Machine Translation and Word Sense Disambiguation through out the 1960s. In 1966, Automatic Language Processing Advisory Committee (ALPAC) [28], concluded that Machine Translation was slower, less accurate and twice as expensive as human translation and that “there is no immediate or predictable prospect of useful Machine Translation” which brought a virtual end to Machine Translation for over a decade. However, the committee recommended the support to continue basic research in “Computational Linguistics”. Throughout the 1960s, though several approaches were proposed [95, 91, 145, 179], without large-scale resources, most of these ideas remained untested.

3.2. AI-based Systems

WSD research flourished in the 1970s with the introduction of Artificial Intelligence (AI) approaches for language understanding. This led to the development of word expert systems. These systems consisted of a set of decision rules used to disambiguate that relies on the value of the context of the target word. Word expert systems were first proposed by Stone [166] and later implemented by Stone and his student Kelly [166, 69]. Weiss [176] developed the first word expert system and showed that disambiguation rules can be learned from a manually sense-tagged

20 corpora. Later, Rieger [155] applied word expert systems for nature language understanding. Wilks [177, 178] described a rather different approach that performs WSD using semantic information in the form of preference restrictions and frame-based lexical semantics. Throughout the 70s, most of the approaches heavily relied on manually created semantic networks and frames containing words, their roles and relationships with other words in the sentence [56, 57, 58, 59].

3.3. Prominent Problem in NLP

In 1980s, researchers [163, 162, 1, 2, 3, 4] continued to build complex communicating word expert systems and extended them to other natural languages like Dutch and French. WSD research reached a turning point with the release of large-scale lexical resources such as Oxford Advanced Learner’s Dictionary of Current English (OALD) 1, which enabled automatic methods for knowledge extraction. Following the release, Lesk [88] developed an algorithm that attempts to identify the most likely meanings for the words in a given context based on a measure of contextual overlap between the current context and OALD dictionary deﬁnitions of the ambiguous words.

During the 1990s, according to Agirre and Edmonds [5] three major events drastically changed further WSD research:

(1) The ﬂourishing of statistical and machine learning methods for natural language processing(NLP), and consequently in WSD research. (2) The development of WordNet [110], a lexical knowledge base that combines the proper- ties of a thesaurus with that of a semantic network which covers a vast majority of nouns, verbs, adjectives and adverbs from English. (3) The organization of the ﬁrst Senseval 2 competition in 1998. The purpose of Senseval is to evaluate the strengths and weaknesses of WSD systems with respect to different words and different languages.

1http://oaadonline.oxfordlearnersdictionaries.com/ 2http://www.senseval.org/

21 3.4. Categories of Approaches to WSD

In this section, I introduce the four most important and well-studied categories of WSD and existing methodologies to WSD in each category. They are categorized as:

(1) Knowledge-based methods (2) Supervised methods (3) Unsupervised methods

3.4.1. Knowledge Based Methods

Knowledge-based methods mainly rely on external knowledge base, such as machine- readable dictionaries (MRDs), large lexical semantic resources (LSRs) and corpora to tackle the WSD problem. Please refer to [109] for a completely overview of the knowledge-based methods that are brieﬂy described here. Five main types of knowledge-based methods have been developed so far for word sense disambiguation:

3.4.1.1. Lesk Algorithm. The Lesk [88] algorithm isone of the first algorithmsfor the semantic disambiguation of all words. It is a dictionary-based method for correctly identifying the word sense by counting word overlaps between dictionary definition of the word and and the sentence in which the word to be disambiguated occurs. This approach tries to evaluate all words in a given context simultaneously, thus, exploding combinatorially proportional to number of senses for all the words in the context. To overcome this combinatorial explosion, Cowie et al. [31] proposed an approach that uses simulated annealing, a global optimization technique that quickly finds the choice of sense using a simple count of words in common with its definition. Further, Agirre and Rigau [12] used Conceptual Density for disambiguating the meanings of all words simultaneously.

Kilgarriff and Rosensweig [73] proposed a simplified version of original Lesk algorithm that measures overlap between sense definitions of a word and current context. It tackles each word individually, independent of the meaning of the other words occurring in the current context. Vasilescu et al. [171] demonstrated that the simplified Lesk algorithm showed an improvement of 16% over the classic Lesk algorithm in Senseval-2’s3 all words English task. Another shortcoming

3http://86.188.143.199/senseval2/

22 of the original Lesk algorithm is the words in the context of the target word must have the exact words used in the definitions of dictionaries for it to perform good. To address this problem, Banerjee and Pedersen [14] proposed an approach using augmented semantic spaces called as Adapted Lesk algorithm. It exploits the hierarchical relationships in WordNet [110] ontology to include the glosses of words that are related to the target word and its neighbours, i.e. expanding the definitions to include the definitions of hypernyms and hyponyms.

Though, several methods were proposed to take advantage of Lesk algorithm, one major disadvantage of Lesk based methods is the large dependency on ontologies like WordNet [110], which are not complete for several natural languages.

3.4.1.2. Semantic Similarity Measures. A second group of knowledge-based methods are based on semantic similarity or semantic distance measures between the surrounding context of the word to be disambiguated and typical patterns of use of that word in a knowledge base. These measures are often based on a semantic ontology or repository like WordNet.

Rada et al. [147] proposed the ﬁrst semantic similarity measure in the biomedical domain using the path length between biomedical terms in a medical hierarchy ontology called MeSH 4. Several measures has been proposed to calculate semantic similarity using ontologies such as L&C [83], Wu&Pamler [181], Resnik [151], J&C [67], H&S [63] and Mihalcea and Moldovan [106]. Patwardhan et al. [135] generalized the Adapted Lesk Algorithm of Banerjee and Pedersen [14] using several semantic similarity measures and conducted experiments on SENSEVAL-25 dataset for WSD.

Another set of approaches that tries to disambiguate in a global context, i.e. considering a larger context window were proposed that uses lexical chains [117]. Lexical chains are built using relationships between word senses. The construction of lexical chains requires structured ontologies such as WordNet, with explicitly deﬁned semantic relations between word senses. Barzilay and Elhadad [16] proposed an exhaustive approach for text summarization using Lexical chains. In this approach, lexical chains were used to identify the cohesion in the text. In this approach,

4http://www.nlm.nih.gov/mesh/ 5http://86.188.143.199/senseval2/

23 initially all possible lexical chains were built for a reference text, and then the lexical chain with the highest score or the strongest representation was chosen. However, this algorithm suffered with quadratic time complexity. To address this, Silber and McCoy [160] proposed a computationally efﬁcient linear time approach, nonetheless it has inaccuracies in WSD. Galley and McKeown [49] proposed an even more efﬁcient approach that separates WSD from the actual lexical chaining. This approach performs WSD enforcing the constraint of one sense per discourse [46], before building the actual lexical chains and it achieved higher accuracies compared to the earlier approaches.

3.4.1.3. Selectional Preferences. Some of the earliest work of knowledge-based WSD methods used selectional preferences. Selectional preference is a kind of semantic restriction that a predicate imposes on its arguments. For example, the verb “eat” prefers animate entities to be its subject and edible entities to be its object.

Resnik [152] described a statistical method to acquire a set of conceptual classes for word senses and applying the acquired selectional preferences for Word Sense Disambiguation. Agirre and Martinez [10] motivated from the assumption, different senses of a verb may have different selectional preferences and some classes of verbs can share preferences, extracted class-to-class selectional preferences. Using the obtained selectional preferences, a probabilistic model for WSD was created. Several other approaches have been introduced to exploit both class-to-class selectional preferences [187, 11] and word-to-class selectional preferences [98, 99] for WSD. Neverth- less, the performance of the methods based on selectional preferences is a concern since none of them exceeded the baseline. An overview of these methods can be found in [19].

3.4.1.4. Graph Based Methods. Recently, graph-based methods have been applied to solve the WSD task. In general, these methods are well suited for all-words disambiguation in the context.

Mihalcea [103] introduced an algorithm for automatic sequence data labelling and tested it on the problem of WSD. This algorithm uses a global annotation approach, annotating all the words in a sequence simultaneously by exploiting dependencies across labels and random walks. Navigli and Lapata [120] presented a generic graph-based algorithm, in which a graph is constructed based on all the possible sequences of senses for the words to be disambiguated and then the structure of

24 the graph is exploited to determine the most important nodes in the graph, which eventually leads to disambiguation of the polysemous words in the context. Sinha and Mihalcea [161] extended their previous work [103] using a collection of semantic similarity measures to assign weights to the links across graph. Different weighted graph-based centrality algorithms are used to rank the vertices of the obtained complete graph, that eventually leads to disambiguate all the words in the context. Recently, several other approaches have been introduced for performing graph based WSD that exploits Personalized PageRank algorithm [9], Semantic Trees and N-Cliques in the graph [54] and travelling sales man problem using ant-colony graph optimization [128]. 3.4.1.5. Heuristic Based Methods. Finally, also heuristic methods can be applied to solve the WSD task. The first and very basic heuristic method is the most frequent sense. It is based on the observation that the dominance of the most common sense increases with the frequency of the word. Kilgarriff [74] created a mathematical model of word sense frequency distributions, in which he showed the word senses roughly follow a Zipfian distribution [189]. Gale, Church and Yarowsky [46] introduced one-sense-per-discourse heuristic, stating that a word tends to preserve the same sense in the entire discourse. Further, Yarowsky [184] introduced one-sense-per-collocation, stating that a word tends to preserve the same sense when used in the same collocation. However, the heuristic methods are not well-suited for fine-grained WSD, meaning, more the number of the senses the weaker the heuristics hold.

3.4.2. Supervised Methods

Supervised systems are based on machine learning algorithms, trained using manually or automatically sense-tagged corpora. In general, these methods are well suited for target word disambiguation, which tackles one word at a time. These methods involve two steps, the learning step consists of learning a sense classiﬁcation machine learning model from the sense-annotated examples and the classiﬁcation step consists of classifying new or unseen examples in order to assign a sense to the target word. In general, supervised WSD systems are based on one or more machine learning algorithms and perform better than other WSD techniques. Nevertheless, supervised methods of WSD are considered brittle since these methods face a number of challenges. In this section I introduce the challenges for supervised WSD and method-

25 ologies used to tackle them in previous research. This section also provides an overview of several machine learning techniques that were tried for supervised WSD. For a complete understanding of supervised WSD research please refer to [92, 94].

3.4.2.1. Lack of Sense Tagged Corpora. The availability of large sense-annotated corpora is a necessary prerequisite for training accurate all-words supervised WSD system. The main weakness of supervised WSD methods is the lack of widely available sense-annotated corpora which is well- known as knowledge-acquisition bottleneck problem. Several automatic approaches have been tried to address the lack of sense-annotated corpora for supervised WSD systems. A ﬁrst set of methods consists of algorithms that generate sense annotated data using words semantically related to a given ambiguous word [83, 107, 7]. Related non-ambiguous words, such as monosemous words or phrases from dictionary deﬁnitions, are used to automatically collect examples from the Web. These examples are then turned into sense-tagged data by replacing the non-ambiguous words with their ambiguous equivalents.

Another approach proposed in the past is based on the idea that an ambiguous word tends to have different translations in a second language [154, 153]. Starting with a collection of parallel texts, sense annotations were generated either for one word at a time [127, 37], or for all words in unrestricted text [38], and in both cases the systems trained on these data were found to be competitive with other word sense disambiguation systems.

The lack of sense-tagged corpora can also be circumvented using semi-supervised bootstrapping algorithms, which start with a few annotated seeds and iteratively generate a large set of disambiguation patterns. This method, initially proposed by [186], was successfully evaluated in the context of the SENSEVAL framework [101].

A series of studies have explored an alternative line of research in which large scale sense annotated corpora are extracted automatically from collaboratively constructed language resource. In an effort related to the Wikipedia collection process, Chklovski and Mihalcea [27] have implemented the Open Mind Word Expert system for collecting sense annotations from volunteer contributors over the Web. The data generated using this method was then used by the systems participating in several of the SENSEVAL-3 tasks. More recently, Henrich et al. [62] used the

26 web-based dictionary Wiktionary 6 to create a sense annotated corpus for German based on a mapping [61] between GermaNet [78] and Wiktionary. A similar effort [62] has led to a sense tagged corpus for English, by using the mapping between WordNet and Wiktionary created by Meyer and Gurevych [100]. Recent methods for mapping Wikipedia to the WordNet sense repository [141, 129] could be used to replace the role of Wiktionary with Wikipedia in the same corpora acquisition approach. Recently Bharath et al. [17] explored word sense disambiguation using Wikipedia as a source of sense annotations similar to [104]. Through experiments on four different languages, they showed that the Wikipedia-based sense annotations are reliable and can be used to construct accurate sense classiﬁers.

3.4.2.2. Lack of Right Sized Training Sets. Another challenge for supervised WSD systems is the strong dependence of disambiguation accuracy on the size of the training corpora. Typically, 1000-2500 examples of each word are manually annotated to create a sense annotated benchmark corpora for supervised WSD. Several studies were conducted to measure the number of examples required to build a supervised WSD system. For instance, one study reports that high precision WSD requires at least 500 examples per ambiguous word on an average, however, it is highly related to the word entropy [125]. At a throughput of one annotation per minute [41], the necessary effort for constructing such a training corpus requires an estimated 80 man-years of human annotation work. [108]. Agirre and Martinez [6] has reported that the distribution of the number of examples per word sense has a strong inﬂuence in the quality of the WSD results using two different corpora known as SemCor and DSO. Agirre and Martinez [7] extended their previous work and showed the distribution of number of examples per sense (bias) is also a key factor in building automatic sense-annotated corpora. However, there is no exact methodology till date that generalizes the minimum number of training examples required for the best performance of WSD systems.

3.4.2.3. Lack of Portability across Domains. An additional challenge to WSD systems is the portability across different domains i.e. training and testing on different domain corpora. Since, in general, words are used differently and different words occur in the context of the target word

6http://www.wiktionary.org

27 in different domains, there is a performance degradation for supervised WSD systems trained and tested on different domains [42, 93]. To answer this problem, the corpora should be large and diverse, covering all the domains and senses in order to make the WSD systems portable. However, this is very costly, considering the number of senses and number of domains in a language and the amount of manual effort required.

Several approaches have been introduced to identify the strong inﬂuence of domain for WSD and the salience of words across domains [22, 76, 149]. Though, these methods could tackle the problem partially, they achieve low recall since the information about target words in one domain is very less when compared to all the domains. [123].

Even though supervised WSD face a number of challenges, a large number of studies have been based on these methods. In general, the main approaches to supervised WSD are often categorized based on the machine learning algorithm that is used to train the classiﬁer. In this section I provide a brief overview and previous work in each category for supervised WSD. The supervised approaches are often categorized as:

3.4.2.4. Decision Based WSD. Decision tree learning is a predictive model for expressing mappings between a given class and features of it using a tree structure. In decision trees, the leaf nodes indicate the class labels (sense of a word) and internal nodes represents conjunctions of features or attributes that lead to those class labels or senses. The advantage of using decision trees is that they are simple for humans to understand, consistent and robust. For the ﬁrst time, Kelly and Stone [69] attempted to use decision trees for learning WSD rules that are represented with a n-ary branching tree. Mooney [116] adapted the C4.5 algorithm implemented by Ross Quinlan [146], a variant of decision tree learning, and performed experiments on several words. However, comparing with other machine learning algorithms, his experiments concluded decision trees are not the best method for WSD. Also they suffer from several issues like data sparseness due to features with a large number of values and less reliability due to the fact that rules cover very few training examples do not produce reliable estimates of the class label.

In contrast, Decision List algorithm uses induction to produce an ordered list of conditions or conjunctive rules ordered by their importance (weight). Rivest [157] attempted to use decision

28 lists while Yarowsky [185, 186] applied decision lists in which the list or features are sorted by using the log-likelihood measure indicating the probability of the sense given the feature value. It was one of the best performing algorithm for WSD in ﬁrst Senseval competition [73]. Agirre and Martinez [6] carried out an in-depth study of the performance of decision lists on three different corpora.

3.4.2.5. Neural Network WSD. An artiﬁcial neural network is a mathematical model that tries to replicate the structure and behavior of a brain. It is a collection of neurons for which the outputs of some neurons serve as the inputs to other neurons, to potentially form a complex network. It learns a mathematical function from this complex network such as that it evaluate the new inputs using this mathematical function and produce an output.

Cottrell and Small [30] and Waltz and Pollack [174] suggested neural networks approaches for WSD. Towell and Voorhees [168] showed that neural networks without hidden layers performs better that the one with hidden layers for WSD which implies WSD is a linear classiﬁcation problem. Later, Veronis and Ide [173] demonstrated the use of building very large neural networks (VLNNs) from the deﬁnitions of machine readable dictionaries for supervised WSD. Tsatsaronis et al. [169] proposed another approach, which is based on generating Spreading Activation Net- works (SANs) from the senses of a thesaurus and the relations between them which helps in linking the senses to current knowledge repositories.

Recently, Rao and Pujari [148] used hopﬁeld model of neural networks and demonstrated a good performance for WSD. Another recent approach was proposed by Wiriyathammabhum et al. [180] which uses a probabilistic generative model composed of multiple layers of hidden units known as Deep Belief Networks(DBNs). This method demonstrated a signiﬁcant improvement in performance when compared to the existing supervised methods.

3.4.2.6. Instance Based WSD. Instance-based learning, also known as exemplar-based, or memory-based learning, is a kind of a lazy learning machine learning algorithm. This method stores all training examples together with their sense label in memory, hence also known as, memory-based learning. In this method, when a unseen or new example is presented a similarity metric is used to ﬁnd the most similar examples in memory and identiﬁes an “average of their

29 senses to predict a class label. Previous work on memory-based WSD includes [126], [172], [64], [102] and [35]. In the recent times, Lefever [85] demonstrated the use of memory-based learning implemented in TIMBL [32] for Cross-lingual WSD.

3.4.2.7. Probabilistic WSD. Probabilistic methods are statistical methods that tries to estimates a set of probabilistic parameters using the observed and latent variables. A probabilistic classifier assigns each word the tag that has the highest estimated probability of having occurred in the given context [21]. A classic example for probabilistic model is Naive-Bayes classifier that was explained in detail in chapter 2. several systems created in the late 90s are based on this classifier [45, 82, 21, 125]. Chao and Dyer [26] tried to take advantage of dependencies between features ie., context words, by using Bayesian Graphical Models. Molina et al. [115] proposed another supervised approach to Word Sense Disambiguation which is based on Hidden Markov Models. Starting from the initial days of supervised WSD, Naive Bayes classifier is considered as one of the best performing classifier for WSD. Mooney [116] demonstrated the same by trying several machine learning algorithms and proved that Naive Bayes performs the best compared to so many other machine learning algorithms for WSD.

3.4.2.8. Support Vector Machines. Support Vector Machines (SVM) [29] are non-probabilistic binary classifiers that finds a hyperplane with the largest margin that best separates a set of training examples into two classes. The test example is classified depending on the side of the hyperplane it lies in the training examples space. For a complete understanding of Support Vector Machines, I refer to [140, 60].

Several supervised approaches proposed in the recent times for WSD are based on Support Vector Machines [23, 84, 50]. For the ﬁrst time, Cabezas et al. [23, 118] experimented with Sup- port Vector Machines on lexical sample provided in Senseval-2 Lexical Sample task dataset. Later, Lee and Ng [84] compared different machine learning algorithms on Senseval-3 Lexical Sample task dataset and Support Vector Machines(SVM) with a linear kernel function performed the best compared to other machine learning algorithms for WSD. Gliozzo et al. [50] have presented domain kernels for word sense disambiguation. They proposed a supervised learning approach using “domain kernels”, a function that exploits domain information and estimates the domain similar-

30 ity among the contexts of the words to be disambiguated. Finally, in a recent work, Lefever [85] demonstrated the use of support vector machines for cross-lingual WSD.

3.4.2.9. Ensemble Classifiers. Ensemble learning is based on the idea different classifiers learn complementary information, hence, combining evidence from different classfiers of different nature in a certain meaning way could improve the overall performance of the task in hand. There are basically two combination scenarios for ensemble learning: 1) all classifiers use the same feature set 2) Different classifiers uses a different feature set.

Pedersen [137] introduced an ensemble approach for WSD, in which an ensemble of Naive Bayesian classifiers, each of them is generated from context features considering different context window sizes surrounding the target word. This resulted 81 different classifiers and a majority voting scheme is used to select the sense of the word. Florian and Yarowsky [44] tried several count-based, rank-based and probability-based combinations strategies for combining evidence from different classifiers with same feature set for WSD. The experiments were performed using Senseval-2 dataset and they demonstrated the substantial empirical success of classifier combination for the word sense disambiguation task. Klein et al. [75] tried different combination strategies such as majority voting, weighted voting and a stacking method based a maximum entropy on Senseval-2 dataset and examined how ensemble performance depends on error independence and difficulty of the task.

Le et al. [80] addressed the second scenario i.e. combining classifiers with different feature representations. In their approach, six different combination rules were tried on four words and observed the importance of combination strategies for classifiers trained with different feature sets for WSD. Following their previous observations, Le et al. [81] introduced two more combination strategies based on Dempster-Shafer Theory [36, 159] and Ordered Weighting Averaging Operators [182] and conducted a detailed study on Senseval-2 and Senseval-3 datasets. Their experiments proved that classifiers with different feature sets offer complementary information which made combination strategies help in predicting the correct sense. However, combining classifiers trained on different feature sets derived from different languages is new to our knowledge which I introduce in this research and compare it directly with the well known WSD techniques.

31 3.4.3. Unsupervised Methods

Unsupervised methods start with a large raw corpora and cluster occurrences of words based on their contexts using clustering techniques, thereby inducing word senses automatically. The number of clusters is determined using number of possible senses for a particular word. Once the clusters are obtained, a sense-labelling algorithm assigns a sense to each cluster. However, the automatic labelling of the obtained clusters remains a considerable challenge. Lin [90] proposed a clustering approach that tries to cluster words that are semantically close. This approach identiﬁes similar words using the distributional pattern of words in a large corpora, which provides the nearest neighbours to each word, along with the distributional similarity score between the target word and its neighbour words. In the next step, an iterative clustering approach is applied to induce the senses of the words. Pedersen and Bruce [138] proposed another approach in which they represent the context of words in the form of vectors using the co-occurrence and POS features. Several methods have been proposed for better representation of context vectors that exploited the external information like glosses taken from a ontology like WordNet [144], word-category co- occurrence matrix in which the categories are coarse-senses obtained from an existing thesaurus [114], and synonym-based word occurrence counts [79]. However, there is very less work done in automatically labelling the clusters. One such work is proposed by McCarthy [96], in which she exploited the frequency distributions of senses available in WordNet.

3.5. Related Work in Cross-Lingual/Multilingual WSD

In the previous sections, I introduce a large number of word sense disambiguation methods that have been proposed, targeting the resolution of word ambiguity in different languages. However, there are only a few methods that try to take evidence from multiple languages simultaneously to solve WSD. Brown et al. [20] made the observation that mapping between word-forms and senses differ across languages and tried a statistical machine learning technique considering these mappings for Word Sense Disambiguation. Gale et al. [47] used parallel translations in French to automatically create large training and testing data in English for Word Sense Disam- biguation. Dagan [34, 33] also used parallel translations and proposed a statistical model that relies on lexical relations in a bilingual corpora to resolve sense ambiguity of sentences. Further,

32 Resnik and Yarowsky [154] conducted a set of empirical studies on multilingual sense inventories and proved that sense distinctions can indeed be captured by translations into other languages. Diab and Resnik [38, 37], proposed an unsupervised bootstrapping approach for WSD that exploits translation correspondences in automatically-generated parallel corpora. Ng et al., [127] proposed an approach to automatically acquire sense-tagged training data from manually created English- Chinese parallel corpora which are then used for training a Naive Bayes classiﬁer to determine the most probable sense of the word. Chan and Ng [24] and Chan et al. [25] tried a small variant of their previous approach [127], and experimented on lexical samples from Senseval-2 and Semeval-2007 English all-words dataset.

Li and Li [89] introduced a bilingual bootstrapping approach, in which starting with a in-domain corpora in two different languages, English and Chinese, word translations are automatically disambiguated using information iteratively drawn from the bilingual corpora. Relavent to their work, Khapra et al. [70, 71] proposed another bilingual bootstrapping, using the aligned multilingual dictionary and bilingual corpora and showed how resource deprived languages can beneﬁt from a resource rich language. They introduce a techique called “parameter projections”, in which parameters learned using both aligned multilingual Wordnet and bilingual corpora are projected from one language to other language to solve WSD.

In recent years, the exponential growth of the Web led to an increased interest in multilinguality. Lefever and Hoste [86] introduced a SemEval task on cross-lingual WSD in SemEval-2010 that received 16 submissions. The corresponding dataset contains a collection of sense annotated English sentences for a few words with their contextually appropriate translations in Dutch, Ger- man, Italian, Spanish and French.

Recently, Banea and Mihalcea [13] explored the utility of features drawn from multiple languages for WSD. In their approach, a multilingual parallel corpus in four languages (English, German, Spanish, and French) is generated using Google Translate. For each example sentence in the training and test set, features are drawn from multiple languages in order to generate a more robust and more effective representations known as “multilingual vector-space representations”. Finally, training a multinomial Naive Bayes learner showed that a classiﬁer based on multilingual

33 vector representations obtains an error reduction ranging from 10.58% to 25.96% as compared to the monolingual classifiers. Lefever [85] proposed a similar strategy for multilingual WSD using a different feature set and machine learning algorithms. Along similar lines, [43] used the Lesk algorithm for unsupervised WSD applied on definitions translated in four languages, and obtained significant improvements as compared to a monolingual application of the same algorithm. Al- though these three methodologies are closely related to our WikiTransSense system, our approach exploits a sense inventory and tagged sense data extracted automatically from Wikipedia. Navigli and Ponzetto [124] proposed a different approach to multilingual WSD based on a large multilingual encylopedic dictionary, called BabelNet [121], that was built from WordNet and Wikipedia. Their approach exploits the graph structure of “BabelNet” to identify complementary sense evidence from translations in different languages.

34 CHAPTER 4

WORD SENSE DISAMBIGUATION FRAMEWORKS

In this chapter, I introduce a method for generating sense-tagged data using Wikipedia. I also present our three WSD frameworks, WikiMonoSense, WikiTransSense and WikiMuSense systems.

4.1. The WikiMonoSense System

A large number of the concepts mentioned in Wikipedia are explicitly linked to their corresponding article through the use of links or piped links. Interestingly, these links can be regarded as sense annotations for the corresponding concepts, which is a property particularly valuable for entities that are ambiguous. In fact, it is precisely this observation that I rely on in order to generate sense tagged corpora starting with the Wikipedia annotations.

4.1.1. A Monolingual Dataset through Wikipedia Links

For example, ambiguous words such as e.g. “plant”, “bar”, or “chair” are linked to different Wikipedia articles depending on their meaning in the context where they occur. Note that the links are manually created by the Wikipedia users, which means that they are most of the time accurate and referencing the correct article. The following represent ﬁve example sentences for the ambiguous word “bar”, with their corresponding Wikipedia annotations (links):

(1) In 1834, Sumner was admitted to the [[bar (law)|bar]] at the age of twenty-three, and entered private practice in Boston. (2) It is danced in 3/4 time (like most waltzes), with the couple turning approx. 180 degrees every [[bar (music)|bar]]. (3) Vehicles of this type may contain expensive audio players, televisions, video players, and [[bar (counter)|bar]]s, often with refrigerators. (4) Jenga is a popular beer in the [[bar (establishment)|bar]]s of Thailand. (5) This is a disturbance on the water surface of a river or estuary, often cause by the presence of a [[bar (landform)|bar]] or dune on the riverbed.

35 To derive sense annotations for a given ambiguous word, I use the links extracted for all the hyperlinked Wikipedia occurrences of the given word, and map these annotations to word senses. For instance, for the “bar” example above, I extract ﬁve possible annotations: “bar (counter), bar (establishment), bar (landform), bar (law), and bar (music)”.

Although Wikipedia provides the so-called disambiguation pages that list the possible meanings of a given word, I decided to use instead the annotations collected directly from the Wikipedia links. This decision is motivated by two main reasons. First, a large number of the occurrences of ambiguous words are not linked to the articles mentioned by the disambiguation page, but to related concepts. This can happen when the annotation is performed using a concept that is similar, but not identical to the concept deﬁned. For instance, the annotation for the word bar in the sentence “The blues uses a rhythmic scheme of twelve 4/4 [[measure (music)|bars]]” is measure (music), which, although correct and directly related to the meaning of bar (music), is not listed in the disambiguation page for bar.

Second, there are several inconsistencies that make it difficult to use the disambiguation pages in an automatic system. For example, for the word “bar”, the Wikipedia page with the identifier bar is a disambiguation page, whereas for the word paper, the page with the identifier paper contains a description of the meaning of paper as “material made of cellulose,” and a different page “paper (disambiguation)” is defined as a disambiguation page. Moreover, in other cases such as e.g. the entries for the word “organization”, no disambiguation page is defined; instead, the articles corresponding to different meanings of this word are connected by links labeled as “alternative meanings.”

Therefore, rather than using the senses listed in a disambiguation page as the sense inventory for a given ambiguous word, I chose instead to collect all the annotations available for that word in the Wikipedia pages, and then cluster these together to form the sense inventory.

I derive a sense-tagged corpus following three main steps:

First, I extract all the paragraphs in Wikipedia that contain an occurrence of the ambiguous word as part of a link or a piped link. I select paragraphs based on the Wikipedia paragraph

36 segmentation, which typically lists one paragraph per line.1 Next, I collect all the possible labels for the given ambiguous word by extracting the leftmost component of the links. For instance, in the piped link “[[musical notation|bar]]”, the label musical notation is extracted. In the case of simple links (e.g. “[[bar]]”), the word itself can also play the role of a valid label if the page it links to is not determined as a disambiguation page.

For the purpose of the experiments in this research, the sense inventory has been created from the Wikipedia labels, primarily using several simple heuristics. This step is mainly motivated by the fact that words often have a large number of labels (e.g., more than 100 labels for the word “bar” in the 2012 Wikipedia), which cannot be directly used as senses. The list of heuristic that I used are as follows:

(1) To focus on the problem of word sense disambiguation, rather than named entity recognition, I explicitly avoid labels using a simple heuristic based on capitalization of the annotations throughout Wikipedia. I considered a labels, if the number of occurences with a common noun is atleast 20% of the times of the total occurences. For example, consider the word “atmosphere”, “Atmosphere of Earth” is used as a sense 376 times for the word “atmosphere” through out Wikipedia. Out of these 376 times, it occurred with the text “Atmosphere” as a label 210 times, i.e. as a candidate proper noun, and the remaining 166 times it occurred with “atmosphere” i.e. as a common noun. Since, I have 166 examples with the label “atmosphere” as a common noun, which is more than the threshold, I consider “Atmosphere of Earth” as a sense for the word “atmosphere”, though, it is a proper noun according to Wikipedia title capitalization rules. In the example, since the total number of occurences is 376, the threshhold is 20 percent of 376, which is 75. (2) The next simple heuristic is based on resolving the redirects, i.e., by linking those labels that refer to the same article in Wikipedia, i.e., all the redirect labels are resolved into corresponding article labels. This is motivated by the fact the labels are often redundant (e.g., both “musical notation” and “Bar (music)” are used as labels for the word “bar”), or refer to senses that may be too ﬁne grained.

1The average length of a paragraph is 80 words.

37 (3) The other simple heuristic is based on frequency of the annotations. I eliminate all the labels that have less number of examples than a certain frequency. For the purpose of this research, I set a tresshhold of 10. In this way, I ensure that each sense contributes a minimal number of training examples in the machine learning approach to WSD. Further, I reduced the treshhold to 5 for only of the form “Word (Concept)” (e.g., “Bar (establishment)” but not “pub”)

4.1.2. An Example

As an example, consider the ambiguous word bar, with 3,784 examples extracted from Wikipedia where “bar” appeared as the rightmost component of a piped link or as a word in a simple link. Since the page with the identiﬁer bar is a disambiguation page, all the examples containing the single link [[bar]] are removed, as the link does not remove the ambiguity. From the remaining examples, I extract their labels and group them into 23 different senses. Table 4.1 shows a subset of these labels, to illustrate the senses that can be extracted from the Wikipedia annotations.

Wordsense LabelsinWikipedia Wikipediadeﬁnition bar (establishment) bar (establishment), nightclub a retail establishment which serves gay club, pub alcoholic beverages

bar (counter) bar (counter) thecounterfromwhichdrinks are dispensed bar (unit) bar (unit) ascientiﬁcunitofpressure bar (music) bar (music), measure music aperiodofmusic musical notation bar (law) bar association, bar law thecommunityofpersonsengaged law society of upper canada in the practice of law state bar of california bar (landform) bar (landform) atypeofbeachbehindwhichlies a lagoon bar (metal) bar metal, pole (object) bar(sports) gymnastics uneven bars, handle bar bar(solid) candy bar, chocolate bar TABLE 4.1. Word senses for the word “bar”, based on annotation labels used in Wikipedia

38 4.1.3. Wikipedia and WordNet

It is interesting to note that Wikipedia has a different sense coverage and distribution compared to more “traditional” lexical resources such as WordNet [110]. For instance, the meaning of “ambiance” for the ambiguous word “atmosphere” does not appear at all in the Wikipedia corpus, although it has the highest frequency in other annotated data such as SENSEVAL. This is partly due to the coarser sense distinctions made in Wikipedia; for instance, Wikipedia does not make the distinction between the act of grasping and the actual hold for the noun “grip”, and occurrences of both of these meanings are annotated with the label “grip (handle)”.

There are also cases when Wikipedia makes different or ﬁner sense distinctions than Word- Net. For instance, there are several Wikipedia annotations for “image” as “copy”, but this meaning is not even deﬁned in WordNet. Similarly, Wikipedia makes the distinction between “dance performance” and “theatre performance”, but both these meanings are listed under one single entry in WordNet (“performance” as “public presentation”).

4.2. The WikiMonoSense Learning Framework

Provided a set of sense-annotated examples for a given ambiguous word, the task of a word sense disambiguation system is to automatically learn a disambiguation model that can predict the correct sense for a new, previously unseen occurrence of the word. Assuming that such a system can be reliably constructed, the implications are two-fold. First, accurate disambiguation models suggest that the data is reliable and consists of correct sense annotations. Second, and perhaps more importantly, the availability of a system able to correctly predict the sense of a word can have important implications for applications that require such information, including machine translation and automatic reasoning.

I use a word sense disambiguation system that integrates local and topical features within a machine learning framework, similar to several of the top-performing supervised word sense disambiguation systems participating in the recent SENSEVAL evaluations2.

The disambiguation algorithm starts with a preprocessing step, where the text is tokenized,

2http://www.senseval.org

39 stemmed and annotated with part-of-speech tags. Collocations are identified using a sliding window approach, where a collocation is defined as a sequence of words that forms a compound concept defined in Wikipedia. Next, local and topical features are extracted from the context of the ambiguous word. Specifically, I use the current word and its part-of-speech, a local context of three words to the left and right of the ambiguous word, the parts-of-speech of the surrounding words, the verb and noun before and after the ambiguous words, and a global context implemented through sense-specific keywords determined as a list of words occurring at least three times in the contexts defining a certain word sense. I used TreeTagger for part-of-speech tagging3 and Snowball stemmer4 for stemming as they both have publicly available implementations for multiple languages. The features are integrated in a Naive Bayes classifier, which was selected mainly for its state-of-the-art performance in previous WSD systems.

4.3. The WikiTransSense System

FIGURE 4.1. Architecture of WikiTransSense Framework

3www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger 4snowball.tartarus.org

40 An airline seat is a chair on an airliner in which passengers are accommodated for the duration of the journey. Ein Flugzeugsitz ist ein Stuhl auf einem Flugzeug, in dem Passagiere fr die Dauer der Reise untergebracht sind.

For a year after graduation, Stanley served as chair of belles-lettres at Christian College in Hustonville. Seit einem Jahr nach dem Abschluss, diente Stanley als Vorsitzender Belletristik bei Christian College in Hustonville.

FIGURE 4.2. English to German translations from Google Translate, with the target words aligned.

Consider the examples centered around the ambiguous noun “chair”, as shown in Fig- ure 4.2, where English is the reference language and German is a supporting language. The ﬁgure shows only 2 out of the 5 possible meanings from the Wikipedia sense inventory. The two examples illustrate two important ways in which the translation can help disambiguation. First, two different senses of the target ambiguous word may be translated into a different word in the supporting language. Therefore, assuming access to word alignments, knowledge of the target word translation can help in disambiguation. Second, features extracted from the translated sentence can be used to enrich the feature space. Even though the target word translation is a strong feature in general, there may be cases where different senses of the target word are translated into the same word in the supporting language. For example, the two senses “bar (unit)” and “bar (establishment)” of the English word “bar” translate to the same German word “bar”. In cases like this, words in the context of the German translation may help in identifying the correct English meaning. While this multilingual representation can sometime result in redundancy when there is a one-to-one translation between languages, in most cases however the translations will enrich the feature space, by either indicating that two features in English share the same meaning or by disambiguating ambiguous English features using different translations Appending therefore multilingual features to the monolingual vector generates a more orthogonal vector space.

4.3.1. A Multilingual Dataset through Machine Translation

In order to generate a multilingualrepresentation for the monolingual dataset, I used Google Translate to translate the data from English into several other languages. The use of Google Trans-

41 late is motivated by the fact that Google’s statistical machine translation system is available for many languages. Furthermore, through the University Research Program, Google Translate also provides the word alignments. Given a target word in an English sentence, I used the word alignments to identify the position of the target word translation in the translated sentence. Each of the four languages is used as a reference language, with the remaining three used as supporting languages. Additionally, French was added as a supporting language in all the multilingual systems, which means that each reference sentence was translated in four supporting languages.

4.3.2. The WikiTransSense Learning Framework

Figure 4.1 shows the architecture of our WikiTransSense system with English as reference language and three supporting languages namely Italian, German and Spanish. Similar to the WikiMonoSense approach described in Section 4.2, I extract the same types of features from the reference sentence, as well as from the translations in each of the supporting languages which are represented as “Tran-It”, “Tran-Es”, “Tran-De” in the ﬁgure 4.1. Correspondingly, the feature vector will contain a section with the reference language features, followed by a multilingual section containing features extracted from the translations in the supporting languages. The resulting multilingual feature vectors are then used with a Naive Bayes classiﬁer.

4.4. The WikiMuSense System

If onewere to train a WSD systemfor all ambiguousnouns, the large number of translations required may be prohibitive. In order to reduce the dependency on the machine translation system, I developed a second multilingual approach to WSD, WikiMuSense, that exploits the availability of Wikipedia in multiple languages. Another key advantage of this approach is the features extracted from each supporting language enriches the feature space in producing less reduntant multilingual context features depending on the contexts in the supporting language examples unlike our WikiTransSense system.

4.4.1. A Multilingual Dataset through Interlingual Wikipedia Links

Wikipedia articles on the same topic in different languages are often connected through inter-lingual links. These are the small navigation links that show up in the “Languages” sidebar

42 FIGURE 4.3. Architecture of WikiMuSense system in most Wikipedia articles. For example, the English Wikipedia sense “bar (music)” is connected through an interlingual link to the German Wikipedia sense “Takt (Musik)”. Given a sense inventory for a word in the reference language, I automatically build the sense repository for a supporting language by following the interlingual links connecting equivalent senses in the two languages. Thus, given the English sense repository for the word “bar” EN = {bar (establishment), bar (landform), bar (law), bar (music)}, the corresponding German sense repository will be DE = {Bar (Lokal), Sandbank, NIL, Takt (Musik)}5. The resulting sense repositories can then be used in conjunction with Wikipedia links to build sense tagged corpora in the supporting languages, using the approach described in Section 4.1.1. However, this approach poses the following two problems:

5NIL represents there is no corresponding page for bar (law) in German Wikipedia

43 (1) Sense gaps problem: There may be reference language senses that do not have interlingual links to the supporting language. In the “bar” example above, the English sense “bar (law)” does not have an interlingual link to German. (2) Multilingual bias problem: The distribution of examples per sense in the automatically created sense tagged corpus for the supporting language may be different from the corresponding distribution for the reference language. Previous work [8, 7] has shown that the WSD performance is sensitive to differences in the two distributions.

4.4.1.1. Sense Gaps Problem. I address the ﬁrst problem using a very simple approach: when- ever there is a sense gap, I randomly sample a number of examples for that sense in the reference language and use Google Translate to create examples in the supporting language.

4.4.1.2. Multilingual Bias Problem. To address this problem I introduce a statistical method based on [7] that automatically identiﬁes the number of examples per sense to be considered from the supporting language examples and source language translations to build the sense annotated corpora in supporting language. With this statistical approach, I consider the number of examples proportional to the bias of the word senses in the source language.

Deﬁnition: Let W be a word in the reference language and n be the number of senses for this word W represented as si for i =1..n.. Let NRsi represents the number of examples extracted for a sense si from reference language Wikipedia and NSsi represents the number of examples extracted for the sense si from the corresponding article in supporting language Wikipedia identiﬁed through interlingual link and Let RPsi represents the percentage of examples for a sense si in the reference language.

To build the supporting language data, ﬁrst, I empirically assigned 500 examples to the most frequent sense (MFS) from the reference language, and assigned the other senses with their proportion of examples given the reference language bias. The proportion distribution (PD) is measured using the equation 5.

500 ifsi == MFS  PDsi =  (5) RPsi 500 × RP otherwise  MFS  44 However, in some cases the distribution of the reference language bias and that of the available supporting language examples would not ﬁt. The problem is caused when there are not enough examples in the supporting language to ﬁll the expectations of a certain word sense.

To address this problem of not having enough examples, I deﬁne a threshold Tsi on each

sense si for i =1..n. Equation 6 implies given a sense si, the maximum number of translations of reference language examples that I can use to create supporting language examples when I do not have enough examples in supporting language is 90% of the translations of the reference language examples. The reason for using only 90% of translations of the reference language examples is due to the reason our system uses the remaining 10% as test data (10-FOLD CROSS VALIDATION).

90  × NSs if NSs < NRs and NSs < 50  100 i i i i Tsi =  (6) NSsi otherwise   Subsequently, I calcuated the minimum ratio of examples that were available for a given target bias and a given supporting language examples. This minimum ratio is calculated using the equation 7.

PDsi MinR = argmini[ ] (7) Tsi Once the minumum ratio is obtained, I calculated the minimum ratio distribution(MRD) of the examples in the supporting language based on the distribution of examples in reference language. This is measured using equation 8

MRDsi = MinR × PDsi (8)

NTsi = MRDsi − NSsi (9)

Example: For example, Table 4.2 shows the detailed steps to calculate the minimum ratio distribution (MRD) for the word “Argument” given English as the reference language and German as the supporting language. The MRD column indicates how using this approach I get a better

45 reﬂecting supporting language data of the proportion of examples in reference language. This approach address both the problems and at the same require very few translations as indicated in NT column. The number of required translations to ﬁll the gaps is calculated using equation 9.

Sense English Corpus German Corpus NR RP NS PD T Ratio MRD NT Logical Argument 289 66.6 139 139 500.0 0.28 139 0 Argument (linguistics) 61 14.0 26 54 105.5 0.51 29 3 Argument (computer science) 54 12.5 101 101 93.4 1.08 25 0 Argument (complex analysis) 22 5.1 NIL 19 38.0 0.57 10 10 Argument of a function 8 1.8 NIL 7 13.8 0.43 3 3 TABLE 4.2. Minimum ratio distribution (MRD) and the required number of translations for building German corpus given English as the reference language for the word “argument.” Table also indicates number of examples in reference language (NR), number of examples in supporting language (NS), reference language percentages (RP), proportional distributions (PD), treshhold (T) and ratio (R), Mini- mum ratio is represented as bold in the ratio column. NIL indication no interlingual link available from English Wikipedia to German Wikipedia for that particular sense

4.4.2. The WikiMuSense Learning Framework

Figure 4.3 shows the architecture of our WikiMuSense system with English as reference language and Italian, German and Spanish as three supporting languages. Once the datasets in the supporting languages are created using the method above, I train a Naive Bayes classifier for each language (reference or supporting). Note that the classifiers built for the supporting languages will use the same senses/classes as the reference classifier. Thus, for the word “bar” in the example above, if English is reference and German is supporting, the Naive Bayes classifier for German will compute probabilities for the four English senses, even though it is trained and tested on German sentences.

For each classiﬁer, the features are extracted using the same approach as in the Wiki- MonoSense system.

At test time, the reference sentence is translated into all four supporting languages using

Google Translate. The ﬁve probabilistic outputs – one from the reference (PR) and four from the supporting classiﬁers (PS) – are combined into an overall disambiguation score using Equation 10

46 below. Finally, disambiguation is done by selecting the sense that obtains the maximum score.

P = PR + PS ∗ min(1, |DS|/|DR|) (10) XS

In Equation 10, DR is the set of training examples in the reference language R, whereas DS is the set of training examples in a source language S. When the number of training examples in a supporting language is smaller than the number of examples in the reference language, the probabilistic output from the corresponding supporting classifier will have a weight smaller than 1 in the disambiguation score, and thus a smaller influence on the disambiguation output. In general, the influence of the supporting classifier will always be less than or equal with the influence of the reference classifier.

47 CHAPTER 5

EXPERIMENTAL RESULTS

The primary focus of this chapter is to present the experiments performed in this setting for supervised WSD.

5.1. Experiments and Results

In our experiments, the WSD dataset was built for a subset of the ambiguous words used during the SENSEVAL-2, SENSEVAL-3 evaluations and a subset of ambiguous words in four languages: English, Spanish, Italian and German. Since the Wikipedia annotations are focused on nouns (associated with the entities typically deﬁned by Wikipedia), the sense annotations I generate and the WSD experiments are also focused on nouns. I also avoided those words that have only one Wikipedia label. This resulted in a set of 105 words in four different languages: 30 for English, 25 for Italian, 25 for Spanish, and 25 for German. Since the Wikipedia annotations are focused on nouns (associated with the entities typically deﬁned by Wikipedia), the sense annotations I generate and the word sense disambiguation experiments are also focused on nouns. I generate sense ambiguous words following the approach described in chapter 4, and use these datasets for two main disambiguation experiments. I used two different accuracy metrics to report the performance of the proposed WSD systems:

(1) “macro accuracy”: an accuracy number was calculated separately for each ambiguous word. Macro accuracy was then computed as the average of these accuracy numbers. (2) “micro accuracy”: the system outputs for all ambiguous words were pooled together and the micro accuracy was computed as the percentage of instances that were disambiguated correctly.

Tables 5.1 and 5.2 show the micro and macro accuracies for the three systems. The tables also show the accuracy of a simple WSD baseline that selects the Most Frequent Sense (MFS). Tables 5.3, 5.4, 5.5, 5.6 show the disambiguation results using the three proposed word sense disambiguation frameworks described in Chapter 4, in a ten-fold cross-validation evaluation

48 Language MFS WikiMonoSense WikiTransSense WikiMuSense English 62.2 78.9 81.9 81.3 German 69.5 81.2 84.6 85.6 Italian 66.0 81.8 84.0 84.7 Spanish 66.8 76.0 78.7 79.7 TABLE 5.1. WSD macro accuracies.

Language MFS WikiMonoSense WikiTransSense WikiMuSense English 59.2 79.3 80.6 80.3 German 75.6 83.9 86.5 87.0 Italian 74.3 84.6 86.3 87.5 Spanish 72.6 79.8 81.1 82.7 TABLE 5.2. WSD micro accuracies. applied on the 2012 Wikipedia data for four languages English, Spanish, German, Italian. For each word, the tables also show the number of senses, the total number of examples, and a simple baseline that selects the most frequent sense by default.1

1Note that this baseline assumes the availability of a sense tagged corpus in order to determine the most frequent sense of a word. The baseline is therefore “informed,” as compared to a random, “uninformed” sense selection.

49 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense argument 5 434 66.6 77.9 78.3 82.0 arm 4 28973.7 84.1 85.8 85.8 atmosphere 5 1037 51.9 64.9 65.9 64.6 bank 5 23740.5 78.0 82.7 81.9 bar 17321128.7 72.6 73.0 72.1 chair 5 59946.1 72.5 73.8 74.3 channel 11 813 38.4 81.4 82.9 82.0 circuit 9 46031.1 58.0 59.6 48.9 degree 11 1945 56.0 85.9 87.2 84.1 detention 4 141 75.9 65.9 76.5 71.6 difference 3 37 27.5 81.7 85.0 81.1 disc 6 31930.7 83.1 83.1 74.9 dyke 3 29849.3 82.9 85.6 82.6 fatigue 3 60072.7 87.5 87.8 90.5 grip 2 6379.8 79.5 87.6 87.3 image 5 55268.1 77.7 80.6 85.1 interest 3 56597.2 93.1 94.5 94.0 material 5 412 92.2 88.8 91.2 84.5 mouth 3 63780.1 83.1 86.3 86.3 nature 2 104796.9 90.2 90.2 94.4 paper 8 148292.1 87.0 88.5 93.7 party 7 785 48 74.1 78.7 68.3 performance 5 724 80.5 65.1 68.8 73.2 plan 5 12174.4 75.1 82.6 77.7 post 3 8360.4 74.0 90.6 89.2 sense 4 29687.5 88.5 88.2 91.9 shelter 4 5226.7 61.3 73.3 78.8 source 5 26870.5 88.9 85.5 87.3 spade 2 9873.6 82.3 79.2 89.8 stress 6 135348.9 83.4 83.4 80.0 AVERAGE 5.3 63262.2 78.9 81.9 81.3 TABLE 5.3. Word sense disambiguation results on English, including one baseline (MFS = most frequent sense) and the three proposed word sense disambiguation systems based on Wikipedia 2012 data. Number of senses (#s) and number of examples (#ex) are also indicated.

50 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense arco 6 65935.4 57.8 61.0 59.5 argumento 3 144 88.2 71.7 75.2 83.3 atmsfera 3 1860 90.5 88.0 87.4 88.9 banda 4 41255.8 67.5 68.7 66.0 barra 3 4549.5 84.5 91.0 88.9 bomba 4 16662.1 79.0 85.6 80.7 cmara 6 16633.7 67.1 75.9 68.7 circuito 3 148 67.6 49.1 51.3 62.8 columna 6 736 77.9 87.2 88.3 91.8 copa 2 7492.1 83.8 85.4 81.1 corona 9 14822.3 53.2 59.5 55.4 disco 9 65140.2 69.6 71.0 72.5 gato 2 56696.1 94.5 92.9 96.6 grado 7 41578.1 86.1 86.1 91.8 imagen 3 21088.1 78.1 79.0 86.7 letra 2 42082.9 74.8 79.5 70.5 masa 2 86892.5 88.5 91.8 94.8 mina 4 41862.5 76.8 80.4 84.7 perodo 4 116091.6 96.0 96.6 97.6 resistencia 11 349 30.4 63.0 64.5 65.6 serie 4 33082.1 88.8 92.1 91.5 sierra 6 28664.3 63.9 70.2 72.7 silla 2 13089.2 76.9 79.2 86.2 tabla 4 15931.5 81.2 81.8 77.4 teclado 6 1576 65.9 73.9 73.0 76.6 average 4.6 484 69.5 81.2 84.6 85.6 TABLE 5.4. Word sense disambiguation results on Spanish, using Wikipedia 2012 data. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated.

51 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense arm 4 16254.3 80.9 90.7 93.8 becken 5 56247.7 80.8 83.6 83.6 bild 6 23242.7 58.1 66.8 70.3 brcke 4 105687.4 85.8 88.2 90.7 decke 4 17737.9 76.2 82.4 80.2 ﬁsch 5 145385.4 81.6 82.6 79.9 ﬂgel 11 51328.9 68.6 73.3 74.7 fuge 3 31685.1 79.8 84.8 81.6 fusion 7 84947.9 88.7 92.6 90.6 grad 5 21365.8 85.9 86.9 93.0 kadenz 4 31146.9 83.3 87.1 87.1 kanal 4 35690.2 85.4 90.5 91.6 klinge 2 25689.1 88.6 89.4 92.6 legende 3 867 97.1 91.9 93.9 96.9 mine 6 59053.1 59.8 60.0 63.7 nadel 3 99 92 87.8 95.9 98.0 natur 6 65490.5 88.2 90.1 89.6 norm 6 81952.6 76.2 79.4 79.6 post 3 365 97 93.7 95.6 97.0 sinn 3 11952.1 77.4 79.9 79.8 stern 3 151489.6 92.7 93.7 92.5 stollen 4 626 96.6 91.9 95.0 96.8 stuhl 3 9048.9 62.2 61.1 66.7 zelle 4 117694.7 91.8 94.0 96.7 zug 7 37164.4 73.9 78.2 73.0 average 4.5 550 69.5 81.2 84.6 85.6 TABLE 5.5. Word sense disambiguation results on German, using Wikipedia 2012 data. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated.

52 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense arco 7 69669.8 82.6 86.6 87.9 atmosfera 6 977 85.2 80.7 82.3 89.4 canale 5 427 48 72.3 74.7 75.6 cavallo 7 1432 91.7 90.3 91.1 92.9 centro 13 51337.8 83.6 80.7 89.1 circuito 4 10848.2 72.9 75.9 73.1 complesso 5 356 57.3 87.9 88.5 90.2 differenziale 2 311 90.4 95.2 93.9 97.1 elemento 4 293 80.9 69.8 85.6 69.5 fase 2 21265.1 90.1 91.0 91.0 grado 6 40543.7 86.2 88.2 88.9 gruppo 11 1742 57.3 86.2 87.2 84.6 lettera 2 31157.2 87.5 93.6 94.9 massa 5 1236 93 87.1 89.2 93.7 ordine 12 2801 88.6 91.0 90.7 92.2 pesca 5 115075.6 75.6 78.9 84.7 pianta 2 266789.8 90.1 93.5 93.8 piatto 4 11848.4 63.4 65.1 66.9 porta 6 26444.3 52.3 56.1 57.6 radice 6 81667.2 84.9 85.1 89.5 resistenza 10 1606 71.8 78.6 79.5 79.6 rete 10 653 34 74.4 74.0 71.2 seno 2 28462.7 91.6 96.1 92.3 spazio 6 64949.2 78.3 78.1 78.9 testa 2 49992.2 92.0 94.2 93.8 average 5.4 815 66.0 81.8 84.0 84.7 TABLE 5.6. Word sense disambiguation results on Italian, using Wikipedia 2012 data. In addition to a baseline (MFS = most frequent sense) and the word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated.

53 5.2. Discussion on WikiMonoSense System

Overall, the Wikipedia-based sense annotations were found reliable, leading to accurate sense classiﬁers for the WikiMonoSense system with an average relative error reduction of 44%, 38%, 44%, and 28% compared to the most frequent sense baseline in terms of macro accuracy. WikiMonoSense performed better for 76 out of the 105 words in the four languages compared to the MFS baseline, which further indicates that Wikipedia data can be useful for creating accurate and robust WSD systems. There were a few exceptions to this general trend. Several words in that data set had highly skewed sense distributions, such as e.g. “nature”, which has a total number of 1,047 examples out of which 1,015 examples pertain to the meaning of “nature (physical world)”, or the word “material” with 380 out of 412 examples annotated with the meaning of “material (substance)”.

5.2.1. Learning Curve for WikiMonoSense System

73 Classification Accuracy(averaged %) EN-Learning Curve DE-Learning Curve 72 ES-Learning Curve IT-Learning Curve 71 10 20 30 40 50 60 70 80 90 100 Fraction of data(%)

FIGURE 5.1. Learning curves for WikiMonoSense.

One aspect that is particularly relevant for any supervised system is the learning rate with respect to the amount of available data. To determine the learning curve for the WikiMonoSense system, I measured the disambiguation accuracy under the assumption that only a percetage of the data were available. I ran 10-fold cross-validation experiments using 10%, 20%, ..., 100% of the data, and averaged the results over all the words in the data set. The learning curves for the four

54 languages are plotted in Figure 5.1. Overall, the curves indicates a continuously growing accuracy with increasingly larger amounts of data. Although the learning pace slows down after a certain number of examples (about 50% of the data currently available), the general trend of the curve seems to indicate that more data is likely to lead to increased accuracy. Given that Wikipedia is growing at a fast pace, the curve suggests that the accuracy of the word sense classiﬁers built on this data is likely to increase for future versions of Wikipedia.

5.3. Discussion on WikiTransSense and WikiMuSense Systems

Compared to the monolingual WikiMonoSense system, the multilingual WikiTransSense system obtained an average relative error reduction of 13.7%, thus conﬁrming the utility of using translated contexts. Relative to the MFS baseline, WikiTransSense performed better on 83 of the 105 words. Finally, WikiMuSense had an even higher average error reduction of 16.5% with respect to WikiMonoSense, demonstrating that the multilingual data available in Wikipedia can successfully replace the machine translation component during training. Relative to the MFS baseline, the multilingual WikiMuSense system performed better on 89 out of the 105 words.

Since WikiMuSense is proposed in order to reduce the dependency on the machine translation system compared to WikiTransSense, I measured the number of translations required in WikiTransSense and WikiMuSense systems. Table 5.7 shows the number of sentence translations required to train the WikiTransSense and WikiMuSense systems. As expected, due to the use of interlingual links and sense annotatated corpora from multilingual Wikipedias, it is substantially smaller than the number of translations required in the WikiTransSense system.

Language WikiTransSense WikiMuSense English 75,832 13,151 German 54,984 8,901 Italian 81,468 4,697 Spanish 48,384 6,560 TABLE 5.7. Total number of sentence translations per language, in the two multilingual approaches.

55 85

80 SP-WikiTransSense 79 SP-WikiMuSense EN-WikiTransSense 78 EN-WikiMuSense DE-WikiTransSense DE-WikiMuSense Classification Accuracy(averaged %) 77 IT-WikiTransSense IT-WikiMuSense 76 0 1 2 3 4 Number of Languages

FIGURE 5.2. Impact of the number of supporting languages on the two multilingual WSD systems.

5.3.1. Learning Curves for Nultilingual WSD Systems

One aspect that is particularly relevant for the multilingual WSD sytems is the impact that the number of supporting languages has on the performance of the two multilingual WSD systems. Both WikiTransSense and WikiMuSense are evaluated using all possible combinations of 1, 2, 3, and 4 supporting languages. The resulting macro accuracy numbers are then averaged for each number of supporting languages. Figure 5.2 indicates that the accuracies continue to improve as more languages are added for both multilingual systems.

Another relevant aspect is the dependency between the amount of data available in supporting languages and the performance of the WikiMuSense system using two different settings. In our ﬁrst setting I ran 10-fold cross-validation experiments using all the data from the reference language and varying the amount of supporting language data from 10% to 100%, in all supporting languages. The accuracy results were averaged over all the words. Figure 5.3 shows the learning curves for the 4 languages. When using 0% fraction of supporting data, the results correspond to the monolingual WikiMonoSense system. When using 100% fraction of the supporting data, the results correspond to the ﬁnal multilungual WikiMuSense system. I can see that WikiMuSense starts to perform better than WikiMonoSense when at least 70-80% of the available supporting data is used, and continues to increase its performance with increasing amounts of supporting languages

56 86

68 Classification Accuracy(averaged %) EN-Learning Curve DE-Learning Curve 66 ES-Learning Curve IT-Learning Curve 64 0 10 20 30 40 50 60 70 80 90 100 Fraction of supporting language data(%)

FIGURE 5.3. Learning curves for WikiMuSense with respect to the supporting language examples. data.

78 Classification Accuracy(averaged %) EN-Learning Curve DE-Learning Curve 77 ES-Learning Curve IT-Learning Curve 76 0 10 20 30 40 50 60 70 80 90 100 Fraction of data(%)

FIGURE 5.4. Learning curves for WikiMuSense with respect to the reference language examples.

In our second setting I ran 10-fold cross-validation experiments using all the data from the reference language and varying the amount of supporting language data with respect to the amount of reference language examples from 10% to 100%, in all supporting languages. The accuracy results were averaged over all the words. Figure 5.4 shows the learning curves for the 4 languages. When using 0% fraction of supporting data, the results correspond to the monolingual

57 WikiMonoSense system. When using 100% fraction of the supporting data, means the number of supporting examples is equal to the number of examples in reference language. I can see that WikiMuSense starts to perform better than WikiMonoSense as the number of supporting language examples gets closer the number of reference language examples except for English. The reason behind this may be because the supporting language classifiers are not as accurate as the English classifier, and when more supporting data is used, their importance increases in the formula 10 and start to dominate the English output. So I continued to increase the supporting data from 110 to 200% with respect to the amount of reference language examples in English and the system showed the performance of our final WikiMuSense system at 150-160% of the amount of reference language examples in the supporting data.

5.3.2. WikiMuSense Without Translations During Training

Since WikiMuSense is still using machine translation when interlingual links are missing, I ran an additional experiment in which MT was completely removed during training to demonstrate the advantage of sense-annotated corpora available in supporting language Wikipedias. Thus, for the 105 ambiguous words, I eliminated all senses that required machine translation to fill the sense gaps. After filtering, 52 words from the four languages had 2 or more sense in Wikipedia for which interlingual links were available in all the supporting languages. The results averaged over the 52 words are shown in Table 5.3.2 and demonstrate that WikiMuSense still outperforms Wiki- MonoSense significantly. Compared to the monolingual WikiMonoSense system, the multilingual WikiMuSense system obtained an average relative error reduction of 20.4%, thus confirming the use of multilingual sense annotated corpora availble in Wikipedia. Relative to the WikiMonoSense system, the multilingual WikiMuSense system performed better on 40 out of the 52 words.

Accuracy WikiMonoSense WikiMuSense Macro 83.9 87.2 Micro 87.5 89.8 TABLE 5.8. WSD performance with no sense gaps.

58 5.4. Word Sense Disambiguation in a Coarse-Grained Setting

I have also evaluated the proposed WSD systems in a coarse-grained setting on the same dataset. Two annotators were provided with the automatically extracted sense inventory from Wikipedia along with the corresponding Wikipedia articles and requested to discuss and create clusters of senses for the 105 words in the four languages. However, I removed all those words that have only one label after clustering which resulted in 103 words. The macro and micro accuracies on this coarse-grained sense inventory are shown in Tables 5.9 and 5.10 indicate that our multilingual systems outperform the monolingual system. Similar trend to the fine-grained setting is observed in the coarse-grained setting and the Wikipedia-based sense annotations were found reliable, leading to accurate sense classifiers for the WikiMonoSense system with an average relative error reduction of 53%, 42%, 56%, 36% compared to the most frequent sense baseline in terms of macro accuracy. WikiMonoSense performed better for 72 out of the 103 words in the four languages compared to the MFS baseline, which further indicates that Wikipedia data can be useful for creating accurate and robust WSD systems. Compared to the monolingual WikiMonoSense system, the multilingual WikiTransSense system obtained an average relative error reduction of 14%, thus confirming the utility of using translated contexts. Relative to the MFS baseline, WikiTransSense performed better on 79 of the 103 words. Finally, WikiMuSense had an even higher average error reduction of 19.4% with respect to WikiMonoSense, demonstrating that the multilingual data available in Wikipedia can successfully replace the machine translation component during training. Relative to the MFS baseline, the multilingual WikiMuSense system performed better on 86 out of the 103 words. Tables 5.11, 5.12, 5.13, 5.14 show the disambiguation results using the three proposed word sense disambiguation frameworks in a ten-fold cross-validation evaluation applied on the 2012 Wikipedia data for four languages English, Spanish, German, Italian in a coarse grained setting. For each word, the tables also show the number of senses, the total number of examples, and a simple baseline that selects the most frequent sense by default.

59 MFS WikiMonoSense WikiTransSense WikiMuSense English 72.9 87.3 88.9 89.9 German 72.8 84.1 87.8 87.9 Italian 71.7 87.6 89.4 90.0 Spanish 73.6 83.2 86.1 86.9 TABLE 5.9. Coarse grained macro accuracies.

MFS WikiMonoSense WikiTransSense WikiMuSense English 69.7 88.9 89.7 91.0 German 78.5 86.7 89.6 89.3 Italian 78.4 88.7 90.3 90.9 Spanish 79.8 87 88.7 90.0 TABLE 5.10. Coarse grained micro accuracies.

60 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense argument 3 434 66.6 80.2 80.4 82.0 arm 3 28993.1 84.8 86.5 90.3 atmosphere 2 1037 90.2 93.1 93.8 95.8 bank 3 23755.7 88.2 88.6 86.9 bar 9 321150.3 93.7 93.3 93.6 chair 2 59953.9 82.0 83.3 82.1 channel 3 81356.7 89.4 89.8 90.5 circuit 4 46069.8 87.8 88.3 89.3 degree 2 194558.6 98.0 98.9 98.7 detention 2 141 88.7 83.0 86.5 87.9 difference 2 37 60 92.5 92.5 97.3 disc 6 31930.7 83.1 83.1 85.0 dyke 2 29883.6 85.5 90.6 91.9 fatigue 2 60073.7 87.8 88.0 83.5 grip 2 6379.8 79.5 87.6 88.9 image 3 55273.9 86.8 89.3 90.9 interest 2 56598.6 96.6 97.0 97.5 material 3 412 93.7 91.0 92.3 89.6 mouth 2 63780.1 92.0 93.1 92.6 nature 2 104796.9 90.2 90.2 93.3 paper 6 148293.1 88.5 89.7 92.2 party 3 78550.4 77.1 83.1 83.7 performance 2 724 93.8 82.2 84.1 88.0 post 3 8360.4 74.0 90.6 86.7 sense 2 29695.3 97.7 97.0 99.0 source 5 26870.5 88.9 85.5 90.3 spade 2 9873.6 82.3 79.2 86.7 stress 3 1353 51 87.9 87.1 82.0 average 3.0 671 72.9 87.3 88.9 89.9 TABLE 5.11. Word sense disambiguation results on English, including one baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results based on Wikipedia 2012 data in a coarse grained setting. Num- ber of senses (#s) and number of examples (#ex) are also indicated.

61 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense arco 5 659 64 83.6 86.0 88.6 argumento 2 144 98.7 100.0 99.3 98.6 atmsfera 2 1860 94.6 94.1 95.4 95.7 banda 3 41294.7 88.8 94.7 97.1 barra 3 4549.5 84.5 91.0 84.4 bomba 4 16662.1 79.0 85.6 81.3 cmara 3 16660.9 85.5 96.4 96.4 circuito 2 148 87.2 79.0 85.7 89.2 columna 5 736 77.9 87.0 88.9 91.8 copa 2 7492.1 83.8 85.4 81.1 corona 6 14830.4 56.8 60.2 61.5 disco 5 65140.2 74.7 76.6 76.8 gato 2 56696.1 94.5 92.9 96.1 grado 6 41578.1 87.5 87.5 92.0 imagen 3 21088.1 78.1 79.0 87.1 letra 2 42082.9 74.8 79.5 70.2 masa 2 86892.5 88.5 91.8 94.8 mina 4 41862.5 76.8 80.4 86.4 perodo 4 116091.6 96.0 96.6 97.2 resistencia 7 349 45 83.7 84.5 85.7 serie 4 33082.1 88.8 92.1 91.5 sierra 5 28664.3 64.3 73.0 72.7 silla 2 13089.2 76.9 79.2 88.5 tabla 4 15931.5 81.2 81.8 75.5 teclado 2 1576 84.5 91.1 88.7 90.8 average 3.6 484 73.6 83.2 86.1 86.9 TABLE 5.12. Word sense disambiguation results on Spanish, using Wikipedia 2012 data in a coarse grained setting. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated.

62 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense arm 4 16254.3 80.9 90.7 93.8 becken 3 56247.7 85.9 91.6 88.1 bild 4 23248.3 62.9 70.7 72.0 brcke 3 105695.8 93.7 95.5 96.7 decke 4 17737.9 76.2 82.4 78.5 ﬁsch 5 145385.4 81.6 82.6 80.5 ﬂgel 8 51328.8 69.0 77.0 74.3 fuge 3 31685.1 79.8 84.8 79.7 fusion 6 84947.9 88.7 92.6 87.5 grad 3 21374.7 91.5 92.0 93.9 kadenz 3 31146.9 89.1 92.0 91.3 kanal 3 35692.1 86.3 92.7 93.8 klinge 2 25689.1 88.6 89.4 93.4 legende 3 867 97.1 91.9 93.9 97.0 mine 2 59078.6 82.9 86.8 85.6 nadel 3 99 92 87.8 95.9 97.0 natur 2 65498.9 97.6 96.8 98.9 norm 4 81955.2 76.6 79.1 80.1 post 2 36598.6 94.0 97.0 98.6 sinn 2 11967.3 77.2 79.7 84.9 stern 2 151489.8 93.1 94.5 92.5 stollen 4 62696.6 91.9 95.0 96.8 stuhl 2 9051.1 65.6 66.7 70.0 zelle 3 117695.4 91.7 94.0 96.9 zug 5 37166.3 76.8 80.9 74.7 AVERAGE 3.4 550 72.8 84.1 87.8 87.9 TABLE 5.13. Word sense disambiguation results on German, using Wikipedia 2012 data in a coarse-grained setting. In addition to a baseline (MFS = most frequent sense) and the three proposed word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated.

63 word #s #ex MFS WikiMonoSense WikiTransSense WikiMuSense arco 5 69670.5 84.1 88.2 87.8 atmosfera 2 977 93.8 96.1 95.4 98.0 canale 3 42766.3 86.9 91.1 89.5 cavallo 7 1432 91.7 90.3 91.1 92.7 circuito 3 108 62 86.3 90.7 87.0 complesso 4 356 59.3 89.9 89.6 92.7 corona 7 35460.8 73.4 77.1 76.8 differenziale 2 311 90.4 95.2 93.9 96.8 elemento 4 293 80.9 83.6 85.6 88.7 fase 2 21265.1 90.1 91.0 91.0 grado 5 40550.1 88.9 90.4 91.4 gruppo 9 174259.2 86.8 88.2 85.9 lettera 2 31157.2 87.5 93.6 94.5 massa 5 1236 93 87.1 89.2 93.7 ordine 12 2801 88.6 91.0 90.7 91.8 pesca 3 115084.6 89.3 94.7 96.6 pianta 2 266789.8 90.1 93.5 94.5 piatto 2 11865.2 88.1 84.7 89.8 porta 4 26463.3 81.4 86.0 84.5 radice 5 81667.2 85.3 85.3 90.0 resistenza 5 1606 84.6 92.2 92.6 88.5 rete 6 65346.2 84.4 84.4 81.9 seno 2 28462.7 91.6 96.1 91.9 spazio 4 64949.2 78.4 78.3 79.8 testa 2 49992.2 92.0 94.2 93.6 average 4.3 815 71.7 87.6 89.4 90.0 TABLE 5.14. Word sense disambiguation results on Italian, using Wikipedia 2012 data in a coarse-grained setting. In addition to a baseline (MFS = most frequent sense) and the word sense disambiguation system results, the number of senses (#s) and number of examples (#ex) are also indicated.

64 CHAPTER 6

CONCLUSIONS AND FUTURE WORK

I will now refer back to the contributions proposed in section 1.4, and how this work addressed them.

(1) Propose a new method for generating sense-tagged corpora using Wikipedia as a source of sense annotations: The availability of large scale sense annotated corpora is essential for building robust and accurate supervised Word Sense Disambiguation systems. In section 4.1 I described the method for generating sense-tagged data using Wikipedia as a source of sense annotations. Despite some limitations inherent to this approach – deﬁni- tions and annotations in Wikipedia are available almost exclusively for nouns, word and sense distributions are sometime skewed, the annotation labels are occasionally inconsis- tent –, these limitations are overcome by the clear advantage that comes with the use of Wikipedia: large sense tagged data for a large number of words at virtually no cost. We believe that this approach is particularly promising for two main reasons. First, the size of Wikipedia is growing at a steady pace, which consequently means that the size of the sense tagged corpora that can be generated based on this resource is also continuously growing. While techniques for supervised word sense disambiguation have been repeatedly criticized in the past for their limited coverage, mainly due to the associated sense-tagged data bottleneck, Wikipedia seems a promising resource that could provide the much needed solution for this problem. Second, Wikipedia editions are available for many languages (currently more than 280), which means that this method can be used to generate sense tagged corpora and build accurate word sense classiﬁers for a large number of languages. (2) Propose two multilingual WSD frameworks that exploits the cumulative ability of features originating from multiple languages to improve on the proposed monolingual WSD framework: We proposed two multilingual WSD frameworks, WikiTransSense and WikiMuSense that explored the cumulative impact of features originating from multiple languages to im-

65 prove on the monolingual word sense disambiguation task. I showed that a multilingual model is suited to better leverage two aspects of the semantics of text by using Google Translate machine translation engine. First, the various senses of a target word may be translated into other languages by using different words, which constitute unique, yet highly salient features that effectively expand the target words feature space. Second, the translated context words themselves embed co-occurrence information that a translation engine gathers from very large parallel corpora. This information is infused in the model and allows for thematic spaces to emerge, where features from multiple languages can be grouped together based on their semantics, leading to a more effective context representation for word sense disambiguation. In order to reduce the reliance on the machine translation system during training, I proposed WikiMuSense framework that explored the possibility of using the multilingual knowledge available in Wikipedia through its interlingual links. (3) Examine the effectiveness and robustness of the proposed three WSD frameworks: Through experiments performed on a subset of polysemous words from four different languages, I showed that the Wikipedia sense annotations can be used to build a word sense disambiguation system leading to a relative error rate reduction of up to 44%, 38%, 44%, and 28% compared to the most frequent sense baseline in terms of macro accuracy as compared to simpler baselines. The WikiTransSense system features led to an average error reduction of 13.7% compared to the monolingual system. Similarly, the WikiMuSense system obtained an average relative error reduction of 16.5% compared to the monolingual system, while requiring signiﬁcantly fewer translations than the alternative Wiki- TransSense system. We have also evaluated the proposed WSD systems in a coarse- grained setting on the same subset of polysemous words and the results indicate that our multilingual systems outperform the monolingual system. (4) Explore the importance of size of sense annotated corpora in the reference language and supporting languages and the number of supporting languages: Through learning curves for the four languages I demonstrated the importance of size of sense annotated corpora

66 for our monolingual system. The curves showed a continuously growing accuracy with increasingly larger amount of data thus indicating the importance of sense annotated corpora built using Wikipedia. Then, I explored the importance of size of sense annotated corpora in the supporting languages for our WikiMuSense system through two different settings. The curves indicated that WikiMuSense starts to perform better than WikiMonoSense when at least 70-80% of the available supporting data is used, and continues to increase its performance with increasing amounts of supporting languages data. We also explored the impact that the number of supporting languages has on the performance of the two multilingual WSD systems. Both WikiTransSense and WikiMuSense are evaluated using all possible combinations of 1, 2, 3, and 4 supporting languages. The curves indicate that the accuracies continue to improve as more languages are added for both multilingual systems. Since our WikiTransSense and WikiMuSense systems can be expanded with any number of supporting languages and Wikipedia is available is more than 280+ languages, these curves indicates adding more languages helps in building more accurate and robust systems.

6.1. Future Work

Currently, I am exploring the use of rule based and ensemble techniques to improve the performance of our WikiMuSense WSD systems. The combination strategy that I used in our WikiMuSense in this research combines the posterior probabilities using a simple addition rule as described in section . I am also exploring the use of combination rules including product, sum, max, min, median, and majority voting that previously demonstrated success for the task of WSD [80]. I am also exploring the possibility for weighted combination of classiﬁers for word sense disambiguation (WSD) based on Dempster-Shafer theory [36, 159] of evidence and ordered weighted averaging (OWA) operators [182] that previously demonstrated success for WSD task [81]. This directly helps in to train the weights in our WikiMuSense system that helps in creating an empirical combination strategy of the reference language and supporting language classiﬁers. Additionally, I am exploring the possibility of building a framework around our multilin-

67 gual model WikiMuSense for all the polysemous words in English, Italian, German and Spanish. This framework will be intelligent driven and it greatly reduces the number of translations by exploiting the availability of Wikipedia in more than 280 languages. In this intelligent framework, instead of ﬁxing the supporting languages aforetime, it will choose N supporting languages from the available 280 languages based on number of supporting language examples available per word or per sense such that the number of the translations required is very minimal. This allows us to build a WikiMuSense system that minimally relies on machine translation during the training phase. Additionally, I am exploring the possibility of building the potential beneﬁts of the proposed WSD systems in a statistical machine translation framework. In this framework, I am planning use our WSD systems with the output of two state-of-the-art statistical machine translation (SMT) systems, Moses and Google Translate, and explore whether they help in improving Moses and Google Translate. Finally, I am exploring the possibility of incorporating a novel method for generating a coarse-grained sense inventor from Wikipedia using a machine learning framework. The aim of this method is to replace the manual clustering of Wikipedia senses as explained in 5.4. For this, I am planning to extract structural and content-based features to induce clusters of articles represen- tative of a word sense. Additionally, I am planning to explore the possibility of adding multilingual features extracted from Wikipedia to test whether multilinguality helps in sense clustering.

68 BIBLIOGRAPHY

[1] Geert Adriaens, Word expert parsing: a natural language analysis program revised and applied to dutch, Leuvensche Bijdragen 75 (1986), no. 1, 73–154. [2] , Wep (word expert parsing) revised and applied to dutch, Proceedings of the 7th European Conference on Artiﬁcial Intelligence, ECAI86, 1987, pp. 222–235. [3] , The parallel expert parser: A meaning-oriented, lexically guided, parallel- interactive model of natural language understanding, International Workshop on Parsing Technologies, 1989, pp. 309–319. [4] Geert Adriaens and Steven L Small, Word expert parsing revisited in a cognitive science perspective, In lexical ambiguity resolution: Perspectives from psycholinguistics, neuropsy- chology, and artiﬁcial intelligence (1988), 13–43. [5] E. Agirre and P. Edmonds, Word sense disambiguation: Algorithms and applications, Springer, 2006. [6] E. Agirre and D. Martinez, Exploring automatic word sense disambiguation with decision lists and the Web, Proceedings of the Semantic Annotation And Intelligent Annotation workshop organized by COLING Luxembourg 2000, 2000. [7] , Unsupervised word sense disambiguation based on automatically retrieved examples: The importance of bias, Proceedings of EMNLP 2004 (Barcelona, Spain), July 2004. [8] E. Agirre, G. Rigau, L. Padro, and J. Asterias, Supervised and unsupervised lexical knowledge methods for word sense disambiguation, Computers and the Humanities 34 (2000), 103–108. [9] E. Agirre and A. Soroa, Personalizing pagerank for word sense disambiguation, Proceed- ings of the Conference of the European Chapter of the Association for Computational Lin- guistics (Athens, Greece), 2009, pp. 33–41. [10] Eneko Agirre and David Martinez, Learning class-to-class selectional preferences, Pro-

69 ceedings of the 2001 workshop on Computational Natural Language Learning-Volume 7, Association for Computational Linguistics, 2001, p. 3. [11] , Integrating selectional preferences in wordnet, arXiv preprint cs/0204027 (2002). [12] Eneko Agirre and German Rigau, Word sense disambiguation using conceptual density, Proceedings of the 16th conference on Computational linguistics-Volume 1, Association for Computational Linguistics, 1996, pp. 16–22. [13] C. Banea and R. Mihalcea, Word sense disambiguation with multilingual features, Interna- tional Conference on Semantic Computing (Oxford, UK), 2011. [14] Satanjeev Banerjee and Ted Pedersen, An adapted lesk algorithm for word sense disambiguation using wordnet, Computational linguistics and intelligent text processing (2002), 117–171. [15] Yehoshua Bar-Hillel, The present status of automatic translation of languages, Advances in Computers 1 (1960), 91–163. [16] R. Barzilay and M. Elhadad, Using lexical chains for text summarization, Proceedings of the Intelligent Scalable Text Summarization ACL Workshop (ISTS’97) (Madrid, Spain), 1997. [17] Rada Mihalcea Bharath Dandala and Razvan Bunescu, Word sense disambiguation using wikipedia, The People’s Web Meets NLP: Collaboratively Constructed Language Resources (2012), To appear. [18] Christian Bizer, Jens Lehmann, Georgi Kobilarov, S¨oren Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, DBpedia - A crystallization point for the Web of Data, Web Semantics: Science, Services and Agents on the World Wide Web 7 (2009), 154–165. [19] Carsten Brockmann and Mirella Lapata, Evaluating and combining approaches to selectional preference acquisition, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1, EACL ’03, 2003, pp. 27–34. [20] Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and Robert L Mercer, Word- sense disambiguation using statistical methods, Proceedings of the 29th annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1991, pp. 264–270.

70 [21] R. Bruce and J. Wiebe, Word sense disambiguation using decomposable models, Proceed- ings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL 1994) (LasCruces, NM), June 1994, pp. 139–146. [22] P. Buitelaar, B. Magnini, C. Strapparava, and P. Vossen, Domain speciﬁc sense disambiguation, Word Sense Disambiguation: Algorithms, Applications, and Trends (P. Edmonds and E. Agirre, eds.), vol. 33, Springer, June 2006. [23] C. Cabezas, P. Resnik, and J. Stevens, Supervised sense tagging using support vector machines, Proceedings of Senseval-2 Workshop, Association of Computational Linguistics (Toulouse, France), 2002, pp. 59–62. [24] Yee Seng Chan and Hwee Tou Ng, Scaling up word sense disambiguation via parallel texts, Proceedings of the 20th national conference on Artiﬁcial intelligence - Volume 3, AAAI’05, 2005, pp. 1037–1042. [25] Y.S. Chan, H.T. Ng, and D. Chiang, Word sense disambiguation improves statistical machine translation, Proceedings of the Association for Computational Linguistics (Prague, Czech Republic), 2007. [26] G. Chao and M. Dyer, Word sense disambiguation of adjectives using probabilistic networks, Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000) (Saarbrucken), 2000. [27] T. Chklovski and R. Mihalcea, Building a sense tagged corpus with Open Mind Word Ex- pert, Proceedings of the ACL 2002 Workshop on ”Word Sense Disambiguation: Recent Successes and Future Directions” (Philadelphia), July 2002. [28] Automatic Language Processing Advisory Committee et al., Language and machines: computers in translation and linguistics, Division of behavioral sciences, National Research Council, National Academy of Sciences, Washington (1966). [29] Corinna Cortes and Vladimir Vapnik, Support-vector networks, Machine learning 20 (1995), no. 3, 273–297. [30] Garrison W Cottrell and Steven L Small, A connectionist scheme for modelling word sense disambiguation., Cognition & Brain Theory (1983).

71 [31] J. Cowie, L. Guthrie, and J. Guthrie, Lexical disambiguation using simulated annealing, Proceedings of the 5th International Conference on Computational Linguistics (COLING 1992), 1992. [32] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, Timbl: Tilburg memory based learner, version 4.0, reference guide, Tech. report, University of Antwerp, 2001. [33] I. Dagan and A. Itai, Word sense disambiguation using a second language monolingual corpus, Computational Linguistics 20 (1994), no. 4, 563–596. [34] Ido Dagan, Alon Itai, and Ulrike Schwall, Two languages are more informative than one, Proceedings of the 29th annual meeting on Association for Computational Linguistics, ACL ’91, 1991, pp. 130–137. [35] B. Decadt, V. Hoste, W. Daelemans, and A. Van den Bosch, Gambl, genetic algorithm optimization of memory-based wsd, Senseval-3: Third International Workshop on the Eval- uation of Systems for the Semantic Analysis of Text (Barcelona, Spain), July 2004. [36] AP Dempster, Upper and lower probabilities induced by a multivalued mapping, The An- nals of Mathematical Statistics (1967), 325–339. [37] M. Diab, Relieving the data acquisition bottleneck in word sense disambiguation, Proceed- ings of the 42nd Meeting of the Association for Computational Linguistics (ACL 2004) (Barcelona, Spain), July 2004. [38] M. Diab and P. Resnik, An unsupervised method for word sense tagging using parallel corpora, Proceedings of the 40st Annual Meeting of the Association for Computational Linguistics (ACL 2002) (Philadelphia, PA), July 2002. [39] Leon E Dostert, The georgetown-ibm experiment, 1955). Machine translation of languages. John Wiley & Sons, New York (1955), 124–135. [40] Richard O Duda and Peter E Hart, Pattern classiﬁcation and scene analysis, 1973. [41] P. Edmonds, Designing a task for Senseval-2, May 2000, Available online at http://www.itri.bton.ac.uk/events/senseval. [42] Gerard Escudero, Lluis Marquez, and German Rigau, An empirical study of the domain dependence of supervised word sense disambiguation systems, Proceedings of the 2000 Joint

72 SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Compu- tational Linguistics-Volume 13, Association for Computational Linguistics, 2000, pp. 172– 180. [43] Erwin Fernandez-Ordonez, Rada Mihalcea, and Samer Hassan, Unsupervised word sense disambiguation with multilingual representations, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Re- sources Association (ELRA), may 2012. [44] R. Florian, S. Cucerzan, C. Schafer, and D. Yarowsky, Combining classiﬁers for word sense disambiguation, JNLE Special Issue on Evaluating Word Sense Disambiguation Systems (2002). [45] W. Gale, K. Church, and D. Yarowsky, Estimating upper and lower bounds on the performance of word-sense disambiguation programs, Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL 1992), 1992, pp. 249–256. [46] , One sense per discourse, Proceedings of the DARPA Speech and Natural Language Workshop (Harriman, New York), 1992, pp. 233–237. [47] WA GALE, KW CHURCH, and D YAROWSKY, A method for disambiguating word senses in a large corpus, Computers and the humanities 26 (1992), no. 5-6, 415–439. [48] William A. Gale and Kenneth W. Church, A program for aligning sentences in bilingual corpora, Proceedings of the 29th annual meeting on Association for Computational Linguis- tics (Stroudsburg, PA, USA), ACL ’91, Association for Computational Linguistics, 1991, pp. 177–184. [49] M. Galley and K. McKeown, Improving word sense disambiguation in lexical chaining, Proceedings of the 18th International Joint Conference on Artiﬁcial Intelligence (IJCAI 2003) (Acapulco, Mexico), August 2003. [50] A. Gliozzo, C. Giuliano, and C. Strapparava, Domain kernels for word sense disambiguation, Proceedings of the 43th Annual Meeting of the Association for Computational Lin- guistics (Ann Arbor, Michigan), 2005.

73 [51] J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarran, Indexing with WordNet synsets can improve text retrieval, Proceedings of COLING-ACL ’98 Workshop on Usage of WordNet in Natural Language Processing Systems (Montreal, Canada), August 1998. [52] Irving John Good, The estimation of probabilities: An essay on modern bayesian methods, vol. 30, MIT press Cambridge, MA, 1965. [53] R Gould, Multiple correspondence, Mechanical translation 4 (1957), no. 1/2, 14–27. [54] Yoan Gutiérrez, Sonia Vázquez, and Andrés Montoyo, A graph-based approach to wsd using relevant semantic trees and n-cliques model, CICLing (1), 2012, pp. 225–237. [55] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten, The weka data mining software: an update, ACM SIGKDD Explorations Newsletter 11 (2009), no. 1, 10–18. [56] Philip J Hayes, A process to implement some word-sense disambiguations, 1976. [57] , On semantic nets, frames and associations, Proceedings of the 5th International. Joint Conference on Artificial Intelligence, Citeseer, 1977, pp. 99–107. [58] , Some association-based techniques for lexical disambiguation by machine, Ph.D. thesis, Ecole polytechnique federale de Lausanne., 1977. [59] , Mapping input onto schemas, University of Rochester, Department of Computer Science, 1978. [60] Marti A. Hearst, ST Dumais, E Osman, John Platt, and Bernhard Scholkopf, Support vector machines, Intelligent Systems and their Applications, IEEE 13 (1998), no. 4, 18–28. [61] Verena Henrich, Erhard W. Hinrichs, and Tatiana Vodolazova, Semi-automatic extension of GermaNet with sense definitions from Wiktionary, Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Sci- ence and Linguistics, 2011, pp. 126–130. [62] , Webcage - A Web-harvested corpus annotated with GermaNet senses, 13th Con- ference of the European Chapter of the Association for Computational Linguistics, EACL ’12, 2012, pp. 387–396. [63] Graeme Hirst and David St-Onge, Lexical chains as representations of context for the detec-

74 tion and correction of malapropisms, WordNet: An electronic lexical database 305 (1998), 305–332. [64] V. Hoste, W. Daelemans, I. Hendrickx, and A. van den Bosch, Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation, Proceed- ings of the ACL Workshop on ”Word Sense Disambiguation: Recent Successes and Future Directions” (Philadelphia), July 2002. [65] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel, Ontonotes: The 90% solution, Proceedings of the Human Language Technology Conference of the NAACL, Com- panion Volume: Short Papers (New York City), 2006. [66] Nancy Ide and Jean Vronis, Word sense disambiguation: The state of the art, Computational Linguistics 24 (1998), 1–40. [67] J. Jiang and D. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, Proceedings of the International Conference on Research in Computational Linguistics (Taiwan), 1997. [68] A. Kaplan, An experimental study of ambiguity and context, Mechanical Translation 2 (1955), 39–46. [69] Edward F Kelly and Philip J Stone, Computer recognition of english word senses, vol. 13, North-Holland Amsterdam, 1975. [70] Mitesh M Khapra, Sapan Shah, Piyush Kedia, and Pushpak Bhattacharyya, Projecting parameters for multilingual word sense disambiguation, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, 2009, pp. 459– 467. [71] Mitesh M. Khapra, Saurabh Sohoney, Anup Kulkarni, and Pushpak Bhattacharyya, Value for money: balancing annotation effort, lexicon building and accuracy for multilingual wsd, Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, 2010, pp. 555–563. [72] A. Kilgarriff, Foreground and background lexicons and word sense disambiguation for in-

75 formation extraction, Proceedings of the Workshop on Lexicon Driven Information Extrac- tion (Frascati, Italy), 1997. [73] A. Kilgarriff and R. Rosenzweig, English SENSEVAL: Report and results, Proceedings of LREC 2000 (Athens, Greece), June 2000. [74] Adam Kilgarriff, How dominant is the commonest sense of a word?, Text, Speech and Dia- logue, Springer, 2004, pp. 103–111. [75] D. Klein, K. Toutanova, I.T. Ilhan, S.D. Kamvar, and C. Manning, Combining heterogeneous classifiers for word-sense disambiguation, Proceedings of the ACL Workshop on ”Word Sense Disambiguatuion: Recent Successes and Future Directions, July 2002, pp. 74–80. [76] Rob Koeling, Diana McCarthy, and John Carroll, Domain-specific sense distributions and predominant sense acquisition, Proceedings of the conference on Human Language Tech- nology and Empirical Methods in Natural Language Processing, 2005, pp. 419–426. [77] Andreas K Koutsoudas and R Korfhage, Mt and the problem of multiple meaning, Mechan- ical Translation 2 (1956), no. 2, 46–51. [78] C. Kunze and L. Lemnitzer, GermaNet – Representation, visualization, application, 3rd In- ternational Conference on Language Resources and Evaluation, LREC’02, 2002, pp. 1485– 1491. [79] Mirella Lapata and Frank Keller, An information retrieval approach to sense ranking, Com- putational Linguistics 348 (2007), no. 355. [80] Cuong Le, Van-Nam Huynh, and Akira Shimazu, Combining classifiers with multi- representation of context in word sense disambiguation, Advances in Knowledge Discovery and Data Mining (2005), 425–449. [81] Cuong Anh Le, Van-Nam Huynh, Akira Shimazu, and Yoshiteru Nakamori, Combining classifiers for word sense disambiguation based on dempster-shafer theory and owa operators, Data Knowl. Eng. 63 (2007), no. 2, 381–396. [82] C. Leacock and M. Chodorow, Combining local context and WordNet sense similarity for word sense identification, WordNet, An Electronic Lexical Database, The MIT Press, 1998.

76 [83] C. Leacock, M. Chodorow, and G.A. Miller, Using corpus statistics and WordNet relations for sense identification, Computational Linguistics 24 (1998), no. 1, 147–165. [84] Y.K. Lee, H.T. Ng, and T.K. Chia, Supervised word sense disambiguation with support vector machines and multiple knowledge sources, Proceedings of ACL/SIGLEX Senseval-3 (Barcelona, Spain), July 2004. [85] Els Lefever, Parasense: parallel corpora for word sense disambiguation, Ph.D. thesis, Ghent University, 2012, pp. XI, 205. [86] Els Lefever and Veronique Hoste, Semeval-2010 task 3: Cross-lingual word sense disambiguation, Proceedings of the 5th International Workshop on Semantic Evaluation, Associ- ation for Computational Linguistics, 2010, pp. 15–20. [87] Els Lefever, Véronique Hoste, and Martine De Cock, Parasense or how to use parallel corpora for word sense disambiguation, Proceedings of the 49th Annual Meeting of the Asso- ciation for Computational Linguistics: Human Language Technologies (Portland, Oregon, USA), June 2011, pp. 317–322. [88] M.E. Lesk, Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone, Proceedings of the SIGDOC Conference 1986 (Toronto), June 1986. [89] Hang Li and Cong Li, Word translation disambiguation using bilingual bootstrapping, Computational Linguistics 30 (2004), no. 1, 1–22. [90] D. Lin, An information-theoretic definition of similarity, Proceedings of the 15th Interna- tional Conference on Machine Learning (Madison, Wisconsin), 1998. [91] Swaminathan Madhu and Dean W Lytle, A figure of merit technique for the resolution of non-grammatical ambiguity, Mechanical translation 8 (1965), no. 2, 9–13. [92] Llu´ıs Màrquez, Gerard Escudero, David Mart´ınez, and German Rigau, Supervised corpus- based methods for wsd, Word Sense Disambiguation (2006), 167–216. [93] D. Martinez and E. Agirre, One sense per collocation and genre/topic variations, Proceed- ings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Process- ing and Very Large Corpora (Hong Kong), 2000.

77 [94] David Martinez Iraola, Supervised word sense disambiguation facing current challenges, Ph.D. thesis, Universidad del Pa´ıs Vasco, 2004. [95] Margaret Masterman, Semantic message detection for machine translation, using an inter- lingua, 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, 1961, pp. 437–475. [96] D. McCarthy, R. Koeling, J. Weeds, and J. Carroll, Finding predominant senses in untagged text, Proceedings of the 42nd Annual Meeting of the Association for Computational Lin- guistics (ACL 2004) (Barcelona, Spain), July 2004. [97] Diana Mccarthy, Relating wordnet senses for word sense disambiguation, In Proceedings of the ACL Workshop on Making Sense of Sense, 2006, pp. 17–24. [98] Diana McCarthy and John Carroll, Disambiguating nouns, verbs, and adjectives using automatically acquired selectional preferences, Computational Linguistics 29 (2003), no. 4, 639–654. [99] Paola Merlo and Suzanne Stevenson, Automatic verb classiﬁcation based on statistical distributions of argument structure, Computational Linguistics 27 (2001), no. 3, 373–408. [100] Christian M. Meyer and Iryna Gurevych, What psycholinguists know about chemistry: Aligning Wiktionary and WordNet for increased domain coverage, Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), November 2011, pp. 883–892. [101] R. Mihalcea, Bootstrapping large sense tagged corpora, Proceedings of the Third Inter- national Conference on Language Resources and Evaluation LREC 2002 (Canary Islands, Spain), May 2002, pp. 1407–1411. [102] , Instance based learning with automatic feature selection applied to Word Sense Disambiguation, Proceedings of the 19th International Conference on Computational Lin- guistics (COLING 2002) (Taipei, Taiwan), August 2002. [103] , Large vocabulary unsupervised word sense disambiguation with graph-based algorithms for sequence data labeling, Proceedings of the Human Language Technology / Empirical Methods in Natural Language Processing conference (Vancouver), 2005.

78 [104] , Using Wikipedia for automatic word sense disambiguation, Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (Rochester, New York), April 2007. [105] R. Mihalcea and D. Moldovan, Automatic generation of a coarse grained wordnet, NAACL 2001 Workshop on WordNet and Other Lexical Resources: applications, extensions and customizations (Pittsburgh), June 2001, pp. 35–41. [106] R. Mihalcea and D.I. Moldovan, Word sense disambiguation based on semantic density, Proceedings of COLING-ACL ’98 Workshop on Usage of WordNet in Natural Language Processing Systems (Montreal, Canada), August 1998, pp. 16–22. [107] , An automatic method for generating sense tagged corpora, Proceedings of AAAI- 99 (Orlando, FL), July 1999, pp. 461–466. [108] Rada Mihalcea, Making sense out of the web, Workshop on Lexical Resources and the Web for Word Sense Disambiguation, IBERAMIA, Mexico, 2004. [109] , Knowledge-based methods for wsd, Word Sense Disambiguation (2006), 107–131. [110] G. Miller, Wordnet: A lexical database, Communication of the ACM 38 (1995), no. 11. [111] G.A. Miller and W.G. Charles, Contextual correlates of semantic similarity, Language and Cognitive Processes 6 (1991), no. 1. [112] David Milne and Ian H Witten, An open-source toolkit for mining wikipedia, Artiﬁcial In- telligence (2012). [113] Thomas M. Mitchell, Machine learning, 1 ed., McGraw-Hill, Inc., 1997. [114] S. Mohammad and G. Hirst, Distributional measures of concept-distance: A task-oriented evaluation, Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (Sydney, Australia), 2006. [115] Antonio Molina, Ferran Pla, and Encarna Segarra, A hidden markov model approach to word sense disambiguation, Advances in Artiﬁcial IntelligenceIBERAMIA 2002 (2002), 655–663. [116] R. Mooney, Comparative experiments on disambiguating word senses: An illustration of the

79 role of bias in machine learning, Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing (EMNLP 1996) (Philadelphia), May 1996. [117] J. Morris and G. Hirst, Lexical cohesion, the thesaurus, and the structure of text, Computa- tional Linguistics 17 (1991), no. 1, 21–48. [118] Masaki Murata, Masao Utiyama, Kiyotaka Uchimoto, Qing Ma, and Hitoshi Isahara, Japan- ese word sense disambiguation using the simple bayes and support vector machine methods, The Proceedings of the Second International Workshop on Evaluating Word Sense Disam- biguation Systems, SENSEVAL ’01, no. 4, 2001, pp. 135–138. [119] R. Navigli, Meaningful clustering of senses helps boost word sense disambiguation performance, Proceedings of the Annual Meeting of the Association for Computational Linguis- tics (Sydney, Australia), 2006. [120] R. Navigli and M. Lapata, Graph connectivity measures for unsupervised word sense disambiguation, Proceedings of the International Joint Conference on Artiﬁcial Intelligence (Hyderabad, India), 2007. [121] R. Navigli and S. Ponzetto, BabelNet: Building a very large multilingual semantic network, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (Uppsala, Sweden), 2010. [122] R. Navigli and P. Velardi, Structural semantic interconnections: A knowledge-based approach to word sense disambiguation, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27 (2005). [123] Roberto Navigli, Word sense disambiguation: A survey, ACM Computing Surveys (CSUR) 41 (2009), no. 2, 10. [124] Roberto Navigli and Simone Paolo Ponzetto, Joining forces pays off: Multilingual joint word sense disambiguation, Proceedings of the 2012 Joint Conference on Empirical Meth- ods in Natural Language Processing and Computational Natural Language Learning (Jeju Island, Korea), July 2012, pp. 1399–1410. [125] H.T. Ng, Getting serious about word sense disambiguation, Proceedings of the ACL

80 SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How? (Washington), 1997, pp. 1–7. [126] H.T. Ng and H.B. Lee, Integrating multiple knowledge sources to disambiguate word sense: An examplar-based approach, Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996) (Santa Cruz), 1996. [127] H.T. Ng, B. Wang, and Y.S. Chan, Exploiting parallel texts for word sense disambiguation: An empirical study, Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003) (Sapporo, Japan), July 2003. [128] Kiem-Hieu Nguyen and Cheol-Young Ock, Word sense disambiguation as a traveling sales- man problem, Artiﬁcial Intelligence Review (2011), 1–23. [129] Elisabeth Niemann and Iryna Gurevych, The people’s Web meets linguistic knowledge: Au- tomatic sense alignment of Wikipedia and Wordnet, Proceedings of the Ninth International Conference on Computational Semantics, IWCS ’11, 2011, pp. 205–214. [130] Anthony G Oettinger, The design of an automatic russian-english technical dictionary, 1955). Machine translation of languages. John Wiley & Sons, New York (1955), 47–65. [131] Victor A Oswald Jr, Microsemantics, Communication presented at the ﬁrst MIT conference on Mechanical Translation, 1952, pp. 17–20. [132] , The rationale of the idioglossary technique, Research in Machine Translation, Georgetown University Press, Washington, DC (1957), 63–69. [133] Victor A Oswald Jr and Richard H Lawson, An idioglossary for mechanical translation, Modern Language Forum, vol. 38, 1953, pp. 1–11. [134] DY Panov, Machine translation and human being, Impact of Science on Society Vol 10 (1960), 16–25. [135] S. Patwardhan, S. Banerjee, and T. Pedersen, Using measures of semantic relatedness for word sense disambiguation, Proceedings of the Fourth International Conference on Intelli- gent Text Processing and Computational Linguistics (Mexico City), February 2003. [136] T. Pedersen, A decision tree of bigrams is an accurate predictor of word sense, Proceedings

81 of the North American Chapter of the Association for Computational Linguistics (NAACL 2001) (Pittsburgh), June 2001, pp. 79–86. [137] Ted Pedersen, A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation, Proceedings of the 1st North American chapter of the Associa- tion for Computational Linguistics conference, NAACL 2000, 2000, pp. 63–69. [138] Ted Pedersen and Rebecca Bruce, Distinguishing word senses in untagged text, Proceedings of the Second Conference on Empirical Methods in Natural Language Processing: August 1- 2, 1997, Brown University, Providence, Rhode Island, USA, Association for Computational Linguistics, 1997, p. 197. [139] W. Peters, I. Peters, and P. Vossen, Automatic sense clustering in EuroWordNet, Proceedings of the First International Conference on Language Resources and Evaluation (Granada), May 1998, pp. 409–416. [140] J Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, 1998. [141] Simone Paolo Ponzetto and Roberto Navigli, Large-scale taxonomy mapping for restruc- turing and integrating Wikipedia, Proceedings of the 21th International Joint Conference on Artificial Intelligence (Pasadena, CA), July 2009. [142] , Knowledge-rich word sense disambiguation rivaling supervised systems, Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics (Strouds- burg, PA, USA), Association for Computational Linguistics, 2010, pp. 1522–1531. [143] M. Porter, An algorithm for suffix stripping, Program 14 (1980), no. 3, 130–137. [144] A. Purandare and T. Pedersen, Word sense discrimination by clustering contexts in vector and similarity spaces, Proceedings of the Conference on Computational Natural Language Learning (CoNLL 2004) (Boston), 2004. [145] M. Quillian, Semantic memory, Semantic Information Processing (M. Minsky, ed.), MIT Press, 1968. [146] J. Quinlan, C4.5: programs for machine learning, Morgan Kaufman, 1993.

82 [147] R. Rada, H. Mili, E. Bickell, and B. Blettner, Development and application of a metric on semantic nets, IEEE Transactions on Systems, Man and Cybernetics 19 (1989), 17–30. [148] M Sreenivasa Rao and Arun K Pujari, Word sense disambiguation using the hopfield model of neural networks, Journal of Intelligent Systems 9 (2011), no. 3-4, 219–258. [149] Siva Reddy, Abhilash Inumella, Diana MacCarthy, and Mark Stevenson, Iiith: Domain specific word sense disambiguation, Proceedings of the 5th International Workshop on Se- mantic Evaluation, Association for Computational Linguistics, 2010, pp. 387–391. [150] ERWIN REIFLER, The mechanical determination of meaning, Machine translation of languages: fourteen essays (1955), 136. [151] P. Resnik, Using information content to evaluate semantic similarity, Proceedings of the 14th International Joint Conference on Artificial Intelligence (Montreal, Canada), 1995. [152] , Selectional preference and sense disambiguation, Proceedings of ACL Siglex Workshop on Tagging Text with Lexical Semantics, Why, What and How? (Washington DC), April 1997. [153] , Mining the Web for bilingual text, Proceedings of 37th Annual Meeting of the As- sociation for Computational Linguistics (ACL 1999) (College Park, Maryland), June 1999. [154] P. Resnik and D. Yarowsky, Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation, Natural Language Engineering 5 (1999), no. 2, 113–134. [155] Chuck Rieger and Steve Small, Word expert parsing, Proceedings of the 6th international joint conference on Artificial intelligence-Volume 2, Morgan Kaufmann Publishers Inc., 1979, pp. 723–728. [156] Irina Rish, An empirical study of the naive bayes classifier, IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, 2001, pp. 41–46. [157] Ronald L Rivest, Learning decision lists, Machine learning 2 (1987), no. 3, 229–246. [158] H. Schmid, Probabilistic part-of-speech tagging using decision trees, International Confer- ence on New Methods in Language Processing (Manchester, UK), 1994.

83 [159] Glenn Shafer, A mathematical theory of evidence, vol. 76, Princeton university press Prince- ton, 1976. [160] H.G. Silber and K. McCoy, Efficiently computed lexical chains as an intermediate representation for automatic text summarization, Computational Linguistics (2002). [161] R. Sinha and R. Mihalcea, Unsupervised graph-basedword sense disambiguation using measures of word semantic similarity, Proceedings of the International Conference on Se- mantic Computing, 2007, pp. 363–369. [162] S.L. Small and University of Rochester. Department of Computer Science, Parsing as coop- erative distributed inference: Understanding through memory interactions, TR (Rochester), Department of Computer Science, University of Rochester, 1981. [163] Steven L Small, Viewing word expert parsing as linguistic theory, Proceedings of the 7th international joint conference on Artificial intelligence-Volume 1, 1981, pp. 70–76. [164] R. Snow, S. Prakash, D. Jurafsky, and A. Ng, Learning to merge word senses, Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning (EMNLP-CoNLL) (Prague, Czech Republic), 2007. [165] B. Snyder and M. Palmer, The English all-words task, Proceedings of ACL/SIGLEX Senseval-3 (Barcelona, Spain), July 2004. [166] Philip J Stone, Improved quality of content analysis categories: Computerized- disambiguation rules for high-frequency english words, Gerbner, George; Holsti, Ole, Krip- pendorf, Klaus; Paisley, William J (1969), 199–221. [167] F. M. Suchanek, G. Kasneci, and G. Weikum, Yago: A core of semantic knowledge, Pro- ceedings of the 16th World Wide Web Conference, 2007. [168] G. Towell and E.M. Voorhees, Disambiguating highly ambiguous words, Computational Linguistics 24 (1998), no. 1, 125–145. [169] G. Tsatsaronis, M. Vazirgiannis, and I. Androutsopoulos, Word sense disambiguation with spreading activation networks generated from thesauri, Proceedings of the 20th Interna- tional Joint Conference on Artificial Intelligence (Hyderabad, India), 2007, pp. 1725–1730. [170] Dan TufiS¸, Radu Ion, and Nancy Ide, Fine-grained word sense disambiguation based on

84 parallel corpora, word alignment, word clustering and aligned wordnets, Proceedings of the 20th international conference on Computational Linguistics, COLING ’04, 2004. [171] F. Vasilescu, P. Langlais, and G. Lapalme, Evaluating variants of the Lesk approach for disambiguating words, Proceedings of the Conference of Language Resources and Evaluations (LREC 2004) (Lisbon, Portugal), 2004. [172] J. Veenstra, A. van den Bosch, S. Buchholz, W. Daelemans, and J. Zavrel, Memory-based word sense disambiguation, Computers and the Humanities 34 (2000), 171–177. [173] J. Veronis and N. Ide, Word sense disambiguation with very large neural networks extracted from machine readable dictionaries, Proceedings of the 13th International Conference on Computational Linguistics (COLING 1990) (Helsinki, Finland), August 1990. [174] David L Waltz and Jordan B Pollack, Massively parallel parsing: A strongly interactive model of natural language interpretation, Cognitive Science 9 (1985), no. 1, 51–74. [175] Warren Weaver, Translation, Machine Translation of Languages (William N. Locke and A. Donald Boothe, eds.), MIT Press, 1949, Reprinted from a memorandum written by Weaver in 1949., pp. 15–23. [176] Stephen F Weiss, Learning to disambiguate, Information Storage and Retrieval 9 (1973), no. 1, 33–41. [177] Yorick Wilks, Preference semantics., Tech. report, DTIC Document, 1973. [178] , A preferential, pattern-seeking, semantics for natural language inference, Artiﬁcial Intelligence 6 (1975), no. 1, 53–74. [179] Yorick A Wilks, On-line semantic analysis of english texts, 1968. [180] Peratham Wiriyathammabhum, Boonserm Kijsirikul, Hiroya Takamura, and Manabu Oku- mura, Applying deep belief networks to word sense disambiguation, arXiv preprint arXiv:1207.0396 (2012). [181] Z. Wu and M. Palmer, Verb semantics and lexical selection, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (Las Cruces, New Mexico), 1994. [182] Ronald R. Yager, On ordered weighted averaging aggregation operators in multicriteria decisionmaking, IEEE Trans. Syst. Man Cybern. 18 (1988), no. 1, 183–190.

85 [183] Ying Yang, Discretization for naive-bayes learning, Ph.D. thesis, Monash University, 2003. [184] D. Yarowsky, One sense per collocation, Proceedings of the ARPA Human Language Tech- nology Workshop, 1993. [185] , Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French, Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (Las Cruces, NM), 1994, pp. 88–95. [186] , Unsupervised word sense disambiguation rivaling supervised methods, Proceed- ings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL 1995) (Cambridge, MA), June 1995. [187] Xuling Zheng, Zhou Changle, Li Tangqiu, and Chen Yidong, Automatic acquisition of chinese semantic collocation rules based on association rule mining technique, JOURNAL- XIAMEN UNIVERSITY NATURAL SCIENCE 46 (2007), no. 3, 331. [188] Zhi Zhong and Hwee Tou Ng, Word sense disambiguation improves information retrieval, ACL (1), 2012, pp. 273–282. [189] George Kingsley Zipf, Human behavior and the principle of least effort., 1949.