<<

Automatically creating multilingual lexical resources

by

Khang Nhut Lam

MSc., Ewha Womans University, Seoul, Korea, 2009

A dissertation submitted to the Graduate Faculty of the

University of Colorado at Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2015 ii

c Copyright By Khang Nhut Lam 2015 All Rights Reserved iii

This thesis for Ph.D. of Computer Science degree by

Khang Nhut Lam

has been approved for the

Department of Computer Science

by

Dr. Jugal Kalita, Chair

Dr. Edward Chow

Dr. Rory Lewis

Dr. Martha Palmer

Dr. Jia Rao

Date iv

Khang Nhut Lam, Ph.D., Computer Science

Title: Automatically creating multilingual lexical resources

Supervisor: Dr. Jugal Kalita

Bilingual and are important resources for natural language processing tasks such as information retrieval and machine translation. However, lexical resources are usually available only for resource-rich languages, e.g., English, Spanish and

French. Resource-poor languages, e.g., Cherokee, Dimasa and Karbi have very few resources with limited numbers of entries. Current approaches for creating new lexical resources work with languages that have good quality resources already available in sufficient quantities.

This thesis proposes novel approaches to generate bilingual dictionaries, translate phrases and construct WordNets for several natural languages, including some languages in the UN-

ESCO Endangered Languages List (viz., Cherokee, Cheyenne, Dimasa and Karbi), by boot- strapping from just a few existing resources and publicly available resources in resource-rich languages such as the Princeton WordNet, Japanese WordNet and the Microsoft Translator.

This thesis not only constructs new lexical resources but also supports communities using languages with limited resources. v

Dedication

I would like to express deep love to my parents. Without your love and your support, I could not make this dissertation this far. Thank you for everything you have done for me. Even though, I have been alone half a world away from you, I have never felt lonely because you are always with me.

I would like to thank the Le family, and my best friends, Vicky Collier and

Janet Gardner, who are always on my side, take care me as their daughter, and have given me a real family during the time I have been in the United

States. vi

Acknowledgments

I would like to take this opportunity to express my warm thanks to my advisor, Dr.

Jugal Kalita, who has supported and guided me with patience and encouragement, and has provided me with a professional evironment studying and doing research since my first day in the PhD Program at UCCS. I also owe my gratitude to my dissertation committee members: Dr. Edward Chow, Dr. Jia Rao, Dr. Martha Palmer and Dr. Rory Lewis, for their enthusiasm, insightful comments, constructive suggestions and critical evaluations of my research.

A special thanks is due to Feras Al. Tarouti, my lab mate and my co-author, for his simulating, contributions, discussions, help in programming and evaluating results, and excellent company during stressful days when we worked together to meet crucial paper deadlines. Many thanks to all of my lab mates for their helps, questions, suggestions and all the fun we have had in our lab.

Many thanks to Dubari Borah, Francisco Torres Reyes, Conner Clark, Tri Doan,

Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phonglosa, Faris

Kateb, Abhijit Bendale, Lalit Prithviraj Jain and Svati Dhamija for helping me evaluate lexical resources. I also thank all my friends in the Xobdo, Microsoft and PanLex projects who provided me with dictionaries and translations.

This research was supported by Vietnam International Education Development- Min- istry of Education and Training of Vietnam (VIED). I gratefully acknowledge VIED finan- cial support. I also thank the Graduate School at UCCS for fellowships and the Computer

Science department at UCCS for my grader and teaching jobs. vii

TABLE OF CONTENTS

1 Introduction 1

1.1 Overview ...... 1

1.2 Types of lexical resources ...... 3

1.3 Research focus and contribution ...... 6

1.4 Intellectual and scientific merit ...... 9

1.5 Broader impact ...... 9

1.6 Organization of the dissertation ...... 10

2 Related work 11

2.1 Introduction ...... 11

2.2 Structure of lexical resources ...... 11

2.2.1 Structure of a bilingual ...... 11

2.2.2 Structure of the Princeton WordNet ...... 12

2.3 Language codes ...... 15

2.4 Creating new bilingual dictionaries ...... 16

2.4.1 Generating bilingual dictionaries using one intermediate language . . 17

2.4.2 Generating bilingual dictionaries using many intermediate languages 21

2.4.3 Extracting bilingual dictionaries from corpora ...... 24

2.4.4 Generating dictionaries from multiple linguistic resources ...... 29

2.5 Generating translations for phrases ...... 33

2.6 Constructing WordNets ...... 38

2.6.1 Constructing WordNets using the merge approach ...... 38

2.6.2 Constructing WordNets using the expand approach ...... 40 viii

2.7 Chapter summary ...... 45

3 Input resources and evaluation methods 47

3.1 Introduction ...... 47

3.2 Input bilingual dictionaries ...... 47

3.3 Input WordNets ...... 48

3.4 Evaluation method ...... 49

3.5 Chapter summary ...... 50

4 Creating reverse bilingual dictionaries 52

4.1 Introduction ...... 52

4.2 Related work ...... 53

4.3 Proposed approaches ...... 54

4.3.1 Direct reversal (DR) ...... 54

4.3.2 Direct reversal with distance (DRwD) ...... 56

4.3.3 Direct reversal with similarity (DRwS) ...... 58

4.3.4 Direct reversal with similarity and distance (DRwSD) ...... 60

4.4 Experimental results ...... 62

4.4.1 Preprocessing entries in the existing dictionaries ...... 63

4.4.2 Results ...... 65

4.5 Future work ...... 69

4.6 Chapter summary ...... 72

5 Creating new bilingual dictionaries 74

5.1 Introduction ...... 74

5.2 Related work ...... 75 ix

5.3 Proposed approaches ...... 76

5.3.1 Direct translation approach (DT) ...... 76

5.3.2 Using publicly available WordNets as intermediate resources (IW) . 77

5.4 Experimental results ...... 81

5.4.1 Results and human evaluation ...... 82

5.4.2 Comparing with existing approaches ...... 88

5.4.3 Comparing with Google Translator ...... 89

5.5 Future work ...... 90

5.6 Chapter summary ...... 91

6 Creating WordNets 93

6.1 Introduction ...... 93

6.2 Related work ...... 94

6.3 Proposed approaches ...... 95

6.3.1 Generating synset candidates ...... 95

6.3.1.1 The direct translation (DT) approach ...... 96

6.3.1.2 Approach using intermediate WordNets (IW) ...... 96

6.3.1.3 Approach using intermediate WordNets and a dictionary

(IWND) ...... 99

6.3.2 Ranking method ...... 100

6.3.3 Selecting candidates based on ranks ...... 101

6.4 Experiments ...... 104

6.5 Future work ...... 106

6.6 Chapter summary ...... 107 x

7 Generating translations for phrases using a and n-gram data 109

7.1 Introduction ...... 109

7.2 Vietnamese ...... 110

7.3 Related work ...... 111

7.4 Proposed approach ...... 112

7.4.1 Segmenting Vietnamese ...... 112

7.4.2 Filtering segmentations ...... 113

7.4.3 Generating ad hoc translations ...... 114

7.4.4 Selecting the best ad hoc translation ...... 114

7.4.5 Finding and ranking translation candidates ...... 116

7.5 Experiments ...... 117

7.6 Future work ...... 119

7.7 Conclusion ...... 120

8 Conclusions 122

References 124

Appendix A: Reverse dictionaries generated 134

Appendix B: New bilingual dictionaries created 136

Appendix C: New WordNets constructed 138 xi

TABLES

2.1 Languages mentioned and their ISO 693-3 codes ...... 15

3.1 The number of entries in the input dictionaries...... 48

3.2 The number of synsets in WordNets ...... 49

3.3 The average scores of entries in the input dictionaries...... 51

4.1 Words related to the “south”, obtained from the Princeton WordNet. . 59

4.2 Reverse dictionaries created using the DR and DRwD approaches...... 65

4.3 Reverse dictionaries created using the DRwS approach ...... 65

4.4 Reverse dictionaries created using the DRwSD approach ...... 66

4.5 Examples of unknown words from the source dictionaries...... 67

4.6 Examples of bad translations from the source dictionaries ...... 68

4.7 Reverse of reverse dictionaries generated ...... 70

4.8 Some new entries, evaluated as excellent or good, in the reverse of reseve

dictionaries ...... 70

5.1 The average score and the number of lexical entries in the dictionaries created

using the DT approach...... 83 xii

5.2 The average score of lexical entries in the dictionaries we create using the IW

approach...... 84

5.3 The number of lexical entries in the dictionaries we create using the IW

approach ...... 85

5.4 The average score of entries and the number of lexical entries in some other

bilingual dictionaries constructed using 4 WordNets: PWN, FWN, JWN and

WWN...... 86

5.5 Examples of entries, evaluated as excellent, in the new bilingual dictionaries

we created...... 86

5.6 The number of lexical entries in some other dictionaries we create using the

best approach...... 87

5.7 Examples of entries, not yet evaluated, in the new bilingual dictionaries we

create ...... 88

5.8 Some “unmatched” lexical entries...... 90

6.1 Different senses of the word “chair” ...... 97

6.2 Synsets obtained from different WordNets and their translations in Vietnamese 98

6.3 Example of calculating the ranks of candidates in Arabic...... 101

6.4 Example of Case 2 to select candidates ...... 103

6.5 Example of Case 3 to select candidates ...... 103 xiii

6.6 The number of WordNet synsets we create using the IW approach. . . . . 105

6.7 The number of WordNets synsets we create using the IWND approach. . . 105

6.8 The number and the average score of WordNets synsets we create...... 105

7.1 Some examples of Vietnamese phrases and their translations ...... 118

7.2 Some translations we create are correct but do not match with translations

by the Google Translator...... 119

1 Sample entries in the English-Assamese ...... 134

2 Sample entries in the English-Vietnamese reverse dictionary ...... 134

3 Sample entries in the English-Dimasa reverse dictionary ...... 135

4 Sample entries in the English-Karbi reverse dictionary ...... 135

5 Sample entries in the Assamese-Vietnamese and Assamese-Arabic dictionar-

ies ...... 136

6 Sample entries in the Assamese-German and Assamese-Spanish dictionaries 136

7 Sample entries in the Arabic-German and Arabic-Spanish dictionaries . . . 137

8 Sample entries in the Vietnamse-German and Vietnamse-Spanish dictionaries 137

9 Sample entries in the Assamese WordNet synsets ...... 138

10 Sample entries in the Arabic WordNet synsets ...... 139 xiv

11 Sample entries in theVietnamese WordNet synsets ...... 140 xv

FIGURES

1.1 “A new Vietnamese-English dictionary” compiled by William Peter Hyde [41].4

2.1 A general method to create a new bilingual dictionary...... 16

2.2 An example of the lexical triangulated translation method ...... 22

4.1 The idea behind the DR algorithm ...... 54

4.2 The drawback of the DR algorithm...... 56

4.3 The idea behind the DRwD algorithm ...... 57

4.4 The drawback of the DRwD algorithm ...... 59

4.5 The idea of the DRwS algorithm ...... 60

4.6 The idea behind the DRwSD algorithm ...... 62

5.1 An example of generating an entry for an Dimasa-Vietnamese dictionary

using the DT approach ...... 78

5.2 The IW approach for creating a new bilingual dictionary ...... 78

5.3 Example of generating lexical entries for an Dimasa-Arabic dictionary using

the IW approach ...... 81

5.4 New bilingual dictionaries created ...... 82 xvi

6.1 The DT approach to construct WordNet synsets in a target language T... 96

6.2 The IW approach to construct WordNet synsets in a target language T .. 98

6.3 The IWND approach to construct WordNet synsets ...... 99

6.4 Example of Case 1 to select candidates ...... 102 1

CHAPTER 1

INTRODUCTION

1.1 Overview

The Ethnologue organization1, which compiles the most comprehensive catalogue of languages of the world, lists 7,106 living languages. Half the world’s population speaks 13 most populous languages, the other half of the world speaks the rest2. Eighty languages,

1.2% of all languages, are spoken by 79.5% of world’s population and 305 (5.5%) are spoken by 94.2%3. One hundred languages are spoken by at least 7.4 million people, the rest by fewer4. 81.3% of world’s languages are spoken by less than a million people each. Many languages spoken by even tens of millions of people do not have official status or have only

(low) regional status, even within their own countries5. With so many languages spoken by so few, many languages do not have high political or economic status. In addition to many that are isolated by inhospitable geography, most languages lack resources to survive and thrive. These resources include books for infants and children, books for adults of various kinds, newspapers, magazines, monolingual dictionaries, bilingual dictionaries, thesauri, and these days electronic versions of these same resources. In contrast to resource-poor languages, resource-rich languages have better access to resources like dictionaries, thesauri, ontologies and possibly have plentiful text corpora as well. In truth, no language can be considered truly resource-rich in absolute terms, but we may consider a few languages (e.g.,

1http://www.ethnologue.com/ 2http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 3http://www.ethnologue.com/statistics/size 4http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers 5http://en.wikipedia.org/wiki/List_of_languages_without_official_status 2

English, Spanish and Japanese), to be resource-rich in relative terms; researchers have created many resources to facilitate various aspects of computational processing for such languages. There are a few other languages that have a limited number of resources, but can benefit from additional resources (e.g., Arabic and Vietnamese). Other languages have very few resources, if any. Many other languages are becoming endangered, a state which is likely to lead to their extinction, without determined intervention. Some endangered languages are Chrau and Tai Daeng in Vietnam, Karbi and Dimasa in India, Cherokee and

Cheyenne in America.

We construct lexical resources necessary for computational processing of natural lan- guages in areas such as information retrieval, automatic word-sense disambiguation, com- puting document similarity, machine learning, and machine translation. Consider bilingual dictionaries, an essential tool for human language learners. Most existing (print or on- line) bilingual dictionaries are between two resource-rich languages (e.g., English-Spanish,

Japanese-Chinese and French-German dictionaries), or between a resource-rich language and a resource-poor language (e.g., English-Assamese and English-Cherokee dictionaries).

The powerful online machine translators (MT) developed by Google6 and Bing7 provide pairwise translations for 80 and 50 languages, respectively. These machines provide transla- tions for single words and phrases also. In spite of so much information for some “privileged” language pairs, there are many languages for which we are lucky to find a single bilingual dictionary online or in print. For example, we can find an online Karbi-English dictionary and an English-Vietnamese dictionary, but we can not find a Karbi-Vietnamese dictionary.

Another important resource that is very helpful in computational processing and in human language learning is a providing and antonyms of words. An enriched

6https://translate.google.com/ 7http://www.bing.com/translator/ 3 thesaurus that provides additional relations among words in the computational context is called a WordNet. An English version of such a WordNet has been produced over several decades at Princeton University, and similar complete WordNets have also been produced for a small number of additional languages (e.g., French, Hindi and Japanese WordNets).

Most such resources do not really exist for resource-poor and endangered languages.

This dissertation focuses on developing new techniques that leverage existing resources for resource-rich languages to build bilingual dictionaries and WordNets for languages, es- pecially languages having very few resources. In addition, a phrase translation model using a bilingual dictionary augmented by n-gram data is also proposed to obtain translations for phrases that occur within these resources or even outside. We believe using approaches that are not language-specific to create computational lexical resources, some of which may be adapted to produce printed resources as well, may work in concert with other similar efforts to invigorate speakers, learners and users of these languages.

1.2 Types of lexical resources

According to Landau [58], a dictionary or a consists of a list of entries sorted by the lexical unit. Each entry usually contains a lexical unit, the definition associated with it, part-of-speech (POS), pronunciation, examples showing the uses of words, and possibly additional information. The lexical unit is usually a single word, whereas its definition is a single word, a multiword expression, or a phrase. A monolingual dictionary contains only one language such as the Oxford English Dictionary8. A bilingual dictionary consists of

translations of words between two languages such as “A Dictionary in Assamese and En-

glish” [18]. The monolingual dictionary is mainly used by the native speaker for reading

8http://www.oed.com/ 4 and understanding texts. The bilingual dictionary is used to understand the words in the source language [58], or to translate [84]. A bilingual dictionary can be unidirectional or bidirectional. A unidirectional dictionary contains translations from the source language to the target language, but the reverse translations are not provided. In contrast, a bidirec- tional dictionary consists of translations from the source language to the target language, and from the target language to the source language. Besides the obvious bilingual dic- tionaries that cover all words used generally in a language, one finds specific dictionaries such as a dictionary (e.g., Merriam-Webster’s Dictionary of Synonyms [73]), a dictionary focused on proper names (e.g., A Dictionary of Surnames [36]), or being focused on a narrow and specific area (e.g., Black’s Law Dictionary [30], and Stedman’s Medi- cal Dictionary [113]). Figures 1.1 is an example of a Vietnamese-English paper bilingual dictionary [41].

Figure 1.1: “A new Vietnamese-English dictionary” compiled by William Peter Hyde [41].

Kilgarriff [47] defines a thesaurus as a resource that groups words according to simi- larity. According to Roget [97], a thesaurus should be “...not in alphabetical order as they 5 are in a dictionary, but according to the ideas which they express”. In particular, according to Soergel [111], a thesaurus contains a set of descriptors, an indexing language, a classi- fication scheme, or a system vocabulary. A thesaurus also consists of relationships among descriptors. Each descriptor is a term, a notation, or another string of symbols used to des- ignate the concept. Examples of thesauri are Roget’s International Thesaurus [98], Open

Thesaurus9 or a large online English thesaurus simply called thesaurus.com.

Miller [75] introduces WordNet, which is a large lexical database where nouns, verbs, adjectives, and adverbs are grouped into unordered sets of cognitive synonyms, the so-called synsets. Each synset expresses a distinct concept. The WordNet is both an enriched dic- tionary and thesaurus. Given a lexical unit, the general dictionary and WordNet return definitions, POSes and examples. For the lexical unit, the dictionary mainly contains single words while the WordNet can consist of short phrases such as “tabular array”, “scholarly person”, and “grape vine”. Given a concept, the WordNet and thesaurus return terms which fit the concept. The words in WordNet synsets are disambiguated in terms of senses. The relationships between words (such as hypernyms or generalization, hyponyms or particular- ization, and meronymy or part-whole relationships) in the WordNet are labeled. Currently, the biggest WordNet is the Princeton WordNet10 version 3.0 which has 117,659 synsets including 82,115 noun synsets, 13,767 verb synsets, 18,156 adjective synsets, and 3,621 ad- verb synsets. Some other WordNets are the FinnWordNet [66], the Japanese WordNet [43], the EuroWordNet [122]. The AsianWordNet11 (AWN) provides a platform for building and sharing WordNets among Asian languages (viz., Bengali, Hindi, Indonesian, Japanese,

Korean, Lao, Mongolian, Burmese, Nepali, Sinhala, Sundanese, Thai, and Vietnamese).

9http://www.openthesaurus.de/ 10http://wordnet.princeton.edu/ 11http://www.asianwordnet.org/progress 6

Unfortunately, the progress of the WordNets in AWN is extremely slow, and they are far from being finished.

Schmidt and W¨orner [105] define parallel corpora as “collections of written texts and their translations into one or more languages, edited and aligned for the purpose of linguistic analysis”. Zanettin [125] introduces a comparable corpus consisting of “texts in the languages involved, which share similar criteria of composition, genre and topic”. A corpus containing only one language is called a monolingual corpus such as the British National

Corpus12 and the Brown Corpus13. A bilingual corpus involves two languages, such as the

English-Vietnamese Bilingual Corpus (EVbcorpus) [83], while a multilingual corpus consists of three or more languages such as the International Cambridge Language Survey14.

1.3 Research focus and contribution

The dissertation concentrates on automatically constructing multilingual lexical re- sources, especially bilingual dictionaries and WordNets, for several natural languages. We also introduce a novel method to translate a given phrase in a source language to a target language. The languages we focus are following.

- Languages that are widely spoken but have limited computational resources such as

Arabic and Vietnamese.

- A language that is spoken by tens of millions in northeast India, but has almost no

resources such as Assamese. 12http://www.natcorp.ox.ac.uk/ 13http://clu.uni.no/icame/brown/bcm.html 14http://www.cambridge.org/us/academic/subjects/languages-linguistics/english-language-and- linguistics-general-interest/series/cambridge-language-surveys 7

- Languages that are in the UNESCO Endangered Languages15 list such as Cherokee,

Cheyenne, Dimasa and Karbi.

We note that Cherokee16 is the Iroquoian language spoken by 13,500 Cherokee people in

Oklahoma and North Carolina. Cheyenne17 is a Native American language spoken by 2,100

Cheyenne people in Montana and Oklahoma. Dimasa18 and Karbi19 are spoken by 110,000 and 420,000 people, respectively, in India. Assamese20 is an Indo-European language spoken by about 16 million people and is resource-poor. Vietnamese21 is an Austroasiatic language spoken by 75 million people in Vietnam and Vietnamese Diaspora, whereas Arabic22 is an

Afro-Asiatic language spoken by 290 million people in countries of the Arab League.

First, we focus on creating reverse bilingual dictionaries. Published methods for auto- matically creating new dictionaries from existing dictionaries use intermediate dictionaries.

Unfortunately, we are lucky to find a single bilingual dictionary online or in software form for many resource-poor languages. So, our first effort, to increase lexical resources for a language under consideration, is to investigate the creation of a reverse dictionary from an existing dictionary, if we can find one. To remove ambiguous entries and increase the number of entries in created dictionaries, WordNets of resource-rich languages will be used to compute similarities between words or phrases. Of course, a new reverse dictionary is associated with the same two languages as the original dictionary that is reversed.

15http://www.unesco.org/new/en/culture/themes/endangered-languages/ 16http://en.wikipedia.org/wiki/Cherokee_language 17http://en.wikipedia.org/wiki/Cheyenne_language 18http://en.wikipedia.org/wiki/Dimasa_language 19http://en.wikipedia.org/wiki/Karbi_language 20http://en.wikipedia.org/wiki/Assamese_language 21http://en.wikipedia.org/wiki/Vietnamese_language 22http://en.wikipedia.org/wiki/Arabic_language 8

Our next effort at increasing lexical resources will be to create bilingual dictionaries for language pairs for which such dictionaries do not exist. We will create dictionaries from resource-poor languages to several other languages by exploiting publicly available

WordNets, bilingual dictionaries, and the dictionaries we create in the first task. Resource- rich languages will provide the pivots for such translations. In general, if a word b (which may be polysemous) in language B is translated into a word a in language A and a word c in language C, we cannot necessarily conclude that a is a translation of c because of their association with b. Hence, statistical techniques and WordNets are used to remove ambiguous entries.

WordNets are among the most heavily used lexical resources. We develop algorithms and models to automatically build WordNets for languages using available resources, but also by bootstrapping with resources we create ourselves. If we can create a number of

WordNets of acceptable quality, we believe it will contribute significantly to the repository of resources for languages that lack them.

A problem we have encountered in our previous tasks is that quite often a dictionary entry has a sense that is given in terms of a sequence of words or a phrase. When we reverse a bilingual dictionary or create bilingual dictionaries for new language pairs, so far we have ignored such sense entries since we do not know how to translate a phrase into the target language. Jackendoff [44, page 156] estimated that the number of multiword expressions or phrases in a person’s vocabulary is of the same order as the number of single words. In addition, Sag et al. [100] found that 41% words in WordNet 1.7 are multiword expressions. In the last research task, we develop a model to translate phrases in a given source language to a target language using dictionary-based approach and n-gram data in a case study of translating phrases from Vietnamese to English. This is an initial step in 9 generating translations for phrases occurring both outside and inside bilingual dictionaries using the information from existing bilingual dictionaries.

1.4 Intellectual and scientific merit

This dissertation will present several novel approaches from simple to complex for automatically generating bilingual dictionaries and WordNets. We will also compare our proposed methods against existing methods to find positive and negative points of difference, and the reasons for the drawbacks. In addition, most existing research works with languages that have some available lexical resources, each of which is expensive to construct. Using many intermediate lexical resources for creating a new one may cause ambiguity in the lexical resource created. The approaches we propose will have the potential not only to create new lexical resources using just a few existing lexical resources which can reduce cost and time consumed, but also can improve the quality of lexical resources we create.

Briefly, to be able to automatically create many lexical resources for languages, espe- cially resource-poor and endangered, we need processes that do not require many resources to begin with, presenting challenging problems for the computational linguist. Our research will make substantial progress on these problems by bootstrapping and leveraging WordNets and dictionaries for resource-rich languages.

1.5 Broader impact

The goal of this dissertation is to study the feasibility of creating multilingual lexical resources for languages by bootstrapping from a few existing resources. Our research has the potential not only to construct new lexical resources, but also to support communities using languages with limited resources. 10

1.6 Organization of the dissertation

The thesis is organized as follows. Existing approaches for constructing new bilingual dictionaries and WordNets for languages, and generating phrase translation are presented in Chapter 2. Chapter 3 introduces notations, input resources used and the methods to evaluate resources we create. Chapter 4 and Chapter 5 propose methods to create reverse bilingual dictionaries and new bilingual dictionaries, respectively. Approaches to construct

WordNet synsets for many languages are proposed in Chapter 6. In Chapter 7, we present algorithms to generate translations for phrases with a case study on translating from Viet- namese to English. Future work is discussed at the end of each chapter. Chapter 8 concludes the thesis.

Acknowledgment

A synopsis of this dissertation is presented in the paper “Automatically creating mul- tilingual lexical resources” in the Proceedings of the Doctoral Consortium at the 28th Con- ference on Artificial Intelligence (AAAI), pages 3077-3078, Quebec, Canada, July 2014. 11

CHAPTER 2

RELATED WORK

2.1 Introduction

Understanding existing approaches to create new bilingual dictionaries, to generate

translations for phrases and to construct WordNets provides us the background knowledge

required to develop techniques to solve our problems, discussed in this dissertation. In this

chapter, we summarize and discuss related work to build relevant lexical resources. The

remainder of this chapter is organized as follows. In Section 2.2, we describe the structure

of lexical resources. Section 2.3 gives the ISO 693-3 codes of languages mentioned in the

this dissertation. Specific approaches to generate dictionaries, translations for phrases and

WordNets from different linguistic resources are presented in Section 2.4, Section 2.5 and

Section 2.6, respectively. Section 2.7 summarizes the chapter.

2.2 Structure of lexical resources

This thesis proposes approaches to automatically construct bilingual dictionaries and

WordNets. Therefore, this section presents the structure of bilingual dictionaries and Word-

Nets, focusing on the Princeton WordNet.

2.2.1 Structure of a bilingual dictionary

For notational purpose, we make an assumption that a bilingual dictionary Dict(A,B)

contains entries of word or phrase translations from the source language A to the target

language B. Similarly, a dictionary Dict(B,A) consists of translation entries from words 12

or phrases in the language B to words or phrases in language A. In particular, Dict(A,B)

contains entries (a,b) whereas Dict(B,A) contains entries (b,a).

A dictionary entry, called LexicalEntry, is a 2-tuple . Here

LexicalUnit is a word or a phrase being defined, also called definiendum more formally,

based on Aristotle’s analysis [58]. Usually, a LexicalUnit is lemmatized (i.e., reduced to

a representative or citation form such as infinitives for verbs), but not always. A list of

entries sorted by the LexicalUnit is called a lexicon or a dictionary. Given a LexicalUnit, the Definition associated with it usually contains its class (e.g., part-of-speech (POS)) and pronunciation, its meaning, and possibly additional information, including usage. The meaning associated with it can have several Senses. A Sense is a discrete representation of a single aspect of the meaning of a word. Thus, a dictionary entry is of the form

Sense1, Sense2, ··· >.

2.2.2 Structure of the Princeton WordNet

The main relation between words in a WordNet is synonymy. A synset contains one or many words. A polysemous word is assigned to many synsets. Each synset has one gloss, which is a brief definition of the concept, along with sentences showing the use of words in the synset. The WordNet 2.1 overview by Marin Dantchev [26] says that each synset is linked to other synsets by numerous conceptual relations. The rest of this section will discuss the synsets from the four syntactic categories: nouns, adjectives, adverbs and verbs.

The Princeton WordNet version 3.0 has 117,798 nouns with 82,115 synsets. The noun synsets are organized into hierarchies. WordNet distinguishes types and instances in noun synsets [29]. Types contain common nouns such as “location”, “president” and “car” whereas most of proper nouns are instances such as “Colorado”, “Obama” and “Ford”. The 13

instances always are leaves of trees, or terminal nodes in the hierarchy. The relations among

noun synsets are super-subordinate relations (viz., hypernymy and hyponymy), part-whole

relations (viz., ) and antonymy.

- Hypernymy is a semantic relation that links a more general word to a more specific

word. For example, the hypernym set of the word “dog” is {canine, canid}.

- Hyponymy links a more specific word to a general word. The hyponym set of the

word “canid” is {bitch, dog, wolf, jackal, hyena, hyaena, fox}. Hyponymy is transitive.

For example, the word “dog” represents a kind of the word “canine”, which represents

a kind of the word “carnivore”; so “dog” represents a kind of “carnivore”.

- Meronymy links synsets denoting parts to synsets denoting the whole. In particular,

if a word a is a meronym of a word b, a is one part of b. For example, the words

{back, backrest, leg} are meronyms of the word “chair”. The inverse of meronymy is

holonymy. Therefore, the word “chair” is the holonym of {back, backrest, leg}.

- Antonymy expresses the relation between two nouns. For instance, the word

“woman” is an antonym of the word “man”.

The current WordNet contains 21,479 adjectives organized into 18,156 synsets. Ad- jective synsets are classified into two categories: descriptive adjectives and relational adjec- tives. The main relation in descriptive adjectives is antonymy, e.g., the anytonym of the word “short” is {long}. Adjective synsets are organized into bipolar clusters where words similar to one adjective are grouped with all words similar to its antonym [26]. The relation in relational adjectives is pertainym, which points to the nouns they are derived from, e.g.,

“criminal” and “crime”, “nuclear” and “nucleus”. 14

There are 3,748 adverbs with 733 synsets. Adverbs in WordNet are usually derived from adjectives via morphological affixation such as “strongly”, “shortly” and “rarely”. The relations among adverb synsets are synonymy and antonymy, sometimes.

WordNet contains 6,277 verbs with 5,252 synsets. Verb synsets are also organized into hierarchies. The common relations between verb synsets are , entailment, and the cause relation.

- Troponymy is when the activity of one verb is doing the activity of another verb in

some manner. For example, the verb “run” is the troponym of the verb “walk”.

- Entailment occurs when one verb logically occurs after one event. For instance, the

verb “divorce” entails the verb “marry”.

- The cause relation relates one verb, which is causative and another, which is resulta-

tive. For example, the verb “show” and the verb “see” have a cause relation between

them.

Another widely used term is Common Base Concepts, firstly introduced in building

EuroWordNet [96]. A concept is important if it is widely used. In the EuroWordNet, the

Common Base Concepts are classified using a Top Ontology. The Top Ontology is divided into three categories named 1stOrderEntities, 2ndOrderEntities, and 3rdOrderEntities.

- The 1stOrderEntities contain concrete synsets which are specified for four roles, viz.,

“origin”, “form”, “composition” and “function”. For example, vehicle is classified as

Artifact (Origin) + Object (Form) + Vehicle (Function). The 1stOrderEntities always

are nouns. 15

- The 2ndOrderEntities include synsets which are located in time, occurr or take place

rather than existing, e.g., “continue”, “occur” and “play”. The 2ndOrderEntities can

be nouns, verbs and non-dynamic adjectives.

- The 3rdOrderEntities consist of synsets which exist independently of time and space.

They can be true or false rather than real, e.g., “idea”, “thought”, “information” and

“plan”. The 3rdOrderEntities are always nouns.

2.3 Language codes

In this thesis, we use names of languages and their ISO 693-3 codes interchangeably.

The ISO 693-3 codes of languages mentioned, including in discussion of related work and our experiments, are presented in Table 2.1.

Table 2.1: Languages mentioned and their ISO 693-3 codes

Language Code Language Code Language Code Language Code

Arabic arb Assamese asm Bengali ben Cherokee chr

Cheyenne chy Chinese cht Croatian hrv Dimasa dis

Dutch uld English eng French fra Finnish fin

Galician glg German deu Hindi hin Hungarian hun

Indonesian ind Japanese jpn Karbi ajz Korean kor

Italian ita Lithuanian lit Malay zlm Thai tha

Spanish spa Slovenian slv Swedish swe Vietnamese vie 16

2.4 Creating new bilingual dictionaries

To construct a new bilingual dictionary, we may use diverse available resources such

as existing dictionaries, thesauri, corpora or WordNets. Whatever resources are used, there

are two main steps to create a new bilingual dictionary. First, translation candidates are

extracted from resources used (e.g., dictionaries, thesauri or corpora). Second, heuristic

algorithms or statistical information is used to disambiguate and to rank translation can-

didates. The general method for constructing a new bilingual dictionary is presented in

Figure 2.1. The approaches we discuss in the next subsections all fit within this general

architecture.

Figure 2.1: A general method to create a new bilingual dictionary.

Human evaluation is the first choice in evaluating the quality of a new dictionary.

However, it is really hard to find volunteers familiar with languages in a dictionary Dict(A,B) we may create such as Assamese-Vietnamese or Cherokee-Karbi. Researchers have evaluated their approaches by generating a dictionary for another language pair Dict(C,D) such that there exists at least one published good quality dictionary Dict*(C,D), which is used as the gold standard. Using this second evaluation method, we can compute precision, recall, 17 or F-score for Dict(C,D). The precision value is the matching percentage of entries in the new dictionary Dict(C,D) and the existing dictionary Dict*(C,D). The recall ratio is the percentage of entries that exists in Dict*(C,D), but also exists in Dict(C,D). We consider the terms accuracy and precision of a dictionary to be synonymous.

2.4.1 Generating bilingual dictionaries using one intermediate language

A basic approach to create a new dictionary and handle ambiguities is a pivot- based method that uses inverse consultation, introduced by Tanaka and Umemura [115].

They generate a Japanese-French dictionary Dict(jpn, fra) and a French-Japanese dic- tionary Dict(fra,jpn) from a Japanese-English harmonized dictionary, Dicthm(jpn, eng), and an English-French harmonized dictionary, Dicthm(eng, fra). A harmonized dictionary

Dicthm(A, B) is a symmetrical dictionary created by integrating two unidirectional dictio- naries Dict(A,B) and Dict(B,A). In the one time inverse consultation method, for each given word in the source language, Japanese, they find a translation chain jpn → eng1 → fra → eng2, and then count the number of matches between eng1 and eng2, where eng1 and eng2 are two sets of words obtained by translation as shown by the arrows. The greater the number of matches, the better the translation candidate. Similarly, in two-time inverse consultation, for each given Japanese word jpn1, they experiment with the trans- lation chain jpn1 → eng → fra → eng → jpn2, and then, count the number of matches between the input Japanese word and the returned Japanese words. For evaluation, Tanaka and Umemura [115] randomly select 100 entries from each of the dictionaries they create,

Dict(jpn,fra) and Dict(fra,jpn), then evaluate them manually and by calculating the match- ing percentage between these entries and entries in the published dictionaries. The greatest 18

matching fraction for manual evaluation and matching percentage are 56% and 58%, re-

spectively.

Shirai et al. [109], and Shirai and Yamamoto [108] conclude that the inverse consul-

tation approach does not resolve the WSD problem well. In addition, differences in the

linguistic natures of languages, such as Japanese and English, affect the content of the har-

monized dictionaries. The authors introduce methods to improve the quality of dictionaries

created using inverse consultation. Shirai and Yamamoto [108] generate translation candi-

dates from Korean to Japanese using one-time inverse consultation from two dictionaries:

Korean-English and Japanese-English. Then, the degree of similarity between words is used

to select correct translations. Given a word in the source language (Korean) wK , and a word in the target language (Japanese) wJ , the degree of similarity between wK and wJ is the

number of common translations of these words in the intermediate language (English):

|common(EwK ,EwJ )| ∗ 2 degree of similarity(wK , wJ ) = , (2.1) |EwK | + |EwJ |

where EwK and EwJ are the set of translations in English of wK and wJ , respectively.

For evaluation, they randomly select 1,000 Korean words in a published Korean-Japanese dictionary, and then create the Japanese translations for these Korean words using their approach. They evaluate their translations against the translations in a published dictio- nary. The accuracy of their translations is 72% when the degree of similarity is equal to or greater than 0.8.

Zhang et al. [126] create a Japanese-Chinese dictionary from Japanese-English and

English-Chinese dictionaries using one-time inverse consultation. To rank candidates and remove irrelevant candidates, they calculate a penalty value for each pair of candidates. 19

The smaller the penalty value, the better the translation:

penalty(wJ , wC ) = k1 ∗ F 1(wJ , wC ) − k2 ∗ F 2(wJ , wC ), (2.2) where k1 and k2 are weights, which are set based on preliminary experiments, F1 is the similarity value in POS between a Japanese word wJ and a Chinese word wC , and F2 is the one-time inverse consultation score of that pair. 172 Japanese words were randomly selected for human evaluation, to be marked either “correct” or “wrong”. The accuracy of their best dictionary is 70.12%.

According to Shirai et al. [109], selecting correct translations among many transla- tion candidates produced using two-time inverse consultation is a challenge. Starting with a Korean-English dictionary and an English-Japanese dictionary, Shirai et al. [109] use the two-time inverse consultation method to generate Korean-Japanese candidates; then, look for overlaps to limit the number of translation candidates. They evaluate their transla- tions by comparing with a published Korean-Japanese dictionary. The precision of their dictionary is 85.7%, while the recall ratio is 35%.

Paik et al. [92] experiment with different input bilingual dictionaries and take di- rectionality into account in creating new Korean-Japanese dictionaries with different ac- curacies. First, given a Korean-English dictionary Dict(kor,eng) and a Japanese-English dictionary Dict(jpn,eng), the one-time inverse consultation method is used. According to their experiment, the more similar the source and target languages1 are, the more correct the translations are. The same approach with several pivot languages is also used by Paik et al. [91]. Their second experiment computes the overlapping constraints of translation can- didates created from Dict(kor,eng) and Dict(eng,jpn). The candidate with a high overlap

1For example, Korean and Japanese share vocabularies using equivalent Chinese characters. 20

similarity score is likely to be the correct translation:

overlap similarity score(wJ , wK ) = |wJ |, wJ ∈ J(E(wK )), (2.3)

where E(wK ) is a set of translations in English of a Korean word wK , and J(E) is the set of

translations in Japanese of words in English. This method can increase the number of entries

in the new dictionaries created significantly. However, many ambiguous entries are created

in the new dictionaries due to the presence of polysemous words in the pivot language.

Finally, a new dictionary is created from Dict(eng,kor) and Dict(eng,jpn). The candidates

whose similarity scores are greater than a threshold are added to the new dictionary. The

similarity score for wJ and wk is computed as below:

|K(E(wK ) ∩ E(wJ ))| + |J(E(wK ) ∩ E(wJ ))| similarity score(wJ , wK ) = . (2.4) |E(wK ) ∩ E(wJ )|

Paik et al. [92] claim that it is appropriate to construct a new dictionary Dict(A,C) using the two bilingual dictionaries Dict(A,B) and Dict(C,B), when A and C are very similar.

The pivot-based method is also used by Sj¨obergh [110] to create a new Japanese-

Swedish dictionary Dict(jpn,swe) from a Japanese-English dictionary Dict(jpn,eng) and a Swedish-English dictionary Dict(swe,eng). After removing English stop words in the existing dictionaries, each English word wE is assigned a weight, calculated by the idf -like measure. |Dict(swe, eng)| + |Dict(jpn, eng)| weight(wE) = log( ), (2.5) |Dict(swe, eng)wE | + |Dict(jpn, eng)wE |

where |Dict(A, B)| is the number of entries in the dictionary, and |Dict(A, B)wE | is the number of descriptions in the dictionary containing the word wE. Then, they match English words in the two existing dictionaries and score the matches as follows: P 2 weight(wE) a score = P P , (2.6) weight(wE1 ) + weight(wE2 ) wE1 wE2 21

where a ∈ Dict(swe,eng)∩Dict(jpn,eng), wE1 ∈ Dict(swe,eng), wE2 ∈ Dict(jpn,eng).A better translation has a higher score. For multiword expressions that have no translation in the target language, the concatenations of translations of single words in the target language are accepted as correct translations. Volunteers are asked to evaluate 300 words using a 5-point scale: all correct, majority correct, some correct, similar (which means the translation is not correct, but close to being correct), and wrong. The accuracies of their translations are 75% all correct with a score greater than 0.9 and 89% all correct with a score equal to 1.0.

2.4.2 Generating bilingual dictionaries using many intermediate languages

To increase the precision of new dictionaries, one can construct new bilingual dic- tionaries using transitivity with two or more pivot languages. Gollins and Sanderson [32] introduce a triangulated translation method for improving cross-language information re- trieval. To create a translation of a word a in the source language A in the target language

B, they translate a to two intermediate languages C and D to generate words c and d, respectively. Then, they translate c and d to the target language B and merge the results in different ways. Adding one more intermediate language to the triangulated translation method produces “three-way” triangulated translation. Their experiments are with Euro- pean languages that are covered by the EuroWordNet [122]. They select words in a source language, create translations in a target language, and evaluate by comparing their trans- lations with the translations obtained from the EuroWordNet. According to Gollins and

Sanderson, triangulated translation outperforms the transitive method by over 55% when the accuracy metric is used because it helps reduce ambiguous senses of words in trans- lations. The three-route triangulated scheme provides higher accuracy than the two-route 22

triangulated scheme. The addition of pseudo-relevance feedback [6] as pre-translation to

triangulation translation improves precision of translations. An example of the triangulated

translation method applied to a non-European language with English and French as pivots

to create entries for a new dictionary is shown in Figure 2.2. The Hindi word “vasant” is

translated to English and French. Then, the resulting words in the intermediate languages

are translated to Vietnamese in order to generate translation candidate sets. The correct

translations of this Hindi word in Vietnamese are the words that survive after applying

different merge strategies on the translation candidate sets. As a result, the translation of

“vasant” is “mùa xuân”.

Figure 2.2: An example of the lexical triangulated translation method

Bond et al. [14], and Bond and Ogura [13] create new dictionaries via one or more pivots. Created entries are ranked in different ways such as using the one-time inverse con- sultation score introduced by Tanaka and Umemura [115], or a semantic matching score, which is the number of times the semantic classes of ai and cj match, mainly focusing on nouns. Samples of random words in the source language and their translations in the target language are selected for evaluation by lexicographers. The evaluation of entries in the Japanese-Malay dictionary they created from a Japanese-English dictionary and a

Malay-English dictionary shows that 80% of the translations are acceptable. To handle 23 homonyms2, they use two intermediate languages: English and Chinese. Using the two in- termediate languages, 97% entries in the new dictionary become acceptable, but the number of entries decreases significantly from 75,872 to 5,238.

A link structure, introduced by Ahn and Frampton [2], is also used to handle am- biguous translations. The central idea is that if (i) a word a in a source language A is translated to a word b in an intermediate language B, which is translated to a word c in a target language C, and (ii) conversely, if the word c is translated to the word b which is translated to the word c, then the word c is a correct translation of the word a. The problem with this method is the presence of polysemous words in the intermediate languages. Ahn and Frampton ameliorate the effect of polysemous words in the following manner. They find all words bk which are translations of each word ci; then, they find all translations aj of each word bk. The words aj, which are the same as the source word a are selected. Finally, they retrace the path to get the words ci, which are correct translations of the word a. The newly created dictionary, a Spanish-German dictionary, covers 78.4% entries in an existing dictionary that was created manually. Issues affecting their results include the observation that the manually-generated dictionary does not contain many entries created using their approach, the sizes of the input dictionaries are limited, and that different font encodings in the input dictionaries mess up their results.

A well-known effort to construct many new bilingual dictionaries is by Mausam et al. [69]. They report several algorithms for creating dictionaries using probabilistic in- ference. They extract entries from multiple dictionaries of multiple language pairs using the concept of a translation graph in which each vertex represents a word in a language and the edge connecting two vertices presents a belief that the two vertices share a sense.

2Homonyms are words that spell and sound the same but have different meanings. 24

The Transgraph algorithm computes the equivalence score that two words in a translation

graph share the same sense. If this score is greater than a threshold, these two words in

two distinct languages are accepted as sharing the same sense. The main idea behind the

Unpruned SenseUniformPaths (uSP) algorithm is that two vertices share the same sense if

there exists at least one translation circuit found by using a random walk and choosing ran-

dom edges without having duplicate vertices in the path from the source word to the target

word. However, the uSP algorithm faces errors that occur in processing source dictionaries

to generate the translation graph and correlated sense shifts in translation circuits. The

SenseUniformPaths (SP) algorithm solves uSP’s problems by pruning paths whose vertices

enter an ambiguity set twice. An ambiguity set is a set of nodes sharing more than one

sense. Their best algorithm is the SP algorithm at precision 0.90, producing 4.5 times as many translations as the dictionaries supported by the Wiktionary, producing 73% more translations over other source dictionary translations.

2.4.3 Extracting bilingual dictionaries from corpora

If languages A and C have substantial corpora of documents that are readily available, researchers have attempted to derive translations between A and C using several methods.

This subsection presents a variety of approaches for extracting translations from parallel corpora, bi-texts,3 comparable corpora and monolingual corpora.

Brown [19] derives bilingual from a Spanish-English parallel corpus containing

685,000 sentence pairs. They construct a correspondence table based on symmetric co- occurrence ratios and asymmetric co-occurrence ratios among words to show the existence of word or phrase translations within sentence pairs. Two thresholds, one symmetric and

3A bi-text is a collection of parallel sentences in two languages. 25 one asymmetric, are set up through experiments to handle the ambiguous candidates and coincidental co-occurrences. The value of each cell in the table is from 0.0 to 1.0. Elements in the table with values greater than 0.0 are added to the new bilingual dictionary. The best dictionary they extracted used a fixed threshold of 1.0 and consisted of 14,446 entries

(covering 15% vocabularies in corpus) with the lowest error rate at 29%.

If a language pair does not have a parallel corpus, but there are some directly trans- lated texts from one language to the other or texts translated into both languages from an intermediate language, researchers may be able to construct a parallel corpus using the intermediate language as a pivot. Then, a new dictionary can be extracted from the generated parallel corpora. For example, Héja [38] collects texts translated to the source languages (Lithuanian and Slovanian) and the target language (Hungarian) from an intermediate language (English) to construct parallel corpora (Lithuanian-Hungarian and Slovanian-Hungarian). In the corpora he creates, sentences in one language might be combined or split to many sentences in another language because they are not perfect direct translations. Hence, translation units, instead of sentences, are used to measure the sizes of these corpora. The Lithuanian-Hungarian corpus contains 147,158 translation units, whereas the Slovanian-Hungarian corpus consists of 38,574 translation units. Then,

GIZA++ [86] is used to compute translation properties for every translation candidate and perform word alignment. Héja also calculates frequencies of words in the source and target languages. A translation candidate is added to the new bilingual dictionary if its translation probability, its frequency in the source language, and its frequency in the target language are higher than some thresholds. From experiments, he finds that a candidate with a low translation probability but high frequency is a good translation, whereas a candidate with a high translation probability but low frequency is usually a bad translation. Héja reports 26 that the number of entries in the newly created dictionaries strongly depends on the size of corpora. He derives approximately 5,000 and 4,000 translation candidates that satisfy all three threshold requirements from the Slovanian-Hungarian and Lithuanian-Hungarian corpora, respectively. 863 extracted translations are evaluated manually. The highest “use- ful” translation pairs in the new dictionaries are 97.2% with the probability of translations from 0.7 to 1.0.

If a language pair (A,C) has a very small size of bi-texts, but there exists a third language B such that B is related to and has a large parallel corpus or bi-texts with A or C, researchers might be able to construct bilingual lexicons for A and C from available resources based on transliterations and cognates. The CLDR project4 defines transliteration as “the general process of converting characters from one script to another, where the result is roughly phonetic for languages in the target script”. For example, “Niu Di-lân” is a transliteration of “New Zealand” in Vietnamese. According to Molina [76], “cognates are words descended from a common ancestor; that is, words having the same linguistic family or derivation”. Some examples of cognates in English and Spanish are “family” - “familia”,

“elephant” - “elefante”, and “gorilla” - “gorila”. Nakov and Ng [80] concatenate the two bi-texts, align words, then extract cognates. One of their main experiments is to extract translations from Spanish to English from the bi-texts of Portuguese-English and Spanish-

English, and they consider Portuguese as a language closely related to Spanish. They extract cognates based on the translation probabilities of words from Portuguese to Spanish using

English as a pivot, and orthographic similarities using the longest common subsequence ratio (LCSR) [71], calculated by dividing the length of the longest common subsequence by the length of the longer word. A threshold of LCSR is set to equal or greater than 0.58.

4http://cldr.unicode.org/index/cldr-spec/transliteration-guidelines 27

Then, they estimate the translations using the competitive linking algorithm [72]. Cognates are extracted from a training dataset, then used to train on the same training dataset to transform words in Portuguese to Spanish. The Bleu score of their translations is 3.37. In addition, they claim that their approach achieves better results than methods using parallel corpora and pivot languages.

Ljubeˇsi´cand Fiˇser[67] extract a Croatian-Slovene dictionary from a comparable cor- pus of news articles. Initially, a seed dictionary, with 33,495 entries, is created by detecting words that are identically spelled in both languages and also have the same POS in both languages. The similarity between the two languages is high since the average cosine dis- tance between corresponding 3-grams picked from corpus is 74%.5 The average precision of their seed dictionary is 72% as computed by manual evaluation. The first dictionary is created by expanding the seed dictionary with cognates found by using a modified LCSR algorithm, named BI-SIM [55]. The second dictionary is generated by adding to the seed dictionary the first set of translation candidates, with a frequency of at least 200. They evaluate the dictionaries they create by comparing against a hand-created gold standard with 500 entries. Their first dictionary consists of 34,823 entries with a precision of 68.5% whereas the second dictionary has 34,817 entries with a precision of 71.4%. According to

Ljubeˇsi´cand Fiˇser,simply considering the first translation candidates as correct translations is very effective.

Given a Chinese-English dictionary Dict(cht,eng), Shao and Ng [106] extract new translations from a Chinese-English comparable corpus using both context and translit- eration information. The existing Chinese-English dictionary they use has about 10,000 entries. The size of the English corpus is 730M bytes, and the size of the Chinese corpus is

5http://borel.slu.edu/crubadan/table.html 28

120M bytes. They divide the corpus into time periods, perform segmentation, and deter-

mine unknown Chinese and English words appearing in each period. Next, they estimate

the translation probability for each translation candidate based on the context:

Y q(tc) P (C(c)|C(e)) = P (tc|Tc(C(e))) , (2.7)

tc∈C(c) where q(tc) is the number of occurrences of a Chinese word tc in the context C(c), e is

English words in the context C(e), and Tc(C(e)) is a bag of Chinese words created by translating the English words in C(e) using a bilingual dictionary. Then, a probability of translation for each candidate based on transliteration is obtained as follows:

X Y a P (e|c) = P (e|pinyin) = P (li |pi), (2.8) a i

th where pi is the i syllable of Pinyin (the official romanization used in China) created by

a converting each character in a Chinese word c, li is the English letter sequence that the ith Pinyin syllable maps to a particular alignment a. Finally, they rank candidates based on the probabilities of translation. The number of new Chinese source words and English translations found are 4,499 and 192,521, respectively. The precision of newly found correct translations is 78.2% as evaluated by humans.

Researchers have derived bilingual lexicons even for language pairs that have neither a parallel corpus nor a comparable corpus. To test their idea, they work with an English corpus and a German corpus that are different in time period and orientation. The goal of Koehn and Knight [53] is to derive one-to-one bilingual noun translations from German to English using these disparate corpora. They find translation candidates based on (i) identical words adopted from other languages (e.g., “email” and “internet”), (ii) words with

similar due to cognate origin (e.g., “website” in English and “webseite” in German),

(iii) words in similar contexts, (iv) similarity scores between all word pairs in the same 29

language (e.g., the word “dog” is similar to the word “cat”) and (v) frequencies of words.

They extract 1,339 bilingual noun translations, which can be considered to constitute a

seed lexicon, with accuracy of 89% starting with just the identical words. According to

Koehn and Knight, finding identical words, words with similar spelling, and words in similar

context help find significantly more new bilingual translations. The authors report that

the translations they extract cover 39% of the translations extracted at word-level from a

German-English parallel corpus.

2.4.4 Generating dictionaries from multiple linguistic resources

To improve the quality and the quantity of entries in the newly created dictionaries,

researchers extract translation candidates from available bilingual dictionaries like we have

discussed in prior sections, but extend by using resources such as thesauri, corpora, and

WordNets to identify senses of words, and to remove irrelevant candidates. Sanfilippo and

Steinberger [102] enrich a bilingual dictionary Dict(A,B) by linking its senses to senses in a

thesaurus of A. The enriched dictionary can be used to distinguish translation candidates

of a word in a given context. In the thesaurus, each word ai has one or many senses and corresponding synonyms for each sense. Each sense has an identical number sensei.

ai: sensei1: ai11, ai12, ai13,... sensei2: ai21, ai22, ...... senseij: ....

Given a word ai in language A in the dictionary, they obtain all words belonging to each sense of this word from the thesaurus, translate them to the target language B, rank translation

candidates based on their occurrence counts. Finally, they match the translation candidates

with translations in the available dictionary. The newly discovered translations may be kept 30 or be discarded. As a result, translations b of the source word a are grouped based on senses of a. ai: sensei1: bi11, bi12,... sensei2: bi21, ...... senseij: .... The precision and recall of linking senses are 86% and 97%, respectively, whereas those of ranking translations are 87% and 92%, respectively. The approach of Sanfilippo and Steinberger [102] can be used to create a new dictionary Dict(B,C) from the given dictionaries Dict(A,B) and Dict(A,C), and a thesaurus in language A. They link senses in each dictionary to senses in the thesaurus, generate translations between B and C using A as a pivot, and align translations using the unique sense numbers of the pivot word ai in A.

Goh et al. [31] construct a new Japanese-Chinese dictionary from Japanese-English and Chinese-English dictionaries using the pivot-based method through English and rely on the one-time inverse consultation method. Samples of 200 randomly selected words of each category (nouns, verbal nouns, and verbs) are evaluated manually using a 4-point scale

{correct, not-first, acceptable, wrong}. Their dictionary has 20,554 entries with an average accuracy of 77%. Because many Japanese words are combinations of Kanji characters, which are similar to Hanzi in Chinese, they find 7,941 new translations with accuracy of

97% for nouns and 97.5% for verbal nouns by converting Kanji to Hanzi.

Nerima and Wehrli [81] create a new bilingual dictionary Dict(A, C) from two input bilingual dictionaries Dict(A, B) and Dict(B,C) using the transitive method. The transla- tion candidates are validated by checking their appearance in an A-C parallel corpus. An example of their experiments is to construct an English-German dictionary from English-

French and German-French dictionaries consisting of 76,311 and 45,492 entries, respectively.

Their new English-German dictionary has 21,600 entries, of which 26% of entries are found in the English-German parallel corpus. The rest of the entries, which cannot be validated 31 using the corpus, are evaluated manually. The authors do not report a precision value for their dictionary, but they claim that the translations they create are very good.

A comparable corpus has also been used to validate translation candidates. Otero and Campos [90] create a new dictionary Dict(A, C) from Dict(A, B) and Dict(B,C) using transitivity; then, remove ambiguous entries in the dictionary created using an A-C compa- rable corpus. They split Dict(A, C) into two subsets Dict(A, C)amb containing ambiguous entries, and Dict(A, C)unamb consisting unambiguous entries. To remove ambiguous en- tries, they generate a temporary dictionary Dict(A, C)corpus from the comparable corpus such that every word in A is translated into the top-N best translations in C and every word in C is also translated into the top-N best translations in A. The final bilingual dictionary

Dict(A, C) is created using the following formula:

Dict(A, C) = Dict(A, C)amb ∩ Dict(A, C)corpus ∪ Dict(A, C)unamb. (2.9)

They create an English-Galician dictionary from the English-Spanish and Spanish-Galician dictionaries, and a comparable corpus of English and Galician. The dictionary created contains 12,064 entries and 22% of the entries are found in the comparable corpus. Similar to Nerima and Wehrli [81], Otero and Campos claim that there is no need to manually evaluate the entries they generated because their qualities are the same as those of entries created by lexicographers without discussing their comparison method against the resource created by a lexicographer.

In addition to parallel or comparable corpora, researchers have also used monolin- gual corpora to validate translation candidates. Kaji et al. [45] create a Japanese-Chinese dictionary from Japanese-English and Chinese-English dictionaries using the pivot-based method. A correlation matrix of associated words versus translations obtained from two monolingual corpora of Japanese and Chinese in the same domain is used to eliminate am- 32 biguous translation candidates. To construct a correlation matrix, they first extract word associations from the corpora, align the extracted Japanese word associations with the ex- tracted Chinese word associations using the dictionary created by the pivot-based method, and iteratively compute the correlations between associated words and translations. The correlation matrix is converted to a binary matrix such that the highest value in each row of the matrix is converted to 1.0 whereas the remaining values are converted to 0.0. Finally, the support for each translation is obtained by dividing the number of times 1.0 occurs in its column by the number of rows in the matrix. The translations with support values greater than a threshold are accepted as the correct translations. For evaluation, 384 Japanese entries of nouns and their translations are manually validated. Evaluation produced 64.9% of precision and 15.8%.

WordNets have been used to remove irrelevant translation candidates. Varga and

Yokohama [119,120] generate a Japanese-Hungarian dictionary from Japanese-English and

Hungarian-English dictionaries using the pivot-based method. A translation candidate is considered unambiguous if there exists only one translation from the the source language to the pivot language, which in turn has only one translation to the target language. To handle ambiguities, they compute scores using information obtained from a WordNet of the pivot language, English WordNet, as below:

0 0 |sns(wJ → i ) ∩ sns(wH → i )| scoreB(wJ , wH ) = max 0 0 , (2.10) |sns(wJ → i ) ∪ sns(wH → i )|

|ext(wJ → wE) ∩ ext(wH → wE)| scoreC,D,E(wJ , wH ) = , (2.11) |ext(wJ → wE) ∪ ext(wH → wE)| Y scoreF (wJ , wH ) = ((c1 +max(scorerel(wJ , wH ))).(c2 +c3.mfactorrel(wJ , wH ))), (2.12) rel

0 where i ∈ (wJ → wE) ∩ (wH → wE); sns(w) is the set of senses of word w; ext(w) is the extension set with synonyms, antonym, and semantic categories of word w; mfactorrel 33

is between 0 and 1; and c1, c2, and c3 are constants. If a candidate is selected using the bi-directional method, the mfactorrel is 1; otherwise, mfactorrel is 0. Unambiguous candidates and candidates that remain after computing scoreB and scoreF are the best and are added to the new dictionary. A sample of entries in the new dictionary is evaluated by human evaluators using a 3-point scale {correct, undecided, wrong}. The average scores of

“correct”, “undecided”, and “wrong” are 72.75%, 6.42% and 20.83%, respectively. They also compare dictionaries created using their approach against the dictionaries created using the

Sj¨obergh approach [110], and Tanaka and Umemura approach [115], both discussed earlier in Section 2.4.1. The “correct” translations in their dictionary is the best with 79.15%, whereas those of Sj¨obergh, and Tanaka and Umemura are 54.05% and 62.50%, respectively, for the 1-to-1 entry precision evaluation.

2.5 Generating translations for phrases

Sag et al. [100] define multiword expressions (MWEs) as “ idiosyncratic interpretations that cross word boundaries”. They also classify MWEs into several classes and subclasses.

The two top-level classes are lexicalized phrases and institutionalized phrases. According to them, the lexicalized phrases have “at least partially idiosyncratic syntax or , or contain words which do not occur in isolation”, whereas the institutionalized phrases are

“syntactically and semantically compositional, but occur with markedly high frequency” in texts. Lexicalized phrases are further classified into compound nominals or compound nouns such as “information technology”, “car park” and “part of speech”, as well as proper nouns or named entities such as “the United States of America”, “Johns Hopkins” and “Colorado

Springs”. They also use light verb constructs such as “make a mistake” and “give a demo”.

Different authors have made various claims regarding the level of processing complexity 34 required to translate MWEs. Some methods do shallow processing, i.e., simply translate the word forms in the source language to word forms in the target language, whereas others use deep processing, i.e., use semantics of the MWEs. In general, the existing approaches study translations of compound words from different resources such as bilingual dictionaries, parallel corpora and comparable corpora.

Abiola et al. [1] study a model for noun phrase translation from English to Yoruba.

They manually generate 29 grammar rules to transfer noun phrases from English to Yoruba.

Given an English noun phrase, they segment the given English noun phrase, then perform translation from English to Yoruba at the word-to-word level using a bilingual dictionary.

Finally, they restructure the translations in the target language using the generated rules.

The accuracy of their translations for 400 randomly input noun phrase is 91%.

Hai et al. [35] introduced a phrase transfer model for Vietnamese-English machine translation focusing on one-to-zero mapping, which means that a word in Vietnamese may not have appropriate single-word translation(s) and may need to be translated into a phrase in English. For example, the Vietnamese word “đi xe” will be translated into a phrase “go by vehicle” in English. They translate Vietnamese words to English using a bilingual dictionary, then use conversion rules to modify the structures of the English translation candidates.

The modifying process builds phrases level-by-level from simple to complex, restructures phrases using a syntactic parser and additional rules, and applies measures to solve phrase conflict.

Cao and Li [20] translate base noun phrase, defined as “a simple and non-recursive noun phrase”, from English to Chinese. They extract the base noun phrases from an English corpus. Given a base noun phrase in English, they generate translation candidates in

Chinese by using an English-Chinese dictionary for translation at the word-to-word level. 35

Then, they obtain the document frequency for each candidate from the Web. Finally,

the correct translations are selected using two methods: Na¨ıve Bayesian Classification and

computing similarities between context vectors (TF-IDF vectors). Both algorithms are

based on the Expectation and Maximization (EM) algorithm [27]. Human evaluation is used

to evaluate translations for 1,000 English base noun phrases created using their approaches.

The accuracy of their translations for the top-3 highest ranks using the first and the second

approaches are 80.3% and 80.8%, respectively.

Ohmori and Higashida [87] extract compound words in Japanese and their transla-

tions in English from non-parallel corpora. They first extract collocations in English and

Japaneses from corpora using n-gram statistics and the entropy values of words before and

after the main words which are calculated using the following equation:

X Hb(w) = − Pb(w|wb) ∗ logPb(w|wb) (2.13) b where Pb(w|wb) is the probability of the word w followed by the the word wb. Only words having entropy values greater than a threshold will survive. Then, they compute a correla- tion R(e, j) for each pair of translation candidate

|C(eu) T C(ju)| R(eu, ju) = , (2.14) |C(eu)| + |C(ju)| − |C(eu) T C(ju)|

l X C(eu) = {(ewi, fi)|i = 1, 2...l}, |C(eu)| = fi (2.15) i=1 m X C(ju) = {(jwk, fk)|k = 1, 2...m}, |C(ju)| = , fk (2.16) k=1 showing that the English word ewi co-occurs with the English word eu fi times, the Japanese word jwk co-occurs with the Japanese word ju fk times. Finally, the pair of Japanese and

English words having the greatest value of correlation will be considered correct translation of each other. They are able to extract 1,282 compound words with an accuracy of 76.2% 36

for the top-3 translations from corpora in the economic domain, consisting of 8,090 English

sentences and 7,674 Japanese sentences.

Tanaka [116] extracts English translations for Japanese compound nouns from non-

parallel corpora. Given a noun compound in Japanese ts, they collect the translation candi-

dates in English tt from the corpus based on existing POS patterns. The candidates whose components are related to components of ts through bilingual dictionaries and thesauri sur-

vive. Then, they measure the similarity between ts and each tt based on the context in

the corpora, which means information about the co-occurrence of words in the compound

nouns in the same sentences. In particular, they look for words having dependent relations

using word sequence templates found by Kilgarriff and Tugwell [48]. They also compute

the semantic co-occurrence of words based on semantic attributes [42]. The similarity Sw

and the semantic attribute based similarity Sa between ts and tt are computed using the

following formulas:

Sw(ts, tt) = T cw(ts)cw(tt) (2.17)

Sa(ts, tt) = ca(ts)ca(tt) (2.18)

where cw and ca are a context word vector and a context attribute vector, respectively, and

T is the translation matrix constructed by [114]. They obtain translations for 400 Japanese

compound nouns and receive translations for 393 input compound nouns with an accuracy

of 70%.

Tanaka and Baldwin [117] study noun noun compound translations from Japanese

to English using a bilingual dictionary and a monolingual corpus in the target language.

They extract 2-gram units, which are noun noun compounds, from a Japanese corpus. A

translation candidate of a Japanese noun noun compound is generated by translating each

Japanese word in the noun noun compound to English using a Japanese-English dictionary, 37

and then restructuring using suitable templates. A list of templates is created by matching

between the Japanese noun noun compounds and their English translations found in a gold

standard. Finally, they calculate a score for each translation candidate using the following

formula.

score(w1, w2, t) = αp(w1, w2, t) + βp(w1, t)p(w2, t) + γp(w1)p(w2)p(t), (2.19)

where w1 and w2 are English translations of Japanese words in the noun noun compound,

and t is the translation template. Different values of α, β and γ will achieve different F-score

performance. The highest F-score, which is 0.49, is achieved when α = 1 and β = 1.

Bouamor et al. [15] extract MWEs and their translations from a French-English paral-

lel corpus. First, a list of syntactic patterns (e.g., adj-noun, noun-adj and noun-prep-noun)

is generated manually. The CEA LIST Multilingual analysis platform [8] is used to produce

a set of POS tagged normalized lemmas from the corpus. Only 2,3 and 4-gram units, which

match the existing patterns, are kept. Then, they assign a vector of size N, which is the

number of sentences in the corpus, to every MWE in order to indicate whether or not this

MWE occurs in a sentence. The MWE pair having the greatest confidence value, computed

using the Jaccard Index presented in the following equation, is considered correct.

Ist confident(MWEs,MWEt) = , (2.20) VMWEs + VMWEt − Ist

where Ist is the number sentences shared by the MWEs and MWEt, VMWEs and VMWEt are the number of sentences independently containing MWEs and MWEt, respectively. Then, they integrate the extracted MWEs into a statistical machine translation system Moses [52].

Their training corpus contains 100,000 parallel French-English sentences whereas their test set has 1,000 parallel sentences. They received a Bleu score of 25.94. 38

2.6 Constructing WordNets

According to Vossen [123], there are two main methods to construct a WordNet.

The most common method is called the expand approach which translates the Princeton

WordNet “to another language and take over the structure”, whereas the second method

is called the merge approach which creates “an independent WordNet in another language and align the separate hierarchies by generating the appropriate translations”. In terms of popularity, the expand approach dominates the merge approach. WordNets generated using the merge approach have different structures from the Princeton WordNet; however, the complex agglutinative morphology, culture specific meanings and usages of words and phrases of target languages can be maintained. In contrast, WordNets created using the expand approach have the same structure as the Princeton WordNet. Bhattacharyya [9] suggests that to construct WordNets, we should first generate universal concepts, which are

common concepts across languages, then apply the expand method to construct WordNets.

In addition, he also contends that if the source language, which has a WordNet, and the

target languages are related, the expand approach to construct a WordNet for the target

language is “all the more attractive”.

2.6.1 Constructing WordNets using the merge approach

A French WordNet was generated from multilingual resources by Sagot and Fiser

[101]. Using the merge approach, they perform word alignment and extract bilingual lex-

icons from a given multilingual corpus; then, every lexical entry is assigned a synset ID

obtained from the BalkaNet WordNet [112]. Using the translation approach, monosemous

literals from the English WordNet are translated into French using bilingual resources such

as dictionaries and multilingual . Finally, they merge synsets collected from 39

two approaches to get the final French WordNet. Their French WordNet contains 32,351

non-empty synsets, and its accuracy based on manual evaluation is 80%.

Gunawan and Saputra [33] introduce a method to generate a prototype version of

synsets for an Indonesian WordNet using a monolingual dictionary of Bahasa Indonesia and

an Indonesian thesaurus. In a WordNet, synonym concepts should be bi-directional relations

such that if the synonym of the word wi is wj, a synonym of the word wj is wi. However, the

thesaurus does not contain bi-directional relations. They first extract synonym concepts

from the thesaurus. Let the synonyms of words w1, w2, w3 be {w2, w4, w5}, {w1, w6, w7}

and {w8, w9}, respectively. Then, they conclude that the bi-directional synonym of w1 is w2, and the reverse, because only these two words have bi-directional relations with each other. To reduce the number of redundant entries between the monolingual dictionary and the thesaurus, they do not process again entries which exist in both the dictionary and the thesaurus. Then, they remove duplicate extracted entries. Finally, a hierarchical clustering technique is applied to merge synsets. Their Bahasa WordNet consists of 60,673 synsets.

No evaluation was performed.

The Hindi WordNet6 is constructed manually by “looking up the various list meanings of words in different dictionaries” [21]. The current version 1.4 has 63,800 unique words and

28,687 synsets, and has not been linked to the Princeton WordNet. The Hindi WordNet is the first WordNet for Indian languages and is used to construct WordNets for other Indian languages (e.g., Marathi, Sanskirt and Gujarati) in the project of IndoWordNet7 by using

the expand approach.

6http://www.cfilt.iitb.ac.in/wordnet/webhwn/index.php 7http://www.cfilt.iitb.ac.in/indowordnet/index.jsp 40

2.6.2 Constructing WordNets using the expand approach

Chang et al. [22] introduce the Class-Based Translation Model to construct a proto- type Chinese WordNet for nouns. The model contains a translation table including English words wE and their translation candidates obtained from a dictionary; a semantic class table consisting of English words wE, WordNet sense numbers, POSes and class codes which are the “path designator” of hypernyms of the wE in WordNet; a class translation table having class codes, potential translation characters which are broken from candidate translations, and their translation probabilities. The translation candidates with the highest translation probabilities will be chosen and assigned to the correspondence synsets. The translation probability of a translation pair is estimated using the Expectation Maximization algo- rithm. Unigrams and bigrams in the target language are used to solve the problem of data sparseness. Their approach builds a Chinese WordNet covering 10,314 word senses, about

76.43% of senses in SEMCOR8. They also manually evaluate 500 cases. The recall rates of

Top 1, Top 2 and Top 3 translations are 70%, 80% and 90%, respectively.

Bilgin et al. [10] build a Turkish WordNet by translating extracted sets from the existing WordNets into Turkish. The first set containing 1,310 base concepts of the Eu- roWordNet [122] is extracted and translated into Turkish using bilingual dictionaries. Then, synonyms, antonyms and hyponyms of Turkish concepts are extracted from a monolingual

Turkish dictionary. Next, the Turkish WordNet is expanded by using additional information

(e.g., corpus frequencies, defining vocabularies, dictionaries and the Princeton WordNet) to determine new sets; then those new sets are translated into Turkish. The final extracted sets, which are sets existing in at least five WordNets in the EuroWordNet, are translated

8http://www.gabormelli.com/RKB/SemCor_Corpus 41 into Turkish. Their Turkish WordNet contains 11,628 synsets, 16,095 synset members and

17,550 semantics relations.

Kaji and Watanable [46] construct a Japanese WordNet by translating the Prince- ton WordNet synsets into Japanese, and use a correlation matrix to deal with translation ambiguity. They extract word associations from comparable corpora, align words using a bilingual dictionary, iteratively compute the correlations between translations and associ- ated words, and generate a correlation matrix of translations for each word. Next, they perform synset translations of words in synsets by calculating the scores of translation can- didates according to the associated words appearing in the gloss appended to each synset, or the texts retrieved by using glosses as query. They claim that their approach is promising for constructing a Japanese WordNet. Later, Bond et al. [12] and Isahara et al. [43] construct a Japanese WordNet by extracting synsets from the Princeton WordNet and translating them into Japanese using bilingual dictionaries. Then, they enrich the Japanese WordNet using the most common words obtained from different resources. Currently, the Japanese

WordNet contains 57,238 synsets with 93,834 words.

Barbu and Mititelu [7] propose and compare several heuristic rules to generate Word-

Nets for a target language. The first and the second WordNets are generated by translating, respectively, synsets, and hypernyms and hyponyms of words obtained from the English

WordNet. For the third WordNet, they translate English synsets into the target language.

Then, Barbu and Mititelu label domains for each word in a bilingual dictionary by using

WordNet domains. Next, each translation synset whose domain matches an English synset is considered as the correct synset and is added to the target WordNet. For the fourth

WordNet, they translate definitions of words obtained from an in the target language into English by using a bilingual dictionary, create vectors for each 42 translated definition of a word, create vectors for synset glosses and compare vectors. In their experiments, they construct a Romanian WordNet and use the existing Romanian

WordNet of the Balkanet project9 as a gold standard for evaluating their work. Their best

WordNet consisting of 9,610 synsets and 11,969 relations with accuracy of 91% is created by combining all methods.

Sathapornrungkij and Pluempitiwiriyawej [103] propose a semi-automatic method to construct a Thai WordNet from machine readable dictionaries. They design a WordNet

Builder system which extracts lexical, semantic, and translation relations from the English

WordNet and a machine readable dictionary. The extracted data is then evaluated accord- ing to thirteen criteria (e.g., monosemic one-to-one, polysemic one-to-one and polysemic many-to-one). The created Thai WordNet contains 19,582 synsets with a coverage of 80% at 76% accuracy. Later on, Akaraputthiporn et al. [3], Leenoi et al. [60] and Leenoi et al. [61] construct Thai WordNets from several English-Thai and Thai-English bilingual dic- tionaries using a bi-directional translation method. In their translation approach, a word bj is kept if the word bj is translated from the word ai in the A-B dictionary, and then that word bj is translated into the word ai in the B-A dictionary. Akaraputthiporn et al. [3] and Leenoi et al. [61] translate the English words in the 2ndOrderEntities and

1stOrderEntities, respectively, of Common Base Concepts into Thai; whereas Leenoi et al. [60] translate English synsets in the Princeton WordNet to Thai. Their results evalu- ated with a gold standard test set are good for precision (78.82%) but not good for recall

(54.58%) and F-measure (64.50%). They conclude that using different input dictionaries created by different methods such as corpora-based method or author’s expertise will pro- duce WordNets with different accuracies. In addition, cultural issues such as categorization,

9http://www.dblab.upatras.gr/balkanet/index.htm 43

gender, and collective perception need to be taken into account to maintain the structure

of Thai data.

Montazery and Faili [77] build a Persian WordNet by mapping the synsets in PWN to

Persian words using a Persian-English bilingual dictionary. They introduce three scenarios

to assign a Persian word to a particular synset. If the English translation of a given Persian

word has only one synset, this Persian word is assigned to that synset. If a Persian word

has more than one English translation and at least two of these translations belong to the

same synset, that Persian word is also assigned to this synset. Finally, if the Persian word

has many synset candidates, it will be assigned to the synset candidate with the greatest

score. The score of each synset candidate syn of a given Persian word w is computed using

the following equation:

X X score(syn) = Sim(wi, syn) ∗ MI(wi, ei), (2.21)

wi∈RTS ei∈GW where RTS is the set of related translations of word w, GW is set of words in the gloss of the

synset and the hypernym synset, and sim(wi, syn) is the between words

in RTS and the synset syn, and the MI(wi, ei) is the mutual information between words in

RTS and words in GW, computed based on the co-occurrence of words in the corpus. Their new Persian WordNet consists of 29,716 synsets linked to the PWN. For evaluation, they randomly pick 500 Persian words and their corresponding synsets, then manually evaluate them. The accuracy of the synsets they create is 82.6%.

Saveski and Trajkovski [104] construct a WordNet for Macedonian using the expand approach. They translate the synsets in the Princeton WordNet to Macedonian using a bilingual dictionary. This produces the so-called candidate words. To remove irrelevant translations, the English synset gloss is translated into Macedonian, and then the Google similarity distance [25] is applied to compute the similarity score, between 0 and 1, showing 44 the semantic relatedness between the translated gloss and the candidate words. The Google similarity distance between words/phrases is based on the correlation between them on the

Google result counts returned when using words as a query. The selected words are words with Google similarity distance with the translated gloss greater than a threshold. The

Macedonian WordNet they create has 33,276 synsets.

Oliver and Climent [88] introduce and compare the accuracies of WordNets created by different methods. The first WordNet is created using the Google translation machine to translate the sense tagged corpus in English into the target language, Spanish. The generated WordNet has about 8,000 synsets with accuracy of 80%. In the second method, given a parallel corpus of English and a target language, they use a linguistic analyzer to tag senses of words with the English WordNet. Then, constructing a WordNet for the target language simply becomes a word alignment problem. The accuracy of the second approach is lower than that of the first approach, and depends on the size of corpus. A bigger corpus will increase the accuracy of the created WordNet. They also conclude that sense tagging will introduce more errors than statistical machine translation.

The Asian WordNet Project (AWN) provides a platform for building and sharing

WordNets of Asian languages based on the Princeton WordNet. Several Asian WordNets have been contributed, including WordNets in Hindi, Indonesian, Japanese, Korean, Thai, and Vietnamese. However, all of these WordNets are in early stages and are far from finished. A distributed management system allowing a cross language WordNet interface for the AWN project was generated by Robkop et al. [95]. Some other researchers are also trying to create WordNets for additional Asian languages. For example, Duc and Thao [28] attempt to create Vietnamese WordNet focusing on noun synsets, Le at al. [59] attempt to 45 develop consensus for collaborative ontology-based Vietnamese WordNet construction, and

Hussain et al. [40] are developing an Assamese WordNet.

Virach et al. [121] introduce a novel method to assign synsets for words in bilingual dictionaries of English and a language with limited resources based on the Princeton Word-

Net. If a word w in a target language has more than one English equivalent such as eng1 and eng2, and both English equivalents are in the same synset S0, the synset S0 is assigned to the word w with a confidence score CS of 4. If a word w1 has only one English equivalent eng0 belonging to synsets S0 and S1, and a word w2 which is a synonym of w1 also has one equivalent eng2 belonging to synset S1, the synset S1 will be assigned to both w1 and w2 with CS of 3. In case the word w has no synonym, and only one equivalent belonging to a synset, this synset will be assigned to the word w with CS of 2. If the word w has equivalents, each of which belongs to different synsets, all these synsets are assigned to the word w with CS of 1. Words assigned synsets with higher values of CS will have higher accuracies. The accuracy of their assignments is 80%.

2.7 Chapter summary

We have scoured the literature and covered many methods for constructing bilingual dictionaries, phrase translation and WordNets. The primary similarity among almost all methods is that existing approaches work with languages that have several available lexical resources, or use related languages with some lexical resources. The number of entries in the new lexical resources created from existing resources are significantly lower than those in the source resources. The more similar two languages are, the more the number of entries in the new resource is created. External resources such as monolingual dictionaries, corpora, thesauri and WordNets are also used to remove irrelevant candidates. In the next 46 chapters, we present our approaches to construct lexical resources such as reverse bilingual dictionaries, new bilingual dictionaries, WordNets and translations for phrases using just a few existing resources.

Acknowledgment

This chapter is based on the paper “Automatically Creating Bilingual Dictionaries:

A Survey”, written in collaboration with Jugal Kalita, that is currently under revision for

Natural Language Engineering published by Cambridge University Press. 47

CHAPTER 3

INPUT RESOURCES AND EVALUATION METHODS

3.1 Introduction

This chapter introduces input resources we use and the methods to evaluate resources we create. Our proposed approaches are general, but to demonstrate the effectiveness and usefulness of our algorithms, we have carefully selected a few languages to experiment with.

As mentioned earlier, we create lexical resources for Arabic, Assamese, Cherokee, Cheyenne,

Dimasa, Karbi and Vietnamese. In addition, we also work with some languages supported by an MT, the Microsoft Translator. Finally, we describe a method to evaluate the lexical resources we create.

3.2 Input bilingual dictionaries

We work with seven existing bilingual dictionaries that translate between one lan- guage of choice and a resource-rich language, which happens to be always English in our experiments:

- Arabic-English and Vietnamese-English dictionaries, Dict(arb,eng) and Dict(vie,eng),

supported by PanLex1,

- Karbi-English and Dimasa-English dictionaries, Dict(ajz,eng) and Dict(dis,eng), sup-

ported by Xobdo2,

1http://panlex.org/ 2http://xobdo.org/ 48

- One Assamese-English dictionary, Dict(asm,eng), created by integrating Assamese to

English dictionaries provided by PanLex and Xobdo,

- Cherokee-English3 and Cheyenne-English4 dictionaries, Dict(chr,eng) and Dict(chy,eng),

obtained from Web pages.

The dictionaries vary in sizes and qualities. The number of entries in these input dictionaries

are presented in Table 3.1.

Table 3.1: The number of entries in the input dictionaries.

Dictionary Entries Dictionaries Entries

Arabic-English 53,194 Assamese-English 76,634

Cherokee-English 3,199 Cheyenne-English 28,097

Dimasa-English 6,628 Karbi-English 4,682

Vietnamese-English 231,665

The Microsoft Translator Java API5 is used as another main resource for translations.

The Microsoft Translator supports translations for 50 languages. We were given free access to the Microsoft Translator for research purposes.

3.3 Input WordNets

In our experiments, we use the Princeton WordNet (PWN) [29] and other WordNets

linked to the PWN 3.0 provided by the Open Multilingual WordNet6 project [11]: WOLF

WordNet (WWN) [101], FinnWordNet (FWN) [66] and JapaneseWordNet (JWN) [43]. Ta-

3http://www.manataka.org/page122.html 4http://www.cdkc.edu/cheyennedictionary/index-english/index.htm 5https://datamarket.azure.com/dataset/bing/microsofttranslator 6http://compling.hss.ntu.edu.sg/omw/ 49

ble 3.2 provides some details of the WordNets used. In the table, the number of synsets in

the WordNets linked to the PWN 3.0 are obtained from the Open Multilingual WordNet,

along with the percentage of synsets covered from the list of 5,000 the most frequenly used

word senses in PWN. Synsets which are not linked to the PWN are not taken into account.

Table 3.2: The number of synsets in WordNets

WordNet Synsets Core WordNet Synsets Core

JWN 57,179 95% FWN 116,763 100%

PWN 117,659 100% WWN 59,091 92%

3.4 Evaluation method

A standard method to evaluate translations in a new dictionary, that is machine gen- erated, is human evaluation. A number of entries in the new dictionary are randomly picked and evaluated by volunteers using different scales such as a two-point scale {good, bad}, a three-point scale {correct, acceptable, incorrect} or a five-point scale {excellent, good, average, fair, bad}. Evaluators should be fluent in both the source and target languages in the dictionary. Human evaluation is ideal to get an accurate estimate of the quality of trans- lations. If evaluators can effectively evaluate and correct all entries in the new dictionary, such a dictionary is more reliable than others which are evaluated by methods that dispense with human evaluation. An issue with human evaluation is that it takes too much time and effort to evaluate an entire dictionary. In addition, not every fluent or native speakers’ vo- cabulary size is usually large enough to judge all entries in a dictionary. This issue becomes more acute when we need a single human evaluator to have high (near-native) fluency in two or more languages. As a result, we often have to rely on the concept of “simple random 50 method” discussed by Hays and Winkler [37] to select entries for evaluation. We are also forced to used general rules of thumb such as the one by Ross [99, page 30] that says we can be “confident of the normal approximation whenever the sample size is at least 30”.

This means evaluating at least 30 randomly selected entries in the dictionary is acceptable, when exhaustive human evaluation is too onerous or impossible. Similar to evaluating a new dictionary, human evaluation also works well for evaluating a new WordNet.

The main goals of our study are to create high-precision dictionaries and WordNets, and to increase the numbers of lexical entries in the created dictionaries and WordNets.

Evaluations are performed by volunteers who are fluent in both source and destination languages. To achieve reliable judgment, we use the same set of 100 entries, randomly chosen from each dictionary we create or 500 entries from each WordNet for evaluation.

We note that, each entry in the dictionary or WordNet has the same probability of being chosen. Due to the small number of entries in the dictionaries of Karbi and Dimasa, we evaluate the same set of 50 entries picked randomly once. Each volunteer was requested to evaluate using a five-point scale: 5: excellent, 4: good, 3: average, 2: fair and 1: bad.

To study the effect of the existing resources used to create new dictionaries, WordNets and translations of phrases, we also evaluate the input bilingual dictionaries. The average scores of the existing bilingual dictionaries used are presented in Table 3.3. The percentage of agreements between raters is in all cases is around 70%.

3.5 Chapter summary

The next four chapters will present approaches to construct new lexical resources using the existing resources discussed in this chapter. We will also analyze the effect of the input resources on the new resources we create and suggest methods to overcome the issues. 51

Table 3.3: The average scores of entries in the input dictionaries.

Dictionary Score Dictionaries Score

Arabic-English 3.58 Assamese-English 4.65

Cherokee-English N/A Cheyenne-English N/A

Dimasa-English 3.60 Karbi-English 4.30

Vietnamese-English 3.77 52

CHAPTER 4

CREATING REVERSE BILINGUAL DICTIONARIES

4.1 Introduction

The chapter focuses on developing new techniques that leverage existing resources for resource-rich languages to automatically generate reverse bilingual dictionaries from ex- isting bilingual dictionaries. Published algorithms for creating new bilingual dictionaries usually use intermediate languages or intermediate dictionaries to find chains of words with the same meaning. Given a Vietnamese-English dictionary and an English-Assamese dic- tionary, the existing algorithms can create a new Vietnamese-Assamese dictionary or a new

Assamese-Vietnamese dictionary using English as an intermediate language. To illustrate the problem we want to solve, let us make an assumption that we have a Vietnamese-English dictionary. Now, we want to ask questions such that is it possible to construct an English-

Vietnamese dictionary based on this input resource alone and if so how many additional dictionaries or languages do we need for creating a good quality English-Vietnamese dic- tionary. For creating an English-Vietnamese dictionary using published approaches, we do need at least two dictionaries: English-Intermediate_Language dictionary and an Interme- diate_Language-Vietnamese dictionary, where Intermediate_Language is an intermediate language or a pivot language used to decrease the ambiguity in the senses of words. Clearly, using such approaches, the existing Vietnamese-English dictionary is useless in creating an

English-Vietnamese dictionary.

We study approaches to create reverse bilingual dictionaries. Given an L1–L2 dictio- nary, where L1 and L2 are two languages, our objective is to automatically obtain a good 53

L2–L1 dictionary. A dictionary consists of a list of entries, each of which contains at least a , part-of-speech and definitions of the headword. Every headword serves as a keyword for all the information given in the entry [5]. At first inspection, creating a reverse bilingual dictionary from an existing bilingual dictionary is simple and straightforward be- cause if a word a in L1 is translated to a word b in L2 in the L1–L2 dictionary, then the new

L2–L1 dictionary will contain a translation from the word b to the word a. However, not all entries in the input dictionary can be reversed to create new translations in the reverse dictionary. For example, given the first entry (“an dưỡng”, “to be on convalescent leave”) in the Vietnamese-English dictionary from Figure 1.1, we cannot reverse the direction of this entry to have a new entry (“to be on convalescent leave”, “an dưỡng”) and add it to the reverse dictionary. The phrase “to be on convalescent leave” cannot be considered as a key- word. As a result, the number of entries in the reverse dictionary is likely to be significantly smaller than that of the input dictionary used.

The remainder of this chapter is organized as follows. Related work is presented in Section 4.2. Section 4.3 describes the algorithms we propose to create new bilingual dictionaries from existing dictionaries. Experimental results are presented in Section 4.4.

Future work is discussed in Section 4.5. Section 4.6 summarizes the chapter.

4.2 Related work

We have not seen any attempts to create a reverse bilingual dictionary directly from an existing one. The only study found in [107] proposes a set of algorithms to create a reverse dictionary in the context of a single language using converse mapping. In particular, given an English-English dictionary, Shaw et al. [107] attempt to find the original words or terms given a synonymous word or phrase describing the meaning of a word. Our goal is to study 54 the feasibility of creating a reverse dictionary by using only one existing dictionary (in which

English is one of the two languages) and a WordNet lexical ontology. For example, given a

Karbi-English dictionary, we will construct an English-Karbi dictionary.

4.3 Proposed approaches

We introduce a series of algorithms, each of which automatically creates a reverse dictionary, ReverseDictionary, from a dictionary that translates a word in language L1 to a word or phrase in language L2. We start with a simple algorithm, and slowly introduce sophistication to our algorithms. We require that at least one of two these languages has a WordNet type lexical ontology [75]. In developing the algorithms, we take advantage of the fact that English has a WordNet, the Princeton WordNet.

4.3.1 Direct reversal (DR)

The existing dictionary has alphabetically sorted LexicalUnits in L1 and each of them has one or more Senses in L2. The idea and the algorithm of the DR algorithm are presented in Figure 4.1 and Algorithm 1, respectively.

Figure 4.1: The idea behind the DR algorithm

To create ReverseDictionary, we simply take every pair in

SourceDictionary and swap the positions of the two (Algorithm 1, line 4). This is a baseline algorithm so that we can compare improvements as we create new algorithms. If in our 55

Algorithm 1 DR Algorithm Input: SourceDictionary, Dict(L1,L2).

Output: ReverseDictionary, Dict(L2,L1).

1: ReverseDictionary := φ

2: for all LexicalEntryi ∈ SourceDictionary do

3: for all Sensej ∈ LexicalEntryi do

4: Add tuple to ReverseDictionary

5: end for

6: end for input dictionary, the sense definitions are mostly single words, and occasionally a simple phrase, even such a simple algorithm gives fairly good results. During the reversal, we skip senses containing more than 30 characters in order to maintain the significance of , which are keywords, in the new dictionary we create.

The DR approach is easy to implement. The accuracy of each ReverseDictionary is usually equal to that of the SourceDictionary used. However, the numbers of entries

in the created dictionaries are limited because this algorithm just swaps the positions of

LexicalUnit and Sense of each entry in the SourceDictionary and does not have any method

to find additional words, which have the same meanings. Entries whose senses are multiword

expressions and are not reversed also decrease the number of entries. An example of the

drawback of the DR approach is shown in Figure 4.2. The word “hostolipi” in Assamese

means “handwriting” in English and the word “lipi” means “script”. From the Oxford English

dictionary1, “handwriting” means “a particular form, style, or method of writing by hand;

the form or style of writing used by a particular person” and “script” means “handwriting, the

1http://www.oed.com/ 56

characters used in hand-writing (as distinguished from print)”. Hence, “hostolipi” and “lipi”

have nearly the same meaning. This baseline algorithm did not conclude that “hostolipi”

and “lipi” have both senses “handwriting” and “script”.

Figure 4.2: The drawback of the DR algorithm.

4.3.2 Direct reversal with distance (DRwD)

To increase the number of entries in the output dictionary, we compute the distance between words in the WordNet hierarchy. The words “hostolipi” and “lipi” in Assamese have the meanings “handwriting” and “script”, respectively. The distance between “handwriting” and “script” in the WordNet hierarchy is 0.0, so that “handwriting” and “script” are likely to have the same meaning. Thus, each of “hostolipi” and “lipi” should have both meanings

“handwriting” and “script”. This approach helps find additional words having the same meanings and possibly increase the number of lexical entries in the reverse dictionaries.

Figure 4.3 shows the idea behind the DRwD approach and Algorithm 2 presents the DRwD algorithm.

To create a ReverseDictionary, for every LexicalEntryi in the existing dictionary, we find all LexicalEntryj, i 6= j with distance to LexicalEntryi equal to or smaller than a threshold α (Algorithm 2, line 6). Note that, LexicalEntryi and LexicalEntryj need to have the same POS (Algorithm 2, line 4). As results, we have new pairs of entries 57

Figure 4.3: The idea behind the DRwD algorithm

; then we swap positions in the two- tuples, and add them into the ReverseDictionary (Algorithm 2, line 7).

Algorithm 2 DRwD Algorithm Input: SourceDictionary, Dict(L1,L2).

Output: ReverseDictionary, Dict(L2,L1).

1: ReverseDictionary := φ

2: for all LexicalEntryi ∈ SourceDictionary do

3: for all Senseu ∈ LexicalEntryi do

4: for all LexicalEntryj ∈ SourceDictionary having the same POS with

LexicalEntryi do

5: for all Sensev ∈ LexicalEntryj do

6: if distance(LexicalEntryi,LexicalEntryj) 6 α then

7: Add tuple to ReverseDictionary

8: end if

9: end for

10: end for

11: end for

12: end for 58

The distance between the two LexicalEntrys is the distance between the two LexicalUnits

if the LexicalUnits occur in WordNet; otherwise, it is the distance between the two Senses.

The distance between each phrase pair is the average of the total distances between every

word pair in the phrases provided by RiTa.WordNet, which is currently a part of Rita core2.

If the distance between two words or phrases is 1.00, there is no similarity between these words or phrases, but if they have the exact same meaning, the distance is 0.00. Changing the value of parameter α in the DRwD algorithm will lead to change in the number of words or phrases in the target language in the new reverse dictionaries. If the value of

α is larger, the number of pairs in the ReverseDictionary is larger, but the accuracy of the ReverseDictionary is smaller. From experiments, we find that a ReverseDictionary created using the value 0.0 for the threshold α has the highest average score. This approach significantly increases the number of entries in the Reverse-

Dictionary. However, there is an issue with this approach. For instance, the word “tuhbi” in Dimasa means “crowded”, “compact”, “dense”, or “packed”. Because the distance between the English words “slow” and “dense” in the WordNet is 0.0, this algorithm concludes that

“slow” has the meaning “tuhbi” also, which are wrong. Figure 4.4 shows an example of the drawback of DRwD algorithm.

4.3.3 Direct reversal with similarity (DRwS)

The DRwD approach computes simply the distance between two senses, but does not look at the meanings of the senses in any depth. The DRwS approach represents a concept in terms of its WordNet synsets, synonyms, hyponyms and hypernyms. In the Princeton

WordNet, synonyms are “words that denote the same concept and are interchangeable in

2http://www.rednoise.org/rita/index.html 59

Figure 4.4: The drawback of the DRwD algorithm

many contexts” and synsets are sets of cognitive synonyms. Hyponyms are words describing things more specifically, while hypernyms are words illustrating things more generally. Table

4.1 shows synonyms, synsets, hypernyms and hyponyms of the word “south”.

Table 4.1: Words related to the word “south”, obtained from the Princeton WordNet.

Relation Words

Synonyms west, line, windward, earth, southeast, orientation, point, base, frontage,

northward, northeast, sodom, somewhere, here, home, part, region, there,

space, westward, whereabouts, north, southwest, leeward, southward, jungle,

seat, east, notch, northwest, pass, bilocation, seaward, opposition, eastward

Synsets southward

Hypernyms location, direction

Hyponyms null

The DRwS approach is like the DRwD approach, but instead of computing the dis- tance between lexical entries in each pair, we calculate a similarity, called simValue. The idea behind the DRwS approach is shown in Figure 4.5. 60

Figure 4.5: The idea of the DRwS algorithm

The algorithm for computing the simValue of entries is shown in Algorithm 3. If the simValue of a , i 6= j pair is equal or larger than a threshold β, we conclude that the LexicalEntryi has the same meaning as LexicalEntryj.

Again, LexicalEntryi and LexicalEntryj need to have the same POS.

To calculate simValue between two phrases, we obtain the ExpansionSet for every word in each phrase from the WordNet database (Algorithm 3, lines 2-10). We define

ExpansionSet of a phrase as a union of synset, and/or synonym, and/or hyponym, and/or hypernym of every word in it. We compare the similarity between the ExpansionSets by counting the minimum similar words between the two ExpansionSets of LexicalEntryi and

LexicalEntryj (Algorithm 3, lines 11-14). The value of β and the kinds of ExpansionSets are changed to create different ReverseDictionarys. Based on experiments, we find that the best value of β is 1.0, and the best ExpansionSet is the union of synset, synonyms, hyponyms, and hypernyms.

4.3.4 Direct reversal with similarity and distance (DRwSD)

The DRwS algorithm maintains the quality of entries created fairly well by looking at the meanings of the senses, but, the DRwS algorithm still has a problem due to the WordNet database. For instance, the simValue of the words “mango” and “papaya” is 1.0 because the 61

Algorithm 3 simValue Input: A pair of LexicalEntryi , LexicalEntryj having the same POS.

Output: The similarity value simValue between LexicalEntryi and LexicalEntryj

1: simWords := φ

2: if LexicalEntryi.LexicalUnit & LexicalEntryj.LexicalUnit have a WordNet lexical

ontology then

3: for all (LexicalUnitu ∈ LexicalEntryi)&(LexicalUnitv ∈ LexicalEntryj) do

4: Find ExpansionSet of every LexicalEntry based on LexicalUnit

5: end for

6: else

7: for all (Senseu ∈ LexicalEntryi)&(Sensev ∈ LexicalEntryj) do

8: Find ExpansionSet of every LexicalEntry based on Sense

9: end for

10: end if

11: simWords ← ExpansionSet (LexicalEntryi) ∩ ExpansionSet(LexicalEntryj)

12: n ←ExpansionSet(LexicalEntryi).length

13: m ←ExpansionSet(LexicalEntryj).length

simW ords.length simW ords.length 14: simValue←min{ n , m }

ExpansionSets of two words are the same. As a result, the DRwS algorithm concludes that

“mango” and “papaya” have the same meaning, which are wrong. Fortunately, the distance value between the two words is 0.0769. Hence, the DRwD algorithm might conclude that the word “mango” has a different meaning with the word “papaya” if we use the best value of the distance threshold α, which is 0.0. We notice that the ExpansionSets of numerals are also similar. The DRwS algorithm concludes that the words “sixteen” and “seventeen” 62

have the same meaning; while the DRwD algorithm recognizes that they have different

meanings because their distance value is 0.125. Therefore, we combine the DRwS and

DRwD algorithms to create a new algorithm, the so-called DRwSD algorithm, to increase

the accuracies of dictionaries created. The idea behind the DRwSD algorithm is presented

in Figure 4.6.

Figure 4.6: The idea behind the DRwSD algorithm

The DRwSD algorithm is presented in Algorithm 4. If the simValue of < LexicalEntryi,

LexicalEntryj >, i 6= j, pair is equal or greater than a threshold β (Algorithm 4, line 6) and

the distance of < LexicalEntryi, LexicalEntryj > pair is equal or smaller than a thresh- old α (Algorithm 4, line 7), we conclude that the LexicalEntryi has the same meaning as LexicalEntryj and add the new entry into the ReverseDictionary (Algorithm 4, line

8). Changing the values of α and β will change the number of entries created and their accuracies.

4.4 Experimental results

Given Karbi-English dictionary Dict(ajz,eng), Arabic-English dictionary Dict(arb,eng),

Assamese-English dictionary Dict(asm,eng), Dimasa-English dictionary Dict(dis,eng) and

Vietnamese-English dictionary Dict(vie,eng), we generate the reverse bilingual dictionaries:

Dict(eng,ajz), Dict(eng,arb), Dict(eng,asm), Dict(eng,dis) and Dict(eng,vie). 63

Algorithm 4 DRwSD Algorithm Input: SourceDictionary, Dict(L1,L2).

Output: ReverseDictionary, Dict(L2,L1).

1: ReverseDictionary := φ

2: for all LexicalEntryi ∈ SourceDictionary do

3: for all Senseu ∈ LexicalEntryi do

4: for all LexicalEntryj ∈ SourceDictionary having the same POS with

LexicalEntryi do

5: for all Sensev ∈ LexicalEntryj do

6: if simV alue(LexicalEntryi, LexicalEntryj) > β then

7: if distance(LexicalEntryi, LexicalEntryj) 6 α then

8: Add tuple to ReverseDictionary

9: end if

10: end if

11: end for

12: end for

13: end for

14: end for

4.4.1 Preprocessing entries in the existing dictionaries

Every Sense in a language which has a WordNet, is given in terms of one or many words, some of which are common, the so-called stop words, and do not to convey much particularized or specific meaning. Examples of such words are “someone”, “to” and “a”. We use a standard list of 571 stop words3 and remove them from the sense phrases. In addition,

3http://www.lextek.com/manuals/onix/stopwords2.html 64 the words that remain after removing stop words may appear in various morphological forms.

We normalize them into a common root-form by lemmatizing. For example, the English words “a raisin” and “a dried grape” become “raisin” and “dry grape”, respectively, after removing stop words and lemmatizing. We do not stem words by using a stemmer algorithm such as the Porter stemmer4 [94] because the remaining words after stemming sometimes are not correct base forms. For instance, the result of using Porter Stemmer to stem the words

“imitate”, “language”, and “software” are “imit”, “languag”, and “softwar”, respectively, which do not exist in English. In addition, we obtain what we call an ExpansionSet consisting of synsets, synonyms, hyponyms, and hypernyms of each word in each phrase from the

WordNet for computing similarities among entries in the next chapters. Therefore, we use the stemming function supported by RiTa.WordNet to lemmatize words and phrases in the dictionaries. Although, this stemming function is also not perfect, it allows us to find the

ExpansionSet of a stemmed word from the WordNet database. The stemmed words or phrases will be used to compute similarities among entries in dictionaries.

The POS of each entry in the existing dictionaries will be taken into account to find all words that have the same meaning as a target word, using the WordNet database. How- ever, many entries in the input dictionaries have no POS information. For example, 100% and 6.63% entries in the Arabic-English and Vietnamese-English dictionaries, respectively, do not have POS. Hence, for each entry in existing dictionaries that do not have POS information, we obtain the best POS of the English word in that entry and consider the obtained POS as the POS of that entry.

4http://tartarus.org/martin/PorterStemmer/ 65

4.4.2 Results

The average scores of entries and the number of entries in ReverseDictionarys we created using different approaches with different parameters are presented in Table 4.2,

Table 4.3, and Table 4.4.

Table 4.2: Reverse dictionaries created using the DR and DRwD approaches.

Dictionary DR DRwD(α = 0.0)

created Score Entries Score Entries

Dict(eng,ajz) 4.18 4,683 2.20 7,697

Dict(eng,arb) 3.26 13,107 3.54 53,676

Dict(eng,asm) 4.67 69,417 3.00 103,290

Dict(eng,dis) 3.60 6,629 2.22 11,655

Dict(eng,vie) 3.05 132,778 2.04 67,543

Table 4.3: Reverse dictionaries created using the DRwS approach

Dictionary DRwS(β > 0.9) DRwS(β = 1.0)

created Score Entries Score Entries

Dict(eng,ajz) 3.74 10,737 4.88 6,383

Dict(eng,arb) 1.62 250,741 1.70 136,255

Dict(eng,asm) 2.76 188,182 4.01 85,215

Dict(eng,dis) 3.86 12,046 3.80 8,752

Dict(eng,vie) 2.10 640,111 3.28 216,382 66

Table 4.4: Reverse dictionaries created using the DRwSD approach

Dictionary DRwSD(β > 0.9, α = 0.0) DRwSD(β = 1.0, α = 0.0)

created Score Entries Score Entries

Dict(eng,ajz) 4.12 5,830 4.20 5,758

Dict(eng,arb) 2.93 20,715 2.68 19,745

Dict(eng,asm) 4.20 77,754 4.31 75,994

Dict(eng,dis) 3.99 8,506 4.31 8,386

Dict(eng,vie) 3.60 161,304 3.48 156,375

The DRwD approach significantly increases the number of entries, but the accuracy of the created dictionaries is much lower. The DRwS approach using a union of the synset, and the sets of synonyms, hyponyms, and hypernyms of words, and β = 1.0 produces good reverse dictionaries. The DRwSD approach combines the advantages of both DRwD and

DRwS approaches by comparing the ExpansionSets and the distances among words to find words with the same meaning. The DRwSD approach creates the best reverse dictionaries for each language pair using thresholds β = 1.0 and α = 0.0. We also create the reverse dictionaries Dict(eng,chr) and Dict(eng,chy) using our best approach (DRwSD) from the existing Dict(chr,eng) and Dict(chy,eng). The Dict(eng,chr) and Dict(eng,chy) we created contain 3,538 and 28,072 entries, respectively, without human evaluation.

The DRwD, DRwS, and DRwSD algorithms take into account POS of each entry in the existing dictionaries to find words having the same meanings. As a result, if the entries in existing dictionaries do not have good POS information, or do not have POS informa- tion at all, the accuracies of new ReverseDictionarys will not be high. For example, the existing Dict(arb,eng) does not have any POS, the accuracies of new Dict(eng,arb) created 67 using different algorithms are very low compared to other ReverseDictionarys created from existing dictionaries with POS such as Dict(eng,ajz), Dict(eng,dis) and Dict(eng,asm).

The dictionaries we work with frequently have several meanings for a word. Some of these meanings are unusual, rare or very infrequently used. The DR algorithm creates entries for the rare or unusual meanings by direct reversal. We noticed that our evaluators do not like such entries in the reversed dictionaries and mark them low. Table 4.5 shows some example words unknown by evaluators. This results in lower average scores in the DR algorithm compared to average scores in the DRwS and DRwSD algorithms. The DRwS and

DRwSD algorithms seem to have removed a number of such unusual or rare meanings (and entries similar to the rare meanings, recursively), improving the average score. However, in another sense, rare words should also occur in a good dictionary. So, for future work, we need to discover which rare words should be kept in a dictionary.

Table 4.5: Examples of unknown words from the source dictionaries. 68

Definitely, the accuracies of ReverseDictionarys created are affected by the existing

SourceDictionarys used. If the SourceDictionarys are not of good quality, the Reverse-

Dictionarys cannot be of excellent quality. Therefore, one of our goals is to maintain and

improve the accuracies of new dictionaries created compared to the existing dictionaries

used. Table 4.6 shows some examples of different errors coming from the SourceDictionarys

such as incorrect translations, unknown words in the source or the target language and

misspelled words in the source or the target language. Using such bad entries in the Source-

Dictionarys to create new entries for the ReverseDictionarys will lead to low accuracies.

Table 4.6: Examples of bad translations from the source dictionaries

Our approaches do not work well for dictionaries containing an abundance of complex phrases. As mentioned earlier, we experiment with single words or multiword expressions with less than 30 characters. The original dictionaries, except Dict(vie,eng), do not contain 69 many long phrases or complex words. In Vietnamese, most words we find in the dictionary can be considered compound words composed of simpler words put together. However, the component words are separated by space. Some examples are found in Figure 1.1 “an dưỡng đường”, “an giấc ngàn thu” and “an giấc nghìn thu”. The presence of a large number of compound words written in this manner causes problems with Dict(eng,vie).

To assess the accuracies of the reverse dictionaries we produce at a gross level, with- out the assistance of human evaluators, we reverse our best ReverseDictionarys created to generate new reversal of reverse dictionaries RoRDictionarys, then integrate the RoRDic- tionarys with the SourceDictionarys to improve the quality of dictionaries. For instance, we use our best Dict(eng,ajz) dictionary to create a new Dict(ajz,eng). Then, we integrate our

Dict(ajz,eng) created with Dict(ajz,eng) provided by Xobdo to create a new Dict(ajz,eng), with a higher numbers of entries and better accuracies. During the process of generating new ReverseDictionarys, we already computed the distances among words and the seman- tic similarity values among words to find words with the same meanings. We use the DR approach to create the RoRDictionarys. The number of entries and the average score of entries in RoRDictionarys generated, and the percentage increase in the number of entries in the RoRDictionarys comparing to the number of entries in the SourceDictionarys are shown in Table 4.7. Some new entries in the RoRDictionarys, which do not exist in the source dictionaries, are shown in Table 4.8.

4.5 Future work

We currently remove from consideration dictionary entries that have more than 30 characters to keep things simple. However, frequently dictionary entries are multiword expressions. When is it appropriate to consider such entries when creating reverse dictio- 70

Table 4.7: Reverse of reverse dictionaries generated

RoRDictionary Score Entries % inc.

Dict(ajz,eng) 3.96 5,764 23.11%

Dict(arb,eng) 2.20 53,550 0.66%

Dict(asm,eng) 4.07 81,742 6.25%

Dict(chr,eng) N/A 3,618 13.1%

Dict(chy,eng) N/A 37,529 33,57%

Dict(dis,eng) 2.72 8,389 26.57%

Dict(vie,eng) 3.48 256,154 10.57%

Table 4.8: Some new entries, evaluated as excellent or good, in the reverse of reseve

dictionaries

naries? How can we extend our approaches to allow for the incorporation of such entries in the reverse dictionary?

In fact, many languages have a few or only one bilingual dictionary. Some of those dictionaries may just consist of words in the source language and their translations in the 71 target language without any POS information or example sentences showing the uses of words. Developing approaches to assign the correct POS for entries in available dictionaries is highly needed. Currently, our algorithms choose the best POS of the English word. This fallback behavior is often insufficient. For instance, the word “book” has two POSes, viz.,

“verb” and “noun”, of which “noun” is more common. Hence, all translations to the word

“book” will have the same POS “noun”. As a result, all LexicalEntrys translating to the word “book” will be treated as noun, leading to many wrong translations. We will work on developing algorithms to discover the POS of entries from available corpora or existing documents in the source and the target languages.

If in Dict(L1, L2), the word a in L1 translates to the word b in L2 such that b is the best translation of a, this does not mean the word a is also the best translation of the word b in the Dict(L2, L1). We need to find a method to rank translations in the reverse dictionaries generated. For example, a statistical approach has been used to rank transla- tion candidates obtained from bilingual web data [65]. We will also construct comparable corpora or parallel corpora from different source (e.g., the Web, existing corpora or example sentences in bilingual dictionaries), then rank translations based on the probability of each pair translation from the built corpora.

Measuring the semantic similarity between word pairs is a challenge. We will experi- ment with different approaches to calculate the distance between words such as computing

WordNet distance [124], calculating the local density [64] or analyzing the graph associated with WordNet [127]. In addition, we will study new entries for dictionaries by using word embedding methods [62] and [74] for suggesting new entries, then searching parallel corpora for ranking suggested entry candidates. 72

Our algorithms require us to compute similarities, using WordNet-based distance measures, among N 2 meaning pairs, where N is the average size of a dictionary, each pair containing n words on an average. This leads to an approximate average complexity of N 2n2 slow WordNet similarity lookups. To decrease the time complexity, we will group entries in the source dictionaries based on the similarity in senses of entries. This will necessitate finding entries with the same meanings as the source entry in the cluster that the source entry belongs to, instead of using the dictionary. We consider our dictionaries to be our corpus. We generate context vectors for words in the corpus, then group words having the same meaning based on the similarities of their context vectors. Cosine distance or a bottom-up agglomerative clustering algorithm [24] will be used to compute the similarities between context vectors.

4.6 Chapter summary

We introduce approaches to create a reverse bilingual dictionary from an existing bilingual dictionary using WordNet. An important novelty of our approaches is to use a

WordNet to compute two measures of similarity and distance between words in the pivot language, which is English or a resource-rich language with an available WordNet, allowing us to increase the number of entries in the resulting dictionary by leveraging synonyms. In our current work, this can increase the size of dictionaries created by an order of magnitude compared to our baseline DR approach, without decreasing their evaluation scores. We show that a high precision reverse dictionary can be created without using any other intermediate dictionaries or languages. We perform experiments with several resource-poor languages including four that are in the UNESCO’s list of endangered languages. Our approaches have the advantage not only of creating new lexical resources using few existing lexical 73 resources, reducing time and cost, but also have the potential to improve the quality of resources created. Our work may also support the creation of linguistic infrastructure for communities using languages with limited resources. Some example entries and their evaluation in the reverse dictionaries created using the DR approach and the best approach

(DRwSD) are presented in the appendix section.

Acknowledgments

We would like to thank the volunteers evaluating the dictionaries we create: Morn- ingkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phonglosa, Dubari Borah,

Feras Al Tarouti, Tri Doan, Faris Kateb, Abhijit Bendale and Lalit Prithviraj Jain.

This chapter is based on the paper “Creating reverse bilingual dictionaries”, written in collaboration with Jugal Kalita, that in the Proceedings of the Conference of the North

American Chapter of the Association for Computational Linguistics: Human Language

Technologies (NAACL-HLT), pages 524-528, Atlanta, USA, June 2013. 74

CHAPTER 5

CREATING NEW BILINGUAL DICTIONARIES

5.1 Introduction

Bilingual dictionaries play a major role in applications such as machine translation, information retrieval, cross lingual document, automatic disambiguation of word sense, computing similarities among documents and increasing translation accuracy [50]. Bilingual dictionaries are also useful to general readers, who may need help in translating documents in a given language to their native language or to a language in which they are familiar.

Such dictionaries may also be important from an intelligence perspective, especially when they deal with smaller languages from sensitive areas of the world. Creating new bilingual dictionaries is also a purely intellectual and scholarly endeavor important to the humanities and other scholars.

The question we address in this chapter is the following: Given a language, especially a resource-poor language, with only one available dictionary translating from that language to a resource-rich language, can we automatically construct several good dictionaries trans- lating from the original language to many other languages using publicly available resources such as bilingual dictionaries, MTs and WordNets? For example, given a Karbi-English dic- tionary, can we construct new bilingual dictionaries such as Karbi-Arabic, Karbi-Chinese and Karbi-Vietnamese dictionaries? We call a dictionary good if each entry in it is of high quality and we have the largest number of entries possible. We must note that these two objectives conflict: Frequently if an algorithm produces a large number of entries, there is a high probability that the entries are of low quality. Restating our goal, with only one input 75 dictionary translating from a source language to a language which has a WordNet linked to the Princeton WordNet, we create a number of good bilingual dictionaries from that source language to all other languages supported by an MT with different levels of accuracy and sophistication.

Our contribution in this work is the reliance on the existence of just one bilingual dic- tionary between a low-resource language and a resource-rich language, viz., English. This strict constraint on the number of input bilingual dictionaries can be met by even many endangered languages. We consciously decided not to depend on additional bilingual dictio- naries or external corpora because such languages usually do not have such resources. The simplicity of our algorithms along with low-resource requirements are our main strengths.

5.2 Related work

As presented in Chapter 2, the common method to create a new bilingual dictionary is the pivot-based method. After obtaining an initial bilingual dictionary, past researchers have used several approaches to mitigate the effect of the ambiguity problem. All the methods used for word sense disambiguation use WordNet distance between source and target words in some ways, in addition to looking at dictionary entries in forward and backward directions and computing the amount of overlap or match to obtain disambiguation scores [2], [13], [14],

[32], [57], [69], [107] and [115]. The formulas used and the names used for the disambiguation scores by different authors are different. Researchers have also merged information from sources such as parallel corpora or comparable corpora [81], [90] and a WordNet [120]. Some researchers have also extracted bilingual dictionaries from parallel corpora or comparable corpora using statistical methods [16], [19], [34], [38], [67] and [80]. 76

The rest of this chapter proposes methods for creating a significant number of bilingual dictionaries from a single available bilingual dictionary, which translates a source language to a resource-rich language with an available WordNet. We use publicly available WordNets in several resource-rich languages and a publicly available MT as well.

5.3 Proposed approaches

This section describes approaches to automatically create new bilingual dictionaries

Dict(S,D), each of which translates a word in language S to a word or multiword expression in a target language D. Our starting point is just one existing bilingual dictionary Dict(S,R), where S is the source language and R is an “intermediate helper” language. We require that the language R has an available WordNet linked to the PWN. We do not think this is a big imposition since the PWN and other WordNets are freely available for research purposes.

5.3.1 Direct translation approach (DT)

We first develop a direct translation method which we call the DT approach (see

Algorithm 5). The DT approach uses transitivity to create new bilingual dictionaries from existing dictionaries and an MT. An existing dictionary Dict(S,R) contains alphabetically sorted LexicalUnits in a source language S and each has one or more Senses in the language

R. We call such a sense SenseR. To create a new bilingual dictionary Dict(S,D), we simply take every pair in Dict(S,R) and translate SenseR to D to generate translation candidates candidateSet (lines 2-4). When there is no translation of SenseR in

D, we skip that pair . Each candidate in candidateSet becomes a

SenseD in language D of that LexicalUnit. We add the new tuple to Dict(S,D) (lines 5-7). 77

Algorithm 5 DT algorithm Input: Dict(S, R)

Output: Dict(S, D)

1: Dict(S, D) := φ

2: for all LexicalEntry ∈ Dict(S, R) do

3: for all SenseR ∈ LexicalEntry do

4: candiateSet= translate(SenseR,D)

5: for all candidate ∈ candiateSet do

6: SenseDj = candidate

7: add tuple to Dict(S,D)

8: end for

9: end for

10: end for

An example of generating an entry for an Dimasa-Vietnamese dictionary using the

DT approach from an input Dimasa-English dictionary is presented in Figure 5.1. In the

Dimasa-English dictionary, the word “tijurutai” in Dimasa has two translations in English

“suggestion” and “advice”, which are translated to Vietnamese as “đề nghị” and “tư vấn”, respectively, using the Bing Translator. Therefore, in the new Dimasa-Vietnamese dictio- nary, the word “tijurutai” has two translations in Vietnamese which are “đề nghị” and “tư vấn”.

5.3.2 Using publicly available WordNets as intermediate resources (IW)

To handle ambiguities in the dictionaries created, we propose the IW approach as in

Figure 5.2 and Algorithm 6. 78

Figure 5.1: An example of generating an entry for an Dimasa-Vietnamese

dictionary using the DT approach

Figure 5.2: The IW approach for creating a new bilingual dictionary

For each SenseR in every given LexicalEntry from Dict(S,R), we find all Offset-

1 POSes in the WordNet of the language R to which SenseR belongs (Algorithm 6, lines 2-5). 1Offset-POS refers to the offset for a synset with a particular POS, from the beginning of its data file.

Words in a synset have the same sense. 79

Algorithm 6 IW algorithm Input: Dict(S,R)

Output: Dict(S, D)

1: Dict(S, D) := φ

2: for all LexicalEntry ∈ Dict(S,R) do

3: for all SenseR ∈ LexicalEntry do

4: candidateSet := φ

5: Find all Offset-POSes of synsets containing SenseR from the R WordNet

6: candidatSet = FindCandidateSet (Offset-POSes, D)

7: sort all candidate in descending order based on their rank values

8: for all candidate ∈ candidateSet do

9: SenseD=candidate.word

10: add tuple to Dict(S,D)

11: end for

12: end for

13: end for

Then, we find a candidate set for translations from the Offset-POSes and the destination language D using Algorithm 7. For each Offset-POS from the extracted Offset-POSes, we obtain each word belonging to that Offset-POS from different WordNets (Algorithm 7, lines

2-3) and translate it to D using an MT to generate translation candidates (Algorithm 7, line 4). We add translation candidates to the candidateSet (Algorithm 7, line 6). Each candidate in the candidateSet has 2 attributes: a translation of the word word in the target language D, the so-called candidate.word and the occurrence count or the rank value of the candidate.word, the so-called candidate.rank. A candidate with a greater rank value is 80

more likely to become a correct translation. Candidates having the same ranks are treated

similarly. Then, we sort all candidates in the candidateSet in descending order based on

their rank values (Algorithm 6, line 7), and add them into the new dictionary Dict(S,D)

(Algorithm 6, lines 8-10). We can vary the WordNets and the numbers of WordNets used

during experiments, producing different results.

Algorithm 7 FindCandidateSet (Offset-POSes,D) Input: Offset-POSes, D

Output: candidateSet

1: candidateSet := φ

2: for all Offset-POS ∈ Offset-POSes do

3: for all word in the Offset-POS extracted from the PWN and other available Word-

Nets linked to the PWN do

4: candidate.word= translate (word, D)

5: candidate.rank++

6: candidateSet += candidate

7: end for

8: end for

9: return candidateSet

Figure 5.3 shows an example of creating entries for a Dimasa-Arabic dictionary from a Dimasa-English dictionary using the IW approach. The word “tijurutai” in Dimasa has two senses as in Figure 5.2: “suggestion” and “advice”. This example only shows the IW approach to find the translation of “tijurutai” with the sense “suggestion”. We find all

Offset-POSes in the PWN containing “suggestion”. Then, we extract words belonging to all Offset-POSes from the PWN, FWN, WWN and JWN. Next, we translate extracted 81

words to Arabic and rank them based on the occurrence counts. According to the ranks,

the best translation of “tijurutai” in Dimasa, which has the greatest rank value, is the word

“almshwrh” in Arabic.

Figure 5.3: Example of generating lexical entries for an Dimasa-Arabic

dictionary using the IW approach

5.4 Experimental results

In our experiments, we create dictionaries from any of {ajz, arb, asm, chr, chy, dis, vie} to any non-eng language supported by the Microsoft Translator, e.g., arb, cht, deu, 82

mww, ind, kor, zlm, tha, spa, and vie, as shown in Figure 5.4. Please refer to the Table 2.1

in Chapter 3 for the ISO codes.

Figure 5.4: New bilingual dictionaries created

5.4.1 Results and human evaluation

Ideally, evaluation should be performed by volunteers who are fluent in both source and destination languages. However, for evaluating dictionaries created, we could not re- cruit any individuals who are experts in two appropriate languages. This is not surprising, considering languages we focus on are disparate belonging to different classes, with prove- nance spread out around the world, and frequently resource poor and even endangered. For example, it is almost impossible to find an individual fluent in the endangered language

Karbi and Vietnamese. Hence, every dictionary is evaluated by 2 people sitting together, one using the destination language as mother tongue, and the other the source language.

Each volunteer pair was requested to evaluate using a 5-point scale – 5: excellent, 4: good,

3: average, 2: fair and 1: bad. The average score and the number of lexical entries in the dictionaries we create using the DT approach are presented in Table 5.1. 83

Table 5.1: The average score and the number of lexical entries in the

dictionaries created using the DT approach.

Dictionary Score Entries Dictionary Score Entries

arb-deu 4.29 1,323 arb-spa 3.61 1,709

arb-vie 3.66 2,048 asm-arb 4.18 47,416

asm-spa 4.81 20,678 asm-vie 4.57 42,743

vie-arb 2.67 85,173 vie-spa 3.55 35,004

The average scores and the numbers of lexical entries in the dictionaries created by the IW approach are presented in Table 5.2 and Table 5.3, respectively. In these tables,

Top n means dictionaries created by picking only translations with the top n highest ranks for each word, A: dictionaries created using PWN only; B: using PWN and FWN; C : using

PWN, FWN and JWN; D: using PWN, FWN, JWN and WWN. The method using all 4

WordNets produces dictionaries with the highest scores and the highest number of lexical entries as well.

The number of lexical entries and the accuracies of the newly created dictionaries definitely depend on the sizes and qualities of the input dictionaries. Therefore, if the sizes and the accuracies of the dictionaries we create are comparable to those of the input dictionaries, we conclude that the new dictionaries are acceptable. Using four WordNets as intermediate resources to create new bilingual dictionaries increases not only the accuracies but also the number of lexical entries in the dictionaries created. We also evaluate several bilingual dictionaries we create for a few of the language pairs. Table 5.4 presents the number of lexical entries and the average score of some of the bilingual dictionaries generated using the four WordNets. 84

Table 5.2: The average score of lexical entries in the dictionaries we

create using the IW approach.

Dictionary WordNets used Dictionary WordNets used

we create A B C D we create A B C D

Top 1 3.42 3.65 3.33 3.71 Top 1 4.43 4.31 3.86 4.43

arb-vie Top 3 3.33 3.58 3.76 3.61 asm-vie Top 3 3.93 3.59 3.33 3.94

Top 5 2.99 3.04 3.08 3.31 Top 5 3.74 3.34 3.4 2.91

Top 1 4.51 3.83 4.69 4.67 Top 1 3.11 2.94 2.78 3.11

asm-arb Top 3 4.03 3.75 3.80 4.10 vie-arb Top 3 2.47 2.72 2.61 3.02

Top 5 3.78 3.85 3.42 4.00 Top 5 2.54 2.37 2.60 2.73

The dictionaries created from Arabic to other languages have low accuracies because our algorithms rely on the POS of the lexical units to find the Offset-POSes and the input

Arabic-English dictionary does not have POS. We were unable to access a better Arabic-

English dictionary free. For lexical entries without POS, our algorithms choose the best

POS of the English word. For instance, the word “study” has two POSes, viz., “verb” and “noun”, of which “noun” is more common. Hence, all translations to the word “study” in the Arabic-English dictionary will have the same POS “noun”. As a result, all lexical entries translating to the word “study” will be treated as a verb, leading to many wrong translations.

Based on experiments, we conclude that using the four public WordNets, viz., PWN,

FWN, JWN and WWN as intermediate resources, we are able to create good bilingual dictionaries, considering the dual objective of high quality and a large number of entries.

For example, in the Cherokee-English bilingual dictionary, the word “ayvtseni” with a POS 85

Table 5.3: The number of lexical entries in the dictionaries we create

using the IW approach

Dictionary WordNets used

we create A B C D

Top 1 1,786 2,132 2,169 2,200

arb-vie Top 3 3,434 4,611 4,908 5,110

Top 5 4,123 5,926 6,529 6,853

Top 1 27,039 27,336 27,449 27,468

asm-arb Top 3 70,940 76,695 78,979 79,585

Top 5 104,732 118,261 125,087 126,779

Top 1 25,824 26,898 27,064 27,129

asm-vie Top 3 64,636 73,652 76,496 77,341

Top 5 92,863 111,977 120,090 122,028

Top 1 63,792 65,606 66,040 65,862

vie-arb Top 3 152,725 177,666 183,098 185,221

Top 5 210,220 261,392 278,117 282,398 of noun in Cherokee is translated to “throat” in English. We want to find translations of

“ayvtseni” in Vietnamese and add them into a Cherokee-Vietnamese bilingual dictionary.

Using the DT approach, we translate the word “throat” to Vietnamese using the Bing trans- lator. We can find only one translation “cổ họng” in Vietnamese. Using the IW approach, the word “throat” belongs to Offset-POSes: “01514549-n”, “04428763-n”, “04428920-n” and

“05547508-n”. We extract all words in these Offset-POSes in PWN, FWN, WWN and JWN, translate them to Vietnamese and rank translation candidates. As a result, the best trans- 86

Table 5.4: The average score of entries and the number of lexical entries in some other

bilingual dictionaries constructed using 4 WordNets: PWN, FWN, JWN and WWN.

Dictionary Top 1 Top 3

we create Score Entries Score Entries

arb-deu 4.27 1,717 4.21 3,859

arb-spa 4.54 2,111 4.27 4,673

asm-spa 4.65 26,224 4.40 72,846

vie-spa 3.42 61,477 3.38 159,567 lations of the word “ayvtseni” are “cổ họng” and “họng”. These two candidates have the same ranks so that we accept them as correct translations of the word “ayvtseni”. In other words, the IW approach using the four intermediate WordNets is our best approach. We note that if we include only translations with the highest ranks, the resulting dictionaries have accuracies even better than the input dictionaries used. Some entries, evaluated as excellent translations, in the bilingual dictionaries we created are shown in Table 5.5.

Table 5.5: Examples of entries, evaluated as excellent, in the new

bilingual dictionaries we created. 87

Table 5.6 presents the number of entries in some of dictionaries, we create using the

best approach, without human evaluation.

Table 5.6: The number of lexical entries in some other dictionaries we create

using the best approach.

Dict. Entries Dict. Entries Dict. Entries Dict. Entries

ajz-arb 4,345 ajz-cht 3,577 ajz-deu 3,856 ajz-mww 4,314

ajz-ind 4,086 ajz-kor 4,312 ajz-zlm 4,312 ajz-spa 3,923

ajz-tha 4,265 ajz-vie 4,344 asm-cht 67,544 asm-deu 71,789

asm-mww 79,381 asm-ind 71,512 asm-kor 79,926 asm-zlm 80,101

asm-tha 78,317 chr-arb 2,623 chr-cat 2,639 chr-cht 2,607

chr-dan 2,655 chr-deu 2,629 chr-mww 2,694 chr-ind 2,580

chr-zlm 2,633 chr-spa 2,607 chr-tha 2,645 chr-vie 2,618

chy-arb 10,604 chy-cat 10,748 chy-cht 10,538 chy-dan 10,654

chy-deu 10,708 chy-mww 10,790 chy-ind 10,434 chy-zlm 10,690

chy-spa 10,580 chy-tha 10,696 chy-vie 10,848 dis-arb 7,651

dis-cht 6,120 dis-deu 6,744 dis-mww 7,552 dis-ind 6,762

dis-kor 7,539 dis-zlm 7,606 dis-spa 6,817 dis-tha 7,348

dis-vie 7,652 vie-cht 140,664 vie-deu 152,106 vie-mww 181,819

vie-ind 150,538 vie-kor 181,101 vie-zlm 188,808 vie-tha 171,365

We are in the processes of finding volunteers to evaluate dictionaries translating from

Cherokee, Cheyenne, Karbi and Dimasa to other languages. Some entries in the bilingual dictionaries we created using the best approach without human evaluation are shown in

Table 5.7. 88

Table 5.7: Examples of entries, not yet evaluated, in the new

bilingual dictionaries we create

5.4.2 Comparing with existing approaches

It is difficult to compare approaches because the language involved in different papers are different, the number and quality of input resources vary and the evaluation methods are not standard. However, for the sake of completeness, we make an attempt at comparing our results with [120]. The precision of the best dictionary created by [120] is 79.15%. Although our score is not in terms of percentage, we obtain the average score of all dictionaries we created using 4 WordNets and containing 3-top greatest ranks LexicalEntrys is 3.87/5.00, with the highest score being 4.10/5.00 which means the entries are very good on average.

If we look at the greatest ranks only (Top 1 ranks), the highest score is 4.69/5.00 which is almost excellent. We believe that we can apply these algorithms to create dictionaries where the source is any language, with a bilingual dictionary, to English.

To handle ambiguities, the existing methods need at least two intermediate dictionar- ies translating from the source language to intermediate languages. For example, to create an Assamese-Arabic dictionary, Gollins and Sanderson [32] and Mausam et al. [69] will need at least two dictionaries, e.g., an Assamese-English and an Assamese-French dictionaries. 89

For Assamese, the second dictionary simply does not exist to the best of our knowledge.

The IW approach requires only one input dictionary. This is a strength of our method, in

the context of resources-poor language.

5.4.3 Comparing with Google Translator

We evaluate the dictionaries we create against a well-known high quality MT: the

Google Translator. We do not compare our work against the Microsoft Translator because

we use it as an input resource. We randomly pick 300 lexical entries from each of our

created dictionaries for language pairs supported by the Google Translator. Then, we

compute the matching percentages between translations in our dictionaries and translations

from the Google Translator. For example, in the Vietnamese-Spanish dictionary we create,

Dict(vie,spa), “người quảng cáo”, which means “someone whose business is advertising”,

in Vietnamese translates to “anunciante” in Spanish which is as same as the translation

given by the Google Translator. As the result, we mark the lexical entry (“người quảng

cáo”, “anunciante”) as “matching”. The Google Translator returns translations for 100%

words we query. The matching percentages of translations in our dictionaries Dict(arb,spa),

Dict(arb,vie), Dict(arb,deu), Dict(vie,deu), Dict(vie,spa) and the Google Translator are

55.56%, 39.16%, 58.17%, 25.71%, and 35.71%, respectively. The lexical entries marked as

“unmatched” do not mean our translations are incorrect. For instance, the Vietnamese word

“định mệnh”, which means “an event (or a course of events) that will inevitably happen in the future”, is translated to “schicksal” in German in our Vietnamese-German dictionary.

This translation does not match with the translation obtained from the Google Translator, which is “das schicksal”, but it is evaluated as excellent. Table 5.8 presents some lexical 90

entries which are correct but are marked as “unmatched”. According to our evaluators, the

translations from the Google Translator of the first four source words are bad.

Table 5.8: Some “unmatched” lexical entries.

5.5 Future work

Evaluation of bilingual dictionaries is quite onerous, requiring hours of time to evalu- ate just a sample of entries from one such dictionary. It is quite difficult to find volunteers who know a pair of languages well, or to find two individuals who each know one of the languages well and have the time to sit together to evaluate dictionary entries. Finding volunteers to evaluate dictionaries translating from ajz, dis, chr and chy to other languages is a difficult task and may require travel to the provenance of these languages or finding experts who have access to the Internet and are able to sit together for a conference through

Skype or some other means.

Currently, four WordNets are used as intermediate resources to create new bilingual dictionaries. We want to experiment extensively to find which combination of WordNets and how many WordNets should be used to create the best dictionaries. Different WordNets 91 have different amounts of coverage and may have been constructed in different manners, and may vary in how they capture language specific information.

We initially use the occurrence counts of translation candidates to rank translations of a word. Our current approaches translate the word s in language S to the word r in language R, then translate r to the word d in language D. We will use different methods to improve the translation quality such as two-time inverse consultation method [115], link structures [2], or using translation graph [69].

Our approaches use the public WordNets linked to the Princeton WordNet. Some existing WordNets such as the Hindi and Indonesian [33] WordNets are constructed in their own languages and have not been aligned to the Princeton WordNet. Such WordNets better maintain the morphology of their own languages. We would like to mine the information from those WordNets to improve the quality of the new dictionaries.

Finally, we will integrate our approaches to construct new lexical resources using existing lexical resources with other machine translation based approaches [70] and [72].

5.6 Chapter summary

We present two approaches to create a large number of good bilingual dictionaries from only one input dictionary, publicly available WordNets and a machine translator. In particular, we created 56 new bilingual dictionaries from 7 input bilingual dictionaries. We note that 49 dictionaries we created are not supported by any publicly available machine translation system yet. We believe that our research will help increase significantly the number of resources for languages which do not have many existing resources or are not supported by publicly available machine translators like Microsoft and Google. This in- cludes languages such as Assamese, Cherokee, Cheyenne, Dimasa, Karbi, and other similar 92 languages. We use WordNets as intermediate resources to create new bilingual dictionaries because these WordNets are available online for unfettered use and they contain information that can be used to remove ambiguities.

Acknowledgement

We would like to thank the volunteers evaluating the dictionaries we have created:

Dubari Borah, Francisco Torres Reyes, Conner Clark and Tri Doan.

This chapter is based on the paper “Automatically Creating a Large Number of New

Bilingual Dictionaries”, written in collaboration with Feras Al Tarouti and Jugal Kalita, that in the Proceedings of the 29th Conference on Artificial Intelligence (AAAI), Texas,

USA, January, 2015. Another version of this paper focusing on the endangered languages is

“Creating Lexical Resources for Endangered Languages” in the Proceedings of Workshop on

The Use of Computational Methods in The Study of Endangered Languages (ComputEL), pages 54-62, Baltimore, USA, June, 2014. Association for Computational Linguistics. 93

CHAPTER 6

CREATING WORDNETS

6.1 Introduction

WordNets are intricate and substantive repositories of lexical knowledge and have

become important resources for computational processing of natural languages and for in-

formation retrieval. Morato et al. [78] summarize the uses of WordNets as “comprehensive

” in a model of the information retrieval process, “linguistic knowledge tool”

to represent the meaning of words or interpretation of semantic equivalents and resources

to construct thesauri, to disambiguate word senses, and to compute semantic distance of

words. In Chapter 4 and Chapter 5, we have shown that WordNets can be used to generate

good bilingual dictionaries in term of both quality and the number of entries. However,

good quality WordNets are available only for a few “resource-rich” languages such as En-

glish and Japanese. Manually constructing a WordNet is a difficult task, needing years

of experts’ time. Published approaches to automatically build new WordNets are manual

or semi-automatic and can be used only for languages that already possess some lexical

resources. The chapter proposes approaches to generate WordNet synsets for languages,

both resource-rich and resource-poor, using publicly available WordNets, a machine trans-

lator and/or a single bilingual dictionary. In general, our algorithms translate synsets of

existing WordNets to a target language T, then apply a ranking method on the translation

candidates to find best translations in T. Our approaches are applicable to any language which has at least one existing bilingual dictionary translating from it to English. 94

One of our goals is to automatically generate high quality synsets for WordNets having the same structure as the Princeton WordNet in several languages. This chapter discusses approaches to automatically construct WordNet synsets for languages with low amounts of resources (viz., Arabic and Vietnamese), resource-poor languages (viz., Assamese) or endangered languages (viz., Dimasa and Karbi). The sizes and the qualities of freely existing resources, if any, for these languages vary, but are not usually high. Hence, our second goal is to use a limited number of freely available resources in the target languages as input to our algorithms to ensure that our methods can be felicitously used with languages that lack much resource. In addition, our approaches need to have a capability to reduce noise coming from the existing resources that we use. For translation, we use a free MT, and restrict ourselves to use it as the “dictionary” and/or a single bilingual dictionary, translating from a language to English, we can have. In particular, given public WordNets aligned to the Princeton WordNet (PWN) ( such as the FinnWordNet (FWN) [66] and the

JapaneseWordNet (JWN) [43]) and the Microsoft Translator, we build WordNet synsets for

Arabic, Assamese, Dimasa, Karbi and Vietnamese.

6.2 Related work

The Princeton WordNet was constructed manually over many decades. WordNets, except the PWN, have been usually constructed by one of two approaches. The expand approach translates the PWN to T [7], [10], [46], [88], [101] and [104]; while the merge approach builds a WordNet in T, and then aligns it with the PWN by generating transla- tions [33]. In terms of popularity, the first approach dominates over the second approach.

WordNets generated using the merge approach have different structures from the PWN; however, the complex agglutinative morphology, culture specific meanings and usages of 95 words and phrases of target languages can be maintained. In contrast, WordNets created using the expand approach have the same structure as the PWN. As mentioned in the previous section, our goal is to construct WordNets for several languages which have dif- ferent morphologies. Therefore, the expand approach is a perfect choice for us to construct

WordNets.

6.3 Proposed approaches

In this section, we propose approaches to create WordNet synsets for a target lan- guages T using existing WordNets and the MT and/or a single bilingual dictionary. We take advantage of the fact that every synset in PWN has a unique Offset-POS. Each synset may have one or more words, each of which may be in one or more synsets. Words in a synset have the same sense. The basic idea is to extract corresponding synsets for each

Offset-POS from existing WordNets linked to PWN, in several languages. Next, we trans- late extracted synsets in each language to T to produce so-called synset candidates using

MT. Then, we apply a ranking method on these candidates to find the correct words for a specific Offset-POS in T.

6.3.1 Generating synset candidates

We propose three approaches to generate synset candidates for each Offset-POS in the target language T. 96

6.3.1.1 The direct translation (DT) approach

The first approach directly translates synsets in PWN to T as in Figure 6.1. For

each Offset-POS, we extract words in that synset from the PWN and translate them to the

target language to generate translation candidates.

Figure 6.1: The DT approach to construct WordNet synsets in a target language T.

For example, we generate the translation candidates in Dimasa for the synset having an Offset-POS “00648224-v”. From the PWN, this Offset-POS has three members {“re- search”, “search”, “explore”}. There is no translation of the words “research” and “explore” in the English-Dimasa bilingual dictionary. The translations for the word “search” with a

POS of verb are {“shamai”, “maikha”}. Therefore, the translation candidates of the Offset-

POS “00648224-v” are {“shamai”, “maikha”} in Dimasa.

6.3.1.2 Approach using intermediate WordNets (IW)

Generating translation candidates by simply translate all members in synsets to a target language is simple. However, there is an important issue that arises. We analyze a concrete example on generating synset candidates in Vietnamese for some Offset-POSes, which are senses of the noun word “chair” presented in Table 6.1.

We notice that the Offset-POSes “03001627-n” and “03002096-n” have only one synset member: “chair”. The word “chair” has only one translation: “ghế”, having the meaning of the Offset-POS “03001627-n”, obtained from MT. Applying the DT approach, the candi- dates of both Offset-POSes “03001627-n” and “03002096-n” are the same “ghế”, which are 97

Table 6.1: Different senses of the word “chair”

Offset-POS Synset member Meaning

00598056-n professorship, chair the position of professor

03001627-n chair a seat for one person, with a support

for the back

03002096-n chair a particular seat in an orchestra

03271030-n electric chair, chair, death an instrument of execution by electro-

chair, hot seat cution; resembles an ordinary seat for

one person

10468962-n president, chairman, chair- the officer who presides at the meetings

woman, chair, chairperson of an organization incorrect. In other words, the DT approach cannot recognize the different senses between words in Offset-POSes “03001627-n” and “03002096-n”.

To handle ambiguities in synset translation, we propose the IW approach as in Figure

6.2. Publicly available WordNets in various languages, which we call intermediate Word-

Nets, are used as resources to create synsets for WordNets. For each Offset-POS, we extract its corresponding synsets from intermediate WordNets. Then, the extracted synsets, which are in different languages, are translated to T using MT to generate synset candidates.

Depending on which WordNets are used and the number of intermediate WordNets, the number of candidates in each synset and the number of synsets in the new WordNets change.

For example, we use PWN, FWN and WWN as intermediate WordNets. The words belonging to synsets having the Offset-POSes “03001627-n” and “03002096-n” in those Word- 98

Figure 6.2: The IW approach to construct WordNet synsets in a target language T

Nets and their translations in Vietnamese, shown inside the curly braces {}, are presented in Table 6.2. The translations are obtained by using the Bing Translator.

Table 6.2: Synsets obtained from different WordNets and their translations in Vietnamese

Offset-POS PWN synset FWN synset WWN synset

03001627-n chair {ghế} tuoli {ghế} chaise {ghế}; fauteuil {ghế}

03002096-n chair{ghế} paikka {ở nơi của các} null

As a result, the candidates of the Offset-POSes “03001627-n” and “03002096-n” are

{“ghế”, “ghế”, “ghế”, “ghế”} and {“ghế”, “ở nơi của các”}, respectively. 99

6.3.1.3 Approach using intermediate WordNets and a dictionary (IWND)

The IW approach for creating WordNet synsets decreases ambiguities in translations.

However, we need more than one bilingual dictionary from each intermediate language to

T. Such dictionaries are not always available for many languages, especially the ones that are resource poor. The IWND approach is like the IW approach, but instead of translating immediately from the intermediate languages to the target language, we translate synsets extracted from intermediate WordNets to English, then translate them to the target lan- guage. The IWND approach is presented in Figure 6.3.

Figure 6.3: The IWND approach to construct WordNet synsets 100

6.3.2 Ranking method

For each Offset-POS, we have many translation candidates. A translation candidate

with a higher rank is more likely to become a word belonging to the corresponding Offset-

POS of the new WordNet in the target language. Candidates having the same ranks are

treated similarly. The rank value is in the range 0.00 to 1.00. The rank of a word w, the

so-called rankw, is computed as below.

occurw numDstW ordNets rankw = numCandidates ∗ numW ordNets , where:

- numCandidates is the total number of translation candidates of an Offset-POS,

- occurw is the occurrence count of the word w in the numCandidates,

- numWordNets is the number of intermediate WordNets used, and

- numDstWordNets is the number of distinct intermediate WordNets that have words

translated to the word w in the target language.

Our motivation for this rank formula is the following. If a candidate has a higher

occurrence count, it has a greater chance to become a correct translation. Therefore, the

occurrence count of each candidate needs to be taken into account. We normalize the oc-

currence count of a word by dividing it by numCandidates. In addition, if a candidate is

translated from different words having the same sense in different languages, this candi-

date is more likely to be a correct translation. Hence, we multiply the first fraction by

numDstWordNets. To normalize, we divide results by the number of intermediate WordNet

used.

For instance, in our experiments we use 4 intermediate WordNets, viz., PWN, FWN,

JWN and WWN. The words in the Offset-POS value “00006802-v” obtained from all 4

WordNets, their translations to Arabic, the occurrence count and the rank of each transla- 101 tion are presented in the second, the fourth and the fifth columns, respectively, of Figure

6.3. In this table, the wordA, wordB and wordC are obtained from PWN, FWN and WWN, respectively. The JWN does not contain this Offset-POS. TL presents transliterations of the words in arb. The numWordNets is 4 and the numCandidates is 7. The rank of each candidate is shown in the last column of Table 6.3.

Table 6.3: Example of calculating the ranks of candidates in Arabic.

6.3.3 Selecting candidates based on ranks

We separate candidates based on three cases as below.

Case 1: A candidate w has the highest chance of becoming a correct word belonging to a specific synset in the target language if its rank is 1.0. This means that all intermediate

WordNets contain the synset with a specific Offset-POS and all words belonging to these synsets are translated to the same word w. The more the number of intermediate WordNets used, the higher the chance the candidate with the rank of 1.0 has to become the correct translation. Therefore, we accept all translations that satisfy this criterion. An example of this scenario is presented in Figure 6.4 using the IW approach with four intermediate 102

WordNets: PWN, FWN, JWN and WWN. All words belonging to the OffSet-POS value

“00952615-n” in all 4 WordNets are translated to the same word “điện” in Vietnamese. The word “điện” is accepted as the correct word belonging to the OffSet-POS at “00952615-n” in the Vietnamese WordNet we create.

Figure 6.4: Example of Case 1 to select candidates

Case 2: If an OffSet-POS does not have candidates with the rank of 1.0, we accept the candidates with the greatest rank. Table 6.4 shows the example of the second scenario using the IW approach with three intermediate WordNets: PWN, FWN and WWN. For the OffSet-POS value “01437254-v”, there is no candidate with the rank of 1.0. The highest rank of the candidates in Vietnamese is 0.67 which is the word “gửi”. We accept “gửi” as the correct word at the OffSet-POS value “01437254-v” in the Vietnamese WordNet we create.

Case 3: If all candidates of an OffSet-POS has the same rank which is also the greatest rank, we skip these candidates. Table 6.5 gives another example of the last scenario using the DT approach. For the OffSet-POS value “00010435-v”, there is no candidate with the rank of 1.0. The highest rank of the candidates in Vietnamese is 0.167. All 3 candidates have the same rank as the highest rank. Therefore, we do not accept any candidate as the 103

Table 6.4: Example of Case 2 to select candidates

correct word in the OffSet-POS value “00010435-v” in the Vietnamese WordNet we create.

Table 6.5: Example of Case 3 to select candidates

WordNet Words Cand. Rank

PWN act hành động 0.167

PWN behave hoạt động 0.167

FWN do làm 0.167

Our scenarios to select translation candidates for Offset-POSes can reduce the am-

biguity of words. For instance, the Vietnamese translation candidates of the Offset-POSes

“03001627-n” and “03002096-n”, created using the IW approach and three intermediate

WordNets (viz., PWN, FWN and WWN), presented in Table 6.2, are {“ghế”, “ghế”, “ghế”,

“ghế”} and {“ghế”, “ở nơi của các”}, respectively. Now, we select the correct translations

for these Offset-POSes in the Vietnamese WordNet using our cases.

- The Offset-POS value “03001627-n”: All words belonging to synsets having this Offset-

POS in all three intermediate WordNets are translated to the same word “ghế” in 104

Vietnamese. This translation candidate has a rank value of 1.0. Hence, we accept it

as a correct word belonging to this Offset-POS in the Vietnamese WordNet.

- The Offset-POS value “03002096-n”: We note that WWN does not contain this synset.

This Offset-POS has two translation candidates in Vietnamese, each of which has the

same rank of 0.167. Hence, we do not accept any of them as correct word in the Offset-

POS “03002096-n” . In other words, the Vietnamese WordNet does not contain the

Offset-POS value “03002096-n”.

6.4 Experiments

Our primary goal is to build high quality synsets for WordNets in languages with low amounts of resources: ajz, asm, arb, dis and vie. The number of WordNet synsets we create for arb and vie using the DT approach and the coverage percentage compared to the PWN synsets are 4813 (4.10%) and 2983 (2.54%), respectively.

The number of synsets for each WordNet we create using the IW approach with different numbers of intermediate WordNets and the coverage percentage compared to the

PWN synsets are presented in Table 6.6. WNs is the number of intermediate WordNets used: 2: PWN and FWN, 3: PWN, FWN and JWN and 4: PWN, FWN, JWN and WWN.

For the IWND approach, we use all 4 WordNets as intermediate resources. The number of WordNet synsets we create using the IWND approach are presented in Table

6.7. We only construct WordNet synsets for ajz, asm and dis using the IWND approach

because these languages are not supported by MT.

Finally, we combine all of the WordNet synsets we create using different approaches

to generate the final WordNet synsets. Table 6.8 presents the final number of WordNet

synsets we create and their coverage percentages. 105

Table 6.6: The number of WordNet synsets we create using the IW approach.

App. Lang. WNs Synsets % coverage

IW arb 2 48,245 41.00%

IW vie 2 42,938 36.49%

IW arb 3 61,354 52.15%

IW vie 3 57,439 48.82%

IW arb 4 75,234 63.94%

IW vie 4 72,010 61.20%

Table 6.7: The number of WordNets synsets we create using the IWND approach.

App. Lang. Synsets % coverage

IWND ajz 21,882 18.60%

IWND arb 70,536 59.95%

IWND asm 43,479 36.95%

IWND dis 24,131 20.51%

IWND vie 42,592 36.20%

Table 6.8: The number and the average score of WordNets synsets we create.

Lang. Synsets % coverage Lang. Synsets % coverage

ajz 21,882 18.60% arb 76,322 64.87%

asm 43,479 36.95% dis 24,131 20.51%

vie 98,210 83.47% 106

The average score of WordNet synsets for arb, asm and vie are 3.82, 3.78 and 3.75, respectively. We have not evaluated the WordNet synsets for ajz and dis. We notice that the WordNet synsets generated using the IW approach with all 4 intermediate WordNets have the highest average score: 4.16/5.00 for arb and 4.26/5.00 for vie.

It is difficult to compare WordNets because the languages involved in different papers are different, the number and quality of input resources vary and the evaluation methods are not standard. However, for the sake of completeness, we make an attempt at comparing our results with published papers. Although our score is not in terms of percentage, we obtain the average score of 3.78/5.00 (about 75.60% precision, if we transform the score to precision, possibly with some reservation) which we believe it is better than 55.30% obtained by [13] and 43.20% obtained by [23]. In addition, the average coverage percentage of all

WordNet synsets we create is 44.85% which is better than 12% in [23] and 33276 synsets

(' 28.28%) in [104] .

The previous studies need more than one dictionary to translate between a target lan- guage and intermediate-helper languages. For example, to create the JWN, Bond and Ogura

[13] needs the Japanese-Multilingual dictionary, Japanese-English lexicon and Japanese-

English life science dictionary. For Assamese, there are a number of Dict(eng,asm); to the best of our knowledge only two online dictionaries, both between English and Assamese, are available. The IWND approach requires only one input dictionary between a pair of languages. This is a strength of our method.

6.5 Future work

We intend to extend our approaches to generate synonyms, antonyms, hypernyms and hyponyms. To enhance the quality of WordNets we create, different approaches will 107 be used to measure relatedness between concepts or words. Some potential approaches for measuring semantic relationships are a dictionary-based approach [56] and a thesaurus- based approach [79]. We will compute semantic relations between members in concepts using the approaches introduced by Hirst and St-Onge [39] based on the semantic distance algorithm [79], then compare the results against human judgment. In addition, we may attempt to construct dense and weighted connections between concepts in WordNets [17].

WordNets constructed by translating the Princeton WordNet lose some of the lan- guage specific relationships present in the target languages. We will investigate ways to maintain such relationships in the newly created WordNets. In addition, the Princeton

WordNet has some gaps itself. For example, the synonyms of “mango” and “papaya” are the same. We will manually manage these gaps to improve the quality of the WordNets we create.

There are many other potential issues in creating lexical resources across languages, such as in the context of Hebrew and English [89]. Some of these arise due to the need to represent gender in nouns in Hebrew and the complexity in representation of passive voice.

Similar language-specific problems occur across languages, especially if we want to build lexical resources that are faithful to the complexities of linguistic phenomena.

6.6 Chapter summary

We present approaches to create WordNet synsets for natural languages using avail- able WordNets, a public MT and a single bilingual dictionary. We create WordNet synsets with good accuracy and high coverage for languages with low resources (arb and vie), resource-poor (asm) and endangered (ajz and dis). We believe that our work has the po- tential to construct full WordNets for languages which do not have many existing resources. 108

Acknowledgement

We would like to thank the volunteers evaluating the WordNet synsets we create:

Dubari Borah and Tri Doan.

This chapter is based on the paper “Automatically constructing WordNet synsets”, written in collaboration with Feras A. Tarouti and Jugal Kalita, that in the Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pages

106-111, Baltimore, USA, June 2014. 109

CHAPTER 7

GENERATING TRANSLATIONS FOR PHRASES USING A

BILINGUAL DICTIONARY AND N-GRAM DATA

7.1 Introduction

During the processes of creating bilingual dictionaries, our algorithms so far do not take into account multiword expressions or phrases. As a result, the number of entries in the new dictionaries we create are significantly smaller than those in the input dictionaries in our experiments. For example, from an available Vietnamese-English dictionary consist- ing of 231,665 entries, we were able to create English-Vietnamese, Vietnamese-Arabic and

Vietnamese-Spanish dictionaries containing 156375, 185221 and 159567 entries, respectively.

This chapter tackles the problem of phrase translation from a source language L1 to a target language L2. The common approach translates words in the given phrase to L2 using an L1–L2 dictionary, then restructures translations using grammar rules which have been created by experts or are extracted from corpora. We propose a method for phrase translation using an L1–L2 dictionary and n-gram data in L2, instead of grammar rules, with a case study in translating phrases from Vietnamese to English. We note that the given Vietnamese phrases for translation do not exist in the dictionary. For example, we translate Vietnamese phrases “bộ môn khoa học máy tính”, “thuế thu nhập cá nhân” and

“đợi một chút” to English: “computer science department”, “individual income tax”, and

“wait a little”, respectively. In particular, given a Vietnamese phrase, our algorithms return a list of ranked translations in English. 110

Currently, the purpose of the phrase translations in our work is to support language learners. For example, assume that from the Vietnamese-English dictionary, a learner obtained that the translations for “bộ môn”, “khoa học” and “máy tính” are “departmen- t/faculty”, “science” and “calculator/computer”, respectively. Now, he wants to find the translation for “bộ môn khoa học máy tính”, which does not exist in that Vietnamese-

English dictionary. We will discover a method to generate phrase translations based on information in the existing dictionary.

The remainder of this chapter is organized as follows. Section 7.2 describes Viet- namese morphology. Related work is presented in Section 7.3. Section 7.4 and Section 7.5 present approaches to generate translations for phrases and our experiments, respectively.

Future work is discussed in Section 7.6. Section 7.7 concludes the chapter.

7.2 Vietnamese morphology

Vietnamese is an Austroasiatic language [63], and does not have morphology [118] and [4, page 10]. The modern Vietnamese alphabet, introduced by Alexandre de Rhodes in the sixteenth century, is an extension of the Latin script with diacritics for tones. In

Vietnamese, whitespaces are not used to separate words. The smallest meaningful part of

Vietnamese orthography is a syllable [82]. Vietnamese words are mainly divided into three types [85] and [82].

- Single words are monosyllabic such as “nhà” - house, “lụa” - silk, and “nhặt” - pick

up.

- Compound words are polysyllabic. (i) Coordinate compounds are collocations such

that each has it own meaning (e.g., “mua bán” - buy and sell, “bàn ghế”

- table and chair). (ii) Subordinate compounds are collocations such that each mor- 111

pheme supports the others (e.g., “đồng ruộng” - rice field, “mè đen” - black sesame),

or one morpheme does not have any meaning (e.g., “cây cối” - trees, “đường xá” -

street). (iii) Isolated compounds are collocations such that cannot be

separated (e.g., “mẫu giáo” - kindergarden, “hành chánh” - administration, “thổ cẩm”

- brocade).

- Reduplicative words are special kinds of compound words such that morphemes are

duplicated (e.g., “vàng vàng” - yellowish, “ngại ngại” - hesitate ), or parts of morphemes

are duplicated one or many times (e.g., “gật gà gật gù” - nod repeatedly out of

satisfaction, “lải nhải” - annoyingly insistent).

As a result, segmentation of the given Vietnamese phrase is required before applying any method to generate its translations.

7.3 Related work

The two methods, commonly used for phrase translation, are dictionary-based and corpus-based. Dictionary-based approaches [1] and [35] generate translation candidates by translating the given phrase to the target language using a bilingual dictionary. The candi- dates are restructured using grammar rules which are developed manually or learned from a corpus. In corpus-based approaches, a statistical method is used to identify bilingual phrases from a comparable or parallel corpus [49], [93], [54], and [15]. Researchers may also extract phrases from a given monolingual corpus in the source language and translate them to the target language using a bilingual dictionary [20], and [117]. Finally, a variety of methods are used to rank translation candidates. These include counting the frequency of candidates in a monolingual corpus in the target language [93], standard statistical calculations [93], or even using Na¨ive Bayes Classifiers and TF-IDF vectors with the EM algorithm [20]. Mari˜no 112 et al. [68] extract translations from a bilingual corpus using an n-gram model augmented by additional information such as target-language model, a word-bonus model, and two lexicon models.

7.4 Proposed approach

This section introduces a new simple and effective approach to translate from Viet- namese to English using a bilingual dictionary and n-gram data. An entry in n-gram data is a 2-tuple < wE, frq >, where wE is a sequence of n words in English and frq is the fre-

quency of wE. An entry in a bilingual dictionary is also a 2-tuple < ws, wt >, where ws and

wt are a word or a phrase in the source language and its translation in the target language.

If the word ws has many translations in the target language, there are several entries such

as < ws, wt1 >, < ws, wt2 > and < ws, wt3 >. We note that an existing bilingual dictionary

may contain phrases and their translations. Our work finds translations for phrases which

do not exist in the dictionary. The general idea of our approach is that we translate each

word in the given phrase to English using a Vietnamese-English dictionary, then use n-gram

data to restructure translations. Our work is divided into 4 steps: segmenting Vietnamese

words, filtering segmentations, generating ad hoc translations, selecting the best ad hoc

translation, and finding and ranking English translation candidates.

7.4.1 Segmenting Vietnamese words

A Vietnamese phrase P, consisting of a sequence of n syllables < s1 s2 ... sn >, can be

segmented in different ways, each of which will produce a segmentation S. A segmentation

S is defined as an ordered sequence of words wi separated by semicolons “;”: 113

S =< w1; w2; w3; ...; wi; ...; wm >, where m is the number of words in S, m ≤ n and 1 ≤ i ≤ m. We note that a word may contain one or more syllables s. Generally, we have 2n−1 possible segmentations for a Viet- namese phrase P. For example, the phrase “khoa khoa học” - science department/faculty, has 4 possible segmentations, onlye one of which is correct:

S1 = ,

S2 = ,

S3 = , and

S4 = .

7.4.2 Filtering segmentations

Each word in each segment may have k ≥ 0 translations in English. The total number of English translation candidates for a Vietnamese phrase, with m words, is O(2n−1 ∗ mk).

To reduce the number of candidates, we check whether or not a candidate Vietnamese word in each segmentation has an English translation in a Vietnamese-English dictionary. If at least one candidate word does not have a translation in the dictionary, we delete that segmentation. For example, we delete S3 and S4 because they contain the words “khoa khoa” and “khoa khoa học” which do not have translations in the dictionary. As a result, the phrase “khoa khoa học” has 2 remaining segmentations:

S1= and

S2=. 114

7.4.3 Generating ad hoc translations

To generate an ad hoc translation T, we translate each word in a segmentation S to

English using the Vietnamese-English dictionary. The ad hoc translations of a given phrase are the translations of segmentations, written sequentially. For instance, the translations of the segmentation S1 for “khoa khoa học” are , , ; and the translations for S2

are , , . Therefore, the

six ad hoc translations of “khoa khoa học” are T1=“faculty faculty study”, T2=“department

department study”, T3=“subject of study subject of study study”, T4=“faculty science”,

T5=“department science”, and T6= “subject of study science”.

7.4.4 Selecting the best ad hoc translation

We have generated several ad hoc translations by simply translating each word in the

segmentations to English. Most are not grammatically correct. We use a method, presented

in Algorithm 8, to reduce the number of ad hoc translations. We consider words in each

entry in the English n-gram data as a bag of words NB (lines 1-3), i.e., the words in each

entry is simply considered a set of words instead of a sequence. For example, the 3-gram

“computer science department” is considered as the set {computer, science, department}.

Each ad hoc translation T , created in Section 7.4.3, is also considered a bag of words TB

(lines 4-6). For every bag of words TB, we find each bag of words NB0, belonging to the

set of all NBs, such that NB0 contains all words in TB (lines 7-9), i.e., TB ⊆ NB0. Each

bag of words TB is given a score scoreTB which is the sum of frequencies of all bags of

words NB0 (line 10). The bag of words TB with the greatest score is considered the best

ad hoc translation (lines 12-18). 115

Algorithm 8 Selecting the best ad hoc translation Input: all ad hoc translations T s

Output: the best ad hoc translation bestAdhocT ran

1: for all entries N ∈ n-gram data do

2: generate bags of words NB

3: end for

4: for all ad hoc translations T do

5: generate bags of words TB

6: end for

7: for all TB do

8: scoreTB = 0

9: Find all NB0 ∈ set of all NBs that contain all words in TB

P 0 10: scoreTB = F requency(NB )

11: end for

12: bestAdhocT ran=TB0

13: for all TB do

14: if scoreTB > scorebestAdhocT ran then

15: bestAdhocT ran=TB

16: end if

17: end for

18: return bestAdhocT ran

After this step, only one ad hoc translation T will remain. For example, we eliminate

5 ad hoc translations (viz., T1, T2, T3, T4 and T6) of the Vietnamese phrase “khoa khoa 116

học”, and select “department science” (T5) as the best ad hoc translation of it. We note that the best ad hoc translation may still be grammatically incorrect English.

7.4.5 Finding and ranking translation candidates

To restructure translations, we use n-gram data in English instead of grammar rules.

We take advantage of the fact that the n-gram information implicitly “encodes” the grammar

of a language. Having the best ad hoc translation TB and several corresponding bags NB0

from the previous step, we find and rank the translation candidates. For every NB0, we retrace its corresponding entry in the n-gram data, and mark the words in the entry as a translation candidate cand. Then, we rank the selected translation candidates.

- If there exists one or many cands such that the sizes of each cand and TB are equal,

these cands are more likely to be correct translations than other candidates. We

simply rank cands based on their n-gram frequencies. The candidate cand with the

greatest frequency is considered the best translation. For example, the best ad hoc

translation of “khoa khoa học” is “department science”. In the n-gram data, we find

an entry <“science department”, 112> which contains exactly the same words in the

best ad hoc translation found. In this entry, the value 112 is the n-gram fequency

of “science department”. We accept “science department” as a correct translation of

“khoa khoa học” and its rank is 112.

- The rest of the candidates are ranked using the following formula:

F requency(cand) rank(cand) = |size(cand)−size(TB)|∗100 .

Our motivation for the rank formula is the following. If a candidate has a greater

frequency, it has a greater likelihood to be a correct translation. However, if the 117

size of the candidate and the size of TB are very different, that candidate may be

inappropriate. Hence, we divide the frequency of cand by the difference in the number

of words between cand and TB. To normalize, we divide results by 100.

7.5 Experiments

We use the free lists of English n-gram data available at the ngrams.info1 Website.

The free lists have the one million most frequent entries for each of 2, 3, 4 and 5-grams.

The n-gram data has been obtained from the corpus of contemporary American English2.

Currently, we limit our experiments to translation candidates with equal or smaller than 5 syllables. We obtain 200 common Vietnamese phrases, which do not exist in the dictionary, from 4 volunteers who are fluent in both Vietnamese and English. Later, these volunteers are asked to evaluate our translations using a 5-point scale, 5: excellent, 4: good,

3: average, 2: fair, and 1: bad.

The average score of translations created using the baseline approach, which is simply translating words in segments to English, is 2.20/5.00. The average score of translations created using our proposed approach is 4.29/5.00, which is quite high. The rating reliability is 0.72 obtained by calculating the Intraclass Correlation Coefficient [51]. Our approach returns translations for 101 phrases out of the 200 input phrases. This means the precision and recall of our translations are 85.8% and 50.5%, respectively.

The average score of our translations is high; however, the recall is low due to the small number of entries in the bilingual dictionary we use. If our algorithms can return a translation for an input phrase, that translation is usually specific, and is evaluated as excellent or good in most cases. Our approach relies on an existing bilingual dictionary

1http://www.ngrams.info/ 2http://corpus.byu.edu/coca/ 118 and n-gram data in English. If we have a dictionary covering the most common words in

Vietnamese, and the n-gram data in English is extensive with different lengths, we believe that our approach will produce even better translations.

Table 7.1 presents some examples of Vietnamese phrases and their translations we create. In this table, the first and the third columns show the input given phrases in

Vietnamese and their translations in English, respectively; the last column reports the human evaluation; the second colum presents the translations of each words after segmenting the Vietnamese phrases.

Table 7.1: Some examples of Vietnamese phrases and their translations

We also compute the matching percentage between our translations and translations performed by the Google Translator. The matching percentage of our translations for 119 phrases is 42%. The translations marked as “unmatched” do not mean our translations are incorrect. A few such examples are presented in Table 7.2.

Table 7.2: Some translations we create are correct but do not match with translations by

the Google Translator.

7.6 Future work

Currently, the number of translation candidates is too high. We would like to obtain heuristic approaches to eliminate segments, instead of having to look up all possibilities.

There are some words which cannot be translated to the target language. For example, we cannot obtain the English translations for phrases “từ hôn”, which means “a couple breaking off an engagement”, or “nước mắm tỏi ớt”, which means “fish sauce with garlic and chili”. We note that the component words used in these given phrases have translations 120 in the bilingual dictionary. We would like to discover a method to solve this problem. In addition, we would like to study word embeddings for suggesting new phrases, then search parallel corpora to find correct translations.

Our work relies on the knowledge about the morphology of the source language. For future work, we will study the morphology of other languages such as Arabic, Assamese,

Karbi and Dimasa, then apply our method to translate phrases from these languages to

English. We also want to design an intelligent system such that when users request to translate phrases in their language to English, our system will generate the translations, save the input phrases and their translations as entries in the dictionary. We hope that our system will not only support users but also help enrich entries for the dictionaries.

7.7 Conclusion

We have introduced a new method to translate a given phrase in Vietnamese to

English using a bilingual dictionary and English n-gram data. Our approach generates translations for phrases with fairly high accuracy 85.8%. The strength of our work is that our approach can generate correct translations for phrases in term of both semantic and syntactic constraints. In addition, our approach may add words such as preposition or article to improve translations. We believe that our work can be applied to other language pairs that have a bilingual dictionary and n-gram data in one of the two languages.

Acknowledgement

We would like to thank the volunteers supporting Vietnamese phrases and evaluating the translations we create: Tri Doan, Cuong Nguyen and Hoang Nguyen. 121

This chapter is based on the paper “Phrase translation using a bilingual dictionary and n-gram data”, written in collaboration with Feras Al Tarouti and Jugal Kalita, that is to appear at the 11th Workshop on Multiword Expressions (MWE), Denver, USA, June

2015. Association for Computational Linguistics: Human Language Technologies (NAACL-

HLT). 122

CHAPTER 8

CONCLUSIONS

This chapter concludes the dissertation with a summary on the main contributions of our work. We add contributions to state-of-the-art in constructing new lexical resources for many languages, especially languages do not have many resources or are endangered.

- We have developed approaches to create reverse bilingual dictionaries and to signif-

icantly increase the number of entries in both reverse dictionaries and the original

dictionaries with almost the same quality.

- We introduce a novel method using aligned WordNets to construct many new bilingual

dictionaries from a single input bilingual dictionary and a machine translator. We note

that all of the resources we use are publicly available and free. In addition, the input

bilingual dictionaries we use in this work are the dictionaries enriched during the

process of creating reverse dictionaries. The new bilingual dictionaries we create are

good in terms of both quality and size. Starting with 7 available bilingual dictionaries,

we create 56 new dictionaries. Of these, 49 pairs of languages are not supported by

the Google and Bing translators.

- We construct WordNet synsets for several languages using their existing limited re-

sources. Our contribution is the reliance on the existence of just one dictionary

between a source language and a resource-rich language. This strict constraint on

the number of input dictionaries can be met by even many endangered languages. 123

The simplicity of our algorithms along with low-resource requirements are our main

strengths.

- We make initial attempts at translating phrases or word sequences with a case study

from Vietnamese to English using a bilingual dictionary and n-gram data. Currently,

our work can obtain translations of a given phrase which does not exist in a bilingual

dictionary, but its components exist in the dictionary.

- We have collected resources for many languages from disparate sources and construct

new resources for them. The new resources we create are available on the website of

Language Information and Computation (LINC)1 lab of the University of Colorado,

Colorado Springs.

- We have also set up a small group of people who use these languages as mother tongues

and are willing to construct resources for their communities.

We have presented several approaches from simple to complex for creating lexical resources and translating phrases. We have also compared our methods against existing methods to find positive and negative points in different approaches, and the reasons for their drawbacks. In addition, most other researchers work with languages that have multiple available lexical resources, each of which is very expensive to construct. Our studies have the potential not only to create new lexical resources using just a few existing lexical resources, but also provide support for communities using languages with limited resources.

1 http://cs.uccs.edu/∼linclab/index.html 124

REFERENCES

[1] O. B. Abiola, Adetunmbi A. O., Fasiku A. I., and Olatunji K. A. A Web-based English to Yoruba noun-phrases machine transaltion system. International Journal of English and Literature, pages 71–78, 2014.

[2] Kisuh Ahn and Matthew Frampton. Automatic generation of translation dictionaries using intermediary languages. In Proceedings of the International Workshop on Cross- Language Knowledge Induction, pages 41–44, Trento, Italy, April, 2006. Conference of the European Chapter of the Association for Computational Linguistics (EACL).

[3] Prissana Akaraputthiporn, Krit Kosawat, and Wirote Aroonmanakun. A bi- directional translation approach for building Thai WordNet. In Proceedings of the International Conference on Asian Language Processing (IALP), pages 97–101, Sin- gapore, December, 2009.

[4] Mark Aronoff and Kirsten Fudeman. What is morphology. John Wiley & Sons, New York City, NY, 2011.

[5] Peter K. Austin and Julia Sallabank. Cambridge Handbook of Endangered Languages. Cambridge Cambridge University Press, Cambridge, United Kingdom, 2013.

[6] Lisa Ballesteros and Bruce Croft. Dictionary methods for cross-lingual information retrieval. In Proceedings of Database and Expert Systems Applications, pages 791–801, Zurich, Switzerland, 1996.

[7] Eduard Barbu and Verginica Barbu Mititelur. Automatic building of WordNets. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 2005.

[8] Romaric Besancon, Gael de Chalendar, Olivier Ferret, Faiza Gara, Meriama Laib, Olivier Mesnard, and Nasredine Semmar. LIMA: A multilingual framework for lin- guistic analysis and linguistic resources development and evaluation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 3697–3704, Malta, Irec, May, 2010.

[9] Pushpak Bhattacharyya. IndoWordNet. In Proceedings of the International Confer- ence on Language Resources and Evaluation (LREC), Malta, Irec, May, 2010.

[10] Orhan Bilgin, Ozlem Cetinoglu, and Kemal Oflazer. Building a WordNet for Turkish. Romanian Journal of Information Science and Technology, 7(1-2):163–172, 2004.

[11] Francis Bond and Ryan Foster. Linking and extending an open multilingual Word- net. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 1352–136, Sofia, Bulgaria, 2013.

[12] Francis Bond, Hitoshi Isahara, Kyoko Kanzaki, and Kiyotaka Uchimoto. Boot- strapping a Wordnet using multiple existing WordNets. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), 2008. 125

[13] Francis Bond and Kentaro Ogura. Combining linguistic resources to create a machine- tractable Japanese-Malay dictionary. Language Resources and Evaluation, 42(2):127– 136, 2008.

[14] Francisand Bond, Ruhaida Binti Sulong, Takefumi Yamazaki, and Kentaro. Design and construction of a machine-tractable Japanese-Malay dictionary. In Proceedings of Machine Translation Summit VIII, European Association for Machine Translation, pages 53–58, Santiago de Compostela, Spain, September, 2001.

[15] Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), pages 674– 679, Istanbul, Turkey, 2012.

[16] Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. Using WordNet and semantic similarity for bilingual terminology mining from comparable corpora. In Pro- ceedings of the 6th Workshop on Building and Using Comparable Corpora, pages 16–23, Sofia, Bulgaria, August, 2013. Association for Computational Linguistics (ACL).

[17] Christiane Fellbaum Daniel Osherson Boyd-Graber, Jordan and Robert Schapire. Adding dense, weighted, connections to WordNet. In Proceedings of the 3rd Interna- tional WordNet Conference, pages 29–36, 2006.

[18] Miles Bronson. A Dictionary in Assamese and English, First Ed. American Baptist Mission, Sibsagor, Assam, India, 1867.

[19] Ralf D. Brown. Automated dictionary extraction for “Knowledge-free” example-based translation. In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 111–118, Santa Fe, USA, 1997.

[20] Yunbo Cao and Hang Li. Base noun phrase translation using Web data and the EM algorithm. In Proceedings of the 19th International Conference on Computational Lin- guistics (COLING), pages 1–7, Taipei, Taiwan, 2002. Association for Computational Linguistics (ACL).

[21] Debasri Chakrabarti, Vaijayanthi Sarma, and Pushpak Bhattacharyya. Complex pred- icates in Indian language WordNets. Lexical Resources and Evaluation Journal, 40(3- 4), 2007.

[22] Jason S. Chang, Tracy Lin, and Geeng-Neng You. Building a Chinese WordNet via Class-Based Translation Model. Computational Linguistics and Chinese Language Processing, 8(2):61–76, 2003.

[23] Thatsanee Charoenporn, Virach Sornlertlamvanich, Chumpol Mokarat, and Hitoshi Isahara. Semi-automatic compilation of Asian WordNet. In Proceedings of the 14th Annual Meeting of the Association for Natural Language Processing (ACL), pages 1041–1044, Tokyo, Japan, 2008.

[24] Keh-Jiann Chen and Jia-Ming You. A study on word similarity using context vector models. Computational Linguistics and Chinese Language Processing, 7(2):37–58, 2002. 126

[25] Rudi L. Cilibrasi and Paul M.B. Vitanyi. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370–383, 2007.

[26] Marin Dantchev. WordNet 2.1 overview. ECS 595/SI 661 & 761/LING 541 Natural Language Processing Fall.

[27] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.

[28] Ho Ngoc Duc and Nguyen Thi Thao. Towrads building a WordNet for Vietnamese. In Proceedings of the 1st International Workshop for Computer, Information and Communication Technologies, Hanoi, Vietnam, 2003.

[29] Christiane Fellbaum. WordNet. Blackwell Publishing Ltd, United Kingdom, 1999.

[30] Bryan A. Garner and Henry Campbell Black. Black’s law dictionary. St. Paul, MN: Thomson/West, Saint Paul, US, 2004.

[31] Chooi-Ling Goh, Masayuki Asahara, and Yuji Matsumoto. Building a Japanese- Chinese dictionary using Kanji/Hanzi conversion. In Proceedings of the 2nd Interna- tional Joint Conference on Natural Language Processing (IJCNLP), pages 670–681, Jeju Island, Korea, October, 2005.

[32] Tim Gollins and Mark Sanderson. Improving cross language information retrieval with triangulated translation. In Proceeding of the 24th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 90–95, New York, USA, September, 2001.

[33] Gunawan and Andy Saputra. Building synsets for Indonesian WordNet with mono- lingual lexical resources. In Proceedings of the International Conference on Asian Language Processing (IALP), pages 297–300, Harbin, China, December, 2010.

[34] Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. Learning bilin- gual lexicons from monolingual corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), volume 2008, pages 771–779, Ohio, USA, June, 2008.

[35] Le Manh Hai, Asanee Kawtrakul, and Yuen Poovorawan. Phrasal transfer model for Vietnamese-English machine translation. In Proceedings of the conference on Natural Language Processing Pacific Rim Symposium (NLPRS).

[36] Patrick Hanks, Flavia Hodges, and David L. Gold. A dictionary of surnames, vol- ume 92. Oxford University Press,Oxford, United Kingdom, 1988.

[37] William L. Hays and Robert L. Winkler. Statistics: Probability, Inference and Deci- sion. Decision. Holt, Rinehart and Winston. Inc., New York, USA, 1971.

[38] Enik¨oHéja. Dictionary building based on parallel corpora and word alignment. In Proceedings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, Leeuwarden/Ljouwert, Netherlands, July, 2010. 127

[39] Graeme Hirst and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 305(1998):305–332, 1998.

[40] Iftikar Hussain, Navanth Saharia, and Utpal Sharma. Development of Assamese Word- Net. Machine Intelligence: Recent Advances, Narosa Publishing House, Editors. B. Nath, U. Sharma and DK Bhattacharyya, ISBN-978-81-8487-140-1, 2011.

[41] William Peter Hyde. A new Vietnamese–English dictionary. Dunwoody Press, Hy- attsville, Maryland, USA, 2008.

[42] Satoru Ikehara, Masahiro Miyazaki, Akio Yokoo, Satoshi Shirai, Hiromi Nakaiwa, Kentaro Ogura, Yoshifumi Ooyama, and Yoshihiko Hayashi. Nihongo Goi Taikei - A Japanese lexicon. Iwanami Shoten 5, 1997.

[43] Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama, and Kyoko Kan- zaki. Development of the Japanese WordNet. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), pages 2420–2423, Mar- rakech, Morocco, May, 2008.

[44] Ray Jackendoff. The architecture of the linguistic-spatial interface. Language and space, pages 1–30, 1996.

[45] Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. Automatic construc- tion of a Japanese-Chinese dictionary via English. In Proceedings of the 6th Inter- national Conference on Language Resources and Evaluation (LREC), volume 2008, pages 699–706, Marrakech, Morocco, May, 2008.

[46] Hiroyuki Kaji and Mariko Watanabe. Automatic construction of Japanese Word- Net. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Genoa, Italy, May, 2006.

[47] Adam Kilgarriff. Thesauruses for natural language processing. In Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pages 5–13, Beijing, China, October, 2003.

[48] Adam Kilgarriff and David Tugwell. WASP-Bench: an MT lexicographers’ worksta- tion supporting state-of-the-art lexical disambiguation. In Proceedings of the 8th Ma- chine Translation Summit, pages 187–190, Santiago de Compostela, Spain, September, 2001.

[49] Kevin Knight and Vasileios Hatzivassiloglou. Two-level, many-paths generation. In Proceedings of the 33rd annual meeting on Association for Computational Linguistic, pages 252–260, Massachusetts, USA, June, 1995.

[50] Kevin Knight and Steve K. Luk. Building a large-scale knowledge base for machine translation. In Proceedings of the 12th National Conference on Artificial Intelligence, pages 773–778, Seattle, Washington, August, 1994.

[51] Gary G. Koch. Intraclass correlation coefficient. Encyclopedia of statistical sciences, 1982. 128

[52] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed- erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation.

[53] Philipp Koehn and Kevin Knight. Learning a translation lexicon from monolingual corpora. In Proceedings of the Workshop on Unsupervised Lexical Acquisition, vol- ume 9, pages 9–16, Philadelphia, USA, July, 2002. Association for Computational Linguistics (ACL).

[54] Philipp Koehn and Kevin Knight. Feature-rich statistical translation of noun phrases. In Proceedings of the 41st Annual Meeting on Association for Computational Linguis- tics (ACL), pages 311–318, Sapporo, Japan, 2003.

[55] Grzegorz Kondrak and Bonnie Dorr. Identification of confusable drug names: A new approach and evaluation methodology. In Proceedings of the 20th International Con- ference on Computational Linguistics (COLING), volume 26, pages 952–958, Switzer- land, 2000.

[56] Hideki Kozima and Teiji Furugori. Similarity between words computed by spreading activation on an English dictionary. In Proceedings of the 6th conference on European chapter of the Association for Computational Linguistics, pages 232–239, 1993.

[57] Khang Nhut Lam and Jugal Kalita. Creating reverse bilingual dictionaries. In Proceed- ings of the International Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL-HLT), pages 524–528, Atlanta, USA, June, 2013.

[58] Sidney Landau. Dictionaries: the art craft of . Cambridge University Press, Cambridge, United Kingdom, 2001.

[59] Tuong Le, Trong Hai Duong, Bay Vo, and Sanggil Kang. Consensus for collaborative ontology-based Vietnamese WordNet building. Intelligent Information and Database Systems, pages 499–508, 2013.

[60] Dhanon Leenoi, Thepchai Supnithi, and Wirote Aroonmanakun. Buidling Thai Word- Net with a bi-directional translation method. In Proceedings of the International Conference on Asian Language Processing (IALP), pages 48–52, Singapore, Decem- ber, 2009.

[61] Dhanon Leenoi, Thepchai Supnithi, and Wirote Arronmanakun. Building a gold standard for Thai WordNet. In Proceeding of The International Conference on Asian Language Processing 2008 (IALP), pages 78–82, Chiang Mai, Thailand, November, 2008.

[62] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In Proceedings of the 18th conference on Computational Natural Lan- guage Learning, pages 171–180, Baltimore, Maryland, USA, June, 2014.

[63] Paul M. Lewis, Gary F. Simons, and Charles D. Fennig (eds.). Ethnologue: Languages of the world, 7th edition. Dallas, Texas: SIL International. 129

[64] William D. Lewis. Measuring conceptual distance using WordNet: the design of a metric for measuring semantic similarity. In The University of Arizona working papers in linguistics, 2002.

[65] Hang Li, Yunbo Cao, and Cong Li. Using bilingual web data to mine and rank translations. IEEE Intelligent Systems.

[66] Krister Linden and Lauri Carlson. FinnWordNet-WordNet p˚afinska via ¨overs¨attning. LexicoNordica, 17:119–140, 2010.

[67] Nikola Ljubeˇsi´cand Darja Fiˇser. Bootstrapping bilingual lexicons from comparable corpora for closely related languages. In Proceedings of the 14th International Con- ference on Text, Speech and Dialogue (TSD), pages 91–98, Plzeˇn,Czech Republic, September, 2011.

[68] José B. Marino, Rafael E. Banchs, Josep M. Crego, Adrià de Gispert, Patrik Lambert, José A.R. Fonollosa, and Marta R. Costa-Jussà. N-gram-based machine translation. Computational Linguistics.

[69] Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Kobi Reiter, Michael Skinner, Marcus Sammer, and Jeff Bilmes. Panlingual lexical translation via proba- bilistic inference. Artificial Intelligence, 174:619–637, 2010.

[70] Dan Melamed. Empirical methods for MT lexicon constructions. Machine Translation and the Information Soup, Springer-Verlag, 1998.

[71] I. Dan Melamed. Bitext maps and alignment via pattern recognition. Computational Linguistics, 25(10):107–130, 1999.

[72] I. Dan Melamed. Models of translational equivalence among words. Computational Linguistics, 26(2):221–249, 2000.

[73] Merriam-Webster. Merriam-Webster’s dictionary of synonyms. Springfield, US: Springfield, Mass.: Merriam-Webster, 1984.

[74] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.

[75] G.A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11):39–41, 1995.

[76] M. Rube Molina. Cognate Linguistics (Cognates Book 1), Kindle Ed. Cognates.org, 2011.

[77] Mortaza Montazery and Heshaam Faili. Automatic Persian WordNet construction. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 846–850, Beijing, China, August, 2010.

[78] Jorge Morato, Miguel Angel Marzal, Juan Lloréns, and José Moreiro. WordNet appli- cations. In Proceedings of the 2nd Global WordNet Conference, pages 270–278, Brno, Czech Republic, January, 2004. 130

[79] Jane Morris and Graeme Hirst. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1):21–48, 1991.

[80] Preslav Nakov and Hwee Tou Ng. Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), vol- ume 3, pages 1358–1367, Singapore, August, 2009. Association for Computational Linguistics (ACL).

[81] Luka Nerima and Eric Wehrli. Generating bilingual dictionaries by transitivity. In Proceedings of the 6th International Conference on Language Resources and Evalua- tion (LREC), pages 2584–2587, Marrakech, Morocco, May, 2008.

[82] Binh N. Ngo. The Vietnamese language learning framework. Journal of Southeast Asian Language Teaching, 10:1–24, 2001.

[83] Q. H. Ngo, W. Winiwarter, and Bartholomaus Wloka. EVBCorpus-a multi-layer English-Vietnamese bilingual corpus for studying tasks in comparative linguistics. In Proceedings in the 6th International Joint Conference on Natural Language Processing (IJCNLP), pages 1–92, Nagoya, Japan, 2013.

[84] Sandro Nielsen. A functional approach to user guides. Dictionaries: Journal of the Dictionary Society of North America 27, 2006(1):1–20, 2006.

[85] Rolf Noyer. Vietnamese’morphology’and the definition of word. In University of Pennsylvania Working Papers in Linguistics, Ed. Alexis Dimitriadis, Hikyoung Lee, Christine Moisset, and Alexander Williams, pages 65–89, Philadelphia: U. Penn De- partment of Linguistics, 1998.

[86] Franz Josef Och and Hermann Ney. Giza++: Training of statistical translation mod- els. 2000.

[87] Kumiko Ohmori and Masanobu Higashida. Extracting bilingual collocations from non-aligned parallel corpora. In Proceedings of the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI99), pages 88–97, 1999.

[88] Antoni Oliver and Salvador Climent. Parallel corpora for wordnet construction: ma- chine translation vs. automatic sense tagging. Proceedings of the 13th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), pages 110–121, 2012.

[89] Noam Ordan and Shuly Wintner. Hebrew WordNet: a test case of aligning lexical databases across languages. International Journal of Translation, 19(1):39–58, 2007.

[90] Pablo G. Otero and Jose R.P. Campos. Automatic generation of bilingual dictionaries using intermediate languages and comparable corpora. In Proceedings of the 11th In- ternational Conference on Computational Linguistics and Intelligent Text Processing (CICLing), pages 473–483, Ia¸si,Romania, March, 2010.

[91] Kyonghee Paik, Francis Bond, and Shirai Satoshi. Using multiple pivots to align Korean and Japanese lexical resources. In Proceedings of the 6th Natural Language 131

Processing Pacific Rim Symposium (NLPRS), pages 63–70, Tokyo, Japan, November, 2001.

[92] Kyonghee Paik, Satoshi Shirai, and Hiromi Nakaiwa. Automatic construction of a transfer dictionary considering directionality. In Proceedings of the Workshop on Multilingual Linguistic Resources, pages 31–38, Geneva, Switzerland, August, 2004. Association for Computational Linguistics (ACL).

[93] Pavel Pecina. A machine learning approach to multiword expression extraction. In Proceedings of the Workshop Towards a Shared Task for Multiword Expressions, pages 54–61, Marrakech, Morocc, 2008. Conference on Language Resources and Evaluation (LREC).

[94] Martin F. Porter. An algorithm for suffix stripping. Program: Electronic library and information systems, 3(40):211–218, 2006.

[95] Kergrit Robkop, Sareewan Thoongsup, Thatsanee Charoenporn, Virach Sornlertlam- vanich, and Hitoshi Isahara. WNMS: connecting the distributed WordNet in the case of Asian Wordnet. In Proceedings of the 5th International Conference of the Global WordNet Association (GWC), Mumbai, India, 2010.

[96] Horacio Rodriguez, Salvador Climent, Piek Vossen, Laura Bloksma, Wim Peters, An- tonietta Alonge, Francesca Bertagna, and Adriana Roventini. The top-down strategy for building EuroWordNet: Vocabulary coverage, base concepts and top ontology. Computers and the Humanities, 32(2-3), 1998.

[97] Peter Mark Roget. Roget’s Thesaurus of English Words and Phrases... TY Crowell Company, USA, 1911.

[98] Peter Mark Roget. Roget’s International Thesaurus. 3/E**. Oxford and IBH Pub- lishing, 2008.

[99] Sheldon M. Ross. Introductory statistics. Academic Press, 2010.

[100] Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. Multiword expressions: A pain in the neck for NLPs. In the Computational Linguistics and Intelligent Text Processing, pages 1–15, 2002.

[101] Benoit Sagot and Darja Fiser. Building a free French WordNet from multilingual resources. In In Proceedings of Ontolex, Marrakech, Morocco, 2008.

[102] Antonio Sanfilippo and Ralf Steinberger. Automatic selection and ranking of trans- lation candidates. In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), volume 97, pages 200–207, Santa Fe, USA, 1997.

[103] Patanakul Sathapornrungkij and Charnyote Pluempitiwiriyawej. Construction of Thai Wordnet lexical database from machine readable dictionaries. In Proceedings of 10th Machine Translation Summit, pages 78–82, Phuket, Thailand, 2005.

[104] Martin Saveski and Igor Trajkovski. Automatic construction of wordnets by using machine translation and language modeling. In Proceedings of the 13th International MultiConference Information Society, volume C, Ljubljana, Slovenia, 2010. 132

[105] Thomas Schmidt and Kai W¨orner. Multilingual corpora and multilingual corpus anal- ysis. John Benjamins Publishing, Amsterdam, Netherlands, 2012.

[106] Li Shao and Hwee Tou Ng. Mining new word translations from comparable corpora. In Proceedings of the 20th International Conference on Computational Linguistics, pages 618–624, Geneva, Switzerland, August, 2004.

[107] Ryan Shaw, Anindya Datta, Debra VanderMeer, and Kaushik Dutta. Building a scalable database-driven reverse dictionary. IEEE Transactions on Knowledge and Data Engineering, 25(3):528–540, 2013.

[108] Satoshi Shirai and Kazuhide Yamamoto. Linking English words in two bilingual dictionaries to generate another language pair dictionary. In Proceedings of the 19th International Conference on Computer Processing of Oriental Languages (ICCPOL), pages 174–179, Seoul, Korea, May, 2001.

[109] Satoshi Shirai, Kazuhide Yamamoto, and Kyonghee Paik. Overlapping constraints of two step selection to generate a transfer dictionary. In Proceedings of the International Conference on Stochastic Programming(ICSP). Berlin, Germany, August, 2001.

[110] Jonas Sj¨obergh. Creating a free digital japanese-swedish lexicon. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING), pages 296–300, Tokyo, Japan, August, 2005.

[111] Dagobert Soergel. Indexing languages and thesauri: construction and maintenance. Melville Pub. Co., New York, USA, 1974.

[112] Sofia Stamou, Kemal Oflazer, Karel Pala, Dimitris Christoudoulakis, Dan Cristea, Dan Tufis, Svetla Koeva, George Totkov, Dominique Dutoit, and Maria Grigoriadou. Balkanet: A multilingual for the Balkan languages. In Proceedings of the International Wordnet Conference, pages 21–25, Mysore, India, 2002.

[113] Thomas Lathrop Stedman. Stedman’s Medical Dictionary, Volume 1, ed. 28. Lippin- cott Williams & Wilkins, Philadelphia, USA, 2006.

[114] Kumiko Tanaka and Hideya Iwasaki. Extraction of lexical translations from non- aligned corpora. In Proceedings of the 16th Conference on ComputationalLlinguistics, volume 2, pages 580–585, Netherlands, 1996.

[115] Kumiko Tanaka and Kyoji Umemura. Construction of a bilingual dictionary interme- diated by a third language. In Proceedings of the 15th Conference on Computational Linguistics (COLING), volume 1, pages 297–303, Kyoto, Japan, August, 1994.

[116] Takaaki Tanaka. Measuring the similarity between compound nouns in different lan- guages using non-parallel corpora. In Proceedings of the 19th international conference on Computational linguistics (COLING), volume 1, pages 1–7, Taipei, Taiwan, Au- gust, 2002.

[117] Takaaki Tanaka and Timothy Baldwin. Translation selection for japanese-english noun-noun compounds. In Proceedings of Machine Translation Summit IX, pages 378–385, Marrakech, Morocc, 2003. 133

[118] Laurence C. Thompson. The problem of the word in Vietnamese. Word journal of the International Linguistic Association, 19(1):39–52, 2003.

[119] Istvan Varga and Shoichi Yokoyama. Japanese-Hungarian dictionary generation using ontology resources. In Proceedings of the Machine Translation Summit XI, pages 483– 490, Copenhagen, Denmark, September, 2007.

[120] Istvan Varga and Shoichi Yokoyama. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 2, pages 862–870, Singapore, August, 2009.

[121] Sornlertlamvanich Virach, Thatsanee Charoenporn, Chumpol Mokarat, Hitoshi Isa- hara, Hammam Riza, and Purev Jaimai. Synset assignment for bi-lingual dictionary with limited resource. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP), 2008.

[122] Piek Vossen. A multilingual database with lexical semantic networks. Kluwer Academic Publishers, Dordrecht, Netherlands, 1998.

[123] Piek Vossen. Building Wordnets. http://www.globalwordnet.org/gwa/BuildingWordnets.ppt, 2005.

[124] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (ACL), pages 133–138, New Mexico, USA, June, 1994.

[125] Federico Zanettin. Bilingual comparable corpora and the training of translators. Meta 43, (4):616–630, 1998.

[126] Yujie Zhang, Qing Ma, and Hitoshi Isahara. Automatic acquisition of a Japanese- Chinese bilingual lexicon using english as an intermediary. In Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering (NLPKE), pages 471–476, 2003.

[127] MˇadˇalinaZurini. Word sense disambiguation using aggregated similarity based on WordNet graph representation. Informatica Economica. 134

Appendix A: Reverse dictionaries generated

Table 1: Sample entries in the English-Assamese reverse dictionary

Table 2: Sample entries in the English-Vietnamese reverse dictionary 135

Table 3: Sample entries in the English-Dimasa reverse dictionary

Table 4: Sample entries in the English-Karbi reverse dictionary 136

Appendix B: New bilingual dictionaries created

Table 5: Sample entries in the Assamese-Vietnamese and Assamese-Arabic dictionaries

Table 6: Sample entries in the Assamese-German and Assamese-Spanish dictionaries 137

Table 7: Sample entries in the Arabic-German and Arabic-Spanish dictionaries

Table 8: Sample entries in the Vietnamse-German and Vietnamse-Spanish dictionaries 138

Appendix C: New WordNets constructed

Table 9: Sample entries in the Assamese WordNet synsets 139

Table 10: Sample entries in the Arabic WordNet synsets 140

Table 11: Sample entries in theVietnamese WordNet synsets