<<

Creating Reverse Bilingual

Khang Lam Jugal Kalita Department of Computer Science Department of Computer Science University of Colorado at Colorado SpringsUniversity of Colorado at Colorado Springs Colorado Springs, USA Colorado Springs, USA [email protected] [email protected]

Abstract fuse the results. They query several existing dictio- naries and then merge results to maximize accuracy. Bilingual dictionaries are expensive resources They use four pivot , German, Spanish, and not many are available when one of the Dutch and Italian, as intermediate languages. An- languages is resource-poor. In this paper, we other existing approach for creating bilingual dictio- propose algorithms for creation of new reverse bilingual dictionaries from existing bilingual naries is using probabilistic inference (Mausam et dictionaries in which English is one of the two al., 2010). They organize dictionaries in a graph languages. Our algorithms exploit the simi- topology and use random walks and probabilistic larity between -concept pairs using the graph sampling. (Shaw et al., 2011) propose a set English Wordnet to produce reverse of algorithms to create a in the entries. Since our algorithms rely on available context of single by using converse map- bilingual dictionaries, they are applicable to ping. In particular, given an English-English dictio- any bilingual dictionary as long as one of the two languages has Wordnet type lexical ontol- nary, they attempt to find the original or terms ogy. given a synonymous word or phrase describing the meaning of a word. The goal of this research is to study the feasibility 1 Introduction of creating a reverse dictionary by using only one ex- The Ethnologue organization1 lists 6,809 distinct isting dictionary and Wordnet lexical ontology. For 3 languages in the world, most of which are resource- example, given a Karbi -English dictionary, we will poor. Most existing online bilingual dictionaries are construct an ENG-AJZ dictionary. The remainder of between two resource-rich languages (e.g., English, this paper is organized as follows. In Section 2, we Spanish, French or German) or between a resource- discuss the nature of bilingual dictionaries. Section rich language and a resource-poor language. There 3 describes the algorithms we propose to create new are languages for which we are lucky to find a single bilingual dictionaries from existing dictionaries. Re- bilingual dictionary online. For example, the Uni- sults of our experiments are presented in Section 4. versity of Chicago hosts bilingual dictionaries from Section 5 concludes the paper. 29 Southeast Asian languages2, but many of these languages have only one bilingual dictionary online. 2 Existing Online Bilingual Dictionaries Existing algorithms for creating new bilingual Powerful online translators developed by Google dictionaries use intermediate languages or interme- and Bing provide pairwise (including diate dictionaries to find chains of words with the for individual words) for 65 and 40 languages, re- same meaning. For example, (Gollins and Sander- spectively. Wiktionary, a dictionary created by vol- son, 2001) use lexical triangulation to translate in unteers, supports over 170 languages. We find a parallel across multiple intermediate languages and 3Karbi is an endangered language spoken by 492,000 peo- 1http://www.ethnologue.com/ ple (2007 Ethnologue data) in Northeast India, ISO 639-3 code 2http://dsal.uchicago.edu/dictionaries/list.html AJZ. ISO 693-3 code for English is ENG. large number of bilingual dictionaries at PanLex4 meaning, and possibly additional information. The including an ENG-Hindi5 and a Vietnamese6-ENG meaning associated with it can have several Senses. dictionary. The University of Chicago has a number A Sense is a discrete representation of a single aspect of bilinugal dictionaries for South Asian languages. of the meaning of a word. Thus, a dictionary entry Xobdo7 has a number of dictionaries, focused on is of the form . 1 2 ··· Northeast India. In this section, we propose a series of algorithms, We classify the many freely available dictionaries each one of which automatically creates a reverse into three main kinds. dictionary, or ReverseDictionary, from a dictio- Word to word dictionaries: These are dictionar- nary that translates a word in language L to a word • 1 ies that translate one word in one language to or phrase in language L2. We require that at least one word or a phrase in another language. An one of two these languages has a Wordnet type lexi- example is an ENG-HIN dictionary at Panlex. cal ontology (Miller, 1995). Our algorithms are used Definition dictionaries: One word in one lan- to create reverse dictionaries from them at various • guage has one or more meanings in the second levels of accuracy and sophistication. language. It also may have pronunciation, parts 3.1 Direct Reversal (DR) of speech, synonyms and examples. An exam- ple is the VIE-ENG dictionary, also at Panlex. The existing dictionary has alphabetically sorted One language dictionaries: A dictionary of this LexicalUnits in L1 and each of them has one or • kind is found at dictionary.com. more Senses in L2. To create ReverseDictionary, we simply take every pair We have examined several hundred online dictionar- in SourceDictionary and swap the positions of the ies and found that they occur in many different for- two. mats. Extracting information from these dictionaries is arduous. We have experimented with five existing Algorithm 1 DR Algorithm bilingual dictionaries: VIE-ENG, ENG-HIN, and a ReverseDictionary := dictionary supported by Xobdo with 4 languages: for all LexicalEntryi SourceDictionary do 8 9 2 Assamese , ENG, AJZ, and Dimasa . We consider for all Sensej LexicalEntry do 2 i the last one to be a collection of 3 bilingual dictio- Add tuple to choose these languages since one of our goals is to ReverseDictionary work with resource-poor languages to enhance the end for quantity and quality of resources available. end for 3 Proposed Solution Approach This is a baseline algorithm so that we can com- A dictionary entry, called LexicalEntry, is 2-tuple pare improvements as we create new algorithms. .ALexicalUnit is a If in our input dictionary, the sense definitions word or a phrase being defined, also called definien- are mostly single words, and occasionally a sim- dum (Landau, 1984). A list of entries sorted by ple phrase, even such a simple algorithm gives the LexicalUnit is called a or a dictionary. fairly good results. In case there are long or com- Given a LexicalUnit, the Definition associated with plex phrases in senses, we skip them. The ap- it usually contains its class and pronunciation, its proach is easy to implement, and produces a high- 4http://panlex.org/ accuracy ReverseDictionary. However, the num- 5ISO 693-3 code HIN ber of entries in the created dictionaries are lim- 6ISO 693-3 code VIE ited because this algorithm just swaps the posi- 7http://www.xobdo.org/ 8 tions of LexicalUnit and Sense of each entry in the Assamese is an Indo-European language spoken by about SourceDictionary 30 million people, but it is resource-poor, ISO 693-3 code ASM. and does not have any method 9Dimasa is another endangered language from Northeast In- to find the additional words having the same mean- dia, spoken by about 115,000 people, ISO 693-3 code DIS. ings. 3.2 Direct Reversal with Distance (DRwD) Algorithm 2 DRwD Algorithm ReverseDictionary := To increase the number of entries in the output dic- for all LexicalEntry SourceDictionary do tionary, we compute the distance between words i 2 for all Sensej LexicalEntry do in the Wordnet hierarchy. For example, the words 2 i for all LexicalEntry "hasta-lipi" and "likhavat" in HIN have the meanings u 2 SourceDictionary do "handwriting" and "script", respectively. The dis- for all Sensev LexicalEntry do tance between "handwriting" and "script" in Word- 2 u if distance( ,)6 ↵ then "hasta-lipi" and "likhavat" should have both mean- Add tuple helps us find additional words having the same u to ReverseDictionary meanings and possibly increase the number of lexi- end if cal entries in the reverse dictionaries. end for To create a ReverseDictionary, for every end for LexicalEntryi in the existing dictionary, end for we find all LexicalEntry ,i = j with dis- j 6 end for tance to LexicalEntryi equal to or smaller than a threshold ↵. As results, we have new 3.3 Direct Reversal with Similarly (DRwS) pairs of entries ; then we swap positions j tance between two senses, but does not look at in the two-tuples, and add them into the Reverse- the meanings of the senses in any depth. The Dictionary. The value of ↵ affects the number of DRwS approach represents a concept in terms of entries and the quality of created dictionaries. The its Wordnet synset10, synonyms, hyponyms and greater the value of ↵, the larger the number of hypernyms. This approach is like the DRwD lexical entries, but the smaller the accuracy of the approach, but instead of computing the distance ReverseDictionary. between lexical entries in each pair, we calcu- The distance between the two LexicalEntrys is the late the similarity, called simValue. If the sim- distance between the two LexicalUnits if the Lexi- Value of a , i = i j 6 calUnits occur in Wordnet ontology; otherwise, it is j pair is equal or larger than ↵, we conclude the distance between the two Senses. The distance that the LexicalEntryi has the same meaning as between each phrase pair is the average of the to- LexicalEntryj. tal distances between every word pair in the phrases To calculate simValue between two phrases, we (Wu and Palmer, 1994). If the distance between two obtain the ExpansionSet for every word in each words or phrases is 1.00, there is no similarity be- phrase from the WordNet database. An Expansion- tween these words or phrases, but if they have the Set of a phrase is a union of synset, and/or synonym, same meaning, the distance is 0.00. and/or hyponym, and/or hypernym of every word in We find that a ReverseDictionary created using it. We compare the similarity between the Expan- the value 0.0 for ↵ has the highest accuracy. This ap- sionSets. The value of ↵ and the kinds of Expan- proach significantly increases the number of entries sionSets are changed to create different ReverseDic- in the ReverseDictionary. However, there is an is- tionarys. Based on experiments, we find that the best sue in this approach. For instance, the word "tuhbi" value of ↵ is 0.9, and the best ExpansionSet is the in DIS means "crowded", "compact", "dense", or union of synset, synonyms, hyponyms, and hyper- "packed". Because the distance between the En- nyms. The algorithm for computing the simValue of glish words "slow" and "dense" in Wordnet is 0.0, entries is shown in Algorithm 3. this algorithm concludes that "slow" has the mean- ing "tuhbi" also, which is wrong. 10Synset is a set of cognitive synonyms. Algorithm 3 simValue(LexicalEntryi, ing a union of synset, synonyms, hyponyms, and LexicalEntryj) hypernyms of words, and ↵ 0.9 produces the simW ords := best revese dictionaries for each language pair. The if LexicalEntryi.LexicalUnit & DRwS approach increases the number of entries in LexicalEntryj.LexicalUnit have a Word- the created dictionaries compared to the DR algo- net lexical ontology then rithm as shown in Figure 2. ReverseDictionary for all (LexicalUnitu LexicalEntryi)& Figure 1: Average entry score in 2 (LexicalUnitv LexicalEntryj) do 2 Find ExpansionSet of every LexicalEntry based on LexicalUnit end for else for all (Senseu LexicalEntryi)& 2 (Sensev LexicalEntryj) do 2 Find ExpansionSet of every LexicalEntry based on Sense end for Figure 2: Number of lexical entries in end if ReverseDictionarys generated from 100 common simWords ExpansionSet (LexicalEntry ) words i \ ExpansionSet(LexicalEntryj) n ExpansionSet(LexicalEntry ).length i m ExpansionSet(LexicalEntry ).length j simValue min{ simW ords.length , simW ords.length } n m

4 Experimental results The goals of our study are to create the high- precision reverse dictionaries, and to increase the numbers of lexical entries in the created dictio- We also create the entire reverse dictionary for the naries. Evaluations were performed by volunteers AJZ-ENG dictionary. The total number of entries in who are fluent in both source and destination lan- the ENG-AJZ dictionaries created by using the DR guages. To achieve reliable judgment, we use the algorithm and DRwS algorithm are 4677 and 5941, same set of 100 non-stop word ENG words, ran- respectively. Then, we pick 100 random words from domly chosen from a list of the most common the ENG-AJZ created by using the DRwS algorithm 11 words . We pick randomly 50 words from each for evaluation. The average score of every entry in created ReverseDictionary for evaluation. Each this created dictionary is 4.07. volunteer was requested to evaluate using a 5-point scale, 5: excellent, 4: good, 3: average, 2: fair, and 5 Conclusion 1: bad. The average scores of entries in the Reverse- We proposed approaches to createa reverse dictio- Dictionarys is presented in Figure 1. The DRwS dic- nary from an existing bilingual dictionary using tionaries are the best in each case. The percentage of Wordnet. We show that a high precision reverse dic- agreements between raters is in all cases is around tionary can be created without using any other inter- 70%. mediate dictionaries or languages. Using the Word- The DRwD approach significantly increases the net hierarchy increases the number of entries in the number of entries, but the accuracy of the created created dictionaries. We performe experiments with dictionaries is much lower. The DRwS approach us- serveral resource-poor languages including two that 11http://www.world-english.org/english500.htm are in the UNESCO’s list of endangered languages. References Mausam, S. Soderlan, O. Etzioni, D.S. Weld, K. Reiter, M. Skinner, M. Sammer, and J. Bilmers 2010. Pan- lingual lexical via probabilistic inference, Artificial Intelligence, 174:619–637. R. Shaw, A. Datta, D. VanderMeer, and K. Datta 2011. Building a scalable database - Driven Reverse Dic- tionary, IEEE Transactions on Knowledge and Data Engineering, volume 99. T. Gollins and M. Sanderson. 2001. Improving cross lan- guage information retrieval with triangulated transla- tion, SIGIR ’01 Proceedings of the 24th annual in- ternational ACM SIGIR conference on Research and development in information retrieval, New York, 90– 95. S.I. Landau. 1984. Dictionaries, Cambridge Univ Press. G.A. Miller. 1995. Wordnet: a lexical database for English, Communications of the ACM, vol- ume 38(11):39–41. Z. Wu and P. Palmer. 1994. Verbs semantics and lexical selection, In proceeding of the 32nd annual meeting on Association for computaional linguistics, Strouds- burg, 133–138.