Creating Reverse Bilingual

Khang Nhut Lam Jugal Kalita Department of Computer Science Department of Computer Science University of Colorado University of Colorado Colorado Springs, USA Colorado Springs, USA [email protected] [email protected]

Abstract fuse the results. They query several existing dictio- naries and then merge results to maximize accuracy. Bilingual dictionaries are expensive resources They use four pivot languages, German, Spanish, and not many are available when one of the Dutch and Italian, as intermediate languages. An- languages is resource-poor. In this paper, we other existing approach for creating bilingual dictio- propose algorithms for creation of new reverse bilingual dictionaries from existing bilingual naries is using probabilistic inference (Mausam et dictionaries in which English is one of the two al., 2010). They organize dictionaries in a graph languages. Our algorithms exploit the simi- topology and use random walks and probabilistic larity between -concept pairs using the graph sampling. (Shaw et al., 2011) propose a set English Wordnet to produce reverse of algorithms to create a reverse dictionary in the entries. Since our algorithms rely on available context of single language by using converse map- bilingual dictionaries, they are applicable to ping. In particular, given an English-English dictio- any as long as one of the two languages has Wordnet type lexical ontol- nary, they attempt to find the original or terms ogy. given a synonymous word or phrase describing the meaning of a word. The goal of this research is to study the feasibility 1 Introduction of creating a reverse dictionary by using only one ex- The Ethnologue organization1 lists 6,809 distinct isting dictionary and Wordnet lexical ontology. For 3 languages in the world, most of which are resource- example, given a Karbi -English dictionary, we will poor. Most existing online bilingual dictionaries are construct an ENG-AJZ dictionary. The remainder of between two resource-rich languages (e.g., English, this paper is organized as follows. In Section 2, we Spanish, French or German) or between a resource- discuss the nature of bilingual dictionaries. Section rich language and a resource-poor language. There 3 describes the algorithms we propose to create new are languages for which we are lucky to find a single bilingual dictionaries from existing dictionaries. Re- bilingual dictionary online. For example, the Uni- sults of our experiments are presented in Section 4. versity of Chicago hosts bilingual dictionaries from Section 5 concludes the paper. 29 Southeast Asian languages2, but many of these 2 Existing Online Bilingual Dictionaries languages have only one bilingual dictionary online. Existing algorithms for creating new bilingual Powerful online translators developed by Google dictionaries use intermediate languages or interme- and Bing provide pairwise translations (including diate dictionaries to find chains of words with the for individual words) for 65 and 40 languages, re- same meaning. For example, (Gollins and Sander- spectively. Wiktionary, a dictionary created by vol- son, 2001) use lexical triangulation to translate in unteers, supports over 170 languages. We find a parallel across multiple intermediate languages and 3Karbi is an endangered language spoken by 492,000 peo- 1http://www.ethnologue.com/ ple (2007 Ethnologue data) in Northeast India, ISO 639-3 code 2http://dsal.uchicago.edu/dictionaries/list.html AJZ. ISO 693-3 code for English is ENG.

524

Proceedings of NAACL-HLT 2013, pages 524–528, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics large number of bilingual dictionaries at PanLex4 meaning, and possibly additional information. The including an ENG-Hindi5 and a Vietnamese6-ENG meaning associated with it can have several Senses. dictionary. The University of Chicago has a number A Sense is a discrete representation of a single aspect of bilingual dictionaries for South Asian languages. of the meaning of a word. Thus, a dictionary entry 7 Xobdo has a number of dictionaries, focused on is of the form . Northeast India. In this section, we propose a series of algorithms, We classify the many freely available dictionaries each one of which automatically creates a reverse into three main kinds. dictionary, or ReverseDictionary, from a dictio- • Word to word dictionaries: These are dictionar- nary that translates a word in language L1 to a word ies that translate one word in one language to or phrase in language L2. We require that at least one word or a phrase in another language. An one of two these languages has a Wordnet type lexi- example is an ENG-HIN dictionary at Panlex. cal ontology (Miller, 1995). Our algorithms are used • Definition dictionaries: One word in one lan- to create reverse dictionaries from them at various guage has one or more meanings in the second levels of accuracy and sophistication. language. It also may have pronunciation, parts 3.1 Direct Reversal (DR) of speech, synonyms and examples. An exam- ple is the VIE-ENG dictionary, also at Panlex. The existing dictionary has alphabetically sorted • One language dictionaries: A dictionary of this LexicalUnits in L1 and each of them has one or kind is found at dictionary.com. more Senses in L2. To create ReverseDictionary, we simply take every pair We have examined several hundred online dictionar- in SourceDictionary and swap the positions of the ies and found that they occur in many different for- two. mats. Extracting information from these dictionaries is arduous. We have experimented with five existing Algorithm 1 DR Algorithm bilingual dictionaries: VIE-ENG, ENG-HIN, and a ReverseDictionary := φ dictionary supported by Xobdo with 4 languages: for all LexicalEntryi ∈ SourceDictionary do 8 9 Assamese , ENG, AJZ, and Dimasa . We consider for all Sensej ∈ LexicalEntryi do the last one to be a collection of 3 bilingual dictio- Add tuple to choose these languages since one of our goals is to ReverseDictionary work with resource-poor languages to enhance the end for quantity and quality of resources available. end for 3 Proposed Solution Approach This is a baseline algorithm so that we can com- A dictionary entry, called LexicalEntry, is a 2-tuple pare improvements as we create new algorithms. .A LexicalUnit is a If in our input dictionary, the sense definitions word or a phrase being defined, also called definien- are mostly single words, and occasionally a sim- dum (Landau, 1984). A list of entries sorted by ple phrase, even such a simple algorithm gives the LexicalUnit is called a or a dictionary. fairly good results. In case there are long or com- Given a LexicalUnit, the Definition associated with plex phrases in senses, we skip them. The ap- it usually contains its class and pronunciation, its proach is easy to implement, and produces a high- 4http://panlex.org/ accuracy ReverseDictionary. However, the num- 5ISO 693-3 code HIN ber of entries in the created dictionaries are lim- 6ISO 693-3 code VIE ited because this algorithm just swaps the posi- 7http://www.xobdo.org/ 8 tions of LexicalUnit and Sense of each entry in the Assamese is an Indo-European language spoken by about SourceDictionary and does not have any method 30 million people, but it is resource-poor, ISO 693-3 code ASM. 9Dimasa is another endangered language from Northeast In- to find the additional words having the same mean- dia, spoken by about 115,000 people, ISO 693-3 code DIS. ings.

525 3.2 Direct Reversal with Distance (DRwD) Algorithm 2 DRwD Algorithm ReverseDictionary := φ To increase the number of entries in the output dic- for all LexicalEntry ∈ SourceDictionary do tionary, we compute the distance between words i for all Sense ∈ LexicalEntry do in the Wordnet hierarchy. For example, the words j i for all LexicalEntry ∈ "hasta-lipi" and "likhavat" in HIN have the meanings u SourceDictionary do "handwriting" and "script", respectively. The dis- for all Sense ∈ LexicalEntry do tance between "handwriting" and "script" in Word- v u if distance( , ) α then "hasta-lipi" and "likhavat" should have both mean- v 6 Add tuple helps us find additional words having the same u to ReverseDictionary meanings and possibly increase the number of lexi- end if cal entries in the reverse dictionaries. end for To create a ReverseDictionary, for every end for LexicalEntryi in the existing dictionary, end for we find all LexicalEntryj, i 6= j with dis- end for tance to LexicalEntryi equal to or smaller than a threshold α. As results, we have new 3.3 Direct Reversal with Similarly (DRwS) pairs of entries ; then we swap positions j tance between two senses, but does not look at in the two-tuples, and add them into the Reverse- the meanings of the senses in any depth. The Dictionary. The value of α affects the number of DRwS approach represents a concept in terms of entries and the quality of created dictionaries. The its Wordnet synset10, synonyms, hyponyms and greater the value of α, the larger the number of hypernyms. This approach is like the DRwD lexical entries, but the smaller the accuracy of the approach, but instead of computing the distance ReverseDictionary. between lexical entries in each pair, we calcu- The distance between the two LexicalEntrys is the late the similarity, called simValue. If the sim- distance between the two LexicalUnits if the Lexi- Value of a , i 6= calUnits occur in Wordnet ontology; otherwise, it is j pair is equal or larger than β, we conclude the distance between the two Senses. The distance that the LexicalEntryi has the same meaning as between each phrase pair is the average of the to- LexicalEntryj. tal distances between every word pair in the phrases To calculate simValue between two phrases, we (Wu and Palmer, 1994). If the distance between two obtain the ExpansionSet for every word in each words or phrases is 1.00, there is no similarity be- phrase from the WordNet database. An Expansion- tween these words or phrases, but if they have the Set of a phrase is a union of synset, and/or synonym, same meaning, the distance is 0.00. and/or hyponym, and/or hypernym of every word in We find that a ReverseDictionary created using it. We compare the similarity between the Expan- the value 0.0 for α has the highest accuracy. This ap- sionSets. The value of β and the kinds of Expan- proach significantly increases the number of entries sionSets are changed to create different ReverseDic- in the ReverseDictionary. However, there is an is- tionarys. Based on experiments, we find that the best sue in this approach. For instance, the word "tuhbi" value of β is 0.9, and the best ExpansionSet is the in DIS means "crowded", "compact", "dense", or union of synset, synonyms, hyponyms, and hyper- "packed". Because the distance between the En- nyms. The algorithm for computing the simValue of glish words "slow" and "dense" in Wordnet is 0.0, entries is shown in Algorithm 3. this algorithm concludes that "slow" has the mean- ing "tuhbi" also, which is wrong. 10Synset is a set of cognitive synonyms.

526 Algorithm 3 simValue(LexicalEntryi, DR algorithm creates entries for the rare or unusual LexicalEntryj) meanings by direct reversal. We noticed that our simW ords := φ evaluators do not like such entries in the reversed if LexicalEntryi.LexicalUnit & dictionaries and mark them low. This results in LexicalEntryj.LexicalUnit have a Word- lower average scores in the DR algorithm compar- net lexical ontology then ing to averages cores in the DRwS algorithm. The for all (LexicalUnitu ∈ LexicalEntryi)& DRwS algorithm seems to have removed a number (LexicalUnitv ∈ LexicalEntryj) do of such unusual or rare meanings (and entries simi- Find ExpansionSet of every lar to the rare meanings, recursively) improving the LexicalEntry based on LexicalUnit average score end for Our proposed approaches do not work well for else dictionaries containing an abundance of complex for all (Senseu ∈ LexicalEntryi)& phrases. The original dictionaries, except the VIE- (Sensev ∈ LexicalEntryj) do ENG dictionary, do not contain many long phrases Find ExpansionSet of every or complex words. In Vietnamese, most words LexicalEntry based on Sense we find in the dictionary can be considered com- end for pound words composed of simpler words put to- end if gether. However, the component words are sepa- simWords ← ExpansionSet (LexicalEntryi) ∩ rated by space. For example, "bái thần giáo" means ExpansionSet(LexicalEntryj) "idolatry". The component words are "bái" mean- n ←ExpansionSet(LexicalEntryi).length ing "bow low"; "thần" meaning "deity"; and "giáo" m ←ExpansionSet(LexicalEntryj).length meaning "lance", "spear", "to teach", or "to edu- simW ords.length simW ords.length cate". The presence of a large number of compound simValue←min{ n , m } words written in this manner causes problems with the ENG-VIE dictionary. If we look closely at Fig- 4 Experimental results ure 1, all language pairs, except ENG-VIE show substantial improvement in score when we compare The goals of our study are to create the high- the DR algorithm with DRwS algorithm. precision reverse dictionaries, and to increase the numbers of lexical entries in the created dictio- Figure 1: Average entry score in ReverseDictionary naries. Evaluations were performed by volunteers who are fluent in both source and destination lan- guages. To achieve reliable judgment, we use the same set of 100 non-stop word ENG words, ran- domly chosen from a list of the most common words11. We pick randomly 50 words from each created ReverseDictionary for evaluation. Each volunteer was requested to evaluate using a 5-point scale, 5: excellent, 4: good, 3: average, 2: fair, and 1: bad. The average scores of entries in the Reverse- Dictionarys is presented in Figure 1. The DRwS dic- The DRwD approach significantly increases the tionaries are the best in each case. The percentage of number of entries, but the accuracy of the created agreements between raters is in all cases is around dictionaries is much lower. The DRwS approach us- 70%. ing a union of synset, synonyms, hyponyms, and hy- The dictionaries we work with frequently have pernyms of words, and β ≥ 0.9 produces the best re- several meanings for a word. Some of these mean- verse dictionaries for each language pair. The DRwS ings are unusual, rare or very infrequently used. The approach increases the number of entries in the cre- 11http://www.world-english.org/english500.htm ated dictionaries compared to the DR algorithm as

527 shown in Figure 2. References Mausam, S. Soderlan, O. Etzioni, D.S. Weld, K. Reiter, Figure 2: Number of lexical entries in M. Skinner, M. Sammer, and J. Bilmers 2010. Pan- ReverseDictionarys generated from 100 common lingual lexical translation via probabilistic inference, words Artificial Intelligence, 174:619–637. R. Shaw, A. Datta, D. VanderMeer, and K. Datta 2011. Building a scalable database - Driven Reverse Dic- tionary, IEEE Transactions on Knowledge and Data Engineering, volume 99. T. Gollins and M. Sanderson. 2001. Improving cross lan- guage information retrieval with triangulated transla- tion, SIGIR ’01 Proceedings of the 24th annual in- ternational ACM SIGIR conference on Research and development in information retrieval, New York, 90– 95. S.I. Landau. 1984. Dictionaries, Cambridge Univ Press. G.A. Miller. 1995. Wordnet: a lexical database We also create the entire reverse dictionary for for English, Communications of the ACM, vol- the AJZ-ENG dictionary. The total number of en- ume 38(11):39–41. tries in the ENG-AJZ dictionaries created by us- Z. Wu and P. Palmer. 1994. Verbs semantics and lexical ing the DR algorithm and DRwS algorithm are selection, In proceeding of the 32nd annual meeting 4677 and 5941, respectively. Then, we pick 100 on Association for computaional linguistics, Strouds- burg, 133–138. random words from the ENG-AJZ created by us- ing the DRwS algorithm for evaluation. The av- erage score of every entry in this created dictio- nary is 4.07. Some of the reversal bilingual dictio- naries can be downloaded at http://cs.uccs.edu/ lin- clab/creatingBilingualLexicalResource.html.

5 Conclusion

We proposed approaches to create a reverse dic- tionary from an existing bilingual dictionary using Wordnet. We show that a high precision reverse dic- tionary can be created without using any other inter- mediate dictionaries or languages. Using the Word- net hierarchy increases the number of entries in the created dictionaries. We perform experiments with several resource-poor languages including two that are in the UNESCO’s list of endangered languages.

Acknowledgements

We would like to thank the volunteers evaluating the dictionaries we create: Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phon- glosa, Abhijit Bendale, and Lalit Prithviraj Jain. We also thank all friends in the Xobdo project who pro- vided us with the ASM-ENG-DIS-AJZ dictionaries.

528