Creating Reverse Bilingual Dictionaries
Khang Nhut Lam Jugal Kalita Department of Computer Science Department of Computer Science University of Colorado University of Colorado Colorado Springs, USA Colorado Springs, USA [email protected] [email protected]
Abstract fuse the results. They query several existing dictio- naries and then merge results to maximize accuracy. Bilingual dictionaries are expensive resources They use four pivot languages, German, Spanish, and not many are available when one of the Dutch and Italian, as intermediate languages. An- languages is resource-poor. In this paper, we other existing approach for creating bilingual dictio- propose algorithms for creation of new reverse bilingual dictionaries from existing bilingual naries is using probabilistic inference (Mausam et dictionaries in which English is one of the two al., 2010). They organize dictionaries in a graph languages. Our algorithms exploit the simi- topology and use random walks and probabilistic larity between word-concept pairs using the graph sampling. (Shaw et al., 2011) propose a set English Wordnet to produce reverse dictionary of algorithms to create a reverse dictionary in the entries. Since our algorithms rely on available context of single language by using converse map- bilingual dictionaries, they are applicable to ping. In particular, given an English-English dictio- any bilingual dictionary as long as one of the two languages has Wordnet type lexical ontol- nary, they attempt to find the original words or terms ogy. given a synonymous word or phrase describing the meaning of a word. The goal of this research is to study the feasibility 1 Introduction of creating a reverse dictionary by using only one ex- The Ethnologue organization1 lists 6,809 distinct isting dictionary and Wordnet lexical ontology. For 3 languages in the world, most of which are resource- example, given a Karbi -English dictionary, we will poor. Most existing online bilingual dictionaries are construct an ENG-AJZ dictionary. The remainder of between two resource-rich languages (e.g., English, this paper is organized as follows. In Section 2, we Spanish, French or German) or between a resource- discuss the nature of bilingual dictionaries. Section rich language and a resource-poor language. There 3 describes the algorithms we propose to create new are languages for which we are lucky to find a single bilingual dictionaries from existing dictionaries. Re- bilingual dictionary online. For example, the Uni- sults of our experiments are presented in Section 4. versity of Chicago hosts bilingual dictionaries from Section 5 concludes the paper. 29 Southeast Asian languages2, but many of these 2 Existing Online Bilingual Dictionaries languages have only one bilingual dictionary online. Existing algorithms for creating new bilingual Powerful online translators developed by Google dictionaries use intermediate languages or interme- and Bing provide pairwise translations (including diate dictionaries to find chains of words with the for individual words) for 65 and 40 languages, re- same meaning. For example, (Gollins and Sander- spectively. Wiktionary, a dictionary created by vol- son, 2001) use lexical triangulation to translate in unteers, supports over 170 languages. We find a parallel across multiple intermediate languages and 3Karbi is an endangered language spoken by 492,000 peo- 1http://www.ethnologue.com/ ple (2007 Ethnologue data) in Northeast India, ISO 639-3 code 2http://dsal.uchicago.edu/dictionaries/list.html AJZ. ISO 693-3 code for English is ENG.
524
Proceedings of NAACL-HLT 2013, pages 524–528, Atlanta, Georgia, 9–14 June 2013. c 2013 Association for Computational Linguistics large number of bilingual dictionaries at PanLex4 meaning, and possibly additional information. The including an ENG-Hindi5 and a Vietnamese6-ENG meaning associated with it can have several Senses. dictionary. The University of Chicago has a number A Sense is a discrete representation of a single aspect of bilingual dictionaries for South Asian languages. of the meaning of a word. Thus, a dictionary entry 7 Xobdo has a number of dictionaries, focused on is of the form
525 3.2 Direct Reversal with Distance (DRwD) Algorithm 2 DRwD Algorithm ReverseDictionary := φ To increase the number of entries in the output dic- for all LexicalEntry ∈ SourceDictionary do tionary, we compute the distance between words i for all Sense ∈ LexicalEntry do in the Wordnet hierarchy. For example, the words j i for all LexicalEntry ∈ "hasta-lipi" and "likhavat" in HIN have the meanings u SourceDictionary do "handwriting" and "script", respectively. The dis- for all Sense ∈ LexicalEntry do tance between "handwriting" and "script" in Word- v u if distance(
526 Algorithm 3 simValue(LexicalEntryi, DR algorithm creates entries for the rare or unusual LexicalEntryj) meanings by direct reversal. We noticed that our simW ords := φ evaluators do not like such entries in the reversed if LexicalEntryi.LexicalUnit & dictionaries and mark them low. This results in LexicalEntryj.LexicalUnit have a Word- lower average scores in the DR algorithm compar- net lexical ontology then ing to averages cores in the DRwS algorithm. The for all (LexicalUnitu ∈ LexicalEntryi)& DRwS algorithm seems to have removed a number (LexicalUnitv ∈ LexicalEntryj) do of such unusual or rare meanings (and entries simi- Find ExpansionSet of every lar to the rare meanings, recursively) improving the LexicalEntry based on LexicalUnit average score end for Our proposed approaches do not work well for else dictionaries containing an abundance of complex for all (Senseu ∈ LexicalEntryi)& phrases. The original dictionaries, except the VIE- (Sensev ∈ LexicalEntryj) do ENG dictionary, do not contain many long phrases Find ExpansionSet of every or complex words. In Vietnamese, most words LexicalEntry based on Sense we find in the dictionary can be considered com- end for pound words composed of simpler words put to- end if gether. However, the component words are sepa- simWords ← ExpansionSet (LexicalEntryi) ∩ rated by space. For example, "bái thần giáo" means ExpansionSet(LexicalEntryj) "idolatry". The component words are "bái" mean- n ←ExpansionSet(LexicalEntryi).length ing "bow low"; "thần" meaning "deity"; and "giáo" m ←ExpansionSet(LexicalEntryj).length meaning "lance", "spear", "to teach", or "to edu- simW ords.length simW ords.length cate". The presence of a large number of compound simValue←min{ n , m } words written in this manner causes problems with the ENG-VIE dictionary. If we look closely at Fig- 4 Experimental results ure 1, all language pairs, except ENG-VIE show substantial improvement in score when we compare The goals of our study are to create the high- the DR algorithm with DRwS algorithm. precision reverse dictionaries, and to increase the numbers of lexical entries in the created dictio- Figure 1: Average entry score in ReverseDictionary naries. Evaluations were performed by volunteers who are fluent in both source and destination lan- guages. To achieve reliable judgment, we use the same set of 100 non-stop word ENG words, ran- domly chosen from a list of the most common words11. We pick randomly 50 words from each created ReverseDictionary for evaluation. Each volunteer was requested to evaluate using a 5-point scale, 5: excellent, 4: good, 3: average, 2: fair, and 1: bad. The average scores of entries in the Reverse- Dictionarys is presented in Figure 1. The DRwS dic- The DRwD approach significantly increases the tionaries are the best in each case. The percentage of number of entries, but the accuracy of the created agreements between raters is in all cases is around dictionaries is much lower. The DRwS approach us- 70%. ing a union of synset, synonyms, hyponyms, and hy- The dictionaries we work with frequently have pernyms of words, and β ≥ 0.9 produces the best re- several meanings for a word. Some of these mean- verse dictionaries for each language pair. The DRwS ings are unusual, rare or very infrequently used. The approach increases the number of entries in the cre- 11http://www.world-english.org/english500.htm ated dictionaries compared to the DR algorithm as
527 shown in Figure 2. References Mausam, S. Soderlan, O. Etzioni, D.S. Weld, K. Reiter, Figure 2: Number of lexical entries in M. Skinner, M. Sammer, and J. Bilmers 2010. Pan- ReverseDictionarys generated from 100 common lingual lexical translation via probabilistic inference, words Artificial Intelligence, 174:619–637. R. Shaw, A. Datta, D. VanderMeer, and K. Datta 2011. Building a scalable database - Driven Reverse Dic- tionary, IEEE Transactions on Knowledge and Data Engineering, volume 99. T. Gollins and M. Sanderson. 2001. Improving cross lan- guage information retrieval with triangulated transla- tion, SIGIR ’01 Proceedings of the 24th annual in- ternational ACM SIGIR conference on Research and development in information retrieval, New York, 90– 95. S.I. Landau. 1984. Dictionaries, Cambridge Univ Press. G.A. Miller. 1995. Wordnet: a lexical database We also create the entire reverse dictionary for for English, Communications of the ACM, vol- the AJZ-ENG dictionary. The total number of en- ume 38(11):39–41. tries in the ENG-AJZ dictionaries created by us- Z. Wu and P. Palmer. 1994. Verbs semantics and lexical ing the DR algorithm and DRwS algorithm are selection, In proceeding of the 32nd annual meeting 4677 and 5941, respectively. Then, we pick 100 on Association for computaional linguistics, Strouds- burg, 133–138. random words from the ENG-AJZ created by us- ing the DRwS algorithm for evaluation. The av- erage score of every entry in this created dictio- nary is 4.07. Some of the reversal bilingual dictio- naries can be downloaded at http://cs.uccs.edu/ lin- clab/creatingBilingualLexicalResource.html.
5 Conclusion
We proposed approaches to create a reverse dic- tionary from an existing bilingual dictionary using Wordnet. We show that a high precision reverse dic- tionary can be created without using any other inter- mediate dictionaries or languages. Using the Word- net hierarchy increases the number of entries in the created dictionaries. We perform experiments with several resource-poor languages including two that are in the UNESCO’s list of endangered languages.
Acknowledgements
We would like to thank the volunteers evaluating the dictionaries we create: Morningkeey Phangcho, Dharamsing Teron, Navanath Saharia, Arnab Phon- glosa, Abhijit Bendale, and Lalit Prithviraj Jain. We also thank all friends in the Xobdo project who pro- vided us with the ASM-ENG-DIS-AJZ dictionaries.
528