Creating Reverse Bilingual Dictionaries

Creating Reverse Bilingual Dictionaries Khang Lam Jugal Kalita Department of Computer Science Department of Computer Science University of Colorado at Colorado SpringsUniversity of Colorado at Colorado Springs Colorado Springs, USA Colorado Springs, USA [email protected] [email protected] Abstract fuse the results. They query several existing dictionaries and then merge results to maximize accuracy. Bilingual dictionaries are expensive resources They use four pivot languages, German, Spanish, and not many are available when one of the Dutch and Italian, as intermediate languages. An- languages is resource-poor. In this paper, we other existing approach for creating bilingual dictio- propose algorithms for creation of new reverse bilingual dictionaries from existing bilingual naries is using probabilistic inference (Mausam et dictionaries in which English is one of the two al., 2010). They organize dictionaries in a graph languages. Our algorithms exploit the simi- topology and use random walks and probabilistic larity between word-concept pairs using the graph sampling. (Shaw et al., 2011) propose a set English Wordnet to produce reverse dictionary of algorithms to create a reverse dictionary in the entries. Since our algorithms rely on available context of single language by using converse map- bilingual dictionaries, they are applicable to ping. In particular, given an English-English dictio- any bilingual dictionary as long as one of the two languages has Wordnet type lexical ontol- nary, they attempt to find the original words or terms ogy. given a synonymous word or phrase describing the meaning of a word. The goal of this research is to study the feasibility 1 Introduction of creating a reverse dictionary by using only one ex- The Ethnologue organization1 lists 6,809 distinct isting dictionary and Wordnet lexical ontology. For 3 languages in the world, most of which are resource- example, given a Karbi -English dictionary, we will poor. Most existing online bilingual dictionaries are construct an ENG-AJZ dictionary. The remainder of between two resource-rich languages (e.g., English, this paper is organized as follows. In Section 2, we Spanish, French or German) or between a resource- discuss the nature of bilingual dictionaries. Section rich language and a resource-poor language. There 3 describes the algorithms we propose to create new are languages for which we are lucky to find a single bilingual dictionaries from existing dictionaries. Re- bilingual dictionary online. For example, the Uni- sults of our experiments are presented in Section 4. versity of Chicago hosts bilingual dictionaries from Section 5 concludes the paper. 29 Southeast Asian languages2, but many of these languages have only one bilingual dictionary online. 2 Existing Online Bilingual Dictionaries Existing algorithms for creating new bilingual Powerful online translators developed by Google dictionaries use intermediate languages or interme- and Bing provide pairwise translations (including diate dictionaries to find chains of words with the for individual words) for 65 and 40 languages, re- same meaning. For example, (Gollins and Sander- spectively. Wiktionary, a dictionary created by vol- son, 2001) use lexical triangulation to translate in unteers, supports over 170 languages. We find a parallel across multiple intermediate languages and 3Karbi is an endangered language spoken by 492,000 peo- 1http://www.ethnologue.com/ ple (2007 Ethnologue data) in Northeast India, ISO 639-3 code 2http://dsal.uchicago.edu/dictionaries/list.html AJZ. ISO 693-3 code for English is ENG. large number of bilingual dictionaries at PanLex4 meaning, and possibly additional information. The including an ENG-Hindi5 and a Vietnamese6-ENG meaning associated with it can have several Senses. dictionary. The University of Chicago has a number A Sense is a discrete representation of a single aspect of bilinugal dictionaries for South Asian languages. of the meaning of a word. Thus, a dictionary entry Xobdo7 has a number of dictionaries, focused on is of the form <LexicalUnit, Sense , Sense , >. 1 2 ··· Northeast India. In this section, we propose a series of algorithms, We classify the many freely available dictionaries each one of which automatically creates a reverse into three main kinds. dictionary, or ReverseDictionary, from a dictio- Word to word dictionaries: These are dictionar- nary that translates a word in language L to a word • 1 ies that translate one word in one language to or phrase in language L2. We require that at least one word or a phrase in another language. An one of two these languages has a Wordnet type lexi- example is an ENG-HIN dictionary at Panlex. cal ontology (Miller, 1995). Our algorithms are used Definition dictionaries: One word in one lan- to create reverse dictionaries from them at various • guage has one or more meanings in the second levels of accuracy and sophistication. language. It also may have pronunciation, parts 3.1 Direct Reversal (DR) of speech, synonyms and examples. An example is the VIE-ENG dictionary, also at Panlex. The existing dictionary has alphabetically sorted One language dictionaries: A dictionary of this LexicalUnits in L1 and each of them has one or • kind is found at dictionary.com. more Senses in L2. To create ReverseDictionary, we simply take every pair <LexicalUnit, Sense> We have examined several hundred online dictionar- in SourceDictionary and swap the positions of the ies and found that they occur in many different for- two. mats. Extracting information from these dictionaries is arduous. We have experimented with five existing Algorithm 1 DR Algorithm bilingual dictionaries: VIE-ENG, ENG-HIN, and a ReverseDictionary := φ dictionary supported by Xobdo with 4 languages: for all LexicalEntryi SourceDictionary do 8 9 2 Assamese , ENG, AJZ, and Dimasa . We consider for all Sensej LexicalEntry do 2 i the last one to be a collection of 3 bilingual dictio- Add tuple <Sensej, naries: ASM-ENG, AJZ-ENG, and DIS-ENG. We LexicalEntryi.LexicalUnit> to choose these languages since one of our goals is to ReverseDictionary work with resource-poor languages to enhance the end for quantity and quality of resources available. end for 3 Proposed Solution Approach This is a baseline algorithm so that we can com- A dictionary entry, called LexicalEntry, is 2-tuple pare improvements as we create new algorithms. <LexicalUnit, Definition>.ALexicalUnit is a If in our input dictionary, the sense definitions word or a phrase being defined, also called definien- are mostly single words, and occasionally a sim- dum (Landau, 1984). A list of entries sorted by ple phrase, even such a simple algorithm gives the LexicalUnit is called a lexicon or a dictionary. fairly good results. In case there are long or com- Given a LexicalUnit, the Definition associated with plex phrases in senses, we skip them. The ap- it usually contains its class and pronunciation, its proach is easy to implement, and produces a high- 4http://panlex.org/ accuracy ReverseDictionary. However, the num- 5ISO 693-3 code HIN ber of entries in the created dictionaries are lim- 6ISO 693-3 code VIE ited because this algorithm just swaps the posi- 7http://www.xobdo.org/ 8 tions of LexicalUnit and Sense of each entry in the Assamese is an Indo-European language spoken by about SourceDictionary 30 million people, but it is resource-poor, ISO 693-3 code ASM. and does not have any method 9Dimasa is another endangered language from Northeast In- to find the additional words having the same mean- dia, spoken by about 115,000 people, ISO 693-3 code DIS. ings. 3.2 Direct Reversal with Distance (DRwD) Algorithm 2 DRwD Algorithm ReverseDictionary := φ To increase the number of entries in the output dic- for all LexicalEntry SourceDictionary do tionary, we compute the distance between words i 2 for all Sensej LexicalEntry do in the Wordnet hierarchy. For example, the words 2 i for all LexicalEntry "hasta-lipi" and "likhavat" in HIN have the meanings u 2 SourceDictionary do "handwriting" and "script", respectively. The dis- for all Sensev LexicalEntry do tance between "handwriting" and "script" in Word- 2 u if distance(<LexicalEntry .LexicalUnit, net hierarchy is 0.0, so that "handwriting" and i Sensej> ,<LexicalEntry .LexicalUnit, "script" likely have the same meaning. Thus, each of u Sensev>)6 ↵ then "hasta-lipi" and "likhavat" should have both mean- Add tuple <Sensej, ings "handwriting" and "script". This approach LexicalEntry .LexicalUnit> helps us find additional words having the same u to ReverseDictionary meanings and possibly increase the number of lexi- end if cal entries in the reverse dictionaries. end for To create a ReverseDictionary, for every end for LexicalEntryi in the existing dictionary, end for we find all LexicalEntry ,i = j with dis- j 6 end for tance to LexicalEntryi equal to or smaller than a threshold ↵. As results, we have new 3.3 Direct Reversal with Similarly (DRwS) pairs of entries <LexicalEntry .LexicalUnit, i The DRwD approach computes simply the dis- LexicalEntry .Sense> ; then we swap positions j tance between two senses, but does not look at in the two-tuples, and add them into the Reverse- the meanings of the senses in any depth. The Dictionary. The value of ↵ affects the number of DRwS approach represents a concept in terms of entries and the quality of created dictionaries. The its Wordnet synset10, synonyms, hyponyms and greater the value of ↵, the larger the number of hypernyms. This approach is like the DRwD lexical entries, but the smaller the accuracy of the approach, but instead of computing the distance ReverseDictionary. between lexical entries in each pair, we calcu- The distance between the two LexicalEntrys is the late the similarity, called simValue. If the sim- distance between the two LexicalUnits if the Lexi- Value of a <LexicalEntry ,LexicalEntry >, i = i j 6 calUnits occur in Wordnet ontology; otherwise, it is j pair is equal or larger than ↵, we conclude the distance between the two Senses.

Load more