Building Specialized Bilingual Lexicons Using Sense Disambiguation

Dhouha Bouamor Nasredine Semmar Pierre Zweigenbaum CEA, LIST, Vision and CEA, LIST, Vision and Content LIMSI-CNRS, Content Engineering Laboratory, Engineering Laboratory, F-91403 Orsay CEDEX 91191 Gif-sur-Yvette CEDEX 91191 Gif-sur-Yvette France France CEDEX France [email protected] [email protected] [email protected]

Abstract polysemous. For instance, the French word action This paper presents an extension of the can be translated into English as share, stock, law- standard approach used for bilingual lex- suit or deed. In such cases, it is difficult to iden- icon extraction from comparable corpora. tify in flat resources like bilingual dictionaries, We study the ambiguity problem revealed wherein entries are usually unweighted and un- by the seed bilingual dictionary used to ordered, which are most relevant. The translate context vectors and augment the standard approach considers all available trans- standard approach by a Word Sense Dis- lations and gives them the same importance in ambiguation process. Our aim is to iden- the resulting translated context vectors indepen- tify the translations of that are more dently of the domain of interest and word ambigu- likely to give the best representation of ity. Thus, in the financial domain, translating ac- words in the target language. On two spe- tion into deed or lawsuit would probably introduce cialized French-English and Romanian- noise in context vectors. English comparable corpora, empirical ex- In this paper, we present a novel approach perimental results show that the proposed which addresses the word ambiguity problem ne- method consistently outperforms the stan- glected in the standard approach. We introduce a dard approach. use of a WordNet-based mea- sure permitting the disambiguation of translated 1 Introduction context vectors. The basic intuition behind this Over the years, bilingual lexicon extraction from method is that instead of taking all translations comparable corpora has attracted a wealth of re- of each seed word to translate a context vector, search works (Fung, 1998; Rapp, 1995; Chiao and we only use the translations that are more likely Zweigenbaum, 2003). The main work in this re- to give the best representation of the context vec- search area could be seen as an extension of Har- tor in the target language. We test the method ris’s distributional hypothesis (Harris, 1954). It is on two comparable corpora specialized on the based on the simple observation that a word and Breast Cancer domain, for the French-English and its are likely to appear in similar con- Romanian-English pair of languages. This choice texts across languages (Rapp, 1995). Based on allows us to study the behavior of the disambigua- this assumption, the alignment method, known as tion for a pair of languages that are richly repre- the standard approach builds and compares con- sented and for a pair that includes Romanian, a text vectors for each word of the source and target language that has fewer associated resources than languages. French and English. A particularity of this approach is that, to enable 2 Related Work the comparison of context vectors, it requires the existence of a seed bilingual dictionary to translate Recent improvements of the standard approach are source context vectors. The use of the bilingual based on the assumption that the more the con- dictionary is problematic when a word has sev- text vectors are representative, the better the bilin- eral translations, whether they are synonymous or gual lexicon extraction is. Prochasson et al. (2009)

952 International Joint Conference on Natural Language Processing, pages 952–956, Nagoya, Japan, 14-18 October 2013. used transliterated words and scientific Once translated into the target language, the words as ‘anchor points’. Giving these words context vectors disambiguation process inter- higher priority when comparing target vectors im- venes. This process operates locally on each con- proved bilingual lexicon extraction. In addition to text vector and aims at finding the most promi- , Rubino and Linares` (2011) com- nent translations of polysemous words. For this bined the contextual representation within a the- purpose, we use monosemic words as a seed set matic one. The basic intuition of their work is that of disambiguated words to infer the polysemous a term and its translation share thematic similari- word’s translations senses. We hypothesize that a ties. Hazem and Morin (2012) recently proposed a word is monosemic if it is associated to only one method that filters the entries of the bilingual dic- entry in the bilingual dictionary. We checked this tionary based upon POS-tagging and domain rel- assumption by probing monosemic entries of the evance criteria, but no improvements was demon- bilingual dictionary against WordNet and found strated. that 95% of the entries are monosemic in both re- Gaussier et al. (2004) attempted to solve the sources. problem of different word ambiguities in the Formally, we derive a semantic similarity value source and target languages. They investigated a between all the translations provided for each pol- number of techniques including canonical corre- ysemous word by the bilingual dictionary and lation analysis and multilingual probabilistic la- all monosemic words appearing whithin the same tent semantic analysis. The best results, with a context vector. There is a relatively large number very small improvement were reported for a mixed of word-to-word similarity metrics that were pre- method. One important difference with Gaussier viously proposed in the literature, ranging from et al. (2004) is that they focus on words ambigu- path-length measures computed on semantic net- ities on source and target languages, whereas we works, to metrics based on models of distribu- consider that it is sufficient to disambiguate only tional similarity learned from large text collec- translated source context vectors. tions. For simplicity, we use in this work, the Wu and Palmer (1994) (WUP) path-length-based se- 3 Context Vector Disambiguation mantic similarity measure. It was demonstrated by (Lin, 1998) that this metric achieves good perfor- The approach we propose augments the standard mances among other measures. WUP computes a approach used for bilingual lexicons mining from score (equation 1) denoting how similar two word comparable corpora. As it was mentioned in sec- senses are, based on the depth of the two synsets tion 1, when the lexical extraction applies to a spe- (s1 and s2) in the WordNet taxonomy and that of cific domain, not all translations in the bilingual their Least Common Subsumer (LCS), i.e., the dictionary are relevant for the target context vec- most specific word that they share as an ancestor. tor representation. For this reason, we introduce 2 × depth(LCS) a WordNet-based WSD process that aims at im- W upSim(s1, s2) = (1) proving the adequacy of context vectors and there- depth(s1) + depth(s2) fore improve the results of the standard approach. In practice, since a word can belong to more A large number of WSD techniques were pre- than one synset in WordNet, we determine the viously proposed in the literature. The most popu- semantic similarity between two words w1 and lar ones are those that compute semantic similarity w2 as the maximum W upSim between the synset with the help of existing thesauri such as Word- or the synsets that include the synsets(w1) and Net (Fellbaum, 1998). This thesaurus has been synsets(w2) according to the following equation: applied to many tasks relying on word-based sim- ilarity, including document (Hwang et al., 2011) SemSim(w1, w2) = max{W upSim(s1, s2); and image (Cho et al., 2007; Choi et al., 2012) (s1, s2) ∈ synsets(w1) × synsets(w2)} (2) retrieval systems. In this work, we use this re- Then, to identify the most prominent translations source to derive a semantic similarity between lex- of each polysemous unit wp, an average similarity ical units within the same context vector. To the j best of our knowledge, this is the first application is computed for each translation wp of wp: of WordNet to the task of bilingual lexicon extrac- PN Sem (w , wj ) Ave Sim(wj ) = i=1 Sim i p (3) tion from comparable corpora. p N

953 Corpus French English The resulting bilingual dictionary contains about 396, 524 524, 805 136,681 entries for Romanian-English with an av- Corpus Romanian English erage of 1 translation per word. 22,539 322,507 4.1.3 Evaluation list Table 1: Comparable corpora sizes in term of In bilingual extraction from compa- words. rable corpora, a reference list is required to eval- uate the performance of the alignment. Such lists are usually composed of about 100 sin- where N is the total number of monosemic words gle terms (Hazem and Morin, 2012; Chiao and j and SemSim is the similarity value of wp and the Zweigenbaum, 2002). Here, we created a refer- ith monosemic word. Hence, according to average ence list3 for each pair of language. The French- j relatedness values Ave Sim(wp), we obtain for English list contains 96 terms extracted from the 4 each polysemous word wp an ordered list of trans- French-English MESH and the UMLS thesauri . 1 n lations wp . . . wp . This allows us to select trans- The Romanian-English reference list was created lations of words which are more salient than the by a native speaker and contains 38 pair of words. others to represent the word to be translated. Note that reference terms pairs appear at least five times in each part of both comparable corpora. 4 Experiments and Results 4.2 Experimental setup 4.1 Resources Three other parameters need to be set up: (1) the 4.1.1 Comparable corpora window size, (2) the association measure and the We conducted our experiments on two French- (3) similarity measure. To define context vectors, English and Romanian-English comparable we use a seven-word window as it approximates corpora specialized on the breast cancer syntactic dependencies. Concerning the rest of the domain. Both corpora were extracted from parameters, we followed Laroche and Langlais Wikipedia1. We consider the topic in the source (2010) for their definition. The authors carried out language (for instance cancer du sein [breast a complete study of the influence of these param- cancer]) as a query to Wikipedia and extract all eters on the bilingual alignment and showed that its sub-topics (i.e., sub-categories in Wikipedia) the most effective configuration is to combine the to construct a domain-specific category tree. Discounted Log-Odds ratio (equation 4) with the Then, based on the constructed tree, we collect cosine similarity. The Discounted Log-Odds ratio all Wikipedia pages belonging to one of these is defined as follows: categories and use inter-language links to build 1 1 (O11 + 2 )(O22 + 2 ) the comparable corpus. Both corpora were Odds-Ratiodisc = log (4) (O + 1 )(O + 1 ) normalized through the following linguistic 12 2 21 2 preprocessing steps: tokenisation, part-of-speech where Oij are the cells of the 2 × 2 contingency tagging, , and function word re- matrix of a token s co-occurring with the term S moval. The resulting corpora2 sizes are given in within a given window size. Table 1. 4.3 Results and discussion 4.1.2 Bilingual dictionary It is difficult to compare results between different The French-English bilingual dictionary used to studies published on bilingual lexicon extraction translate context vectors consists of an in-house from comparable corpora, because of difference manually revised bilingual dictionary which con- between (1) used corpora (in particular their con- tains about 120,000 entries belonging to the gen- struction constraints and volume), (2) target do- eral domain. It is important to note that words mains, and also (3) the coverage and relevance of has on average 7 translations in the bilingual dic- linguistic resources used for translation. To the tionary. The Romanian-English dictionary con- best of our knowledge, there is no common bench- sists of translation pairs extracted from Wikipedia. mark that can serve as a reference. For this reason, 1http://dumps.wikimedia.org/ 3Reference lists will be shared publicly 2Comparable corpora will be shared publicly 4http://www.nlm.nih.gov/

954 ohteS n uswt epc oTop to respect with ours of and performance the SA evaluate the We (SA) both approach reference. standard a the of as results the each use we In languages; of pairs result two best the shows Underline for result. domain best shows Cancer bold Breast measure, overall. similarity the single best for shows Top20 italics column, at F-Measure 2: Table vrtesadr approach. standard of the over improvement an achieves method translations four to (WN-T up word polysemous each for L by obtained The was F-measure maximum approach. standard the outperforms con- sistently vectors context within disambiguat- words in polysemous consists ing which observation method our substantial that first is The com- French-English corpus. the displays parable for 2a obtained Table results en- dictionary. the each bilingual to the associated in baseline try translations The all uses dictionary. (SA) bilingual trans- the 7 the in average in This lations on words have that fact corpus list. the French-English translations by motivated the is in choice ( word translation seventh pivot the WN-T the noted from experiments translations, ranging our of in number account different into take we this For reason, especially synonyms. words, contain list of translations vector a number when context larger the a in consider only or translation introduce ranked should best we the whether that question is A here word. arises polysemous each for tions recall. be- and mean precision tween harmonic the computes which Measure

ocrigteRmna-nls aro lan- of pair Romanian-English the Concerning transla- of list ranked a provides method Our b) RO-EN b) FR-EN tnadApproach(SA) Standard Approach(SA) Standard 4

r osdrdi otx etr.This vectors. context in considered are ) Single Single measures measures Method Method L L V V EACOCK EACOCK L L P W P W ECTOR ECTOR ATH ATH ESK ESK U U P P WN-T WN-T 0.18 0.21 0.15 0.18 0.18 0.51 0.46 0.50 0.54 0.48 i +10% ESK 1 1 1 = WN-T WN-T 20 when 0.57 0.21 0.21 0.18 0.21 0.21 0.56 0.54 0.54 0.56 to ) and F- i , 955 2 2 u-hagCioadPer wiebu.2002. Zweigenbaum. Pierre and Chiao Yun-Chuang References do- the to irrelevant are main. that needed translations prune are to methods disambiguation where cor- specialized pora, of panel lexi- larger bilingual a on for 2010). extraction method con our Gaussier, test and to also (Li want We in presented as much their on mine quality focus to and corpora interesting comparable be of larger more number also a would in It improved ways. be could they be- that we lieve positive, Al- are experiments approach. initial our standard though bet- the a than to performance leads ter process such integrating a that on show experiments corpus comparable Conducted specialized con- polysemous highly general corpus. the the to of relevant text most are that trans- the lations only selects and vectors context in polysemous words bilin- disambiguates method for proposed corpora. The comparable used from that extraction approach lexicon method gual standard novel a the paper extends this in presented We Conclusion 5 one dictionary. only bilingual to the in associated translation is vectors used context dictio- word shape bilingual Each to polysemous. the heavily not in are words nary that The being reported. reason been have improvements no guage, WN-T WN-T okn o addt rnltoa qiaet in equivalents translational candidate for Looking 0.56 0.21 0.21 0.18 0.21 0.21 0.53 0.54 0.55 0.55 3 3 WN-T WN-T 0.21 0.21 0.18 0.21 0.21 0.21 0.56 0.59 0.56 0.56 0.54 0.49 4 4 WN-T WN-T 0.57 0.21 0.21 0.18 0.21 0.21 0.54 0.55 0.54 0.55 5 5 WN-T WN-T 0.56 0.21 0.21 0.18 0.21 0.21 0.55 0.55 0.55 0.54 6 6 WN-T WN-T 0.21 0.21 0.18 0.21 0.21 0.55 0.54 0.54 0.55 0.55 7 7 specialized, comparable corpora. In Proceedings of Dekang Lin. 1998. An information-theoretic def- the 19th international conference on Computational inition of similarity. In Proceedings of the Fif- linguistics - Volume 2, COLING ’02, pages 1–5. As- teenth International Conference on Machine Learn- sociation for Computational Linguistics. ing, ICML ’98, pages 296–304, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Yun-Chuang Chiao and Pierre Zweigenbaum. 2003. The effect of a general lexicon in corpus-based iden- Emmanuel Prochasson, Emmanuel Morin, and Kyo tification of french-english medical word transla- Kageura. 2009. Anchor points for bilingual lexi- tions. In Proceedings Medical Informatics Europe, con extraction from small comparable corpora. In volume 95 of Studies in Health Technology and In- Proceedings, 12th Conference on Machine Transla- formatics, pages 397–402, Amsterdam. tion Summit (MT Summit XII), page 284–291, Ot- tawa, Ontario, Canada. Miyoung Cho, Chang Choi, Hanil Kim, Jungpil Shin, and PanKoo Kim. 2007. Efficient image retrieval Reinhard Rapp. 1995. Identifying word translations in using conceptualization of annotated images. Lec- non-parallel texts. In Proceedings of the 33rd an- ture Notes in Computer Science, pages 426–433. nual meeting on Association for Computational Lin- Springer. guistics, ACL ’95, pages 320–322. Association for Computational Linguistics. Dongjin Choi, Jungin Kim, Hayoung Kim, Myungg- won Hwang, and Pankoo Kim. 2012. A method for Raphael¨ Rubino and Georges Linares.` 2011. A multi- enhancing image retrieval based on annotation using view approach for term translation spotting. In modified wup similarity in . In Proceed- Computational Linguistics and Intelligent Text Pro- ings of the 11th WSEAS international conference cessing, Lecture Notes in Computer Science, pages on Artificial Intelligence, Knowledge Engineering 29–40. and Data Bases, AIKED’12, pages 83–87, Stevens Point, Wisconsin, USA. World Scientific and Engi- Zhibiao Wu and Martha Palmer. 1994. Verbs seman- neering Academy and Society (WSEAS). tics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational WordNet: An Electronic Christiane Fellbaum. 1998. Linguistics, ACL ’94, pages 133–138. Association Lexical Database . Bradford Books. for Computational Linguistics. Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non- parallel corpora. In Processing, pages 1–17. Springer. Eric´ Gaussier, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and Herve´ Dejean.´ 2004. A geometric view on bilingual lexicon extraction from compara- ble corpora. In ACL, pages 526–533. Z.S. Harris. 1954. Distributional structure. Word. Amir Hazem and Emmanuel Morin. 2012. Adap- tive dictionary for bilingual lexicon extraction from comparable corpora. In Proceedings, 8th interna- tional conference on Language Resources and Eval- uation (LREC), Istanbul, Turkey, May. Myunggwon Hwang, Chang Choi, and Pankoo Kim. 2011. Automatic enrichment of semantic relation network and its application to word sense disam- biguation. IEEE Transactions on Knowledge and Data Engineering, 23:845–858. Audrey Laroche and Philippe Langlais. 2010. Re- visiting context-based projection methods for term- translation spotting in comparable corpora. In 23rd International Conference on Computational Lin- guistics (Coling 2010), pages 617–625, Beijing, China, Aug. Bo Li and Eric¨ Gaussier. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In 23rd International Confer- ence on Computational Linguistics (Coling 2010), Beijing, China, Aug.

956