Building Specialized Bilingual Lexicons Using Word Sense Disambiguation
Total Page:16
File Type:pdf, Size:1020Kb
Building Specialized Bilingual Lexicons Using Word Sense Disambiguation Dhouha Bouamor Nasredine Semmar Pierre Zweigenbaum CEA, LIST, Vision and CEA, LIST, Vision and Content LIMSI-CNRS, Content Engineering Laboratory, Engineering Laboratory, F-91403 Orsay CEDEX 91191 Gif-sur-Yvette CEDEX 91191 Gif-sur-Yvette France France CEDEX France [email protected] [email protected] [email protected] Abstract polysemous. For instance, the French word action This paper presents an extension of the can be translated into English as share, stock, law- standard approach used for bilingual lex- suit or deed. In such cases, it is difficult to iden- icon extraction from comparable corpora. tify in flat resources like bilingual dictionaries, We study the ambiguity problem revealed wherein entries are usually unweighted and un- by the seed bilingual dictionary used to ordered, which translations are most relevant. The translate context vectors and augment the standard approach considers all available trans- standard approach by a Word Sense Dis- lations and gives them the same importance in ambiguation process. Our aim is to iden- the resulting translated context vectors indepen- tify the translations of words that are more dently of the domain of interest and word ambigu- likely to give the best representation of ity. Thus, in the financial domain, translating ac- words in the target language. On two spe- tion into deed or lawsuit would probably introduce cialized French-English and Romanian- noise in context vectors. English comparable corpora, empirical ex- In this paper, we present a novel approach perimental results show that the proposed which addresses the word ambiguity problem ne- method consistently outperforms the stan- glected in the standard approach. We introduce a dard approach. use of a WordNet-based semantic similarity mea- sure permitting the disambiguation of translated 1 Introduction context vectors. The basic intuition behind this Over the years, bilingual lexicon extraction from method is that instead of taking all translations comparable corpora has attracted a wealth of re- of each seed word to translate a context vector, search works (Fung, 1998; Rapp, 1995; Chiao and we only use the translations that are more likely Zweigenbaum, 2003). The main work in this re- to give the best representation of the context vec- search area could be seen as an extension of Har- tor in the target language. We test the method ris’s distributional hypothesis (Harris, 1954). It is on two comparable corpora specialized on the based on the simple observation that a word and Breast Cancer domain, for the French-English and its translation are likely to appear in similar con- Romanian-English pair of languages. This choice texts across languages (Rapp, 1995). Based on allows us to study the behavior of the disambigua- this assumption, the alignment method, known as tion for a pair of languages that are richly repre- the standard approach builds and compares con- sented and for a pair that includes Romanian, a text vectors for each word of the source and target language that has fewer associated resources than languages. French and English. A particularity of this approach is that, to enable 2 Related Work the comparison of context vectors, it requires the existence of a seed bilingual dictionary to translate Recent improvements of the standard approach are source context vectors. The use of the bilingual based on the assumption that the more the con- dictionary is problematic when a word has sev- text vectors are representative, the better the bilin- eral translations, whether they are synonymous or gual lexicon extraction is. Prochasson et al. (2009) 952 International Joint Conference on Natural Language Processing, pages 952–956, Nagoya, Japan, 14-18 October 2013. used transliterated words and scientific compound Once translated into the target language, the words as ‘anchor points’. Giving these words context vectors disambiguation process inter- higher priority when comparing target vectors im- venes. This process operates locally on each con- proved bilingual lexicon extraction. In addition to text vector and aims at finding the most promi- transliteration, Rubino and Linares` (2011) com- nent translations of polysemous words. For this bined the contextual representation within a the- purpose, we use monosemic words as a seed set matic one. The basic intuition of their work is that of disambiguated words to infer the polysemous a term and its translation share thematic similari- word’s translations senses. We hypothesize that a ties. Hazem and Morin (2012) recently proposed a word is monosemic if it is associated to only one method that filters the entries of the bilingual dic- entry in the bilingual dictionary. We checked this tionary based upon POS-tagging and domain rel- assumption by probing monosemic entries of the evance criteria, but no improvements was demon- bilingual dictionary against WordNet and found strated. that 95% of the entries are monosemic in both re- Gaussier et al. (2004) attempted to solve the sources. problem of different word ambiguities in the Formally, we derive a semantic similarity value source and target languages. They investigated a between all the translations provided for each pol- number of techniques including canonical corre- ysemous word by the bilingual dictionary and lation analysis and multilingual probabilistic la- all monosemic words appearing whithin the same tent semantic analysis. The best results, with a context vector. There is a relatively large number very small improvement were reported for a mixed of word-to-word similarity metrics that were pre- method. One important difference with Gaussier viously proposed in the literature, ranging from et al. (2004) is that they focus on words ambigu- path-length measures computed on semantic net- ities on source and target languages, whereas we works, to metrics based on models of distribu- consider that it is sufficient to disambiguate only tional similarity learned from large text collec- translated source context vectors. tions. For simplicity, we use in this work, the Wu and Palmer (1994) (WUP) path-length-based se- 3 Context Vector Disambiguation mantic similarity measure. It was demonstrated by (Lin, 1998) that this metric achieves good perfor- The approach we propose augments the standard mances among other measures. WUP computes a approach used for bilingual lexicons mining from score (equation 1) denoting how similar two word comparable corpora. As it was mentioned in sec- senses are, based on the depth of the two synsets tion 1, when the lexical extraction applies to a spe- (s1 and s2) in the WordNet taxonomy and that of cific domain, not all translations in the bilingual their Least Common Subsumer (LCS), i.e., the dictionary are relevant for the target context vec- most specific word that they share as an ancestor. tor representation. For this reason, we introduce 2 × depth(LCS) a WordNet-based WSD process that aims at im- W upSim(s1, s2) = (1) proving the adequacy of context vectors and there- depth(s1) + depth(s2) fore improve the results of the standard approach. In practice, since a word can belong to more A large number of WSD techniques were pre- than one synset in WordNet, we determine the viously proposed in the literature. The most popu- semantic similarity between two words w1 and lar ones are those that compute semantic similarity w2 as the maximum W upSim between the synset with the help of existing thesauri such as Word- or the synsets that include the synsets(w1) and Net (Fellbaum, 1998). This thesaurus has been synsets(w2) according to the following equation: applied to many tasks relying on word-based sim- ilarity, including document (Hwang et al., 2011) SemSim(w1, w2) = max{W upSim(s1, s2); and image (Cho et al., 2007; Choi et al., 2012) (s1, s2) ∈ synsets(w1) × synsets(w2)} (2) retrieval systems. In this work, we use this re- Then, to identify the most prominent translations source to derive a semantic similarity between lex- of each polysemous unit wp, an average similarity ical units within the same context vector. To the j best of our knowledge, this is the first application is computed for each translation wp of wp: of WordNet to the task of bilingual lexicon extrac- PN Sem (w , wj ) Ave Sim(wj ) = i=1 Sim i p (3) tion from comparable corpora. p N 953 Corpus French English The resulting bilingual dictionary contains about 396, 524 524, 805 136,681 entries for Romanian-English with an av- Corpus Romanian English erage of 1 translation per word. 22,539 322,507 4.1.3 Evaluation list Table 1: Comparable corpora sizes in term of In bilingual terminology extraction from compa- words. rable corpora, a reference list is required to eval- uate the performance of the alignment. Such lists are usually composed of about 100 sin- where N is the total number of monosemic words gle terms (Hazem and Morin, 2012; Chiao and j and SemSim is the similarity value of wp and the Zweigenbaum, 2002). Here, we created a refer- ith monosemic word. Hence, according to average ence list3 for each pair of language. The French- j relatedness values Ave Sim(wp), we obtain for English list contains 96 terms extracted from the 4 each polysemous word wp an ordered list of trans- French-English MESH and the UMLS thesauri . 1 n lations wp . wp . This allows us to select trans- The Romanian-English reference list was created lations of words which are more salient than the by a native speaker and contains 38 pair of words. others to represent the word to be translated. Note that reference terms pairs appear at least five times in each part of both comparable corpora. 4 Experiments and Results 4.2 Experimental setup 4.1 Resources Three other parameters need to be set up: (1) the 4.1.1 Comparable corpora window size, (2) the association measure and the We conducted our experiments on two French- (3) similarity measure.