Building Specialized Bilingual Lexicons Using Large Scale

Building Specialized Bilingual Lexicons Using Large-Scale Background Knowledge Dhouha Bouamor1, Adrian Popescu1, Nasredine Semmar1, Pierre Zweigenbaum2 1 CEA, LIST, Vision and Content Engineering Laboratory, 91191 Gif-sur-Yvette CEDEX, France; [email protected] 2LIMSI-CNRS, F-91403 Orsay CEDEX, France; [email protected] Abstract Translate 1 and Systran 2. However, due to the intrin- sic difficulty of the task, a number of related problems remain open, including: the gap between text Bilingual lexicons are central components of machine translation and cross-lingual infor- semantics and statistically derived translations, the mation retrieval systems. Their manual con- scarcity of resources in a large majority of languages struction requires strong expertise in both lan- and the quality of automatically obtained resources guages involved and is a costly process. Sev- and translations. While the first challenge is general eral automatic methods were proposed as an and inherent to any automatic approach, the second alternative but they often rely on resources and the third can be at least partially addressed by available in a limited number of languages an appropriate exploitation of multilingual resources and their performances are still far behind the quality of manual translations. We intro- that are increasingly available on the Web. duce a novel approach to the creation of spe- In this paper we focus on the automatic creation of cific domain bilingual lexicon that relies on domain-specific bilingual lexicons. Such resources Wikipedia. This massively multilingual en- play a vital role in Natural Language Processing cyclopedia makes it possible to create lexi- (NLP) applications that involve different languages. cons for a large number of language pairs. At first, research on lexical extraction has relied on Wikipedia is used to extract domains in each the use of parallel corpora (Och and Ney, 2003). language, to link domains between languages and to create generic translation dictionaries. The scarcity of such corpora, in particular for spe- The approach is tested on four specialized do- cialized domains and for language pairs not involv- mains and is compared to three state of the art ing English, pushed researchers to investigate the approaches using two language pairs: French- use of comparable corpora (Fung, 1998; Chiao and English and Romanian-English. The newly in- Zweigenbaum, 2003). These corpora include texts troduced method compares favorably to exist- which are not exact translation of each other but ing methods in all configurations tested. share common features such as domain, genre, sam- pling period, etc. The basic intuition that underlies bilingual lexi- 1 Introduction con creation is the distributional hypothesis (Harris, 1954) which puts that words with similar meanings The plethora of textual information shared on the occur in similar contexts. In a multilingual formu- Web is strongly multilingual and users’ information lation, this hypothesis states that the translations of needs often go well beyond their knowledge of for- a word are likely to appear in similar lexical envi- eign languages. In such cases, efficient machine ronments across languages (Rapp, 1995). The stan- translation and cross-lingual information retrieval dard approach to bilingual lexicon extraction builds systems are needed. Machine translation already has a decades long history and an array of commercial 1http://translate.google.com/ systems were already deployed, including Google 2http://www.systransoft.com/ 479 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 479–489, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics on the distributional hypothesis and compares con- Markovitch, 2007) is fitted to our application context vectors for each word of the source and tar- text. ESA was already successfully tested in differ- get languages. In this approach, the comparison of ent NLP tasks, such as word relatedness estimation context vectors is conditioned by the existence of a or text classification, and we modify it to mine spe- seed bilingual dictionary. A weakness of the method cialized domains, to characterize these domains and is that poor results are obtained for language pairs to link them across languages. that are not closely related (Ismail and Manandhar, The evaluation of the newly introduced approach 2010). Another important problem occurs whenever is realized on four diversified specialized domains the size of the seed dictionary is small due to ignor- (Breast Cancer, Corporate Finance, Wind Energy ing many context words. Conversely, when dictio- and Mobile Technology) and for two pairs of lan- naries are detailed, ambiguity becomes an important guages: French-English and Romanian-English. drawback. This choice allows us to study the behavior of dif- We introduce a bilingual lexicon extraction ap- ferent approaches for a pair of languages that are proach that exploits Wikipedia in an innovative richly represented and for a pair that includes Roma- manner in order to tackle some of the problems nian, a language that has fewer associated resources mentioned above. Important advantages of using than French and English. Experimental results show Wikipedia are: that the newly introduced approach outperforms the three state of the art methods that were implemented • The resource is available in hundreds of lan- for comparison. guages and it is structured as unambiguous concepts (i.e. articles). 2 Related Work • The languages are explicitly linked through In this section, we first give a review of the stan- concept translations proposed by Wikipedia dard approach and then introduce methods that build contributors. upon it. Finally, we discuss works that rely on Ex- plicit Semantic Analysis to solve other NLP tasks. • It covers a large number of domains and is thus potentially useful in order to mine a wide array 2.1 Standard Approach (SA) of specialized lexicons. Most previous approaches that address bilingual lexicon extraction from comparable corpora are based Mirroring the advantages, there are a number of on the standard approach (Fung, 1998; Chiao and challenges associated with the use of Wikipedia: Zweigenbaum, 2002; Laroche and Langlais, 2010). • The comparability of concept descriptions in This approach is composed of three main steps: different languages is highly variable. 1. Building context vectors: Vectors are first • The translation graph is partial since, when extracted by identifying the words that ap- considering any language pair, only a part of pear around the term to be translated Wcand the concepts are available in both languages in a window of n words. Generally, asso- and explicitly connected. ciation measures such as the mutual information (Morin and Daille, 2006), the log- • Domains are unequally covered in Wikipedia likelihood (Morin and Prochasson, 2011) or the (Halavais and Lackaff, 2008) and efficient do- Discounted Odds-Ratio (Laroche and Langlais, main targeting is needed. 2010) are employed to shape the context vectors. The approach introduced in this paper aims to draw on Wikipedia’s advantages while appropri- 2. Translation of context vectors: To enable the ately addressing associated challenges. Among comparison of source and target vectors, source the techniques devised to mine Wikipedia content, vectors are translated intoto the target language we hypothesize that an adequate adaptation of Ex- by using a seed bilingual dictionary. When- plicit Semantic Analysis (ESA) (Gabrilovich and ever several translations of a context word exist, 480 all translation variants are taken into account. 2.3 Explicit Semantic Analysis Words not included in the seed dictionary are Explicit Semantic Analysis (ESA) (Gabrilovich and simply ignored. Markovitch, 2007) is a method that maps textual documents onto a structured semantic space using 3. Comparison of source and target vectors: classical text indexing schemes such as TF-IDF. Ex- Given W , its automatically translated con- cand amples of semantic spaces used include Wikipedia text vector is compared to the context vectors or the Open Directory Project but, due to superior of all possible translations from the target lan- performances, Wikipedia is most frequently used. guage. Most often, the cosine similarity is In the original evaluation, ESA outperformed state used to rank translation candidates but alterna- of the art methods in a word relatedness estimation tive metrics, including the weighted Jaccard in- task. dex (Prochasson et al., 2009) and the city-block distance (Rapp, 1999), were studied. Subsequently, ESA was successfully exploited in other NLP tasks and in information retrieval. Radin- 2.2 Improvements of the Standard Approach sky and al. (2011) added a temporal dimension to word vectors and showed that this addition improves Most of the improvements of the standard approach the results of word relatedness estimation. (Hassan are based on the observation that the more repre- and Mihalcea, 2011) introduced Salient Semantic sentative the context vectors of a candidate word Analysis (SSA), a development of ESA that relies are, the better the bilingual lexicon extraction is. At on the detection of salient concepts prior to map- first, additional linguistic resources, such as special- ping words to concepts. SSA and the original ESA ized dictionaries (Chiao and Zweigenbaum, 2002) or implementation were tested on several word related- transliterated words (Prochasson

Load more