Building Specialized Bilingual Using Large-Scale Background Knowledge

Dhouha Bouamor1, Adrian Popescu1, Nasredine Semmar1, Pierre Zweigenbaum2 1 CEA, LIST, Vision and Content Engineering Laboratory, 91191 Gif-sur-Yvette CEDEX, France; [email protected] 2LIMSI-CNRS, F-91403 Orsay CEDEX, France; [email protected]

Abstract Translate 1 and Systran 2. However, due to the intrin- sic difficulty of the task, a number of related prob- lems remain open, including: the gap between text Bilingual lexicons are central components of machine translation and cross-lingual infor- semantics and statistically derived translations, the mation retrieval systems. Their manual con- scarcity of resources in a large majority of languages struction requires strong expertise in both lan- and the quality of automatically obtained resources guages involved and is a costly process. Sev- and translations. While the first challenge is general eral automatic methods were proposed as an and inherent to any automatic approach, the second alternative but they often rely on resources and the third can be at least partially addressed by available in a limited number of languages an appropriate exploitation of multilingual resources and their performances are still far behind the quality of manual translations. We intro- that are increasingly available on the Web. duce a novel approach to the creation of spe- In this paper we focus on the automatic creation of cific domain bilingual that relies on domain-specific bilingual lexicons. Such resources Wikipedia. This massively multilingual en- play a vital role in Natural Language Processing cyclopedia makes it possible to create lexi- (NLP) applications that involve different languages. cons for a large number of language pairs. At first, research on lexical extraction has relied on Wikipedia is used to extract domains in each the use of parallel corpora (Och and Ney, 2003). language, to link domains between languages and to create generic translation . The scarcity of such corpora, in particular for spe- The approach is tested on four specialized do- cialized domains and for language pairs not involv- mains and is compared to three state of the art ing English, pushed researchers to investigate the approaches using two language pairs: French- use of comparable corpora (Fung, 1998; Chiao and English and Romanian-English. The newly in- Zweigenbaum, 2003). These corpora include texts troduced method compares favorably to exist- which are not exact translation of each other but ing methods in all configurations tested. share common features such as domain, genre, sam- pling period, etc. The basic intuition that underlies bilingual lexi- 1 Introduction con creation is the distributional hypothesis (Harris, 1954) which puts that with similar meanings The plethora of textual information shared on the occur in similar contexts. In a multilingual formu- Web is strongly multilingual and users’ information lation, this hypothesis states that the translations of needs often go well beyond their knowledge of for- a are likely to appear in similar lexical envi- eign languages. In such cases, efficient machine ronments across languages (Rapp, 1995). The stan- translation and cross-lingual information retrieval dard approach to bilingual lexicon extraction builds systems are needed. Machine translation already has a decades long history and an array of commercial 1http://translate.google.com/ systems were already deployed, including Google 2http://www.systransoft.com/

479

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 479–489, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics on the distributional hypothesis and compares con- Markovitch, 2007) is fitted to our application con- text vectors for each word of the source and tar- text. ESA was already successfully tested in differ- get languages. In this approach, the comparison of ent NLP tasks, such as word relatedness estimation context vectors is conditioned by the existence of a or text classification, and we modify it to mine spe- seed bilingual . A weakness of the method cialized domains, to characterize these domains and is that poor results are obtained for language pairs to link them across languages. that are not closely related (Ismail and Manandhar, The evaluation of the newly introduced approach 2010). Another important problem occurs whenever is realized on four diversified specialized domains the size of the seed dictionary is small due to ignor- (Breast Cancer, Corporate Finance, Wind Energy ing many context words. Conversely, when dictio- and Mobile Technology) and for two pairs of lan- naries are detailed, ambiguity becomes an important guages: French-English and Romanian-English. drawback. This choice allows us to study the behavior of dif- We introduce a bilingual lexicon extraction ap- ferent approaches for a pair of languages that are proach that exploits Wikipedia in an innovative richly represented and for a pair that includes Roma- manner in order to tackle some of the problems nian, a language that has fewer associated resources mentioned above. Important advantages of using than French and English. Experimental results show Wikipedia are: that the newly introduced approach outperforms the three state of the art methods that were implemented • The resource is available in hundreds of lan- for comparison. guages and it is structured as unambiguous con- cepts (i.e. articles). 2 Related Work • The languages are explicitly linked through In this section, we first give a review of the stan- concept translations proposed by Wikipedia dard approach and then introduce methods that build contributors. upon it. Finally, we discuss works that rely on Ex- plicit Semantic Analysis to solve other NLP tasks. • It covers a large number of domains and is thus potentially useful in order to mine a wide array 2.1 Standard Approach (SA) of specialized lexicons. Most previous approaches that address bilingual lex- icon extraction from comparable corpora are based Mirroring the advantages, there are a number of on the standard approach (Fung, 1998; Chiao and challenges associated with the use of Wikipedia: Zweigenbaum, 2002; Laroche and Langlais, 2010). • The comparability of concept descriptions in This approach is composed of three main steps: different languages is highly variable. 1. Building context vectors: Vectors are first • The translation graph is partial since, when extracted by identifying the words that ap- considering any language pair, only a part of pear around the term to be translated Wcand the concepts are available in both languages in a window of n words. Generally, asso- and explicitly connected. ciation measures such as the mutual infor- mation (Morin and Daille, 2006), the log- • Domains are unequally covered in Wikipedia likelihood (Morin and Prochasson, 2011) or the (Halavais and Lackaff, 2008) and efficient do- Discounted Odds-Ratio (Laroche and Langlais, main targeting is needed. 2010) are employed to shape the context vec- tors. The approach introduced in this paper aims to draw on Wikipedia’s advantages while appropri- 2. Translation of context vectors: To enable the ately addressing associated challenges. Among comparison of source and target vectors, source the techniques devised to mine Wikipedia content, vectors are translated intoto the target language we hypothesize that an adequate adaptation of Ex- by using a seed . When- plicit Semantic Analysis (ESA) (Gabrilovich and ever several translations of a context word exist,

480 all translation variants are taken into account. 2.3 Explicit Semantic Analysis Words not included in the seed dictionary are Explicit Semantic Analysis (ESA) (Gabrilovich and simply ignored. Markovitch, 2007) is a method that maps textual documents onto a structured semantic space using 3. Comparison of source and target vectors: classical text indexing schemes such as TF-IDF. Ex- Given W , its automatically translated con- cand amples of semantic spaces used include Wikipedia text vector is compared to the context vectors or the Open Directory Project but, due to superior of all possible translations from the target lan- performances, Wikipedia is most frequently used. guage. Most often, the cosine similarity is In the original evaluation, ESA outperformed state used to rank translation candidates but alterna- of the art methods in a word relatedness estimation tive metrics, including the weighted Jaccard in- task. dex (Prochasson et al., 2009) and the city-block distance (Rapp, 1999), were studied. Subsequently, ESA was successfully exploited in other NLP tasks and in information retrieval. Radin- 2.2 Improvements of the Standard Approach sky and al. (2011) added a temporal dimension to word vectors and showed that this addition improves Most of the improvements of the standard approach the results of word relatedness estimation. (Hassan are based on the observation that the more repre- and Mihalcea, 2011) introduced Salient Semantic sentative the context vectors of a candidate word Analysis (SSA), a development of ESA that relies are, the better the bilingual lexicon extraction is. At on the detection of salient concepts prior to map- first, additional linguistic resources, such as special- ping words to concepts. SSA and the original ESA ized dictionaries (Chiao and Zweigenbaum, 2002) or implementation were tested on several word related- transliterated words (Prochasson et al., 2009), were ness datasets and results were mixed. Improvements combined with the seed dictionary to translate con- were obtained for text classification when compar- text vectors. ing SSA with the authors’ in-house representation The ambiguities that appear in the seed bilingual of the method. ESA has weak language depen- dictionary were taken into account more recently. dence and was already deployed in multilingual con- (Morin and Prochasson, 2011) modify the standard texts. (Sorg and Cimiano, 2012) extended ESA to approach by weighting the different translations ac- other languages and showed that it is useful in cross- cording to their frequency in the target corpus. In lingual and multilingual retrieval task. Their focus (Bouamor et al., 2013), we proposed a method that was on creating a language independent conceptual adds a word sense disambiguation process relying space in which documents would be mapped and on semantic similarity measurement from WordNet then retrieved. to the standard approach. Given a context vector in Some open ESA topics related to bilingual lex- the source language, the most probable translation of icon creation include: (1) the document represen- polysemous words is identified and used for build- tation which is simply done by summing individ- ing the corresponding vector in the target language. ual contributions of words, (2) the adaptation of the The most probable translation is identified using the method to specific domains and (3) the coverage of monosemic words that appear in the same lexical en- the underlying resource in different language. vironment. On specialized French-English comparable cor- 3 ESA for Bilingual Lexicon Extraction pora, this approach outperforms the one proposed in (Morin and Prochasson, 2011), which is itself bet- The main objective of our approach is to devise lex- ter than the standard approach. The main weakness icon translation methods that are easily applicable of (Bouamor et al., 2013) is that the approach relies to a large number of language pairs, while preserv- on WordNet and its application depends on the ex- ing the overall quality of results. A subordinated istence of this resource in the target language. Also, objective is to exploit large scale background mul- the method is highly dependent on the coverage of tilingual knowledge, such as the encyclopedic con- the seed bilingual dictionary. tent available in Wikipedia. As we mentioned, ESA

481 proach. Then, we detail the three steps that com- pose the main bilingual lexicon extraction method illustrated in Figure 1. Finally, as a complement to the main method we introduce a measure for domain word specificity and present a method for extracting generic translation lexicons.

3.1 ESA Word and Concept Representation

Given a semantic space structured using a set of M concepts and including a dictionary of N words, a mapping between words and concepts can be expressed as the following matrix:

w(W1,C1) w(W2,C1) ... w(WN ,C1) w(W1,C2) w(W2,C2) ... w(WN ,C2) ...... w(W1,CM ) w(W2,CM ) ... w(WN ,CM )

When Wikipedia is exploited concepts are equated to Wikipedia articles and the texts of the ar- ticles are processed in order to obtain the weights that link words and concepts. In (Gabrilovich and Markovitch, 2007), the weights w that link words Figure 1: Overview of the Explicit Semantic and concepts were obtained through a classical TF- Analysis enabled bilingual lexicon extraction. IDF weighting of Wikipedia articles. A series of tweaks destined to improve the method’s perfor- mance were used and disclosed later3. For instance, (Gabrilovich and Markovitch, 2007) was exploited administration articles, lists, articles that are too in a number of NLP tasks but not in bilingual lexi- short or have too few links are discarded. Higher con extraction. weight is given to words in the article title and Figure 1 shows the overall architecture of the lex- more longer articles are favored over shorter ones. ical extraction process we propose. The process is We implemented a part of these tweaks and tested completed in the following three steps: our own version of ESA with the Wikipedia ver- sion used in the original implementation. The cor- 1. Given a word to be translated and its con- relation with human judgments of word relatedness text vector in the source language, we derive was 0.72 against 0.75 reported by (Gabrilovich and a ranked list of similar Wikipedia concepts (i.e. Markovitch, 2007). The ESA matrix is sparse since articles) using the ESA inverted index. the N size of the dictionary, is usually in the range 2. Then, a translation graph is used to retrieve the of hundreds of thousands and each concept is usu- corresponding concepts in the target language. ally described by hundreds of distinct words. The direct ESA index from Figure 1 is obtained by read- 3. Candidate translations are found through a sta- ing the matrix horizontally while the inverted ESA tistical processing of concept descriptions from index is obtained by reading the matrix vertically. the ESA direct index in the target language.

In this section, we first introduce the elements of 3https://github.com/faraday/ the original formulation of ESA necessary in our ap- wikiprep-esa/wiki/roadmap

482 Terme Concepts action evaluation´ d’action, communisme, actionnaire activiste, socialisme, developement´ durable . . . deficit´ crise de la dette dans la zone euro, dette publique, regle` d’or budgetaire,´ deficit,´ trouble du deficit´ de l’attention . . . cisaillement taux de cisaillement, zone de cisaillement, cisaillement, contrainte de cisaille- ment, viscoanalyseur . . . turbine ffc turbine potsdam, turbine a` gaz, turbine, urbine hydraulique, cogen´ eration´ ... cryptage TEMPEST, chiffrement, liaison 16, Windows Vista, transfert de fichiers . . . protocole Ad-hoc On-demand Distance Vector, protocole de Kyoto, optimized link state routing protocol, liaison 16, IPv6 . . . biopsie biopsie, maladie de Horton, cancer du sein, cancer du poumon, imagerie par resonance´ magnetique´ . . . palpation cancer du sein, cellulite, examen clinique, appendicite, tenosynovite´ . . .

Table 1: The five most similar Wikipedia concepts to the French terms action[share], deficit[deficit],´ ci- saillement[shear], turbine[turbine], cryptage[encryption], biopsie[biopsie] and palpation[palpation] and their context vectors.

3.2 Source Language Processing Wcand and Wsi and w(Wsi ,Cs) are the weights of the associations between each context word W and The objective of the source language processing is si to obtain a ranked list of similar Wikipedia concepts Cs from the ESA matrix. The use of contextual in- formation in equation 1 serves to characterize the for each candidate word (Wcand) in a specialized do- main. To do this, a context vector is first built for candidate word in the target domain. In table 1, we present the five most similar each Wcand from a specialized monolingual corpus. Wikipedia concepts to the French terms action, The association measure between Wcand and context words is obtained using the Odds-Ratio (defined in deficit,´ cisaillement, turbine, cryptage, biopsie and equation 5). Wikipedia concepts in the source lan- palpation and their context vectors. These terms are part of the four specialized domains we are studying guage Cs that are similar to Wcand and to a part of its context words are extracted and ranked using equa- here. From observing these examples, we note that tion 1. despite the difference between the specialized do- mains and word ambiguity (words action and proto- Wcand Rank(Cs) = (10 ∗ max(Odds ) cole), our method has the advantage of successfully Wsi n representing each word to be translated by relevant X Wcand ∗ w(Wcand,Cs)) + Odds ∗ w(Wsi ,Cs) conceptual spaces. Wsi i=1 (1) 3.3 Translation Graph Construction To bridge the gap between the source and target lan- W where max(Odds cand ) is the highest Odds-Ratio guages, a concept translation graph that enables the Wsi association between Wcand and any of its context multilingual extension of ESA is used. This con- words Wsi ; the factor 10 was empirically set to cept translation graph is extracted from the explicit give more importance to Wcand over context words; translation links available in Wikipedia articles and w(Wcand,Cs) is the weight of the association be- is exploited in order to connect a word’s conceptual tween Wcand and Cs from the ESA matrix; n is the space in the source language with the correspond- total number of words Wsi in the context vector of ing conceptual space in the target language. Only a Wcand Wcand; Odds is the association value between part of the articles have translations and the size of Wsi

483 the conceptual space in the target language is usu- the target language. For instance, in English, ally smaller than the space in the source language. the Corporate Finance domain is seeded with For instance, the French-English translation graph https://en.wikipedia.org/wiki/Corporate finance. contains 940,215 pairs of concepts while the French We extract a set of 10 words with the highest and English Wikipedias contain approximately 1.4 TF-IDF score from this article (noted SW ) and use million articles, respectively 4.25 million articles. them to retrieve a domain ranking of concepts in the target language Rankdom(Ct) with equation 3. 3.4 Target Language Processing

The third step of the approach takes place in the tar- n X get language. Using the translation graph, we select Rankdom(Ct) = ( w(Wti ,Ct) the 100 most similar concept translations (thresh- i=1 old determined empirically after preliminary exper- ∗ w(Cseed,Wti )) ∗ count(SW, Ct) (3) iments) from the target language and use their di- where n is size of the seed list of words (i.e. 10 rect ESA representations in order to retrieve poten- items), w(W ,C ) is the weight of the domain tial translations for the candidate word W from ti t cand words in the concept C ; w(C ,W ) is the source language. These candidate translations W t seed ti t weight of W in C , the seed concept of the do- are ranked using equation 2. ti seed main, and count(SW, Ct) is the number of distinct n X w(Wt,Ct ) seed words from SW that appear in Ct. Rank(W ) = ( i ) t avg(C ) The first part of equation 3 sums up the contribu- i=1 ti tions of different words from SW that appear in Ct ∗ log(count(Wt, S)) (2) while the second part is meant to further reinforce articles that contain a larger number of domain key- with w(W ,C ) is the weight of the translation can- t ti words from SW . didate W for concept C from the ESA matrix T ti Domain delimitation is performed by retaining in the target language; avg(Ct ) is the average TF- i articles whose Rankdom(Ct) is at least 1% or the IDF score of words that appear in Ct ; S is the set i score of the top Rankdom(Ct) score. This threshold of similar concepts C in the target language and ti was set up during preliminary experiments. Given count(W , S) t accounts for the number of different the delimitation obtained with equation 3, we calcu- concepts from S in which the candidate translation late a domain specificity score (specifdom(Wt)) for W T appears. each word that occurs in the domain ( equation 4). The accumulation of weights w(Wt,Ct ) fol- i specifdom(Wt) estimates how much of a word’s use lows the way original ESA text representations in an underlying corpus is related to a target domain. are calculated (Gabrilovich and Markovitch, 2007) DFdom(Wt) and avg(Cti ) is used in order to correct the specifdom(Wt) = (4) bias of the TF-IDF scheme towards short articles. DFgen(Wt) log(count(Wt, S)) is used to favor words that are where DFdom and DFgen stand for the domain and associated with a larger number of concepts. log the generic document frequency of the word Wt. weighting was chosen after preliminary experiments specifdom(Wt) will be used to favor words with with a wide range of functions. greater domain specificity over more general ones when several translations are available in a seed 3.5 Domain Specificity generic translation lexicon. For instance, the French In previous works, ESA was usually exploited word action is ambiguous and has English transla- in generic tasks that did not require any domain tions such as action, stock, share etc. In a general adaptation. Here we process information from case, the most frequent translation is action whereas specific domains and we need to measure the in a corporate finance context, share or stock are specificity of words in those domains. The domain more relevant. The specificity of the three transla- extraction is seeded by using Wikipedia concepts tions, from highest to lowest, is: share, stock and ac- (noted Cseed) that best describes the domain in tion and is used to rank these potential translations.

484 3.6 Generic Dictionaries Domain FR EN Corporate Finance 402,486 756,840 Generic translation dictionaries, already used by ex- Breast Cancer 396,524 524,805 isting bilingual lexicon extraction approaches, can Wind Energy 145,019 345,607 also be integrated in the newly proposed approach. Mobile Technology 197,689 144,168 The Wikipedia translation graph is transformed into Domain RO EN a translation dictionary by removing the disam- Corporate Finance 206,169 524,805 biguation marks from ambiguous concept titles, as Breast Cancer 22,539 322,507 well as lists, categories and other administration Wind Energy 121,118 298,165 pages. Moreover, since the approach does not han- Mobile Technology 200,670 124,149 dle multiword units, we retain only translation pairs Table 2: Number of content words in the that are composed of unigrams in both languages. comparable corpora. When existing, unigram redirections are also added in each language. The obtained dictionaries are incomplete since: language (for instance Cancer Mamar [Breast Can- (1) Wikipedia focuses on concepts that are most of- cer]) as a query to Wikipedia and extract all its sub- ten nouns, (2) specialized domain terms often do not topics (i.e., sub-categories) to construct a domain- have an associated Wikipedia entry and (3) the trans- specific category tree. Then, based on the con- lation graph covers only a fraction of the concepts structed tree, we collect all Wikipedia articles be- available in a language. For instance, the result- longing to at least one of these categories and use ing translation dictionaries have 193,543 entries for inter-language links to build the comparable cor- French-English and 136,681 entries for Romanian- pora. English. They can be used in addition to or instead Concerning the French-English pair, we followed of other resources available and are especially useful the strategy described above to extract the compa- when there are only few other resources that link the rable corpora related to the Corporate Finance and pair of languages processed. Breast Cancer domains since they were otherwise 4 Evaluation unavailable. For the two other domains, we used the corpora released in the TTC project4. All cor- The performances of our approach are evaluated pora were normalized through the following linguis- against the standard approach and its developments tic preprocessing steps: tokenization, part-of-speech proposed by (Morin and Prochasson, 2011) and tagging, lemmatization, and function word removal. (Bouamor et al., 2013). In this section, we first The resulting corpora5 sizes are presented in Table describe the data and resources we used in our ex- 2. The size of the domain corpora vary within and periments. We then present differents parameters across languages, with the corporate finance domain needed in the implementation of the different meth- being the richest in both languages. In Romanian, ods tested. Finally, we discuss the obtained results. Breast Cancer is particularly small, with approxi- mately 22,000 tokens included. This variability will 4.1 Data and Resources allow us to test if there is a correlation between cor- Comparable corpora pus size and quality of results. We conducted our experiments on four French- English and Romanian-English specialized compa- Bilingual dictionary rable corpora: Corporate Finance, Breast Can- The seed generic French-English dictionary used cer, Wind Energy and Mobile Technology. For to translate French context vectors consists of an the Romanian-English language pair, we used in-house manually built resource which contains Wikipedia to collect comparable corpora for all do- approximately 120,000 entries. For Romanian- mains since they were not already available. The 4http://www.ttc-project.eu/index.php/ Wikipedia corpora are harvested using a category- releases-publications based selection. We consider the topic in the source 5Comparable corpora will be shared publicly

485 Domain FR-EN RO-EN larity) were set following Laroche and Langlais Corporate Finance 125 69 (2010), a comprehensive study of the influence of Breast Cancer 96 38 these parameters on the bilingual alignment. Wind Energy 89 38 (O + 1 )(O + 1 ) Mobile Technology 142 94 11 2 22 2 Odds-Ratiodisc = log 1 1 (5) (O12 + 2 )(O21 + 2 ) Table 3: Sizes of the evaluation lists. where Oij are the cells of the 2 × 2 contingency ma- trix of a token s co-occurring with the term S within a given window size. English, we used the generic dictionary extracted The F-measure of the Top 20 results (F- following the procedure described in Subsection 3.6. Measure@20), which measures the harmonic mean of precision and recall, is used as evaluation metric. Gold standard Precision is the total number of correct translations In bilingual extraction from compara- divided by the number of terms for which the system ble corpora, a reference list is required to evaluate returned at least one answer. Recall is equal to the the performance of the alignment. Such lists are usu- ratio between the number of correct translation and ally composed of around 100 single terms (Hazem the total number of words to translate (W ). and Morin, 2012; Chiao and Zweigenbaum, 2002). cand 6 Reference lists were created for the four specialized 4.3 Results and discussion domains and the two pairs of languages. For the In addition to the basic approach based on ESA French-English, reference words from the Corpo- (denoted ESA), we evaluate the performances of rate Finance domain were extracted from the glos- a method so-called Dico in which the transla- sary of bilingual micro-finance terms7. For Breast Spec tions are extracted from a generic dictionary and Cancer, the list is derived from the MESH and the a method we called ESA which combine ESA UMLS thesauri8. Concerning Wind Energy and Mo- Spec and Dico .DICO is based on the generic bile Technology, lists were extracted from special- Spec Spec dictionary we presented in subsection 3.6 and pro- ized found on the Web. The Romanian- ceeds as follows: we extract a list of translations for English gold standard was manually created by a na- each word to be translated from the generic dictio- tive speaker starting from the French-English lists. nary. The domain specificity introduced in subsec- Table 3 displays the sizes of the obtained lists. Ref- tion 3.5 is then used to rank these translations. For erence terms pairs were retained if each word com- instance, the french term port referring in the Mobile posing them appeared at least five times in the com- Technology domain, to the system that allows com- parable domain corpora. puters to receive and transmit information is trans- 4.2 Experimental setup lated into port and seaport. According to domain specificity values, the following ranking is obtained: Aside from those already mentioned, three param- the English term port obtain the highest specificity eters need to be set up: (1) the window size that value (0.48). seaport comes next with a specificity defines contexts, (2) the association measure that value of 0.01. In ESASpec, the translations set out in measures the strength of the association between the translations lists proposed by both ESA and the words and the (3) similarity measure that ranks can- generic dictionary are weighted according to their didate translations for state of the art methods. Con- domain specificity values. The main intuition be- text vectors are defined using a seven-word window hind this method is that by adding the information which approximates syntactic dependencies. The about the domain specificity, we obtain a new rank- association and the similarity measures (Discounted ing of the bilingual extraction results. Log-Odds ratio (equation 5) and the cosine simi- The obtained results are displayed in table 4. The

6Reference lists will be shared publicly comparison of state of the art method shows that 7http://www.microfinance.lu/en/ BA13 performs better than STAPP and MP11 for 8http://www.nlm.nih.gov/ French-English and has comparable performances

486 pcaie eio rnlto.Dico translation. lexicon for specialized appropriate is Wikipedia multi- as rich such validates a resource of lingual that use adequate finding the that a assumption our ESA, ESA from to comes contribution mances main The results. Dico with ESA obtained those than better cantly generic. more thus and change domain ESA to that shows robust This methods. art the of is ESA domains for smaller across variation much performance the nian, the for except for ergy o rnhEgih mrvmnsrnebetween for range 0.19 improvements French-English, lan- ESA of comparing for pairs When two the tested. and guages domains four show the 4 for lines Table in ESA presented that results introduced The newly the approach. discussing the for as baseline BA13 use main will we Consequently, RO-EN. for (see paper this in introduced approach based ESA that the method ESA for recent stands 1). ESA a Figure 3.5). is improve- Subsection BA13 the (see Dico 2011), is specificity Prochasson, 2013). main MP11 and al., approach, (Morin et standard in (Bouamor the developed introduced is we STAPP approach comparison: standard for the languages.Three used of of were pairs ment two methods domains, art specific the four on of creation state dictionary specialized the of Results 4: Table h eut bandwt S r signifi- are ESA with obtained results The oprt Finance Corporate o OE,teipoeet ayfo 0.1 from vary improvements the RO-EN, For . spec hi obnto,frhripoe the improves further combination, their , oprt Finance Corporate spec b) RO-EN a) FR-EN oprt Finance Corporate lal uprom h he base- three the outperforms clearly spec Dico Dico ESA ESA Method Method STAPP STAPP MP11 MP11 BA13 BA13 ESA ESA obnsterslso Dico of results the combines spec spec spec spec o05for 0.5 to spec ratCancer Breast ratCancer Breast hnfrtetrestate three the for than n .6for 0.56 and 0 0 0 0 0 0 0 0 0 0 0.78 0.81 ...... 76 44 21 21 21 74 50 61 55 49 idEnergy Wind oani Roma- in domain spec spec spec spec oprt Finance Corporate Finance Corporate idEn- Wind sasim- a is oBA13 to spec spec smore is perfor- Also, . 0 0 0 0 0 0 0 0 0 0 0.24 0.56 xlisagnrcdcinr,cmie ihteueo do- of use the with combined dictionary, generic a exploits spec and ...... 487 17 11 14 13 13 50 20 37 33 17 F-Measure@20 F-Measure@20 n ESA. and fseilzdblnullxcn,oeo h central the of creation one the lexicons, to bilingual approach specialized new of a presented have We Conclusion 5 ap- our change. that language to indicates robust more result is The ESA proach the for methods. than BA13 based for significant more is performance average drop the pass- RO-EN When to FR-EN domain. from by ing each explained of probably difficulty intrinsic is the finding This formances. domains, three Finance clear other Corporate re- the no than of richer is quality Although and There sults. size domain order domains. between in possible correlation sufficient such all being cover that from clear to far is are it However, dictionaries trans- approaches. lexicon lation specialized qual- in This good dictionaries RO-EN. of generic importance ity for the higher for and advocates those finding FR-EN to for comparable BA13 are of performances a but average dictionary of generic its meanings a in different available the word candidate ranks that method ple idEenrgy Wind Eenrgy Wind 0 0 0 0 0 0 0 0 0 0.58 0.58 0.86 ...... 21 08 08 08 83 36 30 24 08 oieTechnology Mobile Technology Mobile a h oetascae per- associated lowest the has 0 0 0 0 0 0 0 0 0 0 0.55 0.75 ...... 53 16 17 16 16 72 25 24 05 06 building blocks of machine translation systems. The descriptions (i.e. articles) by their relatedness to the proposed approach directly tackles two of the ma- candidate word. Similar to the second direction, this jor challenges identified in the Introduction. The process involves finding comparable text blocks but scarcity of resources is addressed by an adequate rather at a paragraph or sentence level than at the exploitation of Wikipedia, a resource that is avail- article level. able in hundreds of languages. The quality of auto- matic translations was improved by appropriate do- main delimitation and linking across languages, as References well as by an adequate statistical processing of con- Dhouha Bouamor, Nasredine Semmar, and Pierre cepts similar to a word in a given context. Zweigenbaum. 2013. Context vector disambiguation The main advantages of our approach compared for bilingual lexicon extraction. In Proceedings of the to state of the art methods come from: the increased 51st Association for Computational Linguistics (ACL- number of languages that can be processed, from HLT), Sofia, Bulgaria, August. the smaller sensitivity to structured resources and Yun-Chuang Chiao and Pierre Zweigenbaum. 2002. Looking for candidate translational equivalents in spe- the appropriate domain delimitation. Experimental cialized, comparable corpora. In Proceedings of the validation is obtained through evaluation with four 19th international conference on Computational lin- different domains and two pairs of languages which guistics - Volume 2, COLING ’02, pages 1–5. Associ- shows consistent performance improvement. For ation for Computational Linguistics. French-English, two languages that have rich asso- Yun-Chuang Chiao and Pierre Zweigenbaum. 2003. The ciated Wikipedia representations, performances are effect of a general lexicon in corpus-based identifi- very interesting and are starting to approach those of cation of french-english medical word translations. manual translations for three domains out of four (F- In Proceedings Medical Informatics Europe, volume 95 of Studies in Health Technology and Informatics, Measure@20 around 0.8). For Romanian-English, a pages 397–402, Amsterdam. pair involving a language with a sparser Wikipedia Pascale Fung. 1998. A statistical view on bilingual lexi- representation, the performances of our method drop con extraction: From parallel corpora to non-parallel compared to French-English . However, they do not corpora. In Parallel Text Processing, pages 1–17. decrease to the same extent as those of the best state Springer. of the art method tested. This finding indicates that Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com- our approach is more general and, given its low lan- puting semantic relatedness using wikipedia-based guage dependence, it can be easily extended to a explicit semantic analysis. In Proceedings of the large number of language pairs. 20th international joint conference on Artifical intel- ligence, IJCAI’07, pages 1606–1611, San Francisco, The results presented here are very encouraging CA, USA. Morgan Kaufmann Publishers Inc. and we will to pursue work in several directions. Alexander Halavais and Derek Lackaff. 2008. An Anal- First, we will pursue the integration of our method, ysis of Topical Coverage of Wikipedia. Journal of notably through comparable corpora creation using Computer-Mediated Communication, 13(2):429–440. the data driven domain delimitation technique de- Z.S. Harris. 1954. Distributional structure. Word. scribed in Subsection 3.5. Equally important, the Samer Hassan and Rada Mihalcea. 2011. Semantic re- size of the domain can be adapted so as to find latedness using salient semantic analysis. In AAAI. enough context for all the words in domain reference Amir Hazem and Emmanuel Morin. 2012. Adaptive dic- lists. Second, given a word in a context, we currently tionary for bilingual lexicon extraction from compara- exploit all similar concepts from the target language. ble corpora. In Proceedings, 8th international confer- Given that comparability of article versions in the ence on Language Resources and Evaluation (LREC), source and the target language varies, we will eval- Istanbul, Turkey, May. uate algorithms for filtering out concepts from the Azniah Ismail and Suresh Manandhar. 2010. Bilin- gual lexicon extraction from comparable corpora us- target language that have low alignment with their ing in-domain terms. In Proceedings of the 23rd In- source language versions. A final line of work is ternational Conference on Computational Linguistics: constituted by the use of distributional properties of Posters, COLING ’10, pages 481–489. Association for texts in order to automatically rank parts of concept Computational Linguistics.

488 Audrey Laroche and Philippe Langlais. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In 23rd Interna- tional Conference on Computational Linguistics (Col- ing 2010), pages 617–625, Beijing, China, Aug. Emmanuel Morin and Beatrice´ Daille. 2006. Compara- bilite´ de corpus et fouille terminologique multilingue. In Traitement Automatique des Langues (TAL). Emmanuel Morin and Emmanuel Prochasson. 2011. Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. In Proceedings, 4th Workshop on Building and Using Comparable Cor- pora (BUCC), page 27–34, Portland, Oregon, USA. Franz Josef Och and Hermann Ney. 2003. A system- atic comparison of various statistical alignment mod- els. Comput. Linguist., 29(1):19–51, March. Emmanuel Prochasson, Emmanuel Morin, and Kyo Kageura. 2009. Anchor points for bilingual lexi- con extraction from small comparable corpora. In Proceedings, 12th Conference on Machine Translation Summit (MT Summit XII), page 284–291, Ottawa, On- tario, Canada. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: com- puting word relatedness using temporal semantic anal- ysis. In Proceedings of the 20th international confer- ence on World wide web, WWW ’11, pages 337–346, New York, NY, USA. ACM. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL ’95, pages 320–322. Association for Computa- tional Linguistics. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated english and german cor- pora. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Compu- tational Linguistics, ACL ’99, pages 519–526. Asso- ciation for Computational Linguistics. P. Sorg and P. Cimiano. 2012. Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data Knowl. Eng., 74:26–45, April.

489