<<

ACL-IJCNLP 2015

Eighth Workshop on Building and Using Comparable Corpora

Proceedings of the Workshop

July 30, 2015 Beijing, China Production and Manufacturing by Taberg Media Group AB Box 94, 562 02 Taberg Sweden

c 2015 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-941643-60-0

ii Introduction to BUCC 2015

Comparable corpora are collections of documents that are comparable in content and form in various degrees and dimensions. This definition includes many types of parallel and non-parallel multilingual corpora, but also sets of monolingual corpora that are used for comparative purposes. Research on comparable corpora is active but used to be scattered among many workshops and conferences. The workshop series on “Building and Using Comparable Corpora” (BUCC) aims at promoting progress in this exciting emerging field by bundling its research, thereby making it more visible and giving it a better platform.

Research on comparable corpora spans a number of topics from machine translation to contrastive linguistics. Distributional analysis, a topic which has seen renewed interest in recent years, has formed the core of a large part of the methods used to identify translations in comparable corpora. As a matter of fact, the standard techniques of word alignment in comparable corpora can be seen as methods for cross-language distributional semantics.

Following the previous editions of the workshop which took place at LREC 2008 (Marrakech), ACL-IJCNLP 2009 (Singapore), LREC 2010 (Malta), ACL-HLT 2011 (Portland), LREC 2012 (Istanbul), ACL 2013 (Sofia), LREC 2014 (Reykjavik), the workshop this year is co-located with ACL- IJCNLP 2015 in Beijing, China.

This year’s workshop also hosts a companion shared task which is the first evaluation exercise on the identification of comparable texts: given a large multilingual collection of texts derived from Wikipedia, detecting the most similar texts across languages. Evaluation is performed using a gold standard based on actual inter-language links. Three teams submitted eleven runs to link text in three languages to comparable English texts. A special section in this proceedings volume reports on this shared task.

Finally, we would like to thank all people who in one way or another helped in making this workshop once again a success. Our special thanks go to Benjamin K. Tsou for accepting to give the invited presentation, to the members of the program committee who did an excellent job in reviewing the submitted papers, and to the ACL-IJCNLP 2015 workshop chairs and organizers. We also thank LIMSI- CNRS for financial support to our invited speaker. Last but not least we would like to thank our authors and the participants of the workshop.

Pierre Zweigenbaum, Serge Sharoff, Reinhard Rapp

iii

Organizers

Pierre Zweigenbaum LIMSI, CNRS, Orsay (France), Chair Serge Sharoff University of Leeds (UK), Shared Task Chair Reinhard Rapp University of Mainz (Germany)

Programme Committee

Ahmet Aker, University of Sheffield (UK) Srinivas Bangalore (AT&T Labs, US) Caroline Barrière (CRIM, Montréal, Canada) Hervé Déjean (Xerox Research Centre Europe, Grenoble, France) Kurt Eberle (Lingenio, Heidelberg, Germany) Andreas Eisele (European Commission, Luxembourg) Éric Gaussier (Université Joseph Fourier, Grenoble, France) Gregory Grefenstette (INRIA, Saclay, France) Silvia Hansen-Schirra (University of Mainz, Germany) Hitoshi Isahara (Toyohashi University of Technology) Kyo Kageura (University of , Japan) Adam Kilgarriff (Lexical Computing Ltd, UK) Natalie Kübler (Université Paris Diderot, France) Philippe Langlais (Université de Montréal, Canada) Michael Mohler (Language Computer Corp., US) Emmanuel Morin (Université de Nantes, France) Dragos Stefan Munteanu (Language Weaver, Inc., US) Lene Offersgaard (University of Copenhagen, Denmark) Ted Pedersen (University of Minnesota, Duluth, US) Reinhard Rapp (Université Aix-Marseille, France) Sujith Ravi (Google, US) Serge Sharoff (University of Leeds, UK) Michel Simard (National Research Council Canada) Tim Van de Cruys (IRIT-CNRS, Toulouse, France) Stephan Vogel, QCRI (Qatar) Guillaume Wisniewski (Université Paris Sud & LIMSI-CNRS, Orsay, France) Pierre Zweigenbaum (LIMSI-CNRS, Orsay, France)

Invited Speaker

Benjamin K. Tsou (City University of Hong Kong)

v

Table of Contents

Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared Benjamin K. Tsou ...... 1

A Factory of Comparable Corpora from Wikipedia Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez ...... 3

Knowledge-lean projection of coreference chains across languages Yulia Grishina and Manfred Stede ...... 14

Projective methods for mining missing translations in DBpedia Laurent Jakubina and Philippe Langlais ...... 23

Attempting to Bypass Alignment from Comparable Corpora via Pivot Language Alexis Linard, Béatrice Daille and Emmanuel Morin ...... 32

Application of a Corpus to Identify Gaps between English Learners and Native Speakers Katsunori Kotani and Takehiko Yoshimi ...... 38

A Generative Model for Extracting Parallel Fragments from Comparable Documents Somayeh Bakhshaei, Shahram Khadivi and Reza Safabakhsh ...... 43

Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families Zi Long, Takehito Utsuro, Tomoharu Mitsuhashi and Mikio Yamamoto ...... 52

Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps Hyeong-Won Seo, Minah Cheon and Jae-Hoon Kim...... 62

Obtaining SMT dictionaries for related languages Miguel Rios and Serge Sharoff ...... 68

BUCC Shared Task: Cross-Language Document Similarity Serge Sharoff, Pierre Zweigenbaum and Reinhard Rapp ...... 74

AUT Document Alignment Framework for BUCC Workshop Shared Task Atefeh Zafarian, Amir Pouya Agha Sadeghi, Fatemeh Azadi, Sonia Ghiasifard, Zeinab Ali Panahloo, Somayeh Bakhshaei and Seyyed Mohammad Mohammadzadeh Ziabary ...... 79

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin, Amir Hazem, Florian Boudin and Elizaveta Loginova-Clouet ...... 88

vii

Workshop Program

Thursday, July 30, 2015

Session 1: 09:00–10:30 Opening Session

09:00–09:05 Introduction to the BUCC Workshop Pierre Zweigenbaum, Serge Sharoff, Reinhard Rapp

09:05–10:05 Invited presentation: Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared Benjamin K. Tsou

10:05–10:30 A Factory of Comparable Corpora from Wikipedia Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez

Session 2: 11:00–12:30

11:00–11:25 Knowledge-lean projection of coreference chains across languages Yulia Grishina and Manfred Stede

11:25–11:50 Projective methods for mining missing translations in DBpedia Laurent Jakubina and Philippe Langlais

11:50–12:05 Attempting to Bypass Alignment from Comparable Corpora via Pivot Language Alexis Linard, Béatrice Daille and Emmanuel Morin

12:05–12:20 Application of a Corpus to Identify Gaps between English Learners and Native Speakers Katsunori Kotani and Takehiko Yoshimi

ix Thursday, July 30, 2015 (continued)

Session 3: 14:00–15:30 Alignment

14:00–14:25 A Generative Model for Extracting Parallel Fragments from Comparable Docu- ments Somayeh Bakhshaei, Shahram Khadivi and Reza Safabakhsh

14:25–14:50 Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Tech- nical Terms from Patent Families Zi Long, Takehito Utsuro, Tomoharu Mitsuhashi and Mikio Yamamoto

14:50–15:05 Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps Hyeong-Won Seo, Minah Cheon and Jae-Hoon Kim

15:05–15:20 Obtaining SMT dictionaries for related languages Miguel Rios and Serge Sharoff

Session 4: 16:00–17:00 Shared Task

16:00–16:15 BUCC Shared Task: Cross-Language Document Similarity Serge Sharoff, Pierre Zweigenbaum and Reinhard Rapp

16:15–16:30 AUT Document Alignment Framework for BUCC Workshop Shared Task Atefeh Zafarian, Amir Pouya Agha Sadeghi, Fatemeh Azadi, Sonia Ghiasi- fard, Zeinab Ali Panahloo, Somayeh Bakhshaei and Seyyed Mohammad Moham- madzadeh Ziabary

16:30–16:45 LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin, Amir Hazem, Florian Boudin and Elizaveta Loginova-Clouet

16:45–17:00 Shared Task: General Discussion

Closing: 17:00

x Augmented Comparative Corpora and Monitoring Corpus in Chinese: LIVAC and Sketch Search Engine Compared

Benjamin K. Tsou City University of Hong Kong, The Chinese University of Hong Kong, Hong Kong University of Science and Technology

The increasing availability of numerous cor- same time three sub-corpora of LIVAC we pora has significantly contributed to the un- draw on show that apple has a different set derstanding of words in terms of their under- of saliency linkage with computer, iPhone, lying semantic structures and lexical networks Jobs, roll out, share price, company etc. (e.g. COBUILD, WordNet etc.). Through This linkage is related less to the universalis- data mining and information retrieval, re- tic semantic network for apple, than to the search in this area has vastly expanded our foregrounded awareness of apple as a cul- appreciation that what constitutes lexical tural artifact in actual human social interac- knowledge goes beyond synonymy, hyponymy, tion and encoded as social knowledge (Park metonymy, meronymy, grammatical and other 1955, Longino 1990). We also show and ex- collocations. Furthermore, they are fundamen- amine how the salient information associated tal to a universalistic conceptual base of on- with apple varies across the three major Chi- tologies and knowledge representation which nese speech communities: Beijing, Hong Kong are often enriched by deeper and newer analy- and Taipei, reflecting social and societal dif- sis. In this context, each language foregrounds ferences, and regional developments, as well as specific features or nodes within this knowl- variations over time. Similarly free-freedom edge base by usually non-uniform means. in Chinese varies in associated saliency linkage At the same time, the arrival of the age of in the three speech communities in interesting Big Data has attracted extensive studies on ways but also contrasts with the Sketch Engine the actual and dynamic use of language as results. contextualized (ala. Jakobson 1960) within a The above comparison in LIVAC is made given society, especially through the mass me- possible by rigorous improvement to the com- dia. What are foregrounded in this medium mon and simplistic approach to the cultiva- tend to have graded cognitive saliency charac- tion and use of databases. The augmentation terizing members of the common speech com- efforts included the rigorous cultivation of 3 munity, and such shared knowledge is usually comparable (sub-) corpora for Beijing, Hong at great variance with the thesaurus approach Kong and Taipei through geographical (hor- and show noticeable localized features. It is izontal), chronological (vertical) and domain proposed here that the two kinds of knowl- (topical) partitioning of what is often assumed edge (thesauric vs cognitive-cultural) comple- to be a common linguistic database. This par- ment each other in human cognition, and are titioning required well-reasoned pre-conceived integral to it. criteria to ensure adequate equivalency in com- We draw on two large Chinese media parability in terms of size, period and depth of databases Sketch (2.1 billion character to- analysis. 1 kens ) and LIVAC (550 million character to- To facilitate comparison we propose a 2 kens ) for illustration and discussion. The Cognitive-cultural Salience Index (CSI) which Sketch Engine in Chinese shows how apple draws on comparable corpus data (e.g. LI- is, as expected, primarily related to orange, VAC) to provide comparison of the relative peach, fruit, vegetable, food etc. At the saliency of target words in the relevant corpus 1As per Sketch Engine website. and presented as word clouds. The results are 2As per LIVAC website. viewed in the light of the Sketch Engine output

1 Invited Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 1–2, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics to explore how our appreciation of knowledge representation may be enhanced. It will also serve to echo the call to optimize our data col- lection efforts and to broaden our queries with data judiciously curated and cultivated.

References K. E. Boulding. 1956. The image: Knowledge in life and sociology. University of Michigan Press, Ann Arbor, MI. C. R. Huang, A. Kilgarriff, Y. Wu, C. M. Chiu, S. Smith, P. Rychly, and K. J. Chen. 2005. Chinese Sketch Engine and the extraction of grammatical collocations. In Proceedings of the Fourth SIGHAN Workshop on Chinese Lan- guage Processing, pages 48–55. R Jakobson. 1960. Closing statement: Linguis- tics and poetics. In T. Sebeok, editor, Style in Language. A. Kilgarriff, V. Baisa, J. Bušta, M. Jakubíček, Kovář V., J. Michelfeit, P. Rychlý, and Suchomel V. 2014. The Sketch Engine: Ten years on. Lexicography: Journal of ASIALEX, 1(1):7–36.

Livac. http://www.livac.org; https://en. wikipedia.org/wiki/LIVAC_Synchronous_ Corpus. H. E. Longino. 1990. Science and social knowl- edge: Values and objectivity in scientific inquiry. Princeton University Press, Princeton, NJ. R. E. Park. 1955. Society: Collective behavior, news and opinion, sociology and modern society. The Free Press, Glencoe, IL.

Sketch engine. https://the.sketchengine. co.uk; https://en.wikipedia.org/wiki/ Sketch_Engine. B. Tsou and O. Kwong. 2015. LIVAC as a moni- toring corpus for tracking trends beyond linguis- tics. In B. K. Tsou and O. K. Kwong, editors, Linguistic Corpus and Corpus Linguistics in the Chinese Context, number 25 in Journal of Chi- nese Linguistics Monograph Series, pages 447– 471, Hong Kong. Hong Kong Chinese University Press.

2 A Factory of Comparable Corpora from Wikipedia Alberto Barron-Cede´ no˜ 1, Cristina Espana-Bonet˜ 2, Josu Boldoba2 and Llu´ıs Marquez` 1 1 Qatar Computing Research Institute, HBKU, Doha, Qatar 2 TALP Research Center, Univesitat Politecnica` de Catalunya, Barcelona, Spain albarron,lmarquez @qf.org.qa [email protected]{ [email protected]}

Abstract One of the main sources of parallel data is the Web: websites in multiple languages are crawled Multiple approaches to grab comparable and contents retrieved to obtain multilingual data. data from the Web have been developed Wikipedia, an on-line community-curated ency- up to date. Nevertheless, coming out clopædia with editions in multiple languages, has with a high-quality comparable corpus of been used as a source of data for these purposes — a specific topic is not straightforward. for instance, (Adafre and de Rijke, 2006; Potthast We present a model for the automatic et al., 2008; Otero and Lopez,´ 2010; Plamada and extraction of comparable texts in multi- Volk, 2012). Due to its encyclopædic nature, ed- ple languages and on specific topics from itors aim at organising its content within a dense Wikipedia. In order to prove the value of taxonomy of categories.1 Such a taxonomy can be the model, we automatically extract paral- exploited to extract comparable and parallel cor- lel sentences from the comparable collec- pora on specific topics and knowledge domains. tions and use them to train statistical ma- This allows to study how different topics are anal- chine translation engines for specific do- ysed in different languages, extract multilingual mains. Our experiments on the English– lexicons, or train specialised machine translation Spanish pair in the domains of Computer systems, just to mention some instances. Never- Science, Science, and Sports show that theless, the process is not straightforward. The our in-domain translator performs signif- community-generated nature of the Wikipedia has icantly better than a generic one when produced a reasonably good —yet chaotic— tax- translating in-domain Wikipedia articles. onomy in which categories are linked to each other Moreover, we show that these corpora can at will, even if sometimes no relationship among help when translating out-of-domain texts. them exists, and the borders dividing different ar- eas are far from being clearly defined. 1 Introduction The rest of the paper is distributed as follows. We briefly overview the definition of compara- Multilingual corpora with different levels of com- bility levels in the literature and show the diffi- parability are useful for a range of natural lan- culties inherent to extracting comparable corpora guage processing (NLP) tasks. Comparable cor- from Wikipedia (Section 2). We propose a sim- pora were first used for extracting parallel lexicons ple and effective platform for the extraction of (Rapp, 1995; Fung, 1995). Later they were used comparable corpora from Wikipedia (Section 3). for feeding statistical machine translation (SMT) We describe a simple model for the extraction of systems (Uszkoreit et al., 2010) and in multilin- parallel sentences from comparable corpora (Sec- gual retrieval models (Schonhofen¨ et al., 2007; tion 4). Experimental results are reported on each Potthast et al., 2008). SMT systems estimate of these sub-tasks for three domains using the En- the statistical models from bilingual texts (Koehn, glish and Spanish Wikipedia editions. We present 2010). Since only the words that appear in the an application-oriented evaluation of the compara- corpus can be translated, having a corpus of the ble corpora by studying the impact of the extracted right domain is important to have high coverage. parallel sentences on a statistical machine transla- However, it is evident that no large collections of tion system (Section 5). Finally, we draw conclu- parallel texts for all domains and language pairs sions and outline ongoing work (Section 6). exist. In some cases, only general-domain parallel corpora are available; in some others there are no 1http://en.wikipedia.org/wiki/Help: parallel resources at all. Category

3 Long Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 3–13, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics

Scientific Sport Science 2 Background disciplines

Mountain Earth Natural Comparability in multilingual corpora is a fuzzy Sports sports sciencies sciences concept that has received alternative definitions Geology without reaching an overall consensus (Rapp, Mountains Mountaineering Geology by country 1995; Eagles Document Eag–Tcwg–Ctyp, 1996; Mountains Mountains Mountain ran- Geology Fung, 1998; Fung and Cheung, 2004; Wu and by country of Andorra ges of Spain of Spain Fung, 2005; McEnery and Xiao, 2007; Sharoff et al., 2013). Ideally, a comparable corpus should Pyrenees contain texts in multiple languages which are sim- Mountains of ilar in terms of form and content. Regarding con- the Pyrenees tent, they should observe similar structure, func- tion, and a long list of characteristics: register, Figure 1: Slice of the Spanish Wikipedia category field, tenor, mode, time, and dialect (Maia, 2003). graph (as in May 2015) departing from categories Nevertheless, finding these characteristics in Sport and Science. Translated for clarity. real-life data collections is virtually impossible. Therefore, we attach to the following simpler four-class classification (Skadin¸a et al., 2010): the categories in Wikipedia compose a densely- (i) Parallel texts are true and accurate translations connected graph with highly overlapping cate- or approximate translations with minor language- gories, cycles, etc. As they are manually-crafted, specific variations. (ii) Strongly comparable texts the categories are somehow arbitrary and, among are closely related texts reporting the same event other consequences, the potential categorisation of or describing the same subject. (iii) Weakly com- articles does not accomplish with the properties parable texts include texts in the same narrow sub- for representing the desirable —trusty enough— ject domain and genre, but describing different categorisation of articles from different domains. events, as well as texts within the same broader Moreover, many articles are not associated to the domain and genre, but varying in sub-domains categories they should belong to and there is a phe- 3 and specific genres. (iv) Non-comparable texts are nomenon of over-categorization. pairs of texts drawn at random from a pair of very Figure 1 is an example of the complexity large collections of texts in two or more languages. of Wikipedia’s category graph topology. Al- Wikipedia is a particularly suitable source of though this particular example comes from the multilingual text with different levels of compa- Wikipedia in Spanish, similar phenomena exist rability, given that it covers a large amount of lan- in other editions. Firstly, the paths from different guages and topics.2 Articles can be connected via apparently unrelated categories —Sport and interlanguage links (i.e., a link from a page in one Science—, converge in a common node soon Wikipedia language to an equivalent page in an- in the graph (node Pyrenees). As a result, other language). Although there are some missing not only Pyrenees could be considered as a links and an article can be linked by two or more sub-category of both Sport and Science, articles from the same language (Hecht and Ger- but all its descendants. Secondly, cycles exist gle, 2010), the number of available links allows to among the different categories, as in the sequence Mountains of Andorra Pyrenees exploit the multilinguality of Wikipedia. → Mountains of the Pyrenees Still, extracting a comparable corpus on a spe- → → cific domain from Wikipedia is not so straight- Mountains of Andorra. Ideally, every forward. One can take advantage of the user- sub-category of a category should share the same generated categories associated to most articles. attributes, since the “failure to observe this princi- Ideally, the categories and sub-categories would ple reduces the predictability [of the taxonomy] compose a hierarchically organized taxonomy, and can lead to cross-classification” (Rowley and e.g., in the form of a category tree. Nevertheless, Hartley, 2000, p. 196). Although fixing this issue —inherent to all the Wikipedia editions— falls 2Wikipedia contains 288 language editions out of which 277 are active and 12 have more than 1M articles at the time 3This is a phenomenon specially stressed in the of writing, June 2015 (http://en.wikipedia.org/ Wikipedia itself: http://en.wikipedia.org/wiki/ wiki/List_of_Wikipedias). Wikipedia:Overcategorization.

4 out of the scope of our research, some heuristic NLP tasks have particularly taken advantage of strategies are necessary to diminish their impact Wikipedia. Articles have been used for query in the domain definition process. translation (Schonhofen¨ et al., 2007) and cross- Plamada and Volk (2012) dodge this issue by language semantic representations for similarity extracting a domain comparable corpus using IR estimation (Cimiano et al., 2009; Potthast et al., techniques. They use the characteristic vocabulary 2008; Sorg and Cimiano, 2012). The extraction of the domain (100 terms extracted from an exter- of parallel corpora from Wikipedia has been a nal in-domain corpus) to query a Lucene search hot topic during the last years (Adafre and de Ri- engine4 over the whole encyclopædia. Our ap- jke, 2006; Patry and Langlais, 2011; Plamada and proach is completely different: we try to get along Volk, 2012; Smith et al., 2010; Tomas´ et al., 2008; with Wikipedia’s structure with a strategy to walk Yasuda and Sumita, 2008). through the category graph departing from a root or pseudo-root category, which defines our do- 3 Domain-Specific Comparable Corpora main of interest. We empirically set a threshold Extraction to stop exploring the graph such that the included In this section we describe our proposal to ex- categories most likely represent an entire domain tract domain-specific comparable corpora from (cf. Section 3). This approach is more similar Wikipedia. The input to the pipeline is the top cat- to Cui et al. (2008), who explore the Wiki-Graph egory of the domain (e.g., Sport). The terminol- and score every category in order to assess its like- ogy used in this description is as follows. Let c be lihood of belonging to the domain. a Wikipedia category and c∗ be the top category Other tools are being developed to extract cor- of a domain. Let a be a Wikipedia article; a c 5 ∈ pora from Wikipedia. Linguatools released a if a contains c among its categories. Let G be the comparable corpus extracted from Wikipedias in Wikipedia category graph. 253 language pairs. Unfortunately, neither their tool nor the applied methodology description are Vocabulary definition. The domain vocabulary available. CatScan26 is a tool that allows to ex- represents the set of terms that better characterises plore and search categories recursively. The Accu- the domain. We do not expect to have at our dis- rat toolkit (Pinnis et al., 2012; S¸tefanescu,˘ Dan and posal the vocabulary associated to every category. Ion, Radu and Hunsicker, Sabine, 2012)7 aligns Therefore, we build it from the Wikipedia itself. comparable documents and extracts parallel sen- We collect every article a c and apply stan- ∈ ∗ tences, lexicons, and named entities. Finally, the dard pre-processing; i.e., tokenisation, stopword- most related tool to ours: CorpusPedia8 extracts ing, numbers and punctuation marks filtering, and non-aligned, softly-aligned, and strongly-aligned stemming (Porter, 1980). In order to reduce noise, comparable corpora from Wikipedia (Otero and tokens shorter than four characters are discarded Lopez,´ 2010). The difference with respect to our as well. The vocabulary is then composed of the model is that they only consider the articles asso- top n terms, ranked by term frequency. This value ciated to one specific category and not to an entire is empirically determined. domain. Graph exploration. The input for this step is G, The inter-connection among Wikipedia editions c (i.e., the departing node in the graph), and the in different languages has been exploited for mul- ∗ domain vocabulary. Departing from c , we per- tiple tasks including lexicon induction (Erdmann ∗ form a breadth-first search, looking for all those et al., 2008), extraction of bilingual dictionar- categories which more likely belong to the re- ies (Yu and Tsujii, 2009), and identification of quired domain. Two constraints are applied in or- particular translations (Chu et al., 2014; Prochas- der to make a controlled exploration of the graph: son and Fung, 2011). Different cross-language (i) in order to avoid loops and exploring already 4https://lucene.apache.org/ traversed paths, a node can only be visited once, 5http://linguatools.org (ii) in order to avoid exploring the whole cate- 6http://tools.wmflabs.org/catscan2/ gories graph, a stopping criterion is pre-defined. catscan2.php Our stopping criterion is inspired by the classifica- 7http://www.accurat-project.eu 8http://gramatica.usc.es/pln/tools/ tion tree-breadth first search algorithm (Cui et al., CorpusPedia.html 2008). The core idea is scoring the explored cate-

5 Edition Articles Categories Ratio Articles Vocabulary en es en es English 4,123,676 1,032,222 4.0 Spanish 965,543 210,803 4.6 CS 4 130 106 447 Intersection 631,710 107,313 – Sc 29 3 464 140 Sp 3 10 122 100 Table 1: Amount of articles and categories in the Wikipedia editions and in the intersection Table 2: Number of articles in the root categories (i.e., pages linked across languages). and size of the resulting domain vocabulary.

CS gories to determine if they belong to the domain. Sc 100 ENGLISH Sp Our heuristic assumes that a category belongs to 90 the domain if its title contains at least one of the 80 terms in the characteristic vocabulary. Neverthe- 70 60 less, many categories exist that may not include 50 any of the terms in the vocabulary. (e.g., consider 40 category pato in Spanish —literally ”duck” in 30 English— which, somehow surprisingly, refers to 20 a sport rather than an animal). Our na¨ıve solution In-vocabulary categories (%) 10 0 to this issue is to consider subsets of categories 1 3 5 7 9 11 13 15 17 19 21 23 25 27 according to their depth respect to the root. An Depth entire level of categories is considered part of the 100 SPANISH domain if a minimum percentage of its elements 90 include vocabulary terms. 80 70 In our experiments we use the English and 60 50 9 Spanish Wikipedia editions. Table 1 shows some 40 statistics, after filtering disambiguation and redi- 30 rect pages. The intersection of articles and cate- 20 gories between the two languages represents the In-vocabulary categories (%) 10 0 ceiling for the amount of parallel corpora one can 1 3 5 7 9 11 13 15 17 19 21 23 25 27 gather for this pair. We focus on three domains: Depth Computer Science (CS), Science (Sc), and Sports (Sp) —the top categories c∗ from which the graph Figure 2: Percentage of categories with at least is explored in order to extract the corresponding one domain term in the title for the two languages comparable corpora. and the three domains under study. Table 2 shows the number of root articles asso- ciated to c∗ for each domain and language. From them, we obtain domain vocabularies with a size tle: the starting point for our graph-based method between 100 and 400 lemmas (right-side columns) for selecting the in-domain articles. As expected, when using the top 10% terms. We ran experi- nearly 100% of the categories in the root include ments using the top 10%, 15%, 20% and 100%. domain terms and this percentage decreases with The relatively small size of these vocabularies al- increasing depth in the tree. lows to manually check that 10% is the best op- When extracting the corpus, one must decide tion to characterise the desired category, higher the adequate percentage of positive categories percentages add more noise than in-domain terms. allowed. High thresholds lead to small cor- The plots in Figure 2 show the percentage of cat- pora whereas low thresholds lead to larger —but egories with at least one domain term in the ti- noisier— corpora. As in many applications, this is a trade-off between precision and recall and de- 9 Dumps downloaded from https://dumps. pends on the intended use of the corpus. Table 3 wikimedia.org in July 2013 and pre-processed with JWPL (Zesch et al., 2008) (https://code.google. shows some numbers on two different thresholds. com/p/jwpl/). Increasing the threshold does not always mean

6 Articles Distance from the root Similarity computation. We compute similari- 50% 60% 50% 60% ties between pairs of sentences by means of co- en-es en-es en es en es sine and length factor measures. The cosine sim- CS 18,168 8,251 6 5 5 5 ilarity is calculated on three well-known charac- Sc 161,130 21,459 6 4 4 4 terisations in cross-language information retrieval Sp 72,315 1,980 8 8 3 4 and parallel corpora alignment: (i) character n- grams (cng) (McNamee and Mayfield, 2004); Table 3: Number of article pairs according to the (ii) pseudo-cognates (cog) (Simard et al., 1992); percentage of positive categories used to select the and (iii) word 1-grams, after translation into a levels of the graph and distance from the root at common language, both from English to Span- which the percentage is smaller to the desired one. ish and vice versa (monoen, monoes). We add the (iv) length factor (len) (Pouliquen et al., 2003) as an independent measure and as penalty (multi- lowering the selected depth, but when it does, the plicative factor) on the cosine similarity. difference in the number of extracted articles can be significant. The same table shows the number The threshold for each of the measures just in- of article pairs extracted for each value: the result- troduced is empirically set in a manually anno- ing comparable corpus for each domain. The stop- tated corpus. We define it as the value that max- ping level is selected for every language indepen- imises the F1 score on this development set. To dently, but in order to reduce noise, the compara- create this set, we manually annotated a corpus ble corpus is only built from those articles that ap- with 30 article pairs (10 per domain) at sentence pear in both languages and are related via an inter- level. We considered three sentence classes: par- language link. We validate the quality in terms of allel, comparable, and other. The volunteers of application-based utility of the generated compa- the exercise were given as guidelines the defini- rable corpora when used in a translation system tions by Skadin¸a et al. (2010) of parallel text and (cf. Section 5). Therefore, we choose to give more strongly comparable text (cf. Section 2). A pair importance to recall and opt for the corpora ob- that did not match any of these definitions had to tained with a threshold of 50%. be classified as other. Each article pair was anno- tated by two volunteers, native speakers of Span- ish with high command of English (a total of nine 4 Parallel Sentence Extraction volunteers participated in the process). The mean agreement between annotators had a kappa coeffi- In this section we describe a simple technique for cient (Cohen, 1960) of κ 0.7. A third annotator ∼ extracting parallel sentences from a comparable resolved disagreed sentences.10 corpus. Table 4 shows the thresholds that obtain the Given a pair of articles related by an interlan- maximum F1 scores. It is worth noting that, even guage link, we estimate the similarity between all if the values of precision and recall are relatively their pairs of cross-language sentences with dif- low —the maximum recall is 0.57 for len—, our ferent text similarity measures. We repeat the pro- intention with these simple measures is not to ob- cess for all the pairs of articles and rank the result- tain the highest performance in terms of retrieval, ing sentence pairs according to its similarity. After but injecting the most useful data to the translator, defining a threshold for each measure, those sen- even at the cost of some noise. The performance tence pairs with a similarity higher than the thresh- with character 3-grams is the best one, comparable old are extracted as parallel sentences. This is a to that of mono, with an F1 of 0.36. This suggests non-supervised method that generates a noisy par- that a translator is not mandatory for performing allel corpus. The quality of the similarity mea- the sentences selection. Len and 1-grams have no sures will then affect the purity of the parallel cor- discriminating power and lead to the worse scores pus and, therefore, the quality of the translator. (F1 of 0.14 and 0.21, respectively). However, we do not need to be very restrictive We ran a second set of experiments to explore with the measures here and still favour a large cor- the combination of the measures. Table 5 shows pus, since the word alignment process in the SMT 10The corpus is publicly available at http://www.cs. system can take care of part of the noise. upc.edu/˜cristinae/CV/recursos.php.

7 c1g c2g c3g c4g c5g cog monoenmonoes len Thres. 0.95 0.60 0.25 0.20 0.15 0.30 0.20 0.15 0.90 P 0.18 0.29 0.28 0.24 0.23 0.16 0.30 0.26 0.08 R 0.25 0.31 0.53 0.47 0.47 0.49 0.46 0.34 0.57 F1 0.21 0.30 0.36 0.32 0.31 0.24 0.36 0.30 0.14

Table 4: Best thresholds and their associated Precision (P), recall (R) and F1.

S¯ S¯ len S F S F len CS Sc Sp · · 1 · 1· Thres. 0.25 0.15 0.05 0.05 c1g 207,592 1,585,582 404,656 c2g 99,964 745,821 326,882 P 0.27 0.33 0.18 0.32 c3g 96,039 724,210 335,147 R 0.50 0.62 0.77 0.65 c4g 110,701 863,090 394,105 F1 0.35 0.43 0.29 0.43 c5g 126,692 1,012,993 466,007

Table 5: Precision, recall, and F1 for the average cog 182,981 1,215,008 451,941 of the similarities weighted by length model (len) len 271,073 1,941,866 550,338 and/or their F1. monoen 211,209 1,367,917 461,731 monoes 183,439 1,273,509 435,671 S¯ 154,917 1,098,453 450,933 the performance obtained by averaging all the sim- S¯ len 121,697 957,662 390,783 ilarities (S¯), also after multiplying them by the · S F1 153,056 1,085,502 448,076 length factor and/or the observed F1 obtained in · S F1 len 121,407 957,967 392,241 the previous experiment. Even if the length fac- · · tor had shown a poor performance in isolation, it Table 6: Size of the parallel corpora extracted with helps to lift the F1 figures consistently after affect- each similarity measure. ing the similarities. In this case, F1 grows up to 0.43. This impact is not so relevant when the indi- vidual F1 is used for weighting S¯. 5 Evaluation: Statistical Machine Translation Task We applied all the measures —both combined and in isolation— on the entire comparable cor- In this section we validate the quality of the ob- pora previously extracted. Table 6 shows the tained corpora by studying its impact on statisti- amount of parallel sentences extracted by apply- cal machine translation. There are several paral- ing the empirically defined thresholds of Tables 4 lel corpora for the English–Spanish language pair. and 5. As expected, more flexible alternatives, We select as a general-purpose corpus Europarl such as low-level n-grams or length factor result v7 (Koehn, 2005), with 1.97M parallel sentences. in a higher amount of retrieved instances, but in all The order of magnitude is similar to the largest cases the size of the corpora is remarkable. For the corpus we have extracted from Wikipedia, so we most restricted domain, CS, we get around 200k can compare the results in a size-independent way. parallel sentences for a given similarity measure. If our corpus extracted from Wikipedia was made For the widest domain, SC, we surpass the 1M up with parallel fragments of the desired domain, sentence pairs. As it will be shown in the fol- it should be the most adequate to translate these lowing section, these sizes are already useful to domains. If the quality of the parallel fragments be used for training SMT systems. Some standard was acceptable, it should also help when translat- parallel corpora have the same order of magnitude. ing out-of-domain texts. In order to test these hy- For tasks other than MT, where the precision on potheses we analyse three settings: (i) train SMT the extracted pairs can be more important than the systems only with Wikipedia (WP) or Europarl recall, one can obtain cleaner corpora by using a (EP) to translate domain-specific texts, (ii) train threshold that maximises precision instead of F1. SMT systems with Wikipedia and Europarl to

8 translate domain-specific texts, and (iii) train SMT CS Sc Sp Un systems with Wikipedia and Europarl to translate c3g 95,715 723,760 334,828 883,366 out-of-domain texts (news). cog 182,283 1,213,965 451,324 1,430,962 For the out-of-domain evaluation we use the monoen 210,664 1,367,169 461,237 1,638,777 S¯ len 120,835 956,346 389,975 1,160,977 · News Commentaries 2011 test set and the News union 577,428 3,847,381 1,181,664 4,948,241 Commentaries 2009 for development.11 For the WPdev 300 300 300 900 in-domain evaluation we build the test and devel- WPtest 500 500 500 1500 opment sets in a semiautomatic way. We depart GNOME 1000 – – – from the parallel corpora gathered in Section 4 from which sentences with more than four tokens Table 7: Number of sentences of the Wikipedia and beginning with a letter are selected. We es- parallel corpora used to train the SMT systems timate its perplexity with respect to a language (top rows) and of the sets used for development model obtained with Europarl in order to select and test. the most fluent sentences and then we rank the parallel sentences according to their similarity and CS Sc Sp Un Comp. perplexity. The top-n fragments were manually Europarl 27.99 34.00 30.02 30.63 – revised and extracted to build the Wikipedia test c3g 38.81 40.53 46.94 43.68 43.68 (WPtest) and development (WPdev) sets. We re- cog 57.32 56.17 57.60 58.14 54.89 peated the process for the three studied domains monoen 54.27 52.96 55.74 55.17 52.45 S¯ len 56.14 57.40 58.39 58.80 56.78 and drew 300 parallel fragments for development · union 64.65 62.95 62.65 64.47 – for every domain and 500 for test. We removed these sentences from the corresponding training Table 8: BLEU scores obtained on the Wikipedia corpora. For one of the domains, CS, we also gath- test sets for the 20 specialised systems described in ered a test set from a parallel corpus of GNOME Section 5. A comparison column (Comp.) where localisation files (Tiedemann, 2012). Table 7 all the systems are trained with corpora of the shows the size in number of sentences of these test same size is also included (see text). sets and of the 20 Wikipedia training sets used for translation. Only one measure, that with the high- est F1 score, is selected from each family: c3g, against the BLEU evaluation metric (Papineni et cog, mono and S¯ len (cf. Tables 4 and 5). We al., 2002). Our model considers the language en · also compile the corpus that results from the union model, direct and inverse phrase probabilities, di- of the previous four. Notice that, although we rect and inverse lexical probabilities, phrase and eliminate duplicates from this corpus, the size of word penalties, and a lexicalised reordering. the union is close to the sum of the individual cor- pora. This indicates that every similarity measure (i) Training systems with Wikipedia or Eu- selects a different set of parallel fragments. Beside roparl for domain-specific translation. Table 8 the specialised corpus for each domain, we build a shows the evaluation results on WPtest. All the larger corpus with all the data (Un). Again, dupli- specialised systems obtain significant improve- cate fragments coming from articles belonging to ments with respect to the Europarl system, regard- more than one domain are removed. less of their size. For instance, the worst spe- cialised system (c3g with only 95,715 sentences SMT systems are trained using standard freely for CS) outperforms by more than 10 points of available software. We estimate a 5-gram lan- BLEU the general Europarl translator. The most guage model using interpolated Kneser–Ney dis- complete system (the union of the four representa- counting with SRILM (Stolcke, 2002). Word tives) doubles the BLEU score for all the domains alignment is done with GIZA++ (Och and Ney, with an impressive improvement of 30 points. 2003) and both phrase extraction and decoding are This is of course possible due to the nature of the done with Moses (Koehn et al., 2007). We opti- test set that has been extracted from the same col- mise the feature weights of the model with Min- lection as the training data and therefore shares its imum Error Rate Training (MERT) (Och, 2003) structure and vocabulary. 11Both are available at http://www.statmt.org/ To give perspective to these high numbers we wmt14/translation-task.html. evaluate the systems trained on the CS domain

9 CS Un Comp. CS Sc Sp Un c3g 11.08 9.56 9.56 Europarl 27.99 34.00 30.02 30.63 cog 18.48 17.66 16.31 union 64.65 62.95 62.65 64.47 monoen 19.48 20.58 18.84 EP+c3g 46.07 48.29 50.40 49.34 S¯ len 20.71 20.56 19.76 · EP+cog 58.39 57.70 59.05 58.98 union 22.41 20.63 – EP+monoen 54.44 53.93 56.05 55.88 EP+S¯ len 56.05 57.53 59.78 58.72 · Table 9: BLEU scores obtained on the GNOME EP+union 66.22 64.24 64.39 65.67 test set for systems trained only with Wikipedia. A system with Europarl achieves a score of 18.15. Table 10: BLEU scores obtained on the Wikipedia test set for the 20 systems trained with the com- bination of the Europarl (EP) and the Wikipedia against the GNOME dataset (Table 9). Except for corpora. The results with a Europarl system and c3g, the Wikipedia translators always outperform the best one from Table 8 (union) shown for com- the baseline with EP; the union system improves it parison. by 4 BLEU points (22.41 compared to 18.15) with a four times smaller corpus. This confirms that a CS Un corpus automatically extracted with an F1 smaller than 0.5 is still useful for SMT. Notice also that us- EP+c3g 19.78 19.49 ing only the in-domain data (CS) is always better EP+cog 21.09 20.14 than using the whole WP corpus (Un) even if the EP+monoen 21.27 20.66 former is in general ten times smaller (cf. Table 7). EP+S¯ len 21.58 20.65 · According to this indirect evaluation of the sim- EP+union 22.37 21.43 ilarity measures, character n-grams (c3g) repre- sent the worst alternative. These results contra- Table 11: BLEU scores obtained on the GNOME test set for systems trained with Europarl and dict the direct evaluation, where c3g and monoen Wikipedia. A system with Europarl achieves a had the highest F1 scores on the development set among the individual similarity measures. The score of 18.15. size of the corpus is not relevant here: when we train all the systems with the same amount of data, when using all the union data. System c3g benefits the ranking in the quality of the measures remains the most of the inclusion of the Europarl data. The the same. To see this, we trained four additional reason is that it is the individual system with less systems with the top m number of parallel frag- corpus available and the one obtaining the worst ments, where m is the size of the smallest cor- results. In fact, the better the Wikipedia system, pus for the union of domains: Un-c3g. This new the less important the contribution from Europarl comparison is reported in columns “Comp.” in Ta- is. For the independent test set GNOME, Table 11 bles 8 and 9. In this fair comparison c3g is still the shows that the union corpus on CS is better than worst measure and S¯ len the best one. The trans- any combination of Wikipedia and Europarl. Still, · lator built from its associated corpus outperforms as aforementioned, the best performance on this with less than half of the data used for training test set is obtained with a pure in-domain system the general one (883,366 vs. 1,965,734 parallel (cf. Table 9). fragments) both in WPtest (56.78 vs. 30.63) and GNOME (19.76 vs. 18.15). (iii) Training systems on Wikipedia and Eu- roparl for out-of-domain translation. Now we (ii) Training systems on Wikipedia and Eu- check the performance of the Wikipedia transla- roparl for domain-specific translation. Now tors on the out-of-domain news test. Table 12 we enrich the general translator with Wikipedia shows the results. In this neutral domain for Eu- data or, equivalently, complement the Wikipedia roparl and Wikipedia, the in-domain Wikipedia translator with out-of-domain data. Table 10 systems show a lower performance. The BLEU shows the results. Augmenting the size of the in- score obtained with the Europarl system is 27.02 domain corpus by 2 million fragments improves whereas the Wikipedia union system achieves the results even more, about 2 points of BLEU 22.16. When combining the two corpora, results

10 CS Sc Sp Un languages. The only prerequisite is the existence of the corresponding Wikipedia edition and some union 16.74 22.28 15.82 22.16 basic processing tools such as a tokeniser and a EP+c3g 26.06 26.35 26.81 27.07 lemmatiser. Our current efforts intend to generate EP+cog 26.61 27.33 26.71 27.08 a more robust model for parallel sentences identi- EP+monoen 27.18 26.80 26.96 27.44 fication and the design of other indirect evaluation EP+S¯ len 27.59 26.80 27.58 27.22 schemes to validate the model performance. · EP+union 26.76 27.52 27.35 26.72 Acknowledgments Table 12: BLEU scores for the out-of-domain This work was partially funded by the TAC- evaluation on the News Commentaries 2011 test ARDI project (TIN2012-38523-C02) of the Span- set. We show in boldface all the systems that im- ish Ministerio de Econom´ıa y Competitividad prove the Europarl translator, which achieves a (MEC). score of 27.02. are controlled by the Europarl baseline. In general, References systems in which we include only texts from an Sisay Fissaha Adafre and Maarten de Rijke. 2006. unrelated domain do not improve the performance Finding Similar Sentences across Multiple Lan- of the Europarl system alone, results of the com- guages in Wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Associa- bined system are better when we use Wikipedia tion for Computational Linguistics, pages 62–69. texts from all the domains together (column Un) for training. This suggests that, as expected, a gen- Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, and eral Wikipedia corpus is necessary to build a gen- Daniel Tapias, editors. 2008. Proceedings of the eral translator. This is a different problem to deal Sixth International Language Resources and Evalu- with. ation (LREC 2008), Marrakech, Morocco. European Language Resources Association (ELRA).

6 Conclusions and Ongoing Work Chenhui Chu, Toshiaki Nakazawa, and Sadao Kuro- hashi. 2014. Iterative Bilingual Lexicon Extraction In this paper we presented a model for the au- from Comparable Corpora with Topical and Con- tomatic extraction of in-domain comparable cor- textual Knowledge. In Alexander Gelbukh, edi- pora from Wikipedia. It makes possible the auto- tor, Computational Linguistics and Intelligent Text matic extraction of monolingual and comparable Processing, volume 8404 of Lecture Notes in Com- puter Science, pages 296–309. Springer Berlin Hei- article collections as well as a one-click parallel delberg. corpus generation for on-demand language pairs and domains. Given a pair of languages and a Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp main category, the model explores the Wikipedia Sorg, and Steffen Staab. 2009. Explicit Versus La- tent Concept Models for Cross-language Informa- categories graph and identifies a subset of cate- tion Retrieval. In Proceedings of the 21st Inter- gories (and their associated articles) to generate national Jont Conference on Artifical Intelligence, a document-aligned comparable corpus. The re- IJCAI’09, pages 1513–1518, San Francisco, CA, sulting corpus can be exploited for multiple natu- USA. Morgan Kaufmann Publishers Inc. ral language processing tasks. Here we applied it Jacob Cohen. 1960. A coefficient of agreement as part of a pipeline for the extraction of domain- for nominal scales. Educational and Psychological specific parallel sentences. These parallel in- Measurement, 20(1):37–46. stances allowed for a significant improvement in Gaoying Cui, Qin Lu, Wenjie Li, and Yirong Chen. the machine translation quality when compared to 2008. Corpus Exploitation from Wikipedia for On- a generic system and applied to a domain specific tology Construction. In Calzolari et al. (Calzolari et corpus (in-domain). The experiments are shown al., 2008), pages 2126–2128. for the English–Spanish language pair and the do- Eagles Document Eag–Tcwg–Ctyp. 1996. EAGLES mains Computer Science, Science, and Sports. Preliminary recommendations on Corpus Typology. Still it can be applied to other language pairs and Maike Erdmann, Kotaro Nakayama, Takahiro Hara, domains. and Shojiro Nishio. 2008. An Approach for Ex- The prototype is currently operating in other tracting Bilingual Terminology from Wikipedia. In

11 Proceedings of the 13th International Conference Franz Josef Och. 2003. Minimum Error Rate Training on Database Systems for Advanced Applications, in Statistical Machine Translation. In Proceedings DASFAA’08, pages 380–392, Berlin, Heidelberg. of the 41st Annual Meeting of the Association for Springer-Verlag. Computational Linguistics (ACL), pages 160–167, Sapporo, Japan. Pascale Fung and Percy Cheung. 2004. Min- ing verynon-parallel corpora: Parallel sentence and Pablo Gamallo Otero and Issac Gonzalez´ Lopez.´ 2010. lexicon extraction via bootstrapping and em. In Wikipedia as multilingual source of comparable Proceedings of EMNLP, pages 57–63, Barcelona, corpora. In Proceedings of the 3rd Workshop on Spain, July 25–July 26. Building and Using Comparable Corpora, pages 21–25, 22 May. Available at http://www.fb06.uni- Pascale Fung. 1995. Compiling Bilingual Lexicon En- mainz.de/lk/bucc2010/documents/Proceedings- tries from a Non-Parallel English-Chinese Corpus. BUCC-2010.pdf. In Proceedings of the Third Annual Workshop on Very Large Corpora, pages 173–183. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: A Method for Automatic Pascale Fung. 1998. A Statistical View on Bilingual Evaluation of Machine Translation. In Proceedings Lexicon Extraction: From Parallel Corpora to Non- of the 40th Annual Meeting of the Association for Parallel Corpora. Lecture Notes in Computer Sci- Computational Linguistics (ACL 2002), pages 311– ence, 1529:1–17. 318, Philadelphia, PA. Association for Computa- tional Linguistics. Brent Hecht and Darren Gergle. 2010. The Tower of Babel Meets Web 2.0: User-generated Content Alexandre Patry and Philippe Langlais. 2011. Iden- and Its Applications in a Multilingual Context. In tifying Parallel Documents from a Large Bilingual Proceedings of the SIGCHI Conference on Human Collection of Texts: Application to Parallel Arti- Factors in Computing Systems, CHI ’10, pages 291– cle Extraction in Wikipedia. In Pierre Zweigen- 300, New York, NY, USA. ACM. baum, Reinhard Rapp, and Serge Sharoff, editors, Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Proceedings of the 4th Workshop on Building and Callison-Burch, Marcello Federico, Nicola Bertoldi, Using Comparable Corpora: Comparable Corpora Brooke Cowan, Wade Shen, Christine Moran, and the Web, pages 87–95, Portland, Oregon. Asso- Richard Zens, Chris Dyer, Ondrej Bojar, Alexan- ciation for Computational Linguistics. dra Constantin, and Evan Herbst. 2007. Moses: Marcis¯ Pinnis, Radu Ion, Dan S¸tefanescu,˘ Fangzhong Open Source Toolkit for Statistical Machine Trans- Su, Inguna Skadin¸a, Andrejs Vasil¸jevs, and Bogdan lation. In Proceedings of the 45th Annual Meeting of Babych. 2012. Accurat toolkit for multi-level align- the Association for Computational Linguistics (ACL ment and information extraction from comparable 2007), Prague, Czech Republic. corpora. In Proceedings of the ACL 2012 System Philipp Koehn. 2005. Europarl: A Parallel Corpus for Demonstrations, ACL’12, pages 91–96, Strouds- Statistical Machine Translation. In Proceedings of burg, PA, USA. Association for Computational Lin- the Machine Translation Summit X, pages 79–86. guistics. Philipp Koehn. 2010. Statistical Machine Translation. Magdalena Plamada and Martin Volk. 2012. Towards Cambridge University Press, New York, NY, USA, a Wikipedia-extracted alpine corpus. In The Fifth 1st edition. Workshop on Building and Using Comparable Cor- pora, May. Belinda Maia. 2003. What are comparable corpora. In Proceedings of the Corpus Linguistics workshop on Martin F. Porter. 1980. An Algorithm for Suffix Strip- Multilingual Corpora: Linguistic requirements and ping. Program, 14:130–137. technical perspectives. Martin Potthast, Benno Stein, and Maik Anderka. Anthony M. McEnery and Zhonghua Xiao, 2007. In- 2008. A Wikipedia-Based Multilingual Retrieval corporating Corpora: Translation and the Linguist, Model. Advances in Information Retrieval, 30th chapter Parallel and comparable corpora: What are European Conference on IR Research, LNCS they up to? Translating Europe. Multilingual Mat- (4956):522–530. Springer-Verlag. ters. Bruno Pouliquen, Ralf Steinberger, and Camelia Ignat. Paul McNamee and James Mayfield. 2004. Character 2003. Automatic Identification of Document Trans- N-Gram Tokenization for European Language Text lations in Large Multilingual Document Collections. Retrieval. Information Retrieval, 7(1-2):73–97. In Proceedings of the International Conference on Recent Advances in Natural Language Processing Franz Josef Och and Hermann Ney. 2003. A System- (RANLP-2003), pages 401–408, Borovets, Bulgaria. atic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–51. Emmanuel Prochasson and Pascale Fung. 2011. Rare See also [http://www.fjoch.com/GIZA++. Word Translation Extraction from Aligned Compa- html]. rable Documents. In Proceedings of the 49th Annual

12 Meeting of the Association for Computational Lin- Jorg¨ Tiedemann. 2012. Parallel Data, Tools and In- guistics: Human Language Technologies - Volume 1, terfaces in OPUS. In Nicoletta Calzolari, Khalid HLT ’11, pages 1327–1335, Stroudsburg, PA, USA. Choukri, Thierry Declerck, Mehmet Ugur Dogan, Association for Computational Linguistics. Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eighth Reinhard Rapp. 1995. Identifying Word Translations International Conference on Language Resources in Non-Parallel Texts. CoRR, cmp-lg/9505037. and Evaluation (LREC 2012), Istanbul, Turkey, May. European Language Resources Association Jennifer Rowley and Richard Hartley. 2000. Organiz- (ELRA). ing Knowledge. An Introduction to Managing Access to Information. Ashgate, 3rd edition. Jesus´ Tomas,´ Jordi Bataller, Francisco Casacuberta, and Jaime Lloret. 2008. Mining wikipedia as a par- ´ ¨ ´ ´ ´ ´ ´ Peter Schonhofen, Andras A. Benczur, Istvan Bıro, and allel and comparable corpus. LANGUAGE FORUM, ´ ´ Karoly Csalogany. 2007. Cross-language retrieval 34(1). Article presented at CICLing-2008, 9th Inter- Advances in Multilingual and with wikipedia. In national Conference on Intelligent Text Processing Multimodal Information Retrieval, 8th Workshop of and Computational Linguistics, February 17 to 23, the Cross-Language Evaluation Forum, CLEF 2007, 2008, Haifa, Israel. Budapest, Hungary, September 19-21, 2007, Re- vised Selected Papers, pages 72–79. Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Dubiner. 2010. Large scale parallel document min- Serge Sharoff, Reinhard Rapp, and Pierre Zweigen- ing for machine translation. In Chu-Ren Huang and baum, 2013. Building and Using Comparable Cor- Dan Jurafsky, editors, Proceedings of the 23rd Inter- pora, chapter Overviewing Important Aspects of the national Conference on Computational Linguistics Last Twenty Years of Research in Comparable Cor- (COLING 2010), pages 1101–1109, Beijing, China, pora. Springer. August. COLING 2010 Organizing Committee. Michel Simard, George F. Foster, and Pierre Isabelle. Dekai Wu and Pascale Fung. 2005. Inversion trans- 1992. Using Cognates to Align Sentences in Bilin- duction grammar constraints for mining parallel sen- gual Corpora. In Proceedings of the Fourth Interna- tences from quasi-comparable corpora. In Natural tional Conference on Theoretical and Methodologi- Language Processing - IJCNLP 2005, Second Inter- cal Issues in Machine Translation. national Joint Conference, pages 257–268, Jeju Is- Inguna Skadin¸a, Ahmet Aker, Voula Giouli, Dan Tufis¸, land, Korea, Oct 11–Oct 13. Robert Gaizauskas, Madara Mierin¸a, and Nikos Keiji Yasuda and Eiichiro Sumita. 2008. Method for Mastropavlos. 2010. A collection of comparable Building Sentence-Aligned Corpus from Wikipedia. corpora for under-resourced languages. In Proceed- Association for the Advancement of Artificial In- ings of the 2010 Conference on Human Language In telligence Technologies – The Baltic Perspective: Proceed- . ings of the Fourth International Conference Baltic Kun Yu and Junichi Tsujii. 2009. Bilingual dictionary HLT 2010, pages 161–168, Amsterdam, The Nether- extraction from wikipedia. In Proceedings of Ma- lands, The Netherlands. IOS Press. chine Translation Summit XII. Jason R. Smith, Chris Quirk, and Kristina Toutanova. Torsten Zesch, Christof Muller,¨ and Iryna Gurevych. 2010. Extracting Parallel Sentences from Compa- 2008. Extracting Lexical Semantic Knowledge from rable Corpora using Document Level Alignment. Wikipedia and Wikictionary. In Calzolari et al. (Cal- In Human Language Technologies: The 2010 An- zolari et al., 2008). nual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 403–411, Stroudsburg, PA, USA. Associ- ation for Computational Linguistics. Philipp Sorg and Philipp Cimiano. 2012. Exploit- ing Wikipedia for Cross-lingual and Multilingual In- formation Retrieval. Data Knowl. Eng., 74:26–45, April. S¸tefanescu,˘ Dan and Ion, Radu and Hunsicker, Sabine. 2012. Hybrid Parallel Sentence Mining from Com- parable Corpora. In Proceedings of the 16th Annual Conference of the European Association for Ma- chine Translation (EAMT 2012), Trento, Italy. Eu- ropean Association for Machine Translation . Andreas Stolcke. 2002. SRILM - An Extensible Lan- guage Modeling toolkit. In Intl. Conference on Spo- ken Language Processing, Denver, Colorado.

13 Knowledge-lean projection of coreference chains across languages

Yulia Grishina Manfred Stede Applied Computational Linguistics Applied Computational Linguistics University of Potsdam University of Potsdam [email protected] [email protected]

Abstract therefore use three different genres: argumenta- tive newspaper articles, narratives, and medicine Common technologies for automatic instruction leaflets. coreference resolution require either a Our general aim is to explore the limitations of language-specific rule set or large collec- a knowledge-lean approach to the problem, so that tions of manually annotated data, which it is easy to generalize to other low-resourced lan- is typically limited to newswire texts in guages. For the annotation of the corpus, we cre- major languages. This makes it difficult to ated common annotation guidelines that make few develop coreference resolvers for a large assumptions on the structural features of the tar- number of the so-called low-resourced get languages. We used the guidelines to annotate languages. We apply a direct projection texts of the three genres in the three languages, and algorithm on a multi-genre and multilin- provide results on inter-annotator agreement (see gual corpus (English, German, Russian) Section 3). For projection, we use a procedure to automatically produce coreference an- based on sentence and word alignment as calcu- notations for two target languages without lated by a standard tool (GIZA++) that was trained exploiting any linguistic knowledge of the on corpora of moderate size. Thus at this point languages. Our evaluation of the projected we deliberately do not apply linguistic knowledge annotations shows promising results, on the languages involved. The experiments and and the error analysis reveals structural results are described in Section 4. We present a differences of referring expressions and qualitative error analysis showing that a number coreference chains for the three lan- of structural divergences are responsible for many guages, which can now be targeted with of the problems; this suggests that limited syntac- more linguistically-informed projection tic knowledge can be helpful for improving per- algorithms. formance in follow-up work. Section 5 compares our results to the most closely related earlier work, 1 Introduction and Section 6 concludes. Coreference resolution requires relatively expen- 2 Related work sive resources, usually in terms of manual annota- tion. To alleviate this problem for low-resourced A projection approach is used to automatically languages, techniques of annotation projection transfer different types of linguistic annotation can be applied. In this paper, we report on from one language to another. The idea of experiments with projecting nominal coreference mapping from well-studied languages to low- chains across bilingual corpora. Our goal is to resourced languages was initially introduced in see how well a knowledge-lean projection algo- the work of Yarowsky et al. (2001), who stud- rithm works for two relatively similar languages ied the induction of PoS and NE taggers, NP (English-German) and for less similar languages chunkers and morphological analyzers for differ- (English-Russian). Furthermore, we are inter- ent languages using annotation projection. There- ested in differences incurred by the text genre and after, the technique has been used for a variety of

14 Long Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 14–22, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics tasks, including PoS tagging and syntactic parsing common observation that narrative texts are easier (Hwa et al., 2005; Ozdowska, 2006; Tiedemann, to process for coreference, (ii) the fact that news 2014), semantic role labelling (Padó and Lapata, text is important for many applications, and (iii) 2005), sentiment analysis (Mihalcea et al., 2007), the consideration of medical leaflets representing mention detection (Zitouni and Florian, 2008), or a somewhat “exotic” genre that exhibits many dif- named-entity recognition (Ehrmann et al., 2011). ferences to the other two. To our knoweldge, the first application to Corpus statistics are shown in Table 1. The sto- coreference is due to Harabagiu and Maio- ries contain more REs than the newswire texts, rano (2000), who experimented with manually and the coreference chains of the stories tend to projecting coreference chains from English to Ro- be much longer. manian using a translated parallel corpus. They showed that a coreference resolver trained on a 3.2 Annotation parallel corpus can achieve better results than Usually, coreference annotation guidelines have one trained on monolingual data. Then, Posto- been designed with one target language in mind. lache and colleagues (2006) used automatic word In contrast, our goal was to have common guide- alignment to project coreference annotations for lines for the three languages, in order to (i) obtain the same data. Their goal was to achieve high uniform nominal coreference annotations in our precision, and thus they discarded from projec- corpus (supporting the projection task), and (ii) tion those referring expressions (henceforth: REs) facilitate extension to further languages. Regard- whose syntactic heads were not properly aligned. ing English, our guidelines are of similar length Their results indeed show high precision (over and quite compatible with the scheme used for 95%), but considerably lower recall (around 70%). OntoNotes - the largest annotated coreference cor- We will discuss their approach in relation to ours pus for the English language (Hovy et al., 2006). in Section 5. One exception is that we handle only NPs and do Mitkov and Barbu (2002) performed anaphora not annotate verbs that are coreferent with NPs. resolution using projection on a parallel English- Our guidelines borrow many decisions from French corpus, which lead to an improvement in the (relatively language-neutral) Potsdam Coref- the success rate of roughly 4% for both English erence Scheme (PoCoS) (Krasavina and Chiar- and French. (Sayeed et al., 2009) used cross- cos, 2007), and we also considered the recently lingual projection to improve the detection of developed guidelines for thr English-German par- coreferent named entities with the help of English- allel corpus ParCor (Guillou et al., 2014). But Arabic translations, and they reported better re- it considers only pairwise annotation of anaphoric sults than a monolingual resolver could achieve. pronouns and their antecedents, whereas we anno- (Rahman and Ng, 2012) used translation-based tate all REs appearing in a coreference chain (i.e. projection to train a coreference resolver, and that are mentioned in the text at least twice). achieved around 90% of the average F-scores of For the time being, our annotation is restricted a supervised resolver in experiments with Spanish to the referential identity; we thus exclude cases and Italian using few resources (only a mention of ‘bridging’ (also called ’indirect anaphora’) extractor) for the target languages. or near-identity. The following types of REs are considered as markables: full NPs, proper 3 Multilingual coreference corpus names, and pronouns (personal, demonstrative, relative, reflexive, and pronominal adverbs). As 3.1 The corpus in OntoNotes, generic nouns can corefer with def- Our corpus consists of 38 parallel texts in En- inite full NPs or pronouns, but not with other glish, German and Russian, belonging to three generic nouns. In case of English nominal pre- genres: newswire articles (7 texts per language), modifiers, we only annotate a nominal premod- short stories (3 texts per language), and medicine ifier if it can refer to a named entity (the [US]1 instruction leaflets (4 per language, only English- politicians) or is an independent noun in the Gen- 1 German) . This choice is motivated by (i) the itive form ([creditor’s]1 choice); in all other cases, 1Newswire is taken from the multilingual newswire second language acquisition (http://www.lonweb.org). Med- agency Project Syndicate (www.project-syndicate.org). Sto- ical texts are from the EMEA subcorpus of the OPUS collec- ries are taken from an online collection of parallel texts for tion of parallel corpora (Tiedemann, 2009).

15 Newswire Stories Medicine leaflets Total En De Ru En De Ru En De En De Ru Tokens 5903 6268 5763 2619 2642 2343 3386 3002 11908 11912 8106 Sentences 239 252 239 190 186 192 160 160 589 598 431 REs 558 589 606 470 497 479 322 309 1350 1395 1085 Chains 124 140 140 45 45 48 90 88 259 273 188

Table 1: Statistics for the experimental corpus nominal premodifiers are not annotated as separate Table 3. For the medical leaflets, the results are markables (e.g., [bank account]). somewhat lower: κ = 0.76 with binary overlap and When annotators identify a markable, they also 0.67 with proportional overlap; the MUC score is record its RE type from an attribute menu. The 70%. For the NP type attribute, Cohen’s kappa for markable span includes the syntactic head of the the texts from all genres on average is κ = 0.94. NP and all its modifiers, except for dependent rela- English German tive clauses (because relative pronouns are treated Binary overlap κ 0.87 0.86 as separate markables). As a divergence from Proportional overlap κ 0.81 0.81 OntoNotes, they have a separate relation for appo- MUC F-score 77.28 73.91 sitions, whereas we only include them in the head Table 3: Inter-annotator agreement for news and NP markable. Technically, we used the MMAX- stories 2 coreference annotation tool2, and the corpus was tokenized and split into sentences using the Europarl preprocessing tools3. Table 2 shows a 4 Experiment breakdown of NP types of our markables for the three genres. 4.1 Experimental setup Automatic sentence and word alignment. We Newswire Stories Med. leaflets aligned the source and target parts of the corpus Named Entities 39.3 27.5 48.0 Personal pronouns 15.9 51.4 8.2 at the sentence level using the HunAlign sentence Definite NP 30.1 16.1 16.9 aligner (Varga et al., 2007) and its wrapper LF Relative pronouns 9.9 1.1 14.4 Aligner4, which already includes alignment dic- Indefinite NP 4.7 3.5 12.3 Other 0.1 0.4 0.2 tionaries for the required language pairs. Word alignment was performed with GIZA++ Table 2: Types of NPs in the three genres (%) (Och and Ney, 2003) using the standard settings. Before the alignment, all texts in the corpus were tokenized and lower-cased using the Europarl pre- 3.3 Agreement processing tools. The word aligner was trained The English-German corpus was annotated by two on a collection of bilingual newswire text from lightly-trained independent annotators - students our source given above, preprocessed in the same of linguistics. (For Russian, we had only one an- way as descibed above. The training set consists notator available, therefore the agreement study of around 200 000 parallel sentences for English- will be done later.) For markables, we com- German, and 170 000 for English-Russian. puted the inter-annotator agreement using Cohen’s We computed both bidirectional alignments kappa in two settings: binary overlap and propor- and the intersection of source-target / target- tional overlap. For binary overlap, we consider source alignments. (Annotation projection is of- two markables as “agreed” if they overlap by at ten done with intersective alignments, as they least one token; proportional overlap measures the provide higher precision than bidirectional align- extent to which annotators agree on the identifica- ments.) For English-German, we evaluated our tion of spans (number of overlapping tokens). For word alignment against a set of 1000 manually the coreference annotation, we computed MUC annotated parallel sentences made available by S. scores with strict mention matching. The results Padó5. For English-Russian, we are not aware for the newswire texts and stories are shown in of any similar gold alignments and thus did not

2http://mmax2.sourceforge.net 4http://aligner.sourceforge.net 3http://www.statmt.org/europarl/ 5http://nlpado.de/ sebastian/data/srl_data.shtml

16 evaluate. Results are given in Table 4. Follow- ing (Padó, 2007), we evaluated only the resulting 90 intersective alignments. We compared our results German 80 Russian to those of (Padó, 2007) and (Spreyer, 2011), who used the English-German part of the Europarl 70 dataset. Our results are somewhat lower, probably 60 due to the much smaller training set. 50 Bisentences Prec. Recall F-m. Padó (2007) 1 029 400 98.6 52.9 68.86 40 Spreyer (2011) 1 314 944 94.88 62.04 75.02 Our alignment 205 208 92.95 51.23 66.05 30

Table 4: Evaluation of the automatic word align- 20 ment 10

To simplify subsequent processing, we con- 0 verted the corpus annotations into the CoNLL ta- med-muc news-mucnews-bcubstories-mucstories-bcub med-bcub ble format6 using discoursegraphs converter (Neu- mann, 2015). Figure 1: Comparison of English-German and Extraction of REs and transfer of coreference English-Russian projections: boxplots of the chains. For each RE in the source language we macro-averaged F1 scores (MUC and B-cubed) extract the corresponding RE in the target lan- for different genres guage, together with its coreference set num- ber. Following the approach of Postolache et al. We evaluate all the projected coreference (2006), for each word span representing an RE in chains against gold chains using the standard the source language, we extract the corresponding coreference evaluation metrics MUC (Vilain set of aligned words in the target language. The re- et al., 1995), CEAF (Luo, 2005) and B3 sulting target RE is the span between the first and (Bagga and Baldwin, 1998) to get complete the last extracted word, and it belongs to the same performance characteristics. We also use set as the source RE. Table 5 shows the number strict matching as in the evaluation of the of REs and coreference chains projected through identification of mentions and evaluate the word alignment (from English). projected markables against all the markables 4.2 Evaluation of the gold standard. These scores depend on the identification of mentions evaluated in the We evaluate both the quality of the identifica- previous step. We report the micro-averaged tion of mentions and the extraction of coreference Precision, Recall and F-1 scores in Table 6. chains using the CoNLL scorer7. In addition, Figure 1 shows the distribution of 1. Evaluation of the identification of mentions. macro-averaged F1-scores for two of the met- rics (MUC and B3) for both language pairs as We compute the scores for the identification boxplots. of mentions using the strict mention match- ing as in the CoNLL-2011 (Pradhan et al., 3. Evaluation of coreference chains with mini- 2011) and CONLL-2012 shared tasks (Prad- mal spans han et al., 2012), so that we score only those projected markable spans that are ex- Finally, we evaluate using just minimal spans actly the same as the gold ones. The values of the REs, i.e., syntactic heads. This indi- for English-German and English-Russian are cates how well the REs can be projected, not given in Table 6 as mentions. punishing the algorithm for detecting only partially correct REs. We manually anno- 2. Evaluation of coreference chains tated syntactic heads of the gold and pro- 6http://conll.cemantix.org/2012/data.html jected REs. Following the approach of Pos- 7http://conll.cemantix.org/2012/software.html tolache et al. (2006), we select the leftmost

17 Newswire Stories Medicine De Ru De Ru De Transferred REs 465 493 329 357 214 Transferred coreference chains 122 122 44 44 82

Table 5: Number of REs and coreference chains transferred through bilingual projections

Mentions MUC CEAF B3 Average (coref) P R F1 P R F1 P R F1 P R F1 P R F1 de-News 61.5 48.6 54.3 55.9 43.2 48.7 58.6 46.7 51.9 45.8 34.2 39.1 53.4 41.4 46.6 de-Stories 82.0 54.5 65.5 81.9 51.6 63.3 81.7 53. 7 64.8 71.6 32.5 44.7 78.4 45.9 57.6 de-Medicine 61.2 44.7 51.7 66.2 42.7 51.9 59.1 43.3 50.0 53.43 35.16 42.41 59.6 40.4 48.1 de-Newsmin 89.9 71.2 79.4 87.3 66.2 75.3 85.5 67.5 75.5 80.4 58.1 67.5 84.4 63.9 72.8 de-Storiesmin 95.4 62.2 75.3 94.4 58.5 72.2 95.1 61.2 74.5 90.9 40.2 55.7 93.5 53.3 67.5 de-Medicinemin 79.9 58.4 67.5 84.2 54.4 66.1 77.7 56.9 65.7 73.3 47.2 57.4 78.4 52.8 63.1 ru-News 79.3 64.5 71.2 76.3 60.7 67.6 76.3 62.0 68.4 69.0 52.2 59.4 73.9 58.3 65.1 ru-Stories 87.4 65.1 74.6 87.9 64.4 74.3 86.1 64.6 73.8 79.7 47.9 59.8 84.6 59.0 69.3 ru-Newsmin 90.9 72.6 80.7 89.6 69.8 78.5 87.3 69.7 77.5 83.7 61.4 70.9 86.9 67.0 75.6 ru-Storiesmin 94.3 72.4 81.9 94.0 70.9 80.9 93.6 71.7 81.2 90.2 57.3 70.1 92.6 66.6 77.4 Table 6: Results for German and Russian: micro-averaged Precision, Recall, F1-score for different genres

noun, pronoun or numeral as head; otherwise, Problems (2) and (3) are the more interesting the RE is discarded. Results are given in Ta- ones for a qualitative error analysis. For this pur- ble 6 with the tag ‘min’. pose, we visualized the projected files and the gold standard using the coreference module of the 4.3 Error Analysis ICARUS corpus analysis platform (Gärtner et al., From a formal viewpoint, there are three cate- 2014). 50% of the data was randomly selected for gories of projection problems: the detailed analysis, and we determined the most frequent projection errors and categorized them 1. An RE is present in both source and target into three different groups. Thereafter, we tried to text, but it is not projected correctly, or not verify our resulting hypotheses about variation in at all, on the grounds of mistakes in the word pronominal coreference in the three languages us- 8 ˇ alignment phase. ing a larger external corpus: InterCorp (Cermák and Rosen, 2012) offers an online interface for 2. An RE is present in the source text and cor- searching parallel corpora in different languages rectly projected into the target text, but it and sub-corpora. We performed both monolingual does not show up in the gold standard, be- and multilingual queries (e.g. querying one side of cause the target language text does not have a parallel corpus vs. querying parallel data). a corresponding RE pair (the target language Further, we were interested in comparing our does not reproduce the complete chain of the findings to available studies on multilingual nom- source). inal coreference in Contrastive Linguistics. How- ever, the only work we found on this topic is a 3. An RE in the gold standard is not present in comparative study of nominal referring expres- the target text and therefore can not be pro- sions for newswire texts in English and German jected (the dual problem to (2): the source (Kunz, 2010). text does not have an RE pair that would cor- In our data, the problematic cases are those respond to one in the target text). where the source language (SL) referring expres- sion is missing or reformulated in the target text The number of errors caused by wrong word (TL), and therefore is not being projected. We alignment (1) can be estimated on the basis of identified three categories of errors caused by the alignment evaluation (Section 4.1), albeit only structural differences among the three languages: for the English-German language pair; due to the lack of resources, this is not possible for English- Russian. 8www.korpus.cz/intercorp.

18 Morphological differences. English (31.405,22 i.p.m.). We also noticed These are cases of German contractions and com- that article use with named entities can vary in pound nouns. For example, as in the case of policy both languages (for example, the English Hamas towards [minorities]1 and [Minderheiten]politik, corresponds to the German die Hamas). However, the SL markable is not present in the TL as a sep- our corpus queries did not show any regularities arate unit, since we cannot split compound nouns yet; this issue requires a more detailed study and mark only a part. Also, cases like zum Bahn- regarding the types of named entities (which we hof short for zu dem Bahnhof (‘to the station’) assume to be the reason for the different use of cause errors in the identification of spans, because articles). In the case of Russian, the absence of we do not annotate prepositions as parts of mark- articles led to better results in the identification of ables on the English side. However, such cases are REs, since in general, shorter spans increase the frequent in the German data, where, in general, the chance for a perfect alignment. prepositions an, bei, in, von, zu can be contracted with subsequent determiners in written text. Our 2: The use of reflexive pronouns. According corpus study has shown that for the preposition zu to our annotation scheme, we annotated reflex- (‘to’) the frequency of the contraction is 16 times ive pronouns only when they are independent higher than for the full form (InterCorp, measured constituents (rather than verb particles), but we in items per million (henceforth i.p.m.)). observe differences in the use of these pronouns for the three languages, so that in most cases Differences in NP syntax. these are non-parallel. These differences have 1: The use of articles. Some NPs are more fre- to do with the form and distribution of reflexive quently used with a definite article in German than pronouns. In English, we only have -self to ex- in English, which resulted in the misidentification press reflexivity, while in German and Russian a of spans. According to Kunz (2010), English al- wider range of reflexives can be used. In German lows the use of nouns with zero article more fre- and Russian, it is possible to use more than one quently than German. This is true for both singular reflexive in a sentence to emphasize the action, and plural nouns. In our guidelines, nouns with which is not possible in English. As a result, there zero article can only be linked to anaphoric pro- is less reflexives to be transferred from English nouns (if any), but not between each other (like in to the target (German and Russian) sides of the OntoNotes). This resulted in mismatching chains: corpus which led to errors in the projection. English NPs with zero article do not form chains and therefore cannot be projected, while the same NPs actually form a chain in German. For exam- 3: Pre- and post-modification. In general, we ple: noticed that German NPs allow more complicated premodification than English and Russian. Ac- (1) a. Lastly, the G-20 could also help drive momen- tum on climate change. <...> We also have to find cording to Kunz (2010), English tends to postmod- a way to provide funding for adaptation and mitiga- ification, while German is less restrictive with pre- tion - to protect people from the impact of climate modification. These variations result in syntactical change and enable economies to grow while hold- ing down pollution levels - while guarding against differences in markables and in non-parallelism. trade protection in the name of climate change mit- Regarding the participial constructions, one of igation. b. Schließlich könnten die G-20 auch für neue the complications is that in German, they occur Impulse im Bereich [des Klimawandels]1 sorgen. only in pre-position, while in English and Russian Ebenso müssen wir einen Weg finden, finanzielle they can be placed in both pre- and post-position. Mittel für die Anpassung an [den Klimawandel]1 sowie dessen Eindämmung bereitzustellen - um For example: die Menschen zu schützen und den Ökonomien Wachstum zu ermöglichen, aber den Grad der Umweltverschmutzung trotzdem in Grenzen zu hal- (2) a. Pakistan needs international help to ten. Außerdem gilt es, sich vor handelspolitis- bring hope to [the young people]1 [who]1 chen Schutzmaßnahmen im Namen der Eindäm- live there. mung [des Klimawandels]1 zu hüten . b. Pakistan braucht internationale Hilfe, The query of InterCorp data has shown that um [den dort lebenden jungen Men- German exhibits a higher number of NPs with schen]1 Hoffnung zu bringen. definite article (57.928,55 i.p.m.) compared to

19 Non-equivalences in translation. The follow- The medicine instruction leaflets in setting 2 have ing cases of non-parallelism resulted in projection the worst results, and we observe lower improve- errors in our dataset; however, we could not find ment for precision between two settings compared enough evidence to characterize them as system- to the newswire texts. This indicates that the qual- atic. ity of coreference resolution for medical texts de- pends to a higher degree on the coreference rela- Personal pronouns vs. indefinite pronouns. • tions, than on the identification of mentions. In these texts, we frequently find borderline cases of (3) a. [It]1 was pursuing a two-pronged strategy. non-/reference, when dieseases, parts of the body, b. [Man] verfolgte eine Doppelstrate- etc. are being mentioned. Here, we will try to gie. (‘One followed a two-pronged make the annotation guidelines more specific. strategy.’) 5 Discussion The German indefinite pronoun man is the The most closely related work is the approach of target of the projected annotations, but it is (Postolache et al., 2006), but some differences are not a markable according to our guidelines: noteworthy. In contrast to Postolache and col- it is non-referring and thus unable to partici- leagues, we do not focus on maximising precision; pate in RE chains. instead, our goal is to assess how well projection Possessive NPs vs. adjectives. Some pos- can work for all the annotations. In general, we • sessive NPs in the SL (for example, the use neither language-dependent software nor any government of [India]1) can be expressed additional linguistic information about the target through adjectives in the TL (die [indis- language in the coreference projection and evalu- che] Regierung or indijskoe pravitel’stvo ation. Postolache et al., in contrast, applied a ded- ([indi˘iskoe] pravitel~stvo)) and there- icated Romanian-English word aligner9 (which fore are no markables. achieves an F-score of 83.3% compared to our 66.05% of the language-independent GIZA++) Determiners vs. possessive pronouns. Per- • and used special rules that rely upon the POS in- sonal pronouns in English can be translated formation and syntactic heads to produce their an- as articles in German (for example, [its]1 notations, and then discarded the incorrectly pro- broader goal = das weiter gefasste Ziel), so jected ones (we used such rules only in the evalu- that the source RE has no correspondent in ation of the projected heads of REs). These rules the TL. For Russian, in this case a posses- reduced the number of gold and projected REs in sive form of a reflexive pronoun svoj (svo˘i) the English-Romanian corpus considerably: from can be used, or the possessive pronoun can be 3422 to 2491 (Postolache et al., 2006). omitted. In our case, we use all REs to evaluate the Relative clauses in one language can corre- spans of the projected annotations and the result- • spond to participle constructions or PPs in ing coreference chains. Comparing our evalua- another. Examples: tion to Postolache’s evaluation of all REs, we can see that our results yield a higher MUC precision a. [a fat lady]1 [who]1 wore a fur around her neck for all of the genres (average 68.0 for English- German, 82.1 for English-Russian vs. 52.3 for b. [eine dicke Dame mit einer Pelzstola]1 (‘a fat lady with a a fur’) English-Romanian), but a lower recall for both languages (45.8/62.6 vs. 82.04), which results in different F-measure (Postolache et al. obtained 4.4 Comparing the genres an average F1 of 63.9 compared to our F1 of According to Table 6 and Figure 1, we see that 54.6 for German and 71.0 for Russian). This newswire texts get the lowest scores, the rea- can be explained by the lower quality of our au- son most likely being the more complicated NPs. tomatic English-German alignments compared to In setting 2 (evaluation of minimal spans), both 9The COWAL word aligner is a lexical aligner which is newswire texts and stories obtain closer F1-scores, adjusted only for Romanian-English and requires a corpus but the stories still have better precision scores. with morpho-syntactic annotations (Tufis et al., 2006).

20 the English-Romanian; the Russian REs were ex- that way. Still, our emphasis remains on devis- tracted slightly more accurately due to the struc- ing procedures that are generalizable to other low- tural differences in NPs. We also observed differ- resourced languages, so we will do these exten- ent scores for newswire texts, stories and medical sions in small steps only. leaflets, while Postolache et al. only used texts of Our annotation guidelines and other mate- one genre and in fact one author (different chap- rial will be made available via our website ters of the same fiction book). http://www.ling.uni-potsdam.de/acl-lab/. Keeping these different parameters in mind, in order to compare our results in a fair way, we eval- uated the RE heads following the same rules to References extract minimal spans of the projected REs, and Amit Bagga and Breck Baldwin. 1998. Algorithms evaluated them against manually annotated heads for scoring coreference chains. In The first in- ternational conference on language resources and in the gold standard. In this setting, we obtained evaluation workshop on linguistics coreference, vol- higher precision than in the previous setting, and ume 1, pages 563–566. in comparison to Postolache et al. (English- ˇ Romanian, avg. F1 = 80.5), our results are some- František Cermák and Alexandr Rosen. 2012. The case of InterCorp, a multilingual parallel cor- what lower for English-German (avg. F1 = 74.1) pus. International Journal of Corpus Linguistics, and slightly better for English-Russian (avg. F1 = 17(3):411–427. 81.3), which we attribute to the overall more diffi- cult (and therefore more generalizable) projection Maud Ehrmann, Marco Turchi, and Ralf Steinberger. 2011. Building a multilingual named entity- scenario in our approach. annotated corpus using annotation projection. In RANLP, pages 118–124. 6 Conclusions Markus Gärtner, Anders Björkelund, Gregor Thiele, The goal of this study was to explore to what ex- Wolfgang Seeker, and Jonas Kuhn. 2014. Visualiza- tion, search, and error analysis for coreference anno- tent the coreference projection task can be tack- tations. In Proceedings of 52nd Annual Meeting of led with a decidedly “light weight” approach. In the Association for Computational Linguistics: Sys- contrast to earlier work, we used a well-known, tem Demonstrations, pages 7–12. standard word alignment tool trained on a corpus Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg of moderate size. Furthermore, we deliberately Tiedemann, and Bonnie Webber. 2014. ParCor 1.0: worked with projecting English annotations to two A parallel pronoun-coreference corpus to support relatively different languages, Russian and Ger- statistical MT. In Ninth International Conference on Language Resources and Evaluation (LREC’14), man, in order to study the limitations of the ap- pages 3191–3198. European Language Resources proach. In order to be as “generalizable” as possi- Association. ble (especially for other low-resourced languages), we work on the basis of common, relatively lean, Sanda M. Harabagiu and Steven J. Maiorano. 2000. Multilingual coreference resolution. In Proceedings annotation guidelines for coreference, which make of the sixth conference on Applied natural language few assumptions on the specifics of the languages processing, pages 142–149. Association for Compu- considered here. tational Linguistics. We compared our results quantitatively to the Eduard Hovy, Mitchell Marcus, Martha Palmer, most closely related work and argued that they are Lance Ramshaw, and Ralph Weischedel. 2006. competitive, in particular because our task setting OntoNotes: the 90% solution. In Proceedings of is more target-language-neutral, we used three lan- the human language technology conference of the NAACL, Companion Volume: Short Papers, pages guages rather than two, and we worked on three 57–60. Association for Computational Linguistics. different genres of text. Our qualitative error analysis showed that prob- Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping lems are due to a set of structural differences of parsers via syntactic projection across parallel texts. NPs in the three languages. Having completed Natural language engineering, 11(03):311–325. this “light-weight” study, we will now move for- Olga Krasavina and Christian Chiarcos. 2007. PoCoS: ward by introducing limited syntactic knowledge Potsdam coreference scheme. In Proceedings of the of the languages involved (NP chunking) and ex- Linguistic Annotation Workshop, pages 156–163. plore how much performance can be gained in Association for Computational Linguistics.

21 Kerstin Anna Kunz. 2010. Variation in English and Altaf Rahman and Vincent Ng. 2012. Translation- German Nominal Coreference: A Study of Political based projection for multilingual coreference reso- Essays, volume 21. Peter Lang. lution. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Xiaoqiang Luo. 2005. On coreference resolution per- Computational Linguistics: Human Language Tech- formance metrics. In Proceedings of the confer- nologies, pages 720–730. Association for Computa- ence on Human Language Technology and Empiri- tional Linguistics. cal Methods in Natural Language Processing, pages 25–32. Association for Computational Linguistics. Asad Sayeed, Tamer Elsayed, Nikesh Garera, David Alexander, Tan Xu, Douglas W. Oard, David Rada Mihalcea, Carmen Banea, and Janyce M. Wiebe. Yarowsky, and Christine Piatko. 2009. Arabic 2007. Learning multilingual subjective language via cross-document coreference detection. In Proceed- cross-lingual projections. In Proceedings of ACL- ings of the ACL-IJCNLP 2009 Conference Short Pa- 2007. pers, pages 357–360. Association for Computational Linguistics. Ruslan Mitkov and Catalina Barbu. 2002. Using bilin- gual corpora to improve pronoun resolution. Lan- Kathrin Spreyer. 2011. Does it have to be trees?: guages in contrast, 4(2):201–211. Data-driven dependency parsing with incomplete and noisy training data. Ph.D. thesis, Universitäts- Arne Neumann. 2015. discoursegraphs: A graph- bibliothek Potsdam. based merging tool and converter for multilayer an- notated corpora. In Nordic Conference of Computa- Jörg Tiedemann. 2009. News from OPUS-a collection tional Linguistics NODALIDA 2015, page 309. of multilingual parallel corpora with tools and in- terfaces. In Recent Advances in Natural Language Franz Josef Och and Hermann Ney. 2003. A sys- Processing, volume 5, pages 237–248. tematic comparison of various statistical alignment models. Computational linguistics, 29(1):19–51. Jörg Tiedemann. 2014. Rediscovering annotation pro- jection for cross-lingual parser induction. In Proc. Sylwia Ozdowska. 2006. Projecting POS tags and syn- COLING. tactic dependencies from English and French to Pol- ish in aligned corpora. In Proceedings of the Inter- Dan Tufis, Radu Ion, Alexandru Ceausu, and Dan Ste- national Workshop on Cross-Language Knowledge fanescu. 2006. Improved lexical alignment by com- Induction, pages 53–60. Association for Computa- bining multiple reified alignments. In EACL. tional Linguistics. Dániel Varga, Péter Halácsy, András Kornai, Viktor Sebastian Padó and Mirella Lapata. 2005. Cross- Nagy, László Németh, and Viktor Trón. 2007. Par- linguistic projection of role-semantic information. allel corpora for medium density languages. Ams- In Proceedings of the conference on Human Lan- terdam Studies in the Theory and History of Linguis- guage Technology and Empirical Methods in Nat- tics Science series 4, 292:247. ural Language Processing, pages 859–866. Associ- Marc Vilain, John Burger, John Aberdeen, Dennis Con- ation for Computational Linguistics. nolly, and Lynette Hirschman. 1995. A model- theoretic coreference scoring scheme. In Proceed- Sebastian Padó. 2007. Cross-lingual annotation pro- ings of the 6th conference on Message understand- jection models for role-semantic information. Ph.D. ing, pages 45–52. Association for Computational thesis, German Research Center for Artificial Intel- Linguistics. ligence and Saarland University. David Yarowsky, Grace Ngai, and Richard Wicen- Oana Postolache, Dan Cristea, and Constantin Orasan. towski. 2001. Inducing multilingual text analysis 2006. Transferring coreference chains through word tools via robust projection across aligned corpora. Proceedings of LREC-2006 alignment. In . In Proceedings of the first international conference on Human language technology research Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, , pages 1– Martha Palmer, Ralph Weischedel, and Nianwen 8. Association for Computational Linguistics. Xue. 2011. Conll-2011 shared task: Modeling un- Imed Zitouni and Radu Florian. 2008. Mention detec- restricted coreference in OntoNotes. In Proceedings tion crossing the language barrier. In Proceedings of the Fifteenth Conference on Computational Nat- of the Conference on Empirical Methods in Natural ural Language Learning: Shared Task, pages 1–27. Language Processing, pages 600–609. Association Association for Computational Linguistics. for Computational Linguistics. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- 2012 shared task: Modeling multilingual unre- stricted coreference in OntoNotes. In Joint Confer- ence on EMNLP and CoNLL-Shared Task, pages 1– 40. Association for Computational Linguistics.

22 Projective methods for mining missing translations in DBpedia

Laurent Jakubina Philippe Langlais RALI - DIRO RALI - DIRO Université de Montréal Université de Montréal [email protected] [email protected]

Abstract concept House of Commons of Canada2 which is labeled in French as Chambre des Each entry (concept) in DBpedia comes communes du Canada. One problem, how- along a set of surface strings (property ever, is that most labels are currently in English rdfs:label) which are possible real- (Gómez-Pérez et al., 2013). izations of the concept being described. Indeed, the majority of datasets in LOD are Currently, only a fifth of the English primarily generated from the extraction of anglo- DBpedia entries have a surface string in phone resources. DBpedia, the endogenous RDF French, which severely limits the deploy- dataset of Wikipedia is no exception here, since ment of Semantic Web Annotation for this it proposes labels in French (rdfs:label@fr) language. In this paper, we investigate the for only one fifth3 of the concepts. Of course, task of identifying missing translations, all concepts in English Wikipedia have at least contrasting two projective approaches. We one English label. For instance, the concept show that the problem is actually challeng- School life expectancy4 has — at least ing, and that a carefully engineered base- at the time of writing — no label in French, line is not easy to outperform. while for instance, durée moyenne de scolarité appears in the (French) article 1 Introduction Indice_de_développement_humain,5 and is a good translation of the English term. The LOD (Linked Open Data) (Bizer et al., 2009) This situation comes from the fact that cur- is conceived as a language independent resource in rently, a concept in DBpedia receives as its the sense that the information is represented by ab- rdfs:label property in a given language the ti- stract concepts to which “human-readable” strings tle of the Wikipedia article which is inter-language — possibly in different languages — are attached, linked to the (English) Wikipedia article associ- e.g. the rdfs:label property in DBpedia. ated to the DBpedia concept. For instance, we can access the abstract concept of computer by natural language queries such The lack of surface strings in a foreign language as ordinateur (rdfs:label@fr) in French does not only reduce the usefulness of RDF index- 6 or computer (rdfs:label@en) in English. ing engines such as sig.ma, but also limits the Thanks to this, Semantic Web offers the advantage deployment of Semantic Web Annotator (SWA) of having a truly multilingual World Wide Web systems; e.g. (Mihalcea and Csomai, 2007; Milne (Gracia et al., 2012). and Witten, 2008). This motivates the present At the core of LOD, lies DBpedia study, which aims at automatically mining French DBpedia (Jens Lehmann, 2014), the largest dataset labels for the concepts in that do not that constitutes a hub to which most other LOD 2 1 http://dbpedia.org/page/House_of_ datasets are linked. Since DBpedia is (au- Commons_of_Canada tomatically) generated from Wikipedia, which 3http://wiki.dbpedia.org/Datasets/ is multilingual, one would expect that each DatasetStatistics 4http://dbpedia.org/page/School_life_ concept in DBpedia is labeled with a French expectancy surface string. This is for instance the case of the 5http://fr.wikipedia.org/wiki/Indice_ de_developpement_humain 1December 2014 - http://lod-cloud.net/ 6http://sig.ma

23 Long Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 23–31, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics possess one yet. w1 w1 ¬ Identifying the translations of (English) w2 O11 O12 R1 w O O R Wikipedia article titles is partially solved in the ¬ 2 21 22 2 BabelNet project (Navigli and Ponzetto, 2012). C1 C2 N In this project, the translation of concepts in Wikipedia that are not inter-language linked are Table 1: Contingency table taken care of by applying machine translation on (minimum 3 and maximum 10) sentences then, many variants of this idea have been tested; extracted from Wikipedia that contain a link see (Sharoff et al., 2013) for a recent discussion. to the article whose title they seek to translate. We reproduced this approach in this work. In a The most frequent translation is finally selected. nutshell, each term to be translated is represented There are on the order of 500k articles in English by a so-called context vector; that is, the set of Wikipedia that do not link to an article in French words that co-occur with this term in the source and which are not named entities (which typically part of the corpus. An association measure is typ- 7 do not require translation). BabelNet provides a ically used to score the strength of the correlation translation (not necessarily a good one) for 13% between the term and the context words. Each of them. This suggests that the projection of a translation candidate (typically each word of the resource such as DBpedia into French is not yet target vocabulary) is similarly represented in the a solved problem. target language. Thanks to a bilingual seed lexi- In the remainder, we describe the approaches con, the source context vector is projected into a we tested in Section 2. Our experimental proto- target one.9 This projected target language vector col is presented in Section 3. Section 4 reports the is then compared to the vector of each of the target results we obtained. We conclude in Section 5. language candidates by the means of a similarity measure. 2 Approaches There are several parameters to the approach Identifying the translations of a term in a compa- among which the size of the window used to col- rable corpus — two texts (one in each language lect co-occurrent words, the association and the of interest) that share similar topics without be- similarity measures, as well as the seed lexicon. ing in translation relation — is a challenge that We investigate the impact of the window size has attracted many researchers. See (Sharoff et al., in section 4. We also compare two different asso- 2013) for a recent overview of the state-of-the-art ciation measures, namely the discontinuous odds- in this field. In this work, we investigated several ratio (Evert, 2005, p. 86) named ORD hereafter, variants of two approaches for extracting transla- and the log-likelihood ratio (Dunning, 1993), tions from a comparable corpus: the seminal ap- named LLR, the most popular measures used in proach described in (Rapp, 1995) which uses a this line of work. Both measures (Eq. 1 and 2) are seed bilingual lexicon to induces new translations, computed directly from the (monolingual) contin- and the approach of Bouamor et al. (2013) which gency table depicted in Table 1 for two words w1 instead exploits the Wikipedia structure. The latter and w2 where, for instance, O12 stands for the approach has been shown to outperform the for- number of times w1 occurs in a window, while w2 mer significantly on a task of translating 110 terms does not. in 4 different domains, making use of medium- sized corpora.8 1 1 (O11 + 2 )(O22 + 2 ) ORD(w1, w2) = log (1) (O + 1 )(O + 1 ) 2.1 Standard Approach (STAND) 12 2 21 2 The idea that the context of a term and the one of its translation share similarities that can be used N Oij LLR(w , w ) = 2 O log × to rank translation candidates has been previously 1 2 ij R C (2) ij i × j investigated in (Rapp, 1995; Fung, 1998). Since X 9In our implementation, when no translation is found for 7Version 2.0.1 - March 2014 a source word, the word is left as such in the target context 8400k words on the English side, 260k words on the vector. On the contrary, multiple translations are all added to French side. the target context vector.

24 We did not investigate the impact of the nature LKO(a) returns the set of articles to which a and size of the bilingual seed lexicon, but decided points to (out links). For instance the arti- to use one large lexicons comprising 116 354 word cle Entertainment points to Party and pairs populated from several available resources as Fun. well as an in-house bilingual lexicon.10 A similar CMP(a) returns the set of articles that are the choice is made in (Bouamor et al., 2013) where a most similar to a. We used the MoreLikeThis seed lexicon of approximately 120 000 entries is method of the search engine Lucene12 for being used, and in (Hazem et al., 2013), where the this. For instance, Dance and Dance the authors use a lexicon of 200 000 entries (before in Indonesia are the top-2 documents preprocessing). returned by this function for the article Since in (Laroche and Langlais, 2010) the best Entertainment. performing variant uses the cosine similarity mea- sure (Eq. 3), we used it in our experiments.11 For sanity check purposes, we also considered the RND function which randomly returns arti- vsrc vtrg cos(v , v ) = · (3) cles. Note that the LKI() and LKO() functions were src trg v v k srck · k trgk obtained with the Wikipedia Miner toolkit (Milne In the standard approach, the co-occurrent and Witten, 2013). words are extracted from all the source documents of the comparable corpus in which the term to 2.3 Explicit Semantic Analysis (ESA-B) translate appears. We name this variant STAND We also implemented the approach described in hereafter. (Bouamor, 2014) which has been shown by the au- thor to be more accurate than the aforementioned 2.2 Neighbourhood variants (LKI, LKO, CMP standard approach. The proposed method is an and RA) adaptation of the Explicit Semantic Analysis ap- Since we are interested in translating Wikipedia proach described in (Gabrilovich and Markovitch, titles, a natural way of populating the context vec- 2007). tor of a term is to consider the occurrences of this A term to translate is represented by the titles term in the article whose title we seek to trans- of the Wikipedia articles in which it appears. The late. This avoids populating the context vector projection of the resulting context vector into the with words co-occurring with different senses of target language is obtained by following the avail- the word to translate. We implemented such a vari- able inter-language links.13 The words of the arti- ant which is inherently facing the issue that too cles reached this way are candidates to the transla- few occurrences of the term of interest may appear tion and are further ranked by a tf-idf schema. This in a single article, especially in our case where the approach avoids the need for a seed bilingual lex- average length of a Wikipedia article is approxi- icon, but uses instead the structure of Wikipedia, matively 1 400 words. Therefore we considered a and its multilingualism more particularly. variant which involves a neighbourhood function, One meta-parameter of this approach is the that is, a function that returns a set of Wikipedia maximum size of the context vector, that is, the articles related to the one under consideration for maximum number of article titles to keep for de- translation. We investigated three such functions scribing a term. One might think that considering (as well as many combinations of them): all the articles in which a term to translate is found LKI(a) returns the set of articles that have a is a good idea, but this strategy faces some sort link pointing to the article a under con- of semantic drift. For instance, while translating sideration (in links). For instance, both the term tears, the context vector is populated Computer_Science and Art are two ar- with articles related to music albums that contain ticles pointing to Entertainment. this term in their text content, while the associ- ated French article (when available) almost never 10Ergane (12 914 entries - http://download. travlang.com), Freelang (38 869 entries - contains the translation. We investigate this meta- http://www.freelang.net), as well as an in-house parameter in section 4. The other parameters were lexicon (99 747 entries). set as recommended in (Bouamor, 2014). 11Actually, the authors reported that with the LLR associa- tion measure, the Dice similarity was a better choice, but we 12http://www.lucene.org kept along with the cosine measure for simplicity. 13Articles with no inter-language links are simply ignored.

25 3 Experimental Protocol of the English terms we considered in our test set. More precisely, we gathered terms in those differ- 3.1 Comparable corpus ent ranges: infrequent [1-25], moderate [26-100], DBpedia is extracted from Wikipedia large [101-1000] and huge [1001+], where the fre- (Jens Lehmann, 2014). Thus, we downloaded quency is the one in (English) Wikipedia. Some the Wikipedia dump of June 2013 in both En- examples of pairs in each range are displayed in glish and French. The English dump contains Table 2. 4 262 946 articles, and the French one contains 1 398 932. Although some articles that share [1-25] 74 (8.5%) an inter-language link are parallel (Patry and myringotomy paracentèse Langlais, 2011), most article pairs are actually only comparable (Hovy et al., 2013). [26-100] 267 (30.7%) syllabification césure 3.2 English terms without translation The vast majority (82,3%) of articles in the En- [101-1000] 259 (29.8%) glish Wikipedia do not have a link to an arti- numerology numérologie cle in the French Wikipedia. We are interested to identify the translation of their title. Yet, we [1001+] 269 (30.9%) noticed that many of them are actually describing entertainment divertissement named entities (persons, geographic places, etc.), which typically do not require translation.14 In or- Total 869 (100%) der to filter named entities, we applied the Babel- Table 2: Distribution of the number of test forms 15 Net filter. We ended up with a list of 521 895 at a given frequency range along with an exam- (18,5%) terms we ultimately seek to translate. In ple of an English term and its reference (French) this study, we further narrowed down our interest translation. on unigrams.16 This represents roughly 30% of those English terms. We measured that using a large parallel cor- 17 3.3 Reference List pus, we could only identify the translation of roughly 1% of those terms, which indicates that To evaluate our different approaches, we build a parallel data might be of little interest in identify- test set — a list of English source terms and their ing the translations of Wikipedia article titles. reference (French) translation. For this, we ran- domly sampled pairs of articles in Wikipedia that 3.4 Evaluation are inter-language linked. It is accepted that the ti- Our approaches have been configured to produce tles of a pair of articles inter-language linked often a ranked list of (at most) 20 candidates for each constitute good translations (Hovy et al., 2013). source (English) term. We compute two metrics to Therefore, for each term (title) of our test set, we compare them: precision at rank 1 (P@1) which collected the associated title as a reference trans- indicates the percentage of terms for which the lation. best ranked candidate is the reference one, and The sampling was done without considering Mean Average Precision at rank 20 (MAP-20), a named entities. For this purpose, we only consid- measure commonly used in information retrieval ered article pairs which English title belongs to the (Manning et al., 2008) which averages precision bilingual lexicon we used as a seed lexicon for the at various recall rates. STAND approach. Since the frequency of a source term is a key parameter of projective approaches, 3.5 Technical considerations we also paid attention to vary the frequency range The standard approach (STAND) can be rather 14Some languages do involve transliteration, but this is def- computation and time consuming, since any tar- initely beyond the scope of this paper. get word in Wikipedia is a potential candidate for 15We used the BabelSynset.getSynsetType() function of the BabelNet API for this purpose. 17We gathered 32 millions of sentence pairs from differ- 16Methods that handle multi-word expressions typically ent available parallel corpora, including the GIGAWORD cor- embed single word translation (Morin and Daille, 2009); pus we downloaded from http://www.statmt.org/ therefore our choice. wmt13/translation-task.html.

26 a given source term, and we are dealing with a erage, while in the aforementioned work, the dif- rather large comparable corpus. Just as an illustra- ference was much less marked.19 Therefore, we tion, the word france occurs more than 1 mil- use this association measure in the neighbourhood lion times in the French Wikipedia, and its con- variants we tested. text vector potentially contains as much as 136 514 We observed in practice the tendency of ORD words (considering a context window of 6 words). to reward word pairs that appear often together Therefore, in our experiments, we only consider even though the frequency of each word is very the first 50 000 occurrences of each term while low. Thus, the context vector gathered with ORD populating the context vectors. Also, comparing tend to contain rare words that only appear in the source and target vectors can be time consuming, context of the article under consideration. Those especially with context vectors of very high di- words offer a good discriminative power in our mension. To save some time (and memory), we task, thus leading to much higher performance only represent a context vector (source or target) than the context vectors computed by LLR, which by (at most) the 1000 top-ranked terms according tend to gather more general related terms. This to the association measure being used. tendency can be observed in Figure 1 where ORD leads to a context vector with much more specific 4 Results words. This observation deserves further investi- gations. 4.1 STAND In some calibration experiments,18 we observed ORDLLR that increasing the size of the window in which we myringoplasty (16.32) tube (147.6) collect the context words leads to noise (see Ta- myringa (16.14) laser (44.90) ble 3). The optimal window size was 6 (3 words laryngotracheal (15.13) procedure (40.83) on each side of the word under consideration, ex- tympanostomy (14.60) usually (31.86) cluding function words), which means that the co- laryngomalacia (14.19) knife (30.13) occurrent words should be taken in the immediate patency (13.43) myringoplasty (29.85) vicinity of the term to translate. This corroborates equalized (11.75) ear (28.19) the study in (Bullinaria and Levy, 2007). There- grommet (11.58) laryngotracheal (27.45) fore, we set the value of this meta-parameter to 6 obstructive (11.09) tympanostomy (26.39) in the remainder. incision (10.37) cold (24.09)

window MAP-20 Figure 1: Top words in the context vector com- | | 2 0.72 puted with ORD and LLR for the source term 6 0.75 Myringotomy. Words in bold appear in both 14 0.62 context vectors. 30 0.55 A second observation that can be made is the Table 3: MAP-20 of STAND (ORD) measured on a strong correlation between the frequency of the development set, as a function of the window size term to translate and the performance of the ap- (counted in word). proach. As a matter of fact, the performance for very frequent terms ([1001+]) is more than The results of two variants of the standard ap- ten times the one measured on infrequent ones proach are reported in Table 4 (line 1 and 2). ([1–25]). This is a well-know fact that has been Clearly, using ORD as an association measure analyzed for instance in (Prochasson and Fung, drastically improves performance. This definitely 2011) where the authors report a precision of 60% corroborates the findings of Laroche and Langlais for frequent test words (words seen at least 400 (2010). Still, the differences between both vari- times), but only 5% for rare words (seen less than ants is surprisingly high: ORD delivers over six 15 times). time higher performance than LLR does on av- Overall, and even if a close comparison is dif- ficult, the results we obtained for STAND are in- 18We used a development set of 125 (unigram) terms, con- sidering a candidate list of 50k words randomly selected to 19In Table 3 of their article, the authors measured on a test- which we added the reference translations. set of 500 terms a MAP of 0.536 for ORD, and 0.413 for LLR.

27 [1-25] [26-100] [101-1000] [1001+] [Total] P@1 MAP P@1 MAP P@1 MAP P@1 MAP P@1 MAP STAND (LLR) 0.000 0.003 0.011 0.019 0.019 0.023 0.134 0.154 0.051 0.061 STAND (ORD) 0.027 0.057 0.217 0.281 0.425 0.474 0.461 0.506 0.338 0.389 STAND (o-100) 0.027 0.058 0.146 0.201 0.154 0.219 0.104 0.162 0.125 0.182 LKI-1000 0.000 0.002 0.064 0.080 0.124 0.156 0.126 0.155 0.096 0.119 LKO-1000 0.000 0.000 0.016 0.022 0.089 0.119 0.033 0.046 0.044 0.058 CMP-1000 0.016 0.022 0.072 0.099 0.131 0.170 0.093 0.120 0.092 0.121 RND-1000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ESA-B 0.014 0.080 0.056 0.122 0.205 0.300 0.424 0.513 0.211 0.293

Table 4: Precision (at rank 1) and MAP-20 of some variants we tested. Each neighbourhood function was asked to return (at most) 1000 English articles. The ESA-B variant is making use of context vectors of (at most) 30 titles. line with those reported in (Laroche and Langlais, get candidate is computed by considering the full 2010) that also focused on Wikipedia, but min- (French) Wikipedia collection. This dissymmetry ing translations of medical terms. The authors re- introduces a mismatch between the source and tar- ported a precision at rank one ranging from 20.7% get context vectors, leading to poor performances. up to 42.3% depending on test sets and configura- A solution to this problem consists in computing tions considered. target context vectors online from a subset of tar- As we discussed in Section 3.5, due to compu- get documents of interest.20 A drawback of this tational issues, we cut the context vectors of the solution is (of course) that the computation must STAND approach after 1 000 terms. In order to take place for each term to translate. This is left as measure how sensitive this cut-off is, we computed a future work. a variant where the top-100 terms only are kept At least, the neighbourhood variants we exper- (considering the association measure). The results imented outperform the one where random docu- of this variant are reported in line 3 of Table 4. As ments are sampled (RND). This latter variant could expected, the performance of the STAND approach not translate a single term of the test set. drops significantly on average, and especially for very frequent terms ([1001+]). 4.3 ESA-B In the default configuration of the approach de- 4.2 Neighbourhood variants scribed in (Bouamor et al., 2013), the authors limit We tested our neighbourhood functions as well the size of the context vector to 100, which we as several combinations of them. One meta- found suboptimal in our case. We varied the di- parameter we investigated is the maximum num- mension of the context vectors and observed the ber of articles returned by a function. We early ob- best value to be 30 (see Table 5). This is the value served that the more the better, something we ex- used in the sequel. plain shortly. Thereafter, each function was asked to return at most 1 000 articles. The results ob- context MAP-20 | | tained by the 3 neighbourhood functions we de- 10 0.248 scribed in section 2 are reported in lines 4 to 6 of 20 0.287 Table 4. 30 0.293 Clearly, all the neighbourhood variants we con- 50 0.291 sidered yielded a significant drop in performance, 100 0.271 which is disappointing from a practical point of view. This suggests that there is no obvious way Table 5: MAP-20 of ESA-B measured on the test to reduce the number of source documents to con- set, as a function of the context vector dimension. sider while populating the context vector of the 20This subset could, for instance, be defined by following term to translate. One explanation for this is that in the inter-language links of the source documents returned by our implementation, the context vector of each tar- the neighbourhood function.

28 Somehow contrary to what has been observed effect can be observed in Figure 2 in which top in (Bouamor et al., 2013), we observe that ESA- candidates proposed by several variants we tested B (P@1 = 0.211) under-performs the STAND ap- are reported for the terms exemplified in Table 2. proach with the ORD association measure (P@1 Finally, it happens that the top-20 candidates = 0.338). One explanation for the difference is proposed are just noise (e.g. noun translated as that, in (Bouamor et al., 2013), the authors fil- spora). ter in words such as nouns, verbs and adjectives when populating the context vectors, while we do 5 Discussion not. This filter might interfere with the observa- In this study, we implemented and compared two tion made in section 4.1 that, with ORD, rare words projective approaches for identifying the transla- (which might be filtered out, such as URLs or tion of terms that correspond to articles in En- even spelling mistakes) tend to appear in the con- glish Wikipedia that do not have an inter-language text vectors, and happen to help in discriminating link to an article in the French Wikipedia. Do- translations. ing so would potentially help in enriching the rdfs:label property attached to concepts in 4.4 Analysis DBpedia, thus easing semantic annotation in If we consider the 528 test terms that appear over French. One method is a variant of the popular a hundred times in Wikipedia ([101+]), a test approach pioneered by (Rapp, 1995) which uses case where both approaches perform well, STAND a bilingual seed lexicon for mapping source and (ORD) translates correctly 362 of them (consider- target context vectors, and the other one has been ing the top-20 solutions), while ESA-B translates proposed in (Bouamor et al., 2013) for which the 351. If we had an oracle telling us which variant to authors shown to deliver state-of-the-art perfor- trust for a given term, we could translate correctly mance. 431 terms (81.6%), which indicates the comple- Among other things, our experiments suggest mentarity of both approaches. that the STAND approach performs as well or bet- We analyzed the 97 terms for which our two ap- ter than the ESA-B approach and combining both proaches failed to propose the reference transla- approaches, especially for high frequency terms tion in the top-20 candidates and we identified a might improve our results. number of recurrent cases we describe hereafter. We also observed the well-known bias of those First, English terms do appear in the French approaches toward frequent terms, which urges Wikipedia material that eventually get selected the need for methods adapted to less frequent by the STAND approach. This is, for instance terms. As a future work, we will investigate the the case for the term barber (oracle translation: solution proposed in (Prochasson and Fung, 2011) coiffeur) for which STAND proposed the trans- which is one step in this direction. lation barber. Also, the projective methods we considered em- Second, we observed that STAND (and perhaps bed several meta-parameters which values are sen- ESA-B in a less systematic way) often proposes sible. It is therefore difficult to know a pri- morphological variants of the reference transla- ori which configuration to chose for a given tion. For instance, coudre (a verbal form) is the task, without conducting costly calibration ex- first proposed translation for sewing, while the periments. Having at our disposal a num- reference translation is the noun couture. ber of different test cases would help in de- Third, it happens in a few cases that the refer- veloping expertise in doing so. With the ence translation, although correct is very specific. hope that this might help, the code and re- Of course this penalizes equally both approaches sources used in this work will be available at we tested. For instance, the reference translation this url: http://rali.iro.umontreal. of veneration is dulie, while the first trans- ca/rali/?q=fr/Ressources lation produced by STAND is vénération (a Acknowledgments correct translation). Also, and by far the most frequent case, we ob- This work has been funded by the Quebec funding served a thesaurus effect of both approaches where agency Fonds de Recherche Nature et Technolo- terms related to the source one are proposed. This gies (FRQNT).

29 myringotomy [1-25]

ESA-B – laryngologie (0.209) oto (0.191) rhino (0.180) traitement (0.125) otite (0.080) STAND (ORD) – permette (0.0489) devra (0.0473) scopie (0.0471) nécessitait (0.046) pût (0.045) STAND (LLR) – melanosporum (0.274) neural (0.272) séminifère (0.269) ncathodique (0.269)

syllabification [26-100]

ESA-B – langues (0.517) consonne (0.420) langue (0.353) lettre (0.223) phonétique (0.166) STAND (ORD) – modifier (0.079) suffit (0.074) vouloir (0.074) syllabique (0.074) intonation (0.072) STAND (LLR) – édicté (0.106) exécutoire (0.097) syllabique (0.096) irrévocable (0.092)

numerology [101-1000]

ESA-B 20 œuvre (0.053) gematria (0.037) angels (0.031) nombres (0.029) chiffre (0.027) STAND (ORD) 1 numérologie (0.095) occultisme (0.062) ésotérisme (0.062) divinatoire (0.058) STAND (LLR) 5 jyotish (0.415) conditionaliste (0.412) karmique (0.364) domification (0.358)

entertainment [1001+]

ESA-B 2 entertainment (0.392) divertissement (0.151) vidéo (0.121) sony (0.111) jeu (0.073) STAND (ORD) – beatmakers (0.012) manglobe (0.011) spycraft (0.011) déduplication (0.010) STAND (LLR) – dsi (0.299) eshop (0.294) cocoto (0.231) ead (0.225) imagesoft (0.210)

Figure 2: Top candidates produced by several variants of interest for some test terms. The second column indicates the rank of the oracle translation when present in the top-20 returned list (or – if absent).

References Pascale Fung. 1998. A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non- Christian Bizer, Tom Heath, and Tim Berners-Lee. parallel Corpora. In Proceedings of the Third Con- 2009. Linked data-the story so far. International ference of the Association for Machine Translation journal on semantic web and information systems, in the Americas on Machine Translation and the In- 5(3):1–22. formation Soup, AMTA ’98, pages 1–17, London, UK, UK. Springer-Verlag. Dhouha Bouamor, Adrian Popescu, Nasredine Sem- mar, and Pierre Zweigenbaum. 2013. Building spe- Evgeniy Gabrilovich and Shaul Markovitch. 2007. cialized bilingual lexicons using large scale back- Computing semantic relatedness using wikipedia- ground knowledge. In EMNLP, pages 479–489. based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pages 1606–1611, San Fran- Dhouha Bouamor. 2014. Constitution de ressources cisco, CA, USA. Morgan Kaufmann Publishers Inc. linguistiques multilingues à partir de corpus de textes parallèles et comparables. PhD thesis, Uni- Jorge Gracia, Elena Montiel-Ponsoda, Philipp Cimi- versité Paris Sud - Paris XI, February. ano, Asunción Gómez-Pérez, Paul Buitelaar, and John McCrae. 2012. Challenges for the multilin- John A. Bullinaria and Joseph P. Levy. 2007. Ex- gual web of data. Web Semantics: Science, Ser- tracting semantic representations from word co- vices and Agents on the World Wide Web, 11:63–71, occurrence statistics: A computational study. Be- March. havior Research Methods, 39(3):510–526, August. Asunción Gómez-Pérez, Daniel Vila-Suero, Elena Montiel-Ponsoda, Jorge Gracia, and Guadalupe Ted Dunning. 1993. Accurate Methods for the Statis- Aguado-de Cea. 2013. Guidelines for multilingual tics of Surprise and Coincidence. Comput. Linguist., linked data. In Proceedings of the 3rd International 19(1):61–74, March. Conference on Web Intelligence, Mining and Seman- tics, WIMS ’13, pages 3:1–3:12, New York, NY, Stefan Evert. 2005. The statistics of word cooccur- USA. ACM. rences. Ph.D. thesis, Dissertation, Stuttgart Univer- sity. Amir Hazem, Morin Emmanuel, and others. 2013.

30 Word co-occurrence counts prediction for bilingual Meeting of the Association for Computational Lin- terminology extraction from comparable corpora. guistics: Human Language Technologies - Volume 1, IJCNLP 2013. HLT ’11, pages 1327–1335, Stroudsburg, PA, USA. Association for Computational Linguistics. Eduard Hovy, Roberto Navigli, and Simone Paolo Ponzetto. 2013. Collaboratively built semi- Reinhard Rapp. 1995. Identifying word translations structured content and artificial intelligence: The in non-parallel texts. In Proceedings of the 33rd An- story so far. Artificial Intelligence, 194:2–27, Jan- nual Meeting on Association for Computational Lin- uary. guistics, ACL ’95, pages 320–322, Stroudsburg, PA, USA. Association for Computational Linguistics. Robert Isele Jens Lehmann. 2014. DBpedia - a large- scale, multilingual knowledge base extracted from Serge Sharoff, Reinhard Rapp, and Pierre Zweigen- wikipedia. Semantic Web Journal. baum. 2013. Overviewing important aspects of the last twenty years of research in comparable cor- Audrey Laroche and Philippe Langlais. 2010. Re- pora. In Serge Sharoff, Reinhard Rapp, Pierre visiting context-based projection methods for term- Zweigenbaum, and Pascale Fung, editors, Build- translation spotting in comparable corpora. In ing and Using Comparable Corpora, pages 1–17. Proceedings of the 23rd International Conference Springer Berlin Heidelberg. on Computational Linguistics, COLING ’10, pages 617–625, Stroudsburg, PA, USA. Association for Computational Linguistics.

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

Rada Mihalcea and Andras Csomai. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge manage- ment, pages 233–242. ACM.

David Milne and Ian H. Witten. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge manage- ment, pages 509–518. ACM.

David Milne and Ian H. Witten. 2013. An open-source toolkit for mining wikipedia. Artif. Intell., 194:222– 239, January.

Emmanuel Morin and Béatrice Daille. 2009. Com- positionality and lexical alignment of multi-word terms. Language Resources and Evaluation, page 0, August.

Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual se- mantic network. Artificial Intelligence, 193:217– 250, December.

Alexandre Patry and Philippe Langlais. 2011. Iden- tifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. In Proceedings of the 4th Workshop on Building and Using Comparable Cor- pora: Comparable Corpora and the Web, BUCC ’11, pages 87–95, Stroudsburg, PA, USA. Associ- ation for Computational Linguistics.

Emmanuel Prochasson and Pascale Fung. 2011. Rare word translation extraction from aligned compara- ble documents. In Proceedings of the 49th Annual

31 Attempting to Bypass Alignment from Comparable Corpora via Pivot Language

Alexis Linard Beatrice´ Daille Emmanuel Morin Universite´ de Nantes, LINA UMR CNRS 6241 2 rue de la Houssiniere,` BP 92208 44322 Nantes Cedex 03, France [email protected]

Abstract we have of a lot of resources can thwart the effect of the poor original resources. English is probably Alignment from comparable corpora usu- the first language in term of work and resources in ally involves two languages, one source Natural Language Processing, hence it can appear and one target language. Previous works as a good candidate as pivot language. on bilingual lexicon extraction from par- The paper is organized as follows: we give allel corpora demonstrated that more than a short overview of bilingual lexicon extraction two languages can be useful to improve standard method in Section 2. Our proposed ap- the alignments. Our works have inves- proaches are described in Section 3. The resources tigated to which extent a third language we have used are presented in Section 4 and ex- could be interesting to bypass the original perimental results in Section 5. Finally, we ex- alignment. We have defined two original pose further works and improvements in Sections alignment approaches involving pivot lan- 6 and 7. guages and we have evaluated over four languages and two pivot languages in par- 2 Bilingual Lexicon Extraction ticular. The experiments have shown that Initially designed for parallel corpora (Chen, in some cases the quality of the extracted 1993), and due to the scarcity of this kind of re- lexicon has been enhanced. sources (Martin et al., 2005), bilingual lexicon ex- 1 Introduction traction then tried to deal with comparable cor- pora instead (Fung, 1995; Rapp, 1995). An al- The main goal of this work is to investigate to gorithm using comparable corpora is the standard which extent bilingual lexicon extraction using method (Fung and McKeown, 1997) closely based comparable corpora can be improved using a third on the notion of context vectors. Many implemen- language when dealing with poor resource lan- tations have been designed in order to do so (Rapp, guage pairs. Indeed, the quality of the result of 1999; Chiao and Zweigenbaum, 2002; Morin et the extracted bilingual lexicon strongly depends al., 2010). A context vector w is, for a given word on the quality of the resources, that is to say the w, the representation of its contexts ct1 . . . cti and corpora and a general language bilingual dictio- the number of occurrences found in the window nary. In this study, we stress the key role of the of a corpus. In this approach, context vectors are potential high quality resources of the pivot lan- calculated both in source and target languages cor- guage (Chiao and Zweigenbaum, 2004; Morin and pora. They are also normalized according to asso- Prochasson, 2011; Hazem and Morin, 2012). The ciation scores. Then, thanks to a seed dictionary, idea of involving a third language is to benefit source context vectors are transferred into target from the lexical information conveyed by the ad- language. The similarity between the translated ditional language. We also assume that in the case context vector w for a given source word w to of not so usual language pairs the two comparable translate and all target context vectors t lead to the corpora are of medium quality, and the bilingual creation of a list of ranked candidate translations. dictionary seems weak, due to the nonexistence of The rank is function of the similarity between con- such a dictionary. We expect as a consequence a text vectors so that the closer they are, the better bad quality of the extracted lexicon. Nevertheless, the ranked translation is. we are highly confident that a language for which Research in this field aims at improving the

32 Short Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 32–37, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics

Figure 1: Transferring Context Vectors Successively. quality of the extracted lexicon. For instance, we thanks to a source/pivot dictionary. This opera- can cite the use of a bilingual thesaurus (Dejean´ tion is done a second time from pivot to target lan- et al., 2002), implication of predictive methods guage with a pivot/target dictionary in order to ob- for word co-occurrence counts (Hazem and Morin, tain source context vectors translated into target 2013) or the use of unbalanced corpora (Morin language. We can say that we transferred the con- and Hazem, 2014). Among them, and in the case text vectors via a pivot language. Finally, the last of comparable corpora, we can denote that none step of similarity computation stays unchanged: looked into pivot-language approaches. for one source word w for which we want to find Nevertheless, the idea of involving a pivot lan- the translation in target language, we compute the guage for translation tasks is not recent. Bilin- similarity between its context vector transferred gual lexicon extraction from parallel corpora has successively w and all target context vectors t. already been improved via the use of an intermedi- This method is presented in Figure 1. ary language (Kwon et al., 2013; Seo et al., 2014; Kim et al., 2015), so does statistical translation 3.2 Transposing Context Vectors to Pivot (Simard, 1999; Och and Ney, 2001). Those works Language lay on the assumption that another language brings The second method based on pivot dictionaries additional information (Dagan and Itai, 1991). consists in translating both source and target con- text vectors into pivot language. Thus, the oper- 3 Alignment Approaches with Pivot ation of computing similarity occurs in the vecto- Language rial space of the pivot language. In order to do so, In this paper, we present two original approaches the context vector of a word in source language which derive from the standard method and in- to translate is computed as it is usually done in volve a third language. We assume that the bilin- the standard method. The second step is to trans- gual dictionary is unavailable or of low quality, but fer the source and target context vectors into the that the source/pivot and pivot/target dictionaries pivot language using source/pivot and target/pivot are much better. dictionaries. At this stage, we gather in the pivot language the translated source and all target con- 3.1 Transferring Context Vectors text vectors. The next and last operation is to com- Successively pute the similarity between the source context vec- tor transferred into pivot language w and all target The first method, and the most naive is to trans- context vectors transferred into pivot language t. late context vectors successively, to start with from This method is presented in Figure 2. source to pivot language, and to follow from pivot to target language. Hence, the context vectors in 4 Multilingual Resources the source language are computed as it is usually done in the standard method. Then, the second In this paper, we perform translation-candidate ex- step is to transfer them into the pivot language traction from all pairs of languages from/to En-

33 Figure 2: Transposing Context Vectors to Pivot Language. glish, French, German and Spanish and involving catalogue2. These dictionaries were general- English or French as the pivot language. The use ist, and contained few terms related to the of those pivot languages in particular is motivated Wind Energy and Mobile Technologies do- by two factors: first, English, because it is the lan- mains. So, the French/English, French/Spanish guage by default we have of in a quasi infinite and French/German were reversed to obtain En- amount of data, and last, French, because we know glish/French, Spanish/French and German/French that our resources (corpus and dictionaries) are of dictionaries. As for the others, they were built by good quality. triangulation from the ones above (see Table 1). As a consequence, we expect those dictionaries to 4.1 Comparable Corpora be very mediocre. The first comparable corpus we used during our 4.3 Reference Lists experiments is the Wind Energy corpus1. It was built from a crawl of webpages using many key- EN FR ES DE words related to the wind energy field. The com- WE 48 58 55 55 parable corpus is composed of documents in 7 lan- MT 52 58 60 88 guages, among others German, English, Spanish and French. The second comparable corpus we Table 2: Number of SWT in reference lists. used is the Mobile Technologies corpus. It was also built by crawling the web. Both of them In order to evaluate the output of the different were composed of 300k to 470k words in each lan- approaches, terminology reference lists were built guage. from each corpus in each language (Loginova et al., 2012). Depending on the corpus and the lan- 4.2 Bilingual Dictionaries guage, the lists were composed of 48 to 88 single word terms (abbreviated SWT – see Table 2). EN-DE EN-ES EN-FR FR-ES FR-DE DE-ES DE-EN ES-EN FR-EN ES-FR DE-FR ES-DE #entr. 600k 26k 240k 100k 170k 15k 5 Experiments and Results

Table 1: Number of entries in each dictionary. Pre-processing French, English, Spanish and German documents were pre-processed using In order to perform bilingual lexicon ex- TTC TermSuite (Rocheteau and Daille, 2011)3. traction from comparable corpora, a bilingual The operations done during pre-processing were dictionary was mandatory. Nevertheless, we the following: tokenization, part-of-speech tag- only have of French/English, French/Spanish ging and lemmatization. Moreover, function and French/German dictionaries from the ELRA words and hapaxes had been removed.

1http://www.lina.univ-nantes.fr/?Ressources- 2http://catalog.elra.info/ linguistiques-du-projet.html 3https://logiciels.lina.univ-nantes.fr/redmine/projects

34 Wind Energy Mobile Technologies

Lang. Pivot Std. P1 P2 RMAX C Std. P1 P2 RMAX C EN-ES FR 0.268 0.646 0.445 0.882 0.390 0.374 65.76% 0.523 0.467 66.52% ES-EN FR 0.119 0.232 0.233 0.491 0.193 0.272 0.321 0.533 EN-DE FR 0.158 0.125 0.458 0.622 0.205 0.570 0.896 0.215 66.21% 68.95% DE-EN FR 0.018 0.018 0.018 0.200 0.074 0.070 0.069 0.455 FR-DE EN 0.056 0.418 0.053 0.597 0.118 0.132 77.63% 0.063 0.061 80.06% DE-FR EN 0.038 0.028 0.028 0.151 0.034 0.023 0.026 0.432 FR-ES EN 0.366 0.150 0.176 0.528 0.514 0.275 0.280 0.807 82.36% 82.02% ES-FR EN 0.210 0.103 0.117 0.357 0.238 0.207 0.186 0.552 FR 0.000 0.001 ES-DE 0.041 0.097 0.273 0.058 0.067 0.500 EN 0.000 0.001 0.045 0.027 44.24% 0.033 0.035 44.02% FR 0.001 0.126 DE-ES 0.018 0.018 0.218 0.355 0.347 0.585 EN 0.001 0.018 0.018 0.126 0.189 0.179

Table 3: MRR achieved for pivot dictionary based approaches.

Context vectors In order to compute and nor- by the standard method (Std.), method transferring malize context vectors, the value a(ct) associated context vectors successively (P1) and the method to each co-occurrence ct of a given word w in transposing context vectors to pivot language (P2). the corpus was computed. Such value could be We also give additional information, such as the computed thanks to Log Likelihood (Fano and best achievable result according to the reference Hawkins, 1961) or Mutual Information (Dunning, lists and the words belonging to the filtered cor- 1993) for instance. Among them we chose Log pus (RMAX ) and corpora comparability C (Li and Likelihood as its representativity is the most accu- Gaussier, 2010). rate (Bordag, 2008). Context vectors were com- The corpus comparability metric consists in the puted by TermSuite, as one of its components per- expectation of finding the translation in target lan- formed this operation. guage for each source word in the corpus. There- with, it is a good way of measuring the distribu- Similarity measures The so-called similarity tional symmetry between two corpora and given a could be computed according to Cosine similar- dictionary. We can also notice that the Maximun ity (Salton and Lesk, 1968) or Weighted Jaccard Recall RMAX is quite low for some pairs of lan- Distance (Grefenstette, 1994). We decided to only guages: this is due to the high number of hapaxes present the results achieved using Cosine similar- belonging to the reference lists that were filtered ity. The differences between them in term of Mean out during pre-processing. Reciprocal Rank (MRR) were insignificant. According to the results, we can see that there a(w )a(t ) Cosine (w, t) = k k k is a strong correlation between the improvements a(w )2 a(t )2 kP k k k achieved by pivot based approaches and corpus Evaluation metrics pInP order topP evaluate our comparability. We have improved the quality approaches, we used Mean Reciprocal Rank of the extracted bilingual lexicon only in the (Voorhees, 1999). The strength of this metric is case of poorly comparable corpora, respectively 65.76% and 66.52% for Wind Energy and that it takes into account the rank of the candidate ≤ ≤ translations. Hereinafter, the MRR defined as fol- Mobile Technologies corpora. For instance, we have increased the MRR from 0.268 to 0.390 and lows (t stands for the terms to evaluate and rt for the rank achieved by the system for the good trans- 0.374 in the case of translation from English to lation of t): Spanish for the Wind Energy corpus, and from 0.126 to 0.355 and 0.347 for German to Spanish t | | 1 via French for the Mobile Technologies corpus. MRR = 1 ( ) t r | | × k=1 tk X 6 Discussion Results The MRR achieved for both approaches is shown in Table 3 for Wind Energy and Mobile In Section 5 we have shown up that results can be Technologies corpora respectively. We present, enhanced only in the case of poorly comparable for the sake of comparison, the results achieved pairs of languages. For fairly comparable corpora

35 EN-DE EN-ES EN-FR FR-ES FR-DE DE-ES the comparability of the source and target corpora, DE-EN ES-EN FR-EN ES-FR DE-FR ES-DE we think that the next step is to reshape context WE 66.21% 65.76% 80.23% 82.36% 77.63% 44.24% MT 68.95% 66.52% 80.99% 82.02% 80.06% 44.02% vectors with a pivot corpus. In addition, we will see whether linear regression models to reshape Table 4: Corpora comparability. context vectors can make improvements or not.

Acknowledgments ( 68% C 80%), results remain unchanged ≤ ≤ ≤ in comparison with the standard approach. Finally, This work is supported by the French National Re- for highly comparable corpora (C > 80%) the search Agency under grant ANR-12-CORD-0020. quality of the extracted lexicon gets worse. The interpretation we suggest is the follow- ing: given two corpora, S in source language, T References in target and a bilingual dictionary source/target Stefan Bordag. 2008. A comparison of co-occurrence , the comparability is function of S, T , . and similarity measures as simulations of context. In D DS/T Proceedings of the 9th International Conference on Therefore, a low comparability measure can be Computational Linguistics and Intelligent Text Pro- due to a poor expectation of finding the transla- cessing, pages 52–63. Haifa, Israel. tion in target language for each source word in Stanley F. Chen. 1993. Aligning sentences in bilingual the corpus because the two corpora are not lex- corpora using lexical information. In Proceedings of ically close enough, or because the dictionary is the 31st Annual Meeting on Association for Compu- weak. We checked this second option, and this is tational Linguistics, pages 9–16, Columbus, Ohio, how we substantiate the pivot dictionary based ap- USA. proaches. Thus, the use of source/pivot and Yun-Chuang Chiao and Pierre Zweigenbaum. 2002. DS/P pivot/target dictionary can artificially im- Looking for candidate translational equivalents in DP/T prove the comparability and enhance the extracted specialized, comparable corpora. In Proceedings of the 19th International Conference on Computational lexicon. We have also remarked that the coverage Linguistics, pages 1–5, Taipei, Taiwan. of dictionaries is an important factor: a large dic- tionary is better than a shorter. Yun-Chuang Chiao and Pierre Zweigenbaum. 2004. Aligning words in french-english non-parallel med- Of course, we do not pretend that our methods ical texts: Effect of term frequency distributions. can compare with an initially very highly compa- In Medinfo 2004: Proceedings of the 11th World rable corpora since the use of pivot dictionaries Congress on Medical Informatics, pages 23–27, will introduce more noise than it will bring addi- Amsterdam, Netherlands. Ios Pr Inc. tional information. Ido Dagan and Alon Itai. 1991. Two languages are more informative than one. In Proceedings of the 7 Conclusion 29th Annual Meeting of the Association for Compu- tational Linguistics, pages 130–137, Berkeley, Cali- We have presented two pivot based approaches for fornia, USA. bilingual lexicon extraction from comparable spe- Herve´ Dejean,´ Eric´ Gaussier, and Fatia Sadat. 2002. cialized corpora. Both of them lay on pivot dictio- An approach based on multilingual thesauri and naries. We have shown that the bilingual lexicon model combination for bilingual lexicon extraction. extraction depends on the quality of the resources. In Proceedings of the 19th international conference on Computational linguistics, pages 1–7, Taipei, Furthermore, we have also demonstrated that the Taiwan. problem can be fixed involving a third strongly supported language such as English for instance. Ted Dunning. 1993. Accurate methods for the statis- tics of surprise and coincidence. Computational lin- We have also carried out that the enhancements guistics, 19(1):61–74. are function of the comparability of the corpora. These first experiments have shown that using a Robert M Fano and David Hawkins. 1961. Trans- pivot language can make improvements in the case mission of information: A statistical theory of communications. American Journal of Physics, of poorly comparable initial corpora. 29(11):793–794. In future works, we will try to benefit from the Pascale Fung and Kathleen McKeown. 1997. Finding information brought by an unbalanced pivot cor- terminology translations from non-parallel corpora. pus. Unlike this article in which we have only In Proceedings of the 5th Annual Workshop on Very looked into pivot dictionaries in order to increase Large Corpora, pages 192–202, Beijing, China.

36 Pascale Fung. 1995. Compiling bilingual lexicon en- of the 4th Workshop on Building and Using Com- tries from a non-parallel english-chinese corpus. In parable Corpora, pages 27–34, Portland, Oregon, Proceedings of the 3rd Annual Workshop on Very USA. Large Corpora, pages 173–183, Cambridge, Mas- sachusetts, USA. Emmanuel Morin, Beatrice´ Daille, Kyo Kageura, and Koichi Takeuchi. 2010. Brains, not Brawn: The Use Gregory Grefenstette. 1994. Explorations in au- of ”Smart” Comparable Corpora in Bilingual Termi- tomatic thesaurus discovery. Springer Science & nology Mining. ACM Transactions on Speech and Business Media. Language Processing (TSLP), 7(1):1–23.

Amir Hazem and Emmanuel Morin. 2012. Adaptive Franz Josef Och and Hermann Ney. 2001. Statisti- Dictionary for Bilingual Lexicon Extraction from cal multi-source translation. In Machine Translation Comparable Corpora. In Proceedings of the 8th Summit, pages 253–258, Santiago de Compostela, international conference on Language Resources Spain. and Evaluation (LREC), pages 288–292, Istanbul, Turkey. Reinhard Rapp. 1995. Identifying word transla- tions in non-parallel texts. In Proceedings of the Amir Hazem and Emmanuel Morin. 2013. Word 33rd Annual Meeting on Association for Computa- Co-occurrence Counts Prediction for Bilingual Ter- tional Linguistics, pages 320–322, Cambridge, Mas- minology Extraction from Comparable Corpora. sachusetts, USA. In 6th International Joint Conference on Natural Language Processing., pages 1392–1400, Nagoya, Reinhard Rapp. 1999. Automatic identification of Japan. word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting Jae-Hoon Kim, Hong-Seok Kwon, and Hyeong-Won of the Association for Computational Linguistics on Seo. 2015. Evaluating a pivot-based approach for Computational Linguistics, pages 519–526. bilingual lexicon extraction. Computational Intelli- Jer´ omeˆ Rocheteau and Beatrice´ Daille. 2011. TTC gence and Neuroscience, 2015. TermSuite: A UIMA Application for Multilingual Hong-seok Kwon, Hyeong-won Seo, and Jae-hoon Terminology Extraction from Comparable Corpora. Kim. 2013. Bilingual lexicon extraction via pivot In Proceedings of the 5th International Joint Confer- language and word alignment tool. In Proceedings ence on Natural Language Processing, pages 9–12, of the Sixth Workshop on Building and Using Com- Chiang Mai, Thailand. parable Corpora, pages 11–15, Sofia, Bulgaria, Au- Gerard Salton and Michael E Lesk. 1968. Computer gust. evaluation of indexing and text processing. Journal Bo Li and Eric Gaussier. 2010. Improving corpus of the ACM, 15(1):8–36. comparability for bilingual lexicon extraction from Hyeong-Won Seo, Hong-Seok Kwon, and Jae-Hoon comparable corpora. In Proceedings of the 23rd In- Kim. 2014. Extended pivot-based approach for ternational Conference on Computational Linguis- bilingual lexicon extraction. Journal of the Korean tics, pages 644–652, Beijing, China. Society of Marine Engineering, 38(5):557–565. Elizaveta Loginova, Anita Gojun, Helena Blancafort, Michel Simard. 1999. Text-translation alignment: Marie Guegan,´ Tatiana Gornostay, and Ulrich Heid. Three languages are better than two. In Proceedings 2012. Reference lists for the evaluation of term ex- of Empirical Methods in Natural Language Process- traction tools. In Proceedings of the 10th Interna- ing and Very Large Corpora, pages 2–11, College tional Congress on Terminology and Knowledge En- Park, Maryland, USA. gineering, Madrid, Spain. Ellen M. Voorhees. 1999. The trec-8 question answer- Joel Martin, R Mihalcca, and Ted Pedersen. 2005. ing track report. In Proceedings of TREC-8, vol- Word alignment for languages with scarce resources. ume 99, pages 77–82. In Proceedings of The ACL Workshop on Building and Using Parallel Text, pages 65–74, Ann Arbor, Michigan, USA.

Emmanuel Morin and Amir Hazem. 2014. Looking at Unbalanced Specialized Comparable Corpora for Bilingual Lexicon Extraction. In Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics, pages 1284–1293, Baltimore, USA.

Emmanuel Morin and Emmanuel Prochasson. 2011. Bilingual lexicon extraction from comparable cor- pora enhanced with parallel corpora. In Proceedings

37 Application of a Corpus to Identify Gaps between English Learners and Native Speakers

Katsunori Kotani Takehiko Yoshimi 16-1 Nakamiyahigashino-cho, Hirakata, 1-5 Yokotani, Seta, Otsu, Osaka, Japan 573-1001 Shiga, Japan 520-7729 [email protected] [email protected]

because these skills are prerequisite for develop- Abstract ing a level of ability adequate for effective com- munication with English speakers in a global In order to develop effective computer- society (Ono 2005). assisted language teaching systems for A learner corpus should address gaps in both learners of English as a foreign language, accuracy and fluency. Gaps in accuracy result it is first necessary to identify gaps be- from a lack of linguistic knowledge and manifest tween learners and native speakers in the as misunderstandings when reading, grammati- four basic linguistic skills (reading, writ- cally incorrect usage when writing, mispronunci- ing, pronunciation, and listening). To ations when speaking, and misunderstandings identify these gaps, the accuracy and flu- when listening. Gaps in fluency result from a ency in language use between learners limited ability to perform cognitive-linguistic and native speakers should be compared operations (Juffs and Rodríguez 2015), and man- using a learner corpus. However, previ- ifest as slower rates of reading, writing, and pro- ous corpora have not included all neces- nunciation. In addition, fluency gaps also tend to sary types of linguistic data. Therefore, in result in a lack of confidence among learners. this study, we aimed to design and build Some learner corpora have been developed for a new corpus comprising all types of lin- the purpose of comparative analysis with native guistic data necessary for comparing ac- speakers (Sugiura 2007, Friginal et al. 2013, Bar- curacy and fluency in basic linguistic ron and Black 2014); however, these corpora skills between learners and native speak- have only focused on writing and speaking, not ers. reading or listening. A learner corpus compiled by Kotani et al. (2011) was composed of reading, 1 Introduction writing, pronunciation, and listening data, but did not include data from native speakers. Learners of English as a foreign language (EFL) In this study, we aimed to construct a compar- frequently demonstrate a level of language abil- ative learner corpus to analyze gaps in the accu- ity that differs from that of native speakers (gap racy and fluency of the four basic linguistic skills. between learner and native speaker). Compared Specifically, this study collected corpus data with native speakers, learners more frequently from native speakers and merged these data with generate grammatically incorrect sentences and those from learners in the corpus compiled by speak at a slower rate (Brand and Götz 2011, Kotani et al. (2011). For this study, English Chang 2012, Thewissen 2013). To develop tools speakers were categorized into four proficiency and methods for effective learning of EFL, gaps levels as follows: Learners in Kotani et al. (2011) in the four basic linguistic skills (reading, writing, were classified into three groups based on their pronunciation, and listening) need to be clearly level of proficiency, and native speakers were identified and bridged. designated as the fourth and most advanced-level. A comparative learner corpus is a promising With the goal of supporting EFL teachers who linguistic resource for identifying the gaps. To use “authentic” materials such as web-pages that identify the gaps, a learner corpus should cover are used in English speakers’ daily life, we also the basic linguistic skills (Treiman et al. 2003), constructed a statistical model for calculating the

38 Short Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 38–42, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics difficulty of each sentence in authentic materials, tening was assessed in terms of comprehension which demonstrates the effectiveness of our cor- rate for spoken text, similarly to that of reading. pus. Because the difficulty level of authentic ma- Fluency in terms of reading, writing, and pro- terials is not always clear, teachers must person- nunciation was assessed based on silent-reading, ally inspect all such materials in order to verify writing, and reading-aloud rates, respectively. that they are appropriate for the proficiency level Fluency was also assessed based on a difficul- of learners. Therefore, we developed a statistical ty judgment score. Difficulty judgment scores for model to automatically measure sentence diffi- reading were assessed in terms of learners’ culty and thereby reduce the effort required by judgment on the difficulty of reading compre- teachers for this preparatory task. hension, which they indicated on a five-point Likert scale (1: easy; 2: somewhat easy; 3: aver- 2 Corpus between learners and native age; 4: somewhat difficult; or 5: difficult). Scores speakers for writing were assessed in terms of learners’ confidence in accuracy on a five-point Likert 2.1 Corpus data scale (1: confident; 2: somewhat confident; 3: This study collected corpus data from native average; 4: not very confident; or 5: not confi- speakers following the method of Kotani et al. dent). Those for pronunciation and listening were (2011). The corpus data of Kotani et al. (2011) assessed on the five-point Likert scale in terms of consisted of data collected from learners for ana- learners’ judgment on the difficulty of pronun- lyzing the accuracy and fluency of reading, writ- ciation and listening comprehension, respectively. ing, pronunciation, and listening, and the data are summarized in Table 1. 2.2 Data collection method Corpus data were collected through a series of Language use Perspective reading, writing, pronunciation, and listening Accuracy Fluency tasks. In the reading task, learners silently read Reading Comprehen- Silent-reading 80 sentences in four news articles sentence-by- sion rate rate sentence, selected a difficulty score for each sen- Difficulty tence, and answered five multiple-choice com- judgment prehension questions for each article. In the writ- score ing task, learners wrote sentences to describe Writing Written sen- Writing rate four pictures and answered 20 questions about tence Difficulty their background and computer skills, and then (Correct rate) judgment selected a difficulty score for each sentence. The score pronunciation task proceeded similarly to the Pronunciation Speech sound Reading-aloud reading task: learners read aloud 80 sentences in (Correct rate) rate four news articles, and selected a difficulty score Difficulty for each sentence. Their voices were recorded in judgment a sound-attenuated recording booth. In the listen- score ing task, similar to the reading task, learners lis- Listening Comprehen- Difficulty tened to 80 sentences from four audio news clips sion rate judgment sentence-by-sentence, and then selected a diffi- score culty score for each sentence. After a clip was Table 1: Summary of corpus data finished, the learner answered five multiple- choice comprehension questions for each clip. The accuracy of reading (comprehension rate) The learner corpus of Kotani et al. (2011) was assessed by calculating the percentage of compiled corpus data from three different profi- correct answers to comprehension questions ciency groups of learners (beginner-level, inter- based on written text. The accuracy of writing mediate-level, and advanced-level) based on and pronunciation (the correct rate) was assessed TOEIC (Test of English for International Com- by calculating the percentage of correctly written munication) scores; each group comprised 30 words or pronounced speech sounds from the learners. Hence, for this study, we chose to col- total number of words in written sentences or lect corpus data from 30 native speakers (16 spoken words, respectively. The accuracy of lis- male, 14 female; mean age ± standard deviation [SD], 22.5 ± 2.0 years; age range, 20–27 years) to represent a level higher than that of advanced-

39 level learners. The native speakers were recruited lected from each group (n = 30) in 80 sentences from among university students living in areas in involving pronunciation and listening tasks, are and around Tokyo. All native speakers were also shown. compensated for their participation. Mean processing rates (± SD) of 2400 instanc- es collected from each group in 80 sentences in- 2.3 Descriptive statistics volving reading task, are summarized in Table 4. All distributions shown in Tables 2, 3, and 4 fol- Mean writing rates (± SD) of 30*l instances col- lowed our expectation that the difficulty of a task lected from each group (n = 30) in l sentences would decrease from the beginner to the native involving the writing task, in which the number speaker level. This outcome suggests the validity of written sentences (l) differed for each individ- of our corpus data. ual, are also summarized in Table 4. Mean pro- Mean comprehension rates (± SD) of 120 in- cessing rates (± SD) of 2400 instances collected stances collected from each group (n = 30) of from each group in 80 sentences involving pro- learners and native speakers in four articles and nunciation task, are also shown. Processing rates clips involving the reading and listening tasks, were calculated as the number of words are summarized in Table 2. read/written/pronounced in one minute (WPM: words per minute). Task Group Reading Listening Task Group Pronun- Beginner 47.3(23.6) 43.3(22.2) Reading Writing ciation Intermediate 52.0(22.7) 49.8(20.9) 86.91 9.21 66.28 Beginner Advanced 65.8(24.0) 67.0(20.0) (42.19) (3.50) (13.10) Native-speaker 76.5(21.8) 75.7(17.4) 97.17 10.21 76.97 Intermediate Table 2: Comprehension rates of the four groups (32.11) (3.96) (15.27) (%); mean (SD) 128.32 13.35 91.68 Advanced (44.99) (5.42) (12.87) Task Native- 206.21 17.34 119.91 Pro- speaker (61.15) (5.78) (14.73) Group Read- Listen- Writing nuncia- Table 4: Processing rates of the four groups (WPM); ing ing tion mean (SD) 3.26 3.07 3.61 3.63 Beginner 3 Measurement of sentence difficulty (0.84) (0.94) (0.89) (0.76) 3.1 Goal Interme- 2.72 3.02 3.29 3.18 In order to select online materials that are appro- diate (0.81) (0.60) (0.80) (0.72) priate for the proficiency level of learners, a teacher must personally assess the difficulty of 2.18 2.36 2.73 2.28 Advanced the materials, which is often unclear. A method (0.92) (1.00) (1.05) (0.96) that would enable the automatic measuring of Native- 1.92 1.56 2.15 1.87 sentence difficulty of online materials would speaker (0.84) (0.73) (0.88) (0.82) thereby be expected to reduce the burden of this Table 3: Difficulty judgment scores of the four preparatory task. groups; mean (SD) To achieve this, we constructed a statistical model based on our corpus data. Our statistical Mean difficulty judgment scores (± SD) of model calculates sentence difficulty in terms of 2400 instances collected from each group (n = gaps in language use between learners and native 30) in 80 sentences involving reading task, are speakers on the basis of linguistic features of summarized in Table 3. Mean difficulty judg- sentences. ment scores (± SD) of 30*m instances collected 3.2 Methods from each group (n = 30) in m sentences involv- ing the writing task, in which the number of writ- We carried out a multiple regression analysis of ten sentences (m) differed for each individual, our corpus data using sentence length (number of are also summarized in Table 3. Mean difficulty words), and mean length of words in a sentence judgment scores (± SD) of 2400 instances col-

40 (mean number of syllables), as independent vari- gaps predicted using our method and the actual ables. gaps. Our method was associated with a lower For the dependent variable, we used the gaps error rate (0 to 25 WPM). in the silent-reading rate, which were derived for 49 each sentence (n = 80) by subtracting the mean silent-reading rate of advanced-level learners (n = 30) from that of native speakers (n = 30). The distribution of these gaps is summarized in Fig- ure 1. The gaps ranged from < 25 to > 125 WPM, and the distribution of silent-reading rates fol- 26 lowed a normal distribution according to the Kolmogorov-Smirnov test (K = 0.49, p < 0.01).

25 of instances Number

21 3 1 1

<25 <50 <75 <100 >100 14 13 Errors Figure 2: Errors in the cross-validation test results

(WPM) Number of instances Number 5 4 Conclusion 2 This paper described the construction of a corpus comprising all types of linguistic data necessary <25 <50 <75 <100 <125 >125 for comparing accuracy and fluency in basic lin- guistic skills between learners and native speak- Gaps Figure 1: Gaps in the silent-reading rate between ers. We expect that this corpus will enable teach- learners and native speakers (WPM) ers to more accurately assess the performance of learners in greater detail through a comparison 3.3 Results with native speakers. We also expect our statisti- A significant relationship was observed between cal model to serve as an effective method for the linear combination of linguistic features and measuring the difficulty of online materials, gaps in the silent-reading rate (F (2, 77) = 17.42, thereby reducing the burden of this preparatory p < 0.01). The sample multiple correlation coef- task and allowing teachers to more easily select ficient adjusted for degrees of freedom was 0.54, online materials that are appropriate for the pro- indicating that approximately 31.1% of the vari- ficiency level of learners. ance in the gaps in the sample could be account- ed for by the linear combination of linguistic fea- tures. Acknowledgments We then assessed our method using a leave- This work was supported in part by Grant-in-Aid one-out cross-validation test. In this test, our for Scientific Research (B) (22300299) and method was examined n times (n = 80) by using (15H02940). one instance as test data and n−1 instances as training data. Spearman’s correlation coefficient Reference was used to compare the gaps predicted using our method with those that were actually meas- Christiane Brand and Sandra Götz. 2011. Fluency ured. The correlation coefficient (r = 0.48) was versus accuracy in advanced spoken learner lan- guage. Errors and Disfluencies in Spoken Corpora. statistically significantly different from zero (p < Special Issue of International Journal of Corpus 0.01). Linguistics, 16(2): 255–275. Errors in the cross-validation test results are summarized in Figure 2. Errors were calculated Anne Barron and Emily Black. 2014. Constructing as absolute values of the differences between small talk in learner-native speaker voice-based

41 telecollaboration: A focus on topic management and backchanneling. System, 48: 112-128. Anna C.-S. Chang. 2012. Improving reading rate ac- tivities for EFL students: Timed reading and re- peated oral reading. Reading in a Foreign Lan- guage, 24(1): 56-83. Eric Friginal, Man Li, and Sara C. Weigle. 2013. Re- visiting multiple profiles of learner compositions: A comparison of highly rated NS and NNS essays. Journal of Second Language Writing, .23: 1-16. Alan Juffs and Guillermo A. Rodríguez. 2015. Second Language Sentence Processing. New York: Routledge. Katsunori Kotani, Takehiko Yoshimi, Hiroaki Nanjo, and Hitoshi Isahara. 2011. Compiling learner cor- pus data of linguistic output and language pro- cessing in speaking, listening, writing, and reading. Proceedings of the 5th International Joint Confer- ence on Natural Language Processing (IJCNLP 2011): 1418-1422. Hiroshi Ono. 2005. A development of placement test and e-learning system for Japanese university stu- dents: Research on support improving academic ability based on IT. Research Report, National In- stitute of Multimedia Education 2005-6. Jennifer Thewissen. 2013. Capturing L2 accuracy developmental patterns: Insights from an error- tagged EFL learner corpus. The Modern Language Journal, 97(1): 77–101. Rebecca Treiman, Charles Clifton Jr., Antje S. Meyer, and Lee H. Wurm. 2003. Language comprehension and production. Comprehensive Handbook of Psy- chology 4: Experimental Psychology. New York: John Wiley & Sons, Inc.: 527-548. Masatoshi Sugiura, Masumi Narita, Tomomi Ishida, Tatsuya Sakaue, Remi Murao, and Kyoko Muraki. 2007. A discriminant analysis of non-native speak- ers and native speakers of English. Proceedings of the 2007 Corpus Linguistics Conference.

42 A Generative Model for Extracting Parallel Fragments from Comparable Documents

Somayeh Bakhshaei1, Shahram Khadivi2* and Reza Safabakhsh1 1Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran 2eBay Inc., Aachen, Germany [email protected], [email protected], [email protected]

parallel fragments, or parallel lexicon. It has Abstract* been shown in the previous works that extracting parallel sentences from comparable corpora usu- Although parallel corpora are essential ally results in a noisy parallel corpus (Munteanu language resources for many NLP tasks, & Marcu, 2006). Since comparable documents they are rare or even not available for rarely contain exact parallel sentences, instead many language pairs. Instead, compara- they contain a good amount of parallel sub- ble corpora are widely available and con- sentences or fragments. Thus, it is better to tain parallel fragments of information search for parallel fragments instead of parallel that can be used applications like statisti- sentences. cal machine translations. In this research, Recent research works in fragment extraction we propose a generative LDA based have shown significant improvements in SMT model for extracting parallel fragments quality, if parallel fragments are also used in the from comparable documents without us- training phase (Chiao & Zweigenbaum, 2002; ing any initial parallel data or bilingual Déjean, et al., 2002; Fung & McKeown, 1997; lexicon. The experimental results show Fung & Yee, 1998; Gupta, et al., 2013; Otero, P. significant improvement if the extracted G, 2007; Rapp, R., 1999; Saralegui, et al. 2008) . sentence fragments generated by the pro- In this work, we also focus on extracting parallel posed method are used in addition to an fragments from comparable corpora. Our pro- existing parallel corpus in an SMT task. posed approach is a generative model based on According to human judgment, the accu- latent Dirichlet allocation (LDA) principles (Blei racy of the proposed method for an Eng- & Jordan, 2003). lish-Persian task is about 66%. Also, the In our proposed generative model, we assume OOV rate for the same task is reduced by there are parallel topics as hidden variables that 28%. model the parallel fragments in a comparable document corpus. We define parallel fragments 1 Introduction as a sequence of occurrence of one of these par- allel topics. This sequence occurs densely on a Parallel corpora are essential for many applica- pair of comparable documents. It is possible to tions like statistical machine translation (SMT). consider more than one topic in the structure of Even resource rich language pairs in terms of topic sequence but in this work we have limited parallel corpora always need more data, since it to one for simplicity and lower computational languages evolve and diversify over time. Com- complexities. Considering more topics in the parable corpora are considered as a widely avail- structure of a sequence that produces parallel able language resource that contains notably fragments is suggested as our future work. large amount of parallel sentence fragments. The rest of the paper is organized as follows. However, mining these fragments is a challeng- Section 2 describes the related works. Section 3 ing task, and therefore many different approach- describes the generative process for producing es were proposed to extract parallel sentences, comparable documents. The model architecture is described in section 4 with a graphical model. * This work has been done when Shahram Khadivi Section 5 describes the data, tools and resources was with Amirkabir University of Technology.

43 Long Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 43–51, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics used for this work and then the experiments and butional profiles, the distributional profile of evaluation results are presented. Section 6 con- their corresponding translation should also be cludes and presents avenues for future works. correlated in a comparable corpus. The optimiza- tion phase of the model that is based on gradient 2 Related works descent is very complex and time complexity is the biggest challenge of this model facing big Comparable corpora are useful resources for data. The experiment is restricted to highly fre- many research fields of NLP. Also, SMT as one quent words. Quirk, et al. (2007) also proposes a of the major problems of the NLP field can bene- generative model. Their model is a developed fit from comparable corpora. Previous researches version of IBM 1, 2 models. Although these are have suggested different approaches for extract- generative models for extracting parallel frag- ing parallel information from comparable corpo- ments, they completely differ from our model. ra. The main approaches are categorized as: Lex- Our model is based on the LDA model and we icon Induction, Wikipedia based, Lan- define a simpler but more efficient model with an guage, Graph based, Bootstrapping and EM. accurate probabilistic distribution for parallel Works that are reported for Lexicon Induc- fragments in comparable corpora. tion are almost focused on extracting words from Wikipedia as a multilingual encyclopedia is a comparable corpora. These works use different rich source of multilingual comparable corpora. methods that we categorize as: Seed based, mod- There are lots of works reported in the Wikipe- el based and graph based methods. dia based researches (Otero & López, 2010). The aim of the Seed based Lexicon Induc- Otero & López (2010) download the entire Wik- tion approach is expanding an initial parallel ipedia for any two languages, makes the “Cor- seed. Most of these researches use the context pusPedia”, and then extracts information from vector idea (Fung & Yee, 1998; Irvine & this corpus. However, in recent works it is shown Callison-Burch, 2013; Rapp, 1995; Rapp, R., that only a small ad-hoc corpus containing Wik- 1999). Gaussier, et al. (2004) proposes a geomet- ipedia articles can be beneficial for an existing ric model for finding the synonym words in the MT system (Pal, et al., 2014). Although the Wik- space of the context vectors. Garera, et al. (2009) ipedia based approach is a successful method for defines context vectors on the dependency tree producing parallel information, the limitation of rather than using adjacency. Some works use Wikipedia articles for most of the language pairs specific features for describing words like tem- is a big problem. poral co-occurrences (Schafer & Yarowsky, The methods of Cross-lingual Information Re- 2002), linguistic features (Kholy, et al., 2013; trieval are widely used for mining comparable Koehn & Knight, 2002), and web based visual corpora. The Bridge language idea is specially similarity features (Bergsma & Van Durme, used for extracting parallel information between 2011; Fiser & Ljubesic, 2011). The suggested languages (Gispert & Mario, 2006; Kumar, et al., features are almost efficient for similar or closely 2007; Mann & Yarowsky, 2001; Wu & Wang, related languages but not all of the language 2007). Some papers use multiple languages for pairs. pivoting (Soderland, et al., 2009). The big prob- The Model based Lexicon Induction ap- lem of this approach is its unavoidable noisy proach contains works that suggest a model for output. Thus some other papers use a two-step extracting parallel words. (Daumé III & version of this model for solving the problem. Jagarlamudi, 2011; Haghighi, et al., 2008) use a They first produce output and then refine it by generative model based on Canonical Correlation removing its noise (Shezaf & Rappoport, 2010; Analysis (CCA) (Hardoon, et al., 2004). They Kaji, et al., 2008). assume that by mapping words to a feature A wide range of researches are using a Graph space, similar words are located in a subspace for extracting parallel information from compa- which is called the latent space of common con- rable corpora. Laws, et al., (2010) make a graph cepts. Although their model is strong, they have on the source (src) and target (trg) words (nodes defined it based on orthographical features (in are considered as src/trg words) and finds the addition to context vectors) that reduce the effi- similar nodes using the SimRank idea (Jeh & ciency of the model for nonrelated languages. Widom, 2002). Some works define an optimiza- Diab & Finch (2000) also defines a matching tion problem for finding the similarity on the function on similar words of languages. They edges of the graph of src and trg words assume that for two synonyms with close distri- (Muthukrishnan, et al., 2011). Razmara, et al.,

44 )2013( and Saluja & Navrátil, (2013) use graphs target words in a comparable document pair. We for solving the out-of-vocabulary (OOV) error in use a definite function for controlling the topics MT. Razmara, et al. (2013) make the nodes of and producing parallel fragments; this function is the graph on phrases in addition to words. called m() . Function m accepts pairs of frag- Minkov & Cohen (2012) use words and their st ments, ff, , if Conditions (1) satisfies and re- stems for his graph nodes, and also the depend- ency tree for preserving the structure of words in jects them otherwise. The graphical presentation source and target sentences. Some other works of proposed model is depicted in Figure 1. Model use the simple but efficient EM algorithm for variables and relations are also shown in the fig- producing a bilingual lexicon (Koehn & Knight, ure. Here, we have used a known variable . 2000). Each pair of comparable documents will be A wide range of bootstraps are applied for ex- generated with the generative process of Table 1. tracting bilingual information from comparable In this process st,  and  are hyper-parameters corpora. Two-level approaches starts with of the Dirichlet distributions. Topic distribution (Munteanu & Marcu, 2006) that changes a sen-  t and  s is drawn from Dir()&()ts Dir re- tence to a signal, based on LLR score and then spectively. First a sample distribution Dir() uses a filter for extracting parallel fragments. is drawn for both source and target document. This approach is continued in the latter works Then each word of the comparable document (Xiang, et al., 2013). Chu, et al. (2013) use the pair is drawn from a multinomial distribution similar idea on quasi–comparable corpora. parameterized with  , z Mult() . Source and Klementiev, et al. (2012) use a heuristic ap- proach for making context vectors directly on target words are generated from the respective w**  parallel phrases instead of parallel words. (Aker topic distribution: wz* | . & Gaizauskas, 2012; Hewavitharana & Vogel, 2013) define a classifier for extracting parallel fragments. m 3 LDA Based Generative Model z s w s s For extracting parallel fragments we use the   Ns LDA concept (Blei & Jordan, 2003). The base of   our model is a bilingual topic model. Bilingual Nt t t topic models were studied in previous works.   t Multilingual topic models similar to this work w T were presented in (Ni, et al., 2009) and (Mimno, et al., 2009). However, their models are polylin-   gual topic models that are trained on words and D our model is the extended version of this type of Figure 1: Graphical model for extracting parallel models but with additional capability of produc- fragments. Function m determines if the assigned ing parallel fragments. In (Boyd-Graber J. a., sequence of topics to the observed fragments 2009) a bilingual topic model is presented. The satisfies Conditions (1). If so, the parallel pair of model is trained on a pair of src and trg words source and target fragments will be produced. which are prepared by a matching function while training topic models. Another proposed model According to the given definition for parallel is (Boyd-Graber & P. Resnik, 2010) that is a cus- fragments in section 1, we produce a dense se- tomized version of LDA for sentimental analysis. quence of topics. In fact, by a dense sequence of We infer topics as distributions over words as a topic we mean a sub sentence of source and usual in topic model but the model is biased to a target document with limited length in which specific distribution of topics over words of doc- most of its words come from one topic distribu- uments. We assume that a pair of comparable tion. For controlling these sub-sentences, we de- documents is made of a topic distribution. We fine the following conditions: define topics over words but only the topics that 1. Length of the fragments is limited, are proper for producing parallel fragments are 2. At least 50% of the words of a valid frag- chosen. Therefore we limit them to ones that ment come from one specific topic. produce a dense bilingual sequence of source and

45 We refer to these conditions as Conditions (1) in topic k to the source and target words, the rest of the paper. w( I ) and w ( J ) , occurring in the words indexes of kk the sets I and J. Also NI, i and N J, i are the num- 1. Samplesource & t arg et topic distributions : ber of times topic k occurs in the indexes defined s s t t Dir (  ),  Dir (  ), in I and J sets in the source and target docu- 2. For each document pair D S , T , ments. a ) Sample a topic distribution :  Dir ( ), s t s s b ) Sample the length of parallel fragments : pzkzIJ(i |,,,,)p(| i   Izk i  ,,)()  zp i  tt  Poisson ( ), p(J | zii k , , z ) p ( ) c ) Pr oduce random indexes from S &, T NNw() I s k s k,, i I i d ) Sampleatopicz   Mult ( ), n s n s NNk,, i I i e ) Sample each of the chosen indexes : nn w() J t k t * w   , NNk,, i J i wz* |  NNn t n t f) Sample the rest of indexes : k,, i J i nn z  Mult( ), w*  , wz* | g) If sequence of chosen topics satisfies We assume source and target words that are located in the current index i, wst and w , are a conditions(1) : ii produce parallel fragments . member of w( I ) and w ( J ) respectively, but while we are generating a word index outside I and J,

wI() wJ() wi wj Table 1: Generative model for producing compa- then N k and N k changes to Nkk and N . rable document corpus. Finally function m() produces parallel frag- ments ffst, only if they are consistent with 4 Inference Conditions (1) defined in Section 3.

Given a corpus of comparable documents our Corpus #Documents #Words en 194K 47M goal is to infer the unknown parameters of the Raw_ccNews t fa 194K 42M model. According to Figure 2 we infer topics  en 97K 29M Refined_ccNews and  s , distribution of parallel topics on the fa 97K 23M source and target documents,  and topic assign- Table 2: Statistics of used comparable corpora. ment z . The number of documents and running words is We use a collapsed Gibbs sampler (Neal, reported for each side of the corpus. 2000) for sampling the latent variable of topic assignment z. We use two sets I and J. These two Side #Fragments #Words en sets are random indexes chosen from source and Extracted parallel 75K 416K fragments target word indexes of the source and target doc- fa 75K 448K uments, respectively: Table 3: Statistic of extracted parallel fragments. I  Rand() N  1 ds, 5 Experimental Setup  J  Rand() N dt, We have two strategies for evaluating our model. 1 The size of these two sets is defined based on In the first step we try to measure the quality of the maximum length of parallel fragments in extracted fragments from comparable docu- each document pair. The maximum length of ments. In the other scenario we evaluate the parallel fragments, , is randomly sampled from quality of the extracted parallel fragment by evaluating the quality of the SMT system a Poisson distribution, Poisson() : equipped with this extra information. Poisson() The words that appear in the indexes of sets I 5.1 Data and J are respectively shown as w(I) and w(J). The data we use is a corpus of comparable doc- Words of these two sets are made from one topic uments, ccNews. The languages of these data are and build the dense sequence of words. We set Farsi (fa) and English (en). The domain of these Nw()() I and N w J kk as the number of assignment of documents is News gathered between years 2007

46

# src/trg Worse parallel fragment samples Error type En permanent security council members .Type 1.1 in target fragment شورای دایم سبزمبن امنیت Fa 1 En CT permanent security council En nobel peace prize winner .Type 1.2 in target fragment برنده ىندی جبیسه صلح نوبل Fa 2 En CT indian nobel peace prize winner En official irna news agency .Type 2 خبرگساری نیمو رسمی فبرش Fa 3 En CT official fars news agency En eu foreign .Type 1.1 in source fragment مسئول سیبست خبرجی اتحبدیو اروپب Fa 4 En CT eu foreign policy chief En unanimously adopted the resolution imposing sanctions .Type 3 شورای امنیت سبزمبن ملل روز شنبو بو اتفبق آرا Fa 5 En CT the un security council on saturday unanimously Table 4: Some worse parallel fragments produced by our model are recognized by manually checking the model output. The errors are highlighted and the correct translation of English part for the extract- ed Farsi fragment is written in EnCT row. to 2010. The raw version of this corpus (Raw_ccNews) has about 193K documents and 5.3 Results Analysis about 47M and 42M words, respectively in en The statistic of extracted parallel fragments is and fa sides. We did some refinement on the cor- reported in Table 3. On average, 75K parallel pus and the result is named Refined_ccNews fragments are extracted from 97K comparable corpus, as seen in Table 2. We removed repeated documents. These numbers show that the model documents and also pairs of documents with in- just produces high confidence samples and ig- compatible ratio of words are removed. nores most of them. The incompatibility of words ratio is defined Evaluation Strategy 1 - According to our as the proportion of words of one side to the oth- knowledge there is no criterion to automatically er side. This ratio is set to be in the interval [0.5, evaluate the quality of extracted data. Thus for 2]. That is: evaluating the quality of the results we use hu- man judgment. We asked a human translator fa- #words of source side document miliar with both Farsi and English languages to 0.5 2 check the quality of the parallelized fragments #words of t arg et side document and mark the pairs that are wrongly parallelized The full information of the corpus is reported in and to write down a definition of the occurred Table 2. error. The results of manually checking the extract- 5.2 Topic Model Parameters ed fragments are shown in Table 4. In this table In the experiments the hyper-parameter of the we have reported some of the worst errors of the st model. model are manually set to ,  =0.8 and   1 . According to human judge, we recognized And the number of topics in the models is set to some specific types of error in the model output. T=800. The side effect of the training model is a These errors are categorized into three types: parallel topic model. These topics are those that have common words with the source and target 1. Wrong boundaries for parallel fragments, side of at least one comparable document pair. 1.1. Tighter boundaries that lead to incom- The iteration of Gibbs sampling is set to 1000. plete phrases, The parallel fragments of the last iteration 1.2. Wider boundaries that lead to additional produced by m() function are reported as the final wrong tokens in the start/end of parallel result. fragments. 2. Same class words that are not the exact trans- lation of each other.

47 3. Completely wrong samples, uate the models performance on a test set of 1032 multiple-references sentences. For more Type 1 error is related to the samples in which information on the data see Table 6. Domain of boundaries are not correctly chosen by the mod- the dev set and training corpus is literature while el. This type is separated into two sub parts for the test set domain is news. tighter or wider boundaries which respectively As it is seen in Table 7, different approaches ignores or adds some key tokens to the parallel are proposed for how to use parallel fragments fragments which leads to error. for improving the baseline system. Description Type 2 errors are produced because of using of the models is explained in the follow. co-class words instead of synonyms. This is be- Baseline - This is an SMT system that is cause the model intentionally groups words trained on main corpus (Mizan). The BLEU based on co-occurrence instead of considering score of the baseline system is 10.41% on dev meaning which it has inherited from the LDA and 8.01% on test set. The OOV error in this sys- base of the model (the model is actually a topic tem is 3509 and 768 on test and dev sets respec- model and this is a usual behavior of topic mod- tively. els). This bug of the model can be considered as Baseline+ParallelFragments - In this system future works for improving the model accuracy. we directly add the parallel fragments to our At the end, the reason for Type 3 errors is not main corpus and train a new system. The BLEU obviously known. These samples are produced score improvement is about 0.27% and 0.22% because of the inner noises of the model. We respectively on test and dev sets. OOV error re- guess these are the unavoidable noises of compa- duces too. rable documents that are extended to the model Baseline+ParallelFragments (Giza output. weightes) - This approach is the same as Base- According to this classification of errors, the line+ParallelFragments but we use the proportion of each error type is computed. The weighted corpus for Giza alignment. The weight results are reported in Table 5. These are the of main corpus and parallel fragments is set to 10 proportion of each type observed in a set of 400 and 1 respectively. random fragments which is evaluated by human BaseLine+PT_ParallelFragments - In this translator. The most observed error is related to approach we combine the phrase tables of base- type 1. Thus the human evaluation suggests 66% line and the system trained on parallel fragments. accuracy for the model output. Actually because of the difference domain of Evaluation Strategy 2 – In the second step, main corpus and parallel fragments, it is ex- for evaluating the model output, we consider the pected that combining these two resources harm effect of these extracted data in the quality of an the quality of the baseline system. So, we use the existing SMT system. For this aim, at first we phrase table which is trained on parallel frag- train a base line system on a parallel corpus. Our ments as the back off for the phrase table of the corpus is the Mizan parallel corpus2. The domain baseline system. The results show significant of this corpus is literature. For challenging the improvement in this case. The BLEU score im- translation system, we used an out-of-domain proves by about 1% on test set and OOV error is test. Our test is selected from the news domain. decreased by 28%. The standard phrase-based decoder that we Thus, the results shown in Table 7 reveals use for training models is the Moses system that the extracted parallel fragments can improve (Koehn, et al., 2007) in which we use default the quality of the translation output. values for all of the decoder parameters. We also use a 4-gram language model trained using SRILM (Stolcke, 2002) with Kneser-Ney Error Type P smoothing (Kneser & Ney, 1995). To tune the Type 1 33% decoder’s feature weights with minimum error Type 2 0.04% rate (Och, 2003), we use a development (dev) set Type 3 0.02% of 1000 single-reference sentences, and we eval- Table 5: Analysis of model output base of error types recognized by human translator judgment. 2 Supreme Council of Information and Communication Technology. (2013). Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

48 6 Conclusion #Line #Words Domain en 1021323 13636292 In this paper we have proposed a generative Train Literature LDA based model for extracting parallel frag- fa 1021323 13686642 ments from comparable corpora. The main con- en 1032 28112 1 1032 30451 tribution of the proposed model is that it is de- Test 2 1032 33725 News fa veloped for extracting parallel fragments from 3 1032 33128 comparable documents corpus without the need 4 1032 32417 en 1000 23055 to any parallel data such as initial seed or dic- Dev Literature tionary. fa 1000 26351 We have evaluated the output of the model by Table 6: Statistic of Train, Test and Dev set for using a human translator judgment and also by making the SMT system. using the extracted data for expanding the train- ing data set of a SMT system. Results of the Test Dev SMT system BLEU BLEU OOV OOV augmented system show improvement of the (%) (%) output quality. Baseline 10.41 3509 8.01 768 The result of human judgment categorizes the +ParallelFragments 10.68 2459 8.22 737 dominant errors of the model to three types. + ParallelFragments 10.53 2460 8.23 737 Most errors are related to the wrong recognized (Giza weighted) +PT_ParallelFragm boundaries by the model. We have considered 11.46 2530 8.14 734 the refinement of these kinds of errors as our fu- ents ture works. We have also shown that the model Table 7: Results of trained SMT systems. is able to reduce the OOV error.

References Aker, A., & Gaizauskas, Y. F. (2012). Automatic Daumé III, H., & Jagarlamudi, J. (2011). Domain bilingual phrase extraction from comparable adaptation for machine translation by mining corpora. Proceedings of COLING 2012: Posters unseen words. Proceedings of the 49th Annual (pp. 23–32). COLING 2012, Mumbai. Meeting of the Association for Computational Bergsma, S., & Van Durme, B. (2011). Learning Linguistics: Human Language Technologies: short bilingual lexicons using the visual similarity of papers-Volume 2. Association for Computational labeled web images. In IJCAI Proceedings- Linguistics. International Joint Conference on Artificial Déjean, H., Gaussier, É., & Sadat, F. (2002). Intelligence, (pp. Vol. 22, No. 3, p. 1764). Bilingual terminology extraction: an approach Blei, D. M., & Jordan, M. I. (2003). Latent dirichlet based on a multilingual thesaurus applicable to allocation. the Journal of machine Learning comparable corpora. In Proceedings of the 19th research, 3, 993-1022. International Conference on Computational Boyd-Graber, J. a. (2009). Multilingual topic models Linguistics COLING. for unaligned text. In Proceedings of the Twenty- Diab, M., & Finch, S. (2000). A statistical word-level Fifth Conference on Uncertainty in Artificial translation model for comparable corpora. IN Intelligence (pp. 75-82). AUAI Press. PROCEEDINGS OF THE CONFERENCE ON Boyd-Graber, J., & P. Resnik. (2010). Holistic CONTENT-BASED MULTIMEDIA sentiment analysis across languages: Multilingual INFORMATION ACCESS (RIAO). supervised latent Dirichlet allocation. In Fiser, D., & Ljubesic, N. (2011). Bilingual lexicon Proceedings of the 2010 Conference on EMNLP, extraction from comparable corpora for closely (pp. 45-55). related languages. In RANLP, 125-131. Chiao, Y. C., & Zweigenbaum, P. (2002). Looking for Fung, P., & McKeown, K. (1997). Finding candidate translational equivalents in specialized, terminology translations from non-parallel comparable corpora. In Proceedings of the 19th corpora. In Proceedings of the 5th Annual international conference on Computational Workshop on Very Large Corpora, (pp. 192-202). linguistics-Volume 2, (pp. 1-5). Fung, P., & Yee, L. Y. (1998). An ir approach for Chu, C., Nakazawa, T., & Kurohashi, S. (2013). translating new words from nonparallel, Accurate Parallel Fragment Extraction from comparable texts. In Proceedings of the 36th Quasi–Comparable Corpora using Alignment Annual Meeting of the Association and 17th Model and Translation Lexicon. Proceedings of International Conference on Computational the Sixth International Joint Conference on Linguistics - Volume 1, ACL '98 (pp. 414-420). Natural Language Processing, (pp. 1144--1150). Association for Computational Linguistics.

49 Garera, N., Callison-Burch, C., & Yarowsky, D. monolingual corpora using the EM algorithm. (2009). Improving translation lexicon induction AAAI/IAAI. from monolingual corpora via dependency Koehn, P., & Knight, K. (2002). Learning a contexts and part-of-speech equivalences. translation lexicon from monolingual corpora. Proceedings of the Thirteenth Conference on Proceedings of the ACL-02 workshop on Computational Natural Languag Learning. Unsupervised lexical acquisition-Volume 9. Association for Computational Linguistic. Association for Computational Linguistics. Gaussier, E., Renders, J. M., Matveeva, I., Goutte, C., Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., & Déjean, H. (2004). A geometric view on Federico, M., Bertoldi, N., et al. (2007). Moses: bilingual lexicon extraction from comparable Open source toolkit for statistical machine corpora. In Proceedings of the 42nd Annual translation. In Proceedings of the 45th annual Meeting on ACL, (p. 526). meeting of the ACL on interactive poster and Gispert, A. d., & Mario, B. (2006). Catalan-english demonstration sessions, (pp. 177-180). statistical machine translation without parallel Kumar, S., Och, F. J., & Macherey, W. (2007). corpus: bridging through spanish. in Proc. of 5th Improving Word Alignment with Bridge International Conference on Language Resources Languages. EMNLP-CoNLL. and Evaluation (LREC). Genoa, Italy. Laws, F., Michelbacher, L., Dorow, B., Scheible, C., Gupta, R., Pal, S., & Bandyopadhyay, S. (2013). Heid, U., & Schütze, H. (2010). A linguistically Improving mt system using extracted parallel grounded graph model for bilingual lexicon fragments of text from comparable corpora. In extraction. In Proceedings of the 23rd proceedings of 6th workshop of BUCC, ACL, International Conference on Computational Sofia, Bulgaria, (pp. 69-76). Linguistics: Posters (pp. 614-622). Association for Haghighi, A., Liang, P., Berg-Kirkpatrick, T., & Computational Linguistics. Klein, D. (2008). Learning Bilingual Lexicons Mann, G. S., & Yarowsky, D. (2001). Multipath from Monolingual Corpora. In ACL, Vol. 2008, translation lexicon induction via bridge languages. (pp. 771-779). In NAACL. Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. Mimno, D., Wallach, H. M., Naradowsky, J., Smith, (2004). Canonical correlation analysis: An D. A., & McCallum, A. (2009). Polylingual topic overview with application to learning methods. models. In Proceedings of the 2009 Conference on Neural Computation 16.12, 2639-2664. Empirical Methods in Natural Language Hewavitharana, S., & Vogel, S. ( 2013). Extracting Processing: Volume 2, (pp. 880-889). parallel phrases from comparable data. Building Minkov, E., & Cohen, W. W. (2012). Graph based and Using Comparable Corpora. Springer Berlin similarity measures for synonym extraction from Heidelberg, 191-204. parsed text. Workshop Proceedings of Irvine, A., & Callison-Burch, C. (2013). Supervised TextGraphs-7 on Graph-based Methods for bilingual lexicon induction with multiple Natural Language Processing. Association for monolingual signals. Proceedings of NAACL-HLT. Computational Linguistics. Jeh, G., & Widom, J. (2002). Simrank: A measure of Munteanu, D. S., & Marcu, D. (2006). Extracting structural-context similarity. In KDD ’02, pages parallel sub-sentential fragments from non-parallel 538–543. corpora. In Proceedings of the 21st International Kaji, H., Tamamura, S., & Erdenebat, D. (2008). Conference on Computational Linguistics and the Automatic construction of a japanese-chinese 44th annual meeting of the ACL , (pp. 81-88). dictionary via english. In LREC. Muthukrishnan, P., Radev, D., & Mei, Q. (2011). Kholy, E., Nizar Habash, A., Leusch, G., Matusov, E., Simultaneous similarity learning and feature- & Sawaf, H. (2013). Language independent weight learning for document clustering.". connectivity strength features for phrase pivot Proceedings of textgraphs-6: Graph-based statistical machine translation. In Proc. of ACL, methods for natural language processing. vol. 13. Association for Computational. Klementiev, A., Irvine, A., Callison-Burch, C., & Neal, R. M. (2000). Markov chain sampling methods Yarowsky, D. (2012). Toward statistical machine for Dirichlet process mixture models. Journal of translation without parallel corpora. In computational and graphical statistics, 9(2), 249- Proceedings of the 13th Conference of the 265. European Chapter of the ACL, (pp. 130-140). Ni, X., Sun, J. T., Hu, J., & Chen, Z. (2009). Mining Kneser, R., & Ney, H. (1995). Improved backing-off multilingual topics from wikipedia. In for m-gram language modeling. In Acoustics, Proceedings of the 18th international conference Speech, and Signal Processing, 1995. ICASSP-95., on World wide web (pp. 1155-1156). ACM. 1995 International Conference on Vol. 1, (pp. 181- Och, F. J. (2003). Minimum error rate training in 184). IEEE. statistical machine translation. Association for Koehn, P., & Knight, K. (2000). Estimating word Computational Linguistics. Proceedings of the translation probabilities from unrelated

50 41st Annual Meeting on Association for Heterogeneous Feature Types. Graph-Based Computational Linguistics-Volume 1. Methods for Natural Language Processin. Otero, P. G. (2007). Learning bilingual lexicons from Saralegui, X. I., San Vicente, I., & Gurrutx-aga, A. comparable english and spanish corpora. (2008). Automatic generation of bilingual lexicons Proceedings of MT Summit xI, (pp. 191-198). from comparable corpora in a popular science Otero, P. G., & López, I. G. (2010). Wikipedia as domain. In LREC 2008 workshop on BUCC. multilingual source of comparable corpora. In Schafer, C., & Yarowsky, D. (2002). Iducing Proceedings of the 3rd Workshop on BUCC, translation lexicons via diverse similarity measures LREC, (pp. 21-25). and bridge languages. In proceedings of the 6th Pal, S., Pakray, P., & Naskar, S. K. (2014). Automatic conference on Natural language learning - Building and Using Parallel Resources for SMT Volume 20, COLING-02 (pp. 1-7). Stroudsburg, from Comparable Corpora. In Proceedings of the PA, USA: Association for Computational 3rd Workshop on Hybrid Approaches to Linguistics. Translation (HyTra)@ EACL, (pp. 48-57). Shezaf, D., & Rappoport, A. (2010). Bilingual lexicon Quirk, C., Udupa, R., & Menezes, A. (2007). generation using non-aligned signatures. In Generative models of noisy translations with Proceedings of the 48th annual meeting of the applications to parallel fragment extraction. association for computational linguistics (pp. 98- Proceedings of the Machine Translation Summit 107). Association for Computational Linguistics. XI, (pp. 377-384). Soderland, S., Etzioni, O., Weld, D. S., Skinner, M., Rapp, R. (1995). Identifying word translationsin non- & Bilmes, J. (2009). Compiling a massive, parallel texts. In Proceedings of Association for multilingual dictionary via probabilistic inference. Computational Linguistics. In Proceedings of the Joint Conference of the 47th Rapp, R. (1999). Automatic identification of word Annual Meeting of the ACL and the 4th translations from unrelated english and german International Joint Conference on Natural corpora. In Proceedings of the 37th annual Language Processing of the AFNLP: Volume 1, meeting of the Association for Computational (pp. 262-270). Linguistics on Computational Linguistics, ACL '99 Stolcke, A. (2002). SRILM-an extensible language (pp. 519-526). Association for Computational modeling toolkit. In INTERSPEECH. . Linguistics. Wu, H., & Wang, H. (2007). Pivot language approach Razmara, M., Siahbani, M., Haffari, G., & Sarkar, A. for phrase-based statistical machine translation. (2013). Graph propagation for paraphrasing out- Machine Translation 21.3, 165-181. of-vocabulary words in statistical machine Xiang, L., Zhou, Y., & Zon, C. (2013). An Efficient translation. InProceedings of the Conference of the Framework to Extract Parallel Units from Association for Computational Linguistics (ACL). Comparable Data. Natural Language Processing Saluja, A., & Navrátil, J. (2013). Graph-Based and Chinese Computing Springer Berlin Unsupervised Learning of Word Similarities Using Heidelberg, 151-163.

51 Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families Zi Long Tomoharu Mitsuhashi Mikio Yamamoto Takehito Utsuro Japan Patent Grad. Sch. Sys. & Inf. Eng., Grad. Sch. Sys. & Inf. Eng., Information Organization, University of Tsukuba, University of Tsukuba, 4-1-7, Tokyo, Koto-ku, Tsukuba, 305-8573, Japan Tsukuba, 305-8573, Japan Tokyo, 135-0016, Japan

Abstract engine (Huang et al., 2005), and translation term pair acquisition from comparable corpora (Fung In the process of translating patent doc- and Yee, 1998; Aker et al., 2013; Kontonatsios et uments, a bilingual lexicon of technical al., 2014; Rapp and Sharoff, 2014). terms is inevitable knowledge source. It is Among those efforts of acquiring bilingual lex- important to develop techniques of acquir- icon from text, Morishita et al. (2008) studied to ing technical term translation equivalent acquire Japanese-English technical term transla- pairs automatically from parallel patent tion lexicon from phrase tables, which are trained documents. We take an approach of uti- by a phrase-based SMT model with parallel sen- lizing the phrase table of a state-of-the- tences automatically extracted from parallel patent art phrase-based statistical machine trans- documents. Furthermore, based on the achieve- lation model. First, we collect candi- ment above, Liang et al. (2011a) studied the is- dates of synonymous translation equiva- sue of identifying Japanese-English synonymous lent pairs from parallel patent sentences. translation equivalent pairs in the task of acquiring Then, we apply the Support Vector Ma- Japanese-English technical term translation equiv- chines (SVMs) to the task of identify- alent pairs. Based on the technique and the re- ing bilingual synonymous technical terms. sults of identifying Japanese-English synonymous This paper especially focuses on the issue translation equivalent pairs in Liang et al. (2011a), of examining the effectiveness of each fea- Long et al. (2014) next studied how to identify ture and identifies the minimum number Japanese-Chinese synonymous translation equiv- of features that perform as comparatively alent pairs from Japanese-Chinese patent families. well as the optimal set of features. Finally, In the task of identifying Japanese-Chinese we achieve the performance of over 90% synonymous translation equivalent pairs from precision with the condition of more than Japanese-Chinese patent families (Figure 1) stud- or equal to 25% recall. ied in Long et al. (2014), this paper modifies some 1 Introduction of the features studied in Long et al. (2014) and For both high quality machine and human transla- further focuses on the issue of examining the ef- tion, a large scale and high quality bilingual lex- fectiveness of each feature. This paper especially icon is the most important key resource. Since identifies the minimum number of features that manual compilation of bilingual lexicon requires perform as comparatively well as the optimal set plenty of time and huge manual labor, in the re- of features, where the most effective feature is dis- search area of knowledge acquisition from natural covered to be the rate of intersection in translation language text, automatic bilingual lexicon compi- by the phrase table. Based on the evaluation re- lation have been studied. Techniques invented so sults, we finally achieve the performance of over far include translation term pair acquisition based 90% precision with the condition of more than or on statistical co-occurrence measure from parallel equal to 25% recall. sentences (Matsumoto and Utsuro, 2000), compo- 2 Japanese-Chinese Parallel Patent sitional translation generation based on an exist- Documents ing bilingual lexicon for human use (Tonoike et al., 2006), translation term pair acquisition by col- Japanese-Chinese parallel patent documents are lecting partially bilingual texts through the search collected from the Japanese patent documents

52 Long Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 52–61, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics

Figure 1: Framework of Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families published by the Japanese Patent Office (JPO) in 3 Phrase Table of an SMT Model 2004-2012 and the Chinese patent documents pub- lished by State Intellectual Property Office of the As a toolkit of a phrase-based SMT model, we People’s Republic of China (SIPO) in 2005-2010. use Moses (Koehn et al., 2007) and apply it to From them, we extract 312,492 patent families, the whole 3.6M parallel patent sentences. Be- and the method of Utiyama and Isahara (2007) is fore applying Moses, Japanese sentences are seg- applied1 to the text of those patent families, and mented into a sequence of morphemes by the Japanese and Chinese sentences are aligned. In Japanese morphological analyzer MeCab3 with this paper, we use 3.6M parallel patent sentences the morpheme lexicon IPAdic4. For Chinese sen- with the highest scores of sentence alignment 2. tences, we examine two types of segmentation, 1We used a Japanese-Chinese translation lexicon consist- ing of about 170,000 Chinese head words. 2The maximum score of the method of Utiyama and Isa- 3http://mecab.sourceforge.net/ hara (2007) is set to be 1.0, while the lower bound of its score 4http://sourceforge.jp/projects/ is about 0.152 with the 3.6M parallel patent sentences. ipadic/

53 Figure 2: Developing a Reference Set of Bilingual Synonymous Technical Terms i.e., segmentation by characters5 and segmentation Japanese to Chinese phrase translation probabili- by morphemes6. ties. In the similar way, in the phrase table in the As the result of applying Moses, we have a opposite direction of Chinese to Japanese transla- phrase table in the direction of Japanese to Chi- tion, for each Chinese phrase, multiple Japanese nese translation, and another one in the opposite translation candidates are ranked in descending or- direction of Chinese to Japanese translation. In the der of Chinese to Japanese phrase translation prob- direction of Japanese to Chinese translation, when abilities. Chinese side of parallel sentences are segmented Those two phrase tables are then referred to by morphemes, we finally obtain 108M transla- when identifying a bilingual technical term pair, tion pairs with 75M unique Japanese phrases with given a parallel sentence pair S ,S and a Japanese to Chinese phrase translation probabili- J C Japanese technical term t , or a Chinese techni- ties P (p p ) of translating a Japanese phrase J C | J cal term t . In the direction of Japanese to Chi- p into a Chinese phrase p . When Chinese C J C nese, as shown in Figure 1 (a), given a parallel sentences are segmented by characters, on the sentence pair S ,S containing a Japanese tech- other hand, we obtain 274M translation pairs with J C nical term t , Chinese translation candidates col- 197M unique Japanese phrases. For each Japanese J lected from the Japanese to Chinese phrase table phrase, those multiple translation candidates in the are matched against the Chinese sentence S of phrase table are ranked in descending order of C the parallel sentence pair. Among those found in SC , tˆC with the largest translation probability 5A consecutive sequence of numbers as well as a consec- P (t t ) is selected and the bilingual technical utive sequence of alphabetical characters are segmented into C | J a token. term pair tJ , tˆC is identified. Similarly, in the 6Chinese sentences are segmented into a sequence of opposite direction of Chinese to Japanese, given a morphemes by the Chinese morphological analyzer Stanford parallel sentence pair S ,S containing a Chi- Word Segment (Tseng et al., 2005) trained with Chinese Penn J C Treebank. nese technical term tC , the Chinese to Japanese

54 phrase table is referred to when identifying a bilin- of candidates of bilingual synonymous technical gual technical term pair. term pairs, where CBP1,...,CBP10 are 10 dis- joint subsets8 of CBP. 4 Developing a Reference Set of Bilingual As a tool for learning SVMs, we use Synonymous Technical Terms TinySVM (http://chasen.org/˜taku/ When developing a reference set of bilingual syn- software/TinySVM/). As the kernel func- 9 onymous technical terms (detailed procedure to tion, we use the polynomial (1st order) kernel . be found in Long et al. (2014)), as illustrated in In the testing of a SVMs classifier, we regard the Figure 2, starting from a seed bilingual term pair distance from the separating hyperplane to each test instance as a confidence measure, and return sJC = sJ ,sC , we repeat the translation esti- mation procedure of the previous section in both test instances satisfying confidence measures over Japanese-Chinese direction and Chinese-Japanese a certain lower bound only as positive samples direction six times in total, and generate the set (i.e., synonymous with the seed). In the training of SVMs, we use 8 subsets out of the whole 10 CBP(sJ ) of candidates of bilingual synonymous technical term pairs. Then, we manually di- subsets CBP1,...,CBP10. Then, we tune the lower bound of the confidence measure with one vide the set CBP(sJ ) into SBP(sJC), those of of the remaining two subsets. With this subset, we which are synonymous with sJC, and the remain- also tune the parameter of TinySVM for trade-off ing NSBP(sJC). As in Table 1, we collect 114 seeds, where the number of bilingual technical between training error and margin. Finally, we test the trained classifier against another one of the terms included in SBP(sJC) in total for all of the 114 seed bilingual technical term pairs is around remaining two subsets. We repeat this procedure 2,300 to 2,400, which amounts to around 21 per of training / tuning / testing 10 times, and average seed on average7. As shown in Figure 1 (b), to the 10 results of test performance. all of those bilingual term pairs, the procedure of 5.2 Features identifying the synonymous sets is applied. Table 2 lists all the features used for training and 5 Identifying Bilingual Synonymous testing of SVMs for identifying bilingual syn- Technical Terms by Machine Learning onymous technical terms. Features are roughly divided into two types: those of the first type In this section, we apply the Support Vector Ma- f1,...,f6 simply represent various characteris- chines (SVMs) (Vapnik, 1998) to the task of iden- tics of the input bilingual technical term tJ ,tC , tifying bilingual synonymous technical terms. In while those of the second type f7,...,f17 repre- this paper, we model the task of identifying bilin- sent relation of the input bilingual technical term gual synonymous technical terms by the SVMs as tJ ,tC and the seed bilingual technical term pair that of judging whether or not the input bilingual sJC = sJ ,sC term pair t ,t is synonymous with the seed J C Among the features of the first type are the fre- bilingual technical term pair sJC = sJ ,sC . quency (f1), ranks of terms with respect to the 5.1 The Procedure conditional translation probabilities (f2 and f3), length of terms (f4 and f5), and the number of First, let CBP be the union of the sets CBP(sJ ) times repeating the procedure of generating trans- of candidates of bilingual synonymous technical lation with the phrase tables until generating input term pairs for all of the 114 seed bilingual tech- terms tJ and tC from the Japanese seed term sJ nical term pairs. In the training and testing of (f6). the classifier for identifying bilingual synonymous Among the features of the second type are iden- technical terms, we first divide the set of 114 tity of monolingual terms (f and f ), edit distance seed bilingual technical term pairs into 10 sub- 7 8 of monolingual terms (f9), character bigram sim- sets. Here, for each i-th subset (i =1,...,10), we 8 construct the union CBP of the sets CBP(s ) Here, we divide the set of 114 seed bilingual technical i J term pairs into 10 subsets so that the numbers of positive (i.e., 7We manually generate the reference set by discarding the synonymous with the seed) / negative (i.e., not synonymous bilingual pairs which are judged as not synonymous with the with the seed) samples in each CBPi (i =1,...,10)are seed pair. The procedure of generating the whole reference comparative among the 10 subsets. sets took about 30 hours, i.e., about 3 seconds for judging a 9We compare the performance of the 1st order and 2nd or- bilingual term pair on average. der kernels, where we have almost comparative performance.

55 Table 1: Number of Bilingual Technical Terms: Candidates and Reference of Synonyms

(a) With the Phrase Table based on Chinese Sentences Segmented by Morphemes # of bilingual technical terms average per seed for the total 114 seeds included only Candidates of Synonyms 12,640 110.9 in the set (a) 24,621 216.0 CBP(sJ ) included in the intersection 11,981 105.1 sJ of the sets (a) and (b) included only Reference of Synonyms 228 2.0 in the set (a) 2,473 21.7 SBP(sJC) included in the intersection 2,245 19.7 sJC of the sets (a) and (b) (b) With the Phrase Table based on Chinese Sentences Segmented by Characters # of bilingual technical terms average per seed for the total 114 seeds included only Candidates of Synonyms 6,358 55.8 in the set (b) 17,478 153.3 CBP(sJ ) included in the intersection 11,120 97.5 sJ of the sets (a) and (b) included only Reference of Synonyms 287 2.5 in the set (b) 2,318 20.3 SBP(sJC) included in the intersection 2,031 17.8 sJC of the sets (a) and (b)

5.3 Evaluating the Effectiveness of Features Table 4: Pairs of Features having No Significant Difference (5% Significance Level) with Maxi- Table 3 shows the evaluation results for a baseline mum Precision Features and their Evaluation Re- as well as for SVMs. As the baseline, we sim- sults (%) ply judge the input bilingual term pair t ,t as J C (a) Chinese sentences are segmented by morphemes synonymous with the seed bilingual technical term feature precision recall f-measure pair sJC = sJ ,sC when tJ and sJ are identical, or, and are identical. When training / testing f15 + f16 85.6 25.4 39.2 tC sC f9 + f16 86.8 24.9 38.7 a SVMs classifier, we tune the lower bound of the f13 + f14 + f16 86.8 24.8 38.6 confidence measure of the distance from the sepa- (b) Chinese sentences are segmented by characters rating hyperplane in two ways: i.e., for maximiz- feature precision recall f-measure ing precision and for maximizing F-measure. As f9 + f15 87.4 25.4 39.3 shown in Table 3, when we use the set of features which maximize precision, we achieve higher pre- cisions of 89.0% and 90.4% for morpheme-based segmentation and character-based segmentation, ilarity of monolingual terms (f10), rate of identi- respectively, compared with when we use all of cal morphemes (in Japanese, f11) / characters (in the proposed features (86.5% and 89.0%) with Chinese, f12), string subsumption and variants for the condition of more than or equal to 40% F- 10 Japanese (f13), identical stem for Chinese (f14), measure . The sets of features which maximize rate of intersection in translation by the phrase ta- precision are f1 6 + f9 16 for morpheme-based ∼ ∼ ble (f15), rate of intersection in translation by the phrase table for the substrings not common be- 10Out of 655 (for morpheme-based segmentation) / 605 (for character-based segmentation) pairs which are correctly tween the seed and a term (f16), and translation judged as synonymous with the seed pair by SVM , 197 by the phrase tables (f17). (30.1%) / 161 (26.6%) are not judged as synonymous by the baseline method, i.e., neither the Japanese term nor the Chi- As we discuss in the next section, among all of nese term is identical to that of the seed pair. On the other those features, f15 and f16, which utilize the rate hand, out of 986 (for morpheme-based segmentation) / 927 of intersection in translation by the phrase table, (for character-based segmentation) pairs which are correctly judged as synonymous by the baseline method, 458 (46.5%) / are the most effective, where we add f16 in this 444 (47.9%) are judged as synonymous with the seed pair by paper to those studied in Long et al. (2014). SVM, while the rests are not judged as synonymous by SVM.

56 Table 2: Features for Identifying Bilingual Synonymous Technical Terms by Machine Learning definition ( where X denotes J or C, class feature and sJ ,sC denotes the seed bilingual technical term pair ) features f1: frequency log of the frequency of tJ ,tC within the whole parallel patent for sentences bilingual f2: rank of the Chinese term given tJ , log of the rank of tC with respect to the descending order technical of the conditional translation probability P(tC tJ ) | terms f3: rank of the Japanese term given tC , log of the rank of tJ with respect to the descending order tJ ,tC of the conditional translation probability P(tJ tC ) | f4: number of Japanese charac- number of characters in tJ ters f5: number of Chinese charac- number of characters in tC ters f6: number of times generating the number of times repeating the procedure of generating transla- translation by applying the tion by applying the phrase tables until generating tC or tJ from phrase tables sJ ,asinsC tJ tC ,or,sJ tC tJ →··· → → →··· → → features f7: identity of Japanese terms returns 1 when tJ = sJ for the f8: identity of Chinese terms returns 1 when tC = sC relation ED(tX ,sX ) f9: edit distance similarity of f9(tX ,sX )=1 max( t , s ) (where ED is the edit dis- of − | X | | X | monolingual terms tance of tX and sX ,and t denotes the number of characters of bilingual t.) | | technical bigram(tX ) bigram(sX ) f10: character bigram similarity f10(tX ,sX )= | ∩ | (where bigram(t) terms max( tX , sX ) 1 of monolingual terms | | | | − tJ ,tC is the set of character bigrams of the term t.) const(tJ ) const(sJ ) and the f11: rate of identical morphemes f11(tJ ,sJ )= | ∩ | (where const(t) is max( const(tJ ) , const(sJ ) ) seed (for Japanese terms) the set of morphemes| in the Japanese| | term| t.) sJ ,sC const(tC ) const(sC ) f12: rate of identical characters f11(tC ,sC )= | ∩ | (where const(t) is max( const(tC ) , const(sC ) ) (for Chinese terms) the set of Characters| in the Chinese| | term |t.) f13: subsumption relation of returns 1 when the difference of tJ and sJ is only in their suffixes, strings / variants relation of or only whether or not having the prolonged sound “ー”, or only in surface forms (for Japanese their hiragana parts. terms ) f14: identical stem (for Chinese returns 1 when the difference of tC and sC is only whether or not terms) having the word “的” which is not the prefixorsuffix. trans(tX ) trans(sX ) f15: rate of intersection in trans- f15(tX ,sX )= | ∩ | ( where trans(t) max( trans(tX ) , trans(sX ) ) lation by the phrase table is the set of translation| of term |t|from the phrase| table.) 1 m 1 n f16: rate of intersection in trans- Suppose that xt ,...,xt and xs,...,xs are the substrings which lation by the phrase table are not common between tX and sX .Here,wefind l (= i (for the substrings not com- min(m, n)) pairs of one-to-one mappings between xt (i = j mon between tX and sX ) 1,...,m)andxs (j =1,...,n) which maximize the product of i j the rates f15(xt,xs) of intersection in translation by the phrase ta- ble and return this product. f17: translation by the phrase ta- returns 1 when sJ can be generated by translating tC with the ble phrase table, or, sC can be generated by translating tJ with the phrase table. segmentation and f2,3 + f6 9 + f11,12,15,16 for based and character-based segmentations, respec- character-based segmentation,∼ respectively. How- tively. The reason why these features are the most ever, their differences are not significant (5% sig- effective among other features is that they directly nificance level). Next, we evaluate the effect of measure the degree of being synonymous within each single feature as well as combinations of one language with respect to the rate of intersec- small number of features, where, among those tion of translations into the other language, while results, Table 4 shows pairs of features each of other features just measure the character-based or which achieves a precision with no significant dif- morpheme-based similarity within one language. ference (5% significance level) with the set of fea- We further compare the performance of the pro- tures having the maximum precision. It is obvi- posed features with those studied in Tsunakawa ous that features f15 and f16, which utilize the and Tsujii (2008), where we modify the features rate of intersection in translation by the phrase ta- of Tsunakawa and Tsujii (2008) as shown in Ta- ble, are the most effective. Also, when we re- ble 5, and then evaluate those modified features. move features f15 and f16 from all the features, As we compare the performance of the proposed precisions are significantly damaged (5% signifi- features and the modified features of Tsunakawa cance level) to 78.5% and 79.4% for morpheme- and Tsujii (2008) in Table 3, it is clear that the pro-

57 Table 3: Evaluation Results (%) segmented by morphemes segmented by characters precision recall f-measure precision recall f-measure baseline (t and s are identical, J J 71.4 40.0 51.3 74.0 40.1 52.0 or, tC and sC are identical.) maximum SVM 86.5 26.5 40.5 89.0 26.1 40.4 precision (all features) maximum 64.3 64.1 64.2 63.5 65.3 64.4 f-measure SVM maximum 89.0 23.9 37.7 90.4 25.5 40.4 (features with precision maximum precision) (f1 6 + f9 16) (f2,3 + f6 9 + f11,12,15,16 ) ∼ ∼ ∼ maximum SVM (features in 72.6 26.1 38.4 74.4 36.7 49.2 precision Tsunakawa and maximum Tsujii (2008)) 71.0 54.7 61.5 72.7 53.7 61.8 f-measure

Table 5: Features for Identifying Bilingual Synonymous Technical Terms by Tsunakawa and Tsu- jii (2008) class features definition

h1J ,h1C : agreement of the first returns 1 when the fisrt characters of tX and sX match. characters h h : edit distance of simi- the same as f basical 2J , 2C 9 larity of monolingual features terms h3J ,h3C : character of bigram the same as f10 similarity of monolin- gual terms h4J ,h4C : agreement of word return the count that substrings of tX match sX . (Here, Tsunakawa substring and Tsujii (2008) count not only the common substrings but also substrings in known synonymous relation between tX and sX . However, in our work, we have no lexicon available for synonymous relation. So, we utilize only the count of common substrings.) h5J ,h5C : translation by the the same as f17.(Here, instead of the phrase table, Tsunakawa and phrase table Tsujii (2008) utilize a bilingual lexicon and consider the existence of bilingual lexical items as features. ) h6: identical stem for Chi- the same as f14 (Although Tsunakawa and Tsujii (2008) define this nese terms feature as examining the acronym relation of English terms, we modify this feature as examining the difference of the Chinese terms as the Chinese word “的”.) h7: subsumption relation the same as f13 (Although Tsunanakwa and Tsujii (2008) examine of strings / variants only the katakana variant, we additionally examine the difference relation of surface of suffixes and variants of hiragana parts.) forms for Japanese terms

h1J h1C — ∧ √h2J h2C — combina- · √h3J h3C — torial · h5J h5C — feature ∧ h6 h2J — · h7 h2C — ·

posed features outperform the modified features of tical. In our proposed features, f17 contributes Tsunakawa and Tsujii (2008). to the correct judgement, where it returns 1 be- cause of the existence of the translation pairs “ Next, Table 6 shows examples of improvement ガラス転移温度”,“ ” and “ガラス転移 by SVM compared with the baseline. As shown 点”,“ ” in the phrase table. In the case in Table 6 (a), the relation between input bilin- of another example shown in Table 6 (b), on the gual term pairs and seed bilingual term pairs is other hand, the proposed method correctly judges correctly judged as “synonym”, while judgement as “not synonym” by SVM compared with the by the baseline is “not synonym” since neither the baseline, where both the edit distance similarity Chinese terms nor the Japanese terms are iden-

58 Table 6: Examples of Improvement in Identifying Bilingual Synonymous Technical Terms by SVM Baseline: Judge the input bilingual term pair tJ ,tC as synonymous with the seed bilingual term pair sJ ,sC when tJ and sJ are identical, or, tC and sC are identical. SVM: Maximize precision by tuning the lower bound of the confidence measure of the distance from the separating hyperplane (Chinese sentences are segmented by morphemes). (a) Correct Judgement as “Synonym” only by SVM

(b) Correct Judgement as “Not Synonym” only by SVM

(f9) and the character bigram similarity (f10)be- acquiring bilingual synonymous technical terms. tween the Japanese terms “集電装置” and “コ Tsunakawa and Tsujii (2008) is mostly related to レクト”are0(f ( t ,t , s ,s )=0and our study, in that they also proposed to apply ma- 9 J C J C f ( t ,t , s ,s )=0). chine learning technique to the task of identifying 10 J C J C Finally, Table 7 shows examples of er- bilingual synonymous technical terms. However, roneous judgements by SVM. As shown in Tsunakawa and Tsujii (2008) studied the issue of Table 7 (a), since erroneous translation pairs identifying bilingual synonymous technical terms “断熱体”,“ ” and “インシュレー only within manually compiled bilingual techni- ター”,“ ” exist in the phrase table, both cal term lexicon and thus are quite limited in its f (both of the translations pairs s ,t and applicability. Our approach, on the other hand, is 17 J C t ,s exist in the phrase table) and f (either quite advantageous in that we start from parallel J C 17 the translation pair sJ ,tC or tJ ,sC exist in patent documents which continue to be published the phrase table) return 1, resulting in erroneous every year and then, that we can generate candi- judgement. dates of bilingual synonymous technical terms au- Another example is shown in Table 7 (b), where tomatically. Furthermore, as we show in the pre- the proposed method returns erroneous judgement vious section, the features proposed in this paper as “not synonym”. In this case, since the transla- outperform that of Tsunakawa and Tsujii (2008). tion pair “成膜チャンバー”,“ ” only ex- ists in the phrase table, f17 (either the transla- 7 Conclusion tion pair s ,t or t ,s exist in the phrase J C J C In the task of acquiring Japanese-Chinese techni- table) returns 1, while f (both of translations 17 cal term translation equivalent pairs from paral- pairs s ,t and t ,s exist in the phrase ta- J C J C lel patent documents, this paper studied the issue ble) returns 0. Furthermore, even though Chinese of identifying synonymous translation equivalent words “ ” and “ ” are synonymous, their pairs. This paper especially focused on the issue character bigram similarity is computed as 0, since of examining the effectiveness of each feature and they have opposite character orderings. identified the minimum number of features that 6 Related Work perform as comparatively well as the optimal set of features. One of the most important future work Among related works on acquiring bilingual lexi- is definitely to improve recall. To do this, we plan con from text, Lu and Tsou (2009) and Yasuda and to apply the semi-automatic framework (Liang et Sumita (2013) studied to extract bilingual terms al., 2011b) which have been invented in the task of from comparable patents, where, they first ex- identifying Japanese-English synonymous transla- tract parallel sentences from comparable patents, tion equivalent pairs and have been proven to be and then extract bilingual terms from parallel sen- effective in improving recall. Another important tences. Those studies differ from this paper in future work is to train the SVM of identifying that those studies did not address the issue of bilingual synonymous technical pairs with a set

59 Table 7: Examples of Errors in Identifying Bilingual Synonymous Technical Terms By the Proposed Method (a) Incorrect Judgement as “Synonym” by SVM

(a) Incorrect Judgement as “Not Synonym” by SVM

of patent families, and then to evaluate the trained B. Liang, T. Utsuro, and M. Yamamoto. 2011a. Iden- SVM against parallel patent sentences and phrase tifying bilingual synonymous technical terms from tables extracted from another set of patent fami- phrase tables and parallel patent sentences. Proce- dia - Social and Behavioral Sciences, 27:50–60. lies. B. Liang, T. Utsuro, and M. Yamamoto. 2011b. Semi-automatic identification of bilingual synony- References mous technical terms from phrase tables and parallel patent sentences. In Proc. 25th PACLIC, pages 196– A. Aker, M. Paramita, and R. Gaizauskas. 2013. 205. Extracting bilingual terminologies from comparable corpora. In Proc. 51st ACL, pages 402–411. Z. Long, L. Dong, T. Utsuro, T. Mitsuhashi, and M. Ya- mamoto. 2014. Identifying Japanese-Chinese bilin- P. Fung and L. Y. Yee. 1998. An IR approach for gual synonymous technical terms from patent fami- translating new words from nonparallel, comparable lies. In Proc. 7th BUCC, pages 49–54. texts. In Proc. 17th COLING and 36th ACL, pages 414–420. B. Lu and B. K. Tsou. 2009. Towards bilingual term extraction in comparable patents. In Proc. 23rd PACLIC, pages 755–762. F. Huang, Y. Zhang, and S. Vogel. 2005. Mining key phrase translations from Web corpora. In Proc. Y.Matsumoto and T. Utsuro. 2000. Lexical knowledge HLT/EMNLP, pages 483–490. acquisition. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, chapter 24, pages 563–610. Marcel Dekker Inc. M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con- Y. Morishita, T. Utsuro, and M. Yamamoto. 2008. In- stantin, and E. Herbst. 2007. Moses: Open source tegrating a phrase-based SMT model and a bilin- toolkit for statistical machine translation. In Proc. gual lexicon for human in semi-automatic acquisi- 45th ACL, Companion Volume, pages 177–180. tion of technical term translation lexicon. In Proc. 8th AMTA, pages 153–162. G. Kontonatsios, I. Korkontzelos, J. Tsujii, and S. Ana- niadou. 2014. Using random forest classifier to R. Rapp and S. Sharoff. 2014. Extracting multiword compile bilingual dictionaries of technical terms translations from aligned comparable documents. In from comparable corpora. In Proc. 14th EACL, Proc. 3rd Workshop on Hybrid Approaches to Trans- pages 111–116. lation, pages 83–91.

60 M. Tonoike, M. Kida, T. Takagi, Y. Sasaki, T. Utsuro, and S. Sato. 2006. A comparative study on compo- sitional translation estimation using a domain/topic- specific corpus collected from the web. In Proc. 2nd Intl. Workshop on Web as Corpus, pages 11–18. H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and C. Manning. 2005. A conditional random field word segmenter for Sighan bakeoff 2005. In Proc. 4th SIGHAN Workshop on Chinese Language Pro- cessing, pages 168–171. T. Tsunakawa and J. Tsujii. 2008. Bilingual synonym identification with spelling variations. In Proc. 3rd IJCNLP, pages 457–464. M. Utiyama and H. Isahara. 2007. A Japanese-English patent parallel corpus. In Proc. MT Summit XI, pages 475–482. V. N. Vapnik. 1998. Statistical Learning Theory. Wiley-Interscience. K. Yasuda and E. Sumita. 2013. Building a bilin- gual dictionary from a Japanese-Chinese patent cor- pus. In Computational Linguistics and Intelligent Text Processing, volume 7817 of LNCS, pages 276– 284. Springer.

61

Extracting Bilingual Lexica from Comparable Corpora Using Self-Organizing Maps

Hyeong-Won Seo Min-Ah Cheon Jae-Hoon Kim

Department of Computer Engineering, Korea Maritime and Ocean University, Busan 606-791, Republic of Korea

[email protected] [email protected] [email protected]

The extended approach, one of such approaches, Abstract (Déjean et al., 2002; Daille & Morin, 2005) has been proposed in order to reduce the load on the This paper aims to present a novel method seed dictionary. It gathers k nearest neighbors to of extracting bilingual lexica from compa- augment the context of the word to be translated. rable corpora using one of the artificial In spite of their efforts, using comparable corpora neural network algorithms, self-organiz- for extracting such lexica yields quite poor perfor- ing maps (SOMs). The proposed method mances unless orthographic features are used. is very useful when a seed dictionary for However, such features may bring other costs. translating source words into target words Under the circumstances like this, this paper is is insufficient. Our experiments have motivated to propose an efficient method in which shown stunning results when contrasted comparable corpora with a minimum of resources with one of the other approaches. For fu- are considered for extracting bilingual lexica. The ture work, we need to fine-tune various SOM-based approach, we propose in this paper, parameters to achieve stronger perfor- can yield stronger performances with the same ex- mances. Also we should investigate how perimental circumstance than earlier studies can to construct good synonym vectors. do. In order to show this, we compare the pro- posed method to the standard approach. Of course, it does not meaning our method outperforms for 1 Introduction every data. We just show the proposed method is reasonable for this field. Bilingual lexicon extraction from comparable cor- The rest of the paper is structured as follows: pora has been studied by many researchers since Section 2 presents several works closely related to the late 1990s (Fung, 1998; Rapp 1999; Chiao & our method. Section 3 describes our method (the Zweigenbaum, 2002; Ismail & Manandhar, 2010; SOM-based approach) in more detail. Section 4 Hazem & Morin, 2012). shows experimental results with discussions, and To our knowledge, one of the basic approaches Section 5 concludes the paper and presents future is the context vector-based approach (Rapp, 1995; research directions. Fung, 1998) called the standard approach in the literatures, and many other studies have been de- rived from this approach. Some of these are con- 2 Related Works cerned with similarity score measurement (Fung, 1998; Rapp, 1999; Koehn & Knight, 2002; Pro- 2.1 Context-based approach chasson et al., 2009), the size of the context win- As has been noted earlier, the standard approach dow (Daille & Morin, 2005; Prochasson et al., (Rapp, 1995; Fung, 1998) is proposed to extract 2009), and the size of the seed dictionary (Fung, bilingual lexica from comparable corpora. It uses 1998; Rapp, 1999; Chiao & Zweigenbaum, 2002; Koehn & Knight, 2002; Daille & Morin, 2005).

62 Short Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 62–67, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics contextually relevant words in a small-sized win- dow. Selecting similar context vectors between Source corpus Target corpus source and target languages is the key feature of the approach. Since the approach uses comparable Building synonym vectors corpora, a seed dictionary to translate one to an- other language is required. Additionally, a large scale of corpora as well as sufficient amount of SOM training initial seed dictionaries should be prepared for a better performance. Extracting translations

2.2 Self-organizing maps Bilingual dictionary A self-organizing map (SOM) (Kohonen, 1982; 1995) is one of the artificial neural network mod- els and represents a huge amount of input data in Figure 1. Overall structure of SOM-based ap- a more illustrative form in a lower-dimensional proach space. In general, a SOM is an unsupervised and words semantically related to each other. There- competitive learning network. It has been studied fore, synonym vectors should be constructed in a extensively in recent years. For example, SOMs semantic fashion not a co-relational fashion. For have been studied in pattern recognition (Li et al., example, the vector for baby should very similar 2006; Ghorpade et al., 2010), signal processing to the vector for kid not just for closely related toy (Wakuya et al., 2007), multivariate statistical or sitter. Therefore, building synonym vectors is analysis (Nag et al., 2005), data mining (Júnior et one of the most important issues in this work. For al., 2013), word categorization (Klami & Lagus, this, we firstly build context vectors via contextu- 2006), and clustering (Juntunen et al., 2013). ally related words in a fixed-size window. This Since a SOM tries to keep the topological prop- context vector is weighted by an association erties of input data, semantically/geometrically measure, such as the PMI or the chi-square. After similar inputs are generally mapped around one context vectors are built, similarity scores be- neuron, usually in the form of a two-dimensional tween the vectors are computed. In this paper, the lattice (i.e. a map). Significantly, the SOM can be similarity score, as occurs so often in information used for clustering the input vectors and finding retrieval, is computed by cosine similarity. features inherent to the problem. In this perspec- Synonyms can be identified based on the scores tive, we can expect that actual similar words have higher than a reasonable threshold. Synonym vec- one common winner (winning neuron) or share tors are then weighted by the scores. For instance, the same neighbors if input vectors are semanti- let kid be a base word to be vectored. In this case, cally well-formed. its elements are similarity scores between kid and Based on this characteristic, a main idea of the the most similar k words, such as baby, teenager, proposed method is to make two different words and youth. Consequently, well-made synonym that are translations of each other have one com- vectors have a SOM reflects the topological prop- mon winner. If a new input data has a similar input erties of input data and will obtain common win- trained already, the SOMs can extract its transla- ners after the SOMs are trained. tions based on its neighbors. Consequently, neigh- Note that such context vectors are very sensi- bors (i.e. semantically similar words) also share tive to experimental data and parameters such as similar areas in the feature map. association/similarity measures, so any kind of vector is welcomed here. We just assume seman- tically formed synonym vectors are already avail- 3 SOM-based approach able before we train SOMs. The overall structure of the SOM-based approach can be summarized as follows (see Figure 1 for ii. SOM training: After the source and target syn- more details): onym vectors are built, we train two sorts of SOMs in different ways. Figure 2 describes how i. Building synonym vectors: In this paper, the two SOMs are trained interactively. synonym vector indicates a vector that consists of

63 Source synonym vectors Unsupervised training in the seed dictionary are updated by calling re- 아이 (kid) the source SOM ( , ) lated source winners. Therefore, two different 𝑠𝑠 𝑠𝑠 𝑠𝑠 𝑠𝑠 아이𝑤𝑤𝑖𝑖 애기𝑤𝑤𝑖𝑖 (baby)𝑠𝑠𝑠𝑠𝑠𝑠 𝑤𝑤 𝑖𝑖 0.82𝑤𝑤𝑖𝑖 words which are translations to each other can be (kid) 십대 (teenage) 0.64 located in the same topological location of two … … different SOMs. We think that we can teach cor- rect labels to insiders (i.e. the target words that in- cluded in the seed dictionary) not for outsiders. As Iterative SOM training Seed Dict mentioned before, if synonym vectors are well- formed as well as two SOMs are well-trained, a enfant (kid) ( , ) source word and its translation will have one com- 𝑇𝑇 𝑇𝑇 𝑇𝑇 𝑇𝑇 enfant𝑤𝑤𝑗𝑗 bébé𝑤𝑤𝑗𝑗 (baby)𝑠𝑠𝑠𝑠𝑠𝑠 𝑤𝑤 𝑗𝑗 0.90𝑤𝑤𝑗𝑗 mon winner. Although a target word is not trained (kid) jeunesse (youth) 0.83 yet, the word can be extracted when its synonym … … Supervised training is trained. Target synonym vectors the target SOM iii. Extracting translations: After two SOMs are Figure 2. SOM training trained interactively, SOM vectors should be con- Firstly, we train the source SOM in an unsuper- structed based on each feature map (i.e. the source vised fashion. The general SOM algorithm to train and target). In this case, similarity scores between all source words can be summarized as follows: an input vector and weight vectors become ele- ments of SOM vectors. That is, a length of the 1) Set an initial weight vector (0) with small SOM is a dimension of the SOM vector. random values [0, 1], and set learning rate ( ) After the SOM vectors are built, similarity with a small positive value (0𝑤𝑤< ( ) ( scores between one source SOM vector and all 1) 1). The iteration is for one input data. 𝜂𝜂 𝑡𝑡 target SOM vectors are calculated by cosine sim- 𝜂𝜂 𝑡𝑡 ≤ 𝜂𝜂 𝑡𝑡 − ilarity. And then, the top k candidates are selected 2)≤ For every single 𝑡𝑡input , find the winning and added to the bilingual lexicon. neuron (i.e. winner) which has minimum score based on Euclidean distance𝑥𝑥s between an input and weight vectors 𝑐𝑐 = min . 4 Experiments

𝑐𝑐 𝑖𝑖 In this paper, we evaluate the proposed method for 3) Update the weight‖𝑥𝑥 −vectors𝑤𝑤 ‖ of winning𝑖𝑖 ‖𝑥𝑥 − 𝑤𝑤neuron‖ two language pairs – Korean–French (KR–FR) and its neighbors as follows: and Korean–Spanish (KR–ES). Regarding the ( + 1) = ( ) + ( ) ( )[ ( ) ( )] , comparison, we implemented the standard ap- where𝑐𝑐 denotes time, ( ) is an input vector at , proach mentioned in Section 2.1. Note that the 𝑖𝑖 𝑖𝑖 𝑖𝑖 and𝑤𝑤 (𝑡𝑡 ) is the 𝑤𝑤neighborhood𝑡𝑡 𝜂𝜂 𝑡𝑡 ℎ kernel𝑡𝑡 𝑥𝑥 𝑡𝑡 around− 𝑤𝑤 𝑡𝑡the standard approach implemented here is not com- winner𝑡𝑡 . In this step, 𝑥𝑥we𝑡𝑡 update them in online𝑡𝑡 plete. There are many chances to show better per- modeℎ which𝑡𝑡 means one update per one input (c.f. formances by fine-tuning several parameters, an offline𝑐𝑐 mode means one update per all inputs). such as the size of the context window, and asso- ciation/similarity measures. However, we can 4) Repeat the steps 2) and 3) until a certain ter- briefly estimate them because both methods are mination condition like the maximum number of implemented by using same resources. Several iterations is reached. parameters are fixed as follows: the context size of the window as 5, and the association measure After the source SOM is trained in an unsuper- as a chi-square test, and the similarity measure as vised fashion, we train the target SOM in a super- a cosine similarity. These measures were empiri- vised fashion. In this case, most of steps are the cally chosen from our experimental data. same with the case of the source. Note that we We used three comparable corpora (Kwon et al., should aware of updating the target weight vectors. 2014) in Korean, French, and Spanish. Each cor- Target winners which of words excluded in the pus included around 800k sentences collected seed dictionary are updated naturally as the case from the Web1. The Korean corpus consists of of the source. The others which of words included news articles and some are derived from different sources (Seo et al., 2006). The others also consists

1 Korean: http://www.naver.com, French: http://www.lemonde.fr, and Spanish: http://www.abc.es

64 of news articles (around 400k sentences), and that these nouns have more chances to have neigh- some are combined with the European parliament bors than low-frequency words. In order to evalu- proceedings (400k randomly sampled sentences) ate whether the proposed approach is valid (i.e. (Koehn, 2005). The Korean corpus has around whether trained SOM can extract new input data 280k word types (180k for French and 185k for that not trained), we need to train words having Spanish), and the average number of words per many neighbors. These 200 source words were se- sentence is 16.2 (15.9 for French and 16.1 for lected if actual translations were in their corpora. Spanish). Consequently, the balance of three cor- Thus, the 200th source word did not indicate a pora is well-formed. 200th high-frequency word. The KR→FR 3 dic- We extracted nouns from these corpora for our tionary had total of 288 translations (451 transla- test sets as well as input data. We considered only tions in the FR→KR dictionary), and the KR→ES nouns to reduce the sizes of the dimensions of ei- dictionary contained 377 translations (687 trans- ther synonym vectors or SOMs. Thus, we finally lations in the ES→KR dictionary). Additionally, collected almost 190k Korean noun types (45k for regretfully, there were several duplicated transla- French and 58k for Spanish). The reason why the tions for every language. In the case of KR–FR, number of Korean noun types was higher than the Korean words had 447 French translations others was due to Korean characteristics. We (420 types) and the French words had 209 Korean should split the Korean words into morpheme translations (189 types). In the case of KR–ES, the units because there are a lot of compound words Korean words had 456 Spanish translations (369 and omitted morphemes. Furthermore, we col- types) and the Spanish words had 509 Korean lected very finely segmented Korean nouns to translations (421 types). We did not perform any eliminate indulgent compound nouns that were heuristic process to give each source word a possibly missed during a word segmentation task. unique sense. Instead, we assumed related source All collected nouns were considered candidates of words corresponding to a single translation were both test sets and seed words independently. semantically the same. After the input data was prepared, we built syn- onym vectors, as mentioned previously. We al- ready introduced the method how to construct 4.2 Seed dictionary synonym vectors. However, this paper doesn’t The seed dictionaries were also built manually mainly propose the efficient way of representing based on the high-frequency nouns as mentioned words semantically in vector spaces. If synonym before. Seed words, however, were not over- vectors are built based on context vectors and their lapped with evaluation data. We chose 11,910 Ko- similarity scores, the size of the vector dimension rean noun types (8,105 French types and 7,458 would be very huge. It would cause many time- Spanish types) out of 94% of the total words in the consuming problems. In this paper, we simply use corpus. As mentioned before, 11,910 Korean word2vec2 to build synonym vectors. As far as we noun types out of 190k (total) noun types is an ex- know, word2vec cannot yield semantically related tremely low number. Except 200 of the highest- vectors as output. However, we used this tool to frequency words (contained in the evaluation dic- reduce vector sizes and assume these outputs (i.e. tionary), we finally collected 2,399 Korean seed vectors) are reasonable as the input data for train- nouns having their translations in the target cor- ing SOMs. Some parameters for building syno- pora for KR→FR, 4,387 Korean seed nouns for nym vectors can be presented as follows: window KR→ES, 2,138 French seed nouns for FR→KR, size is 5, word vector size is 100, and training it- and 1,813 Spanish seed nouns for ES→KR, re- eration is 100. spectively.

4.1 Evaluation dictionary 5 Results We manually built evaluation dictionaries to eval- Unfortunately, we do not have a publicly accepted uate our method because such dictionaries for gold standard or experimental guidelines in these KR–FR/–ES are publicly unavailable. Each dic- language pairs. By and large, the best perfor- tionary contains 200 high-frequency nouns. The mances depended on various experimental set- reason why we picked high-frequency nouns is tings, such as languages, document domains, and

2 http://code.google.com/p/word2vec/ 3 The symbol ‘→’ means unidirectional way (i.e. source to target only).

65 50% 50% x-axis: rank, y-axis: accuracy

40% 40%

30% 30%

20% 20%

10% 10%

BASE SOM (1400) BASE SOM (3600)

0% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 2. Accuracies of the KR→FR pair Figure 4. Accuracies of the KR→ES pair 50% 50%

40% 40%

30% 30%

20% 20%

10% 10% BASE SOM (1600) BASE SOM (1600) 0% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 3. Accuracies of the FR→KR pair Figure 5. Accuracies of the ES→KR pair seed dictionaries. Doubtless, the quality of syno- In our experimental results of the KR to FR pair, nym vectors and seed dictionaries including for example, we extracted stratégie (strategy) as trained SOMs are the most important issues for the translation of the source word 전략 (jeonryak4; achieving high performances. Additionally, we strategy, operation) where their neighbors, 작전 5 ignore evaluations of the quality of synonym vec- (jakjeon; operation, tactic, strategy) and opé- tors in this paper. We only consider accuracies for ration6 (operation), are included in the seed dic- the top 20 candidates for two sets of language tionary. If new input data (to be tested) have very pairs (i.e. KR–FR and KR–ES). similar seed words, we can extract correct transla- For simplified experiments, we fixed several tions through well-trained SOMs. Although the parameters as follows: The dimension of the syn- sizes of SOMs were neither the same nor the best onym vector as 100, the size of the Gaussian func- sizes, we can see the proposed approach is quite tion as 25 (5×5), the learning rate as 0.1, and the outstanding compared with the standard approach. epoch as 2000. These parameters were given based on preceding experiments. In case of a 6 Conclusion SOM size, all sizes are different for covering most of seed words (one-to-one mapping had shown This paper proposes a novel method for extracting poor performances due to the fixed and small- bilingual lexica from comparable corpora by us- sized Gaussian function). We tried to find the best ing SOMs. The method trains two sorts of SOMs, parameters via fine-tuning, but most could be fur- either in an unsupervised fashion and a supervised ther improved in future research. fashion, respectively. As we can see the experi- The accuracies for two sets of language pairs mental results, our method generally outperforms are described in Figures 2 to 5. In those figures, the standard approach under the same experi- the BASE means the standard approach, the SOM mental conditions (i.e. the same seed dictionaries means the SOM-based approach, the number and corpora). Although the given parameters are around brackets means a size of the SOM, x-axis not the best for both approaches so far, our method indicates ranks, and y-axis indicates accuracies. shows stunning results. As can be seen, the SOM-based approach outper- For future work, we can tune parameter factors formed the standard approach over all language such as the size of SOMs, the Gaussian function, settings. and the epoch. Moreover, various parts-of-speech

4 The Korean gloss is presented before a semicolon in brackets. 6 The similarity score between opération and stratégie is 0.82. 5 The similarity score between 작전 and 전략 is 0.88.

66 could be considered, as we only considered nouns treatment process. Applied Soft Computing, 13(7): in this work. In addition, a deep analysis of errors 3191–3196. is required. M. Klami and K. Lagus. 2006. Unsupervised word cat- egorization using self-organizing maps and automat- ically extracted morphs. Intelligent Data Engineer- ing and Automated Learning, IDEAL 2006, volume Acknowledgements 4224 of Lecture Notes in Computer Science, pp. 912– 919. This work was supported by the ICT R&D program of P. Koehn and K. Knight. 2002. Learning a translation MSIP/IITP. [10041807 , Development of Original lexicon from monolingual corpora, In Proceedings of Software Technology for Automatic Speech Transla- the Association for computational linguistic on unsu- tion with Performance 90% for Tour/International pervised lexical acquisition, pp. 9–16. Event focused on Multilingual Expansibility and based P. Koehn. 2005. Europarl: a parallel corpus for statisti- on Knowledge Learning] cal machine translation. In Proceeding of 10th Con- ference on Machine Translation Summit, 79–86. References T. Kohonen. 1982. Self-organized formation of topo- Y.-C. Chiao and P. Zweigenbaum. 2002. Looking for logically correct feature maps. In Biological Cyber- candidate translational equivalents in specialized, netics, 43(1): 59–69. comparable corpora. In Proceedings of the 19th in- T. Kohonen. 1995. Self-organizing maps. Springer, ternational conference on Computational Linguistics, volume 30. Coling’02, pp. 1208–1212. H. Kwon, H.-W. Seo, M.-A. Cheon, and J.-H. Kim. B. Daille and E. Morin. 2005. French-English termi- 2014. Iterative bilingual lexicon extraction from nology extraction from comparable corpora. In Pro- comparable corpora using a modified perceptron al- ceedings of the 2nd International Joint Conference gorithm. Journal of Contemporary Engineering Sci- on Natural Language Processing, IJCNLP’05, pp. ences, 7(24): 1335–1343. 707–718. C. Li, H. Zhang, J. Wang, and R. Zhao. 2006. A new H. Déjean, É. Gaussier, and F. Sadat. 2002. An ap- pattern recognition model based on heuristic SOM proach based on multilingual thesauri and model network and rough set theory. In Vehicular Electron- combination for bilingual lexicon extraction. In Pro- ics and Safety 2006, ICVES 2006, IEEE International ceedings of the 19th International Conference on Conference on, pp. 45–48. Computational Linguistics (COLING’02), 1: 1-7. A. K. Nag, A. Mitra, and S. Mitra. 2005. Multiple out- P. Fung. 1998. A statistical view on bilingual lexicon lier detection in multivariate data using self-organiz- extraction: from parallel corpora to non-parallel cor- ing maps title. Computational Statistics, 20(2): 245– pora. In Proceedings of the 3rd conference of the As- 264. sociation for Machine Translation in the Americas, E. Prochasson, E. Morin, and K. Kageura. 2009. An- Amta’98, pp. 1–16. chor points for bilingual lexicon extraction from S. Ghorpade, J. Ghorpade, and S. Mantri. 2010. Pattern small comparable corpora. In Proceedings of the 12th Recognition Using Neural Networks. International Conference on Machine Translation Summit (MT Journal of Computer Science & Information Tech- Summit XII), pp. 284–291. nology, IJCSIT, 2(6): 92-98. R. Rapp. 1995. Identifying word translations in non- A. Hazem and E. Morin. 2012. Qalign: A new method parallel texts. In Proceedings of the Annual Meeting for bilingual lexicon extraction from comparable cor- of the Association for Computational Linguistics, pora. In Proceedings of the 13th International Con- ACL ’33, pp. 320–322. ference on Computational Linguistics and Intelligent R. Rapp. 1999. Automatic identification of word trans- Text Processing, CICLing’12, volume 2 of Lecture lations from unrelated English and German corpora. Notes in Computer Science, pp. 83–96. In Proceedings of the 37th Annual Meeting of the As- A. Ismail and S. Manandhar. 2010. Bilingual lexicon sociation for Computational Linguistics, ACL ’99, extraction from comparable corpora using in-domain pp. 519–526. terms. In Proceedings of the 23rd International Con- H.-W. Seo, H.-C. Kim, H.-Y. Cho, J.-H. Kim, and S.- ference on Computational Linguistics, Coling’10, pp. I. Yang. 2006. Automatically constructing English– 481–489. Korean parallel corpus from Web documents (in Ko- E. Júnior, G. Breda, E. Marques, and L. Mendes. 2013. rean). In Proceeding of 26th Conference on Korea In- Knowledge discovery: Data mining by self-organiz- formation Processing Society fall conference, KIPS, ing maps. Web Information Systems and Technolo- 13(2): 161–164. gies, Lecture Notes in Business Information Pro- H. Wakuya, H. Harada, and K. Shida. 2007. An archi- cessing, 140: 185–200. tecture of self-organizing map for temporal signal P. Juntunen, M. Liukkonen, M. Lehtola, and H. Yrjö. processing and its application to a braille recognition 2013. Cluster analysis by self-organizing maps: An task (in Japanese). Systems and Computers in Japan, application to the modelling of water quality in a 38(3):62–71. Translated from Denshi Joho Tsushin Gakkai Ronbunshi, J87-D-II (3): 884–892.

67 Obtaining SMT dictionaries for related languages

Miguel Rios, Serge Sharoff Centre for Translation Studies University of Leeds m.riosgaona,[email protected]

Abstract grams, and (2) learning to predict the transforma- tions of the source word needed to (Jiampojamarn This study explores methods for develop- et al., 2010; Frunza and Inkpen, 2009). Given ing Machine Translation dictionaries on that SMT is usually based on a full-form lexicon, the basis of word frequency lists coming one of the possible issues in generation of cog- from comparable corpora. We investigate nates concerns the similarity of words in their root (1) various methods to measure the sim- form vs the similarity in endings. For example, the ilarity of cognates between related lan- Ukrainian wordform áëèæíüîãî ‘neargen’ is cog- guages, (2) detection and removal of noisy nate to Russian áëèæíåãî, the root is identical, cognate translations using SVM ranking. while the ending is considerably different (üîãî We show preliminary results on several vs åãî). Regular differences in the endings, which Romance and Slavonic languages. are shared across a large number of words, can be learned separately from the regular differences in 1 Introduction the roots. Cognates are words having similarities in their One also needs to take into account the false spelling and meaning in two languages, either be- friends among cognates. For example, diseñar cause the two languages are typologically related, means ‘to design’ in Spanish vs desenhar in Por- e.g., maladie vs malattia (‘disease’), or because tuguese means ‘to draw’. There are also often they were both borrowed from the same source cases of partial cognates, when the words share (informatique vs informatica). The advantage of the meaning in some contexts, but not in others, their use in Statistical Machine Translation (SMT) e.g., æåíà in Russian means ‘wife’, while its Bul- is that the procedure can be based on comparable garian cognate æåíà has two meanings: ‘wife’ corpora, i.e., similar corpora which are not trans- and ‘woman’. Yet another complexity concerns lations of each other (Sharoff et al., 2013). Given a frequency mismatch. Two cognates might differ that there are more sources of comparable corpora in their frequency. For example, dibujo in Span- in comparison to parallel ones, the lexicon ob- ish (‘a drawing’, rank 1779 in the Wikipedia fre- tained from them is likely to be richer and more quency list) corresponds to a relatively rare cog- variable. nate word debuxo in Portuguese (rank 104,514 in Detection of cognates is a well-known task, Wikipedia), while another Portuguese word de- which has been explored for a range of languages senho is more commonly used in this sense (rank using different methods. The two main approaches 884 in the Portuguese Wikipedia). For MT tasks applied to detection of the cognates are the gener- we need translations that are equally appropriate ative and discriminative paradigms. The first one in the source and target language, therefore cog- is based on detection of the edit distance between nates useful for a high-quality dictionary for SMT potential candidate pairs. The distance can be a need to have roughly the same frequency in com- simple Levenshtein distance, or a distance mea- parable corpora and they need to be used in similar sure with the scores learned from an existing par- contexts. allel set (Tiedemann, 1999; Mann and Yarowsky, This study investigates the settings for extract- 2001). The discriminative paradigm uses stan- ing cognates for related languages in Romance and dard approaches to machine learning, which are Slavonic language families for the task of reducing based on (1) extracting features, e.g., character n- the number of unknown words for SMT. This in-

68 Short Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 68–73, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics cludes the effects of having constraints for the cog- set the weights α = 0.6 and β = 0.4 giving more nates to be similar in their roots and in the endings, importance to the roots. We set a higher weight to occur in distributionally similar contexts and to to roots on the L-R, which is language dependent, have similar frequency. and compare to the L-C metric, which is language independent. We transform the Levenshtein dis- 2 Methodology tances into similarity metrics by subtracting the The methodology for producing the list of cog- normalised distance score from one. nates is based on the following steps: 1) Produce The produced lists contain for each source word several lists of cognates using a family of distance the possible n-best target words accordingly to the measures, discussed in Section 2.1 from compara- maximum scores with one of the previous mea- ble corpora, 2) Prune the candidate lists by ranking sures. The n-best list allows possible cognate items, this is done using a Machine Learning (ML) translations to a given source word that share a algorithm trained over parallel corpora for detect- part of the surface form. Different from (Mann and ing the outliers, discussed in Section 2.2; Yarowsky, 2001), we produce n-best cognate lists The initial frequency lists for alignment are scored by edit distance instead of 1-best. An im- based Wikipedia dumps for the following lan- portant problem when comparing comparable cor- guages: Romance (French, Italian, Spanish, pora is the way of representing the search space, Portuguese) and Slavonic (Bulgarian, Russian, where an exhaustive method compares all the Ukrainian), where the target languages are Span- combinations of source and target words (Mann ish and Russian1. and Yarowsky, 2001). We constraint the search space by comparing each source word against the 2.1 Cognate detection target words that belong to a frequency window We extract possible lists of cognates from compa- around the frequency of the source word. This rable corpora by using a family of similarity mea- constraint only applies for the L and L-R metrics. sures: We use Wikipedia dumps for the source and tar- get side processed in the form frequency lists. We L direct matching between the languages using order the target side list into bins of similar fre- Levenshtein distance (Levenshtein, 1966); quency and for the source side we filter words that L(ws, wt) = 1 ed(ws, wt) appear only once. We use the window approach − given that the frequency between the corpora un- L-R Levenshtein distance with weights computed der study can not be directly comparable. Dur- separately for the roots and for the endings; ing testing we use a wide window of 200 bins to α ed(rs,rt)+β ed(es,et) ± LR(rs, rt, es, et) = × α+β× minimise the loss of good candidate translations. The second search space constraint heuristic is the L-C Levenshtein distance over word with similar L-C metric. This metric only compares source number of starting characters (i.e. prefix); words with the target words upto a given n prefix. For cs, ct in L-C , we use the first four characters 1 ed(cs, ct), same prefix LC(cs, ct) = − to compare groups of words as suggested in (Kon- (0, otherwise drak et al., 2003). where ed(., .) is the normalised Levenshtein dis- 2.2 Cognate Ranking tance in characters between the source word ws and the target word wt. The rs and rt are the stems Given that the n-best lists contain noise, we aim to produced by the Snowball stemmer2 . Since the prune them by an ML ranking model. However, Snowball stemmer does not support Ukrainian and there is a lack of resources to train a classification Bulgarian, we used the Russian model for making model for cognates (i.e. cognate vs. false friend), the stem/ending split. es, et are the characters at as mentioned in (Fišer and Ljubešic,´ 2013). Avail- the end of a word form given a stem and cs, ct are able data that can be used to judge the cognate the first n characters of a word. In this work, we lists are the alignment pairs extracted from parallel data. We decide to use a ranking model to avoid 1For the Slavonic family we only use languages based on the Cyrillic alphabet to avoid the character set problems. data imbalance present in classification and to use 2http://snowball.tartarus.org/ the probability scores of the alignment pairs as

69 ranks, as opposed to the classification model used 3.1 Data by (Irvine and Callison-Burch, 2013). Moreover, The n-best lists to detect cognates were ex- we also use a popular domain adaptation technique tracted from the respective Wikipedias by using (Daumé et al., 2010) given that we have access the method described in Section 2.1. The train- to different domains of parallel training data that ing data for the ranking model consists of differ- might be compatible with our comparable corpora. ent types of parallel corpora. The parallel cor- The training data are the alignments between pora are as follows: 1) Wiki-titles we use the in- pairs of words where we rank them accordingly ter language links to create a parallel corpus from to their correspondent alignment probability from the tittles of the Wikipedia articles, with about the output of GIZA++ (Och and Ney, 2003). We 500K aligned links (i.e. ‘sentences’) per language then use a heuristic to prune training data in order pair (about 200k for bg-ru), giving us about 200K to simulate cognate words. Pairs of words scored training instances per language pair 3, 2) Open- below the Levenshtein similarity threshold of 0.5 subs is an open source corpus of subtitles built are not considered as cognates given that they are by the fan community, with 1M sentences, 6M to- likely to have a different surface form. kens, 100K words, giving about 90K training in- We represent the training and test data with fea- stances (Tiedemann, 2012) and 3) Zoo is a pro- tures extracted from different edit distance scores prietary corpus of subtitles produced by profes- and distributional measures. The edit distances sional translators, with 100K sentences, 700K to- features are as follows: 1) Similarity measure L kens, 40K words and giving about 20K training and 2) Number of times of each edit operation. instances per language pair. Thus, the model assigns a different importance to Our objective is to create MT dictionaries from each operation. The distributional feature is based the produced n-best lists and we use parallel data on the cosine between the distributional vectors as a source of training to prune them. We are inter- of a window of n words around the word cur- ested in the corpora of subtitles because the cho- rently under comparison. We train distributional sen domain of our SMT experiments is subtitling, similarity models with word2vec (Mikolov et al., while the proposed ranking method can be used in 2013a) for the source and target side separately. other application domains as well. We extract the continuous vector for each word We consider Zoo and Opensubs as two differ- in the window, concatenate it and then compute ent domains given that they were built by different the cosine between the concatenated vectors of the types of translators and they differ in size and qual- source and the target. We suspect that the vectors ity. The heldout data consists of 2K instances for will have similar behaviour between the source each corpus. and the target given that they are trained under We use Wikipedia documents and Opensusbs parallel Wikipedia articles. We develop two ML subtitles to train word2vec for the distributional models: 1) Edit distance scores and 2) Edit dis- similarity feature. We use the continuous bag-of- tance scores and distributional similarity score. words algorithm for word2vec and set the param- We use SVMlight (Joachims, 1998) for the eters for training to 200 dimensions and a window ranking model with the augmented features for of 8 words. The Wikipedia documents with an domain adaptation. The domains are as follows: average number of 70K documents for each lan- Wikipedia aligned titles, open source subtitles and guage, and Opensubs subtitles with 1M sentences proprietary subtitles, discussed in Section 3.1. for each language. In practice we only use the Wikipedia data given that for Opensubs the model is able to find relatively few vectors, for example 3 Results and Discussion a vector is found only for 20% of the words in the pt-es pair. In this section we describe the data used to pro- 3.2 Evaluation of the Ranking Model duce the n-best lists and train the cognate rank- ing models. We evaluate the ranking models with We define two ranking models as: model E for heldout data from each training domain. We also edit distance features and model EC for both edit provide manual evaluation over the ranked n-best 3The aligned links have been extracted with: lists for error analysis. https://github.com/clab/wikipedia-parallel-titles

70 Zoo error% Opensubs error% Wiki-titles error% Lang pairs Model E Model EC Model E Model EC Model E Model EC Romance pt-es 53.31 53.72 54.81 48.31 12.22 9.87 it-es 56.00 42.86 63.95 63.03 8.44 11.23 fr-es 59.05 53.00 43.00 41.19 10.75 10.09 Slavonic uk-ru 47.90 40.84 37.06 30.19 10.71 10.72 bg-ru 54.17 43.98 49.12 57.89 18.72 17.13

Table 1: Zero/one-error percentage results on heldout test parallel data for each training domain. distance and distributional similarity features. We for the domain adaptation technique. The purpose evaluate these models with heldout data from each of manual evaluation is to decide whether the ML domain used for training. Each test dataset is eval- setup is sensible on the objective task. Each list uated with Zero/one-error percentage, that is the is evaluated by accuracy at 1 and accuracy at 10. fraction of perfectly correct rankings. We evaluate We also show success and failure examples of the the models for the Romance and Slavonic families ranking and the n-best lists. Table 2 shows the pre- where the target languages are Spanish and Rus- liminary results of the ML model E on a sample sian respectively. of Wikipedia dumps. The L and L-R lists consis- Table 1 shows the results of the ranking pro- tently show poor results. A possible reason is the cedure. For the Romance family language pairs amount of errors given the first step to extract the the model EC with context features consistently n-best lists. For example, in pt-es, for the word reduces the error compared to the solely use of vivem ‘live’ the 10-best list only contain one word edit distance metrics. The only exception is the with a similar meaning viva ‘living’ but it can be it-es EC model with poor results for the domain also translated as ‘cheers’. of Wiki-titles. The models for the Slavonic family behave similarly to the Romance family, where the In the pt-es list for the word representação ‘de- use of context features reduces the ranking error. scription’ the correct translation representación is The exception is the bg-ru model on the Opensubs not among the 10-best in the L list. However, it domain. is present in the 10-best for the L-C list and the A possible reason for the poor results on the it- ML model EC ranks it in the first place. The edit es and bg-ru models is that the model often assigns distance model E still makes mistakes like with a high similarity score to unrelated words. For ex- the list L-C, the word vivem ‘live’ translates into ample, in it-es, mortes ‘deads’ is treated as close to viven ‘living’ and the correct translation is vivir. categoria ‘category’. A possible solution is to map However, given a certain context/sense the previ- the vectors form the source side into the space of ous translation can be correct. The ranking scores the target side via a learned transformation matrix given by the SVM varies from each list version. (Mikolov et al., 2013b). For the L-C lists the scores are more uniform in increasing order and with a small variance. The L 3.3 Preliminary Results on Comparable and L-R lists show the opposite behaviour. Corpora

After we extracted the n-best lists for the Romance We add the produced Wikipedia n-best lists with family comparable corpora, we applied one of the the L metric into a SMT training dataset for the pt- ranking models on these lists and we manually es pair. We use the Moses SMT toolkit (Koehn et 4 evaluated over a sample of 50 words . We set n to al., 2007) to test the augmented datasets. We com- 10 for the n-best lists. We use a frequency window pare the augmented model with a baseline both of 200 for the n-best list search heuristic and the trained by using the Zoo corpus of subtitles. We domain of the comparable corpora to Wiki-titles use a 1-best list consisting of 100K pairs. Te 4The sample consists of words with a frequency between dataset used for pt-es baseline is: 80K training 2K and 5. sentences, 1K sentences for tuning and 2K sen-

71 List L List L-R List L-C Lang Pairs acc@1 acc@10 acc@1 acc@10 acc@1 acc@10 pt-es 20 60 22 59 32 70 it-es 16 53 18 45 44 66 fr-es 10 48 12 51 29 59

Table 2: Accuracy at 1 and at 10 results of the ML model E over a sample of 50 words on Wikipedia dumps comparable corpora for the Romance family. tences for testing. We use fast-align5, KenLM6 acters, because the used comparable corpora con- with a 5-gram language model and Moses with sist of related languages that share a similar or- the standard feature set . The BLEU score for the thography. However, the lists based on the fre- baseline is 20.68 and 20.86 for the augmented ver- quency window heuristic show poor results to in- sion, where the +0.18 increase is not statistically clude the correct translations during the extraction significant. However, the augmented dataset im- step. Our ML models based on similarity met- proves the coverage of the model. The out of vo- rics over parallel corpora show rankings similar to cabulary (OOV) words decrease from: 1476 to- heldout data. However, we created our training kens (9.4%), 623 types (21.1%) to 896 tokens data using simple heuristics that simulate cognate (5.7%) and 337 types (11.4%). For uk-ru the base- words (i.e. pairs of words with a small surface line training data is: 140K training sentences, 1K difference). The ML models are able to rank sim- sentences for tuning and 2K sentences for test- ilar words on the top of the list and they give a ing. The uk-ru 1-best list consists of 100K. The reliable score to discriminate wrong translations. BLEU results for the baseline is 28.72 and 29.56 Preliminary results on the addition of the n-best for the augmented dataset with a difference in lists into an SMT system show modest improve- +0.93 which is not statistically significant7. The ments compare to the baseline. However, the OOV results for OOV are: 1202 tokens (8.1%), 756 rate shows improvements around 10% reduction types (21.6%) to 894 tokens (6.0%) and 545 types on word types, because of the wide variety of lex- (15.6%). ical choices introduced by the MT dictionaries. A possible reason for low improvement in terms Future work involves the addition of unsuper- of the BLEU scores is because MT evaluation met- vised morphology features for the n-best list ex- rics, such as BLEU, compare the MT output with traction, i.e. first step, given that the use of start- a human reference. The human reference transla- ing characters shows to be an effective heuristic tions in our corpus have been done from English to prune the search space and language indepen- (e.g., En Es), while the test translations come dent. Finally, we will measure the contribution for → from a related language (En Pt Es), often re- all the produced cognate lists, where we can try → → sulting in different paraphrases of the same En- different strategies to add the dictionaries into an glish source. While our OOV rate improved, the SMT system (Irvine and Callison-Burch, 2014). evaluation scores did not reflected this, because our MT output was still far from the reference even Acknowledgments in cases it was otherwise acceptable. The research was funded by Innovate UK and ZOO Digital Group plc. 4 Conclusions and future Work

We have presented work in progress for devel- References oping MT dictionaries extracted from compara- ble resources for related languages. The extrac- Hal Daumé, III, Abhishek Kumar, and Avishek Saha. 2010. Frustratingly easy semi-supervised domain tion heuristic show positive results on the n-best adaptation. In Proceedings of the 2010 Workshop lists that group words with the same starting char- on Domain Adaptation for Natural Language Pro- cessing, DANLP 2010, pages 53–59, Stroudsburg, 5https://github.com/clab/fast_align PA, USA. 6https://kheafield.com/code/kenlm/ 7The p-value for the uk-ru pair is 0.06 we do not consider Darja Fišer and Nikola Ljubešic.´ 2013. Best friends or this result as statistically significant. just faking it? Corpus-based extraction of Slovene-

72 Croatian translation equivalents and false friends. Franz Josef Och and Hermann Ney. 2003. A sys- Slovenšˇcina2.0, 1. tematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. Oana Frunza and Diana Inkpen. 2009. Identification and disambiguation of cognates, false friends, and Serge Sharoff, Reinhard Rapp, and Pierre Zweigen- partial cognates using machine learning techniques. baum. 2013. Overviewing important aspects of the International Journal of Linguistics, 1(1). last twenty years of research in comparable corpora. In Serge Sharoff, Reinhard Rapp, Pierre Zweigen- Ann Irvine and Chris Callison-Burch. 2013. Su- baum, and Pascale Fung, editors, BUCC: Build- pervised bilingual lexicon induction with multiple ing and Using Comparable Corpora, pages 1–17. monolingual signals. In Proceedings of the 2013 Springer. Conference of the North American Chapter of the Association for Computational Linguistics (NAACL Jörg Tiedemann. 1999. Automatic construction of 2013), Atlanta, Georgia, June. weighted similarity measures. In Proc. Empirical methods in Natural Language Processing and Very Ann Irvine and Chris Callison-Burch. 2014. Us- Large Corpora, pages 213–219. ing comparable corpora to adapt mt models to new domains. In Proceedings of the Ninth Workshop Jorg Tiedemann. 2012. Parallel data, tools and inter- on Statistical Machine Translation, pages 437–444, faces in opus. In Proceedings of the Eight Interna- June. tional Conference on Language Resources and Eval- uation (LREC’12), Istanbul, Turkey, may. Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2010. Integrating joint n-gram features into a discriminative training framework. In Pro- ceedings of NAACL-2010, Los Angeles, CA, June. T. Joachims. 1998. Making large-scale svm learning practical. LS8-Report 24, Universität Dortmund, LS VIII-Report. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Grzegorz Kondrak, Daniel Marcu, and Kevin Knight. 2003. Cognates can improve statistical translation models. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Companion Volume of the Proceed- ings of HLT-NAACL 2003–short Papers - Volume 2, pages 46–48, Stroudsburg, PA, USA. Vladimir Iosifovich Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and re- versals. Soviet Physics Doklady, 10:707–710. Gideon S. Mann and David Yarowsky. 2001. Mul- tipath translation lexicon induction via bridge lan- guages. In Proceedings of NAACL, page 151–158, Pittsburgh, PA, June. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word repre- sentations in vector space. In Proc. Workshop at ICLR’13. Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.

73 BUCC Shared Task: Cross-Language Document Similarity

Serge Sharoff Pierre Zweigenbaum Reinhard Rapp University of Leeds LIMSI, CNRS University of Mainz Leeds, UK Orsay, France Mainz, Germany [email protected] [email protected] [email protected]

Abstract languages, i.e. between articles describing the same or corresponding headwords. These inter- We summarise the organisation and results language links are provided by the authors of the of the first shared task aimed at detecting articles, i.e. they are based on expert judgement. the most similar texts in a large multilin- For the shared task we selected bilingual pairs gual collection. The dataset of the shared of articles which fulfilled the following require- was based on Wikipedia dumps with inter- ments: language links with further filtering to en- sure comparability of the paired articles. 1. The inter-language links between the articles The eleven system runs we received have had to be bidirectional, i.e. not only an article been evaluated using the TREC evaluation in Language1 needs to be linked to the corre- metrics. sponding article in Language2, but also vice versa. This ensured a page in one language 1 Task description is not linked only to a portion of a page in another one. Parallel corpora of original texts with their transla- tions provide the basis for multilingual NLP appli- 2. The size of the textual content of the two ar- cations since the beginning of the 1990s. Relative ticles within a pair (i.e. their length measured scarcity of such resources led to greater attention as the number of characters) had to be similar to comparable (=less parallel) resources to mine (see Section 2.2 below). information about possible translations. Many Note that this selection procedure for the article studies have been produced within the paradigm pairs implies that an article pair selected for one of comparable corpora, including publications in 1 language pair may or may not be selected for an- the BUCC workshop series since 2008. other language pair. All articles which satisfied the However, the community so far has not con- selection conditions have been considered for the ducted an evaluation which compared different ap- evaluation run. proaches for identifying more or less parallel doc- The data for each language pair has been split uments in a large amount of multilingual data. randomly into two sets: Also, it is not clear how language-specific such approaches are. In this shared task we propose the Training set articles with information about the first evaluation exercise, which is aimed at detect- correct links for the respective language pairs ing the most similar texts in a large multilingual provided to the participants; collection. Test set articles without the links. 2 Data set The task is for each article in the test set to sub- mit up to five ranked suggestions to its linked ar- 2.1 Description ticle, assuming that the gold standard contains its The dataset is derived from static Wikipedia counterpart in another language. The submissions dumps of the main articles. A feature of Wikipedia had to be in the tab-separated format as used in is that it provides so-called inter-language links the submissions to the shared tasks of the Text Re- between many corresponding articles of different trieval Conference (TREC2) with six fields: 1See http://comparable.limsi.fr/ 2See http://trec.nist.gov/.

74 Shared Task Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 74–77, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics

Min. 1st Qu Median Mean 3rd Qu Max. Selected pairs De 0.010 0.420 0.790 1.244 1.370 206.000 294990 Fr 0.000 0.370 0.740 1.194 1.260 255.800 229591 Ru 0.010 0.300 0.620 0.987 1.070 108.600 159810 Tr 0.000 0.140 0.350 0.616 0.760 46.730 32614 Zh 0.010 0.280 0.610 1.010 1.090 111.500 42944

Table 1: Ratios of lengths of aligned articles to English

id1 X id2 Y score run.name Russian or Turkish. Therefore, mapping of word The X and Y fields are not used, but they are re- forms is likely to be more sparse. served by the TREC evaluation script (and it does not use them either). id1 and id2 are the respec- 2.2 Preparation tive article identifiers in a source language and in We started with the downloadable Wikipedia English. The score should reflect the similarity be- dumps,4 which were cleaned to their text only con- tween id1 and id2, the higher the closer. The tents by removing standard formatting codes, fig- participants were invited to submit up to five runs ures (with their captions), templates, tables and of their system with different parameters, as iden- external links. Given that the first sentence in tified by a keyword in the last field. Wikipedia articles provide a concise summary of The evaluation script and more information the article contents, the first sentence (defined as about the format have been made available in ad- a sequence of characters to the first full stop) has vance. 3 been also removed to make the task more similar The languages in the shared task were Chinese, to detection of webpages in context unrelated to French, German, Russian and Turkish. Pages in Wikipedia. Shaded areas in Figure 1 demonstrate these languages needed to be linked to a page in the extent of cleaning. English. We selected a subset of articles aligned to En- The choice of languages reflects variation in the glish. Table 1 lists the distribution of the length available clues for linking the pages. The lan- ratios of the respective articles to their English guages vary in: counterparts and the number of articles remaining after pruning their length. A small number of ar- their writing systems (Latin, Cyrillic, logo- • ticles are much shorter than their English coun- graphic); terparts. Less frequently this happens in the op- tokenisation (clitics in French, compounds in posite direction, and the length ratio is more than • German, no orthographic word boundaries in one (the median is always less than one). Usu- Chinese); ally articles which differ in their length are not their morphology (covering isolating, inflect- good candidates for comparable corpora. We took • ing and agglutinative languages); only those within the inter-quartile range. This left us with 50% of article pairs in the original list, Even though the writing system issue is su- which are all reasonably comparable in their con- perficial, it shifts the clues for linking the arti- tents. Examples for each language bordering on cles. Thus, it requires more intelligent mapping the 1st quartile in ratio to English all show reason- between the languages. In the same writing sys- able amount of text to be considered as compara- tem, many clues remain the same or nearly iden- ble entries: tical (Paris, Frankfurt), while in another set they have to adapt to the target language requirements: de Aaron Ramsey Париж (‘Paris’, transliterated as Parizh in Rus- fr Adena culture sian) or 巴黎 (‘Paris’, pinyin Bali in Chinese). ru Quantum mechanics Morphology accounts for variation of forms for Cyrano de Bergerac (play) connection with the dictionaries. It is considerably tr larger in morphologically rich languages, such as zh Blood transfusion

3See http://trec.nist.gov/trec_eval/ 4Downloaded in November 2011.

75 Figure 1: Example of cleanup (shaded areas indicate removed text).

For example, the Adena culture article has been rect target article has been ranked among the top selected only for the French-English pair, since the 5 positions. Mean Reciprocal Rank (MRR) is articles in other languages are much shorter than also a relevant measure: If the correct target ar- the English one to be considered as reasonably ticle is ranked at position N, a score of 1/N is comparable. given to this source article. Then these scores are averaged over the set of source articles. These 3 Evaluation measures are respectively obtained by parameters success.1 success.5 recip_rank Evaluation has been done using standard TREC , , and trec_eval evaluation measures, modeling the task as the re- in . trieval of a ranked list of links from a source page. 4 Results Extrinsic evaluation setups, for example, via terminology extraction, would possibly provide Overall, we have received eleven runs: one entry more interesting measures, but this would require for Chinese (Table 2), three entries for French (Ta- a baseline system which works with all the lan- ble 2), and seven for German (Table 3). guages in question. 4.1 Methods used 3.1 Metrics The method used by the system CCNUNLP is de- For each source page there exists exactly one cor- scribed in (Li and Gaussier, 2013). In essence, it rect linked page in the gold standard. Systems uses a bilingual dictionary for converting the word were required to return a ranked list of hypotheses feature vectors between the languages and estimat- in which the correct target page should be ranked ing their overlap. The other systems are discussed as high as possible. in details in the current proceedings (Morin et al., Several evaluation measures are relevant to this 2015; Zafarian et al., 2015). The LINA system situation in the trec_eval program used in (Morin et al., 2015) is based on matching ha- TREC evaluations. The Success measures cor- pax legomena, i.e., words occurring only once. In respond to commonly used measures when eval- addition to using hapax legomena, the quality of uating term translations in comparable corpora. linking in one language pair, e.g., French-English, We use them here to evaluate the proposed inter- is also assessed by using information available language links between the articles. Success@1 in pages in another language pair, e.g., German- determines the proportion of source articles for English. The AUT system (Zafarian et al., 2015) which the correct target article has been ranked uses the most complicated setup by combining in the top position; Success@5 determines the several steps. First, documents in different lan- proportion of source articles for which the cor- guages are mapped into the same space using a

76 French Chinese runid ccnunlp lina.p lina.cl ccnunlp num_q 114423 114423 78529 21467 num_ret 572115 572111 143542 107335 num_rel 114423 114423 78529 21467 num_rel_ret 87367 42777 47561 18474 MRR 0.669 0.329 0.590 0.769 success@1 0.607 0.300 0.577 0.710 success@5 0.764 0.374 0.606 0.861

Table 2: Evaluation results for French and Chinese. lina.p corresponds to Pigeonhole, lina.cl to Cross-lingual in the authors’ paper.

German runid lina.p lina.cl aut1 aut2 aut3 aut4 aut5 num_q 147220 92020 147515 147515 147515 147515 147515 num_ret 736100 166051 147516 147516 147516 147516 147516 num_rel 147220 92020 147515 147515 147515 147515 147515 num_rel_ret 52223 58828 6870 2703 2029 1371 890 MRR 0.290 0.622 0.047 0.018 0.014 0.009 0.006 success@1 0.249 0.607 0.047 0.018 0.014 0.009 0.006 success@5 0.355 0.639 0.047 0.018 0.014 0.009 0.006

Table 3: Evaluation results for German feature transformation matrix. This helps in se- which ranks correct articles before the sixth posi- lecting a relatively small subset of pages to detect tion often enough will have a higher success@5 possible links. Second, document similarity is as- than MRR. This is the case of all systems ex- sessed using three pipelines, namely, a polylingual cept aut, which only returned one target article per topic model, a named entities detection tool and a source article. word feature mapping procedure using MT. The tables show that the rankings obtained by the three measures, MRR, success@1 and suc- 4.2 Comparison of results cess@5 are the same in all cases, i.e., rank cor- Since AUT submitted exactly one target article for relation of the results is always 1. This suggests each source article, its MRR, success@1 and suc- that system results ranked the correct article in the cess@5 measures are identical. top 5 often enough. For each run, success@1 is the strictest mea- 4.3 Comparison of methods sure, hence provides the lowest score, because it can only obtain points if the top ranked article The best results were obtained on Chinese with is the correct one. Mean reciprocal rank (MRR) a succes@1 of 0.710 and a success@5 of 0.861. yields the same score when the top ranked ar- This is a very good performance, but also reveals ticle is correct, but also scores decreasing frac- that the problem is not solved. tions of one when the correct article is found any- Although the number of different runs is not where in the ranking: this results in a higher av- sufficient to draw general conclusions, we can erage score than success@1. Finally, success@5 compare the same methods across different lan- also takes into account articles beyond the first, guage pairs and different methods on the same lan- but only until the fifth; if the correct article is guage pairs. present in this range, the full score of one is as- CCNUNLP obtained better results on Chinese signed to the article; otherwise no point is ob- than on French, probably because of the quality of tained. Therefore a system which generally ranks the underlying dictionaries. LINA.CL worked bet- correct articles beyond the fifth position will have ter on German than for French, while the reverse a lower success@5 than its MRR; but a system was true for LINA.P. After the evaluation run, it

77 transpired that the submissions of AUT had a data References processing bug. Li, B. and Gaussier, E. (2013). Exploiting compara- Overall, the CCNUNLP method obtained the best ble corpora for lexicon extraction: Measuring and results on Chinese and French, followed by the improving corpus quality. In Sharoff, S., Rapp, R., LINA.CL method (second best on French, and best Zweigenbaum, P., and Fung, P., editors, Building and Using Comparable Corpora, pages 131–149. on German). Springer-Verlag. 4.4 Discussion Morin, E., Hazem, A., Boudin, F., and Loginova- Clouet, E. (2015). Lina: Identifying comparable The results are encouraging. Success@1 rates documents from wikipedia. In Proc. Workshop on reach 0.71 for Chinese and 0.61 for French and Building and Using Comparable Corpora at ACL German. However, this level of accuracy is still far 2015. from reliable identification of comparable pages. Zafarian, A., Agha Sadeghi, A. P., Azadi, F., Ghiasi- Given a small number of participating systems and fard, S., Ali Panahloo, Z., bakhshaei, S., and Mo- an uneven coverage of the language pairs involved hammadzadeh Ziabary, S. M. (2015). Aut document it is difficult to make predictions about which alignment framework for bucc workshop shared methods are more or less successful. A dictionary- task. In Proc. Workshop on Building and Using Comparable Corpora at ACL 2015. based method (CCNUNLP) is slightly ahead of a method based on hapax legomena (LINA.*). A multi-stage method like the one used by AUT is promising, but its complexity makes it prone to er- rors. Another question concerns the evaluation sce- nario. The shared task has been evaluated by using gold standard data in intrinsic evaluation. Given that the purpose of collecting comparable corpora is to provide more data for terminology extraction or Machine Translation, we need to evaluate text collections by referring to their successful use in such tasks. The limitation in using extrinsic eval- uation is the lack of gold-standard methods and resources. In the next shared task we plan to address this issue by specifically targeting either terminology extraction or MT development methods by using comparable corpora. This shared task will use the resources we developed for the current one.

4.5 Conclusions In addition to obtaining an estimate of the qual- ity of various methods for measuring comparabil- ity, the major outcomes of the evaluation exercise concerns the available standardised dataset which is split into the training and testing parts. We en- courage our readers to develop better systems and to test them on our data. The dataset is available from: http://corpus.leeds.ac.uk/serge/BUCC/ We intend to keep the data on the web for many years as a benchmark for measuring comparability on the text level.

78 AUT Document Alignment Framework for BUCC Workshop Shared Task

Atefeh Zafarian, Amirpouya Aghasadeghi, Fatemeh Azadi, Sonia Ghiasifard, Zeinab Alipanahloo, Somayeh Bakhshaei, Seyed Mohammad Mohammadzadeh Ziabary Human Language Technology Lab Computer Engineering Department Amirkabir University of Technology, Tehran, Iran {atefeh.zafarian, aghasadeghi, ft.azadi, s.ghiasi, apanahloo, bakhshaei, mehran.m}@aut.ac.ir

Information Retrieval (IR) is “the act of finding Abstract materials, usually documents of an unstructured form that satisfies an information need within This paper presents a framework for aligning large collections stored in computers” (Manning comparable documents collection. Our fea- et al., 2008). IR is not limited to monolingual ture based model is able to consider different documents if the task is related to mapping bilin- characteristics of documents for evaluating gual or multilingual documents; a new area of IR their similarities. The model uses the content will be introduced: Cross/Multilingual IR. The of documents while no link, special tag or Metadata are available. And also we apply a idea of Cross-Lingual IR (CLIR) is to retrieve filtering mechanism which made our model documents in a language different from the lan- to be properly applicable for a large collec- guage of input text (Oard, 1998). The input text tion of data. According to the results, our may be either a query or a document which cate- model is able to recognize related documents gorizes the field to document based or query in the target language with recall of 45.67% based approaches. CLIR is a way of expanding for the 1-best and 62% for the 5-best. input queries to other languages. This is a useful approach in search engines that enables users to 1 Introduction formulate queries in their preferred languages and retrieve relevant documents in whatever lan- Comparable corpora (CC) are collections of guage they are written. For this purpose instead similar documents with different levels of com- of parallel corpora for translating input queries, parability (Fung and Cheung, 2004). There are using comparable corpora might be helpful. useful resources for most of the Natural Lan- However, document based CLIR can be used for guage Processing (NLP) or Information Retrieval producing comparable corpora. The related (IR) tasks such as cross-lingual text mining works will be reviewed in section 2. (Tang et al., 2011), bilingual lexicon extraction Our Model is a framework consists of different (Li and Gaussier, 2010), cross-lingual infor- modules. Each module considers disparate fea- mation retrieval (Knoth et al., 2011) and machine tures for matching each source document with translation (Smith et al., 2010; Delpech, 2011) the target documents, so we called this a feature- etc. based model. The pipeline of the modules in our The sub-fields of NLP are related to solving hu- model is shown in Figure 1. man language tasks that are mostly hard prob- We assume two similar documents contain same lems such as Language Understanding (Wino- sets of names which occur in the same order. grad, 1972), Machine Translation etc. The mod- Name Module of our model is responsible for ern algorithms of NLP sub-fields are based on checking this structure. In addition, translation of machine learning and statistical approaches. similar texts in the source and target languages Most of the developed systems of these fields must be similar, so we use SMT system as an- require large amounts of parallel corpora, as a other module in the model. We also assume simi- result the limitation in success of such tasks is lar documents converted to vector representa- the lack of parallel corpora. In recent researches, tions using neural networks will have shorter it is proven that Comparable Corpora can be a Euclidean distance between each other. This valuable alternative to rare parallel corpora. characteristic of similar documents is considered

79 Shared Task Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 79–87, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics

Bilingual Topic Module

Filter Name Module System Combination

Traslation Module

Figure 1. The structure of the pipeline model. in Document-to-Vector module. To recognize length attribute. It is expected for a pair of trans- similar conceptual structures in documents we lated documents to have closely related lengths. use bilingual topic models. A common language independent approach for The problem is aligning source documents to representing documents is based on vector repre- target documents, but because of the amount of sentation. Representing documents in a collec- data, in addition to similarity concerns, our mod- tion as bag off words is called Vector Space el is faced to the very large corpora problem. It is Model. Each component of the vector shows the not possible to evaluate each pair of documents importance of that term in the document. In large in this big space, so we used some of our mod- document collections, document vectors have ules as a filter for decreasing the target space. high dimensions. For this reason, some ap- From the framework's modules, we choose ones proaches using linear projection, a map from the with higher Recall to be sure that our filter is high dimension to a low-dimensional vector able to recognize and remove wrong samples. space. Early approaches for linear projection are Another factor for selecting the filtering modules LSA (Deerwester et al., 1990) and LDA (Blei et is the execution speed. Low speed modules are al., 2003). not proper in the model's pipeline. Cross-language latent semantic indexing (CL- LSI) (Dumais et al., 1997) is based on LSA used 2 Related Works for multiple languages by reducing the dimen- sionality of a matrix which rows are obtained by Common methods for comparing documents are concatenating comparable documents from dif- by extracting features from texts; namely com- ferent languages. Another projection model, La- pare documents through the most frequent words tent Dirichlet Allocation (LDA) is based on the (Kilgarriff, 2001). extraction of generative models from documents. In multilingual context, some approaches trans- Polylingual Topic Models (Mimno et al., 2009) late features and compare documents using dif- are multilingual versions of LDA. ferences in the frequencies of the translations of Cross Language Explicit Semantic Analysis (CL- the keywords, namely using cosine similarity ESA) is the other model in vector context ap- measure between the feature vectors (Su and proach (Potthast et al., 2008) that uses compara- Babych, 2012). ble Wikipedia corpora. Each document is repre- A successful approach is the Cross Language sented by a concept vector, where each dimen- Character N-Gram (CL-CNG) model (Mcnamee sion is the similarity of the document to one of and Mayfield, 2004) that uses character n-grams the Wikipedia documents in the corpus. and is based on the syntax of documents, found New approaches for comparable document re- remarkable performance for languages with syn- trieval task and for measuring documents simi- tactic similarities. larities are knowledge-based; despite previous A primary approach for aligning comparable works that were supervised. Knowledge-based document corpora is based on statistical machine Document Similarity (KBSim) (Franco-Salvador translation technology such as CL-ASA (Barrón - et al., 2008) is one of the most recent of them. It Cedeno et al., 2008), that uses a combination of turns source and target documents to knowledge a translation model and a length model for meas- graphs using a Multilingual Semantic Network uring similarity between source and target docu- (MSN) such as Babelnet (Navigli and Ponzetto, ments. The translation model shows that how 2012) then compares two graphs using KBSim. likely the source document is a translation of the target document and length model measures the similarity of those two documents with the

80 3 Model description mentioned in (Mikolov et al., 2013) we can build a bilingual Paragraph model to align source The framework of our model is constructed on 4 and target Documents. modules: Doc2Vec, Name, Topic Model and The transformation matrix can be found by solv- SMT. These modules evaluate the similarity of ing following optimization problem. In equation each pair of documents considering different (1) 푥푖 is the vector representation of ith source characteristics of a document pair. document and 푧푖 is its paired document vector According to the framework of our model (Fig- representation in the target space. ure 1), the first step contains filters for reducing n the size of the target space. Two filters are used minw Wx i z i (1) serially based on Doc2Vec and Name modules i1 for this purpose. In the following subsections, we explain each of the modules used in our frame- 푊can be found by any optimization method, but work in more details. we solved it with a stochastic gradient descent approach. By computing 푧 = 푊푥 any source 3.1 Document-to-Vector Module (Doc2Vec) vector will be projected to the target space, then Recent works in learning vector representation of we can search closest neighbors of z in target words using neural networks (Mikolov et al., vectors to find our answers. 2013), show that these models can capture great Training this model and transformation matrix is details about semantics and syntactic relation- relatively fast in comparing with our other mod- ships and patterns between similar words, which ules. Even though this method precision is low, it many of those patterns can be obtained from can discriminate related and unrelated documents simple linear transitions. For example, it has from each other very well. Since generating clos- been shown that the result of a vector calculation est neighbors list in this method is simple, this vec(“Madrid”) - vec(“Spain”) + vec(“France”) is module is used for filtering the target search closer to vec(“Paris”) than to any other word space for our slower modules such as topic mod- vector (Mikolov et al., 2013). els and machine translation. Another great properties of these models is that if 3.2 Bilingual Topic Model Module (BiTM) they trained on comparable corpora in different languages, by using simple transformation ma- The basic idea behind topic models is that docu- trix, vectors from source language can be pro- ments are mixtures of topics, where a topic is a jected to the target space and be used to build a probability distribution over words (Blei et al., larger dictionary (Mikolov et al., 2013). Such 2003). Topic models have a major benefit; they transformation matrices can be obtained from a don’t need documents to be sentence-aligned, so few thousand aligned words from the source and it will be a good choice for finding comparable target side. corpora. To model bilingual topics, we used an Models mentioned above can only work on fixed extension of latent Dirichlet allocation (LDA) length text inputs such as words or short phrases, called Polylingual Topic Model (Mimno et al., but many tasks in NLP need variable input 2009). We consider each document as a bag of length. A new extension of these models is Para- words, this approach consists of three main steps, graph Vector (Le and Mikolov, 2014) which can first step is creating sets of topics for both sides convert any variable length input from sentence (source and target languages) then calculating to document, to a fixed length vector output. probability of each topic in each document and Since the Paragraph Vector model training is finally, finding documents similarities. similar to Word to Vector model, they share Figure 2 shows graphical model of polylingual many properties and this new model also cap- topic model, where 훼 and 훽 are the hyper- tures the relationship between similar words and parameters on the Dirichlet priors for topic dis- sentences. The previous works on this model tributions 휃 and the topic-word distributions 휑 show the state of the art results in the field of respectively. This model actually finds and sentiment analysis and document classification. aligns topics of different languages. As far as we know there has been no previous Now that the topic distributions of target and use of Paragraph Vector model for bilingual and source languages are created, we use these topics multilingual tasks. Since the Paragraph Vector to find topics probabilities over each document model is based on Word to Vector model, we get using Gibbs sampling. to this conclusion that by using the same method

81 Accordingly, each document is converted to a T names from parallel corpus that is described in dimensional vector 푣 = [푝1, 푝2, … , 푝푇], where 푝1 (Durrani et al., 2014) and apply this on the Euro- is the probability of assigning topic one to this parl parallel corpus and extract a transliterated document and T is the number of topics. To find bilingual German-English dictionary that we similar documents in two languages we used a called ENTD (Europarl Transliterated Names well-known method called cosine similarity. In Dictionary). our case, two vectors (from source and target language) are compared, using cosine similarity 3.3.2 Mapping Table as bellow: In this step, we extract high-probability translit- n erated names of ENTD and apply an iterative vv' i1 ii Sim v', v  (2) alignment model on this for generating a table of nn vv' 22   characters that are aligned with high probability ii11iiin source and target languages. This method Where 푣′ and 푣 are respectively documents of is similar to the method described in (Mousavi source and target language. The result is a number Nejad and Khadivi, 2011). The alignment model between 0 and 1, while the similarity of 1, means is a Levenshtein distance based on the mapping the documents are completely similar. table. In each iteration of the model, the charac- ters with high alignment probabilities added to the mapping table and the algorithm is repeated until no change in the mapping happens any- more.

3.3.3 Compute Phonetic distance In this step, we compute the phonetic distance Figure 2. Graphical model of topic model between name entities in comparable training (Mimno et al., 2009) documents using a recursive function based on the mapping table. This method is similar to 3.3 Names Module the method described in (Mehdizadeh Seraj et Named entities play an important role in Cross- al., 2014). For measuring the distance between Lingual Information Retrieval (CLIR). Compa- an English character in position i and a German rable documents generally share many named character in position j. We will use the recursion entities (NE) (Gupta and Bandyopadhyay, 2013), definition, according to the following equation. in this section, we make a Name model for In this definition, e and g are English and Ger- checking the effectiveness of NEs in aligning man words respectively. comparable documents. In this model, we extract d( i -1, j ) 1- p , NEs from documents and classify into three  remove ei    types: location, person and organization, then d i, j  min d i , j -1 1- p , (3)  remove gi   compute document similarity based on the simi-  larity of names that have the same type. As re-  d i -1, j -1  1- p  replace eii, g   spects NEs usually are phonetically transliterat-  ed, we consider the phonetic similarity of the two words as similarity criteria. Our Name model Where 푝푟푒푚표푣푒 푎푛푑 푝푟푒푝푙푎푐푒 are obtained from contains two sections: the mapping table. Finally, we generate a trans- A. Named Entities Recognition: we apply a literated bilingual German-English dictionary of CRF-based supervised classifier as NER model. transliterated names that have a low phonetic B. Computing Phonetic Similarity: The main distance, named BTND (BUCC Transliterated bottleneck in computing phonetic similarity is Names Dictionary). the lack of availability of transliteration training 3.3.4 Compute Similarity of Documents in Test data so we propose a solution for solving this Time problem. Our proposed method includes 3 fol- Computing the phonetic similarity by a recursive lowing steps: function takes a lot of time and it is not efficient

3.3.1 Transliteration Mining for test time, so we use the bilingual dictionaries In this step, we use an unsupervised translitera- generated in the previous sections. When there is tion mining model for extracting transliterated enough time, we can use the method described in

82 section 3.3.3 that is a language-independent of the translation output in words (c), and the method. In this state, we divide source-target length of the reference translation in words (r), names in each of the two documents in 3 states: the BLEU metric will be computed as follows, 1. The named entities of the same type that have 4 (5) same letter form. 2. The named entities of the BLEU BP.exp( log pn ) same type that exist in ETND. 3. The named en- n1 tities of the same type that exist in BTND. 1/rc We search each source-target name in state 3 BP min(1, e ) (6) only if it doesn’t exist in state 1 and 2, and search it in state 2 only if it doesn’t exist in state 1. We Here the translation output is our SMT system’s also extract URLs from documents and consider output for the source document and the reference these as state 4: translation is a target document. As these two 4. The same URLs in two documents. documents are not necessarily sentence-aligned Finally, we define a score function for computing we concatenate each of them to make one sen- document similarity between a German docu- tence documents. As we know, one of the BLEU ment 퐺 and an English document 퐸 as follow: metric’s shortcomings is that it was designed for corpus level and might not perform well on sin- w1 s 1 w 2 s 2  w 3 s 3  w 4 s 4 gle sentences, since the 4-gram precision could scoreGE,  (4) CCNE url be often zero and it makes the whole BLEU score to be zero. So as BLEU might perform badly in some cases, Where 푤1 is the weight of state 1 and 푠1 is the number of common NEs in documents 퐺 and 퐸. we also used another metric called Position- independent word Error Rate (PER) (Tillmann et 퐶푁퐸 and 퐶푢푟푙 are the number of NEs and URLs in German documents. For estimating the weight al., 1997). This metric measures a position- of each state, we apply our models on a devel- independent Levenshtein distance (bag-of-word opment set and increase the impact of phonetic based distance) between the translation output dictionaries ETND and BTND by filtering the and reference. The resulting number is then di- pairs of names with low alignment probabilities. vided by the number of words in the reference. We compute the thresholds by testing on the de- The reason that we used this instead of other er- velopment set. ror rates such as WER (Nießen et al., 2000) and TER (Snover et al., 2006) is that it completely 3.4 SMT Module neglects word orders. As in our task, sentences in When two documents in two different languages two similar documents might be displaced and are similar, the translation of the first document we don’t want this displacement to influence our to the other’s language should make a similar similarity score, PER is more reasonable to use. document to the second one. That’s why we use As the BLEU score contains higher order n- the statistical machine translation (Brown et al., grams, it also considers correct phrases instead of 1993) as another module for measuring the doc- just words in PER, and so it has a higher recall in ument similarities. our experiments (shown in section 4). But as In this module, we first train a phrase-based PER might help for the cases that BLUE is not SMT system on a sentence-aligned parallel cor- working well, we use both of these scores for our pus (Zens et al., 2002; Koehn et al., 2003). Then final system. we translate each source document with the 3.5 System Combination trained SMT system, and in the next step, for each source document we calculate similarity In our model first the big space of English doc- scores by comparing its translation to the list of uments is filtered with high-speed modules. Then filtered target documents that were produced by for each pair of the documents in this filtered the former modules. space we compute the value of their features, For the similarity scores, we use two well-known which is the similarity scores of modules. Scores translation evaluation metrics. The first metric is of TM, Name and SMT modules are used here. BLEU, which is computed by comparing the sys- (,)d d BiTM (,), d d Names (,),SMT(,) d d d d tem output against the reference translation (Pap- i j i j i j i j  ineni et al., 2002). Given the precision pn of n- (7) grams of size up to N (usually N = 4), the length

83 Finally, we use a simple linear combination of needed to be split. We have used Cdec tools for these features as the final score for the document this purpose(Dyer, 2009). pairs: 4.3 Preparing modules In this section, we introduce the tools and corpo- Score(,):(,) di d j W TM BiTM d i d j ra used for training and preparing each of the WName  Names(,)(,) d i d j  W SMT  SMT d i d j modules. (8) 4.3.1 Doc2Vec

In this equation the scores for each pair of docu- Training Doc2Vec module consists of two steps. First we need to train a words vector model. ments (,)dd is used: BiTM(,) d d is the BiTM ij ij Since the quality of word2vec model depends on score, Names(,) dij d is the Name score, SMT(ddij , ) the size of the training data, we train our model is two score BLEU and PER of SMT module. on all documents in the training and test sets. The weight of each model is tuned on a devel- After this step, we train paragraph vector model opment set using Least Square Error approach. and convert each document in source and target test sets to a 200-dimensional vector. After that 4 Experiments by selecting 5000 random aligned documents from training set we calculated our transfor- 4.1 Training data mation matrix by minimizing the error rate on The available data set is a very large corpus of those documents. Training and querying this comparable documents, coming from the BUCC model for all German documents can be done in shared task. The documents are Wikipedia pages several hours. The precision of this module is not without any links, special tags or Metadata. very high. Hence, it cannot be used in an effec- Training data (train.en/de) is a corpus of linked tive manner for predicting documents alignment. comparable documents with about 147(K) doc- But due to its speed it can be a great filter for our uments. The non-linked data are a set of about other modules. The results of this experiment are 166(K) English documents that have no similar shown in Figure 4. document in German document space. Test set 4.3.2 BiTM (test.en/de) is a random subset of training data that we use its de side as the source while ignor- For training bilingual topic models, we use Mal- ing the en side. Also, the tuning set for system let toolkit (McCallum, 2002). One important de- combination parameters is about 1(K) documents cision in topic modeling is finding the number of of the training data that are not seen in the test topics and the hyper-parameters, because of their set. Statistical information of data is reported in significant impact on the resulting topic assign- Table 1. ments. For finding the number of topics, we cal- #Running Lexicon culate perplexity, which is a way of evaluating #Documents Words Size the predictive power of the model (Figure 3). total.en 313471 264(M) 2(M) From now on, in all the experiments of BiTM we train.en 147474 83(M) 1(M) set number of topics to 1200. train.de 147474 121(M) 1(M) test.en 10000 8(M) 239(K) test.de 10000 5(M) 263(K) Table 1. Statistical information of data.

4.2 Preprocessing The first step of our work is preprocessing the input documents. So that for tokenization and normalization we use the E4SMT tools) Jabbari Figure 3. Perplexity for different number of et al., 2012). This tool normalizes different char- topics. when 훂 = ퟏ and 훃 = ퟎ. ퟕ, the lower acter representations to be uniform, tokenizes the perplexity is better. input text and also tags the specific tokens like numbers, dates, abbreviations, URL addresses, Also, we use the method in (Wallach, 2009) to etc. In addition, the compound words of de side find hyper-parameters.

84 1 0.9 0.8 0.7

0.6 0.5 Recall 0.4 0.3 0.2 0.1 0 10- 20- 200- 500- 1000- 2000- 3000- 4000- 5000- 6000- 7000- 10000 12000 1-best 2-best 5-best best best best best best best best best best best best -best -best Doc2Vec 0.012 0.019 0.035 0.053 0.081 0.255 0.382 0.491 0.491 0.681 0.728 0.766 0.793 0.81 0.856 0.876 TM 0.27890.33820.41360.47730.53930.74040.80020.82760.84030.84370.84460.8446 SMT.BLEU 0.31030.34330.38520.41390.44650.56620.61560.65640.71070.75740.80010.8446 SMT.PER 0.19860.22070.24860.27360.29930.43120.52290.61780.74960.79630.81740.8446 Name 0.30730.3696 0.441 0.48680.52920.65750.70620.74740.79190.81670.83330.8446

Figure 4. Diagram of modules recall in different neighborhood sizes.

from 12(K) documents to 5(K) documents with 4.3.3 Names the accuracy of 84.46%. In this work, we use a German and English NER Each de document in the test set is evaluated model to tag NEs. For this purpose, we use the with the filtered en documents (5K documents). Stanford NER tagger tool and also we use an Then the vector of the similarity scores for each unsupervised transliteration mining with the Mo- pair is produced and the score of the system 1 ses toolkit (Koehn et al., 2007). combination module is computed for each pair of 4.3.4 SMT documents. The result is a matrix of similarity values. Finally, for each row of this matrix the For this module, we train a German to English 5-best results are extracted. SMT system. For this purpose, we use the Moses The precision, recall and F-measure for the 1- toolkit for training translation models and decod- 2 best output of the system combination module ing, as well as SRILM (Stolcke, 2002) to build and the 5-best results list are shown in Table 2. the language models. Also, we used the German- 3 1-best System English part of the Europarl (Koehn, 2005) par- 5-best results allel corpora as the SMT’s training corpora. Combination Precision 12.6 45.67 4.4 Evaluation Recall 62.98 45.67 In this phase, we align the documents of test.de F-measure 21 45.67 set with a proper en document from the collec- Table 2. Results of our model; Precision, Recall tion of English documents. In the two filtering and F-measure for 1-best System Combination steps of the model pipeline, we reduce the size of and 5-best results list. the target space from 313(K) documents to 5(K) documents for each de document. The first filter The final precision of our model is about 12%, is the Doc2Vec module, which is the fastest this is because of the variation of the modules module in our model. This filter reduces the tar- votes. Each module considers the en documents get space to 12(K) English documents that are from a different view so the 5-best list of the fi- the nearest documents to the de one with 87.6% nal results contains the most similar en docu- of accuracy. The second filter is the Name mod- ments to the de input. But from this list just one ule. This filter reduces the size of the target space of them is the exact translation. Although each Wikipedia page has a specific equivalent page in the target language but, it is probable that a set of pages are highly similar to it, especially for pag- 1 https://github.com/moses-smt/mosesdecoder 2 http://www.speech.sri.com/projects/srilm/ es with related topics. So, because of this charac- 3 http://www.statmt.org/europarl/ teristic of Wikipedia pages, deciding the exact

85 translation just with the content information is a Dumais, Susan T., Todd A. Letsche, Michael L. vague task. Also aligning the de document to a Littman, and Thomas K. Landauer. (1997, March). proper en one from a large collection of en sam- Automatic cross-language retrieval using latent ples increases this ambiguity. semantic indexing. In AAAI spring symposium on cross-language text and speech retrieval (Vol. 15, p. 5 Conclusion 21). Durrani, Nadir, Hieu Hoang, Philipp Koehn, and Our work is a framework consists of several Hassan Sajjad. (2014). Integrating an unsupervised modules for retrieving similar Wikipedia pages transliteration model into statistical machine for German documents from a large collection of translation. EACL 2014, 148. Dyer, Chris. (2009). Using a maximum entropy model English documents. Our model is proper for to build segmentation lattices for MT. In dealing with very large corpora. The results show Proceedings of NAACL HLT 2009, Boulder, that our model is able to find the correct answer Colorado. in 62% of samples. Franco-Salvador, Marc, Paolo Rosso, and Roberto The framework proposed here has two ad- Navigli. (2014, April). A knowledge-based vantages over the previous works: firstly it can representation for cross-language document retrieval handle searching through a large collection of and categorization. InProceedings of EACL (pp. data which is achieved by applying the filtering 414-423). modules. And also everything was done just by Fung, Pascale, and Percy Cheung. (2004). Multi-level using the content information of documents, bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the without using any special tags or Metadata. 20th international conference on Computational Also, all of the modules used in our frame- Linguistics, page 1051. Association for work are language independent, and it could be Computational Linguistics. used for any other language pairs. Gupta, Rajdeep, and Sivaji Bandyopadhyay. (2013). Testing the Effectiveness of Named Entities in Acknowledgement Aligning Comparable English-Bengali Document Pair. In Intelligent Interactive Technologies and This research was partially supported by Multimedia (pp. 102-110). Springer Berlin targoman.com. We thank our colleagues from Heidelberg. targoman.com, who provided insight and exper- Jabbari, Fattaneh, Somayeh Bakhshaei, Seyed tise that greatly assisted the research. Mohammad Mohammadzadeh Ziabary, and Shahram Khadivi. (2012). Developing an Open- References domain English-Farsi Translation System Using AFEC: Amirkabir Bilingual Farsi-English Corpus. Barrón-Cedeno, Alberto, Paolo Rosso, David Pinto, In The Fourth Workshop on Computational and Alfons Juan. (2008). On Cross-lingual Approaches to Arabic Script-based Languages (p. Plagiarism Analysis using a Statistical Model. In 17). PAN. Kilgarriff, Adam. (2001). Comparing corpora. Blei, David M., Andrew Y. Ng, and Michael I. International journal of corpus linguistics, 6, no. 1 Jordan. (2003). Latent dirichlet allocation. the (pp. 97-133). Journal of machine Learning research, 3, 993-1022. Knoth, Petr, Lukas Zilka, and Zdenek Zdrahal. Brown, Peter F., Vincent J. Della Pietra, Stephen A. (2011). Using explicit semantic analysis for cross- Della Pietra, and Robert L. Mercer. (1993) .The lingual link discovery. In Proceedings of the 5th In- mathematics of statistical machine translation: ternational Joint Conference on Natural Language Parameter estimation. Computational linguistics 19, Processing , pages 2-10. no. 2, 263-311. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. Manning, Christopher D., Prabhakar Raghavan, and (2003). Statistical phrase-based translation. In Hinrich Schütze. (2008). An Introduction to Proceedings of the 2003 Conference of the North information retrieval. Vol. 1. Cambridge: American Chapter of the Association for Cambridge university press. Computational Linguistics on Human Language Deerwester, Scott C., Susan T. Dumais, Thomas K. Technology-Volume 1, pp. 48-54. Landauer, George W. Furnas, and Richard A. Koehn, Philipp. (2005). Europarl: A parallel corpus Harshman. (1990). Indexing by latent semantic for statistical machine translation. MT summit, (pp. analysis. JAsIs 41, no. 6, 391-407. 79-86). Delpech, Estelle. (2011). Evaluation of terminologies acquired from comparable corpora: an application Koehn, Philipp, Hieu Hoang, Alexandra Birch, perspective. In Proceedings of the 18th International Chris Callison-Burch, Marcello Federico, Nicola Nordic Conference of Computational Linguistics Bertoldi, Brooke Cowan, Wade Shen, Christine (NODALIDA 2011), pages 66-73. Moran, Richard Zens, Chris Dyer, Ondřej Bojar,

86 Alexandra Constantin, Evan Herbst. (2007). Moses: Research. Proceedings of the 2nd International Open source toolkit for statistical machine Conference on Language Resources and Evaluation. translation. Proceedings of the 45th annual meeting Athens, Greece. of the ACL on interactive poster and demonstration Oard, Douglas W. (1998). A comparative study of sessions, (pp. 177-180). query and document translation for cross-language information retrieval. Machine Translation and the Le, Quoc V., and Tomas Mikolov. (2014). Distributed Information Soup. Springer Berlin Heidelberg, 472- Representations of Sentences and Documents. Int. 483. Conf. Mach. Learn. ICML 2014, vol. 32, pp. 1188– Papineni, Kishore, Salim Roukos, Todd Ward, and 1196. Wei-Jing Zhu. (2002). Bleu: a Method for Li, Bo, and Eric Gaussier. (2010). Improving corpus Automatic Evaluation of Machine Translation. In comparability for bilingual lexicon extraction from Proceedings of the 40th annual meeting on comparable corpora. In Proceedings of the 23rd In- association for computational linguistics (pp. 311- ternational Conference on Computational Linguis- 318). Association for Computational Linguistics. tics, pages 644-652. Association for Computational Potthast, Martin, Benno Stein, and Maik Anderka. Linguistics. (2008). A Wikipedia-based multilingual retrieval McCallum, Andrew K.. (2002). MALLET: A model. In Advances in Information Retrieval (pp. Machine Learning for Language Toolkit. 522-530). Springer Berlin Heidelberg. http://mallet.cs.umass.edu. Smith, Jason R., Chris Quirk, and Kristina Toutanova. Mcnamee, Paul, and James Mayfield. (2004). (2010). Extracting parallel sentences from Character n-gram tokenization for European comparable corpora using document level language text retrieval. Information retrieval 7, no. alignment. In Human Language Technologies: The 1-2 (pp. 73-97). 2010 Annual Conference of the North American Mehdizadeh Seraj, Ramtin, Fattaneh Jabbari, and Chapter of the Association for Computational Shahram Khadivi. (2014). A novel unsupervised Linguistics, pages 403-411. Association for method for named-entity identification in resource- Computational Linguistics. poor languages using bilingual corpus. In Snover, Matthew, Bonnie Dorr, Richard Schwartz, Telecommunications (IST), 2014 7th International Linnea Micciulla, and John Makhoul. (2006). A Symposium on (pp. 519-523). IEEE. study of translation edit rate with targeted human Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. annotation. Proceedings of association for machine Corrado, and Jeff Dean. (2013). Distributed translation in the Americas. representations of words and phrases and their Steyvers, Mark, and Tom Griffiths. (2007). compositionality. In Advances in Neural Probabilistic topic models. Handbook of latent Information Processing Systems, pp. 3111-3119. semantic analysis 427(7), 424-440. Mikolov, Tomas, Kai Chen, Greg Corrado, and Stolcke, Andreas. (2002). SRILM-an extensible Jeffrey Dean. (2013). Efficient estimation of word language modeling toolkit. INTERSPEECH. representations in vector space. arXiv preprint

arXiv:1301.3781. Su, Fangzhong, and Bogdan Babych. (2012). Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. Development and Application of a Cross-language (2013). Exploiting similarities among languages for Document Comparability Metric. In LREC, (pp. machine translation." arXiv preprint 3956-3962). arXiv:1309.4168. Tang, Guoyu, Yunqing Xia, Min Zhang, Haizhou Li, Mimno, David, Hanna M. Wallach, Jason and Fang Zheng. (2011). CLGVSM: Adapting Naradowsky, David A. Smith, and Andrew Generalized Vector Space Model to Cross-lingual McCallum. (2009, August). Polylingual topic Document Clustering. In IJCNLP , pages 580-588. models. In Proceedings of the 2009 Conference on Tillmann, Christoph, Stephan Vogel, Hermann Ney, Empirical Methods in Natural Language Processing: Arkaitz Zubiaga, and Hassan Sawaf. (1997). Volume 2-Volume 2 (pp. 880-889). Association for Accelerated DP based search for statistical Computational Linguistics. translation. Eurospeech. Mousavi Nejad, Najmeh, and Shahram Khadivi. Wallach, Hanna M., David Minmo, and Andrew (2011). An Unsupervised Alignment Model for McCallum. (2009). Rethinking LDA: Why priors Sequence Labeling: Application to Name matter. Transliteration. 2011 Named Entities Workshop. Winograd, Terry. (1972). Understanding natural Navigli, Roberto, and Simone Paolo Ponzetto. (2012). language. Cognitive psychology, 3(1), 1-191. BabelNet: The automatic construction, evaluation Zens, Richard, Franz Josef Och, and Hermann Ney. and application of a wide-coverage multilingual (2002). Phrase-based statistical machine translation. semantic network.Artificial Intelligence, 193, 217- In KI 2002, Advances in Artificial Intelligence (pp. 250. 18-32). Springer Berlin Heidelberg. Nießen, Sonja, Franz Josef Och, Gregor Leusch, and

Hermann Ney. (2000). An Evaluation Tool for Machine Translation: Fast Evaluation for MT

87 LINA: Identifying Comparable Documents from Wikipedia

Emmanuel Morin2 Amir Hazem1 Elizaveta Loginova-Clouet2 Florian Boudin2 1 LIUM - EA 4023, Universite´ du Maine, France [email protected] 2 LINA - UMR CNRS 6241, Universite´ de Nantes, France elizaveta.loginova,florian.boudin,emmanuel.morin @univ-nantes.fr { }

Abstract In this paper, we describe the system that we de- veloped for the BUCC 2015 shared track and show This paper describes the LINA system that a language agnostic approach can achieve for the BUCC 2015 shared track. Fol- promising results. lowing (Enright and Kondrak, 2007), our system identify comparable documents by 2 Proposed Method collecting counts of hapax words. We ex- tend this method by filtering out document The method we propose is based on (Enright and pairs sharing target documents using pi- Kondrak, 2007)’s approach to parallel document geonhole reasoning and cross-lingual in- identification. Documents are treated as bags of formation. words, in which only blank separated strings that are at least four characters long and that appear 1 Introduction only once in the document (hapax words) are in- dexed. Given a document in language A, the doc- Parallel corpora, that is, collections of documents ument in language B that share the largest number that are mutual translations, are used in many nat- of these words is considered as parallel. ural language processing applications, particularly Although very simple, this approach was shown for statistical machine translation. Building such to perform very well in detecting parallel docu- resources is however exceedingly expensive, re- ments in Wikipedia (Patry and Langlais, 2011). quiring highly skilled annotators or professional The reason for this is that most hapax words are in translators (Preiss, 2012). Comparable corpora, practice proper nouns or numerical entities, which that are sets of texts in two or more languages are often cognates. An example of hapax words without being translations of each other, are of- extracted from a document is given in Table 1. ten considered as a solution for the lack of paral- We purposely keep urls and special characters, as lel corpora, and many techniques have been pro- these are useful clues for identifying translated posed to extract parallel sentences (Munteanu et Wikipedia pages. al., 2004; Abdul-Rauf and Schwenk, 2009; Smith et al., 2010), or mine word translations (Fung, 1995; Rapp, 1999; Chiao and Zweigenbaum, website major gaston links flutist mar- 2002; Morin et al., 2007; Vulic´ and Moens, 2012). cel debost states sources college crunelle Identifying comparable resources in a large conservatoire principal rampal united cur- amount of multilingual data remains a very chal- rently recorded chastain competitions music lenging task. The purpose of the Building and http://www.oberlin.edu/faculty/mdebost/ Using Comparable Corpora (BUCC) 2015 shared under international flutists jean-pierre pro- task1 is to provide the first evaluation of ex- file moyse french repertoire amazon lives isting approaches for identifying comparable re- external *http://www.amazon.com/michel- sources. More precisely, given a large collection debost/dp/b000s9zsk0 known teaches con- of Wikipedia pages in several languages, the task servatory school professor studied kathleen is to identify the most similar pages across lan- orchestre replaced michel guages. Table 1: Example of indexed document as bag of 1https://comparable.limsi.fr/bucc2015/ hapax words (en-bacde.txt).

88 Shared Task Paper Proceedings of the Eighth Workshop on Building and Using Comparable Corpora, pages 88–91, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics

Here, we experiment with this approach for docfr 2. This strategy further reduces the number detecting near-parallel (comparable) documents. of multiply assigned source documents from 10% Following (Patry and Langlais, 2011), we first to less than 4%. search for the potential source-target document pairs. To do so, we select for each document in the source language, the N = 20 documents in docfr 1 docfr 2 the target language that share the largest number of hapax words (hereafter baseline). 8 10 Scoring each pair of documents independently 6 10 of other candidate pairs leads to several source documents being paired to a same target docu- ment. As indicated in Table 2, the percentage docde docen of English articles that are paired with multiple 14 source documents is high (57.3% for French and 60.4% for German). To address this problem, we Figure 1: Example of the use of cross-lingual remove potential multiple source documents by information to order multiple documents that re- keeping the document pairs with the highest num- ceived the same scores. The number of shared ber of shared words (hereafter pigeonhole). This words are labelled on the edges. strategy greatly reduces the number of multiply assigned source documents from roughly 60% to 10%. This in turn removes needlessly paired doc- 3 Experiments uments and greatly improves the effectiveness of 3.1 Experimental settings the method. The BUCC 2015 shared task consists in returning for each Wikipedia page in a source language, up Strategy FR EN DE EN → → to five ranked suggestions to its linked page in En- baseline 57.3 60.4 glish. Inter-language links, that is, links from a + pigeonhole 10.7 10.8 page in one language to an equivalent page in an- + cross-lingual 3.7 3.4 other language, are used to evaluate the effective- ness of the systems. Here, we only focus on the Table 2: Percentage of English articles that are French-English and German-English pairs. Fol- paired with multiple French or German articles on lowing the task guidelines, we use the following the training data. evaluation measures investigate the effectiveness of our method: In an attempt to break the remaining score ties Mean Average Precision (MAP). Average of between document pairs, we further extend our • model to exploit cross-lingual information. When precisions computed at the point of each cor- multiple source documents are paired to a given rectly paired document in the ranked list of English document with the same score, we use paired documents. the paired documents in a third language to or- Success (Succ.). Precision computed on the • der them (hereafter cross-lingual). Here we make first returned paired document. two assumptions that are valid for the BUCC 2015 shared Task: (1) we have access to comparable Precision at 5 (P@5). Precision computed on • documents in a third language, and (2) source doc- the 5 topmost paired documents. uments should be paired 1-to-1 with target docu- ments. 3.2 Results An example of two French documents (docfr 1 Results are presented in Table 3. Overall, we ob- and docfr 2) being paired to the same English doc- serve that the two strategies that filter out multi- ument (docen) is given in Figure 1. We use the ply assigned source documents improve the per- German document (docde) paired with docen and formance of the method. The largest part of the select the French document that shares the largest improvement comes from using pigeonhole rea- number of hapax words, which for this example is soning. The use of cross-lingual information to

89 FR EN DE EN → → Train Test Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 MAP Succ. P@5 MAP Succ. P@5 baseline 31.4 28.0 7.4 32.9 30.0 7.5 28.7 24.9 6.9 29.0 24.9 7.1 + pigeonhole 57.7 56.4 11.9 61.6 60.1 12.8 − − − − − − + cross-lingual 58.9 57.7 12.1 59.0 57.7 12.1 62.3 60.9 12.8 62.2 60.7 12.8

Table 3: Performance in terms of MAP, success (Succ.) and precision at 5 (P@5) of our model. break ties between the remaining multiply as- References signed source documents only gives a small im- Sadaf Abdul-Rauf and Holger Schwenk. 2009. On the provement. We assume that the limited number of use of comparable corpora to improve SMT perfor- potential source-target document pairs we use in mance. In Proceedings of the 12th Conference of the our experiments (N = 20) is a reason for this. European Chapter of the Association for Computa- tional Linguistics, pages 16–23, Athens, Greece. Interestingly, results are consistent across lan- guages and datasets (test and train). Our best Yun-Chuang Chiao and Pierre Zweigenbaum. 2002. configuration, that is, with pigeonhole and cross- Looking for candidate translational equivalents in specialized, comparable corpora. In Proceedings lingual, achieves nearly 60% of success for the of the 19th International Conference on Computa- first returned pair. Here we show that a sim- tional Linguistics - Volume 2, COLING ’02, pages ple and straightforward approach that requires no 1–5, Stroudsburg, PA, USA. Association for Com- language-specific resources still yields some inter- putational Linguistics. esting results. Jessica Enright and Grzegorz Kondrak. 2007. A fast method for parallel document identification. In Hu- man Language Technologies 2007: The Conference 4 Discussion of the North American Chapter of the Association for Computational Linguistics (NAACL’07), pages In this paper we described the LINA system for 29–32, Rochester, New York, USA. the BUCC 2015 shared track. We proposed to Pascale Fung. 1995. Compiling bilingual lexicon en- extend (Enright and Kondrak, 2007)’s approach tries from a non-parallel english-chinese corpus. In to parallel document identification by filtering out Proceedings of the 3rd Annual Workshop on Very document pairs sharing target documents using pi- Large Corpora (VLC’95), pages 173–183, Cam- geonhole reasoning and cross-lingual information. bridge, MA, USA. Experimental results show that our system iden- Emmanuel Morin, Beatrice´ Daille, Koichi Takeuchi, tifies comparable documents with a precision of and Kyo Kageura. 2007. Bilingual terminology about 60%. mining - using brain, not brawn comparable corpora. In Proceedings of the 45th Annual Meeting of the As- Scoring document pairs using the number of sociation of Computational Linguistics, pages 664– shared hapax words was first intended to be a 671, Prague, Czech Republic, June. Association for baseline for comparison purposes. We tried a Computational Linguistics. finer grained scoring approach relying on bilin- Dragos Stefan Munteanu, Alexander Fraser, and Daniel gual dictionaries and information retrieval weight- Marcu. 2004. Improved machine translation per- ing schemes. For reasonable computation time, formance via parallel sentence extraction from com- we were unable to include low-frequency words in parable corpora. In Daniel Marcu Susan Du- mais and Salim Roukos, editors, HLT-NAACL 2004: our system. Partial results were very low and we Main Proceedings, pages 265–272, Boston, Mas- are still in the process of investigating the reasons sachusetts, USA, May 2 - May 7. Association for for this. Computational Linguistics. Alexandre Patry and Philippe Langlais. 2011. Identi- Acknowledgments fying parallel documents from a large bilingual col- lection of texts: Application to parallel article ex- traction in wikipedia. In Proceedings of the 4th This work is supported by the French National Re- Workshop on Building and Using Comparable Cor- search Agency under grant ANR-12-CORD-0020. pora (BUCC’11), pages 87–95, Portland, Oregon, USA.

90 Judita Preiss. 2012. Identifying comparable corpora using lda. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 558–562, Montreal,´ Canada, June. Association for Computational Linguistics. Reinhard Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and Ger- man Corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Lin- guistics (ACL’99), pages 519–526, College Park, MD, USA. Jason R. Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from compa- rable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 403–411, Los Angeles, California, June. Associa- tion for Computational Linguistics.

Ivan Vulic´ and Marie-Francine Moens. 2012. Detect- ing highly confident word translations from compa- rable corpora without any prior knowledge. In Pro- ceedings of the 13th Conference of the European Chapter of the Association for Computational Lin- guistics, pages 449–459, Avignon, France, April. Association for Computational Linguistics.

91

Author Index

Agha Sadeghi, Amir Pouya, 79 Yamamoto, Mikio, 52 Ali Panahloo, Zeinab, 79 Yoshimi, Takehiko, 38 Azadi, Fatemeh, 79 Zafarian, Atefeh, 79 Bakhshaei, Somayeh, 43, 79 Zweigenbaum, Pierre, 74 Barrón-Cedeño, Alberto,3 Boldoba, Josu,3 Boudin, Florian, 88

Cheon, Minah, 62

Daille, Béatrice, 32

España-Bonet, Cristina,3

Ghiasifard, Sonia, 79 Grishina, Yulia, 14

Hazem, Amir, 88

Jakubina, Laurent, 23

Khadivi, Shahram, 43 Kim, Jae-Hoon, 62 Kotani, Katsunori, 38

Langlais, Philippe, 23 Linard, Alexis, 32 Loginova-Clouet, Elizaveta, 88 Long, Zi, 52

Màrquez, Lluís,3 Mitsuhashi, Tomoharu, 52 Mohammadzadeh Ziabary, Seyyed Mohammad, 79 Morin, Emmanuel, 32, 88

Rapp, Reinhard, 74 Rios, Miguel, 68

Safabakhsh, Reza, 43 Seo, Hyeong-Won, 62 Sharoff, Serge, 68, 74 Stede, Manfred, 14

Tsou, Benjamin K.,1

Utsuro, Takehito, 52

93